We were turning roundss/sd/ps/pd intrinsics with immediates of 1 or 2 into
llvm.floor/ceil. The llvm.ceil/floor intrinsics are supposed to correspond
to the libm functions. For the libm functions we need to disable the
precision exception so the llvm.floor/ceil functions should always map to
encodings 0x9 and 0xA.
We had a mix of isel patterns where some used 0x9 and 0xA and others used
0x1 and 0x2. We need to be consistent and always use 0x9 and 0xA.
Since we have no way in isel of knowing where the llvm.ceil/floor came
from, we can't map X86 specific intrinsics with encodings 1 or 2 to it.
We could map 0x9 and 0xA to llvm.ceil/floor instead, but I'd really like
to see a use case and optimization advantage first.
I've left the backend test cases to show the blend we now emit without
the extra isel patterns. But I've removed the InstCombine tests completely.
llvm-svn: 361425
As it's causing some bot failures (and per request from kbarton).
This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.
llvm-svn: 358546
First step towards removing the MOVMSK intrinsics completely - this patch expands MOVMSK to the pattern:
e.g. PMOVMSKB(v16i8 x):
%cmp = icmp slt <16 x i8> %x, zeroinitializer
%int = bitcast <16 x i8> %cmp to i16
%res = zext i16 %int to i32
Which is correctly handled by ISel and FastIsel (give or take an annoying movzx move....): https://godbolt.org/z/rkrSFW
Differential Revision: https://reviews.llvm.org/D60256
llvm-svn: 357909
In PR41304:
https://bugs.llvm.org/show_bug.cgi?id=41304
...we have a case where we want to fold a binop of select-shuffle (blended) values.
Rather than try to match commuted variants of the pattern, we can canonicalize the
shuffles and check for mask equality with commuted operands.
We don't produce arbitrary shuffle masks in instcombine, but select-shuffles are a
special case that the backend is required to handle because we already canonicalize
vector select to this shuffle form.
So there should be no codegen difference from this change. It's possible that this
improves CSE in IR though.
Differential Revision: https://reviews.llvm.org/D60016
llvm-svn: 357366
Teach instcombine to propagate demanded elements through a masked load or masked gather instruction. This is in the broader context of improving vector pointer instcombine under https://reviews.llvm.org/D57140.
Differential Revision: https://reviews.llvm.org/D57372
llvm-svn: 356510
Remove test cases that checked for not crashing when immediate operands were passed not an immediate. These are now considered ill-formed in IR.
This was done by manually scanning the intrinsic file for llvm_i32_ty and llvm_i8_ty which are the predominant types we use for immediates. Most of them are on vector intrinsics. I might have missed some other intrinsics.
Differential Revision: https://reviews.llvm.org/D58302
llvm-svn: 355993
If we can reduce the x86-specific intrinsic to the generic op, it allows existing
simplifications and value tracking folds. AFAICT, this always results in identical
x86 codegen in the non-reduced case...which should be true because we semi-generically
(too aggressively IMO) convert to llvm.uadd.with.overflow in CGP, so the DAG/isel must
already combine/lower this intrinsic as expected.
This isn't quite what was requested in:
https://bugs.llvm.org/show_bug.cgi?id=40486
...but we want to have these kinds of folds early for efficiency and to enable greater
simplifications. For the case in the bug report where we have:
_addcarry_u64(0, ahi, 0, &ahi)
...this gets completely simplified away in IR.
Differential Revision: https://reviews.llvm.org/D57453
llvm-svn: 352870
The point is that this simplifies integration of new intrinsics into SimplifiedDemandedVectorElts, and ensures we don't miss any existing ones.
This is intended to be NFC-ish, but as seen from the diffs, can produce slightly different output. This is due to order of transforms w/in instcombine resulting in two slightly different fixed points. That's something we should fix, but isn't a problem w/this patch per se.
Differential Revision: https://reviews.llvm.org/D57398
llvm-svn: 352653
I had a local change I hadn't realized when submitting that auto-update. As such, the auto-update was wrong. This should fix it, and with that, it's clearly time to stop submitting changes and go to bed.
llvm-svn: 352454
This file appears to have been manually editted at some point after being auto-updated. A future change adjusts this file slightly, and all of the updates makes the diff super confusing.
llvm-svn: 352453
As discussed on D55894, this replaces the existing PADDS/PSUBUS intrinsics with the the sadd/ssub.sat generic intrinsics and moves the tests out of the x86 subfolder.
PR40110 has been raised to fix the regression with constant folding vectors containing undef elements.
llvm-svn: 349759
call iM movmsk(sext <N x i1> X) --> zext (bitcast <N x i1> X to iN) to iM
This has the potential to create less-than-8-bit scalar types as shown in
some of the test diffs, but it looks like the backend knows how to deal
with that in these patterns. This is the simple part of the fix suggested in:
https://bugs.llvm.org/show_bug.cgi?id=39927
Differential Revision: https://reviews.llvm.org/D55529
llvm-svn: 348862
We established the (unfortunately complicated) rules for UB/poison
propagation with vector ops in:
D48893
D48987
D49047
It's clear from the affected tests that we are potentially creating
poison where none existed before the transforms. For add/sub/mul,
the answer is simple: just drop the flags because the extra undef
vector lanes are generally more valuable for analysis and codegen.
llvm-svn: 343819
We're a long way from D50992 and D51553, but this is where we have to start.
We weren't back-propagating undefs into binop constant values for anything but
add/sub/mul/and/or/xor.
This is likely because we have to be careful about not introducing UB/poison
with div/rem/shift. But I suspect we already are getting the poison part wrong
for add/sub/mul (although it may not be possible to expose the bug currently
because we use SimplifyDemandedVectorElts from a limited set of opcodes).
See the discussion/implementation from D48987 and D49047.
This patch just enables functionality for FP ops because those do not have
UB/poison potential.
llvm-svn: 343727
Follow-up to rL342324 (D52059):
Missing optimizations with blendv are shown in:
https://bugs.llvm.org/show_bug.cgi?id=38814
This is an easier and more powerful solution than adding pattern matching for a few
special cases in the backend. The potential danger with this transform in IR is that
the condition value can get separated from the select, and the backend might not be
able to make a blendv out of it again.
llvm-svn: 342806
Missing optimizations with blendv are shown in:
https://bugs.llvm.org/show_bug.cgi?id=38814
If this works, it's an easier and more powerful solution than adding pattern matching
for a few special cases in the backend. The potential danger with this transform in IR
is that the condition value can get separated from the select, and the backend might
not be able to make a blendv out of it again. I don't think that's too likely, but
I've kept this patch minimal with a 'TODO', so we can test that theory in the wild
before expanding the transform.
Differential Revision: https://reviews.llvm.org/D52059
llvm-svn: 342324
This converts them to what clang is now using for codegen. Unfortunately, there seem to be a few kinks to work out still. I'll try to address with follow up patches.
llvm-svn: 336871
I think the intrinsics named 'avx512.mask.' should refer to the previous behavior of taking a mask argument in the intrinsic instead of using a 'select' or 'and' instruction in IR to accomplish the masking. This is more consistent with the goal that eventually we will have no intrinsics that have masking builtin. When we reach that goal, we should have no intrinsics named "avx512.mask".
llvm-svn: 335744
This patch replaces calls to X86-specific intrinsics with floor-ceil semantics
with calls to target-independent @llvm.floor.* and @llvm.ceil.* intrinsics. This
doesn't affect the resulting machine code, as those intrinsics are lowered to
the same instructions, but exposes these specific rounding cases to generic
optimizations.
Differential Revision: https://reviews.llvm.org/D48067
llvm-svn: 335039
This is the last step in getting constant pattern matchers to allow
undef elements in constant vectors.
I'm adding a dedicated m_ZeroInt() function and building m_Zero() from
that. In most cases, calling code can be updated to use m_ZeroInt()
directly when there's no need to match pointers, but I'm leaving that
efficiency optimization as a follow-up step because it's not always
clear when that's ok.
There are just enough icmp folds in InstSimplify that can be used for
integer or pointer types, that we probably still want a generic m_Zero()
for those cases. Otherwise, we could eliminate it (and possibly add a
m_NullPtr() as an alias for isa<ConstantPointerNull>()).
We're conservatively returning a full zero vector (zeroinitializer) in
InstSimplify/InstCombine on some of these folds (see diffs in InstSimplify),
but I'm not sure if that's actually necessary in all cases. We may be
able to propagate an undef lane instead. One test where this happens is
marked with 'TODO'.
llvm-svn: 330550
This completes the work started in r329604 and r329605 when we changed clang to no longer use the intrinsics.
We lost some InstCombine SimplifyDemandedBit optimizations through this change as we aren't able to fold 'and', bitcast, shuffle very well.
llvm-svn: 329990
These are uncontroversial and independent of a proposed LangRef edits (D44216).
I tried to fix tests that would fold away:
rL327004
rL327028
rL327030
rL327034
I'm not sure if the Reassociate tests are meaningless yet, but they probably will be
as we add more folds, so if anyone has suggestions or wants to fix those, please do.
Differential Revision: https://reviews.llvm.org/D44258
llvm-svn: 327058
Summary:
This patch changes the signature of the avx512 packed fp compare intrinsics to return a vXi1 vector and no longer take a mask as input. The casts to scalar type will now need to be explicit in the IR. The masking node will now be an explicit and in the IR.
This makes the intrinsic look much more similar to an fcmp instruction that we wish we could use for these but can't. We already use icmp instructions for integer compares.
Previously the lowering step of isel would turn the intrinsic into an X86 specific ISD node and a emit the masking nodes as well as some bitcasts. This means DAG combines can't see the vXi1 type until somewhat late, making it more difficult to combine out gpr<->mask transition sequences. By exposing the vXi1 type explicitly in the IR and initial SelectionDAG we give earlier DAG combines and even InstCombine the chance to see it and optimize it.
This should make any issues with gpr<->mask sequences the same between integer and fp. Meaning we only have to fix them once.
Reviewers: spatel, delena, RKSimon, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43137
llvm-svn: 324827
I've moved the test cases from the InstCombine optimizations to the backend to keep the coverage we had there. It covered every possible immediate so I've preserved the resulting shuffle mask for each of those immediates.
llvm-svn: 313450
Recurse instead of returning on the first found optimization. Also, return early in the caller
instead of continuing because that allows another round of simplification before we might
potentially lose undef information from a shuffle mask by eliminating the shuffle.
As noted in the review, we could probably do better and be more efficient by moving all of
demanded elements into a separate pass, but this is yet another quick fix to instcombine.
Differential Revision: https://reviews.llvm.org/D37236
llvm-svn: 312248
This intrinsic clears the upper bits starting at a specified index. If the index is a constant we can do some simplifications.
This could be in InstSimplify, but we don't handle any target specific intrinsics there today.
Differential Revision: https://reviews.llvm.org/D36069
llvm-svn: 309604
This patch adds simplification support for the BEXTR/BEXTRI intrinsics to match gcc. This only supports cases that fold to 0 or can be fully constant folded. Theoretically we could support converting to AND if the shift part is unused or to only a shift if the mask doesn't modify any bits after an equivalent shl. gcc doesn't do these transformations either.
I put this in InstCombine, but it could be done in InstSimplify. It would be the first target specific intrinsic in InstSimplify.
Differential Revision: https://reviews.llvm.org/D36063
llvm-svn: 309603