This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.
Differential Revision: https://reviews.llvm.org/D61189
llvm-svn: 359666
Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512
Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.
Committed on behalf of @Serge_Preis (Serge Preis)
Differential Revision: https://reviews.llvm.org/D60852
llvm-svn: 359545
The MachineFunction wasn't used in getOptimalMemOpType, but more importantly,
this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType.
This is the groundwork for the changes in D59766 and D59787, that allows
implementation of TTI::getMemcpyCost.
Differential Revision: https://reviews.llvm.org/D59785
llvm-svn: 359537
Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support.
This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type
Fixes PR39936 and should help with PR39920/PR39921
Differential Revision: https://reviews.llvm.org/D61245
llvm-svn: 359491
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.
llvm-svn: 359400
An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0.
Differential Revision: https://reviews.llvm.org/D61230
llvm-svn: 359396
As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern.
llvm-svn: 359386
Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector.
We can extend this in the future to handle more opcodes and non-zero selections.
llvm-svn: 359378
Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store.
Reviewers: spatel, RKSimon, reames, jfb, efriedma
Reviewed By: RKSimon
Subscribers: hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60546
llvm-svn: 359368
Create a matchBitOpReduction helper that checks for the pattern with any opcode.
First step towards reusing this code to recognize other scalar reduction patterns.
llvm-svn: 359296
As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs).
Differential Revision: https://reviews.llvm.org/D61068
llvm-svn: 359293
A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380.
Differential Revision: https://reviews.llvm.org/D60512
llvm-svn: 359292
Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type.
We still have the issue identified in PR41594 but D61114 should handle this.
llvm-svn: 359176
If the types don't match, we can't just remove the shuffle.
There may be some other opportunity for optimization here,
but this should prevent the crashing seen in:
https://bugs.llvm.org/show_bug.cgi?id=41414
llvm-svn: 359095
Circling back to a leftover bit from PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859#c1
...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'.
This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca.
We already do this transform for the SETULT predicate, so this makes the code more
symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect
anything but pre-SSE4.1 subtargets.
$ cat before.s
movdqa -16(%rip), %xmm2 ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768]
pxor %xmm0, %xmm2
pcmpgtw -32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255]
pand %xmm2, %xmm0
pandn %xmm1, %xmm2
por %xmm2, %xmm0
$ cat after.s
movdqa -16(%rip), %xmm2 ## xmm2 = [256,256,256,256,256,256,256,256]
psubusw %xmm0, %xmm2
pxor %xmm3, %xmm3
pcmpeqw %xmm2, %xmm3
pand %xmm3, %xmm0
pandn %xmm1, %xmm3
por %xmm3, %xmm0
$ llvm-mca before.s -mcpu=haswell
Iterations: 100
Instructions: 600
Total Cycles: 909
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 0.77
IPC: 0.66
Block RThroughput: 1.8
$ llvm-mca after.s -mcpu=haswell
Iterations: 100
Instructions: 700
Total Cycles: 409
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.71
IPC: 1.71
Block RThroughput: 1.8
Differential Revision: https://reviews.llvm.org/D60838
llvm-svn: 358999
This was supposed to be NFC, but the change in SDLoc
definitions causes instruction scheduling changes.
There's nothing x86-specific in this code, and it can
likely be used from DAGCombiner's simplifyVBinOp().
llvm-svn: 358930
Summary:
If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory.
The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too.
This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix.
Reviewers: rnk
Reviewed By: rnk
Subscribers: dbabokin, llvm-commits, hiraditya
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60801
llvm-svn: 358817
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.
This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).
This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.
My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.
Differential Revision: https://reviews.llvm.org/D60375#inline-539623
llvm-svn: 358692
This replaces the MOVMSK combine introduced at D52121/rL342326
(movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))
with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra.
Differential Revision: https://reviews.llvm.org/D60625
llvm-svn: 358651
As discussed on PR41359, this patch renames the pair of shift-mask target feature functions to make their purposes more obvious.
shouldFoldShiftPairToMask -> shouldFoldConstantShiftPairToMask
preferShiftsToClearExtremeBits -> shouldFoldMaskToVariableShiftPair
llvm-svn: 358526
Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register.
This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data.
Differential Revision: https://reviews.llvm.org/D60562
llvm-svn: 358516
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.
This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.
This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.
Differential Revision: https://reviews.llvm.org/D60610
llvm-svn: 358286
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.
The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.
llvm-svn: 358218
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.
It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.
This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.
Differential Revision: https://reviews.llvm.org/D60394
llvm-svn: 358215
With correct test checks this time.
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ
This matches what gcc and icc do for this case and removes an existing FIXME.
llvm-svn: 358214
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integer and store it to a stack temporary. From there we can do two 32-bit loads to get the value into integer registers without worrying about atomicness.
This matches what gcc and icc do for this case and removes an existing FIXME.
Differential Revision: https://reviews.llvm.org/D60156
llvm-svn: 358211
Certain optimisations from ConstantHoisting and CGP rely on Selection DAG not
seeing through to the constant in other blocks. Revert this patch while we come
up with a better way to handle that.
I will try to follow this up with some better tests.
llvm-svn: 358113
Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it.
Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements.
For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later.
llvm-svn: 357935
I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops.
We've hinted at making all 8-bit ops undesirable for the reason in the code comment:
// TODO: Almost no 8-bit ops are desirable because they have no actual
// size/speed advantages vs. 32-bit ops, but they do have a major
// potential disadvantage by causing partial register stalls.
...but that leads to massive diffs and exposes all kinds of optimization holes itself.
Differential Revision: https://reviews.llvm.org/D60286
llvm-svn: 357912
Previously LowerOperationWrapper took the number of results from the original
node and counted that many results from the new node. This was intended to drop
chain operands from FP_TO_SINT lowering that uses X87 with memory operations to
stack temporaries. The final load had an extra chain output that needs to be
ignored.
Unfortunately, it didn't work with scatter which has 2 result operands, the
mask output which is discarded and a chain output. The chain output is the one
that is needed but it comes second and it would be dropped by the previous
logic here. To workaround this we were doing a ReplaceAllUses in the lowering
code so that the generic legalization code wouldn't see any uses to replace
since it had been given the wrong result/type.
After this change we take the LowerOperation result directly if the original
node has one result. This allows us to directly return the chain from scatter
or the load data from the FP_TO_SINT case. When the original node has multiple
results we'll ensure the returned node has the same number and copy them over.
For cases where the original node has multiple results and the new code for some
reason has even more results, MERGE_VALUES can be used to pass only the needed
results.
llvm-svn: 357887
In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands.
This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved.
Differential Revision: https://reviews.llvm.org/D60333
llvm-svn: 357859
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.
Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.
Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon
Reviewed By: RKSimon
Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60228
llvm-svn: 357802
Summary:
Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.
This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.
Also add a missing truncation on X86, found by testing of this patch.
Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa
Reviewers: bogner, craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59535
llvm-svn: 357745
We already promote SRL and SHL to i32.
This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions.
I think there might be some DAG combine improvement opportunities in the test changes here.
Differential Revision: https://reviews.llvm.org/D60278
llvm-svn: 357743
It unnecessarily breaks previously-working code which used varargs,
but didn't pass any float/double arguments (such as EDK2).
Also revert the fixup on top of that:
Revert [X86] Fix a test from r357317
This reverts r357317 (git commit d413f41de6)
This reverts r357380 (git commit 7af32444b9)
llvm-svn: 357718
These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers.
This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes
that take the extra zeroes as inputs. The zeros will then go through isel on their own to select
the MOV32r0 pseudo. Then we just need to mention the physical registers directly
in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary
copies to/from physical registers.
llvm-svn: 357659
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.
To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.
With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.
I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.
llvm-svn: 357652
This pattern would show up as a regression if we more
aggressively convert vector FP ops to scalar ops.
There's still a missed optimization for the v4f64 legal
case (AVX) because we create that h-op with an undef operand.
We should probably just duplicate the operands for that
pattern to avoid trouble.
llvm-svn: 357642
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.
The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.
If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.
$ llvm-mca -mcpu=haswell no_movmsk.s -timeline
Iterations: 100
Instructions: 400
Total Cycles: 504
Total uOps: 400
Dispatch Width: 4
uOps Per Cycle: 0.79
IPC: 0.79
Block RThroughput: 1.0
$ llvm-mca -mcpu=haswell movmsk.s -timeline
Iterations: 100
Instructions: 600
Total Cycles: 358
Total uOps: 600
Dispatch Width: 4
uOps Per Cycle: 1.68
IPC: 1.68
Block RThroughput: 1.5
$ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
Iterations: 100
Instructions: 400
Total Cycles: 407
Total uOps: 400
Dispatch Width: 2
uOps Per Cycle: 0.98
IPC: 0.98
Block RThroughput: 2.0
$ llvm-mca -mcpu=btver2 movmsk.s -timeline
Iterations: 100
Instructions: 600
Total Cycles: 311
Total uOps: 600
Dispatch Width: 2
uOps Per Cycle: 1.93
IPC: 1.93
Block RThroughput: 3.0
Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.
Differential Revision: https://reviews.llvm.org/D59997
llvm-svn: 357367
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.
llvm-svn: 357351
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.
llvm-svn: 357317
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.
We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.
There seems to be a missing fold for sext of a bool bit here:
negl %ecx
movslq %ecx, %rax
...but that's an independent transform.
Differential Revision: https://reviews.llvm.org/D59818
llvm-svn: 357172
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.
I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.
From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.
Differential Revision: https://reviews.llvm.org/D59777
llvm-svn: 357129
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
// Also support integer VTs to avoid a int->fp bitcast in the DAG.
llvm-svn: 357128
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.
This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249
llvm-svn: 357057
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.
This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
llvm-svn: 356864
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).
Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.
llvm-svn: 356858
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613
uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0
We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.
Differential Revision: https://reviews.llvm.org/D59006
llvm-svn: 356855
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.
This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.
Differential Revision: https://reviews.llvm.org/D59662
llvm-svn: 356808
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.
I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.
So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?
I've left some FIXMEs for improvements we can do without SSE2.
Differential Revision: https://reviews.llvm.org/D59679
llvm-svn: 356807
CMPXCHG8B was introduced on i586/pentium generation.
If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.
Differential Revision: https://reviews.llvm.org/D59576
llvm-svn: 356631
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.
I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.
This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.
Differential Revision: https://reviews.llvm.org/D59479
llvm-svn: 356618
This patch removes the following dag node opcodes from namespace X86ISD:
RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG
The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:
RDTSC/RDTSCP don't implicitly read ECX.
RDTSCP also implicitly writes ECX.
I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.
No functional change intended.
Differential Revision: https://reviews.llvm.org/D59547
llvm-svn: 356546
These changes are related to PR37743 and include:
SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.
Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.
Add promoting the integer ABS node in the LegalizeIntegerType.
Expand-based legalization of integer result for the ABS nodes.
Expand-based legalization of ABS vector operations.
Add some integer abs testcases for different typesizes for Thumb arch
Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
tmp = (SRA, Hi, 31)
Lo = (UADDO tmp, Lo)
Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
Lo = (XOR tmp, Lo)
The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
(ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).
Change integer abs testcases for codegen with the ABS node support for AArch64.
Indicate that the ABS is legal for the i64 type when the NEON is supported.
Change the integer abs testcases to show changing of codegen.
Add combine and legalization of ABS nodes for Thumb arch.
Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.
For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743
Patch by: @ikulagin (Ivan Kulagin)
Differential Revision: https://reviews.llvm.org/D49837
llvm-svn: 356468
This allows better code size for aarch64 floating point materialization
in a future patch.
Reviewers: evandro
Differential Revision: https://reviews.llvm.org/D58690
llvm-svn: 356389
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780
If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits, andreadb
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59412
llvm-svn: 356301
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.
Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.
Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.
llvm-svn: 356294
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.
llvm-svn: 356292
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.
llvm-svn: 356158
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.
We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.
I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.
Differential Revision: https://reviews.llvm.org/D59308
llvm-svn: 356078
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.
This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.
The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.
llvm-svn: 356064
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700
The bug was introduced recently here:
rL355741
This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8).
llvm-svn: 356053
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.
This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.
Differential Revision: https://reviews.llvm.org/D58888
llvm-svn: 356043
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.
This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.
llvm-svn: 355799
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.
With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.
llvm-svn: 355789
An extension of D58282 noted in PR39665:
https://bugs.llvm.org/show_bug.cgi?id=39665
This doesn't answer the request to use movmsk, but that's an
independent problem. We need this and probably still need
scalarization of FP selects because we can't do that as a
target-independent transform (although it seems likely that
targets besides x86 should have this transform).
llvm-svn: 355741
Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well.
Requested by @spatel on D59006
llvm-svn: 355533
As noticed on D58965
DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point.
Differential Revision: https://reviews.llvm.org/D58974
llvm-svn: 355494
X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo
instructions without accounting for debug intrinsics. This leads to different
codegen with and without option -g, if a DBG_VALUE instruction lands in the
middle of several lowered selects.
Work around this by skipping over debug instructions when looking for CMOV
sequences, and sinking those debug insts into the EmitLoweredSelect sunk block.
This might slightly shift where variables appear in the instruction sequence,
but won't re-order assignments.
Differential Revision: https://reviews.llvm.org/D58672
llvm-svn: 355307
Summary:
This extends the variety of pattern that can generate a SHLD instead of using two shifts.
This fixes a regression that would be introduced by D57367 or D33587
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57389
llvm-svn: 355260
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:
1. Memory loads may get scalarized resulting in more efficient code.
2. Memory stores may get scalarized resulting in more efficient code.
3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.
The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.
Differential Revision: https://reviews.llvm.org/D58282
llvm-svn: 355130
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.
Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.
Differential Revision: https://reviews.llvm.org/D58705
llvm-svn: 355116
Summary:
The description of KnownBits::zext() and
KnownBits::zextOrTrunc() has confusingly been telling
that the operation is equivalent to zero extending the
value we're tracking. That has not been true, instead
the user has been forced to explicitly set the extended
bits as known zero afterwards.
This patch adds a second argument to KnownBits::zext()
and KnownBits::zextOrTrunc() to control if the extended
bits should be considered as known zero or as unknown.
Reviewers: craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58650
llvm-svn: 355099
A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node.
I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered.
Differential Revision: https://reviews.llvm.org/D58583
llvm-svn: 355015
Summary:
Use a custom calling convention handler for interrupts instead of fixing
up the locations in LowerMemArgument. This way, the offsets are correct
when constructed and we don't need to account for them in as many
places.
Depends on D56883
Replaces D56275
Reviewers: craig.topper, phil-opp
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D56944
llvm-svn: 354837
Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results.
Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB.
Differential Revision: https://reviews.llvm.org/D58597
llvm-svn: 354771
Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step.
llvm-svn: 354757
Summary:
When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it.
We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used.
Reviewers: spatel, RKSimon, nikic
Reviewed By: nikic
Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58567
llvm-svn: 354753
r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit."
These were reverted in r354713 as their context depended on other patches that were reverted for a bug.
llvm-svn: 354734
Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD.
llvm-svn: 354729
r354363 caused https://crbug.com/934963#c1, which has a plain C reduced
test case.
I also had to revert some dependent changes:
- r354648
- r354647
- r354640
- r354511
llvm-svn: 354713
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.
Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.
llvm-svn: 354709
As discussed in:
D56864
D58197
Always use the narrow (128-bit) instruction when possible.
We already had the signed int version of this transform.
llvm-svn: 354675
Only the 1st fold is attempted pre-legalization, but it requires
legal (simple) types too, so we don't need an EVT in any of the code.
llvm-svn: 354674
This is a follow-up to D56864.
If we're extracting from a non-zero index before casting to FP,
then shuffle the vector and optionally narrow the vector before doing the cast:
cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0
This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974
Differential Revision: https://reviews.llvm.org/D58197
llvm-svn: 354619
We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement.
llvm-svn: 354575
Second part of https://bugs.llvm.org/show_bug.cgi?id=40442.
This adds an extra UnrollVectorOverflowOp() method to SDAG, because
the general UnrollOverflowOp() method can't deal with multiple results.
Additionally we need to expand UMULO/SMULO during vector op
legalization, as it may result in unrolling, which may need additional
type legalization.
Differential Revision: https://reviews.llvm.org/D57997
llvm-svn: 354513
We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit.
llvm-svn: 354489
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 354363
This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads.
It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load.
Differential Revision: https://reviews.llvm.org/D58053
llvm-svn: 354340
The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754https://bugs.llvm.org/show_bug.cgi?id=40487
..and those are shown in the IR test file and x86 codegen file.
Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.
This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo
formation by default. Only x86 overrides that setting for now although other targets will likely benefit
by forming usbuo too.
Differential Revision: https://reviews.llvm.org/D57789
llvm-svn: 354298
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.
This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.
Differential Revision: https://reviews.llvm.org/D58181
llvm-svn: 354279
No need for a separate stack slot. The lifetimes don't overlap.
Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot.
llvm-svn: 354234
No need to manually split everything. We can let the type legalizer work for us.
The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended.
llvm-svn: 354197
When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption.
This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block)
Differential Revision: https://reviews.llvm.org/D57788
llvm-svn: 354178
These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there.
Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner.
llvm-svn: 354161
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.
This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.
This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.
Differential Revision: https://reviews.llvm.org/D58237
llvm-svn: 354117
When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion.
Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type.
llvm-svn: 354003
In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this.
llvm-svn: 353921
-Pull the final stack load creation from the two callers into the helper.
-Return a single SDValue instead of a std::pair.
-Remove the Replace flag which isn't really needed.
llvm-svn: 353920
A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars.
llvm-svn: 353860
We were using DstTy, but that represents the integer type we are converting to which is i64 in this
case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers.
If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte
access. The store we used to get from SSE to the stack is using the corect size.
While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function.
llvm-svn: 353745
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).
The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.
All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.
In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.
I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.
Differential Revision: https://reviews.llvm.org/D57841
llvm-svn: 353641
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.
There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.
llvm-svn: 353626
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 353610
Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner
First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111.
Differential Revision: https://reviews.llvm.org/D57754
llvm-svn: 353457
This is intentionally a small step because it's hard to know exactly
where we might introduce a conflicting transform with the code that
tries to form wider shuffles. But I think this is safe - if we have
a wide shuffle with 2 operands, then we should do better with an
extract + narrow shuffle.
Differential Revision: https://reviews.llvm.org/D57867
llvm-svn: 353427
combineExtractWithShuffle may leave a dangling bitcast which may
prevent further optimization in later passes. Avoid constructing it
unless it is used.
llvm-svn: 353333
The proposal in D56796 may cross the line because we're trying to avoid vectorization
transforms in generic DAG combining. So this is an alternate, later, x86-specific
translation of that patch.
There are several potential follow-ups to enhance this:
1. Allow extraction from non-zero element index.
2. Peek through extends of smaller width integers.
3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P
Differential Revision: https://reviews.llvm.org/D56864
llvm-svn: 353302
rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly.
llvm-svn: 353198
If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free).
Differential Revision: https://reviews.llvm.org/D57663
llvm-svn: 353154
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.
This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.
Reviewers: rnk
Reviewed By: rnk
Subscribers: eraman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57641
llvm-svn: 353141
For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly.
Differential Revision: https://reviews.llvm.org/D57667
llvm-svn: 353051