Same as what we do for vector reductions in combineHorizontalPredicateResult, use movmsk+cmp for scalar (and(extract(x,0),extract(x,1)) reduction patterns.
llvm-svn: 361052
Summary:
This refactors four pieces of code that create SDNodes for references to
symbols:
- normal global address lowering (LEA, MOV, etc)
- callee global address lowering (CALL)
- external symbol address lowering (LEA, MOV, etc)
- external symbol address lowering (CALL)
Each of these pieces of code need to:
- classify the reference
- lower the symbol
- emit a RIP wrapper if needed
- emit a load if needed
- add offsets if needed
I think handling them all in one place will make the code easier to
maintain in the future.
Reviewers: craig.topper, RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61690
llvm-svn: 360952
This patch add the ISD::LROUND and ISD::LLROUND along with new
intrinsics. The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.
The idea is to optimize lround/llround generation for AArch64
in a subsequent patch. Current semantic is just route it to libm
symbol.
llvm-svn: 360889
They encode the same way, but OR32mi8Locked sets hasUnmodeledSideEffects set
which should be stronger than the mayLoad/mayStore on LOCK_OR32mi8. I think
this makes sense since we are using it as a fence.
This also seems to hide the operation from the speculative load hardening pass
so I've reverted r360511.
llvm-svn: 360747
This was the portion split off D58632 so that it could follow the redzone API cleanup. Note that I changed the offset preferred from -8 to -64. The difference should be very minor, but I thought it might help address one concern which had been previously raised.
Differential Revision: https://reviews.llvm.org/D61862
llvm-svn: 360719
D61068 handled vector shifts, this patch does the same for scalars where there are similar number of pipes for shifts as bit ops - this is true almost entirely for AMD targets where the scalar ALUs are well balanced.
This combine avoids AND immediate mask which usually means we reduce encoding size.
Some tests show use of (slow, scaled) LEA instead of SHL in some cases, but thats due to particular shift immediates - shift+mask generate these just as easily.
Differential Revision: https://reviews.llvm.org/D61830
llvm-svn: 360684
This is a follow on to D58632, with the same logic. Given a memory operation which needs ordering, but doesn't need to modify any particular address, prefer to use a locked stack op over an mfence.
Differential Revision: https://reviews.llvm.org/D61863
llvm-svn: 360649
Returning SDValue() makes the caller think that nothing happened and it will
end up executing the Expand path. This generates extra nodes that will need to
be pruned as dead code.
Returning an ISD::MERGE_VALUES will tell the caller that we'd like to make a
change and it will take care of replacing uses. This will prevent falling into
the Expand path.
llvm-svn: 360627
These are updates to match how isel table would emit a LOCK_OR32mi8 node.
-Use i32 for the immediate zero even though only 8 bits are encoded.
-Use i16 for segment register.
-Use LOCK_OR32mi8 for idempotent atomic operations in 32-bit mode to match
64-bit mode. I'm not sure why OR32mi8Locked and LOCK_OR32mi8 both exist. The
only difference seems to be that OR32mi8Locked is marked as UnmodeledSideEffects=1.
-Emit an extra i32 result for the flags output.
I don't know if the types here really matter just noticed it was inconsistent
with normal behavior.
llvm-svn: 360619
Summary:
X86TargetLowering::LowerAsmOperandForConstraint had better support than
TargetLowering::LowerAsmOperandForConstraint for arbitrary depth
getelementpointers for "i", "n", and "s" extended inline assembly
constraints. Hoist its support from the derived class into the base
class.
Link: https://github.com/ClangBuiltLinux/linux/issues/469
Reviewers: echristo, t.p.northover
Reviewed By: t.p.northover
Subscribers: t.p.northover, E5ten, kees, jyknight, nemanjai, javed.absar, eraman, hiraditya, jsji, llvm-commits, void, craig.topper, nathanchance, srhines
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61560
llvm-svn: 360604
Fixes the regression noted in D61782 where a VZEXT_MOVL was being inserted because we weren't discriminating between 'zeroable' and 'all undef' for the upper elts.
Differential Revision: https://reviews.llvm.org/D61782
llvm-svn: 360596
Now that we can use HADD/SUB for scalar additions from any pair of extracted elements (D61263), we can relax the one use limit as we will be able to merge multiple uses into using the same HADD/SUB op.
This exposes a couple of missed opportunities in LowerBuildVectorv4x32 which will be committed separately.
Differential Revision: https://reviews.llvm.org/D61782
llvm-svn: 360594
See if we can simplify the demanded vector elts from the extraction before trying to simplify the demanded bits.
This helps us with target shuffles and hops in particular.
llvm-svn: 360535
If we only use the lower xmm of a ymm hop, then extract the xmm's (for free), perform the xmm hop and then insert back into a ymm (for free).
Fixes some of the regressions noted in D61782
llvm-svn: 360435
The current lowering uses an mfence. mfences are substaintially higher latency than the locked operations originally requested, but we do want to avoid contention on the original cache line. As such, use a locked instruction on a cache line assumed to be thread local.
Differential Revision: https://reviews.llvm.org/D58632
llvm-svn: 360393
As reported on PR39920, "slow horizontal ops" targets tend to internally expand to 2*shuffle+add/sub - so if we can reduce 2*shuffle+add/sub to a hadd/sub then we should do it - similar port usage but reduced instruction count.
This works out in most cases, although the "PR22377" regression in vector-shuffle-combining.ll is annoying - going from 2*shuffle+add+shuffle to hadd+2*shuffle - I've opened PR41813 to cover this.
Differential Revision: https://reviews.llvm.org/D61308
llvm-svn: 360360
Summary:
A COFF stub indirects the reference to a symbol through memory. A
.refptr.$sym global variable pointer is created to refer to $sym.
Typically mingw uses these for external global variable declarations,
but we can use them for weak function declarations as well.
Updates the dso_local classification to add a special case for
extern_weak symbols on COFF in both clang and LLVM.
Fixes PR37598
Reviewers: smeenai, mstorsjo
Subscribers: hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D61615
llvm-svn: 360207
Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining.
llvm-svn: 360134
The FR32/FR64/VR128/VR256 register classes don't contain the upper 16 registers. For most cases we use the default implementation which will find any register class that contains the register in question if the VT is legal for the register class. But if the VT is i32 or i64, we won't find a matching register class and will instead up in the code modified in this patch.
If the requested register is x/y/zmm16-31 we weren't returning a register class that contains those registers and will hit an assertion in the caller.
To fix this, I've changed to use the extended register class instead. I don't believe we need a subtarget check to see if avx512 is enabled. The default implementation just pick whatever register class it finds first. I checked and we currently pick FR32X for XMM0 with an f32 type using the default implementation regardless of whether avx512 is enabled. So I assume its it is ok to do the same for i32.
Differential Revision: https://reviews.llvm.org/D61457
llvm-svn: 360102
Summary:
1. Enable infrastructure of AVX512_BF16, which is supported for BFLOAT16 in Cooper Lake;
2. Enable VCVTNE2PS2BF16, VCVTNEPS2BF16 and DPBF16PS instructions, which are Vector Neural Network Instructions supporting BFLOAT16 inputs and conversion instructions from IEEE single precision.
VCVTNE2PS2BF16: Convert Two Packed Single Data to One Packed BF16 Data.
VCVTNEPS2BF16: Convert Packed Single Data to Packed BF16 Data.
VDPBF16PS: Dot Product of BF16 Pairs Accumulated into Packed Single Precision.
For more details about BF16 isa, please refer to the latest ISE document: https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference
Author: LiuTianle
Reviewers: craig.topper, smaslov, LuoYuanke, wxiao3, annita.zhang, RKSimon, spatel
Reviewed By: craig.topper
Subscribers: kristina, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60550
llvm-svn: 360017
The default impementation in the base class for TargetLowering::getRegForInlineAsmConstraint doesn't work for mask registers when the VT is a scalar type integer types since the only legal mask types are vXi1. So we end up just getting whatever the first register class that contains the register. Currently this appears to be VK1, but its really dependent on the order tablegen outputs the register classes.
Some code in the caller ends up looking up the type for this register class and find v1i1 then generates a copyfromreg from the physical k-register with the v1i1 type. Then it generates an any_extend from v1i1 to the scalar VT which isn't legal. This bad any_extend sticks around until isel where it selects a MOVZX32rr8 with a v1i1 input or maybe a i8 input. Not sure but eventually we pick up a copy from VK1 to GR8 in MachineIR which isn't supported. This leads to a failure in physical register copying.
This patch uses the scalar type to find a VK class of the right size. In the attached test case this will be VK16. This causes a bitcast from vk16 to i16 to be generated instead of an any_extend. This will be properly iseled to a VK16 to GR32 copy and a GR32->GR16 extract_subreg.
Fixes PR41678
Differential Revision: https://reviews.llvm.org/D61453
llvm-svn: 359837
Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway.
Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result:
https://godbolt.org/z/0R-U-K
Differential Revision: https://reviews.llvm.org/D61426
llvm-svn: 359786
We already perform horizontal add/sub if we extract from elements 0 and 1, this patch extends it to non-0/1 element extraction indices (as long as they are from the lowest 128-bit vector).
Differential Revision: https://reviews.llvm.org/D61263
llvm-svn: 359707
This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.
Differential Revision: https://reviews.llvm.org/D61189
llvm-svn: 359666
Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512
Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.
Committed on behalf of @Serge_Preis (Serge Preis)
Differential Revision: https://reviews.llvm.org/D60852
llvm-svn: 359545
The MachineFunction wasn't used in getOptimalMemOpType, but more importantly,
this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType.
This is the groundwork for the changes in D59766 and D59787, that allows
implementation of TTI::getMemcpyCost.
Differential Revision: https://reviews.llvm.org/D59785
llvm-svn: 359537
Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support.
This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type
Fixes PR39936 and should help with PR39920/PR39921
Differential Revision: https://reviews.llvm.org/D61245
llvm-svn: 359491
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.
llvm-svn: 359400
An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0.
Differential Revision: https://reviews.llvm.org/D61230
llvm-svn: 359396
As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern.
llvm-svn: 359386
Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector.
We can extend this in the future to handle more opcodes and non-zero selections.
llvm-svn: 359378
Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store.
Reviewers: spatel, RKSimon, reames, jfb, efriedma
Reviewed By: RKSimon
Subscribers: hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60546
llvm-svn: 359368
Create a matchBitOpReduction helper that checks for the pattern with any opcode.
First step towards reusing this code to recognize other scalar reduction patterns.
llvm-svn: 359296
As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs).
Differential Revision: https://reviews.llvm.org/D61068
llvm-svn: 359293
A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380.
Differential Revision: https://reviews.llvm.org/D60512
llvm-svn: 359292
Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type.
We still have the issue identified in PR41594 but D61114 should handle this.
llvm-svn: 359176
If the types don't match, we can't just remove the shuffle.
There may be some other opportunity for optimization here,
but this should prevent the crashing seen in:
https://bugs.llvm.org/show_bug.cgi?id=41414
llvm-svn: 359095
Circling back to a leftover bit from PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859#c1
...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'.
This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca.
We already do this transform for the SETULT predicate, so this makes the code more
symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect
anything but pre-SSE4.1 subtargets.
$ cat before.s
movdqa -16(%rip), %xmm2 ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768]
pxor %xmm0, %xmm2
pcmpgtw -32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255]
pand %xmm2, %xmm0
pandn %xmm1, %xmm2
por %xmm2, %xmm0
$ cat after.s
movdqa -16(%rip), %xmm2 ## xmm2 = [256,256,256,256,256,256,256,256]
psubusw %xmm0, %xmm2
pxor %xmm3, %xmm3
pcmpeqw %xmm2, %xmm3
pand %xmm3, %xmm0
pandn %xmm1, %xmm3
por %xmm3, %xmm0
$ llvm-mca before.s -mcpu=haswell
Iterations: 100
Instructions: 600
Total Cycles: 909
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 0.77
IPC: 0.66
Block RThroughput: 1.8
$ llvm-mca after.s -mcpu=haswell
Iterations: 100
Instructions: 700
Total Cycles: 409
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.71
IPC: 1.71
Block RThroughput: 1.8
Differential Revision: https://reviews.llvm.org/D60838
llvm-svn: 358999
This was supposed to be NFC, but the change in SDLoc
definitions causes instruction scheduling changes.
There's nothing x86-specific in this code, and it can
likely be used from DAGCombiner's simplifyVBinOp().
llvm-svn: 358930
Summary:
If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory.
The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too.
This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix.
Reviewers: rnk
Reviewed By: rnk
Subscribers: dbabokin, llvm-commits, hiraditya
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60801
llvm-svn: 358817
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.
This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).
This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.
My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.
Differential Revision: https://reviews.llvm.org/D60375#inline-539623
llvm-svn: 358692
This replaces the MOVMSK combine introduced at D52121/rL342326
(movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))
with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra.
Differential Revision: https://reviews.llvm.org/D60625
llvm-svn: 358651
As discussed on PR41359, this patch renames the pair of shift-mask target feature functions to make their purposes more obvious.
shouldFoldShiftPairToMask -> shouldFoldConstantShiftPairToMask
preferShiftsToClearExtremeBits -> shouldFoldMaskToVariableShiftPair
llvm-svn: 358526
Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register.
This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data.
Differential Revision: https://reviews.llvm.org/D60562
llvm-svn: 358516
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.
This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.
This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.
Differential Revision: https://reviews.llvm.org/D60610
llvm-svn: 358286
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.
The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.
llvm-svn: 358218
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.
It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.
This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.
Differential Revision: https://reviews.llvm.org/D60394
llvm-svn: 358215
With correct test checks this time.
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ
This matches what gcc and icc do for this case and removes an existing FIXME.
llvm-svn: 358214
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integer and store it to a stack temporary. From there we can do two 32-bit loads to get the value into integer registers without worrying about atomicness.
This matches what gcc and icc do for this case and removes an existing FIXME.
Differential Revision: https://reviews.llvm.org/D60156
llvm-svn: 358211
Certain optimisations from ConstantHoisting and CGP rely on Selection DAG not
seeing through to the constant in other blocks. Revert this patch while we come
up with a better way to handle that.
I will try to follow this up with some better tests.
llvm-svn: 358113
Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it.
Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements.
For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later.
llvm-svn: 357935
I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops.
We've hinted at making all 8-bit ops undesirable for the reason in the code comment:
// TODO: Almost no 8-bit ops are desirable because they have no actual
// size/speed advantages vs. 32-bit ops, but they do have a major
// potential disadvantage by causing partial register stalls.
...but that leads to massive diffs and exposes all kinds of optimization holes itself.
Differential Revision: https://reviews.llvm.org/D60286
llvm-svn: 357912
Previously LowerOperationWrapper took the number of results from the original
node and counted that many results from the new node. This was intended to drop
chain operands from FP_TO_SINT lowering that uses X87 with memory operations to
stack temporaries. The final load had an extra chain output that needs to be
ignored.
Unfortunately, it didn't work with scatter which has 2 result operands, the
mask output which is discarded and a chain output. The chain output is the one
that is needed but it comes second and it would be dropped by the previous
logic here. To workaround this we were doing a ReplaceAllUses in the lowering
code so that the generic legalization code wouldn't see any uses to replace
since it had been given the wrong result/type.
After this change we take the LowerOperation result directly if the original
node has one result. This allows us to directly return the chain from scatter
or the load data from the FP_TO_SINT case. When the original node has multiple
results we'll ensure the returned node has the same number and copy them over.
For cases where the original node has multiple results and the new code for some
reason has even more results, MERGE_VALUES can be used to pass only the needed
results.
llvm-svn: 357887
In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands.
This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved.
Differential Revision: https://reviews.llvm.org/D60333
llvm-svn: 357859
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.
Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.
Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon
Reviewed By: RKSimon
Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60228
llvm-svn: 357802
Summary:
Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.
This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.
Also add a missing truncation on X86, found by testing of this patch.
Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa
Reviewers: bogner, craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59535
llvm-svn: 357745
We already promote SRL and SHL to i32.
This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions.
I think there might be some DAG combine improvement opportunities in the test changes here.
Differential Revision: https://reviews.llvm.org/D60278
llvm-svn: 357743
It unnecessarily breaks previously-working code which used varargs,
but didn't pass any float/double arguments (such as EDK2).
Also revert the fixup on top of that:
Revert [X86] Fix a test from r357317
This reverts r357317 (git commit d413f41de6)
This reverts r357380 (git commit 7af32444b9)
llvm-svn: 357718
These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers.
This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes
that take the extra zeroes as inputs. The zeros will then go through isel on their own to select
the MOV32r0 pseudo. Then we just need to mention the physical registers directly
in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary
copies to/from physical registers.
llvm-svn: 357659
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.
To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.
With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.
I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.
llvm-svn: 357652
This pattern would show up as a regression if we more
aggressively convert vector FP ops to scalar ops.
There's still a missed optimization for the v4f64 legal
case (AVX) because we create that h-op with an undef operand.
We should probably just duplicate the operands for that
pattern to avoid trouble.
llvm-svn: 357642
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.
The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.
If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.
$ llvm-mca -mcpu=haswell no_movmsk.s -timeline
Iterations: 100
Instructions: 400
Total Cycles: 504
Total uOps: 400
Dispatch Width: 4
uOps Per Cycle: 0.79
IPC: 0.79
Block RThroughput: 1.0
$ llvm-mca -mcpu=haswell movmsk.s -timeline
Iterations: 100
Instructions: 600
Total Cycles: 358
Total uOps: 600
Dispatch Width: 4
uOps Per Cycle: 1.68
IPC: 1.68
Block RThroughput: 1.5
$ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
Iterations: 100
Instructions: 400
Total Cycles: 407
Total uOps: 400
Dispatch Width: 2
uOps Per Cycle: 0.98
IPC: 0.98
Block RThroughput: 2.0
$ llvm-mca -mcpu=btver2 movmsk.s -timeline
Iterations: 100
Instructions: 600
Total Cycles: 311
Total uOps: 600
Dispatch Width: 2
uOps Per Cycle: 1.93
IPC: 1.93
Block RThroughput: 3.0
Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.
Differential Revision: https://reviews.llvm.org/D59997
llvm-svn: 357367
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.
llvm-svn: 357351
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.
llvm-svn: 357317
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.
We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.
There seems to be a missing fold for sext of a bool bit here:
negl %ecx
movslq %ecx, %rax
...but that's an independent transform.
Differential Revision: https://reviews.llvm.org/D59818
llvm-svn: 357172
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.
I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.
From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.
Differential Revision: https://reviews.llvm.org/D59777
llvm-svn: 357129
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
// Also support integer VTs to avoid a int->fp bitcast in the DAG.
llvm-svn: 357128
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.
This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249
llvm-svn: 357057
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.
This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
llvm-svn: 356864
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).
Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.
llvm-svn: 356858
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613
uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0
We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.
Differential Revision: https://reviews.llvm.org/D59006
llvm-svn: 356855
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.
This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.
Differential Revision: https://reviews.llvm.org/D59662
llvm-svn: 356808
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.
I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.
So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?
I've left some FIXMEs for improvements we can do without SSE2.
Differential Revision: https://reviews.llvm.org/D59679
llvm-svn: 356807
CMPXCHG8B was introduced on i586/pentium generation.
If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.
Differential Revision: https://reviews.llvm.org/D59576
llvm-svn: 356631
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.
I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.
This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.
Differential Revision: https://reviews.llvm.org/D59479
llvm-svn: 356618
This patch removes the following dag node opcodes from namespace X86ISD:
RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG
The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:
RDTSC/RDTSCP don't implicitly read ECX.
RDTSCP also implicitly writes ECX.
I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.
No functional change intended.
Differential Revision: https://reviews.llvm.org/D59547
llvm-svn: 356546
These changes are related to PR37743 and include:
SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.
Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.
Add promoting the integer ABS node in the LegalizeIntegerType.
Expand-based legalization of integer result for the ABS nodes.
Expand-based legalization of ABS vector operations.
Add some integer abs testcases for different typesizes for Thumb arch
Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
tmp = (SRA, Hi, 31)
Lo = (UADDO tmp, Lo)
Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
Lo = (XOR tmp, Lo)
The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
(ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).
Change integer abs testcases for codegen with the ABS node support for AArch64.
Indicate that the ABS is legal for the i64 type when the NEON is supported.
Change the integer abs testcases to show changing of codegen.
Add combine and legalization of ABS nodes for Thumb arch.
Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.
For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743
Patch by: @ikulagin (Ivan Kulagin)
Differential Revision: https://reviews.llvm.org/D49837
llvm-svn: 356468
This allows better code size for aarch64 floating point materialization
in a future patch.
Reviewers: evandro
Differential Revision: https://reviews.llvm.org/D58690
llvm-svn: 356389
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780
If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits, andreadb
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59412
llvm-svn: 356301
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.
Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.
Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.
llvm-svn: 356294