llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	b323d5ec7c	[X86] LowerToHorizontalOp - Tidyup calls to getHopForBuildVector. NFCI. Merge the if() tests for the various HADD/SUB + Subtarget tests llvm-svn: 359901	2019-05-03 15:56:06 +00:00
Simon Pilgrim	bfdd0f75a8	[X86] Remove repeated variables. NFCI. llvm-svn: 359889	2019-05-03 14:37:00 +00:00
Simon Pilgrim	aa49be4926	Avoid cppcheck operator precedence warnings. NFCI. Prefer ((X & Y) ? A : B) to (X & Y ? A : B) llvm-svn: 359884	2019-05-03 13:50:38 +00:00
Simon Pilgrim	a359ef192b	[X86] LowerMULH - remove unused Lo/Hi vector indices. NFCI. Leftover from before we had the extract128BitVector helpers. llvm-svn: 359871	2019-05-03 10:32:07 +00:00
Simon Pilgrim	88f9117168	Reduce variable scope to just the if() block its actually used in. NFCI. llvm-svn: 359869	2019-05-03 10:13:41 +00:00
Craig Topper	e1e38d4248	[X86] Correct the register class for specific mask register constraints in getRegForInlineAsmConstraint when the VT is a scalar type The default impementation in the base class for TargetLowering::getRegForInlineAsmConstraint doesn't work for mask registers when the VT is a scalar type integer types since the only legal mask types are vXi1. So we end up just getting whatever the first register class that contains the register. Currently this appears to be VK1, but its really dependent on the order tablegen outputs the register classes. Some code in the caller ends up looking up the type for this register class and find v1i1 then generates a copyfromreg from the physical k-register with the v1i1 type. Then it generates an any_extend from v1i1 to the scalar VT which isn't legal. This bad any_extend sticks around until isel where it selects a MOVZX32rr8 with a v1i1 input or maybe a i8 input. Not sure but eventually we pick up a copy from VK1 to GR8 in MachineIR which isn't supported. This leads to a failure in physical register copying. This patch uses the scalar type to find a VK class of the right size. In the attached test case this will be VK16. This causes a bitcast from vk16 to i16 to be generated instead of an any_extend. This will be properly iseled to a VK16 to GR32 copy and a GR32->GR16 extract_subreg. Fixes PR41678 Differential Revision: https://reviews.llvm.org/D61453 llvm-svn: 359837	2019-05-02 22:26:40 +00:00
Simon Pilgrim	df8daf0ef4	[X86][SSE] lowerAddSubToHorizontalOp - enable ymm extraction+fold Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway. Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result: https://godbolt.org/z/0R-U-K Differential Revision: https://reviews.llvm.org/D61426 llvm-svn: 359786	2019-05-02 14:00:55 +00:00
Simon Pilgrim	9fa56f7829	[X86][SSE] Move shouldUseHorizontalOp inside isHorizontalBinOp. NFCI. Matches what we do for lowerAddSubToHorizontalOp and will make it easier to peek through subvectors to help fix PR39921 llvm-svn: 359782	2019-05-02 12:18:24 +00:00
Simon Pilgrim	9f04d97cd7	[X86][SSE] Fold scalar horizontal add/sub for non-0/1 element extractions We already perform horizontal add/sub if we extract from elements 0 and 1, this patch extends it to non-0/1 element extraction indices (as long as they are from the lowest 128-bit vector). Differential Revision: https://reviews.llvm.org/D61263 llvm-svn: 359707	2019-05-01 17:13:35 +00:00
Simon Pilgrim	f5bdff7747	Fix 80 column violation. NFCI. llvm-svn: 359694	2019-05-01 16:01:49 +00:00
Simon Pilgrim	6711b9699a	[X86][SSE] Add demanded elts support X86ISD::PMULDQ\PMULUDQ Add to SimplifyDemandedVectorEltsForTargetNode and SimplifyDemandedBitsForTargetNode llvm-svn: 359686	2019-05-01 14:50:50 +00:00
Simon Pilgrim	3d6899e369	[X86][SSE] Add SSE vector shift support to SimplifyDemandedVectorEltsForTargetNode vector splitting llvm-svn: 359680	2019-05-01 13:51:09 +00:00
Simon Pilgrim	ba372c6e62	[X86][SSE] Split 512-bit -> 128-bit vector directly in SimplifyDemandedVectorEltsForTargetNode llvm-svn: 359678	2019-05-01 12:48:42 +00:00
Simon Pilgrim	951a6b4579	[X86][SSE] Add 512-bit vector support to SimplifyDemandedVectorEltsForTargetNode vector splitting llvm-svn: 359677	2019-05-01 12:37:41 +00:00
Simon Pilgrim	37c2419cc7	[X86][SSE] Add X86ISD::PACKSS\PACKUS to SimplifyDemandedVectorEltsForTargetNode vector splitting llvm-svn: 359673	2019-05-01 11:29:36 +00:00
Simon Pilgrim	3353cee06c	[X86][SSE] Add X86ISD::UNPCKL\UNPCK to SimplifyDemandedVectorEltsForTargetNode vector splitting llvm-svn: 359670	2019-05-01 11:08:03 +00:00
Simon Pilgrim	f7b978a71b	[X86][SSE] Move extract_subvector(pshufb) fold to SimplifyDemandedVectorEltsForTargetNode This lets us hit more cases than combineExtractSubvector and allows us reuse more code. llvm-svn: 359669	2019-05-01 10:58:38 +00:00
Simon Pilgrim	a7d107a3e0	[X86] SimplifyDemandedVectorEltsForTargetNode - pull out vector halving code. NFCI. Pull out the HADD/HSUB code to halve vector widths if the upper half isn't used - prep work to adding support for other opcodes. llvm-svn: 359667	2019-05-01 10:38:10 +00:00
Simon Pilgrim	99eefe94b5	[X86][SSE] Extract i1 elements from vXi1 bool vectors This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK. Differential Revision: https://reviews.llvm.org/D61189 llvm-svn: 359666	2019-05-01 10:02:22 +00:00
Simon Pilgrim	07ab4e7db8	[X86][SSE] Fold extract_subvector(extend(x)) -> extend_vector_inreg(x) This adds any extend support - folding to zero_extend_vector_inreg (PMOVZX) for legality Minor improvement for PR39709 llvm-svn: 359608	2019-04-30 20:31:07 +00:00
Simon Pilgrim	22641cc194	Fix for bug 41512: lower INSERT_VECTOR_ELT(ZeroVec, 0, Elt) to SCALAR_TO_VECTOR(Elt) for all SSE flavors Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd. INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32. So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4. https://bugs.llvm.org/show_bug.cgi?id=41512 Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike. Committed on behalf of @Serge_Preis (Serge Preis) Differential Revision: https://reviews.llvm.org/D60852 llvm-svn: 359545	2019-04-30 10:18:25 +00:00
Sjoerd Meijer	180f1ae57c	[TargetLowering] Change getOptimalMemOpType to take a function attribute list The MachineFunction wasn't used in getOptimalMemOpType, but more importantly, this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType. This is the groundwork for the changes in D59766 and D59787, that allows implementation of TTI::getMemcpyCost. Differential Revision: https://reviews.llvm.org/D59785 llvm-svn: 359537	2019-04-30 08:38:12 +00:00
Simon Pilgrim	028485d7b9	[X86][SSE] isHorizontalBinOp - add support for target shuffles Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support. This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type Fixes PR39936 and should help with PR39920/PR39921 Differential Revision: https://reviews.llvm.org/D61245 llvm-svn: 359491	2019-04-29 19:52:59 +00:00
Simon Pilgrim	d5cc753b6d	[X86][SSE] combineExtractVectorElt - add early-out to return zero/undef for out-of-range extraction indices. llvm-svn: 359406	2019-04-28 19:12:58 +00:00
Simon Pilgrim	22d1476bfa	[X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3 Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results. llvm-svn: 359400	2019-04-28 14:31:01 +00:00
Simon Pilgrim	93ad48210c	[X86][SSE] Optimize llvm.experimental.vector.reduce.xor.vXi1 parity reduction (PR38840) An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0. Differential Revision: https://reviews.llvm.org/D61230 llvm-svn: 359396	2019-04-28 10:46:17 +00:00
Simon Pilgrim	03c4e2663c	Revert rL359389: [X86][SSE] Add support for <64 x i1> bool reduction Minor generalization of the existing <32 x i1> pre-AVX2 split code. ........ Causing irregular buildbot failures. llvm-svn: 359391	2019-04-27 20:44:08 +00:00
Simon Pilgrim	4118be3af6	[X86][SSE] Add support for <64 x i1> bool reduction Minor generalization of the existing <32 x i1> pre-AVX2 split code. llvm-svn: 359389	2019-04-27 20:04:44 +00:00
Simon Pilgrim	2a2d422400	[X86][AVX512] Improve vector bool reductions As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern. llvm-svn: 359386	2019-04-27 17:32:46 +00:00
Simon Pilgrim	acc1e6d1c6	[X86][AVX] Merge mask select with shuffles across extract_subvector (PR40332) Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector. We can extend this in the future to handle more opcodes and non-zero selections. llvm-svn: 359378	2019-04-27 13:35:32 +00:00
Craig Topper	063b471ff7	[X86] Use MOVQ for i64 atomic_stores when SSE2 is enabled Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store. Reviewers: spatel, RKSimon, reames, jfb, efriedma Reviewed By: RKSimon Subscribers: hiraditya, dexonsmith, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60546 llvm-svn: 359368	2019-04-27 03:38:15 +00:00
Simon Pilgrim	27e01e675c	[X86][AVX] Fold extract_subvector(broadcast(x)) -> broadcast(x) iff x has one use llvm-svn: 359332	2019-04-26 18:02:14 +00:00
Simon Pilgrim	c3a34c3e07	Fix Wparentheses warning. NFCI. llvm-svn: 359299	2019-04-26 12:23:42 +00:00
Simon Pilgrim	bb230c5e79	[X86][SSE] Pull out OR(EXTRACTELT(X,0),OR(EXTRACTELT(X,1),...)) matching code from LowerVectorAllZeroTest Create a matchBitOpReduction helper that checks for the pattern with any opcode. First step towards reusing this code to recognize other scalar reduction patterns. llvm-svn: 359296	2019-04-26 11:45:54 +00:00
Simon Pilgrim	5d6ef94c36	[X86][SSE] Disable shouldFoldConstantShiftPairToMask for btver1/btver2 targets (PR40758) As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs). Differential Revision: https://reviews.llvm.org/D61068 llvm-svn: 359293	2019-04-26 10:49:13 +00:00
Simon Pilgrim	5e161df9f8	[X86][AVX] Combine shuffles extracted from a common vector A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380. Differential Revision: https://reviews.llvm.org/D60512 llvm-svn: 359292	2019-04-26 09:56:14 +00:00
Simon Pilgrim	0a7d1b3ce1	[X86][SSE] combineBitcastvxi1 - add support for bitcasting to non-scalar integers Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type. We still have the issue identified in PR41594 but D61114 should handle this. llvm-svn: 359176	2019-04-25 09:34:36 +00:00
Sanjay Patel	b1b3368907	[x86] make sure horizontal op and broadcast types match to simplify (PR41414) If the types don't match, we can't just remove the shuffle. There may be some other opportunity for optimization here, but this should prevent the crashing seen in: https://bugs.llvm.org/show_bug.cgi?id=41414 llvm-svn: 359095	2019-04-24 14:05:08 +00:00
Simon Pilgrim	d30745b2a0	[X86] Add shouldFoldConstantShiftPairToMask override placeholder. NFCI. Prep work toward fixing PR40758 llvm-svn: 359088	2019-04-24 12:34:08 +00:00
Sanjay Patel	12a561fa1b	[x86] use psubus for more vsetcc lowering (PR39859) Circling back to a leftover bit from PR39859: https://bugs.llvm.org/show_bug.cgi?id=39859#c1 ...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'. This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca. We already do this transform for the SETULT predicate, so this makes the code more symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect anything but pre-SSE4.1 subtargets. $ cat before.s movdqa -16(%rip), %xmm2 ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768] pxor %xmm0, %xmm2 pcmpgtw -32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255] pand %xmm2, %xmm0 pandn %xmm1, %xmm2 por %xmm2, %xmm0 $ cat after.s movdqa -16(%rip), %xmm2 ## xmm2 = [256,256,256,256,256,256,256,256] psubusw %xmm0, %xmm2 pxor %xmm3, %xmm3 pcmpeqw %xmm2, %xmm3 pand %xmm3, %xmm0 pandn %xmm1, %xmm3 por %xmm3, %xmm0 $ llvm-mca before.s -mcpu=haswell Iterations: 100 Instructions: 600 Total Cycles: 909 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 0.77 IPC: 0.66 Block RThroughput: 1.8 $ llvm-mca after.s -mcpu=haswell Iterations: 100 Instructions: 700 Total Cycles: 409 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 1.71 IPC: 1.71 Block RThroughput: 1.8 Differential Revision: https://reviews.llvm.org/D60838 llvm-svn: 358999	2019-04-23 15:20:17 +00:00
Simon Pilgrim	0e4992ce27	[X86] Pull out collectConcatOps helper. NFCI. Create collectConcatOps helper that returns all the subvector ops for CONCAT_VECTORS or a INSERT_SUBVECTOR series. llvm-svn: 358989	2019-04-23 14:07:49 +00:00
Sanjay Patel	bf8aacb715	[SelectionDAG] move splat util functions up from x86 lowering This was supposed to be NFC, but the change in SDLoc definitions causes instruction scheduling changes. There's nothing x86-specific in this code, and it can likely be used from DAGCombiner's simplifyVBinOp(). llvm-svn: 358930	2019-04-22 22:43:36 +00:00
Craig Topper	5c43ab337f	[X86] Reject 512-bit types in getRegForInlineAsmConstraint when AVX512 is not enabled. Same for 256 bit and AVX. llvm-svn: 358872	2019-04-22 06:12:02 +00:00
Craig Topper	3980d1ca6b	[X86] Disable argument copy elision for arguments passed via pointers Summary: If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory. The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too. This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix. Reviewers: rnk Reviewed By: rnk Subscribers: dbabokin, llvm-commits, hiraditya Tags: #llvm Differential Revision: https://reviews.llvm.org/D60801 llvm-svn: 358817	2019-04-20 15:26:44 +00:00
Simon Pilgrim	4171a91e92	[X86] combineVectorTruncationWithPACKUS - remove split/concatenation of mask combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together. This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356). This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit. My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great. Differential Revision: https://reviews.llvm.org/D60375#inline-539623 llvm-svn: 358692	2019-04-18 17:23:09 +00:00
Simon Pilgrim	8f87e53462	[X86][SSE] Lower ICMP EQ(AND(X,C),C) -> SRA(SHL(X,LOG2(C)),BW-1) iff C is power-of-2. This replaces the MOVMSK combine introduced at D52121/rL342326 (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)) with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra. Differential Revision: https://reviews.llvm.org/D60625 llvm-svn: 358651	2019-04-18 09:58:59 +00:00
Simon Pilgrim	e5573f4f4e	[TargetLowering] Rename preferShiftsToClearExtremeBits and shouldFoldShiftPairToMask (PR41359) As discussed on PR41359, this patch renames the pair of shift-mask target feature functions to make their purposes more obvious. shouldFoldShiftPairToMask -> shouldFoldConstantShiftPairToMask preferShiftsToClearExtremeBits -> shouldFoldMaskToVariableShiftPair llvm-svn: 358526	2019-04-16 20:57:28 +00:00
Simon Pilgrim	d769bb1e58	[X86][AVX] X86ISD::PERMV/PERMV3 node types can never fold index ops Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register. This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data. Differential Revision: https://reviews.llvm.org/D60562 llvm-svn: 358516	2019-04-16 19:18:53 +00:00
Craig Topper	0495f29e42	[X86] Limit the 'x' inline assembly constraint to zmm0-15 when used for a 512 type. The 'v' constraint is used to select zmm0-31. This makes 512 bit consistent with 128/256-bit.a llvm-svn: 358450	2019-04-15 21:06:32 +00:00
Craig Topper	3d9b47c770	[X86] Block i32/i64 for 'k' and 'Yk' in getRegForInlineAsmConstraint without avx512bw. 32 and 64 bit k-registers require avx512bw. If we don't block this properly, it leads to a crash. llvm-svn: 358436	2019-04-15 18:39:45 +00:00

1 2 3 4 5 ...

6193 Commits