llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	abd33bf5ef	[X86][AVX] lowerShuffleWithPERMV - pad 128/256-bit shuffles on non-VLX targets Allow non-VLX targets to use 512-bits VPERMV/VPERMV3 for 128/256-bit shuffles. TBH I'm not sure these targets actually exist in the wild, but we're testing for them and its good test coverage for shuffle lowering/combines across different subvector widths.	2020-08-18 15:46:02 +01:00
Simon Pilgrim	011bf4fd96	[X86][AVX] lowerShuffleWithVTRUNC - extend to support v16i16/v32i8 binary shuffles. This requires a few additional SrcVT vs DstVT padding cases in getAVX512TruncNode.	2020-08-18 15:30:02 +01:00
Simon Pilgrim	d5621b83a5	[X86][AVX] lowerShuffleWithVTRUNC - pull out TRUNCATE/VTRUNC creation into helper code. NFCI. Prep work toward adding v16i16/v32i8 support for lowerShuffleWithVTRUNC and improving lowerShuffleWithVPMOV.	2020-08-18 14:52:42 +01:00
Simon Pilgrim	7db5124736	[X86][AVX] lowerShuffleWithVTRUNC - avoid unnecessary division in element counts. NFCI. (256 / SrcEltBits) == ((2 * EltSizeInBits * NumElts) / (EltSizeInBits * Scale)) == (2 * (NumElts / Scale)) == NumSrcElts	2020-08-18 13:48:22 +01:00
Simon Pilgrim	d2057a8015	[X86][AVX] Lower v16i8/v8i16 binary shuffles using VTRUNC/TRUNCATE This patch adds lowerShuffleWithVTRUNC to handle basic binary shuffles that can be lowered either as a pure ISD::TRUNCATE or a X86ISD::VTRUNC (with undef/zero values in the remaining upper elements). We concat the binary sources together into a single 256-bit source vector. To avoid regressions we perform this after we've tried to lower with PACKS/PACKUS which typically does a cleaner job than a concat. For non-AVX512VL cases we have to canonicalize VTRUNC cases to use a 512-bit source vectors (inserting undefs/zeros in the upper elements as necessary), truncate and then (possibly) extract the 128-bit result. This should address the last regressions in D66004 Differential Revision: https://reviews.llvm.org/D86093	2020-08-18 11:11:58 +01:00
Craig Topper	b673dfbb9a	[X86] When manually creating intrinsic nodes in X86ISelLowering, make sure we use getTargetConstant and pointer type for the intrinsic ID. Doesn't really matter in practice but that's how the nodes are normally created by SelectionDAGBuilder. So we should match. Found by temporarily hacking type checks into isel table.	2020-08-17 17:25:53 -07:00
Craig Topper	2ffa5d218f	[X86] Rename INTR_TYPE_4OP to INTR_TYPE_4OP_IMM8 and truncate immediates to MVT::i8 This makes sure VPTERNLOG is generated with MVT::i8 immediate as its SDNode declaration in X86InstrFragmentsSIMD.td declares.	2020-08-17 17:25:52 -07:00
Craig Topper	bc244f08cf	[X86] Truncate immediate to i8 for INTR_TYPE_3OP_IMM8 This is used for DBPSADBW which has a i32 immediate for its intrinsic and an i8 immediate in tablegen isel patterns.	2020-08-17 17:25:51 -07:00
Simon Pilgrim	1d2ede87ea	[X86][AVX] Move lowerShuffleWithVPMOV inside explicit shuffle lowering cases Perform lowerShuffleWithVPMOV as part of the v16i8/v8i16 shuffle lowering stages, which are the only types that are currently supported. We need to expand support for lowering shuffles as truncations to fix the remaining regressions in D66004	2020-08-17 11:58:51 +01:00
Craig Topper	a206f85091	[X86] Reject dirflag in inline asm constraints other than clobber. Fixes the crash from PR47195.	2020-08-16 23:33:45 -07:00
Simon Pilgrim	f25d47b7ed	[X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types We can now enable this for AVX1 targets can now assist with canonicalizeShuffleMaskWithHorizOp cleanup. There's still a few missed opportunities for merging subvector insert/extracts into shuffles, but they shouldn't cause any regressions now.	2020-08-16 15:00:41 +01:00
Simon Pilgrim	dca7eb7d60	[X86][SSE] Replace combineShuffleWithHorizOp with canonicalizeShuffleMaskWithHorizOp Instead of just attempting to fold shuffle(HOP,HOP) for a specific target shuffle, make this part of combineX86ShufflesRecursively so we can perform this on the combined shuffle chain, which is particularly useful for recognising more cases of where we're performing multiple HOPs that can be merged and pre-AVX where we don't have good blend/unary target shuffle support.	2020-08-16 12:26:27 +01:00
Simon Pilgrim	c27baa54b7	[X86] isRepeatedTargetShuffleMask - don't require specific MVT type. NFC. Split the isRepeatedTargetShuffleMask into a wrapper variant that takes a MVT describing the mask width, and an internal version that just needs the raw mask element bit size. This will be necessary for an upcoming change where the horizontal ops element width might not match the shuffle mask element width.	2020-08-16 11:51:44 +01:00
Simon Pilgrim	e9eb2dc332	[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y)) This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining Another step towards PR41813 Recommit of rG9bd97d036398 with fixed offset adjustments	2020-08-14 18:43:19 +01:00
Simon Pilgrim	cd3b850a4c	rG9bd97d0363987b582 - Revert "[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))" This reverts commit `9bd97d0363`. Seeing some codegen issues in internal testing.	2020-08-13 15:21:15 +01:00
Simon Pilgrim	a31d20e67e	[X86][SSE] IsElementEquivalent - add HOP(X,X) support For HADD/HSUB/PACKS ops with repeated operands the lower/upper half element of each lane are known to be equivalent	2020-08-13 12:42:59 +01:00
Simon Pilgrim	39de63aef9	Fix signed/unsigned comparison warnings. NFC.	2020-08-12 19:22:13 +01:00
Simon Pilgrim	13d6cf0951	[X86][SSE] Pull out BUILD_VECTOR operand equivalence tests. NFC. Pull out element equivalence code from isShuffleEquivalent/isTargetShuffleEquivalent, I've also removed many of the index modulos where possible. First step toward simply adding some additional equivalence tests.	2020-08-12 18:20:18 +01:00
Simon Pilgrim	9bd97d0363	[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y)) This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining Another step towards PR41813	2020-08-12 12:16:36 +01:00
Simon Pilgrim	a0c2c6aa42	[X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types Only do this for AVX2+ targets as we still get some regressions on AVX1 without PERMPD/PERMQ	2020-08-12 11:31:05 +01:00
Simon Pilgrim	2655bd51d6	[X86][SSE] combineShuffleWithHorizOp - canonicalize SHUFFLE(HOP(X,Y),HOP(Y,X)) -> SHUFFLE(HOP(X,Y)) Attempt to canonicalize binary shuffles of HOPs with commuted operands to an unary shuffle.	2020-08-11 18:13:03 +01:00
Eric Christopher	8155cb27a2	Fold Opcode into assert uses to fix an unused variable warning without asserts.	2020-08-11 09:30:51 -07:00
Simon Pilgrim	fe1f36986b	[X86][SSE] combineShuffleWithHorizOp - avoid unnecessary subtraction. NFCI. We can safely replace ((M - NumElts) % NumEltsPerLane) with (M % NumEltsPerLane) as the modulo result will be the same.	2020-08-11 17:07:32 +01:00
Simon Pilgrim	91d59cbf1b	[X86][SSE] Add HADD/SUB support to combineHorizOpWithShuffle Handles some HOP(SHUFFLE,SHUFFLE) patterns and sets us up to improve some of the cases mentioned in PR41813.	2020-08-11 16:14:14 +01:00
Kerry McLaughlin	85c7e89f3b	[CodeGen] Refactor getMemBasePlusOffset & getObjectPtrOffset to accept a TypeSize Changes the Offset arguments to both functions from int64_t to TypeSize & updates all uses of the functions to create the offset using TypeSize::Fixed() Reviewed By: efriedma Differential Revision: https://reviews.llvm.org/D85220	2020-08-11 12:17:10 +01:00
Simon Pilgrim	49016eeab6	[X86] Rename combineVectorPackWithShuffle -> combineHorizOpWithShuffle. NFC. The plan is to use this for (F)HADD/SUB opcodes as well as PACKs - similar to how we use combineShuffleWithHorizOp	2020-08-11 11:38:43 +01:00
Wang, Pengfei	9512525947	[X86][FPEnv] Teach X86 mask compare intrinsics to respect strict FP semantics. When we use mask compare intrinsics under strict FP option, the masked elements shouldn't raise any exception. So, we cann't replace the intrinsic with a full compare + "and" operation. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D85385	2020-08-11 10:28:41 +08:00
Simon Pilgrim	9a368d2b00	[X86][SSE] shuffle(hop,hop) - canonicalize unary hop(x,x) shuffle masks If a shuffle is referring to both the lower and upper half lanes of an unary horizontal op, then canonicalize the mask to only refer to the lower half.	2020-08-10 16:09:27 +01:00
Simon Pilgrim	07e673a02b	[X86][SSE] Pull out shuffle(hop,hop) combine into combineShuffleWithHorizOp helper. NFC.	2020-08-10 15:08:57 +01:00
Simon Pilgrim	e6dc2c8ce7	[X86][SSE] combineTargetShuffle - rearrange shuffle(hop,hop) matching to delay shuffle mask manipulation. NFC. Check that we're shuffling hadd/pack ops first before altering shuffle masks. First step towards adding extra functionality, plus it avoids costly shuffle mask manipulation if not necessary.	2020-08-10 14:13:19 +01:00
Craig Topper	d3153b5ca2	[X86] Remove a DCI.isBeforeLegalize() call from combineVSelectWithAllOnesOrZeros. This was blocking isTypeLegal call so that we could do a particular transform on illegal types before type legalization. But the we create a target specific node using that type. We shouldn't do that if the type isn't legal. So I think we should just always make sure the type is legal. I suspect that in order to get the condition VT to not be a vector of i1 we already completed type legalization anyway so this probably doesn't matter much in practice.	2020-08-08 14:19:13 -07:00
Simon Pilgrim	cc15380f10	[X86][SSE] combineTargetShuffle - use scaleShuffleMask helper to widen shuffle mask. NFCI. Use scaleShuffleMask helper for the shuffle(hadd,hadd) canonicalization.	2020-08-08 19:36:18 +01:00
Craig Topper	514b00c439	[X86] Limit the scope of the min/max canonicalization in combineSelect Previously the transform was doing these two canonicalizations (x > y) ? x : y -> (x >= y) ? x : y (x < y) ? x : y -> (x <= y) ? x : y But those don't seem to be useful generally. And they actively pessimize the cases in PR47049. This patch limits it to (x > 0) ? x : 0 -> (x >= 0) ? x : 0 (x < -1) ? x : -1 -> (x <= -1) ? x : -1 These are the cases mentioned in the comments as the motivation for the canonicalization. These allow the CMOV to use the S flag from the compare thus improving opportunities to use a TEST or the flags from an arithmetic instruction.	2020-08-07 22:51:49 -07:00
Keno Fischer	c58674df14	[X86] Don't produce bad x86andp nodes for i1 vectors In D85499, I attempted to fix this same issue by canonicalizing andnp for i1 vectors, but since there was some opposition to such a change, this commit just fixes the bug by using two different forms depending on which kind of vector type is in use. We can then always decide to switch the canonical forms later. Description of the original bug: We have a DAG combine that tries to fold (vselect cond, 0000..., X) -> (andnp cond, x). However, it does so by attempting to create an i64 vector with the number of elements obtained by truncating division by 64 from the bitwidth. This is bad for mask vectors like v8i1, since that division is just zero. Besides, we don't want i64 vectors anyway. For i1 vectors, switch the pattern to (andnp (not cond), x), which is the canonical form for `kandn` on mask registers. Fixes https://github.com/JuliaLang/julia/issues/36955. Differential Revision: https://reviews.llvm.org/D85553	2020-08-07 20:05:47 -04:00
Craig Topper	0215ae9735	[X86] Remove incomplete custom handling of i128 sdivrem/udivrem on Windows. We need to have special handling of i128 div/rem on Windows due to a weird calling convention needed for the libcall. There was also some code that made it look like we do the same for sdivrem/udiv, but the code didn't account for multiple return values of those functions so couldn't possibly work. I think this code never triggers because we don't have libcall names defined for those functions by default so DAGCombine never creates DIVREM nodes.	2020-08-05 23:01:07 -07:00
Craig Topper	08b2d0a963	[X86] Disable copy elision in LowerMemArgument for scalarized vectors when the loc VT is a different size than the original element. For example a v4f16 argument is scalarized to 4 i32 values. So the values are spread out instead of being packed tightly like in the original vector. Fixes PR47000.	2020-08-05 15:44:54 -07:00
Simon Pilgrim	b60f998859	[X86][SSE] Fold 128-bit PACK(EXTEND(X),EXTEND(Y)) -> CONCAT(X,Y) subvectors This is seen in the sub-128-bit vector trunc(ext()) of comparison results Fixes pr46585.ll regression in D66004	2020-08-05 18:27:40 +01:00
Simon Pilgrim	6a06c7a0a7	[X86] isHorizontalBinOp - only update LHS/RHS references on success We've had issues in the past where isHorizontalBinOp calls would affect later combines as the LHS/RHS references had been commuted but still failed to match.	2020-08-05 15:09:52 +01:00
Simon Pilgrim	a57bfb44bc	[X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for integer types	2020-08-05 15:09:51 +01:00
Simon Pilgrim	6f0da46d53	[X86] getFauxShuffleMask - drop unnecessary computeKnownBits OR(X,Y) shuffle decoding. Now that rG47cea9e82dda941e lets us aggressively decode multi-use shuffles for the OR(SHUFFLE(),SHUFFLE()) case we don't need the computeKnownBits variant any more.	2020-08-04 15:57:47 +01:00
Simon Pilgrim	051f293b78	[X86] Remove unused canScaleShuffleElements helper The only use was removed at rG36750ba5bd0e9e72 Thanks to @nemanjai for the heads up	2020-08-04 14:51:23 +01:00
Simon Pilgrim	36750ba5bd	[X86][AVX] isHorizontalBinOp - relax lane-crossing limits for AVX1-only targets. Permit lane-crossing post shuffles on AVX1 targets as long as every element comes from the same source lane, which for v8f32/v4f64 cases can be efficiently lowered with the LowerShuffleAsLanePermuteAnd* style methods.	2020-08-04 14:27:01 +01:00
Simon Pilgrim	47cea9e82d	Revert rG66e7dce714fab "Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero."" [X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero (REAPPLIED) This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR(). This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle. Reapplied with fixed signed/unsigned comparisons.	2020-08-04 10:32:39 +01:00
Wang, Pengfei	6bc7ea2d8d	[X86][AVX512] Fix build fail after D81548 Test function mask_cmp_128 failed during ISEL LLVM ERROR: Cannot select: t37: v8i1 = X86ISD::KSHIFTL t48, TargetConstant:i8<4> due to v8i1 only available under AVX512DQ. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D84922	2020-08-04 12:31:04 +08:00
Mitch Phillips	66e7dce714	Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero." This reverts commit `219f32f4b6`. Commit contains unsigned compasions that break bots that build with -Wsign-compare.	2020-08-03 13:48:30 -07:00
Simon Pilgrim	219f32f4b6	[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero. This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR(). This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.	2020-08-03 18:32:47 +01:00
Simon Pilgrim	99a971cadf	[X86][SSE] Start shuffle combining from ANY_EXTEND_VECTOR_INREG on SSE targets We already do this on AVX (+ for ZERO_EXTEND_VECTOR_INREG), but this enables it for all SSE targets - we attempted something similar back at rL357057 but hit issues with the ZERO_EXTEND_VECTOR_INREG handling (PR41249). I'm still looking at the vector-mul.ll regression - which is due to 32-bit targets performing the load as a f64, resulting in the shuffle combiner thinking it has to create a shuffle in the float domain.	2020-08-03 13:41:48 +01:00
Craig Topper	64516ec7c1	[X86] Use parity flag from byte test/cmp instruction for __builtin_parity when input fits in 8 bits. If the upper bits of the __builtin_parity idiom are known to be 0 we were previously emitting an xor with 0 to get the parity flag. But we can use cmp/test instead which may expose opportunities for load folding or combining an AND.	2020-08-02 10:45:04 -07:00
Simon Pilgrim	d14a22da5e	[DAG] TargetLowering::LowerAsmOutputForConstraint - pass SDLoc as const& Try to be more consistent with the SDLoc param in the TargetLowering methods.	2020-08-02 15:12:02 +01:00
Simon Pilgrim	20fbbbc583	[X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI. Fixes clang-tidy warning.	2020-08-02 14:32:23 +01:00
Simon Pilgrim	d7e2616741	[X86] Pass SDLoc by const reference. NFCI.	2020-08-02 14:32:22 +01:00
Simon Pilgrim	3f276840b6	[X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI. Fixes clang-tidy warning.	2020-08-02 14:32:22 +01:00
Simon Pilgrim	2700311cce	[X86] combineX86ShuffleChain - pull out repeated RootVT.getSizeInBits() calls. NFCI.	2020-08-02 14:32:22 +01:00
Craig Topper	56166a3a52	[X86] Improve parity idiom recognition to handle (and (truncate (ctpop X)), 1). Fixes part of PR46954	2020-08-01 22:59:43 -07:00
Simon Pilgrim	82a5c848e7	[X86][AVX512] Fold concat(and(x,y),and(z,w)) -> and(concat(x,z),concat(y,w)) for 512-bit vectors Helps vpternlog folding on non-AVX512BW targets	2020-08-01 20:34:39 +01:00
Simon Pilgrim	bb13c34c3a	[X86][AVX] Ensure we only combine to PSHUFLW/PSHUFHW on supporting targets Noticed while investigating combining from concatenated shuffle vectors, we weren't checking that PSHUFLW/PSHUFHW was legal - we were depending on lowering splitting to subvectors.	2020-08-01 19:18:11 +01:00
Simon Pilgrim	1b1901536a	[X86][AVX] Extend v2f64 BROADCAST(LOAD) -> BROADCAST_LOAD to v2i64/v4f32/v4i32 Minor precursor fix for D66004, but helps the SSE41 tests as well as they run with -disable-peephole	2020-08-01 12:28:29 +01:00
Matt Arsenault	57bd64ff84	Support addrspacecast initializers with isNoopAddrSpaceCast Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs with the DataLayout.	2020-07-31 10:42:43 -04:00
Simon Pilgrim	2dec72ba5c	[X86][SSE] combineExtractWithShuffle - extend extract(truncate(x),0) for any source vector size As long as we can extract the lowest 128-bit subvector from the pre-truncated source vector, then we don't care what size it is. The next stage will be to support non-zero extraction indices, as long as its still coming from the lowest 128-bit subvector.	2020-07-30 12:27:49 +01:00
Simon Pilgrim	a1c9529e60	[X86][AVX] isHorizontalBinOp - relax no-lane-crossing limit for AVX1-only targets. Instead of never accepting v8f32/v4f64 FHADD/FHSUB if the input shuffle masks cross lanes, perform the matching and determine if the post shuffle mask simplifies to a 'whole lane shuffle' mask - in which case we are guaranteed to cheaply perform this as a VPERM2F128 shuffle.	2020-07-29 20:49:10 +01:00
Craig Topper	c4823b24a4	[X86] Add custom lowering for llvm.roundeven with sse4.1. We can use the roundss/sd/ps/pd instructions like we do for ceil/floor/trunc/rint/nearbyint. Differential Revision: https://reviews.llvm.org/D84592	2020-07-29 10:23:08 -07:00
Simon Pilgrim	0c005be6eb	[X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast. getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements. I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.	2020-07-29 11:32:44 +01:00
Simon Pilgrim	4838cd46a9	[X86][XOP] Shuffle v16i8 using VPPERM(X,Y) instead of OR(PSHUFB(X),PSHUFB(Y))	2020-07-28 19:56:10 +01:00
Simon Pilgrim	182111777b	[X86][SSE] Attempt to match OP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) An initial backend patch towards fixing the various poor HADD combines (PR34724, PR41813, PR45747 etc.). This extends isHorizontalBinOp to check if we have per-element horizontal ops (odd+even element pairs), but not in the expected serial order - in which case we build a "post shuffle mask" that we can apply to the HOP result, assuming we have fast-hops/optsize etc. The next step will be to extend the SHUFFLE(HOP(X,Y)) combines as suggested on PR41813 - accepting more post-shuffle masks even on slow-hop targets if we can fold it into another shuffle. Differential Revision: https://reviews.llvm.org/D83789	2020-07-28 10:04:14 +01:00
Craig Topper	647e861e08	[X86] Detect if EFLAGs is live across XBEGIN pseudo instruction. Add it as livein to the basic blocks created when expanding the pseudo XBEGIN causes several based blocks to be inserted. If flags are live across it we need to make eflags live in the new basic blocks to avoid machine verifier errors. Fixes PR46827 Reviewed By: ivanbaev Differential Revision: https://reviews.llvm.org/D84479	2020-07-27 21:15:35 -07:00
Simon Pilgrim	4d84d94969	[X86][SSE] Relax 128-bit restriction on extract_subvector(ext_vector_inreg(X),0) -> ext_vector_inreg(extract_subvector(X,0)) fold We only need to ensure that the source is larger than the subvector result type	2020-07-27 17:50:36 +01:00
Simon Pilgrim	ab4ffa52f0	[X86][AVX] Fold extract_subvector(truncate(x),0) -> truncate(extract_subvector(x),0) This is currently only supported for VLX targets where the op should be legal.	2020-07-27 14:51:29 +01:00
Simon Pilgrim	f720c9c68c	[X86] combineExtractSubvector - pull out repeated getSizeInBits() calls. NFCI.	2020-07-27 14:51:28 +01:00
Simon Pilgrim	17eafe0841	[X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls. By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.	2020-07-26 16:04:22 +01:00
Simon Pilgrim	3b21823e4a	[X86][SSE] combineX86ShufflesRecursively - move all Root node asserts to the same location. NFCI. Minor tidyup for some upcoming shuffle combine improvements.	2020-07-25 12:48:14 +01:00
Simon Pilgrim	66998ae59f	[X86][SSE] getFauxShuffle - ignore undemanded sources for PACKSS/PACKUS faux shuffles If we don't care about an entire LHS/RHS of the PACK op, then can just treat it the same as undef (we don't care if it saturates) and is safe to treat as a shuffle. This can happen if we attempt to decode as a faux shuffle before SimplifyDemandedVectorElts has been called on the PACK which should replace the source with UNDEF entirely.	2020-07-25 10:51:14 +01:00
Simon Pilgrim	d720ba1e4b	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add SSE shift multiple use handling Add SimplifyMultipleUseDemandedVectorElts peek through for imm/var SSE shifts	2020-07-23 14:39:03 +01:00
Simon Pilgrim	5b5dc2442a	[X86][AVX] getTargetShuffleMask - don't decode VBROADCAST(EXTRACT_SUBVECTOR(X,0)) patterns. getTargetShuffleMask is used by the various "SimplifyDemanded" folds so we can't assume that the bypassed extract_subvector can be safely simplified - getFauxShuffleMask performs a more general decode that allows us to more safely catch many of these cases so the impact is minimal.	2020-07-21 21:55:44 +01:00
Sanjay Patel	50afa18772	[x86] split FMA with fast-math-flags to avoid libcall fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware) C/C++ code may use explicit fma() calls (which become LLVM fma intrinsics in IR) but then gets compiled with -ffast-math or similar. For targets that do not have FMA hardware, we don't want to go out to the math library for a precise but slow FMA result. I tried this as a generic DAGCombine, but it caused infinite looping on more than 1 other target, so there's likely some over-reaching fma formation happening. There's also a potential intersection of strict FP with fast-math here. Deferring to current behavior for that case (assuming that strict-ness overrides fast-ness). Differential Revision: https://reviews.llvm.org/D83981	2020-07-19 10:03:55 -04:00
Craig Topper	5408024fa8	[X86] Move integer hadd/hsub formation into a helper function shared by combineAdd and combineSub. There was a lot of duplicate code here for checking the VT and subtarget. Moving it into a helper avoids that. It also fixes a bug that combineAdd reused Op0/Op1 after a call to isHorizontalBinOp may have changed it. The new helper function has its own local version of Op0/Op1 that aren't shared by other code. Fixes PR46455. Reviewed By: spatel, bkramer Differential Revision: https://reviews.llvm.org/D83971	2020-07-16 13:27:27 -07:00
Craig Topper	9adf7461f7	[X86] Add test case for PR46455.	2020-07-16 11:06:55 -07:00
Craig Topper	f8f007e378	[X86] Consistently use 128 as the PSHUFB/VPPERM index for zero Bit 7 of the index controls zeroing, the other bits are ignored when bit 7 is set. Shuffle lowering was using 128 and shuffle combining was using 255. Seems like we should be consistent. This patch changes shuffle combining to use 128 to match lowering. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83587	2020-07-12 10:52:43 -07:00
Craig Topper	04013a07ac	[X86] Fix two places that appear to misuse peekThroughOneUseBitcasts peekThroughOneUseBitcasts checks the use count of the operand of the bitcast. Not the bitcast itself. So I think that means we need to do any outside haseOneUse checks before calling the function not after. I was working on another patch where I misused the function and did a very quick audit to see if I there were other similar mistakes. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83598	2020-07-12 10:52:43 -07:00
Wang, Pengfei	e628092524	[X86][MMX] Optimize MMX shift intrinsics. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D83534	2020-07-11 11:16:23 +08:00
Simon Pilgrim	4cc26a44ca	[X86][SSE] Use shouldUseHorizontalOp helper to determine whether to use (F)HADD. NFCI.	2020-07-10 12:13:34 +01:00
Simon Pilgrim	77133cc1e2	[X86][AVX] Attempt to fold PACK(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(PACK(X,Y)). Truncations lowered as shuffles of multiple (concatenated) vectors often leave us with lane-crossing shuffles that feed a PACKSS/PACKUS, if both shuffles are fed from the same 2 vector sources, then we can PACK the sources directly and shuffle the result instead. This is currently limited to whole i128 lanes in a 256-bit vector, but we can extend this if the need arises (but I'm not seeing many examples in real world code).	2020-07-10 09:33:27 +01:00
Craig Topper	918e653186	[X86] Immediately call LowerShift from lowerBuildVectorToBitOp. If we don't immediately lower the vector shift, the splat constant vector we created may get turned into a constant pool load before we get around to lowering the shift. This makes it a lot more difficult to create a shift by constant. Sometimes we fail to see through the constant pool at all and end up trying to lower as if it was a variable shift. This requires custom handling and may create an unsupported vselect on pre-sse-4.1 targets. Since we're after LegalizeVectorOps we are unable to legalize the unsupported vselect as that code is in LegalizeVectorOps rather than LegalizeDAG. So calling LowerShift immediately ensures that we get see the splat constant. Fixes PR46527. Differential Revision: https://reviews.llvm.org/D83455	2020-07-09 10:51:29 -07:00
Hiroshi Yamauchi	2c1a9006dd	[PGO][PGSO] Add profile guided size optimization to X86 ISel Lowering.	2020-07-09 10:43:45 -07:00
Craig Topper	3e75912005	[X86] Directly emit X86ISD::BLENDV instead of VSELECT in a few places that were emitting sign bit tests. Technically a VSELECT expects a vector of all 1s or 0s elements for its condition. But we aren't guaranteeing that the sign bit and the non sign bits match in these locations. So we should use BLENDV which is more relaxed. Differential Revision: https://reviews.llvm.org/D83447	2020-07-09 10:40:09 -07:00
Simon Pilgrim	f54402b63a	[X86][AVX] Attempt to fold extract_subvector(shuffle(X)) -> extract_subvector(X) If we're extracting a subvector from a shuffle that is shuffling entire subvectors we can peek through and extract the subvector from the shuffle source instead. This helps remove some cases where concat_vectors(extract_subvector(),extract_subvector()) legalizations has resulted in BLEND/VPERM2F128 shuffles of the subvectors.	2020-07-09 14:09:24 +01:00
Benjamin Kramer	b44470547e	Make helpers static. NFC.	2020-07-09 13:48:56 +02:00
Simon Pilgrim	800fb68420	[X86][SSE] Pull out PACK(SHUFFLE(),SHUFFLE()) folds into its own function. NFC. Future patches will extend this so declutter combineVectorPack before we start.	2020-07-08 17:42:42 +01:00
Simon Pilgrim	08a2c9ce5c	[X86] Fix copy+paste typo in combineVectorPack assert message. NFC.	2020-07-08 17:42:42 +01:00
Sanjay Patel	9114900287	[x86] improve codegen for non-splat bit-masked vector compare and select (PR46531) vselect ((X & Pow2C) == 0), LHS, RHS --> vselect ((shl X, C') < 0), RHS, LHS Follow-up to D83073 - the non-splat mask cases where we actually see an improvement are quite limited from what I can tell. AVX1 needs multiply and blend capabilities and AVX2 needs vector shift and blend capabilities. The intersection of those 2 constraints is only vectors with 32-bit or 64-bit elements. XOP is/was better. Differential Revision: https://reviews.llvm.org/D83181	2020-07-08 08:20:49 -04:00
Simon Pilgrim	9dc250db9d	[X86][AVX] SimplifyDemandedVectorEltsForTargetShuffle - ensure mask is same size as constant size Fixes test regression reported on D81791	2020-07-08 11:47:59 +01:00
Simon Pilgrim	c00a27752e	[X86][AVX] Remove redundant EXTRACT_VECTOR_ELT(VBROADCAST(SCALAR())) fold Noticed while looking for similar cases to rG931ec74f7a29 - SimplifyDemandedVectorElts and shuffle combining both should handle this now.	2020-07-08 10:18:36 +01:00
Simon Pilgrim	931ec74f7a	[X86][AVX] Don't fold PEXTR(VBROADCAST_LOAD(X)) -> LOAD(X). We were checking the VBROADCAST_LOAD element size against the extraction destination size instead of the extracted vector element size - PEXTRW/PEXTB have implicit zext'ing so have i32 destination sizes for v8i16/v16i8 vectors, resulting in us extracting from the wrong part of a load. This patch bails from the fold if the vector element sizes don't match, and we now use the target constant extraction code later on like the pre-AVX2 targets, fixing the test case. Found by internal fuzzing tests.	2020-07-07 19:10:03 +01:00
Sanjay Patel	642eed3713	[x86] fix miscompile in buildvector v16i8 lowering In the test based on PR46586: https://bugs.llvm.org/show_bug.cgi?id=46586 ...we are inserting 16-bits into the high element of the vector, shuffling it to element 0, and extracting 32-bits. But xmm1 was never initialized, so the top 16-bits of the extract are undef without this patch. (It seems like we could do better than this by recognizing that we only demand a subsection of the build vector, but I want to make sure we fix the miscompile 1st.) This path is only used for pre-SSE4.1, and simpler patterns get squashed somewhere along the way, so the test still includes a 'urem' as it did in the original test from the bug report. Differential Revision: https://reviews.llvm.org/D83319	2020-07-07 13:02:31 -04:00
Liu, Chen3	ea85ff82c8	[X86] Fix a bug that when lowering byval argument When an argument has 'byval' attribute and should be passed on the stack according calling convention, a stack copy would be emitted twice. This will cause the real value will be put into stack where the pointer should be passed. Differential Revision: https://reviews.llvm.org/D83175	2020-07-07 21:49:31 +08:00
Xiang1 Zhang	939d8309db	[X86-64] Support Intel AMX Intrinsic INTEL ADVANCED MATRIX EXTENSIONS (AMX). AMX is a new programming paradigm, it has a set of 2-dimensional registers (TILES) representing sub-arrays from a larger 2-dimensional memory image and operate on TILES. These intrinsics use direct TMM register number as its params. Spec can be found in Chapter 3 here https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D83111	2020-07-07 10:13:40 +08:00
Craig Topper	e652c0f8f3	[X86] Teach lowerShuffleAsBlend to use bit blend for v16i8/v32i8/v16i16 when avx512vl is enabled but not avx512bw. Probably not super important since there are no real CPUs with avx512vl and not avx512bw. But vpternlog should be better than vblendvb. I do wonder if we should use vpternlog even with BWI. We currently use vblendmb or vpblendmw by putting the mask into a GPR and moving it to a k-register. But I don't think we hoist the GPR to k-register copy in machine LICM. Using VPTERNLOG would use a constant pool load, but has the advantage that we're pretty good at hoisting and rematerializing those. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83156	2020-07-04 10:26:56 -07:00
Craig Topper	b4eb415a99	[X86] Disable VPBLENDVB formation in combineLogicBlendIntoPBLENDV if VPTERNLOG is supported. VPBLENDVB is multiple uops while VPTERNLOG is a single uop. So we should use that instead. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83155	2020-07-04 10:12:19 -07:00
Simon Pilgrim	71f342d6c3	[X86][AVX] Fold PACK(LOSUBVECTOR(SHUFFLE(X)),HISUBVECTOR(SHUFFLE(X))) -> SHUFFLE(PACK(LOSUBVECTOR(X),HISUBVECTOR(X))) Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed. This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles. A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.	2020-07-04 13:54:30 +01:00
Craig Topper	fed432523e	[X86] Directly emit VPTERNLOG from canonicalizeBitSelect when possible. Seems to produce better results on some rotate tests. And is neutral for other tests.	2020-07-03 22:08:28 -07:00
Sanjay Patel	26543f1c0c	[x86] improve codegen for bit-masked vector compare and select (PR46531) We canonicalize patterns like: %s = lshr i32 %a0, 1 %t = trunc i32 %s to i1 to: %a = and i32 %a0, 2 %c = icmp ne i32 %a, 0 ...in IR, but the bit-shifting original sequence may be better for x86 vector codegen. I tried several variants of the transform, and it's tricky to not induce regressions. In particular, I did not find a way to cleanly handle non-splat constants, so I've left that as a TODO item here (currently negative tests for those are included). AVX512 resulted in some diffs, but didn't look meaningful, so I left that out too. Some of the 256-bit AVX1 diffs are questionable, but close enough that they are probably insignificant. Differential Revision: https://reviews.llvm.org/D83073.	2020-07-03 17:31:57 -04:00

1 2 3 4 5 ...

7379 Commits