Commit Graph

7379 Commits

Author SHA1 Message Date
Simon Pilgrim abd33bf5ef [X86][AVX] lowerShuffleWithPERMV - pad 128/256-bit shuffles on non-VLX targets
Allow non-VLX targets to use 512-bits VPERMV/VPERMV3 for 128/256-bit shuffles.

TBH I'm not sure these targets actually exist in the wild, but we're testing for them and its good test coverage for shuffle lowering/combines across different subvector widths.
2020-08-18 15:46:02 +01:00
Simon Pilgrim 011bf4fd96 [X86][AVX] lowerShuffleWithVTRUNC - extend to support v16i16/v32i8 binary shuffles.
This requires a few additional SrcVT vs DstVT padding cases in getAVX512TruncNode.
2020-08-18 15:30:02 +01:00
Simon Pilgrim d5621b83a5 [X86][AVX] lowerShuffleWithVTRUNC - pull out TRUNCATE/VTRUNC creation into helper code. NFCI.
Prep work toward adding v16i16/v32i8 support for lowerShuffleWithVTRUNC and improving lowerShuffleWithVPMOV.
2020-08-18 14:52:42 +01:00
Simon Pilgrim 7db5124736 [X86][AVX] lowerShuffleWithVTRUNC - avoid unnecessary division in element counts. NFCI.
(256 / SrcEltBits) == ((2 * EltSizeInBits * NumElts) / (EltSizeInBits * Scale)) == (2 * (NumElts / Scale)) == NumSrcElts
2020-08-18 13:48:22 +01:00
Simon Pilgrim d2057a8015 [X86][AVX] Lower v16i8/v8i16 binary shuffles using VTRUNC/TRUNCATE
This patch adds lowerShuffleWithVTRUNC to handle basic binary shuffles that can be lowered either as a pure ISD::TRUNCATE or a X86ISD::VTRUNC (with undef/zero values in the remaining upper elements).

We concat the binary sources together into a single 256-bit source vector. To avoid regressions we perform this after we've tried to lower with PACKS/PACKUS which typically does a cleaner job than a concat.

For non-AVX512VL cases we have to canonicalize VTRUNC cases to use a 512-bit source vectors (inserting undefs/zeros in the upper elements as necessary), truncate and then (possibly) extract the 128-bit result.

This should address the last regressions in D66004

Differential Revision: https://reviews.llvm.org/D86093
2020-08-18 11:11:58 +01:00
Craig Topper b673dfbb9a [X86] When manually creating intrinsic nodes in X86ISelLowering, make sure we use getTargetConstant and pointer type for the intrinsic ID.
Doesn't really matter in practice but that's how the nodes are
normally created by SelectionDAGBuilder. So we should match.

Found by temporarily hacking type checks into isel table.
2020-08-17 17:25:53 -07:00
Craig Topper 2ffa5d218f [X86] Rename INTR_TYPE_4OP to INTR_TYPE_4OP_IMM8 and truncate immediates to MVT::i8
This makes sure VPTERNLOG is generated with MVT::i8 immediate
as its SDNode declaration in X86InstrFragmentsSIMD.td declares.
2020-08-17 17:25:52 -07:00
Craig Topper bc244f08cf [X86] Truncate immediate to i8 for INTR_TYPE_3OP_IMM8
This is used for DBPSADBW which has a i32 immediate for its
intrinsic and an i8 immediate in tablegen isel patterns.
2020-08-17 17:25:51 -07:00
Simon Pilgrim 1d2ede87ea [X86][AVX] Move lowerShuffleWithVPMOV inside explicit shuffle lowering cases
Perform lowerShuffleWithVPMOV as part of the v16i8/v8i16 shuffle lowering stages, which are the only types that are currently supported.

We need to expand support for lowering shuffles as truncations to fix the remaining regressions in D66004
2020-08-17 11:58:51 +01:00
Craig Topper a206f85091 [X86] Reject dirflag in inline asm constraints other than clobber.
Fixes the crash from PR47195.
2020-08-16 23:33:45 -07:00
Simon Pilgrim f25d47b7ed [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
We can now enable this for AVX1 targets can now assist with canonicalizeShuffleMaskWithHorizOp cleanup.

There's still a few missed opportunities for merging subvector insert/extracts into shuffles, but they shouldn't cause any regressions now.
2020-08-16 15:00:41 +01:00
Simon Pilgrim dca7eb7d60 [X86][SSE] Replace combineShuffleWithHorizOp with canonicalizeShuffleMaskWithHorizOp
Instead of just attempting to fold shuffle(HOP,HOP) for a specific target shuffle, make this part of combineX86ShufflesRecursively so we can perform this on the combined shuffle chain, which is particularly useful for recognising more cases of where we're performing multiple HOPs that can be merged and pre-AVX where we don't have good blend/unary target shuffle support.
2020-08-16 12:26:27 +01:00
Simon Pilgrim c27baa54b7 [X86] isRepeatedTargetShuffleMask - don't require specific MVT type. NFC.
Split the isRepeatedTargetShuffleMask into a wrapper variant that takes a MVT describing the mask width, and an internal version that just needs the raw mask element bit size.

This will be necessary for an upcoming change where the horizontal ops element width might not match the shuffle mask element width.
2020-08-16 11:51:44 +01:00
Simon Pilgrim e9eb2dc332 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813

Recommit of rG9bd97d036398 with fixed offset adjustments
2020-08-14 18:43:19 +01:00
Simon Pilgrim cd3b850a4c rG9bd97d0363987b582 - Revert "[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))"
This reverts commit 9bd97d0363.

Seeing some codegen issues in internal testing.
2020-08-13 15:21:15 +01:00
Simon Pilgrim a31d20e67e [X86][SSE] IsElementEquivalent - add HOP(X,X) support
For HADD/HSUB/PACKS ops with repeated operands the lower/upper half element of each lane are known to be equivalent
2020-08-13 12:42:59 +01:00
Simon Pilgrim 39de63aef9 Fix signed/unsigned comparison warnings. NFC. 2020-08-12 19:22:13 +01:00
Simon Pilgrim 13d6cf0951 [X86][SSE] Pull out BUILD_VECTOR operand equivalence tests. NFC.
Pull out element equivalence code from isShuffleEquivalent/isTargetShuffleEquivalent, I've also removed many of the index modulos where possible.

First step toward simply adding some additional equivalence tests.
2020-08-12 18:20:18 +01:00
Simon Pilgrim 9bd97d0363 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813
2020-08-12 12:16:36 +01:00
Simon Pilgrim a0c2c6aa42 [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
Only do this for AVX2+ targets as we still get some regressions on AVX1 without PERMPD/PERMQ
2020-08-12 11:31:05 +01:00
Simon Pilgrim 2655bd51d6 [X86][SSE] combineShuffleWithHorizOp - canonicalize SHUFFLE(HOP(X,Y),HOP(Y,X)) -> SHUFFLE(HOP(X,Y))
Attempt to canonicalize binary shuffles of HOPs with commuted operands to an unary shuffle.
2020-08-11 18:13:03 +01:00
Eric Christopher 8155cb27a2 Fold Opcode into assert uses to fix an unused variable warning without asserts. 2020-08-11 09:30:51 -07:00
Simon Pilgrim fe1f36986b [X86][SSE] combineShuffleWithHorizOp - avoid unnecessary subtraction. NFCI.
We can safely replace ((M - NumElts) % NumEltsPerLane) with (M % NumEltsPerLane) as the modulo result will be the same.
2020-08-11 17:07:32 +01:00
Simon Pilgrim 91d59cbf1b [X86][SSE] Add HADD/SUB support to combineHorizOpWithShuffle
Handles some HOP(SHUFFLE,SHUFFLE) patterns and sets us up to improve some of the cases mentioned in PR41813.
2020-08-11 16:14:14 +01:00
Kerry McLaughlin 85c7e89f3b [CodeGen] Refactor getMemBasePlusOffset & getObjectPtrOffset to accept a TypeSize
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D85220
2020-08-11 12:17:10 +01:00
Simon Pilgrim 49016eeab6 [X86] Rename combineVectorPackWithShuffle -> combineHorizOpWithShuffle. NFC.
The plan is to use this for (F)HADD/SUB opcodes as well as PACKs - similar to how we use combineShuffleWithHorizOp
2020-08-11 11:38:43 +01:00
Wang, Pengfei 9512525947 [X86][FPEnv] Teach X86 mask compare intrinsics to respect strict FP semantics.
When we use mask compare intrinsics under strict FP option, the masked
elements shouldn't raise any exception. So, we cann't replace the
intrinsic with a full compare + "and" operation.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D85385
2020-08-11 10:28:41 +08:00
Simon Pilgrim 9a368d2b00 [X86][SSE] shuffle(hop,hop) - canonicalize unary hop(x,x) shuffle masks
If a shuffle is referring to both the lower and upper half lanes of an unary horizontal op, then canonicalize the mask to only refer to the lower half.
2020-08-10 16:09:27 +01:00
Simon Pilgrim 07e673a02b [X86][SSE] Pull out shuffle(hop,hop) combine into combineShuffleWithHorizOp helper. NFC. 2020-08-10 15:08:57 +01:00
Simon Pilgrim e6dc2c8ce7 [X86][SSE] combineTargetShuffle - rearrange shuffle(hop,hop) matching to delay shuffle mask manipulation. NFC.
Check that we're shuffling hadd/pack ops first before altering shuffle masks.

First step towards adding extra functionality, plus it avoids costly shuffle mask manipulation if not necessary.
2020-08-10 14:13:19 +01:00
Craig Topper d3153b5ca2 [X86] Remove a DCI.isBeforeLegalize() call from combineVSelectWithAllOnesOrZeros.
This was blocking isTypeLegal call so that we could do a particular
transform on illegal types before type legalization. But the we
create a target specific node using that type. We shouldn't do
that if the type isn't legal. So I think we should just always
make sure the type is legal.

I suspect that in order to get the condition VT to not be a vector
of i1 we already completed type legalization anyway so this probably
doesn't matter much in practice.
2020-08-08 14:19:13 -07:00
Simon Pilgrim cc15380f10 [X86][SSE] combineTargetShuffle - use scaleShuffleMask helper to widen shuffle mask. NFCI.
Use scaleShuffleMask helper for the shuffle(hadd,hadd) canonicalization.
2020-08-08 19:36:18 +01:00
Craig Topper 514b00c439 [X86] Limit the scope of the min/max canonicalization in combineSelect
Previously the transform was doing these two canonicalizations
(x > y) ? x : y -> (x >= y) ? x : y
(x < y) ? x : y -> (x <= y) ? x : y

But those don't seem to be useful generally. And they actively
pessimize the cases in PR47049.

This patch limits it to
(x > 0) ? x : 0 -> (x >= 0) ? x : 0
(x < -1) ? x : -1 -> (x <= -1) ? x : -1

These are the cases mentioned in the comments as the motivation
for the canonicalization. These allow the CMOV to use the S
flag from the compare thus improving opportunities to use a TEST
or the flags from an arithmetic instruction.
2020-08-07 22:51:49 -07:00
Keno Fischer c58674df14 [X86] Don't produce bad x86andp nodes for i1 vectors
In D85499, I attempted to fix this same issue by canonicalizing
andnp for i1 vectors, but since there was some opposition to such
a change, this commit just fixes the bug by using two different
forms depending on which kind of vector type is in use. We can
then always decide to switch the canonical forms later.

Description of the original bug:
We have a DAG combine that tries to fold (vselect cond, 0000..., X) -> (andnp cond, x).
However, it does so by attempting to create an i64 vector with the number
of elements obtained by truncating division by 64 from the bitwidth. This is
bad for mask vectors like v8i1, since that division is just zero. Besides,
we don't want i64 vectors anyway. For i1 vectors, switch the pattern
to (andnp (not cond), x), which is the canonical form for `kandn`
on mask registers.

Fixes https://github.com/JuliaLang/julia/issues/36955.

Differential Revision: https://reviews.llvm.org/D85553
2020-08-07 20:05:47 -04:00
Craig Topper 0215ae9735 [X86] Remove incomplete custom handling of i128 sdivrem/udivrem on Windows.
We need to have special handling of i128 div/rem on Windows due
to a weird calling convention needed for the libcall. There was
also some code that made it look like we do the same for sdivrem/udiv,
but the code didn't account for multiple return values of those
functions so couldn't possibly work. I think this code never
triggers because we don't have libcall names defined for those
functions by default so DAGCombine never creates DIVREM nodes.
2020-08-05 23:01:07 -07:00
Craig Topper 08b2d0a963 [X86] Disable copy elision in LowerMemArgument for scalarized vectors when the loc VT is a different size than the original element.
For example a v4f16 argument is scalarized to 4 i32 values. So
the values are spread out instead of being packed tightly like
in the original vector.

Fixes PR47000.
2020-08-05 15:44:54 -07:00
Simon Pilgrim b60f998859 [X86][SSE] Fold 128-bit PACK(EXTEND(X),EXTEND(Y)) -> CONCAT(X,Y) subvectors
This is seen in the sub-128-bit vector trunc(ext()) of comparison results

Fixes pr46585.ll regression in D66004
2020-08-05 18:27:40 +01:00
Simon Pilgrim 6a06c7a0a7 [X86] isHorizontalBinOp - only update LHS/RHS references on success
We've had issues in the past where isHorizontalBinOp calls would affect later combines as the LHS/RHS references had been commuted but still failed to match.
2020-08-05 15:09:52 +01:00
Simon Pilgrim a57bfb44bc [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for integer types 2020-08-05 15:09:51 +01:00
Simon Pilgrim 6f0da46d53 [X86] getFauxShuffleMask - drop unnecessary computeKnownBits OR(X,Y) shuffle decoding.
Now that rG47cea9e82dda941e lets us aggressively decode multi-use shuffles for the OR(SHUFFLE(),SHUFFLE()) case we don't need the computeKnownBits variant any more.
2020-08-04 15:57:47 +01:00
Simon Pilgrim 051f293b78 [X86] Remove unused canScaleShuffleElements helper
The only use was removed at rG36750ba5bd0e9e72

Thanks to @nemanjai for the heads up
2020-08-04 14:51:23 +01:00
Simon Pilgrim 36750ba5bd [X86][AVX] isHorizontalBinOp - relax lane-crossing limits for AVX1-only targets.
Permit lane-crossing post shuffles on AVX1 targets as long as every element comes from the same source lane, which for v8f32/v4f64 cases can be efficiently lowered with the LowerShuffleAsLanePermuteAnd* style methods.
2020-08-04 14:27:01 +01:00
Simon Pilgrim 47cea9e82d Revert rG66e7dce714fab "Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero.""
[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero (REAPPLIED)

This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR().

This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.

Reapplied with fixed signed/unsigned comparisons.
2020-08-04 10:32:39 +01:00
Wang, Pengfei 6bc7ea2d8d [X86][AVX512] Fix build fail after D81548
Test function mask_cmp_128 failed during ISEL
LLVM ERROR: Cannot select: t37: v8i1 = X86ISD::KSHIFTL t48, TargetConstant:i8<4>
due to v8i1 only available under AVX512DQ.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D84922
2020-08-04 12:31:04 +08:00
Mitch Phillips 66e7dce714 Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero."
This reverts commit 219f32f4b6.

Commit contains unsigned compasions that break bots that build with
-Wsign-compare.
2020-08-03 13:48:30 -07:00
Simon Pilgrim 219f32f4b6 [X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero.
This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR().

This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.
2020-08-03 18:32:47 +01:00
Simon Pilgrim 99a971cadf [X86][SSE] Start shuffle combining from ANY_EXTEND_VECTOR_INREG on SSE targets
We already do this on AVX (+ for ZERO_EXTEND_VECTOR_INREG), but this enables it for all SSE targets - we attempted something similar back at rL357057 but hit issues with the ZERO_EXTEND_VECTOR_INREG handling (PR41249).

I'm still looking at the vector-mul.ll regression - which is due to 32-bit targets performing the load as a f64, resulting in the shuffle combiner thinking it has to create a shuffle in the float domain.
2020-08-03 13:41:48 +01:00
Craig Topper 64516ec7c1 [X86] Use parity flag from byte test/cmp instruction for __builtin_parity when input fits in 8 bits.
If the upper bits of the __builtin_parity idiom are known to be
0 we were previously emitting an xor with 0 to get the parity flag.
But we can use cmp/test instead which may expose opportunities for
load folding or combining an AND.
2020-08-02 10:45:04 -07:00
Simon Pilgrim d14a22da5e [DAG] TargetLowering::LowerAsmOutputForConstraint - pass SDLoc as const&
Try to be more consistent with the SDLoc param in the TargetLowering methods.
2020-08-02 15:12:02 +01:00
Simon Pilgrim 20fbbbc583 [X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI.
Fixes clang-tidy warning.
2020-08-02 14:32:23 +01:00
Simon Pilgrim d7e2616741 [X86] Pass SDLoc by const reference. NFCI. 2020-08-02 14:32:22 +01:00
Simon Pilgrim 3f276840b6 [X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI.
Fixes clang-tidy warning.
2020-08-02 14:32:22 +01:00
Simon Pilgrim 2700311cce [X86] combineX86ShuffleChain - pull out repeated RootVT.getSizeInBits() calls. NFCI. 2020-08-02 14:32:22 +01:00
Craig Topper 56166a3a52 [X86] Improve parity idiom recognition to handle (and (truncate (ctpop X)), 1).
Fixes part of PR46954
2020-08-01 22:59:43 -07:00
Simon Pilgrim 82a5c848e7 [X86][AVX512] Fold concat(and(x,y),and(z,w)) -> and(concat(x,z),concat(y,w)) for 512-bit vectors
Helps vpternlog folding on non-AVX512BW targets
2020-08-01 20:34:39 +01:00
Simon Pilgrim bb13c34c3a [X86][AVX] Ensure we only combine to PSHUFLW/PSHUFHW on supporting targets
Noticed while investigating combining from concatenated shuffle vectors, we weren't checking that PSHUFLW/PSHUFHW was legal - we were depending on lowering splitting to subvectors.
2020-08-01 19:18:11 +01:00
Simon Pilgrim 1b1901536a [X86][AVX] Extend v2f64 BROADCAST(LOAD) -> BROADCAST_LOAD to v2i64/v4f32/v4i32
Minor precursor fix for D66004, but helps the SSE41 tests as well as they run with -disable-peephole
2020-08-01 12:28:29 +01:00
Matt Arsenault 57bd64ff84 Support addrspacecast initializers with isNoopAddrSpaceCast
Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs
with the DataLayout.
2020-07-31 10:42:43 -04:00
Simon Pilgrim 2dec72ba5c [X86][SSE] combineExtractWithShuffle - extend extract(truncate(x),0) for any source vector size
As long as we can extract the lowest 128-bit subvector from the pre-truncated source vector, then we don't care what size it is.

The next stage will be to support non-zero extraction indices, as long as its still coming from the lowest 128-bit subvector.
2020-07-30 12:27:49 +01:00
Simon Pilgrim a1c9529e60 [X86][AVX] isHorizontalBinOp - relax no-lane-crossing limit for AVX1-only targets.
Instead of never accepting v8f32/v4f64 FHADD/FHSUB if the input shuffle masks cross lanes, perform the matching and determine if the post shuffle mask simplifies to a 'whole lane shuffle' mask - in which case we are guaranteed to cheaply perform this as a VPERM2F128 shuffle.
2020-07-29 20:49:10 +01:00
Craig Topper c4823b24a4 [X86] Add custom lowering for llvm.roundeven with sse4.1.
We can use the roundss/sd/ps/pd instructions like we do for
ceil/floor/trunc/rint/nearbyint.

Differential Revision: https://reviews.llvm.org/D84592
2020-07-29 10:23:08 -07:00
Simon Pilgrim 0c005be6eb [X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks
If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast.

getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements.

I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.
2020-07-29 11:32:44 +01:00
Simon Pilgrim 4838cd46a9 [X86][XOP] Shuffle v16i8 using VPPERM(X,Y) instead of OR(PSHUFB(X),PSHUFB(Y)) 2020-07-28 19:56:10 +01:00
Simon Pilgrim 182111777b [X86][SSE] Attempt to match OP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
An initial backend patch towards fixing the various poor HADD combines (PR34724, PR41813, PR45747 etc.).

This extends isHorizontalBinOp to check if we have per-element horizontal ops (odd+even element pairs), but not in the expected serial order - in which case we build a "post shuffle mask" that we can apply to the HOP result, assuming we have fast-hops/optsize etc.

The next step will be to extend the SHUFFLE(HOP(X,Y)) combines as suggested on PR41813 - accepting more post-shuffle masks even on slow-hop targets if we can fold it into another shuffle.

Differential Revision: https://reviews.llvm.org/D83789
2020-07-28 10:04:14 +01:00
Craig Topper 647e861e08 [X86] Detect if EFLAGs is live across XBEGIN pseudo instruction. Add it as livein to the basic blocks created when expanding the pseudo
XBEGIN causes several based blocks to be inserted. If flags are live across it we need to make eflags live in the new basic blocks to avoid machine verifier errors.

Fixes PR46827

Reviewed By: ivanbaev

Differential Revision: https://reviews.llvm.org/D84479
2020-07-27 21:15:35 -07:00
Simon Pilgrim 4d84d94969 [X86][SSE] Relax 128-bit restriction on extract_subvector(ext_vector_inreg(X),0) -> ext_vector_inreg(extract_subvector(X,0)) fold
We only need to ensure that the source is larger than the subvector result type
2020-07-27 17:50:36 +01:00
Simon Pilgrim ab4ffa52f0 [X86][AVX] Fold extract_subvector(truncate(x),0) -> truncate(extract_subvector(x),0)
This is currently only supported for VLX targets where the op should be legal.
2020-07-27 14:51:29 +01:00
Simon Pilgrim f720c9c68c [X86] combineExtractSubvector - pull out repeated getSizeInBits() calls. NFCI. 2020-07-27 14:51:28 +01:00
Simon Pilgrim 17eafe0841 [X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening
If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls.

By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.
2020-07-26 16:04:22 +01:00
Simon Pilgrim 3b21823e4a [X86][SSE] combineX86ShufflesRecursively - move all Root node asserts to the same location. NFCI.
Minor tidyup for some upcoming shuffle combine improvements.
2020-07-25 12:48:14 +01:00
Simon Pilgrim 66998ae59f [X86][SSE] getFauxShuffle - ignore undemanded sources for PACKSS/PACKUS faux shuffles
If we don't care about an entire LHS/RHS of the PACK op, then can just treat it the same as undef (we don't care if it saturates) and is safe to treat as a shuffle.

This can happen if we attempt to decode as a faux shuffle before SimplifyDemandedVectorElts has been called on the PACK which should replace the source with UNDEF entirely.
2020-07-25 10:51:14 +01:00
Simon Pilgrim d720ba1e4b [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add SSE shift multiple use handling
Add SimplifyMultipleUseDemandedVectorElts peek through for imm/var SSE shifts
2020-07-23 14:39:03 +01:00
Simon Pilgrim 5b5dc2442a [X86][AVX] getTargetShuffleMask - don't decode VBROADCAST(EXTRACT_SUBVECTOR(X,0)) patterns.
getTargetShuffleMask is used by the various "SimplifyDemanded" folds so we can't assume that the bypassed extract_subvector can be safely simplified - getFauxShuffleMask performs a more general decode that allows us to more safely catch many of these cases so the impact is minimal.
2020-07-21 21:55:44 +01:00
Sanjay Patel 50afa18772 [x86] split FMA with fast-math-flags to avoid libcall
fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware)

C/C++ code may use explicit fma() calls (which become LLVM fma
intrinsics in IR) but then gets compiled with -ffast-math or similar.
For targets that do not have FMA hardware, we don't want to go out to
the math library for a precise but slow FMA result.

I tried this as a generic DAGCombine, but it caused infinite looping
on more than 1 other target, so there's likely some over-reaching fma
formation happening.

There's also a potential intersection of strict FP with fast-math here.
Deferring to current behavior for that case (assuming that strict-ness
overrides fast-ness).

Differential Revision: https://reviews.llvm.org/D83981
2020-07-19 10:03:55 -04:00
Craig Topper 5408024fa8 [X86] Move integer hadd/hsub formation into a helper function shared by combineAdd and combineSub.
There was a lot of duplicate code here for checking the VT and
subtarget. Moving it into a helper avoids that.

It also fixes a bug that combineAdd reused Op0/Op1 after a call
to isHorizontalBinOp may have changed it. The new helper function
has its own local version of Op0/Op1 that aren't shared by other
code.

Fixes PR46455.

Reviewed By: spatel, bkramer

Differential Revision: https://reviews.llvm.org/D83971
2020-07-16 13:27:27 -07:00
Craig Topper 9adf7461f7 [X86] Add test case for PR46455. 2020-07-16 11:06:55 -07:00
Craig Topper f8f007e378 [X86] Consistently use 128 as the PSHUFB/VPPERM index for zero
Bit 7 of the index controls zeroing, the other bits are ignored when bit 7 is set. Shuffle lowering was using 128 and shuffle combining was using 255. Seems like we should be consistent.

This patch changes shuffle combining to use 128 to match lowering.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83587
2020-07-12 10:52:43 -07:00
Craig Topper 04013a07ac [X86] Fix two places that appear to misuse peekThroughOneUseBitcasts
peekThroughOneUseBitcasts checks the use count of the operand of the bitcast. Not the bitcast itself. So I think that means we need to do any outside haseOneUse checks before calling the function not after.

I was working on another patch where I misused the function and did a very quick audit to see if I there were other similar mistakes.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83598
2020-07-12 10:52:43 -07:00
Wang, Pengfei e628092524 [X86][MMX] Optimize MMX shift intrinsics.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D83534
2020-07-11 11:16:23 +08:00
Simon Pilgrim 4cc26a44ca [X86][SSE] Use shouldUseHorizontalOp helper to determine whether to use (F)HADD. NFCI. 2020-07-10 12:13:34 +01:00
Simon Pilgrim 77133cc1e2 [X86][AVX] Attempt to fold PACK(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(PACK(X,Y)).
Truncations lowered as shuffles of multiple (concatenated) vectors often leave us with lane-crossing shuffles that feed a PACKSS/PACKUS, if both shuffles are fed from the same 2 vector sources, then we can PACK the sources directly and shuffle the result instead.

This is currently limited to whole i128 lanes in a 256-bit vector, but we can extend this if the need arises (but I'm not seeing many examples in real world code).
2020-07-10 09:33:27 +01:00
Craig Topper 918e653186 [X86] Immediately call LowerShift from lowerBuildVectorToBitOp.
If we don't immediately lower the vector shift, the splat
constant vector we created may get turned into a constant pool
load before we get around to lowering the shift. This makes it
a lot more difficult to create a shift by constant. Sometimes we
fail to see through the constant pool at all and end up trying
to lower as if it was a variable shift. This requires custom
handling and may create an unsupported vselect on pre-sse-4.1
targets. Since we're after LegalizeVectorOps we are unable to
legalize the unsupported vselect as that code is in LegalizeVectorOps
rather than LegalizeDAG.

So calling LowerShift immediately ensures that we get see the
splat constant.

Fixes PR46527.

Differential Revision: https://reviews.llvm.org/D83455
2020-07-09 10:51:29 -07:00
Hiroshi Yamauchi 2c1a9006dd [PGO][PGSO] Add profile guided size optimization to X86 ISel Lowering. 2020-07-09 10:43:45 -07:00
Craig Topper 3e75912005 [X86] Directly emit X86ISD::BLENDV instead of VSELECT in a few places that were emitting sign bit tests.
Technically a VSELECT expects a vector of all 1s or 0s elements
for its condition. But we aren't guaranteeing that the sign bit
and the non sign bits match in these locations. So we should use
BLENDV which is more relaxed.

Differential Revision: https://reviews.llvm.org/D83447
2020-07-09 10:40:09 -07:00
Simon Pilgrim f54402b63a [X86][AVX] Attempt to fold extract_subvector(shuffle(X)) -> extract_subvector(X)
If we're extracting a subvector from a shuffle that is shuffling entire subvectors we can peek through and extract the subvector from the shuffle source instead.

This helps remove some cases where concat_vectors(extract_subvector(),extract_subvector()) legalizations has resulted in BLEND/VPERM2F128 shuffles of the subvectors.
2020-07-09 14:09:24 +01:00
Benjamin Kramer b44470547e Make helpers static. NFC. 2020-07-09 13:48:56 +02:00
Simon Pilgrim 800fb68420 [X86][SSE] Pull out PACK(SHUFFLE(),SHUFFLE()) folds into its own function. NFC.
Future patches will extend this so declutter combineVectorPack before we start.
2020-07-08 17:42:42 +01:00
Simon Pilgrim 08a2c9ce5c [X86] Fix copy+paste typo in combineVectorPack assert message. NFC. 2020-07-08 17:42:42 +01:00
Sanjay Patel 9114900287 [x86] improve codegen for non-splat bit-masked vector compare and select (PR46531)
vselect ((X & Pow2C) == 0), LHS, RHS --> vselect ((shl X, C') < 0), RHS, LHS

Follow-up to D83073 - the non-splat mask cases where we actually see an
improvement are quite limited from what I can tell. AVX1 needs multiply
and blend capabilities and AVX2 needs vector shift and blend capabilities.
The intersection of those 2 constraints is only vectors with 32-bit or
64-bit elements.

XOP is/was better.

Differential Revision: https://reviews.llvm.org/D83181
2020-07-08 08:20:49 -04:00
Simon Pilgrim 9dc250db9d [X86][AVX] SimplifyDemandedVectorEltsForTargetShuffle - ensure mask is same size as constant size
Fixes test regression reported on D81791
2020-07-08 11:47:59 +01:00
Simon Pilgrim c00a27752e [X86][AVX] Remove redundant EXTRACT_VECTOR_ELT(VBROADCAST(SCALAR())) fold
Noticed while looking for similar cases to rG931ec74f7a29 - SimplifyDemandedVectorElts and shuffle combining both should handle this now.
2020-07-08 10:18:36 +01:00
Simon Pilgrim 931ec74f7a [X86][AVX] Don't fold PEXTR(VBROADCAST_LOAD(X)) -> LOAD(X).
We were checking the VBROADCAST_LOAD element size against the extraction destination size instead of the extracted vector element size - PEXTRW/PEXTB have implicit zext'ing so have i32 destination sizes for v8i16/v16i8 vectors, resulting in us extracting from the wrong part of a load.

This patch bails from the fold if the vector element sizes don't match, and we now use the target constant extraction code later on like the pre-AVX2 targets, fixing the test case.

Found by internal fuzzing tests.
2020-07-07 19:10:03 +01:00
Sanjay Patel 642eed3713 [x86] fix miscompile in buildvector v16i8 lowering
In the test based on PR46586:
https://bugs.llvm.org/show_bug.cgi?id=46586
...we are inserting 16-bits into the high element of the vector, shuffling it
to element 0, and extracting 32-bits. But xmm1 was never initialized, so the
top 16-bits of the extract are undef without this patch.

(It seems like we could do better than this by recognizing that we only demand
a subsection of the build vector, but I want to make sure we fix the
miscompile 1st.)

This path is only used for pre-SSE4.1, and simpler patterns get squashed
somewhere along the way, so the test still includes a 'urem' as it did in the
original test from the bug report.

Differential Revision: https://reviews.llvm.org/D83319
2020-07-07 13:02:31 -04:00
Liu, Chen3 ea85ff82c8 [X86] Fix a bug that when lowering byval argument
When an argument has 'byval' attribute and should be
passed on the stack according calling convention,
a stack copy would be emitted twice. This will cause
the real value will be put into stack where the pointer
should be passed.

Differential Revision: https://reviews.llvm.org/D83175
2020-07-07 21:49:31 +08:00
Xiang1 Zhang 939d8309db [X86-64] Support Intel AMX Intrinsic
INTEL ADVANCED MATRIX EXTENSIONS (AMX).
AMX is a new programming paradigm, it has a set of 2-dimensional registers
(TILES) representing sub-arrays from a larger 2-dimensional memory image and
operate on TILES.

These intrinsics use direct TMM register number as its params.

Spec can be found in Chapter 3 here https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D83111
2020-07-07 10:13:40 +08:00
Craig Topper e652c0f8f3 [X86] Teach lowerShuffleAsBlend to use bit blend for v16i8/v32i8/v16i16 when avx512vl is enabled but not avx512bw.
Probably not super important since there are no real CPUs with
avx512vl and not avx512bw. But vpternlog should be better than
vblendvb.

I do wonder if we should use vpternlog even with BWI. We
currently use vblendmb or vpblendmw by putting the mask into a GPR
and moving it to a k-register. But I don't think we hoist the
GPR to k-register copy in machine LICM. Using VPTERNLOG would use
a constant pool load, but has the advantage that we're pretty good
at hoisting and rematerializing those.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83156
2020-07-04 10:26:56 -07:00
Craig Topper b4eb415a99 [X86] Disable VPBLENDVB formation in combineLogicBlendIntoPBLENDV if VPTERNLOG is supported.
VPBLENDVB is multiple uops while VPTERNLOG is a single uop. So
we should use that instead.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83155
2020-07-04 10:12:19 -07:00
Simon Pilgrim 71f342d6c3 [X86][AVX] Fold PACK(LOSUBVECTOR(SHUFFLE(X)),HISUBVECTOR(SHUFFLE(X))) -> SHUFFLE(PACK(LOSUBVECTOR(X),HISUBVECTOR(X)))
Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed.

This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles.

A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.
2020-07-04 13:54:30 +01:00
Craig Topper fed432523e [X86] Directly emit VPTERNLOG from canonicalizeBitSelect when possible.
Seems to produce better results on some rotate tests. And is
neutral for other tests.
2020-07-03 22:08:28 -07:00
Sanjay Patel 26543f1c0c [x86] improve codegen for bit-masked vector compare and select (PR46531)
We canonicalize patterns like:
  %s = lshr i32 %a0, 1
  %t = trunc i32 %s to i1

to:
  %a = and i32 %a0, 2
  %c = icmp ne i32 %a, 0

...in IR, but the bit-shifting original sequence may be better for x86 vector codegen.

I tried several variants of the transform, and it's tricky to not induce regressions.
In particular, I did not find a way to cleanly handle non-splat constants, so I've left
that as a TODO item here (currently negative tests for those are included). AVX512
resulted in some diffs, but didn't look meaningful, so I left that out too. Some of
the 256-bit AVX1 diffs are questionable, but close enough that they are probably
insignificant.

Differential Revision: https://reviews.llvm.org/D83073.
2020-07-03 17:31:57 -04:00