llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	f591231cad	[X86] combineSelect - canonicalize (vXi1 bitcast(iX Cond)) with combineToExtendBoolVectorInReg before legalization This replaces the attempt in `20af71f8ec` to use combineToExtendBoolVectorInReg to create X86ISD::BLENDV masks directly, instead we use it to canonicalize the iX bitcast to a sign-extended mask and then truncate it back to vXi1 prior to legalization breaking it apart. Fixes #53760	2022-03-15 12:16:11 +00:00
Simon Pilgrim	ad3a7654dc	[X86] combineCMP - peek through zero-extensions for X86cmp(zext(x0),0) zero tests (PR38960) If we're comparing a value against zero, strip away any zero-extension and perform the comparison on the pre-extended value Fixes #38308 Differential Revision: https://reviews.llvm.org/D121472	2022-03-13 11:38:40 +00:00
Simon Pilgrim	e4ab2024a6	[X86] convertIntLogicToFPLogic - enable fp-logic on pre-AVX targets for supported fp predicates (PR34563) If the SETCC fp-condcode is supported on SSE as a single CMPPS/PD op then we can use convertIntLogicToFPLogic to reduce EFLAGS and XMM->GPR traffic like we do for AVX targets. Differential Revision: https://reviews.llvm.org/D121210	2022-03-08 18:06:27 +00:00
Simon Pilgrim	9119eefe5f	[X86] Add cheapX86FSETCC_SSE helper. NFC. Identify FP CondCode that can be performed by a non-AVX SSE CMP op Pulled out of D121210	2022-03-08 18:06:27 +00:00
Simon Pilgrim	d0aa77440c	[X86] convertIntLogicToFPLogic - pull out condcodes. NFCI.	2022-03-08 13:31:17 +00:00
Simon Pilgrim	588d97e246	[X86] getTargetVShiftNode - peek through any zext node If the shift amount has been zero-extended, peek through as this might help us further canonicalize the shift amount. Fixes regression mentioned in rG147cfcbef1255ba2b4875b76708dab1a685085f5	2022-03-04 17:41:45 +00:00
Simon Pilgrim	147cfcbef1	[X86] LowerShiftByScalarVariable - find splat patterns with getSplatSourceVector instead of getSplatValue This completes the removal of uses of SelectionDAG::getSplatValue started in D119090 - by avoiding extracting the splatted element we make it a lot easier to zero-extend the bottom 64-bits of the shift amount and fixes issues we had on 32-bit targets where i64 isn't legal. I've removed the old version of getTargetVShiftNode that took the scalar shift amount argument and LowerRotate can finally efficiently handle vXi16 rotates-by-scalar (using the same code as general funnel-shifts). The only regression we see is in the X86-AVX2 PR52719 test case in vector-shift-ashr-256.ll - this is now hitting the same problem as the X86-AVX1 case (failure to simplify a multi-use X86ISD::VBROADCAST_LOAD) which I intend to address in a follow up patch.	2022-03-04 16:47:35 +00:00
Simon Pilgrim	940d7cd59f	[X86] SimplifyDemandedVectorElts - adjust X86ISD::ANDNP demanded elts based off constant masks Similar to what we already do in combineAndnp, if either operand is a constant then we can improve the demanded elts/bits.	2022-03-04 13:40:56 +00:00
Paul Robinson	7b85f0f32f	[PS4] isPS4 and isPS4CPU are not meaningfully different	2022-03-03 11:36:59 -05:00
Simon Pilgrim	75c4a92706	[X86] Enable v32i16 FSHL/FSHR support Now that we've improved splat detection we no longer see regressions in the funnel-shift-by-splat-amount test cases	2022-03-02 17:32:38 +00:00
Simon Pilgrim	ab2cbb8466	[X86] LowerShiftByScalarVariable - remove 32-bit vXi64 bitcast shift amount handling This was handled generically (and better) by D120553	2022-03-02 13:52:14 +00:00
Phoebe Wang	e03d216c28	[X86] Use bit test instructions to optimize some logic atomic operations This is to match GCC's optimizations: https://gcc.godbolt.org/z/3odh9e7WE Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D120199	2022-03-01 09:57:08 +08:00
Simon Pilgrim	2b46417aa2	[X86][SSE] Attempt to lower vec_reduce_add patterns with PSADBW for zero-extended vXi8 sources For i16/32/64 vectors, if the upper bits are known to be zero, then we can try to truncate to vXi8 (if its worth it) and perform this as a PSADBW to add+zext each v4i8 subvector to a i64 sum, which we can then reduce together. This addresses some of the PR42674 test cases where the source data was vXi8 but had been extended to match a wider unsigned integer accumulator. Differential Revision: https://reviews.llvm.org/D120193	2022-02-27 15:17:42 +00:00
Pawe Bylica	eb1ff70fc5	[X86] Combine ADC(ADD(X,Y),0,Carry) -> ADC(X,Y,Carry) Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D120435	2022-02-25 14:31:20 +01:00
Simon Pilgrim	748bf545dc	Revert rG87753cebf5f861eee418d6bce155dfa0b00f9878 "[X86] combineX86ShufflesRecursively - don't both widening inputs before calling combineX86ShuffleChain" Reverting while we investigate codegen regression reports	2022-02-25 08:59:53 +00:00
Simon Pilgrim	a636801a36	[X86] LowerRotate - enable v8i16 ROTL/ROTR on all pre-SSE41 targets We're still better off expanding this once we have PMOVZX	2022-02-24 14:14:08 +00:00
Simon Pilgrim	0ea50bee83	[X86] SimplifyDemandedVectorEltsForTargetNode - add X86ISD::ANDNP handling	2022-02-24 13:51:51 +00:00
Simon Pilgrim	e41a138520	[X86] LowerShiftByScalarVariable - use getSplatSourceVector for vXi8 shift expansion Using getSplatValue causes poor codegen due to not always being able to remove the EXTRACT_VECTOR_ELT created inside getSplatValue. The vXi16 shifts/rotates are still showing occasional regressions but vXi8 is a definite improvement.	2022-02-24 11:24:06 +00:00
Simon Pilgrim	427d9f60db	[X86] combineX86ShufflesRecursively - pull out repeated getValueType/getSimpleValueType calls.	2022-02-23 18:45:28 +00:00
Simon Pilgrim	87753cebf5	[X86] combineX86ShufflesRecursively - don't both widening inputs before calling combineX86ShuffleChain combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319	2022-02-23 17:29:41 +00:00
Simon Pilgrim	22d0453128	[X86] combineX86ShuffleChainWithExtract - don't both widening inputs after peeking through ISD::EXTRACT_SUBVECTOR nodes combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319 Removing widening from combineX86ShufflesRecursively will be the next step, followed by removing combineX86ShuffleChainWithExtract entirely	2022-02-23 15:44:24 +00:00
Sanjay Patel	ad7214f23d	[x86] add load folding restriction to pushAddIntoCmovOfConsts() With only a load-fold the diffs look neutral. If there's a load and store (rmw) fold opportunity as shown in the test based on #53862, then we end up with an extra instruction. Fixes #53862 Differential Revision: https://reviews.llvm.org/D120281	2022-02-22 08:02:11 -05:00
Simon Pilgrim	ec910751fe	[X86] combineX86ShufflesRecursively - attempt to fold ISD::EXTRACT_SUBVECTOR into a shuffle chain Peek through if we're extracting a non-zero'th subvector in an attempt to fold the extract into a lane-crossing shuffle This also exposes a failure to fold extract_subvector(movddup(x),c) -> movddup(extract_subvector(x,c))	2022-02-20 18:50:33 +00:00
Simon Pilgrim	8ef3e895ad	[X86] combineX86ShufflesRecursively - add TODO not to generate temporary nodes Extension to PR45974, unless we actual combine the target shuffles we shouldn't be generating temporary nodes as they may interfere with the one use checks in the shuffle recursions	2022-02-20 15:59:23 +00:00
Simon Pilgrim	ab069f37e8	[X86] combineArithReduction - pull out repeated getVectorNumElements() calls	2022-02-19 19:41:20 +00:00
Simon Pilgrim	de2c0a2e61	[X86] combineADC/SBB - pull out repeated getOperand calls. NFC.	2022-02-18 11:21:44 +00:00
Nick Desaulniers	027c16bef4	[X86ISelLowering] permit BlockAddressSDNode "i" constraints for PIC When building 32b x86 code as PIC, the existing handling of "i" constraints is conservative since generally we have to go through the GOT to find references to functions. But generally, BlockAddresses from C code refer to the Function in the current TU. Permit BlockAddresses to be used with the "i" constraint for those cases. I regressed this in commit `4edb9983cb` ("[SelectionDAG] treat X constrained labels as i for asm") Fixes: https://github.com/llvm/llvm-project/issues/53868 Reviewed By: efriedma, MaskRay Differential Revision: https://reviews.llvm.org/D119905	2022-02-17 10:54:46 -08:00
Simon Pilgrim	2808743cbd	[X86] LowerVSETCC - always split 512-bit vectors before lowering to PCMPEQ/GT (PR53842) Extend the existing split where we already do this for v32i16/v64i8 We can end up trying to use PCMPEQ/GT if the result needs to be sign-extended (typically due to the DAGCombiner::foldSextSetcc fold). Fixes #53842	2022-02-15 14:21:12 +00:00
Simon Pilgrim	890beda4e1	[X86] combineArithReduction - pull out (near) duplicate v4i8/v8i8 widening code. NFC.	2022-02-13 21:02:50 +00:00
Sanjay Patel	c486b82cfb	[x86] try harder to scalarize a vector load with extracted integer op uses This is a retry of `b4b97ec813` - that was reverted because it could cause miscompiles by illegally reordering memory operations. A new test based on #53695 is added here to verify we do not have that same problem. extract_vec_elt (load X), C --> scalar load (X+C) As noted in the comment, DAGCombiner has this fold -- and the code in this patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but x86 should benefit even if the loaded vector has other uses as long as we apply some other x86-specific conditions. The motivating example from #50310 is shown in vec_int_to_fp.ll. Fixes #50310 Fixes #53695 Differential Revision: https://reviews.llvm.org/D118376	2022-02-13 08:32:21 -05:00
Phoebe Wang	2aa732a918	[X86][MS] Fix the wrong alignment of vector variable arguments on Win32 D108887 fixed alignment mismatch by changing the caller's alignment in ABI. However, we found some cases that still assume the alignment is vector size. This patch fixes them to avoid the runtime crash. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D114536	2022-02-13 10:23:18 +08:00
Simon Pilgrim	9c55b0e121	[X86] LowerFunnelShift - enable v16i16 support	2022-02-12 17:04:59 +00:00
Simon Pilgrim	a4ed0c2f03	[X86] combineAndnp - if an input has a zero (after inversion for Op0) in a vector element, then we don't demand that bit/element in the other input Similar to what we already perform in combineAnd	2022-02-12 16:49:05 +00:00
Simon Pilgrim	1f43367377	[X86] getTargetVShiftNode - Fix Wparentheses gcc warning.	2022-02-12 16:37:24 +00:00
Simon Pilgrim	6320c3e77c	[X86] combineAndnp - pull out repeated operands. NFC.	2022-02-12 16:35:24 +00:00
Simon Pilgrim	dcf465731d	[X86] combineAnd - add SimplifyMultipleUseDemandedBits handling to masked vector element analysis Extend the existing fold to use SimplifyMultipleUseDemandedBits as well as SimplifyDemandedVectorElts/SimplifyDemandedBits when attempting to simplify based off known zero vector elements.	2022-02-12 15:30:53 +00:00
Simon Pilgrim	1e1b60138c	[X86] Improve uniform funnelshift/rotation amount handling To find uniform shift/rotation amounts, we currently use SelectionDAG::getSplatValue which creates a node that extracts the scalar value from the source vector, this makes it more difficult for later combines to remove the extraction and stay on the SIMD unit, and can be a problem when the scalar type is illegal (i.e. i64 vs v2i64 on 32-bit targets). This patch begins to use SelectionDAG::getSplatSourceVector (which SelectionDAG::getSplatValue uses internally) and adds a new variant of getTargetVShiftNode that takes the source vector and the splat index, and adjusts the vector in place to create the zero-extended value suitable for the SSE PSLL/PSRL/PSRA uniform instructions. I'm still addressing a number of regressions when used for normal vector shifts, so I've just handled the funnelshift/rotation lowering for this first patch. I can then focus on the yak shaving (SimplifyDemandedBits/Elts in particular) necessary to always use SelectionDAG::getSplatSourceVector. Differential Revision: https://reviews.llvm.org/D119090	2022-02-12 14:46:30 +00:00
Simon Pilgrim	37cf7275cd	[X86] Enable vector splitting of ISD::AVGCEILU nodes on AVX1 and non-BWI targets	2022-02-12 14:04:55 +00:00
David Green	f810b40c3b	[X86] Replace X86ISD::AVG with generic ISD::AVGCEILU Pulled out of D106237, this replaces the X86ISD::AVG DAG node with the generic ISD::AVGCEILU. It doesn't remove the detectAVGPattern method, but the extra generic ISel matching does alter the existing test. Differential Revision: https://reviews.llvm.org/D119073	2022-02-11 18:57:18 +00:00
Simon Pilgrim	20af71f8ec	[X86] combineVSelectToBLENDV - handle vselect(vXi1,A,B) -> blendv(sext(vXi1),A,B) For pre-AVX512 targets, attempt to sign-extend a vXi1 condition mask to pass to a X86ISD::BLENDV node Fixes Issue #53760	2022-02-11 18:38:17 +00:00
Simon Pilgrim	48e1434a0a	[X86] Move combineToExtendBoolVectorInReg before the select combines. NFC. Avoid the need for a forward declaration. Cleanup prep for Issue #53760	2022-02-11 16:51:46 +00:00
Simon Pilgrim	827d0c51be	[X86] combineToExtendBoolVectorInReg - use explicit arguments. NFC. Replace the *_EXTEND node with the raw operands, this will make it easier to use combineToExtendBoolVectorInReg for any boolvec extension combine. Cleanup prep for Issue #53760	2022-02-11 16:40:29 +00:00
Sanjay Patel	a68e098024	[SDAG] move x86 select-with-identity-constant fold behind a target hook; NFC This is no-functional-change-intended because only the x86 target enables the TLI hook currently. We can add fmul/fdiv opcodes to the switch similar to the proposal D119111, but we don't need to make other changes like enabling target-specific combines. We can also add integer opcodes (add, or, shl, etc.) to the switch because this function is called from all of the generic binary opcodes. The goal is to incrementally enable the profitable diffs from D90113 while avoiding regressions. Differential Revision: https://reviews.llvm.org/D119150	2022-02-08 09:55:05 -05:00
Simon Pilgrim	d7be2bff16	[X86] combineShiftRightArithmetic - break if-else chain as they all return (style). NFC.	2022-02-07 09:54:34 +00:00
Simon Pilgrim	74b98ab1db	[X86] Fold ZERO_EXTEND_VECTOR_INREG(BUILD_VECTOR(X,Y,?,?)) -> BUILD_VECTOR(X,0,Y,0) Helps avoid some unnecessary shift by splat amount extensions before shuffle combining gets limited by with one use checks	2022-02-06 12:53:11 +00:00
Sanjay Patel	7b03725097	Revert "[x86] try harder to scalarize a vector load with extracted integer op uses" This reverts commit `b4b97ec813`. As discussed in post-commit feedback at: https://reviews.llvm.org/D118376 ...there's a stage 2 failure on a Mac running a clang-refactor tool test.	2022-02-04 07:45:57 -05:00
Sanjay Patel	6592bcecd4	[x86] invert a vector select IR canonicalization with a binop identity constant This is an intentionally limited/different form of D90113. That patch bravely tries to generalize folds where we pull a binop into the arms of a select: N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal) ...but it is not universally profitable. This is the inverse of IR canonicalization as discussed in D113442. We know that this transform is not entirely profitable even within x86, so we only handle x86 vector fadd/fsub as a 1st step. The intent is to prevent AVX512 regressions as mentioned in D113442. The plan is to port this to DAGCombiner (so it will eventually look more like D90113) and add more types/cases in pieces with many more tests to verify that we are seeing improvements. Differential Revision: https://reviews.llvm.org/D118644	2022-02-02 08:17:53 -05:00
Simon Pilgrim	5aa2acc86b	[DAG] SimplifyDemandedVectorElts - remove KnownZero/KnownUndef from DCI helper wrapper None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.	2022-02-02 12:04:49 +00:00
Simon Pilgrim	7ec8fc2932	[X86] combineAnd() - per-element simplification - call SimplifyDemandedBits using mask demanded bits if SimplifyDemandedVectorElts fails We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements. This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.	2022-01-31 13:58:00 +00:00
Simon Pilgrim	156f83adc2	[X86] combineVectorTruncation - use PACKUSDW(BLENDW(X,0),BLENDW(Y,0)) for v8i32->v8i16 truncation Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).	2022-01-30 20:07:04 +00:00
Simon Pilgrim	b7e04ccd99	[X86][AVX] matchUnaryShuffle - avoid creation of on-the-fly nodes (PR45974) Don't extract the ANY/ZERO_EXTEND_VECTOR_INREG subvector source until we're definitely combining to a new node.	2022-01-30 17:59:14 +00:00
Simon Pilgrim	2cdbaca394	[X86] Attempt to fold MOVMSK(CMPEQ(AND(X,C1),0)) -> MOVMSK(NOT(SHL(X,C2))) Allows pow2 mask tests to avoid an unnecessary constant load. Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.	2022-01-30 15:53:21 +00:00
Simon Pilgrim	ee9eeed773	[X86] LowerFunnelShift - enable v8i16 lowering	2022-01-29 16:20:36 +00:00
Simon Pilgrim	6777289dd9	[X86] lowerShuffleAsBlend - pull out repeated getVectorNumElements() calls. NFC.	2022-01-29 16:16:29 +00:00
Simon Pilgrim	f1305f2369	[X86] combinePredicateReduction - always use PMOVMSKB(PCMPEQB()) for allof(icmp_eq()) reductions This greatly simplifies the codegen for recognising PTEST patterns and matches the codegen from the very similar LowerVectorAllZero	2022-01-29 15:16:59 +00:00
Simon Pilgrim	67a399fd57	[X86] SimplifyDemandedBits - add X86ISD::BLENDV SimplifyMultipleUseDemandedBits handling Lets us see through multiple use operands	2022-01-29 14:26:41 +00:00
Simon Pilgrim	7e849fd97b	[X86] LowerFunnelShift - allow non-constant vXi8 unpack(y,x) << zext(z) lowering pre-AVX512 Without AVX512 (which can efficiently extend/truncate to vXi16/vXi32), unpacking/packing to vXi16 is more efficient that relying on the (uops-heavy) PBLENDV shift expansion	2022-01-29 13:58:30 +00:00
Luo, Yuanke	be44177ede	[X86][avx512fp16] Promote fp16 to fp32 for frem. Promote fp16 to fp32 for frem. Differential Revision: https://reviews.llvm.org/D118470	2022-01-29 11:41:27 +08:00
Sanjay Patel	b4b97ec813	[x86] try harder to scalarize a vector load with extracted integer op uses extract_vec_elt (load X), C --> scalar load (X+C) As noted in the comment, DAGCombiner has this fold -- and the code in this patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but x86 should benefit even if the loaded vector has other uses as long as we apply some other x86-specific conditions. The motivating example from #50310 is shown in vec_int_to_fp.ll. Fixes #50310 Differential Revision: https://reviews.llvm.org/D118376	2022-01-28 10:22:52 -05:00
Simon Pilgrim	c7bb3665a1	[X86] SimplifyDemandedBitsForTargetNode - fold MOVMSK(YMM) -> MOVMSK(XMM) If we don't demand the upper elements of the 256-bit vector, then just perform as a 128-bit vector	2022-01-28 14:42:53 +00:00
Simon Pilgrim	2a13beaa70	[X86] combineSetCCMOVMSK - don't fold MOVMSK(BITCAST(PCMPEQ(X,0))) -> PTESTZ(X,X) if we're not testing every element comparison	2022-01-28 13:22:37 +00:00
Simon Pilgrim	cce6490eca	[X86] combineSetCCMOVMSK - match all_of patterns with X86ISD::CMP as well as X86ISD::SUB Previous folds by combineSetCCMOVMSK might have converted these to CMP when changing the bitwidth, and the CMP->SUB fold might not have happened (or will happen)	2022-01-28 11:43:10 +00:00
Simon Pilgrim	93c9b39d25	[X86] Fix MOVMSK(CONCAT(X,Y)) -> MOVMSK(AND/OR(X,Y)) fold for float types and demanded elements rG9103b73fe052 was assuming that we could OR/AND with the source vector, but that will fail on float/double vectors without bitcasting - it also missed the case that any_of checks might be testing less than all the source elements	2022-01-28 11:01:47 +00:00
Simon Pilgrim	9103b73fe0	[X86] Fold MOVMSK(CONCAT(X,Y)) -> MOVMSK(AND/OR(X,Y)) for all_of/any_of patterns Makes it easier for later folds and avoids unnecessary 256-bit ops (especially on AVX1-only targets where we miss a lot of integer instructions)	2022-01-27 18:28:09 +00:00
Simon Pilgrim	ccda0f2226	[X86][SSE] Add combineBitOpWithShift for BITOP(SHIFT(X,Z),SHIFT(Y,Z)) -> SHIFT(BITOP(X,Y),Z) vector folds InstCombine performs this more generally with SimplifyUsingDistributiveLaws, but we don't need anything that complex here - this is mainly to fix up cases where logic ops get created late on during lowering, often in conjunction with sext/zext ops for type legalization. https://alive2.llvm.org/ce/z/gGpY5v	2022-01-27 14:54:41 +00:00
Simon Pilgrim	389ae775e4	[X86] Fold TESTZ(OR(LO(X),HI(X)),OR(LO(Y),HI(Y))) -> TESTZ(X,Y) Helps fix a number of poor codegen cases for allof(cmp()) with 256-bit vectors on AVX1	2022-01-27 13:20:36 +00:00
Benjamin Kramer	f15014ff54	Revert "Rename llvm::array_lengthof into llvm::size to match std::size from C++17" This reverts commit `ef82063207`. - It conflicts with the existing llvm::size in STLExtras, which will now never be called. - Calling it without llvm:: breaks C++17 compat	2022-01-26 16:55:53 +01:00
serge-sans-paille	ef82063207	Rename llvm::array_lengthof into llvm::size to match std::size from C++17 As a conquence move llvm::array_lengthof from STLExtras.h to STLForwardCompat.h (which is included by STLExtras.h so no build breakage expected).	2022-01-26 16:17:45 +01:00
Simon Pilgrim	99ae5c13f6	[X86] Add 'getSplitVectorSrc' helper to determine if subvectors all come from the same source Helps determine if the subvector ops come from the same larger vector and match the lower/upper extractions	2022-01-26 15:17:21 +00:00
Simon Pilgrim	157f9b68a3	[X86] combineVectorSignBitsTruncation - fix indentation. NFC.	2022-01-25 11:54:22 +00:00
Simon Pilgrim	902184e6cc	[X86] combinePredicateReduction - generalize allof(cmpeq(x,0)) handling to allof(cmpeq(x,y)) There's no further reasons to limit this to cmpeq-with-zero, the outstanding regressions with lowering to PTEST have now been addressed Improves codegen for Issue #53379	2022-01-25 00:24:06 +00:00
Simon Pilgrim	11bb4a1111	[X86] combinePredicateReduction - split vXi16 allof(cmpeq()) to vXi8 allof(cmpeq()) vXi16 patterns allof(cmp()) reduction patterns will have to be pack the comparison results to vXi8 to use PMOVMSKB. If we're reducing cmpeq(), then we can compare the vXi8 halves directly - similar to what we already do for vXi64 -> vXi32 for cases without PCMPEQQ.	2022-01-24 22:43:29 +00:00
Simon Pilgrim	8d298355ca	[X86] combineSetCCMOVMSK - detect and(pcmpeq(),pcmpeq()) ptest pattern. Handle cases where we've split an allof(cmpeq()) pattern to a legal vector type	2022-01-24 21:42:03 +00:00
Simon Pilgrim	6997f4d07f	[X86] combineSetCCMOVMSK - fold allof(cmpeq(x,y)) -> ptest(sub(x,y)) (PR53379) As suggested on PR53379, for all-of icmp-eq patterns, we can use ptest(sub(x,y)) on SSE41+ targets This is a generalization of the existing allof(cmpeq(x,0)) -> ptest(x) pattern We can probably extend this further, in particularly to handle 256-bit cases on pre-AVX2 targets, but this part of the generalization is pretty trivial Fixes Issue #53379	2022-01-24 16:44:37 +00:00
Simon Pilgrim	577a6dc9a1	[X86] getVectorMaskingNode - fix indentation. NFC. clang-format	2022-01-24 11:08:41 +00:00
Kazu Hirata	bf039a8620	[Target] Use range-based for loops (NFC)	2022-01-23 22:53:15 -08:00
Simon Pilgrim	4762c077e7	[X86] LowerFunnelShift - always lower vXi8 fshl by constant amounts as unpack(y,x) << zext(z) This can always be lowered as PMULLW+PSRLWI+PACKUSWB	2022-01-23 21:35:05 +00:00
Simon Pilgrim	32dc14f876	[X86] LowerFunnelShift - use supportedVectorShiftWithBaseAmnt to check for supported scalar shifts Allows us to reuse the ISD shift opcode instead of a mixture of ISD/X86ISD variants	2022-01-23 21:13:58 +00:00
David Green	b27e5459d5	[DAG] Convert truncstore(extend(x)) back to store(x) Pulled out of D106237, this folds truncstore(extend(x)) back to store(x) if the original store was legal. This can come up due to the order we fold nodes. A fold from X86 needs to be adjusted to prevent infinite loops, to have it pick the operand of a trunc more directly. Differential Revision: https://reviews.llvm.org/D117901	2022-01-22 13:20:36 +00:00
Simon Pilgrim	866311e71c	[X86] lowerToAddSubOrFMAddSub - lower 512-bit ADDSUB patterns to blend(fsub,fadd) AVX512 doesn't provide a ADDSUB instruction, but if we've built this from a build vector of scalar fsub/fadd elements we can still lower to blend(fsub,fadd)	2022-01-20 15:16:05 +00:00
Simon Pilgrim	304cfc706a	[X86] combineConcatVectorOps - remove superfluous Subtarget.hasAVX() check This function only ever gets called by AVX targets, and we already assert for this at the top of the function	2022-01-20 12:56:09 +00:00
Simon Pilgrim	c4f5fd76da	[X86] combineConcatVectorOps - add handling for X86ISD::VSHL/VSRL/VSRA These can be handled the same as the vector shift by immediate variants that are already handled.	2022-01-20 12:56:08 +00:00
Luo, Yuanke	5dea7a865e	Combine to vpdpbusd when operand is constant and small enough. Differential Revision: https://reviews.llvm.org/D116363	2022-01-20 11:10:49 +08:00
Simon Pilgrim	a8890995ee	[X86][AVX] LowerFunnelShift - improve FSHL/FSHR per-element lowering Similar to LowerRotate, see if we can either unpack or extend to a wider type and use that type's per-element shift instruction	2022-01-19 10:15:43 +00:00
Simon Pilgrim	ce2345d8c1	[X86] getTargetShuffleInputs - ensure we limit the maximum recursion depth to match SelectionDAG::MaxRecursionDepth Regressions were pre-handled by rG62e36b120749 Fixes Issue #52960	2022-01-18 15:25:21 +00:00
Simon Pilgrim	62e36b1207	[X86] canLowerByDroppingEvenElements - generalize to drop even or odd elements This allows us to match shuffle<1,3,5,7,9,11,13,15> style shift+trunc/pack patterns as well as the existing shuffle<0,2,4,6,8,10,12,14> style shuffle trunc/pack patterns In the future, interleaving patterns might benefit from an even more general implementation for higher strides	2022-01-18 15:07:24 +00:00
Simon Pilgrim	c41ca1be7d	[X86] LowerFunnelShift - enable vXi32 handling	2022-01-15 15:03:24 +00:00
Fangrui Song	254302021b	[X86] Fix -Wunused-lambda-capture	2022-01-14 10:07:20 -08:00
Simon Pilgrim	67076ebb60	[X86][AVX] lowerShuffleAsLanePermuteAndShuffle - don't split repeated mask patterns Generalize `57a551a8df` - if the inlane mask is a repeated mask, we're better off performing the lane permute instead of splitting	2022-01-14 17:10:37 +00:00
Simon Pilgrim	9b72e0f9a2	[X86] combineConcatVectorOps - fold concat(permilpd(x),permilpd(y)) -> permilpd(concat(x,y))	2022-01-14 15:48:57 +00:00
Simon Pilgrim	7500b4c7e4	[X86] combineConcatVectorOps - fold concat(movsdup(x),movsdup(y)) -> movs*dup(concat(x,y))	2022-01-14 15:48:56 +00:00
Simon Pilgrim	7d0ea3f41a	[X86] combineConcatVectorOps - fold concat(movddup(x),movddup(y)) -> movddup(concat(x,y)) For AVX2+ targets this requires us to also recognise v4f64 concat(broadcast(x),broadcast(y)) -> movddup(concat(x,y))	2022-01-14 14:49:57 +00:00
Craig Topper	0fac3891ec	[X86] Fix mistake in comment on LowerFROUND. NFC The code uses floor not trunc.	2022-01-13 12:22:04 -08:00
Simon Pilgrim	08212dbc44	[X86] Add xop/avx2 shifts to X86TargetLowering::isBinOp Allows shuffle combining through per-element shift nodes This exposed a number of issues with shuffle combining with target intrinsics that are lowered to nodes later during legalization - in particular shuffle combining and SimplifyDemandedVectorElts were being called after canonicalizeShuffleWithBinOps, meaning that shuffles didn't have a chance to be combined away before the shuffle(binop(x,y)) -> binop(shuffle(x),shuffle(y)) fold.	2022-01-13 17:44:10 +00:00
Simon Pilgrim	55029f017d	[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFHW/PSHUFLW handling	2022-01-13 17:08:59 +00:00
Simon Pilgrim	57a551a8df	[X86][AVX] lowerShuffleAsLanePermuteAndShuffle - don't split element rotate patterns Partial element rotate patterns (e.g. for element insertion on Issue #53124) were being split if every lane wasn't crossing, but really there's a good repeated mask hiding in there.	2022-01-13 11:59:08 +00:00
Simon Pilgrim	de3808c8fc	[X86][AVX2] Add SimplifyDemandedVectorElts handling for avx2 per element shifts Noticed while investigating how to improve funnel shift codegen	2022-01-12 14:50:28 +00:00
Simon Pilgrim	c2426fdcae	[X86][XOP] Add SimplifyDemandedVectorElts handling for xop shifts Noticed while investigating how to improve funnel shift codegen	2022-01-12 12:43:13 +00:00
Simon Pilgrim	20404d820c	[X86] Apply clang-format to X86TargetLowering::isVectorShiftByScalarCheap Fix indentation	2022-01-11 16:41:16 +00:00
Sanjay Patel	e745507eda	[x86] exclude "X==0 ? Y :-1" from math/logic transform This is the last step in a series to improve lowering via "SBB" asm: `68defc0134` `aab1f55e33` ...and fixes #53006	2022-01-09 09:03:39 -05:00
Sanjay Patel	aab1f55e33	[x86] use SETCC_CARRY instead of SBB node for select lowering This is a suggested follow-up to D116765. This removes a clear of the register operand, so it is better for code size, but it does potentially create a false register dependency on surrounding code. If that is a problem, it should be solvable using dependency-breaking code that is used for other instructions. Differential Revision: https://reviews.llvm.org/D116804	2022-01-09 06:23:50 -05:00
Simon Pilgrim	75d8507e45	[X86] LowerRotate - enable ROTL vXi16 rotate-by-splat-amount on pre-AVX targets To enable this on all targets there's still a number of regressions due to getSplatValue/getTargetVShiftNode but these don't really affect pre-AVX targets.	2022-01-08 14:57:00 +00:00
Simon Pilgrim	b5d2e232b8	[X86][SSE] Add initial FSHL/FSHR vXi8 lowering support This is very similar to the existing ROTL/ROTR support for scalar shifts in LowerRotate, I think as time goes on we should be able to share much of this code in helpers between Funnel Shift + Rotation lowering.	2022-01-08 12:19:25 +00:00
Sanjay Patel	68defc0134	[x86] make select lowering using SBB hack more flexible select (X != 0), -1, Y --> 0 - X; or (sbb), Y select (X != 0), Y, -1 --> X - 1; or (sbb), Y We already had these x86 carry-flag transforms, but one was over-specified to handle a "0" select arm only. That's just a special-case of the more general pattern (the 'or' will be deleted if Y is zero). This is part of solving #53006, but it misses that example because some other combine has already converted that exact pattern into math ops. Differential Revision: https://reviews.llvm.org/D116765	2022-01-07 13:23:09 -05:00
Luo, Yuanke	21babe4db3	[X86] Combine reduce(add (mul x, y)) to VNNI instruction. For below C code, we can use VNNI to combine the mul and add operation. int usdot_prod_qi(unsigned char restrict a, char restrict b, int c, int n) { int i; for (i = 0; i < 32; i++) { c += ((int)a[i] * (int)b[i]); } return c; } We didn't support the combine acoss basic block in this patch. Differential Revision: https://reviews.llvm.org/D116039	2022-01-07 21:12:19 +08:00
Simon Pilgrim	95f9eddbbc	[X86] combineSetCCMOVMSK - use APInt::getLowBitsSet to create bitmask. NFC. SelectionDAG::getConstant creates an APInt internally anyway, and getLowBitsSet helps assert for legal bitwidths. Plus it silences static analyzer out-of-bounds shift warnings.	2022-01-04 16:56:44 +00:00
Craig Topper	cbcbbd6ac8	[ValueTracking][SelectionDAG] Rename ComputeMinSignedBits->ComputeMaxSignificantBits. NFC This function returns an upper bound on the number of bits needed to represent the signed value. Use "Max" to match similar functions in KnownBits like countMaxActiveBits. Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old name around to keep this patch size down. Will do a bulk rename as follow up. Rename KnownBits::countMaxSignedBits->countMaxSignificantBits. Reviewed By: lebedev.ri, RKSimon, spatel Differential Revision: https://reviews.llvm.org/D116522	2022-01-03 11:33:30 -08:00
Simon Pilgrim	159da56737	[X86] Enable v32i16 ISD::ROTL/ROTR lowering on AVX512BW targets	2021-12-24 13:30:52 +00:00
Simon Pilgrim	71fc4bbdd2	[X86][SSE] Add ISD::ROTR support Fix issue in TargetLowering::expandROT where we only attempt to flip a rotation if the other direction has better support - this matches TargetLowering::expandFunnelShift This allows us to enable ISD::ROTR lowering on SSE targets, which particularly simplifies/improves codegen for splat amount and AVX2 per-element shifts.	2021-12-23 15:07:30 +00:00
Simon Pilgrim	a3f50fb06d	[X86] isVectorShiftByScalarCheap - vXi8 select(shift(x,splat0),shift(x,splat1)) is better than shift(x,select(splat0,splat1)) Even though we don't have vXi8 vector shifts (apart from XOP), it is still better to prefer shift (or funnel-shift/rotate) by scalar where possible. https://llvm.godbolt.org/z/6ss6ffTxv Differential Revision: https://reviews.llvm.org/D116191	2021-12-23 14:30:02 +00:00
Simon Pilgrim	8b58344efb	Remove superfluous semicolon. Missed by MSVC	2021-12-22 17:42:45 +00:00
Simon Pilgrim	4639461531	[DAG][X86] Add TargetLowering::isSplatValueForTargetNode override Add callback to enable us to test target nodes if they are splat vectors Added some basic X86ISD::VBROADCAST + X86ISD::VBROADCAST_LOAD handling	2021-12-22 16:57:44 +00:00
Simon Pilgrim	dfa2ad1ad8	[X86] getTargetVShiftNode - remove shift-by-constant handling. Move shift-by-constant handling and move it into its only user (VSHIFT intrinsics lowering). This is some prep-work for getTargetVShiftNode to no longer take a scalar shift amount - we're introducing temporary ISD::EXTRACT_VECTOR_ELT nodes via SelectionDAG::getSplatValue to accommodate this which can cause various issues, including unnecessary scalarization and xmm->gpr->xmm transfers, and causes problems for 32-bit codegen if we fail to remove an (illegal) i64 scalar extracted from a (legal) vXi64 vector.	2021-12-21 13:16:48 +00:00
Simon Pilgrim	0caf8a3daf	[X86] LowerRotate - enable vXi32 splat handling Pull out the "rotl(x,y) --> (unpack(x,x) << zext(splat(y % bw))) >> bw" special case from vXi8 lowering so we can reuse it for vXi32 types as well. There's still some regressions with vXi16 to handle before this becomes entirely general. It also allows us to remove the now unnecessary hack for handling amount-modulo before splatting.	2021-12-21 11:19:23 +00:00
Kazu Hirata	500c4b68dc	[llvm] Construct SmallVector with iterator ranges (NFC)	2021-12-20 23:43:24 -08:00
Kazu Hirata	2b7be47b22	[llvm] Strip redundant lambda (NFC)	2021-12-17 10:51:40 -08:00
Simon Pilgrim	a640f16ca2	[X86] combineAnd - don't demand operand vector elements if the other operand element is zero If either operand has a zero element, then we don't need the equivalent element from the other operand, as no bits will be set.	2021-12-16 16:54:27 +00:00
Simon Pilgrim	3267de7215	[X86] combineAnd - pull out repeated getOperand() and SDLoc() calls. NFCI.	2021-12-16 16:22:39 +00:00
Simon Pilgrim	4712a71415	[X86] Rename LowerScalarImmediateShift/LowerScalarVariableShift helpers. NFC. Rename them to LowerShiftByScalarImmediate/LowerShiftByScalarVariable to make it easier to find them wrt LowerShift()	2021-12-16 16:01:14 +00:00
Simon Pilgrim	52cb0bbec3	[X86] LowerRotate - use vXi8 custom lowering for non-uniform constant amounts Instead of bailing and using the default expansion, we can more efficiently use the shl(unpack(x,x),unpack(amt,zero)) pattern for vXi8 rotl, as we'll then use vXi16 fast PMULLW (or PSLLVW). This required some minor changes to improve constant folding during unpack shuffle creation and convertShiftLeftToScale to support constants that have already been lowered to constant pools.	2021-12-15 14:51:15 +00:00
Simon Pilgrim	36b0325c44	[X86] Enable v16i8/v32i8/v64i8 rotation on AVX512 targets We currently rely on generic promotion to vXi16/vXi32 types for rotation lowering on various AVX512 targets. We can more efficiently perform this by making use of the shl(unpack(x,x),amt) style pattern that we already use for vXi8 rotation by splat amounts, either by widening to a larger vector type or unpacking lo/hi halves of the subvectors so we can access whatever vXi16/vXi32 per-element shifts are supported. This uncovered an issue in the supportedVectorShiftWithImm/supportedVectorVarShift legality checkers which was using hasAVX512() instead of useAVX512Regs() to detect support for 512-bit vector shifts. NOTE: I'm actually hoping to eventually reuse this code for shl(unpack(y,x),amt) funnel shift lowering (vXi8 and wider), but initially I just want to ensure we have efficient ISD::ROTL lowering for all targets. Differential Revision: https://reviews.llvm.org/D115180	2021-12-15 11:17:45 +00:00
Simon Pilgrim	4f2e183229	[X86] combineOr - don't demand operand elements if the other operand element is 'allones' If either operand has an element with allbits set, then we don't need the equivalent element from the other operand, as allbits are guaranteed to be set.	2021-12-14 15:36:33 +00:00
Simon Pilgrim	a9d811405f	[X86] combineOr - pull out repeated SDLoc(). NFCI.	2021-12-14 15:36:32 +00:00
Kazu Hirata	f829630d2e	[llvm] Use llvm::count (NFC)	2021-12-09 20:50:38 -08:00
Simon Pilgrim	7bffc547a6	[X86] LowerRotate - split 512-bit integers on non 512-bit BWI targets.	2021-12-08 11:54:43 +00:00
Simon Pilgrim	2925f3c9ae	[X86] LowerRotate - pull out repeated splitVectorIntBinary call. NFC.	2021-12-07 12:05:33 +00:00
Simon Pilgrim	5983cfdc50	[X86] LowerRotate - fix assertion. NFC. v32i16 rotation lowering is only lowered on non-BWI targets.	2021-12-06 19:46:22 +00:00
Phoebe Wang	f37d9b4112	[X86][FP16] Replace vXi16 to vXf16 instead of v8f16 Fixes pr52561 Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D114304	2021-12-05 19:19:11 +08:00
Phoebe Wang	f13b43d570	[X86][FP16] Only generate approximate rsqrt when Reciprocal is true for half type We have reasonable fast sqrt and accurate rsqrt for half type due to the limited fractions. So neither do we need multi steps refinement for rsqrt nor replace sqrt by rsqrt. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114844	2021-12-02 13:52:45 +08:00
Phoebe Wang	4756a2f157	[X86] Insert FMUL for estimated non reciprocal SQRT when `RefinementSteps` = 0 Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D114843	2021-12-02 13:52:45 +08:00
Simon Pilgrim	19d34f6e95	[X86] combinePMULH - recognise 'cheap' trunctions via PACKS/PACKUS as well as SEXT/ZEXT combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation. But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation. This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result. Differential Revision: https://reviews.llvm.org/D113371	2021-12-01 16:37:49 +00:00
Hans Wennborg	c0b33a65f2	Typo fix	2021-11-30 19:26:59 +01:00
Matthias Braun	87ba99c263	X86: Fold masked-merge when and-not is not available Differential Revision: https://reviews.llvm.org/D112754	2021-11-29 16:09:01 -08:00
Craig Topper	a4373f6753	[X86] Don't combine (x86cmp (trunc (movmsk (bitcast X))), 0) if the truncate discards unknown bits. We have transform that tries turn a pmovmskb into movmskps/pd or movmskps to movmskpd. This transform isn't valid if the truncate discarded bits that might be set by the original movmsk. We could fix this by inserting an AND after the new movmsk to discard the equivalent of the truncated bits, but I've left that for later patch. Fixes PR52567. Differential Revision: https://reviews.llvm.org/D114306	2021-11-19 21:50:35 -08:00
Matt Morehouse	671f0930fe	[X86] Selective relocation relaxation for +tagged-globals For tagged-globals, we only need to disable relaxation for globals that we actually tag. With this patch function pointer relocations, which we do not instrument, can be relaxed. This patch also makes tagged-globals work properly with LTO, as -Wa,-mrelax-relocations=no doesn't work with LTO. Reviewed By: pcc Differential Revision: https://reviews.llvm.org/D113220	2021-11-19 07:18:27 -08:00
Simon Pilgrim	0f652d8f52	[X86] LowerRotate - recognise hidden ROTR patterns for better vXi8 codegen Check for a hidden ISD::ROTR (rotl(sub(0,x))) - vXi8 lowering can handle both (its always beneficial for splats, but otherwise only if we have VPTERNLOG). We currently hit infinite loops in TargetLowering::expandROT if we set ISD::ROTR to custom, which needs addressing before we extend this much further.	2021-11-19 11:49:15 +00:00
Simon Pilgrim	3821d2ab3b	[X86] LowerRotate - pull out repeated is ISD::ROTL check. NFC.	2021-11-18 17:03:19 +00:00
Phoebe Wang	a9fba2be35	[X86][ABI] Do not return float/double from x87 registers when x87 is disabled This is aligned with GCC's behavior. Also, alias `-mno-fp-ret-in-387` to `-mno-x87`, by which we can fix pr51498. Reviewed By: nickdesaulniers Differential Revision: https://reviews.llvm.org/D112143	2021-11-18 12:31:08 +08:00
Phoebe Wang	de34a940ae	[X86] Add -mskip-rax-setup support to align with GCC AMD64 ABI mandates caller to specify the number of used SSE registers when passing variable arguments. GCC also provides option -mskip-rax-setup to skip the setup of rax when SSE is disabled. This helps to reduce the code size, see pr23258. Reviewed By: nickdesaulniers Differential Revision: https://reviews.llvm.org/D112413	2021-11-18 11:20:32 +08:00
Zarko Todorovski	5b8bbbecfa	[NFC][llvm] Inclusive language: reword and remove uses of sanity in llvm/lib/Target Reworded removed code comments that contain `sanity check` and `sanity test`.	2021-11-17 21:59:00 -05:00
Simon Pilgrim	5f99f771ec	[X86] splitVector - only extract lower half subvector from splats If we're splitting a source vector that is a splat (with no undefs), just extract (for free) the lower half subvector and use it for both halfs.	2021-11-17 19:38:43 +00:00
Simon Pilgrim	e76032c173	[X86] LowerRotate - improve vXi8 rotate-by-scalar lowering with direct use of (extended) shift-by-scalar helpers. If we're rotating vXi8 by a splatted amount, then unpack to vXi16, perform a SHL by the (extended) scalar, and then pack the results. This is a vector equivalent to the "rotl(x,y) -> (((aext(x) << bw) \| zext(x)) << (y & (bw-1))) >> bw" style expansion we do for scalars in LowerFunnelShift. I think we can usefully use this for other vector types and vector funnel-shifts in the future, depending how we expand beyond D113192 for matching rotations/funnel-shifts for more type/ops.	2021-11-17 18:59:23 +00:00
Fangrui Song	fee52fe0ad	[X86] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off build. NFC	2021-11-15 13:10:47 -08:00
Simon Pilgrim	441de2536b	[X86] Add generic splitVectorOp helper. NFC Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks. Avoid code duplication and makes it easier to split other vector instruction forms in the future.	2021-11-15 17:59:23 +00:00
Simon Pilgrim	fc7c1cebbc	[X86] LowerFunnelShift - pull out repeated EltSizeInBits variable. NFC.	2021-11-15 17:11:44 +00:00
Sanjay Patel	3d01507c2d	[x86] fold vector (X > -1) & Y to shift+andn (2nd try) The first try at this patch ( `bf5748a1af` ) was reverted ( `5be64d4164` ) because it could crash. The cause of that problem was failing to account for the optional peek-through-bitcast in the enclosing function. This version of the patch adds a clause to avoid the fold in case of bitcasts because it is unlikely to be profitable in that scenario. A test case based on https://llvm.org/PR52504 was added to make sure we don't have that problem again. Original commit message: and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y This avoids the -1 constant vector in favor of an arithmetic shift instruction if it exists (the ISA is still not complete after all these years...). We catch this pattern late in combining by matching PCMPGT, so it should not interfere with more general folds. Differential Revision: https://reviews.llvm.org/D113603	2021-11-15 11:09:32 -05:00
Simon Pilgrim	ea9e6aa423	[X86] getAVX512Node() - find constant broadcasts to encourage load-folding If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes. I've refactored getAVX512Node so that VLX targets can make better use of this as well. NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well. Differential Revision: https://reviews.llvm.org/D113845	2021-11-15 15:52:03 +00:00
Hans Wennborg	5be64d4164	Revert "[x86] fold vector (X > -1) & Y to shift+andn" This casued assertion failures: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446: void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode , llvm::SDNode ): Assertion `(!From->hasAnyUseOfValue(i) \|\| From->getValueType(i) == To->getValueType(i)) && "Cannot use this version of ReplaceAllUsesWith!"' failed. See comment on the code review. (Had to update some expectations in test/CodeGen/X86/vselect-zero.ll manually due to other changes having landed after the reverted one.) > and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y > > This avoids the -1 constant vector in favor of an arithmetic shift > instruction if it exists (the ISA is still not complete after all > these years...). > > We catch this pattern late in combining by matching PCMPGT, so it > should not interfere with more general folds. > > Differential Revision: https://reviews.llvm.org/D113603 This reverts commit `bf5748a1af`.	2021-11-15 12:35:49 +01:00
Simon Pilgrim	c3a772fdf5	[X86] Add getPack helper This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it. It also optionally packs the upper half instead of the lower half.	2021-11-14 21:27:15 +00:00
Simon Pilgrim	f4143ffed7	[X86] Widen 128/256-bit VPTERNLOG patterns to 512-bit on non-VLX targets Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets. Fixes regressions in D113192 Differential Revision: https://reviews.llvm.org/D113827	2021-11-14 13:40:53 +00:00
Simon Pilgrim	a310cbae02	[X86] Add getAVX512Node helper. NFC. For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.). I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.	2021-11-13 13:59:42 +00:00
Craig Topper	82bc6a094e	[X86] Promote f16 STRICT_FROUND to f32 and call libc. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D113817	2021-11-12 21:37:03 -08:00
Kazu Hirata	efa896e5f7	[Target] Use SDNode::uses (NFC)	2021-11-12 21:23:04 -08:00
Simon Pilgrim	6bb71738e2	[X86] convertShiftLeftToScale - improve vXi8 constant handling Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow. We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).	2021-11-12 16:48:10 +00:00
Simon Pilgrim	59087dce3b	[X86] combineX86ShufflesConstants - constant fold from target shuffles unless optsize = true Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat. This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them. This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure. Differential Revision: https://reviews.llvm.org/D113748	2021-11-12 14:02:43 +00:00
Sanjay Patel	bf5748a1af	[x86] fold vector (X > -1) & Y to shift+andn and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y This avoids the -1 constant vector in favor of an arithmetic shift instruction if it exists (the ISA is still not complete after all these years...). We catch this pattern late in combining by matching PCMPGT, so it should not interfere with more general folds. Differential Revision: https://reviews.llvm.org/D113603	2021-11-12 08:17:46 -05:00
Phoebe Wang	74b979abcd	[X86][FP16] Avoid to generate VZEXT_MOVL with i16 This fixes the crash due to lacking VZEXT_MOVL support with i16. Reviewed By: LuoYuanke, RKSimon Differential Revision: https://reviews.llvm.org/D113661	2021-11-12 09:32:29 +08:00
Simon Pilgrim	94a901a50a	[X86] Move LowerFunnelShift below LowerShift. NFC. Makes it easier to reuse the various vector shift helpers defined above LowerShift	2021-11-11 18:45:51 +00:00
Sanjay Patel	a8abd19b10	[x86] simplify code; NFC We bail out if the types don't match, so it's clearer to have a single variable to show that common type.	2021-11-10 13:29:57 -05:00
Sanjay Patel	5424fb164a	[x86] fix formatting; NFC	2021-11-10 13:29:57 -05:00
Simon Pilgrim	a1e0aa75ca	[X86] combineMulToPMADDWD - remove useless TODO We should always be able to use PMULUDQ/PMULDQ in PMADDWD patterns with greater than 32-bit extended integer sources	2021-11-10 16:56:44 +00:00
Sanjay Patel	be9e892e9d	[x86] shorten function name; NFC	2021-11-10 09:44:55 -05:00
Simon Pilgrim	d510fd2bed	[X86] combineMulToPMADDWD - handle any pow2 vector type and split to legal types combineMulToPMADDWD is currently limited to legal types, but there's no reason why we can't handle any larger type that the existing SplitOpsAndApply code can use to split to legal X86ISD::VPMADDWD ops. This also exposed a missed opportunity for pre-SSE41 targets to handle SEXT ops from types smaller than vXi16 - without PMOVSX instructions these will always be expanded to unpack+shifts, so we can cheat and convert this into a ZEXT(SEXT()) sequence to make it a valid PMADDWD op. Differential Revision: https://reviews.llvm.org/D110995	2021-11-09 15:20:43 +00:00
Simon Pilgrim	7f32edea23	[X86] combineMulToPMADDWD - use ComputeMinSignedBits(). NFCI. Use ComputeMinSignedBits() to ensure the mul source operands at least sign-extend down from the bottom 16 bits. This will make it easier if/when we try to support handling of source types larger than 32-bits.	2021-11-08 15:28:31 +00:00
Simon Pilgrim	f059b04f7b	[DAG] Add SelectionDAG::ComputeMinSignedBits helper As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper. I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious. Differential Revision: https://reviews.llvm.org/D113396	2021-11-08 14:12:45 +00:00
Simon Pilgrim	f60d3ec0c7	[DAG] Add BuildVectorSDNode::getConstantRawBits helper We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation. This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data. Differential Revision: https://reviews.llvm.org/D113351	2021-11-08 12:07:38 +00:00
Simon Pilgrim	55e4cd8485	[X86][AVX2] Recognise 256-bit truncation shuffles and mask 256-bit source For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level. With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(	2021-11-07 21:24:55 +00:00
Kazu Hirata	41ef3187e0	[ARM, X86] Use MachineBasicBlock::{predecessors,successors} (NFC)	2021-11-07 09:53:16 -08:00
Kazu Hirata	2c4ba3e9d3	[Target] Use make_early_inc_range (NFC)	2021-11-05 09:14:32 -07:00
Simon Pilgrim	5e9ac7c0a5	[X86] Enable v32i16 rotate lowering on non-BWI targets Fixes one of the regressions in D113192	2021-11-05 11:00:31 +00:00
Simon Pilgrim	87d5bb66eb	[X86][SSE] Improve PMADDWD SimplifyDemandedVectorElts handling Check both operands for zero elements to remove unnecessary demanded elts. Try to help reduce some minor regressions noticed in D110995	2021-11-04 12:56:31 +00:00
Harald van Dijk	889c2b97bd	[X86] Fix X32 indirect call generation The check for whether a zero extension was needed was subtly wrong and saw a value that was already 64 bits, so did not extend. Fixes PR52357. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112860	2021-11-03 16:43:44 +00:00
Phoebe Wang	8f101971b6	[X86][VARARG] Assign MMO earlier to avoid prolog insert point been sunk across VASTART_SAVE_XMM_REGS The changes in D80163 defered the assignment of MachineMemOperand (MMO) until the X86ExpandPseudo pass. This will result in crash due to prolog insert point been sunk across the pseudo instruction VASTART_SAVE_XMM_REGS. Moving the assignment to the creation of the node can avoid the problem. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D112859	2021-11-03 10:13:32 +08:00
Simon Pilgrim	53900a19fd	[X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper for splat of normal vector loads. NFCI. Reapplied from rG1cfecf4fc427 with fix for PR51226 - ensure the load is a normal (non-ext) load.	2021-11-02 20:03:25 +00:00
Simon Pilgrim	82e0eb22af	[X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper. NFCI. This is part of rG1cfecf4fc427 that was reverted to fix PR51226 - concating the broadcasts is OK, its the splatted loads that crash (we're not detecting extloads). I'm still creating a reduced test case so haven't added the load handling again yet.	2021-11-02 18:04:35 +00:00
Simon Pilgrim	e173631dd1	[X86][AVX] SimplifyDemandedVectorEltsForTargetNode - use getBROADCAST_LOAD helper. NFCI. Reduce width of X86ISD::SUBV_BROADCAST_LOAD node.	2021-11-02 14:07:22 +00:00
Simon Pilgrim	8ca666a280	[X86][AVX] lowerV2X128Shuffle - use getBROADCAST_LOAD helper. NFCI.	2021-11-02 14:07:21 +00:00
Simon Pilgrim	fd485d8cda	[X86][AVX] Prefer VINSERTF128 over VPERM2F128 for 128->256 subvector concatenations The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances. This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source. There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses? Noticed while investigating the quality of interleaved load/store codegen. Differential Revision: https://reviews.llvm.org/D111960	2021-11-01 10:45:50 +00:00
Sanjay Patel	837518d6a0	[x86] make mayFold* helpers visible to more files; NFC The first function is needed for D112464, but we might as well keep these together in case the others can be used someday.	2021-10-29 15:48:35 -04:00
Simon Pilgrim	154c036ebb	[X86] combineX86GatherScatter - only fold scale if the index isn't extended As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe. If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.	2021-10-29 11:48:05 +01:00
Simon Pilgrim	d29ccbecd0	[X86][AVX] Attempt to fold a scaled index into a gather/scatter scale immediate (PR13310) If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead. Fixes PR13310. Differential Revision: https://reviews.llvm.org/D108539	2021-10-28 14:07:17 +01:00
Phoebe Wang	2bc28c6f82	[X86] Add a dependency breaking xor before any gathers with an undef passthru value. In the instruction encoding, the passthru register is always tied to the destination register. The CPU scheduler has to wait for the last writer of this register to finish executing before the gather can start. This is true even if the initial mask is all ones so that the passthru will never be used. By explicitly zeroing the register we can break the false dependency. The zero idiom is executed completing by the register renamer and so is immedately considered ready. Authored by Craig. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D112505	2021-10-28 11:44:52 +08:00
Sanjay Patel	6c0a2c2804	[x86] enhance mayFoldLoad to check alignment As noted in D112464, a pre-AVX target may not be able to fold an under-aligned vector load into another op, so we shouldn't report that as a load folding candidate. I only found one caller where this would make a difference -- combineCommutableSHUFP() -- so that's where I added a test to show the (minor) regression. Differential Revision: https://reviews.llvm.org/D112545	2021-10-27 07:54:25 -04:00
Sanjay Patel	2ab0148c14	[x86] use cast instead of dyn_cast for unchecked usage; NFC This was noted as an independent clean-up in D112464.	2021-10-26 08:20:19 -04:00
Phoebe Wang	79f9dfef0d	[X86] Move splat addends from the gather/scatter index operand to the base address This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant. Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues. Authored by Craig. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111595	2021-10-26 12:35:57 +08:00
Simon Pilgrim	b09f2ee57c	[X86] findEltLoadSrc - fix shift amount variable name. NFCI. Fix the copy + paste, renaming shift amt from Idx to Amt	2021-10-23 21:24:37 +01:00
Craig Topper	cd824f9e39	[X86] Fix bad formatting. NFC	2021-10-22 13:16:35 -07:00
Sanjay Patel	40163f1df8	[x86] add special-case lowering for usubsat for AVX512 This is a small extension of D112095 to avoid another regression seen with D112085. In this case, we allow the same conversion from usubsat to ALU ops if the target supports vpternlog. That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG(). This seems better than putting a magic immediate constant directly in this code to create the exact vpternlog that we need. It's possible that there are other special-cases along these lines, so we should try to keep all of the vpternlog magic in one place. Differential Revision: https://reviews.llvm.org/D112138	2021-10-20 16:41:13 -04:00
Sanjay Patel	3efd2a0bec	[x86] make helper for useVPTERNLOG; NFC See D112085 for another use case.	2021-10-20 10:26:53 -04:00
Sanjay Patel	92a0389b04	[x86] add special-case lowering for usubsat for pre-SSE4 usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1) This would be a regression with D112085 where we combine to usubsat more aggressively, so avoid that by matching the special-case where we are subtracting SMIN (signmask): https://alive2.llvm.org/ce/z/4_3gBD Differential Revision: https://reviews.llvm.org/D112095	2021-10-19 17:13:16 -04:00
Simon Pilgrim	71e39e3f18	[ADT] Add APInt::isNegatedPowerOf2() helper Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos..... Differential Revision: https://reviews.llvm.org/D111998	2021-10-19 14:38:21 +01:00
Simon Pilgrim	a83384498b	[X86] combineMulToPMADDWD - replace ASHR(X,16) -> LSHR(X,16) If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero. The lshr also has a better chance of folding with shuffles etc.	2021-10-18 22:12:56 +01:00
Craig Topper	beb7862db5	[X86] Add DAG combine for negation of CMOV absolute value pattern. This patch detects the absolute value pattern on the RHS of a subtract. If we find it we swap the CMOV true/false values and replace the subtract with an ADD. There may be a more generic way to do this, but I'm not sure. Targets that don't have legal or custom ISD::ABS use a generic expand in DAG combiner already when it sees (neg (abs(x))). I haven't checked what happens if the neg is a more general subtract. Fixes PR50991 for X86. Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D111858	2021-10-16 13:31:43 -07:00
Dávid Bolvanský	6678db00e6	[X86] Enable promotion of i16 popcnt (PR52056) Solves https://bugs.llvm.org/show_bug.cgi?id=52056 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111507	2021-10-15 15:41:37 +02:00
Kazu Hirata	81e9c90686	[llvm] Use llvm::is_contained (NFC)	2021-10-14 22:44:09 -07:00
Craig Topper	3ff9cc01f2	[X86] Use CMOVNS for abs instead of CMOVGE. CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with other recent changes to use a single flag where possible. It also matches gcc codegen. I believe this technically changes whether the conditioanl move happens on INT_MIN, but for INT_MIN both registers are the same so it doesn't matter. Differential Revision: https://reviews.llvm.org/D111826	2021-10-14 12:28:28 -07:00
Simon Pilgrim	fb2539b9d8	[X86][SSE] Add X86ISD::AVG to isCommutativeBinOp to support folding shuffles through the binop	2021-10-13 10:54:37 +01:00
Roman Lebedev	958da6598f	[X86] `detectAVGPattern()`: don't require zext in the with-constant case	2021-10-12 20:24:17 +03:00
Roman Lebedev	a1d57f75d1	[NFC][X86] `detectAVGPattern()`: rely on `AVGSplitter()` to perform truncation	2021-10-12 20:12:09 +03:00
Roman Lebedev	5f4f5da634	[X86] `detectAVGPattern()`: support basic case of PAVG chaining (PR52131) As noted in https://github.com/halide/Halide/pull/6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571	2021-10-12 19:51:43 +03:00
Roman Lebedev	7964c3ed82	[X86] `detectAVGPattern()`: small preparatory NFC refactor	2021-10-12 19:51:43 +03:00
Freddy Ye	d57a87ea89	[X86][ISel] Lowering llvm.thread.pointer Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D110681	2021-10-12 11:01:18 +08:00
Simon Pilgrim	ad16c6e52f	[X86][AVX] Ensure we retain zero elements in select(pshufb,pshufb) -> or(pshufb,pshufb) fold (PR52122) The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef. PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero	2021-10-11 14:50:24 +01:00
Arthur Eubanks	a0a4935182	Make more places that use alignment use uint64_t Followup to D110451.	2021-10-08 16:35:19 -07:00
Wang, Pengfei	c236883b6b	[X86] Optimize fdiv with reciprocal instructions for half type Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110557	2021-10-08 09:41:13 +08:00
Craig Topper	58b68e70eb	[X86] Don't use popcnt for parity if only bits 7:0 of the input can be non-zero. Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt. To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY. Fixes PR52093 Differential Revision: https://reviews.llvm.org/D111249	2021-10-06 14:39:57 -07:00
Wolfgang Pieb	f53d05135e	[UBSAN][PS4] For the PS4 target, emit the ud2 ocpode for ubsan traps. Reviewed By: probinson Differential Revision: https://reviews.llvm.org/D111172	2021-10-06 11:49:36 -07:00
Nathan Sidwell	c11e7b59d2	[X86][NFC] structure-return simplificiation The X86 backend only needs to know whether structure return is via an sret pointer. This removes the categorization enumeration and adjusts, templatizes and renames the related functions. Differential Revision: https://reviews.llvm.org/D109966	2021-10-06 03:12:48 -07:00
Simon Pilgrim	bf30c48419	[X86] SimplifyDemandedVectorEltsForTargetNode - simplify PMADDWD for known zero elements Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element. Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.	2021-10-04 14:36:45 +01:00
Jay Foad	566690b067	[APFloat] Remove BitWidth argument from getAllOnesValue There's no need to pass this in explicitly because it is trivially available from the semantics.	2021-10-04 11:32:16 +01:00
Jay Foad	a9bceb2b05	[APInt] Stop using soft-deprecated constructors and methods in llvm. NFC. Stop using APInt constructors and methods that were soft-deprecated in D109483. This fixes all the uses I found in llvm, except for the APInt unit tests which should still test the deprecated methods. Differential Revision: https://reviews.llvm.org/D110807	2021-10-04 08:57:44 +01:00
Simon Pilgrim	9452ec722c	[X86][SSE] Fix typo + infinite-loop in HOP(HOP'(X,X),HOP'(Y,Y)) fold (PR52040) PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold. Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks. Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.	2021-10-02 15:31:12 +01:00
Simon Pilgrim	bb42cc2090	[X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal. Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ. Differential Revision: https://reviews.llvm.org/D110588	2021-10-02 12:35:25 +01:00
Martin Storsjö	d6216e2cd1	[X86] Fix handling of i128<->fp on Windows On Windows, i128 arguments are passed as indirect arguments, and they are returned in xmm0. This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making the IR functions return v2i64 instead of i128, and making the arguments indirect. However for cases where libcalls are generated in the target lowering, the lowering uses the default x86_64 calling convention for i128, where they are passed/returned as a register pair. Add custom lowering logic, similar to the existing logic for i128 div/mod (added in `4a406d32e9`), manually making the libcall (while overriding the return type to v2i64 or passing the arguments as pointers to arguments on the stack). X86CallingConv.td doesn't seem to handle i128 at all, otherwise the windows specific behaviours would ideally be implemented as overrides there, in generic code, handling these cases automatically. This fixes https://bugs.llvm.org/show_bug.cgi?id=48940. Differential Revision: https://reviews.llvm.org/D110413	2021-09-29 13:05:59 +03:00
Liu, Chen3	57e8f840b6	[X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A). This bug was introduced by D109953. The operand order of generated FMA is wrong. Differential Revision: https://reviews.llvm.org/D110606	2021-09-28 11:38:53 +08:00
Simon Pilgrim	468ff703e1	[X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974) This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies. Fixes PR51974	2021-09-27 10:41:22 +01:00
Freddy Ye	902ec6142a	[X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16) When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with TypeLegalize, and no libcall in libm is ready for fround(f16) now. FROUNDEVEN(f16) has related instruction in AVX512FP16. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110312	2021-09-27 13:35:03 +08:00
Simon Pilgrim	c0eff50fc5	[X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.	2021-09-26 19:37:23 +01:00
Simon Pilgrim	ed3e4917b3	[X86] Fold PACK(_EXTEND_VECTOR_INREG, UNDEF) -> _EXTEND_VECTOR_INREG For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.	2021-09-26 19:37:22 +01:00
Simon Pilgrim	3fe9767204	[X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W)) Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements. There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....	2021-09-26 18:08:29 +01:00
Simon Pilgrim	eb7c78c2c5	[X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16	2021-09-25 15:50:45 +01:00
Simon Pilgrim	2a4fa0c27c	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors	2021-09-25 15:50:45 +01:00
Simon Pilgrim	f5a26ccae2	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.	2021-09-25 14:35:31 +01:00
Simon Pilgrim	a25f25c3b7	[X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2 With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target. We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.	2021-09-25 14:08:03 +01:00
Simon Pilgrim	d8fc9f8727	[X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16) As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.	2021-09-24 16:42:01 +01:00
Sanjay Patel	09e71c367a	[x86] convert logic-of-FP-compares to FP logic-of-vector-compares This is motivated by the examples and discussion in: https://llvm.org/PR51245 ...and related bugs. By using vector compares and vector logic, we can convert 2 'set' instructions into 1 'movd' or 'movmsk' and generally improve throughput/reduce instructions. Unfortunately, we don't have a complete vector compare ISA before AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic ops to simulate the missing predicates for SSE 'cmpp*', so it's not as clearly a win. Differential Revision: https://reviews.llvm.org/D110342	2021-09-24 11:38:19 -04:00
Sanjay Patel	74ba4b769a	[x86] move combiner state check into convertIntLogicToFPLogic(); NFC This function can be adapted to solve bugs like PR51245, but it could require differentiating the combiner timing between the existing and new transforms.	2021-09-23 14:28:22 -04:00
Liu, Chen3	76656ec8ec	[X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A) This patch is to support transform something like _mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph())) to _mm512_fmadd_pch(a, b, acc). Differential Revision: https://reviews.llvm.org/D109953	2021-09-23 15:37:08 +08:00
Wang, Pengfei	ebec077e07	[X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D109658	2021-09-23 11:02:48 +08:00
Kirill Stoimenov	2649999579	[asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on. This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D110012	2021-09-21 22:26:03 +00:00
Amara Emerson	4ceea77409	[X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC. For x86 Darwin, we have a stack checking feature which re-uses some of this machinery around stack probing on Windows. Renaming this to be more appropriate for a generic feature. Differential Revision: https://reviews.llvm.org/D109993	2021-09-20 16:19:28 -07:00
Simon Pilgrim	0e89ff8195	[X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use. Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.	2021-09-19 22:53:30 +01:00
Simon Pilgrim	b7342e3137	[X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask') We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly. Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.	2021-09-19 20:39:19 +01:00
Roman Lebedev	5f2fe48d06	[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded Apparently this has no test coverage before D108382, but D108382 itself shows a few regressions that this fixes. It doesn't seem worthwhile breaking apart broadcasts, assuming we want the broadcasted value to be preset in several elements, not just the 0'th one. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108411	2021-09-19 17:38:32 +03:00
Roman Lebedev	07f1d8f0ca	[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such Split off from D108253. Broadcast is simpler than any other shuffle we might produce to do what we want to do here, so prefer it. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108382	2021-09-19 17:35:37 +03:00
Roman Lebedev	0852313e47	[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch	2021-09-19 17:29:18 +03:00
Roman Lebedev	1e72ca94e5	[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing This was suggested in https://reviews.llvm.org/D108382#inline-1039018, and it avoids regressions in that patch. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109065	2021-09-19 17:24:58 +03:00
Roman Lebedev	6a2c2263fb	[X86] Improve i8 all-ones element insertion in pre-SSE4.1 Should avoid some regressions in D109065 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109989	2021-09-18 22:24:06 +03:00
Roman Lebedev	358df06f4e	[X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away); or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking: https://alive2.llvm.org/ce/z/RKejao Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109726	2021-09-17 19:13:33 +03:00
Simon Pilgrim	1ef62cb200	[X86] SimplifyDemandedVectorEltsForTargetNode - add PSADBW handling Peek through PSADBW operands to handle non demanded elements.	2021-09-16 11:28:31 +01:00
Simon Pilgrim	dcba994184	[X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858) When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.	2021-09-15 10:21:05 +01:00
Craig Topper	9af8f1b18e	[SelectionDAG] Add isZero/isAllOnes methods to ConstantSDNode. Soft deprecrate isNullValue/isAllOnesValue and update in tree callers. This matches the changes to the APInt interface from D109483. Reviewed By: lattner Differential Revision: https://reviews.llvm.org/D109535	2021-09-09 13:28:30 -07:00
Chris Lattner	d51da74889	[CodeGen] Use DAG.getAllOnesConstant where possible to simplify code. NFC.	2021-09-09 10:22:51 -07:00
Craig Topper	124bcc1a13	[X86] Disable muloti4 libcalls for x86-64. This library function only exists in compiler-rt not libgcc. So this would fail to link unless we were linking with compiler-rt. This is consistent with the recent removal of calls to mulodi4 on 32-bit targets like D108928. I suppose maybe we could keep the libcalls for platforms like Darwin that use compiler-rt exclusively? Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D109385	2021-09-09 10:03:15 -07:00
Chris Lattner	735f46715d	[APInt] Normalize naming on keep constructors / predicate methods. This renames the primary methods for creating a zero value to `getZero` instead of `getNullValue` and renames predicates like `isAllOnesValue` to simply `isAllOnes`. This achieves two things: 1) This starts standardizing predicates across the LLVM codebase, following (in this case) ConstantInt. The word "Value" doesn't convey anything of merit, and is missing in some of the other things. 2) Calling an integer "null" doesn't make any sense. The original sin here is mine and I've regretted it for years. This moves us to calling it "zero" instead, which is correct! APInt is widely used and I don't think anyone is keen to take massive source breakage on anything so core, at least not all in one go. As such, this doesn't actually delete any entrypoints, it "soft deprecates" them with a comment. Included in this patch are changes to a bunch of the codebase, but there are more. We should normalize SelectionDAG and other APIs as well, which would make the API change more mechanical. Differential Revision: https://reviews.llvm.org/D109483	2021-09-09 09:50:24 -07:00
Akira Hatanaka	dea6f71af0	[ObjC][ARC] Use the addresses of the ARC runtime functions instead of integer 0/1 for the operand of bundle "clang.arc.attachedcall" https://reviews.llvm.org/D102996 changes the operand of bundle "clang.arc.attachedcall". This patch makes changes to llvm that are needed to handle the new IR. This should make it easier to understand what the IR is doing and also simplify some of the passes as they no longer have to translate the integer values to the runtime functions. Differential Revision: https://reviews.llvm.org/D103000	2021-09-08 11:58:03 -07:00
Nick Desaulniers	d0eeb64be5	[X86ISelLowering] avoid emitting libcalls to __mulodi4() Similar to D108842, D108844, and D108926. __has_builtin(builtin_mul_overflow) returns true for 32b x86 targets, but Clang is deferring to compiler RT when encountering long long types. This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel that are using builtin_mul_overflow with these types for these targets. If the semantics of __has_builtin mean "the compiler resolves these, always" then we shouldn't conditionally emit a libcall. This will still need to be worked around in the Linux kernel in order to continue to support these builds of the Linux kernel for this target with older releases of clang. Link: https://bugs.llvm.org/show_bug.cgi?id=28629 Link: https://bugs.llvm.org/show_bug.cgi?id=35922 Link: https://github.com/ClangBuiltLinux/linux/issues/1438 Reviewed By: lebedev.ri, RKSimon Differential Revision: https://reviews.llvm.org/D108928	2021-09-07 10:44:54 -07:00
Simon Pilgrim	d66d520fe1	[X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]y[0] + (int)x[1]y[1], ..., (int)x[6]y[6] + (int)x[7]y[7] } Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits. That way the sign of the result is still preserved, and the second half is still zero. Noticed while investigating PR47437. Differential Revision: https://reviews.llvm.org/D108522	2021-09-02 17:36:22 +01:00
Roman Lebedev	3f1f08f0ed	Revert @llvm.isnan intrinsic patchset. Please refer to https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html (and that whole thread.) TLDR: the original patch had no prior RFC, yet it had some changes that really need a proper RFC discussion. It won't be productive to discuss such an RFC, once it's actually posted, while said patch is already committed, because that introduces bias towards already-committed stuff, and the tree is potentially in broken state meanwhile. While the end result of discussion may lead back to the current design, it may also not lead to the current design. Therefore i take it upon myself to revert the tree back to last known good state. This reverts commit `4c4093e6e3`. This reverts commit `0a2b1ba33a`. This reverts commit `d9873711cb`. This reverts commit `791006fb8c`. This reverts commit `c22b64ef66`. This reverts commit `72ebcd3198`. This reverts commit `5fa6039a5f`. This reverts commit `9efda541bf`. This reverts commit `94d3ff09cf`.	2021-09-02 13:53:56 +03:00
Simon Pilgrim	b0acd6c369	[X86] Fold PMADD(x,0) or PMADD(0,x) -> 0 Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops	2021-09-02 10:48:50 +01:00

... 3 4 5 6 7 ...

8184 Commits