llvm-project

Commit Graph

Author	SHA1	Message	Date
Fangrui Song	fee52fe0ad	[X86] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off build. NFC	2021-11-15 13:10:47 -08:00
Simon Pilgrim	441de2536b	[X86] Add generic splitVectorOp helper. NFC Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks. Avoid code duplication and makes it easier to split other vector instruction forms in the future.	2021-11-15 17:59:23 +00:00
Simon Pilgrim	fc7c1cebbc	[X86] LowerFunnelShift - pull out repeated EltSizeInBits variable. NFC.	2021-11-15 17:11:44 +00:00
Sanjay Patel	3d01507c2d	[x86] fold vector (X > -1) & Y to shift+andn (2nd try) The first try at this patch ( `bf5748a1af` ) was reverted ( `5be64d4164` ) because it could crash. The cause of that problem was failing to account for the optional peek-through-bitcast in the enclosing function. This version of the patch adds a clause to avoid the fold in case of bitcasts because it is unlikely to be profitable in that scenario. A test case based on https://llvm.org/PR52504 was added to make sure we don't have that problem again. Original commit message: and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y This avoids the -1 constant vector in favor of an arithmetic shift instruction if it exists (the ISA is still not complete after all these years...). We catch this pattern late in combining by matching PCMPGT, so it should not interfere with more general folds. Differential Revision: https://reviews.llvm.org/D113603	2021-11-15 11:09:32 -05:00
Simon Pilgrim	ea9e6aa423	[X86] getAVX512Node() - find constant broadcasts to encourage load-folding If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes. I've refactored getAVX512Node so that VLX targets can make better use of this as well. NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well. Differential Revision: https://reviews.llvm.org/D113845	2021-11-15 15:52:03 +00:00
Hans Wennborg	5be64d4164	Revert "[x86] fold vector (X > -1) & Y to shift+andn" This casued assertion failures: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446: void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode , llvm::SDNode ): Assertion `(!From->hasAnyUseOfValue(i) \|\| From->getValueType(i) == To->getValueType(i)) && "Cannot use this version of ReplaceAllUsesWith!"' failed. See comment on the code review. (Had to update some expectations in test/CodeGen/X86/vselect-zero.ll manually due to other changes having landed after the reverted one.) > and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y > > This avoids the -1 constant vector in favor of an arithmetic shift > instruction if it exists (the ISA is still not complete after all > these years...). > > We catch this pattern late in combining by matching PCMPGT, so it > should not interfere with more general folds. > > Differential Revision: https://reviews.llvm.org/D113603 This reverts commit `bf5748a1af`.	2021-11-15 12:35:49 +01:00
Simon Pilgrim	c3a772fdf5	[X86] Add getPack helper This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it. It also optionally packs the upper half instead of the lower half.	2021-11-14 21:27:15 +00:00
Simon Pilgrim	f4143ffed7	[X86] Widen 128/256-bit VPTERNLOG patterns to 512-bit on non-VLX targets Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets. Fixes regressions in D113192 Differential Revision: https://reviews.llvm.org/D113827	2021-11-14 13:40:53 +00:00
Simon Pilgrim	a310cbae02	[X86] Add getAVX512Node helper. NFC. For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.). I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.	2021-11-13 13:59:42 +00:00
Craig Topper	82bc6a094e	[X86] Promote f16 STRICT_FROUND to f32 and call libc. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D113817	2021-11-12 21:37:03 -08:00
Kazu Hirata	efa896e5f7	[Target] Use SDNode::uses (NFC)	2021-11-12 21:23:04 -08:00
Simon Pilgrim	6bb71738e2	[X86] convertShiftLeftToScale - improve vXi8 constant handling Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow. We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).	2021-11-12 16:48:10 +00:00
Simon Pilgrim	59087dce3b	[X86] combineX86ShufflesConstants - constant fold from target shuffles unless optsize = true Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat. This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them. This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure. Differential Revision: https://reviews.llvm.org/D113748	2021-11-12 14:02:43 +00:00
Sanjay Patel	bf5748a1af	[x86] fold vector (X > -1) & Y to shift+andn and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y This avoids the -1 constant vector in favor of an arithmetic shift instruction if it exists (the ISA is still not complete after all these years...). We catch this pattern late in combining by matching PCMPGT, so it should not interfere with more general folds. Differential Revision: https://reviews.llvm.org/D113603	2021-11-12 08:17:46 -05:00
Phoebe Wang	74b979abcd	[X86][FP16] Avoid to generate VZEXT_MOVL with i16 This fixes the crash due to lacking VZEXT_MOVL support with i16. Reviewed By: LuoYuanke, RKSimon Differential Revision: https://reviews.llvm.org/D113661	2021-11-12 09:32:29 +08:00
Simon Pilgrim	94a901a50a	[X86] Move LowerFunnelShift below LowerShift. NFC. Makes it easier to reuse the various vector shift helpers defined above LowerShift	2021-11-11 18:45:51 +00:00
Sanjay Patel	a8abd19b10	[x86] simplify code; NFC We bail out if the types don't match, so it's clearer to have a single variable to show that common type.	2021-11-10 13:29:57 -05:00
Sanjay Patel	5424fb164a	[x86] fix formatting; NFC	2021-11-10 13:29:57 -05:00
Simon Pilgrim	a1e0aa75ca	[X86] combineMulToPMADDWD - remove useless TODO We should always be able to use PMULUDQ/PMULDQ in PMADDWD patterns with greater than 32-bit extended integer sources	2021-11-10 16:56:44 +00:00
Sanjay Patel	be9e892e9d	[x86] shorten function name; NFC	2021-11-10 09:44:55 -05:00
Simon Pilgrim	d510fd2bed	[X86] combineMulToPMADDWD - handle any pow2 vector type and split to legal types combineMulToPMADDWD is currently limited to legal types, but there's no reason why we can't handle any larger type that the existing SplitOpsAndApply code can use to split to legal X86ISD::VPMADDWD ops. This also exposed a missed opportunity for pre-SSE41 targets to handle SEXT ops from types smaller than vXi16 - without PMOVSX instructions these will always be expanded to unpack+shifts, so we can cheat and convert this into a ZEXT(SEXT()) sequence to make it a valid PMADDWD op. Differential Revision: https://reviews.llvm.org/D110995	2021-11-09 15:20:43 +00:00
Simon Pilgrim	7f32edea23	[X86] combineMulToPMADDWD - use ComputeMinSignedBits(). NFCI. Use ComputeMinSignedBits() to ensure the mul source operands at least sign-extend down from the bottom 16 bits. This will make it easier if/when we try to support handling of source types larger than 32-bits.	2021-11-08 15:28:31 +00:00
Simon Pilgrim	f059b04f7b	[DAG] Add SelectionDAG::ComputeMinSignedBits helper As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper. I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious. Differential Revision: https://reviews.llvm.org/D113396	2021-11-08 14:12:45 +00:00
Simon Pilgrim	f60d3ec0c7	[DAG] Add BuildVectorSDNode::getConstantRawBits helper We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation. This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data. Differential Revision: https://reviews.llvm.org/D113351	2021-11-08 12:07:38 +00:00
Simon Pilgrim	55e4cd8485	[X86][AVX2] Recognise 256-bit truncation shuffles and mask 256-bit source For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level. With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(	2021-11-07 21:24:55 +00:00
Kazu Hirata	41ef3187e0	[ARM, X86] Use MachineBasicBlock::{predecessors,successors} (NFC)	2021-11-07 09:53:16 -08:00
Kazu Hirata	2c4ba3e9d3	[Target] Use make_early_inc_range (NFC)	2021-11-05 09:14:32 -07:00
Simon Pilgrim	5e9ac7c0a5	[X86] Enable v32i16 rotate lowering on non-BWI targets Fixes one of the regressions in D113192	2021-11-05 11:00:31 +00:00
Simon Pilgrim	87d5bb66eb	[X86][SSE] Improve PMADDWD SimplifyDemandedVectorElts handling Check both operands for zero elements to remove unnecessary demanded elts. Try to help reduce some minor regressions noticed in D110995	2021-11-04 12:56:31 +00:00
Harald van Dijk	889c2b97bd	[X86] Fix X32 indirect call generation The check for whether a zero extension was needed was subtly wrong and saw a value that was already 64 bits, so did not extend. Fixes PR52357. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112860	2021-11-03 16:43:44 +00:00
Phoebe Wang	8f101971b6	[X86][VARARG] Assign MMO earlier to avoid prolog insert point been sunk across VASTART_SAVE_XMM_REGS The changes in D80163 defered the assignment of MachineMemOperand (MMO) until the X86ExpandPseudo pass. This will result in crash due to prolog insert point been sunk across the pseudo instruction VASTART_SAVE_XMM_REGS. Moving the assignment to the creation of the node can avoid the problem. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D112859	2021-11-03 10:13:32 +08:00
Simon Pilgrim	53900a19fd	[X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper for splat of normal vector loads. NFCI. Reapplied from rG1cfecf4fc427 with fix for PR51226 - ensure the load is a normal (non-ext) load.	2021-11-02 20:03:25 +00:00
Simon Pilgrim	82e0eb22af	[X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper. NFCI. This is part of rG1cfecf4fc427 that was reverted to fix PR51226 - concating the broadcasts is OK, its the splatted loads that crash (we're not detecting extloads). I'm still creating a reduced test case so haven't added the load handling again yet.	2021-11-02 18:04:35 +00:00
Simon Pilgrim	e173631dd1	[X86][AVX] SimplifyDemandedVectorEltsForTargetNode - use getBROADCAST_LOAD helper. NFCI. Reduce width of X86ISD::SUBV_BROADCAST_LOAD node.	2021-11-02 14:07:22 +00:00
Simon Pilgrim	8ca666a280	[X86][AVX] lowerV2X128Shuffle - use getBROADCAST_LOAD helper. NFCI.	2021-11-02 14:07:21 +00:00
Simon Pilgrim	fd485d8cda	[X86][AVX] Prefer VINSERTF128 over VPERM2F128 for 128->256 subvector concatenations The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances. This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source. There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses? Noticed while investigating the quality of interleaved load/store codegen. Differential Revision: https://reviews.llvm.org/D111960	2021-11-01 10:45:50 +00:00
Sanjay Patel	837518d6a0	[x86] make mayFold* helpers visible to more files; NFC The first function is needed for D112464, but we might as well keep these together in case the others can be used someday.	2021-10-29 15:48:35 -04:00
Simon Pilgrim	154c036ebb	[X86] combineX86GatherScatter - only fold scale if the index isn't extended As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe. If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.	2021-10-29 11:48:05 +01:00
Simon Pilgrim	d29ccbecd0	[X86][AVX] Attempt to fold a scaled index into a gather/scatter scale immediate (PR13310) If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead. Fixes PR13310. Differential Revision: https://reviews.llvm.org/D108539	2021-10-28 14:07:17 +01:00
Phoebe Wang	2bc28c6f82	[X86] Add a dependency breaking xor before any gathers with an undef passthru value. In the instruction encoding, the passthru register is always tied to the destination register. The CPU scheduler has to wait for the last writer of this register to finish executing before the gather can start. This is true even if the initial mask is all ones so that the passthru will never be used. By explicitly zeroing the register we can break the false dependency. The zero idiom is executed completing by the register renamer and so is immedately considered ready. Authored by Craig. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D112505	2021-10-28 11:44:52 +08:00
Sanjay Patel	6c0a2c2804	[x86] enhance mayFoldLoad to check alignment As noted in D112464, a pre-AVX target may not be able to fold an under-aligned vector load into another op, so we shouldn't report that as a load folding candidate. I only found one caller where this would make a difference -- combineCommutableSHUFP() -- so that's where I added a test to show the (minor) regression. Differential Revision: https://reviews.llvm.org/D112545	2021-10-27 07:54:25 -04:00
Sanjay Patel	2ab0148c14	[x86] use cast instead of dyn_cast for unchecked usage; NFC This was noted as an independent clean-up in D112464.	2021-10-26 08:20:19 -04:00
Phoebe Wang	79f9dfef0d	[X86] Move splat addends from the gather/scatter index operand to the base address This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant. Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues. Authored by Craig. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111595	2021-10-26 12:35:57 +08:00
Simon Pilgrim	b09f2ee57c	[X86] findEltLoadSrc - fix shift amount variable name. NFCI. Fix the copy + paste, renaming shift amt from Idx to Amt	2021-10-23 21:24:37 +01:00
Craig Topper	cd824f9e39	[X86] Fix bad formatting. NFC	2021-10-22 13:16:35 -07:00
Sanjay Patel	40163f1df8	[x86] add special-case lowering for usubsat for AVX512 This is a small extension of D112095 to avoid another regression seen with D112085. In this case, we allow the same conversion from usubsat to ALU ops if the target supports vpternlog. That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG(). This seems better than putting a magic immediate constant directly in this code to create the exact vpternlog that we need. It's possible that there are other special-cases along these lines, so we should try to keep all of the vpternlog magic in one place. Differential Revision: https://reviews.llvm.org/D112138	2021-10-20 16:41:13 -04:00
Sanjay Patel	3efd2a0bec	[x86] make helper for useVPTERNLOG; NFC See D112085 for another use case.	2021-10-20 10:26:53 -04:00
Sanjay Patel	92a0389b04	[x86] add special-case lowering for usubsat for pre-SSE4 usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1) This would be a regression with D112085 where we combine to usubsat more aggressively, so avoid that by matching the special-case where we are subtracting SMIN (signmask): https://alive2.llvm.org/ce/z/4_3gBD Differential Revision: https://reviews.llvm.org/D112095	2021-10-19 17:13:16 -04:00
Simon Pilgrim	71e39e3f18	[ADT] Add APInt::isNegatedPowerOf2() helper Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos..... Differential Revision: https://reviews.llvm.org/D111998	2021-10-19 14:38:21 +01:00
Simon Pilgrim	a83384498b	[X86] combineMulToPMADDWD - replace ASHR(X,16) -> LSHR(X,16) If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero. The lshr also has a better chance of folding with shuffles etc.	2021-10-18 22:12:56 +01:00
Craig Topper	beb7862db5	[X86] Add DAG combine for negation of CMOV absolute value pattern. This patch detects the absolute value pattern on the RHS of a subtract. If we find it we swap the CMOV true/false values and replace the subtract with an ADD. There may be a more generic way to do this, but I'm not sure. Targets that don't have legal or custom ISD::ABS use a generic expand in DAG combiner already when it sees (neg (abs(x))). I haven't checked what happens if the neg is a more general subtract. Fixes PR50991 for X86. Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D111858	2021-10-16 13:31:43 -07:00
Dávid Bolvanský	6678db00e6	[X86] Enable promotion of i16 popcnt (PR52056) Solves https://bugs.llvm.org/show_bug.cgi?id=52056 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111507	2021-10-15 15:41:37 +02:00
Kazu Hirata	81e9c90686	[llvm] Use llvm::is_contained (NFC)	2021-10-14 22:44:09 -07:00
Craig Topper	3ff9cc01f2	[X86] Use CMOVNS for abs instead of CMOVGE. CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with other recent changes to use a single flag where possible. It also matches gcc codegen. I believe this technically changes whether the conditioanl move happens on INT_MIN, but for INT_MIN both registers are the same so it doesn't matter. Differential Revision: https://reviews.llvm.org/D111826	2021-10-14 12:28:28 -07:00
Simon Pilgrim	fb2539b9d8	[X86][SSE] Add X86ISD::AVG to isCommutativeBinOp to support folding shuffles through the binop	2021-10-13 10:54:37 +01:00
Roman Lebedev	958da6598f	[X86] `detectAVGPattern()`: don't require zext in the with-constant case	2021-10-12 20:24:17 +03:00
Roman Lebedev	a1d57f75d1	[NFC][X86] `detectAVGPattern()`: rely on `AVGSplitter()` to perform truncation	2021-10-12 20:12:09 +03:00
Roman Lebedev	5f4f5da634	[X86] `detectAVGPattern()`: support basic case of PAVG chaining (PR52131) As noted in https://github.com/halide/Halide/pull/6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571	2021-10-12 19:51:43 +03:00
Roman Lebedev	7964c3ed82	[X86] `detectAVGPattern()`: small preparatory NFC refactor	2021-10-12 19:51:43 +03:00
Freddy Ye	d57a87ea89	[X86][ISel] Lowering llvm.thread.pointer Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D110681	2021-10-12 11:01:18 +08:00
Simon Pilgrim	ad16c6e52f	[X86][AVX] Ensure we retain zero elements in select(pshufb,pshufb) -> or(pshufb,pshufb) fold (PR52122) The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef. PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero	2021-10-11 14:50:24 +01:00
Arthur Eubanks	a0a4935182	Make more places that use alignment use uint64_t Followup to D110451.	2021-10-08 16:35:19 -07:00
Wang, Pengfei	c236883b6b	[X86] Optimize fdiv with reciprocal instructions for half type Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110557	2021-10-08 09:41:13 +08:00
Craig Topper	58b68e70eb	[X86] Don't use popcnt for parity if only bits 7:0 of the input can be non-zero. Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt. To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY. Fixes PR52093 Differential Revision: https://reviews.llvm.org/D111249	2021-10-06 14:39:57 -07:00
Wolfgang Pieb	f53d05135e	[UBSAN][PS4] For the PS4 target, emit the ud2 ocpode for ubsan traps. Reviewed By: probinson Differential Revision: https://reviews.llvm.org/D111172	2021-10-06 11:49:36 -07:00
Nathan Sidwell	c11e7b59d2	[X86][NFC] structure-return simplificiation The X86 backend only needs to know whether structure return is via an sret pointer. This removes the categorization enumeration and adjusts, templatizes and renames the related functions. Differential Revision: https://reviews.llvm.org/D109966	2021-10-06 03:12:48 -07:00
Simon Pilgrim	bf30c48419	[X86] SimplifyDemandedVectorEltsForTargetNode - simplify PMADDWD for known zero elements Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element. Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.	2021-10-04 14:36:45 +01:00
Jay Foad	566690b067	[APFloat] Remove BitWidth argument from getAllOnesValue There's no need to pass this in explicitly because it is trivially available from the semantics.	2021-10-04 11:32:16 +01:00
Jay Foad	a9bceb2b05	[APInt] Stop using soft-deprecated constructors and methods in llvm. NFC. Stop using APInt constructors and methods that were soft-deprecated in D109483. This fixes all the uses I found in llvm, except for the APInt unit tests which should still test the deprecated methods. Differential Revision: https://reviews.llvm.org/D110807	2021-10-04 08:57:44 +01:00
Simon Pilgrim	9452ec722c	[X86][SSE] Fix typo + infinite-loop in HOP(HOP'(X,X),HOP'(Y,Y)) fold (PR52040) PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold. Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks. Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.	2021-10-02 15:31:12 +01:00
Simon Pilgrim	bb42cc2090	[X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal. Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ. Differential Revision: https://reviews.llvm.org/D110588	2021-10-02 12:35:25 +01:00
Martin Storsjö	d6216e2cd1	[X86] Fix handling of i128<->fp on Windows On Windows, i128 arguments are passed as indirect arguments, and they are returned in xmm0. This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making the IR functions return v2i64 instead of i128, and making the arguments indirect. However for cases where libcalls are generated in the target lowering, the lowering uses the default x86_64 calling convention for i128, where they are passed/returned as a register pair. Add custom lowering logic, similar to the existing logic for i128 div/mod (added in `4a406d32e9`), manually making the libcall (while overriding the return type to v2i64 or passing the arguments as pointers to arguments on the stack). X86CallingConv.td doesn't seem to handle i128 at all, otherwise the windows specific behaviours would ideally be implemented as overrides there, in generic code, handling these cases automatically. This fixes https://bugs.llvm.org/show_bug.cgi?id=48940. Differential Revision: https://reviews.llvm.org/D110413	2021-09-29 13:05:59 +03:00
Liu, Chen3	57e8f840b6	[X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A). This bug was introduced by D109953. The operand order of generated FMA is wrong. Differential Revision: https://reviews.llvm.org/D110606	2021-09-28 11:38:53 +08:00
Simon Pilgrim	468ff703e1	[X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974) This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies. Fixes PR51974	2021-09-27 10:41:22 +01:00
Freddy Ye	902ec6142a	[X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16) When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with TypeLegalize, and no libcall in libm is ready for fround(f16) now. FROUNDEVEN(f16) has related instruction in AVX512FP16. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110312	2021-09-27 13:35:03 +08:00
Simon Pilgrim	c0eff50fc5	[X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.	2021-09-26 19:37:23 +01:00
Simon Pilgrim	ed3e4917b3	[X86] Fold PACK(_EXTEND_VECTOR_INREG, UNDEF) -> _EXTEND_VECTOR_INREG For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.	2021-09-26 19:37:22 +01:00
Simon Pilgrim	3fe9767204	[X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W)) Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements. There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....	2021-09-26 18:08:29 +01:00
Simon Pilgrim	eb7c78c2c5	[X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16	2021-09-25 15:50:45 +01:00
Simon Pilgrim	2a4fa0c27c	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors	2021-09-25 15:50:45 +01:00
Simon Pilgrim	f5a26ccae2	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.	2021-09-25 14:35:31 +01:00
Simon Pilgrim	a25f25c3b7	[X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2 With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target. We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.	2021-09-25 14:08:03 +01:00
Simon Pilgrim	d8fc9f8727	[X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16) As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.	2021-09-24 16:42:01 +01:00
Sanjay Patel	09e71c367a	[x86] convert logic-of-FP-compares to FP logic-of-vector-compares This is motivated by the examples and discussion in: https://llvm.org/PR51245 ...and related bugs. By using vector compares and vector logic, we can convert 2 'set' instructions into 1 'movd' or 'movmsk' and generally improve throughput/reduce instructions. Unfortunately, we don't have a complete vector compare ISA before AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic ops to simulate the missing predicates for SSE 'cmpp*', so it's not as clearly a win. Differential Revision: https://reviews.llvm.org/D110342	2021-09-24 11:38:19 -04:00
Sanjay Patel	74ba4b769a	[x86] move combiner state check into convertIntLogicToFPLogic(); NFC This function can be adapted to solve bugs like PR51245, but it could require differentiating the combiner timing between the existing and new transforms.	2021-09-23 14:28:22 -04:00
Liu, Chen3	76656ec8ec	[X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A) This patch is to support transform something like _mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph())) to _mm512_fmadd_pch(a, b, acc). Differential Revision: https://reviews.llvm.org/D109953	2021-09-23 15:37:08 +08:00
Wang, Pengfei	ebec077e07	[X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D109658	2021-09-23 11:02:48 +08:00
Kirill Stoimenov	2649999579	[asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on. This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D110012	2021-09-21 22:26:03 +00:00
Amara Emerson	4ceea77409	[X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC. For x86 Darwin, we have a stack checking feature which re-uses some of this machinery around stack probing on Windows. Renaming this to be more appropriate for a generic feature. Differential Revision: https://reviews.llvm.org/D109993	2021-09-20 16:19:28 -07:00
Simon Pilgrim	0e89ff8195	[X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use. Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.	2021-09-19 22:53:30 +01:00
Simon Pilgrim	b7342e3137	[X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask') We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly. Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.	2021-09-19 20:39:19 +01:00
Roman Lebedev	5f2fe48d06	[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded Apparently this has no test coverage before D108382, but D108382 itself shows a few regressions that this fixes. It doesn't seem worthwhile breaking apart broadcasts, assuming we want the broadcasted value to be preset in several elements, not just the 0'th one. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108411	2021-09-19 17:38:32 +03:00
Roman Lebedev	07f1d8f0ca	[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such Split off from D108253. Broadcast is simpler than any other shuffle we might produce to do what we want to do here, so prefer it. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108382	2021-09-19 17:35:37 +03:00
Roman Lebedev	0852313e47	[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch	2021-09-19 17:29:18 +03:00
Roman Lebedev	1e72ca94e5	[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing This was suggested in https://reviews.llvm.org/D108382#inline-1039018, and it avoids regressions in that patch. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109065	2021-09-19 17:24:58 +03:00
Roman Lebedev	6a2c2263fb	[X86] Improve i8 all-ones element insertion in pre-SSE4.1 Should avoid some regressions in D109065 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109989	2021-09-18 22:24:06 +03:00
Roman Lebedev	358df06f4e	[X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away); or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking: https://alive2.llvm.org/ce/z/RKejao Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109726	2021-09-17 19:13:33 +03:00
Simon Pilgrim	1ef62cb200	[X86] SimplifyDemandedVectorEltsForTargetNode - add PSADBW handling Peek through PSADBW operands to handle non demanded elements.	2021-09-16 11:28:31 +01:00
Simon Pilgrim	dcba994184	[X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858) When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.	2021-09-15 10:21:05 +01:00
Craig Topper	9af8f1b18e	[SelectionDAG] Add isZero/isAllOnes methods to ConstantSDNode. Soft deprecrate isNullValue/isAllOnesValue and update in tree callers. This matches the changes to the APInt interface from D109483. Reviewed By: lattner Differential Revision: https://reviews.llvm.org/D109535	2021-09-09 13:28:30 -07:00
Chris Lattner	d51da74889	[CodeGen] Use DAG.getAllOnesConstant where possible to simplify code. NFC.	2021-09-09 10:22:51 -07:00
Craig Topper	124bcc1a13	[X86] Disable muloti4 libcalls for x86-64. This library function only exists in compiler-rt not libgcc. So this would fail to link unless we were linking with compiler-rt. This is consistent with the recent removal of calls to mulodi4 on 32-bit targets like D108928. I suppose maybe we could keep the libcalls for platforms like Darwin that use compiler-rt exclusively? Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D109385	2021-09-09 10:03:15 -07:00
Chris Lattner	735f46715d	[APInt] Normalize naming on keep constructors / predicate methods. This renames the primary methods for creating a zero value to `getZero` instead of `getNullValue` and renames predicates like `isAllOnesValue` to simply `isAllOnes`. This achieves two things: 1) This starts standardizing predicates across the LLVM codebase, following (in this case) ConstantInt. The word "Value" doesn't convey anything of merit, and is missing in some of the other things. 2) Calling an integer "null" doesn't make any sense. The original sin here is mine and I've regretted it for years. This moves us to calling it "zero" instead, which is correct! APInt is widely used and I don't think anyone is keen to take massive source breakage on anything so core, at least not all in one go. As such, this doesn't actually delete any entrypoints, it "soft deprecates" them with a comment. Included in this patch are changes to a bunch of the codebase, but there are more. We should normalize SelectionDAG and other APIs as well, which would make the API change more mechanical. Differential Revision: https://reviews.llvm.org/D109483	2021-09-09 09:50:24 -07:00
Akira Hatanaka	dea6f71af0	[ObjC][ARC] Use the addresses of the ARC runtime functions instead of integer 0/1 for the operand of bundle "clang.arc.attachedcall" https://reviews.llvm.org/D102996 changes the operand of bundle "clang.arc.attachedcall". This patch makes changes to llvm that are needed to handle the new IR. This should make it easier to understand what the IR is doing and also simplify some of the passes as they no longer have to translate the integer values to the runtime functions. Differential Revision: https://reviews.llvm.org/D103000	2021-09-08 11:58:03 -07:00
Nick Desaulniers	d0eeb64be5	[X86ISelLowering] avoid emitting libcalls to __mulodi4() Similar to D108842, D108844, and D108926. __has_builtin(builtin_mul_overflow) returns true for 32b x86 targets, but Clang is deferring to compiler RT when encountering long long types. This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel that are using builtin_mul_overflow with these types for these targets. If the semantics of __has_builtin mean "the compiler resolves these, always" then we shouldn't conditionally emit a libcall. This will still need to be worked around in the Linux kernel in order to continue to support these builds of the Linux kernel for this target with older releases of clang. Link: https://bugs.llvm.org/show_bug.cgi?id=28629 Link: https://bugs.llvm.org/show_bug.cgi?id=35922 Link: https://github.com/ClangBuiltLinux/linux/issues/1438 Reviewed By: lebedev.ri, RKSimon Differential Revision: https://reviews.llvm.org/D108928	2021-09-07 10:44:54 -07:00
Simon Pilgrim	d66d520fe1	[X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]y[0] + (int)x[1]y[1], ..., (int)x[6]y[6] + (int)x[7]y[7] } Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits. That way the sign of the result is still preserved, and the second half is still zero. Noticed while investigating PR47437. Differential Revision: https://reviews.llvm.org/D108522	2021-09-02 17:36:22 +01:00
Roman Lebedev	3f1f08f0ed	Revert @llvm.isnan intrinsic patchset. Please refer to https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html (and that whole thread.) TLDR: the original patch had no prior RFC, yet it had some changes that really need a proper RFC discussion. It won't be productive to discuss such an RFC, once it's actually posted, while said patch is already committed, because that introduces bias towards already-committed stuff, and the tree is potentially in broken state meanwhile. While the end result of discussion may lead back to the current design, it may also not lead to the current design. Therefore i take it upon myself to revert the tree back to last known good state. This reverts commit `4c4093e6e3`. This reverts commit `0a2b1ba33a`. This reverts commit `d9873711cb`. This reverts commit `791006fb8c`. This reverts commit `c22b64ef66`. This reverts commit `72ebcd3198`. This reverts commit `5fa6039a5f`. This reverts commit `9efda541bf`. This reverts commit `94d3ff09cf`.	2021-09-02 13:53:56 +03:00
Simon Pilgrim	b0acd6c369	[X86] Fold PMADD(x,0) or PMADD(0,x) -> 0 Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops	2021-09-02 10:48:50 +01:00
Wang, Pengfei	74043caef2	[X86] Enable half type support in inline assembly constraints Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105799	2021-09-01 09:29:31 +08:00
Wang, Pengfei	ab40dbfe03	[X86] AVX512FP16 instructions enabling 6/6 Enable FP16 complex FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105269	2021-08-30 13:08:45 +08:00
Roman Lebedev	6734018041	[Codegen][X86] EltsFromConsecutiveLoads(): if only have AVX1, ensure that the "load" is actually foldable (PR51615) This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615 And again, the fix lies not in the code added in D105390 In this case, we completely don't check that the "broadcast-from-mem" we create can actually fold the load. In this case, it's operand was not a load at all: ``` Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32 Creating new node: t29: i32 = undef RepeatLoad: t8: i32 = truncate t7 t7: i64 = extract_vector_elt t5, Constant:i64<0> t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64 t2: i64,ch = CopyFromReg t0, Register:i64 %0 t1: i64 = Register %0 t4: i64 = undef t3: i64 = Constant<0> Combining: t15: v8i32 = undef ``` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108821	2021-08-27 20:26:53 +03:00
Serge Pavlov	cdbe569fb6	[X86] Implement llvm.isnan(x86_fp80) as unordered comparison x86_fp80 format allows values that do not fit any of IEEE-754 category. Previously they were recognized by intrinsic __builtin_isnan as NaNs. Now this intrinsic is implemented using instruction FXAM, which distinguish between NaNs and unsupported values. It can make some programs behave differently. As a solution, this fix changes lowering of the intrinsic. If floating point exceptions are ignored, llvm.isnan is lowered into unordered comparison, as __buildtin_isnan was implemented earlier. In strictfp functions the intrinsic is lowered using FXAM, which does not raise exceptions even for signaling NaN, as required by IEEE-754 and C standards. Differential Revision: https://reviews.llvm.org/D108037	2021-08-27 18:06:07 +07:00
Nathan Sidwell	199ac3a839	[NFC][X86] Sret return register cleanup There are no paths into LowerFormalParms that have already specified the sret register. We always materialize a virtual and then assign it to the physical reg at the point of the return. Differential Revision: https://reviews.llvm.org/D108762	2021-08-27 04:03:49 -07:00
Roman Lebedev	a8125bf4a8	[X86][Codegen] PR51615: don't replace wide volatile load with narrow broadcast-from-memory Even though https://bugs.llvm.org/show_bug.cgi?id=51615 appears to be introduced by D105390, the fix lies here. We can not replace a wide volatile load with a broadcast-from-memory, because that would narrow the load, which isn't legal for volatiles. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D108757	2021-08-26 18:46:49 +03:00
Nathan Sidwell	ab55cc6cef	[X86] pr51000 in-register struct return tailcalling In-register structure returns are not special, and handled by lowering to multiple-value tuples. We can tail-call from non-sret fns to structure-returning functions, except on i686 where the sret pointer is callee-pop. Differential Revision: https://reviews.llvm.org/D105807	2021-08-25 10:15:50 -07:00
Simon Pilgrim	307890f85b	[X86] Freeze vXi8 shl(x,1) -> add(x,x) vector fold (PR50468) We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468. Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative. Differential Revision: https://reviews.llvm.org/D108139	2021-08-24 16:08:24 +01:00
Liu, Chen3	b7795eb646	[X86] Building constant vector which element type is half will cause assertion fail. Fix assertion fail when building con constant vector which element type is half. Differential Revision: https://reviews.llvm.org/D108612	2021-08-24 14:34:30 +08:00
Wang, Pengfei	c728bd5bba	[X86] AVX512FP16 instructions enabling 5/6 Enable FP16 FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105268	2021-08-24 09:07:19 +08:00
Simon Pilgrim	805fb1f6c1	[X86] combineMul - move MUL_IMM comment inside function. NFC. combineMul is now used for other things as well as the mul-with-constant expansion - move the comment to where its actually relevant.	2021-08-22 18:27:03 +01:00
Simon Pilgrim	352df10a23	[X86][AVX] matchShuffleAsBlend - use isElementEquivalent to help match broadcast/repeated elements Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element. This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).	2021-08-22 15:26:17 +01:00
Simon Pilgrim	96fb3eef66	Fix signed/unsigned comparison warning. NFCI.	2021-08-22 15:02:19 +01:00
Simon Pilgrim	a1c892b439	[X86][SSE] lowerVECTOR_SHUFFLE - canonicalize with horizontal ops. Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.	2021-08-22 14:54:48 +01:00
Wang, Pengfei	b088536ce9	[X86] AVX512FP16 instructions enabling 4/6 Enable FP16 unary operator instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105267	2021-08-22 08:59:35 +08:00
David Green	d10f23a25d	[ISel] Expand saddsat and ssubsat via asr and xor This changes the lowering of saddsat and ssubsat so that instead of using: r,o = saddo x, y c = setcc r < 0 s = c ? INTMAX : INTMIN ret o ? s : r into using asr and xor to materialize the INTMAX/INTMIN constants: r,o = saddo x, y s = ashr r, BW-1 x = xor s, INTMIN ret o ? x : r https://alive2.llvm.org/ce/z/TYufgD This seems to reduce the instruction count in most testcases across most architectures. X86 has some custom lowering added to compensate for cases where it can increase instruction count. Differential Revision: https://reviews.llvm.org/D105853	2021-08-19 16:08:07 +01:00
Arthur Eubanks	46cf82532c	[NFC] Replace Function handling of attributes with less confusing calls To avoid magic constants and confusing indexes.	2021-08-17 21:05:40 -07:00
Wang, Pengfei	2379949aad	[X86] AVX512FP16 instructions enabling 3/6 Enable FP16 conversion instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105265	2021-08-18 09:03:41 +08:00
Simon Pilgrim	359cfa2af7	[X86] EmitInstrWithCustomInserter - silence uninitialized variable warnings. NFCI. Consistently use llvm_unreachable for the default case in the inner switch statements.	2021-08-17 22:01:07 +01:00
Roman Lebedev	2078c4ecfd	[X86] Lower insertions into upper half of an 256-bit vector as broadcast+blend (PR50971) Broadcast is not worse than extract+insert of subvector. https://godbolt.org/z/aPq98G6Yh Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D105390	2021-08-17 18:45:10 +03:00
Wang, Pengfei	f1de9d6dae	[X86] AVX512FP16 instructions enabling 2/6 Enable FP16 binary operator instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105264	2021-08-15 08:56:33 +08:00
Arthur Eubanks	d7593ebaee	[NFC] Clean up users of AttributeList::hasAttribute() AttributeList::hasAttribute() is confusing, use clearer methods like hasParamAttr()/hasRetAttr(). Add hasRetAttr() since it was missing from AttributeList.	2021-08-13 11:59:18 -07:00
Arthur Eubanks	92ce6db9ee	[NFC] Rename AttributeList::hasFnAttribute() -> hasFnAttr() This is more consistent with similar methods.	2021-08-13 11:09:18 -07:00
Wang, Pengfei	6f7f5b54c8	[X86] AVX512FP16 instructions enabling 1/6 1. Enable FP16 type support and basic declarations used by following patches. 2. Enable new instructions VMOVW and VMOVSH. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105263	2021-08-10 12:46:01 +08:00
Craig Topper	24dfba8d50	[X86] Teach shouldSinkOperands to recognize pmuldq/pmuludq patterns. The IR for pmuldq/pmuludq intrinsics uses a sext_inreg/zext_inreg pattern on the inputs. Ideally we pattern match these away during isel. It is possible for LICM or other middle end optimizations to separate the extend from the mul. This prevents SelectionDAG from removing it or depending on how the extend is lowered, we may not be able to generate an AssertSExt/AssertZExt in the mul basic block. This will prevent pmuldq/pmuludq from being formed at all. This patch teaches shouldSinkOperands to recognize this so that CodeGenPrepare will clone the extend into the same basic block as the mul. Fixes PR51371. Differential Revision: https://reviews.llvm.org/D107689	2021-08-07 08:45:56 -07:00
Amara Emerson	2b067e3335	Change TargetLowering::canMergeStoresTo() to take a MF instead of DAG. DAG is unnecessary and we need this hook to implement store merging on GlobalISel too.	2021-08-06 12:57:53 -07:00
Craig Topper	b2ca4dc935	[LegalizeTypes] Add a simple expansion for SMULO when a libcall isn't available. This isn't optimal, but prevents crashing when the libcall isn't available. It just calculates the full product and makes sure the high bits match the sign of the low half. Each of the pieces should go through their own type legalization. This can make D107420 unnecessary. Needs tests, but I wanted to start discussion about D107420. Reviewed By: FreddyYe Differential Revision: https://reviews.llvm.org/D107581	2021-08-06 09:43:01 -07:00
Simon Pilgrim	18e6a03b1a	[X86][AVX] Extract SUBV_BROADCAST constant bits from just the lower subvector range (PR51281) As reported on PR51281, an internal fuzz test encountered an issue when extracting constant bits from a SUBV_BROADCAST node from a constant pool source larger than the broadcasted subvector width. The getTargetConstantBitsFromNode was assuming that the Constant would the same size as the subvector, resulting in the incorrect packing of the per-element bits data. This patch attempts to solve this by using the SUBV_BROADCAST node to determine the subvector width, and then ensuring we extract only the lowest bits from Constant of that subvector bitsize. Differential Revision: https://reviews.llvm.org/D107158	2021-08-06 11:21:31 +01:00
Serge Pavlov	4c4093e6e3	Introduce intrinsic llvm.isnan This is recommit of the patch `16ff91ebcc`, reverted in `0c28a7c990` because it had an error in call of getFastMathFlags (base type should be FPMathOperator but not Instruction). The original commit message is duplicated below: Clang has builtin function '__builtin_isnan', which implements C library function 'isnan'. This function now is implemented entirely in clang codegen, which expands the function into set of IR operations. There are three mechanisms by which the expansion can be made. * The most common mechanism is using an unordered comparison made by instruction 'fcmp uno'. This simple solution is target-independent and works well in most cases. It however is not suitable if floating point exceptions are tracked. Corresponding IEEE 754 operation and C function must never raise FP exception, even if the argument is a signaling NaN. Compare instructions usually does not have such property, they raise 'invalid' exception in such case. So this mechanism is unsuitable when exception behavior is strict. In particular it could result in unexpected trapping if argument is SNaN. * Another solution was implemented in https://reviews.llvm.org/D95948. It is used in the cases when raising FP exceptions by 'isnan' is not allowed. This solution implements 'isnan' using integer operations. It solves the problem of exceptions, but offers one solution for all targets, however some can do the check in more efficient way. * Solution implemented by https://reviews.llvm.org/D96568 introduced a hook 'clang::TargetCodeGenInfo::testFPKind', which injects target specific code into IR. Now only SystemZ implements this hook and it generates a call to target specific intrinsic function. Although these mechanisms allow to implement 'isnan' with enough efficiency, expanding 'isnan' in clang has drawbacks: * The operation 'isnan' is hidden behind generic integer operations or target-specific intrinsics. It complicates analysis and can prevent some optimizations. * IR can be created by tools other than clang, in this case treatment of 'isnan' has to be duplicated in that tool. Another issue with the current implementation of 'isnan' comes from the use of options '-ffast-math' or '-fno-honor-nans'. If such option is specified, 'fcmp uno' may be optimized to 'false'. It is valid optimization in general, but it results in 'isnan' always returning 'false'. For example, in some libc++ implementations the following code returns 'false': std::isnan(std::numeric_limits<float>::quiet_NaN()) The options '-ffast-math' and '-fno-honor-nans' imply that FP operation operands are never NaNs. This assumption however should not be applied to the functions that check FP number properties, including 'isnan'. If such function returns expected result instead of actually making checks, it becomes useless in many cases. The option '-ffast-math' is often used for performance critical code, as it can speed up execution by the expense of manual treatment of corner cases. If 'isnan' returns assumed result, a user cannot use it in the manual treatment of NaNs and has to invent replacements, like making the check using integer operations. There is a discussion in https://reviews.llvm.org/D18513#387418, which also expresses the opinion, that limitations imposed by '-ffast-math' should be applied only to 'math' functions but not to 'tests'. To overcome these drawbacks, this change introduces a new IR intrinsic function 'llvm.isnan', which realizes the check as specified by IEEE-754 and C standards in target-agnostic way. During IR transformations it does not undergo undesirable optimizations. It reaches instruction selection, where is lowered in target-dependent way. The lowering can vary depending on options like '-ffast-math' or '-ffp-model' so the resulting code satisfies requested semantics. Differential Revision: https://reviews.llvm.org/D104854	2021-08-06 14:32:27 +07:00
Roman Lebedev	c0586ff05d	[NFC][X86] combineX86ShuffleChain(): hoist Mask variable higher up Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef` to it is confusing. Instead, just move the `Mask` vector higher up, and change the code that earlier had no access to it but now does to use `Mask` instead of `BaseMask`. This has no other intentional changes. This is a recommit of `35c0848b57`, that was reverted to simplify reversion of an earlier change.	2021-08-05 20:37:51 +03:00
Benjamin Kramer	bd17ced1db	Revert "[X86] combineX86ShuffleChain(): canonicalize mask elts picking from splats" This reverts commits `f819e4c7d0` and `35c0848b57`. It triggers an infinite loop during compilation. $ cat t.ll target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" define void @MaxPoolGradGrad_1.65() local_unnamed_addr #0 { entry: %wide.vec78 = load <64 x i32>, <64 x i32>* null, align 16 %strided.vec83 = shufflevector <64 x i32> %wide.vec78, <64 x i32> poison, <8 x i32> <i32 4, i32 12, i32 20, i32 28, i32 36, i32 44, i32 52, i32 60> %0 = lshr <8 x i32> %strided.vec83, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16> %1 = add <8 x i32> zeroinitializer, %0 %2 = shufflevector <8 x i32> %1, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> %3 = shufflevector <16 x i32> %2, <16 x i32> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31> %interleaved.vec = shufflevector <32 x i32> undef, <32 x i32> %3, <64 x i32> <i32 0, i32 8, i32 16, i32 24, i32 32, i32 40, i32 48, i32 56, i32 1, i32 9, i32 17, i32 25, i32 33, i32 41, i32 49, i32 57, i32 2, i32 10, i32 18, i32 26, i32 34, i32 42, i32 50, i32 58, i32 3, i32 11, i32 19, i32 27, i32 35, i32 43, i32 51, i32 59, i32 4, i32 12, i32 20, i32 28, i32 36, i32 44, i32 52, i32 60, i32 5, i32 13, i32 21, i32 29, i32 37, i32 45, i32 53, i32 61, i32 6, i32 14, i32 22, i32 30, i32 38, i32 46, i32 54, i32 62, i32 7, i32 15, i32 23, i32 31, i32 39, i32 47, i32 55, i32 63> store <64 x i32> %interleaved.vec, <64 x i32>* undef, align 16 unreachable } $ llc < t.ll -mcpu=skylake <hang>	2021-08-05 18:58:08 +02:00
Simon Pilgrim	2cbf9fd402	[DAG] DAGCombiner::visitVECTOR_SHUFFLE - recognise INSERT_SUBVECTOR patterns IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes. This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector. This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector. Fixes PR50053 Differential Revision: https://reviews.llvm.org/D107068	2021-08-05 15:40:48 +01:00
Fangrui Song	9c19b36f1c	[X86] Remove -x86-experimental-pref-loop-alignment in favor of -align-loops	2021-08-04 13:23:57 -07:00
Roman Lebedev	35c0848b57	[NFC][X86] combineX86ShuffleChain(): hoist Mask variable higher up Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef` to it is confusing. Instead, just move the `Mask` vector higher up, and change the code that earlier had no access to it but now does to use `Mask` instead of `BaseMask`. This has no other intentional changes.	2021-08-04 17:15:12 +03:00
Roman Lebedev	916cdc3d4b	[NFC][X86] combineX86ShuffleChain(): rename inner Mask to avoid future shadowing I want to hoist `Mask` variable higher up, but then it would clash with this one. So let's rename this one first. There are no other intentional changes here other than said rename.	2021-08-04 17:15:12 +03:00
Roman Lebedev	f819e4c7d0	[X86] combineX86ShuffleChain(): canonicalize mask elts picking from splats Given a shuffle mask, if it is picking from an input that is splat given the current granularity of the shuffle, then adjust the mask to pick from the same lane of the input as the mask element is in. This may result in a shuffle being simplified into a blend. I believe this is correct given that the splat detection matches the one just above the new code, My basic thought is that we might be able to get less regressions by handling multiple insertions of the same value into a vector if we form broadcasts+blend here, as opposed to D105390, but i have not really thought this through, and did not try implementing it yet. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D107009	2021-08-04 16:55:04 +03:00
Serge Pavlov	0c28a7c990	Revert "Introduce intrinsic llvm.isnan" This reverts commit `16ff91ebcc`. Several errors were reported mainly test-suite execution time. Reverted for investigation.	2021-08-04 17:18:15 +07:00
Serge Pavlov	16ff91ebcc	Introduce intrinsic llvm.isnan Clang has builtin function '__builtin_isnan', which implements C library function 'isnan'. This function now is implemented entirely in clang codegen, which expands the function into set of IR operations. There are three mechanisms by which the expansion can be made. * The most common mechanism is using an unordered comparison made by instruction 'fcmp uno'. This simple solution is target-independent and works well in most cases. It however is not suitable if floating point exceptions are tracked. Corresponding IEEE 754 operation and C function must never raise FP exception, even if the argument is a signaling NaN. Compare instructions usually does not have such property, they raise 'invalid' exception in such case. So this mechanism is unsuitable when exception behavior is strict. In particular it could result in unexpected trapping if argument is SNaN. * Another solution was implemented in https://reviews.llvm.org/D95948. It is used in the cases when raising FP exceptions by 'isnan' is not allowed. This solution implements 'isnan' using integer operations. It solves the problem of exceptions, but offers one solution for all targets, however some can do the check in more efficient way. * Solution implemented by https://reviews.llvm.org/D96568 introduced a hook 'clang::TargetCodeGenInfo::testFPKind', which injects target specific code into IR. Now only SystemZ implements this hook and it generates a call to target specific intrinsic function. Although these mechanisms allow to implement 'isnan' with enough efficiency, expanding 'isnan' in clang has drawbacks: * The operation 'isnan' is hidden behind generic integer operations or target-specific intrinsics. It complicates analysis and can prevent some optimizations. * IR can be created by tools other than clang, in this case treatment of 'isnan' has to be duplicated in that tool. Another issue with the current implementation of 'isnan' comes from the use of options '-ffast-math' or '-fno-honor-nans'. If such option is specified, 'fcmp uno' may be optimized to 'false'. It is valid optimization in general, but it results in 'isnan' always returning 'false'. For example, in some libc++ implementations the following code returns 'false': std::isnan(std::numeric_limits<float>::quiet_NaN()) The options '-ffast-math' and '-fno-honor-nans' imply that FP operation operands are never NaNs. This assumption however should not be applied to the functions that check FP number properties, including 'isnan'. If such function returns expected result instead of actually making checks, it becomes useless in many cases. The option '-ffast-math' is often used for performance critical code, as it can speed up execution by the expense of manual treatment of corner cases. If 'isnan' returns assumed result, a user cannot use it in the manual treatment of NaNs and has to invent replacements, like making the check using integer operations. There is a discussion in https://reviews.llvm.org/D18513#387418, which also expresses the opinion, that limitations imposed by '-ffast-math' should be applied only to 'math' functions but not to 'tests'. To overcome these drawbacks, this change introduces a new IR intrinsic function 'llvm.isnan', which realizes the check as specified by IEEE-754 and C standards in target-agnostic way. During IR transformations it does not undergo undesirable optimizations. It reaches instruction selection, where is lowered in target-dependent way. The lowering can vary depending on options like '-ffast-math' or '-ffp-model' so the resulting code satisfies requested semantics. Differential Revision: https://reviews.llvm.org/D104854	2021-08-04 15:27:49 +07:00
Sanjay Patel	4c41caa287	[x86] improve CMOV codegen by pushing add into operands, part 3 In this episode, we are trying to avoid an x86 micro-arch quirk where complex (3 operand) LEA potentially costs significantly more than simple LEA. So we simultaneously push and pull the math around the CMOV to balance the operations. I looked at the debug spew during instruction selection and decided against trying a later DAGToDAG transform -- it seems very difficult to match if the trailing memops are already selected and managing the creation of extra instructions at that level is always tricky. Differential Revision: https://reviews.llvm.org/D106918	2021-07-28 09:10:33 -04:00
Xiang1 Zhang	3223d41017	[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT Differential Revision: https://reviews.llvm.org/D106780	2021-07-28 08:16:59 +08:00
Xiang1 Zhang	2ca3937131	Revert "[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT" This reverts commit `6ff73efea9`.	2021-07-28 08:12:29 +08:00
Xiang1 Zhang	6ff73efea9	[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT	2021-07-28 08:08:30 +08:00
Sanjay Patel	156ba620b3	[x86] update stale code comment; NFC The transform was generalized with: `1ce05ad619`	2021-07-27 16:45:52 -04:00
Tres Popp	d225de60c9	Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI." This reverts commit `1cfecf4fc4`. This commit broke LLVM code generated through XLA by removing a conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD This is not a perfect revert. The new function is left as other uses of it exist now.	2021-07-27 16:55:50 +02:00
Tres Popp	70fa9479b2	Revert "Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI."" This reverts commit `d7bbb1230a`. There were follow up uses of a deleted method and I didn't run the tests. Undo the revert, so I can do it properly.	2021-07-27 16:48:31 +02:00
Tres Popp	d7bbb1230a	Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI." This reverts commit `1cfecf4fc4`. This commit broke LLVM code generated through XLA by removing a conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD	2021-07-27 16:22:25 +02:00
Simon Pilgrim	c8472db0a8	[X86][AVX] Prefer vinsertf128 to vperm2f128 on AVX1 targets Splatting the lower xmm with vinsertf128 is at least as quick as vperm2f128, and a lot faster on some AMD targets. First step towards PR50053	2021-07-26 11:11:56 +01:00
Simon Pilgrim	1cfecf4fc4	[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI. Begin replacing individual getMemIntrinsicNode calls and setup (for X86ISD::VBROADCAST_LOAD + X86ISD::SUBV_BROADCAST_LOAD opcodes) with this getBROADCAST_LOAD helper.	2021-07-25 20:37:58 +01:00
Simon Pilgrim	b95f66ad78	[X86][SSE] LowerRotate - perform modulo on the amount splat source directly. If the rotation amount is a known splat, perform the modulo on the splat source, and then perform the splat. That way the amount-extension performed later by LowerScalarVariableShift can fold the splats away without any multiple-use issues. Fixes one of the concerns raised on D104156	2021-07-25 17:30:32 +01:00
Sanjay Patel	1ce05ad619	[x86] improve CMOV codegen by pushing add into operands, part 2 This is a minimum extension of D106607 to allow folding for 2 non-zero constantsi that can be materialized as immediates.. In the reduced test examples, we save 1 instruction by rolling the constants into LEA/ADD. In the motivating test from the bullet benchmark, we absorb both of the constant moves into add ops via LEA magic, so we reduce by 2 instructions. Differential Revision: https://reviews.llvm.org/D106684	2021-07-25 10:05:41 -04:00
Simon Pilgrim	15b883f457	[X86][AVX] Adjust AllowBWIVPERMV3 tolerance to account for VariableCrossLaneShuffleDepth As noticed on D105390 - we were hardwiring the depth limit for combining to VPERMI2W/VPERMI2B instructions. Not only had we made the limit too low, we hadn't accounted for slow/fast shuffles via the VariableCrossLaneShuffleDepth control	2021-07-25 14:05:11 +01:00
Sanjay Patel	f060aa1cf3	[x86] improve CMOV codegen by pushing add into operands This is not the transform direction we want in general, but by the time we have a CMOV, we've already tried everything else that could be better. The transform increases the uses of the other add operand, but that is safe according to Alive2: https://alive2.llvm.org/ce/z/Yn6p-A We could probably extend this to other binops (not just add). This is the motivating pattern discussed in: https://llvm.org/PR51069 The test with i8 shows a missed fold because there's a trunc sitting in front of the add. That can be handled with a small follow-up. Differential Revision: https://reviews.llvm.org/D106607	2021-07-23 09:39:32 -04:00
Simon Pilgrim	71d0fd3564	[X86][AVX] lowerV2X128Shuffle - attempt to recognise broadcastf128 subvector load As noticed on PR50053 we were failing to recognise when a shuffle of a load was really a subvector broadcast load	2021-07-23 13:10:38 +01:00
Simon Pilgrim	142e60f40b	[X86] Fix case of IsAfterLegalize argument. NFC. Pulled out of D106280	2021-07-19 17:15:28 +01:00
Eli Friedman	6601be4419	[X86] Remove incorrect use of known bits in shuffle simplification. This reverts commit `2a419a0b99`. The result of a shufflevector must not propagate poison from any element other than the one noted in the shuffle mask. The regressions outside of fptoui-may-overflow.ll can probably be recovered some other way; for example, using isGuaranteedNotToBePoison. See discussion on https://reviews.llvm.org/D106053 for more background. Differential Revision: https://reviews.llvm.org/D106222	2021-07-18 18:13:11 -07:00
Simon Pilgrim	51a12d2ff0	[X86][SSE] matchShuffleWithPACK - avoid poison pollution from bitcasting multiple elements together. D106053 exposed that we've not been taking into account that by bitcasting smaller elements together and then performing a ComputeKnownBits on the result we'd be allowing a poison element to influence other neighbouring elements being used in the pack. Instead we now peek through any existing bitcast to ensure that the source type already matches the width source of the pack node we're trying to match. This has also been a chance to stop matchShuffleWithPACK creating unused nodes on the fly which could affect oneuse tests during shuffle lowering/combining. The only regression we're seeing is due to being unable to peek through a bitcast as its on the other side of a extract_subvector - which should go away once we finally allow shuffle combining across different vector widths (by making matchShuffleWithPACK using const SelectionDAG& we've gotten closer to this - see PR45974).	2021-07-18 14:25:28 +01:00
Simon Pilgrim	d2458bcdc6	[X86][SSE] combineX86ShufflesRecursively - bail if constant folding fails due to oneuse limits. Fixes issue reported on D105827 where a single shuffle of a constant (with multiple uses) was caught in an infinite loop where one shuffle (UNPCKL) used an undef arg but then that got recombined to SHUFPS as the constant value had its own undef that confused matching.....	2021-07-16 19:21:46 +01:00
Simon Pilgrim	ee71c1bbcc	[X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic: small := CVTTPS2SI(x); fp_to_ui(x) := small \| (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31)) Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering. @TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded). Original Patch by @TomHender (Tom Hender) Differential Revision: https://reviews.llvm.org/D89697	2021-07-14 12:03:49 +01:00
Simon Pilgrim	3cee36c5ac	[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result (REAPPLIED) Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1)) Reapplied rGe4aa6ad13216 with fix for intrinsic variants of the opcode which uses a vector return type	2021-07-13 12:31:09 +01:00
Vitaly Buka	606551ee98	Revert "[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result" Fails here https://lab.llvm.org/buildbot/#/builders/37/builds/5267 This reverts commit `e4aa6ad132`.	2021-07-12 22:26:54 -07:00
Simon Pilgrim	e4aa6ad132	[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))	2021-07-12 09:56:59 +01:00
Simon Pilgrim	9dbeac16ba	[X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to let us add AssertSext/AssertZext Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again. This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles). This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.	2021-07-09 12:07:33 +01:00
Wang, Pengfei	9ab99f773f	[X86] Twist shuffle mask when fold HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) This patch fixes PR50823. The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104903	2021-07-05 21:29:42 +08:00
Simon Pilgrim	59fa435ea6	[X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760) This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read. As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes. I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations. Differential Revision: https://reviews.llvm.org/D101074	2021-06-30 18:46:50 +01:00
Craig Topper	81f6d7c082	[X86] Tighten up some inline assembly constraint handling. Don't allow vectors to split into GPRs for 'r' and other scalar constraints. Prevents assertion in getCopyToPartsVector. Makes PR50907 give a better error instead of crashing.	2021-06-26 22:57:22 -07:00
Martin Storsjö	42f74e8249	[llvm] Rename StringRef _lower() method calls to _insensitive() This is a mechanical change. This actually also renames the similarly named methods in the SmallString class, however these methods don't seem to be used outside of the llvm subproject, so this doesn't break building of the rest of the monorepo.	2021-06-25 00:22:01 +03:00
Florian Hahn	a54c6fc083	[X86] Exclude invalid element types for bitcast/broadcast folding. It looks like the fold introduced in `63f3383ece` can cause crashes if the type of the bitcasted value is not a valid vector element type, like x86_mmx. To resolve the crash, reject invalid vector element types. The way it is done in the patch is a bit clunky. Perhaps there's a better way to check? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104792	2021-06-24 12:39:01 +01:00
Simon Pilgrim	c4d3eedc7f	[X86] Fold nested select_cc to select (cmp*ge/le Cond0, Cond1), LHS, Y) select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y) etc, We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes. Pulled out of D101074 to help address the remaining regressions. Differential Revision: https://reviews.llvm.org/D104707	2021-06-24 11:27:57 +01:00
Eli Friedman	74909e4b6e	Rename MachineMemOperand::getOrdering -> getSuccessOrdering. Since this method can apply to cmpxchg operations, make sure it's clear what value we're actually retrieving. This will help ensure we don't accidentally ignore the failure ordering of cmpxchg in the future. We could potentially introduce a getOrdering() method on AtomicSDNode that asserts the operation isn't cmpxchg, but not sure that's worthwhile. Differential Revision: https://reviews.llvm.org/D103338	2021-06-21 16:49:27 -07:00
Simon Pilgrim	cdb4fcf9a1	[X86] combineSelect - refactor MIN/MAX detection code to make it easier to add additional select(setcc,x,y) folds. NFCI. I need to add some additional handling to address some of the regressions from D101074	2021-06-17 13:50:59 +01:00
Roman Lebedev	308f6a5245	[NFC][X86] lowerVECTOR_SHUFFLE(): drop FIXME about widening to i128 (YMM half) element type As per the discussion in D103818, so far, this does not appear to be worthwhile. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103818	2021-06-16 10:24:33 +03:00
Bob Haarman	3bc899b4de	[X86] avoid assert with varargs, soft float, and no-implicit-float Fixes: - PR36507 Floating point varargs are not handled correctly with -mno-implicit-float - PR48528 __builtin_va_start assumes it can pass SSE registers when using -Xclang -msoft-float -Xclang -no-implicit-float On x86_64, floating-point parameters are normally passed in XMM registers. For va_start, we spill those to memory so va_arg can find them. There is an interaction here with -msoft-float and -no-implicit-float: When -msoft-float is in effect, instead of passing floating-point parameters in XMM registers, they are passed in general-purpose registers. When -no-implicit-float is in effect, it "disables implicit floating-point instructions" (per the LangRef). The intended effect is to not have the compiler generate floating-point code unless explicit floating-point operations are present in the source code, but what exactly counts as an explicit floating-point operation is not specified. The existing behavior of LLVM here has led to some surprises and PRs. This change modifies the behavior as follows: \| soft \| no-implicit \| old behavior \| new behavior \| \| no \| no \| spill XMM regs \| spill XMM regs \| \| yes \| no \| don't spill XMM \| don't spill XMM \| \| no \| yes \| don't spill XMM \| spill XMM regs \| \| yes \| yes \| assert \| don't spill XMM \| In particular, this avoids the assert that happens when -msoft-float and -no-implicit-float are both in effect. This seems like a perfectly reasonable combination: If we don't want to rely on hardware floating-point support, we want to both avoid using float registers to pass parameters and avoid having the compiler generate floating-point code that wasn't in the original program. Instead of crashing the compiler, the new behavior is to not synthesize floating-point code in this case. This fixes PR48528. The other interesting case is when -no-implicit-float is in effect, but -msoft-float is not. In that case, any floating-point parameters that are present will be in XMM registers, and so we have to spill them to correctly handle those. This fixes PR36507. The spill is conditional on %al indicating that parameters are present in XMM registers, so no floating-point code will be executed unless the function is called with floating-point parameters. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D104001	2021-06-15 11:27:35 -07:00
Craig Topper	4017d0335a	[X86] Use EVT::getVectorVT instead of changeVectorElementType in reduceVMULWidth. Changing vector element type doesn't work for v6i32->v6i16 now that v6i32 is an MVT and v6i16 is not. I would like to fix this in changeVectorElementType, but you need a LLVMContext to call getVectorVT which we can't get from an MVT. Fixes PR50709.	2021-06-14 22:07:04 -07:00
Craig Topper	765ef4bb2a	[X86] Check destination element type before forming VTRUNCS/VTRUNCUS in combineTruncateWithSat. Fixes crash reported here https://reviews.llvm.org/D73607 Using a store to keep the trunc intact. Returning v16i24 would cause the trunc to be optimized away in SelectionDAGBuilder. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103940	2021-06-09 07:08:17 -07:00
Simon Pilgrim	8ab8b3fad7	[X86][SSE] LowerFP_TO_INT - remove dead code. NFCI. Non-Strict v2f32->v2i64 cases have already early-returned to be handled by legalization.	2021-06-06 20:04:15 +01:00
Simon Pilgrim	4879c8f3b0	[X86][SSE] combineVectorTruncation - simplify PSHUFB-is-better logic. NFCI. OutSVT is guaranteed to be i8/i16 and we accept any InSVT that isn't i64	2021-06-06 20:04:14 +01:00
Nikita Popov	1ffa6499ea	[TargetLowering] Use IRBuilderBase instead of IRBuilder<> (NFC) Don't require a specific kind of IRBuilder for TargetLowering hooks. This allows us to drop the IRBuilder.h include from TargetLowering.h. Differential Revision: https://reviews.llvm.org/D103759	2021-06-06 16:29:50 +02:00
Nikita Popov	9914200393	[CodeGen] Add missing includes (NFC) These currently rely on the IRBuilder.h include in TargetLowering.h. Make them explicit.	2021-06-06 15:48:27 +02:00
Simon Pilgrim	9f5d783d46	[X86][SSE] combineScalarToVector - only reuse broadcasts for scalar_to_vector if the source operands scalar types match We were hitting an issue when the scalar_to_vector source was being implicitly truncated (in this case to i8 to vXi1) but we were also using the i8 source in a broadcast to a vXi8 value. Fixes PR50374	2021-06-02 22:05:40 +01:00
Roman Lebedev	cf9b1f7a0e	[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature, which controls profitability of both the cross-lane and per-lane variable shuffles. I guess, this has been fine so far. But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`) are as fast as as shuffles with fixed/immediate mask, while lane-crossing shuffles, e.g. `VPERMPS` is performing worse. So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles, as suggested by @RKSimon, split the feature flag into two. Differential Revision: https://reviews.llvm.org/D103274	2021-06-01 10:39:36 +03:00
Tim Northover	9ff2eb1ea5	SwiftTailCC: teach verifier musttail rules applicable to this CC. SwiftTailCC has a different set of requirements than the C calling convention for a tail call. The exact argument sequence doesn't have to match, but fewer ABI-affecting attributes are allowed. Also make sure the musttail diagnostic triggers if a musttail call isn't actually a tail call.	2021-05-28 11:12:00 +01:00
Craig Topper	a105d3024e	[X86] Fold (shift undef, X)->0 for vector shifts by immediate. We could previously do this by accident through the later call to getTargetConstantBitsFromNode I think, but that only worked if N0 had a single use. This patch makes it explicit for undef and doesn't have a use count check. I think this is needed to move the (shl X, 1)->(add X, X) fold to isel for PR50468. We need to be sure X won't be IMPLICIT_DEF which might prevent the same vreg from being used for both operands. Differential Revision: https://reviews.llvm.org/D103192	2021-05-27 09:31:47 -07:00
Nick Desaulniers	033138ea45	[IR] make stack-protector-guard-* flags into module attrs D88631 added initial support for: - -mstack-protector-guard= - -mstack-protector-guard-reg= - -mstack-protector-guard-offset= flags, and D100919 extended these to AArch64. Unfortunately, these flags aren't retained for LTO. Make them module attributes rather than TargetOptions. Link: https://github.com/ClangBuiltLinux/linux/issues/1378 Reviewed By: tejohnson Differential Revision: https://reviews.llvm.org/D102742	2021-05-21 15:53:30 -07:00
Florian Hahn	c2d44bd230	[X86] Lower calls with clang.arc.attachedcall bundle This patch adds support for lowering function calls with the `clang.arc.attachedcall` bundle. The goal is to expand such calls to the following sequence of instructions: callq @fn movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue / _objc_unsafeClaimAutoreleasedReturnValue This sequence of instructions triggers Objective-C runtime optimizations, hence we want to ensure no instructions get moved in between them. This patch achieves that by adding a new CALL_RVMARKER ISD node, which gets turned into the CALL64_RVMARKER pseudo, which eventually gets expanded into the sequence mentioned above. The ObjC runtime function to call is determined by the argument in the bundle, which is passed through as a target constant to the pseudo. @ahatanak is working on using this attribute in the front- & middle-end. Together with the front- & middle-end changes, this should address PR31925 for X86. This is the X86 version of `46bc40e502`, which added similar support for AArch64. Reviewed By: ab Differential Revision: https://reviews.llvm.org/D94597	2021-05-21 16:33:58 +01:00
Jim Lin	4456805938	[X86] Don't fold (fneg (fma (fneg X), Y, (fneg Z))) to (fma X, Y, Z) Check if it has no signed zeros flag (nsz) in getNegatedExpression for x86. This patch fixed miscompilation: https://alive2.llvm.org/ce/z/XxwBAJ Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D90901	2021-05-21 23:02:19 +08:00
Sanjay Patel	f12f9beb04	[x86] propagate FMF from x86-specific intrinsic nodes to others during combining This is another FMF gap exposed by D90901, but I don't see a way to show the difference in a regression test as with: `f66ba4c` `6025663` We will see an asm difference if we add a test as part of D90901.	2021-05-19 14:25:09 -04:00
Sanjay Patel	f66ba4cfa7	[x86] propagate FMF from x86-specific intrinsic nodes to others during lowering This is another fast-math-flags failure exposed by D90901.	2021-05-19 13:11:15 -04:00
Simon Pilgrim	ab4e04a0f3	[X86][AVX] createVariablePermute - generalize the PR50356 fix for smaller indices vector as well Generalize the fix from rGd0902a8665b1 by ensuring we widen/narrow the indices subvector first and then perform the ZERO_EXTEND_VECTOR_INREG (if necessary), which should allow us to perform the variable permutes with source/destination/indices vectors of any widths.	2021-05-19 14:39:41 +01:00
Tim Northover	c1dc267258	MachineBasicBlock: add liveout iterator aware of which liveins are defined by the runtime. Using this in RegAlloc fast reduces register pressure, and in some cases allows x86 code to compile that wouldn't before.	2021-05-19 11:00:24 +01:00
Simon Pilgrim	d0902a8665	[X86][AVX] createVariablePermute - correctly extend same-sized-vector indices (PR50356) D101838 incorrectly handled indices vectors of the same size but with higher element counts to just bitcast to the target indices type instead of performing a ZERO_EXTEND_VECTOR_INREG	2021-05-18 20:30:46 +01:00
Tim Northover	ba1509da7b	Recommit X86: support Swift Async context This adds support to the X86 backend for the newly committed swiftasync function parameter. If such a (pointer) parameter is present it gets stored into an augmented frame record (populated in IR, but generally containing enhanced backtrace for coroutines using lots of tail calls back and forth). The context frame is identical to AArch64 (primarily so that unwinders etc don't get extra complexity). Specfically, the new frame record is [AsyncCtx, %rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp being set to 1. %rbp still points to the frame pointer in memory for backwards compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest of the record is because of x86). Recommited with a fix for unwind info when i386 pc-rel thunks are adjacent to a prologue.	2021-05-18 15:19:05 +01:00
Mitch Phillips	6791a6b309	Revert "X86: support Swift Async context" This reverts commit `747e5cfb9f`. Reason: New frame layout broke the sanitizer unwinder. Not clear why, but seems like some of the changes aren't always guarded by Swyft checks. See https://reviews.llvm.org/rG747e5cfb9f5d944b47fe014925b0d5dc2fda74d7 for more information.	2021-05-17 12:44:57 -07:00
Simon Pilgrim	41587466aa	[X86] Don't dereference a dyn_cast<> - use a cast<> instead. NFCI. dyn_cast<> can return null if the cast fails, by using cast<> we assert that the cast is correct helping to avoid a potential null dereference.	2021-05-17 15:58:32 +01:00
Tim Northover	747e5cfb9f	X86: support Swift Async context This adds support to the X86 backend for the newly committed swiftasync function parameter. If such a (pointer) parameter is present it gets stored into an augmented frame record (populated in IR, but generally containing enhanced backtrace for coroutines using lots of tail calls back and forth). The context frame is identical to AArch64 (primarily so that unwinders etc don't get extra complexity). Specfically, the new frame record is [AsyncCtx, %rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp being set to 1. %rbp still points to the frame pointer in memory for backwards compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest of the record is because of x86).	2021-05-17 11:56:16 +01:00
Tim Northover	82a0e808bb	IR/AArch64/X86: add "swifttailcc" calling convention. Swift's new concurrency features are going to require guaranteed tail calls so that they don't consume excessive amounts of stack space. This would normally mean "tailcc", but there are also Swift-specific ABI desires that don't naturally go along with "tailcc" so this adds another calling convention that's the combination of "swiftcc" and "tailcc". Support is added for AArch64 and X86 for now.	2021-05-17 10:48:34 +01:00
Simon Pilgrim	262e4200d1	[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines (REAPPLIED). NFCI. The intention is to be able to run this from additional locations (such as shuffle combining) in the future. Reapplies rGb95a103808ac (after reversion at rGc012a388a15b), with SSE3/SSSE3 typo fix, test added at rG0afb10de1449.	2021-05-16 13:50:58 +01:00
Nico Weber	c012a388a1	Revert "[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI." This reverts commit `b95a103808`. Makes clang assert very early in a Chromium build. See https://bugs.chromium.org/p/chromium/issues/detail?id=1209490#c1 for a standalone repro.	2021-05-15 12:20:02 -04:00
Simon Pilgrim	b95a103808	[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI. The intention is to be able to run this from additional locations (such as shuffle combining) in the future.	2021-05-14 16:52:55 +01:00
Serge Pavlov	12537ab772	[FPEnv][X86] Implement lowering of llvm.set.rounding Differential Revision: https://reviews.llvm.org/D74730	2021-05-13 14:30:38 +07:00
Fangrui Song	0fe6649bc5	[X86] Fix -Wunused-lambda-capture	2021-05-12 10:34:32 -07:00
Simon Pilgrim	fb1d61b725	[X86][AVX] Fold concat(pslq(x,32),pslq(y,32)) -> shuffle(concat(x,y),zero) (PR46621) On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero. I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.	2021-05-12 18:04:40 +01:00
Simon Pilgrim	7bff9bdd34	[X86][AVX] combineConcatVectorOps - add ConcatSubOperand helper. NFCI. Pull out repeated code to create a concat_vectors of the same operand from all subvecs.	2021-05-12 16:42:18 +01:00
Craig Topper	44e0e91db0	[ValueTypes] Rename MVT::getVectorNumElements() to MVT::getVectorMinNumElements(). Fix some misuses of getVectorNumElements() getVectorNumElements() returns a value for scalable vectors without any warning so it is effectively getVectorMinNumElements(). By renaming it and making getVectorNumElements() forward to it, we can insert a check for scalable vectors into getVectorNumElements() similar to EVT. I didn't do that in this patch because there are still more fixes needed, but I was able to temporarily do it and passed the RISCV lit tests with these changes. The changes to isPow2VectorType and getPow2VectorType are copied from EVT. The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table. We're now considering SameNumElts to require the scalable property to match which removes some unneeded type checks. This was motivated by the bug I fixed yesterday in `80b9510806` Reviewed By: frasercrmck, sdesmalen Differential Revision: https://reviews.llvm.org/D102262	2021-05-12 07:46:45 -07:00
Sanjay Patel	f58e0513dd	[x86] try harder to lower to PCMPGT instead of not-of-PCMPEQ This is motivated by the example in https://llvm.org/PR50055 , but it doesn't do anything for that bug currently because we don't actually have a zero-extended setcc there. Proof for the generic transform (inverse of what we would try to do in combining): https://alive2.llvm.org/ce/z/aBL-Mg Differential Revision: https://reviews.llvm.org/D102275	2021-05-12 08:25:29 -04:00
Simon Pilgrim	72e242a286	[X86][AVX] canonicalizeShuffleMaskWithHorizOp - improve support for 256/512-bit vectors Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases. This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle. This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.	2021-05-12 12:13:24 +01:00
Simon Pilgrim	9acc03ad92	[X86][SSE] Replace foldShuffleOfHorizOp with generalized version in canonicalizeShuffleMaskWithHorizOp foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well. The next step will be to support 256/512-bit vector cases.	2021-05-11 14:18:45 +01:00
Simon Pilgrim	e32374ed5c	[X86][SSE] canonicalizeShuffleMaskWithHorizOp - add TODO for better 256/512-bit shuffle+hop folding support. NFC.	2021-05-10 18:43:16 +01:00
Simon Pilgrim	4524d8b755	[X86] combineHorizOpWithShuffle - generalize HOP(SHUFFLE(X),SHUFFLE(Y)) -> SHUFFLE(HOP(X,Y)) fold. For 128-bit types, generalize the fold to recognise duplicate operands in either shuffle.	2021-05-08 16:23:27 +01:00
Simon Pilgrim	f744723f75	[X86] combineXor - limit fold to non-opaque constants (PR50254) Ensure we don't try to fold when one might be an opaque constant - the constant fold will fail and then the reverse fold will happen in DAGCombine.....	2021-05-07 16:39:24 +01:00
Simon Pilgrim	280aa3415e	[DAG] Add a generic expansion for SHIFT_PARTS opcodes using funnel shifts Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts. Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication. I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to). NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly. Differential Revision: https://reviews.llvm.org/D101987	2021-05-07 13:12:30 +01:00
Simon Pilgrim	8e42024f79	[X86] Ensure we pass DebugLoc by const reference where possible. NFCI. Avoids a lot of unnecessary tracking increments/decrements of the underlying TrackingMDNodeRef	2021-05-07 12:32:59 +01:00
Simon Pilgrim	85460a2f5b	[X86][SSE] Move unpack(hop,hop) fold from foldShuffleOfHorizOp to combineTargetShuffle By moving this after more of the shuffle canonicalization we reduce the demanded vector elts, avoiding a few unnecessary copies/moves etc.	2021-05-05 13:36:09 +01:00
Alexey Bataev	13a51e017c	[X86]Fix a crash trying to convert indices to proper type. Need to perfortm a bitcast on IndicesVec rather than subvector extract if the original size of the IndicesVec is the same as the size of the destination type. Differential Revision: https://reviews.llvm.org/D101838	2021-05-05 05:14:42 -07:00
Dávid Bolvanský	2af95a5275	[X86] Promote 16-bit CTTZ_ZERO_UNDEF to 32-bit variant Related to PR50172. Protects us against regressions after we will start doing cttz(zext(x)) -> zext(cttz(x)) transformation in the middle-end. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D101662	2021-05-01 00:42:15 +02:00
Craig Topper	3067520bf4	[SelectionDAG] Use a VTSDNode to store the saturation width for FP_TO_SINT_SAT/FP_TO_UINT_SAT Previously we used an i32 constant to store the saturation width, but i32 isn't legal on RISCV64. This wasn't a big deal to fix, but it is extra work for the type legalizer. This patch uses a VTSDNode to store the type similar to SEXT_INREG. This makes it opaque to the type legalizer. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D101262	2021-04-27 14:38:42 -07:00
Nick Desaulniers	ea8416bf4d	[CodeGenOptions] make StackProtectorGuardOffset signed GCC supports negative values for -mstack-protector-guard-offset=, this should be a signed value. Pre-req to D100919. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D101325	2021-04-27 10:12:58 -07:00
Sander de Smalen	43ace8b5ce	[TTI] NFC: Change getScalingFactorCost to return InstructionCost This patch migrates the TTI cost interfaces to return an InstructionCost. See this patch for the introduction of the type: https://reviews.llvm.org/D91174 See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html Differential Revision: https://reviews.llvm.org/D100564	2021-04-23 16:06:36 +01:00
Simon Pilgrim	7b32e8b96a	[X86] combineSetCCAtomicArith - pull out repeated ops. NFCI. Reduces diff in D101074	2021-04-23 14:19:24 +01:00
Simon Pilgrim	a511b55cfd	[X86][SSE] getFauxShuffleMask - don't decode OR(SHUFFLE,SHUFFLE) containing UNDEFs. (PR50049) PR50049 demonstrated an infinite loop between OR(SHUFFLE,SHUFFLE) <-> BLEND(SHUFFLE,SHUFFLE) patterns. The UNDEF elements were allowing a combined shuffle mask to be widened which lost the undef element, resulting us needing to use the BLEND pattern (as the undef element would need to be zero for the OR pattern). But then bitcast folds would re-expose the undef element allowing us to use OR again.....	2021-04-21 18:47:00 +01:00
Simon Pilgrim	2a419a0b99	[X86][SSE] combineX86ShuffleChain - check if we're blending with zero into already zero elements Add a SelectionDAG::MaskedElementsAreZero helper that wraps SelectionDAG::MaskedValueIsZero testing for entirely zero vector elements	2021-04-20 17:09:49 +01:00
Simon Pilgrim	9d57a77b81	[X86] combineCMP - fold cmpEQ/NE(TRUNC(X),0) -> cmpEQ/NE(X,0) If we are truncating from a i32 source before comparing the result against zero, then see if we can directly compare the source value against zero. If the upper (truncated) bits are known to be zero then we can compare against that, hopefully increasing the chances of us folding the compare into a EFLAG result of the source's operation. Fixes PR49028. Differential Revision: https://reviews.llvm.org/D100491	2021-04-15 13:55:51 +01:00
Simon Pilgrim	4fbe761572	[X86][SSE] canonicalizeShuffleWithBinOps - check for more combos of merge-able binary shuffles. In the fold SHUFFLE(BINOP(X,Y),BINOP(Z,W)) -> BINOP(SHUFFLE(X,Z),SHUFFLE(Y,W)), check if both X/Z AND Y/W have at least one merge-able shuffle in which case the total number of shuffle should still fall. Helps with instruction count regressions we saw while fixing PR48823	2021-04-14 15:24:41 +01:00
Simon Pilgrim	73737fe990	[X86] Fold cmpeq/ne(trunc(x),0) --> cmpeq/ne(x,0) Relax the fold from rGbaadbe04bf75 to compare any op, not just logic ops, now that the movmsk regressions have been handled.	2021-04-14 11:02:02 +01:00
Simon Pilgrim	016ceb8382	[X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) fold Extension to rG74f98391a7a4, we can also include any of the upper (known zero) bits in the comparison in the shuffle removal fold, just as long as we demand all the elements of the movmsk source vector.	2021-04-14 11:02:01 +01:00
Simon Pilgrim	74f98391a7	[X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in CMP(MOVMSK(PACKSS())) -> CMP(MOVMSK()) fold We already allow the comparison of the upper bits of 'IsAllOf' (allbits) patterns, but we can safely compare the known zero bits for 'IsAnyOf' (zerobits) patterns as well. This fixes an issues where we are comparing a type wide than the number of vector elements, which avoids a regression mentioned in rGbaadbe04bf75.	2021-04-13 17:37:24 +01:00
Simon Pilgrim	baadbe04bf	[X86] Fold cmpeq/ne(trunc(logic(x)),0) --> cmpeq/ne(logic(x),0) Fixes the issues noted in PR48768, where the and/or/xor instruction had been promoted to avoid i8/i16 partial-dependencies, but the test against zero had not. We can almost certainly relax this fold to work for any truncation, although it breaks a number of existing folds (notable movmsk folds which tend to rely on the truncate to determine the demanded bits/elts in the source vector). There is a reverse combine in TargetLowering.SimplifySetCC so we must wait until after legalization before attempting this.	2021-04-12 16:05:34 +01:00
Simon Pilgrim	231b87618b	[X86][AVX512] Fold not(kmov(x)) -> kmov(not(x)) and not(widen_subvector(x)) -> widen_subvector(not(x)) Improve AVX512 mask inversion, rG38c799bce801 exposed some missing opportunities to move scalar not() back onto the boolvector types for folding with setcc etc.	2021-04-11 20:07:09 +01:00
Simon Pilgrim	13bdac5709	[X86] combineXor - Pull out repeated getOperand() calls. NFCI.	2021-04-11 19:01:59 +01:00
Simon Pilgrim	38c799bce8	[X86] Fold cmpeq/ne(and(X,Y),Y) --> cmpeq/ne(and(~X,Y),0) Followup to D100177, handle an similar (demorgan inverse style) case from PR47797 as well The AVX512 test cases could be further improved if we folded not(iX bitcast(vXi1)) -> (iX bitcast(not(vXi1))) Alive2: https://alive2.llvm.org/ce/z/AnA_-W	2021-04-11 18:42:01 +01:00
Simon Pilgrim	d8bc4de3cf	[X86] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) on non-BMI targets (PR44136) Followup to D100177, enable the fold for non-BMI targets as well.	2021-04-09 16:11:11 +01:00
Simon Pilgrim	245036950a	[X86][BMI] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) (PR44136) I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit). Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon. https://alive2.llvm.org/ce/z/QVWHP_ https://alive2.llvm.org/ce/z/pLngT- Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare. Differential Revision: https://reviews.llvm.org/D100177	2021-04-09 15:52:03 +01:00
Simon Pilgrim	3ae0a405fc	[X86] combineHorizOpWithShuffle - peek through one use bitcasts when decoding shuffles. Checking for one use, peek through bitcasts of the horizop args to allows us to merge shuffles of different widths through the horizop.	2021-04-09 10:51:04 +01:00
Simon Pilgrim	53283cc2f1	[X86][SSE] canonicalizeShuffleWithBinOps - add MOVSD/MOVSS handling.	2021-04-06 16:42:18 +01:00
Simon Pilgrim	ddbb58736a	[KnownBits] Rename KnownBits::computeForMul to KnownBits::mul. NFCI. As promised in D98866	2021-04-06 10:11:41 +01:00
Simon Pilgrim	36d4f6d7f8	[X86] Fold xor(zext(xor(x,c1)),c2) -> xor(zext(x),xor(zext(c1),c2)) Fixes PR47603 (second case) by extending rG89afec348dbd3e5078f176e978971ee2d3b5dec8	2021-04-05 11:40:37 +01:00
Simon Pilgrim	89afec348d	[X86] Fold xor(truncate(xor(x,c1)),c2) -> xor(truncate(x),xor(truncate(c1),c2)) Fixes PR47603 This should probably be transferable to DAGCombine - the main limitation with the existing trunc(logicop) DAG fold is we don't know if legalization has tried to promote truncated logicops already. We might be able to peek through extensions as well.	2021-04-03 12:43:05 +01:00
Simon Pilgrim	7c17f1ea84	[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper (REAPPLIED) Use the getTargetShuffleInputs helper for all shuffle decoding Reapplied (after reversion in rGfa0aff6d6960) with fix+test for subvector splitting - we weren't accounting for peeking through bitcasts changing the vector element count of the shuffle sources.	2021-04-03 11:59:19 +01:00
Nico Weber	fa0aff6d69	Revert "[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper" This reverts commit `500969f1d0`. Makes clang assert compiling avx2 code, see https://bugs.chromium.org/p/chromium/issues/detail?id=1195353#c4 for a standalone repro.	2021-04-02 09:55:55 -04:00
Simon Pilgrim	500969f1d0	[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper Use the getTargetShuffleInputs helper for all shuffle decoding	2021-04-02 11:50:18 +01:00
Yang Fan	bc6001ce1e	[X86] Fix -Wunused-function warning (NFC) GCC warning: ``` /llvm-project/llvm/lib/Target/X86/X86ISelLowering.cpp:9212:13: warning: ‘bool isHorizOp(unsigned int)’ defined but not used [-Wunused-function] 9212 \| static bool isHorizOp(unsigned Opcode) { \| ^~~~~~~~~ ```	2021-04-02 09:38:12 +08:00
Simon Pilgrim	abbe80fa52	[X86][SSE] Fold HOP(HOP(X,X),HOP(Y,Y)) -> HOP(PERMUTE(HOP(X,Y)),PERMUTE(HOP(X,Y)) For slow-hop targets, attempt to merge HADD/SUB pairs used in chains.	2021-04-01 11:54:10 +01:00
Simon Pilgrim	301319840e	[X86][SSE] Enable (F)HADD/SUB handling to SimplifyMultipleUseDemandedVectorElts Attempt to bypass unused horiz-op operands. This is very similar to the PACKSS/PACKUS handling - we should try to merge these.	2021-04-01 11:54:09 +01:00
Simon Pilgrim	f7aeaced65	[X86][SSE] Add isHorizOp helper function. NFCI.	2021-04-01 11:54:09 +01:00
Craig Topper	437958d9fd	[X86] Improve SMULO/UMULO codegen for vXi8 vectors. The default expansion creates a MUL and either a MULHS/MULHU. Each of those separately expand to sequences that use one or more PMULLW instructions as well as additional instructions to extend the types to vXi16. The MULHS/MULHU expansion computes the whole 16-bit product, but only keeps the high part. We can improve the lowering of SMULO/UMULO for some cases by using the MULHS/MULHU expansion, but keep both the high and low parts. And we can use those parts to calculate the overflow. For AVX512 we might have vXi1 overflow outputs. We can improve those by using vpcmpeqw to produce a k register if AVX512BW is enabled. This is a little better than truncating the high result to use vpcmpeqb. If we don't have avx512bw we can extend up to v16i32 to use vpcmpeqd to produce a k register. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D97624	2021-03-31 10:13:50 -07:00
Tomas Matheson	a9968c0a33	[NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions Currently needsStackRealignment returns false if canRealignStack returns false. This means that the behavior of needsStackRealignment does not correspond to it's name and description; a function might need stack realignment, but if it is not possible then this function returns false. Furthermore, needsStackRealignment is not virtual and therefore some backends have made use of canRealignStack to indicate whether a function needs stack realignment. This patch attempts to clarify the situation by separating them and introducing new names: - shouldRealignStack - true if there is any reason the stack should be realigned - canRealignStack - true if we are still able to realign the stack (e.g. we can still reserve/have reserved a frame pointer) - hasStackRealignment = shouldRealignStack && canRealignStack (not target customisable) Targets can now override shouldRealignStack to indicate that stack realignment is required. This change will make it easier in a future change to handle the case where we need to realign the stack but can't do so (for example when the register allocator creates an aligned spill after the frame pointer has been eliminated). Differential Revision: https://reviews.llvm.org/D98716 Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87	2021-03-30 17:31:39 +01:00
Sanjay Patel	e694e19a79	[x86] enhance matching of pmaddwd This was crashing with the example from: https://llvm.org/PR49716 ...and that was avoided with `a283d72583` , but as we can see from the SSE vs. AVX test code diff, we can try harder to match the pattern. This matcher code was adapted from another pmadd pattern match in D49636, but it needs different ops to deal with size mismatches. Differential Revision: https://reviews.llvm.org/D99531	2021-03-30 07:28:33 -04:00
Simon Pilgrim	805148eaf2	[X86][SSE] combineHorizOpWithShuffle - consistently use getTargetShuffleInputs to decode shuffles Minor cleanup before I start trying to merge the unary/binary shuffle combining paths.	2021-03-29 11:31:19 +01:00
Craig Topper	69bdf35dc7	[X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size. For these cases we need to extract the upper or lower elements, multiply them using 16-bit multiplies and repack them. Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to extract and sign extend so we could use pmullw to compute the 16-bit product and then shift down the high bits. We can avoid the need to sign extend if we unpack the bytes into the high byte of each word and fill the lower byte with 0 using pxor. This puts the sign bit of each byte into the sign bit of each word. Since the LHS and RHS have 8 trailing zeros, the full 32-bit product of those 16-bit values will have 16 trailing zeros. This means the 16-bit product of the original bytes is in the upper 16 bits which we can calculate using pmulhw. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98587	2021-03-28 11:41:29 -07:00
Simon Pilgrim	2a0d5da917	[X86][SSE] foldShuffleOfHorizOp - remove broadcast handling. Remove VBROADCAST/MOVDDUP/splat-shuffle handling from foldShuffleOfHorizOp This can all be handled by canonicalizeShuffleMaskWithHorizOp along as we check that the HADD/SUB are only used once (to prevent infinite loops on slow-horizop targets which will try to reuse the nodes again followed by a post-hop shuffle).	2021-03-27 15:09:23 +00:00
Simon Pilgrim	41146bfe82	[X86][SSE] combineX86ShuffleChain - attempt to recognise 'hidden' identity shuffles See if the combined shuffle mask is equivalent to an identity shuffle, typically this is due to repeated LHS/RHS ops in horiz-ops, but isTargetShuffleEquivalent might see other patterns as well. This is another small step towards getting rid of foldShuffleOfHorizOp and relying on canonicalizeShuffleMaskWithHorizOp and generic shuffle combining.	2021-03-27 11:09:30 +00:00
Sanjay Patel	a283d72583	[x86] prevent crashing while matching pmaddwd This could crash in 2 ways: either one or both of the input vectors could be a different size than the math ops. https://llvm.org/PR49716	2021-03-27 05:27:14 -04:00
Simon Pilgrim	c769ba9514	[X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode	2021-03-26 17:23:54 +00:00
Simon Pilgrim	36e3c6c841	[X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead. combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle. We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage. Differential Revision: https://reviews.llvm.org/D96609	2021-03-25 10:34:34 +00:00
Simon Pilgrim	9fde88c3e2	[X86][AVX] splitIntVSETCC - handle separate (canonicalized) SETCC operands LowerVSETCC calls splitIntVSETCC after canonicalizing certain patterns, in particular (X & CPow2 != 0) -> (X & CPow2 == CPow2). Unfortunately if we're splitting for AVX1/non-AVX512BW cases, we lose these canonicalizations as we call the split with the original SetCC node, and when the split nodes are later lowered in LowerVSETCC the patterns are lost behind extract_subvector etc. But if we pass the canonicalized operands for splitting we retain the optimizations. Differential Revision: https://reviews.llvm.org/D99256	2021-03-25 10:18:44 +00:00
Simon Pilgrim	7920527796	[X86][AVX] combineBitcastvxi1 - improve handling of vectors truncated to vXi1 If we're truncating to vXi1 from a wider type, then prefer the original wider vector as is simplifies folding the separate truncations + extensions. AVX1 this is only worth it for v8i1 cases, not v4i1 where we're always better off truncating down to v4i32 for movmsk. Helps with some regressions encountered in D96609	2021-03-24 14:05:59 +00:00
Simon Pilgrim	e9015bd595	[X86][AVX] lowerShuffleAsBroadcast - MOVDDUP(SCALAR_TO_VECTOR(X)) -> BROADCAST(X) Prefer broadcast from scalar on AVX targets as this makes it easier for later folds to strip away bitcasts etc. This helps a lot with the AVX1 poor codegen from PR49658. There's a trivial regression in bitcast-int-to-vector-bool-*ext.ll tests due to SimplifyDemandedBits not being able to see a multi-use case, but there's bigger existing codegen issues to be addressed first in those tests (unnecessary NOTs).	2021-03-24 11:31:56 +00:00
Simon Pilgrim	c1ef642ad8	[X86] Remove unused 'OneUse' option from IsNOT helper. NFCI.	2021-03-24 11:14:38 +00:00
Simon Pilgrim	080cb83e52	[X86][AVX] Narrow VPBROADCASTQ->VPBROADCASTD if we don't need the upper bits. Helps fix cases where we've splatted smaller types to a wider vector element type without needing the upper bits. Avoid this on AVX512 targets as that can affect broadcast folding.	2021-03-23 09:41:02 +00:00
Simon Pilgrim	3179588947	[X86][AVX] ComputeNumSignBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting. Added as an extension to PR49658	2021-03-21 12:22:51 +00:00
Simon Pilgrim	297b9bc3fa	[X86][AVX] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting. Suggested by @craig.topper on PR49658	2021-03-21 10:40:57 +00:00
Simon Pilgrim	54a05f2ec8	[X86] computeKnownBitsForTargetNode - add X86ISD::PMULUDQ handling Reuse the existing KnownBits multiplication code to handle what is effectively a ISD::UMUL_LOHI varient	2021-03-21 09:57:20 +00:00
Simon Pilgrim	64687f2cc3	[X86][SSE] canonicalizeShuffleWithBinOps - add PERMILPS/PERMILPD + PERMPD/PERMQ + INSERTPS handling. Bail if the INSERTPS would introduce zeros across the binop.	2021-03-16 13:52:08 +00:00
Simon Pilgrim	772155793b	[X86][SSE] isHorizontalBinOp - ensure we clear any unused source operands to improve HADD/SUB matching Our shuffle matching for HADD/SUB patterns wasn't clearing repeated ops in 'fake unary' style shuffle masks (unpack(x,x) etc.), preventing matching of add(fakeunary(),fakeunary()) style patterns.	2021-03-15 16:24:29 +00:00
Simon Pilgrim	814339454d	[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.	2021-03-15 15:01:29 +00:00
Simon Pilgrim	07232f4507	[X86][SSE] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling. Recommit rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 with an early-out if the pshub would introduce zeros across the binop.	2021-03-15 12:43:30 +00:00
Simon Pilgrim	75a184dacf	Revert rG9ba577eca2e339726bfaad4e615c6324a705b292 "[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI." Sorry this wasn't supposed to be committed yet (and certainly not tagged as NFCI....)	2021-03-15 12:23:44 +00:00
Simon Pilgrim	9ba577eca2	[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI. Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.	2021-03-15 11:59:25 +00:00
Simon Pilgrim	6878be5dc3	[X86][SSE] Attempt to merge single-op hops for slow targets. For slow-hop targets, see if any single-op hops are duplicating work already done on another (dual-op) hop, which can sometimes occur as isHorizontalBinOp tries to find potential duplicates (but can't merge them itself). If so, reuse the other hop and shuffle the result.	2021-03-15 09:30:20 +00:00
Simon Pilgrim	6cb7dddaf4	[X86][AVX] Insert zeros byte elements into 256/512-bit vectors using shuffle/and Avoid extracting/inserting subvectors which makes it more difficult for shuffle combining to merge them together.	2021-03-12 15:16:36 +00:00
Simon Pilgrim	33dcdd414c	[X86] Provide lighter weight getTargetShuffleMask wrapper. NFCI. Most callers to getTargetShuffleMask don't use the IsUnary flag.	2021-03-12 15:16:35 +00:00
Simon Pilgrim	bc5e9ec2dc	Revert rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 "[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling." Investigating an issue reported by @bkramer, possibly when the PSHUFB mask generates zero elements.	2021-03-11 13:14:00 +00:00
Simon Pilgrim	77394c12a4	[X86] Don't attempt to fold sub(C1, xor(X, C2)) with opaque constants Fixes PR49451	2021-03-11 12:06:40 +00:00
Simon Pilgrim	d0884541cc	[X86] canonicalizeShuffleWithBinOps - add binary shuffle handling	2021-03-09 13:57:03 +00:00
Simon Pilgrim	f71cee136d	[X86] Break if-else chain. NFCI. Both if blocks affect control flow - we don't need the else. Fixes clang-tidy warning.	2021-03-08 11:44:31 +00:00
Simon Pilgrim	cd938ab162	[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling.	2021-03-07 12:56:35 +00:00
Simon Pilgrim	772a501bf4	[X86] canonicalizeShuffleWithBinOps - shuffle oneuse constants. We can freely shuffle all ones/zeros constants but we can also freely shuffle other constants as long as they only have one use.	2021-03-07 11:17:03 +00:00
Alexey Lapshin	cf7cdaff64	[X86][VARARG] Avoid spilling xmm registers for va_start. That review is extracted from D69372. It fixes https://bugs.llvm.org/show_bug.cgi?id=42219 bug. For the noimplicitfloat mode, the compiler mustn't generate floating-point code if it was not asked directly to do so. This rule does not work with variable function arguments currently. Though compiler correctly guards block of code, which copies xmm vararg parameters with a check for %al, it does not protect spills for xmm registers. Thus, such spills are generated in non-protected areas and could break code, which does not expect floating-point data. The problem happens in -O0 optimization mode. With this optimization level there is used FastRegisterAllocator, which spills virtual registers at basic block boundaries. Register Allocator does not protect spills with additional control-flow modifications. Thus to resolve that problem, it is suggested to not copy incoming physical registers into virtual registers. Instead, store incoming physical xmm registers into the memory from scratch. Differential Revision: https://reviews.llvm.org/D80163	2021-03-06 15:25:47 +03:00
Simon Pilgrim	d7b8cb4d57	[X86] X86ISelLowering.cpp - try to use for-range loops. NFCI.	2021-03-05 11:09:14 +00:00
Simon Pilgrim	7cbc5df438	[X86] X86TargetLowering::isSafeMemOpType - break if-else chain. NFCI. All if-else blocks return - fixes clang-tidy warning.	2021-03-04 12:15:08 +00:00
Simon Pilgrim	1584e55a26	[X86] canonicalizeShuffleWithBinOps - handle general unaryshuffle(binop(x,c)) patterns not just xor(x,-1) Generalize the shuffle(not(x)) -> not(shuffle(x)) fold to handle any binop with 0/-1. Hopefully we can further generalize to help push target unary/binary shuffles through binops similar to what we do in DAGCombiner::visitVECTOR_SHUFFLE	2021-03-04 10:44:38 +00:00
Simon Pilgrim	aa4afebbf9	[X86] Fold scalar_to_vector(x) -> extract_subvector(broadcast(x),0) iff broadcast(x) exists Add handling for reusing an existing broadcast(x) to a wider vector.	2021-03-03 15:50:37 +00:00
Benjamin Kramer	10c256ccaf	Revert "[X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef))" This reverts commit `925093d88a`. Causes an infinite loop when compiling some shuffles: $ cat bugpoint-reduced-simplified.ll target triple = "x86_64-unknown-linux-gnu" define void @foo() { entry: %0 = load i8, i8* undef, align 1 %broadcast.splatinsert = insertelement <16 x i8> poison, i8 %0, i32 0 %1 = icmp ne <16 x i8> %broadcast.splatinsert, zeroinitializer %2 = shufflevector <16 x i1> %1, <16 x i1> undef, <16 x i32> zeroinitializer %wide.load = load <16 x i8>, <16 x i8>* undef, align 1 %3 = icmp ne <16 x i8> %wide.load, zeroinitializer %4 = and <16 x i1> %3, %2 %5 = zext <16 x i1> %4 to <16 x i8> store <16 x i8> %5, <16 x i8>* undef, align 1 ret void } $ llc < bugpoint-reduced-simplified.ll <timeout>	2021-03-02 11:24:07 +01:00
Simon Pilgrim	925093d88a	[X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef)) Move NOT out to expose more AND -> ANDN folds	2021-03-01 14:47:39 +00:00
Simon Pilgrim	ab3ea27b6f	[X86][AVX] Reuse existing VBROADCAST(x) for SCALAR_TO_VECTOR(x) Similar to what we already do for BROADCASTs of different vector sizes - if we're going to broadcast it anyway might as well reuse it.	2021-02-28 11:37:27 +00:00
Craig Topper	993f4d8ffa	[X86] Fix a couple comments that said LHS where they meant RHS. NFC	2021-02-27 17:14:17 -08:00
James Y Knight	6de6455752	Use getAlign() on atomicrmw/cmpxchg instructions, now that it's available. These locations were missed as part of adding alignment to the instructions, and were still making their own alignment assumptions.	2021-02-26 15:06:15 -05:00
Simon Pilgrim	ed1f45bce9	[X86][AVX] SimplifyDemandedBitsForTargetNode - add basic X86ISD::VBROADCAST handling. Simplify through to the scalar/vector source operand.	2021-02-26 16:13:14 +00:00
Simon Pilgrim	7ac4c956af	[X86] Remove unnecessary custom lowering of vXi1 SADDSAT/SSUBSAT/UADDSAT/USUBSAT As discussed on D97478. The removal of the custom tag causes some changes in the add/sub-overflow expansion as it no longer expands to sat-arith codegen.	2021-02-26 12:10:23 +00:00
Simon Pilgrim	aefe8f2f6c	[DAG] Fold vXi1 multiplies -> and This allows us to remove X86 custom lowering of vXi1 MUL, which helps simplify a load of mask math. Mentioned in D97478 post review.	2021-02-26 11:46:12 +00:00
Simon Pilgrim	40b8b4a466	[X86] Remove unnecessary custom lowering of v16i1/v32i1 ADD/SUB These were missed in D97478	2021-02-26 11:46:11 +00:00
Craig Topper	ceaedfb5fc	[X86] Remove custom lowering of vXi1 ADD/SUB now that they are canonicalized to XOR in getNode. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D97478	2021-02-25 08:52:41 -08:00
Simon Pilgrim	8b82669d56	[X86][SSE] Move unaryshuffle(xor(x,-1)) -> xor(unaryshuffle(x),-1) fold into helper. NFCI. We should be able to extend this "canonicalizeShuffleWithBinOps" to handle more generic binop cases where either/both operands can be cheaply shuffled.	2021-02-25 10:56:23 +00:00
Simon Pilgrim	b568d3d6c9	[X86] Add vector support to sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold.	2021-02-21 21:51:27 +00:00
Simon Pilgrim	3ab32c94a4	[X86] Replace explicit constant handling in sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold. NFCI. NFC cleanup before adding vector support - rely on the SelectionDAG to handle everything for us.	2021-02-21 21:40:32 +00:00
Simon Pilgrim	bae04a3e2d	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - remove unnecessary BITCASTs. In conjunction with the 'vperm2x128(bitcast(x),bitcast(y),c) -> bitcast(vperm2x128(x,y,c))' fold in combineTargetShuffle, this should remove any unnecessary bitcasts around vperm2x128 lane shuffles.	2021-02-21 18:40:32 +00:00
Simon Pilgrim	a6a258f1da	[X86][AVX] Fold concat(extract_subvector(v0,c0), extract_subvector(v1,c1)) -> vperm2x128 Fixes regression exposed by removing bitcasts across logic-ops in D96206. Differential Revision: https://reviews.llvm.org/D96206	2021-02-21 14:50:43 +00:00
Simon Pilgrim	2885d1251f	[X86] Fold bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y)) for int-int bitcasts Extend the existing combine that handles bitcasting for fp-logic ops to also help remove logic ops across bitcasts to/from the same integer types. This helps improve AVX512 predicate handling for D/Q logic ops and also allows DAGCombine's scalarizeExtractedBinop to remove some annoying gpr->simd->gpr transfers. The concat_vectors regression in pr40891.ll will be addressed in a followup commit on this patch. Differential Revision: https://reviews.llvm.org/D96206	2021-02-21 14:40:54 +00:00
Simon Pilgrim	761bbed264	[DAG] foldSubToUSubSat - fold sub(a,trunc(umin(zext(a),b))) -> usubsat(a,trunc(umin(b,SatLimit))) This moves the last custom x86 USUBSAT fold to generic DAGCombine. Completes PR40111 Differential Revision: https://reviews.llvm.org/D96703	2021-02-20 12:02:07 +00:00
Simon Pilgrim	2258b367db	[X86][AVX] getFauxShuffleMask - decode VBROADCAST(EXTRACT_VECTOR_ELT(V,0)) Handle the case where we're broadcasting a scalar extracted from another vector.	2021-02-19 11:06:53 +00:00
Wang, Pengfei	c98644c2ec	[X86] Fix a codegen crash in getSetCCResultType This patch fixes some crashes coming from X86ISelLowering::getSetCCResultType, which would occasionally return an EVT constructed from an invalid MVT, which has a null Type pointer. This patch refers to D95434. Differential Revision: https://reviews.llvm.org/D97036	2021-02-19 17:30:10 +08:00
Simon Pilgrim	05c64ea672	[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) (REAPPLIED) Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)) Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles. Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle. This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise. Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs. Differential Revision: https://reviews.llvm.org/D96345	2021-02-17 11:42:43 +00:00
Sterling Augustine	5aa8f4c084	Revert "[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)))" This reverts commit `5dfba562dd`. That commit causes an assertion failure with the following repro: typedef long b __attribute__((__vector_size__(16))); b d; b e; b __attribute__((__always_inline__)) c(b h, b i) { return (__attribute__((__vector_size__(8 sizeof(short)))) short)h + i; } j() { b k, l, m, n, o[6], p, q; m = d[5]; b r = m; b s = f(r, 8); q = s; l = d[1]; p = l; t(q); n = c(m, l); o[1] = c(s, f(p, 8)); k = __builtin_shufflevector(n, o[1], 0, 2); e = __builtin_ia32_psrlwi128(k, j); } ./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c	2021-02-16 12:48:15 -08:00
Simon Pilgrim	5dfba562dd	[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)) Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles. Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle. This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise. Differential Revision: https://reviews.llvm.org/D96345	2021-02-16 15:46:34 +00:00
Simon Pilgrim	4841a225b7	[DAG] Move basic USUBSAT pattern matches from X86 to DAGCombine Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets. This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization. We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation. The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987. Differential Revision: https://reviews.llvm.org/D96413	2021-02-12 18:22:57 +00:00
Simon Pilgrim	eb31c3c5cb	Revert rGe1172959226689a "[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks." Revert this while I investigate a downstream breakage report.	2021-02-10 10:26:44 +00:00
Simon Pilgrim	89d9ff8229	[X86][SSE] foldShuffleOfHorizOp - add SHUFPS v4f32 handling Fold shufps(hop(x,y),hop(z,w)) -> permute(hop(x,z)) - this is very similar to the equivalent unpack fold. I did start trying to convert foldShuffleOfHorizOp to handle generic shuffle masks but we're relying on a lot of special cases at the moment.	2021-02-09 14:18:45 +00:00
Simon Pilgrim	598ceb25d4	[X86][AVX] Fold extract_subvector(splat, c) -> extract_subvector(splat, 0) We already do this for VBROADCASTs, extend this for any splat that SelectionDAG::isSplatValue recognises as well.	2021-02-07 11:42:41 +00:00
Craig Topper	6f4f0efd89	[X86] Don't pass a 1 to the second argument of ISD::FP_ROUND in LowerFCOPYSIGN. I don't think we have any reason to believe the FP_ROUND here doesn't change the value. Found while trying to see if we still need the fp128 block in CanCombineFCOPYSIGN_EXTEND_ROUND. Removing that check caused this FP_ROUND to fire for fp128 which introduced a libcall expansion that asserted for this being a 1. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D96098	2021-02-06 10:29:01 -08:00
Simon Pilgrim	e117295922	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks. Now that PR48908 has been dealt with, we can handle v4f64 permute cases by extracting the low/high lane VPERMILPD masks and creating a new mask based on which lanes are referenced by the VPERM2F128 mask.	2021-02-06 15:58:02 +00:00
Craig Topper	11ef356d9e	[TargetLowering] Use Align in allowsMisalignedMemoryAccesses. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D96097	2021-02-04 19:22:06 -08:00
Simon Pilgrim	31b85e1c0b	[X86] Use VT::changeVectorElementType helper where possible. NFCI.	2021-02-04 15:03:56 +00:00
Simon Pilgrim	fa2cdb8140	[X86] Remove stale TODO comment. NFC. We now handle implicit zero-extension shuffle mask cases.	2021-02-04 12:14:05 +00:00
Simon Pilgrim	32b7c2fa42	[X86][SSE] Support variable-index float/double vector insertion on SSE41+ targets (PR47924) Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic. The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select. SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem. Differential Revision: https://reviews.llvm.org/D95866	2021-02-03 14:14:35 +00:00
Simon Pilgrim	8c2e075c2c	[X86][SSE] LowerINSERT_VECTOR_ELT - pull out repeated EltSizeInBits calls. NFCI.	2021-02-02 13:45:18 +00:00
Simon Pilgrim	d46a6b3d55	[X86][AVX512] Support variable-index vector insertion on AVX512 targets (PR47924) With predicate masks, AVX512 can efficiently perform variable-index vector insertion with 2 broadcasts + 1 comparison, avoiding a lot of aliased memory traffic. Differential Revision: https://reviews.llvm.org/D95779	2021-02-02 11:41:18 +00:00
Philip Reames	46e764a628	[x86] introduce no_callee_saved_registers attribute This is directly analogous to the existing no_caller_saved_registers, but with the opposite intention. A function or call so marked shifts the responsibility of spilling the usual CSRs to it's caller. An indirect call site and callee which don't agree on the attribute is ill defined. The motivation for this change is that being able to prune callee saves (without modifying other details of the calling convention) is sometimes useful when generating stubs and adapters. There's no intention to expose this as a source language feature; this is expected to be used by frontends to implement adapters where warranted. Some specific examples of use cases: * GC compatible compiled code wants to call an externally defined library function without needing to track pointer values through CSRs. * debug enabled code wants to call precompiled library which doesn't provide enough information to track CSRs while preserving debug quality in caller. * adapter stub entering hand written assembler which doesn't follow normal calling conventions.	2021-02-01 16:19:14 -08:00
Philip Reames	9d09db941f	[NFC][X86] Use CallBase interface to simplify code	2021-02-01 15:24:41 -08:00
Philip Reames	bb6c23b1f5	[NFC][X86] Avoid redundant work inspecting callee	2021-02-01 15:24:41 -08:00
Simon Pilgrim	e640b209b2	[X86][SSE] LowerScalarImmediateShift - use APInt::getLowBitsSet for vXi8 ISD::SRL mask generation. NFCI. Match what we do for ISD::SHL	2021-02-01 18:17:40 +00:00
Simon Pilgrim	5211af4818	[X86][AVX] combineExtractWithShuffle - combine extracts from 256/512-bit vector shuffles. We can only legally extract from the lowest 128-bit subvector, so extract the correct subvector to allow us to handle 256/512-bit vector element extracts.	2021-02-01 10:31:43 +00:00
Simon Pilgrim	d6b68d1344	[X86][SSE] combineExtractWithShuffle - support zero-extending to allow extracting from narrow shuffle masks If the shuffle mask can't be widened to match the original extracted element width, see if the upper bits are zeroable - which allows us to extract+zero-extend the smaller extraction.	2021-01-29 14:22:10 +00:00
Simon Pilgrim	f84efe97bc	[X86][AVX] combineHorizOpWithShuffle - fix valuetype comparison typo. Ensure we check the valuetypes of all the HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) shuffle input ops - there was a copy+paste typo (noticed by MSVC analyzer) that meant we were checking the same input from one of the shuffles twice. I haven't been able to create a test case for this yet - I don't think its currently possible to create a target/faux binary shuffle that scales to a 2x128 shuffle mask from two different value types.	2021-01-28 16:36:23 +00:00
Simon Pilgrim	6663330bc8	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - don't merge VPERMILPD ops with different low/high masks. Unlike VPERMILPS, VPERMILPD can have non-repeating masks in each 128-bit subvector, we weren't accounting for this when folding vperm2f128(vpermilpd(x,c),vpermilpd(y,c)) -> vpermilpd(vperm2f128(x,y),c). I'm intending to add support for this but wanted to get a minimal fix in first for merging into 12.xx. Fixes PR48908	2021-01-28 12:11:31 +00:00
Freddy Ye	b3b0acdc6f	[NFC] Refine some uninitialized used variables. These warning are reported by static code analysis tool: Klocwork Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D95421	2021-01-26 16:51:05 +08:00
Simon Pilgrim	13f2aee783	[X86][AVX] Generalize vperm2f128/vperm2i128 patterns to support all legal 256-bit vector types Remove bitcasts to/from v4x64 types through vperm2f128/vperm2i128 ops to help improve shuffle combining and demanded vector elts folding.	2021-01-25 15:35:36 +00:00
Simon Pilgrim	821a51a9ca	[X86][AVX] combineX86ShuffleChainWithExtract - widen to at least original root size. NFCI. We're relying on the source inputs for shuffle combining having already been widened to the root size (otherwise the offset logic falls over) - we're going to be supporting different sized shuffle inputs soon, so we need to explicitly make the minimum widened width the original root size.	2021-01-25 13:45:37 +00:00
Simon Pilgrim	1b780cf32e	[X86][AVX] LowerTRUNCATE - avoid bitcasts around extract_subvectors. We allow extract_subvector lowering of all legal types, so pre-bitcast the source type to try and reduce bitcast pollution.	2021-01-25 12:10:36 +00:00
Simon Pilgrim	f461e35cba	[X86][AVX] combineX86ShuffleChain - avoid bitcasts around insert_subvector() shuffle patterns. We allow insert_subvector lowering of all legal types, so don't always cast to the vXi64/vXf64 shuffle types - this is only necessary for X86ISD::SHUF128/X86ISD::VPERM2X128 patterns later.	2021-01-25 11:35:45 +00:00
Fangrui Song	d745b82de1	[XRay] Support DW_TAG_call_site and delete unneeded PATCHABLE_EVENT_CALL/PATCHABLE_TYPED_EVENT_CALL lowering	2021-01-25 00:49:18 -08:00
Simon Pilgrim	bd122f6d21	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle vperm2x128(movddup(x),movddup(y)) cases Fold vperm2x128(movddup(x),movddup(y)) -> movddup(vperm2x128(x,y))	2021-01-22 16:05:19 +00:00
Simon Pilgrim	c33d36e066	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle unary vperm2x128(permute/shift(x,c),undef) cases Fold vperm2x128(permute/shift(x,c),undef) -> permute/shift(vperm2x128(x,undef),c)	2021-01-22 15:47:23 +00:00
Simon Pilgrim	4846f6ab81	[X86][AVX] combineTargetShuffle - simplify the X86ISD::VPERM2X128 subvector matching Simplify vperm2x128(concat(X,Y),concat(Z,W)) folding. Use collectConcatOps / ISD::INSERT_SUBVECTOR to find the source subvectors instead of hardcoded immediate matching.	2021-01-22 15:47:22 +00:00
Simon Pilgrim	b1166e1317	[X86][AVX] combineX86ShufflesRecursively - attempt to constant fold before widening shuffle inputs combineX86ShufflesConstants/canonicalizeShuffleMaskWithHorizOp can both handle/earlyout shuffles with inputs of different widths, so delay widening as late as possible to make it easier to match constant folds etc. The plan is to eventually move the widening inside combineX86ShuffleChain so that we don't create any new nodes unless we successfully combine the shuffles.	2021-01-22 13:19:35 +00:00
Simon Pilgrim	ffe72f987f	[X86][SSE] Don't fold shuffle(binop(),binop()) -> binop(shuffle(),shuffle()) if the shuffle are splats rGbe69e66b1cd8 added the fold, but DAGCombiner.visitVECTOR_SHUFFLE doesn't merge shuffles if the inner shuffle is a splat, so we need to bail. The non-fast-horiz-ops paths see some minor regressions, we might be able to improve on this after lowering to target shuffles. Fix PR48823	2021-01-22 11:31:38 +00:00
Simon Pilgrim	86021d98d3	[X86] Avoid a std::string copy by replacing auto with const auto&. NFC. Fixes msvc analyzer warning.	2021-01-21 11:04:07 +00:00
Max Kazantsev	d6bb96e677	[X86] Add experimental option to separately tune alignment of innermost loops We already have an experimental option to tune loop alignment. Its impact is very wide (and there is a suspicion that it's not always profitable). We want to have something more narrow to play with. This patch adds similar option that overrides preferred alignment for innermost loops. This is for experimental purposes, default values do not change the existing behavior. Differential Revision: https://reviews.llvm.org/D94895 Reviewed By: pengfei	2021-01-21 11:15:16 +07:00
Simon Pilgrim	b8b5e87e6b	[X86][AVX] Handle vperm2x128 shuffling of a subvector splat. We already handle "vperm2x128 (ins ?, X, C1), (ins ?, X, C1), 0x31" for shuffling of the upper subvectors, but we weren't dealing with the case when we were splatting the upper subvector from a single source.	2021-01-20 18:16:33 +00:00
Simon Pilgrim	19d02842ee	[X86][AVX] Fold extract_subvector(VSRLI/VSHLI(x,32)) -> VSRLI/VSHLI(extract_subvector(x),32) As discussed on D56387, if we're shifting to extract the upper/lower half of a vXi64 vector then we're actually better off performing this at the subvector level as its very likely to fold into something. combineConcatVectorOps can perform this in reverse if necessary.	2021-01-20 14:34:54 +00:00
Simon Pilgrim	5626adcd6b	[X86][SSE] combineVectorSignBitsTruncation - fold trunc(srl(x,c)) -> packss(sra(x,c)) If a srl doesn't introduce any sign bits into the truncated result, then replace with a sra to let us use a PACKSS truncation - fixes a regression noticed in D56387 on pre-SSE41 targets that don't have PACKUSDW.	2021-01-19 11:04:13 +00:00
Sanjay Patel	d27bb5c375	[x86] add cast to avoid compile-time warning; NFC	2021-01-18 17:47:04 -05:00
Simon Pilgrim	ce06475da9	[X86][AVX] IsElementEquivalent - add matchShuffleWithUNPCK + VBROADCAST/VBROADCAST_LOAD handling Specify LHS/RHS operands in matchShuffleWithUNPCK's calls to isTargetShuffleEquivalent, and handle VBROADCAST/VBROADCAST_LOAD matching in IsElementEquivalent	2021-01-18 15:55:00 +00:00
Simon Pilgrim	770d1e0a88	[X86][SSE] isHorizontalBinOp - reuse any existing horizontal ops. If we already have similar horizontal ops using the same args, then match that, even if we are on a target with slow horizontal ops.	2021-01-18 10:14:45 +00:00

... 5 6 7 8 9 ...

8142 Commits