llvm-project

Commit Graph

Author	SHA1	Message	Date
Roman Lebedev	7964c3ed82	[X86] `detectAVGPattern()`: small preparatory NFC refactor	2021-10-12 19:51:43 +03:00
Freddy Ye	d57a87ea89	[X86][ISel] Lowering llvm.thread.pointer Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D110681	2021-10-12 11:01:18 +08:00
Simon Pilgrim	ad16c6e52f	[X86][AVX] Ensure we retain zero elements in select(pshufb,pshufb) -> or(pshufb,pshufb) fold (PR52122) The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef. PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero	2021-10-11 14:50:24 +01:00
Arthur Eubanks	a0a4935182	Make more places that use alignment use uint64_t Followup to D110451.	2021-10-08 16:35:19 -07:00
Wang, Pengfei	c236883b6b	[X86] Optimize fdiv with reciprocal instructions for half type Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110557	2021-10-08 09:41:13 +08:00
Craig Topper	58b68e70eb	[X86] Don't use popcnt for parity if only bits 7:0 of the input can be non-zero. Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt. To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY. Fixes PR52093 Differential Revision: https://reviews.llvm.org/D111249	2021-10-06 14:39:57 -07:00
Wolfgang Pieb	f53d05135e	[UBSAN][PS4] For the PS4 target, emit the ud2 ocpode for ubsan traps. Reviewed By: probinson Differential Revision: https://reviews.llvm.org/D111172	2021-10-06 11:49:36 -07:00
Nathan Sidwell	c11e7b59d2	[X86][NFC] structure-return simplificiation The X86 backend only needs to know whether structure return is via an sret pointer. This removes the categorization enumeration and adjusts, templatizes and renames the related functions. Differential Revision: https://reviews.llvm.org/D109966	2021-10-06 03:12:48 -07:00
Simon Pilgrim	bf30c48419	[X86] SimplifyDemandedVectorEltsForTargetNode - simplify PMADDWD for known zero elements Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element. Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.	2021-10-04 14:36:45 +01:00
Jay Foad	566690b067	[APFloat] Remove BitWidth argument from getAllOnesValue There's no need to pass this in explicitly because it is trivially available from the semantics.	2021-10-04 11:32:16 +01:00
Jay Foad	a9bceb2b05	[APInt] Stop using soft-deprecated constructors and methods in llvm. NFC. Stop using APInt constructors and methods that were soft-deprecated in D109483. This fixes all the uses I found in llvm, except for the APInt unit tests which should still test the deprecated methods. Differential Revision: https://reviews.llvm.org/D110807	2021-10-04 08:57:44 +01:00
Simon Pilgrim	9452ec722c	[X86][SSE] Fix typo + infinite-loop in HOP(HOP'(X,X),HOP'(Y,Y)) fold (PR52040) PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold. Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks. Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.	2021-10-02 15:31:12 +01:00
Simon Pilgrim	bb42cc2090	[X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal. Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ. Differential Revision: https://reviews.llvm.org/D110588	2021-10-02 12:35:25 +01:00
Martin Storsjö	d6216e2cd1	[X86] Fix handling of i128<->fp on Windows On Windows, i128 arguments are passed as indirect arguments, and they are returned in xmm0. This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making the IR functions return v2i64 instead of i128, and making the arguments indirect. However for cases where libcalls are generated in the target lowering, the lowering uses the default x86_64 calling convention for i128, where they are passed/returned as a register pair. Add custom lowering logic, similar to the existing logic for i128 div/mod (added in `4a406d32e9`), manually making the libcall (while overriding the return type to v2i64 or passing the arguments as pointers to arguments on the stack). X86CallingConv.td doesn't seem to handle i128 at all, otherwise the windows specific behaviours would ideally be implemented as overrides there, in generic code, handling these cases automatically. This fixes https://bugs.llvm.org/show_bug.cgi?id=48940. Differential Revision: https://reviews.llvm.org/D110413	2021-09-29 13:05:59 +03:00
Liu, Chen3	57e8f840b6	[X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A). This bug was introduced by D109953. The operand order of generated FMA is wrong. Differential Revision: https://reviews.llvm.org/D110606	2021-09-28 11:38:53 +08:00
Simon Pilgrim	468ff703e1	[X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974) This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies. Fixes PR51974	2021-09-27 10:41:22 +01:00
Freddy Ye	902ec6142a	[X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16) When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with TypeLegalize, and no libcall in libm is ready for fround(f16) now. FROUNDEVEN(f16) has related instruction in AVX512FP16. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110312	2021-09-27 13:35:03 +08:00
Simon Pilgrim	c0eff50fc5	[X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.	2021-09-26 19:37:23 +01:00
Simon Pilgrim	ed3e4917b3	[X86] Fold PACK(_EXTEND_VECTOR_INREG, UNDEF) -> _EXTEND_VECTOR_INREG For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.	2021-09-26 19:37:22 +01:00
Simon Pilgrim	3fe9767204	[X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W)) Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements. There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....	2021-09-26 18:08:29 +01:00
Simon Pilgrim	eb7c78c2c5	[X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16	2021-09-25 15:50:45 +01:00
Simon Pilgrim	2a4fa0c27c	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors	2021-09-25 15:50:45 +01:00
Simon Pilgrim	f5a26ccae2	[X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.	2021-09-25 14:35:31 +01:00
Simon Pilgrim	a25f25c3b7	[X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2 With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target. We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.	2021-09-25 14:08:03 +01:00
Simon Pilgrim	d8fc9f8727	[X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16) As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.	2021-09-24 16:42:01 +01:00
Sanjay Patel	09e71c367a	[x86] convert logic-of-FP-compares to FP logic-of-vector-compares This is motivated by the examples and discussion in: https://llvm.org/PR51245 ...and related bugs. By using vector compares and vector logic, we can convert 2 'set' instructions into 1 'movd' or 'movmsk' and generally improve throughput/reduce instructions. Unfortunately, we don't have a complete vector compare ISA before AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic ops to simulate the missing predicates for SSE 'cmpp*', so it's not as clearly a win. Differential Revision: https://reviews.llvm.org/D110342	2021-09-24 11:38:19 -04:00
Sanjay Patel	74ba4b769a	[x86] move combiner state check into convertIntLogicToFPLogic(); NFC This function can be adapted to solve bugs like PR51245, but it could require differentiating the combiner timing between the existing and new transforms.	2021-09-23 14:28:22 -04:00
Liu, Chen3	76656ec8ec	[X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A) This patch is to support transform something like _mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph())) to _mm512_fmadd_pch(a, b, acc). Differential Revision: https://reviews.llvm.org/D109953	2021-09-23 15:37:08 +08:00
Wang, Pengfei	ebec077e07	[X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D109658	2021-09-23 11:02:48 +08:00
Kirill Stoimenov	2649999579	[asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on. This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D110012	2021-09-21 22:26:03 +00:00
Amara Emerson	4ceea77409	[X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC. For x86 Darwin, we have a stack checking feature which re-uses some of this machinery around stack probing on Windows. Renaming this to be more appropriate for a generic feature. Differential Revision: https://reviews.llvm.org/D109993	2021-09-20 16:19:28 -07:00
Simon Pilgrim	0e89ff8195	[X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use. Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.	2021-09-19 22:53:30 +01:00
Simon Pilgrim	b7342e3137	[X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask') We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly. Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.	2021-09-19 20:39:19 +01:00
Roman Lebedev	5f2fe48d06	[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded Apparently this has no test coverage before D108382, but D108382 itself shows a few regressions that this fixes. It doesn't seem worthwhile breaking apart broadcasts, assuming we want the broadcasted value to be preset in several elements, not just the 0'th one. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108411	2021-09-19 17:38:32 +03:00
Roman Lebedev	07f1d8f0ca	[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such Split off from D108253. Broadcast is simpler than any other shuffle we might produce to do what we want to do here, so prefer it. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108382	2021-09-19 17:35:37 +03:00
Roman Lebedev	0852313e47	[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch	2021-09-19 17:29:18 +03:00
Roman Lebedev	1e72ca94e5	[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing This was suggested in https://reviews.llvm.org/D108382#inline-1039018, and it avoids regressions in that patch. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109065	2021-09-19 17:24:58 +03:00
Roman Lebedev	6a2c2263fb	[X86] Improve i8 all-ones element insertion in pre-SSE4.1 Should avoid some regressions in D109065 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109989	2021-09-18 22:24:06 +03:00
Roman Lebedev	358df06f4e	[X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away); or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking: https://alive2.llvm.org/ce/z/RKejao Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109726	2021-09-17 19:13:33 +03:00
Simon Pilgrim	1ef62cb200	[X86] SimplifyDemandedVectorEltsForTargetNode - add PSADBW handling Peek through PSADBW operands to handle non demanded elements.	2021-09-16 11:28:31 +01:00
Simon Pilgrim	dcba994184	[X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858) When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.	2021-09-15 10:21:05 +01:00
Craig Topper	9af8f1b18e	[SelectionDAG] Add isZero/isAllOnes methods to ConstantSDNode. Soft deprecrate isNullValue/isAllOnesValue and update in tree callers. This matches the changes to the APInt interface from D109483. Reviewed By: lattner Differential Revision: https://reviews.llvm.org/D109535	2021-09-09 13:28:30 -07:00
Chris Lattner	d51da74889	[CodeGen] Use DAG.getAllOnesConstant where possible to simplify code. NFC.	2021-09-09 10:22:51 -07:00
Craig Topper	124bcc1a13	[X86] Disable muloti4 libcalls for x86-64. This library function only exists in compiler-rt not libgcc. So this would fail to link unless we were linking with compiler-rt. This is consistent with the recent removal of calls to mulodi4 on 32-bit targets like D108928. I suppose maybe we could keep the libcalls for platforms like Darwin that use compiler-rt exclusively? Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D109385	2021-09-09 10:03:15 -07:00
Chris Lattner	735f46715d	[APInt] Normalize naming on keep constructors / predicate methods. This renames the primary methods for creating a zero value to `getZero` instead of `getNullValue` and renames predicates like `isAllOnesValue` to simply `isAllOnes`. This achieves two things: 1) This starts standardizing predicates across the LLVM codebase, following (in this case) ConstantInt. The word "Value" doesn't convey anything of merit, and is missing in some of the other things. 2) Calling an integer "null" doesn't make any sense. The original sin here is mine and I've regretted it for years. This moves us to calling it "zero" instead, which is correct! APInt is widely used and I don't think anyone is keen to take massive source breakage on anything so core, at least not all in one go. As such, this doesn't actually delete any entrypoints, it "soft deprecates" them with a comment. Included in this patch are changes to a bunch of the codebase, but there are more. We should normalize SelectionDAG and other APIs as well, which would make the API change more mechanical. Differential Revision: https://reviews.llvm.org/D109483	2021-09-09 09:50:24 -07:00
Akira Hatanaka	dea6f71af0	[ObjC][ARC] Use the addresses of the ARC runtime functions instead of integer 0/1 for the operand of bundle "clang.arc.attachedcall" https://reviews.llvm.org/D102996 changes the operand of bundle "clang.arc.attachedcall". This patch makes changes to llvm that are needed to handle the new IR. This should make it easier to understand what the IR is doing and also simplify some of the passes as they no longer have to translate the integer values to the runtime functions. Differential Revision: https://reviews.llvm.org/D103000	2021-09-08 11:58:03 -07:00
Nick Desaulniers	d0eeb64be5	[X86ISelLowering] avoid emitting libcalls to __mulodi4() Similar to D108842, D108844, and D108926. __has_builtin(builtin_mul_overflow) returns true for 32b x86 targets, but Clang is deferring to compiler RT when encountering long long types. This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel that are using builtin_mul_overflow with these types for these targets. If the semantics of __has_builtin mean "the compiler resolves these, always" then we shouldn't conditionally emit a libcall. This will still need to be worked around in the Linux kernel in order to continue to support these builds of the Linux kernel for this target with older releases of clang. Link: https://bugs.llvm.org/show_bug.cgi?id=28629 Link: https://bugs.llvm.org/show_bug.cgi?id=35922 Link: https://github.com/ClangBuiltLinux/linux/issues/1438 Reviewed By: lebedev.ri, RKSimon Differential Revision: https://reviews.llvm.org/D108928	2021-09-07 10:44:54 -07:00
Simon Pilgrim	d66d520fe1	[X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]y[0] + (int)x[1]y[1], ..., (int)x[6]y[6] + (int)x[7]y[7] } Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits. That way the sign of the result is still preserved, and the second half is still zero. Noticed while investigating PR47437. Differential Revision: https://reviews.llvm.org/D108522	2021-09-02 17:36:22 +01:00
Roman Lebedev	3f1f08f0ed	Revert @llvm.isnan intrinsic patchset. Please refer to https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html (and that whole thread.) TLDR: the original patch had no prior RFC, yet it had some changes that really need a proper RFC discussion. It won't be productive to discuss such an RFC, once it's actually posted, while said patch is already committed, because that introduces bias towards already-committed stuff, and the tree is potentially in broken state meanwhile. While the end result of discussion may lead back to the current design, it may also not lead to the current design. Therefore i take it upon myself to revert the tree back to last known good state. This reverts commit `4c4093e6e3`. This reverts commit `0a2b1ba33a`. This reverts commit `d9873711cb`. This reverts commit `791006fb8c`. This reverts commit `c22b64ef66`. This reverts commit `72ebcd3198`. This reverts commit `5fa6039a5f`. This reverts commit `9efda541bf`. This reverts commit `94d3ff09cf`.	2021-09-02 13:53:56 +03:00
Simon Pilgrim	b0acd6c369	[X86] Fold PMADD(x,0) or PMADD(0,x) -> 0 Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops	2021-09-02 10:48:50 +01:00
Wang, Pengfei	74043caef2	[X86] Enable half type support in inline assembly constraints Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105799	2021-09-01 09:29:31 +08:00
Wang, Pengfei	ab40dbfe03	[X86] AVX512FP16 instructions enabling 6/6 Enable FP16 complex FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105269	2021-08-30 13:08:45 +08:00
Roman Lebedev	6734018041	[Codegen][X86] EltsFromConsecutiveLoads(): if only have AVX1, ensure that the "load" is actually foldable (PR51615) This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615 And again, the fix lies not in the code added in D105390 In this case, we completely don't check that the "broadcast-from-mem" we create can actually fold the load. In this case, it's operand was not a load at all: ``` Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32 Creating new node: t29: i32 = undef RepeatLoad: t8: i32 = truncate t7 t7: i64 = extract_vector_elt t5, Constant:i64<0> t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64 t2: i64,ch = CopyFromReg t0, Register:i64 %0 t1: i64 = Register %0 t4: i64 = undef t3: i64 = Constant<0> Combining: t15: v8i32 = undef ``` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108821	2021-08-27 20:26:53 +03:00
Serge Pavlov	cdbe569fb6	[X86] Implement llvm.isnan(x86_fp80) as unordered comparison x86_fp80 format allows values that do not fit any of IEEE-754 category. Previously they were recognized by intrinsic __builtin_isnan as NaNs. Now this intrinsic is implemented using instruction FXAM, which distinguish between NaNs and unsupported values. It can make some programs behave differently. As a solution, this fix changes lowering of the intrinsic. If floating point exceptions are ignored, llvm.isnan is lowered into unordered comparison, as __buildtin_isnan was implemented earlier. In strictfp functions the intrinsic is lowered using FXAM, which does not raise exceptions even for signaling NaN, as required by IEEE-754 and C standards. Differential Revision: https://reviews.llvm.org/D108037	2021-08-27 18:06:07 +07:00
Nathan Sidwell	199ac3a839	[NFC][X86] Sret return register cleanup There are no paths into LowerFormalParms that have already specified the sret register. We always materialize a virtual and then assign it to the physical reg at the point of the return. Differential Revision: https://reviews.llvm.org/D108762	2021-08-27 04:03:49 -07:00
Roman Lebedev	a8125bf4a8	[X86][Codegen] PR51615: don't replace wide volatile load with narrow broadcast-from-memory Even though https://bugs.llvm.org/show_bug.cgi?id=51615 appears to be introduced by D105390, the fix lies here. We can not replace a wide volatile load with a broadcast-from-memory, because that would narrow the load, which isn't legal for volatiles. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D108757	2021-08-26 18:46:49 +03:00
Nathan Sidwell	ab55cc6cef	[X86] pr51000 in-register struct return tailcalling In-register structure returns are not special, and handled by lowering to multiple-value tuples. We can tail-call from non-sret fns to structure-returning functions, except on i686 where the sret pointer is callee-pop. Differential Revision: https://reviews.llvm.org/D105807	2021-08-25 10:15:50 -07:00
Simon Pilgrim	307890f85b	[X86] Freeze vXi8 shl(x,1) -> add(x,x) vector fold (PR50468) We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468. Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative. Differential Revision: https://reviews.llvm.org/D108139	2021-08-24 16:08:24 +01:00
Liu, Chen3	b7795eb646	[X86] Building constant vector which element type is half will cause assertion fail. Fix assertion fail when building con constant vector which element type is half. Differential Revision: https://reviews.llvm.org/D108612	2021-08-24 14:34:30 +08:00
Wang, Pengfei	c728bd5bba	[X86] AVX512FP16 instructions enabling 5/6 Enable FP16 FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105268	2021-08-24 09:07:19 +08:00
Simon Pilgrim	805fb1f6c1	[X86] combineMul - move MUL_IMM comment inside function. NFC. combineMul is now used for other things as well as the mul-with-constant expansion - move the comment to where its actually relevant.	2021-08-22 18:27:03 +01:00
Simon Pilgrim	352df10a23	[X86][AVX] matchShuffleAsBlend - use isElementEquivalent to help match broadcast/repeated elements Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element. This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).	2021-08-22 15:26:17 +01:00
Simon Pilgrim	96fb3eef66	Fix signed/unsigned comparison warning. NFCI.	2021-08-22 15:02:19 +01:00
Simon Pilgrim	a1c892b439	[X86][SSE] lowerVECTOR_SHUFFLE - canonicalize with horizontal ops. Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.	2021-08-22 14:54:48 +01:00
Wang, Pengfei	b088536ce9	[X86] AVX512FP16 instructions enabling 4/6 Enable FP16 unary operator instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105267	2021-08-22 08:59:35 +08:00
David Green	d10f23a25d	[ISel] Expand saddsat and ssubsat via asr and xor This changes the lowering of saddsat and ssubsat so that instead of using: r,o = saddo x, y c = setcc r < 0 s = c ? INTMAX : INTMIN ret o ? s : r into using asr and xor to materialize the INTMAX/INTMIN constants: r,o = saddo x, y s = ashr r, BW-1 x = xor s, INTMIN ret o ? x : r https://alive2.llvm.org/ce/z/TYufgD This seems to reduce the instruction count in most testcases across most architectures. X86 has some custom lowering added to compensate for cases where it can increase instruction count. Differential Revision: https://reviews.llvm.org/D105853	2021-08-19 16:08:07 +01:00
Arthur Eubanks	46cf82532c	[NFC] Replace Function handling of attributes with less confusing calls To avoid magic constants and confusing indexes.	2021-08-17 21:05:40 -07:00
Wang, Pengfei	2379949aad	[X86] AVX512FP16 instructions enabling 3/6 Enable FP16 conversion instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105265	2021-08-18 09:03:41 +08:00
Simon Pilgrim	359cfa2af7	[X86] EmitInstrWithCustomInserter - silence uninitialized variable warnings. NFCI. Consistently use llvm_unreachable for the default case in the inner switch statements.	2021-08-17 22:01:07 +01:00
Roman Lebedev	2078c4ecfd	[X86] Lower insertions into upper half of an 256-bit vector as broadcast+blend (PR50971) Broadcast is not worse than extract+insert of subvector. https://godbolt.org/z/aPq98G6Yh Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D105390	2021-08-17 18:45:10 +03:00
Wang, Pengfei	f1de9d6dae	[X86] AVX512FP16 instructions enabling 2/6 Enable FP16 binary operator instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105264	2021-08-15 08:56:33 +08:00
Arthur Eubanks	d7593ebaee	[NFC] Clean up users of AttributeList::hasAttribute() AttributeList::hasAttribute() is confusing, use clearer methods like hasParamAttr()/hasRetAttr(). Add hasRetAttr() since it was missing from AttributeList.	2021-08-13 11:59:18 -07:00
Arthur Eubanks	92ce6db9ee	[NFC] Rename AttributeList::hasFnAttribute() -> hasFnAttr() This is more consistent with similar methods.	2021-08-13 11:09:18 -07:00
Wang, Pengfei	6f7f5b54c8	[X86] AVX512FP16 instructions enabling 1/6 1. Enable FP16 type support and basic declarations used by following patches. 2. Enable new instructions VMOVW and VMOVSH. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105263	2021-08-10 12:46:01 +08:00
Craig Topper	24dfba8d50	[X86] Teach shouldSinkOperands to recognize pmuldq/pmuludq patterns. The IR for pmuldq/pmuludq intrinsics uses a sext_inreg/zext_inreg pattern on the inputs. Ideally we pattern match these away during isel. It is possible for LICM or other middle end optimizations to separate the extend from the mul. This prevents SelectionDAG from removing it or depending on how the extend is lowered, we may not be able to generate an AssertSExt/AssertZExt in the mul basic block. This will prevent pmuldq/pmuludq from being formed at all. This patch teaches shouldSinkOperands to recognize this so that CodeGenPrepare will clone the extend into the same basic block as the mul. Fixes PR51371. Differential Revision: https://reviews.llvm.org/D107689	2021-08-07 08:45:56 -07:00
Amara Emerson	2b067e3335	Change TargetLowering::canMergeStoresTo() to take a MF instead of DAG. DAG is unnecessary and we need this hook to implement store merging on GlobalISel too.	2021-08-06 12:57:53 -07:00
Craig Topper	b2ca4dc935	[LegalizeTypes] Add a simple expansion for SMULO when a libcall isn't available. This isn't optimal, but prevents crashing when the libcall isn't available. It just calculates the full product and makes sure the high bits match the sign of the low half. Each of the pieces should go through their own type legalization. This can make D107420 unnecessary. Needs tests, but I wanted to start discussion about D107420. Reviewed By: FreddyYe Differential Revision: https://reviews.llvm.org/D107581	2021-08-06 09:43:01 -07:00
Simon Pilgrim	18e6a03b1a	[X86][AVX] Extract SUBV_BROADCAST constant bits from just the lower subvector range (PR51281) As reported on PR51281, an internal fuzz test encountered an issue when extracting constant bits from a SUBV_BROADCAST node from a constant pool source larger than the broadcasted subvector width. The getTargetConstantBitsFromNode was assuming that the Constant would the same size as the subvector, resulting in the incorrect packing of the per-element bits data. This patch attempts to solve this by using the SUBV_BROADCAST node to determine the subvector width, and then ensuring we extract only the lowest bits from Constant of that subvector bitsize. Differential Revision: https://reviews.llvm.org/D107158	2021-08-06 11:21:31 +01:00
Serge Pavlov	4c4093e6e3	Introduce intrinsic llvm.isnan This is recommit of the patch `16ff91ebcc`, reverted in `0c28a7c990` because it had an error in call of getFastMathFlags (base type should be FPMathOperator but not Instruction). The original commit message is duplicated below: Clang has builtin function '__builtin_isnan', which implements C library function 'isnan'. This function now is implemented entirely in clang codegen, which expands the function into set of IR operations. There are three mechanisms by which the expansion can be made. * The most common mechanism is using an unordered comparison made by instruction 'fcmp uno'. This simple solution is target-independent and works well in most cases. It however is not suitable if floating point exceptions are tracked. Corresponding IEEE 754 operation and C function must never raise FP exception, even if the argument is a signaling NaN. Compare instructions usually does not have such property, they raise 'invalid' exception in such case. So this mechanism is unsuitable when exception behavior is strict. In particular it could result in unexpected trapping if argument is SNaN. * Another solution was implemented in https://reviews.llvm.org/D95948. It is used in the cases when raising FP exceptions by 'isnan' is not allowed. This solution implements 'isnan' using integer operations. It solves the problem of exceptions, but offers one solution for all targets, however some can do the check in more efficient way. * Solution implemented by https://reviews.llvm.org/D96568 introduced a hook 'clang::TargetCodeGenInfo::testFPKind', which injects target specific code into IR. Now only SystemZ implements this hook and it generates a call to target specific intrinsic function. Although these mechanisms allow to implement 'isnan' with enough efficiency, expanding 'isnan' in clang has drawbacks: * The operation 'isnan' is hidden behind generic integer operations or target-specific intrinsics. It complicates analysis and can prevent some optimizations. * IR can be created by tools other than clang, in this case treatment of 'isnan' has to be duplicated in that tool. Another issue with the current implementation of 'isnan' comes from the use of options '-ffast-math' or '-fno-honor-nans'. If such option is specified, 'fcmp uno' may be optimized to 'false'. It is valid optimization in general, but it results in 'isnan' always returning 'false'. For example, in some libc++ implementations the following code returns 'false': std::isnan(std::numeric_limits<float>::quiet_NaN()) The options '-ffast-math' and '-fno-honor-nans' imply that FP operation operands are never NaNs. This assumption however should not be applied to the functions that check FP number properties, including 'isnan'. If such function returns expected result instead of actually making checks, it becomes useless in many cases. The option '-ffast-math' is often used for performance critical code, as it can speed up execution by the expense of manual treatment of corner cases. If 'isnan' returns assumed result, a user cannot use it in the manual treatment of NaNs and has to invent replacements, like making the check using integer operations. There is a discussion in https://reviews.llvm.org/D18513#387418, which also expresses the opinion, that limitations imposed by '-ffast-math' should be applied only to 'math' functions but not to 'tests'. To overcome these drawbacks, this change introduces a new IR intrinsic function 'llvm.isnan', which realizes the check as specified by IEEE-754 and C standards in target-agnostic way. During IR transformations it does not undergo undesirable optimizations. It reaches instruction selection, where is lowered in target-dependent way. The lowering can vary depending on options like '-ffast-math' or '-ffp-model' so the resulting code satisfies requested semantics. Differential Revision: https://reviews.llvm.org/D104854	2021-08-06 14:32:27 +07:00
Roman Lebedev	c0586ff05d	[NFC][X86] combineX86ShuffleChain(): hoist Mask variable higher up Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef` to it is confusing. Instead, just move the `Mask` vector higher up, and change the code that earlier had no access to it but now does to use `Mask` instead of `BaseMask`. This has no other intentional changes. This is a recommit of `35c0848b57`, that was reverted to simplify reversion of an earlier change.	2021-08-05 20:37:51 +03:00
Benjamin Kramer	bd17ced1db	Revert "[X86] combineX86ShuffleChain(): canonicalize mask elts picking from splats" This reverts commits `f819e4c7d0` and `35c0848b57`. It triggers an infinite loop during compilation. $ cat t.ll target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" define void @MaxPoolGradGrad_1.65() local_unnamed_addr #0 { entry: %wide.vec78 = load <64 x i32>, <64 x i32>* null, align 16 %strided.vec83 = shufflevector <64 x i32> %wide.vec78, <64 x i32> poison, <8 x i32> <i32 4, i32 12, i32 20, i32 28, i32 36, i32 44, i32 52, i32 60> %0 = lshr <8 x i32> %strided.vec83, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16> %1 = add <8 x i32> zeroinitializer, %0 %2 = shufflevector <8 x i32> %1, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> %3 = shufflevector <16 x i32> %2, <16 x i32> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31> %interleaved.vec = shufflevector <32 x i32> undef, <32 x i32> %3, <64 x i32> <i32 0, i32 8, i32 16, i32 24, i32 32, i32 40, i32 48, i32 56, i32 1, i32 9, i32 17, i32 25, i32 33, i32 41, i32 49, i32 57, i32 2, i32 10, i32 18, i32 26, i32 34, i32 42, i32 50, i32 58, i32 3, i32 11, i32 19, i32 27, i32 35, i32 43, i32 51, i32 59, i32 4, i32 12, i32 20, i32 28, i32 36, i32 44, i32 52, i32 60, i32 5, i32 13, i32 21, i32 29, i32 37, i32 45, i32 53, i32 61, i32 6, i32 14, i32 22, i32 30, i32 38, i32 46, i32 54, i32 62, i32 7, i32 15, i32 23, i32 31, i32 39, i32 47, i32 55, i32 63> store <64 x i32> %interleaved.vec, <64 x i32>* undef, align 16 unreachable } $ llc < t.ll -mcpu=skylake <hang>	2021-08-05 18:58:08 +02:00
Simon Pilgrim	2cbf9fd402	[DAG] DAGCombiner::visitVECTOR_SHUFFLE - recognise INSERT_SUBVECTOR patterns IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes. This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector. This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector. Fixes PR50053 Differential Revision: https://reviews.llvm.org/D107068	2021-08-05 15:40:48 +01:00
Fangrui Song	9c19b36f1c	[X86] Remove -x86-experimental-pref-loop-alignment in favor of -align-loops	2021-08-04 13:23:57 -07:00
Roman Lebedev	35c0848b57	[NFC][X86] combineX86ShuffleChain(): hoist Mask variable higher up Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef` to it is confusing. Instead, just move the `Mask` vector higher up, and change the code that earlier had no access to it but now does to use `Mask` instead of `BaseMask`. This has no other intentional changes.	2021-08-04 17:15:12 +03:00
Roman Lebedev	916cdc3d4b	[NFC][X86] combineX86ShuffleChain(): rename inner Mask to avoid future shadowing I want to hoist `Mask` variable higher up, but then it would clash with this one. So let's rename this one first. There are no other intentional changes here other than said rename.	2021-08-04 17:15:12 +03:00
Roman Lebedev	f819e4c7d0	[X86] combineX86ShuffleChain(): canonicalize mask elts picking from splats Given a shuffle mask, if it is picking from an input that is splat given the current granularity of the shuffle, then adjust the mask to pick from the same lane of the input as the mask element is in. This may result in a shuffle being simplified into a blend. I believe this is correct given that the splat detection matches the one just above the new code, My basic thought is that we might be able to get less regressions by handling multiple insertions of the same value into a vector if we form broadcasts+blend here, as opposed to D105390, but i have not really thought this through, and did not try implementing it yet. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D107009	2021-08-04 16:55:04 +03:00
Serge Pavlov	0c28a7c990	Revert "Introduce intrinsic llvm.isnan" This reverts commit `16ff91ebcc`. Several errors were reported mainly test-suite execution time. Reverted for investigation.	2021-08-04 17:18:15 +07:00
Serge Pavlov	16ff91ebcc	Introduce intrinsic llvm.isnan Clang has builtin function '__builtin_isnan', which implements C library function 'isnan'. This function now is implemented entirely in clang codegen, which expands the function into set of IR operations. There are three mechanisms by which the expansion can be made. * The most common mechanism is using an unordered comparison made by instruction 'fcmp uno'. This simple solution is target-independent and works well in most cases. It however is not suitable if floating point exceptions are tracked. Corresponding IEEE 754 operation and C function must never raise FP exception, even if the argument is a signaling NaN. Compare instructions usually does not have such property, they raise 'invalid' exception in such case. So this mechanism is unsuitable when exception behavior is strict. In particular it could result in unexpected trapping if argument is SNaN. * Another solution was implemented in https://reviews.llvm.org/D95948. It is used in the cases when raising FP exceptions by 'isnan' is not allowed. This solution implements 'isnan' using integer operations. It solves the problem of exceptions, but offers one solution for all targets, however some can do the check in more efficient way. * Solution implemented by https://reviews.llvm.org/D96568 introduced a hook 'clang::TargetCodeGenInfo::testFPKind', which injects target specific code into IR. Now only SystemZ implements this hook and it generates a call to target specific intrinsic function. Although these mechanisms allow to implement 'isnan' with enough efficiency, expanding 'isnan' in clang has drawbacks: * The operation 'isnan' is hidden behind generic integer operations or target-specific intrinsics. It complicates analysis and can prevent some optimizations. * IR can be created by tools other than clang, in this case treatment of 'isnan' has to be duplicated in that tool. Another issue with the current implementation of 'isnan' comes from the use of options '-ffast-math' or '-fno-honor-nans'. If such option is specified, 'fcmp uno' may be optimized to 'false'. It is valid optimization in general, but it results in 'isnan' always returning 'false'. For example, in some libc++ implementations the following code returns 'false': std::isnan(std::numeric_limits<float>::quiet_NaN()) The options '-ffast-math' and '-fno-honor-nans' imply that FP operation operands are never NaNs. This assumption however should not be applied to the functions that check FP number properties, including 'isnan'. If such function returns expected result instead of actually making checks, it becomes useless in many cases. The option '-ffast-math' is often used for performance critical code, as it can speed up execution by the expense of manual treatment of corner cases. If 'isnan' returns assumed result, a user cannot use it in the manual treatment of NaNs and has to invent replacements, like making the check using integer operations. There is a discussion in https://reviews.llvm.org/D18513#387418, which also expresses the opinion, that limitations imposed by '-ffast-math' should be applied only to 'math' functions but not to 'tests'. To overcome these drawbacks, this change introduces a new IR intrinsic function 'llvm.isnan', which realizes the check as specified by IEEE-754 and C standards in target-agnostic way. During IR transformations it does not undergo undesirable optimizations. It reaches instruction selection, where is lowered in target-dependent way. The lowering can vary depending on options like '-ffast-math' or '-ffp-model' so the resulting code satisfies requested semantics. Differential Revision: https://reviews.llvm.org/D104854	2021-08-04 15:27:49 +07:00
Sanjay Patel	4c41caa287	[x86] improve CMOV codegen by pushing add into operands, part 3 In this episode, we are trying to avoid an x86 micro-arch quirk where complex (3 operand) LEA potentially costs significantly more than simple LEA. So we simultaneously push and pull the math around the CMOV to balance the operations. I looked at the debug spew during instruction selection and decided against trying a later DAGToDAG transform -- it seems very difficult to match if the trailing memops are already selected and managing the creation of extra instructions at that level is always tricky. Differential Revision: https://reviews.llvm.org/D106918	2021-07-28 09:10:33 -04:00
Xiang1 Zhang	3223d41017	[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT Differential Revision: https://reviews.llvm.org/D106780	2021-07-28 08:16:59 +08:00
Xiang1 Zhang	2ca3937131	Revert "[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT" This reverts commit `6ff73efea9`.	2021-07-28 08:12:29 +08:00
Xiang1 Zhang	6ff73efea9	[X86] Fix lowering to illegal type in LowerINSERT_VECTOR_ELT	2021-07-28 08:08:30 +08:00
Sanjay Patel	156ba620b3	[x86] update stale code comment; NFC The transform was generalized with: `1ce05ad619`	2021-07-27 16:45:52 -04:00
Tres Popp	d225de60c9	Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI." This reverts commit `1cfecf4fc4`. This commit broke LLVM code generated through XLA by removing a conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD This is not a perfect revert. The new function is left as other uses of it exist now.	2021-07-27 16:55:50 +02:00
Tres Popp	70fa9479b2	Revert "Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI."" This reverts commit `d7bbb1230a`. There were follow up uses of a deleted method and I didn't run the tests. Undo the revert, so I can do it properly.	2021-07-27 16:48:31 +02:00
Tres Popp	d7bbb1230a	Revert "[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI." This reverts commit `1cfecf4fc4`. This commit broke LLVM code generated through XLA by removing a conditional on Ld->getExtensionType() == ISD::NON_EXTLOAD	2021-07-27 16:22:25 +02:00
Simon Pilgrim	c8472db0a8	[X86][AVX] Prefer vinsertf128 to vperm2f128 on AVX1 targets Splatting the lower xmm with vinsertf128 is at least as quick as vperm2f128, and a lot faster on some AMD targets. First step towards PR50053	2021-07-26 11:11:56 +01:00
Simon Pilgrim	1cfecf4fc4	[X86][AVX] Add getBROADCAST_LOAD helper function. NFCI. Begin replacing individual getMemIntrinsicNode calls and setup (for X86ISD::VBROADCAST_LOAD + X86ISD::SUBV_BROADCAST_LOAD opcodes) with this getBROADCAST_LOAD helper.	2021-07-25 20:37:58 +01:00
Simon Pilgrim	b95f66ad78	[X86][SSE] LowerRotate - perform modulo on the amount splat source directly. If the rotation amount is a known splat, perform the modulo on the splat source, and then perform the splat. That way the amount-extension performed later by LowerScalarVariableShift can fold the splats away without any multiple-use issues. Fixes one of the concerns raised on D104156	2021-07-25 17:30:32 +01:00
Sanjay Patel	1ce05ad619	[x86] improve CMOV codegen by pushing add into operands, part 2 This is a minimum extension of D106607 to allow folding for 2 non-zero constantsi that can be materialized as immediates.. In the reduced test examples, we save 1 instruction by rolling the constants into LEA/ADD. In the motivating test from the bullet benchmark, we absorb both of the constant moves into add ops via LEA magic, so we reduce by 2 instructions. Differential Revision: https://reviews.llvm.org/D106684	2021-07-25 10:05:41 -04:00
Simon Pilgrim	15b883f457	[X86][AVX] Adjust AllowBWIVPERMV3 tolerance to account for VariableCrossLaneShuffleDepth As noticed on D105390 - we were hardwiring the depth limit for combining to VPERMI2W/VPERMI2B instructions. Not only had we made the limit too low, we hadn't accounted for slow/fast shuffles via the VariableCrossLaneShuffleDepth control	2021-07-25 14:05:11 +01:00
Sanjay Patel	f060aa1cf3	[x86] improve CMOV codegen by pushing add into operands This is not the transform direction we want in general, but by the time we have a CMOV, we've already tried everything else that could be better. The transform increases the uses of the other add operand, but that is safe according to Alive2: https://alive2.llvm.org/ce/z/Yn6p-A We could probably extend this to other binops (not just add). This is the motivating pattern discussed in: https://llvm.org/PR51069 The test with i8 shows a missed fold because there's a trunc sitting in front of the add. That can be handled with a small follow-up. Differential Revision: https://reviews.llvm.org/D106607	2021-07-23 09:39:32 -04:00
Simon Pilgrim	71d0fd3564	[X86][AVX] lowerV2X128Shuffle - attempt to recognise broadcastf128 subvector load As noticed on PR50053 we were failing to recognise when a shuffle of a load was really a subvector broadcast load	2021-07-23 13:10:38 +01:00
Simon Pilgrim	142e60f40b	[X86] Fix case of IsAfterLegalize argument. NFC. Pulled out of D106280	2021-07-19 17:15:28 +01:00
Eli Friedman	6601be4419	[X86] Remove incorrect use of known bits in shuffle simplification. This reverts commit `2a419a0b99`. The result of a shufflevector must not propagate poison from any element other than the one noted in the shuffle mask. The regressions outside of fptoui-may-overflow.ll can probably be recovered some other way; for example, using isGuaranteedNotToBePoison. See discussion on https://reviews.llvm.org/D106053 for more background. Differential Revision: https://reviews.llvm.org/D106222	2021-07-18 18:13:11 -07:00
Simon Pilgrim	51a12d2ff0	[X86][SSE] matchShuffleWithPACK - avoid poison pollution from bitcasting multiple elements together. D106053 exposed that we've not been taking into account that by bitcasting smaller elements together and then performing a ComputeKnownBits on the result we'd be allowing a poison element to influence other neighbouring elements being used in the pack. Instead we now peek through any existing bitcast to ensure that the source type already matches the width source of the pack node we're trying to match. This has also been a chance to stop matchShuffleWithPACK creating unused nodes on the fly which could affect oneuse tests during shuffle lowering/combining. The only regression we're seeing is due to being unable to peek through a bitcast as its on the other side of a extract_subvector - which should go away once we finally allow shuffle combining across different vector widths (by making matchShuffleWithPACK using const SelectionDAG& we've gotten closer to this - see PR45974).	2021-07-18 14:25:28 +01:00
Simon Pilgrim	d2458bcdc6	[X86][SSE] combineX86ShufflesRecursively - bail if constant folding fails due to oneuse limits. Fixes issue reported on D105827 where a single shuffle of a constant (with multiple uses) was caught in an infinite loop where one shuffle (UNPCKL) used an undef arg but then that got recombined to SHUFPS as the constant value had its own undef that confused matching.....	2021-07-16 19:21:46 +01:00
Simon Pilgrim	ee71c1bbcc	[X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic: small := CVTTPS2SI(x); fp_to_ui(x) := small \| (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31)) Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering. @TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded). Original Patch by @TomHender (Tom Hender) Differential Revision: https://reviews.llvm.org/D89697	2021-07-14 12:03:49 +01:00
Simon Pilgrim	3cee36c5ac	[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result (REAPPLIED) Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1)) Reapplied rGe4aa6ad13216 with fix for intrinsic variants of the opcode which uses a vector return type	2021-07-13 12:31:09 +01:00
Vitaly Buka	606551ee98	Revert "[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result" Fails here https://lab.llvm.org/buildbot/#/builders/37/builds/5267 This reverts commit `e4aa6ad132`.	2021-07-12 22:26:54 -07:00
Simon Pilgrim	e4aa6ad132	[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))	2021-07-12 09:56:59 +01:00
Simon Pilgrim	9dbeac16ba	[X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to let us add AssertSext/AssertZext Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again. This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles). This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.	2021-07-09 12:07:33 +01:00
Wang, Pengfei	9ab99f773f	[X86] Twist shuffle mask when fold HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) This patch fixes PR50823. The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104903	2021-07-05 21:29:42 +08:00
Simon Pilgrim	59fa435ea6	[X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760) This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read. As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes. I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations. Differential Revision: https://reviews.llvm.org/D101074	2021-06-30 18:46:50 +01:00
Craig Topper	81f6d7c082	[X86] Tighten up some inline assembly constraint handling. Don't allow vectors to split into GPRs for 'r' and other scalar constraints. Prevents assertion in getCopyToPartsVector. Makes PR50907 give a better error instead of crashing.	2021-06-26 22:57:22 -07:00
Martin Storsjö	42f74e8249	[llvm] Rename StringRef _lower() method calls to _insensitive() This is a mechanical change. This actually also renames the similarly named methods in the SmallString class, however these methods don't seem to be used outside of the llvm subproject, so this doesn't break building of the rest of the monorepo.	2021-06-25 00:22:01 +03:00
Florian Hahn	a54c6fc083	[X86] Exclude invalid element types for bitcast/broadcast folding. It looks like the fold introduced in `63f3383ece` can cause crashes if the type of the bitcasted value is not a valid vector element type, like x86_mmx. To resolve the crash, reject invalid vector element types. The way it is done in the patch is a bit clunky. Perhaps there's a better way to check? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104792	2021-06-24 12:39:01 +01:00
Simon Pilgrim	c4d3eedc7f	[X86] Fold nested select_cc to select (cmp*ge/le Cond0, Cond1), LHS, Y) select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y) etc, We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes. Pulled out of D101074 to help address the remaining regressions. Differential Revision: https://reviews.llvm.org/D104707	2021-06-24 11:27:57 +01:00
Eli Friedman	74909e4b6e	Rename MachineMemOperand::getOrdering -> getSuccessOrdering. Since this method can apply to cmpxchg operations, make sure it's clear what value we're actually retrieving. This will help ensure we don't accidentally ignore the failure ordering of cmpxchg in the future. We could potentially introduce a getOrdering() method on AtomicSDNode that asserts the operation isn't cmpxchg, but not sure that's worthwhile. Differential Revision: https://reviews.llvm.org/D103338	2021-06-21 16:49:27 -07:00
Simon Pilgrim	cdb4fcf9a1	[X86] combineSelect - refactor MIN/MAX detection code to make it easier to add additional select(setcc,x,y) folds. NFCI. I need to add some additional handling to address some of the regressions from D101074	2021-06-17 13:50:59 +01:00
Roman Lebedev	308f6a5245	[NFC][X86] lowerVECTOR_SHUFFLE(): drop FIXME about widening to i128 (YMM half) element type As per the discussion in D103818, so far, this does not appear to be worthwhile. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103818	2021-06-16 10:24:33 +03:00
Bob Haarman	3bc899b4de	[X86] avoid assert with varargs, soft float, and no-implicit-float Fixes: - PR36507 Floating point varargs are not handled correctly with -mno-implicit-float - PR48528 __builtin_va_start assumes it can pass SSE registers when using -Xclang -msoft-float -Xclang -no-implicit-float On x86_64, floating-point parameters are normally passed in XMM registers. For va_start, we spill those to memory so va_arg can find them. There is an interaction here with -msoft-float and -no-implicit-float: When -msoft-float is in effect, instead of passing floating-point parameters in XMM registers, they are passed in general-purpose registers. When -no-implicit-float is in effect, it "disables implicit floating-point instructions" (per the LangRef). The intended effect is to not have the compiler generate floating-point code unless explicit floating-point operations are present in the source code, but what exactly counts as an explicit floating-point operation is not specified. The existing behavior of LLVM here has led to some surprises and PRs. This change modifies the behavior as follows: \| soft \| no-implicit \| old behavior \| new behavior \| \| no \| no \| spill XMM regs \| spill XMM regs \| \| yes \| no \| don't spill XMM \| don't spill XMM \| \| no \| yes \| don't spill XMM \| spill XMM regs \| \| yes \| yes \| assert \| don't spill XMM \| In particular, this avoids the assert that happens when -msoft-float and -no-implicit-float are both in effect. This seems like a perfectly reasonable combination: If we don't want to rely on hardware floating-point support, we want to both avoid using float registers to pass parameters and avoid having the compiler generate floating-point code that wasn't in the original program. Instead of crashing the compiler, the new behavior is to not synthesize floating-point code in this case. This fixes PR48528. The other interesting case is when -no-implicit-float is in effect, but -msoft-float is not. In that case, any floating-point parameters that are present will be in XMM registers, and so we have to spill them to correctly handle those. This fixes PR36507. The spill is conditional on %al indicating that parameters are present in XMM registers, so no floating-point code will be executed unless the function is called with floating-point parameters. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D104001	2021-06-15 11:27:35 -07:00
Craig Topper	4017d0335a	[X86] Use EVT::getVectorVT instead of changeVectorElementType in reduceVMULWidth. Changing vector element type doesn't work for v6i32->v6i16 now that v6i32 is an MVT and v6i16 is not. I would like to fix this in changeVectorElementType, but you need a LLVMContext to call getVectorVT which we can't get from an MVT. Fixes PR50709.	2021-06-14 22:07:04 -07:00
Craig Topper	765ef4bb2a	[X86] Check destination element type before forming VTRUNCS/VTRUNCUS in combineTruncateWithSat. Fixes crash reported here https://reviews.llvm.org/D73607 Using a store to keep the trunc intact. Returning v16i24 would cause the trunc to be optimized away in SelectionDAGBuilder. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103940	2021-06-09 07:08:17 -07:00
Simon Pilgrim	8ab8b3fad7	[X86][SSE] LowerFP_TO_INT - remove dead code. NFCI. Non-Strict v2f32->v2i64 cases have already early-returned to be handled by legalization.	2021-06-06 20:04:15 +01:00
Simon Pilgrim	4879c8f3b0	[X86][SSE] combineVectorTruncation - simplify PSHUFB-is-better logic. NFCI. OutSVT is guaranteed to be i8/i16 and we accept any InSVT that isn't i64	2021-06-06 20:04:14 +01:00
Nikita Popov	1ffa6499ea	[TargetLowering] Use IRBuilderBase instead of IRBuilder<> (NFC) Don't require a specific kind of IRBuilder for TargetLowering hooks. This allows us to drop the IRBuilder.h include from TargetLowering.h. Differential Revision: https://reviews.llvm.org/D103759	2021-06-06 16:29:50 +02:00
Nikita Popov	9914200393	[CodeGen] Add missing includes (NFC) These currently rely on the IRBuilder.h include in TargetLowering.h. Make them explicit.	2021-06-06 15:48:27 +02:00
Simon Pilgrim	9f5d783d46	[X86][SSE] combineScalarToVector - only reuse broadcasts for scalar_to_vector if the source operands scalar types match We were hitting an issue when the scalar_to_vector source was being implicitly truncated (in this case to i8 to vXi1) but we were also using the i8 source in a broadcast to a vXi8 value. Fixes PR50374	2021-06-02 22:05:40 +01:00
Roman Lebedev	cf9b1f7a0e	[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature, which controls profitability of both the cross-lane and per-lane variable shuffles. I guess, this has been fine so far. But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`) are as fast as as shuffles with fixed/immediate mask, while lane-crossing shuffles, e.g. `VPERMPS` is performing worse. So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles, as suggested by @RKSimon, split the feature flag into two. Differential Revision: https://reviews.llvm.org/D103274	2021-06-01 10:39:36 +03:00
Tim Northover	9ff2eb1ea5	SwiftTailCC: teach verifier musttail rules applicable to this CC. SwiftTailCC has a different set of requirements than the C calling convention for a tail call. The exact argument sequence doesn't have to match, but fewer ABI-affecting attributes are allowed. Also make sure the musttail diagnostic triggers if a musttail call isn't actually a tail call.	2021-05-28 11:12:00 +01:00
Craig Topper	a105d3024e	[X86] Fold (shift undef, X)->0 for vector shifts by immediate. We could previously do this by accident through the later call to getTargetConstantBitsFromNode I think, but that only worked if N0 had a single use. This patch makes it explicit for undef and doesn't have a use count check. I think this is needed to move the (shl X, 1)->(add X, X) fold to isel for PR50468. We need to be sure X won't be IMPLICIT_DEF which might prevent the same vreg from being used for both operands. Differential Revision: https://reviews.llvm.org/D103192	2021-05-27 09:31:47 -07:00
Nick Desaulniers	033138ea45	[IR] make stack-protector-guard-* flags into module attrs D88631 added initial support for: - -mstack-protector-guard= - -mstack-protector-guard-reg= - -mstack-protector-guard-offset= flags, and D100919 extended these to AArch64. Unfortunately, these flags aren't retained for LTO. Make them module attributes rather than TargetOptions. Link: https://github.com/ClangBuiltLinux/linux/issues/1378 Reviewed By: tejohnson Differential Revision: https://reviews.llvm.org/D102742	2021-05-21 15:53:30 -07:00
Florian Hahn	c2d44bd230	[X86] Lower calls with clang.arc.attachedcall bundle This patch adds support for lowering function calls with the `clang.arc.attachedcall` bundle. The goal is to expand such calls to the following sequence of instructions: callq @fn movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue / _objc_unsafeClaimAutoreleasedReturnValue This sequence of instructions triggers Objective-C runtime optimizations, hence we want to ensure no instructions get moved in between them. This patch achieves that by adding a new CALL_RVMARKER ISD node, which gets turned into the CALL64_RVMARKER pseudo, which eventually gets expanded into the sequence mentioned above. The ObjC runtime function to call is determined by the argument in the bundle, which is passed through as a target constant to the pseudo. @ahatanak is working on using this attribute in the front- & middle-end. Together with the front- & middle-end changes, this should address PR31925 for X86. This is the X86 version of `46bc40e502`, which added similar support for AArch64. Reviewed By: ab Differential Revision: https://reviews.llvm.org/D94597	2021-05-21 16:33:58 +01:00
Jim Lin	4456805938	[X86] Don't fold (fneg (fma (fneg X), Y, (fneg Z))) to (fma X, Y, Z) Check if it has no signed zeros flag (nsz) in getNegatedExpression for x86. This patch fixed miscompilation: https://alive2.llvm.org/ce/z/XxwBAJ Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D90901	2021-05-21 23:02:19 +08:00
Sanjay Patel	f12f9beb04	[x86] propagate FMF from x86-specific intrinsic nodes to others during combining This is another FMF gap exposed by D90901, but I don't see a way to show the difference in a regression test as with: `f66ba4c` `6025663` We will see an asm difference if we add a test as part of D90901.	2021-05-19 14:25:09 -04:00
Sanjay Patel	f66ba4cfa7	[x86] propagate FMF from x86-specific intrinsic nodes to others during lowering This is another fast-math-flags failure exposed by D90901.	2021-05-19 13:11:15 -04:00
Simon Pilgrim	ab4e04a0f3	[X86][AVX] createVariablePermute - generalize the PR50356 fix for smaller indices vector as well Generalize the fix from rGd0902a8665b1 by ensuring we widen/narrow the indices subvector first and then perform the ZERO_EXTEND_VECTOR_INREG (if necessary), which should allow us to perform the variable permutes with source/destination/indices vectors of any widths.	2021-05-19 14:39:41 +01:00
Tim Northover	c1dc267258	MachineBasicBlock: add liveout iterator aware of which liveins are defined by the runtime. Using this in RegAlloc fast reduces register pressure, and in some cases allows x86 code to compile that wouldn't before.	2021-05-19 11:00:24 +01:00
Simon Pilgrim	d0902a8665	[X86][AVX] createVariablePermute - correctly extend same-sized-vector indices (PR50356) D101838 incorrectly handled indices vectors of the same size but with higher element counts to just bitcast to the target indices type instead of performing a ZERO_EXTEND_VECTOR_INREG	2021-05-18 20:30:46 +01:00
Tim Northover	ba1509da7b	Recommit X86: support Swift Async context This adds support to the X86 backend for the newly committed swiftasync function parameter. If such a (pointer) parameter is present it gets stored into an augmented frame record (populated in IR, but generally containing enhanced backtrace for coroutines using lots of tail calls back and forth). The context frame is identical to AArch64 (primarily so that unwinders etc don't get extra complexity). Specfically, the new frame record is [AsyncCtx, %rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp being set to 1. %rbp still points to the frame pointer in memory for backwards compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest of the record is because of x86). Recommited with a fix for unwind info when i386 pc-rel thunks are adjacent to a prologue.	2021-05-18 15:19:05 +01:00
Mitch Phillips	6791a6b309	Revert "X86: support Swift Async context" This reverts commit `747e5cfb9f`. Reason: New frame layout broke the sanitizer unwinder. Not clear why, but seems like some of the changes aren't always guarded by Swyft checks. See https://reviews.llvm.org/rG747e5cfb9f5d944b47fe014925b0d5dc2fda74d7 for more information.	2021-05-17 12:44:57 -07:00
Simon Pilgrim	41587466aa	[X86] Don't dereference a dyn_cast<> - use a cast<> instead. NFCI. dyn_cast<> can return null if the cast fails, by using cast<> we assert that the cast is correct helping to avoid a potential null dereference.	2021-05-17 15:58:32 +01:00
Tim Northover	747e5cfb9f	X86: support Swift Async context This adds support to the X86 backend for the newly committed swiftasync function parameter. If such a (pointer) parameter is present it gets stored into an augmented frame record (populated in IR, but generally containing enhanced backtrace for coroutines using lots of tail calls back and forth). The context frame is identical to AArch64 (primarily so that unwinders etc don't get extra complexity). Specfically, the new frame record is [AsyncCtx, %rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp being set to 1. %rbp still points to the frame pointer in memory for backwards compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest of the record is because of x86).	2021-05-17 11:56:16 +01:00
Tim Northover	82a0e808bb	IR/AArch64/X86: add "swifttailcc" calling convention. Swift's new concurrency features are going to require guaranteed tail calls so that they don't consume excessive amounts of stack space. This would normally mean "tailcc", but there are also Swift-specific ABI desires that don't naturally go along with "tailcc" so this adds another calling convention that's the combination of "swiftcc" and "tailcc". Support is added for AArch64 and X86 for now.	2021-05-17 10:48:34 +01:00
Simon Pilgrim	262e4200d1	[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines (REAPPLIED). NFCI. The intention is to be able to run this from additional locations (such as shuffle combining) in the future. Reapplies rGb95a103808ac (after reversion at rGc012a388a15b), with SSE3/SSSE3 typo fix, test added at rG0afb10de1449.	2021-05-16 13:50:58 +01:00
Nico Weber	c012a388a1	Revert "[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI." This reverts commit `b95a103808`. Makes clang assert very early in a Chromium build. See https://bugs.chromium.org/p/chromium/issues/detail?id=1209490#c1 for a standalone repro.	2021-05-15 12:20:02 -04:00
Simon Pilgrim	b95a103808	[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI. The intention is to be able to run this from additional locations (such as shuffle combining) in the future.	2021-05-14 16:52:55 +01:00
Serge Pavlov	12537ab772	[FPEnv][X86] Implement lowering of llvm.set.rounding Differential Revision: https://reviews.llvm.org/D74730	2021-05-13 14:30:38 +07:00
Fangrui Song	0fe6649bc5	[X86] Fix -Wunused-lambda-capture	2021-05-12 10:34:32 -07:00
Simon Pilgrim	fb1d61b725	[X86][AVX] Fold concat(pslq(x,32),pslq(y,32)) -> shuffle(concat(x,y),zero) (PR46621) On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero. I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.	2021-05-12 18:04:40 +01:00
Simon Pilgrim	7bff9bdd34	[X86][AVX] combineConcatVectorOps - add ConcatSubOperand helper. NFCI. Pull out repeated code to create a concat_vectors of the same operand from all subvecs.	2021-05-12 16:42:18 +01:00
Craig Topper	44e0e91db0	[ValueTypes] Rename MVT::getVectorNumElements() to MVT::getVectorMinNumElements(). Fix some misuses of getVectorNumElements() getVectorNumElements() returns a value for scalable vectors without any warning so it is effectively getVectorMinNumElements(). By renaming it and making getVectorNumElements() forward to it, we can insert a check for scalable vectors into getVectorNumElements() similar to EVT. I didn't do that in this patch because there are still more fixes needed, but I was able to temporarily do it and passed the RISCV lit tests with these changes. The changes to isPow2VectorType and getPow2VectorType are copied from EVT. The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table. We're now considering SameNumElts to require the scalable property to match which removes some unneeded type checks. This was motivated by the bug I fixed yesterday in `80b9510806` Reviewed By: frasercrmck, sdesmalen Differential Revision: https://reviews.llvm.org/D102262	2021-05-12 07:46:45 -07:00
Sanjay Patel	f58e0513dd	[x86] try harder to lower to PCMPGT instead of not-of-PCMPEQ This is motivated by the example in https://llvm.org/PR50055 , but it doesn't do anything for that bug currently because we don't actually have a zero-extended setcc there. Proof for the generic transform (inverse of what we would try to do in combining): https://alive2.llvm.org/ce/z/aBL-Mg Differential Revision: https://reviews.llvm.org/D102275	2021-05-12 08:25:29 -04:00
Simon Pilgrim	72e242a286	[X86][AVX] canonicalizeShuffleMaskWithHorizOp - improve support for 256/512-bit vectors Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases. This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle. This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.	2021-05-12 12:13:24 +01:00
Simon Pilgrim	9acc03ad92	[X86][SSE] Replace foldShuffleOfHorizOp with generalized version in canonicalizeShuffleMaskWithHorizOp foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well. The next step will be to support 256/512-bit vector cases.	2021-05-11 14:18:45 +01:00
Simon Pilgrim	e32374ed5c	[X86][SSE] canonicalizeShuffleMaskWithHorizOp - add TODO for better 256/512-bit shuffle+hop folding support. NFC.	2021-05-10 18:43:16 +01:00
Simon Pilgrim	4524d8b755	[X86] combineHorizOpWithShuffle - generalize HOP(SHUFFLE(X),SHUFFLE(Y)) -> SHUFFLE(HOP(X,Y)) fold. For 128-bit types, generalize the fold to recognise duplicate operands in either shuffle.	2021-05-08 16:23:27 +01:00
Simon Pilgrim	f744723f75	[X86] combineXor - limit fold to non-opaque constants (PR50254) Ensure we don't try to fold when one might be an opaque constant - the constant fold will fail and then the reverse fold will happen in DAGCombine.....	2021-05-07 16:39:24 +01:00
Simon Pilgrim	280aa3415e	[DAG] Add a generic expansion for SHIFT_PARTS opcodes using funnel shifts Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts. Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication. I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to). NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly. Differential Revision: https://reviews.llvm.org/D101987	2021-05-07 13:12:30 +01:00
Simon Pilgrim	8e42024f79	[X86] Ensure we pass DebugLoc by const reference where possible. NFCI. Avoids a lot of unnecessary tracking increments/decrements of the underlying TrackingMDNodeRef	2021-05-07 12:32:59 +01:00
Simon Pilgrim	85460a2f5b	[X86][SSE] Move unpack(hop,hop) fold from foldShuffleOfHorizOp to combineTargetShuffle By moving this after more of the shuffle canonicalization we reduce the demanded vector elts, avoiding a few unnecessary copies/moves etc.	2021-05-05 13:36:09 +01:00
Alexey Bataev	13a51e017c	[X86]Fix a crash trying to convert indices to proper type. Need to perfortm a bitcast on IndicesVec rather than subvector extract if the original size of the IndicesVec is the same as the size of the destination type. Differential Revision: https://reviews.llvm.org/D101838	2021-05-05 05:14:42 -07:00
Dávid Bolvanský	2af95a5275	[X86] Promote 16-bit CTTZ_ZERO_UNDEF to 32-bit variant Related to PR50172. Protects us against regressions after we will start doing cttz(zext(x)) -> zext(cttz(x)) transformation in the middle-end. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D101662	2021-05-01 00:42:15 +02:00
Craig Topper	3067520bf4	[SelectionDAG] Use a VTSDNode to store the saturation width for FP_TO_SINT_SAT/FP_TO_UINT_SAT Previously we used an i32 constant to store the saturation width, but i32 isn't legal on RISCV64. This wasn't a big deal to fix, but it is extra work for the type legalizer. This patch uses a VTSDNode to store the type similar to SEXT_INREG. This makes it opaque to the type legalizer. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D101262	2021-04-27 14:38:42 -07:00
Nick Desaulniers	ea8416bf4d	[CodeGenOptions] make StackProtectorGuardOffset signed GCC supports negative values for -mstack-protector-guard-offset=, this should be a signed value. Pre-req to D100919. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D101325	2021-04-27 10:12:58 -07:00
Sander de Smalen	43ace8b5ce	[TTI] NFC: Change getScalingFactorCost to return InstructionCost This patch migrates the TTI cost interfaces to return an InstructionCost. See this patch for the introduction of the type: https://reviews.llvm.org/D91174 See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html Differential Revision: https://reviews.llvm.org/D100564	2021-04-23 16:06:36 +01:00
Simon Pilgrim	7b32e8b96a	[X86] combineSetCCAtomicArith - pull out repeated ops. NFCI. Reduces diff in D101074	2021-04-23 14:19:24 +01:00
Simon Pilgrim	a511b55cfd	[X86][SSE] getFauxShuffleMask - don't decode OR(SHUFFLE,SHUFFLE) containing UNDEFs. (PR50049) PR50049 demonstrated an infinite loop between OR(SHUFFLE,SHUFFLE) <-> BLEND(SHUFFLE,SHUFFLE) patterns. The UNDEF elements were allowing a combined shuffle mask to be widened which lost the undef element, resulting us needing to use the BLEND pattern (as the undef element would need to be zero for the OR pattern). But then bitcast folds would re-expose the undef element allowing us to use OR again.....	2021-04-21 18:47:00 +01:00
Simon Pilgrim	2a419a0b99	[X86][SSE] combineX86ShuffleChain - check if we're blending with zero into already zero elements Add a SelectionDAG::MaskedElementsAreZero helper that wraps SelectionDAG::MaskedValueIsZero testing for entirely zero vector elements	2021-04-20 17:09:49 +01:00
Simon Pilgrim	9d57a77b81	[X86] combineCMP - fold cmpEQ/NE(TRUNC(X),0) -> cmpEQ/NE(X,0) If we are truncating from a i32 source before comparing the result against zero, then see if we can directly compare the source value against zero. If the upper (truncated) bits are known to be zero then we can compare against that, hopefully increasing the chances of us folding the compare into a EFLAG result of the source's operation. Fixes PR49028. Differential Revision: https://reviews.llvm.org/D100491	2021-04-15 13:55:51 +01:00
Simon Pilgrim	4fbe761572	[X86][SSE] canonicalizeShuffleWithBinOps - check for more combos of merge-able binary shuffles. In the fold SHUFFLE(BINOP(X,Y),BINOP(Z,W)) -> BINOP(SHUFFLE(X,Z),SHUFFLE(Y,W)), check if both X/Z AND Y/W have at least one merge-able shuffle in which case the total number of shuffle should still fall. Helps with instruction count regressions we saw while fixing PR48823	2021-04-14 15:24:41 +01:00
Simon Pilgrim	73737fe990	[X86] Fold cmpeq/ne(trunc(x),0) --> cmpeq/ne(x,0) Relax the fold from rGbaadbe04bf75 to compare any op, not just logic ops, now that the movmsk regressions have been handled.	2021-04-14 11:02:02 +01:00
Simon Pilgrim	016ceb8382	[X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) fold Extension to rG74f98391a7a4, we can also include any of the upper (known zero) bits in the comparison in the shuffle removal fold, just as long as we demand all the elements of the movmsk source vector.	2021-04-14 11:02:01 +01:00
Simon Pilgrim	74f98391a7	[X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in CMP(MOVMSK(PACKSS())) -> CMP(MOVMSK()) fold We already allow the comparison of the upper bits of 'IsAllOf' (allbits) patterns, but we can safely compare the known zero bits for 'IsAnyOf' (zerobits) patterns as well. This fixes an issues where we are comparing a type wide than the number of vector elements, which avoids a regression mentioned in rGbaadbe04bf75.	2021-04-13 17:37:24 +01:00
Simon Pilgrim	baadbe04bf	[X86] Fold cmpeq/ne(trunc(logic(x)),0) --> cmpeq/ne(logic(x),0) Fixes the issues noted in PR48768, where the and/or/xor instruction had been promoted to avoid i8/i16 partial-dependencies, but the test against zero had not. We can almost certainly relax this fold to work for any truncation, although it breaks a number of existing folds (notable movmsk folds which tend to rely on the truncate to determine the demanded bits/elts in the source vector). There is a reverse combine in TargetLowering.SimplifySetCC so we must wait until after legalization before attempting this.	2021-04-12 16:05:34 +01:00
Simon Pilgrim	231b87618b	[X86][AVX512] Fold not(kmov(x)) -> kmov(not(x)) and not(widen_subvector(x)) -> widen_subvector(not(x)) Improve AVX512 mask inversion, rG38c799bce801 exposed some missing opportunities to move scalar not() back onto the boolvector types for folding with setcc etc.	2021-04-11 20:07:09 +01:00
Simon Pilgrim	13bdac5709	[X86] combineXor - Pull out repeated getOperand() calls. NFCI.	2021-04-11 19:01:59 +01:00
Simon Pilgrim	38c799bce8	[X86] Fold cmpeq/ne(and(X,Y),Y) --> cmpeq/ne(and(~X,Y),0) Followup to D100177, handle an similar (demorgan inverse style) case from PR47797 as well The AVX512 test cases could be further improved if we folded not(iX bitcast(vXi1)) -> (iX bitcast(not(vXi1))) Alive2: https://alive2.llvm.org/ce/z/AnA_-W	2021-04-11 18:42:01 +01:00
Simon Pilgrim	d8bc4de3cf	[X86] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) on non-BMI targets (PR44136) Followup to D100177, enable the fold for non-BMI targets as well.	2021-04-09 16:11:11 +01:00
Simon Pilgrim	245036950a	[X86][BMI] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) (PR44136) I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit). Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon. https://alive2.llvm.org/ce/z/QVWHP_ https://alive2.llvm.org/ce/z/pLngT- Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare. Differential Revision: https://reviews.llvm.org/D100177	2021-04-09 15:52:03 +01:00
Simon Pilgrim	3ae0a405fc	[X86] combineHorizOpWithShuffle - peek through one use bitcasts when decoding shuffles. Checking for one use, peek through bitcasts of the horizop args to allows us to merge shuffles of different widths through the horizop.	2021-04-09 10:51:04 +01:00
Simon Pilgrim	53283cc2f1	[X86][SSE] canonicalizeShuffleWithBinOps - add MOVSD/MOVSS handling.	2021-04-06 16:42:18 +01:00
Simon Pilgrim	ddbb58736a	[KnownBits] Rename KnownBits::computeForMul to KnownBits::mul. NFCI. As promised in D98866	2021-04-06 10:11:41 +01:00
Simon Pilgrim	36d4f6d7f8	[X86] Fold xor(zext(xor(x,c1)),c2) -> xor(zext(x),xor(zext(c1),c2)) Fixes PR47603 (second case) by extending rG89afec348dbd3e5078f176e978971ee2d3b5dec8	2021-04-05 11:40:37 +01:00
Simon Pilgrim	89afec348d	[X86] Fold xor(truncate(xor(x,c1)),c2) -> xor(truncate(x),xor(truncate(c1),c2)) Fixes PR47603 This should probably be transferable to DAGCombine - the main limitation with the existing trunc(logicop) DAG fold is we don't know if legalization has tried to promote truncated logicops already. We might be able to peek through extensions as well.	2021-04-03 12:43:05 +01:00
Simon Pilgrim	7c17f1ea84	[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper (REAPPLIED) Use the getTargetShuffleInputs helper for all shuffle decoding Reapplied (after reversion in rGfa0aff6d6960) with fix+test for subvector splitting - we weren't accounting for peeking through bitcasts changing the vector element count of the shuffle sources.	2021-04-03 11:59:19 +01:00
Nico Weber	fa0aff6d69	Revert "[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper" This reverts commit `500969f1d0`. Makes clang assert compiling avx2 code, see https://bugs.chromium.org/p/chromium/issues/detail?id=1195353#c4 for a standalone repro.	2021-04-02 09:55:55 -04:00
Simon Pilgrim	500969f1d0	[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper Use the getTargetShuffleInputs helper for all shuffle decoding	2021-04-02 11:50:18 +01:00
Yang Fan	bc6001ce1e	[X86] Fix -Wunused-function warning (NFC) GCC warning: ``` /llvm-project/llvm/lib/Target/X86/X86ISelLowering.cpp:9212:13: warning: ‘bool isHorizOp(unsigned int)’ defined but not used [-Wunused-function] 9212 \| static bool isHorizOp(unsigned Opcode) { \| ^~~~~~~~~ ```	2021-04-02 09:38:12 +08:00
Simon Pilgrim	abbe80fa52	[X86][SSE] Fold HOP(HOP(X,X),HOP(Y,Y)) -> HOP(PERMUTE(HOP(X,Y)),PERMUTE(HOP(X,Y)) For slow-hop targets, attempt to merge HADD/SUB pairs used in chains.	2021-04-01 11:54:10 +01:00
Simon Pilgrim	301319840e	[X86][SSE] Enable (F)HADD/SUB handling to SimplifyMultipleUseDemandedVectorElts Attempt to bypass unused horiz-op operands. This is very similar to the PACKSS/PACKUS handling - we should try to merge these.	2021-04-01 11:54:09 +01:00
Simon Pilgrim	f7aeaced65	[X86][SSE] Add isHorizOp helper function. NFCI.	2021-04-01 11:54:09 +01:00
Craig Topper	437958d9fd	[X86] Improve SMULO/UMULO codegen for vXi8 vectors. The default expansion creates a MUL and either a MULHS/MULHU. Each of those separately expand to sequences that use one or more PMULLW instructions as well as additional instructions to extend the types to vXi16. The MULHS/MULHU expansion computes the whole 16-bit product, but only keeps the high part. We can improve the lowering of SMULO/UMULO for some cases by using the MULHS/MULHU expansion, but keep both the high and low parts. And we can use those parts to calculate the overflow. For AVX512 we might have vXi1 overflow outputs. We can improve those by using vpcmpeqw to produce a k register if AVX512BW is enabled. This is a little better than truncating the high result to use vpcmpeqb. If we don't have avx512bw we can extend up to v16i32 to use vpcmpeqd to produce a k register. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D97624	2021-03-31 10:13:50 -07:00
Tomas Matheson	a9968c0a33	[NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions Currently needsStackRealignment returns false if canRealignStack returns false. This means that the behavior of needsStackRealignment does not correspond to it's name and description; a function might need stack realignment, but if it is not possible then this function returns false. Furthermore, needsStackRealignment is not virtual and therefore some backends have made use of canRealignStack to indicate whether a function needs stack realignment. This patch attempts to clarify the situation by separating them and introducing new names: - shouldRealignStack - true if there is any reason the stack should be realigned - canRealignStack - true if we are still able to realign the stack (e.g. we can still reserve/have reserved a frame pointer) - hasStackRealignment = shouldRealignStack && canRealignStack (not target customisable) Targets can now override shouldRealignStack to indicate that stack realignment is required. This change will make it easier in a future change to handle the case where we need to realign the stack but can't do so (for example when the register allocator creates an aligned spill after the frame pointer has been eliminated). Differential Revision: https://reviews.llvm.org/D98716 Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87	2021-03-30 17:31:39 +01:00
Sanjay Patel	e694e19a79	[x86] enhance matching of pmaddwd This was crashing with the example from: https://llvm.org/PR49716 ...and that was avoided with `a283d72583` , but as we can see from the SSE vs. AVX test code diff, we can try harder to match the pattern. This matcher code was adapted from another pmadd pattern match in D49636, but it needs different ops to deal with size mismatches. Differential Revision: https://reviews.llvm.org/D99531	2021-03-30 07:28:33 -04:00
Simon Pilgrim	805148eaf2	[X86][SSE] combineHorizOpWithShuffle - consistently use getTargetShuffleInputs to decode shuffles Minor cleanup before I start trying to merge the unary/binary shuffle combining paths.	2021-03-29 11:31:19 +01:00
Craig Topper	69bdf35dc7	[X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size. For these cases we need to extract the upper or lower elements, multiply them using 16-bit multiplies and repack them. Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to extract and sign extend so we could use pmullw to compute the 16-bit product and then shift down the high bits. We can avoid the need to sign extend if we unpack the bytes into the high byte of each word and fill the lower byte with 0 using pxor. This puts the sign bit of each byte into the sign bit of each word. Since the LHS and RHS have 8 trailing zeros, the full 32-bit product of those 16-bit values will have 16 trailing zeros. This means the 16-bit product of the original bytes is in the upper 16 bits which we can calculate using pmulhw. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98587	2021-03-28 11:41:29 -07:00
Simon Pilgrim	2a0d5da917	[X86][SSE] foldShuffleOfHorizOp - remove broadcast handling. Remove VBROADCAST/MOVDDUP/splat-shuffle handling from foldShuffleOfHorizOp This can all be handled by canonicalizeShuffleMaskWithHorizOp along as we check that the HADD/SUB are only used once (to prevent infinite loops on slow-horizop targets which will try to reuse the nodes again followed by a post-hop shuffle).	2021-03-27 15:09:23 +00:00
Simon Pilgrim	41146bfe82	[X86][SSE] combineX86ShuffleChain - attempt to recognise 'hidden' identity shuffles See if the combined shuffle mask is equivalent to an identity shuffle, typically this is due to repeated LHS/RHS ops in horiz-ops, but isTargetShuffleEquivalent might see other patterns as well. This is another small step towards getting rid of foldShuffleOfHorizOp and relying on canonicalizeShuffleMaskWithHorizOp and generic shuffle combining.	2021-03-27 11:09:30 +00:00
Sanjay Patel	a283d72583	[x86] prevent crashing while matching pmaddwd This could crash in 2 ways: either one or both of the input vectors could be a different size than the math ops. https://llvm.org/PR49716	2021-03-27 05:27:14 -04:00
Simon Pilgrim	c769ba9514	[X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode	2021-03-26 17:23:54 +00:00
Simon Pilgrim	36e3c6c841	[X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead. combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle. We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage. Differential Revision: https://reviews.llvm.org/D96609	2021-03-25 10:34:34 +00:00
Simon Pilgrim	9fde88c3e2	[X86][AVX] splitIntVSETCC - handle separate (canonicalized) SETCC operands LowerVSETCC calls splitIntVSETCC after canonicalizing certain patterns, in particular (X & CPow2 != 0) -> (X & CPow2 == CPow2). Unfortunately if we're splitting for AVX1/non-AVX512BW cases, we lose these canonicalizations as we call the split with the original SetCC node, and when the split nodes are later lowered in LowerVSETCC the patterns are lost behind extract_subvector etc. But if we pass the canonicalized operands for splitting we retain the optimizations. Differential Revision: https://reviews.llvm.org/D99256	2021-03-25 10:18:44 +00:00
Simon Pilgrim	7920527796	[X86][AVX] combineBitcastvxi1 - improve handling of vectors truncated to vXi1 If we're truncating to vXi1 from a wider type, then prefer the original wider vector as is simplifies folding the separate truncations + extensions. AVX1 this is only worth it for v8i1 cases, not v4i1 where we're always better off truncating down to v4i32 for movmsk. Helps with some regressions encountered in D96609	2021-03-24 14:05:59 +00:00
Simon Pilgrim	e9015bd595	[X86][AVX] lowerShuffleAsBroadcast - MOVDDUP(SCALAR_TO_VECTOR(X)) -> BROADCAST(X) Prefer broadcast from scalar on AVX targets as this makes it easier for later folds to strip away bitcasts etc. This helps a lot with the AVX1 poor codegen from PR49658. There's a trivial regression in bitcast-int-to-vector-bool-*ext.ll tests due to SimplifyDemandedBits not being able to see a multi-use case, but there's bigger existing codegen issues to be addressed first in those tests (unnecessary NOTs).	2021-03-24 11:31:56 +00:00
Simon Pilgrim	c1ef642ad8	[X86] Remove unused 'OneUse' option from IsNOT helper. NFCI.	2021-03-24 11:14:38 +00:00
Simon Pilgrim	080cb83e52	[X86][AVX] Narrow VPBROADCASTQ->VPBROADCASTD if we don't need the upper bits. Helps fix cases where we've splatted smaller types to a wider vector element type without needing the upper bits. Avoid this on AVX512 targets as that can affect broadcast folding.	2021-03-23 09:41:02 +00:00
Simon Pilgrim	3179588947	[X86][AVX] ComputeNumSignBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting. Added as an extension to PR49658	2021-03-21 12:22:51 +00:00
Simon Pilgrim	297b9bc3fa	[X86][AVX] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting. Suggested by @craig.topper on PR49658	2021-03-21 10:40:57 +00:00
Simon Pilgrim	54a05f2ec8	[X86] computeKnownBitsForTargetNode - add X86ISD::PMULUDQ handling Reuse the existing KnownBits multiplication code to handle what is effectively a ISD::UMUL_LOHI varient	2021-03-21 09:57:20 +00:00
Simon Pilgrim	64687f2cc3	[X86][SSE] canonicalizeShuffleWithBinOps - add PERMILPS/PERMILPD + PERMPD/PERMQ + INSERTPS handling. Bail if the INSERTPS would introduce zeros across the binop.	2021-03-16 13:52:08 +00:00
Simon Pilgrim	772155793b	[X86][SSE] isHorizontalBinOp - ensure we clear any unused source operands to improve HADD/SUB matching Our shuffle matching for HADD/SUB patterns wasn't clearing repeated ops in 'fake unary' style shuffle masks (unpack(x,x) etc.), preventing matching of add(fakeunary(),fakeunary()) style patterns.	2021-03-15 16:24:29 +00:00
Simon Pilgrim	814339454d	[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.	2021-03-15 15:01:29 +00:00
Simon Pilgrim	07232f4507	[X86][SSE] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling. Recommit rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 with an early-out if the pshub would introduce zeros across the binop.	2021-03-15 12:43:30 +00:00
Simon Pilgrim	75a184dacf	Revert rG9ba577eca2e339726bfaad4e615c6324a705b292 "[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI." Sorry this wasn't supposed to be committed yet (and certainly not tagged as NFCI....)	2021-03-15 12:23:44 +00:00
Simon Pilgrim	9ba577eca2	[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI. Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.	2021-03-15 11:59:25 +00:00
Simon Pilgrim	6878be5dc3	[X86][SSE] Attempt to merge single-op hops for slow targets. For slow-hop targets, see if any single-op hops are duplicating work already done on another (dual-op) hop, which can sometimes occur as isHorizontalBinOp tries to find potential duplicates (but can't merge them itself). If so, reuse the other hop and shuffle the result.	2021-03-15 09:30:20 +00:00
Simon Pilgrim	6cb7dddaf4	[X86][AVX] Insert zeros byte elements into 256/512-bit vectors using shuffle/and Avoid extracting/inserting subvectors which makes it more difficult for shuffle combining to merge them together.	2021-03-12 15:16:36 +00:00
Simon Pilgrim	33dcdd414c	[X86] Provide lighter weight getTargetShuffleMask wrapper. NFCI. Most callers to getTargetShuffleMask don't use the IsUnary flag.	2021-03-12 15:16:35 +00:00
Simon Pilgrim	bc5e9ec2dc	Revert rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 "[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling." Investigating an issue reported by @bkramer, possibly when the PSHUFB mask generates zero elements.	2021-03-11 13:14:00 +00:00
Simon Pilgrim	77394c12a4	[X86] Don't attempt to fold sub(C1, xor(X, C2)) with opaque constants Fixes PR49451	2021-03-11 12:06:40 +00:00
Simon Pilgrim	d0884541cc	[X86] canonicalizeShuffleWithBinOps - add binary shuffle handling	2021-03-09 13:57:03 +00:00
Simon Pilgrim	f71cee136d	[X86] Break if-else chain. NFCI. Both if blocks affect control flow - we don't need the else. Fixes clang-tidy warning.	2021-03-08 11:44:31 +00:00
Simon Pilgrim	cd938ab162	[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling.	2021-03-07 12:56:35 +00:00
Simon Pilgrim	772a501bf4	[X86] canonicalizeShuffleWithBinOps - shuffle oneuse constants. We can freely shuffle all ones/zeros constants but we can also freely shuffle other constants as long as they only have one use.	2021-03-07 11:17:03 +00:00
Alexey Lapshin	cf7cdaff64	[X86][VARARG] Avoid spilling xmm registers for va_start. That review is extracted from D69372. It fixes https://bugs.llvm.org/show_bug.cgi?id=42219 bug. For the noimplicitfloat mode, the compiler mustn't generate floating-point code if it was not asked directly to do so. This rule does not work with variable function arguments currently. Though compiler correctly guards block of code, which copies xmm vararg parameters with a check for %al, it does not protect spills for xmm registers. Thus, such spills are generated in non-protected areas and could break code, which does not expect floating-point data. The problem happens in -O0 optimization mode. With this optimization level there is used FastRegisterAllocator, which spills virtual registers at basic block boundaries. Register Allocator does not protect spills with additional control-flow modifications. Thus to resolve that problem, it is suggested to not copy incoming physical registers into virtual registers. Instead, store incoming physical xmm registers into the memory from scratch. Differential Revision: https://reviews.llvm.org/D80163	2021-03-06 15:25:47 +03:00
Simon Pilgrim	d7b8cb4d57	[X86] X86ISelLowering.cpp - try to use for-range loops. NFCI.	2021-03-05 11:09:14 +00:00
Simon Pilgrim	7cbc5df438	[X86] X86TargetLowering::isSafeMemOpType - break if-else chain. NFCI. All if-else blocks return - fixes clang-tidy warning.	2021-03-04 12:15:08 +00:00
Simon Pilgrim	1584e55a26	[X86] canonicalizeShuffleWithBinOps - handle general unaryshuffle(binop(x,c)) patterns not just xor(x,-1) Generalize the shuffle(not(x)) -> not(shuffle(x)) fold to handle any binop with 0/-1. Hopefully we can further generalize to help push target unary/binary shuffles through binops similar to what we do in DAGCombiner::visitVECTOR_SHUFFLE	2021-03-04 10:44:38 +00:00
Simon Pilgrim	aa4afebbf9	[X86] Fold scalar_to_vector(x) -> extract_subvector(broadcast(x),0) iff broadcast(x) exists Add handling for reusing an existing broadcast(x) to a wider vector.	2021-03-03 15:50:37 +00:00
Benjamin Kramer	10c256ccaf	Revert "[X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef))" This reverts commit `925093d88a`. Causes an infinite loop when compiling some shuffles: $ cat bugpoint-reduced-simplified.ll target triple = "x86_64-unknown-linux-gnu" define void @foo() { entry: %0 = load i8, i8* undef, align 1 %broadcast.splatinsert = insertelement <16 x i8> poison, i8 %0, i32 0 %1 = icmp ne <16 x i8> %broadcast.splatinsert, zeroinitializer %2 = shufflevector <16 x i1> %1, <16 x i1> undef, <16 x i32> zeroinitializer %wide.load = load <16 x i8>, <16 x i8>* undef, align 1 %3 = icmp ne <16 x i8> %wide.load, zeroinitializer %4 = and <16 x i1> %3, %2 %5 = zext <16 x i1> %4 to <16 x i8> store <16 x i8> %5, <16 x i8>* undef, align 1 ret void } $ llc < bugpoint-reduced-simplified.ll <timeout>	2021-03-02 11:24:07 +01:00
Simon Pilgrim	925093d88a	[X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef)) Move NOT out to expose more AND -> ANDN folds	2021-03-01 14:47:39 +00:00
Simon Pilgrim	ab3ea27b6f	[X86][AVX] Reuse existing VBROADCAST(x) for SCALAR_TO_VECTOR(x) Similar to what we already do for BROADCASTs of different vector sizes - if we're going to broadcast it anyway might as well reuse it.	2021-02-28 11:37:27 +00:00
Craig Topper	993f4d8ffa	[X86] Fix a couple comments that said LHS where they meant RHS. NFC	2021-02-27 17:14:17 -08:00
James Y Knight	6de6455752	Use getAlign() on atomicrmw/cmpxchg instructions, now that it's available. These locations were missed as part of adding alignment to the instructions, and were still making their own alignment assumptions.	2021-02-26 15:06:15 -05:00
Simon Pilgrim	ed1f45bce9	[X86][AVX] SimplifyDemandedBitsForTargetNode - add basic X86ISD::VBROADCAST handling. Simplify through to the scalar/vector source operand.	2021-02-26 16:13:14 +00:00
Simon Pilgrim	7ac4c956af	[X86] Remove unnecessary custom lowering of vXi1 SADDSAT/SSUBSAT/UADDSAT/USUBSAT As discussed on D97478. The removal of the custom tag causes some changes in the add/sub-overflow expansion as it no longer expands to sat-arith codegen.	2021-02-26 12:10:23 +00:00
Simon Pilgrim	aefe8f2f6c	[DAG] Fold vXi1 multiplies -> and This allows us to remove X86 custom lowering of vXi1 MUL, which helps simplify a load of mask math. Mentioned in D97478 post review.	2021-02-26 11:46:12 +00:00
Simon Pilgrim	40b8b4a466	[X86] Remove unnecessary custom lowering of v16i1/v32i1 ADD/SUB These were missed in D97478	2021-02-26 11:46:11 +00:00
Craig Topper	ceaedfb5fc	[X86] Remove custom lowering of vXi1 ADD/SUB now that they are canonicalized to XOR in getNode. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D97478	2021-02-25 08:52:41 -08:00
Simon Pilgrim	8b82669d56	[X86][SSE] Move unaryshuffle(xor(x,-1)) -> xor(unaryshuffle(x),-1) fold into helper. NFCI. We should be able to extend this "canonicalizeShuffleWithBinOps" to handle more generic binop cases where either/both operands can be cheaply shuffled.	2021-02-25 10:56:23 +00:00
Simon Pilgrim	b568d3d6c9	[X86] Add vector support to sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold.	2021-02-21 21:51:27 +00:00
Simon Pilgrim	3ab32c94a4	[X86] Replace explicit constant handling in sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold. NFCI. NFC cleanup before adding vector support - rely on the SelectionDAG to handle everything for us.	2021-02-21 21:40:32 +00:00
Simon Pilgrim	bae04a3e2d	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - remove unnecessary BITCASTs. In conjunction with the 'vperm2x128(bitcast(x),bitcast(y),c) -> bitcast(vperm2x128(x,y,c))' fold in combineTargetShuffle, this should remove any unnecessary bitcasts around vperm2x128 lane shuffles.	2021-02-21 18:40:32 +00:00
Simon Pilgrim	a6a258f1da	[X86][AVX] Fold concat(extract_subvector(v0,c0), extract_subvector(v1,c1)) -> vperm2x128 Fixes regression exposed by removing bitcasts across logic-ops in D96206. Differential Revision: https://reviews.llvm.org/D96206	2021-02-21 14:50:43 +00:00
Simon Pilgrim	2885d1251f	[X86] Fold bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y)) for int-int bitcasts Extend the existing combine that handles bitcasting for fp-logic ops to also help remove logic ops across bitcasts to/from the same integer types. This helps improve AVX512 predicate handling for D/Q logic ops and also allows DAGCombine's scalarizeExtractedBinop to remove some annoying gpr->simd->gpr transfers. The concat_vectors regression in pr40891.ll will be addressed in a followup commit on this patch. Differential Revision: https://reviews.llvm.org/D96206	2021-02-21 14:40:54 +00:00
Simon Pilgrim	761bbed264	[DAG] foldSubToUSubSat - fold sub(a,trunc(umin(zext(a),b))) -> usubsat(a,trunc(umin(b,SatLimit))) This moves the last custom x86 USUBSAT fold to generic DAGCombine. Completes PR40111 Differential Revision: https://reviews.llvm.org/D96703	2021-02-20 12:02:07 +00:00
Simon Pilgrim	2258b367db	[X86][AVX] getFauxShuffleMask - decode VBROADCAST(EXTRACT_VECTOR_ELT(V,0)) Handle the case where we're broadcasting a scalar extracted from another vector.	2021-02-19 11:06:53 +00:00
Wang, Pengfei	c98644c2ec	[X86] Fix a codegen crash in getSetCCResultType This patch fixes some crashes coming from X86ISelLowering::getSetCCResultType, which would occasionally return an EVT constructed from an invalid MVT, which has a null Type pointer. This patch refers to D95434. Differential Revision: https://reviews.llvm.org/D97036	2021-02-19 17:30:10 +08:00
Simon Pilgrim	05c64ea672	[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) (REAPPLIED) Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)) Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles. Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle. This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise. Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs. Differential Revision: https://reviews.llvm.org/D96345	2021-02-17 11:42:43 +00:00
Sterling Augustine	5aa8f4c084	Revert "[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)))" This reverts commit `5dfba562dd`. That commit causes an assertion failure with the following repro: typedef long b __attribute__((__vector_size__(16))); b d; b e; b __attribute__((__always_inline__)) c(b h, b i) { return (__attribute__((__vector_size__(8 sizeof(short)))) short)h + i; } j() { b k, l, m, n, o[6], p, q; m = d[5]; b r = m; b s = f(r, 8); q = s; l = d[1]; p = l; t(q); n = c(m, l); o[1] = c(s, f(p, 8)); k = __builtin_shufflevector(n, o[1], 0, 2); e = __builtin_ia32_psrlwi128(k, j); } ./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c	2021-02-16 12:48:15 -08:00
Simon Pilgrim	5dfba562dd	[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)) Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles. Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle. This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise. Differential Revision: https://reviews.llvm.org/D96345	2021-02-16 15:46:34 +00:00
Simon Pilgrim	4841a225b7	[DAG] Move basic USUBSAT pattern matches from X86 to DAGCombine Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets. This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization. We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation. The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987. Differential Revision: https://reviews.llvm.org/D96413	2021-02-12 18:22:57 +00:00
Simon Pilgrim	eb31c3c5cb	Revert rGe1172959226689a "[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks." Revert this while I investigate a downstream breakage report.	2021-02-10 10:26:44 +00:00
Simon Pilgrim	89d9ff8229	[X86][SSE] foldShuffleOfHorizOp - add SHUFPS v4f32 handling Fold shufps(hop(x,y),hop(z,w)) -> permute(hop(x,z)) - this is very similar to the equivalent unpack fold. I did start trying to convert foldShuffleOfHorizOp to handle generic shuffle masks but we're relying on a lot of special cases at the moment.	2021-02-09 14:18:45 +00:00
Simon Pilgrim	598ceb25d4	[X86][AVX] Fold extract_subvector(splat, c) -> extract_subvector(splat, 0) We already do this for VBROADCASTs, extend this for any splat that SelectionDAG::isSplatValue recognises as well.	2021-02-07 11:42:41 +00:00
Craig Topper	6f4f0efd89	[X86] Don't pass a 1 to the second argument of ISD::FP_ROUND in LowerFCOPYSIGN. I don't think we have any reason to believe the FP_ROUND here doesn't change the value. Found while trying to see if we still need the fp128 block in CanCombineFCOPYSIGN_EXTEND_ROUND. Removing that check caused this FP_ROUND to fire for fp128 which introduced a libcall expansion that asserted for this being a 1. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D96098	2021-02-06 10:29:01 -08:00
Simon Pilgrim	e117295922	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks. Now that PR48908 has been dealt with, we can handle v4f64 permute cases by extracting the low/high lane VPERMILPD masks and creating a new mask based on which lanes are referenced by the VPERM2F128 mask.	2021-02-06 15:58:02 +00:00
Craig Topper	11ef356d9e	[TargetLowering] Use Align in allowsMisalignedMemoryAccesses. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D96097	2021-02-04 19:22:06 -08:00
Simon Pilgrim	31b85e1c0b	[X86] Use VT::changeVectorElementType helper where possible. NFCI.	2021-02-04 15:03:56 +00:00
Simon Pilgrim	fa2cdb8140	[X86] Remove stale TODO comment. NFC. We now handle implicit zero-extension shuffle mask cases.	2021-02-04 12:14:05 +00:00
Simon Pilgrim	32b7c2fa42	[X86][SSE] Support variable-index float/double vector insertion on SSE41+ targets (PR47924) Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic. The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select. SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem. Differential Revision: https://reviews.llvm.org/D95866	2021-02-03 14:14:35 +00:00
Simon Pilgrim	8c2e075c2c	[X86][SSE] LowerINSERT_VECTOR_ELT - pull out repeated EltSizeInBits calls. NFCI.	2021-02-02 13:45:18 +00:00
Simon Pilgrim	d46a6b3d55	[X86][AVX512] Support variable-index vector insertion on AVX512 targets (PR47924) With predicate masks, AVX512 can efficiently perform variable-index vector insertion with 2 broadcasts + 1 comparison, avoiding a lot of aliased memory traffic. Differential Revision: https://reviews.llvm.org/D95779	2021-02-02 11:41:18 +00:00
Philip Reames	46e764a628	[x86] introduce no_callee_saved_registers attribute This is directly analogous to the existing no_caller_saved_registers, but with the opposite intention. A function or call so marked shifts the responsibility of spilling the usual CSRs to it's caller. An indirect call site and callee which don't agree on the attribute is ill defined. The motivation for this change is that being able to prune callee saves (without modifying other details of the calling convention) is sometimes useful when generating stubs and adapters. There's no intention to expose this as a source language feature; this is expected to be used by frontends to implement adapters where warranted. Some specific examples of use cases: * GC compatible compiled code wants to call an externally defined library function without needing to track pointer values through CSRs. * debug enabled code wants to call precompiled library which doesn't provide enough information to track CSRs while preserving debug quality in caller. * adapter stub entering hand written assembler which doesn't follow normal calling conventions.	2021-02-01 16:19:14 -08:00
Philip Reames	9d09db941f	[NFC][X86] Use CallBase interface to simplify code	2021-02-01 15:24:41 -08:00
Philip Reames	bb6c23b1f5	[NFC][X86] Avoid redundant work inspecting callee	2021-02-01 15:24:41 -08:00
Simon Pilgrim	e640b209b2	[X86][SSE] LowerScalarImmediateShift - use APInt::getLowBitsSet for vXi8 ISD::SRL mask generation. NFCI. Match what we do for ISD::SHL	2021-02-01 18:17:40 +00:00
Simon Pilgrim	5211af4818	[X86][AVX] combineExtractWithShuffle - combine extracts from 256/512-bit vector shuffles. We can only legally extract from the lowest 128-bit subvector, so extract the correct subvector to allow us to handle 256/512-bit vector element extracts.	2021-02-01 10:31:43 +00:00
Simon Pilgrim	d6b68d1344	[X86][SSE] combineExtractWithShuffle - support zero-extending to allow extracting from narrow shuffle masks If the shuffle mask can't be widened to match the original extracted element width, see if the upper bits are zeroable - which allows us to extract+zero-extend the smaller extraction.	2021-01-29 14:22:10 +00:00
Simon Pilgrim	f84efe97bc	[X86][AVX] combineHorizOpWithShuffle - fix valuetype comparison typo. Ensure we check the valuetypes of all the HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) shuffle input ops - there was a copy+paste typo (noticed by MSVC analyzer) that meant we were checking the same input from one of the shuffles twice. I haven't been able to create a test case for this yet - I don't think its currently possible to create a target/faux binary shuffle that scales to a 2x128 shuffle mask from two different value types.	2021-01-28 16:36:23 +00:00
Simon Pilgrim	6663330bc8	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - don't merge VPERMILPD ops with different low/high masks. Unlike VPERMILPS, VPERMILPD can have non-repeating masks in each 128-bit subvector, we weren't accounting for this when folding vperm2f128(vpermilpd(x,c),vpermilpd(y,c)) -> vpermilpd(vperm2f128(x,y),c). I'm intending to add support for this but wanted to get a minimal fix in first for merging into 12.xx. Fixes PR48908	2021-01-28 12:11:31 +00:00
Freddy Ye	b3b0acdc6f	[NFC] Refine some uninitialized used variables. These warning are reported by static code analysis tool: Klocwork Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D95421	2021-01-26 16:51:05 +08:00
Simon Pilgrim	13f2aee783	[X86][AVX] Generalize vperm2f128/vperm2i128 patterns to support all legal 256-bit vector types Remove bitcasts to/from v4x64 types through vperm2f128/vperm2i128 ops to help improve shuffle combining and demanded vector elts folding.	2021-01-25 15:35:36 +00:00
Simon Pilgrim	821a51a9ca	[X86][AVX] combineX86ShuffleChainWithExtract - widen to at least original root size. NFCI. We're relying on the source inputs for shuffle combining having already been widened to the root size (otherwise the offset logic falls over) - we're going to be supporting different sized shuffle inputs soon, so we need to explicitly make the minimum widened width the original root size.	2021-01-25 13:45:37 +00:00
Simon Pilgrim	1b780cf32e	[X86][AVX] LowerTRUNCATE - avoid bitcasts around extract_subvectors. We allow extract_subvector lowering of all legal types, so pre-bitcast the source type to try and reduce bitcast pollution.	2021-01-25 12:10:36 +00:00
Simon Pilgrim	f461e35cba	[X86][AVX] combineX86ShuffleChain - avoid bitcasts around insert_subvector() shuffle patterns. We allow insert_subvector lowering of all legal types, so don't always cast to the vXi64/vXf64 shuffle types - this is only necessary for X86ISD::SHUF128/X86ISD::VPERM2X128 patterns later.	2021-01-25 11:35:45 +00:00
Fangrui Song	d745b82de1	[XRay] Support DW_TAG_call_site and delete unneeded PATCHABLE_EVENT_CALL/PATCHABLE_TYPED_EVENT_CALL lowering	2021-01-25 00:49:18 -08:00
Simon Pilgrim	bd122f6d21	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle vperm2x128(movddup(x),movddup(y)) cases Fold vperm2x128(movddup(x),movddup(y)) -> movddup(vperm2x128(x,y))	2021-01-22 16:05:19 +00:00
Simon Pilgrim	c33d36e066	[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle unary vperm2x128(permute/shift(x,c),undef) cases Fold vperm2x128(permute/shift(x,c),undef) -> permute/shift(vperm2x128(x,undef),c)	2021-01-22 15:47:23 +00:00
Simon Pilgrim	4846f6ab81	[X86][AVX] combineTargetShuffle - simplify the X86ISD::VPERM2X128 subvector matching Simplify vperm2x128(concat(X,Y),concat(Z,W)) folding. Use collectConcatOps / ISD::INSERT_SUBVECTOR to find the source subvectors instead of hardcoded immediate matching.	2021-01-22 15:47:22 +00:00
Simon Pilgrim	b1166e1317	[X86][AVX] combineX86ShufflesRecursively - attempt to constant fold before widening shuffle inputs combineX86ShufflesConstants/canonicalizeShuffleMaskWithHorizOp can both handle/earlyout shuffles with inputs of different widths, so delay widening as late as possible to make it easier to match constant folds etc. The plan is to eventually move the widening inside combineX86ShuffleChain so that we don't create any new nodes unless we successfully combine the shuffles.	2021-01-22 13:19:35 +00:00
Simon Pilgrim	ffe72f987f	[X86][SSE] Don't fold shuffle(binop(),binop()) -> binop(shuffle(),shuffle()) if the shuffle are splats rGbe69e66b1cd8 added the fold, but DAGCombiner.visitVECTOR_SHUFFLE doesn't merge shuffles if the inner shuffle is a splat, so we need to bail. The non-fast-horiz-ops paths see some minor regressions, we might be able to improve on this after lowering to target shuffles. Fix PR48823	2021-01-22 11:31:38 +00:00
Simon Pilgrim	86021d98d3	[X86] Avoid a std::string copy by replacing auto with const auto&. NFC. Fixes msvc analyzer warning.	2021-01-21 11:04:07 +00:00
Max Kazantsev	d6bb96e677	[X86] Add experimental option to separately tune alignment of innermost loops We already have an experimental option to tune loop alignment. Its impact is very wide (and there is a suspicion that it's not always profitable). We want to have something more narrow to play with. This patch adds similar option that overrides preferred alignment for innermost loops. This is for experimental purposes, default values do not change the existing behavior. Differential Revision: https://reviews.llvm.org/D94895 Reviewed By: pengfei	2021-01-21 11:15:16 +07:00
Simon Pilgrim	b8b5e87e6b	[X86][AVX] Handle vperm2x128 shuffling of a subvector splat. We already handle "vperm2x128 (ins ?, X, C1), (ins ?, X, C1), 0x31" for shuffling of the upper subvectors, but we weren't dealing with the case when we were splatting the upper subvector from a single source.	2021-01-20 18:16:33 +00:00
Simon Pilgrim	19d02842ee	[X86][AVX] Fold extract_subvector(VSRLI/VSHLI(x,32)) -> VSRLI/VSHLI(extract_subvector(x),32) As discussed on D56387, if we're shifting to extract the upper/lower half of a vXi64 vector then we're actually better off performing this at the subvector level as its very likely to fold into something. combineConcatVectorOps can perform this in reverse if necessary.	2021-01-20 14:34:54 +00:00
Simon Pilgrim	5626adcd6b	[X86][SSE] combineVectorSignBitsTruncation - fold trunc(srl(x,c)) -> packss(sra(x,c)) If a srl doesn't introduce any sign bits into the truncated result, then replace with a sra to let us use a PACKSS truncation - fixes a regression noticed in D56387 on pre-SSE41 targets that don't have PACKUSDW.	2021-01-19 11:04:13 +00:00
Sanjay Patel	d27bb5c375	[x86] add cast to avoid compile-time warning; NFC	2021-01-18 17:47:04 -05:00
Simon Pilgrim	ce06475da9	[X86][AVX] IsElementEquivalent - add matchShuffleWithUNPCK + VBROADCAST/VBROADCAST_LOAD handling Specify LHS/RHS operands in matchShuffleWithUNPCK's calls to isTargetShuffleEquivalent, and handle VBROADCAST/VBROADCAST_LOAD matching in IsElementEquivalent	2021-01-18 15:55:00 +00:00
Simon Pilgrim	770d1e0a88	[X86][SSE] isHorizontalBinOp - reuse any existing horizontal ops. If we already have similar horizontal ops using the same args, then match that, even if we are on a target with slow horizontal ops.	2021-01-18 10:14:45 +00:00
Kazu Hirata	2082b10d10	[llvm] Use *::empty (NFC)	2021-01-16 09:40:55 -08:00
Simon Pilgrim	be69e66b1c	[X86][SSE] Attempt to fold shuffle(binop(),binop()) -> binop(shuffle(),shuffle()) If this will help us fold shuffles together, then push the shuffle through the merged binops. Ideally this would be performed in DAGCombiner::visitVECTOR_SHUFFLE but getting an efficient+legal merged shuffle can be tricky - on SSE we can be confident that for 32/64-bit elements vectors shuffles should easily fold.	2021-01-15 16:25:25 +00:00
Simon Pilgrim	1dfd5c9ad8	[X86][AVX] combineHorizOpWithShuffle - support target shuffles in HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) Be more aggressive on (AVX2+) folds of lane shuffles of 256-bit horizontal ops by working on target/faux shuffles as well.	2021-01-15 13:55:30 +00:00
Simon Pilgrim	cbbfc82586	[X86][SSE] canonicalizeShuffleMaskWithHorizOp - simplify shuffle(HOP(HOP(X,Y),HOP(Z,W))) style chains. See if we can remove the shuffle by resorting a HOP chain so that the HOP args are pre-shuffled. This initial version just handles (the most common) v4i32/v4f32 hadd/hsub reduction patterns - future work can extend this to v8i16 types plus PACK chains (2f64 HADD/HSUB should already be handled in the half-lane combine code later on).	2021-01-13 17:19:40 +00:00
Simon Pilgrim	0a0ee7f5a5	[X86] canonicalizeShuffleMaskWithHorizOp - minor refactor to support multiple src ops. NFCI. canonicalizeShuffleMaskWithHorizOp currently only supports shuffles with 1 or 2 sources, but PR41813 will require us to support higher numbers of sources. This patch just generalizes the initial setup stages to ensure all src ops are the same type and opcode and then will continue to early out if we have more than 2 sources.	2021-01-13 13:59:56 +00:00
Simon Pilgrim	0f59d09957	[X86][AVX] combineVectorSignBitsTruncation - limit AVX512 truncations to 128-bits (PR48727) rG73a44f437bf1 result in 256-bit packss/packus ops with additional shuffles that shuffle combining can sometimes try to convert back into a truncation.	2021-01-13 10:38:23 +00:00
Bevin Hansson	07605ea1f3	[X86] Improved lowering for saturating float to int. Adapted from D54696 by @nikic. This patch improves lowering of saturating float to int conversions, FP_TO_[SU]INT_SAT, for X86. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D86079	2021-01-12 15:44:41 +01:00
Simon Pilgrim	2ed914cb7e	[X86][SSE] getFauxShuffleMask - handle PACKSS(SRAI(),SRAI()) shuffle patterns. We can't easily treat ASHR a faux shuffle, but if it was just feeding a PACKSS then it was likely being used as sign-extension for a truncation, so just peek through and adjust the mask accordingly.	2021-01-12 14:07:53 +00:00
Simon Pilgrim	7e44208115	[X86][SSE] combineSubToSubus - add v16i32 handling on pre-AVX512BW targets. v16i32 -> v16i16/v8i16 truncation is now good enough using PACKSS/PACKUS + shuffle combining that its no longer necessary to early-out on pre-AVX512BW targets. This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.	2021-01-12 13:44:11 +00:00
Simon Pilgrim	a5212b5c91	[X86][SSE] combineSubToSubus - remove SSE2 early-out. SSE2 truncation codegen has improved over the past few years (mainly due to better shuffle lowering/combining and computeKnownBits) - its no longer necessary to early-out from v8i32/v8i64 truncations. This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.	2021-01-12 12:52:11 +00:00
Simon Pilgrim	4214ca9614	[X86][AVX] Attempt to fold vpermf128(op(x,i),op(y,i)) -> op(vpermf128(x,y),i) If vpermf128/vpermi128 is acting on 2 similar 'inlane' ops, then try to perform the vpermf128 first which will allow us to merge the ops. This will help us fix one of the regressions in D56387	2021-01-11 16:59:25 +00:00
Simon Pilgrim	41bf338dd1	Revert rGd43a264a5dd3 "Revert "[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())"" This reapplies commit rG80dee7965dffdfb866afa9d74f3a4a97453708b2. [X86][SSE] Fold unpack(hop(),hop()) -> permute(hop()) UNPCKL/UNPCKH only uses one op from each hop, so we can merge the hops and then permute the result. REAPPLIED with a fix for unary unpacks of HOP.	2021-01-11 11:29:04 +00:00
Nico Weber	d43a264a5d	Revert "[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())" This reverts commit `80dee7965d`. Makes clang sometimes hang forever. See https://bugs.chromium.org/p/chromium/issues/detail?id=1164786#c6 for a stand-alone repro.	2021-01-10 20:22:53 -05:00
Simon Pilgrim	80dee7965d	[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop()) UNPCKL/UNPCKH only uses one op from each hop, so we can merge the hops and then permute the result.	2021-01-08 15:22:17 +00:00
Simon Pilgrim	73a44f437b	[X86][AVX] combineVectorSignBitsTruncation - use PACKSS/PACKUS in more AVX cases AVX512 has fast truncation ops, but if the truncation source is a concatenation of subvectors then its likely that we can use PACK more efficiently. This is only guaranteed to work for truncations to 128/256-bit vectors as the PACK works across 128-bit sub-lanes, for now I've just disabled 512-bit truncation cases but we need to get them working eventually for D61129.	2021-01-05 15:01:45 +00:00
Kazu Hirata	985f899bf2	[Target] Use llvm::append_range (NFC)	2021-01-03 09:57:43 -08:00
Fangrui Song	6be0b9a8dd	[X86] Don't fold negative offset into 32-bit absolute address (e.g. movl $foo-1, %eax) When building abseil-cpp `bin/absl_hash_test` with Clang in -fno-pic mode, an instruction like `movl $foo-2147483648, $eax` may be produced (subtracting a number from the address of a static variable). If foo's address is smaller than 2147483648, GNU ld/gold/LLD will error because R_X86_64_32 cannot represent a negative value. ``` using absl::Hash; struct NoOp { template < typename HashCode > friend HashCode AbslHashValue(HashCode , NoOp ); }; template <typename> class HashIntTest : public testing::Test {}; TYPED_TEST_SUITE_P(HashIntTest); TYPED_TEST_P(HashIntTest, BasicUsage) { if (std::numeric_limits< TypeParam >::min ) EXPECT_NE(Hash< NoOp >()({}), Hash< TypeParam >()(std::numeric_limits< TypeParam >::min())); } REGISTER_TYPED_TEST_CASE_P(HashIntTest, BasicUsage); using IntTypes = testing::Types< int32_t>; INSTANTIATE_TYPED_TEST_CASE_P(My, HashIntTest, IntTypes); ld: error: hash_test.cc:(function (anonymous namespace)::gtest_suite_HashIntTest_::BasicUsage<int>::TestBody(): .text+0x4E472): relocation R_X86_64_32 out of range: 18446744071564237392 is not in [0, 4294967295]; references absl::hash_internal::HashState::kSeed ``` Actually any negative offset is not allowed because the symbol address can be zero (e.g. set by `-Wl,--defsym=foo=0`). So disallow such folding. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D93931	2020-12-30 18:47:26 -08:00
Luo, Yuanke	981a0bd858	[X86] Add x86_amx type for intel AMX. The x86_amx is used for AMX intrisics. <256 x i32> is bitcast to x86_amx when it is used by AMX intrinsics, and x86_amx is bitcast to <256 x i32> when it is used by load/store instruction. So amx intrinsics only operate on type x86_amx. It can help to separate amx intrinsics from llvm IR instructions (+-*/). Thank Craig for the idea. This patch depend on https://reviews.llvm.org/D87981. Differential Revision: https://reviews.llvm.org/D91927	2020-12-30 13:52:13 +08:00
Simon Pilgrim	8767f3bb97	[X86][AVX] Remove X86ISD::SUBV_BROADCAST (PR38969) Followup to D92645 - remove the remaining places where we create X86ISD::SUBV_BROADCAST, and fold splatted vector loads to X86ISD::SUBV_BROADCAST_LOAD instead. Remove all the X86SubVBroadcast isel patterns, including all the fallbacks for if memory folding failed.	2020-12-18 15:49:53 +00:00
Simon Pilgrim	992fad03e2	[X86][AVX] Replace extract_subvector(broadcast(), 0) folds with generic SimplifyDemandedVectorEltsForTargetNode handling. Simplifies a few more cases, notably shuffle demanded elts cases.	2020-12-18 11:51:10 +00:00
Simon Pilgrim	931e66bd89	[X86] Remove extract_subvector(subv_broadcast_load()) fold. This was needed in an earlier version of D92645, but isn't now - and I've just noticed that it was potentially flawed depending on the relevant widths of the broadcasted and extracted subvectors.	2020-12-17 11:02:49 +00:00
Simon Pilgrim	cdb692ee0c	[X86] Add X86ISD::SUBV_BROADCAST_LOAD and begin removing X86ISD::SUBV_BROADCAST (PR38969) Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns. This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts. As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes. Later patches will continue to replace X86ISD::SUBV_BROADCAST usage. Differential Revision: https://reviews.llvm.org/D92645	2020-12-17 10:25:25 +00:00
QingShan Zhang	ebdd20f430	Expand the fp_to_int/int_to_fp/fp_round/fp_extend as libcall for fp128 X86 and AArch64 expand it as libcall inside the target. And PowerPC also want to expand them as libcall for P8. So, propose an implement in the legalizer to common the logic and remove the code for X86/AArch64 to avoid the duplicate code. Reviewed By: Craig Topper Differential Revision: https://reviews.llvm.org/D91331	2020-12-17 07:59:30 +00:00
Simon Pilgrim	553808d456	[X86] Rename reduction combiners to make it clearer whats happening. NFCI. Since these are all working on reduction patterns, actually use that term in the function name to make them easier to search for. At some point we're likely to start working with the ISD::VECREDUCE_* opcodes directly in the x86 backend, but that is still some way off.	2020-12-16 14:48:21 +00:00
Simon Pilgrim	e55f7de946	[X86][SSE] combineReductionToHorizontal - don't rely on widenSubVector to handle illegal vector types. Thanks to @asbirlea for reporting the bug.	2020-12-16 11:24:40 +00:00
Simon Pilgrim	712117338a	[X86] Explicitly use SDValue instead of auto. NFCI. Fix static analyzer warning about not using a SDValue&	2020-12-15 17:27:25 +00:00
Simon Pilgrim	b0e5aea557	[X86] Remove unnecessary SUBV_BROADCAST combines. NFCI. Noticed while dealing with D92645 - these are now handled by getFauxShuffleMask + shuffle combining code.	2020-12-15 16:54:34 +00:00
Simon Pilgrim	bd07092669	[X86] Remove trailing whitespace. NFC.	2020-12-15 10:11:38 +00:00
Simon Pilgrim	15a31389b2	[X86][AVX] LowerBUILD_VECTOR - reduce 256/512-bit build vectors with zero/undef upper elements + pad. As discussed on D92645, we don't do a good job of recognising when we don't require the full width of a ymm/zmm build vector because the upper elements are undef/zero. This commit allows us to make use of implicit zeroing of upper elements with AVX instructions, which we emulate in DAG with a INSERT_SUBVECTOR into the bottom of a undef/zero vector of the original type. This exposed a limitation in getTargetConstantBitsFromNode which didn't extract bits from INSERT_SUBVECTORs of different element widths which I've included as well to prevent a couple of regressions.	2020-12-15 10:11:38 +00:00
Harald van Dijk	9eac818370	[X86] Fix variadic argument handling for x32 The X86-64 ABI defines va_list as typedef struct { unsigned int gp_offset; unsigned int fp_offset; void overflow_arg_area; void reg_save_area; } va_list[1]; This means the size, alignment, and reg_save_area offset will depend on whether we are in LP64 or in ILP32 mode, so this commit adds the checks. Additionally, the VAARG_64 pseudo-instruction assumed 64-bit pointers, so this commit adds a VAARG_X32 pseudo-instruction that behaves just like VAARG_64, except for assuming 32-bit pointers. Some of these changes were originally done by Michael Liao <michael.hliao@gmail.com>. Fixes https://bugs.llvm.org/show_bug.cgi?id=48428. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D93160	2020-12-14 23:47:27 +00:00
Simon Pilgrim	5f5a2547c1	[X86] LowerBUILD_VECTOR - track zero/nonzero elements with APInt masks. NFCI. Prep work for undef/zero 'upper elements' handling as proposed in D92645.	2020-12-14 16:28:45 +00:00
Kazu Hirata	913515e465	[Target] Use llvm::is_contained (NFC)	2020-12-13 19:35:10 -08:00
Simon Pilgrim	d5c434d7dd	[X86][SSE] combineX86ShufflesRecursively - add basic handling for combining shuffles of different widths (PR45974) If a faux shuffle uses smaller shuffle inputs, try to recursively combine with those inputs directly instead of widening them immediately. Then widen all smaller inputs at the bottom of the recursion. This will still mean we're generating nodes on the fly (PR45974) even if we don't combine to a new shuffle but it does help AVX2+ targets combine across xmm/ymm/zmm types, mainly as variable shuffles.	2020-12-13 17:18:07 +00:00
Simon Pilgrim	47321c311b	[X86][SSE] combineReductionToHorizontal - add vXi8 ISD::MUL reduction handling (PR39709) Default expansion leads to repeated extensions/truncations to/from vXi16 which shuffle combining and demanded elts can't completely unravel. Better just to promote (any_extend) the input and perform a vXi16 reduction. We'll be able to remove a lot of this if we ever get decent legalization support for reduction intrinsics in SelectionDAG.	2020-12-13 15:22:54 +00:00
Luo, Yuanke	f80b29878b	[X86] AMX programming model. This patch implements amx programming model that discussed in llvm-dev (http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html). Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet. This patch implemeted 7 components. 1. The c interface to end user. 2. The AMX intrinsics in LLVM IR. 3. Transform load/store <256 x i32> to AMX intrinsics or split the type into two <128 x i32>. 4. The Lowering from AMX intrinsics to AMX pseudo instruction. 5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx intruction. 6. The register allocation for tile register. 7. Morph AMX pseudo instruction to AMX real instruction. Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0 Differential Revision: https://reviews.llvm.org/D87981	2020-12-10 17:01:54 +08:00
Saleem Abdulrasool	ee74d1b420	X86: use a data driven configuration of Windows x86 libcalls (NFC) Rather than creating a series of associated calls and ensuring that everything is lined up, use a table driven approach that ensures that they two always stay in sync.	2020-12-09 22:49:11 +00:00
Simon Pilgrim	24184dbb82	[X86] Fold CONCAT(VPERMV3(X,Y,M0),VPERMV3(Z,W,M1)) -> VPERMV3(CONCAT(X,Z),CONCAT(Y,W),CONCAT(M0,M1)) Further prep work toward supporting different subvector sizes in combineX86ShufflesRecursively	2020-12-09 14:29:32 +00:00
Kerry McLaughlin	4519ff4b6f	[SVE][CodeGen] Add the ExtensionType flag to MGATHER Adds the ExtensionType flag, which reflects the LoadExtType of a MaskedGatherSDNode. Also updated SelectionDAGDumper::print_details so that details of the gather load (is signed, is scaled & extension type) are printed. Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D91084	2020-12-09 11:19:08 +00:00
Harald van Dijk	29c8ea6f1a	[X86] Handle localdynamic TLS model in x32 mode D92346 added TLS_(base_)addrX32 to handle TLS in x32 mode, but missed the different TLS models. This diff fixes the logic for the local dynamic model where `RAX` was used when `EAX` should be, and extends the tests to cover all four TLS models. Fixes https://bugs.llvm.org/show_bug.cgi?id=26472. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D92737	2020-12-08 21:06:00 +00:00
Tim Northover	c5978f42ec	UBSAN: emit distinctive traps Sometimes people get minimal crash reports after a UBSAN incident. This change tags each trap with an integer representing the kind of failure encountered, which can aid in tracking down the root cause of the problem.	2020-12-08 10:28:26 +00:00
Simon Pilgrim	0101fb73de	[X86] Fold MOVMSK(ICMP_SGT(X,-1)) -> NOT(MOVMSK(X))) Noticed while triaging PR37506	2020-12-06 17:56:41 +00:00
Layton Kifer	ac522f8700	[DAGCombiner] Fold (sext (not i1 x)) -> (add (zext i1 x), -1) Move fold of (sext (not i1 x)) -> (add (zext i1 x), -1) from X86 to DAGCombiner to improve codegen on other targets. Differential Revision: https://reviews.llvm.org/D91589	2020-12-06 11:52:10 -05:00
Simon Pilgrim	b96a521077	[X86] LowerRotate - enable custom lowering of ROTL/ROTR vXi16 on VBMI2 targets.	2020-12-04 12:16:59 +00:00
Simon Pilgrim	d073805be6	[X86] LowerRotate - VBMI2 targets can lower vXi16 rotates using funnel shifts. Ideally we'd do this inside DAGCombine but until we can make the FSHL/FSHR opcodes legal for VBMI2 it won't help us.	2020-12-04 11:29:23 +00:00
Simon Pilgrim	df1ddc4234	[X86] Let VBMI2 non-VLX targets still use funnel shifts instructions	2020-12-04 11:06:43 +00:00
Simon Pilgrim	8eedd18fcb	[X86] Remove unnecessary bitcast. NFC. The X86ISD::SUBV_BROADCAST node is already VT	2020-12-04 09:44:57 +00:00
Xiang1 Zhang	f2e2924463	[X86] Unbind the ebx with GOT address in regcall calling convention No register can be allocated for indirect call when it use regcall calling convention and passed 5/5+ args. For example: call vreg (ag1, ag2, ag3, ag4, ag5, ...) --> 5 regs (EAX, ECX, EDX, ESI, EDI) used for pass args, 1 reg (EBX )used for hold GOT point, so no regs can be allocated to vreg. The Intel386 architecture provides 8 general purpose 32-bit registers. RA mostly use 6 of them (EAX, EBX, ECX, EDX, ESI, EDI). 5 of this regs can be used to pass function arguments (EAX, ECX, EDX, ESI, EDI). EBX used to hold the GOT pointer when making function calls via the PLT. ESP and EBP usually be "reserved" in register allocation. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D91020	2020-12-04 10:00:13 +08:00
Harald van Dijk	c9be4ef184	[X86] Add TLS_(base_)addrX32 for X32 mode LLVM has TLS_(base_)addr32 for 32-bit TLS addresses in 32-bit mode, and TLS_(base_)addr64 for 64-bit TLS addresses in 64-bit mode. x32 mode wants 32-bit TLS addresses in 64-bit mode, which were not yet handled. This adds TLS_(base_)addrX32 as copies of TLS_(base_)addr64, except that they use tls32(base)addr rather than tls64(base)addr, and then restricts TLS_(base_)addr64 to 64-bit LP64 mode, TLS_(base_)addrX32 to 64-bit ILP32 mode. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D92346	2020-12-02 22:20:36 +00:00
Simon Pilgrim	f019362329	[X86] EltsFromConsecutiveLoads - remove old FIXME comment. NFC. Its unlikely an undef element in a zero vector will be any use.	2020-12-02 17:21:41 +00:00
Simon Pilgrim	3900ec6f05	[X86] combineX86ShufflesRecursively - remove old FIXME comment. NFC. Its unlikely an undef element in a zero vector will be any use, and SimplifyDemandedVectorElts now calls combineX86ShufflesRecursively so its unlikely we actually have a dependency on these specific elements.	2020-12-02 16:29:38 +00:00
Simon Pilgrim	0dab7ecc5d	[X86] EltsFromConsecutiveLoads - pull out repeated NumLoadedElts. NFCI.	2020-12-02 16:29:37 +00:00
Simon Pilgrim	1b209ff9e3	[DAG] Move vselect(icmp_ult, 0, sub(x,y)) -> usubsat(x,y) to DAGCombine (PR40111) Move the X86 VSELECT->USUBSAT fold to DAGCombiner - there's nothing target specific about these folds.	2020-12-01 14:25:29 +00:00
Simon Pilgrim	6dbd0d36a1	[DAG] Move vselect(icmp_ult, -1, add(x,y)) -> uaddsat(x,y) to DAGCombine (PR40111) Move the X86 VSELECT->UADDSAT fold to DAGCombiner - there's nothing target specific about these folds. The SSE42 test diffs are relatively benign - its avoiding an extra constant load in exchange for an extra xor operation - there are extra register moves, which is annoying as all those operations should commute them away. Differential Revision: https://reviews.llvm.org/D91876	2020-12-01 11:56:26 +00:00
Harald van Dijk	cdac34bd47	[X86] Zero-extend pointers to i64 for x86_64 For LP64 mode, this has no effect as pointers are already 64 bits. For ILP32 mode (x32), this extension is specified by the ABI. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D91338	2020-11-30 18:51:23 +00:00
Simon Pilgrim	83d79ca5bf	[X86][AVX512] Only lower to VPALIGNR if we have BWI (PR48322)	2020-11-30 10:51:24 +00:00
Simon Pilgrim	969918e177	[DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion. Allows us to move the x86-specific lowering to the generic expansion code. Differential Revision: https://reviews.llvm.org/D92183	2020-11-27 11:18:58 +00:00
Simon Pilgrim	8057ebf4a0	Revert rG12d59b696b330 "[DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal" This reverts commit `12d59b696b`. Prematurely pushed this to trunk	2020-11-26 15:07:45 +00:00
Simon Pilgrim	12d59b696b	[DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion. Allows us to move the x86-specific lowering to the generic expansion code.	2020-11-26 14:47:28 +00:00
Simon Pilgrim	791040cd8b	[DAG] LowerMINMAX - move default expansion to generic TargetLowering::expandIntMINMAX This is part of the discussion on D91876 about trying to reduce custom lowering of MIN/MAX ops on older SSE targets - if we can improve generic vector expansion we should be able to relax the limitations in SelectionDAGBuilder when it will let MIN/MAX ops be generated, and avoid having to flag so many ops as 'custom'.	2020-11-22 13:02:27 +00:00
Simon Pilgrim	0341029bb4	[X86][AVX] LowerADDSAT_SUBSAT - avoid X86ISD::BLENDV in UADDSAT/USUBSAT v8i32/v4i64 lowering Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on SSE targets. Enable custom lowering for v8i32/v4i64 and generalize the 128-bit lowering code for any vector size - this also lets us use the slightly cheaper codegen for icmp_ugt instead of umin/umax.	2020-11-20 18:16:44 +00:00
Craig Topper	a7eae62a42	[SelectionDAG][X86][PowerPC][Mips] Replace the default implementation of LowerOperationWrapper with the X86 and PowerPC version. The default version only works if the returned node has a single result. The X86 and PowerPC versions support multiple results and allow a single result to be returned from a node with multiple outputs. And allow a single result that is not result 0 of the node. Also replace the Mips version since the new version should work for it. The original version handled multiple results, but only if the new node and original node had the same number of results. Differential Revision: https://reviews.llvm.org/D91846	2020-11-20 10:06:53 -08:00
Simon Pilgrim	09a081f221	[X86][SSE] LowerADDSAT_SUBSAT - avoid X86ISD::BLENDV in UADDSAT/USUBSAT custom lowering Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on pre-SSE4 targets. We're still using X86ISD::BLENDV on some AVX targets as we don't do custom lowering for >= 256-bit vectors. Really this (and combineVSelectWithAllOnesOrZeros) needs moving to DAGCombiner, but pre-SSE42 we see the vXi64 comparison type as a 2 x 32-bits result so we can't just rely on ComputeNumSignBits to give us the 'all bits' result we need.	2020-11-20 16:53:01 +00:00
Simon Pilgrim	14ae02fb33	[X86][AVX] Only share broadcasts of different widths from the same SDValue of the same SDNode (PR48215) D57663 allowed us to reuse broadcasts of the same scalar value by extracting low subvectors from the widest type. Unfortunately we weren't ensuring the broadcasts were from the same SDValue, just the same SDNode - which failed on multiple-value nodes like ISD::SDIVREM FYI: I intend to request this be merged into the 11.x release branch. Differential Revision: https://reviews.llvm.org/D91709	2020-11-19 12:15:18 +00:00
Craig Topper	f0b0bab34d	[X86] Use GF2P8AFFINEQB to implement vector bitreverse. We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size. We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW. I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D91515	2020-11-17 23:49:06 -08:00
Craig Topper	57c0c4a275	[X86] Fix crash with i64 bitreverse on 32-bit targets with XOP. We unconditionally marked i64 as Custom, but did not install a handler in ReplaceNodeResults when i64 isn't legal type. This leads to ReplaceNodeResults asserting. We have two options to fix this. Only mark i64 as Custom on 64-bit targets and let it expand to two i32 bitreverses which each need a VPPERM. Or the other option is to add the Custom handling to ReplaceNodeResults. This is what I went with.	2020-11-15 19:02:34 -08:00
Craig Topper	114f044640	[X86] Use EVT::getIntegerVT instead of MVT::getIntegerVT where the type can be i2 or i4. This was a mistake introduced in D91294. I'm not sure how to exercise this with the existing code, but I hit it while trying some follow up experiments.	2020-11-12 21:48:45 -08:00
Craig Topper	a4124e455e	[X86] When storing v1i1/v2i1/v4i1 to memory, make sure we store zeros in the rest of the byte We can't store garbage in the unused bits. It possible that something like zextload from i1/i2/i4 is created to read the memory. Those zextloads would be legalized assuming the extra bits are 0. I'm not sure that the code in lowerStore is executed for the v1i1/v2i1/v4i1 case. It looks like the DAG combine in combineStore may have converted them to v8i1 first. And I think we're missing some cases to avoid going to the stack in the first place. But I don't have time to investigate those things at the moment so I wanted to focus on the correctness issue. Should fix PR48147. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D91294	2020-11-12 21:28:18 -08:00
Simon Pilgrim	1a62ca65c1	[KnownBits] Add KnownBits::commonBits helper. NFCI. We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........	2020-11-11 12:15:54 +00:00
Kerry McLaughlin	ffbbfc76ca	[SVE][CodeGen] Add the isTruncatingStore flag to MSCATTER This patch adds the IsTruncatingStore flag to MaskedScatterSDNode, set by getMaskedScatter(). Updated SelectionDAGDumper::print_details for MaskedScatterSDNode to print the details of masked scatters (is truncating, signed or scaled). This is the first in a series of patches which adds support for scalable masked scatters Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D90939	2020-11-11 10:58:24 +00:00
Gaurav Jain	3726b14428	[NFC] Use [MC]Register for x86 target Differential Revision: https://reviews.llvm.org/D91161	2020-11-10 15:49:39 -08:00
Craig Topper	f40925aa8b	[X86] Improve lowering of fptoui Invert the select condition when masking in the sign bit of a fptoui operation. Also, rather than lowering the sign mask to select/xor and expecting the select to get cleaned up later, directly lower to shift/xor. Patch by Layton Kifer! Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D90658	2020-11-07 23:50:03 -08:00
Simon Pilgrim	9e406ee808	[X86] Make some basic VarArgsLoweringHelper helper methods const. NFCI. Fixes a number of cppcheck remarks.	2020-10-31 12:16:49 +00:00
serge-sans-paille	0f60bcc36c	[stack-clash] Fix probing of dynamic alloca - Perform the probing in the correct direction. Related to https://github.com/rust-lang/rust/pull/77885#issuecomment-711062924 - The first touch on a dynamic alloca cannot use a mov because it clobbers existing space. Use a xor 0 instead Differential Revision: https://reviews.llvm.org/D90216	2020-10-30 15:34:00 +01:00
Benjamin Kramer	35f7cbf9df	[X86] Don't crash on CVTPS2PH with wide vector inputs.	2020-10-27 14:42:02 +01:00
Craig Topper	63ba82ed00	[X86] Use TargetConstant for immediates for VASTART_SAVE_XMM_REGS.	2020-10-25 12:52:56 -07:00
Craig Topper	2ed16aa66f	[X86] Use TargetConstant instead of Constant for operands to X86vaarg64.	2020-10-25 12:24:59 -07:00
Craig Topper	a222d832d5	[X86] Use TargetConstant for FPDiff with X86::TC_RETURN. It's required to be a constant and can never be in a register so make it explicit.	2020-10-25 00:29:11 -07:00
Simon Pilgrim	ce356e1546	[DAG] Add BuildVectorSDNode::getRepeatedSequence helper to recognise multi-element splat patterns Replace the X86 specific isSplatZeroExtended helper with a generic BuildVectorSDNode method. I've just used this to simplify the X86ISD::BROADCASTM lowering so far (and remove isSplatZeroExtended), but we should be able to use this in more places to lower to complex broadcast patterns. Differential Revision: https://reviews.llvm.org/D87930	2020-10-24 12:23:09 +01:00
Simon Pilgrim	936ef89ebe	[X86] lowerShuffleWithPERMV - use MVT::changeTypeToInteger helper. NFCI.	2020-10-23 12:35:27 +01:00
Simon Pilgrim	794dc7ad26	[CodeGen] Split MVT::changeTypeToInteger() functionality from EVT::changeTypeToInteger(). Add the MVT equivalent handling for EVT changeTypeToInteger/changeVectorElementType/changeVectorElementTypeToInteger. All the SimpleVT code already exists inside the EVT equivalents, but by splitting this out we can use these directly inside MVT types without converting to/from EVT.	2020-10-22 14:27:42 +01:00
Tianqing Wang	be39a6fe6f	[X86] Add User Interrupts(UINTR) instructions For more details about these instructions, please refer to the latest ISE document: https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D89301	2020-10-22 17:33:07 +08:00
Xiang1 Zhang	7c3fea7721	[X86] Support customizing stack protector guard Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D88631	2020-10-22 10:08:14 +08:00
Craig Topper	9e884169a2	[FPEnv][X86][SystemZ] Use different algorithms for i64->double uint_to_fp under strictfp to avoid producing -0.0 when rounding toward negative infinity Some of our conversion algorithms produce -0.0 when converting unsigned i64 to double when the rounding mode is round toward negative. This switches them to other algorithms that don't have this problem. Since it is undefined behavior to change rounding mode with the non-strict nodes, this patch only changes the behavior for strict nodes. There are still problems with unsigned i32 conversions too which I'll try to fix in another patch. Fixes part of PR47393 Reviewed By: efriedma Differential Revision: https://reviews.llvm.org/D87115	2020-10-21 18:12:54 -07:00
Gaurav Jain	4634ad6c0b	[NFC] Set return type of getStackPointerRegisterToSaveRestore to Register Differential Revision: https://reviews.llvm.org/D89858	2020-10-21 16:19:38 -07:00
Wang, Pengfei	3a85472af2	[X86] Fix assert fail when element type is i1. extract_vector_elt will turn type vxi1 into i8, which triggers the assertion fail. Since we don't really handle vxi1 cases in below code, we can just return from here. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D89096	2020-10-20 09:26:32 +08:00
David Sherwood	47f2dc7e5f	[SVE][NFC] Replace some TypeSize comparisons in non-AArch64 Targets In most of lib/Target we know that we are not dealing with scalable types so it's perfectly fine to replace TypeSize comparison operators with their fixed width equivalents, making use of getFixedSize() and so on. Differential Revision: https://reviews.llvm.org/D89101	2020-10-15 09:01:21 +01:00
Craig Topper	1687a8d83b	[X86][SelectionDAG] Add SADDO_CARRY and SSUBO_CARRY to support multipart signed add/sub overflow legalization. This passes existing X86 test but I'm not sure if it handles all type legalization cases it needs to. Alternative to D89200 Reviewed By: efriedma Differential Revision: https://reviews.llvm.org/D89222	2020-10-12 23:18:29 -07:00
Simon Pilgrim	913d7a110e	[X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16. This is my first LLVM patch, so please tell me if there are any process issues. The main observation for this patch is that we can lower UMIN/UMAX with v8i16 by using unsigned saturated subtractions in a clever way. Previously this operation was lowered by turning the signbit of both inputs and the output which turns the unsigned minimum/maximum into a signed one. We could use this trick in reverse for lowering SMIN/SMAX with v16i8 instead. In terms of latency/throughput this is the needs one large move instruction. It's just that the sign bit turning has an increased chance of being optimized further. This is particularly apparent in the "reduce" test cases. However due to the slight regression in the single use case, this patch no longer proposes this. Unfortunately this argument also applies in reverse to the new lowering of UMIN/UMAX with v8i16 which regresses the "horizontal-reduce-umax", "horizontal-reduce-umin", "vector-reduce-umin" and "vector-reduce-umax" test cases a bit with this patch. Maybe some extra casework would be possible to avoid this. However independent of that I believe that the benefits in the common case of just 1 to 3 chained min/max instructions outweighs the downsides in that specific case. Patch By: @TomHender (Tom Hender) ActuallyaDeviloper Differential Revision: https://reviews.llvm.org/D87236	2020-10-11 11:21:23 +01:00
Craig Topper	9895327914	[X86] Redefine X86ISD::PEXTRB/W and X86ISD::PINSRB/PINSRW to use a i8 TargetConstant for the immediate instead of a ptr constant. This is more consistent with other target specific ISD opcodes that require immediates.	2020-10-10 21:50:58 -07:00
Craig Topper	375849518d	[X86] Add a X86ISD::BEXTRI to distinquish the case where the control must be a constant. The bextri intrinsic has a ImmArg attribute which will be converted in SelectionDAG using TargetConstant. We previously converted this to a plain Constant to allow X86ISD::BEXTR to call SimplifyDemandedBits on it. But while trying to decide if D89178 was safe, I realized that this conversion of TargetConstant to Constant would be one case where that would break. So this patch adds a new opcode specifically for the immediate case. And then teaches computeKnownBits and SimplifyDemandedBits to also handle it, but not try to SimplifyDemandedBits on it. To make up for that, I immediately masked the constant to 16 bits when converting from the intrinsic node to the X86ISD node.	2020-10-10 19:18:06 -07:00
Joao Moreira	e0b89df2e0	[X86] Check if call is indirect before emitting NT_CALL The notrack prefix is a relaxation of CET policies which makes it possible to indirectly call targets which do not have an ENDBR instruction in the landing address. To emit a call with this prefix, the special attribute "nocf_check" is used. When used as a function attribute, a CallInst targeting the respective function will return true for the method "doesNoCfCheck()", no matter if it is a direct call (and such should remain like this, as the information that the to-be-called function won't perform control-flow checks is useful in other contexts). Yet, when emitting an X86ISD::NT_CALL, the respective CallInst should be verified for its indirection, allowing that the prefixed calls are only emitted in the right situations. Update the respective testing unit to also verify for direct calls to functions with ''nocf_check'' attributes. The bug can also be reproduced through compiling the following C code using the -fcf-protection=full flag. int __attribute__((nocf_check)) foo(int a) {}; int main() { foo(42); } Differential Revision: https://reviews.llvm.org/D87320	2020-10-09 15:54:23 -07:00
Craig Topper	f34bb06935	[X86] When expanding LCMPXCHG16B_NO_RBX in EmitInstrWithCustomInserter, directly copy address operands instead of going through X86AddressMode. I suspect getAddressFromInstr and addFullAddress are not handling all addresses cases properly based on a report from MaskRay. So just copy the operands directly. This should be more efficient anyway.	2020-10-09 11:55:24 -07:00
Fangrui Song	e36a41b3cf	[X86] Fix some clang-tidy bugprone-argument-comment issues	2020-10-08 15:26:50 -07:00
Craig Topper	68e1a8d207	[X86] Defer the creation of LCMPXCHG16B_SAVE_RBX until finalize-isel We need to use LCMPXCHG16B_SAVE_RBX if RBX/EBX is being used as the frame pointer. We previously checked for this during type legalization, but that's too early to know for sure if the base pointer is needed. This patch adds a new pseudo instruction to emit from isel that uses a virtual register for the RBX input. Then we use the custom inserter hook to emit LCMPXCHG16B if RBX isn't needed as a base pointer or LCMPXCHG16B_SAVE_RBX if it is. Fixes PR42064. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D88808	2020-10-07 17:00:43 -07:00
Simon Pilgrim	6c7d713cf5	[X86][SSE] combineX86ShuffleChain add 'CanonicalizeShuffleInput' helper. NFCI. As part of PR45974, we're getting closer to not creating 'padded' vectors on-the-fly in combineX86ShufflesRecursively, and only pad the source inputs if we have a definite match inside combineX86ShuffleChain. At the moment combineX86ShuffleChain just has to bitcast an input to the correct shuffle type, but eventually we'll need to pad them as well. So, move the bitcast into a 'CanonicalizeShuffleInput helper for now, making the diff for future padding support a lot smaller.	2020-10-06 17:47:24 +01:00
Craig Topper	4da4e7cb20	[X86] Remove X86ISD::LCMPXCHG8_SAVE_EBX_DAG and LCMPXCHG8B_SAVE_EBX pseudo instruction This and its friend X86ISD::LCMPXCHG8_SAVE_RBX_DAG are used if we need to avoid clobbering the frame pointer in EBX/RBX. EBX/RBX are only used a frame pointer in 64-bit mode. In 64-bit mode we don't use CMPXCHG8B since we have a GR64 cmpxchg available. So we don't need special handling for LCMPXCHG8B. Split from D88808 Differential Revision: https://reviews.llvm.org/D88853	2020-10-05 15:03:07 -07:00
Craig Topper	1127662c6d	[SelectionDAG] Make sure FMF are propagated when getSetcc canonicalizes FP constants to RHS. getNode handling for ISD:SETCC calls FoldSETCC which can canonicalize FP constants to the RHS. When this happens we should create the node with the FMF that was requested. By using FlagInserter when can ensure any calls to getNode/getSetcc during canonicalization will also get the flags. Differential Revision: https://reviews.llvm.org/D88063	2020-10-05 14:55:23 -07:00
Simon Pilgrim	0ac210e580	[X86] isTargetShuffleEquivalent - merge duplicate array accesses. NFCI.	2020-10-05 17:22:14 +01:00
Craig Topper	4b38ceb0eb	[X86] Remove MWAITX_SAVE_EBX pseudo instruction. Always save/restore the full %rbx register even in gnux32. ebx/rbx only needs to be saved when 64-bit registers are supported anyway. It should be fine to save/restore the whole rbx register even in gnux32 where the base is technically just ebx. This matches what we do for cmpxchg16b where rbx is saved/restored regardless of gnux32.	2020-10-04 16:28:15 -07:00
Simon Pilgrim	e4e5c42896	[X86][SSE] isTargetShuffleEquivalent - ensure shuffle inputs are the correct size. Preliminary patch for the next stage of PR45974 - we don't want to be creating 'padded' vectors on-the-fly at all in combineX86ShufflesRecursively, and only pad the source inputs if we have a definite match inside combineX86ShuffleChain. This means that the inputs to combineX86ShuffleChain might soon be smaller than the final root value type, so we should ensure that isTargetShuffleEquivalent only matches with the inputs if they are the correct size.	2020-10-04 15:32:05 +01:00
Craig Topper	a7e45ea30d	[X86] Add memory operand to AESENC/AESDEC Key Locker instructions. This removes FIXMEs from selectAddr.	2020-10-03 21:42:16 -07:00
Craig Topper	39fc4a0b0a	[X86] Move ENCODEKEY128/256 handling from lowering to selection. We should avoid emitting MachineSDNodes from lowering. We can use the the implicit def handling in InstrEmitter to avoid manually copying from each xmm result register. We only need to manually emit the copies for the implicit uses.	2020-10-03 18:44:53 -07:00
Craig Topper	7f3da48885	[X86] Remove X86ISD::MWAITX_DAG. Just match the intrinsic to the custom inserter pseudo instruction during isel.	2020-10-03 18:44:53 -07:00
Craig Topper	adccc0bfa3	[X86] Add X86ISD opcodes for the Key Locker AESENCKL and AESDECKL instructions Instead of emitting MachineSDNodes during lowering, emit X86ISD opcodes. These opcodes will either be selected by tablegen patterns or custom selection code. Emitting MachineSDNodes during lowering is uncommon so this makes things more consistent. It also allows selectAddr to be called to perform address matching during instruction selection. I had trouble getting tablegen to accept XMM0-XMM7 as results in an isel pattern for the WIDE instructions so I had to use custom instruction selection.	2020-10-03 16:55:19 -07:00
serge-sans-paille	9573c9f2a3	Fix limit behavior of dynamic alloca When the allocation size is 0, we shouldn't probe. Within [1, PAGE_SIZE], we should probe once etc. This fixes https://bugs.llvm.org/show_bug.cgi?id=47657 Differential Revision: https://reviews.llvm.org/D88548	2020-10-02 11:10:02 +02:00
Craig Topper	d1d7fc9832	[X86] Canonicalize (x > 1) ? x : 1 -> (x >= 1) ? x : 1 for sign and unsigned to enable the use of test instructions for the compare. This will be further canonicalized to a compare involving 0 which will enable the use of test instructions. Either using cmovg for signed for cmovne for unsigned. Fixes more case for PR47049	2020-09-30 13:50:52 -07:00
Xiang1 Zhang	413577a879	[X86] Support Intel Key Locker Key Locker provides a mechanism to encrypt and decrypt data with an AES key without having access to the raw key value by converting AES keys into “handles”. These handles can be used to perform the same encryption and decryption operations as the original AES keys, but they only work on the current system and only until they are revoked. If software revokes Key Locker handles (e.g., on a reboot), then any previous handles can no longer be used. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D88398	2020-09-30 18:08:45 +08:00
Craig Topper	618a890b72	[X86] Increase the depth threshold required to form VPERMI2W/VPERMI2B in shuffle combining These instructions are implemented with two port 5 uops and one port 015 uop so they are more complicated that most shuffles. This patch increases the depth threshold for when we form them during shuffle combining to try to limit increasing the number of uops especially on port 5. Differential Revision: https://reviews.llvm.org/D88503	2020-09-29 18:37:23 -07:00
Craig Topper	82da0cabb9	[X86] Add computeKnownBits support for PEXT. The number of zeros in the mask provides a lower bound on the number of leading zeros in the result.	2020-09-28 22:54:07 -07:00
Craig Topper	e53196b1e8	[X86] Add support for calling SimplifyDemandedBits on the input of PDEP with a constant mask. We can do several optimizations for PDEP using computeKnownBits and SimplifyDemandedBits -If the MSBs of the output aren't demanded, those MSBs of the mask input aren't demanded either. We need to keep the most significant demanded bit of the mask and any mask bits before it. -The number of possible ones in the mask determines how many bits of the lsbs of the other operand are demanded. Any bits of the mask we don't demand by the previous rule should not be counted. -The result will have zeros in any position that the mask is zero. -Since non-mask input bits can only be output in the original position or a higher bit position, the result will have at least as many trailing zeroes as the non-mask input. Differential Revision: https://reviews.llvm.org/D87883	2020-09-28 14:21:30 -07:00
Simon Pilgrim	e0820d87e3	[X86] Flip isShuffleEquivalent argument order to match isTargetShuffleEquivalent A while ago, we converted isShuffleEquivalent/isTargetShuffleEquivalent to both use IsElementEquivalent internally. This allows us to make the shuffle args optional like isTargetShuffleEquivalent and update foldShuffleOfHorizOp to use isShuffleEquivalent (which it should as its using a ISD::VECTOR_SHUFFLE mask).	2020-09-28 12:53:56 +01:00
Simon Pilgrim	6b5198f06b	[X86] Simplify broadcast mask detection with isUndefOrEqual helper. Add an additional isUndefOrEqual variant that matches an entire mask, not just a single value.	2020-09-28 12:53:56 +01:00
Simon Pilgrim	283036394e	[X86][SSE] combineVectorTruncation - enable (pre-SSSE3) vXi16->vXi8 truncation. Shuffle combining can now handle this output, and by performing this early in combineVectorTruncation we avoid a scalarization that caused a regression on D87502.	2020-09-24 15:51:36 +01:00
Craig Topper	f21f835ee8	[X86] Improve demanded bits for X86ISD::BEXTR. If the control is constant we can figure out exactly which bits of the input are demanded. Differential Revision: https://reviews.llvm.org/D88072	2020-09-23 10:51:02 -07:00
Craig Topper	a74b1faba2	[X86] Make reduceMaskedLoadToScalarLoad/reduceMaskedStoreToScalarStore work for avx512 after type legalization. The scalar elements of the vXi1 build_vector will have been type legalized to i8 by padding with 0s. So we can't check for all ones. Instead we should just look at bit 0 of the constant. Differential Revision: https://reviews.llvm.org/D87863	2020-09-20 13:54:20 -07:00
Craig Topper	4e8c028158	[X86] Stop reduceMaskedLoadToScalarLoad/reduceMaskedStoreToScalarStore from creating scalar i64 load/stores in 32-bit mode If we emit a scalar i64 load/store it will get type legalized to two i32 load/stores. Differential Revision: https://reviews.llvm.org/D87862	2020-09-20 13:46:59 -07:00
Simon Pilgrim	0bfeede669	[X86][SSE] Fold EXTEND_VECTOR_INREG(EXTRACT_SUBVECTOR(EXTEND(X),0)) -> EXTEND_VECTOR_INREG(X)	2020-09-20 18:39:12 +01:00
Simon Pilgrim	bb0078e591	[X86][SSE] Fold SIGN_EXTEND(SIGN_EXTEND_VECTOR_INREG(X)) -> SIGN_EXTEND_VECTOR_INREG(X) It should be possible to make this generic, but we're not great at checking legality of *_EXTEND_VECTOR_INREG ops so I'm conservatively putting this inside X86ISelLowering.cpp	2020-09-20 18:39:12 +01:00
Simon Pilgrim	15c8306056	[X86][SSE] Fold EXTEND_VECTOR_INREG(EXTEND_VECTOR_INREG(X)) -> EXTEND_VECTOR_INREG(X) It should be possible to make this generic, but we're not great at checking legality of *_EXTEND_VECTOR_INREG ops so I'm conservatively putting this inside X86ISelLowering.cpp	2020-09-20 16:33:02 +01:00
Simon Pilgrim	a0c8793ce6	[X86][SSE] Enable ZERO_EXTEND_VECTOR_INREG shuffle combining on SSE41 targets. Allows ZERO_EXTEND_VECTOR_INREG to be shuffle combined on all targets where it is legal.	2020-09-20 16:05:10 +01:00
Simon Pilgrim	2b634a9d0e	[X86] Rename getExtendInVec to getEXTEND_VECTOR_INREG. NFCI. Make it easier to find the method by naming it after the ops it actually handles. We already do this for lowering/combining.	2020-09-20 15:19:39 +01:00
Simon Pilgrim	91720ee561	[X86] combineX86ShufflesRecursively - fix use after move warning. NFCI. After moving WidenedMask is in an undefined state, so reduce scope of the variable so its reinitialized every iteration - we should still retain any memory allocation savings.	2020-09-20 14:06:50 +01:00
Simon Pilgrim	e17686ae60	[X86] Rename combineExtInVec to combineEXTEND_VECTOR_INREG. NFCI. Make it easier to find the method by naming it after the ops it actually handles. We already do this for lowering.	2020-09-20 12:16:00 +01:00
Craig Topper	721d57f952	[X86] Return from SimplifyDemandedBitsForTargetNode after calculating known bits for VSHLI/VSRAI/VSRLI. We were breaking out of the switch which falls into the default implementation of SimplifyDemandedBitsForTargetNode which is a wrapper around computeKnownBits. So we end up doing the recursion and known bits calculation all over again. Instead we should return with the known bits we calculated in the switch.	2020-09-18 23:57:01 -07:00
Craig Topper	58ecbbcdcd	[X86] Fix copy paste mistake in @ccnp flag. We were treating @ccp and @ccnp the same.	2020-09-18 21:28:01 -07:00
Simon Pilgrim	4ebd30722a	[X86][AVX] lowerBuildVectorAsBroadcast - improve BROADCASTM lowering on non-VLX targets Broadcast to a ZMM type then extract the low subvector.	2020-09-18 19:52:02 +01:00
Simon Pilgrim	ceadd98c2f	[X86][AVX] lowerBuildVectorAsBroadcast - improve i64 BROADCASTM lowering on 32-bit targets We already handle the the cases where we have a 'zero extended splat' build vector (a, 0, 0, 0, a, 0, 0, 0, ...) but were missing the case where the 'a' scalar was zero-extended as well - such as i64 -> vXi64 splat cases on 32-bit targets.	2020-09-18 16:59:57 +01:00
Craig Topper	3783d3bc7b	[X86] Don't match x87 register inline asm constraints unless the VT is floating point or its a clobber The register class picked will be the RFP80 register class which has a f80 VT. The code in SelectionDAGBuilder that generates copies around inline assembly doesn't know how to handle an integer and floating point type of different bit widths. The test case is derived from this https://godbolt.org/z/sEa659 which gcc accepts but clang crashes on. This patch just gives a more graceful error. I'm not sure if the single element struct case is special in gcc. Adding another field to the struct makes gcc reject it. If we want to support this correctly I think we need a change in the frontend to give us the true element type. Right now the frontend just realizes the constraint can take a memory argument so creates an integer type of the same size and bitcasts. Differential Revision: https://reviews.llvm.org/D87485	2020-09-17 11:26:50 -07:00
Simon Pilgrim	b2c931eff3	[X86] EmitInstrWithCustomInserter - remove redundant getDebugLoc() calls. NFCI. Use the same DebugLoc that is called at the top of the method. Fixes some Wshadow static analyzer warnings.	2020-09-16 16:29:56 +01:00
Simon Pilgrim	aa4b0b755a	[X86][SSE] Move VZEXT_MOVL(INSERT_SUBVECTOR(UNDEF,X,0)) handling into combineTargetShuffle. Now that we're getting better at combining shuffles of different vector widths, this can now be performed as part of the standard target shuffle combines and isn't required for cleanup. Exposed a minor issue in combineX86ShufflesRecursively where we failed to check if a shuffle's src ops were simple types.	2020-09-16 16:08:31 +01:00
Craig Topper	05134877e6	[X86] Use Align in reduceMaskedLoadToScalarLoad/reduceMaskedStoreToScalarStore. Correct pointer info. If we offset the pointer, we also need to offset the pointer info Differential Revision: https://reviews.llvm.org/D87593	2020-09-15 11:22:02 -07:00
Simon Pilgrim	a43e68b58b	[X86][AVX] lowerShuffleWithSHUFPS - handle missed canonicalization cases. PR47534 exposes a case where calling lowerShuffleWithSHUFPS directly from a derived repeated mask (found by is128BitLaneRepeatedShuffleMask) results in us using an non-canonicalized mask. The missed canonicalization in this case is trivial - just commute the mask so we have more (swapped) LHS than RHS references so lowerShuffleWithSHUFPS can handle it.	2020-09-15 17:31:08 +01:00
Simon Pilgrim	fc446935d7	[X86] detectAVGPattern - accept non-pow2 vectors by padding. Drop the pow2 vector limitation for AVG generation by padding the vector to the next pow2, creating the PAVG nodes and then extracting the final subvector. Fixes some poor codegen that has been annoying me for years.....	2020-09-15 10:07:03 +01:00
Craig Topper	c193a689b4	[SelectionDAG] Use Align/MaybeAlign in calls to getLoad/getStore/getExtLoad/getTruncStore. The versions that take 'unsigned' will be removed in the future. I tried to use getOriginalAlign instead of getAlign in some places. getAlign factors in the minimum alignment implied by the offset in the pointer info. Since we're also passing the pointer info we can use the original alignment. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D87592	2020-09-14 13:54:50 -07:00
Craig Topper	758732a34e	[X86] Use ISD::PARITY directly instead of emitting CTPOP and AND from combineHorizontalPredicateResult. We have a PARITY ISD node now so might as well use it. It will get re-expanded later.	2020-09-12 20:01:17 -07:00
Craig Topper	ad3d6f993d	[SelectionDAG][X86][ARM][AArch64] Add ISD opcode for __builtin_parity. Expand it to shifts and xors. Clang emits (and (ctpop X), 1) for __builtin_parity. If ctpop isn't natively supported by the target, this leads to poor codegen due to the expansion of ctpop being more complex than what is needed for parity. This adds a DAG combine to convert the pattern to ISD::PARITY before operation legalization. Type legalization is updated to handled Expanding and Promoting this operation. If after type legalization, CTPOP is supported for this type, LegalizeDAG will turn it back into CTPOP+AND. Otherwise LegalizeDAG will emit a series of shifts and xors followed by an AND with 1. I've avoided vectors in this patch to avoid more legalization complexity for this patch. X86 previously had a custom DAG combiner for this. This is now moved to Custom lowering for the new opcode. There is a minor regression in vector-reduce-xor-bool.ll, but a follow up patch can easily fix that. Fixes PR47433 Reviewed By: efriedma Differential Revision: https://reviews.llvm.org/D87209	2020-09-12 11:42:18 -07:00
Simon Pilgrim	35dc91aee2	[X86][SSE] lowerShuffleAsDecomposedShuffleBlend - support decomposed unpacks for some vXi8/vXi16 cases Follow up to D86429 to handle the remaining regressions. This patch generalizes lowerShuffleAsDecomposedShuffleBlend to lowerShuffleAsDecomposedShuffleMerge, and attempts to use an UNPCKL shuffle mask instead of a blend for the cases where the inputs are coming from alternating vXi8/vXi16 sources. Technically they don't have to be alternating (just as long as they can fit into a lower lane half for the unpack) but I didn't find as many general cases and it needed a lot more of the function to be altered. For vXi32/vXi64 cases this could still be beneficial but in most cases the existing permute+blend approach was better. Differential Revision: https://reviews.llvm.org/D87405	2020-09-12 13:39:33 +01:00
Simon Pilgrim	70a05ee288	[X86] Keep variables from getDataLayout/getDebugLoc calls as const reference. NFCI. These are only ever used as references in the called functions, so just pass the original reference instead of copying it.	2020-09-11 10:44:42 +01:00
Simon Pilgrim	b585fdae24	[X86] Use Register instead of unsigned. NFCI. Fixes llvm-prefer-register-over-unsigned clang-tidy warnings.	2020-09-10 16:05:33 +01:00
Simon Pilgrim	fc49abee56	[X86][SSE] lowerShuffleAsSplitOrBlend always returns a shuffle. lowerShuffleAsSplitOrBlend always returns a target shuffle result (and is the default operation for lowering some shuffle types), so we don't need to check for null.	2020-09-10 11:45:08 +01:00
Hiroshi Yamauchi	0ab6a15698	[X86] Add support for using fast short rep mov for memcpy lowering. Disabled by default behind an option. Differential Revision: https://reviews.llvm.org/D86883	2020-09-09 12:46:40 -07:00
Craig Topper	b1e68f885b	[SelectionDAGBuilder] Pass fast math flags to getNode calls rather than trying to set them after the fact.: This removes the after the fact FMF handling from D46854 in favor of passing fast math flags to getNode. This should be a superset of D87130. This required adding a SDNodeFlags to SelectionDAG::getSetCC. Now we manage to contant fold some stuff undefs during the initial getNode that we don't do in later DAG combines. Differential Revision: https://reviews.llvm.org/D87200	2020-09-08 15:27:21 -07:00
Craig Topper	da79b1eecc	[SelectionDAG][X86][ARM] Teach ExpandIntRes_ABS to use sra+add+xor expansion when ADDCARRY is supported. Rather than using SELECT instructions, use SRA, UADDO/ADDCARRY and XORs to expand ABS. This is the multi-part version of the sequence we use in LegalizeDAG. It's also the same as the Custom sequence uses for i64 on 32-bit and i128 on 64-bit. So we can remove the X86 customization. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D87215	2020-09-07 13:15:26 -07:00
Craig Topper	01b3e16757	[X86] Use the same sequence for i128 ISD::ABS on 64-bit targets as we use for i64 on 32-bit targets. Differential Revision: https://reviews.llvm.org/D87214	2020-09-07 11:14:05 -07:00
Simon Pilgrim	9de0a3da6a	[X86][SSE] Don't use LowerVSETCCWithSUBUS for unsigned compare with +ve operands (PR47448) We already simplify the unsigned comparisons if we've found the operands are non-negative, but we were still calling LowerVSETCCWithSUBUS which resulted in the PR47448 regressions.	2020-09-07 16:11:40 +01:00
Simon Pilgrim	9b645ebfff	[X86][AVX] Use lowerShuffleWithPERMV in shuffle combining to support non-VLX targets lowerShuffleWithPERMV allows us to use the ZMM variants for 128/256-bit variable shuffles on non-VLX AVX512 targets. This is another step towards shuffle combining through between vector widths - we still end up with an annoying regression (combine_vpermilvar_vperm2f128_zero_8f32) but we're going in the right direction....	2020-09-07 12:50:50 +01:00
Simon Pilgrim	71dfdbe2c7	[X86] getFauxShuffleMask - handle insert_subvector(zero, sub, C) Directly use SM_SentinelZero elements if we're (widening)inserting into a zero vector.	2020-09-07 11:10:40 +01:00
Simon Pilgrim	ecac5c2808	[X86][AVX] lowerShuffleWithPERMV - adjust binary shuffle masks to account for widening on non-VLX targets rGabd33bf5eff2 enabled us to pad 128/256-bit shuffles to 512-bit on non-VLX targets, but wasn't updating binary shuffles to account for the new vector width.	2020-09-06 14:52:25 +01:00
Craig Topper	35b35a373d	[X86] Prevent shuffle combining from creating an identical X86ISD::SHUF128. This can cause an infinite loop if SimplifiedDemandedElts asks for the node to replace itself. A similar protection exists in other places in shuffle combining. Fixes ISPC https://github.com/ispc/ispc/issues/1864	2020-09-04 14:12:49 -07:00
Simon Pilgrim	740625fecd	[X86] Make lowerShuffleAsLanePermuteAndPermute use sublanes on AVX2 Extends lowerShuffleAsLanePermuteAndPermute to search for opportunities to use vpermq (64-bit cross-lane shuffle) and vpermd (32-bit cross-lane shuffle) to get elements into the correct lane, in addition to the 128-bit full-lane permutes it previously searched for. This is especially helpful in cross-lane byte shuffles, where the alternative tends to be "vpshufb both lanes separately and blend them with a vpblendvb", which is very expensive, especially on Haswell where vpblendvb uses the same execution port as all the shuffles. Addresses PR47262 Patch By: @TellowKrinkle (TellowKrinkle) Differential Revision: https://reviews.llvm.org/D86429	2020-09-04 11:41:26 +01:00
Craig Topper	0851350557	[X86] Update stale comment. NFC The optimization in ExpandIntOp_UINT_TO_FP was removed in D72728 in January 2020.	2020-09-03 16:19:10 -07:00
Simon Pilgrim	e56edb801b	[X86][SSE] Fold select(X > -1, A, B) -> select(0 > X, B, A) (PR47404) Help PBLENDVB peek through to the sign bit source of the selection mask by swapping the select condition and inputs.	2020-09-03 13:02:08 +01:00
Simon Pilgrim	888049b97a	[X86][SSE] Fold vselect(pshufb,pshufb) -> or(pshufb,pshufb) If the PSHUFBs have no other uses, then we can force the unselected elements to zero to OR them instead, avoiding both an extra mask load and a costly variable blend. Eventually we should try to bring this into shuffle combining, once we can more easily convert between shuffles + select patterns.	2020-09-02 16:55:00 +01:00
Martin Storsjö	4820af2bfc	[X86] Remove superfluous trailing semicolons, fixing warnings. NFC.	2020-09-02 11:43:27 +03:00
Simon Pilgrim	21d02dc595	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support This patch uses partial DemandedElts masks to further simplify target shuffle chains and finally starts making target shuffle combining part of SimplifyDemandedBits/SimplifyDemandedVectorElts. We already manage this for Depth == 0 cases, where combineX86ShuffleChain would early-out if the shuffle combined to the same op, but the patch generalizes this by manipulating the depth handling of combineX86ShufflesRecursively - calling with a new Depth = 0 and reducing the maximum shuffle combine depth accordingly. Differential Revision: https://reviews.llvm.org/D66004	2020-09-02 09:24:46 +01:00
Pierre Gousseau	cda6b09242	[X86] Make sure we do not clobber RBX with mwaitx when used as a base pointer. mwaitx uses EBX as one of its argument. Using this instruction clobbers RBX as it is defined to hold one of the input. When the backend uses dynamically allocated stack, RBX is used as a reserved register for the base pointer. This patch is adapted from @qcolombet patch for cmpxchg at r263325. This fixes PR43528. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D73475	2020-08-26 11:20:31 +01:00
Craig Topper	b8ec8f5776	[X86] Remove extra getOperand(0) call from recently introduced store(extract_element(vtrunc)) to truncated store combine. The IsExtractedElement already called getOperand(0) so Extract here is the source vector. We shouldn't call getOperand(0). This worked for the original test cases because the result was a bitcast so the getOperand(0) accidently peeked through the bitcast which is what we wanted. In the failing case here, the operand turns out to be undef so the getOperand(0) asserts because undef has no operands. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25184 Differential Revision: https://reviews.llvm.org/D86428	2020-08-25 16:16:54 -07:00
Simon Pilgrim	057bdd63a4	[X86][AVX] lowerShuffleWithVPMOV - minor refactor to more closely match lowerShuffleAsVTRUNC Replace isBuildVectorAllZeros check by using the Zeroable bitmask instead.	2020-08-19 14:34:32 +01:00

... 7 8 9 10 11 ...

8184 Commits