llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	6a06c7a0a7	[X86] isHorizontalBinOp - only update LHS/RHS references on success We've had issues in the past where isHorizontalBinOp calls would affect later combines as the LHS/RHS references had been commuted but still failed to match.	2020-08-05 15:09:52 +01:00
Simon Pilgrim	a57bfb44bc	[X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for integer types	2020-08-05 15:09:51 +01:00
Simon Pilgrim	6f0da46d53	[X86] getFauxShuffleMask - drop unnecessary computeKnownBits OR(X,Y) shuffle decoding. Now that rG47cea9e82dda941e lets us aggressively decode multi-use shuffles for the OR(SHUFFLE(),SHUFFLE()) case we don't need the computeKnownBits variant any more.	2020-08-04 15:57:47 +01:00
Simon Pilgrim	051f293b78	[X86] Remove unused canScaleShuffleElements helper The only use was removed at rG36750ba5bd0e9e72 Thanks to @nemanjai for the heads up	2020-08-04 14:51:23 +01:00
Simon Pilgrim	36750ba5bd	[X86][AVX] isHorizontalBinOp - relax lane-crossing limits for AVX1-only targets. Permit lane-crossing post shuffles on AVX1 targets as long as every element comes from the same source lane, which for v8f32/v4f64 cases can be efficiently lowered with the LowerShuffleAsLanePermuteAnd* style methods.	2020-08-04 14:27:01 +01:00
Simon Pilgrim	47cea9e82d	Revert rG66e7dce714fab "Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero."" [X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero (REAPPLIED) This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR(). This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle. Reapplied with fixed signed/unsigned comparisons.	2020-08-04 10:32:39 +01:00
Wang, Pengfei	6bc7ea2d8d	[X86][AVX512] Fix build fail after D81548 Test function mask_cmp_128 failed during ISEL LLVM ERROR: Cannot select: t37: v8i1 = X86ISD::KSHIFTL t48, TargetConstant:i8<4> due to v8i1 only available under AVX512DQ. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D84922	2020-08-04 12:31:04 +08:00
Mitch Phillips	66e7dce714	Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero." This reverts commit `219f32f4b6`. Commit contains unsigned compasions that break bots that build with -Wsign-compare.	2020-08-03 13:48:30 -07:00
Simon Pilgrim	219f32f4b6	[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero. This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR(). This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.	2020-08-03 18:32:47 +01:00
Simon Pilgrim	99a971cadf	[X86][SSE] Start shuffle combining from ANY_EXTEND_VECTOR_INREG on SSE targets We already do this on AVX (+ for ZERO_EXTEND_VECTOR_INREG), but this enables it for all SSE targets - we attempted something similar back at rL357057 but hit issues with the ZERO_EXTEND_VECTOR_INREG handling (PR41249). I'm still looking at the vector-mul.ll regression - which is due to 32-bit targets performing the load as a f64, resulting in the shuffle combiner thinking it has to create a shuffle in the float domain.	2020-08-03 13:41:48 +01:00
Craig Topper	64516ec7c1	[X86] Use parity flag from byte test/cmp instruction for __builtin_parity when input fits in 8 bits. If the upper bits of the __builtin_parity idiom are known to be 0 we were previously emitting an xor with 0 to get the parity flag. But we can use cmp/test instead which may expose opportunities for load folding or combining an AND.	2020-08-02 10:45:04 -07:00
Simon Pilgrim	d14a22da5e	[DAG] TargetLowering::LowerAsmOutputForConstraint - pass SDLoc as const& Try to be more consistent with the SDLoc param in the TargetLowering methods.	2020-08-02 15:12:02 +01:00
Simon Pilgrim	20fbbbc583	[X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI. Fixes clang-tidy warning.	2020-08-02 14:32:23 +01:00
Simon Pilgrim	d7e2616741	[X86] Pass SDLoc by const reference. NFCI.	2020-08-02 14:32:22 +01:00
Simon Pilgrim	3f276840b6	[X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI. Fixes clang-tidy warning.	2020-08-02 14:32:22 +01:00
Simon Pilgrim	2700311cce	[X86] combineX86ShuffleChain - pull out repeated RootVT.getSizeInBits() calls. NFCI.	2020-08-02 14:32:22 +01:00
Craig Topper	56166a3a52	[X86] Improve parity idiom recognition to handle (and (truncate (ctpop X)), 1). Fixes part of PR46954	2020-08-01 22:59:43 -07:00
Simon Pilgrim	82a5c848e7	[X86][AVX512] Fold concat(and(x,y),and(z,w)) -> and(concat(x,z),concat(y,w)) for 512-bit vectors Helps vpternlog folding on non-AVX512BW targets	2020-08-01 20:34:39 +01:00
Simon Pilgrim	bb13c34c3a	[X86][AVX] Ensure we only combine to PSHUFLW/PSHUFHW on supporting targets Noticed while investigating combining from concatenated shuffle vectors, we weren't checking that PSHUFLW/PSHUFHW was legal - we were depending on lowering splitting to subvectors.	2020-08-01 19:18:11 +01:00
Simon Pilgrim	1b1901536a	[X86][AVX] Extend v2f64 BROADCAST(LOAD) -> BROADCAST_LOAD to v2i64/v4f32/v4i32 Minor precursor fix for D66004, but helps the SSE41 tests as well as they run with -disable-peephole	2020-08-01 12:28:29 +01:00
Matt Arsenault	57bd64ff84	Support addrspacecast initializers with isNoopAddrSpaceCast Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs with the DataLayout.	2020-07-31 10:42:43 -04:00
Simon Pilgrim	2dec72ba5c	[X86][SSE] combineExtractWithShuffle - extend extract(truncate(x),0) for any source vector size As long as we can extract the lowest 128-bit subvector from the pre-truncated source vector, then we don't care what size it is. The next stage will be to support non-zero extraction indices, as long as its still coming from the lowest 128-bit subvector.	2020-07-30 12:27:49 +01:00
Simon Pilgrim	a1c9529e60	[X86][AVX] isHorizontalBinOp - relax no-lane-crossing limit for AVX1-only targets. Instead of never accepting v8f32/v4f64 FHADD/FHSUB if the input shuffle masks cross lanes, perform the matching and determine if the post shuffle mask simplifies to a 'whole lane shuffle' mask - in which case we are guaranteed to cheaply perform this as a VPERM2F128 shuffle.	2020-07-29 20:49:10 +01:00
Craig Topper	c4823b24a4	[X86] Add custom lowering for llvm.roundeven with sse4.1. We can use the roundss/sd/ps/pd instructions like we do for ceil/floor/trunc/rint/nearbyint. Differential Revision: https://reviews.llvm.org/D84592	2020-07-29 10:23:08 -07:00
Simon Pilgrim	0c005be6eb	[X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast. getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements. I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.	2020-07-29 11:32:44 +01:00
Simon Pilgrim	4838cd46a9	[X86][XOP] Shuffle v16i8 using VPPERM(X,Y) instead of OR(PSHUFB(X),PSHUFB(Y))	2020-07-28 19:56:10 +01:00
Simon Pilgrim	182111777b	[X86][SSE] Attempt to match OP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) An initial backend patch towards fixing the various poor HADD combines (PR34724, PR41813, PR45747 etc.). This extends isHorizontalBinOp to check if we have per-element horizontal ops (odd+even element pairs), but not in the expected serial order - in which case we build a "post shuffle mask" that we can apply to the HOP result, assuming we have fast-hops/optsize etc. The next step will be to extend the SHUFFLE(HOP(X,Y)) combines as suggested on PR41813 - accepting more post-shuffle masks even on slow-hop targets if we can fold it into another shuffle. Differential Revision: https://reviews.llvm.org/D83789	2020-07-28 10:04:14 +01:00
Craig Topper	647e861e08	[X86] Detect if EFLAGs is live across XBEGIN pseudo instruction. Add it as livein to the basic blocks created when expanding the pseudo XBEGIN causes several based blocks to be inserted. If flags are live across it we need to make eflags live in the new basic blocks to avoid machine verifier errors. Fixes PR46827 Reviewed By: ivanbaev Differential Revision: https://reviews.llvm.org/D84479	2020-07-27 21:15:35 -07:00
Simon Pilgrim	4d84d94969	[X86][SSE] Relax 128-bit restriction on extract_subvector(ext_vector_inreg(X),0) -> ext_vector_inreg(extract_subvector(X,0)) fold We only need to ensure that the source is larger than the subvector result type	2020-07-27 17:50:36 +01:00
Simon Pilgrim	ab4ffa52f0	[X86][AVX] Fold extract_subvector(truncate(x),0) -> truncate(extract_subvector(x),0) This is currently only supported for VLX targets where the op should be legal.	2020-07-27 14:51:29 +01:00
Simon Pilgrim	f720c9c68c	[X86] combineExtractSubvector - pull out repeated getSizeInBits() calls. NFCI.	2020-07-27 14:51:28 +01:00
Simon Pilgrim	17eafe0841	[X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls. By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.	2020-07-26 16:04:22 +01:00
Simon Pilgrim	3b21823e4a	[X86][SSE] combineX86ShufflesRecursively - move all Root node asserts to the same location. NFCI. Minor tidyup for some upcoming shuffle combine improvements.	2020-07-25 12:48:14 +01:00
Simon Pilgrim	66998ae59f	[X86][SSE] getFauxShuffle - ignore undemanded sources for PACKSS/PACKUS faux shuffles If we don't care about an entire LHS/RHS of the PACK op, then can just treat it the same as undef (we don't care if it saturates) and is safe to treat as a shuffle. This can happen if we attempt to decode as a faux shuffle before SimplifyDemandedVectorElts has been called on the PACK which should replace the source with UNDEF entirely.	2020-07-25 10:51:14 +01:00
Simon Pilgrim	d720ba1e4b	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add SSE shift multiple use handling Add SimplifyMultipleUseDemandedVectorElts peek through for imm/var SSE shifts	2020-07-23 14:39:03 +01:00
Simon Pilgrim	5b5dc2442a	[X86][AVX] getTargetShuffleMask - don't decode VBROADCAST(EXTRACT_SUBVECTOR(X,0)) patterns. getTargetShuffleMask is used by the various "SimplifyDemanded" folds so we can't assume that the bypassed extract_subvector can be safely simplified - getFauxShuffleMask performs a more general decode that allows us to more safely catch many of these cases so the impact is minimal.	2020-07-21 21:55:44 +01:00
Sanjay Patel	50afa18772	[x86] split FMA with fast-math-flags to avoid libcall fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware) C/C++ code may use explicit fma() calls (which become LLVM fma intrinsics in IR) but then gets compiled with -ffast-math or similar. For targets that do not have FMA hardware, we don't want to go out to the math library for a precise but slow FMA result. I tried this as a generic DAGCombine, but it caused infinite looping on more than 1 other target, so there's likely some over-reaching fma formation happening. There's also a potential intersection of strict FP with fast-math here. Deferring to current behavior for that case (assuming that strict-ness overrides fast-ness). Differential Revision: https://reviews.llvm.org/D83981	2020-07-19 10:03:55 -04:00
Craig Topper	5408024fa8	[X86] Move integer hadd/hsub formation into a helper function shared by combineAdd and combineSub. There was a lot of duplicate code here for checking the VT and subtarget. Moving it into a helper avoids that. It also fixes a bug that combineAdd reused Op0/Op1 after a call to isHorizontalBinOp may have changed it. The new helper function has its own local version of Op0/Op1 that aren't shared by other code. Fixes PR46455. Reviewed By: spatel, bkramer Differential Revision: https://reviews.llvm.org/D83971	2020-07-16 13:27:27 -07:00
Craig Topper	9adf7461f7	[X86] Add test case for PR46455.	2020-07-16 11:06:55 -07:00
Craig Topper	f8f007e378	[X86] Consistently use 128 as the PSHUFB/VPPERM index for zero Bit 7 of the index controls zeroing, the other bits are ignored when bit 7 is set. Shuffle lowering was using 128 and shuffle combining was using 255. Seems like we should be consistent. This patch changes shuffle combining to use 128 to match lowering. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83587	2020-07-12 10:52:43 -07:00
Craig Topper	04013a07ac	[X86] Fix two places that appear to misuse peekThroughOneUseBitcasts peekThroughOneUseBitcasts checks the use count of the operand of the bitcast. Not the bitcast itself. So I think that means we need to do any outside haseOneUse checks before calling the function not after. I was working on another patch where I misused the function and did a very quick audit to see if I there were other similar mistakes. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83598	2020-07-12 10:52:43 -07:00
Wang, Pengfei	e628092524	[X86][MMX] Optimize MMX shift intrinsics. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D83534	2020-07-11 11:16:23 +08:00
Simon Pilgrim	4cc26a44ca	[X86][SSE] Use shouldUseHorizontalOp helper to determine whether to use (F)HADD. NFCI.	2020-07-10 12:13:34 +01:00
Simon Pilgrim	77133cc1e2	[X86][AVX] Attempt to fold PACK(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(PACK(X,Y)). Truncations lowered as shuffles of multiple (concatenated) vectors often leave us with lane-crossing shuffles that feed a PACKSS/PACKUS, if both shuffles are fed from the same 2 vector sources, then we can PACK the sources directly and shuffle the result instead. This is currently limited to whole i128 lanes in a 256-bit vector, but we can extend this if the need arises (but I'm not seeing many examples in real world code).	2020-07-10 09:33:27 +01:00
Craig Topper	918e653186	[X86] Immediately call LowerShift from lowerBuildVectorToBitOp. If we don't immediately lower the vector shift, the splat constant vector we created may get turned into a constant pool load before we get around to lowering the shift. This makes it a lot more difficult to create a shift by constant. Sometimes we fail to see through the constant pool at all and end up trying to lower as if it was a variable shift. This requires custom handling and may create an unsupported vselect on pre-sse-4.1 targets. Since we're after LegalizeVectorOps we are unable to legalize the unsupported vselect as that code is in LegalizeVectorOps rather than LegalizeDAG. So calling LowerShift immediately ensures that we get see the splat constant. Fixes PR46527. Differential Revision: https://reviews.llvm.org/D83455	2020-07-09 10:51:29 -07:00
Hiroshi Yamauchi	2c1a9006dd	[PGO][PGSO] Add profile guided size optimization to X86 ISel Lowering.	2020-07-09 10:43:45 -07:00
Craig Topper	3e75912005	[X86] Directly emit X86ISD::BLENDV instead of VSELECT in a few places that were emitting sign bit tests. Technically a VSELECT expects a vector of all 1s or 0s elements for its condition. But we aren't guaranteeing that the sign bit and the non sign bits match in these locations. So we should use BLENDV which is more relaxed. Differential Revision: https://reviews.llvm.org/D83447	2020-07-09 10:40:09 -07:00
Simon Pilgrim	f54402b63a	[X86][AVX] Attempt to fold extract_subvector(shuffle(X)) -> extract_subvector(X) If we're extracting a subvector from a shuffle that is shuffling entire subvectors we can peek through and extract the subvector from the shuffle source instead. This helps remove some cases where concat_vectors(extract_subvector(),extract_subvector()) legalizations has resulted in BLEND/VPERM2F128 shuffles of the subvectors.	2020-07-09 14:09:24 +01:00
Benjamin Kramer	b44470547e	Make helpers static. NFC.	2020-07-09 13:48:56 +02:00
Simon Pilgrim	800fb68420	[X86][SSE] Pull out PACK(SHUFFLE(),SHUFFLE()) folds into its own function. NFC. Future patches will extend this so declutter combineVectorPack before we start.	2020-07-08 17:42:42 +01:00
Simon Pilgrim	08a2c9ce5c	[X86] Fix copy+paste typo in combineVectorPack assert message. NFC.	2020-07-08 17:42:42 +01:00
Sanjay Patel	9114900287	[x86] improve codegen for non-splat bit-masked vector compare and select (PR46531) vselect ((X & Pow2C) == 0), LHS, RHS --> vselect ((shl X, C') < 0), RHS, LHS Follow-up to D83073 - the non-splat mask cases where we actually see an improvement are quite limited from what I can tell. AVX1 needs multiply and blend capabilities and AVX2 needs vector shift and blend capabilities. The intersection of those 2 constraints is only vectors with 32-bit or 64-bit elements. XOP is/was better. Differential Revision: https://reviews.llvm.org/D83181	2020-07-08 08:20:49 -04:00
Simon Pilgrim	9dc250db9d	[X86][AVX] SimplifyDemandedVectorEltsForTargetShuffle - ensure mask is same size as constant size Fixes test regression reported on D81791	2020-07-08 11:47:59 +01:00
Simon Pilgrim	c00a27752e	[X86][AVX] Remove redundant EXTRACT_VECTOR_ELT(VBROADCAST(SCALAR())) fold Noticed while looking for similar cases to rG931ec74f7a29 - SimplifyDemandedVectorElts and shuffle combining both should handle this now.	2020-07-08 10:18:36 +01:00
Simon Pilgrim	931ec74f7a	[X86][AVX] Don't fold PEXTR(VBROADCAST_LOAD(X)) -> LOAD(X). We were checking the VBROADCAST_LOAD element size against the extraction destination size instead of the extracted vector element size - PEXTRW/PEXTB have implicit zext'ing so have i32 destination sizes for v8i16/v16i8 vectors, resulting in us extracting from the wrong part of a load. This patch bails from the fold if the vector element sizes don't match, and we now use the target constant extraction code later on like the pre-AVX2 targets, fixing the test case. Found by internal fuzzing tests.	2020-07-07 19:10:03 +01:00
Sanjay Patel	642eed3713	[x86] fix miscompile in buildvector v16i8 lowering In the test based on PR46586: https://bugs.llvm.org/show_bug.cgi?id=46586 ...we are inserting 16-bits into the high element of the vector, shuffling it to element 0, and extracting 32-bits. But xmm1 was never initialized, so the top 16-bits of the extract are undef without this patch. (It seems like we could do better than this by recognizing that we only demand a subsection of the build vector, but I want to make sure we fix the miscompile 1st.) This path is only used for pre-SSE4.1, and simpler patterns get squashed somewhere along the way, so the test still includes a 'urem' as it did in the original test from the bug report. Differential Revision: https://reviews.llvm.org/D83319	2020-07-07 13:02:31 -04:00
Liu, Chen3	ea85ff82c8	[X86] Fix a bug that when lowering byval argument When an argument has 'byval' attribute and should be passed on the stack according calling convention, a stack copy would be emitted twice. This will cause the real value will be put into stack where the pointer should be passed. Differential Revision: https://reviews.llvm.org/D83175	2020-07-07 21:49:31 +08:00
Xiang1 Zhang	939d8309db	[X86-64] Support Intel AMX Intrinsic INTEL ADVANCED MATRIX EXTENSIONS (AMX). AMX is a new programming paradigm, it has a set of 2-dimensional registers (TILES) representing sub-arrays from a larger 2-dimensional memory image and operate on TILES. These intrinsics use direct TMM register number as its params. Spec can be found in Chapter 3 here https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D83111	2020-07-07 10:13:40 +08:00
Craig Topper	e652c0f8f3	[X86] Teach lowerShuffleAsBlend to use bit blend for v16i8/v32i8/v16i16 when avx512vl is enabled but not avx512bw. Probably not super important since there are no real CPUs with avx512vl and not avx512bw. But vpternlog should be better than vblendvb. I do wonder if we should use vpternlog even with BWI. We currently use vblendmb or vpblendmw by putting the mask into a GPR and moving it to a k-register. But I don't think we hoist the GPR to k-register copy in machine LICM. Using VPTERNLOG would use a constant pool load, but has the advantage that we're pretty good at hoisting and rematerializing those. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83156	2020-07-04 10:26:56 -07:00
Craig Topper	b4eb415a99	[X86] Disable VPBLENDVB formation in combineLogicBlendIntoPBLENDV if VPTERNLOG is supported. VPBLENDVB is multiple uops while VPTERNLOG is a single uop. So we should use that instead. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D83155	2020-07-04 10:12:19 -07:00
Simon Pilgrim	71f342d6c3	[X86][AVX] Fold PACK(LOSUBVECTOR(SHUFFLE(X)),HISUBVECTOR(SHUFFLE(X))) -> SHUFFLE(PACK(LOSUBVECTOR(X),HISUBVECTOR(X))) Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed. This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles. A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.	2020-07-04 13:54:30 +01:00
Craig Topper	fed432523e	[X86] Directly emit VPTERNLOG from canonicalizeBitSelect when possible. Seems to produce better results on some rotate tests. And is neutral for other tests.	2020-07-03 22:08:28 -07:00
Sanjay Patel	26543f1c0c	[x86] improve codegen for bit-masked vector compare and select (PR46531) We canonicalize patterns like: %s = lshr i32 %a0, 1 %t = trunc i32 %s to i1 to: %a = and i32 %a0, 2 %c = icmp ne i32 %a, 0 ...in IR, but the bit-shifting original sequence may be better for x86 vector codegen. I tried several variants of the transform, and it's tricky to not induce regressions. In particular, I did not find a way to cleanly handle non-splat constants, so I've left that as a TODO item here (currently negative tests for those are included). AVX512 resulted in some diffs, but didn't look meaningful, so I left that out too. Some of the 256-bit AVX1 diffs are questionable, but close enough that they are probably insignificant. Differential Revision: https://reviews.llvm.org/D83073.	2020-07-03 17:31:57 -04:00
serge-sans-paille	c8ef3d5a2f	Fix stack-clash probing for large static alloca Differential Revision: https://reviews.llvm.org/D82867	2020-07-03 09:22:03 +02:00
Craig Topper	acf6c94a38	[X86] Teach lower512BitShuffle to try bitmask and bitblend before splitting v32i16/v64i8 on av512f only targets. We consider v32i16/v64i8 to be legal types on avx512f, but we don't have most operations until avx512bw. But we can use and/or/xor operations. So try those before splitting. This is especially helpful since we turn some ands with constant masks into shuffles in early DAG combines. So we should make sure we recover those back to AND.	2020-07-02 15:35:48 -07:00
Craig Topper	204a21317a	[X86] Modify the conditions for when we stop making v16i8/v32i8 rotate Custom based on having avx512 features. The comments here indicate that we prefer to promote the shifts instead of allowing rotate to be pattern matched. But we weren't taking into account whether 512-bit registers are enabled or whethever we have vpsllvw/vpsrlvw instructions. splatvar_rotate_v32i8 is a slight regrssion, but the other cases are neutral or improved.	2020-07-02 13:07:51 -07:00
Simon Pilgrim	b485586482	[X86][SSE] Fix targetShrinkDemandedConstant constant vector sign extensions D82257/rG3521ecf1f8a3 was incorrectly sign-extending a constant vector from the lsb, this is fine if all the constant elements are 'allsignbits' in the active bits, but if only some of the elements are, then we are corrupting the constant values for those elements. This fix ensures we sign extend from the msb of the active/demanded bits instead.	2020-07-01 12:12:53 +01:00
Guillaume Chatelet	28de229bc6	[Alignment][NFC] Migrate MachineFrameInfo::CreateStackObject to Align This patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Differential Revision: https://reviews.llvm.org/D82894	2020-07-01 07:28:11 +00:00
Matt Arsenault	249933f254	X86: Use Register	2020-06-30 12:13:08 -04:00
Simon Pilgrim	82de018954	[X86][SSE] LowerVectorAllZero - add support for masked OR-reductions If we're masking the result of an OR-reduction before comparing against zero, we can fold this into the PTEST() / MOVMSK(CMPEQ()) codegen by pre-masking the source value. This works particularly well on PTEST which performs the AND as part of its operation, but the MOVMSK variant also benefits for non-V2I64 cases. Fixes PR44781	2020-06-30 14:38:52 +01:00
Guillaume Chatelet	a976ea3209	[Alignment][NFC] Migrate PPC, X86 and XCore backends to Align This patch is part of a series to introduce an Alignment type. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html See this patch for the introduction of the type: https://reviews.llvm.org/D64790 Differential Revision: https://reviews.llvm.org/D82779	2020-06-30 08:08:45 +00:00
Craig Topper	9b04d69cce	[X86] Prefer AND over PSHUFB for v64i8 when possible If the shuffle is a blend and one input is a 0 vector, we should prefer AND over PSHUFB since its available on more execution ports. Differential Revision: https://reviews.llvm.org/D82798	2020-06-29 16:26:53 -07:00
Matt Arsenault	2790516418	X86: Use MOV32r0 pseudo instead of directly emitting xor This was producing reg = xor undef reg, undef reg. This looks similar to a use of a value to define itself, and I want to disallow undef uses for SSA virtual registers. If this were to use implicit_def, there's no guarantee the two operands end up using the same register (I think no guarantee exists even if the two operands start out as the same register, but this was violated when I switched this to use an explicit implicit_def). The MOV32r0 pseudo evidently exists to handle this case, so use it instead. This was more work than I expected for the 64-bit case, but I didn't see any helper for materializing a 64-bit 0.	2020-06-29 14:45:20 -04:00
Simon Pilgrim	333aa690f4	[X86][SSE] MatchVectorAllZeroTest - handle OR vector reductions (REAPPLIED) This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero. Reapplied with a fix for a chromium regression due to a missing isNullConstant() check in combineSetCC: https://bugs.chromium.org/p/chromium/issues/detail?id=1097758 Fixes PR45378 Differential Revision: https://reviews.llvm.org/D81547	2020-06-29 15:50:44 +01:00
Simon Pilgrim	3521ecf1f8	[X86] Add vector support to targetShrinkDemandedConstant for OR/XOR opcodes If a constant is only allsignbits in the demanded/active bits, then sign extend it to an allsignbits bool pattern for OR/XOR ops. This also requires SimplifyDemandedBits XOR handling to be modified to call ShrinkDemandedConstant on any (non-NOT) XOR pattern to account for non-splat cases. Next step towards fixing PR45808 - with this patch we now get a <-1,-1,0,0> v4i64 constant instead of <1,1,0,0>. Differential Revision: https://reviews.llvm.org/D82257	2020-06-29 12:19:05 +01:00
Simon Pilgrim	973685fc78	[TargetLowering] Add DemandedElts arg to ShrinkDemandedConstant Pre-commit for D82257, this adds a DemandedElts arg to ShrinkDemandedConstant/targetShrinkDemandedConstant which will allow future patches to (optionally) add vector support.	2020-06-29 11:46:58 +01:00
Simon Pilgrim	e07a982693	[X86] combineScalarToVector - handle (v2i64 scalar_to_vector(aextload)) as well as (v2i64 scalar_to_vector(aext)) We already fold (v2i64 scalar_to_vector(aext)) -> (v2i64 bitcast(v4i32 scalar_to_vector(x))), this adds support for similar aextload cases and also handles v2f64 cases that wrap the i64 extension behind bitcasts. Fixes the remaining issue with PR39016	2020-06-28 13:00:32 +01:00
Simon Pilgrim	393b4bd136	[X86] SimplifyDemandedVectorEltsForTargetNode - merge shuffle/pack lower demanded elements handling. Generalize the vector operand extraction code for shuffle/pack ops - we can assume that the vector operands are the same width as the result, and any non-vector values can be reused directly in the smaller width op.	2020-06-27 19:10:13 +01:00
Simon Pilgrim	e855efe424	[X86][AVX] SimplifyDemandedVectorEltsForTargetNode - reduce width of X86ISD::VPERMIL2 If we don't need the elements of the upper lanes, reduce the width of the X86ISD::VPERMIL2 node.	2020-06-27 15:06:49 +01:00
Simon Pilgrim	d56c6475a6	[X86][AVX] SimplifyDemandedVectorEltsForTargetNode - reduce width of X86ISD::VPERMILPV If we don't need the elements of the upper lanes, reduce the width of the X86ISD::VPERMILPV node.	2020-06-27 14:43:03 +01:00
Amy Huang	8b59c26bf3	Extend or truncate __ptr32/__ptr64 pointers when dereferenced. Summary: A while ago I implemented the functionality to lower Microsoft __ptr32 and __ptr64 pointers, which are stored as 32-bit and 64-bit pointer and are extended/truncated to the appropriate pointer size when dereferenced. This patch adds an addrspacecast to cast from the __ptr32/__ptr64 pointer to a default address space when dereferencing. Bug: https://bugs.llvm.org/show_bug.cgi?id=42359 Reviewers: hans, arsenm, RKSimon Subscribers: wdng, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81517	2020-06-26 13:33:54 -07:00
Simon Pilgrim	e7e204a373	[X86][AVX] Attempt to lower v16i32/v16f32 shuffles with lowerShuffleAsRepeatedMaskAndLanePermute Avoids prematurely creating permps/permd variable shuffles. Fixes PR46249	2020-06-23 18:33:50 +01:00
Simon Pilgrim	4c257bb44e	[X86] truncateVectorWithPACK - fix outdated comment. NFC. We perform PACKSS/PACKUS on AVX512 targets if the calling function wants to.	2020-06-23 10:45:27 +01:00
Hans Wennborg	1357c06578	Revert "[X86][SSE] MatchVectorAllZeroTest - handle OR vector reductions" This caused a Chromium test to miscompile. See discussion on the Phabricator review. > This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero. > > Fixes PR45378 > > Differential Revision: https://reviews.llvm.org/D81547 This reverts `057c9c7ee0`	2020-06-22 21:27:11 +02:00
Simon Pilgrim	48d1a2d6d0	[DAG] Add SimplifyMultipleUseDemandedVectorElts helper for SimplifyMultipleUseDemandedBits. NFCI. We have many cases where we call SimplifyMultipleUseDemandedBits and demand specific vector elements, but all the bits from them - this adds a helper wrapper to handle this.	2020-06-22 14:24:39 +01:00
David Green	730ecb63ec	[CGP] Convert phi types If a collection of interconnected phi nodes is only ever loaded, stored or bitcast then we can convert the whole set to the bitcast type, potentially helping to reduce the number of register moves needed as the phi's are passed across basic block boundaries. This has to be done in CodegenPrepare as it naturally straddles basic blocks. The alorithm just looks from phi nodes, looking at uses and operands for a collection of nodes that all together are bitcast between float and integer types. We record visited phi nodes to not have to process them more than once. The whole subgraph is then replaced with a new type. Loads and Stores are bitcast to the correct type, which should then be folded into the load/store, changing it's type. This comes up in the biquad testcase due to the way MVE needs to keep values in integer registers. I have also seen it come up from aarch64 partner example code, where a complicated set of sroa/inlining produced integer phis, where float would have been a better choice. I also added undef and extract element handling which increased the potency in some cases. This adds it with an option that defaults to off, and disabled for 32bit X86 due to potential issues around canonicalizing NaNs. Differential Revision: https://reviews.llvm.org/D81827	2020-06-21 15:54:17 +01:00
Simon Pilgrim	fb9f9dc318	[X86][SSE] Add SimplifyDemandedVectorEltsForTargetShuffle to handle target shuffle variable masks Pulled out from the ongoing work on D66004, currently we don't do a good job of simplifying variable shuffle masks that have already lowered to constant pool entries. This patch adds SimplifyDemandedVectorEltsForTargetShuffle (a custom x86 helper) to first try SimplifyDemandedVectorElts (which we already do) and then constant pool simplification to help mark undefined elements. To prevent lowering/combines infinite loops, we only handle basic constant pool loads instead of creating new BUILD_VECTOR nodes for lowering - e.g. we don't try to convert them to broadcast/vzext_load - there might be some benefit to this but if so I'd rather we come up with some way to reuse existing code than reimplement a lot of BUILD_VECTOR code. Differential Revision: https://reviews.llvm.org/D81791	2020-06-21 11:16:07 +01:00
Simon Pilgrim	89dcbdfcfd	[X86] combineSetCCMOVMSK - consistently use CmpBits variable. NFCI. The comparison value should be the same size - I've added an assert to be absolutely certain.	2020-06-20 12:35:24 +01:00
Simon Pilgrim	56a9332328	[X86][SSE] Fold MOVMSK(PCMPEQ(X,0)) != -1 -> !PTESTZ(X,X) allof patterns	2020-06-20 12:17:32 +01:00
Eric Christopher	cf23852587	[Target] As part of using inclusive language within the llvm project, migrate away from the use of blacklist and whitelist. This change affects an internal llvm command line option.	2020-06-20 00:06:39 -07:00
Simon Pilgrim	c143db3b10	[X86][SSE] combineHorizontalPredicateResult - improve all_of(X == 0) for vXi64 on pre-SSE41 targets Without SSE41 we don't have the PCMPEQQ instruction, making cmp-with-zero reductions more complicated than necessary. We can compare as vXi32 (PCMPEQD) and tweak the MOVMSK comparison to test upper/lower DWORD comparisons. This pre-fixes something that occurs with null tests for vectors of (64-bit) pointers such as in PR35129.	2020-06-19 11:43:25 +01:00
Simon Pilgrim	cad2038700	[X86][SSE] combineSetCCMOVMSK - fold MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) If we're permuting ALL the elements of a single vector, then for allof/anyof MOVMSK tests we can avoid the shuffle entirely.	2020-06-19 10:57:52 +01:00
Simon Pilgrim	fe0a85faf4	[X86][SSE] Fold MOVMSK(PCMPEQ(X,0)) == -1 -> PTESTZ(X,X) Allow combineSetCCMOVMSK to handle 'allof' X == 0 patterns to be replaced with PTESTZ This is a preliminary patch before properly handling PR35129	2020-06-18 15:38:32 +01:00
Simon Pilgrim	9d11822f09	Fix comment typo - Uexpected -> Unexpected. NFC.	2020-06-16 12:14:51 +01:00
Simon Pilgrim	379c5b31f7	[X86][SSE] combineVectorSizedSetCCEquality - remove unused AVX2 MOVMSK path. NFCI. If PTEST is not available, then we're guaranteed to be performing a 128-bit vector comparison using MOVMSK(PCMPEQB(v16i8)).	2020-06-16 10:07:41 +01:00
Simon Pilgrim	057c9c7ee0	[X86][SSE] MatchVectorAllZeroTest - handle OR vector reductions This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero. Fixes PR45378 Differential Revision: https://reviews.llvm.org/D81547	2020-06-16 09:42:34 +01:00
Simon Pilgrim	65c3fa849b	[X86][SSE] combineVectorSizedSetCCEquality - move single Subtarget.hasAVX() use into condition. NFC. We already have Subtarget.hasSSE2() and Subtarget.useAVX512Regs() in the condition - seems to be a legacy from when we had multiple uses.	2020-06-16 09:42:33 +01:00
Craig Topper	255d5dbae1	[X86] Add support for inline assembly 'x' constraint for i128. Limiting to x86-64 since that's when __int128 is legal in clang. Differential Revision: https://reviews.llvm.org/D81817	2020-06-15 19:34:02 -07:00
Simon Pilgrim	cb8a0ba829	[X86][SSE] Add LowerVectorAllZero helper for checking if all bits of a vector are zero. Pull the lowering code out of LowerVectorAllZeroTest (and rename it MatchVectorAllZeroTest). We should be able to reuse this in combineVectorSizedSetCCEquality as well. Another cleanup to simplify D81547.	2020-06-15 15:54:38 +01:00
Simon Pilgrim	ae33cbc494	[X86][SSE] LowerVectorAllZeroTest - add support for >256-bit vectors Reduce by splitting the vector until we reach the target size for PTEST/MOVMSK_PCMPEQ. There might be some cases where AVX512 can perform this with 512-bit vectors but so far I haven't encountered any such pattern that reaches LowerVectorAllZeroTest. Prep work for D81547	2020-06-15 15:30:24 +01:00
Simon Pilgrim	0b806549b5	[X86][SSE] LowerVectorAllZeroTest - remove unnecessary bitcasts matchScalarReduction should return all its source vectors with the same type, so we can safely perform the OR reduction with the original type. So we just need to bitcast for PTEST/PCMPEQB with the final reduced vector.	2020-06-15 15:13:13 +01:00
Nikita Popov	7cac7e0cfc	[IR] Prefer hasFnAttribute() where possible (NFC) When checking for an enum function attribute, use hasFnAttribute() rather than hasAttribute() at FunctionIndex, because it is significantly faster (and more concise to boot).	2020-06-15 09:30:35 +02:00
Simon Pilgrim	3d8149c2a1	[X86][SSE] Fold BITOP(MOVMSK(X),MOVMSK(Y)) -> MOVMSK(BITOP(X,Y)) Reduce XMM->GPR traffic by performing bitops on the vectors, and using a single MOVMSK call. This requires us to use vectors of the same size and element width, but we can mix fp/int type equivalents with suitable bitcasting.	2020-06-14 21:37:58 +01:00
Simon Pilgrim	e0cff30c17	[X86][SSE] LowerVectorAllZeroTest - add support for pre-SSE41 targets Even without PTEST, we can still efficiently perform an OR reduction as PMOVMSKB(PCMPEQB(X,0)) == 0, avoiding xmm->gpr extractions.	2020-06-14 13:41:56 +01:00
Craig Topper	cb5072d187	[X86] Teach combineBitcastvxi1 to prefer movmsk on avx512 in more cases If the input to the bitcast is a sign bit test, it makes sense to directly use vpmovmskb or vmovmskps/pd. This removes the need to copy the sign bits to a k-register and then to a GPR. Fixes PR46200. Differential Revision: https://reviews.llvm.org/D81327	2020-06-13 14:50:13 -07:00
Simon Pilgrim	8d30945ab9	[X86][SSE] combineX86ShuffleChain - combine INSERT_VECTOR_ELT patterns to INSERTPS Noticed while trying to cleanup D66004 - if a shuffle operand came from a scalar, we're better off using INSERTPS vs UNPCKLPS as this is more likely to load fold later on. It also matches our existing BUILD_VECTOR lowering. We can extend this to other PINSRB/D/Q/W cases in the future as the need arises.	2020-06-12 11:59:01 +01:00
Simon Pilgrim	7706c7af74	[X86] Fold vXi1 OR(KSHIFTL(X,NumElts/2),Y) -> KUNPCK Convert shift+or bool vector patterns into CONCAT_VECTORS if we know this will be lowered to KUNPCK (which requires 16+ vector elements). Fixes PR32547	2020-06-11 15:47:20 +01:00
Simon Pilgrim	5cca9828ff	[X86][AVX512] Avoid bitcasts between scalar and vXi1 bool vectors AVX512 mask types are often bitcasted to scalar integers for various ops before being bitcast back to be used as a predicate. In many cases we can avoid these KMASK<->GPR transfers and perform equivalent operations on the mask unit. If the destination mask type is legal, and we can confirm that the scalar op originally came from a mask/vector/float/double type then we should try to avoid the scalar entirely. This avoids some codegen issues noticed while working on PTEST/MOVMSK improvements. Partially fixes PR32547 - we don't create a KUNPCK yet, but OR(X,KSHIFTL(Y)) can be handled in a separate patch. Differential Revision: https://reviews.llvm.org/D81548	2020-06-11 10:22:55 +01:00
Craig Topper	1c3dd7bf37	[X86] Call LowerADDRSPACECAST directly from ReplaceNodeResults to avoid repeating identical code. NFC	2020-06-10 14:39:02 -07:00
serge-sans-paille	60fe25cb0c	Fix dynamic probing scheme If we probe after each static stack allocation, we need to probe before each dynamic stack allocation. Provide a scheme to describe the possible scenario. Thanks a lot to @jonpa for motivating this fix. Differential Revision: https://reviews.llvm.org/D81067	2020-06-10 21:37:09 +02:00
Craig Topper	2328cab16c	[X86] Prevent LowerSELECT from causing suboptimal codegen for __builtin_ffs(X) - 1. LowerSELECT sees the CMP with 0 and wants to use a trick with SUB and SBB. But we can use the flags from the BSF/TZCNT. Fixes PR46203. Differential Revision: https://reviews.llvm.org/D81312	2020-06-08 11:46:56 -07:00
Guillaume Chatelet	1778564f91	[Alignment][NFC] Migrate the rest of backends Summary: This is a followup on D81196 Reviewers: courbet Subscribers: arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81278	2020-06-08 07:17:20 +00:00
Craig Topper	b0eea7213b	[X86] Support load shrinking for strict fp nodes in combineCVTPH2PS	2020-06-07 21:09:55 -07:00
Craig Topper	e3aece06cf	[X86] Improve (vzmovl (insert_subvector)) combine to handle a bitcast between the vzmovl and insert This combine tries shrink a vzmovl if its input is an insert_subvector. This patch improves it to turn (vzmovl (bitcast (insert_subvector))) into (insert_subvector (vzmovl (bitcast))) potentially allowing the bitcast to be folded with a load.	2020-06-07 19:31:06 -07:00
Craig Topper	22987babd5	[X86] Teach combineCVTP2I_CVTTP2I to handle STRICT_CVTTP2SI/STRICT_CVTTP2UI Allows us to shrink 128-bit simple load to enable folding for v2f32->v2i64 vcvttps2qq/vcvttps2uqq.	2020-06-07 19:31:06 -07:00
Craig Topper	a135c4a2cf	[X86] Don't scalarize v2f32->v2i64 strict_fp_to_sint/uint with avx512dq and not avx512vl. We can pad the v2f32 with 0s up to v8f32 and use a v8f32->v8i64 operation. This is what we end up with on non-strict nodes except we don't pad with 0s since we don't care about exceptions.	2020-06-07 14:45:26 -07:00
Simon Pilgrim	ce677ef532	[X86][AVX2] combineSetCCMOVMSK - handle all_of patterns for PMOVMSKB(PACKSSBW(LO(X), HI(X))) In the sign splat case, we can fold PMOVMSKB(PACKSSBW(LO(X), HI(X))) -> PMOVMSKB(BITCAST_v32i8(X)) without introducing a signmask + comparison (which unlike for any_of won't fold into a single TEST).	2020-06-07 21:08:53 +01:00
Simon Pilgrim	3a28ae091b	[X86][SSE] combineSetCCMOVMSK - add initial support for allof patterns. Handle MOVMSK 'allof' comparisons (X86ISD::SUB X, AllBitsMask) as well as 'anyof' patterns. This allows us to handle these patterns in the MOVMSK(BITCAST(X)) pattern to fix PR37087.	2020-06-07 16:10:13 +01:00
Simon Pilgrim	0741b75ad5	[X86][SSE] Attempt to widen MOVMSK vector input if the signbits are splatted. As shown on PR37087, if we have a MOVMSK(BICAST(X)) from a wider vector, then by using MOVMSK from the wider type (32/64-bit elements) we can improve the chances of further combines with SimplifyDemandedBits/Elts and on some targets (skylake) can be more efficient.	2020-06-07 11:44:43 +01:00
Craig Topper	3408dcbdf0	[X86] Fold undef elts to 0 in getTargetVShiftByConstNode. Similar to D81212. Differential Revision: https://reviews.llvm.org/D81292	2020-06-05 13:39:40 -07:00
Craig Topper	7c9a89fed8	[X86] Teach combineVectorShiftImm to constant fold undef elements to 0 not undef. Shifts are supposed to always shift in zeros or sign bits regardless of their inputs. It's possible the input value may have been replaced with undef by SimplifyDemandedBits, but the shift in zeros are still demanded. This issue was reported to me by ispc from 10.0. Unfortunately their failing test does not fail on trunk. Seems to be because the shl is optimized out earlier now and doesn't become VSHLI. ispc bug https://github.com/ispc/ispc/issues/1771 Differential Revision: https://reviews.llvm.org/D81212	2020-06-05 11:29:55 -07:00
Simon Pilgrim	d194ff31cf	[X86][SSE] Simplify MOVMSK patterns based on comparison An initial patch adding combineSetCCMOVMSK to simplify MOVMSK and its vector input based on the comparison of the MOVMSK result. This first stage just adds support for some simple MOVMSK(PACKSSBW()) cases where we remove the PACKSS if we're comparing ne/eq zero (any_of patterns), allowing us to directly compare against the v8i16 source vector(s) bitcasted to v16i8, with suitable masking to take into account of which signbits are valid. Future combines could peek through further PACKSS, target shuffles, handle all_of patterns (ne/eq -1), optimize to a PTEST op, etc. Differential Revision: https://reviews.llvm.org/D81171	2020-06-05 16:53:22 +01:00
Simon Pilgrim	ea80b40669	[DAG] SimplifyDemandedBits - peek through SHL if we only demand sign bits. If we're only demanding the (shifted) sign bits of the shift source value, then we can use the value directly. This handles SimplifyDemandedBits/SimplifyMultipleUseDemandedBits for both ISD::SHL and X86ISD::VSHLI. Differential Revision: https://reviews.llvm.org/D80869	2020-06-03 16:11:54 +01:00
Simon Pilgrim	d9d28b3559	[X86][AVX] getFauxShuffleMask - fix sub vector size check in INSERT_SUBVECTOR(X,SHUFFLE(Y,Z)) We were bailing on subvector shuffle inputs that were smaller than the subvector type instead of larger than it. Fixes PR46178	2020-06-03 15:26:22 +01:00
Craig Topper	e51d5bc7a4	[X86] Fix a few recursivelyDeleteUnusedNodes calls that were trying to delete nodes before their user was really gone. We looked through a truncate to get to the load. So we should be deleting the truncate first. There is a check that the node is really unused before deleting so this didn't cause a functional issue.	2020-06-01 21:55:13 -07:00
Simon Pilgrim	22e50833e9	[X86][AVX] Reduce unary target shuffles width if the upper elements aren't demanded.	2020-05-31 20:19:24 +01:00
Simon Pilgrim	8f2f613a6e	[X86][AVX] combineX86ShufflesRecursively - peekThroughOneUseBitcasts subvector before widening. This matches what we do for the full sized vector ops at the start of combineX86ShufflesRecursively, and helps getFauxShuffleMask extract more INSERT_SUBVECTOR patterns.	2020-05-31 19:58:33 +01:00
Simon Pilgrim	4a2673d79f	[X86][AVX] Add SimplifyMultipleUseDemandedBits VBROADCAST handling to SimplifyDemandedVectorElts. As suggested on D79987.	2020-05-31 14:20:15 +01:00
Simon Pilgrim	f046326847	[X86] getFauxShuffleMask/getTargetShuffleInputs - make SelectionDAG const (PR45974). Try to prevent future node creation issues (as detailed in PR45974) by making the SelectionDAG reference const, so it can still be used for analysis, but not node creation.	2020-05-31 13:51:01 +01:00
Simon Pilgrim	d33ba1aa0b	[X86][AVX] getFauxShuffleMask - don't widen shuffle inputs from INSERT_SUBVECTOR(X,SHUFFLE(Y,Z)) Don't create nodes on the fly when decoding INSERT_SUBVECTOR as faux shuffles.	2020-05-31 13:19:18 +01:00
Simon Pilgrim	45ebe38ffc	[X86][AVX] Pad small shuffle inputs in combineX86ShufflesRecursively As detailed on PR45974 and D79987, getFauxShuffleMask is creating nodes on the fly to create shuffles with inputs the same size as the result, causing problems for hasOneUse() checks in later simplification stages. Currently only combineX86ShufflesRecursively benefits from these widened inputs so I've begun moving the functionality there, and out of getFauxShuffleMask. This allows us to remove the widening from VBROADCAST and EXTEND faux shuffle cases. This just leaves the INSERT_SUBVECTOR case in getFauxShuffleMask still creating nodes, which will require more extensive refactoring.	2020-05-31 11:43:47 +01:00
Craig Topper	7c3b8077cc	[X86] Add DAG combine to turn (v2i64 (scalar_to_vector (i64 (bitconvert (mmx))))) to MOVQ2DQ. Remove unneeded isel patterns. We already had a DAG combine for (mmx (bitconvert (i64 (extractelement v2i64)))) to MOVDQ2Q. Remove patterns for MMX_MOVQ2DQrr/MMX_MOVDQ2Qrr that use scalar_to_vector/extractelement involving i64 scalar type with v2i64 and x86mmx.	2020-05-30 19:47:08 -07:00
Craig Topper	af1accdd86	[X86] Teach computeKnownBitsForTargetNode that the upper half of X86ISD::MOVQ2DQ is all zero.	2020-05-30 19:47:07 -07:00
Craig Topper	1ecf39d607	[X86] Fix a place where we created MOVQ2DQ with a DstVT other than v2i64. The type profile and isel pattern have this type declared as being MVT::v2i64. But isel skips the explicit type check due to the type profile.	2020-05-30 19:47:07 -07:00
Christopher Tetreault	5a99ec10f5	[SVE] Eliminate calls to default-false VectorType::get() from X86 Reviewers: efriedma, sdesmalen, c-rhodes, craig.topper Reviewed By: craig.topper Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D80331	2020-05-29 16:16:07 -07:00
Zequan Wu	80e107ccd0	Add NoMerge MIFlag to avoid MIR branch folding Let the codegen recognized the nomerge attribute and disable branch folding when the attribute is given Differential Revision: https://reviews.llvm.org/D79537	2020-05-29 12:31:06 -07:00
Craig Topper	87e4ad4d5c	[X86] Remove isel pattern for MMX_X86movdq2q+simple_load. Replace with DAG combine to to loadmmx. Only 64-bit bits will be loaded, not the whole 128 bits. We can just combine it to plain mmx load. This has the side effect of enabling isel load folding for it. This part of my desire to get rid of isel patterns that shrink loads.	2020-05-29 10:20:03 -07:00
Simon Pilgrim	1ddac9563d	[X86][SSE] Peek though MOVMSK source sign bits using SimplifyMultipleUseDemandedBits Allows SimplifyDemandedBitsForTargetNode to peek through multi-use ops where MOVMSK only demands the signbit of each vector element.	2020-05-28 13:42:24 +01:00
Simon Pilgrim	410667f1b7	[X86][SSE] Convert PTEST to MOVMSK for allsign bits vector results If we are using PTEST to check 'allsign bits' vector elements we can use MOVMSK to extract the signbits directly and perform the comparison on the scalar value. For vXi16 cases, as we don't have a MOVMSK for this type, we must mask each signbit out of a PMOVMSKB v2Xi8 result, which folds into the TEST comparison. If this allows us to remove a vector op (via the SimplifyMultipleUseDemandedBits call) this is consistently faster than a PTEST (https://godbolt.org/z/ziJUst). I'm investigating whether we ever get regressions without the SimplifyMultipleUseDemandedBits call, even if this means we don't remove a vector op, but that has exposed some other poor codegen issues that I'm still investigating and would have to wait for a later patch. Suggested on PR42035 to avoid unnecessary ashr(x,bw-1)/pcmpgt(0,x) sign splat patterns feeding into ptest. Differential Revision: https://reviews.llvm.org/D80563	2020-05-27 11:06:16 +01:00
Craig Topper	a1dfd6d828	[X86] Add helper function to reduce some code duplication when shrinking a vector load to a vzext_load. There's more code for calling CombineTo and replacing the nodes that I'd like to share, but its complicated by the getNode call in the middle that needs to be specific to each opcode. While there are also make sure we recursively delete the load we're replacing. It eventually gets removed by a RemoveDeadNodes call at the end of DAG combine, but we should be more eager about it. We were inconsistently doing this in some places but not all.	2020-05-27 01:32:13 -07:00
Simon Pilgrim	6f802ec433	[X86] Fix fshr comment copy+paste typo. NFC. Noticed by @foad on D80466.	2020-05-26 10:55:57 +01:00
Craig Topper	51a276c759	[X86] Teach combineTruncatedArithmetic to push truncate through subtracts where only one of the inputs is free to truncate. Fix combineSubToSubus to handle the new DAG to avoid a regression. There are still regressions in test14/test15/test16. Where it looks like were trying to set up cases we could match to umin+trunc+subus but the handling was never finished. The regression here isn't unique to sub. Its a lost opportunity for taking an AND with two truncated inputs and producing a larger AND with a single truncate. The same thing could happen with any other node we handle in combineTruncatedArithmetic since we are moving the truncate up the DAG. Differential Revision: https://reviews.llvm.org/D80483	2020-05-25 11:42:42 -07:00
serge-sans-paille	8eae32188b	Improve stack-clash implementation on x86 - test both 32 and 64 bit version - probe the tail in dynamic-alloca - generate more concise code Differential Revision: https://reviews.llvm.org/D79482	2020-05-25 14:48:14 +02:00
Sanjay Patel	fa038e0350	[x86] favor vector constant load to avoid GPR to XMM transfer, part 2 This replaces the build_vector lowering code that was just added in D80013 and matches the pattern later from the x86-specific "vzext_movl". That seems to result in the same or better improvements and gets rid of the 'TODO' items from that patch. AFAICT, we always shrink wider constant vectors to 128-bit on these patterns, so we still get the implicit zero-extension to ymm/zmm without wasting space on larger vector constants. There's a trade-off there because that means we miss potential load-folding. Similarly, we could load scalar constants here with implicit zero-extension even to 128-bit. That saves constant space, but it means we forego load-folding, and so it increases register pressure. This seems like a good middle-ground between those 2 options. Differential Revision: https://reviews.llvm.org/D80131	2020-05-25 08:01:48 -04:00
Simon Pilgrim	8a5aea7b50	[X86][AVX] Fold extract_subvector(subv_broadcast(x),c) -> (x) If we're extracting an subvector from a broadcasted subvector of the same type then we can use the source vector directly.	2020-05-24 18:49:39 +01:00
Simon Pilgrim	e508d643cf	[X86][AVX] Fold extract_subvector(broadcast(x),c) -> extract_subvector(broadcast(x),0) iff c != 0 If we're extracting an upper subvector from a broadcast we're better off extracting the lowest subvector instead as it avoids an actual extract instruction and might help SimplifyDemandedVectorElts further simplify the code.	2020-05-24 18:05:54 +01:00
Simon Pilgrim	1e7865d946	[X86] SimplifyMultipleUseDemandedBitsForTargetNode - add initial X86ISD::VSRAI handling. This initial version only peeks through cases where we just demand the sign bit of an ashr shift, but we could generalize this further depending on how many sign bits we already have. The pr18014.ll case is a minor annoyance - we've failed to to move the psrad/paddd after the blendvps which would have avoided the extra move, but we have still increased the ILP.	2020-05-24 16:07:46 +01:00
Simon Pilgrim	478f2ce5d3	[X86] Pull out repeated DemandedBits signmask variable. NFC. Both paths always create the same DemandedBits mask.	2020-05-24 12:01:58 +01:00
Simon Pilgrim	ffb367217d	[X86] Move CONCAT_VECTORS/INSERT_SUBVECTOR actions inside loop. NFC. CONCAT_VECTORS/INSERT_SUBVECTOR both are custom on v32i1/v64i1 like the other ops in the loop.	2020-05-24 10:59:33 +01:00
Simon Pilgrim	8310c9b741	[X86][AVX] Call SimplifyDemandedBits on MaskedLoadSDNode with non-boolean masks On X86 (AVX1/AVX2), non-boolean masked loads only demand the sign bit of the mask, we already do the equivalent for masked stores. Annoyingly I can't easily handle this inside TargetLowering::SimplifyDemandedBits as this is an x86 specific case for a generic node. Differential Revision: https://reviews.llvm.org/D80478	2020-05-24 09:51:21 +01:00
Simon Pilgrim	cc65a7a5ea	[X86] Improve i8 + 'slow' i16 funnel shift codegen This is a preliminary patch before I deal with the xor+and issue raised in D77301. We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers. fshl(x,y,z) -> (((zext(x) << bw) \| zext(y)) << (z & (bw-1))) >> bw. fshr(x,y,z) -> (((zext(x) << bw) \| zext(y)) >> (z & (bw-1))) >> bw. Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that. Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift. Differential Revision: https://reviews.llvm.org/D80466	2020-05-24 08:08:53 +01:00
Craig Topper	7392820f98	[Align] Remove operations on MaybeAlign that asserted that it had a defined value. If the caller needs to reponsible for making sure the MaybeAlign has a value, then we should just make the caller convert it to an Align with operator*. I explicitly deleted the relational comparison operators that were being inherited from Optional. It's unclear what the meaning of two MaybeAligns were one is defined and the other isn't should be. So make the caller reponsible for defining the behavior. I left the ==/!= operators from Optional. But now that exposed a weird quirk that ==/!= between Align and MaybeAlign required the MaybeAlign to be defined. But now we use the operator== from Optional that takes an Optional and the Value. Differential Revision: https://reviews.llvm.org/D80455	2020-05-22 21:54:28 -07:00
aartbik	645bba8d3d	[llvm] [CodeGen] [X86] Fix issues with v4i1 instruction selection Summary: Fixes issue https://bugs.llvm.org/show_bug.cgi?id=45995 Reviewers: mehdi_amini, nicolasvasilache, reidtatge, craig.topper, ftynse, bkramer Reviewed By: craig.topper Subscribers: RKSimon, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D80231	2020-05-20 11:34:56 -07:00
Arthur Eubanks	8a88755610	Reland [X86] Codegen for preallocated See https://reviews.llvm.org/D74651 for the preallocated IR constructs and LangRef changes. In X86TargetLowering::LowerCall(), if a call is preallocated, record each argument's offset from the stack pointer and the total stack adjustment. Associate the call Value with an integer index. Store the info in X86MachineFunctionInfo with the integer index as the key. This adds two new target independent ISDOpcodes and two new target dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}. The setup ISelDAG node takes in a chain and outputs a chain and a SrcValue of the preallocated call Value. It is lowered to a target dependent node with the SrcValue replaced with the integer index key by looking in X86MachineFunctionInfo. In X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an %esp adjustment, the exact amount determined by looking in X86MachineFunctionInfo with the integer index key. The arg ISelDAG node takes in a chain, a SrcValue of the preallocated call Value, and the arg index int constant. It produces a chain and the pointer fo the arg. It is lowered to a target dependent node with the SrcValue replaced with the integer index key by looking in X86MachineFunctionInfo. In X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a lea of the stack pointer plus an offset determined by looking in X86MachineFunctionInfo with the integer index key. Force any function containing a preallocated call to use the frame pointer. Does not yet handle a setup without a call, or a conditional call. Does not yet handle musttail. That requires a LangRef change first. Tried to look at all references to inalloca and see if they apply to preallocated. I've made preallocated versions of tests testing inalloca whenever possible and when they make sense (e.g. not alloca related, inalloca edge cases). Aside from the tests added here, I checked that this codegen produces correct code for something like ``` struct A { A(); A(A&&); ~A(); }; void bar() { foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8); } ``` by replacing the inalloca version of the .ll file with the appropriate preallocated code. Running the executable produces the same results as using the current inalloca implementation. Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland. Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D77689	2020-05-20 11:25:44 -07:00
Arthur Eubanks	b8cbff51d3	Revert "[X86] Codegen for preallocated" This reverts commit `810567dc69`. Some tests are unexpectedly passing	2020-05-20 10:04:55 -07:00
Arthur Eubanks	810567dc69	[X86] Codegen for preallocated See https://reviews.llvm.org/D74651 for the preallocated IR constructs and LangRef changes. In X86TargetLowering::LowerCall(), if a call is preallocated, record each argument's offset from the stack pointer and the total stack adjustment. Associate the call Value with an integer index. Store the info in X86MachineFunctionInfo with the integer index as the key. This adds two new target independent ISDOpcodes and two new target dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}. The setup ISelDAG node takes in a chain and outputs a chain and a SrcValue of the preallocated call Value. It is lowered to a target dependent node with the SrcValue replaced with the integer index key by looking in X86MachineFunctionInfo. In X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an %esp adjustment, the exact amount determined by looking in X86MachineFunctionInfo with the integer index key. The arg ISelDAG node takes in a chain, a SrcValue of the preallocated call Value, and the arg index int constant. It produces a chain and the pointer fo the arg. It is lowered to a target dependent node with the SrcValue replaced with the integer index key by looking in X86MachineFunctionInfo. In X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a lea of the stack pointer plus an offset determined by looking in X86MachineFunctionInfo with the integer index key. Force any function containing a preallocated call to use the frame pointer. Does not yet handle a setup without a call, or a conditional call. Does not yet handle musttail. That requires a LangRef change first. Tried to look at all references to inalloca and see if they apply to preallocated. I've made preallocated versions of tests testing inalloca whenever possible and when they make sense (e.g. not alloca related, inalloca edge cases). Aside from the tests added here, I checked that this codegen produces correct code for something like ``` struct A { A(); A(A&&); ~A(); }; void bar() { foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8); } ``` by replacing the inalloca version of the .ll file with the appropriate preallocated code. Running the executable produces the same results as using the current inalloca implementation. Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D77689	2020-05-20 09:20:38 -07:00
QingShan Zhang	2b59e9f1bd	[DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression. However, during negating the expression, the cost might change as we are changing the DAG, and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore. This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression, and check the cost during negating the expression. It also reduce the duplicated code between getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638 Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D77319	2020-05-20 02:12:16 +00:00
Benjamin Kramer	350dadaa8a	Give helpers internal linkage. NFC.	2020-05-19 22:16:37 +02:00
Sanjay Patel	57c3fe76a3	[x86] favor vector constant load to avoid GPR to XMM transfer This build vector lowering pattern came up in D79886. I've tried to limit the improvement to cases where it looks clearly better to load, but we could remove the 'TODO' predicates already if we are willing to overlook some corner cases. Differential Revision: https://reviews.llvm.org/D80013	2020-05-17 11:56:26 -04:00
Simon Pilgrim	6f02633a4f	[X86] Add getTargetConstantFromBasePtr helper. NFC. Allows us to share code from LoadSDNode and MemIntrinsicSDNode constant pool loads.	2020-05-17 14:58:31 +01:00
Simon Pilgrim	9aca5b68ee	[X86] getTargetConstantBitsFromNode - remove unnecessary X86ISD::VBROADCAST handling. We create X86ISD::VBROADCAST_LOAD for constant pool folds now.	2020-05-17 14:58:30 +01:00
Simon Pilgrim	330b7491d5	[X86] Remove some duplicate ConstantSDNode casts. NFC. Avoid repeated isa<> and cast<> by just performing a dyn_cast<ConstantSDNode>	2020-05-15 18:23:35 +01:00
Simon Pilgrim	9825d3daa8	[X86] Use getConstantOperandVal helper in a few places. NFC. Avoid raw cast<ConstantSDNode> calls.	2020-05-15 17:31:27 +01:00
Simon Pilgrim	4580b0f5b6	[X86] getFauxShuffle - remove (unused) ISD::TRUNCATE shuffle decoding.	2020-05-15 17:31:26 +01:00
Ties Stuij	8c24f33158	[IR][BFloat] Add BFloat IR type Summary: The BFloat IR type is introduced to provide support for, initially, the BFloat16 datatype introduced with the Armv8.6 architecture (optional from Armv8.2 onwards). It has an 8-bit exponent and a 7-bit mantissa and behaves like an IEEE 754 floating point IR type. This is part of a patch series upstreaming Armv8.6 features. Subsequent patches will upstream intrinsics support and C-lang support for BFloat. Reviewers: SjoerdMeijer, rjmccall, rsmith, liutianle, RKSimon, craig.topper, jfb, LukeGeeson, sdesmalen, deadalnix, ctetreau Subscribers: hiraditya, llvm-commits, danielkiss, arphaman, kristof.beyls, dexonsmith Tags: #llvm Differential Revision: https://reviews.llvm.org/D78190	2020-05-15 14:43:43 +01:00
Simon Pilgrim	1024e82469	X86ISelLowering.cpp - remove non-constant EXTRACT_SUBVECTOR/INSERT_SUBVECTOR handling. NFC. Now that D79814 has landed, we can assume that subvector ops use constant, in-range indices.	2020-05-15 11:50:00 +01:00
Craig Topper	2fdeee9c82	[X86] Add support for forming vXi16 PMULH instructions from shifts. We already form PMULH when the shift is truncated. But we can also do it from just a shift by extending the result. Unfortunately, I get regressions if I try to replace the truncate combine with this as we turn the truncate into a more complicated sequence first. Then we are unable to combine that sequence with the extend produced at the end of this combine. Differential Revision: https://reviews.llvm.org/D79682	2020-05-14 10:58:00 -07:00
Sanjay Patel	26e742fd84	[x86][CGP] improve sinking of splatted vector shift amount operand Expands on the enablement of the shouldSinkOperands() TLI hook in: D79718 The last codegen/IR test diff shows what I suspected could happen - we were sinking all splat shift operands into a loop. But that's not what we want in general; we only want to sink the shift amount operand if it is a splat. Differential Revision: https://reviews.llvm.org/D79827	2020-05-14 08:36:03 -04:00
Craig Topper	028bfdd891	[X86] Only allow f32, f64, or f80 to be used with 'f' inline assembly constraint. Avoids crash when using i128. Gives better error than 'scalar-to-vector conversion failed' for other types.	2020-05-13 13:27:13 -07:00
Craig Topper	38e0ab2f3a	[X86] Don't allow f80 to be used with the 'q', 'r', 'l', 'Q' or 'q' inline assembly constraints. It was previously trying to use the 64-bit class, but 80 isn't evenly divisible by 64 so it will trigger a crash.	2020-05-13 12:19:57 -07:00
Craig Topper	47985451ed	[X86] Make the if statement structure for inline assembly constraints 'l', 'r', 'q', 'Q', and 'R' the same. These did similar things but had slight differences. For example 'Q' didn't allow f64, but the others did.	2020-05-13 12:19:57 -07:00
Sanjay Patel	f490ca76b0	[x86][CGP] enable target hook to sink funnel shift intrinsic's splatted shift amount SDAG suffers when it can't see that a funnel operand is a splat value (due to single-basic-block visibility), so invert the normal loop hoisting rules to move a splat op closer to its use. This would be part 1 of an enhancement similar to D63233. This is needed to re-fix PR37426: https://bugs.llvm.org/show_bug.cgi?id=37426 ...because we got better at canonicalizing IR to funnel shift intrinsics. The existing CGP code for shift opcodes is likely overstepping what it was intended to do, so that will be fixed in a follow-up. Differential Revision: https://reviews.llvm.org/D79718	2020-05-12 18:40:40 -04:00
Alexey Lapshin	1c44430e73	Fix buildbots #2 after `aa1eb5152d`.	2020-05-13 01:23:39 +03:00
Alexey Lapshin	293c6d3821	Fix buildbots after `aa1eb5152d`.	2020-05-13 01:11:01 +03:00
Alexey Lapshin	aa1eb5152d	[X86][ISelLowering] refactor Varargs handling in X86ISelLowering.cpp Summary: This patch refactors handling of VarArgs in X86TargetLowering::LowerFormalArguments. That refactoring was requested while reviewing D69372. Code related to varargs handling is removed from X86TargetLowering::LowerFormalArguments and is divided into smaller routines. Reviewed By: aeubanks Differential Revision: https://reviews.llvm.org/D74794	2020-05-13 00:32:00 +03:00
Craig Topper	01636c1eea	[X86] Remove the v16i8->v16i16 path for MULHS with AVX2. We have a couple main strategies for legalizing MULH. -If the vXi16 type is legal, extend to do the full i16 multiply and then shift and truncate the results. -Use unpcks to split each 128 bit lane into high and low halves.a For signed we have an extra case to split a v32i8 to v16i8 and then use the extending to v16i16 strategy. This patch proposes to use the unpck strategy instead. Which is what we already do for unsigned. This seems to be 1 instruction shorter when the RHS is constant like the idiv case. It's 1 instruction longer for the smulo case. But we're trading cross lane shuffles for inlane shuffles and a shift. Differential Revision: https://reviews.llvm.org/D79652	2020-05-12 10:32:01 -07:00
Simon Pilgrim	0387df7f02	[X86] combineX86ShuffleChain - use narrowShuffleMaskElts scale == 1 builtin handling. NFC. narrowShuffleMaskElts already has the fast-path for scale == 1, no need to reimplement it here.	2020-05-12 13:45:40 +01:00
Simon Pilgrim	45aa1b8853	[X86][AVX] Use X86ISD::VPERM2X128 for blend-with-zero if optimizing for size Last part of PR22984 - avoid the zero-register dependency if optimizing for size	2020-05-12 13:03:50 +01:00
Sam McCall	728cf6d86b	Revert "[DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression" This reverts commit `3c44c441db`. Causes infloops on some inputs, see https://reviews.llvm.org/D77319 for repro	2020-05-11 16:44:01 +02:00
QingShan Zhang	3c44c441db	[DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression. However, during negating the expression, the cost might change as we are changing the DAG, and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore. This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression, and check the cost during negating the expression. It also reduce the duplicated code between getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638 Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D77319	2020-05-11 02:41:10 +00:00
Fangrui Song	f40fc7b8d6	[X86] Fix combineVectorCompareAndMaskUnaryOp regression after `0e8e731449`	2020-05-10 19:03:38 -07:00
Simon Pilgrim	9237d88001	[X86] isVectorShiftByScalarCheap - don't limit fast XOP vector shifts to 128-bit vectors XOP targets have fast per-element vector shifts and we're better off splitting to 128-bit shifts where necessary (which is what we already do in LowerShift).	2020-05-09 22:24:08 +01:00
Craig Topper	56bf0b58c2	[X86] Add an assert that v32i16/v64i8 splitting in LowerVSETCC should only occur when AVX512BW is disabled. NFC With BWI we should only get a v32i1/v64i1 result type.	2020-05-09 11:24:16 -07:00
Simon Pilgrim	0e8e731449	[X86] Allow combineVectorCompareAndMaskUnaryOp to handle 'all-bits' general case For the sint_to_fp(and(X,C)) -> and(X,sint_to_fp(C)) fold, allow combineVectorCompareAndMaskUnaryOp to match any X that ComputeNumSignBits says is all-bits, not just SETCC. Noticed while investigating mask promotion issues in PR45808	2020-05-09 14:53:25 +01:00
Craig Topper	bebdc62c3f	[SelectionDAG] Remove ConstantPoolSDNode::getAlignment. Use getAlign instead. Differential Revision: https://reviews.llvm.org/D79459	2020-05-08 16:04:11 -07:00
Craig Topper	d1119980e5	[SelectionDAG] Use Align/MaybeAlign for ConstantPoolSDNode. This patch stores the alignment for ConstantPoolSDNode as an Align and updates the getConstantPool interface to take a MaybeAlign. Removing getAlignment() will be done as a follow up. Differential Revision: https://reviews.llvm.org/D79436	2020-05-08 16:04:11 -07:00
Simon Pilgrim	5f9f37c42a	[X86][AVX] Don't let X86ISD::BROADCAST peek through bitcasts to illegal types. This was an existing bug exposed by the more aggressive X86ISD::BROADCAST generation by rG8817334ce3c7 Original test case thanks to @mstorsjo	2020-05-08 12:30:50 +01:00
Simon Pilgrim	b8a725274c	[X86][AVX] combineSignExtendInReg - promote mask arithmetic before v4i64 canonicalization We rely on the combine (sext_in_reg (v4i64 a/sext (v4i32 x)), v4i1) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x, ExtraVT))) to avoid complex v4i64 ashr codegen, but doing so prevents v4i64 comparison mask promotion, so ensure we attempt to promote before canonicalizing the (hopefully now redundant sext_in_reg). Helps with the poor codegen in PR45808.	2020-05-07 13:16:36 +01:00
Craig Topper	350645594e	[X86] Enable combinePMULH to match multiplies with elements larger than i32. We're truncating so the extra bits will be discarded.	2020-05-06 23:13:59 -07:00
Craig Topper	16c800b8b7	[X86] Remove support for Y0 constraint as an alias for Yz in inline assembly. Neither gcc or icc support this. Split out from D79472. I want to remove more, but it looks like icc does support some things gcc doesn't and I need to double check our internal test suites.	2020-05-06 14:58:53 -07:00
Craig Topper	9bb9ff0957	[X86] Remove incomplete support for 'Y' has an inline assembly constraint by itself. Y is the start of several 2 letter constraints, but we also had partial support to recognize it by itself. But it doesn't look like it can get through clang as a single letter so the backend support for this was effectively dead.	2020-05-06 14:23:04 -07:00
Simon Pilgrim	8817334ce3	[X86] getShuffleScalarElt - add CONCAT_VECTORS/INSERT_VECTOR_ELT support. This helped fix some i686 vXi64 broadcast folds that were becoming v2Xi32 broadcasts because we didn't match the broadcast until after SimplifyDemandedBits worked out we only used the bottom 32-bits in PMUL(U)DQ and type legalization had split the original i64 load. A couple of regressions occurred which required some fixups - adding concat_vectors(broadcast_load,broadcast_load) splat support and recognising (unnecessary) unary shuffles of already broadcasted vectors. This came about as part of the work investigating vector load combining from shuffles for PR42550.	2020-05-06 18:13:33 +01:00
Simon Pilgrim	8c71c2291e	[X86] getShuffleScalarElt - consistently use SDValue. NFC. We never need to call this from anything but ISD::SHUFFLE_VECTOR or target shuffles so shouldn't need to address SDNode directly.	2020-05-06 18:13:33 +01:00
Simon Pilgrim	f5f7fd990e	[X86][SSE] combineX86ShuffleChain - remove unused shuffle(vzext_load(),undef) combine. This should always be caught by the various VZEXT_MOVL handling in combineTargetShuffle and SimplifyDemandedVectorEltsForTargetNode.	2020-05-06 15:20:29 +01:00
Simon Pilgrim	8650b36935	[X86][SSE] Move VZEXT_MOVL removal into SimplifyDemandedVectorEltsForTargetNode This patch replaces the VZEXT_MOVL removal from combineShuffle with a more general version based in SimplifyDemandedVectorEltsForTargetNode. By using computeKnownBits we can always remove the VZEXT_MOVL if the upper elements of the source operand are known to be zero. This requires us to add the conversion ops to computeKnownBitsForTargetNode as well. Reviewed By: @craig.topper Differential Revision: https://reviews.llvm.org/D79335	2020-05-06 14:05:07 +01:00
Simon Pilgrim	1c4f118d89	[X86][SSE] getShuffleScalarElt - minor NFC cleanup. Use SelectionDAG::MaxRecursionDepth instead of (equal) hard coded constant. clang-format	2020-05-06 14:05:07 +01:00
Craig Topper	b55009df66	[X86] Add v32i16/v64i8 into the handling for 512-bit inline assembly constraints.	2020-05-05 21:41:31 -07:00
Craig Topper	0fac1c1912	[X86] Allow Yz inline assembly constraint to choose ymm0 or zmm0 when avx/avx512 are enabled and type is 256 or 512 bits gcc supports selecting ymm0/zmm0 for the Yz constraint when used with 256 or 512 bit vector types. Fixes PR45806 Differential Revision: https://reviews.llvm.org/D79448	2020-05-05 21:12:30 -07:00
Craig Topper	a4286fc952	[X86] Fix usage of Align constructing MachineMemOperands. Similar to D77687, but for the X86 specific code. Differential Revision: https://reviews.llvm.org/D79381	2020-05-05 15:25:02 -07:00
Simon Pilgrim	e53d4869a0	[X86][AVX] combineVectorSignBitsTruncation - avoid complex vXi64->vXi32 PACKSS truncations (PR45794) Unless we're truncating an 'all-bits' result, using PACKSS for vXi64->vXi32 truncation causes problems with later combines as ComputeNumSignBits struggles to see through BITCASTs to smaller types. If we don't use PACKSS in these cases then we fallback to shuffles which are usually just as good.	2020-05-05 11:57:25 +01:00

... 2 3 4 5 6 ...

7442 Commits