llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	c693a23025	[X86] Simplify the end of custom type legalization for (v2i32/v4i16/v8i8 (bitcast (f64))) by just emitting an EXTRACT_SUBVECTOR instead of a BUILD_VECTOR. Generic legalization should be able to finish legalizing the EXTRACT_SUBVECTOR probably by turning it into a BUILD_VECTOR. But we should emit the simplest sequence. llvm-svn: 344424	2018-10-12 22:00:00 +00:00
Craig Topper	a8a44f1bec	[X86] Skip (v2i32/v4i16/v8i8 (bitcast (f64))) handling in ReplaceNodeResults if the dest type can be widened by generic legalization. NFCI The algorithm we would do previously was identical to generic legalization. If we ever switch to legalizing integer vectors via widening we'll be able to kill off the code since it now only runs for promotion. llvm-svn: 344423	2018-10-12 21:59:58 +00:00
Sanjay Patel	e28c8ecd72	[x86] add and use fast horizontal vector math subtarget feature This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen by default. AMD Jaguar (btver2) should have no difference with this patch because it has fast-hops. (If we want to set that bit for other CPUs, let me know.) The code changes are small, but there are many test diffs. For files that are specifically testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences side-by-side. For files that are primarily concerned with codegen other than hops, I just updated the CHECK lines to reflect the new default codegen. To recap the recent horizontal op story: 1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. Hops were likely not optimal for all targets though. 2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195). 3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but probably bad for other CPUs. 4. This patch allows us to distinguish when we want to produce hops, so everyone can be happy. I'm not sure if we have the best predicate here, but the intent is to undo the extra hop-iness that was enabled by r344141. Differential Revision: https://reviews.llvm.org/D53095 llvm-svn: 344361	2018-10-12 16:41:02 +00:00
Eric Liu	55ab86b72b	Fix unused variable warning after r344348 llvm-svn: 344350	2018-10-12 15:01:11 +00:00
Simon Pilgrim	78b5a3c3ef	[X86][SSE] LowerVectorCTPOP - pull out repeated byte sum stage. Pull out repeated byte sum stage for popcount of vector elements > 8bits. This allows us to simplify the LUT/BITMATH popcnt code to always assume vXi8 vectors, and also improves avx512bitalg codegen which only has access to vpopcntb/vpopcntw. llvm-svn: 344348	2018-10-12 14:18:47 +00:00
Simon Pilgrim	29279f29c8	[X86][SSE] Add extract_subvector(PSHUFB) -> PSHUFB(extract_subvector()) combine Fixes PR32160 by reducing the size of PSHUFB if we only use one of the lanes. This approach can probably be generalized to handle any target shuffle (and any subvector index) but we have no test coverage at the moment. llvm-svn: 344336	2018-10-12 12:10:34 +00:00
Richard Trieu	dfd1760b5f	Inline variable into assert to avoid unused variable warning. llvm-svn: 344308	2018-10-11 22:42:41 +00:00
Craig Topper	35d513c7e4	[X86] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector. On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success. This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency and to keep the load in the fp domain. There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately. Differential Revision: https://reviews.llvm.org/D52528 llvm-svn: 344291	2018-10-11 20:36:06 +00:00
Craig Topper	fb2ac8969e	[X86] Restore X86ISelDAGToDAG::matchBEXTRFromAnd. Teach address matching to create a BEXTR pattern from a (shl (and X, mask >> C1) if C1 can be folded into addressing mode. This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about. We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR. I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that. I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet. Differential Revision: https://reviews.llvm.org/D53126 llvm-svn: 344270	2018-10-11 18:06:07 +00:00
Roman Lebedev	33d84c6dac	[X86] Move X86DAGToDAGISel::matchBEXTRFromAnd() into X86ISelLowering Summary: As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 \| PR38938 ]], we fail to emit `BEXTR` if the mask is shifted. We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`, and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node, and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back. So it would seem X86ISelLowering is the place to handle this. This patch only moves the matchBEXTRFromAnd() from X86DAGToDAGISel to X86ISelLowering. It does not add support for the 'shifted mask' pattern. Reviewers: RKSimon, craig.topper, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D52426 llvm-svn: 344179	2018-10-10 20:40:12 +00:00
Sanjay Patel	6cca8af227	[x86] allow single source horizontal op matching (PR39195) This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in: rL343727 As noted in PR39195: https://bugs.llvm.org/show_bug.cgi?id=39195 ...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature bit to deal with it here in the DAG. Differential Revision: https://reviews.llvm.org/D52997 llvm-svn: 344141	2018-10-10 13:39:59 +00:00
Simon Pilgrim	5cb3a82892	[TargetLowering] Add root node back to work list after successful SimplifyDemandedBits/SimplifyDemandedVectorElts Similar to what already happens in the DAGCombiner wrappers, this patch adds the root nodes back onto the worklist if the DCI wrappers' SimplifyDemandedBits/SimplifyDemandedVectorElts were successful. Differential Revision: https://reviews.llvm.org/D53026 llvm-svn: 344132	2018-10-10 10:44:15 +00:00
Craig Topper	f6d8400869	[X86] When lowering unsigned v2i64 setcc without SSE42, flip the sign bits in the v2i64 type then bitcast to v4i32. This may give slightly better opportunities for DAG combine to simplify with the operations before the setcc. It also matches the type the xors will eventually be promoted to anyway so it saves a legalization step. Almost all of the test changes are because our constant pool entry is now v2i64 instead of v4i32 on 64-bit targets. On 32-bit targets getConstant should be emitting a v4i32 build_vector and a v4i32->v2i64 bitcast. There are a couple test cases where it appears we now combine a bitwise not with one of these xors which caused a new constant vector to be generated. This prevented a constant pool entry from being shared. But if that's an issue we're concerned about, it seems we need to address it another way that just relying a bitcast to hide it. This came about from experiments I've been trying with pushing the promotion of and/or/xor to vXi64 later than LegalizeVectorOps where it is today. We run LegalizeVectorOps in a bottom up order. So the and/or/xor are promoted before their users are legalized. The bitcasts added for the promotion act as a barrier to computeKnownBits if we try to use it during vector legalization of a later operation. So by moving the promotion out we can hopefully get better results from computeKnownBits/computeNumSignBits like in LowerTruncate on AVX512. I've also looked at running LegalizeVectorOps in a top down order like LegalizeDAG, but thats showing some other issues. llvm-svn: 344071	2018-10-09 19:05:50 +00:00
Sanjay Patel	f5fac1826a	[x86] use demanded bits to simplify masked store codegen As noted in D52747, if we prefer IR to use trunc for bool vectors rather than and+icmp, we can expose codegen shortcomings as seen here with masked store. Replace a hard-coded PCMPGT simplification with the more general demanded bits call to improve things. Differential Revision: https://reviews.llvm.org/D52964 llvm-svn: 344048	2018-10-09 14:04:14 +00:00
Simon Pilgrim	720db8ed7b	[X86][AVX1] Enable _EXTEND_VECTOR_INREG lowering of 256-bit vectors As discussed on D52964, this adds 256-bit _EXTEND_VECTOR_INREG lowering support for AVX1 targets to help improve SimplifyDemandedBits handling. Differential Revision: https://reviews.llvm.org/D52980 llvm-svn: 344019	2018-10-09 07:42:01 +00:00
Craig Topper	ff9f02580d	[X86] Prefer isTypeLegal over checking isSimple in a DAG combine. Simple types are a superset of what all in tree targets in LLVM could possibly have a legal type. This means the behavior of using isSimple to check for a supported type for X86 could change over time. For example, this could would change if a v256i1 type was added to MVT in the future. llvm-svn: 343995	2018-10-08 20:02:59 +00:00
Simon Pilgrim	6fc8d05565	[X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors Some necessary yak shaving before lowering *_EXTEND_VECTOR_INREG 256-bit vectors on AVX1 targets as suggested by D52964. Differential Revision: https://reviews.llvm.org/D52970 llvm-svn: 343991	2018-10-08 18:40:50 +00:00
Sanjay Patel	43bf9917cc	[x86] make horizontal binop matching clearer; NFCI The instructions are complicated, so this code will probably never be very obvious, but hopefully this makes it better. As shown in PR39195: https://bugs.llvm.org/show_bug.cgi?id=39195 ...we need to improve the matching to not miss cases where we're h-opping on 1 source vector, and that should be a small patch after this rearranging. llvm-svn: 343989	2018-10-08 18:08:02 +00:00
Simon Pilgrim	9fa1c66421	[X86] getFauxShuffleMask - Handle undef + sentinel values in subvector insertion llvm-svn: 343926	2018-10-06 22:13:44 +00:00
Simon Pilgrim	a30e8d23e2	[X86][AVX] Ensure resolveTargetShuffleInputs shuffle masks are the correct width Don't handle ZERO_EXTEND style shuffles until we support bitcasts. Found by inspection. llvm-svn: 343924	2018-10-06 17:18:41 +00:00
Simon Pilgrim	62d199f4e5	[X86] combinePMULDQ - add op back to worklist if SimplifyDemandedBits succeeds on either operand Prevents missing other simplifications that may occur deep in the operand chain where CommitTargetLoweringOpt won't add the PMULDQ back to the worklist itself llvm-svn: 343922	2018-10-06 14:51:14 +00:00
Simon Pilgrim	0cc0a24b55	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - simplify PSHUFB masks Attempt to simplify PSHUFB masks (even non-constant ones) - we should probably be able to simplify other variable shuffles as well as the need arises. llvm-svn: 343919	2018-10-06 13:49:31 +00:00
Simon Pilgrim	ae78d709b4	[X86] Use the SimplifyDemandedBits wrappers where possible. NFCI. Leave the wrapper to handle TargetLowering::TargetLoweringOpt and CommitTargetLoweringOpt. llvm-svn: 343918	2018-10-06 13:29:08 +00:00
Simon Pilgrim	dc97118efe	[X86][AVX] Limit getFauxShuffleMask INSERT_SUBVECTOR support to 2 inputs rL343853 didn't limit the number of subinputs, but we don't currently support faux shuffles with more than 2 total inputs, so put a limiter in place until this is fixed. Found by Artem Dergachev. llvm-svn: 343891	2018-10-05 21:44:19 +00:00
Craig Topper	0ed892da70	[X86] Don't promote i16 compares to i32 if the immediate will fit in 8 bits. The comments in this code say we were trying to avoid 16-bit immediates, but if the immediate fits in 8-bits this isn't an issue. This avoids creating a zero extend that probably won't go away. The movmskb related changes are interesting. The movmskb instruction writes a 32-bit result, but fills the upper bits with 0. So the zero_extend we were previously emitting was free, but we turned a -1 immediate that would fit in 8-bits into a 32-bit immediate so it was still bad. llvm-svn: 343871	2018-10-05 18:13:36 +00:00
Simon Pilgrim	6c5ab48fe7	[X86][AVX] getFauxShuffleMask - add support for INSERT_SUBVECTOR subvector shuffles Decode subvector shuffles from INSERT_SUBVECTOR(SRC0, SHUFFLE(EXTRACT_SUBVECTOR(SRC1)) This was found necessary while investigating PR39161 llvm-svn: 343853	2018-10-05 14:41:00 +00:00
Craig Topper	7d2155e3f9	[X86][LegalizeVectorOps] Use MERGE_VALUES to return two results from LowerLoad. Remove special case code in LegalizeVectorOps that allowed us to only return one result. Previously we replaced the chain use ourself and return the data result. LegalizeVectorOps then detected that we'd done this and assumed the chain had already been handled. This commit instead returns a MERGE_VALUES node with two results joined from nodes. This allows LegalizeVectorOps to do all the replacements for us without any special casing. The MERGE_VALUES will be removed by DAG combine. llvm-svn: 343817	2018-10-04 21:24:24 +00:00
David Greene	4f916df29e	[X86] Set correct MMO offset on scalarized load pieces When scalarizing a load, be sure to update the offset in the MachineMemOperand for each scalar load. llvm-svn: 343776	2018-10-04 14:07:59 +00:00
Craig Topper	8b3c46f0a8	[X86] Merge matchANDXORWithAllOnesAsANDNP into combineANDXORWithAllOnesIntoANDNP. NFCI It's the only caller and the logic pretty easy to combine. llvm-svn: 343754	2018-10-04 06:13:27 +00:00
Craig Topper	a65c2dbfd6	[X86] Stop promoting vector ISD::SELECT to vXi64. The additional patterns needed for this aren't overwhelming and introducing extra bitcasts during lowering limits our ability to do computeNumSignBits. Not that I have a good example of that for select. I'm just becoming increasingly grumpy about promotion of AND/OR/XOR. SELECT was just a lot easier to fix. llvm-svn: 343723	2018-10-03 21:10:29 +00:00
Craig Topper	c39dc41b63	[X86] Add CMOV_VK2/VK4 pseudos and remove lowering code that turned v2i1/v4i1 SELECT into v8i1. llvm-svn: 343713	2018-10-03 20:28:43 +00:00
Craig Topper	703fbde3cb	[X86] Add CMOV pseudos for VR128X and VR256X register classes. Use them when AVX512VL is enabled. This allows the phi nodes to be generated with the correct register class when expanded. llvm-svn: 343710	2018-10-03 19:48:26 +00:00
Craig Topper	4b62c2dbda	[X86] Don't break CMOV pseudo instructions down by type. Just by register class. The register class is all that's important for the pseudo instructions. We can use patterns to handle the different types. llvm-svn: 343709	2018-10-03 19:48:23 +00:00
Nirav Dave	925b64be64	[X86] Correctly use SSE registers if no-x87 is selected. Fix use of SSE1 registers for f32 ops in no-x87 mode. Notably, allow use of SSE instructions for f32 operations in 64-bit mode (but not 32-bit which is disallowed by callign convention). Also avoid translating memset/memcopy/memmove into SSE registers without X87 for 32-bit mode. This fixes PR38738. Reviewers: nickdesaulniers, craig.topper Subscribers: hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D52555 llvm-svn: 343689	2018-10-03 14:13:30 +00:00
Simon Pilgrim	a2efe82b81	[X86] SimplifyDemandedVectorEltsForTargetNode - remove identity target shuffles before simplifying inputs By removing demanded target shuffles that simplify to zero/undef/identity before simplifying its inputs we improve chances of further simplification, as only the immediate parent user of the combined is added back to the work list - this still doesn't help us if its passed through other ops though (bitcasts....). llvm-svn: 343390	2018-09-29 18:15:26 +00:00
Simon Pilgrim	a93407fadf	[X86][SSE] LowerScalarImmediateShift - remove 32-bit vXi64 special case handling. This is all handled generally by getTargetConstantBitsFromNode now llvm-svn: 343387	2018-09-29 17:36:22 +00:00
Simon Pilgrim	b5737007cd	Fix signed/unsigned mismatch warning. NFCI. llvm-svn: 343385	2018-09-29 17:11:19 +00:00
Simon Pilgrim	d633e290c8	[X86] getTargetConstantBitsFromNode - add support for rearranging constant bits via shuffles Exposed an issue that recursive calls to getTargetConstantBitsFromNode don't handle changes to EltSizeInBits yet. llvm-svn: 343384	2018-09-29 17:01:55 +00:00
Simon Pilgrim	ae34ae12ef	[X86][SSE] LowerScalarImmediateShift - use getTargetConstantBitsFromNode to get immediate data Don't just attempt to find a splat build vector. First step towards getting rid of all the 32-bit special case code. llvm-svn: 343383	2018-09-29 16:40:35 +00:00
Simon Pilgrim	a731940c60	[X86] getTargetConstantBitsFromNode - fix self-move assertions from gcc builds due to rL343375 llvm-svn: 343377	2018-09-29 14:51:09 +00:00
Simon Pilgrim	22d51014af	[X86] getTargetConstantBitsFromNode - add support for peeking through ISD::EXTRACT_SUBVECTOR llvm-svn: 343375	2018-09-29 14:17:32 +00:00
Simon Pilgrim	aa77033a6b	[X86][SSE] Fixed issue with v2i64 variable shifts on 32-bit targets The shift amount might have peeked through a extract_subvector, altering the number of vector elements in the 'Amt' variable - so we were incorrectly calculating the ratio when peeking through bitcasts, resulting in incorrectly detecting splats. llvm-svn: 343373	2018-09-29 13:25:22 +00:00
Simon Pilgrim	ebabd79f43	[X86][SSE] canReduceVMulWidth - use ComputeNumSignBits/SignBitIsZero directly Don't reinvent the wheel for BUILD_VECTOR/ZERO_EXTEND - its only the ANY_EXTEND special case that needs handling. llvm-svn: 343096	2018-09-26 11:48:52 +00:00
Simon Pilgrim	5beaac433d	[X86][SSE] Use ISD::MULHS for constant vXi16 ISD::SRA lowering (PR38151) Similar to the existing ISD::SRL constant vector shifts from D49562, this patch adds ISD::SRA support with ISD::MULHS. As we're dealing with signed values, we have to handle shift by zero and shift by one special cases, so XOP+AVX2/AVX512 splitting/extension is still a better solution - really we should still use ISD::MULHS if one of the special cases are used but for now I've just left a TODO and filtered by isKnownNeverZero. Differential Revision: https://reviews.llvm.org/D52171 llvm-svn: 343093	2018-09-26 10:57:05 +00:00
Craig Topper	12c18840fa	[X86] Allow movmskpd/ps ISD nodes to be created and selected with integer input types. This removes an int->fp bitcast between the surrounding code and the movmsk. I had already added a hack to combineMOVMSK to try to look through this bitcast to improve the SimplifyDemandedBits there. But I found an additional issue where the bitcast was preventing combineMOVMSK from being called again after earlier nodes in the DAG are optimized. The bitcast gets revisted, but not the user of the bitcast. By using integer types throughout, the bitcast doesn't get in the way. llvm-svn: 343046	2018-09-25 23:28:27 +00:00
Simon Pilgrim	96335dd1ec	[X86] combineUIntToFP - Fix UINT_TO_FP(vXi1) comment (PR39078) llvm-svn: 343026	2018-09-25 20:52:08 +00:00
Sanjay Patel	10c11b867a	[x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449) This is the final (I hope!) problem pattern mentioned in PR37749: https://bugs.llvm.org/show_bug.cgi?id=37749 We are trying to avoid an AVX1 sinkhole caused by having 256-bit bitwise logic ops but no other 256-bit integer ops. We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op. The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test, we have this vector-type-legalized sequence: t29: v8i32 = concat_vectors t27, t28 t30: v4i64 = bitcast t29 t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, ... t31: v4i64 = bitcast t18 t32: v4i64 = xor t30, t31 t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, ... t34: v4i64 = bitcast t9 t35: v4i64 = and t32, t34 t36: v8i32 = bitcast t35 t37: v4i32 = extract_subvector t36, Constant:i64<0> t38: v4i32 = extract_subvector t36, Constant:i64<4> Differential Revision: https://reviews.llvm.org/D52318 llvm-svn: 343008	2018-09-25 19:09:34 +00:00
Craig Topper	6fb1358a98	[X86] Add AVX512 support to combineVectorSizedSetCCEquality. Reviewers: spatel, RKSimon Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D52424 llvm-svn: 342989	2018-09-25 16:27:12 +00:00
Craig Topper	9ce5da7b62	[X86] Don't create FILD ISD nodes when X87 is disabled. The included test case previously asserted because the type legalizer tried to soften the FILD ISD node. Fixes PR38819. llvm-svn: 342934	2018-09-25 00:16:57 +00:00
Craig Topper	aeb4930b47	[X86] Remove superfluous curly braces. NFC llvm-svn: 342933	2018-09-25 00:16:54 +00:00
Craig Topper	b7e2499e80	[X86] Update comment. Use 'glued' instead of 'flagged' NFC llvm-svn: 342932	2018-09-25 00:16:52 +00:00
Sanjay Patel	0027946915	[DAGCombiner][x86] extend decompose of integer multiply into shift/add with negation This is an alternative to https://reviews.llvm.org/D37896. We can't decompose multiplies generically without a target hook to tell us when it's profitable. ARM and AArch64 may be able to remove some existing code that overlaps with this transform. This extends D52195 and may resolve PR34474: https://bugs.llvm.org/show_bug.cgi?id=34474 (still an open question about transforming legal vector multiplies, but we could open another bug report for those) llvm-svn: 342844	2018-09-23 18:41:38 +00:00
Craig Topper	3e0b4b0eb7	[X86] Fix a few typos in comments. llvm-svn: 342829	2018-09-23 06:49:47 +00:00
Craig Topper	9995760df4	[X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)) for vXi8 vectors. We don't have a vXi8 shift left so we need to bitcast to a vXi16 vector to perform the shift. If we let lowering legalize the vXi8 shift we get an extra and that we don't need and fail to remove. llvm-svn: 342795	2018-09-22 05:08:38 +00:00
Sanjay Patel	8a1227ccc8	[SelectionDAG] replace duplicated peekThroughBitcast helper functions; NFCI x86 had 2 versions of peekThroughBitcast. DAGCombiner had 1. Plus, it had a 1-off implementation for the one-use variant. Move the x86 versions of the code to SelectionDAG, so we don't have different copies of the code. No functional change intended. I'm putting this next to isBitwiseNot() because I am planning to use it in there. Another option is next to the helpers in the ISD namespace (eg, ISD::isConstantSplatVector()). But if there's no good reason for those to be there, I'd prefer to pull other helpers over to SelectionDAG in follow-up steps. Differential Revision: https://reviews.llvm.org/D52285 llvm-svn: 342669	2018-09-20 17:34:08 +00:00
Simon Pilgrim	3e2de767f6	[X86][SSE] Remove UNPCKL(SHUFFLE)->UNPCKH custom combine This can be achieved more generally by combineX86ShufflesRecursively. llvm-svn: 342645	2018-09-20 13:10:22 +00:00
Simon Pilgrim	46c1dcb1af	[X86][SSE] Remove PSHUFLW/PSHUFHW combineRedundantHalfShuffle combine This can be achieved more generally by combineX86ShufflesRecursively and was causing a fuzz test failure found by Mikael Holmén. llvm-svn: 342642	2018-09-20 12:11:38 +00:00
Sanjay Patel	1a1c0ee599	[x86] change names of vector splitting helper functions; NFC As the code comments suggest, these are about splitting, and they are not necessarily limited to lowering, so that misled me. There's nothing that's actually x86-specific in these either, so they might be better placed in a common header so any target can use them. llvm-svn: 342575	2018-09-19 18:52:00 +00:00
Simon Pilgrim	8191d63c3b	[X86] Add initial SimplifyDemandedVectorEltsForTargetNode support This patch adds an initial x86 SimplifyDemandedVectorEltsForTargetNode implementation to handle target shuffles. Currently the patch only decodes a target shuffle, calls SimplifyDemandedVectorElts on its input operands and removes any shuffle that reduces to undef/zero/identity. Future work will need to integrate this with combineX86ShufflesRecursively, add support for other x86 ops, etc. NOTE: There is a minor regression that appears to be affecting further (extractelement?) combines which I haven't been able to solve yet - possibly something to do with how nodes are added to the worklist after simplification. Differential Revision: https://reviews.llvm.org/D52140 llvm-svn: 342564	2018-09-19 18:11:34 +00:00
Sanjay Patel	4fd2e2a498	[DAGCombiner][x86] add transform/hook to decompose integer multiply into shift/add This is an alternative to D37896. I don't see a way to decompose multiplies generically without a target hook to tell us when it's profitable. ARM and AArch64 may be able to remove some duplicate code that overlaps with this transform. As a first step, we're only getting the most clear wins on the vector examples requested in PR34474: https://bugs.llvm.org/show_bug.cgi?id=34474 As noted in the code comment, it's likely that the x86 constraints are tighter than necessary, but it may not always be a win to replace a pmullw/pmulld. Differential Revision: https://reviews.llvm.org/D52195 llvm-svn: 342554	2018-09-19 15:57:40 +00:00
Simon Pilgrim	e9bf71e761	[X86][SSE] LowerShift - pull out repeated getTargetVShiftUniformOpcode calls. NFCI. llvm-svn: 342462	2018-09-18 10:44:44 +00:00
Keno Fischer	c8ccaed325	[X86ISel] Implement byval lowering for Win64 calling convention Summary: The IR reference for the `byval` attribute states: ``` This indicates that the pointer parameter should really be passed by value to the function. The attribute implies that a hidden copy of the pointee is made between the caller and the callee, so the callee is unable to modify the value in the caller. This attribute is only valid on LLVM pointer arguments. ``` However, on Win64, this attribute is unimplemented and the raw pointer is passed to the callee instead. This is problematic, because frontend authors relying on the implicit hidden copy (as happens for every other calling convention) will see the passed value silently (if mutable memory) or loudly (by means of a crash) modified because the callee treats the location as scratch memory space it is allowed to mutate. At this point, it's worth taking a step back to understand the context. In most calling conventions, aggregates that are too large to be passed in registers, instead get copied to the stack at a fixed (computable from the signature) offset of the stack pointer. At the LLVM, we hide this hidden copy behind the byval attribute. The caller passes a pointer to the desired data and the callee receives a pointer, but these pointers are not the same. In particular, the pointer that the callee receives points to temporary stack memory allocated as part of the call lowering. In most calling conventions, this pointer is never realized in registers or memory. The temporary memory is simply defined by an implicit offset from the stack pointer at function entry. Win64, uniquely, works differently. The structure is still passed in memory, but instead of being stored at an implicit memory offset, the caller computes a pointer to the temporary memory and passes it to the callee as a regular pointer (taking up a register, or if all registers are taken up, an additional stack slot). Presumably, this was done to allow eliding the copy when passing aggregates through several functions on the stack. This explains why ignoring the `byval` attribute mostly works on Win64. The argument simply gets passed as a pointer and as long as we're ok with the callee trampling all over that memory, there are no ill effects. However, it does contradict the documentation of the `byval` attribute which specifies that there is to be an implicit copy. Frontends can of course work around this by never emitting the `byval` attribute for Win64 and creating `alloca`s for the requisite temporary stack slots (and that does appear to be what frontends are doing). However, the presence of the `byval` attribute is not a trap for frontend authors, since it seems to work, but silently modifies the passed memory contrary to documentation. I see two solutions: - Disallow the `byval` attribute in the verifier if using the Win64 calling convention. - Make it work by simply emitting a temporary stack copy as we would with any other calling convention (frontends can of course always not use the attribute if they want to elide the copy). This patch implements the second option (make it work), though I would be fine with the first also. Ref: https://github.com/JuliaLang/julia/issues/28338 Reviewers: rnk Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D51842 llvm-svn: 342402	2018-09-17 17:37:14 +00:00
Simon Pilgrim	cffa206423	[X86][SSE] Always enable ISD::SRL -> ISD::MULHU for v8i16 For constant non-uniform cases we'll never introduce more and/andn/or selects than already occur in generic pre-SSE41 ISD::SRL lowering. llvm-svn: 342352	2018-09-16 20:28:38 +00:00
Simon Pilgrim	ea069ffd44	[X86][AVX] Enable ISD::SRL -> ISD::MULHU for v16i16 Now that rL340913 has landed with improved v16i16 selects as shuffles. llvm-svn: 342349	2018-09-16 19:20:47 +00:00
Sanjay Patel	bfee5a9b42	[x86] fix uses check in broadcast transform (PR38949) https://bugs.llvm.org/show_bug.cgi?id=38949 It's not clear to me that we even need a one-use check in this fold. Ie, 2 independent loads might be better than a load+dependent shuffle. Note that the existing re-use tests are not affected. We actually do form a broadcast node in those tests now because there's no extra use of the insert_subvector node in those cases. But something later in isel pattern matching decides that it is not worth using a broadcast for the full load in those tests: Legalized selection DAG: %bb.0 'test_broadcast_2f64_4f64_reuse:' t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64 t18: v4f64 = insert_subvector undef:v4f64, t7, Constant:i64<0> t20: v4f64 = insert_subvector t18, t7, Constant:i64<2> Becomes: t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64 t21: v4f64 = X86ISD::SUBV_BROADCAST t7 ISEL: Starting selection on root node: t21: v4f64 = X86ISD::SUBV_BROADCAST t7 ... Created node: t27: v4f64 = INSERT_SUBREG IMPLICIT_DEF:v4f64, t7, TargetConstant:i32<7> Morphed node: t21: v4f64 = VINSERTF128rr t27, t7, TargetConstant:i8<1> llvm-svn: 342347	2018-09-16 15:41:56 +00:00
Craig Topper	fe0b973fbf	[X86] Remove an fp->int->fp domain crossing in LowerUINT_TO_FP_i64. Summary: This unfortunately adds a move, but isn't that better than going to the int domain and back? Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D52134 llvm-svn: 342327	2018-09-15 16:23:35 +00:00
Craig Topper	273f755da3	[X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)) Summary: MOVMSK only care about the sign bit so we don't need the setcc to fill the whole element with 0s/1s. We can just shift the bit we're looking for into the sign bit. This saves a constant pool load. Inspired by PR38840. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: lebedev.ri, llvm-commits Differential Revision: https://reviews.llvm.org/D52121 llvm-svn: 342326	2018-09-15 16:23:33 +00:00
Simon Pilgrim	32857c54d2	[X86][SSE] Lower shuffles to permute(unpack(x,y)) (PR31151) Attempt to lower a shuffle as an unpack of elements from two inputs followed by a single-input (wider) permutation. As long as the permutation is wider this is a win - there may be some circumstances where same size permutations would also be useful but I've left that for future work. Differential Revision: https://reviews.llvm.org/D52043 llvm-svn: 342257	2018-09-14 18:33:31 +00:00
Nirav Dave	59ad1c8457	[X86] Fix register resizings for inline assembly register operands. When replacing a named register input to the appropriately sized sub/super-register. In the case of a 64-bit value being assigned to a register in 32-bit mode, match GCC's assignment. Reviewers: eli.friedman, craig.topper Subscribers: nickdesaulniers, llvm-commits, hiraditya Differential Revision: https://reviews.llvm.org/D51502 llvm-svn: 342175	2018-09-13 20:33:56 +00:00
Nirav Dave	2060a16dfd	[X86] Cleanup pair returns. NFCI. llvm-svn: 342174	2018-09-13 20:33:27 +00:00
Craig Topper	f107123a88	[X86] Type legalize v2i32 div/rem by scalarizing rather than promoting Summary: Previously we type legalized v2i32 div/rem by promoting to v2i64. But we don't support div/rem of vectors so op legalization would then scalarize it using i64 scalar ops since it doesn't know about the original promotion. 64-bit scalar divides on Intel hardware are known to be slow and in 32-bit mode they require a libcall. This patch switches type legalization to do the scalarizing itself using i32. It looks like the division by power of 2 optimization is still kicking in and leaving the code as a vector. The division by other constant optimization doesn't kick in pre type legalization since it ignores illegal types. And previously, after type legalization we scalarized the v2i64 since we don't have v2i64 MULHS/MULHU support. Another option might be to widen v2i32 to v4i32 so we could do division by constant optimizations, but we'd have to be careful to only do that for constant divisors or we risk scalaring to 4 scalar divides. Reviewers: RKSimon, spatel Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D51325 llvm-svn: 342114	2018-09-13 06:13:37 +00:00
Craig Topper	d7362a3e5f	[X86] Correct the one use check from r341915. The one use check should be on the bitcast, not the input to the bitcast. llvm-svn: 341956	2018-09-11 16:05:03 +00:00
Craig Topper	844f035e1e	[X86] In combineMOVMSK, look through int->fp bitcasts before callling SimplifyDemandedBits. MOVMSKPS and MOVMSKPD both take FP types, but likely the operations before it are on integer types with just a int->fp bitcast between them. If the bitcast isn't used by anything else and doesn't change the element width we can look through it to simplify the integer ops. llvm-svn: 341915	2018-09-11 08:20:02 +00:00
Craig Topper	a5ae613c15	[X86] Mark the ISD::SETLT/SETLE condition codes as illegal for v32i16/v64i8 to match the other vector types. I'm having a hard time finding a test case for this, but we should be consistent here. The fact that we canonicalize all zeros and all ones constants to vXi32 and all other constants to loads makes this hard to hit the easy DAG combine infinite loop we get for some of the other types. llvm-svn: 341859	2018-09-10 20:31:27 +00:00
Craig Topper	3823516103	[X86] Custom type legalize (v2i32 (fp_to_uint v2f64))) without avx512vl by widening to v4i32 and v4f64 instead of v8i32 and v8f64. Make it aware of x86-experimental-vector-widening-legalization We have isel patterns for v4i32/v4f64 that artificially widen to v8i32/v8f64 so just use that. If x86-experimental-vector-widening-legalization is enabled, we don't need any custom legalization and can just return. I've modified the test RUN lines to cover this case. llvm-svn: 341765	2018-09-09 20:36:36 +00:00
Craig Topper	7af5e333e7	[X86] Create paddus/psubus from narrower vectors with i8/i16 element types. Summary: This patch allows vectors with a power of 2 number of elements and i8/i16 element type to select paddus/psubus instructions. ReplaceNodeResults has been updated to custom widen these operations up to 128 bits like we already do for PAVG. Another step towards fixing PR38691 Reviewers: RKSimon, spatel Reviewed By: RKSimon, spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D51818 llvm-svn: 341753	2018-09-08 19:32:58 +00:00
Craig Topper	5cbce81c91	[X86] Don't create ZERO_EXTEND_INREG/SIGN_EXTEND_INREG for v1iX vectors. The generic type legalizer will scalarize vXi1 instructions getting rid of the vector entirely. Creating wider vector instructions is just going to prevent that. llvm-svn: 341705	2018-09-07 20:56:03 +00:00
Craig Topper	39f48fdcbc	[X86] Don't create X86ISD::AVG nodes from v1iX vectors. The type legalizer will try to scalarize this and fail. It looks like there's some other v1iX oddities out there too since we still generated some vector instructions. llvm-svn: 341704	2018-09-07 20:56:01 +00:00
Craig Topper	4863313b35	[X86] Modify the the rdtscp intrinsic to return values instead of taking a pointer argument Similar to what was recently done for addcarry/subborrow and has been done for rdrand/rdseed for a while. It's better to use two results and an explicit store in IR when the store isn't part of the semantics of the instruction. This allows store->load forwarding to happen in the middle end. Or the store to be removed if its never loaded. Differential Revision: https://reviews.llvm.org/D51803 llvm-svn: 341698	2018-09-07 19:14:15 +00:00
Craig Topper	72964ae99e	[X86] Change the addcarry and subborrow intrinsics to return 2 results and remove the pointer argument. We should represent the store directly in IR instead. This gives the middle end a chance to remove it if it can see a load from the same address. Differential Revision: https://reviews.llvm.org/D51769 llvm-svn: 341677	2018-09-07 16:58:39 +00:00
Craig Topper	caf6672779	[X86] Add intrinsics for KTEST instructions. These intrinsics use the same implementation as PTEST intrinsics, but use vXi1 vectors. New clang builtins will be accompanying them shortly. llvm-svn: 341259	2018-08-31 21:31:53 +00:00
Craig Topper	b7bb9f0078	[X86] Add support for turning vXi1 shuffles into KSHIFTL/KSHIFTR. This patch recognizes shuffles that shift elements and fill with zeros. I've copied and modified the shift matching code we use for normal vector registers to do this. I'm not sure if there's a good way to share more of this code without making the existing function more complex than it already is. This will be used to enable kshift intrinsics in clang. Differential Revision: https://reviews.llvm.org/D51401 llvm-svn: 341227	2018-08-31 17:17:21 +00:00
Craig Topper	83e9f928ba	[X86] Don't do anything in ReplaceNodeResults for (v2i32 (fptoui/fptosi v2f32)) when -x86-experimental-vector-widening-legalization is on. We don't need to do our own widening, the generic legalizer can do it. llvm-svn: 341174	2018-08-31 07:05:39 +00:00
Craig Topper	e9af89a78b	[X86] Don't custom widen (v2i32 (setcc v2f32)) when -x86-experimental-vector-widening-legalization is in effect. We aren't doing anything than what the generic legalizer will do so just let it do it. llvm-svn: 341172	2018-08-31 07:05:37 +00:00
Craig Topper	1a8c99e670	[X86] Weaken an overly aggressive assert. This assert tried to check that AND constants are only on the RHS. But its possible for both operands to be constants if one is opaque which will prevent the AND from being constant folded. Fixes PR38771 llvm-svn: 341102	2018-08-30 19:35:38 +00:00
Simon Pilgrim	6b9bf7ecbc	[X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead. TODO: As discussed on the patch, we should investigate adding VPBLENDVB handling to target shuffle combining as well, that will allow us to extend this to VPBLENDW+VPBLENDW+VPBLENDD. Differential Revision: https://reviews.llvm.org/D50074 llvm-svn: 340913	2018-08-29 10:51:08 +00:00
Simon Pilgrim	af98587095	[X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694) This patch creates the shift mask and actual shift using the vXi16 vector shift ops. Differential Revision: https://reviews.llvm.org/D51263 llvm-svn: 340813	2018-08-28 10:37:29 +00:00
Simon Pilgrim	f119e27d80	[X86][SSE] Avoid vector extraction/insertion for non-constant uniform shifts As discussed on D51263, we're better off using byte shifts to clear the upper bits on pre-SSE41 hardware. llvm-svn: 340810	2018-08-28 10:14:09 +00:00
Craig Topper	c1436db753	[X86] Fix some comments to refer to KORTEST not KTEST. NFC KTEST is a different instruction. All of this code uses KORTEST. llvm-svn: 340799	2018-08-28 06:39:35 +00:00
Craig Topper	4be11c0585	[X86] When lowering v32i8 MULHS/MULHU, shuffle after the PACKUS rather than before. We're using a 256-bit PACKUS to do the truncation, but that instruction operates on 128-bit lanes. So previously we shuffled first to rearrange the lanes. But that requires 2 shuffles. Instead we can shuffle after the PACKUS using a single VPERMQ. This matches what our normal LowerTRUNCATE code does when it uses PACKUS. Differential Revision: https://reviews.llvm.org/D51284 llvm-svn: 340757	2018-08-27 17:20:41 +00:00
Craig Topper	fff90377fd	[X86] Add support for matching paddus patterns where one of the vectors is a constant. InstCombine mucks these up a bit. So we need to do some additional pattern matching to fix it. There are a still a few special cases not handled, but this covers the general case. Differential Revision: https://reviews.llvm.org/D50952 llvm-svn: 340756	2018-08-27 17:20:38 +00:00
Craig Topper	73ed2a2a6b	[X86] Cleanup the LowerMULH code by hoisting some commonalities between the vXi32 and vXi8 handling. NFCI vXi32 support was recently moved from LowerMUL_LOHI to LowerMULH. This commit shares the getOperand calls, switches both to use common IsSigned flag, and hoists the NumElems/NumElts variable. llvm-svn: 340720	2018-08-27 06:35:02 +00:00
Sanjay Patel	113cac3b15	[SelectionDAG][x86] turn insertelement into undef with variable index into splat I noticed this along with the patterns in D51125, but when the index is variable, we don't convert insertelement into a build_vector. For x86, that means these get expanded at legalization time into the loading/spilling code that we see in the tests. I think it's always better to avoid going to memory on these, and we get the optimal 'broadcast' if it's available. I suspect other targets may want to look at enabling the hook. AArch64 and AMDGPU have regression tests that would be affected (although I did not check what would happen in those cases). In the most basic cases shown here, AArch64 would probably do much better with a splat. Differential Revision: https://reviews.llvm.org/D51186 llvm-svn: 340705	2018-08-26 18:20:41 +00:00
Craig Topper	4240ecb909	[X86] Fix typo in comment, expect->except. NFC llvm-svn: 340695	2018-08-26 03:43:23 +00:00
Craig Topper	ebec2793d1	[X86] Replace support for vXi32 SMUL_LOHI/UMUL_LOHI with MULHS/MULHU support instead. Summary: The only time vector SMUL_LOHI/UMUL_LOHI nodes are created is during division/remainder lowering. If its created before op legalization, generic DAGCombine immediately turns that SMUL_LOHI/UMUL_LOHI into a MULHS/MULHU since only the upper half is used. That node will stick around through vector op legalization and will be turned back into UMUL_LOHI/SMUL_LOHI during op legalization. It will then be custom lowered by the X86 backend. Due to this two step lowering the vector shuffles created by the custom lowering get legalized after their inputs rather than before. This prevents the shuffles from being combined with any build_vector of constants. This patch uses changes vXi32 to use MULHS/MULHU instead. This is what the later DAG combine did anyway. But by skipping the change back to UMUL_LOHI/SMUL_LOHI we lower it before any constant BUILD_VECTORS. This allows the vector_shuffle creation to constant fold with the build_vectors. This accounts for the test changes here. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D51254 llvm-svn: 340690	2018-08-25 18:01:24 +00:00
Craig Topper	a11a3b3818	[SelectionDAG][X86] Reorder the operands the MaskedStoreSDNode to put the value first. Summary: Previously the value being stored is the last operand in SDNode. This causes the type legalizer to visit the mask operand before the value operand. The type legalizer was more complicated because of this since we want the type of the value to drive the decisions. This patch moves the value to be the first operand so we visit it first during type legalization. It also simplifies the type legalization code accordingly. X86 is currently the only in tree target that uses this SDNode. Not sure if there are any users out of tree. Reviewers: RKSimon, delena, hfinkel, eli.friedman Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D50402 llvm-svn: 340689	2018-08-25 17:48:17 +00:00
Craig Topper	bce8680605	[X86] Make sure type is a vector before calling VT.getVectorNumElements() in combineLoopMAddPattern Fixes PR38700. llvm-svn: 340688	2018-08-25 17:23:43 +00:00
Sanjay Patel	8a84c747d2	[x86] try harder to use broadcast to load a scalar into vector reg This is a preliminary step for a preliminary step for D50992. I noticed that x86 often misses chances to load a scalar directly into a vector register. So this patch is just allowing more of those cases to match a broadcast op in lowerBuildVectorAsBroadcast(). The old code comment said it doesn't make sense to use a broadcast when we're loading a single element and everything else is undef, but I think that's the best case in the improved tests in insert-loaded-scalar.ll. We avoid scalar-to-vector-register move and/or less efficient shuffling. Note that there are some existing types that were already producing a broadcast, but that happens semi-accidentally. Ie, it's not happening as part of lowerBuildVectorAsBroadcast(). The build vector gets expanded into load + shuffle, and then shuffle lowering produces the broadcast. Description of the other test diffs: 1. avx-basic.ll - replacing load+shufle is a win. 2. sse3-avx-addsub-2.ll - vmovddup vs. vbroadcastss is neutral 3. sse41.ll - don't care - we convert that intrinsic to generic IR now, so this test is deprecated 4. vector-shuffle-128-v8.ll / vector-shuffle-256-v16.ll - pshufb alternatives with an extra instruction are not obviously bad Differential Revision: https://reviews.llvm.org/D51125 llvm-svn: 340685	2018-08-25 14:56:05 +00:00
Craig Topper	4058e29e7d	[X86] Teach combineLoopMAddPattern to handle cases where there is no loop and the add has two multiply inputs Differential Revision: https://reviews.llvm.org/D50868 llvm-svn: 340631	2018-08-24 18:05:04 +00:00
Chandler Carruth	ae0cafece8	[x86/retpoline] Split the LLVM concept of retpolines into separate subtarget features for indirect calls and indirect branches. This is in preparation for enabling only the call retpolines when using speculative load hardening. I've continued to use subtarget features for now as they continue to seem the best fit given the lack of other retpoline like constructs so far. The LLVM side is pretty simple. I'd like to eventually get rid of the old feature, but not sure what backwards compatibility issues that will cause. This does remove the "implies" from requesting an external thunk. This always seemed somewhat questionable and is now clearly not desirable -- you specify a thunk the same way no matter which set of things are getting retpolines. I really want to keep this nicely isolated from end users and just an LLVM implementation detail, so I've moved the `-mretpoline` flag in Clang to no longer rely on a specific subtarget feature by that name and instead to be directly handled. In some ways this is simpler, but in order to preserve existing behavior I've had to add some fallback code so that users who relied on merely passing -mretpoline-external-thunk continue to get the same behavior. We should eventually remove this I suspect (we have never tested that it works!) but I've not done that in this patch. Differential Revision: https://reviews.llvm.org/D51150 llvm-svn: 340515	2018-08-23 06:06:38 +00:00
Craig Topper	cf9df99d79	[X86] Teach combineLoopSADPattern to handle cases where there is no loop and the add has two absolute difference inputs Previously we asumed a vector reduction add is part of a loop and one of the input is a phi. But the code in SelectionDAGBuilder that sets vector reduction flag handles more cases than that. It just requires that the use chain ends in a horizontal reduction. And there are no other uses. This means it can handle unrolled reduction loops. If the initial value of the reduction was 0, an unrolled loop would begin with a vector reduction add that has two sad inputs. Previously we would only transform one side of the add, but for this case we need to transform both sides. I've created a lambda to reuse some of the code for both sides. And fixed the variables names to remove reference to "phi". Differential Revision: https://reviews.llvm.org/D50817 llvm-svn: 340478	2018-08-22 23:19:01 +00:00
Simon Pilgrim	ffdfe45645	[X86][SSE] LowerMULH vXi8 - use SSE shifts directly. We know these vXi16 extended cases are legal constant splat shifts. llvm-svn: 340414	2018-08-22 15:37:11 +00:00
Simon Pilgrim	50eba6b380	[X86][SSE] Lower vXi8 general shifts to SSE shifts directly. NFCI. Most of these shifts are extended to vXi16 so we don't gain anything from forcing another round of generic shift lowering - we know these extended cases are legal constant splat shifts. llvm-svn: 340307	2018-08-21 17:27:03 +00:00
Simon Pilgrim	98eb4ae499	[X86][SSE] Lower v8i16 general shifts to SSE shifts directly. NFCI. We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts. llvm-svn: 340302	2018-08-21 17:05:07 +00:00
Simon Pilgrim	dbe4e9e3ff	[X86][SSE] Lower directly to SSE shifts in the BLEND(SHIFT, SHIFT) combine. NFCI. We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts. llvm-svn: 340300	2018-08-21 16:46:48 +00:00
Simon Pilgrim	5a83a1fd13	[X86][SSE] Add helper function to convert to/between the SSE vector shift opcodes. NFCI. Also remove some more getOpcode calls from LowerShift when we already have Opc. llvm-svn: 340290	2018-08-21 15:57:33 +00:00
Craig Topper	210ccfe3db	[X86] Prevent lowerVectorShuffleByMerging128BitLanes from creating cycles Due to some splat handling code in getVectorShuffle, its possible for NewV1/NewV2 to have their mask modified from what is requested. This can lead to cycles being created in the DAG. This patch examines the returned mask and makes sure its different. Long term we may need to look closer at that splat code in getVectorShuffle, or add more splat awareness to getVectorShuffle. Fixes PR38639 Differential Revision: https://reviews.llvm.org/D50981 llvm-svn: 340214	2018-08-20 21:08:35 +00:00
Craig Topper	7dcb2c4b0a	[X86] Teach combineTruncatedArithmetic to handle some cases of ISD::SUB We can safely avoid interfering with the subus combine if both inputs are freely truncatable. Either both extends, or an extend and a constant vector. Differential Revision: https://reviews.llvm.org/D50878 llvm-svn: 340212	2018-08-20 20:57:35 +00:00
Craig Topper	803912ea57	[X86] Fix an issue in the matching for ADDUS. We were basically assuming only one operand of the compare could be an ADD node and using that to swap operands. But we can have a normal add followed by a saturing add. This rewrites the canonicalization to just be based on the condition code. llvm-svn: 340134	2018-08-19 04:26:31 +00:00
Craig Topper	2b03df9b05	[X86] Use SDValue::operator== instead of DAG.isEqualTo in strictly integer matching. isEqualTo is more useful for floating point. operator== is sufficient for integer. llvm-svn: 340130	2018-08-18 19:16:56 +00:00
Craig Topper	3e299d896f	[X86] Simplify the PADDUS legality check in combineSelect to match PSUBUS. NFC While there remove some trailing whitespace. llvm-svn: 340129	2018-08-18 18:51:04 +00:00
Craig Topper	40c9559b74	[X86] Add support for using 512-bit PSUBUS to combineSelect. The code already support 128 and 256 and even knows to split 256 for AVX1. So we really just needed to stop looking for specific VTs and subtarget features and just look for legal VTs with i8/i16 elements. While there, add some curly braces around outer if statement bodies that contain only another if. It makes all the closing curly braces look more regular. llvm-svn: 340128	2018-08-18 18:51:03 +00:00
Craig Topper	62dd1b1b4f	[X86] Remove detectAddSubSatPattern. This was added very recently in r339650, but appears to be completely untested and has at least one bug in it. llvm-svn: 340086	2018-08-17 21:19:28 +00:00
Simon Pilgrim	2f48122cc9	[X86][SSE] Lower constant vXi8 ISD::SRL/ISD::SRA using PMULLW Extending the concept introduced in D49562, this patch lowers constant vXi8 ISD::SRL/ISD::SRA by zero/sign extending to vXi16 and using PMULLW and then truncating the high 8 bits of the result. Differential Revision: https://reviews.llvm.org/D50781 llvm-svn: 340062	2018-08-17 18:03:11 +00:00
Craig Topper	730890dbdb	[X86] Use hasOneUse instead of isOnlyUserOf. NFCI isOnlyUserOf is a little heavier because it allows the node to be used multiple times by the other node. In this case we are looking at a truncate which only has one operand so we know it can only use it once. Thus hasOneUse is better. llvm-svn: 340059	2018-08-17 17:57:25 +00:00
Francis Visoiu Mistrih	8bff832534	[X86] Fix liveness information when expanding X86::EH_SjLj_LongJmp64 test/CodeGen/X86/shadow-stack.ll has the following machine verifier errors: ``` * Bad machine code: Using a killed virtual register * - function: bar - basic block: %bb.6 entry (0x7fdc81857818) - instruction: %3:gr64 = MOV64rm killed %2:gr64, 1, $noreg, 8, $noreg - operand 1: killed %2:gr64 * Bad machine code: Using a killed virtual register * - function: bar - basic block: %bb.6 entry (0x7fdc81857818) - instruction: $rsp = MOV64rm killed %2:gr64, 1, $noreg, 16, $noreg - operand 1: killed %2:gr64 * Bad machine code: Virtual register killed in block, but needed live out. * - function: bar - basic block: %bb.2 entry (0x7fdc818574f8) Virtual register %2 is used after the block. ``` The fix here is to only copy the machine operand's register without the kill flags for all the instructions except the very last one of the sequence. I had to insert dummy PHIs in the test case to force the NoPHI function property to be set to false. More on this here: https://llvm.org/PR38439 Differential Revision: https://reviews.llvm.org/D50260 llvm-svn: 340033	2018-08-17 14:46:56 +00:00
Chandler Carruth	c73c0307fe	[MI] Change the array of `MachineMemOperand` pointers to be a generically extensible collection of extra info attached to a `MachineInstr`. The primary change here is cleaning up the APIs used for setting and manipulating the `MachineMemOperand` pointer arrays so chat we can change how they are allocated. Then we introduce an extra info object that using the trailing object pattern to attach some number of MMOs but also other extra info. The design of this is specifically so that this extra info has a fixed necessary cost (the header tracking what extra info is included) and everything else can be tail allocated. This pattern works especially well with a `BumpPtrAllocator` which we use here. I've also added the basic scaffolding for putting interesting pointers into this, namely pre- and post-instruction symbols. These aren't used anywhere yet, they're just there to ensure I've actually gotten the data structure types correct. I'll flesh out support for these in a subsequent patch (MIR dumping, parsing, the works). Finally, I've included an optimization where we store any single pointer inline in the `MachineInstr` to avoid the allocation overhead. This is expected to be the overwhelmingly most common case and so should avoid any memory usage growth due to slightly less clever / dense allocation when dealing with >1 MMO. This did require several ergonomic improvements to the `PointerSumType` to reasonably support the various usage models. This also has a side effect of freeing up 8 bits within the `MachineInstr` which could be repurposed for something else. The suggested direction here came largely from Hal Finkel. I hope it was worth it. ;] It does hopefully clear a path for subsequent extensions w/o nearly as much leg work. Lots of thanks to Reid and Justin for careful reviews and ideas about how to do all of this. Differential Revision: https://reviews.llvm.org/D50701 llvm-svn: 339940	2018-08-16 21:30:05 +00:00
Craig Topper	08e082619a	[X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result. This fixes PR35833 Differential Revison: https://reviews.llvm.org/D41794 llvm-svn: 339818	2018-08-15 21:21:52 +00:00
Simon Pilgrim	f3b5943ffc	Remove lambda default argument to fix gcc pedantic warning. llvm-svn: 339767	2018-08-15 12:32:09 +00:00
Craig Topper	633fe98e27	[X86] Change legacy SSE scalar fp to integer intrinsics to use specific ISD opcodes instead of keeping as intrinsics. Unify SSE and AVX512 isel patterns. AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics. The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics. Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns. llvm-svn: 339749	2018-08-15 01:23:00 +00:00
Simon Pilgrim	2ce3d6e135	[X86][SSE] Avoid duplicate shuffle input sources in combineX86ShufflesRecursively rL339686 added the case where a faux shuffle might have repeated shuffle inputs coming from either side of the OR(). This patch improves the insertion of the inputs into the source ops lists to account for this, as well as making it trivial to add support for shuffles with more than 2 inputs in the future. llvm-svn: 339696	2018-08-14 17:22:37 +00:00
Simon Pilgrim	ed55138247	[X86][SSE] Add shuffle combine support for OR(PSHUFB,PSHUFB) style patterns. If each element is zero from one (or both) inputs then we can combine these into a single shuffle mask. llvm-svn: 339686	2018-08-14 16:00:05 +00:00
Simon Pilgrim	df9880f257	[X86][SSE] Generalize lowerVectorShuffleAsBlendOfPSHUFBs to work with any vXi8 type. We still only use this for v16i8, but this cleans up the code to support v32i8/v64i8 sometime in the future. llvm-svn: 339679	2018-08-14 14:00:14 +00:00
Tomasz Krupa	86a63889f3	[X86] Lowering addus/subus intrinsics to native IR Summary: This revision improves previous version (rL330322) which has been reverted due to crashes. This is the patch that lowers x86 intrinsics to native IR in order to enable optimizations. The patch also includes folding of previously missing saturation patterns so that IR emits the same machine instructions as the intrinsics. Reviewers: craig.topper, spatel, RKSimon Reviewed By: craig.topper Subscribers: mike.dvoretsky, DavidKreitzer, sroland, llvm-commits Differential Revision: https://reviews.llvm.org/D46179 llvm-svn: 339650	2018-08-14 08:00:56 +00:00
Simon Pilgrim	130b00bc43	[X86][SSE] Pull out repeated shift getOpcode() calls. NFCI. llvm-svn: 339425	2018-08-10 11:42:42 +00:00
Craig Topper	9a8136f7b4	[X86] Qualify one of the heuristics in combineMul to only apply to positive multiply amounts. This seems to slightly help the performance of one of our internal benchmarks. We probably need better heuristics here. llvm-svn: 339406	2018-08-09 23:27:42 +00:00
Simon Pilgrim	511c3fc529	[X86][SSE] Remove PMULDQ/PMULUDQ by zero Exposed by D50328 Differential Revision: https://reviews.llvm.org/D50328 llvm-svn: 339337	2018-08-09 12:37:36 +00:00
Simon Pilgrim	01ae462fef	[X86][SSE] Combine (some) target shuffles with multiple uses As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses. This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool. However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move. This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit. Differential Revision: https://reviews.llvm.org/D50328 llvm-svn: 339335	2018-08-09 12:30:02 +00:00
Craig Topper	9de1797c50	[SelectionDAG][X86] Rename MaskedLoadSDNode::getSrc0 to getPassThru. Src0 doesn't really convey any meaning to what the operand is. Passthru matches what's used in the documentation for the intrinsic this comes from. llvm-svn: 339101	2018-08-07 06:52:49 +00:00
Craig Topper	17989208a9	[SelectionDAG][X86] Rename getValue to getPassThru for gather SDNodes. getValue is more meaningful name for scatter than it is for gather. Split them and use getPassThru for gather. llvm-svn: 339096	2018-08-07 06:13:40 +00:00
Easwaran Raman	10fd92dd94	[X86] Recognize a splat of negate in isFNEG Summary: Expand isFNEG so that we generate the appropriate F(N)M(ADD\|SUB) instructions in more cases. For example, the following sequence a = _mm256_broadcast_ss(f) d = _mm256_fnmadd_ps(a, b, c) generates an fsub and fma without this patch and an fnma with this change. Reviewers: craig.topper Subscribers: llvm-commits, davidxl, wmi Differential Revision: https://reviews.llvm.org/D48467 llvm-svn: 339043	2018-08-06 19:23:38 +00:00
Craig Topper	feb2a58860	[X86] Add a DAG combine for the __builtin_parity idiom used by clang to enable better codegen Clang uses "ctpop & 1" to implement __builtin_parity. If the popcnt instruction isn't supported this generates a large amount of code to calculate the population count. Instead we can bisect the data down to a single byte using xor and then check the parity flag. Even when popcnt is supported, its still a good idea to split 64-bit data on 32-bit targets using an xor in front of a single popcnt. Otherwise we get two popcnts and an add before the and. I've specifically targeted this at the sizes supported by clang builtins, but we could generalize this if we think that's useful. Differential Revision: https://reviews.llvm.org/D50165 llvm-svn: 338907	2018-08-03 18:00:29 +00:00
Craig Topper	e902b7d0b0	[X86] Support fp128 and/or/xor/load/store with VEX and EVEX encoded instructions. Move all the patterns to X86InstrVecCompiler.td so we can keep SSE/AVX/AVX512 all in one place. To save some patterns we'll use an existing DAG combine to convert f128 fand/for/fxor to integer when sse2 is enabled. This allows use to reuse all the existing patterns for v2i64. I believe this now makes SHA instructions the only case where VEX/EVEX and legacy encoded instructions could be generated simultaneously. llvm-svn: 338821	2018-08-03 06:12:56 +00:00
Craig Topper	2c095444a4	[X86] Prevent promotion of i16 add/sub/and/or/xor to i32 if we can fold an atomic load and atomic store. This makes them consistent with i8/i32/i64. Which still seems to be more aggressive on folding than icc, gcc, or MSVC. llvm-svn: 338795	2018-08-03 00:37:34 +00:00
Simon Pilgrim	8b16e15d47	[X86][SSE] Pull out duplicate VSELECT to shuffle mask code. NFCI. llvm-svn: 338693	2018-08-02 09:20:27 +00:00
Reid Kleckner	a30a6d2c29	Load from the GOT for external symbols in the large, PIC code model Do the same handling for external symbols that we do for jump table symbols and global values. Fixes one of the cases in PR38385 llvm-svn: 338651	2018-08-01 22:56:05 +00:00
Craig Topper	c985d42903	[X86] Canonicalize the pattern for __builtin_ffs in a similar way to '__builtin_ffs + 5' We now emit a move of -1 before the cmov and do the addition after the cmov just like the case with an extra addition. This may be slightly worse for code size, but is more consistent with other compilers. And we might be able to hoist the mov -1 outside of loops. llvm-svn: 338613	2018-08-01 18:38:46 +00:00
Simon Pilgrim	931ebe3be1	[X86] Assign from a brace initializer to match style guide. NFCI. llvm-svn: 338598	2018-08-01 17:43:38 +00:00
Simon Pilgrim	a3548c960e	[SelectionDAG] Make binop reduction matcher available to all targets There is nothing x86-specific about this code, so it'd be nice to make this available for other targets to use in the future (and get it out of X86ISelLowering!). Differential Revision: https://reviews.llvm.org/D50083 llvm-svn: 338586	2018-08-01 16:52:28 +00:00
Simon Pilgrim	564762cf32	[X86] Use isNullConstant helper. NFCI. llvm-svn: 338530	2018-08-01 13:06:14 +00:00
Simon Pilgrim	e447a273bd	[X86] Use isNullConstant helper. NFCI. llvm-svn: 338516	2018-08-01 11:24:11 +00:00
Craig Topper	65a1388881	[X86] When looking for (CMOV C-1, (ADD (CTTZ X), C), (X != 0)) -> (ADD (CMOV (CTTZ X), -1, (X != 0)), C), make sure we really have a compare with 0. It's not strictly required by the transform of the cmov and the add, but it makes sure we restrict it to the cases we know we want to match. While there canonicalize the operand order of the cmov to simplify the matching and emitting code. llvm-svn: 338492	2018-08-01 06:36:20 +00:00
Simon Pilgrim	5d9b00d15b	[X86][SSE] Use ISD::MULHU for constant/non-zero ISD::SRL lowering (PR38151) As was done for vector rotations, we can efficiently use ISD::MULHU for vXi8/vXi16 ISD::SRL lowering. Shift-by-zero cases are still problematic (mainly on v32i8 due to extra AND/ANDN/OR or VPBLENDVB blend masks but v8i16/v16i16 aren't great either if PBLENDW fails) so I've limited this first patch to known non-zero cases if we can't easily use PBLENDW. Differential Revision: https://reviews.llvm.org/D49562 llvm-svn: 338407	2018-07-31 18:05:56 +00:00
Craig Topper	bef126fb71	[X86] Add pattern matching for PMADDUBSW Summary: Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned. A C example that triggers this pattern ``` static const int N = 128; int8_t A[2N]; uint8_t B[2N]; int16_t C[N]; void foo() { for (int i = 0; i != N; ++i) C[i] = MIN(MAX((int16_t)A[2i](int16_t)B[2i] + (int16_t)A[2i+1](int16_t)B[2i+1], -32768), 32767); } ``` Reviewers: RKSimon, spatel, zvi Reviewed By: RKSimon, zvi Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D49829 llvm-svn: 338402	2018-07-31 17:12:08 +00:00
Simon Pilgrim	99d475f97d	[X86][SSE] isFNEG - Use getTargetConstantBitsFromNode to handle all constant cases isFNEG was duplicating much of what was done by getTargetConstantBitsFromNode in its own calls to getTargetConstantFromNode. Noticed while reviewing D48467. llvm-svn: 338358	2018-07-31 10:13:17 +00:00
Fangrui Song	f78650a8de	Remove trailing space sed -Ei 's/[[:space:]]+$//' include/*/.{def,h,td} lib/*/.{cpp,h} llvm-svn: 338293	2018-07-30 19:41:25 +00:00
Craig Topper	f014ec9b3b	[X86] Fix typo in comment. NFC llvm-svn: 338274	2018-07-30 17:34:31 +00:00
Matt Arsenault	81920b0a25	DAG: Add calling convention argument to calling convention funcs This seems like a pretty glaring omission, and AMDGPU wants to treat kernels differently from other calling conventions. llvm-svn: 338194	2018-07-28 13:25:19 +00:00
Craig Topper	c3e11bf3f7	[X86] Add support expanding multiplies by constant where the constant is -3/-5/-9 multplied by a power of 2. These can be replaced with an LEA, a shift, and a negate. This seems to match what gcc and icc would do. llvm-svn: 338174	2018-07-27 23:04:59 +00:00
Craig Topper	561e298e29	[X86] Remove an unnecessary 'if' that prevented treating INT64_MAX and -INT64_MAX as power of 2 minus 1 in the multiply expansion code. Not sure why they were being explicitly excluded, but I believe all the math inside the if works. I changed the absolute value to be uint64_t instead of int64_t so INT64_MIN+1 wouldn't be signed wrap. llvm-svn: 338101	2018-07-27 05:56:27 +00:00
Craig Topper	e364baa88b	[X86] Add matching for another pattern of PMADDWD. Summary: This is the pattern you get from the loop vectorizer for something like this int16_t A[1024]; int16_t B[1024]; int32_t C[512]; void pmaddwd() { for (int i = 0; i != 512; ++i) C[i] = (A[2i]B[2i]) + (A[2i+1]B[2i+1]); } In this case we will have (add (mul (build_vector), (build_vector)), (mul (build_vector), (build_vector))). This is different than the pattern we currently match which has the build_vectors between an add and a single multiply. I'm not sure what C code would get you that pattern. Reviewers: RKSimon, spatel, zvi Reviewed By: zvi Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D49636 llvm-svn: 338097	2018-07-27 04:29:10 +00:00
Craig Topper	f7bc550223	[X86] When removing sign extends from gather/scatter indices, make sure we handle UpdateNodeOperands finding an existing node to CSE with. If this happens the operands aren't updated and the existing node is returned. Make sure we pass this existing node up to the DAG combiner so that a proper replacement happens. Otherwise we get stuck in an infinite loop with an unoptimized node. llvm-svn: 338090	2018-07-27 00:00:30 +00:00
Craig Topper	4e687d5bb2	[X86] Don't use CombineTo to skip adding new nodes to the DAGCombiner worklist in combineMul. I'm not sure if this was trying to avoid optimizing the new nodes further or what. Or maybe to prevent a cycle if something tried to reform the multiply? But I don't think its a reliable way to do that. If the user of the expanded multiply is visited by the DAGCombiner after this conversion happens, the DAGCombiner will check its operands, see that they haven't been visited by the DAGCombiner before and it will then add the first node to the worklist. This process will repeat until all the new nodes are visited. So this seems like an unreliable prevention at best. So this patch just returns the new nodes like any other combine. If this starts causing problems we can try to add target specific nodes or something to more directly prevent optimizations. Now that we handle the combine normally, we can combine any negates the mul expansion creates into their users since those will be visited now. llvm-svn: 338007	2018-07-26 05:40:10 +00:00
Craig Topper	370bdd3a0f	[X86] Remove some unnecessary explicit calls to DCI.AddToWorkList. These calls were making sure some newly created nodes were added to worklist, but the DAGCombiner has internal support for ensuring it has visited all nodes. Any time it visits a node it ensures the operands have been queued to be visited as well. This means if we only need to return the last new node. The DAGCombiner will take care of adding its inputs thus walking backwards through all the new nodes. llvm-svn: 337996	2018-07-26 03:20:27 +00:00
Matthias Braun	57dd5b3dea	CodeGen: Cleanup regmask construction; NFC - Avoid duplication of regmask size calculation. - Simplify allocateRegisterMask() call. - Rename allocateRegisterMask() to allocateRegMask() to be consistent with naming in MachineOperand. llvm-svn: 337986	2018-07-26 00:27:47 +00:00
Craig Topper	dc0e8a601d	[X86] Use X86ISD::MUL_IMM instead of ISD::MUL for multiply we intend to be selected to LEA. This prevents other combines from possibly disturbing it. llvm-svn: 337890	2018-07-25 05:33:36 +00:00
Craig Topper	fc501a9223	[X86] Use a shift plus an lea for multiplying by a constant that is a power of 2 plus 2/4/8. The LEA allows us to combine an add and the multiply by 2/4/8 together so we just need a shift for the larger power of 2. llvm-svn: 337875	2018-07-25 01:15:38 +00:00
Craig Topper	5be253d988	[X86] Expand mul by pow2 + 2 using a shift and two adds similar to what we do for pow2 - 2. llvm-svn: 337874	2018-07-25 01:15:35 +00:00
Craig Topper	56c104f104	[X86] Use a two lea sequence for multiply by 37, 41, and 73. These fit a pattern used by 11, 21, and 19. llvm-svn: 337871	2018-07-24 23:44:17 +00:00
Craig Topper	f8fcee70a3	[X86] Change multiply by 26 to use two multiplies by 5 and an add instead of multiply by 3 and 9 and a subtract. Same number of operations, but ending in an add is friendlier due to it being commutable. llvm-svn: 337869	2018-07-24 23:44:12 +00:00
Craig Topper	5ddc0a2b14	[X86] When expanding a multiply by a negative of one less than a power of 2, like 31, don't generate a negate of a subtract that we'll never optimize. We generated a subtract for the power of 2 minus one then negated the result. The negate can be optimized away by swapping the subtract operands, but DAG combine doesn't know how to do that and we don't add any of the new nodes to the worklist anyway. This patch makes use explicitly emit the swapped subtract. llvm-svn: 337858	2018-07-24 21:31:21 +00:00
Craig Topper	6d29891bef	[X86] Generalize the multiply by 30 lowering to generic multipy by power 2 minus 2. Use a left shift and 2 subtracts like we do for 30. Move this out from behind the slow lea check since it doesn't even use an LEA. Use this for multiply by 14 as well. llvm-svn: 337856	2018-07-24 21:15:41 +00:00
Craig Topper	86d6320b94	[X86] Change multiply by 19 to use (9 * X) * 2 + X instead of (5 * X) * 4 - 1. The new lowering can be done in 2 LEAs. The old code took 1 LEA, 1 shift, and 1 sub. llvm-svn: 337851	2018-07-24 20:31:48 +00:00
Craig Topper	b2a626b52e	[X86] Remove the max vector width restriction from combineLoopMAddPattern and rely splitOpsAndApply to handle splitting. This seems to be a net improvement. There's still an issue under avx512f where we have a 512-bit vpaddd, but not vpmaddwd so we end up doing two 256-bit vpmaddwds and inserting the results before a 512-bit vpaddd. It might be better to do two 512-bits paddds with zeros in the upper half. Same number of instructions, but breaks a dependency. llvm-svn: 337656	2018-07-22 19:44:35 +00:00
Benjamin Kramer	64c7fa3201	Revert "[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts" This reverts commit r337547. It triggers an infinite loop. llvm-svn: 337617	2018-07-20 20:59:46 +00:00
Craig Topper	28ac623f6f	[X86] Remove isel patterns for MOVSS/MOVSD ISD opcodes with integer types. Ideally our ISD node types going into the isel table would have types consistent with their instruction domain. This prevents us having to duplicate patterns with different types for the same instruction. Unfortunately, it seems our shuffle combining is currently relying on this a little remove some bitcasts. This seems to enable some switching between shufps and shufd. Hopefully there's some way we can address this in the combining. Differential Revision: https://reviews.llvm.org/D49280 llvm-svn: 337590	2018-07-20 17:57:53 +00:00
Craig Topper	6194ccf8c7	[X86] Remove what appear to be unnecessary uses of DCI.CombineTo CombineTo is most useful when you need to replace multiple results, avoid the worklist management, or you need to something else after the combine, etc. Otherwise you should be able to just return the new node and let DAGCombiner go through its usual worklist code. All of the places changed in this patch look to be standard cases where we should be able to use the more stand behavior of just returning the new node. Differential Revision: https://reviews.llvm.org/D49569 llvm-svn: 337589	2018-07-20 17:57:42 +00:00
Simon Pilgrim	70fcd0f481	[X86][XOP] Fix SUB constant folding for VPSHA/VPSHL shift lowering We can safely use getConstant here as we're still lowering, which allows constant folding to kick in and simplify the vector shift codegen. Noticed while working on D49562. llvm-svn: 337578	2018-07-20 16:55:18 +00:00
Simon Pilgrim	c7132031a2	[X86][SSE] Use SplitOpsAndApply to improve HADD/HSUB lowering Improve AVX1 256-bit vector HADD/HSUB matching by using SplitOpsAndApply to split into 128-bit instructions. llvm-svn: 337568	2018-07-20 16:20:45 +00:00
Simon Pilgrim	a85b86a982	[X86][AVX] Add support for i16 256-bit vector horizontal op redundant shuffle removal llvm-svn: 337566	2018-07-20 15:51:01 +00:00
Simon Pilgrim	7c56bce996	[X86][AVX] Add support for 32/64 bits 256-bit vector horizontal op redundant shuffle removal llvm-svn: 337561	2018-07-20 15:24:12 +00:00
Simon Pilgrim	6fb8b68b2d	[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts This is an early step towards using SimplifyDemandedVectorElts for target shuffle combining - this merely moves the existing X86ISD::VBROADCAST simplification code to use the SimplifyDemandedVectorElts mechanism. Adds X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode to handle X86ISD::VBROADCAST - in time we can support all target shuffles (and other ops) here. llvm-svn: 337547	2018-07-20 13:26:51 +00:00
Simon Pilgrim	1d181bc992	[X86][AVX] Use extract_subvector to reduce vector op widths (PR36761) We have a number of cases where we fail to reduce vector op widths, performing the op in a larger vector and then extracting a subvector. This is often because by default it would create illegal types. This peephole patch attempts to handle a few common cases detailed in PR36761, which typically involved extension+conversion to vX2f64 types. Differential Revision: https://reviews.llvm.org/D49556 llvm-svn: 337500	2018-07-19 21:52:06 +00:00
Craig Topper	9888670c6b	[X86] Fix some 'return SDValue()' after DCI.CombineTo instead return the output of CombineTo Returning SDValue() means nothing was changed. Returning the result of CombineTo returns the first argument of CombineTo. This is specially detected by DAGCombiner as meaning that something changed, but worklist management was already taken care of. I think the only real effect of this change is that we now properly update the Statistic the counts the number of combines performed. That's the only thing between the check for null and the check for N in the DAGCombiner. llvm-svn: 337491	2018-07-19 20:10:44 +00:00
Simon Pilgrim	d4b82da113	[X86][SSE] Canonicalize scalar fp arithmetic shuffle patterns As discussed on PR38197, this canonicalizes MOVS(N0, OP(N0, N1)) --> MOVS(N0, SCALAR_TO_VECTOR(OP(N0[0], N1[0]))) This returns the scalar-fp codegen lost by rL336971. Additionally it handles the OP(N1, N0)) case for commutable (FADD/FMUL) ops. Differential Revision: https://reviews.llvm.org/D49474 llvm-svn: 337419	2018-07-18 19:55:19 +00:00
Simon Pilgrim	3a45369b9e	[X86][SSE] Remove BLENDPD canonicalization from combineTargetShuffle When rL336971 removed the scalar-fp isel patterns, we lost the need for this canonicalization - commutation/folding can handle everything else. llvm-svn: 337387	2018-07-18 13:01:20 +00:00
Craig Topper	1425e10cc6	[X86] Generate v2f64 X86ISD::UNPCKL/UNPCKH instead of X86ISD::MOVLHPS/MOVHLPS for unary v2f64 {0,0} and {1,1} shuffles with SSE2. I'm trying to restrict the MOVLHPS/MOVHLPS ISD nodes to SSE1 only. With SSE2 we can use unpcks. I believe this will allow some patterns to be cleaned up to require fewer bitcasts. I've put in an odd isel hack to still select MOVHLPS instruction from the unpckh node to avoid changing tests and because movhlps is a shorter encoding. Ideally we'd do execution domain switching on this, but the operands are in the wrong order and are tied. We might be able to try a commute in the domain switching using custom code. We already support domain switching for UNPCKLPD and MOVLHPS. llvm-svn: 337348	2018-07-18 05:10:51 +00:00
Craig Topper	07a1787501	[X86] Merge the FR128 and VR128 regclass since they have identical spill and alignment characteristics. This unfortunately requires a bunch of bitcasts to be added added to SUBREG_TO_REG, COPY_TO_REGCLASS, and instructions in output patterns. Otherwise tablegen seems to default to picking f128 and then we fail when something tries to get the register class for f128 which isn't always valid. The test changes are because we were previously mixing fr128 and vr128 due to contrainRegClass finding FR128 first and passes like live range shrinking weren't handling that well. llvm-svn: 337147	2018-07-16 06:56:09 +00:00
Fangrui Song	dcdc9ac7a2	[X86] Correct comment of TEST elimination in BSF/TZCNT llvm-svn: 337052	2018-07-13 21:40:08 +00:00
Fangrui Song	90d9c201dc	[X86] Try fixing r336768 llvm-svn: 337043	2018-07-13 20:54:24 +00:00
Simon Pilgrim	44b89fa900	[X86][SSE] Utilize ZeroableElements for canWidenShuffleElements canWidenShuffleElements can do a better job if given a mask with ZeroableElements info. Apparently, ZeroableElements was being only used to identify AllZero candidates, but possibly we could plug it into more shuffle matchers. Original Patch by Zvi Rackover @zvi Differential Revision: https://reviews.llvm.org/D42044 llvm-svn: 336903	2018-07-12 13:29:41 +00:00
Simon Pilgrim	8a463897e9	[X86][AVX] Use Zeroable mask to improve shuffle mask widening Noticed while updating D42044, lowerV2X128VectorShuffle can improve the shuffle mask with the zeroable data to create a target shuffle mask to recognise more 'zero upper 128' patterns. NOTE: lowerV4X128VectorShuffle could benefit as well but the code needs refactoring first to discriminate between SM_SentinelUndef and SM_SentinelZero for negative shuffle indices. Differential Revision: https://reviews.llvm.org/D49092 llvm-svn: 336900	2018-07-12 13:03:58 +00:00
Craig Topper	73347ec081	[X86] Remove patterns and ISD nodes for the old scalar FMA intrinsic lowering. We now use llvm.fma.f32/f64 or llvm.x86.fmadd.f32/f64 intrinsics that use scalar types rather than vector types. So we don't these special ISD nodes that operate on the lowest element of a vector. llvm-svn: 336883	2018-07-12 03:42:41 +00:00
Craig Topper	034adf2683	[X86] Remove and autoupgrade the scalar fma intrinsics with masking. This converts them to what clang is now using for codegen. Unfortunately, there seem to be a few kinks to work out still. I'll try to address with follow up patches. llvm-svn: 336871	2018-07-12 00:29:56 +00:00
Craig Topper	02867f0fa3	[X86] The TEST instruction is eliminated when BSF/TZCNT is used Summary: These changes cover the PR#31399. Now the ffs(x) function is lowered to (x != 0) ? llvm.cttz(x) + 1 : 0 and it corresponds to the following llvm code: %cnt = tail call i32 @llvm.cttz.i32(i32 %v, i1 true) %tobool = icmp eq i32 %v, 0 %.op = add nuw nsw i32 %cnt, 1 %add = select i1 %tobool, i32 0, i32 %.op and x86 asm code: bsfl %edi, %ecx addl $1, %ecx testl %edi, %edi movl $0, %eax cmovnel %ecx, %eax In this case the 'test' instruction can't be eliminated because the 'add' instruction modifies the EFLAGS, namely, ZF flag that is set by the 'bsf' instruction when 'x' is zero. We now produce the following code: bsfl %edi, %ecx movl $-1, %eax cmovnel %ecx, %eax addl $1, %eax Patch by Ivan Kulagin Reviewers: davide, craig.topper, spatel, RKSimon Reviewed By: craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D48765 llvm-svn: 336768	2018-07-11 06:57:42 +00:00
Craig Topper	dea0b88b04	[X86] Remove X86ISD::MOVLPS and X86ISD::MOVLPD. NFCI These ISD nodes try to select the MOVLPS and MOVLPD instructions which are special load only instructions. They load data and merge it into the lower 64-bits of an XMM register. They are logically equivalent to our MOVSD node plus a load. There was only one place in X86ISelLowering that used MOVLPD and no places that selected MOVLPS. The one place that selected MOVLPD had to choose between it and MOVSD based on whether there was a load. But lowering is too early to tell if the load can really be folded. So in isel we have patterns that use MOVSD for MOVLPD if we can't find a load. We also had patterns that select the MOVLPD instruction for a MOVSD if we can find a load, but didn't choose the MOVLPD ISD opcode for some reason. So it seems better to just standardize on MOVSD ISD opcode and manage MOVSD vs MOVLPD instruction with isel patterns. llvm-svn: 336728	2018-07-10 21:00:22 +00:00
Simon Pilgrim	d32ca2c0b7	[X86][SSE] Prefer BLEND(SHL(v,c1),SHL(v,c2)) over MUL(v, c3) Now that rL336250 has landed, we should prefer 2 immediate shifts + a shuffle blend over performing a multiply. Despite the increase in instructions, this is quicker (especially for slow v4i32 multiplies), avoid loads and constant pool usage. It does mean however that we increase register pressure. The code size will go up a little but by less than what we save on the constant pool data. This patch also adds support for v16i16 to the BLEND(SHIFT(v,c1),SHIFT(v,c2)) combine, and also prevents blending on pre-SSE41 shifts if it would introduce extra blend masks/constant pool usage. Differential Revision: https://reviews.llvm.org/D48936 llvm-svn: 336642	2018-07-10 07:58:33 +00:00
Roman Lebedev	5ccae1750b	[X86][TLI] DAGCombine: Unfold variable bit-clearing mask to two shifts. Summary: This adds a reverse transform for the instcombine canonicalizations that were added in D47980, D47981. As discussed later, that was worse at least for the code size, and potentially for the performance, too. https://rise4fun.com/Alive/Zmpl Reviewers: craig.topper, RKSimon, spatel Reviewed By: spatel Subscribers: reames, llvm-commits Differential Revision: https://reviews.llvm.org/D48768 llvm-svn: 336585	2018-07-09 19:06:42 +00:00
Craig Topper	47170b3153	[X86] In combineFMA, make sure we bitcast the result of isFNEG back the expected type before creating the new FMA node. Previously, we were creating malformed SDNodes, but nothing noticed because the type constraints prevented isel from noticing. llvm-svn: 336566	2018-07-09 17:43:24 +00:00
Craig Topper	b8145ec667	[X86] Improve the message for some asserts. Remove an if that is guaranteed true by said asserts. This replaces some asserts in lowerV2F64VectorShuffle with the similar asserts from lowerVIF64VectorShuffle which are more readable. The original asserts mentioned a blend, but there's no guarantee that it is a blend. Also remove an if that the asserts prove is always true. Mask[0] is always less than 2 and Mask[1] is always at least 2. Therefore (Mask[0] >= 2) + (Mask[1] >= 2) == 1 must wlays be true. llvm-svn: 336517	2018-07-09 01:52:56 +00:00
Craig Topper	9e17073c21	[X86] Enhance combineFMA to look for FNEG behind an EXTRACT_VECTOR_ELT. llvm-svn: 336514	2018-07-08 18:04:00 +00:00
Simon Pilgrim	2eced71ecf	[X86][SSE] Combine v16i8 SHL by constants to multiplies Pre-AVX512 (which can perform a quick extend/shift/truncate), extending to 2 v8i16 for the PMULLW and then truncating is more performant than relying on the generic PBLENDVB vXi8 shift path and uses a similar amount of mask constant pool data. Differential Revision: https://reviews.llvm.org/D48963 llvm-svn: 336513	2018-07-08 12:47:50 +00:00
Simon Pilgrim	23f9eddabe	[SelectionDAG] Split float and integer isKnownNeverZero tests Splits off isKnownNeverZeroFloat to handle +/- 0 float cases. This will make it easier to be more aggressive with the integer isKnownNeverZero tests (similar to ValueTracking), use computeKnownBits etc. Differential Revision: https://reviews.llvm.org/D48969 llvm-svn: 336492	2018-07-07 18:17:14 +00:00
Craig Topper	2c27e33a58	[X86] Merge INTR_TYPE_3OP_RM with INTR_TYPE_3OP. Remove unused INTR_TYPE_1OP_RM. llvm-svn: 336476	2018-07-07 01:04:22 +00:00
Vedant Kumar	b3091da3af	Use Type::isIntOrPtrTy where possible, NFC It's a bit neater to write T.isIntOrPtrTy() over `T.isIntegerTy() \|\| T.isPointerTy()`. I used Python's re.sub with this regex to update users: r'([\w.\->()]+)isIntegerTy\s\\|\\|\s\1isPointerTy' llvm-svn: 336462	2018-07-06 20:17:42 +00:00
Craig Topper	c60e1807b3	[X86] Remove FMA4 scalar intrinsics. Use llvm.fma intrinsic instead. The intrinsics can be implemented with a f32/f64 llvm.fma intrinsic and an insert into a zero vector. There are a couple regressions here due to SelectionDAG not being able to pull an fneg through an extract_vector_elt. I'm not super worried about this though as InstCombine should be able to do it before we get to SelectionDAG. llvm-svn: 336416	2018-07-06 07:14:41 +00:00
Craig Topper	7b35585ff1	[X86] Remove all of the avx512 masked packed fma intrinsics. Use llvm.fma or unmasked 512-bit intrinsics with rounding mode. This upgrades all of the intrinsics to use fneg instructions to convert fma into fmsub/fnmsub/fnmadd/fmsubadd. And uses a select instruction for masking. This matches how clang uses the intrinsics these days. llvm-svn: 336409	2018-07-06 03:42:09 +00:00
Craig Topper	4fe321d1ce	[X86] Add SHUF128 to target shuffle decoding. Differential Revision: https://reviews.llvm.org/D48954 llvm-svn: 336376	2018-07-05 17:10:17 +00:00
Craig Topper	95eb88abfe	[X86] Add support for combining FMSUB/FNMADD/FNMSUB ISD nodes with an fneg input. Previously we could only negate the FMADD opcodes. This used to be mostly ok when we lowered FMA intrinsics during lowering. But with the move to llvm.fma from target specific intrinsics, we can combine (fneg (fma)) to (fmsub) earlier. So if we start with (fneg (fma (fneg))) we would get stuck at (fmsub (fneg)). This patch fixes that so we can also combine things like (fmsub (fneg)). llvm-svn: 336304	2018-07-05 02:52:56 +00:00
Simon Pilgrim	c3e1617bf9	[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values (REAPPLIED) We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case. Reapplied with a fixed (extra null tests) version of rL336113 after reversion in rL336189 - extra test case added at rL336247. llvm-svn: 336250	2018-07-04 09:12:48 +00:00
Craig Topper	e317533dcf	[X86] Remove repeated 'the' from multiple comments that have been copy and pasted. NFC llvm-svn: 336226	2018-07-03 20:39:55 +00:00
Benjamin Kramer	fd171f2f89	Revert "[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values" This reverts commit r336113. It causes crashes. llvm-svn: 336189	2018-07-03 11:15:17 +00:00
Simon Pilgrim	2bc8e079f2	[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case. llvm-svn: 336113	2018-07-02 15:14:07 +00:00
Craig Topper	1b7b9b8596	[X86] Use MVT::i8 for scalar shift amounts since that is what they ultimately need to legalize to. I believe all of these are constants so legalizing them should be pretty trivial, but this saves a step. In one case it looks like we may have been creating a shift amount larger than the shift input itself. llvm-svn: 336052	2018-06-30 18:30:31 +00:00
Craig Topper	5f28d50d27	[X86] When combining load to BZHI, make sure we create the shift instruction with an i8 type. This combine runs pretty late and causes us to introduce a shift after the op legalization phase has run. We need to be sure we create the shift with the proper type for the shift amount. If we don't do this, we will still re-legalize the operation properly, but we won't get a chance to fully optimize the truncate that gets inserted. So this patch adds the necessary truncate when the shift is created. I've also narrowed the subtract that gets created to always be an i32 type. The truncate would have trigered SimplifyDemandedBits to optimize it anyway. But using a more appropriate VT here is free and saves an optimization step. llvm-svn: 336051	2018-06-30 17:49:42 +00:00
Craig Topper	59f2f38fe0	[X86] Remove masking from avx512 rotate intrinsics. Use select in IR instead. llvm-svn: 336035	2018-06-30 01:32:04 +00:00
Craig Topper	87b107dd69	[X86] Limit the number of target specific nodes emitted in LowerShiftParts The important part is the creation of the SHLD/SHRD nodes. The compare and the conditional move can use target independent nodes that can be legalized on their own. This gives some opportunities to trigger the optimizations present in the lowering for those things. And its just better to limit the number of places we emit target specific nodes. The changed test cases still aren't optimal. Differential Revision: https://reviews.llvm.org/D48619 llvm-svn: 335998	2018-06-29 17:24:07 +00:00
Simon Pilgrim	aab8660e23	[X86][SSE] Support v16i8/v32i8 vector rotations This uses the same technique as for shifts - split the rotation into 4/2/1-bit partial rotations and select those partials based on the amount bit, making use of PBLENDVB if available. This halves the use of PBLENDVB compared to expanding to shifts, which can be a slow op. Unfortunately I haven't found a decent way to share much of this code with the shift equivalent. Differential Revision: https://reviews.llvm.org/D48655 llvm-svn: 335957	2018-06-29 09:36:39 +00:00
Craig Topper	875e9f8fa4	[X86] Remove masking from the avx512 packed sqrt intrinsics. Use select in IR instead. While there improve the coverage of the intrinsic testing and add fast-isel tests. llvm-svn: 335944	2018-06-29 05:43:26 +00:00
Simon Pilgrim	aa2bf2be31	[TargetLowering] isVectorClearMaskLegal - use ArrayRef<int> instead of const SmallVectorImpl<int>& This is more generic and matches isShuffleMaskLegal. Differential Revision: https://reviews.llvm.org/D48591 llvm-svn: 335605	2018-06-26 14:15:31 +00:00
Simon Pilgrim	bfaa09220b	[X86] Just use ArrayRef instead of SmallVectorImpl in a few static method arguments. NFCI. llvm-svn: 335590	2018-06-26 10:45:41 +00:00
Craig Topper	08dae1682d	[X86] Don't use getScalarShiftAmountTy to get the immediate type for target specific VSHLDQ/VSRLDQ nodes. These opcodes have a fixed type of i8 for their immediate and shouldn't have anything to do with the scalar shift amount used by target independent shift nodes. llvm-svn: 335578	2018-06-26 04:53:42 +00:00
Craig Topper	689e363ff2	[X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction. This recommits r335562 and 335563 as a single commit. The frontend will surround the intrinsic with the appropriate marshalling to/from a scalar type to match the sigature of the builtin that software expects. By exposing the vXi1 type directly in the llvm intrinsic we make it available to optimizers much earlier. This can enable the scalar marshalling code to be optimized away. llvm-svn: 335568	2018-06-26 01:37:02 +00:00
Craig Topper	6f4fdfa9af	Revert r335562 and 335563 "[X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction." These were supposed to have been squashed to a single commit. llvm-svn: 335566	2018-06-26 01:31:53 +00:00
Craig Topper	9b4322ce31	foo llvm-svn: 335562	2018-06-26 00:43:34 +00:00
Craig Topper	3b18bdc46d	[X86] Simplify some code by using isOneConstant. NFC llvm-svn: 335437	2018-06-25 01:01:47 +00:00
Craig Topper	4331d6218d	[X86] Remove the changes to combineScalarToVector made in r335037. They appear to be untested other than the test case for p37879.ll and I believe we should be using SimplifyDemandedElts here to handle these cases. llvm-svn: 335436	2018-06-25 00:21:53 +00:00
Mikhail Dvoretckii	0963562083	[X86] Changing the check for valid inputs in combineScalarToVector Changing the logic of scalar mask folding to check for valid input types rather than against invalid ones, making it more robust and fixing PR37879. Differential Revision: https://reviews.llvm.org/D48366 llvm-svn: 335323	2018-06-22 08:28:05 +00:00
Mikhail Dvoretckii	22c82af5c8	[x86] Lower some trunc + shuffle patterns to vpmov[q\|d][b\|w] This should help in lowering the following four intrinsics: _mm256_cvtepi32_epi8 _mm256_cvtepi64_epi16 _mm256_cvtepi64_epi8 _mm512_cvtepi64_epi8 Differential Revision: https://reviews.llvm.org/D46957 llvm-svn: 335238	2018-06-21 14:16:45 +00:00
Craig Topper	c2696d577b	[X86] Use setcc ISD opcode for AVX512 integer comparisons all the way to isel I don't believe there is any real reason to have separate X86 specific opcodes for vector compares. Setcc has the same behavior just uses a different encoding for the condition code. I had to change the CondCodeAction for SETLT and SETLE to prevent some transforms from changing SETGT lowering. Differential Revision: https://reviews.llvm.org/D43608 llvm-svn: 335173	2018-06-20 21:05:02 +00:00
Mikhail Dvoretckii	b1ce7765be	[X86] VRNDSCALE* folding from masked and scalar ffloor and fceil patterns This patch handles back-end folding of generic patterns created by lowering the X86 rounding intrinsics to native IR in cases where the instruction isn't a straightforward packed values rounding operation, but a masked operation or a scalar operation. Differential Revision: https://reviews.llvm.org/D45203 llvm-svn: 335037	2018-06-19 10:37:52 +00:00
Mikhail Dvoretckii	bd8ed2dbaa	Test commit. llvm-svn: 335026	2018-06-19 07:55:10 +00:00
Sanjay Patel	f85ca6abee	[x86] be more selective about converting 'and' to shuffle (PR37749) isVectorClearMaskLegal() is the TLI hook used by the generic DAGCombiner::XformToShuffleWithZero(). We've grown to accomodate/expect this transform to shuffle (disabling it more generally results in many regressions). So I'm narrowly excluding the 256-bit types that clearly are not worthwhile for AVX1. I think in most cases we are able to recover by converting the shuffle back into 'and' ops, but the cases in: https://bugs.llvm.org/show_bug.cgi?id=37749 ...show that there are cracks. llvm-svn: 334759	2018-06-14 19:55:02 +00:00
Sanjay Patel	b983ac6fe1	[x86] eliminate even more sign-bit tests with vector select This shortcoming was noted in D47330, and the test diffs show we already had other examples where we failed to fold to a SHRUNKBLEND: /// Dynamic (non-constant condition) vector blend where only the sign bits /// of the condition elements are used. This is used to enforce that the /// condition mask is not valid for generic VSELECT optimizations. This patch implements an idea from D48043 and would obsolete that patch because it catches more cases (notable the AVX1 case that was missed there). All we're doing is allowing the existing transform to fire more often by removing the post-legalize constraint. All of the relevant feature checks and other predicates are left as-is. Differential Revision: https://reviews.llvm.org/D48078 llvm-svn: 334592	2018-06-13 12:28:32 +00:00
Craig Topper	3829d258ee	[X86] Remove masking from avx512vbmi2 concat and shift by immediate intrinsics. Use select in IR instead. llvm-svn: 334576	2018-06-13 07:19:21 +00:00
Sanjay Patel	c3466d2568	[x86] move shrunkblend transform to helper function; NFCI We should be able to obsolete D48043 by easing the constraints on this existing code. llvm-svn: 334504	2018-06-12 14:21:51 +00:00
Craig Topper	0e25c8239a	[X86] Remove masking from dbpsadbw intrinsics, use select in IR instead. llvm-svn: 334384	2018-06-11 06:18:22 +00:00
Craig Topper	e71ad1f6d0	[X86] Remove and autoupgrade the expandload and compressstore intrinsics. We use the target independent intrinsics now. llvm-svn: 334381	2018-06-11 01:25:22 +00:00
Craig Topper	98a79934af	[X86] Remove masking from the 512-bit masked floating point add/sub/mul/div intrinsics. Use a select in IR instead. llvm-svn: 334358	2018-06-10 06:01:36 +00:00
Simon Pilgrim	5c32989c91	[X86][SSE] Support v8i16/v16i16 rotations Extension to D46954 (PR37426), this patch adds support for v8i16/v16i16 rotations in a similar manner - the conversion of the shift/rotate amount to a multiplication factor and the use of PMULLW to shift left and PMULHUW (ISD::MULHU) to shift the wrapped bits back around to be ORd together. Differential Revision: https://reviews.llvm.org/D47822 llvm-svn: 334309	2018-06-08 17:58:42 +00:00
Simon Pilgrim	a6afa310c9	[X86][SSE] Simplify combineVectorTruncationWithPACKUS to reduce code duplication Simplify combineVectorTruncationWithPACKUS to mask the upper bits followed by calling truncateVectorWithPACK instead of duplicating with similar code. This results in the codegen using (V)PACKUSDW on SSE41+ targets for vXi64/vXi32 inputs where before it always used PACKUSWB (along with a lot more bitcasting). I've raised PR37749 as until we avoid unnecessary concats back to 256-bit for bitwise ops, we can't avoid splitting the input value into 128-bit subvectors for masking. llvm-svn: 334289	2018-06-08 13:59:11 +00:00
Simon Pilgrim	ad45efc445	[X86][SSE] Consistently prefer lowering to PACKUS over PACKSS We have some combines/lowerings that attempt to use PACKSS-then-PACKUS and others that use PACKUS-then-PACKSS. PACKUS is much easier to combine with if we know the upper bits are zero as ComputeKnownBits can easily see through BITCASTs etc. especially now that rL333995 and rL334007 have landed. It also effectively works at byte level which further simplifies shuffle combines. The only (minor) annoyances are that ComputeKnownBits can sometimes take longer as it doesn't fail as quickly as ComputeNumSignBits (but I'm not seeing any actual regressions in tests) and PACKUSDW only became available after SSE41 so we have more codegen diffs between targets. llvm-svn: 334276	2018-06-08 10:29:00 +00:00
Simon Pilgrim	51ceef775b	[X86][SSE] Updated comment - combineVectorSignBitsTruncation handles PACKSS and PACKUS. NFCI. llvm-svn: 334204	2018-06-07 16:08:40 +00:00
Simon Pilgrim	51ff15f472	[X86][SSE] Simplify combineVectorTruncationWithPACKUS. NFCI. Move code only used by combineVectorTruncationWithPACKUS out of combineVectorTruncation. llvm-svn: 334201	2018-06-07 14:53:32 +00:00
Simon Pilgrim	09953d8412	[X86][SSE] Simplify combineVectorTruncationWithPACKSS to reduce code duplication Simplify combineVectorTruncationWithPACKSS to just a SIGN_EXTEND_INREG followed by using the existing truncateVectorWithPACK instead of duplicating code. llvm-svn: 334193	2018-06-07 13:01:42 +00:00
Tomasz Krupa	145825162a	Test commit access. Added a bunch of periods after comments. llvm-svn: 334171	2018-06-07 08:20:28 +00:00
Simon Pilgrim	3d14158891	[X86][BMI][TBM] Only demand bottom 16-bits of the BEXTR control op (PR34042) Only the bottom 16-bits of BEXTR's control op are required (0:8 INDEX, 15:8 LENGTH). Differential Revision: https://reviews.llvm.org/D47690 llvm-svn: 334083	2018-06-06 10:52:10 +00:00
Simon Pilgrim	f2f043acbb	[X86][SSE] Use multiplication scale factors for v8i16 SHL on pre-AVX2 targets. Similar to v4i32 SHL, convert v8i16 shift amounts to scale factors instead to improve performance and reduce instruction count. We were already doing this for constant shifts, this adds variable shift support. Reduces the serial nature of the codegen, which relies on chains of plendvb/pand+pandn+por shifts. This is a step towards adding support for vXi16 vector rotates. Differential Revision: https://reviews.llvm.org/D47546 llvm-svn: 334023	2018-06-05 15:17:39 +00:00
Simon Pilgrim	fef9b6eea6	[X86][SSE] Add target shuffle support to X86TargetLowering::computeKnownBitsForTargetNode Ideally we'd use resolveTargetShuffleInputs to handle faux shuffles as well but: (a) that code path doesn't handle general/pre-legalized ops/types very well. (b) I'm concerned about the compute time as they recurse to calls to computeKnownBits/ComputeNumSignBits which would need depth limiting somehow. llvm-svn: 334007	2018-06-05 10:52:29 +00:00
Simon Pilgrim	7bbe7a2920	[X86][SSE] Add basic PACKUS support to X86TargetLowering::computeKnownBitsForTargetNode Helps improve analysis of saturation ops llvm-svn: 333995	2018-06-05 09:45:03 +00:00
Alexander Ivchenko	964b27fa21	[X86][CET] Shadow stack fix for setjmp/longjmp This is the new version of D46181, allowing setjmp/longjmp to work correctly with the Intel CET shadow stack by storing SSP on setjmp and fixing it on longjmp. The patch has been updated to use the cf-protection-return module flag instead of HasSHSTK, and the bug that caused D46181 to be reverted has been fixed with the test expanded to track that fix. patch by mike.dvoretsky Differential Revision: https://reviews.llvm.org/D47311 llvm-svn: 333990	2018-06-05 09:22:30 +00:00
Craig Topper	1f956e2c5f	[X86] Don't pass ParitySrc array into isAddSubOrSubAddMask. Instead use a bool output parameter to get the real piece of info we care about. NFC The ParitySrc array is more of an implementation detail. A single bool to get the final parity is sufficient. llvm-svn: 333935	2018-06-04 17:58:45 +00:00
Simon Pilgrim	7c000d4267	[X86] Only accept const SelectionDAG to resolveTargetShuffleInputs/getFauxShuffleMask These methods should only be using SelectionDAG for analysis (known/sign bits etc), not node creation. llvm-svn: 333925	2018-06-04 16:48:13 +00:00
Craig Topper	3828ce7eab	[X86] Do something sensible when an expand load intrinsic is passed a 0 mask. Previously we just returned undef, but really we should be returning the pass thru input. We also need to make sure we preserve the chain output that the original intrinsic node had to maintain connectivity in the DAG. So we should just return the incoming chain as the output chain. llvm-svn: 333804	2018-06-01 22:59:07 +00:00
Simon Pilgrim	ff0623cd29	[X86][SSE] Recognise splat rotations and expand back to shift ops. Noticed while fixing PR37426, for splat rotations (rotation by an uniform value) its better to just expand back to shift ops than performing as a general non-uniform rotation. llvm-svn: 333661	2018-05-31 15:47:17 +00:00
Simon Pilgrim	c34395d889	[X86][AVX] Add peekThroughEXTRACT_SUBVECTORs helper (NFCI) We often need this for AVX1 128-bit integer ops as they may have been split from a 256-bit source. llvm-svn: 333660	2018-05-31 15:15:49 +00:00
Simon Pilgrim	346886bc0d	[X86][SSE] Add support for detecting SUB(SPLAT_BV, SPLAT) cases for shift-rotate patterns. This improves splat rotations (rotation by an uniform value), to avoid having to use the generic non-uniform shift code (extension to PR37426). llvm-svn: 333641	2018-05-31 11:25:16 +00:00
Simon Pilgrim	5e9f459c62	[X86][SSE] Pulled out splat detection helper from LowerScalarVariableShift (NFCI) Created the IsSplatValue helper from the splat detection code in LowerScalarVariableShift as a first NFC step towards improving support for splat rotations, which is an extension of PR37426. llvm-svn: 333580	2018-05-30 19:16:59 +00:00
Gabor Buella	890e363e11	[X86] Lowering FMA intrinsics to native IR (LLVM part) Support for Clang lowering of fused intrinsics. This patch: 1. Removes bindings to clang fma intrinsics. 2. Introduces new LLVM unmasked intrinsics with rounding mode: int_x86_avx512_vfmadd_pd_512 int_x86_avx512_vfmadd_ps_512 int_x86_avx512_vfmaddsub_pd_512 int_x86_avx512_vfmaddsub_ps_512 supported with a new intrinsic type (INTR_TYPE_3OP_RM). 3. Introduces new x86 fmaddsub/fmsubadd folding. 4. Introduces new tests for code emitted by sequentions introduced in Clang part. Patch by tkrupa Reviewers: craig.topper, sroland, spatel, RKSimon Reviewed By: craig.topper, RKSimon Differential Revision: https://reviews.llvm.org/D47443 llvm-svn: 333554	2018-05-30 15:25:16 +00:00
Craig Topper	5439b3d1e5	[X86] Fix a potential crash that occur after r333419. The code could issue a truncate from a small type to larger type. We need to extend in that case instead. llvm-svn: 333460	2018-05-29 20:04:10 +00:00
Matt Arsenault	ab2b79cb97	DAG: Remove redundant version of getRegisterTypeForCallingConv There seems to be no real reason to have these separate copies. The existing implementations just copy each other for x86. For Mips there is a subtle difference, which is just a bug since it changes based on the context where which one was called. Dropping this version, all tests pass. If I try to merge them to match the removed version, a test fails. llvm-svn: 333440	2018-05-29 17:42:26 +00:00
Alexander Ivchenko	96062eaa8e	[X86] Scalar mask and scalar move optimizations 1. Introduction of mask scalar TableGen patterns. 2. Introduction of new scalar move TableGen patterns and refactoring of existing ones. 3. Folding of pattern created by introducing scalar masking in Clang header files. Patch by tkrupa Differential Revision: https://reviews.llvm.org/D47012 llvm-svn: 333419	2018-05-29 14:27:11 +00:00
Craig Topper	a34f8731c7	[X86] Disable a DAG combine to allow packed AVX512DQ instructions to be consistently used for i64->float/double conversions. Summary: We already get this right if the i64 didn't come from a load. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D47439 llvm-svn: 333393	2018-05-29 06:22:45 +00:00
Craig Topper	21aeddc3dc	[X86] Remove masked vpermi2var/vpermt2var intrinsics and autoupgrade. We have unmasked intrinsics now and wrap them with a select. This is a net reduction of 36 intrinsics from before the unmasked intrinsics were added. llvm-svn: 333388	2018-05-29 05:22:05 +00:00
Craig Topper	dcfcfdb0d1	[X86] Converge X86ISD::VPERMV3 and X86ISD::VPERMIV3 to a single opcode. These do the same thing with the first and second sources swapped. They previously came from separate intrinsics that specified different masking behavior. But we can cover that with isel patterns and a single node. This is a step towards reducing the number of intrinsics needed. A bunch of tests change because we are now biased to choosing VPERMT over VPERMI when there is nothing to signal that commuting is beneficial. llvm-svn: 333383	2018-05-28 19:33:11 +00:00
Craig Topper	26bc84860a	[X86] Stop forcing X86VPermi2X node index operand to match destination type to make masking pattern matching easier. Add extra patterns with bitcasts instead. This basically reverts r280696 in favor of using extra patterns as mentioned as an alternative in that commit message. For now I've only added the cases we have test cases for, but it should be easy to add more in the future. This will help to convert VPERMI2PS/VPERMT2PS intrinsics to use a single ISD node opcode. And hopefully allow some intrinsics to be removed. llvm-svn: 333365	2018-05-28 05:37:25 +00:00
Craig Topper	51eddb8749	[X86] Remove masking from avx512ifma intrinsics. Use a select instead. This allows us to avoid having mask and maskz variant. Reducing from 12 intrinsics to 6. llvm-svn: 333346	2018-05-26 18:55:19 +00:00
Simon Pilgrim	b8c7c9c369	[X86][SSE] Pull out (AND (XOR X, -1), Y) matching into a helper function. NFC. llvm-svn: 333201	2018-05-24 16:16:42 +00:00
Simon Pilgrim	8bd73573c3	Fix unused variable warnings. NFCI. llvm-svn: 333195	2018-05-24 15:34:50 +00:00
Simon Pilgrim	0c72316a21	[X86][SSE] Pull out OR(AND(~MASK,X),AND(MASK,Y)) matching into a helper function. NFC. First stage towards matching more variants of the bitselect pattern for combineLogicBlendIntoPBLENDV (PR37549) llvm-svn: 333191	2018-05-24 15:12:48 +00:00
Roman Lebedev	7772de25d0	[DAGCombine][X86][AArch64] Masked merge unfolding: vector edition. Summary: This appears to be the last missing piece for the masked merge pattern handling in the backend. This is [[ https://bugs.llvm.org/show_bug.cgi?id=37104 \| PR37104 ]]. [[ https://bugs.llvm.org/show_bug.cgi?id=6773 \| PR6773 ]] will introduce an IR canonicalization that is likely bad for the end assembly. Previously, `andps`+`andnps` / `bsl` would be generated. (see `@out`) Now, they would no longer be generated (see `@in`), and we need to make sure that they are generated. Differential Revision: https://reviews.llvm.org/D46528 llvm-svn: 332904	2018-05-21 21:41:02 +00:00
Craig Topper	aad3aefaeb	[X86] Remove masking from vpternlog intrinsics. Use a select in IR instead. This removes 6 intrinsics since we no longer need separate mask and maskz intrinsics. Differential Revision: https://reviews.llvm.org/D47124 llvm-svn: 332890	2018-05-21 20:58:09 +00:00
Simon Pilgrim	a8869e68a9	[X86][SSE] Add an assert to ensure that rotation amount is converted to a scale Missed in rL332832 where we added SSE v4i32 rotations for PR37426. llvm-svn: 332844	2018-05-21 15:17:23 +00:00
Simon Pilgrim	5aa7cdfd70	[X86][SSE] Support v4i32 rotations (PR37426) As suggested by Fabian on PR37426, we can use PMULUDQ to perform v4i32 vector rotations as the upper 32bits of the multiply will contain the 'wrapped' bits of the rotation. v8i16/v16i8 rotations would be straightforward to add to lowerRotate in the future - ideally we'd mostly share code with the vector shifts lowering. Differential Revision: https://reviews.llvm.org/D46954 llvm-svn: 332832	2018-05-21 09:45:59 +00:00
Craig Topper	e4c045b7df	[X86] Remove mask arguments from permvar builtins/intrinsics. Use a select in IR instead. Someday maybe we'll use selects for all intrinsics. llvm-svn: 332824	2018-05-20 23:34:04 +00:00
Craig Topper	f94ed26ea9	[X86] Directly legalize v16i16/v8i16 vselect to vXi8 vselect to use VPBLENDVB The intrinsic legalization for masked truncate uses ISD::TRUNCATE which can be constant folded by getNode. This prevents getVectorMaskingNode from seeing the ISD::TRUNCATE special case where it should emit X86ISD::SELECT instead of ISD::VSELECT. This causes a vselect with a v16i1 or v8i1 condition to be emitted during vector legalization. but vector legalization doesn't revisit nodes it creates. DAG combine will then promote this condition to match the result type. Then op legalization will try to legalize it, but the custom lowering hook returned SDValue(). But op legalization doesn't have an Expand for VSELECT because it expects vector legalization to have taken care of it. So the operation sticks around and fails in isel. This patch adds a custom legalization hook to morph it to a vXi8 vselect instead. This also simplifies the normal vXi16 vselect handling because vector legalization was normally expanding to AND/ANDN/OR and DAG combine was turning that into VBLENDVB. So we can skip a step by doing it directly. Fixes PR37499 Differential Revision: https://reviews.llvm.org/D47025 llvm-svn: 332743	2018-05-18 17:48:06 +00:00
Simon Pilgrim	2e0f6c9b21	[X86][SSE] Reduce instruction/register usages for v4i32 vector shifts (PR37441) As suggested by Fabian on PR37441, use PSHUFLW to extend shift amount types for use with PSRAD/PSRLD to reduce register pressure. Some of this ideally would be done by combineTargetShuffle but its tricky to do as most of the shuffles are sharing inputs. Differential Revision: https://reviews.llvm.org/D46959 llvm-svn: 332524	2018-05-16 20:52:52 +00:00
Craig Topper	67aa726f8c	[X86][AVX512DQ] Use packed instructions for scalar FP<->i64 conversions on 32-bit targets As i64 types are not legal on 32-bit targets, insert these into a suitable zero vector and use the packed vXi64<->FP conversion instructions instead. Fixes PR3163. Differential Revision: https://reviews.llvm.org/D43441 llvm-svn: 332498	2018-05-16 17:40:07 +00:00
Mikael Holmen	e01131decf	Remove unused variable introduced in r332336 The unused variable caused a compilation warning: ../lib/Target/X86/X86ISelLowering.cpp:34614:17: error: unused variable 'SMax' [-Werror,-Wunused-variable] if (SDValue SMax = MatchMinMax(SMin, ISD::SMAX, C1)) ^ 1 error generated. llvm-svn: 332431	2018-05-16 06:36:11 +00:00
Artur Gainullin	243a3d56d8	[X86] Improve unsigned saturation downconvert detection. Summary: New unsigned saturation downconvert patterns detection was implemented in X86 Codegen: (truncate (smin (smax (x, C1), C2)) to dest_type), where C1 >= 0 and C2 is unsigned max of destination type. (truncate (smax (smin (x, C2), C1)) to dest_type) where C1 >= 0, C2 is unsigned max of destination type and C1 <= C2. These two patterns are equivalent to: (truncate (umin (smax(x, C1), unsigned_max_of_dest_type)) to dest_type) Reviewers: RKSimon Subscribers: llvm-commits, a.elovikov Differential Revision: https://reviews.llvm.org/D45315 llvm-svn: 332336	2018-05-15 10:24:12 +00:00
Sanjay Patel	b4e7893ba8	[x86] fix fmaxnum/fminnum with nnan With nnan, there's no need for the masked merge / blend sequence (that probably costs much more than the min/max instruction). Somewhere between clang 5.0 and 6.0, we started producing these intrinsics for fmax()/fmin() in C source instead of libcalls or fcmp/select. The backend wasn't prepared for that, so we regressed perf in those cases. Note: it's possible that other targets have similar problems as seen here. Noticed while investigating PR37403 and related bugs: https://bugs.llvm.org/show_bug.cgi?id=37403 The IR FMF propagation cases still don't work. There's a proposal that might fix those cases in D46563. llvm-svn: 331992	2018-05-10 15:40:49 +00:00
Craig Topper	b9a473d186	[X86] Combine (vXi1 (bitcast (-1)))) and (vXi1 (bitcast (0))) to all ones or all zeros vXi1 vector. llvm-svn: 331847	2018-05-09 06:07:20 +00:00
Shiva Chen	801bf7ebbe	[DebugInfo] Examine all uses of isDebugValue() for debug instructions. Because we create a new kind of debug instruction, DBG_LABEL, we need to check all passes which use isDebugValue() to check MachineInstr is debug instruction or not. When expelling debug instructions, we should expel both DBG_VALUE and DBG_LABEL. So, I create a new function, isDebugInstr(), in MachineInstr to check whether the MachineInstr is debug instruction or not. This patch has no new test case. I have run regression test and there is no difference in regression test. Differential Revision: https://reviews.llvm.org/D45342 Patch by Hsiangkai Wang. llvm-svn: 331844	2018-05-09 02:42:00 +00:00
Jessica Paquette	ec37c640dd	Revert "[X86][CET] Shadow stack fix for setjmp/longjmp" This reverts commit 30962eca38ef02666ebcdded72a94f2cd0292d68. This commit has been causing test asan failures on a build bot. http://green.lab.llvm.org/green/job/clang-stage1-configure-RA/45108/ Original commit: https://reviews.llvm.org/D46181 llvm-svn: 331813	2018-05-08 22:00:57 +00:00
Jeremy Morse	4f799c027e	[X86] Mark all byval parameters as aliased This is a fix for PR30290: by marking all byval stack slots as being aliased, the instruction scheduler is more conservative about rescheduling memory accesses to such stack slots as an LLVM Value* might alias it. This fixes errors such as in the patched test case, where reads and writes to a data structure are illegally mixed. This could be fixed better in the future with better analysis for the instruction scheduler to know what Values alias what stack slots. Differential Revision: https://reviews.llvm.org/D45022 llvm-svn: 331749	2018-05-08 09:18:01 +00:00
Alexander Ivchenko	c47f799289	[X86][CET] Shadow stack fix for setjmp/longjmp This patch adds a shadow stack fix when compiling setjmp/longjmp with the shadow stack enabled. This allows setjmp/longjmp to work correctly with CET. Patch by mike.dvoretsky Differential Revision: https://reviews.llvm.org/D46181 llvm-svn: 331748	2018-05-08 09:04:07 +00:00
Roman Lebedev	cc42d08b1d	[DagCombiner] Not all 'andn''s work with immediates. Summary: Split off from D46031. In masked merge case, this degrades IPC by decreasing instruction count. {F6108777} The next patch should be able to recover and improve this. This also affects the transform @spatel have added in D27489 / rL289738, and the test coverage for X86 was missing. But after i have added it, and looked at the changes in MCA, i'm somewhat confused. {F6093591} {F6093592} {F6093593} I'd say this regression is an improvement, since `IPC` increased in that case? Reviewers: spatel, craig.topper Reviewed By: spatel Subscribers: andreadb, llvm-commits, spatel Differential Revision: https://reviews.llvm.org/D46493 llvm-svn: 331684	2018-05-07 21:52:11 +00:00
Craig Topper	c882014f43	[X86] Fix copy/paste mistake in comment. NFC llvm-svn: 331611	2018-05-07 00:47:02 +00:00
Craig Topper	cb2abc7977	[X86] Enable reciprocal estimates for v16f32 vectors by using VRCP14PS/VRSQRT14PS Summary: The legacy VRCPPS/VRSQRTPS instructions aren't available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates. For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch. Reviewers: spatel Reviewed By: spatel Subscribers: RKSimon, llvm-commits, mehdi_amini Differential Revision: https://reviews.llvm.org/D46498 llvm-svn: 331606	2018-05-06 17:48:21 +00:00
Adrian Prantl	5f8f34e459	Remove \brief commands from doxygen comments. We've been running doxygen with the autobrief option for a couple of years now. This makes the \brief markers into our comments redundant. Since they are a visual distraction and we don't want to encourage more \brief markers in new code either, this patch removes them all. Patch produced by for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done Differential Revision: https://reviews.llvm.org/D46290 llvm-svn: 331272	2018-05-01 15:54:18 +00:00
Craig Topper	d656410293	[X86] Make the STTNI flag intrinsics use the flags from pcmpestrm/pcmpistrm if the mask instrinsics are also used in the same basic block. Summary: Previously the flag intrinsics always used the index instructions even if a mask instruction also exists. To fix fix this I've created a single ISD node type that returns index, mask, and flags. The SelectionDAG CSE process will merge all flavors of intrinsics with the same inputs to a s ingle node. Then during isel we just have to look at which results are used to know what instruction to generate. If both mask and index are used we'll need to emit two instructions. But for all other cases we can emit a single instruction. Since I had to do manual isel anyway, I've removed the pseudo instructions and custom inserter code that was working around tablegen limitations with multiple implicit defs. I've also renamed the recently added sse42.ll test case to sttni.ll since it focuses on that subset of the sse4.2 instructions. Reviewers: chandlerc, RKSimon, spatel Reviewed By: chandlerc Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D46202 llvm-svn: 331091	2018-04-27 22:15:33 +00:00
Chandler Carruth	16429acacb	[x86] Revert r330322 (& r330323): Lowering x86 adds/addus/subs/subus intrinsics The LLVM commit introduces a crash in LLVM's instruction selection. I filed http://llvm.org/PR37260 with the test case. llvm-svn: 330997	2018-04-26 21:46:01 +00:00
Craig Topper	300e20d61c	[X86] Form MUL_IMM for multiplies with 3/5/9 to encourage LEA formation over load folding. Previously we only formed MUL_IMM when we split a constant. This blocked load folding on those cases. We should also form MUL_IMM for 3/5/9 to favor LEA over load folding. Differential Revision: https://reviews.llvm.org/D46040 llvm-svn: 330850	2018-04-25 17:35:03 +00:00
Alexander Ivchenko	5717fbaf4c	[X86] Replace action Promote with Expand for operation ISD::SINT_TO_FP Summary: If attribute "use-soft-float"="true" is set then X86ISelLowering.cpp sets 'Promote' action for ISD::SINT_TO_FP operation on type i32. But 'Promote' action is not proper in this case since lib function __floatsidf is available for casting from signed int to float type. Thus Expand action is more suitable here. The Expand action should be set for ISD::UINT_TO_FP for soft float as well. If function attribute "use-soft-float"="true" is set then infinite looping can happen in DAG combining, function visitSINT_TO_FP() replaces SINT_TO_FP node with UINT_TO_FP node and function combineUIntToFP() replace vice versa in cycle. The fix prevents it. Patch by vrybalov Differential Revision: https://reviews.llvm.org/D45572 llvm-svn: 330711	2018-04-24 12:57:51 +00:00
Craig Topper	fe59bea07b	[X86] Add DAG combine to turn (trunc (srl (mul ext, ext), 16) into PMULHW/PMULHUW. Ultimately I want to use this to remove the intrinsics for these instructions. llvm-svn: 330520	2018-04-21 18:39:21 +00:00
Gabor Buella	31fa8025ba	[X86] WaitPKG instructions Three new instructions: umonitor - Sets up a linear address range to be monitored by hardware and activates the monitor. The address range should be a writeback memory caching type. umwait - A hint that allows the processor to stop instruction execution and enter an implementation-dependent optimized state until occurrence of a class of events. tpause - Directs the processor to enter an implementation-dependent optimized state until the TSC reaches the value in EDX:EAX. Also modifying the description of the mfence instruction, as the rep prefix (0xF3) was allowed before, which would conflict with umonitor during disassembly. Before: $ echo 0xf3,0x0f,0xae,0xf0 \| llvm-mc -disassemble .text mfence After: $ echo 0xf3,0x0f,0xae,0xf0 \| llvm-mc -disassemble .text umonitor %rax Reviewers: craig.topper, zvi Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D45253 llvm-svn: 330462	2018-04-20 18:42:47 +00:00
Alexander Ivchenko	e8fed1546e	Lowering x86 adds/addus/subs/subus intrinsics (llvm part) This is the patch that lowers x86 intrinsics to native IR in order to enable optimizations. The patch also includes folding of previously missing saturation patterns so that IR emits the same machine instructions as the intrinsics. Patch by tkrupa Differential Revision: https://reviews.llvm.org/D44785 llvm-svn: 330322	2018-04-19 12:13:30 +00:00
Keith Wyss	3d86823f3d	[XRay] Typed event logging intrinsic Summary: Add an LLVM intrinsic for type discriminated event logging with XRay. Similar to the existing intrinsic for custom events, but also accepts a type tag argument to allow plugins to be aware of different types and semantically interpret logged events they know about without choking on those they don't. Relies on a symbol defined in compiler-rt patch D43668. I may wait to submit before I can see demo everything working together including a still to come clang patch. Reviewers: dberris, pelikan, eizan, rSerge, timshen Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D45633 llvm-svn: 330219	2018-04-17 21:30:29 +00:00
Hiroshi Inoue	ae17900997	[NFC] fix trivial typos in document and comments "not not" -> "not" etc llvm-svn: 330083	2018-04-14 08:59:00 +00:00
Craig Topper	254ed028a4	[X86] Remove the pmuldq/pmuldq intrinsics and replace with native IR. This completes the work started in r329604 and r329605 when we changed clang to no longer use the intrinsics. We lost some InstCombine SimplifyDemandedBit optimizations through this change as we aren't able to fold 'and', bitcast, shuffle very well. llvm-svn: 329990	2018-04-13 06:07:18 +00:00
Sriraman Tallam	d693093a65	GOTPCREL references must always use RIP. With -fno-plt, global value references can use GOTPCREL and RIP must be used. Differential Revision: https://reviews.llvm.org/D45460 llvm-svn: 329765	2018-04-10 22:50:05 +00:00
Chandler Carruth	0ca3bd0729	[x86] Model the direction flag (DF) separately from the rest of EFLAGS. This cleans up a number of operations that only claimed te use EFLAGS due to using DF. But no instructions which we think of us setting EFLAGS actually modify DF (other than things like popf) and so this needlessly creates uses of EFLAGS that aren't really there. In fact, DF is so restrictive it is pretty easy to model. Only STD, CLD, and the whole-flags writes (WRFLAGS and POPF) need to model this. I've also somewhat cleaned up some of the flag management instruction definitions to be in the correct .td file. Adding this extra register also uncovered a failure to use the correct datatype to hold X86 registers, and I've corrected that as necessary here. Differential Revision: https://reviews.llvm.org/D45154 llvm-svn: 329673	2018-04-10 06:40:51 +00:00
Chandler Carruth	19618fc639	[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues. The key idea is to lower COPY nodes populating EFLAGS by scanning the uses of EFLAGS and introducing dedicated code to preserve the necessary state in a GPR. In the vast majority of cases, these uses are cmovCC and jCC instructions. For such cases, we can very easily save and restore the necessary information by simply inserting a setCC into a GPR where the original flags are live, and then testing that GPR directly to feed the cmov or conditional branch. However, things are a bit more tricky if arithmetic is using the flags. This patch handles the vast majority of cases that seem to come up in practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of partially preserved EFLAGS as LLVM doesn't currently model that at all. There are a large number of operations that techinaclly observe EFLAGS currently but shouldn't in this case -- they typically are using DF. Currently, they will not be handled by this approach. However, I have never seen this issue come up in practice. It is already pretty rare to have these patterns come up in practical code with LLVM. I had to resort to writing MIR tests to cover most of the logic in this pass already. I suspect even with its current amount of coverage of arithmetic users of EFLAGS it will be a significant improvement over the current use of pushf/popf. It will also produce substantially faster code in most of the common patterns. This patch also removes all of the old lowering for EFLAGS copies, and the hack that forced us to use a frame pointer when EFLAGS copies were found anywhere in a function so that the dynamic stack adjustment wasn't a problem. None of this is needed as we now lower all of these copies directly in MI and without require stack adjustments. Lots of thanks to Reid who came up with several aspects of this approach, and Craig who helped me work out a couple of things tripping me up while working on this. Differential Revision: https://reviews.llvm.org/D45146 llvm-svn: 329657	2018-04-10 01:41:17 +00:00
Craig Topper	47b2f9d836	[X86] Don't use Lower512IntUnary to split bitcasts with v32i16/v64i8 types on targets without AVX512BW. LowerIntUnary as its name says has an assert for integer types. But for the bitcast case one side might be an FP type. Rather than making sure the function really works for fp types and renaming it. Just do really basic splitting directly. The LowerIntUnary has the advantage that it can peek through BUILD_VECTOR because every other call is during Lowering. But these calls are during legalization and will be followed by a DAG combine round. Revert some change to LowerVectorIntUnary that were originally made just to make these two calls work even in pure integer cases. This was found purely by compiling the avx512f-builtins.c test from clang so I've copied over the offending function from that. llvm-svn: 329616	2018-04-09 20:37:14 +00:00
Mandeep Singh Grang	68a151a13c	[X86] Change std::sort to llvm::sort in response to r327219 Summary: r327219 added wrappers to std::sort which randomly shuffle the container before sorting. This will help in uncovering non-determinism caused due to undefined sorting order of objects having the same key. To make use of that infrastructure we need to invoke llvm::sort instead of std::sort. Note: This patch is one of a series of patches to replace all std::sort to llvm::sort. Refer the comments section in D44363 for a list of all the required patches. Reviewers: chandlerc, craig.topper, RKSimon Reviewed By: chandlerc, craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44874 llvm-svn: 329534	2018-04-08 16:42:52 +00:00
Craig Topper	ef37aebc96	[X86] Combine vXi64 multiplies to MULDQ/MULUDQ during DAG combine instead of lowering. Previously we used a custom lowering for this because of the AVX1 splitting requirement. But we can do the split during DAG combine if we check the types and subtarget llvm-svn: 329510	2018-04-07 19:09:52 +00:00
Benjamin Kramer	1fc0da4849	Make helpers static. NFC. llvm-svn: 329170	2018-04-04 11:45:11 +00:00
Craig Topper	3064c15dc3	[X86] Remove some code that was only needed when i1 was a legal type. NFC llvm-svn: 329146	2018-04-04 04:38:54 +00:00
Craig Topper	9b8cd5fe55	[X86] Don't check for folding into a store when deciding if we can promote an i16 mul. There's no RMW mul operation. llvm-svn: 328931	2018-04-01 06:29:32 +00:00
Craig Topper	db6caabccc	[X86] Check if the load and store are to the same pointer before preventing i16 RMW shifts and subtracts from being promoted. llvm-svn: 328930	2018-04-01 06:29:28 +00:00
Craig Topper	ae2de57db0	[X86] Allow i16 subtracts to be promoted if the load is on the LHS and its not being stored. llvm-svn: 328928	2018-04-01 06:29:25 +00:00
Craig Topper	9bc0d881a3	[X86] Remove unneeded temporary variable. NFC This Promote flag was alwasys set to true except in the default case. But in the default case we don't need to set PVT and can just return false. llvm-svn: 328926	2018-04-01 06:29:21 +00:00
Simon Pilgrim	8c8ebd7945	Fix trailing whitespace. NFCI. llvm-svn: 328917	2018-03-31 09:14:14 +00:00
Simon Pilgrim	71c5f3fffd	[X86][SSE] Don't bother re-adding combined target shuffles to the work list We are re-adding all the bitcasts, constant masks and target shuffles to the work list for no apparent gain. Found while investigating adding SimplifyDemandedVectorElts to target shuffles. Differential Revision: https://reviews.llvm.org/D44942 llvm-svn: 328771	2018-03-29 11:18:41 +00:00
Reid Kleckner	41fb2dba9c	[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32 Summary: Re-lands r328386 and r328443, reverting r328482. Incorporates fixes from @mstorsjo in D44876 (thanks!) so that small parameters in i8 and i16 do not end up in the SysV register parameters (EDI, ESI, etc). I added tests for how we receive small parameters, since that is the important part. It's always safe to store more bytes than will be read, but the assumptions you make when loading them are what really matter. I also tested this by self-hosting clang and it passed tests on win64. Reviewers: mstorsjo, hans Subscribers: hiraditya, mstorsjo, llvm-commits Differential Revision: https://reviews.llvm.org/D44900 llvm-svn: 328570	2018-03-26 18:49:48 +00:00
Hans Wennborg	311b63f13b	Revert r328386 "[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32" This broke Chromium (see crbug.com/825748). It looks like mstorsjo's follow-up patch at D44876 fixes this, but let's revert back to green for now until that's ready to land. (Also reverts r328443.) > Both GCC and MSVC only look at the low byte of a boolean when it is > passed. llvm-svn: 328482	2018-03-26 10:07:51 +00:00
Simon Pilgrim	854ac7490d	[X86] Add missing full stop to comment. NFCI. llvm-svn: 328456	2018-03-25 18:49:48 +00:00
Craig Topper	2c0a62ab9a	[X86] Add a DAG combine to simplify PMULDQ/PMULUDQ nodes These nodes only use the lower 32 bits of their inputs so we can use SimplifyDemandedBits to simplify them. Differential Revision: https://reviews.llvm.org/D44375 llvm-svn: 328405	2018-03-24 01:52:01 +00:00
Reid Kleckner	e27b410661	[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32 Both GCC and MSVC only look at the low byte of a boolean when it is passed. llvm-svn: 328386	2018-03-23 23:38:53 +00:00
Martin Storsjo	07589fc496	[X86] Don't use the MSVC stack protector names on mingw Mingw uses the same stack protector functions as GCC provides on other platforms as well. Patch by Valentin Churavy! Differential Revision: https://reviews.llvm.org/D27296 llvm-svn: 328039	2018-03-20 20:37:51 +00:00
Craig Topper	ab6076514d	[X86] Simplify the AVX512 code in LowerTruncate a little. We don't need to create an ISD::TRUNCATE node to return, we started with one and can return it. Also remove the call to getExtendInVec, the result is just going to be a getNode of that value passed in. llvm-svn: 327914	2018-03-19 21:58:02 +00:00
Craig Topper	3b967466d5	[X86] Replace a couple calls to getExtendInVec with getNode and the appropriate target independent EXTEND_VECTOR_INREG opcode. llvm-svn: 327899	2018-03-19 20:20:22 +00:00
Craig Topper	259eaa6e7c	[X86] Remove sse41 specific code from lowering v16i8 multiply With the SRAs removed from the SSE2 code in D44267, then there doesn't appear to be any advantage to the sse41 code. The punpcklbw instruction and pmovsx seem to have the same latency and throughput on most CPUs. And the SSE41 code requires moving the upper 64-bits into the lower 64-bit before the sign extend can be done. The unpckhbw in sse2 code can do better than that. llvm-svn: 327869	2018-03-19 17:31:41 +00:00
Oren Ben Simhon	fdd72fd522	[X86] Added support for nocf_check attribute for indirect Branch Tracking X86 Supports Indirect Branch Tracking (IBT) as part of Control-Flow Enforcement Technology (CET). IBT instruments ENDBR instructions used to specify valid targets of indirect call / jmp. The `nocf_check` attribute has two roles in the context of X86 IBT technology: 1. Appertains to a function - do not add ENDBR instruction at the beginning of the function. 2. Appertains to a function pointer - do not track the target function of this pointer by adding nocf_check prefix to the indirect-call instruction. This patch implements `nocf_check` context for Indirect Branch Tracking. It also auto generates `nocf_check` prefixes before indirect branchs to jump tables that are guarded by range checks. Differential Revision: https://reviews.llvm.org/D41879 llvm-svn: 327767	2018-03-17 13:29:46 +00:00
Craig Topper	f0815e01d8	[X86] Merge ADDSUB/SUBADD detection into single methods that can detect either and indicate what they found. Previously, we called the same functions twice with a bool flag determining whether we should look for ADDSUB or SUBADD. It would be more efficient to run the code once and detect either pattern with a flag to tell which type it found. Differential Revision: https://reviews.llvm.org/D44540 llvm-svn: 327730	2018-03-16 18:25:59 +00:00
Craig Topper	1b8cf49704	[SelectionDAG][ARM][X86] Teach PromoteIntRes_SETCC to do a better job picking the result type for the setcc. Previously if getSetccResultType returned an illegal type we just fell back to using the default promoted type. This appears to have been to handle the case where for vectors getSetccResultType returns the input type, but the input type itself isn't legal and will need to be promoted. Without the legality check we would never reach a legal type. But just picking the promoted type to be the setcc type can create strange setccs where the result type is 128 bits and the operand type is 256 bits. If for example the result type was promoted to v8i16 from v8i1, but the input type was promoted from v8i23 to v8i32. We currently handle this with custom lowering code in X86. This legality check also caused us reject the getSetccResultType when the input type needed to be widened or split. Even though that result wouldn't have caused legalization to get stuck. This patch tries to fix this by detecting the getSetccResultType needs to be promoted. If its input type also needs to be promoted we'll try a ask for a new setcc result type based on its eventual promoted value. Otherwise we fall back to default type to promote to. For any other illegal values we might get back from the initial call to getSetccResultType we just keep and allow it to be re-legalized later via splitting or widening or scalarizing. llvm-svn: 327683	2018-03-15 23:04:11 +00:00
Craig Topper	c3983c34cd	[X86] Make sure we use FSUB instruction as the reference for operand order in isAddSubOrSubAdd when recognizing subadd The FADD part of the addsub/subadd pattern can have its operands commuted, but when checking for fsubadd we were using the fadd as reference and commuting the fsub node. llvm-svn: 327660	2018-03-15 20:30:54 +00:00
Craig Topper	5a0251fe67	[X86] Simplify the type legality checking for (FM)ADDSUB/SUBADD matching. NFCI Rather than enumerating all specific types, for the DAG combine we can just use TLI::isTypeLegal and an SSE3 check. For the BUILD_VECTOR version we already know the type is legal so we just need to check SSE3. llvm-svn: 327649	2018-03-15 17:38:59 +00:00
Craig Topper	627e001fad	[X86] Fix 80 column violations. llvm-svn: 327648	2018-03-15 17:38:55 +00:00
Craig Topper	26a3a80c87	[X86] Add support for matching FMSUBADD from build_vector. llvm-svn: 327604	2018-03-15 06:14:55 +00:00
Craig Topper	a5e712f402	[X86] Remove old TODO. We have coverage for this now. Coverage was added in r320950. llvm-svn: 327603	2018-03-15 06:14:53 +00:00
Craig Topper	b9526e9fdb	[X86] Use MVT in a couple places where we know the type is legal. llvm-svn: 327602	2018-03-15 06:14:51 +00:00
Craig Topper	b36cb20ef9	[X86] Teach X86TargetLowering::targetShrinkDemandedConstant to set non-demanded bits if it helps created an and mask that can be matched as a zero extend. I had to modify the bswap recognition to allow unshrunk masks to make this work. Fixes PR36689. Differential Revision: https://reviews.llvm.org/D44442 llvm-svn: 327530	2018-03-14 16:55:15 +00:00
Matt Arsenault	41e5ac4fa4	TargetMachine: Add address space to getPointerSize llvm-svn: 327467	2018-03-14 00:36:23 +00:00
Craig Topper	ec4881ad53	[X86] Simplify the LowerAVXCONCAT_VECTORS code a little by creating a single path for insert_subvector handling. We now only create recursive concats if we have more than two non-zero values. This keeps our subvector broadcast DAG combine functioning. llvm-svn: 327457	2018-03-13 22:36:07 +00:00
Craig Topper	cc060e921b	[X86] Rewrite LowerAVXCONCAT_VECTORS similar to how we handle vXi1 concats. This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs. This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best. We probably want to merge this with the vXi1 concat code since they are very similar. llvm-svn: 327454	2018-03-13 22:05:25 +00:00
Craig Topper	7e711a6822	[X86] Remove SplitBinaryOpsAndApply and use SplitOpsAndApply by adding curly braces around the ops. Summary: Unless you were intentionally avoiding this syntax? I saw you mentioned makeArrayRef in your commit that added SplitOpsAndApply. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44403 llvm-svn: 327418	2018-03-13 16:23:27 +00:00
Simon Pilgrim	93bd7187f4	[X86][SSE41] createVariablePermute v2X64 - PCMPEQQ can test for index 0/1 and select between them. llvm-svn: 327385	2018-03-13 12:22:58 +00:00
Craig Topper	acaba3b402	[X86] Remove use of MVT class from the ShuffleDecode library. MVT belongs to the CodeGen layer, but ShuffleDecode is used by the X86 InstPrinter which is part of the MC layer. This only worked because MVT is completely implemented in a header file with no other library dependencies. Differential Revision: https://reviews.llvm.org/D44353 llvm-svn: 327292	2018-03-12 16:43:11 +00:00
Simon Pilgrim	6618e2a09c	[X86][SSE] createVariablePermute - PSHUFB requires SSSE3 not just SSE3 llvm-svn: 327259	2018-03-12 12:30:04 +00:00
Craig Topper	7cc1b1fc84	[X86] Don't compute known bits twice for the same SDValue in LowerMUL. We called MaskedValueIsZero with two different masks, but underneath that calls computeKnownBits before applying the mask. This means we compute the same known bits twice due to the two calls. Instead just call computeKnownBits directly and apply the two masks ourselves. llvm-svn: 327251	2018-03-12 05:35:02 +00:00
Simon Pilgrim	d09cc9c62c	[X86][MMX] Support MMX build vectors to avoid SSE usage (PR29222) 64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type. This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics. We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover. Differential Revision: https://reviews.llvm.org/D43618 llvm-svn: 327247	2018-03-11 19:22:13 +00:00
Simon Pilgrim	30f74c14ff	[X86][AVX] createVariablePermute - scale v16i16 variable permutes to use v32i8 codegen XOP was already doing this, and now AVX performs v32i8 variable permutes as well. llvm-svn: 327245	2018-03-11 17:23:54 +00:00
Simon Pilgrim	b306501796	[X86][AVX] createVariablePermute - widen permutes for cases where the source vector is wider than the destination type llvm-svn: 327244	2018-03-11 17:00:46 +00:00
Simon Pilgrim	9a5d0c7540	[X86][AVX] createVariablePermute - use PSHUFB+PCMPGT+SELECT for v32i8 variable permutes Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size. llvm-svn: 327242	2018-03-11 16:28:11 +00:00
Simon Pilgrim	d2fbd87ce8	Fix for buildbots which didn't like makeArrayRef with initializer lists. llvm-svn: 327241	2018-03-11 14:31:55 +00:00
Simon Pilgrim	e60afdf9eb	[X86][SSE] Generalized SplitBinaryOpsAndApply to SplitOpsAndApply to support any number of ops. I've kept SplitBinaryOpsAndApply as a wrapper to avoid a lot of makeArrayRef code. llvm-svn: 327240	2018-03-11 14:04:53 +00:00
Simon Pilgrim	f9cc80d218	[X86][AVX] createVariablePermute - use 2xVPERMIL+PCMPGT+SELECT for v8i32/v8f32 and v4i64/v4f64 variable permutes As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi. For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends. llvm-svn: 327239	2018-03-11 11:52:26 +00:00
Simon Pilgrim	2565bd421e	[X86][AVX512] createVariablePermute - Non-VLX targets can widen v4i64/v8f64 variable permutes to v8i64/v8f64 Permutes in the upper elements will be undefined, but they will be discarded anyway. llvm-svn: 327238	2018-03-11 11:19:19 +00:00
Simon Pilgrim	64b899f0f3	[x86][SSE] Add widenSubVector helper. NFCI. Helper function to insert a subvector into the bottom elements of a larger zero/undef vector with the same scalar type. I've converted a couple of INSERT_SUBVECTOR calls to use it, there are plenty more although in some cases I was worried it might make the code more ambiguous. llvm-svn: 327236	2018-03-11 10:50:48 +00:00
Simon Pilgrim	de7f3f0f91	[X86][XOP] createVariablePermute - use VPERMIL2 for v8i32/v4i64 variable permutes llvm-svn: 327222	2018-03-10 19:49:59 +00:00
Simon Pilgrim	ff1248f82f	[X86][XOP] createVariablePermute - use VPPERM for v16i16 variable permutes llvm-svn: 327218	2018-03-10 18:33:29 +00:00
Simon Pilgrim	d9dc114e2f	[X86][SSE] createVariablePermute - create index scaling helper. NFCI. This will help in some future changes for custom lowering. llvm-svn: 327217	2018-03-10 18:12:35 +00:00
Simon Pilgrim	8224241f75	[X86][XOP] createVariablePermute - use VPPERM for v32i8 variable permutes llvm-svn: 327213	2018-03-10 16:51:45 +00:00
Simon Pilgrim	2cd489feb2	[X86][AVX] createVariablePermute - fix v2i64/v2f64 VPERMILPD index creation. The input indices vector will put the index in bit0, but VPERMILPD actually selects off bit1 - so we need to scale accordingly. llvm-svn: 327159	2018-03-09 18:37:56 +00:00
Simon Pilgrim	230d38b559	[X86][SSE] createVariablePermute - move source vector canonicalization to top of function. NFCI. This is to make it easier to return early from the switch statement with custom lowering. llvm-svn: 327157	2018-03-09 18:08:08 +00:00
Simon Pilgrim	033a4167d2	Tidyup comment that was destroyed by clang-format. NFCI. llvm-svn: 327141	2018-03-09 15:50:09 +00:00
Simon Pilgrim	322c521ed7	[X86][SSE] createVariablePermute - move index vector canonicalization to top of function. NFCI. This is to make it easier to return early from the switch statement with custom lowering. llvm-svn: 327140	2018-03-09 15:48:56 +00:00
Craig Topper	784f1bbf5e	[X86] Remove SRAs from v16i8 multiply lowering on sse2 targets Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does. Differential Revision: https://reviews.llvm.org/D44267 llvm-svn: 327093	2018-03-09 01:22:31 +00:00
Simon Pilgrim	c286680032	[X86][AVX] Pull out variable permute creation from LowerBUILD_VECTORAsVariablePermute. NFCI. This will make it easier to handle more complex cases than basic scaling or index masks. llvm-svn: 327054	2018-03-08 20:07:06 +00:00

... 5 6 7 8 9 ...

5955 Commits