llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	f3b5943ffc	Remove lambda default argument to fix gcc pedantic warning. llvm-svn: 339767	2018-08-15 12:32:09 +00:00
Craig Topper	633fe98e27	[X86] Change legacy SSE scalar fp to integer intrinsics to use specific ISD opcodes instead of keeping as intrinsics. Unify SSE and AVX512 isel patterns. AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics. The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics. Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns. llvm-svn: 339749	2018-08-15 01:23:00 +00:00
Simon Pilgrim	2ce3d6e135	[X86][SSE] Avoid duplicate shuffle input sources in combineX86ShufflesRecursively rL339686 added the case where a faux shuffle might have repeated shuffle inputs coming from either side of the OR(). This patch improves the insertion of the inputs into the source ops lists to account for this, as well as making it trivial to add support for shuffles with more than 2 inputs in the future. llvm-svn: 339696	2018-08-14 17:22:37 +00:00
Simon Pilgrim	ed55138247	[X86][SSE] Add shuffle combine support for OR(PSHUFB,PSHUFB) style patterns. If each element is zero from one (or both) inputs then we can combine these into a single shuffle mask. llvm-svn: 339686	2018-08-14 16:00:05 +00:00
Simon Pilgrim	df9880f257	[X86][SSE] Generalize lowerVectorShuffleAsBlendOfPSHUFBs to work with any vXi8 type. We still only use this for v16i8, but this cleans up the code to support v32i8/v64i8 sometime in the future. llvm-svn: 339679	2018-08-14 14:00:14 +00:00
Tomasz Krupa	86a63889f3	[X86] Lowering addus/subus intrinsics to native IR Summary: This revision improves previous version (rL330322) which has been reverted due to crashes. This is the patch that lowers x86 intrinsics to native IR in order to enable optimizations. The patch also includes folding of previously missing saturation patterns so that IR emits the same machine instructions as the intrinsics. Reviewers: craig.topper, spatel, RKSimon Reviewed By: craig.topper Subscribers: mike.dvoretsky, DavidKreitzer, sroland, llvm-commits Differential Revision: https://reviews.llvm.org/D46179 llvm-svn: 339650	2018-08-14 08:00:56 +00:00
Simon Pilgrim	130b00bc43	[X86][SSE] Pull out repeated shift getOpcode() calls. NFCI. llvm-svn: 339425	2018-08-10 11:42:42 +00:00
Craig Topper	9a8136f7b4	[X86] Qualify one of the heuristics in combineMul to only apply to positive multiply amounts. This seems to slightly help the performance of one of our internal benchmarks. We probably need better heuristics here. llvm-svn: 339406	2018-08-09 23:27:42 +00:00
Simon Pilgrim	511c3fc529	[X86][SSE] Remove PMULDQ/PMULUDQ by zero Exposed by D50328 Differential Revision: https://reviews.llvm.org/D50328 llvm-svn: 339337	2018-08-09 12:37:36 +00:00
Simon Pilgrim	01ae462fef	[X86][SSE] Combine (some) target shuffles with multiple uses As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses. This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool. However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move. This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit. Differential Revision: https://reviews.llvm.org/D50328 llvm-svn: 339335	2018-08-09 12:30:02 +00:00
Craig Topper	9de1797c50	[SelectionDAG][X86] Rename MaskedLoadSDNode::getSrc0 to getPassThru. Src0 doesn't really convey any meaning to what the operand is. Passthru matches what's used in the documentation for the intrinsic this comes from. llvm-svn: 339101	2018-08-07 06:52:49 +00:00
Craig Topper	17989208a9	[SelectionDAG][X86] Rename getValue to getPassThru for gather SDNodes. getValue is more meaningful name for scatter than it is for gather. Split them and use getPassThru for gather. llvm-svn: 339096	2018-08-07 06:13:40 +00:00
Easwaran Raman	10fd92dd94	[X86] Recognize a splat of negate in isFNEG Summary: Expand isFNEG so that we generate the appropriate F(N)M(ADD\|SUB) instructions in more cases. For example, the following sequence a = _mm256_broadcast_ss(f) d = _mm256_fnmadd_ps(a, b, c) generates an fsub and fma without this patch and an fnma with this change. Reviewers: craig.topper Subscribers: llvm-commits, davidxl, wmi Differential Revision: https://reviews.llvm.org/D48467 llvm-svn: 339043	2018-08-06 19:23:38 +00:00
Craig Topper	feb2a58860	[X86] Add a DAG combine for the __builtin_parity idiom used by clang to enable better codegen Clang uses "ctpop & 1" to implement __builtin_parity. If the popcnt instruction isn't supported this generates a large amount of code to calculate the population count. Instead we can bisect the data down to a single byte using xor and then check the parity flag. Even when popcnt is supported, its still a good idea to split 64-bit data on 32-bit targets using an xor in front of a single popcnt. Otherwise we get two popcnts and an add before the and. I've specifically targeted this at the sizes supported by clang builtins, but we could generalize this if we think that's useful. Differential Revision: https://reviews.llvm.org/D50165 llvm-svn: 338907	2018-08-03 18:00:29 +00:00
Craig Topper	e902b7d0b0	[X86] Support fp128 and/or/xor/load/store with VEX and EVEX encoded instructions. Move all the patterns to X86InstrVecCompiler.td so we can keep SSE/AVX/AVX512 all in one place. To save some patterns we'll use an existing DAG combine to convert f128 fand/for/fxor to integer when sse2 is enabled. This allows use to reuse all the existing patterns for v2i64. I believe this now makes SHA instructions the only case where VEX/EVEX and legacy encoded instructions could be generated simultaneously. llvm-svn: 338821	2018-08-03 06:12:56 +00:00
Craig Topper	2c095444a4	[X86] Prevent promotion of i16 add/sub/and/or/xor to i32 if we can fold an atomic load and atomic store. This makes them consistent with i8/i32/i64. Which still seems to be more aggressive on folding than icc, gcc, or MSVC. llvm-svn: 338795	2018-08-03 00:37:34 +00:00
Simon Pilgrim	8b16e15d47	[X86][SSE] Pull out duplicate VSELECT to shuffle mask code. NFCI. llvm-svn: 338693	2018-08-02 09:20:27 +00:00
Reid Kleckner	a30a6d2c29	Load from the GOT for external symbols in the large, PIC code model Do the same handling for external symbols that we do for jump table symbols and global values. Fixes one of the cases in PR38385 llvm-svn: 338651	2018-08-01 22:56:05 +00:00
Craig Topper	c985d42903	[X86] Canonicalize the pattern for __builtin_ffs in a similar way to '__builtin_ffs + 5' We now emit a move of -1 before the cmov and do the addition after the cmov just like the case with an extra addition. This may be slightly worse for code size, but is more consistent with other compilers. And we might be able to hoist the mov -1 outside of loops. llvm-svn: 338613	2018-08-01 18:38:46 +00:00
Simon Pilgrim	931ebe3be1	[X86] Assign from a brace initializer to match style guide. NFCI. llvm-svn: 338598	2018-08-01 17:43:38 +00:00
Simon Pilgrim	a3548c960e	[SelectionDAG] Make binop reduction matcher available to all targets There is nothing x86-specific about this code, so it'd be nice to make this available for other targets to use in the future (and get it out of X86ISelLowering!). Differential Revision: https://reviews.llvm.org/D50083 llvm-svn: 338586	2018-08-01 16:52:28 +00:00
Simon Pilgrim	564762cf32	[X86] Use isNullConstant helper. NFCI. llvm-svn: 338530	2018-08-01 13:06:14 +00:00
Simon Pilgrim	e447a273bd	[X86] Use isNullConstant helper. NFCI. llvm-svn: 338516	2018-08-01 11:24:11 +00:00
Craig Topper	65a1388881	[X86] When looking for (CMOV C-1, (ADD (CTTZ X), C), (X != 0)) -> (ADD (CMOV (CTTZ X), -1, (X != 0)), C), make sure we really have a compare with 0. It's not strictly required by the transform of the cmov and the add, but it makes sure we restrict it to the cases we know we want to match. While there canonicalize the operand order of the cmov to simplify the matching and emitting code. llvm-svn: 338492	2018-08-01 06:36:20 +00:00
Simon Pilgrim	5d9b00d15b	[X86][SSE] Use ISD::MULHU for constant/non-zero ISD::SRL lowering (PR38151) As was done for vector rotations, we can efficiently use ISD::MULHU for vXi8/vXi16 ISD::SRL lowering. Shift-by-zero cases are still problematic (mainly on v32i8 due to extra AND/ANDN/OR or VPBLENDVB blend masks but v8i16/v16i16 aren't great either if PBLENDW fails) so I've limited this first patch to known non-zero cases if we can't easily use PBLENDW. Differential Revision: https://reviews.llvm.org/D49562 llvm-svn: 338407	2018-07-31 18:05:56 +00:00
Craig Topper	bef126fb71	[X86] Add pattern matching for PMADDUBSW Summary: Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned. A C example that triggers this pattern ``` static const int N = 128; int8_t A[2N]; uint8_t B[2N]; int16_t C[N]; void foo() { for (int i = 0; i != N; ++i) C[i] = MIN(MAX((int16_t)A[2i](int16_t)B[2i] + (int16_t)A[2i+1](int16_t)B[2i+1], -32768), 32767); } ``` Reviewers: RKSimon, spatel, zvi Reviewed By: RKSimon, zvi Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D49829 llvm-svn: 338402	2018-07-31 17:12:08 +00:00
Simon Pilgrim	99d475f97d	[X86][SSE] isFNEG - Use getTargetConstantBitsFromNode to handle all constant cases isFNEG was duplicating much of what was done by getTargetConstantBitsFromNode in its own calls to getTargetConstantFromNode. Noticed while reviewing D48467. llvm-svn: 338358	2018-07-31 10:13:17 +00:00
Fangrui Song	f78650a8de	Remove trailing space sed -Ei 's/[[:space:]]+$//' include/*/.{def,h,td} lib/*/.{cpp,h} llvm-svn: 338293	2018-07-30 19:41:25 +00:00
Craig Topper	f014ec9b3b	[X86] Fix typo in comment. NFC llvm-svn: 338274	2018-07-30 17:34:31 +00:00
Matt Arsenault	81920b0a25	DAG: Add calling convention argument to calling convention funcs This seems like a pretty glaring omission, and AMDGPU wants to treat kernels differently from other calling conventions. llvm-svn: 338194	2018-07-28 13:25:19 +00:00
Craig Topper	c3e11bf3f7	[X86] Add support expanding multiplies by constant where the constant is -3/-5/-9 multplied by a power of 2. These can be replaced with an LEA, a shift, and a negate. This seems to match what gcc and icc would do. llvm-svn: 338174	2018-07-27 23:04:59 +00:00
Craig Topper	561e298e29	[X86] Remove an unnecessary 'if' that prevented treating INT64_MAX and -INT64_MAX as power of 2 minus 1 in the multiply expansion code. Not sure why they were being explicitly excluded, but I believe all the math inside the if works. I changed the absolute value to be uint64_t instead of int64_t so INT64_MIN+1 wouldn't be signed wrap. llvm-svn: 338101	2018-07-27 05:56:27 +00:00
Craig Topper	e364baa88b	[X86] Add matching for another pattern of PMADDWD. Summary: This is the pattern you get from the loop vectorizer for something like this int16_t A[1024]; int16_t B[1024]; int32_t C[512]; void pmaddwd() { for (int i = 0; i != 512; ++i) C[i] = (A[2i]B[2i]) + (A[2i+1]B[2i+1]); } In this case we will have (add (mul (build_vector), (build_vector)), (mul (build_vector), (build_vector))). This is different than the pattern we currently match which has the build_vectors between an add and a single multiply. I'm not sure what C code would get you that pattern. Reviewers: RKSimon, spatel, zvi Reviewed By: zvi Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D49636 llvm-svn: 338097	2018-07-27 04:29:10 +00:00
Craig Topper	f7bc550223	[X86] When removing sign extends from gather/scatter indices, make sure we handle UpdateNodeOperands finding an existing node to CSE with. If this happens the operands aren't updated and the existing node is returned. Make sure we pass this existing node up to the DAG combiner so that a proper replacement happens. Otherwise we get stuck in an infinite loop with an unoptimized node. llvm-svn: 338090	2018-07-27 00:00:30 +00:00
Craig Topper	4e687d5bb2	[X86] Don't use CombineTo to skip adding new nodes to the DAGCombiner worklist in combineMul. I'm not sure if this was trying to avoid optimizing the new nodes further or what. Or maybe to prevent a cycle if something tried to reform the multiply? But I don't think its a reliable way to do that. If the user of the expanded multiply is visited by the DAGCombiner after this conversion happens, the DAGCombiner will check its operands, see that they haven't been visited by the DAGCombiner before and it will then add the first node to the worklist. This process will repeat until all the new nodes are visited. So this seems like an unreliable prevention at best. So this patch just returns the new nodes like any other combine. If this starts causing problems we can try to add target specific nodes or something to more directly prevent optimizations. Now that we handle the combine normally, we can combine any negates the mul expansion creates into their users since those will be visited now. llvm-svn: 338007	2018-07-26 05:40:10 +00:00
Craig Topper	370bdd3a0f	[X86] Remove some unnecessary explicit calls to DCI.AddToWorkList. These calls were making sure some newly created nodes were added to worklist, but the DAGCombiner has internal support for ensuring it has visited all nodes. Any time it visits a node it ensures the operands have been queued to be visited as well. This means if we only need to return the last new node. The DAGCombiner will take care of adding its inputs thus walking backwards through all the new nodes. llvm-svn: 337996	2018-07-26 03:20:27 +00:00
Matthias Braun	57dd5b3dea	CodeGen: Cleanup regmask construction; NFC - Avoid duplication of regmask size calculation. - Simplify allocateRegisterMask() call. - Rename allocateRegisterMask() to allocateRegMask() to be consistent with naming in MachineOperand. llvm-svn: 337986	2018-07-26 00:27:47 +00:00
Craig Topper	dc0e8a601d	[X86] Use X86ISD::MUL_IMM instead of ISD::MUL for multiply we intend to be selected to LEA. This prevents other combines from possibly disturbing it. llvm-svn: 337890	2018-07-25 05:33:36 +00:00
Craig Topper	fc501a9223	[X86] Use a shift plus an lea for multiplying by a constant that is a power of 2 plus 2/4/8. The LEA allows us to combine an add and the multiply by 2/4/8 together so we just need a shift for the larger power of 2. llvm-svn: 337875	2018-07-25 01:15:38 +00:00
Craig Topper	5be253d988	[X86] Expand mul by pow2 + 2 using a shift and two adds similar to what we do for pow2 - 2. llvm-svn: 337874	2018-07-25 01:15:35 +00:00
Craig Topper	56c104f104	[X86] Use a two lea sequence for multiply by 37, 41, and 73. These fit a pattern used by 11, 21, and 19. llvm-svn: 337871	2018-07-24 23:44:17 +00:00
Craig Topper	f8fcee70a3	[X86] Change multiply by 26 to use two multiplies by 5 and an add instead of multiply by 3 and 9 and a subtract. Same number of operations, but ending in an add is friendlier due to it being commutable. llvm-svn: 337869	2018-07-24 23:44:12 +00:00
Craig Topper	5ddc0a2b14	[X86] When expanding a multiply by a negative of one less than a power of 2, like 31, don't generate a negate of a subtract that we'll never optimize. We generated a subtract for the power of 2 minus one then negated the result. The negate can be optimized away by swapping the subtract operands, but DAG combine doesn't know how to do that and we don't add any of the new nodes to the worklist anyway. This patch makes use explicitly emit the swapped subtract. llvm-svn: 337858	2018-07-24 21:31:21 +00:00
Craig Topper	6d29891bef	[X86] Generalize the multiply by 30 lowering to generic multipy by power 2 minus 2. Use a left shift and 2 subtracts like we do for 30. Move this out from behind the slow lea check since it doesn't even use an LEA. Use this for multiply by 14 as well. llvm-svn: 337856	2018-07-24 21:15:41 +00:00
Craig Topper	86d6320b94	[X86] Change multiply by 19 to use (9 * X) * 2 + X instead of (5 * X) * 4 - 1. The new lowering can be done in 2 LEAs. The old code took 1 LEA, 1 shift, and 1 sub. llvm-svn: 337851	2018-07-24 20:31:48 +00:00
Craig Topper	b2a626b52e	[X86] Remove the max vector width restriction from combineLoopMAddPattern and rely splitOpsAndApply to handle splitting. This seems to be a net improvement. There's still an issue under avx512f where we have a 512-bit vpaddd, but not vpmaddwd so we end up doing two 256-bit vpmaddwds and inserting the results before a 512-bit vpaddd. It might be better to do two 512-bits paddds with zeros in the upper half. Same number of instructions, but breaks a dependency. llvm-svn: 337656	2018-07-22 19:44:35 +00:00
Benjamin Kramer	64c7fa3201	Revert "[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts" This reverts commit r337547. It triggers an infinite loop. llvm-svn: 337617	2018-07-20 20:59:46 +00:00
Craig Topper	28ac623f6f	[X86] Remove isel patterns for MOVSS/MOVSD ISD opcodes with integer types. Ideally our ISD node types going into the isel table would have types consistent with their instruction domain. This prevents us having to duplicate patterns with different types for the same instruction. Unfortunately, it seems our shuffle combining is currently relying on this a little remove some bitcasts. This seems to enable some switching between shufps and shufd. Hopefully there's some way we can address this in the combining. Differential Revision: https://reviews.llvm.org/D49280 llvm-svn: 337590	2018-07-20 17:57:53 +00:00
Craig Topper	6194ccf8c7	[X86] Remove what appear to be unnecessary uses of DCI.CombineTo CombineTo is most useful when you need to replace multiple results, avoid the worklist management, or you need to something else after the combine, etc. Otherwise you should be able to just return the new node and let DAGCombiner go through its usual worklist code. All of the places changed in this patch look to be standard cases where we should be able to use the more stand behavior of just returning the new node. Differential Revision: https://reviews.llvm.org/D49569 llvm-svn: 337589	2018-07-20 17:57:42 +00:00
Simon Pilgrim	70fcd0f481	[X86][XOP] Fix SUB constant folding for VPSHA/VPSHL shift lowering We can safely use getConstant here as we're still lowering, which allows constant folding to kick in and simplify the vector shift codegen. Noticed while working on D49562. llvm-svn: 337578	2018-07-20 16:55:18 +00:00
Simon Pilgrim	c7132031a2	[X86][SSE] Use SplitOpsAndApply to improve HADD/HSUB lowering Improve AVX1 256-bit vector HADD/HSUB matching by using SplitOpsAndApply to split into 128-bit instructions. llvm-svn: 337568	2018-07-20 16:20:45 +00:00
Simon Pilgrim	a85b86a982	[X86][AVX] Add support for i16 256-bit vector horizontal op redundant shuffle removal llvm-svn: 337566	2018-07-20 15:51:01 +00:00
Simon Pilgrim	7c56bce996	[X86][AVX] Add support for 32/64 bits 256-bit vector horizontal op redundant shuffle removal llvm-svn: 337561	2018-07-20 15:24:12 +00:00
Simon Pilgrim	6fb8b68b2d	[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts This is an early step towards using SimplifyDemandedVectorElts for target shuffle combining - this merely moves the existing X86ISD::VBROADCAST simplification code to use the SimplifyDemandedVectorElts mechanism. Adds X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode to handle X86ISD::VBROADCAST - in time we can support all target shuffles (and other ops) here. llvm-svn: 337547	2018-07-20 13:26:51 +00:00
Simon Pilgrim	1d181bc992	[X86][AVX] Use extract_subvector to reduce vector op widths (PR36761) We have a number of cases where we fail to reduce vector op widths, performing the op in a larger vector and then extracting a subvector. This is often because by default it would create illegal types. This peephole patch attempts to handle a few common cases detailed in PR36761, which typically involved extension+conversion to vX2f64 types. Differential Revision: https://reviews.llvm.org/D49556 llvm-svn: 337500	2018-07-19 21:52:06 +00:00
Craig Topper	9888670c6b	[X86] Fix some 'return SDValue()' after DCI.CombineTo instead return the output of CombineTo Returning SDValue() means nothing was changed. Returning the result of CombineTo returns the first argument of CombineTo. This is specially detected by DAGCombiner as meaning that something changed, but worklist management was already taken care of. I think the only real effect of this change is that we now properly update the Statistic the counts the number of combines performed. That's the only thing between the check for null and the check for N in the DAGCombiner. llvm-svn: 337491	2018-07-19 20:10:44 +00:00
Simon Pilgrim	d4b82da113	[X86][SSE] Canonicalize scalar fp arithmetic shuffle patterns As discussed on PR38197, this canonicalizes MOVS(N0, OP(N0, N1)) --> MOVS(N0, SCALAR_TO_VECTOR(OP(N0[0], N1[0]))) This returns the scalar-fp codegen lost by rL336971. Additionally it handles the OP(N1, N0)) case for commutable (FADD/FMUL) ops. Differential Revision: https://reviews.llvm.org/D49474 llvm-svn: 337419	2018-07-18 19:55:19 +00:00
Simon Pilgrim	3a45369b9e	[X86][SSE] Remove BLENDPD canonicalization from combineTargetShuffle When rL336971 removed the scalar-fp isel patterns, we lost the need for this canonicalization - commutation/folding can handle everything else. llvm-svn: 337387	2018-07-18 13:01:20 +00:00
Craig Topper	1425e10cc6	[X86] Generate v2f64 X86ISD::UNPCKL/UNPCKH instead of X86ISD::MOVLHPS/MOVHLPS for unary v2f64 {0,0} and {1,1} shuffles with SSE2. I'm trying to restrict the MOVLHPS/MOVHLPS ISD nodes to SSE1 only. With SSE2 we can use unpcks. I believe this will allow some patterns to be cleaned up to require fewer bitcasts. I've put in an odd isel hack to still select MOVHLPS instruction from the unpckh node to avoid changing tests and because movhlps is a shorter encoding. Ideally we'd do execution domain switching on this, but the operands are in the wrong order and are tied. We might be able to try a commute in the domain switching using custom code. We already support domain switching for UNPCKLPD and MOVLHPS. llvm-svn: 337348	2018-07-18 05:10:51 +00:00
Craig Topper	07a1787501	[X86] Merge the FR128 and VR128 regclass since they have identical spill and alignment characteristics. This unfortunately requires a bunch of bitcasts to be added added to SUBREG_TO_REG, COPY_TO_REGCLASS, and instructions in output patterns. Otherwise tablegen seems to default to picking f128 and then we fail when something tries to get the register class for f128 which isn't always valid. The test changes are because we were previously mixing fr128 and vr128 due to contrainRegClass finding FR128 first and passes like live range shrinking weren't handling that well. llvm-svn: 337147	2018-07-16 06:56:09 +00:00
Fangrui Song	dcdc9ac7a2	[X86] Correct comment of TEST elimination in BSF/TZCNT llvm-svn: 337052	2018-07-13 21:40:08 +00:00
Fangrui Song	90d9c201dc	[X86] Try fixing r336768 llvm-svn: 337043	2018-07-13 20:54:24 +00:00
Simon Pilgrim	44b89fa900	[X86][SSE] Utilize ZeroableElements for canWidenShuffleElements canWidenShuffleElements can do a better job if given a mask with ZeroableElements info. Apparently, ZeroableElements was being only used to identify AllZero candidates, but possibly we could plug it into more shuffle matchers. Original Patch by Zvi Rackover @zvi Differential Revision: https://reviews.llvm.org/D42044 llvm-svn: 336903	2018-07-12 13:29:41 +00:00
Simon Pilgrim	8a463897e9	[X86][AVX] Use Zeroable mask to improve shuffle mask widening Noticed while updating D42044, lowerV2X128VectorShuffle can improve the shuffle mask with the zeroable data to create a target shuffle mask to recognise more 'zero upper 128' patterns. NOTE: lowerV4X128VectorShuffle could benefit as well but the code needs refactoring first to discriminate between SM_SentinelUndef and SM_SentinelZero for negative shuffle indices. Differential Revision: https://reviews.llvm.org/D49092 llvm-svn: 336900	2018-07-12 13:03:58 +00:00
Craig Topper	73347ec081	[X86] Remove patterns and ISD nodes for the old scalar FMA intrinsic lowering. We now use llvm.fma.f32/f64 or llvm.x86.fmadd.f32/f64 intrinsics that use scalar types rather than vector types. So we don't these special ISD nodes that operate on the lowest element of a vector. llvm-svn: 336883	2018-07-12 03:42:41 +00:00
Craig Topper	034adf2683	[X86] Remove and autoupgrade the scalar fma intrinsics with masking. This converts them to what clang is now using for codegen. Unfortunately, there seem to be a few kinks to work out still. I'll try to address with follow up patches. llvm-svn: 336871	2018-07-12 00:29:56 +00:00
Craig Topper	02867f0fa3	[X86] The TEST instruction is eliminated when BSF/TZCNT is used Summary: These changes cover the PR#31399. Now the ffs(x) function is lowered to (x != 0) ? llvm.cttz(x) + 1 : 0 and it corresponds to the following llvm code: %cnt = tail call i32 @llvm.cttz.i32(i32 %v, i1 true) %tobool = icmp eq i32 %v, 0 %.op = add nuw nsw i32 %cnt, 1 %add = select i1 %tobool, i32 0, i32 %.op and x86 asm code: bsfl %edi, %ecx addl $1, %ecx testl %edi, %edi movl $0, %eax cmovnel %ecx, %eax In this case the 'test' instruction can't be eliminated because the 'add' instruction modifies the EFLAGS, namely, ZF flag that is set by the 'bsf' instruction when 'x' is zero. We now produce the following code: bsfl %edi, %ecx movl $-1, %eax cmovnel %ecx, %eax addl $1, %eax Patch by Ivan Kulagin Reviewers: davide, craig.topper, spatel, RKSimon Reviewed By: craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D48765 llvm-svn: 336768	2018-07-11 06:57:42 +00:00
Craig Topper	dea0b88b04	[X86] Remove X86ISD::MOVLPS and X86ISD::MOVLPD. NFCI These ISD nodes try to select the MOVLPS and MOVLPD instructions which are special load only instructions. They load data and merge it into the lower 64-bits of an XMM register. They are logically equivalent to our MOVSD node plus a load. There was only one place in X86ISelLowering that used MOVLPD and no places that selected MOVLPS. The one place that selected MOVLPD had to choose between it and MOVSD based on whether there was a load. But lowering is too early to tell if the load can really be folded. So in isel we have patterns that use MOVSD for MOVLPD if we can't find a load. We also had patterns that select the MOVLPD instruction for a MOVSD if we can find a load, but didn't choose the MOVLPD ISD opcode for some reason. So it seems better to just standardize on MOVSD ISD opcode and manage MOVSD vs MOVLPD instruction with isel patterns. llvm-svn: 336728	2018-07-10 21:00:22 +00:00
Simon Pilgrim	d32ca2c0b7	[X86][SSE] Prefer BLEND(SHL(v,c1),SHL(v,c2)) over MUL(v, c3) Now that rL336250 has landed, we should prefer 2 immediate shifts + a shuffle blend over performing a multiply. Despite the increase in instructions, this is quicker (especially for slow v4i32 multiplies), avoid loads and constant pool usage. It does mean however that we increase register pressure. The code size will go up a little but by less than what we save on the constant pool data. This patch also adds support for v16i16 to the BLEND(SHIFT(v,c1),SHIFT(v,c2)) combine, and also prevents blending on pre-SSE41 shifts if it would introduce extra blend masks/constant pool usage. Differential Revision: https://reviews.llvm.org/D48936 llvm-svn: 336642	2018-07-10 07:58:33 +00:00
Roman Lebedev	5ccae1750b	[X86][TLI] DAGCombine: Unfold variable bit-clearing mask to two shifts. Summary: This adds a reverse transform for the instcombine canonicalizations that were added in D47980, D47981. As discussed later, that was worse at least for the code size, and potentially for the performance, too. https://rise4fun.com/Alive/Zmpl Reviewers: craig.topper, RKSimon, spatel Reviewed By: spatel Subscribers: reames, llvm-commits Differential Revision: https://reviews.llvm.org/D48768 llvm-svn: 336585	2018-07-09 19:06:42 +00:00
Craig Topper	47170b3153	[X86] In combineFMA, make sure we bitcast the result of isFNEG back the expected type before creating the new FMA node. Previously, we were creating malformed SDNodes, but nothing noticed because the type constraints prevented isel from noticing. llvm-svn: 336566	2018-07-09 17:43:24 +00:00
Craig Topper	b8145ec667	[X86] Improve the message for some asserts. Remove an if that is guaranteed true by said asserts. This replaces some asserts in lowerV2F64VectorShuffle with the similar asserts from lowerVIF64VectorShuffle which are more readable. The original asserts mentioned a blend, but there's no guarantee that it is a blend. Also remove an if that the asserts prove is always true. Mask[0] is always less than 2 and Mask[1] is always at least 2. Therefore (Mask[0] >= 2) + (Mask[1] >= 2) == 1 must wlays be true. llvm-svn: 336517	2018-07-09 01:52:56 +00:00
Craig Topper	9e17073c21	[X86] Enhance combineFMA to look for FNEG behind an EXTRACT_VECTOR_ELT. llvm-svn: 336514	2018-07-08 18:04:00 +00:00
Simon Pilgrim	2eced71ecf	[X86][SSE] Combine v16i8 SHL by constants to multiplies Pre-AVX512 (which can perform a quick extend/shift/truncate), extending to 2 v8i16 for the PMULLW and then truncating is more performant than relying on the generic PBLENDVB vXi8 shift path and uses a similar amount of mask constant pool data. Differential Revision: https://reviews.llvm.org/D48963 llvm-svn: 336513	2018-07-08 12:47:50 +00:00
Simon Pilgrim	23f9eddabe	[SelectionDAG] Split float and integer isKnownNeverZero tests Splits off isKnownNeverZeroFloat to handle +/- 0 float cases. This will make it easier to be more aggressive with the integer isKnownNeverZero tests (similar to ValueTracking), use computeKnownBits etc. Differential Revision: https://reviews.llvm.org/D48969 llvm-svn: 336492	2018-07-07 18:17:14 +00:00
Craig Topper	2c27e33a58	[X86] Merge INTR_TYPE_3OP_RM with INTR_TYPE_3OP. Remove unused INTR_TYPE_1OP_RM. llvm-svn: 336476	2018-07-07 01:04:22 +00:00
Vedant Kumar	b3091da3af	Use Type::isIntOrPtrTy where possible, NFC It's a bit neater to write T.isIntOrPtrTy() over `T.isIntegerTy() \|\| T.isPointerTy()`. I used Python's re.sub with this regex to update users: r'([\w.\->()]+)isIntegerTy\s\\|\\|\s\1isPointerTy' llvm-svn: 336462	2018-07-06 20:17:42 +00:00
Craig Topper	c60e1807b3	[X86] Remove FMA4 scalar intrinsics. Use llvm.fma intrinsic instead. The intrinsics can be implemented with a f32/f64 llvm.fma intrinsic and an insert into a zero vector. There are a couple regressions here due to SelectionDAG not being able to pull an fneg through an extract_vector_elt. I'm not super worried about this though as InstCombine should be able to do it before we get to SelectionDAG. llvm-svn: 336416	2018-07-06 07:14:41 +00:00
Craig Topper	7b35585ff1	[X86] Remove all of the avx512 masked packed fma intrinsics. Use llvm.fma or unmasked 512-bit intrinsics with rounding mode. This upgrades all of the intrinsics to use fneg instructions to convert fma into fmsub/fnmsub/fnmadd/fmsubadd. And uses a select instruction for masking. This matches how clang uses the intrinsics these days. llvm-svn: 336409	2018-07-06 03:42:09 +00:00
Craig Topper	4fe321d1ce	[X86] Add SHUF128 to target shuffle decoding. Differential Revision: https://reviews.llvm.org/D48954 llvm-svn: 336376	2018-07-05 17:10:17 +00:00
Craig Topper	95eb88abfe	[X86] Add support for combining FMSUB/FNMADD/FNMSUB ISD nodes with an fneg input. Previously we could only negate the FMADD opcodes. This used to be mostly ok when we lowered FMA intrinsics during lowering. But with the move to llvm.fma from target specific intrinsics, we can combine (fneg (fma)) to (fmsub) earlier. So if we start with (fneg (fma (fneg))) we would get stuck at (fmsub (fneg)). This patch fixes that so we can also combine things like (fmsub (fneg)). llvm-svn: 336304	2018-07-05 02:52:56 +00:00
Simon Pilgrim	c3e1617bf9	[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values (REAPPLIED) We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case. Reapplied with a fixed (extra null tests) version of rL336113 after reversion in rL336189 - extra test case added at rL336247. llvm-svn: 336250	2018-07-04 09:12:48 +00:00
Craig Topper	e317533dcf	[X86] Remove repeated 'the' from multiple comments that have been copy and pasted. NFC llvm-svn: 336226	2018-07-03 20:39:55 +00:00
Benjamin Kramer	fd171f2f89	Revert "[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values" This reverts commit r336113. It causes crashes. llvm-svn: 336189	2018-07-03 11:15:17 +00:00
Simon Pilgrim	2bc8e079f2	[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case. llvm-svn: 336113	2018-07-02 15:14:07 +00:00
Craig Topper	1b7b9b8596	[X86] Use MVT::i8 for scalar shift amounts since that is what they ultimately need to legalize to. I believe all of these are constants so legalizing them should be pretty trivial, but this saves a step. In one case it looks like we may have been creating a shift amount larger than the shift input itself. llvm-svn: 336052	2018-06-30 18:30:31 +00:00
Craig Topper	5f28d50d27	[X86] When combining load to BZHI, make sure we create the shift instruction with an i8 type. This combine runs pretty late and causes us to introduce a shift after the op legalization phase has run. We need to be sure we create the shift with the proper type for the shift amount. If we don't do this, we will still re-legalize the operation properly, but we won't get a chance to fully optimize the truncate that gets inserted. So this patch adds the necessary truncate when the shift is created. I've also narrowed the subtract that gets created to always be an i32 type. The truncate would have trigered SimplifyDemandedBits to optimize it anyway. But using a more appropriate VT here is free and saves an optimization step. llvm-svn: 336051	2018-06-30 17:49:42 +00:00
Craig Topper	59f2f38fe0	[X86] Remove masking from avx512 rotate intrinsics. Use select in IR instead. llvm-svn: 336035	2018-06-30 01:32:04 +00:00
Craig Topper	87b107dd69	[X86] Limit the number of target specific nodes emitted in LowerShiftParts The important part is the creation of the SHLD/SHRD nodes. The compare and the conditional move can use target independent nodes that can be legalized on their own. This gives some opportunities to trigger the optimizations present in the lowering for those things. And its just better to limit the number of places we emit target specific nodes. The changed test cases still aren't optimal. Differential Revision: https://reviews.llvm.org/D48619 llvm-svn: 335998	2018-06-29 17:24:07 +00:00
Simon Pilgrim	aab8660e23	[X86][SSE] Support v16i8/v32i8 vector rotations This uses the same technique as for shifts - split the rotation into 4/2/1-bit partial rotations and select those partials based on the amount bit, making use of PBLENDVB if available. This halves the use of PBLENDVB compared to expanding to shifts, which can be a slow op. Unfortunately I haven't found a decent way to share much of this code with the shift equivalent. Differential Revision: https://reviews.llvm.org/D48655 llvm-svn: 335957	2018-06-29 09:36:39 +00:00
Craig Topper	875e9f8fa4	[X86] Remove masking from the avx512 packed sqrt intrinsics. Use select in IR instead. While there improve the coverage of the intrinsic testing and add fast-isel tests. llvm-svn: 335944	2018-06-29 05:43:26 +00:00
Simon Pilgrim	aa2bf2be31	[TargetLowering] isVectorClearMaskLegal - use ArrayRef<int> instead of const SmallVectorImpl<int>& This is more generic and matches isShuffleMaskLegal. Differential Revision: https://reviews.llvm.org/D48591 llvm-svn: 335605	2018-06-26 14:15:31 +00:00
Simon Pilgrim	bfaa09220b	[X86] Just use ArrayRef instead of SmallVectorImpl in a few static method arguments. NFCI. llvm-svn: 335590	2018-06-26 10:45:41 +00:00
Craig Topper	08dae1682d	[X86] Don't use getScalarShiftAmountTy to get the immediate type for target specific VSHLDQ/VSRLDQ nodes. These opcodes have a fixed type of i8 for their immediate and shouldn't have anything to do with the scalar shift amount used by target independent shift nodes. llvm-svn: 335578	2018-06-26 04:53:42 +00:00
Craig Topper	689e363ff2	[X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction. This recommits r335562 and 335563 as a single commit. The frontend will surround the intrinsic with the appropriate marshalling to/from a scalar type to match the sigature of the builtin that software expects. By exposing the vXi1 type directly in the llvm intrinsic we make it available to optimizers much earlier. This can enable the scalar marshalling code to be optimized away. llvm-svn: 335568	2018-06-26 01:37:02 +00:00
Craig Topper	6f4fdfa9af	Revert r335562 and 335563 "[X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction." These were supposed to have been squashed to a single commit. llvm-svn: 335566	2018-06-26 01:31:53 +00:00
Craig Topper	9b4322ce31	foo llvm-svn: 335562	2018-06-26 00:43:34 +00:00
Craig Topper	3b18bdc46d	[X86] Simplify some code by using isOneConstant. NFC llvm-svn: 335437	2018-06-25 01:01:47 +00:00
Craig Topper	4331d6218d	[X86] Remove the changes to combineScalarToVector made in r335037. They appear to be untested other than the test case for p37879.ll and I believe we should be using SimplifyDemandedElts here to handle these cases. llvm-svn: 335436	2018-06-25 00:21:53 +00:00
Mikhail Dvoretckii	0963562083	[X86] Changing the check for valid inputs in combineScalarToVector Changing the logic of scalar mask folding to check for valid input types rather than against invalid ones, making it more robust and fixing PR37879. Differential Revision: https://reviews.llvm.org/D48366 llvm-svn: 335323	2018-06-22 08:28:05 +00:00

1 2 3 4 5 ...

5587 Commits