Commit Graph

5955 Commits

Author SHA1 Message Date
Craig Topper 006bac6880 [X86] Return false from hasAndNotCompare if the comparision value is a constant.
We won't end up using an ANDN instruction in this case so we should generate the same code we do for pre-BMI targets.

llvm-svn: 350018
2018-12-23 05:52:55 +00:00
Craig Topper 3cc92a28ce [X86] Fix an old FIXME about folding the zero constant into the OR instruction we use for sequentially consistent fence in 32-bit mode without SSE2.
llvm-svn: 350013
2018-12-23 01:54:43 +00:00
Sanjay Patel 80187b8a17 [x86] add movddup specialization for build vector lowering (PR37502)
This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right. 
In the motivating example, note that we have movddup before SSE4.1 and 
again with AVX2. That's because insertps isn't available pre-SSE41 and 
vbroadcast is (more generally) available with AVX2 (and the splat is 
reduced to movddup via isel pattern).

Differential Revision: https://reviews.llvm.org/D55898

llvm-svn: 349937
2018-12-21 18:48:32 +00:00
Simon Pilgrim 57733507fe [X86] Always use the version of computeKnownBits that returns a value. NFCI.
Continues the work started by @bogner in rL340594 to remove uses of the old KnownBits output paramater version.

llvm-svn: 349902
2018-12-21 14:25:14 +00:00
Simon Pilgrim 09c081176a [X86][AVX512] Don't custom lower v16i8 rotations.
As discussed on D55747, the expansion to (wider) shifts is better on all AVX512 cases, not just BWI.

llvm-svn: 349763
2018-12-20 14:38:35 +00:00
Craig Topper 9ca2f5605e [X86] Disable custom widening of signed/unsigned add/sub saturation intrinsics under -x86-experimental-vector-widening-legalization.
Generic legalization should take care of this.

llvm-svn: 349714
2018-12-20 01:32:06 +00:00
Craig Topper 217b3b20d8 [X86] Remove TLI variable from ReplaceNodeResults. NFC
We're already in X86TargetLowering which is a derived class of TargetLowering. We can just call methods directly.

llvm-svn: 349695
2018-12-19 23:13:03 +00:00
Craig Topper d16da2b479 [X86] Remove a bunch of 'else' after returns in reduceVMULWidth. NFC
This reduces indentation and makes it obvious this function always returns something.

llvm-svn: 349671
2018-12-19 19:39:34 +00:00
Craig Topper 8434ef7d1e [X86] Don't use SplitOpsAndApply to create ISD::UADDSAT/ISD::USUBSAT nodes. Let type legalization and op legalization deal with it.
Now that we've switched to target independent nodes we can rely on generic infrastructure to do the legalization for us.

llvm-svn: 349526
2018-12-18 19:29:08 +00:00
Nikita Popov f6058ff140 [X86] Use SADDSAT/SSUBSAT instead of ADDS/SUBS
Migrate the X86 backend from X86ISD opcodes ADDS and SUBS to generic
ISD opcodes SADDSAT and SSUBSAT. This also improves scodegen for
@llvm.sadd.sat() and @llvm.ssub.sat() intrinsics.

This is a followup to D55787 and part of PR40056.

Differential Revision: https://reviews.llvm.org/D55833

llvm-svn: 349520
2018-12-18 18:28:22 +00:00
Craig Topper 20a6db5a84 [X86] Create PSUBUS from (add (umax X, C), -C)
InstCombine seems to canonicalize or PSUB patter into a max with the cosntant and an add with an inverse of the constant.

This patch recognizes this pattern and turns it into PSUBUS. Future work could improve undef element handling.

Fixes some of PR40053

Differential Revision: https://reviews.llvm.org/D55780

llvm-svn: 349519
2018-12-18 18:26:25 +00:00
Simon Pilgrim 1411917431 [X86][SSE] Don't use 'sign bit select' vXi8 ROTL lowering for constant rotation amounts
Noticed by @spatel on D55747 - we get much better codegen if we use the regular shift expansion.

llvm-svn: 349510
2018-12-18 17:31:11 +00:00
Simon Pilgrim e9effe9744 [X86][SSE] Don't use 'sign bit select' vXi8 ROTL lowering for splat rotation amounts
Noticed by @spatel on D55747 - we get much better codegen if we use the regular shift expansion.

llvm-svn: 349500
2018-12-18 16:02:23 +00:00
Nikita Popov 665ab08178 [X86] Use UADDSAT/USUBSAT instead of ADDUS/SUBUS
Replace the X86ISD opcodes ADDUS and SUBUS with generic ISD opcodes
UADDSAT and USUBSAT. As a side-effect, this also makes codegen for
the @llvm.uadd.sat and @llvm.usub.sat intrinsics reasonable.

This only replaces use in the X86 backend, and does not move any of
the ADDUS/SUBUS X86 specific combines into generic codegen.

Differential Revision: https://reviews.llvm.org/D55787

llvm-svn: 349481
2018-12-18 13:23:03 +00:00
Simon Pilgrim 8488a44c34 [X86][SSE] Move VSRAI sign extend in reg fold into SimplifyDemandedBits
(VSRAI (VSHLI X, C1), C1) --> X iff NumSignBits(X) > C1

This works better as part of SimplifyDemandedBits than part of the general combine.

llvm-svn: 349462
2018-12-18 09:11:34 +00:00
Simon Pilgrim 26c630f416 [X86][SSE] Replace (VSRLI (VSRAI X, Y), 31) -> (VSRLI X, 31) fold.
This fold was incredibly specific - replace with a SimplifyDemandedBits fold to remove a VSRAI if only the original sign bit is demanded (its guaranteed to stay the same).

Test change is merely a rescheduling.

llvm-svn: 349459
2018-12-18 08:55:47 +00:00
Simon Pilgrim 7e2975a44c [X86][SSE] Improve immediate vector shift known bits handling.
Convert VSRAI to VSRLI is the sign bit is known zero and improve KnownBits output for all shift instruction.

Fixes the poor codegen comments in D55768.

llvm-svn: 349407
2018-12-17 22:09:47 +00:00
Simon Pilgrim 6b5e0b7b2b [X86][SSE] Split SimplifyDemandedBitsForTargetNode X86ISD::VSRLI/VSRAI handling.
First step towards adding more capable combines to fix comments in D55768.

llvm-svn: 349400
2018-12-17 21:36:17 +00:00
Craig Topper 728cbc0378 Convert (CMP (srl/shl X, C), 0) to (CMP (and X, C'), 0) when only the zero flag is used.
This allows a TEST to be used and can be combined with any AND that may already exist as an input to the shift.

This was already done in EmitTest, but was easily tricked by multiple uses because the setcc might be used by multiple instructions. Once the SETCC and users are legalized then we can look for the shift to be used by a single CMP, but the CMP itself can have multiple users.

This appears to fix the case in PR39968.

llvm-svn: 349385
2018-12-17 20:02:16 +00:00
Simon Pilgrim 9274f17a5e [TargetLowering] Add DemandedElts mask to SimplifyDemandedBits (PR40000)
This is an initial patch to add the necessary support for a DemandedElts argument to SimplifyDemandedBits, more closely matching computeKnownBits and to help improve vector codegen.

I've added only a small amount of the changes necessary to get at least one test to update - a lot more can be done but I'd like to add these methodically with proper test coverage, at the same time the hope is to slowly move some/all of SimplifyDemandedVectorElts into SimplifyDemandedBits as well.

Differential Revision: https://reviews.llvm.org/D55768

llvm-svn: 349374
2018-12-17 18:43:43 +00:00
Craig Topper fa4907d671 [X86] Fix bad operand lookup for cmov introduced in r349315
The CC is operand 2 not operand 3.

llvm-svn: 349330
2018-12-17 06:40:35 +00:00
Simon Pilgrim d0c9e43b1c [X86] Pull out constant splat rotation detection.
We had 3 different approaches - consistently use getTargetConstantBitsFromNode and allow undef elts.

llvm-svn: 349319
2018-12-16 19:46:04 +00:00
Craig Topper 10f8892837 [X86] Remove truncation handling from EmitTest. Replace it with a DAG combine.
I'd like to try to move a lot of the flag matching out of EmitTest and push it to isel or isel preprocessing. This is a step towards that.

The test-shrink-bug.ll changie is an improvement because we are no longer interfering with test shrink handling in isel.

The pr34137.ll change is a regression, but the IR came from -O0 and was not reduced by InstCombine. So it contains a lot of redundancies like duplicate loads that made it combine poorly.

llvm-svn: 349315
2018-12-16 18:35:55 +00:00
Sanjay Patel 13ac2f15b0 [x86] increment/decrement constant vector with min/max in vsetcc lowering (PR39859)
This is part of fixing PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859

We have a crippled vector ISA, so we have to invert a typical fold and create min/max here.

As discussed in the bug report, we can probably do better by using saturating subtract when 
it's available, but we should have this improvement for the min/max patterns regardless.

Alive proofs:
https://rise4fun.com/Alive/zsf
https://rise4fun.com/Alive/Qrl

Differential Revision: https://reviews.llvm.org/D55515

llvm-svn: 349304
2018-12-16 15:05:48 +00:00
Simon Pilgrim 52c982406e [X86] Begin cleaning up combineOr -> SHLD/SHRD. NFCI.
In preparation for converting to funnel shifts.

llvm-svn: 349286
2018-12-15 21:11:49 +00:00
Simon Pilgrim ef7b5949e5 [X86] Lower to SHLD/SHRD on slow machines for optsize
Use consistent rules for when to lower to SHLD/SHRD for slow machines - fixes a weird issue where funnel shift gets expanded but then X86ISelLowering's combineOr sees the optsize and combines to SHLD/SHRD, but now with the modulo amount guard......

llvm-svn: 349285
2018-12-15 19:43:44 +00:00
Craig Topper 257ce3871e [DAGCombiner][X86] Prevent visitSIGN_EXTEND from returning N when (sext (setcc)) already has the target desired type for the setcc
Summary:
If the setcc already has the target desired type we can reach the getSetCC/getSExtOrTrunc after the MatchingVecType check with the exact same types as the nodes we started with. This causes those causes VsetCC to be CSEd to N0 and the getSExtOrTrunc will CSE to N. When we return N, the caller will think that meant we called CombineTo and did our own worklist management. But that's not what happened. This prevents target hooks from being called for the node.

To fix this, I've now returned SDValue if the setcc is already the desired type. But to avoid some regressions in X86 I've had to disable one of the target combines that wasn't being reached before in the case of a (sext (setcc)). If we get vector widening legalization enabled that entire function will be deleted anyway so hopefully this is only for the short term.

Reviewers: RKSimon, spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D55459

llvm-svn: 349137
2018-12-14 08:28:24 +00:00
Craig Topper 178abc59ac [X86] Demote EmitTest to a helper function of EmitCmp. Route all callers except EmitCmp through EmitCmp.
This requires the two callers to manifest a 0 to make EmitCmp call EmitTest.

I'm looking into changing how we combine TEST and flag setting instructions to not be part of lowering. And instead be part of DAG combine or isel. Which will mean EmitTest will probably become gutted and maybe disappear entirely.

llvm-svn: 349094
2018-12-13 23:55:30 +00:00
Simon Pilgrim b5aaa673c6 [X86][SSE] Add SSE vector imm/var shift support to SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 349057
2018-12-13 16:39:29 +00:00
Simon Pilgrim b0b2f1503a [X86][SSE] Fix all remaining modulo vector rotation amounts (PR38243)
There's still a couple of minor SimplifyDemandedElts regressions in some of the shift amount splats that will be fixed in future patches.

llvm-svn: 349052
2018-12-13 15:50:31 +00:00
Simon Pilgrim ba91ff4a86 [X86][SSE] Fix modulo rotation amounts for v8i16/v16i16/v4i32 (PR38243)
llvm-svn: 349047
2018-12-13 15:23:09 +00:00
Simon Pilgrim 7c84f7ae3a [X86][SSE] Merge the vXi16/vXi32 vector rotation expansion cases. NFCI.
Merged the repeated code into a single if().

llvm-svn: 349040
2018-12-13 14:51:28 +00:00
Simon Pilgrim 320fd7383f [X86][BWI] Don't custom lower vXi8 rotations.
We always expand to shifts anyhow - test changes are just different scheduling only.

llvm-svn: 349034
2018-12-13 13:44:33 +00:00
Simon Pilgrim ab973a45b9 [DAGCombine] Moved X86 rotate_amount % bitwidth == 0 early out to DAGCombiner
Remove common code from custom lowering (code is still safe if somehow a zero value gets used).

llvm-svn: 349028
2018-12-13 12:23:32 +00:00
Simon Pilgrim 77fc551d1a [TargetLowering] Add ISD::ROTL/ROTR vector expansion
Move existing rotation expansion code into TargetLowering and set it up for vectors as well.

Ideally this would share more of the funnel shift expansion, but we handle the shift amount modulo quite differently at the moment.

Begun removing x86 vector rotate custom lowering to use the expansion.

llvm-svn: 349025
2018-12-13 11:20:48 +00:00
Craig Topper a048d58de7 [X86] Remove assert leftover from when i1 was a legal type. Add more accurate assert. NFC
llvm-svn: 349007
2018-12-13 06:14:25 +00:00
Craig Topper 4937adf75f [X86] Emit SBB instead of SETCC_CARRY from LowerSELECT. Break false dependency on the SBB input.
I'm hoping we can just replace SETCC_CARRY with SBB. This is another step towards that.

I've explicitly used zero as the input to the setcc to avoid a false dependency that we've had with the SETCC_CARRY. I changed one of the patterns that used NEG to instead use an explicit compare with 0 on the LHS. We needed the zero anyway to avoid the false dependency. The negate would clobber its input register. By using a CMP we can avoid that which could be useful.

Differential Revision: https://reviews.llvm.org/D55414

llvm-svn: 348959
2018-12-12 19:20:21 +00:00
Simon Pilgrim eb508f8ccb [SelectionDAG] Add a generic isSplatValue function
This patch introduces a generic function to determine whether a given vector type is known to be a splat value for the specified demanded elements, recursing up the DAG looking for BUILD_VECTOR or VECTOR_SHUFFLE splat patterns.

It also keeps track of the elements that are known to be UNDEF - it returns true if all the demanded elements are UNDEF (as this may be useful under some circumstances), so this needs to be handled by the caller.

A wrapper variant is also provided that doesn't take the DemandedElts or UndefElts arguments for cases where we just want to know if the SDValue is a splat or not (with/without UNDEFS).

I had hoped to completely remove the X86 local version of this function, but I'm seeing some regressions in shift/rotate codegen that will take a little longer to fix and I hope to get this in sooner so I can continue work on PR38243 which needs more capable splat detection.

Differential Revision: https://reviews.llvm.org/D55426

llvm-svn: 348953
2018-12-12 18:32:29 +00:00
Craig Topper 1fe466689b [X86] Combine vpmovdw+vpacksswb into vpmovdb.
This is similar to the combine we already have for vpmovdw+vpackuswb.

llvm-svn: 348910
2018-12-12 05:56:01 +00:00
Sanjay Patel 134f56e702 [x86] fix formatting; NFC
This should really be generalized to allow increment and/or
we should replace it by using ISD::matchUnaryPredicate().
See D55515 for context.

llvm-svn: 348776
2018-12-10 17:23:44 +00:00
Craig Topper 2b09d17d93 [X86] If the carry input to an addcarry/subborrow intrinsic is known to be 0, emit a flag setting ADD/SUB instead of ADC/SBB.
Previously we had to take the carry in and add -1 to it to set the carry flag so we could use it with ADC/SBB. But if we know its 0 then we don't need to bother.

This should go a long way towards fixing PR24545.

llvm-svn: 348727
2018-12-09 18:02:37 +00:00
Craig Topper 2c7a9476e0 [X86] Directly create ADC/SBB nodes instead of using ADD/SUB with (and SETCC_CARRY, 1)
This addresses a FIXME and avoids depending on an isel pattern match I think. I've remove the isel patterns too since he have no lit tests left that cover them. Hopefully that really means they are unused.

I'm trying to decide if we need SETCC_CARRY. This removes one of its usages.

Differential Revision: https://reviews.llvm.org/D55355

llvm-svn: 348536
2018-12-06 22:26:59 +00:00
Simon Pilgrim bb650daeaf [X86] Refactored IsSplatVector to use switch. NFCI.
Initial step towards making the function more generic (and probably move into SelectionDAG).

This is necessary to avoid massive codegen bloat for PR38243 (Add modulo rotate support to LowerRotate).

llvm-svn: 348498
2018-12-06 16:29:14 +00:00
Craig Topper 6a6d77b851 [X86] Remove some leftover code for handling an i1 setcc type. NFC
We should only need to handle i8 now.

llvm-svn: 348460
2018-12-06 07:00:02 +00:00
Simon Pilgrim 32483668d7 [X86][SSE] Begun adding modulo rotate support to LowerRotate
Prep work for PR38243 - mainly adding comments on where we need to add modulo support (doing so at the moment causes massive codegen regressions).

I've also consistently added support for modulo folding for uniform constants (although at the moment we have no way to trigger this) and removed the old assertions.

llvm-svn: 348366
2018-12-05 14:46:37 +00:00
Simon Pilgrim 180639afe5 [SelectionDAG] Initial support for FSHL/FSHR funnel shift opcodes (PR39467)
This is an initial patch to add a minimum level of support for funnel shifts to the SelectionDAG and to begin wiring it up to the X86 SHLD/SHRD instructions.

Some partial legalization code has been added to handle the case for 'SlowSHLD' where we want to expand instead and I've added a few DAG combines so we don't get regressions from the existing DAG builder expansion code.

Differential Revision: https://reviews.llvm.org/D54698

llvm-svn: 348353
2018-12-05 11:12:12 +00:00
Nirav Dave ce26c27b2a [SelectionDAG] Redefine isGAPlusOffset in terms of unwrapAddress. NFCI.
llvm-svn: 348288
2018-12-04 17:59:43 +00:00
Simon Pilgrim 07843640d5 [X86][SSE] Add SimplifyDemandedBitsForTargetNode handling for MOVMSK
Moves existing SimplifyDemandedBits call out of combineMOVMSK and add SimplifyDemandedVectorElts call based on the sign bits we need.

llvm-svn: 348282
2018-12-04 16:52:32 +00:00
Simon Pilgrim b1d6db7693 [X86] Remove unnecessary peekThroughEXTRACT_SUBVECTORs call.
The GetSplatValue/IsSplatVector call will call this anyhow and the later code is just for a v2i64 type so doesn't need it.

llvm-svn: 348253
2018-12-04 12:21:43 +00:00
Simon Pilgrim 0add090e24 [TargetLowering] expandFP_TO_UINT - avoid FPE due to out of range conversion (PR17686)
PR17686 demonstrates that for some targets FP exceptions can fire in cases where the FP_TO_UINT is expanded using a FP_TO_SINT instruction.

The existing code converts both the inrange and outofrange cases using FP_TO_SINT and then selects the result, this patch changes this for 'strict' cases to pre-select the FP_TO_SINT input and the offset adjustment.

The X87 cases don't need the strict flag but generates much nicer code with it....

Differential Revision: https://reviews.llvm.org/D53794

llvm-svn: 348251
2018-12-04 11:21:30 +00:00
Craig Topper 35585aff34 [X86] Remove custom DAG combine for SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG.
We only needed this because it provided really aggressive constant folding even through constant pool entries created from build_vectors. The main case was for vXi8 MULH legalization which was happening as part of legalize DAG instead of as part of legalize vector ops. Now its part of vector op legalization and we've added special handling for build vectors of all constants there. This has removed the need for this code on the list tests we have.

llvm-svn: 348237
2018-12-04 04:51:07 +00:00
Sanjay Patel d24f63477d [DAGCombiner] narrow truncated vector binops when legal
This is the smallest vector enhancement I could find to D54640.
Here, we're allowing narrowing to only legal vector ops because we'll see
regressions without that. All of the test diffs are wins from what I can tell.
With AVX/AVX512, we can shrink ymm/zmm ops to xmm.

x86 vector multiplies are the problem case that we're avoiding due to the
patchwork ISA, and it's not clear to me if we can dance around those
regressions using TLI hooks or if we need preliminary patches to plug those
holes.

Differential Revision: https://reviews.llvm.org/D55126

llvm-svn: 348195
2018-12-03 21:57:35 +00:00
Craig Topper 5440b63fa8 [X86] Teach LowerMUL/LowerMULH for vXi8 to unpack constant RHS.
Summary:
We need to unpackl and unpackh the operands to use two vXi16 multiplies. Previously it looks like the low unpack would get constant folded at least in the 128-bit case after shuffle lowering turned the unpackl into ZERO_EXTEND_VECTOR_INREG and X86 custom DAG combined it. The same doesn't happen for the high half. So we'd load a constant and then shuffle it. But the low half would just be loaded and used by the multiply directly.

After this patch we now end up with a constant pool entry for the low and high unpacks separately with no shuffle operations.

This is a step towards removing custom constant folding for ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG in the X86 backend.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D55165

llvm-svn: 348159
2018-12-03 18:26:27 +00:00
Craig Topper e35b01f8ea [X86] Add DAG combine to combine a v8i32->v8i16 truncate with a packuswb that truncates v8i16->v8i8.
Summary:
Under -x86-experimental-vector-widening-legalization, fp_to_uint/fp_to_sint with a smaller than 128 bit vector type results are custom type legalized by promoting the result to a 128 bit vector by promoting the elements, inserting an assertzext/assertsext, then truncating back to original type. The truncate will be further legalizdd to a pack shuffle. In the case of a v8i8 result type, we'll end up with a v8i16 fp_to_sint. This will need to be further legalized during vector op legalization by promoting to v8i32 and then truncating again. Under avx2 this produces good code with two pack instructions, but Under avx512 this will result in a truncate instruction and a packuswb instruction. But we should be able to get away with a single truncate instruction.

The other option is to promote all the way to vXi32 result type during the first type legalization. But in some experimentation that seemed to require more work to produce good code for other configurations.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54836

llvm-svn: 348158
2018-12-03 18:26:24 +00:00
Craig Topper 959b415e2f [X86] Add a DAG combine to turn stores of vXi1 on pre-avx512 targets into a bitcast and a store of a iX scalar.
llvm-svn: 348104
2018-12-02 19:47:14 +00:00
Craig Topper 6f54ff57fd [X86] Fix bad comment. NFC
llvm-svn: 348103
2018-12-02 19:47:13 +00:00
Craig Topper 204e4110e0 [X86] Simplify LowerBITCAST code for v2i32/v4i16/v8i8/i64->mmx/i64/f64 bitcast.
Previously this code generated its own extracts and build_vector. But we can use a simpler concat_vectors or scalar_to_vector operation and let type legalization do additional legalization of those operations.

llvm-svn: 348087
2018-12-02 07:52:39 +00:00
Craig Topper 4bb077910a [X86] Add custom type legalization for v2i32/v4i16/v8i8->mmx bitcasts to avoid a store/load to/from the stack.
Widen the input to a 128 bit vector by padding with undef elements. Then use a movdq2q to convert from xmm register to mmx register.

llvm-svn: 348086
2018-12-02 05:46:50 +00:00
Craig Topper ec096a1dae [X86] Custom type legalize v2i32/v4i16/v8i8->i64 bitcasts in 64-bit mode similar to what's done when the destination is f64.
The generic legalizer will fall back to a stack spill that uses a truncating store. That store will get expanded into a shuffle and non-truncating store on pre-avx512 targets. Once that happens the stack store/load pair will be combined away leaving behind the shuffle and bitcasts. On avx512 targets the truncating store is legal so doesn't get folded away.

By custom legalizing it we can avoid this churn and maybe produce better code.

llvm-svn: 348085
2018-12-02 05:46:48 +00:00
Craig Topper f4b13927e7 [X86] Don't use zero_extend_vector_inreg for mulhu lowering with sse 4.1
Summary: With sse4.1 we use two zero_extend_vector_inreg and a pshufd to expand the v16i8 input into two v8i16 vectors for the multiply. That's 3 shuffles to extend one operand. The other operand is usually constant as this is mostly used by division by constant optimization. Pre sse4.1 we use a punpckhbw and a punpcklbw with a zero vector. That's two shuffles and an xor and a copy due to tied register constraints. That seems maybe better than the 3 shuffles. With AVX we avoid the copy so that's obviously better.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D55138

llvm-svn: 348079
2018-12-01 19:26:31 +00:00
Craig Topper 4d80f199e8 [X86] Change vXi8 MULHU lowering to unpack high and low half of lanes instead of extracting and concating low and high half registers.
This reduces the number of shuffle operations that need to be done. The splitting strategy requires the shuffle unit for the extraction and the extension. With the unpack strategy the unpacks accomplish a splitting and extending in one operation.

llvm-svn: 348019
2018-11-30 18:43:18 +00:00
Craig Topper 8191307d09 [X86] Prefer lowerVectorShuffleAsBitMask over using a avx512 masked operation when avx512bw/avx512vl is enabled.
This does require a constant pool load instead of loading an immediate into a gpr, moving to a k register and masking. But its less instructions and more consistent with previous ISAs. It probably opens up more combine opportunities as one of the test cases demonstrates.

llvm-svn: 348018
2018-11-30 18:43:15 +00:00
Craig Topper a2133061c0 [X86] Emit PACKUS directly from the v16i8 LowerMULH code instead of using a shuffle.
llvm-svn: 347967
2018-11-30 08:32:05 +00:00
Craig Topper 6e4b266a0d [X86] Change the pre-sse4.1 code in the v16i8 MULHU lowering to be what we get after DAG combine cleans it up.
Previously we emitted a punpcklbw/punpckhbw to move the byte elements into the upper half of 16 bit elements then shifted right by 8 to zero the upper bits. After DAG combine we end up with punpcklbw/punpckhbw into the lower bits with zeros in the uppers bits and no shifts. So just emit that directly.

llvm-svn: 347966
2018-11-30 08:32:01 +00:00
Craig Topper 0850e8a6b6 [X86] Fix a couple types in SimplifyDemandedVectorEltsForTargetNode. NFCI
We had a EVT variable capturing the result of getSimpleValueType which returns an MVT. Another place using EVT that could have been MVT. And an 'int' that should be 'unsigned'.

llvm-svn: 347959
2018-11-30 06:23:55 +00:00
Craig Topper 73c1d75d58 [X86] Change the pre-type legalization DAG combine added in r347898 into a custom type legalization operation instead.
This seems to produce the same results on the tests we have.

llvm-svn: 347912
2018-11-29 20:18:58 +00:00
Craig Topper 129d529ab3 [SelectionDAG][AArch64][X86] Move legalization of vector MULHS/MULHU from LegalizeDAG to LegalizeVectorOps
I believe we should be legalizing these with the rest of vector binary operations. If any custom lowering is required for these nodes, this will give the DAG combine between LegalizeVectorOps and LegalizeDAG to run on the custom code before constant build_vectors are lowered in LegalizeDAG.

I've moved MULHU/MULHS handling in AArch64 from Lowering to isel. Moving the lowering earlier caused build_vector+extract_subvector simplifications to kick in which made the generated code worse.

Differential Revision: https://reviews.llvm.org/D54276

llvm-svn: 347902
2018-11-29 19:36:17 +00:00
Craig Topper 6cd0b17078 [X86] Add a DAG combine pre type legalization to widen division by constant splat on narrow vectors to avoid scalarization
This is another patch for -x86-experimental-vector-widening. This pre widens narrow division by constants so that we can get pass the legal type check in the generic DAG combiner. Otherwise we end up scalarizing.

I've restricted this to splats for now because it was easy to just call DAG.getConstant. Not sure what we should do for non-splat? Increase the element size?Widen the constant vector by padding with 1?

Differential Revision: https://reviews.llvm.org/D54919

llvm-svn: 347898
2018-11-29 19:13:38 +00:00
Craig Topper c2540995ed [X86] Correct comment. NFC
llvm-svn: 347835
2018-11-29 05:56:03 +00:00
Sanjay Patel 2de209313e [x86] try select simplification for target-specific nodes
This failed to select (which might be a separate bug) in
X86ISelDAGToDAG because we try to create a select node
that can be simplified away after rL347227.

This change avoids the problem by simplifying the SHRUNKBLEND
node sooner. In the test case, we manage to realize that the
true/false values of the select (SHRUNKBLEND) are the same thing,
so it simplifies away completely.

llvm-svn: 347818
2018-11-28 22:51:04 +00:00
Craig Topper f3b6f583e2 [X86] Add a combine for back to back VSRAI instructions
Expansion of SIGN_EXTEND_INREG can create a VSRAI instruction. If there is already a VSRAI after it, we should combine them into a larger VSRAI

Differential Revision: https://reviews.llvm.org/D54959

llvm-svn: 347784
2018-11-28 18:03:38 +00:00
Craig Topper 7ceef03dc9 [X86] Replace an APInt that is guaranteed to be 8-bits with just an 'unsigned'
We're already mixing this APInt with other 'unsigned' variables. This allows us to use regular comparison operators instead of needing to use APInt::ult or APInt::uge. And it removes a later conversion from APInt to unsigned.

I might be adding another combine to this function and this will probably simplify the logic required for that.

llvm-svn: 347684
2018-11-27 18:24:56 +00:00
Craig Topper 196fd31e33 [X86] Use getUnpackl/getUnpackh instead of directly creating UNPCKL/UNPCKH nodes.
llvm-svn: 347642
2018-11-27 06:24:56 +00:00
Craig Topper 4325505f05 [X86] Prevent DAG combine from folding a bitcast from vXi1 to iX with a store on pre-AVX512 targets.
If we fold the bitcast into the store we'll end up creating a truncating store to vXi1 that will get scalarized. Instead allow the bitcast to be turned into a movmsk.

We probably need to do something if the store itself is a vXi1 type, but I'll leave that til a testcase appears.

llvm-svn: 347632
2018-11-27 02:57:27 +00:00
Craig Topper b955bf382c [LegalizeVectorTypes][X86][ARM][AArch64][PowerPC] Don't use SplitVecOp_TruncateHelper for FP_TO_SINT/UINT.
SplitVecOp_TruncateHelper tries to promote the result type while splitting FP_TO_SINT/UINT. It then concatenates the result and introduces a truncate to the original result type. But it does this without inserting the AssertZExt/AssertSExt that the regular result type promotion would insert. Nor does it turn FP_TO_UINT into FP_TO_SINT the way normal result type promotion for these operations does. This is bad on X86 which doesn't support FP_TO_SINT until AVX512.

This patch disables the use of SplitVecOp_TruncateHelper for these operations and just lets normal promotion handle it. I've tweaked a couple things in X86ISelLowering to avoid a few obvious regressions there. I believe all the changes on X86 are improvements. The other targets look neutral.

Differential Revision: https://reviews.llvm.org/D54906

llvm-svn: 347593
2018-11-26 21:12:39 +00:00
Sanjay Patel d31220e0de [x86] promote all multiply i8 by constant to i32
We have these 2 "isDesirable" promotion hooks (I'm not sure why we need both of them, but that's 
independent of this patch), and we can adjust them to promote "mul i8 X, C" to i32. Then, all of 
our existing LEA and other multiply expansion magic happens as it would for i32 ops.

Some of the test diffs show that we could end up with an actual 32-bit mul instruction here 
because we choose not to expand to simpler ops. That instruction could be slower depending on the 
subtarget. On the plus side, this means we don't need a separate instruction to load the constant 
operand and possibly an extra instruction to move the result. If we need to tune mul i32 further, 
we could add a later transform that tries to shrink it back to i8 based on subtarget timing.

I did not bother to duplicate all of the 32-bit test file RUNs and target settings that exist to 
test whether LEA expansion is cheap or not. The diffs here assume a default target, so that means 
LEA is generally cheap.

Differential Revision: https://reviews.llvm.org/D54803

llvm-svn: 347557
2018-11-26 15:22:30 +00:00
Sanjay Patel 7336e7c67a [x86] limit transform for select-of-fp-constants
This should likely be adjusted to limit this transform
further, but these diffs should be clear wins.

If we have blendv/conditional move, then we should assume 
those are cheap ops. The loads become independent of the
compare, so those can be speculated before we need to use 
the values in the blend/mov.

llvm-svn: 347526
2018-11-25 17:27:02 +00:00
Sanjay Patel cadf62f360 [x86] fix predicate for avoiding vblendv
It only makes sense to produce the logic ops when 1 of the
constants is +0.0. Otherwise, go with vblendv to reduce code.

llvm-svn: 347403
2018-11-21 18:02:50 +00:00
Simon Pilgrim 66bae9aee8 [X86][AVX] Remove BROADCAST if we only need the 0'th element
We don't catch this with target shuffle simplification if the src/dst types are different.

llvm-svn: 347386
2018-11-21 11:00:09 +00:00
Craig Topper e9b4001a82 [X86] In getScalarMaskingNode, replace scalar_to_vector with a bitcast to v8i1 and an extract_subvector to convert i8 to v1i1.
The bitcast can be nicely merged with any i8 loads that exist for argument passing in 32 mode for example.

llvm-svn: 347380
2018-11-21 07:01:22 +00:00
Craig Topper aa52ee2770 [X86] Emit a PACKUS instead of a VECTOR_SHUFFLE from LowerTRUNCATE for v16i16->v16i8.
We can't guarantee that demanded bits passing through the vector shuffle won't cause the AND in front of this to be removed. This would prevent the PACKUS from being matched during shuffle lowering.

Unfortunately, this adds a packuswb to one of the vector-reduce-mul.ll tests since we were removing the shuffle via SimplifyDemandedVectorElts. We appear to have similar issues with vpmovwb on the same test case on other targets.

llvm-svn: 347361
2018-11-20 22:57:48 +00:00
Craig Topper 24b346da42 [X86] Emit a single shuffle for the v16i8->v4i32 step of a SIGN_EXTEND_VECTOR_INREG lowering on pre-sse4.1 targets.
Previously we emitted to separate shuffles, one for unpcklbw and one for unpcklwd. Instead emit a single shuffle equivalent to both of the original shuffles. Shuffle lowering seems able to handle it. This avoids a bitcast between the two shuffles which seems helpful to DAG combine.

Remove the custom type legalization for v8i8->v8i32. I had put that in to avoid some almost duplicate punpcklbw instructions I was seeing, but this lowering change seems to fix that. It also fixes some duplicate shuffles seen in vector-sext.ll

llvm-svn: 347348
2018-11-20 21:21:52 +00:00
Simon Pilgrim ee8b96f253 [X86][SSE] Add computeKnownBits/ComputeNumSignBits support for PACKSS/PACKUS instructions.
Pull out getPackDemandedElts demanded elts remapping helper from computeKnownBitsForTargetNode and use in computeKnownBits/ComputeNumSignBits.

llvm-svn: 347303
2018-11-20 13:23:37 +00:00
Simon Pilgrim ed7e2fda18 [X86][SSE] XFormVExtractWithShuffleIntoLoad - getVectorShuffle won't accept SM_SentinelZero
Noticed while working on improving demanded elts target shuffle shuffle combining

llvm-svn: 347302
2018-11-20 12:17:50 +00:00
Simon Pilgrim a6fb85ffa7 [X86][SSE] Lower immediately to PACKUS instead of VECTOR_SHUFFLE.
As discussed on rL347240, this avoids some regressions on D54679 and also helps some combines to kick in a bit earlier.

llvm-svn: 347300
2018-11-20 11:46:37 +00:00
Simon Pilgrim 7198506ba8 [X86][SSE] Add SimplifyDemandedVectorElts support for PACKSS/PACKUS instructions.
As discussed on rL347240.

llvm-svn: 347299
2018-11-20 11:09:46 +00:00
Craig Topper 17fa42a69b [X86] Preserve undef information when creating a punpckl/hbw from a v16i8 where all the even or odd elements are undef.
Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output.

As near as I can tell this makes v16i8 behavior consistent with every other VT now.

This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types.

llvm-svn: 347296
2018-11-20 09:04:01 +00:00
Craig Topper b06d1aa3a1 [X86] Add custom type legalization for v8i8->v8i32 sign extend pre-SSE4.1
This helps with a future patch and makes us less reliant on DAG combine merging shuffles.

llvm-svn: 347295
2018-11-20 09:03:58 +00:00
Craig Topper c733c7bf94 [X86] Replace more calls to getZeroVector with regular getConstant.
getZeroVector produces a specifically canonicalized zero vector, but we can just let DAG legalization take care of it.

The test changes are because MULH lowering happens later than it should and this change gave us the opportunity to constant fold away a multiply during a DAG combine before the build_vector got legalized with a bitcast.

llvm-svn: 347290
2018-11-20 06:54:01 +00:00
Craig Topper 808d0dd689 [X86] Rename combineVSZext->combineExtendVectorInreg. NFC
Now that we no longer have target specific vector extend nodes let's make the function name match the nodes we do use.

llvm-svn: 347268
2018-11-19 22:18:47 +00:00
Simon Pilgrim c4861ab170 [X86][SSE] Remove unnecessary bit-and in pshufb vector ctlz (PR39703)
SSE PSHUFB vector ctlz lowering works at the i4 nibble level. As detailed in PR39703, we were masking the lower nibble off but we only actually use it in the case where the upper nibble is known to be zero, making it safe to remove the mask and save an instruction.

Differential Revision: https://reviews.llvm.org/D54707

llvm-svn: 347242
2018-11-19 18:40:59 +00:00
Craig Topper 311bbcd535 [X86] Attempt to improve v32i8/v64i8 multiply lowering by applying the v16i8 non-avx2 algorithm to each 128-bit lane.
Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.

This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.

Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.

Differential Revision: https://reviews.llvm.org/D54668

llvm-svn: 347240
2018-11-19 18:32:53 +00:00
Craig Topper 8b22bcd39f [X86] Use a pcmpgt with 0 instead of psrad 31, to fill elements with the sign bit in v4i32 MULH lowering.
The shift requires a copy to avoid clobbering a register. Comparing with 0 uses an xor to produce 0 that will be overwritten with the compare results. So still requires 2 instructions, but should be one byte shorter since it doesn't need to encode an immediate.

llvm-svn: 347185
2018-11-19 07:22:26 +00:00
Craig Topper 3616891046 [X86] Use compare with 0 to fill an element with sign bits when sign extending to v2i64 pre-sse4.1
Previously we used an arithmetic shift right by 31, but that requires a copy to preserve the input. So we might as well materialize a zero and compare to it since the comparison will overwrite the register that contains the zeros. This should be one byte shorter.

llvm-svn: 347181
2018-11-19 04:33:20 +00:00
Craig Topper 053f1eea96 [X86] Remove most of the SEXTLOAD Custom setOperationAction calls under -x86-experimental-vector-widening-legalization.
Leave just the v4i8->v4i64 and v8i8->v8i64, but only enable them on pre-sse4.1 targets when 64-bit mode is enabled. In those cases we end up creating sext loads that get scalarized to code that looks better than what we get from loading into a vector register and doing a multiple step sign extend using unpacks and shifts.

llvm-svn: 347180
2018-11-19 00:33:16 +00:00
Simon Pilgrim 7f92efa5a9 [X86][SSE] Add SimplifyDemandedVectorElts support for SSE packed i2fp conversions.
llvm-svn: 347177
2018-11-18 22:13:31 +00:00
Craig Topper 0468c860b7 [X86] Add custom type legalization for extending v4i8/v4i16->v4i64.
Pre-SSE4.1 sext_invec for v2i64 is complicated because we don't have a v2i64 sra instruction. So instead we sign extend to i32 using unpack and sra, then copy the elements and do a v4i32 sra to fill with sign bits, then interleave the i32 sign extend and the sign bits. So really we're doing to two sign extends but only using half of the v4i32 intermediate result.

When the result is more than 128 bits, default type legalization would prefer to split the destination type all the way down to v2i64 with shuffles followed by v16i8/v8i16->v2i64 sext_inreg operations. This results in more instructions than necessary because we are only utilizing the lower 2 elements of the v4i32 intermediate result. Instead we can custom split a v4i8/v4i16->v4i64 sign_extend. Then we can sign extend v4i8/v4i16->v4i32 invec producing a full v4i32 result. Create the sign bit vector as a v4i32 then split and interleave with the sign bits using an punpackldq and punpackhdq.

llvm-svn: 347176
2018-11-18 21:28:50 +00:00
Simon Pilgrim b31bdbd2e9 [X86][SSE] Add SimplifyDemandedVectorElts support for SSE splat-vector-shifts.
SSE vector shifts only use the bottom 64-bits of the shift amount vector.

llvm-svn: 347173
2018-11-18 20:21:52 +00:00
Craig Topper 11d50948e2 [X86] Disable combineToExtendVectorInReg under -x86-experimental-vector-widening-legalization. Add custom type legalization for extends.
If we widen illegal types instead of promoting, we should be able to rely on the type legalizer to create the vector_inreg operations for us with some caveats.

This patch disables combineToExtendVectorInReg when we are using widening.

I've enabled custom legalization for v8i8->v8i64 extends under avx512f since the type legalizer would want to create a vector_inreg with a v64i8 input type which isn't legal without avx512bw. So we go to v16i8 with custom code using the relaxation of rules we get from D54346.

I've also enable custom legalization of v8i64 and v16i32 operations with with AVX. When the input type is 128 bits, the default splitting legalization would extend first 128->256, then do the a split to two 128 pieces. Extend each half to 256 and then concat the result. The custom legalization I've added instead uses a 128->256 bit vector_inreg extend that only reads the lower 64-bits for the low half of the split. Then shuffles the high 64-bits to the low 64-bits and does another vector_inreg extend.

llvm-svn: 347172
2018-11-18 18:11:25 +00:00
Craig Topper bc8148f7b0 [X86] Lower v16i16->v8i16 truncate using an 'and' with 255, an extract_subvector, and a packuswb instruction.
Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54671

llvm-svn: 347171
2018-11-18 17:59:28 +00:00
Simon Pilgrim ec808cf541 Remove unused variable. NFCI.
llvm-svn: 347169
2018-11-18 17:24:59 +00:00
Simon Pilgrim 50828c75d0 [X86][SSE] Split IsSplatValue into GetSplatValue and IsSplatVector
Refactor towards making this recursive (necessary for PR38243 rotation splat detection).
IsSplatVector returns the original vector source of the splat and the splat index.
GetSplatValue returns the scalar splatted value as an extraction from IsSplatVector.

llvm-svn: 347168
2018-11-18 17:15:06 +00:00
Simon Pilgrim fec9f8657b [X86][SSE] Relax IsSplatValue - remove the 'variable shift' limit on subtracts.
Means we don't use the per-lane-shifts as much when we can cheaply use the older splat-variable-shifts.

llvm-svn: 347162
2018-11-18 15:52:08 +00:00
Simon Pilgrim cc1f5d2407 [X86][SSE] Use raw shuffle mask decode in SimplifyDemandedVectorEltsForTargetNode (PR39549)
We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs.

llvm-svn: 347158
2018-11-18 13:34:53 +00:00
Craig Topper cd94a7c227 [X86] Add -x86-experimental-vector-widening-legalization check to combineSelect and combineSetCC to cover vXi16/vXi8 promotion without BWI.
I don't yet have any test cases for this, but its the right thing to do based on log file inspection.

llvm-svn: 347151
2018-11-18 08:30:09 +00:00
Craig Topper b03f80a21c [X86] Rename WidenMaskArithmetic->PromoteMaskArithmetic since we usually use widen to refer to adding elements not making elements larger. NFC
llvm-svn: 347150
2018-11-18 07:35:08 +00:00
Craig Topper f56a57518d [X86] Don't use a pmaddwd for vXi32 multiply if the inputs are zero extends from i8 or smaller without SSE4.1. Prefer to shrink the mul instead.
The zero extend will require two stages of unpacks to implement. So its better to shrink the multiply using pmullw and then extend that result back to v4i32 using a single unpack.

llvm-svn: 347149
2018-11-18 05:53:21 +00:00
Craig Topper 0438d791fa [X86] Add support for matching PACKUSWB from a v64i8 shuffle.
llvm-svn: 347143
2018-11-17 18:54:43 +00:00
Craig Topper dd61f11642 [X86] Don't extend v32i8 multiplies to v32i16 with avx512bw and prefer-vector-width=256.
llvm-svn: 347131
2018-11-17 02:36:07 +00:00
Craig Topper b05ea28f1f [X86] Use getUnpackl/getUnpackh instead of hardcoding a shuffle mask.
llvm-svn: 347127
2018-11-17 02:18:12 +00:00
Fangrui Song 7570932977 Use llvm::copy. NFC
llvm-svn: 347126
2018-11-17 01:44:25 +00:00
Craig Topper ee0333b4a9 [X86] Add custom promotion of narrow fp_to_uint/fp_to_sint operations under -x86-experimental-vector-widening-legalization.
This tries to force the result type to vXi32 followed by a truncate. This can help avoid scalarization that would otherwise occur.

There's some annoying examples of an avx512 truncate instruction followed by a packus where we should really be able to just use one truncate. But overall this is still a net improvement.

llvm-svn: 347105
2018-11-16 22:53:00 +00:00
Craig Topper 87bc07b3dd [X86] Qualify part of the masked gather handling in ReplaceNodeResults with a getTypeAction call to know if we can use default legalization.
If we managed to switch to -x86-experimental-vector-widening-legalization this block can be removed.

llvm-svn: 347100
2018-11-16 22:04:29 +00:00
Craig Topper 567aaeb40d [X86] Remove a branch on SSE4.1 from LowerLoad
We should be able to use getExtendInVec with or without sse4.1 to produce a SIGN_EXTEND_VECTOR_INREG.

llvm-svn: 347095
2018-11-16 21:05:00 +00:00
Craig Topper 7fff9a9aef [X86] In LowerLoad, fix assert messages and rename a variable that use Zize instead of Size. NFC
llvm-svn: 347093
2018-11-16 21:04:56 +00:00
Simon Pilgrim bcd6631a2a [X86][SSE] Move number of input limit out of resolveTargetShuffleInputs.
Only combineX86ShufflesRecursively needs this limit.

llvm-svn: 347054
2018-11-16 15:01:05 +00:00
Craig Topper 079c37da58 [X86] Add custom type legalization for v2i8/v4i8/v8i8 mul under -x86-experimental-vector-widening.
By early promoting the multiply to use an i16 element type we can avoid op legalization emit a second multiply for the 8 upper elements of the v16i8 type we would otherwise get.

llvm-svn: 347032
2018-11-16 06:15:21 +00:00
Craig Topper 5802b82b40 [X86] Use ANY_EXTEND instead of SIGN_EXTEND in the AVX2 and later path for legalizing vXi8 multiply.
We aren't going to use the upper bits of the multiply result that the extend would effect. So we don't need a specific type of extend.

This makes some reduction test cases shorter because we were previously trying to sign_extend a truncate which we can't eliminate.

llvm-svn: 347011
2018-11-16 01:16:59 +00:00
Craig Topper 1acafd863f [X86] Update a couple comments to remove a mention of a sign extending that no longer happens. NFC
llvm-svn: 347010
2018-11-16 01:16:51 +00:00
Craig Topper 22bfa99448 [X86] Remove ANY_EXTEND special case from canReduceVMulWidth
Removing this code doesn't affect any lit tests so it doesn't appear to be tested anymore. I assume it was when it was added, but I guess something else changed? Code coverage report also says its unused.

I mostly didn't like that it seemed to count the sign bits as if it was a sign_extend, but then set isPositive as if it was a zero_extend. It feels like we should have picked one interpretation?

Differential Revision: https://reviews.llvm.org/D54596

llvm-svn: 346995
2018-11-15 21:19:32 +00:00
Craig Topper b144c7a6fb [X86] Minor cleanup to getExtendInVec. NFCI
Use unsigned to calculate the subvector index to avoid a cast.

Remove an unnecessary condition and replace it with a stronger assert.

Use the InVT variable we updated when we extracted instead of grabbing it from the In SDValue.

llvm-svn: 346983
2018-11-15 19:20:22 +00:00
Craig Topper 73bb04ab6f [X86] Add -x86-experimental-vector-widening support to reduceVMULWidth and combineMulToPMADDWD
In reduceVMULWidth, we no longer need to worry about extending the vector to 128 bits first. Regular widening of extends, muls and shuffles will take care of that for us.

In combineMulToPMADDWD, we can handle v2i32 multiplies and allow the VPMADDWD to be widened to v4i32 during type legalization by adding custom widening like we do have for AVG/ADDUS/SUBUS. I had to modify that code a little to allow different and output VTs.

Differential Revision: https://reviews.llvm.org/D54512

llvm-svn: 346980
2018-11-15 18:59:31 +00:00
Craig Topper 553ac560aa [X86] Add some custom type legalization rules for truncate with -x86-experimental-vector-widening-legalization.
This avoids some nasty shuffles when we have avx512. It will also prevent using zmm truncate instructions when a ymm instruction that zeroes part of an xmm register will do. Also avoid using avx512 truncate instructions when the input is 128 bits or less. These instructions are 2 uops on skx so we can probably find a better single uop shuffle like pshufb.

llvm-svn: 346936
2018-11-15 08:23:40 +00:00
Craig Topper ea6ced9d1a [X86] Don't mark SEXTLOADS with narrow types as Custom with -x86-experimental-vector-widening-legalization.
The narrow types end up requesting widening, but generic legalization will end up scalaring and using a build_vector to do the widening.

llvm-svn: 346916
2018-11-15 00:21:41 +00:00
Benjamin Kramer 6b7d6fe079 [X86] Remove unused variable
llvm-svn: 346909
2018-11-14 23:13:27 +00:00
Craig Topper 0b2089da4b [X86] Support v2i32/v4i16/v8i8 load/store using f64 on 32-bit targets under -x86-experimental-vector-widening-legalization.
On 64-bit targets the type legalizer will use i64 to legalize these. But when i64 isn't legal, the type legalizer won't try an FP type. So do it manually instead.

There are a few regressions in here due to some v2i32 operations like mul and div now being reassembled into a full vector just to store instead of storing the pieces. But this was already occuring in 64-bit mode so its not a new issue.

llvm-svn: 346908
2018-11-14 23:02:09 +00:00
Craig Topper 6c94264b1f [X86] Allow pmulh to be formed from narrow vXi16 vectors under -x86-experimental-vector-widening-legalization
Narrower vectors will be widened to 128 bits without changing the element size. And generic type legalization can already handle widening mulhu/mulhs.

Differential Revision: https://reviews.llvm.org/D54513

llvm-svn: 346879
2018-11-14 18:16:21 +00:00
Simon Pilgrim 7501780ec6 [X86][AVX512] Remove constant pool shuffle decoding from SelectionDAG
This patch removes the last use of the constant pool shuffle decode helper and consistently uses the 'getTargetShuffleMaskIndices' versions instead. The constant pool versions are now purely used for assembly comments.

The avx512vbmi intrinsic upgrades had to be altered as they were being decoded as broadcasts, similar to what I fixed in rL346032. I don't think the change is critical - although its annoying that we lose the {k}{z} instruction test coverage as they are tricky to generate....

Differential Revision: https://reviews.llvm.org/D54083

llvm-svn: 346850
2018-11-14 11:26:35 +00:00
Craig Topper aca8390216 [SelectionDAG][X86] Relax restriction on the width of an input to *_EXTEND_VECTOR_INREG. Use them and regular *_EXTEND to replace the X86 specific VSEXT/VZEXT opcodes
Previously, the extend_vector_inreg opcode required their input register to be the same total width as their output. But this doesn't match up with how the X86 instructions are defined. For X86 the input just needs to be a legal type with at least enough elements to cover the output.

This patch weakens the check on these nodes and allows them to be used as long as they have more input elements than output elements. I haven't changed type legalization behavior so it will still create them with matching input and output sizes.

X86 will custom legalize these nodes by shrinking the input to be a 128 bit vector and once we've done that we treat them as legal operations. We still have one case during type legalization where we must custom handle v64i8 on avx512f targets without avx512bw where v64i8 isn't a legal type. In this case we will custom type legalize to a *extend_vector_inreg with a v16i8 input. After that the input is a legal type so type legalization should ignore the node and doesn't need to know about the relaxed restriction. We are no longer allowed to use the default expansion for these nodes during vector op legalization since the default expansion uses a shuffle which required the widths to match. Custom legalization for all types will prevent us from reaching the default expansion code.

I believe DAG combine works correctly with the released restriction because it doesn't check the number of input elements.

The rest of the patch is changing X86 to use either the vector_inreg nodes or the regular zero_extend/sign_extend nodes. I had to add additional isel patterns to handle any_extend during isel since simplifydemandedbits can create them at any time so we can't legalize to zero_extend before isel. We don't yet create any_extend_vector_inreg in simplifydemandedbits.

Differential Revision: https://reviews.llvm.org/D54346

llvm-svn: 346784
2018-11-13 19:45:21 +00:00
Simon Pilgrim e565e5a962 [X86][SSE] Add lowerVectorShuffleAsByteRotateAndPermute (PR39387)
This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location.

Differential Revision: https://reviews.llvm.org/D54267

llvm-svn: 346706
2018-11-12 21:12:38 +00:00
Craig Topper c48712b341 [X86] In LowerMULH, use generic truncate and vector shuffle nodes instead of directly emitting PACKUS.
Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis.

This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not.

llvm-svn: 346697
2018-11-12 19:37:29 +00:00
Craig Topper 2eab39f77b [X86] Use DAG.getConstant instead of getZeroVector.
llvm-svn: 346605
2018-11-11 07:24:36 +00:00
Craig Topper ef33a190bc [X86] Replace calls to getOnesVector/getZeroVector with getConstant.
getConstant will create a BUILD_VECTOR for us and use a legal type if necessary. So just create the simple node and let BUILD_VECTOR legalization do the canonicalization.

llvm-svn: 346603
2018-11-11 01:40:04 +00:00
Benjamin Kramer 37c691e867 [X86] Remove unused variable
llvm-svn: 346592
2018-11-10 18:11:11 +00:00
Craig Topper 7956a256e9 [X86] Remove apparently unneeded code from combineVSZext.
No lit tests fail with this code removed.

This is a pre-commit for D54346.

llvm-svn: 346590
2018-11-10 17:44:28 +00:00
Craig Topper 0364085281 [X86] In LowerHorizontalByteSum, emit vector_shuffle nodes instead of directly using X86ISD::UNPCKL/X86ISD::UNPCKH.
This gives shuffle lowering the freedom to use zero_extend_vector_inreg for the unpckl shuffle. Shuffle combining usually makes this swap later, but not when AVX512 is enabled it seems.

While there also use DAG.getConstant to create a 0 vector instead of using the helper the forces a specific BUILD_VECTOR. I don't think that helper is usually needed. We're basically free to create a constant build_vector anytime and it will be legalized on its own.

llvm-svn: 346574
2018-11-10 00:26:42 +00:00
Craig Topper 17d64c71c5 [X86] Move the promotion of v16i16->v16i8 for avx512f but not avx512bw from lowering to isel. Change to use vpmovzx instead of vpmovsx.
With avx512f but not avx512bw we need to extend to v16i32 then truncate that to to v16i8. Previously we emitted both nodes during lowering, but I'm trying to switch to using target independent nodes and with that switched the extend+truncate wou

This patch changes the implementation to what will be necessary with that patch which helps minimize test diffs.

llvm-svn: 346552
2018-11-09 20:09:53 +00:00
Craig Topper 731ea7dbc1 [X86] Turn X86ISD::VSEXT into X86ISD::VZEXT if the upper bits aren't demanded.
This makes X86ISD::VSEXT more similar to ISD::SIGN_EXTEND and ISD::ZERO_EXTEND.

I'm hoping to replace X86ISD::VSEXT/VZEXT with target independent nodes. Making the target specific nodes similar to the target independent nodes helps minimize test diffs in that patch.

llvm-svn: 346539
2018-11-09 19:05:51 +00:00
Sanjay Patel fa1c0fe478 [x86] try to form broadcast before widening shuffle elements
I noticed that we weren't generating broadcasts as much I thought we would with 
D54271, and this is part of the problem.

Widening the shuffle elements means adding bitcasts and hiding the relationship 
between a splatted scalar and the vector. If we can form a broadcast, do that 
before going through the rest of the shuffle lowering because broadcasts should 
be cheap and can often be load-folded.

Differential Revision: https://reviews.llvm.org/D54280

llvm-svn: 346498
2018-11-09 14:54:58 +00:00
Simon Pilgrim ea51f98b9b [X86] Add Subtarget to more lowerVectorShuffle functions. NFCI.
This will be necessary for an update to D54267

llvm-svn: 346490
2018-11-09 13:19:03 +00:00
Sanjay Patel b5535dc7b3 [x86] use shuffles for scalar insertion into high elements of a constant vector
As discussed in D54073, we have a potential regression from more aggressive vector narrowing here, so let's try to avoid that by changing build-vector lowering slightly.

Insert-vector-element lowering always does this since there's no "pinsr" for ymm/zmm:

// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
// into that, and then insert the subvector back into the result.

...but we can sometimes do better for insert-into-constant-vector by using shuffle lowering.

Differential Revision: https://reviews.llvm.org/D54271

llvm-svn: 346433
2018-11-08 19:16:27 +00:00
Craig Topper 6428a2cd9a [X86] Add custom promotion of v2i8/v2i16 fp_to_sint to avoid over promotion to v2i64 which would force scalarization.
llvm-svn: 346259
2018-11-06 19:24:21 +00:00
Craig Topper 0b5f8169b0 [TargetLowering] Change TargetLoweringBase::getPreferredVectorAction to take an MVT instead of an EVT. NFC
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.

llvm-svn: 346180
2018-11-05 23:26:13 +00:00
Craig Topper def82a81af [X86] Don't turn any_extend from a mask register into a sign_extend during lowering. Add patterns to match any_extend during isel instead.
SimplifyDemandedBits can turn a sign_extend back into an any_extend and trigger an infinite loop. So instead legalize it the same way as a sign_extend, but preserve the opcode. Then just pattern match it the same as sign_extend during isel.

I don't have a reduced test case for such an infinite loop yet.

llvm-svn: 346170
2018-11-05 22:08:17 +00:00
Craig Topper 30b627e5c9 [X86] Custom type legalize v2i8/v2i16/v2i32 mul to use to pmuludq.
v2i8/v2i16/v2i32 are promoted to v2i64. pmuludq takes a v2i64 input and produces a v2i64 output. Since we don't about the upper bits of the type legalized multiply we can use the pmuludq to produce the multiply result for the bits we do care about.

llvm-svn: 346115
2018-11-05 05:02:12 +00:00
Craig Topper ed6a0a817f [X86] Add vector shift by immediate to SimplifyDemandedBitsForTargetNode.
Summary: This also enables some constant folding from KnownBits propagation. This helps on some cases vXi64 case in 32-bit mode where constant vectors appear as vXi32 and a bitcast. This can prevent getNode from constant folding sra/shl/srl.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54069

llvm-svn: 346102
2018-11-04 17:31:27 +00:00
Craig Topper 1ba86188cf [SelectionDAG] Remove special methods for creating *_EXTEND_VECTOR_INREG nodes. Move asserts into getNode.
These methods were just wrappers around getNode with additional asserts (identical and repeated 3 times). But getNode already has a switch that can be used to hold these asserts that allows them to be shared for all 3 opcodes. This also enables checking on the places that create these nodes without using the wrappers.

The rest of the patch is just changing all callers to use getNode directly.

llvm-svn: 346087
2018-11-04 02:10:18 +00:00
Craig Topper 7aed9e600b [X86] Update comment I forgot to change in r346043. NFC
llvm-svn: 346073
2018-11-03 19:49:13 +00:00
Craig Topper f7108aef14 [X86] In LowerEXTEND_VECTOR_INREG, emit a vector shuffle instead of directly using X86ISD::UNPCKL
The majority of the changes are because the rest of shuffle lowering/combining prefers to replace the undef input with the other operand. Using UNPCKL directly seemed to avoid this and just grabbed a randomish register for the undef which can create false dependencies.

llvm-svn: 346050
2018-11-02 22:48:02 +00:00
Craig Topper 60c202a494 [X86] Don't emit *_extend_vector_inreg nodes when both the input and output types are legal with AVX1
We already have custom lowering for the AVX case in LegalizeVectorOps. So its better to keep the regular extend op around as long as possible.

I had to qualify one place in DAG combine that created illegal vector extending load operations. This change by itself had no effect on any tests which is why its included here.

I've made a few cleanups to the custom lowering. The sign extend code no longer creates an identity shuffle with undef elements. The zero extend code now emits a zero_extend_vector_inreg instead of an unpckl with a zero vector.

For the high half of the custom lowering of zero_extend/any_extend, we're now using an unpckh with a zero vector or undef. Previously we used used a pshufd to move the upper 64-bits to the lower 64-bits and then used a zero_extend_vector_inreg. I think the zero vector should require less execution resources and be smaller code size.

Differential Revision: https://reviews.llvm.org/D54024

llvm-svn: 346043
2018-11-02 21:09:49 +00:00