Commit Graph

5955 Commits

Author SHA1 Message Date
Simon Pilgrim ec808cf541 Remove unused variable. NFCI.
llvm-svn: 347169
2018-11-18 17:24:59 +00:00
Simon Pilgrim 50828c75d0 [X86][SSE] Split IsSplatValue into GetSplatValue and IsSplatVector
Refactor towards making this recursive (necessary for PR38243 rotation splat detection).
IsSplatVector returns the original vector source of the splat and the splat index.
GetSplatValue returns the scalar splatted value as an extraction from IsSplatVector.

llvm-svn: 347168
2018-11-18 17:15:06 +00:00
Simon Pilgrim fec9f8657b [X86][SSE] Relax IsSplatValue - remove the 'variable shift' limit on subtracts.
Means we don't use the per-lane-shifts as much when we can cheaply use the older splat-variable-shifts.

llvm-svn: 347162
2018-11-18 15:52:08 +00:00
Simon Pilgrim cc1f5d2407 [X86][SSE] Use raw shuffle mask decode in SimplifyDemandedVectorEltsForTargetNode (PR39549)
We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs.

llvm-svn: 347158
2018-11-18 13:34:53 +00:00
Craig Topper cd94a7c227 [X86] Add -x86-experimental-vector-widening-legalization check to combineSelect and combineSetCC to cover vXi16/vXi8 promotion without BWI.
I don't yet have any test cases for this, but its the right thing to do based on log file inspection.

llvm-svn: 347151
2018-11-18 08:30:09 +00:00
Craig Topper b03f80a21c [X86] Rename WidenMaskArithmetic->PromoteMaskArithmetic since we usually use widen to refer to adding elements not making elements larger. NFC
llvm-svn: 347150
2018-11-18 07:35:08 +00:00
Craig Topper f56a57518d [X86] Don't use a pmaddwd for vXi32 multiply if the inputs are zero extends from i8 or smaller without SSE4.1. Prefer to shrink the mul instead.
The zero extend will require two stages of unpacks to implement. So its better to shrink the multiply using pmullw and then extend that result back to v4i32 using a single unpack.

llvm-svn: 347149
2018-11-18 05:53:21 +00:00
Craig Topper 0438d791fa [X86] Add support for matching PACKUSWB from a v64i8 shuffle.
llvm-svn: 347143
2018-11-17 18:54:43 +00:00
Craig Topper dd61f11642 [X86] Don't extend v32i8 multiplies to v32i16 with avx512bw and prefer-vector-width=256.
llvm-svn: 347131
2018-11-17 02:36:07 +00:00
Craig Topper b05ea28f1f [X86] Use getUnpackl/getUnpackh instead of hardcoding a shuffle mask.
llvm-svn: 347127
2018-11-17 02:18:12 +00:00
Fangrui Song 7570932977 Use llvm::copy. NFC
llvm-svn: 347126
2018-11-17 01:44:25 +00:00
Craig Topper ee0333b4a9 [X86] Add custom promotion of narrow fp_to_uint/fp_to_sint operations under -x86-experimental-vector-widening-legalization.
This tries to force the result type to vXi32 followed by a truncate. This can help avoid scalarization that would otherwise occur.

There's some annoying examples of an avx512 truncate instruction followed by a packus where we should really be able to just use one truncate. But overall this is still a net improvement.

llvm-svn: 347105
2018-11-16 22:53:00 +00:00
Craig Topper 87bc07b3dd [X86] Qualify part of the masked gather handling in ReplaceNodeResults with a getTypeAction call to know if we can use default legalization.
If we managed to switch to -x86-experimental-vector-widening-legalization this block can be removed.

llvm-svn: 347100
2018-11-16 22:04:29 +00:00
Craig Topper 567aaeb40d [X86] Remove a branch on SSE4.1 from LowerLoad
We should be able to use getExtendInVec with or without sse4.1 to produce a SIGN_EXTEND_VECTOR_INREG.

llvm-svn: 347095
2018-11-16 21:05:00 +00:00
Craig Topper 7fff9a9aef [X86] In LowerLoad, fix assert messages and rename a variable that use Zize instead of Size. NFC
llvm-svn: 347093
2018-11-16 21:04:56 +00:00
Simon Pilgrim bcd6631a2a [X86][SSE] Move number of input limit out of resolveTargetShuffleInputs.
Only combineX86ShufflesRecursively needs this limit.

llvm-svn: 347054
2018-11-16 15:01:05 +00:00
Craig Topper 079c37da58 [X86] Add custom type legalization for v2i8/v4i8/v8i8 mul under -x86-experimental-vector-widening.
By early promoting the multiply to use an i16 element type we can avoid op legalization emit a second multiply for the 8 upper elements of the v16i8 type we would otherwise get.

llvm-svn: 347032
2018-11-16 06:15:21 +00:00
Craig Topper 5802b82b40 [X86] Use ANY_EXTEND instead of SIGN_EXTEND in the AVX2 and later path for legalizing vXi8 multiply.
We aren't going to use the upper bits of the multiply result that the extend would effect. So we don't need a specific type of extend.

This makes some reduction test cases shorter because we were previously trying to sign_extend a truncate which we can't eliminate.

llvm-svn: 347011
2018-11-16 01:16:59 +00:00
Craig Topper 1acafd863f [X86] Update a couple comments to remove a mention of a sign extending that no longer happens. NFC
llvm-svn: 347010
2018-11-16 01:16:51 +00:00
Craig Topper 22bfa99448 [X86] Remove ANY_EXTEND special case from canReduceVMulWidth
Removing this code doesn't affect any lit tests so it doesn't appear to be tested anymore. I assume it was when it was added, but I guess something else changed? Code coverage report also says its unused.

I mostly didn't like that it seemed to count the sign bits as if it was a sign_extend, but then set isPositive as if it was a zero_extend. It feels like we should have picked one interpretation?

Differential Revision: https://reviews.llvm.org/D54596

llvm-svn: 346995
2018-11-15 21:19:32 +00:00
Craig Topper b144c7a6fb [X86] Minor cleanup to getExtendInVec. NFCI
Use unsigned to calculate the subvector index to avoid a cast.

Remove an unnecessary condition and replace it with a stronger assert.

Use the InVT variable we updated when we extracted instead of grabbing it from the In SDValue.

llvm-svn: 346983
2018-11-15 19:20:22 +00:00
Craig Topper 73bb04ab6f [X86] Add -x86-experimental-vector-widening support to reduceVMULWidth and combineMulToPMADDWD
In reduceVMULWidth, we no longer need to worry about extending the vector to 128 bits first. Regular widening of extends, muls and shuffles will take care of that for us.

In combineMulToPMADDWD, we can handle v2i32 multiplies and allow the VPMADDWD to be widened to v4i32 during type legalization by adding custom widening like we do have for AVG/ADDUS/SUBUS. I had to modify that code a little to allow different and output VTs.

Differential Revision: https://reviews.llvm.org/D54512

llvm-svn: 346980
2018-11-15 18:59:31 +00:00
Craig Topper 553ac560aa [X86] Add some custom type legalization rules for truncate with -x86-experimental-vector-widening-legalization.
This avoids some nasty shuffles when we have avx512. It will also prevent using zmm truncate instructions when a ymm instruction that zeroes part of an xmm register will do. Also avoid using avx512 truncate instructions when the input is 128 bits or less. These instructions are 2 uops on skx so we can probably find a better single uop shuffle like pshufb.

llvm-svn: 346936
2018-11-15 08:23:40 +00:00
Craig Topper ea6ced9d1a [X86] Don't mark SEXTLOADS with narrow types as Custom with -x86-experimental-vector-widening-legalization.
The narrow types end up requesting widening, but generic legalization will end up scalaring and using a build_vector to do the widening.

llvm-svn: 346916
2018-11-15 00:21:41 +00:00
Benjamin Kramer 6b7d6fe079 [X86] Remove unused variable
llvm-svn: 346909
2018-11-14 23:13:27 +00:00
Craig Topper 0b2089da4b [X86] Support v2i32/v4i16/v8i8 load/store using f64 on 32-bit targets under -x86-experimental-vector-widening-legalization.
On 64-bit targets the type legalizer will use i64 to legalize these. But when i64 isn't legal, the type legalizer won't try an FP type. So do it manually instead.

There are a few regressions in here due to some v2i32 operations like mul and div now being reassembled into a full vector just to store instead of storing the pieces. But this was already occuring in 64-bit mode so its not a new issue.

llvm-svn: 346908
2018-11-14 23:02:09 +00:00
Craig Topper 6c94264b1f [X86] Allow pmulh to be formed from narrow vXi16 vectors under -x86-experimental-vector-widening-legalization
Narrower vectors will be widened to 128 bits without changing the element size. And generic type legalization can already handle widening mulhu/mulhs.

Differential Revision: https://reviews.llvm.org/D54513

llvm-svn: 346879
2018-11-14 18:16:21 +00:00
Simon Pilgrim 7501780ec6 [X86][AVX512] Remove constant pool shuffle decoding from SelectionDAG
This patch removes the last use of the constant pool shuffle decode helper and consistently uses the 'getTargetShuffleMaskIndices' versions instead. The constant pool versions are now purely used for assembly comments.

The avx512vbmi intrinsic upgrades had to be altered as they were being decoded as broadcasts, similar to what I fixed in rL346032. I don't think the change is critical - although its annoying that we lose the {k}{z} instruction test coverage as they are tricky to generate....

Differential Revision: https://reviews.llvm.org/D54083

llvm-svn: 346850
2018-11-14 11:26:35 +00:00
Craig Topper aca8390216 [SelectionDAG][X86] Relax restriction on the width of an input to *_EXTEND_VECTOR_INREG. Use them and regular *_EXTEND to replace the X86 specific VSEXT/VZEXT opcodes
Previously, the extend_vector_inreg opcode required their input register to be the same total width as their output. But this doesn't match up with how the X86 instructions are defined. For X86 the input just needs to be a legal type with at least enough elements to cover the output.

This patch weakens the check on these nodes and allows them to be used as long as they have more input elements than output elements. I haven't changed type legalization behavior so it will still create them with matching input and output sizes.

X86 will custom legalize these nodes by shrinking the input to be a 128 bit vector and once we've done that we treat them as legal operations. We still have one case during type legalization where we must custom handle v64i8 on avx512f targets without avx512bw where v64i8 isn't a legal type. In this case we will custom type legalize to a *extend_vector_inreg with a v16i8 input. After that the input is a legal type so type legalization should ignore the node and doesn't need to know about the relaxed restriction. We are no longer allowed to use the default expansion for these nodes during vector op legalization since the default expansion uses a shuffle which required the widths to match. Custom legalization for all types will prevent us from reaching the default expansion code.

I believe DAG combine works correctly with the released restriction because it doesn't check the number of input elements.

The rest of the patch is changing X86 to use either the vector_inreg nodes or the regular zero_extend/sign_extend nodes. I had to add additional isel patterns to handle any_extend during isel since simplifydemandedbits can create them at any time so we can't legalize to zero_extend before isel. We don't yet create any_extend_vector_inreg in simplifydemandedbits.

Differential Revision: https://reviews.llvm.org/D54346

llvm-svn: 346784
2018-11-13 19:45:21 +00:00
Simon Pilgrim e565e5a962 [X86][SSE] Add lowerVectorShuffleAsByteRotateAndPermute (PR39387)
This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location.

Differential Revision: https://reviews.llvm.org/D54267

llvm-svn: 346706
2018-11-12 21:12:38 +00:00
Craig Topper c48712b341 [X86] In LowerMULH, use generic truncate and vector shuffle nodes instead of directly emitting PACKUS.
Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis.

This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not.

llvm-svn: 346697
2018-11-12 19:37:29 +00:00
Craig Topper 2eab39f77b [X86] Use DAG.getConstant instead of getZeroVector.
llvm-svn: 346605
2018-11-11 07:24:36 +00:00
Craig Topper ef33a190bc [X86] Replace calls to getOnesVector/getZeroVector with getConstant.
getConstant will create a BUILD_VECTOR for us and use a legal type if necessary. So just create the simple node and let BUILD_VECTOR legalization do the canonicalization.

llvm-svn: 346603
2018-11-11 01:40:04 +00:00
Benjamin Kramer 37c691e867 [X86] Remove unused variable
llvm-svn: 346592
2018-11-10 18:11:11 +00:00
Craig Topper 7956a256e9 [X86] Remove apparently unneeded code from combineVSZext.
No lit tests fail with this code removed.

This is a pre-commit for D54346.

llvm-svn: 346590
2018-11-10 17:44:28 +00:00
Craig Topper 0364085281 [X86] In LowerHorizontalByteSum, emit vector_shuffle nodes instead of directly using X86ISD::UNPCKL/X86ISD::UNPCKH.
This gives shuffle lowering the freedom to use zero_extend_vector_inreg for the unpckl shuffle. Shuffle combining usually makes this swap later, but not when AVX512 is enabled it seems.

While there also use DAG.getConstant to create a 0 vector instead of using the helper the forces a specific BUILD_VECTOR. I don't think that helper is usually needed. We're basically free to create a constant build_vector anytime and it will be legalized on its own.

llvm-svn: 346574
2018-11-10 00:26:42 +00:00
Craig Topper 17d64c71c5 [X86] Move the promotion of v16i16->v16i8 for avx512f but not avx512bw from lowering to isel. Change to use vpmovzx instead of vpmovsx.
With avx512f but not avx512bw we need to extend to v16i32 then truncate that to to v16i8. Previously we emitted both nodes during lowering, but I'm trying to switch to using target independent nodes and with that switched the extend+truncate wou

This patch changes the implementation to what will be necessary with that patch which helps minimize test diffs.

llvm-svn: 346552
2018-11-09 20:09:53 +00:00
Craig Topper 731ea7dbc1 [X86] Turn X86ISD::VSEXT into X86ISD::VZEXT if the upper bits aren't demanded.
This makes X86ISD::VSEXT more similar to ISD::SIGN_EXTEND and ISD::ZERO_EXTEND.

I'm hoping to replace X86ISD::VSEXT/VZEXT with target independent nodes. Making the target specific nodes similar to the target independent nodes helps minimize test diffs in that patch.

llvm-svn: 346539
2018-11-09 19:05:51 +00:00
Sanjay Patel fa1c0fe478 [x86] try to form broadcast before widening shuffle elements
I noticed that we weren't generating broadcasts as much I thought we would with 
D54271, and this is part of the problem.

Widening the shuffle elements means adding bitcasts and hiding the relationship 
between a splatted scalar and the vector. If we can form a broadcast, do that 
before going through the rest of the shuffle lowering because broadcasts should 
be cheap and can often be load-folded.

Differential Revision: https://reviews.llvm.org/D54280

llvm-svn: 346498
2018-11-09 14:54:58 +00:00
Simon Pilgrim ea51f98b9b [X86] Add Subtarget to more lowerVectorShuffle functions. NFCI.
This will be necessary for an update to D54267

llvm-svn: 346490
2018-11-09 13:19:03 +00:00
Sanjay Patel b5535dc7b3 [x86] use shuffles for scalar insertion into high elements of a constant vector
As discussed in D54073, we have a potential regression from more aggressive vector narrowing here, so let's try to avoid that by changing build-vector lowering slightly.

Insert-vector-element lowering always does this since there's no "pinsr" for ymm/zmm:

// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
// into that, and then insert the subvector back into the result.

...but we can sometimes do better for insert-into-constant-vector by using shuffle lowering.

Differential Revision: https://reviews.llvm.org/D54271

llvm-svn: 346433
2018-11-08 19:16:27 +00:00
Craig Topper 6428a2cd9a [X86] Add custom promotion of v2i8/v2i16 fp_to_sint to avoid over promotion to v2i64 which would force scalarization.
llvm-svn: 346259
2018-11-06 19:24:21 +00:00
Craig Topper 0b5f8169b0 [TargetLowering] Change TargetLoweringBase::getPreferredVectorAction to take an MVT instead of an EVT. NFC
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.

llvm-svn: 346180
2018-11-05 23:26:13 +00:00
Craig Topper def82a81af [X86] Don't turn any_extend from a mask register into a sign_extend during lowering. Add patterns to match any_extend during isel instead.
SimplifyDemandedBits can turn a sign_extend back into an any_extend and trigger an infinite loop. So instead legalize it the same way as a sign_extend, but preserve the opcode. Then just pattern match it the same as sign_extend during isel.

I don't have a reduced test case for such an infinite loop yet.

llvm-svn: 346170
2018-11-05 22:08:17 +00:00
Craig Topper 30b627e5c9 [X86] Custom type legalize v2i8/v2i16/v2i32 mul to use to pmuludq.
v2i8/v2i16/v2i32 are promoted to v2i64. pmuludq takes a v2i64 input and produces a v2i64 output. Since we don't about the upper bits of the type legalized multiply we can use the pmuludq to produce the multiply result for the bits we do care about.

llvm-svn: 346115
2018-11-05 05:02:12 +00:00
Craig Topper ed6a0a817f [X86] Add vector shift by immediate to SimplifyDemandedBitsForTargetNode.
Summary: This also enables some constant folding from KnownBits propagation. This helps on some cases vXi64 case in 32-bit mode where constant vectors appear as vXi32 and a bitcast. This can prevent getNode from constant folding sra/shl/srl.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54069

llvm-svn: 346102
2018-11-04 17:31:27 +00:00
Craig Topper 1ba86188cf [SelectionDAG] Remove special methods for creating *_EXTEND_VECTOR_INREG nodes. Move asserts into getNode.
These methods were just wrappers around getNode with additional asserts (identical and repeated 3 times). But getNode already has a switch that can be used to hold these asserts that allows them to be shared for all 3 opcodes. This also enables checking on the places that create these nodes without using the wrappers.

The rest of the patch is just changing all callers to use getNode directly.

llvm-svn: 346087
2018-11-04 02:10:18 +00:00
Craig Topper 7aed9e600b [X86] Update comment I forgot to change in r346043. NFC
llvm-svn: 346073
2018-11-03 19:49:13 +00:00
Craig Topper f7108aef14 [X86] In LowerEXTEND_VECTOR_INREG, emit a vector shuffle instead of directly using X86ISD::UNPCKL
The majority of the changes are because the rest of shuffle lowering/combining prefers to replace the undef input with the other operand. Using UNPCKL directly seemed to avoid this and just grabbed a randomish register for the undef which can create false dependencies.

llvm-svn: 346050
2018-11-02 22:48:02 +00:00
Craig Topper 60c202a494 [X86] Don't emit *_extend_vector_inreg nodes when both the input and output types are legal with AVX1
We already have custom lowering for the AVX case in LegalizeVectorOps. So its better to keep the regular extend op around as long as possible.

I had to qualify one place in DAG combine that created illegal vector extending load operations. This change by itself had no effect on any tests which is why its included here.

I've made a few cleanups to the custom lowering. The sign extend code no longer creates an identity shuffle with undef elements. The zero extend code now emits a zero_extend_vector_inreg instead of an unpckl with a zero vector.

For the high half of the custom lowering of zero_extend/any_extend, we're now using an unpckh with a zero vector or undef. Previously we used used a pshufd to move the upper 64-bits to the lower 64-bits and then used a zero_extend_vector_inreg. I think the zero vector should require less execution resources and be smaller code size.

Differential Revision: https://reviews.llvm.org/D54024

llvm-svn: 346043
2018-11-02 21:09:49 +00:00
Simon Pilgrim b34a052852 [LegalizeDAG] Add generic vector CTPOP expansion (PR32655)
This patch adds support for expanding vector CTPOP instructions and removes the x86 'bitmath' lowering which replicates the same expansion.

Differential Revision: https://reviews.llvm.org/D53258

llvm-svn: 345869
2018-11-01 18:22:11 +00:00
Simon Pilgrim 1f0a8421ad [X86][SSE] Move 2-input limit up from getFauxShuffleMask to resolveTargetShuffleInputs (reapplied)
Reapplying an updated version of rL345395 (reverted in rL345451), now the issues noticed in PR39483 have been fixed. 

This patch allows resolveTargetShuffleInputs to remove UNDEF inputs from cases where we have more than 2 inputs.

llvm-svn: 345824
2018-11-01 11:52:09 +00:00
Craig Topper 6958b5ffa9 [X86] In lowerVectorShuffleAsBroadcast, make peeking through CONCAT_VECTORS work correctly if we already walked through a bitcast that changed the element size.
The CONCAT_VECTORS case was using the original mask element count to determine how to adjust the broadcast index. But if we looked through a bitcast the original mask size doesn't tell us anything about the concat_vectors.

This patch switchs to using the concat_vectors input element count directly instead.

Differential Revision: https://reviews.llvm.org/D53823

llvm-svn: 345626
2018-10-30 18:48:42 +00:00
Craig Topper b293322cee [LegalizeTypes] Teach PromoteIntRes_BITCAST to better handle a bitcast with vector output type and a vector input type that needs to be widened
Summary: Previously if we had a bitcast vector output type that needs promotion and a vector input type that needs widening we would just do a stack store and load to handle the conversion. We can do a little better if we can widen the bitcast to a legal vector type the same size as the widened input type. Then we can do the bitcast between this widened type and the widened input type. Afterwards we can extract_subvector back to the original output and any_extend that. Type legalization will then circle back and handle promotion of the extract_subvector and the any_extend will just be removed. This will avoid going through the stack and allows us to remove a custom version of this legalization from X86.

Reviewers: efriedma, RKSimon

Reviewed By: efriedma

Subscribers: javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D53229

llvm-svn: 345567
2018-10-30 03:27:15 +00:00
Craig Topper 67c2878501 [X86] Cleanup the code in LowerFABSorFNEG and LowerFCOPYSIGN a little. NFC
Use SelectionDAG::EVTToAPFloatSemantics. Make the LogicVT calculation in LowerFABSorFNEG similar to LowerFCOPYSIGN. Use APInt::getSignedMaxValue instead of ~APInt::getSignMask.

llvm-svn: 345565
2018-10-30 03:27:12 +00:00
Craig Topper 676d7a7a43 [X86] Stop changing f128 fand/for/fxor to v2i64.
The additional patterns don't cost us much and it seems better than changing element widths.

llvm-svn: 345564
2018-10-30 03:27:11 +00:00
Simon Pilgrim 3a2f3c2c0a [X86][SSE] getFauxShuffleMask - Fix shuffle mask adjustment for multiple inserted subvectors
Part of the issue discovered in PR39483, although its not fully exposed until I reapply rL345395 (by reverting rL345451)

llvm-svn: 345520
2018-10-29 18:25:48 +00:00
Craig Topper 42aa87143d [X86] Recognize constant splats in LowerFCOPYSIGN.
llvm-svn: 345484
2018-10-28 23:51:35 +00:00
Simon Pilgrim 9b77f0c291 [VectorLegalizer] Enable TargetLowering::expandFP_TO_UINT support.
Add vector support to TargetLowering::expandFP_TO_UINT.

This exposes an issue in X86TargetLowering::LowerVSELECT which was assuming that the select mask was the same width as the LHS/RHS ops - as long as the result is a sign splat we can easily sext/trunk this.

llvm-svn: 345473
2018-10-28 13:07:25 +00:00
Simon Pilgrim a365719a24 [X86][SSE] LowerVSELECT - pull out repeated getOperand(). NFCI.
llvm-svn: 345458
2018-10-27 18:37:59 +00:00
Simon Pilgrim 88116e905e Revert rL345395: [X86][SSE] Move 2-input limit up from getFauxShuffleMask to resolveTargetShuffleInputs
Makes no difference to actual shuffle decoding yet, but merges all the existing limits in one place for when proper support is fixed.
........
Its been reported that this is causing out of trunk failures.

llvm-svn: 345451
2018-10-27 07:10:48 +00:00
Craig Topper 8315d9990c [X86] Stop promoting vector and/or/xor/andn to vXi64.
These promotions add additional bitcasts to the SelectionDAG that can pessimize computeKnownBits/computeNumSignBits. It also seems to interfere with broadcast formation.

This patch removes the promotion and adds isel patterns instead.

The increased table size is more than I would like, but hopefully we can find some canonicalizations or other tricks to start pruning out patterns going forward.

Differential Revision: https://reviews.llvm.org/D53268

llvm-svn: 345408
2018-10-26 17:21:26 +00:00
Simon Pilgrim 5d1be4f8d4 [X86][SSE] Move 2-input limit up from getFauxShuffleMask to resolveTargetShuffleInputs
Makes no difference to actual shuffle decoding yet, but merges all the existing limits in one place for when proper support is fixed.

llvm-svn: 345395
2018-10-26 15:19:02 +00:00
Sanjay Patel 6b40768f5a [x86] commute blendvb with constant condition op to allow load folding
This is a narrow fix for 1 of the problems mentioned in PR27780:
https://bugs.llvm.org/show_bug.cgi?id=27780

I looked at more general solutions, but it's a mess. We canonicalize shuffle masks
based on the number of elements accessed from each operand, and that's not optional.
If you remove that, we'll crash because we fail to match isel patterns. So I'm
waiting until we're sure that we have blendvb with constant condition and then
commuting based on the load potential. Other cases like blend-with-immediate are
already handled elsewhere, so this is probably not a common problem anyway.

I didn't use "MayFoldLoad" because that checks for one-use and in these cases, we've
screwed that up by creating a temporary PSHUFB using these operands that we're counting
on to be killed later. Undoing that didn't look like a simple task because it's
intertwined with determining if we actually use both operands of the shuffle or not.a

Differential Revision: https://reviews.llvm.org/D53737

llvm-svn: 345390
2018-10-26 14:58:13 +00:00
Simon Pilgrim 7575c6d01b [X86] Use existing pulled out VT variables. NFCI.
llvm-svn: 345388
2018-10-26 14:39:28 +00:00
Craig Topper c10de9a37a [X86] Remove ProcIntelKNL and replace with a SlowPMADDWD flag to use in the one place it was checked.
llvm-svn: 345286
2018-10-25 17:29:00 +00:00
Craig Topper 7ae43cad65 [X86] Don't use the OriginalDemandedBits to calculate the DemandedMask for PMULUDQ/PMULDQ inputs.
Multiply a is complex operation so just because some bit of the output isn't used doesn't mean that bit of the input isn't used.

We might able to bound it, but it will require some more thought.

llvm-svn: 345241
2018-10-25 07:00:09 +00:00
Simon Pilgrim c5bb362b13 [X86][SSE] Add SimplifyDemandedBitsForTargetNode PMULDQ/PMULUDQ handling
Add X86 SimplifyDemandedBitsForTargetNode and use it to simplify PMULDQ/PMULUDQ target nodes.

This enables us to repeatedly simplify the node's arguments after the previous approach had to be reverted due to PR39398.

Differential Revision: https://reviews.llvm.org/D53643

llvm-svn: 345182
2018-10-24 19:11:28 +00:00
Matthias Braun 4f82406c46 SelectionDAG: Reuse bigger sized constants in memset expansion.
When implementing memset's today we often see this pattern:
$x0 = MOV 0xXYXYXYXYXYXYXYXY
store $x0, ...
$w1 = MOV 0xXYXYXYXY
store $w1, ...

We first create a 64bit constant in a 64bit register with all bytes the
same and then create a 32bit constant with all bytes the same in a 32bit
register. In many targets we could just access the lower byte of the
64bit register instead.

- Ideally this would be handled by the ConstantHoist pass but it runs
  too early when memset isn't expanded yet.
- The memset expansion code already had this optimization implemented,
  however SelectionDAG constantfolding would constantfold the
  "trunc(bigconstnat)" pattern to "smallconstant".
- This patch makes the memset expansion mark the constant as Opaque and
  stop DAGCombiner from constant folding in this situation. (Similar to
  how ConstantHoisting marks things as Opaque to avoid folding
  ADD/SUB/etc.)

Differential Revision: https://reviews.llvm.org/D53181

llvm-svn: 345102
2018-10-23 23:19:23 +00:00
Simon Pilgrim b6c57075c0 [X86][SSE] Revert rL343922 combinePMULDQ AddToWorklist (PR39398)
We can't add the MULDQ node back to the worklist after the demanded bits change has been committed in case the node has been removed entirely. This will have to wait until we have SimplifyDemandedBitsForTargetNode.

llvm-svn: 345070
2018-10-23 19:07:53 +00:00
Simon Pilgrim f85ee9f8b4 [X86][SSE] Update raw mask shuffle decoders to handle UNDEF mask elts
Matches the approach taken in the constant pool shuffle decoders, and uses an UndefElts mask instead of uint64_t(-1) raw mask values, which doesn't work safely for i32/i64 shuffle mask sizes (as the -1 value is legal).

This allows us to remove the constant pool shuffle decoders from most of the getTargetShuffleMask variable shuffle cases (X86ISD::VPERMV3 will be handled in a future commit).

llvm-svn: 345018
2018-10-23 11:33:38 +00:00
Craig Topper c8e183f9ee Recommit r344877 "[X86] Stop promoting integer loads to vXi64"
I've included a fix to DAGCombiner::ForwardStoreValueToDirectLoad that I believe will prevent the previous miscompile.

Original commit message:

Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to rem

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the lo

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344965
2018-10-22 22:14:05 +00:00
Simon Pilgrim 3b91e9676b Revert rL344931 from llvm/trunk: [X86][SSE] getTargetShuffleMaskIndices - allow opt-in support for whole undef shuffle mask elements
We can't safely assume that certain RawMask entries are UNDEF as most variable shuffles ignore non-index bits - PSHUFB only works on i8 elts so it'd be safe to use but I'm intending to come up with an alternative approach that works for all.
........
Enable this for PSHUFB constant mask decoding and remove the ConstantPool DecodePSHUFBMask

llvm-svn: 344937
2018-10-22 19:01:25 +00:00
Simon Pilgrim 794f85cd93 Revert rL344933 from llvm/trunk: [X86][SSE] Tidyup DecodeVPERMILPMask shuffle mask decoding
We can't safely assume that certain RawMask entries are UNDEF as most variable shuffles ignore non-index bits.
........
Add support for UNDEF raw mask elements and remove the ConstantPool DecodeVPERMILPMask usage in X86ISelLowering.cpp

llvm-svn: 344936
2018-10-22 18:58:32 +00:00
Simon Pilgrim 476c9f42fc [X86][SSE] Tidyup DecodeVPERMILPMask shuffle mask decoding
Add support for UNDEF raw mask elements and remove the ConstantPool DecodeVPERMILPMask usage in X86ISelLowering.cpp

llvm-svn: 344933
2018-10-22 18:35:13 +00:00
Simon Pilgrim 3521367ff3 [X86][SSE] getTargetShuffleMaskIndices - allow opt-in support for whole undef shuffle mask elements
Enable this for PSHUFB constant mask decoding and remove the ConstantPool DecodePSHUFBMask

llvm-svn: 344931
2018-10-22 18:09:02 +00:00
Simon Pilgrim 5dff767c25 [X86] getTargetConstantBitsFromNode - handle extraction from larger constant pool entries
First step towards removing X86ShuffleDecodeConstantPool usage from X86ISelLowering.cpp

llvm-svn: 344924
2018-10-22 17:43:33 +00:00
Craig Topper 8d8dcfe690 Revert r344877 "[X86] Stop promoting integer loads to vXi64"
Sam McCall reported miscompiles in some tensorflow code. Reverting while I try to figure out.

llvm-svn: 344921
2018-10-22 16:59:24 +00:00
Simon Pilgrim 6f5cd7c67f [X86][SSE] getTargetShuffleMask - pull out repeated shuffle mask element size. NFCI.
llvm-svn: 344910
2018-10-22 15:33:30 +00:00
Roman Lebedev 898808504d [X86] X86DAGToDAGISel: handle BZHI selection too, not just BEXTR.
Summary:
As discussed in D52304 / IRC, we now have pattern matching for
'bit extract' in two places - tablegen and `X86DAGToDAGISel`.
There are 4 patterns.
And we will have a problem with `x &  (-1 >> (32 - y))` pattern.
* If the mask is one-use, then it is always unfolded into `x << (32 - y) >> (32 - y)` first.
  Thus, the existing test coverage is already broken.
* If it is not one-use, then it is not unfolded, and is matched as BZHI.
* If it is not one-use, we will not match it as BEXTR. And if it is one-use, it will have been unfolded already.
So we will either not handle that pattern for BEXTR, or not have test coverage for it.
This is bad.

As discussed with @craig.topper, let's unify this matching, and do everything in `X86DAGToDAGISel`.
Then we will not have code duplication, and will have proper test coverage.

This indeed does not affect any tests, and this is great.
It means that for these two patterns, the `X86DAGToDAGISel` is identical to the tablegen version.

Please review carefully, i'm not fully sure about that intrinsic change, and introduction of the new `X86ISD` opcode.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: craig.topper

Subscribers: llvm-commits, craig.topper

Differential Revision: https://reviews.llvm.org/D53164

llvm-svn: 344904
2018-10-22 14:12:44 +00:00
Craig Topper 321df5b0d4 [X86] Stop promoting integer loads to vXi64
Summary:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to remove the bitcast.

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the load size.

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344877
2018-10-21 21:30:26 +00:00
Craig Topper 8de07b4db1 Revert r344873 "foo"
Rebase gone wrong left this in my tree.

llvm-svn: 344875
2018-10-21 21:08:37 +00:00
Craig Topper 5eea94edd4 [X86] Remove SDIVREM8_SEXT_HREG/UDIVREM8_ZEXT_HREG and their associated DAG combine and target bits support. Use a post isel peephole instead.
Summary:
These nodes exist to overcome an isel problem where we can generate a zero extend of an AH register followed by an extract subreg, and another zero extend. The first zero extend exists to avoid a partial register update copying the AH register into the low 8-bits. The second zero extend exists if the user wanted the remainder zero extended.

To make this work we had a DAG combine to morph the DIVREM opcode to a special opcode that included the extend. But then we had to add the new node to computeKnownBits and computeNumSignBits to process the extension portion.

This patch instead removes all of that and adds a late peephole to detect the two extends.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53449

llvm-svn: 344874
2018-10-21 21:07:27 +00:00
Craig Topper e367039fe5 foo
llvm-svn: 344873
2018-10-21 21:07:25 +00:00
Simon Pilgrim eb806d5f30 [X86][AVX] Enable lowerVectorShuffleAsLanePermuteAndPermute v16i16/v32i8 unary shuffle lowering
llvm-svn: 344868
2018-10-21 17:07:50 +00:00
Simon Pilgrim abc24fdb94 [X86] Only extract constant pool shuffle mask data with zero offsets
D53306 exposes an issue where we sometimes use constant pool data from bigger vectors than the target shuffle mask. This should be safe to do, but we have to be certain that we're using the bottom most part of the vector as the shuffle mask decoders have no way to peek into subvectors with non-zero offsets.

llvm-svn: 344867
2018-10-21 11:55:56 +00:00
Craig Topper 06aea1720a [X86] Move promotion of vector and/or/xor from legalization to DAG combine
Summary:
I've noticed that the bitcasts we introduce for these make computeKnownBits and computeNumSignBits not work well in LegalizeVectorOps. LegalizeVectorOps legalizes bottom up while LegalizeDAG legalizes top down. The bottom up strategy for LegalizeVectorOps means operands are legalized before their uses. So we promote and/or/xor before we legalize the operands that use them making computeKnownBits/computeNumSignBits in places like LowerTruncate suboptimal. I looked at changing LegalizeVectorOps to be top down as well, but that was more disruptive and caused some regressions. I also looked at just moving promotion of binops to LegalizeDAG, but that had a few issues one around matching AND,ANDN,OR into VSELECT because I had to create ANDN as vXi64, but the other nodes hadn't legalized yet, I didn't look too hard at fixing that.

This patch seems to produce better results overall than my other attempts. We now form broadcasts of constants better in some cases. For at least some of them the AND was being introduced in LegalizeDAG, promoted to vXi64, and the BUILD_VECTOR was also legalized there. I think we got bad ordering of that. Now the promotion is out of the legalizer so we handle this better.

In the longer term I think we really should evaluate whether we should be doing this promotion at all. It's really there to reduce isel pattern count, but I'm wondering if we'd be better served just eating the pattern cost or doing C++ based isel for vector and/or/xor in X86ISelDAGToDAG. The masked and/or/xor will definitely be difficult in patterns if a bitcast gets between the vselect and the and/or/xor node. That becomes a lot of permutations to cover.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53107

llvm-svn: 344487
2018-10-15 01:51:58 +00:00
Simon Pilgrim 861cd0ba44 [X86][AVX] Enable lowerVectorShuffleAsLanePermuteAndPermute v16i16/v32i8 shuffle lowering
Extends D53148 from v4f64 now that we have test coverage for v16i16/v32i8 shuffles.

llvm-svn: 344481
2018-10-14 17:34:20 +00:00
Craig Topper 20fa085d74 [X86] Fix bad indentation. NFC
llvm-svn: 344471
2018-10-14 04:01:40 +00:00
Craig Topper ec4b75f47a [X86] Type legalize v2f32 stores by widening to v4f32, casting to v2f64, extracting f64 and storing.
Summary: This is similar to what D52528 did for loads. It should match what generic type legalization does in 64-bit mode where it uses a v2i64 cast and an i64 store.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53173

llvm-svn: 344470
2018-10-14 03:36:27 +00:00
Simon Pilgrim c5d7c6e5f6 [X86][SSE] Remove most of vector CTTZ custom lowering and use LegalizeDAG instead.
There is one remnant - AVX1 custom splitting of 256-bit vectors - which is due to a regression where the X86ISD::ANDNP is still performed as a YMM.

I've also tightened the CTLZ or CTPOP lowering in SelectionDAGLegalize::ExpandBitCount to require a legal CTLZ - it doesn't affect existing users and fixes an issue with AVX512 codegen.

llvm-svn: 344457
2018-10-13 16:11:15 +00:00
Simon Pilgrim 1c2051ead7 [X86][SSE] Begin removing vector CTTZ custom lowering and use LegalizeDAG instead.
Adds CTTZ vector legalization support and begins the removal of the X86/SSE custom lowering. 

llvm-svn: 344453
2018-10-13 15:16:55 +00:00
Simon Pilgrim 1c6d320351 [X86][SSE] combineIncDecVector - use isConstantSplat
Use isConstantSplat instead of ISD::isConstantSplatVector to let us us peek through to illegal types (in this case for i686 targets to recognise i64 constants)

llvm-svn: 344452
2018-10-13 14:45:44 +00:00
Simon Pilgrim a03379527a [X86] Pull out target constant splat helper function. NFCI.
The code in LowerScalarImmediateShift is just a more powerful version of ISD::isConstantSplatVector.

llvm-svn: 344451
2018-10-13 14:28:40 +00:00
Simon Pilgrim 10434cbae1 Pull out repeated getOperand(). NFCI.
llvm-svn: 344450
2018-10-13 13:33:32 +00:00
Simon Pilgrim bc141724c0 Remove unused variable. NFCI.
llvm-svn: 344449
2018-10-13 13:30:10 +00:00
Simon Pilgrim f64e654d62 [X86][SSE] Improve CTTZ lowering when CTLZ is legal
If we have better CTLZ support than CTPOP, then use cttz(x) = width - ctlz(~x & (x - 1)) - and remove the CTTZ_ZERO_UNDEF handling as it no longer gives better codegen.

Similar to rL344447, this is also closer to LegalizeDAG's approach

llvm-svn: 344448
2018-10-13 13:05:19 +00:00
Simon Pilgrim afead139cf [X86][SSE] Change CTTZ vector lowering to cttz(x) = ctpop(~x & (x - 1))
This patch changes the vector CTTZ lowering from:

cttz(x) = ctpop((x & -x) - 1)

to:

cttz(x) = ctpop(~x & (x - 1))

Not only does this make better use of the PANDN instruction, but it also matches the LegalizeDAG method which should allow us to remove the x86 specific code at some point in the future (we need to fix some issues with the bitcasted logic ops and CTPOP lowering first).

Differential Revision: https://reviews.llvm.org/D53214

llvm-svn: 344447
2018-10-13 12:12:06 +00:00
Simon Pilgrim f3952413f7 [X86][AVX] Add lowerVectorShuffleAsLanePermuteAndPermute for v4f64 shuffles (PR39161)
Add shuffle lowering for the case where we can shuffle the lanes into place followed by an in-lane permute.

This is mainly for cases where we can have non-repeating permutes in each lane, but for now I've just enabled it for v4f64 unary shuffles to fix PR39161 - there is no test coverage for other shuffles that might benefit yet.

We now have several cross-lane shuffle lowering methods that all do something similar - I've looked at merging some of these (notably by making the repeated mask mechanism in lowerVectorShuffleByMerging128BitLanes optional), but there is a lot of assertions/assumptions in the way that makes this tricky - I ended up going for adding yet another relatively simple method instead.

Differential Revision: https://reviews.llvm.org/D53148

llvm-svn: 344446
2018-10-13 11:38:10 +00:00
Craig Topper 3e76b2d736 [X86] Improve type legalization of (v2i32/v4i16/v8i16 (bitcast (v2f32))) to avoid a stack stack temporary.
llvm-svn: 344425
2018-10-12 22:00:04 +00:00
Craig Topper c693a23025 [X86] Simplify the end of custom type legalization for (v2i32/v4i16/v8i8 (bitcast (f64))) by just emitting an EXTRACT_SUBVECTOR instead of a BUILD_VECTOR.
Generic legalization should be able to finish legalizing the EXTRACT_SUBVECTOR probably by turning it into a BUILD_VECTOR. But we should emit the simplest sequence.

llvm-svn: 344424
2018-10-12 22:00:00 +00:00
Craig Topper a8a44f1bec [X86] Skip (v2i32/v4i16/v8i8 (bitcast (f64))) handling in ReplaceNodeResults if the dest type can be widened by generic legalization. NFCI
The algorithm we would do previously was identical to generic legalization. If we ever switch to legalizing integer vectors via widening we'll be able to kill off the code since it now only runs for promotion.

llvm-svn: 344423
2018-10-12 21:59:58 +00:00
Sanjay Patel e28c8ecd72 [x86] add and use fast horizontal vector math subtarget feature
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen 
by default. AMD Jaguar (btver2) should have no difference with this patch because it has 
fast-hops. (If we want to set that bit for other CPUs, let me know.)

The code changes are small, but there are many test diffs. For files that are specifically 
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences 
side-by-side. For files that are primarily concerned with codegen other than hops, I just 
updated the CHECK lines to reflect the new default codegen.

To recap the recent horizontal op story:

1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. 
   Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so 
   we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but 
   probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be 
   happy. I'm not sure if we have the best predicate here, but the intent is to undo the 
   extra hop-iness that was enabled by r344141.

Differential Revision: https://reviews.llvm.org/D53095

llvm-svn: 344361
2018-10-12 16:41:02 +00:00
Eric Liu 55ab86b72b Fix unused variable warning after r344348
llvm-svn: 344350
2018-10-12 15:01:11 +00:00
Simon Pilgrim 78b5a3c3ef [X86][SSE] LowerVectorCTPOP - pull out repeated byte sum stage.
Pull out repeated byte sum stage for popcount of vector elements > 8bits.

This allows us to simplify the LUT/BITMATH popcnt code to always assume vXi8 vectors, and also improves avx512bitalg codegen which only has access to vpopcntb/vpopcntw.

llvm-svn: 344348
2018-10-12 14:18:47 +00:00
Simon Pilgrim 29279f29c8 [X86][SSE] Add extract_subvector(PSHUFB) -> PSHUFB(extract_subvector()) combine
Fixes PR32160 by reducing the size of PSHUFB if we only use one of the lanes.

This approach can probably be generalized to handle any target shuffle (and any subvector index) but we have no test coverage at the moment.

llvm-svn: 344336
2018-10-12 12:10:34 +00:00
Richard Trieu dfd1760b5f Inline variable into assert to avoid unused variable warning.
llvm-svn: 344308
2018-10-11 22:42:41 +00:00
Craig Topper 35d513c7e4 [X86] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector.
On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success.

This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency and to keep the load in the fp domain.

There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately.

Differential Revision: https://reviews.llvm.org/D52528

llvm-svn: 344291
2018-10-11 20:36:06 +00:00
Craig Topper fb2ac8969e [X86] Restore X86ISelDAGToDAG::matchBEXTRFromAnd. Teach address matching to create a BEXTR pattern from a (shl (and X, mask >> C1) if C1 can be folded into addressing mode.
This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about.

We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR.

I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that.

I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet.

Differential Revision: https://reviews.llvm.org/D53126

llvm-svn: 344270
2018-10-11 18:06:07 +00:00
Roman Lebedev 33d84c6dac [X86] Move X86DAGToDAGISel::matchBEXTRFromAnd() into X86ISelLowering
Summary:
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 | PR38938 ]],
we fail to emit `BEXTR` if the mask is shifted.
We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`,
and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node,
and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back.
So it would seem X86ISelLowering is the place to handle this.

This patch only moves the matchBEXTRFromAnd()
from X86DAGToDAGISel to X86ISelLowering.
It does not add support for the 'shifted mask' pattern.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52426

llvm-svn: 344179
2018-10-10 20:40:12 +00:00
Sanjay Patel 6cca8af227 [x86] allow single source horizontal op matching (PR39195)
This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd 
solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature
bit to deal with it here in the DAG.

Differential Revision: https://reviews.llvm.org/D52997

llvm-svn: 344141
2018-10-10 13:39:59 +00:00
Simon Pilgrim 5cb3a82892 [TargetLowering] Add root node back to work list after successful SimplifyDemandedBits/SimplifyDemandedVectorElts
Similar to what already happens in the DAGCombiner wrappers, this patch adds the root nodes back onto the worklist if the DCI wrappers' SimplifyDemandedBits/SimplifyDemandedVectorElts were successful.

Differential Revision: https://reviews.llvm.org/D53026

llvm-svn: 344132
2018-10-10 10:44:15 +00:00
Craig Topper f6d8400869 [X86] When lowering unsigned v2i64 setcc without SSE42, flip the sign bits in the v2i64 type then bitcast to v4i32.
This may give slightly better opportunities for DAG combine to simplify with the operations before the setcc. It also matches the type the xors will eventually be promoted to anyway so it saves a legalization step.

Almost all of the test changes are because our constant pool entry is now v2i64 instead of v4i32 on 64-bit targets. On 32-bit targets getConstant should be emitting a v4i32 build_vector and a v4i32->v2i64 bitcast.

There are a couple test cases where it appears we now combine a bitwise not with one of these xors which caused a new constant vector to be generated. This prevented a constant pool entry from being shared. But if that's an issue we're concerned about, it seems we need to address it another way that just relying a bitcast to hide it.

This came about from experiments I've been trying with pushing the promotion of and/or/xor to vXi64 later than LegalizeVectorOps where it is today. We run LegalizeVectorOps in a bottom up order. So the and/or/xor are promoted before their users are legalized. The bitcasts added for the promotion act as a barrier to computeKnownBits if we try to use it during vector legalization of a later operation. So by moving the promotion out we can hopefully get better results from computeKnownBits/computeNumSignBits like in LowerTruncate on AVX512. I've also looked at running LegalizeVectorOps in a top down order like LegalizeDAG, but thats showing some other issues.

llvm-svn: 344071
2018-10-09 19:05:50 +00:00
Sanjay Patel f5fac1826a [x86] use demanded bits to simplify masked store codegen
As noted in D52747, if we prefer IR to use trunc for bool vectors rather 
than and+icmp, we can expose codegen shortcomings as seen here with masked store.

Replace a hard-coded PCMPGT simplification with the more general demanded bits call
to improve things.

Differential Revision: https://reviews.llvm.org/D52964

llvm-svn: 344048
2018-10-09 14:04:14 +00:00
Simon Pilgrim 720db8ed7b [X86][AVX1] Enable *_EXTEND_VECTOR_INREG lowering of 256-bit vectors
As discussed on D52964, this adds 256-bit *_EXTEND_VECTOR_INREG lowering support for AVX1 targets to help improve SimplifyDemandedBits handling.

Differential Revision: https://reviews.llvm.org/D52980

llvm-svn: 344019
2018-10-09 07:42:01 +00:00
Craig Topper ff9f02580d [X86] Prefer isTypeLegal over checking isSimple in a DAG combine.
Simple types are a superset of what all in tree targets in LLVM could possibly have a legal type. This means the behavior of using isSimple to check for a supported type for X86 could change over time. For example, this could would change if a v256i1 type was added to MVT in the future.

llvm-svn: 343995
2018-10-08 20:02:59 +00:00
Simon Pilgrim 6fc8d05565 [X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors
Some necessary yak shaving before lowering *_EXTEND_VECTOR_INREG 256-bit vectors on AVX1 targets as suggested by D52964.

Differential Revision: https://reviews.llvm.org/D52970

llvm-svn: 343991
2018-10-08 18:40:50 +00:00
Sanjay Patel 43bf9917cc [x86] make horizontal binop matching clearer; NFCI
The instructions are complicated, so this code will
probably never be very obvious, but hopefully this
makes it better. 

As shown in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...we need to improve the matching to not miss cases
where we're h-opping on 1 source vector, and that
should be a small patch after this rearranging.

llvm-svn: 343989
2018-10-08 18:08:02 +00:00
Simon Pilgrim 9fa1c66421 [X86] getFauxShuffleMask - Handle undef + sentinel values in subvector insertion
llvm-svn: 343926
2018-10-06 22:13:44 +00:00
Simon Pilgrim a30e8d23e2 [X86][AVX] Ensure resolveTargetShuffleInputs shuffle masks are the correct width
Don't handle ZERO_EXTEND style shuffles until we support bitcasts. Found by inspection.

llvm-svn: 343924
2018-10-06 17:18:41 +00:00
Simon Pilgrim 62d199f4e5 [X86] combinePMULDQ - add op back to worklist if SimplifyDemandedBits succeeds on either operand
Prevents missing other simplifications that may occur deep in the operand chain where CommitTargetLoweringOpt won't add the PMULDQ back to the worklist itself

llvm-svn: 343922
2018-10-06 14:51:14 +00:00
Simon Pilgrim 0cc0a24b55 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - simplify PSHUFB masks
Attempt to simplify PSHUFB masks (even non-constant ones) - we should probably be able to simplify other variable shuffles as well as the need arises.

llvm-svn: 343919
2018-10-06 13:49:31 +00:00
Simon Pilgrim ae78d709b4 [X86] Use the SimplifyDemandedBits wrappers where possible. NFCI.
Leave the wrapper to handle TargetLowering::TargetLoweringOpt and CommitTargetLoweringOpt.

llvm-svn: 343918
2018-10-06 13:29:08 +00:00
Simon Pilgrim dc97118efe [X86][AVX] Limit getFauxShuffleMask INSERT_SUBVECTOR support to 2 inputs
rL343853 didn't limit the number of subinputs, but we don't currently support faux shuffles with more than 2 total inputs, so put a limiter in place until this is fixed.

Found by Artem Dergachev.

llvm-svn: 343891
2018-10-05 21:44:19 +00:00
Craig Topper 0ed892da70 [X86] Don't promote i16 compares to i32 if the immediate will fit in 8 bits.
The comments in this code say we were trying to avoid 16-bit immediates, but if the immediate fits in 8-bits this isn't an issue. This avoids creating a zero extend that probably won't go away.

The movmskb related changes are interesting. The movmskb instruction writes a 32-bit result, but fills the upper bits with 0. So the zero_extend we were previously emitting was free, but we turned a -1 immediate that would fit in 8-bits into a 32-bit immediate so it was still bad.

llvm-svn: 343871
2018-10-05 18:13:36 +00:00
Simon Pilgrim 6c5ab48fe7 [X86][AVX] getFauxShuffleMask - add support for INSERT_SUBVECTOR subvector shuffles
Decode subvector shuffles from INSERT_SUBVECTOR(SRC0, SHUFFLE(EXTRACT_SUBVECTOR(SRC1))

This was found necessary while investigating PR39161

llvm-svn: 343853
2018-10-05 14:41:00 +00:00
Craig Topper 7d2155e3f9 [X86][LegalizeVectorOps] Use MERGE_VALUES to return two results from LowerLoad. Remove special case code in LegalizeVectorOps that allowed us to only return one result.
Previously we replaced the chain use ourself and return the data result. LegalizeVectorOps then detected that we'd done this and assumed the chain had already been handled.

This commit instead returns a MERGE_VALUES node with two results joined from nodes. This allows LegalizeVectorOps to do all the replacements for us without any special casing. The MERGE_VALUES will be removed by DAG combine.

llvm-svn: 343817
2018-10-04 21:24:24 +00:00
David Greene 4f916df29e [X86] Set correct MMO offset on scalarized load pieces
When scalarizing a load, be sure to update the offset in the
MachineMemOperand for each scalar load.

llvm-svn: 343776
2018-10-04 14:07:59 +00:00
Craig Topper 8b3c46f0a8 [X86] Merge matchANDXORWithAllOnesAsANDNP into combineANDXORWithAllOnesIntoANDNP. NFCI
It's the only caller and the logic pretty easy to combine.

llvm-svn: 343754
2018-10-04 06:13:27 +00:00
Craig Topper a65c2dbfd6 [X86] Stop promoting vector ISD::SELECT to vXi64.
The additional patterns needed for this aren't overwhelming and introducing extra bitcasts during lowering limits our ability to do computeNumSignBits. Not that I have a good example of that for select. I'm just becoming increasingly grumpy about promotion of AND/OR/XOR. SELECT was just a lot easier to fix.

llvm-svn: 343723
2018-10-03 21:10:29 +00:00
Craig Topper c39dc41b63 [X86] Add CMOV_VK2/VK4 pseudos and remove lowering code that turned v2i1/v4i1 SELECT into v8i1.
llvm-svn: 343713
2018-10-03 20:28:43 +00:00
Craig Topper 703fbde3cb [X86] Add CMOV pseudos for VR128X and VR256X register classes. Use them when AVX512VL is enabled.
This allows the phi nodes to be generated with the correct register class when expanded.

llvm-svn: 343710
2018-10-03 19:48:26 +00:00
Craig Topper 4b62c2dbda [X86] Don't break CMOV pseudo instructions down by type. Just by register class.
The register class is all that's important for the pseudo instructions. We can use patterns to handle the different types.

llvm-svn: 343709
2018-10-03 19:48:23 +00:00
Nirav Dave 925b64be64 [X86] Correctly use SSE registers if no-x87 is selected.
Fix use of SSE1 registers for f32 ops in no-x87 mode.

Notably, allow use of SSE instructions for f32 operations in 64-bit
mode (but not 32-bit which is disallowed by callign convention).

Also avoid translating memset/memcopy/memmove into SSE registers
without X87 for 32-bit mode.

This fixes PR38738.

Reviewers: nickdesaulniers, craig.topper

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D52555

llvm-svn: 343689
2018-10-03 14:13:30 +00:00
Simon Pilgrim a2efe82b81 [X86] SimplifyDemandedVectorEltsForTargetNode - remove identity target shuffles before simplifying inputs
By removing demanded target shuffles that simplify to zero/undef/identity before simplifying its inputs we improve chances of further simplification, as only the immediate parent user of the combined is added back to the work list - this still doesn't help us if its passed through other ops though (bitcasts....).

llvm-svn: 343390
2018-09-29 18:15:26 +00:00
Simon Pilgrim a93407fadf [X86][SSE] LowerScalarImmediateShift - remove 32-bit vXi64 special case handling.
This is all handled generally by getTargetConstantBitsFromNode now

llvm-svn: 343387
2018-09-29 17:36:22 +00:00
Simon Pilgrim b5737007cd Fix signed/unsigned mismatch warning. NFCI.
llvm-svn: 343385
2018-09-29 17:11:19 +00:00
Simon Pilgrim d633e290c8 [X86] getTargetConstantBitsFromNode - add support for rearranging constant bits via shuffles
Exposed an issue that recursive calls to getTargetConstantBitsFromNode don't handle changes to EltSizeInBits yet.

llvm-svn: 343384
2018-09-29 17:01:55 +00:00
Simon Pilgrim ae34ae12ef [X86][SSE] LowerScalarImmediateShift - use getTargetConstantBitsFromNode to get immediate data
Don't just attempt to find a splat build vector.

First step towards getting rid of all the 32-bit special case code.

llvm-svn: 343383
2018-09-29 16:40:35 +00:00
Simon Pilgrim a731940c60 [X86] getTargetConstantBitsFromNode - fix self-move assertions from gcc builds due to rL343375
llvm-svn: 343377
2018-09-29 14:51:09 +00:00
Simon Pilgrim 22d51014af [X86] getTargetConstantBitsFromNode - add support for peeking through ISD::EXTRACT_SUBVECTOR
llvm-svn: 343375
2018-09-29 14:17:32 +00:00
Simon Pilgrim aa77033a6b [X86][SSE] Fixed issue with v2i64 variable shifts on 32-bit targets
The shift amount might have peeked through a extract_subvector, altering the number of vector elements in the 'Amt' variable - so we were incorrectly calculating the ratio when peeking through bitcasts, resulting in incorrectly detecting splats.

llvm-svn: 343373
2018-09-29 13:25:22 +00:00
Simon Pilgrim ebabd79f43 [X86][SSE] canReduceVMulWidth - use ComputeNumSignBits/SignBitIsZero directly
Don't reinvent the wheel for BUILD_VECTOR/ZERO_EXTEND - its only the ANY_EXTEND special case that needs handling.

llvm-svn: 343096
2018-09-26 11:48:52 +00:00
Simon Pilgrim 5beaac433d [X86][SSE] Use ISD::MULHS for constant vXi16 ISD::SRA lowering (PR38151)
Similar to the existing ISD::SRL constant vector shifts from D49562, this patch adds ISD::SRA support with ISD::MULHS.

As we're dealing with signed values, we have to handle shift by zero and shift by one special cases, so XOP+AVX2/AVX512 splitting/extension is still a better solution - really we should still use ISD::MULHS if one of the special cases are used but for now I've just left a TODO and filtered by isKnownNeverZero.

Differential Revision: https://reviews.llvm.org/D52171

llvm-svn: 343093
2018-09-26 10:57:05 +00:00
Craig Topper 12c18840fa [X86] Allow movmskpd/ps ISD nodes to be created and selected with integer input types.
This removes an int->fp bitcast between the surrounding code and the movmsk. I had already added a hack to combineMOVMSK to try to look through this bitcast to improve the SimplifyDemandedBits there.

But I found an additional issue where the bitcast was preventing combineMOVMSK from being called again after earlier nodes in the DAG are optimized. The bitcast gets revisted, but not the user of the bitcast. By using integer types throughout, the bitcast doesn't get in the way.

llvm-svn: 343046
2018-09-25 23:28:27 +00:00
Simon Pilgrim 96335dd1ec [X86] combineUIntToFP - Fix UINT_TO_FP(vXi1) comment (PR39078)
llvm-svn: 343026
2018-09-25 20:52:08 +00:00
Sanjay Patel 10c11b867a [x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449)
This is the final (I hope!) problem pattern mentioned in PR37749:
https://bugs.llvm.org/show_bug.cgi?id=37749

We are trying to avoid an AVX1 sinkhole caused by having 256-bit bitwise logic ops but no other 256-bit integer ops. 
We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like 
extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches 
that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op.

The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test, 
we have this vector-type-legalized sequence:

        t29: v8i32 = concat_vectors t27, t28
      t30: v4i64 = bitcast t29
        t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, ...
      t31: v4i64 = bitcast t18
    t32: v4i64 = xor t30, t31
      t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, ...
    t34: v4i64 = bitcast t9
  t35: v4i64 = and t32, t34
t36: v8i32 = bitcast t35
      t37: v4i32 = extract_subvector t36, Constant:i64<0>
      t38: v4i32 = extract_subvector t36, Constant:i64<4>

Differential Revision: https://reviews.llvm.org/D52318

llvm-svn: 343008
2018-09-25 19:09:34 +00:00
Craig Topper 6fb1358a98 [X86] Add AVX512 support to combineVectorSizedSetCCEquality.
Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52424

llvm-svn: 342989
2018-09-25 16:27:12 +00:00
Craig Topper 9ce5da7b62 [X86] Don't create FILD ISD nodes when X87 is disabled.
The included test case previously asserted because the type legalizer tried to soften the FILD ISD node.

Fixes PR38819.

llvm-svn: 342934
2018-09-25 00:16:57 +00:00
Craig Topper aeb4930b47 [X86] Remove superfluous curly braces. NFC
llvm-svn: 342933
2018-09-25 00:16:54 +00:00
Craig Topper b7e2499e80 [X86] Update comment. Use 'glued' instead of 'flagged' NFC
llvm-svn: 342932
2018-09-25 00:16:52 +00:00
Sanjay Patel 0027946915 [DAGCombiner][x86] extend decompose of integer multiply into shift/add with negation
This is an alternative to https://reviews.llvm.org/D37896. We can't decompose 
multiplies generically without a target hook to tell us when it's profitable.

ARM and AArch64 may be able to remove some existing code that overlaps with
this transform.

This extends D52195 and may resolve PR34474: 
https://bugs.llvm.org/show_bug.cgi?id=34474
(still an open question about transforming legal vector multiplies, but we
could open another bug report for those)

llvm-svn: 342844
2018-09-23 18:41:38 +00:00
Craig Topper 3e0b4b0eb7 [X86] Fix a few typos in comments.
llvm-svn: 342829
2018-09-23 06:49:47 +00:00
Craig Topper 9995760df4 [X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)) for vXi8 vectors.
We don't have a vXi8 shift left so we need to bitcast to a vXi16 vector to perform the shift. If we let lowering legalize the vXi8 shift we get an extra and that we don't need and fail to remove.

llvm-svn: 342795
2018-09-22 05:08:38 +00:00
Sanjay Patel 8a1227ccc8 [SelectionDAG] replace duplicated peekThroughBitcast helper functions; NFCI
x86 had 2 versions of peekThroughBitcast. DAGCombiner had 1. Plus, it had a 1-off implementation for the one-use variant.
Move the x86 versions of the code to SelectionDAG, so we don't have different copies of the code. 
No functional change intended.

I'm putting this next to isBitwiseNot() because I am planning to use it in there. Another option is next to the
helpers in the ISD namespace (eg, ISD::isConstantSplatVector()). But if there's no good reason for those to be 
there, I'd prefer to pull other helpers over to SelectionDAG in follow-up steps.

Differential Revision: https://reviews.llvm.org/D52285

llvm-svn: 342669
2018-09-20 17:34:08 +00:00
Simon Pilgrim 3e2de767f6 [X86][SSE] Remove UNPCKL(SHUFFLE)->UNPCKH custom combine
This can be achieved more generally by combineX86ShufflesRecursively.

llvm-svn: 342645
2018-09-20 13:10:22 +00:00
Simon Pilgrim 46c1dcb1af [X86][SSE] Remove PSHUFLW/PSHUFHW combineRedundantHalfShuffle combine
This can be achieved more generally by combineX86ShufflesRecursively and was causing a fuzz test failure found by Mikael Holmén.

llvm-svn: 342642
2018-09-20 12:11:38 +00:00
Sanjay Patel 1a1c0ee599 [x86] change names of vector splitting helper functions; NFC
As the code comments suggest, these are about splitting, and they
are not necessarily limited to lowering, so that misled me.

There's nothing that's actually x86-specific in these either, so 
they might be better placed in a common header so any target can 
use them.

llvm-svn: 342575
2018-09-19 18:52:00 +00:00
Simon Pilgrim 8191d63c3b [X86] Add initial SimplifyDemandedVectorEltsForTargetNode support
This patch adds an initial x86 SimplifyDemandedVectorEltsForTargetNode implementation to handle target shuffles.

Currently the patch only decodes a target shuffle, calls SimplifyDemandedVectorElts on its input operands and removes any shuffle that reduces to undef/zero/identity.

Future work will need to integrate this with combineX86ShufflesRecursively, add support for other x86 ops, etc.

NOTE: There is a minor regression that appears to be affecting further (extractelement?) combines which I haven't been able to solve yet - possibly something to do with how nodes are added to the worklist after simplification.

Differential Revision: https://reviews.llvm.org/D52140

llvm-svn: 342564
2018-09-19 18:11:34 +00:00
Sanjay Patel 4fd2e2a498 [DAGCombiner][x86] add transform/hook to decompose integer multiply into shift/add
This is an alternative to D37896. I don't see a way to decompose multiplies 
generically without a target hook to tell us when it's profitable. 

ARM and AArch64 may be able to remove some duplicate code that overlaps with 
this transform.

As a first step, we're only getting the most clear wins on the vector examples
requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474

As noted in the code comment, it's likely that the x86 constraints are tighter
than necessary, but it may not always be a win to replace a pmullw/pmulld.

Differential Revision: https://reviews.llvm.org/D52195

llvm-svn: 342554
2018-09-19 15:57:40 +00:00
Simon Pilgrim e9bf71e761 [X86][SSE] LowerShift - pull out repeated getTargetVShiftUniformOpcode calls. NFCI.
llvm-svn: 342462
2018-09-18 10:44:44 +00:00
Keno Fischer c8ccaed325 [X86ISel] Implement byval lowering for Win64 calling convention
Summary:
The IR reference for the `byval` attribute states:

```
This indicates that the pointer parameter should really be passed by value
to the function. The attribute implies that a hidden copy of the pointee is
made between the caller and the callee, so the callee is unable to modify
the value in the caller. This attribute is only valid on LLVM pointer arguments.
```

However, on Win64, this attribute is unimplemented and the raw pointer is
passed to the callee instead. This is problematic, because frontend authors
relying on the implicit hidden copy (as happens for every other calling
convention) will see the passed value silently (if mutable memory) or
loudly (by means of a crash) modified because the callee treats the
location as scratch memory space it is allowed to mutate.

At this point, it's worth taking a step back to understand the context.
In most calling conventions, aggregates that are too large to be passed
in registers, instead get *copied* to the stack at a fixed (computable
from the signature) offset of the stack pointer. At the LLVM, we hide
this hidden copy behind the byval attribute. The caller passes a pointer
to the desired data and the callee receives a pointer, but these pointers
are not the same. In particular, the pointer that the callee receives
points to temporary stack memory allocated as part of the call lowering.
In most calling conventions, this pointer is never realized in registers
or memory. The temporary memory is simply defined by an implicit
offset from the stack pointer at function entry.

Win64, uniquely, works differently. The structure is still passed in
memory, but instead of being stored at an implicit memory offset, the
caller computes a pointer to the temporary memory and passes it to
the callee as a regular pointer (taking up a register, or if all
registers are taken up, an additional stack slot). Presumably, this
was done to allow eliding the copy when passing aggregates through
several functions on the stack.

This explains why ignoring the `byval` attribute mostly works on Win64.
The argument simply gets passed as a pointer and as long as we're ok
with the callee trampling all over that memory, there are no ill effects.
However, it does contradict the documentation of the `byval` attribute
which specifies that there is to be an implicit copy.

Frontends can of course work around this by never emitting the `byval`
attribute for Win64 and creating `alloca`s for the requisite temporary
stack slots (and that does appear to be what frontends are doing).
However, the presence of the `byval` attribute is not a trap for
frontend authors, since it seems to work, but silently modifies the
passed memory contrary to documentation.

I see two solutions:
- Disallow the `byval` attribute in the verifier if using the Win64
  calling convention.
- Make it work by simply emitting a temporary stack copy as we would
  with any other calling convention (frontends can of course always
  not use the attribute if they want to elide the copy).

This patch implements the second option (make it work), though I would
be fine with the first also.

Ref: https://github.com/JuliaLang/julia/issues/28338

Reviewers: rnk

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51842

llvm-svn: 342402
2018-09-17 17:37:14 +00:00
Simon Pilgrim cffa206423 [X86][SSE] Always enable ISD::SRL -> ISD::MULHU for v8i16
For constant non-uniform cases we'll never introduce more and/andn/or selects than already occur in generic pre-SSE41 ISD::SRL lowering.

llvm-svn: 342352
2018-09-16 20:28:38 +00:00
Simon Pilgrim ea069ffd44 [X86][AVX] Enable ISD::SRL -> ISD::MULHU for v16i16
Now that rL340913 has landed with improved v16i16 selects as shuffles.

llvm-svn: 342349
2018-09-16 19:20:47 +00:00
Sanjay Patel bfee5a9b42 [x86] fix uses check in broadcast transform (PR38949)
https://bugs.llvm.org/show_bug.cgi?id=38949

It's not clear to me that we even need a one-use check in this fold.
Ie, 2 independent loads might be better than a load+dependent shuffle.

Note that the existing re-use tests are not affected. We actually do form a
broadcast node in those tests now because there's no extra use of the 
insert_subvector node in those cases. But something later in isel pattern 
matching decides that it is not worth using a broadcast for the full load in 
those tests:

Legalized selection DAG: %bb.0 'test_broadcast_2f64_4f64_reuse:'
  t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
      t18: v4f64 = insert_subvector undef:v4f64, t7, Constant:i64<0>
    t20: v4f64 = insert_subvector t18, t7, Constant:i64<2>

Becomes:
  t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
    t21: v4f64 = X86ISD::SUBV_BROADCAST t7

ISEL: Starting selection on root node: t21: v4f64 = X86ISD::SUBV_BROADCAST t7
...
  Created node: t27: v4f64 = INSERT_SUBREG IMPLICIT_DEF:v4f64, t7, TargetConstant:i32<7>
  Morphed node: t21: v4f64 = VINSERTF128rr t27, t7, TargetConstant:i8<1>

llvm-svn: 342347
2018-09-16 15:41:56 +00:00
Craig Topper fe0b973fbf [X86] Remove an fp->int->fp domain crossing in LowerUINT_TO_FP_i64.
Summary: This unfortunately adds a move, but isn't that better than going to the int domain and back?

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52134

llvm-svn: 342327
2018-09-15 16:23:35 +00:00
Craig Topper 273f755da3 [X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))
Summary:
MOVMSK only care about the sign bit so we don't need the setcc to fill the whole element with 0s/1s. We can just shift the bit we're looking for into the sign bit. This saves a constant pool load.

Inspired by PR38840.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D52121

llvm-svn: 342326
2018-09-15 16:23:33 +00:00
Simon Pilgrim 32857c54d2 [X86][SSE] Lower shuffles to permute(unpack(x,y)) (PR31151)
Attempt to lower a shuffle as an unpack of elements from two inputs followed by a single-input (wider) permutation.

As long as the permutation is wider this is a win - there may be some circumstances where same size permutations would also be useful but I've left that for future work.

Differential Revision: https://reviews.llvm.org/D52043

llvm-svn: 342257
2018-09-14 18:33:31 +00:00
Nirav Dave 59ad1c8457 [X86] Fix register resizings for inline assembly register operands.
When replacing a named register input to the appropriately sized
sub/super-register. In the case of a 64-bit value being assigned to a
register in 32-bit mode, match GCC's assignment.

Reviewers: eli.friedman, craig.topper

Subscribers: nickdesaulniers, llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D51502

llvm-svn: 342175
2018-09-13 20:33:56 +00:00
Nirav Dave 2060a16dfd [X86] Cleanup pair returns. NFCI.
llvm-svn: 342174
2018-09-13 20:33:27 +00:00
Craig Topper f107123a88 [X86] Type legalize v2i32 div/rem by scalarizing rather than promoting
Summary:
Previously we type legalized v2i32 div/rem by promoting to v2i64. But we don't support div/rem of vectors so op legalization would then scalarize it using i64 scalar ops since it doesn't know about the original promotion. 64-bit scalar divides on Intel hardware are known to be slow and in 32-bit mode they require a libcall.

This patch switches type legalization to do the scalarizing itself using i32.

It looks like the division by power of 2 optimization is still kicking in and leaving the code as a vector. The division by other constant optimization doesn't kick in pre type legalization since it ignores illegal types. And previously, after type legalization we scalarized the v2i64 since we don't have v2i64 MULHS/MULHU support.

Another option might be to widen v2i32 to v4i32 so we could do division by constant optimizations, but we'd have to be careful to only do that for constant divisors or we risk scalaring to 4 scalar divides.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51325

llvm-svn: 342114
2018-09-13 06:13:37 +00:00
Craig Topper d7362a3e5f [X86] Correct the one use check from r341915.
The one use check should be on the bitcast, not the input to the bitcast.

llvm-svn: 341956
2018-09-11 16:05:03 +00:00
Craig Topper 844f035e1e [X86] In combineMOVMSK, look through int->fp bitcasts before callling SimplifyDemandedBits.
MOVMSKPS and MOVMSKPD both take FP types, but likely the operations before it are on integer types with just a int->fp bitcast between them. If the bitcast isn't used by anything else and doesn't change the element width we can look through it to simplify the integer ops.

llvm-svn: 341915
2018-09-11 08:20:02 +00:00
Craig Topper a5ae613c15 [X86] Mark the ISD::SETLT/SETLE condition codes as illegal for v32i16/v64i8 to match the other vector types.
I'm having a hard time finding a test case for this, but we should be consistent here. The fact that we canonicalize all zeros and all ones constants to vXi32 and all other constants to loads makes this hard to hit the easy DAG combine infinite loop we get for some of the other types.

llvm-svn: 341859
2018-09-10 20:31:27 +00:00
Craig Topper 3823516103 [X86] Custom type legalize (v2i32 (fp_to_uint v2f64))) without avx512vl by widening to v4i32 and v4f64 instead of v8i32 and v8f64. Make it aware of x86-experimental-vector-widening-legalization
We have isel patterns for v4i32/v4f64 that artificially widen to v8i32/v8f64 so just use that.

If x86-experimental-vector-widening-legalization is enabled, we don't need any custom legalization and can just return. I've modified the test RUN lines to cover this case.

llvm-svn: 341765
2018-09-09 20:36:36 +00:00
Craig Topper 7af5e333e7 [X86] Create paddus/psubus from narrower vectors with i8/i16 element types.
Summary:
This patch allows vectors with a power of 2 number of elements and i8/i16 element type to select paddus/psubus instructions. ReplaceNodeResults has been updated to custom widen these operations up to 128 bits like we already do for PAVG.

Another step towards fixing PR38691

Reviewers: RKSimon, spatel

Reviewed By: RKSimon, spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51818

llvm-svn: 341753
2018-09-08 19:32:58 +00:00
Craig Topper 5cbce81c91 [X86] Don't create ZERO_EXTEND_INREG/SIGN_EXTEND_INREG for v1iX vectors.
The generic type legalizer will scalarize vXi1 instructions getting rid of the vector entirely. Creating wider vector instructions is just going to prevent that.

llvm-svn: 341705
2018-09-07 20:56:03 +00:00
Craig Topper 39f48fdcbc [X86] Don't create X86ISD::AVG nodes from v1iX vectors.
The type legalizer will try to scalarize this and fail.

It looks like there's some other v1iX oddities out there too since we still generated some vector instructions.

llvm-svn: 341704
2018-09-07 20:56:01 +00:00
Craig Topper 4863313b35 [X86] Modify the the rdtscp intrinsic to return values instead of taking a pointer argument
Similar to what was recently done for addcarry/subborrow and has been done for rdrand/rdseed for a while. It's better to use two results and an explicit store in IR when the store isn't part of the semantics of the instruction. This allows store->load forwarding to happen in the middle end. Or the store to be removed if its never loaded.

Differential Revision: https://reviews.llvm.org/D51803

llvm-svn: 341698
2018-09-07 19:14:15 +00:00
Craig Topper 72964ae99e [X86] Change the addcarry and subborrow intrinsics to return 2 results and remove the pointer argument.
We should represent the store directly in IR instead. This gives the middle end a chance to remove it if it can see a load from the same address.

Differential Revision: https://reviews.llvm.org/D51769

llvm-svn: 341677
2018-09-07 16:58:39 +00:00
Craig Topper caf6672779 [X86] Add intrinsics for KTEST instructions.
These intrinsics use the same implementation as PTEST intrinsics, but use vXi1 vectors.

New clang builtins will be accompanying them shortly.

llvm-svn: 341259
2018-08-31 21:31:53 +00:00
Craig Topper b7bb9f0078 [X86] Add support for turning vXi1 shuffles into KSHIFTL/KSHIFTR.
This patch recognizes shuffles that shift elements and fill with zeros. I've copied and modified the shift matching code we use for normal vector registers to do this. I'm not sure if there's a good way to share more of this code without making the existing function more complex than it already is.

This will be used to enable kshift intrinsics in clang.

Differential Revision: https://reviews.llvm.org/D51401

llvm-svn: 341227
2018-08-31 17:17:21 +00:00
Craig Topper 83e9f928ba [X86] Don't do anything in ReplaceNodeResults for (v2i32 (fptoui/fptosi v2f32)) when -x86-experimental-vector-widening-legalization is on.
We don't need to do our own widening, the generic legalizer can do it.

llvm-svn: 341174
2018-08-31 07:05:39 +00:00
Craig Topper e9af89a78b [X86] Don't custom widen (v2i32 (setcc v2f32)) when -x86-experimental-vector-widening-legalization is in effect.
We aren't doing anything than what the generic legalizer will do so just let it do it.

llvm-svn: 341172
2018-08-31 07:05:37 +00:00
Craig Topper 1a8c99e670 [X86] Weaken an overly aggressive assert.
This assert tried to check that AND constants are only on the RHS. But its possible for both operands to be constants if one is opaque which will prevent the AND from being constant folded.

Fixes PR38771

llvm-svn: 341102
2018-08-30 19:35:38 +00:00
Simon Pilgrim 6b9bf7ecbc [X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles
Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.

TODO: As discussed on the patch, we should investigate adding VPBLENDVB handling to target shuffle combining as well, that will allow us to extend this to VPBLENDW+VPBLENDW+VPBLENDD.

Differential Revision: https://reviews.llvm.org/D50074

llvm-svn: 340913
2018-08-29 10:51:08 +00:00
Simon Pilgrim af98587095 [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
This patch creates the shift mask and actual shift using the vXi16 vector shift ops.

Differential Revision: https://reviews.llvm.org/D51263

llvm-svn: 340813
2018-08-28 10:37:29 +00:00
Simon Pilgrim f119e27d80 [X86][SSE] Avoid vector extraction/insertion for non-constant uniform shifts
As discussed on D51263, we're better off using byte shifts to clear the upper bits on pre-SSE41 hardware.

llvm-svn: 340810
2018-08-28 10:14:09 +00:00
Craig Topper c1436db753 [X86] Fix some comments to refer to KORTEST not KTEST. NFC
KTEST is a different instruction. All of this code uses KORTEST.

llvm-svn: 340799
2018-08-28 06:39:35 +00:00
Craig Topper 4be11c0585 [X86] When lowering v32i8 MULHS/MULHU, shuffle after the PACKUS rather than before.
We're using a 256-bit PACKUS to do the truncation, but that instruction operates on 128-bit lanes. So previously we shuffled first to rearrange the lanes. But that requires 2 shuffles. Instead we can shuffle after the PACKUS using a single VPERMQ. This matches what our normal LowerTRUNCATE code does when it uses PACKUS.

Differential Revision: https://reviews.llvm.org/D51284

llvm-svn: 340757
2018-08-27 17:20:41 +00:00
Craig Topper fff90377fd [X86] Add support for matching paddus patterns where one of the vectors is a constant.
InstCombine mucks these up a bit. So we need to do some additional pattern matching to fix it. There are a still a few special cases not handled, but this covers the general case.

Differential Revision: https://reviews.llvm.org/D50952

llvm-svn: 340756
2018-08-27 17:20:38 +00:00
Craig Topper 73ed2a2a6b [X86] Cleanup the LowerMULH code by hoisting some commonalities between the vXi32 and vXi8 handling. NFCI
vXi32 support was recently moved from LowerMUL_LOHI to LowerMULH.

This commit shares the getOperand calls, switches both to use common IsSigned flag, and hoists the NumElems/NumElts variable.

llvm-svn: 340720
2018-08-27 06:35:02 +00:00
Sanjay Patel 113cac3b15 [SelectionDAG][x86] turn insertelement into undef with variable index into splat
I noticed this along with the patterns in D51125, but when the index is variable, 
we don't convert insertelement into a build_vector.

For x86, that means these get expanded at legalization time into the loading/spilling 
code that we see in the tests. I think it's always better to avoid going to memory on 
these, and we get the optimal 'broadcast' if it's available.

I suspect other targets may want to look at enabling the hook. AArch64 and AMDGPU have 
regression tests that would be affected (although I did not check what would happen in 
those cases). In the most basic cases shown here, AArch64 would probably do much 
better with a splat.

Differential Revision: https://reviews.llvm.org/D51186

llvm-svn: 340705
2018-08-26 18:20:41 +00:00
Craig Topper 4240ecb909 [X86] Fix typo in comment, expect->except. NFC
llvm-svn: 340695
2018-08-26 03:43:23 +00:00
Craig Topper ebec2793d1 [X86] Replace support for vXi32 SMUL_LOHI/UMUL_LOHI with MULHS/MULHU support instead.
Summary:
The only time vector SMUL_LOHI/UMUL_LOHI nodes are created is during division/remainder lowering. If its created before op legalization, generic DAGCombine immediately turns that SMUL_LOHI/UMUL_LOHI into a MULHS/MULHU since only the upper half is used. That node will stick around through vector op legalization and will be turned back into UMUL_LOHI/SMUL_LOHI during op legalization. It will then be custom lowered by the X86 backend. Due to this two step lowering the vector shuffles created by the custom lowering get legalized after their inputs rather than before. This prevents the shuffles from being combined with any build_vector of constants.

This patch uses changes vXi32 to use MULHS/MULHU instead. This is what the later DAG combine did anyway. But by skipping the change back to UMUL_LOHI/SMUL_LOHI we lower it before any constant BUILD_VECTORS. This allows the vector_shuffle creation to constant fold with the build_vectors. This accounts for the test changes here.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51254

llvm-svn: 340690
2018-08-25 18:01:24 +00:00
Craig Topper a11a3b3818 [SelectionDAG][X86] Reorder the operands the MaskedStoreSDNode to put the value first.
Summary:
Previously the value being stored is the last operand in SDNode. This causes the type legalizer to visit the mask operand before the value operand. The type legalizer was more complicated because of this since we want the type of the value to drive the decisions.

This patch moves the value to be the first operand so we visit it first during type legalization. It also simplifies the type legalization code accordingly.

X86 is currently the only in tree target that uses this SDNode. Not sure if there are any users out of tree.

Reviewers: RKSimon, delena, hfinkel, eli.friedman

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D50402

llvm-svn: 340689
2018-08-25 17:48:17 +00:00
Craig Topper bce8680605 [X86] Make sure type is a vector before calling VT.getVectorNumElements() in combineLoopMAddPattern
Fixes PR38700.

llvm-svn: 340688
2018-08-25 17:23:43 +00:00
Sanjay Patel 8a84c747d2 [x86] try harder to use broadcast to load a scalar into vector reg
This is a preliminary step for a preliminary step for D50992. 
I noticed that x86 often misses chances to load a scalar directly 
into a vector register.

So this patch is just allowing more of those cases to match a 
broadcast op in lowerBuildVectorAsBroadcast(). The old code comment 
said it doesn't make sense to use a broadcast when we're loading a 
single element and everything else is undef, but I think that's the 
best case in the improved tests in insert-loaded-scalar.ll. We avoid 
scalar-to-vector-register move and/or less efficient shuffling.

Note that there are some existing types that were already producing 
a broadcast, but that happens semi-accidentally. Ie, it's not 
happening as part of lowerBuildVectorAsBroadcast(). The build vector 
gets expanded into load + shuffle, and then shuffle lowering produces 
the broadcast.

Description of the other test diffs:
1. avx-basic.ll - replacing load+shufle is a win.
2. sse3-avx-addsub-2.ll - vmovddup vs. vbroadcastss is neutral
3. sse41.ll - don't care - we convert that intrinsic to generic IR now, so this test is deprecated
4. vector-shuffle-128-v8.ll / vector-shuffle-256-v16.ll - pshufb alternatives with an extra instruction are not obviously bad

Differential Revision: https://reviews.llvm.org/D51125

llvm-svn: 340685
2018-08-25 14:56:05 +00:00
Craig Topper 4058e29e7d [X86] Teach combineLoopMAddPattern to handle cases where there is no loop and the add has two multiply inputs
Differential Revision: https://reviews.llvm.org/D50868

llvm-svn: 340631
2018-08-24 18:05:04 +00:00
Chandler Carruth ae0cafece8 [x86/retpoline] Split the LLVM concept of retpolines into separate
subtarget features for indirect calls and indirect branches.

This is in preparation for enabling *only* the call retpolines when
using speculative load hardening.

I've continued to use subtarget features for now as they continue to
seem the best fit given the lack of other retpoline like constructs so
far.

The LLVM side is pretty simple. I'd like to eventually get rid of the
old feature, but not sure what backwards compatibility issues that will
cause.

This does remove the "implies" from requesting an external thunk. This
always seemed somewhat questionable and is now clearly not desirable --
you specify a thunk the same way no matter which set of things are
getting retpolines.

I really want to keep this nicely isolated from end users and just an
LLVM implementation detail, so I've moved the `-mretpoline` flag in
Clang to no longer rely on a specific subtarget feature by that name and
instead to be directly handled. In some ways this is simpler, but in
order to preserve existing behavior I've had to add some fallback code
so that users who relied on merely passing -mretpoline-external-thunk
continue to get the same behavior. We should eventually remove this
I suspect (we have never tested that it works!) but I've not done that
in this patch.

Differential Revision: https://reviews.llvm.org/D51150

llvm-svn: 340515
2018-08-23 06:06:38 +00:00
Craig Topper cf9df99d79 [X86] Teach combineLoopSADPattern to handle cases where there is no loop and the add has two absolute difference inputs
Previously we asumed a vector reduction add is part of a loop and one of the input is a phi. But the code in SelectionDAGBuilder that sets vector reduction flag handles more cases than that. It just requires that the use chain ends in a horizontal reduction. And there are no other uses. This means it can handle unrolled reduction loops.

If the initial value of the reduction was 0, an unrolled loop would begin with a vector reduction add that has two sad inputs. Previously we would only transform one side of the add, but for this case we need to transform both sides.

I've created a lambda to reuse some of the code for both sides. And fixed the variables names to remove reference to "phi".

Differential Revision: https://reviews.llvm.org/D50817

llvm-svn: 340478
2018-08-22 23:19:01 +00:00
Simon Pilgrim ffdfe45645 [X86][SSE] LowerMULH vXi8 - use SSE shifts directly.
We know these vXi16 extended cases are legal constant splat shifts.

llvm-svn: 340414
2018-08-22 15:37:11 +00:00
Simon Pilgrim 50eba6b380 [X86][SSE] Lower vXi8 general shifts to SSE shifts directly. NFCI.
Most of these shifts are extended to vXi16 so we don't gain anything from forcing another round of generic shift lowering - we know these extended cases are legal constant splat shifts.

llvm-svn: 340307
2018-08-21 17:27:03 +00:00
Simon Pilgrim 98eb4ae499 [X86][SSE] Lower v8i16 general shifts to SSE shifts directly. NFCI.
We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts.

llvm-svn: 340302
2018-08-21 17:05:07 +00:00
Simon Pilgrim dbe4e9e3ff [X86][SSE] Lower directly to SSE shifts in the BLEND(SHIFT, SHIFT) combine. NFCI.
We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts.

llvm-svn: 340300
2018-08-21 16:46:48 +00:00
Simon Pilgrim 5a83a1fd13 [X86][SSE] Add helper function to convert to/between the SSE vector shift opcodes. NFCI.
Also remove some more getOpcode calls from LowerShift when we already have Opc.

llvm-svn: 340290
2018-08-21 15:57:33 +00:00
Craig Topper 210ccfe3db [X86] Prevent lowerVectorShuffleByMerging128BitLanes from creating cycles
Due to some splat handling code in getVectorShuffle, its possible for NewV1/NewV2 to have their mask modified from what is requested. This can lead to cycles being created in the DAG.

This patch examines the returned mask and makes sure its different. Long term we may need to look closer at that splat code in getVectorShuffle, or add more splat awareness to getVectorShuffle.

Fixes PR38639

Differential Revision: https://reviews.llvm.org/D50981

llvm-svn: 340214
2018-08-20 21:08:35 +00:00
Craig Topper 7dcb2c4b0a [X86] Teach combineTruncatedArithmetic to handle some cases of ISD::SUB
We can safely avoid interfering with the subus combine if both inputs are freely truncatable. Either both extends, or an extend and a constant vector.

Differential Revision: https://reviews.llvm.org/D50878

llvm-svn: 340212
2018-08-20 20:57:35 +00:00
Craig Topper 803912ea57 [X86] Fix an issue in the matching for ADDUS.
We were basically assuming only one operand of the compare could be an ADD node and using that to swap operands. But we can have a normal add followed by a saturing add.

This rewrites the canonicalization to just be based on the condition code.

llvm-svn: 340134
2018-08-19 04:26:31 +00:00
Craig Topper 2b03df9b05 [X86] Use SDValue::operator== instead of DAG.isEqualTo in strictly integer matching.
isEqualTo is more useful for floating point. operator== is sufficient for integer.

llvm-svn: 340130
2018-08-18 19:16:56 +00:00
Craig Topper 3e299d896f [X86] Simplify the PADDUS legality check in combineSelect to match PSUBUS. NFC
While there remove some trailing whitespace.

llvm-svn: 340129
2018-08-18 18:51:04 +00:00
Craig Topper 40c9559b74 [X86] Add support for using 512-bit PSUBUS to combineSelect.
The code already support 128 and 256 and even knows to split 256 for AVX1. So we really just needed to stop looking for specific VTs and subtarget features and just look for legal VTs with i8/i16 elements.

While there, add some curly braces around outer if statement bodies that contain only another if. It makes all the closing curly braces look more regular.

llvm-svn: 340128
2018-08-18 18:51:03 +00:00
Craig Topper 62dd1b1b4f [X86] Remove detectAddSubSatPattern.
This was added very recently in r339650, but appears to be completely untested and has at least one bug in it.

llvm-svn: 340086
2018-08-17 21:19:28 +00:00
Simon Pilgrim 2f48122cc9 [X86][SSE] Lower constant vXi8 ISD::SRL/ISD::SRA using PMULLW
Extending the concept introduced in D49562, this patch lowers constant vXi8 ISD::SRL/ISD::SRA by zero/sign extending to vXi16 and using PMULLW and then truncating the high 8 bits of the result.

Differential Revision: https://reviews.llvm.org/D50781

llvm-svn: 340062
2018-08-17 18:03:11 +00:00
Craig Topper 730890dbdb [X86] Use hasOneUse instead of isOnlyUserOf. NFCI
isOnlyUserOf is a little heavier because it allows the node to be used multiple times by the other node. In this case we are looking at a truncate which only has one operand so we know it can only use it once. Thus hasOneUse is better.

llvm-svn: 340059
2018-08-17 17:57:25 +00:00
Francis Visoiu Mistrih 8bff832534 [X86] Fix liveness information when expanding X86::EH_SjLj_LongJmp64
test/CodeGen/X86/shadow-stack.ll has the following machine verifier
errors:

```
*** Bad machine code: Using a killed virtual register ***
- function:    bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: %3:gr64 = MOV64rm killed %2:gr64, 1, $noreg, 8, $noreg
- operand 1:   killed %2:gr64

*** Bad machine code: Using a killed virtual register ***
- function:    bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: $rsp = MOV64rm killed %2:gr64, 1, $noreg, 16, $noreg
- operand 1:   killed %2:gr64

*** Bad machine code: Virtual register killed in block, but needed live out. ***
- function:    bar
- basic block: %bb.2 entry (0x7fdc818574f8)
Virtual register %2 is used after the block.
```

The fix here is to only copy the machine operand's register without the
kill flags for all the instructions except the very last one of the
sequence.

I had to insert dummy PHIs in the test case to force the NoPHI function
property to be set to false. More on this here: https://llvm.org/PR38439

Differential Revision: https://reviews.llvm.org/D50260

llvm-svn: 340033
2018-08-17 14:46:56 +00:00
Chandler Carruth c73c0307fe [MI] Change the array of `MachineMemOperand` pointers to be
a generically extensible collection of extra info attached to
a `MachineInstr`.

The primary change here is cleaning up the APIs used for setting and
manipulating the `MachineMemOperand` pointer arrays so chat we can
change how they are allocated.

Then we introduce an extra info object that using the trailing object
pattern to attach some number of MMOs but also other extra info. The
design of this is specifically so that this extra info has a fixed
necessary cost (the header tracking what extra info is included) and
everything else can be tail allocated. This pattern works especially
well with a `BumpPtrAllocator` which we use here.

I've also added the basic scaffolding for putting interesting pointers
into this, namely pre- and post-instruction symbols. These aren't used
anywhere yet, they're just there to ensure I've actually gotten the data
structure types correct. I'll flesh out support for these in
a subsequent patch (MIR dumping, parsing, the works).

Finally, I've included an optimization where we store any single pointer
inline in the `MachineInstr` to avoid the allocation overhead. This is
expected to be the overwhelmingly most common case and so should avoid
any memory usage growth due to slightly less clever / dense allocation
when dealing with >1 MMO. This did require several ergonomic
improvements to the `PointerSumType` to reasonably support the various
usage models.

This also has a side effect of freeing up 8 bits within the
`MachineInstr` which could be repurposed for something else.

The suggested direction here came largely from Hal Finkel. I hope it was
worth it. ;] It does hopefully clear a path for subsequent extensions
w/o nearly as much leg work. Lots of thanks to Reid and Justin for
careful reviews and ideas about how to do all of this.

Differential Revision: https://reviews.llvm.org/D50701

llvm-svn: 339940
2018-08-16 21:30:05 +00:00
Craig Topper 08e082619a [X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation
To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.

This fixes PR35833

Differential Revison: https://reviews.llvm.org/D41794

llvm-svn: 339818
2018-08-15 21:21:52 +00:00
Simon Pilgrim f3b5943ffc Remove lambda default argument to fix gcc pedantic warning.
llvm-svn: 339767
2018-08-15 12:32:09 +00:00
Craig Topper 633fe98e27 [X86] Change legacy SSE scalar fp to integer intrinsics to use specific ISD opcodes instead of keeping as intrinsics. Unify SSE and AVX512 isel patterns.
AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics.

The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics.

Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns.

llvm-svn: 339749
2018-08-15 01:23:00 +00:00
Simon Pilgrim 2ce3d6e135 [X86][SSE] Avoid duplicate shuffle input sources in combineX86ShufflesRecursively
rL339686 added the case where a faux shuffle might have repeated shuffle inputs coming from either side of the OR().

This patch improves the insertion of the inputs into the source ops lists to account for this, as well as making it trivial to add support for shuffles with more than 2 inputs in the future.

llvm-svn: 339696
2018-08-14 17:22:37 +00:00
Simon Pilgrim ed55138247 [X86][SSE] Add shuffle combine support for OR(PSHUFB,PSHUFB) style patterns.
If each element is zero from one (or both) inputs then we can combine these into a single shuffle mask.

llvm-svn: 339686
2018-08-14 16:00:05 +00:00
Simon Pilgrim df9880f257 [X86][SSE] Generalize lowerVectorShuffleAsBlendOfPSHUFBs to work with any vXi8 type.
We still only use this for v16i8, but this cleans up the code to support v32i8/v64i8 sometime in the future.

llvm-svn: 339679
2018-08-14 14:00:14 +00:00
Tomasz Krupa 86a63889f3 [X86] Lowering addus/subus intrinsics to native IR
Summary: This revision improves previous version (rL330322) which has been reverted due to crashes.

This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.

Reviewers: craig.topper, spatel, RKSimon

Reviewed By: craig.topper

Subscribers: mike.dvoretsky, DavidKreitzer, sroland, llvm-commits

Differential Revision: https://reviews.llvm.org/D46179

llvm-svn: 339650
2018-08-14 08:00:56 +00:00
Simon Pilgrim 130b00bc43 [X86][SSE] Pull out repeated shift getOpcode() calls. NFCI.
llvm-svn: 339425
2018-08-10 11:42:42 +00:00
Craig Topper 9a8136f7b4 [X86] Qualify one of the heuristics in combineMul to only apply to positive multiply amounts.
This seems to slightly help the performance of one of our internal benchmarks. We probably need better heuristics here.

llvm-svn: 339406
2018-08-09 23:27:42 +00:00
Simon Pilgrim 511c3fc529 [X86][SSE] Remove PMULDQ/PMULUDQ by zero
Exposed by D50328

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339337
2018-08-09 12:37:36 +00:00
Simon Pilgrim 01ae462fef [X86][SSE] Combine (some) target shuffles with multiple uses
As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.

This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool.

However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit.

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339335
2018-08-09 12:30:02 +00:00
Craig Topper 9de1797c50 [SelectionDAG][X86] Rename MaskedLoadSDNode::getSrc0 to getPassThru.
Src0 doesn't really convey any meaning to what the operand is. Passthru matches what's used in the documentation for the intrinsic this comes from.

llvm-svn: 339101
2018-08-07 06:52:49 +00:00
Craig Topper 17989208a9 [SelectionDAG][X86] Rename getValue to getPassThru for gather SDNodes.
getValue is more meaningful name for scatter than it is for gather. Split them and use getPassThru for gather.

llvm-svn: 339096
2018-08-07 06:13:40 +00:00
Easwaran Raman 10fd92dd94 [X86] Recognize a splat of negate in isFNEG
Summary:
Expand isFNEG so that we generate the appropriate F(N)M(ADD|SUB)
instructions in more cases. For example, the following sequence
a = _mm256_broadcast_ss(f)
d = _mm256_fnmadd_ps(a, b, c)

generates an fsub and fma without this patch and an fnma with this
change.

Reviewers: craig.topper

Subscribers: llvm-commits, davidxl, wmi

Differential Revision: https://reviews.llvm.org/D48467

llvm-svn: 339043
2018-08-06 19:23:38 +00:00
Craig Topper feb2a58860 [X86] Add a DAG combine for the __builtin_parity idiom used by clang to enable better codegen
Clang uses "ctpop & 1" to implement __builtin_parity. If the popcnt instruction isn't supported this generates a large amount of code to calculate the population count. Instead we can bisect the data down to a single byte using xor and then check the parity flag.

Even when popcnt is supported, its still a good idea to split 64-bit data on 32-bit targets using an xor in front of a single popcnt. Otherwise we get two popcnts and an add before the and.

I've specifically targeted this at the sizes supported by clang builtins, but we could generalize this if we think that's useful.

Differential Revision: https://reviews.llvm.org/D50165

llvm-svn: 338907
2018-08-03 18:00:29 +00:00
Craig Topper e902b7d0b0 [X86] Support fp128 and/or/xor/load/store with VEX and EVEX encoded instructions.
Move all the patterns to X86InstrVecCompiler.td so we can keep SSE/AVX/AVX512 all in one place.

To save some patterns we'll use an existing DAG combine to convert f128 fand/for/fxor to integer when sse2 is enabled. This allows use to reuse all the existing patterns for v2i64.

I believe this now makes SHA instructions the only case where VEX/EVEX and legacy encoded instructions could be generated simultaneously.

llvm-svn: 338821
2018-08-03 06:12:56 +00:00
Craig Topper 2c095444a4 [X86] Prevent promotion of i16 add/sub/and/or/xor to i32 if we can fold an atomic load and atomic store.
This makes them consistent with i8/i32/i64. Which still seems to be more aggressive on folding than icc, gcc, or MSVC.

llvm-svn: 338795
2018-08-03 00:37:34 +00:00
Simon Pilgrim 8b16e15d47 [X86][SSE] Pull out duplicate VSELECT to shuffle mask code. NFCI.
llvm-svn: 338693
2018-08-02 09:20:27 +00:00
Reid Kleckner a30a6d2c29 Load from the GOT for external symbols in the large, PIC code model
Do the same handling for external symbols that we do for jump table
symbols and global values.

Fixes one of the cases in PR38385

llvm-svn: 338651
2018-08-01 22:56:05 +00:00
Craig Topper c985d42903 [X86] Canonicalize the pattern for __builtin_ffs in a similar way to '__builtin_ffs + 5'
We now emit a move of -1 before the cmov and do the addition after the cmov just like the case with an extra addition.

This may be slightly worse for code size, but is more consistent with other compilers. And we might be able to hoist the mov -1 outside of loops.

llvm-svn: 338613
2018-08-01 18:38:46 +00:00
Simon Pilgrim 931ebe3be1 [X86] Assign from a brace initializer to match style guide. NFCI.
llvm-svn: 338598
2018-08-01 17:43:38 +00:00
Simon Pilgrim a3548c960e [SelectionDAG] Make binop reduction matcher available to all targets
There is nothing x86-specific about this code, so it'd be nice to make this available for other targets to use in the future (and get it out of X86ISelLowering!).

Differential Revision: https://reviews.llvm.org/D50083

llvm-svn: 338586
2018-08-01 16:52:28 +00:00
Simon Pilgrim 564762cf32 [X86] Use isNullConstant helper. NFCI.
llvm-svn: 338530
2018-08-01 13:06:14 +00:00
Simon Pilgrim e447a273bd [X86] Use isNullConstant helper. NFCI.
llvm-svn: 338516
2018-08-01 11:24:11 +00:00
Craig Topper 65a1388881 [X86] When looking for (CMOV C-1, (ADD (CTTZ X), C), (X != 0)) -> (ADD (CMOV (CTTZ X), -1, (X != 0)), C), make sure we really have a compare with 0.
It's not strictly required by the transform of the cmov and the add, but it makes sure we restrict it to the cases we know we want to match.

While there canonicalize the operand order of the cmov to simplify the matching and emitting code.

llvm-svn: 338492
2018-08-01 06:36:20 +00:00
Simon Pilgrim 5d9b00d15b [X86][SSE] Use ISD::MULHU for constant/non-zero ISD::SRL lowering (PR38151)
As was done for vector rotations, we can efficiently use ISD::MULHU for vXi8/vXi16 ISD::SRL lowering.

Shift-by-zero cases are still problematic (mainly on v32i8 due to extra AND/ANDN/OR or VPBLENDVB blend masks but v8i16/v16i16 aren't great either if PBLENDW fails) so I've limited this first patch to known non-zero cases if we can't easily use PBLENDW.

Differential Revision: https://reviews.llvm.org/D49562

llvm-svn: 338407
2018-07-31 18:05:56 +00:00
Craig Topper bef126fb71 [X86] Add pattern matching for PMADDUBSW
Summary:
Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned.

A C example that triggers this pattern

```
static const int N = 128;

int8_t A[2*N];
uint8_t B[2*N];
int16_t C[N];

void foo() {
  for (int i = 0; i != N; ++i)
    C[i] = MIN(MAX((int16_t)A[2*i]*(int16_t)B[2*i] + (int16_t)A[2*i+1]*(int16_t)B[2*i+1], -32768), 32767);
}
```

Reviewers: RKSimon, spatel, zvi

Reviewed By: RKSimon, zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49829

llvm-svn: 338402
2018-07-31 17:12:08 +00:00
Simon Pilgrim 99d475f97d [X86][SSE] isFNEG - Use getTargetConstantBitsFromNode to handle all constant cases
isFNEG was duplicating much of what was done by getTargetConstantBitsFromNode in its own calls to getTargetConstantFromNode.

Noticed while reviewing D48467.

llvm-svn: 338358
2018-07-31 10:13:17 +00:00
Fangrui Song f78650a8de Remove trailing space
sed -Ei 's/[[:space:]]+$//' include/**/*.{def,h,td} lib/**/*.{cpp,h}

llvm-svn: 338293
2018-07-30 19:41:25 +00:00
Craig Topper f014ec9b3b [X86] Fix typo in comment. NFC
llvm-svn: 338274
2018-07-30 17:34:31 +00:00
Matt Arsenault 81920b0a25 DAG: Add calling convention argument to calling convention funcs
This seems like a pretty glaring omission, and AMDGPU
wants to treat kernels differently from other calling
conventions.

llvm-svn: 338194
2018-07-28 13:25:19 +00:00
Craig Topper c3e11bf3f7 [X86] Add support expanding multiplies by constant where the constant is -3/-5/-9 multplied by a power of 2.
These can be replaced with an LEA, a shift, and a negate. This seems to match what gcc and icc would do.

llvm-svn: 338174
2018-07-27 23:04:59 +00:00
Craig Topper 561e298e29 [X86] Remove an unnecessary 'if' that prevented treating INT64_MAX and -INT64_MAX as power of 2 minus 1 in the multiply expansion code.
Not sure why they were being explicitly excluded, but I believe all the math inside the if works. I changed the absolute value to be uint64_t instead of int64_t so INT64_MIN+1 wouldn't be signed wrap.

llvm-svn: 338101
2018-07-27 05:56:27 +00:00