Commit Graph

5955 Commits

Author SHA1 Message Date
Craig Topper c693a23025 [X86] Simplify the end of custom type legalization for (v2i32/v4i16/v8i8 (bitcast (f64))) by just emitting an EXTRACT_SUBVECTOR instead of a BUILD_VECTOR.
Generic legalization should be able to finish legalizing the EXTRACT_SUBVECTOR probably by turning it into a BUILD_VECTOR. But we should emit the simplest sequence.

llvm-svn: 344424
2018-10-12 22:00:00 +00:00
Craig Topper a8a44f1bec [X86] Skip (v2i32/v4i16/v8i8 (bitcast (f64))) handling in ReplaceNodeResults if the dest type can be widened by generic legalization. NFCI
The algorithm we would do previously was identical to generic legalization. If we ever switch to legalizing integer vectors via widening we'll be able to kill off the code since it now only runs for promotion.

llvm-svn: 344423
2018-10-12 21:59:58 +00:00
Sanjay Patel e28c8ecd72 [x86] add and use fast horizontal vector math subtarget feature
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen 
by default. AMD Jaguar (btver2) should have no difference with this patch because it has 
fast-hops. (If we want to set that bit for other CPUs, let me know.)

The code changes are small, but there are many test diffs. For files that are specifically 
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences 
side-by-side. For files that are primarily concerned with codegen other than hops, I just 
updated the CHECK lines to reflect the new default codegen.

To recap the recent horizontal op story:

1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. 
   Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so 
   we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but 
   probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be 
   happy. I'm not sure if we have the best predicate here, but the intent is to undo the 
   extra hop-iness that was enabled by r344141.

Differential Revision: https://reviews.llvm.org/D53095

llvm-svn: 344361
2018-10-12 16:41:02 +00:00
Eric Liu 55ab86b72b Fix unused variable warning after r344348
llvm-svn: 344350
2018-10-12 15:01:11 +00:00
Simon Pilgrim 78b5a3c3ef [X86][SSE] LowerVectorCTPOP - pull out repeated byte sum stage.
Pull out repeated byte sum stage for popcount of vector elements > 8bits.

This allows us to simplify the LUT/BITMATH popcnt code to always assume vXi8 vectors, and also improves avx512bitalg codegen which only has access to vpopcntb/vpopcntw.

llvm-svn: 344348
2018-10-12 14:18:47 +00:00
Simon Pilgrim 29279f29c8 [X86][SSE] Add extract_subvector(PSHUFB) -> PSHUFB(extract_subvector()) combine
Fixes PR32160 by reducing the size of PSHUFB if we only use one of the lanes.

This approach can probably be generalized to handle any target shuffle (and any subvector index) but we have no test coverage at the moment.

llvm-svn: 344336
2018-10-12 12:10:34 +00:00
Richard Trieu dfd1760b5f Inline variable into assert to avoid unused variable warning.
llvm-svn: 344308
2018-10-11 22:42:41 +00:00
Craig Topper 35d513c7e4 [X86] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector.
On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success.

This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency and to keep the load in the fp domain.

There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately.

Differential Revision: https://reviews.llvm.org/D52528

llvm-svn: 344291
2018-10-11 20:36:06 +00:00
Craig Topper fb2ac8969e [X86] Restore X86ISelDAGToDAG::matchBEXTRFromAnd. Teach address matching to create a BEXTR pattern from a (shl (and X, mask >> C1) if C1 can be folded into addressing mode.
This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about.

We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR.

I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that.

I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet.

Differential Revision: https://reviews.llvm.org/D53126

llvm-svn: 344270
2018-10-11 18:06:07 +00:00
Roman Lebedev 33d84c6dac [X86] Move X86DAGToDAGISel::matchBEXTRFromAnd() into X86ISelLowering
Summary:
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 | PR38938 ]],
we fail to emit `BEXTR` if the mask is shifted.
We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`,
and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node,
and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back.
So it would seem X86ISelLowering is the place to handle this.

This patch only moves the matchBEXTRFromAnd()
from X86DAGToDAGISel to X86ISelLowering.
It does not add support for the 'shifted mask' pattern.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52426

llvm-svn: 344179
2018-10-10 20:40:12 +00:00
Sanjay Patel 6cca8af227 [x86] allow single source horizontal op matching (PR39195)
This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd 
solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature
bit to deal with it here in the DAG.

Differential Revision: https://reviews.llvm.org/D52997

llvm-svn: 344141
2018-10-10 13:39:59 +00:00
Simon Pilgrim 5cb3a82892 [TargetLowering] Add root node back to work list after successful SimplifyDemandedBits/SimplifyDemandedVectorElts
Similar to what already happens in the DAGCombiner wrappers, this patch adds the root nodes back onto the worklist if the DCI wrappers' SimplifyDemandedBits/SimplifyDemandedVectorElts were successful.

Differential Revision: https://reviews.llvm.org/D53026

llvm-svn: 344132
2018-10-10 10:44:15 +00:00
Craig Topper f6d8400869 [X86] When lowering unsigned v2i64 setcc without SSE42, flip the sign bits in the v2i64 type then bitcast to v4i32.
This may give slightly better opportunities for DAG combine to simplify with the operations before the setcc. It also matches the type the xors will eventually be promoted to anyway so it saves a legalization step.

Almost all of the test changes are because our constant pool entry is now v2i64 instead of v4i32 on 64-bit targets. On 32-bit targets getConstant should be emitting a v4i32 build_vector and a v4i32->v2i64 bitcast.

There are a couple test cases where it appears we now combine a bitwise not with one of these xors which caused a new constant vector to be generated. This prevented a constant pool entry from being shared. But if that's an issue we're concerned about, it seems we need to address it another way that just relying a bitcast to hide it.

This came about from experiments I've been trying with pushing the promotion of and/or/xor to vXi64 later than LegalizeVectorOps where it is today. We run LegalizeVectorOps in a bottom up order. So the and/or/xor are promoted before their users are legalized. The bitcasts added for the promotion act as a barrier to computeKnownBits if we try to use it during vector legalization of a later operation. So by moving the promotion out we can hopefully get better results from computeKnownBits/computeNumSignBits like in LowerTruncate on AVX512. I've also looked at running LegalizeVectorOps in a top down order like LegalizeDAG, but thats showing some other issues.

llvm-svn: 344071
2018-10-09 19:05:50 +00:00
Sanjay Patel f5fac1826a [x86] use demanded bits to simplify masked store codegen
As noted in D52747, if we prefer IR to use trunc for bool vectors rather 
than and+icmp, we can expose codegen shortcomings as seen here with masked store.

Replace a hard-coded PCMPGT simplification with the more general demanded bits call
to improve things.

Differential Revision: https://reviews.llvm.org/D52964

llvm-svn: 344048
2018-10-09 14:04:14 +00:00
Simon Pilgrim 720db8ed7b [X86][AVX1] Enable *_EXTEND_VECTOR_INREG lowering of 256-bit vectors
As discussed on D52964, this adds 256-bit *_EXTEND_VECTOR_INREG lowering support for AVX1 targets to help improve SimplifyDemandedBits handling.

Differential Revision: https://reviews.llvm.org/D52980

llvm-svn: 344019
2018-10-09 07:42:01 +00:00
Craig Topper ff9f02580d [X86] Prefer isTypeLegal over checking isSimple in a DAG combine.
Simple types are a superset of what all in tree targets in LLVM could possibly have a legal type. This means the behavior of using isSimple to check for a supported type for X86 could change over time. For example, this could would change if a v256i1 type was added to MVT in the future.

llvm-svn: 343995
2018-10-08 20:02:59 +00:00
Simon Pilgrim 6fc8d05565 [X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors
Some necessary yak shaving before lowering *_EXTEND_VECTOR_INREG 256-bit vectors on AVX1 targets as suggested by D52964.

Differential Revision: https://reviews.llvm.org/D52970

llvm-svn: 343991
2018-10-08 18:40:50 +00:00
Sanjay Patel 43bf9917cc [x86] make horizontal binop matching clearer; NFCI
The instructions are complicated, so this code will
probably never be very obvious, but hopefully this
makes it better. 

As shown in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...we need to improve the matching to not miss cases
where we're h-opping on 1 source vector, and that
should be a small patch after this rearranging.

llvm-svn: 343989
2018-10-08 18:08:02 +00:00
Simon Pilgrim 9fa1c66421 [X86] getFauxShuffleMask - Handle undef + sentinel values in subvector insertion
llvm-svn: 343926
2018-10-06 22:13:44 +00:00
Simon Pilgrim a30e8d23e2 [X86][AVX] Ensure resolveTargetShuffleInputs shuffle masks are the correct width
Don't handle ZERO_EXTEND style shuffles until we support bitcasts. Found by inspection.

llvm-svn: 343924
2018-10-06 17:18:41 +00:00
Simon Pilgrim 62d199f4e5 [X86] combinePMULDQ - add op back to worklist if SimplifyDemandedBits succeeds on either operand
Prevents missing other simplifications that may occur deep in the operand chain where CommitTargetLoweringOpt won't add the PMULDQ back to the worklist itself

llvm-svn: 343922
2018-10-06 14:51:14 +00:00
Simon Pilgrim 0cc0a24b55 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - simplify PSHUFB masks
Attempt to simplify PSHUFB masks (even non-constant ones) - we should probably be able to simplify other variable shuffles as well as the need arises.

llvm-svn: 343919
2018-10-06 13:49:31 +00:00
Simon Pilgrim ae78d709b4 [X86] Use the SimplifyDemandedBits wrappers where possible. NFCI.
Leave the wrapper to handle TargetLowering::TargetLoweringOpt and CommitTargetLoweringOpt.

llvm-svn: 343918
2018-10-06 13:29:08 +00:00
Simon Pilgrim dc97118efe [X86][AVX] Limit getFauxShuffleMask INSERT_SUBVECTOR support to 2 inputs
rL343853 didn't limit the number of subinputs, but we don't currently support faux shuffles with more than 2 total inputs, so put a limiter in place until this is fixed.

Found by Artem Dergachev.

llvm-svn: 343891
2018-10-05 21:44:19 +00:00
Craig Topper 0ed892da70 [X86] Don't promote i16 compares to i32 if the immediate will fit in 8 bits.
The comments in this code say we were trying to avoid 16-bit immediates, but if the immediate fits in 8-bits this isn't an issue. This avoids creating a zero extend that probably won't go away.

The movmskb related changes are interesting. The movmskb instruction writes a 32-bit result, but fills the upper bits with 0. So the zero_extend we were previously emitting was free, but we turned a -1 immediate that would fit in 8-bits into a 32-bit immediate so it was still bad.

llvm-svn: 343871
2018-10-05 18:13:36 +00:00
Simon Pilgrim 6c5ab48fe7 [X86][AVX] getFauxShuffleMask - add support for INSERT_SUBVECTOR subvector shuffles
Decode subvector shuffles from INSERT_SUBVECTOR(SRC0, SHUFFLE(EXTRACT_SUBVECTOR(SRC1))

This was found necessary while investigating PR39161

llvm-svn: 343853
2018-10-05 14:41:00 +00:00
Craig Topper 7d2155e3f9 [X86][LegalizeVectorOps] Use MERGE_VALUES to return two results from LowerLoad. Remove special case code in LegalizeVectorOps that allowed us to only return one result.
Previously we replaced the chain use ourself and return the data result. LegalizeVectorOps then detected that we'd done this and assumed the chain had already been handled.

This commit instead returns a MERGE_VALUES node with two results joined from nodes. This allows LegalizeVectorOps to do all the replacements for us without any special casing. The MERGE_VALUES will be removed by DAG combine.

llvm-svn: 343817
2018-10-04 21:24:24 +00:00
David Greene 4f916df29e [X86] Set correct MMO offset on scalarized load pieces
When scalarizing a load, be sure to update the offset in the
MachineMemOperand for each scalar load.

llvm-svn: 343776
2018-10-04 14:07:59 +00:00
Craig Topper 8b3c46f0a8 [X86] Merge matchANDXORWithAllOnesAsANDNP into combineANDXORWithAllOnesIntoANDNP. NFCI
It's the only caller and the logic pretty easy to combine.

llvm-svn: 343754
2018-10-04 06:13:27 +00:00
Craig Topper a65c2dbfd6 [X86] Stop promoting vector ISD::SELECT to vXi64.
The additional patterns needed for this aren't overwhelming and introducing extra bitcasts during lowering limits our ability to do computeNumSignBits. Not that I have a good example of that for select. I'm just becoming increasingly grumpy about promotion of AND/OR/XOR. SELECT was just a lot easier to fix.

llvm-svn: 343723
2018-10-03 21:10:29 +00:00
Craig Topper c39dc41b63 [X86] Add CMOV_VK2/VK4 pseudos and remove lowering code that turned v2i1/v4i1 SELECT into v8i1.
llvm-svn: 343713
2018-10-03 20:28:43 +00:00
Craig Topper 703fbde3cb [X86] Add CMOV pseudos for VR128X and VR256X register classes. Use them when AVX512VL is enabled.
This allows the phi nodes to be generated with the correct register class when expanded.

llvm-svn: 343710
2018-10-03 19:48:26 +00:00
Craig Topper 4b62c2dbda [X86] Don't break CMOV pseudo instructions down by type. Just by register class.
The register class is all that's important for the pseudo instructions. We can use patterns to handle the different types.

llvm-svn: 343709
2018-10-03 19:48:23 +00:00
Nirav Dave 925b64be64 [X86] Correctly use SSE registers if no-x87 is selected.
Fix use of SSE1 registers for f32 ops in no-x87 mode.

Notably, allow use of SSE instructions for f32 operations in 64-bit
mode (but not 32-bit which is disallowed by callign convention).

Also avoid translating memset/memcopy/memmove into SSE registers
without X87 for 32-bit mode.

This fixes PR38738.

Reviewers: nickdesaulniers, craig.topper

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D52555

llvm-svn: 343689
2018-10-03 14:13:30 +00:00
Simon Pilgrim a2efe82b81 [X86] SimplifyDemandedVectorEltsForTargetNode - remove identity target shuffles before simplifying inputs
By removing demanded target shuffles that simplify to zero/undef/identity before simplifying its inputs we improve chances of further simplification, as only the immediate parent user of the combined is added back to the work list - this still doesn't help us if its passed through other ops though (bitcasts....).

llvm-svn: 343390
2018-09-29 18:15:26 +00:00
Simon Pilgrim a93407fadf [X86][SSE] LowerScalarImmediateShift - remove 32-bit vXi64 special case handling.
This is all handled generally by getTargetConstantBitsFromNode now

llvm-svn: 343387
2018-09-29 17:36:22 +00:00
Simon Pilgrim b5737007cd Fix signed/unsigned mismatch warning. NFCI.
llvm-svn: 343385
2018-09-29 17:11:19 +00:00
Simon Pilgrim d633e290c8 [X86] getTargetConstantBitsFromNode - add support for rearranging constant bits via shuffles
Exposed an issue that recursive calls to getTargetConstantBitsFromNode don't handle changes to EltSizeInBits yet.

llvm-svn: 343384
2018-09-29 17:01:55 +00:00
Simon Pilgrim ae34ae12ef [X86][SSE] LowerScalarImmediateShift - use getTargetConstantBitsFromNode to get immediate data
Don't just attempt to find a splat build vector.

First step towards getting rid of all the 32-bit special case code.

llvm-svn: 343383
2018-09-29 16:40:35 +00:00
Simon Pilgrim a731940c60 [X86] getTargetConstantBitsFromNode - fix self-move assertions from gcc builds due to rL343375
llvm-svn: 343377
2018-09-29 14:51:09 +00:00
Simon Pilgrim 22d51014af [X86] getTargetConstantBitsFromNode - add support for peeking through ISD::EXTRACT_SUBVECTOR
llvm-svn: 343375
2018-09-29 14:17:32 +00:00
Simon Pilgrim aa77033a6b [X86][SSE] Fixed issue with v2i64 variable shifts on 32-bit targets
The shift amount might have peeked through a extract_subvector, altering the number of vector elements in the 'Amt' variable - so we were incorrectly calculating the ratio when peeking through bitcasts, resulting in incorrectly detecting splats.

llvm-svn: 343373
2018-09-29 13:25:22 +00:00
Simon Pilgrim ebabd79f43 [X86][SSE] canReduceVMulWidth - use ComputeNumSignBits/SignBitIsZero directly
Don't reinvent the wheel for BUILD_VECTOR/ZERO_EXTEND - its only the ANY_EXTEND special case that needs handling.

llvm-svn: 343096
2018-09-26 11:48:52 +00:00
Simon Pilgrim 5beaac433d [X86][SSE] Use ISD::MULHS for constant vXi16 ISD::SRA lowering (PR38151)
Similar to the existing ISD::SRL constant vector shifts from D49562, this patch adds ISD::SRA support with ISD::MULHS.

As we're dealing with signed values, we have to handle shift by zero and shift by one special cases, so XOP+AVX2/AVX512 splitting/extension is still a better solution - really we should still use ISD::MULHS if one of the special cases are used but for now I've just left a TODO and filtered by isKnownNeverZero.

Differential Revision: https://reviews.llvm.org/D52171

llvm-svn: 343093
2018-09-26 10:57:05 +00:00
Craig Topper 12c18840fa [X86] Allow movmskpd/ps ISD nodes to be created and selected with integer input types.
This removes an int->fp bitcast between the surrounding code and the movmsk. I had already added a hack to combineMOVMSK to try to look through this bitcast to improve the SimplifyDemandedBits there.

But I found an additional issue where the bitcast was preventing combineMOVMSK from being called again after earlier nodes in the DAG are optimized. The bitcast gets revisted, but not the user of the bitcast. By using integer types throughout, the bitcast doesn't get in the way.

llvm-svn: 343046
2018-09-25 23:28:27 +00:00
Simon Pilgrim 96335dd1ec [X86] combineUIntToFP - Fix UINT_TO_FP(vXi1) comment (PR39078)
llvm-svn: 343026
2018-09-25 20:52:08 +00:00
Sanjay Patel 10c11b867a [x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449)
This is the final (I hope!) problem pattern mentioned in PR37749:
https://bugs.llvm.org/show_bug.cgi?id=37749

We are trying to avoid an AVX1 sinkhole caused by having 256-bit bitwise logic ops but no other 256-bit integer ops. 
We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like 
extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches 
that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op.

The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test, 
we have this vector-type-legalized sequence:

        t29: v8i32 = concat_vectors t27, t28
      t30: v4i64 = bitcast t29
        t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, ...
      t31: v4i64 = bitcast t18
    t32: v4i64 = xor t30, t31
      t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, ...
    t34: v4i64 = bitcast t9
  t35: v4i64 = and t32, t34
t36: v8i32 = bitcast t35
      t37: v4i32 = extract_subvector t36, Constant:i64<0>
      t38: v4i32 = extract_subvector t36, Constant:i64<4>

Differential Revision: https://reviews.llvm.org/D52318

llvm-svn: 343008
2018-09-25 19:09:34 +00:00
Craig Topper 6fb1358a98 [X86] Add AVX512 support to combineVectorSizedSetCCEquality.
Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52424

llvm-svn: 342989
2018-09-25 16:27:12 +00:00
Craig Topper 9ce5da7b62 [X86] Don't create FILD ISD nodes when X87 is disabled.
The included test case previously asserted because the type legalizer tried to soften the FILD ISD node.

Fixes PR38819.

llvm-svn: 342934
2018-09-25 00:16:57 +00:00
Craig Topper aeb4930b47 [X86] Remove superfluous curly braces. NFC
llvm-svn: 342933
2018-09-25 00:16:54 +00:00
Craig Topper b7e2499e80 [X86] Update comment. Use 'glued' instead of 'flagged' NFC
llvm-svn: 342932
2018-09-25 00:16:52 +00:00
Sanjay Patel 0027946915 [DAGCombiner][x86] extend decompose of integer multiply into shift/add with negation
This is an alternative to https://reviews.llvm.org/D37896. We can't decompose 
multiplies generically without a target hook to tell us when it's profitable.

ARM and AArch64 may be able to remove some existing code that overlaps with
this transform.

This extends D52195 and may resolve PR34474: 
https://bugs.llvm.org/show_bug.cgi?id=34474
(still an open question about transforming legal vector multiplies, but we
could open another bug report for those)

llvm-svn: 342844
2018-09-23 18:41:38 +00:00
Craig Topper 3e0b4b0eb7 [X86] Fix a few typos in comments.
llvm-svn: 342829
2018-09-23 06:49:47 +00:00
Craig Topper 9995760df4 [X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)) for vXi8 vectors.
We don't have a vXi8 shift left so we need to bitcast to a vXi16 vector to perform the shift. If we let lowering legalize the vXi8 shift we get an extra and that we don't need and fail to remove.

llvm-svn: 342795
2018-09-22 05:08:38 +00:00
Sanjay Patel 8a1227ccc8 [SelectionDAG] replace duplicated peekThroughBitcast helper functions; NFCI
x86 had 2 versions of peekThroughBitcast. DAGCombiner had 1. Plus, it had a 1-off implementation for the one-use variant.
Move the x86 versions of the code to SelectionDAG, so we don't have different copies of the code. 
No functional change intended.

I'm putting this next to isBitwiseNot() because I am planning to use it in there. Another option is next to the
helpers in the ISD namespace (eg, ISD::isConstantSplatVector()). But if there's no good reason for those to be 
there, I'd prefer to pull other helpers over to SelectionDAG in follow-up steps.

Differential Revision: https://reviews.llvm.org/D52285

llvm-svn: 342669
2018-09-20 17:34:08 +00:00
Simon Pilgrim 3e2de767f6 [X86][SSE] Remove UNPCKL(SHUFFLE)->UNPCKH custom combine
This can be achieved more generally by combineX86ShufflesRecursively.

llvm-svn: 342645
2018-09-20 13:10:22 +00:00
Simon Pilgrim 46c1dcb1af [X86][SSE] Remove PSHUFLW/PSHUFHW combineRedundantHalfShuffle combine
This can be achieved more generally by combineX86ShufflesRecursively and was causing a fuzz test failure found by Mikael Holmén.

llvm-svn: 342642
2018-09-20 12:11:38 +00:00
Sanjay Patel 1a1c0ee599 [x86] change names of vector splitting helper functions; NFC
As the code comments suggest, these are about splitting, and they
are not necessarily limited to lowering, so that misled me.

There's nothing that's actually x86-specific in these either, so 
they might be better placed in a common header so any target can 
use them.

llvm-svn: 342575
2018-09-19 18:52:00 +00:00
Simon Pilgrim 8191d63c3b [X86] Add initial SimplifyDemandedVectorEltsForTargetNode support
This patch adds an initial x86 SimplifyDemandedVectorEltsForTargetNode implementation to handle target shuffles.

Currently the patch only decodes a target shuffle, calls SimplifyDemandedVectorElts on its input operands and removes any shuffle that reduces to undef/zero/identity.

Future work will need to integrate this with combineX86ShufflesRecursively, add support for other x86 ops, etc.

NOTE: There is a minor regression that appears to be affecting further (extractelement?) combines which I haven't been able to solve yet - possibly something to do with how nodes are added to the worklist after simplification.

Differential Revision: https://reviews.llvm.org/D52140

llvm-svn: 342564
2018-09-19 18:11:34 +00:00
Sanjay Patel 4fd2e2a498 [DAGCombiner][x86] add transform/hook to decompose integer multiply into shift/add
This is an alternative to D37896. I don't see a way to decompose multiplies 
generically without a target hook to tell us when it's profitable. 

ARM and AArch64 may be able to remove some duplicate code that overlaps with 
this transform.

As a first step, we're only getting the most clear wins on the vector examples
requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474

As noted in the code comment, it's likely that the x86 constraints are tighter
than necessary, but it may not always be a win to replace a pmullw/pmulld.

Differential Revision: https://reviews.llvm.org/D52195

llvm-svn: 342554
2018-09-19 15:57:40 +00:00
Simon Pilgrim e9bf71e761 [X86][SSE] LowerShift - pull out repeated getTargetVShiftUniformOpcode calls. NFCI.
llvm-svn: 342462
2018-09-18 10:44:44 +00:00
Keno Fischer c8ccaed325 [X86ISel] Implement byval lowering for Win64 calling convention
Summary:
The IR reference for the `byval` attribute states:

```
This indicates that the pointer parameter should really be passed by value
to the function. The attribute implies that a hidden copy of the pointee is
made between the caller and the callee, so the callee is unable to modify
the value in the caller. This attribute is only valid on LLVM pointer arguments.
```

However, on Win64, this attribute is unimplemented and the raw pointer is
passed to the callee instead. This is problematic, because frontend authors
relying on the implicit hidden copy (as happens for every other calling
convention) will see the passed value silently (if mutable memory) or
loudly (by means of a crash) modified because the callee treats the
location as scratch memory space it is allowed to mutate.

At this point, it's worth taking a step back to understand the context.
In most calling conventions, aggregates that are too large to be passed
in registers, instead get *copied* to the stack at a fixed (computable
from the signature) offset of the stack pointer. At the LLVM, we hide
this hidden copy behind the byval attribute. The caller passes a pointer
to the desired data and the callee receives a pointer, but these pointers
are not the same. In particular, the pointer that the callee receives
points to temporary stack memory allocated as part of the call lowering.
In most calling conventions, this pointer is never realized in registers
or memory. The temporary memory is simply defined by an implicit
offset from the stack pointer at function entry.

Win64, uniquely, works differently. The structure is still passed in
memory, but instead of being stored at an implicit memory offset, the
caller computes a pointer to the temporary memory and passes it to
the callee as a regular pointer (taking up a register, or if all
registers are taken up, an additional stack slot). Presumably, this
was done to allow eliding the copy when passing aggregates through
several functions on the stack.

This explains why ignoring the `byval` attribute mostly works on Win64.
The argument simply gets passed as a pointer and as long as we're ok
with the callee trampling all over that memory, there are no ill effects.
However, it does contradict the documentation of the `byval` attribute
which specifies that there is to be an implicit copy.

Frontends can of course work around this by never emitting the `byval`
attribute for Win64 and creating `alloca`s for the requisite temporary
stack slots (and that does appear to be what frontends are doing).
However, the presence of the `byval` attribute is not a trap for
frontend authors, since it seems to work, but silently modifies the
passed memory contrary to documentation.

I see two solutions:
- Disallow the `byval` attribute in the verifier if using the Win64
  calling convention.
- Make it work by simply emitting a temporary stack copy as we would
  with any other calling convention (frontends can of course always
  not use the attribute if they want to elide the copy).

This patch implements the second option (make it work), though I would
be fine with the first also.

Ref: https://github.com/JuliaLang/julia/issues/28338

Reviewers: rnk

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51842

llvm-svn: 342402
2018-09-17 17:37:14 +00:00
Simon Pilgrim cffa206423 [X86][SSE] Always enable ISD::SRL -> ISD::MULHU for v8i16
For constant non-uniform cases we'll never introduce more and/andn/or selects than already occur in generic pre-SSE41 ISD::SRL lowering.

llvm-svn: 342352
2018-09-16 20:28:38 +00:00
Simon Pilgrim ea069ffd44 [X86][AVX] Enable ISD::SRL -> ISD::MULHU for v16i16
Now that rL340913 has landed with improved v16i16 selects as shuffles.

llvm-svn: 342349
2018-09-16 19:20:47 +00:00
Sanjay Patel bfee5a9b42 [x86] fix uses check in broadcast transform (PR38949)
https://bugs.llvm.org/show_bug.cgi?id=38949

It's not clear to me that we even need a one-use check in this fold.
Ie, 2 independent loads might be better than a load+dependent shuffle.

Note that the existing re-use tests are not affected. We actually do form a
broadcast node in those tests now because there's no extra use of the 
insert_subvector node in those cases. But something later in isel pattern 
matching decides that it is not worth using a broadcast for the full load in 
those tests:

Legalized selection DAG: %bb.0 'test_broadcast_2f64_4f64_reuse:'
  t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
      t18: v4f64 = insert_subvector undef:v4f64, t7, Constant:i64<0>
    t20: v4f64 = insert_subvector t18, t7, Constant:i64<2>

Becomes:
  t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
    t21: v4f64 = X86ISD::SUBV_BROADCAST t7

ISEL: Starting selection on root node: t21: v4f64 = X86ISD::SUBV_BROADCAST t7
...
  Created node: t27: v4f64 = INSERT_SUBREG IMPLICIT_DEF:v4f64, t7, TargetConstant:i32<7>
  Morphed node: t21: v4f64 = VINSERTF128rr t27, t7, TargetConstant:i8<1>

llvm-svn: 342347
2018-09-16 15:41:56 +00:00
Craig Topper fe0b973fbf [X86] Remove an fp->int->fp domain crossing in LowerUINT_TO_FP_i64.
Summary: This unfortunately adds a move, but isn't that better than going to the int domain and back?

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52134

llvm-svn: 342327
2018-09-15 16:23:35 +00:00
Craig Topper 273f755da3 [X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))
Summary:
MOVMSK only care about the sign bit so we don't need the setcc to fill the whole element with 0s/1s. We can just shift the bit we're looking for into the sign bit. This saves a constant pool load.

Inspired by PR38840.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D52121

llvm-svn: 342326
2018-09-15 16:23:33 +00:00
Simon Pilgrim 32857c54d2 [X86][SSE] Lower shuffles to permute(unpack(x,y)) (PR31151)
Attempt to lower a shuffle as an unpack of elements from two inputs followed by a single-input (wider) permutation.

As long as the permutation is wider this is a win - there may be some circumstances where same size permutations would also be useful but I've left that for future work.

Differential Revision: https://reviews.llvm.org/D52043

llvm-svn: 342257
2018-09-14 18:33:31 +00:00
Nirav Dave 59ad1c8457 [X86] Fix register resizings for inline assembly register operands.
When replacing a named register input to the appropriately sized
sub/super-register. In the case of a 64-bit value being assigned to a
register in 32-bit mode, match GCC's assignment.

Reviewers: eli.friedman, craig.topper

Subscribers: nickdesaulniers, llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D51502

llvm-svn: 342175
2018-09-13 20:33:56 +00:00
Nirav Dave 2060a16dfd [X86] Cleanup pair returns. NFCI.
llvm-svn: 342174
2018-09-13 20:33:27 +00:00
Craig Topper f107123a88 [X86] Type legalize v2i32 div/rem by scalarizing rather than promoting
Summary:
Previously we type legalized v2i32 div/rem by promoting to v2i64. But we don't support div/rem of vectors so op legalization would then scalarize it using i64 scalar ops since it doesn't know about the original promotion. 64-bit scalar divides on Intel hardware are known to be slow and in 32-bit mode they require a libcall.

This patch switches type legalization to do the scalarizing itself using i32.

It looks like the division by power of 2 optimization is still kicking in and leaving the code as a vector. The division by other constant optimization doesn't kick in pre type legalization since it ignores illegal types. And previously, after type legalization we scalarized the v2i64 since we don't have v2i64 MULHS/MULHU support.

Another option might be to widen v2i32 to v4i32 so we could do division by constant optimizations, but we'd have to be careful to only do that for constant divisors or we risk scalaring to 4 scalar divides.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51325

llvm-svn: 342114
2018-09-13 06:13:37 +00:00
Craig Topper d7362a3e5f [X86] Correct the one use check from r341915.
The one use check should be on the bitcast, not the input to the bitcast.

llvm-svn: 341956
2018-09-11 16:05:03 +00:00
Craig Topper 844f035e1e [X86] In combineMOVMSK, look through int->fp bitcasts before callling SimplifyDemandedBits.
MOVMSKPS and MOVMSKPD both take FP types, but likely the operations before it are on integer types with just a int->fp bitcast between them. If the bitcast isn't used by anything else and doesn't change the element width we can look through it to simplify the integer ops.

llvm-svn: 341915
2018-09-11 08:20:02 +00:00
Craig Topper a5ae613c15 [X86] Mark the ISD::SETLT/SETLE condition codes as illegal for v32i16/v64i8 to match the other vector types.
I'm having a hard time finding a test case for this, but we should be consistent here. The fact that we canonicalize all zeros and all ones constants to vXi32 and all other constants to loads makes this hard to hit the easy DAG combine infinite loop we get for some of the other types.

llvm-svn: 341859
2018-09-10 20:31:27 +00:00
Craig Topper 3823516103 [X86] Custom type legalize (v2i32 (fp_to_uint v2f64))) without avx512vl by widening to v4i32 and v4f64 instead of v8i32 and v8f64. Make it aware of x86-experimental-vector-widening-legalization
We have isel patterns for v4i32/v4f64 that artificially widen to v8i32/v8f64 so just use that.

If x86-experimental-vector-widening-legalization is enabled, we don't need any custom legalization and can just return. I've modified the test RUN lines to cover this case.

llvm-svn: 341765
2018-09-09 20:36:36 +00:00
Craig Topper 7af5e333e7 [X86] Create paddus/psubus from narrower vectors with i8/i16 element types.
Summary:
This patch allows vectors with a power of 2 number of elements and i8/i16 element type to select paddus/psubus instructions. ReplaceNodeResults has been updated to custom widen these operations up to 128 bits like we already do for PAVG.

Another step towards fixing PR38691

Reviewers: RKSimon, spatel

Reviewed By: RKSimon, spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51818

llvm-svn: 341753
2018-09-08 19:32:58 +00:00
Craig Topper 5cbce81c91 [X86] Don't create ZERO_EXTEND_INREG/SIGN_EXTEND_INREG for v1iX vectors.
The generic type legalizer will scalarize vXi1 instructions getting rid of the vector entirely. Creating wider vector instructions is just going to prevent that.

llvm-svn: 341705
2018-09-07 20:56:03 +00:00
Craig Topper 39f48fdcbc [X86] Don't create X86ISD::AVG nodes from v1iX vectors.
The type legalizer will try to scalarize this and fail.

It looks like there's some other v1iX oddities out there too since we still generated some vector instructions.

llvm-svn: 341704
2018-09-07 20:56:01 +00:00
Craig Topper 4863313b35 [X86] Modify the the rdtscp intrinsic to return values instead of taking a pointer argument
Similar to what was recently done for addcarry/subborrow and has been done for rdrand/rdseed for a while. It's better to use two results and an explicit store in IR when the store isn't part of the semantics of the instruction. This allows store->load forwarding to happen in the middle end. Or the store to be removed if its never loaded.

Differential Revision: https://reviews.llvm.org/D51803

llvm-svn: 341698
2018-09-07 19:14:15 +00:00
Craig Topper 72964ae99e [X86] Change the addcarry and subborrow intrinsics to return 2 results and remove the pointer argument.
We should represent the store directly in IR instead. This gives the middle end a chance to remove it if it can see a load from the same address.

Differential Revision: https://reviews.llvm.org/D51769

llvm-svn: 341677
2018-09-07 16:58:39 +00:00
Craig Topper caf6672779 [X86] Add intrinsics for KTEST instructions.
These intrinsics use the same implementation as PTEST intrinsics, but use vXi1 vectors.

New clang builtins will be accompanying them shortly.

llvm-svn: 341259
2018-08-31 21:31:53 +00:00
Craig Topper b7bb9f0078 [X86] Add support for turning vXi1 shuffles into KSHIFTL/KSHIFTR.
This patch recognizes shuffles that shift elements and fill with zeros. I've copied and modified the shift matching code we use for normal vector registers to do this. I'm not sure if there's a good way to share more of this code without making the existing function more complex than it already is.

This will be used to enable kshift intrinsics in clang.

Differential Revision: https://reviews.llvm.org/D51401

llvm-svn: 341227
2018-08-31 17:17:21 +00:00
Craig Topper 83e9f928ba [X86] Don't do anything in ReplaceNodeResults for (v2i32 (fptoui/fptosi v2f32)) when -x86-experimental-vector-widening-legalization is on.
We don't need to do our own widening, the generic legalizer can do it.

llvm-svn: 341174
2018-08-31 07:05:39 +00:00
Craig Topper e9af89a78b [X86] Don't custom widen (v2i32 (setcc v2f32)) when -x86-experimental-vector-widening-legalization is in effect.
We aren't doing anything than what the generic legalizer will do so just let it do it.

llvm-svn: 341172
2018-08-31 07:05:37 +00:00
Craig Topper 1a8c99e670 [X86] Weaken an overly aggressive assert.
This assert tried to check that AND constants are only on the RHS. But its possible for both operands to be constants if one is opaque which will prevent the AND from being constant folded.

Fixes PR38771

llvm-svn: 341102
2018-08-30 19:35:38 +00:00
Simon Pilgrim 6b9bf7ecbc [X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles
Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.

TODO: As discussed on the patch, we should investigate adding VPBLENDVB handling to target shuffle combining as well, that will allow us to extend this to VPBLENDW+VPBLENDW+VPBLENDD.

Differential Revision: https://reviews.llvm.org/D50074

llvm-svn: 340913
2018-08-29 10:51:08 +00:00
Simon Pilgrim af98587095 [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
This patch creates the shift mask and actual shift using the vXi16 vector shift ops.

Differential Revision: https://reviews.llvm.org/D51263

llvm-svn: 340813
2018-08-28 10:37:29 +00:00
Simon Pilgrim f119e27d80 [X86][SSE] Avoid vector extraction/insertion for non-constant uniform shifts
As discussed on D51263, we're better off using byte shifts to clear the upper bits on pre-SSE41 hardware.

llvm-svn: 340810
2018-08-28 10:14:09 +00:00
Craig Topper c1436db753 [X86] Fix some comments to refer to KORTEST not KTEST. NFC
KTEST is a different instruction. All of this code uses KORTEST.

llvm-svn: 340799
2018-08-28 06:39:35 +00:00
Craig Topper 4be11c0585 [X86] When lowering v32i8 MULHS/MULHU, shuffle after the PACKUS rather than before.
We're using a 256-bit PACKUS to do the truncation, but that instruction operates on 128-bit lanes. So previously we shuffled first to rearrange the lanes. But that requires 2 shuffles. Instead we can shuffle after the PACKUS using a single VPERMQ. This matches what our normal LowerTRUNCATE code does when it uses PACKUS.

Differential Revision: https://reviews.llvm.org/D51284

llvm-svn: 340757
2018-08-27 17:20:41 +00:00
Craig Topper fff90377fd [X86] Add support for matching paddus patterns where one of the vectors is a constant.
InstCombine mucks these up a bit. So we need to do some additional pattern matching to fix it. There are a still a few special cases not handled, but this covers the general case.

Differential Revision: https://reviews.llvm.org/D50952

llvm-svn: 340756
2018-08-27 17:20:38 +00:00
Craig Topper 73ed2a2a6b [X86] Cleanup the LowerMULH code by hoisting some commonalities between the vXi32 and vXi8 handling. NFCI
vXi32 support was recently moved from LowerMUL_LOHI to LowerMULH.

This commit shares the getOperand calls, switches both to use common IsSigned flag, and hoists the NumElems/NumElts variable.

llvm-svn: 340720
2018-08-27 06:35:02 +00:00
Sanjay Patel 113cac3b15 [SelectionDAG][x86] turn insertelement into undef with variable index into splat
I noticed this along with the patterns in D51125, but when the index is variable, 
we don't convert insertelement into a build_vector.

For x86, that means these get expanded at legalization time into the loading/spilling 
code that we see in the tests. I think it's always better to avoid going to memory on 
these, and we get the optimal 'broadcast' if it's available.

I suspect other targets may want to look at enabling the hook. AArch64 and AMDGPU have 
regression tests that would be affected (although I did not check what would happen in 
those cases). In the most basic cases shown here, AArch64 would probably do much 
better with a splat.

Differential Revision: https://reviews.llvm.org/D51186

llvm-svn: 340705
2018-08-26 18:20:41 +00:00
Craig Topper 4240ecb909 [X86] Fix typo in comment, expect->except. NFC
llvm-svn: 340695
2018-08-26 03:43:23 +00:00
Craig Topper ebec2793d1 [X86] Replace support for vXi32 SMUL_LOHI/UMUL_LOHI with MULHS/MULHU support instead.
Summary:
The only time vector SMUL_LOHI/UMUL_LOHI nodes are created is during division/remainder lowering. If its created before op legalization, generic DAGCombine immediately turns that SMUL_LOHI/UMUL_LOHI into a MULHS/MULHU since only the upper half is used. That node will stick around through vector op legalization and will be turned back into UMUL_LOHI/SMUL_LOHI during op legalization. It will then be custom lowered by the X86 backend. Due to this two step lowering the vector shuffles created by the custom lowering get legalized after their inputs rather than before. This prevents the shuffles from being combined with any build_vector of constants.

This patch uses changes vXi32 to use MULHS/MULHU instead. This is what the later DAG combine did anyway. But by skipping the change back to UMUL_LOHI/SMUL_LOHI we lower it before any constant BUILD_VECTORS. This allows the vector_shuffle creation to constant fold with the build_vectors. This accounts for the test changes here.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D51254

llvm-svn: 340690
2018-08-25 18:01:24 +00:00
Craig Topper a11a3b3818 [SelectionDAG][X86] Reorder the operands the MaskedStoreSDNode to put the value first.
Summary:
Previously the value being stored is the last operand in SDNode. This causes the type legalizer to visit the mask operand before the value operand. The type legalizer was more complicated because of this since we want the type of the value to drive the decisions.

This patch moves the value to be the first operand so we visit it first during type legalization. It also simplifies the type legalization code accordingly.

X86 is currently the only in tree target that uses this SDNode. Not sure if there are any users out of tree.

Reviewers: RKSimon, delena, hfinkel, eli.friedman

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D50402

llvm-svn: 340689
2018-08-25 17:48:17 +00:00
Craig Topper bce8680605 [X86] Make sure type is a vector before calling VT.getVectorNumElements() in combineLoopMAddPattern
Fixes PR38700.

llvm-svn: 340688
2018-08-25 17:23:43 +00:00
Sanjay Patel 8a84c747d2 [x86] try harder to use broadcast to load a scalar into vector reg
This is a preliminary step for a preliminary step for D50992. 
I noticed that x86 often misses chances to load a scalar directly 
into a vector register.

So this patch is just allowing more of those cases to match a 
broadcast op in lowerBuildVectorAsBroadcast(). The old code comment 
said it doesn't make sense to use a broadcast when we're loading a 
single element and everything else is undef, but I think that's the 
best case in the improved tests in insert-loaded-scalar.ll. We avoid 
scalar-to-vector-register move and/or less efficient shuffling.

Note that there are some existing types that were already producing 
a broadcast, but that happens semi-accidentally. Ie, it's not 
happening as part of lowerBuildVectorAsBroadcast(). The build vector 
gets expanded into load + shuffle, and then shuffle lowering produces 
the broadcast.

Description of the other test diffs:
1. avx-basic.ll - replacing load+shufle is a win.
2. sse3-avx-addsub-2.ll - vmovddup vs. vbroadcastss is neutral
3. sse41.ll - don't care - we convert that intrinsic to generic IR now, so this test is deprecated
4. vector-shuffle-128-v8.ll / vector-shuffle-256-v16.ll - pshufb alternatives with an extra instruction are not obviously bad

Differential Revision: https://reviews.llvm.org/D51125

llvm-svn: 340685
2018-08-25 14:56:05 +00:00
Craig Topper 4058e29e7d [X86] Teach combineLoopMAddPattern to handle cases where there is no loop and the add has two multiply inputs
Differential Revision: https://reviews.llvm.org/D50868

llvm-svn: 340631
2018-08-24 18:05:04 +00:00
Chandler Carruth ae0cafece8 [x86/retpoline] Split the LLVM concept of retpolines into separate
subtarget features for indirect calls and indirect branches.

This is in preparation for enabling *only* the call retpolines when
using speculative load hardening.

I've continued to use subtarget features for now as they continue to
seem the best fit given the lack of other retpoline like constructs so
far.

The LLVM side is pretty simple. I'd like to eventually get rid of the
old feature, but not sure what backwards compatibility issues that will
cause.

This does remove the "implies" from requesting an external thunk. This
always seemed somewhat questionable and is now clearly not desirable --
you specify a thunk the same way no matter which set of things are
getting retpolines.

I really want to keep this nicely isolated from end users and just an
LLVM implementation detail, so I've moved the `-mretpoline` flag in
Clang to no longer rely on a specific subtarget feature by that name and
instead to be directly handled. In some ways this is simpler, but in
order to preserve existing behavior I've had to add some fallback code
so that users who relied on merely passing -mretpoline-external-thunk
continue to get the same behavior. We should eventually remove this
I suspect (we have never tested that it works!) but I've not done that
in this patch.

Differential Revision: https://reviews.llvm.org/D51150

llvm-svn: 340515
2018-08-23 06:06:38 +00:00
Craig Topper cf9df99d79 [X86] Teach combineLoopSADPattern to handle cases where there is no loop and the add has two absolute difference inputs
Previously we asumed a vector reduction add is part of a loop and one of the input is a phi. But the code in SelectionDAGBuilder that sets vector reduction flag handles more cases than that. It just requires that the use chain ends in a horizontal reduction. And there are no other uses. This means it can handle unrolled reduction loops.

If the initial value of the reduction was 0, an unrolled loop would begin with a vector reduction add that has two sad inputs. Previously we would only transform one side of the add, but for this case we need to transform both sides.

I've created a lambda to reuse some of the code for both sides. And fixed the variables names to remove reference to "phi".

Differential Revision: https://reviews.llvm.org/D50817

llvm-svn: 340478
2018-08-22 23:19:01 +00:00
Simon Pilgrim ffdfe45645 [X86][SSE] LowerMULH vXi8 - use SSE shifts directly.
We know these vXi16 extended cases are legal constant splat shifts.

llvm-svn: 340414
2018-08-22 15:37:11 +00:00
Simon Pilgrim 50eba6b380 [X86][SSE] Lower vXi8 general shifts to SSE shifts directly. NFCI.
Most of these shifts are extended to vXi16 so we don't gain anything from forcing another round of generic shift lowering - we know these extended cases are legal constant splat shifts.

llvm-svn: 340307
2018-08-21 17:27:03 +00:00
Simon Pilgrim 98eb4ae499 [X86][SSE] Lower v8i16 general shifts to SSE shifts directly. NFCI.
We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts.

llvm-svn: 340302
2018-08-21 17:05:07 +00:00
Simon Pilgrim dbe4e9e3ff [X86][SSE] Lower directly to SSE shifts in the BLEND(SHIFT, SHIFT) combine. NFCI.
We don't gain anything from forcing another round of generic shift lowering - we know these are legal constant splat shifts.

llvm-svn: 340300
2018-08-21 16:46:48 +00:00
Simon Pilgrim 5a83a1fd13 [X86][SSE] Add helper function to convert to/between the SSE vector shift opcodes. NFCI.
Also remove some more getOpcode calls from LowerShift when we already have Opc.

llvm-svn: 340290
2018-08-21 15:57:33 +00:00
Craig Topper 210ccfe3db [X86] Prevent lowerVectorShuffleByMerging128BitLanes from creating cycles
Due to some splat handling code in getVectorShuffle, its possible for NewV1/NewV2 to have their mask modified from what is requested. This can lead to cycles being created in the DAG.

This patch examines the returned mask and makes sure its different. Long term we may need to look closer at that splat code in getVectorShuffle, or add more splat awareness to getVectorShuffle.

Fixes PR38639

Differential Revision: https://reviews.llvm.org/D50981

llvm-svn: 340214
2018-08-20 21:08:35 +00:00
Craig Topper 7dcb2c4b0a [X86] Teach combineTruncatedArithmetic to handle some cases of ISD::SUB
We can safely avoid interfering with the subus combine if both inputs are freely truncatable. Either both extends, or an extend and a constant vector.

Differential Revision: https://reviews.llvm.org/D50878

llvm-svn: 340212
2018-08-20 20:57:35 +00:00
Craig Topper 803912ea57 [X86] Fix an issue in the matching for ADDUS.
We were basically assuming only one operand of the compare could be an ADD node and using that to swap operands. But we can have a normal add followed by a saturing add.

This rewrites the canonicalization to just be based on the condition code.

llvm-svn: 340134
2018-08-19 04:26:31 +00:00
Craig Topper 2b03df9b05 [X86] Use SDValue::operator== instead of DAG.isEqualTo in strictly integer matching.
isEqualTo is more useful for floating point. operator== is sufficient for integer.

llvm-svn: 340130
2018-08-18 19:16:56 +00:00
Craig Topper 3e299d896f [X86] Simplify the PADDUS legality check in combineSelect to match PSUBUS. NFC
While there remove some trailing whitespace.

llvm-svn: 340129
2018-08-18 18:51:04 +00:00
Craig Topper 40c9559b74 [X86] Add support for using 512-bit PSUBUS to combineSelect.
The code already support 128 and 256 and even knows to split 256 for AVX1. So we really just needed to stop looking for specific VTs and subtarget features and just look for legal VTs with i8/i16 elements.

While there, add some curly braces around outer if statement bodies that contain only another if. It makes all the closing curly braces look more regular.

llvm-svn: 340128
2018-08-18 18:51:03 +00:00
Craig Topper 62dd1b1b4f [X86] Remove detectAddSubSatPattern.
This was added very recently in r339650, but appears to be completely untested and has at least one bug in it.

llvm-svn: 340086
2018-08-17 21:19:28 +00:00
Simon Pilgrim 2f48122cc9 [X86][SSE] Lower constant vXi8 ISD::SRL/ISD::SRA using PMULLW
Extending the concept introduced in D49562, this patch lowers constant vXi8 ISD::SRL/ISD::SRA by zero/sign extending to vXi16 and using PMULLW and then truncating the high 8 bits of the result.

Differential Revision: https://reviews.llvm.org/D50781

llvm-svn: 340062
2018-08-17 18:03:11 +00:00
Craig Topper 730890dbdb [X86] Use hasOneUse instead of isOnlyUserOf. NFCI
isOnlyUserOf is a little heavier because it allows the node to be used multiple times by the other node. In this case we are looking at a truncate which only has one operand so we know it can only use it once. Thus hasOneUse is better.

llvm-svn: 340059
2018-08-17 17:57:25 +00:00
Francis Visoiu Mistrih 8bff832534 [X86] Fix liveness information when expanding X86::EH_SjLj_LongJmp64
test/CodeGen/X86/shadow-stack.ll has the following machine verifier
errors:

```
*** Bad machine code: Using a killed virtual register ***
- function:    bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: %3:gr64 = MOV64rm killed %2:gr64, 1, $noreg, 8, $noreg
- operand 1:   killed %2:gr64

*** Bad machine code: Using a killed virtual register ***
- function:    bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: $rsp = MOV64rm killed %2:gr64, 1, $noreg, 16, $noreg
- operand 1:   killed %2:gr64

*** Bad machine code: Virtual register killed in block, but needed live out. ***
- function:    bar
- basic block: %bb.2 entry (0x7fdc818574f8)
Virtual register %2 is used after the block.
```

The fix here is to only copy the machine operand's register without the
kill flags for all the instructions except the very last one of the
sequence.

I had to insert dummy PHIs in the test case to force the NoPHI function
property to be set to false. More on this here: https://llvm.org/PR38439

Differential Revision: https://reviews.llvm.org/D50260

llvm-svn: 340033
2018-08-17 14:46:56 +00:00
Chandler Carruth c73c0307fe [MI] Change the array of `MachineMemOperand` pointers to be
a generically extensible collection of extra info attached to
a `MachineInstr`.

The primary change here is cleaning up the APIs used for setting and
manipulating the `MachineMemOperand` pointer arrays so chat we can
change how they are allocated.

Then we introduce an extra info object that using the trailing object
pattern to attach some number of MMOs but also other extra info. The
design of this is specifically so that this extra info has a fixed
necessary cost (the header tracking what extra info is included) and
everything else can be tail allocated. This pattern works especially
well with a `BumpPtrAllocator` which we use here.

I've also added the basic scaffolding for putting interesting pointers
into this, namely pre- and post-instruction symbols. These aren't used
anywhere yet, they're just there to ensure I've actually gotten the data
structure types correct. I'll flesh out support for these in
a subsequent patch (MIR dumping, parsing, the works).

Finally, I've included an optimization where we store any single pointer
inline in the `MachineInstr` to avoid the allocation overhead. This is
expected to be the overwhelmingly most common case and so should avoid
any memory usage growth due to slightly less clever / dense allocation
when dealing with >1 MMO. This did require several ergonomic
improvements to the `PointerSumType` to reasonably support the various
usage models.

This also has a side effect of freeing up 8 bits within the
`MachineInstr` which could be repurposed for something else.

The suggested direction here came largely from Hal Finkel. I hope it was
worth it. ;] It does hopefully clear a path for subsequent extensions
w/o nearly as much leg work. Lots of thanks to Reid and Justin for
careful reviews and ideas about how to do all of this.

Differential Revision: https://reviews.llvm.org/D50701

llvm-svn: 339940
2018-08-16 21:30:05 +00:00
Craig Topper 08e082619a [X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation
To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.

This fixes PR35833

Differential Revison: https://reviews.llvm.org/D41794

llvm-svn: 339818
2018-08-15 21:21:52 +00:00
Simon Pilgrim f3b5943ffc Remove lambda default argument to fix gcc pedantic warning.
llvm-svn: 339767
2018-08-15 12:32:09 +00:00
Craig Topper 633fe98e27 [X86] Change legacy SSE scalar fp to integer intrinsics to use specific ISD opcodes instead of keeping as intrinsics. Unify SSE and AVX512 isel patterns.
AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics.

The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics.

Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns.

llvm-svn: 339749
2018-08-15 01:23:00 +00:00
Simon Pilgrim 2ce3d6e135 [X86][SSE] Avoid duplicate shuffle input sources in combineX86ShufflesRecursively
rL339686 added the case where a faux shuffle might have repeated shuffle inputs coming from either side of the OR().

This patch improves the insertion of the inputs into the source ops lists to account for this, as well as making it trivial to add support for shuffles with more than 2 inputs in the future.

llvm-svn: 339696
2018-08-14 17:22:37 +00:00
Simon Pilgrim ed55138247 [X86][SSE] Add shuffle combine support for OR(PSHUFB,PSHUFB) style patterns.
If each element is zero from one (or both) inputs then we can combine these into a single shuffle mask.

llvm-svn: 339686
2018-08-14 16:00:05 +00:00
Simon Pilgrim df9880f257 [X86][SSE] Generalize lowerVectorShuffleAsBlendOfPSHUFBs to work with any vXi8 type.
We still only use this for v16i8, but this cleans up the code to support v32i8/v64i8 sometime in the future.

llvm-svn: 339679
2018-08-14 14:00:14 +00:00
Tomasz Krupa 86a63889f3 [X86] Lowering addus/subus intrinsics to native IR
Summary: This revision improves previous version (rL330322) which has been reverted due to crashes.

This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.

Reviewers: craig.topper, spatel, RKSimon

Reviewed By: craig.topper

Subscribers: mike.dvoretsky, DavidKreitzer, sroland, llvm-commits

Differential Revision: https://reviews.llvm.org/D46179

llvm-svn: 339650
2018-08-14 08:00:56 +00:00
Simon Pilgrim 130b00bc43 [X86][SSE] Pull out repeated shift getOpcode() calls. NFCI.
llvm-svn: 339425
2018-08-10 11:42:42 +00:00
Craig Topper 9a8136f7b4 [X86] Qualify one of the heuristics in combineMul to only apply to positive multiply amounts.
This seems to slightly help the performance of one of our internal benchmarks. We probably need better heuristics here.

llvm-svn: 339406
2018-08-09 23:27:42 +00:00
Simon Pilgrim 511c3fc529 [X86][SSE] Remove PMULDQ/PMULUDQ by zero
Exposed by D50328

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339337
2018-08-09 12:37:36 +00:00
Simon Pilgrim 01ae462fef [X86][SSE] Combine (some) target shuffles with multiple uses
As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.

This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool.

However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit.

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339335
2018-08-09 12:30:02 +00:00
Craig Topper 9de1797c50 [SelectionDAG][X86] Rename MaskedLoadSDNode::getSrc0 to getPassThru.
Src0 doesn't really convey any meaning to what the operand is. Passthru matches what's used in the documentation for the intrinsic this comes from.

llvm-svn: 339101
2018-08-07 06:52:49 +00:00
Craig Topper 17989208a9 [SelectionDAG][X86] Rename getValue to getPassThru for gather SDNodes.
getValue is more meaningful name for scatter than it is for gather. Split them and use getPassThru for gather.

llvm-svn: 339096
2018-08-07 06:13:40 +00:00
Easwaran Raman 10fd92dd94 [X86] Recognize a splat of negate in isFNEG
Summary:
Expand isFNEG so that we generate the appropriate F(N)M(ADD|SUB)
instructions in more cases. For example, the following sequence
a = _mm256_broadcast_ss(f)
d = _mm256_fnmadd_ps(a, b, c)

generates an fsub and fma without this patch and an fnma with this
change.

Reviewers: craig.topper

Subscribers: llvm-commits, davidxl, wmi

Differential Revision: https://reviews.llvm.org/D48467

llvm-svn: 339043
2018-08-06 19:23:38 +00:00
Craig Topper feb2a58860 [X86] Add a DAG combine for the __builtin_parity idiom used by clang to enable better codegen
Clang uses "ctpop & 1" to implement __builtin_parity. If the popcnt instruction isn't supported this generates a large amount of code to calculate the population count. Instead we can bisect the data down to a single byte using xor and then check the parity flag.

Even when popcnt is supported, its still a good idea to split 64-bit data on 32-bit targets using an xor in front of a single popcnt. Otherwise we get two popcnts and an add before the and.

I've specifically targeted this at the sizes supported by clang builtins, but we could generalize this if we think that's useful.

Differential Revision: https://reviews.llvm.org/D50165

llvm-svn: 338907
2018-08-03 18:00:29 +00:00
Craig Topper e902b7d0b0 [X86] Support fp128 and/or/xor/load/store with VEX and EVEX encoded instructions.
Move all the patterns to X86InstrVecCompiler.td so we can keep SSE/AVX/AVX512 all in one place.

To save some patterns we'll use an existing DAG combine to convert f128 fand/for/fxor to integer when sse2 is enabled. This allows use to reuse all the existing patterns for v2i64.

I believe this now makes SHA instructions the only case where VEX/EVEX and legacy encoded instructions could be generated simultaneously.

llvm-svn: 338821
2018-08-03 06:12:56 +00:00
Craig Topper 2c095444a4 [X86] Prevent promotion of i16 add/sub/and/or/xor to i32 if we can fold an atomic load and atomic store.
This makes them consistent with i8/i32/i64. Which still seems to be more aggressive on folding than icc, gcc, or MSVC.

llvm-svn: 338795
2018-08-03 00:37:34 +00:00
Simon Pilgrim 8b16e15d47 [X86][SSE] Pull out duplicate VSELECT to shuffle mask code. NFCI.
llvm-svn: 338693
2018-08-02 09:20:27 +00:00
Reid Kleckner a30a6d2c29 Load from the GOT for external symbols in the large, PIC code model
Do the same handling for external symbols that we do for jump table
symbols and global values.

Fixes one of the cases in PR38385

llvm-svn: 338651
2018-08-01 22:56:05 +00:00
Craig Topper c985d42903 [X86] Canonicalize the pattern for __builtin_ffs in a similar way to '__builtin_ffs + 5'
We now emit a move of -1 before the cmov and do the addition after the cmov just like the case with an extra addition.

This may be slightly worse for code size, but is more consistent with other compilers. And we might be able to hoist the mov -1 outside of loops.

llvm-svn: 338613
2018-08-01 18:38:46 +00:00
Simon Pilgrim 931ebe3be1 [X86] Assign from a brace initializer to match style guide. NFCI.
llvm-svn: 338598
2018-08-01 17:43:38 +00:00
Simon Pilgrim a3548c960e [SelectionDAG] Make binop reduction matcher available to all targets
There is nothing x86-specific about this code, so it'd be nice to make this available for other targets to use in the future (and get it out of X86ISelLowering!).

Differential Revision: https://reviews.llvm.org/D50083

llvm-svn: 338586
2018-08-01 16:52:28 +00:00
Simon Pilgrim 564762cf32 [X86] Use isNullConstant helper. NFCI.
llvm-svn: 338530
2018-08-01 13:06:14 +00:00
Simon Pilgrim e447a273bd [X86] Use isNullConstant helper. NFCI.
llvm-svn: 338516
2018-08-01 11:24:11 +00:00
Craig Topper 65a1388881 [X86] When looking for (CMOV C-1, (ADD (CTTZ X), C), (X != 0)) -> (ADD (CMOV (CTTZ X), -1, (X != 0)), C), make sure we really have a compare with 0.
It's not strictly required by the transform of the cmov and the add, but it makes sure we restrict it to the cases we know we want to match.

While there canonicalize the operand order of the cmov to simplify the matching and emitting code.

llvm-svn: 338492
2018-08-01 06:36:20 +00:00
Simon Pilgrim 5d9b00d15b [X86][SSE] Use ISD::MULHU for constant/non-zero ISD::SRL lowering (PR38151)
As was done for vector rotations, we can efficiently use ISD::MULHU for vXi8/vXi16 ISD::SRL lowering.

Shift-by-zero cases are still problematic (mainly on v32i8 due to extra AND/ANDN/OR or VPBLENDVB blend masks but v8i16/v16i16 aren't great either if PBLENDW fails) so I've limited this first patch to known non-zero cases if we can't easily use PBLENDW.

Differential Revision: https://reviews.llvm.org/D49562

llvm-svn: 338407
2018-07-31 18:05:56 +00:00
Craig Topper bef126fb71 [X86] Add pattern matching for PMADDUBSW
Summary:
Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned.

A C example that triggers this pattern

```
static const int N = 128;

int8_t A[2*N];
uint8_t B[2*N];
int16_t C[N];

void foo() {
  for (int i = 0; i != N; ++i)
    C[i] = MIN(MAX((int16_t)A[2*i]*(int16_t)B[2*i] + (int16_t)A[2*i+1]*(int16_t)B[2*i+1], -32768), 32767);
}
```

Reviewers: RKSimon, spatel, zvi

Reviewed By: RKSimon, zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49829

llvm-svn: 338402
2018-07-31 17:12:08 +00:00
Simon Pilgrim 99d475f97d [X86][SSE] isFNEG - Use getTargetConstantBitsFromNode to handle all constant cases
isFNEG was duplicating much of what was done by getTargetConstantBitsFromNode in its own calls to getTargetConstantFromNode.

Noticed while reviewing D48467.

llvm-svn: 338358
2018-07-31 10:13:17 +00:00
Fangrui Song f78650a8de Remove trailing space
sed -Ei 's/[[:space:]]+$//' include/**/*.{def,h,td} lib/**/*.{cpp,h}

llvm-svn: 338293
2018-07-30 19:41:25 +00:00
Craig Topper f014ec9b3b [X86] Fix typo in comment. NFC
llvm-svn: 338274
2018-07-30 17:34:31 +00:00
Matt Arsenault 81920b0a25 DAG: Add calling convention argument to calling convention funcs
This seems like a pretty glaring omission, and AMDGPU
wants to treat kernels differently from other calling
conventions.

llvm-svn: 338194
2018-07-28 13:25:19 +00:00
Craig Topper c3e11bf3f7 [X86] Add support expanding multiplies by constant where the constant is -3/-5/-9 multplied by a power of 2.
These can be replaced with an LEA, a shift, and a negate. This seems to match what gcc and icc would do.

llvm-svn: 338174
2018-07-27 23:04:59 +00:00
Craig Topper 561e298e29 [X86] Remove an unnecessary 'if' that prevented treating INT64_MAX and -INT64_MAX as power of 2 minus 1 in the multiply expansion code.
Not sure why they were being explicitly excluded, but I believe all the math inside the if works. I changed the absolute value to be uint64_t instead of int64_t so INT64_MIN+1 wouldn't be signed wrap.

llvm-svn: 338101
2018-07-27 05:56:27 +00:00
Craig Topper e364baa88b [X86] Add matching for another pattern of PMADDWD.
Summary:
This is the pattern you get from the loop vectorizer for something like this

int16_t A[1024];
int16_t B[1024];
int32_t C[512];

void pmaddwd() {
  for (int i = 0; i != 512; ++i)
    C[i] = (A[2*i]*B[2*i]) + (A[2*i+1]*B[2*i+1]);
}

In this case we will have (add (mul (build_vector), (build_vector)), (mul (build_vector), (build_vector))). This is different than the pattern we currently match which has the build_vectors between an add and a single multiply. I'm not sure what C code would get you that pattern.

Reviewers: RKSimon, spatel, zvi

Reviewed By: zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49636

llvm-svn: 338097
2018-07-27 04:29:10 +00:00
Craig Topper f7bc550223 [X86] When removing sign extends from gather/scatter indices, make sure we handle UpdateNodeOperands finding an existing node to CSE with.
If this happens the operands aren't updated and the existing node is returned. Make sure we pass this existing node up to the DAG combiner so that a proper replacement happens. Otherwise we get stuck in an infinite loop with an unoptimized node.

llvm-svn: 338090
2018-07-27 00:00:30 +00:00
Craig Topper 4e687d5bb2 [X86] Don't use CombineTo to skip adding new nodes to the DAGCombiner worklist in combineMul.
I'm not sure if this was trying to avoid optimizing the new nodes further or what. Or maybe to prevent a cycle if something tried to reform the multiply? But I don't think its a reliable way to do that. If the user of the expanded multiply is visited by the DAGCombiner after this conversion happens, the DAGCombiner will check its operands, see that they haven't been visited by the DAGCombiner before and it will then add the first node to the worklist. This process will repeat until all the new nodes are visited.

So this seems like an unreliable prevention at best. So this patch just returns the new nodes like any other combine. If this starts causing problems we can try to add target specific nodes or something to more directly prevent optimizations.

Now that we handle the combine normally, we can combine any negates the mul expansion creates into their users since those will be visited now.

llvm-svn: 338007
2018-07-26 05:40:10 +00:00
Craig Topper 370bdd3a0f [X86] Remove some unnecessary explicit calls to DCI.AddToWorkList.
These calls were making sure some newly created nodes were added to worklist, but the DAGCombiner has internal support for ensuring it has visited all nodes. Any time it visits a node it ensures the operands have been queued to be visited as well. This means if we only need to return the last new node. The DAGCombiner will take care of adding its inputs thus walking backwards through all the new nodes.

llvm-svn: 337996
2018-07-26 03:20:27 +00:00
Matthias Braun 57dd5b3dea CodeGen: Cleanup regmask construction; NFC
- Avoid duplication of regmask size calculation.
- Simplify allocateRegisterMask() call.
- Rename allocateRegisterMask() to allocateRegMask() to be consistent
  with naming in MachineOperand.

llvm-svn: 337986
2018-07-26 00:27:47 +00:00
Craig Topper dc0e8a601d [X86] Use X86ISD::MUL_IMM instead of ISD::MUL for multiply we intend to be selected to LEA.
This prevents other combines from possibly disturbing it.

llvm-svn: 337890
2018-07-25 05:33:36 +00:00
Craig Topper fc501a9223 [X86] Use a shift plus an lea for multiplying by a constant that is a power of 2 plus 2/4/8.
The LEA allows us to combine an add and the multiply by 2/4/8 together so we just need a shift for the larger power of 2.

llvm-svn: 337875
2018-07-25 01:15:38 +00:00
Craig Topper 5be253d988 [X86] Expand mul by pow2 + 2 using a shift and two adds similar to what we do for pow2 - 2.
llvm-svn: 337874
2018-07-25 01:15:35 +00:00
Craig Topper 56c104f104 [X86] Use a two lea sequence for multiply by 37, 41, and 73.
These fit a pattern used by 11, 21, and 19.

llvm-svn: 337871
2018-07-24 23:44:17 +00:00
Craig Topper f8fcee70a3 [X86] Change multiply by 26 to use two multiplies by 5 and an add instead of multiply by 3 and 9 and a subtract.
Same number of operations, but ending in an add is friendlier due to it being commutable.

llvm-svn: 337869
2018-07-24 23:44:12 +00:00
Craig Topper 5ddc0a2b14 [X86] When expanding a multiply by a negative of one less than a power of 2, like 31, don't generate a negate of a subtract that we'll never optimize.
We generated a subtract for the power of 2 minus one then negated the result. The negate can be optimized away by swapping the subtract operands, but DAG combine doesn't know how to do that and we don't add any of the new nodes to the worklist anyway.

This patch makes use explicitly emit the swapped subtract.

llvm-svn: 337858
2018-07-24 21:31:21 +00:00
Craig Topper 6d29891bef [X86] Generalize the multiply by 30 lowering to generic multipy by power 2 minus 2.
Use a left shift and 2 subtracts like we do for 30. Move this out from behind the slow lea check since it doesn't even use an LEA.

Use this for multiply by 14 as well.

llvm-svn: 337856
2018-07-24 21:15:41 +00:00
Craig Topper 86d6320b94 [X86] Change multiply by 19 to use (9 * X) * 2 + X instead of (5 * X) * 4 - 1.
The new lowering can be done in 2 LEAs. The old code took 1 LEA, 1 shift, and 1 sub.

llvm-svn: 337851
2018-07-24 20:31:48 +00:00
Craig Topper b2a626b52e [X86] Remove the max vector width restriction from combineLoopMAddPattern and rely splitOpsAndApply to handle splitting.
This seems to be a net improvement. There's still an issue under avx512f where we have a 512-bit vpaddd, but not vpmaddwd so we end up doing two 256-bit vpmaddwds and inserting the results before a 512-bit vpaddd. It might be better to do two 512-bits paddds with zeros in the upper half. Same number of instructions, but breaks a dependency.

llvm-svn: 337656
2018-07-22 19:44:35 +00:00
Benjamin Kramer 64c7fa3201 Revert "[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts"
This reverts commit r337547. It triggers an infinite loop.

llvm-svn: 337617
2018-07-20 20:59:46 +00:00
Craig Topper 28ac623f6f [X86] Remove isel patterns for MOVSS/MOVSD ISD opcodes with integer types.
Ideally our ISD node types going into the isel table would have types consistent with their instruction domain. This prevents us having to duplicate patterns with different types for the same instruction.

Unfortunately, it seems our shuffle combining is currently relying on this a little remove some bitcasts. This seems to enable some switching between shufps and shufd. Hopefully there's some way we can address this in the combining.

Differential Revision: https://reviews.llvm.org/D49280

llvm-svn: 337590
2018-07-20 17:57:53 +00:00
Craig Topper 6194ccf8c7 [X86] Remove what appear to be unnecessary uses of DCI.CombineTo
CombineTo is most useful when you need to replace multiple results, avoid the worklist management, or you need to something else after the combine, etc. Otherwise you should be able to just return the new node and let DAGCombiner go through its usual worklist code.

All of the places changed in this patch look to be standard cases where we should be able to use the more stand behavior of just returning the new node.

Differential Revision: https://reviews.llvm.org/D49569

llvm-svn: 337589
2018-07-20 17:57:42 +00:00
Simon Pilgrim 70fcd0f481 [X86][XOP] Fix SUB constant folding for VPSHA/VPSHL shift lowering
We can safely use getConstant here as we're still lowering, which allows constant folding to kick in and simplify the vector shift codegen.

Noticed while working on D49562.

llvm-svn: 337578
2018-07-20 16:55:18 +00:00
Simon Pilgrim c7132031a2 [X86][SSE] Use SplitOpsAndApply to improve HADD/HSUB lowering
Improve AVX1 256-bit vector HADD/HSUB matching by using SplitOpsAndApply to split into 128-bit instructions.

llvm-svn: 337568
2018-07-20 16:20:45 +00:00
Simon Pilgrim a85b86a982 [X86][AVX] Add support for i16 256-bit vector horizontal op redundant shuffle removal
llvm-svn: 337566
2018-07-20 15:51:01 +00:00
Simon Pilgrim 7c56bce996 [X86][AVX] Add support for 32/64 bits 256-bit vector horizontal op redundant shuffle removal
llvm-svn: 337561
2018-07-20 15:24:12 +00:00
Simon Pilgrim 6fb8b68b2d [X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use SimplifyDemandedVectorElts
This is an early step towards using SimplifyDemandedVectorElts for target shuffle combining - this merely moves the existing X86ISD::VBROADCAST simplification code to use the SimplifyDemandedVectorElts mechanism.

Adds X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode to handle X86ISD::VBROADCAST - in time we can support all target shuffles (and other ops) here.

llvm-svn: 337547
2018-07-20 13:26:51 +00:00
Simon Pilgrim 1d181bc992 [X86][AVX] Use extract_subvector to reduce vector op widths (PR36761)
We have a number of cases where we fail to reduce vector op widths, performing the op in a larger vector and then extracting a subvector. This is often because by default it would create illegal types.

This peephole patch attempts to handle a few common cases detailed in PR36761, which typically involved extension+conversion to vX2f64 types.

Differential Revision: https://reviews.llvm.org/D49556

llvm-svn: 337500
2018-07-19 21:52:06 +00:00
Craig Topper 9888670c6b [X86] Fix some 'return SDValue()' after DCI.CombineTo instead return the output of CombineTo
Returning SDValue() means nothing was changed. Returning the result of CombineTo returns the first argument of CombineTo. This is specially detected by DAGCombiner as meaning that something changed, but worklist management was already taken care of.

I think the only real effect of this change is that we now properly update the Statistic the counts the number of combines performed. That's the only thing between the check for null and the check for N in the DAGCombiner.

llvm-svn: 337491
2018-07-19 20:10:44 +00:00
Simon Pilgrim d4b82da113 [X86][SSE] Canonicalize scalar fp arithmetic shuffle patterns
As discussed on PR38197, this canonicalizes MOVS*(N0, OP(N0, N1)) --> MOVS*(N0, SCALAR_TO_VECTOR(OP(N0[0], N1[0])))

This returns the scalar-fp codegen lost by rL336971.

Additionally it handles the OP(N1, N0)) case for commutable (FADD/FMUL) ops.

Differential Revision: https://reviews.llvm.org/D49474

llvm-svn: 337419
2018-07-18 19:55:19 +00:00
Simon Pilgrim 3a45369b9e [X86][SSE] Remove BLENDPD canonicalization from combineTargetShuffle
When rL336971 removed the scalar-fp isel patterns, we lost the need for this canonicalization - commutation/folding can handle everything else.

llvm-svn: 337387
2018-07-18 13:01:20 +00:00
Craig Topper 1425e10cc6 [X86] Generate v2f64 X86ISD::UNPCKL/UNPCKH instead of X86ISD::MOVLHPS/MOVHLPS for unary v2f64 {0,0} and {1,1} shuffles with SSE2.
I'm trying to restrict the MOVLHPS/MOVHLPS ISD nodes to SSE1 only. With SSE2 we can use unpcks. I believe this will allow some patterns to be cleaned up to require fewer bitcasts.

I've put in an odd isel hack to still select MOVHLPS instruction from the unpckh node to avoid changing tests and because movhlps is a shorter encoding. Ideally we'd do execution domain switching on this, but the operands are in the wrong order and are tied. We might be able to try a commute in the domain switching using custom code.

We already support domain switching for UNPCKLPD and MOVLHPS.

llvm-svn: 337348
2018-07-18 05:10:51 +00:00
Craig Topper 07a1787501 [X86] Merge the FR128 and VR128 regclass since they have identical spill and alignment characteristics.
This unfortunately requires a bunch of bitcasts to be added added to SUBREG_TO_REG, COPY_TO_REGCLASS, and instructions in output patterns. Otherwise tablegen seems to default to picking f128 and then we fail when something tries to get the register class for f128 which isn't always valid.

The test changes are because we were previously mixing fr128 and vr128 due to contrainRegClass finding FR128 first and passes like live range shrinking weren't handling that well.

llvm-svn: 337147
2018-07-16 06:56:09 +00:00
Fangrui Song dcdc9ac7a2 [X86] Correct comment of TEST elimination in BSF/TZCNT
llvm-svn: 337052
2018-07-13 21:40:08 +00:00
Fangrui Song 90d9c201dc [X86] Try fixing r336768
llvm-svn: 337043
2018-07-13 20:54:24 +00:00
Simon Pilgrim 44b89fa900 [X86][SSE] Utilize ZeroableElements for canWidenShuffleElements
canWidenShuffleElements can do a better job if given a mask with ZeroableElements info. Apparently, ZeroableElements was being only used to identify AllZero candidates, but possibly we could plug it into more shuffle matchers.

Original Patch by Zvi Rackover @zvi

Differential Revision: https://reviews.llvm.org/D42044

llvm-svn: 336903
2018-07-12 13:29:41 +00:00
Simon Pilgrim 8a463897e9 [X86][AVX] Use Zeroable mask to improve shuffle mask widening
Noticed while updating D42044, lowerV2X128VectorShuffle can improve the shuffle mask with the zeroable data to create a target shuffle mask to recognise more 'zero upper 128' patterns.

NOTE: lowerV4X128VectorShuffle could benefit as well but the code needs refactoring first to discriminate between SM_SentinelUndef and SM_SentinelZero for negative shuffle indices.

Differential Revision: https://reviews.llvm.org/D49092

llvm-svn: 336900
2018-07-12 13:03:58 +00:00
Craig Topper 73347ec081 [X86] Remove patterns and ISD nodes for the old scalar FMA intrinsic lowering.
We now use llvm.fma.f32/f64 or llvm.x86.fmadd.f32/f64 intrinsics that use scalar types rather than vector types. So we don't these special ISD nodes that operate on the lowest element of a vector.

llvm-svn: 336883
2018-07-12 03:42:41 +00:00
Craig Topper 034adf2683 [X86] Remove and autoupgrade the scalar fma intrinsics with masking.
This converts them to what clang is now using for codegen. Unfortunately, there seem to be a few kinks to work out still. I'll try to address with follow up patches.

llvm-svn: 336871
2018-07-12 00:29:56 +00:00
Craig Topper 02867f0fa3 [X86] The TEST instruction is eliminated when BSF/TZCNT is used
Summary:
These changes cover the PR#31399.
Now the ffs(x) function is lowered to (x != 0) ? llvm.cttz(x) + 1 : 0
and it corresponds to the following llvm code:
  %cnt = tail call i32 @llvm.cttz.i32(i32 %v, i1 true)
  %tobool = icmp eq i32 %v, 0
  %.op = add nuw nsw i32 %cnt, 1
  %add = select i1 %tobool, i32 0, i32 %.op
and x86 asm code:
  bsfl     %edi, %ecx
  addl     $1, %ecx
  testl    %edi, %edi
  movl     $0, %eax
  cmovnel  %ecx, %eax
In this case the 'test' instruction can't be eliminated because
the 'add' instruction modifies the EFLAGS, namely, ZF flag
that is set by the 'bsf' instruction when 'x' is zero.

We now produce the following code:
  bsfl     %edi, %ecx
  movl     $-1, %eax
  cmovnel  %ecx, %eax
  addl     $1, %eax

Patch by Ivan Kulagin

Reviewers: davide, craig.topper, spatel, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D48765

llvm-svn: 336768
2018-07-11 06:57:42 +00:00
Craig Topper dea0b88b04 [X86] Remove X86ISD::MOVLPS and X86ISD::MOVLPD. NFCI
These ISD nodes try to select the MOVLPS and MOVLPD instructions which are special load only instructions. They load data and merge it into the lower 64-bits of an XMM register. They are logically equivalent to our MOVSD node plus a load.

There was only one place in X86ISelLowering that used MOVLPD and no places that selected MOVLPS. The one place that selected MOVLPD had to choose between it and MOVSD based on whether there was a load. But lowering is too early to tell if the load can really be folded. So in isel we have patterns that use MOVSD for MOVLPD if we can't find a load.

We also had patterns that select the MOVLPD instruction for a MOVSD if we can find a load, but didn't choose the MOVLPD ISD opcode for some reason.

So it seems better to just standardize on MOVSD ISD opcode and manage MOVSD vs MOVLPD instruction with isel patterns.

llvm-svn: 336728
2018-07-10 21:00:22 +00:00
Simon Pilgrim d32ca2c0b7 [X86][SSE] Prefer BLEND(SHL(v,c1),SHL(v,c2)) over MUL(v, c3)
Now that rL336250 has landed, we should prefer 2 immediate shifts + a shuffle blend over performing a multiply. Despite the increase in instructions, this is quicker (especially for slow v4i32 multiplies), avoid loads and constant pool usage. It does mean however that we increase register pressure. The code size will go up a little but by less than what we save on the constant pool data.

This patch also adds support for v16i16 to the BLEND(SHIFT(v,c1),SHIFT(v,c2)) combine, and also prevents blending on pre-SSE41 shifts if it would introduce extra blend masks/constant pool usage.

Differential Revision: https://reviews.llvm.org/D48936

llvm-svn: 336642
2018-07-10 07:58:33 +00:00
Roman Lebedev 5ccae1750b [X86][TLI] DAGCombine: Unfold variable bit-clearing mask to two shifts.
Summary:
This adds a reverse transform for the instcombine canonicalizations
that were added in D47980, D47981.

As discussed later, that was worse at least for the code size,
and potentially for the performance, too.

https://rise4fun.com/Alive/Zmpl

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: spatel

Subscribers: reames, llvm-commits

Differential Revision: https://reviews.llvm.org/D48768

llvm-svn: 336585
2018-07-09 19:06:42 +00:00
Craig Topper 47170b3153 [X86] In combineFMA, make sure we bitcast the result of isFNEG back the expected type before creating the new FMA node.
Previously, we were creating malformed SDNodes, but nothing noticed because the type constraints prevented isel from noticing.

llvm-svn: 336566
2018-07-09 17:43:24 +00:00
Craig Topper b8145ec667 [X86] Improve the message for some asserts. Remove an if that is guaranteed true by said asserts.
This replaces some asserts in lowerV2F64VectorShuffle with the similar asserts from lowerVIF64VectorShuffle which are more readable. The original asserts mentioned a blend, but there's no guarantee that it is a blend.

Also remove an if that the asserts prove is always true. Mask[0] is always less than 2 and Mask[1] is always at least 2. Therefore (Mask[0] >= 2) + (Mask[1] >= 2) == 1 must wlays be true.

llvm-svn: 336517
2018-07-09 01:52:56 +00:00
Craig Topper 9e17073c21 [X86] Enhance combineFMA to look for FNEG behind an EXTRACT_VECTOR_ELT.
llvm-svn: 336514
2018-07-08 18:04:00 +00:00
Simon Pilgrim 2eced71ecf [X86][SSE] Combine v16i8 SHL by constants to multiplies
Pre-AVX512 (which can perform a quick extend/shift/truncate), extending to 2 v8i16 for the PMULLW and then truncating is more performant than relying on the generic PBLENDVB vXi8 shift path and uses a similar amount of mask constant pool data.

Differential Revision: https://reviews.llvm.org/D48963

llvm-svn: 336513
2018-07-08 12:47:50 +00:00
Simon Pilgrim 23f9eddabe [SelectionDAG] Split float and integer isKnownNeverZero tests
Splits off isKnownNeverZeroFloat to handle +/- 0 float cases.

This will make it easier to be more aggressive with the integer isKnownNeverZero tests (similar to ValueTracking), use computeKnownBits etc.

Differential Revision: https://reviews.llvm.org/D48969

llvm-svn: 336492
2018-07-07 18:17:14 +00:00
Craig Topper 2c27e33a58 [X86] Merge INTR_TYPE_3OP_RM with INTR_TYPE_3OP. Remove unused INTR_TYPE_1OP_RM.
llvm-svn: 336476
2018-07-07 01:04:22 +00:00
Vedant Kumar b3091da3af Use Type::isIntOrPtrTy where possible, NFC
It's a bit neater to write T.isIntOrPtrTy() over `T.isIntegerTy() ||
T.isPointerTy()`.

I used Python's re.sub with this regex to update users:

  r'([\w.\->()]+)isIntegerTy\(\)\s*\|\|\s*\1isPointerTy\(\)'

llvm-svn: 336462
2018-07-06 20:17:42 +00:00
Craig Topper c60e1807b3 [X86] Remove FMA4 scalar intrinsics. Use llvm.fma intrinsic instead.
The intrinsics can be implemented with a f32/f64 llvm.fma intrinsic and an insert into a zero vector.

There are a couple regressions here due to SelectionDAG not being able to pull an fneg through an extract_vector_elt. I'm not super worried about this though as InstCombine should be able to do it before we get to SelectionDAG.

llvm-svn: 336416
2018-07-06 07:14:41 +00:00
Craig Topper 7b35585ff1 [X86] Remove all of the avx512 masked packed fma intrinsics. Use llvm.fma or unmasked 512-bit intrinsics with rounding mode.
This upgrades all of the intrinsics to use fneg instructions to convert fma into fmsub/fnmsub/fnmadd/fmsubadd. And uses a select instruction for masking.

This matches how clang uses the intrinsics these days.

llvm-svn: 336409
2018-07-06 03:42:09 +00:00
Craig Topper 4fe321d1ce [X86] Add SHUF128 to target shuffle decoding.
Differential Revision: https://reviews.llvm.org/D48954

llvm-svn: 336376
2018-07-05 17:10:17 +00:00
Craig Topper 95eb88abfe [X86] Add support for combining FMSUB/FNMADD/FNMSUB ISD nodes with an fneg input.
Previously we could only negate the FMADD opcodes. This used to be mostly ok when we lowered FMA intrinsics during lowering. But with the move to llvm.fma from target specific intrinsics, we can combine (fneg (fma)) to (fmsub) earlier. So if we start with (fneg (fma (fneg))) we would get stuck at (fmsub (fneg)).

This patch fixes that so we can also combine things like (fmsub (fneg)).

llvm-svn: 336304
2018-07-05 02:52:56 +00:00
Simon Pilgrim c3e1617bf9 [X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values (REAPPLIED)
We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case.

Reapplied with a fixed (extra null tests) version of rL336113 after reversion in rL336189 - extra test case added at rL336247.

llvm-svn: 336250
2018-07-04 09:12:48 +00:00
Craig Topper e317533dcf [X86] Remove repeated 'the' from multiple comments that have been copy and pasted. NFC
llvm-svn: 336226
2018-07-03 20:39:55 +00:00
Benjamin Kramer fd171f2f89 Revert "[X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values"
This reverts commit r336113. It causes crashes.

llvm-svn: 336189
2018-07-03 11:15:17 +00:00
Simon Pilgrim 2bc8e079f2 [X86][SSE] Blend any v8i16/v4i32 shift with 2 shift unique values
We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case.

llvm-svn: 336113
2018-07-02 15:14:07 +00:00
Craig Topper 1b7b9b8596 [X86] Use MVT::i8 for scalar shift amounts since that is what they ultimately need to legalize to.
I believe all of these are constants so legalizing them should be pretty trivial, but this saves a step.

In one case it looks like we may have been creating a shift amount larger than the shift input itself.

llvm-svn: 336052
2018-06-30 18:30:31 +00:00
Craig Topper 5f28d50d27 [X86] When combining load to BZHI, make sure we create the shift instruction with an i8 type.
This combine runs pretty late and causes us to introduce a shift after the op legalization phase has run. We need to be sure we create the shift with the proper type for the shift amount. If we don't do this, we will still re-legalize the operation properly, but we won't get a chance to fully optimize the truncate that gets inserted.

So this patch adds the necessary truncate when the shift is created. I've also narrowed the subtract that gets created to always be an i32 type. The truncate would have trigered SimplifyDemandedBits to optimize it anyway. But using a more appropriate VT here is free and saves an optimization step.

llvm-svn: 336051
2018-06-30 17:49:42 +00:00
Craig Topper 59f2f38fe0 [X86] Remove masking from avx512 rotate intrinsics. Use select in IR instead.
llvm-svn: 336035
2018-06-30 01:32:04 +00:00
Craig Topper 87b107dd69 [X86] Limit the number of target specific nodes emitted in LowerShiftParts
The important part is the creation of the SHLD/SHRD nodes. The compare and the conditional move can use target independent nodes that can be legalized on their own. This gives some opportunities to trigger the optimizations present in the lowering for those things. And its just better to limit the number of places we emit target specific nodes.

The changed test cases still aren't optimal.

Differential Revision: https://reviews.llvm.org/D48619

llvm-svn: 335998
2018-06-29 17:24:07 +00:00
Simon Pilgrim aab8660e23 [X86][SSE] Support v16i8/v32i8 vector rotations
This uses the same technique as for shifts - split the rotation into 4/2/1-bit partial rotations and select those partials based on the amount bit, making use of PBLENDVB if available. This halves the use of PBLENDVB compared to expanding to shifts, which can be a slow op.

Unfortunately I haven't found a decent way to share much of this code with the shift equivalent.

Differential Revision: https://reviews.llvm.org/D48655

llvm-svn: 335957
2018-06-29 09:36:39 +00:00
Craig Topper 875e9f8fa4 [X86] Remove masking from the avx512 packed sqrt intrinsics. Use select in IR instead.
While there improve the coverage of the intrinsic testing and add fast-isel tests.

llvm-svn: 335944
2018-06-29 05:43:26 +00:00
Simon Pilgrim aa2bf2be31 [TargetLowering] isVectorClearMaskLegal - use ArrayRef<int> instead of const SmallVectorImpl<int>&
This is more generic and matches isShuffleMaskLegal.

Differential Revision: https://reviews.llvm.org/D48591

llvm-svn: 335605
2018-06-26 14:15:31 +00:00
Simon Pilgrim bfaa09220b [X86] Just use ArrayRef instead of SmallVectorImpl in a few static method arguments. NFCI.
llvm-svn: 335590
2018-06-26 10:45:41 +00:00
Craig Topper 08dae1682d [X86] Don't use getScalarShiftAmountTy to get the immediate type for target specific VSHLDQ/VSRLDQ nodes.
These opcodes have a fixed type of i8 for their immediate and shouldn't have anything to do with the scalar shift amount used by target independent shift nodes.

llvm-svn: 335578
2018-06-26 04:53:42 +00:00
Craig Topper 689e363ff2 [X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction.
This recommits r335562 and 335563 as a single commit.

The frontend will surround the intrinsic with the appropriate marshalling to/from a scalar type to match the sigature of the builtin that software expects.

By exposing the vXi1 type directly in the llvm intrinsic we make it available to optimizers much earlier. This can enable the scalar marshalling code to be optimized away.

llvm-svn: 335568
2018-06-26 01:37:02 +00:00
Craig Topper 6f4fdfa9af Revert r335562 and 335563 "[X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction."
These were supposed to have been squashed to a single commit.

llvm-svn: 335566
2018-06-26 01:31:53 +00:00
Craig Topper 9b4322ce31 foo
llvm-svn: 335562
2018-06-26 00:43:34 +00:00
Craig Topper 3b18bdc46d [X86] Simplify some code by using isOneConstant. NFC
llvm-svn: 335437
2018-06-25 01:01:47 +00:00
Craig Topper 4331d6218d [X86] Remove the changes to combineScalarToVector made in r335037.
They appear to be untested other than the test case for p37879.ll and I believe we should be using SimplifyDemandedElts here to handle these cases.

llvm-svn: 335436
2018-06-25 00:21:53 +00:00
Mikhail Dvoretckii 0963562083 [X86] Changing the check for valid inputs in combineScalarToVector
Changing the logic of scalar mask folding to check for valid input types rather
than against invalid ones, making it more robust and fixing PR37879.

Differential Revision: https://reviews.llvm.org/D48366

llvm-svn: 335323
2018-06-22 08:28:05 +00:00
Mikhail Dvoretckii 22c82af5c8 [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]
This should help in lowering the following four intrinsics:
 _mm256_cvtepi32_epi8
 _mm256_cvtepi64_epi16
 _mm256_cvtepi64_epi8
 _mm512_cvtepi64_epi8

Differential Revision: https://reviews.llvm.org/D46957

llvm-svn: 335238
2018-06-21 14:16:45 +00:00
Craig Topper c2696d577b [X86] Use setcc ISD opcode for AVX512 integer comparisons all the way to isel
I don't believe there is any real reason to have separate X86 specific opcodes for vector compares. Setcc has the same behavior just uses a different encoding for the condition code.

I had to change the CondCodeAction for SETLT and SETLE to prevent some transforms from changing SETGT lowering.

Differential Revision: https://reviews.llvm.org/D43608

llvm-svn: 335173
2018-06-20 21:05:02 +00:00
Mikhail Dvoretckii b1ce7765be [X86] VRNDSCALE* folding from masked and scalar ffloor and fceil patterns
This patch handles back-end folding of generic patterns created by lowering the
X86 rounding intrinsics to native IR in cases where the instruction isn't a
straightforward packed values rounding operation, but a masked operation or a
scalar operation.

Differential Revision: https://reviews.llvm.org/D45203

llvm-svn: 335037
2018-06-19 10:37:52 +00:00
Mikhail Dvoretckii bd8ed2dbaa Test commit.
llvm-svn: 335026
2018-06-19 07:55:10 +00:00
Sanjay Patel f85ca6abee [x86] be more selective about converting 'and' to shuffle (PR37749)
isVectorClearMaskLegal() is the TLI hook used by the generic
DAGCombiner::XformToShuffleWithZero().

We've grown to accomodate/expect this transform to shuffle
(disabling it more generally results in many regressions).
So I'm narrowly excluding the 256-bit types that clearly 
are not worthwhile for AVX1. 

I think in most cases we are able to recover by converting 
the shuffle back into 'and' ops, but the cases in:
https://bugs.llvm.org/show_bug.cgi?id=37749
...show that there are cracks.

llvm-svn: 334759
2018-06-14 19:55:02 +00:00
Sanjay Patel b983ac6fe1 [x86] eliminate even more sign-bit tests with vector select
This shortcoming was noted in D47330, and the test diffs show we already 
had other examples where we failed to fold to a SHRUNKBLEND:

/// Dynamic (non-constant condition) vector blend where only the sign bits
/// of the condition elements are used. This is used to enforce that the
/// condition mask is not valid for generic VSELECT optimizations.

This patch implements an idea from D48043 and would obsolete that patch 
because it catches more cases (notable the AVX1 case that was missed there). 
All we're doing is allowing the existing transform to fire more often by 
removing the post-legalize constraint. All of the relevant feature checks 
and other predicates are left as-is.

Differential Revision: https://reviews.llvm.org/D48078

llvm-svn: 334592
2018-06-13 12:28:32 +00:00
Craig Topper 3829d258ee [X86] Remove masking from avx512vbmi2 concat and shift by immediate intrinsics. Use select in IR instead.
llvm-svn: 334576
2018-06-13 07:19:21 +00:00
Sanjay Patel c3466d2568 [x86] move shrunkblend transform to helper function; NFCI
We should be able to obsolete D48043 by easing the constraints
on this existing code. 

llvm-svn: 334504
2018-06-12 14:21:51 +00:00
Craig Topper 0e25c8239a [X86] Remove masking from dbpsadbw intrinsics, use select in IR instead.
llvm-svn: 334384
2018-06-11 06:18:22 +00:00
Craig Topper e71ad1f6d0 [X86] Remove and autoupgrade the expandload and compressstore intrinsics.
We use the target independent intrinsics now.

llvm-svn: 334381
2018-06-11 01:25:22 +00:00
Craig Topper 98a79934af [X86] Remove masking from the 512-bit masked floating point add/sub/mul/div intrinsics. Use a select in IR instead.
llvm-svn: 334358
2018-06-10 06:01:36 +00:00
Simon Pilgrim 5c32989c91 [X86][SSE] Support v8i16/v16i16 rotations
Extension to D46954 (PR37426), this patch adds support for v8i16/v16i16 rotations in a similar manner - the conversion of the shift/rotate amount to a multiplication factor and the use of PMULLW to shift left and PMULHUW (ISD::MULHU) to shift the wrapped bits back around to be ORd together.

Differential Revision: https://reviews.llvm.org/D47822

llvm-svn: 334309
2018-06-08 17:58:42 +00:00
Simon Pilgrim a6afa310c9 [X86][SSE] Simplify combineVectorTruncationWithPACKUS to reduce code duplication
Simplify combineVectorTruncationWithPACKUS to mask the upper bits followed by calling truncateVectorWithPACK instead of duplicating with similar code.

This results in the codegen using (V)PACKUSDW on SSE41+ targets for vXi64/vXi32 inputs where before it always used PACKUSWB (along with a lot more bitcasting).

I've raised PR37749 as until we avoid unnecessary concats back to 256-bit for bitwise ops, we can't avoid splitting the input value into 128-bit subvectors for masking.

llvm-svn: 334289
2018-06-08 13:59:11 +00:00
Simon Pilgrim ad45efc445 [X86][SSE] Consistently prefer lowering to PACKUS over PACKSS
We have some combines/lowerings that attempt to use PACKSS-then-PACKUS and others that use PACKUS-then-PACKSS.

PACKUS is much easier to combine with if we know the upper bits are zero as ComputeKnownBits can easily see through BITCASTs etc. especially now that rL333995 and rL334007 have landed. It also effectively works at byte level which further simplifies shuffle combines.

The only (minor) annoyances are that ComputeKnownBits can sometimes take longer as it doesn't fail as quickly as ComputeNumSignBits (but I'm not seeing any actual regressions in tests) and PACKUSDW only became available after SSE41 so we have more codegen diffs between targets.

llvm-svn: 334276
2018-06-08 10:29:00 +00:00
Simon Pilgrim 51ceef775b [X86][SSE] Updated comment - combineVectorSignBitsTruncation handles PACKSS and PACKUS. NFCI.
llvm-svn: 334204
2018-06-07 16:08:40 +00:00
Simon Pilgrim 51ff15f472 [X86][SSE] Simplify combineVectorTruncationWithPACKUS. NFCI.
Move code only used by combineVectorTruncationWithPACKUS out of combineVectorTruncation.

llvm-svn: 334201
2018-06-07 14:53:32 +00:00
Simon Pilgrim 09953d8412 [X86][SSE] Simplify combineVectorTruncationWithPACKSS to reduce code duplication
Simplify combineVectorTruncationWithPACKSS to just a SIGN_EXTEND_INREG followed by using the existing truncateVectorWithPACK instead of duplicating code.

llvm-svn: 334193
2018-06-07 13:01:42 +00:00
Tomasz Krupa 145825162a Test commit access.
Added a bunch of periods after comments.

llvm-svn: 334171
2018-06-07 08:20:28 +00:00
Simon Pilgrim 3d14158891 [X86][BMI][TBM] Only demand bottom 16-bits of the BEXTR control op (PR34042)
Only the bottom 16-bits of BEXTR's control op are required (0:8 INDEX, 15:8 LENGTH).

Differential Revision: https://reviews.llvm.org/D47690

llvm-svn: 334083
2018-06-06 10:52:10 +00:00
Simon Pilgrim f2f043acbb [X86][SSE] Use multiplication scale factors for v8i16 SHL on pre-AVX2 targets.
Similar to v4i32 SHL, convert v8i16 shift amounts to scale factors instead to improve performance and reduce instruction count. We were already doing this for constant shifts, this adds variable shift support.

Reduces the serial nature of the codegen, which relies on chains of plendvb/pand+pandn+por shifts.

This is a step towards adding support for vXi16 vector rotates.

Differential Revision: https://reviews.llvm.org/D47546

llvm-svn: 334023
2018-06-05 15:17:39 +00:00
Simon Pilgrim fef9b6eea6 [X86][SSE] Add target shuffle support to X86TargetLowering::computeKnownBitsForTargetNode
Ideally we'd use resolveTargetShuffleInputs to handle faux shuffles as well but:
(a) that code path doesn't handle general/pre-legalized ops/types very well.
(b) I'm concerned about the compute time as they recurse to calls to computeKnownBits/ComputeNumSignBits which would need depth limiting somehow.

llvm-svn: 334007
2018-06-05 10:52:29 +00:00
Simon Pilgrim 7bbe7a2920 [X86][SSE] Add basic PACKUS support to X86TargetLowering::computeKnownBitsForTargetNode
Helps improve analysis of saturation ops

llvm-svn: 333995
2018-06-05 09:45:03 +00:00
Alexander Ivchenko 964b27fa21 [X86][CET] Shadow stack fix for setjmp/longjmp
This is the new version of D46181, allowing setjmp/longjmp
to work correctly with the Intel CET shadow stack by storing
SSP on setjmp and fixing it on longjmp. The patch has been
updated to use the cf-protection-return module flag instead
of HasSHSTK, and the bug that caused D46181 to be reverted
has been fixed with the test expanded to track that fix.

patch by mike.dvoretsky

Differential Revision: https://reviews.llvm.org/D47311

llvm-svn: 333990
2018-06-05 09:22:30 +00:00
Craig Topper 1f956e2c5f [X86] Don't pass ParitySrc array into isAddSubOrSubAddMask. Instead use a bool output parameter to get the real piece of info we care about. NFC
The ParitySrc array is more of an implementation detail. A single bool to get the final parity is sufficient.

llvm-svn: 333935
2018-06-04 17:58:45 +00:00
Simon Pilgrim 7c000d4267 [X86] Only accept const SelectionDAG to resolveTargetShuffleInputs/getFauxShuffleMask
These methods should only be using SelectionDAG for analysis (known/sign bits etc), not node creation.

llvm-svn: 333925
2018-06-04 16:48:13 +00:00
Craig Topper 3828ce7eab [X86] Do something sensible when an expand load intrinsic is passed a 0 mask.
Previously we just returned undef, but really we should be returning the pass thru input. We also need to make sure we preserve the chain output that the original intrinsic node had to maintain connectivity in the DAG. So we should just return the incoming chain as the output chain.

llvm-svn: 333804
2018-06-01 22:59:07 +00:00
Simon Pilgrim ff0623cd29 [X86][SSE] Recognise splat rotations and expand back to shift ops.
Noticed while fixing PR37426, for splat rotations (rotation by an uniform value) its better to just expand back to shift ops than performing as a general non-uniform rotation.

llvm-svn: 333661
2018-05-31 15:47:17 +00:00
Simon Pilgrim c34395d889 [X86][AVX] Add peekThroughEXTRACT_SUBVECTORs helper (NFCI)
We often need this for AVX1 128-bit integer ops as they may have been split from a 256-bit source.

llvm-svn: 333660
2018-05-31 15:15:49 +00:00
Simon Pilgrim 346886bc0d [X86][SSE] Add support for detecting SUB(SPLAT_BV, SPLAT) cases for shift-rotate patterns.
This improves splat rotations (rotation by an uniform value), to avoid having to use the generic non-uniform shift code (extension to PR37426).

llvm-svn: 333641
2018-05-31 11:25:16 +00:00
Simon Pilgrim 5e9f459c62 [X86][SSE] Pulled out splat detection helper from LowerScalarVariableShift (NFCI)
Created the IsSplatValue helper from the splat detection code in LowerScalarVariableShift as a first NFC step towards improving support for splat rotations, which is an extension of PR37426.

llvm-svn: 333580
2018-05-30 19:16:59 +00:00
Gabor Buella 890e363e11 [X86] Lowering FMA intrinsics to native IR (LLVM part)
Support for Clang lowering of fused intrinsics. This patch:

1. Removes bindings to clang fma intrinsics.
2. Introduces new LLVM unmasked intrinsics with rounding mode:
     int_x86_avx512_vfmadd_pd_512
     int_x86_avx512_vfmadd_ps_512
     int_x86_avx512_vfmaddsub_pd_512
     int_x86_avx512_vfmaddsub_ps_512
     supported with a new intrinsic type (INTR_TYPE_3OP_RM).
3. Introduces new x86 fmaddsub/fmsubadd folding.
4. Introduces new tests for code emitted by sequentions introduced in Clang part.

Patch by tkrupa

Reviewers: craig.topper, sroland, spatel, RKSimon

Reviewed By: craig.topper, RKSimon

Differential Revision: https://reviews.llvm.org/D47443

llvm-svn: 333554
2018-05-30 15:25:16 +00:00
Craig Topper 5439b3d1e5 [X86] Fix a potential crash that occur after r333419.
The code could issue a truncate from a small type to larger type. We need to extend in that case instead.

llvm-svn: 333460
2018-05-29 20:04:10 +00:00
Matt Arsenault ab2b79cb97 DAG: Remove redundant version of getRegisterTypeForCallingConv
There seems to be no real reason to have these separate copies.
The existing implementations just copy each other for x86.
For Mips there is a subtle difference, which is just a bug
since it changes based on the context where which one was called.
Dropping this version, all tests pass. If I try to merge them
to match the removed version, a test fails.

llvm-svn: 333440
2018-05-29 17:42:26 +00:00
Alexander Ivchenko 96062eaa8e [X86] Scalar mask and scalar move optimizations
1. Introduction of mask scalar TableGen patterns.
2. Introduction of new scalar move TableGen patterns
   and refactoring of existing ones.
3. Folding of pattern created by introducing scalar
   masking in Clang header files.

Patch by tkrupa

Differential Revision: https://reviews.llvm.org/D47012

llvm-svn: 333419
2018-05-29 14:27:11 +00:00
Craig Topper a34f8731c7 [X86] Disable a DAG combine to allow packed AVX512DQ instructions to be consistently used for i64->float/double conversions.
Summary: We already get this right if the i64 didn't come from a load.

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D47439

llvm-svn: 333393
2018-05-29 06:22:45 +00:00
Craig Topper 21aeddc3dc [X86] Remove masked vpermi2var/vpermt2var intrinsics and autoupgrade.
We have unmasked intrinsics now and wrap them with a select. This is a net reduction of 36 intrinsics from before the unmasked intrinsics were added.

llvm-svn: 333388
2018-05-29 05:22:05 +00:00
Craig Topper dcfcfdb0d1 [X86] Converge X86ISD::VPERMV3 and X86ISD::VPERMIV3 to a single opcode.
These do the same thing with the first and second sources swapped. They previously came from separate intrinsics that specified different masking behavior. But we can cover that with isel patterns and a single node.

This is a step towards reducing the number of intrinsics needed.

A bunch of tests change because we are now biased to choosing VPERMT over VPERMI when there is nothing to signal that commuting is beneficial.

llvm-svn: 333383
2018-05-28 19:33:11 +00:00
Craig Topper 26bc84860a [X86] Stop forcing X86VPermi2X node index operand to match destination type to make masking pattern matching easier. Add extra patterns with bitcasts instead.
This basically reverts r280696 in favor of using extra patterns as mentioned as an alternative in that commit message. For now I've only added the cases we have test cases for, but it should be easy to add more in the future.

This will help to convert VPERMI2PS/VPERMT2PS intrinsics to use a single ISD node opcode. And hopefully allow some intrinsics to be removed.

llvm-svn: 333365
2018-05-28 05:37:25 +00:00
Craig Topper 51eddb8749 [X86] Remove masking from avx512ifma intrinsics. Use a select instead.
This allows us to avoid having mask and maskz variant. Reducing from 12 intrinsics to 6.

llvm-svn: 333346
2018-05-26 18:55:19 +00:00
Simon Pilgrim b8c7c9c369 [X86][SSE] Pull out (AND (XOR X, -1), Y) matching into a helper function. NFC.
llvm-svn: 333201
2018-05-24 16:16:42 +00:00
Simon Pilgrim 8bd73573c3 Fix unused variable warnings. NFCI.
llvm-svn: 333195
2018-05-24 15:34:50 +00:00
Simon Pilgrim 0c72316a21 [X86][SSE] Pull out OR(AND(~MASK,X),AND(MASK,Y)) matching into a helper function. NFC.
First stage towards matching more variants of the bitselect pattern for combineLogicBlendIntoPBLENDV (PR37549)

llvm-svn: 333191
2018-05-24 15:12:48 +00:00
Roman Lebedev 7772de25d0 [DAGCombine][X86][AArch64] Masked merge unfolding: vector edition.
Summary:
This **appears** to be the last missing piece for the masked merge pattern handling in the backend.

This is [[ https://bugs.llvm.org/show_bug.cgi?id=37104 | PR37104 ]].

[[ https://bugs.llvm.org/show_bug.cgi?id=6773 | PR6773 ]] will introduce an IR canonicalization that is likely bad for the end assembly.
Previously, `andps`+`andnps` / `bsl` would be generated. (see `@out`)
Now, they would no longer be generated  (see `@in`), and we need to make sure that they are generated.

Differential Revision: https://reviews.llvm.org/D46528

llvm-svn: 332904
2018-05-21 21:41:02 +00:00
Craig Topper aad3aefaeb [X86] Remove masking from vpternlog intrinsics. Use a select in IR instead.
This removes 6 intrinsics since we no longer need separate mask and maskz intrinsics.

Differential Revision: https://reviews.llvm.org/D47124

llvm-svn: 332890
2018-05-21 20:58:09 +00:00
Simon Pilgrim a8869e68a9 [X86][SSE] Add an assert to ensure that rotation amount is converted to a scale
Missed in rL332832 where we added SSE v4i32 rotations for PR37426.

llvm-svn: 332844
2018-05-21 15:17:23 +00:00
Simon Pilgrim 5aa7cdfd70 [X86][SSE] Support v4i32 rotations (PR37426)
As suggested by Fabian on PR37426, we can use PMULUDQ to perform v4i32 vector rotations as the upper 32bits of the multiply will contain the 'wrapped' bits of the rotation.

v8i16/v16i8 rotations would be straightforward to add to lowerRotate in the future - ideally we'd mostly share code with the vector shifts lowering.

Differential Revision: https://reviews.llvm.org/D46954

llvm-svn: 332832
2018-05-21 09:45:59 +00:00
Craig Topper e4c045b7df [X86] Remove mask arguments from permvar builtins/intrinsics. Use a select in IR instead.
Someday maybe we'll use selects for all intrinsics.

llvm-svn: 332824
2018-05-20 23:34:04 +00:00
Craig Topper f94ed26ea9 [X86] Directly legalize v16i16/v8i16 vselect to vXi8 vselect to use VPBLENDVB
The intrinsic legalization for masked truncate uses ISD::TRUNCATE which can be constant folded by getNode. This prevents getVectorMaskingNode from seeing the ISD::TRUNCATE special case where it should emit X86ISD::SELECT instead of ISD::VSELECT. This causes a vselect with a v16i1 or v8i1 condition to be emitted during vector legalization. but vector legalization doesn't revisit nodes it creates. DAG combine will then promote this condition to match the result type. Then op legalization will try to legalize it, but the custom lowering hook returned SDValue(). But op legalization doesn't have an Expand for VSELECT because it expects vector legalization to have taken care of it. So the operation sticks around and fails in isel.

This patch adds a custom legalization hook to morph it to a vXi8 vselect instead.

This also simplifies the normal vXi16 vselect handling because vector legalization was normally expanding to AND/ANDN/OR and DAG combine was turning that into VBLENDVB. So we can skip a step by doing it directly.

Fixes PR37499

Differential Revision: https://reviews.llvm.org/D47025

llvm-svn: 332743
2018-05-18 17:48:06 +00:00
Simon Pilgrim 2e0f6c9b21 [X86][SSE] Reduce instruction/register usages for v4i32 vector shifts (PR37441)
As suggested by Fabian on PR37441, use PSHUFLW to extend shift amount types for use with PSRAD/PSRLD to reduce register pressure.

Some of this ideally would be done by combineTargetShuffle but its tricky to do as most of the shuffles are sharing inputs.

Differential Revision: https://reviews.llvm.org/D46959

llvm-svn: 332524
2018-05-16 20:52:52 +00:00
Craig Topper 67aa726f8c [X86][AVX512DQ] Use packed instructions for scalar FP<->i64 conversions on 32-bit targets
As i64 types are not legal on 32-bit targets, insert these into a suitable zero vector and use the packed vXi64<->FP conversion instructions instead.

Fixes PR3163.

Differential Revision: https://reviews.llvm.org/D43441

llvm-svn: 332498
2018-05-16 17:40:07 +00:00
Mikael Holmen e01131decf Remove unused variable introduced in r332336
The unused variable caused a compilation warning:

../lib/Target/X86/X86ISelLowering.cpp:34614:17: error: unused variable 'SMax' [-Werror,-Wunused-variable]
    if (SDValue SMax = MatchMinMax(SMin, ISD::SMAX, C1))
                ^
1 error generated.

llvm-svn: 332431
2018-05-16 06:36:11 +00:00
Artur Gainullin 243a3d56d8 [X86] Improve unsigned saturation downconvert detection.
Summary:
New unsigned saturation downconvert patterns detection was implemented in
X86 Codegen:

(truncate (smin (smax (x, C1), C2)) to dest_type),
where C1 >= 0 and C2 is unsigned max of destination type.

(truncate (smax (smin (x, C2), C1)) to dest_type)
where C1 >= 0, C2 is unsigned max of destination type and C1 <= C2.
These two patterns are equivalent to:

(truncate (umin (smax(x, C1), unsigned_max_of_dest_type)) to dest_type)

Reviewers: RKSimon

Subscribers: llvm-commits, a.elovikov

Differential Revision: https://reviews.llvm.org/D45315

llvm-svn: 332336
2018-05-15 10:24:12 +00:00
Sanjay Patel b4e7893ba8 [x86] fix fmaxnum/fminnum with nnan
With nnan, there's no need for the masked merge / blend
sequence (that probably costs much more than the min/max
instruction).

Somewhere between clang 5.0 and 6.0, we started producing
these intrinsics for fmax()/fmin() in C source instead of
libcalls or fcmp/select. The backend wasn't prepared for
that, so we regressed perf in those cases.

Note: it's possible that other targets have similar problems
as seen here. 

Noticed while investigating PR37403 and related bugs:
https://bugs.llvm.org/show_bug.cgi?id=37403

The IR FMF propagation cases still don't work. There's
a proposal that might fix those cases in D46563.

llvm-svn: 331992
2018-05-10 15:40:49 +00:00
Craig Topper b9a473d186 [X86] Combine (vXi1 (bitcast (-1)))) and (vXi1 (bitcast (0))) to all ones or all zeros vXi1 vector.
llvm-svn: 331847
2018-05-09 06:07:20 +00:00
Shiva Chen 801bf7ebbe [DebugInfo] Examine all uses of isDebugValue() for debug instructions.
Because we create a new kind of debug instruction, DBG_LABEL, we need to
check all passes which use isDebugValue() to check MachineInstr is debug
instruction or not. When expelling debug instructions, we should expel
both DBG_VALUE and DBG_LABEL. So, I create a new function,
isDebugInstr(), in MachineInstr to check whether the MachineInstr is
debug instruction or not.

This patch has no new test case. I have run regression test and there is
no difference in regression test.

Differential Revision: https://reviews.llvm.org/D45342

Patch by Hsiangkai Wang.

llvm-svn: 331844
2018-05-09 02:42:00 +00:00
Jessica Paquette ec37c640dd Revert "[X86][CET] Shadow stack fix for setjmp/longjmp"
This reverts commit 30962eca38ef02666ebcdded72a94f2cd0292d68.

This commit has been causing test asan failures on a build bot.

http://green.lab.llvm.org/green/job/clang-stage1-configure-RA/45108/

Original commit: https://reviews.llvm.org/D46181

llvm-svn: 331813
2018-05-08 22:00:57 +00:00
Jeremy Morse 4f799c027e [X86] Mark all byval parameters as aliased
This is a fix for PR30290: by marking all byval stack slots as being aliased,
the instruction scheduler is more conservative about rescheduling memory
accesses to such stack slots as an LLVM Value* might alias it. This fixes
errors such as in the patched test case, where reads and writes to a data
structure are illegally mixed.

This could be fixed better in the future with better analysis for the
instruction scheduler to know what Values alias what stack slots.

Differential Revision: https://reviews.llvm.org/D45022

llvm-svn: 331749
2018-05-08 09:18:01 +00:00
Alexander Ivchenko c47f799289 [X86][CET] Shadow stack fix for setjmp/longjmp
This patch adds a shadow stack fix when compiling
setjmp/longjmp with the shadow stack enabled. This
allows setjmp/longjmp to work correctly with CET.

Patch by mike.dvoretsky

Differential Revision: https://reviews.llvm.org/D46181

llvm-svn: 331748
2018-05-08 09:04:07 +00:00
Roman Lebedev cc42d08b1d [DagCombiner] Not all 'andn''s work with immediates.
Summary:
Split off from D46031.

In masked merge case, this degrades IPC by decreasing instruction count.
{F6108777}
The next patch should be able to recover and improve this.

This also affects the transform @spatel have added in D27489 / rL289738,
and the test coverage for X86 was missing.
But after i have added it, and looked at the changes in MCA, i'm somewhat confused.
{F6093591} {F6093592} {F6093593}
I'd say this regression is an improvement, since `IPC` increased in that case?

Reviewers: spatel, craig.topper

Reviewed By: spatel

Subscribers: andreadb, llvm-commits, spatel

Differential Revision: https://reviews.llvm.org/D46493

llvm-svn: 331684
2018-05-07 21:52:11 +00:00
Craig Topper c882014f43 [X86] Fix copy/paste mistake in comment. NFC
llvm-svn: 331611
2018-05-07 00:47:02 +00:00
Craig Topper cb2abc7977 [X86] Enable reciprocal estimates for v16f32 vectors by using VRCP14PS/VRSQRT14PS
Summary:
The legacy VRCPPS/VRSQRTPS instructions aren't available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates.

For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch.

Reviewers: spatel

Reviewed By: spatel

Subscribers: RKSimon, llvm-commits, mehdi_amini

Differential Revision: https://reviews.llvm.org/D46498

llvm-svn: 331606
2018-05-06 17:48:21 +00:00
Adrian Prantl 5f8f34e459 Remove \brief commands from doxygen comments.
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.

Patch produced by

  for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done

Differential Revision: https://reviews.llvm.org/D46290

llvm-svn: 331272
2018-05-01 15:54:18 +00:00
Craig Topper d656410293 [X86] Make the STTNI flag intrinsics use the flags from pcmpestrm/pcmpistrm if the mask instrinsics are also used in the same basic block.
Summary:
Previously the flag intrinsics always used the index instructions even if a mask instruction also exists.

To fix fix this I've created a single ISD node type that returns index, mask, and flags. The SelectionDAG CSE process will merge all flavors of intrinsics with the same inputs to a s ingle node. Then during isel we just have to look at which results are used to know what instruction to generate. If both mask and index are used we'll need to emit two instructions. But for all other cases we can emit a single instruction.

Since I had to do manual isel anyway, I've removed the pseudo instructions and custom inserter code that was working around tablegen limitations with multiple implicit defs.

I've also renamed the recently added sse42.ll test case to sttni.ll since it focuses on that subset of the sse4.2 instructions.

Reviewers: chandlerc, RKSimon, spatel

Reviewed By: chandlerc

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D46202

llvm-svn: 331091
2018-04-27 22:15:33 +00:00
Chandler Carruth 16429acacb [x86] Revert r330322 (& r330323): Lowering x86 adds/addus/subs/subus intrinsics
The LLVM commit introduces a crash in LLVM's instruction selection.

I filed http://llvm.org/PR37260 with the test case.

llvm-svn: 330997
2018-04-26 21:46:01 +00:00
Craig Topper 300e20d61c [X86] Form MUL_IMM for multiplies with 3/5/9 to encourage LEA formation over load folding.
Previously we only formed MUL_IMM when we split a constant. This blocked load folding on those cases. We should also form MUL_IMM for 3/5/9 to favor LEA over load folding.

Differential Revision: https://reviews.llvm.org/D46040

llvm-svn: 330850
2018-04-25 17:35:03 +00:00
Alexander Ivchenko 5717fbaf4c [X86] Replace action Promote with Expand for operation ISD::SINT_TO_FP
Summary:
If attribute "use-soft-float"="true" is set then X86ISelLowering.cpp sets
'Promote' action for ISD::SINT_TO_FP operation on type i32.

But 'Promote' action is not proper in this case since lib function
__floatsidf is available for casting from signed int to float type.
Thus Expand action is more suitable here.

The Expand action should be set for ISD::UINT_TO_FP for soft float as well.

If function attribute "use-soft-float"="true" is set then infinite looping
can happen in DAG combining, function visitSINT_TO_FP() replaces SINT_TO_FP
node with UINT_TO_FP node and function combineUIntToFP() replace vice versa in cycle.
The fix prevents it.

Patch by vrybalov

Differential Revision: https://reviews.llvm.org/D45572

llvm-svn: 330711
2018-04-24 12:57:51 +00:00
Craig Topper fe59bea07b [X86] Add DAG combine to turn (trunc (srl (mul ext, ext), 16) into PMULHW/PMULHUW.
Ultimately I want to use this to remove the intrinsics for these instructions.

llvm-svn: 330520
2018-04-21 18:39:21 +00:00
Gabor Buella 31fa8025ba [X86] WaitPKG instructions
Three new instructions:

umonitor - Sets up a linear address range to be
monitored by hardware and activates the monitor.
The address range should be a writeback memory
caching type.

umwait - A hint that allows the processor to
stop instruction execution and enter an
implementation-dependent optimized state
until occurrence of a class of events.

tpause - Directs the processor to enter an
implementation-dependent optimized state
until the TSC reaches the value in EDX:EAX.

Also modifying the description of the mfence
instruction, as the rep prefix (0xF3) was allowed
before, which would conflict with umonitor during
disassembly.

Before:
$ echo 0xf3,0x0f,0xae,0xf0 | llvm-mc -disassemble
.text
mfence

After:
$ echo 0xf3,0x0f,0xae,0xf0 | llvm-mc -disassemble
.text
umonitor        %rax

Reviewers: craig.topper, zvi

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D45253

llvm-svn: 330462
2018-04-20 18:42:47 +00:00
Alexander Ivchenko e8fed1546e Lowering x86 adds/addus/subs/subus intrinsics (llvm part)
This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.

Patch by tkrupa

Differential Revision: https://reviews.llvm.org/D44785

llvm-svn: 330322
2018-04-19 12:13:30 +00:00
Keith Wyss 3d86823f3d [XRay] Typed event logging intrinsic
Summary:
Add an LLVM intrinsic for type discriminated event logging with XRay.
Similar to the existing intrinsic for custom events, but also accepts
a type tag argument to allow plugins to be aware of different types
and semantically interpret logged events they know about without
choking on those they don't.

Relies on a symbol defined in compiler-rt patch D43668. I may wait
to submit before I can see demo everything working together including
a still to come clang patch.

Reviewers: dberris, pelikan, eizan, rSerge, timshen

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D45633

llvm-svn: 330219
2018-04-17 21:30:29 +00:00
Hiroshi Inoue ae17900997 [NFC] fix trivial typos in document and comments
"not not" -> "not" etc

llvm-svn: 330083
2018-04-14 08:59:00 +00:00
Craig Topper 254ed028a4 [X86] Remove the pmuldq/pmuldq intrinsics and replace with native IR.
This completes the work started in r329604 and r329605 when we changed clang to no longer use the intrinsics.

We lost some InstCombine SimplifyDemandedBit optimizations through this change as we aren't able to fold 'and', bitcast, shuffle very well.

llvm-svn: 329990
2018-04-13 06:07:18 +00:00
Sriraman Tallam d693093a65 GOTPCREL references must always use RIP.
With -fno-plt, global value references can use GOTPCREL and RIP must be used.

Differential Revision: https://reviews.llvm.org/D45460

llvm-svn: 329765
2018-04-10 22:50:05 +00:00
Chandler Carruth 0ca3bd0729 [x86] Model the direction flag (DF) separately from the rest of EFLAGS.
This cleans up a number of operations that only claimed te use EFLAGS
due to using DF. But no instructions which we think of us setting EFLAGS
actually modify DF (other than things like popf) and so this needlessly
creates uses of EFLAGS that aren't really there.

In fact, DF is so restrictive it is pretty easy to model. Only STD, CLD,
and the whole-flags writes (WRFLAGS and POPF) need to model this.

I've also somewhat cleaned up some of the flag management instruction
definitions to be in the correct .td file.

Adding this extra register also uncovered a failure to use the correct
datatype to hold X86 registers, and I've corrected that as necessary
here.

Differential Revision: https://reviews.llvm.org/D45154

llvm-svn: 329673
2018-04-10 06:40:51 +00:00
Chandler Carruth 19618fc639 [x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.

However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.

There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.

This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.

Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.

Differential Revision: https://reviews.llvm.org/D45146

llvm-svn: 329657
2018-04-10 01:41:17 +00:00
Craig Topper 47b2f9d836 [X86] Don't use Lower512IntUnary to split bitcasts with v32i16/v64i8 types on targets without AVX512BW.
LowerIntUnary as its name says has an assert for integer types. But for the bitcast case one side might be an FP type.

Rather than making sure the function really works for fp types and renaming it. Just do really basic splitting directly. The LowerIntUnary has the advantage that it can peek through BUILD_VECTOR because every other call is during Lowering. But these calls are during legalization and will be followed by a DAG combine round.

Revert some change to LowerVectorIntUnary that were originally made just to make these two calls work even in pure integer cases.

This was found purely by compiling the avx512f-builtins.c test from clang so I've copied over the offending function from that.

llvm-svn: 329616
2018-04-09 20:37:14 +00:00
Mandeep Singh Grang 68a151a13c [X86] Change std::sort to llvm::sort in response to r327219
Summary:
r327219 added wrappers to std::sort which randomly shuffle the container before sorting.
This will help in uncovering non-determinism caused due to undefined sorting
order of objects having the same key.

To make use of that infrastructure we need to invoke llvm::sort instead of std::sort.

Note: This patch is one of a series of patches to replace *all* std::sort to llvm::sort.
Refer the comments section in D44363 for a list of all the required patches.

Reviewers: chandlerc, craig.topper, RKSimon

Reviewed By: chandlerc, craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D44874

llvm-svn: 329534
2018-04-08 16:42:52 +00:00
Craig Topper ef37aebc96 [X86] Combine vXi64 multiplies to MULDQ/MULUDQ during DAG combine instead of lowering.
Previously we used a custom lowering for this because of the AVX1 splitting requirement. But we can do the split during DAG combine if we check the types and subtarget

llvm-svn: 329510
2018-04-07 19:09:52 +00:00
Benjamin Kramer 1fc0da4849 Make helpers static. NFC.
llvm-svn: 329170
2018-04-04 11:45:11 +00:00
Craig Topper 3064c15dc3 [X86] Remove some code that was only needed when i1 was a legal type. NFC
llvm-svn: 329146
2018-04-04 04:38:54 +00:00
Craig Topper 9b8cd5fe55 [X86] Don't check for folding into a store when deciding if we can promote an i16 mul.
There's no RMW mul operation.

llvm-svn: 328931
2018-04-01 06:29:32 +00:00
Craig Topper db6caabccc [X86] Check if the load and store are to the same pointer before preventing i16 RMW shifts and subtracts from being promoted.
llvm-svn: 328930
2018-04-01 06:29:28 +00:00
Craig Topper ae2de57db0 [X86] Allow i16 subtracts to be promoted if the load is on the LHS and its not being stored.
llvm-svn: 328928
2018-04-01 06:29:25 +00:00
Craig Topper 9bc0d881a3 [X86] Remove unneeded temporary variable. NFC
This Promote flag was alwasys set to true except in the default case. But in the default case we don't need to set PVT and can just return false.

llvm-svn: 328926
2018-04-01 06:29:21 +00:00
Simon Pilgrim 8c8ebd7945 Fix trailing whitespace. NFCI.
llvm-svn: 328917
2018-03-31 09:14:14 +00:00
Simon Pilgrim 71c5f3fffd [X86][SSE] Don't bother re-adding combined target shuffles to the work list
We are re-adding all the bitcasts, constant masks and target shuffles to the work list for no apparent gain.

Found while investigating adding SimplifyDemandedVectorElts to target shuffles.

Differential Revision: https://reviews.llvm.org/D44942

llvm-svn: 328771
2018-03-29 11:18:41 +00:00
Reid Kleckner 41fb2dba9c [X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32
Summary:
Re-lands r328386 and r328443, reverting r328482.

Incorporates fixes from @mstorsjo in D44876 (thanks!) so that small
parameters in i8 and i16 do not end up in the SysV register parameters
(EDI, ESI, etc).

I added tests for how we receive small parameters, since that is the
important part. It's always safe to store more bytes than will be read,
but the assumptions you make when loading them are what really matter.

I also tested this by self-hosting clang and it passed tests on win64.

Reviewers: mstorsjo, hans

Subscribers: hiraditya, mstorsjo, llvm-commits

Differential Revision: https://reviews.llvm.org/D44900

llvm-svn: 328570
2018-03-26 18:49:48 +00:00
Hans Wennborg 311b63f13b Revert r328386 "[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32"
This broke Chromium (see crbug.com/825748). It looks like mstorsjo's follow-up
patch at D44876 fixes this, but let's revert back to green for now until that's
ready to land.

(Also reverts r328443.)

> Both GCC and MSVC only look at the low byte of a boolean when it is
> passed.

llvm-svn: 328482
2018-03-26 10:07:51 +00:00
Simon Pilgrim 854ac7490d [X86] Add missing full stop to comment. NFCI.
llvm-svn: 328456
2018-03-25 18:49:48 +00:00
Craig Topper 2c0a62ab9a [X86] Add a DAG combine to simplify PMULDQ/PMULUDQ nodes
These nodes only use the lower 32 bits of their inputs so we can use SimplifyDemandedBits to simplify them.

Differential Revision: https://reviews.llvm.org/D44375

llvm-svn: 328405
2018-03-24 01:52:01 +00:00
Reid Kleckner e27b410661 [X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32
Both GCC and MSVC only look at the low byte of a boolean when it is
passed.

llvm-svn: 328386
2018-03-23 23:38:53 +00:00
Martin Storsjo 07589fc496 [X86] Don't use the MSVC stack protector names on mingw
Mingw uses the same stack protector functions as GCC provides
on other platforms as well.

Patch by Valentin Churavy!

Differential Revision: https://reviews.llvm.org/D27296

llvm-svn: 328039
2018-03-20 20:37:51 +00:00
Craig Topper ab6076514d [X86] Simplify the AVX512 code in LowerTruncate a little.
We don't need to create an ISD::TRUNCATE node to return, we started with one and can return it. Also remove the call to getExtendInVec, the result is just going to be a getNode of that value passed in.

llvm-svn: 327914
2018-03-19 21:58:02 +00:00
Craig Topper 3b967466d5 [X86] Replace a couple calls to getExtendInVec with getNode and the appropriate target independent EXTEND_VECTOR_INREG opcode.
llvm-svn: 327899
2018-03-19 20:20:22 +00:00
Craig Topper 259eaa6e7c [X86] Remove sse41 specific code from lowering v16i8 multiply
With the SRAs removed from the SSE2 code in D44267, then there doesn't appear to be any advantage to the sse41 code. The punpcklbw instruction and pmovsx seem to have the same latency and throughput on most CPUs. And the SSE41 code requires moving the upper 64-bits into the lower 64-bit before the sign extend can be done. The unpckhbw in sse2 code can do better than that.

llvm-svn: 327869
2018-03-19 17:31:41 +00:00
Oren Ben Simhon fdd72fd522 [X86] Added support for nocf_check attribute for indirect Branch Tracking
X86 Supports Indirect Branch Tracking (IBT) as part of Control-Flow Enforcement Technology (CET).
IBT instruments ENDBR instructions used to specify valid targets of indirect call / jmp.
The `nocf_check` attribute has two roles in the context of X86 IBT technology:
	1. Appertains to a function - do not add ENDBR instruction at the beginning of the function.
	2. Appertains to a function pointer - do not track the target function of this pointer by adding nocf_check prefix to the indirect-call instruction.

This patch implements `nocf_check` context for Indirect Branch Tracking.
It also auto generates `nocf_check` prefixes before indirect branchs to jump tables that are guarded by range checks.

Differential Revision: https://reviews.llvm.org/D41879

llvm-svn: 327767
2018-03-17 13:29:46 +00:00
Craig Topper f0815e01d8 [X86] Merge ADDSUB/SUBADD detection into single methods that can detect either and indicate what they found.
Previously, we called the same functions twice with a bool flag determining whether we should look for ADDSUB or SUBADD. It would be more efficient to run the code once and detect either pattern with a flag to tell which type it found.

Differential Revision: https://reviews.llvm.org/D44540

llvm-svn: 327730
2018-03-16 18:25:59 +00:00
Craig Topper 1b8cf49704 [SelectionDAG][ARM][X86] Teach PromoteIntRes_SETCC to do a better job picking the result type for the setcc.
Previously if getSetccResultType returned an illegal type we just fell back to using the default promoted type. This appears to have been to handle the case where for vectors getSetccResultType returns the input type, but the input type itself isn't legal and will need to be promoted. Without the legality check we would never reach a legal type.

But just picking the promoted type to be the setcc type can create strange setccs where the result type is 128 bits and the operand type is 256 bits. If for example the result type was promoted to v8i16 from v8i1, but the input type was promoted from v8i23 to v8i32. We currently handle this with custom lowering code in X86.

This legality check also caused us reject the getSetccResultType when the input type needed to be widened or split. Even though that result wouldn't have caused legalization to get stuck.

This patch tries to fix this by detecting the getSetccResultType needs to be promoted. If its input type also needs to be promoted we'll try a ask for a new setcc result type based on its eventual promoted value. Otherwise we fall back to default type to promote to.

For any other illegal values we might get back from the initial call to getSetccResultType we just keep and allow it to be re-legalized later via splitting or widening or scalarizing.

llvm-svn: 327683
2018-03-15 23:04:11 +00:00
Craig Topper c3983c34cd [X86] Make sure we use FSUB instruction as the reference for operand order in isAddSubOrSubAdd when recognizing subadd
The FADD part of the addsub/subadd pattern can have its operands commuted, but when checking for fsubadd we were using the fadd as reference and commuting the fsub node.

llvm-svn: 327660
2018-03-15 20:30:54 +00:00
Craig Topper 5a0251fe67 [X86] Simplify the type legality checking for (FM)ADDSUB/SUBADD matching. NFCI
Rather than enumerating all specific types, for the DAG combine we can just use TLI::isTypeLegal and an SSE3 check. For the BUILD_VECTOR version we already know the type is legal so we just need to check SSE3.

llvm-svn: 327649
2018-03-15 17:38:59 +00:00
Craig Topper 627e001fad [X86] Fix 80 column violations.
llvm-svn: 327648
2018-03-15 17:38:55 +00:00
Craig Topper 26a3a80c87 [X86] Add support for matching FMSUBADD from build_vector.
llvm-svn: 327604
2018-03-15 06:14:55 +00:00
Craig Topper a5e712f402 [X86] Remove old TODO. We have coverage for this now.
Coverage was added in r320950.

llvm-svn: 327603
2018-03-15 06:14:53 +00:00
Craig Topper b9526e9fdb [X86] Use MVT in a couple places where we know the type is legal.
llvm-svn: 327602
2018-03-15 06:14:51 +00:00
Craig Topper b36cb20ef9 [X86] Teach X86TargetLowering::targetShrinkDemandedConstant to set non-demanded bits if it helps created an and mask that can be matched as a zero extend.
I had to modify the bswap recognition to allow unshrunk masks to make this work.

Fixes PR36689.

Differential Revision: https://reviews.llvm.org/D44442

llvm-svn: 327530
2018-03-14 16:55:15 +00:00
Matt Arsenault 41e5ac4fa4 TargetMachine: Add address space to getPointerSize
llvm-svn: 327467
2018-03-14 00:36:23 +00:00
Craig Topper ec4881ad53 [X86] Simplify the LowerAVXCONCAT_VECTORS code a little by creating a single path for insert_subvector handling.
We now only create recursive concats if we have more than two non-zero values. This keeps our subvector broadcast DAG combine functioning.

llvm-svn: 327457
2018-03-13 22:36:07 +00:00
Craig Topper cc060e921b [X86] Rewrite LowerAVXCONCAT_VECTORS similar to how we handle vXi1 concats.
This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs.

This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best.

We probably want to merge this with the vXi1 concat code since they are very similar.

llvm-svn: 327454
2018-03-13 22:05:25 +00:00
Craig Topper 7e711a6822 [X86] Remove SplitBinaryOpsAndApply and use SplitOpsAndApply by adding curly braces around the ops.
Summary: Unless you were intentionally avoiding this syntax? I saw you mentioned makeArrayRef in your commit that added SplitOpsAndApply.

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D44403

llvm-svn: 327418
2018-03-13 16:23:27 +00:00
Simon Pilgrim 93bd7187f4 [X86][SSE41] createVariablePermute v2X64 - PCMPEQQ can test for index 0/1 and select between them.
llvm-svn: 327385
2018-03-13 12:22:58 +00:00
Craig Topper acaba3b402 [X86] Remove use of MVT class from the ShuffleDecode library.
MVT belongs to the CodeGen layer, but ShuffleDecode is used by the X86 InstPrinter which is part of the MC layer. This only worked because MVT is completely implemented in a header file with no other library dependencies.

Differential Revision: https://reviews.llvm.org/D44353

llvm-svn: 327292
2018-03-12 16:43:11 +00:00
Simon Pilgrim 6618e2a09c [X86][SSE] createVariablePermute - PSHUFB requires SSSE3 not just SSE3
llvm-svn: 327259
2018-03-12 12:30:04 +00:00
Craig Topper 7cc1b1fc84 [X86] Don't compute known bits twice for the same SDValue in LowerMUL.
We called MaskedValueIsZero with two different masks, but underneath that calls computeKnownBits before applying the mask. This means we compute the same known bits twice due to the two calls. Instead just call computeKnownBits directly and apply the two masks ourselves.

llvm-svn: 327251
2018-03-12 05:35:02 +00:00
Simon Pilgrim d09cc9c62c [X86][MMX] Support MMX build vectors to avoid SSE usage (PR29222)
64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type.

This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics.

We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover.

Differential Revision: https://reviews.llvm.org/D43618

llvm-svn: 327247
2018-03-11 19:22:13 +00:00
Simon Pilgrim 30f74c14ff [X86][AVX] createVariablePermute - scale v16i16 variable permutes to use v32i8 codegen
XOP was already doing this, and now AVX performs v32i8 variable permutes as well.

llvm-svn: 327245
2018-03-11 17:23:54 +00:00
Simon Pilgrim b306501796 [X86][AVX] createVariablePermute - widen permutes for cases where the source vector is wider than the destination type
llvm-svn: 327244
2018-03-11 17:00:46 +00:00
Simon Pilgrim 9a5d0c7540 [X86][AVX] createVariablePermute - use PSHUFB+PCMPGT+SELECT for v32i8 variable permutes
Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size.

llvm-svn: 327242
2018-03-11 16:28:11 +00:00
Simon Pilgrim d2fbd87ce8 Fix for buildbots which didn't like makeArrayRef with initializer lists.
llvm-svn: 327241
2018-03-11 14:31:55 +00:00
Simon Pilgrim e60afdf9eb [X86][SSE] Generalized SplitBinaryOpsAndApply to SplitOpsAndApply to support any number of ops.
I've kept SplitBinaryOpsAndApply as a wrapper to avoid a lot of makeArrayRef code.

llvm-svn: 327240
2018-03-11 14:04:53 +00:00
Simon Pilgrim f9cc80d218 [X86][AVX] createVariablePermute - use 2xVPERMIL+PCMPGT+SELECT for v8i32/v8f32 and v4i64/v4f64 variable permutes
As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi.

For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends.

llvm-svn: 327239
2018-03-11 11:52:26 +00:00
Simon Pilgrim 2565bd421e [X86][AVX512] createVariablePermute - Non-VLX targets can widen v4i64/v8f64 variable permutes to v8i64/v8f64
Permutes in the upper elements will be undefined, but they will be discarded anyway.

llvm-svn: 327238
2018-03-11 11:19:19 +00:00
Simon Pilgrim 64b899f0f3 [x86][SSE] Add widenSubVector helper. NFCI.
Helper function to insert a subvector into the bottom elements of a larger zero/undef vector with the same scalar type.

I've converted a couple of INSERT_SUBVECTOR calls to use it, there are plenty more although in some cases I was worried it might make the code more ambiguous. 

llvm-svn: 327236
2018-03-11 10:50:48 +00:00
Simon Pilgrim de7f3f0f91 [X86][XOP] createVariablePermute - use VPERMIL2 for v8i32/v4i64 variable permutes
llvm-svn: 327222
2018-03-10 19:49:59 +00:00
Simon Pilgrim ff1248f82f [X86][XOP] createVariablePermute - use VPPERM for v16i16 variable permutes
llvm-svn: 327218
2018-03-10 18:33:29 +00:00
Simon Pilgrim d9dc114e2f [X86][SSE] createVariablePermute - create index scaling helper. NFCI.
This will help in some future changes for custom lowering.

llvm-svn: 327217
2018-03-10 18:12:35 +00:00
Simon Pilgrim 8224241f75 [X86][XOP] createVariablePermute - use VPPERM for v32i8 variable permutes
llvm-svn: 327213
2018-03-10 16:51:45 +00:00
Simon Pilgrim 2cd489feb2 [X86][AVX] createVariablePermute - fix v2i64/v2f64 VPERMILPD index creation.
The input indices vector will put the index in bit0, but VPERMILPD actually selects off bit1 - so we need to scale accordingly.

llvm-svn: 327159
2018-03-09 18:37:56 +00:00
Simon Pilgrim 230d38b559 [X86][SSE] createVariablePermute - move source vector canonicalization to top of function. NFCI.
This is to make it easier to return early from the switch statement with custom lowering.

llvm-svn: 327157
2018-03-09 18:08:08 +00:00
Simon Pilgrim 033a4167d2 Tidyup comment that was destroyed by clang-format. NFCI.
llvm-svn: 327141
2018-03-09 15:50:09 +00:00
Simon Pilgrim 322c521ed7 [X86][SSE] createVariablePermute - move index vector canonicalization to top of function. NFCI.
This is to make it easier to return early from the switch statement with custom lowering.

llvm-svn: 327140
2018-03-09 15:48:56 +00:00
Craig Topper 784f1bbf5e [X86] Remove SRAs from v16i8 multiply lowering on sse2 targets
Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb

Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does.

Differential Revision: https://reviews.llvm.org/D44267

llvm-svn: 327093
2018-03-09 01:22:31 +00:00
Simon Pilgrim c286680032 [X86][AVX] Pull out variable permute creation from LowerBUILD_VECTORAsVariablePermute. NFCI.
This will make it easier to handle more complex cases than basic scaling or index masks.

llvm-svn: 327054
2018-03-08 20:07:06 +00:00