Commit Graph

7719 Commits

Author SHA1 Message Date
aartbik 8387bee94d [llvm] [X86] Fixed type bug in vselect for AVX masked load
Summary:
Bugzilla issue 45563
https://bugs.llvm.org/show_bug.cgi?id=45563

Reviewers: nicolasvasilache, mehdi_amini, craig.topper

Reviewed By: craig.topper

Subscribers: RKSimon, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D78527
2020-04-21 11:11:35 -07:00
Simon Pilgrim c74acd8fc9 X86ISelLowering.cpp - clang-format to fix col80 limit. NFC. 2020-04-21 15:18:23 +01:00
Craig Topper d7e2d937bc [X86] Add X86ISD nodes for PDEP and PEXT.
This will allow use to add DAG combines for these instructions.
2020-04-19 16:14:13 -07:00
Sanjay Patel 720015e537 [x86] avoid build warning for enum mismatch; NFC
gcc may warn here because X86ISD::NodeType is specified as "unsigned",
but ISD::NodeType is a naked C enum (although passed as an "unsigned"
throughout SDAG).
2020-04-19 10:22:11 -04:00
Simon Pilgrim e71dd7c011 [X86][SSE] getFauxShuffle - don't combine shuffles with small truncated scalars (PR45604)
getFauxShuffle attempts to combine INSERT_VECTOR_ELT(TRUNCATE/EXTEND(EXTRACT_VECTOR_ELT(x))) patterns into a target shuffle chain.

PR45604 identified an issue where the scalar was truncated to a size smaller than the destination vector element and then zero extended back, which requires the upper bits to be zero'd which we don't currently do.

To avoid the bug I've added an early out in these truncation cases, a future commit should allow us to handle this by inserting the necessary SM_SentinelZero padding.
2020-04-19 13:35:22 +01:00
Sanjay Patel cceb630a07 [x86] use vector instructions to lower more FP->int->FP casts
This is an enhancement to D77895 to avoid another
round-trip from XMM->GPR->XMM. This time we handle
the case of starting/ending with an f64 and casting
to signed i32 as the intermediate value.

It's a bit more involved than I initially assumed
because we need to use target-specific opcodes to
represent the non-standard cast ops.

Differential Revision: https://reviews.llvm.org/D78362
2020-04-19 08:33:17 -04:00
Simon Pilgrim 9559557014 X86InstrFMA3Info.h - remove unnecessary includes. NFC.
There were a number of cpp files explicitly relying on X86InstrFMA3Info.h to include the X86.h header - so I've had to add it locally.
2020-04-19 12:17:56 +01:00
Sanjay Patel 818126ae97 [x86] rename variables for types for readability; NFC
This gets harder to follow if we allow changing types/sizes
between source, dest, and intermediate value.
2020-04-17 08:41:18 -04:00
Christopher Tetreault 85247c1e89 [SVE] Remove calls to getBitWidth from x86
Reviewers: efriedma, RKSimon, sdesmalen

Reviewed By: RKSimon

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77901
2020-04-15 15:48:48 -07:00
Craig Topper 8dfb9627b7 [X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead.
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.

This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.

Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.

Differential Revision: https://reviews.llvm.org/D76212
2020-04-15 12:17:18 -07:00
Craig Topper a916e81927 [X86] Various improvements to our vector splitting helpers for lowering. NFC
-Consistently name the functions as split*
-Add a helper for doing the two extractSubvector calls and determining the size of the split
-Use getSplitDestVTs to get the result type for the split node.
-Move the binary and unary helper to one place in the file near the extractSubvector functions. Left the VSETCC one near LowerVSETCC since that's its only caller.
-Remove the 256/512 wrappers that just had asserts. I don't think they provided a lot of value and now with the routines called split* the call sites are more obvious what they do.
-Make the unary routine support different source and dest types to support D76212.
-Add some weaker asserts into the helpers to make up for losing the very specific asserts from the 256/512 wrappers.

Differential Revision: https://reviews.llvm.org/D78176
2020-04-15 10:57:53 -07:00
Georgii Rymar 1647ff6e27 [ADT/STLExtras.h] - Add llvm::is_sorted wrapper and update callers.
It can be used to avoid passing the begin and end of a range.
This makes the code shorter and it is consistent with another
wrappers we already have.

Differential revision: https://reviews.llvm.org/D78016
2020-04-14 14:11:02 +03:00
Craig Topper 113f37a1f9 [CallSite removal][TargetLowering] Replace ImmutableCallSite with CallBase
Differential Revision: https://reviews.llvm.org/D77995
2020-04-13 13:50:15 -07:00
Craig Topper 6dbf1a1229 [X86] Move X86ShuffleDecode.cpp/h into MCTargetDesc and remove X86Utils library. NFC
The shuffle decoding is used by X86ISelLowering and
MCTargetDesc/X86InstComments. The latter used to be in a
separate InstPrinter library. The Utils library existed to allow
InstPrinter and CodeGen to share the shuffle decoding. Since
X86InstComments now lives in the MCTargetDesc, which CodeGen
already depends on, we can sink the shuffle decoding there as well.

Differential Revision: https://reviews.llvm.org/D77980
2020-04-13 10:14:08 -07:00
Jay Foad bc78baec4c [X86] Improve combineVectorShiftImm
Summary:
Fold (shift (shift X, C2), C1) -> (shift X, (C1 + C2)) for logical as
well as arithmetic shifts. This is needed to prevent regressions from
an upcoming funnel shift expansion change.

While we're here, fold (VSRAI -1, C) -> -1 too.

Reviewers: RKSimon, craig.topper

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77300
2020-04-13 15:54:55 +01:00
Simon Pilgrim 401cbe373b [X86][AVX] Attempt to scale masked shuffles to match the root type
Improve the chances of folding the writemask into the combined shuffle by scaling a wider shuffle mask to match the root's original type.

This creates a few minor issues with variable shuffles, preventing combines of shuffles because of the more limited support binary shuffle types. In most cases we're probably better off combining the shuffles and losing the writemask fold, but this isn't always going to be true.
2020-04-13 14:57:25 +01:00
Simon Pilgrim fdd9ff9700 [X86][AVX] Create splitVectorIntBinary helper.
Removes duplicate code from split256IntArith/split512IntArith.
2020-04-13 13:09:38 +01:00
Sanjay Patel d04db4825a [x86] use vector instructions to lower FP->int->FP casts
As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM
by using the vector versions of the convert instructions.

Based on experimental results from recent Intel/AMD chips, we don't
need to worry about triggering denorm stalls while operating on
garbage data in the high lanes with convert instructions, so this is
expected to always be as good or better perf than the scalar
instruction equivalent. FP exceptions are also not a concern because
strict code should not be using the regular SDAG opcodes.

Differential Revision: https://reviews.llvm.org/D77895
2020-04-12 10:26:43 -04:00
Simon Pilgrim 40581a0a2b [X86] Use isAnyZero shuffle mask helper where possible. NFC. 2020-04-12 10:57:29 +01:00
Craig Topper d3465e0691 [X86] Enable shuffle combining for AVX512 unless the root is used by a vselect
A lot of vectorized code doesn't use masks so we shouldn't penalize them by not doing shuffle combining on avx512 targets.

I've added support for VALIGNQ/VALIGND and 512-bit SHUF128 to prevent some regressions. I also prevented recombining 256-bit SHUF128 to PERM2X128. We may not need to add 256-bit SHUF128 support, but I don't think I found any cases requiring that in my testing.

Differential Revision: https://reviews.llvm.org/D77928
2020-04-11 20:05:10 -07:00
Sanjay Patel 1318ddbc14 [VectorUtils] rename scaleShuffleMask to narrowShuffleMaskElts; NFC
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.

While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
2020-04-11 10:05:49 -04:00
Craig Topper a6732069ee [CallSite removal][X86] Remove unneeded use of CallSite. NFC
We already have a CallInst, we can just get the calling convention from it.
2020-04-10 10:27:21 -07:00
Jay Foad c63aed890e [KnownBits] Move AND, OR and XOR logic into KnownBits
Summary:
There are at least three clients for KnownBits calculations:
ValueTracking, SelectionDAG and GlobalISel. To reduce duplication the
common logic should be moved out of these clients and into KnownBits
itself.

This patch does this for AND, OR and XOR calculations by implementing
and using appropriate operator overloads KnownBits::operator& etc.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74060
2020-04-09 10:10:37 +01:00
Matt Arsenault 84aa58cbe2 CodeGen: Use Register in TargetLowering 2020-04-08 12:10:58 -04:00
Simon Pilgrim 66c18c729d [X86][SSE] Combine PTEST(AND(X,Y),AND(X,Y)) -> PTEST(X,Y) and ANDN equivalents
Tests derived from PR42035 examples
2020-04-08 12:42:22 +01:00
Craig Topper c41685b16f [SelectionDAG] Make getZeroExtendInReg take a vector VT if the operand VT is a vector.
This removes a call to getScalarType from a bunch of call sites.
It also makes the behavior consistent with SIGN_EXTEND_INREG.

Differential Revision: https://reviews.llvm.org/D77631
2020-04-07 11:34:08 -07:00
Simon Pilgrim e3b6059776 [X86][SSE] combineX86ShufflesConstants - early out for zeroable vectors (PR45443)
Shuffle combining can insert zero byte sized elements into the shuffle mask, which combineX86ShufflesConstants will attempt to fold without taking into account whether the byte-sized type is legal (e.g. AVX512F only targets).

If we have a full-zeroable vector then we should just return a zero version of the root type, otherwise if the type isn't valid we should bail.

Fixes PR45443
2020-04-07 14:45:29 +01:00
David Blaikie 5aead592f0 X86ISelLowering: Minor refactor to avoid redundant initialization while ensuring compiler warnings can hopefully still prove initialization
Based on post-commit review/discussion in fabe52a7412b
2020-04-06 14:25:52 -07:00
Simon Pilgrim 9bc5b1a489 [X86][SSE] combineVectorSignBitsTruncation - remove minimum vector length limitations
truncateVectorWithPACK has its own vector length controls, so we can rely on those directly. This helps some existing truncation to subvector tests, which were being combined later during shuffle lowering at which point the sign/zero bit detection had become obscured preventing lowerShuffleWithPACK working as well as it could.
2020-04-06 12:45:23 +01:00
Simon Pilgrim a43e233606 Remove unused function 'isInRange'. NFCI. 2020-04-05 23:11:24 +01:00
Simon Pilgrim 4431a29c60 [X86][SSE] Combine unary shuffle(HORIZOP,HORIZOP) -> HORIZOP
We had previously limited the shuffle(HORIZOP,HORIZOP) combine to binary shuffles, but we can often merge unary shuffles just as well, folding in UNDEF/ZERO values into the 64-bit half lanes.

For the (P)HADD/HSUB cases this is limited to fast-horizontal cases but PACKSS/PACKUS combines under all cases.
2020-04-05 22:49:46 +01:00
Benjamin Kramer ff889df356 [X86] Roll some loops. NFCI. 2020-04-05 13:59:50 +02:00
Simon Pilgrim 3079e51858 [X86][SSE] Generalize shuffle(HORIZOP,HORIZOP) -> HORIZOP combine
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.

This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
2020-04-05 12:09:19 +01:00
Simon Pilgrim a17de6b91c [X86][SSE] truncateVectorWithPACK - upper undef for 128->64 packing
If we're packing from 128-bits to 64-bits then we don't need the RHS argument. This helps with register allocation, especially as we avoid repeating a use of the input value.
2020-04-05 11:47:36 +01:00
Simon Pilgrim e5e719d885 [X86][SSE] lowerV8I16Shuffle - lower compaction shuffles using PACKUSDW(PBLENDW,PBLENDW) on SSE41+
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.

This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.

Noticed in codegen while dealing with PR31443.
2020-04-04 13:08:25 +01:00
Simon Pilgrim 34a497b765 [X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations
Extend lowerShuffleWithPACK/matchShuffleWithPACK/createPackShuffleMask to handle compaction style shuffle masks that can be lowered to chains of PACKSS/PACKUS if their inputs are suitably sign/zero extended.

This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask should recognise the PACKSS/PACKUS chains.
2020-04-03 18:26:10 +01:00
Scott Constable 5b519cf1fc [X86] Add Indirect Thunk Support to X86 to mitigate Load Value Injection (LVI)
This pass replaces each indirect call/jump with a direct call to a thunk that looks like:

lfence
jmpq *%r11

This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.

This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.

Differential Revision: https://reviews.llvm.org/D76812
2020-04-03 00:34:39 -07:00
Scott Constable 71e8021d82 [X86][NFC] Generalize the naming of "Retpoline Thunks" and related code to "Indirect Thunks"
There are applications for indirect call/branch thunks other than retpoline for Spectre v2, e.g.,

https://software.intel.com/security-software-guidance/software-guidance/load-value-injection

Therefore it makes sense to refactor X86RetpolineThunks as a more general capability.

Differential Revision: https://reviews.llvm.org/D76810
2020-04-02 21:55:13 -07:00
Craig Topper 4fdb63bbf0 [X86] Enable combineExtSetcc for vectors larger than 256 bits when we've disabled 512 bit vectors.
The compares are going to be type legalized to 256 bits so we
might as well fold the extend.
2020-04-02 12:44:27 -07:00
Guillaume Chatelet fc63c4d8ce [Alignment][NFC] Remove remaining uses of MachineFrameInfo::setObjectAlignment
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77217
2020-04-01 14:38:05 +00:00
Simon Pilgrim eb8880562e [X86][SSE] combinePTESTCC - fold TESTZ(X,~Y) -> TESTC(Y,X) 2020-04-01 15:10:53 +01:00
Guillaume Chatelet 3a78f44daf [Alignment][NFC] Convert SelectionDAG::InferPtrAlignment to MaybeAlign
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77212
2020-04-01 13:22:11 +00:00
Simon Pilgrim 481413d394 [X86][SSE] matchShuffleWithPACK - generalize zero/signbits matching for any packed src type
First step toward making use of canLowerByDroppingEvenElements to match chains of PACKSS/PACKUS for compaction shuffles.

At the moment we still only match a single stage but the MatchPACK is now more general.
2020-04-01 14:10:32 +01:00
Simon Pilgrim 918ccb64b0 [X86][SSE] Handle basic inversion of PTEST/TESTP operands (PR38522)
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0

If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.

Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).

We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.

Differential Revision: https://reviews.llvm.org/D76984
2020-04-01 11:33:28 +01:00
Bjorn Pettersson ef49895da8 [X86] Do not assume types are legal in getFauxShuffleMask
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77136
2020-04-01 11:40:18 +02:00
Guillaume Chatelet c7468c1696 [Alignment][NFC] Use Align in SelectionDAG::getMemIntrinsicNode
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: jholewinski, nemanjai, hiraditya, kbarton, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77149
2020-04-01 09:32:05 +00:00
Craig Topper f92563f907 [VectorUtils][X86] De-templatize scaleShuffleMask and 2 X86 shuffle mask helpers and move their implementation to cpp files
Summary: These were templated due to SelectionDAG using int masks for shuffles and IR using unsigned masks for shuffles. But now that D72467 has landed we have an int mask version of IRBuilder::CreateShuffleVector. So just use int instead of a template

Reviewers: spatel, efriedma, RKSimon

Reviewed By: efriedma

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D77183
2020-04-01 00:46:48 -07:00
Simon Pilgrim f9f401dba1 [X86][AVX] Add additional 256/512-bit test cases for PACKSS/PACKUS shuffle patterns
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
2020-04-01 08:19:03 +01:00
Simon Pilgrim 7e0e5fa499 Revert rGefe59d6717dcdf7777acb9b7a734e1a520bdf22a "[X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations"
This might be causing an issue on the fuchsia-x86_64-linux buildbot - reverting to see what happens.
2020-03-31 15:47:30 +01:00
Simon Pilgrim 38619fa7da Fix enumeral mismatch warning. NFCI.
Don't mix llvm::ISD and llvm::X86ISD.
2020-03-31 15:38:02 +01:00
Simon Pilgrim efe59d6717 [X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.

This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
2020-03-31 14:48:48 +01:00
Simon Pilgrim 98357dee1c [X86] Combine concat(palignr,palignr) -> palignr(concat,concat)
combineX86ShufflesRecursively should handle this someday
2020-03-31 11:06:35 +01:00
Simon Pilgrim 7a4a98a9c4 [X86] Move canLowerByDroppingEvenElements earlier to be with matchShuffleWithPACK. NFCI.
Make sure its defined earlier so more shuffle lowering methods can use it.
2020-03-31 10:56:35 +01:00
Guillaume Chatelet bdf77209b9 [Alignment][NFC] Use Align version of getMachineMemOperand
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: jyknight, sdardis, nemanjai, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, jfb, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77059
2020-03-30 15:46:27 +00:00
Simon Pilgrim e95d04f4f1 [X86][AVX] lowerV4X128Shuffle - attempt to widen to 2x256 to simplify shuffles
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
2020-03-30 12:22:26 +01:00
Simon Pilgrim 9c8ec99c80 [X86][AVX] Combine 128/256-bit lane shuffles with zeroable upper subvectors to EXTRACT_SUBVECTOR (PR40720)
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128/SHUF128, and we can use the implicit zeroing of the uppers.
2020-03-29 19:51:38 +01:00
Simon Pilgrim 8206c50cde [X86] Add isAnyZero shuffle mask helper 2020-03-29 19:51:37 +01:00
Simon Pilgrim 7734e4b3a3 [X86][AVX] Combine 128-bit lane shuffles with a zeroable upper half to EXTRACT_SUBVECTOR (PR40720)
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.

I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
2020-03-29 16:41:59 +01:00
Simon Pilgrim da4c7db793 [X86] Rename matchShuffleAsByteRotate to matchShuffleAsElementRotate. NFC.
This was an inner helper function for the real matchShuffleAsByteRotate function, but it is more generic and is used directly for VALIGN lowering which doesn't work at the byte level.
2020-03-29 16:41:58 +01:00
Simon Pilgrim 10439f9e32 [X86][AVX] Add X86ISD::VALIGN target shuffle decode support
Allows us to combine VALIGN instructions with other shuffles - the combiner doesn't create VALIGN yet though.
2020-03-29 16:41:58 +01:00
Craig Topper 9f7d4150b9 [X86] Move combineLoopMAddPattern and combineLoopSADPattern to an IR pass before SelecitonDAG.
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.

This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.

An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.

Differential Revision: https://reviews.llvm.org/D76649
2020-03-26 14:10:20 -07:00
Simon Pilgrim ad36491ebb [X86] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on all targets
Extends rG9d1721ce3926 to support AVX2+ targets.
2020-03-26 20:46:24 +00:00
Simon Pilgrim 39a52a19ed [X86] lowerV16I8Shuffle - create v8i16 mask for PACKUS(AND(),AND()) patterns.
We can improve computeKnownBits results by avoiding excess bitcasts.

For this pattern we were doing:

  (v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))

By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.

This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
2020-03-26 19:59:57 +00:00
Simon Pilgrim 9d1721ce39 [X86][SSE] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on pre-AVX2 targets
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.

The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.

We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
2020-03-26 15:47:43 +00:00
Simon Pilgrim e30d29ebc1 [X86][SSE] getFauxShuffleMask - peek through TRUNCATE/AEXT/ZEXT for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT())
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
2020-03-26 11:57:45 +00:00
Simon Pilgrim c6e5531f9b [X86][AVX] Combine shuffles to TRUNCATE/VTRUNC patterns
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.

We could probably extend this further to handle non-VLX truncation cases.
2020-03-25 17:41:51 +00:00
Reid Kleckner 597718aae0 Re-land "Avoid emitting unreachable SP adjustments after `throw`"
This reverts commit 4e0fe038f4. Re-lands
65b21282c7.

After landing 5ff5ddd0ad to add int3 into
trailing unreachable blocks, we can now remove these extra stack
adjustments without confusing the Win64 unwinder. See
https://llvm.org/45064#c4 or X86AvoidTrailingCall.cpp for a full
explanation.

Fixes PR45064.
2020-03-24 12:04:43 -07:00
Simon Pilgrim 714402147d [X86][SSE1] Add support for logic+movmsk patterns (PR42870)
rL368506 handled the basic case, but we need to account for boolean logic patterns as well.
2020-03-24 14:28:40 +00:00
Guillaume Chatelet 3ba550a05a [Alignment][NFC] Use TFL::getStackAlign()
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: dylanmckay, sdardis, nemanjai, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D76551
2020-03-23 13:48:29 +01:00
Craig Topper e2cb121374 [X86] Remove maximum vector length limit from combineBasicSADPattern.
createPSADBW uses SplitsOpsAndApply so should be able to handle
any size.

Restrict the extract result type to i32 or i64 since that's what
we have coverage for today and probably matches what the
isSimple() check gave us before.

Differential Revision: https://reviews.llvm.org/D76560
2020-03-22 15:02:05 -07:00
Craig Topper b89ae50795 [X86] Remove maximum vector width restriction from combineLoopSADPattern.
SplitsOpsAndApply will take care of any needed splitting correctly.
All that we need to check is that the vector element count is a
power of 2.

Differential Revision: https://reviews.llvm.org/D76558
2020-03-22 11:09:14 -07:00
Simon Pilgrim 25eb9056d7 [X86] getTargetShuffleAndZeroables - add insert_subvector(undef, sub, c) handling.
We often widen xmm/ymm vectors to ymm/zmm by insertion into an undef base vector. By letting getTargetShuffleAndZeroables track the undef elts we can help avoid a lot of unnecessary cross-lane shuffles.

Fixes PR44694
2020-03-21 19:11:42 +00:00
Simon Pilgrim 4ceade0428 [X86] Combine concat(shufps,shufps) -> shufps(concat,concat)
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.

Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
2020-03-21 12:44:10 +00:00
Simon Pilgrim f424d51c3e Revert rGe6a7e3b5e3e7 "[X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles."
This reverts commit e6a7e3b5e3.

Avoids register pressure regression reported at PR45263
2020-03-21 12:14:19 +00:00
Craig Topper 32fbea1548 [X86] Prevent (bitcast (broadcast_load)) combine from producing vXf16 broadcast instructions.
The combine tries to put the broadcast in either the integer or
fp domain to match the bitcast domain. But we can only do this
if the broadcast size is 32 or larger.
2020-03-20 09:15:07 -07:00
Nico Weber 4e0fe038f4 Revert "Avoid emitting unreachable SP adjustments after `throw`"
This reverts commit 65b21282c7.
Breaks sanitizer bots (https://reviews.llvm.org/D75712#1927668)
and causes https://crbug.com/1062021 (which may or may not
be a compiler bug, not clear yet).
2020-03-17 20:49:22 -04:00
Simon Pilgrim c9656a3b31 [DAGCombiner] matchRotateSub - handle shift amount truncation
Under certain circumstances we'll end up in the position where the negated shift amount will get truncated to the type specified getScalarShiftAmountTy(), so we need to test for a truncated version of the shift amount as well.

This allows us to remove half of the remaining patterns tested for by X86ISelLowering's combineOrShiftToFunnelShift.
2020-03-17 16:01:23 +00:00
Simon Pilgrim ebb181cf40 [X86] matchScalarReduction - add support for partial reductions
Add optional support for opt-in partial reduction cases by providing an optional partial mask to indicate which elements have been extracted for the scalar reduction.
2020-03-16 18:01:02 +00:00
Simon Pilgrim e43a085781 [X86] X86::isConstantSplat - enable partial undef bit handling by default.
We currently only ever use this for lowering constant uniform values (shift/rotate by immediate) so we can safely enable it by default (it treats the undef bits as zero when extracting constants).

This is necessary for an upcoming patch that will use SimplifyDemandedBits more aggressively on funnel shift amounts and causes regressions in vXi64 constant without it.
2020-03-16 12:56:24 +00:00
Simon Pilgrim ac4609cb1d [X86] LowerRotate - use X86::isConstantSplat to detect constant splat rotation amounts.
Avoid code duplication and matches what we do for the similar LowerFunnelShift and LowerScalarImmediateShift methods.
2020-03-16 12:56:23 +00:00
Simon Pilgrim ee862adf60 Fix signed/unsigned comparison warning. 2020-03-14 18:42:27 +00:00
Simon Pilgrim 0cb2f089c1 [X86] getFauxShuffleMask - pull out repeated byte sizes varaibles. NFC. 2020-03-14 17:36:17 +00:00
Simon Pilgrim f47f4c137b [X86] getFauxShuffleMask - merge insertelement paths
Merge the INSERT_VECTOR_ELT/SCALAR_TO_VECTOR and PINSRW/PINSRB shuffle mask paths - they both do the same thing (find source vector + handle implicit zero extension). The PINSRW/PINSRB path also handled in the insertion of zero case which needed to be added to the general case as well.
2020-03-14 13:11:03 +00:00
Craig Topper 755e00876c [X86] Remove isel patterns for X86VBroadcast+trunc+extload. Replace with DAG combines.
This is a little more complicated than I'd like it to be. We have
to manually match a trunc+srl+load pattern that generic DAG
combine won't do for us due to isTypeDesirableForOp.
2020-03-13 18:12:16 -07:00
Simon Pilgrim 05c0d34918 [X86][SSE] Prefer trunc(movd(x)) to pextrb(x,0)
If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB.

The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4).

This matches what we already do for PEXTRW.

Differential Revision: https://reviews.llvm.org/D76138
2020-03-13 18:43:04 +00:00
Simon Pilgrim 846c614f54 [X86] combineExtractWithShuffle - pull out repeated getSizeInBits() call. NFC. 2020-03-13 15:36:04 +00:00
Simon Pilgrim fe047fbccc [X86] LowerEXTRACT_VECTOR_ELT - pull out repeated getOperand() calls. NFC.
Also, cleanup LowerEXTRACT_VECTOR_ELT_SSE4 comments which had references to non-constant extraction indices.
2020-03-13 15:36:02 +00:00
Simon Pilgrim 4689eae820 [X86] combineOrShiftToFunnelShift - remove shift by immediate handling.
Now that D75114 has landed, DAGCombiner handles this case so the code is redundant.
2020-03-12 11:46:51 +00:00
Simon Pilgrim b3b4727a3e [X86] Replace (most) X86ISD::SHLD/SHRD usage with ISD::FSHL/FSHR generic opcodes (PR39467)
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.

The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.

Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.

The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.

Differential Revision: https://reviews.llvm.org/D75748
2020-03-11 11:17:49 +00:00
Simon Pilgrim c8ede5e485 [X86][SSE] getFauxShuffleMask - add support for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT) shuffle pattern
We already do this for PINSRB/PINSRW and SCALAR_TO_VECTOR.
2020-03-10 15:42:37 +00:00
Simon Pilgrim e6a7e3b5e3 [X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles.
This causes one minor test change but is mainly necessary for an upcoming patch.
2020-03-10 15:42:36 +00:00
Simon Pilgrim 18c19441d1 [X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.

At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
2020-03-10 10:44:28 +00:00
Craig Topper ef4f939d38 [X86] Remove isel patterns for (X86VBroadcast (i16 (trunc (i32 (load))))). Replace with a DAG combine to form VBROADCAST_LOAD.
isTypeDesirableForOp prevents loads from being shrunk to i16 by DAG
combine. Because of this we can't just match the broadcast and a
scalar load. So look for broadcast+truncate+load and form a
vbroadcast_load during DAG combine. This replaces what was
previously done as an isel pattern and I think fixes it so we
won't change the size of a volatile load. But my main motivation
is just to clean up our isel patterns.
2020-03-10 00:07:07 -07:00
Simon Pilgrim 4b130b883d [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - reduce vector width of X86ISD::BLENDI
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.

This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
2020-03-09 18:29:28 +00:00
Craig Topper 3dcc0db15e [X86] Teach combineToExtendBoolVectorInReg to create opportunities for using broadcast load instructions.
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.

Previously we any_extended in the scalar domain before inserting.

This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.

This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.

Differential Revision: https://reviews.llvm.org/D75835
2020-03-09 11:26:12 -07:00
Djordje Todorovic c15c68abdc [CallSiteInfo] Enable the call site info only for -g + optimizations
Emit call site info only in the case of '-g' + 'O>0' level.

Differential Revision: https://reviews.llvm.org/D75175
2020-03-09 12:12:44 +01:00
Craig Topper 70e4fb8a53 [X86] Add DAG combine to turn (vzext_movl (vbroadcast_load)) -> vzext_load.
If we're zeroing the other elements then we don't need the broadcast.
2020-03-08 00:35:40 -08:00
Craig Topper d81d451442 [X86] Add DAG combine to replace vXi64 vzext_movl+scalar_to_vector with vYi32 vzext_movl+scalar_to_vector if the upper 32 bits of the scalar are zero.
We can just use a 32-bit copy and zero in the SSE domain when we
zero the upper bits.

Remove an isel pattern that becomes dead with this.
2020-03-07 16:14:26 -08:00
Craig Topper d41ea65ee8 [X86] Add DAG combines to enable removing of movddup/vbroadcast + simple_load isel patterns. 2020-03-07 15:22:02 -08:00
Craig Topper bc65b68661 [X86] Add a DAG combine to turn vbroadcast(vzload X) -> vbroadcast_load
Remove now unneeded isel patterns.
2020-03-07 15:22:02 -08:00
Craig Topper ec1d1f6ae7 [X86] Use MVT instead of EVT in a couple shuffle lowering functions. 2020-03-07 09:50:53 -08:00
Reid Kleckner 65b21282c7 Avoid emitting unreachable SP adjustments after `throw`
In 172eee9c, we tried to avoid these by modelling the callee as
internally resetting the stack pointer.

However, for the majority of functions with reserved stack frames, this
would lead LLVM to emit extra SP adjustments to undo the callee's
internal adjustment. This lead us to fix the problem further on down the
pipeline in eliminateCallFramePseudoInstr. In 5b79e603d3, I added
use a heuristic to try to detect when the adjustment would be
unreachable.

This heuristic is imperfect, and when exception handling is involved, it
fails to fire. The new test is an example of this. Simply throwing an
exception with an active cleanup emits dead SP adjustments after the
throw. Not only are they dead, but if they were executed, they would be
incorrect, so they are confusing.

This change essentially reverts 172eee9c and makes the 5b79e603d3
heuristic responsible for preventing unreachable stack adjustments. This
means we may emit unreachable stack adjustments for functions using EH
with unreserved call frames, but that is not very many these days. Back
in 2016 when this change was added, we were focused on 32-bit, which we
observed to have fewer reserved frames.

Fixes PR45064

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D75712
2020-03-06 13:33:45 -08:00
Craig Topper 4c7c87f245 [X86] Simplify the code at the end of lowerShuffleAsBroadcast.
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.

Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.

For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.

I think this all matches what we generated before, just in a more
readable way.
2020-03-04 20:45:02 -08:00
Craig Topper eadea7868f [X86] Convert vXi1 vectors to xmm/ymm/zmm types via getRegisterTypeForCallingConv rather than using CCPromoteToType in the td file
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.

Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.

The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.

Differential Revision: https://reviews.llvm.org/D75154
2020-03-04 15:02:32 -08:00
Craig Topper 06de426426 [X86] Directly form VBROADCAST_LOAD in lowerShuffleAsBroadcast on AVX targets.
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.

I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
2020-03-04 09:11:57 -08:00
Craig Topper 9284abd004 [X86] Directly form VBROADCAST_LOAD for BUILD_VECTOR of splat loads in lowerBuildVectorAsBroadcast. 2020-03-03 22:27:34 -08:00
Craig Topper 3c4e635593 [X86] Always emit an integer vbroadcast_load from lowerBuildVectorAsBroadcast regardless of AVX vs AVX2
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.

Differential Revision: https://reviews.llvm.org/D75413
2020-03-03 10:39:11 -08:00
Craig Topper 56cd3bc209 [X86] Directly emit VBROADCAST_LOAD from constant pool in lowerBuildVectorAsBroadcast
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.

Differential Revision: https://reviews.llvm.org/D75412
2020-03-03 10:39:10 -08:00
Craig Topper 68aeaab888 [X86] Don't count the chain uses when forming broadcast loads in lowerBuildVectorAsBroadcast.
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
2020-03-03 08:41:31 -08:00
Craig Topper 2f4f8fcf64 [X86] Don't add DELETED_NODES to DAG combine worklist after calling SimplifyDemandedBits/SimplifyDemandedVectorElts.
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.

Fixes PR45067.
2020-03-01 00:06:32 -08:00
Craig Topper f2d45e5097 [X86] Canonicalize (bitcast (vbroadcast_load)) so that the cast and vbroadcast_load are both integer or fp.
Helps a little with some isel pattern matching. Especially on
32-bit targets where we sometimes use f64 loads.
2020-02-28 15:07:49 -08:00
Craig Topper b68eeff05c [X86] Cleanup a comment around bitcasting X86ISD::VBROADCAST_LOAD and add an assert to make sure memory VT size doesn't change. 2020-02-28 15:07:49 -08:00
Craig Topper c0d0e6b198 [X86] Recognize CVTPH2PS from STRICT_FP_EXTEND
This should avoid scalarizing the cvtph2ps intrinsics with D75162

Differential Revision: https://reviews.llvm.org/D75304
2020-02-28 10:19:57 -08:00
Simon Pilgrim f90cc633de Fix cppcheck definition/declaration arg mismatch warnings. NFCI. 2020-02-27 14:35:20 +00:00
Simon Pilgrim fe6bcfaf3b [X86] Use Subtarget.useSoftFloat() in X86TargetLowering constructor
Avoid use of X86TargetLowering::useSoftFloat() in the constructor as its a virtual function
2020-02-27 14:35:20 +00:00
Simon Pilgrim e61e7f0794 Fix shadow variable warning. NFC. 2020-02-27 14:23:05 +00:00
Simon Pilgrim dc7ac563ac Fix shadow variable warnings. NFC. 2020-02-27 14:21:30 +00:00
Simon Pilgrim efe2f59ec4 [X86] LowerMSCATTER/MGATHER - reduce scope of MaskVT. NFCI.
Fixes cppcheck warning.
2020-02-27 14:20:44 +00:00
Simon Pilgrim fabe52a741 Fix uninitialized variable warning. NFC. 2020-02-27 14:20:43 +00:00
Simon Pilgrim 6bdd63dc28 [X86] createVariablePermute - handle case where recursive createVariablePermute call fails
Account for the case where a recursive createVariablePermute call with a wider vector type fails.

Original test case from @craig.topper (Craig Topper)
2020-02-27 13:52:31 +00:00
Craig Topper 82a21c1655 [X86] Add proper MachinePointerInfo to stack store created in LowerWin64_i128OP. 2020-02-26 16:55:24 -08:00
Craig Topper 870363a22d [X86] Explicitly pass Destination VT and debug location to BuildFILD. NFC
We'd already passed most everything else. Might was well pass
these two things and stop passing Op.
2020-02-26 16:26:46 -08:00
Craig Topper 15e2831fcd [X86] Explicitly pass Pointer, MachinePointerInfo and Alignment to BuildFILD.
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.

In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.

For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.

This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.

I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
2020-02-26 16:26:26 -08:00
Craig Topper 77d9b7b2cd [X86] Query constant pool object alignment instead of hardcoding. 2020-02-26 14:45:39 -08:00
Craig Topper 9c1a707ba3 [X86] Use proper alignment for stack temporary and correct MachinePointerInfo for stack accesses in LowerUINT_TO_FP. 2020-02-26 14:45:38 -08:00
Craig Topper a8186935ae [X86] Use correct MachineMemOperand for stack load in LowerFLT_ROUNDS_ 2020-02-26 14:45:38 -08:00
Craig Topper 735d27dc40 [SelectionDAG][PowerPC][AArch64][X86][ARM] Add chain input and output the ISD::FLT_ROUNDS_
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.

This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.

Differential Revision: https://reviews.llvm.org/D75132
2020-02-25 16:58:23 -08:00
Craig Topper 9238dfb4d8 [X86] Remove mask output from X86 gather/scatter ISD opcodes.
Instead add it when we make the machine nodes during instruction
selections.

This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
2020-02-24 23:56:28 -08:00
Simon Pilgrim daac8dba77 [X86] combineX86ShuffleChain - select X86ISD::FAND/ISD::AND based on MaskVT
Noticed by inspection, we shouldn't use FloatDomain directly, we've already bitcast both inputs to MaskVT so select the opcode using that.
2020-02-24 18:24:44 +00:00
Simon Pilgrim 59d8d13c7b [X86] getTargetShuffleInputs - check that the source inputs are all the right size.
I'm hoping to begin improving shuffle combining across different vector sizes, but before that we must ensure that all existing getTargetShuffleInputs calls must bail if the inputs aren't the same size.
2020-02-24 16:26:10 +00:00
Craig Topper 7a7146cf72 [X86] When creating X86ISD::MGATHER nodes from AVX2 gather intrinsics, cast the mask to integer type.
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
2020-02-23 23:00:41 -08:00
Craig Topper 5a70518660 [X86] Remove most X86 specific subclasses of MemSDNode. Just use a MemIntrinsicSDNode as we usually do.
Leave the gather/scatter subclasses, but make them inherit from
MemIntrinsicSDNode and delete their constructor and destructor.
This way we can still have the getIndex, getMask, etc. convenience
functions.
2020-02-23 15:13:32 -08:00
Craig Topper 15b6aa7448 [X86] Enable the use of movlps for i64 atomic load on 32-bit targets with sse1.
Still a little room for improvement by using movlps to store to
the stack temporary needed to move data out of the xmm register
after the load.
2020-02-23 15:11:38 -08:00
Craig Topper 2a10f8019d [X86] Use FIST for i64 atomic stores on 32-bit targets without SSE. 2020-02-23 15:11:38 -08:00
Craig Topper 84cd968f75 [X86] Add AddToWorklist(N) after calls to SimplifyDemandedBits/SimplifyDemandedVectorElts that are called on an operand of N.
If a simplication occurs the operand will be added to the worklist.
But since the demanded mask was based on N, we need to make sure
we revisit N in case there are more simplifications to be done.
Returning SDValue(N, 0) as we do, only tells DAG combine that
something changed, but that won't make it add anything to the
worklist.

Found while playing around with using VEXTRACT_STORE in more cases.
But I guess this doesn't affect any of our existing tests.
2020-02-22 21:42:59 -08:00
Craig Topper bdb1729c83 [X86] Teach EltsFromConsecutiveLoads that it's ok to form a v4f32 VZEXT_LOAD with a 64 bit memory size on SSE1 targets.
We can use MOVLPS which will load 64 bits, but we need a v4f32
result type. We already have isel patterns for this.

The code here is a little hacky. We can probably improve it with
more isel patterns.
2020-02-22 18:50:52 -08:00
Craig Topper e7a184fc7c [X86] Use movlps for i64 atomic stores on 32-targets with sse1.
This is similar to using movd which we do for sse2 targets.

I've added a DAG combine for VEXTRACT_STORE to use SimplifyDemandedVectorElts
to clean up some artifacts from type legalization.
2020-02-22 18:22:47 -08:00
Craig Topper 228a2bc9b7 [X86] Teach combineCVTPH2PS to shrink v8i16 loads when the output type is v4f32. Remove extra isel patterns.
Similar to what do for other operations that use a subset of bits.
Allows us to remove a pattern that shrinks a load. Which was
incorrect if the load was volatile.
2020-02-21 18:11:07 -08:00
Nikita Popov c90ea87cfd [X86] Fix SDLoc initialization
Fixes -Wparentheses warning, in this case indicating a genuine
bug.
2020-02-21 18:26:05 +01:00
Craig Topper 97f11600e0 [X86] Don't bother avoiding illegal FCMOVs if we don't have the cmov subtarget feature.
We'll be forced to emit branches so we might as well use the most
direct condition.
2020-02-21 00:34:15 -08:00
Craig Topper 263bef2bbc [X86] Make combineCMov not create unsupported FCMOVs when f32/f64 are using X87.
This makes the behavior consistent with what's in LowerSELECT.
2020-02-21 00:34:15 -08:00
Craig Topper 4576606831 [X86] Remove unnecessary isNullConstant in LowerSelect. NFC
At this point in the code we know that Op1 or Op2 is
all ones. Y points to the other operand. In the case that
Op2 is zero, Op1 must be all ones and Y is Op2. The OR
ORs Y into Res. But if Y is 0 the OR will be folded away by
getNode so we don't need to check for it.
2020-02-20 21:41:13 -08:00
Craig Topper 78be618717 [X86] Add CMOV_VR64 pseudo instruction for MMX. Remove mmx handling from combineSelect.
The combineSelect code was casting to i64 without any check that
i64 was legal. This can break after type legalization.

It also required splitting the mmx register on 32-bit targets.
It's not clear that this makes sense. Instead switch to using
a cmov pseudo like we do for XMM/YMM/ZMM.
2020-02-20 20:30:56 -08:00
Craig Topper e5782377f3 [X86] Add CMOV_VK1 pseudo so we don't crash on v1i1 ISD::SELECT 2020-02-20 15:13:48 -08:00
Craig Topper 7e92769862 [X86] Expand vselect of v1i1 under avx512.
We already do this for v2i1, v4i1, etc.
2020-02-20 15:13:47 -08:00
Craig Topper b00ef8951b [X86] Custom legalize v1i1 UADDSAT/USUBSAT/SADDSAT/UADDSAT to match v2i1/v4i1/v8i1 etc. 2020-02-20 15:13:46 -08:00
Craig Topper d95a10a7f9 [X86] Custom legalize v1i1 add/sub/mul to xor/xor/and with avx512.
We already did this for v2i1, v4i1, v8i1, etc.
2020-02-20 15:13:44 -08:00
Craig Topper c7b54a196e Recommit "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT""
With the correct author this time
2020-02-20 12:28:54 -08:00
Craig Topper 1d8860f90b Revert 714265dabb "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT"
I accidentally messed up the author on the previous commit somehow.
2020-02-20 12:28:33 -08:00
Quentin Colombet 714265dabb [X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT
The type here isn't guaranteed to be a simple type.

Fixes PR44976
2020-02-20 12:25:37 -08:00
Sanjay Patel 064cd2ecdb [x86] allow peeking through an extract_subvector to find a splatted operand
The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)

There are other potential improvements visible like allowing scalarization or vector
narrowing.

Differential Revision: https://reviews.llvm.org/D74909
2020-02-20 13:59:59 -05:00
Craig Topper 0ed7a61543 [X86] Fix a -Wparentheses warning. NFC 2020-02-20 09:32:03 -08:00
Craig Topper 3543ac9ab5 [X86] Rewrite LowerBRCOND to remove dead code and handle ISD::SETCC and overflow ops directly.
There's a lot of old leftover code in LowerBRCOND. Especially
the detecting or AND or OR of X86ISD::SETCC nodes. Those were
needed before LegalizeDAG was changed to visit nodes before
their operands.

It also relied on reversing the output of LowerSETCC to find the
flags producing node to use for the X86ISD::BRCOND node.

Rather than using LowerSETCC this patch uses emitFlagsForSetcc to
handle the integer ISD::SETCC case. This gives the flag producer
and the comparison code to use directly. I've removed the addTest
flag and just produce a X86ISD::BRCOND and return immediately.

Floating point ISD::SETCC case is just an X86ISD::FCMP with special
care for OEQ and UNE derived from the previous code. I've left
f128 out so it will emit a test. And LowerSETCC will be called
later to produce a libcall and X86ISD::SETCC. We have combines
that can merge the test and X86ISD::SETCC.

We need to handle two cases for overflow ops. Either they are used
directly or they have a seteq 0 or setne 1 to invert the overflow.
The old code did not handle the setne 1 case, but I think some
other combines were making up for it.

If we fail to find a condition, we'll wrap an AND with 1 on the
original condition and tell emitFlagsForSetcc to emit a compare
with 0. This will pickup the LowerAndToBT and or the EmitTest case.
I kept the isTruncWithZeroHighBitsInput call, but we might be able
to fold that in to emitFlagsForSetcc.

Differential Revision: https://reviews.llvm.org/D74750
2020-02-20 08:50:18 -08:00
Craig Topper 12cc105f80 [X86] Add DAG combines to form CVTPH2PS/CVTPS2PH from vXf16->vXf32/vXf64 fp_extends and vXf32->vXf16 fp_round.
Only handle power of 2 element count for simplicity. Not sure what to do with vXf64->vXf16 fp_round to avoid double rounding

Differential Revision: https://reviews.llvm.org/D74886
2020-02-20 08:26:17 -08:00
Djordje Todorovic 2f215cf36a Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGfaff707db82d.
A failure found on an ARM 2-stage buildbot.
The investigation is needed.
2020-02-20 14:41:39 +01:00
Craig Topper f559cecc3e [X86] Add DCI.isBeforeLegalize() check to the v64i1 constant splitting code in combineStore.
We only need to split after type legalization. If we're before
we can just use a wide store and type legalization will split it.

Add a v128i1 test to exercise it post type legalization.
2020-02-19 09:18:16 -08:00
Florian Hahn 216afd3301 [TargetLower] Update shouldFormOverflowOp check if math is used.
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.

This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.

Reviewers: nikic, RKSimon, lebedev.ri, spatel

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D74722
2020-02-19 11:28:33 +01:00
Djordje Todorovic faff707db8 Reland "[DebugInfo] Enable the debug entry values feature by default"
Differential Revision: https://reviews.llvm.org/D73534
2020-02-19 11:12:26 +01:00
Craig Topper f69a29da5a [X86] Remove vXi1 select optimization from LowerSELECT. Move it to DAG combine. 2020-02-19 00:00:55 -08:00
Craig Topper 0dbc4658d8 [X86] Handle splats in LowerBUILD_VECTORvXi1 by directly emitting scalar selects instead of deferring that to LowerSELECT.
LoweSELECT will detect the constant inputs and convert to scalar
selects, but we can do it directly here.

I might remove some of the code from LowerSELECT and move it to
DAG combine so doing this explicitly will make us less dependent
on it happening in lowering.
2020-02-18 22:39:30 -08:00
Simon Pilgrim d6eef0614f [TargetLowering] Add SimplifyMultipleUseDemandedBits 'all elements' helper wrapper. NFC. 2020-02-18 19:53:50 +00:00
Craig Topper 89ab5c69c8 [X86] Add a helper function to pull some repeated code out of combineGatherScatter. NFC 2020-02-18 11:10:40 -08:00
Djordje Todorovic 2bf44d11cb Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGa82d3e8a6e67.
2020-02-18 16:38:11 +01:00
Djordje Todorovic a82d3e8a6e Reland "[DebugInfo] Enable the debug entry values feature by default"
This patch enables the debug entry values feature.

  - Remove the (CC1) experimental -femit-debug-entry-values option
  - Enable it for x86, arm and aarch64 targets
  - Resolve the test failures
  - Leave the llc experimental option for targets that do not
    support the CallSiteInfo yet

Differential Revision: https://reviews.llvm.org/D73534
2020-02-18 14:41:08 +01:00
Craig Topper e90dc7c48b [X86] Move avx512 code that forces zeros to the false side of vselects above a check for legal types.
This helps this transform occur earlier so we can fold the not
with setcc. If we delay it until after type legalization we might
have introduced instructions to widen the mask if the vselect was
widened. This can prevent the not from making it to the setcc.

We could of course add more DAG combines to handle that, but
moving this earlier is easier.
2020-02-17 22:24:21 -08:00
Craig Topper b0840934a7 [X86] Use isScalarFPTypeInSSEReg to simplify code in LowerSELECT. NFC 2020-02-17 19:43:57 -08:00
Craig Topper 3f4490d384 [X86] Add one use check to '0-x == y --> x+y == 0' in EmitCmp.
I failed to copy it when I moved this in
b62de210cf
2020-02-17 18:16:42 -08:00
Craig Topper 43e948c4b7 [X86] Change how the alignment for the stack object is created in LowerFLT_ROUNDS_.
We don't need FrameInfo's concept of the stack alignment. We just
need to tell it the desired alignment. Which in this case is 2.
2020-02-17 11:27:34 -08:00
Craig Topper b62de210cf [X86] Move '0-x == y --> x+y == 0' and similar combines to EmitCmp.
AArch64 handles this pattern in their lowering code. By emitting
CMN. ARM handles it as an isel pattern.
2020-02-17 11:27:34 -08:00
Nikita Popov 80397d2d12 [IRBuilder] Delete copy constructor
D73835 will make IRBuilder no longer trivially copyable. This patch
deletes the copy constructor in advance, to separate out the breakage.

Currently, the IRBuilder copy constructor is usually used by accident,
not by intention.  In rG7c362b25d7a9 I've fixed a number of cases where
functions accepted IRBuilder rather than IRBuilder &, thus performing
an unnecessary copy. In rG5f7b92b1b4d6 I've fixed cases where an
IRBuilder was copied, while an InsertPointGuard should have been used
instead.

The only non-trivial use of the copy constructor is the
getIRBForDbgInsertion() helper, for which I separated construction and
setting of the insertion point in this patch.

Differential Revision: https://reviews.llvm.org/D74693
2020-02-17 18:14:48 +01:00
Craig Topper 464729cf7c [X86] Remove unnecessary check for null SDValue. NFC 2020-02-16 20:25:24 -08:00
Craig Topper 272d35aef5 [X86] Separate floating point handling out of EmitCmp and emitFlagsForSetcc.
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.

This removes the need to pass Chain and IsSignaling all the way
down.
2020-02-16 10:51:05 -08:00
Craig Topper d26f11108b [X86] Split X86ISD::CMP into an integer and FP opcode. 2020-02-16 10:10:19 -08:00
Simon Pilgrim b85df2e185 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to PALIGNR 2020-02-16 16:13:26 +00:00
Simon Pilgrim c9c1c2b335 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to bit shifts 2020-02-16 16:13:25 +00:00
Sanjay Patel e48b536be6 [x86] form broadcast of scalar memop even with >1 use
The unseen logic diff occurs because MayFoldLoad() is defined like this:

static bool MayFoldLoad(SDValue Op) {
  return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}

The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.

The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.

Differential Revision: https://reviews.llvm.org/D74088
2020-02-16 10:32:56 -05:00
Simon Pilgrim 34a054ce71 [X86] combineX86ShuffleChain - add support for combining to X86ISD::ROTLI
Refactors matchShuffleAsBitRotate to allow use by both lowerShuffleAsBitRotate and matchUnaryPermuteShuffle.
2020-02-15 20:04:54 +00:00
Simon Pilgrim 2492075add [X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.

REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
2020-02-14 11:55:18 +00:00
Craig Topper c2e8a421ac [X86] Don't widen 128/256-bit strict compares with vXi1 result to 512-bits on KNL.
If we widen the compare we might trigger a spurious exception from
the garbage data.

We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.

I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.

Differential Revision: https://reviews.llvm.org/D74522
2020-02-13 13:26:40 -08:00
Amy Huang de1d90299b Revert "[X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets"
This reverts commit 11c16e7159 because it
causes a crash in chromium code. See
https://reviews.llvm.org/rG11c16e71598d51f15b4cfd0f719c4dabcc0bebf7.
2020-02-12 17:00:37 -08:00
Jay Foad 32aac25637 [KnownBits] Introduce anyext instead of passing a flag into zext
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.

I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.

NFC.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74482
2020-02-12 19:06:53 +00:00
Simon Pilgrim ff307c8120 [X86] combineFneg - generalize FMA negations with isNegatibleForFree/getNegatedExpression
This has a really interesting side effect in that it improves some UMAX/UMIN reduction code which had redundant XOR(SHUFFLE(XOR(X,SIGNMASK)),SIGNMASK) patterns - the getNegatibleCost recognises it as FNEG(SHUFFLE(FNEG(X))).... We have a lot of FNEG patterns bitcasted to the integer domain for XOR signbit twiddling which is similar to what we do to allow UMAX/UMIN to be lowered using SMAX/SMIN.

Differential Revision: https://reviews.llvm.org/D74231
2020-02-12 16:07:27 +00:00
Simon Pilgrim 9eb426c88c [TargetLowering] Add NegatibleCost enum for isNegatibleForFree return codes
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.

This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.

It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.

Differential Revision: https://reviews.llvm.org/D74221
2020-02-12 11:51:42 +00:00
Djordje Todorovic 97ed706a96 Revert "[DebugInfo] Enable the debug entry values feature by default"
This reverts commit rG9f6ff07f8a39.

Found a test failure on clang-with-thin-lto-ubuntu buildbot.
2020-02-12 11:59:04 +01:00
Djordje Todorovic 9f6ff07f8a [DebugInfo] Enable the debug entry values feature by default
This patch enables the debug entry values feature.

  - Remove the (CC1) experimental -femit-debug-entry-values option
  - Enable it for x86, arm and aarch64 targets
  - Resolve the test failures
  - Leave the llc experimental option for targets that do not
    support the CallSiteInfo yet

Differential Revision: https://reviews.llvm.org/D73534
2020-02-12 10:25:14 +01:00
Craig Topper 0daf9b8e41 [X86][LegalizeTypes] Add SoftPromoteHalf support STRICT_FP_EXTEND and STRICT_FP_ROUND
This adds a strict version of FP16_TO_FP and FP_TO_FP16 and uses
them to implement soft promotion for the half type. This is
enough to provide basic support for __fp16 with strictfp.

Add the necessary X86 support to use VCVTPS2PH/VCVTPH2PS when F16C
is enabled.
2020-02-11 22:30:04 -08:00
Craig Topper 846d0ac43e [X86] Don't disable code in combineHorizontalPredicateResult just because we have avx512
We aren't doing a good job of optimizing AVX512 outside of this code. So remove the bail out for AVX512 and replace with a FIXME. This at least gets us the AVX2 codegen.

Differential Revision: https://reviews.llvm.org/D74431
2020-02-11 14:36:29 -08:00
Simon Pilgrim fa620fc8e2 [X86] combineConcatVectorOps - reuse IsSplat and remove duplicate code. NFC. 2020-02-11 13:37:57 +00:00
Simon Pilgrim 11c16e7159 [X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
2020-02-11 12:21:03 +00:00
Craig Topper 798305d29b [X86] Custom lower ISD::FP16_TO_FP and ISD::FP_TO_FP16 on f16c targets instead of using isel patterns.
We need to use vector instructions for these operations. Previously
we handled this with isel patterns that used extra instructions
and copies to handle the the conversions.

Now we use custom lowering to emit the conversions. This allows
them to be pattern matched and optimized on their own. For
example we can now emit vpextrw to store the result if its going
directly to memory.

I've forced the upper elements to VCVTPHS2PS to zero to keep some
code similar. Zeroes will be needed for strictfp. I've added a
DAG combine for (fp16_to_fp (fp_to_fp16 X)) to avoid extra
instructions in between to be closer to the previous codegen.

This is a step towards strictfp support for f16 conversions.
2020-02-10 22:01:48 -08:00
Simon Pilgrim f319074824 [X86] combineConcatVectorOps - combine X86ISD::PACKSS ops 2020-02-10 17:48:02 +00:00
Simon Pilgrim 74c0f98cf5 [X86] combineConcatVectorOps - combine X86ISD::VPERMI ops 2020-02-10 17:48:01 +00:00
Simon Pilgrim 2463b8c97d [X86] combineConcatVectorOps - combine VSHLI/VSRAI/VSRLI ops
Non-AVX512BW targets failed to concatenate 256-bit shifts back to 512-bits (split during 512-bit shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-10 16:59:09 +00:00
Simon Pilgrim 06617c4522 [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

REAPPLIED rGe82e17d4d4ca after reversion at rG39eade73a567 - fixed offset matching in matchShuffleAsBitRotate.
2020-02-10 16:16:56 +00:00
Simon Pilgrim 39eade73a5 Revert rGe82e17d4d4cac8b2df00094e80d5e1cb22795664 - [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
---
Internal shuffle tests indicate theres a bug somewhere that I haven't been able to track down yet.
2020-02-10 12:14:26 +00:00
Craig Topper 06ba969c9d [X86] Make (insert_vector_elt (v8i16 zerovec), i16 %x, 0) generate the same code as (v8i16 (build_vector %x, 0, 0, 0, 0, 0, 0, 0)).
Instead of using a insrw to element 0, use movzx and movd.

Same for v16i8.
2020-02-09 21:52:11 -08:00
Simon Pilgrim 29e646fe65 [X86] combineConcatVectorOps - combine VROTLI/VROTRI ops
Fix issue mentioned on rGe82e17d4d4ca - non-AVX512BW targets failed to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-09 21:50:10 +00:00
Simon Pilgrim e82e17d4d4 [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-09 21:15:03 +00:00
Simon Pilgrim 29621b2534 [X86] Rename matchShuffleAsRotate - matchShuffleAsByteRotate. NFCI.
A matchShuffleAsBitRotate variant will be added soon and we need to make the difference more obvious.
2020-02-09 18:35:50 +00:00
Simon Pilgrim 3ec6de07e9 Fix signed/unsigned warning. 2020-02-09 13:35:03 +00:00
Simon Pilgrim 644d56b432 [X86] Recognise ROTLI/ROTRI rotations as faux shuffles
Allows us to combine rotations with shuffles.

One of many things necessary to fix PR44379 (lowering shuffles to rotations)
2020-02-09 12:25:49 +00:00
serge_sans_paille e67cbac812 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-09 10:42:45 +01:00
serge-sans-paille 4546211600 Revert "Support -fstack-clash-protection for x86"
This reverts commit 0fd51a4554.

Failures:

http://lab.llvm.org:8011/builders/llvm-clang-win-x-armv7l/builds/4354
2020-02-09 10:06:31 +01:00
serge_sans_paille 0fd51a4554 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-09 09:35:42 +01:00
Craig Topper eeb63944e4 [LegalizeTypes][ARM][AArch64][PowerPC][RISCV][X86] Use BUILD_PAIR to return expanded integer results from ReplaceNodeResults instead of just returning two results.
Remove code from LegalizeTypes that allowed this to work.

We were already using BUILD_PAIR for this in some places so this
standardizes on a single way to do this.
2020-02-08 09:52:31 -08:00
serge-sans-paille 658495e6ec Revert "Support -fstack-clash-protection for x86"
This reverts commit e229017732.

Failures:

http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-debian/builds/2604
http://lab.llvm.org:8011/builders/llvm-clang-win-x-aarch64/builds/4308
2020-02-08 14:26:22 +01:00
serge_sans_paille e229017732 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with better option
handling and more portable testing

Differential Revision: https://reviews.llvm.org/D68720
2020-02-08 13:31:52 +01:00
Simon Pilgrim 7f5b3fa73c [X86][SSE] Add X86ISD::FRCP handling to isNegatibleForFree
Peek through X86ISD::FRCP nodes to see if there is a negatible input.
2020-02-08 10:56:27 +00:00
Simon Pilgrim 4229f12a22 [TargetLowering] Remove isDesirableToCombineBuildVectorToShuffleTruncate target hook. NFC.
This hasn't been used for years, its original implementation, D35700, had bugs that caused the reversion of most of the code, and since then x86 shuffle lowering/combining has handled most cases and can deal with the rest as well.
2020-02-08 08:55:51 +00:00
Nico Weber b03c3d8c62 Revert "Support -fstack-clash-protection for x86"
This reverts commit 4a1a0690ad.
Breaks tests on mac and win, see https://reviews.llvm.org/D68720
2020-02-07 14:49:38 -05:00
serge_sans_paille 4a1a0690ad Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with correct option
flags set.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-07 19:54:39 +01:00
Sanjay Patel de6f7eb47e [x86] don't create an unused constant vector
Noticed while scanning through debug spew. Creating unused
nodes is inefficient and makes following the debug output harder.
2020-02-07 12:05:02 -05:00
Simon Pilgrim c96001035d [X86] isNegatibleForFree - allow pre-legalized FMA negation
As long as the FMA operation is legal (which we can proxy for the FMA3/FMA4 variants as well), we don't have to wait for the LegalOperations stage.
2020-02-07 17:04:17 +00:00
serge-sans-paille f6d98429fc Revert "Support -fstack-clash-protection for x86"
This reverts commit 39f50da2a3.

The -fstack-clash-protection is being passed to the linker too, which
is not intended.

Reverting and fixing that in a later commit.
2020-02-07 11:36:53 +01:00
Guillaume Chatelet f85d3408e6 [NFC] Introduce an API for MemOp
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73964
2020-02-07 11:32:27 +01:00
serge_sans_paille 39f50da2a3 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

Differential Revision: https://reviews.llvm.org/D68720
2020-02-07 10:56:15 +01:00
Craig Topper 3f62028f2f [X86] Use SelectionDAG::getAllOnesConstant to simplify some code. NFC 2020-02-06 21:32:53 -08:00
Craig Topper ec9a94af4d [X86] Use MVT::i8 instead of MVT::i64 for shift amount in BuildSDIVPow2
X86 uses i8 for shift amounts. This code can fail on a 32-bit target
if it runs after type legalization.

This code was copied from AArch64 and modified for X86, but the
shift amount wasn't changed to the correct type for X86.

Fixes PR44812
2020-02-06 13:32:13 -08:00
Craig Topper 4175d7e22e [X86] Custom isel floating point X86ISD::CMP on pre-CMOV targets. Eliminate ConvertCmpIfNecessary
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.

Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.

This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.

Differential Revision: https://reviews.llvm.org/D73863
2020-02-06 10:43:06 -08:00
Sanjay Patel 0a389c81cd [x86] use getSplatIndex() in lowerShuffleAsBroadcast()
The old code was doing an N^2 search for splat index.

Differential Revision: https://reviews.llvm.org/D74064
2020-02-05 14:55:02 -05:00
Craig Topper a3d489e87e [X86] Add a DAG combine for (i32 (sext (i8 (x86isd::setcc_carry)))) -> (i32 (x86isd::setcc_carry)) and remove isel patterns.
Same for any_extend though we don't have coverage for that.

The test changes are because isel didn't check one use of the
setcc_carry. So in isel we would end up with two different
sized setcc_carry instructions. And since it clobbers
the flags we would need to recreate the flags for the second
instruction.

This code handles additional uses by truncating the new wide
setcc_carry back to the original size for those uses.
2020-02-04 22:40:36 -08:00
Craig Topper 016f42e3dc [X86] Add custom lowering for lrint/llrint to either cvtss2si/cvtsd2si or fist.
lrint/llrint are defined as rounding using the current rounding
mode. Numbers that can't be converted raise FE_INVALID and an
implementation defined value is returned. They may also write to
errno.

I believe this means we can use cvtss2si/cvtsd2si or fist to
convert as long as -fno-math-errno is passed on the command line.
Clang will leave them as libcalls if errno is enabled so they
won't become ISD::LRINT/LLRINT in SelectionDAG.

For 64-bit results on a 32-bit target we can't use cvtss2si/cvtsd2si
but we can use fist since it can write to a 64-bit memory location.
Though maybe we could consider using vcvtps2qq/vcvtpd2qq on avx512dq
targets?

gcc also does this optimization.

I think we might be able to do this with STRICT_LRINT/LLRINT as
well, but I've left that for future work.

Differential Revision: https://reviews.llvm.org/D73859
2020-02-04 16:15:40 -08:00
Reid Kleckner 2d89e0a098 [SEH] Remove CATCHPAD SDNode and X86::EH_RESTORE MachineInstr
The CATCHPAD node mostly existed to be selected into the EH_RESTORE
instruction, which sets the frame back up when 32-bit Windows exceptions
return to the parent function. However, creating this MachineInstr early
increases the risk that other passes will come along and insert
instructions that use the stack before ESP and EBP are restored. That
happened in PR44697.

Instead of representing these in the instruction stream early, delay it
until PEI. Mark the blocks where this needs to happen as EHPads, but not
funclet entry blocks. Passes after PEI have to be careful not to hoist
instructions that can use stack across frame setup instructions, so this
should be relatively reliable.

Fixes PR44697

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D73752
2020-02-04 15:13:12 -08:00
Craig Topper e195ff98f6 Recommit "[X86] Use X86ISD::SUB instead of X86ISD::CMP in some places."
This time with correct types for the data result from the SUB.

Original commit message:

Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.

This commit makes other places that create X86ISD::CMP have the
same behavior.
2020-02-04 12:19:34 -08:00
Kadir Cetinkaya d2b6ac6ccd
Revert "[X86] Use X86ISD::SUB instead of X86ISD::CMP in some places."
This reverts commit 8413116bf1.

this seems to be causing crashes while compiling ncurses.
```
$ ./bin/llc bugpoint-reduced-simplified.ll
LLVM ERROR: Cannot emit physreg copy instruction
```

Here are the crashers: https://gist.github.com/kadircet/918f5bb97a2afe048cb875490edba46e

executing with an llc compiled at 904d54de9b works fine.
2020-02-04 11:22:53 +01:00
Guillaume Chatelet b8144c0536 [NFC] Encapsulate MemOp logic
Summary:
This patch simply introduces functions instead of directly accessing the fields.
This helps introducing additional check logic. A second patch will add simplifying functions.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73945
2020-02-04 10:36:26 +01:00
Craig Topper cd14b4a62b [X86] Remove unneeded code that looks for (and (i8 (X86setcc_c))
I don't believe we use this construct anymore so I don't think
we need to look for it.
2020-02-03 23:18:11 -08:00
Craig Topper 4581d97416 [X86] Remove some uncovered and possibly broken code from combineZext.
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.

I believe this code is a leftover from when i1 was a legal type.
2020-02-03 22:59:39 -08:00
Craig Topper 8413116bf1 [X86] Use X86ISD::SUB instead of X86ISD::CMP in some places.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.

This commit makes other places that create X86ISD::CMP have the
same behavior.
2020-02-03 21:01:11 -08:00
Craig Topper c3a47221e0 [X86] Don't emit two X86ISD::COMI/UCOMI nodes when handling comi/ucomi intrinsics.
We were creating two with different operand orders, and then only
using one of them.

Instead just swap the operands when needed and create a single node.
2020-02-03 20:08:01 -08:00
Simon Pilgrim 3ece5a23bd [X86] getTargetShuffleMask - use getConstantOperandVal helper. NFCI. 2020-02-03 18:06:47 +00:00
Simon Pilgrim 8c0e715eb2 [X86] BEXTR SimplifyDemandedBitsForTargetNode - length == 0 -> result = 0 2020-02-03 16:50:03 +00:00
Guillaume Chatelet 333f2ad8b8 [Alignment][NFC] Use Align for getMemcpy/Memmove/Memset
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73885
2020-02-03 17:13:19 +01:00
Simon Pilgrim 8ead5df0b1 [X86] computeKnownBitsForTargetNode - add BEXTR support (PR39153)
Add a KnownBits::extractBits helper
2020-02-03 15:43:59 +00:00
Simon Pilgrim a9ee3ffbc0 [X86] Move BEXTR DemandedBits handling inside SimplifyDemandedBitsForTargetNode
Some prep work for PR39153.
2020-02-03 15:16:40 +00:00
Craig Topper cf20fde1d1 [X86] Remove a couple unnecessary calls to ConvertCmpIfNecessary.
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
2020-02-02 21:36:51 -08:00
Craig Topper ee85415dbb [X86] Use MVT::f80 for the result type of the FLD used to convert from SSE register to X87 register in FP_TO_INTHelper. 2020-02-02 13:24:37 -08:00
Simon Pilgrim 5d86ac82a6 Fix a few spelling mistakes in comments. NFCI. 2020-02-02 18:27:43 +00:00
Simon Pilgrim 17e91b7dd2 [X86][SSE] combineBitcastvxi1 - add pre-AVX512 v64i1 handling 2020-02-02 18:00:09 +00:00
Guillaume Chatelet 3c89b75f23 [NFC] Introduce a type to model memory operation
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.

Reviewers: courbet

Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73785
2020-01-31 17:29:01 +01:00
Craig Topper 90c31b0f42 [X86] Custom lower ISD::FROUND with SSE4.1 to avoid a libcall.
ISD::FROUND is defined to round to nearest with ties rounding
away from 0. This mode isn't supported in hardware on X86.

But as long as we aren't compiling with trapping math, we can
emulate this with floor(X + copysign(nextafter(0.5, 0.0), X)).

We have to use nextafter to avoid some corner cases that adding
0.5 would have. For example, if X is nextafter(0.5, 0.0) it should
round to 0.0, but adding 0.5 would need one extra bit of mantissa
than can be stored so it rounds to 1.0. Adding nextafter(0.5, 0.0)
instead will just increase the exponent by 1 and leave the mantissa
as all 1s. This would be nextafter(1.0, 0.0) which will floor to 0.0.

Techically this requires -fno-trapping-math which isn't our default.
But if we care about exceptions we should be using constrained
intrinsics. Constrained intrinsics would use STRICT_FROUND which
won't go through this code.

Fixes PR42195.

Differential Revision: https://reviews.llvm.org/D73607
2020-01-29 09:10:02 -08:00
Craig Topper e5edd641fd [X86] Use a shorter sequence to implement FLT_ROUNDS
This code needs to map from the FPCW 2-bit encoding for rounding mode to the 2-bit encoding defined for FLT_ROUNDS. The previous implementation did some clever swapping of bits and adding 1 modulo 4 to do the mapping.

This patch instead uses an 8-bit immediate as a lookup table of four 2-bit values. Then we use the 2-bit FPCW encoding to index the lookup table by using a right shift and an AND. This requires extracting the 2-bit value from FPCW and multipying it by 2 to make it usable as a shift amount. But still results in less code.

Differential Revision: https://reviews.llvm.org/D73599
2020-01-29 08:56:33 -08:00
Craig Topper ca2abea29a [X86] Use SelectionDAG::getZExtOrTrunc to simplify some code. NFCI 2020-01-28 16:27:59 -08:00
Wang, Pengfei 3d1f0ce3b9 [X86] Add combination for fma and fneg on X86 under strict FP.
Summary: X86 has instructions to calculate fma and fneg at the same time. But we combine the fneg and fma only when fneg is the source operand under strict FP.

Reviewers: craig.topper, andrew.w.kaylor, uweigand, RKSimon, LiuChen3

Subscribers: LuoYuanke, llvm-commits, cfe-commits, jdoerfert, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72824
2020-01-28 20:09:56 +08:00
Simon Pilgrim 2d5e281b0f [X86][AVX] Add a more aggressive SimplifyMultipleUseDemandedBits to simplify masked store masks.
Fixes a poor codegen issue noticed in PR11210.
2020-01-27 16:44:25 +00:00
Simon Pilgrim fa19d67a2a [X86][AVX] Extend combineCommutableSHUFP to handle v8f32 and v16f32 commutable shufps patterns 2020-01-26 19:04:12 +00:00
Simon Pilgrim 1a81b296cd [X86][SSE] combineCommutableSHUFP - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Pull out combineTargetShuffle code added in rG3fd5d1c6e7db into a helper function and extend it to handle shufps(shufps(load(),x),y) and shufps(y,shufps(load(),x)) cases as well.
2020-01-26 14:36:23 +00:00
Craig Topper 3fdd435a4b [X86] Use a macro to convert X86ISD names to strings in getTargetNodeName.
Every case in the switch had a string version of themselves. Two
of them had a typo that used : instead of ::

By using a macro we can automate the string creation and avoid
the possibility of typos like this.

This is similar to what is done on the AMDGPU target.
2020-01-25 18:27:29 -08:00
Craig Topper 2c1decc040 [X86] Break the loop in LowerReturn into 2 loops. NFCI
I believe for STRICT_FP I need to use a STRICT_FP_EXTEND for the extending to f80 for returning f32/f64 in 32-bit mode when SSE is enabled. The STRICT_FP_EXTEND node requires a Chain. I need to get that node onto the chain before any CopyToRegs are emitted. This is because all the CopyToRegs are glued and chained together. So I can't put a STRICT_FP_EXTEND on the chain between the glued nodes without also glueing the STRICT_ FP_EXTEND.

This patch moves all the extend creation to a first pass and then creates the copytoregs and fills out RetOps in a second pass.

Differential Revision: https://reviews.llvm.org/D72665
2020-01-24 14:44:38 -08:00
Simon Pilgrim 3fd5d1c6e7 [X86][SSE] combineTargetShuffle - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Moves lowerShuffleWithSHUFPS commutation code from rG30fcd29fe479 to catch cases during combine
2020-01-24 15:23:20 +00:00
Simon Pilgrim 30fcd29fe4 [X86][SSE] lowerShuffleWithSHUFPS - commute '2*V1+2*V2 elements' mask if it allows a loaded fold
As mentioned on D73023.
2020-01-24 12:04:10 +00:00
Guillaume Chatelet 805c157e8a [Alignment][NFC] Deprecate Align::None()
Summary:
This is a follow up on https://reviews.llvm.org/D71473#inline-647262.
There's a caveat here that `Align(1)` relies on the compiler understanding of `Log2_64` implementation to produce good code. One could use `Align()` as a replacement but I believe it is less clear that the alignment is one in that case.

Reviewers: xbolva00, courbet, bollu

Subscribers: arsenm, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, Jim, kerbowa, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73099
2020-01-24 12:53:58 +01:00
Simon Pilgrim 0ec25a0316 [X86] LowerRotate - early out for vector rotates by zero 2020-01-23 17:48:09 +00:00
Guillaume Chatelet 279fa8e006 [Alignement][NFC] Deprecate untyped CreateAlignedLoad
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73260
2020-01-23 13:34:32 +01:00
Sanjay Patel 363d27c871 [x86] fold vperm2x128 to concat of 128-bit high half vectors
vperm (ins ?, X, C), (ins ?, Y, C), 0x31 --> concat X, Y

This is another shuffle problem seen with PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

We have this small crack in legalization/lowering/combining/demanded
that allows forming a vperm2f128 of high halves with AVX1 when we
could do better by peeking through the insert_subvector nodes.
AFAICT, it requires IR as shown in the diffs - much larger than legal
vectors - to avoid all of the usual folds.

Another option would prevent forming the 256-bit vperm in lowering.

Differential Revision: https://reviews.llvm.org/D73197
2020-01-22 15:35:50 -05:00
Simon Pilgrim 5340434c94 [X86][SSE] combineExtractWithShuffle - extract(bitcast(broadcast(x))) --> x
Removes some unnecessary gpr<-->fpu traffic
2020-01-22 18:02:58 +00:00
Simon Pilgrim a14aa7dabd [X86][SSE] combineExtractWithShuffle - extract(bictcast(scalar_to_vector(x))) --> x
Removes some unnecessary gpr<-->fpu traffic
2020-01-22 16:11:08 +00:00
Simon Pilgrim c784e5451b Use SelectionDAG::getShiftAmountConstant(). NFCI. 2020-01-22 13:52:43 +00:00
Simon Pilgrim 963f268186 [X86][SSE] combineExtractWithShuffle - pull out repeated extract index code. NFCI. 2020-01-22 12:08:58 +00:00
Simon Pilgrim b065902ed4 [X86] combineBT - use SimplifyDemandedBits instead of GetDemandedBits
Another step towards removing SelectionDAG::GetDemandedBits entirely
2020-01-21 14:24:46 +00:00
Simon Pilgrim eaa4548459 [X86][SSE] Add PACKSS SimplifyMultipleUseDemandedBits 'sign bit' handling.
Attempt to use SimplifyMultipleUseDemandedBits to simplify PACKSS if we're only after the sign bit.
2020-01-20 10:48:54 +00:00
Florian Hahn 0ee1db2d1d [X86] Try to avoid casts around logical vector ops recursively.
Currently PromoteMaskArithemtic only looks at a single operation to
skip casts. This means we miss cases where we combine multiple masks.

This patch updates PromoteMaskArithemtic to try to recursively promote
AND/XOR/AND nodes that terminate in truncates of the right size or
constant vectors.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D72524
2020-01-19 17:22:43 -08:00
Craig Topper 5fa2022ec0 [X86] Remove X86ISD::FILD_FLAG and stop gluing nodes together.
Summary:
I think whatever problem the gluing was fixing has long since been fixed. We don't have any of the restrictions on FP stack stuff that existed back when this was first added.

I had to change which type we use for FILD in BuildFILD when X86 was enabled because most of the isel patterns block f32/f64 instructions when SSE1/SSE2 are enabled. So I needed to use the f80 pattern, but this shouldn't have an effect the generated code since there is only one FILD instruction anyway. We already use f80 explicitly in other other places.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: andrew.w.kaylor, scanon, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72805
2020-01-18 23:44:05 -06:00
Simon Pilgrim 69bc450882 [X86] Rename lowerShuffleAsRotate -> lowerShuffleAsVALIGN
Since it can only ever create VALIGN nodes.
2020-01-18 11:29:14 +00:00
Michael Liao 6d0d86a64d [DAG] Add helper for creating constant vector index with correct type. NFC. 2020-01-18 01:23:36 -05:00
Sanjay Patel 43f60e614a [x86] try harder to form 256-bit unpck*
This is another part of a problem noted in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

The AVX2 code may use awkward 256-bit shuffles vs. the AVX code that gets split
into the expected 128-bit unpack instructions. We have to be selective in
matching the types where we try to do this though. Otherwise, we can end up
with more instructions (in the case of v8x32/v4x64).

Differential Revision: https://reviews.llvm.org/D72575
2020-01-17 10:42:39 -05:00
Craig Topper e445447921 [X86] When handling i64->f32 sint_to_fp on 32-bit targets only bitcast to f64 if sse2 is enabled.
The code is trying to copy the i64 value to an xmm register to
use a 64-bit store so that the 64-bit fild can benefit from
store forwarding.

But this trick only works if f64 is going to be stored in an
XMM register. If we only have SSE1 then only float is in xmm
register. So this trick just causes 2 stores i32 stores, an f64
load into the x87, an f64 from x87, and a 64-bit fild. So we end
up with an extra stack temporary and still didn't get store forwarding.

We might be able to use v2f32 here instead, but I didn't check. I
just wanted the code to make sense.

Found by inspection as I continue to stare too hard at our
int_to_fp conversions.
2020-01-15 18:26:28 -08:00
Craig Topper be8f217b18 [X86] Don't call LowerUINT_TO_FP_i32 for i32->f80 on 32-bit targets with sse2.
We were performing an emulated i32->f64 in the SSE registers, then
storing that value to memory and doing a extload into the X87
domain.

After this patch we'll now just store the i32 to memory along
with an i32 0. Then do a 64-bit FILD to f80 completely in the X87
unit. This matches what we do without SSE.
2020-01-15 00:43:07 -08:00
Reid Kleckner 40cd26c700 [Win64] Handle FP arguments more gracefully under -mno-sse
Pass small FP values in GPRs or stack memory according the the normal
convention. This is what gcc -mno-sse does on Win64.

I adjusted the conditions under which we emit an error to check if the
argument or return value would be passed in an XMM register when SSE is
disabled. This has a side effect of no longer emitting an error for FP
arguments marked 'inreg' when targetting x86 with SSE disabled. Our
calling convention logic was already assigning it to FP0/FP1, and then
we emitted this error. That seems unnecessary, we can ignore 'inreg' and
compile it without SSE.

Reviewers: jyknight, aemerson

Differential Revision: https://reviews.llvm.org/D70465
2020-01-14 17:19:35 -08:00
Craig Topper 76291e1158 [X86] Drop an unneeded FIXME. NFC
The extload on X87 is free.
2020-01-14 17:05:46 -08:00
Craig Topper 57eb56b839 [X86] Swap the 0 and the fudge factor in the constant pool for the 32-bit mode i64->f32/f64/f80 uint_to_fp algorithm.
This allows us to generate better code for selecting the fixup
to load.

Previously when the sign was set we had to load offset 0. And
when it was clear we had to load offset 4. This required a testl,
setns, zero extend, and finally a mul by 4. By switching the offsets
we can just shift the sign bit into the lsb and multiply it by 4.
2020-01-14 17:05:23 -08:00
Craig Topper 98c54fb1fe [X86] Directly emit a BROADCAST_LOAD from constant pool in lowerUINT_TO_FP_vXi32 to avoid double loads seen in D71971
By directly emitting the constants as a constant pool load we seem to avoid the build_vector/extract_subvector combines that resulted in the duplicate loads we had before.

Differential Revision: https://reviews.llvm.org/D72307
2020-01-14 10:50:39 -08:00
Simon Pilgrim 66e39067ed [X86][AVX] Use lowerShuffleAsLanePermuteAndSHUFP to lower binary v4f64 shuffles.
Only perform this if we are shuffling lower and upper lane elements across the lanes (otherwise splitting to lower xmm shuffles would be better).

This is a regression if we shuffle build_vectors due to getVectorShuffle canonicalizing 'blend of splat' build vectors, for now I've set this not to shuffle build_vector nodes at all to avoid this.
2020-01-12 12:29:41 +00:00
Simon Pilgrim b375f28b0e [X86][AVX] lowerShuffleAsLanePermuteAndSHUFP - only set the demanded elements of the lane mask.
Fixes an cyclic dependency issue with an upcoming patch where getVectorShuffle canonicalizes masks with splat build vector sources.
2020-01-12 09:41:40 +00:00
Craig Topper d692f0f6c8 [X86] Don't call LowerSETCC from LowerSELECT for STRICT_FSETCC/STRICT_FSETCCS nodes.
This causes the STRICT_FSETCC/STRICT_FSETCCS nodes to lowered
early while lowering SELECT, but the output chain doesn't get
connected. Then we visit the node again when it is its turn
because we haven't replaced the use of the chain result. In the
case of the fp128 libcall lowering, after D72341 this will cause
the libcall to be emitted twice.
2020-01-11 20:43:00 -08:00
Craig Topper ed679804d5 [TargetLowering][X86] Connect the chain from STRICT_FSETCC in TargetLowering::expandFP_TO_UINT and X86TargetLowering::FP_TO_INTHelper. 2020-01-11 17:50:20 -08:00
Simon Pilgrim 24763734e7 [X86] Fix outdated comment
The generic saturated math opcodes are no longer widened inside X86TargetLowering
2020-01-11 14:37:18 +00:00
Simon Pilgrim ce35010d78 [X86][AVX] Add lowerShuffleAsLanePermuteAndSHUFP lowering
Add initial support for lowering v4f64 shuffles to SHUFPD(VPERM2F128(V1, V2), VPERM2F128(V1, V2)), eventually this could be used for v8f32 (and maybe v8f64/v16f32) but I'm being conservative for the initial implementation as only v4f64 can always succeed.

This currently is only called from lowerShuffleAsLanePermuteAndShuffle so only gets used for unary shuffles, and we limit this to cases where we use upper elements as otherwise concating 2 xmm shuffles is probably the better case.

Helps with poor shuffles mentioned in D66004.
2020-01-11 12:42:00 +00:00
Simon Pilgrim a5bdada09d [X86][AVX] lowerShuffleAsLanePermuteAndShuffle - consistently normalize multi-input shuffle elements
We only use lowerShuffleAsLanePermuteAndShuffle for unary shuffles at the moment, but we should consistently handle lane index calculations for multiple inputs in both the AVX1 and AVX2 paths.

Minor (almost NFC) tidyup as I'm hoping to use lowerShuffleAsLanePermuteAndShuffle for binary shuffles soon.
2020-01-10 17:21:20 +00:00
Matt Arsenault 255cc5a760 CodeGen: Use LLT instead of EVT in getRegisterByName
Only PPC seems to be using it, and only checks some simple cases and
doesn't distinguish between FP. Just switch to using LLT to simplify
use from GlobalISel.
2020-01-09 17:37:52 -05:00
Craig Topper 3811417f39 [X86] Custom type legalize v4i64->v4f32 uint_to_fp on sse4.1 targets in 64-bit mode
For v4i64->v4f32 uint_to_fp on pre-avx targets where v4i64 isn't legal we create to v2i64->v2f32 uint_to_fp that need to be shuffled together. Our codegen for v2i64->v2f32 involves detecting if the number is larger than (2^31 - 1), if so we do a special divison by 2 so we can do a signed conversion which we need to scalarize, then do a multiply by 2 at the end if we divided earlier.

When v4i64 isn't legal we need to split the checking for a larger number and dividing by 2 into two v2i64 vectors. The scalar part can extract the 4 i64 values from those 4 splits. But we can reassemble the 4 scalar f32 results directly into a single v432 vector. Then we just need to combine the fixup indications from the 2 halves and we can do the final multiply by 2 fixup on all 4 values if needed at once using a single v4f32 blend and v4f32 fadd.

Differential Revision: https://reviews.llvm.org/D72368
2020-01-08 10:06:01 -08:00
Wang, Pengfei 9a621de1ec [X86] Adding fp128 support for strict fcmp
Summary: Adding fp128 support for strict fcmp

Reviewers: craig.topper, LiuChen3, andrew.w.kaylor, RKSimon, uweigand

Subscribers: hiraditya, llvm-commits, LuoYuanke

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71897
2020-01-08 12:59:31 +08:00
Craig Topper 9685cf709f [X86] Enable v2i64->v2f32 uint_to_fp code in ReplaceNodeResults on SSE4.1 target
Now that we generate decent code for (v2i64 (setlt zero, X)) on pre-sse4.2 targets I think we can use this now.

Differential Revision: https://reviews.llvm.org/D72354
2020-01-07 13:25:29 -08:00
Craig Topper afa8211e97 [X86] Improve lowering of (v2i64 (setgt X, -1)) on pre-SSE2 targets. Enable v2i64 in foldVectorXorShiftIntoCmp.
Similar to D72302 but for the canonical form for the opposite case. I've changed foldVectorXorShiftIntoCmp to form a target independent setcc node instead of PCMPGT now and enabled its for v2i64 on pre-SSE4.2 targets. The setcc should eventually get lowered to PCMPGT or the new v2i64 sequence.

Differential Revision: https://reviews.llvm.org/D72318
2020-01-07 11:22:04 -08:00
Craig Topper b9376690a0 [X86] Improve lowering of v2i64 sign bit tests on pre-sse4.2 targets
Without sse4.2 a v2i64 setlt needs to expand into a pcmpgtd, pcmpeqd, 3 shuffles, and 2 logic ops. But if we're only interested in the sign bit of the i64 elements, we can just use one pcmpgtd and shuffle the odd elements to the even elements.

Differential Revision: https://reviews.llvm.org/D72302
2020-01-07 11:22:03 -08:00
Simon Pilgrim 0e912e22b6 [X86] Pull out repeated SrcVT.getVectorNumElements() call. NFCI. 2020-01-07 16:51:10 +00:00
Simon Pilgrim c0365aaaa4 [X86] Standardize shuffle match/lowering function names. NFC.
We mainly use lowerShuffle*/matchShuffle* - replace the (few) lowerVectorShuffle*/matchVectorShuffle* cases to be consistent.
2020-01-07 13:41:52 +00:00
Craig Topper 6a0564adcf [X86] Improve v4i32->v4f64 uint_to_fp for AVX1/AVX2 targets.
Use zext+or+fsub to do the conversion. Similar to D71971.

Differential Revision: https://reviews.llvm.org/D71971
2020-01-06 14:07:35 -08:00
Craig Topper 95840866b7 [X86] Improve v2i64->v2f32 and v4i64->v4f32 uint_to_fp on avx and avx2 targets.
Summary:
Based on Simon's D52965, but improved to handle strict fp and improve some of the shuffling.

Rather than use v2i1/v4i1 and let type legalization continue, just generate all the code with legal types and use an explicit shuffle.

I also added an explicit setcc to the v4i64 code to match the semantics of vselect which doesn't just use the sign bit. I'm also using a v4i64->v4i32 truncate instead of the shuffle in Simon's original code. With the setcc this will become a pack.

Future work can look into using X86ISD::BLENDV and a different shuffle that only moves the sign bit.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71956
2020-01-05 17:44:08 -08:00
Liu, Chen3 ca3bf289a7 [NFC] Modify the format:
Drop the else since we alerady returned in the if.
2020-01-06 09:35:19 +08:00
Simon Pilgrim e3bd011890 [X86][SSE] Combine combineLogicBlendIntoConditionalNegate for VSELECT nodes (PR43660)
Attempt to use combineLogicBlendIntoConditionalNegate for (select M, (sub 0, X), X) -> (sub (xor X, M), M)

We limit this to cases that can't easily replace the VSELECT with a shuffle (non-constant masks) or where a BLENDV is likely to occur (which tends to result in slower codegen).
2020-01-05 18:50:44 +00:00
Simon Pilgrim 6a6e6f04ec [X86] Move combineLogicBlendIntoConditionalNegate before combineSelect. NFCI.
Updates function order in preparation of future fix for PR43660
2020-01-05 17:17:41 +00:00
Simon Pilgrim 3db84f142a [X86] Merge (identical) LowerGC_TRANSITION_START and LowerGC_TRANSITION_END (NFC)
Silences a copy+paste analyzer warning - all they are doing are inserting NOOPs in exactly the same way.
2020-01-05 15:24:57 +00:00
Craig Topper 2875cc6b29 [X86] Improve for v2i32->v2f64 uint_to_fp
This uses an alternative implementation of this conversion derived
from our v2i32->v2f32 handling. We can zero extend the v2i32 to
v2i64, or it with the bit representation of 2.0^52 which will give
us 2.0^52 plus the 32-bit integer since double's mantissa is 52 bits.
Then we just need to subtract 2.0^52 as a double and let the floating
point unit normalize the remaining bits into a valid double.

This is less instructions then our previous code, but does require
a port 5 shuffle for the zero extend or unpack.

Differential Revision: https://reviews.llvm.org/D71945
2020-01-03 11:39:08 -08:00
Reid Kleckner 9c2b72821b Move tail call disabling code to target independent code
When the "disable-tail-calls" attribute was added, checks were added for
it in various backends. Now this code has proliferated, and it is
something the target is responsible for checking. Move that
responsibility back to the ISels (fast, global, and SD).

There's no major functionality change, except for targets that never
implemented this check.

This LLVM attribute was originally added in
d9699bc7bd (2015).

Reviewers: echristo, MaskRay

Differential Revision: https://reviews.llvm.org/D72118
2020-01-03 11:27:41 -08:00
Craig Topper bd46e29742 [X86] Re-enable lowerUINT_TO_FP_vXi32 under fast-math by using an FSUB instead of an FADD.
Summary:
We previously disabled this under fast math due to aggressive
reassociation by the machine combiner. But I think we can work
around this by using a FSUB instead of FADD for the first
operation.

This matches the similar algorithm we do for uint_to_fp i64->f64
in TargetLowering::expandUINT_TO_FP. If reassociation hasn't
been a problem for that, hopefully its not a problem here.

Reviewers: RKSimon, spatel, scanon

Reviewed By: spatel

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71968
2020-01-02 21:46:53 -08:00
Wang, Pengfei 60333a5317 [X86] Enable strict FP by default and remove option -disable-strictnode-mutation. NFCI. 2020-01-03 10:59:34 +08:00
Wang, Pengfei 9dc9e0ea64 [X86] Optimization of inserting vxi1 sub vector into vXi1 vector
Summary:
After bugfix the undef value case here, we used more operations to implement inserting vxi1 sub vector into vXi1 vector, I optimize it by use less operations.

The history information at https://reviews.llvm.org/D68311

Reviewers: craig.topper, LuoYuanke, yubing, annita.zhang, pengfei, LiuChen3, RKSimon

Reviewed By: craig.topper

Subscribers: hiraditya, llvm-commits

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D71917
2020-01-03 09:25:25 +08:00
Craig Topper c36763d894 [X86] Call SimplifyMultipleUseDemandedBits from combineVSelectToBLENDV if the condition is used by something other than select conditions.
We might be able to bypass some nodes on the condition path.

Differential Revision: https://reviews.llvm.org/D71984
2020-01-01 11:16:52 -08:00
Liu, Chen3 8af492ade1 add strict float for round operation
Differential Revision: https://reviews.llvm.org/D72026
2020-01-01 20:42:12 +08:00
Craig Topper 26bdc603f7 [X86] Constant fold KSHIFT of an all zeros vector to just an all zeros vector. 2019-12-31 15:57:39 -08:00
Craig Topper 1cc8a74de3 [X86] Use carry flag from add for (seteq (add X, -1), -1).
If we just subtracted 1 and are checking if the result is -1. We can use the carry flag from the ADD instead of an explicit CMP. I'm using the same checks for the add users as EmitTest.

Fixes one case from PR44412

Differential Revision: https://reviews.llvm.org/D72019
2019-12-31 15:05:23 -08:00
Craig Topper e898ba2d15 [X86] Slightly improve our attempted error recovery for 64-bit -mno-sse2 in LowerCallResult to use FP1 if there are two return values.
If the return value is a struct of 2 doubles we need two return
registers.

If SSE2 is disabled we can't return in XMM registers like the ABI says.
After logging an error we attempt to recover by using FP0 instead
of an XMM register. But if the return needs two registers, we may have
already used FP0. So if the register we were supposed to copy to is
XMM1, copy to FP1 in the recovery instead.

This seems to fix the assertion/crash in PR44413.
2019-12-31 00:16:13 -08:00
Craig Topper 47a2fd2df4 [X86] Add X86ISD::PCMPGT to SimplifyMultipleUseDemandedBitsForTargetNode.
If only the sign bit is demanded, and the LHS is all zeroes, then
we can bypass the PCMPGT.
2019-12-30 10:50:25 -08:00
Craig Topper 266cd7717c [X86] Use APInt::isOneValue and ConstantSDNode::isOne. NFC
These are implemented slightly more efficiently than comparing
to 1 in the case that the value is more than 64 bits.
2019-12-29 17:35:49 -08:00
Craig Topper b2f19320dc [X86] Use isOneConstant to simplify some code. NFC 2019-12-29 16:53:38 -08:00
Craig Topper 599d070910 [X86] Remove dyn_casts to ConstantSDNode for operand 1 of X86ISD::VSRLI/VSRAI/VSRLI. Use getConstantOperandVal and APInt operations.
These nodes should only ever be formed with an i8 TargetConstant
so we don't need to check for it to be a constant. It's also
always 8-bits so we don't need to use APInt compare functions.
2019-12-29 16:53:38 -08:00
Craig Topper a5c96e326a [X86] Stop accidentally custom type legalizing v4i32->v4f32 on SSE1 only targets.
We had a Custom operation action for v4i32 on SSE1. But since
v4i32 isn't legal until SSE2 this was not what was intended. The
code that get executed was intended for op legalization and
creates a bunch of v4i32 nodes that all end up scalarized.
2019-12-28 23:11:48 -08:00
Craig Topper ae321faeed [X86] Remove a redundant (scalar_to_vector (extract_vector_elt X))) in LowerUINT_TO_FP_i32. NFCI 2019-12-28 21:49:22 -08:00
Craig Topper fca4736874 [X86] Allow v2i32->v2f32 strict and non-strict uint_to_fp to be widened to v4i32->v4f32 under avx512.
With avx512vl we get v4i32->v4f32 uint_to_fp instructions. With
avx512f we get v16i32->v16f32 instructions which we can use to
emulate v4i32->v4f32.
2019-12-27 00:28:44 -08:00
Craig Topper 20aab49492 [X86] Custom widen v2i32->v2f32 strict_sint_to_fp to avoid scalarization. 2019-12-27 00:28:44 -08:00
Fangrui Song 7a7334663c Delete llvm.{sig,}{setjmp,longjmp} remnant after r136821
Intrinsic has incorrect argument type!
  i32 (i32*)* @llvm.setjmp

*wipes tear*
2019-12-27 00:00:14 -08:00
Craig Topper ecbaf152f8 [X86] Custom widen 128/256-bit vXi32 fp_to_uint on avx512f targets without avx512vl. Similar for vXi64 on avx512dq without avx512vl.
Summary:
Previously we did this with isel patterns that used garbage in
the widened part of the source. But that's not valid for strictfp.
So now we custom widen and use zeroes for the widened elemens for
strictfp.

This replaces D71864.

Reviewers: RKSimon, spatel, andrew.w.kaylor, pengfei, LiuChen3

Reviewed By: pengfei

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71879
2019-12-26 22:04:40 -08:00
Craig Topper 50fb3957c1 [X86] Custom widen strict v2f32->v2i32 by padding with zeroes.
For non-strict, generic type legalization will take care of this,
but that doesn't happen currently for strict nodes.
2019-12-26 21:45:18 -08:00
Fangrui Song c4a97b64e3 [X86] Fix -Wmisleading-indentation after D71892 2019-12-26 21:41:16 -08:00
Craig Topper 53ee806d93 [X86][FPEnv] Promote some float strictfp operations to double on i686-pc-windows-msvc to match what we do for non-strict.
The float libcalls are inlined in MSVC's math header where they
just cast to double and use the double libcall. Do the same when
we emit libcalls.
2019-12-26 20:22:24 -08:00
Craig Topper a5d266b9cf [X86] Add custom legalization for strict_uint_to_fp v2i32->v2f32.
I believe the algorithm we use for non-strict is exception safe
for strict. The fsub won't generate any exceptions. After it we
will have an exact version of the i32 integer in a double. Then
we just round it to f32. That rounding will generate a precision
exception if it can't be represented exactly.
2019-12-26 19:10:26 -08:00
Liu, Chen3 1a7b69f5dd add custom operation for strict fpextend/fpround
Differential Revision: https://reviews.llvm.org/D71892
2019-12-27 08:28:33 +08:00
Eric Christopher 1584e2f987 Remove SrcVT only used in an assert and propagate query. 2019-12-26 15:28:32 -08:00
Craig Topper f953882113 [X86] Custom widen 128/256-bit vXi32 uint_to_fp on avx512f targets without avx512vl. Similar for vXi64 sint_to_fp/uint_to_fp on avx512dq without avx512vl.
Previously we widened these through isel patterns, but that
didn't work for STRICT_ nodes. Those need to be padded with
zeroes in the upper bits which is harder to do in isel patterns.
2019-12-26 14:46:56 -08:00
Craig Topper 90ff34e6ab [X86] Add custom widening for v2i32->v2f64 strict_uint_to_fp with AVX512F, but not AVX512VL.
Previously we were widening with isel patterns, but that wasn't
exception safe for strict FP. So now we widen to v4i32->v4f64
during type legalization. And then let op legalization further
widen to v8i32->v8f64.

The vec_int_to_fp.ll changes are caused by us no longer narrowing
extracts of strict_uint_to_fp to the v4i32->v2f64 instruction
without AVX512VL only to have isel rewiden it. Now we just keep
it wide throughout. So we don't have an opportunity to narrow
the load.
2019-12-26 13:40:56 -08:00
Craig Topper bb0138729b [X86] Add custom widening for v2f64->v2i32 strict_fp_to_uint with avx512f, but not avx512vl.
AVX512F added instruction for vector fp_to_uint conversions. With
AVX512VL we can use a specific instruction that does v2f64->v4i32 with
zeroes in the 2 extra elements. For non-strict nodes without AVX512VL
we relied on type legalization to turn it to v4f64->v4i32 which would
later be widened by op legalization to v8f64->v8i32. But type legalization
doesn't currently widen strict nodes since it doesn't know how to
safely and efficiently pad the extra elements. But for X86 we know
padding with zeroes is safe and efficient so do that ourselves.
2019-12-26 12:42:27 -08:00
Craig Topper c91bf72e2c [X86] Merge the SINT_TO_FP/UINT_TO_FP handlers in ReplaceNodeResults since the AVX512DQ+AVX512VL code is very similar in both. NFC 2019-12-26 08:58:34 -08:00
Craig Topper 4e6b0dd681 [X86] Add custom lowering for v2i64->v2f32 strict_sint_to_fp/strict_uint_to_fp for avx512dq+avx512vl targets.
With avx512dq+avx512vl we have instruction that implements this and
places zeroes in the upper 64-bits of the destination xmm register.
2019-12-26 08:58:34 -08:00
Wang, Pengfei 472bded3ed [X86] Enable STRICT_SINT_TO_FP/STRICT_UINT_TO_FP on X86 backend
Summary: Enable STRICT_SINT_TO_FP/STRICT_UINT_TO_FP on X86 backend

Reviewers: craig.topper, RKSimon, LiuChen3, uweigand, andrew.w.kaylor

Subscribers: hiraditya, llvm-commits, LuoYuanke

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71871
2019-12-26 08:15:13 +08:00
Craig Topper c5b4a2386b [X86] Use zero vector to extend to 512-bits for strict_fp_to_uint v2i1->v2f64 on targets with AVX512F, but not AVX512VL.
In the worst case, this requires a 128-bit move instruction to
implicitly zero the upper bits. In the common case, we should
recognize the producing instruction already zeroed the upper bits.
2019-12-25 10:46:00 -08:00
Craig Topper 2498d88259 [X86] Merge together some common code in LowerFP_TO_INT now that we have STRICT_CVTTP2SI/STRICT_CVTTP2UI nodes. NFC 2019-12-25 09:57:27 -08:00
Liu, Chen3 8304781cae Add missing strict_fp_to_int
Differential Revision: https://reviews.llvm.org/D71867
2019-12-25 16:10:10 +08:00
Craig Topper c06e53119b [X86] Use 128-bit vector instructions for f32/f64->i64 conversions on 32-bit targets with avx512dq and avx512vl instructions.
On 32-bit targets we can't use the scalar instruction so we
insert the scalar into a vector and use packed conversions.
Previously we used either v4f32->v4i64 or v4f64->v4i64 to avoid
some complexity creating target specific ISD opcodes for
v4f32->v2i64. But this causes extra vzeroupper instructions and
possibly frequency throttling on Intel CPUs.

This patch changes this to create a 128-bit vector and uses a
target specific ISD opcode if needed.
2019-12-24 11:20:10 -08:00
Craig Topper a21beccea2 [X86] Add STRICT versions of CVTTP2SI, CVTTP2UI, CMPM, and CMPP.
Differential Revision: https://reviews.llvm.org/D71850
2019-12-24 10:07:04 -08:00
Ulrich Weigand 0d3f782e41 [FPEnv][X86] More strict int <-> FP conversion fixes
Fix several several additional problems with the int <-> FP conversion
logic both in common code and in the X86 target. In particular:

- The STRICT_FP_TO_UINT expansion emits a floating-point compare. This
  compare can raise exceptions and therefore needs to be a strict compare.
  I've made it signaling (even though quiet would also be correct) as
  signaling is the more usual default for an LT. This code exists both
  in common code and in the X86 target.

- The STRICT_UINT_TO_FP expansion algorithm was incorrect for strict mode:
  it emitted two STRICT_SINT_TO_FP nodes and then used a select to choose one
  of the results. This can cause spurious exceptions by the STRICT_SINT_TO_FP
  that ends up not chosen. I've fixed the algorithm to use only a single
  STRICT_SINT_TO_FP instead.

- The !isStrictFPEnabled logic in DoInstructionSelection would sometimes do
  the wrong thing because it calls getOperationAction using the result VT.
  But for some opcodes, incuding [SU]INT_TO_FP, getOperationAction needs to
  be called using the operand VT.

- Remove some (obsolete) code in X86DAGToDAGISel::Select that would mutate
  STRICT_FP_TO_[SU]INT to non-strict versions unnecessarily.

Reviewed by: craig.topper

Differential Revision: https://reviews.llvm.org/D71840
2019-12-23 21:11:45 +01:00
Sanjay Patel 8cefc37be5 [DAGCombine] visitEXTRACT_SUBVECTOR - 'little to big' extract_subvector(bitcast()) support
This moves the X86 specific transform from rL364407
into DAGCombiner to generically handle 'little to big' cases
(for example: extract_subvector(v2i64 bitcast(v16i8))). This
allows us to remove both the x86 implementation and the aarch64
bitcast(extract_subvector(bitcast())) combine.

Earlier patches that dealt with regressions initially exposed
by this patch:
rG5e5e99c041e4
rG0b38af89e2c0

Patch by: @RKSimon (Simon Pilgrim)

Differential Revision: https://reviews.llvm.org/D63815
2019-12-23 10:11:45 -05:00
Craig Topper de2378b4f3 [X86] Fix a KNL miscompile caused by combineSetCC swapping LHS/RHS variables before a later use.
The setcc operands are copied into LHS and RHS variables at the top of the function. We also capture the condition code.

A later piece of code swaps the operands and changing the CC variable as part of a canonicalization to make some other checks simpler. But we might not make the transform we canonicalized for. So we continue on through the function where we can use the swapped LHS/RHS variables and access the original condition code operand instead of the modified CC variable. This leads to a setcc being created with the original condition code, but with swapped operands.

To mitigate this, this patch does a couple things. The LHS/RHS/CC variables are made const to keep them from being modified like this again. The transform that needs the swap now uses temporary copies of the variables. And the transform that used the original condition code operand has been altered to use the CC variable we cached originally. Either of these changes are enough to fix the issue, but doing both to make this code very safe.

I also considered rewriting the swap code in some way to check both permutations without explicitly swapping or needing temporary variables, but held off on that.

Differential Revision: https://reviews.llvm.org/D71736
2019-12-20 11:24:45 -08:00
Craig Topper bf507d4259 [X86] Make EmitCmp into a static function and explicitly return chain result for STRICT_FCMP. NFCI
The only thing its getting from the X86TargetLowering class is
the subtarget which we can easily pass. This function only has
one call site now since this might help the compiler inline it.

Explicitly return both the flag result and the chain result for
STRICT_FCMP nodes. This removes an assumption in the caller that
getValue(1) is the right way to get the chain.
2019-12-19 23:03:15 -08:00
Craig Topper 9b6fafa399 [X86] Directly call EmitTest in two places instead of creating a null constant and calling EmitCmp. NFCI
EmitCmp will just immediately call EmitTest and discard the null
constant only to have EmitTest create it again if it doesn't fold.

So just skip all that and go directly to EmitTest.
2019-12-19 23:03:06 -08:00
Liu, Chen3 2f932b5729 Enable STRICT_FP_TO_SINT/UINT on X86 backend
This patch is mainly for custom lowering the vector operation.

Differential Revision: https://reviews.llvm.org/D71592
2019-12-19 14:49:13 +08:00
Wang, Pengfei 1949235d13 [X86] Add strict fma support
Summary: Add strict fma support

Reviewers: craig.topper, RKSimon, LiuChen3

Subscribers: hiraditya, llvm-commits, LuoYuanke

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71604
2019-12-18 11:44:00 +08:00
Craig Topper 004fdbe041 [X86] Manually format some setOperationAction calls to line up arguments to improve readability. NFC 2019-12-17 16:11:31 -08:00
Kevin P. Neal b1d8576b0a This adds constrained intrinsics for the signed and unsigned conversions
of integers to floating point.

This includes some of Craig Topper's changes for promotion support from
D71130.

Differential Revision: https://reviews.llvm.org/D69275
2019-12-17 10:06:51 -05:00
Alex Richardson be15dfa88f [NFC] Use EVT instead of bool for getSetCCInverse()
Summary:
The use of a boolean isInteger flag (generally initialized using
VT.isInteger()) caused errors in our out-of-tree CHERI backend
(https://github.com/CTSRD-CHERI/llvm-project).

In our backend, pointers use a separate ValueType (iFATPTR) and therefore
.isInteger() returns false. This meant that getSetCCInverse() was using the
floating-point variant and generated incorrect code for us:
`(void *)0x12033091e < (void *)0xffffffffffffffff` would return false.

Committing this change will significantly reduce our merge conflicts
for each upstream merge.

Reviewers: spatel, bogner

Reviewed By: bogner

Subscribers: wuzish, arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70917
2019-12-13 12:22:03 +00:00
Sanjay Patel cdf5cfea8e Revert "[SDAG] remove use restriction in isNegatibleForFree() when called from getNegatedExpression()"
This reverts commit d1f0bdf2d2.
The patch can cause infinite loops in DAGCombiner.
2019-12-11 16:56:58 -05:00
Sanjay Patel d1f0bdf2d2 [SDAG] remove use restriction in isNegatibleForFree() when called from getNegatedExpression()
This is an alternate fix for the bug discussed in D70595.
This also includes minimal tests for other in-tree targets
to show the problem more generally.

We check the number of uses as a predicate for whether some
value is free to negate, but that use count can change as we
rewrite the expression in getNegatedExpression(). So something
that was marked free to negate during the cost evaluation
phase becomes not free to negate during the rewrite phase (or
the inverse - something that was not free becomes free).
This can lead to a crash/assert because we expect that
everything in an expression that is negatible to be handled
in the corresponding code within getNegatedExpression().

This patch skips the use check during the rewrite phase.
So we determine that some expression isNegatibleForFree
(identically to without this patch), but during the rewrite,
don't rely on use counts to decide how to create the optimal
expression.

Differential Revision: https://reviews.llvm.org/D70975
2019-12-11 13:30:39 -05:00
Craig Topper 935d41e4bd [X86] Split v64i1 arguments into 2 v32i1s that will be promoted to v32i8 under min-legal-vector-width=256
This is an improvement to 88dacbd436
2019-12-10 17:29:02 -08:00
Wang, Pengfei 21bc8631fe [FPEnv][X86] Constrained FCmp intrinsics enabling on X86
Summary: This is a follow up of D69281, it enables the X86 backend support for the FP comparision.

Reviewers: uweigand, kpn, craig.topper, RKSimon, cameron.mcinally, andrew.w.kaylor

Subscribers: hiraditya, llvm-commits, annita.zhang, LuoYuanke, LiuChen3

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70582
2019-12-11 08:23:09 +08:00
Craig Topper 88dacbd436 [X86] Go back to considering v64i1 as a legal type under min-legal-vector-width=256. Scalarize v64i1 arguments and shuffles under min-legal-vector-width=256.
This reverts 3e1aee2ba7 in favor
of a different approach.

Scalarizing isn't great codegen, but making the type illegal was
interfering with k constraint in inline assembly.
2019-12-10 15:07:55 -08:00
Liu, Chen3 bbf7860b93 add support for strict operation fpextend/fpround/fsqrt on X86 backend
Differential Revision: https://reviews.llvm.org/D71184
2019-12-10 09:04:28 +08:00
Amara Emerson 84fdd9d7a5 [X86] Fix prolog/epilog mismatch for stack protectors on win32-macho.
The xor'ing behaviour is only used for msvc/crt environments, when we're targeting
macho the guard load code doesn't know about the xor in the epilog. Disable xor'ing
when targeting win32-macho to be consistent.

Differential Revision: https://reviews.llvm.org/D71095
2019-12-06 14:44:56 -08:00
Craig Topper 28b573d249 [TargetLowering] Fix another potential FPE in expandFP_TO_UINT
D53794 introduced code to perform the FP_TO_UINT expansion via FP_TO_SINT in a way that would never expose floating-point exceptions in the intermediate steps. Unfortunately, I just noticed there is still a way this can happen. As discussed in D53794, the compiler now generates this sequence:

// Sel = Src < 0x8000000000000000
// Val = select Sel, Src, Src - 0x8000000000000000
// Ofs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val) ^ Ofs
The problem is with the Src - 0x8000000000000000 expression. As I mentioned in the original review, that expression can never overflow or underflow if the original value is in range for FP_TO_UINT. But I missed that we can get an Inexact exception in the case where Src is a very small positive value. (In this case the result of the sub is ignored, but that doesn't help.)

Instead, I'd suggest to use the following sequence:

// Sel = Src < 0x8000000000000000
// FltOfs = select Sel, 0, 0x8000000000000000
// IntOfs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val - FltOfs) ^ IntOfs
In the case where the value is already in range of FP_TO_SINT, we now simply compute Val - 0, which now definitely cannot trap (unless Val is a NaN in which case we'd want to trap anyway).

In the case where the value is not in range of FP_TO_SINT, but still in range of FP_TO_UINT, the sub can never be inexact, as Val is between 2^(n-1) and (2^n)-1, i.e. always has the 2^(n-1) bit set, and the sub is always simply clearing that bit.

There is a slight complication in the case where Val is a constant, so we know at compile time whether Sel is true or false. In that scenario, the old code would automatically optimize the sub away, while this no longer happens with the new code. Instead, I've added extra code to check for this case and then just fall back to FP_TO_SINT directly. (This seems to catch even slightly more cases.)

Original version of the patch by Ulrich Weigand. X86 changes added by Craig Topper

Differential Revision: https://reviews.llvm.org/D67105
2019-12-06 14:11:04 -08:00
Reid Kleckner c089f02898 [X86] Don't setup and teardown memory for a musttail call
Summary:
musttail calls should not require allocating extra stack for arguments.
Updates to arguments passed in memory should happen in place before the
epilogue.

This bug was mostly a missed optimization, unless inalloca was used and
store to push conversion fired.

If a reserved call frame was used for an inalloca musttail call, the
call setup and teardown instructions would be deleted, and SP
adjustments would be inserted in the prologue and epilogue. You can see
these are removed from several test cases in this change.

In the case where the stack frame was not reserved, i.e. call frame
optimization fires and turns argument stores into pushes, then the
imbalanced call frame setup instructions created for inalloca calls
become a problem. They remain in the instruction stream, resulting in a
call setup that allocates zero bytes (expected for inalloca), and a call
teardown that deallocates the inalloca pack. This deallocation was
unbalanced, leading to subsequent crashes.

Reviewers: hans

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71097
2019-12-06 12:58:54 -08:00
Craig Topper 8267be2995 [X86] Make X86TargetLowering::BuildFILD return a std::pair of SDValues so we explicitly return the chain instead of calling getValue on the single SDValue.
We shouldn't assume that the returned result can be used to get
the other result.

This is prep-work for strict FP where we will also need to pass
the chain result along in more cases.
2019-12-05 17:54:21 -08:00
Liu, Chen3 3041434450 Add strict fp support for instructions fadd/fsub/fmul/fdiv
Differential Revision: https://reviews.llvm.org/D68757
2019-12-06 09:44:33 +08:00
Craig Topper 3d43c73f26 [X86] Remove override of shouldUseStrictFP_TO_INT for fp80. NFC
I suspect this became unnecessary after r354161. Prior to that
we may have been going through the default expansion of FP_TO_UINT
on 64-bit targets and then ending up back in Custom X86 handling
to handle the FP_TO_SINT for it. Now we just Custom handle the
FP_TO_UINT directly. We already need to handle it for 32-bit mode
during type legalization so we wouldn't save any code by using
the default expansion on 64-bit.
2019-12-04 17:58:10 -08:00
Amy Huang 9e978bb01c Add support for lowering 32-bit/64-bit pointers
Summary:
This follows a previous patch that changes the X86 datalayout to represent
mixed size pointers (32-bit sext, 32-bit zext, and 64-bit) with address spaces
(https://reviews.llvm.org/D64931)

This patch implements the address space cast lowering to the corresponding
sign extension, zero extension, or truncate instructions.

Related to https://bugs.llvm.org/show_bug.cgi?id=42359

Reviewers: rnk, craig.topper, RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69639
2019-12-04 11:39:03 -08:00
Craig Topper 8f73a93b2d [X86] Add support for STRICT_FP_TO_UINT/SINT from fp128. 2019-11-27 18:38:32 -08:00
Craig Topper cfce8f2cfb [X86] Add strict fp support for operations of X87 instructions
This is the following patch of D68854.

This patch adds basic operations of X87 instructions, including +, -, *, / , fp extensions and fp truncations.

Patch by Chen Liu(LiuChen3)

Differential Revision: https://reviews.llvm.org/D68857
2019-11-26 10:59:41 -08:00
David Green b5315ae8ff [Codegen][ARM] Add addressing modes from masked loads and stores
MVE has a basic symmetry between it's normal loads/store operations and
the masked variants. This means that masked loads and stores can use
pre-inc and post-inc addressing modes, just like the standard loads and
stores already do.

To enable that, this patch adds all the relevant infrastructure for
treating masked loads/stores addressing modes in the same way as normal
loads/stores.

This involves:
- Adding an AddressingMode to MaskedLoadStoreSDNode, along with an extra
   Offset operand that is added after the PtrBase.
- Extending the IndexedModeActions from 8bits to 16bits to store the
   legality of masked operations as well as normal ones. This array is
   fairly small, so doubling the size still won't make it very large.
   Offset masked loads can then be controlled with
   setIndexedMaskedLoadAction, similar to standard loads.
- The same methods that combine to indexed loads, such as
   CombineToPostIndexedLoadStore, are adjusted to handle masked loads in
   the same way.
- The ARM backend is then adjusted to make use of these indexed masked
   loads/stores.
- The X86 backend is adjusted to hopefully be no functional changes.

Differential Revision: https://reviews.llvm.org/D70176
2019-11-26 16:21:01 +00:00
Craig Topper 1b20908334 [X86] Return Op instead of SDValue() for lowering flags_read/write intrinsics
Returning SDValue() means we didn't handle it and the common
code should try to expand it. But its a target intrinsic so
expanding won't do anything and just leave the node alone. But
it will print confusing debug messages.

By returning Op we tell the common code that the node is legal
and shouldn't receive any further processing.
2019-11-25 23:13:30 -08:00
Craig Topper c43b8ec735 [X86] Add support for STRICT_FP_ROUND/STRICT_FP_EXTEND from/to fp128 to/from f32/f64/f80 in 64-bit mode.
These need to emit a libcall like we do for the non-strict version.

32-bit mode needs to SoftenFloat support to be implemented for strict FP nodes.

Differential Revision: https://reviews.llvm.org/D70504
2019-11-25 18:18:39 -08:00
Simon Pilgrim 5d9a259ad5 [X86][SSE] Split off generic isLaneCrossingShuffleMask helper. NFC.
Avoid MVT dependency which will be needed in a future patch.
2019-11-23 12:41:03 +00:00
Craig Topper b29e5cdb7c [X86] Add test cases for most of the constrained fp libcalls with fp128.
Add explicit setOperation actions for some to match their none
strict counterparts. This isn't required, but makes the code
self documenting that we didn't forget about strict fp. I've
used LibCall instead of Expand since that's more explicitly what
we want.

Only lrint/llrint/lround/llround are missing now.
2019-11-21 18:17:59 -08:00
Craig Topper fc4020dbbe [X86] Mark fp128 FMA as LibCall instead of Expand. Add STRICT_FMA as well.
The Expand code would fall back to LibCall, but this makes it
more explicit.
2019-11-21 18:17:57 -08:00
Craig Topper 7696b99258 [LegalizeDAG][X86] Add support for turning STRICT_FADD/SUB/MUL/DIV into libcalls. Use it for fp128 on x86-64.
This requires a minor hack for f32/f64 strict fadd/fsub to avoid
turning those into libcalls.
2019-11-21 16:19:25 -08:00
Craig Topper 95f44cf44a [X86] Mark vector STRICT_FADD/STRICT_FSUB as Legal and add mutation to X86ISelDAGToDAG
The prevents LegalizeVectorOps from scalarizing them. We'll need
to remove the X86 mutation code when we add isel patterns.
2019-11-21 16:19:18 -08:00
Hiroshi Yamauchi 52e377497d [PGO][PGSO] DAG.shouldOptForSize part.
Summary:
(Split of off D67120)

SelectionDAG::shouldOptForSize changes for profile guided size optimization.

Reviewers: davidxl

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70095
2019-11-21 14:16:00 -08:00
Craig Topper 1439059cc7 [X86] Change legalization action for f128 fadd/fsub/fmul/fdiv from Custom to LibCall.
The custom code just emits a libcall, but we can do the same
with generic code. The only difference is that the generic code
can form tail calls where the custom code couldn't. This is
responsible for the test changes.

This avoids needing to modify the Custom handling for strict fp.
2019-11-21 11:44:29 -08:00
Craig Topper 27da569a7a [X86] Fix i16->f128 sitofp to promote the i16 to i32 before trying to form a libcall.
Previously one of the test cases added here gave an error.
2019-11-20 17:09:32 -08:00
Craig Topper 5f3bf5967b [X86] Fix f128->i16 fptosi to promote the i16 to i32 before trying to form a libcall.
Previously one of the test cases added here gave an error.
2019-11-20 17:09:31 -08:00
Craig Topper 7488c0a6f5 [X86] Mark vector STRICT_FP_ROUND as Legal instead of Custom.
The Custom handler doesn't do anything for these nodes anyway.

SelectionDAGISel won't mutate them if they are Legal or Custom.
X86 has custom code for mutating them due to missing isel patterns.
When the isel patterns are added Legal will be the right answer.
So go ahead a change it now since that's where we'll end up.
2019-11-20 13:03:51 -08:00
Reid Kleckner 606a2bd621 [musttail] Don't forward AL on Win64
AL is only used for varargs on SysV platforms. Don't forward it on
Windows. This allows control flow guard to set up an extra hidden
parameter in RAX, as described in PR44049.

This also has the effect of freeing up RAX for use in virtual member
pointer thunks, which may also be a nice little code size improvement on
Win64.

Fixes PR44049

Reviewers: ajpaverd, efriedma, hans

Differential Revision: https://reviews.llvm.org/D70413
2019-11-19 16:54:00 -08:00
Craig Topper c4b41e8d1d [LegalizeDAG][X86] Enable STRICT_FP_TO_SINT/UINT to be promoted
Differential Revision: https://reviews.llvm.org/D70220
2019-11-19 16:14:37 -08:00
Craig Topper 85589f8077 [X86] Add custom type legalization and lowering for scalar STRICT_FP_TO_SINT/UINT
This is a first pass at Custom lowering for these operations. I also updated some of the vector code where it was obviously easy and straightforward. More work needed in follow up.

This enables these operations to be handled with X87 where special rounding control adjustments are needed to perform a truncate.

Still need to fix Promotion in the target independent code in LegalizeDAG.
llrint/llround split into separate test file because we can't make a strict libcall properly yet either and we need to do that when i64 isn't a legal type.

This does not include any isel support. So we still rely on the mutation in SelectionDAGIsel to remove the strict from this stuff later. Except for the X87 stuff which goes through custom nodes that already had chains.

Differential Revision: https://reviews.llvm.org/D70214
2019-11-19 16:05:22 -08:00
Matt Arsenault b696b9dba7 DAG: Add function context to isFMAFasterThanFMulAndFAdd
AMDGPU needs to know the FP mode for the function to answer this
correctly when this is removed from the subtarget.

AArch64 had to make this more complicated by using this from an IR
hook, so add an IR typed overload.
2019-11-19 19:25:26 +05:30
Simon Pilgrim bbf4af3109 [X86][SSE] Remove XFormVExtractWithShuffleIntoLoad to prevent legalization infinite loops (PR43971)
As detailed in PR43971/D70267, the use of XFormVExtractWithShuffleIntoLoad causes issues where we end up in infinite loops of extract(targetshuffle(vecload)) -> extract(shuffle(vecload)) -> extract(vecload) -> extract(targetshuffle(vecload)), there are just too many legalization checks at every stage that we can't guarantee that extract(shuffle(vecload)) -> scalarload can occur.

At the moment we see a number of minor regressions as we don't fold extract(shuffle(vecload)) -> scalarload before legal ops, these can be addressed in future patches and extension of X86ISelLowering's combineExtractWithShuffle.
2019-11-19 11:55:44 +00:00
Graham Hunter 3f08ad611a [SVE][CodeGen] Scalable vector MVT size queries
* Implements scalable size queries for MVTs, split out from D53137.

* Contains a fix for FindMemType to avoid using scalable vector type
  to contain non-scalable types.

* Explicit casts for several places where implicit integer sign
  changes or promotion from 32 to 64 bits caused problems.

* CodeGenDAGPatterns will treat scalable and non-scalable vector types
  as different.

Reviewers: greened, cameron.mcinally, sdesmalen, rovka

Reviewed By: rovka

Differential Revision: https://reviews.llvm.org/D66871
2019-11-18 12:30:59 +00:00
Craig Topper f7e9d81a8e [X86] Don't set the operation action for i16 SINT_TO_FP to Promote just because SSE1 is enabled.
Instead do custom promotion in the handler so that we can still
allow i16 to be used with fp80. And f64 without sse2.
2019-11-13 14:07:56 -08:00
Craig Topper 787595b2e7 [X86] Fix typo in comment. NFC 2019-11-13 14:07:55 -08:00
Craig Topper fee9067261 [X86] Move all the FP_TO_XINT/XINT_TO_FP setOperationActions into the same !useSoftFloat block. Qualify all of the Promote actions for these with !useSoftFloat too. NFCI
The Promote action doesn't apply until LegalizeDAG. By the time
we get there, we would have already softened all the FP operations
if useSoftFloat was true. So there wouldn't be any operation left
to Promote.
2019-11-13 14:07:54 -08:00
Craig Topper a4b7613a49 [X86] Remove setOperationAction for FP_TO_SINT v8i16.
This is no longer needed after widening legalization as we
custom legalize v8i8 ourselves.

Added entries to the cost model, but bumped the cost slightly
to account for the truncate shuffle that wasn't costed before.
2019-11-12 22:45:52 -08:00
Craig Topper 3e1aee2ba7 [X86] Don't consider v64i1 as a legal type unless v64i8 is also a legal type.
This avoids some nasty issues with argument passing and lowering of
arbitrary v64i8 shuffles.
2019-11-12 14:56:02 -08:00
Craig Topper 0f04ffc073 [X86] Only pass v64i8/v32i16 as v16i32 on non-avx512bw targets if the v16i32 type won't be split by prefer-vector-width=256
Otherwise just let the v64i8/v32i16 types be split to v32i8/v16i16.

In reality this shouldn't happen because it means we have a 512-bit
vector argument, but min-legal-vector-width says a value less than
512. But a 512-bit argument should have been factored into the
preferred vector width.
2019-11-12 14:56:01 -08:00
Craig Topper ff1504da6f [X86] Update stale comment. NFC 2019-11-11 23:55:12 -08:00
Craig Topper 578f3b5dce [X86] Remove setOperationAction lines that say to promote MVT::i1
MVT::i1 should be removed by type legalization before we reach
any code that would act on the promote action.

Mainly to avoid replicating this for strict FP versions of these
operations.
2019-11-11 18:35:57 -08:00
Craig Topper 6c86d6efaf [X86] Remove some else branches after checking for !useSoftFloat() that set operations to Expand.
If we're using soft floats, then these operations shoudl be
softened during type legalization. They'll never get to
LegalizeVectorOps or LegalizeDAG so they don't need to be
Expanded there.
2019-11-11 16:32:19 -08:00
Eli Friedman 5df3a87224 [AArch64][X86] Don't assume __powidf2 is available on Windows.
We had some code for this for 32-bit ARM, but this doesn't really need
to be in target-specific code; generalize it.

(I think this started showing up recently because we added an
optimization that converts pow to powi.)

Differential Revision: https://reviews.llvm.org/D69013
2019-11-08 12:43:21 -08:00
Craig Topper 17eb12fa6d [X86] Remove unused variable. NFC 2019-11-06 22:53:48 -08:00
Craig Topper 1c8460d6e1 [X86] Remove dead code from combineStore.
Leftovers from before we switched to widening legalization.

Fixes PR43919.
2019-11-06 22:24:47 -08:00
Craig Topper 641d2e5232 [X86] Clamp large constant shift amounts for MMX shift intrinsics to 8-bits.
The MMX intrinsics for shift by immediate take a 32-bit shift
amount but the hardware for shifting by immediate only encodes
8-bits. For the intrinsic we don't require the shift amount to
fit in 8-bits in the frontend because we don't check that its an
immediate in the frontend. If its is not an immediate we move it
to an MMX register and use the shift by register.

But if it is an immediate we'll use the shift by immediate
instruction. But we need to change the shift amount to 8-bits.
We were previously doing this accidentally by masking it in the
encoder. But this can make a large shift amount into a small
in bounds shift amount. Instead we should clamp larger shift
amounts to 255 so that the they don't become in bounds.

Fixes PR43922
2019-11-06 13:03:18 -08:00
Dávid Bolvanský ca7f5becf9 [X86ISelLowering] Fixed typo in assert. NFCI. 2019-11-06 20:04:15 +01:00
Sanjay Patel 8e34dd941c [x86] avoid crashing when splitting AVX stores with non-simple type (PR43916)
The store splitting transform was assuming a simple type (MVT),
but that's not necessarily the case as shown in the test.
2019-11-06 09:28:41 -05:00
Simon Pilgrim 37cdac6344 [X86] LowerAVXExtend - fix dodgy self-comparison assert.
PVS Studio noticed that we were asserting "VT.getVectorNumElements() == VT.getVectorNumElements()" instead of "VT.getVectorNumElements() == InVT.getVectorNumElements()".
2019-11-06 12:50:29 +00:00
Benjamin Kramer 5f158d8e21 [X86] Gate select->fmin/fmax transform on NoSignedZeros instead of UnsafeFPMath 2019-11-05 21:28:41 +01:00
Philip Reames 027aa27d95 [X86/Atomics] (Semantically) revert G246098, switch back to the old atomic example
When writing an email for a follow up proposal, I realized one of the diffs in the committed change was incorrect.  Digging into it revealed that the fix is complicated enough to require some thought, so reverting in the meantime.

The problem is visible in this diff (from the revert):
 ; X64-SSE-LABEL: store_fp128:
 ; X64-SSE:       # %bb.0:
-; X64-SSE-NEXT:    movaps %xmm0, (%rdi)
+; X64-SSE-NEXT:    subq $24, %rsp
+; X64-SSE-NEXT:    .cfi_def_cfa_offset 32
+; X64-SSE-NEXT:    movaps %xmm0, (%rsp)
+; X64-SSE-NEXT:    movq (%rsp), %rsi
+; X64-SSE-NEXT:    movq {{[0-9]+}}(%rsp), %rdx
+; X64-SSE-NEXT:    callq __sync_lock_test_and_set_16
+; X64-SSE-NEXT:    addq $24, %rsp
+; X64-SSE-NEXT:    .cfi_def_cfa_offset 8
 ; X64-SSE-NEXT:    retq
   store atomic fp128 %v, fp128* %fptr unordered, align 16
   ret void

The problem here is three fold:
1) x86-64 doesn't guarantee atomicity of anything larger than 8 bytes.  Some platforms observably break this guarantee, others don't, but the codegen isn't considering this, so it's wrong on at least some platforms.
2) When I started to track down the problem, I discovered that DAGCombiner had stripped the atomicity off the store entirely.  This comes down to idiomatic usage of DAG.getStore passing all MMO components separately as opposed to just passing the MMO.
3) On x86 (not -64), there are cases where 8 byte atomiciy is supported, but only for floating point operations.  This would seem to imply that operation typing matters for correctness, and DAGCombine happily folds away bitcasts.  I'm not 100% sure there's a problem here, but I'm not entirely sure there isn't either.

I plan on returning to each issue in turn;  sorry for the churn here.
2019-11-05 11:24:27 -08:00
Benjamin Kramer 00e53d912d [X86] Specifically limit fmin/fmax commutativity to NoNaNs + NoSignedZeros
The backend UnsafeFPMath flag is not a superset of all the others, so
limit it to the exact bits needed.
2019-11-05 19:34:06 +01:00
Simon Pilgrim 9ad9d1531b [X86] Convert ShrinkMode to scoped enum class. NFCI. 2019-11-04 15:35:20 +00:00
Simon Pilgrim 31ed36d044 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (REAPPLIED)
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts (see D66004).

This reapplies rL368307 (reverted at rL369167) after the fix for the infinite loop reported at PR43024 was applied at rG3f087e38a2e7b87a5adaaac1c1b61e51220e7ff3
2019-11-04 11:37:57 +00:00
Simon Pilgrim 3f087e38a2 [X86][SSE] combineX86ShufflesRecursively - at Depth==0, only resolve KnownZero if it removes an input.
This stops infinite loops where KnownUndef elements are converted to Zeroable, resulting in KnownZero elements which are then simplified (via SimplifyDemandedElts etc.) back to KnownUndef elements........

Prep fix for PR43024 which will allow rL368307 to be re-applied.
2019-11-03 21:10:47 +00:00
Simon Pilgrim 8f29e4407c [X86][SSE] combineX86ShufflesRecursively - don't bother merging shuffles with empty roots. NFCI.
This doesn't affect actual codegen, but is a minor refactor toward fixing PR43024 where we need to avoid excess changes (folding zeroables etc.) to the shuffle mask at Depth == 0.
2019-11-03 17:46:00 +00:00
Simon Pilgrim 297d96bb60 Fix uninitialized variable warning. NFCI. 2019-11-03 11:15:55 +00:00
Simon Pilgrim 254b8461ac [X86] Move computeZeroableShuffleElements before getTargetShuffleAndZeroables. NFCI.
Prep work toward merging some of the functionality.
2019-11-02 13:38:35 +00:00
Craig Topper eeeb18cd07 [X86] Change the behavior of canWidenShuffleElements used by lowerV2X128Shuffle to match the behavior in lowerVectorShuffle with regards to zeroable elements.
Previously we marked zeroable elements in a way that prevented
the widening check from recognizing that it could widen. Now
we only mark them zeroable if V2 is an all zeros vector. This
matches what we do for widening elements in lowerVectorShuffle.

Fixes PR43866.
2019-11-01 13:06:03 -07:00
Simon Pilgrim 9b0dfdf5e1 [X86][AVX] Add support for and/or scalar bool reduction with AVX512 mask registers
combineBitcastvxi1 only handles bitcast->MOVMSK combines, with mask registers we use BITCAST directly.
2019-11-01 17:55:31 +00:00
Simon Pilgrim ea27d82814 [X86] isFNEG - use switch() instead of if-else tree. NFCI.
In a future patch this will avoid some checks which don't need to be done for some opcodes.
2019-11-01 17:09:04 +00:00
Simon Pilgrim a780b94cd1 [X86][SSE] Convert computeZeroableShuffleElements to emit KnownUndef and KnownZero 2019-10-31 11:21:39 +00:00
Simon Pilgrim f25f3d39df [X86] Add FIXME comment to merge more of computeZeroableShuffleElements and getTargetShuffleAndZeroables 2019-10-30 18:30:01 +00:00
Simon Pilgrim 94a4a2c97f [X86][SSE] combineX86ShuffleChain - use resolveZeroablesFromTargetShuffle helper. NFCI. 2019-10-30 18:30:01 +00:00
Simon Pilgrim 81399002ae [X86] combineOrShiftToFunnelShift - use isOperationLegalOrCustom to check FSHL/FSHR support
Remove hard wired legality check.
2019-10-30 11:52:22 +00:00
Simon Pilgrim 26655376fe [X86] combineOrShiftToFunnelShift - use getShiftAmountTy instead of hardwiring to MVT::i8 2019-10-30 11:52:22 +00:00
Guillaume Chatelet 119b436da1 [Alignment] Use Align for TFI.getStackAlignment() in X86ISelLowering
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet, craig.topper, rnk

Reviewed By: rnk

Subscribers: rnk, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69034
2019-10-30 10:35:13 +01:00
David Zarzycki f68925d450
[X86] Make memcmp vector lowering handle arbitrary expansions
Teach combineVectorSizedSetCCEquality() to handle arbitrary memcmp
expansions but do not change any default policy for now.

This also fixes a bug in the memcmp expansion itself when large
displacements are needed.

https://reviews.llvm.org/D69507
2019-10-30 09:12:57 +02:00
Philip Reames 2460989eab [SelectionDAG] Enable lowering unordered atomics loads w/LoadSDNode (and stores w/StoreSDNode) by default
Enable the new SelectionDAG representation for unordered loads and stores introduced in r371441 by default.  As a reminder, the new lowering changes the representation of an unordered atomic load from an AtomicSDNode - which is essentially a black box which gets passed through without combines messing with it - to a LoadSDNode w/a atomic marker on the MMO. The later parallels the way we handle volatiles, and I've audited the code to ensure that every location which checks one checks the other.

This has been fairly heavily fuzzed, and I examined diffs in a reasonable large corpus of assembly by hand, so I'm reasonable sure this is correct for the common case.  Late in the review for this, it was discovered that I hadn't correctly handled cases which could be legalized into CAS operations.  This points out that there's a strong bias in the IR of the frontend I'm working with towards only legal atomics.  If there are problems with this patch, the most likely area will be legalization.

Differential Revision: https://reviews.llvm.org/D69219
2019-10-29 12:46:24 -07:00
Craig Topper 772533d921 [X86] Narrow i64 compares with constant to i32 when the upper 32-bits are known zero.
This catches some cases. There are probably ways to improve this.
I tried doing it as a combine on the setcc, but that broke
some cases involving flag reuse in place of test.

I renamed the isX86CCUnsigned to isX86CCSigned and flipped its
polarity to make it consistent with the similar functions for
ISD::SETCC. This avoids calling EQ/NE as being signed or unsigned.

Fixes PR43823.

Differential Revision: https://reviews.llvm.org/D69499
2019-10-29 11:38:15 -07:00
Simon Pilgrim 501cf25839 [X86] Pull out combineOrShiftToFunnelShift helper. NFCI. 2019-10-29 15:29:51 +00:00
Craig Topper 3da269a248 [X86] Add a DAG combine to turn (and (bitcast (vXi1 (concat_vectors (vYi1 setcc), undef,))), C) into (bitcast (vXi1 (concat_vectors (vYi1 setcc), zero,)))
The legalization of v2i1->i2 or v4i1->i4 bitcasts followed by a setcc can create an and after the bitcast. If we're lucky enough that the input to the bitcast is a concat_vectors where the first operand is a setcc that can natively 0 all the upper bits of ak-register, then we should replace the other operands of the concat_vectors with zero in order to remove the AND.

With the AND removed we might be able to use a kortest on the result.

Differential Revision: https://reviews.llvm.org/D69205
2019-10-28 11:27:01 -07:00
David Zarzycki 657e4240b1 [X86] Fix 48/96 byte memcmp code gen
Detect scalar ISD::ZERO_EXTEND generated by memcmp lowering and convert
it to ISD::INSERT_SUBVECTOR.

https://reviews.llvm.org/D69464
2019-10-28 08:41:45 +02:00
David Zarzycki 11c920207a [X86] Prefer KORTEST on Knights Landing or later for memcmp()
PTEST and especially the MOVMSK instructions are slow on Knights Landing
or later. As a bonus, this patch increases instruction parallelism by
emitting:
    KORTEST(PCMPNEQ(a, b), PCMPNEQ(c, d)) == 0
Instead of:
    KORTEST(AND(PCMPEQ(a, b), PCMPEQ(c, d))) == ~0

https://reviews.llvm.org/D69157
2019-10-26 21:14:57 +03:00
Craig Topper 3dd0a896b6 [X86] Add a check for SSE2 to the top of combineReductionToHorizontal.
Without this, we can create a PSADBW node that isn't legal.
2019-10-25 11:11:32 -07:00
Simon Pilgrim a4d55a2c36 [X86] combineX86ShufflesRecursively - assert the root mask is legal. NFCI. 2019-10-23 07:33:29 -07:00
Simon Pilgrim b446356bf3 [X86][SSE] Add OR(EXTRACTELT(X,0),OR(EXTRACTELT(X,1))) -> MOVMSK+CMP reduction combine
llvm-svn: 375463
2019-10-21 22:36:31 +00:00
Simon Pilgrim 7c15c4fb17 [X86] Rename matchBitOpReduction to matchScalarReduction. NFCI.
This doesn't need to be just for bitops, but the ops do need to be fully associative.

llvm-svn: 375445
2019-10-21 19:19:50 +00:00
Craig Topper e78414622d [X86] Check Subtarget.hasSSE3() before calling shouldUseHorizontalOp and emitting X86ISD::FHADD in LowerUINT_TO_FP_i64.
This was a regression from r375341.

Fixes PR43729.

llvm-svn: 375381
2019-10-20 23:54:19 +00:00
Simon Pilgrim 10213b9073 [X86] Pulled out helper to decode target shuffle element sentinel values to 'Zeroable' known undef/zero bits. NFCI.
Renamed 'resolveTargetShuffleAndZeroables' to 'resolveTargetShuffleFromZeroables' to match.

llvm-svn: 375348
2019-10-19 16:58:24 +00:00
Simon Pilgrim b5088aa944 [X86][SSE] lowerV16I8Shuffle - tryToWidenViaDuplication - undef unpack args
tryToWidenViaDuplication lowers using the shuffle_v8i16(unpack_v16i8(shuffle_v8i16(x),shuffle_v8i16(x))) pattern, but the unpack only needs the even/odd 16i8 args if the original v16i8 shuffle mask references the even/odd elements - which isn't true for many extension style shuffles.

llvm-svn: 375342
2019-10-19 13:18:02 +00:00
Simon Pilgrim 6ada70d1b5 [X86][SSE] LowerUINT_TO_FP_i64 - only use HADDPD for size/fast-hops
We were always generating a single source HADDPD, but really we should only do this if shouldUseHorizontalOp says its a good idea.

Differential Revision: https://reviews.llvm.org/D69175

llvm-svn: 375341
2019-10-19 11:53:48 +00:00
Simon Pilgrim 696794b66e [X86] combineX86ShufflesRecursively - pull out isTargetShuffleVariableMask. NFCI.
llvm-svn: 375253
2019-10-18 16:39:01 +00:00
David Zarzycki 7b9fd37fa1 [X86] Emit KTEST when possible
https://reviews.llvm.org/D69111

llvm-svn: 375197
2019-10-18 03:45:52 +00:00
Sam Parker 39af8a3a3b [DAGCombine][ARM] Enable extending masked loads
Add generic DAG combine for extending masked loads.

Allow us to generate sext/zext masked loads which can access v4i8,
v8i8 and v4i16 memory to produce v4i32, v8i16 and v4i32 respectively.

Differential Revision: https://reviews.llvm.org/D68337

llvm-svn: 375085
2019-10-17 07:55:55 +00:00
Simon Pilgrim 50dc09dd16 [X86] combineX86ShufflesRecursively - split the getTargetShuffleInputs call from the resolveTargetShuffleAndZeroables call.
Exposes an issue in getFauxShuffleMask where the OR(SHUFFLE,SHUFFLE) decode should always resolve zero/undef elements.

Part of the fix for PR43024 where ideally we shouldn't call resolveTargetShuffleAndZeroables for Depth == 0

llvm-svn: 374928
2019-10-15 17:59:13 +00:00
David Zarzycki 59390efef2 [X86] Make memcmp() use PTEST if possible and also enable AVX1
llvm-svn: 374922
2019-10-15 17:40:12 +00:00
Simon Pilgrim 70778444c7 [X86] Resolve KnownUndef/KnownZero bits into target shuffle masks in helper. NFCI.
llvm-svn: 374878
2019-10-15 11:13:51 +00:00
Craig Topper b2661a2d15 [X86] Don't check for VBROADCAST_LOAD being a user of the source of a VBROADCAST when trying to share broadcasts.
The only things VBROADCAST_LOAD uses is an address and a chain
node. It has no vector inputs.

So if its a user of the source of another broadcast that could
only mean one of two things. The other broadcast is broadcasting
the address of the broadcast_load. Or the source is a load and
the use we're seeing is the chain result from that load. Neither
of these cases make sense to combine here.

This issue was reported post-commit r373871. Test case has not
been reduced yet.

llvm-svn: 374862
2019-10-15 06:10:11 +00:00
Craig Topper f4d03213f3 [X86] Teach EmitTest to handle ISD::SSUBO/USUBO in order to use the Z flag from the subtract directly during isel.
This prevents isel from emitting a TEST instruction that
optimizeCompareInstr will need to remove later.

In some of the modified tests, the SUB gets duplicated due to
the flags being needed in two places and being clobbered in
between. optimizeCompareInstr was able to optimize away the TEST
that was using the result of one of them, but optimizeCompareInstr
doesn't know to turn SUB into CMP after removing the TEST. It
only knows how to turn SUB into CMP if the result was already
dead.

With this change the TEST never exists, so optimizeCompareInstr
doesn't have to remove it. Then it can just turn the SUB into
CMP immediately.

Fixes PR43649.

llvm-svn: 374755
2019-10-14 06:47:56 +00:00
Simon Pilgrim 11495e5acb [X86] getTargetShuffleInputs - Control KnownUndef mask element resolution as well as KnownZero.
We were already controlling whether the KnownZero elements were being written to the target mask, this extends it to the KnownUndef elements as well so we can prevent the target shuffle mask being manipulated at all.

llvm-svn: 374732
2019-10-13 19:35:35 +00:00
Craig Topper 25eb219959 [X86] Enable use of avx512 saturating truncate instructions in more cases.
This enables use of the saturating truncate instructions when the
result type is less than 128 bits. It also enables the use of
saturating truncate instructions on KNL when the input is less
than 512 bits. We can do this by widening the input and then
extracting the result.

llvm-svn: 374731
2019-10-13 19:07:28 +00:00
Simon Pilgrim 3efafd6c38 [X86] SimplifyMultipleUseDemandedBitsForTargetNode - use getTargetShuffleInputs with KnownUndef/Zero results.
llvm-svn: 374725
2019-10-13 17:03:11 +00:00
Simon Pilgrim e4c58db8bc [X86] getTargetShuffleInputs - add KnownUndef/Zero output support
Adjust SimplifyDemandedVectorEltsForTargetNode to use the known elts masks instead of recomputing it locally.

llvm-svn: 374724
2019-10-13 17:03:02 +00:00
Craig Topper d50cb9ac8c [X86] Add a one use check on the setcc to the min/max canonicalization code in combineSelect.
This seems to improve std::midpoint code where we have a min and
a max with the same condition. If we split the setcc we can end
up with two compares if the one of the operands is a constant.
Since we aggressively canonicalize compares with constants.
For non-constants it can interfere with our ability to share
control flow if we need to expand cmovs into control flow.

I'm also not sure I understand this min/max canonicalization code.
The motivating case talks about comparing with 0. But we don't
check for 0 explicitly.

Removes one instruction from the codegen for PR43658.

llvm-svn: 374706
2019-10-13 06:48:05 +00:00
Craig Topper bf57aa2b25 [X86] Enable v4i32->v4i16 and v8i16->v8i8 saturating truncates to use pack instructions with avx512.
llvm-svn: 374705
2019-10-13 05:47:47 +00:00
Simon Pilgrim 6716512670 [X86] scaleShuffleMask - use size_t Scale to avoid overflow warnings
llvm-svn: 374674
2019-10-12 18:33:47 +00:00
Simon Pilgrim 66417a9f03 Replace for-loop of SmallVector::push_back with SmallVector::append. NFCI.
llvm-svn: 374669
2019-10-12 16:37:02 +00:00
Simon Pilgrim 37041c7d22 Fix cppcheck shadow variable name warnings. NFCI.
llvm-svn: 374668
2019-10-12 16:36:52 +00:00
Simon Pilgrim 6446079add [X86] Use any_of/all_of patterns in shuffle mask pattern recognisers. NFCI.
llvm-svn: 374667
2019-10-12 16:36:44 +00:00
Simon Pilgrim 9f0885d38d [X86][SSE] Avoid unnecessary PMOVZX in v4i8 sum reduction
This should go away once D66004 has landed and we can simplify shuffle chains using demanded elts.

llvm-svn: 374658
2019-10-12 15:19:13 +00:00
Craig Topper 9bd542dcd5 [X86] Use pack instructions for packus/ssat truncate patterns when 256-bit is the largest legal vector and the result type is at least 256 bits.
Since the input type is larger than 256-bits we'll need to some
concatenating to reassemble the results. The pack instructions
ability to concatenate while packing make this a shorter/faster
sequence.

llvm-svn: 374643
2019-10-12 07:59:29 +00:00
Craig Topper 3472feb94c [X86] Fold a VTRUNCS/VTRUNCUS+store into a saturating truncating store.
We already did this for VTRUNCUS with a specific combination of
types. This extends this to VTRUNCS and handles any types where
a truncating store is legal.

llvm-svn: 374615
2019-10-12 00:01:08 +00:00
Simon Pilgrim af6c15f679 [X86][SSE] Add support for v4i8 add reduction
llvm-svn: 374579
2019-10-11 17:54:15 +00:00
Simon Pilgrim 6434eac860 [X86] isFNEG - add recursion depth limit
Now that its used by isNegatibleForFree we should try to avoid costly deep recursion

llvm-svn: 374534
2019-10-11 11:34:18 +00:00
Craig Topper ccc85ac855 [X86] Add a DAG combine to turn v16i16->v16i8 VTRUNCUS+store into a saturating truncating store.
llvm-svn: 374509
2019-10-11 04:16:49 +00:00
Craig Topper b560fd6c52 [X86] Improve the AVX512 bailout in combineTruncateWithSat to allow pack instructions in more situations.
If we don't have VLX we won't end up selecting a saturating
truncate for 256-bit or smaller vectors so we should just use
the pack lowering.

llvm-svn: 374487
2019-10-11 00:38:51 +00:00
Craig Topper 4ee7f8365f [X86] Guard against leaving a dangling node in combineTruncateWithSat.
When handling the packus pattern for i32->i8 we do a two step
process using a packss to i16 followed by a packus to i8. If the
final i8 step is a type with less than 64-bits the packus step
will return SDValue(), but the i32->i16 step might have succeeded.
This leaves the nodes from the middle step dangling.

Guard against this by pre-checking that the number of elements is
at least 8 before doing the middle step.

With that check in place this should mean the only other
case the middle step itself can fail is when SSE2 is disabled. So
add an early SSE2 check then just assert that neither the middle
or final step ever fail.

llvm-svn: 374460
2019-10-10 21:46:52 +00:00
Craig Topper 0e561437c5 [X86] Use packusdw+vpmovuswb to implement v16i32->V16i8 that clamps signed inputs to be between 0 and 255 when zmm registers are disabled on SKX.
If we've disable zmm registers, the v16i32 will need to be split. This split will propagate through min/max the truncate. This creates two sequences that need to be concatenated back to v16i8. We can instead use packusdw to do part of the clamping, truncating, and concatenating all at once. Then we can use a vpmovuswb to finish off the clamp.

Differential Revision: https://reviews.llvm.org/D68763

llvm-svn: 374431
2019-10-10 19:40:44 +00:00
Simon Pilgrim 6a38474f77 [X86] combineFMA - Convert to use isNegatibleForFree/GetNegatedExpression.
Split off from D67557.

llvm-svn: 374356
2019-10-10 14:14:12 +00:00
Simon Pilgrim f096443a98 [X86] combineFMADDSUB - Convert to use isNegatibleForFree/GetNegatedExpression.
Split off from D67557, fixes the compile time regression mentioned in rL372756

llvm-svn: 374351
2019-10-10 13:46:44 +00:00
Simon Pilgrim 08c2f530ec [DAG][X86] Add isNegatibleForFree/GetNegatedExpression override placeholders. NFCI.
Continuing to undo the rL372756 reversion.

Differential Revision: https://reviews.llvm.org/D67557

llvm-svn: 374345
2019-10-10 13:29:35 +00:00
Philip Reames 931120846e Conservatively add volatility and atomic checks in a few places
As background, starting in D66309, I'm working on support unordered atomics analogous to volatile flags on normal LoadSDNode/StoreSDNodes for X86.

As part of that, I spent some time going through usages of LoadSDNode and StoreSDNode looking for cases where we might have missed a volatility check or need an atomic check. I couldn't find any cases that clearly miscompile - i.e. no test cases - but a couple of pieces in code loop suspicious though I can't figure out how to exercise them.

This patch adds defensive checks and asserts in the places my manual audit found. If anyone has any ideas on how to either a) disprove any of the checks, or b) hit the bug they might be fixing, I welcome suggestions.

Differential Revision: https://reviews.llvm.org/D68419

llvm-svn: 374261
2019-10-09 23:43:33 +00:00
Craig Topper be7f81ece9 [X86] Shrink zero extends of gather indices from type less than i32 to types larger than i32.
Gather instructions can use i32 or i64 elements for indices. If
the index is zero extended from a type smaller than i32 to i64, we
can shrink the extend to just extend to i32.

llvm-svn: 373982
2019-10-07 23:03:12 +00:00
Reid Kleckner f9b67b810e [X86] Add new calling convention that guarantees tail call optimization
When the target option GuaranteedTailCallOpt is specified, calls with
the fastcc calling convention will be transformed into tail calls if
they are in tail position. This diff adds a new calling convention,
tailcc, currently supported only on X86, which behaves the same way as
fastcc, except that the GuaranteedTailCallOpt flag does not need to
enabled in order to enable tail call optimization.

Patch by Dwight Guth <dwight.guth@runtimeverification.com>!

Reviewed By: lebedev.ri, paquette, rnk

Differential Revision: https://reviews.llvm.org/D67855

llvm-svn: 373976
2019-10-07 22:28:58 +00:00
Simon Pilgrim 9c2e123043 [X86][SSE] getTargetShuffleInputs - move VT.isSimple/isVector checks inside. NFCI.
Stop all the callers from having to check the value type before calling getTargetShuffleInputs.

llvm-svn: 373915
2019-10-07 16:15:20 +00:00
Simon Pilgrim b4ba3cbda0 [X86][AVX] Access a scalar float/double as a free extract from a broadcast load (PR43217)
If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element.

This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted.

Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper.

Fixes PR43217

Differential Revision: https://reviews.llvm.org/D68544

llvm-svn: 373871
2019-10-06 21:11:45 +00:00
Simon Pilgrim d84cd7caa8 Fix signed/unsigned warning. NFCI
llvm-svn: 373870
2019-10-06 19:54:20 +00:00
Simon Pilgrim 739c9f0b79 [X86][SSE] Remove resolveTargetShuffleInputs and use getTargetShuffleInputs directly.
Move the resolveTargetShuffleInputsAndMask call to after the shuffle mask combine before the undef/zero constant fold instead.

llvm-svn: 373868
2019-10-06 19:07:00 +00:00
Simon Pilgrim 42010dc810 [X86][SSE] Don't merge known undef/zero elements into target shuffle masks.
Replaces setTargetShuffleZeroElements with getTargetShuffleAndZeroables which reports the Zeroable elements but doesn't merge them into the decoded target shuffle mask (the merging has been moved up into getTargetShuffleInputs until we can get rid of it entirely).

This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle mask isn't adjusted by its source inputs but instead we cache them in a parallel Zeroable mask.

llvm-svn: 373867
2019-10-06 19:06:45 +00:00
Craig Topper 570ae49d03 [X86] Add custom type legalization for v16i64->v16i8 truncate and v8i64->v8i8 truncate when v8i64 isn't legal
Summary:
The default legalization for v16i64->v16i8 tries to create a multiple stage truncate concatenating after each stage and truncating again. But avx512 implements truncates with multiple uops. So it should be better to truncate all the way to the desired element size and then concatenate the pieces using unpckl instructions. This minimizes the number of 2 uop truncates. The unpcks are all single uop instructions.

I tried to handle this by just custom splitting the v16i64->v16i8 shuffle. And hoped that the DAG combiner would leave the two halves in the state needed to make D68374 do the job for each half. This worked for the first half, but the second half got messed up. So I've implemented custom handling for v8i64->v8i8 when v8i64 needs to be split to produce the VTRUNCs directly.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68428

llvm-svn: 373864
2019-10-06 18:43:08 +00:00
Simon Pilgrim 5c876303ec [X86][SSE] resolveTargetShuffleInputs - call getTargetShuffleInputs instead of using setTargetShuffleZeroElements directly. NFCI.
llvm-svn: 373855
2019-10-06 15:42:25 +00:00
Simon Pilgrim 2dee7e5561 [X86][AVX] combineExtractSubvector - merge duplicate variables. NFCI.
llvm-svn: 373849
2019-10-06 13:25:10 +00:00
Simon Pilgrim 032dd9b086 [X86][SSE] matchVectorShuffleAsBlend - use Zeroable element mask directly.
We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly.

This allows us to remove createTargetShuffleMask.

This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask.

llvm-svn: 373846
2019-10-06 12:38:38 +00:00
David Zarzycki 7653ff398d [X86] Enable AVX512BW for memcmp()
llvm-svn: 373845
2019-10-06 10:25:52 +00:00
Simon Pilgrim 8815be04ec [X86][AVX] Push sign extensions of comparison bool results through bitops (PR42025)
As discussed on PR42025, with more complex boolean math we can end up with many truncations/extensions of the comparison results through each bitop.

This patch handles the cases introduced in combineBitcastvxi1 by pushing the sign extension through the AND/OR/XOR ops so its just the original SETCC ops that gets extended.

Differential Revision: https://reviews.llvm.org/D68226

llvm-svn: 373834
2019-10-05 20:49:34 +00:00
Simon Pilgrim 9ecacb0d54 [X86] lowerShuffleAsLanePermuteAndRepeatedMask - variable renames. NFCI.
Rename some variables to match lowerShuffleAsRepeatedMaskAndLanePermute - prep work toward adding some equivalent sublane functionality.

llvm-svn: 373832
2019-10-05 16:08:30 +00:00
Craig Topper 074fa390d2 [X86] Add DAG combine to form saturating VTRUNCUS/VTRUNCS from VTRUNC
We already do this for ISD::TRUNCATE, but we can do the same for X86ISD::VTRUNC

Differential Revision: https://reviews.llvm.org/D68432

llvm-svn: 373765
2019-10-04 17:53:18 +00:00
Craig Topper 185ee6ec7c [X86] Add v32i8 shuffle lowering strategy to recognize two v4i64 vectors truncated to v4i8 and concatenated into the lower 8 bytes with undef/zero upper bytes.
This patch recognizes the shuffle pattern we get from a
v8i64->v8i8 truncate when v8i64 isn't a legal type.

With VLX we can use two VTRUNCs, unpckldq, and a insert_subvector.

Diffrential Revision: https://reviews.llvm.org/D68374

llvm-svn: 373645
2019-10-03 18:34:42 +00:00
Simon Pilgrim eb8d85e5db [X86] matchShuffleWithSHUFPD - use Zeroable element mask directly. NFCI.
We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly.

This only leaves one user of createTargetShuffleMask which we can hopefully get rid of in a similar manner.

This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask.

llvm-svn: 373641
2019-10-03 18:13:50 +00:00
Craig Topper eb420aa379 [X86] Add DAG combine to turn (bitcast (vbroadcast_load)) into just a vbroadcast_load if the scalar size is the same.
This improves broadcast load folding of i64 elements on 32-bit
targets where i64 isn't legal.

Previously we had to represent these as vXf64 vbroadcast_loads and
a bitcast to vXi64. But we didn't have any isel patterns
looking for that.

This also allows us to remove or simplify some isel patterns that
were looking for bitcasted vbroadcast_loads.

llvm-svn: 373566
2019-10-03 05:30:02 +00:00
Craig Topper 74c7d6be28 [X86] Rewrite to the vXi1 subvector insertion code to not rely on the value of bits that might be undef
The previous code tried to do a trick where we would extract the subvector from the location we were inserting. Then xor that with the new value. Take the xored value and clear out the bits above the subvector size. Then shift that xored subvector to the insert location. And finally xor that with the original vector. Since the old subvector was used in both xors, this would leave just the new subvector at the inserted location. Since the surrounding bits had been zeroed no other bits of the original vector would be modified.

Unfortunately, if the old subvector came from undef we might aggressively propagate the undef. Then we end up with the XORs not cancelling because they aren't using the same value for the two uses of the old subvector. @bkramer gave me a case that demonstrated this, but we haven't reduced it enough to make it easily readable to see what's happening.

This patch uses a safer, but more costly approach. It isolate the bits above the insertion and bits below the insert point and ORs those together leaving 0 for the insertion location. Then widens the subvector with 0s in the upper bits, shifts it into position with 0s in the lower bits. Then we do another OR.

Differential Revision: https://reviews.llvm.org/D68311

llvm-svn: 373495
2019-10-02 17:47:09 +00:00
Craig Topper 8c19925f42 [X86] Add a DAG combine to shrink vXi64 gather/scatter indices that are constant with sufficient sign bits to fit in vXi32
The gather/scatter instructions can implicitly sign extend the indices. If we're operating on 32-bit data, an v16i64 index can force a v16i32 gather to be split in two since the index needs 2 registers. If we can shrink the index to the i32 we can avoid the split. It should always be safe to shrink the index regardless of the number of elements. We have gather/scatter instructions that can use v2i32 index stored in a v4i32 register with v2i64 data size.

I've limited this to before legalize types to avoid creating a v2i32 after type legalization. We could check for it, but we'd also need testing. I'm also only handling build_vectors with no bitcasts to be sure the truncate will constant fold.

Differential Revision: https://reviews.llvm.org/D68247

llvm-svn: 373408
2019-10-01 23:18:31 +00:00
Craig Topper 105e82edde [X86] Add a VBROADCAST_LOAD ISD opcode representing a scalar load broadcasted to a vector.
Summary:
This adds the ISD opcode and a DAG combine to create it. There are
probably some places where we can directly create it, but I'll
leave that for future work.

This updates all of the isel patterns to look for this new node.
I had to add a few additional isel patterns for aligned extloads
which we should probably fix with a DAG combine or something. This
does mean that the broadcast load folding for avx512 can no
longer match a broadcasted aligned extload.

There's still some work to do here for combining a broadcast of
a broadcast_load. We also need to improve extractelement or
demanded vector elements of a broadcast_load. I'll try to get
those done before I submit this patch.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68198

llvm-svn: 373349
2019-10-01 16:28:20 +00:00
Matt Arsenault f24ac13aaa TLI: Remove DAG argument from getRegisterByName
Replace with the MachineFunction. X86 is the only user, and only uses
it for the function. This removes one obstacle from using this in
GlobalISel. The other is the more tolerable EVT argument.

The X86 use of the function seems questionable to me. It checks hasFP,
before frame lowering.

llvm-svn: 373292
2019-10-01 01:44:39 +00:00
Craig Topper 3405237f77 [X86] Mask off upper bits of splat element in LowerBUILD_VECTORvXi1 when forming a SELECT.
The i1 scalar would have been type legalized to i8, but that
doesn't guarantee anything about the upper bits. If we're going
to use it as condition we need to make sure the upper bits are 0.

I've special cased ISD::SETCC conditions since that should
guarantee zero upper bits. We could go further and use
computeKnownBits, but we have no tests that would need that.

Fixes PR43507.

llvm-svn: 373246
2019-09-30 18:43:44 +00:00
Craig Topper 8216414fd1 [X86] Address post-commit review from code I accidentally commited in r373136.
See https://reviews.llvm.org/D68167

llvm-svn: 373245
2019-09-30 18:43:27 +00:00
Craig Topper 299ebacfe9 [X86] Add ANY_EXTEND to switch in ReplaceNodeResults, but just fall back to default handling.
ANY_EXTEND of v8i8 is marked Custom on AVX512 for handling extends
from v8i8. But the type legalization infrastructure will call
ReplaceNodeResults for v8i8 results. We should just defer it the
default handling instead of asserting in the default of the switch.

Fixes PR43509.

llvm-svn: 373234
2019-09-30 17:14:22 +00:00
Craig Topper 1b0ea0a12e [X86] Split v16i32/v8i64 bitreverse on avx512f targets without avx512bw to enable the use of vpshufb on the 256-bit halves.
llvm-svn: 373177
2019-09-30 03:14:38 +00:00
Fangrui Song 6c320b22cd [X86] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds after r373174
llvm-svn: 373175
2019-09-30 02:06:23 +00:00
Craig Topper 1069c01924 [X86] Remove -x86-experimental-vector-widening-legalization command line flag
This was added back to allow some performance regressions to be
investigated. The main perf issue was fixed shortly after adding
this back and no other major issues have been reported. So I
think its safe to remove this again.

llvm-svn: 373174
2019-09-29 23:32:37 +00:00
Craig Topper 0ac4aacea8 [X86] Enable canonicalizeBitSelect for AVX512 since we can use VPTERNLOG now.
llvm-svn: 373155
2019-09-29 01:24:22 +00:00
Craig Topper 82a707e941 [X86] Stop using UpdateNodeOperands in combineGatherScatter. Create new nodes like most other DAG combines.
Creating new nodes is what we usually do. Have to explicitly
check that we don't update to an existing node and having
to manually manage the worklist is unusual.

We can probably add a helper function to reduce the duplication
of having to check if we should create a gather or scatter, but
I wanted to just get the simple thing done.

llvm-svn: 373137
2019-09-28 01:08:46 +00:00
Craig Topper 22984ebd0e [X86] Split combineGatherScatter into a version for generic ISD nodes and another version for X86 specific nodes.
The majority of the code doesn't run on the X86 nodes today since
its gated by isBeforeLegalizeOps and we don't formm X86 nodes
until after that. Except for a couple special case in type
legalization. But I think we would probably break those if
some of the transforms fire on them.

I want to remove the hardcoded operand numbers and the unusual
use of UpdateNodeOperands. Being able to know which ISD opcodes
are present should help with that.

llvm-svn: 373136
2019-09-28 01:06:58 +00:00
Craig Topper 750bdda638 [X86] Call SimplifyDemandedBits in combineGatherScatter any time the mask element is wider than i1, not just when AVX512 is disabled.
The AVX2 intrinsics can still be used when AVX512 is enabled and
those go through this path. So we should simplify them.

llvm-svn: 373108
2019-09-27 18:23:55 +00:00
Guillaume Chatelet 18f805a7ea [Alignment][NFC] Remove unneeded llvm:: scoping on Align types
llvm-svn: 373081
2019-09-27 12:54:21 +00:00
Ilya Biryukov 60e5e0b667 Revert r372333: [DAG][X86] Convert isNegatibleForFree/GetNegatedExpression to a target hook (PR42863)
Reason: this caused severe compile time regressions in JAX.
See email thread  of original revision on llvm-commits for details:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190923/697042.html

llvm-svn: 372756
2019-09-24 13:48:02 +00:00
Craig Topper e3c2163ffe [X86] Use TargetConstant for condition code on X86ISD::SETCC/CMOV/BRCOND nodes.
This removes the need for ConvertToTarget opcodes in the isel table.
It's also consistent with the recent changes to use TargetConstant
for intrinsic nodes that always take immediates.

Differential Revision: https://reviews.llvm.org/D67902

llvm-svn: 372645
2019-09-23 19:48:20 +00:00
Sanjay Patel 31b9dfe23f [x86] fix assert with horizontal math + broadcast of vector (PR43402)
https://bugs.llvm.org/show_bug.cgi?id=43402

llvm-svn: 372606
2019-09-23 13:30:23 +00:00
Craig Topper 5e26064c40 [X86] Remove SETEQ/SETNE canonicalization code from LowerIntVSETCC_AVX512 to prevent an infinite loop.
The attached test case would previous infinite loop after
r365711.

I'm going to move this to X86ISelDAGToDAG.cpp to get the setcc
to match VPTEST in 32-bit mode in a follow up commit.

llvm-svn: 372543
2019-09-23 05:35:20 +00:00
David Zarzycki a7a515cb77 Prefer AVX512 memcpy when applicable
When AVX512 is available and the preferred vector width is 512-bits or
more, we should prefer AVX512 for memcpy().

https://bugs.llvm.org/show_bug.cgi?id=43240

https://reviews.llvm.org/D67874

llvm-svn: 372540
2019-09-23 05:00:59 +00:00
Craig Topper da4a4707d2 [X86] Convert to Constant arguments to MMX shift by i32 intrinsics to TargetConstant during lowering.
This allows us to use timm in the isel table which is more
consistent with other intrinsics that take an immediate now.

We can't declare the intrinsic as taking an ImmArg because we
need to match non-constants to the shift by MMX register
instruction which we do by mutating the intrinsic id during
lowering.

llvm-svn: 372537
2019-09-23 01:21:51 +00:00
Craig Topper a533e87792 [X86][SelectionDAGBuilder] Move the hack for handling MMX shift by i32 intrinsics into the X86 backend.
This intrinsics should be shift by immediate, but gcc allows any
i32 scalar and clang needs to match that. So we try to detect the
non-constant case and move the data from an integer register to an
MMX register.

Previously this was done by creating a v2i32 build_vector and
bitcast in SelectionDAGBuilder. This had to be done early since
v2i32 isn't a legal type. The bitcast+build_vector would be DAG
combined to X86ISD::MMX_MOVW2D which isel will turn into a
GPR->MMX MOVD.

This commit just moves the whole thing to lowering and emits
the X86ISD::MMX_MOVW2D directly to avoid the illegal type. The
test changes just seem to be due to nodes being linearized in a
different order.

llvm-svn: 372535
2019-09-23 01:05:33 +00:00
Sterling Augustine 4a58936716 Fix missed case of switching getConstant to getTargetConstant. Try 2.
Summary: This fixes a crasher introduced by r372338.

Reviewers: echristo, arsenm

Subscribers: wdng, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67850

llvm-svn: 372434
2019-09-20 22:26:55 +00:00
Nico Weber 03475adcf7 Revert r372366 "Use getTargetConstant for BLENDI, and add a test to catch it."
This reverts commit 52621307bc.

Tests have been failing all night with

    [0/2] ACTION //llvm/test:check-llvm(//llvm/utils/gn/build/toolchain:unix)
    -- Testing: 33647 tests, 64 threads --
    Testing: 0 .. 10..
    UNRESOLVED: LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll (6943 of 33647)
    ******************** TEST 'LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll' FAILED ********************
    Test has no run line!
    ********************

Since there were other concerns on https://reviews.llvm.org/D67785,
I'm just reverting for now.

llvm-svn: 372383
2019-09-20 12:05:29 +00:00
Craig Topper 621c93ec1f [X86] Convert tbm_bextri_u32/tbm_bextri_u64 intrinsics TargetConstant argument to a regular Constant during lowering.
We reuse an ISD opcode here that can be reached from BMI that
doesn't require it to be an immediate. Our isel patterns to match
the TBM immediate form require a Constant and not a TargetConstant.

We were accidentally getting the Constant due to a quirk of
combineBEXTR calling SimplifyDemandedBits. The call to
SimplifyDemandedBits ended up constant folding the TargetConstant
to a regular Constant. But we should probably instead be asserting
if SimplifyDemandedBits on a TargetConstant so we shouldn't rely
on this behavior.

llvm-svn: 372373
2019-09-20 07:00:22 +00:00
Sterling Augustine 52621307bc Use getTargetConstant for BLENDI, and add a test to catch it.
Summary: This fixes a crasher introduced by r372338.

Reviewers: echristo, arsenm

Subscribers: jvesely, wdng, nhaehnle, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67785

Tighten up the test case.

llvm-svn: 372366
2019-09-20 02:29:16 +00:00
Craig Topper 081cb7ef23 [X86] Remove the special isBuildVectorOfConstantSDNodes handling from LowerBUILD_VECTORvXi1.
The later code that generates a constant when there are
some non-const elements works basically the same and doesn't
require there to be any non-const elements.

llvm-svn: 372365
2019-09-20 01:49:46 +00:00
Matt Arsenault 3ecab8e455 Reapply r372285 "GlobalISel: Don't materialize immarg arguments to intrinsics"
This reverts r372314, reapplying r372285 and the commits which depend
on it (r372286-r372293, and r372296-r372297)

This was missing one switch to getTargetConstant in an untested case.

llvm-svn: 372338
2019-09-19 16:26:14 +00:00
Simon Pilgrim af6043557d [DAG][X86] Convert isNegatibleForFree/GetNegatedExpression to a target hook (PR42863)
This patch converts the DAGCombine isNegatibleForFree/GetNegatedExpression into overridable TLI hooks and includes a demonstration X86 implementation.

The intention is to let us extend existing FNEG combines to work more generally with negatible float ops, allowing it work with target specific combines and opcodes (e.g. X86's FMA variants).

Unlike the SimplifyDemandedBits, we can't just handle target nodes through a Target callback, we need to do this as an override to allow targets to handle generic opcodes as well. This does mean that the target implementations has to duplicate some checks (recursion depth etc.).

I've only begun to replace X86's FNEG handling here, handling FMADDSUB/FMSUBADD negation and some low impact codegen changes (some FMA negatation propagation). We can build on this in future patches.

Differential Revision: https://reviews.llvm.org/D67557

llvm-svn: 372333
2019-09-19 15:02:47 +00:00
Hans Wennborg 13bdae8541 Revert r372285 "GlobalISel: Don't materialize immarg arguments to intrinsics"
This broke the Chromium build, causing it to fail with e.g.

  fatal error: error in backend: Cannot select: t362: v4i32 = X86ISD::VSHLI t392, Constant:i8<15>

See llvm-commits thread of r372285 for details.

This also reverts r372286, r372287, r372288, r372289, r372290, r372291,
r372292, r372293, r372296, and r372297, which seemed to depend on the
main commit.

> Encode them directly as an imm argument to G_INTRINSIC*.
>
> Since now intrinsics can now define what parameters are required to be
> immediates, avoid using registers for them. Intrinsics could
> potentially want a constant that isn't a legal register type. Also,
> since G_CONSTANT is subject to CSE and legalization, transforms could
> potentially obscure the value (and create extra work for the
> selector). The register bank of a G_CONSTANT is also meaningful, so
> this could throw off future folding and legalization logic for AMDGPU.
>
> This will be much more convenient to work with than needing to call
> getConstantVRegVal and checking if it may have failed for every
> constant intrinsic parameter. AMDGPU has quite a lot of intrinsics wth
> immarg operands, many of which need inspection during lowering. Having
> to find the value in a register is going to add a lot of boilerplate
> and waste compile time.
>
> SelectionDAG has always provided TargetConstant for constants which
> should not be legalized or materialized in a register. The distinction
> between Constant and TargetConstant was somewhat fuzzy, and there was
> no automatic way to force usage of TargetConstant for certain
> intrinsic parameters. They were both ultimately ConstantSDNode, and it
> was inconsistently used. It was quite easy to mis-select an
> instruction requiring an immediate. For SelectionDAG, start emitting
> TargetConstant for these arguments, and using timm to match them.
>
> Most of the work here is to cleanup target handling of constants. Some
> targets process intrinsics through intermediate custom nodes, which
> need to preserve TargetConstant usage to match the intrinsic
> expectation. Pattern inputs now need to distinguish whether a constant
> is merely compatible with an operand or whether it is mandatory.
>
> The GlobalISelEmitter needs to treat timm as a special case of a leaf
> node, simlar to MachineBasicBlock operands. This should also enable
> handling of patterns for some G_* instructions with immediates, like
> G_FENCE or G_EXTRACT.
>
> This does include a workaround for a crash in GlobalISelEmitter when
> ARM tries to uses "imm" in an output with a "timm" pattern source.

llvm-svn: 372314
2019-09-19 12:33:07 +00:00
Craig Topper c2d25ed1b3 [X86] Prevent crash in LowerBUILD_VECTORvXi1 for v64i1 vectors on 32-bit targets when the vector is a mix of constants and non-constant.
We need to materialize the constants as two 32-bit values that
are casted to v32i1 and then concatenated.

llvm-svn: 372304
2019-09-19 06:50:39 +00:00
Craig Topper d103bb654f [X86] Change a SmallVector& argument to SmallVectorImpl&. NFC
Avoids repeating the size.

llvm-svn: 372302
2019-09-19 06:27:12 +00:00
Craig Topper eff4fd6999 [X86] Remove unused argument from a helper function. NFC
llvm-svn: 372301
2019-09-19 06:27:07 +00:00
Matt Arsenault d8399d12cd GlobalISel: Don't materialize immarg arguments to intrinsics
Encode them directly as an imm argument to G_INTRINSIC*.

Since now intrinsics can now define what parameters are required to be
immediates, avoid using registers for them. Intrinsics could
potentially want a constant that isn't a legal register type. Also,
since G_CONSTANT is subject to CSE and legalization, transforms could
potentially obscure the value (and create extra work for the
selector). The register bank of a G_CONSTANT is also meaningful, so
this could throw off future folding and legalization logic for AMDGPU.

This will be much more convenient to work with than needing to call
getConstantVRegVal and checking if it may have failed for every
constant intrinsic parameter. AMDGPU has quite a lot of intrinsics wth
immarg operands, many of which need inspection during lowering. Having
to find the value in a register is going to add a lot of boilerplate
and waste compile time.

SelectionDAG has always provided TargetConstant for constants which
should not be legalized or materialized in a register. The distinction
between Constant and TargetConstant was somewhat fuzzy, and there was
no automatic way to force usage of TargetConstant for certain
intrinsic parameters. They were both ultimately ConstantSDNode, and it
was inconsistently used. It was quite easy to mis-select an
instruction requiring an immediate. For SelectionDAG, start emitting
TargetConstant for these arguments, and using timm to match them.

Most of the work here is to cleanup target handling of constants. Some
targets process intrinsics through intermediate custom nodes, which
need to preserve TargetConstant usage to match the intrinsic
expectation. Pattern inputs now need to distinguish whether a constant
is merely compatible with an operand or whether it is mandatory.

The GlobalISelEmitter needs to treat timm as a special case of a leaf
node, simlar to MachineBasicBlock operands. This should also enable
handling of patterns for some G_* instructions with immediates, like
G_FENCE or G_EXTRACT.

This does include a workaround for a crash in GlobalISelEmitter when
ARM tries to uses "imm" in an output with a "timm" pattern source.

llvm-svn: 372285
2019-09-19 01:33:14 +00:00
Guillaume Chatelet 35b4b403b4 [Alignment][NFC] Use Align::None instead of 1
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: sdardis, nemanjai, hiraditya, kbarton, jrtc27, MaskRay, atanasyan, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67704

llvm-svn: 372230
2019-09-18 15:40:20 +00:00
Craig Topper 93e1f73b6b [X86] Break non-power of 2 vXi1 vectors into scalars for argument passing with avx512.
This generates worse code, but matches what is done for avx2 and
prevents crashes when more arguments are passed than we have
registers for.

llvm-svn: 372200
2019-09-18 06:06:11 +00:00
Craig Topper 4a07336a88 [X86] Prevent assertion when calling a function that returns double with -mno-sse2 on x86-64.
As seen in the most recent updates to PR10498

llvm-svn: 372197
2019-09-18 01:57:46 +00:00
Craig Topper c198ffd8c3 [X86] Use APInt::operator<<= and APInt::lshrInPlace. NFC
llvm-svn: 372159
2019-09-17 18:19:06 +00:00
Craig Topper f9a89b6788 [X86] Simplify b2b KSHIFTL+KSHIFTR using demanded elts.
llvm-svn: 372155
2019-09-17 18:02:56 +00:00
Craig Topper f1ba94ade0 [X86] Call SimplifyDemandedVectorElts on KSHIFTL/KSHIFTR nodes during DAG combine.
llvm-svn: 372154
2019-09-17 18:02:52 +00:00
Craig Topper b50894b9c3 [X86] Simplify some code in LowerBUILD_VECTORvXi1. NFCI
The case were Immediate is 0 and HasConstElts is true should never
happen since that would mean the constant elts were all zero. But
we check for all zero build vector earlier. So just use HasConstElts
and blindly take Immediate without checking if its 0.

Move the code that bitcasts and extract the immediate into the
the HasConstElts case since the other code just creates an undef
with the right type. No casting needed.

llvm-svn: 372153
2019-09-17 18:02:46 +00:00
Simon Pilgrim 0b10da7cc7 [X86] Use APInt::getLowBitsSet helper. NFCI.
Also avoids a static analyzer warning about out of range shifts.

llvm-svn: 372103
2019-09-17 10:51:30 +00:00
Graham Hunter 1a9195d817 [SVE][MVT] Fixed-length vector MVT ranges
* Reordered MVT simple types to group scalable vector types
    together.
  * New range functions in MachineValueType.h to only iterate over
    the fixed-length int/fp vector types.
  * Stopped backends which don't support scalable vector types from
    iterating over scalable types.

Reviewers: sdesmalen, greened

Reviewed By: greened

Differential Revision: https://reviews.llvm.org/D66339

llvm-svn: 372099
2019-09-17 10:19:23 +00:00
Craig Topper 95aea74494 [X86] Split oversized vXi1 vector arguments and return values into scalars on avx512 targets.
Previously we tried to split them into narrower v64i1 or v16i1
pieces that each got promoted to vXi8 and then passed in a zmm
or xmm register. But this crashes when you need to pass more
pieces than available registers reserved for argument passing.

The scalarizing done here generates much longer and slower code,
but is consistent with the behavior of avx2 and earlier targets
for these types.

Fixes PR43323.

llvm-svn: 372069
2019-09-17 04:41:14 +00:00
Simon Pilgrim 3df0daddfd [X86][AVX] matchShuffleWithSHUFPD - add support for zeroable operands
Determine if all of the uses of LHS/RHS operands can be replaced with a zero vector.

llvm-svn: 372013
2019-09-16 17:30:33 +00:00
Craig Topper 8e0f104916 [X86] Use incDecVectorConstant to simplify the min/max code in LowerVSETCC.
incDecVectorConstant is used for a similar reason in LowerVSETCCWithSUBUS
so we might as well share the code.

llvm-svn: 371861
2019-09-13 14:59:08 +00:00
Simon Pilgrim 930ebc15a6 [X86] negateFMAOpcode - extend to support FMADDSUB/FMSUBADD and output negation. NFCI.
Some prep work for PR42863, this change allows us to move all the FMA opcode mappings into the negateFMAOpcode helper.

For the FMADDSUB/FMSUBADD cases, we can only negate the accumulator - any other negations will result in an error.

llvm-svn: 371840
2019-09-13 11:22:40 +00:00
Craig Topper efe6724b9f [DAGCombiner][X86] Pass the CmpOpVT to reduceSelectOfFPConstantLoads so X86 can exclude fp128 compares.
The X86 decision assumes the compare will produce a result in an XMM
register, but that can't happen for an fp128 compare since those
go to a libcall the returns an i32. Pass the VT so X86 can check
the type.

llvm-svn: 371775
2019-09-12 21:30:18 +00:00
Simon Pilgrim d67661ee24 [X86] Move negateFMAOpcode helper earlier to help future patch. NFCI.
llvm-svn: 371770
2019-09-12 20:39:56 +00:00
Reid Kleckner ff45955fc8 [X86] Fix latent bugs in 32-bit CMPXCHG8B inserter
I found three issues:
1. the loop over E[ABCD]X copies run over BB start
2. the direct address of cmpxchg8b could be a frame index
3. the displacement of cmpxchg8b could be a global instead of an
   immediate

These were all introduced together in r287875, and should be fixed with
this change.

Issue reported by Zachary Turner.

llvm-svn: 371678
2019-09-11 21:56:17 +00:00
Craig Topper 08474ca091 [X86] Move x86_64 fp128 conversion to libcalls from type legalization to DAG legalization
fp128 is considered a legal type for a register, but has almost no legal operations so everything needs to be converted to a libcall. Previously this was implemented by tricking type legalization into softening the operations with various checks for "is legal in hardware register" to change the behavior to still use f128 as the resulting type instead of converting to i128.

This patch abandons this approach and instead moves the libcall conversions to LegalizeDAG. This is the approach taken by AArch64 where they also have a legal fp128 type, but no legal operations. I think this is more in spirit with how SelectionDAG's phases are supposed to work.

I had to make some hacks for STRICT_FP_ROUND because some of the strict FP handling checks if ISD::FP_ROUND is Legal for a given result type, but I had to make ISD::FP_ROUND Custom to allow making a libcall when the input is f128. For all other types the Custom handler just returns the original node. These hacks are incomplete and don't work for a strict truncate from f128, but I don't think it worked before either since LegalizeFloatTypes doesn't know about strict ops yet. I've also raised PR43209 against AArch64 which currently crashes on a strict ftrunc from f64->f32 because of FP_ROUND being marked Custom for the same reason there.

Differential Revision: https://reviews.llvm.org/D67128

llvm-svn: 371672
2019-09-11 21:30:09 +00:00
Philip Reames a9beacbac8 [X86] Updated target specific selection dag code to conservatively check for isAtomic in addition to isVolatile
See D66309 for context.

This is the first sweep of x86 target specific code to add isAtomic bailouts where appropriate. The intention here is to have the switch from AtomicSDNode to LoadSDNode/StoreSDNode be close to NFC; that is, I'm not looking to allow additional optimizations at this time.

Sorry for the lack of tests.  As discussed in the review, most of these are vector tests (for which atomicity is not well defined) and I couldn't figure out to exercise the anyextend cases which aren't vector specific.

Differential Revision: https://reviews.llvm.org/D66322

llvm-svn: 371547
2019-09-10 18:43:15 +00:00
Philip Reames 20aafa3156 Introduce infrastructure for an incremental port of SelectionDAG atomic load/store handling
This is the first patch in a large sequence. The eventual goal is to have unordered atomic loads and stores - and possibly ordered atomics as well - handled through the normal ISEL codepaths for loads and stores. Today, there handled w/instances of AtomicSDNodes. The result of which is that all transforms need to be duplicated to work for unordered atomics. The benefit of the current design is that it's harder to introduce a silent miscompile by adding an transform which forgets about atomicity.  See the thread on llvm-dev titled "FYI: proposed changes to atomic load/store in SelectionDAG" for further context.

Note that this patch is NFC unless the experimental flag is set.

The basic strategy I plan on taking is:

    introduce infrastructure and a flag for testing (this patch)
    Audit uses of isVolatile, and apply isAtomic conservatively*
    piecemeal conservative* update generic code and x86 backedge code in individual reviews w/tests for cases which didn't check volatile, but can be found with inspection
    flip the flag at the end (with minimal diffs)
    Work through todo list identified in (2) and (3) exposing performance ops

(*) The "conservative" bit here is aimed at minimizing the number of diffs involved in (4). Ideally, there'd be none. In practice, getting it down to something reviewable by a human is the actual goal. Note that there are (currently) no paths which produce LoadSDNode or StoreSDNode with atomic MMOs, so we don't need to worry about preserving any behaviour there.

We've taken a very similar strategy twice before with success - once at IR level, and once at the MI level (post ISEL). 

Differential Revision: https://reviews.llvm.org/D66309

llvm-svn: 371441
2019-09-09 19:23:22 +00:00
Craig Topper 5ebd0a6e88 [SelectionDAG] Remove ISD::FP_ROUND_INREG
I don't think anything in tree creates this node. So all of this
code appears to be dead.

Code coverage agrees
http://lab.llvm.org:8080/coverage/coverage-reports/llvm/coverage/Users/buildslave/jenkins/workspace/clang-stage2-coverage-R/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp.html

Differential Revision: https://reviews.llvm.org/D67312

llvm-svn: 371431
2019-09-09 17:54:44 +00:00
Craig Topper ce2cb0f09e [X86] Allow _MM_FROUND_CUR_DIRECTION and _MM_FROUND_NO_EXC to be used together on instructions that only support SAE and not embedded rounding.
Current for SAE instructions we only allow _MM_FROUND_CUR_DIRECTION(bit 2) or _MM_FROUND_NO_EXC(bit 3) to be used as the immediate passed to the inrinsics. But these instructions don't perform rounding so _MM_FROUND_CUR_DIRECTION is just sort of a default placeholder when you don't want to suppress exceptions. Using _MM_FROUND_NO_EXC by itself is really bit equivalent to (_MM_FROUND_NO_EXC | _MM_FROUND_TO_NEAREST_INT) since _MM_FROUND_TO_NEAREST_INT is 0. Since we aren't rounding on these instructions we should also accept (_MM_FROUND_CUR_DIRECTION | _MM_FROUND_NO_EXC) as equivalent to (_MM_FROUND_NO_EXC). icc allows this, but gcc does not.

Differential Revision: https://reviews.llvm.org/D67289

llvm-svn: 371430
2019-09-09 17:48:05 +00:00
Craig Topper 72624b0e59 [X86] Use xorps to create fp128 +0.0 constants.
This matches what we do for f32/f64. gcc also does this for fp128.

llvm-svn: 371357
2019-09-09 01:35:00 +00:00
Simon Pilgrim e0ea746215 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add faux shuffle support.
This patch decodes target and faux shuffles with getTargetShuffleInputs - a reduced version of resolveTargetShuffleInputs that doesn't resolve SM_SentinelZero cases, so we can correctly remove zero vectors if they aren't demanded.

llvm-svn: 371353
2019-09-08 21:38:33 +00:00
Craig Topper 77dd86ee4a [X86] Add a hack to combineVSelectWithAllOnesOrZeros to turn selects with two zero/undef vector inputs into an all zeroes vector.
If the two zero vectors have undefs in different places they
won't get combined by simplifySelect.

This fixes a regression from an earlier commit.

llvm-svn: 371351
2019-09-08 20:56:09 +00:00
Craig Topper 9c11901256 [X86] Remove call to getZeroVector from materializeVectorConstant. Add isel patterns for zero vectors with all types.
The change to avx512-vec-cmp.ll is a regression, but should be
easy to fix. It occurs because the getZeroVector call was
canonicalizing both sides to the same node, then SimplifySelect
was able to simplify it. But since only called getZeroVector
on some VTs this isn't a robust way to combine this.

The change to vector-shuffle-combining-ssse3.ll is more
instructions, but removes a constant pool load so its unclear
if its a regression or not.

llvm-svn: 371350
2019-09-08 20:56:05 +00:00
Craig Topper 97d41b8917 [X86] Use DAG.getConstant instead of getZeroVector in combinePMULDQ.
getZeroVector canonicalizes the type to vXi32, but that's a
legalization action. We should use the most correct type if
possible.

llvm-svn: 371345
2019-09-08 19:24:42 +00:00
Craig Topper 30837abd96 [X86] Teach materializeVectorConstant to not call getZeroVector/getOnesVector on the types we already have isel patterns for.
llvm-svn: 371343
2019-09-08 19:24:29 +00:00
Simon Pilgrim 178cd2cd3a [X86][SSE] Fix out of range shift introduced in D67070/rL371328
Use APInt to create the comparison mask instead.

llvm-svn: 371330
2019-09-08 12:44:22 +00:00
Simon Pilgrim 3262084384 [X86][SSE] Add support for <64 x i1> bool reduction
This generalizes the existing <32 x i1> pre-AVX2 split code to support reductions from <64 x i1> as well, we can probably generalize to any larger pow2 case in the future if the (unlikely) need ever arises.

We still need to tweak combineBitcastvxi1 to improve AVX512F codegen as its assumes vXi1 types should be handled on the mask registers even when they aren't legal.

Differential Revision: https://reviews.llvm.org/D67070

llvm-svn: 371328
2019-09-08 11:46:21 +00:00
Craig Topper 37dd59298f [X86] Make getZeroVector return floating point vectors in their native type on SSE2 and later.
isel used to require zero vectors to be canonicalized to a single
type to minimize the number of patterns needed to match. This is
 no longer required.

I plan to do this to integers too, but floating point was simpler
to start with. Integer has a complication where v32i16/v64i8 aren't
legal when the other 512-bit integer types are.

llvm-svn: 371325
2019-09-08 00:43:52 +00:00
Simon Pilgrim 08692e5dd1 [X86] Avoid uses of getZextValue(). NFCI.
Use getAPIntValue() directly - this is mainly a best practice style issue to help prevent fuzz tests blowing up when a i12345 (or whatever) is generated.

Use getConstantOperandVal/getConstantOperandAPInt wrappers where possible.

llvm-svn: 371315
2019-09-07 16:13:57 +00:00
Nikita Popov 314893cc4b [X86] Fix pshuflw formation from repeated shuffle mask (PR43230)
Fix for https://bugs.llvm.org/show_bug.cgi?id=43230.

When creating PSHUFLW from a repeated shuffle mask, we have to apply
the checks to the repeated mask, not the original one. For the test
case from PR43230 the inspected part of the original mask is all undef.

Differential Revision: https://reviews.llvm.org/D67314

llvm-svn: 371307
2019-09-07 12:13:44 +00:00
Simon Pilgrim d7d8bb937a Fix MSVC "32-bit shift implicitly converted to 64 bits" warnings. NFCI.
llvm-svn: 371302
2019-09-07 11:04:04 +00:00
Guillaume Chatelet ad1cea0dda [Alignment][NFC] Use Align with TargetLowering::setPrefFunctionAlignment
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: nemanjai, javed.absar, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, s.egerton, pzheng, ychen, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67267

llvm-svn: 371212
2019-09-06 15:03:49 +00:00
Guillaume Chatelet 9fcf066d0c [Alignment][NFC] Use Align with TargetLowering::setPrefLoopAlignment
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: nemanjai, hiraditya, kbarton, MaskRay, jsji, ychen, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67278

llvm-svn: 371210
2019-09-06 14:51:15 +00:00
Craig Topper 0fde412140 [X86] Enable BuildSDIVPow2 for i16.
We're able to use a 32-bit ADD and CMOV here and should work
well with our other i16->i32 promotion optimizations.

llvm-svn: 371107
2019-09-05 18:49:52 +00:00
Craig Topper b8d6ba3ca2 [X86] Override BuildSDIVPow2 for X86.
As noted in PR43197, we can use test+add+cmov+sra to implement
signed division by a power of 2.

This is based off the similar version in AArch64, but I've
adjusted it to use target independent nodes where AArch64 uses
target specific CMP and CSEL nodes. I've also blocked INT_MIN
as the transform isn't valid for that.

I've limited this to i32 and i64 on 64-bit targets for now and only
when CMOV is supported. i8 and i16 need further investigation to be
sure they get promoted to i32 well.

I adjusted a few tests to enable cmov to demonstrate the new
codegen. I also changed twoaddr-coalesce-3.ll to 32-bit mode
without cmov to avoid perturbing the scenario that is being
set up there.

Differential Revision: https://reviews.llvm.org/D67087

llvm-svn: 371104
2019-09-05 18:15:07 +00:00
Sanjay Patel 10412a69f9 [x86] fix horizontal math bug exposed by improved demanded elements analysis (PR43225)
https://bugs.llvm.org/show_bug.cgi?id=43225

llvm-svn: 371095
2019-09-05 17:28:17 +00:00
Craig Topper a5508163ad [X86] Fix stale comment. NFC
We aren't checking for a concat here. We're just always splitting
256-bit stores.

llvm-svn: 371092
2019-09-05 17:24:15 +00:00
Simon Pilgrim 29361c704d [X86][SSE] EltsFromConsecutiveLoads - ignore non-zero offset base loads (PR43227)
As discussed on D64551 and PR43227, we don't correctly handle cases where the base load has a non-zero byte offset.

Until we can properly handle this, we must bail from EltsFromConsecutiveLoads.

llvm-svn: 371078
2019-09-05 15:07:07 +00:00
Guillaume Chatelet aff45e4b23 [LLVM][Alignment] Make functions using log of alignment explicit
Summary:
This patch renames functions that takes or returns alignment as log2, this patch will help with the transition to llvm::Align.
The renaming makes it explicit that we deal with log(alignment) instead of a power of two alignment.
A few renames uncovered dubious assignments:

 - `MirParser`/`MirPrinter` was expecting powers of two but `MachineFunction` and `MachineBasicBlock` were using deal with log2(align). This patch fixes it and updates the documentation.
 - `MachineBlockPlacement` exposes two flags (`align-all-blocks` and `align-all-nofallthru-blocks`) supposedly interpreted as power of two alignments, internally these values are interpreted as log2(align). This patch updates the documentation,
 - `MachineFunctionexposes` exposes `align-all-functions` also interpreted as power of two alignment, internally this value is interpreted as log2(align). This patch updates the documentation,

Reviewers: lattner, thegameg, courbet

Subscribers: dschuff, arsenm, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, Jim, s.egerton, llvm-commits, courbet

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65945

llvm-svn: 371045
2019-09-05 10:00:22 +00:00
Reid Kleckner 3fa07dee94 Revert [Windows] Disable TrapUnreachable for Win64, add SEH_NoReturn
This reverts r370525 (git commit 0bb1630685)
Also reverts r370543 (git commit 185ddc08ee)

The approach I took only works for functions marked `noreturn`. In
general, a call that is not known to be noreturn may be followed by
unreachable for other reasons. For example, there could be multiple call
sites to a function that throws sometimes, and at some call sites, it is
known to always throw, so it is followed by unreachable. We need to
insert an `int3` in these cases to pacify the Windows unwinder.

I think this probably deserves its own standalone, Win64-only fixup pass
that runs after block placement. Implementing that will take some time,
so let's revert to TrapUnreachable in the mean time.

llvm-svn: 370829
2019-09-03 22:27:27 +00:00
Simon Pilgrim 99525bbe49 [X86] Merge 2 consecutive HasInt256 branches. NFCI.
llvm-svn: 370761
2019-09-03 14:39:06 +00:00
Craig Topper b915109043 [X86] Simplify the setOperationAction handling for fp_to_uint by improving the Custom handler a bit.
This merges the 32-bit and 64-bit mode code to just use Custom
for both i32 and i64. We already had most of the handling in
the custom handling due to the AVX512 having legal fp_to_uint.
Just needed to add the i32->i64 promotion handling. Refactor
the fp_to_uint code in the custom handler to simplify the
number of times we check things.

Tweak cost model tables to match the default handling we were
getting due to Expand before.

llvm-svn: 370700
2019-09-03 05:57:22 +00:00
Craig Topper 9dc8c448ed [X86] Don't use Expand for i32 fp_to_uint on SSE1/2 targets on 32-bit target.
Use Custom lowering instead. Fall back to default expansion only
when the scalar FP type belongs in an XMM register. This improves
lowering for i32 to fp80, and also i32 to double on SSE1 only.

llvm-svn: 370699
2019-09-03 05:57:18 +00:00
Craig Topper dcecc7ea46 [X86] Custom promote i32->f80 uint_to_fp on AVX512 64-bit targets.
Reuse the same code to promote all i32 uint_to_fp on 64-bit targets
to simplify the X86ISelLowering constructor.

llvm-svn: 370693
2019-09-03 02:51:10 +00:00
Craig Topper 45cd185109 [X86] Enable fp128 as a legal type with SSE1 rather than with MMX.
FP128 values are passed in xmm registers so should be asssociated
with an SSE feature rather than MMX which uses a different set
of registers.

llc enables sse1 and sse2 by default with x86_64. But does not
enable mmx. Clang enables all 3 features by default.

I've tried to add command lines to test with -sse
where possible, but any test that returns a value in an xmm
register fails with a fatal error with -sse since we have no
defined ABI for that scenario.

llvm-svn: 370682
2019-09-02 20:16:30 +00:00
Simon Pilgrim fb5661a884 [X86] getPMOVMSKB - add MVT::v64i8 handling and remove from combineBitcastvxi1. NFCI.
llvm-svn: 370670
2019-09-02 15:10:35 +00:00
Simon Pilgrim 05a3a92751 [X86] combineHorizontalPredicateResult - pull out repeated getTargetLoweringInfo() calls. NFCI.
llvm-svn: 370637
2019-09-02 10:42:48 +00:00
Simon Pilgrim 07de5292e5 [X86][AVX] Rename + cleanup lowerShuffleAsLanePermuteAndBlend. NFCI.
Rename to lowerShuffleAsLanePermuteAndShuffle to make it clear that not just blends are performed.

Cleanup the in-lane shuffle mask generation to make it more obvious what's going on.

Some prep work noticed while investigating the poor shuffle code mentioned in D66004.

llvm-svn: 370613
2019-09-01 16:04:28 +00:00
Simon Pilgrim 27cc2efaf2 Fix shadow variable warning. NFCI.
llvm-svn: 370610
2019-09-01 13:10:18 +00:00
Simon Pilgrim f8d1d00190 [X86] EltsFromConsecutiveLoads - Don't confuse elt count with vector element count (PR43170)
EltsFromConsecutiveLoads was assuming that the number of input elts was the same as the number of elements in the output vector type when creating a zeroing shuffle, causing an assert when subvectors were being combined instead of just scalars.

llvm-svn: 370592
2019-08-31 16:21:31 +00:00
Simon Pilgrim cffbec63d6 Fix shadow variable warning by making CondCodes names more explicit. NFCI.
llvm-svn: 370589
2019-08-31 15:19:59 +00:00
Simon Pilgrim ad020c0af1 Fix shadow variable warning. NFCI.
llvm-svn: 370585
2019-08-31 15:01:03 +00:00
Simon Pilgrim 2d89007f61 [X86ISelLowering] combineCMov - cleanup CMOV->LEA codegen. NFCI.
Only compute the diff once and we don't need the truncation code (assert the bitwidth is correct just to be safe).

llvm-svn: 370583
2019-08-31 14:18:26 +00:00
Simon Pilgrim 7238353da2 [X86ISelLowering] LowerSELECT - remove duplicate value type. NFCI.
VT of SELECT result and selection ops will be the same.

llvm-svn: 370581
2019-08-31 13:14:52 +00:00
Reid Kleckner 0bb1630685 [Windows] Disable TrapUnreachable for Win64, add SEH_NoReturn
Users have complained llvm.trap produce two ud2 instructions on Win64,
one for the trap, and one for unreachable. This change fixes that.

TrapUnreachable was added and enabled for Win64 in r206684 (April 2014)
to avoid poorly understood issues with the Windows unwinder.

There seem to be two major things in play:
- the unwinder
- C++ EH, _CxxFrameHandler3 & co

The unwinder disassembles forward from the return address to scan for
epilogues. Inserting a ud2 had the effect of stopping the unwinder, and
ensuring that it ran the EH personality function for the current frame.
However, it's not clear what the unwinder does when the return address
happens to be the last address of one function and the first address of
the next function.

The Visual C++ EH personality, _CxxFrameHandler3, needs to figure out
what the current EH state number is. It does this by consulting the
ip2state table, which maps from PC to state number. This seems to go
wrong when the return address is the last PC of the function or catch
funclet.

I'm not sure precisely which system is involved here, but in order to
address these real or hypothetical problems, I believe it is enough to
insert int3 after a call site if it would otherwise be the last
instruction in a function or funclet.  I was able to reproduce some
similar problems locally by arranging for a noreturn call to appear at
the end of a catch block immediately before an unrelated function, and I
confirmed that the problems go away when an extra trailing int3
instruction is added.

MSVC inserts int3 after every noreturn function call, but I believe it's
only necessary to do it if the call would be the last instruction. This
change inserts a pseudo instruction that expands to int3 if it is in the
last basic block of a function or funclet. I did what I could to run the
Microsoft compiler EH tests, and the ones I was able to run showed no
behavior difference before or after this change.

Differential Revision: https://reviews.llvm.org/D66980

llvm-svn: 370525
2019-08-30 20:46:39 +00:00
Craig Topper 18e8d02e8c [X86] Pass v32i16/v64i8 in zmm registers on KNL target.
gcc and icc pass these types in zmm registers in zmm registers.

This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.

Fixes PR42957

Differential Revision: https://reviews.llvm.org/D66708

llvm-svn: 370495
2019-08-30 17:35:08 +00:00
Simon Pilgrim 3d705a1fa4 [X86][SSE] combinePMULDQ - pmuldq(x, 0) -> zero vector (PR43159)
ISD::isBuildVectorAllZeros permits undef elements to be present, which means we can't return it as a zero vector. PMULDQ/PMULUDQ is an extending multiply so a multiply by zero of the lower 32-bits should result in a zero 64-bit element.

llvm-svn: 370404
2019-08-29 20:22:08 +00:00
Roman Lebedev cc7495a355 [X86][CodeGen][NFC] Delay `combineIncDecVector()` from DAGCombine to X86DAGToDAGISel
Summary:
We were previously doing it in DAGCombine.
But we also want to do `sub %x, C` -> `add %x, (sub 0, C)` for vectors in DAGCombine.
So if we had `sub %x, -1`, we'll transform it to `add %x, 1`,
which `combineIncDecVector()` will immediately transform back into `sub %x, -1`,
and here we go again...

I've marked this as NFC since not a single test changes,
but since that 'changes' DAGCombine, probably this isn't fully NFC.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: craig.topper

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62327

llvm-svn: 370327
2019-08-29 10:50:09 +00:00
Craig Topper 1ec5c204b8 [X86] Add a DAG combine to combine INSERTPS and VBROADCAST of a scalar load. Remove corresponding isel patterns.
We had an isel pattern to perform this, but its better to
do it in DAG combine as a simplification. This also fixes the lack
of patterns for AVX512 targets.

llvm-svn: 370294
2019-08-29 05:48:48 +00:00
Craig Topper 1aadf6f39f [X86] Make inline assembly 'x' and 'v' constraints work for f128.
Including a type legalizer fix to make bitcast operand promotion
work correctly when getSoftenedFloat returns f128 instead of i128.

Fixes PR43157

llvm-svn: 370293
2019-08-29 05:13:56 +00:00
Hans Wennborg cff90f07cb [SelectionDAG] Don't generate libcalls for wide shifts on Windows (PR42711)
Neither libgcc or compiler-rt are usually used on Windows, so these
functions can't be called.

Differential revision: https://reviews.llvm.org/D66880

llvm-svn: 370204
2019-08-28 13:55:10 +00:00
Simon Pilgrim 8912e2af39 [X86][AVX] Add SimplifyDemandedVectorElts support for KSHIFTL/KSHIFTR
Differential Revision: https://reviews.llvm.org/D66527

llvm-svn: 370055
2019-08-27 13:13:17 +00:00
Craig Topper 6db7f492d9 [X86] Delay combineIncDecVector until after op legalization.
Probably better to keep add over sub in early DAG combines.

It might make sense to push this to lowering or delay it all
the way to isel. But this was the simplest change.

llvm-svn: 369981
2019-08-26 22:17:54 +00:00
Craig Topper 36d1588f01 [X86] Add a hack to combinePMULDQ to manually turn SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG inputs into an ANY_EXTEND_VECTOR_INREG style shuffle
ANY_EXTEND_VECTOR_INREG isn't currently marked Legal which prevents SimplifyDemandedBits from turning SIGN/ZERO_EXTEND_VECTOR_INREG into it after op legalization. And even if we did make it Legal, combineExtInVec doesn't do shuffle combining on the VECTOR_INREG nodes until AVX1.

This patch adds a quick hack to combinePMULDQ to directly emit a vector shuffle corresponding to an ANY_EXTEND_VECTOR_INREG operation. This avoids both of those issues without creating any other regressions on our tests. The xop-ifma.ll change here also showed up when I tried to resurrect D56306 and seemed to be the only improvement that patch creates now. This is a more direct way to get the benefit.

Differential Revision: https://reviews.llvm.org/D66436

llvm-svn: 369942
2019-08-26 18:23:26 +00:00
Craig Topper b8b90ac1c5 [X86][DAGCombiner] Teach narrowShuffle to use concat_vectors instead of inserting into undef
Summary:
Concat_vectors is more canonical during early DAG combine. For example, its what's used by SelectionDAGBuilder when converting IR shuffles into SelectionDAG shuffles when element counts between inputs and mask don't match. We also have combines in DAGCombiner than can pull concat_vectors through a shuffle. See partitionShuffleOfConcats. So it seems like concat_vectors is a better operation to use here. I had to teach DAGCombiner's SimplifyVBinOp to also handle concat_vectors with undef. I haven't checked yet if we can remove the INSERT_SUBVECTOR version in there or not.

I didn't want to mess with the other caller of getShuffleHalfVectors that's used during shuffle lowering where insert_subvector probably is what we want to produce so I've enabled this via a boolean passed to the function.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66504

llvm-svn: 369872
2019-08-25 17:59:49 +00:00
Craig Topper dd2cf78381 [X86] Add an assert to mark more code that needs to be removed when the vector widening legalization switch is removed again.
llvm-svn: 369837
2019-08-24 05:59:46 +00:00
Craig Topper bc173d4c51 [X86] Move a transform out of combineConcatVectorOps so we don't prematurely turn CONCAT_VECTORS into INSERT_SUBVECTORS.
CONCAT_VECTORS and INSERT_SUBVECTORS can both call combineConcatVectorOps,
but we shouldn't produce INSERT_SUBVECTORS from there. We should
keep CONCAT_VECTORS until vector legalization.

Noticed while looking at the madd_quad_reduction test from madd.ll

llvm-svn: 369802
2019-08-23 19:52:24 +00:00
Craig Topper e7211bb567 [SelectionDAG][X86] Enable iX SimplifyDemandedBits to vXi1 SimplifyDemandedVectorElts simplification. Add a hack to X86 to avoid a regression
Patch showing the effect of enabling bool vector oversimplification.

Non-VLX builds can simplify a kshift shuffle, but VLX builds simplify:

insert_subvector v8i zeroinitializer, v2i --> insert_subvector v8i undef, v2i

Preventing the removal of the AND to clear the upper bits of result

Differential Revision: https://reviews.llvm.org/D53022

llvm-svn: 369780
2019-08-23 17:14:58 +00:00
Simon Pilgrim c88408cf85 Use VT::getHalfNumVectorElementsVT helpers in a few places. NFCI.
llvm-svn: 369751
2019-08-23 12:37:09 +00:00
Craig Topper 4deb388bca [X86] Make combineLoopSADPattern use CONCAT_VECTORS instead of INSERT_SUBVECTORS for widening with zeros.
CONCAT_VECTORS is more canonical for the early DAG combine runs
until we start getting into the op legalization phases.

llvm-svn: 369734
2019-08-23 06:08:33 +00:00
Craig Topper bdceb9fb14 [X86] Improve lowering of v2i32 SAD handling in combineLoopSADPattern.
For v2i32 we only feed 2 i8 elements into the psadbw instructions
with 0s in the other 14 bytes. The resulting psadbw instruction
will produce zeros in bits [127:16] of the output. We need to take
the result and feed it to a v2i32 add where the first element
includes bits [15:0] of the sad result. The other element should
be zero.

Prior to this patch we were using a truncate to take 0 from
bits 95:64 of the psadbw. This results in a pshufd to move those
bits to 63:32. But since we also have zeroes in bits 63:32 of
the psadbw output, we should just take those bits.

The previous code probably worked better with promoting legalization,
but now we use widening legalization. I've preserved the old
behavior if -x86-experimental-vector-widening-legalization=false
until we get that option removed.

llvm-svn: 369733
2019-08-23 05:33:27 +00:00
Simon Pilgrim 6dd51c2f19 [MVT] Add MVT equivalent to EVT::getHalfNumVectorElementsVT() helper. NFCI.
Allows for some cleanup in a lot of SSE/AVX vector splitting code

llvm-svn: 369640
2019-08-22 11:14:30 +00:00
Craig Topper ba375263e8 [DAGCombiner][X86] Teach visitCONCAT_VECTORS to combine (concat_vectors (concat_vectors X, Y), undef)) -> (concat_vectors X, Y, undef, undef)
I also had to add a new combine to X86's combineExtractSubvector to prevent a regression.

This helps our vXi1 code see the full concat operation and allow it optimize undef to a zero if there is already a zero in the concat. This helped us use a movzx instead of an AND in some of the tests. In those tests, one concat comes from SelectionDAGBuilder and the second comes from type legalization of v4i1->i4 bitcasts which uses an additional concat. Though these changes weren't my original motivation.

I'm looking at making X86ISelLowering's narrowShuffle emit a concat_vectors instead of an insert_subvector since concat_vectors is more canonical during early DAG combine. This patch helps prevent a regression from my experiments with that.

Differential Revision: https://reviews.llvm.org/D66456

llvm-svn: 369459
2019-08-20 22:12:50 +00:00
Craig Topper 3a2b08e6c9 [X86] Add a DAG combine to transform (i8 (bitcast (v8i1 (extract_subvector (v16i1 X), 0)))) -> (i8 (trunc (i16 (bitcast (v16i1 X))))) on KNL target
Without AVX512DQ we don't have KMOVB so we can't really copy 8-bits of a k-register to a GPR. We have to copy 16 bits instead. We do this even if the DAG copy is from v8i1->v16i1. If we detect the (i8 (bitcast (v8i1 (extract_subvector (v16i1 X), 0)))) we should rewrite the types to match the copy we do support. By doing this, we can help known bits to propagate without losing the upper 8 bits of the input to the extract_subvector. This allows some zero extends to be removed since we have an isel pattern to use kmovw for (zero_extend (i16 (bitcast (v16i1 X))).

Differential Revision: https://reviews.llvm.org/D66489

llvm-svn: 369434
2019-08-20 20:20:04 +00:00
Craig Topper 22ac9f396f [X86] Use isNullConstant instead of getConstantOperandVal == 0. NFC
llvm-svn: 369410
2019-08-20 16:55:12 +00:00
Craig Topper 1ada137854 [X86] Add back the -x86-experimental-vector-widening-legalization comand line flag and all associated code, but leave it enabled by default
Google is reporting performance issues with the new default behavior
and have asked for a way to switch back to the old behavior while we
investigate and make fixes.

I've restored all of the code that had since been removed and added
additional checks of the command flag onto code paths that are
not otherwise guarded by a check of getTypeAction.

I've also modified the cost model tables to hopefully get us back
to the previous costs.

Hopefully we won't need to support this for very long since we
have no test coverage of the old behavior so we can very easily
break it.

llvm-svn: 369332
2019-08-20 06:58:00 +00:00
Craig Topper a0d92c7262 [X86] Teach lowerV4I32Shuffle to only use broadcasts if the mask has more than one undef element. Prioritize shifts over broadcast in lowerV8I16Shuffle.
The motivating case are the changes in vector-reduce-add.ll where
we were doing extra work in the scalar domain instead of shuffling.
There may be some one use check that needs to be looked into there,
but this patch sidesteps the issue by avoiding broadcasts that
aren't really broadcasting.

Differential Revision: https://reviews.llvm.org/D66071

llvm-svn: 369287
2019-08-19 18:15:50 +00:00
Craig Topper ebb7ddc633 [X86] Teach lower1BitShuffle to match right shifts with upper zero elements on types that don't natively support KSHIFT.
We can support these by widening to a supported type,
then shifting all the way to the left and then
back to the right to ensure that we shift in zeroes.

llvm-svn: 369232
2019-08-19 05:45:39 +00:00
Craig Topper e47437a6ef [X86] Fix the lower1BitShuffle code added in r369215 to correctly pass the widened vector to the KSHIFT node.
Not sure how to test this as we have tests that exercise this code,
but nothing failed for the types not matching. Since all the k-registers
use equivalent register classes everything just ends up working.

llvm-svn: 369228
2019-08-19 04:08:44 +00:00
Craig Topper 269c6b1c15 [X86] Teach lower1BitShuffle to match KSHIFTR that doesn't use Zeroable and only relies on undef.
This allows us to widen the type when the KSHIFTR instruction
doesn't exist for the type. If we need to shift in zeroes into
the upper elements we would need more work to guarantee zeroes
when widening.

llvm-svn: 369227
2019-08-19 04:08:40 +00:00
Craig Topper 2eb7951da3 [X86] Teach lower1BitShuffle to recognize padding a subvector with zeros with V2 as the source and V1 as the zero vector.
Shuffle canonicalization can swap the sources so the zero vector
might be V1 and the subvector that's being padded can be V2.

llvm-svn: 369226
2019-08-19 00:39:22 +00:00
Craig Topper 2ee46c7c4b [X86] Add a special case to LowerCONCAT_VECTORSvXi1 to handle concatenating zero vectors followed by one non-zero vector followed by undef vectors.
For such a case we should only need a KSHIFTL, but we were
previously generating a KSHIFTL followed by a KSHIFTR because
we mistakenly believed we need to zero the undef elements.

llvm-svn: 369224
2019-08-18 23:30:11 +00:00
Craig Topper 388b8dd94a [X86] Replace uses of getZeroVector for vXi1 vectors with DAG.getConstant.
vXi1 vectors don't need special handling.

llvm-svn: 369222
2019-08-18 23:30:03 +00:00
Craig Topper 9e074c06fe [X86] Improve lower1BitShuffle handling for KSHIFTL on narrow vectors.
We can insert the value into a larger legal type and shift that
by the desired amount.

llvm-svn: 369215
2019-08-18 18:52:46 +00:00
Simon Pilgrim 63b3c56fca Fix signed/unsigned comparison warning. NFCI.
llvm-svn: 369213
2019-08-18 17:26:30 +00:00
Simon Pilgrim fee2546f3f [X86] isTargetShuffleEquivalent - add BUILD_VECTOR matching
Add similar functionality to isShuffleEquivalent - if the mask elements don't match, try matching the BUILD_VECTOR scalars instead.

As target shuffles need to handle SM_Sentinel values, this can get a bit tricky, so commit just adds actual mask element index handling - full SM_SentinelZero support will be added when the need arises.

Also, enables support in matchVectorShuffleWithPACK

llvm-svn: 369212
2019-08-18 17:15:26 +00:00
Simon Pilgrim a66edd86e2 [X86] isTargetShuffleEquivalent - early out on illegal shuffle masks. NFCI.
Simplifies shuffle mask comparisons by just bailing out if the shuffle mask has any out of range values - will make an upcoming patch much simpler.

llvm-svn: 369211
2019-08-18 16:37:58 +00:00
Craig Topper 31f829f0cd [X86] Add a one use check to the combineStore code that handles v16i16->v16i8 truncate+store by extending to v16i32 and then emitting a v16i32->v16i8 truncstore.
This prevent us from emitting a separate truncate and a truncating
store instruction.

llvm-svn: 369200
2019-08-17 22:46:15 +00:00
Jordan Rupprecht d0797ece46 Revert [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (reapplied)
This reverts r368662 (git commit 1a8d790cf5)

The compile-time regression repro is in https://bugs.llvm.org/show_bug.cgi?id=43024

llvm-svn: 369167
2019-08-16 23:08:56 +00:00
Simon Pilgrim 63b78b678b [X86] resolveTargetShuffleInputs - add DemandedElts variant. NFCI.
Nothing calls this yet, everything still goes through the non (all) DemandedElts wrapper.

llvm-svn: 369136
2019-08-16 18:13:22 +00:00
Simon Pilgrim 8ff1b7de4d [X86] combineExtractWithShuffle - handle extract(truncate(x), 0)
Eventually we need to generalize combineExtractWithShuffle to handle all faux shuffles and handle truncate (and X86ISD::VTRUNC etc.) there, but we're not ready yet (still creates nodes on the fly, incomplete DemandedElts support, bad use of recursive Depth limit).

llvm-svn: 369134
2019-08-16 17:35:08 +00:00
Daniel Sanders 0c47611131 Apply llvm-prefer-register-over-unsigned from clang-tidy to LLVM
Summary:
This clang-tidy check is looking for unsigned integer variables whose initializer
starts with an implicit cast from llvm::Register and changes the type of the
variable to llvm::Register (dropping the llvm:: where possible).

Partial reverts in:
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
X86FixupLEAs.cpp - Some functions return unsigned and arguably should be MCRegister
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
HexagonBitSimplify.cpp - Function takes BitTracker::RegisterRef which appears to be unsigned&
MachineVerifier.cpp - Ambiguous operator==() given MCRegister and const Register
PPCFastISel.cpp - No Register::operator-=()
PeepholeOptimizer.cpp - TargetInstrInfo::optimizeLoadInstr() takes an unsigned&
MachineTraceMetrics.cpp - MachineTraceMetrics lacks a suitable constructor

Manual fixups in:
ARMFastISel.cpp - ARMEmitLoad() now takes a Register& instead of unsigned&
HexagonSplitDouble.cpp - Ternary operator was ambiguous between unsigned/Register
HexagonConstExtenders.cpp - Has a local class named Register, used llvm::Register instead of Register.
PPCFastISel.cpp - PPCEmitLoad() now takes a Register& instead of unsigned&

Depends on D65919

Reviewers: arsenm, bogner, craig.topper, RKSimon

Reviewed By: arsenm

Subscribers: RKSimon, craig.topper, lenary, aemerson, wuzish, jholewinski, MatzeB, qcolombet, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, wdng, nhaehnle, sbc100, jgravelle-google, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, javed.absar, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, tpr, PkmX, jocewei, jsji, Petar.Avramovic, asbirlea, Jim, s.egerton, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65962

llvm-svn: 369041
2019-08-15 19:22:08 +00:00
Craig Topper 2a372ba534 [X86] Add custom type legalization for bitcasting mmx to v2i32/v4i16/v8i8 to use movq2dq instead of going through memory.
llvm-svn: 369031
2019-08-15 18:23:37 +00:00
Sanjay Patel 57d459309d [SDAG][x86] check for relaxed math when matching an FP reduction
If the last step in an FP add reduction allows reassociation and doesn't care
about -0.0, then we are free to recognize that computation as a reduction
that may reorder the intermediate steps.

This is requested directly by PR42705:
https://bugs.llvm.org/show_bug.cgi?id=42705
and solves PR42947 (if horizontal math instructions are actually faster than
the alternative):
https://bugs.llvm.org/show_bug.cgi?id=42947

Differential Revision: https://reviews.llvm.org/D66236

llvm-svn: 368995
2019-08-15 12:43:15 +00:00
Craig Topper a57734ba4e [X86] Disable custom type legalization for v2i32/v4i16/v8i8->i64.
The default legalization can take care of this.

llvm-svn: 368967
2019-08-15 05:51:58 +00:00
Craig Topper 57286afe4e [X86] Disable custom type legalization for v2i32/v4i16/v8i8->f64 bitcast.
The generic legalization handles this in the same way so just use
that.

llvm-svn: 368966
2019-08-15 05:51:54 +00:00
Craig Topper ba39fcd8c6 [X86] Remove some unreachable code from LowerBITCAST.
llvm-svn: 368965
2019-08-15 05:51:50 +00:00
Craig Topper 14f7560020 [X86] Remove some dead code and combine some repeated code that's left.
If the width is 256 bits, then we must have AVX so the else here
was unnecessary. Once that's removed then the >= 256 bit code is
identical to the 128 bit code with a different VT so combine them.

llvm-svn: 368956
2019-08-15 04:07:43 +00:00
Craig Topper 3e44d96170 [X86] Use PSADBW for v8i8 addition reductions.
Improves the 8 byte case from PR42674.

Differential Revision: https://reviews.llvm.org/D66069

llvm-svn: 368864
2019-08-14 15:57:29 +00:00
Simon Pilgrim e7b350a5d1 [X86] XFormVExtractWithShuffleIntoLoad - handle shuffle mask scaling
If the target shuffle mask is from a wider type, attempt to scale the mask so that the extraction can attempt to peek through.

Fixes the regression mentioned in rL368662

Reapplying this as rL368308 had to be reverted as part of rL368660 to revert rL368276

llvm-svn: 368663
2019-08-13 11:11:42 +00:00
Simon Pilgrim 1a8d790cf5 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (reapplied)
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts. 

The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.

Reapplying this as rL368307 had to be reverted as part of rL368660 to revert rL368276

llvm-svn: 368662
2019-08-13 10:51:39 +00:00
Hans Wennborg 5390d25f2b Revert r368276 "[TargetLowering] SimplifyDemandedBits - call SimplifyMultipleUseDemandedBits for ISD::EXTRACT_VECTOR_ELT"
This introduced a false positive MemorySanitizer warning about use of
uninitialized memory in a vectorized crc function in Chromium. That suggests
maybe something is not right with this transformation. See
https://crbug.com/992853#c7 for a reproducer.

This also reverts the follow-up commits r368307 and r368308 which
depended on this.

> This patch attempts to peek through vectors based on the demanded bits/elt of a particular ISD::EXTRACT_VECTOR_ELT node, allowing us to avoid dependencies on ops that have no impact on the extract.
>
> In particular this helps remove some unnecessary scalar->vector->scalar patterns.
>
> The wasm shift patterns are annoying - @tlively has indicated that the wasm vector shift codegen are to be refactored in the near-term and isn't considered a major issue.
>
> Differential Revision: https://reviews.llvm.org/D65887

llvm-svn: 368660
2019-08-13 09:33:25 +00:00
Craig Topper e07e593782 [X86] Allow combineTruncateWithSat to use pack instructions for i16->i8 without AVX512BW.
We need AVX512BW to be able to truncate an i16 vector. If we don't
have that we have to extend i16->i32, then trunc, i32->i8. But we
won't be able to remove the min/max if we do that. At least not
without more special handling.

llvm-svn: 368623
2019-08-12 22:18:23 +00:00
Craig Topper 0761a38e8a [X86] Remove unreachable code from LowerTRUNCATE. NFC
All three 256->128 bit cases were already handled above.

Noticed while looking at the coverage report.

llvm-svn: 368609
2019-08-12 19:26:45 +00:00
Craig Topper a3605baaff [X86] Add a paranoia type check to the code that detects AVG patterns from truncating stores.
If we're after type legalize, we should make sure we won't create
a store with an illegal type when we separate the AVG pattern
from the truncating store.

I don't know of a way to fail for this today. Just noticed while
I was in the vicinity.

llvm-svn: 368608
2019-08-12 19:26:37 +00:00
Craig Topper 1b02909847 [X86] Simplify creation of saturating truncating stores.
We just need to check if the truncating store is legal
instead of going through isSATValidOnAVX512Subtarget.

llvm-svn: 368607
2019-08-12 19:26:30 +00:00
Craig Topper 3f4e9b156d [X86] Replace call to isTruncStoreLegalOrCustom with isTruncStoreLegal. NFC
We have no custom trunc stores on X86.

llvm-svn: 368606
2019-08-12 19:26:22 +00:00
Craig Topper 09d5d15339 [X86] Disable use of zmm registers for varargs musttail calls under prefer-vector-width=256 and min-legal-vector-width=256.
Under this config, the v16f32 type we try to use isn't to a register
class so the getRegClassFor call will fail.

llvm-svn: 368594
2019-08-12 17:43:26 +00:00
Simon Pilgrim 182249daee [X86][SSE] ComputeKnownBits - add basic PSADBW handling
llvm-svn: 368558
2019-08-12 12:19:19 +00:00
Craig Topper ce6a2cf966 [X86] Simplify some of the type checks in combineSubToSubus.
If we have SSE2 we can handle any i8/i16 type and let
type legalization deal with it.

llvm-svn: 368538
2019-08-11 17:36:49 +00:00
Craig Topper 637964bfd8 [X86] Don't use SplitOpsAndApply for ISD::USUBSAT.
Target independent type legalization and custom lowering
should be able to handle it.

llvm-svn: 368537
2019-08-11 17:36:45 +00:00
Craig Topper 9758e0e1bf [X86] Remove some more code from combineShuffle that is no longer needed with widening legalization.
llvm-svn: 368523
2019-08-11 02:17:18 +00:00
Craig Topper 0f74b82ef1 [X86] Remove some code from combineShuffle that seems largely unnecessary with widening legalization.
The test case that changed is probably better served through
allowing combineTruncatedArithmetic to create narrow vectors. It
also appears InstCombine would have simplified this test case
to remove the zext and trunc anyway.

llvm-svn: 368522
2019-08-11 02:08:38 +00:00
Simon Pilgrim ec128709f0 [X86][SSE] Lower shuffle as ANY_EXTEND_VECTOR_INREG
On SSE41+ targets we always lower vector shuffles to ZERO_EXTEND_VECTOR_INREG, even if we don't need the extended bits.

This patch relaxes this so that we lower to ANY_EXTEND_VECTOR_INREG if we can, meaning that shuffle combines have a better idea of what elements need to be kept zero. This helps the multiple reduction code as we can now combine away a lot more of the pack+extend codes.

Differential Revision: https://reviews.llvm.org/D65741

llvm-svn: 368515
2019-08-10 16:46:07 +00:00
Craig Topper 74c43a2277 [X86] Match the IR pattern form movmsk on SSE1 only targets where v4i32 isn't legal
Summary:
This patch adds a special DAG combine for SSE1 to recognize the IR pattern InstCombine gives us for movmsk. This only does the recognition for a few cases where its obvious the input won't be scalarized resulting in building a vector just do to the movmsk. I've made it separate from our existing matching for movmsk since that's called in multiple places and I didn't spend time to see if the other callers would make sense here. Plus the restrictions and additional checks would complicate that.

This fixes the case from PR42870. Buts its probably still broken the presence of logic ops feeding the movmsk pattern which would further hide the v4f32 type.

Reviewers: spatel, RKSimon, xbolva00

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65689

llvm-svn: 368506
2019-08-10 07:51:13 +00:00
Luo, Yuanke c6c86f4f81 [X86] Fix stack probe issue on windows32.
Summary:
On windows if the frame size exceed 4096 bytes, compiler need to
generate a call to _alloca_probe. X86CallFrameOptimization pass
changes the reserved stack size and cause of stack probe function
not be inserted. This patch fix the issue by detecting the call
frame size, if the size exceed 4096 bytes, drop X86CallFrameOptimization.

Reviewers: craig.topper, wxiao3, annita.zhang, rnk, RKSimon

Reviewed By: rnk

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65923

llvm-svn: 368503
2019-08-10 02:49:02 +00:00
Eric Christopher db2f17d362 Remove variable only used in an assert.
llvm-svn: 368486
2019-08-09 21:02:47 +00:00
Craig Topper 6cb05ca044 [X86] Remove custom handling for extloads from LowerLoad.
We don't appear to need this with widening legalization.

llvm-svn: 368479
2019-08-09 20:27:22 +00:00
Simon Pilgrim 60394f47b0 [X86][SSE] Swap X86ISD::BLENDV inputs with an inverted selection mask (PR42825)
As discussed on PR42825, if we are inverting the selection mask we can just swap the inputs and avoid the inversion.

Differential Revision: https://reviews.llvm.org/D65522

llvm-svn: 368438
2019-08-09 12:44:20 +00:00
Craig Topper 6179175551 [X86] Remove code that expands truncating stores from combineStore.
We shouldn't form trunc stores that need to be expanded now that
we are using widening legalization.

llvm-svn: 368400
2019-08-09 06:59:53 +00:00
Craig Topper 7e33f11ba7 [X86] Remove stale FIXME from combineMaskedStore. NFC
I believe PR34584 was tracking that FIXME, but its since been
closed and a test case was added.

llvm-svn: 368397
2019-08-09 05:55:41 +00:00
Craig Topper 8c5c09780d [X86] Remove DAG combine expansion of extending masked load and truncating masked store.
The only way to generate these was through promoting legalization
of narrow vectors, but we widen those types now. So we shouldn't
produce these nodes.

llvm-svn: 368396
2019-08-09 05:53:37 +00:00
Craig Topper 509c8774fa [X86] Remove handler for (U/S)(ADD/SUB)SAT from ReplaceNodeResults. Remove TypeWidenVector check from code that handles X86ISD::VPMADDWD and X86ISD::AVG.
More unneeded code since we now legalize narrow vectors by widening.

llvm-svn: 368395
2019-08-09 05:17:52 +00:00
Craig Topper 824961824f [X86] Remove ISD::SETCC handling from ReplaceNodeResults.
This is no longer needed since we widen v2i32 instead of promoting.

llvm-svn: 368394
2019-08-09 05:17:48 +00:00
Craig Topper ef5b435b00 [X86] Simplify ISD::LOAD handling in ReplaceNodeResults and ISD::STORE handling in LowerStore now that v2i32 is widened to v4i32.
llvm-svn: 368390
2019-08-09 03:09:43 +00:00
Craig Topper 0da681a2be [X86] Merge v2f32 and v2i32 gather/scatter handling in ReplaceNodeResults/LowerMSCATTER now that v2i32 is also widened like v2f32.
llvm-svn: 368389
2019-08-09 03:09:28 +00:00
Craig Topper 6f81db0f68 [X86] Now unreachable handling for f64->v2i32/v4i16/v8i8 bitcasts from ReplaceNodeResults.
We rely on the generic type legalizer for this now.

llvm-svn: 368388
2019-08-09 03:09:19 +00:00
Craig Topper d871f638d7 [X86] Simplify ReplaceNodeResults handling for FP_TO_SINT/UINT for vectors to only handle widening.
llvm-svn: 368387
2019-08-09 03:09:10 +00:00
Craig Topper 0bd44d59db [X86] Simplify ReplaceNodeResults handling for SIGN_EXTEND/ZERO_EXTEND/TRUNCATE for vectors to only handle widening.
llvm-svn: 368386
2019-08-09 03:08:54 +00:00
Craig Topper cdb9a8ebd8 [X86] Simplify ReplaceNodeResults handling for UDIV/UREM/SDIV/SREM for vectors to only handle widening.
llvm-svn: 368385
2019-08-09 03:08:45 +00:00
Craig Topper 35848345f0 [X86] Remove vector promotion handling from the ReplaceNodeResults ISD::MUL handling code.
We now widen illegal vector types so we don't need this anymore.

llvm-svn: 368384
2019-08-09 03:08:28 +00:00
Craig Topper c49d3e6c4d [X86] Improve codegen of v8i64->v8i16 and v16i32->v16i8 truncate with avx512vl, avx512bw, min-legal-vector-width<=256 and prefer-vector-width=256
Under this configuration we'll want to split the v8i64 or v16i32 into two vectors. The default legalization will try to truncate each of those 256-bit pieces one step to 128-bit, concatenate those, then truncate one more time from the new 256 to 128 bits.

With this patch we now truncate the two splits to 64-bits then concatenate those. We have to do this two different ways depending on whether have widening legalization enabled. Without widening legalization we have to manually construct X86ISD::VTRUNC to prevent the ISD::TRUNCATE with a narrow result being promoted to 128 bits with a larger element type than what we want followed by something like a pshufb to grab the lower half of each element to finish the job. With widening legalization we just get the right thing. When we switch to widening by default we can just delete the other code path.

Differential Revision: https://reviews.llvm.org/D65626

llvm-svn: 368349
2019-08-08 21:36:47 +00:00
Simon Pilgrim eb7a553db8 [X86] XFormVExtractWithShuffleIntoLoad - handle shuffle mask scaling
If the target shuffle mask is from a wider type, attempt to scale the mask so that the extraction can attempt to peek through.

Fixes the regression mentioned in rL368307

llvm-svn: 368308
2019-08-08 16:05:23 +00:00
Simon Pilgrim 67c246bbe6 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts. 

The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.

llvm-svn: 368307
2019-08-08 15:54:20 +00:00
Simon Pilgrim 59fabf9c60 [X86][SSE] matchBinaryPermuteShuffle - split INSERTPS combines
We need to prefer INSERTPS with zeros over SHUFPS, but fallback to INSERTPS if that fails.

llvm-svn: 368292
2019-08-08 13:23:53 +00:00
Craig Topper 724c6053ac [X86] Remove -x86-experimental-vector-widening-legalization command line option and all its uses.
This option is now defaulted to true and we don't want to support
turning it off so remove the option.

llvm-svn: 368258
2019-08-08 06:48:22 +00:00
Craig Topper 0aacc7da8b [X86] Add CMOV_FR32X and CMOV_FR64X to the isCMOVPseudo function.
llvm-svn: 368250
2019-08-08 04:40:59 +00:00
Amy Huang 0b870b969f Recommit "[MS] Emit S_HEAPALLOCSITE debug info in Selection DAG"
with a fix to clear the SDNode map when SelectionDAG is cleared.

llvm-svn: 368230
2019-08-07 22:49:40 +00:00
Craig Topper 7f7ef0208b [X86] Allow pack instructions to be used for 512->256 truncates when -mprefer-vector-width=256 is causing 512-bit vectors to be split
If we're splitting the 512-bit vector anyway and we have zero/sign bits, then we might as well use pack instructions to concat and truncate at once.

Differential Revision: https://reviews.llvm.org/D65904

llvm-svn: 368210
2019-08-07 21:16:10 +00:00
Craig Topper 8b5f2ab2a4 Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.

Original commit message:

This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 368183
2019-08-07 16:24:26 +00:00
Simon Pilgrim d52bc482a5 [X86] EltsFromConsecutiveLoads - early out for non-byte sized memory (PR42909)
Don't attempt to merge loads for types that aren't modulo 8-bits.

llvm-svn: 368165
2019-08-07 12:41:59 +00:00
Mitch Phillips bd0d97e1c4 Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default."
This reverts commit 3de33245d2.

This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368107
2019-08-06 23:00:43 +00:00
Craig Topper ecc1e5d476 [X86] Don't allow combineSIntToFP to create v2i32 vectors after type legalization.
If we're after type legalization we should only be trying to turn
v2i64 into v2i32. So bitcast to v4i32, shuffle the even elements
together. Then use X86ISD::CVTSI2P. The alternative is to leave
the v2i64 type alone and let it scalarized. Hopefully keeping
it packed is better.

Fixes PR42905.

llvm-svn: 368091
2019-08-06 21:43:15 +00:00
Simon Pilgrim cf62047d29 [X86][SSE] Call SimplifyMultipleUseDemandedBits on PACKSS/PACKUS arguments.
This mainly helps to replace unused arguments with UNDEF in the case where they have multiple users.

llvm-svn: 368026
2019-08-06 13:10:42 +00:00
Simon Pilgrim 01d267dc4f [X86] SimplifyMultipleUseDemandedBits - target shuffles might not be identity
If we don't demand any non-undef shuffle elements then the assert will fail as all shuffle inputs would still be flagged as 'identity' safe.

Exposed by an incoming patch.

llvm-svn: 368022
2019-08-06 12:41:29 +00:00
Simon Pilgrim c6735aecfa [X86][SSE] Enable min/max partial reduction
As mentioned on D65047 / rL366933 the plan is to enable partial reduction handling wherever possible.

llvm-svn: 368016
2019-08-06 11:00:34 +00:00
Cullen Rhodes ced419f4d7 [SelectionDAG] Extend base addressing modes supported by MGATHER/MSCATTER
Summary:
Before this patch MGATHER/MSCATTER is capable of representing all
common addressing modes, but only when illegal types are used.
This patch adds an IndexType property so more representations
are available when using legal types only.

Original modes:
 vector of bases
 base + vector of signed scaled offsets

New modes:
 base + vector of signed unscaled offsets
 base + vector of unsigned scaled offsets
 base + vector of unsigned unscaled offsets

The current behaviour of addressing modes for gather/scatter remains
unchanged.

Patch by Paul Walker.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D65636

llvm-svn: 368008
2019-08-06 09:46:13 +00:00
Craig Topper 3de33245d2 [X86] Enable -x86-experimental-vector-widening-legalization by default.
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 367901
2019-08-05 18:25:36 +00:00
Sanjay Patel eaf13044bd [DAGCombiner][x86] prevent infinite loop from truncate/extend transforms
The test case is based on the example from the post-commit thread for:
https://reviews.llvm.org/rGc9171bd0a955

This replaces the x86-specific simple-type check from:
rL367766
with a check in the DAGCombiner. Adding the check isn't
strictly necessary after the fix from:
rL367768
...but it seems likely that we're heading for trouble if
we are creating weird types in this transform.

I combined the earlier legality check into the initial
clause to simplify the code.

So we should only try the trunc/sext transform at the
earliest combine stage, but we limit the transform to
simple types anyway because the TLI hook is probably
too lax about what it considers a free truncate.

llvm-svn: 367834
2019-08-05 11:27:07 +00:00
Guillaume Chatelet c97a3d15d2 [LLVM][Alignment] Introduce Alignment Type
Summary:
This is patch is part of a serie to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet, jfb, jakehehrlich

Reviewed By: jfb

Subscribers: wuzish, jholewinski, arsenm, dschuff, nemanjai, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, s.egerton, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65514

llvm-svn: 367828
2019-08-05 11:02:05 +00:00
Craig Topper 635f5ff580 [X86] Fix a bad early out in combineExtInVec that prevented recursive shuffle combining from running with -x86-experimental-vector-widening-legalization.
llvm-svn: 367798
2019-08-05 03:48:31 +00:00
Craig Topper 5a4989e2ac [TargetLowering][X86] Teach SimplifyDemandedVectorElts to replace the base vector of INSERT_SUBVECTOR with undef if none of the elements are demanded even if the node has other users.
Summary:
The SimplifyDemandedVectorElts function can replace with undef
when no elements are demanded, but due to how it interacts with
TargetLoweringOpts, it can only do this when the node has
no other users.

Remove a now unneeded DAG combine from the X86 backend.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65713

llvm-svn: 367788
2019-08-04 17:30:41 +00:00
Simon Pilgrim 436fd52a71 [X86] lowerShuffleAsSpecificZeroOrAnyExtend - use undef PSHUFB mask indices for ANY_EXTEND shuffles
llvm-svn: 367784
2019-08-04 13:15:23 +00:00
Simon Pilgrim c5891eaa34 Fix signed/unsigned comparison warning. NFC.
llvm-svn: 367783
2019-08-04 12:48:19 +00:00
Simon Pilgrim e16901844d [X86] SimplifyMultipleUseDemandedBits - Add target shuffle support
llvm-svn: 367782
2019-08-04 12:24:40 +00:00
Craig Topper 0fff1e4f3d [X86] Consistently use MVT::i8 for the constant operand of BLENDI and INSERTPS nodes.
This is the type listed in the type constraint for isel. But since
we list a type there, it doesn't get checked during isel matching.

llvm-svn: 367775
2019-08-04 06:01:31 +00:00
Sanjay Patel c9171bd0a9 [x86] change free truncate hook to handle only simple types (PR42880)
This avoids the crash from:
https://bugs.llvm.org/show_bug.cgi?id=42880
...and I think it's a proper constraint for the TLI hook.

But that example raises questions about what happens to get us
into this situation (created i29 types) and what happens later
(why does legalization die on those types), so I'm not sure if
we will resolve the bug based on this change.

llvm-svn: 367766
2019-08-03 21:46:27 +00:00
Bill Wendling 41a2847a9a Emit diagnostic if an inline asm constraint requires an immediate
Summary:
An inline asm call can result in an immediate after inlining. Therefore emit a
diagnostic here if constraint requires an immediate but one isn't supplied.

Reviewers: joerg, mgorny, efriedma, rsmith

Reviewed By: joerg

Subscribers: asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, s.egerton, MaskRay, jyknight, dylanmckay, javed.absar, fedor.sergeev, jrtc27, Jim, krytarowski, eraman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60942

llvm-svn: 367750
2019-08-03 05:52:47 +00:00
Craig Topper 45ea25289d [X86] Use the pointer VT for the Scale node when lowering x86 gather/scatter intrinsics.
This is consistent with the target independent intrinsic handling.

Not sure this really matters since we just pull the constant out
using getZExtValue later.

llvm-svn: 367736
2019-08-02 23:18:16 +00:00
Daniel Sanders 2bea69bf65 Finish moving TargetRegisterInfo::isVirtualRegister() and friends to llvm::Register as started by r367614. NFC
llvm-svn: 367633
2019-08-01 23:27:28 +00:00
Craig Topper a9ed5436bd [X86] In decomposeMulByConstant, legalize the VT before querying whether the multiply is legal
If a type is larger than a legal type and needs to be split, we would previously allow the multiply to be decomposed even if the split multiply is legal. Since the shift + add/sub code would also need to be split, its not any better to decompose it.

This patch figures out what type the mul will eventually be legalized to and then uses that type for the query. I tried just returning false illegal types and letting them get handled after type legalization, but then we can't recognize and i64 constant splat on 32-bit targets since will be destroyed by type legalization. We could special case vectors of i64 to avoid that...

Differential Revision: https://reviews.llvm.org/D65533

llvm-svn: 367601
2019-08-01 18:49:07 +00:00
Simon Pilgrim 63d4114f72 [X86][SSE] Add PEXTR*(PINSR*(v, s, c), c) -> s combine.
We should probably extend this to cover bitcasts as well to help other cases in promote-vec3.ll.

llvm-svn: 367582
2019-08-01 16:38:39 +00:00
Simon Pilgrim 33f5f863b5 [X86][SSE] SimplifyMultipleUseDemandedBits - Add PEXTR/PINSR B+W handling
This adds SimplifyMultipleUseDemandedBitsForTargetNode X86 support and uses it to allow us to peek through vector insertions to avoid dependencies on entire insertion chains.

llvm-svn: 367570
2019-08-01 14:46:03 +00:00
Simon Pilgrim f99f9881e3 [X86] EltsFromConsecutiveLoads - don't attempt to merge volatile loads (PR42846)
llvm-svn: 367556
2019-08-01 13:13:18 +00:00
Amy Huang 153f20057c Revert "[MS] Emit S_HEAPALLOCSITE debug info in Selection DAG" and
and partial fix.
Causes windows buildbot errors.

This reverts commit 6e65c34523963094acd0d6c94a5f5c64b32fe6aa and
53da7ca943.

llvm-svn: 367496
2019-07-31 23:59:31 +00:00
Craig Topper b51dc64063 [X86] Add DAG combine to fold any_extend_vector_inreg+truncstore to an extractelement+store
We have custom code that ignores the normal promoting type legalization on less than 128-bit vector types like v4i8 to emit pavgb, paddusb, psubusb since we don't have the equivalent instruction on a larger element type like v4i32. If this operation appears before a store, we can be left with an any_extend_vector_inreg followed by a truncstore after type legalization. When truncstore isn't legal, this will normally be decomposed into shuffles and a non-truncating store. This will then combine away the any_extend_vector_inreg and shuffle leaving just the store. On avx512, truncstore is legal so we don't decompose it and we had no combines to fix it.

This patch adds a new DAG combine to detect this case and emit either an extract_store for 64-bit stoers or a extractelement+store for 32 and 16 bit stores. This makes the avx512 codegen match the avx2 codegen for these situations. I'm restricting to only when -x86-experimental-vector-widening-legalization is false. When we're widening we're not likely to create this any_extend_inreg+truncstore combination. This means we should be able to remove this code when we flip the default. I would like to flip the default soon, but I need to investigate some performance regressions its causing in our branch that I wasn't seeing on trunk.

Differential Revision: https://reviews.llvm.org/D65538

llvm-svn: 367488
2019-07-31 22:43:08 +00:00
Simon Pilgrim 0707f66ad0 [X86] Moved IsNOT helper earlier. NFCI.
Makes it available for more combines to use without adding declarations.

llvm-svn: 367436
2019-07-31 14:36:04 +00:00
Simon Pilgrim 24ad2b5e7d [X86][AVX] Ensure chained subvector insertions are the same size (PR42833)
Before combining insert_subvector(insert_subvector(vec, sub0, c0), sub1, c1) patterns, ensure that the subvectors are all the same type. On AVX512 targets especially we might have a mixture of 128/256 subvector insertions.

llvm-svn: 367429
2019-07-31 12:55:39 +00:00
Amy Huang 53da7ca943 [MS] Emit S_HEAPALLOCSITE debug info in SelectionDAG
Summary: This emits labels around heapallocsite calls in SelectionDAG.

Reviewers: rnk

Subscribers: MatzeB, aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61105

llvm-svn: 367374
2019-07-31 00:16:13 +00:00
Craig Topper 8b58371fae [X86] Fix mistake in comment. NFC
The code is matching sext not zext.

llvm-svn: 367357
2019-07-30 21:00:24 +00:00
Simon Pilgrim b989bc47c0 [X86] SimplifyDemandedVectorEltsForTargetNode should be calling resolveTargetShuffleInputs not getTargetShuffleMask
Add TODO comment.

llvm-svn: 367318
2019-07-30 15:06:09 +00:00
Simon Pilgrim e4d5423dcd [X86][AVX] SimplifyDemandedVectorElts - handle extraction from X86ISD::SUBV_BROADCAST source (PR42819)
PR42819 showed an issue that we couldn't handle the case where we demanded a 'sub-sub-vector' of the SUBV_BROADCAST 'sub-vector' source.

This patch recognizes these cases and extracts the sub-sub-vector instead of trying to broadcast to a type smaller than the 'sub-vector' source. 

llvm-svn: 367306
2019-07-30 11:35:13 +00:00
Craig Topper 479b45411e [X86] Fix typo in comment. We're looking at a right shift not a left shift. NFC
llvm-svn: 367251
2019-07-29 19:22:51 +00:00
Simon Pilgrim 962c03fac4 [X86] resolveTargetShuffleInputs - add depth to limit recursion.
Avoids slow downs from calls to ComputeNumSignBits/computeKnownBits going too deep.

llvm-svn: 367240
2019-07-29 17:17:58 +00:00
Simon Pilgrim 5ab948f823 [X86] combineX86ShufflesRecursively - start recursion at depth = 0. NFCI.
As discussed on rL367171, we have a problem where the depth recursion used in combineX86ShufflesRecursively was subtly different to computeKnownBits etc. - it starts at Depth=1 instead of Depth=0 like the others and has a different maximum recursion depth.

This NFC patch fixes the recursion depth to start at 0, so we can more easily reuse depth values in calls from combineX86ShufflesRecursively and its helper functions in computeKnownBits etc.

llvm-svn: 367232
2019-07-29 15:57:06 +00:00
Craig Topper eb1beabad9 [X86] Don't use PMADDWD for vector add reductions of multiplies if the mul inputs have an additional user.
The pmaddwd inserts a truncate, if that truncate would end up
creating additional instructions instead of making a zext
narrower, then we shouldn't do it.

I've restricted this to only sse4.1 targets since on prior
targets the zext will be done in stages. So the truncate will
probably not create additional instructions. Might need some
more investigation of mul shrinking and the other pmaddwd
transform to be sure this is the right decision.

There might be a slight regression on AVX1 targets due to add
splitting. Hard to say for sure. Maybe we need to look into
using the vector reduction flag to use 2 narrow loads and a
blend instead of extracting and inserting.

llvm-svn: 367198
2019-07-29 01:36:58 +00:00
Craig Topper 894916cac9 [X86] In combineLoopMAddPattern and combineLoopSADPattern, preserve the vector reduction flag on the final add. Handle unrolled loops by letting DAG combine revisit.
This reverts r340478 and r340631 and replaces them with a simpler
method of just letting DAG combine revisit the nodes to handle
the other operand.

llvm-svn: 367195
2019-07-28 18:45:42 +00:00
Simon Pilgrim 353a848473 [X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler (Reapplied)
Recommit rL367100 which was reverted at rL367141. Until PR42777 is fixed, we no longer get the benefits of peeking through bitcasts but it does still remove a GetDemandedBits user and gives us the equivalent combines.

llvm-svn: 367172
2019-07-27 13:30:29 +00:00
Vlad Tsyrklevich 485b8789de Revert "[X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler."
This reverts r367100, it appears to be causing test failures after
Nico's revert of r367091.

llvm-svn: 367141
2019-07-26 18:14:21 +00:00
Simon Pilgrim d93e8ece7b [X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler.
This removes a GetDemandedBits user and allows us to benefit from the DemandedElts propagated through SimplifyDemandedBits.

llvm-svn: 367100
2019-07-26 11:10:20 +00:00
Simon Pilgrim 447fe31964 [X86] concatSubVectors - remove unnecessary args. NFCI.
All these args can be cheaply recomputed and it makes it much easier to use the function as a quick helper.

llvm-svn: 367014
2019-07-25 13:05:46 +00:00
Roman Lebedev 017e272c3a [Codegen] (X & (C l>>/<< Y)) ==/!= 0 --> ((X <</l>> Y) & C) ==/!= 0 fold
Summary:
This was originally reported in D62818.
https://rise4fun.com/Alive/oPH

InstCombine does the opposite fold, in hope that `C l>>/<< Y` expression
will be hoisted out of a loop if `Y` is invariant and `X` is not.
But as it is seen from the diffs here, if it didn't get hoisted,
the produced assembly is almost universally worse.

Much like with my recent "hoist add/sub by/from const" patches,
we should get almost universal win if we hoist constant,
there is almost always an "and/test by imm" instruction,
but "shift of imm" not so much, so we may avoid having to
materialize the immediate, and thus need one less register.
And since we now shift not by constant, but by something else,
the live-range of that something else may reduce.

Special care needs to be applied not to disturb x86 `BT` / hexagon `tstbit`
instruction pattern. And to not get into endless combine loop.

Reviewers: RKSimon, efriedma, t.p.northover, craig.topper, spatel, arsenm

Reviewed By: spatel

Subscribers: hiraditya, MaskRay, wuzish, xbolva00, nikic, nemanjai, jvesely, wdng, nhaehnle, javed.absar, tpr, kristof.beyls, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62871

llvm-svn: 366955
2019-07-24 22:57:22 +00:00
Simon Pilgrim 7d318b2bb1 [DAGCombine] matchBinOpReduction - add partial reduction matching
This patch adds support for recognizing cases where a larger vector type is being used to reduce just the elements in the lower subvector:

e.g. <8 x i32> reduction pattern in a <16 x i32> vector:

<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u>
<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u>

matchBinOpReduction returns the lower extracted subvector in such cases, assuming isExtractSubvectorCheap accepts the extraction.

I've only enabled it for X86 reduction sums so far. I intend to enable it for the bitop/minmax cases in future patches, and eventually I think its worth turning it on all the time. This is mainly just a case of ensuring calls to matchBinOpReduction don't make assumptions on the vector width based on the original vector extraction.

Fixes the x86 partial reduction sum cases in PR33758 and PR42023.

Differential Revision: https://reviews.llvm.org/D65047

llvm-svn: 366933
2019-07-24 17:29:56 +00:00
Craig Topper 76bc3d6e07 [X86] In lowerVectorShuffle, instead of creating a new node to canonicalize the shuffle mask by commuting, just commute the mask and swap V1/V2.
LegalizeDAG tries to legal the DAG by legalizing nodes before
their operands.

If we create a new node, we end up legalizing it after its operands.
This prevents some of the optimizations that can be done when the
operand is a build_vector since the build_vector will have been
legalized to something else.

Differential Revision: https://reviews.llvm.org/D65132

llvm-svn: 366835
2019-07-23 18:46:15 +00:00
Craig Topper 510e6fadaa [X86] When using AND+PACKUS in lowerV16I8Shuffle, generate the build vector directly in v16i8 with the correct 0x00 or 0xFF elements rather than using another VT and bitcasting it.
The build_vector will become a constant pool load. By using the
desired type initially, it ensures we don't generate a bitcast
of the constant pool load which will need to be folded with
the load.

While experimenting with another patch, I noticed that when the
load type and the constant pool type don't match, then
SimplifyDemandedBits can't handle it. While we should probably
fix that, this was a simple way to fix the issue I saw.

llvm-svn: 366732
2019-07-22 19:58:49 +00:00
Simon Pilgrim b3d719e1cf [X86] EltsFromConsecutiveLoads - support common source loads (REAPPLIED)
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Fixed out of bounds load assert identified in rL366501

Differential Revision: https://reviews.llvm.org/D64551

llvm-svn: 366681
2019-07-22 12:44:10 +00:00
Simon Pilgrim 86fa3270ef [X86] SimplifyDemandedVectorEltsForTargetNode - Move SUBV_BROADCAST narrowing handling. NFCI.
Move the narrowing of SUBV_BROADCAST to where we handle all the other opcodes.

llvm-svn: 366660
2019-07-21 19:04:44 +00:00
Simon Pilgrim adec0f2252 [X86][SSE] Use PSADBW to improve vXi8 sum reduction (PR42674)
As detailed on PR42674, we can reduce a vXi8 down until we have the final <8 x i8>, and then use PSADBW with zero, to sum those values. We then extract the bottom i8, discarding any overflow from the upper bits of the i16 result.

llvm-svn: 366636
2019-07-20 15:20:11 +00:00
Reid Kleckner ba9c9e62cb Revert [X86] EltsFromConsecutiveLoads - support common source loads
This reverts r366441 (git commit 48104ef7c9)

This causes clang to fail to compile some file in Skia. Reduction soon.

llvm-svn: 366501
2019-07-18 21:26:41 +00:00
Simon Pilgrim 48104ef7c9 [X86] EltsFromConsecutiveLoads - support common source loads
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Differential Revision: https://reviews.llvm.org/D64551

llvm-svn: 366441
2019-07-18 14:33:25 +00:00
Craig Topper 8da0402210 [X86] Disable combineConcatVectors for vXi1 vectors.
I'm not convinced the code this calls is properly vetted for
vXi1 vectors. Experimental vector widening legalization testing
for D55251 is now hitting an assertion failure inside
EltsFromConsecutiveLoads. This is occurring from a v2i1 load
having a store size different than its VT size. Hopefully
this commit will keep such issues from happening.

llvm-svn: 366405
2019-07-18 06:18:06 +00:00
Craig Topper 61fff7a337 [X86] Make sure we mark 128/256 MLOAD as Legal with VLX when min-legal-vector-width=256 is in effect.
This started triggering an assertion after r364718 when we made
these Custom under AVX2.

llvm-svn: 366382
2019-07-17 22:26:00 +00:00