Commit Graph

7442 Commits

Author SHA1 Message Date
Simon Pilgrim cc65a7a5ea [X86] Improve i8 + 'slow' i16 funnel shift codegen
This is a preliminary patch before I deal with the xor+and issue raised in D77301.

We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers.

fshl(x,y,z) -> (((zext(x) << bw) | zext(y)) << (z & (bw-1))) >> bw.
fshr(x,y,z) -> (((zext(x) << bw) | zext(y)) >> (z & (bw-1))) >> bw.

Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn

This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that.

Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift.

Differential Revision: https://reviews.llvm.org/D80466
2020-05-24 08:08:53 +01:00
Craig Topper 7392820f98 [Align] Remove operations on MaybeAlign that asserted that it had a defined value.
If the caller needs to reponsible for making sure the MaybeAlign
has a value, then we should just make the caller convert it to an Align
with operator*.

I explicitly deleted the relational comparison operators that
were being inherited from Optional. It's unclear what the meaning
of two MaybeAligns were one is defined and the other isn't
should be. So make the caller reponsible for defining the behavior.

I left the ==/!= operators from Optional. But now that exposed a
weird quirk that ==/!= between Align and MaybeAlign required the
MaybeAlign to be defined. But now we use the operator== from
Optional that takes an Optional and the Value.

Differential Revision: https://reviews.llvm.org/D80455
2020-05-22 21:54:28 -07:00
aartbik 645bba8d3d [llvm] [CodeGen] [X86] Fix issues with v4i1 instruction selection
Summary:
Fixes issue
https://bugs.llvm.org/show_bug.cgi?id=45995

Reviewers: mehdi_amini, nicolasvasilache, reidtatge, craig.topper, ftynse, bkramer

Reviewed By: craig.topper

Subscribers: RKSimon, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80231
2020-05-20 11:34:56 -07:00
Arthur Eubanks 8a88755610 Reland [X86] Codegen for preallocated
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.

In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.

This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.

The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.

The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.

Force any function containing a preallocated call to use the frame
pointer.

Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.

Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).

Aside from the tests added here, I checked that this codegen produces
correct code for something like

```
struct A {
        A();
        A(A&&);
        ~A();
};

void bar() {
        foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```

by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.

Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77689
2020-05-20 11:25:44 -07:00
Arthur Eubanks b8cbff51d3 Revert "[X86] Codegen for preallocated"
This reverts commit 810567dc69.

Some tests are unexpectedly passing
2020-05-20 10:04:55 -07:00
Arthur Eubanks 810567dc69 [X86] Codegen for preallocated
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.

In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.

This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.

The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.

The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.

Force any function containing a preallocated call to use the frame
pointer.

Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.

Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).

Aside from the tests added here, I checked that this codegen produces
correct code for something like

```
struct A {
        A();
        A(A&&);
        ~A();
};

void bar() {
        foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```

by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77689
2020-05-20 09:20:38 -07:00
QingShan Zhang 2b59e9f1bd [DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.

This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638

Reviewed By: RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D77319
2020-05-20 02:12:16 +00:00
Benjamin Kramer 350dadaa8a Give helpers internal linkage. NFC. 2020-05-19 22:16:37 +02:00
Sanjay Patel 57c3fe76a3 [x86] favor vector constant load to avoid GPR to XMM transfer
This build vector lowering pattern came up in D79886.
I've tried to limit the improvement to cases where it looks
clearly better to load, but we could remove the 'TODO'
predicates already if we are willing to overlook some
corner cases.

Differential Revision: https://reviews.llvm.org/D80013
2020-05-17 11:56:26 -04:00
Simon Pilgrim 6f02633a4f [X86] Add getTargetConstantFromBasePtr helper. NFC.
Allows us to share code from LoadSDNode and MemIntrinsicSDNode constant pool loads.
2020-05-17 14:58:31 +01:00
Simon Pilgrim 9aca5b68ee [X86] getTargetConstantBitsFromNode - remove unnecessary X86ISD::VBROADCAST handling.
We create X86ISD::VBROADCAST_LOAD for constant pool folds now.
2020-05-17 14:58:30 +01:00
Simon Pilgrim 330b7491d5 [X86] Remove some duplicate ConstantSDNode casts. NFC.
Avoid repeated isa<> and cast<> by just performing a dyn_cast<ConstantSDNode>
2020-05-15 18:23:35 +01:00
Simon Pilgrim 9825d3daa8 [X86] Use getConstantOperandVal helper in a few places. NFC.
Avoid raw cast<ConstantSDNode> calls.
2020-05-15 17:31:27 +01:00
Simon Pilgrim 4580b0f5b6 [X86] getFauxShuffle - remove (unused) ISD::TRUNCATE shuffle decoding. 2020-05-15 17:31:26 +01:00
Ties Stuij 8c24f33158 [IR][BFloat] Add BFloat IR type
Summary:
The BFloat IR type is introduced to provide support for, initially, the BFloat16
datatype introduced with the Armv8.6 architecture (optional from Armv8.2
onwards). It has an 8-bit exponent and a 7-bit mantissa and behaves like an IEEE
754 floating point IR type.

This is part of a patch series upstreaming Armv8.6 features. Subsequent patches
will upstream intrinsics support and C-lang support for BFloat.

Reviewers: SjoerdMeijer, rjmccall, rsmith, liutianle, RKSimon, craig.topper, jfb, LukeGeeson, sdesmalen, deadalnix, ctetreau

Subscribers: hiraditya, llvm-commits, danielkiss, arphaman, kristof.beyls, dexonsmith

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D78190
2020-05-15 14:43:43 +01:00
Simon Pilgrim 1024e82469 X86ISelLowering.cpp - remove non-constant EXTRACT_SUBVECTOR/INSERT_SUBVECTOR handling. NFC.
Now that D79814 has landed, we can assume that subvector ops use constant, in-range indices.
2020-05-15 11:50:00 +01:00
Craig Topper 2fdeee9c82 [X86] Add support for forming vXi16 PMULH instructions from shifts.
We already form PMULH when the shift is truncated. But we can
also do it from just a shift by extending the result.

Unfortunately, I get regressions if I try to replace the truncate
combine with this as we turn the truncate into a more complicated
sequence first. Then we are unable to combine that sequence with
the extend produced at the end of this combine.

Differential Revision: https://reviews.llvm.org/D79682
2020-05-14 10:58:00 -07:00
Sanjay Patel 26e742fd84 [x86][CGP] improve sinking of splatted vector shift amount operand
Expands on the enablement of the shouldSinkOperands() TLI hook in:
D79718

The last codegen/IR test diff shows what I suspected could happen - we were
sinking all splat shift operands into a loop. But that's not what we want in
general; we only want to sink the *shift amount* operand if it is a splat.

Differential Revision: https://reviews.llvm.org/D79827
2020-05-14 08:36:03 -04:00
Craig Topper 028bfdd891 [X86] Only allow f32, f64, or f80 to be used with 'f' inline assembly constraint.
Avoids crash when using i128. Gives better error than
'scalar-to-vector conversion failed' for other types.
2020-05-13 13:27:13 -07:00
Craig Topper 38e0ab2f3a [X86] Don't allow f80 to be used with the 'q', 'r', 'l', 'Q' or 'q' inline assembly constraints.
It was previously trying to use the 64-bit class, but 80 isn't
evenly divisible by 64 so it will trigger a crash.
2020-05-13 12:19:57 -07:00
Craig Topper 47985451ed [X86] Make the if statement structure for inline assembly constraints 'l', 'r', 'q', 'Q', and 'R' the same.
These did similar things but had slight differences. For example
'Q' didn't allow f64, but the others did.
2020-05-13 12:19:57 -07:00
Sanjay Patel f490ca76b0 [x86][CGP] enable target hook to sink funnel shift intrinsic's splatted shift amount
SDAG suffers when it can't see that a funnel operand is a splat value
(due to single-basic-block visibility), so invert the normal loop
hoisting rules to move a splat op closer to its use.

This would be part 1 of an enhancement similar to D63233.

This is needed to re-fix PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426
...because we got better at canonicalizing IR to funnel shift intrinsics.

The existing CGP code for shift opcodes is likely overstepping what it was
intended to do, so that will be fixed in a follow-up.

Differential Revision: https://reviews.llvm.org/D79718
2020-05-12 18:40:40 -04:00
Alexey Lapshin 1c44430e73 Fix buildbots #2 after aa1eb5152d. 2020-05-13 01:23:39 +03:00
Alexey Lapshin 293c6d3821 Fix buildbots after aa1eb5152d. 2020-05-13 01:11:01 +03:00
Alexey Lapshin aa1eb5152d [X86][ISelLowering] refactor Varargs handling in X86ISelLowering.cpp
Summary:
This patch refactors handling of VarArgs in
X86TargetLowering::LowerFormalArguments.
That refactoring was requested while reviewing
D69372. Code related to varargs handling is removed
from X86TargetLowering::LowerFormalArguments and
is divided into smaller routines.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D74794
2020-05-13 00:32:00 +03:00
Craig Topper 01636c1eea [X86] Remove the v16i8->v16i16 path for MULHS with AVX2.
We have a couple main strategies for legalizing MULH.

-If the vXi16 type is legal, extend to do the full i16 multiply
and then shift and truncate the results.
-Use unpcks to split each 128 bit lane into high and low halves.a

For signed we have an extra case to split a v32i8 to v16i8 and then
use the extending to v16i16 strategy.

This patch proposes to use the unpck strategy instead. Which is
what we already do for unsigned.

This seems to be 1 instruction shorter when the RHS is constant
like the idiv case. It's 1 instruction longer for the smulo case.
But we're trading cross lane shuffles for inlane shuffles and a
shift.

Differential Revision: https://reviews.llvm.org/D79652
2020-05-12 10:32:01 -07:00
Simon Pilgrim 0387df7f02 [X86] combineX86ShuffleChain - use narrowShuffleMaskElts scale == 1 builtin handling. NFC.
narrowShuffleMaskElts already has the fast-path for scale == 1, no need to reimplement it here.
2020-05-12 13:45:40 +01:00
Simon Pilgrim 45aa1b8853 [X86][AVX] Use X86ISD::VPERM2X128 for blend-with-zero if optimizing for size
Last part of PR22984 - avoid the zero-register dependency if optimizing for size
2020-05-12 13:03:50 +01:00
Sam McCall 728cf6d86b Revert "[DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression"
This reverts commit 3c44c441db.

Causes infloops on some inputs, see https://reviews.llvm.org/D77319 for repro
2020-05-11 16:44:01 +02:00
QingShan Zhang 3c44c441db [DAGCombine] Remove the getNegatibleCost to avoid the out of sync with getNegatedExpression
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.

This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638

Reviewed By: RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D77319
2020-05-11 02:41:10 +00:00
Fangrui Song f40fc7b8d6 [X86] Fix combineVectorCompareAndMaskUnaryOp regression after 0e8e731449 2020-05-10 19:03:38 -07:00
Simon Pilgrim 9237d88001 [X86] isVectorShiftByScalarCheap - don't limit fast XOP vector shifts to 128-bit vectors
XOP targets have fast per-element vector shifts and we're better off splitting to 128-bit shifts where necessary (which is what we already do in LowerShift).
2020-05-09 22:24:08 +01:00
Craig Topper 56bf0b58c2 [X86] Add an assert that v32i16/v64i8 splitting in LowerVSETCC should only occur when AVX512BW is disabled. NFC
With BWI we should only get a v32i1/v64i1 result type.
2020-05-09 11:24:16 -07:00
Simon Pilgrim 0e8e731449 [X86] Allow combineVectorCompareAndMaskUnaryOp to handle 'all-bits' general case
For the sint_to_fp(and(X,C)) -> and(X,sint_to_fp(C)) fold, allow combineVectorCompareAndMaskUnaryOp to match any X that ComputeNumSignBits says is all-bits, not just SETCC.

Noticed while investigating mask promotion issues in PR45808
2020-05-09 14:53:25 +01:00
Craig Topper bebdc62c3f [SelectionDAG] Remove ConstantPoolSDNode::getAlignment.
Use getAlign instead.

Differential Revision: https://reviews.llvm.org/D79459
2020-05-08 16:04:11 -07:00
Craig Topper d1119980e5 [SelectionDAG] Use Align/MaybeAlign for ConstantPoolSDNode.
This patch stores the alignment for ConstantPoolSDNode as an
Align and updates the getConstantPool interface to take a MaybeAlign.

Removing getAlignment() will be done as a follow up.

Differential Revision: https://reviews.llvm.org/D79436
2020-05-08 16:04:11 -07:00
Simon Pilgrim 5f9f37c42a [X86][AVX] Don't let X86ISD::BROADCAST peek through bitcasts to illegal types.
This was an existing bug exposed by the more aggressive X86ISD::BROADCAST generation by rG8817334ce3c7

Original test case thanks to @mstorsjo
2020-05-08 12:30:50 +01:00
Simon Pilgrim b8a725274c [X86][AVX] combineSignExtendInReg - promote mask arithmetic before v4i64 canonicalization
We rely on the combine

(sext_in_reg (v4i64 a/sext (v4i32 x)), v4i1) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x, ExtraVT)))

to avoid complex v4i64 ashr codegen, but doing so prevents v4i64 comparison mask promotion, so ensure we attempt to promote before canonicalizing the (hopefully now redundant sext_in_reg).

Helps with the poor codegen in PR45808.
2020-05-07 13:16:36 +01:00
Craig Topper 350645594e [X86] Enable combinePMULH to match multiplies with elements larger than i32.
We're truncating so the extra bits will be discarded.
2020-05-06 23:13:59 -07:00
Craig Topper 16c800b8b7 [X86] Remove support for Y0 constraint as an alias for Yz in inline assembly.
Neither gcc or icc support this. Split out from D79472. I want
to remove more, but it looks like icc does support some things
gcc doesn't and I need to double check our internal test suites.
2020-05-06 14:58:53 -07:00
Craig Topper 9bb9ff0957 [X86] Remove incomplete support for 'Y' has an inline assembly constraint by itself.
Y is the start of several 2 letter constraints, but we also had
partial support to recognize it by itself. But it doesn't look
like it can get through clang as a single letter so the backend
support for this was effectively dead.
2020-05-06 14:23:04 -07:00
Simon Pilgrim 8817334ce3 [X86] getShuffleScalarElt - add CONCAT_VECTORS/INSERT_VECTOR_ELT support.
This helped fix some i686 vXi64 broadcast folds that were becoming v2Xi32 broadcasts because we didn't match the broadcast until after SimplifyDemandedBits worked out we only used the bottom 32-bits in PMUL(U)DQ and type legalization had split the original i64 load.

A couple of regressions occurred which required some fixups - adding concat_vectors(broadcast_load,broadcast_load) splat support and recognising (unnecessary) unary shuffles of already broadcasted vectors.

This came about as part of the work investigating vector load combining from shuffles for PR42550.
2020-05-06 18:13:33 +01:00
Simon Pilgrim 8c71c2291e [X86] getShuffleScalarElt - consistently use SDValue. NFC.
We never need to call this from anything but ISD::SHUFFLE_VECTOR or target shuffles so shouldn't need to address SDNode directly.
2020-05-06 18:13:33 +01:00
Simon Pilgrim f5f7fd990e [X86][SSE] combineX86ShuffleChain - remove unused shuffle(vzext_load(),undef) combine.
This should always be caught by the various VZEXT_MOVL handling in combineTargetShuffle and SimplifyDemandedVectorEltsForTargetNode.
2020-05-06 15:20:29 +01:00
Simon Pilgrim 8650b36935 [X86][SSE] Move VZEXT_MOVL removal into SimplifyDemandedVectorEltsForTargetNode
This patch replaces the VZEXT_MOVL removal from combineShuffle with a more general version based in SimplifyDemandedVectorEltsForTargetNode.

By using computeKnownBits we can always remove the VZEXT_MOVL if the upper elements of the source operand are known to be zero.

This requires us to add the conversion ops to computeKnownBitsForTargetNode as well.

Reviewed By: @craig.topper

Differential Revision: https://reviews.llvm.org/D79335
2020-05-06 14:05:07 +01:00
Simon Pilgrim 1c4f118d89 [X86][SSE] getShuffleScalarElt - minor NFC cleanup.
Use SelectionDAG::MaxRecursionDepth instead of (equal) hard coded constant.

clang-format
2020-05-06 14:05:07 +01:00
Craig Topper b55009df66 [X86] Add v32i16/v64i8 into the handling for 512-bit inline assembly constraints. 2020-05-05 21:41:31 -07:00
Craig Topper 0fac1c1912 [X86] Allow Yz inline assembly constraint to choose ymm0 or zmm0 when avx/avx512 are enabled and type is 256 or 512 bits
gcc supports selecting ymm0/zmm0 for the Yz constraint when used with 256 or 512 bit vector types.

Fixes PR45806

Differential Revision: https://reviews.llvm.org/D79448
2020-05-05 21:12:30 -07:00
Craig Topper a4286fc952 [X86] Fix usage of Align constructing MachineMemOperands.
Similar to D77687, but for the X86 specific code.

Differential Revision: https://reviews.llvm.org/D79381
2020-05-05 15:25:02 -07:00
Simon Pilgrim e53d4869a0 [X86][AVX] combineVectorSignBitsTruncation - avoid complex vXi64->vXi32 PACKSS truncations (PR45794)
Unless we're truncating an 'all-bits' result, using PACKSS for vXi64->vXi32 truncation causes problems with later combines as ComputeNumSignBits struggles to see through BITCASTs to smaller types. If we don't use PACKSS in these cases then we fallback to shuffles which are usually just as good.
2020-05-05 11:57:25 +01:00
Simon Pilgrim 4b9d75c1ac [X86][SSE] Move some VZEXT_MOVL combines into combineTargetShuffle. NFC.
Minor cleanup of combineShuffle by moving some of the low hanging fruit (load + scalar_to_vector folds).
2020-05-04 15:13:44 +01:00
Craig Topper 243ffc0e65 [X86] Simplify some code in combineTruncatedArithmetic. NFC
We haven't promoted AND/OR/XOR to vXi64 types for a while. So
there's no reason to use isOperationLegalOrPromote. So we can
just use isOperationLegal by merging with ADD handling.
2020-05-03 23:53:10 -07:00
Craig Topper 8b53fdd3b6 [X86] Custom legalize v16i64->v16i8 truncate with avx512.
Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.

Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.

I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.

Differential Revision: https://reviews.llvm.org/D79231
2020-05-03 23:26:04 -07:00
Simon Pilgrim 7c203163c7 [X86] Use splitVector helper in truncateVectorWithPACK/splitVectorStore/combineHorizontalMinMaxResult/combineReductionToHorizontal. NFC.
All these locations were performing the same type splitting/extractSubVector calls as the spltVector helper.
2020-05-03 13:40:38 +01:00
Simon Pilgrim e8d9794a23 [X86] Don't limit splitVector helper to simple types.
It can handle EVT just as well (and so can the extractSubVector calls).
2020-05-03 12:27:37 +01:00
Simon Pilgrim 74e9952c8e [X86][SSE] splitAndLowerShuffle - use splitVector helper. NFC.
The splitVector helper uses extractSubVector which splits build vectors like we do here, so avoid reimplementing it.

splitVector could easily be extended to peek through bitcasts as well but I'd prefer to keep this commit NFC.
2020-05-03 11:26:51 +01:00
Simon Pilgrim 4d2b0ebd17 [X86] detectAVGPattern - use matchUnaryPredicate helper. NFC.
Use the ISD::matchUnaryPredicate helper to check for inrange constants.
2020-05-03 11:26:51 +01:00
Simon Pilgrim 8cbd8194c1 [X86] Improving folding of concat_vectors of subvectors from the same broadcast
Handle concat_vectors(extract_subvector(broadcast(x)), extract_subvector(broadcast(x))) -> broadcast(x)

To expose this we also need collectConcatOps to recognise the insert_subvector(x, extract_subvector(x, lo), hi) subvector splat pattern
2020-05-01 11:23:10 +01:00
Craig Topper ed7479b635 [X86] Update type actions for ISD::TRUNCATE with avx512f to be Legal when possible. NFCI
The Custom handler wasn't doing anything for these cases anyway.
2020-04-30 23:27:29 -07:00
Craig Topper 6a1ad76dab [X86] Don't return true from isTruncateFree for vectors
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.

Differential Revision: https://reviews.llvm.org/D79045
2020-04-30 16:43:35 -07:00
Simon Pilgrim bf468f4349 [X86][SSE] Canonicalize UNARYSHUFFLE(XOR(X,-1) -> XOR(UNARYSHUFFLE(X),-1)
This pushes the NOT pattern up the DAG to help expose it for further combines (AND->ANDN in particular).

The PSHUFD/MOVDDUP 'splat' cases are the only ones I've seen in the wild so far, we can further generalize if/when we need to.
2020-04-30 19:18:51 +01:00
Simon Pilgrim 30211c4783 [X86] combineANDXORWithAllOnesIntoANDNP - add BROADCAST handling
Fold BROADCAST(NOT(Y)) -> BROADCAST(Y) as part of finding a NOT inversion.
2020-04-30 16:24:17 +01:00
Simon Pilgrim 96238486ed [DAGCombine] Move the remaining X86 funnel shift patterns to DAGCombine
X86 matches several 'shift+xor' funnel shift patterns:

  fold (or (srl (srl x1, 1), (xor y, 31)), (shl x0, y))  -> (fshl x0, x1, y)
  fold (or (shl (shl x0, 1), (xor y, 31)), (srl x1, y))  -> (fshr x0, x1, y)
  fold (or (shl (add x0, x0), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)

These patterns are also what we end up with the proposed expansion changes in D77301.

This patch moves these to DAGCombine's generic MatchFunnelPosNeg.

All existing X86 test cases still pass, and we just have a small codegen change in pr32282.ll.

Reviewed By: @spatel

Differential Revision: https://reviews.llvm.org/D78935
2020-04-30 12:57:17 +01:00
Craig Topper 9d4bcc3a60 [X86] Merge the last of the useBWIRegs() section into the useAVX512Regs() section of the X86TargetLowering constructor. NFC
This section is the remnant of how this code was structured before
we made v32i16/v64i8 legal types with avx512f when not restricting
to 256 bit vectors. Now that there are just a few items left,
merge them near similar things in the other section.
2020-04-29 14:40:04 -07:00
Craig Topper 0de7ddbfb0 [X86] Handle more cases in combineAddOrSubToADCOrSBB.
This adds support for

X + SETAE --> sbb X, -1
X - SETAE --> adc X, -1

Fixes PR45700

Differential Revision: https://reviews.llvm.org/D78984
2020-04-28 10:39:39 -07:00
Craig Topper d42192c50f [X86][CostModel] Correct the costs for truncate to a mask register with avx512
I've modified isTruncateFree to get an accurate cost for types that need to be split. I'm planning to look into fixing it for all vectors, but need more cost cleanups first.

Differential Revision: https://reviews.llvm.org/D78973
2020-04-28 10:39:36 -07:00
Craig Topper a58b62b4a2 [IR] Replace all uses of CallBase::getCalledValue() with getCalledOperand().
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().

I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.

Differential Revision: https://reviews.llvm.org/D78882
2020-04-27 22:17:03 -07:00
Simon Pilgrim d9e174dbf7 [X86][SSE] getFauxShuffle - account for PEXTW/PEXTB implicit zero-extension
The insert(truncate/extend(extract(vec0,c0)),vec1,c1) case in rGacbc5ede99 wasn't combining the 'mineltsize' with the src vector elt size which may be smaller due to implicit extension during extraction.

Reduced from test case provided by @mstorsjo
2020-04-27 12:46:50 +01:00
QingShan Zhang 2957fa0cd1 [NFC][DAGCombine] Adding three helper functions and change the getNegatedExpression to negateExpression
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.

Reviewed By: RKSimon

Differential revision: https://reviews.llvm.org/D78291
2020-04-27 04:11:42 +00:00
Simon Pilgrim acbc5ede99 [X86][SSE] getFauxShuffle - support insert(truncate/extend(extract(vec0,c0)),vec1,c1) shuffle patterns at the byte level
Followup to the PR45604 fix at rGe71dd7c011a3 where we disabled most of these cases.

By creating the shuffle at the byte level we can handle any extension/truncation as long as we track how small the scalar got and assume that the upper bytes will need to be zero.
2020-04-26 15:31:01 +01:00
Craig Topper c1cb733db6 [X86] Improve lowering of v16i8->v16i1 truncate under prefer-vector-width=256. 2020-04-25 15:20:33 -07:00
Sanjay Patel 7f4ff782d4 [x86] use vector instructions to lower even more FP->int->FP casts
This is another enhancement to D77895/D78362
to avoid a round-trip from XMM->GPR->XMM.
This time we handle the case of starting/ending with different FP types
but always with signed i32 as the intermediate value.
I think this covers all of the faux vector optimization possibilities
for pre-AVX512.

There is at least 1 other transform mentioned in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c19
...where we fold an 'fpext' into a preceding 'sitofp'. I think we will
want to handle that earlier (DAGCombiner or instcombine) because that's
a target-independent optimization.

Differential Revision: https://reviews.llvm.org/D78758
2020-04-25 11:38:54 -04:00
Vedant Kumar 10ce1bc8d0 [MachineBasicBlock] Add helpers for skipping debug instructions [1/14]
Summary:
These helpers are exercised by follow-up commits in this patch series,
which is all about removing CodeGen differences with vs. without debug
info in the AArch64 backend.

Reviewers: fhahn, aprantl, jpaquette, paquette

Subscribers: kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D78260
2020-04-22 17:03:39 -07:00
aartbik 8387bee94d [llvm] [X86] Fixed type bug in vselect for AVX masked load
Summary:
Bugzilla issue 45563
https://bugs.llvm.org/show_bug.cgi?id=45563

Reviewers: nicolasvasilache, mehdi_amini, craig.topper

Reviewed By: craig.topper

Subscribers: RKSimon, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D78527
2020-04-21 11:11:35 -07:00
Simon Pilgrim c74acd8fc9 X86ISelLowering.cpp - clang-format to fix col80 limit. NFC. 2020-04-21 15:18:23 +01:00
Craig Topper d7e2d937bc [X86] Add X86ISD nodes for PDEP and PEXT.
This will allow use to add DAG combines for these instructions.
2020-04-19 16:14:13 -07:00
Sanjay Patel 720015e537 [x86] avoid build warning for enum mismatch; NFC
gcc may warn here because X86ISD::NodeType is specified as "unsigned",
but ISD::NodeType is a naked C enum (although passed as an "unsigned"
throughout SDAG).
2020-04-19 10:22:11 -04:00
Simon Pilgrim e71dd7c011 [X86][SSE] getFauxShuffle - don't combine shuffles with small truncated scalars (PR45604)
getFauxShuffle attempts to combine INSERT_VECTOR_ELT(TRUNCATE/EXTEND(EXTRACT_VECTOR_ELT(x))) patterns into a target shuffle chain.

PR45604 identified an issue where the scalar was truncated to a size smaller than the destination vector element and then zero extended back, which requires the upper bits to be zero'd which we don't currently do.

To avoid the bug I've added an early out in these truncation cases, a future commit should allow us to handle this by inserting the necessary SM_SentinelZero padding.
2020-04-19 13:35:22 +01:00
Sanjay Patel cceb630a07 [x86] use vector instructions to lower more FP->int->FP casts
This is an enhancement to D77895 to avoid another
round-trip from XMM->GPR->XMM. This time we handle
the case of starting/ending with an f64 and casting
to signed i32 as the intermediate value.

It's a bit more involved than I initially assumed
because we need to use target-specific opcodes to
represent the non-standard cast ops.

Differential Revision: https://reviews.llvm.org/D78362
2020-04-19 08:33:17 -04:00
Simon Pilgrim 9559557014 X86InstrFMA3Info.h - remove unnecessary includes. NFC.
There were a number of cpp files explicitly relying on X86InstrFMA3Info.h to include the X86.h header - so I've had to add it locally.
2020-04-19 12:17:56 +01:00
Sanjay Patel 818126ae97 [x86] rename variables for types for readability; NFC
This gets harder to follow if we allow changing types/sizes
between source, dest, and intermediate value.
2020-04-17 08:41:18 -04:00
Christopher Tetreault 85247c1e89 [SVE] Remove calls to getBitWidth from x86
Reviewers: efriedma, RKSimon, sdesmalen

Reviewed By: RKSimon

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77901
2020-04-15 15:48:48 -07:00
Craig Topper 8dfb9627b7 [X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead.
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.

This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.

Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.

Differential Revision: https://reviews.llvm.org/D76212
2020-04-15 12:17:18 -07:00
Craig Topper a916e81927 [X86] Various improvements to our vector splitting helpers for lowering. NFC
-Consistently name the functions as split*
-Add a helper for doing the two extractSubvector calls and determining the size of the split
-Use getSplitDestVTs to get the result type for the split node.
-Move the binary and unary helper to one place in the file near the extractSubvector functions. Left the VSETCC one near LowerVSETCC since that's its only caller.
-Remove the 256/512 wrappers that just had asserts. I don't think they provided a lot of value and now with the routines called split* the call sites are more obvious what they do.
-Make the unary routine support different source and dest types to support D76212.
-Add some weaker asserts into the helpers to make up for losing the very specific asserts from the 256/512 wrappers.

Differential Revision: https://reviews.llvm.org/D78176
2020-04-15 10:57:53 -07:00
Georgii Rymar 1647ff6e27 [ADT/STLExtras.h] - Add llvm::is_sorted wrapper and update callers.
It can be used to avoid passing the begin and end of a range.
This makes the code shorter and it is consistent with another
wrappers we already have.

Differential revision: https://reviews.llvm.org/D78016
2020-04-14 14:11:02 +03:00
Craig Topper 113f37a1f9 [CallSite removal][TargetLowering] Replace ImmutableCallSite with CallBase
Differential Revision: https://reviews.llvm.org/D77995
2020-04-13 13:50:15 -07:00
Craig Topper 6dbf1a1229 [X86] Move X86ShuffleDecode.cpp/h into MCTargetDesc and remove X86Utils library. NFC
The shuffle decoding is used by X86ISelLowering and
MCTargetDesc/X86InstComments. The latter used to be in a
separate InstPrinter library. The Utils library existed to allow
InstPrinter and CodeGen to share the shuffle decoding. Since
X86InstComments now lives in the MCTargetDesc, which CodeGen
already depends on, we can sink the shuffle decoding there as well.

Differential Revision: https://reviews.llvm.org/D77980
2020-04-13 10:14:08 -07:00
Jay Foad bc78baec4c [X86] Improve combineVectorShiftImm
Summary:
Fold (shift (shift X, C2), C1) -> (shift X, (C1 + C2)) for logical as
well as arithmetic shifts. This is needed to prevent regressions from
an upcoming funnel shift expansion change.

While we're here, fold (VSRAI -1, C) -> -1 too.

Reviewers: RKSimon, craig.topper

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77300
2020-04-13 15:54:55 +01:00
Simon Pilgrim 401cbe373b [X86][AVX] Attempt to scale masked shuffles to match the root type
Improve the chances of folding the writemask into the combined shuffle by scaling a wider shuffle mask to match the root's original type.

This creates a few minor issues with variable shuffles, preventing combines of shuffles because of the more limited support binary shuffle types. In most cases we're probably better off combining the shuffles and losing the writemask fold, but this isn't always going to be true.
2020-04-13 14:57:25 +01:00
Simon Pilgrim fdd9ff9700 [X86][AVX] Create splitVectorIntBinary helper.
Removes duplicate code from split256IntArith/split512IntArith.
2020-04-13 13:09:38 +01:00
Sanjay Patel d04db4825a [x86] use vector instructions to lower FP->int->FP casts
As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM
by using the vector versions of the convert instructions.

Based on experimental results from recent Intel/AMD chips, we don't
need to worry about triggering denorm stalls while operating on
garbage data in the high lanes with convert instructions, so this is
expected to always be as good or better perf than the scalar
instruction equivalent. FP exceptions are also not a concern because
strict code should not be using the regular SDAG opcodes.

Differential Revision: https://reviews.llvm.org/D77895
2020-04-12 10:26:43 -04:00
Simon Pilgrim 40581a0a2b [X86] Use isAnyZero shuffle mask helper where possible. NFC. 2020-04-12 10:57:29 +01:00
Craig Topper d3465e0691 [X86] Enable shuffle combining for AVX512 unless the root is used by a vselect
A lot of vectorized code doesn't use masks so we shouldn't penalize them by not doing shuffle combining on avx512 targets.

I've added support for VALIGNQ/VALIGND and 512-bit SHUF128 to prevent some regressions. I also prevented recombining 256-bit SHUF128 to PERM2X128. We may not need to add 256-bit SHUF128 support, but I don't think I found any cases requiring that in my testing.

Differential Revision: https://reviews.llvm.org/D77928
2020-04-11 20:05:10 -07:00
Sanjay Patel 1318ddbc14 [VectorUtils] rename scaleShuffleMask to narrowShuffleMaskElts; NFC
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.

While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
2020-04-11 10:05:49 -04:00
Craig Topper a6732069ee [CallSite removal][X86] Remove unneeded use of CallSite. NFC
We already have a CallInst, we can just get the calling convention from it.
2020-04-10 10:27:21 -07:00
Jay Foad c63aed890e [KnownBits] Move AND, OR and XOR logic into KnownBits
Summary:
There are at least three clients for KnownBits calculations:
ValueTracking, SelectionDAG and GlobalISel. To reduce duplication the
common logic should be moved out of these clients and into KnownBits
itself.

This patch does this for AND, OR and XOR calculations by implementing
and using appropriate operator overloads KnownBits::operator& etc.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74060
2020-04-09 10:10:37 +01:00
Matt Arsenault 84aa58cbe2 CodeGen: Use Register in TargetLowering 2020-04-08 12:10:58 -04:00
Simon Pilgrim 66c18c729d [X86][SSE] Combine PTEST(AND(X,Y),AND(X,Y)) -> PTEST(X,Y) and ANDN equivalents
Tests derived from PR42035 examples
2020-04-08 12:42:22 +01:00
Craig Topper c41685b16f [SelectionDAG] Make getZeroExtendInReg take a vector VT if the operand VT is a vector.
This removes a call to getScalarType from a bunch of call sites.
It also makes the behavior consistent with SIGN_EXTEND_INREG.

Differential Revision: https://reviews.llvm.org/D77631
2020-04-07 11:34:08 -07:00
Simon Pilgrim e3b6059776 [X86][SSE] combineX86ShufflesConstants - early out for zeroable vectors (PR45443)
Shuffle combining can insert zero byte sized elements into the shuffle mask, which combineX86ShufflesConstants will attempt to fold without taking into account whether the byte-sized type is legal (e.g. AVX512F only targets).

If we have a full-zeroable vector then we should just return a zero version of the root type, otherwise if the type isn't valid we should bail.

Fixes PR45443
2020-04-07 14:45:29 +01:00
David Blaikie 5aead592f0 X86ISelLowering: Minor refactor to avoid redundant initialization while ensuring compiler warnings can hopefully still prove initialization
Based on post-commit review/discussion in fabe52a7412b
2020-04-06 14:25:52 -07:00
Simon Pilgrim 9bc5b1a489 [X86][SSE] combineVectorSignBitsTruncation - remove minimum vector length limitations
truncateVectorWithPACK has its own vector length controls, so we can rely on those directly. This helps some existing truncation to subvector tests, which were being combined later during shuffle lowering at which point the sign/zero bit detection had become obscured preventing lowerShuffleWithPACK working as well as it could.
2020-04-06 12:45:23 +01:00
Simon Pilgrim a43e233606 Remove unused function 'isInRange'. NFCI. 2020-04-05 23:11:24 +01:00
Simon Pilgrim 4431a29c60 [X86][SSE] Combine unary shuffle(HORIZOP,HORIZOP) -> HORIZOP
We had previously limited the shuffle(HORIZOP,HORIZOP) combine to binary shuffles, but we can often merge unary shuffles just as well, folding in UNDEF/ZERO values into the 64-bit half lanes.

For the (P)HADD/HSUB cases this is limited to fast-horizontal cases but PACKSS/PACKUS combines under all cases.
2020-04-05 22:49:46 +01:00
Benjamin Kramer ff889df356 [X86] Roll some loops. NFCI. 2020-04-05 13:59:50 +02:00
Simon Pilgrim 3079e51858 [X86][SSE] Generalize shuffle(HORIZOP,HORIZOP) -> HORIZOP combine
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.

This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
2020-04-05 12:09:19 +01:00
Simon Pilgrim a17de6b91c [X86][SSE] truncateVectorWithPACK - upper undef for 128->64 packing
If we're packing from 128-bits to 64-bits then we don't need the RHS argument. This helps with register allocation, especially as we avoid repeating a use of the input value.
2020-04-05 11:47:36 +01:00
Simon Pilgrim e5e719d885 [X86][SSE] lowerV8I16Shuffle - lower compaction shuffles using PACKUSDW(PBLENDW,PBLENDW) on SSE41+
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.

This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.

Noticed in codegen while dealing with PR31443.
2020-04-04 13:08:25 +01:00
Simon Pilgrim 34a497b765 [X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations
Extend lowerShuffleWithPACK/matchShuffleWithPACK/createPackShuffleMask to handle compaction style shuffle masks that can be lowered to chains of PACKSS/PACKUS if their inputs are suitably sign/zero extended.

This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask should recognise the PACKSS/PACKUS chains.
2020-04-03 18:26:10 +01:00
Scott Constable 5b519cf1fc [X86] Add Indirect Thunk Support to X86 to mitigate Load Value Injection (LVI)
This pass replaces each indirect call/jump with a direct call to a thunk that looks like:

lfence
jmpq *%r11

This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.

This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.

Differential Revision: https://reviews.llvm.org/D76812
2020-04-03 00:34:39 -07:00
Scott Constable 71e8021d82 [X86][NFC] Generalize the naming of "Retpoline Thunks" and related code to "Indirect Thunks"
There are applications for indirect call/branch thunks other than retpoline for Spectre v2, e.g.,

https://software.intel.com/security-software-guidance/software-guidance/load-value-injection

Therefore it makes sense to refactor X86RetpolineThunks as a more general capability.

Differential Revision: https://reviews.llvm.org/D76810
2020-04-02 21:55:13 -07:00
Craig Topper 4fdb63bbf0 [X86] Enable combineExtSetcc for vectors larger than 256 bits when we've disabled 512 bit vectors.
The compares are going to be type legalized to 256 bits so we
might as well fold the extend.
2020-04-02 12:44:27 -07:00
Guillaume Chatelet fc63c4d8ce [Alignment][NFC] Remove remaining uses of MachineFrameInfo::setObjectAlignment
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77217
2020-04-01 14:38:05 +00:00
Simon Pilgrim eb8880562e [X86][SSE] combinePTESTCC - fold TESTZ(X,~Y) -> TESTC(Y,X) 2020-04-01 15:10:53 +01:00
Guillaume Chatelet 3a78f44daf [Alignment][NFC] Convert SelectionDAG::InferPtrAlignment to MaybeAlign
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77212
2020-04-01 13:22:11 +00:00
Simon Pilgrim 481413d394 [X86][SSE] matchShuffleWithPACK - generalize zero/signbits matching for any packed src type
First step toward making use of canLowerByDroppingEvenElements to match chains of PACKSS/PACKUS for compaction shuffles.

At the moment we still only match a single stage but the MatchPACK is now more general.
2020-04-01 14:10:32 +01:00
Simon Pilgrim 918ccb64b0 [X86][SSE] Handle basic inversion of PTEST/TESTP operands (PR38522)
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0

If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.

Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).

We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.

Differential Revision: https://reviews.llvm.org/D76984
2020-04-01 11:33:28 +01:00
Bjorn Pettersson ef49895da8 [X86] Do not assume types are legal in getFauxShuffleMask
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77136
2020-04-01 11:40:18 +02:00
Guillaume Chatelet c7468c1696 [Alignment][NFC] Use Align in SelectionDAG::getMemIntrinsicNode
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: jholewinski, nemanjai, hiraditya, kbarton, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77149
2020-04-01 09:32:05 +00:00
Craig Topper f92563f907 [VectorUtils][X86] De-templatize scaleShuffleMask and 2 X86 shuffle mask helpers and move their implementation to cpp files
Summary: These were templated due to SelectionDAG using int masks for shuffles and IR using unsigned masks for shuffles. But now that D72467 has landed we have an int mask version of IRBuilder::CreateShuffleVector. So just use int instead of a template

Reviewers: spatel, efriedma, RKSimon

Reviewed By: efriedma

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D77183
2020-04-01 00:46:48 -07:00
Simon Pilgrim f9f401dba1 [X86][AVX] Add additional 256/512-bit test cases for PACKSS/PACKUS shuffle patterns
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
2020-04-01 08:19:03 +01:00
Simon Pilgrim 7e0e5fa499 Revert rGefe59d6717dcdf7777acb9b7a734e1a520bdf22a "[X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations"
This might be causing an issue on the fuchsia-x86_64-linux buildbot - reverting to see what happens.
2020-03-31 15:47:30 +01:00
Simon Pilgrim 38619fa7da Fix enumeral mismatch warning. NFCI.
Don't mix llvm::ISD and llvm::X86ISD.
2020-03-31 15:38:02 +01:00
Simon Pilgrim efe59d6717 [X86][SSE] lowerShuffleWithPACK - extend to use chained PACKs for larger truncations
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.

This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
2020-03-31 14:48:48 +01:00
Simon Pilgrim 98357dee1c [X86] Combine concat(palignr,palignr) -> palignr(concat,concat)
combineX86ShufflesRecursively should handle this someday
2020-03-31 11:06:35 +01:00
Simon Pilgrim 7a4a98a9c4 [X86] Move canLowerByDroppingEvenElements earlier to be with matchShuffleWithPACK. NFCI.
Make sure its defined earlier so more shuffle lowering methods can use it.
2020-03-31 10:56:35 +01:00
Guillaume Chatelet bdf77209b9 [Alignment][NFC] Use Align version of getMachineMemOperand
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: jyknight, sdardis, nemanjai, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, jfb, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77059
2020-03-30 15:46:27 +00:00
Simon Pilgrim e95d04f4f1 [X86][AVX] lowerV4X128Shuffle - attempt to widen to 2x256 to simplify shuffles
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
2020-03-30 12:22:26 +01:00
Simon Pilgrim 9c8ec99c80 [X86][AVX] Combine 128/256-bit lane shuffles with zeroable upper subvectors to EXTRACT_SUBVECTOR (PR40720)
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128/SHUF128, and we can use the implicit zeroing of the uppers.
2020-03-29 19:51:38 +01:00
Simon Pilgrim 8206c50cde [X86] Add isAnyZero shuffle mask helper 2020-03-29 19:51:37 +01:00
Simon Pilgrim 7734e4b3a3 [X86][AVX] Combine 128-bit lane shuffles with a zeroable upper half to EXTRACT_SUBVECTOR (PR40720)
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.

I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
2020-03-29 16:41:59 +01:00
Simon Pilgrim da4c7db793 [X86] Rename matchShuffleAsByteRotate to matchShuffleAsElementRotate. NFC.
This was an inner helper function for the real matchShuffleAsByteRotate function, but it is more generic and is used directly for VALIGN lowering which doesn't work at the byte level.
2020-03-29 16:41:58 +01:00
Simon Pilgrim 10439f9e32 [X86][AVX] Add X86ISD::VALIGN target shuffle decode support
Allows us to combine VALIGN instructions with other shuffles - the combiner doesn't create VALIGN yet though.
2020-03-29 16:41:58 +01:00
Craig Topper 9f7d4150b9 [X86] Move combineLoopMAddPattern and combineLoopSADPattern to an IR pass before SelecitonDAG.
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.

This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.

An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.

Differential Revision: https://reviews.llvm.org/D76649
2020-03-26 14:10:20 -07:00
Simon Pilgrim ad36491ebb [X86] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on all targets
Extends rG9d1721ce3926 to support AVX2+ targets.
2020-03-26 20:46:24 +00:00
Simon Pilgrim 39a52a19ed [X86] lowerV16I8Shuffle - create v8i16 mask for PACKUS(AND(),AND()) patterns.
We can improve computeKnownBits results by avoiding excess bitcasts.

For this pattern we were doing:

  (v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))

By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.

This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
2020-03-26 19:59:57 +00:00
Simon Pilgrim 9d1721ce39 [X86][SSE] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on pre-AVX2 targets
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.

The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.

We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
2020-03-26 15:47:43 +00:00
Simon Pilgrim e30d29ebc1 [X86][SSE] getFauxShuffleMask - peek through TRUNCATE/AEXT/ZEXT for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT())
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
2020-03-26 11:57:45 +00:00
Simon Pilgrim c6e5531f9b [X86][AVX] Combine shuffles to TRUNCATE/VTRUNC patterns
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.

We could probably extend this further to handle non-VLX truncation cases.
2020-03-25 17:41:51 +00:00
Reid Kleckner 597718aae0 Re-land "Avoid emitting unreachable SP adjustments after `throw`"
This reverts commit 4e0fe038f4. Re-lands
65b21282c7.

After landing 5ff5ddd0ad to add int3 into
trailing unreachable blocks, we can now remove these extra stack
adjustments without confusing the Win64 unwinder. See
https://llvm.org/45064#c4 or X86AvoidTrailingCall.cpp for a full
explanation.

Fixes PR45064.
2020-03-24 12:04:43 -07:00
Simon Pilgrim 714402147d [X86][SSE1] Add support for logic+movmsk patterns (PR42870)
rL368506 handled the basic case, but we need to account for boolean logic patterns as well.
2020-03-24 14:28:40 +00:00
Guillaume Chatelet 3ba550a05a [Alignment][NFC] Use TFL::getStackAlign()
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: dylanmckay, sdardis, nemanjai, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D76551
2020-03-23 13:48:29 +01:00
Craig Topper e2cb121374 [X86] Remove maximum vector length limit from combineBasicSADPattern.
createPSADBW uses SplitsOpsAndApply so should be able to handle
any size.

Restrict the extract result type to i32 or i64 since that's what
we have coverage for today and probably matches what the
isSimple() check gave us before.

Differential Revision: https://reviews.llvm.org/D76560
2020-03-22 15:02:05 -07:00
Craig Topper b89ae50795 [X86] Remove maximum vector width restriction from combineLoopSADPattern.
SplitsOpsAndApply will take care of any needed splitting correctly.
All that we need to check is that the vector element count is a
power of 2.

Differential Revision: https://reviews.llvm.org/D76558
2020-03-22 11:09:14 -07:00
Simon Pilgrim 25eb9056d7 [X86] getTargetShuffleAndZeroables - add insert_subvector(undef, sub, c) handling.
We often widen xmm/ymm vectors to ymm/zmm by insertion into an undef base vector. By letting getTargetShuffleAndZeroables track the undef elts we can help avoid a lot of unnecessary cross-lane shuffles.

Fixes PR44694
2020-03-21 19:11:42 +00:00
Simon Pilgrim 4ceade0428 [X86] Combine concat(shufps,shufps) -> shufps(concat,concat)
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.

Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
2020-03-21 12:44:10 +00:00
Simon Pilgrim f424d51c3e Revert rGe6a7e3b5e3e7 "[X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles."
This reverts commit e6a7e3b5e3.

Avoids register pressure regression reported at PR45263
2020-03-21 12:14:19 +00:00
Craig Topper 32fbea1548 [X86] Prevent (bitcast (broadcast_load)) combine from producing vXf16 broadcast instructions.
The combine tries to put the broadcast in either the integer or
fp domain to match the bitcast domain. But we can only do this
if the broadcast size is 32 or larger.
2020-03-20 09:15:07 -07:00
Nico Weber 4e0fe038f4 Revert "Avoid emitting unreachable SP adjustments after `throw`"
This reverts commit 65b21282c7.
Breaks sanitizer bots (https://reviews.llvm.org/D75712#1927668)
and causes https://crbug.com/1062021 (which may or may not
be a compiler bug, not clear yet).
2020-03-17 20:49:22 -04:00
Simon Pilgrim c9656a3b31 [DAGCombiner] matchRotateSub - handle shift amount truncation
Under certain circumstances we'll end up in the position where the negated shift amount will get truncated to the type specified getScalarShiftAmountTy(), so we need to test for a truncated version of the shift amount as well.

This allows us to remove half of the remaining patterns tested for by X86ISelLowering's combineOrShiftToFunnelShift.
2020-03-17 16:01:23 +00:00
Simon Pilgrim ebb181cf40 [X86] matchScalarReduction - add support for partial reductions
Add optional support for opt-in partial reduction cases by providing an optional partial mask to indicate which elements have been extracted for the scalar reduction.
2020-03-16 18:01:02 +00:00
Simon Pilgrim e43a085781 [X86] X86::isConstantSplat - enable partial undef bit handling by default.
We currently only ever use this for lowering constant uniform values (shift/rotate by immediate) so we can safely enable it by default (it treats the undef bits as zero when extracting constants).

This is necessary for an upcoming patch that will use SimplifyDemandedBits more aggressively on funnel shift amounts and causes regressions in vXi64 constant without it.
2020-03-16 12:56:24 +00:00
Simon Pilgrim ac4609cb1d [X86] LowerRotate - use X86::isConstantSplat to detect constant splat rotation amounts.
Avoid code duplication and matches what we do for the similar LowerFunnelShift and LowerScalarImmediateShift methods.
2020-03-16 12:56:23 +00:00
Simon Pilgrim ee862adf60 Fix signed/unsigned comparison warning. 2020-03-14 18:42:27 +00:00
Simon Pilgrim 0cb2f089c1 [X86] getFauxShuffleMask - pull out repeated byte sizes varaibles. NFC. 2020-03-14 17:36:17 +00:00
Simon Pilgrim f47f4c137b [X86] getFauxShuffleMask - merge insertelement paths
Merge the INSERT_VECTOR_ELT/SCALAR_TO_VECTOR and PINSRW/PINSRB shuffle mask paths - they both do the same thing (find source vector + handle implicit zero extension). The PINSRW/PINSRB path also handled in the insertion of zero case which needed to be added to the general case as well.
2020-03-14 13:11:03 +00:00
Craig Topper 755e00876c [X86] Remove isel patterns for X86VBroadcast+trunc+extload. Replace with DAG combines.
This is a little more complicated than I'd like it to be. We have
to manually match a trunc+srl+load pattern that generic DAG
combine won't do for us due to isTypeDesirableForOp.
2020-03-13 18:12:16 -07:00
Simon Pilgrim 05c0d34918 [X86][SSE] Prefer trunc(movd(x)) to pextrb(x,0)
If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB.

The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4).

This matches what we already do for PEXTRW.

Differential Revision: https://reviews.llvm.org/D76138
2020-03-13 18:43:04 +00:00
Simon Pilgrim 846c614f54 [X86] combineExtractWithShuffle - pull out repeated getSizeInBits() call. NFC. 2020-03-13 15:36:04 +00:00
Simon Pilgrim fe047fbccc [X86] LowerEXTRACT_VECTOR_ELT - pull out repeated getOperand() calls. NFC.
Also, cleanup LowerEXTRACT_VECTOR_ELT_SSE4 comments which had references to non-constant extraction indices.
2020-03-13 15:36:02 +00:00
Simon Pilgrim 4689eae820 [X86] combineOrShiftToFunnelShift - remove shift by immediate handling.
Now that D75114 has landed, DAGCombiner handles this case so the code is redundant.
2020-03-12 11:46:51 +00:00
Simon Pilgrim b3b4727a3e [X86] Replace (most) X86ISD::SHLD/SHRD usage with ISD::FSHL/FSHR generic opcodes (PR39467)
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.

The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.

Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.

The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.

Differential Revision: https://reviews.llvm.org/D75748
2020-03-11 11:17:49 +00:00
Simon Pilgrim c8ede5e485 [X86][SSE] getFauxShuffleMask - add support for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT) shuffle pattern
We already do this for PINSRB/PINSRW and SCALAR_TO_VECTOR.
2020-03-10 15:42:37 +00:00
Simon Pilgrim e6a7e3b5e3 [X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles.
This causes one minor test change but is mainly necessary for an upcoming patch.
2020-03-10 15:42:36 +00:00
Simon Pilgrim 18c19441d1 [X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.

At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
2020-03-10 10:44:28 +00:00
Craig Topper ef4f939d38 [X86] Remove isel patterns for (X86VBroadcast (i16 (trunc (i32 (load))))). Replace with a DAG combine to form VBROADCAST_LOAD.
isTypeDesirableForOp prevents loads from being shrunk to i16 by DAG
combine. Because of this we can't just match the broadcast and a
scalar load. So look for broadcast+truncate+load and form a
vbroadcast_load during DAG combine. This replaces what was
previously done as an isel pattern and I think fixes it so we
won't change the size of a volatile load. But my main motivation
is just to clean up our isel patterns.
2020-03-10 00:07:07 -07:00
Simon Pilgrim 4b130b883d [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - reduce vector width of X86ISD::BLENDI
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.

This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
2020-03-09 18:29:28 +00:00
Craig Topper 3dcc0db15e [X86] Teach combineToExtendBoolVectorInReg to create opportunities for using broadcast load instructions.
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.

Previously we any_extended in the scalar domain before inserting.

This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.

This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.

Differential Revision: https://reviews.llvm.org/D75835
2020-03-09 11:26:12 -07:00
Djordje Todorovic c15c68abdc [CallSiteInfo] Enable the call site info only for -g + optimizations
Emit call site info only in the case of '-g' + 'O>0' level.

Differential Revision: https://reviews.llvm.org/D75175
2020-03-09 12:12:44 +01:00
Craig Topper 70e4fb8a53 [X86] Add DAG combine to turn (vzext_movl (vbroadcast_load)) -> vzext_load.
If we're zeroing the other elements then we don't need the broadcast.
2020-03-08 00:35:40 -08:00
Craig Topper d81d451442 [X86] Add DAG combine to replace vXi64 vzext_movl+scalar_to_vector with vYi32 vzext_movl+scalar_to_vector if the upper 32 bits of the scalar are zero.
We can just use a 32-bit copy and zero in the SSE domain when we
zero the upper bits.

Remove an isel pattern that becomes dead with this.
2020-03-07 16:14:26 -08:00
Craig Topper d41ea65ee8 [X86] Add DAG combines to enable removing of movddup/vbroadcast + simple_load isel patterns. 2020-03-07 15:22:02 -08:00
Craig Topper bc65b68661 [X86] Add a DAG combine to turn vbroadcast(vzload X) -> vbroadcast_load
Remove now unneeded isel patterns.
2020-03-07 15:22:02 -08:00
Craig Topper ec1d1f6ae7 [X86] Use MVT instead of EVT in a couple shuffle lowering functions. 2020-03-07 09:50:53 -08:00
Reid Kleckner 65b21282c7 Avoid emitting unreachable SP adjustments after `throw`
In 172eee9c, we tried to avoid these by modelling the callee as
internally resetting the stack pointer.

However, for the majority of functions with reserved stack frames, this
would lead LLVM to emit extra SP adjustments to undo the callee's
internal adjustment. This lead us to fix the problem further on down the
pipeline in eliminateCallFramePseudoInstr. In 5b79e603d3, I added
use a heuristic to try to detect when the adjustment would be
unreachable.

This heuristic is imperfect, and when exception handling is involved, it
fails to fire. The new test is an example of this. Simply throwing an
exception with an active cleanup emits dead SP adjustments after the
throw. Not only are they dead, but if they were executed, they would be
incorrect, so they are confusing.

This change essentially reverts 172eee9c and makes the 5b79e603d3
heuristic responsible for preventing unreachable stack adjustments. This
means we may emit unreachable stack adjustments for functions using EH
with unreserved call frames, but that is not very many these days. Back
in 2016 when this change was added, we were focused on 32-bit, which we
observed to have fewer reserved frames.

Fixes PR45064

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D75712
2020-03-06 13:33:45 -08:00
Craig Topper 4c7c87f245 [X86] Simplify the code at the end of lowerShuffleAsBroadcast.
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.

Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.

For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.

I think this all matches what we generated before, just in a more
readable way.
2020-03-04 20:45:02 -08:00
Craig Topper eadea7868f [X86] Convert vXi1 vectors to xmm/ymm/zmm types via getRegisterTypeForCallingConv rather than using CCPromoteToType in the td file
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.

Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.

The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.

Differential Revision: https://reviews.llvm.org/D75154
2020-03-04 15:02:32 -08:00
Craig Topper 06de426426 [X86] Directly form VBROADCAST_LOAD in lowerShuffleAsBroadcast on AVX targets.
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.

I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
2020-03-04 09:11:57 -08:00
Craig Topper 9284abd004 [X86] Directly form VBROADCAST_LOAD for BUILD_VECTOR of splat loads in lowerBuildVectorAsBroadcast. 2020-03-03 22:27:34 -08:00
Craig Topper 3c4e635593 [X86] Always emit an integer vbroadcast_load from lowerBuildVectorAsBroadcast regardless of AVX vs AVX2
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.

Differential Revision: https://reviews.llvm.org/D75413
2020-03-03 10:39:11 -08:00
Craig Topper 56cd3bc209 [X86] Directly emit VBROADCAST_LOAD from constant pool in lowerBuildVectorAsBroadcast
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.

Differential Revision: https://reviews.llvm.org/D75412
2020-03-03 10:39:10 -08:00
Craig Topper 68aeaab888 [X86] Don't count the chain uses when forming broadcast loads in lowerBuildVectorAsBroadcast.
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
2020-03-03 08:41:31 -08:00
Craig Topper 2f4f8fcf64 [X86] Don't add DELETED_NODES to DAG combine worklist after calling SimplifyDemandedBits/SimplifyDemandedVectorElts.
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.

Fixes PR45067.
2020-03-01 00:06:32 -08:00
Craig Topper f2d45e5097 [X86] Canonicalize (bitcast (vbroadcast_load)) so that the cast and vbroadcast_load are both integer or fp.
Helps a little with some isel pattern matching. Especially on
32-bit targets where we sometimes use f64 loads.
2020-02-28 15:07:49 -08:00
Craig Topper b68eeff05c [X86] Cleanup a comment around bitcasting X86ISD::VBROADCAST_LOAD and add an assert to make sure memory VT size doesn't change. 2020-02-28 15:07:49 -08:00
Craig Topper c0d0e6b198 [X86] Recognize CVTPH2PS from STRICT_FP_EXTEND
This should avoid scalarizing the cvtph2ps intrinsics with D75162

Differential Revision: https://reviews.llvm.org/D75304
2020-02-28 10:19:57 -08:00
Simon Pilgrim f90cc633de Fix cppcheck definition/declaration arg mismatch warnings. NFCI. 2020-02-27 14:35:20 +00:00
Simon Pilgrim fe6bcfaf3b [X86] Use Subtarget.useSoftFloat() in X86TargetLowering constructor
Avoid use of X86TargetLowering::useSoftFloat() in the constructor as its a virtual function
2020-02-27 14:35:20 +00:00
Simon Pilgrim e61e7f0794 Fix shadow variable warning. NFC. 2020-02-27 14:23:05 +00:00
Simon Pilgrim dc7ac563ac Fix shadow variable warnings. NFC. 2020-02-27 14:21:30 +00:00
Simon Pilgrim efe2f59ec4 [X86] LowerMSCATTER/MGATHER - reduce scope of MaskVT. NFCI.
Fixes cppcheck warning.
2020-02-27 14:20:44 +00:00
Simon Pilgrim fabe52a741 Fix uninitialized variable warning. NFC. 2020-02-27 14:20:43 +00:00
Simon Pilgrim 6bdd63dc28 [X86] createVariablePermute - handle case where recursive createVariablePermute call fails
Account for the case where a recursive createVariablePermute call with a wider vector type fails.

Original test case from @craig.topper (Craig Topper)
2020-02-27 13:52:31 +00:00
Craig Topper 82a21c1655 [X86] Add proper MachinePointerInfo to stack store created in LowerWin64_i128OP. 2020-02-26 16:55:24 -08:00
Craig Topper 870363a22d [X86] Explicitly pass Destination VT and debug location to BuildFILD. NFC
We'd already passed most everything else. Might was well pass
these two things and stop passing Op.
2020-02-26 16:26:46 -08:00
Craig Topper 15e2831fcd [X86] Explicitly pass Pointer, MachinePointerInfo and Alignment to BuildFILD.
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.

In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.

For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.

This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.

I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
2020-02-26 16:26:26 -08:00
Craig Topper 77d9b7b2cd [X86] Query constant pool object alignment instead of hardcoding. 2020-02-26 14:45:39 -08:00
Craig Topper 9c1a707ba3 [X86] Use proper alignment for stack temporary and correct MachinePointerInfo for stack accesses in LowerUINT_TO_FP. 2020-02-26 14:45:38 -08:00
Craig Topper a8186935ae [X86] Use correct MachineMemOperand for stack load in LowerFLT_ROUNDS_ 2020-02-26 14:45:38 -08:00
Craig Topper 735d27dc40 [SelectionDAG][PowerPC][AArch64][X86][ARM] Add chain input and output the ISD::FLT_ROUNDS_
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.

This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.

Differential Revision: https://reviews.llvm.org/D75132
2020-02-25 16:58:23 -08:00
Craig Topper 9238dfb4d8 [X86] Remove mask output from X86 gather/scatter ISD opcodes.
Instead add it when we make the machine nodes during instruction
selections.

This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
2020-02-24 23:56:28 -08:00
Simon Pilgrim daac8dba77 [X86] combineX86ShuffleChain - select X86ISD::FAND/ISD::AND based on MaskVT
Noticed by inspection, we shouldn't use FloatDomain directly, we've already bitcast both inputs to MaskVT so select the opcode using that.
2020-02-24 18:24:44 +00:00
Simon Pilgrim 59d8d13c7b [X86] getTargetShuffleInputs - check that the source inputs are all the right size.
I'm hoping to begin improving shuffle combining across different vector sizes, but before that we must ensure that all existing getTargetShuffleInputs calls must bail if the inputs aren't the same size.
2020-02-24 16:26:10 +00:00
Craig Topper 7a7146cf72 [X86] When creating X86ISD::MGATHER nodes from AVX2 gather intrinsics, cast the mask to integer type.
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
2020-02-23 23:00:41 -08:00
Craig Topper 5a70518660 [X86] Remove most X86 specific subclasses of MemSDNode. Just use a MemIntrinsicSDNode as we usually do.
Leave the gather/scatter subclasses, but make them inherit from
MemIntrinsicSDNode and delete their constructor and destructor.
This way we can still have the getIndex, getMask, etc. convenience
functions.
2020-02-23 15:13:32 -08:00
Craig Topper 15b6aa7448 [X86] Enable the use of movlps for i64 atomic load on 32-bit targets with sse1.
Still a little room for improvement by using movlps to store to
the stack temporary needed to move data out of the xmm register
after the load.
2020-02-23 15:11:38 -08:00
Craig Topper 2a10f8019d [X86] Use FIST for i64 atomic stores on 32-bit targets without SSE. 2020-02-23 15:11:38 -08:00
Craig Topper 84cd968f75 [X86] Add AddToWorklist(N) after calls to SimplifyDemandedBits/SimplifyDemandedVectorElts that are called on an operand of N.
If a simplication occurs the operand will be added to the worklist.
But since the demanded mask was based on N, we need to make sure
we revisit N in case there are more simplifications to be done.
Returning SDValue(N, 0) as we do, only tells DAG combine that
something changed, but that won't make it add anything to the
worklist.

Found while playing around with using VEXTRACT_STORE in more cases.
But I guess this doesn't affect any of our existing tests.
2020-02-22 21:42:59 -08:00
Craig Topper bdb1729c83 [X86] Teach EltsFromConsecutiveLoads that it's ok to form a v4f32 VZEXT_LOAD with a 64 bit memory size on SSE1 targets.
We can use MOVLPS which will load 64 bits, but we need a v4f32
result type. We already have isel patterns for this.

The code here is a little hacky. We can probably improve it with
more isel patterns.
2020-02-22 18:50:52 -08:00
Craig Topper e7a184fc7c [X86] Use movlps for i64 atomic stores on 32-targets with sse1.
This is similar to using movd which we do for sse2 targets.

I've added a DAG combine for VEXTRACT_STORE to use SimplifyDemandedVectorElts
to clean up some artifacts from type legalization.
2020-02-22 18:22:47 -08:00
Craig Topper 228a2bc9b7 [X86] Teach combineCVTPH2PS to shrink v8i16 loads when the output type is v4f32. Remove extra isel patterns.
Similar to what do for other operations that use a subset of bits.
Allows us to remove a pattern that shrinks a load. Which was
incorrect if the load was volatile.
2020-02-21 18:11:07 -08:00
Nikita Popov c90ea87cfd [X86] Fix SDLoc initialization
Fixes -Wparentheses warning, in this case indicating a genuine
bug.
2020-02-21 18:26:05 +01:00
Craig Topper 97f11600e0 [X86] Don't bother avoiding illegal FCMOVs if we don't have the cmov subtarget feature.
We'll be forced to emit branches so we might as well use the most
direct condition.
2020-02-21 00:34:15 -08:00
Craig Topper 263bef2bbc [X86] Make combineCMov not create unsupported FCMOVs when f32/f64 are using X87.
This makes the behavior consistent with what's in LowerSELECT.
2020-02-21 00:34:15 -08:00
Craig Topper 4576606831 [X86] Remove unnecessary isNullConstant in LowerSelect. NFC
At this point in the code we know that Op1 or Op2 is
all ones. Y points to the other operand. In the case that
Op2 is zero, Op1 must be all ones and Y is Op2. The OR
ORs Y into Res. But if Y is 0 the OR will be folded away by
getNode so we don't need to check for it.
2020-02-20 21:41:13 -08:00
Craig Topper 78be618717 [X86] Add CMOV_VR64 pseudo instruction for MMX. Remove mmx handling from combineSelect.
The combineSelect code was casting to i64 without any check that
i64 was legal. This can break after type legalization.

It also required splitting the mmx register on 32-bit targets.
It's not clear that this makes sense. Instead switch to using
a cmov pseudo like we do for XMM/YMM/ZMM.
2020-02-20 20:30:56 -08:00
Craig Topper e5782377f3 [X86] Add CMOV_VK1 pseudo so we don't crash on v1i1 ISD::SELECT 2020-02-20 15:13:48 -08:00
Craig Topper 7e92769862 [X86] Expand vselect of v1i1 under avx512.
We already do this for v2i1, v4i1, etc.
2020-02-20 15:13:47 -08:00
Craig Topper b00ef8951b [X86] Custom legalize v1i1 UADDSAT/USUBSAT/SADDSAT/UADDSAT to match v2i1/v4i1/v8i1 etc. 2020-02-20 15:13:46 -08:00
Craig Topper d95a10a7f9 [X86] Custom legalize v1i1 add/sub/mul to xor/xor/and with avx512.
We already did this for v2i1, v4i1, v8i1, etc.
2020-02-20 15:13:44 -08:00
Craig Topper c7b54a196e Recommit "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT""
With the correct author this time
2020-02-20 12:28:54 -08:00
Craig Topper 1d8860f90b Revert 714265dabb "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT"
I accidentally messed up the author on the previous commit somehow.
2020-02-20 12:28:33 -08:00
Quentin Colombet 714265dabb [X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT
The type here isn't guaranteed to be a simple type.

Fixes PR44976
2020-02-20 12:25:37 -08:00
Sanjay Patel 064cd2ecdb [x86] allow peeking through an extract_subvector to find a splatted operand
The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)

There are other potential improvements visible like allowing scalarization or vector
narrowing.

Differential Revision: https://reviews.llvm.org/D74909
2020-02-20 13:59:59 -05:00
Craig Topper 0ed7a61543 [X86] Fix a -Wparentheses warning. NFC 2020-02-20 09:32:03 -08:00
Craig Topper 3543ac9ab5 [X86] Rewrite LowerBRCOND to remove dead code and handle ISD::SETCC and overflow ops directly.
There's a lot of old leftover code in LowerBRCOND. Especially
the detecting or AND or OR of X86ISD::SETCC nodes. Those were
needed before LegalizeDAG was changed to visit nodes before
their operands.

It also relied on reversing the output of LowerSETCC to find the
flags producing node to use for the X86ISD::BRCOND node.

Rather than using LowerSETCC this patch uses emitFlagsForSetcc to
handle the integer ISD::SETCC case. This gives the flag producer
and the comparison code to use directly. I've removed the addTest
flag and just produce a X86ISD::BRCOND and return immediately.

Floating point ISD::SETCC case is just an X86ISD::FCMP with special
care for OEQ and UNE derived from the previous code. I've left
f128 out so it will emit a test. And LowerSETCC will be called
later to produce a libcall and X86ISD::SETCC. We have combines
that can merge the test and X86ISD::SETCC.

We need to handle two cases for overflow ops. Either they are used
directly or they have a seteq 0 or setne 1 to invert the overflow.
The old code did not handle the setne 1 case, but I think some
other combines were making up for it.

If we fail to find a condition, we'll wrap an AND with 1 on the
original condition and tell emitFlagsForSetcc to emit a compare
with 0. This will pickup the LowerAndToBT and or the EmitTest case.
I kept the isTruncWithZeroHighBitsInput call, but we might be able
to fold that in to emitFlagsForSetcc.

Differential Revision: https://reviews.llvm.org/D74750
2020-02-20 08:50:18 -08:00
Craig Topper 12cc105f80 [X86] Add DAG combines to form CVTPH2PS/CVTPS2PH from vXf16->vXf32/vXf64 fp_extends and vXf32->vXf16 fp_round.
Only handle power of 2 element count for simplicity. Not sure what to do with vXf64->vXf16 fp_round to avoid double rounding

Differential Revision: https://reviews.llvm.org/D74886
2020-02-20 08:26:17 -08:00
Djordje Todorovic 2f215cf36a Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGfaff707db82d.
A failure found on an ARM 2-stage buildbot.
The investigation is needed.
2020-02-20 14:41:39 +01:00
Craig Topper f559cecc3e [X86] Add DCI.isBeforeLegalize() check to the v64i1 constant splitting code in combineStore.
We only need to split after type legalization. If we're before
we can just use a wide store and type legalization will split it.

Add a v128i1 test to exercise it post type legalization.
2020-02-19 09:18:16 -08:00
Florian Hahn 216afd3301 [TargetLower] Update shouldFormOverflowOp check if math is used.
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.

This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.

Reviewers: nikic, RKSimon, lebedev.ri, spatel

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D74722
2020-02-19 11:28:33 +01:00
Djordje Todorovic faff707db8 Reland "[DebugInfo] Enable the debug entry values feature by default"
Differential Revision: https://reviews.llvm.org/D73534
2020-02-19 11:12:26 +01:00
Craig Topper f69a29da5a [X86] Remove vXi1 select optimization from LowerSELECT. Move it to DAG combine. 2020-02-19 00:00:55 -08:00
Craig Topper 0dbc4658d8 [X86] Handle splats in LowerBUILD_VECTORvXi1 by directly emitting scalar selects instead of deferring that to LowerSELECT.
LoweSELECT will detect the constant inputs and convert to scalar
selects, but we can do it directly here.

I might remove some of the code from LowerSELECT and move it to
DAG combine so doing this explicitly will make us less dependent
on it happening in lowering.
2020-02-18 22:39:30 -08:00
Simon Pilgrim d6eef0614f [TargetLowering] Add SimplifyMultipleUseDemandedBits 'all elements' helper wrapper. NFC. 2020-02-18 19:53:50 +00:00
Craig Topper 89ab5c69c8 [X86] Add a helper function to pull some repeated code out of combineGatherScatter. NFC 2020-02-18 11:10:40 -08:00
Djordje Todorovic 2bf44d11cb Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGa82d3e8a6e67.
2020-02-18 16:38:11 +01:00
Djordje Todorovic a82d3e8a6e Reland "[DebugInfo] Enable the debug entry values feature by default"
This patch enables the debug entry values feature.

  - Remove the (CC1) experimental -femit-debug-entry-values option
  - Enable it for x86, arm and aarch64 targets
  - Resolve the test failures
  - Leave the llc experimental option for targets that do not
    support the CallSiteInfo yet

Differential Revision: https://reviews.llvm.org/D73534
2020-02-18 14:41:08 +01:00
Craig Topper e90dc7c48b [X86] Move avx512 code that forces zeros to the false side of vselects above a check for legal types.
This helps this transform occur earlier so we can fold the not
with setcc. If we delay it until after type legalization we might
have introduced instructions to widen the mask if the vselect was
widened. This can prevent the not from making it to the setcc.

We could of course add more DAG combines to handle that, but
moving this earlier is easier.
2020-02-17 22:24:21 -08:00
Craig Topper b0840934a7 [X86] Use isScalarFPTypeInSSEReg to simplify code in LowerSELECT. NFC 2020-02-17 19:43:57 -08:00
Craig Topper 3f4490d384 [X86] Add one use check to '0-x == y --> x+y == 0' in EmitCmp.
I failed to copy it when I moved this in
b62de210cf
2020-02-17 18:16:42 -08:00
Craig Topper 43e948c4b7 [X86] Change how the alignment for the stack object is created in LowerFLT_ROUNDS_.
We don't need FrameInfo's concept of the stack alignment. We just
need to tell it the desired alignment. Which in this case is 2.
2020-02-17 11:27:34 -08:00
Craig Topper b62de210cf [X86] Move '0-x == y --> x+y == 0' and similar combines to EmitCmp.
AArch64 handles this pattern in their lowering code. By emitting
CMN. ARM handles it as an isel pattern.
2020-02-17 11:27:34 -08:00
Nikita Popov 80397d2d12 [IRBuilder] Delete copy constructor
D73835 will make IRBuilder no longer trivially copyable. This patch
deletes the copy constructor in advance, to separate out the breakage.

Currently, the IRBuilder copy constructor is usually used by accident,
not by intention.  In rG7c362b25d7a9 I've fixed a number of cases where
functions accepted IRBuilder rather than IRBuilder &, thus performing
an unnecessary copy. In rG5f7b92b1b4d6 I've fixed cases where an
IRBuilder was copied, while an InsertPointGuard should have been used
instead.

The only non-trivial use of the copy constructor is the
getIRBForDbgInsertion() helper, for which I separated construction and
setting of the insertion point in this patch.

Differential Revision: https://reviews.llvm.org/D74693
2020-02-17 18:14:48 +01:00
Craig Topper 464729cf7c [X86] Remove unnecessary check for null SDValue. NFC 2020-02-16 20:25:24 -08:00
Craig Topper 272d35aef5 [X86] Separate floating point handling out of EmitCmp and emitFlagsForSetcc.
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.

This removes the need to pass Chain and IsSignaling all the way
down.
2020-02-16 10:51:05 -08:00
Craig Topper d26f11108b [X86] Split X86ISD::CMP into an integer and FP opcode. 2020-02-16 10:10:19 -08:00
Simon Pilgrim b85df2e185 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to PALIGNR 2020-02-16 16:13:26 +00:00
Simon Pilgrim c9c1c2b335 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to bit shifts 2020-02-16 16:13:25 +00:00
Sanjay Patel e48b536be6 [x86] form broadcast of scalar memop even with >1 use
The unseen logic diff occurs because MayFoldLoad() is defined like this:

static bool MayFoldLoad(SDValue Op) {
  return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}

The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.

The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.

Differential Revision: https://reviews.llvm.org/D74088
2020-02-16 10:32:56 -05:00
Simon Pilgrim 34a054ce71 [X86] combineX86ShuffleChain - add support for combining to X86ISD::ROTLI
Refactors matchShuffleAsBitRotate to allow use by both lowerShuffleAsBitRotate and matchUnaryPermuteShuffle.
2020-02-15 20:04:54 +00:00
Simon Pilgrim 2492075add [X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.

REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
2020-02-14 11:55:18 +00:00
Craig Topper c2e8a421ac [X86] Don't widen 128/256-bit strict compares with vXi1 result to 512-bits on KNL.
If we widen the compare we might trigger a spurious exception from
the garbage data.

We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.

I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.

Differential Revision: https://reviews.llvm.org/D74522
2020-02-13 13:26:40 -08:00
Amy Huang de1d90299b Revert "[X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets"
This reverts commit 11c16e7159 because it
causes a crash in chromium code. See
https://reviews.llvm.org/rG11c16e71598d51f15b4cfd0f719c4dabcc0bebf7.
2020-02-12 17:00:37 -08:00
Jay Foad 32aac25637 [KnownBits] Introduce anyext instead of passing a flag into zext
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.

I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.

NFC.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74482
2020-02-12 19:06:53 +00:00
Simon Pilgrim ff307c8120 [X86] combineFneg - generalize FMA negations with isNegatibleForFree/getNegatedExpression
This has a really interesting side effect in that it improves some UMAX/UMIN reduction code which had redundant XOR(SHUFFLE(XOR(X,SIGNMASK)),SIGNMASK) patterns - the getNegatibleCost recognises it as FNEG(SHUFFLE(FNEG(X))).... We have a lot of FNEG patterns bitcasted to the integer domain for XOR signbit twiddling which is similar to what we do to allow UMAX/UMIN to be lowered using SMAX/SMIN.

Differential Revision: https://reviews.llvm.org/D74231
2020-02-12 16:07:27 +00:00
Simon Pilgrim 9eb426c88c [TargetLowering] Add NegatibleCost enum for isNegatibleForFree return codes
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.

This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.

It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.

Differential Revision: https://reviews.llvm.org/D74221
2020-02-12 11:51:42 +00:00
Djordje Todorovic 97ed706a96 Revert "[DebugInfo] Enable the debug entry values feature by default"
This reverts commit rG9f6ff07f8a39.

Found a test failure on clang-with-thin-lto-ubuntu buildbot.
2020-02-12 11:59:04 +01:00
Djordje Todorovic 9f6ff07f8a [DebugInfo] Enable the debug entry values feature by default
This patch enables the debug entry values feature.

  - Remove the (CC1) experimental -femit-debug-entry-values option
  - Enable it for x86, arm and aarch64 targets
  - Resolve the test failures
  - Leave the llc experimental option for targets that do not
    support the CallSiteInfo yet

Differential Revision: https://reviews.llvm.org/D73534
2020-02-12 10:25:14 +01:00
Craig Topper 0daf9b8e41 [X86][LegalizeTypes] Add SoftPromoteHalf support STRICT_FP_EXTEND and STRICT_FP_ROUND
This adds a strict version of FP16_TO_FP and FP_TO_FP16 and uses
them to implement soft promotion for the half type. This is
enough to provide basic support for __fp16 with strictfp.

Add the necessary X86 support to use VCVTPS2PH/VCVTPH2PS when F16C
is enabled.
2020-02-11 22:30:04 -08:00
Craig Topper 846d0ac43e [X86] Don't disable code in combineHorizontalPredicateResult just because we have avx512
We aren't doing a good job of optimizing AVX512 outside of this code. So remove the bail out for AVX512 and replace with a FIXME. This at least gets us the AVX2 codegen.

Differential Revision: https://reviews.llvm.org/D74431
2020-02-11 14:36:29 -08:00
Simon Pilgrim fa620fc8e2 [X86] combineConcatVectorOps - reuse IsSplat and remove duplicate code. NFC. 2020-02-11 13:37:57 +00:00
Simon Pilgrim 11c16e7159 [X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
2020-02-11 12:21:03 +00:00
Craig Topper 798305d29b [X86] Custom lower ISD::FP16_TO_FP and ISD::FP_TO_FP16 on f16c targets instead of using isel patterns.
We need to use vector instructions for these operations. Previously
we handled this with isel patterns that used extra instructions
and copies to handle the the conversions.

Now we use custom lowering to emit the conversions. This allows
them to be pattern matched and optimized on their own. For
example we can now emit vpextrw to store the result if its going
directly to memory.

I've forced the upper elements to VCVTPHS2PS to zero to keep some
code similar. Zeroes will be needed for strictfp. I've added a
DAG combine for (fp16_to_fp (fp_to_fp16 X)) to avoid extra
instructions in between to be closer to the previous codegen.

This is a step towards strictfp support for f16 conversions.
2020-02-10 22:01:48 -08:00
Simon Pilgrim f319074824 [X86] combineConcatVectorOps - combine X86ISD::PACKSS ops 2020-02-10 17:48:02 +00:00
Simon Pilgrim 74c0f98cf5 [X86] combineConcatVectorOps - combine X86ISD::VPERMI ops 2020-02-10 17:48:01 +00:00
Simon Pilgrim 2463b8c97d [X86] combineConcatVectorOps - combine VSHLI/VSRAI/VSRLI ops
Non-AVX512BW targets failed to concatenate 256-bit shifts back to 512-bits (split during 512-bit shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-10 16:59:09 +00:00
Simon Pilgrim 06617c4522 [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

REAPPLIED rGe82e17d4d4ca after reversion at rG39eade73a567 - fixed offset matching in matchShuffleAsBitRotate.
2020-02-10 16:16:56 +00:00
Simon Pilgrim 39eade73a5 Revert rGe82e17d4d4cac8b2df00094e80d5e1cb22795664 - [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
---
Internal shuffle tests indicate theres a bug somewhere that I haven't been able to track down yet.
2020-02-10 12:14:26 +00:00
Craig Topper 06ba969c9d [X86] Make (insert_vector_elt (v8i16 zerovec), i16 %x, 0) generate the same code as (v8i16 (build_vector %x, 0, 0, 0, 0, 0, 0, 0)).
Instead of using a insrw to element 0, use movzx and movd.

Same for v16i8.
2020-02-09 21:52:11 -08:00
Simon Pilgrim 29e646fe65 [X86] combineConcatVectorOps - combine VROTLI/VROTRI ops
Fix issue mentioned on rGe82e17d4d4ca - non-AVX512BW targets failed to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-09 21:50:10 +00:00
Simon Pilgrim e82e17d4d4 [X86] Add lowerShuffleAsBitRotate (PR44379)
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.

This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.

There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.

Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
2020-02-09 21:15:03 +00:00
Simon Pilgrim 29621b2534 [X86] Rename matchShuffleAsRotate - matchShuffleAsByteRotate. NFCI.
A matchShuffleAsBitRotate variant will be added soon and we need to make the difference more obvious.
2020-02-09 18:35:50 +00:00
Simon Pilgrim 3ec6de07e9 Fix signed/unsigned warning. 2020-02-09 13:35:03 +00:00
Simon Pilgrim 644d56b432 [X86] Recognise ROTLI/ROTRI rotations as faux shuffles
Allows us to combine rotations with shuffles.

One of many things necessary to fix PR44379 (lowering shuffles to rotations)
2020-02-09 12:25:49 +00:00
serge_sans_paille e67cbac812 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-09 10:42:45 +01:00
serge-sans-paille 4546211600 Revert "Support -fstack-clash-protection for x86"
This reverts commit 0fd51a4554.

Failures:

http://lab.llvm.org:8011/builders/llvm-clang-win-x-armv7l/builds/4354
2020-02-09 10:06:31 +01:00
serge_sans_paille 0fd51a4554 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-09 09:35:42 +01:00
Craig Topper eeb63944e4 [LegalizeTypes][ARM][AArch64][PowerPC][RISCV][X86] Use BUILD_PAIR to return expanded integer results from ReplaceNodeResults instead of just returning two results.
Remove code from LegalizeTypes that allowed this to work.

We were already using BUILD_PAIR for this in some places so this
standardizes on a single way to do this.
2020-02-08 09:52:31 -08:00
serge-sans-paille 658495e6ec Revert "Support -fstack-clash-protection for x86"
This reverts commit e229017732.

Failures:

http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-debian/builds/2604
http://lab.llvm.org:8011/builders/llvm-clang-win-x-aarch64/builds/4308
2020-02-08 14:26:22 +01:00
serge_sans_paille e229017732 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with better option
handling and more portable testing

Differential Revision: https://reviews.llvm.org/D68720
2020-02-08 13:31:52 +01:00
Simon Pilgrim 7f5b3fa73c [X86][SSE] Add X86ISD::FRCP handling to isNegatibleForFree
Peek through X86ISD::FRCP nodes to see if there is a negatible input.
2020-02-08 10:56:27 +00:00
Simon Pilgrim 4229f12a22 [TargetLowering] Remove isDesirableToCombineBuildVectorToShuffleTruncate target hook. NFC.
This hasn't been used for years, its original implementation, D35700, had bugs that caused the reversion of most of the code, and since then x86 shuffle lowering/combining has handled most cases and can deal with the rest as well.
2020-02-08 08:55:51 +00:00
Nico Weber b03c3d8c62 Revert "Support -fstack-clash-protection for x86"
This reverts commit 4a1a0690ad.
Breaks tests on mac and win, see https://reviews.llvm.org/D68720
2020-02-07 14:49:38 -05:00
serge_sans_paille 4a1a0690ad Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This a recommit of 39f50da2a3 with correct option
flags set.

Differential Revision: https://reviews.llvm.org/D68720
2020-02-07 19:54:39 +01:00
Sanjay Patel de6f7eb47e [x86] don't create an unused constant vector
Noticed while scanning through debug spew. Creating unused
nodes is inefficient and makes following the debug output harder.
2020-02-07 12:05:02 -05:00
Simon Pilgrim c96001035d [X86] isNegatibleForFree - allow pre-legalized FMA negation
As long as the FMA operation is legal (which we can proxy for the FMA3/FMA4 variants as well), we don't have to wait for the LegalOperations stage.
2020-02-07 17:04:17 +00:00
serge-sans-paille f6d98429fc Revert "Support -fstack-clash-protection for x86"
This reverts commit 39f50da2a3.

The -fstack-clash-protection is being passed to the linker too, which
is not intended.

Reverting and fixing that in a later commit.
2020-02-07 11:36:53 +01:00
Guillaume Chatelet f85d3408e6 [NFC] Introduce an API for MemOp
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73964
2020-02-07 11:32:27 +01:00
serge_sans_paille 39f50da2a3 Support -fstack-clash-protection for x86
Implement protection against the stack clash attack [0] through inline stack
probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].

This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

Differential Revision: https://reviews.llvm.org/D68720
2020-02-07 10:56:15 +01:00
Craig Topper 3f62028f2f [X86] Use SelectionDAG::getAllOnesConstant to simplify some code. NFC 2020-02-06 21:32:53 -08:00
Craig Topper ec9a94af4d [X86] Use MVT::i8 instead of MVT::i64 for shift amount in BuildSDIVPow2
X86 uses i8 for shift amounts. This code can fail on a 32-bit target
if it runs after type legalization.

This code was copied from AArch64 and modified for X86, but the
shift amount wasn't changed to the correct type for X86.

Fixes PR44812
2020-02-06 13:32:13 -08:00
Craig Topper 4175d7e22e [X86] Custom isel floating point X86ISD::CMP on pre-CMOV targets. Eliminate ConvertCmpIfNecessary
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.

Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.

This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.

Differential Revision: https://reviews.llvm.org/D73863
2020-02-06 10:43:06 -08:00
Sanjay Patel 0a389c81cd [x86] use getSplatIndex() in lowerShuffleAsBroadcast()
The old code was doing an N^2 search for splat index.

Differential Revision: https://reviews.llvm.org/D74064
2020-02-05 14:55:02 -05:00
Craig Topper a3d489e87e [X86] Add a DAG combine for (i32 (sext (i8 (x86isd::setcc_carry)))) -> (i32 (x86isd::setcc_carry)) and remove isel patterns.
Same for any_extend though we don't have coverage for that.

The test changes are because isel didn't check one use of the
setcc_carry. So in isel we would end up with two different
sized setcc_carry instructions. And since it clobbers
the flags we would need to recreate the flags for the second
instruction.

This code handles additional uses by truncating the new wide
setcc_carry back to the original size for those uses.
2020-02-04 22:40:36 -08:00
Craig Topper 016f42e3dc [X86] Add custom lowering for lrint/llrint to either cvtss2si/cvtsd2si or fist.
lrint/llrint are defined as rounding using the current rounding
mode. Numbers that can't be converted raise FE_INVALID and an
implementation defined value is returned. They may also write to
errno.

I believe this means we can use cvtss2si/cvtsd2si or fist to
convert as long as -fno-math-errno is passed on the command line.
Clang will leave them as libcalls if errno is enabled so they
won't become ISD::LRINT/LLRINT in SelectionDAG.

For 64-bit results on a 32-bit target we can't use cvtss2si/cvtsd2si
but we can use fist since it can write to a 64-bit memory location.
Though maybe we could consider using vcvtps2qq/vcvtpd2qq on avx512dq
targets?

gcc also does this optimization.

I think we might be able to do this with STRICT_LRINT/LLRINT as
well, but I've left that for future work.

Differential Revision: https://reviews.llvm.org/D73859
2020-02-04 16:15:40 -08:00
Reid Kleckner 2d89e0a098 [SEH] Remove CATCHPAD SDNode and X86::EH_RESTORE MachineInstr
The CATCHPAD node mostly existed to be selected into the EH_RESTORE
instruction, which sets the frame back up when 32-bit Windows exceptions
return to the parent function. However, creating this MachineInstr early
increases the risk that other passes will come along and insert
instructions that use the stack before ESP and EBP are restored. That
happened in PR44697.

Instead of representing these in the instruction stream early, delay it
until PEI. Mark the blocks where this needs to happen as EHPads, but not
funclet entry blocks. Passes after PEI have to be careful not to hoist
instructions that can use stack across frame setup instructions, so this
should be relatively reliable.

Fixes PR44697

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D73752
2020-02-04 15:13:12 -08:00
Craig Topper e195ff98f6 Recommit "[X86] Use X86ISD::SUB instead of X86ISD::CMP in some places."
This time with correct types for the data result from the SUB.

Original commit message:

Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.

This commit makes other places that create X86ISD::CMP have the
same behavior.
2020-02-04 12:19:34 -08:00
Kadir Cetinkaya d2b6ac6ccd
Revert "[X86] Use X86ISD::SUB instead of X86ISD::CMP in some places."
This reverts commit 8413116bf1.

this seems to be causing crashes while compiling ncurses.
```
$ ./bin/llc bugpoint-reduced-simplified.ll
LLVM ERROR: Cannot emit physreg copy instruction
```

Here are the crashers: https://gist.github.com/kadircet/918f5bb97a2afe048cb875490edba46e

executing with an llc compiled at 904d54de9b works fine.
2020-02-04 11:22:53 +01:00
Guillaume Chatelet b8144c0536 [NFC] Encapsulate MemOp logic
Summary:
This patch simply introduces functions instead of directly accessing the fields.
This helps introducing additional check logic. A second patch will add simplifying functions.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73945
2020-02-04 10:36:26 +01:00
Craig Topper cd14b4a62b [X86] Remove unneeded code that looks for (and (i8 (X86setcc_c))
I don't believe we use this construct anymore so I don't think
we need to look for it.
2020-02-03 23:18:11 -08:00
Craig Topper 4581d97416 [X86] Remove some uncovered and possibly broken code from combineZext.
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.

I believe this code is a leftover from when i1 was a legal type.
2020-02-03 22:59:39 -08:00
Craig Topper 8413116bf1 [X86] Use X86ISD::SUB instead of X86ISD::CMP in some places.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.

This commit makes other places that create X86ISD::CMP have the
same behavior.
2020-02-03 21:01:11 -08:00
Craig Topper c3a47221e0 [X86] Don't emit two X86ISD::COMI/UCOMI nodes when handling comi/ucomi intrinsics.
We were creating two with different operand orders, and then only
using one of them.

Instead just swap the operands when needed and create a single node.
2020-02-03 20:08:01 -08:00
Simon Pilgrim 3ece5a23bd [X86] getTargetShuffleMask - use getConstantOperandVal helper. NFCI. 2020-02-03 18:06:47 +00:00
Simon Pilgrim 8c0e715eb2 [X86] BEXTR SimplifyDemandedBitsForTargetNode - length == 0 -> result = 0 2020-02-03 16:50:03 +00:00
Guillaume Chatelet 333f2ad8b8 [Alignment][NFC] Use Align for getMemcpy/Memmove/Memset
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73885
2020-02-03 17:13:19 +01:00
Simon Pilgrim 8ead5df0b1 [X86] computeKnownBitsForTargetNode - add BEXTR support (PR39153)
Add a KnownBits::extractBits helper
2020-02-03 15:43:59 +00:00
Simon Pilgrim a9ee3ffbc0 [X86] Move BEXTR DemandedBits handling inside SimplifyDemandedBitsForTargetNode
Some prep work for PR39153.
2020-02-03 15:16:40 +00:00
Craig Topper cf20fde1d1 [X86] Remove a couple unnecessary calls to ConvertCmpIfNecessary.
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
2020-02-02 21:36:51 -08:00
Craig Topper ee85415dbb [X86] Use MVT::f80 for the result type of the FLD used to convert from SSE register to X87 register in FP_TO_INTHelper. 2020-02-02 13:24:37 -08:00
Simon Pilgrim 5d86ac82a6 Fix a few spelling mistakes in comments. NFCI. 2020-02-02 18:27:43 +00:00
Simon Pilgrim 17e91b7dd2 [X86][SSE] combineBitcastvxi1 - add pre-AVX512 v64i1 handling 2020-02-02 18:00:09 +00:00
Guillaume Chatelet 3c89b75f23 [NFC] Introduce a type to model memory operation
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.

Reviewers: courbet

Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73785
2020-01-31 17:29:01 +01:00
Craig Topper 90c31b0f42 [X86] Custom lower ISD::FROUND with SSE4.1 to avoid a libcall.
ISD::FROUND is defined to round to nearest with ties rounding
away from 0. This mode isn't supported in hardware on X86.

But as long as we aren't compiling with trapping math, we can
emulate this with floor(X + copysign(nextafter(0.5, 0.0), X)).

We have to use nextafter to avoid some corner cases that adding
0.5 would have. For example, if X is nextafter(0.5, 0.0) it should
round to 0.0, but adding 0.5 would need one extra bit of mantissa
than can be stored so it rounds to 1.0. Adding nextafter(0.5, 0.0)
instead will just increase the exponent by 1 and leave the mantissa
as all 1s. This would be nextafter(1.0, 0.0) which will floor to 0.0.

Techically this requires -fno-trapping-math which isn't our default.
But if we care about exceptions we should be using constrained
intrinsics. Constrained intrinsics would use STRICT_FROUND which
won't go through this code.

Fixes PR42195.

Differential Revision: https://reviews.llvm.org/D73607
2020-01-29 09:10:02 -08:00
Craig Topper e5edd641fd [X86] Use a shorter sequence to implement FLT_ROUNDS
This code needs to map from the FPCW 2-bit encoding for rounding mode to the 2-bit encoding defined for FLT_ROUNDS. The previous implementation did some clever swapping of bits and adding 1 modulo 4 to do the mapping.

This patch instead uses an 8-bit immediate as a lookup table of four 2-bit values. Then we use the 2-bit FPCW encoding to index the lookup table by using a right shift and an AND. This requires extracting the 2-bit value from FPCW and multipying it by 2 to make it usable as a shift amount. But still results in less code.

Differential Revision: https://reviews.llvm.org/D73599
2020-01-29 08:56:33 -08:00
Craig Topper ca2abea29a [X86] Use SelectionDAG::getZExtOrTrunc to simplify some code. NFCI 2020-01-28 16:27:59 -08:00
Wang, Pengfei 3d1f0ce3b9 [X86] Add combination for fma and fneg on X86 under strict FP.
Summary: X86 has instructions to calculate fma and fneg at the same time. But we combine the fneg and fma only when fneg is the source operand under strict FP.

Reviewers: craig.topper, andrew.w.kaylor, uweigand, RKSimon, LiuChen3

Subscribers: LuoYuanke, llvm-commits, cfe-commits, jdoerfert, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72824
2020-01-28 20:09:56 +08:00
Simon Pilgrim 2d5e281b0f [X86][AVX] Add a more aggressive SimplifyMultipleUseDemandedBits to simplify masked store masks.
Fixes a poor codegen issue noticed in PR11210.
2020-01-27 16:44:25 +00:00
Simon Pilgrim fa19d67a2a [X86][AVX] Extend combineCommutableSHUFP to handle v8f32 and v16f32 commutable shufps patterns 2020-01-26 19:04:12 +00:00
Simon Pilgrim 1a81b296cd [X86][SSE] combineCommutableSHUFP - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Pull out combineTargetShuffle code added in rG3fd5d1c6e7db into a helper function and extend it to handle shufps(shufps(load(),x),y) and shufps(y,shufps(load(),x)) cases as well.
2020-01-26 14:36:23 +00:00
Craig Topper 3fdd435a4b [X86] Use a macro to convert X86ISD names to strings in getTargetNodeName.
Every case in the switch had a string version of themselves. Two
of them had a typo that used : instead of ::

By using a macro we can automate the string creation and avoid
the possibility of typos like this.

This is similar to what is done on the AMDGPU target.
2020-01-25 18:27:29 -08:00
Craig Topper 2c1decc040 [X86] Break the loop in LowerReturn into 2 loops. NFCI
I believe for STRICT_FP I need to use a STRICT_FP_EXTEND for the extending to f80 for returning f32/f64 in 32-bit mode when SSE is enabled. The STRICT_FP_EXTEND node requires a Chain. I need to get that node onto the chain before any CopyToRegs are emitted. This is because all the CopyToRegs are glued and chained together. So I can't put a STRICT_FP_EXTEND on the chain between the glued nodes without also glueing the STRICT_ FP_EXTEND.

This patch moves all the extend creation to a first pass and then creates the copytoregs and fills out RetOps in a second pass.

Differential Revision: https://reviews.llvm.org/D72665
2020-01-24 14:44:38 -08:00
Simon Pilgrim 3fd5d1c6e7 [X86][SSE] combineTargetShuffle - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Moves lowerShuffleWithSHUFPS commutation code from rG30fcd29fe479 to catch cases during combine
2020-01-24 15:23:20 +00:00
Simon Pilgrim 30fcd29fe4 [X86][SSE] lowerShuffleWithSHUFPS - commute '2*V1+2*V2 elements' mask if it allows a loaded fold
As mentioned on D73023.
2020-01-24 12:04:10 +00:00
Guillaume Chatelet 805c157e8a [Alignment][NFC] Deprecate Align::None()
Summary:
This is a follow up on https://reviews.llvm.org/D71473#inline-647262.
There's a caveat here that `Align(1)` relies on the compiler understanding of `Log2_64` implementation to produce good code. One could use `Align()` as a replacement but I believe it is less clear that the alignment is one in that case.

Reviewers: xbolva00, courbet, bollu

Subscribers: arsenm, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, Jim, kerbowa, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73099
2020-01-24 12:53:58 +01:00
Simon Pilgrim 0ec25a0316 [X86] LowerRotate - early out for vector rotates by zero 2020-01-23 17:48:09 +00:00
Guillaume Chatelet 279fa8e006 [Alignement][NFC] Deprecate untyped CreateAlignedLoad
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73260
2020-01-23 13:34:32 +01:00
Sanjay Patel 363d27c871 [x86] fold vperm2x128 to concat of 128-bit high half vectors
vperm (ins ?, X, C), (ins ?, Y, C), 0x31 --> concat X, Y

This is another shuffle problem seen with PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

We have this small crack in legalization/lowering/combining/demanded
that allows forming a vperm2f128 of high halves with AVX1 when we
could do better by peeking through the insert_subvector nodes.
AFAICT, it requires IR as shown in the diffs - much larger than legal
vectors - to avoid all of the usual folds.

Another option would prevent forming the 256-bit vperm in lowering.

Differential Revision: https://reviews.llvm.org/D73197
2020-01-22 15:35:50 -05:00
Simon Pilgrim 5340434c94 [X86][SSE] combineExtractWithShuffle - extract(bitcast(broadcast(x))) --> x
Removes some unnecessary gpr<-->fpu traffic
2020-01-22 18:02:58 +00:00
Simon Pilgrim a14aa7dabd [X86][SSE] combineExtractWithShuffle - extract(bictcast(scalar_to_vector(x))) --> x
Removes some unnecessary gpr<-->fpu traffic
2020-01-22 16:11:08 +00:00
Simon Pilgrim c784e5451b Use SelectionDAG::getShiftAmountConstant(). NFCI. 2020-01-22 13:52:43 +00:00
Simon Pilgrim 963f268186 [X86][SSE] combineExtractWithShuffle - pull out repeated extract index code. NFCI. 2020-01-22 12:08:58 +00:00
Simon Pilgrim b065902ed4 [X86] combineBT - use SimplifyDemandedBits instead of GetDemandedBits
Another step towards removing SelectionDAG::GetDemandedBits entirely
2020-01-21 14:24:46 +00:00
Simon Pilgrim eaa4548459 [X86][SSE] Add PACKSS SimplifyMultipleUseDemandedBits 'sign bit' handling.
Attempt to use SimplifyMultipleUseDemandedBits to simplify PACKSS if we're only after the sign bit.
2020-01-20 10:48:54 +00:00
Florian Hahn 0ee1db2d1d [X86] Try to avoid casts around logical vector ops recursively.
Currently PromoteMaskArithemtic only looks at a single operation to
skip casts. This means we miss cases where we combine multiple masks.

This patch updates PromoteMaskArithemtic to try to recursively promote
AND/XOR/AND nodes that terminate in truncates of the right size or
constant vectors.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D72524
2020-01-19 17:22:43 -08:00
Craig Topper 5fa2022ec0 [X86] Remove X86ISD::FILD_FLAG and stop gluing nodes together.
Summary:
I think whatever problem the gluing was fixing has long since been fixed. We don't have any of the restrictions on FP stack stuff that existed back when this was first added.

I had to change which type we use for FILD in BuildFILD when X86 was enabled because most of the isel patterns block f32/f64 instructions when SSE1/SSE2 are enabled. So I needed to use the f80 pattern, but this shouldn't have an effect the generated code since there is only one FILD instruction anyway. We already use f80 explicitly in other other places.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: andrew.w.kaylor, scanon, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72805
2020-01-18 23:44:05 -06:00
Simon Pilgrim 69bc450882 [X86] Rename lowerShuffleAsRotate -> lowerShuffleAsVALIGN
Since it can only ever create VALIGN nodes.
2020-01-18 11:29:14 +00:00
Michael Liao 6d0d86a64d [DAG] Add helper for creating constant vector index with correct type. NFC. 2020-01-18 01:23:36 -05:00
Sanjay Patel 43f60e614a [x86] try harder to form 256-bit unpck*
This is another part of a problem noted in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

The AVX2 code may use awkward 256-bit shuffles vs. the AVX code that gets split
into the expected 128-bit unpack instructions. We have to be selective in
matching the types where we try to do this though. Otherwise, we can end up
with more instructions (in the case of v8x32/v4x64).

Differential Revision: https://reviews.llvm.org/D72575
2020-01-17 10:42:39 -05:00
Craig Topper e445447921 [X86] When handling i64->f32 sint_to_fp on 32-bit targets only bitcast to f64 if sse2 is enabled.
The code is trying to copy the i64 value to an xmm register to
use a 64-bit store so that the 64-bit fild can benefit from
store forwarding.

But this trick only works if f64 is going to be stored in an
XMM register. If we only have SSE1 then only float is in xmm
register. So this trick just causes 2 stores i32 stores, an f64
load into the x87, an f64 from x87, and a 64-bit fild. So we end
up with an extra stack temporary and still didn't get store forwarding.

We might be able to use v2f32 here instead, but I didn't check. I
just wanted the code to make sense.

Found by inspection as I continue to stare too hard at our
int_to_fp conversions.
2020-01-15 18:26:28 -08:00
Craig Topper be8f217b18 [X86] Don't call LowerUINT_TO_FP_i32 for i32->f80 on 32-bit targets with sse2.
We were performing an emulated i32->f64 in the SSE registers, then
storing that value to memory and doing a extload into the X87
domain.

After this patch we'll now just store the i32 to memory along
with an i32 0. Then do a 64-bit FILD to f80 completely in the X87
unit. This matches what we do without SSE.
2020-01-15 00:43:07 -08:00
Reid Kleckner 40cd26c700 [Win64] Handle FP arguments more gracefully under -mno-sse
Pass small FP values in GPRs or stack memory according the the normal
convention. This is what gcc -mno-sse does on Win64.

I adjusted the conditions under which we emit an error to check if the
argument or return value would be passed in an XMM register when SSE is
disabled. This has a side effect of no longer emitting an error for FP
arguments marked 'inreg' when targetting x86 with SSE disabled. Our
calling convention logic was already assigning it to FP0/FP1, and then
we emitted this error. That seems unnecessary, we can ignore 'inreg' and
compile it without SSE.

Reviewers: jyknight, aemerson

Differential Revision: https://reviews.llvm.org/D70465
2020-01-14 17:19:35 -08:00
Craig Topper 76291e1158 [X86] Drop an unneeded FIXME. NFC
The extload on X87 is free.
2020-01-14 17:05:46 -08:00
Craig Topper 57eb56b839 [X86] Swap the 0 and the fudge factor in the constant pool for the 32-bit mode i64->f32/f64/f80 uint_to_fp algorithm.
This allows us to generate better code for selecting the fixup
to load.

Previously when the sign was set we had to load offset 0. And
when it was clear we had to load offset 4. This required a testl,
setns, zero extend, and finally a mul by 4. By switching the offsets
we can just shift the sign bit into the lsb and multiply it by 4.
2020-01-14 17:05:23 -08:00
Craig Topper 98c54fb1fe [X86] Directly emit a BROADCAST_LOAD from constant pool in lowerUINT_TO_FP_vXi32 to avoid double loads seen in D71971
By directly emitting the constants as a constant pool load we seem to avoid the build_vector/extract_subvector combines that resulted in the duplicate loads we had before.

Differential Revision: https://reviews.llvm.org/D72307
2020-01-14 10:50:39 -08:00
Simon Pilgrim 66e39067ed [X86][AVX] Use lowerShuffleAsLanePermuteAndSHUFP to lower binary v4f64 shuffles.
Only perform this if we are shuffling lower and upper lane elements across the lanes (otherwise splitting to lower xmm shuffles would be better).

This is a regression if we shuffle build_vectors due to getVectorShuffle canonicalizing 'blend of splat' build vectors, for now I've set this not to shuffle build_vector nodes at all to avoid this.
2020-01-12 12:29:41 +00:00
Simon Pilgrim b375f28b0e [X86][AVX] lowerShuffleAsLanePermuteAndSHUFP - only set the demanded elements of the lane mask.
Fixes an cyclic dependency issue with an upcoming patch where getVectorShuffle canonicalizes masks with splat build vector sources.
2020-01-12 09:41:40 +00:00
Craig Topper d692f0f6c8 [X86] Don't call LowerSETCC from LowerSELECT for STRICT_FSETCC/STRICT_FSETCCS nodes.
This causes the STRICT_FSETCC/STRICT_FSETCCS nodes to lowered
early while lowering SELECT, but the output chain doesn't get
connected. Then we visit the node again when it is its turn
because we haven't replaced the use of the chain result. In the
case of the fp128 libcall lowering, after D72341 this will cause
the libcall to be emitted twice.
2020-01-11 20:43:00 -08:00
Craig Topper ed679804d5 [TargetLowering][X86] Connect the chain from STRICT_FSETCC in TargetLowering::expandFP_TO_UINT and X86TargetLowering::FP_TO_INTHelper. 2020-01-11 17:50:20 -08:00
Simon Pilgrim 24763734e7 [X86] Fix outdated comment
The generic saturated math opcodes are no longer widened inside X86TargetLowering
2020-01-11 14:37:18 +00:00