We looked through a truncate to get to the load. So we should be
deleting the truncate first.
There is a check that the node is really unused before deleting
so this didn't cause a functional issue.
This matches what we do for the full sized vector ops at the start of combineX86ShufflesRecursively, and helps getFauxShuffleMask extract more INSERT_SUBVECTOR patterns.
Try to prevent future node creation issues (as detailed in PR45974) by making the SelectionDAG reference const, so it can still be used for analysis, but not node creation.
As detailed on PR45974 and D79987, getFauxShuffleMask is creating nodes on the fly to create shuffles with inputs the same size as the result, causing problems for hasOneUse() checks in later simplification stages.
Currently only combineX86ShufflesRecursively benefits from these widened inputs so I've begun moving the functionality there, and out of getFauxShuffleMask. This allows us to remove the widening from VBROADCAST and *EXTEND* faux shuffle cases.
This just leaves the INSERT_SUBVECTOR case in getFauxShuffleMask still creating nodes, which will require more extensive refactoring.
We already had a DAG combine for (mmx (bitconvert (i64 (extractelement v2i64))))
to MOVDQ2Q.
Remove patterns for MMX_MOVQ2DQrr/MMX_MOVDQ2Qrr that use
scalar_to_vector/extractelement involving i64 scalar type with
v2i64 and x86mmx.
Let the codegen recognized the nomerge attribute and disable branch folding when the attribute is given
Differential Revision: https://reviews.llvm.org/D79537
Only 64-bit bits will be loaded, not the whole 128 bits. We can
just combine it to plain mmx load. This has the side effect of
enabling isel load folding for it.
This part of my desire to get rid of isel patterns that shrink loads.
If we are using PTEST to check 'allsign bits' vector elements we can use MOVMSK to extract the signbits directly and perform the comparison on the scalar value.
For vXi16 cases, as we don't have a MOVMSK for this type, we must mask each signbit out of a PMOVMSKB v2Xi8 result, which folds into the TEST comparison.
If this allows us to remove a vector op (via the SimplifyMultipleUseDemandedBits call) this is consistently faster than a PTEST (https://godbolt.org/z/ziJUst).
I'm investigating whether we ever get regressions without the SimplifyMultipleUseDemandedBits call, even if this means we don't remove a vector op, but that has exposed some other poor codegen issues that I'm still investigating and would have to wait for a later patch.
Suggested on PR42035 to avoid unnecessary ashr(x,bw-1)/pcmpgt(0,x) sign splat patterns feeding into ptest.
Differential Revision: https://reviews.llvm.org/D80563
There's more code for calling CombineTo and replacing the nodes
that I'd like to share, but its complicated by the getNode call
in the middle that needs to be specific to each opcode.
While there are also make sure we recursively delete the load
we're replacing. It eventually gets removed by a RemoveDeadNodes
call at the end of DAG combine, but we should be more eager about
it. We were inconsistently doing this in some places but not all.
Fix combineSubToSubus to handle the new DAG to avoid a regression.
There are still regressions in test14/test15/test16. Where it
looks like were trying to set up cases we could match to
umin+trunc+subus but the handling was never finished. The
regression here isn't unique to sub. Its a lost opportunity for
taking an AND with two truncated inputs and producing a larger
AND with a single truncate. The same thing could happen with
any other node we handle in combineTruncatedArithmetic since we
are moving the truncate up the DAG.
Differential Revision: https://reviews.llvm.org/D80483
- test both 32 and 64 bit version
- probe the tail in dynamic-alloca
- generate more concise code
Differential Revision: https://reviews.llvm.org/D79482
This replaces the build_vector lowering code that was just added in
D80013
and matches the pattern later from the x86-specific "vzext_movl".
That seems to result in the same or better improvements and gets rid
of the 'TODO' items from that patch.
AFAICT, we always shrink wider constant vectors to 128-bit on these
patterns, so we still get the implicit zero-extension to ymm/zmm
without wasting space on larger vector constants. There's a trade-off
there because that means we miss potential load-folding.
Similarly, we could load scalar constants here with implicit
zero-extension even to 128-bit. That saves constant space, but it
means we forego load-folding, and so it increases register pressure.
This seems like a good middle-ground between those 2 options.
Differential Revision: https://reviews.llvm.org/D80131
If we're extracting an upper subvector from a broadcast we're better off extracting the lowest subvector instead as it avoids an actual extract instruction and might help SimplifyDemandedVectorElts further simplify the code.
This initial version only peeks through cases where we just demand the sign bit of an ashr shift, but we could generalize this further depending on how many sign bits we already have.
The pr18014.ll case is a minor annoyance - we've failed to to move the psrad/paddd after the blendvps which would have avoided the extra move, but we have still increased the ILP.
On X86 (AVX1/AVX2), non-boolean masked loads only demand the sign bit of the mask, we already do the equivalent for masked stores.
Annoyingly I can't easily handle this inside TargetLowering::SimplifyDemandedBits as this is an x86 specific case for a generic node.
Differential Revision: https://reviews.llvm.org/D80478
This is a preliminary patch before I deal with the xor+and issue raised in D77301.
We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers.
fshl(x,y,z) -> (((zext(x) << bw) | zext(y)) << (z & (bw-1))) >> bw.
fshr(x,y,z) -> (((zext(x) << bw) | zext(y)) >> (z & (bw-1))) >> bw.
Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn
This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that.
Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift.
Differential Revision: https://reviews.llvm.org/D80466
If the caller needs to reponsible for making sure the MaybeAlign
has a value, then we should just make the caller convert it to an Align
with operator*.
I explicitly deleted the relational comparison operators that
were being inherited from Optional. It's unclear what the meaning
of two MaybeAligns were one is defined and the other isn't
should be. So make the caller reponsible for defining the behavior.
I left the ==/!= operators from Optional. But now that exposed a
weird quirk that ==/!= between Align and MaybeAlign required the
MaybeAlign to be defined. But now we use the operator== from
Optional that takes an Optional and the Value.
Differential Revision: https://reviews.llvm.org/D80455
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
This build vector lowering pattern came up in D79886.
I've tried to limit the improvement to cases where it looks
clearly better to load, but we could remove the 'TODO'
predicates already if we are willing to overlook some
corner cases.
Differential Revision: https://reviews.llvm.org/D80013
Summary:
The BFloat IR type is introduced to provide support for, initially, the BFloat16
datatype introduced with the Armv8.6 architecture (optional from Armv8.2
onwards). It has an 8-bit exponent and a 7-bit mantissa and behaves like an IEEE
754 floating point IR type.
This is part of a patch series upstreaming Armv8.6 features. Subsequent patches
will upstream intrinsics support and C-lang support for BFloat.
Reviewers: SjoerdMeijer, rjmccall, rsmith, liutianle, RKSimon, craig.topper, jfb, LukeGeeson, sdesmalen, deadalnix, ctetreau
Subscribers: hiraditya, llvm-commits, danielkiss, arphaman, kristof.beyls, dexonsmith
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78190
We already form PMULH when the shift is truncated. But we can
also do it from just a shift by extending the result.
Unfortunately, I get regressions if I try to replace the truncate
combine with this as we turn the truncate into a more complicated
sequence first. Then we are unable to combine that sequence with
the extend produced at the end of this combine.
Differential Revision: https://reviews.llvm.org/D79682
Expands on the enablement of the shouldSinkOperands() TLI hook in:
D79718
The last codegen/IR test diff shows what I suspected could happen - we were
sinking all splat shift operands into a loop. But that's not what we want in
general; we only want to sink the *shift amount* operand if it is a splat.
Differential Revision: https://reviews.llvm.org/D79827
SDAG suffers when it can't see that a funnel operand is a splat value
(due to single-basic-block visibility), so invert the normal loop
hoisting rules to move a splat op closer to its use.
This would be part 1 of an enhancement similar to D63233.
This is needed to re-fix PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426
...because we got better at canonicalizing IR to funnel shift intrinsics.
The existing CGP code for shift opcodes is likely overstepping what it was
intended to do, so that will be fixed in a follow-up.
Differential Revision: https://reviews.llvm.org/D79718
Summary:
This patch refactors handling of VarArgs in
X86TargetLowering::LowerFormalArguments.
That refactoring was requested while reviewing
D69372. Code related to varargs handling is removed
from X86TargetLowering::LowerFormalArguments and
is divided into smaller routines.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D74794
We have a couple main strategies for legalizing MULH.
-If the vXi16 type is legal, extend to do the full i16 multiply
and then shift and truncate the results.
-Use unpcks to split each 128 bit lane into high and low halves.a
For signed we have an extra case to split a v32i8 to v16i8 and then
use the extending to v16i16 strategy.
This patch proposes to use the unpck strategy instead. Which is
what we already do for unsigned.
This seems to be 1 instruction shorter when the RHS is constant
like the idiv case. It's 1 instruction longer for the smulo case.
But we're trading cross lane shuffles for inlane shuffles and a
shift.
Differential Revision: https://reviews.llvm.org/D79652
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
XOP targets have fast per-element vector shifts and we're better off splitting to 128-bit shifts where necessary (which is what we already do in LowerShift).
For the sint_to_fp(and(X,C)) -> and(X,sint_to_fp(C)) fold, allow combineVectorCompareAndMaskUnaryOp to match any X that ComputeNumSignBits says is all-bits, not just SETCC.
Noticed while investigating mask promotion issues in PR45808
This patch stores the alignment for ConstantPoolSDNode as an
Align and updates the getConstantPool interface to take a MaybeAlign.
Removing getAlignment() will be done as a follow up.
Differential Revision: https://reviews.llvm.org/D79436
We rely on the combine
(sext_in_reg (v4i64 a/sext (v4i32 x)), v4i1) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x, ExtraVT)))
to avoid complex v4i64 ashr codegen, but doing so prevents v4i64 comparison mask promotion, so ensure we attempt to promote before canonicalizing the (hopefully now redundant sext_in_reg).
Helps with the poor codegen in PR45808.
Neither gcc or icc support this. Split out from D79472. I want
to remove more, but it looks like icc does support some things
gcc doesn't and I need to double check our internal test suites.
Y is the start of several 2 letter constraints, but we also had
partial support to recognize it by itself. But it doesn't look
like it can get through clang as a single letter so the backend
support for this was effectively dead.
This helped fix some i686 vXi64 broadcast folds that were becoming v2Xi32 broadcasts because we didn't match the broadcast until after SimplifyDemandedBits worked out we only used the bottom 32-bits in PMUL(U)DQ and type legalization had split the original i64 load.
A couple of regressions occurred which required some fixups - adding concat_vectors(broadcast_load,broadcast_load) splat support and recognising (unnecessary) unary shuffles of already broadcasted vectors.
This came about as part of the work investigating vector load combining from shuffles for PR42550.
This patch replaces the VZEXT_MOVL removal from combineShuffle with a more general version based in SimplifyDemandedVectorEltsForTargetNode.
By using computeKnownBits we can always remove the VZEXT_MOVL if the upper elements of the source operand are known to be zero.
This requires us to add the conversion ops to computeKnownBitsForTargetNode as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D79335
gcc supports selecting ymm0/zmm0 for the Yz constraint when used with 256 or 512 bit vector types.
Fixes PR45806
Differential Revision: https://reviews.llvm.org/D79448
Unless we're truncating an 'all-bits' result, using PACKSS for vXi64->vXi32 truncation causes problems with later combines as ComputeNumSignBits struggles to see through BITCASTs to smaller types. If we don't use PACKSS in these cases then we fallback to shuffles which are usually just as good.
We haven't promoted AND/OR/XOR to vXi64 types for a while. So
there's no reason to use isOperationLegalOrPromote. So we can
just use isOperationLegal by merging with ADD handling.
Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.
Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.
I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.
Differential Revision: https://reviews.llvm.org/D79231
The splitVector helper uses extractSubVector which splits build vectors like we do here, so avoid reimplementing it.
splitVector could easily be extended to peek through bitcasts as well but I'd prefer to keep this commit NFC.
Handle concat_vectors(extract_subvector(broadcast(x)), extract_subvector(broadcast(x))) -> broadcast(x)
To expose this we also need collectConcatOps to recognise the insert_subvector(x, extract_subvector(x, lo), hi) subvector splat pattern
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.
Differential Revision: https://reviews.llvm.org/D79045
This pushes the NOT pattern up the DAG to help expose it for further combines (AND->ANDN in particular).
The PSHUFD/MOVDDUP 'splat' cases are the only ones I've seen in the wild so far, we can further generalize if/when we need to.
X86 matches several 'shift+xor' funnel shift patterns:
fold (or (srl (srl x1, 1), (xor y, 31)), (shl x0, y)) -> (fshl x0, x1, y)
fold (or (shl (shl x0, 1), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
fold (or (shl (add x0, x0), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
These patterns are also what we end up with the proposed expansion changes in D77301.
This patch moves these to DAGCombine's generic MatchFunnelPosNeg.
All existing X86 test cases still pass, and we just have a small codegen change in pr32282.ll.
Reviewed By: @spatel
Differential Revision: https://reviews.llvm.org/D78935
This section is the remnant of how this code was structured before
we made v32i16/v64i8 legal types with avx512f when not restricting
to 256 bit vectors. Now that there are just a few items left,
merge them near similar things in the other section.
I've modified isTruncateFree to get an accurate cost for types that need to be split. I'm planning to look into fixing it for all vectors, but need more cost cleanups first.
Differential Revision: https://reviews.llvm.org/D78973
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().
I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.
Differential Revision: https://reviews.llvm.org/D78882
The insert(truncate/extend(extract(vec0,c0)),vec1,c1) case in rGacbc5ede99 wasn't combining the 'mineltsize' with the src vector elt size which may be smaller due to implicit extension during extraction.
Reduced from test case provided by @mstorsjo
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.
Reviewed By: RKSimon
Differential revision: https://reviews.llvm.org/D78291
Followup to the PR45604 fix at rGe71dd7c011a3 where we disabled most of these cases.
By creating the shuffle at the byte level we can handle any extension/truncation as long as we track how small the scalar got and assume that the upper bytes will need to be zero.
This is another enhancement to D77895/D78362
to avoid a round-trip from XMM->GPR->XMM.
This time we handle the case of starting/ending with different FP types
but always with signed i32 as the intermediate value.
I think this covers all of the faux vector optimization possibilities
for pre-AVX512.
There is at least 1 other transform mentioned in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c19
...where we fold an 'fpext' into a preceding 'sitofp'. I think we will
want to handle that earlier (DAGCombiner or instcombine) because that's
a target-independent optimization.
Differential Revision: https://reviews.llvm.org/D78758
Summary:
These helpers are exercised by follow-up commits in this patch series,
which is all about removing CodeGen differences with vs. without debug
info in the AArch64 backend.
Reviewers: fhahn, aprantl, jpaquette, paquette
Subscribers: kristof.beyls, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78260
gcc may warn here because X86ISD::NodeType is specified as "unsigned",
but ISD::NodeType is a naked C enum (although passed as an "unsigned"
throughout SDAG).
getFauxShuffle attempts to combine INSERT_VECTOR_ELT(TRUNCATE/EXTEND(EXTRACT_VECTOR_ELT(x))) patterns into a target shuffle chain.
PR45604 identified an issue where the scalar was truncated to a size smaller than the destination vector element and then zero extended back, which requires the upper bits to be zero'd which we don't currently do.
To avoid the bug I've added an early out in these truncation cases, a future commit should allow us to handle this by inserting the necessary SM_SentinelZero padding.
This is an enhancement to D77895 to avoid another
round-trip from XMM->GPR->XMM. This time we handle
the case of starting/ending with an f64 and casting
to signed i32 as the intermediate value.
It's a bit more involved than I initially assumed
because we need to use target-specific opcodes to
represent the non-standard cast ops.
Differential Revision: https://reviews.llvm.org/D78362
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.
Differential Revision: https://reviews.llvm.org/D76212
-Consistently name the functions as split*
-Add a helper for doing the two extractSubvector calls and determining the size of the split
-Use getSplitDestVTs to get the result type for the split node.
-Move the binary and unary helper to one place in the file near the extractSubvector functions. Left the VSETCC one near LowerVSETCC since that's its only caller.
-Remove the 256/512 wrappers that just had asserts. I don't think they provided a lot of value and now with the routines called split* the call sites are more obvious what they do.
-Make the unary routine support different source and dest types to support D76212.
-Add some weaker asserts into the helpers to make up for losing the very specific asserts from the 256/512 wrappers.
Differential Revision: https://reviews.llvm.org/D78176
It can be used to avoid passing the begin and end of a range.
This makes the code shorter and it is consistent with another
wrappers we already have.
Differential revision: https://reviews.llvm.org/D78016
The shuffle decoding is used by X86ISelLowering and
MCTargetDesc/X86InstComments. The latter used to be in a
separate InstPrinter library. The Utils library existed to allow
InstPrinter and CodeGen to share the shuffle decoding. Since
X86InstComments now lives in the MCTargetDesc, which CodeGen
already depends on, we can sink the shuffle decoding there as well.
Differential Revision: https://reviews.llvm.org/D77980
Summary:
Fold (shift (shift X, C2), C1) -> (shift X, (C1 + C2)) for logical as
well as arithmetic shifts. This is needed to prevent regressions from
an upcoming funnel shift expansion change.
While we're here, fold (VSRAI -1, C) -> -1 too.
Reviewers: RKSimon, craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77300
Improve the chances of folding the writemask into the combined shuffle by scaling a wider shuffle mask to match the root's original type.
This creates a few minor issues with variable shuffles, preventing combines of shuffles because of the more limited support binary shuffle types. In most cases we're probably better off combining the shuffles and losing the writemask fold, but this isn't always going to be true.
As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM
by using the vector versions of the convert instructions.
Based on experimental results from recent Intel/AMD chips, we don't
need to worry about triggering denorm stalls while operating on
garbage data in the high lanes with convert instructions, so this is
expected to always be as good or better perf than the scalar
instruction equivalent. FP exceptions are also not a concern because
strict code should not be using the regular SDAG opcodes.
Differential Revision: https://reviews.llvm.org/D77895
A lot of vectorized code doesn't use masks so we shouldn't penalize them by not doing shuffle combining on avx512 targets.
I've added support for VALIGNQ/VALIGND and 512-bit SHUF128 to prevent some regressions. I also prevented recombining 256-bit SHUF128 to PERM2X128. We may not need to add 256-bit SHUF128 support, but I don't think I found any cases requiring that in my testing.
Differential Revision: https://reviews.llvm.org/D77928
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.
While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
Summary:
There are at least three clients for KnownBits calculations:
ValueTracking, SelectionDAG and GlobalISel. To reduce duplication the
common logic should be moved out of these clients and into KnownBits
itself.
This patch does this for AND, OR and XOR calculations by implementing
and using appropriate operator overloads KnownBits::operator& etc.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74060
This removes a call to getScalarType from a bunch of call sites.
It also makes the behavior consistent with SIGN_EXTEND_INREG.
Differential Revision: https://reviews.llvm.org/D77631
Shuffle combining can insert zero byte sized elements into the shuffle mask, which combineX86ShufflesConstants will attempt to fold without taking into account whether the byte-sized type is legal (e.g. AVX512F only targets).
If we have a full-zeroable vector then we should just return a zero version of the root type, otherwise if the type isn't valid we should bail.
Fixes PR45443
truncateVectorWithPACK has its own vector length controls, so we can rely on those directly. This helps some existing truncation to subvector tests, which were being combined later during shuffle lowering at which point the sign/zero bit detection had become obscured preventing lowerShuffleWithPACK working as well as it could.
We had previously limited the shuffle(HORIZOP,HORIZOP) combine to binary shuffles, but we can often merge unary shuffles just as well, folding in UNDEF/ZERO values into the 64-bit half lanes.
For the (P)HADD/HSUB cases this is limited to fast-horizontal cases but PACKSS/PACKUS combines under all cases.
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.
This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
If we're packing from 128-bits to 64-bits then we don't need the RHS argument. This helps with register allocation, especially as we avoid repeating a use of the input value.
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.
This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.
Noticed in codegen while dealing with PR31443.
Extend lowerShuffleWithPACK/matchShuffleWithPACK/createPackShuffleMask to handle compaction style shuffle masks that can be lowered to chains of PACKSS/PACKUS if their inputs are suitably sign/zero extended.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask should recognise the PACKSS/PACKUS chains.
This pass replaces each indirect call/jump with a direct call to a thunk that looks like:
lfence
jmpq *%r11
This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.
This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.
Differential Revision: https://reviews.llvm.org/D76812
First step toward making use of canLowerByDroppingEvenElements to match chains of PACKSS/PACKUS for compaction shuffles.
At the moment we still only match a single stage but the MatchPACK is now more general.
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0
If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.
Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).
We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.
Differential Revision: https://reviews.llvm.org/D76984
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77136
Summary: These were templated due to SelectionDAG using int masks for shuffles and IR using unsigned masks for shuffles. But now that D72467 has landed we have an int mask version of IRBuilder::CreateShuffleVector. So just use int instead of a template
Reviewers: spatel, efriedma, RKSimon
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D77183
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: jyknight, sdardis, nemanjai, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, jfb, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77059
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.
I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
This was an inner helper function for the real matchShuffleAsByteRotate function, but it is more generic and is used directly for VALIGN lowering which doesn't work at the byte level.
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.
This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.
An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.
Differential Revision: https://reviews.llvm.org/D76649
We can improve computeKnownBits results by avoiding excess bitcasts.
For this pattern we were doing:
(v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))
By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.
This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.
The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.
We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.
We could probably extend this further to handle non-VLX truncation cases.
This reverts commit 4e0fe038f4. Re-lands
65b21282c7.
After landing 5ff5ddd0ad to add int3 into
trailing unreachable blocks, we can now remove these extra stack
adjustments without confusing the Win64 unwinder. See
https://llvm.org/45064#c4 or X86AvoidTrailingCall.cpp for a full
explanation.
Fixes PR45064.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: dylanmckay, sdardis, nemanjai, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76551
createPSADBW uses SplitsOpsAndApply so should be able to handle
any size.
Restrict the extract result type to i32 or i64 since that's what
we have coverage for today and probably matches what the
isSimple() check gave us before.
Differential Revision: https://reviews.llvm.org/D76560
SplitsOpsAndApply will take care of any needed splitting correctly.
All that we need to check is that the vector element count is a
power of 2.
Differential Revision: https://reviews.llvm.org/D76558
We often widen xmm/ymm vectors to ymm/zmm by insertion into an undef base vector. By letting getTargetShuffleAndZeroables track the undef elts we can help avoid a lot of unnecessary cross-lane shuffles.
Fixes PR44694
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.
Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
The combine tries to put the broadcast in either the integer or
fp domain to match the bitcast domain. But we can only do this
if the broadcast size is 32 or larger.
Under certain circumstances we'll end up in the position where the negated shift amount will get truncated to the type specified getScalarShiftAmountTy(), so we need to test for a truncated version of the shift amount as well.
This allows us to remove half of the remaining patterns tested for by X86ISelLowering's combineOrShiftToFunnelShift.
Add optional support for opt-in partial reduction cases by providing an optional partial mask to indicate which elements have been extracted for the scalar reduction.
We currently only ever use this for lowering constant uniform values (shift/rotate by immediate) so we can safely enable it by default (it treats the undef bits as zero when extracting constants).
This is necessary for an upcoming patch that will use SimplifyDemandedBits more aggressively on funnel shift amounts and causes regressions in vXi64 constant without it.
Merge the INSERT_VECTOR_ELT/SCALAR_TO_VECTOR and PINSRW/PINSRB shuffle mask paths - they both do the same thing (find source vector + handle implicit zero extension). The PINSRW/PINSRB path also handled in the insertion of zero case which needed to be added to the general case as well.
This is a little more complicated than I'd like it to be. We have
to manually match a trunc+srl+load pattern that generic DAG
combine won't do for us due to isTypeDesirableForOp.
If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB.
The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4).
This matches what we already do for PEXTRW.
Differential Revision: https://reviews.llvm.org/D76138
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.
The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.
Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.
The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.
Differential Revision: https://reviews.llvm.org/D75748
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.
At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
isTypeDesirableForOp prevents loads from being shrunk to i16 by DAG
combine. Because of this we can't just match the broadcast and a
scalar load. So look for broadcast+truncate+load and form a
vbroadcast_load during DAG combine. This replaces what was
previously done as an isel pattern and I think fixes it so we
won't change the size of a volatile load. But my main motivation
is just to clean up our isel patterns.
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.
This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.
Previously we any_extended in the scalar domain before inserting.
This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.
This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.
Differential Revision: https://reviews.llvm.org/D75835
In 172eee9c, we tried to avoid these by modelling the callee as
internally resetting the stack pointer.
However, for the majority of functions with reserved stack frames, this
would lead LLVM to emit extra SP adjustments to undo the callee's
internal adjustment. This lead us to fix the problem further on down the
pipeline in eliminateCallFramePseudoInstr. In 5b79e603d3, I added
use a heuristic to try to detect when the adjustment would be
unreachable.
This heuristic is imperfect, and when exception handling is involved, it
fails to fire. The new test is an example of this. Simply throwing an
exception with an active cleanup emits dead SP adjustments after the
throw. Not only are they dead, but if they were executed, they would be
incorrect, so they are confusing.
This change essentially reverts 172eee9c and makes the 5b79e603d3
heuristic responsible for preventing unreachable stack adjustments. This
means we may emit unreachable stack adjustments for functions using EH
with unreserved call frames, but that is not very many these days. Back
in 2016 when this change was added, we were focused on 32-bit, which we
observed to have fewer reserved frames.
Fixes PR45064
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D75712
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.
Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.
For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.
I think this all matches what we generated before, just in a more
readable way.
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.
Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.
The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.
Differential Revision: https://reviews.llvm.org/D75154
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.
I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.
Differential Revision: https://reviews.llvm.org/D75413
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.
Differential Revision: https://reviews.llvm.org/D75412
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.
Fixes PR45067.
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.
In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.
For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.
This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.
I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.
This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.
Differential Revision: https://reviews.llvm.org/D75132
Instead add it when we make the machine nodes during instruction
selections.
This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
I'm hoping to begin improving shuffle combining across different vector sizes, but before that we must ensure that all existing getTargetShuffleInputs calls must bail if the inputs aren't the same size.
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
Leave the gather/scatter subclasses, but make them inherit from
MemIntrinsicSDNode and delete their constructor and destructor.
This way we can still have the getIndex, getMask, etc. convenience
functions.
If a simplication occurs the operand will be added to the worklist.
But since the demanded mask was based on N, we need to make sure
we revisit N in case there are more simplifications to be done.
Returning SDValue(N, 0) as we do, only tells DAG combine that
something changed, but that won't make it add anything to the
worklist.
Found while playing around with using VEXTRACT_STORE in more cases.
But I guess this doesn't affect any of our existing tests.
We can use MOVLPS which will load 64 bits, but we need a v4f32
result type. We already have isel patterns for this.
The code here is a little hacky. We can probably improve it with
more isel patterns.
This is similar to using movd which we do for sse2 targets.
I've added a DAG combine for VEXTRACT_STORE to use SimplifyDemandedVectorElts
to clean up some artifacts from type legalization.
Similar to what do for other operations that use a subset of bits.
Allows us to remove a pattern that shrinks a load. Which was
incorrect if the load was volatile.
At this point in the code we know that Op1 or Op2 is
all ones. Y points to the other operand. In the case that
Op2 is zero, Op1 must be all ones and Y is Op2. The OR
ORs Y into Res. But if Y is 0 the OR will be folded away by
getNode so we don't need to check for it.
The combineSelect code was casting to i64 without any check that
i64 was legal. This can break after type legalization.
It also required splitting the mmx register on 32-bit targets.
It's not clear that this makes sense. Instead switch to using
a cmov pseudo like we do for XMM/YMM/ZMM.
The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)
There are other potential improvements visible like allowing scalarization or vector
narrowing.
Differential Revision: https://reviews.llvm.org/D74909
There's a lot of old leftover code in LowerBRCOND. Especially
the detecting or AND or OR of X86ISD::SETCC nodes. Those were
needed before LegalizeDAG was changed to visit nodes before
their operands.
It also relied on reversing the output of LowerSETCC to find the
flags producing node to use for the X86ISD::BRCOND node.
Rather than using LowerSETCC this patch uses emitFlagsForSetcc to
handle the integer ISD::SETCC case. This gives the flag producer
and the comparison code to use directly. I've removed the addTest
flag and just produce a X86ISD::BRCOND and return immediately.
Floating point ISD::SETCC case is just an X86ISD::FCMP with special
care for OEQ and UNE derived from the previous code. I've left
f128 out so it will emit a test. And LowerSETCC will be called
later to produce a libcall and X86ISD::SETCC. We have combines
that can merge the test and X86ISD::SETCC.
We need to handle two cases for overflow ops. Either they are used
directly or they have a seteq 0 or setne 1 to invert the overflow.
The old code did not handle the setne 1 case, but I think some
other combines were making up for it.
If we fail to find a condition, we'll wrap an AND with 1 on the
original condition and tell emitFlagsForSetcc to emit a compare
with 0. This will pickup the LowerAndToBT and or the EmitTest case.
I kept the isTruncWithZeroHighBitsInput call, but we might be able
to fold that in to emitFlagsForSetcc.
Differential Revision: https://reviews.llvm.org/D74750
Only handle power of 2 element count for simplicity. Not sure what to do with vXf64->vXf16 fp_round to avoid double rounding
Differential Revision: https://reviews.llvm.org/D74886
We only need to split after type legalization. If we're before
we can just use a wide store and type legalization will split it.
Add a v128i1 test to exercise it post type legalization.
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.
This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.
Reviewers: nikic, RKSimon, lebedev.ri, spatel
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D74722
LoweSELECT will detect the constant inputs and convert to scalar
selects, but we can do it directly here.
I might remove some of the code from LowerSELECT and move it to
DAG combine so doing this explicitly will make us less dependent
on it happening in lowering.
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
This helps this transform occur earlier so we can fold the not
with setcc. If we delay it until after type legalization we might
have introduced instructions to widen the mask if the vselect was
widened. This can prevent the not from making it to the setcc.
We could of course add more DAG combines to handle that, but
moving this earlier is easier.
D73835 will make IRBuilder no longer trivially copyable. This patch
deletes the copy constructor in advance, to separate out the breakage.
Currently, the IRBuilder copy constructor is usually used by accident,
not by intention. In rG7c362b25d7a9 I've fixed a number of cases where
functions accepted IRBuilder rather than IRBuilder &, thus performing
an unnecessary copy. In rG5f7b92b1b4d6 I've fixed cases where an
IRBuilder was copied, while an InsertPointGuard should have been used
instead.
The only non-trivial use of the copy constructor is the
getIRBForDbgInsertion() helper, for which I separated construction and
setting of the insertion point in this patch.
Differential Revision: https://reviews.llvm.org/D74693
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.
This removes the need to pass Chain and IsSignaling all the way
down.
The unseen logic diff occurs because MayFoldLoad() is defined like this:
static bool MayFoldLoad(SDValue Op) {
return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}
The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.
The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.
Differential Revision: https://reviews.llvm.org/D74088
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
If we widen the compare we might trigger a spurious exception from
the garbage data.
We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.
I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.
Differential Revision: https://reviews.llvm.org/D74522
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.
I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.
NFC.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74482
This has a really interesting side effect in that it improves some UMAX/UMIN reduction code which had redundant XOR(SHUFFLE(XOR(X,SIGNMASK)),SIGNMASK) patterns - the getNegatibleCost recognises it as FNEG(SHUFFLE(FNEG(X))).... We have a lot of FNEG patterns bitcasted to the integer domain for XOR signbit twiddling which is similar to what we do to allow UMAX/UMIN to be lowered using SMAX/SMIN.
Differential Revision: https://reviews.llvm.org/D74231
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.
This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.
It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.
Differential Revision: https://reviews.llvm.org/D74221
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
This adds a strict version of FP16_TO_FP and FP_TO_FP16 and uses
them to implement soft promotion for the half type. This is
enough to provide basic support for __fp16 with strictfp.
Add the necessary X86 support to use VCVTPS2PH/VCVTPH2PS when F16C
is enabled.
We aren't doing a good job of optimizing AVX512 outside of this code. So remove the bail out for AVX512 and replace with a FIXME. This at least gets us the AVX2 codegen.
Differential Revision: https://reviews.llvm.org/D74431
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
We need to use vector instructions for these operations. Previously
we handled this with isel patterns that used extra instructions
and copies to handle the the conversions.
Now we use custom lowering to emit the conversions. This allows
them to be pattern matched and optimized on their own. For
example we can now emit vpextrw to store the result if its going
directly to memory.
I've forced the upper elements to VCVTPHS2PS to zero to keep some
code similar. Zeroes will be needed for strictfp. I've added a
DAG combine for (fp16_to_fp (fp_to_fp16 X)) to avoid extra
instructions in between to be closer to the previous codegen.
This is a step towards strictfp support for f16 conversions.
Non-AVX512BW targets failed to concatenate 256-bit shifts back to 512-bits (split during 512-bit shuffle lowering as they don't have v32i16/v64i8 types).
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
REAPPLIED rGe82e17d4d4ca after reversion at rG39eade73a567 - fixed offset matching in matchShuffleAsBitRotate.
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
---
Internal shuffle tests indicate theres a bug somewhere that I haven't been able to track down yet.
Fix issue mentioned on rGe82e17d4d4ca - non-AVX512BW targets failed to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.
Differential Revision: https://reviews.llvm.org/D68720
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.
Differential Revision: https://reviews.llvm.org/D68720
Remove code from LegalizeTypes that allowed this to work.
We were already using BUILD_PAIR for this in some places so this
standardizes on a single way to do this.
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with better option
handling and more portable testing
Differential Revision: https://reviews.llvm.org/D68720
This hasn't been used for years, its original implementation, D35700, had bugs that caused the reversion of most of the code, and since then x86 shuffle lowering/combining has handled most cases and can deal with the rest as well.
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with correct option
flags set.
Differential Revision: https://reviews.llvm.org/D68720
This reverts commit 39f50da2a3.
The -fstack-clash-protection is being passed to the linker too, which
is not intended.
Reverting and fixing that in a later commit.
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.
Reviewers: courbet
Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73964
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
Differential Revision: https://reviews.llvm.org/D68720
X86 uses i8 for shift amounts. This code can fail on a 32-bit target
if it runs after type legalization.
This code was copied from AArch64 and modified for X86, but the
shift amount wasn't changed to the correct type for X86.
Fixes PR44812
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.
Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.
This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.
Differential Revision: https://reviews.llvm.org/D73863
Same for any_extend though we don't have coverage for that.
The test changes are because isel didn't check one use of the
setcc_carry. So in isel we would end up with two different
sized setcc_carry instructions. And since it clobbers
the flags we would need to recreate the flags for the second
instruction.
This code handles additional uses by truncating the new wide
setcc_carry back to the original size for those uses.
lrint/llrint are defined as rounding using the current rounding
mode. Numbers that can't be converted raise FE_INVALID and an
implementation defined value is returned. They may also write to
errno.
I believe this means we can use cvtss2si/cvtsd2si or fist to
convert as long as -fno-math-errno is passed on the command line.
Clang will leave them as libcalls if errno is enabled so they
won't become ISD::LRINT/LLRINT in SelectionDAG.
For 64-bit results on a 32-bit target we can't use cvtss2si/cvtsd2si
but we can use fist since it can write to a 64-bit memory location.
Though maybe we could consider using vcvtps2qq/vcvtpd2qq on avx512dq
targets?
gcc also does this optimization.
I think we might be able to do this with STRICT_LRINT/LLRINT as
well, but I've left that for future work.
Differential Revision: https://reviews.llvm.org/D73859
The CATCHPAD node mostly existed to be selected into the EH_RESTORE
instruction, which sets the frame back up when 32-bit Windows exceptions
return to the parent function. However, creating this MachineInstr early
increases the risk that other passes will come along and insert
instructions that use the stack before ESP and EBP are restored. That
happened in PR44697.
Instead of representing these in the instruction stream early, delay it
until PEI. Mark the blocks where this needs to happen as EHPads, but not
funclet entry blocks. Passes after PEI have to be careful not to hoist
instructions that can use stack across frame setup instructions, so this
should be relatively reliable.
Fixes PR44697
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D73752
This time with correct types for the data result from the SUB.
Original commit message:
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.
I believe this code is a leftover from when i1 was a legal type.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
We were creating two with different operand orders, and then only
using one of them.
Instead just swap the operands when needed and create a single node.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73885
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
ISD::FROUND is defined to round to nearest with ties rounding
away from 0. This mode isn't supported in hardware on X86.
But as long as we aren't compiling with trapping math, we can
emulate this with floor(X + copysign(nextafter(0.5, 0.0), X)).
We have to use nextafter to avoid some corner cases that adding
0.5 would have. For example, if X is nextafter(0.5, 0.0) it should
round to 0.0, but adding 0.5 would need one extra bit of mantissa
than can be stored so it rounds to 1.0. Adding nextafter(0.5, 0.0)
instead will just increase the exponent by 1 and leave the mantissa
as all 1s. This would be nextafter(1.0, 0.0) which will floor to 0.0.
Techically this requires -fno-trapping-math which isn't our default.
But if we care about exceptions we should be using constrained
intrinsics. Constrained intrinsics would use STRICT_FROUND which
won't go through this code.
Fixes PR42195.
Differential Revision: https://reviews.llvm.org/D73607
This code needs to map from the FPCW 2-bit encoding for rounding mode to the 2-bit encoding defined for FLT_ROUNDS. The previous implementation did some clever swapping of bits and adding 1 modulo 4 to do the mapping.
This patch instead uses an 8-bit immediate as a lookup table of four 2-bit values. Then we use the 2-bit FPCW encoding to index the lookup table by using a right shift and an AND. This requires extracting the 2-bit value from FPCW and multipying it by 2 to make it usable as a shift amount. But still results in less code.
Differential Revision: https://reviews.llvm.org/D73599
Summary: X86 has instructions to calculate fma and fneg at the same time. But we combine the fneg and fma only when fneg is the source operand under strict FP.
Reviewers: craig.topper, andrew.w.kaylor, uweigand, RKSimon, LiuChen3
Subscribers: LuoYuanke, llvm-commits, cfe-commits, jdoerfert, hiraditya
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72824
Pull out combineTargetShuffle code added in rG3fd5d1c6e7db into a helper function and extend it to handle shufps(shufps(load(),x),y) and shufps(y,shufps(load(),x)) cases as well.
Every case in the switch had a string version of themselves. Two
of them had a typo that used : instead of ::
By using a macro we can automate the string creation and avoid
the possibility of typos like this.
This is similar to what is done on the AMDGPU target.
I believe for STRICT_FP I need to use a STRICT_FP_EXTEND for the extending to f80 for returning f32/f64 in 32-bit mode when SSE is enabled. The STRICT_FP_EXTEND node requires a Chain. I need to get that node onto the chain before any CopyToRegs are emitted. This is because all the CopyToRegs are glued and chained together. So I can't put a STRICT_FP_EXTEND on the chain between the glued nodes without also glueing the STRICT_ FP_EXTEND.
This patch moves all the extend creation to a first pass and then creates the copytoregs and fills out RetOps in a second pass.
Differential Revision: https://reviews.llvm.org/D72665
Summary:
This is a follow up on https://reviews.llvm.org/D71473#inline-647262.
There's a caveat here that `Align(1)` relies on the compiler understanding of `Log2_64` implementation to produce good code. One could use `Align()` as a replacement but I believe it is less clear that the alignment is one in that case.
Reviewers: xbolva00, courbet, bollu
Subscribers: arsenm, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, Jim, kerbowa, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73099
vperm (ins ?, X, C), (ins ?, Y, C), 0x31 --> concat X, Y
This is another shuffle problem seen with PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
We have this small crack in legalization/lowering/combining/demanded
that allows forming a vperm2f128 of high halves with AVX1 when we
could do better by peeking through the insert_subvector nodes.
AFAICT, it requires IR as shown in the diffs - much larger than legal
vectors - to avoid all of the usual folds.
Another option would prevent forming the 256-bit vperm in lowering.
Differential Revision: https://reviews.llvm.org/D73197
Currently PromoteMaskArithemtic only looks at a single operation to
skip casts. This means we miss cases where we combine multiple masks.
This patch updates PromoteMaskArithemtic to try to recursively promote
AND/XOR/AND nodes that terminate in truncates of the right size or
constant vectors.
Reviewers: craig.topper, RKSimon, spatel
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D72524
Summary:
I think whatever problem the gluing was fixing has long since been fixed. We don't have any of the restrictions on FP stack stuff that existed back when this was first added.
I had to change which type we use for FILD in BuildFILD when X86 was enabled because most of the isel patterns block f32/f64 instructions when SSE1/SSE2 are enabled. So I needed to use the f80 pattern, but this shouldn't have an effect the generated code since there is only one FILD instruction anyway. We already use f80 explicitly in other other places.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: andrew.w.kaylor, scanon, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72805
This is another part of a problem noted in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
The AVX2 code may use awkward 256-bit shuffles vs. the AVX code that gets split
into the expected 128-bit unpack instructions. We have to be selective in
matching the types where we try to do this though. Otherwise, we can end up
with more instructions (in the case of v8x32/v4x64).
Differential Revision: https://reviews.llvm.org/D72575
The code is trying to copy the i64 value to an xmm register to
use a 64-bit store so that the 64-bit fild can benefit from
store forwarding.
But this trick only works if f64 is going to be stored in an
XMM register. If we only have SSE1 then only float is in xmm
register. So this trick just causes 2 stores i32 stores, an f64
load into the x87, an f64 from x87, and a 64-bit fild. So we end
up with an extra stack temporary and still didn't get store forwarding.
We might be able to use v2f32 here instead, but I didn't check. I
just wanted the code to make sense.
Found by inspection as I continue to stare too hard at our
int_to_fp conversions.
We were performing an emulated i32->f64 in the SSE registers, then
storing that value to memory and doing a extload into the X87
domain.
After this patch we'll now just store the i32 to memory along
with an i32 0. Then do a 64-bit FILD to f80 completely in the X87
unit. This matches what we do without SSE.
Pass small FP values in GPRs or stack memory according the the normal
convention. This is what gcc -mno-sse does on Win64.
I adjusted the conditions under which we emit an error to check if the
argument or return value would be passed in an XMM register when SSE is
disabled. This has a side effect of no longer emitting an error for FP
arguments marked 'inreg' when targetting x86 with SSE disabled. Our
calling convention logic was already assigning it to FP0/FP1, and then
we emitted this error. That seems unnecessary, we can ignore 'inreg' and
compile it without SSE.
Reviewers: jyknight, aemerson
Differential Revision: https://reviews.llvm.org/D70465
This allows us to generate better code for selecting the fixup
to load.
Previously when the sign was set we had to load offset 0. And
when it was clear we had to load offset 4. This required a testl,
setns, zero extend, and finally a mul by 4. By switching the offsets
we can just shift the sign bit into the lsb and multiply it by 4.
By directly emitting the constants as a constant pool load we seem to avoid the build_vector/extract_subvector combines that resulted in the duplicate loads we had before.
Differential Revision: https://reviews.llvm.org/D72307
Only perform this if we are shuffling lower and upper lane elements across the lanes (otherwise splitting to lower xmm shuffles would be better).
This is a regression if we shuffle build_vectors due to getVectorShuffle canonicalizing 'blend of splat' build vectors, for now I've set this not to shuffle build_vector nodes at all to avoid this.
This causes the STRICT_FSETCC/STRICT_FSETCCS nodes to lowered
early while lowering SELECT, but the output chain doesn't get
connected. Then we visit the node again when it is its turn
because we haven't replaced the use of the chain result. In the
case of the fp128 libcall lowering, after D72341 this will cause
the libcall to be emitted twice.
Add initial support for lowering v4f64 shuffles to SHUFPD(VPERM2F128(V1, V2), VPERM2F128(V1, V2)), eventually this could be used for v8f32 (and maybe v8f64/v16f32) but I'm being conservative for the initial implementation as only v4f64 can always succeed.
This currently is only called from lowerShuffleAsLanePermuteAndShuffle so only gets used for unary shuffles, and we limit this to cases where we use upper elements as otherwise concating 2 xmm shuffles is probably the better case.
Helps with poor shuffles mentioned in D66004.
We only use lowerShuffleAsLanePermuteAndShuffle for unary shuffles at the moment, but we should consistently handle lane index calculations for multiple inputs in both the AVX1 and AVX2 paths.
Minor (almost NFC) tidyup as I'm hoping to use lowerShuffleAsLanePermuteAndShuffle for binary shuffles soon.
Only PPC seems to be using it, and only checks some simple cases and
doesn't distinguish between FP. Just switch to using LLT to simplify
use from GlobalISel.
For v4i64->v4f32 uint_to_fp on pre-avx targets where v4i64 isn't legal we create to v2i64->v2f32 uint_to_fp that need to be shuffled together. Our codegen for v2i64->v2f32 involves detecting if the number is larger than (2^31 - 1), if so we do a special divison by 2 so we can do a signed conversion which we need to scalarize, then do a multiply by 2 at the end if we divided earlier.
When v4i64 isn't legal we need to split the checking for a larger number and dividing by 2 into two v2i64 vectors. The scalar part can extract the 4 i64 values from those 4 splits. But we can reassemble the 4 scalar f32 results directly into a single v432 vector. Then we just need to combine the fixup indications from the 2 halves and we can do the final multiply by 2 fixup on all 4 values if needed at once using a single v4f32 blend and v4f32 fadd.
Differential Revision: https://reviews.llvm.org/D72368
Now that we generate decent code for (v2i64 (setlt zero, X)) on pre-sse4.2 targets I think we can use this now.
Differential Revision: https://reviews.llvm.org/D72354
Similar to D72302 but for the canonical form for the opposite case. I've changed foldVectorXorShiftIntoCmp to form a target independent setcc node instead of PCMPGT now and enabled its for v2i64 on pre-SSE4.2 targets. The setcc should eventually get lowered to PCMPGT or the new v2i64 sequence.
Differential Revision: https://reviews.llvm.org/D72318
Without sse4.2 a v2i64 setlt needs to expand into a pcmpgtd, pcmpeqd, 3 shuffles, and 2 logic ops. But if we're only interested in the sign bit of the i64 elements, we can just use one pcmpgtd and shuffle the odd elements to the even elements.
Differential Revision: https://reviews.llvm.org/D72302
Summary:
Based on Simon's D52965, but improved to handle strict fp and improve some of the shuffling.
Rather than use v2i1/v4i1 and let type legalization continue, just generate all the code with legal types and use an explicit shuffle.
I also added an explicit setcc to the v4i64 code to match the semantics of vselect which doesn't just use the sign bit. I'm also using a v4i64->v4i32 truncate instead of the shuffle in Simon's original code. With the setcc this will become a pack.
Future work can look into using X86ISD::BLENDV and a different shuffle that only moves the sign bit.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71956
Attempt to use combineLogicBlendIntoConditionalNegate for (select M, (sub 0, X), X) -> (sub (xor X, M), M)
We limit this to cases that can't easily replace the VSELECT with a shuffle (non-constant masks) or where a BLENDV is likely to occur (which tends to result in slower codegen).
This uses an alternative implementation of this conversion derived
from our v2i32->v2f32 handling. We can zero extend the v2i32 to
v2i64, or it with the bit representation of 2.0^52 which will give
us 2.0^52 plus the 32-bit integer since double's mantissa is 52 bits.
Then we just need to subtract 2.0^52 as a double and let the floating
point unit normalize the remaining bits into a valid double.
This is less instructions then our previous code, but does require
a port 5 shuffle for the zero extend or unpack.
Differential Revision: https://reviews.llvm.org/D71945
When the "disable-tail-calls" attribute was added, checks were added for
it in various backends. Now this code has proliferated, and it is
something the target is responsible for checking. Move that
responsibility back to the ISels (fast, global, and SD).
There's no major functionality change, except for targets that never
implemented this check.
This LLVM attribute was originally added in
d9699bc7bd (2015).
Reviewers: echristo, MaskRay
Differential Revision: https://reviews.llvm.org/D72118
Summary:
We previously disabled this under fast math due to aggressive
reassociation by the machine combiner. But I think we can work
around this by using a FSUB instead of FADD for the first
operation.
This matches the similar algorithm we do for uint_to_fp i64->f64
in TargetLowering::expandUINT_TO_FP. If reassociation hasn't
been a problem for that, hopefully its not a problem here.
Reviewers: RKSimon, spatel, scanon
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71968
Summary:
After bugfix the undef value case here, we used more operations to implement inserting vxi1 sub vector into vXi1 vector, I optimize it by use less operations.
The history information at https://reviews.llvm.org/D68311
Reviewers: craig.topper, LuoYuanke, yubing, annita.zhang, pengfei, LiuChen3, RKSimon
Reviewed By: craig.topper
Subscribers: hiraditya, llvm-commits
Patch by Xiang Zhang (xiangzhangllvm)
Differential Revision: https://reviews.llvm.org/D71917
If we just subtracted 1 and are checking if the result is -1. We can use the carry flag from the ADD instead of an explicit CMP. I'm using the same checks for the add users as EmitTest.
Fixes one case from PR44412
Differential Revision: https://reviews.llvm.org/D72019
If the return value is a struct of 2 doubles we need two return
registers.
If SSE2 is disabled we can't return in XMM registers like the ABI says.
After logging an error we attempt to recover by using FP0 instead
of an XMM register. But if the return needs two registers, we may have
already used FP0. So if the register we were supposed to copy to is
XMM1, copy to FP1 in the recovery instead.
This seems to fix the assertion/crash in PR44413.
These nodes should only ever be formed with an i8 TargetConstant
so we don't need to check for it to be a constant. It's also
always 8-bits so we don't need to use APInt compare functions.
We had a Custom operation action for v4i32 on SSE1. But since
v4i32 isn't legal until SSE2 this was not what was intended. The
code that get executed was intended for op legalization and
creates a bunch of v4i32 nodes that all end up scalarized.
Summary:
Previously we did this with isel patterns that used garbage in
the widened part of the source. But that's not valid for strictfp.
So now we custom widen and use zeroes for the widened elemens for
strictfp.
This replaces D71864.
Reviewers: RKSimon, spatel, andrew.w.kaylor, pengfei, LiuChen3
Reviewed By: pengfei
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71879
I believe the algorithm we use for non-strict is exception safe
for strict. The fsub won't generate any exceptions. After it we
will have an exact version of the i32 integer in a double. Then
we just round it to f32. That rounding will generate a precision
exception if it can't be represented exactly.
Previously we widened these through isel patterns, but that
didn't work for STRICT_ nodes. Those need to be padded with
zeroes in the upper bits which is harder to do in isel patterns.
Previously we were widening with isel patterns, but that wasn't
exception safe for strict FP. So now we widen to v4i32->v4f64
during type legalization. And then let op legalization further
widen to v8i32->v8f64.
The vec_int_to_fp.ll changes are caused by us no longer narrowing
extracts of strict_uint_to_fp to the v4i32->v2f64 instruction
without AVX512VL only to have isel rewiden it. Now we just keep
it wide throughout. So we don't have an opportunity to narrow
the load.
AVX512F added instruction for vector fp_to_uint conversions. With
AVX512VL we can use a specific instruction that does v2f64->v4i32 with
zeroes in the 2 extra elements. For non-strict nodes without AVX512VL
we relied on type legalization to turn it to v4f64->v4i32 which would
later be widened by op legalization to v8f64->v8i32. But type legalization
doesn't currently widen strict nodes since it doesn't know how to
safely and efficiently pad the extra elements. But for X86 we know
padding with zeroes is safe and efficient so do that ourselves.
In the worst case, this requires a 128-bit move instruction to
implicitly zero the upper bits. In the common case, we should
recognize the producing instruction already zeroed the upper bits.
On 32-bit targets we can't use the scalar instruction so we
insert the scalar into a vector and use packed conversions.
Previously we used either v4f32->v4i64 or v4f64->v4i64 to avoid
some complexity creating target specific ISD opcodes for
v4f32->v2i64. But this causes extra vzeroupper instructions and
possibly frequency throttling on Intel CPUs.
This patch changes this to create a 128-bit vector and uses a
target specific ISD opcode if needed.
Fix several several additional problems with the int <-> FP conversion
logic both in common code and in the X86 target. In particular:
- The STRICT_FP_TO_UINT expansion emits a floating-point compare. This
compare can raise exceptions and therefore needs to be a strict compare.
I've made it signaling (even though quiet would also be correct) as
signaling is the more usual default for an LT. This code exists both
in common code and in the X86 target.
- The STRICT_UINT_TO_FP expansion algorithm was incorrect for strict mode:
it emitted two STRICT_SINT_TO_FP nodes and then used a select to choose one
of the results. This can cause spurious exceptions by the STRICT_SINT_TO_FP
that ends up not chosen. I've fixed the algorithm to use only a single
STRICT_SINT_TO_FP instead.
- The !isStrictFPEnabled logic in DoInstructionSelection would sometimes do
the wrong thing because it calls getOperationAction using the result VT.
But for some opcodes, incuding [SU]INT_TO_FP, getOperationAction needs to
be called using the operand VT.
- Remove some (obsolete) code in X86DAGToDAGISel::Select that would mutate
STRICT_FP_TO_[SU]INT to non-strict versions unnecessarily.
Reviewed by: craig.topper
Differential Revision: https://reviews.llvm.org/D71840
This moves the X86 specific transform from rL364407
into DAGCombiner to generically handle 'little to big' cases
(for example: extract_subvector(v2i64 bitcast(v16i8))). This
allows us to remove both the x86 implementation and the aarch64
bitcast(extract_subvector(bitcast())) combine.
Earlier patches that dealt with regressions initially exposed
by this patch:
rG5e5e99c041e4
rG0b38af89e2c0
Patch by: @RKSimon (Simon Pilgrim)
Differential Revision: https://reviews.llvm.org/D63815
The setcc operands are copied into LHS and RHS variables at the top of the function. We also capture the condition code.
A later piece of code swaps the operands and changing the CC variable as part of a canonicalization to make some other checks simpler. But we might not make the transform we canonicalized for. So we continue on through the function where we can use the swapped LHS/RHS variables and access the original condition code operand instead of the modified CC variable. This leads to a setcc being created with the original condition code, but with swapped operands.
To mitigate this, this patch does a couple things. The LHS/RHS/CC variables are made const to keep them from being modified like this again. The transform that needs the swap now uses temporary copies of the variables. And the transform that used the original condition code operand has been altered to use the CC variable we cached originally. Either of these changes are enough to fix the issue, but doing both to make this code very safe.
I also considered rewriting the swap code in some way to check both permutations without explicitly swapping or needing temporary variables, but held off on that.
Differential Revision: https://reviews.llvm.org/D71736
The only thing its getting from the X86TargetLowering class is
the subtarget which we can easily pass. This function only has
one call site now since this might help the compiler inline it.
Explicitly return both the flag result and the chain result for
STRICT_FCMP nodes. This removes an assumption in the caller that
getValue(1) is the right way to get the chain.
EmitCmp will just immediately call EmitTest and discard the null
constant only to have EmitTest create it again if it doesn't fold.
So just skip all that and go directly to EmitTest.
of integers to floating point.
This includes some of Craig Topper's changes for promotion support from
D71130.
Differential Revision: https://reviews.llvm.org/D69275
Summary:
The use of a boolean isInteger flag (generally initialized using
VT.isInteger()) caused errors in our out-of-tree CHERI backend
(https://github.com/CTSRD-CHERI/llvm-project).
In our backend, pointers use a separate ValueType (iFATPTR) and therefore
.isInteger() returns false. This meant that getSetCCInverse() was using the
floating-point variant and generated incorrect code for us:
`(void *)0x12033091e < (void *)0xffffffffffffffff` would return false.
Committing this change will significantly reduce our merge conflicts
for each upstream merge.
Reviewers: spatel, bogner
Reviewed By: bogner
Subscribers: wuzish, arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70917
This is an alternate fix for the bug discussed in D70595.
This also includes minimal tests for other in-tree targets
to show the problem more generally.
We check the number of uses as a predicate for whether some
value is free to negate, but that use count can change as we
rewrite the expression in getNegatedExpression(). So something
that was marked free to negate during the cost evaluation
phase becomes not free to negate during the rewrite phase (or
the inverse - something that was not free becomes free).
This can lead to a crash/assert because we expect that
everything in an expression that is negatible to be handled
in the corresponding code within getNegatedExpression().
This patch skips the use check during the rewrite phase.
So we determine that some expression isNegatibleForFree
(identically to without this patch), but during the rewrite,
don't rely on use counts to decide how to create the optimal
expression.
Differential Revision: https://reviews.llvm.org/D70975
Summary: This is a follow up of D69281, it enables the X86 backend support for the FP comparision.
Reviewers: uweigand, kpn, craig.topper, RKSimon, cameron.mcinally, andrew.w.kaylor
Subscribers: hiraditya, llvm-commits, annita.zhang, LuoYuanke, LiuChen3
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70582
This reverts 3e1aee2ba7 in favor
of a different approach.
Scalarizing isn't great codegen, but making the type illegal was
interfering with k constraint in inline assembly.
The xor'ing behaviour is only used for msvc/crt environments, when we're targeting
macho the guard load code doesn't know about the xor in the epilog. Disable xor'ing
when targeting win32-macho to be consistent.
Differential Revision: https://reviews.llvm.org/D71095
D53794 introduced code to perform the FP_TO_UINT expansion via FP_TO_SINT in a way that would never expose floating-point exceptions in the intermediate steps. Unfortunately, I just noticed there is still a way this can happen. As discussed in D53794, the compiler now generates this sequence:
// Sel = Src < 0x8000000000000000
// Val = select Sel, Src, Src - 0x8000000000000000
// Ofs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val) ^ Ofs
The problem is with the Src - 0x8000000000000000 expression. As I mentioned in the original review, that expression can never overflow or underflow if the original value is in range for FP_TO_UINT. But I missed that we can get an Inexact exception in the case where Src is a very small positive value. (In this case the result of the sub is ignored, but that doesn't help.)
Instead, I'd suggest to use the following sequence:
// Sel = Src < 0x8000000000000000
// FltOfs = select Sel, 0, 0x8000000000000000
// IntOfs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val - FltOfs) ^ IntOfs
In the case where the value is already in range of FP_TO_SINT, we now simply compute Val - 0, which now definitely cannot trap (unless Val is a NaN in which case we'd want to trap anyway).
In the case where the value is not in range of FP_TO_SINT, but still in range of FP_TO_UINT, the sub can never be inexact, as Val is between 2^(n-1) and (2^n)-1, i.e. always has the 2^(n-1) bit set, and the sub is always simply clearing that bit.
There is a slight complication in the case where Val is a constant, so we know at compile time whether Sel is true or false. In that scenario, the old code would automatically optimize the sub away, while this no longer happens with the new code. Instead, I've added extra code to check for this case and then just fall back to FP_TO_SINT directly. (This seems to catch even slightly more cases.)
Original version of the patch by Ulrich Weigand. X86 changes added by Craig Topper
Differential Revision: https://reviews.llvm.org/D67105
Summary:
musttail calls should not require allocating extra stack for arguments.
Updates to arguments passed in memory should happen in place before the
epilogue.
This bug was mostly a missed optimization, unless inalloca was used and
store to push conversion fired.
If a reserved call frame was used for an inalloca musttail call, the
call setup and teardown instructions would be deleted, and SP
adjustments would be inserted in the prologue and epilogue. You can see
these are removed from several test cases in this change.
In the case where the stack frame was not reserved, i.e. call frame
optimization fires and turns argument stores into pushes, then the
imbalanced call frame setup instructions created for inalloca calls
become a problem. They remain in the instruction stream, resulting in a
call setup that allocates zero bytes (expected for inalloca), and a call
teardown that deallocates the inalloca pack. This deallocation was
unbalanced, leading to subsequent crashes.
Reviewers: hans
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71097
We shouldn't assume that the returned result can be used to get
the other result.
This is prep-work for strict FP where we will also need to pass
the chain result along in more cases.
I suspect this became unnecessary after r354161. Prior to that
we may have been going through the default expansion of FP_TO_UINT
on 64-bit targets and then ending up back in Custom X86 handling
to handle the FP_TO_SINT for it. Now we just Custom handle the
FP_TO_UINT directly. We already need to handle it for 32-bit mode
during type legalization so we wouldn't save any code by using
the default expansion on 64-bit.
Summary:
This follows a previous patch that changes the X86 datalayout to represent
mixed size pointers (32-bit sext, 32-bit zext, and 64-bit) with address spaces
(https://reviews.llvm.org/D64931)
This patch implements the address space cast lowering to the corresponding
sign extension, zero extension, or truncate instructions.
Related to https://bugs.llvm.org/show_bug.cgi?id=42359
Reviewers: rnk, craig.topper, RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69639
This is the following patch of D68854.
This patch adds basic operations of X87 instructions, including +, -, *, / , fp extensions and fp truncations.
Patch by Chen Liu(LiuChen3)
Differential Revision: https://reviews.llvm.org/D68857
MVE has a basic symmetry between it's normal loads/store operations and
the masked variants. This means that masked loads and stores can use
pre-inc and post-inc addressing modes, just like the standard loads and
stores already do.
To enable that, this patch adds all the relevant infrastructure for
treating masked loads/stores addressing modes in the same way as normal
loads/stores.
This involves:
- Adding an AddressingMode to MaskedLoadStoreSDNode, along with an extra
Offset operand that is added after the PtrBase.
- Extending the IndexedModeActions from 8bits to 16bits to store the
legality of masked operations as well as normal ones. This array is
fairly small, so doubling the size still won't make it very large.
Offset masked loads can then be controlled with
setIndexedMaskedLoadAction, similar to standard loads.
- The same methods that combine to indexed loads, such as
CombineToPostIndexedLoadStore, are adjusted to handle masked loads in
the same way.
- The ARM backend is then adjusted to make use of these indexed masked
loads/stores.
- The X86 backend is adjusted to hopefully be no functional changes.
Differential Revision: https://reviews.llvm.org/D70176
Returning SDValue() means we didn't handle it and the common
code should try to expand it. But its a target intrinsic so
expanding won't do anything and just leave the node alone. But
it will print confusing debug messages.
By returning Op we tell the common code that the node is legal
and shouldn't receive any further processing.
These need to emit a libcall like we do for the non-strict version.
32-bit mode needs to SoftenFloat support to be implemented for strict FP nodes.
Differential Revision: https://reviews.llvm.org/D70504
Add explicit setOperation actions for some to match their none
strict counterparts. This isn't required, but makes the code
self documenting that we didn't forget about strict fp. I've
used LibCall instead of Expand since that's more explicitly what
we want.
Only lrint/llrint/lround/llround are missing now.
The custom code just emits a libcall, but we can do the same
with generic code. The only difference is that the generic code
can form tail calls where the custom code couldn't. This is
responsible for the test changes.
This avoids needing to modify the Custom handling for strict fp.
The Custom handler doesn't do anything for these nodes anyway.
SelectionDAGISel won't mutate them if they are Legal or Custom.
X86 has custom code for mutating them due to missing isel patterns.
When the isel patterns are added Legal will be the right answer.
So go ahead a change it now since that's where we'll end up.
AL is only used for varargs on SysV platforms. Don't forward it on
Windows. This allows control flow guard to set up an extra hidden
parameter in RAX, as described in PR44049.
This also has the effect of freeing up RAX for use in virtual member
pointer thunks, which may also be a nice little code size improvement on
Win64.
Fixes PR44049
Reviewers: ajpaverd, efriedma, hans
Differential Revision: https://reviews.llvm.org/D70413
This is a first pass at Custom lowering for these operations. I also updated some of the vector code where it was obviously easy and straightforward. More work needed in follow up.
This enables these operations to be handled with X87 where special rounding control adjustments are needed to perform a truncate.
Still need to fix Promotion in the target independent code in LegalizeDAG.
llrint/llround split into separate test file because we can't make a strict libcall properly yet either and we need to do that when i64 isn't a legal type.
This does not include any isel support. So we still rely on the mutation in SelectionDAGIsel to remove the strict from this stuff later. Except for the X87 stuff which goes through custom nodes that already had chains.
Differential Revision: https://reviews.llvm.org/D70214
AMDGPU needs to know the FP mode for the function to answer this
correctly when this is removed from the subtarget.
AArch64 had to make this more complicated by using this from an IR
hook, so add an IR typed overload.
As detailed in PR43971/D70267, the use of XFormVExtractWithShuffleIntoLoad causes issues where we end up in infinite loops of extract(targetshuffle(vecload)) -> extract(shuffle(vecload)) -> extract(vecload) -> extract(targetshuffle(vecload)), there are just too many legalization checks at every stage that we can't guarantee that extract(shuffle(vecload)) -> scalarload can occur.
At the moment we see a number of minor regressions as we don't fold extract(shuffle(vecload)) -> scalarload before legal ops, these can be addressed in future patches and extension of X86ISelLowering's combineExtractWithShuffle.
* Implements scalable size queries for MVTs, split out from D53137.
* Contains a fix for FindMemType to avoid using scalable vector type
to contain non-scalable types.
* Explicit casts for several places where implicit integer sign
changes or promotion from 32 to 64 bits caused problems.
* CodeGenDAGPatterns will treat scalable and non-scalable vector types
as different.
Reviewers: greened, cameron.mcinally, sdesmalen, rovka
Reviewed By: rovka
Differential Revision: https://reviews.llvm.org/D66871
The Promote action doesn't apply until LegalizeDAG. By the time
we get there, we would have already softened all the FP operations
if useSoftFloat was true. So there wouldn't be any operation left
to Promote.
This is no longer needed after widening legalization as we
custom legalize v8i8 ourselves.
Added entries to the cost model, but bumped the cost slightly
to account for the truncate shuffle that wasn't costed before.
Otherwise just let the v64i8/v32i16 types be split to v32i8/v16i16.
In reality this shouldn't happen because it means we have a 512-bit
vector argument, but min-legal-vector-width says a value less than
512. But a 512-bit argument should have been factored into the
preferred vector width.
MVT::i1 should be removed by type legalization before we reach
any code that would act on the promote action.
Mainly to avoid replicating this for strict FP versions of these
operations.
If we're using soft floats, then these operations shoudl be
softened during type legalization. They'll never get to
LegalizeVectorOps or LegalizeDAG so they don't need to be
Expanded there.
We had some code for this for 32-bit ARM, but this doesn't really need
to be in target-specific code; generalize it.
(I think this started showing up recently because we added an
optimization that converts pow to powi.)
Differential Revision: https://reviews.llvm.org/D69013
The MMX intrinsics for shift by immediate take a 32-bit shift
amount but the hardware for shifting by immediate only encodes
8-bits. For the intrinsic we don't require the shift amount to
fit in 8-bits in the frontend because we don't check that its an
immediate in the frontend. If its is not an immediate we move it
to an MMX register and use the shift by register.
But if it is an immediate we'll use the shift by immediate
instruction. But we need to change the shift amount to 8-bits.
We were previously doing this accidentally by masking it in the
encoder. But this can make a large shift amount into a small
in bounds shift amount. Instead we should clamp larger shift
amounts to 255 so that the they don't become in bounds.
Fixes PR43922
PVS Studio noticed that we were asserting "VT.getVectorNumElements() == VT.getVectorNumElements()" instead of "VT.getVectorNumElements() == InVT.getVectorNumElements()".
When writing an email for a follow up proposal, I realized one of the diffs in the committed change was incorrect. Digging into it revealed that the fix is complicated enough to require some thought, so reverting in the meantime.
The problem is visible in this diff (from the revert):
; X64-SSE-LABEL: store_fp128:
; X64-SSE: # %bb.0:
-; X64-SSE-NEXT: movaps %xmm0, (%rdi)
+; X64-SSE-NEXT: subq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 32
+; X64-SSE-NEXT: movaps %xmm0, (%rsp)
+; X64-SSE-NEXT: movq (%rsp), %rsi
+; X64-SSE-NEXT: movq {{[0-9]+}}(%rsp), %rdx
+; X64-SSE-NEXT: callq __sync_lock_test_and_set_16
+; X64-SSE-NEXT: addq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 8
; X64-SSE-NEXT: retq
store atomic fp128 %v, fp128* %fptr unordered, align 16
ret void
The problem here is three fold:
1) x86-64 doesn't guarantee atomicity of anything larger than 8 bytes. Some platforms observably break this guarantee, others don't, but the codegen isn't considering this, so it's wrong on at least some platforms.
2) When I started to track down the problem, I discovered that DAGCombiner had stripped the atomicity off the store entirely. This comes down to idiomatic usage of DAG.getStore passing all MMO components separately as opposed to just passing the MMO.
3) On x86 (not -64), there are cases where 8 byte atomiciy is supported, but only for floating point operations. This would seem to imply that operation typing matters for correctness, and DAGCombine happily folds away bitcasts. I'm not 100% sure there's a problem here, but I'm not entirely sure there isn't either.
I plan on returning to each issue in turn; sorry for the churn here.
If we don't demand all elements, then attempt to combine to a simpler shuffle.
At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts (see D66004).
This reapplies rL368307 (reverted at rL369167) after the fix for the infinite loop reported at PR43024 was applied at rG3f087e38a2e7b87a5adaaac1c1b61e51220e7ff3
This stops infinite loops where KnownUndef elements are converted to Zeroable, resulting in KnownZero elements which are then simplified (via SimplifyDemandedElts etc.) back to KnownUndef elements........
Prep fix for PR43024 which will allow rL368307 to be re-applied.
This doesn't affect actual codegen, but is a minor refactor toward fixing PR43024 where we need to avoid excess changes (folding zeroables etc.) to the shuffle mask at Depth == 0.
Previously we marked zeroable elements in a way that prevented
the widening check from recognizing that it could widen. Now
we only mark them zeroable if V2 is an all zeros vector. This
matches what we do for widening elements in lowerVectorShuffle.
Fixes PR43866.
Teach combineVectorSizedSetCCEquality() to handle arbitrary memcmp
expansions but do not change any default policy for now.
This also fixes a bug in the memcmp expansion itself when large
displacements are needed.
https://reviews.llvm.org/D69507
Enable the new SelectionDAG representation for unordered loads and stores introduced in r371441 by default. As a reminder, the new lowering changes the representation of an unordered atomic load from an AtomicSDNode - which is essentially a black box which gets passed through without combines messing with it - to a LoadSDNode w/a atomic marker on the MMO. The later parallels the way we handle volatiles, and I've audited the code to ensure that every location which checks one checks the other.
This has been fairly heavily fuzzed, and I examined diffs in a reasonable large corpus of assembly by hand, so I'm reasonable sure this is correct for the common case. Late in the review for this, it was discovered that I hadn't correctly handled cases which could be legalized into CAS operations. This points out that there's a strong bias in the IR of the frontend I'm working with towards only legal atomics. If there are problems with this patch, the most likely area will be legalization.
Differential Revision: https://reviews.llvm.org/D69219
This catches some cases. There are probably ways to improve this.
I tried doing it as a combine on the setcc, but that broke
some cases involving flag reuse in place of test.
I renamed the isX86CCUnsigned to isX86CCSigned and flipped its
polarity to make it consistent with the similar functions for
ISD::SETCC. This avoids calling EQ/NE as being signed or unsigned.
Fixes PR43823.
Differential Revision: https://reviews.llvm.org/D69499
The legalization of v2i1->i2 or v4i1->i4 bitcasts followed by a setcc can create an and after the bitcast. If we're lucky enough that the input to the bitcast is a concat_vectors where the first operand is a setcc that can natively 0 all the upper bits of ak-register, then we should replace the other operands of the concat_vectors with zero in order to remove the AND.
With the AND removed we might be able to use a kortest on the result.
Differential Revision: https://reviews.llvm.org/D69205
PTEST and especially the MOVMSK instructions are slow on Knights Landing
or later. As a bonus, this patch increases instruction parallelism by
emitting:
KORTEST(PCMPNEQ(a, b), PCMPNEQ(c, d)) == 0
Instead of:
KORTEST(AND(PCMPEQ(a, b), PCMPEQ(c, d))) == ~0
https://reviews.llvm.org/D69157
tryToWidenViaDuplication lowers using the shuffle_v8i16(unpack_v16i8(shuffle_v8i16(x),shuffle_v8i16(x))) pattern, but the unpack only needs the even/odd 16i8 args if the original v16i8 shuffle mask references the even/odd elements - which isn't true for many extension style shuffles.
llvm-svn: 375342
We were always generating a single source HADDPD, but really we should only do this if shouldUseHorizontalOp says its a good idea.
Differential Revision: https://reviews.llvm.org/D69175
llvm-svn: 375341
Add generic DAG combine for extending masked loads.
Allow us to generate sext/zext masked loads which can access v4i8,
v8i8 and v4i16 memory to produce v4i32, v8i16 and v4i32 respectively.
Differential Revision: https://reviews.llvm.org/D68337
llvm-svn: 375085
Exposes an issue in getFauxShuffleMask where the OR(SHUFFLE,SHUFFLE) decode should always resolve zero/undef elements.
Part of the fix for PR43024 where ideally we shouldn't call resolveTargetShuffleAndZeroables for Depth == 0
llvm-svn: 374928
The only things VBROADCAST_LOAD uses is an address and a chain
node. It has no vector inputs.
So if its a user of the source of another broadcast that could
only mean one of two things. The other broadcast is broadcasting
the address of the broadcast_load. Or the source is a load and
the use we're seeing is the chain result from that load. Neither
of these cases make sense to combine here.
This issue was reported post-commit r373871. Test case has not
been reduced yet.
llvm-svn: 374862
This prevents isel from emitting a TEST instruction that
optimizeCompareInstr will need to remove later.
In some of the modified tests, the SUB gets duplicated due to
the flags being needed in two places and being clobbered in
between. optimizeCompareInstr was able to optimize away the TEST
that was using the result of one of them, but optimizeCompareInstr
doesn't know to turn SUB into CMP after removing the TEST. It
only knows how to turn SUB into CMP if the result was already
dead.
With this change the TEST never exists, so optimizeCompareInstr
doesn't have to remove it. Then it can just turn the SUB into
CMP immediately.
Fixes PR43649.
llvm-svn: 374755
We were already controlling whether the KnownZero elements were being written to the target mask, this extends it to the KnownUndef elements as well so we can prevent the target shuffle mask being manipulated at all.
llvm-svn: 374732
This enables use of the saturating truncate instructions when the
result type is less than 128 bits. It also enables the use of
saturating truncate instructions on KNL when the input is less
than 512 bits. We can do this by widening the input and then
extracting the result.
llvm-svn: 374731
This seems to improve std::midpoint code where we have a min and
a max with the same condition. If we split the setcc we can end
up with two compares if the one of the operands is a constant.
Since we aggressively canonicalize compares with constants.
For non-constants it can interfere with our ability to share
control flow if we need to expand cmovs into control flow.
I'm also not sure I understand this min/max canonicalization code.
The motivating case talks about comparing with 0. But we don't
check for 0 explicitly.
Removes one instruction from the codegen for PR43658.
llvm-svn: 374706
Since the input type is larger than 256-bits we'll need to some
concatenating to reassemble the results. The pack instructions
ability to concatenate while packing make this a shorter/faster
sequence.
llvm-svn: 374643
We already did this for VTRUNCUS with a specific combination of
types. This extends this to VTRUNCS and handles any types where
a truncating store is legal.
llvm-svn: 374615
If we don't have VLX we won't end up selecting a saturating
truncate for 256-bit or smaller vectors so we should just use
the pack lowering.
llvm-svn: 374487
When handling the packus pattern for i32->i8 we do a two step
process using a packss to i16 followed by a packus to i8. If the
final i8 step is a type with less than 64-bits the packus step
will return SDValue(), but the i32->i16 step might have succeeded.
This leaves the nodes from the middle step dangling.
Guard against this by pre-checking that the number of elements is
at least 8 before doing the middle step.
With that check in place this should mean the only other
case the middle step itself can fail is when SSE2 is disabled. So
add an early SSE2 check then just assert that neither the middle
or final step ever fail.
llvm-svn: 374460
If we've disable zmm registers, the v16i32 will need to be split. This split will propagate through min/max the truncate. This creates two sequences that need to be concatenated back to v16i8. We can instead use packusdw to do part of the clamping, truncating, and concatenating all at once. Then we can use a vpmovuswb to finish off the clamp.
Differential Revision: https://reviews.llvm.org/D68763
llvm-svn: 374431
As background, starting in D66309, I'm working on support unordered atomics analogous to volatile flags on normal LoadSDNode/StoreSDNodes for X86.
As part of that, I spent some time going through usages of LoadSDNode and StoreSDNode looking for cases where we might have missed a volatility check or need an atomic check. I couldn't find any cases that clearly miscompile - i.e. no test cases - but a couple of pieces in code loop suspicious though I can't figure out how to exercise them.
This patch adds defensive checks and asserts in the places my manual audit found. If anyone has any ideas on how to either a) disprove any of the checks, or b) hit the bug they might be fixing, I welcome suggestions.
Differential Revision: https://reviews.llvm.org/D68419
llvm-svn: 374261
Gather instructions can use i32 or i64 elements for indices. If
the index is zero extended from a type smaller than i32 to i64, we
can shrink the extend to just extend to i32.
llvm-svn: 373982
When the target option GuaranteedTailCallOpt is specified, calls with
the fastcc calling convention will be transformed into tail calls if
they are in tail position. This diff adds a new calling convention,
tailcc, currently supported only on X86, which behaves the same way as
fastcc, except that the GuaranteedTailCallOpt flag does not need to
enabled in order to enable tail call optimization.
Patch by Dwight Guth <dwight.guth@runtimeverification.com>!
Reviewed By: lebedev.ri, paquette, rnk
Differential Revision: https://reviews.llvm.org/D67855
llvm-svn: 373976
If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element.
This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted.
Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper.
Fixes PR43217
Differential Revision: https://reviews.llvm.org/D68544
llvm-svn: 373871
Replaces setTargetShuffleZeroElements with getTargetShuffleAndZeroables which reports the Zeroable elements but doesn't merge them into the decoded target shuffle mask (the merging has been moved up into getTargetShuffleInputs until we can get rid of it entirely).
This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle mask isn't adjusted by its source inputs but instead we cache them in a parallel Zeroable mask.
llvm-svn: 373867
Summary:
The default legalization for v16i64->v16i8 tries to create a multiple stage truncate concatenating after each stage and truncating again. But avx512 implements truncates with multiple uops. So it should be better to truncate all the way to the desired element size and then concatenate the pieces using unpckl instructions. This minimizes the number of 2 uop truncates. The unpcks are all single uop instructions.
I tried to handle this by just custom splitting the v16i64->v16i8 shuffle. And hoped that the DAG combiner would leave the two halves in the state needed to make D68374 do the job for each half. This worked for the first half, but the second half got messed up. So I've implemented custom handling for v8i64->v8i8 when v8i64 needs to be split to produce the VTRUNCs directly.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68428
llvm-svn: 373864
We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly.
This allows us to remove createTargetShuffleMask.
This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask.
llvm-svn: 373846
As discussed on PR42025, with more complex boolean math we can end up with many truncations/extensions of the comparison results through each bitop.
This patch handles the cases introduced in combineBitcastvxi1 by pushing the sign extension through the AND/OR/XOR ops so its just the original SETCC ops that gets extended.
Differential Revision: https://reviews.llvm.org/D68226
llvm-svn: 373834
Rename some variables to match lowerShuffleAsRepeatedMaskAndLanePermute - prep work toward adding some equivalent sublane functionality.
llvm-svn: 373832
We already do this for ISD::TRUNCATE, but we can do the same for X86ISD::VTRUNC
Differential Revision: https://reviews.llvm.org/D68432
llvm-svn: 373765
This patch recognizes the shuffle pattern we get from a
v8i64->v8i8 truncate when v8i64 isn't a legal type.
With VLX we can use two VTRUNCs, unpckldq, and a insert_subvector.
Diffrential Revision: https://reviews.llvm.org/D68374
llvm-svn: 373645
We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly.
This only leaves one user of createTargetShuffleMask which we can hopefully get rid of in a similar manner.
This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask.
llvm-svn: 373641
This improves broadcast load folding of i64 elements on 32-bit
targets where i64 isn't legal.
Previously we had to represent these as vXf64 vbroadcast_loads and
a bitcast to vXi64. But we didn't have any isel patterns
looking for that.
This also allows us to remove or simplify some isel patterns that
were looking for bitcasted vbroadcast_loads.
llvm-svn: 373566
The previous code tried to do a trick where we would extract the subvector from the location we were inserting. Then xor that with the new value. Take the xored value and clear out the bits above the subvector size. Then shift that xored subvector to the insert location. And finally xor that with the original vector. Since the old subvector was used in both xors, this would leave just the new subvector at the inserted location. Since the surrounding bits had been zeroed no other bits of the original vector would be modified.
Unfortunately, if the old subvector came from undef we might aggressively propagate the undef. Then we end up with the XORs not cancelling because they aren't using the same value for the two uses of the old subvector. @bkramer gave me a case that demonstrated this, but we haven't reduced it enough to make it easily readable to see what's happening.
This patch uses a safer, but more costly approach. It isolate the bits above the insertion and bits below the insert point and ORs those together leaving 0 for the insertion location. Then widens the subvector with 0s in the upper bits, shifts it into position with 0s in the lower bits. Then we do another OR.
Differential Revision: https://reviews.llvm.org/D68311
llvm-svn: 373495
The gather/scatter instructions can implicitly sign extend the indices. If we're operating on 32-bit data, an v16i64 index can force a v16i32 gather to be split in two since the index needs 2 registers. If we can shrink the index to the i32 we can avoid the split. It should always be safe to shrink the index regardless of the number of elements. We have gather/scatter instructions that can use v2i32 index stored in a v4i32 register with v2i64 data size.
I've limited this to before legalize types to avoid creating a v2i32 after type legalization. We could check for it, but we'd also need testing. I'm also only handling build_vectors with no bitcasts to be sure the truncate will constant fold.
Differential Revision: https://reviews.llvm.org/D68247
llvm-svn: 373408
Summary:
This adds the ISD opcode and a DAG combine to create it. There are
probably some places where we can directly create it, but I'll
leave that for future work.
This updates all of the isel patterns to look for this new node.
I had to add a few additional isel patterns for aligned extloads
which we should probably fix with a DAG combine or something. This
does mean that the broadcast load folding for avx512 can no
longer match a broadcasted aligned extload.
There's still some work to do here for combining a broadcast of
a broadcast_load. We also need to improve extractelement or
demanded vector elements of a broadcast_load. I'll try to get
those done before I submit this patch.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68198
llvm-svn: 373349
Replace with the MachineFunction. X86 is the only user, and only uses
it for the function. This removes one obstacle from using this in
GlobalISel. The other is the more tolerable EVT argument.
The X86 use of the function seems questionable to me. It checks hasFP,
before frame lowering.
llvm-svn: 373292
The i1 scalar would have been type legalized to i8, but that
doesn't guarantee anything about the upper bits. If we're going
to use it as condition we need to make sure the upper bits are 0.
I've special cased ISD::SETCC conditions since that should
guarantee zero upper bits. We could go further and use
computeKnownBits, but we have no tests that would need that.
Fixes PR43507.
llvm-svn: 373246
ANY_EXTEND of v8i8 is marked Custom on AVX512 for handling extends
from v8i8. But the type legalization infrastructure will call
ReplaceNodeResults for v8i8 results. We should just defer it the
default handling instead of asserting in the default of the switch.
Fixes PR43509.
llvm-svn: 373234
This was added back to allow some performance regressions to be
investigated. The main perf issue was fixed shortly after adding
this back and no other major issues have been reported. So I
think its safe to remove this again.
llvm-svn: 373174
Creating new nodes is what we usually do. Have to explicitly
check that we don't update to an existing node and having
to manually manage the worklist is unusual.
We can probably add a helper function to reduce the duplication
of having to check if we should create a gather or scatter, but
I wanted to just get the simple thing done.
llvm-svn: 373137
The majority of the code doesn't run on the X86 nodes today since
its gated by isBeforeLegalizeOps and we don't formm X86 nodes
until after that. Except for a couple special case in type
legalization. But I think we would probably break those if
some of the transforms fire on them.
I want to remove the hardcoded operand numbers and the unusual
use of UpdateNodeOperands. Being able to know which ISD opcodes
are present should help with that.
llvm-svn: 373136
This removes the need for ConvertToTarget opcodes in the isel table.
It's also consistent with the recent changes to use TargetConstant
for intrinsic nodes that always take immediates.
Differential Revision: https://reviews.llvm.org/D67902
llvm-svn: 372645
The attached test case would previous infinite loop after
r365711.
I'm going to move this to X86ISelDAGToDAG.cpp to get the setcc
to match VPTEST in 32-bit mode in a follow up commit.
llvm-svn: 372543
This allows us to use timm in the isel table which is more
consistent with other intrinsics that take an immediate now.
We can't declare the intrinsic as taking an ImmArg because we
need to match non-constants to the shift by MMX register
instruction which we do by mutating the intrinsic id during
lowering.
llvm-svn: 372537
This intrinsics should be shift by immediate, but gcc allows any
i32 scalar and clang needs to match that. So we try to detect the
non-constant case and move the data from an integer register to an
MMX register.
Previously this was done by creating a v2i32 build_vector and
bitcast in SelectionDAGBuilder. This had to be done early since
v2i32 isn't a legal type. The bitcast+build_vector would be DAG
combined to X86ISD::MMX_MOVW2D which isel will turn into a
GPR->MMX MOVD.
This commit just moves the whole thing to lowering and emits
the X86ISD::MMX_MOVW2D directly to avoid the illegal type. The
test changes just seem to be due to nodes being linearized in a
different order.
llvm-svn: 372535
This reverts commit 52621307bc.
Tests have been failing all night with
[0/2] ACTION //llvm/test:check-llvm(//llvm/utils/gn/build/toolchain:unix)
-- Testing: 33647 tests, 64 threads --
Testing: 0 .. 10..
UNRESOLVED: LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll (6943 of 33647)
******************** TEST 'LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll' FAILED ********************
Test has no run line!
********************
Since there were other concerns on https://reviews.llvm.org/D67785,
I'm just reverting for now.
llvm-svn: 372383
We reuse an ISD opcode here that can be reached from BMI that
doesn't require it to be an immediate. Our isel patterns to match
the TBM immediate form require a Constant and not a TargetConstant.
We were accidentally getting the Constant due to a quirk of
combineBEXTR calling SimplifyDemandedBits. The call to
SimplifyDemandedBits ended up constant folding the TargetConstant
to a regular Constant. But we should probably instead be asserting
if SimplifyDemandedBits on a TargetConstant so we shouldn't rely
on this behavior.
llvm-svn: 372373
Summary: This fixes a crasher introduced by r372338.
Reviewers: echristo, arsenm
Subscribers: jvesely, wdng, nhaehnle, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67785
Tighten up the test case.
llvm-svn: 372366
The later code that generates a constant when there are
some non-const elements works basically the same and doesn't
require there to be any non-const elements.
llvm-svn: 372365
This reverts r372314, reapplying r372285 and the commits which depend
on it (r372286-r372293, and r372296-r372297)
This was missing one switch to getTargetConstant in an untested case.
llvm-svn: 372338