hasOneUse is not cheap on nodes with chain results that might have
many uses. By checking the opcode first, we can avoid a costly walk
of the use list on nodes we aren't interested in.
Found by investigating calls to hasNUsesOfValue from the example
provided in D123857.
As far as I know getNode will never return a null SDValue.
I'm guessing this was modeled after the FoldConstantArithmetic
call earlier.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123550
Accord the discussion in D122281, we missing an ISD::AND combine for MLOAD
because it relies on BuildVectorSDNode is fails for scalable vectors.
This patch is intend to handle that, so we can circle back the type MVT::nxv2i32
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122703
As raised on PR52267, XOR(X,MIN_SIGNED_VALUE) can be treated as ADD(X,MIN_SIGNED_VALUE), so let these cases use the 'AddLike' folds, similar to how we perform no-common-bits OR(X,Y) cases.
define i8 @src(i8 %x) {
%r = xor i8 %x, 128
ret i8 %r
}
=>
define i8 @tgt(i8 %x) {
%r = add i8 %x, 128
ret i8 %r
}
Transformation seems to be correct!
https://alive2.llvm.org/ce/z/qV46E2
Differential Revision: https://reviews.llvm.org/D122754
When shifting by a byte-multiple:
bswap (shl X, C) --> lshr (bswap X), C
bswap (lshr X, C) --> shl (bswap X), C
This is the backend version of D122010 and an alternative
suggested in D120648.
There's an extra check to make sure the shift amount is
valid that was not in the rough draft.
I'm not sure if there is a larger motivating case for RISCV (bug report?),
but the ARM diffs show a benefit from having a late version of the
transform (because we do not combine the loads in IR).
Differential Revision: https://reviews.llvm.org/D122655
Modified DAGCombiner to pass the shift the bittest input and the shift amount
to hasBitTest. This matches the other call to hasBitTest in TargetLowering.h
This is an alternative to D122454.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122458
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
Add shl/srl/sra to the list of ops that we canonicalize with a select to expose an identity merge
Differential Revision: https://reviews.llvm.org/D122070
RISCV strong prefers i32 values be sign extended to i64. This combine
was always zero extending the constant using APInt methods.
This adjusts the code so that it calls getNode using ISD::ANY_EXTEND instead.
getNode will call TLI.isSExtCheaperThanZExt to decide how to handle
the constant.
Tests were copied from D121598 where I noticed that we were creating
constants that were hard to materialize.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121650
We do not have general reassociation here (and probably
do not need it), but I noticed these were missing in
patches/tests motivated by D111530, so we can at
least handle the simplest patterns.
The VE test diff looks correct, but we miss that
pattern in IR currently:
https://alive2.llvm.org/ce/z/u66_PM
This is another fold generalized from D111530.
We can find a common source for a rotate operation hidden inside an 'or':
https://alive2.llvm.org/ce/z/9pV8hn
Deciding when this is profitable vs. a funnel-shift is tricky, but this
does not show any regressions: if a target has a rotate but it does not
have a funnel-shift, then try to form the rotate here. That is why we
don't have x86 test diffs for the scalar tests that are duplicated from
AArch64 ( 74a65e3834 ) - shld/shrd are available. That also makes it
difficult to show vector diffs - the only case where I found a diff was
on x86 AVX512 or XOP with i64 elements.
There's an additional check for a legal type to avoid a problem seen
with x86-32 where we form a 64-bit rotate but then it gets split
inefficiently. We might avoid that by adding more rotate folds, but
I didn't check to see what is missing on that path.
This gets most of the motivating patterns for AArch64 / ARM that are in
D111530.
We still need a couple of enhancements to setcc pattern matching with
rotate/funnel-shift to get the rest.
Differential Revision: https://reviews.llvm.org/D120933
When inserting undef into buildvectors created from shuffles of
buildvectors, we convert elements to the largest needed type. This had
the effect of converting undef into 0, which isn't needed as the
buildvector implicitly truncates and trunc(zext(undef)) == undef.
Differential Revision: https://reviews.llvm.org/D121002
This extends acb96ffd14 to 'and' and 'xor' opcodes.
Copying from that message:
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
If the types aren't legal, the expansions may get type legalized in a
different way preventing code sharing. If the type is legal, we will
share some instructions between the two expansions, but we will need an
extra register.
Since we don't appear to fold (neg (sub A, B)) if the sub has an
additional user, I think it makes sense not to expand NABS.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120513
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
This is a partial implementation of a transform suggested in D111530
(only handles 'or' bitwise logic as a first step - need to stamp out more
tests for other opcodes).
Several of the same tests added for D111530 are altered here (but not
fully optimized). I'm not sure yet if this would help/hinder that patch,
but this should be an improvement for all tests added with ecf606cb43
since it removes a shift operation in those examples.
Differential Revision: https://reviews.llvm.org/D120516
If the shl is at least half the bitwidth (i.e. the lower half of the bswap source is zero), then we can reduce the shift and perform the bswap at half the bitwidth and just zero extend.
Based off PR51391 + PR53867
Differential Revision: https://reviews.llvm.org/D120192
This is the SDAG translation of D120253 :
https://alive2.llvm.org/ce/z/qHpmNn
The SDAG nodes can have different operand types than the result value.
We can see an example of that with AArch64 - the funnel shift amount
is an i64 rather than i32.
We may need to make that match even more flexible to handle
post-legalization nodes, but I have not stepped into that yet.
Differential Revision: https://reviews.llvm.org/D120264
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
In combineCarryDiamond() use getAsCarry() to find more candidates for being a carry flag.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118362
This fold is done in IR:
https://alive2.llvm.org/ce/z/jWyFrP
There is an x86 test that shows an improvement
from the added flexibility of using add (commutative).
The other diffs are presumed neutral.
Note that this could also be folded to an 'xor',
but I'm not sure if that would be universally better
(eg, x86 can convert adds more easily into LEA).
This helps prevent regressions from a potential fold for
issue #53829.
The current ABD combine doesn't quite work for SVE because only a
single scalable vector per scalar integer type is legal (e.g. for
i32, <vscale x 4 x i32> is the only legal scalable vector type).
This patch extends the combine to also trigger for the cases when
operand extension must be retained.
Differential Revision: https://reviews.llvm.org/D115739
This moves the matching of AVGFloor and AVGCeil into a place where
demand bit are available, so that it can detect more cases for more
folds. It changes the transform to start from a shift, not from a
truncate. We match the pattern shr(add(ext(A), ext(B)), 1), transforming
to ext(hadd(A, B)).
For signed values, because only the bottom bits are demanded llvm will
transform the above to use a lshr too, as opposed to ashr. In order to
correctly detect the hadd we need to know the demanded bits to turn it
back. Depending on whether the shift is signed (ashr) or logical (lshr),
and the extensions are signed or unsigned we can create different nodes.
If the shift is signed:
Needs >= 2 sign bits. https://alive2.llvm.org/ce/z/h4gQAW generating signed rhadd.
Needs >= 2 zero bits. https://alive2.llvm.org/ce/z/B64DUA generating unsigned rhadd.
If the shift is unsigned:
Needs >= 1 zero bits. https://alive2.llvm.org/ce/z/ByD8sj generating unsigned rhadd.
Needs 1 demanded bit zero and >= 2 sign bits https://alive2.llvm.org/ce/z/hvPGxX and
https://alive2.llvm.org/ce/z/32P5n1 generating signed rhadd.
Differential Revision: https://reviews.llvm.org/D119072
This adds very basic combines for AVG nodes, mostly for constant folding
and handling degenerate (zero) cases. The code performs mostly the same
transforms as visitMULHS, adjusted for AVG nodes.
Constant folding extends to a higher bitwidth and drops the lowest bit.
For undef nodes, `avg undef, x` is transformed to x. There is also a
transform for `avgfloor x, 0` transforming to `shr x, 1`.
Differential Revision: https://reviews.llvm.org/D119559
This enables fshl to be matched earlier on X86
%6 = lshr i32 %3, 1
%7 = select i1 %4, i32 -2147483648, i32 0
%8 = or i32 %6, %7
X86 uses i8 for shift amounts. SelectionDAGBuilder creates the
ISD::SRL with an i8 shift type. DAGCombiner turns the select into
an ISD::SHL. Prior to this patch it would use i32 for the shift
amount. fshl matching failed because the shift amounts have different
types. LegalizeDAG fixes the ISD::SHL shift amount to i8. This
allowed fshl matching to succeed.
With this patch, the ISD::SHL will be created with an i8 shift
amount. This allows the fshl to match immediately.
No test case beause we still end up with a fshl either way.
I have not found a way to expose a difference for this patch in a test
because it only triggers for a one-use load, but this is the code that
was adapted into D118376 and caused miscompiles. The new code pattern
is the same as what we do in narrowExtractedVectorLoad() (reduces load
width for a subvector extract).
This removes seemingly unnecessary manual worklist management and fixes
the chain updating via "SelectionDAG::makeEquivalentMemoryOrdering()".
Differential Revision: https://reviews.llvm.org/D119549
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237
We're hitting a pathological compile-time case, profiled to be in
DagCombiner::visitTokenFactor and many inserts into a SmallPtrSet.
It looks like one of the paths around findBetterNeighborChains is not
capped and leads to this.
This patch resolves the issue. Looking for feedback if this solution
looks reasonable.
Differential Revision: https://reviews.llvm.org/D118877
The test diffs are identical to D119111.
This only affects x86 currently because no other target
has an override for the TLI hook that controls this transform.
This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.
This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.
I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.
Differential Revision: https://reviews.llvm.org/D119019
rv64izbb has a RORW/ROLW instructions that operate on the lower
32-bits of a 64-bit value and sign extend bit 31 of the result.
DAGCombiner won't match rotate idioms because the i32 type isn't Legal
on riscv64.
This patch teaches DAGCombiner to allow it if the type is going to
be promoted and the target has Custom type legalization for ISD::ROTL
or ISD::ROTR. I've restricted this to scalar types. It doesn't appear
any in tree targets other than riscv64 have custom type legalization
for rotates.
If this patch isn't acceptable, I guess I can match SRLW, SLLW, and OR
after type legalization, but I'd like to avoid that if possible.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119062
When the shift amount is known and a known sign bit analysis of
the shiftee indicates that no saturation will occur, then we can
replace SSHLSAT/USHLSAT by SHL.
Differential Revision: https://reviews.llvm.org/D118765
In the aftermath of D116895 a problem was found in the analysis of
dependencies between store merge candidates in
checkMergeStoreCandidatesForDependencies, that is needed to avoid
the cycles are introduced in the DAG.
In the past it has been enough (or assumed to be enough) to start
scanning from non-chain operands when analysing the store merge
candidates for dependencies, assuming that the analysis of chain
dependencies performed when finding the candidates would cover
up for potential dependencies that exist involving the chain operands.
It was however discovered that one could end up with scenarios such
as descibed in the aarch64-checkMergeStoreCandidatesForDependencies.ll
test case, when the dependency between two stores is given by a mix
of chain operand dependencies and non-chain operand dependencies.
The fix in this patch make sure that we also account for chain operand
dependencies when doing the more elaborate analysis in
checkMergeStoreCandidatesForDependencies, no longer relying on that
the earlier check involving chain operands is enough.
Differential Revision: https://reviews.llvm.org/D118943
Do "simplifyShift" and "FoldConstantArithmetic" folds for the SSHLSAT
and USHLSAT DAG nodes.
This includes folds such as:
(shlsat undef/poison, x) -> 0
(shlsat x, undef/poison) -> undef
(shlsat x, too_large_shamt) -> undef
(shlsat 0, x) -> 0
(shlsat x, 0) -> x
(shlsat c1, c2) -> c3
Differential Revision: https://reviews.llvm.org/D118603
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
If we have a vector FP division with a splatted divisor, use
getVectorMinNumElements when scaling the num of uses by splat factor.
For AArch64 the combine kicks in for the <vscale x 4 x float> case since it's
above the fdiv threshold (3) when scaling num uses by splat factor, but the
codegen is worse (splat + vector fdiv + vector fmul) than the <vscale x 2 x
double> case (splat + vector fdiv).
If the combine could be converted into a scalar FP division by
scalarizeBinOpOfSplats it may be cheaper, but it looks like this is predicated
on the isExtractVecEltCheap TLI function which is implemented for x86 but not
AArch64. Perhaps for now combineRepeatedFPDivisors should only scale num uses
by splat if the division can be converted into scalar op.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118343
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D116270
This is the unsigned variant of D111976, where we convert a clamped
fptoui to a fptoui.sat. Because we are unsigned, the condition this time
is only UMIN of UINT_MAX. Similarly to D111976 it handles ISD::UMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D114964
An sra is basically sign-extending a narrower value. Fold away the
shift by doing a sextload of a narrower value, when it is legal to
reduce the load width accordingly.
Differential Revision: https://reviews.llvm.org/D116930
If the bitreverse gets expanded, it will introduce a new bswap. By
putting a bswap before the bitreverse, we can ensure it gets cancelled
out when this happens.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118012
In code review for D117104 two slightly weird checks were found
in DAGCombiner::reduceLoadWidth. They were typically checking
if BitsA was a mulitple of BitsB by looking at (BitsA & (BitsB - 1)),
but such a comparison actually only make sense if BitsB is a power
of two.
The checks were related to the code that attempted to shrink a load
based on the fact that the loaded value would be right shifted.
Afaict the legality of the value types is checked later (typically in
isLegalNarrowLdSt), so the existing checks were both overly
conservative as well as being wrong whenever ExtVTBits wasn't a
power of two. The latter was a situation triggered by a number of
lit tests so we could not just assert on ExtVTBIts being a power of
two).
When attempting to simply remove the checks I found some problems,
that seems to have been guarded by the checks (maybe just out of
luck). A typical example would be a pattern like this:
t1 = load i96* ptr
t2 = srl t1, 64
t3 = truncate t2 to i64
When DAGCombine is visiting the truncate reduceLoadWidth is called
attempting to narrow the load to 64 bits (ExtVT := MVT::i64). Then
the SRL is detected and we set ShAmt to 64.
In the past we've bailed out due to i96 not being a multiple of 64.
If we simply remove that check then we would end up replacing the
load with a new load that would read 64 bits but with a base pointer
adjusted by 64 bits. So we would read 32 bits the wasn't accessed by
the original load.
This patch will instead utilize the fact that the logical left shift
can be folded away by using a zextload. Thus, the pattern above will
now be combined into
t3 = load i32* ptr+offset, zext to i64
Another case is shown in the X86/shift-folding.ll test case:
t1 = load i32* ptr
t2 = srl i32 t1, 8
t3 = truncate t2 to i16
In the past we bailed out due to the shift count (8) not being a
multiple of 16. Now the narrowing kicks in and we get
t3 = load i16* ptr+offset
Differential Revision: https://reviews.llvm.org/D117406
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
This extends the code in SearchForAndLoads to be able to look through
ANY_EXTEND nodes, which can be created from mismatching IR types where
the AND node we begin from only demands the low parts of the register.
That turns zext and sext into any_extends as only the low bits are
demanded. To be able to look through ANY_EXTEND nodes we need to handle
mismatching types in a few places, potentially truncating the mask to
the size of the final load.
Recommitted with a more conservative check for the type of the extend.
Differential Revision: https://reviews.llvm.org/D117457
This caused builds to fail with
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:5638:
bool (anonymous namespace)::DAGCombiner::BackwardsPropagateMask(llvm::SDNode *):
Assertion `NewLoad && "Shouldn't be masking the load if it can't be narrowed"' failed.
See the code review for a link to a reproducer.
> This extends the code in SearchForAndLoads to be able to look through
> ANY_EXTEND nodes, which can be created from mismatching IR types where
> the AND node we begin from only demands the low parts of the register.
> That turns zext and sext into any_extends as only the low bits are
> demanded. To be able to look through ANY_EXTEND nodes we need to handle
> mismatching types in a few places, potentially truncating the mask to
> the size of the final load.
>
> Differential Revision: https://reviews.llvm.org/D117457
This reverts commit 578008789f.
This extends the code in SearchForAndLoads to be able to look through
ANY_EXTEND nodes, which can be created from mismatching IR types where
the AND node we begin from only demands the low parts of the register.
That turns zext and sext into any_extends as only the low bits are
demanded. To be able to look through ANY_EXTEND nodes we need to handle
mismatching types in a few places, potentially truncating the mask to
the size of the final load.
Differential Revision: https://reviews.llvm.org/D117457
Update code comments in DAGCombiner::ReduceLoadWidth and refactor
the handling of SRL a bit. The refactoring is done with the intent
of adding support for folding away SRA by using SEXTLOAD in a
follow-up patch.
The function is also renamed as DAGCombiner::reduceLoadWidth.
Differential Revision: https://reviews.llvm.org/D117104
This commit fixes a missed opportunity in merging consecutive stores.
The code that searches for stores skipped the case of stores that
directly connect to the root. The comment above the implementation lists
this case but the code did not handle it. I found this pattern when
looking into the shared_ptr destructor. GCC generates the right
sequence. Here is a small repo:
int foo(int* buff) {
buff[0] = 0;
int x = buff[1];
buff[1] = 0;
return x;
}
Differential Revision: https://reviews.llvm.org/D116895
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.
Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.
Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.
Reviewed By: lebedev.ri, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D116522
When the source has a series of assignments, users reasonably want to
have the debugger step through each one individually. Turn off the combine
for adjacent stores so we get this behavior at -O0.
Similar to D7181.
Reviewed By: spatel, xgupta
Differential Revision: https://reviews.llvm.org/D115808
When the source has a series of assignments, users reasonably want to
have the debugger step through each one individually. Turn off the combine
for adjacent stores so we get this behavior at -O0.
Similar to D7181.
Differential Revision: https://reviews.llvm.org/D115808
Merge the node combines into a common DAGCombiner::visitFMinMax (like we do for IMINMAX).
Move the constant folding into SelectionDAG::foldConstantFPMath.
This allows us to fold the vecreduce-propagate-sd-flags.ll test as it reduces constants - so I've refactored it to take variables instead.
Differential Revision: https://reviews.llvm.org/D115952
We were using a function attribute to indicate a non-standard FP mode,
but now we can use intrinsics for that job as shown in the new tests.
Presumably the x86 asm could be improved for that IR with intrinsics,
but I have not worked out exactly how to do that. Note that the
transform to FTRUNC still requires a hacky check for "nsz" (because
FMF are not applied to FP casts).
This is a cleanup based on the clang change in D115804 / 8c7f2a4f87 .
This is effectively a revert of 5a90285bd9 + D46237 .
Differential Revision: https://reviews.llvm.org/D115885
Replace custom constant scalar/splat folding with FoldConstantArithmetic call and canonicalize commutative constant ops to the RHS before the SimplifyVBinOp call
This extends the custom lowering for truncating stores on
fixed length vectors in SVE to support masked truncating stores.
It also adds a DAG combine for truncates followed by masked
stores.
Reviewed By: peterwaller-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108115
SimplifyVBinOp still has a FoldConstantArithmetic call, which now it isn't vector specific we should be able to remove (once fp binops are tidied up); but we can at least clean up the integer opcodes to perform the basic constant/undef handling in common code first.
After D113888 / 32b6c17b29 the MMO size of a masked loads/store is
unknown. When we are converting back to a standard load/store because
the mask is known all ones, we can refine that to the correct size from
the size of the vector being loaded/stored.
Differential Revision: https://reviews.llvm.org/D114582
As an extension to D111976, this converts clamp fptosi, clamped between
0 and (2^n)-1 to a fptoui.sat. This can greatly help on targets with
conversions that naturally saturate, such as Arm.
X86 disables the transform as some of the test cases increases in size.
A fptoui.sat necessitates a fp clamp without native support, so there is
little use in converting if the instruction is just going to be
expanded.
Differential Revision: https://reviews.llvm.org/D112428
combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation.
But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation.
This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result.
Differential Revision: https://reviews.llvm.org/D113371
The REM DAG combine uses the visitDivLike functions to try and get an
optimized DIV node to provide better codegen, however in some cases this
visitDivLike call ends up in the BuildSDIVPow2 target hook, which in
turn sometimes will return the same node passed in to indicate not to
change it. The REM DAG combine does not anticipate this and creates a
cycle in the DAG because of it.
Fix this by ensuring any such optimized div node returned is distinct
from the node being combined.
Differential Revision: https://reviews.llvm.org/D114716
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
This allows the generic DAG combine to fold fp_extend/fp_trunc into
loads/stores which we can then lower into a integer extending
load/truncating store plus an FP_EXTEND/FP_ROUND.
The nuance here is that fixed-type FP_EXTEND/FP_ROUND require unpacked
types hence lowering them introduces an unpack/zip. By allowing these
nodes to be combined with loads/store we make it much easier to have
this unpack/zip combined into the load/store by our custom lowering.
Differential Revision: https://reviews.llvm.org/D114580
Patch to fix some of the regressions in D77804.
By folding to rotate/funnel-shift by constant amounts for illegal types, we prevent SimplifyDemandedBits from destroying the patterns prematurely, allowing us to use the rotate/funnel-shift legalization that was added in D112443.
Differential Revision: https://reviews.llvm.org/D113192
It's possible that the mask is already a NOT. At least if InstCombine
hasn't canonicalized the input. In that case we will form an ANDN with
X instead of with Y. So we don't need to worry about Y being a constant.
We might need to check that X isn't a constant instead, but we don't
have a test case for that yet.
This fixes a size regression found when trying to enable this combine
for RISCV in D113937.
Differential Revision: https://reviews.llvm.org/D113948
This was noted as a follow-up to D113212 / D113426:
4fc1fc40057e30404c3b11522cfcadhttps://alive2.llvm.org/ce/z/e4o96b
The canonicalization rules for these IR patterns are complicated,
and we were not matching the expected forms in 2 out of the 3
cases. We can make codegen more robust by matching the swapped
forms (and that will also work if these patterns are created late).
(Cond0 s> -1) ? N1 : 0 --> ~(Cond0 s>> BW-1) & N1
https://alive2.llvm.org/ce/z/mGCBrd
This was suggested as a potential enhancement in D113212 (also 7e30404c3b ).
There's an improvement for AArch that could be generalized ( X > -1 --> X >= 0 ).
For x86, we have a counter-acting fold for most cases that turns the shift+not
back into a setcc, so that needs a work-around to get more cases to use "pandn":
D113603
Note that this pattern (and a previous one) are not currently canonical forms
in IR:
https://alive2.llvm.org/ce/z/e4o96b
Adding swapped variants is left as a TODO item here, but is planned as
a near-term follow-up patch.
Differential Revision: https://reviews.llvm.org/D113426
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.
I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.
Differential Revision: https://reviews.llvm.org/D113396
We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation.
This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data.
Differential Revision: https://reviews.llvm.org/D113351
Currently FoldConstantArithmetic only handles binops, so replacing other uses of FoldConstantVectorArithmetic (in particular for SETCC nodes), still require more work.
This is the 'or' sibling for the fold added with:
D113212
https://alive2.llvm.org/ce/z/tgnp7K
Note that neither of these transforms is poison-safe,
but it does not seem to matter at this level. We have
had the scalar version of D113212 for a long time, so
this is just making optimizer behavior consistent.
We do not have the scalar version of *this* fold,
however, so that is another follow-up.
Another minor step towards merging FoldConstantVectorArithmetic into FoldConstantArithmetic.
We don't use SDNodeFlags in any constant folding inside DAG, so passing the Flags argument is a waste of time - an alternative would be to wire up FoldConstantArithmetic to take SDNodeFlags just-in-case we someday start using it, but we don't have any way to test it and I'd prefer to avoid dead code.
Differential Revision: https://reviews.llvm.org/D113276
(X s< 0) ? Y : 0 --> (X s>> BW-1) & Y
We canonicalize to the icmp+select form in IR, and we already have this fold
for scalar select in SDAG, so I think it's an oversight that we don't have
the fold for vectors. It seems neutral for AArch64 and saves some instructions
on x86.
Whether we should also have the sibling folds for the inverse condition or
all-ones true value may depend on target-specific factors such as whether
there's an "and-not" instruction.
Differential Revision: https://reviews.llvm.org/D113212
Fold (srl (mul (zext i32:$a to i64), i64:c), 32) -> (mulhu $a, $b),
if c can truncate to i32 without loss.
Reviewed By: frasercrmck, craig.topper, RKSimon
Differential Revision: https://reviews.llvm.org/D108129
Rely on the hasOperation() instead - as commented on D77804, the mid-term intention is to recognise rotate/funnel-by-constant pre-legalization to help avoid SimplifyDemandedBits regressions.
(i8 X ^ 128) & (i8 X s>> 7) --> usubsat X, 128
As suggested in D112085, we can substitute 'xor' with 'add'
in this pattern, and it is logically equivalent:
https://alive2.llvm.org/ce/z/eJtWWC
We canonicalize to 'xor' in IR, but SDAG does not do that
(and it probably should not - https://llvm.org/PR52267 ), so
it is possible to see either pattern in codegen. Note that
'sub' is a another potential pattern, but that is
canonicalized to 'add' in DAGCombiner, so we don't need to
worry about that variation.
Differential Revision: https://reviews.llvm.org/D112377
EXTRACT_SUBVECTOR indices are always constant, we don't need to check for ConstantSDNode, we should just use getConstantOperandVal which will assert for the constant.
Instead of returning a bool to indicate success and a separate
SDValue, return the SDValue and have the callers check if it is
null.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112331
(i8 X ^ 128) & (i8 X s>> 7) --> usubsat X, 128
I haven't found a generalization of this identity:
https://alive2.llvm.org/ce/z/_sriEQ
Note: I was actually looking at the first form of the pattern in that link,
but that's part of a long chain of potential missed transforms in codegen
and IR....that I hope ends here!
The predicates for when this is profitable are a bit tricky. This version of
the patch excludes multi-use but includes custom lowering (as opposed to
legal only).
On x86 for example, we have custom lowering for some vector types, and that
uses umax and sub. So to enable that fold, we need add use checks to avoid
regressions. Even with legal-only lowering, we could see code with extra
reg move instructions for extra uses, so that constraint would have to be
eased very carefully to avoid penalties.
Differential Revision: https://reviews.llvm.org/D112085
With unoptimized code, we may see lots of stores and spend too much time in mergeTruncStores.
Fixes PR51827.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D111596
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
This is NFC-intended for the callers. Posting in case there are
other potential users that I missed.
I would also use this from VectorCombine in a patch for:
https://llvm.org/PR52178 ( D111901 )
Differential Revision: https://reviews.llvm.org/D111891
We were previously silently generating incorrect code when extracting a
fixed-width vector from a scalable vector. This is worse than crashing,
since the user will have no indication that this is currently unsupported
behaviour. I have fixed the code to only perform DAG combines when safe
to do so, i.e. the input and output vectors are both fixed-width or
both scalable.
Test added here:
CodeGen/AArch64/sve-extract-scalable-vector.ll
Differential revision: https://reviews.llvm.org/D110624
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
One of the cases identified in PR45116 - we don't need to limit extracted loads to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
I've also cleaned up the alignment calculation code - if we have a constant extraction index then the alignment can be based on an offset from the original vector load alignment, but for non-constant indices we should assume the worst (single element alignment only).
Differential Revision: https://reviews.llvm.org/D110486
This patch adds a generic DAGCombine for vector-predicated (VP) nodes.
Those for which we can determine that no vector element is active can be
replaced by either undef or, for reductions, the start value.
This is tested rather trivially at the IR level, where it's possible
that we want to teach instcombine to perform this optimization.
However, we can also see the zero-evl case arise during SelectionDAG
legalization, when wide VP operations can be split into two and the
upper operation emerges as trivially false.
It's possible that we could perform this optimization "proactively"
(both on legal vectors and before splitting) and reduce the width of an
operation and insert it into a larger undef vector:
```
v8i32 vp_add x, y, mask, 4
->
v8i32 insert_subvector (v8i32 undef), (v4i32 vp_add xsub, ysub, mask, 4), i32 0
```
This is somewhat analogous to similar vector narrow/widening
optimizations, but it's unclear at this point whether that's beneficial
to do this for VP ops for any/all targets.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D109148
One of the cases identified in PR45116 - we don't need to limit store narrowing to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory access by checking allowsMisalignedMemoryAccesses as a fallback.
The fmul is a canonicalizing operation, and fneg is not so this would
break denormals that need flushing and also would not quiet signaling
nans. Fold to fsub instead, which is also canonicalizing.
This extends the custom lowering for extending loads on
fixed length vectors in SVE to support masked extending loads.
The existing tests for correct behaviour of masked extending loads
exhibit bad code generation due to the legalistaion of i1 vectors.
They have been left as-is and new tests have been added that do not
exhibit this behaviour.
Differential Revision: https://reviews.llvm.org/D108200
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D109535
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`. This achieves two things:
1) This starts standardizing predicates across the LLVM codebase,
following (in this case) ConstantInt. The word "Value" doesn't
convey anything of merit, and is missing in some of the other things.
2) Calling an integer "null" doesn't make any sense. The original sin
here is mine and I've regretted it for years. This moves us to calling
it "zero" instead, which is correct!
APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go. As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.
Included in this patch are changes to a bunch of the codebase, but there are
more. We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.
Differential Revision: https://reviews.llvm.org/D109483
Given a select_cc producing a constant and a invertion of the constant
for a comparison more than zero, we can produce an xor with ashr
instead, which produces smaller code. The ashr either sets all bits or
clear all bits depending on if the value is negative. This is then xor'd
with the constant to optionally negate the value.
https://alive2.llvm.org/ce/z/DTFaBZ
This includes a OneUseCheck on the Cmp, which seems to make thinks a
little worse and will be removed in a followup.
Differential Revision: https://reviews.llvm.org/D109149
The check for whether a rotate is possible occurs before the
memory legality checks for the integer type. So it's possible we
decide we can use a rotate, but then fail the legality checks. If
that happens we should not fall back to a vector type. This triggers
an assertion in the rotate handling when it finds a vector type
instead of an integer type.
In theory we could use a shufflevector in place of the rotate, but
right now I'd just like to fix the crash.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108839
Without this change only the preferred fusion opcode is tested
when attempting to combine FMA operations.
If both FMA and FMAD are available then FMA ops formed prior to
legalization will not be merged post legalization as FMAD becomes
the preferred fusion opcode.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108619
This is another bug exposed by https://llvm.org/PR51612
(and the one that triggered the initial assertion) in the report.
That example was suppressed with:
985b48f183
...but these would still crash because we created nodes
like UADDO without the expected 2 output values.
There are 2 bugs here:
1. We were not checking uses of operand 2 (the false value of the select).
2. We were not checking for multiple uses of nodes that produce >1 result.
Correcting those is enough to avoid the crash in the reduced test based on:
https://llvm.org/PR51612
The additional use check on operand 0 (the condition value of the select)
should not strictly be necessary because we are only replacing one use
with another (whether it makes performance sense to do the transform with
that pattern is not clear). But as noted in the TODO, changing that
uncovers another bug.
Note: there's at least one more bug here - we aren't propagating EVTs
correctly, but I plan to fix that in another patch.
For ISD::EXTRACT_SUBVECTOR, its second operand must be a constant
multiple of the known-minimum vector length of the result type.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D107795
One of the cases identified in PR45116 - we don't need to limit load combines to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
One of the cases identified in PR45116 - we don't need to limit load combines (in this case for fp->int load/store copies) to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
Differential Revision: https://reviews.llvm.org/D108318
One of the cases identified in PR45116 - we don't need to limit load combines (in this case for ISD::BUILD_PAIR) to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
This helps in particular for 32-bit X86 cases loading 64-bit size data, reducing codegen diffs vs x86_64.
Differential Revision: https://reviews.llvm.org/D108307
Follow-up to D107068, attempt to fold nested concat_vectors/undefs, as long as both the vector and inner subvector types are legal.
This exposed the same issue in ARM's MVE LowerCONCAT_VECTORS_i1 (raised as PR51365) and AArch64's performConcatVectorsCombine which both assumed concat_vectors only took 2 subvector operands.
Differential Revision: https://reviews.llvm.org/D107597
visitEXTRACT_SUBVECTOR can sometimes create illegal BITCASTs when
removing "redundant" INSERT_SUBVECTOR operations. This patch adds
an extra check to ensure such combines only occur after operation
legalisation if any resulting BITBAST is itself legal.
Differential Revision: https://reviews.llvm.org/D108086
We don't have real demanded bits support for MULHU, but we can
still use the known bits based constant folding support at the end
of SimplifyDemandedBits to simplify a MULHU. This helps with cases
where we know the LHS and RHS have enough leading zeros so that
the high multiply result is always 0.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D106471
IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes.
This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector.
This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector.
Fixes PR50053
Differential Revision: https://reviews.llvm.org/D107068
This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D107486
We had some similar hasOneUse/isNON_EXTLoad early-outs spread out over different parts of the method - we should pull them all together.
Noticed while triaging PR45116
This transform was added with D58874, but there were no tests for overflow ops.
We need to change this one way or another because it can crash as shown in:
https://llvm.org/PR51238
Note that if there are no uses of an overflow op's bool overflow result, we
reduce it to a regular math op, so we continue to fold that case either way.
If we have uses of both the math and the overflow bool, then we are likely
not saving anything by creating an independent sub instruction as seen in
the test diffs here.
This patch makes the behavior in SDAG consistent with what we do in
instcombine AFAICT.
Differential Revision: https://reviews.llvm.org/D106983
This patch adds a peephole optimization `SETCC(FREEZE(x),const)` => `FREEZE(SETCC(x,const))`
if the SETCC is only used by BRCOND.
Combined with `BRCOND(FREEZE(X)) => BRCOND(X)`, this leads to a nice improvement in the generated assembly when x is a masked loaded value.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D105344
This patch builds on top of D106575 in which scalable-vector splats were
supported in `ISD::matchBinaryPredicate`. It teaches the DAGCombiner how
to perform a variety of the pre-existing saturating add/sub combines on
scalable-vector types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106652
This patch extends support for (scalable-vector) splats in the
DAGCombiner via the `ISD::matchBinaryPredicate` function, which enable a
variety of simple combines of constants.
Users of this function may now have to distinguish between
`BUILD_VECTOR` and `SPLAT_VECTOR` vector operands. The way of dealing
with this in-tree follows the approach added for
`ISD::matchUnaryPredicate` implemented in D94501.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106575
I've setup the basic framework for the isGuaranteedNotToBeUndefOrPoison call and updated DAGCombiner::visitFREEZE to use it, further Opcodes can be handled when we have test coverage.
I'm not aware of any vector test freeze coverage so the DemandedElts (and the Depth) args are not being used yet - but they are in place.
SelectionDAG::isGuaranteedNotToBePoison wrappers have also been added.
Differential Revision: https://reviews.llvm.org/D106668
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
Reland of 31859f896.
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D104797
The existing rule about the operand type is strange. Instead, just say
the operand is a TargetConstant with the right width. (Legalization
ignores TargetConstants, so it doesn't matter if that width is legal.)
Highlights:
1. I had to substantially rewrite the AArch64 isel patterns to expect a
TargetConstant. Nothing too exotic, but maybe a little hairy. Maybe
worth considering a target-specific node with some dagcombines instead
of this complicated nest of isel patterns.
2. Our behavior on RV32 for vectors of i64 has changed slightly. In
particular, we correctly preserve the width of the arithmetic through
legalization. This changes the DAG a bit. Maybe room for
improvement here.
3. I explicitly defined the behavior around overflow. This is necessary
to make the DAGCombine transforms legal, and I don't think it causes any
practical issues.
Differential Revision: https://reviews.llvm.org/D105673
I'm going to extend the functionality started in D106058 so move the folds into their own method to reduce the amount of code in DAGCombiner::visitSELECT
Similar to the folds performed in InstCombinerImpl::foldSelectOpOp, this attempts to push a select further up to help merge a pair of binops.
I'm primarily interested in select(cond,add(x,y),add(x,z)) folds to help expose pointer math (see https://bugs.llvm.org/show_bug.cgi?id=51069 etc.) but I've tried to use the more generic isBinOp().
Differential Revision: https://reviews.llvm.org/D106058
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
We already have reassociation code for Adds and Ors separately in DAG
combiner, this adds it for the combination of the two where Ors act like
Adds. It reassociates (add (or (x, c), y) -> (add (add (x, y), c)) where
we know that the Ors operands have no common bits set, and the Or has
one use.
Differential Revision: https://reviews.llvm.org/D104765
Reland of 31859f896.
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.
Differential Revision: https://reviews.llvm.org/D104797
This ports the AArch64 SABD and USBD over to DAG Combine, where they can be
used by more backends (notably MVE in a follow-up patch). The matching code
has changed very little, just to handle legal operations and types
differently. It selects from (ABS (SUB (EXTEND a), (EXTEND b))), producing
a ubds/abdu which is zexted to the original type.
Differential Revision: https://reviews.llvm.org/D91937
This add as a fold of sub(0, splat(sub(0, x))) -> splat(x). This can
come up in the lowering of right shifts under AArch64, where we generate
a shift left of a negated number.
Differential Revision: https://reviews.llvm.org/D103755
The is from discussion in https://reviews.llvm.org/D104247#inline-993387
The contract and reassoc flags shouldn't imply each other .
All the aggressive fsub fusion reassociate operations,
we should guard them with reassoc flag check.
Reviewed By: mcberg2017
Differential Revision: https://reviews.llvm.org/D104723
According to IR LangRef, the FMF flag:
contract
Allow floating-point contraction (e.g. fusing a multiply followed by an
addition into a fused multiply-and-add).
reassoc
Allow reassociation transformations for floating-point instructions.
This may dramatically change results in floating-point.
My understanding is that these two flags shouldn't imply each other,
as we might have a SDNode that can be reassociated with others, but
not contractble.
eg: We may want following fmul/fad/fsub to freely reassoc, but don't
want fma being generated here.
%F = fmul reassoc double %A, %B ; <double> [#uses=1]
%G = fmul reassoc double %C, %D ; <double> [#uses=1]
%H = fadd reassoc double %F, %G ; <double> [#uses=1]
%I = fsub reassoc double %H, %E ; <double> [#uses=1]
Before https://reviews.llvm.org/D45710, `reassoc` flag actually
did not imply isContratable either.
The current implementation also only check the flag in fadd node,
ignoring fmul node, this patch update that as well.
Reviewed By: spatel, qiucf
Differential Revision: https://reviews.llvm.org/D104247
6e5628354e regressed the Windows build as
the return type no longer matched in both branches for the return value
type deduction. This uses a bit more compiler magic to deal with that.
The sorting, obviously, must be stable, else we will have random assembly fluctuations.
Apparently there was no test coverage that would benefit from that,
so i've added one test.
The sorting consists of two parts - just sort the input vectors,
and recompute the shuffle mask -> input vector mapping.
I don't believe we need to do anything else.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D104187
When reducing vector builds to shuffles it possible that
the DAG combiner may try to extract invalid subvectors.
This happens as the existing code assumes vectors will be power
of 2 sizes, which is already untrue, but becomes more noticable
with v6 and v7 types.
Specifically the existing code assumes that half PowerOf2Ceil of
a given vector index will fit twice into a given vector.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D103880
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D95425
As shown in:
https://llvm.org/PR50623
...and the similar tests here, we were not accounting for
store merging of different sizes that do not cover the
entire range of the wide value to be stored.
This is the easy fix: just make sure that all of the
original stores are the same size, so when we calculate
the wide width, it's a simple N * M check.
This still allows all of the motivating optimizations from:
D86420 / 54a5dd485c
D87112 / 7a06b166b1
We could enhance this code to track individual bytes and
allow merging multiple sizes.
shuffle(concat(x,undef),concat(y,undef)) -> concat(shuffle(x,y),shuffle(x,y))
If the original shuffle references any of the upper (undef) subvector elements, ensure the split shuffle masks uses undef instead of an out-of-bounds value.
Fixes PR50609
This extends 434c8e013a and ede3982792 to handle signed
predicates by sign-extending the setcc operands.
This is not shown directly in https://llvm.org/PR50055 ,
but the pattern is visible by changing the unsigned convert
to signed in the source code.
This is a follow-up to D103280 that eases the use restrictions,
so we can handle the motivating case from:
https://llvm.org/PR50055
The loop code is adapted from similar use checks in
ExtendUsesToFormExtLoad() and SliceUpLoad(). I did not see an
easier way to filter out non-chain uses of load values.
Differential Revision: https://reviews.llvm.org/D103462
I accidentaly pushed a draft of D103280 that was discussed
during the review, but it was not supposed to be the final
version.
Rather than revert and recommit, I'm updating the existing
code. This way we have a record of the codegen diff that
would result if we decide to remove this predicate in the
future.
sext (vsetcc X, Y) --> vsetcc (zext X), (zext Y) --
(when the zexts are free and a bunch of other conditions)
We have a couple of similar folds to this already for vector selects,
but this pattern slips through because it is only a setcc.
The tests are based on the motivating case from:
https://llvm.org/PR50055
...but we need extra logic to get that example, so I've left that as
a TODO for now.
Differential Revision: https://reviews.llvm.org/D103280
extractelement is poison if the index is out-of-bounds, so just
scalarizing the load may introduce an out-of-bounds load, which is UB.
To avoid introducing new UB, we can mask the index so it only contains
valid indices.
Fixes PR50382.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D103077
DAGCombine's `mergeStoresOfConstantsOrVecElts` optimization is told
whether it's to use vector types and also whether it's to issue a
truncating store. However, the truncating store code path assumes a
scalar integer `ConstantSDNode`, and when using vector types it creates
either a `BUILD_VECTOR` or `CONCAT_VECTORS` to store: neither of which
is a constant.
The `riscv64` target is able to expose a crash here because it switches
on both code paths at the same time. The `f32` is stored as `i32` which
must be promoted to `i64`, necessitating a truncating store.
It also decides later that it prefers a vector store of `v2f32`.
While vector truncating stores are legal, this combine is not able to
emit them. We also don't have a test case. This patch adds an assert to
catch this case more gracefully, and updates one of the caller functions
to the function to turn off the use of truncating stores when preferring
vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103173
The select-of-constants transform was asserting that its constant vector
inputs did not implicitly truncate their input without that as an
explicit precondition to the function. This patch relaxes that assertion
into an early return to skip the optimization.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102393
Fixes a bug in the DAG combiner that eliminates the stores because it missed
to inspect the address space of the pointers.
%v = load %ptr_as1
// no chain side effect
store %v, %ptr_as2
As well as
store %v, %ptr_as1
store %v, %ptr_as2
Fixes a test for above in X86.
Differential Revision: https://reviews.llvm.org/D102096
Expanding a fixed length operation involves wrapping the operation in an
insert/extract subvector pair, as such, when this is done to bitcast we
end up with an extract_subvector of a bitcast. DAGCombine tries to
convert this into a bitcast of an extract_subvector which restores the
initial fixed length bitcast, causing an infinite loop of legalization.
As part of this patch, we must make sure the above DAGCombine does not
trigger after legalization if the created bitcast would not be legal.
Differential Revision: https://reviews.llvm.org/D101990
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
It is proper to relax non-negative limitation of step_vector.
Also this patch adds more combines for step_vector:
(sub X, step_vector(C)) -> (add X, step_vector(-C))
Differential Revision: https://reviews.llvm.org/D100812
This patch adds incrementally-better support for SPLAT_VECTOR in a
handful of vector combines by changing a few more
isBuildVectorAllOnes/isBuildVectorAllZeros to the equivalent
isConstantSplatVectorAllOnes/Zeros calls.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D100851
This patch changes ISD::isBuildVectorAllZeros to
ISD::isConstantSplatVectorAllZeros which handles zero sclar vector.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D100813
When trying to clamp a constant index into a scalable vector we can
test if the index is less than the minimum number of elements in the
vector. If so, we can simply return the index because we know it is
guaranteed to fit inside the vector.
Differential Revision: https://reviews.llvm.org/D100639
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98027
If the inner shuffle already contains undef elements, then accept them in the merged shuffle as well.
This helps some X86 HADD/SUB patterns where slow targets were ending up with HADD/SUB because the (un)merged shuffles were stuck either side of the ADD/SUB - meaning we ended up with a total cost much higher than the "2*shuffle+add" that a slow target usually expands a HADD/SUB to.
This patch adds a new isIntOrFPConstant helper function to check if a
SDValue is a integer of FP constant. This pattern is used in various
places.
There also are places that incorrectly just check for integer constants,
e.g. D99384, so hopefully this helper will help people avoid that issue.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D99428
Don't bother calling ComputeNumSignBits if N00Bits < ExtVTBits. No
matter what answer we get back this will be true:
(N00Bits - DAG.ComputeNumSignBits(N00, DemandedSrcElts)) < ExtVTBits)
So we might as well save the computation. This makes the code more
consistent with the similar (sext_in_reg (sext x)) handling above.
As commented by @craig.topper on rG1ba5c550d418, we can't guarantee that we'll be extending zero bits, just sign bit. So, revert to the old code for zero_extend_vector_inreg cases.
Followup to D96345, handle unary shuffles of binops (as well as binary shuffles) if we can merge the shuffle with inner operand shuffles.
Differential Revision: https://reviews.llvm.org/D98646
Extend this to support ComputeNumSignBits of the (used) source vector elements so that we can handle more than just the case where we're sext_in_reg from the source element signbit.
Noticed while investigating the poor codegen in D98587.
A 1-bit smulo overflows is both inputs are -1 since the result
should be +1 which can't be represented in a signed 1 bit value.
We can detect this with an AND and a setcc. The multiply result
can also use the same AND.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97634
This prepares codegen for a change that will remove the identical
folds from IR because they are not poison-safe. See
D93065 / D97360
for details.
We already generically support scalar types, and there are various
target-specific transforms that overlap the vector folds. For example,
x86 recognizes the and patterns, but not or. We can end up with 1
extra instruction there, but I think that is still preferred over the
blendv alternative that loads a constant vector.
If this is not optimal, then it should be fixed with a later transform
(this change is not expected to result in any regressions because
InstCombine currently does the same thing).
Removing custom code and supporting undefs in constant-pattern-matching
can be follow-up changes.
Differential Revision: https://reviews.llvm.org/D97730
Peeking through AND is only valid if the input to both shifts is
the same. If the inputs are different, then the original pattern
ORs the two values when the masked shift amount is 0. This is ok
if the values are the same since the OR would be a NOP which is
why its ok for rotate.
Fixes PR49365 and reverts PR34641
Differential Revision: https://reviews.llvm.org/D97637
Even if the first computeKnownBits call doesn't have any zero
bits it is possible the other operand has bitwidth-1 leading zero.
In that case overflow is still impossible. So always call computeKnownBits
for both operands.
Using ComputeNumSignBits or computeKnownBits we might be able
to determine that overflow is impossible.
This especially helps after type legalization if the type was
promoted from a type with half the bits or more. Type legalization
conservatively creates a promoted smulo/umulo and an overflow
check for the promoted bits. The overflow from the promoted
smulo/umulo is ORed with the result of the promoted bits
overflow check. Proving that the promoted smulo/umulo can never
overflow will leave us with just the promoted bits overflow check.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97160
This patch handles usubsat patterns hidden through zext/trunc and uses the getTruncatedUSUBSAT helper to determine if the USUBSAT can be correctly performed in the truncated form:
zext(x) >= y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))
zext(x) > y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))
Based on original examples:
void foo(unsigned short *p, int max, int n) {
int i;
unsigned m;
for (i = 0; i < n; i++) {
m = *--p;
*p = (unsigned short)(m >= max ? m-max : 0);
}
}
Differential Revision: https://reviews.llvm.org/D25987
If extload is legal, following transform
(zext (select c, load1, load2)) -> (select c, zextload1, zextload2)
can save one ext instruction.
Differential Revision: https://reviews.llvm.org/D95086
Adjust generateFMAsInMachineCombiner to return false if SVE is present
in order to combine fmul+fadd into fma. Also add new pseudo instructions
so as to select the most appropriate of FMLA/FMAD depending on register
allocation.
Depends on D96599
Differential Revision: https://reviews.llvm.org/D96424
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs.
Differential Revision: https://reviews.llvm.org/D96345
This reverts commit 5dfba562dd.
That commit causes an assertion failure with the following repro:
typedef long b __attribute__((__vector_size__(16)));
b *d;
b e;
b __attribute__((__always_inline__)) c(b h, b i) {
return (__attribute__((__vector_size__(8 * sizeof(short)))) short)h + i;
}
j() {
b k, l, m, n, o[6], p, q;
m = d[5];
b r = m;
b s = f(r, 8);
q = s;
l = d[1];
p = l;
t(q);
n = c(m, l);
o[1] = c(s, f(p, 8));
k = __builtin_shufflevector(n, o[1], 0, 2);
e = __builtin_ia32_psrlwi128(k, j);
}
./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Differential Revision: https://reviews.llvm.org/D96345
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.
This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.
We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.
The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.
Differential Revision: https://reviews.llvm.org/D96413
Avoid doing the following combine for vector types:
```
copysign(x, fp_extend(y)) -> copysign(x, y)
copysign(x, fp_round(y)) -> copysign(x, y)
```
That combine seemed to impede the selection of vector instruction and cause
a mess in some circumstances.
Differential Revision: https://reviews.llvm.org/D96037
As of commit 284f2bffc9, the DAG Combiner gets rid of the masking of the
input to this node if the mask only keeps the bottom 16 bits. This is because
the underlying library function does not use the high order bits. However, on
PowerPC's ELFv2 ABI, it is the caller that is responsible for clearing the bits
from the register. Therefore, the library implementation of __gnu_h2f_ieee will
return an incorrect result if the bits aren't cleared.
This combine is desired for ARM (and possibly other targets) so this patch adds
a query to Target Lowering to check if this zeroing needs to be kept.
Fixes: https://bugs.llvm.org/show_bug.cgi?id=49092
Differential revision: https://reviews.llvm.org/D96283
Make sure scalable property is preserved by using getVectorElementCount().
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D95967