As suggested from 02f8519502, this uses the
isAnyConstantBuildVector method in lieu of separate
isBuildVectorOfConstantSDNodes calls. It should
otherwise be an NFC.
This prevents an infinite loop from D123801, where code trying to reduce
the total number of bitcasts, but also handling constants, could create
the opposite transform. Prevent the transform in these case to let the
bitcast of a constant transform naturally.
Fixes#55345
This is part of an ongoing effort toward making DAGCombine process the nodes in topological order.
This is able to discover a couple of new optimizations, but also causes a couple of regression. I nevertheless chose to submit this patch for review as to start the discussion with people working on the backend so we can find a good way forward.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124743
Add helper functions to query the signed and scaled properties
of ISD::IndexType along with functions to change them.
Remove setIndexType from MaskedGatherSDNode because it only has
one usage and typically should only be changed alongside its
index operand.
Minimise the direct use of the enum values to lay the groundwork
for more refactoring.
Differential Revision: https://reviews.llvm.org/D123347
Something is going wrong with the BigEndian PowerPC bot. It is hard to
tell what is wrong from here, but attempt to fix it by disabling the
combineShuffleOfBitcast combine for bigendian.
If the mask is made up of elements that form a mask in the higher type
we can convert shuffle(bitcast into the bitcast type, simplifying the
instruction sequence. A v4i32 2,3,0,1 for example can be treated as a
1,0 v2i64 shuffle. This helps clean up some of the AArch64 concat load
combines, along with helping simplify a number of other tests.
The PowerPC combine for v16i8 splat vector loads needed some fixes to
keep it working for v16i8 vectors. This improves the handling of v2i64
shuffles to match too, hopefully improving them in general.
Differential Revision: https://reviews.llvm.org/D123801
This extends the (X & ~Y) | Y to X | Y fold to also work if ~Y is
a truncated not (when taking into account the mask X). This is
done by exporting the infrastructure added in D124856 and reusing
it here.
I've retained the old value of AllowUndefs=false, though probably
this can be switched to true with extra test coverage.
Differential Revision: https://reviews.llvm.org/D124930
Don't assume the rotation amounts have been correctly normalized - do it as part of the constant folding.
Also, the normalization should be performed with UREM not SREM.
When looking for memory uses,
reassociationCanBreakAddressingModePattern should check uses of
the outer ADD rather than the inner ADD. We want to know if the
two ops we're reassociating are used by a load/store.
In practice, the existing check usually works because CodeGenPrepare
will make one of the load/stores have an offset of 0 relative to
split GEP. That will make the inner add have a memory use.
To test this, I've manually split the GEPs so there is no 0 offset
store.
This issue was recently discussed in the original review D60294.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124644
We try to match as a disguised rotate by constant of these forms
(shl (X | Y), C1) | (srl X, C2) --> (rotl X, C1) | (shl Y, C1)
(shl X, C1) | (srl (X | Y), C2) --> (rotl X, C1) | (srl Y, C2)
We may have also looked through an AND to find the shift. If we
did, we need to apply a mask to the result.
I'll add an AArch64 test and pre-commit it and the RISC-V test
tomorrow.
Fixes PR55201.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124711
When looking through extends of gather/scatter indices it's safe
to convert a known positive signed index to unsigned, but unsigned
indices must remain unsigned.
Depends On D123318
Differential Revision: https://reviews.llvm.org/D123326
refineUniformBase and selectGatherScatterAddrMode both attempt the
transformation:
base(0) + index(A+splat(B)) => base(B) + index(A)
However, this is only safe when index is not implicitly scaled.
Differential Revision: https://reviews.llvm.org/D123222
This is a very specific fold to fix an upstream poor codegen issue.
InstCombine has the much more flexible pushFreezeToPreventPoisonFromPropagating but I don't think we're quite there with DAG/TLI handling for canCreateUndefOrPoison/isGuaranteedNotToBeUndefOrPoison value tracking yet.
Fixes#54911
Differential Revision: https://reviews.llvm.org/D124185
We can process the long shuffles (working across several actual
vector registers) in the best way if we take the actual register
represantion into account. We can build more correct representation of
register shuffles, improve number of recognised buildvector sequences.
Also, same function can be used to improve the cost model for the
shuffles. in future patches.
Part of D100486
Differential Revision: https://reviews.llvm.org/D115653
We can process the long shuffles (working across several actual
vector registers) in the best way if we take the actual register
represantion into account. We can build more correct representation of
register shuffles, improve number of recognised buildvector sequences.
Also, same function can be used to improve the cost model for the
shuffles. in future patches.
Part of D100486
Differential Revision: https://reviews.llvm.org/D115653
1. X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first.
2. Add AArch64 faster path for SREM only pow2 case.
Fix https://github.com/llvm/llvm-project/issues/54649
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122968
hasOneUse is not cheap on nodes with chain results that might have
many uses. By checking the opcode first, we can avoid a costly walk
of the use list on nodes we aren't interested in.
Found by investigating calls to hasNUsesOfValue from the example
provided in D123857.
As far as I know getNode will never return a null SDValue.
I'm guessing this was modeled after the FoldConstantArithmetic
call earlier.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123550
Accord the discussion in D122281, we missing an ISD::AND combine for MLOAD
because it relies on BuildVectorSDNode is fails for scalable vectors.
This patch is intend to handle that, so we can circle back the type MVT::nxv2i32
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122703
As raised on PR52267, XOR(X,MIN_SIGNED_VALUE) can be treated as ADD(X,MIN_SIGNED_VALUE), so let these cases use the 'AddLike' folds, similar to how we perform no-common-bits OR(X,Y) cases.
define i8 @src(i8 %x) {
%r = xor i8 %x, 128
ret i8 %r
}
=>
define i8 @tgt(i8 %x) {
%r = add i8 %x, 128
ret i8 %r
}
Transformation seems to be correct!
https://alive2.llvm.org/ce/z/qV46E2
Differential Revision: https://reviews.llvm.org/D122754
When shifting by a byte-multiple:
bswap (shl X, C) --> lshr (bswap X), C
bswap (lshr X, C) --> shl (bswap X), C
This is the backend version of D122010 and an alternative
suggested in D120648.
There's an extra check to make sure the shift amount is
valid that was not in the rough draft.
I'm not sure if there is a larger motivating case for RISCV (bug report?),
but the ARM diffs show a benefit from having a late version of the
transform (because we do not combine the loads in IR).
Differential Revision: https://reviews.llvm.org/D122655
Modified DAGCombiner to pass the shift the bittest input and the shift amount
to hasBitTest. This matches the other call to hasBitTest in TargetLowering.h
This is an alternative to D122454.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122458
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
Add shl/srl/sra to the list of ops that we canonicalize with a select to expose an identity merge
Differential Revision: https://reviews.llvm.org/D122070
RISCV strong prefers i32 values be sign extended to i64. This combine
was always zero extending the constant using APInt methods.
This adjusts the code so that it calls getNode using ISD::ANY_EXTEND instead.
getNode will call TLI.isSExtCheaperThanZExt to decide how to handle
the constant.
Tests were copied from D121598 where I noticed that we were creating
constants that were hard to materialize.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121650
We do not have general reassociation here (and probably
do not need it), but I noticed these were missing in
patches/tests motivated by D111530, so we can at
least handle the simplest patterns.
The VE test diff looks correct, but we miss that
pattern in IR currently:
https://alive2.llvm.org/ce/z/u66_PM
This is another fold generalized from D111530.
We can find a common source for a rotate operation hidden inside an 'or':
https://alive2.llvm.org/ce/z/9pV8hn
Deciding when this is profitable vs. a funnel-shift is tricky, but this
does not show any regressions: if a target has a rotate but it does not
have a funnel-shift, then try to form the rotate here. That is why we
don't have x86 test diffs for the scalar tests that are duplicated from
AArch64 ( 74a65e3834 ) - shld/shrd are available. That also makes it
difficult to show vector diffs - the only case where I found a diff was
on x86 AVX512 or XOP with i64 elements.
There's an additional check for a legal type to avoid a problem seen
with x86-32 where we form a 64-bit rotate but then it gets split
inefficiently. We might avoid that by adding more rotate folds, but
I didn't check to see what is missing on that path.
This gets most of the motivating patterns for AArch64 / ARM that are in
D111530.
We still need a couple of enhancements to setcc pattern matching with
rotate/funnel-shift to get the rest.
Differential Revision: https://reviews.llvm.org/D120933
When inserting undef into buildvectors created from shuffles of
buildvectors, we convert elements to the largest needed type. This had
the effect of converting undef into 0, which isn't needed as the
buildvector implicitly truncates and trunc(zext(undef)) == undef.
Differential Revision: https://reviews.llvm.org/D121002
This extends acb96ffd14 to 'and' and 'xor' opcodes.
Copying from that message:
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
If the types aren't legal, the expansions may get type legalized in a
different way preventing code sharing. If the type is legal, we will
share some instructions between the two expansions, but we will need an
extra register.
Since we don't appear to fold (neg (sub A, B)) if the sub has an
additional user, I think it makes sense not to expand NABS.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120513
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
This is a partial implementation of a transform suggested in D111530
(only handles 'or' bitwise logic as a first step - need to stamp out more
tests for other opcodes).
Several of the same tests added for D111530 are altered here (but not
fully optimized). I'm not sure yet if this would help/hinder that patch,
but this should be an improvement for all tests added with ecf606cb43
since it removes a shift operation in those examples.
Differential Revision: https://reviews.llvm.org/D120516
If the shl is at least half the bitwidth (i.e. the lower half of the bswap source is zero), then we can reduce the shift and perform the bswap at half the bitwidth and just zero extend.
Based off PR51391 + PR53867
Differential Revision: https://reviews.llvm.org/D120192
This is the SDAG translation of D120253 :
https://alive2.llvm.org/ce/z/qHpmNn
The SDAG nodes can have different operand types than the result value.
We can see an example of that with AArch64 - the funnel shift amount
is an i64 rather than i32.
We may need to make that match even more flexible to handle
post-legalization nodes, but I have not stepped into that yet.
Differential Revision: https://reviews.llvm.org/D120264
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
In combineCarryDiamond() use getAsCarry() to find more candidates for being a carry flag.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118362
This fold is done in IR:
https://alive2.llvm.org/ce/z/jWyFrP
There is an x86 test that shows an improvement
from the added flexibility of using add (commutative).
The other diffs are presumed neutral.
Note that this could also be folded to an 'xor',
but I'm not sure if that would be universally better
(eg, x86 can convert adds more easily into LEA).
This helps prevent regressions from a potential fold for
issue #53829.
The current ABD combine doesn't quite work for SVE because only a
single scalable vector per scalar integer type is legal (e.g. for
i32, <vscale x 4 x i32> is the only legal scalable vector type).
This patch extends the combine to also trigger for the cases when
operand extension must be retained.
Differential Revision: https://reviews.llvm.org/D115739
This moves the matching of AVGFloor and AVGCeil into a place where
demand bit are available, so that it can detect more cases for more
folds. It changes the transform to start from a shift, not from a
truncate. We match the pattern shr(add(ext(A), ext(B)), 1), transforming
to ext(hadd(A, B)).
For signed values, because only the bottom bits are demanded llvm will
transform the above to use a lshr too, as opposed to ashr. In order to
correctly detect the hadd we need to know the demanded bits to turn it
back. Depending on whether the shift is signed (ashr) or logical (lshr),
and the extensions are signed or unsigned we can create different nodes.
If the shift is signed:
Needs >= 2 sign bits. https://alive2.llvm.org/ce/z/h4gQAW generating signed rhadd.
Needs >= 2 zero bits. https://alive2.llvm.org/ce/z/B64DUA generating unsigned rhadd.
If the shift is unsigned:
Needs >= 1 zero bits. https://alive2.llvm.org/ce/z/ByD8sj generating unsigned rhadd.
Needs 1 demanded bit zero and >= 2 sign bits https://alive2.llvm.org/ce/z/hvPGxX and
https://alive2.llvm.org/ce/z/32P5n1 generating signed rhadd.
Differential Revision: https://reviews.llvm.org/D119072
This adds very basic combines for AVG nodes, mostly for constant folding
and handling degenerate (zero) cases. The code performs mostly the same
transforms as visitMULHS, adjusted for AVG nodes.
Constant folding extends to a higher bitwidth and drops the lowest bit.
For undef nodes, `avg undef, x` is transformed to x. There is also a
transform for `avgfloor x, 0` transforming to `shr x, 1`.
Differential Revision: https://reviews.llvm.org/D119559
This enables fshl to be matched earlier on X86
%6 = lshr i32 %3, 1
%7 = select i1 %4, i32 -2147483648, i32 0
%8 = or i32 %6, %7
X86 uses i8 for shift amounts. SelectionDAGBuilder creates the
ISD::SRL with an i8 shift type. DAGCombiner turns the select into
an ISD::SHL. Prior to this patch it would use i32 for the shift
amount. fshl matching failed because the shift amounts have different
types. LegalizeDAG fixes the ISD::SHL shift amount to i8. This
allowed fshl matching to succeed.
With this patch, the ISD::SHL will be created with an i8 shift
amount. This allows the fshl to match immediately.
No test case beause we still end up with a fshl either way.
I have not found a way to expose a difference for this patch in a test
because it only triggers for a one-use load, but this is the code that
was adapted into D118376 and caused miscompiles. The new code pattern
is the same as what we do in narrowExtractedVectorLoad() (reduces load
width for a subvector extract).
This removes seemingly unnecessary manual worklist management and fixes
the chain updating via "SelectionDAG::makeEquivalentMemoryOrdering()".
Differential Revision: https://reviews.llvm.org/D119549
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237
We're hitting a pathological compile-time case, profiled to be in
DagCombiner::visitTokenFactor and many inserts into a SmallPtrSet.
It looks like one of the paths around findBetterNeighborChains is not
capped and leads to this.
This patch resolves the issue. Looking for feedback if this solution
looks reasonable.
Differential Revision: https://reviews.llvm.org/D118877
The test diffs are identical to D119111.
This only affects x86 currently because no other target
has an override for the TLI hook that controls this transform.
This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.
This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.
I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.
Differential Revision: https://reviews.llvm.org/D119019
rv64izbb has a RORW/ROLW instructions that operate on the lower
32-bits of a 64-bit value and sign extend bit 31 of the result.
DAGCombiner won't match rotate idioms because the i32 type isn't Legal
on riscv64.
This patch teaches DAGCombiner to allow it if the type is going to
be promoted and the target has Custom type legalization for ISD::ROTL
or ISD::ROTR. I've restricted this to scalar types. It doesn't appear
any in tree targets other than riscv64 have custom type legalization
for rotates.
If this patch isn't acceptable, I guess I can match SRLW, SLLW, and OR
after type legalization, but I'd like to avoid that if possible.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119062
When the shift amount is known and a known sign bit analysis of
the shiftee indicates that no saturation will occur, then we can
replace SSHLSAT/USHLSAT by SHL.
Differential Revision: https://reviews.llvm.org/D118765
In the aftermath of D116895 a problem was found in the analysis of
dependencies between store merge candidates in
checkMergeStoreCandidatesForDependencies, that is needed to avoid
the cycles are introduced in the DAG.
In the past it has been enough (or assumed to be enough) to start
scanning from non-chain operands when analysing the store merge
candidates for dependencies, assuming that the analysis of chain
dependencies performed when finding the candidates would cover
up for potential dependencies that exist involving the chain operands.
It was however discovered that one could end up with scenarios such
as descibed in the aarch64-checkMergeStoreCandidatesForDependencies.ll
test case, when the dependency between two stores is given by a mix
of chain operand dependencies and non-chain operand dependencies.
The fix in this patch make sure that we also account for chain operand
dependencies when doing the more elaborate analysis in
checkMergeStoreCandidatesForDependencies, no longer relying on that
the earlier check involving chain operands is enough.
Differential Revision: https://reviews.llvm.org/D118943
Do "simplifyShift" and "FoldConstantArithmetic" folds for the SSHLSAT
and USHLSAT DAG nodes.
This includes folds such as:
(shlsat undef/poison, x) -> 0
(shlsat x, undef/poison) -> undef
(shlsat x, too_large_shamt) -> undef
(shlsat 0, x) -> 0
(shlsat x, 0) -> x
(shlsat c1, c2) -> c3
Differential Revision: https://reviews.llvm.org/D118603
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
If we have a vector FP division with a splatted divisor, use
getVectorMinNumElements when scaling the num of uses by splat factor.
For AArch64 the combine kicks in for the <vscale x 4 x float> case since it's
above the fdiv threshold (3) when scaling num uses by splat factor, but the
codegen is worse (splat + vector fdiv + vector fmul) than the <vscale x 2 x
double> case (splat + vector fdiv).
If the combine could be converted into a scalar FP division by
scalarizeBinOpOfSplats it may be cheaper, but it looks like this is predicated
on the isExtractVecEltCheap TLI function which is implemented for x86 but not
AArch64. Perhaps for now combineRepeatedFPDivisors should only scale num uses
by splat if the division can be converted into scalar op.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118343
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D116270
This is the unsigned variant of D111976, where we convert a clamped
fptoui to a fptoui.sat. Because we are unsigned, the condition this time
is only UMIN of UINT_MAX. Similarly to D111976 it handles ISD::UMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D114964
An sra is basically sign-extending a narrower value. Fold away the
shift by doing a sextload of a narrower value, when it is legal to
reduce the load width accordingly.
Differential Revision: https://reviews.llvm.org/D116930
If the bitreverse gets expanded, it will introduce a new bswap. By
putting a bswap before the bitreverse, we can ensure it gets cancelled
out when this happens.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118012
In code review for D117104 two slightly weird checks were found
in DAGCombiner::reduceLoadWidth. They were typically checking
if BitsA was a mulitple of BitsB by looking at (BitsA & (BitsB - 1)),
but such a comparison actually only make sense if BitsB is a power
of two.
The checks were related to the code that attempted to shrink a load
based on the fact that the loaded value would be right shifted.
Afaict the legality of the value types is checked later (typically in
isLegalNarrowLdSt), so the existing checks were both overly
conservative as well as being wrong whenever ExtVTBits wasn't a
power of two. The latter was a situation triggered by a number of
lit tests so we could not just assert on ExtVTBIts being a power of
two).
When attempting to simply remove the checks I found some problems,
that seems to have been guarded by the checks (maybe just out of
luck). A typical example would be a pattern like this:
t1 = load i96* ptr
t2 = srl t1, 64
t3 = truncate t2 to i64
When DAGCombine is visiting the truncate reduceLoadWidth is called
attempting to narrow the load to 64 bits (ExtVT := MVT::i64). Then
the SRL is detected and we set ShAmt to 64.
In the past we've bailed out due to i96 not being a multiple of 64.
If we simply remove that check then we would end up replacing the
load with a new load that would read 64 bits but with a base pointer
adjusted by 64 bits. So we would read 32 bits the wasn't accessed by
the original load.
This patch will instead utilize the fact that the logical left shift
can be folded away by using a zextload. Thus, the pattern above will
now be combined into
t3 = load i32* ptr+offset, zext to i64
Another case is shown in the X86/shift-folding.ll test case:
t1 = load i32* ptr
t2 = srl i32 t1, 8
t3 = truncate t2 to i16
In the past we bailed out due to the shift count (8) not being a
multiple of 16. Now the narrowing kicks in and we get
t3 = load i16* ptr+offset
Differential Revision: https://reviews.llvm.org/D117406
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
This extends the code in SearchForAndLoads to be able to look through
ANY_EXTEND nodes, which can be created from mismatching IR types where
the AND node we begin from only demands the low parts of the register.
That turns zext and sext into any_extends as only the low bits are
demanded. To be able to look through ANY_EXTEND nodes we need to handle
mismatching types in a few places, potentially truncating the mask to
the size of the final load.
Recommitted with a more conservative check for the type of the extend.
Differential Revision: https://reviews.llvm.org/D117457
This caused builds to fail with
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:5638:
bool (anonymous namespace)::DAGCombiner::BackwardsPropagateMask(llvm::SDNode *):
Assertion `NewLoad && "Shouldn't be masking the load if it can't be narrowed"' failed.
See the code review for a link to a reproducer.
> This extends the code in SearchForAndLoads to be able to look through
> ANY_EXTEND nodes, which can be created from mismatching IR types where
> the AND node we begin from only demands the low parts of the register.
> That turns zext and sext into any_extends as only the low bits are
> demanded. To be able to look through ANY_EXTEND nodes we need to handle
> mismatching types in a few places, potentially truncating the mask to
> the size of the final load.
>
> Differential Revision: https://reviews.llvm.org/D117457
This reverts commit 578008789f.
This extends the code in SearchForAndLoads to be able to look through
ANY_EXTEND nodes, which can be created from mismatching IR types where
the AND node we begin from only demands the low parts of the register.
That turns zext and sext into any_extends as only the low bits are
demanded. To be able to look through ANY_EXTEND nodes we need to handle
mismatching types in a few places, potentially truncating the mask to
the size of the final load.
Differential Revision: https://reviews.llvm.org/D117457
Update code comments in DAGCombiner::ReduceLoadWidth and refactor
the handling of SRL a bit. The refactoring is done with the intent
of adding support for folding away SRA by using SEXTLOAD in a
follow-up patch.
The function is also renamed as DAGCombiner::reduceLoadWidth.
Differential Revision: https://reviews.llvm.org/D117104
This commit fixes a missed opportunity in merging consecutive stores.
The code that searches for stores skipped the case of stores that
directly connect to the root. The comment above the implementation lists
this case but the code did not handle it. I found this pattern when
looking into the shared_ptr destructor. GCC generates the right
sequence. Here is a small repo:
int foo(int* buff) {
buff[0] = 0;
int x = buff[1];
buff[1] = 0;
return x;
}
Differential Revision: https://reviews.llvm.org/D116895
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.
Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.
Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.
Reviewed By: lebedev.ri, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D116522
When the source has a series of assignments, users reasonably want to
have the debugger step through each one individually. Turn off the combine
for adjacent stores so we get this behavior at -O0.
Similar to D7181.
Reviewed By: spatel, xgupta
Differential Revision: https://reviews.llvm.org/D115808
When the source has a series of assignments, users reasonably want to
have the debugger step through each one individually. Turn off the combine
for adjacent stores so we get this behavior at -O0.
Similar to D7181.
Differential Revision: https://reviews.llvm.org/D115808
Merge the node combines into a common DAGCombiner::visitFMinMax (like we do for IMINMAX).
Move the constant folding into SelectionDAG::foldConstantFPMath.
This allows us to fold the vecreduce-propagate-sd-flags.ll test as it reduces constants - so I've refactored it to take variables instead.
Differential Revision: https://reviews.llvm.org/D115952
We were using a function attribute to indicate a non-standard FP mode,
but now we can use intrinsics for that job as shown in the new tests.
Presumably the x86 asm could be improved for that IR with intrinsics,
but I have not worked out exactly how to do that. Note that the
transform to FTRUNC still requires a hacky check for "nsz" (because
FMF are not applied to FP casts).
This is a cleanup based on the clang change in D115804 / 8c7f2a4f87 .
This is effectively a revert of 5a90285bd9 + D46237 .
Differential Revision: https://reviews.llvm.org/D115885
Replace custom constant scalar/splat folding with FoldConstantArithmetic call and canonicalize commutative constant ops to the RHS before the SimplifyVBinOp call
This extends the custom lowering for truncating stores on
fixed length vectors in SVE to support masked truncating stores.
It also adds a DAG combine for truncates followed by masked
stores.
Reviewed By: peterwaller-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108115
SimplifyVBinOp still has a FoldConstantArithmetic call, which now it isn't vector specific we should be able to remove (once fp binops are tidied up); but we can at least clean up the integer opcodes to perform the basic constant/undef handling in common code first.
After D113888 / 32b6c17b29 the MMO size of a masked loads/store is
unknown. When we are converting back to a standard load/store because
the mask is known all ones, we can refine that to the correct size from
the size of the vector being loaded/stored.
Differential Revision: https://reviews.llvm.org/D114582
As an extension to D111976, this converts clamp fptosi, clamped between
0 and (2^n)-1 to a fptoui.sat. This can greatly help on targets with
conversions that naturally saturate, such as Arm.
X86 disables the transform as some of the test cases increases in size.
A fptoui.sat necessitates a fp clamp without native support, so there is
little use in converting if the instruction is just going to be
expanded.
Differential Revision: https://reviews.llvm.org/D112428
combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation.
But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation.
This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result.
Differential Revision: https://reviews.llvm.org/D113371
The REM DAG combine uses the visitDivLike functions to try and get an
optimized DIV node to provide better codegen, however in some cases this
visitDivLike call ends up in the BuildSDIVPow2 target hook, which in
turn sometimes will return the same node passed in to indicate not to
change it. The REM DAG combine does not anticipate this and creates a
cycle in the DAG because of it.
Fix this by ensuring any such optimized div node returned is distinct
from the node being combined.
Differential Revision: https://reviews.llvm.org/D114716
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
This allows the generic DAG combine to fold fp_extend/fp_trunc into
loads/stores which we can then lower into a integer extending
load/truncating store plus an FP_EXTEND/FP_ROUND.
The nuance here is that fixed-type FP_EXTEND/FP_ROUND require unpacked
types hence lowering them introduces an unpack/zip. By allowing these
nodes to be combined with loads/store we make it much easier to have
this unpack/zip combined into the load/store by our custom lowering.
Differential Revision: https://reviews.llvm.org/D114580
Patch to fix some of the regressions in D77804.
By folding to rotate/funnel-shift by constant amounts for illegal types, we prevent SimplifyDemandedBits from destroying the patterns prematurely, allowing us to use the rotate/funnel-shift legalization that was added in D112443.
Differential Revision: https://reviews.llvm.org/D113192
It's possible that the mask is already a NOT. At least if InstCombine
hasn't canonicalized the input. In that case we will form an ANDN with
X instead of with Y. So we don't need to worry about Y being a constant.
We might need to check that X isn't a constant instead, but we don't
have a test case for that yet.
This fixes a size regression found when trying to enable this combine
for RISCV in D113937.
Differential Revision: https://reviews.llvm.org/D113948
This was noted as a follow-up to D113212 / D113426:
4fc1fc40057e30404c3b11522cfcadhttps://alive2.llvm.org/ce/z/e4o96b
The canonicalization rules for these IR patterns are complicated,
and we were not matching the expected forms in 2 out of the 3
cases. We can make codegen more robust by matching the swapped
forms (and that will also work if these patterns are created late).
(Cond0 s> -1) ? N1 : 0 --> ~(Cond0 s>> BW-1) & N1
https://alive2.llvm.org/ce/z/mGCBrd
This was suggested as a potential enhancement in D113212 (also 7e30404c3b ).
There's an improvement for AArch that could be generalized ( X > -1 --> X >= 0 ).
For x86, we have a counter-acting fold for most cases that turns the shift+not
back into a setcc, so that needs a work-around to get more cases to use "pandn":
D113603
Note that this pattern (and a previous one) are not currently canonical forms
in IR:
https://alive2.llvm.org/ce/z/e4o96b
Adding swapped variants is left as a TODO item here, but is planned as
a near-term follow-up patch.
Differential Revision: https://reviews.llvm.org/D113426
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.
I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.
Differential Revision: https://reviews.llvm.org/D113396
We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation.
This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data.
Differential Revision: https://reviews.llvm.org/D113351
Currently FoldConstantArithmetic only handles binops, so replacing other uses of FoldConstantVectorArithmetic (in particular for SETCC nodes), still require more work.
This is the 'or' sibling for the fold added with:
D113212
https://alive2.llvm.org/ce/z/tgnp7K
Note that neither of these transforms is poison-safe,
but it does not seem to matter at this level. We have
had the scalar version of D113212 for a long time, so
this is just making optimizer behavior consistent.
We do not have the scalar version of *this* fold,
however, so that is another follow-up.
Another minor step towards merging FoldConstantVectorArithmetic into FoldConstantArithmetic.
We don't use SDNodeFlags in any constant folding inside DAG, so passing the Flags argument is a waste of time - an alternative would be to wire up FoldConstantArithmetic to take SDNodeFlags just-in-case we someday start using it, but we don't have any way to test it and I'd prefer to avoid dead code.
Differential Revision: https://reviews.llvm.org/D113276
(X s< 0) ? Y : 0 --> (X s>> BW-1) & Y
We canonicalize to the icmp+select form in IR, and we already have this fold
for scalar select in SDAG, so I think it's an oversight that we don't have
the fold for vectors. It seems neutral for AArch64 and saves some instructions
on x86.
Whether we should also have the sibling folds for the inverse condition or
all-ones true value may depend on target-specific factors such as whether
there's an "and-not" instruction.
Differential Revision: https://reviews.llvm.org/D113212
Fold (srl (mul (zext i32:$a to i64), i64:c), 32) -> (mulhu $a, $b),
if c can truncate to i32 without loss.
Reviewed By: frasercrmck, craig.topper, RKSimon
Differential Revision: https://reviews.llvm.org/D108129
Rely on the hasOperation() instead - as commented on D77804, the mid-term intention is to recognise rotate/funnel-by-constant pre-legalization to help avoid SimplifyDemandedBits regressions.
(i8 X ^ 128) & (i8 X s>> 7) --> usubsat X, 128
As suggested in D112085, we can substitute 'xor' with 'add'
in this pattern, and it is logically equivalent:
https://alive2.llvm.org/ce/z/eJtWWC
We canonicalize to 'xor' in IR, but SDAG does not do that
(and it probably should not - https://llvm.org/PR52267 ), so
it is possible to see either pattern in codegen. Note that
'sub' is a another potential pattern, but that is
canonicalized to 'add' in DAGCombiner, so we don't need to
worry about that variation.
Differential Revision: https://reviews.llvm.org/D112377
EXTRACT_SUBVECTOR indices are always constant, we don't need to check for ConstantSDNode, we should just use getConstantOperandVal which will assert for the constant.
Instead of returning a bool to indicate success and a separate
SDValue, return the SDValue and have the callers check if it is
null.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112331
(i8 X ^ 128) & (i8 X s>> 7) --> usubsat X, 128
I haven't found a generalization of this identity:
https://alive2.llvm.org/ce/z/_sriEQ
Note: I was actually looking at the first form of the pattern in that link,
but that's part of a long chain of potential missed transforms in codegen
and IR....that I hope ends here!
The predicates for when this is profitable are a bit tricky. This version of
the patch excludes multi-use but includes custom lowering (as opposed to
legal only).
On x86 for example, we have custom lowering for some vector types, and that
uses umax and sub. So to enable that fold, we need add use checks to avoid
regressions. Even with legal-only lowering, we could see code with extra
reg move instructions for extra uses, so that constraint would have to be
eased very carefully to avoid penalties.
Differential Revision: https://reviews.llvm.org/D112085
With unoptimized code, we may see lots of stores and spend too much time in mergeTruncStores.
Fixes PR51827.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D111596
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
This is NFC-intended for the callers. Posting in case there are
other potential users that I missed.
I would also use this from VectorCombine in a patch for:
https://llvm.org/PR52178 ( D111901 )
Differential Revision: https://reviews.llvm.org/D111891