Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
Add shl/srl/sra to the list of ops that we canonicalize with a select to expose an identity merge
Differential Revision: https://reviews.llvm.org/D122070
RISCV strong prefers i32 values be sign extended to i64. This combine
was always zero extending the constant using APInt methods.
This adjusts the code so that it calls getNode using ISD::ANY_EXTEND instead.
getNode will call TLI.isSExtCheaperThanZExt to decide how to handle
the constant.
Tests were copied from D121598 where I noticed that we were creating
constants that were hard to materialize.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121650
We do not have general reassociation here (and probably
do not need it), but I noticed these were missing in
patches/tests motivated by D111530, so we can at
least handle the simplest patterns.
The VE test diff looks correct, but we miss that
pattern in IR currently:
https://alive2.llvm.org/ce/z/u66_PM
This is another fold generalized from D111530.
We can find a common source for a rotate operation hidden inside an 'or':
https://alive2.llvm.org/ce/z/9pV8hn
Deciding when this is profitable vs. a funnel-shift is tricky, but this
does not show any regressions: if a target has a rotate but it does not
have a funnel-shift, then try to form the rotate here. That is why we
don't have x86 test diffs for the scalar tests that are duplicated from
AArch64 ( 74a65e3834 ) - shld/shrd are available. That also makes it
difficult to show vector diffs - the only case where I found a diff was
on x86 AVX512 or XOP with i64 elements.
There's an additional check for a legal type to avoid a problem seen
with x86-32 where we form a 64-bit rotate but then it gets split
inefficiently. We might avoid that by adding more rotate folds, but
I didn't check to see what is missing on that path.
This gets most of the motivating patterns for AArch64 / ARM that are in
D111530.
We still need a couple of enhancements to setcc pattern matching with
rotate/funnel-shift to get the rest.
Differential Revision: https://reviews.llvm.org/D120933
When inserting undef into buildvectors created from shuffles of
buildvectors, we convert elements to the largest needed type. This had
the effect of converting undef into 0, which isn't needed as the
buildvector implicitly truncates and trunc(zext(undef)) == undef.
Differential Revision: https://reviews.llvm.org/D121002
This extends acb96ffd14 to 'and' and 'xor' opcodes.
Copying from that message:
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
If the types aren't legal, the expansions may get type legalized in a
different way preventing code sharing. If the type is legal, we will
share some instructions between the two expansions, but we will need an
extra register.
Since we don't appear to fold (neg (sub A, B)) if the sub has an
additional user, I think it makes sense not to expand NABS.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120513
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
This is a partial implementation of a transform suggested in D111530
(only handles 'or' bitwise logic as a first step - need to stamp out more
tests for other opcodes).
Several of the same tests added for D111530 are altered here (but not
fully optimized). I'm not sure yet if this would help/hinder that patch,
but this should be an improvement for all tests added with ecf606cb43
since it removes a shift operation in those examples.
Differential Revision: https://reviews.llvm.org/D120516
If the shl is at least half the bitwidth (i.e. the lower half of the bswap source is zero), then we can reduce the shift and perform the bswap at half the bitwidth and just zero extend.
Based off PR51391 + PR53867
Differential Revision: https://reviews.llvm.org/D120192
This is the SDAG translation of D120253 :
https://alive2.llvm.org/ce/z/qHpmNn
The SDAG nodes can have different operand types than the result value.
We can see an example of that with AArch64 - the funnel shift amount
is an i64 rather than i32.
We may need to make that match even more flexible to handle
post-legalization nodes, but I have not stepped into that yet.
Differential Revision: https://reviews.llvm.org/D120264
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
In combineCarryDiamond() use getAsCarry() to find more candidates for being a carry flag.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118362
This fold is done in IR:
https://alive2.llvm.org/ce/z/jWyFrP
There is an x86 test that shows an improvement
from the added flexibility of using add (commutative).
The other diffs are presumed neutral.
Note that this could also be folded to an 'xor',
but I'm not sure if that would be universally better
(eg, x86 can convert adds more easily into LEA).
This helps prevent regressions from a potential fold for
issue #53829.
The current ABD combine doesn't quite work for SVE because only a
single scalable vector per scalar integer type is legal (e.g. for
i32, <vscale x 4 x i32> is the only legal scalable vector type).
This patch extends the combine to also trigger for the cases when
operand extension must be retained.
Differential Revision: https://reviews.llvm.org/D115739
This moves the matching of AVGFloor and AVGCeil into a place where
demand bit are available, so that it can detect more cases for more
folds. It changes the transform to start from a shift, not from a
truncate. We match the pattern shr(add(ext(A), ext(B)), 1), transforming
to ext(hadd(A, B)).
For signed values, because only the bottom bits are demanded llvm will
transform the above to use a lshr too, as opposed to ashr. In order to
correctly detect the hadd we need to know the demanded bits to turn it
back. Depending on whether the shift is signed (ashr) or logical (lshr),
and the extensions are signed or unsigned we can create different nodes.
If the shift is signed:
Needs >= 2 sign bits. https://alive2.llvm.org/ce/z/h4gQAW generating signed rhadd.
Needs >= 2 zero bits. https://alive2.llvm.org/ce/z/B64DUA generating unsigned rhadd.
If the shift is unsigned:
Needs >= 1 zero bits. https://alive2.llvm.org/ce/z/ByD8sj generating unsigned rhadd.
Needs 1 demanded bit zero and >= 2 sign bits https://alive2.llvm.org/ce/z/hvPGxX and
https://alive2.llvm.org/ce/z/32P5n1 generating signed rhadd.
Differential Revision: https://reviews.llvm.org/D119072
This adds very basic combines for AVG nodes, mostly for constant folding
and handling degenerate (zero) cases. The code performs mostly the same
transforms as visitMULHS, adjusted for AVG nodes.
Constant folding extends to a higher bitwidth and drops the lowest bit.
For undef nodes, `avg undef, x` is transformed to x. There is also a
transform for `avgfloor x, 0` transforming to `shr x, 1`.
Differential Revision: https://reviews.llvm.org/D119559
This enables fshl to be matched earlier on X86
%6 = lshr i32 %3, 1
%7 = select i1 %4, i32 -2147483648, i32 0
%8 = or i32 %6, %7
X86 uses i8 for shift amounts. SelectionDAGBuilder creates the
ISD::SRL with an i8 shift type. DAGCombiner turns the select into
an ISD::SHL. Prior to this patch it would use i32 for the shift
amount. fshl matching failed because the shift amounts have different
types. LegalizeDAG fixes the ISD::SHL shift amount to i8. This
allowed fshl matching to succeed.
With this patch, the ISD::SHL will be created with an i8 shift
amount. This allows the fshl to match immediately.
No test case beause we still end up with a fshl either way.
I have not found a way to expose a difference for this patch in a test
because it only triggers for a one-use load, but this is the code that
was adapted into D118376 and caused miscompiles. The new code pattern
is the same as what we do in narrowExtractedVectorLoad() (reduces load
width for a subvector extract).
This removes seemingly unnecessary manual worklist management and fixes
the chain updating via "SelectionDAG::makeEquivalentMemoryOrdering()".
Differential Revision: https://reviews.llvm.org/D119549
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237
We're hitting a pathological compile-time case, profiled to be in
DagCombiner::visitTokenFactor and many inserts into a SmallPtrSet.
It looks like one of the paths around findBetterNeighborChains is not
capped and leads to this.
This patch resolves the issue. Looking for feedback if this solution
looks reasonable.
Differential Revision: https://reviews.llvm.org/D118877
The test diffs are identical to D119111.
This only affects x86 currently because no other target
has an override for the TLI hook that controls this transform.
This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.
This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.
I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.
Differential Revision: https://reviews.llvm.org/D119019
rv64izbb has a RORW/ROLW instructions that operate on the lower
32-bits of a 64-bit value and sign extend bit 31 of the result.
DAGCombiner won't match rotate idioms because the i32 type isn't Legal
on riscv64.
This patch teaches DAGCombiner to allow it if the type is going to
be promoted and the target has Custom type legalization for ISD::ROTL
or ISD::ROTR. I've restricted this to scalar types. It doesn't appear
any in tree targets other than riscv64 have custom type legalization
for rotates.
If this patch isn't acceptable, I guess I can match SRLW, SLLW, and OR
after type legalization, but I'd like to avoid that if possible.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119062
When the shift amount is known and a known sign bit analysis of
the shiftee indicates that no saturation will occur, then we can
replace SSHLSAT/USHLSAT by SHL.
Differential Revision: https://reviews.llvm.org/D118765
In the aftermath of D116895 a problem was found in the analysis of
dependencies between store merge candidates in
checkMergeStoreCandidatesForDependencies, that is needed to avoid
the cycles are introduced in the DAG.
In the past it has been enough (or assumed to be enough) to start
scanning from non-chain operands when analysing the store merge
candidates for dependencies, assuming that the analysis of chain
dependencies performed when finding the candidates would cover
up for potential dependencies that exist involving the chain operands.
It was however discovered that one could end up with scenarios such
as descibed in the aarch64-checkMergeStoreCandidatesForDependencies.ll
test case, when the dependency between two stores is given by a mix
of chain operand dependencies and non-chain operand dependencies.
The fix in this patch make sure that we also account for chain operand
dependencies when doing the more elaborate analysis in
checkMergeStoreCandidatesForDependencies, no longer relying on that
the earlier check involving chain operands is enough.
Differential Revision: https://reviews.llvm.org/D118943
Do "simplifyShift" and "FoldConstantArithmetic" folds for the SSHLSAT
and USHLSAT DAG nodes.
This includes folds such as:
(shlsat undef/poison, x) -> 0
(shlsat x, undef/poison) -> undef
(shlsat x, too_large_shamt) -> undef
(shlsat 0, x) -> 0
(shlsat x, 0) -> x
(shlsat c1, c2) -> c3
Differential Revision: https://reviews.llvm.org/D118603
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
If we have a vector FP division with a splatted divisor, use
getVectorMinNumElements when scaling the num of uses by splat factor.
For AArch64 the combine kicks in for the <vscale x 4 x float> case since it's
above the fdiv threshold (3) when scaling num uses by splat factor, but the
codegen is worse (splat + vector fdiv + vector fmul) than the <vscale x 2 x
double> case (splat + vector fdiv).
If the combine could be converted into a scalar FP division by
scalarizeBinOpOfSplats it may be cheaper, but it looks like this is predicated
on the isExtractVecEltCheap TLI function which is implemented for x86 but not
AArch64. Perhaps for now combineRepeatedFPDivisors should only scale num uses
by splat if the division can be converted into scalar op.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118343
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D116270
This is the unsigned variant of D111976, where we convert a clamped
fptoui to a fptoui.sat. Because we are unsigned, the condition this time
is only UMIN of UINT_MAX. Similarly to D111976 it handles ISD::UMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D114964
An sra is basically sign-extending a narrower value. Fold away the
shift by doing a sextload of a narrower value, when it is legal to
reduce the load width accordingly.
Differential Revision: https://reviews.llvm.org/D116930