We already have reassociation code for Adds and Ors separately in DAG
combiner, this adds it for the combination of the two where Ors act like
Adds. It reassociates (add (or (x, c), y) -> (add (add (x, y), c)) where
we know that the Ors operands have no common bits set, and the Or has
one use.
Differential Revision: https://reviews.llvm.org/D104765
Reland of 31859f896.
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.
Differential Revision: https://reviews.llvm.org/D104797
This ports the AArch64 SABD and USBD over to DAG Combine, where they can be
used by more backends (notably MVE in a follow-up patch). The matching code
has changed very little, just to handle legal operations and types
differently. It selects from (ABS (SUB (EXTEND a), (EXTEND b))), producing
a ubds/abdu which is zexted to the original type.
Differential Revision: https://reviews.llvm.org/D91937
This add as a fold of sub(0, splat(sub(0, x))) -> splat(x). This can
come up in the lowering of right shifts under AArch64, where we generate
a shift left of a negated number.
Differential Revision: https://reviews.llvm.org/D103755
The is from discussion in https://reviews.llvm.org/D104247#inline-993387
The contract and reassoc flags shouldn't imply each other .
All the aggressive fsub fusion reassociate operations,
we should guard them with reassoc flag check.
Reviewed By: mcberg2017
Differential Revision: https://reviews.llvm.org/D104723
According to IR LangRef, the FMF flag:
contract
Allow floating-point contraction (e.g. fusing a multiply followed by an
addition into a fused multiply-and-add).
reassoc
Allow reassociation transformations for floating-point instructions.
This may dramatically change results in floating-point.
My understanding is that these two flags shouldn't imply each other,
as we might have a SDNode that can be reassociated with others, but
not contractble.
eg: We may want following fmul/fad/fsub to freely reassoc, but don't
want fma being generated here.
%F = fmul reassoc double %A, %B ; <double> [#uses=1]
%G = fmul reassoc double %C, %D ; <double> [#uses=1]
%H = fadd reassoc double %F, %G ; <double> [#uses=1]
%I = fsub reassoc double %H, %E ; <double> [#uses=1]
Before https://reviews.llvm.org/D45710, `reassoc` flag actually
did not imply isContratable either.
The current implementation also only check the flag in fadd node,
ignoring fmul node, this patch update that as well.
Reviewed By: spatel, qiucf
Differential Revision: https://reviews.llvm.org/D104247
6e5628354e regressed the Windows build as
the return type no longer matched in both branches for the return value
type deduction. This uses a bit more compiler magic to deal with that.
The sorting, obviously, must be stable, else we will have random assembly fluctuations.
Apparently there was no test coverage that would benefit from that,
so i've added one test.
The sorting consists of two parts - just sort the input vectors,
and recompute the shuffle mask -> input vector mapping.
I don't believe we need to do anything else.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D104187
When reducing vector builds to shuffles it possible that
the DAG combiner may try to extract invalid subvectors.
This happens as the existing code assumes vectors will be power
of 2 sizes, which is already untrue, but becomes more noticable
with v6 and v7 types.
Specifically the existing code assumes that half PowerOf2Ceil of
a given vector index will fit twice into a given vector.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D103880
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D95425
As shown in:
https://llvm.org/PR50623
...and the similar tests here, we were not accounting for
store merging of different sizes that do not cover the
entire range of the wide value to be stored.
This is the easy fix: just make sure that all of the
original stores are the same size, so when we calculate
the wide width, it's a simple N * M check.
This still allows all of the motivating optimizations from:
D86420 / 54a5dd485c
D87112 / 7a06b166b1
We could enhance this code to track individual bytes and
allow merging multiple sizes.
shuffle(concat(x,undef),concat(y,undef)) -> concat(shuffle(x,y),shuffle(x,y))
If the original shuffle references any of the upper (undef) subvector elements, ensure the split shuffle masks uses undef instead of an out-of-bounds value.
Fixes PR50609
This extends 434c8e013a and ede3982792 to handle signed
predicates by sign-extending the setcc operands.
This is not shown directly in https://llvm.org/PR50055 ,
but the pattern is visible by changing the unsigned convert
to signed in the source code.
This is a follow-up to D103280 that eases the use restrictions,
so we can handle the motivating case from:
https://llvm.org/PR50055
The loop code is adapted from similar use checks in
ExtendUsesToFormExtLoad() and SliceUpLoad(). I did not see an
easier way to filter out non-chain uses of load values.
Differential Revision: https://reviews.llvm.org/D103462
I accidentaly pushed a draft of D103280 that was discussed
during the review, but it was not supposed to be the final
version.
Rather than revert and recommit, I'm updating the existing
code. This way we have a record of the codegen diff that
would result if we decide to remove this predicate in the
future.
sext (vsetcc X, Y) --> vsetcc (zext X), (zext Y) --
(when the zexts are free and a bunch of other conditions)
We have a couple of similar folds to this already for vector selects,
but this pattern slips through because it is only a setcc.
The tests are based on the motivating case from:
https://llvm.org/PR50055
...but we need extra logic to get that example, so I've left that as
a TODO for now.
Differential Revision: https://reviews.llvm.org/D103280
extractelement is poison if the index is out-of-bounds, so just
scalarizing the load may introduce an out-of-bounds load, which is UB.
To avoid introducing new UB, we can mask the index so it only contains
valid indices.
Fixes PR50382.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D103077
DAGCombine's `mergeStoresOfConstantsOrVecElts` optimization is told
whether it's to use vector types and also whether it's to issue a
truncating store. However, the truncating store code path assumes a
scalar integer `ConstantSDNode`, and when using vector types it creates
either a `BUILD_VECTOR` or `CONCAT_VECTORS` to store: neither of which
is a constant.
The `riscv64` target is able to expose a crash here because it switches
on both code paths at the same time. The `f32` is stored as `i32` which
must be promoted to `i64`, necessitating a truncating store.
It also decides later that it prefers a vector store of `v2f32`.
While vector truncating stores are legal, this combine is not able to
emit them. We also don't have a test case. This patch adds an assert to
catch this case more gracefully, and updates one of the caller functions
to the function to turn off the use of truncating stores when preferring
vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103173
The select-of-constants transform was asserting that its constant vector
inputs did not implicitly truncate their input without that as an
explicit precondition to the function. This patch relaxes that assertion
into an early return to skip the optimization.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102393
Fixes a bug in the DAG combiner that eliminates the stores because it missed
to inspect the address space of the pointers.
%v = load %ptr_as1
// no chain side effect
store %v, %ptr_as2
As well as
store %v, %ptr_as1
store %v, %ptr_as2
Fixes a test for above in X86.
Differential Revision: https://reviews.llvm.org/D102096
Expanding a fixed length operation involves wrapping the operation in an
insert/extract subvector pair, as such, when this is done to bitcast we
end up with an extract_subvector of a bitcast. DAGCombine tries to
convert this into a bitcast of an extract_subvector which restores the
initial fixed length bitcast, causing an infinite loop of legalization.
As part of this patch, we must make sure the above DAGCombine does not
trigger after legalization if the created bitcast would not be legal.
Differential Revision: https://reviews.llvm.org/D101990
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
It is proper to relax non-negative limitation of step_vector.
Also this patch adds more combines for step_vector:
(sub X, step_vector(C)) -> (add X, step_vector(-C))
Differential Revision: https://reviews.llvm.org/D100812
This patch adds incrementally-better support for SPLAT_VECTOR in a
handful of vector combines by changing a few more
isBuildVectorAllOnes/isBuildVectorAllZeros to the equivalent
isConstantSplatVectorAllOnes/Zeros calls.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D100851
This patch changes ISD::isBuildVectorAllZeros to
ISD::isConstantSplatVectorAllZeros which handles zero sclar vector.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D100813
When trying to clamp a constant index into a scalable vector we can
test if the index is less than the minimum number of elements in the
vector. If so, we can simply return the index because we know it is
guaranteed to fit inside the vector.
Differential Revision: https://reviews.llvm.org/D100639
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98027
If the inner shuffle already contains undef elements, then accept them in the merged shuffle as well.
This helps some X86 HADD/SUB patterns where slow targets were ending up with HADD/SUB because the (un)merged shuffles were stuck either side of the ADD/SUB - meaning we ended up with a total cost much higher than the "2*shuffle+add" that a slow target usually expands a HADD/SUB to.
This patch adds a new isIntOrFPConstant helper function to check if a
SDValue is a integer of FP constant. This pattern is used in various
places.
There also are places that incorrectly just check for integer constants,
e.g. D99384, so hopefully this helper will help people avoid that issue.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D99428
Don't bother calling ComputeNumSignBits if N00Bits < ExtVTBits. No
matter what answer we get back this will be true:
(N00Bits - DAG.ComputeNumSignBits(N00, DemandedSrcElts)) < ExtVTBits)
So we might as well save the computation. This makes the code more
consistent with the similar (sext_in_reg (sext x)) handling above.
As commented by @craig.topper on rG1ba5c550d418, we can't guarantee that we'll be extending zero bits, just sign bit. So, revert to the old code for zero_extend_vector_inreg cases.
Followup to D96345, handle unary shuffles of binops (as well as binary shuffles) if we can merge the shuffle with inner operand shuffles.
Differential Revision: https://reviews.llvm.org/D98646
Extend this to support ComputeNumSignBits of the (used) source vector elements so that we can handle more than just the case where we're sext_in_reg from the source element signbit.
Noticed while investigating the poor codegen in D98587.
A 1-bit smulo overflows is both inputs are -1 since the result
should be +1 which can't be represented in a signed 1 bit value.
We can detect this with an AND and a setcc. The multiply result
can also use the same AND.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97634
This prepares codegen for a change that will remove the identical
folds from IR because they are not poison-safe. See
D93065 / D97360
for details.
We already generically support scalar types, and there are various
target-specific transforms that overlap the vector folds. For example,
x86 recognizes the and patterns, but not or. We can end up with 1
extra instruction there, but I think that is still preferred over the
blendv alternative that loads a constant vector.
If this is not optimal, then it should be fixed with a later transform
(this change is not expected to result in any regressions because
InstCombine currently does the same thing).
Removing custom code and supporting undefs in constant-pattern-matching
can be follow-up changes.
Differential Revision: https://reviews.llvm.org/D97730
Peeking through AND is only valid if the input to both shifts is
the same. If the inputs are different, then the original pattern
ORs the two values when the masked shift amount is 0. This is ok
if the values are the same since the OR would be a NOP which is
why its ok for rotate.
Fixes PR49365 and reverts PR34641
Differential Revision: https://reviews.llvm.org/D97637
Even if the first computeKnownBits call doesn't have any zero
bits it is possible the other operand has bitwidth-1 leading zero.
In that case overflow is still impossible. So always call computeKnownBits
for both operands.
Using ComputeNumSignBits or computeKnownBits we might be able
to determine that overflow is impossible.
This especially helps after type legalization if the type was
promoted from a type with half the bits or more. Type legalization
conservatively creates a promoted smulo/umulo and an overflow
check for the promoted bits. The overflow from the promoted
smulo/umulo is ORed with the result of the promoted bits
overflow check. Proving that the promoted smulo/umulo can never
overflow will leave us with just the promoted bits overflow check.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97160
This patch handles usubsat patterns hidden through zext/trunc and uses the getTruncatedUSUBSAT helper to determine if the USUBSAT can be correctly performed in the truncated form:
zext(x) >= y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))
zext(x) > y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))
Based on original examples:
void foo(unsigned short *p, int max, int n) {
int i;
unsigned m;
for (i = 0; i < n; i++) {
m = *--p;
*p = (unsigned short)(m >= max ? m-max : 0);
}
}
Differential Revision: https://reviews.llvm.org/D25987
If extload is legal, following transform
(zext (select c, load1, load2)) -> (select c, zextload1, zextload2)
can save one ext instruction.
Differential Revision: https://reviews.llvm.org/D95086
Adjust generateFMAsInMachineCombiner to return false if SVE is present
in order to combine fmul+fadd into fma. Also add new pseudo instructions
so as to select the most appropriate of FMLA/FMAD depending on register
allocation.
Depends on D96599
Differential Revision: https://reviews.llvm.org/D96424
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs.
Differential Revision: https://reviews.llvm.org/D96345
This reverts commit 5dfba562dd.
That commit causes an assertion failure with the following repro:
typedef long b __attribute__((__vector_size__(16)));
b *d;
b e;
b __attribute__((__always_inline__)) c(b h, b i) {
return (__attribute__((__vector_size__(8 * sizeof(short)))) short)h + i;
}
j() {
b k, l, m, n, o[6], p, q;
m = d[5];
b r = m;
b s = f(r, 8);
q = s;
l = d[1];
p = l;
t(q);
n = c(m, l);
o[1] = c(s, f(p, 8));
k = __builtin_shufflevector(n, o[1], 0, 2);
e = __builtin_ia32_psrlwi128(k, j);
}
./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Differential Revision: https://reviews.llvm.org/D96345
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.
This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.
We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.
The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.
Differential Revision: https://reviews.llvm.org/D96413
Avoid doing the following combine for vector types:
```
copysign(x, fp_extend(y)) -> copysign(x, y)
copysign(x, fp_round(y)) -> copysign(x, y)
```
That combine seemed to impede the selection of vector instruction and cause
a mess in some circumstances.
Differential Revision: https://reviews.llvm.org/D96037
As of commit 284f2bffc9, the DAG Combiner gets rid of the masking of the
input to this node if the mask only keeps the bottom 16 bits. This is because
the underlying library function does not use the high order bits. However, on
PowerPC's ELFv2 ABI, it is the caller that is responsible for clearing the bits
from the register. Therefore, the library implementation of __gnu_h2f_ieee will
return an incorrect result if the bits aren't cleared.
This combine is desired for ARM (and possibly other targets) so this patch adds
a query to Target Lowering to check if this zeroing needs to be kept.
Fixes: https://bugs.llvm.org/show_bug.cgi?id=49092
Differential revision: https://reviews.llvm.org/D96283
Make sure scalable property is preserved by using getVectorElementCount().
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D95967
If sext_inreg is supported, we will turn this into sext_inreg. That
will then remove it if there are enough sign bits. But if sext_inreg
isn't supported, we can still remove the shift pair based on sign
bits.
Split from D95890.
This patch adds support for scalable-vector splats in DAGCombiner's
`isConstantOrConstantVector` and `ISD::matchUnaryPredicate` functions,
which enable the SelectionDAG div/rem-by-constant optimizations for
scalable vector types.
It also fixes up one case where the UDIV optimization was generating a
SETCC without first consulting the target for its preferred SETCC result
type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D94501
For now, we correct the result for sqrt if iteration > 0. This doesn't make
sense as they are not strict relative.
Reviewed By: dmgreen, spatel, RKSimon
Differential Revision: https://reviews.llvm.org/D94480
Add DemandedElts support inside the TRUNCATE analysis.
REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d
Differential Revision: https://reviews.llvm.org/D56387
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.
> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387
This reverts commit cad4275d69.
Use the KnownBits icmp comparisons to determine when a ISD::UMIN/UMAX op is unnecessary should either op be known to be ULT/ULE or UGT/UGE than the other.
Differential Revision: https://reviews.llvm.org/D94532
MergeInnerShuffle currently attempts to merge shuffle(shuffle(x,y),z) patterns into a single shuffle, using 1 or 2 of the x,y,z ops.
However if we already match 2 ops we might be able to handle the third op if its also a shuffle that references one of the previous ops, allowing us to handle some cases like:
shuffle(shuffle(x,y),shuffle(x,y))
shuffle(shuffle(shuffle(x,z),y),z)
shuffle(shuffle(x,shuffle(x,y)),z)
etc.
This isn't an exhaustive match and is dependent on the order the candidate ops are encountered - if one of the matched ops was a shuffle that was peek-able we don't go back and try to split that, I haven't found much need for that amount of analysis yet.
This is a preliminary patch that will allow us to later improve x86 HADD/HSUB matching - but needs to be reviewed separately as its in generic code and affects existing Thumb2 tests.
Differential Revision: https://reviews.llvm.org/D94671
I'm hoping to reuse MergeInnerShuffle in some other folds - so ensure the candidate ops/mask are reset at the start of each run.
Also, move the second op matching before bailing to make it simpler to try to match other things afterward.
This patch resolves the suboptimal codegen described in http://llvm.org/pr47873 .
When CodeGenPrepare lowers select into a conditional branch, a freeze instruction is inserted.
It is then translated to `BRCOND(FREEZE(SETCC))` in SelDag.
The `FREEZE` in the middle of `SETCC` and `BRCOND` was causing a suboptimal code generation however.
This patch adds `BRCOND(FREEZE(cond))` -> `BRCOND(cond)` fold to DAGCombiner to remove the `FREEZE`.
To make this optimization sound, `BRCOND(UNDEF)` simply should nondeterministically jump to the branch or not, rather than raising UB.
It wasn't clear what happens when the condition was undef according to the comments in ISDOpcodes.h, however.
I updated the comments of `BRCOND` to make it explicit (as well as `BR_CC`, which is also a conditional branch instruction).
Note that it diverges from the semantics of `br` instruction in IR, which is explicitly UB.
Since the UB semantics was necessary to explain optimizations that use branching conditions, and SelDag doesn't seem to have such optimization, I think this divergence is okay.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D92015
We are checking the unsafe-fp-math for sqrt but not for fpow, which behaves inconsistent.
As the direction is to remove this global option, we need to remove the unsafe-fp-math
check for sqrt and update the test with afn fast-math flags.
Reviewed By: Spatel
Differential Revision: https://reviews.llvm.org/D93891
This looks to have been done to save some duplicated code under
two different if statements, but it ends up being harmful to D94073.
This speculative constant can be called on a scalable vector type
with i64 element size when i64 scalars aren't legal. The code tries
and fails to find a vector type with i32 elements that it can use.
So only create the node when we know it will be used.
This patch disables the FSUB(-0,X)->FNEG(X) DAG combine when we're flushing subnormals. It requires updating the existing AMDGPU tests to use the fneg IR instruction, in place of the old fsub(-0,X) canonical form, since AMDGPU is the only backend currently checking the DenormalMode flags.
Note that this will require follow-up optimizations to make sure the FSUB(-0,X) form is handled appropriately
Differential Revision: https://reviews.llvm.org/D93243
Fixes a bug introduced by D91589.
When folding `(sext (not i1 x)) -> (add (zext i1 x), -1)`, we try to replace the not first when possible. If we replace the not in-visit, then the now invalidated node will be returned, and subsequently we will return an invalid sext. In cases where the not is replaced in-visit we can simply return SDValue, as the not in the current sext should have already been replaced.
Thanks @jgorbe, for finding the below reproducer.
The following reduced test case crashes clang when built with `clang -O1 -frounding-math`:
```
template <class> class a {
int b() { return c == 0.0 ? 0 : -1; }
int c;
};
template class a<long>;
```
A debug build of clang produces this "assertion failed" error:
```
clang: /home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:264: void {anonymous}::DAGCombiner::AddToWorklist(llvm::
SDNode*): Assertion `N->getOpcode() != ISD::DELETED_NODE && "Deleted Node added to Worklist"' failed.
```
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93274
Clean up a TODO, to support folding a shift of a constant by a
select of constants, on targets with different shift operand sizes.
Reviewed By: RKSimon, lebedev.ri
Differential Revision: https://reviews.llvm.org/D90349
This patch adds the following DAGCombines, which apply if isVectorLoadExtDesirable() returns true:
- fold (and (masked_gather x)) -> (zext_masked_gather x)
- fold (sext_inreg (masked_gather x)) -> (sext_masked_gather x)
LowerMGATHER has also been updated to fetch the LoadExtType associated with the
gather and also use this value to determine the correct masked gather opcode to use.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D92230
Adds the ExtensionType flag, which reflects the LoadExtType of a MaskedGatherSDNode.
Also updated SelectionDAGDumper::print_details so that details of the gather
load (is signed, is scaled & extension type) are printed.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D91084
LLVM intrinsic llvm.maxnum|minnum is overloaded intrinsic, can be used on any
floating-point or vector of floating-point type.
This patch extends current infrastructure to support scalable vector type.
This patch also fix a warning message of incorrect use of EVT::getVectorNumElements()
for scalable type, when DAGCombiner trying to split scalable vector.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D92607
The refineIndexType & refineUniformBase functions added by D90942 can also be used to
improve CodeGen of masked gathers.
These changes were split out from D91092
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D92319
In previous code, when refineIndexType(...) is called and Index is undef, Index.getOperand(0) will raise a assertion fail.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D92548
Move fold of (sext (not i1 x)) -> (add (zext i1 x), -1) from X86 to DAGCombiner to improve codegen on other targets.
Differential Revision: https://reviews.llvm.org/D91589
1. Removed #include "...AliasAnalysis.h" in other headers and modules.
2. Cleaned up includes in AliasAnalysis.h.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D92489
Move the X86 VSELECT->UADDSAT fold to DAGCombiner - there's nothing target specific about these folds.
The SSE42 test diffs are relatively benign - its avoiding an extra constant load in exchange for an extra xor operation - there are extra register moves, which is annoying as all those operations should commute them away.
Differential Revision: https://reviews.llvm.org/D91876
For now, we will hardcode the result as 0.0 if the input is denormal or 0. That will
have the impact the precision. As the fsqrt added belong to the cold path of the
cmp+branch, it won't impact the performance for normal inputs for PowerPC, but improve
the precision if the input is denormal.
Reviewed By: Spatel
Differential Revision: https://reviews.llvm.org/D80974
PowerPC has instruction ftsqrt/xstsqrtdp etc to do the input test for software square root.
LLVM now tests it with smallest normalized value using abs + setcc. We should add hook to
target that has test instructions.
Reviewed By: Spatel, Chen Zheng, Qiu Chao Fang
Differential Revision: https://reviews.llvm.org/D80706
If the scatter store is able to perform the sign/zero extend of
its index, this is folded into the instruction with refineIndexType().
Additionally, refineUniformBase() will return the base pointer and index
from an add + splat_vector.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90942
Fold
VT = (and (sign_extend NarrowVT to VT) #bitmask)
into
VT = (zero_extend NarrowVT)
With this combine, the test replaces a sign extended load + an
unsigned extention with a zero extended load to render one of the
operands of the last multiplication.
BEFORE | AFTER
f_i16_i32: | f_i16_i32:
.fnstart | .fnstart
ldrsh r0, [r0] | ldrh r1, [r1]
ldrsh r1, [r1] | ldrsh r0, [r0]
smulbb r0, r1, r0 | smulbb r0, r0, r1
uxth r1, r1 | mul r0, r0, r1
mul r0, r0, r1 | bx lr
bx lr |
Reviewed By: resistor
Differential Revision: https://reviews.llvm.org/D90605
Summary:
For vector element types which are not byte-sized, we would generate
incorrect scalar offsets and produce incorrect codegen.
This optimization could potentially be supported in the future, e.g. by
loading in bytes, then shifting and masking out the remaining bits of
the vector element. However, without an upstream target to test against
it's best to avoid the bad codegen in the simplest possible way.
Related to this bug:
https://bugs.llvm.org/show_bug.cgi?id=27600
Reviewed by: foad
Differential Revision: https://reviews.llvm.org/D78568
This patch uses the existing LowerFixedLengthReductionToSVE function to also lower
scalable vector reductions. A separate function has been added to lower VECREDUCE_AND
& VECREDUCE_OR operations with predicate types using ptest.
Lowering scalable floating-point reductions will be addressed in a follow up patch,
for now these will hit the assertion added to expandVecReduce() in TargetLowering.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D89382
The modified code in visitSTORE was missing a scalable vector check, and still
using the now deprecated implicit cast of TypeSize to uint64_t through the
overloaded operator. This patch fixes these issues.
This brings the logic in line with the comment on the context line immediately
above the added precondition.
Add a test in sve-redundant-store.ll that the warning is not triggered.
Differential Revision: https://reviews.llvm.org/D89701
The modified code in visitSTORE was missing a scalable vector check, and still
using the now deprecated implicit cast of TypeSize to uint64_t through the
overloaded operator. This patch fixes these issues.
This brings the logic in line with the comment on the context line immediately
above the added precondition.
Add a test in Redundantstores.ll that the warning is not triggered.
From LangRef, FMF contract should not enable reassociating to form
arbitrary contractions. So it should not help rearrange nodes like
(fma (fmul x, c1), c2, y) into (fma x, c1*c2, y).
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D89527
MULH is often expanded on targets.
This patch removes the isMulhCheaperThanMulShift hook and uses
isOperationLegalOrCustom instead.
Differential Revision: https://reviews.llvm.org/D80485
In certain places in llvm/lib/CodeGen we were relying upon the TypeSize
comparison operators when in fact the code was only ever expecting
either scalar values or fixed width vectors. This patch changes a few
functions that were always expecting to work on scalar or fixed width
types:
1. DAGCombiner::mergeTruncStores - deals with scalar integers only.
2. DAGCombiner::ReduceLoadWidth - not valid for vectors.
3. DAGCombiner::createBuildVecShuffle - should only be used for
fixed width vectors.
4. SelectionDAGLegalize::ExpandFCOPYSIGN and
SelectionDAGLegalize::getSignAsIntValue - only work on scalars.
Differential Revision: https://reviews.llvm.org/D88562
In certain places in llvm/lib/CodeGen we were relying upon the TypeSize
comparison operators when in fact the code was only ever expecting
either scalar values or fixed width vectors. I've changed some of these
places to use the equivalent scalar operator.
Differential Revision: https://reviews.llvm.org/D88482
In certain places in the code we can never end up in a situation where
we're mixing fixed width and scalable vector types. For example,
we can't have truncations and extends that change the lane count. Also,
in other places such as GenWidenVectorStores and GenWidenVectorLoads we
know from the behaviour of FindMemType that we can never choose a vector
type with a different scalable property.
In various places I have used EVT::bitsXY functions instead of
TypeSize::isKnownXY, where it probably makes sense to keep an assert
that scalable properties match.
Differential Revision: https://reviews.llvm.org/D88654
This passes existing X86 test but I'm not sure if it handles all type
legalization cases it needs to.
Alternative to D89200
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D89222
I have introduced a new template PolySize class, where the template
parameter determines the type of quantity, i.e. for an element
count this is just an unsigned value. The ElementCount class is
now just a simple derivation of PolySize<unsigned>, whereas TypeSize
is more complicated because it still needs to contain the uint64_t
cast operator, since there are still many places in the code that
rely upon this implicit cast. As such the class also still needs
some of it's own operators.
I've tried to minimise the amount of code in the base PolySize
class, which led to a couple of changes:
1. In some places we were relying on '==' operator comparisons
between ElementCounts and the scalar value 1. I didn't put this
operator in the new PolySize class, and thought it was actually
clearer to use the isScalar() function instead.
2. I removed the isByteSized function and replaced it with calls
to isKnownMultipleOf(8).
I've also renamed NextPowerOf2 to be coefficientNextPowerOf2 so
that it's more consistent with coefficientDivideBy.
Differential Revision: https://reviews.llvm.org/D88409
Summary: This patch is derived from D87384.
In this patch we expand the existing decomposition of mul-by-constant to be more general by implementing 2 patterns:
```
mul x, (2^N + 2^M) --> (add (shl x, N), (shl x, M))
mul x, (2^N - 2^M) --> (sub (shl x, N), (shl x, M))
```
The conversion will be trigged if the multiplier is a big constant that the target can't use a single multiplication instruction to handle. This is controlled by the hook `decomposeMulByConstant`.
More over, the conversion benefits from an ILP improvement since the instructions are independent. A case with the sequence like following also gets benefit since a shift instruction is saved.
```
*res1 = a * 0x8800;
*res2 = a * 0x8080;
```
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D88201
In DAGCombiner::ForwardStoreValueToDirectLoad I have fixed up some
implicit casts from TypeSize -> uint64_t and replaced calls to
getVectorNumElements() with getVectorElementCount(). There are some
simple cases of forwarding that we can definitely support for
scalable vectors, i.e. when the store and load are both scalable
vectors and have the same size. I have added tests for the new
code paths here:
CodeGen/AArch64/sve-forward-st-to-ld.ll
Differential Revision: https://reviews.llvm.org/D87098
getNode handling for ISD:SETCC calls FoldSETCC which can canonicalize
FP constants to the RHS. When this happens we should create the node
with the FMF that was requested. By using FlagInserter when can ensure
any calls to getNode/getSetcc during canonicalization will also get the flags.
Differential Revision: https://reviews.llvm.org/D88063
In the motivating case from https://llvm.org/PR47517
we create a node that does not get constant folded
before getNegatedExpression is attempted from some
other node, and we crash.
By moving the fold into SelectionDAG::simplifyFPBinop(),
we get the constant fold sooner and avoid the problem.
After some recent upstream discussion we decided that it was best
to avoid having the / operator for both ElementCount and TypeSize,
since this could give the impression that these classes can be used
in the same way as basic integer integer types. However, division
for scalable types is a bit odd because we are only dividing the
minimum quantity by a value, as opposed to something like:
(MinSize * Vscale) / SomeValue
This is why when performing division it's important the caller
first establishes whether the operation makes sense, perhaps by
calling isKnownMultipleOf() prior to division. The caller must now
explictly call divideCoefficientBy() on the class to perform the
operation.
Differential Revision: https://reviews.llvm.org/D87700
If we're multiplying all elements of a vector by '0' or '1' then we can more efficiently perform this as a clearing mask (that is likely to further simplify to a shuffle blend).
This was noticed when reviewing D87502 but seems to help idiv/irem by constant cases even more as '0'/'1' values are often used for 'passthrough' cases.
Differential Revision: https://reviews.llvm.org/D88225
This is like FastMathFlagGuard in IR. Since we use SDAG instance to get
values, it's with SelectionDAG. By creating a FlagInserter in current
scope, all values created by getNode will get the flags if no Flags
argument provided.
In this patch, I applied it to floating point operations folding part in
DAG combiner, and removed Flags passing to getNode to show its effect.
Other places in DAG combiner and other helper methods similar to getNode
also need this. They can be done in follow-up patches.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D87361
An existing function Type::getScalarSizeInBits returns a uint64_t
instead of a TypeSize class because the caller is requesting a
scalar size, which cannot be scalable. This patch makes other
similar functions requesting a scalar size consistent with that,
thereby eliminating more than 1000 implicit TypeSize -> uint64_t
casts.
Differential revision: https://reviews.llvm.org/D87889
If we have an all ones mask, we can just a regular masked load. InstCombine already gets this in IR. But the all ones mask can appear after type legalization.
Only avx512 test cases are affected because X86 backend already looks for element 0 and the last element being 1. It replaces this with an unmasked load and blend. The all ones mask is a special case of that where the blend will be removed. That transform is only enabled on avx2 targets. I believe that's because a non-zero passthru on avx2 already requires a separate blend so its more profitable to handle mixed constant masks.
This patch adds a dedicated all ones handling to the target independent DAG combiner. I've skipped extending, expanding, and index loads for now. X86 doesn't use index so I don't know much about it. Extending made me nervous because I wasn't sure I could trust the memory VT had the right element count due to some weirdness in vector splitting. For expanding I wasn't sure if we needed different undef handling.
Differential Revision: https://reviews.llvm.org/D87788
The versions that take 'unsigned' will be removed in the future.
I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87592
Similar to D87415, this folds the various float min/max opcodes
with a constant INF or -INF operand, or FLT_MAX / -FLT_MAX operand
if the ninf flag is set. Some of the folds are only possible under
nnan.
The fminnum(X, INF) with nnan and fmaxnum(X, -INF) with nnan cases
are needed to improve the VECREDUCE_FMIN/FMAX lowerings on X86,
the rest is here for the sake of completeness.
Differential Revision: https://reviews.llvm.org/D87571
Previously, we formed ISD::PARITY by looking for (and (ctpop X), 1)
but the AND might be separated from the ctpop. For example if the
parity result is multiplied by 2, we'll pull the AND through the
shift.
So to handle more cases, move to SimplifyDemandedBits where we
can handle more cases that result in only the LSB of the CTPOP
being used.
DAG combiner folds (fma a 1.0 b) into (fadd a b) but the flag isn't
propagated into new fadd. This patch fixes that.
Some code in visitFMA is redundant and such support for vector constants
is missing. Need follow-up patch to clean.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D87037
Clang emits (and (ctpop X), 1) for __builtin_parity. If ctpop
isn't natively supported by the target, this leads to poor codegen
due to the expansion of ctpop being more complex than what is needed
for parity.
This adds a DAG combine to convert the pattern to ISD::PARITY
before operation legalization. Type legalization is updated
to handled Expanding and Promoting this operation. If after type
legalization, CTPOP is supported for this type, LegalizeDAG will
turn it back into CTPOP+AND. Otherwise LegalizeDAG will emit a
series of shifts and xors followed by an AND with 1.
I've avoided vectors in this patch to avoid more legalization
complexity for this patch.
X86 previously had a custom DAG combiner for this. This is now
moved to Custom lowering for the new opcode. There is a minor
regression in vector-reduce-xor-bool.ll, but a follow up patch
can easily fix that.
Fixes PR47433
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87209
fminnum(X, NaN) is X, fminimum(X, NaN) is NaN. This mirrors the
behavior of existing InstSimplify folds.
This is expected to improve the reduction lowerings in D87391,
which use NaN as a neutral element.
Differential Revision: https://reviews.llvm.org/D87415
During the main DAGCombine loop, whenever a node gets replaced, the new
node and all its users are pushed onto the worklist. Omit this if the
new node is the EntryToken (e.g. if a store managed to get optimized
out), because re-visiting the EntryToken and its users will not uncover
any additional opportunities, but there may be a large number of such
users, potentially causing compile time explosion.
This compile time explosion showed up in particular when building the
SingleSource/UnitTests/matrix-types-spec.cpp test-suite case on any
platform without SIMD vector support.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D86963
This removes the after the fact FMF handling from D46854 in favor of passing fast math flags to getNode. This should be a superset of D87130.
This required adding a SDNodeFlags to SelectionDAG::getSetCC.
Now we manage to contant fold some stuff undefs during the
initial getNode that we don't do in later DAG combines.
Differential Revision: https://reviews.llvm.org/D87200
This is a follow-up suggested in D86420 - if we have a pair of stores
in inverted order for the target endian, we can rotate the source
bits into place.
The "be_i64_to_i16_order" test shows a limitation of the current
function (which might be avoided if we integrate this function with
the other cases in mergeConsecutiveStores). In the earlier
"be_i64_to_i16" test, we skip the first 2 stores because we do not
match the full set as consecutive or rotate-able, but then we reach
the last 2 stores and see that they are an inverted pair of 16-bit
stores. The "be_i64_to_i16_order" test alters the program order of
the stores, so we miss matching the sub-pattern.
Differential Revision: https://reviews.llvm.org/D87112
When lowering fixed length vector operations for SVE the subvector
operations are used extensively to marshall data between scalable
and fixed-length vectors. This means that sequences like:
extract_subvec(binop(insert_subvec(a), insert_subvec(b)))
are very common. DAGCombine only checks if the resulting binop is
legal or can be custom lowered when undoing such sequences. When
it's custom lowering that is introducing them the result is an
infinite legalise->combine->legalise loop.
This patch extends the isOperationLegalOr... functions to include
a "LegalOnly" parameter to restrict the check to legal operations
only. Although isOperationLegal could be used it's common for
the affected code paths to be visited pre and post legalisation,
so the extra parameter keeps the code tidy.
Differential Revision: https://reviews.llvm.org/D86450
fabs and fneg share a common transformation:
(fneg (bitconvert x)) -> (bitconvert (xor x sign))
(fabs (bitconvert x)) -> (bitconvert (and x ~sign))
This patch separate the code into a single method.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86862
I tried to fix this in:
rG716e35a0cf53
...but that patch depends on the order that we encounter the
magic "x/sqrt(x)" expression in the combiner's worklist.
This patch should improve that by waiting until we walk the
user list to decide if there's a use to skip.
The AArch64 test reveals another (existing) ordering problem
though - we may try to create an estimate for plain sqrt(x)
before we see that it is part of a 1/sqrt(x) expression.
In general, we probably want to try the multi-use reciprocal
transform before sqrt transforms, but x/sqrt(x) is a special-case
because that will always reduce to plain sqrt(x) or an estimate.
The AArch64 tests show that the transform is limited by TLI
hook to patterns where there are 3 or more uses of the divisor.
So this change can result in an extra division compared to
what we had, but that's the intended behvior based on the
current setting of that hook.
Current `v:t = zext(setcc x,y,cc)` will be transformed to `select x, y, 1:t, 0:t, cc`. It misses some opportunities if x's type size is less than `t`'s size. This patch enhances the above transformation.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86687
This patch changes ElementCount so that the Min and Scalable
members are now private and can only be accessed via the get
functions getKnownMinValue() and isScalable(). In addition I've
added some other member functions for more commonly used operations.
Hopefully this makes the class more useful and will reduce the
need for calling getKnownMinValue().
Differential Revision: https://reviews.llvm.org/D86065
This is the first of a set of DAGCombiner changes enabling strictfp
optimizations. I want to test to waters with this to make sure changes
like these are acceptable for the strictfp case- this particular change
should preserve exception ordering and result precision perfectly, and
many other possible changes appear to be able to as well.
Copied from regular fadd combines but modified to preserve ordering via
the chain, this change allows strict_fadd x, (fneg y) to become
struct_fsub x, y and strict_fadd (fneg x), y to become strict_fsub y, x.
Differential Revision: https://reviews.llvm.org/D85548
We have a gap in our store merging capabilities for shift+truncate
patterns as discussed in:
https://llvm.org/PR46662
I generalized the code/comments for this function in earlier commits,
so we only need ease the type restriction and adjust the address/endian
checking to make this work.
AArch64 lets us switch endian to make sure that patterns are matched
either way.
Differential Revision: https://reviews.llvm.org/D86420
With FMF ( "nsz" and " reassoc") fold X/Sqrt(X) to Sqrt(X).
This is done after targets have the chance to produce a
reciprocal sqrt estimate sequence because that expansion
is probably more efficient than an expansion of a
non-reciprocal sqrt. That is also why we deferred doing
this transform in IR (D85709).
Differential Revision: https://reviews.llvm.org/D86403
The pattern matching does not account for truncating stores,
so it is unlikely to work at later stages. So we are likely
wasting compile-time with no hope of improvement by running
this later.
This should be NFC in terms of output because the endian
check further down would bail out too, but we are wasting
time by waiting to that point to give up. If we generalize
that function to deal with more than i8 types, we should
not have to deal with the degenerate case.
The "isa" checks were less constrained because they allow
target constants, but the later matching code would bail
out on those anyway, so this should be slightly more
efficient.
This patch changes SplitVecOp_EXTRACT_VECTOR_ELT to work correctly
for scalable vectors and also fixes an a bug in DAGCombiner where
the scalable property is dropped in visitTRUNCATE when attempting
to fold an extract + a truncate.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85754
In narrowExtractedVectorLoad there is an optimisation that tries to
combine extract_subvector with a narrowing vector load. At the moment
this produces warnings due to the incorrect calls to
getVectorNumElements() for scalable vector types. I've got this
working for scalable vectors too when the extract subvector index
is a multiple of the minimum number of elements. I have added a
new variant of the function:
MachineFunction::getMachineMemOperand
that copies an existing MachineMemOperand, but replaces the pointer
info with a null version since we cannot currently represent scaled
offsets.
I've added a new test for this particular case in:
CodeGen/AArch64/sve-extract-subvector.ll
Differential Revision: https://reviews.llvm.org/D83950
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85220
Follow-up to D82716 / rGea71ba11ab11
We do not have the fabs removal fold in IR yet for the case
where the sqrt operand is repeated, so that's another potential
improvement.
We currently don't do anything to fold any_extend vector loads as no target has such an instruction.
Instead I've added support for folding to a zextload, SimplifyDemandedBits does a good job of adjusting the zext(truncate(()) stages as required later on.
We still need the custom scalar extload handling instead of using the tryToFoldExtOfLoad helper as it has different legality tests - we can probably tweak that to reduce most of the code duplication.
Fixes the regression I mentioned in rG99a971cadff7
Differential Revision: https://reviews.llvm.org/D85129
I have added tests to:
CodeGen/AArch64/sve-intrinsics-int-arith.ll
for doing simple integer add operations on tuple types. Since these
tests introduced new warnings due to incorrect use of
getVectorNumElements() I have also fixed up these warnings in the
same patch. These fixes are:
1. In narrowExtractedVectorBinOp I have changed the code to bail out
early for scalable vector types, since we've not yet hit a case that
proves the optimisations are profitable for scalable vectors.
2. In DAGTypeLegalizer::WidenVecRes_CONCAT_VECTORS I have replaced
calls to getVectorNumElements with getVectorMinNumElements in cases
that work with scalable vectors. For the other cases I have added
asserts that the vector is not scalable because we should not be
using shuffle vectors and build vectors in such cases.
Differential revision: https://reviews.llvm.org/D84016
Summary:
In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000.
We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on
such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose
to give up simplification with the consideration of compile time.
Reviewers:
@spatel, @arsenm
Differential Revision:
https://reviews.llvm.org/D84204
This isn't a natively supported operation, so convert it to a
mask+compare.
In addition to the operation itself, fix up some surrounding stuff to
make the testcase work: we need concat_vectors on i1 vectors, we need
legalization of i1 vector truncates, and we need to fix up all the
relevant uses of getVectorNumElements().
Differential Revision: https://reviews.llvm.org/D83811
The existing code already considered this case. Unfortunately a typo in
the condition prevents it from triggering. Also the existing code, had
it run, forgot to do the folding.
This fixes PR42876.
Differential Revision: https://reviews.llvm.org/D65802
In DAGCombiner::TransformFPLoadStorePair we were dropping the scalable
property of TypeSize when trying to create an integer type of equivalent
size. In fact, this optimisation makes no sense for scalable types
since we don't know the size at compile time. I have changed the code
to bail out when encountering scalable type sizes.
I've added a test to
llvm/test/CodeGen/AArch64/sve-fp.ll
that exercises this code path. The test already emits an error if it
encounters warnings due to implicit TypeSize->uint64_t conversions.
Differential Revision: https://reviews.llvm.org/D83572
We have this generic transform in IR (instcombine),
but as shown in PR41098:
http://bugs.llvm.org/PR41098
...the pattern may emerge in codegen too.
x86 has a potential refinement/reversal opportunity here,
but that should come later or needs a target hook to
avoid the transform. Converting to bswap is the more
specific form, so we should use it if it is available.
This carves out an exception for a pair of consecutive loads that are
reversed from the consecutive order of a pair of stores. All of the
existing profitability/legality checks for the memops remain between
the 2 altered hunks of code.
This should give us the same x86 base-case asm that gcc gets in
PR41098 and PR44895:
http://bugs.llvm.org/PR41098http://bugs.llvm.org/PR44895
I think we are missing a potential subsequent conversion to use "movbe"
if the target supports that. That might be similar to what AArch64
would use to get "rev16".
Differential Revision: https://reviews.llvm.org/D83567
This carves out an exception for a pair of consecutive loads that are
reversed from the consecutive order of a pair of stores. All of the
existing profitability/legality checks for the memops remain between
the 2 altered hunks of code.
This should give us the same x86 base-case asm that gcc gets in
PR41098 and PR44895:i
http://bugs.llvm.org/PR41098http://bugs.llvm.org/PR44895
I think we are missing a potential subsequent conversion to use "movbe"
if the target supports that. That might be similar to what AArch64
would use to get "rev16".
Differential Revision:
fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)
This is only allowed when "reassoc" is present on the fadd.
As discussed in D80801, this transform goes beyond
what is allowed by "contract" FMF (-ffp-contract=fast).
That is because we are fusing the trailing add of 'E' with a
multiply, but without "reassoc", the code mandates that the
products A*B and C*D are added together before adding in 'E'.
I've added this example to the LangRef to try to clarify the
meaning of "contract". If that seems reasonable, we should
probably do something similar for the clang docs because
there does not appear to be any formal spec for the behavior
of -ffp-contract=fast.
Differential Revision: https://reviews.llvm.org/D82499
This removes existing code duplication and allows us to
assert that we are handling the expected cases.
We have a list of outstanding bugs that could benefit by
handling truncated source values, so that's a possible
addition going forward.
Summary:
The following combine currently breaks in the DAGCombiner:
```
extract_vector_elt (concat_vectors v4i16:a, v4i16:b), x
-> extract_vector_elt a, x
```
This happens because after we have combined these nodes we have inserted nodes
that use individual instances of the vector element type. In the above example
i16. However this isn't a legal type on all backends, and when the combining pass calls
the legalizer it breaks as it expects types to already be legal. The type legalizer has
already been run, and running it again would make a mess of the nodes.
In the example code at least, the generated code is still efficient after the change.
Reviewers: miyuki, arsenm, dmgreen, lebedev.ri
Reviewed By: miyuki, lebedev.ri
Subscribers: lebedev.ri, wdng, hiraditya, steven.zhang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D83231
Summary:
When splitting a load of a scalable type, the new address is
calculated in SplitVecRes_LOAD using a vscale and an add instruction.
This patch also adds a DAG combiner fold to visitADD for vscale:
- Fold (add (vscale(C0)), (vscale(C1))) to (add (vscale(C0 + C1)))
Reviewers: sdesmalen, efriedma, david-arm
Reviewed By: david-arm
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82792
X / (fabs(A) * sqrt(Z)) --> X / sqrt(A*A*Z) --> X * rsqrt(A*A*Z)
In the motivating case from PR46406:
https://bugs.llvm.org/show_bug.cgi?id=46406
...this is restoring the sequence that was originally in the source code.
We extracted a term from within the sqrt because we do not know in
instcombine whether a target will expand a sqrt call.
Note: we could say that the transform in IR should be restricted, but
that would not solve the problem if the source was originally in the
pattern shown here.
This is a gray area for fast-math-flag requirements. I think we should at
least check fast-math-flags on the fdiv and fmul because I view this
transform as 2 pieces: reassociate the fmul operands and form reciprocal
from the fdiv (as with the existing transform). We could argue that the
sqrt also needs FMF, but that was not required before, so we should change
that in a follow-up patch if that seems better.
We don't currently have a way to check that the target will produce a sqrt
or recip estimate without actually creating nodes (the APIs are SDValue
getSqrtEstimate() and SDValue getRecipEstimate()), so we clean up
speculatively created nodes if we are not able to create an estimate.
The x86 test with doubles verifies that we are not changing a test with
no estimate sequence.
Differential Revision: https://reviews.llvm.org/D82716
We need to ensure that the sign bits of the result all match
so we can't fold to undef.
Similar to PR46585.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D83163
zext_vector_inreg needs to produces 0s in the extended bits and
sext_vector_inreg needs to produce upper bits that are all the
same. So we should fold them to a 0 vector instead of undef.
Fixes PR46585.
There was a rogue 'assert' in AArch64ISelLowering for the tuple.get intrinsics,
that shouldn't really have been there (I suspect this was a remnant from when
we expected the wider vector always to have come from a vector CONCAT).
When I tried to create a more minimal reproducer, I found a bug in
DAGCombiner where it drops the scalable flag when trying to fold:
extract_subv (bitcast X), Index --> bitcast (extract_subv X, Index')
This patch fixes both issues.
Reviewers: david-arm, efriedma, spatel
Reviewed By: efriedma
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82910
In visitSCALAR_TO_VECTOR we try to optimise cases such as:
scalar_to_vector (extract_vector_elt %x)
into vector shuffles of %x. However, it led to numerous warnings
when %x is a scalable vector type, so for now I've changed the
code to only perform the combination on fixed length vectors.
Although we probably could change the code to work with scalable
vectors in certain cases, without a proper profit analysis it
doesn't seem worth it at the moment.
This change fixes up one of the warnings in:
llvm/test/CodeGen/AArch64/sve-merging-stores.ll
I've also added a simplified version of the same test to:
llvm/test/CodeGen/AArch64/sve-fp.ll
which already has checks for no warnings.
Differential Revision: https://reviews.llvm.org/D82872
As per documentation of `hasPairLoad`:
"`RequiredAlignment` gives the minimal alignment constraints that must be met to be able to select this paired load."
In this sense, `0` is strictly equivalent to `1`. We make this obvious by using `Align` instead of unsigned.
There is only one implementor of this interface.
Differential Revision: https://reviews.llvm.org/D82958
It's perfectly valid to do certain DAG combines where we extract
subvectors from a concat vector when we have scalable vector types.
However, we can do this in a way that avoids generating compiler
warnings by replacing calls to getVectorNumElements() with
getVectorMinNumElements(). Due to the way subvector extracts are
designed to work with scalable vector types this is ok.
This eliminates some warnings from existing tests in this file:
llvm/test/CodeGen/AArch64/sve-intrinsics-loads.ll
Differential Revision: https://reviews.llvm.org/D82655
When trying to reduce a BUILD_VECTOR to a SHUFFLE_VECTOR it's
important that we carefully check the vector types that led to
that BUILD_VECTOR. In the test I have attached to this commit
there is a case where the results of two SVE faddv instructions
are being stored to consecutive memory locations. With my fix,
as part of merging those stores we discover that each BUILD_VECTOR
element came from an extract of a SVE vector element and
therefore bail out.
Differential Revision: https://reviews.llvm.org/D82564
reduceBuildVecExtToExtBuildVec was breaking a splat(zext(x)) pattern into buildvector(x, 0, x, 0, ..) resulting in much more complex insert+shuffle codegen.
We already go to some lengths to avoid this in SimplifyDemandedVectorElts etc. when we encounter splat buildvectors.
It should be OK to fold all splat(aext(x)) patterns - we might need to tighten this if we find a case where we mustn't introduce a buildvector(x, undef, x, undef, ..) but I can't find one.
Fixes PR46461.
Summary:
- AssertAlign node records the guaranteed alignment on its source node,
where these alignments are retrieved from alignment attributes in LLVM
IR. These tracked alignments could help DAG combining and lowering
generating efficient code.
- In this patch, the basic support of AssertAlign node is added. So far,
we only generate AssertAlign nodes on return values from intrinsic
calls.
- Addressing selection in AMDGPU is revised accordingly to capture the
new (base + offset) patterns.
Reviewers: arsenm, bogner
Subscribers: jvesely, wdng, nhaehnle, tpr, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81711
For now I have changed SimplifyDemandedBits and it's various callers
to assume we know nothing for scalable vectors and to ignore the
demanded bits completely. I have also done something similar for
SimplifyDemandedVectorElts. These changes fix up lots of warnings
due to calls to EVT::getVectorNumElements() for types with scalable
vectors. These functions are all used for optimisations, rather than
functional requirements. In future we can revisit this code if
there is a need to improve code quality for SVE.
Differential Revision: https://reviews.llvm.org/D80537
Current implementation of division estimation isn't correct for some
cases like 1.0/0.0 (result is nan, not expected inf).
And this change exposes a potential infinite loop: we use
isConstOrConstSplatFP in combineRepeatedFPDivisors to look up if the
divisor is some constant. But it doesn't work after legalized on some
platforms. This patch restricts the method to act before LegalDAG.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D80542
If fmul and fadd are separated by an fma, we can fold them together
to save an instruction:
fadd (fma A, B, (fmul C, D)), N1 --> fma(A, B, fma(C, D, N1))
The fold implemented here is actually a specialization - we should
be able to peek through >1 fma to find this pattern. That's another
patch if we want to try that enhancement though.
This transform was guarded by the TLI hook enableAggressiveFMAFusion(),
so it was done for some in-tree targets like PowerPC, but not AArch64
or x86. The hook is protecting against forming a potentially more
expensive computation when fma takes longer to execute than a single
fadd. That hook may be needed for other transforms, but in this case,
we are replacing fmul+fadd with fma, and the fma should never take
longer than the 2 individual instructions.
'contract' FMF is all we need to allow this transform. That flag
corresponds to -ffp-contract=fast in Clang, so we are allowed to form
fma ops freely across expressions.
Differential Revision: https://reviews.llvm.org/D80801
Summary:
Note to downstream target maintainers: this might silently change the semantics of your code if you override `TargetLowering::allowsMemoryAccess` without marking it override.
This patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81379
D79003/rG9fa58d1bf2f8 exposed an issue with scalarizeBinOpOfSplats that we were extracting from the splatted vector result instead of the source, the splat index is only valid for the source vector not the result, which may contain undefs, including at the splat index.
This reverts commit 21dadd774f.
In at least PromoteIntBinOps, they wanted to know about users of *all* values
produced by the node not just the integer being promoted. For example not
replacing chain users if the operation was a load breaks the ordering of the
DAG.
This patch implements a target independent DAG combine to produce multiply-high
instructions from shifts. This DAG combine will combine shifts for any type as
long as the MULH on the narrow type is legal.
For now, it is enabled on PowerPC as PowerPC is the only target that has an
implementation of the isMulhCheaperThanMulShift TLI hook introduced in
D78271.
Moreover, this DAG combine focuses on catching the pattern:
(shift (mul (ext <narrow_type>:$a to <wide_type>), (ext <narrow_type>:$b to <wide_type>)), <narrow_width>)
to produce mulhs when we have a sign-extend, and mulhu when we have
a zero-extend.
The patch performs the following checks:
- Operation is a right shift arithmetic (sra) or logical (srl)
- Input to the shift is a multiply
- Both operands to the shift are sext/zext nodes
- The extends into the multiply are both the same
- The narrow type is half the width of the wide type
- The shift amount is the width of the narrow type
- The respective mulh operation is legal
Differential Revision: https://reviews.llvm.org/D78272
This code was repeated in two callers of CommitTargetLoweringOpt.
But CommitTargetLoweringOpt is also called from TargetLowering.
We should print a message for those calls to. So sink the
repeated code into CommitTargetLoweringOpt to catch those calls.
optimizations
As discussed in the thread http://lists.llvm.org/pipermail/llvm-dev/2020-May/141838.html,
some bit field access width can be reduced by ReduceLoadOpStoreWidth, some
can't. If two accesses are very close, and the first access width is reduced,
the second is not. Then the wide load of second access will be stalled for long
time.
This patch add command line options to guard ReduceLoadOpStoreWidth and
ShrinkLoadReplaceStoreWithStore, so users can use them to disable these
store width reduction optimizations.
Differential Revision: https://reviews.llvm.org/D80745
Currently combineInsertEltToShuffle turns insert_vector_elt into a
vector_shuffle, even if the inserted element is a vector with a single
element. In this case, it should be unlikely that the additional shuffle
would be more efficient than a insert_vector_elt.
Additionally, this fixes a infinite cycle in DAGCombine, where
combineInsertEltToShuffle turns a insert_vector_elt into a shuffle,
which gets turned back into a insert_vector_elt/extract_vector_elt by
a custom AArch64 lowering (in visitVECTOR_SHUFFLE).
Such insert_vector_elt and extract_vector_elt combinations can be
lowered efficiently using mov on AArch64.
There are 2 test changes in arm64-neon-copy.ll: we now use one or two
mov instructions instead of a single zip1. The reason that we need a
second mov in ins1f2 is that we have to move the result to the result
register and is not really related to the DAGCombine fold I think.
But in any case, on most uarchs, mov should be cheaper than zip1. On a
Cortex-A75 for example, zip1 is twice as expensive as mov
(https://developer.arm.com/docs/101398/latest/arm-cortex-a75-software-optimization-guide-v20)
Reviewers: spatel, efriedma, dmgreen, RKSimon
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D80710
I have tried to ensure that SelectionDAG and DAGCombiner do
sensible things for scalable vectors, and added support for a
limited number of simple folds. Codegen support for the vector
extract patterns have also been added to the AArch64 backend.
New vector extract tests have been added here:
CodeGen/AArch64/sve-extract-element.ll
and I have also added new folds using inserts and extracts here:
CodeGen/AArch64/sve-insert-element.ll
Differential Revision: https://reviews.llvm.org/D80208
binop (splat X), (splat C) --> splat (binop X, C)
binop (splat C), (splat X) --> splat (binop C, X)
We do this in IR, and there's a similar fold for the case with 2
non-constant operands just above the code diff in this patch.
This was discussed in D79718, and the extra shuffle in the test
(llvm/test/CodeGen/X86/vector-fshl-128.ll::sink_splatvar) where it
was noticed disappears because demanded elements analysis is no
longer blocked. The large majority of the test diffs seem to be
benign code scheduling changes, but I do see another type of win:
moving the splat later allows binop narrowing in some cases.
Regressions were avoided on x86 and ARM with the INSERT_VECTOR_ELT
restriction.
Differential Revision: https://reviews.llvm.org/D79886
This patch introduces a TargetLowering query, isMulhCheaperThanMulShift.
Currently in DAG Combine, it will transform mulhs/mulhu into a
wider multiply and a shift if the wide multiply is legal.
This TLI function is implemented on 64-bit PowerPC, as it is more desirable to
have multiply-high over multiply + shift for words and doublewords. Having
multiply-high can also aid in further transformations that can be done.
Differential Revision: https://reviews.llvm.org/D78271
Summary:
For some targets generic combines don't really do much and they
consume a disproportionate amount of time.
There's not really a mechanism in SDISel to tactically disable
combines, but we can have a switch to disable all of them and
let the targets just implement what they specifically need.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79112
This patch stores the alignment for ConstantPoolSDNode as an
Align and updates the getConstantPool interface to take a MaybeAlign.
Removing getAlignment() will be done as a follow up.
Differential Revision: https://reviews.llvm.org/D79436
Try to combine N short vector cast ops into 1 wide vector cast op:
concat (cast X), (cast Y)... -> cast (concat X, Y...)
This is part of solving PR45794:
https://bugs.llvm.org/show_bug.cgi?id=45794
As noted in the code comment, this is uglier than I was hoping because
the opcode determines whether we pass the source or destination type
to isOperationLegalOrCustom(). Also IIUC, there's no way to validate
what the other (dest or src) type is. Without the extra legality check
on that, there's an ARM regression test in:
test/CodeGen/ARM/isel-v8i32-crash.ll
...that will crash trying to lower an unsupported v8f32 to v8i16.
Differential Revision: https://reviews.llvm.org/D79360
rL368553 added SimplifyMultipleUseDemandedBits handling for ISD::TRUNCATE to SimplifyDemandedBits so we don't need to duplicate this (and it gets rid of another GetDemandedBits call which is slowly being replaced with SimplifyMultipleUseDemandedBits anyhow).
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.
Differential Revision: https://reviews.llvm.org/D79045
X86 matches several 'shift+xor' funnel shift patterns:
fold (or (srl (srl x1, 1), (xor y, 31)), (shl x0, y)) -> (fshl x0, x1, y)
fold (or (shl (shl x0, 1), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
fold (or (shl (add x0, x0), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
These patterns are also what we end up with the proposed expansion changes in D77301.
This patch moves these to DAGCombine's generic MatchFunnelPosNeg.
All existing X86 test cases still pass, and we just have a small codegen change in pr32282.ll.
Reviewed By: @spatel
Differential Revision: https://reviews.llvm.org/D78935
Summary:
This patch tries to ensure that we do something sensible when
generating code for the ISD::INSERT_VECTOR_ELT DAG node when operating
on scalable vectors. Previously we always returned 'undef' when
inserting an element into an out-of-bounds lane index, whereas now
we only do this for fixed length vectors. For scalable vectors it
is assumed that the backend will do the right thing in the same way
that we have to deal with variable lane indices.
In this patch I have permitted a few basic combinations for scalable
vector types where it makes sense, but in general avoided most cases
for now as they currently require the use of BUILD_VECTOR nodes.
This patch includes tests for all scalable vector types when inserting
into lane 0, but I've only included one or two vector types for other
cases such as variable lane inserts.
Differential Revision: https://reviews.llvm.org/D78992
Call getNegatedExpression(Cost) and check the Cost to make the code more clear.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D78347
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.
Reviewed By: RKSimon
Differential revision: https://reviews.llvm.org/D78291
Since 1725f28841, this should check
isFMADLegalForFAddFSub rather than the the plain isOperationLegal.
This would assert in a subset of cases due to an oddity in how FMAD is
selected. We will allow FMA formation pre-legalize, but not FMAD even
in cases where it would be valid.
The current hook requires passing in the root fadd/fsub. However, in
this distributed case, this would be far more complicated to pass in
the relevant operand. AMDGPU doesn't get any value from the node, and
only needs the type and is the only implementor, so I'm not sure why
we have this complexity. Just rename and expand the assert to avoid
the more complicated checks spread through the distribution logic.
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.
While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.