D99674 stopped the folding of certain select operations into and/or, due
to incorrect folding in the presence of poison. D97360 added some costs
to attempt to account for the change, but only worked at the getUserCost
level, not the getCmpSelInstrCost that the vectorizer will use directly.
This adds similar logic into the vectorizer to handle these logical
and/or selects, treating them like and/or directly.
This fixes 60% performance regressions from code like the attached test
case.
Differential Revision: https://reviews.llvm.org/D99884
The scalarization overhead was set deliberately high for MVE, whilst the
codegen was new. It helps protect us against the negative ramifications
of mixing scalar and vector instructions. This decreases that,
especially for floating point where the cost of extracting/inserting
lane elements can be low. For integer the cost is still fairly high due
to the cross-register-bank copy, but is no longer n^2 in the length of
the vector.
In general, this will decrease the cost of scalarizing floats and long
integer vectors. i64 increase in cost, having a high cost before and
after this patch. For floats this allows up to start doing things like
vectorizing fdiv instructions, even if they are scalarized.
Differential Revision: https://reviews.llvm.org/D98245
This test shows a case where we can potentially scalarize the store in a
predicated loop, creating a lot of instructions that would be much
slower than scalar.
A v8i32 compare will produce a v8i1 predicate, but during codegen the
v8i32 will be split into two v4i32, potentially requiring two v4i1
predicates to be merged into a single v8i1. Because this merging of two
v4i1's into a v8i1 is very expensive, we need to make the cost of the
compare equally high.
This patch adds the cost of that to ARMTTIImpl::getCmpSelInstrCost.
Because we don't know whether the user of the predicate can be split,
and the cost model is mostly pre-instruction, we may be pessimistic but
that should only be for larger and legal types. This also adds min/max
detection to the costmodel where it can be detected, to keep those in
line with the cost of simple min/max instructions. Otherwise for the
most part, costs that were already expensive have become more expensive.
Differential Revision: https://reviews.llvm.org/D96692
The vector reduction intrinsics started life as experimental ops, so backend support
was lacking. As part of promoting them to 1st-class intrinsics, however, codegen
support was added/improved:
D58015
D90247
So I think it is safe to now remove this complication from IR.
Note that we still have an IR-level codegen expansion pass for these as discussed
in D95690. Removing that is another step in simplifying the logic. Also note that
x86 was already unconditionally forming reductions in IR, so there should be no
difference for x86.
I spot checked a couple of the tests here by running them through opt+llc and did
not see any asm diffs.
If we do find functional differences for other targets, it should be possible
to (at least temporarily) restore the shuffle IR with the ExpandReductions IR
pass.
Differential Revision: https://reviews.llvm.org/D96552
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor
again.
This patch makes SLP and LV emit operations with initial vectors set to poison constant instead of undef.
This is a part of efforts for using poison vector instead of undef to represent "doesn't care" vector.
The goal is to make nice shufflevector optimizations valid that is currently incorrect due to the tricky interaction between undef and poison (see https://bugs.llvm.org/show_bug.cgi?id=44185 ).
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94061
Creating in-loop reductions relies on IR references to map
IR values to VPValues after interleave group creation.
Make sure we re-add the updated member to the plan, so the look-ups
still work as expected
This fixes a crash reported after D90562.
As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used
instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`.
Let's update them.
Actually, it would have been more natural if the patches were made in this order:
(1) let them use unary CreateShuffleVector first
(2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793)
The order is swapped, but in terms of correctness it is still fine.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93923
This patch updates IRBuilder to create insertelement/shufflevector using poison as a placeholder.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93793
Temporarily revert commit 8b1c4e310c.
After 8b1c4e310c the compile-time for `MultiSource/Benchmarks/MiBench/consumer-lame`
dramatically increases with -O3 & LTO, causing issues for builders with
that configuration.
I filed PR48553 with a smallish reproducer that shows a 10-100x compile
time increase.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.
Original patch by Anna Welker
Differential Revision: https://reviews.llvm.org/D86452
If we have two unknown sizes and one GEP operand and one non-GEP
operand, then we currently simply return MayAlias. The comment says
we can't do anything useful ... but we can! We can still check that
the underlying objects are different (and do so for the GEP-GEP case).
To reduce the compile-time impact, this a) checks this early, before
doing the relatively expensive GEP decomposition that will not be
used and b) doesn't do the check if the other operand is a phi or
select. In that case, the phi/select will already recurse, so this
would just do two slightly different recursive walks that arrive at
the same roots.
Compile-time is still a bit of a mixed bag: https://llvm-compile-time-tracker.com/compare.php?from=624af932a808b363a888139beca49f57313d9a3b&to=845356e14adbe651a553ed11318ddb5e79a24bcd&stat=instructions
On average this is a small improvement, but sqlite with ThinLTO has
a 0.5% regression (lencod has a 1% improvement).
The BasicAA test case checks this by using two memsets with unknown
size. However, the more interesting case where this is useful is
the LoopVectorize test case, as analysis of accesses in loops tends
to always us unknown sizes.
Differential Revision: https://reviews.llvm.org/D92401
Use -0.0 instead of 0.0 as the start value. The previous use of 0.0
was fine for all existing uses of this function though, as it is
always generated with fast flags right now, and thus nsz.
This expands upon the inloop reductions added in e9761688e41cb9e976,
allowing them to be inserted into tail folded loops. Reductions are
generates with the form:
x = select(mask, vecop, zero)
v = vecreduce.add(x)
c = add chain, v
Where zero here is chosen as the identity value for add reductions. The
backend is then expected to fold the select and the vecreduce into a
single predicated instruction.
Most of the code is fairly straight forward, except for the creation of
blockmasks which need to ensure they are created in dominance order. The
order they are added is altered to be after any phis, keeping the
requirements for the underlying IR.
Differential Revision: https://reviews.llvm.org/D84451
We currently collect the ICmp and Add from an induction variable,
marking them as dead so that vplan values are not created for them. This
extends that to include any single use trunk from the ICmp, which allows
the Add to more readily be removed too.
This can help with costing vplan nodes, as the ICmp and Add are more
reliably removed and are not double-counted.
Differential Revision: https://reviews.llvm.org/D88873
We have been running tests/benchmarks downstream with tail-predication enabled
for some time now and this behaves as expected: we are not aware of any
correctness issues, and this performs better across the board than with
tail-predication disabled. Time to flip the switch!
Differential Revision: https://reviews.llvm.org/D88093
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.
For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.
Differential Revision: https://reviews.llvm.org/D75512
This implements 2 different vectorisation fallback strategies if tail-folding
fails: 1) don't vectorise at all, or 2) vectorise using a scalar epilogue. This
can be controlled with option -prefer-predicate-over-epilogue, that has been
changed to take a numeric value corresponding to the tail-folding preference
and preferred fallback.
Patch by: Pierre van Houtryve, Sjoerd Meijer.
Differential Revision: https://reviews.llvm.org/D79783
MVE Gather scatter codegeneration is looking a lot better than it used
to, but still has some issues. The instructions we currently model as 1
cycle per element, which is a bit low for some cases. Increasing the
cost by the MVECostFactor brings them in-line with our other instruction
costs. This will have the effect of only generating then when the extra
benefit is more likely to overcome some of the issues. Notably in
running out of registers and vectorizing loops that could otherwise be
SLP vectorized.
In the short-term whilst we look at other ways of dealing with those
more directly, we can increase the costs of gathers to make them more
likely to be beneficial when created.
Differential Revision: https://reviews.llvm.org/D86444
This adapts LV to the new semantics of get.active.lane.mask as discussed in
D86147, which means that the LV now emits intrinsic get.active.lane.mask with
the loop tripcount instead of the backedge-taken count as its second argument.
The motivation for this is described in D86147.
Differential Revision: https://reviews.llvm.org/D86304
If gather/scatters are enabled, ARMTargetTransformInfo now allows
tail predication for loops with a much wider range of strides, up
to anything that is loop invariant.
Differential Revision: https://reviews.llvm.org/D85410
As part of D84741, this adds a target hook for the
preferPredicatedReductionSelect option and makes use
of it under MVE, allowing us to tail predicate most
reduction loops.
Differential Revision: https://reviews.llvm.org/D85980
This patch uses the feature added in D79162 to fix the cost of a
sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's
not the case as we cannot split the load in the same way we would for
normal loads.
This updates the costs to better reflect reality, and adds a test for it
in test/Analysis/CostModel/ARM/cast.ll.
It also adds a vectorizer test that showcases the improvement: in some
cases, the vectorizer will now choose a smaller VF when
tail-predication is enabled, which results in better codegen. (Because
if it were to use a higher VF in those cases, the code we see above
would be generated, and the vmovs would block tail-predication later in
the process, resulting in very poor codegen overall)
Original Patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79163
It was getting difficult to see which test was in which file, so this
reorganises the test files so that now all filenames start with tail-folding-*
followed by a more descriptive name what that group of tests check.
This patch enables the LoopVectorizer to build a phi of pointer
type and provide the vector loads and stores with vector type
getelementptrs built from the pointer induction variable, which
produces much less instructions than the previous approach of
creating scalar getelementpointers and glue them together to a
vector.
Differential Revision: https://reviews.llvm.org/D81267
If a vector body has live-out values, it is probably a reduction, which needs a
final reduction step after the loop. MVE has a VADDV instruction to reduce
integer vectors, but doesn't have an equivalent one for float vectors. A
live-out value that is not recognised as reduction later in the optimisation
pipeline will result in the tail-predicated loop to be reverted to a
non-predicated loop and this is very expensive, i.e. it has a significant
performance impact, which is what we hope to avoid with fine tuning the ARM TTI
hook preferPredicateOverEpilogue implementation.
Differential Revision: https://reviews.llvm.org/D82953
This refactors option -disable-mve-tail-predication to take different arguments
so that we have 1 option to control tail-predication rather than several
different ones.
This is also a prep step for D82953, in which we want to reject reductions
unless that is requested with this option.
Differential Revision: https://reviews.llvm.org/D83133
This adjusts the MVE fp16 cost model, similar to how we already do for
integer casts. It uses the base cost of 1 per cvt for most fp extend /
truncates, but adjusts it for loads and stores where we know that a
extending load has been used to get the load into the correct lane, and
only an MVE VCVTB is then needed.
Differential Revision: https://reviews.llvm.org/D81813
This alters getMemoryOpCost to use the Base TargetTransformInfo version
that includes some additional checks for whether extending loads are
legal. This will generally have the effect of making <2 x ..> and some
<4 x ..> loads/stores more expensive, which in turn should help favour
larger vector factors.
Notably it alters the cost of a <4 x half>, which with the current
codegen will be expensive if it is not extended.
Differential Revision: https://reviews.llvm.org/D82456
This patch enables the LoopVectorizer to build a phi of pointer
type and provide the vector loads and stores with vector type
getelementptrs built from the pointer induction variable, which
produces much less instructions than the previous approach of
creating scalar getelementpointers and glue them together to a
vector.
Differential Revision: https://reviews.llvm.org/D81267