With the extra optimisations we have done, these should now be fine to
enable by default. Which is what this patch does.
Differential Revision: https://reviews.llvm.org/D70968
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
This adds some extra cost model tests for shifts, and does some minor
adjustments to some Neon code to make it clear as to what it applies to.
Both NFC.
Alas, using half the available vector registers in a single instruction
is just too much for the register allocator to handle. The mve-vldst4.ll
test here fails when these instructions are enabled at present. This
patch disables the generation of VLD4 and VST4 by adding a
mve-max-interleave-factor option, which we currently default to 2.
Differential Revision: https://reviews.llvm.org/D71109
Currently we fail to pick the right insertion point when
PreviousLastPart of a first-order-recurrence is a PHI node not in the
LoopVectorBody. This can happen when PreviousLastPart is produce in a
predicated block. In that case, we should pick the insertion point in
the BB the PHI is in.
Fixes PR44020.
Reviewers: hsaito, fhahn, Ayal, dorit
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D71071
Fix PR40816: avoid considering scalar-with-predication instructions as also
uniform-after-vectorization.
Instructions identified as "scalar with predication" will be "vectorized" using
a replicating region. If such instructions are also optimized as "uniform after
vectorization", namely when only the first of VF lanes is used, such a
replicating region becomes erroneous - only the first instance of the region can
and should be formed. Fix such cases by not considering such instructions as
"uniform after vectorization".
Differential Revision: https://reviews.llvm.org/D70298
rL341831 moved one-use check higher up, restricting a few folds
that produced a single instruction from two instructions to the case
where the inner instruction would go away.
Original commit message:
> InstCombine: move hasOneUse check to the top of foldICmpAddConstant
>
> There were two combines not covered by the check before now,
> neither of which actually differed from normal in the benefit analysis.
>
> The most recent seems to be because it was just added at the top of the
> function (naturally). The older is from way back in 2008 (r46687)
> when we just didn't put those checks in so routinely, and has been
> diligently maintained since.
From the commit message alone, there doesn't seem to be a
deeper motivation, deeper problem that was trying to solve,
other than 'fixing the wrong one-use check'.
As i have briefly discusses in IRC with Tim, the original motivation
can no longer be recovered, too much time has passed.
However i believe that the original fold was doing the right thing,
we should be performing such a transformation even if the inner `add`
will not go away - that will still unchain the comparison from `add`,
it will no longer need to wait for `add` to compute.
Doing so doesn't seem to break any particular idioms,
as least as far as i can see.
References https://bugs.llvm.org/show_bug.cgi?id=44100
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
This version contains 2 fixes for reported issues:
1. Make sure we do not try to sink terminator instructions.
2. Make sure we bail out, if we try to sink an instruction that needs to
stay in place for another recurrence.
Original message:
If the recurrence PHI node has a single user, we can sink any
instruction without side effects, given that all users are dominated by
the instruction computing the incoming value of the next iteration
('Previous'). We can sink instructions that may cause traps, because
that only causes the trap to occur later, but not on any new paths.
With the relaxed check, we also have to make sure that we do not have a
direct cycle (meaning PHI user == 'Previous), which indicates a
reduction relation, which potentially gets missed by
ReductionDescriptor.
As follow-ups, we can also sink stores, iff they do not alias with
other instructions we move them across and we could also support sinking
chains of instructions and multiple users of the PHI.
Fixes PR43398.
Reviewers: hsaito, dcaballe, Ayal, rengolin
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D69228
Follow-up of cb47b8783: don't query TTI->preferPredicateOverEpilogue when
option -prefer-predicate-over-epilog is set to false, i.e. when we prefer not
to predicate the loop.
Differential Revision: https://reviews.llvm.org/D70382
Now that we have the intrinsics, we can add VLD2/4 and VST2/4 lowering
for MVE. This works the same way as Neon, recognising the load/shuffles
combination and converting them into intrinsics in a pre-isel pass,
which just calls getMaxSupportedInterleaveFactor, lowerInterleavedLoad
and lowerInterleavedStore.
The main difference to Neon is that we do not have a VLD3 instruction.
Otherwise most of the code works very similarly, with just some minor
differences in the form of the intrinsics to work around. VLD3 is
disabled by making isLegalInterleavedAccessType return false for those
cases.
We may need some other future adjustments, such as VLD4 take up half the
available registers so should maybe cost more. This patch should get the
basics in though.
Differential Revision: https://reviews.llvm.org/D69392
This is a follow up of d90804d, to also flag fmcp instructions as instructions
that we do not support in tail-predicated vector loops.
Differential Revision: https://reviews.llvm.org/D70295
The vectoriser queries TTI->preferPredicateOverEpilogue to determine if
tail-folding is preferred for a loop, but it was not respecting loop hint
'predicate' that can disable this, which has now been added. This showed that
we were incorrectly initialising loop hint 'vectorize.predicate.enable' with 0
(i.e. FK_Disabled) but this should have been FK_Undefined, which has been
fixed.
Differential Revision: https://reviews.llvm.org/D70125
This implements TTI hook 'preferPredicateOverEpilogue' for MVE. This is a
first version and it operates on single block loops only. With this change, the
vectoriser will now determine if tail-folding scalar remainder loops is
possible/desired, which is the first step to generate MVE tail-predicated
vector loops.
This is disabled by default for now. I.e,, this is depends on option
-disable-mve-tail-predication, which is off by default.
I will follow up on this soon with a patch for the vectoriser to respect loop
hint 'vectorize.predicate.enable'. I.e., with this loop hint set to Disabled,
we don't want to tail-fold and we shouldn't query this TTI hook, which is
done in D70125.
Differential Revision: https://reviews.llvm.org/D69845
This recommits 11ed1c0239 (reverted in
9f08ce0d21 for failing an assert) with a fix:
tryToWidenMemory() now first checks if the widening decision is to interleave,
thus maintaining previous behavior where tryToInterleaveMemory() was called
first, giving priority to interleave decisions over widening/scalarization. This
commit adds the test case that exposed this bug as a LIT.
This recommits 100e797adb (reverted in
009e032634 for failing an assert). While the
root cause was independently reverted in eaff300401,
this commit includes a LIT to make sure IVDescriptor's SinkAfter logic does not
try to sink branch instructions.
It broke Chromium, causing "Instruction does not dominate all uses!" errors.
See https://bugs.chromium.org/p/chromium/issues/detail?id=1022297#c1 for a
reproducer.
> If the recurrence PHI node has a single user, we can sink any
> instruction without side effects, given that all users are dominated by
> the instruction computing the incoming value of the next iteration
> ('Previous'). We can sink instructions that may cause traps, because
> that only causes the trap to occur later, but not on any new paths.
>
> With the relaxed check, we also have to make sure that we do not have a
> direct cycle (meaning PHI user == 'Previous), which indicates a
> reduction relation, which potentially gets missed by
> ReductionDescriptor.
>
> As follow-ups, we can also sink stores, iff they do not alias with
> other instructions we move them across and we could also support sinking
> chains of instructions and multiple users of the PHI.
>
> Fixes PR43398.
>
> Reviewers: hsaito, dcaballe, Ayal, rengolin
>
> Reviewed By: Ayal
>
> Differential Revision: https://reviews.llvm.org/D69228
We have two ways to steer creating a predicated vector body over creating a
scalar epilogue. To force this, we have 1) a command line option and 2) a
pragma available. This adds a third: a target hook to TargetTransformInfo that
can be queried whether predication is preferred or not, which allows the
vectoriser to make the decision without forcing it.
While this change behaves as a non-functional change for now, it shows the
required TTI plumbing, usage of this new hook in the vectoriser, and the
beginning of an ARM MVE implementation. I will follow up on this with:
- a complete MVE implementation, see D69845.
- a patch to disable this, i.e. we should respect "vector_predicate(disable)"
and its corresponding loophint.
Differential Revision: https://reviews.llvm.org/D69040
If the recurrence PHI node has a single user, we can sink any
instruction without side effects, given that all users are dominated by
the instruction computing the incoming value of the next iteration
('Previous'). We can sink instructions that may cause traps, because
that only causes the trap to occur later, but not on any new paths.
With the relaxed check, we also have to make sure that we do not have a
direct cycle (meaning PHI user == 'Previous), which indicates a
reduction relation, which potentially gets missed by
ReductionDescriptor.
As follow-ups, we can also sink stores, iff they do not alias with
other instructions we move them across and we could also support sinking
chains of instructions and multiple users of the PHI.
Fixes PR43398.
Reviewers: hsaito, dcaballe, Ayal, rengolin
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D69228
Summary:
Getelementptr has vector type if any of its operands are vectors
(the scalar operands being implicitly broadcast to all vector elements).
Extractelement applied to a vector getelementptr can be folded by
applying the extractelement in turn to all of the vector operands.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69379
Currently we may do iterleaving by more than estimated trip count
coming from the profile or computed maximum trip count. The solution is to
use "best known" trip count instead of exact one in interleaving analysis.
Patch by Evgeniy Brevnov.
Differential Revision: https://reviews.llvm.org/D67948
Add generic DAG combine for extending masked loads.
Allow us to generate sext/zext masked loads which can access v4i8,
v8i8 and v4i16 memory to produce v4i32, v8i16 and v4i32 respectively.
Differential Revision: https://reviews.llvm.org/D68337
llvm-svn: 375085
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374634
When optimising for size and SCEV runtime checks need to be emitted to check
overflow behaviour, the loop vectorizer can run in this assert:
LoopVectorize.cpp:2699: void llvm::InnerLoopVectorizer::emitSCEVChecks(
llvm::Loop *, llvm::BasicBlock *): Assertion `!BB->getParent()->hasOptSize()
&& "Cannot SCEV check stride or overflow when opt
We should not generate predicates while optimising for size because
code will be generated for predicates such as these SCEV overflow runtime
checks.
This should fix PR43371.
Differential Revision: https://reviews.llvm.org/D68082
llvm-svn: 374166
Also Revert "[LoopVectorize] Fix non-debug builds after rL374017"
This reverts commit 9f41deccc0.
This reverts commit 18b6fe07bc.
The patch is breaking PowerPC internal build, checked with author, reverting
on behalf of him for now due to timezone.
llvm-svn: 374091
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374017
When vectorisation is forced with a pragma, we optimise for min size, and we
need to emit runtime memory checks, then allow this code growth and don't run
in an assert like we currently do.
This is the result of D65197 and D66803, and was a use-case not really
considered before. If this now happens, we emit an optimisation remark warning
about the code-size expansion, which can be avoided by not forcing
vectorisation or possibly source-code modifications.
Differential Revision: https://reviews.llvm.org/D67764
llvm-svn: 372694
Now that the vectorizer can do tail-folding (rL367592), and the ARM backend
understands MVE masked loads/stores (rL371932), it's time to add the MVE
tail-folding equivalent of the X86 tests that I added.
llvm-svn: 371996
Masked loads and store fit naturally with MVE, the instructions being easily
predicated. This adds lowering for the simple cases of masked loads and stores.
It does not yet deal with widening/narrowing or pre/post inc, and so is
currently behind an option.
The llvm masked load intrinsic will accept a "passthru" value, dictating the
values used for the zero masked lanes. In MVE the instructions write 0 to the
zero predicated lanes, so we need to match a passthru that isn't 0 (or undef)
with a select instruction to pull in the correct data after the load.
Differential Revision: https://reviews.llvm.org/D67186
llvm-svn: 371932
Implement a TODO from rL371452, and handle loop invariant addresses in predicated blocks. If we can prove that the load is safe to speculate into the header, then we can avoid using a masked.load in favour of a normal load.
This is mostly about vectorization robustness. In the common case, it's generally expected that LICM/LoadStorePromotion would have eliminated such loads entirely.
Differential Revision: https://reviews.llvm.org/D67372
llvm-svn: 371745
If we're vectorizing a load in a predicated block, check to see if the load can be speculated rather than predicated. This allows us to generate a normal vector load instead of a masked.load.
To do so, we must prove that all bytes accessed on any iteration of the original loop are dereferenceable, and that all loads (across all iterations) are properly aligned. This is equivelent to proving that hoisting the load into the loop header in the original scalar loop is safe.
Note: There are a couple of code motion todos in the code. My intention is to wait about a day - to be sure this sticks - and then perform the NFC motion without furthe review.
Differential Revision: https://reviews.llvm.org/D66688
llvm-svn: 371452
Summary:
Fold-tail currently supports reduction last-vector-value live-out's,
but has yet to support last-scalar-value live-outs, including
non-header phi's. As it relies on AllowedExit in order to detect
them and bail out we need to add the non-header PHI nodes to
AllowedExit, otherwise we end up with miscompiles.
Solves https://bugs.llvm.org/show_bug.cgi?id=43166
Reviewers: fhahn, Ayal
Reviewed By: fhahn, Ayal
Subscribers: anna, hiraditya, rkruppe, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67074
llvm-svn: 370721
Allow vectorizing loops that have reductions when tail is folded by masking.
A select is introduced in VPlan, choosing between the last value carried by the
loop-exit/live-out instruction of the reduction, and the penultimate value
carried by the reduction phi, according to the "i < n" mask of fold-tail.
This select replaces the last value as the live-out value of the loop.
Differential Revision: https://reviews.llvm.org/D66720
llvm-svn: 370173
We don't yet know how to generate these instructions for MVE. And in the case
of VLD3, we don't even have the instruction. For the moment don't tell the
vectoriser that we have VLD4, just to end up serialising the results.
Differential Revision: https://reviews.llvm.org/D66009
llvm-svn: 369101