Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118688
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This removes another instance of recipe execution still relying on
the cost model.
Depends on D116554.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D116656
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This removes the remaining dependence on LoopVectorizationCostModel from
buildScalarSteps and is required so it can be moved out of ILV.
It also improves allows us to remove a few unneeded instructions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116554
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.
This has the following benefits:
1. First step to avoid setting both vector and scalar values for the
same induction def.
2. Reducing complexity of widenIntOrFpInduction through making things
more explicit in VPlan
3. Only need to splat the vector IV for block in masks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116123
This reverts commit db49a78900. The reasoning in the patch applied to a downstream branch, and I got myself confused when trying to split apart pieces. Thankfully, the assert was simply weaker than the actual invariant currently upstream which is that ReadyInsts is not empty.
The fact we could have a block with a valid scheduling window, but nothing to schedule was surprising to me. After digging through the code, this can only happen if we don't find anything to directly vectorize. However, the reduction handling code relies on this mode, so we can't simply consider such trees unvectorizeable. The assert conveys both that this situation can happen, but also that it can *only* happen for an immediate gather.
Context: We built the bundle before deciding that vectorization of a bundle is possible. A side effect of bundle construction is manipulating the scheduling window, so a bundle which isn't vectorizable can cause the creation or expansion of a scheduling window.
No need to reorder the top nodes, if they are not stores or
insertelement instructions and each node should be analized only
once, when the bottom-to-top analysis is performed.
We still endup with extractelements for the top node scalars and
the final shuffle just adds an extra cost and currently
crashes the compiler for PHI nodes.
Differential Revision: https://reviews.llvm.org/D116760
This explicitly records whether a scalar IV is needed in the
VPWidenIntOrFpInductionRecipe, to remove a dependence on the cost-model
during its ::execute.
It will also be used in D116123 to determine if a vector phi will be
generated.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118167
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.
Differential Revision: https://reviews.llvm.org/D117219
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:
- `fixReduction`: If fixReduction is being called during vectorisation of the
epilogue, the phi node it creates will need to additionally carry incoming
values from the middle block of the main loop.
- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
created by fixReduction are updated after the vec.epilog.iter.check block
is added. The phi is also moved to the preheader of the epilogue.
- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
vectorising the epilogue loop. The getResumeInstr function added to the ILV
will return the resume instruction associated with the recurrence descriptor.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D116928
This patch updates createBlockInMask to always generate
VPWidenCanonicalIVRecipe and adds a transform to optimize it away later,
if it is not needed.
This is a step towards breaking up VPWidenIntOrFpInductionRecipe and
explicitly distinguishing between vector phis and scalarizing.
Split off from D116123.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D117140
This patch adds VPWidenIntOrFpInductionRecipe::isCanonical to check if
an induction recipe is canonical. The code is also updated to use it
instead of isCanonicalID.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D117551
After resetting the start value of the canonical IV, it might not be
canonical any more. Add an assertion to make sure it is only used by its
increment, to avoid potential mis-use. Suggested in D117140.
When SVE is enabled for AArch64 targets it makes more sense to use the
get.active.lane.mask intrinsic, because SVE has an exact 1-1 mapping
from the intrinsic to the 'whilelo' instruction for legal vector types.
This instruction neatly takes overflow into account as well. This patch
fixes an issue in VPInstruction::generateInstruction that assumed we are
only dealing with fixed-width vectors.
Differential Revision: https://reviews.llvm.org/D117109
After d4a8fc3a87 LV stopped adding metadata to disable runtime
unrolling to the vectorized epilogue loop. This was missed because
278aa65cc4 removed the relevant test coverage.
This patch fixes that by adding the relevant metadata after
vector loop generation.
This reverts the revert commit 073c27b5e5.
A reduced test case has been added in 5e4966cbae and the code has
been updated to handle the case where getInductionOpcode returns
BinaryOpsEnd. In this case, the original code was always using
Instruction::Add. Do the same in the patch.
Note this commit may slightly change the value naming, because it now
also assigns the 'induction' name in the floating point case.
Causes a crash with the following (creduce'd) test-case:
clang -O3 '--target=aarch64-grtev4-linux-gnu' -xc - -c -o /dev/null <<EOF
int *e;
int f;
int g() {
int h;
int *j = 0;
while (&f - j > 0) {
int k;
k = j;
if (e == j && *e)
k = 5;
h = k;
j++;
}
return h;
}
EOF
This reverts commit 7ce48be0fd.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.
It must be the last recipe in the exit block of the topmost vector loop
region.
This extracts parts from D113224 and was discussed in D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116479
This is required to query the legality more precisely in the LoopVectorizer.
This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.
Differential Revision: https://reviews.llvm.org/D115329
The code in VPWidenCanonicalIVRecipe::execute only worked for fixed-width
vectors due to the way we generate the values per lane. This patch changes
the code to use a combination of vector splats and step vectors to get
the same result. This then works for both fixed-width and scalable vectors.
Tests that exercise this code path for scalable vectors have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Differential Revision: https://reviews.llvm.org/D113180
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.
It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.
In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:
-prefer-predicate-over-epilogue=predicate-dont-vectorize
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue
For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.
Various tests have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.
Differential Revision: https://reviews.llvm.org/D113003
There is no need to sort inserted instructions by dominance, as the
deletion loop still requires RAUW with undef before deleting. Removing
instructions in reverse insertion order should still insure that the
number of uselist updates is kept to a minimum.
No need to include the order of the scalars beeing used as part of the
alternate vectorization into account when trying to reorder the whole
graph. Such elements better to reorder in the following phase because
the subtree still ends up in shuffle.
Part of D116688, fixes the regression in D116690.
Differential Revision: https://reviews.llvm.org/D116740
There is a bug in the reordering analysis stage. If the element with the
given hash is not added to the map but has the same number of APOs and
instructions with same parent, but different instruction opcode, it will
be initalized with default values and then the counter is increased by
1. But the lane is not updated and default to 0 instead of the actual
`Lane` value. It leads to the fact that the analysis is useless in
many cases and default to lane 0 instead of actual lane with the
minimum amount of APO operands.
Differential Revision: https://reviews.llvm.org/D116690