I have updated cheapToScalarize to also consider the case when
extracting lanes from a stepvector intrinsic. This required removing
the existing 'bool IsConstantExtractIndex' and passing in the actual
index as a Value instead. We do this because we need to know if the
index is <= known minimum number of elements returned by the stepvector
intrinsic. Effectively, when extracting lane X from a stepvector we
know the value returned is also X.
New tests added here:
Transforms/InstCombine/vscale_extractelement.ll
Differential Revision: https://reviews.llvm.org/D106358
In all of these, the value must be an instruction for us to succeed anyway,
so change it to maybe hopefully make further changes more straight-forward.
This patch allows folding stepvector + extract to the lane when the lane is
lower than the minimum size of the scalable vector. This fold is possible
because lane X of a stepvector is also X!
For instance, extracting element 3 of a <vscale x 4 x i64>stepvector is 3.
Differential Revision: https://reviews.llvm.org/D103153
This is a patch that replaces shufflevector and insertelement's placeholder value with poison.
Underlying motivation is to fix the semantics of shufflevector with undef mask to return poison instead
(D93818)
The consensus has been made in the late 2020 via mailing list as well as the thread in https://bugs.llvm.org/show_bug.cgi?id=44185 .
This patch is a simple syntactic change to the existing code, hence directly pushed as a commit.
We sometimes see code like this:
Case 1:
%gep = getelementptr i32, i32* %a, <2 x i64> %splat
%ext = extractelement <2 x i32*> %gep, i32 0
or this:
Case 2:
%gep = getelementptr i32, <4 x i32*> %a, i64 1
%ext = extractelement <4 x i32*> %gep, i32 0
where there is only one use of the GEP. In such cases it makes
sense to fold the two together such that we create a scalar GEP:
Case 1:
%ext = extractelement <2 x i64> %splat, i32 0
%gep = getelementptr i32, i32* %a, i64 %ext
Case 2:
%ext = extractelement <2 x i32*> %a, i32 0
%gep = getelementptr i32, i32* %ext, i64 1
This may create further folding opportunities as a result, i.e.
the extract of a splat vector can be completely eliminated. Also,
even for the general case where the vector operand is not a splat
it seems beneficial to create a scalar GEP and extract the scalar
element from the operand. Therefore, in this patch I've assumed
that a scalar GEP is always preferrable to a vector GEP and have
added code to unconditionally fold the extract + GEP.
I haven't added folds for the case when we have both a vector of
pointers and a vector of indices, since this would require
generating an additional extractelement operation.
Tests have been added here:
Transforms/InstCombine/gep-vector-indices.ll
Differential Revision: https://reviews.llvm.org/D101900
Some intrinsics wrapper code has the habit of ignoring the type of the
elements in vectors, thinking of vector registers as a "bag of bits". As
a consequence, some operations are shared between vectors of different
types are shared. For example, functions that rearrange elements in a
vector can be shared between vectors of int32 and float.
This can result in bitcasts in awkward places that prevent the backend
from recognizing some instructions. For AArch64 in particular, it
inhibits the selection of dup from a general purpose register (GPR), and
mov from GPR to a vector lane.
This patch adds a pattern in InstCombine to move the bitcasts past the
shufflevector if this is possible. Sometimes this even allows
InstCombine to remove the bitcast entirely, as in the included tests.
Alternatively this could be done with a few extra patterns in the
AArch64 backend, but InstCombine seems like a better place for this.
Differential Revision: https://reviews.llvm.org/D97397
This patch changes ElementCount so that the Min and Scalable
members are now private and can only be accessed via the get
functions getKnownMinValue() and isScalable(). In addition I've
added some other member functions for more commonly used operations.
Hopefully this makes the class more useful and will reduce the
need for calling getKnownMinValue().
Differential Revision: https://reviews.llvm.org/D86065
Now that we no longer require for this map to have stable iteration order,
we no longer need to pay for keeping the iteration order stable,
so switch from `SmallMapVector` to `SmallDenseMap`.
While it may seem like we can just "deduplicate" the case where
some basic block happens to be a predecessor more than once,
which happens for e.g. switches, that is not correct thing to do.
We must actually add a PHI operand for each predecessor.
This was initially reported to me by David Major
as a clang crash during gecko build for android.
While the original implementation added in D85787 / ae7f08812e
is not incorrect, it is known to be suboptimal.
In particular, it is not incorrect to use the basic block
in which the original `insertvalue` instruction is located
as the merge point, that is not necessarily optimal,
as `@test6` shows.
We should look at all the AggElts, and, if they are all defined
in the same basic block, then that is the basic block we should use.
On RawSpeed library, this catches +4% (+50) more cases.
On vanilla LLVM test-suits, this catches +12% (+92) more cases.
In a following patch, UseBB will be detected later,
so capturing it is potentially error-prone (capture by ref vs by val).
Also, parametrized UseBB will likely be needed
for multiple levels of PHI indirections later on anyways.
This is NFC at the moment, because right now we always insert the PHI
into the same basic block in which the original `insertvalue` instruction
is, but that will change.
Also, fixes addition of the suffix to the value names.
With gcc 6.3.0, I hit the following compilation bug.
../lib/Transforms/InstCombine/InstCombineVectorOps.cpp:937:2: error: extra ‘;’ [-Werror=pedantic]
};
^
cc1plus: all warnings being treated as errors
The error is introduced by Commit ae7f08812e ("[InstCombine]
Aggregate reconstruction simplification (PR47060)")
This pattern happens in clang C++ exception lowering code, on unwind branch.
We end up having a `landingpad` block after each `invoke`, where RAII
cleanup is performed, and the elements of an aggregate `{i8*, i32}`
holding exception info are `extractvalue`'d, and we then branch to common block
that takes extracted `i8*` and `i32` elements (via `phi` nodes),
form a new aggregate, and finally `resume`'s the exception.
The problem is that, if the cleanup block is effectively empty,
it shouldn't be there, there shouldn't be that `landingpad` and `resume`,
said `invoke` should be a `call`.
Indeed, we do that simplification in e.g. SimplifyCFG `SimplifyCFGOpt::simplifyResume()`.
But the thing is, all this extra `extractvalue` + `phi` + `insertvalue` cruft,
while it is pointless, does not look like "empty cleanup block".
So the `SimplifyCFGOpt::simplifyResume()` fails, and the exception is has
higher cost than it could have on unwind branch :S
This doesn't happen *that* often, but it will basically happen once per C++
function with complex CFG that called more than one other function
that isn't known to be `nounwind`.
I think, this is a missing fold in InstCombine, so i've implemented it.
I think, the algorithm/implementation is rather self-explanatory:
1. Find a chain of `insertvalue`'s that fully tell us the initializer of the aggregate.
2. For each element, try to find from which aggregate it was extracted.
If it was extracted from the aggregate with identical type,
from identical element index, great.
3. If all elements were found to have been extracted from the same aggregate,
then we can just use said original source aggregate directly,
instead of re-creating it.
4. If we fail to find said aggregate when looking only in the current block,
we need be PHI-aware - we might have different source aggregate when coming
from each predecessor.
I'm not sure if this already handles everything, and there are some FIXME's,
i'll deal with all that later in followups.
I'd be fine with going with post-commit review here code-wise,
but just in case there are thoughts, i'm posting this.
On RawSpeed, for example, this has the following effect:
```
| statistic name | baseline | proposed | Δ | % | abs(%) |
|---------------------------------------------------|---------:|---------:|------:|--------:|-------:|
| instcombine.NumAggregateReconstructionsSimplified | 0 | 1253 | 1253 | 0.00% | 0.00% |
| simplifycfg.NumInvokes | 948 | 1355 | 407 | 42.93% | 42.93% |
| instcount.NumInsertValueInst | 4382 | 3210 | -1172 | -26.75% | 26.75% |
| simplifycfg.NumSinkCommonCode | 574 | 458 | -116 | -20.21% | 20.21% |
| simplifycfg.NumSinkCommonInstrs | 1154 | 921 | -233 | -20.19% | 20.19% |
| instcount.NumExtractValueInst | 29017 | 26397 | -2620 | -9.03% | 9.03% |
| instcombine.NumDeadInst | 166618 | 174705 | 8087 | 4.85% | 4.85% |
| instcount.NumPHIInst | 51526 | 50678 | -848 | -1.65% | 1.65% |
| instcount.NumLandingPadInst | 20865 | 20609 | -256 | -1.23% | 1.23% |
| instcount.NumInvokeInst | 34023 | 33675 | -348 | -1.02% | 1.02% |
| simplifycfg.NumSimpl | 113634 | 114708 | 1074 | 0.95% | 0.95% |
| instcombine.NumSunkInst | 15030 | 14930 | -100 | -0.67% | 0.67% |
| instcount.TotalBlocks | 219544 | 219024 | -520 | -0.24% | 0.24% |
| instcombine.NumCombined | 644562 | 645805 | 1243 | 0.19% | 0.19% |
| instcount.TotalInsts | 2139506 | 2135377 | -4129 | -0.19% | 0.19% |
| instcount.NumBrInst | 156988 | 156821 | -167 | -0.11% | 0.11% |
| instcount.NumCallInst | 1206144 | 1207076 | 932 | 0.08% | 0.08% |
| instcount.NumResumeInst | 5193 | 5190 | -3 | -0.06% | 0.06% |
| asm-printer.EmittedInsts | 948580 | 948299 | -281 | -0.03% | 0.03% |
| instcount.TotalFuncs | 11509 | 11507 | -2 | -0.02% | 0.02% |
| inline.NumDeleted | 97595 | 97597 | 2 | 0.00% | 0.00% |
| inline.NumInlined | 210514 | 210522 | 8 | 0.00% | 0.00% |
```
So we manage to increase the amount of `invoke` -> `call` conversions in SimplifyCFG by almost a half,
and there is a very apparent decrease in instruction and basic block count.
On vanilla llvm-test-suite:
```
| statistic name | baseline | proposed | Δ | % | abs(%) |
|---------------------------------------------------|---------:|---------:|------:|--------:|-------:|
| instcombine.NumAggregateReconstructionsSimplified | 0 | 744 | 744 | 0.00% | 0.00% |
| instcount.NumInsertValueInst | 2705 | 2053 | -652 | -24.10% | 24.10% |
| simplifycfg.NumInvokes | 1212 | 1424 | 212 | 17.49% | 17.49% |
| instcount.NumExtractValueInst | 21681 | 20139 | -1542 | -7.11% | 7.11% |
| simplifycfg.NumSinkCommonInstrs | 14575 | 14361 | -214 | -1.47% | 1.47% |
| simplifycfg.NumSinkCommonCode | 6815 | 6743 | -72 | -1.06% | 1.06% |
| instcount.NumLandingPadInst | 14851 | 14712 | -139 | -0.94% | 0.94% |
| instcount.NumInvokeInst | 27510 | 27332 | -178 | -0.65% | 0.65% |
| instcombine.NumDeadInst | 1438173 | 1443371 | 5198 | 0.36% | 0.36% |
| instcount.NumResumeInst | 2880 | 2872 | -8 | -0.28% | 0.28% |
| instcombine.NumSunkInst | 55187 | 55076 | -111 | -0.20% | 0.20% |
| instcount.NumPHIInst | 321366 | 320916 | -450 | -0.14% | 0.14% |
| instcount.TotalBlocks | 886816 | 886493 | -323 | -0.04% | 0.04% |
| instcount.TotalInsts | 7663845 | 7661108 | -2737 | -0.04% | 0.04% |
| simplifycfg.NumSimpl | 886791 | 887171 | 380 | 0.04% | 0.04% |
| instcount.NumCallInst | 553552 | 553733 | 181 | 0.03% | 0.03% |
| instcombine.NumCombined | 3200512 | 3201202 | 690 | 0.02% | 0.02% |
| instcount.NumBrInst | 741794 | 741656 | -138 | -0.02% | 0.02% |
| simplifycfg.NumHoistCommonInstrs | 14443 | 14445 | 2 | 0.01% | 0.01% |
| asm-printer.EmittedInsts | 7978085 | 7977916 | -169 | 0.00% | 0.00% |
| inline.NumDeleted | 73188 | 73189 | 1 | 0.00% | 0.00% |
| inline.NumInlined | 291959 | 291968 | 9 | 0.00% | 0.00% |
```
Roughly similar effect, less instructions and blocks total.
See also: rGe492f0e03b01a5e4ec4b6333abb02d303c3e479e.
Compile-time wise, this appears to be roughly geomean-neutral:
http://llvm-compile-time-tracker.com/compare.php?from=39617aaed95ac00957979bc1525598c1be80e85e&to=b59866cf30420da8f8e3ca239ed3bec577b23387&stat=instructions
And this is a win size-wize in general:
http://llvm-compile-time-tracker.com/compare.php?from=39617aaed95ac00957979bc1525598c1be80e85e&to=b59866cf30420da8f8e3ca239ed3bec577b23387&stat=size-text
See https://bugs.llvm.org/show_bug.cgi?id=47060
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D85787
For a long time, the InstCombine pass handled target specific
intrinsics. Having target specific code in general passes was noted as
an area for improvement for a long time.
D81728 moves most target specific code out of the InstCombine pass.
Applying the target specific combinations in an extra pass would
probably result in inferior optimizations compared to the current
fixed-point iteration, therefore the InstCombine pass resorts to newly
introduced functions in the TargetTransformInfo when it encounters
unknown intrinsics.
The patch should not have any effect on generated code (under the
assumption that code never uses intrinsics from a foreign target).
This introduces three new functions:
TargetTransformInfo::instCombineIntrinsic
TargetTransformInfo::simplifyDemandedUseBitsIntrinsic
TargetTransformInfo::simplifyDemandedVectorEltsIntrinsic
A few target specific parts are left in the InstCombine folder, where
it makes sense to share code. The largest left-over part in
InstCombineCalls.cpp is the code shared between arm and aarch64.
This allows to move about 3000 lines out from InstCombine to the targets.
Differential Revision: https://reviews.llvm.org/D81728
We have a transform in the opposite direction only for the x86 MMX type,
Other types are not handled either way before this patch.
The motivating case from PR45748:
https://bugs.llvm.org/show_bug.cgi?id=45748
...is the last test diff. In that example, we are triggering an existing
bitcast transform, so we reduce the number of casts, and that should give
us the ideal x86 codegen.
Differential Revision: https://reviews.llvm.org/D79171
Summary:
This patch fix the following issues with visitExtractElementInst:
1. Restrict VectorUtils::findScalarElement to fixed-length vector.
For scalable type, the number of elements in shuffle mask is
unknown at compile-time.
2. Fix out-of-range calculation for fixed-length vector.
3. Skip scalable type when analysis rely on fixed number of elements.
4. Add unit tests to check functionality of extractelement for scalable type.
Reviewers: sdesmalen, efriedma, spatel, nikic
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78267
Summary:
This patch fixes the following issues in visitInsertElementInst:
1. Bail out for scalable type when analysis requires fixed size number of vector elements.
2. Use cast<FixedVectorType> to get vector number of elements. This ensure assertion
on scalable vector type.
3. For scalable type, avoid folding a chain of insertelement into splat:
insertelt(insertelt(insertelt(insertelt X, %k, 0), %k, 1), %k, 2) ...
->
shufflevector(insertelt(X, %k, 0), undef, zero)
The length of scalable vector is unknown at compile-time, therefore we don't know if
given insertelement sequence is valid for splat.
Reviewers: sdesmalen, efriedma, spatel, nikic
Reviewed By: sdesmalen, efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78895
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.
While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
Summary:
Remove usages of asserting vector getters in Type in preparation for the
VectorType refactor. The existence of these functions complicates the
refactor while adding little value.
Reviewers: sdesmalen, rriddle, efriedma
Reviewed By: sdesmalen
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77263
As discussed in D76983, that patch can turn a chain of insert/extract
with scalar trunc ops into bitcast+extract and existing instcombine
vector transforms end up creating a shuffle out of that (see the
PhaseOrdering test for an example). Currently, that process requires
at least this sequence: -instcombine -early-cse -instcombine.
Before D76983, the sequence of insert/extract would reach the SLP
vectorizer and become a vector trunc there.
Based on a small sampling of public targets/types, converting the
shuffle to a trunc is better for codegen in most cases (and a
regression of that form is the reason this was noticed). The trunc is
clearly better for IR-level analysis as well.
This means that we can induce "spontaneous vectorization" without
invoking any explicit vectorizer passes (at least a vector cast op
may be created out of scalar casts), but that seems to be the right
choice given that we started with a chain of insert/extract, and the
backend would expand back to that chain if a target does not support
the op.
Differential Revision: https://reviews.llvm.org/D77299
Instead, represent the mask as out-of-line data in the instruction. This
should be more efficient in the places that currently use
getShuffleVector(), and paves the way for further changes to add new
shuffles for scalable vectors.
This doesn't change the syntax in textual IR. And I don't currently plan
to change the bitcode encoding in this patch, although we'll probably
need to do something once we extend shufflevector for scalable types.
I expect that once this is finished, we can then replace the raw "mask"
with something more appropriate for scalable vectors. Not sure exactly
what this looks like at the moment, but there are a few different ways
we could handle it. Maybe we could try to describe specific shuffles.
Or maybe we could define it in terms of a function to convert a fixed-length
array into an appropriate scalable vector, using a "step", or something
like that.
Differential Revision: https://reviews.llvm.org/D72467
Summary: Rewrite the fsub-0.0 idiom to fneg and always emit fneg for fp
negation. This also extends the scalarization cost in instcombine for unary
operators to result in the same IR rewrites for fneg as for the idiom.
Reviewed By: cameron.mcinally
Differential Revision: https://reviews.llvm.org/D75467
This is a followup to D73803, which uses the replaceOperand()
helper in more places.
This should be NFC apart from changes to worklist order.
Differential Revision: https://reviews.llvm.org/D73919
As discussed on D73919, this replaces a few cases where we were
modifying multiple operands of instructions in-place with the
creation of a new instruction, which we generally prefer nowadays.
This tends to be more readable and less prone to worklist management
bugs.
Test changes are only superficial (instruction naming and order).