These intrinsics are now fundemental for SVE code generation and have been
present for a year and a half, hence move them out of the experimental
namespace.
Differential Revision: https://reviews.llvm.org/D127976
D127680 added some unnecessary funnel shift costs for AArch64 to "match
the legacy behaviour". The default costs are closer to the correct
values and line up with the scalar/neon costs better. Remove the lines
again to clean up the code, they can be added back at a later date with
better values if needed.
This change removes an explicit scalable vector bailout for fshl and fshr. This bailout was added in 60e4698b9a, when sinking a unconditional bailout for all intrinsics into selected cases. Its not clear if the bailout was originally unneeded, or if our cost model infrastructure has simply matured in the meantime. Either way, the generic code appears to handle scalable vectors without issue.
Note that the RISC-V cost model changes here aren't particularly interesting. They do probably better match the current lowering, but the main point is to have coverage of the BasicTTI path and simply show lack of crashing.
AArch64 costing was changed to preserve legacy behavior. There will most likely be an upcoming change to use the generic costs there too, but I didn't want to make that change not being particularly familiar with the target.
Differential Revision: https://reviews.llvm.org/D127680
TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads.
For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses.
This patch specializes the hook for AArch64, to return true only when we enable SVE.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D124612
The patch extends AArch64TTIImpl::instCombineIntrinsic to simplify
llvm.aarch64.neon.f{min,max}nm(a, a) -> a.
This helps with simplifying code written using the ACLE, e.g.
see https://godbolt.org/z/jYxsoc89c
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D125234
This adds some extra costs for reverse shuffles under AArch64, filling
in the i16/f16/i8 gaps in the cost model.
Differential Revision: https://reviews.llvm.org/D124786
This builds on top of the target-independent cost model added in D124269
to add aarch64 specific costs for fptoui_sat and fptosi_sat intrinsics.
For many common types they will be legal instructions as the AArch64
instructions will saturate naturally. For unsupported pairs of integer
and floating point types, an additional min/max clamp is needed.
Differential Revision: https://reviews.llvm.org/D124357
Given a larger-than-legal shuffle mask, the final codegen will split
into multiple sub-vectors. This attempts to model that in
AArch64TTIImpl::getShuffleCost, splitting masks up according to the size
of the legalized vectors. If the sub-masks have at most 2 input sources
we can call getShuffleCost on them and sum the costs, to get a more
accurate final cost for the entire shuffle. The call to
improveShuffleKindFromMask helps to improve the shuffle kind for the
sub-mask cost call.
Differential Revision: https://reviews.llvm.org/D123414
Given a shuffle with 4 elements size 16 or 32, we can use the costs
directly from the PerfectShuffle tables to get a slightly more accurate
cost for the resulting shuffle.
Differential Revision: https://reviews.llvm.org/D123409
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
The original patch (https://reviews.llvm.org/D121354) targets x86 and adjusts
the lookahead score of splat loads ad they can be done by the `movddup`
instruction that combines the load and the broadcast and is cheap to execute.
A similar issue shows up on AArch64. The `ld1r` instruction performs a broadcast
load and is cheap to execute.
This patch implements the TargetTransformInfo hooks for AArch64.
Differential Revision: https://reviews.llvm.org/D123638
An insert subvector under aarch64 can often be done as a single lane mov
operation. For example a v4i8 inserted into a v16i8 is a s-reg mov, so
long as the index is a multiple of 4. This teaches the cost model that,
using code copied over from the X86 backend.
Some of the costs (v16i16_4_0) are still high because they get matched
as a SK_Select, not an SK_InsertSubvector. D120879 has some codegen
tests for inserting subvectors, which I were added as
llvm/test/CodeGen/AArch64/insert-subvector.ll.
Differential Revision: https://reviews.llvm.org/D120880
The cost of a v2i64 multiply was special cased in D92208 as scalarized
into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are
expensive though, and the cost wasn't high enough to prevent vectorizing
in places where it can be detrimental for performance. This increases it
so that the costs of copying to/from GPRs is increased to 2 each, with
the total cost increasing to 14. So long as umull/smull are handled
correctly (as in D123006) this seems to lead to better vectorization
factors and better performance.
Differential Revision: https://reviews.llvm.org/D123007
A vector mul(sext, sext) or mul(zext, zext) will be code generated as a
single smull or umull instruction. This most notably effects v2i64
multiplies, which are otherwise not legal and need to be expanded.
The oneuse check has also been slightly changed, as it is already
checked from the use of isWideningInstruction in getCastInstrCost.
Differential Revision: https://reviews.llvm.org/D123006
Currently the cost model under-estimates the cost of certain
FP16 conversions.
This patch updates getCastInstrCost to return more accurate costs for
the cases improved in c2ed9fd054.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D113700
The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.
Differential Revision: https://reviews.llvm.org/D120773
In Clang we can attach TBAA metadata based on the load/store intrinsics
based on the operation's element type.
This also contains changes to InstCombine where the AArch64-specific
intrinsics are transformed into generic LLVM load/store operations,
to ensure that all metadata is transferred to the new instruction.
There will be some further work after this patch to also emit TBAA
metadata for SVE's gather/scatter- and struct load/store intrinsics.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D119319
The current cost-model overestimates the cost of vector compares &
selects for ordered floating point compares. This patch fixes that by
extending the existing logic for integer predicates.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D118256
We have some bitcasts which we know will be simplified,
so their cost is zero.
Reviewed By: david-arm, sdesmalen
Differential Revision: https://reviews.llvm.org/D118019
Generated code resulted in redundant aarch64.sve.convert.to.svbool
calls for AArch64 Binary Operations. Narrow the more precise operands
instead of widening the less precise operands
Differential Revision: https://reviews.llvm.org/D116730
A lot of neon intrinsics work lane-wise, meaning that non-demanded
elements in and not demanded out. This teaches that to
AArch64TTIImpl::simplifyDemandedVectorEltsIntrinsic for some simple
single-input truncate intrinsics, which can help remove unnecessary
instructions.
Differential Revision: https://reviews.llvm.org/D117097
If we are inserting into or extracting from a scalable vector we do
not know the number of elements at runtime, so we can only let the
index wrap for fixed-length vectors.
Tests added here:
Analysis/CostModel/AArch64/sve-insert-extract.ll
Differential Revision: https://reviews.llvm.org/D117099
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.
Differential Revision: https://reviews.llvm.org/D116734
This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overflow intrinsics is usually better than the default expansion on
AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various
types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.
Differential Revision: https://reviews.llvm.org/D116732
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.
Differential Revision: https://reviews.llvm.org/D115143
This patch extends the "is all active predicate" check to cover
cases where the predicate is casted but in a way that doesn't
change its "all active" status.
Differential Revision: https://reviews.llvm.org/D115047
We can not swap multiplicand and multiplier because the sve intrinsics
are predicated. Imagine lanes in vectors having the following values:
pg = 0
multiplicand = 1 (from dup)
multiplier = 2
The resulting value should be 1, but if we swap multiplicand and multiplier it will become 2,
which is incorrect.
Differential Revision: https://reviews.llvm.org/D114577