This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.
Differential Revision: https://reviews.llvm.org/D76212
If we're inserting into v2i8/v4i8/v8i8/v2i16/v4i16 style sub-128bit vectors ensure we don't use the SK_PermuteTwoSrc cost of the legalized value type - this is a followup to rG12c629ec6c59 which added equivalent sub-128bit shuffle costs
This is similar to what I recently did for getArithmeticReductionCost.
I'm trying to account for the narrowing from 512->256->128 as we go.
I've also added a new helper method getMinMaxCost that tries to
handle the cases where we have native min/max instructions and
fall back to cmp+select when we don't.
Differential Revision: https://reviews.llvm.org/D76634
v2i8/v4i8/v8i8 + v2i16/v4i16 all show up in vectorizer code and by just using the legalized types (v16i8/v8i16) we're highly exaggerating the actual cost of the shuffle.
This patch attempts to more accurately model the reduction of
power of 2 vectors of types we natively support. This takes into
account the narrowing of vectors that occur as we go from 512
bits to 256 bits, to 128 bits. It also takes into account the use
of wider elements in the shuffles for the first 2 steps of a
reduction from 128 bits. And uses a v8i16 shift for the final step
of vXi8 reduction.
The default implementation uses the legalized type for the arithmetic
for all levels. And uses the single source permute cost of the
legalized type for all levels. This penalizes things like
lack of v16i8 pshufb on pre-sse3 targets and the splitting and
joining that needs to be done for integer types on AVX1. We never
need v16i8 shuffle for a reduction and we only need split AVX1 ops
when type the type wide and needs to be split. I think we're still
over costing splits and joins for AVX1, but we're closer now.
I've also removed all pairwise special casing because I don't
think we ever want to generate that on X86. I've also adjusted
the add handling to more accurately account for any type splitting
that occurs before we reach a legal type.
Differential Revision: https://reviews.llvm.org/D76478
Previously we multiplied the cost for the table entries by the number of splits needed. But that implies that each split goes through a reduction to scalar independently. I think what really happens is that the we AND/OR the split pieces until we're down to a single value with a legal type and then do special reduction sequence on that.
So to model that this patch takes the number of splits minus one multiplied by the cost of a AND/OR at the legal element count and adds that on top of the table lookup.
Differential Revision: https://reviews.llvm.org/D76400
Refines the gather/scatter cost model, but also changes the TTI
function getIntrinsicInstrCost to accept an additional parameter
which is needed for the gather/scatter cost evaluation.
This did require trivial changes in some non-ARM backends to
adopt the new parameter.
Extending gathers and truncating scatters are now priced cheaper.
Differential Revision: https://reviews.llvm.org/D75525
This tries to improve the accuracy of extract/insert element costs by accounting for subvector extraction/insertion for >128-bit vectors and the shuffling of elements to/from the 0'th index.
It also adds INSERTPS for f32 types and PINSR/PEXTR costs for integer types (at the moment we assume the same cost as MOVD/MOVQ - which isn't always true).
Differential Revision: https://reviews.llvm.org/D74976
D74976 will handle larger vector types, but since SLM doesn't support AVX+ then we will always be extracting from 128-bit vectors so don't need to scale the cost.
A cost query for a vector instruction should return a cost even without
target vector support, and not trigger an assert.
VectorCombine does this with an input containing source code vectors.
Review: Ulrich Weigand
We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.
Differential Revision: https://reviews.llvm.org/D73646
This is a very basic MVE gather/scatter cost model, based roughly on the
code that we will currently produce. It does not handle truncating
scatters or extending gathers correctly yet, as it is difficult to tell
that they are going to be correctly extended/truncated from the limited
information in the cost function.
This can be improved as we extend support for these in the future.
Based on code originally written by David Sherwood.
Differential Revision: https://reviews.llvm.org/D73021
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
This adds some extra cost model tests for shifts, and does some minor
adjustments to some Neon code to make it clear as to what it applies to.
Both NFC.
This is a follow-up to D70607 where we made any
extract element on SLM more costly than default. But that is
pessimistic for extract from element 0 because that corresponds
to x86 movd/movq instructions. These generally have >1 cycle
latency, but they are probably implemented as single uop
instructions.
Note that no vectorization tests are affected by this change.
Also, no targets besides SLM are affected because those are
falling through to the default cost of 1 anyway. But this will
become visible/important if we add more specializations via cost
tables.
Differential Revision: https://reviews.llvm.org/D71023
The Power 9 CPU has some features that are unlikely to be passed on to future
versions of the CPU. This patch separates this out so that future CPU does not
inherit them.
Differential Revision: https://reviews.llvm.org/D70466
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
This is no longer needed after widening legalization as we
custom legalize v8i8 ourselves.
Added entries to the cost model, but bumped the cost slightly
to account for the truncate shuffle that wasn't costed before.
ShuffleVectorInst::isExtractSubvectorMask, introduced in
[CostModel] Add SK_ExtractSubvector handling to getInstructionThroughput (PR39368)
erroneously thought that
%340 = shufflevector <4 x float> %339, <4 x float> undef, <3 x i32> <i32 2, i32 3, i32 undef>
is a subvector extract, even though it goes off the end of the parent
vector with the undef index. That then caused an assert in
BasicTTIImplBase::getExtractSubvectorOverhead.
This commit fixes that, by not considering the above a subvector
extract.
Differential Revision: https://reviews.llvm.org/D70005
Change-Id: I87b8b00b24bef19ffc9a1b82ef4eca3b8a246eaf