Commit Graph

656 Commits

Author SHA1 Message Date
Simon Pilgrim 8c82d42e97 [TTI][X86] Pull out repeated getSizeInBits() calls. NFC. 2022-02-10 18:58:32 +00:00
Rosie Sumpter 552eb372cb [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Differential Revision: https://reviews.llvm.org/D115329
2022-01-12 13:34:12 +00:00
Simon Pilgrim 5eb47961c4 [CostModel][X86] Update ROTL/ROTR vXi8/vXi16 costs on AVX512BW targets
Refresh based off recent improvements to codegen and the helper script from D103695
2022-01-10 13:18:25 +00:00
Nikita Popov 7c3cf4c2c0 [Inline][X86] Avoid inlining if it would create ABI-incompatible calls (PR52660)
X86 allows inlining functions if the callee target features are a
subset of the caller target features. This ensures that we don't
inline something into a caller that does not support it.

However, this does not account for possible call ABI mismatches as
a result of inlining. If a call passing a vector argument was
originally in a -avx function, calling another -avx function, the
vector is passed in xmm. If we now inline it into a +avx function,
then it will be passed in ymm, even though the callee expects it in xmm.

Fix this by scanning over all calls in the function and checking
whether ABI incompatibility is possible. Calls that only pass scalar
types are excluded, as I believe those always use the same ABI
independent of target features.

Fixes https://github.com/llvm/llvm-project/issues/52660.

Differential Revision: https://reviews.llvm.org/D116036
2021-12-27 09:36:21 +01:00
Nikita Popov f5ac23b5ae [ArgPromotion][TTI] Pass types to ABI compatibility hook
The areFunctionArgsABICompatible() hook currently accepts a list of
pointer arguments, though what we're actually interested in is the
ABI compatibility after these pointer arguments have been converted
into value arguments.

This means that a) the current API is incompatible with opaque
pointers (because it requires inspection of pointee types) and
b) it can only be used in the specific context of ArgPromotion.
I would like to reuse the API when inspecting calls during inlining.

This patch converts it into an areTypesABICompatible() hook, which
accepts a list of types. This makes the method more generally usable,
and compatible with opaque pointers from an API perspective (the
actual usage in ArgPromotion/Attributor is still incompatible,
I'll follow up on that in separate patches).

Differential Revision: https://reviews.llvm.org/D116031
2021-12-22 09:37:51 +01:00
Simon Pilgrim 30e38d6771 [CostModel][X86] Split MUL/SDIV+SREM/UDIV+UREM PowerOf2 handling. NFC.
This is a NFC cleanup to simplify some upcoming refactoring required to address the regressions in D111968.
2021-12-08 18:47:12 +00:00
Haohai Wen d2c093e79d [CostModel][X86] Add i64 mul cost for avx512 as 1cy
i64 mul cost is 1cy for all cpu that support avx512. Currently
all X86 cpu uses i64 mul cost in X64 cost table which is not
true for cpu that support avx512 (skx, icx).

Reviewed By: pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D115016
2021-12-08 11:29:08 +08:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Roman Lebedev 7e73c2a66a
[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle
The mask on the shuffle is for the output, not the input.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114697
2021-11-29 18:37:07 +03:00
Roman Lebedev cffe3a084f
[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()`
... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512.
This unblocks D111460.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114316
2021-11-29 14:41:48 +03:00
Roman Lebedev cd8d219536
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ
I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()`
for AVX512, other than the question of handling plain AVX512F,
where we end up with some really ugly "shuffles",
but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW?

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114315
2021-11-24 17:23:15 +03:00
Roman Lebedev 704d92607d
[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions
Apparently my methodology was suboptimal, and not only did miss all the +VL tuples,
i also missed some plain tuples. I believe, this adds everything missing.
Indeed, these manual costmodels are just not okay long-term.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114334
2021-11-22 14:31:34 +03:00
Roman Lebedev 8d09dd61c3
[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions
Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW,
these either sign-extent the mask register into a vector,
or pack the mask from vector register.

Apparently, we didn't even have MCA tests for these,
added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05,
so i'm just guessing that their perf characteristics
are optimal.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114314
2021-11-22 14:31:34 +03:00
Roman Lebedev 049799c311
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI
If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles),
we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114071
2021-11-19 15:58:10 +03:00
Roman Lebedev a751084bb4
[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1`
Note that there are many other missing costs, i'm *only* adding the ones that are queried
from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114070
2021-11-19 15:57:32 +03:00
Roman Lebedev a50fdd3fc9
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW
Here we get pretty lucky. AVX512F does not provide any instructions
to convert between a `k` vector mask and a vector,
but AVX512BW adds `{k}<->nX{i8,i16}`conversions,
and just as it happens, with AVX512BW we have a i16 shuffle.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113915
2021-11-19 15:55:41 +03:00
Roman Lebedev 2037ec725f
[X86][Costmodel] `*ext v64i1 to v32i16` can appear after legalization, cost is same as for `*ext v32i1 to v32i16`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113914
2021-11-17 12:02:50 +03:00
Roman Lebedev 23b194bf18
[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113913
2021-11-17 12:02:50 +03:00
Roman Lebedev 5c7255fe3a
[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI
Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type,
so this change does affect vectorization. In the end, it will ask about i1.

We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up.
All costs here look good, i've added the missing truncation costs in preparatory patches.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113853
2021-11-15 19:04:02 +03:00
Roman Lebedev a468c39c90
[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8`
Some of the costs get larger here,
but i suppose that makes sense since we'd previously query
scalarization costs that may not be really representative of the reality.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113852
2021-11-15 19:04:02 +03:00
Roman Lebedev 9e57d9b09d
[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113851
2021-11-15 19:04:02 +03:00
Roman Lebedev 0116c708c6
[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113850
2021-11-15 19:04:02 +03:00
Roman Lebedev 4dd2f0446c
[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW
The basic idea is simple, if we don't have native shuffle for this element type,
then we must have native shuffle for wider element type,
so promote, replicate, demote.

I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically,
case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in
2 * XMM, that then will be packed into 1 * YMM,
and it should count the cost of said packing,
not just the truncations.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113609
2021-11-14 20:01:38 +03:00
Roman Lebedev e876698a5d
[NFC][TTI] `getReplicationShuffleCost()`: s/Replicated/Dst/
'Replicated' is mouthful and somewhat ambigious,
while 'destination' is pretty self-explanatory.
2021-11-14 20:01:38 +03:00
Roman Lebedev b283961012
[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16`
Same as D113842, but for i64

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113843
2021-11-14 18:41:38 +03:00
Roman Lebedev a5f2fdca99
[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16`
This was noticed in D113609, hopefully it unblocks that patch.
There are likely other similar problems.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113842
2021-11-14 18:41:37 +03:00
Roman Lebedev a70d74323e
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI
VBMI introduced VPERMB, so cost-model i8 replication shuffle using it.
Note that we can still model i8 replication shufflle without VBMI,
by promoting to i16/i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113479
2021-11-10 22:52:40 +03:00
Roman Lebedev c6e894b9b2
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW
BWI introduced VPERMW, so cost-model i16 replication shuffle using it.
Note that we can still model i16 replication shufflle without BWI,
by promoting to i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113478
2021-11-10 22:52:39 +03:00
Roman Lebedev 4101c7bf19
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F
This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`,
that take a single input vector and a single index vector,
and are cross-lane. So far i haven't seen evidence that
replication ever results in demanding more than a single
input vector per output vector.

This results in *shockingly* lesser costs :)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113350
2021-11-10 22:52:33 +03:00
Roman Lebedev d484cc152b
[TTI] Adjust `getReplicationShuffleCost()` interface
It is trivial to produce DemandedSrcElts given DemandedReplicatedElts,
so don't pass the former. Also, it isn't really useful so far
to have the overload taking the Mask, so just inline it.
2021-11-09 14:07:59 +03:00
Roman Lebedev f8efc5c0ac
[NFC][TTI] Add/extract `getReplicationShuffleCost()` method, deduplicate it's implementations
Hiding it in `getInterleavedMemoryOpCost()` is problematic for a number of reasons,
including testability and reuse, let's do better.

In a followup `getUserCost()` will be taught to use to to estimate the mask costs,
which will allow for better cost model tests for it.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113313
2021-11-06 16:45:15 +03:00
Roman Lebedev ad617183bb
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1
Even though AVX512's masked mem ops (unlike AVX1/2) have a mask
that is a `VF x i1`, replication of said masks happens after
promotion of it to `VF x i8`, so we should use `i8`, not `i1`,
when calculating the cost of mask replication.
2021-11-05 17:27:02 +03:00
Roman Lebedev df93c8a919
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask
I don't really buy that masked interleaved memory loads/stores are supported on X86.
There is zero costmodel test coverage, no actual cost modelling for the generation
of the mask repetition, and basically only two LV tests.
Additionally, i'm not very interested in AVX512.

I don't know if this really helps "soft" block over at
https://reviews.llvm.org/D111460#inline-1075467,
but i think it can't make things worse at least.

When we are being told that there is a masking, instead of
completely giving up and falling back to
fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`,
let's correctly query the cost of masked memory ops,
keep all the pretty shuffle cost modelling,
but scalarize the cost computation for the mask replication.

I think, not scalarizing the shuffles themselves
may adjust the computed costs a bit,
and maybe hopefully just enough to hide the "regressions"
at https://reviews.llvm.org/D111460#inline-1075467
I do mean hide, because the test coverage is non-existent.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112873
2021-11-03 18:14:35 +03:00
Simon Pilgrim 6fc50e531d [CostModel][X86] Remove old FIXME comments for AVX512F vector splitting
Similar to AVX1, the cost of splitting/merging 512-bit -> 256-bits vectors for arithmetic operations are typically hidden due to different used ports etc.
2021-11-01 11:11:11 +00:00
Simon Pilgrim 6102e5d56b [CostModel][X86] Remove old TODO comment
BMI (TZCNT) scalar handling was added at rGa2db388dce77c2f23f2009d7363a0b63bb54523c
2021-10-29 17:28:45 +01:00
Roman Lebedev 8fac9e95ad
[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members
By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least **in general**,
that is obviously overly pessimistic, because **in general**,
not all the shuffles needed to perform the full interleaving
will end up being live.

So what this does, is naively scales the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate, although it may be over/under -estimate.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112307
2021-10-22 16:33:58 +03:00
Simon Pilgrim 5b395bd633 [CostModel][X86] Add costs for multiply-by-pow2 constants
These are folded to left shifts in the backend.

We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436
2021-10-20 13:11:21 +01:00
Simon Pilgrim 9fc523d114 [X86] Remove X86ProcFamilyEnum::IntelSLM
Replace X86ProcFamilyEnum::IntelSLM enum with a TuningUseSLMArithCosts flag instead, matching what we already do for Goldmont.

This just leaves X86ProcFamilyEnum::IntelAtom to replace with general Tuning/Feature flags and we can finally get rid of the old X86ProcFamilyEnum enum.

Differential Revision: https://reviews.llvm.org/D112079
2021-10-20 11:58:39 +01:00
Simon Pilgrim f041338153 [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:46:10 +01:00
Simon Pilgrim c850d5c5c8 [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:15:14 +01:00
Roman Lebedev 91373bf12e
[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
2021-10-17 17:28:10 +03:00
Roman Lebedev 3274ce3a28
[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
2021-10-17 17:28:10 +03:00
Roman Lebedev 3a6a9f74d3
[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
2021-10-17 17:28:09 +03:00
Roman Lebedev 4b76a74b42
[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
2021-10-17 17:28:09 +03:00
Roman Lebedev 887acf6842
[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
2021-10-17 17:28:09 +03:00
Simon Pilgrim 85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim 6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Roman Lebedev d137f1288e
[X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Roman Lebedev 3d7bf6625a
[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Simon Pilgrim 871f773986 [TTI][X86] Merge getInterleavedMemoryOpCostAVX2 into getInterleavedMemoryOpCost. NFC
This a NFC refactor patch to merge the AVX2 interleaved cost handling back into the getInterleavedMemoryOpCost base method - while getInterleavedMemoryOpCostAVX512 uses instruction and patterns very specific to AVX512+, much of the costs analysis for AVX2 can be reused for all SSE targets.

This is the first step towards improving SSE and AVX1 costs that will reuse the relevant AVX2 costs by splitting some of the tables - for instance AVX1 has very similar costs for most vXi64/vXf64 interleave patterns and many sub-128bit vector costs are the same all the way down to SSE2 (or at least SSSE3).

Differential Revision: https://reviews.llvm.org/D111822
2021-10-14 18:46:25 +01:00