Commit Graph

497 Commits

Author SHA1 Message Date
Simon Pilgrim 49d3a367c0 [CostModel][X86] Improve AVX1/AVX2 truncation costs
Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:

AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8

Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
2021-06-08 10:41:03 +01:00
Simon Pilgrim 432eff22ab [CostModel][X86] Add 512-bit bswap costs 2021-06-06 22:36:34 +01:00
Simon Pilgrim ae973380c5 [CostModel][X86] Improve AVX512 FDIV costs
Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model
2021-06-06 21:41:05 +01:00
Simon Pilgrim 90d25808c4 [CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets
Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.
2021-05-27 18:17:50 +01:00
Simon Pilgrim fe8d97cbe5 [CostModel][X86] AVX512 truncation ops are slower than cost models indicate.
The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions.

Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.
2021-05-27 16:07:42 +01:00
Roman Lebedev 8c86161a0b
[NFC][X86] clang-format X86TTIImpl::getInterleavedMemoryOpCostAVX2()
I plan to make changes to it, and undoing formatting each time is not going to be fun.
2021-05-26 12:27:47 +03:00
Simon Pilgrim def6269779 [CostModel][X86] Improve accuracy of 256-bit non-uniform vector shifts on AVX1
Determined from llvm-mca analysis, AVX1 capable targets have a higher throughput for VPBLENDVB and shuffle ops, making it cheaper to perform shift+shuffle/select shift patterns.
2021-05-25 17:31:45 +01:00
Simon Pilgrim c909ddddda [CostModel][X86] Improve accuracy of vXi64 vector non-uniform shift costs on AVX2+ targets
rG1ad4f887bd7692a9e63fb42586f0ece366f2fe01 incorrectly assumed that vXi64 non-uniform shifts were slow like vXi32 were - but llvm-mca (+Agner) both confirm that Haswell/Broadwell are full rate.
2021-05-25 15:58:23 +01:00
Simon Pilgrim 68ef68f8ac [CostModel][X86] Improve accuracy of vXi8/vXi16 vector non-uniform shift costs on AVX2/AVX512 targets
Determined from llvm-mca analysis, AVX2+ capable targets have a higher throughput for VPBLENDVB and VPMOVZX ops, making it cheaper to perform shift+select patterns for vXi8 shifts or extend/shift/truncate for vXi16 shifts. Similarly AVX512BW can perform vXi8 as extend/shift/truncate patterns.
2021-05-25 11:35:57 +01:00
Roman Lebedev c666208f63
[X86][Costmodel] getMaskedMemoryOpCost(): don't scalarize non-power-of-two vectors with legal element type
This follows in steps of similar `getMemoryOpCost()` changes, D100099/D100684.

Intel SDM, `VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores`:
```
Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to
referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no
faults will be detected if the mask bits are all zero.
```
I.e., if mask is all-zeros, any address is fine.

Masked load/store's prime use-case is e.g. tail masking the loop remainder,
where for the last iteration, only first some few elements of a vector exist.

So much similarly, i don't see why must we scalarize non-power-of-two vectors,
iff the element type is something we can masked- store/load.
We simply need to legalize it, widen the mask, and be done with it.
And we even already count the cost of widening the mask.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D102990
2021-05-24 20:09:54 +03:00
Simon Pilgrim dcaca7206e [CostModel][X86] Add missing SSE41 v2iX sext/zext costs
Also fix existing v4i8->v4i16 sext cost to match the equivalents
2021-05-24 15:53:43 +01:00
Simon Pilgrim 1ad4f887bd [CostModel][X86] Improve accuracy of vector non-uniform shift costs on XOP/AVX2 targets
By llvm-mca analysis, Haswell/Broadwell has a non-uniform vector shift recip-throughput cost of the AVX2 targets at 2 for both 128 and 256-bit vectors - XOP capable targets have better 128-bit vector shifts so improve the fallback in those cases.
2021-05-24 14:18:21 +01:00
Simon Pilgrim 243e588681 [CostModel][X86] Improve accuracy of vXi64 MUL costs on AVX2/AVX512 targets
By llvm-mca analysis, Haswell/Broadwell has the worst v4i64 recip-throughput cost of the AVX2 targets at 6 (vs the currently used cost of 8). Similarly SkylakeServer (our only AVX512 target model) implements PMULLQ with an average cost of 1.5 (rounded up to 2.0), and the PMULUDQ-sequence (without AVX512DQ) as a cost of 6.
2021-05-24 09:48:32 +01:00
Simon Pilgrim e4ec5cc8eb [CostModel][X86] Align v2i64 MUL costs on SSE42+ targets with worst case
Based on worst case of sandybridge (which seems to match nehalem for this SSE sequence) (vs btver2 + bdver2) llvm-mca analysis
2021-05-23 16:20:57 +01:00
Simon Pilgrim fc01b9bdf8 [CostModel][X86] Align v4i64 MUL costs on AVX1 targets with worst case
Based on worst case of sandybridge (vs btver2 + bdver2) llvm-mca analysis - which is a lot less than what we were predicting (I think based off total uop count).
2021-05-22 20:07:55 +01:00
Simon Pilgrim 6f9ac11e39 [CostModel][X86] Pull out X86/X64 scalar int arithmetric costs from SSE tables. NFCI.
These aren't dependent on any SSE level (and don't tend to get quicker either).
2021-05-22 16:13:49 +01:00
Simon Pilgrim 7a898477bb [CostModel][X86] vXi8 MUL is always promoted to vXi16 2021-05-22 11:56:49 +01:00
Simon Pilgrim 9bd0dc83b5 [CostModel][X86] Improve v8i32 MUL costs on AVX1 targets to account for slower btver2
BTVER2 has a 2 cycle throughput for v4i32 multiplies (same as SSE41 targets), which is only partially hidden by the subvector extracts/insert when splitting v8i32.
2021-05-22 11:13:07 +01:00
Roman Lebedev 8ed0864fd7
Reland [X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.

The AVX512 variant probably needs a similar change,
but there it is less obvious.

This was initially landed in 69ed93a435,
but was reverted in 6b95fd199d
because the patch it depends on was reverted.
2021-05-22 11:47:08 +03:00
Roman Lebedev 05a4e4a89c
Reland [X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.

Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.

This was initially landed in c02476f315,
but reverted in 5fddc3312b,
because the code made some very optimistic assumptions about invariants
that didn't hold in practice.

Reviewed By: RKSimon, ABataev

Differential Revision: https://reviews.llvm.org/D100684
2021-05-22 11:46:32 +03:00
Simon Pilgrim fe6c11c571 [CostModel][X86] Improve f64/v2f64/v4f64 FMUL costs on AVX1 targets to account for slower btver2
BTVER2 has a weaker f64 multiplier that other AVX1-era targets, so we need to bump the worst case cost slightly - llvm-mca reports the new vectorization in simplebb is beneficial on btver2, bdver2 and sandybridge AVX1 targets
2021-05-21 18:12:13 +01:00
Simon Pilgrim 2fca555866 [CostModel][X86] Improve fneg costs
These are always lowered as xor ops, so are always cheap
2021-05-21 17:23:45 +01:00
Simon Pilgrim 3ae7f7ae0a [CostModel][X86] Tweak fptoui v4f32->v4i32 + v8f32->v8i32 SSE/AVX costs
Adjust for worst case for atom/slm (SSE), btver2/sandybridge (AVX1) and haswell/znver* (AVX2)
2021-05-21 12:09:31 +01:00
Simon Pilgrim 4865ed3020 [CostModel][X86] Match SSE41 legalized conversion costs as well as SSE2 2021-05-21 11:42:22 +01:00
Simon Pilgrim eb6429d0fb [CostModel][X86] Add uitpfp v4f32->v4i32 + v8f32->v8i32 SSE/AVX costs
These were using (default) scalarized values.
2021-05-21 11:30:15 +01:00
Simon Pilgrim 62fca69a70 [CostModel][X86][AVX2] Improve 256-bit vector non-uniform shifts costs
Haswell, Excavator and early Ryzen all have slower 256-bit non-uniform vector shifts (confirmed on AMDSoG/Agner/instlatx64 and llvm models) - so bump the worst case costs accordingly.

Noticed while investigating PR50364
2021-05-20 12:16:16 +01:00
Simon Pilgrim 560b709abe [X86][AVX] Cleanup AVX2 vector integer truncation costs
Noticed while investigating PR50364, the truncation costs for v4i64->v4i16/v4i8 and v8i32->v8i8 were way too optimistic for a shuffle sequence that usually matches the AVX1 codegen (they matched AVX512 numbers which have actual truncation instructions!).
2021-05-18 13:07:29 +01:00
Roman Lebedev 5fddc3312b
Revert "[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again"
As reported in post-commit feedback, this has issues with e.g. <16 x i1>:
https://llvm.godbolt.org/z/jxPvdGEW4

This reverts commit c02476f315.
2021-05-14 00:03:36 +03:00
Roman Lebedev 6b95fd199d
Revert "[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()"
Depends on a commit that is about to be reverted.

This reverts commit 69ed93a435.
2021-05-14 00:03:36 +03:00
Roman Lebedev 97e04d41e6
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): canonicalize to integer type
This way we don't have to duplicate i32/f32 and i64/f64 entries,
which was already forgotten to be done for a few tuples.
2021-05-11 21:35:58 +03:00
Roman Lebedev 69ed93a435
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.

The AVX512 variant probably needs a similar change,
but there it is less obvious.
2021-05-11 16:28:00 +03:00
Roman Lebedev c02476f315
[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.

Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.

Reviewed By: RKSimon, ABataev

Differential Revision: https://reviews.llvm.org/D100684
2021-05-11 16:02:22 +03:00
Roman Lebedev b1c38207e9
[X86] Improve costmodel for scalar byte swaps
Currently we model i16 bswap as very high cost (`10`),
which doesn't seem right, with all other being at `1`.

Regardless of `MOVBE`, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of `1`:

Since i32/i64 already have cost of `1`,
`MOVBE` doesn't improve their costs any further.

BUT, `MOVBE` must have at least a single memory operand,
with other being a register. Which means, if we have
a bswap of load, iff load has a single use,
we'll fold bswap into load.

Likewise, if we have store of a bswap, iff bswap
has a single use, we'll fold bswap into store.

So i think we should treat such a bswap as free,
unless of course we know that for the particular CPU
they are performing badly.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D101924
2021-05-08 15:17:35 +03:00
Daniil Fukalov 3489c2d7b1 [TTI] NFC: Change getTypeLegalizationCost to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen, kparzysz

Differential Revision: https://reviews.llvm.org/D101533
2021-04-30 22:51:51 +03:00
Alexey Bataev 12c51f2358 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 12:48:00 -07:00
Alexey Bataev 6e859f3cd4 Revert "[COST] Improve shuffle kind detection if shuffle mask is provided."
This reverts commit 9239932221 to fix
a compiler crash on mask checks.
2021-04-29 12:40:33 -07:00
Alexey Bataev 9239932221 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 09:42:56 -07:00
Alexey Bataev f19e8f424f [COST][X86]Improve cost model for reverse shuffle v32i16/v64i8 in AVX512F.
Improved cost model for reverse shuffle on AVX512F for types
v32i16/v64i8.

Differential Revision: https://reviews.llvm.org/D100974
2021-04-27 11:14:21 -07:00
dfukalov e4c606acaf [TTI] NFC: Change getScalarizationOverhead and getOperandsScalarizationOverhead to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D101283
2021-04-27 08:51:48 +03:00
Simon Pilgrim 043bc88dba [CostModel][X86] Improve v2f32 fadd reduction cost
This was being reported as a similar cost to v4f32 when its a lot cheaper (just a shufps+addps).
2021-04-23 16:56:13 +01:00
Sander de Smalen f9a50f04ba [TTI] NFC: Change getIntImmCost[Inst|Intrin] to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100565
2021-04-23 16:06:36 +01:00
Sander de Smalen e0edfa052f [TTI] NFC: Change getAddressComputationCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100561
2021-04-23 16:06:35 +01:00
Roman Lebedev df9597cf5a
[X86][CostModel] X86TTIImpl::getShuffleCost(): subvector insertions are cheap
This is similar to the subvector extractions,
except that the 0'th subvector isn't free to insert,
because we generally don't know whether or not
the upper elements need to be preserved:
https://godbolt.org/z/rsxP5W4sW

This is needed to avoid regressions in D100684

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100698
2021-04-19 13:24:58 +03:00
Roman Lebedev b06c55a698
[X86][CostModel] Fix cost model for non-power-of-two vector load/stores
Sometimes LV has to produce really wide vectors,
and sometimes they end up being not powers of two.
As it can be seen from the diff, the cost computation
is currently completely non-sensical in those cases.

Instead of just scalarizing everything, split/factorize the wide vector
into a number of subvectors, each one having a power-of-two elements,
recurse to get the cost of op on this subvector. Also, check how we'd
legalize this subvector, and if the legalized type is scalar,
also account for the scalarization cost.

Note that for sub-vector loads, we might be able to do better,
when the vectors are properly aligned.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100099
2021-04-16 15:30:57 +03:00
Sander de Smalen 4f42d873c2 [TTI] NFC: Change getArithmeticInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100317
2021-04-14 17:20:36 +01:00
Sander de Smalen 1af35e77f4 [TTI] NFC: Change getVectorInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100315
2021-04-14 17:20:35 +01:00
Sander de Smalen 174e8f6c5e [TTI] NFC: Change getShuffleCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100314
2021-04-14 17:20:35 +01:00
Sander de Smalen 14b934f8a6 [TTI] NFC: Change getCFInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: samparker

Differential Revision: https://reviews.llvm.org/D100313
2021-04-14 17:20:34 +01:00
Sander de Smalen 03f47bdcb1 [TTI] NFC: Change get[Interleaved]MemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100205
2021-04-13 14:21:02 +01:00
Sander de Smalen d676b5749d [TTI] NFC: Change getMaskedMemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100204
2021-04-13 14:21:01 +01:00