Commit Graph

1215 Commits

Author SHA1 Message Date
David Green fa784f6382 [AArch64] Insert subvector costs
An insert subvector under aarch64 can often be done as a single lane mov
operation. For example a v4i8 inserted into a v16i8 is a s-reg mov, so
long as the index is a multiple of 4. This teaches the cost model that,
using code copied over from the X86 backend.

Some of the costs (v16i16_4_0) are still high because they get matched
as a SK_Select, not an SK_InsertSubvector. D120879 has some codegen
tests for inserting subvectors, which I were added as
llvm/test/CodeGen/AArch64/insert-subvector.ll.

Differential Revision: https://reviews.llvm.org/D120880
2022-04-07 19:27:41 +01:00
David Green 750bf3582a [AArch64] Increase cost of v2i64 multiplies
The cost of a v2i64 multiply was special cased in D92208 as scalarized
into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are
expensive though, and the cost wasn't high enough to prevent vectorizing
in places where it can be detrimental for performance. This increases it
so that the costs of copying to/from GPRs is increased to 2 each, with
the total cost increasing to 14. So long as umull/smull are handled
correctly (as in D123006) this seems to lead to better vectorization
factors and better performance.

Differential Revision: https://reviews.llvm.org/D123007
2022-04-04 17:42:20 +01:00
David Green 2abaa027d9 [AArch64] Teach the costmodel about widening muls
A vector mul(sext, sext) or mul(zext, zext) will be code generated as a
single smull or umull instruction. This most notably effects v2i64
multiplies, which are otherwise not legal and need to be expanded.

The oneuse check has also been slightly changed, as it is already
checked from the use of isWideningInstruction in getCastInstrCost.

Differential Revision: https://reviews.llvm.org/D123006
2022-04-04 12:45:04 +01:00
David Green 2e2f38a1ac [AArch64] Add widening arithmetic cost tests. NFC 2022-04-04 12:19:45 +01:00
Simon Pilgrim d663166acb [CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets
Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput
2022-03-30 09:11:55 +01:00
Arthur Eubanks d051c566cd [test] Remove the last couple uses of -analyze in llvm/test 2022-03-23 11:31:12 -07:00
David Green c56dd20f69 [AArch64] Add extra insert subvector cost model tests. NFC 2022-03-22 12:20:19 +00:00
Yeting Kuo ecd7a0132a [RISCV] Add basic cost model for vector casting
To perform the cost model of vector casting, the patch consider most vector
casts as their scalar form and consider those vector form of free scalr castings
as 1.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D121771
2022-03-22 14:17:08 +08:00
Simon Pilgrim 5dde9c1286 [CostModel][X86] Reduce cost of extracting bool vector elements
For constant indices, these are now just a MOVMSK+TEST/BT
2022-03-18 19:02:47 +00:00
Simon Pilgrim 4455c5cdea [CostModel][X86] Update RUN -passes=* to double quotes to appease update scripts on windows 2022-03-18 11:44:18 +00:00
Craig Topper bbd2ecf9f0 [RISCV] Add +experimental-zvfh extension to cover half types in vectors.
Currently we allow half types in vectors if the scalar Zfh extension
is enabled. This behavior is not inline with the vector spec. For f32
and f64 types, the Zve32f, Zve64f, Zve64d, and V explicitly control
the availablity of floating point types in vectors.

In order to make our compiler compliant, we either need to remove all support
for half in vectors or we need an extension to control it.

Draft spec here https://github.com/riscv/riscv-v-spec/pull/780

Reviewed By: kito-cheng

Differential Revision: https://reviews.llvm.org/D121345
2022-03-17 10:04:02 -07:00
David Sherwood e7b89c2fc3 Add BasicTTIImpl cost model for llvm.get.active.lane.mask intrinsic
The vectoriser sometimes generates predicated vector loops using
the llvm.get.active.lane.mask intrinsic so it's important that we
are able to calculate a valid cost for the call instruction. When
SVE is enabled we are able to use a single whilelo instruction
for some vector types - in such cases I've marked the cost as 1.
For all other cases I've set the cost according to how the intrinsic
will be expanded.

Tests added here:

  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Analysis/CostModel/ARM/active_lane_mask.ll
  Analysis/CostModel/RISCV/active_lane_mask.ll

Differential Revision: https://reviews.llvm.org/D121109
2022-03-14 09:35:05 +00:00
Yeting Kuo ae7c6647f3 [RISCV] Add basic code modeling for fixed length vector reduction.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D121447
2022-03-14 11:04:31 +08:00
Florian Hahn aa590e5823
[AArch64] Improve costs for some conversions to fp16.
Currently the cost model under-estimates the cost of certain
FP16 conversions.

This patch updates getCastInstrCost to return more accurate costs for
the cases improved in c2ed9fd054.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D113700
2022-03-11 10:27:39 +00:00
Florian Hahn 697f55e368
[AArch64] Move fp16 cast tests.
Move FP16 tests to fp16cast function, as suggested in D113700.
2022-03-10 12:22:06 +00:00
Roman Lebedev 2f80ea7f4f
[NFC][LV] Use different braces in debug output
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D121105
2022-03-07 19:32:37 +03:00
David Green 43b638241a [AArch64] Use NPM for cost model tests. NFC
As per the other tests, this switches the run lines back to using the
NPM via
-passes='print<cost-model>' -cost-kind=throughput 2>&1 -disable-output
2022-03-07 08:57:50 +00:00
Alex Tsao 89f15fc687 [RISCV] Add cost modelling for masked memory op
The patch adds very basic cost model for masked memory op on scalable vector.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D117884
2022-03-03 20:47:58 +08:00
David Green 47f4cd9c3d [AArch64] Update costs for some fp16 converts
This updates the costs for FP16 converts, as some of them were pretty
high.

Differential Revision: https://reviews.llvm.org/D120771
2022-03-03 11:17:24 +00:00
David Green 65c0e45a37 [AArch64] Vector shifts cost 1
The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.

Differential Revision: https://reviews.llvm.org/D120773
2022-03-03 10:42:57 +00:00
David Green 97e0366d67 [AArch64] Add some fp16 conversion cost tests. NFC 2022-03-02 18:07:14 +00:00
Nikita Popov 98cfcae4e9 Revert "[RISCV] Add cost modelling for masked memory op"
This reverts commit 76f243b53b.

The newly added test fails.
2022-03-02 17:32:10 +01:00
Alex Tsao 76f243b53b [RISCV] Add cost modelling for masked memory op
The patch adds very basic cost model for masked memory op on scalable vector.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D117884
2022-03-02 22:48:41 +08:00
David Green 02de975259 [AArch64] Add some tests for the cost of extending an extract. NFC 2022-03-02 14:47:32 +00:00
David Green 62c2b070d5 [AArch64] Add simple arithmetic cost model test. NFC 2022-03-01 23:31:02 +00:00
David Green 2e7c35ea12 [AArch64] Cleanup and extend cast costs. NFC 2022-02-26 17:59:02 +00:00
David Green 5fe8307b70 [AArch64] Add scalar min/max costs. NFC
The vector costs were already added, this adds scalar variants to
complete the test coverage.
2022-02-25 17:11:24 +00:00
Pavel Kosov 37fa99eda0 [SchedModels][CortexA55] Add ASIMD integer instructions
Depends on D114642

Original review https://reviews.llvm.org/D112201

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D117003
2022-02-17 13:41:57 +03:00
Arthur Eubanks 15ba588d6d [test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>' 2022-02-09 15:42:16 -08:00
David Green b55d4c2ad8 Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`"
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00
Craig Topper 09629215c2 [RISCV] Add a really basic cost model for SK_Splice.
While testing scalable vectors I found that if we generate a
vector splice intrinsic and run the code through the loop unroller,
we'll crash due to an invalid cost.

This adds a basic cost based on the 2 slide instructions used by the
lowering in D119303.

We probably need to factor LMUL into this, but that's true for
arithmetic instructions too. So I've ignored for the moment.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D119316
2022-02-09 11:43:31 -08:00
Roman Lebedev 77a0da926c
[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
  // TODO: Cost model for emulated masked load/store is completely
  // broken. This hack guides the cost model to use an artificially
  // high enough value to practically disable vectorization with such
  // operations, except where previously deployed legality hack allowed
  // using very low cost values. This is to avoid regressions coming simply
  // from moving "masked load/store" check from legality to cost model.
  // Masked Load/Gather emulation was previously never allowed.
  // Limited number of Masked Store/Scatter emulation was allowed.
```

While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

But since this was added for X86 specifically, let's just instead completely remove this hack.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114779
2022-02-07 16:08:31 +03:00
Florian Hahn 17ebd68ae6
[AArch64] Fix costs of float vector compare/selects pairs.
The current cost-model overestimates the cost of vector compares &
selects for ordered floating point compares. This patch fixes that by
extending the existing logic for integer predicates.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D118256
2022-01-31 10:18:29 +00:00
Florian Hahn cb3df1a299
[AArch64] Add vector compare/select tests with UNE predicate.
Precommit some additional tests for D118256.
2022-01-27 14:20:40 +00:00
Florian Hahn e6ebd2c72d
[AArch64] Add float vector compare/select cost-model tests. 2022-01-26 16:27:29 +00:00
Alban Bridonneau 2feddb37b4 Implement correct cost for SVE bitcasts
We have some bitcasts which we know will be simplified,
so their cost is zero.

Reviewed By: david-arm, sdesmalen

Differential Revision: https://reviews.llvm.org/D118019
2022-01-26 14:25:44 +00:00
Stanislav Mekhanoshin bb1fe36977 [AMDGPU] Make v8i16/v8f16 legal
Differential Revision: https://reviews.llvm.org/D117721
2022-01-24 11:51:08 -08:00
eopXD 3cf15af2da [RISCV] Remove experimental prefix from rvv-related extensions.
Extensions affected: +v, +zve*, +zvl*

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D117860
2022-01-22 20:18:40 -08:00
Shao-Ce SUN a0a76fee0c [RISCV] update zfh and zfhmin extention to v1.0
`zfh` and `zfhmin` have been ratified, with version 1.0.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D117098
2022-01-15 09:21:24 +08:00
David Sherwood ef1ca4d3e9 [AArch64] Fix incorrect use of MVT::getVectorNumElements in AArch64TTIImpl::getVectorInstrCost
If we are inserting into or extracting from a scalable vector we do
not know the number of elements at runtime, so we can only let the
index wrap for fixed-length vectors.

Tests added here:

  Analysis/CostModel/AArch64/sve-insert-extract.ll

Differential Revision: https://reviews.llvm.org/D117099
2022-01-13 09:27:14 +00:00
Andrew Litteken 4ff4e7ea30 [CostModel] Use cost of target trunc type when only it is the only use of a non-register sized load
The code size cost model for most targets uses the legalization cost for the type of the pointer of a load. If this load is followed directly by a trunc instruction, and is the only use of the result of the load, only one instruction is generated in the target assembly language. This adds a check for this case, and uses the target type of the trunc instruction if so.

This did not show any changes in CTMark code size benchmarks.

Reviewers: paquette, samparker, dmgreen

Differential Revision: https://reviews.llvm.org/D109388
2022-01-12 18:03:50 -06:00
Simon Pilgrim 5eb47961c4 [CostModel][X86] Update ROTL/ROTR vXi8/vXi16 costs on AVX512BW targets
Refresh based off recent improvements to codegen and the helper script from D103695
2022-01-10 13:18:25 +00:00
David Green bc615e436c [AArch64] Update addo and subo costs
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.

Differential Revision: https://reviews.llvm.org/D116734
2022-01-07 16:20:23 +00:00
David Green c65270cf96 [AArch64] Add basic umulo and smulo costs
This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overflow intrinsics is usually better than the default expansion on
AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various
types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.

Differential Revision: https://reviews.llvm.org/D116732
2022-01-06 17:22:47 +00:00
Nico Weber 085f078307 Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""
This reverts commit 859ebca744.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.
2022-01-05 13:10:25 -05:00
David Salinas 859ebca744 Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7.

That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup

Change-Id: Ibf8e397df94001f248fba609f072088a46abae08

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D115960

Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
2022-01-05 17:57:32 +00:00
Jun Ma 80e56ad9ae [TTI] Return invalid cost for scalable vector in getShuffleCost
Differential Revision: https://reviews.llvm.org/D116362
2022-01-05 18:59:11 +08:00
Daniil Fukalov a2120f6b44 [NFC][AMDGPU][CostModel] Add tests for AMDGPU cost model, part 2. 2021-12-22 22:33:57 +03:00
Daniil Fukalov deaedab14a [NFC][AMDGPU][CostModel] Add tests for AMDGPU cost model. 2021-12-22 22:32:09 +03:00
Matthew Devereau e00f22c1b1 [AArch64][SVE] Teach cost model that masked loads/stores are cheap
Reduce the cost of VLS masked loads/stores to make the vectorizor emit them more frequently.
2021-12-17 15:04:45 +00:00
Alexandros Lamprineas 61bb8b5d40 [AArch64] Convert sra(X, elt_size(X)-1) to cmlt(X, 0)
CMLT has twice the execution throughput of SSHR on Arm out-of-order cores.

Differential Revision: https://reviews.llvm.org/D115457
2021-12-14 16:03:02 +00:00
Daniil Fukalov e5c64b45be [CostModel][AMDGPU] Fix intrinsics costs estimations.
1. Fixed costs inconsistency for llvm.fma.vXf16 instinsiscs.
2. Added tests for llvm.sadd.sat, llvm.ssub.sat, llvm.uadd.sat, llvm.usub.sat
   intrisics since they have special processing in cost model.
3. Minor intrisics' costs tests updat and refinement.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D115385
2021-12-13 17:17:34 +03:00
David Sherwood 8b0448ce5d [AArch64][Analysis] Add on overhead costs for SVE gathers and scatters
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.

Differential Revision: https://reviews.llvm.org/D115143
2021-12-09 16:02:59 +00:00
Haohai Wen d2c093e79d [CostModel][X86] Add i64 mul cost for avx512 as 1cy
i64 mul cost is 1cy for all cpu that support avx512. Currently
all X86 cpu uses i64 mul cost in X64 cost table which is not
true for cpu that support avx512 (skx, icx).

Reviewed By: pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D115016
2021-12-08 11:29:08 +08:00
Cullen Rhodes 698584f89b [IR] Remove unbounded as possible value for vscale_range minimum
The default for min is changed to 1. The behaviour of -mvscale-{min,max}
in Clang is also changed such that 16 is the max vscale when targeting
SVE and no max is specified.

Reviewed By: sdesmalen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D113294
2021-12-07 09:52:21 +00:00
David Green 255ad73424 [ARM] Make MVE v2i1 predicates legal
MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the
same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the
two halves. This was never treated as a legal type in llvm in the past
as there are not many 64bit instructions and no 64bit compares. There
are a few instructions that could use it though, notably a VSELECT (as
it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for
similar reasons, some gathers/scatter and long multiplies and VCTP64
instructions.

This patch goes through and makes v2i1 a legal type, handling all the
cases that fall out of that. It also makes VSELECT legal for v2i64 as a
side benefit. A lot of the codegen changes as a result - usually in way
that is a little better or a little worse, but still expensive. Costs
can change a little too in the process, again in a way that expensive
things remain expensive. A lot of the tests that changed are mainly to
ensure correctness - the code can hopefully be improved in the future
where it comes up in practice.

The intrinsics currently remain using the v4i1 they previously did to
emulate a v2i1. This will be changed in a followup patch but this one
was already large enough.

Differential Revision: https://reviews.llvm.org/D114449
2021-12-03 14:05:41 +00:00
Daniil Fukalov ab05ab59a7 [CostModel][AMDGPU] Fix instructions costs estimation for vector types.
1. Fixed vector instructions costs estimations incosistency - removed different
   logic for "not simple types" since it biases costs for these types.
2. Fixed legalization penalty for vectors too big for the target: changed from
   overwrite default legalization cost value estimation to added penalty.
3. Fixed few typos in tests.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114893
2021-12-03 03:08:08 +03:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Roman Lebedev 7e73c2a66a
[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle
The mask on the shuffle is for the output, not the input.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114697
2021-11-29 18:37:07 +03:00
Roman Lebedev 5e96553608
[NFC][X86][LV][Costmodel] Add most basic test for masked interleaved load 2021-11-29 16:46:19 +03:00
Roman Lebedev cffe3a084f
[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()`
... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512.
This unblocks D111460.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114316
2021-11-29 14:41:48 +03:00
Zarko Todorovski 7f7dac7126 [NFC][llvm] Inclusive language: reword uses of sanity test and check
Part of continuing work to use more inclusive language. Reworded uses
of sanity check and sanity test in llvm/test/
2021-11-25 07:21:42 -05:00
Roman Lebedev cd8d219536
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ
I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()`
for AVX512, other than the question of handling plain AVX512F,
where we end up with some really ugly "shuffles",
but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW?

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114315
2021-11-24 17:23:15 +03:00
Roman Lebedev 704d92607d
[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions
Apparently my methodology was suboptimal, and not only did miss all the +VL tuples,
i also missed some plain tuples. I believe, this adds everything missing.
Indeed, these manual costmodels are just not okay long-term.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114334
2021-11-22 14:31:34 +03:00
Roman Lebedev 8d09dd61c3
[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions
Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW,
these either sign-extent the mask register into a vector,
or pack the mask from vector register.

Apparently, we didn't even have MCA tests for these,
added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05,
so i'm just guessing that their perf characteristics
are optimal.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114314
2021-11-22 14:31:34 +03:00
Roman Lebedev df70cf5e14
[NFC][X86][Costmodel] Actually test +prefer-256-bit in replication-shuffle-related tests :(
While -prefer-256-bit indeed becomes complete with D114314,
the real-world (the one with +prefer-256-bit) coverage is lacking.

Hilarious.
2021-11-21 01:25:49 +03:00
Roman Lebedev da47a63e03
[NFC][X86][Costmodel] Add AVX512DQ runlines to trunc.ll/extend.ll 2021-11-20 13:55:13 +03:00
Roman Lebedev 049799c311
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI
If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles),
we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114071
2021-11-19 15:58:10 +03:00
Roman Lebedev a751084bb4
[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1`
Note that there are many other missing costs, i'm *only* adding the ones that are queried
from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114070
2021-11-19 15:57:32 +03:00
Roman Lebedev a50fdd3fc9
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW
Here we get pretty lucky. AVX512F does not provide any instructions
to convert between a `k` vector mask and a vector,
but AVX512BW adds `{k}<->nX{i8,i16}`conversions,
and just as it happens, with AVX512BW we have a i16 shuffle.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113915
2021-11-19 15:55:41 +03:00
Roman Lebedev 496ccb543e
[NFC][X86][Costmodel] Improve test coverage for i32->i64 vector *ext 2021-11-17 12:02:50 +03:00
Roman Lebedev 2037ec725f
[X86][Costmodel] `*ext v64i1 to v32i16` can appear after legalization, cost is same as for `*ext v32i1 to v32i16`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113914
2021-11-17 12:02:50 +03:00
Roman Lebedev 23b194bf18
[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113913
2021-11-17 12:02:50 +03:00
David Green 309f1e4ac8 [ARM] Add datalayout to costmodel tests. NFC
This adds a sensible datalayout to the ARM cost model tests, to prevent
the costs reported being incorrect for the size of pointers.
2021-11-16 09:49:42 +00:00
Roman Lebedev 7114c60e8f
[NFC][X86][Costmodel] Improve test coverage for {i8,i16,i32,i64}->i1 vector trunc 2021-11-15 20:46:48 +03:00
Roman Lebedev 949103dc36
[NFC][X86][Costmodel] Improve test coverage for i1->{i8,i16,i32,i64} vector *ext 2021-11-15 20:46:48 +03:00
Roman Lebedev bc35d5fe2f
[NFC][X86][Costmodel] Add i1 replication shuffle costmodel test coverage 2021-11-15 20:02:52 +03:00
Roman Lebedev 5c7255fe3a
[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI
Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type,
so this change does affect vectorization. In the end, it will ask about i1.

We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up.
All costs here look good, i've added the missing truncation costs in preparatory patches.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113853
2021-11-15 19:04:02 +03:00
Roman Lebedev a468c39c90
[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8`
Some of the costs get larger here,
but i suppose that makes sense since we'd previously query
scalarization costs that may not be really representative of the reality.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113852
2021-11-15 19:04:02 +03:00
Roman Lebedev 9e57d9b09d
[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113851
2021-11-15 19:04:02 +03:00
Roman Lebedev 0116c708c6
[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113850
2021-11-15 19:04:02 +03:00
Roman Lebedev f86b57e37c
[NFC][X86][Costmodel] Improve test coverage for {i16,i32,i64}->i8 vector trunc 2021-11-14 20:25:40 +03:00
Roman Lebedev f0da329f93
[NFC][X86][Costmodel] Improve test coverage for i8->{i16,i32,i64} vector *ext 2021-11-14 20:25:33 +03:00
Roman Lebedev 4dd2f0446c
[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW
The basic idea is simple, if we don't have native shuffle for this element type,
then we must have native shuffle for wider element type,
so promote, replicate, demote.

I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically,
case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in
2 * XMM, that then will be packed into 1 * YMM,
and it should count the cost of said packing,
not just the truncations.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113609
2021-11-14 20:01:38 +03:00
Roman Lebedev b283961012
[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16`
Same as D113842, but for i64

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113843
2021-11-14 18:41:38 +03:00
Roman Lebedev a5f2fdca99
[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16`
This was noticed in D113609, hopefully it unblocks that patch.
There are likely other similar problems.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113842
2021-11-14 18:41:37 +03:00
Roman Lebedev 17a3df87ff
[NFC][X86][Costmodel] Improve test coverage for {i32,i64}->i16 vector *ext
See https://reviews.llvm.org/D113609 - some of these costs seem wrong.
2021-11-14 16:07:30 +03:00
Roman Lebedev fd24446ba5
[NFC][X86][Costmodel] Improve test coverage for i16->{i32,i64} vector *ext 2021-11-14 16:06:45 +03:00
Florian Hahn 8d2a1994c8
[AArch64] Add some fp16 cast cost-model tests.
This adds initial tests for cost-modeling {u,s}itofp for fp16 vectors.
At the moment, they are under-estimated in a couple of cases.
2021-11-11 18:21:44 +00:00
Roman Lebedev a70d74323e
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI
VBMI introduced VPERMB, so cost-model i8 replication shuffle using it.
Note that we can still model i8 replication shufflle without VBMI,
by promoting to i16/i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113479
2021-11-10 22:52:40 +03:00
Roman Lebedev c6e894b9b2
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW
BWI introduced VPERMW, so cost-model i16 replication shuffle using it.
Note that we can still model i16 replication shufflle without BWI,
by promoting to i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113478
2021-11-10 22:52:39 +03:00
Roman Lebedev 4101c7bf19
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F
This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`,
that take a single input vector and a single index vector,
and are cross-lane. So far i haven't seen evidence that
replication ever results in demanding more than a single
input vector per output vector.

This results in *shockingly* lesser costs :)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113350
2021-11-10 22:52:33 +03:00
Roman Lebedev 3bdf738d1b
[NFC][X86][Costmodel] Add i16 replication shuffle costmodel test coverage 2021-11-09 14:19:44 +03:00
Roman Lebedev 23566f18c6
[NFC][X86][Costmodel] Add tests for i32/i64 replication shuffles
While this isn't what we eventually need (i8 or i1),
approaching from this end is more straight-forward.
2021-11-06 17:14:56 +03:00
Roman Lebedev a30ec4778a
[TTI][CostModel] `getUserCost()`: recognize replication shuffles and query their cost
This finally creates proper test coverage for replication shuffles,
that are used by LV for conditional loads, and will allow to add
proper costmodel at least for AVX512.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113324
2021-11-06 16:45:15 +03:00
Roman Lebedev 04fa7cbf55
[NFC][CostModel] Add exhaustive test coverage for replication shuffles
This coverage has been brought to you by https://godbolt.org/z/nfc3cY1za
2021-11-06 00:53:28 +03:00
Roman Lebedev ad617183bb
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1
Even though AVX512's masked mem ops (unlike AVX1/2) have a mask
that is a `VF x i1`, replication of said masks happens after
promotion of it to `VF x i8`, so we should use `i8`, not `i1`,
when calculating the cost of mask replication.
2021-11-05 17:27:02 +03:00
Roman Lebedev 34b903d8b0
[NFC] Add forgotten `REQUIRES: asserts` into the new costmodel test 2021-11-03 19:40:23 +03:00
Roman Lebedev c65e2ac405
[NFC] Rewrite runlines in interleaved-store-accesses-with-gaps.ll once again
https://lab.llvm.org/buildbot/#/builders/98/builds/8198 is still failing,
and i really don't understand how runlines in this test differ
from the ones in other nearby tests...
2021-11-03 19:15:33 +03:00
Roman Lebedev df93c8a919
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask
I don't really buy that masked interleaved memory loads/stores are supported on X86.
There is zero costmodel test coverage, no actual cost modelling for the generation
of the mask repetition, and basically only two LV tests.
Additionally, i'm not very interested in AVX512.

I don't know if this really helps "soft" block over at
https://reviews.llvm.org/D111460#inline-1075467,
but i think it can't make things worse at least.

When we are being told that there is a masking, instead of
completely giving up and falling back to
fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`,
let's correctly query the cost of masked memory ops,
keep all the pretty shuffle cost modelling,
but scalarize the cost computation for the mask replication.

I think, not scalarizing the shuffles themselves
may adjust the computed costs a bit,
and maybe hopefully just enough to hide the "regressions"
at https://reviews.llvm.org/D111460#inline-1075467
I do mean hide, because the test coverage is non-existent.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112873
2021-11-03 18:14:35 +03:00
Roman Lebedev f3d1ddfe71
[NFC] Use single-dash-prefixed options in newly-added test
https://lab.llvm.org/buildbot/#/builders/98/builds/8195 complains,
and this is the only guess i have.
2021-11-03 18:12:40 +03:00
Roman Lebedev a4b64f7727
[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used
As it can be seen in `InnerLoopVectorizer::vectorizeInterleaveGroup()`,
in some cases (reported by `UseMaskForGaps`), the gaps in the interleaved load/store group
will be masked away by another constant mask, so there is no need to
account for the cost of replication of the mask for these.

Differential Revision: https://reviews.llvm.org/D112877
2021-11-03 17:33:28 +03:00
Roman Lebedev c6b3da1d66
[NFC][X86] Duplicate LV test into a costmodel test
Copied from llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
As discussed in D111460 / D112877 / D112873 we have basically no test coverage
for this part of cost model.
2021-11-03 17:25:18 +03:00
Roman Lebedev db848fbf67
[NFC][LV][X86] Improve test coverage for masked mem ops 2021-10-27 13:36:04 +03:00
David Green 74b2a4edcc [AArch64] Add a costmodel test for overflowing arithmatic. NFC 2021-10-26 10:35:12 +01:00
Roman Lebedev 8fac9e95ad
[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members
By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least **in general**,
that is obviously overly pessimistic, because **in general**,
not all the shuffles needed to perform the full interleaving
will end up being live.

So what this does, is naively scales the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate, although it may be over/under -estimate.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112307
2021-10-22 16:33:58 +03:00
David Sherwood 9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
Simon Pilgrim 5b395bd633 [CostModel][X86] Add costs for multiply-by-pow2 constants
These are folded to left shifts in the backend.

We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436
2021-10-20 13:11:21 +01:00
David Green 862e8d7e55 [AArch64] Improve div and rem costmodel tests. NFC
Copied from the X86 tests, these give a better test coveraged than the
existing tests.
2021-10-20 09:58:35 +01:00
Simon Pilgrim f041338153 [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:46:10 +01:00
Simon Pilgrim c850d5c5c8 [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:15:14 +01:00
Simon Pilgrim dc3382dc2c [CostModel][X86] Add mul by positive/negative power-of-2 constants tests
We have backend optimizations for these, but currently the costmodel doesn't match them
2021-10-17 20:34:17 +01:00
Simon Pilgrim dbf5dc8930 [CostModel][X86] Add div/rem by negative power-of-2 constants
We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them
2021-10-17 18:51:15 +01:00
Roman Lebedev 91373bf12e
[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
2021-10-17 17:28:10 +03:00
Roman Lebedev 3274ce3a28
[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
2021-10-17 17:28:10 +03:00
Roman Lebedev 3a6a9f74d3
[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
2021-10-17 17:28:09 +03:00
Roman Lebedev 4b76a74b42
[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
2021-10-17 17:28:09 +03:00
Roman Lebedev 887acf6842
[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
2021-10-17 17:28:09 +03:00
Simon Pilgrim 85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim 6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Roman Lebedev d137f1288e
[X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Roman Lebedev 3d7bf6625a
[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Simon Pilgrim 77dcdc2f50 [CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32
Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))
2021-10-14 12:17:40 +01:00
Roman Lebedev cb41efb5f4
[NFC][Costmodel][X86] Fix broken `CHECK-NOT`'s in interleave costmodel tests 2021-10-13 22:44:57 +03:00
Roman Lebedev 18eef13dad
[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()`
`X86TTIImpl::getGSScalarCost()` has (at least) two issues:
* it naively computes the cost of sequence of `insertelement`/`extractelement`.
  If we are operating not on the XMM (but YMM/ZMM),
  this widely overestimates the cost of subvector insertions/extractions.
* Gather/scatter takes a vector of pointers, and scalarization results in us performing
  scalar memory operation for each of these pointers, but we never account for the cost
  of extracting these pointers out of the vector of pointers.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111222
2021-10-13 22:35:39 +03:00
David Green adec922361 [AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.

The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.

On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.

When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.

A lot of existing tests have updated. This is a summary of the important
differences:
 - Most changes are the same instructions in a different order.
 - Sometimes this leads to very minor inefficiencies, such as requiring
   an extra mov to move variables into r0/v0 for the return value of a test
   function.
 - misched-fusion.ll was no longer fusing the pairs of instructions it
   should, as per D110561. I've changed the schedule used in the test
   for now.
 - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
   the different latencies. This seems fine to me.
 - Some SVE tests do not always remove movprfx where they did before due
   to different register allocation giving different destructive forms.
 - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
   produce two LDR where they previously produced an LDP due to
   store-pair-suppress kicking in.
 - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
 - Some tests such as arm64-neon-mul-div.ll and
   ragreedy-local-interval-cost.ll have more, less or just different
   spilling.
 - In aarch64_generated_funcs.ll.generated.expected one part of the
   function is no longer outlined. Interestingly if I switch this to use
   any other scheduled even less is outlined.

Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.

Differential Revision: https://reviews.llvm.org/D110830
2021-10-09 15:58:31 +01:00
Simon Pilgrim b6426d5211 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT/UGT for generic abs/min/max cost expansion
Split off ABS cost handling from MIN/MAX and use explicit predicates for each

Our generic expansion of ABS doesn't use NEG+CMP+SELECT any more (its now ASHR+ADD+XOR) so this needs to be updated.
2021-10-08 12:41:58 +01:00
Simon Pilgrim 716883736b [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT for generic sadd/ssub sat cost expansion
The comparison always checks for negative values so know the icmp predicate will be ICMP_SGT
2021-10-07 15:42:45 +01:00
Simon Pilgrim 2ced9a42be [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion
Match the predicate used in TargetLowering::expandMULO to detect overflow
2021-10-06 19:11:33 +01:00
Simon Pilgrim 7bd097fd1e [CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion
Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values.

The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply.

More closely matches the default expansion from TargetLowering::expandMULO
2021-10-06 19:11:32 +01:00
Simon Pilgrim 81b5da8c97 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion
Match the predicates used in TargetLowering::expandUADDSUBO
2021-10-06 19:11:32 +01:00
Simon Pilgrim 3dda247e18 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_EQ for generic funnel shift cost expansion
The comparison always checks for zero value so know the icmp predicate will be ICMP_EQ
2021-10-06 16:39:16 +01:00
Simon Pilgrim 0776924a17 [CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions
As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost.

These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.
2021-10-06 10:14:56 +01:00
Roman Lebedev f92961d238
[NFC] Fixup newly-added costmodel tests to actually test what they should 2021-10-05 21:35:47 +03:00
Roman Lebedev 200edc152b
[NFC][X86][LV] Add basic costmodel test coverage for not-fully-interleaved i32 loads
The coverage could have cumulative explosion here,
so i'm adding only the most basic cases,
and hoping it's enough, though more can be added if needed.
2021-10-05 19:39:50 +03:00
Roman Lebedev 3f9b235482
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0`
So could pick cost of `36`

For store we have:
https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0`
So we could pick cost of `30`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111094
2021-10-05 16:58:58 +03:00
Roman Lebedev e2784c5d8c
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0`
So could pick cost of `18`.

For store we have:
https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0`
So we could pick cost of `15`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111093
2021-10-05 16:58:58 +03:00
Roman Lebedev 3960693048
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111092
2021-10-05 16:58:58 +03:00
Roman Lebedev 79d6d12d95
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs
This one required quite a bit of an assembly surgery, but i think it's in the right ballpark..

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So could pick cost of `64`.

For store we have:
https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5`
So we could pick cost of `66`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111091
2021-10-05 16:58:58 +03:00
Roman Lebedev 2996a2b50f
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0`
So could pick cost of `31`.

For store we have:
https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8`
So we could pick cost of `33`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111089
2021-10-05 16:58:57 +03:00
Roman Lebedev d51532d8aa
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8`
So could pick cost of `15`.

For store we have:
https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0`
So we could pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111087
2021-10-05 16:58:57 +03:00
Roman Lebedev 764fd5f463
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0`
So we could pick cost of `9`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111083
2021-10-05 16:58:57 +03:00
Roman Lebedev c800119c46
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So could pick cost of `20`.

For store we have:
https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So we could pick cost of `20`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111076
2021-10-05 16:58:57 +03:00
Roman Lebedev 000ce0bfd5
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So could pick cost of `8`.

For store we have:
https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111075
2021-10-05 16:58:57 +03:00
Roman Lebedev dcc2b0d933
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111073
2021-10-05 16:58:57 +03:00
Roman Lebedev 7d91037fd2
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=16 interleaving costs
This one required quite a bit of assembly surgery, but the trend continues, so i think this is right.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/EKWdj8cKT - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`.

For store we have:
https://godbolt.org/z/zj4bb9P75 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111064
2021-10-05 16:58:57 +03:00
Roman Lebedev 4aee1e5b93
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/a6rxMG6ec - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=12.0`
So could pick cost of `16`.

For store we have:
https://godbolt.org/z/ced1bdqc9 - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0`
So we could pick cost of `16`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111063
2021-10-05 16:58:57 +03:00
Roman Lebedev 3c2e22b795
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/avq1oz98W - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0`
So could pick cost of `8`.

For store we have:
https://godbolt.org/z/89PGMc1qs - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=6.0`
So we could pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111061
2021-10-05 16:58:57 +03:00
Roman Lebedev b6234c1edf
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=2 interleaving costs
Finally, we are getting to the heavy-hitter stuff!

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/7crGWoar6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So could pick cost of `4`.

For store we have:
https://godbolt.org/z/T8aq3MszM - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.0`
So we could pick cost of `5`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111060
2021-10-05 16:58:56 +03:00
Roman Lebedev dee4d699b2
[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=6 2021-10-04 20:57:35 +03:00