Commit Graph

522 Commits

Author SHA1 Message Date
Simon Pilgrim 62fca69a70 [CostModel][X86][AVX2] Improve 256-bit vector non-uniform shifts costs
Haswell, Excavator and early Ryzen all have slower 256-bit non-uniform vector shifts (confirmed on AMDSoG/Agner/instlatx64 and llvm models) - so bump the worst case costs accordingly.

Noticed while investigating PR50364
2021-05-20 12:16:16 +01:00
Simon Pilgrim 560b709abe [X86][AVX] Cleanup AVX2 vector integer truncation costs
Noticed while investigating PR50364, the truncation costs for v4i64->v4i16/v4i8 and v8i32->v8i8 were way too optimistic for a shuffle sequence that usually matches the AVX1 codegen (they matched AVX512 numbers which have actual truncation instructions!).
2021-05-18 13:07:29 +01:00
Roman Lebedev 5fddc3312b
Revert "[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again"
As reported in post-commit feedback, this has issues with e.g. <16 x i1>:
https://llvm.godbolt.org/z/jxPvdGEW4

This reverts commit c02476f315.
2021-05-14 00:03:36 +03:00
Roman Lebedev 6b95fd199d
Revert "[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()"
Depends on a commit that is about to be reverted.

This reverts commit 69ed93a435.
2021-05-14 00:03:36 +03:00
Roman Lebedev 97e04d41e6
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): canonicalize to integer type
This way we don't have to duplicate i32/f32 and i64/f64 entries,
which was already forgotten to be done for a few tuples.
2021-05-11 21:35:58 +03:00
Roman Lebedev 69ed93a435
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.

The AVX512 variant probably needs a similar change,
but there it is less obvious.
2021-05-11 16:28:00 +03:00
Roman Lebedev c02476f315
[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.

Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.

Reviewed By: RKSimon, ABataev

Differential Revision: https://reviews.llvm.org/D100684
2021-05-11 16:02:22 +03:00
Roman Lebedev b1c38207e9
[X86] Improve costmodel for scalar byte swaps
Currently we model i16 bswap as very high cost (`10`),
which doesn't seem right, with all other being at `1`.

Regardless of `MOVBE`, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of `1`:

Since i32/i64 already have cost of `1`,
`MOVBE` doesn't improve their costs any further.

BUT, `MOVBE` must have at least a single memory operand,
with other being a register. Which means, if we have
a bswap of load, iff load has a single use,
we'll fold bswap into load.

Likewise, if we have store of a bswap, iff bswap
has a single use, we'll fold bswap into store.

So i think we should treat such a bswap as free,
unless of course we know that for the particular CPU
they are performing badly.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D101924
2021-05-08 15:17:35 +03:00
Daniil Fukalov 3489c2d7b1 [TTI] NFC: Change getTypeLegalizationCost to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen, kparzysz

Differential Revision: https://reviews.llvm.org/D101533
2021-04-30 22:51:51 +03:00
Alexey Bataev 12c51f2358 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 12:48:00 -07:00
Alexey Bataev 6e859f3cd4 Revert "[COST] Improve shuffle kind detection if shuffle mask is provided."
This reverts commit 9239932221 to fix
a compiler crash on mask checks.
2021-04-29 12:40:33 -07:00
Alexey Bataev 9239932221 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 09:42:56 -07:00
Alexey Bataev f19e8f424f [COST][X86]Improve cost model for reverse shuffle v32i16/v64i8 in AVX512F.
Improved cost model for reverse shuffle on AVX512F for types
v32i16/v64i8.

Differential Revision: https://reviews.llvm.org/D100974
2021-04-27 11:14:21 -07:00
dfukalov e4c606acaf [TTI] NFC: Change getScalarizationOverhead and getOperandsScalarizationOverhead to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D101283
2021-04-27 08:51:48 +03:00
Simon Pilgrim 043bc88dba [CostModel][X86] Improve v2f32 fadd reduction cost
This was being reported as a similar cost to v4f32 when its a lot cheaper (just a shufps+addps).
2021-04-23 16:56:13 +01:00
Sander de Smalen f9a50f04ba [TTI] NFC: Change getIntImmCost[Inst|Intrin] to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100565
2021-04-23 16:06:36 +01:00
Sander de Smalen e0edfa052f [TTI] NFC: Change getAddressComputationCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100561
2021-04-23 16:06:35 +01:00
Roman Lebedev df9597cf5a
[X86][CostModel] X86TTIImpl::getShuffleCost(): subvector insertions are cheap
This is similar to the subvector extractions,
except that the 0'th subvector isn't free to insert,
because we generally don't know whether or not
the upper elements need to be preserved:
https://godbolt.org/z/rsxP5W4sW

This is needed to avoid regressions in D100684

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100698
2021-04-19 13:24:58 +03:00
Roman Lebedev b06c55a698
[X86][CostModel] Fix cost model for non-power-of-two vector load/stores
Sometimes LV has to produce really wide vectors,
and sometimes they end up being not powers of two.
As it can be seen from the diff, the cost computation
is currently completely non-sensical in those cases.

Instead of just scalarizing everything, split/factorize the wide vector
into a number of subvectors, each one having a power-of-two elements,
recurse to get the cost of op on this subvector. Also, check how we'd
legalize this subvector, and if the legalized type is scalar,
also account for the scalarization cost.

Note that for sub-vector loads, we might be able to do better,
when the vectors are properly aligned.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100099
2021-04-16 15:30:57 +03:00
Sander de Smalen 4f42d873c2 [TTI] NFC: Change getArithmeticInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100317
2021-04-14 17:20:36 +01:00
Sander de Smalen 1af35e77f4 [TTI] NFC: Change getVectorInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100315
2021-04-14 17:20:35 +01:00
Sander de Smalen 174e8f6c5e [TTI] NFC: Change getShuffleCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100314
2021-04-14 17:20:35 +01:00
Sander de Smalen 14b934f8a6 [TTI] NFC: Change getCFInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: samparker

Differential Revision: https://reviews.llvm.org/D100313
2021-04-14 17:20:34 +01:00
Sander de Smalen 03f47bdcb1 [TTI] NFC: Change get[Interleaved]MemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100205
2021-04-13 14:21:02 +01:00
Sander de Smalen d676b5749d [TTI] NFC: Change getMaskedMemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100204
2021-04-13 14:21:01 +01:00
Sander de Smalen db134e2428 [TTI] NFC: Change getCmpSelInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100203
2021-04-13 14:21:01 +01:00
Sander de Smalen 2285dfb73f [TTI] NFC: Change getMinMaxReductionCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100202
2021-04-13 14:21:00 +01:00
Sander de Smalen bd86824d98 [TTI] NFC: Change getArithmeticReductionCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

This patch is practically NFC, with the exception of an AArch64 SVE related
cost-model change, where we can now return an Invalid cost instead of some
bogus number.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100201
2021-04-13 14:20:59 +01:00
Sander de Smalen fd1f8a5462 [TTI] NFC: Change getGatherScatterOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100200
2021-04-13 14:20:59 +01:00
Sander de Smalen 92d8421f49 [TTI] NFC: Change getCastInstrCost and getExtractWithExtendCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100199
2021-04-13 14:20:58 +01:00
dfukalov 8f4b7e94a2 [AMDGPU][CostModel] Refine cost model for control-flow instructions.
Added cost estimation for switch instruction, updated costs of branches, fixed
phi cost.
Had to increase `-amdgpu-unroll-threshold-if` default value since conditional
branch cost (size) was corrected to higher value.
Test renamed to "control-flow.ll".

Removed redundant code in `X86TTIImpl::getCFInstrCost()` and
`PPCTTIImpl::getCFInstrCost()`.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D96805
2021-04-10 09:20:24 +03:00
Simon Pilgrim 201877d572 [CostModel][X86] Improve accuracy of vXi8 multiply reduction costs
After rG47321c311bdbe0145b9bf45d822185c37b19fa50 we promote vXi8 reductions to vXi16 to create a much faster PMULLW mul reduction, followed by a (free) truncation. This avoids the high cost of repeated vXi8 multiplications (which extend+multiply+truncate to/from vXi16 types....).

Fixes the missing vXi8 mul reduction vectorization in PR42674 (Comment #20) 'mul16' test case.
2021-04-06 11:53:22 +01:00
Sander de Smalen 2f6f249a49 NFC: Change getIntrinsicInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97468

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97469
2021-03-31 14:04:41 +01:00
Sander de Smalen 2f56e1c6b1 NFC: Change getTypeBasedIntrinsicCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97466

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97468
2021-03-31 14:04:41 +01:00
Sander de Smalen 55d18b3cc2 [TTI] Return a TypeSize from getRegisterBitWidth.
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D98874
2021-03-24 14:45:13 +00:00
David Green e2935dcfc4 [TTI] Add a Mask to getShuffleCost
This adds an Mask ArrayRef to getShuffleCost, so that if an exact mask
can be provided a more accurate cost can be provided by the backend.
For example VREV costs could be returned by the ARM backend. This should
be an NFC until then, laying the groundwork for that to be added.

Differential Revision: https://reviews.llvm.org/D98206
2021-03-17 17:46:26 +00:00
Simon Pilgrim dc74d7ed1f [X86] getMemoryOpCost - use dyn_cast_or_null<StoreInst>. NFCI.
Use instead of the isa_and_nonnull<StoreInst> and use the StoreInst::getPointerOperand wrapper instead of a hardcoded Instruction::getOperand.

Looks cleaner and avoids a spurious clang static analyzer null dereference warning.
2021-01-05 13:23:09 +00:00
Simon Pilgrim c86c024e10 [X86] Fix static analyzer warnings. NFCI.
Replace '|' with '||' in condition, and fix case of SignedMode variable.
2020-12-07 18:23:55 +00:00
Simon Pilgrim db900995ed [CostModel][X86] getGatherScatterOpCost - use default implementation for alt costkinds
Noticed while looking at D92701 - we only really handle TCK_RecipThroughput gather/scatter costs - for now drop back to the default implementation for non-legal gathers/scatters.
2020-12-06 14:08:26 +00:00
Sanjay Patel 136f98e523 [x86] adjust cost model values for minnum/maxnum with fast-math-flags
Without FMF, we lower these intrinsics into something like this:

vmaxsd	%xmm0, %xmm1, %xmm2
vcmpunordsd	%xmm0, %xmm0, %xmm0
vblendvpd	%xmm0, %xmm1, %xmm2, %xmm0

But if we can ignore NANs, the single min/max instruction is enough
because there is no need to fix up the x86 logic that corresponds to
X > Y ? X : Y.

We probably want to make other adjustments for FP intrinsics with FMF
to account for specialized codegen (for example, FSQRT).

Differential Revision: https://reviews.llvm.org/D92337
2020-12-01 10:45:53 -05:00
Simon Pilgrim 385a27d6cd [CostModel][X86] Refresh ISD::ABS costs
Update costs now that D92095 and D92102 have tweaked the SSE2 implementation

The SSE42 BLENDVPD cost can actually be used on SSE41 as we don't attempt to generate PCMPGT anymore

Add scalar i16/i32/i64 costs as we can do this cheaply with CMOV
2020-11-25 18:40:19 +00:00
Craig Topper f0b0bab34d [X86] Use GF2P8AFFINEQB to implement vector bitreverse.
We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size.

We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW.

I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D91515
2020-11-17 23:49:06 -08:00
Sanjay Patel 9af561ec99 [x86] update cost table comments for maxnum; NFC
Follow-up suggested in D90613.
2020-11-03 08:09:59 -05:00
Sanjay Patel 35fa3c474f [x86] add AVX2 cost model entries for maxnum of 256-bit vectors
As noticed in D90554 ,
the AVX2 costs for 256-bit vectors did not include FMAXNUM entries,
so we fell back to AVX1 which assumes those ops will be split into
128-bit halves or something close to that.

Differential Revision: https://reviews.llvm.org/D90613
2020-11-02 12:20:17 -05:00
Florian Hahn b3b993a7ad Reland "[TTI] Add VecPred argument to getCmpSelInstrCost."
This reverts the revert commit 408c4408fa.

This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.

Original message:

On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.

This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.

This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.

I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
2020-11-02 15:39:29 +00:00
Florian Hahn 408c4408fa Revert "[TTI] Add VecPred argument to getCmpSelInstrCost."
This reverts commit 73f01e3df5.

This appears to break
http://lab.llvm.org:8011/#/builders/85/builds/383.
2020-10-30 21:26:14 +00:00
Sanjay Patel 251dd7c0f9 [x86] add cost overrides for mul with overflow
I'm assuming the standard size integer instructions for this end up as something like:
mulq %rsi
seto %al

And the 'mul' generally has reciprocal throughput of 1 on typical implementations
(higher latency, but that's not handled here).
The default costs may end up much higher than that, and that's what we see in the test diffs.

Vector types are left as a 'TODO'.

Differential Revision: https://reviews.llvm.org/D90431
2020-10-30 12:38:16 -04:00
Florian Hahn 73f01e3df5 [TTI] Add VecPred argument to getCmpSelInstrCost.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.

This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.

This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.

I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.

Reviewed By: dmgreen, RKSimon

Differential Revision: https://reviews.llvm.org/D90070
2020-10-30 13:49:08 +00:00
Sanjay Patel 7c395f31a6 [CostModel][x86] remove cost-kind predicate for intrinsic costs
We model cost as number of instructions / uops, so it does not
make sense to treat size/blended costs any differently than
throughput.
2020-10-28 14:33:37 -04:00
Bing1 Yu 2c08f1b4b6 [CostModel][X86] teach TTI calculate cost of chain of vector inserts/extracts more precisely and correctly:In each 128-lane, if there is at least one index is demanded and not all indices are demanded...
In each 128-lane, if there is at least one index is demanded and not all
indices are demanded and this 128-lane is not the first 128-lane of the
legalized-vector, then this 128-lane needs a extracti128;
If in each 128-lane, there is at least one index is demanded, this 128-lane
needs a inserti128.

The following cases will help you build a better understanding:
Assume we insert several elements into a v8i32 vector in avx2,
Case#1: inserting into 1th index needs vpinsrd + inserti128
Case#2: inserting into 5th index needs extracti128 + vpinsrd +
inserti128
Case#3: inserting into 4,5,6,7 index needs 4*vpinsrd + inserti128.

Reviewed By: pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D89767
2020-10-27 11:21:13 +08:00
Simon Pilgrim 913d7a110e [X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16.
This is my first LLVM patch, so please tell me if there are any process issues.

The main observation for this patch is that we can lower UMIN/UMAX with v8i16 by using unsigned saturated subtractions in a clever way. Previously this operation was lowered by turning the signbit of both inputs and the output which turns the unsigned minimum/maximum into a signed one.

We could use this trick in reverse for lowering SMIN/SMAX with v16i8 instead. In terms of latency/throughput this is the needs one large move instruction. It's just that the sign bit turning has an increased chance of being optimized further. This is particularly apparent in the "reduce" test cases. However due to the slight regression in the single use case, this patch no longer proposes this.

Unfortunately this argument also applies in reverse to the new lowering of UMIN/UMAX with v8i16 which regresses the "horizontal-reduce-umax", "horizontal-reduce-umin", "vector-reduce-umin" and "vector-reduce-umax" test cases a bit with this patch. Maybe some extra casework would be possible to avoid this. However independent of that I believe that the benefits in the common case of just 1 to 3 chained min/max instructions outweighs the downsides in that specific case.

Patch By: @TomHender (Tom Hender) ActuallyaDeviloper

Differential Revision: https://reviews.llvm.org/D87236
2020-10-11 11:21:23 +01:00
Bing1 Yu ec24e50553 [CostModel][X86] add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)
add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D87884
2020-09-23 10:29:10 +08:00
Meera Nakrani a3d0dce260 [ARM][TTI] Prevents constants in a min(max) or max(min) pattern from being hoisted when in a loop
Changes TTI function getIntImmCostInst to take an additional Instruction parameter,
which enables us to be able to check it is part of a min(max())/max(min()) pattern that will match SSAT.
We can then mark the constant used as free to prevent it being hoisted so SSAT can still be generated.
Required minor changes in some non-ARM backends to allow for the optional parameter to be included.

Differential Revision: https://reviews.llvm.org/D87457
2020-09-22 11:54:10 +00:00
Craig Topper 41f4cd60d5 [X86] Don't scalarize gather/scatters with non-power of 2 element counts. Widen instead.
We can pad the mask with zeros in order to widen. We already do
this for power 2 types that are smaller than a legal type.
2020-09-15 23:22:53 -07:00
Simon Pilgrim de25ebaac6 [CostModel][X86] Add vXi32 division by uniform constant costs (PR47476)
Other types can be handled in future patches but their uniform / non-uniform costs are more similar and don't appear to cause many vectorization issues.
2020-09-10 12:17:54 +01:00
Simon Pilgrim d1abca187d [CostModel][X86] Add SSE costs for SMAX/SMIN/UMAX/UMIN intrinsics 2020-07-29 15:55:43 +01:00
Simon Pilgrim 0a0f28254a [CostModel][X86] Add SSE costs for ABS intrinsics 2020-07-29 14:33:59 +01:00
David Green 60280e9818 [Analysis] TTI: Add CastContextHint for getCastInstrCost
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types.  Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization.  Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.

For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.

To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.

Original patch by Pierre van Houtryve

Differential Revision: https://reviews.llvm.org/D79162
2020-07-29 13:32:53 +01:00
Craig Topper 1a75d88b3e [X86] Move getGatherOverhead/getScatterOverhead into X86TargetTransformInfo.
These cost methods don't make much sense in X86Subtarget. Make
them methods in X86's TTI and move the feature checks from the
X86Subtarget constructor into these methods.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D84594
2020-07-26 10:38:42 -07:00
Christopher Tetreault 0da1e7ebf9 [SVE] Remove calls to VectorType::getNumElements from X86
Reviewers: efriedma, RKSimon, craig.topper, fpetrogalli, c-rhodes

Reviewed By: RKSimon

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82508
2020-06-29 11:10:35 -07:00
Guillaume Chatelet b66e33a689 [Alignment][NFC] Migrate TTI::getGatherScatterOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82577
2020-06-26 11:08:27 +00:00
Guillaume Chatelet fdc7c7fb87 [Alignment][NFC] Migrate TTI::getInterleavedMemoryOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82573
2020-06-26 11:00:53 +00:00
Guillaume Chatelet 7e1f79c3de [Alignment][NFC] Migrate TTI::getMaskedMemoryOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82569
2020-06-26 10:14:16 +00:00
Guillaume Chatelet 324cda2073 [Alignment][NFC] Conform X86, ARM and AArch64 TargetTransformInfo backends to the public API
The main interface has been migrated to Align already but a few backends where broadening the type from Align to MaybeAlign.
This patch makes sure all implementations conform to the public API.

Differential Revision: https://reviews.llvm.org/D82465
2020-06-25 13:23:13 +00:00
dfukalov 7ddee0922f [NFCI][CostModel] Add const to Value*.
Summary:
Get back `const` partially lost in one of recent changes.
Additionally specify explicit qualifiers in few places.

Reviewers: samparker

Reviewed By: samparker

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82383
2020-06-24 23:16:08 +03:00
Sam Parker 2596da3174 [CostModel] getCFInstrCost in getUserCost.
Have BasicTTI call the base implementation so that both agree on the
default behaviour, which the default being a cost of '1'. This has
required an X86 specific implementation as it seems to be very
reliant on those instructions being free. Changes are also made to
AMDGPU so that their implementations distinguish between cost kinds,
so that the unrolling isn't affected. PowerPC also has its own
implementation to prevent changes to the reg-usage vectorizer test.

The cost model test changes now reflect that ret instructions are not
generally free.

Differential Revision: https://reviews.llvm.org/D79164
2020-06-15 09:28:46 +01:00
Christopher Tetreault 9044027e45 [SVE] Eliminate calls to default-false VectorType::get() from X86
Reviewers: efriedma, craig.topper, RKSimon, samparker, kmclaughlin, david-arm

Reviewed By: david-arm

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81520
2020-06-10 09:56:00 -07:00
Sam Parker fa8bff0cd1 [CostModel] Unify getArithmeticInstrCost
Add the remaining arithmetic opcodes into the generic implementation
of getUserCost and then call this from getInstructionThroughput. Most
of the backends have been modified to return the base implementation
for cost kinds other RecipThroughput. The outlier here is AMDGPU
which already uses getArithmeticInstrCost for all the cost kinds.
This change means that most of the opcodes can be removed from that
backends implementation of getUserCost.

Differential Revision: https://reviews.llvm.org/D80992
2020-06-10 09:08:45 +01:00
Sam Parker 37289615c0 [NFCI][CostModel] Unify getCmpSelInstrCost
Add cases for icmp, fcmp and select into the switch statement of the
generic getUserCost implementation with getInstructionThroughput then
calling into it. The BasicTTI and backend implementations have be set
to return a default value (1) when a cost other than throughput is
being queried.

Differential Revision: https://reviews.llvm.org/D80550
2020-06-09 07:41:22 +01:00
Sam Parker 5b5e78ad2b [CostModel] Follow-up to buildbot fix
Adding type checks into the other backends that call
getTypeLegalizationCost.

Differential Revision: https://reviews.llvm.org/D80984
2020-06-08 15:26:25 +01:00
Sam Parker 9303546b42 [CostModel] Unify getMemoryOpCost
Use getMemoryOpCost from the generic implementation of getUserCost
and have getInstructionThroughput return the result of that for loads
and stores.

This also means that the X86 implementation of getUserCost can be
removed with the functionality folded into its getMemoryOpCost.

Differential Revision: https://reviews.llvm.org/D80984
2020-06-05 10:13:38 +01:00
Christopher Tetreault 5a99ec10f5 [SVE] Eliminate calls to default-false VectorType::get() from X86
Reviewers: efriedma, sdesmalen, c-rhodes, craig.topper

Reviewed By: craig.topper

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80331
2020-05-29 16:16:07 -07:00
Sam Parker 8aaabadece [CostModel] Unify getCastInstrCost
Add the remaining cast instruction opcodes to the base implementation
of getUserCost and directly return the result. This allows
getInstructionThroughput to return getUserCost for the casts. This
has required changes to PPC and SystemZ because they implement
getUserCost and/or getCastInstrCost with adjustments for vector
operations. Adjusts have also been made in the remaining backends
that implement the method so that they still produce a cost of zero
or one for cost kinds other than throughput.

Differential Revision: https://reviews.llvm.org/D79848
2020-05-26 11:29:57 +01:00
Sam Parker 871556a494 [CostModel] Unify Intrinsic Costs.
Recommitting most of the remaining changes from
259eb619ff, but excluding the call to
getUserCost from getInstructionThroughput. Though there's still no
test changes, I doubt that this is an NFC...

With the two getIntrinsicInstrCosts folded into one, now fold in the
scalar/code-size orientated getIntrinsicCost. The remaining scalar
intrinsics were memcpy, cttz and ctlz which now have special handling
in the BasicTTI implementation.

This had required a change in the AMDGPU backend for fabs as it
should always be 'free'. I've also changed the X86 backend to return
the BaseT implementation when the CostKind isn't RecipThroughput.

Differential Revision: https://reviews.llvm.org/D80012
2020-05-26 09:48:26 +01:00
Sam Parker 259eb619ff Revert "[CostModel] Unify Intrinsic Costs."
This reverts commit de71def3f5.

This is causing some very large changes, so I'm first going to break
this patch down and re-commit in parts.
2020-05-21 12:50:24 +01:00
Sam Parker de71def3f5 [CostModel] Unify Intrinsic Costs.
With the two getIntrinsicInstrCosts folded into one, now fold in the
scalar/code-size orientated getIntrinsicCost. This involved sinking
cost of the TTIImpl into the base implementation, as it performs no
target checks. The opcodes remaining were memcpy, cttz and ctlz which
now have special handling in the BasicTTI implementation.
getInstructionThroughput can now directly return the result of
getUserCost.

This had required a change in the AMDGPU backend for fabs and its
always 'free'. I've also changed the X86 backend to return '1' for
any intrinsic when the CostKind isn't RecipThroughput.

Though this intended to be a non-functional change, there are many
paths being combined here so I would be very surprised if this didn't
have an effect.

Differential Revision: https://reviews.llvm.org/D80012
2020-05-21 07:38:25 +01:00
Sam Parker 8cc911fa5b [NFCI][CostModel] Refactor getIntrinsicInstrCost
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.

Differential Revision: https://reviews.llvm.org/D79941
2020-05-20 11:59:08 +01:00
Simon Pilgrim 4e3c005554 [TTI] getScalarizationOverhead - use explicit VectorType operand
getScalarizationOverhead is only ever called with vectors (and we already had a load of cast<VectorType> calls immediately inside the functions).

Followup to D78357

Reviewed By: @samparker

Differential Revision: https://reviews.llvm.org/D79341
2020-05-05 16:59:23 +01:00
Sam Parker 40574fefe9 [NFC][CostModel] Add TargetCostKind to relevant APIs
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.

RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html

Differential Revision: https://reviews.llvm.org/D79002
2020-05-05 10:35:54 +01:00
Craig Topper b938168aef [X86] Lower the cost of v4i64->v4i32 truncate with avx512.
We use the vpmovqd instruction which is a single uop. So
the cost should be 1.
2020-05-01 11:09:37 -07:00
Craig Topper 6a1ad76dab [X86] Don't return true from isTruncateFree for vectors
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.

Differential Revision: https://reviews.llvm.org/D79045
2020-04-30 16:43:35 -07:00
Craig Topper ff66919020 [X86][CostModel] Bump the cost of vpermw/vpermt2b/vperm2w
vpermw is 2 uops. vpermt2b/vpermt2w are two shuffle uops and a port 015 uop. Weirdly vpermb is a single uop.

This patch bumps the cost to 2 for these operations. Maybe should go to 3 for the vpermt2*, but I've started conservative.

I've also removed a few entries that were now the same as earlier subtargets or that I didn't think we really did. Like I don't think we extend v32i8 to v32i16, shuffle, and then truncate.

Differential Revision: https://reviews.llvm.org/D79148
2020-04-30 11:32:25 -07:00
Craig Topper cff6686532 [X86] Lower the cost of v4i64->v4i32 and v8i64->v8i32 truncate with AVX
We generate much better code these days than we used to. And we use the same sequence for AVX1 and AVX2 for these

For v4i64->v4i32 we generate:
vextractf128    xmm1, ymm0, 1
vshufps xmm0, xmm0, xmm1, 136   # xmm0 = xmm0[0,2],xmm1[0,2]

And for v8i64->v8i32 we generate:
vperm2f128      ymm2, ymm0, ymm1, 49 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128     ymm0, ymm0, xmm1, 1
vshufps ymm0, ymm0, ymm2, 136   # ymm0 = ymm0[0,2],ymm2[0,2],ymm0[4,6],ymm2[4,6]

Differential Revision: https://reviews.llvm.org/D79109
2020-04-29 13:21:44 -07:00
Simon Pilgrim 090cae8491 [TTI] Add DemandedElts to getScalarizationOverhead
The improvements to the x86 vector insert/extract element costs in D74976 resulted in the estimated costs for vector initialization and scalarization increasing higher than should be expected. This is particularly noticeable on pre-SSE4 targets where the available of legal INSERT_VECTOR_ELT ops is more limited.

This patch does 2 things:
1 - it implements X86TTIImpl::getScalarizationOverhead to more accurately represent the typical costs of a ISD::BUILD_VECTOR pattern.
2 - it adds a DemandedElts mask to getScalarizationOverhead to permit the SLP's BoUpSLP::getGatherCost to be rewritten to use it directly instead of accumulating raw vector insertion costs.

This fixes PR45418 where a v4i8 (zext'd to v4i32) was no longer vectorizing.

A future patch should extend X86TTIImpl::getScalarizationOverhead to tweak the EXTRACT_VECTOR_ELT scalarization costs as well.

Reviewed By: @craig.topper

Differential Revision: https://reviews.llvm.org/D78216
2020-04-29 12:00:38 +01:00
Craig Topper 59b9e6fe76 [X86] Update costs for truncates from less than 128-bit vectors to vXi1 on pre-avx512 targets
vXi1 types are legalized by promoting, but the narrow vectors
are legalized by widening. This results in some truncates turning
into any_extends.
2020-04-28 11:35:41 -07:00
Craig Topper d42192c50f [X86][CostModel] Correct the costs for truncate to a mask register with avx512
I've modified isTruncateFree to get an accurate cost for types that need to be split. I'm planning to look into fixing it for all vectors, but need more cost cleanups first.

Differential Revision: https://reviews.llvm.org/D78973
2020-04-28 10:39:36 -07:00
Sam Parker e9c9329aa4 [TTI] Add TargetCostKind argument to getUserCost
There are several different types of cost that TTI tries to provide
explicit information for: throughput, latency, code size along with
a vague 'intersection of code-size cost and execution cost'.

The vectorizer is a keen user of RecipThroughput and there's at least
'getInstructionThroughput' and 'getArithmeticInstrCost' designed to
help with this cost. The latency cost has a single use and a single
implementation. The intersection cost appears to cover most of the
rest of the API.

getUserCost is explicitly called from within TTI when the user has
been explicit in wanting the code size (also only one use) as well
as a few passes which are concerned with a mixture of size and/or
a relative cost. In many cases these costs are closely related, such
as when multiple instructions are required, but one evident diverging
cost in this function is for div/rem.

This patch adds an argument so that the cost required is explicit,
so that we can make the important distinction when necessary.

Differential Revision: https://reviews.llvm.org/D78635
2020-04-28 08:57:45 +01:00
Craig Topper 37ec709233 [X86][CostModel] Update truncate costs for some narrow vector cases to match their wider version.
This updates v4i16->v4i8 with sse2 to match v8i16->v8i8.
Update v2i16->v2i8 and v4i16->v4i8 with sse 4.1 to match v8i16->v8i8.
2020-04-27 13:47:48 -07:00
Craig Topper bdbbed115f [X86][CostModel] Update costs for vector truncate with avx512f/avx512bw.
All avx512 truncate instructions except vXi64->vXi32 are 2 uops
on port 5. So raise their costs to 2. Except when we have an
earlier faster sequence like pshufb for 128 bit input vectors.

Add a lower cost of 3 v16i16->v16i8 with avx512f where we can
extend to v16i32 then truncate. And a cost of 2 for avx512bw with
and without avx512vl. There we can use vpmovwb with either a ymm
or zmm input. Both of these beat masking, splitting, and using
packuswb which is our avx/avx2 codegen.
2020-04-27 12:00:24 -07:00
Craig Topper 5eff75d86a [X86][CostModel] Improve costs for fp_to_uint/fp_to_sint for vXi8/vXi16/v2i32 results.
Differential Revision: https://reviews.llvm.org/D78893
2020-04-27 10:35:15 -07:00
Craig Topper fc02d9f3c6 [X86] Add cost table entry for v2i32->v2f64 fp_to_uint with avx512.
We're currently getting this from the default implementation. But
I don't like how the cost model came to this answer and I might
be making some changes there.
2020-04-26 19:59:01 -07:00
Craig Topper b9de62c2b6 [X86] Fix the cost of v16i1->v16i16 sext/zext on avx targets.
Previously we were hitting the scalarization case in the default
implementation.
2020-04-25 23:16:20 -07:00
Craig Topper 19cb26f517 [X86][CostModel] Improve costs for vXi1 sign_extend/zero_extend with avx512.
With avx512 vXi1 is legal and uses k-registers with many custom cases
for extending.
2020-04-25 23:16:20 -07:00
Craig Topper 7664a0d282 [X86] Improve accuracy of cost for v16i64->v16i8 truncate with avx512.
The 2 vpmovqds are only 1 uop each.
2020-04-24 19:13:55 -07:00
Craig Topper e4a9190ad7 [X86][ArgumentPromotion] Allow Argument Promotion if caller and callee disagree on 512-bit vectors support if the arguments are scalar.
If one of caller/callee has disabled ZMM registers due to
prefer-vector-width=256, we were previously
disabling argument promotion as the ABI might be incompatible since
one side will split 512-bit vectors in this case.

But if we can see that the types are all scalar this shouldn't be
a problem.

This patch assumes that pointer element type reflects the type that
the argument will be promoted to.

Differential Revision: https://reviews.llvm.org/D78770
2020-04-24 15:47:02 -07:00
Sam Parker e3056ae9a0 [NFC][TTI] Explicit use of VectorType
The API for shuffles and reductions uses generic Type parameters,
instead of VectorType, and so assertions and casts are used a lot.
This patch makes those types explicit, which means that the clients
can't be lazy, but results in less ambiguity, and that can only be a
good thing.

Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45562

Differential Revision: https://reviews.llvm.org/D78357
2020-04-20 09:16:52 +01:00
Christopher Tetreault dd24fb388b Clean up usages of asserting vector getters in Type
Summary:
Remove usages of asserting vector getters in Type in preparation for the
VectorType refactor. The existence of these functions complicates the
refactor while adding little value.

Reviewers: craig.topper, sdesmalen, efriedma, RKSimon

Reviewed By: efriedma

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77264
2020-04-17 10:49:16 -07:00
Craig Topper 8dfb9627b7 [X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead.
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.

This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.

Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.

Differential Revision: https://reviews.llvm.org/D76212
2020-04-15 12:17:18 -07:00
Simon Pilgrim 426f37584e [TTI][X86] Add X86TTIImpl::getScalarizationOverhead implementation.
This is a currently just a wrapper to the base type, I'll be adding ISD::BUILD_VECTOR costs in a future patch.
2020-04-14 12:58:19 +01:00
Craig Topper 2f60fbce6c [X86] Use a more realisitic cost for truncate v16i64->v16i8 with avx512f.
Still not great and we could probably codegen this better, but
11 was clearly ridiculous.
2020-04-13 21:09:43 -07:00