llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	7ff2a9f250	[CostModel][X86] Add CodeSize handling for fadd/fsub/fmul/fsqrt ops Eventually this will be part of the cost table lookup	2022-08-21 17:42:11 +01:00
Simon Pilgrim	3c4391b4bb	Revert rG15de7aaae52ef4be9f9ff3b130804e5b5ccd29f4 "[CostModel][X86] Add CodeSize/SizeLatency handling for fadd/fsub/fmul/fsqrt ops" This is unintentionally affecting some backend tests	2022-08-21 16:51:45 +01:00
Simon Pilgrim	15de7aaae5	[CostModel][X86] Add CodeSize/SizeLatency handling for fadd/fsub/fmul/fsqrt ops Eventually this will be part of the cost table lookup	2022-08-21 16:39:57 +01:00
Simon Pilgrim	5263155d5b	[CostModel] Add CostKind argument to getShuffleCost Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future. Differential Revision: https://reviews.llvm.org/D132287	2022-08-21 10:54:51 +01:00
Alexey Bataev	0e7ed32c71	[SLP]Cost for a constant buildvector. In many cases constant buildvector results in a vector load from a constant/data pool. Need to consider this cost too. Differential Revision: https://reviews.llvm.org/D126885	2022-08-19 08:02:42 -07:00
Alexey Bataev	d53e245951	[COST][NFC]Introduce OperandValueKind in getMemoryOpCost, NFC. Added OperandValueKind OpdInfo parameter to getMemoryOpCost functions to better estimate cost with immediate values. Part of D126885.	2022-08-19 07:33:00 -07:00
Simon Pilgrim	b864cad7b4	[CostModel][X86] Adjust SLM select costs to match poor throughput of pblendvb/blendvpd/blendvps	2022-08-18 17:05:38 +01:00
Simon Pilgrim	55b1a147f2	[CostModel][X86] getArithmeticInstrCost - use MUL/DIV/REM expansions for all cost kinds The costs tables still assume throughput, but the general expansion patterns should be good for any cost kind	2022-08-18 14:18:54 +01:00
Simon Pilgrim	fdec50182d	[CostModel] Replace getUserCost with getInstructionCost * Replace getUserCost with getInstructionCost, covering all cost kinds. * Remove getInstructionLatency, it's not implemented by any backends, and we should fold the functionality into getUserCost (now getInstructionCost) to make it easier for targets to handle the cost kinds with their existing cost callbacks. Original Patch by @samparker (Sam Parker) Differential Revision: https://reviews.llvm.org/D79483	2022-08-18 11:55:23 +01:00
Daniil Fukalov	7ed3d81333	[NFCI] Move cost estimation from TargetLowering to TargetTransformInfo. TragetLowering had two last InstructionCost related `getTypeLegalizationCost()` and `getScalingFactorCost()` members, but all other costs are processed in TTI. E.g. it is not comfortable to use other TTI members in these two functions overrided in a target. Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout parameter - it was always passed from TTI. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D117723	2022-08-18 00:38:55 +03:00
Fangrui Song	de9d80c1c5	[llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC With C++17 there is no Clang pedantic warning or MSVC C5051.	2022-08-08 11:24:15 -07:00
Kazu Hirata	a2d4501718	[llvm] Fix comment typos (NFC)	2022-08-07 00:16:14 -07:00
Phoebe Wang	f187948162	[X86][FP16] Enable vector support for FP16 emulation This is follow up of D107082, which enable vector support according to psABI. Reviewed By: skan Differential Revision: https://reviews.llvm.org/D127982	2022-07-16 09:38:58 +08:00
Vasileios Porpodas	7a9ad25769	Recommit "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6d6268dcbf`. Review: https://reviews.llvm.org/D125712	2022-06-21 18:35:29 -07:00
Vasileios Porpodas	6d6268dcbf	Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6f88acf410`.	2022-06-21 17:07:21 -07:00
Vasileios Porpodas	6f88acf410	[SLP][X86] Improve reordering to consider alternate instruction bundles During the reordering transformation we should try to avoid reordering bundles like fadd,fsub because this may block them being matched into a single vector instruction in x86. We do this by checking if a TreeEntry is such a pattern and adding it to the list of TreeEntries with orders that need to be considered. Differential Revision: https://reviews.llvm.org/D125712	2022-06-21 16:44:48 -07:00
Simon Pilgrim	cf2072bcad	[X86] X86TargetTransformInfo.cpp - use InstructionCost type to accumulate instructions costs	2022-06-15 12:21:01 +01:00
eopXD	6a84579243	[LSR][TTI][PowerPC][SystemZ][X86] Add const-ness to TTI::isLSRCostLess. NFC Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D126350	2022-05-27 15:22:23 -07:00
Simon Pilgrim	6c80267d0f	[CostModel][X86] getScalarizationOverhead - improve extraction costs for > 128-bit vectors We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs. This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once. Differential Revision: https://reviews.llvm.org/D125527	2022-05-24 15:18:08 +01:00
Jay Foad	6bec3e9303	[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf Most clients only used these methods because they wanted to be able to extend or truncate to the same bit width (which is a no-op). Now that the standard zext, sext and trunc allow this, there is no reason to use the OrSelf versions. The OrSelf versions additionally have the strange behaviour of allowing extending to a smaller width, or truncating to a larger width, which are also treated as no-ops. A small amount of client code relied on this (ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and needed rewriting. Differential Revision: https://reviews.llvm.org/D125557	2022-05-19 11:23:13 +01:00
Simon Pilgrim	3d107ce2b2	[CostModel][X86] Relax fcmp costs on SSE41 targets or later Only pre-SSE41 targets double-pump the fp comparison ops	2022-05-06 13:29:40 +01:00
Simon Pilgrim	cbfa857346	[CostModel][X86] Adjust 128-bit select costs to account for slow BLENDV op Based off the script from D103695 - Jaguar, Bulldozer, Silvermont (et al) and Haswell all have slow BLENDV ops, so adjust the worse case cost values	2022-05-06 13:07:34 +01:00
Simon Pilgrim	d21bf51494	[CostModel][X86] Adjust pre-SSE41 fp scalar select costs to account for vector ops Based off the script from D103695, we now mainly use BLENDV or OR(AND,ANDN) to select scalar float/double ops	2022-05-06 11:41:55 +01:00
Simon Pilgrim	f0e8c1d6d9	[CostModel][X86] Adjust 256-bit select costs to account for slow BLENDV op Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops	2022-05-06 11:27:37 +01:00
Simon Pilgrim	86bb7df6e6	[CostModel][X86] getScalarizationOverhead - handle vXi1 extracts with MOVMSK (pre-AVX512) We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op. The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases. The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.	2022-05-02 09:58:39 +01:00
Simon Pilgrim	d5198cf92f	[CostModel][X86] Check for 'null op' truncations If the legalized src/dst types are the same, assume the "truncation" is free. This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths	2022-05-01 12:03:40 +01:00
Simon Pilgrim	c2964746e3	[CostModel][X86] Reduce cost of vector selects on SSE2/AVX1 targets Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput	2022-05-01 09:32:14 +01:00
Alexey Bataev	371412e065	[COST]Fix crash for non-power-2 vector shuffle mask. Need to normalizize the mask to avoid possible crashes during attempts to estimate cost of the very long shuffles with non-power-2 number of elements in masks.	2022-04-29 07:28:07 -07:00
Alexey Bataev	75e1cf4a6a	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-28 10:04:41 -07:00
Alexey Bataev	9861ca0c23	Revert "[COST]Improve cost model for shuffles in SLP." This reverts commit `29a470e380` to fix a crash reported in https://reviews.llvm.org/D100486#3479989.	2022-04-28 08:11:56 -07:00
Alexey Bataev	29a470e380	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-27 10:56:26 -07:00
Vasileios Porpodas	fa8a9fea47	Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `6a9bbd9f20`. Code review: https://reviews.llvm.org/D124202	2022-04-26 14:02:40 -07:00
Vasileios Porpodas	6a9bbd9f20	Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `55ce296d6f`.	2022-04-26 11:25:26 -07:00
Vasileios Porpodas	55ce296d6f	[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost` Before this patch `Args` was used to pass a broadcat's arguments by SLP. This patch changes this. `Args` is now used for passing the operands of the shuffle. Differential Revision: https://reviews.llvm.org/D124202	2022-04-26 11:11:29 -07:00
Vasileios Porpodas	889588ee97	[SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`. Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`. This helps reduce the diff of a future SLP patch for AArch64.	2022-04-21 10:19:00 -07:00
Simon Pilgrim	d663166acb	[CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput	2022-03-30 09:11:55 +01:00
Vasileios Porpodas	39aa202aff	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `e6ead19b77`.	2022-03-23 18:32:17 -07:00
Arthur Eubanks	e6ead19b77	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash." This reverts commit `27bd8f9492`. Causes crashes, see comments in D121973	2022-03-23 10:57:45 -07:00
Vasileios Porpodas	27bd8f9492	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `f7d7d2a08d`.	2022-03-22 16:41:55 -07:00
Arthur Eubanks	f7d7d2a08d	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads."" This reverts commit `79613185d3`. Causes crashes, see comments in https://reviews.llvm.org/D121973.	2022-03-22 13:33:49 -07:00
Vasileios Porpodas	79613185d3	Recommit "[SLP] Fix lookahead operand reordering for splat loads." Original review: https://reviews.llvm.org/D121354 The original commit `9136145eb0` broke the build on several targets. Differential Revision: https://reviews.llvm.org/D121973	2022-03-21 15:57:32 -07:00
Simon Pilgrim	5dde9c1286	[CostModel][X86] Reduce cost of extracting bool vector elements For constant indices, these are now just a MOVMSK+TEST/BT	2022-03-18 19:02:47 +00:00
Vasileios Porpodas	9136145eb0	Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build failures This reverts commit `5efa78985b`.	2022-03-17 18:22:04 -07:00
Vasileios Porpodas	5efa78985b	[SLP] Fix lookahead operand reordering for splat loads. Splat loads are inexpensive in X86. For a 2-lane vector we need just one instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads to worse code. This patch adds a new score dedicated for splat loads. Please note that a splat is usually three IR instructions: - It is usually a load and 2 inserts: %ld = load double, double* %gep %ins1 = insertelement <2 x double> poison, double %ld, i32 0 %ins2 = insertelement <2 x double> %ins1, double %ld, i32 1 - But it can also be a load, an insert and a shuffle: %ld = load double, double* %gep %ins = insertelement <2 x double> poison, double %ld, i32 0 %shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer Because of this some of the lit tests contain more IR instructions. Differential Revision: https://reviews.llvm.org/D121354	2022-03-17 18:05:54 -07:00
Simon Pilgrim	8c82d42e97	[TTI][X86] Pull out repeated getSizeInBits() calls. NFC.	2022-02-10 18:58:32 +00:00
Rosie Sumpter	552eb372cb	[LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter This is required to query the legality more precisely in the LoopVectorizer. This adds another TTI function named 'forceScalarizeMaskedGather/Scatter' function to work around the hack introduced for MVE, where isLegalMaskedGather/Scatter would return an answer by second-guessing where the function was called from, based on the Type passed in (vector vs scalar). The new interface makes this explicit. It is also used by X86 to check for vector widths where gather/scatters aren't profitable (or don't exist) for certain subtargets. Differential Revision: https://reviews.llvm.org/D115329	2022-01-12 13:34:12 +00:00
Simon Pilgrim	5eb47961c4	[CostModel][X86] Update ROTL/ROTR vXi8/vXi16 costs on AVX512BW targets Refresh based off recent improvements to codegen and the helper script from D103695	2022-01-10 13:18:25 +00:00
Nikita Popov	7c3cf4c2c0	[Inline][X86] Avoid inlining if it would create ABI-incompatible calls (PR52660) X86 allows inlining functions if the callee target features are a subset of the caller target features. This ensures that we don't inline something into a caller that does not support it. However, this does not account for possible call ABI mismatches as a result of inlining. If a call passing a vector argument was originally in a -avx function, calling another -avx function, the vector is passed in xmm. If we now inline it into a +avx function, then it will be passed in ymm, even though the callee expects it in xmm. Fix this by scanning over all calls in the function and checking whether ABI incompatibility is possible. Calls that only pass scalar types are excluded, as I believe those always use the same ABI independent of target features. Fixes https://github.com/llvm/llvm-project/issues/52660. Differential Revision: https://reviews.llvm.org/D116036	2021-12-27 09:36:21 +01:00
Nikita Popov	f5ac23b5ae	[ArgPromotion][TTI] Pass types to ABI compatibility hook The areFunctionArgsABICompatible() hook currently accepts a list of pointer arguments, though what we're actually interested in is the ABI compatibility after these pointer arguments have been converted into value arguments. This means that a) the current API is incompatible with opaque pointers (because it requires inspection of pointee types) and b) it can only be used in the specific context of ArgPromotion. I would like to reuse the API when inspecting calls during inlining. This patch converts it into an areTypesABICompatible() hook, which accepts a list of types. This makes the method more generally usable, and compatible with opaque pointers from an API perspective (the actual usage in ArgPromotion/Attributor is still incompatible, I'll follow up on that in separate patches). Differential Revision: https://reviews.llvm.org/D116031	2021-12-22 09:37:51 +01:00
Simon Pilgrim	30e38d6771	[CostModel][X86] Split MUL/SDIV+SREM/UDIV+UREM PowerOf2 handling. NFC. This is a NFC cleanup to simplify some upcoming refactoring required to address the regressions in D111968.	2021-12-08 18:47:12 +00:00
Haohai Wen	d2c093e79d	[CostModel][X86] Add i64 mul cost for avx512 as 1cy i64 mul cost is 1cy for all cpu that support avx512. Currently all X86 cpu uses i64 mul cost in X64 cost table which is not true for cpu that support avx512 (skx, icx). Reviewed By: pengfei, RKSimon Differential Revision: https://reviews.llvm.org/D115016	2021-12-08 11:29:08 +08:00
Roman Lebedev	8cd782487f	[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()` We ask `TTI.getAddressComputationCost()` about the cost of computing vector address, and then multiply it by the vector width. This doesn't make any sense, it implies that we'd do a vector GEP and then scalarize the vector of pointers, but there is no such thing in the vectorized IR, we perform scalar GEP's. This is especially bad on X86, and was effectively prohibiting any scalarized vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()` says that cost of vector address computation is `10` as compared to `1` for scalar. The computed costs are similar to the ones with D111222+D111220, but we end up without masked memory intrinsics that we'd then have to expand later on, without much luck. (D111363) Differential Revision: https://reviews.llvm.org/D111460	2021-11-30 10:47:56 +03:00
Roman Lebedev	7e73c2a66a	[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle The mask on the shuffle is for the output, not the input. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114697	2021-11-29 18:37:07 +03:00
Roman Lebedev	cffe3a084f	[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()` ... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512. This unblocks D111460. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114316	2021-11-29 14:41:48 +03:00
Roman Lebedev	cd8d219536	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()` for AVX512, other than the question of handling plain AVX512F, where we end up with some really ugly "shuffles", but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114315	2021-11-24 17:23:15 +03:00
Roman Lebedev	704d92607d	[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions Apparently my methodology was suboptimal, and not only did miss all the +VL tuples, i also missed some plain tuples. I believe, this adds everything missing. Indeed, these manual costmodels are just not okay long-term. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114334	2021-11-22 14:31:34 +03:00
Roman Lebedev	8d09dd61c3	[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW, these either sign-extent the mask register into a vector, or pack the mask from vector register. Apparently, we didn't even have MCA tests for these, added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05, so i'm just guessing that their perf characteristics are optimal. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114314	2021-11-22 14:31:34 +03:00
Roman Lebedev	049799c311	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles), we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114071	2021-11-19 15:58:10 +03:00
Roman Lebedev	a751084bb4	[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1` Note that there are many other missing costs, i'm only adding the ones that are queried from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114070	2021-11-19 15:57:32 +03:00
Roman Lebedev	a50fdd3fc9	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW Here we get pretty lucky. AVX512F does not provide any instructions to convert between a `k` vector mask and a vector, but AVX512BW adds `{k}<->nX{i8,i16}`conversions, and just as it happens, with AVX512BW we have a i16 shuffle. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113915	2021-11-19 15:55:41 +03:00
Roman Lebedev	2037ec725f	[X86][Costmodel] `ext v64i1 to v32i16` can appear after legalization, cost is same as for `ext v32i1 to v32i16` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113914	2021-11-17 12:02:50 +03:00
Roman Lebedev	23b194bf18	[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113913	2021-11-17 12:02:50 +03:00
Roman Lebedev	5c7255fe3a	[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type, so this change does affect vectorization. In the end, it will ask about i1. We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up. All costs here look good, i've added the missing truncation costs in preparatory patches. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113853	2021-11-15 19:04:02 +03:00
Roman Lebedev	a468c39c90	[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8` Some of the costs get larger here, but i suppose that makes sense since we'd previously query scalarization costs that may not be really representative of the reality. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113852	2021-11-15 19:04:02 +03:00
Roman Lebedev	9e57d9b09d	[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113851	2021-11-15 19:04:02 +03:00
Roman Lebedev	0116c708c6	[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113850	2021-11-15 19:04:02 +03:00
Roman Lebedev	4dd2f0446c	[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW The basic idea is simple, if we don't have native shuffle for this element type, then we must have native shuffle for wider element type, so promote, replicate, demote. I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically, case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in 2 * XMM, that then will be packed into 1 * YMM, and it should count the cost of said packing, not just the truncations. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113609	2021-11-14 20:01:38 +03:00
Roman Lebedev	e876698a5d	[NFC][TTI] `getReplicationShuffleCost()`: s/Replicated/Dst/ 'Replicated' is mouthful and somewhat ambigious, while 'destination' is pretty self-explanatory.	2021-11-14 20:01:38 +03:00
Roman Lebedev	b283961012	[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16` Same as D113842, but for i64 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113843	2021-11-14 18:41:38 +03:00
Roman Lebedev	a5f2fdca99	[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16` This was noticed in D113609, hopefully it unblocks that patch. There are likely other similar problems. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113842	2021-11-14 18:41:37 +03:00
Roman Lebedev	a70d74323e	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI VBMI introduced VPERMB, so cost-model i8 replication shuffle using it. Note that we can still model i8 replication shufflle without VBMI, by promoting to i16/i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113479	2021-11-10 22:52:40 +03:00
Roman Lebedev	c6e894b9b2	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW BWI introduced VPERMW, so cost-model i16 replication shuffle using it. Note that we can still model i16 replication shufflle without BWI, by promoting to i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113478	2021-11-10 22:52:39 +03:00
Roman Lebedev	4101c7bf19	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`, that take a single input vector and a single index vector, and are cross-lane. So far i haven't seen evidence that replication ever results in demanding more than a single input vector per output vector. This results in shockingly lesser costs :) Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113350	2021-11-10 22:52:33 +03:00
Roman Lebedev	d484cc152b	[TTI] Adjust `getReplicationShuffleCost()` interface It is trivial to produce DemandedSrcElts given DemandedReplicatedElts, so don't pass the former. Also, it isn't really useful so far to have the overload taking the Mask, so just inline it.	2021-11-09 14:07:59 +03:00
Roman Lebedev	f8efc5c0ac	[NFC][TTI] Add/extract `getReplicationShuffleCost()` method, deduplicate it's implementations Hiding it in `getInterleavedMemoryOpCost()` is problematic for a number of reasons, including testability and reuse, let's do better. In a followup `getUserCost()` will be taught to use to to estimate the mask costs, which will allow for better cost model tests for it. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113313	2021-11-06 16:45:15 +03:00
Roman Lebedev	ad617183bb	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1 Even though AVX512's masked mem ops (unlike AVX1/2) have a mask that is a `VF x i1`, replication of said masks happens after promotion of it to `VF x i8`, so we should use `i8`, not `i1`, when calculating the cost of mask replication.	2021-11-05 17:27:02 +03:00
Roman Lebedev	df93c8a919	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask I don't really buy that masked interleaved memory loads/stores are supported on X86. There is zero costmodel test coverage, no actual cost modelling for the generation of the mask repetition, and basically only two LV tests. Additionally, i'm not very interested in AVX512. I don't know if this really helps "soft" block over at https://reviews.llvm.org/D111460#inline-1075467, but i think it can't make things worse at least. When we are being told that there is a masking, instead of completely giving up and falling back to fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`, let's correctly query the cost of masked memory ops, keep all the pretty shuffle cost modelling, but scalarize the cost computation for the mask replication. I think, not scalarizing the shuffles themselves may adjust the computed costs a bit, and maybe hopefully just enough to hide the "regressions" at https://reviews.llvm.org/D111460#inline-1075467 I do mean hide, because the test coverage is non-existent. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112873	2021-11-03 18:14:35 +03:00
Simon Pilgrim	6fc50e531d	[CostModel][X86] Remove old FIXME comments for AVX512F vector splitting Similar to AVX1, the cost of splitting/merging 512-bit -> 256-bits vectors for arithmetic operations are typically hidden due to different used ports etc.	2021-11-01 11:11:11 +00:00
Simon Pilgrim	6102e5d56b	[CostModel][X86] Remove old TODO comment BMI (TZCNT) scalar handling was added at rGa2db388dce77c2f23f2009d7363a0b63bb54523c	2021-10-29 17:28:45 +01:00
Roman Lebedev	8fac9e95ad	[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least in general, that is obviously overly pessimistic, because in general*, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307	2021-10-22 16:33:58 +03:00
Simon Pilgrim	5b395bd633	[CostModel][X86] Add costs for multiply-by-pow2 constants These are folded to left shifts in the backend. We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436	2021-10-20 13:11:21 +01:00
Simon Pilgrim	9fc523d114	[X86] Remove X86ProcFamilyEnum::IntelSLM Replace X86ProcFamilyEnum::IntelSLM enum with a TuningUseSLMArithCosts flag instead, matching what we already do for Goldmont. This just leaves X86ProcFamilyEnum::IntelAtom to replace with general Tuning/Feature flags and we can finally get rid of the old X86ProcFamilyEnum enum. Differential Revision: https://reviews.llvm.org/D112079	2021-10-20 11:58:39 +01:00
Simon Pilgrim	f041338153	[X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:46:10 +01:00
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00
Simon Pilgrim	6ec644e215	[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64. It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win. Fixes PR47437 Differential Revision: https://reviews.llvm.org/D111938	2021-10-16 16:21:45 +01:00
Roman Lebedev	d137f1288e	[X86][LV] X86 does not prefer vectorized addressing And another attempt to start untangling this ball of threads around gather. There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`, which tells LV to try to vectorize the addresses that lead to loads, but X86 generally can not deal with vectors of addresses, the only instructions that support that are GATHER/SCATTER, but even those aren't available until AVX2, and aren't really usable until AVX512. This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111546	2021-10-16 12:32:18 +03:00
Roman Lebedev	3d7bf6625a	[X86][Costmodel] Improve cost modelling for not-fully-interleaved load While i've modelled most of the relevant tuples for AVX2, that only covered fully-interleaved groups. By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) As we have established over the course of last ~70 patches, (wow) `BaseT::getInterleavedMemoryOpCos()` is absolutely bogus, it is usually almost an order of magnitude overestimation, so i would claim that we should at least use the hardcoded costs of fully interleaved load groups. We could go further and adjust them e.g. by the number of demanded indices, but then i'm somewhat fearful of underestimating the cost. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111174	2021-10-14 23:14:36 +03:00
Simon Pilgrim	871f773986	[TTI][X86] Merge getInterleavedMemoryOpCostAVX2 into getInterleavedMemoryOpCost. NFC This a NFC refactor patch to merge the AVX2 interleaved cost handling back into the getInterleavedMemoryOpCost base method - while getInterleavedMemoryOpCostAVX512 uses instruction and patterns very specific to AVX512+, much of the costs analysis for AVX2 can be reused for all SSE targets. This is the first step towards improving SSE and AVX1 costs that will reuse the relevant AVX2 costs by splitting some of the tables - for instance AVX1 has very similar costs for most vXi64/vXf64 interleave patterns and many sub-128bit vector costs are the same all the way down to SSE2 (or at least SSSE3). Differential Revision: https://reviews.llvm.org/D111822	2021-10-14 18:46:25 +01:00
Simon Pilgrim	fcbec7e668	[TTI][X86] Swap getInterleavedMemoryOpCostAVX2/getInterleavedMemoryOpCostAVX512 implementations. NFC. I have some upcoming refactoring for SSE/AVX1 interleaving cost support, and the diff is a lot nicer if the (unaltered) AVX512 implementation isn't stuck between getInterleavedMemoryOpCost and getInterleavedMemoryOpCostAVX2	2021-10-14 18:10:03 +01:00
Simon Pilgrim	77dcdc2f50	[CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32 Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))	2021-10-14 12:17:40 +01:00
Roman Lebedev	18eef13dad	[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()` `X86TTIImpl::getGSScalarCost()` has (at least) two issues: * it naively computes the cost of sequence of `insertelement`/`extractelement`. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions. * Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111222	2021-10-13 22:35:39 +03:00
Simon Pilgrim	0776924a17	[CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost. These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.	2021-10-06 10:14:56 +01:00
Roman Lebedev	3f9b235482	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0` So could pick cost of `36` For store we have: https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0` So we could pick cost of `30`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111094	2021-10-05 16:58:58 +03:00
Roman Lebedev	e2784c5d8c	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So could pick cost of `18`. For store we have: https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0` So we could pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111093	2021-10-05 16:58:58 +03:00

1 2 3 4 5 ...

750 Commits