llvm-project

Commit Graph

Author	SHA1	Message	Date
David Green	fa784f6382	[AArch64] Insert subvector costs An insert subvector under aarch64 can often be done as a single lane mov operation. For example a v4i8 inserted into a v16i8 is a s-reg mov, so long as the index is a multiple of 4. This teaches the cost model that, using code copied over from the X86 backend. Some of the costs (v16i16_4_0) are still high because they get matched as a SK_Select, not an SK_InsertSubvector. D120879 has some codegen tests for inserting subvectors, which I were added as llvm/test/CodeGen/AArch64/insert-subvector.ll. Differential Revision: https://reviews.llvm.org/D120880	2022-04-07 19:27:41 +01:00
David Green	750bf3582a	[AArch64] Increase cost of v2i64 multiplies The cost of a v2i64 multiply was special cased in D92208 as scalarized into 4extract + 2insert + 2*mul. Scalarizing to/from gpr registers are expensive though, and the cost wasn't high enough to prevent vectorizing in places where it can be detrimental for performance. This increases it so that the costs of copying to/from GPRs is increased to 2 each, with the total cost increasing to 14. So long as umull/smull are handled correctly (as in D123006) this seems to lead to better vectorization factors and better performance. Differential Revision: https://reviews.llvm.org/D123007	2022-04-04 17:42:20 +01:00
David Green	2abaa027d9	[AArch64] Teach the costmodel about widening muls A vector mul(sext, sext) or mul(zext, zext) will be code generated as a single smull or umull instruction. This most notably effects v2i64 multiplies, which are otherwise not legal and need to be expanded. The oneuse check has also been slightly changed, as it is already checked from the use of isWideningInstruction in getCastInstrCost. Differential Revision: https://reviews.llvm.org/D123006	2022-04-04 12:45:04 +01:00
David Green	2e2f38a1ac	[AArch64] Add widening arithmetic cost tests. NFC	2022-04-04 12:19:45 +01:00
Simon Pilgrim	d663166acb	[CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput	2022-03-30 09:11:55 +01:00
Arthur Eubanks	d051c566cd	[test] Remove the last couple uses of -analyze in llvm/test	2022-03-23 11:31:12 -07:00
David Green	c56dd20f69	[AArch64] Add extra insert subvector cost model tests. NFC	2022-03-22 12:20:19 +00:00
Yeting Kuo	ecd7a0132a	[RISCV] Add basic cost model for vector casting To perform the cost model of vector casting, the patch consider most vector casts as their scalar form and consider those vector form of free scalr castings as 1. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D121771	2022-03-22 14:17:08 +08:00
Simon Pilgrim	5dde9c1286	[CostModel][X86] Reduce cost of extracting bool vector elements For constant indices, these are now just a MOVMSK+TEST/BT	2022-03-18 19:02:47 +00:00
Simon Pilgrim	4455c5cdea	[CostModel][X86] Update RUN -passes=* to double quotes to appease update scripts on windows	2022-03-18 11:44:18 +00:00
Craig Topper	bbd2ecf9f0	[RISCV] Add +experimental-zvfh extension to cover half types in vectors. Currently we allow half types in vectors if the scalar Zfh extension is enabled. This behavior is not inline with the vector spec. For f32 and f64 types, the Zve32f, Zve64f, Zve64d, and V explicitly control the availablity of floating point types in vectors. In order to make our compiler compliant, we either need to remove all support for half in vectors or we need an extension to control it. Draft spec here https://github.com/riscv/riscv-v-spec/pull/780 Reviewed By: kito-cheng Differential Revision: https://reviews.llvm.org/D121345	2022-03-17 10:04:02 -07:00
David Sherwood	e7b89c2fc3	Add BasicTTIImpl cost model for llvm.get.active.lane.mask intrinsic The vectoriser sometimes generates predicated vector loops using the llvm.get.active.lane.mask intrinsic so it's important that we are able to calculate a valid cost for the call instruction. When SVE is enabled we are able to use a single whilelo instruction for some vector types - in such cases I've marked the cost as 1. For all other cases I've set the cost according to how the intrinsic will be expanded. Tests added here: Analysis/CostModel/AArch64/sve-intrinsics.ll Analysis/CostModel/ARM/active_lane_mask.ll Analysis/CostModel/RISCV/active_lane_mask.ll Differential Revision: https://reviews.llvm.org/D121109	2022-03-14 09:35:05 +00:00
Yeting Kuo	ae7c6647f3	[RISCV] Add basic code modeling for fixed length vector reduction. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D121447	2022-03-14 11:04:31 +08:00
Florian Hahn	aa590e5823	[AArch64] Improve costs for some conversions to fp16. Currently the cost model under-estimates the cost of certain FP16 conversions. This patch updates getCastInstrCost to return more accurate costs for the cases improved in `c2ed9fd054`. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D113700	2022-03-11 10:27:39 +00:00
Florian Hahn	697f55e368	[AArch64] Move fp16 cast tests. Move FP16 tests to fp16cast function, as suggested in D113700.	2022-03-10 12:22:06 +00:00
Roman Lebedev	2f80ea7f4f	[NFC][LV] Use different braces in debug output The analysis passes output function name encapsulated in `'` braces, but LV uses `"`. Harmonizing this may help in creating an update script for the LV costmodel test checks. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D121105	2022-03-07 19:32:37 +03:00
David Green	43b638241a	[AArch64] Use NPM for cost model tests. NFC As per the other tests, this switches the run lines back to using the NPM via -passes='print<cost-model>' -cost-kind=throughput 2>&1 -disable-output	2022-03-07 08:57:50 +00:00
Alex Tsao	89f15fc687	[RISCV] Add cost modelling for masked memory op The patch adds very basic cost model for masked memory op on scalable vector. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D117884	2022-03-03 20:47:58 +08:00
David Green	47f4cd9c3d	[AArch64] Update costs for some fp16 converts This updates the costs for FP16 converts, as some of them were pretty high. Differential Revision: https://reviews.llvm.org/D120771	2022-03-03 11:17:24 +00:00
David Green	65c0e45a37	[AArch64] Vector shifts cost 1 The costs of vector shifts was 2 as opposed to 1, as the nodes are marked custom. Fix this like the others and mark the nodes as cheap. Differential Revision: https://reviews.llvm.org/D120773	2022-03-03 10:42:57 +00:00
David Green	97e0366d67	[AArch64] Add some fp16 conversion cost tests. NFC	2022-03-02 18:07:14 +00:00
Nikita Popov	98cfcae4e9	Revert "[RISCV] Add cost modelling for masked memory op" This reverts commit `76f243b53b`. The newly added test fails.	2022-03-02 17:32:10 +01:00
Alex Tsao	76f243b53b	[RISCV] Add cost modelling for masked memory op The patch adds very basic cost model for masked memory op on scalable vector. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D117884	2022-03-02 22:48:41 +08:00
David Green	02de975259	[AArch64] Add some tests for the cost of extending an extract. NFC	2022-03-02 14:47:32 +00:00
David Green	62c2b070d5	[AArch64] Add simple arithmetic cost model test. NFC	2022-03-01 23:31:02 +00:00
David Green	2e7c35ea12	[AArch64] Cleanup and extend cast costs. NFC	2022-02-26 17:59:02 +00:00
David Green	5fe8307b70	[AArch64] Add scalar min/max costs. NFC The vector costs were already added, this adds scalar variants to complete the test coverage.	2022-02-25 17:11:24 +00:00
Pavel Kosov	37fa99eda0	[SchedModels][CortexA55] Add ASIMD integer instructions Depends on D114642 Original review https://reviews.llvm.org/D112201 OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D117003	2022-02-17 13:41:57 +03:00
Arthur Eubanks	15ba588d6d	[test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>'	2022-02-09 15:42:16 -08:00
David Green	b55d4c2ad8	Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`" This reverts commit `77a0da926c` as we've received multiple reports of this significantly impacting performance, in ways that don't seem to just be target specific cost models going wrong. I would offer some reproducers, but the test changes here seem to be full of them! Reverting for now and hopefully we can remove the "hack" more carefully as we go.	2022-02-09 20:02:54 +00:00
Craig Topper	09629215c2	[RISCV] Add a really basic cost model for SK_Splice. While testing scalable vectors I found that if we generate a vector splice intrinsic and run the code through the loop unroller, we'll crash due to an invalid cost. This adds a basic cost based on the 2 slide instructions used by the lowering in D119303. We probably need to factor LMUL into this, but that's true for arithmetic instructions too. So I've ignored for the moment. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D119316	2022-02-09 11:43:31 -08:00
Roman Lebedev	77a0da926c	[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model. What it essentially does is prevents scalarized vectorization of masked memory operations: ``` // TODO: Cost model for emulated masked load/store is completely // broken. This hack guides the cost model to use an artificially // high enough value to practically disable vectorization with such // operations, except where previously deployed legality hack allowed // using very low cost values. This is to avoid regressions coming simply // from moving "masked load/store" check from legality to cost model. // Masked Load/Gather emulation was previously never allowed. // Limited number of Masked Store/Scatter emulation was allowed. ``` While i don't really understand about what specifically `is completely broken` was talking about, i believe that at least on X86 with AVX2-or-later, this is no longer true. (or at least, i would like to know what is still broken). So i would like to follow suit after D111460, and like wise disable that hack for AVX2+. But since this was added for X86 specifically, let's just instead completely remove this hack. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114779	2022-02-07 16:08:31 +03:00
Florian Hahn	17ebd68ae6	[AArch64] Fix costs of float vector compare/selects pairs. The current cost-model overestimates the cost of vector compares & selects for ordered floating point compares. This patch fixes that by extending the existing logic for integer predicates. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D118256	2022-01-31 10:18:29 +00:00
Florian Hahn	cb3df1a299	[AArch64] Add vector compare/select tests with UNE predicate. Precommit some additional tests for D118256.	2022-01-27 14:20:40 +00:00
Florian Hahn	e6ebd2c72d	[AArch64] Add float vector compare/select cost-model tests.	2022-01-26 16:27:29 +00:00
Alban Bridonneau	2feddb37b4	Implement correct cost for SVE bitcasts We have some bitcasts which we know will be simplified, so their cost is zero. Reviewed By: david-arm, sdesmalen Differential Revision: https://reviews.llvm.org/D118019	2022-01-26 14:25:44 +00:00
Stanislav Mekhanoshin	bb1fe36977	[AMDGPU] Make v8i16/v8f16 legal Differential Revision: https://reviews.llvm.org/D117721	2022-01-24 11:51:08 -08:00
eopXD	3cf15af2da	[RISCV] Remove experimental prefix from rvv-related extensions. Extensions affected: +v, +zve, +zvl Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D117860	2022-01-22 20:18:40 -08:00
Shao-Ce SUN	a0a76fee0c	[RISCV] update zfh and zfhmin extention to v1.0 `zfh` and `zfhmin` have been ratified, with version 1.0. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D117098	2022-01-15 09:21:24 +08:00
David Sherwood	ef1ca4d3e9	[AArch64] Fix incorrect use of MVT::getVectorNumElements in AArch64TTIImpl::getVectorInstrCost If we are inserting into or extracting from a scalable vector we do not know the number of elements at runtime, so we can only let the index wrap for fixed-length vectors. Tests added here: Analysis/CostModel/AArch64/sve-insert-extract.ll Differential Revision: https://reviews.llvm.org/D117099	2022-01-13 09:27:14 +00:00
Andrew Litteken	4ff4e7ea30	[CostModel] Use cost of target trunc type when only it is the only use of a non-register sized load The code size cost model for most targets uses the legalization cost for the type of the pointer of a load. If this load is followed directly by a trunc instruction, and is the only use of the result of the load, only one instruction is generated in the target assembly language. This adds a check for this case, and uses the target type of the trunc instruction if so. This did not show any changes in CTMark code size benchmarks. Reviewers: paquette, samparker, dmgreen Differential Revision: https://reviews.llvm.org/D109388	2022-01-12 18:03:50 -06:00
Simon Pilgrim	5eb47961c4	[CostModel][X86] Update ROTL/ROTR vXi8/vXi16 costs on AVX512BW targets Refresh based off recent improvements to codegen and the helper script from D103695	2022-01-10 13:18:25 +00:00
David Green	bc615e436c	[AArch64] Update addo and subo costs Similar to D116732, this adds basic scalar sadd_with_overflow, uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for aarch64, which are usually quite efficiently lowered. Differential Revision: https://reviews.llvm.org/D116734	2022-01-07 16:20:23 +00:00
David Green	c65270cf96	[AArch64] Add basic umulo and smulo costs This adds some AArch64 specific smul_with_overflow and umul_with_overflow costs, overriding the default costs. The code generation for these mul with overflow intrinsics is usually better than the default expansion on AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll. Differential Revision: https://reviews.llvm.org/D116732	2022-01-06 17:22:47 +00:00
Nico Weber	085f078307	Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."" This reverts commit `859ebca744`. The change contained many unrelated changes and e.g. restored unit test failes for the old lld port.	2022-01-05 13:10:25 -05:00
David Salinas	859ebca744	Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`." This reverts commit `640beb38e7`. That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup Change-Id: Ibf8e397df94001f248fba609f072088a46abae08 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D115960 Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105	2022-01-05 17:57:32 +00:00
Jun Ma	80e56ad9ae	[TTI] Return invalid cost for scalable vector in getShuffleCost Differential Revision: https://reviews.llvm.org/D116362	2022-01-05 18:59:11 +08:00
Daniil Fukalov	a2120f6b44	[NFC][AMDGPU][CostModel] Add tests for AMDGPU cost model, part 2.	2021-12-22 22:33:57 +03:00
Daniil Fukalov	deaedab14a	[NFC][AMDGPU][CostModel] Add tests for AMDGPU cost model.	2021-12-22 22:32:09 +03:00
Matthew Devereau	e00f22c1b1	[AArch64][SVE] Teach cost model that masked loads/stores are cheap Reduce the cost of VLS masked loads/stores to make the vectorizor emit them more frequently.	2021-12-17 15:04:45 +00:00
Alexandros Lamprineas	61bb8b5d40	[AArch64] Convert sra(X, elt_size(X)-1) to cmlt(X, 0) CMLT has twice the execution throughput of SSHR on Arm out-of-order cores. Differential Revision: https://reviews.llvm.org/D115457	2021-12-14 16:03:02 +00:00
Daniil Fukalov	e5c64b45be	[CostModel][AMDGPU] Fix intrinsics costs estimations. 1. Fixed costs inconsistency for llvm.fma.vXf16 instinsiscs. 2. Added tests for llvm.sadd.sat, llvm.ssub.sat, llvm.uadd.sat, llvm.usub.sat intrisics since they have special processing in cost model. 3. Minor intrisics' costs tests updat and refinement. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D115385	2021-12-13 17:17:34 +03:00
David Sherwood	8b0448ce5d	[AArch64][Analysis] Add on overhead costs for SVE gathers and scatters This patch adds on an overhead cost for gathers and scatters, which is a rough estimate based on performance investigations I have performed on SVE hardware for various micro-benchmarks. Differential Revision: https://reviews.llvm.org/D115143	2021-12-09 16:02:59 +00:00
Haohai Wen	d2c093e79d	[CostModel][X86] Add i64 mul cost for avx512 as 1cy i64 mul cost is 1cy for all cpu that support avx512. Currently all X86 cpu uses i64 mul cost in X64 cost table which is not true for cpu that support avx512 (skx, icx). Reviewed By: pengfei, RKSimon Differential Revision: https://reviews.llvm.org/D115016	2021-12-08 11:29:08 +08:00
Cullen Rhodes	698584f89b	[IR] Remove unbounded as possible value for vscale_range minimum The default for min is changed to 1. The behaviour of -mvscale-{min,max} in Clang is also changed such that 16 is the max vscale when targeting SVE and no max is specified. Reviewed By: sdesmalen, paulwalker-arm Differential Revision: https://reviews.llvm.org/D113294	2021-12-07 09:52:21 +00:00
David Green	255ad73424	[ARM] Make MVE v2i1 predicates legal MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the two halves. This was never treated as a legal type in llvm in the past as there are not many 64bit instructions and no 64bit compares. There are a few instructions that could use it though, notably a VSELECT (as it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for similar reasons, some gathers/scatter and long multiplies and VCTP64 instructions. This patch goes through and makes v2i1 a legal type, handling all the cases that fall out of that. It also makes VSELECT legal for v2i64 as a side benefit. A lot of the codegen changes as a result - usually in way that is a little better or a little worse, but still expensive. Costs can change a little too in the process, again in a way that expensive things remain expensive. A lot of the tests that changed are mainly to ensure correctness - the code can hopefully be improved in the future where it comes up in practice. The intrinsics currently remain using the v4i1 they previously did to emulate a v2i1. This will be changed in a followup patch but this one was already large enough. Differential Revision: https://reviews.llvm.org/D114449	2021-12-03 14:05:41 +00:00
Daniil Fukalov	ab05ab59a7	[CostModel][AMDGPU] Fix instructions costs estimation for vector types. 1. Fixed vector instructions costs estimations incosistency - removed different logic for "not simple types" since it biases costs for these types. 2. Fixed legalization penalty for vectors too big for the target: changed from overwrite default legalization cost value estimation to added penalty. 3. Fixed few typos in tests. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114893	2021-12-03 03:08:08 +03:00
Roman Lebedev	8cd782487f	[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()` We ask `TTI.getAddressComputationCost()` about the cost of computing vector address, and then multiply it by the vector width. This doesn't make any sense, it implies that we'd do a vector GEP and then scalarize the vector of pointers, but there is no such thing in the vectorized IR, we perform scalar GEP's. This is especially bad on X86, and was effectively prohibiting any scalarized vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()` says that cost of vector address computation is `10` as compared to `1` for scalar. The computed costs are similar to the ones with D111222+D111220, but we end up without masked memory intrinsics that we'd then have to expand later on, without much luck. (D111363) Differential Revision: https://reviews.llvm.org/D111460	2021-11-30 10:47:56 +03:00
Roman Lebedev	7e73c2a66a	[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle The mask on the shuffle is for the output, not the input. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114697	2021-11-29 18:37:07 +03:00
Roman Lebedev	5e96553608	[NFC][X86][LV][Costmodel] Add most basic test for masked interleaved load	2021-11-29 16:46:19 +03:00
Roman Lebedev	cffe3a084f	[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()` ... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512. This unblocks D111460. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114316	2021-11-29 14:41:48 +03:00
Zarko Todorovski	7f7dac7126	[NFC][llvm] Inclusive language: reword uses of sanity test and check Part of continuing work to use more inclusive language. Reworded uses of sanity check and sanity test in llvm/test/	2021-11-25 07:21:42 -05:00
Roman Lebedev	cd8d219536	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()` for AVX512, other than the question of handling plain AVX512F, where we end up with some really ugly "shuffles", but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114315	2021-11-24 17:23:15 +03:00
Roman Lebedev	704d92607d	[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions Apparently my methodology was suboptimal, and not only did miss all the +VL tuples, i also missed some plain tuples. I believe, this adds everything missing. Indeed, these manual costmodels are just not okay long-term. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114334	2021-11-22 14:31:34 +03:00
Roman Lebedev	8d09dd61c3	[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW, these either sign-extent the mask register into a vector, or pack the mask from vector register. Apparently, we didn't even have MCA tests for these, added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05, so i'm just guessing that their perf characteristics are optimal. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114314	2021-11-22 14:31:34 +03:00
Roman Lebedev	df70cf5e14	[NFC][X86][Costmodel] Actually test +prefer-256-bit in replication-shuffle-related tests :( While -prefer-256-bit indeed becomes complete with D114314, the real-world (the one with +prefer-256-bit) coverage is lacking. Hilarious.	2021-11-21 01:25:49 +03:00
Roman Lebedev	da47a63e03	[NFC][X86][Costmodel] Add AVX512DQ runlines to trunc.ll/extend.ll	2021-11-20 13:55:13 +03:00
Roman Lebedev	049799c311	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles), we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114071	2021-11-19 15:58:10 +03:00
Roman Lebedev	a751084bb4	[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1` Note that there are many other missing costs, i'm only adding the ones that are queried from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114070	2021-11-19 15:57:32 +03:00
Roman Lebedev	a50fdd3fc9	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW Here we get pretty lucky. AVX512F does not provide any instructions to convert between a `k` vector mask and a vector, but AVX512BW adds `{k}<->nX{i8,i16}`conversions, and just as it happens, with AVX512BW we have a i16 shuffle. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113915	2021-11-19 15:55:41 +03:00
Roman Lebedev	496ccb543e	[NFC][X86][Costmodel] Improve test coverage for i32->i64 vector *ext	2021-11-17 12:02:50 +03:00
Roman Lebedev	2037ec725f	[X86][Costmodel] `ext v64i1 to v32i16` can appear after legalization, cost is same as for `ext v32i1 to v32i16` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113914	2021-11-17 12:02:50 +03:00
Roman Lebedev	23b194bf18	[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113913	2021-11-17 12:02:50 +03:00
David Green	309f1e4ac8	[ARM] Add datalayout to costmodel tests. NFC This adds a sensible datalayout to the ARM cost model tests, to prevent the costs reported being incorrect for the size of pointers.	2021-11-16 09:49:42 +00:00
Roman Lebedev	7114c60e8f	[NFC][X86][Costmodel] Improve test coverage for {i8,i16,i32,i64}->i1 vector trunc	2021-11-15 20:46:48 +03:00
Roman Lebedev	949103dc36	[NFC][X86][Costmodel] Improve test coverage for i1->{i8,i16,i32,i64} vector *ext	2021-11-15 20:46:48 +03:00
Roman Lebedev	bc35d5fe2f	[NFC][X86][Costmodel] Add i1 replication shuffle costmodel test coverage	2021-11-15 20:02:52 +03:00
Roman Lebedev	5c7255fe3a	[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type, so this change does affect vectorization. In the end, it will ask about i1. We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up. All costs here look good, i've added the missing truncation costs in preparatory patches. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113853	2021-11-15 19:04:02 +03:00
Roman Lebedev	a468c39c90	[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8` Some of the costs get larger here, but i suppose that makes sense since we'd previously query scalarization costs that may not be really representative of the reality. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113852	2021-11-15 19:04:02 +03:00
Roman Lebedev	9e57d9b09d	[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113851	2021-11-15 19:04:02 +03:00
Roman Lebedev	0116c708c6	[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113850	2021-11-15 19:04:02 +03:00
Roman Lebedev	f86b57e37c	[NFC][X86][Costmodel] Improve test coverage for {i16,i32,i64}->i8 vector trunc	2021-11-14 20:25:40 +03:00
Roman Lebedev	f0da329f93	[NFC][X86][Costmodel] Improve test coverage for i8->{i16,i32,i64} vector *ext	2021-11-14 20:25:33 +03:00
Roman Lebedev	4dd2f0446c	[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW The basic idea is simple, if we don't have native shuffle for this element type, then we must have native shuffle for wider element type, so promote, replicate, demote. I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically, case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in 2 * XMM, that then will be packed into 1 * YMM, and it should count the cost of said packing, not just the truncations. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113609	2021-11-14 20:01:38 +03:00
Roman Lebedev	b283961012	[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16` Same as D113842, but for i64 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113843	2021-11-14 18:41:38 +03:00
Roman Lebedev	a5f2fdca99	[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16` This was noticed in D113609, hopefully it unblocks that patch. There are likely other similar problems. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113842	2021-11-14 18:41:37 +03:00
Roman Lebedev	17a3df87ff	[NFC][X86][Costmodel] Improve test coverage for {i32,i64}->i16 vector *ext See https://reviews.llvm.org/D113609 - some of these costs seem wrong.	2021-11-14 16:07:30 +03:00
Roman Lebedev	fd24446ba5	[NFC][X86][Costmodel] Improve test coverage for i16->{i32,i64} vector *ext	2021-11-14 16:06:45 +03:00
Florian Hahn	8d2a1994c8	[AArch64] Add some fp16 cast cost-model tests. This adds initial tests for cost-modeling {u,s}itofp for fp16 vectors. At the moment, they are under-estimated in a couple of cases.	2021-11-11 18:21:44 +00:00
Roman Lebedev	a70d74323e	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI VBMI introduced VPERMB, so cost-model i8 replication shuffle using it. Note that we can still model i8 replication shufflle without VBMI, by promoting to i16/i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113479	2021-11-10 22:52:40 +03:00
Roman Lebedev	c6e894b9b2	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW BWI introduced VPERMW, so cost-model i16 replication shuffle using it. Note that we can still model i16 replication shufflle without BWI, by promoting to i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113478	2021-11-10 22:52:39 +03:00
Roman Lebedev	4101c7bf19	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`, that take a single input vector and a single index vector, and are cross-lane. So far i haven't seen evidence that replication ever results in demanding more than a single input vector per output vector. This results in shockingly lesser costs :) Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113350	2021-11-10 22:52:33 +03:00
Roman Lebedev	3bdf738d1b	[NFC][X86][Costmodel] Add i16 replication shuffle costmodel test coverage	2021-11-09 14:19:44 +03:00
Roman Lebedev	23566f18c6	[NFC][X86][Costmodel] Add tests for i32/i64 replication shuffles While this isn't what we eventually need (i8 or i1), approaching from this end is more straight-forward.	2021-11-06 17:14:56 +03:00
Roman Lebedev	a30ec4778a	[TTI][CostModel] `getUserCost()`: recognize replication shuffles and query their cost This finally creates proper test coverage for replication shuffles, that are used by LV for conditional loads, and will allow to add proper costmodel at least for AVX512. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113324	2021-11-06 16:45:15 +03:00
Roman Lebedev	04fa7cbf55	[NFC][CostModel] Add exhaustive test coverage for replication shuffles This coverage has been brought to you by https://godbolt.org/z/nfc3cY1za	2021-11-06 00:53:28 +03:00
Roman Lebedev	ad617183bb	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1 Even though AVX512's masked mem ops (unlike AVX1/2) have a mask that is a `VF x i1`, replication of said masks happens after promotion of it to `VF x i8`, so we should use `i8`, not `i1`, when calculating the cost of mask replication.	2021-11-05 17:27:02 +03:00
Roman Lebedev	34b903d8b0	[NFC] Add forgotten `REQUIRES: asserts` into the new costmodel test	2021-11-03 19:40:23 +03:00
Roman Lebedev	c65e2ac405	[NFC] Rewrite runlines in interleaved-store-accesses-with-gaps.ll once again https://lab.llvm.org/buildbot/#/builders/98/builds/8198 is still failing, and i really don't understand how runlines in this test differ from the ones in other nearby tests...	2021-11-03 19:15:33 +03:00
Roman Lebedev	df93c8a919	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask I don't really buy that masked interleaved memory loads/stores are supported on X86. There is zero costmodel test coverage, no actual cost modelling for the generation of the mask repetition, and basically only two LV tests. Additionally, i'm not very interested in AVX512. I don't know if this really helps "soft" block over at https://reviews.llvm.org/D111460#inline-1075467, but i think it can't make things worse at least. When we are being told that there is a masking, instead of completely giving up and falling back to fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`, let's correctly query the cost of masked memory ops, keep all the pretty shuffle cost modelling, but scalarize the cost computation for the mask replication. I think, not scalarizing the shuffles themselves may adjust the computed costs a bit, and maybe hopefully just enough to hide the "regressions" at https://reviews.llvm.org/D111460#inline-1075467 I do mean hide, because the test coverage is non-existent. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112873	2021-11-03 18:14:35 +03:00
Roman Lebedev	f3d1ddfe71	[NFC] Use single-dash-prefixed options in newly-added test https://lab.llvm.org/buildbot/#/builders/98/builds/8195 complains, and this is the only guess i have.	2021-11-03 18:12:40 +03:00
Roman Lebedev	a4b64f7727	[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used As it can be seen in `InnerLoopVectorizer::vectorizeInterleaveGroup()`, in some cases (reported by `UseMaskForGaps`), the gaps in the interleaved load/store group will be masked away by another constant mask, so there is no need to account for the cost of replication of the mask for these. Differential Revision: https://reviews.llvm.org/D112877	2021-11-03 17:33:28 +03:00
Roman Lebedev	c6b3da1d66	[NFC][X86] Duplicate LV test into a costmodel test Copied from llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll As discussed in D111460 / D112877 / D112873 we have basically no test coverage for this part of cost model.	2021-11-03 17:25:18 +03:00
Roman Lebedev	db848fbf67	[NFC][LV][X86] Improve test coverage for masked mem ops	2021-10-27 13:36:04 +03:00
David Green	74b2a4edcc	[AArch64] Add a costmodel test for overflowing arithmatic. NFC	2021-10-26 10:35:12 +01:00
Roman Lebedev	8fac9e95ad	[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least in general, that is obviously overly pessimistic, because in general*, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307	2021-10-22 16:33:58 +03:00
David Sherwood	9448cdc900	[SVE][Analysis] Tune the cost model according to the tune-cpu attribute This patch introduces a new function: AArch64Subtarget::getVScaleForTuning that returns a value for vscale that can be used for tuning the cost model when using scalable vectors. The VScaleForTuning option in AArch64Subtarget is initialised according to the following rules: 1. If the user has specified the CPU to tune for we use that, else 2. If the target CPU was specified we use that, else 3. The tuning is set to "generic". For CPUs of type "generic" I have assumed that vscale=2. New tests added here: Analysis/CostModel/AArch64/sve-gather.ll Analysis/CostModel/AArch64/sve-scatter.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D110259	2021-10-21 09:33:50 +01:00
Simon Pilgrim	5b395bd633	[CostModel][X86] Add costs for multiply-by-pow2 constants These are folded to left shifts in the backend. We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436	2021-10-20 13:11:21 +01:00
David Green	862e8d7e55	[AArch64] Improve div and rem costmodel tests. NFC Copied from the X86 tests, these give a better test coveraged than the existing tests.	2021-10-20 09:58:35 +01:00
Simon Pilgrim	f041338153	[X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:46:10 +01:00
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Simon Pilgrim	dc3382dc2c	[CostModel][X86] Add mul by positive/negative power-of-2 constants tests We have backend optimizations for these, but currently the costmodel doesn't match them	2021-10-17 20:34:17 +01:00
Simon Pilgrim	dbf5dc8930	[CostModel][X86] Add div/rem by negative power-of-2 constants We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them	2021-10-17 18:51:15 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00
Simon Pilgrim	6ec644e215	[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64. It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win. Fixes PR47437 Differential Revision: https://reviews.llvm.org/D111938	2021-10-16 16:21:45 +01:00
Roman Lebedev	d137f1288e	[X86][LV] X86 does not prefer vectorized addressing And another attempt to start untangling this ball of threads around gather. There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`, which tells LV to try to vectorize the addresses that lead to loads, but X86 generally can not deal with vectors of addresses, the only instructions that support that are GATHER/SCATTER, but even those aren't available until AVX2, and aren't really usable until AVX512. This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111546	2021-10-16 12:32:18 +03:00
Roman Lebedev	3d7bf6625a	[X86][Costmodel] Improve cost modelling for not-fully-interleaved load While i've modelled most of the relevant tuples for AVX2, that only covered fully-interleaved groups. By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) As we have established over the course of last ~70 patches, (wow) `BaseT::getInterleavedMemoryOpCos()` is absolutely bogus, it is usually almost an order of magnitude overestimation, so i would claim that we should at least use the hardcoded costs of fully interleaved load groups. We could go further and adjust them e.g. by the number of demanded indices, but then i'm somewhat fearful of underestimating the cost. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111174	2021-10-14 23:14:36 +03:00
Simon Pilgrim	77dcdc2f50	[CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32 Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))	2021-10-14 12:17:40 +01:00
Roman Lebedev	cb41efb5f4	[NFC][Costmodel][X86] Fix broken `CHECK-NOT`'s in interleave costmodel tests	2021-10-13 22:44:57 +03:00
Roman Lebedev	18eef13dad	[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()` `X86TTIImpl::getGSScalarCost()` has (at least) two issues: * it naively computes the cost of sequence of `insertelement`/`extractelement`. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions. * Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111222	2021-10-13 22:35:39 +03:00
David Green	adec922361	[AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from `eecb353d0e` which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830	2021-10-09 15:58:31 +01:00
Simon Pilgrim	b6426d5211	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT/UGT for generic abs/min/max cost expansion Split off ABS cost handling from MIN/MAX and use explicit predicates for each Our generic expansion of ABS doesn't use NEG+CMP+SELECT any more (its now ASHR+ADD+XOR) so this needs to be updated.	2021-10-08 12:41:58 +01:00
Simon Pilgrim	716883736b	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT for generic sadd/ssub sat cost expansion The comparison always checks for negative values so know the icmp predicate will be ICMP_SGT	2021-10-07 15:42:45 +01:00
Simon Pilgrim	2ced9a42be	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion Match the predicate used in TargetLowering::expandMULO to detect overflow	2021-10-06 19:11:33 +01:00
Simon Pilgrim	7bd097fd1e	[CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values. The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply. More closely matches the default expansion from TargetLowering::expandMULO	2021-10-06 19:11:32 +01:00
Simon Pilgrim	81b5da8c97	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion Match the predicates used in TargetLowering::expandUADDSUBO	2021-10-06 19:11:32 +01:00
Simon Pilgrim	3dda247e18	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_EQ for generic funnel shift cost expansion The comparison always checks for zero value so know the icmp predicate will be ICMP_EQ	2021-10-06 16:39:16 +01:00
Simon Pilgrim	0776924a17	[CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost. These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.	2021-10-06 10:14:56 +01:00
Roman Lebedev	f92961d238	[NFC] Fixup newly-added costmodel tests to actually test what they should	2021-10-05 21:35:47 +03:00
Roman Lebedev	200edc152b	[NFC][X86][LV] Add basic costmodel test coverage for not-fully-interleaved i32 loads The coverage could have cumulative explosion here, so i'm adding only the most basic cases, and hoping it's enough, though more can be added if needed.	2021-10-05 19:39:50 +03:00
Roman Lebedev	3f9b235482	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0` So could pick cost of `36` For store we have: https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0` So we could pick cost of `30`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111094	2021-10-05 16:58:58 +03:00
Roman Lebedev	e2784c5d8c	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So could pick cost of `18`. For store we have: https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0` So we could pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111093	2021-10-05 16:58:58 +03:00
Roman Lebedev	3960693048	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111092	2021-10-05 16:58:58 +03:00
Roman Lebedev	79d6d12d95	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs This one required quite a bit of an assembly surgery, but i think it's in the right ballpark.. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So could pick cost of `64`. For store we have: https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5` So we could pick cost of `66`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111091	2021-10-05 16:58:58 +03:00
Roman Lebedev	2996a2b50f	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0` So could pick cost of `31`. For store we have: https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8` So we could pick cost of `33`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111089	2021-10-05 16:58:57 +03:00
Roman Lebedev	d51532d8aa	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8` So could pick cost of `15`. For store we have: https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0` So we could pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111087	2021-10-05 16:58:57 +03:00
Roman Lebedev	764fd5f463	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3` So could pick cost of `6`. For store we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0` So we could pick cost of `9`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111083	2021-10-05 16:58:57 +03:00
Roman Lebedev	c800119c46	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So could pick cost of `20`. For store we have: https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So we could pick cost of `20`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111076	2021-10-05 16:58:57 +03:00
Roman Lebedev	000ce0bfd5	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111075	2021-10-05 16:58:57 +03:00
Roman Lebedev	dcc2b0d933	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111073	2021-10-05 16:58:57 +03:00
Roman Lebedev	7d91037fd2	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=16 interleaving costs This one required quite a bit of assembly surgery, but the trend continues, so i think this is right. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/EKWdj8cKT - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32`. For store we have: https://godbolt.org/z/zj4bb9P75 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111064	2021-10-05 16:58:57 +03:00
Roman Lebedev	4aee1e5b93	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/a6rxMG6ec - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=12.0` So could pick cost of `16`. For store we have: https://godbolt.org/z/ced1bdqc9 - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0` So we could pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111063	2021-10-05 16:58:57 +03:00
Roman Lebedev	3c2e22b795	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/avq1oz98W - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/89PGMc1qs - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111061	2021-10-05 16:58:57 +03:00
Roman Lebedev	b6234c1edf	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=2 interleaving costs Finally, we are getting to the heavy-hitter stuff! The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/7crGWoar6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So could pick cost of `4`. For store we have: https://godbolt.org/z/T8aq3MszM - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.0` So we could pick cost of `5`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111060	2021-10-05 16:58:56 +03:00
Roman Lebedev	dee4d699b2	[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=6	2021-10-04 20:57:35 +03:00

1 2 3 4 5 ...

1215 Commits