llvm-project

Commit Graph

Author	SHA1	Message	Date
Alexey Bataev	75e1cf4a6a	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-28 10:04:41 -07:00
Alexey Bataev	9861ca0c23	Revert "[COST]Improve cost model for shuffles in SLP." This reverts commit `29a470e380` to fix a crash reported in https://reviews.llvm.org/D100486#3479989.	2022-04-28 08:11:56 -07:00
Alexey Bataev	29a470e380	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-27 10:56:26 -07:00
Simon Pilgrim	4455c5cdea	[CostModel][X86] Update RUN -passes=* to double quotes to appease update scripts on windows	2022-03-18 11:44:18 +00:00
Arthur Eubanks	15ba588d6d	[test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>'	2022-02-09 15:42:16 -08:00
David Green	38c9a4068d	[TTI] Remove IsPairwiseForm from getArithmeticReductionCost This patch removes the IsPairwiseForm flag from the Reduction Cost TTI hooks, along with some accompanying code for pattern matching reductions from trees starting at extract elements. IsPairWise is now assumed to be false, which was the predominant way that the value was used from both the Loop and SLP vectorizers. Since the adjustments such as D93860, the SLP vectorizer has not relied upon this distinction between paiwise and non-pairwise reductions. This also removes some code that was detecting reductions trees starting from extract elements inside the costmodel. This case was double-counting costs though, adding the individual costs on the individual instruction _and_ the total cost of the reduction. Removing it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to not double count. The cost of reduction intrinsics is still tested through the various tests in llvm/test/Analysis/CostModel/X86/reduce-xyz.ll. Differential Revision: https://reviews.llvm.org/D105484	2021-07-09 11:51:16 +01:00
Craig Topper	f4c67dfa92	[X86] More accurately model the cost of horizontal reductions. This patch attempts to more accurately model the reduction of power of 2 vectors of types we natively support. This takes into account the narrowing of vectors that occur as we go from 512 bits to 256 bits, to 128 bits. It also takes into account the use of wider elements in the shuffles for the first 2 steps of a reduction from 128 bits. And uses a v8i16 shift for the final step of vXi8 reduction. The default implementation uses the legalized type for the arithmetic for all levels. And uses the single source permute cost of the legalized type for all levels. This penalizes things like lack of v16i8 pshufb on pre-sse3 targets and the splitting and joining that needs to be done for integer types on AVX1. We never need v16i8 shuffle for a reduction and we only need split AVX1 ops when type the type wide and needs to be split. I think we're still over costing splits and joins for AVX1, but we're closer now. I've also removed all pairwise special casing because I don't think we ever want to generate that on X86. I've also adjusted the add handling to more accurately account for any type splitting that occurs before we reach a legal type. Differential Revision: https://reviews.llvm.org/D76478	2020-03-22 14:20:15 -07:00
Simon Pilgrim	a091f70610	[CostModel][X86] Improve add vXi64 + fadd vXf64 reduction tests for SLM As noted on D59710 we weren't handling the high costs of these operations on SLM.	2019-11-06 17:55:38 +00:00
Simon Pilgrim	1b986b41ac	[CostModel][X86] Add add/fadd reduction tests for SLM	2019-11-06 17:04:22 +00:00
Simon Pilgrim	1b59a16c0b	[CostModel][X86] Improve sum reduction costs. I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2. I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674. llvm-svn: 374655	2019-10-12 13:21:50 +00:00
Craig Topper	c6bfb05762	[CostModel][X86] Don't count 2 shuffles on the last level of a pairwise arithmetic or min/max reduction This is split from D55452 with the correct patch this time. Pairwise reductions require two shuffles on every level but the last. On the last level the two shuffles are <1, u, u, u...> and <0, u, u, u...>, but <0, u, u, u...> will be dropped by InstCombine/DAGCombine as being an identity shuffle. Differential Revision: https://reviews.llvm.org/D55615 llvm-svn: 349072	2018-12-13 19:08:10 +00:00
Craig Topper	d1498ed8df	[CostModel][X86] Fix overcounting arithmetic cost in illegal types in getArithmeticReductionCost/getMinMaxReductionCost We were overcounting the number of arithmetic operations needed at each level before we reach a legal type. We were using the full vector type for that level, but we are going to split the input vector at that level in half. So the effective arithmetic operation cost at that level is half the width. So for example on 8i32 on an sse target. Were were calculating the cost of an 8i32 op which is likely 2 for basic integer. Then after the loop we count 2 more v4i32 ops. For a total arith cost of 4. But if you look at the assembly there would only be 3 arithmetic ops. There are still more bugs in this code that I'm going to work on next. The non pairwise code shouldn't count extract subvectors in the loop. There are no extracts, the types are split in registers. For pairwise we need to use 2 two src permute shuffles. Differential Revision: https://reviews.llvm.org/D55397 llvm-svn: 348621	2018-12-07 18:20:56 +00:00
Simon Pilgrim	102854f4d4	[TTI] Reduction costs only need to include a single extract element cost (REAPPLIED) We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type. For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed. Fixes PR37731 Rebased and reapplied after being reverted in rL347541 due to PR39774 - which was fixed by D54955/rL347759 and D55017/rL347997 Differential Revision: https://reviews.llvm.org/D54585 llvm-svn: 348076	2018-12-01 14:18:31 +00:00
Fedor Sergeev	8cd9d1b5ce	Revert "[TTI] Reduction costs only need to include a single extract element cost" This reverts commit r346970. It was causing PR39774, a crash in slp-vectorizer on a rather simple loop with just a bunch of 'and's in the body. llvm-svn: 347541	2018-11-26 10:17:27 +00:00
Simon Pilgrim	924f193419	[TTI] Reduction costs only need to include a single extract element cost We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type. For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed. Fixes PR37731 Differential Revision: https://reviews.llvm.org/D54585 llvm-svn: 346970	2018-11-15 17:42:53 +00:00
Simon Pilgrim	fc8f1d7da7	[CostModel][X86] SK_ExtractSubvector is free if the subvector is at the start of the source vector llvm-svn: 346538	2018-11-09 19:04:27 +00:00
Simon Pilgrim	44a9a71d2a	[TTI] Fix uses of SK_ExtractSubvector shuffle costs (PR39368) Correct costings of SK_ExtractSubvector requires the SubTy argument to indicate the type/size of the extracted subvector. Unlike the rest of the shuffle kinds this means that the main Ty argument represents the source vector type not the destination! I've done my best to fix a number of vectorizer uses: SLP - the reduction epilogue costs should be using a SK_PermuteSingleSrc shuffle as these all occur at the hardware vector width - we're not extracting (illegal) subvector types. This is causing the cost model diffs as SK_ExtractSubvector costs are poorly handled and tend to just return 1 at the moment. LV - I'm not clear on what the SK_ExtractSubvector should represents for recurrences - I've used a <1 x ?> subvector extraction as that seems to match the VF delta. Differential Revision: https://reviews.llvm.org/D53573 llvm-svn: 345617	2018-10-30 18:10:02 +00:00
Simon Pilgrim	2a9cde026c	[X86][AVX] Reduce v4f64/v4i64 shuffle costs (PR37882) These were being over cautious for costs for one/two op general shuffles - VSHUFPD doesn't have to replicate the same shuffle in both lanes like VSHUFPS does. llvm-svn: 335216	2018-06-21 11:37:13 +00:00
Simon Pilgrim	0783921987	[CostModel] Treat Identity shuffle masks as zero cost As discussed on D47985, identity shuffle masks should probably be free. I've limited this to the case where the input and output types all match - but we could probably accept all cases. Differential Revision: https://reviews.llvm.org/D47986 llvm-svn: 334506	2018-06-12 14:47:13 +00:00
Simon Pilgrim	8a8ff4f6d4	[CostModel][X86] Regenerate vector reduction cost tests with update_analyze_test_checks.py NOTE: We're only really interested in the extractelement cost (which represents the entire reduction). llvm-svn: 329504	2018-04-07 14:20:10 +00:00
Sanjay Patel	e6143904b9	revert r325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280) There are too many perf regressions resulting from this, so we need to investigate (and add tests for) targets like ARM and AArch64 before trying to reinstate. llvm-svn: 325658	2018-02-21 01:42:52 +00:00
Sanjay Patel	3e8a76abfd	[TTI CostModel] change default cost of FP ops to 1 (PR36280) This change was mentioned at least as far back as: https://bugs.llvm.org/show_bug.cgi?id=26837#c26 ...and I found a real program that is harmed by this: Himeno running on AMD Jaguar gets 6% slower with SLP vectorization: https://bugs.llvm.org/show_bug.cgi?id=36280 ...but the change here appears to solve that bug only accidentally. The div/rem costs for x86 look very wrong in some cases, but that's already true, so we can fix those in follow-up patches. There's also evidence that more cost model changes are needed to solve SLP problems as shown in D42981, but that's an independent problem (though the solution may be adjusted after this change is made). Differential Revision: https://reviews.llvm.org/D43079 llvm-svn: 325515	2018-02-19 16:11:44 +00:00
Alexey Bataev	62af7252f1	[SLP] Fixed cost model for horizontal reduction. Currently when cost of scalar operations is evaluated the vector type is used for scalar operations. Patch fixes this issue and fixes evaluation of the vector operations cost. Several test showed that vector cost model is too optimistic. It allowed vectorization of 8 or less add/fadd operations, though scalar code is faster. Actually, only for 16 or more operations vector code provides better performance. Differential Revision: https://reviews.llvm.org/D26277 llvm-svn: 288398	2016-12-01 18:42:42 +00:00
Alexey Bataev	fc617690ab	[SLP] Additional tests with the cost of vector operations. llvm-svn: 288377	2016-12-01 17:26:54 +00:00
Alexey Bataev	e59a8351d0	Revert "[SLP] Additional tests with the cost of vector operations." This reverts commit a61718435fc4118c82f8aa6133fd81f803789c1e. llvm-svn: 288371	2016-12-01 16:45:04 +00:00
Alexey Bataev	2ff768475d	[SLP] Additional tests with the cost of vector operations. llvm-svn: 288369	2016-12-01 16:11:48 +00:00
Cong Hou	94620278a4	Don't punish vectorized arithmetic instruction whose type will be split to multiple registers Currently in LLVM's cost model, a vectorized arithmetic instruction will have high cost if its type is split into multiple registers. However, this punishment is too heavy and unnecessary. The overhead of the split should not be on arithmetic instructions but instructions that implement the split. Note that during vectorization we have calculated the register pressure, and we only choose proper interleaving factor (and also vectorization factor) so that we don't use more registers than the maximum number. Here is a very simple example: if a vadd has the cost 1, and if we double VF so that we need two registers to perform it, then its cost will become 4 with the current implementation, which will prevent us to use larger VF. Differential revision: http://reviews.llvm.org/D15159 llvm-svn: 254671	2015-12-04 00:36:58 +00:00
Yi Jiang	5c343de8d3	X86 horizontal vector reduction cost model llvm-svn: 191021	2013-09-19 17:48:48 +00:00
Arnold Schwaighofer	cae8735a54	Costmodel: Add support for horizontal vector reductions Upcoming SLP vectorization improvements will want to be able to estimate costs of horizontal reductions. Add infrastructure to support this. We model reductions as a series of (shufflevector,add) tuples ultimately followed by an extractelement. For example, for an add-reduction of <4 x float> we could generate the following sequence: (v0, v1, v2, v3) \ \ / / \ \ / + + (v0+v2, v1+v3, undef, undef) \ / ((v0+v2) + (v1+v3), undef, undef) %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef> %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7 %r = extractelement <4 x float> %bin.rdx8, i32 0 This commit adds a cost model interface "getReductionCost(Opcode, Ty, Pairwise)" that will allow clients to ask for the cost of such a reduction (as backends might generate more efficient code than the cost of the individual instructions summed up). This interface is excercised by the CostModel analysis pass which looks for reduction patterns like the one above - starting at extractelements - and if it sees a matching sequence will call the cost model interface. We will also support a second form of pairwise reduction that is well supported on common architectures (haddps, vpadd, faddp). (v0, v1, v2, v3) \ / \ / (v0+v1, v2+v3, undef, undef) \ / ((v0+v1)+(v2+v3), undef, undef, undef) %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef> %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef> %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1 %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef> %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1 %r = extractelement <4 x float> %bin.rdx.1, i32 0 llvm-svn: 190876	2013-09-17 18:06:50 +00:00

29 Commits