llvm-project

Commit Graph

Author	SHA1	Message	Date
Florian Hahn	69c1cbe20f	[SCEV] Add test case where applying zext info pessimizes BTC. Add an additional test case for D113578.	2021-11-12 12:19:35 +00:00
Florian Hahn	5dfe60d171	[SCEV] Add tests where guards limit both %n and (zext %n). Suggested in D113577.	2021-11-12 10:31:35 +00:00
Florian Hahn	8d2a1994c8	[AArch64] Add some fp16 cast cost-model tests. This adds initial tests for cost-modeling {u,s}itofp for fp16 vectors. At the moment, they are under-estimated in a couple of cases.	2021-11-11 18:21:44 +00:00
Roman Lebedev	a70d74323e	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI VBMI introduced VPERMB, so cost-model i8 replication shuffle using it. Note that we can still model i8 replication shufflle without VBMI, by promoting to i16/i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113479	2021-11-10 22:52:40 +03:00
Roman Lebedev	c6e894b9b2	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW BWI introduced VPERMW, so cost-model i16 replication shuffle using it. Note that we can still model i16 replication shufflle without BWI, by promoting to i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113478	2021-11-10 22:52:39 +03:00
Roman Lebedev	4101c7bf19	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`, that take a single input vector and a single index vector, and are cross-lane. So far i haven't seen evidence that replication ever results in demanding more than a single input vector per output vector. This results in shockingly lesser costs :) Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113350	2021-11-10 22:52:33 +03:00
Florian Hahn	fcf2ae9923	[SCEV] Add tests that require rewriting zexts when applying guards. Precommit tests inspired by PR40961 and PR52464.	2021-11-10 16:58:27 +00:00
Roman Lebedev	3bdf738d1b	[NFC][X86][Costmodel] Add i16 replication shuffle costmodel test coverage	2021-11-09 14:19:44 +03:00
Nikita Popov	a8c318b50e	[BasicAA] Use index size instead of pointer size When accumulating the GEP offset in BasicAA, we should use the pointer index size rather than the pointer size. Differential Revision: https://reviews.llvm.org/D112370	2021-11-07 18:56:11 +01:00
Roman Lebedev	23566f18c6	[NFC][X86][Costmodel] Add tests for i32/i64 replication shuffles While this isn't what we eventually need (i8 or i1), approaching from this end is more straight-forward.	2021-11-06 17:14:56 +03:00
Roman Lebedev	a30ec4778a	[TTI][CostModel] `getUserCost()`: recognize replication shuffles and query their cost This finally creates proper test coverage for replication shuffles, that are used by LV for conditional loads, and will allow to add proper costmodel at least for AVX512. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113324	2021-11-06 16:45:15 +03:00
Philip Reames	d24a0e8857	[SCEV] Use constant range of RHS to prove NUW on narrow IV in trip count logic The basic idea here is that given a zero extended narrow IV, we can prove the inner IV to be NUW if we can prove there's a value the inner IV must take before overflow which must exit the loop. Differential Revision: https://reviews.llvm.org/D109457	2021-11-05 15:36:47 -07:00
Roman Lebedev	04fa7cbf55	[NFC][CostModel] Add exhaustive test coverage for replication shuffles This coverage has been brought to you by https://godbolt.org/z/nfc3cY1za	2021-11-06 00:53:28 +03:00
Roman Lebedev	ad617183bb	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1 Even though AVX512's masked mem ops (unlike AVX1/2) have a mask that is a `VF x i1`, replication of said masks happens after promotion of it to `VF x i8`, so we should use `i8`, not `i1`, when calculating the cost of mask replication.	2021-11-05 17:27:02 +03:00
Roman Lebedev	34b903d8b0	[NFC] Add forgotten `REQUIRES: asserts` into the new costmodel test	2021-11-03 19:40:23 +03:00
Roman Lebedev	c65e2ac405	[NFC] Rewrite runlines in interleaved-store-accesses-with-gaps.ll once again https://lab.llvm.org/buildbot/#/builders/98/builds/8198 is still failing, and i really don't understand how runlines in this test differ from the ones in other nearby tests...	2021-11-03 19:15:33 +03:00
Roman Lebedev	df93c8a919	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask I don't really buy that masked interleaved memory loads/stores are supported on X86. There is zero costmodel test coverage, no actual cost modelling for the generation of the mask repetition, and basically only two LV tests. Additionally, i'm not very interested in AVX512. I don't know if this really helps "soft" block over at https://reviews.llvm.org/D111460#inline-1075467, but i think it can't make things worse at least. When we are being told that there is a masking, instead of completely giving up and falling back to fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`, let's correctly query the cost of masked memory ops, keep all the pretty shuffle cost modelling, but scalarize the cost computation for the mask replication. I think, not scalarizing the shuffles themselves may adjust the computed costs a bit, and maybe hopefully just enough to hide the "regressions" at https://reviews.llvm.org/D111460#inline-1075467 I do mean hide, because the test coverage is non-existent. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112873	2021-11-03 18:14:35 +03:00
Roman Lebedev	f3d1ddfe71	[NFC] Use single-dash-prefixed options in newly-added test https://lab.llvm.org/buildbot/#/builders/98/builds/8195 complains, and this is the only guess i have.	2021-11-03 18:12:40 +03:00
Roman Lebedev	a4b64f7727	[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used As it can be seen in `InnerLoopVectorizer::vectorizeInterleaveGroup()`, in some cases (reported by `UseMaskForGaps`), the gaps in the interleaved load/store group will be masked away by another constant mask, so there is no need to account for the cost of replication of the mask for these. Differential Revision: https://reviews.llvm.org/D112877	2021-11-03 17:33:28 +03:00
Roman Lebedev	c6b3da1d66	[NFC][X86] Duplicate LV test into a costmodel test Copied from llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll As discussed in D111460 / D112877 / D112873 we have basically no test coverage for this part of cost model.	2021-11-03 17:25:18 +03:00
Nikita Popov	51e9f33603	[BasicAA] Use saturating multiply on range if nsw If we know that the var * scale multiplication is nsw, we can use a saturating multiplication on the range (as a good approximation of an nsw multiply). This recovers some cases where the fix from D112611 is unnecessarily strict. (This can be further strengthened by using a saturating add, but we currently don't track all the necessary information for that.) This exposes an issue in our NSW tracking for multiplies. The code was assuming that (X +nsw Y) nsw Z results in (X nsw Z) +nsw (Y nsw Z) -- however, it is possible that the distributed multiplications overflow, even if the non-distributed one does not. We should discard the nsw flag if the the offset is non-zero. If we just have (X nsw Y) nsw Z then concluding X nsw (Y *nsw Z) is fine. Differential Revision: https://reviews.llvm.org/D112848	2021-11-02 20:27:39 +01:00
Arthur Eubanks	029f1a5344	[LazyCallGraph] Skip blockaddresses blockaddresses do not participate in the call graph since the only instructions that use them must all return to someplace within the current function. And passes cannot retrieve a function address from a blockaddress. This was suggested by efriedma in D58260. Fixes PR50881. Reviewed By: nickdesaulniers Differential Revision: https://reviews.llvm.org/D112178	2021-11-01 13:10:24 -07:00
Nikita Popov	7cf7378a9d	[BasicAA] Don't treat non-inbounds GEP as nsw The scale multiplication is only guaranteed to be nsw if the GEP is inbounds (or the multiplication is trivial). Previously we were only considering explicit muls in GEP indices.	2021-10-29 22:30:44 +02:00
Nikita Popov	4dd540d9c8	[BasicAA] Add missing inbounds to tests (NFC) Add missing inbounds to tests that are not correct without it due to possibility of offset overflow. inbounds: https://alive2.llvm.org/ce/z/LC8G9_ w/o inbounds: https://alive2.llvm.org/ce/z/ErrJVW	2021-10-29 19:05:39 +02:00
Nikita Popov	36b22f7845	[BasicAA] Add range test with nsw (NFC)	2021-10-29 18:00:25 +02:00
Nikita Popov	fbc0c308d5	[BasicAA] Handle known bits as ranges BasicAA currently tries to determine that the offset is positive by checking whether all variable indices are positive based on known bits, multiplied by a positive scale. However, this is incorrect if the scale multiplication might overflow. In the modified test case the original value is positive, but may be negative after a left shift. Fix this by converting known bits into a constant range and reusing the range-based logic, which handles overflow correctly. Differential Revision: https://reviews.llvm.org/D112611	2021-10-27 14:41:31 +02:00
Nikita Popov	9bc7e543b4	[BasicAA] Make range check more precise Make the range check more precise by calculating the range of potentially accessed bytes for both accesses and checking whether their intersection is empty. In that case there can be no overlap between the accesses and the result is NoAlias. This is more powerful than the previous approach, because it can deal with sign-wrapped ranges. In the test case the original range is [-1, INT_MAX] but becomes [0, INT_MIN] after applying the offset. This is a wrapping range, so getSignedMin/getSignedMax will treat it as a full range. However, the range excludes the elements [INT_MIN+1, -1], which is enough to prove NoAlias with an access at offset -1. Differential Revision: https://reviews.llvm.org/D112486	2021-10-27 12:40:58 +02:00
Roman Lebedev	db848fbf67	[NFC][LV][X86] Improve test coverage for masked mem ops	2021-10-27 13:36:04 +03:00
David Green	74b2a4edcc	[AArch64] Add a costmodel test for overflowing arithmatic. NFC	2021-10-26 10:35:12 +01:00
Nikita Popov	721569cc36	[BasicAA] Add test for benign range overflow (NFC)	2021-10-25 22:09:39 +02:00
Nikita Popov	7e97347409	[BasicAA] Add test for incorrect non-negative logic (NFC)	2021-10-25 18:02:41 +02:00
Nikita Popov	0d20ebf686	[BasicAA] Use ranges for more than one index D109746 made BasicAA use range information to determine the minimum/maximum GEP offset. However, it was limited to the case of a single variable index. This patch extends support to multiple indices by adding all the ranges together. Differential Revision: https://reviews.llvm.org/D112378	2021-10-25 15:30:50 +02:00
Nikita Popov	2ae67c9684	[BasicAA] Add range test with multiple indices (NFC)	2021-10-24 16:13:03 +02:00
Nikita Popov	61cfdf636d	[BasicAA] Model implicit trunc of GEP indices GEP indices larger than the GEP index size are implicitly truncated to the index size. BasicAA currently doesn't model this, resulting in incorrect alias analysis results. Fix this by explicitly modelling truncation in CastedValue in the same way we do zext and sext. Additionally we need to disable a number of optimizations for truncated values, in particular "non-zero" and "non-equal" may no longer hold after truncation. I believe the constant offset heuristic is also not necessarily correct for truncated values, but wasn't able to come up with a test for that one. A possible followup here would be to use the new mechanism to model explicit trunc as well (which should be much more common, as it is the canonical form). This is straightforward, but omitted here to separate the correctness fix from the analysis improvement. (Side note: While I say "index size" above, BasicAA currently uses the pointer size instead. Something for another day...) Differential Revision: https://reviews.llvm.org/D110977	2021-10-22 23:47:02 +02:00
Roman Lebedev	8fac9e95ad	[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least in general, that is obviously overly pessimistic, because in general*, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307	2021-10-22 16:33:58 +03:00
David Sherwood	9448cdc900	[SVE][Analysis] Tune the cost model according to the tune-cpu attribute This patch introduces a new function: AArch64Subtarget::getVScaleForTuning that returns a value for vscale that can be used for tuning the cost model when using scalable vectors. The VScaleForTuning option in AArch64Subtarget is initialised according to the following rules: 1. If the user has specified the CPU to tune for we use that, else 2. If the target CPU was specified we use that, else 3. The tuning is set to "generic". For CPUs of type "generic" I have assumed that vscale=2. New tests added here: Analysis/CostModel/AArch64/sve-gather.ll Analysis/CostModel/AArch64/sve-scatter.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D110259	2021-10-21 09:33:50 +01:00
Simon Pilgrim	5b395bd633	[CostModel][X86] Add costs for multiply-by-pow2 constants These are folded to left shifts in the backend. We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436	2021-10-20 13:11:21 +01:00
David Green	862e8d7e55	[AArch64] Improve div and rem costmodel tests. NFC Copied from the X86 tests, these give a better test coveraged than the existing tests.	2021-10-20 09:58:35 +01:00
Bjorn Pettersson	08619006a0	[SCEV] Avoid compile time explosion in ScalarEvolution::isImpliedCond As seen in PR51869 the ScalarEvolution::isImpliedCond function might end up spending lots of time when doing the isKnownPredicate checks. Calling isKnownPredicate for example result in isKnownViaInduction being called, which might result in isLoopBackedgeGuardedByCond being called, and then we might get one or more new calls to isImpliedCond. Even if the scenario described here isn't an infinite loop, using some random generated C programs as input indicates that those isKnownPredicate checks quite often returns true. On the other hand, the third condition that needs to be fulfilled in order to "prove implications via truncation", i.e. the isImpliedCondBalancedTypes check, is rarely fulfilled. I also made some similar experiments to look at how often we would get the same result when using isKnownViaNonRecursiveReasoning instead of isKnownPredicate. So far I haven't seen a single case when codegen is negatively impacted by using isKnownViaNonRecursiveReasoning. On the other hand, it seems like we get rid of the compile time explosion seen in PR51869 that way. Hence this patch. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D112080	2021-10-19 21:37:57 +02:00
Kirill Stoimenov	62627c7217	[Sanitizers] Replaced getMaxPointerSizeInBits with getPointerSizeInBits, which was causing failures for 32bit x86. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D111829	2021-10-18 09:31:14 -07:00
Simon Pilgrim	f041338153	[X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:46:10 +01:00
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Simon Pilgrim	dc3382dc2c	[CostModel][X86] Add mul by positive/negative power-of-2 constants tests We have backend optimizations for these, but currently the costmodel doesn't match them	2021-10-17 20:34:17 +01:00
Simon Pilgrim	dbf5dc8930	[CostModel][X86] Add div/rem by negative power-of-2 constants We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them	2021-10-17 18:51:15 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00

1 2 3 4 5 ...

3113 Commits