llvm-project

Commit Graph

Author	SHA1	Message	Date
Philip Reames	a048e2fa1d	[tests] fix an accidental target dependence added in `99ac8868`	2020-12-15 11:07:30 -08:00
Philip Reames	99ac8868cf	[tests][LV] precommit tests for D93317	2020-12-15 10:53:34 -08:00
Florian Hahn	0e0295fd61	[LV] Pass explicit vector width to not require a X86 target.	2020-12-15 12:52:22 +00:00
Florian Hahn	8a7e770638	[LV] Add reduction test, which exposed a crash in a pending patch.	2020-12-15 09:42:00 +00:00
David Green	ab97c9bdb7	[LV] Fix scalar cost for tail predicated loops When it comes to the scalar cost of any predicated block, the loop vectorizer by default regards this predication as a sign that it is looking at an if-conversion and divides the scalar cost of the block by 2, assuming it would only be executed half the time. This however makes no sense if the predication has been introduced to tail predicate the loop. Original patch by Anna Welker Differential Revision: https://reviews.llvm.org/D86452	2020-12-12 14:21:40 +00:00
David Green	f6e885ad2a	[ARM] Test for showing scalar vector costs. NFC	2020-12-12 11:43:14 +00:00
Florian Hahn	0519722930	[LV] Precommit test for PR48429.	2020-12-11 19:56:48 +00:00
Nikita Popov	8b1c4e310c	[BasicAA] Handle two unknown sizes for GEPs If we have two unknown sizes and one GEP operand and one non-GEP operand, then we currently simply return MayAlias. The comment says we can't do anything useful ... but we can! We can still check that the underlying objects are different (and do so for the GEP-GEP case). To reduce the compile-time impact, this a) checks this early, before doing the relatively expensive GEP decomposition that will not be used and b) doesn't do the check if the other operand is a phi or select. In that case, the phi/select will already recurse, so this would just do two slightly different recursive walks that arrive at the same roots. Compile-time is still a bit of a mixed bag: https://llvm-compile-time-tracker.com/compare.php?from=624af932a808b363a888139beca49f57313d9a3b&to=845356e14adbe651a553ed11318ddb5e79a24bcd&stat=instructions On average this is a small improvement, but sqlite with ThinLTO has a 0.5% regression (lencod has a 1% improvement). The BasicAA test case checks this by using two memsets with unknown size. However, the more interesting case where this is useful is the LoopVectorize test case, as analysis of accesses in loops tends to always us unknown sizes. Differential Revision: https://reviews.llvm.org/D92401	2020-12-11 18:45:53 +01:00
Sander de Smalen	d568cff696	[LoopVectorizer][SVE] Vectorize a simple loop with with a scalable VF. * Steps are scaled by `vscale`, a runtime value. * Changes to circumvent the cost-model for now (temporary) so that the cost-model can be implemented separately. This can vectorize the following loop [1]: void loop(int N, double a, double b) { #pragma clang loop vectorize_width(4, scalable) for (int i = 0; i < N; i++) { a[i] = b[i] + 1.0; } } [1] This source-level example is based on the pragma proposed separately in D89031. This patch only implements the LLVM part. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D91077	2020-12-09 11:25:21 +00:00
Bardia Mahjour	4c70b6ee45	[LV] Make optimal-epilog-vectorization-profitability.ll more robust Add a CHECK to properly limit the scope of CHECK-NOTs	2020-12-08 12:35:08 -05:00
Arthur Eubanks	dc93a8d1e2	[test] Fix Transforms/LoopVectorize under NPM The -enable-new-pm=1 translation caused loop-vectorize to run on all functions, then instcombine, rather than all passes on one function then the next. This caused the output of -debug-only and -print-after to be interleaved in an unexpected way.	2020-12-07 21:48:21 -08:00
Bardia Mahjour	4db9b78c81	[LV] Epilogue Vectorization with Optimal Control Flow - Default Enablement This patch enables epilogue vectorization by default per reviewer requests. Differential Revision: https://reviews.llvm.org/D89566	2020-12-07 14:29:36 -05:00
Jinsong Ji	b49b8f096c	[PowerPC][Clang] Remove QPX support Clean up QPX code in clang missed in https://reviews.llvm.org/D83915 Reviewed By: #powerpc, steven.zhang Differential Revision: https://reviews.llvm.org/D92329	2020-12-07 10:15:39 -05:00
Alexey Bataev	e7fc561843	[TEST]Autogenerate test checks, NFC.	2020-12-04 11:01:58 -08:00
Philip Reames	0129cd5035	Use deref facts derived from minimum object size of allocations This change should be fairly straight forward. If we've reached a call, check to see if we can tell the result is dereferenceable from information about the minimum object size returned by the call. To control compile time impact, I'm only adding the call for base facts in the routine. getObjectSize can also do recursive reasoning, and we don't want that general capability here. As a follow up patch (without separate review), I will plumb through the missing TLI parameter. That will have the effect of extending this to known libcalls - malloc, new, and the like - whereas currently this only covers calls with the explicit allocsize attribute. Differential Revision: https://reviews.llvm.org/D90341	2020-12-03 15:01:14 -08:00
Philip Reames	0c866a3d6a	[LoopVec] Support non-instructions as argument to uniform mem ops The initial step of the uniform-after-vectorization (lane-0 demanded only) analysis was very awkwardly written. It would revisit use list of each pointer operand of a widened load/store. As a result, it was in the worst case O(N^2) where N was the number of instructions in a loop, and had restricted operand Value types to reduce the size of use lists. This patch replaces the original algorithm with one which is at most O(2N) in the number of instructions in the loop. (The key observation is that each use of a potentially interesting pointer is visited at most twice, once on first scan, once in the use list of it's operand. Only instructions within the loop have their uses scanned.) In the process, we remove a restriction which required the operand of the uniform mem op to itself be an instruction. This allows detection of uniform mem ops involving global addresses. Differential Revision: https://reviews.llvm.org/D92056	2020-12-03 14:51:44 -08:00
Simon Pilgrim	a8034fc1ad	[LoopVectorize] Fix optimal-epilog-vectorization-limitations.ll test on non-debug build bots Add "REQUIRES: asserts" as the test uses the "--debug-only" switch Should fix the clang-with-thin-lto-ubuntu buildbot failure	2020-12-02 18:00:42 +00:00
Bardia Mahjour	a7e2c26939	[LV] Epilogue Vectorization with Optimal Control Flow (Recommit) This is yet another attempt at providing support for epilogue vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none and reviews D30247 and D88819. Similar to D88819, this patch achieve epilogue vectorization by executing a single vplan twice: once on the main loop and a second time on the epilogue loop (using a different VF). However it's able to handle more loops, and generates more optimal control flow for cases where the trip count is too small to execute any code in vector form. Reviewed By: SjoerdMeijer Differential Revision: https://reviews.llvm.org/D89566	2020-12-02 10:09:56 -05:00
David Sherwood	71bd59f0cb	[SVE] Add support for scalable vectors with vectorize.scalable.enable loop attribute In this patch I have added support for a new loop hint called vectorize.scalable.enable that says whether we should enable scalable vectorization or not. If a user wants to instruct the compiler to vectorize a loop with scalable vectors they can now do this as follows: br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !2 ... !2 = !{!2, !3, !4} !3 = !{!"llvm.loop.vectorize.width", i32 8} !4 = !{!"llvm.loop.vectorize.scalable.enable", i1 true} Setting the hint to false simply reverts the behaviour back to the default, using fixed width vectors. Differential Revision: https://reviews.llvm.org/D88962	2020-12-02 13:23:43 +00:00
Bardia Mahjour	c94af03f7f	Revert "[LV] Epilogue Vectorization with Optimal Control Flow" This reverts commit `9c5504adce`. Reverting to investigate build failure in http://lab.llvm.org:8011/#/builders/98/builds/1461/steps/9	2020-12-01 12:50:36 -05:00
Bardia Mahjour	9c5504adce	[LV] Epilogue Vectorization with Optimal Control Flow This is yet another attempt at providing support for epilogue vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none and reviews D30247 and D88819. Similar to D88819, this patch achieve epilogue vectorization by executing a single vplan twice: once on the main loop and a second time on the epilogue loop (using a different VF). However it's able to handle more loops, and generates more optimal control flow for cases where the trip count is too small to execute any code in vector form. Reviewed By: SjoerdMeijer Differential Revision: https://reviews.llvm.org/D89566	2020-12-01 12:04:29 -05:00
Cullen Rhodes	cba4accda0	[LV] Clamp VF hint when unsafe In the following loop the dependence distance is 2 and can only be vectorized if the vector length is no larger than this. void foo(int a, int b, int N) { #pragma clang loop vectorize(enable) vectorize_width(4) for (int i=0; i<N; ++i) { a[i + 2] = a[i] + b[i]; } } However, when specifying a VF of 4 via a loop hint this loop is vectorized. According to [1][2], loop hints are ignored if the optimization is not safe to apply. This patch introduces a check to bail of vectorization if the user specified VF is greater than the maximum feasible VF, unless explicitly forced with '-force-vector-width=X'. [1] https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave [2] https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations Reviewed By: sdesmalen, fhahn, Meinersbur Differential Revision: https://reviews.llvm.org/D90687	2020-12-01 11:30:34 +00:00
Sjoerd Meijer	f44ba25135	ExtractValue instruction costs Instruction ExtractValue wasn't handled in LoopVectorizationCostModel::getInstructionCost(). As a result, it was modeled as a mul which is not really accurate. Since it is free (most of the times), this now gets a cost of 0 using getInstructionCost. This is a follow-up of D92208, that required changing this regression test. In a follow up I will look at InsertValue which also isn't handled yet. Differential Revision: https://reviews.llvm.org/D92317	2020-12-01 10:42:23 +00:00
Florian Hahn	fe83adb05a	[VPlan] Use VPUser to manage VPPredInstPHIRecipe operand (NFC). VPPredInstPHIRecipe is one of the recipes that was missed during the initial conversion. This patch adjusts the recipe to also manage its operand using VPUser.	2020-11-30 13:09:58 +00:00
Sjoerd Meijer	5110ff0817	[AArch64][CostModel] Fix cost for mul <2 x i64> This was modeled to have a cost of 1, but since we do not have a MUL.2d this is scalarized into vector inserts/extracts and scalar muls. Motivating precommitted test is test/Transforms/SLPVectorizer/AArch64/mul.ll, which we don't want to SLP vectorize. Test Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll unfortunately needed changing, but the reason is documented in LoopVectorize.cpp:6855: // The cost of executing VF copies of the scalar instruction. This opcode // is unknown. Assume that it is the same as 'mul'. which I will address next as a follow up of this. Differential Revision: https://reviews.llvm.org/D92208	2020-11-30 11:36:55 +00:00
Florian Hahn	4bc9b909d7	[VPlan] Use VPValue and VPUser ops to print VPReplicateRecipe.	2020-11-29 18:28:27 +00:00
David Green	d939ba4c68	[ARM] MVE qabs vectorization test. NFC	2020-11-27 12:21:11 +00:00
David Green	e0c479cd0e	[VPlan] Switch VPWidenRecipe to be a VPValue Similar to other patches, this makes VPWidenRecipe a VPValue. Because of the way it interacts with the reduction code it also slightly alters the way that VPValues are registered, removing the up front NeedDef and using getOrAddVPValue to create them on-demand if needed instead. Differential Revision: https://reviews.llvm.org/D88447	2020-11-25 08:25:06 +00:00
David Green	00a6601136	[VPlan] Turn VPReductionRecipe into a VPValue This converts the VPReductionRecipe into a VPValue, like other VPRecipe's in preparation for traversing def-use chains. It also makes it a VPUser, now storing the used VPValues as operands. It doesn't yet change how the VPReductionRecipes are created. It will need to call replaceAllUsesWith from the original recipe they replace, but that is not done yet as VPWidenRecipe need to be created first. Differential Revision: https://reviews.llvm.org/D88382	2020-11-25 08:25:05 +00:00
Ayal Zaks	32d9a386bf	[LV] Keep Primary Induction alive when folding tail by masking Fix PR47390. The primary induction should be considered alive when folding tail by masking, because it will be used by said masking; even when it may otherwise appear useless: feeding only its own 'bump', which is correctly considered dead, and as the 'bump' of another induction variable, which may wrongfully want to consider its bump = the primary induction, dead. Differential Revision: https://reviews.llvm.org/D92017	2020-11-24 15:12:54 +02:00
Yichao Yu	4bc88a0e9a	Enable support for floating-point division reductions Similar to fsub, fdiv can also be vectorized using fmul. Also http://llvm.org/viewvc/llvm-project?view=revision&revision=215200 Differential Revision: https://reviews.llvm.org/D34078 Co-authored-by: Jameson Nash <jameson@juliacomputing.com>	2020-11-23 20:00:58 -05:00
Philip Reames	d6239b3ea6	[test] pre-comit test for D91451	2020-11-23 15:36:08 -08:00
Philip Reames	b06a2ad94f	[LoopVectorizer] Lower uniform loads as a single load (instead of relying on CSE) A uniform load is one which loads from a uniform address across all lanes. As currently implemented, we cost model such loads as if we did a single scalar load + a broadcast, but the actual lowering replicates the load once per lane. This change tweaks the lowering to use the REPLICATE strategy by marking such loads (and the computation leading to their memory operand) as uniform after vectorization. This is a useful change in itself, but it's real purpose is to pave the way for a following change which will generalize our uniformity logic. In review discussion, there was an issue raised with coupling cost modeling with the lowering strategy for uniform inputs. The discussion on that item remains unsettled and is pending larger architectural discussion. We decided to move forward with this patch as is, and revise as warranted once the bigger picture design questions are settled. Differential Revision: https://reviews.llvm.org/D91398	2020-11-23 15:32:17 -08:00
Sanjay Patel	e32bd35120	[CostModel] mostly remove cost-kind predicate for intrinsics in basic TTI implementation This is re-applying a combination of `f7eac51b9b` and `8ec7ea3ddc` as one patch to avoid regressions now that we have better testing in place. Those were reverted with `32dd5870ee` because of crashing in experimental intrinsics. That bug should be fixed with `7ae346434`. Paraphrased original commit messages: This is the last step in removing cost-kind as a consideration in the basic class model for intrinsics. See D89461 for the start of that. Subsequent commits dealt with each of the special-case intrinsics that had customization here in the basic class. This should remove a barrier to retrying D87188 (canonicalization to the abs intrinsic). The ARM and x86 cost diffs seen here may be wrong because the target-specific overrides have their own bugs, but we hope this is less wrong - if something has a significant throughput cost, then it should have a significant size / blended cost too by default. The only behavioral diff in current regression tests is shown in the x86 scatter-gather test (which is misplaced or broken because it runs the entire -O3 pipeline) - we unrolled less, and we assume that is a improvement. Exception: in general, we want the size cost for a scalar call to be cheap even if the other costs are expensive - we expect it to just be a branch with some optional stack manipulation. It is likely that we will want to carve out some exceptions/overrides to this rule as follow-up patches for calls that have some general and/or target-specific difference to the expected lowering. This was noticed as a regression in unrolling, so we have a test for that now along with a couple of direct cost model tests. If the assumed scalarization costs for the oversized vector calls are not realistic, that would be another follow-up refinement of the cost models. Differential Revision: https://reviews.llvm.org/D90554	2020-11-20 11:21:10 -05:00
Eric Christopher	32dd5870ee	Temporarily Revert "[CostModel] remove cost-kind predicate for intrinsics in basic TTI implementation" as it's causing crashes in the optimizer. A reduced testcase has been posted as a follow-up. This reverts commit `f7eac51b9b`. Temporarily Revert "[CostModel] make default size cost for libcalls small (again)" as it depends upon the primary revert. This reverts commit `8ec7ea3ddc`. Temporarily Revert "[CostModel] add tests for math library calls; NFC" as it depends upon the primary revert. This reverts commit `df09f82599`. Temporarily Revert "[LoopUnroll] add test for full unroll that is sensitive to cost-model; NFC" as it depends upon the primary revert. This reverts commit `618d555e8d`.	2020-11-19 22:10:23 -08:00
Sanjay Patel	4e68bc0999	Revert "[InstCombine] add multi-use demanded bits fold for add with low-bit mask" This reverts commit `e56103d250`. There is a stage2 msan failure blamed on this commit: http://lab.llvm.org:8011/#/builders/74/builds/888/steps/9/logs/stdio	2020-11-16 14:48:09 -05:00
Sanjay Patel	e56103d250	[InstCombine] add multi-use demanded bits fold for add with low-bit mask I noticed an add example like the one from D91343, so here's a similar patch. The logic is based on existing code for the single-use demanded bits fold. But I only matched a constant instead of using compute known bits on the operands because that was the motivating patterni that I noticed. I think this will allow removing a special-case (but incomplete) dedicated fold within visitAnd(), but I need to untangle the existing code to be sure. https://rise4fun.com/Alive/V6fP Name: add with low mask Pre: (C1 & (-1 u>> countLeadingZeros(C2))) == 0 %a = add i8 %x, C1 %r = and i8 %a, C2 => %r = and i8 %x, C2 Differential Revision: https://reviews.llvm.org/D91415	2020-11-15 15:09:49 -05:00
Florian Hahn	0c119ba8a8	[VPlan] Use VPValue def for VPWidenGEPRecipe. This patch turns VPWidenGEPRecipe into a VPValue and uses it during VPlan construction and codegeneration instead of the plain IR reference where possible. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D84683	2020-11-15 15:12:47 +00:00
Florian Hahn	a70b511e78	Recommit "[VPlan] Use VPValue def for VPWidenSelectRecipe." This reverts the revert commit `c8d73d939f`. It includes a fix for cases where we missed inserting VPValues for some selects, which should fix PR48142.	2020-11-14 20:00:25 +00:00
Philip Reames	d4e81cd9dd	[Tests][LoopVect] Exercise basic uniform memory operand logic	2020-11-12 20:34:31 -08:00
Sanjay Patel	9e0c35655b	[LoopVectorize] regenerate test checks; NFC	2020-11-12 17:15:46 -05:00
Florian Hahn	c8d73d939f	Revert "[VPlan] Use VPValue def for VPWidenSelectRecipe." This reverts commit `a8e50f1c6e`. This reportedly breaks building the Linux kernel. https://bugs.llvm.org/show_bug.cgi?id=48142	2020-11-10 22:50:46 +00:00
Florian Hahn	a8e50f1c6e	[VPlan] Use VPValue def for VPWidenSelectRecipe. This patch turns VPWidenSelectRecipe into a VPValue and uses it during VPlan construction and codegeneration instead of the plain IR reference where possible. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D84682	2020-11-10 19:39:37 +00:00
Sanjay Patel	f7eac51b9b	[CostModel] remove cost-kind predicate for intrinsics in basic TTI implementation This is the last step in removing cost-kind as a consideration in the basic class model for intrinsics. See D89461 for the start of that. Subsequent commits dealt with each of the special-case intrinsics that had customization here in the basic class. This should remove a barrier to retrying D87188 (canonicalization to the abs intrinsic). The ARM and x86 cost diffs seen here may be wrong because the target-specific overrides have their own bugs, but we hope this is less wrong - if something has a significant throughput cost, then it should have a significant size / blended cost too by default. The only behavioral diff in current regression tests is shown in the x86 scatter-gather test (which is misplaced or broken because it runs the entire -O3 pipeline) - we unrolled less, and we assume that is a improvement. Differential Revision: https://reviews.llvm.org/D90554	2020-11-10 08:19:31 -05:00
Joe Ellis	462dd4f803	[SVE][AArch64] Improve specificity of vectorization legality TypeSize test The test was using -O2, where -loop-vectorize will suffice. Reviewed By: fpetrogalli Differential Revision: https://reviews.llvm.org/D90685	2020-11-10 10:55:25 +00:00
Florian Hahn	f0d76275cb	[VPlan] Print result value for loads in VPWidenMemoryInst (NFC). For loads, print the result value.	2020-11-09 14:01:29 +00:00
Florian Hahn	fec64de261	[VPlan] Use VPValue def for VPWidenCall. This patch turns VPWidenCall into a VPValue and uses it during VPlan construction and codegeneration instead of the plain IR reference where possible. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D84681	2020-11-09 13:29:41 +00:00
Simon Pilgrim	28fc173819	[LoopVectorize] Remove unused check-prefixes	2020-11-09 12:18:20 +00:00
Simon Pilgrim	8a34e30d33	[LoopVectorize][AMDGPU] Regenerate packed-math test checks	2020-11-09 12:18:20 +00:00
Simon Moll	d3b33a7810	[VE][TTI] don't advertise vregs/vops Claim to not have any vector support to dissuade SLP, LV and friends from generating SIMD IR for the VE target. We will take this back once vector isel is stable. Reviewed By: kaz7, fhahn Differential Revision: https://reviews.llvm.org/D90462	2020-11-06 11:12:10 +01:00
Simon Pilgrim	f03be9df37	[LV][X86] Regenerate gather_scatter tests. NFCI. Reduce diff in D90554	2020-11-02 11:57:37 +00:00
Arthur Eubanks	5c31b8b94f	Revert "Use uint64_t for branch weights instead of uint32_t" This reverts commit `10f2a0d662`. More uint64_t overflows.	2020-10-31 00:25:32 -07:00
Arthur Eubanks	10f2a0d662	Use uint64_t for branch weights instead of uint32_t CallInst::updateProfWeight() creates branch_weights with i64 instead of i32. To be more consistent everywhere and remove lots of casts from uint64_t to uint32_t, use i64 for branch_weights. Reviewed By: davidxl Differential Revision: https://reviews.llvm.org/D88609	2020-10-30 10:03:46 -07:00
Nikita Popov	20b386aae0	[LoopUtils] Fix neutral value for vector.reduce.fadd Use -0.0 instead of 0.0 as the start value. The previous use of 0.0 was fine for all existing uses of this function though, as it is always generated with fast flags right now, and thus nsz.	2020-10-29 21:45:13 +01:00
Philip Reames	4e4abd16a7	[Deref] Use maximum trip count instead of exact trip count When trying to prove that a memory access touches only dereferenceable memory across all iterations of a loop, use the maximum exit count rather than an exact one. In many cases we can't prove exact exit counts whereas we can prove an upper bound. The test included is for a single exit loop with a min(C,V) exit count, but the true motivation is support for multiple exits loops. It's just really hard to write a test case for multiple exits because the vectorizer (the primary user of this API), bails far before this. For multiple exits, this allows a mix of analyzeable and unanalyzable exits when only analyzeable exits are needed to prove deref.	2020-10-28 14:33:30 -07:00
Nico Weber	2a4e704c92	Revert "Use uint64_t for branch weights instead of uint32_t" This reverts commit `e5766f25c6`. Makes clang assert when building Chromium, see https://crbug.com/1142813 for a repro.	2020-10-27 09:26:21 -04:00
Florian Hahn	f067bc3c0a	[LoopRotation] Allow loop header duplication if vectorization is forced. -Oz normally does not allow loop header duplication so this loop wouldn't be vectorized. However the vectorization pragma should override this and allow for loop rotation. rdar://problem/49281061 Original patch by Adam Nemet. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D59832	2020-10-27 09:28:01 +00:00
Arthur Eubanks	e5766f25c6	Use uint64_t for branch weights instead of uint32_t CallInst::updateProfWeight() creates branch_weights with i64 instead of i32. To be more consistent everywhere and remove lots of casts from uint64_t to uint32_t, use i64 for branch_weights. Reviewed By: davidxl Differential Revision: https://reviews.llvm.org/D88609	2020-10-26 20:24:04 -07:00
Joe Ellis	467e5cf40f	[SVE][AArch64] Fix TypeSize warning in loop vectorization legality The warning would fire when calling isDereferenceableAndAlignedInLoop with a scalable load. Calling isDereferenceableAndAlignedInLoop with a scalable load would result in the use of the now deprecated implicit cast of TypeSize to uint64_t through the overloaded operator. This patch fixes this issue by: - no longer considering vector loads as candidates in canVectorizeWithIfConvert. This doesn't make sense in the context of identifying scalar loads to vectorize. - making use of getFixedSize inside isDereferenceableAndAlignedInLoop -- this removes the dependency on the deprecated interface, and will trigger an assertion error if the function is ever called with a scalable type. Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D89798	2020-10-26 17:40:04 +00:00
Florian Hahn	1747aae9fc	[LV] Add cost-model test for AArch64 select costs. Currently, the cost of some compare/select patterns is overestimated on AArch64.	2020-10-26 13:43:31 +00:00
Nikita Popov	0dda633317	[SCEV] Strength nowrap flags after constant folding We should first try to constant fold the add expression and only strengthen nowrap flags afterwards. This allows us to determine stronger flags if e.g. only two operands are left after constant folding (and thus "guaranteed no wrap region" code applies) or the resulting operands are non-negative and thus nsw->nuw strengthening applies.	2020-10-25 18:00:22 +01:00
Venkataramanan Kumar	57cdc52c4d	Initial support for vectorization using Libmvec (GLIBC vector math library) Differential Revision: https://reviews.llvm.org/D88154	2020-10-22 16:01:39 -04:00
Arthur Eubanks	c76968d8b6	[test][NPM] Fix already-vectorized.ll under NPM The NPM runs SpeculateAroundPHIs which breaks critical edges, causing a branch we check for to not directly jump back to the same block.	2020-10-19 13:11:13 -07:00
Arthur Eubanks	fce64578bc	[NPM][test] Fix some LoopVectorize tests under NPM	2020-10-19 12:05:37 -07:00
Arthur Eubanks	65e5006962	[NPM][opt] Run -O# after other passes in legacy PM compatibility mode Generally tests run -O# before other passes, not after.	2020-10-19 11:48:44 -07:00
Evgeniy Brevnov	d0c95808e5	[LV] Unroll factor is expected to be > 0 LV fails with assertion checking that UF > 0. We already set UF to 1 if it is 0 except the case when IC > MaxInterleaveCount. The fix is to set UF to 1 for that case as well. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D87679	2020-10-14 16:48:17 +07:00
David Green	be6e8e50f4	[LV] Tail folded inloop reductions. This expands upon the inloop reductions added in e9761688e41cb9e976, allowing them to be inserted into tail folded loops. Reductions are generates with the form: x = select(mask, vecop, zero) v = vecreduce.add(x) c = add chain, v Where zero here is chosen as the identity value for add reductions. The backend is then expected to fold the select and the vecreduce into a single predicated instruction. Most of the code is fairly straight forward, except for the creation of blockmasks which need to ensure they are created in dominance order. The order they are added is altered to be after any phis, keeping the requirements for the underlying IR. Differential Revision: https://reviews.llvm.org/D84451	2020-10-11 16:58:34 +01:00
David Green	8f2cacae67	[LV] Extra predicated inloop reduction tests. NFC	2020-10-11 15:06:21 +01:00
David Green	6d8eea61b1	[AArch64][LV] Move vectorizer test to Transforms/LoopVectorize/AArch64. NFC	2020-10-10 10:15:43 +01:00
David Green	498f89d188	[LV] Collect dead induction truncates We currently collect the ICmp and Add from an induction variable, marking them as dead so that vplan values are not created for them. This extends that to include any single use trunk from the ICmp, which allows the Add to more readily be removed too. This can help with costing vplan nodes, as the ICmp and Add are more reliably removed and are not double-counted. Differential Revision: https://reviews.llvm.org/D88873	2020-10-08 08:28:58 +01:00
Florian Hahn	a73166a452	[LAA] Use DL to get element size for bound computation. Currently LAA uses getScalarSizeInBits to compute the size of an element when computing the end bound of an access. This does not work as expected for pointers to pointers, because getScalarSizeInBits will return 0 for pointer types. By using DataLayout to get the size of the element we can also correctly handle pointer element types. Note the changes to the existing test, which seems to also use the wrong offset for the end. Fixes PR47751. Reviewed By: anemet Differential Revision: https://reviews.llvm.org/D88953	2020-10-07 18:57:07 +01:00
Amara Emerson	322d0afd87	[llvm][mlir] Promote the experimental reduction intrinsics to be first class intrinsics. This change renames the intrinsics to not have "experimental" in the name. The autoupgrader will handle legacy intrinsics. Relevant ML thread: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html Differential Revision: https://reviews.llvm.org/D88787	2020-10-07 10:36:44 -07:00
Florian Hahn	20cfd5fa33	[LAA] Add test for PR47751, which currently uses wrong bounds.	2020-10-07 11:22:22 +01:00
Mauri Mustonen	cef0de5eb5	[VPlan] Add vplan native path vectorization test case for inner loop reduction Regarding this bug I posted earlier: https://bugs.llvm.org/show_bug.cgi?id=47035 After reading through LLVM source code and getting familiar with VPlan I was able to vectorize the code using by enabling VPlan native path. After talking with @fhahn he suggested that I contribute this as a test case. So here it is. I tried to follow the available guides how to do this best I could. I modified IR code by hand to have more clear variable names instead of numbers. One thing what I'd like to get input from someone is that is current CHECK lines sufficient enough to verify that the inner loop has been vectorized properly? Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D87564	2020-10-06 10:11:58 +01:00
Wenlei He	89e8a8b223	Revert SVML support for sqrt As was brought up in D87169 by @craig.topper we shouldn't map llvm.sqrt to svml since there is a faster native instruction. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_p&expand=5824,5823,5356,5823,5825,5365,5356 Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D88620	2020-10-05 08:13:11 -07:00
David Green	ff86acbb79	[LV] Regenerate test. NFC This just reruns the update script to add the new [[LOOP0:!llvm.loop !.*]] checks to remove them from other diffs.	2020-10-05 13:46:15 +01:00
Florian Hahn	ef72591de9	[LV] Add another test case with unsinkable first-order recurrences.	2020-10-03 20:41:41 +01:00
Sjoerd Meijer	1696dd27fb	[ARM][MVE] Enable tail-predication by default We have been running tests/benchmarks downstream with tail-predication enabled for some time now and this behaves as expected: we are not aware of any correctness issues, and this performs better across the board than with tail-predication disabled. Time to flip the switch! Differential Revision: https://reviews.llvm.org/D88093	2020-09-28 14:01:23 +01:00
Florian Hahn	d4ddf63fc4	[SCEV] Use loop guard info when computing the max BE taken count in howFarToZero. For some expressions, we can use information from loop guards when we are looking for a maximum. This patch applies information from loop guards to the expression used to compute the maximum backedge taken count in howFarToZero. It currently replaces an unknown expression X with UMin(X, Y), if the loop is guarded by X ult Y. This patch is minimal in what conditions it applies, and there are a few TODOs to generalize. This partly addresses PR40961. We will also need an update to LV to address it completely. Reviewed By: reames Differential Revision: https://reviews.llvm.org/D67178	2020-09-24 11:06:55 +01:00
Sam Parker	97a476eb56	[NFC][ARM] Tail fold test changes Run update script on one test and add another.	2020-09-17 13:09:10 +01:00
Wenlei He	056534dc2b	SVML support for log10, sqrt Although LLVM supports vectorization of loops containing log10/sqrt, it did not support using SVML implementation of it. Added support so that when clang is invoked with -fveclib=SVML now an appropriate SVML library log2 implementation will be invoked. Follow up on: https://reviews.llvm.org/D77114 Tests: Added unit tests to svml-calls.ll, svml-calls-finite.ll. Can be run with llvm-lint. Created a simple c++ file that tests log10/sqrt, and used clang+ to build it, and output final assembly. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D87169	2020-09-15 17:29:44 -07:00
David Green	08baa97923	[ARM] Enable tail predication for reduction tests. NFC	2020-09-14 14:26:10 +01:00
David Green	74760bb00f	[LV][ARM] Add preferInloopReduction target hook. This allows the backend to tell the vectorizer to produce inloop reductions through a TTI hook. For the moment on ARM under MVE this means allowing integer add reductions of the correct size. In the future this can include integer min/max too, under -Os. Differential Revision: https://reviews.llvm.org/D75512	2020-09-12 17:47:04 +01:00
Wenlei He	d1be928d23	SVML support for log2 Although LLVM supports vectorization of loops containing log2, it did not support using SVML implementation of it. Added support so that when clang is invoked with -fveclib=SVML now an appropriate SVML library log2 implementation will be invoked. Follow up on: https://reviews.llvm.org/D77114 Tests: Added unit tests to svml-calls.ll, svml-calls-finite.ll. Can be run with llvm-lint. Created a simple c++ file that tests log2, and used clang+ to build it, and output final assembly. Reviewed By: wenlei, craig.topper Differential Revision: https://reviews.llvm.org/D86730	2020-09-03 11:52:29 -07:00
Aaron Liu	d7e16ca28f	[LV] Interleave to expose ILP for small loops with scalar reductions. Interleave for small loops that have reductions inside, which breaks dependencies and expose. This gives very significant performance improvements for some benchmarks. Because small loops could be in very hot functions in real applications. Differential Revision: https://reviews.llvm.org/D81416	2020-09-01 19:47:32 +00:00
Roman Lebedev	c23aefd7c3	[NFC][InstCombine] visitPHINode(): cleanup PHI CSE instruction replacement As @nikic is pointing out in https://reviews.llvm.org/rGbf21ce7b908e#inline-4647 this must be sufficient otherwise `EliminateDuplicatePHINodes()` would have hit issues with it already.	2020-08-31 22:29:39 +03:00
Florian Hahn	eb35ebb3a2	[LV] Update CFG before adding runtime checks. addRuntimeChecks uses SCEVExpander, which relies on the DT/LoopInfo to be up-to-date. Changing the CFG afterwards may invalidate some inserted instructions, especially LCSSA phis. Reorder the code to first update the CFG and then create the runtime checks. This should not have any impact on the generated code, as we adjust the CFG and generate runtime checks together. Fixes PR47343.	2020-08-30 18:21:44 +01:00
Roman Lebedev	bf21ce7b90	[InstCombine] Take 3: Perform trivial PHI CSE The original take 1 was `6102310d81`, which taught InstSimplify to do that, which seemed better at time, since we got EarlyCSE support for free. However, it was proven that we can not do that there, the simplified-to PHI would not be reachable from the original PHI, and that is not something InstSimplify is allowed to do, as noted in the commit `ed90f15efb` that reverted it: > It appears to cause compilation non-determinism and caused stage3 mismatches. Then there was take 2 `3e69871ab5`, which was InstCombine-specific, but it again showed stage2-stage3 differences, and reverted in `bdaa3f86a0`. This is quite alarming. Here, let's try to change how we find existing PHI candidate: due to the worklist order, and the way PHI nodes are inserted (it may be inserted as the first one, or maybe not), let's look at all PHI nodes in the block. Effects on vanilla llvm test-suite + RawSpeed: ``` \| statistic name \| baseline \| proposed \| Δ \| % \| \\|%\\| \| \|----------------------------------------------------\|-----------\|-----------\|-------:\|---------:\|---------:\| \| asm-printer.EmittedInsts \| 7942329 \| 7942457 \| 128 \| 0.00% \| 0.00% \| \| assembler.ObjectBytes \| 254295632 \| 254312480 \| 16848 \| 0.01% \| 0.01% \| \| correlated-value-propagation.NumPhis \| 18412 \| 18347 \| -65 \| -0.35% \| 0.35% \| \| early-cse.NumCSE \| 2183283 \| 2183267 \| -16 \| 0.00% \| 0.00% \| \| early-cse.NumSimplify \| 550105 \| 541842 \| -8263 \| -1.50% \| 1.50% \| \| instcombine.NumAggregateReconstructionsSimplified \| 73 \| 4506 \| 4433 \| 6072.60% \| 6072.60% \| \| instcombine.NumCombined \| 3640311 \| 3644419 \| 4108 \| 0.11% \| 0.11% \| \| instcombine.NumDeadInst \| 1778204 \| 1783205 \| 5001 \| 0.28% \| 0.28% \| \| instcombine.NumPHICSEs \| 0 \| 22490 \| 22490 \| 0.00% \| 0.00% \| \| instcombine.NumWorklistIterations \| 2023272 \| 2024400 \| 1128 \| 0.06% \| 0.06% \| \| instcount.NumCallInst \| 1758395 \| `1758802` \| 407 \| 0.02% \| 0.02% \| \| instcount.NumInvokeInst \| 59478 \| 59502 \| 24 \| 0.04% \| 0.04% \| \| instcount.NumPHIInst \| 330557 \| 330545 \| -12 \| 0.00% \| 0.00% \| \| instcount.TotalBlocks \| 1077138 \| 1077220 \| 82 \| 0.01% \| 0.01% \| \| instcount.TotalFuncs \| 101442 \| 101441 \| -1 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 8831946 \| 8832606 \| 660 \| 0.01% \| 0.01% \| \| simplifycfg.NumHoistCommonCode \| 24186 \| 24187 \| 1 \| 0.00% \| 0.00% \| \| simplifycfg.NumInvokes \| 4300 \| 4410 \| 110 \| 2.56% \| 2.56% \| \| simplifycfg.NumSimpl \| 1019813 \| 999767 \| -20046 \| -1.97% \| 1.97% \| ``` So it fires 22490 times, which is less than ~24k the take 1 did, but more than what take 2 did (22228 times) . It allows foldAggregateConstructionIntoAggregateReuse() to actually work after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG would have done this PHI CSE, of all places. Additionally, allows some more `invoke`->`call` folds to happen (+110, +2.56%). All in all, expectedly, this catches less things overall, but all the motivational cases are still caught, so all good.	2020-08-29 18:21:24 +03:00
Roman Lebedev	bdaa3f86a0	Revert "[InstCombine] Take 2: Perform trivial PHI CSE" While the original variant with doing this in InstSimplify (rightfully) caused questions and ultimately was detected to be a culprit of stage2-stage3 mismatch, it was expected that InstCombine-based implementation would be fine. But apparently it's not, as http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24095/steps/compare-compilers/logs/stdio suggests. Which suggests that somewhere in InstCombine there is a loop over nondeterministically sorted container, which causes different worklist ordering. This reverts commit `3e69871ab5`.	2020-08-29 16:05:02 +03:00
Roman Lebedev	3e69871ab5	[InstCombine] Take 2: Perform trivial PHI CSE The original take was `6102310d81`, which taught InstSimplify to do that, which seemed better at time, since we got EarlyCSE support for free. However, it was proven that we can not do that there, the simplified-to PHI would not be reachable from the original PHI, and that is not something InstSimplify is allowed to do, as noted in the commit `ed90f15efb` that reverted it : > It appears to cause compilation non-determinism and caused stage3 mismatches. However InstCombine already does many different optimizations, so it should be a safe place to do it here. Note that we still can't just compare incoming values ranges, because there is no guarantee that these PHI's we'd simplify to were already re-visited and sorted. However coming up with a test is problematic. Effects on vanilla llvm test-suite + RawSpeed: ``` \| statistic name \| baseline \| proposed \| Δ \| % \| \|%\| \| \|----------------------------------------------------\|-----------\|-----------\|-------:\|---------:\|---------:\| \| instcombine.NumPHICSEs \| 0 \| 22228 \| 22228 \| 0.00% \| 0.00% \| \| asm-printer.EmittedInsts \| 7942329 \| 7942456 \| 127 \| 0.00% \| 0.00% \| \| assembler.ObjectBytes \| 254295632 \| 254313792 \| 18160 \| 0.01% \| 0.01% \| \| early-cse.NumCSE \| 2183283 \| 2183272 \| -11 \| 0.00% \| 0.00% \| \| early-cse.NumSimplify \| 550105 \| 541842 \| -8263 \| -1.50% \| 1.50% \| \| instcombine.NumAggregateReconstructionsSimplified \| 73 \| 4506 \| 4433 \| 6072.60% \| 6072.60% \| \| instcombine.NumCombined \| 3640311 \| 3666911 \| 26600 \| 0.73% \| 0.73% \| \| instcombine.NumDeadInst \| 1778204 \| 1783318 \| 5114 \| 0.29% \| 0.29% \| \| instcount.NumCallInst \| 1758395 \| 1758804 \| 409 \| 0.02% \| 0.02% \| \| instcount.NumInvokeInst \| 59478 \| 59502 \| 24 \| 0.04% \| 0.04% \| \| instcount.NumPHIInst \| 330557 \| 330549 \| -8 \| 0.00% \| 0.00% \| \| instcount.TotalBlocks \| 1077138 \| 1077221 \| 83 \| 0.01% \| 0.01% \| \| instcount.TotalFuncs \| 101442 \| 101441 \| -1 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 8831946 \| 8832611 \| 665 \| 0.01% \| 0.01% \| \| simplifycfg.NumInvokes \| 4300 \| 4410 \| 110 \| 2.56% \| 2.56% \| \| simplifycfg.NumSimpl \| 1019813 \| 999740 \| -20073 \| -1.97% \| 1.97% \| ``` So it fires ~22k times, which is less than ~24k the take 1 did. It allows foldAggregateConstructionIntoAggregateReuse() to actually work after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG would have done this PHI CSE, of all places. Additionally, allows some more `invoke`->`call` folds to happen (+110, +2.56%). All in all, expectedly, this catches less things overall, but all the motivational cases are still caught, so all good.	2020-08-29 13:13:06 +03:00
Owen Anderson	ed90f15efb	Revert "[InstSimplify][EarlyCSE] Try to CSE PHI nodes in the same basic block" This reverts commit `6102310d81`. It appears to cause compilation non-determinism and caused stage3 mismatches.	2020-08-28 23:43:42 +00:00
Anna Welker	064981f0ce	[ARM][MVE] Enable MVE gathers and scatters by default Enable MVE gather/scatters by default, which requires some minor adaptations in some tests. Differential revision: https://reviews.llvm.org/D86776	2020-08-28 19:05:29 +01:00
Roman Lebedev	6102310d81	[InstSimplify][EarlyCSE] Try to CSE PHI nodes in the same basic block Apparently, we don't do this, neither in EarlyCSE, nor in InstSimplify, nor in (old) GVN, but do in NewGVN and SimplifyCFG of all places.. While i could teach EarlyCSE how to hash PHI nodes, we can't really do much (anything?) even if we find two identical PHI nodes in different basic blocks, same-BB case is the interesting one, and if we teach InstSimplify about it (which is what i wanted originally, https://reviews.llvm.org/D86530), we get EarlyCSE support for free. So i would think this is pretty uncontroversial. On vanilla llvm test-suite + RawSpeed, this has the following effects: ``` \| statistic name \| baseline \| proposed \| Δ \| % \| \\|%\\| \| \|----------------------------------------------------\|-----------\|-----------\|-------:\|---------:\|---------:\| \| instsimplify.NumPHICSE \| 0 \| 23779 \| 23779 \| 0.00% \| 0.00% \| \| asm-printer.EmittedInsts \| 7942328 \| 7942392 \| 64 \| 0.00% \| 0.00% \| \| assembler.ObjectBytes \| 273069192 \| 273084704 \| 15512 \| 0.01% \| 0.01% \| \| correlated-value-propagation.NumPhis \| 18412 \| 18539 \| 127 \| 0.69% \| 0.69% \| \| early-cse.NumCSE \| 2183283 \| 2183227 \| -56 \| 0.00% \| 0.00% \| \| early-cse.NumSimplify \| 550105 \| 542090 \| -8015 \| -1.46% \| 1.46% \| \| instcombine.NumAggregateReconstructionsSimplified \| 73 \| 4506 \| 4433 \| 6072.60% \| 6072.60% \| \| instcombine.NumCombined \| 3640264 \| 3664769 \| 24505 \| 0.67% \| 0.67% \| \| instcombine.NumDeadInst \| 1778193 \| 1783183 \| 4990 \| 0.28% \| 0.28% \| \| instcount.NumCallInst \| 1758401 \| 1758799 \| 398 \| 0.02% \| 0.02% \| \| instcount.NumInvokeInst \| 59478 \| 59502 \| 24 \| 0.04% \| 0.04% \| \| instcount.NumPHIInst \| 330557 \| 330533 \| -24 \| -0.01% \| 0.01% \| \| instcount.TotalInsts \| 8831952 \| 8832286 \| 334 \| 0.00% \| 0.00% \| \| simplifycfg.NumInvokes \| 4300 \| 4410 \| 110 \| 2.56% \| 2.56% \| \| simplifycfg.NumSimpl \| 1019808 \| 999607 \| -20201 \| -1.98% \| 1.98% \| ``` I.e. it fires ~24k times, causes +110 (+2.56%) more `invoke` -> `call` transforms, and counter-intuitively results in more instructions total. That being said, the PHI count doesn't decrease that much, and looking at some examples, it seems at least some of them were previously getting PHI CSE'd in SimplifyCFG of all places.. I'm adjusting `Instruction::isIdenticalToWhenDefined()` at the same time. As a comment in `InstCombinerImpl::visitPHINode()` already stated, there are no guarantees on the ordering of the operands of a PHI node, so if we just naively compare them, we may false-negatively say that the nodes are not equal when the only difference is operand order, which is especially important since the fold is in InstSimplify, so we can't rely on InstCombine sorting them beforehand. Fixing this for the general case is costly (geomean +0.02%), and does not appear to catch anything in test-suite, but for the same-BB case, it's trivial, so let's fix at least that. As per http://llvm-compile-time-tracker.com/compare.php?from=04879086b44348cad600a0a1ccbe1f7776cc3cf9&to=82bdedb888b945df1e9f130dd3ac4dd3c96e2925&stat=instructions this appears to cause geomean +0.03% compile time increase (regression), but geomean -0.01%..-0.04% code size decrease (improvement).	2020-08-27 18:47:04 +03:00
Sjoerd Meijer	bda8fbe2d2	[LV] Fallback strategies if tail-folding fails This implements 2 different vectorisation fallback strategies if tail-folding fails: 1) don't vectorise at all, or 2) vectorise using a scalar epilogue. This can be controlled with option -prefer-predicate-over-epilogue, that has been changed to take a numeric value corresponding to the tail-folding preference and preferred fallback. Patch by: Pierre van Houtryve, Sjoerd Meijer. Differential Revision: https://reviews.llvm.org/D79783	2020-08-26 16:55:25 +01:00
David Green	677c1590c0	[ARM] Increase MVE gather/scatter cost by MVECostFactor. MVE Gather scatter codegeneration is looking a lot better than it used to, but still has some issues. The instructions we currently model as 1 cycle per element, which is a bit low for some cases. Increasing the cost by the MVECostFactor brings them in-line with our other instruction costs. This will have the effect of only generating then when the extra benefit is more likely to overcome some of the issues. Notably in running out of registers and vectorizing loops that could otherwise be SLP vectorized. In the short-term whilst we look at other ways of dealing with those more directly, we can increase the costs of gathers to make them more likely to be beneficial when created. Differential Revision: https://reviews.llvm.org/D86444	2020-08-26 13:03:46 +01:00
Arthur Eubanks	df5576a852	[test] Add -inject-tli-mapping to -loop-vectorize -vector-library tests The legacy LoopVectorize has a dependency on InjectTLIMappingsLegacy. That cannot be expressed in the new PM since they are both normal passes. Explicitly add -inject-tli-mappings as a pass. Follow-up to https://reviews.llvm.org/D86492. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D86561	2020-08-25 11:55:11 -07:00
Sjoerd Meijer	ae366479e8	[LV] get.active.lane.mask consuming tripcount instead of backedge-taken count This adapts LV to the new semantics of get.active.lane.mask as discussed in D86147, which means that the LV now emits intrinsic get.active.lane.mask with the loop tripcount instead of the backedge-taken count as its second argument. The motivation for this is described in D86147. Differential Revision: https://reviews.llvm.org/D86304	2020-08-25 13:49:19 +01:00
Anna Welker	8048068c3e	[ARM][MVE] Allow tail predication for strides !=1 with gather/scatters If gather/scatters are enabled, ARMTargetTransformInfo now allows tail predication for loops with a much wider range of strides, up to anything that is loop invariant. Differential Revision: https://reviews.llvm.org/D85410	2020-08-24 13:54:47 +01:00
David Green	2b69efded0	[ARM][LV] Add a preferPredicatedReductionSelect target hook As part of D84741, this adds a target hook for the preferPredicatedReductionSelect option and makes use of it under MVE, allowing us to tail predicate most reduction loops. Differential Revision: https://reviews.llvm.org/D85980	2020-08-21 08:48:12 +01:00
David Green	816097e4e5	[LV] Allow tail folded reduction selects to remain in the loop The normal scheme for tail folding reductions is to use: loop: p = phi(0, a) mask = ... x = masked_load(..., mask) a = add(x, p) s = select(mask, a, p) This means we need to keep the register p and a alive out of the loop, plus the mask. On a target with predicated operations we can instead generate the phi as p = phi(0, s). This ensures the select in the loop and we can fold select(m, add(a, b), c) to something like a vaddt c, a, b using the m predicate. This in turn allows us to tail predicate the entire loop. Differential Revision: https://reviews.llvm.org/D84741	2020-08-20 14:31:14 +01:00
Hiroshi Yamauchi	ab401a8c8a	[PGO][PGSO][LV] Fix loop not vectorized issue under profile guided size opts. D81345 appears to accidentally disables vectorization when explicitly enabled. As PGSO isn't currently accessible from LoopAccessInfo, revert back to the vectorization with versioning-for-unit-stride for PGSO. Differential Revision: https://reviews.llvm.org/D85784	2020-08-19 12:13:34 -07:00
David Green	b8088ada05	[LV] Predicated reduction tests. NFC	2020-08-18 16:02:21 +01:00
David Green	745bf6cf44	[LoopVectorizer] Inloop vector reductions Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not. In order to do that we need a way to represent that the reduction operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a prefersInloopReduction TTI hook (which follows in a later patch). Differential Revision: https://reviews.llvm.org/D75069	2020-08-06 10:10:50 +01:00
Jordan Rupprecht	3c39db0c44	Revert "[LoopVectorizer] Inloop vector reductions" This reverts commit `e9761688e4`. It breaks the build: ``` ~/src/llvm-project/llvm/lib/Analysis/IVDescriptors.cpp:868:10: error: no viable conversion from returned value of type 'SmallVector<[...], 8>' to function return type 'SmallVector<[...], 4>' return ReductionOperations; ```	2020-08-05 10:24:15 -07:00
David Green	e9761688e4	[LoopVectorizer] Inloop vector reductions Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not. In order to do that we need a way to represent that the reduction operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a prefersInloopReduction TTI hook (which follows in a later patch). Differential Revision: https://reviews.llvm.org/D75069	2020-08-05 18:14:05 +01:00
Florian Hahn	98db27711d	[LV] Do not check widening decision for instrs outside of loop. No widening decisions will be computed for instructions outside the loop. Do not try to get a widening decision. The load/store will be just a scalar load, so treating at as normal should be fine I think. Fixes PR46950. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D85087	2020-08-03 10:09:24 +01:00
Arthur Eubanks	b36c39260e	[NewPM] Don't print 'Invalidating all non-preserved analyses' If an analysis is actually invalidated, there's already a log statement for that: 'Invalidating analysis: FooAnalysis'. Otherwise the statement is not very useful. Reviewed By: asbirlea, ychen Differential Revision: https://reviews.llvm.org/D84981	2020-07-30 19:40:29 -07:00
David Green	1da0c47fa2	[LoopVectorizer] Don't create unused block masks for reductions. NFC This removes some unneeded block masks when we don't have any reductions. It should not have any effect on codegen as the values created are dead anyway. Differential Revision: https://reviews.llvm.org/D81415	2020-07-30 14:28:08 +01:00
Craig Topper	3efc978bae	[LV] Add abs/smin/smax/umin/umax intrinsics to isTriviallyVectorizable This patch adds support for vectorizing these intrinsics. Differential Revision: https://reviews.llvm.org/D84796	2020-07-29 10:23:07 -07:00
David Green	9ddb28964c	[ARM] Tune getCastInstrCost for extending masked loads and truncating masked stores This patch uses the feature added in D79162 to fix the cost of a sext/zext of a masked load, or a trunc for a masked store. Previously, those were considered cheap or even free, but it's not the case as we cannot split the load in the same way we would for normal loads. This updates the costs to better reflect reality, and adds a test for it in test/Analysis/CostModel/ARM/cast.ll. It also adds a vectorizer test that showcases the improvement: in some cases, the vectorizer will now choose a smaller VF when tail-predication is enabled, which results in better codegen. (Because if it were to use a higher VF in those cases, the code we see above would be generated, and the vmovs would block tail-predication later in the process, resulting in very poor codegen overall) Original Patch by Pierre van Houtryve Differential Revision: https://reviews.llvm.org/D79163	2020-07-29 13:41:34 +01:00
Jinsong Ji	d28f86723f	Re-land "[PowerPC] Remove QPX/A2Q BGQ/BGP CNK support" This reverts commit `bf544fa1c3`. Fixed the typo in PPCInstrInfo.cpp.	2020-07-28 14:00:11 +00:00
Jinsong Ji	bf544fa1c3	Revert "[PowerPC] Remove QPX/A2Q BGQ/BGP CNK support" This reverts commit `adffce7153`. This is breaking test-suite, revert while investigation.	2020-07-27 21:07:00 +00:00
Jinsong Ji	adffce7153	[PowerPC] Remove QPX/A2Q BGQ/BGP CNK support Per RFC http://lists.llvm.org/pipermail/llvm-dev/2020-April/141295.html no one is making use of QPX/A2Q/BGQ/BGP CNK anymore. This patch remove the support of QPX/A2Q in llvm, BGQ/BGP in clang, CNK support in openmp/polly. Reviewed By: hfinkel Differential Revision: https://reviews.llvm.org/D83915	2020-07-27 19:24:39 +00:00
Arthur Eubanks	9bb6ce78be	Rename scoped-noalias -> scoped-noalias-aa Summary: To match NewPM name. Also the new name is clearer and more consistent. Subscribers: jvesely, nhaehnle, hiraditya, asbirlea, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D84542	2020-07-24 12:14:27 -07:00
Hiroshi Yamauchi	7bedae7dee	[PGO][PGSO] Add profile guided size optimization to loop vectorization legality.	2020-07-21 11:16:36 -07:00
David Green	2f4c3e8097	[LV] Add additional InLoop redution tests. NFC	2020-07-18 12:14:23 +01:00
Arthur Eubanks	0dfa4a83fa	Revert "[PGO][PGSO] Add profile guided size optimization to loop vectorization legality." This reverts commit `30c382a7c6`. See https://crbug.com/1106813.	2020-07-17 16:47:41 -07:00
Sjoerd Meijer	7ebc6bed84	[ARM][MVE] Reorg of the LV tail-folding tests It was getting difficult to see which test was in which file, so this reorganises the test files so that now all filenames start with tail-folding-* followed by a more descriptive name what that group of tests check.	2020-07-17 15:54:15 +01:00
Anna Welker	23c9534515	[LV] Enable the LoopVectorizer to create pointer inductions This patch enables the LoopVectorizer to build a phi of pointer type and provide the vector loads and stores with vector type getelementptrs built from the pointer induction variable, which produces much less instructions than the previous approach of creating scalar getelementpointers and glue them together to a vector. Differential Revision: https://reviews.llvm.org/D81267	2020-07-17 13:35:07 +01:00
Hiroshi Yamauchi	30c382a7c6	[PGO][PGSO] Add profile guided size optimization to loop vectorization legality. Differential Revision: https://reviews.llvm.org/D83329	2020-07-15 11:49:36 -07:00
Sjoerd Meijer	959eaa50d6	[ARM][MVE] Only tail-fold integer add reductions If a vector body has live-out values, it is probably a reduction, which needs a final reduction step after the loop. MVE has a VADDV instruction to reduce integer vectors, but doesn't have an equivalent one for float vectors. A live-out value that is not recognised as reduction later in the optimisation pipeline will result in the tail-predicated loop to be reverted to a non-predicated loop and this is very expensive, i.e. it has a significant performance impact, which is what we hope to avoid with fine tuning the ARM TTI hook preferPredicateOverEpilogue implementation. Differential Revision: https://reviews.llvm.org/D82953	2020-07-14 10:15:07 +01:00
Sjoerd Meijer	595270ae39	[ARM][MVE] Refactor option -disable-mve-tail-predication This refactors option -disable-mve-tail-predication to take different arguments so that we have 1 option to control tail-predication rather than several different ones. This is also a prep step for D82953, in which we want to reject reductions unless that is requested with this option. Differential Revision: https://reviews.llvm.org/D83133	2020-07-13 13:40:33 +01:00
Ayal Zaks	82a5157ff1	[LV] Fixing versioning-for-unit-stide of loops with small trip count This patch fixes D81345 and PR46652. If a loop with a small trip count is compiled w/o -Os/-Oz, Loop Access Analysis still generates runtime checks for unit strides that will version the loop. In such cases, the loop vectorizer should either re-run the analysis or bail-out from vectorizing the loop, as done prior to D81345. The latter is applied for now as the former requires refactoring. Differential Revision: https://reviews.llvm.org/D83470	2020-07-12 19:51:47 +03:00
Florian Hahn	264ab1e2c8	[LV] Pick vector loop body as insert point for SCEV expansion. Currently the DomTree is not kept up to date for additional blocks generated in the vector loop, for example when vectorizing with predication. SCEVExpander relies on dominance checks when looking for existing instructions to re-use and in some cases that can lead to the expander picking instructions that do not actually dominate their insert point (e.g. as in PR46525). Unfortunately keeping the DT up-to-date is a bit tricky, because the CFG is only patched up after generating code for a block. For now, we can just use the vector loop header, as this ensures the inserted instructions dominate all uses in the vector loop. There should be no noticeable impact on the generated code, as other passes should sink those instructions, if profitable. Fixes PR46525. Reviewers: Ayal, gilr, mkazantsev, dmgreen Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D83288	2020-07-10 10:37:12 +01:00
Ayal Zaks	7bf299c8d8	[LV] Vectorize without versioning-for-unit-stride under -Os/-Oz If a loop is in a function marked OptSize, Loop Access Analysis should refrain from generating runtime checks for unit strides that will version the loop. If a loop is in a function marked OptSize and its vectorization is enabled, it should be vectorized w/o any versioning. Fixes PR46228. Differential Revision: https://reviews.llvm.org/D81345	2020-07-07 15:04:21 +03:00
Jordan Rupprecht	10c82eecbc	Revert "[LV] Enable the LoopVectorizer to create pointer inductions" This reverts commit `a8fe12065e`. It causes a crash when building gzip. Will post the detailed reduced test case to D81267.	2020-07-06 17:50:38 -07:00
David Green	146dad0077	[ARM] MVE FP16 cost adjustments This adjusts the MVE fp16 cost model, similar to how we already do for integer casts. It uses the base cost of 1 per cvt for most fp extend / truncates, but adjusts it for loads and stores where we know that a extending load has been used to get the load into the correct lane, and only an MVE VCVTB is then needed. Differential Revision: https://reviews.llvm.org/D81813	2020-07-06 15:57:51 +01:00
David Green	55227f85d0	[ARM] Use BaseT::getMemoryOpCost for getMemoryOpCost This alters getMemoryOpCost to use the Base TargetTransformInfo version that includes some additional checks for whether extending loads are legal. This will generally have the effect of making <2 x ..> and some <4 x ..> loads/stores more expensive, which in turn should help favour larger vector factors. Notably it alters the cost of a <4 x half>, which with the current codegen will be expensive if it is not extended. Differential Revision: https://reviews.llvm.org/D82456	2020-07-06 10:58:40 +01:00
Anna Welker	a8fe12065e	[LV] Enable the LoopVectorizer to create pointer inductions This patch enables the LoopVectorizer to build a phi of pointer type and provide the vector loads and stores with vector type getelementptrs built from the pointer induction variable, which produces much less instructions than the previous approach of creating scalar getelementpointers and glue them together to a vector. Differential Revision: https://reviews.llvm.org/D81267	2020-07-02 11:39:28 +01:00
Florian Hahn	1ccc49924a	[AArch64] Add getCFInstrCost, treat branches as free for throughput. D79164/2596da31740f changed getCFInstrCost to return 1 per default. AArch64 did not have its own implementation, hence the throughput cost of CFI instructions is overestimated. On most cores, most branches should be predicated and essentially free throughput wise. This restores a 9% performance regression on a SPEC2006 benchmark on AArch64 with -O3 LTO & PGO. This patch effectively restores pre `2596da3174` behavior for AArch64 and undoes the AArch64 test changes of the patch. Reviewers: samparker, dmgreen, anemet Reviewed By: samparker Differential Revision: https://reviews.llvm.org/D82755	2020-06-30 20:34:04 +01:00
Fangrui Song	4cd19a6e15	[BasicAA] Rename -disable-basicaa to -disable-basic-aa to be consistent with the canonical name "basic-aa"	2020-06-26 20:55:44 -07:00
Fangrui Song	f31811f2dc	[BasicAA] Rename deprecated -basicaa to -basic-aa Follow-up to D82607 Revert an accidental change (empty.ll) of D82683	2020-06-26 20:41:37 -07:00
Sjoerd Meijer	e345d547a0	Recommit "[LV] Emit @llvm.get.active.lane.mask for tail-folded loops" Fixed ARM regression test. Please see the original commit message rG47650451738c for details.	2020-06-17 13:12:15 +01:00
Sjoerd Meijer	d4e183f686	Revert "[LV] Emit @llvm.get.active.mask for tail-folded loops" This reverts commit `4765045173` while I investigate the build bot failures.	2020-06-17 10:09:54 +01:00
Sjoerd Meijer	4765045173	[LV] Emit @llvm.get.active.mask for tail-folded loops This emits new IR intrinsic @llvm.get.active.mask for tail-folded vectorised loops if the intrinsic is supported by the backend, which is checked by querying TargetTransform hook emitGetActiveLaneMask. This intrinsic creates a mask representing active and inactive vector lanes, which is used by the masked load/store instructions that are created for tail-folded loops. The semantics of @llvm.get.active.mask are described here in LangRef: https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics This intrinsic is also used to provide a hint to the backend. That is, the second argument of the intrinsic represents the back-edge taken count of the loop. For MVE, for example, we use that to set up tail-predication, which is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop, which we can now reconstruct using @llvm.get.active.mask. Differential Revision: https://reviews.llvm.org/D79100	2020-06-17 09:53:58 +01:00
Sam Parker	2596da3174	[CostModel] getCFInstrCost in getUserCost. Have BasicTTI call the base implementation so that both agree on the default behaviour, which the default being a cost of '1'. This has required an X86 specific implementation as it seems to be very reliant on those instructions being free. Changes are also made to AMDGPU so that their implementations distinguish between cost kinds, so that the unrolling isn't affected. PowerPC also has its own implementation to prevent changes to the reg-usage vectorizer test. The cost model test changes now reflect that ret instructions are not generally free. Differential Revision: https://reviews.llvm.org/D79164	2020-06-15 09:28:46 +01:00
Florian Hahn	6176f04436	[LAA] Do not set CanDoRT to false for AS that do not need RT checks. Alternative approach to D80570. canCheckPtrAtRT already contains checks the figure out for which alias sets runtime checks are needed. But it currently sets CanDoRT to false for alias sets for which we cannot do RT checks but also do not need any. If we know that we do not need RT checks based on the number of reads/writes in the alias set, we can skip processing the AS. This patch also adds an assertion to ensure that DepCands does not contain more than one write from the alias set. Reviewers: Ayal, anemet, hfinkel, dmgreen Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D80622	2020-06-14 20:55:59 +01:00
David Green	46529978bf	[ARM] Always use reductions intrinsics under MVE Similar to a recent change to the X86 backend, this changes things so that we always produce a reduction intrinsics for all reduction types, not just the legal ones. This gives a better chance in the backend to custom lower them to something more suitable for MVE. Especially for something like fadd the in-order reduction produced during DAG lowering is already better than the shuffles produced in the midend, and we can do even better with a bit of custom lowering. Differential Revision: https://reviews.llvm.org/D81398	2020-06-12 19:21:17 +01:00
Florian Hahn	3a846d4d92	[VPlan] Reject loops without computable backedge taken counts getOrCreateTripCount is used to generate code for the outer loop, but it requires a computable backedge taken counts. Check that in the VPlan native path. Reviewers: Ayal, gilr, rengolin, sguggill Reviewed By: sguggill Differential Revision: https://reviews.llvm.org/D81088	2020-06-12 10:31:18 +01:00
David Green	a4cf68e743	[ARM] MVE vectorizer reduction tests for each reduction type. NFC	2020-06-10 11:18:24 +01:00
Anh Tuyen Tran	e7c5412b37	[NFC][LV][TEST]: extend pr45679-fold-tail-by-masking.ll with -force-vector-width=1 -force-vector-interleave=4 Summary: Add -force-vector-width=1 -force-vector-interleave=4 to pr45679-fold-tail-by-masking.ll Author: anhtuyen (Anh Tuyen Tran) Reviewers: Ayal (Ayal Zaks) Reviewed By: Ayal (Ayal Zaks) Subscribers: rkruppe (Hanna Kruppe), llvm-commits, LLVM Tag: LLVM Differential Revision: https://reviews.llvm.org/D80446	2020-06-09 18:30:56 +00:00
Sanjay Patel	e50059f6b6	[x86] form reduction intrinsics from vectorizers instead of raw IR Motivating examples are seen in the PhaseOrdering tests based on: https://bugs.llvm.org/show_bug.cgi?id=43953#c2 - if we have intrinsics there, some pass can fold them. The intrinsics are still named "experimental" at this point, but if there is no fallout from this patch, that will be a good indicator that it is safe to finalize them. Differential Revision: https://reviews.llvm.org/D80867	2020-06-05 12:38:49 -04:00
Florian Hahn	b446ec56a2	[LV] Make sure the MaxVF is a power-of-2 by rounding down. LV currently only supports power of 2 vectorization factors, which has been made explicit with the assertion added in `840450549c`. However, if the widest type is not a power-of-2 the computed MaxVF won't be a power-of-2 either. This patch updates computeFeasibleMaxVF to ensure the returned value is a power-of-2 by rounding down to the nearest power-of-2. Fixes PR46139. Reviewers: Ayal, gilr, rengolin Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D80870	2020-06-02 10:40:49 +01:00
Sanjay Patel	db653ff6b7	[LoopVectorize] auto-generate complete test checks; NFC	2020-05-29 13:14:08 -04:00
Sanjay Patel	f78eecbb93	[LoopVectorize] regenerate test checks; NFC Align attributes are now visible.	2020-05-29 13:02:45 -04:00
Sanjay Patel	5e94273227	[LoopVectorize] auto-generate complete checks; NFC	2020-05-29 13:01:35 -04:00
Sanjay Patel	9d1f95bf9f	[LoopVectorize] regenerate test checks; NFC Align attributes are now visible.	2020-05-29 13:01:35 -04:00
Sanjay Patel	0b21c6706a	[LoopVectorize] auto-generate complete test checks; NFC	2020-05-29 13:01:35 -04:00
Sanjay Patel	1a2bffaf8b	[InstCombine] reassociate sub+add to increase adds and throughput The -reassociate pass tends to transform this kind of pattern into something that is worse for vectorization and codegen. See PR43953: https://bugs.llvm.org/show_bug.cgi?id=43953 Follows-up the FP version of the same transform: rGa0ce2338a083	2020-05-26 14:49:17 -04:00
Sanjay Patel	f5cfcc4b06	[LoopVectorize] regenerate full test checks; NFC	2020-05-26 14:49:17 -04:00

1 2 3 4 5 ...

1210 Commits