llvm-project

Commit Graph

Author	SHA1	Message	Date
Sanjay Patel	bfb9b8e075	[Passes] add a tail-call-elim pass near the end of the opt pipeline We call tail-call-elim near the beginning of the pipeline, but that is too early to annotate calls that get added later. In the motivating case from issue #47852, the missing 'tail' on memset leads to sub-optimal codegen. I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time. "tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ): https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright Note that there was an effort to split the tail call functionality into 2 passes - that could help reduce compile-time if we find that this change costs more in compile-time than expected based on the preliminary testing: D60031 Differential Revision: https://reviews.llvm.org/D130374	2022-07-25 15:25:47 -04:00
Nuno Lopes	a30e77b6f6	fix tests for commit `9df0b254d2`	2022-07-23 22:32:30 +01:00
Nuno Lopes	9df0b254d2	[NFC] Switch a few uses of undef to poison as placeholders for unreachable code	2022-07-23 21:50:11 +01:00
Philip Reames	bd75350180	[LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst This code confuses LV's "Uniform" and LVL/LAI's "Uniform". Despite the common name, these are different. * LVs notion means that only the first lane of each unrolled part is required. That is, lanes within a single unroll factor are considered uniform. This allows e.g. widenable memory ops to be considered uses of uniform computations. * LVL and LAI's notion refers to all lanes across all unrollings. IsUniformMem is in turn defined in terms of LAI's notion. Thus a UniformMemOpmeans is a memory operation with a loop invariant address. This means the same address is accessed in every iteration. The tweaked piece of code was trying to match a uniform mem op (i.e. fully loop invariant address), but instead checked for LV's notion of uniformity. In theory, this meant with UF > 1, we could speculate a load which wasn't safe to execute. This ends up being mostly silent in current code as it is nearly impossible to create the case where this difference is visible. The closest I've come in the test case from 54cb87, but even then, the incorrect result is only visible in the vplan debug output; before this change we sink the unsafely speculated load back into the user's predicate blocks before emitting IR. Both before and after IR are correct so the differences aren't "interesting". The other test changes are uninteresting. They're cases where LV's uniform analysis is slightly weaker than SCEV isLoopInvariant.	2022-07-21 15:44:34 -07:00
Philip Reames	54cb87964d	[LV] Add a load focused version of the r45679 test This a reproducer for bug in predicated instruction handling. The final result code is correct, but the reasoning by which we get there isn't.	2022-07-21 15:33:42 -07:00
Philip Reames	83993d666b	[LV][SVE] Autogen a test for ease of update	2022-07-21 13:12:53 -07:00
Philip Reames	27945f9282	[RISCV][LV] Split coverage of uniform load with outside use Turns out this has a large effect of tail folding, so split out a single test to cover that case and remove it from the others.	2022-07-21 12:07:26 -07:00
Philip Reames	bb5dc2918f	{RISCV][LV] Add tail folding coverage of uniform load store cases	2022-07-21 11:15:36 -07:00
Philip Reames	56a25ed208	{RISCV][LV] Add a test for uniform store of a loop varying value	2022-07-21 11:15:36 -07:00
Philip Reames	0ae46693f0	{RISCV][LV] Split out and expand tests for uniform loads and stores	2022-07-21 10:42:18 -07:00
David Sherwood	f15b6b2907	[AArch64] Add target hook for preferPredicateOverEpilogue This patch adds the AArch64 hook for preferPredicateOverEpilogue, which currently returns true if SVE is enabled and one of the following conditions (non-exhaustive) is met: 1. The "sve-tail-folding" option is set to "all", or 2. The "sve-tail-folding" option is set to "all+noreductions" and the loop does not contain reductions, 3. The "sve-tail-folding" option is set to "all+norecurrences" and the loop has no first-order recurrences. Currently the default option is "disabled", but this will be changed in a later patch. I've added new tests to show the options behave as expected here: Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll Differential Revision: https://reviews.llvm.org/D129560	2022-07-21 17:20:06 +01:00
David Sherwood	ceb6c23b70	[NFC][LoopVectorize] Explicitly disable tail-folding on some SVE tests This patch is in preparation for enabling vectorisation with tail-folding by default for SVE targets. Once we do that many existing tests will break that depend upon having normal unpredicated vector loops. For all such tests I have added the flag: -prefer-predicate-over-epilogue=scalar-epilogue Differential Revision: https://reviews.llvm.org/D129137	2022-07-21 15:23:00 +01:00
Philip Reames	f934b9b073	[LV] Refresh a couple of autogen tests for naming change These appear to just be changes in temporary identifiers; bit suprising we have so many.	2022-07-20 14:47:52 -07:00
Philip Reames	1a73ef75fa	[LV] Autogen a test for ease of update	2022-07-20 08:19:38 -07:00
Philip Reames	be25f52fec	[LV] Autogen several tests for ease of update in upcoming change	2022-07-20 07:17:51 -07:00
Philip Reames	523a526a02	[LV] Fix miscompile due to srem/sdiv speculation safety condition An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block. Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose. Differential Revision: https://reviews.llvm.org/D130106	2022-07-20 05:35:23 -07:00
David Sherwood	79660d339e	[LoopVectorize][AArch64] Add TTI hook preferPredicatedReductionSelect By default if SVE is enabled we want the select instruction used for reductions to be inside the loop, rather than outside. This makes it possible for the backend to fold the select into the operation to produce a single predicated add, fadd, etc. Differential Revision: https://reviews.llvm.org/D129763	2022-07-20 09:33:29 +01:00
Philip Reames	f1243fa193	[LV] Autogen a partially autogened test for ease of update	2022-07-19 14:18:53 -07:00
Philip Reames	8353403f08	[LV] Add test for generic predicated sdiv	2022-07-19 12:33:36 -07:00
Philip Reames	2247fe856a	[LV] Add test coverage for a bug in srem handling	2022-07-19 11:29:17 -07:00
Philip Reames	b7d3ba4bdb	[LV] Add test coverage for scalable div/rem patterns	2022-07-19 11:02:14 -07:00
David Sherwood	34f81cfa3d	[LoopVectorize][NFC] Split reductions out from sve-tail-folding into new file In sve-tail-folding-reductions.ll I've also added an extra RUN line to test normal reductions, i.e. not in-loop. This patch is a pre-commit in preparation for a follow-on patch that changes how reduction selects are generated in the vector loop. Differential Revision: https://reviews.llvm.org/D129761	2022-07-18 13:56:39 +01:00
David Sherwood	1e77b0c871	[AArch64][NFC] Simplify loop vectoriser tail-folding tests I've simplified all of the SVE vectoriser tail-folding tests to only care about testing the flag: -prefer-predicate-over-epiloge=predicate-else-scalar-epilogue In practice we always want to fall back on unpredicated vector loops if tail-folding is not possible. Differential Revision: https://reviews.llvm.org/D129843	2022-07-18 13:37:29 +01:00
Graham Hunter	db8fcb2c25	[LAA] Add recursive IR walker for forked pointers This builds on the previous forked pointers patch, which only accepted a single select as the pointer to check. A recursive function to walk through IR has been added, which searches for either a loop-invariant or addrec SCEV. This will only handle a single fork at present, so selects of selects or a GEP with a select for both the base and offset will be rejected. There is also a recursion limit with a cli option to change it. Reviewed By: fhahn, david-arm Differential Revision: https://reviews.llvm.org/D108699	2022-07-18 12:06:17 +01:00
Florian Hahn	105032f549	[LV] Use PHI recipe instead of PredRecipe for subsequent uses. At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of the predicate recipe. This incorrectly models the def-use chains, as all later uses should use the phi recipe. Fix that by delaying recording of the recipe. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D129436	2022-07-18 09:35:34 +01:00
Florian Hahn	6813b41d57	[LV] Avoid creating new run-time VF expression for each runtime checks. At the moment, the cost of runtime checks for scalable vectors is overestimated due to creating separate vscale * VF expressions for each check. Instead re-use the first expression.	2022-07-16 17:24:07 +01:00
Florian Hahn	aa00fb02c9	[LV] Use umax(VF * UF, MinProfTC) for scalable vectors. For scalable vectors, it is not sufficient to only check MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because this property may not holder for larger values of vscale. In those cases, compute umax(VF * UF, MinProfTC) instead. This should fix https://lab.llvm.org/buildbot/#/builders/197/builds/2262	2022-07-15 10:23:14 -07:00
Florian Hahn	4c85a01758	[LV] Add scalable vector test showing incorrect min-trip count check. The test shows a case where the minimum trip count check incorrectly only checks the minimum profitable trip count computed due to runtime checks. This is incorrect for scalable VFs, because the VF * UF may exceed the minimum profitable trip count for vscale > 1. This is the likely reason for https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.	2022-07-15 10:02:55 -07:00
Mel Chen	bd404fbcc8	[LV][NFC] Fix the condition for printing debug messages Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D128523	2022-07-15 01:47:33 -07:00
Mel Chen	ae15e6a952	[LV] Pre-commit test case for D128523, NFC	2022-07-15 01:22:06 -07:00
David Sherwood	307ace7f20	[LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs When vectorising ordered reductions we call a function LoopVectorizationPlanner::adjustRecipesForReductions to replace the existing VPWidenRecipe for the fadd instruction with a new VPReductionRecipe. We attempt to insert the new recipe in the same place, but this is wrong because createBlockInMask may have generated new recipes that VPReductionRecipe now depends upon. I have changed the insertion code to append the recipe to the VPBasicBlock instead. Added a new RUN with tail-folding enabled to the existing test: Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll Differential Revision: https://reviews.llvm.org/D129550	2022-07-13 09:29:25 +01:00
Nikita Popov	a5ee62a141	[IndVars] Call replaceLoopPHINodesWithPreheaderValues() for already constant exits Currently we only call replaceLoopPHINodesWithPreheaderValues() if optimizeLoopExits() replaces the exit with an unconditional exit. However, it is very common that this already happens as part of eliminateIVComparison(), in which case we're leaving behind the dead header phi. Tweak the early bailout for already-constant exits to also call replaceLoopPHINodesWithPreheaderValues(). Differential Revision: https://reviews.llvm.org/D129214	2022-07-13 09:43:21 +02:00
David Sherwood	6b694d600a	[LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF When calculating the cost of Instruction::Br in getInstructionCost we query PredicatedBBsAfterVectorization to see if there is a scalar predicated block. However, this meant that the decisions being made for a given fixed-width VF were affecting the cost for a scalable VF. As a result we were returning InstructionCost::Invalid pointlessly for a scalable VF that should have a low cost. I encountered this for some loops when enabling tail-folding for scalable VFs. Test added here: Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll Differential Revision: https://reviews.llvm.org/D128272	2022-07-12 14:53:20 +01:00
Nikita Popov	4bb7b6fae3	[IR] Remove support for float binop constant expressions As part of https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179, this removes support for the floating-point binop constant expressions fadd, fsub, fmul, fdiv and frem. As part of this change, the C APIs LLVMConstFAdd, LLVMConstFSub, LLVMConstFMul, LLVMConstFDiv and LLVMConstFRem are removed. The LLVMBuild APIs should be used instead. Differential Revision: https://reviews.llvm.org/D129478	2022-07-12 09:40:49 +02:00
David Sherwood	03fee6712a	[LoopVectorize] Add option to use active lane mask for loop control flow Currently, for vectorised loops that use the get.active.lane.mask intrinsic we only use the mask for predicated vector operations, such as masked loads and stores, etc. The loop itself is still controlled by comparing the canonical induction variable with the trip count. However, for some targets this is inefficient when it's cheap to use the mask itself to control the loop. This patch adds support for using the active lane mask for control flow by: 1. Generating the active lane mask for the next iteration of the vector loop, rather than the current one. If there are still any remaining iterations then at least the first bit of the mask will be set. 2. Extract the first bit of this mask and use this bit for the conditional branch. I did this by creating a new VPActiveLaneMaskPHIRecipe that sets up the initial PHI values in the vector loop pre-header. I've also made use of the new BranchOnCond VPInstruction for the final instruction in the loop region. Differential Revision: https://reviews.llvm.org/D125301	2022-07-11 13:46:55 +01:00
Philip Reames	b12930e133	[RISCV] Switch to using get.active.lane.mask when tail folding The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change. The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice. Differential Revision: https://reviews.llvm.org/D129221	2022-07-08 10:24:59 -07:00
Florian Hahn	12f9c7b270	[LV] Update RISCV test missed by `bc19b7c3cc`.	2022-07-07 08:51:15 -07:00
Florian Hahn	bc19b7c3cc	[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE. Now that removeDeadRecipes can remove most dead recipes across a whole VPlan, there is no need to first collect some dead instructions. Instead removeDeadRecipes can simply clean them up. Depends D127580. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128408	2022-07-07 08:40:27 -07:00
David Green	438ffdb821	[ARM] Switch the costs of mve1beat and mve4beat These three subtarget features are meant to control where MVE instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature is described as "Model MVE instructions as a 1 beat per tick architecture", meaning MVE instruction will execute over 4 cycles. mve4beat is the opposite where the entire 4 beats of the MVE instruction execute in a single cycle. The costs for the two were backwards though, not matching the cycle counts like they should. This patch switches the costs on the two to bring them in-line with expectations. Differential Revision: https://reviews.llvm.org/D129141	2022-07-07 16:10:00 +01:00
Florian Hahn	17d48c3169	[VPlan] Move remove dead recipes before merging regions. This can enable additional region merging, while not losing opportunities as region merging does not produce dead recipes. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128831	2022-07-06 20:38:38 -07:00
Florian Hahn	6826047381	[LV] Remove redundant checks from recurrence test. The removed CHECK configurations are tested as well below, modulo the dce/instcombine runs. This makes them redundant, and removing them removes a substantial amount of uneeded checks.	2022-07-06 15:31:57 -07:00
Philip Reames	b9513a70e1	[RISCV] Autogen a vectorizer test for ease of update	2022-07-06 09:35:02 -07:00
Florian Hahn	e4613d8e1b	[LV] Remove unnecessary memory checks from recurrence test The tests are focused on code-gen for first-order recurrences. There are plenty of tests specifically for runtime check generation. Using noalias to avoid runtime checks slightly simplifies the test output and ensures the checks focus on the relevant bits and ensures the checks focus on the relevant bits and ensures the checks focus on the relevant bits and ensures the checks focus on the relevant bits.	2022-07-06 09:08:29 -07:00
Philip Reames	0f49d9e8d0	[RISCV] Add test coverage for vectorizer tailfolding As can be seen in the check lines, we have a lot of work to do.	2022-07-06 09:06:20 -07:00
Nikita Popov	11950efe06	[ConstExpr] Remove div/rem constant expressions D128820 stopped creating div/rem constant expressions by default; this patch removes support for them entirely. The getUDiv(), getExactUDiv(), getSDiv(), getExactSDiv(), getURem() and getSRem() on ConstantExpr are removed, and ConstantExpr::get() now only accepts binary operators for which ConstantExpr::isSupportedBinOp() returns true. Uses of these methods may be replaced either by corresponding IRBuilder methods, or ConstantFoldBinaryOpOperands (if a constant result is required). On the C API side, LLVMConstUDiv, LLVMConstExactUDiv, LLVMConstSDiv, LLVMConstExactSDiv, LLVMConstURem and LLVMConstSRem are removed and corresponding LLVMBuild methods should be used. Importantly, this also means that constant expressions can no longer trap! This patch still keeps the canTrap() method to minimize diff -- I plan to drop it in a separate NFC patch. Differential Revision: https://reviews.llvm.org/D129148	2022-07-06 10:11:34 +02:00
Florian Hahn	644a965c1e	[LV] Vectorize cases with larger number of RT checks, execute only if profitable. This patch replaces the tight hard cut-off for the number of runtime checks with a more accurate cost-driven approach. The new approach allows vectorization with a larger number of runtime checks in general, but only executes the vector loop (and runtime checks) if considered profitable at runtime. Profitable here means that the cost-model indicates that the runtime check cost + vector loop cost < scalar loop cost. To do that, LV computes the minimum trip count for which runtime check cost + vector-loop-cost < scalar loop cost. Note that there is still a hard cut-off to avoid excessive compile-time/code-size increases, but it is much larger than the original limit. The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource is mostly neutral, but the new approach can give substantial gains in cases where we failed to vectorize before due to the over-aggressive cut-offs. On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%) and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system. ``` CFP2006/447.dealII/447.dealII -1.9% CINT2017rate/525.x264_r/525.x264_r -2.2% ASC_Sequoia/IRSmk/IRSmk -9.2% Rodinia/srad/srad -36.1% ``` `size` regressions on AArch64 with -O3 are ``` MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3% MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9% MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5% External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7% External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0% External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7% MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3% MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8% External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3% External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2% ``` Reviewed By: lebedev.ri, ebrevnov, dmgreen Differential Revision: https://reviews.llvm.org/D109368	2022-07-04 15:11:39 +01:00
Florian Hahn	0dddf04cab	[LV] Don't optimize exit cond during epilogue vectorization. At the moment, the same VPlan can be used code generation of both the main vector and epilogue vector loop. This can lead to wrong results, if the plan is optimized based on the VF of the main vector loop and then re-used for the epilogue loop. One example where this is problematic is if the scalar loops need to execute at least one iteration, e.g. due to interleave groups. To prevent mis-compiles in the short-term, disable optimizing exit conditions for VPlans when using epilogue vectorization. The proper fix is to avoid re-using the same plan for both loops, which will require support for cloning plans first. Fixes #56319.	2022-07-01 13:48:38 +01:00
Florian Hahn	afd9f422e4	[LV] Update test for #56319 to use interleave group. The original test was over-reduced. It requires an interleave group, so the last vector iteration of the epilogue vector loop doesn't execute.	2022-07-01 11:12:00 +01:00
Florian Hahn	8704cfc744	[LV] Add test case for #56319 . Test case for PR56319.	2022-07-01 10:09:24 +01:00
Sanjay Patel	cc88445a91	[InstCombine] canonicalize 'icmp (trunc X), C' to 'icmp (X & Mask), C' I looked at canonicalizing in the other direction, but that causes many potential regressions and infinite loops because we already (possibly wrongly) canonicalize "trunc X to i1" into an and+icmp. This has a data layout restriction to avoid creating illegal mask instructions, but we could remove that if we can show that the backend can undo this when needed. The motivating example from issue #56119 is modeled by the PhaseOrdering test.	2022-06-30 15:51:39 -04:00

1 2 3 4 5 ...

1772 Commits