llvm-project

Commit Graph

Author	SHA1	Message	Date
Philip Reames	b5c7213647	[LV] Use early return to simplify code structure	2022-07-22 12:15:14 -07:00
Benjamin Kramer	5a445395e4	[LV] Remove unused variable. NFC.	2022-07-22 17:43:58 +02:00
Philip Reames	d7bf81fd51	[LV] Rework widening cost of uniform memory ops for clarity [nfc] Reorganize the code to make it clear what is and isn't handle, and why. Restructure bailout to remove (false and confusing) dependence on CM_Scalarize; just return invalid cost and propagate, that's what it is for.	2022-07-22 08:35:45 -07:00
Philip Reames	bd75350180	[LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst This code confuses LV's "Uniform" and LVL/LAI's "Uniform". Despite the common name, these are different. * LVs notion means that only the first lane of each unrolled part is required. That is, lanes within a single unroll factor are considered uniform. This allows e.g. widenable memory ops to be considered uses of uniform computations. * LVL and LAI's notion refers to all lanes across all unrollings. IsUniformMem is in turn defined in terms of LAI's notion. Thus a UniformMemOpmeans is a memory operation with a loop invariant address. This means the same address is accessed in every iteration. The tweaked piece of code was trying to match a uniform mem op (i.e. fully loop invariant address), but instead checked for LV's notion of uniformity. In theory, this meant with UF > 1, we could speculate a load which wasn't safe to execute. This ends up being mostly silent in current code as it is nearly impossible to create the case where this difference is visible. The closest I've come in the test case from 54cb87, but even then, the incorrect result is only visible in the vplan debug output; before this change we sink the unsafely speculated load back into the user's predicate blocks before emitting IR. Both before and after IR are correct so the differences aren't "interesting". The other test changes are uninteresting. They're cases where LV's uniform analysis is slightly weaker than SCEV isLoopInvariant.	2022-07-21 15:44:34 -07:00
David Sherwood	f15b6b2907	[AArch64] Add target hook for preferPredicateOverEpilogue This patch adds the AArch64 hook for preferPredicateOverEpilogue, which currently returns true if SVE is enabled and one of the following conditions (non-exhaustive) is met: 1. The "sve-tail-folding" option is set to "all", or 2. The "sve-tail-folding" option is set to "all+noreductions" and the loop does not contain reductions, 3. The "sve-tail-folding" option is set to "all+norecurrences" and the loop has no first-order recurrences. Currently the default option is "disabled", but this will be changed in a later patch. I've added new tests to show the options behave as expected here: Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll Differential Revision: https://reviews.llvm.org/D129560	2022-07-21 17:20:06 +01:00
Philip Reames	523a526a02	[LV] Fix miscompile due to srem/sdiv speculation safety condition An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block. Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose. Differential Revision: https://reviews.llvm.org/D130106	2022-07-20 05:35:23 -07:00
Florian Hahn	5124b21648	[VPlan] Initial def-use verification. This patch introduces some initial def-use verification. This catches cases like the one fixed by D129436. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D129717	2022-07-20 11:06:32 +01:00
William Schmidt	bccc9aa81c	Don't vectorize PHIs in catchswitch blocks We currently assert in vectorizeTree(TreeEntry*) when processing a PHI bundle in a block containing a catchswitch. We attempt to set the IRBuilder insertion point following the catchswitch, which is invalid. This is done so that ShuffleBuilder.finalize() knows where to insert a shuffle if one is needed. To avoid this occurring, watch out for catchswitch blocks during buildTree_rec() processing, and avoid adding PHIs in such blocks to the vectorizable tree. It is unlikely that constraining vectorization over an exception path will cause a noticeable performance loss, so this seems preferable to trying to anticipate when a shuffle will and will not be required.	2022-07-19 06:10:17 -07:00
Florian Hahn	a75760a269	[LV] Remove unnecessary cast in widenCallInstruction. (NFC)	2022-07-19 11:23:24 +01:00
Florian Hahn	30e53b8c03	[LV] Sink module variable and use State to set it in widenCall. (NFC) Limits the lifetime of the variable and makes it independent of CallInst.	2022-07-18 19:41:48 +01:00
Florian Hahn	105032f549	[LV] Use PHI recipe instead of PredRecipe for subsequent uses. At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of the predicate recipe. This incorrectly models the def-use chains, as all later uses should use the phi recipe. Fix that by delaying recording of the recipe. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D129436	2022-07-18 09:35:34 +01:00
Kazu Hirata	7094ab4ee7	[llvm] Modernize bool literals (NFC) Identified with modernize-use-bool-literals.	2022-07-17 18:08:51 -07:00
Florian Hahn	cc0ee17951	[LV] Move VPPredInstPHIRecipe::execute to VPlanRecipes.cpp (NFC)	2022-07-17 11:34:23 +01:00
Florian Hahn	6813b41d57	[LV] Avoid creating new run-time VF expression for each runtime checks. At the moment, the cost of runtime checks for scalable vectors is overestimated due to creating separate vscale * VF expressions for each check. Instead re-use the first expression.	2022-07-16 17:24:07 +01:00
David Green	4b7913c357	[VectorCombine] Only consider shuffle uses with the same type. The backend getShuffleCosts do not currently handle shuffles that change size very well. Limit the shuffles we collect to the same type to make sure they do not cause issues as reported in D128732.	2022-07-16 13:23:39 +01:00
Florian Hahn	aa00fb02c9	[LV] Use umax(VF * UF, MinProfTC) for scalable vectors. For scalable vectors, it is not sufficient to only check MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because this property may not holder for larger values of vscale. In those cases, compute umax(VF * UF, MinProfTC) instead. This should fix https://lab.llvm.org/buildbot/#/builders/197/builds/2262	2022-07-15 10:23:14 -07:00
Mel Chen	bd404fbcc8	[LV][NFC] Fix the condition for printing debug messages Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D128523	2022-07-15 01:47:33 -07:00
Kazu Hirata	611ffcf4e4	[llvm] Use value instead of getValue (NFC)	2022-07-13 23:11:56 -07:00
Florian Hahn	ee37ae91b6	[VPlan] Move VPBB verification to separate function (NFC).	2022-07-13 18:53:40 -07:00
Florian Hahn	6f7347b888	[LV] Use PredRecipe directly instead of getOrAddVPValue (NFC). There is no need to look up the VPValue for Instr, PredRecipe can be used directly.	2022-07-13 17:01:42 -07:00
Florian Hahn	225e3ec622	[LV] Move VPBranchOnMaskRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-13 14:39:59 -07:00
David Sherwood	307ace7f20	[LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs When vectorising ordered reductions we call a function LoopVectorizationPlanner::adjustRecipesForReductions to replace the existing VPWidenRecipe for the fadd instruction with a new VPReductionRecipe. We attempt to insert the new recipe in the same place, but this is wrong because createBlockInMask may have generated new recipes that VPReductionRecipe now depends upon. I have changed the insertion code to append the recipe to the VPBasicBlock instead. Added a new RUN with tail-folding enabled to the existing test: Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll Differential Revision: https://reviews.llvm.org/D129550	2022-07-13 09:29:25 +01:00
David Sherwood	6b694d600a	[LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF When calculating the cost of Instruction::Br in getInstructionCost we query PredicatedBBsAfterVectorization to see if there is a scalar predicated block. However, this meant that the decisions being made for a given fixed-width VF were affecting the cost for a scalable VF. As a result we were returning InstructionCost::Invalid pointlessly for a scalable VF that should have a low cost. I encountered this for some loops when enabling tail-folding for scalable VFs. Test added here: Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll Differential Revision: https://reviews.llvm.org/D128272	2022-07-12 14:53:20 +01:00
Florian Hahn	5d135041c5	[LV] Move VPBlendRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-11 16:01:07 -07:00
David Sherwood	03fee6712a	[LoopVectorize] Add option to use active lane mask for loop control flow Currently, for vectorised loops that use the get.active.lane.mask intrinsic we only use the mask for predicated vector operations, such as masked loads and stores, etc. The loop itself is still controlled by comparing the canonical induction variable with the trip count. However, for some targets this is inefficient when it's cheap to use the mask itself to control the loop. This patch adds support for using the active lane mask for control flow by: 1. Generating the active lane mask for the next iteration of the vector loop, rather than the current one. If there are still any remaining iterations then at least the first bit of the mask will be set. 2. Extract the first bit of this mask and use this bit for the conditional branch. I did this by creating a new VPActiveLaneMaskPHIRecipe that sets up the initial PHI values in the vector loop pre-header. I've also made use of the new BranchOnCond VPInstruction for the final instruction in the loop region. Differential Revision: https://reviews.llvm.org/D125301	2022-07-11 13:46:55 +01:00
David Sherwood	02d6950d84	[LoopVectorize][NFC] Add optional Name parameter to VPInstruction This patch is a simple piece of refactoring that now permits users to create VPInstructions and specify the name of the value being generated. This is useful for creating more readable/meaningful names in IR. Differential Revision: https://reviews.llvm.org/D128982	2022-07-11 09:23:24 +01:00
Florian Hahn	6a4bc452f8	[LV] Move VPWidenGEPRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-10 17:10:17 -07:00
Florian Hahn	13ae213469	[LV] Move VPWidenRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-09 18:46:57 -07:00
Florian Hahn	0c27b38849	[VPlan] Move VPWidenSelectRecipe::execute to VPlanRecipes.cpp (NFC). Depends on D127968. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127970	2022-07-08 09:35:23 -07:00
Craig Topper	0266773464	[SLP] Add missing space to optimization remark. Reviewed By: vporpo Differential Revision: https://reviews.llvm.org/D129330	2022-07-07 23:29:11 -07:00
Florian Hahn	bc19b7c3cc	[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE. Now that removeDeadRecipes can remove most dead recipes across a whole VPlan, there is no need to first collect some dead instructions. Instead removeDeadRecipes can simply clean them up. Depends D127580. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128408	2022-07-07 08:40:27 -07:00
Sander de Smalen	519d7876cb	[VectorCombine] Avoid creating shuffle for extract-extract pattern on scalable vector. This addresses https://github.com/llvm/llvm-project/issues/56377 Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D129136	2022-07-07 08:37:04 +00:00
Florian Hahn	17d48c3169	[VPlan] Move remove dead recipes before merging regions. This can enable additional region merging, while not losing opportunities as region merging does not produce dead recipes. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128831	2022-07-06 20:38:38 -07:00
Nikita Popov	1ed8b29302	[LoopVectorizationLegality] Drop unused variable (NFC)	2022-07-06 10:43:39 +02:00
Nikita Popov	8ee913d83b	[IR] Remove Constant::canTrap() (NFC) As integer div/rem constant expressions are no longer supported, constants can no longer trap and are always safe to speculate. Remove the Constant::canTrap() method and its usages.	2022-07-06 10:36:47 +02:00
David Green	5493f8fc59	[VectorCombine] Improve shuffle select shuffle-of-shuffles This in an extension to the code added in D123911 which added vector combine folding of shuffle-select patterns, attempting to reduce the total amount of shuffling required in patterns like: %x = shuffle %i1, %i2 %y = shuffle %i1, %i2 %a = binop %x, %y %b = binop %x, %y shuffle %a, %b, selectmask This patch extends the handing of shuffles that are dependent on one another, which can arise from the SLP vectorizer, as-in: %x = shuffle %i1, %i2 %y = shuffle %x The input shuffles can also be emitted, in which case they are treated like identity shuffles. This patch also attempts to calculate a better ordering of input shuffles, which can help getting lower cost input shuffles, pushing complex shuffles further down the tree. This is a recommit with some additional checks for supported forms and out-of-bounds mask elements, with some extra tests. Differential Revision: https://reviews.llvm.org/D128732	2022-07-05 17:16:18 +01:00
Florian Hahn	ebb78a95ce	[LV] Remove stray dbgs() call after `774fc63490`.	2022-07-05 12:58:18 +01:00
Florian Hahn	774fc63490	[LV] Consider minimum vscale assmuption for RT check cost. For scalable VFs, the minimum assumed vscale needs to be included in the cost-computation, otherwise a smaller VF may be used for RT check cost computation than was used for earlier cost computations. Fixes a RISCV test failing with UBSan due to both scalar and vector loops having the same cost.	2022-07-05 09:41:58 +01:00
Nikita Popov	b69c75d53f	Revert "[VectorCombine] Improve shuffle select shuffle-of-shuffles" This reverts commit `19a1e20b8a`. Clang crashes while linking bullet from llvm-test-suite in ReleaseLTO-g cmake configuration.	2022-07-05 09:31:20 +02:00
Florian Hahn	2a82c15f63	[LV] Consider runtime checks profitable if scalar cost is zero. This fixes an UBSan failure after `644a965c1e`. When using user-provided VFs/ICs (via the force-vector-width / force-vector-interleave options) the scalar cost is zero, which would cause divide-by-zero. When forcing vectorization using the options, the cost of the runtime checks should not block vectorization.	2022-07-04 21:37:16 +01:00
Florian Hahn	9eb6572786	[LV] Add back CantReorderMemOps remark. Add back remark unintentionally dropped by `644a965c1e`. I will add a LV test separately, so we do not have to rely on a Clang test to catch this.	2022-07-04 17:23:47 +01:00
Peter Waller	c146af3f46	[LoopVectorize][NFC] Reinstate TTICapture workaround for gcc-6 Fixes #56374.	2022-07-04 14:14:15 +00:00
Florian Hahn	644a965c1e	[LV] Vectorize cases with larger number of RT checks, execute only if profitable. This patch replaces the tight hard cut-off for the number of runtime checks with a more accurate cost-driven approach. The new approach allows vectorization with a larger number of runtime checks in general, but only executes the vector loop (and runtime checks) if considered profitable at runtime. Profitable here means that the cost-model indicates that the runtime check cost + vector loop cost < scalar loop cost. To do that, LV computes the minimum trip count for which runtime check cost + vector-loop-cost < scalar loop cost. Note that there is still a hard cut-off to avoid excessive compile-time/code-size increases, but it is much larger than the original limit. The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource is mostly neutral, but the new approach can give substantial gains in cases where we failed to vectorize before due to the over-aggressive cut-offs. On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%) and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system. ``` CFP2006/447.dealII/447.dealII -1.9% CINT2017rate/525.x264_r/525.x264_r -2.2% ASC_Sequoia/IRSmk/IRSmk -9.2% Rodinia/srad/srad -36.1% ``` `size` regressions on AArch64 with -O3 are ``` MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3% MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9% MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5% External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7% External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0% External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7% MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3% MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8% External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3% External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2% ``` Reviewed By: lebedev.ri, ebrevnov, dmgreen Differential Revision: https://reviews.llvm.org/D109368	2022-07-04 15:11:39 +01:00
David Green	2de05afc19	[SLP] Peek into loads when hitting the RecursionMaxDepth This patch slightly extends the limit on the RecursionMaxDepth inside the SLP vectorizer. It does it only when it hits a load (or zext/sext of a load), which allows it to peek through in the places where it will be the most valuable, without ballooning out the O(..) by any 2^n factors. Differential Revision: https://reviews.llvm.org/D122148	2022-07-04 14:22:50 +01:00
David Green	19a1e20b8a	[VectorCombine] Improve shuffle select shuffle-of-shuffles This in an extension to the code added in D123911 which added vector combine folding of shuffle-select patterns, attempting to reduce the total amount of shuffling required in patterns like: %x = shuffle %i1, %i2 %y = shuffle %i1, %i2 %a = binop %x, %y %b = binop %x, %y shuffle %a, %b, selectmask This patch extends the handing of shuffles that are dependent on one another, which can arise from the SLP vectorizer, as-in: %x = shuffle %i1, %i2 %y = shuffle %x The input shuffles can also be emitted, in which case they are treated like identity shuffles. This patch also attempts to calculate a better ordering of input shuffles, which can help getting lower cost input shuffles, pushing complex shuffles further down the tree. Differential Revision: https://reviews.llvm.org/D128732	2022-07-04 13:38:43 +01:00
Florian Hahn	b4694229aa	[LV] Simplify setDebugLocFromInst by using early exit (NFC). Suggested as separate improvement in D128657.	2022-07-04 09:25:26 +01:00
Florian Hahn	b0da3c6fa4	[VPlan] Move setDebugLocFromInst to VPTransformState (NFC). The moved helpers are only used for codegen. It will allow moving the remaining ::execute implementations out of LoopVectorize.cpp. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128657	2022-07-02 15:18:17 +01:00
Florian Hahn	0dddf04cab	[LV] Don't optimize exit cond during epilogue vectorization. At the moment, the same VPlan can be used code generation of both the main vector and epilogue vector loop. This can lead to wrong results, if the plan is optimized based on the VF of the main vector loop and then re-used for the epilogue loop. One example where this is problematic is if the scalar loops need to execute at least one iteration, e.g. due to interleave groups. To prevent mis-compiles in the short-term, disable optimizing exit conditions for VPlans when using epilogue vectorization. The proper fix is to avoid re-using the same plan for both loops, which will require support for cloning plans first. Fixes #56319.	2022-07-01 13:48:38 +01:00
Florian Hahn	583abd0e36	[VPlan] Move addMetadata to VPTransformState (NFC). The moved helpers are only used for codegen. It will allow moving the remaining ::execute implementations out of LoopVectorize.cpp. Depends on D127966. Depends on D127965. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127968	2022-07-01 12:03:25 +01:00
Alexey Bataev	4be3fc35aa	[SLP][NFC]Cleanup up operands of the removed insertelements, NFC. Replace all operands of the insertelement instruction, replaced by shuffles, by poisons to avoid false-positive reports about incorrect function.	2022-06-30 17:51:43 -07:00
Florian Hahn	68884dde70	[LV] Move LoopVersioning creation to LVP::execute. At the moment LoopVersioning is only created for inner-loop vectorization. This patch moves it to LVP::execute, which means it will also be added for epilogue vectorization. As a consequence, the proper noalias metadata is now also added to epilogue vector loops. LVer will be moved to VPTransformState as follow-up. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127966	2022-06-30 12:14:32 +01:00
Florian Hahn	24b5f8e0d0	[VPlan] Make sure optimizeInductions removes wide ind from scalar plan. In some cases, there may be widened users of inductions even though the plan includes the scalar VF. In those cases, make sure we still replace the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF. Alternatively the patch could also split the range if needed. This fixes a crash exposed by D123720. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128755	2022-06-30 09:11:48 +01:00
Nikita Popov	bdba8278d9	[VectorCombine] Avoid ConstantExpr::get() (NFC) Use IRBuilder APIs instead, which will still constant fold.	2022-06-29 17:17:52 +02:00
Alexey Bataev	bf4dcbd2df	[SLP]Fix PR56251: Do not remove the reordering from the root node, being used as an operand. If the root order itself does not require reordering, we can just remove its reorder mask safely (e.g., if the root node is a vector of phis). But if this node is used as an operand in the graph, we cannot delete the reordering, need to keep it. Otherwise the graph nodes are not synchronized with the operands. It may cause an extra gather instruction(s) or a compiler crash. Also, need to be very careful when selecting the gather nodes for reordering since there might several gather nodes with the same scalars and we can try to reorder just the same node many times instead of different nodes. Differential Revision: https://reviews.llvm.org/D128680	2022-06-28 13:42:05 -07:00
Florian Hahn	03975b7f0e	[VPlan] Move recipe implementations to separate file (NFC). This patch moves the code for recipe implementations to a separate file. The benefits are: * Keep VPlan.cpp smaller => faster compile-time during parallel builds. * Keep code for logical units together As a follow-up I am also planning on moving all ::execute implemetnations from LoopVectorize.cpp over to the new file, which should help to reduce the size of the file a bit. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127965	2022-06-28 10:34:30 +01:00
Guillaume Chatelet	3c126d5fe4	[Alignment] Replace commonAlignment with std::min `commonAlignment` is a shortcut to pick the smallest of two `Align` objects. As-is it doesn't bring much value compared to `std::min`. Differential Revision: https://reviews.llvm.org/D128345	2022-06-28 07:15:02 +00:00
Philip Reames	20dd3297b1	[LV] Allow scalable vectorization with vscale = 1 This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling. (I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.) This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable. This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code. In practice, this patch unblocks scalable vectorization for ELEN types on RISCV. Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1. Differential Revision: https://reviews.llvm.org/D128542	2022-06-27 13:38:57 -07:00
Kazu Hirata	a7938c74f1	[llvm] Don't use Optional::hasValue (NFC) This patch replaces Optional::hasValue with the implicit cast to bool in conditionals only.	2022-06-25 21:42:52 -07:00
Kazu Hirata	3b7c3a654c	Revert "Don't use Optional::hasValue (NFC)" This reverts commit `aa8feeefd3`.	2022-06-25 11:56:50 -07:00
Kazu Hirata	aa8feeefd3	Don't use Optional::hasValue (NFC)	2022-06-25 11:55:57 -07:00
Alexey Bataev	2faacf61a5	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-24 09:28:01 -07:00
Florian Hahn	cb69ba4faa	[LV] Create RT checks once VF/IC are selected, track scalar cost. This patch updates LV to generate runtime after the VF & IC are selected. It allows deciding whether to vectorize with runtime checks or not based on their cost compared to the vector loop. It also updates VectorizationFactor to include the scalar cost. Reviewed By: lebedev.ri, dmgreen Differential Revision: https://reviews.llvm.org/D75981	2022-06-24 17:42:11 +02:00
Florian Hahn	b18141a8f2	[VPlan] Set VFs included in plan before last set of VPTransforms (NFC). This allows VPlanTransforms to query the VFs included in the plan in the future.	2022-06-24 10:16:56 +02:00
Philip Reames	46ea4b5ea1	[LV] Avoid a crash when costing a uniform store which doesn't correspond to a legal scatter If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it. (Well, we could, but we haven't implemented that case.) This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.	2022-06-23 12:41:09 -07:00
Alexey Bataev	3b6edef15d	[SLP]Fix a crash when reorder masked gather nodes with reused scalars. If the masked gather nodes must be reordered, we can just reorder scalars, just like for gather nodes. But if the node contains reused scalars, it must be handled same way as a regular vectorizable node, since need to reorder reused mask, not the scalars directly. Differential Revision: https://reviews.llvm.org/D128360	2022-06-23 11:32:30 -07:00
Florian Hahn	569d84fe99	[VPlan] Remove dead recipes across whole plan. This extends removeDeadRecipe to remove recipes across the whole plan. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127580	2022-06-23 13:36:02 +02:00
Fangrui Song	1ffd2d99c2	Revert D115462 "[SLP]Improve shuffles cost estimation where possible." This reverts commit `cac60940b7`. Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo. See my latest comment (may update) on D115462.	2022-06-22 23:16:31 -07:00
Fangrui Song	a411bc11d6	Revert "[SLP]Fix a crash when insert subvector is out of range." This reverts commit `f1ee2738b3`. Revert due to the revert of a dependent commit `[SLP]Improve shuffles cost estimation where possible.`	2022-06-22 23:16:25 -07:00
Guillaume Chatelet	57ffff6db0	Revert "[NFC] Remove dead code" This reverts commit `8ba2cbff70`.	2022-06-22 14:55:47 +00:00
Guillaume Chatelet	8ba2cbff70	[NFC] Remove dead code	2022-06-22 13:33:58 +00:00
Serguei Katkov	8f891b7c39	[LoopVectorize] Uninitialized phi node leads to a crash in SSAUpdater. createInductionResumeValues creates a phi node placeholder without filling incoming values. Then it generates the incoming values. It includes triggering of SCEV expander which may invoke SSAUpdater. SSAUpdater has an optimization to detect number of predecessors basing on incoming values if there is phi node. In case phi node is not filled with incoming values - the number of predecessors is detected as 0 and this leads to segmentation fault. In other words SSAUpdater expects that phi is in good shape while LoopVectorizer breaks this requirement. The fix is just prepare all incoming values first and then build a phi node. Reviewed By: fhahn Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D128033	2022-06-22 10:49:27 +07:00
Vasileios Porpodas	7a9ad25769	Recommit "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6d6268dcbf`. Review: https://reviews.llvm.org/D125712	2022-06-21 18:35:29 -07:00
Vasileios Porpodas	6d6268dcbf	Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6f88acf410`.	2022-06-21 17:07:21 -07:00
Vasileios Porpodas	6f88acf410	[SLP][X86] Improve reordering to consider alternate instruction bundles During the reordering transformation we should try to avoid reordering bundles like fadd,fsub because this may block them being matched into a single vector instruction in x86. We do this by checking if a TreeEntry is such a pattern and adding it to the list of TreeEntries with orders that need to be considered. Differential Revision: https://reviews.llvm.org/D125712	2022-06-21 16:44:48 -07:00
Florian Hahn	88ce403c6a	[LV] Add new block to place recurrence splice, if needed. In some cases, a recurrence splice instructions needs to be inserted between to regions, for example if the regions get re-arranged during sinking. Fixes #56146.	2022-06-21 21:54:37 +02:00
Alexey Bataev	d4ee43153d	[SLP][NFC]Fix a warning in a comparison, NFC. Fixed signedness warning.	2022-06-21 10:19:47 -07:00
Alexey Bataev	f1ee2738b3	[SLP]Fix a crash when insert subvector is out of range. If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate the cost as shuffle of 2 vector, not as insert of subvector. Otherwise, the inserted subvector is out of range and compiler may crash. Differential Revision: https://reviews.llvm.org/D128071	2022-06-21 07:16:35 -07:00
Kazu Hirata	7a47ee51a1	[llvm] Don't use Optional::getValue (NFC)	2022-06-20 22:45:45 -07:00
Kazu Hirata	e0e687a615	[llvm] Don't use Optional::hasValue (NFC)	2022-06-20 10:38:12 -07:00
Kazu Hirata	129b531c9c	[llvm] Use value_or instead of getValueOr (NFC)	2022-06-18 23:07:11 -07:00
Kazu Hirata	c399b3a608	[Vectorize] Use llvm::is_contained (NFC)	2022-06-18 15:49:15 -07:00
Malhar Jajoo	6bb40552f2	[LoopVectorize] Add support for invariant stores of ordered reductions Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D126772	2022-06-17 14:56:21 +01:00
Alexey Bataev	76782a65ee	[SLP]Use original vector if need to shuffle truncated root. If the root scalar is mapped to to the smallest bit width, the vector is truncated and the types between original buildvector and extracted value mismatched. For extract, we emit sext/zext instructions, for shuffles we can reuse oringal vector instead of the truncated one. Differential Revision: https://reviews.llvm.org/D127974	2022-06-16 10:41:18 -07:00
Florian Hahn	949c13649c	[LV] Remove widenPHIInstruction dependence on underlying instr (NFC). Instead of using the underlying instruction and VF to get the type, use the type of the incoming value. This removes an unnecessary dependence on the underlying instruction and enables using the recipe without an underlying instruction.	2022-06-16 16:03:01 +02:00
Alexey Bataev	7236d49fd5	[SLP]Extend vectorization for scatter vectorize nodes. Currently scatter vectorize nodes can be emitted only for GEPs with constant indices. But we can also emit such nodes for GEPs with the same ptr and non-constant vectorizable/gathered indices, if profitable. Patch adds support for such nodes and tries to improve handling of GEPs with non-const indeces for such nodes. Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5243.00 5240.00 -0.1% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5243.00 5240.00 -0.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00 27507.00 -0.2% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 5395.00 5380.00 -0.3% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5389.00 5374.00 -0.3% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 961.00 958.00 -0.3% test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 961.00 958.00 -0.3% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5664.00 5643.00 -0.4% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00 13127.00 -0.6% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 212.00 207.00 -2.4% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 890.00 850.00 -4.5% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 1695.00 1581.00 -6.7% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2338.00 2140.00 -8.5% test-suite :: SingleSource/UnitTests/matrix-types-spec.test 63.00 55.00 -12.7% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 468.00 356.00 -23.9% Geomean difference -0.3% All numbers show increased number of generated vector instructions. Diff: SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but need an extra analysis with LTO (with LTO compiler generates masked_gather, while before regular loads were emitted because of extra data, availbale at LTO time). SingleSource/UnitTests/matrix-types-spec - more vector code. MultiSource/Applications/JM/lencod/lencod - same. External/SPEC/CINT2006/464.h264ref/464.h264ref - same. MultiSource/Benchmarks/7zip/7zip-benchmark - same. External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes. External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code. External/SPEC/CFP2006/447.dealII/447.dealII - same External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same External/SPEC/CFP2017rate/511.povray_r/511.povray - same External/SPEC/CFP2006/453.povray/453.povray - same External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same Differential Revision: https://reviews.llvm.org/D127219	2022-06-16 06:05:48 -07:00
Florian Hahn	5ff5b460d9	[LV] Remove unneeded CustomBuilder arg from setDebugLocFromInst (NFC). The only user that passed in a custom builder was passing in VPTransformState::Builder, which is the same as ILV::Builder.	2022-06-15 18:48:02 +01:00
Alexey Bataev	c60c13f7eb	[SLP] Improve reordering in presence of constant only nodes. We can skip the analysis of the constant nodes, their order should not affect the ordering of the trees/subtrees. Differential Revision: https://reviews.llvm.org/D127775	2022-06-15 06:17:34 -07:00
Florian Hahn	9129e7bb54	[LV] Replace OrigPHIsToFix in native with VPlan traversal. (NFC) OrigPHIsToFix is only used in the native path. Collecting phis can be replaced by iterating over the plan. This also removes another unnecessary use of a late getVPValue. This also reduces the coupling between ILV and the VPlan utilities.	2022-06-13 22:20:58 +01:00
Hubert Tong	5efb380c26	[NFC] Undo AIX build compiler workaround Removes the workaround from https://reviews.llvm.org/D98509#2732628 for an AIX build compiler issue. The AIX build compiler product that caused the issue has since been fixed. Also, the AIX build compiler has been changed to one based on LLVM.	2022-06-13 17:00:33 -04:00
Florian Hahn	763f2bdba5	[VPlan] Remove dead OrigLoop argument from removeDeadRecipes (NFC). The use of the argument has been remove a while ago. Remove the dead argument.	2022-06-11 23:36:47 +01:00
Florian Hahn	85983ca42e	[VPlan] Replace remaining use of needsScalarIV. All information is already available in VPlan. Note that there are some test changes, because we now can correctly look through instructions like truncates to analyze the actual users. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123541	2022-06-09 12:05:37 +01:00
Florian Hahn	cedfd7a2e5	Recommit "[VPlan] Remove uneeded needsVectorIV check." This reverts commit `266ea446ab`. The reasons for the revert have been addressed by cleaning up condition handling in VPlan and properly marking VPBranchOnMaskRecipe as using scalars. The test case for the revert from D123720 has been added in `3d663308a5`.	2022-06-08 14:06:45 +01:00
Florian Hahn	b0c9a71be0	[VPlan] Handle VPInst without underlying instr in VPInterleavedAccess. This violation is hidden while `cast` is missing an isa assertion after D123901.	2022-06-07 21:00:49 +01:00
David Sherwood	997ecb0036	[LoopVectorize] Add FastMathFlags to the select used for reductions with tail-folding Based on reviewer comments on https://reviews.llvm.org/D126692 I've added FastMathFlags to the select instruction used when tail-folding with reductions. These flags can then be used by InstCombine to decide upon the most optimal floating point identity value for fadd/fsub. Doing so unlocks further optimisations, such as folding selects into masked loads. Differential Revision: https://reviews.llvm.org/D126778	2022-06-07 10:21:31 +01:00
Florian Hahn	eaf48dd9b0	[VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF. Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF. This is an alternative to D121899 which simplifies the VPlan directly instead of doing so late in code-gen. The potential benefit of doing this in VPlan is that this may help cost-modeling in the future. The reason this is done in prepareToExecute at the moment is that a single plan may be used for multiple VFs/UFs. There are further simplifications that can be applied as follow ups: 1. Replace inductions with constants 2. Replace vector region with regular block. Fixes #55354. Depends on D126679. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126680	2022-06-06 09:38:53 +01:00
Florian Hahn	416a5080d8	[VPlan] Update vector latch terminator edge to exit block after execution. Instead of setting the successor to the exit using CFG.ExitBB, set it to nullptr initially. The successor to the exit block is later set either through createEmptyBasicBlock or after VPlan execution (because at the moment, no block is created by VPlan for the exit block, the existing one is reused). This also enables BranchOnCond to be used as terminator for the exiting block of the topmost vector region. Depends on D126618. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126679	2022-06-04 21:22:32 +01:00
Alexey Bataev	cac60940b7	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-03 08:06:22 -07:00
Benjamin Kramer	a8d2a381a2	[VPlan] Silence another unused variable warning in release builds	2022-06-03 14:07:56 +02:00
Benjamin Kramer	6b7c186390	[VPlan] Inline variable into assertion. NFC. Avoids a warning in release builds llvm/lib/Transforms/Vectorize/VPlanHCFGBuilder.cpp:311:14: warning: unused variable 'BrCond' [-Wunused-variable] Value *BrCond = Br->getCondition();	2022-06-03 13:59:48 +02:00
Florian Hahn	a5bb4a3b4d	[VPlan] Replace CondBit with BranchOnCond VPInstruction. This patch removes CondBit and Predicate from VPBasicBlock. To do so, the patch introduces a new branch-on-cond VPInstruction opcode to model a branch on a condition explicitly. This addresses a long-standing TODO/FIXME that blocks shouldn't be users of VPValues. Those extra users can cause issues for VPValue-based analyses that don't expect blocks. Addressing this fixme should allow us to re-introduce `266ea446ab`. The generic branch opcode can also be used in follow-up patches. Depends on D123005. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126618	2022-06-03 11:48:31 +01:00
Fangrui Song	df0f30dc36	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit `9980c99718`. Caused assertion failures: https://reviews.llvm.org/D115462#3555350	2022-06-03 00:30:34 -07:00
Alexey Bataev	9980c99718	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-02 11:18:14 -07:00
Florian Hahn	4f1c86e3d5	[VPlan] Remove dead VPlan-native special case from BranchOnCount (NFC). After `05776122b6` this special case doesn't exist any longer.	2022-06-02 12:07:54 +01:00
Alexey Bataev	73020b4540	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit `fd5a6ce9dc` to fix a crash detected by a buildbot https://lab.llvm.org/buildbot/#/builders/179/builds/3805/steps/11/logs/stdio.	2022-06-01 15:44:51 -07:00
Florian Hahn	08482830eb	[LV] Update var name to Exiting, in line with terminology (NFC) Recently the terminology used has been changed from Exit->Exiting in line with common LLVM loop terminology. Update a remaining use of the old terminology.	2022-06-01 22:13:29 +01:00
Alexey Bataev	fd5a6ce9dc	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-01 11:01:37 -07:00
Alexey Bataev	fe4949942d	[SLP]Fix PR55796: insert point for extractelements from different basic blocks. Extractelement instructions may come from different basic blocks, need to take it into account when looking for a last instruction in the bundle to prevent compiler crash. Differential Revision: https://reviews.llvm.org/D126777	2022-06-01 09:44:53 -07:00
Florian Hahn	05776122b6	[VPlan] Use region for each loop in native path. This patch updates the VPlan native path to use VPRegionBlocks for all loops in a loop nest. Up to now, only the outermost loop used a region. This is a step towards unifying both paths and keep things consistent between them. It also prepares various code-gen parts for modeling the pre-header in the inner loop vectorizer (D121624). Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123005	2022-06-01 10:41:05 +01:00
Florian Hahn	d157019482	[VPlan] Remove unused native utilities incompatible with nested regions. The implementations of VPlanDominatorTree, VPlanLoopInfo and VPlanPredicator are all incompatible with modeling loops in VPlans as region without explicit back-edges. Those pieces are not actively used and only exercised by a few gtest unit tests. They are at the moment blocking progress towards unifying the native and inner-loop vectorizer paths in D121624 and D123005. I think we should not block forward progress on unused pieces of code, so this patch removes the utilities for now. The plan is to re-introduce them as needed in a way that is compatible with the unified VPlan scheme used in both the inner loop vectorizer and the native path. Reviewed By: sguggill Differential Revision: https://reviews.llvm.org/D123017	2022-06-01 09:32:59 +01:00
Mel Chen	b0fc765350	[NFC] Change LoopVectorizationCostModel::useOrderedReductions() to be a const function. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D126200	2022-05-31 05:39:13 -07:00
Florian Hahn	6abce17fc2	[VPlan] Use Exiting-block instead of Exit-block terminology (NFC). In LLVM's common loop terminology, an exit block is a block outside a loop with a predecessor inside the loop. An exiting block is a block inside the loop which branches to an exit block outside the loop. This patch updates a few places where VPlan was using ExitBlock for a block exiting a region. Those instances have been updated to use ExitingBlock. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126173	2022-05-28 21:16:05 +01:00
Alexey Bataev	7b809c30b9	[SLP]Improve compile time, NFC. Patch improves compile time. For function calls, which cannot be vectorized, create a unique group for each such a call instead of subgroup. It prevents them from being grouped by a subgroups and attempts for their vectorization. Also, looks through casts operand to try to check their groups/subgroups. Reduces number of vectorization attempts. No changes in the statistics for SPEC2017/2006/llvm-test-suite. Differential Revision: https://reviews.llvm.org/D126476	2022-05-26 08:40:59 -07:00
Alexey Bataev	120d52b0ef	[SLP]Fix PR55653: emit undefs where required, not poison. Need to handle a corner case correctly, if all elements are Undefs/Poisons, need to emit actual values, not just poisons. Differential Revision: https://reviews.llvm.org/D126298	2022-05-26 08:38:50 -07:00
Simon Pilgrim	14258d6fb5	[SLP] Move canVectorizeLoads implementation to simplify the diff in D105986. NFC.	2022-05-26 15:23:58 +01:00
Alexey Bataev	9139d484d4	[SLP]Fix crash on reordering of ScatterVectorize nodes. ScatterVectorize nodes should be handled same way as gathers in reorderBottomToTop function, since we can simple reorder the loads in this node. Because of that need to include such nodes to the list of gathered nodes to fix compiler crash. Differential Revision: https://reviews.llvm.org/D126378	2022-05-26 06:25:58 -07:00
Florian Hahn	390c0ac28d	[LV] Fix indentation in tryToCreateWidenRecipe (NFC).	2022-05-26 08:53:34 +01:00
Alexey Bataev	3bf5c2c8ec	[SLP]Do not try to generate ScatterVectorize if it will be scalarized. SLP should build ScatterVectorize nodes only if they actually end up with masked gather rather than with scalarization. In the second scenario better to build a gather node. Differential Revision: https://reviews.llvm.org/D126379	2022-05-25 14:25:07 -07:00
Alexey Bataev	10f41a2147	[SLP]Fix PR55688: Miscompile due to incorrect nuw/nsw handling. Need to use all ReductionOps when propagating flags for the reduction ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw flags. Differential Revision: https://reviews.llvm.org/D126371	2022-05-25 13:59:06 -07:00
David Sherwood	87936c7b13	[LoopVectorize] Fix assertion failure in fixReduction when tail-folding When compiling the attached new test in scalable-reductions-tf.ll we were hitting this assertion in fixReduction: Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select" The loop contains a reduction and an intermediate store of the reduction value. When vectorising with tail-folding the contains of 'U' in the assertion above happened to be a scatter_store. It turns out that we were still creating a widen recipe for the invariant store, despite knowing that we can actually sink it. The simplest fix is to change buildVPlanWithVPRecipes so that we look for invariant stores before attempting to widen it. Differential Revision: https://reviews.llvm.org/D126295	2022-05-25 11:46:32 +01:00
Florian Hahn	c6e45ea074	[VPlan] Exit earlier when trying to widen with scalar VFs. This simplifies the code a bit, suggested in D124718. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D125029	2022-05-25 11:05:23 +01:00
Florian Hahn	1ba42dd04b	[VPlan] Use MapVector for LiveOuts for deterministic iteration. During code-gen, we iterate over the LiveOuts and the differences in iteration order can cause slightly different outputs.	2022-05-25 09:30:02 +01:00
Vasileios Porpodas	9df0568b07	[SLP] Fix crash caused by reorderBottomToTop(). The crash is caused by incorrect order set by reorderBottomToTop(), which happens when it is reordering a TreeEntry which has a user that has already been reordered earlier. Please see the detailed description in the lit test. Differential Revision: https://reviews.llvm.org/D126099	2022-05-24 12:24:19 -07:00
Alexey Bataev	f9c806ae5c	[SLP][NFC]Make isFirstInsertElement a weak strict ordering comparator. To be used correctly in a sort-like function, isFirstInsertElement function must follow weak strict ordering rule, i.e. isFirstInsertElement(IE1, IE1) should return false.	2022-05-24 06:02:42 -07:00
Alexey Bataev	319a722f6f	[SLP][NFC]Improve compile time, NFC. Builds UserIgnore list only once as a SmallDenseSet without rebuilding it between the runs, iterate over gathers instead list of reduction ops, do some checks in the buildTree_rec only if the corresponding containers are not empty.	2022-05-23 12:15:27 -07:00
Benjamin Kramer	2f2ca30d0a	Fix an unused variable warning in no-asserts build mode	2022-05-23 19:53:40 +02:00
Jingu Kang	bb82f74612	Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"" This reverts commit `42ebfa8269`. The commmit from https://reviews.llvm.org/D125918 has fixed the stage 2 build failure. Differential Revision: https://reviews.llvm.org/D118979	2022-05-23 16:15:45 +01:00
Alexey Bataev	2ac5ebedea	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-23 07:06:45 -07:00
Peter Waller	ade47bdc31	[LV] Improve register pressure estimate at high VFs Previously, `getRegUsageForType` was implemented using `getTypeLegalizationCost`. `getRegUsageForType` is used by the loop vectorizer to estimate the register pressure caused by using a vector type. However, `getTypeLegalizationCost` currently only appears to understand splitting and not scalarization, so significantly underestimates the register requirements. Instead, use `getNumRegisters`, which understands when scalarization can occur (via computeRegisterProperties). This was discovered while investigating D118979 (Set maximum VF with shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the loop vectorizer previously ends up costing an v128i1 as 2 v64i* registers where it actually occupies 128 i32 registers. I'm sending this patch early for comment, I'm still doing some sanity checking with LNT. I note that getRegisterClassForType appears to return VectorRC even though the type in question (large vNi1 types) end up occupying scalar registers. That might be worth fixing too. Differential Revision: https://reviews.llvm.org/D125918	2022-05-23 07:57:45 +00:00
Florian Hahn	145fe57106	[LV] Use exiting block instead of latch in addUsersInExitBlock. The latch may not be the exiting block. Use the exiting block instead when looking up the incoming value of the LCSSA phi node. This fixes a crash with early-exit loops.	2022-05-22 18:27:41 +01:00
Florian Hahn	97590baead	[LV] Widen ptr-inductions with scalar uses for scalable VFs. Current codegen only supports scalarization of pointer inductions for scalable VFs if they are uniform. After `3bebec659` we now may enter the scalarization code path in VPWidenPointerInductionRecipe::execute for scalable vectors. Fall back to widening for scalable vectors if necessary. This should fix a build failure when bootstrapping LLVM with SVE, e.g. https://lab.llvm.org/buildbot/#/builders/176/builds/1723	2022-05-22 16:24:13 +01:00
Florian Hahn	aeb19817d6	Revert "[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly." This reverts commit `fc9c59c355`. The patch triggers an assertion when building SPEC on X86. Reduced reproducer shared at D107966. Also reverts follow-up commit `11a09af76d`.	2022-05-21 21:00:01 +01:00
Florian Hahn	3bebec6592	[VPlan] Model first exit values using VPLiveOut. This patch introduces a new VPLiveOut subclass of VPUser to model exit values explicitly. The initial version handles exit values that are neither part of induction or reduction chains nor first order recurrence phis. Fixes #51366, #54867, #55167, #55459 Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123537	2022-05-21 16:01:38 +01:00
Dmitri Gribenko	11a09af76d	Fix an unused variable warning in no-asserts build mode	2022-05-20 17:11:58 +02:00
Alexey Bataev	fc9c59c355	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-20 05:58:09 -07:00
Alexey Bataev	4e271fc495	[SLP][NFC]Use SmallPtrSet to avoid n*m complexity, NFC.	2022-05-20 05:56:43 -07:00
Florian Hahn	cd61d4bd2f	[LV] Do not LoopSimplify/LCSSA after generating main vector loop. At the moment LV runs LoopSimplify and reconstructs LCSSA form after generating the main vector loop and before generating the epilogue vector loop. In practice, this adds a new exit block for the scalar loop because the middle block now also branches to the original exit block of the scalar loop. It also requires adding a new LCSSA phi in the newly created exit block. This complicates things when modeling exit values in VPlan, because we would need to update the VPlan for the epilogue loop to update the newly created LCSSA phi node. But none of that should be necessary, as all analysis requiring loop-simplify form is already done at this point and LCSSA form of the original loop is not broken. Reviewed By: bmahjour Differential Revision: https://reviews.llvm.org/D125810	2022-05-20 09:58:40 +01:00
Florian Hahn	c90235f0ef	[LV] Drop wrap flags for reductions using VP def-use chain. Update clearReductionWrapFlags to use the VPlan def-use chain from the reduction phi recipe to drop reduction wrap flags. This addresses an existing FIXME and fixes a crash when instructions in the reduction chain are not used and have been removed before VPlan codegeneration. Fixes #55540.	2022-05-19 20:36:46 +01:00
Tiehu Zhang	3ed9f603fd	[LoopVectorize] Don't interleave when the number of runtime checks exceeds the threshold The runtime check threshold should also restrict interleave count. Otherwise, too many runtime checks will be generated for some cases. Reviewed By: fhahn, dmgreen Differential Revision: https://reviews.llvm.org/D122126	2022-05-19 23:29:00 +08:00
Florian Hahn	df56fb44f5	[VPlan] Update VPWidenMemoryInstruction to not inherit from VPValue. VPWidenMemoryInstruction also models stores which may not produce a value. This can trip over analyses. Improve the modeling by only adding VPValues for VPWidenMemoryInstructionRecipes modeling loads.	2022-05-19 16:24:58 +01:00
Jay Foad	6bec3e9303	[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf Most clients only used these methods because they wanted to be able to extend or truncate to the same bit width (which is a no-op). Now that the standard zext, sext and trunc allow this, there is no reason to use the OrSelf versions. The OrSelf versions additionally have the strange behaviour of allowing extending to a smaller width, or truncating to a larger width, which are also treated as no-ops. A small amount of client code relied on this (ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and needed rewriting. Differential Revision: https://reviews.llvm.org/D125557	2022-05-19 11:23:13 +01:00
lizhijin	90ea81fcb2	[LV] Widen freeze instead of scalarizing it This patch changes the strategy for vectorizing freeze instrucion, from replicating multiple times to widening according to selected VF. Fixes #54992 Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D125016	2022-05-19 12:28:01 +08:00
Alexey Bataev	7d8060bc19	[SLP]Improve reductions vectorization. The pattern matching and vectgorization for reductions was not very effective. Some of of the possible reduction values were marked as external arguments, SLP could not find some reduction patterns because of too early attempt to vectorize pair of binops arguments, the cost of consts reductions was not correct. Patch addresses these issues and improves the analysis/cost estimation and vectorization of the reductions. The most significant changes in SLP.NumVectorInstructions: Metric: SLP.NumVectorInstructions [140/14396] Program results results0 diff test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 920.00 3548.00 285.7% test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 66.00 122.00 84.8% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 100.00 128.00 28.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 592.00 687.00 16.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 402.00 426.00 6.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1665.00 1745.00 4.8% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 135.00 139.00 3.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 388.00 397.00 2.3% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 895.00 914.00 2.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 240.00 244.00 1.7% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 240.00 244.00 1.7% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 820.00 832.00 1.5% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 820.00 832.00 1.5% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00 0.7% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 8125.00 8183.00 0.7% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1330.00 1338.00 0.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1330.00 1338.00 0.6% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9832.00 9880.00 0.5% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5267.00 5291.00 0.5% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 4018.00 4024.00 0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 4018.00 4024.00 0.1% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 426.00 424.00 -0.5% test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 426.00 424.00 -0.5% test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 201.00 192.00 -4.5% test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 201.00 192.00 -4.5% 644.nab_s and 544.nab_r - reduced number of shuffles but increased number of useful vectorized instructions. 641.leela_s and 541.leela_r - the function `@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore but its body gets vectorized successfully. Before, the function was inlined twice and vectorized just after inlining, currently it is not required. The vector code looks pretty similar, just like as it was before. Differential Revision: https://reviews.llvm.org/D111574	2022-05-18 13:22:18 -07:00
Florian Hahn	fcfb86483b	[LV] set Header earlier, use variable instead of repeated access (NFC).	2022-05-18 09:29:59 +01:00
Florian Hahn	5b00d13c00	[LV] Fetch vector loop region once and remember it (NFC). This avoids an unnecessary lookup and makes the code slightly more compact.	2022-05-17 15:57:23 +01:00
Alexey Bataev	b0f0313feb	[SLP]Add an extra check for select minmax reduction to avoid crash. Need to check if the reduction is still (not)cmp-select pattern min/max reduction to avoid compiler crash during building list of reduction operations. cmp-sel pattern provides 2 reduction operations, while intrinsics - just one.	2022-05-17 06:05:52 -07:00
Florian Hahn	c1a9d14982	[VPlan] Move usesScalars/onlyFirstLaneUsed to VPUser. Those helpers model properties of a user and they should also be available to non-recipe users. This will be used in D123537 for a new exit value user. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D124936	2022-05-17 11:20:06 +01:00
Alexey Bataev	152072801e	[SLP]Check if the root of the buildvector has one use only. The root of the buildvector can have only one use, otherwise it can be treated only as a final element of the previous buildvector sequence.	2022-05-16 07:30:36 -07:00
Florian Hahn	b7315ffc3c	[LAA,LV] Add initial support for pointer-diff memory checks. This patch adds initial support for a pointer diff based runtime check scheme for vectorization. This scheme requires fewer computations and checks than the existing full overlap checking, if it is applicable. The main idea is to only check if source and sink of a dependency are far enough apart so the accesses won't overlap in the vector loop. To do so, it is sufficient to compute the difference and compare it to the `VF * UF * AccessSize`. It is sufficient to check `(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards dependence in the vector loop with the given VF and UF. If Src >=u Sink, there is not dependence preventing vectorization, hence the overflow should not matter and using the ULT should be sufficient. Note that the initial version is restricted in multiple ways: 1. Pointers must only either be read or written, by a single instruction (this allows re-constructing source/sink for dependences with the available information) 2. Source and sink pointers must be add-recs, with matching steps 3. The step must be a constant. 3. abs(step) == AccessSize. Most of those restrictions can be relaxed in the future. See https://github.com/llvm/llvm-project/issues/53590. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D119078	2022-05-16 15:27:22 +01:00
David Sherwood	befc952045	[LoopVectorize] Permit tail-folding for low trip counts using scalable vectors When the loop vectoriser encounters a known low trip count it tries to create a single predicated loop in order to get the benefit of vectorisation and eliminate the scalar tail. However, until now the vectoriser prevented the use of scalable vectors in this case due to concerns in the past about stability. I believe that tail-folded loops using scalable vectors are now sufficiently well tested that we can enable this. For the same reason I've also enabled it when optimising for code size too. Tests added here: Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll Transforms/LoopVectorize/RISCV/low-trip-count.ll Differential Revision: https://reviews.llvm.org/D121595	2022-05-16 09:14:24 +01:00
Florian Hahn	8b7c3d2179	[LV] Set SCEVCheckCond to nullptr whenever it was used. Under some circumstances, SCEVExpander will insert new instructions when expanding a predicate, but the final result of the expansion can be a false constant. In those cases, the expanded instructions may later be used by other expansions, e.g. the trip count. This may trigger an assertion during SCEVExpander cleanup. To avoid this, always mark the result as used. Fixes #55100.	2022-05-15 21:52:07 +01:00
Craig Topper	b3097eb6cd	[SLP] Fix misspelling of 'analyzed'. NFC	2022-05-15 10:30:24 -07:00
Florian Hahn	39552964e1	[VPlan] Improve printing of VPReplicateRecipe with calls. Suggested as part of D124718.	2022-05-15 15:51:26 +01:00
Alexey Bataev	8b8281f354	[SLP]Do not vectorize non-profitable alternate nodes. If alternate node has only 2 instructions and the tree is already big enough, better to skip the vectorization of such nodes, they are not very profitable (the resulting code cotains 3 instructions instead of original 2 scalars). SLP can try to vectorize the buildvector sequence in the next attempt, if it is profitable. Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 72.00 73.00 1.4% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 1186.00 1198.00 1.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 241.00 242.00 0.4% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2131.00 2139.00 0.4% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 6377.00 6384.00 0.1% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 6377.00 6384.00 0.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00 12658.00 0.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00 26147.00 -0.1% test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test 99.00 86.00 -13.1% Gains: 526.blender_r - more vectorized trees. enc-3des - same. Others: 510.parest_r - no changes. miniFE - same 623.xalancbmk_s - some (non-profitable) parts of the trees are not vectorized. 523.xalancbmk_r - same lencod - same timberwolfmc - same miniAMR - same Differential Revision: https://reviews.llvm.org/D125571	2022-05-13 14:28:54 -07:00
Alexey Bataev	85f6b15ee5	[SLP]Do not look for buildvector sequence, if the index is reused. If the insert indes was used already or is not constant, we should stop looking for unique buildvector sequence, it mustbe splitted to 2 different buildvectors.	2022-05-13 13:56:02 -07:00
David Sherwood	92c645b5c1	[LoopVectorize] Add overflow checks when tail-folding with scalable vectors In InnerLoopVectorizer::getOrCreateVectorTripCount there is an assert that the known minimum value for the VF is a power of 2 when tail-folding is enabled. However, for scalable vectors the value of vscale may not be a power of 2, which means we have to worry about the possibility of overflow. I have solved this problem by adding preheader checks that prevent us from entering the vector body if the canonical IV would overflow, i.e. if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ... Differential Revision: https://reviews.llvm.org/D125235	2022-05-13 14:09:43 +01:00
Nikita Popov	ed1cb01baf	[IRBuilder] Add IsInBounds parameter to CreateGEP() We commonly want to create either an inbounds or non-inbounds GEP based on a boolean value, e.g. when preserving inbounds from existing GEPs. Directly accept such a boolean in the API, rather than requiring a ternary between CreateGEP and CreateInBoundsGEP. This change is not entirely NFC, because we now preserve an inbounds flag in a constant expression edge-case in InstCombine.	2022-05-13 14:30:55 +02:00
Vasileios Porpodas	0950d4060c	Recommit "[SLP] Make reordering aware of external vectorizable scalar stores." This reverts commit `c2a7904aba`. Original code review: https://reviews.llvm.org/D125111	2022-05-11 16:47:29 -07:00
Arthur Eubanks	c2a7904aba	Revert "[SLP] Make reordering aware of external vectorizable scalar stores." This reverts commit `71bcead98b`. Causes crashes, see comments in D125111.	2022-05-11 15:28:00 -07:00
Alexey Bataev	f5d45d70a5	[SLP]Further improvement of the cost model for scalars used in buildvectors. Further improvement of the cost model for the scalars used in buildvectors sequences. The main functionality is outlined into a separate function. The cost is calculated in the following way: 1. If the Base vector is not undef vector, resizing the very first mask to have common VF and perform action for 2 input vectors (including non-undef Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements. 2. If the Base is undef vector and have only 1 shuffle mask, perform the action only for 1 vector with the given mask, if it is not the identity mask. 3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors, combing the masks properly between the steps. The original implementation misses the very first analysis for the Base vector, so the cost might too optimistic in some cases. But it improves the cost for the insertelements which are part of the current SLP graph. Part of D107966. Differential Revision: https://reviews.llvm.org/D115750	2022-05-11 06:08:55 -07:00
Florian Hahn	635b752211	[VPlan] VPInterleaveRecipe only uses first lane if op not stored. With opaque pointers, both the stored value and the address can be the same. Only consider the recipe using the first lane only if the address is not stored. Fixes #55375.	2022-05-11 11:24:56 +01:00
Vasileios Porpodas	71bcead98b	[SLP] Make reordering aware of external vectorizable scalar stores. The current reordering scheme only checks the ordering of in-tree operands. There are some cases, however, where we need to adjust the ordering based on the ordering of a future SLP-tree who's instructions are not part of the current tree, but are external users. This patch is a simple implementation of this. We keep track of scalar stores that are users of TreeEntries and if they look profitable to vectorize, then we keep track of their ordering. During the reordering step we take this new index order into account. This can remove some shuffles in cases like in the lit test. Differential Revision: https://reviews.llvm.org/D125111	2022-05-10 15:25:35 -07:00
Alexey Bataev	4212ef8a0e	Revert "[SLP]Further improvement of the cost model for scalars used in buildvectors." This reverts commit `99f31acfce` and several others to fix detected crashes, reported in https://reviews.llvm.org/D115750	2022-05-09 13:46:06 -07:00
Alexey Bataev	cce80bd8b7	[SLP]Adjust assertion check for scalars in several insertelements. If the same scalar is inserted several times into the same buildvector, the mask index can be used already. In this case need to check, that this scalar is already part of the vectorized buildvector.	2022-05-09 13:07:59 -07:00
Florian Hahn	266ea446ab	Revert "Recommit "[VPlan] Remove uneeded needsVectorIV check."" This reverts commit `8b48223447`. This triggers an assertion on a test case mentioned in D123720. Revert while I investigate.	2022-05-09 20:33:14 +01:00
Alexey Bataev	9dc4ced204	[SLP]Try partial store vectorization if supported by target. We can try to vectorize number of stores less than MinVecRegSize / scalar_value_size, if it is allowed by target. Gives an extra opportunity for the vectorization. Fixes PR54985. Differential Revision: https://reviews.llvm.org/D124284	2022-05-09 09:48:15 -07:00
Alexey Bataev	9c3a75eabf	[SLP]Fix a crash when preparing a mask for external scalars. Need to use actual index instead of the tree entry position, since the insert index may be different than 0. It mean, that we vectorized part of the buildvector starting from not initial insertelement instruction beause of some reason.	2022-05-09 07:59:34 -07:00
David Green	6f9e1ea0ef	[VectorCombine] Attempt to fold select shuffles from reductions Given a commutative reduction leading from a shuffle, the order of the lanes on the shuffle are not important for the result. This means we can reorder the shuffle to something simpler, which we try shuffling the first vector lanes first. This was D123494. The new shuffle may not be profitable though, and if it is not we can try the folding of select shuffles from D123911. This, with some adjustment as the output lane ordering is now unimportant, can allow the final shuffle to simplify given the inputs to the patterns from D123911. Where as each transformation on their own are not profitable, the combination is. We can only support a single shuffle when called from reductions, but we are able to sort the ReconstructMask, potentially allowing it to simplify to an identity or concat mask. Differential Revision: https://reviews.llvm.org/D125086	2022-05-08 10:32:41 +01:00
David Green	802e15c576	[SLP] Cluster ordering for loads Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts). Differential Revision: https://reviews.llvm.org/D122145	2022-05-07 14:38:11 +01:00
David Green	100cb9a2ba	[VectorCombine] Fold shuffle select pattern This patch adds a combine to attempt to reduce the costs of certain select-shuffle patterns. The form of code it attempts to detect is: %x = shuffle ... %y = shuffle ... %a = binop %x, %y %b = binop %x, %y shuffle %a, %b, selectmask A classic select-mask will pick items from each lane of a or b. These do not always have a great lowering on many architectures. This patch attempts to pack a and b into the lower elements, creating a differently ordered shuffle for reconstructing the orignal which may be better than the select mask. This can be better for performance, especially if less elements of a and b need to be computed and the input shuffles are cheaper. Because select-masks are just one form of shuffle, we generalize to any mask. So long as the backend has decent costmodel for the shuffles, this can generally improve things when they come up. For more basic cost models the folds do not appear to be profitable, not getting past the cost checks. Differential Revision: https://reviews.llvm.org/D123911	2022-05-06 08:13:18 +01:00
Florian Hahn	f9f7aa30f8	[VPlan] Remove dead code to create VPWidenPHIRecipes (NFCI). After introducing VPWidenPointerInductionRecipe, VPWidenPHIRecipes should not be created at this point. Turn check into an assert.	2022-05-05 19:29:02 +01:00
Alexey Bataev	99f31acfce	[SLP]Further improvement of the cost model for scalars used in buildvectors. Further improvement of the cost model for the scalars used in buildvectors sequences. The main functionality is outlined into a separate function. The cost is calculated in the following way: 1. If the Base vector is not undef vector, resizing the very first mask to have common VF and perform action for 2 input vectors (including non-undef Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements. 2. If the Base is undef vector and have only 1 shuffle mask, perform the action only for 1 vector with the given mask, if it is not the identity mask. 3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors, combing the masks properly between the steps. The original implementation misses the very first analysis for the Base vector, so the cost might too optimistic in some cases. But it improves the cost for the insertelements which are part of the current SLP graph. Part of D107966. Differential Revision: https://reviews.llvm.org/D115750	2022-05-05 06:04:25 -07:00
Florian Hahn	8b48223447	Recommit "[VPlan] Remove uneeded needsVectorIV check." This reverts commit `f4e1eaa375`. The patch was originally reverted because it uncovered an issue that has now been fixed in `0ef8ca6d88`.	2022-05-04 10:53:42 +01:00
serge-sans-paille	7030654296	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since `fa5a4e1b95` detected a few regressions, fixing them. Differential Revision: https://reviews.llvm.org/D124847	2022-05-04 08:32:38 +02:00
Igor Kirillov	4e5e042d9a	[LoopVectorize] Support reductions that store intermediary result Adds ability to vectorize loops containing a store to a loop-invariant address as part of a reduction that isn't converted to SSA form due to lack of aliasing info. Runtime checks are generated to ensure the store does not alias any other accesses in the loop. Ordered fadd reductions are not yet supported. Differential Revision: https://reviews.llvm.org/D110235	2022-05-03 10:12:30 +01:00
David Green	6f81903e89	[LV][SLP] Mark fptosi_sat as vectorizable This adds fptosi_sat and fptoui_sat to the list of trivially vectorizable functions, mainly so that the loop vectorizer can vectorize the instruction. Marking them as trivially vectorizable also allows them to be SLP vectorized, and Scalarized. The signature of a fptosi_sat requires two type overrides (@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first operand of the intrinsic as a overloaded (but not scalar) operand. Differential Revision: https://reviews.llvm.org/D124358	2022-05-03 09:32:34 +01:00
Alexey Bataev	e74a73782f	[SLP][NFC]Minor code changes for better readability, NFC.	2022-05-02 12:58:25 -07:00
Alexey Bataev	7ea03f0b4e	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-05-02 12:03:58 -07:00
Florian Hahn	0ef8ca6d88	[VPlan] Do not create VPWidenCall recipes for scalar vector factors. 'Widen' recipe are only used when actual vector values are generated. Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector factors. This was exposed by D123720, because the widened recipes are considered vector users. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D124718	2022-05-02 19:40:33 +01:00
Simon Pilgrim	34f97a3709	[VectorCombine] Merge isa<>/cast<> into dyn_cast<>. NFC. We want to handle the the assert in VectorCombine so avoid the repeated isa/cast code.	2022-05-01 20:09:10 +01:00
Simon Pilgrim	09761ce295	[SLPVectorizer] Remove weird unicode character from comment. NFCI. Whatever it was, Visual Assist really didn't like it....	2022-05-01 16:37:21 +01:00
Alexey Bataev	484fcb9888	[SLP][NFC]Fix a comment.	2022-04-29 09:27:13 -07:00
Anna Thomas	205246cb64	[CompileTime] [Passes] Avoid computing unnecessary analyses. NFC Similar to `c515b2f39e`, If there are no loops in the function as seen through LI, we should avoid computing the remaining expensive analyses (such as SCEV, BPI). Reordered the analyses requests and early return if there are no loops. The logic of avoiding expensive analyses is applied to LoopVectorizer, LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate on loops. This is an NFC with compile time improvement. Differential Revision: https://reviews.llvm.org/D124529	2022-04-29 10:00:06 -04:00
Ricky Zhou	24a133e16f	[LV] Rename CountRoundDown to VectorTripCount (NFC) The name CountRoundDown is potentially misleading, as the number of iterations can be rounded up when folding the tail. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D119681	2022-04-29 13:50:00 +01:00
Florian Hahn	e66127e69b	[VPlan] Simplify & adjust code as suggested in D123005. Improve code as suggested in D123005. Applied separately, because the comments where made a diff that has not been rebased to current main.	2022-04-29 13:34:54 +01:00
David Green	7047c47918	[VecCombine] Fix sort comparator logic in foldShuffleFromReductions I think this sort comparator was overly complex, and the windows expensive check bot agreed, failing as it was not giving a strict weak ordering. Change it to use the comparison of the mask values as unsigned integers. This should sort the undef elements to the end whilst keeping X<Y otherwise.	2022-04-29 09:30:02 +01:00
Florian Hahn	f4e1eaa375	Revert "[VPlan] Remove uneeded needsVectorIV check." This reverts commit `43842b887e` while I investigate a buildbot failure. It also reverts the follow-up commit `2883de0514`.	2022-04-28 20:16:21 +01:00
David Green	ded8187e35	[VectorCombine] Try to reduce shuffle cost for commutative reduction operands Given a shuffle feeding a commutative reduction, the lane ordering of the shuffle will not alter the result. This is also true if there are a number of operations between the reduction and the shuffle, providing they only operate lane-wise. This patch searches for cases like that in Vector Combine, allowing us to check the cost of the shuffle vs an in-order identity shuffle and replace the order if possible. This only handles a single shuffle at the moment to keep things simple, and is able to ignore splats that produce results where every result is the same. This is a more powerful version of a combine that already happens in instrcombine, capable of optimizing more cases by looking through more instructions and being able to cost the shuffle. Differential Revision: https://reviews.llvm.org/D123494	2022-04-28 19:46:12 +01:00
Alexey Bataev	75e1cf4a6a	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-28 10:04:41 -07:00
Florian Hahn	2883de0514	[VPlan] Fix comment formatting from `43842b887e`.	2022-04-28 16:31:48 +01:00
Florian Hahn	43842b887e	[VPlan] Remove uneeded needsVectorIV check. Remove one of the last remaining uses of ::needsVectorIV, preparing for its removal. Now that usesScalars is available and based on the information explicit in VPlan, there is no need to use the pre-computed needsVectorIV. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123720	2022-04-28 16:27:34 +01:00
Alexey Bataev	9861ca0c23	Revert "[COST]Improve cost model for shuffles in SLP." This reverts commit `29a470e380` to fix a crash reported in https://reviews.llvm.org/D100486#3479989.	2022-04-28 08:11:56 -07:00
Alexey Bataev	29a470e380	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-27 10:56:26 -07:00
Shilei Tian	a6b355dd31	[SLP] Fix a typo that causes redundant assertion and potential segment fault Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D124497	2022-04-27 10:07:59 -04:00
Vasileios Porpodas	fa8a9fea47	Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `6a9bbd9f20`. Code review: https://reviews.llvm.org/D124202	2022-04-26 14:02:40 -07:00
Vasileios Porpodas	6a9bbd9f20	Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `55ce296d6f`.	2022-04-26 11:25:26 -07:00
Vasileios Porpodas	55ce296d6f	[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost` Before this patch `Args` was used to pass a broadcat's arguments by SLP. This patch changes this. `Args` is now used for passing the operands of the shuffle. Differential Revision: https://reviews.llvm.org/D124202	2022-04-26 11:11:29 -07:00
Valery N Dmitriev	88b9e46fb5	[SLP] Steer for the best chance in tryToVectorize() when rooting with binary ops. tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer, specifically for binary and comparison operations. Order of making probes for various scalar pairs was defined by its implementation: the instruction operands, then climb over one operand if the instruction is its sole user and then perform same actions for another operand if previous attempts failed. Problem with this approach is that among these options we can have more than a single vectorizable tree candidate and it is not necessarily the one that encountered first. Trying to build vectorizable tree for each possible combination for just evaluation is expensive. But we already have lookahead heuristics mechanism which we use for finding best pick among operands of commutative instructions. It calculates cumulative score for candidates in two consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among several combinations. We only try one that looks as most promising for vectorization. Additional benefit is that we reduce total number of vectorization trees built for probes because we skip those looking non-profitable early. Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo) Differential Revision: https://reviews.llvm.org/D124309	2022-04-25 12:25:33 -07:00
David Green	9727c77d58	[NFC] Rename Instrinsic to Intrinsic	2022-04-25 18:13:23 +01:00
Valery N Dmitriev	edf7bed87b	[SLP][NFC] Outline lookahead heuristics into a separate helper class. Minor refactoring to reduce size of functional change D124309: look-ahead scoring routines pulled out of VLOperands and formed new LookAheadHeuristics helper class. Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo) Differential Revision: https://reviews.llvm.org/D124313	2022-04-22 18:59:08 -07:00
Vasileios Porpodas	889588ee97	[SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`. Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`. This helps reduce the diff of a future SLP patch for AArch64.	2022-04-21 10:19:00 -07:00
Florian Hahn	bea69b232f	[VPlan] Initial modeling of middle block in VPlan. This patch extends the scope of VPlan to also include the exit (aka middle) block. For now, the exit block remains empty, but handling of exit values will subsequently be moved to VPlan, by adding recipes to model exit values in the exit block. As a first step, this will allow fixing #51366. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123457	2022-04-20 19:34:41 +01:00
Vasileios Porpodas	8d4b5e0833	[NFC][SLP] Improved description of getShallowScore() and getScoreAtLevelRec() Differential Revision: https://reviews.llvm.org/D124027	2022-04-19 12:15:36 -07:00
Florian Hahn	4026b718b8	[VPlan] Remove unused SCEV forward declaration (NFC).	2022-04-19 17:16:17 +02:00
Alexey Bataev	883571928c	Revert "[SLP]Improve reductions analysis and emission, part 1." This reverts commit `0e1f4d4d3c` to fix a crash reported in PR54976	2022-04-19 06:17:03 -07:00
Florian Hahn	a65f2730d2	[VPlan] Expand induction step in VPlan pre-header. This patch moves SCEV expansion of steps used by VPWidenIntOrFpInductionRecipes to the pre-header using VPExpandSCEVRecipe. This ensures that those steps are expanded while the CFG is in a valid state. Previously, SCEV expansion may happen during vector body code-generation, during which the CFG may be invalid, causing issues with SCEV expansion. Depends on D122095. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D122096	2022-04-19 13:06:39 +02:00
Vasileios Porpodas	b1333f03d9	Recommit "[SLP] Support internal users of splat loads" Code review: https://reviews.llvm.org/D121940 This reverts commit `359dbb0d3d`.	2022-04-18 15:58:01 -07:00
Vasileios Porpodas	359dbb0d3d	Revert "[SLP] Support internal users of splat loads" This reverts commit `f8e1337115`.	2022-04-18 12:12:34 -07:00
Vasileios Porpodas	f8e1337115	[SLP] Support internal users of splat loads Until now we would only accept a broadcast load pattern if it is only used by a single vector of instructions. This patch relaxes this, and allows for the broadcast to have more than one user vector, as long as all of its uses are internal to the SLP graph and vectorized. Differential Revision: https://reviews.llvm.org/D121940	2022-04-18 11:59:44 -07:00
Florian Hahn	73f5d7d0d6	[VPlan] Handle equal address and store ops in onlyFirstLaneDemanded. With opaque pointers, the stored value and address can be the same. Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded incorrectly considers stores with matching store and pointer operands as only demanding the first lane, causing a crash.	2022-04-15 22:53:33 +02:00
Johannes Doerfert	1fb415fee9	[AMDGPU][FIX] Proper load-store-vectorizer result with opaque pointers The original code relied on the fact that we needed a bitcast instruction (for non constant base objects). With opaque pointers there might not be a bitcast. Always check if reordering is required instead. Fixes: https://github.com/llvm/llvm-project/issues/54896 Differential Revision: https://reviews.llvm.org/D123694	2022-04-15 13:42:46 -05:00
Florian Hahn	2c14cdf831	[VPlan] Turn external defs in Value -> VPValue mapping. This addresses an existing TODO by keeping a mapping of external IR Value * definitions wrapped in VPValues for use in a VPlan. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123700	2022-04-14 12:03:09 +02:00
serge-sans-paille	fa5a4e1b95	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since `a96638e50e` detected a few regressions, fixing them.	2022-04-13 20:53:19 +02:00
Alexey Bataev	0e1f4d4d3c	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-04-12 17:46:11 -07:00
Muhammad Omair Javaid	42ebfa8269	Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth" This reverts commit `64b6192e81`. This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage: https://lab.llvm.org/buildbot/#/builders/176/builds/1515 llvm-tblgen crashes after applying this patch.	2022-04-13 04:53:07 +05:00
Florian Hahn	5f1eb74850	[VPlan] Place VPExpandSCEVRecipe in pre-header. After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be placed there. This ensures SCEV expansion happens before modifying the CFG during VPlan execution, when CFG is incomplete. Depends on D121624. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D122095	2022-04-10 10:26:20 +02:00
Florian Hahn	256c6b0ba1	[VPlan] Model pre-header explicitly. This patch extends the scope of VPlan to also model the pre-header. The pre-header can be used to place recipes that should be code-gen'd outside the loop, like SCEV expansion. Depends on D121623. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121624	2022-04-09 14:19:47 +02:00
Florian Hahn	467dbcd9f1	[LV] Set debug loc after setting insert point. This fixes the code to actually use the location of the instruction, if available. Previously, SetInsertPoint would overwrite the insert point set from the instruction.	2022-04-08 20:34:40 +02:00
Florian Hahn	29fe998eaa	[VPlan] Preserve debug location when creating branch. Update createEmptyBasicBlock to preserve the debug location of the previous terminator.	2022-04-08 17:22:53 +02:00
Florian Hahn	4388c979da	[VPlan] Use vector.body as header name in VPlan native path. This brings the VPlan block naming in line with the naming of the generated basic blocks.	2022-04-07 10:31:12 +02:00
Jingu Kang	64b6192e81	[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth Set the maximum VF of AArch64 with 128 / the size of smallest type in loop. Differential Revision: https://reviews.llvm.org/D118979	2022-04-05 13:16:52 +01:00
Florian Hahn	1ff022e21b	[LV] Add vector.body block to parent loop during skeleton creation. When creating induction resume values, SCEV queries may rely on LoopInfo. Make sure vector.body gets added to the loop of the pre-header during skeleton construction. %vector.body will be moved to the vector preheader during VPlan execution. Fixes #54745.	2022-04-05 11:54:17 +01:00
Florian Hahn	1817c526e1	[VPlan] Update VPInterleavedAccessInfo to use getVectorLoopRegion. Update VPInterleavedAccessInfo to use the generic getVectorLoopRegion helper instead of relying on the entry block being the top-most vector loop region.	2022-04-04 10:26:39 +01:00
Florian Hahn	8cd1892725	[VPlan] Remember previous loop and reset vector loop. At the moment this is NFC, but will be needed once nested loops are also modeled as regions. Preparation for D123005.	2022-04-04 09:27:15 +01:00
Philip Reames	88de27e3fd	[LV] Handle non-integral types when considering interleave widening legality In general, anywhere we might need to insert a blind bitcast, we need to make sure the types are losslessly convertible. This fixes pr54634.	2022-04-03 20:16:20 -07:00
Florian Hahn	95b2aa511e	[VPlan] Set VPlan header block name to vector.body. This brings the VPlan block naming in line with the naming of the generated basic blocks.	2022-04-02 19:34:32 +01:00
Florian Hahn	f8101e4d68	Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts commit `14e3650f01`. The issue causing the revert were fixed independently in `a08c90a402` and `14e5f9785c`.	2022-04-01 16:53:39 +01:00
Florian Hahn	14e5f9785c	[LV] Add SCEV workaround from `80e8025` to epilogue vector code path. This was exposed by `14e3650f`. The recommit of `14e3650f` will hit the problematic code path requiring the workaround. test case that crashes without the workaround.	2022-04-01 15:14:47 +01:00
Florian Hahn	a08c90a402	[LV] Re-use TripCount from EPI.TripCount. During skeleton construction for the epilogue vector loop, generic helpers use getOrCreateTripCount, which will re-expand the trip count computation. Instead, re-use the TripCount created during main loop vectorization.	2022-04-01 13:47:34 +01:00
Florian Hahn	14e3650f01	Revert "Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"" This reverts commit `8378a71b6c`. It looks like this patch uncovered another issue, e.g. see https://lab.llvm.org/buildbot/#/builders/168/builds/5518	2022-03-31 19:00:48 +01:00
Florian Hahn	8378a71b6c	Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts the revert commit `2760cdc9c6`. This version pulls in the code to create the vector loop object in VPlan from D121624. This is needed because otherwise existing LoopInfo verification will fail, as a loop block doesn't have in-loop successors now that we do not replace the branch. Now that we do not add new loops during skeleton construction, there's also no need to verify LI there.	2022-03-31 14:48:32 +01:00
Florian Hahn	2760cdc9c6	Revert "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts commit `32bc83d11e`. This is causing bots with expensive-checks to fail. Revert while I investigate.	2022-03-31 12:32:50 +01:00
Florian Hahn	32bc83d11e	[LV] Remove unneeded createHeaderBranch.(NFCI) The only remaining use was to get the exit block of the loop. Instead of relying on the loop, use the successor of VectorHeaderBB (LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB Depends on D121621. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121623	2022-03-31 11:48:52 +01:00
Florian Hahn	2c494f0941	[VPlan] Remove unneeded Loop variable (NFC). Suggested in D121623. The remaining uses of L can be replaced, reducing the need for the variable.	2022-03-31 10:34:28 +01:00
David Green	b65267ca7b	[LV] Invalidate widening decisions after maximizing vector bandwidth When MaximizeVectorBandwidth is enabled, we can end up (via calls to collectUniformsAndScalars/setCostBasedWideningDecision through calculateRegisterUsage) making widening decisions before we have decided whether to fold the tail by masking. These decisions will be wrong if we later decided to fold the tail, for example when the trip count is very low. It will use incorrect costs for loads that should get masked, using standard memory operation costs instead. This still at the moment uses the EmulatedMaskMemRefHack costs (a bit unfortunately), but the old costs without this change were 1, leading to too optimistic vectorization. This slightly changes the way that the MaximizeVectorBandwidth option works to make it easier to test, always honouring the option if it is set. Differential Revision: https://reviews.llvm.org/D120215	2022-03-31 09:19:31 +01:00
Florian Hahn	e4543af4e6	[VPlan] Track current vector loop in VPTransformState (NFC). Instead of looking up the vector loop using the header, keep track of the current vector loop in VPTransformState. This removes the requirement for the vector header block being part of the loop up front. A follow-up patch will move the code to generate the Loop object for the vector loop to VPRegionBlock. Depends on D121619. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121621	2022-03-30 22:16:40 +01:00
Florian Hahn	e8673f2f20	[LV] Do not create separate latch block in VPlan::execute. Now that all dependencies on creating the latch block up-front have been removed, there is no need to create it early. Depends on D121618. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121619	2022-03-30 17:31:38 +01:00
Florian Hahn	8a4077fac0	[LV] Pass LoopHeaderBB directly to updateDominatorTree. (NFC) At the call site, we already know what the vector header block is. Pass it directly.	2022-03-30 13:11:20 +01:00
Florian Hahn	ecb4171dcb	[LV] Handle zero cost loops in selectInterleaveCount. In some case, like in the added test case, we can reach selectInterleaveCount with loops that actually have a cost of 0. Unfortunately a loop cost of 0 is also used to communicate that the cost has not been computed yet. To resolve the crash, bail out if the cost remains zero after computing it. This seems like the best option, as there are multiple code paths that return a cost of 0 to force a computation in selectInterleaveCount. Computing the cost at multiple places up front there would unnecessarily complicate the logic. Fixes #54413.	2022-03-29 22:52:43 +01:00
Florian Hahn	d1d3563278	[LV] Move code to place pointer induction increment to VPlan post-processing. This patch moves the code to set the correct incoming block for the backedge value to VPlan::execute. When generating the phi node, the backedge value is temporarily added using the pre-header as incoming block. The invalid phi node will be fixed up during VPlan::execute after main VPlan code generation. At the same time, the backedge value is also moved to the latch. This change removes the requirement to create the latch block up-front for VPWidenInductionPHIRecipe::execute, which in turn will enable modeling the pre-header in VPlan. Depends on D121617. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121618	2022-03-29 20:27:59 +01:00
Philip Reames	7d6e8f2a96	[slp] Delete dead scalar instructions feeding vectorized instructions If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well. This is purely for test output quality and readability. It should have no effect in any sane pipeline. Differential Revision: https://reviews.llvm.org/D122493	2022-03-28 20:10:13 -07:00
Florian Hahn	e7bf2ea934	[LV] Move code to place induction increment to VPlan post-processing. This patch moves the code to set the correct incoming block for the backedge value to VPlan::execute. When generating the phi node, the backedge value is temporarily added using the pre-header as incoming block. The invalid phi node will be fixed up during VPlan::execute after main VPlan code generation. At the same time, the backedge value is also moved to the latch. This change removes the requirement to create the latch block up-front for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable modeling the pre-header in VPlan. As an alternative, the increment could be modeled as separate recipe, but that would require more work and a bit of redundant code, as we need to create the step-vector during VPWidenIntOrFpInductionRecipe::execute anyways, to create the values for different parts. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121617	2022-03-28 16:20:02 +01:00
Philip Reames	f80aaa675f	[SLP] Simplify eraseInstruction [NFC] This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.	2022-03-25 12:01:52 -07:00
Philip Reames	48cc9287f5	Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3) The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change. Original commit message follows: SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users. This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example: Before this patch: 704357 SLP - Number of calcDeps actions 699021 SLP - Number of schedule calls 5598 SLP - Number of ReSchedule actions 59 SLP - Number of ReScheduleOnFail actions 10084 SLP - Number of schedule resets 8523 SLP - Number of vector instructions generated After this patch: 102895 SLP - Number of calcDeps actions 161916 SLP - Number of schedule calls 5637 SLP - Number of ReSchedule actions 55 SLP - Number of ReScheduleOnFail actions 10083 SLP - Number of schedule resets 8403 SLP - Number of vector instructions generated I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore. The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass. For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch. Differential Revision: https://reviews.llvm.org/D118538	2022-03-25 10:39:23 -07:00
Philip Reames	ec858f0201	[SLP] Optimize stacksave dependence handling [NFC] After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.	2022-03-25 10:04:10 -07:00
Philip Reames	a16308c282	[SLP] Explicit track required stacksave/alloca dependency (try 3) This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm. The prior attempt had failed to account for this case: %a = alloca i8 stacksave stackrestore store i8 0, i8* %a If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug. There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore. Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.	2022-03-25 10:04:10 -07:00
Florian Hahn	e47d220230	[LV] Use getVectorLoopRegion to retrieve header. (NFC) Update all places that currently assume the entry block to the plan is also the vector loop header to use getVectorLoopRegion instead. getVectorLoopRegion will keep doing the right thing when the pre-header is modeled explicitly (and becomes the new entry block in the plan).	2022-03-25 16:57:12 +00:00
Philip Reames	d9756fa723	[slp] Factor out a lambda to avoid uplicating code a third time in upcoming patch [nfc]	2022-03-25 09:02:39 -07:00
Fraser Cormack	2e44b7872b	[VectorCombine] Insert addrspacecast when crossing address space boundaries We can not bitcast pointers across different address spaces. This was previously fixed in D89577 but then in D93229 an enhancement was added which peeks further through the ponter operand, opening up the possibility that address-space violations could be introduced. Instead of bailing as the previous fix did, simply insert an addrspacecast cast instruction. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D121787	2022-03-24 19:08:08 +00:00
Simon Pilgrim	597aefa89c	Fix unused variable warning by embedding inside assertion	2022-03-24 17:41:24 +00:00
Florian Hahn	46432a0088	[VPlan] Add VPWidenPointerInductionRecipe. This patch moves pointer induction handling from VPWidenPHIRecipe to its own recipe. In the process, it adds all information required to generate code for pointer inductions without relying on Legal to access the list of induction phis. Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121615	2022-03-24 14:58:45 +00:00

... 3 4 5 6 7 ...

3480 Commits