llvm-project

Commit Graph

Author	SHA1	Message	Date
Florian Hahn	ee37ae91b6	[VPlan] Move VPBB verification to separate function (NFC).	2022-07-13 18:53:40 -07:00
Florian Hahn	6f7347b888	[LV] Use PredRecipe directly instead of getOrAddVPValue (NFC). There is no need to look up the VPValue for Instr, PredRecipe can be used directly.	2022-07-13 17:01:42 -07:00
Florian Hahn	225e3ec622	[LV] Move VPBranchOnMaskRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-13 14:39:59 -07:00
David Sherwood	307ace7f20	[LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs When vectorising ordered reductions we call a function LoopVectorizationPlanner::adjustRecipesForReductions to replace the existing VPWidenRecipe for the fadd instruction with a new VPReductionRecipe. We attempt to insert the new recipe in the same place, but this is wrong because createBlockInMask may have generated new recipes that VPReductionRecipe now depends upon. I have changed the insertion code to append the recipe to the VPBasicBlock instead. Added a new RUN with tail-folding enabled to the existing test: Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll Differential Revision: https://reviews.llvm.org/D129550	2022-07-13 09:29:25 +01:00
David Sherwood	6b694d600a	[LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF When calculating the cost of Instruction::Br in getInstructionCost we query PredicatedBBsAfterVectorization to see if there is a scalar predicated block. However, this meant that the decisions being made for a given fixed-width VF were affecting the cost for a scalable VF. As a result we were returning InstructionCost::Invalid pointlessly for a scalable VF that should have a low cost. I encountered this for some loops when enabling tail-folding for scalable VFs. Test added here: Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll Differential Revision: https://reviews.llvm.org/D128272	2022-07-12 14:53:20 +01:00
Florian Hahn	5d135041c5	[LV] Move VPBlendRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-11 16:01:07 -07:00
David Sherwood	03fee6712a	[LoopVectorize] Add option to use active lane mask for loop control flow Currently, for vectorised loops that use the get.active.lane.mask intrinsic we only use the mask for predicated vector operations, such as masked loads and stores, etc. The loop itself is still controlled by comparing the canonical induction variable with the trip count. However, for some targets this is inefficient when it's cheap to use the mask itself to control the loop. This patch adds support for using the active lane mask for control flow by: 1. Generating the active lane mask for the next iteration of the vector loop, rather than the current one. If there are still any remaining iterations then at least the first bit of the mask will be set. 2. Extract the first bit of this mask and use this bit for the conditional branch. I did this by creating a new VPActiveLaneMaskPHIRecipe that sets up the initial PHI values in the vector loop pre-header. I've also made use of the new BranchOnCond VPInstruction for the final instruction in the loop region. Differential Revision: https://reviews.llvm.org/D125301	2022-07-11 13:46:55 +01:00
David Sherwood	02d6950d84	[LoopVectorize][NFC] Add optional Name parameter to VPInstruction This patch is a simple piece of refactoring that now permits users to create VPInstructions and specify the name of the value being generated. This is useful for creating more readable/meaningful names in IR. Differential Revision: https://reviews.llvm.org/D128982	2022-07-11 09:23:24 +01:00
Florian Hahn	6a4bc452f8	[LV] Move VPWidenGEPRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-10 17:10:17 -07:00
Florian Hahn	13ae213469	[LV] Move VPWidenRecipe::execute to VPlanRecipes.cpp (NFC).	2022-07-09 18:46:57 -07:00
Florian Hahn	0c27b38849	[VPlan] Move VPWidenSelectRecipe::execute to VPlanRecipes.cpp (NFC). Depends on D127968. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127970	2022-07-08 09:35:23 -07:00
Craig Topper	0266773464	[SLP] Add missing space to optimization remark. Reviewed By: vporpo Differential Revision: https://reviews.llvm.org/D129330	2022-07-07 23:29:11 -07:00
Florian Hahn	bc19b7c3cc	[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE. Now that removeDeadRecipes can remove most dead recipes across a whole VPlan, there is no need to first collect some dead instructions. Instead removeDeadRecipes can simply clean them up. Depends D127580. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128408	2022-07-07 08:40:27 -07:00
Sander de Smalen	519d7876cb	[VectorCombine] Avoid creating shuffle for extract-extract pattern on scalable vector. This addresses https://github.com/llvm/llvm-project/issues/56377 Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D129136	2022-07-07 08:37:04 +00:00
Florian Hahn	17d48c3169	[VPlan] Move remove dead recipes before merging regions. This can enable additional region merging, while not losing opportunities as region merging does not produce dead recipes. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128831	2022-07-06 20:38:38 -07:00
Nikita Popov	1ed8b29302	[LoopVectorizationLegality] Drop unused variable (NFC)	2022-07-06 10:43:39 +02:00
Nikita Popov	8ee913d83b	[IR] Remove Constant::canTrap() (NFC) As integer div/rem constant expressions are no longer supported, constants can no longer trap and are always safe to speculate. Remove the Constant::canTrap() method and its usages.	2022-07-06 10:36:47 +02:00
David Green	5493f8fc59	[VectorCombine] Improve shuffle select shuffle-of-shuffles This in an extension to the code added in D123911 which added vector combine folding of shuffle-select patterns, attempting to reduce the total amount of shuffling required in patterns like: %x = shuffle %i1, %i2 %y = shuffle %i1, %i2 %a = binop %x, %y %b = binop %x, %y shuffle %a, %b, selectmask This patch extends the handing of shuffles that are dependent on one another, which can arise from the SLP vectorizer, as-in: %x = shuffle %i1, %i2 %y = shuffle %x The input shuffles can also be emitted, in which case they are treated like identity shuffles. This patch also attempts to calculate a better ordering of input shuffles, which can help getting lower cost input shuffles, pushing complex shuffles further down the tree. This is a recommit with some additional checks for supported forms and out-of-bounds mask elements, with some extra tests. Differential Revision: https://reviews.llvm.org/D128732	2022-07-05 17:16:18 +01:00
Florian Hahn	ebb78a95ce	[LV] Remove stray dbgs() call after `774fc63490`.	2022-07-05 12:58:18 +01:00
Florian Hahn	774fc63490	[LV] Consider minimum vscale assmuption for RT check cost. For scalable VFs, the minimum assumed vscale needs to be included in the cost-computation, otherwise a smaller VF may be used for RT check cost computation than was used for earlier cost computations. Fixes a RISCV test failing with UBSan due to both scalar and vector loops having the same cost.	2022-07-05 09:41:58 +01:00
Nikita Popov	b69c75d53f	Revert "[VectorCombine] Improve shuffle select shuffle-of-shuffles" This reverts commit `19a1e20b8a`. Clang crashes while linking bullet from llvm-test-suite in ReleaseLTO-g cmake configuration.	2022-07-05 09:31:20 +02:00
Florian Hahn	2a82c15f63	[LV] Consider runtime checks profitable if scalar cost is zero. This fixes an UBSan failure after `644a965c1e`. When using user-provided VFs/ICs (via the force-vector-width / force-vector-interleave options) the scalar cost is zero, which would cause divide-by-zero. When forcing vectorization using the options, the cost of the runtime checks should not block vectorization.	2022-07-04 21:37:16 +01:00
Florian Hahn	9eb6572786	[LV] Add back CantReorderMemOps remark. Add back remark unintentionally dropped by `644a965c1e`. I will add a LV test separately, so we do not have to rely on a Clang test to catch this.	2022-07-04 17:23:47 +01:00
Peter Waller	c146af3f46	[LoopVectorize][NFC] Reinstate TTICapture workaround for gcc-6 Fixes #56374.	2022-07-04 14:14:15 +00:00
Florian Hahn	644a965c1e	[LV] Vectorize cases with larger number of RT checks, execute only if profitable. This patch replaces the tight hard cut-off for the number of runtime checks with a more accurate cost-driven approach. The new approach allows vectorization with a larger number of runtime checks in general, but only executes the vector loop (and runtime checks) if considered profitable at runtime. Profitable here means that the cost-model indicates that the runtime check cost + vector loop cost < scalar loop cost. To do that, LV computes the minimum trip count for which runtime check cost + vector-loop-cost < scalar loop cost. Note that there is still a hard cut-off to avoid excessive compile-time/code-size increases, but it is much larger than the original limit. The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource is mostly neutral, but the new approach can give substantial gains in cases where we failed to vectorize before due to the over-aggressive cut-offs. On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%) and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system. ``` CFP2006/447.dealII/447.dealII -1.9% CINT2017rate/525.x264_r/525.x264_r -2.2% ASC_Sequoia/IRSmk/IRSmk -9.2% Rodinia/srad/srad -36.1% ``` `size` regressions on AArch64 with -O3 are ``` MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3% MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9% MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5% External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7% External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0% External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7% MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3% MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8% External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3% External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2% ``` Reviewed By: lebedev.ri, ebrevnov, dmgreen Differential Revision: https://reviews.llvm.org/D109368	2022-07-04 15:11:39 +01:00
David Green	2de05afc19	[SLP] Peek into loads when hitting the RecursionMaxDepth This patch slightly extends the limit on the RecursionMaxDepth inside the SLP vectorizer. It does it only when it hits a load (or zext/sext of a load), which allows it to peek through in the places where it will be the most valuable, without ballooning out the O(..) by any 2^n factors. Differential Revision: https://reviews.llvm.org/D122148	2022-07-04 14:22:50 +01:00
David Green	19a1e20b8a	[VectorCombine] Improve shuffle select shuffle-of-shuffles This in an extension to the code added in D123911 which added vector combine folding of shuffle-select patterns, attempting to reduce the total amount of shuffling required in patterns like: %x = shuffle %i1, %i2 %y = shuffle %i1, %i2 %a = binop %x, %y %b = binop %x, %y shuffle %a, %b, selectmask This patch extends the handing of shuffles that are dependent on one another, which can arise from the SLP vectorizer, as-in: %x = shuffle %i1, %i2 %y = shuffle %x The input shuffles can also be emitted, in which case they are treated like identity shuffles. This patch also attempts to calculate a better ordering of input shuffles, which can help getting lower cost input shuffles, pushing complex shuffles further down the tree. Differential Revision: https://reviews.llvm.org/D128732	2022-07-04 13:38:43 +01:00
Florian Hahn	b4694229aa	[LV] Simplify setDebugLocFromInst by using early exit (NFC). Suggested as separate improvement in D128657.	2022-07-04 09:25:26 +01:00
Florian Hahn	b0da3c6fa4	[VPlan] Move setDebugLocFromInst to VPTransformState (NFC). The moved helpers are only used for codegen. It will allow moving the remaining ::execute implementations out of LoopVectorize.cpp. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128657	2022-07-02 15:18:17 +01:00
Florian Hahn	0dddf04cab	[LV] Don't optimize exit cond during epilogue vectorization. At the moment, the same VPlan can be used code generation of both the main vector and epilogue vector loop. This can lead to wrong results, if the plan is optimized based on the VF of the main vector loop and then re-used for the epilogue loop. One example where this is problematic is if the scalar loops need to execute at least one iteration, e.g. due to interleave groups. To prevent mis-compiles in the short-term, disable optimizing exit conditions for VPlans when using epilogue vectorization. The proper fix is to avoid re-using the same plan for both loops, which will require support for cloning plans first. Fixes #56319.	2022-07-01 13:48:38 +01:00
Florian Hahn	583abd0e36	[VPlan] Move addMetadata to VPTransformState (NFC). The moved helpers are only used for codegen. It will allow moving the remaining ::execute implementations out of LoopVectorize.cpp. Depends on D127966. Depends on D127965. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127968	2022-07-01 12:03:25 +01:00
Alexey Bataev	4be3fc35aa	[SLP][NFC]Cleanup up operands of the removed insertelements, NFC. Replace all operands of the insertelement instruction, replaced by shuffles, by poisons to avoid false-positive reports about incorrect function.	2022-06-30 17:51:43 -07:00
Florian Hahn	68884dde70	[LV] Move LoopVersioning creation to LVP::execute. At the moment LoopVersioning is only created for inner-loop vectorization. This patch moves it to LVP::execute, which means it will also be added for epilogue vectorization. As a consequence, the proper noalias metadata is now also added to epilogue vector loops. LVer will be moved to VPTransformState as follow-up. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127966	2022-06-30 12:14:32 +01:00
Florian Hahn	24b5f8e0d0	[VPlan] Make sure optimizeInductions removes wide ind from scalar plan. In some cases, there may be widened users of inductions even though the plan includes the scalar VF. In those cases, make sure we still replace the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF. Alternatively the patch could also split the range if needed. This fixes a crash exposed by D123720. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D128755	2022-06-30 09:11:48 +01:00
Nikita Popov	bdba8278d9	[VectorCombine] Avoid ConstantExpr::get() (NFC) Use IRBuilder APIs instead, which will still constant fold.	2022-06-29 17:17:52 +02:00
Alexey Bataev	bf4dcbd2df	[SLP]Fix PR56251: Do not remove the reordering from the root node, being used as an operand. If the root order itself does not require reordering, we can just remove its reorder mask safely (e.g., if the root node is a vector of phis). But if this node is used as an operand in the graph, we cannot delete the reordering, need to keep it. Otherwise the graph nodes are not synchronized with the operands. It may cause an extra gather instruction(s) or a compiler crash. Also, need to be very careful when selecting the gather nodes for reordering since there might several gather nodes with the same scalars and we can try to reorder just the same node many times instead of different nodes. Differential Revision: https://reviews.llvm.org/D128680	2022-06-28 13:42:05 -07:00
Florian Hahn	03975b7f0e	[VPlan] Move recipe implementations to separate file (NFC). This patch moves the code for recipe implementations to a separate file. The benefits are: * Keep VPlan.cpp smaller => faster compile-time during parallel builds. * Keep code for logical units together As a follow-up I am also planning on moving all ::execute implemetnations from LoopVectorize.cpp over to the new file, which should help to reduce the size of the file a bit. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127965	2022-06-28 10:34:30 +01:00
Guillaume Chatelet	3c126d5fe4	[Alignment] Replace commonAlignment with std::min `commonAlignment` is a shortcut to pick the smallest of two `Align` objects. As-is it doesn't bring much value compared to `std::min`. Differential Revision: https://reviews.llvm.org/D128345	2022-06-28 07:15:02 +00:00
Philip Reames	20dd3297b1	[LV] Allow scalable vectorization with vscale = 1 This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling. (I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.) This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable. This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code. In practice, this patch unblocks scalable vectorization for ELEN types on RISCV. Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1. Differential Revision: https://reviews.llvm.org/D128542	2022-06-27 13:38:57 -07:00
Kazu Hirata	a7938c74f1	[llvm] Don't use Optional::hasValue (NFC) This patch replaces Optional::hasValue with the implicit cast to bool in conditionals only.	2022-06-25 21:42:52 -07:00
Kazu Hirata	3b7c3a654c	Revert "Don't use Optional::hasValue (NFC)" This reverts commit `aa8feeefd3`.	2022-06-25 11:56:50 -07:00
Kazu Hirata	aa8feeefd3	Don't use Optional::hasValue (NFC)	2022-06-25 11:55:57 -07:00
Alexey Bataev	2faacf61a5	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-24 09:28:01 -07:00
Florian Hahn	cb69ba4faa	[LV] Create RT checks once VF/IC are selected, track scalar cost. This patch updates LV to generate runtime after the VF & IC are selected. It allows deciding whether to vectorize with runtime checks or not based on their cost compared to the vector loop. It also updates VectorizationFactor to include the scalar cost. Reviewed By: lebedev.ri, dmgreen Differential Revision: https://reviews.llvm.org/D75981	2022-06-24 17:42:11 +02:00
Florian Hahn	b18141a8f2	[VPlan] Set VFs included in plan before last set of VPTransforms (NFC). This allows VPlanTransforms to query the VFs included in the plan in the future.	2022-06-24 10:16:56 +02:00
Philip Reames	46ea4b5ea1	[LV] Avoid a crash when costing a uniform store which doesn't correspond to a legal scatter If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it. (Well, we could, but we haven't implemented that case.) This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.	2022-06-23 12:41:09 -07:00
Alexey Bataev	3b6edef15d	[SLP]Fix a crash when reorder masked gather nodes with reused scalars. If the masked gather nodes must be reordered, we can just reorder scalars, just like for gather nodes. But if the node contains reused scalars, it must be handled same way as a regular vectorizable node, since need to reorder reused mask, not the scalars directly. Differential Revision: https://reviews.llvm.org/D128360	2022-06-23 11:32:30 -07:00
Florian Hahn	569d84fe99	[VPlan] Remove dead recipes across whole plan. This extends removeDeadRecipe to remove recipes across the whole plan. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D127580	2022-06-23 13:36:02 +02:00
Fangrui Song	1ffd2d99c2	Revert D115462 "[SLP]Improve shuffles cost estimation where possible." This reverts commit `cac60940b7`. Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo. See my latest comment (may update) on D115462.	2022-06-22 23:16:31 -07:00
Fangrui Song	a411bc11d6	Revert "[SLP]Fix a crash when insert subvector is out of range." This reverts commit `f1ee2738b3`. Revert due to the revert of a dependent commit `[SLP]Improve shuffles cost estimation where possible.`	2022-06-22 23:16:25 -07:00
Guillaume Chatelet	57ffff6db0	Revert "[NFC] Remove dead code" This reverts commit `8ba2cbff70`.	2022-06-22 14:55:47 +00:00
Guillaume Chatelet	8ba2cbff70	[NFC] Remove dead code	2022-06-22 13:33:58 +00:00
Serguei Katkov	8f891b7c39	[LoopVectorize] Uninitialized phi node leads to a crash in SSAUpdater. createInductionResumeValues creates a phi node placeholder without filling incoming values. Then it generates the incoming values. It includes triggering of SCEV expander which may invoke SSAUpdater. SSAUpdater has an optimization to detect number of predecessors basing on incoming values if there is phi node. In case phi node is not filled with incoming values - the number of predecessors is detected as 0 and this leads to segmentation fault. In other words SSAUpdater expects that phi is in good shape while LoopVectorizer breaks this requirement. The fix is just prepare all incoming values first and then build a phi node. Reviewed By: fhahn Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D128033	2022-06-22 10:49:27 +07:00
Vasileios Porpodas	7a9ad25769	Recommit "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6d6268dcbf`. Review: https://reviews.llvm.org/D125712	2022-06-21 18:35:29 -07:00
Vasileios Porpodas	6d6268dcbf	Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6f88acf410`.	2022-06-21 17:07:21 -07:00
Vasileios Porpodas	6f88acf410	[SLP][X86] Improve reordering to consider alternate instruction bundles During the reordering transformation we should try to avoid reordering bundles like fadd,fsub because this may block them being matched into a single vector instruction in x86. We do this by checking if a TreeEntry is such a pattern and adding it to the list of TreeEntries with orders that need to be considered. Differential Revision: https://reviews.llvm.org/D125712	2022-06-21 16:44:48 -07:00
Florian Hahn	88ce403c6a	[LV] Add new block to place recurrence splice, if needed. In some cases, a recurrence splice instructions needs to be inserted between to regions, for example if the regions get re-arranged during sinking. Fixes #56146.	2022-06-21 21:54:37 +02:00
Alexey Bataev	d4ee43153d	[SLP][NFC]Fix a warning in a comparison, NFC. Fixed signedness warning.	2022-06-21 10:19:47 -07:00
Alexey Bataev	f1ee2738b3	[SLP]Fix a crash when insert subvector is out of range. If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate the cost as shuffle of 2 vector, not as insert of subvector. Otherwise, the inserted subvector is out of range and compiler may crash. Differential Revision: https://reviews.llvm.org/D128071	2022-06-21 07:16:35 -07:00
Kazu Hirata	7a47ee51a1	[llvm] Don't use Optional::getValue (NFC)	2022-06-20 22:45:45 -07:00
Kazu Hirata	e0e687a615	[llvm] Don't use Optional::hasValue (NFC)	2022-06-20 10:38:12 -07:00
Kazu Hirata	129b531c9c	[llvm] Use value_or instead of getValueOr (NFC)	2022-06-18 23:07:11 -07:00
Kazu Hirata	c399b3a608	[Vectorize] Use llvm::is_contained (NFC)	2022-06-18 15:49:15 -07:00
Malhar Jajoo	6bb40552f2	[LoopVectorize] Add support for invariant stores of ordered reductions Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D126772	2022-06-17 14:56:21 +01:00
Alexey Bataev	76782a65ee	[SLP]Use original vector if need to shuffle truncated root. If the root scalar is mapped to to the smallest bit width, the vector is truncated and the types between original buildvector and extracted value mismatched. For extract, we emit sext/zext instructions, for shuffles we can reuse oringal vector instead of the truncated one. Differential Revision: https://reviews.llvm.org/D127974	2022-06-16 10:41:18 -07:00
Florian Hahn	949c13649c	[LV] Remove widenPHIInstruction dependence on underlying instr (NFC). Instead of using the underlying instruction and VF to get the type, use the type of the incoming value. This removes an unnecessary dependence on the underlying instruction and enables using the recipe without an underlying instruction.	2022-06-16 16:03:01 +02:00
Alexey Bataev	7236d49fd5	[SLP]Extend vectorization for scatter vectorize nodes. Currently scatter vectorize nodes can be emitted only for GEPs with constant indices. But we can also emit such nodes for GEPs with the same ptr and non-constant vectorizable/gathered indices, if profitable. Patch adds support for such nodes and tries to improve handling of GEPs with non-const indeces for such nodes. Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5243.00 5240.00 -0.1% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5243.00 5240.00 -0.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00 27507.00 -0.2% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 5395.00 5380.00 -0.3% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5389.00 5374.00 -0.3% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 961.00 958.00 -0.3% test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 961.00 958.00 -0.3% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5664.00 5643.00 -0.4% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00 13127.00 -0.6% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 212.00 207.00 -2.4% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 890.00 850.00 -4.5% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 1695.00 1581.00 -6.7% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2338.00 2140.00 -8.5% test-suite :: SingleSource/UnitTests/matrix-types-spec.test 63.00 55.00 -12.7% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 468.00 356.00 -23.9% Geomean difference -0.3% All numbers show increased number of generated vector instructions. Diff: SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but need an extra analysis with LTO (with LTO compiler generates masked_gather, while before regular loads were emitted because of extra data, availbale at LTO time). SingleSource/UnitTests/matrix-types-spec - more vector code. MultiSource/Applications/JM/lencod/lencod - same. External/SPEC/CINT2006/464.h264ref/464.h264ref - same. MultiSource/Benchmarks/7zip/7zip-benchmark - same. External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes. External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code. External/SPEC/CFP2006/447.dealII/447.dealII - same External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same External/SPEC/CFP2017rate/511.povray_r/511.povray - same External/SPEC/CFP2006/453.povray/453.povray - same External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same Differential Revision: https://reviews.llvm.org/D127219	2022-06-16 06:05:48 -07:00
Florian Hahn	5ff5b460d9	[LV] Remove unneeded CustomBuilder arg from setDebugLocFromInst (NFC). The only user that passed in a custom builder was passing in VPTransformState::Builder, which is the same as ILV::Builder.	2022-06-15 18:48:02 +01:00
Alexey Bataev	c60c13f7eb	[SLP] Improve reordering in presence of constant only nodes. We can skip the analysis of the constant nodes, their order should not affect the ordering of the trees/subtrees. Differential Revision: https://reviews.llvm.org/D127775	2022-06-15 06:17:34 -07:00
Florian Hahn	9129e7bb54	[LV] Replace OrigPHIsToFix in native with VPlan traversal. (NFC) OrigPHIsToFix is only used in the native path. Collecting phis can be replaced by iterating over the plan. This also removes another unnecessary use of a late getVPValue. This also reduces the coupling between ILV and the VPlan utilities.	2022-06-13 22:20:58 +01:00
Hubert Tong	5efb380c26	[NFC] Undo AIX build compiler workaround Removes the workaround from https://reviews.llvm.org/D98509#2732628 for an AIX build compiler issue. The AIX build compiler product that caused the issue has since been fixed. Also, the AIX build compiler has been changed to one based on LLVM.	2022-06-13 17:00:33 -04:00
Florian Hahn	763f2bdba5	[VPlan] Remove dead OrigLoop argument from removeDeadRecipes (NFC). The use of the argument has been remove a while ago. Remove the dead argument.	2022-06-11 23:36:47 +01:00
Florian Hahn	85983ca42e	[VPlan] Replace remaining use of needsScalarIV. All information is already available in VPlan. Note that there are some test changes, because we now can correctly look through instructions like truncates to analyze the actual users. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123541	2022-06-09 12:05:37 +01:00
Florian Hahn	cedfd7a2e5	Recommit "[VPlan] Remove uneeded needsVectorIV check." This reverts commit `266ea446ab`. The reasons for the revert have been addressed by cleaning up condition handling in VPlan and properly marking VPBranchOnMaskRecipe as using scalars. The test case for the revert from D123720 has been added in `3d663308a5`.	2022-06-08 14:06:45 +01:00
Florian Hahn	b0c9a71be0	[VPlan] Handle VPInst without underlying instr in VPInterleavedAccess. This violation is hidden while `cast` is missing an isa assertion after D123901.	2022-06-07 21:00:49 +01:00
David Sherwood	997ecb0036	[LoopVectorize] Add FastMathFlags to the select used for reductions with tail-folding Based on reviewer comments on https://reviews.llvm.org/D126692 I've added FastMathFlags to the select instruction used when tail-folding with reductions. These flags can then be used by InstCombine to decide upon the most optimal floating point identity value for fadd/fsub. Doing so unlocks further optimisations, such as folding selects into masked loads. Differential Revision: https://reviews.llvm.org/D126778	2022-06-07 10:21:31 +01:00
Florian Hahn	eaf48dd9b0	[VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF. Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF. This is an alternative to D121899 which simplifies the VPlan directly instead of doing so late in code-gen. The potential benefit of doing this in VPlan is that this may help cost-modeling in the future. The reason this is done in prepareToExecute at the moment is that a single plan may be used for multiple VFs/UFs. There are further simplifications that can be applied as follow ups: 1. Replace inductions with constants 2. Replace vector region with regular block. Fixes #55354. Depends on D126679. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126680	2022-06-06 09:38:53 +01:00
Florian Hahn	416a5080d8	[VPlan] Update vector latch terminator edge to exit block after execution. Instead of setting the successor to the exit using CFG.ExitBB, set it to nullptr initially. The successor to the exit block is later set either through createEmptyBasicBlock or after VPlan execution (because at the moment, no block is created by VPlan for the exit block, the existing one is reused). This also enables BranchOnCond to be used as terminator for the exiting block of the topmost vector region. Depends on D126618. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126679	2022-06-04 21:22:32 +01:00
Alexey Bataev	cac60940b7	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-03 08:06:22 -07:00
Benjamin Kramer	a8d2a381a2	[VPlan] Silence another unused variable warning in release builds	2022-06-03 14:07:56 +02:00
Benjamin Kramer	6b7c186390	[VPlan] Inline variable into assertion. NFC. Avoids a warning in release builds llvm/lib/Transforms/Vectorize/VPlanHCFGBuilder.cpp:311:14: warning: unused variable 'BrCond' [-Wunused-variable] Value *BrCond = Br->getCondition();	2022-06-03 13:59:48 +02:00
Florian Hahn	a5bb4a3b4d	[VPlan] Replace CondBit with BranchOnCond VPInstruction. This patch removes CondBit and Predicate from VPBasicBlock. To do so, the patch introduces a new branch-on-cond VPInstruction opcode to model a branch on a condition explicitly. This addresses a long-standing TODO/FIXME that blocks shouldn't be users of VPValues. Those extra users can cause issues for VPValue-based analyses that don't expect blocks. Addressing this fixme should allow us to re-introduce `266ea446ab`. The generic branch opcode can also be used in follow-up patches. Depends on D123005. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126618	2022-06-03 11:48:31 +01:00
Fangrui Song	df0f30dc36	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit `9980c99718`. Caused assertion failures: https://reviews.llvm.org/D115462#3555350	2022-06-03 00:30:34 -07:00
Alexey Bataev	9980c99718	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-02 11:18:14 -07:00
Florian Hahn	4f1c86e3d5	[VPlan] Remove dead VPlan-native special case from BranchOnCount (NFC). After `05776122b6` this special case doesn't exist any longer.	2022-06-02 12:07:54 +01:00
Alexey Bataev	73020b4540	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit `fd5a6ce9dc` to fix a crash detected by a buildbot https://lab.llvm.org/buildbot/#/builders/179/builds/3805/steps/11/logs/stdio.	2022-06-01 15:44:51 -07:00
Florian Hahn	08482830eb	[LV] Update var name to Exiting, in line with terminology (NFC) Recently the terminology used has been changed from Exit->Exiting in line with common LLVM loop terminology. Update a remaining use of the old terminology.	2022-06-01 22:13:29 +01:00
Alexey Bataev	fd5a6ce9dc	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-01 11:01:37 -07:00
Alexey Bataev	fe4949942d	[SLP]Fix PR55796: insert point for extractelements from different basic blocks. Extractelement instructions may come from different basic blocks, need to take it into account when looking for a last instruction in the bundle to prevent compiler crash. Differential Revision: https://reviews.llvm.org/D126777	2022-06-01 09:44:53 -07:00
Florian Hahn	05776122b6	[VPlan] Use region for each loop in native path. This patch updates the VPlan native path to use VPRegionBlocks for all loops in a loop nest. Up to now, only the outermost loop used a region. This is a step towards unifying both paths and keep things consistent between them. It also prepares various code-gen parts for modeling the pre-header in the inner loop vectorizer (D121624). Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123005	2022-06-01 10:41:05 +01:00
Florian Hahn	d157019482	[VPlan] Remove unused native utilities incompatible with nested regions. The implementations of VPlanDominatorTree, VPlanLoopInfo and VPlanPredicator are all incompatible with modeling loops in VPlans as region without explicit back-edges. Those pieces are not actively used and only exercised by a few gtest unit tests. They are at the moment blocking progress towards unifying the native and inner-loop vectorizer paths in D121624 and D123005. I think we should not block forward progress on unused pieces of code, so this patch removes the utilities for now. The plan is to re-introduce them as needed in a way that is compatible with the unified VPlan scheme used in both the inner loop vectorizer and the native path. Reviewed By: sguggill Differential Revision: https://reviews.llvm.org/D123017	2022-06-01 09:32:59 +01:00
Mel Chen	b0fc765350	[NFC] Change LoopVectorizationCostModel::useOrderedReductions() to be a const function. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D126200	2022-05-31 05:39:13 -07:00
Florian Hahn	6abce17fc2	[VPlan] Use Exiting-block instead of Exit-block terminology (NFC). In LLVM's common loop terminology, an exit block is a block outside a loop with a predecessor inside the loop. An exiting block is a block inside the loop which branches to an exit block outside the loop. This patch updates a few places where VPlan was using ExitBlock for a block exiting a region. Those instances have been updated to use ExitingBlock. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126173	2022-05-28 21:16:05 +01:00
Alexey Bataev	7b809c30b9	[SLP]Improve compile time, NFC. Patch improves compile time. For function calls, which cannot be vectorized, create a unique group for each such a call instead of subgroup. It prevents them from being grouped by a subgroups and attempts for their vectorization. Also, looks through casts operand to try to check their groups/subgroups. Reduces number of vectorization attempts. No changes in the statistics for SPEC2017/2006/llvm-test-suite. Differential Revision: https://reviews.llvm.org/D126476	2022-05-26 08:40:59 -07:00
Alexey Bataev	120d52b0ef	[SLP]Fix PR55653: emit undefs where required, not poison. Need to handle a corner case correctly, if all elements are Undefs/Poisons, need to emit actual values, not just poisons. Differential Revision: https://reviews.llvm.org/D126298	2022-05-26 08:38:50 -07:00
Simon Pilgrim	14258d6fb5	[SLP] Move canVectorizeLoads implementation to simplify the diff in D105986. NFC.	2022-05-26 15:23:58 +01:00
Alexey Bataev	9139d484d4	[SLP]Fix crash on reordering of ScatterVectorize nodes. ScatterVectorize nodes should be handled same way as gathers in reorderBottomToTop function, since we can simple reorder the loads in this node. Because of that need to include such nodes to the list of gathered nodes to fix compiler crash. Differential Revision: https://reviews.llvm.org/D126378	2022-05-26 06:25:58 -07:00
Florian Hahn	390c0ac28d	[LV] Fix indentation in tryToCreateWidenRecipe (NFC).	2022-05-26 08:53:34 +01:00
Alexey Bataev	3bf5c2c8ec	[SLP]Do not try to generate ScatterVectorize if it will be scalarized. SLP should build ScatterVectorize nodes only if they actually end up with masked gather rather than with scalarization. In the second scenario better to build a gather node. Differential Revision: https://reviews.llvm.org/D126379	2022-05-25 14:25:07 -07:00
Alexey Bataev	10f41a2147	[SLP]Fix PR55688: Miscompile due to incorrect nuw/nsw handling. Need to use all ReductionOps when propagating flags for the reduction ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw flags. Differential Revision: https://reviews.llvm.org/D126371	2022-05-25 13:59:06 -07:00
David Sherwood	87936c7b13	[LoopVectorize] Fix assertion failure in fixReduction when tail-folding When compiling the attached new test in scalable-reductions-tf.ll we were hitting this assertion in fixReduction: Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select" The loop contains a reduction and an intermediate store of the reduction value. When vectorising with tail-folding the contains of 'U' in the assertion above happened to be a scatter_store. It turns out that we were still creating a widen recipe for the invariant store, despite knowing that we can actually sink it. The simplest fix is to change buildVPlanWithVPRecipes so that we look for invariant stores before attempting to widen it. Differential Revision: https://reviews.llvm.org/D126295	2022-05-25 11:46:32 +01:00
Florian Hahn	c6e45ea074	[VPlan] Exit earlier when trying to widen with scalar VFs. This simplifies the code a bit, suggested in D124718. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D125029	2022-05-25 11:05:23 +01:00
Florian Hahn	1ba42dd04b	[VPlan] Use MapVector for LiveOuts for deterministic iteration. During code-gen, we iterate over the LiveOuts and the differences in iteration order can cause slightly different outputs.	2022-05-25 09:30:02 +01:00
Vasileios Porpodas	9df0568b07	[SLP] Fix crash caused by reorderBottomToTop(). The crash is caused by incorrect order set by reorderBottomToTop(), which happens when it is reordering a TreeEntry which has a user that has already been reordered earlier. Please see the detailed description in the lit test. Differential Revision: https://reviews.llvm.org/D126099	2022-05-24 12:24:19 -07:00
Alexey Bataev	f9c806ae5c	[SLP][NFC]Make isFirstInsertElement a weak strict ordering comparator. To be used correctly in a sort-like function, isFirstInsertElement function must follow weak strict ordering rule, i.e. isFirstInsertElement(IE1, IE1) should return false.	2022-05-24 06:02:42 -07:00
Alexey Bataev	319a722f6f	[SLP][NFC]Improve compile time, NFC. Builds UserIgnore list only once as a SmallDenseSet without rebuilding it between the runs, iterate over gathers instead list of reduction ops, do some checks in the buildTree_rec only if the corresponding containers are not empty.	2022-05-23 12:15:27 -07:00
Benjamin Kramer	2f2ca30d0a	Fix an unused variable warning in no-asserts build mode	2022-05-23 19:53:40 +02:00
Jingu Kang	bb82f74612	Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"" This reverts commit `42ebfa8269`. The commmit from https://reviews.llvm.org/D125918 has fixed the stage 2 build failure. Differential Revision: https://reviews.llvm.org/D118979	2022-05-23 16:15:45 +01:00
Alexey Bataev	2ac5ebedea	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-23 07:06:45 -07:00
Peter Waller	ade47bdc31	[LV] Improve register pressure estimate at high VFs Previously, `getRegUsageForType` was implemented using `getTypeLegalizationCost`. `getRegUsageForType` is used by the loop vectorizer to estimate the register pressure caused by using a vector type. However, `getTypeLegalizationCost` currently only appears to understand splitting and not scalarization, so significantly underestimates the register requirements. Instead, use `getNumRegisters`, which understands when scalarization can occur (via computeRegisterProperties). This was discovered while investigating D118979 (Set maximum VF with shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the loop vectorizer previously ends up costing an v128i1 as 2 v64i* registers where it actually occupies 128 i32 registers. I'm sending this patch early for comment, I'm still doing some sanity checking with LNT. I note that getRegisterClassForType appears to return VectorRC even though the type in question (large vNi1 types) end up occupying scalar registers. That might be worth fixing too. Differential Revision: https://reviews.llvm.org/D125918	2022-05-23 07:57:45 +00:00
Florian Hahn	145fe57106	[LV] Use exiting block instead of latch in addUsersInExitBlock. The latch may not be the exiting block. Use the exiting block instead when looking up the incoming value of the LCSSA phi node. This fixes a crash with early-exit loops.	2022-05-22 18:27:41 +01:00
Florian Hahn	97590baead	[LV] Widen ptr-inductions with scalar uses for scalable VFs. Current codegen only supports scalarization of pointer inductions for scalable VFs if they are uniform. After `3bebec659` we now may enter the scalarization code path in VPWidenPointerInductionRecipe::execute for scalable vectors. Fall back to widening for scalable vectors if necessary. This should fix a build failure when bootstrapping LLVM with SVE, e.g. https://lab.llvm.org/buildbot/#/builders/176/builds/1723	2022-05-22 16:24:13 +01:00
Florian Hahn	aeb19817d6	Revert "[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly." This reverts commit `fc9c59c355`. The patch triggers an assertion when building SPEC on X86. Reduced reproducer shared at D107966. Also reverts follow-up commit `11a09af76d`.	2022-05-21 21:00:01 +01:00
Florian Hahn	3bebec6592	[VPlan] Model first exit values using VPLiveOut. This patch introduces a new VPLiveOut subclass of VPUser to model exit values explicitly. The initial version handles exit values that are neither part of induction or reduction chains nor first order recurrence phis. Fixes #51366, #54867, #55167, #55459 Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123537	2022-05-21 16:01:38 +01:00
Dmitri Gribenko	11a09af76d	Fix an unused variable warning in no-asserts build mode	2022-05-20 17:11:58 +02:00
Alexey Bataev	fc9c59c355	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-20 05:58:09 -07:00
Alexey Bataev	4e271fc495	[SLP][NFC]Use SmallPtrSet to avoid n*m complexity, NFC.	2022-05-20 05:56:43 -07:00
Florian Hahn	cd61d4bd2f	[LV] Do not LoopSimplify/LCSSA after generating main vector loop. At the moment LV runs LoopSimplify and reconstructs LCSSA form after generating the main vector loop and before generating the epilogue vector loop. In practice, this adds a new exit block for the scalar loop because the middle block now also branches to the original exit block of the scalar loop. It also requires adding a new LCSSA phi in the newly created exit block. This complicates things when modeling exit values in VPlan, because we would need to update the VPlan for the epilogue loop to update the newly created LCSSA phi node. But none of that should be necessary, as all analysis requiring loop-simplify form is already done at this point and LCSSA form of the original loop is not broken. Reviewed By: bmahjour Differential Revision: https://reviews.llvm.org/D125810	2022-05-20 09:58:40 +01:00
Florian Hahn	c90235f0ef	[LV] Drop wrap flags for reductions using VP def-use chain. Update clearReductionWrapFlags to use the VPlan def-use chain from the reduction phi recipe to drop reduction wrap flags. This addresses an existing FIXME and fixes a crash when instructions in the reduction chain are not used and have been removed before VPlan codegeneration. Fixes #55540.	2022-05-19 20:36:46 +01:00
Tiehu Zhang	3ed9f603fd	[LoopVectorize] Don't interleave when the number of runtime checks exceeds the threshold The runtime check threshold should also restrict interleave count. Otherwise, too many runtime checks will be generated for some cases. Reviewed By: fhahn, dmgreen Differential Revision: https://reviews.llvm.org/D122126	2022-05-19 23:29:00 +08:00
Florian Hahn	df56fb44f5	[VPlan] Update VPWidenMemoryInstruction to not inherit from VPValue. VPWidenMemoryInstruction also models stores which may not produce a value. This can trip over analyses. Improve the modeling by only adding VPValues for VPWidenMemoryInstructionRecipes modeling loads.	2022-05-19 16:24:58 +01:00
Jay Foad	6bec3e9303	[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf Most clients only used these methods because they wanted to be able to extend or truncate to the same bit width (which is a no-op). Now that the standard zext, sext and trunc allow this, there is no reason to use the OrSelf versions. The OrSelf versions additionally have the strange behaviour of allowing extending to a smaller width, or truncating to a larger width, which are also treated as no-ops. A small amount of client code relied on this (ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and needed rewriting. Differential Revision: https://reviews.llvm.org/D125557	2022-05-19 11:23:13 +01:00
lizhijin	90ea81fcb2	[LV] Widen freeze instead of scalarizing it This patch changes the strategy for vectorizing freeze instrucion, from replicating multiple times to widening according to selected VF. Fixes #54992 Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D125016	2022-05-19 12:28:01 +08:00
Alexey Bataev	7d8060bc19	[SLP]Improve reductions vectorization. The pattern matching and vectgorization for reductions was not very effective. Some of of the possible reduction values were marked as external arguments, SLP could not find some reduction patterns because of too early attempt to vectorize pair of binops arguments, the cost of consts reductions was not correct. Patch addresses these issues and improves the analysis/cost estimation and vectorization of the reductions. The most significant changes in SLP.NumVectorInstructions: Metric: SLP.NumVectorInstructions [140/14396] Program results results0 diff test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 920.00 3548.00 285.7% test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 66.00 122.00 84.8% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 100.00 128.00 28.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 592.00 687.00 16.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 402.00 426.00 6.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1665.00 1745.00 4.8% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 135.00 139.00 3.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 388.00 397.00 2.3% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 895.00 914.00 2.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 240.00 244.00 1.7% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 240.00 244.00 1.7% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 820.00 832.00 1.5% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 820.00 832.00 1.5% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00 0.7% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 8125.00 8183.00 0.7% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1330.00 1338.00 0.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1330.00 1338.00 0.6% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9832.00 9880.00 0.5% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5267.00 5291.00 0.5% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 4018.00 4024.00 0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 4018.00 4024.00 0.1% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 426.00 424.00 -0.5% test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 426.00 424.00 -0.5% test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 201.00 192.00 -4.5% test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 201.00 192.00 -4.5% 644.nab_s and 544.nab_r - reduced number of shuffles but increased number of useful vectorized instructions. 641.leela_s and 541.leela_r - the function `@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore but its body gets vectorized successfully. Before, the function was inlined twice and vectorized just after inlining, currently it is not required. The vector code looks pretty similar, just like as it was before. Differential Revision: https://reviews.llvm.org/D111574	2022-05-18 13:22:18 -07:00
Florian Hahn	fcfb86483b	[LV] set Header earlier, use variable instead of repeated access (NFC).	2022-05-18 09:29:59 +01:00
Florian Hahn	5b00d13c00	[LV] Fetch vector loop region once and remember it (NFC). This avoids an unnecessary lookup and makes the code slightly more compact.	2022-05-17 15:57:23 +01:00
Alexey Bataev	b0f0313feb	[SLP]Add an extra check for select minmax reduction to avoid crash. Need to check if the reduction is still (not)cmp-select pattern min/max reduction to avoid compiler crash during building list of reduction operations. cmp-sel pattern provides 2 reduction operations, while intrinsics - just one.	2022-05-17 06:05:52 -07:00
Florian Hahn	c1a9d14982	[VPlan] Move usesScalars/onlyFirstLaneUsed to VPUser. Those helpers model properties of a user and they should also be available to non-recipe users. This will be used in D123537 for a new exit value user. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D124936	2022-05-17 11:20:06 +01:00
Alexey Bataev	152072801e	[SLP]Check if the root of the buildvector has one use only. The root of the buildvector can have only one use, otherwise it can be treated only as a final element of the previous buildvector sequence.	2022-05-16 07:30:36 -07:00
Florian Hahn	b7315ffc3c	[LAA,LV] Add initial support for pointer-diff memory checks. This patch adds initial support for a pointer diff based runtime check scheme for vectorization. This scheme requires fewer computations and checks than the existing full overlap checking, if it is applicable. The main idea is to only check if source and sink of a dependency are far enough apart so the accesses won't overlap in the vector loop. To do so, it is sufficient to compute the difference and compare it to the `VF * UF * AccessSize`. It is sufficient to check `(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards dependence in the vector loop with the given VF and UF. If Src >=u Sink, there is not dependence preventing vectorization, hence the overflow should not matter and using the ULT should be sufficient. Note that the initial version is restricted in multiple ways: 1. Pointers must only either be read or written, by a single instruction (this allows re-constructing source/sink for dependences with the available information) 2. Source and sink pointers must be add-recs, with matching steps 3. The step must be a constant. 3. abs(step) == AccessSize. Most of those restrictions can be relaxed in the future. See https://github.com/llvm/llvm-project/issues/53590. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D119078	2022-05-16 15:27:22 +01:00
David Sherwood	befc952045	[LoopVectorize] Permit tail-folding for low trip counts using scalable vectors When the loop vectoriser encounters a known low trip count it tries to create a single predicated loop in order to get the benefit of vectorisation and eliminate the scalar tail. However, until now the vectoriser prevented the use of scalable vectors in this case due to concerns in the past about stability. I believe that tail-folded loops using scalable vectors are now sufficiently well tested that we can enable this. For the same reason I've also enabled it when optimising for code size too. Tests added here: Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll Transforms/LoopVectorize/RISCV/low-trip-count.ll Differential Revision: https://reviews.llvm.org/D121595	2022-05-16 09:14:24 +01:00
Florian Hahn	8b7c3d2179	[LV] Set SCEVCheckCond to nullptr whenever it was used. Under some circumstances, SCEVExpander will insert new instructions when expanding a predicate, but the final result of the expansion can be a false constant. In those cases, the expanded instructions may later be used by other expansions, e.g. the trip count. This may trigger an assertion during SCEVExpander cleanup. To avoid this, always mark the result as used. Fixes #55100.	2022-05-15 21:52:07 +01:00
Craig Topper	b3097eb6cd	[SLP] Fix misspelling of 'analyzed'. NFC	2022-05-15 10:30:24 -07:00
Florian Hahn	39552964e1	[VPlan] Improve printing of VPReplicateRecipe with calls. Suggested as part of D124718.	2022-05-15 15:51:26 +01:00
Alexey Bataev	8b8281f354	[SLP]Do not vectorize non-profitable alternate nodes. If alternate node has only 2 instructions and the tree is already big enough, better to skip the vectorization of such nodes, they are not very profitable (the resulting code cotains 3 instructions instead of original 2 scalars). SLP can try to vectorize the buildvector sequence in the next attempt, if it is profitable. Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 72.00 73.00 1.4% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 1186.00 1198.00 1.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 241.00 242.00 0.4% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2131.00 2139.00 0.4% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 6377.00 6384.00 0.1% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 6377.00 6384.00 0.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00 12658.00 0.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00 26147.00 -0.1% test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test 99.00 86.00 -13.1% Gains: 526.blender_r - more vectorized trees. enc-3des - same. Others: 510.parest_r - no changes. miniFE - same 623.xalancbmk_s - some (non-profitable) parts of the trees are not vectorized. 523.xalancbmk_r - same lencod - same timberwolfmc - same miniAMR - same Differential Revision: https://reviews.llvm.org/D125571	2022-05-13 14:28:54 -07:00
Alexey Bataev	85f6b15ee5	[SLP]Do not look for buildvector sequence, if the index is reused. If the insert indes was used already or is not constant, we should stop looking for unique buildvector sequence, it mustbe splitted to 2 different buildvectors.	2022-05-13 13:56:02 -07:00
David Sherwood	92c645b5c1	[LoopVectorize] Add overflow checks when tail-folding with scalable vectors In InnerLoopVectorizer::getOrCreateVectorTripCount there is an assert that the known minimum value for the VF is a power of 2 when tail-folding is enabled. However, for scalable vectors the value of vscale may not be a power of 2, which means we have to worry about the possibility of overflow. I have solved this problem by adding preheader checks that prevent us from entering the vector body if the canonical IV would overflow, i.e. if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ... Differential Revision: https://reviews.llvm.org/D125235	2022-05-13 14:09:43 +01:00
Nikita Popov	ed1cb01baf	[IRBuilder] Add IsInBounds parameter to CreateGEP() We commonly want to create either an inbounds or non-inbounds GEP based on a boolean value, e.g. when preserving inbounds from existing GEPs. Directly accept such a boolean in the API, rather than requiring a ternary between CreateGEP and CreateInBoundsGEP. This change is not entirely NFC, because we now preserve an inbounds flag in a constant expression edge-case in InstCombine.	2022-05-13 14:30:55 +02:00
Vasileios Porpodas	0950d4060c	Recommit "[SLP] Make reordering aware of external vectorizable scalar stores." This reverts commit `c2a7904aba`. Original code review: https://reviews.llvm.org/D125111	2022-05-11 16:47:29 -07:00
Arthur Eubanks	c2a7904aba	Revert "[SLP] Make reordering aware of external vectorizable scalar stores." This reverts commit `71bcead98b`. Causes crashes, see comments in D125111.	2022-05-11 15:28:00 -07:00
Alexey Bataev	f5d45d70a5	[SLP]Further improvement of the cost model for scalars used in buildvectors. Further improvement of the cost model for the scalars used in buildvectors sequences. The main functionality is outlined into a separate function. The cost is calculated in the following way: 1. If the Base vector is not undef vector, resizing the very first mask to have common VF and perform action for 2 input vectors (including non-undef Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements. 2. If the Base is undef vector and have only 1 shuffle mask, perform the action only for 1 vector with the given mask, if it is not the identity mask. 3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors, combing the masks properly between the steps. The original implementation misses the very first analysis for the Base vector, so the cost might too optimistic in some cases. But it improves the cost for the insertelements which are part of the current SLP graph. Part of D107966. Differential Revision: https://reviews.llvm.org/D115750	2022-05-11 06:08:55 -07:00
Florian Hahn	635b752211	[VPlan] VPInterleaveRecipe only uses first lane if op not stored. With opaque pointers, both the stored value and the address can be the same. Only consider the recipe using the first lane only if the address is not stored. Fixes #55375.	2022-05-11 11:24:56 +01:00
Vasileios Porpodas	71bcead98b	[SLP] Make reordering aware of external vectorizable scalar stores. The current reordering scheme only checks the ordering of in-tree operands. There are some cases, however, where we need to adjust the ordering based on the ordering of a future SLP-tree who's instructions are not part of the current tree, but are external users. This patch is a simple implementation of this. We keep track of scalar stores that are users of TreeEntries and if they look profitable to vectorize, then we keep track of their ordering. During the reordering step we take this new index order into account. This can remove some shuffles in cases like in the lit test. Differential Revision: https://reviews.llvm.org/D125111	2022-05-10 15:25:35 -07:00
Alexey Bataev	4212ef8a0e	Revert "[SLP]Further improvement of the cost model for scalars used in buildvectors." This reverts commit `99f31acfce` and several others to fix detected crashes, reported in https://reviews.llvm.org/D115750	2022-05-09 13:46:06 -07:00
Alexey Bataev	cce80bd8b7	[SLP]Adjust assertion check for scalars in several insertelements. If the same scalar is inserted several times into the same buildvector, the mask index can be used already. In this case need to check, that this scalar is already part of the vectorized buildvector.	2022-05-09 13:07:59 -07:00
Florian Hahn	266ea446ab	Revert "Recommit "[VPlan] Remove uneeded needsVectorIV check."" This reverts commit `8b48223447`. This triggers an assertion on a test case mentioned in D123720. Revert while I investigate.	2022-05-09 20:33:14 +01:00
Alexey Bataev	9dc4ced204	[SLP]Try partial store vectorization if supported by target. We can try to vectorize number of stores less than MinVecRegSize / scalar_value_size, if it is allowed by target. Gives an extra opportunity for the vectorization. Fixes PR54985. Differential Revision: https://reviews.llvm.org/D124284	2022-05-09 09:48:15 -07:00
Alexey Bataev	9c3a75eabf	[SLP]Fix a crash when preparing a mask for external scalars. Need to use actual index instead of the tree entry position, since the insert index may be different than 0. It mean, that we vectorized part of the buildvector starting from not initial insertelement instruction beause of some reason.	2022-05-09 07:59:34 -07:00
David Green	6f9e1ea0ef	[VectorCombine] Attempt to fold select shuffles from reductions Given a commutative reduction leading from a shuffle, the order of the lanes on the shuffle are not important for the result. This means we can reorder the shuffle to something simpler, which we try shuffling the first vector lanes first. This was D123494. The new shuffle may not be profitable though, and if it is not we can try the folding of select shuffles from D123911. This, with some adjustment as the output lane ordering is now unimportant, can allow the final shuffle to simplify given the inputs to the patterns from D123911. Where as each transformation on their own are not profitable, the combination is. We can only support a single shuffle when called from reductions, but we are able to sort the ReconstructMask, potentially allowing it to simplify to an identity or concat mask. Differential Revision: https://reviews.llvm.org/D125086	2022-05-08 10:32:41 +01:00
David Green	802e15c576	[SLP] Cluster ordering for loads Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts). Differential Revision: https://reviews.llvm.org/D122145	2022-05-07 14:38:11 +01:00

1 2 3 4 5 ...

3362 Commits