Revert rGef89409a59f3b79ae143b33b7d8e6ee6285aa42f "Fix 'unused-lambda-capture' gcc warning. NFCI."
Revert rG926ccfef032d206dcbcdf74ca1e3a9ebf4d1be45 "[SLP] ScalarizationOverheadBuilder - demand all elements for scalarization if the extraction index is unknown / out of bounds"
Revert ScalarizationOverheadBuilder sequence from D134605 - when accumulating extraction costs by Type (instead of specific Value), we are not distinguishing enough when they are coming from the same source or not, and we always just count the cost once. This needs addressing before we can use getScalarizationOverhead properly.
Instead of accumulating all extraction costs separately and then adjusting for repeated subvector extractions, this patch collects all the extractions and then converts to calls to getScalarizationOverhead to improve the accuracy of the costs.
I'm not entirely satisfied with the getExtractWithExtendCost handling yet - this still just adds all the getExtractWithExtendCost costs together - it really needs to be replaced with a "getScalarizationOverheadWithExtend", but that will require further refactoring first.
This replaces my initial attempt in D124769.
Differential Revision: https://reviews.llvm.org/D134605
The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both.
This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling.
Fixes#50778
Differential Revision: https://reviews.llvm.org/D111968
This is the first patch in a series intended for removing flag
-enable-new-pm=0 from lit tests. This is part of a bigger
effort of completely removing legacy code related to legacy
pass manager in favor of currently default new pass manager.
In this patch flag has been removed only from tests where no significant
change has been required because checks has been duplicated for
both PMs.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D134150
Added the mask and the analysis of the buildvector sequence in the
isUndefVector function, improves codegen and cost estimation.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27362.00 27360.00 -0.0%
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 805299.00 806035.00 0.1%
526.blender_r - some extra code is vectorized.
508.namd_r - some extra code is optimized out.
Differential Revision: https://reviews.llvm.org/D133891
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (and recent fixes to the bdver2 + alderlake models)
Adding full CostKinds costs are affecting some other tests as they make assumptions about SizeLatency costs, so they need addressing first
These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Adding the pre-header to CSEBlocks ensures instructions are CSE'd even
after hoisting.
This was original discovered by @atrick a while ago.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D133649
If the reused scalars are clustered, i.e. each part of the reused mask
contains all elements of the original scalars exactly once, we can
reorder those clusters to improve the whole ordering of of the clustered
vectors.
Differential Revision: https://reviews.llvm.org/D133524
Need either follow the original order of the operands for bool logical
ops, or emit freeze instruction to avoid poison propagation.
Differential Revision: https://reviews.llvm.org/D126877
The pointer operands for the ScatterVectorize node may contain
non-instruction values and they are not checked for "already being
vectorized". Need to check that such pointers are already vectorized and
gather them instead of trying to build vectorize node to avoid compiler
crash.
Differential Revision: https://reviews.llvm.org/D132949
This patch changes order of searching for reductions vs other vectorization possibilities.
The idea is if we do not match a reduction it won't be harmful for further attempts to
find vectorizable operations on a vector build sequences. But doing it in the opposite
order we have good chance to ruin opportunity to match a reduction later.
We also don't want to try vectorizing binary operations too early as 2-way vectorization
may effectively prohibit wider ones leading to producing less effective code.
Differential Revision: https://reviews.llvm.org/D132590
When estimating the cost of the in-tree vectorized scalars in
buildvector sequences, need to take into account the vectorized
insertelement instruction. The top of the buildvector seuences is the
topmost vectorized insertelement instruction, because it will have
> than 1 use after the vectorization.
For the affected test case improves througput from 21 to 16 (per
llvm-mca).
Differential Revision: https://reviews.llvm.org/D132740
SLP vectorizer tries to find the reductions starting the operands of the
instructions with no-users/void returns/etc. But such operands can be
postponable instructions, like Cmp, InsertElement or InsertValue. Such
operands still must be postponed, vectorizer should not try to vectorize
them immediately.
Differential Revision: https://reviews.llvm.org/D131965
In many cases constant buildvector results in a vector load from a
constant/data pool. Need to consider this cost too.
Differential Revision: https://reviews.llvm.org/D126885
Currently, we try to vectorize values, feeding into stores, only if
slp-vectorize-hor-store option is provided. We can safely enable
vectorization of the value operand of a single store in the basic block,
if the operand value is used only in store.
It should enable extra vectorization and should not increase compile
time significantly.
Fixes https://github.com/llvm/llvm-project/issues/51320
Differential Revision: https://reviews.llvm.org/D131894
We currently assert in vectorizeTree(TreeEntry*) when processing a PHI
bundle in a block containing a catchswitch. We attempt to set the
IRBuilder insertion point following the catchswitch, which is invalid.
This is done so that ShuffleBuilder.finalize() knows where to insert
a shuffle if one is needed.
To avoid this occurring, watch out for catchswitch blocks during
buildTree_rec() processing, and avoid adding PHIs in such blocks to
the vectorizable tree. It is unlikely that constraining vectorization
over an exception path will cause a noticeable performance loss, so
this seems preferable to trying to anticipate when a shuffle will and
will not be required.
Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types.
Differential Revision: https://reviews.llvm.org/D127604
If the root order itself does not require reordering, we can just
remove its reorder mask safely (e.g., if the root node is a vector of
phis). But if this node is used as an operand in the graph, we cannot
delete the reordering, need to keep it. Otherwise the graph nodes are
not synchronized with the operands. It may cause an extra gather
instruction(s) or a compiler crash.
Also, need to be very careful when selecting the gather nodes for
reordering since there might several gather nodes with the same scalars
and we can try to reorder just the same node many times instead of
different nodes.
Differential Revision: https://reviews.llvm.org/D128680
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Patch was reverted in 4c5f10a due to buildbot failures, now being
reapplied with updated AArch64 and RISCV tests.
This patch adds handling for the llvm.powi.* intrinsics in
BasicTTIImplBase::getIntrinsicInstrCost() and improves vectorization.
Closes#53887.
Differential Revision: https://reviews.llvm.org/D128172
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.
Differential Revision: https://reviews.llvm.org/D128360
This reverts commit cac60940b7.
Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.
Differential Revision: https://reviews.llvm.org/D125712