Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.
Differential Revision: https://reviews.llvm.org/D128360
This reverts commit cac60940b7.
Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.
Differential Revision: https://reviews.llvm.org/D125712
If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate
the cost as shuffle of 2 vector, not as insert of subvector. Otherwise,
the inserted subvector is out of range and compiler may crash.
Differential Revision: https://reviews.llvm.org/D128071
If the root scalar is mapped to to the smallest bit width, the vector is
truncated and the types between original buildvector and extracted value
mismatched. For extract, we emit sext/zext instructions, for shuffles we
can reuse oringal vector instead of the truncated one.
Differential Revision: https://reviews.llvm.org/D127974
Currently scatter vectorize nodes can be emitted only for GEPs with
constant indices. But we can also emit such nodes for GEPs with the same
ptr and non-constant vectorizable/gathered indices, if profitable. Patch
adds support for such nodes and tries to improve handling of GEPs with
non-const indeces for such nodes.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00 27507.00 -0.2%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 5395.00 5380.00 -0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5389.00 5374.00 -0.3%
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5664.00 5643.00 -0.4%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00 13127.00 -0.6%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 212.00 207.00 -2.4%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 890.00 850.00 -4.5%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 1695.00 1581.00 -6.7%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2338.00 2140.00 -8.5%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 63.00 55.00 -12.7%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 468.00 356.00 -23.9%
Geomean difference -0.3%
All numbers show increased number of generated vector instructions.
Diff:
SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but
need an extra analysis with LTO (with LTO compiler generates
masked_gather, while before regular loads were emitted because of extra
data, availbale at LTO time).
SingleSource/UnitTests/matrix-types-spec - more vector code.
MultiSource/Applications/JM/lencod/lencod - same.
External/SPEC/CINT2006/464.h264ref/464.h264ref - same.
MultiSource/Benchmarks/7zip/7zip-benchmark - same.
External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes.
External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code.
External/SPEC/CFP2006/447.dealII/447.dealII - same
External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same
External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same
External/SPEC/CFP2017rate/511.povray_r/511.povray - same
External/SPEC/CFP2006/453.povray/453.povray - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same
Differential Revision: https://reviews.llvm.org/D127219
We can skip the analysis of the constant nodes, their order should not
affect the ordering of the trees/subtrees.
Differential Revision: https://reviews.llvm.org/D127775
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Extractelement instructions may come from different basic blocks, need
to take it into account when looking for a last instruction in the
bundle to prevent compiler crash.
Differential Revision: https://reviews.llvm.org/D126777
Patch improves compile time. For function calls, which cannot be
vectorized, create a unique group for each such a call instead of
subgroup. It prevents them from being grouped by a subgroups and
attempts for their vectorization.
Also, looks through casts operand to try to check their
groups/subgroups.
Reduces number of vectorization attempts. No changes in the statistics
for SPEC2017/2006/llvm-test-suite.
Differential Revision: https://reviews.llvm.org/D126476
Need to handle a corner case correctly, if all elements are Undefs/Poisons,
need to emit actual values, not just poisons.
Differential Revision: https://reviews.llvm.org/D126298
ScatterVectorize nodes should be handled same way as gathers in
reorderBottomToTop function, since we can simple reorder the loads in
this node. Because of that need to include such nodes to the list of
gathered nodes to fix compiler crash.
Differential Revision: https://reviews.llvm.org/D126378
SLP should build ScatterVectorize nodes only if they actually end up
with masked gather rather than with scalarization. In the second
scenario better to build a gather node.
Differential Revision: https://reviews.llvm.org/D126379
Need to use all ReductionOps when propagating flags for the reduction
ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw
flags.
Differential Revision: https://reviews.llvm.org/D126371
The crash is caused by incorrect order set by reorderBottomToTop(), which
happens when it is reordering a TreeEntry which has a user that has already been
reordered earlier. Please see the detailed description in the lit test.
Differential Revision: https://reviews.llvm.org/D126099
To be used correctly in a sort-like function, isFirstInsertElement
function must follow weak strict ordering rule, i.e.
isFirstInsertElement(IE1, IE1) should return false.
Builds UserIgnore list only once as a SmallDenseSet without rebuilding
it between the runs, iterate over gathers instead list of reduction ops,
do some checks in the buildTree_rec only if the corresponding containers
are not empty.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
This reverts commit fc9c59c355.
The patch triggers an assertion when building SPEC on X86. Reduced
reproducer shared at D107966.
Also reverts follow-up commit 11a09af76d.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.
The most significant changes in SLP.NumVectorInstructions:
Metric: SLP.NumVectorInstructions [140/14396]
Program results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 920.00 3548.00 285.7%
test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 66.00 122.00 84.8%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 100.00 128.00 28.0%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0%
test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 592.00 687.00 16.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 402.00 426.00 6.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1665.00 1745.00 4.8%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 135.00 139.00 3.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 388.00 397.00 2.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 895.00 914.00 2.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 240.00 244.00 1.7%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 240.00 244.00 1.7%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00 0.7%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 8125.00 8183.00 0.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9832.00 9880.00 0.5%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5267.00 5291.00 0.5%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 201.00 192.00 -4.5%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 201.00 192.00 -4.5%
644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.
641.leela_s and 541.leela_r - the function
`@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.
Differential Revision: https://reviews.llvm.org/D111574
Need to check if the reduction is still (not)cmp-select pattern min/max
reduction to avoid compiler crash during building list of reduction
operations. cmp-sel pattern provides 2 reduction operations, while
intrinsics - just one.
If alternate node has only 2 instructions and the tree is already big
enough, better to skip the vectorization of such nodes, they are not
very profitable (the resulting code cotains 3 instructions instead of
original 2 scalars). SLP can try to vectorize the buildvector sequence
in the next attempt, if it is profitable.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 72.00 73.00 1.4%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 1186.00 1198.00 1.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 241.00 242.00 0.4%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2131.00 2139.00 0.4%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 6377.00 6384.00 0.1%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 6377.00 6384.00 0.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00 12658.00 0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00 26147.00 -0.1%
test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test 99.00 86.00 -13.1%
Gains:
526.blender_r - more vectorized trees.
enc-3des - same.
Others:
510.parest_r - no changes.
miniFE - same
623.xalancbmk_s - some (non-profitable) parts of the trees are not
vectorized.
523.xalancbmk_r - same
lencod - same
timberwolfmc - same
miniAMR - same
Differential Revision: https://reviews.llvm.org/D125571
If the insert indes was used already or is not constant, we should stop
looking for unique buildvector sequence, it mustbe splitted to
2 different buildvectors.
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750