Commit Graph

976 Commits

Author SHA1 Message Date
Alexey Bataev ba74bb3a22 [SLP]Fix reused extracts cost.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.

Differential Revision: https://reviews.llvm.org/D114958
2021-12-02 10:52:00 -08:00
Alexey Bataev 8ceccbd321 [SLP]Outline and fix code for finding common insertelement vectors.
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.

Differential Revision: https://reviews.llvm.org/D114909
2021-12-02 09:18:25 -08:00
Alexey Bataev 92fbd76af5 [SLP]Improve registering and merging of compatible shuffles.
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.

Differential Revision: https://reviews.llvm.org/D114087
2021-12-02 08:48:29 -08:00
Alexey Bataev afc9e7517a [SLP]Improve cost model for the shuffled extracts.
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.

Differential Revision: https://reviews.llvm.org/D113782
2021-12-01 08:10:57 -08:00
Alexey Bataev cc30fbf242 [SLP]Introduce isUndefVector function to check for undef vectors.
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.

Differential Revision: https://reviews.llvm.org/D114873
2021-12-01 07:46:10 -08:00
Alexey Bataev ddce6e0561 [SLP]Improve vectorization of cmp instructions sequences.
Final attempt to vectorize bundles of comptatible cmp instructions after
all other instructions processing.

Metric: SLP.NumVectorInstructions

Program                                                                             results results0 diff
        test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    1.00    5.00  400.0%
                              test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    8.00   11.00   37.5%
                    test-suite :: MultiSource/Benchmarks/Olden/voronoi/voronoi.test   20.00   26.00   30.0%
                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1344.00 1648.00   22.6%
               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1344.00 1648.00   22.6%
                              test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test  102.00  124.00   21.6%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test  118.00  133.00   12.7%
          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3233.00 3554.00    9.9%
           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3233.00 3554.00    9.9%
                        test-suite :: MultiSource/Benchmarks/Olden/power/power.test   64.00   70.00    9.4%
           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 7879.00 8604.00    9.2%
           test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test   50.00   54.00    8.0%
                        test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   27.00   29.00    7.4%
             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 8345.00 8955.00    7.3%
     test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  694.00  738.00    6.3%
                        test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  361.00  382.00    5.8%
                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  409.00  430.00    5.1%
     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  140.00  147.00    5.0%
      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  140.00  147.00    5.0%
             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4013.00 4206.00    4.8%
                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  966.00 1011.00    4.7%
                           test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   65.00   68.00    4.6%
                            test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4219.00 4381.00    3.8%
                    test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1911.00 1973.00    3.2%
      test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   62.00   64.00    3.2%
     test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   62.00   64.00    3.2%
                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  852.00  877.00    2.9%
                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  852.00  877.00    2.9%
                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1624.00 1668.00    2.7%
                         test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   39.00   40.00    2.6%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  613.00  624.00    1.8%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  378.00  383.00    1.3%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  293.00  295.00    0.7%
            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  297.00  299.00    0.7%
      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5522.00 5534.00    0.2%
     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5522.00 5534.00    0.2%

Differential Revision: https://reviews.llvm.org/D114799
2021-12-01 07:26:29 -08:00
Alexey Bataev dce6c434ea [SLP]Improve isFixedVectorShuffle and its use.
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.

Differential Revision: https://reviews.llvm.org/D114121
2021-11-30 10:10:20 -08:00
Alexey Bataev fc57cfad3c [SLP][NFC]Move static function to make it visible in member function,
NFC.
2021-11-30 09:38:46 -08:00
Alexey Bataev fc0aacf324 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-26 06:38:02 -08:00
Alexey Bataev 4675a1654c Revert "[SLP]Improve analysis/emission of vector operands for alternate nodes."
This reverts commit 496254cf80 to fix
compiler crashes reported in D114101#3152982.
2021-11-25 05:19:49 -08:00
Alexey Bataev 496254cf80 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-24 12:55:24 -08:00
Rosie Sumpter c2441b6b89 [LoopVectorize] Add vector reduction support for fmuladd intrinsic
Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.

Differential Revision: https://reviews.llvm.org/D111555
2021-11-24 08:50:04 +00:00
Alexey Bataev d1fdf867b1 [SLP][NFC]Introduce TreeEntry::getVectorFactor member function, NFC.
Added TreeEntry::getVectorFactor to get the final vectotization factor
to simplify the code.

Differential Revision: https://reviews.llvm.org/D114190
2021-11-19 06:32:19 -08:00
Alexey Bataev 900cc1a226 [SLP]Improve cost of the gather nodes.
No need to count the final shuffle cost for the constants, gathering of
the constants is just a constant vector + extra inserts, if required.

Differential Revision: https://reviews.llvm.org/D113770
2021-11-16 06:25:07 -08:00
Alexey Bataev cdf8a53c1d [SLP]Fix windows build, NFC.
Need to put `IndexIdx` var to the list of captures.
2021-11-16 06:09:51 -08:00
Alexey Bataev aa9bbb64be [SLP]Adjust GEP indices types when trying to build entries.
Need to adjust the types of GEPs indices when building the tree
entries/operands. Otherwise some of the nodes might differ and
vectorizer is unable to correctly find them and count their cost.

Differential Revision: https://reviews.llvm.org/D113792
2021-11-16 05:44:33 -08:00
Alexey Bataev 224e46d355 [SLP][DOT][NFCI]Output all scalars for the splats, not only the first one. 2021-11-15 10:54:26 -08:00
Alexey Bataev 036207d5f2 [SLP]Improve splat detection.
A bunch of scalars can be treated as a splat not only if all elements
are the same but also if some of them are undefvalues.

Differential Revision: https://reviews.llvm.org/D113774
2021-11-15 07:50:34 -08:00
Alexey Bataev b85152f8b1 [SLP][NFC]Use `isa_and_nonnull` and fix comment, NFC. 2021-11-15 06:49:33 -08:00
Alexey Bataev 6fb5bed7d1 [SLP]Do not create unused gather nodes for scalar arguments of vector intrinsics.
If the vector intrinsic has scalar argument, we currently still create
a tree entry for this argument. This entry is not used, just consumes
resources and increases the cost of the tree.

Differential Revision: https://reviews.llvm.org/D113806
2021-11-15 06:11:19 -08:00
Alexey Bataev 352c46e707 [SLP]Improve vectorization of split loads.
Need to fix ther cost estimation for split loads, since we look at the
subregs already, no need to permute them, need just to estimate
subregister insert, if it is smaller than the real register. Also, using
split loads, it might be profitable already to vectorize smaller trees
with gathering of the loads.

Differential Revision: https://reviews.llvm.org/D107188
2021-11-12 06:13:22 -08:00
Simon Pilgrim f057756a1a [SLP] Fix Wdocumentation warning - remove \returns from void function. NFC. 2021-11-07 15:08:39 +00:00
Kazu Hirata 1b108ab975 [Transforms] Use make_early_inc_range (NFC) 2021-11-02 18:13:23 -07:00
Alexey Bataev 07ef9f513f [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-28 05:45:09 -07:00
Alexey Bataev f06e332982 Revert "[SLP]Improve/fix reordering of the gathered graph nodes."
This reverts commit 64d1617d18 to fix test
non-stability.
2021-10-27 11:16:58 -07:00
Alexey Bataev 64d1617d18 [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-27 08:49:13 -07:00
Alexey Bataev 9b12975cbf Revert "[SLP]Improve/fix reordering of the gathered graph nodes."
This reverts commit f719b794bc to fix
instability in tests.
2021-10-27 07:31:36 -07:00
Alexey Bataev f719b794bc [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-27 06:08:40 -07:00
Alexey Bataev cb4feae7bd [SLP]Fix logical and/or reductions.
Need to emit select(cmp) instructions for poison-safe forms of select
ops. Currently alive reports that `Target is more poisonous than source`
for operations we generating for such instructions.

https://alive2.llvm.org/ce/z/FiNiAA

Differential Revision: https://reviews.llvm.org/D112562
2021-10-27 04:25:20 -07:00
Alexey Bataev ce14d1b690 [SLP]Do not reorder reduction nodes.
The final reduction nodes should not be reordered, the order does not
matter for reductions. Also, it might be profitable to vectorize smaller
reduction trees, reduction cost may compensate small tree cost.

Part of D111574

Differential Revision: https://reviews.llvm.org/D112467
2021-10-26 07:41:24 -07:00
Alexey Bataev eb9b75dd4d [SLP]Change the order of the reduction/binops args pair vectorization attempts.
Need to change the order of the reduction/binops args pair vectorization
attempts. Need to try to find the reduction at first and postpone
vectorization of binops args. This may help to find more reduction
patterns and vectorize them.
Part of D111574.

Differential Revision: https://reviews.llvm.org/D112224
2021-10-25 06:27:14 -07:00
Alexey Bataev 3ea7877c8b [SLP]Unify vectorization of PHI and store nodes with improved tiny tree vectorization.
Vectorization of PHIs and stores very similar, it might be beneficial to
try to revectorize stores (like PHIs) if the total number of stores with
the same/alternate opcode is less than the vector size but number of
stores with the same type is larger than the vector size.

Differential Revision: https://reviews.llvm.org/D109831
2021-10-21 06:25:32 -07:00
Alexey Bataev b9cfa016da [SLP]Fix emission of the shrink shuffles.
Need to follow the order of the reused scalars from the
ReuseShuffleIndices mask rather than rely on the natural order.

Differential Revision: https://reviews.llvm.org/D111898
2021-10-18 13:13:12 -07:00
Alexey Bataev 414abff1fe [SLP]Fix PR52090: clang crashes: Assertion `Index < Length && "Invalid index!"' failed.
Need to check that either Idx is UndefMaskElem and value is UndefValue
or Idx is valid and value is the same as the scalar value in the node.

Differential Revision: https://reviews.llvm.org/D111802
2021-10-14 14:26:29 -07:00
Simon Pilgrim 0dcd2b40e6 [TTI] Remove default condition type and predicate arguments from getCmpSelInstrCost
We need to be better at exposing the comparison predicate to getCmpSelInstrCost calls as some targets (e.g. X86 SSE) have very different costs for different comparisons (PR48337), and we can't always rely on the optional Instruction argument.

This initial commit requires explicit condition type and predicate arguments. The next step will be to review a lot of the existing getCmpSelInstrCost calls which have used BAD_ICMP_PREDICATE even when the predicate is known.

Differential Revision: https://reviews.llvm.org/D111024
2021-10-06 15:40:35 +01:00
Alexey Bataev bebe702dbe [SLP]Detect reused scalars in all possible gathers for better vectorization cost.
Some initially gathered nodes missed the check for the reused scalars,
which leads to high gather cost. Such nodes still can be represented as
m gathers + shuffle instead of n gathers, where m < n.

Differential Revision: https://reviews.llvm.org/D111153
2021-10-05 09:43:03 -07:00
Kazu Hirata 4f0225f6d2 [Transforms] Migrate from getNumArgOperands to arg_size (NFC)
Note that getNumArgOperands is considered a legacy name.  See
llvm/include/llvm/IR/InstrTypes.h for details.
2021-10-01 09:57:40 -07:00
Kerry McLaughlin c1d46d3461 [SLPVectorizer] Fix crash in isShuffle with scalable vectors
D104809 changed `buildTree_rec` to check for extract element instructions
with scalable types. However, if the extract is extended or truncated,
these changes do not apply and we assert later on in isShuffle(), which
attempts to cast the type of the extract to FixedVectorType.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D110640
2021-10-01 10:56:44 +01:00
Alexey Bataev f701505c45 [SLP]Improve vectorization of phi nodes by trying wider vectors.
Try to improve vectorization of the PHI nodes by trying to vectorize
similar instructions at the size of the widest possible vectors, then
aggregating with compatible type PHIs and trying to vectoriza again and
only if this failed, try smaller sizes of the vector factors for
compatible PHI nodes. This restores performance of several benchmarks
after tuning of the fp/int conversion instructions costs.

Differential Revision: https://reviews.llvm.org/D108740
2021-09-28 07:20:36 -07:00
Alexey Bataev 8bacfb9bed [SLP]No need to schedule/check parent for extract{element/value} instruction.
The instruction extractelement/extractvalue are not required to
be scheduled since they only depend on the source vector/aggregate (with
constant indices), smae applies to the parent basic block checks.
Improves compile time and saves scheduling budget.

Differential Revision: https://reviews.llvm.org/D108703
2021-09-28 06:13:55 -07:00
Jameson Nash e27a6db529 Bad SLPVectorization shufflevector replacement, resulting in write to wrong memory location
We see that it might otherwise do:

  %10 = getelementptr {}**, <2 x {}***> %9, <2 x i32> <i32 10, i32 4>
  %11 = bitcast <2 x {}***> %10 to <2 x i64*>
...
  %27 = extractelement <2 x i64*> %11, i32 0
  %28 = bitcast i64* %27 to <2 x i64>*
  store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2

Which is an out-of-bounds store (the extractelement got offset 10
instead of offset 4 as intended). With the fix, we correctly generate
extractelement for i32 1 and generate correct code.

Differential Revision: https://reviews.llvm.org/D106613
2021-09-27 14:06:13 -04:00
Simon Pilgrim 8a44281f47 [SLP] getReductionCost - use explicit TTI::TCK_RecipThroughput CostKind. NFCI.
Avoid relying on the default cost kinds in TTI calls (we already do this in other places in SLP) - noticed while trying to see how much work it'd be to extend D110242 and remove all remaining uses of default CostKind arguments.
2021-09-22 16:52:22 +01:00
Alexey Bataev bc69dd62c0 [SLP]Improve graph reordering.
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Differential Revision: https://reviews.llvm.org/D105020
2021-09-20 08:42:19 -07:00
Chris Lattner 735f46715d [APInt] Normalize naming on keep constructors / predicate methods.
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`.  This achieves two things:

1) This starts standardizing predicates across the LLVM codebase,
   following (in this case) ConstantInt.  The word "Value" doesn't
   convey anything of merit, and is missing in some of the other things.

2) Calling an integer "null" doesn't make any sense.  The original sin
   here is mine and I've regretted it for years.  This moves us to calling
   it "zero" instead, which is correct!

APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go.  As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.

Included in this patch are changes to a bunch of the codebase, but there are
more.  We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.

Differential Revision: https://reviews.llvm.org/D109483
2021-09-09 09:50:24 -07:00
Kazu Hirata 5648f7170e [Analysis, Target, Transforms] Construct SmallVector with iterator ranges (NFC) 2021-09-07 09:19:33 -07:00
Nikita Popov 48ebe427c9 [SLPVectorizer] Make aliasing check more precise
SLPVectorizer currently uses AA::isNoAlias() to determine whether
two locations alias. This does not work if one of the instructions
is a call. Instead, we should check getModRefInfo(), which
determines whether an arbitrary instruction modifies or references
a given location.

Among other things, this prevents @llvm.experimental.noalias.scope.decl()
and other inaccessiblmemonly intrinsics from interfering with SLP
vectorization.

Differential Revision: https://reviews.llvm.org/D109012
2021-08-31 22:35:30 +02:00
Anton Afanasyev 077d4cb3ab Revert "[SLP]No need to schedule/check parent for extract{element/value} instruction."
Revert since introduced issure reported here:
https://lists.llvm.org/pipermail/llvm-dev/2021-August/152411.html
Discussed starting from here: https://reviews.llvm.org/D108703#2974289

This reverts commit a36bc873a2.
2021-08-31 15:29:06 +03:00
Mikhail Goncharov 5097b6e352 Revert "[SLP]Improve graph reordering."
This reverts commit 84cbd71c95.

This commit breaks one of the internal tests. As agreed with Alexey I
will provide the reproducer later.
2021-08-30 19:16:44 +02:00
Alexey Bataev 84cbd71c95 [SLP]Improve graph reordering.
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Differential Revision: https://reviews.llvm.org/D105020
2021-08-26 12:31:18 -07:00
Alexey Bataev b00f73d8bf Revert "[SLP]Improve graph reordering."
This reverts commit a28234e37a to
investigate a compiler crash caused by the commit.
2021-08-26 09:19:40 -07:00