Commit Graph

29059 Commits

Author SHA1 Message Date
Chuanqi Xu 2ae5011827 [Coroutines] Handle CallBrInst in SalvageDebugInfo
Reviewed by: StephenTozer

Differential Revision: https://reviews.llvm.org/D115139
2021-12-06 22:55:05 +08:00
Sander de Smalen 3d549dddf7 [LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.

I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.

Reviewed By: dmgreen, fhahn

Differential Revision: https://reviews.llvm.org/D114646
2021-12-06 11:41:27 +00:00
Florian Hahn 7c3c352d82
[VPlan] Separate ctors for VPWidenIntOrFpInduction. (NFC)
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
2021-12-05 12:15:18 +00:00
Matt Arsenault a25111c9e2 Attributor: Fix typo in function name 2021-12-04 11:25:22 -05:00
Philip Reames 1a25d0bfbb [LICM] Remove profile driven restriction on hoisting
This reverts change 2c391a5/D87551.  As noted in the llvm-dev thread "LICM as canonical form" sent earlier today, introducing this was a major design change made without sufficient cause.

A profile driven LICM is not an unreasonable design, it simply is not what we have.  Switching to such a model requires a lot more work than just this patch, and broad aggeement that is the right direction for the optimizer as a whole.

Worth noting is that all the tests included in the reverted changed are probably handled if we allow running unconstrained LICM, and later run LoopSink.  As such, we have no public examples which motivate a profit based hoisting approach.
2021-12-03 17:19:25 -08:00
Choongwoo Han 181c4ba467 [CFG] Handle calls with funclet bundle
When Control Flow Guard Check is inserted, funclet bundle was not checked. Therefore, it didn't generate code correctly when a target function has "funclet" bundle.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D114914
2021-12-03 10:51:10 -08:00
Philip Reames 7b54de5fef [funcattrs] Fix a bug in recently introduced writeonly argument inference
This fixes a bug in 740057d.  There's two ways to describe the issue:
* One caller hasn't yet proven nocapture on the argument.  Given that, the inference routine is responsible for bailing out on a potential capture.
* Even if we know the argument is nocapture, the access inference needs to traverse the exact set of users the capture tracking would (or exit conservatively).  Even if capture tracking can prove a store is non-capturing (e.g. to a local alloc which doesn't escape), we still need to track the copy of the pointer to see if it's later reloaded and accessed again.

Note that all the test changes except the newly added ones appear to be false negatives.  That is, cases where we could prove writeonly, but the current code isn't strong enough.  That's why I didn't spot this originally.
2021-12-03 08:57:15 -08:00
Florian Hahn ead3979a92
[MemoryLocation] Move DSE intrinsic handling to MemoryLocation. (NFC)
Suggested in D114872.
2021-12-03 16:00:39 +00:00
Anna Thomas 72750f0012 [TrivialDeadness] Introduce API separating two different usages
The earlier usage of wouldInstructionBeTriviallyDead is based on the
assumption that the use_count of that instruction being checked will be
zero. This patch separates the API into two different ones:

1. The strictly conservative one where the instruction is trivially dead iff the uses are dead.
2. The slightly relaxed form, where an instruction is dead along paths where it is not used.

The second form can be used in identifying instructions that are valid
to sink down to uses (D109917).

Reviewed-By: reames
Differential Revision: https://reviews.llvm.org/D114647
2021-12-03 10:09:52 -05:00
Florian Hahn f078536f46
[MemoryLocation] Move DSE's logic to new MemLoc::getForDest helper (NFC).
DSE has some extra logic to determine the write location of library
calls like str*cpy and str*cat. This patch moves the logic to a new
MemoryLocation:getForDest variant, which takes a call and TLI.

This patch should be NFC, because no other places take advantage of the
new helper yet.

Suggested by @reames post-commit 7eec832def.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D114872
2021-12-03 09:12:01 +00:00
Chuanqi Xu 84980761a7 [Coroutines] Handle InvokeInst in SalvageDebugInfo
Since coroutine would be splitted into pieces, compiler would move the
dbg.declare intrinsic after the Storage is created to make sure the
corresponding dbg instruction is still available aftet splitted.
However, it would be problematic if the storage instruction is an
InvokeInst, which is a terminator. We couldn't move instruction after an
InvokeInst. This patch tries to move the dbg.declare intrinsic in the
normal destination of the InvokeInst. It should make sense due to the
Storage should be invalid in exception path.
2021-12-03 13:46:54 +08:00
Hongtao Yu 4e24ca1cdc [CSSPGO] Turn on Profi by default
As titled.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D115011
2021-12-02 17:56:38 -08:00
Philip Reames 740057d185 [funcattrs] Infer writeonly argument attribute
This change extends the current logic for inferring readonly and readnone argument attributes to also infer writeonly.

This change is deliberately minimal; there's a couple of areas for follow up.
* I left out all call handling and thus any benefit from the SCC walk. When examining the test changes, I realized the existing code is imprecise, and am going to fix that in it's own revision before adding in the writeonly handling. (Mostly because updating the tests is hard when I, the human, can't figure out whether the result is correct.)
* I left out handling for storing a value (as opposed to storing to a pointer). This should benefit readonly/readnone as well, and applies to a bunch of other instructions. Seemed worth having as a separate review.

Differential Revision: https://reviews.llvm.org/D114963
2021-12-02 13:04:09 -08:00
spupyrev 93a2c2919f profi - a flow-based profile inference algorithm: Part III (out of 3)
This is a continuation of D109860 and D109903.

An important challenge for profile inference is caused by the fact that the
sample profile is collected on a fully optimized binary, while the block and
edge frequencies are consumed on an early stage of the compilation that operates
with a non-optimized IR. As a result, some of the basic blocks may not have
associated sample counts, and it is up to the algorithm to deduce missing
frequencies. The problem is illustrated in the figure where three basic
blocks are not present in the optimized binary and hence, receive no samples
during profiling.

We found that it is beneficial to treat all such blocks equally. Otherwise the
compiler may decide that some blocks are “cold” and apply undesirable
optimizations (e.g., hot-cold splitting) regressing the performance. Therefore,
we want to distribute the counts evenly along the blocks with missing samples.
This is achieved by a post-processing step that identifies "dangling" subgraphs
consisting of basic blocks with no sampled counts; once the subgraphs are
found, we rebalance the flow so as every branch probability is 50:50 within the
subgraphs.

Our experiments indicate up to 1% performance win using the optimization on
some binaries and a significant improvement in the quality of profile counts
(when compared to ground-truth instrumentation-based counts)

{F19093045}

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D109980
2021-12-02 12:01:30 -08:00
spupyrev 98dd2f9ed3 profi - a flow-based profile inference algorithm: Part II (out of 3)
This is a continuation of D109860.

Traditional flow-based algorithms cannot guarantee that the resulting edge
frequencies correspond to a *connected* flow in the control-flow graph. For
example, for an instance in the attached figure, a flow-based (or any other)
inference algorithm may produce an output in which the hot loop is disconnected
from the entry block (refer to the rightmost graph in the figure). Furthermore,
creating a connected minimum-cost maximum flow is a computationally NP-hard
problem. Hence, we apply a post-processing adjustments to the computed flow
by connecting all isolated flow components ("islands").

This feature helps to keep all blocks with sample counts connected and results
in significant performance wins for some binaries.
{F19077343}

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D109903
2021-12-02 11:04:21 -08:00
Alexey Bataev ba74bb3a22 [SLP]Fix reused extracts cost.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.

Differential Revision: https://reviews.llvm.org/D114958
2021-12-02 10:52:00 -08:00
Kazu Hirata 22d82949b0 [llvm] Fix "unused variable" warnings 2021-12-02 09:20:17 -08:00
Alexey Bataev 8ceccbd321 [SLP]Outline and fix code for finding common insertelement vectors.
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.

Differential Revision: https://reviews.llvm.org/D114909
2021-12-02 09:18:25 -08:00
Alexey Bataev 92fbd76af5 [SLP]Improve registering and merging of compatible shuffles.
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.

Differential Revision: https://reviews.llvm.org/D114087
2021-12-02 08:48:29 -08:00
Djordje Todorovic 2cdc6f2ca6 Reland "[LICM] Hoist LOAD without sinking the STORE"
When doing load/store promotion within LICM, if we
cannot prove that it is safe to sink the store we won't
hoist the load, even though we can prove the load could
be dereferenced and moved outside the loop. This patch
implements the load promotion by moving it in the loop
preheader by inserting proper PHI in the loop. The store
is kept as is in the loop. By doing this, we avoid doing
the load from a memory location in each iteration.

Please consider this small example:

loop {
  var = *ptr;
  if (var) break;
  *ptr= var + 1;
}
After this patch, it will be:

var0 = *ptr;
loop {
  var1 = phi (var0, var2);
  if (var1) break;
  var2 = var1 + 1;
  *ptr = var2;
}
This addresses some problems from [0].

[0] https://bugs.llvm.org/show_bug.cgi?id=51193

Differential revision: https://reviews.llvm.org/D113289
2021-12-02 03:53:50 -08:00
Florian Hahn 2de5f39e54
[BuildLibCalls] Add support for memset_pattern{4,8}.
Add support for memset_pattern{4,8} similar to the existing
memset_pattern16 handling.

Reviewed By: ab

Differential Revision: https://reviews.llvm.org/D114883
2021-12-02 11:04:25 +00:00
Nikita Popov a0ff26e08c [GlobalOpt] Fix assertion failure during instruction deletion
This fixes the assertion failure reported in https://reviews.llvm.org/D114889#3166417,
by making RecursivelyDeleteTriviallyDeadInstructionsPermissive()
more permissive. As the function accepts a WeakTrackingVH, even if
originally only Instructions were inserted, we may end up with
different Value types after a RAUW operation. As such, we should
not assume that the vector only contains instructions.

Notably this matches the behavior of the
RecursivelyDeleteTriviallyDeadInstructions() function variant which
accepts a single value rather than vector.
2021-12-02 11:58:39 +01:00
Florian Hahn 0496edad49
[BuildLibCalls] Add additional attrs to memcpy_chk.
`memcpy_chk` can be treated like `memcpy`, with the exception that it
may not return (if it aborts the program).

See D114793 for a similar patch for `memset_chk`.

Reviewed By: xbolva00

Differential Revision: https://reviews.llvm.org/D114863
2021-12-02 09:50:14 +00:00
Daniel Sanders 54e21df973 [unroll] Fix a functional change in an NFC patch
5c77aa2b91 [unroll] Use early return in shouldFullUnroll [nfc]
wasn't quite NFC since !(x <= y) is x > y rather than x >= y

Credit to Justin Bogner for spotting the bug

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D114894
2021-12-01 17:28:12 -08:00
Arthur Eubanks 512534bc16 [Cloning] Clone metadata on function declarations
Previously we missed cloning metadata on function declarations because
we don't call CloneFunctionInto() on declarations in CloneModule().

Reviewed By: dexonsmith

Differential Revision: https://reviews.llvm.org/D113812
2021-12-01 15:40:05 -08:00
spupyrev 7cc2493daa profi - a flow-based profile inference algorithm: Part I (out of 3)
The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.

Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:
- "fixing" errors in profiles caused by sampling;
- converting basic block counts to edge frequencies (branch probabilities);
- dealing with "dangling" blocks having no samples in the profile.

The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.

The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.

UPD Dec 1st 2021:
- synced the declaration and definition of the option `SampleProfileUseProfi ` to use type `cl::opt<bool`;
- added `inline` for `SampleProfileInference<BT>::findUnlikelyJumps` and `SampleProfileInference<BT>::isExit` to avoid linking problems on windows.

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D109860
2021-12-01 15:30:38 -08:00
Nikita Popov 8d1759c404 [GlobalOpt] Simplify CleanupConstantGlobalUsers()
This bases the CleanupConstantGlobalUsers() implementation around
the ConstantFoldLoadFromConst() API. The general approach is that
we discover all users while looking through casts, and then
constant fold loads and drop stores and memintrinsics.

This avoids special cases and limitations in the previous
implementation, which is also incompatible with opaque pointers.
The result is a bit more powerful than before, because we now use
more general load folding logic which can for example look through
pointer bitcasts between different sizes. This is where the test
changes come from, as we now fold more loads and can thus remove
more globals.

Differential Revision: https://reviews.llvm.org/D114889
2021-12-01 21:06:25 +01:00
Ellis Hoag 9e647806f3 [InstrProf][NFC] Refactor ProfileDataMap usage
Instead of using `DenseMap::find()` and `DenseMap::insert()`, use
`DenseMap::operator[]` to get a reference to the profile data and update
the reference. This simplifies the changes in D114565.

Reviewed By: kyulee

Differential Revision: https://reviews.llvm.org/D114828
2021-12-01 11:47:14 -08:00
Joseph Huber 058c312a44 [OpenMP][FIX] SPMDzation guarding needs to account for all reaching kernels
If two reaching kernels disagree on the execution mode we cannot guard a
function right now. Ensure we do not as we otherwise will cause a
deadlock.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D114866
2021-12-01 11:44:32 -05:00
Alexey Bataev afc9e7517a [SLP]Improve cost model for the shuffled extracts.
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.

Differential Revision: https://reviews.llvm.org/D113782
2021-12-01 08:10:57 -08:00
Alexey Bataev cc30fbf242 [SLP]Introduce isUndefVector function to check for undef vectors.
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.

Differential Revision: https://reviews.llvm.org/D114873
2021-12-01 07:46:10 -08:00
Alexey Bataev ddce6e0561 [SLP]Improve vectorization of cmp instructions sequences.
Final attempt to vectorize bundles of comptatible cmp instructions after
all other instructions processing.

Metric: SLP.NumVectorInstructions

Program                                                                             results results0 diff
        test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    1.00    5.00  400.0%
                              test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    8.00   11.00   37.5%
                    test-suite :: MultiSource/Benchmarks/Olden/voronoi/voronoi.test   20.00   26.00   30.0%
                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1344.00 1648.00   22.6%
               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1344.00 1648.00   22.6%
                              test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test  102.00  124.00   21.6%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test  118.00  133.00   12.7%
          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3233.00 3554.00    9.9%
           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3233.00 3554.00    9.9%
                        test-suite :: MultiSource/Benchmarks/Olden/power/power.test   64.00   70.00    9.4%
           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 7879.00 8604.00    9.2%
           test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test   50.00   54.00    8.0%
                        test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   27.00   29.00    7.4%
             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 8345.00 8955.00    7.3%
     test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  694.00  738.00    6.3%
                        test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  361.00  382.00    5.8%
                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  409.00  430.00    5.1%
     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  140.00  147.00    5.0%
      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  140.00  147.00    5.0%
             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4013.00 4206.00    4.8%
                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  966.00 1011.00    4.7%
                           test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   65.00   68.00    4.6%
                            test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4219.00 4381.00    3.8%
                    test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1911.00 1973.00    3.2%
      test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   62.00   64.00    3.2%
     test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   62.00   64.00    3.2%
                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  852.00  877.00    2.9%
                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  852.00  877.00    2.9%
                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1624.00 1668.00    2.7%
                         test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   39.00   40.00    2.6%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  613.00  624.00    1.8%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  378.00  383.00    1.3%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  293.00  295.00    0.7%
            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  297.00  299.00    0.7%
      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5522.00 5534.00    0.2%
     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5522.00 5534.00    0.2%

Differential Revision: https://reviews.llvm.org/D114799
2021-12-01 07:26:29 -08:00
Florian Hahn e44298a8f8
[LV] Move code from vectorizeMemoryInstruction to recipe's execute().
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.

Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114324
2021-12-01 14:56:51 +00:00
Djordje Todorovic 72f9f066df Revert "[LICM] Hoist LOAD without sinking the STORE"
This reverts commit ecb9d8e4e3.

I'll reland this as soon as the failing tests are fixed/updated.
2021-12-01 04:39:26 -08:00
Djordje Todorovic ecb9d8e4e3 [LICM] Hoist LOAD without sinking the STORE
When doing load/store promotion within LICM, if we
cannot prove that it is safe to sink the store we won't
hoist the load, even though we can prove the load could
be dereferenced and moved outside the loop. This patch
implements the load promotion by moving it in the loop
preheader by inserting proper PHI in the loop. The store
is kept as is in the loop. By doing this, we avoid doing
the load from a memory location in each iteration.

Please consider this small example:

loop {
  var = *ptr;
  if (var) break;
  *ptr= var + 1;
}
After this patch, it will be:

var0 = *ptr;
loop {
  var1 = phi (var0, var2);
  if (var1) break;
  var2 = var1 + 1;
  *ptr = var2;
}
This addresses some problems from [0].

[0] https://bugs.llvm.org/show_bug.cgi?id=51193

Differential revision: https://reviews.llvm.org/D113289
2021-12-01 04:27:50 -08:00
Nikita Popov 1e1a8be21f [LICM] Support opaque pointers in scalar promotion
Make sure that all pointers have the same load/store access type,
rather than comparing pointer element types.
2021-12-01 12:56:24 +01:00
Florian Hahn 6a5e29d13f
[BuildLibCalls] Add argmemonly, writeonly, nounwind to memset_chk.
The memset_chk library function should match memset's attributes with
respect of memory effects (argmemonly, writeonly). It also does not
raise exceptions. It may not return, in case it aborts the program.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D114793
2021-12-01 10:09:52 +00:00
Nikita Popov 84b978da3b [LoopUnrollRuntime] Remove unnecessary pointer BECount check (NFC)
BECounts are guaranteed to be integers nowadays.
2021-12-01 10:32:37 +01:00
Florian Hahn 7de410440d
[DSE] Allow DSE to optimize MemorySSA by default.
This allows for better optimization of 'stores-of-existing-values' and
possibly helps passes further down the pipeline.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D113712
2021-12-01 08:29:23 +00:00
Markus Lavin ce22b7f17b [NPM] Fix LoopNestPasses in -print-pipeline-passes
Fix printing of LoopNestPasses when using the opt pipeline printer
option -print-pipeline-passes.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D114771
2021-12-01 07:57:17 +01:00
Srividya Karumuri 9e3e1aad31 [InstCombine] Allow fake vector insert folding to bit-logic only if the insert element is integer type
The below commit is causing assertion when insert element type is not integer
 type such as half. This is because the transformation is creating zext before
 doing bitwise OR, and the zext is supported only for integer types
80ab06c599

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D114734
2021-11-30 13:54:52 -08:00
Alexey Bataev dce6c434ea [SLP]Improve isFixedVectorShuffle and its use.
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.

Differential Revision: https://reviews.llvm.org/D114121
2021-11-30 10:10:20 -08:00
Alexey Bataev fc57cfad3c [SLP][NFC]Move static function to make it visible in member function,
NFC.
2021-11-30 09:38:46 -08:00
Hongtao Yu bf317f6698 [CSSPGO] Sorting nodes in a cycle of profiled call graph.
For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes are reached from outside the SCC and inside the SCC, based on the Tarjan algorithm. This does not honor profile edge hotness, thus does not gurantee hot callsites to be inlined prior to cold callsites. To mitigate that, I'm adding an extra sorter on top of scc_iter to sort scc functions in the order of callsite hotness, instead of changing the internal of scc_iter.

Sorting on callsite hotness can be optimally based on detecting cycles on a directed call graph, i.e, to remove the coldest edge until a cycle is broken. However, detecting cycles isn't cheap. I'm using an MST-based approach which is faster and appear to deliver some performance wins.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D114204
2021-11-30 09:01:08 -08:00
Philip Reames c41b318423 [LV] Remove unneeded cast to Operator [NFC] 2021-11-30 08:45:13 -08:00
Florian Hahn c9ad356266
[DSE] Use optimized access if available for redundant store elimination.
Using the optimized access enables additional optimizations in cases
where the defining access is a non-aliasing store.

Alternatively we could also walk upwards and skip non-aliasing defs
here, but my experiments so far showed that this will noticeably
increase compile-time for little extra gain compared to just using the
optimized access.

Improvements of dse.NumRedundantStores on MultiSource/CINT2006/CPF2006
on X86 with -O3:

     test-suite...-typeset/consumer-typeset.test     1.00                  76.00              7500.0%
     test-suite.../Benchmarks/Bullet/bullet.test     3.00                  12.00              300.0%
     test-suite...006/453.povray/453.povray.test     3.00                   6.00              100.0%
     test-suite...telecomm-gsm/telecomm-gsm.test     1.00                   2.00              100.0%
     test-suite...ediabench/gsm/toast/toast.test     1.00                   2.00              100.0%
     test-suite...marks/7zip/7zip-benchmark.test     1.00                   2.00              100.0%
     test-suite...ications/JM/lencod/lencod.test     7.00                  10.00              42.9%
     test-suite...6/464.h264ref/464.h264ref.test     6.00                   8.00              33.3%
     test-suite...ications/JM/ldecod/ldecod.test     6.00                   7.00              16.7%
     test-suite...006/447.dealII/447.dealII.test    33.00                  33.00               0.0%
     test-suite...6/471.omnetpp/471.omnetpp.test    NaN                     1.00               nan%
     test-suite...006/450.soplex/450.soplex.test    NaN                     2.00               nan%
     test-suite.../CINT2006/403.gcc/403.gcc.test    NaN                     7.00               nan%
     test-suite...lications/ClamAV/clamscan.test    NaN                     1.00               nan%
     test-suite...CI_Purple/SMG2000/smg2000.test    NaN                     3.00               nan%

Follow-up to D111727.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D112315
2021-11-30 15:40:14 +00:00
Joseph Huber 7986a5f23e [OpenMP] Add RTL function to externalization RAII
This patch adds the `__kmpc_get_warp_size` OpenMP RTL function to the
externalization RAII struct. This was getting optimized out and then
being replaced with an undefined value once added back in, causing bugs
for complex reductions.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D114802
2021-11-30 10:19:06 -05:00
Florian Hahn dab776dd0f
[LV] Move code from widenSelectInstruction to VPWidenSelectRecipe. (NFC)
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.

Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114323
2021-11-30 10:32:44 +00:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Fabian Wolff 45ecfed6c6 [CVP] Remove ashr of -1 or 0
Fixes PR#52190. There is already a check for converting ashr instructions with non-negative left-hand sides into lshr; this patch adds an optimization to remove ashr altogether if the left-hand side is known to be in the range [-1, 1).

Differential Revision: https://reviews.llvm.org/D113835
2021-11-29 15:32:09 -08:00