Commit Graph

612 Commits

Author SHA1 Message Date
Arthur Eubanks f3a928e233 [opt] Don't translate legacy -analysis flag to require<analysis>
Tests relying on this should explicitly use -passes='require<analysis>,foo'.
2022-10-07 14:54:34 -07:00
Arthur Eubanks d3d8465446 [opt] Stop treating alias analysis specially when translating legacy opt syntax
I've attempted to keep AA tests as close to their original intent as possible.
2022-10-07 11:50:43 -07:00
Philip Reames 32dc1151e2 [VPlan] Only generate single instr for unpredicated stores of varying value to invariant address
This extends the previously added uniform store case to handle stores of loop varying values to a loop invariant address. Note that the placement of this code only allows unpredicated stores; this is important for correctness. (That is "IsPredicated" is always false at this point in the function.)

This patch does not include scalable types. The diff felt "large enough" as it were; I'll handle that in a separate patch. (It requires some changes to cost modeling.)

Differential Revision: https://reviews.llvm.org/D133580
2022-09-22 08:53:46 -07:00
Simon Pilgrim e030be64d8 [CostModel][X86] Add partial CostKinds handling for funnelshifts/rotates
This mainly just adds costs for the targets where we have actual funnelshift/rotate instructions (VBMI2/XOP etc.) - the cases where we expand still need addressing, although for many the default shift+or expansion, especially for uniform cases, isn't that bad.

This was achieved with the 'cost-tables vs llvm-mca' script D103695
2022-09-22 11:24:11 +01:00
Simon Pilgrim b2cd8118d0 [CostModel][X86] Add CostKinds handling for smax/smin/umax/umin instructions
This was achieved with the 'cost-tables vs llvm-mca' script D103695
2022-09-22 10:19:23 +01:00
Philip Reames 8c46881a53 [TTI] Recognize fp constants in getOperandInfo
We were recognizing vectors of floats, but not scalars.  That's a tad odd.
2022-09-21 14:34:34 -07:00
Simon Pilgrim 71162ad957 [LoopVectorize] Fix test name - the test is for fshl not cttz intrinsic costs 2022-09-21 15:24:43 +01:00
Simon Pilgrim 09cb9fdef9 [InstCombine] Fold ult(add(x,-1),c) -> ule(x,c) iff x != 0 (PR57635)
Alive2: https://alive2.llvm.org/ce/z/sZ6wwS

As detailed on Issue #57635 and #37628 - for unsigned comparisons, we can compare prior to a decrement iff the value is known never to be zero.

Differential Revision: https://reviews.llvm.org/D134172
2022-09-20 16:44:41 +01:00
Vitaly Buka bbef90ace4 [IRBuilder] Use PoisonValue in CreateMasked*
Followup to 72b776168c

Reviewed By: nlopes

Differential Revision: https://reviews.llvm.org/D133967
2022-09-19 11:01:41 -07:00
Florian Hahn 582f8ef19f
[LV] Keep track of cost-based ScalarAfterVec in VPWidenPointerInd.
Epilogue vectorization uses isScalarAfterVectorization to check if
widened versions for inductions need to be generated and bails out in
those cases.

At the moment, there are scenarios where isScalarAfterVectorization
returns true but VPWidenPointerInduction::onlyScalarsGenerated would
return false, causing widening.

This can lead to widened phis with incorrect start values being created
in the epilogue vector body.

This patch addresses the issue by storing the cost-model decision in
VPWidenPointerInductionRecipe and restoring the behavior before 151c144.
This effectively reverts 151c144, but the long-term fix is to properly
support widened inductions during epilogue vectorization

Fixes #57712.
2022-09-19 18:14:35 +01:00
Sebastian Peryt 99c9b37d11 [NFC][1/n] Remove -enable-new-pm=0 flags from lit tests
This is the first patch in a series intended for removing flag
-enable-new-pm=0 from lit tests. This is part of a bigger
effort of completely removing legacy code related to legacy
pass manager in favor of currently default new pass manager.

In this patch flag has been removed only from tests where no significant
change has been required because checks has been duplicated for
both PMs.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D134150
2022-09-19 09:57:37 -07:00
Simon Pilgrim 7e626d7a89 [LoopVectorize][X86] Use quotes around the pass list to appease DOS cmd evaluation
DOS can't handle -passes='default<O3>' correctly
2022-09-19 10:24:37 +01:00
Sanjay Patel d6498abc24 [InstCombine] remove multi-use add demanded constant fold
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.
2022-09-18 14:23:43 -04:00
Vitaly Buka ed188b39ab [test] Regenerate few tests 2022-09-15 12:36:32 -07:00
Simon Pilgrim 0ec028fe10 [CostModel][X86] Add CostKinds handling for vector shift by uniform/constuniform ops
Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount.

This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
2022-09-15 14:05:30 +01:00
Simon Pilgrim 8ae9cf550b [LoopVectorize][X86] Add uniform shift costs checks for VF=1/2/4 2022-09-13 13:46:52 +01:00
Philip Reames 4e295cb1ce [LV] Autogen a test for ease of update 2022-09-09 08:16:22 -07:00
Philip Reames edb26268ce [VPlan] Only generate single instr for stores uniform across all parts.
Extend the approach taken by D133019 to store instructions.

Differential Revision: https://reviews.llvm.org/D133497
2022-09-09 07:15:12 -07:00
Florian Hahn 422cf99161
[VPlan] Only generate single instr for loads uniform across all parts.
VPReplicateRecipe::isUniform actually means uniform-per-parts, hence a
scalar instruction is generated per-part.

This is a potential alternative D132892. For now the current patch only
catches cases where the address is trivially invariant (defined outside
VPlan), while D132892 catches any address that is considered invariant
by SCEV AFAICT.

It should be possible to hoist fully invariant recipes feeding loads out
of the vector loop region as well, but in practice LICM should do that
already.

This version of the patch artificially limits this to loads to make it
easier to compare, but this restriction should be easily liftable.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D133019
2022-09-08 14:27:58 +01:00
Philip Reames 4c10646367 [LV] Refresh autogen tests to reflect naming changes [nfc]
Purely so that these can be easily autogened without spurious diffs
2022-08-29 14:16:54 -07:00
David Green e29f9f7572 [AArch64][X86] Add some fixed-order-recurrence tests to check the costmodel of fixed order recurrences. NFC 2022-08-24 08:18:01 +01:00
Jay Foad 2754ff883d [InstCombine] Try not to demand low order bits for Add
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
  overflow and affect higher bit positions

This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.

Differential Revision: https://reviews.llvm.org/D130075
2022-08-22 20:03:53 +01:00
Florian Hahn a34428f07d
[LV] Use variables instead of hard-coded metadata IDs in tests. 2022-08-16 12:21:49 +01:00
Philip Reames 0b47615fcf [LV] Recognize store of invariant value to invariant address as uniform
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.

For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)

An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.

Differential Revision: https://reviews.llvm.org/D130364
2022-08-02 08:09:49 -07:00
Florian Hahn 16e0620d6d
[VPlan] Mark VPPredInstPHIRecipe as not having side-effects.
Now that all uses of VPPredInstPHIRecipes are properly modeled, they can
be treated as not having side-effects, enabling removal.
2022-07-27 19:29:26 +01:00
Sanjay Patel bfb9b8e075 [Passes] add a tail-call-elim pass near the end of the opt pipeline
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.

I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031

Differential Revision: https://reviews.llvm.org/D130374
2022-07-25 15:25:47 -04:00
Nuno Lopes 9df0b254d2 [NFC] Switch a few uses of undef to poison as placeholders for unreachable code 2022-07-23 21:50:11 +01:00
Philip Reames f934b9b073 [LV] Refresh a couple of autogen tests for naming change
These appear to just be changes in temporary identifiers; bit suprising we have so many.
2022-07-20 14:47:52 -07:00
Philip Reames be25f52fec [LV] Autogen several tests for ease of update in upcoming change 2022-07-20 07:17:51 -07:00
Nikita Popov a5ee62a141 [IndVars] Call replaceLoopPHINodesWithPreheaderValues() for already constant exits
Currently we only call replaceLoopPHINodesWithPreheaderValues() if
optimizeLoopExits() replaces the exit with an unconditional exit.
However, it is very common that this already happens as part of
eliminateIVComparison(), in which case we're leaving behind the
dead header phi.

Tweak the early bailout for already-constant exits to also call
replaceLoopPHINodesWithPreheaderValues().

Differential Revision: https://reviews.llvm.org/D129214
2022-07-13 09:43:21 +02:00
Florian Hahn bc19b7c3cc
[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE.
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.

Depends D127580.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128408
2022-07-07 08:40:27 -07:00
Nikita Popov 11950efe06 [ConstExpr] Remove div/rem constant expressions
D128820 stopped creating div/rem constant expressions by default;
this patch removes support for them entirely.

The getUDiv(), getExactUDiv(), getSDiv(), getExactSDiv(), getURem()
and getSRem() on ConstantExpr are removed, and ConstantExpr::get()
now only accepts binary operators for which
ConstantExpr::isSupportedBinOp() returns true. Uses of these methods
may be replaced either by corresponding IRBuilder methods, or
ConstantFoldBinaryOpOperands (if a constant result is required).

On the C API side, LLVMConstUDiv, LLVMConstExactUDiv, LLVMConstSDiv,
LLVMConstExactSDiv, LLVMConstURem and LLVMConstSRem are removed and
corresponding LLVMBuild methods should be used.

Importantly, this also means that constant expressions can no longer
trap! This patch still keeps the canTrap() method to minimize diff --
I plan to drop it in a separate NFC patch.

Differential Revision: https://reviews.llvm.org/D129148
2022-07-06 10:11:34 +02:00
Florian Hahn 644a965c1e
[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368
2022-07-04 15:11:39 +01:00
Florian Hahn 0dddf04cab
[LV] Don't optimize exit cond during epilogue vectorization.
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.

One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.

To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.

Fixes #56319.
2022-07-01 13:48:38 +01:00
Florian Hahn afd9f422e4
[LV] Update test for #56319 to use interleave group.
The original test was over-reduced. It requires an interleave group, so
the last vector iteration of the epilogue vector loop doesn't execute.
2022-07-01 11:12:00 +01:00
Florian Hahn 8704cfc744
[LV] Add test case for #56319.
Test case for PR56319.
2022-07-01 10:09:24 +01:00
Sanjay Patel cc88445a91 [InstCombine] canonicalize 'icmp (trunc X), C' to 'icmp (X & Mask), C'
I looked at canonicalizing in the other direction, but that causes
many potential regressions and infinite loops because we already
(possibly wrongly) canonicalize "trunc X to i1" into an and+icmp.

This has a data layout restriction to avoid creating illegal
mask instructions, but we could remove that if we can show
that the backend can undo this when needed.

The motivating example from issue #56119 is modeled by the
PhaseOrdering test.
2022-06-30 15:51:39 -04:00
Florian Hahn 68884dde70
[LV] Move LoopVersioning creation to LVP::execute.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.

LVer will be moved to VPTransformState as follow-up.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127966
2022-06-30 12:14:32 +01:00
Florian Hahn 85983ca42e
[VPlan] Replace remaining use of needsScalarIV.
All information is already available in VPlan. Note that there are some
test changes, because we now can correctly look through instructions
like truncates to analyze the actual users.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123541
2022-06-09 12:05:37 +01:00
Florian Hahn eaf48dd9b0
[VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF.
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.

This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.

The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.

There are further simplifications that can be applied as follow ups:

1. Replace inductions with constants
2. Replace vector region with regular block.

Fixes #55354.

Depends on D126679.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126680
2022-06-06 09:38:53 +01:00
Nikita Popov 36cbdaa163 [InstCombine] Fix inbounds preservation when swapping GEPs (PR44206)
When reassociating GEPs, we can only keep inbounds if both original
GEPs were inbounds, and their offsets have the same sign. For the
sake of simplicity, I only handle the case where both offsets are
non-negative here.

It would probably be fine to just not preserve inbounds at all here,
but as I don't see a compile-time impact for adding the
isKnownNonNegative() calls I went with this more conservative
approach.

Fixes https://github.com/llvm/llvm-project/issues/44206.

Differential Revision: https://reviews.llvm.org/D126687
2022-05-31 15:45:02 +02:00
Ivan Kosarev ad1d60c3be [FileCheck] Catch missspelled directives.
Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D125604
2022-05-26 11:37:19 +01:00
Peter Waller ade47bdc31 [LV] Improve register pressure estimate at high VFs
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`.  `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type.  However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.

Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).

This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.

I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT.  I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.

Differential Revision: https://reviews.llvm.org/D125918
2022-05-23 07:57:45 +00:00
Florian Hahn 3bebec6592
[VPlan] Model first exit values using VPLiveOut.
This patch introduces a new VPLiveOut subclass of VPUser  to model
 exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.

Fixes #51366, #54867, #55167, #55459

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123537
2022-05-21 16:01:38 +01:00
Florian Hahn cd61d4bd2f
[LV] Do not LoopSimplify/LCSSA after generating main vector loop.
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.

In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.

This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.

But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.

Reviewed By: bmahjour

Differential Revision: https://reviews.llvm.org/D125810
2022-05-20 09:58:40 +01:00
Florian Hahn d92cec4c96
[LV] Regenerate check lines for some tests.
Make sure the auto-generated check lines are up-to-date for some files,
to reduce the test diff in upcoming changes
2022-05-17 17:45:01 +01:00
Florian Hahn b7315ffc3c
[LAA,LV] Add initial support for pointer-diff memory checks.
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.

The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.

Note that the initial version is restricted in multiple ways:

1. Pointers must only either be read or written, by a single
   instruction (this allows re-constructing source/sink for
   dependences with the available information)
 2. Source and sink pointers must be add-recs, with matching steps
 3. The step must be a constant.
 3. abs(step) == AccessSize.

Most of those restrictions can be relaxed in the future.

See https://github.com/llvm/llvm-project/issues/53590.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D119078
2022-05-16 15:27:22 +01:00
Florian Hahn 38189438b6
[LV] Add crashing test from #55096. 2022-05-12 22:40:28 +01:00
Florian Hahn 635b752211
[VPlan] VPInterleaveRecipe only uses first lane if op not stored.
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.

Fixes #55375.
2022-05-11 11:24:56 +01:00
Florian Hahn e79c1962b9
[LV] Add opaque pointer test for #55375. 2022-05-11 11:24:52 +01:00