Commit Graph

2292 Commits

Author SHA1 Message Date
Florian Hahn 73f01e3df5 [TTI] Add VecPred argument to getCmpSelInstrCost.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.

This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.

This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.

I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.

Reviewed By: dmgreen, RKSimon

Differential Revision: https://reviews.llvm.org/D90070
2020-10-30 13:49:08 +00:00
Florian Hahn 1922570489 [SLP] Consider alternatives for cost of select instructions.
Some architectures do not have general vector select instructions (e.g.
AArch64). But some cmp/select patterns can be vectorized using other
instructions/intrinsics.

One example is using min/max instructions for certain patterns.

This patch updates the cost calculations for selects in the SLP
vectorizer to consider using min/max intrinsics.

This patch does not change SLP vectorizer's codegen itself to actually
generate those intrinsics, but relies on the backends to lower the
vector cmps & selects. This keeps things simple on the SLP side and
works well in practice for AArch64.

This exposes additional SLP vectorization opportunities in some
benchmarks on AArch64 (-O3 -flto).

Metric: SLP.NumVectorInstructions

Program                                        base    slp     diff
 test-suite...ications/JM/ldecod/ldecod.test   502.00  697.00  38.8%
 test-suite...ications/JM/lencod/lencod.test   1023.00 1414.00 38.2%
 test-suite...-typeset/consumer-typeset.test    56.00   65.00  16.1%
 test-suite...6/464.h264ref/464.h264ref.test   804.00  822.00   2.2%
 test-suite...006/453.povray/453.povray.test   3335.00 3357.00  0.7%
 test-suite...CFP2000/177.mesa/177.mesa.test   2110.00 2121.00  0.5%
 test-suite...:: External/Povray/povray.test   2378.00 2382.00  0.2%

Reviewed By: RKSimon, samparker

Differential Revision: https://reviews.llvm.org/D89969
2020-10-29 20:39:50 +00:00
Nicolai Hähnle e025d09b21 Revert multiple patches based on "Introduce CfgTraits abstraction"
These logically belong together since it's a base commit plus
followup fixes to less common build configurations.

The patches are:

Revert "CfgInterface: rename interface() to getInterface()"

This reverts commit a74fc48158.

Revert "Wrap CfgTraitsFor in namespace llvm to please GCC 5"

This reverts commit f2a06875b6.

Revert "Try to make GCC5 happy about the CfgTraits thing"

This reverts commit 03a5f7ce12.

Revert "Introduce CfgTraits abstraction"

This reverts commit c0cdd22c72.
2020-10-27 20:33:30 +01:00
Joe Ellis 467e5cf40f [SVE][AArch64] Fix TypeSize warning in loop vectorization legality
The warning would fire when calling isDereferenceableAndAlignedInLoop
with a scalable load. Calling isDereferenceableAndAlignedInLoop with a
scalable load would result in the use of the now deprecated implicit
cast of TypeSize to uint64_t through the overloaded operator.

This patch fixes this issue by:

- no longer considering vector loads as candidates in
  canVectorizeWithIfConvert. This doesn't make sense in the context of
  identifying scalar loads to vectorize.

- making use of getFixedSize inside isDereferenceableAndAlignedInLoop --
  this removes the dependency on the deprecated interface, and will
  trigger an assertion error if the function is ever called with a
  scalable type.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D89798
2020-10-26 17:40:04 +00:00
Nicolai Hähnle c0cdd22c72 Introduce CfgTraits abstraction
The CfgTraits abstraction simplfies writing algorithms that are
generic over the type of CFG, and enables writing such algorithms
as regular non-template code that operates on opaque references
to CFG blocks and values.

Implementations of CfgTraits provide operations on the concrete
CFG types, e.g. `IrCfgTraits::BlockRef` is `BasicBlock *`.

CfgInterface is an abstract base class which provides operations
on opaque types CfgBlockRef and CfgValueRef. Those opaque types
encapsulate a `void *`, but the meaning depends on the concrete
CFG type. For example, MachineCfgTraits -- for use with MachineIR
in SSA form -- encodes a Register inside CfgValueRef. Converting
between concrete references and opaque/generic ones is done by
CfgTraits::{fromGeneric,toGeneric}. Convenience methods
CfgTraits::{un}wrap{Iterator,Range} are available as well.

Writing algorithms in terms of CfgInterface adds some overhead
(virtual method calls, plus in same cases it removes the
opportunity to inline iterators), but can be much more convenient
since generic algorithms can be written as non-templates.

This patch adds implementations of CfgTraits for all CFGs on
which dominator trees are calculated, so that the dominator
tree can be ported to this machinery. Only IrCfgTraits (LLVM IR)
and MachineCfgTraits (Machine IR in SSA form) are complete, the
other implementations are limited to the absolute minimum
required to make the upcoming dominator tree changes work.

v5:
- fix MachineCfgTraits::blockdef_iterator and allow it to iterate over
  the instructions in a bundle
- use MachineBasicBlock::printName

v6:
- implement predecessors/successors for all CfgTraits implementations
- fix error in unwrapRange
- rename toGeneric/fromGeneric into wrapRef/unwrapRef to have naming
  that is consistent with {wrap,unwrap}{Iterator,Range}
- use getVRegDef instead of getUniqueVRegDef

v7:
- std::forward fix in wrapping_iterator
- fix typos

v8:
- cleanup operators on CfgOpaqueType
- address other review comments

Change-Id: Ia75f4f268fded33fca11218a7d578c9aec1f3f4d

Differential Revision: https://reviews.llvm.org/D83088
2020-10-20 13:50:52 +02:00
Artem Belevich c36c0fabd1 [VectorCombine] Avoid crossing address space boundaries.
We can not bitcast pointers across different address spaces, and VectorCombine
should be careful when it attempts to find the original source of the loaded
data.

Differential Revision: https://reviews.llvm.org/D89577
2020-10-16 13:19:31 -07:00
Florian Hahn 89c0124273 [LoopVersion] Unify SCEVChecks and alias check handling (NFC).
This is an initial cleanup of the way LoopVersioning interacts with LAA.

Currently LoopVersioning has 2 ways of initializing things:

1. Passing LAI and passing UseLAIChecks = true
2. Passing UseLAIChecks = false, followed by calling setSCEVChecks and
   setAliasChecks.

Both ways of initializing lead to the same result and the duplication
seems more complicated than necessary.

This patch removes the UseLAIChecks flag from the constructor and the
setSCEVChecks & setAliasChecks helpers and move initialization
exclusively to the constructor.

This simplifies things, by providing a single way to initialize
LoopVersioning and reducing duplication.

Reviewed By: Meinersbur, lebedev.ri

Differential Revision: https://reviews.llvm.org/D84406
2020-10-15 22:02:17 +01:00
David Green 13ec3dd66f [LV] Add a getRecurrenceBinOp and make use of it. NFC 2020-10-15 18:21:41 +01:00
Florian Hahn 93f6c6b79c Recommit "[VPlan] Use VPValue def for VPMemoryInstructionRecipe."
This reverts the revert commit 710aceb645
and includes a fix for a memsan failure.

Original message:

    This patch turns VPMemoryInstructionRecipe into a VPValue and uses it
    during VPlan construction and codegeneration instead of the plain IR
    reference where possible.
2020-10-14 17:41:23 +01:00
Evgeniy Brevnov d0c95808e5 [LV] Unroll factor is expected to be > 0
LV fails with assertion checking that UF > 0. We already set UF to 1 if it is 0 except the case when IC > MaxInterleaveCount. The fix is to set UF to 1 for that case as well.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D87679
2020-10-14 16:48:17 +07:00
Vitaly Buka 710aceb645 Revert "[VPlan] Use VPValue def for VPMemoryInstructionRecipe."
It introduced a memory leak.

This reverts commit 525b085a65.
2020-10-13 03:14:08 -07:00
Florian Hahn 525b085a65 [VPlan] Use VPValue def for VPMemoryInstructionRecipe.
This patch turns VPMemoryInstructionRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D84680
2020-10-12 18:02:33 +01:00
Florian Hahn ea058d289c [VPlan] Use operands for printing of VPWidenMemoryInstructionRecipe.
Now that operands of the recipe are managed through VPUser, we can
simplify the printing by just using the operands.
2020-10-12 16:51:54 +01:00
David Sherwood c5ba0d33cc [SVE] Make ElementCount and TypeSize use a new PolySize class
I have introduced a new template PolySize class, where the template
parameter determines the type of quantity, i.e. for an element
count this is just an unsigned value. The ElementCount class is
now just a simple derivation of PolySize<unsigned>, whereas TypeSize
is more complicated because it still needs to contain the uint64_t
cast operator, since there are still many places in the code that
rely upon this implicit cast. As such the class also still needs
some of it's own operators.

I've tried to minimise the amount of code in the base PolySize
class, which led to a couple of changes:

1. In some places we were relying on '==' operator comparisons
between ElementCounts and the scalar value 1. I didn't put this
operator in the new PolySize class, and thought it was actually
clearer to use the isScalar() function instead.
2. I removed the isByteSized function and replaced it with calls
to isKnownMultipleOf(8).

I've also renamed NextPowerOf2 to be coefficientNextPowerOf2 so
that it's more consistent with coefficientDivideBy.

Differential Revision: https://reviews.llvm.org/D88409
2020-10-12 08:23:38 +01:00
David Green be6e8e50f4 [LV] Tail folded inloop reductions.
This expands upon the inloop reductions added in e9761688e41cb9e976,
allowing them to be inserted into tail folded loops. Reductions are
generates with the form:

  x = select(mask, vecop, zero)
  v = vecreduce.add(x)
  c = add chain, v

Where zero here is chosen as the identity value for add reductions. The
backend is then expected to fold the select and the vecreduce into a
single predicated instruction.

Most of the code is fairly straight forward, except for the creation of
blockmasks which need to ensure they are created in dominance order. The
order they are added is altered to be after any phis, keeping the
requirements for the underlying IR.

Differential Revision: https://reviews.llvm.org/D84451
2020-10-11 16:58:34 +01:00
Simon Pilgrim 0716805c02 [SLP] optimizeGatherSequence - assert every Instruction in the worklist is non-null.
Fixes clang static analyzer warning.
2020-10-08 20:02:18 +01:00
David Green 498f89d188 [LV] Collect dead induction truncates
We currently collect the ICmp and Add from an induction variable,
marking them as dead so that vplan values are not created for them. This
extends that to include any single use trunk from the ICmp, which allows
the Add to more readily be removed too.

This can help with costing vplan nodes, as the ICmp and Add are more
reliably removed and are not double-counted.

Differential Revision: https://reviews.llvm.org/D88873
2020-10-08 08:28:58 +01:00
Florian Hahn 348d85a6c7 [VPlan] Clean up uses/operands on VPBB deletion.
Update the code responsible for deleting VPBBs and recipes to properly
update users and release operands.

This is another preparation for D84680 & following patches towards
enabling modeling def-use chains in VPlan.
2020-10-05 14:43:52 +01:00
Florian Hahn 357bbaab66 [VPlan] Add VPRecipeBase::toVPUser helper (NFC).
This adds a helper to convert a VPRecipeBase pointer to a VPUser, for
recipes that inherit from VPUser. Once VPRecipeBase directly inherits
from VPUser this helper can be removed.
2020-10-04 19:43:27 +01:00
Florian Hahn f5fe7abe8a [VPlan] Account for removed users in replaceAllUsesWith.
Make sure we do not iterate using an invalid iterator.

Another small fix/step towards traversing the def-use chains in VPlan.
2020-10-04 18:18:58 +01:00
Florian Hahn 82dcd383c4 [VPlan] Properly update users when updating operands.
When updating operands of a VPUser, we also have to adjust the list of
users for the new and old VPValues. This is required once we start
transitioning recipes to become VPValues.
2020-10-03 20:54:58 +01:00
Florian Hahn 0867a9e85a [VPlan] Use isa<> instead of directly checking VPRecipeID (NFC).
getVPRecipeID is intended to be only used in `classof` helpers. Instead
of checking it directly, use isa<> with the correct recipe type.
2020-10-02 17:47:35 +01:00
Florian Hahn d856365470 [VPlan] Change recipes to inherit from VPUser instead of a member var.
Now that VPUser is not inheriting from VPValue, we can take the next
step and turn the recipes that already manage their operands via VPUser
into VPUsers directly. This is another small step towards traversing
def-use chains in VPlan.

This is NFC with respect to the generated code, but makes the interface
more powerful.
2020-09-30 14:39:00 +01:00
Sanjay Patel 0a349d5827 [SLP] clean up - use 'const' and ArrayRef constructor; NFC
Follow-on tidying suggested in the post-commit review of 6a23668.
2020-09-24 15:31:07 -04:00
Craig Topper 03f22b08e2 [SLP] Remove LHS and RHS from OperationData.
These were only really used for 2 things. One was to check if the operand matches the phi if it exists. The other was for the createOp method to build the reduction.

For the first case we still have the operation we just need to know how to index its operands. So I've modified getLHS/getRHS to just use the opcode/kind to know how to find the right operands on an instruction that is now passed in.

For the other case we had to create an OperationData object to set the LHS/RHS values and copy the opcode/kind from another object. We would then just call createOp on that temporary object. Instead I've made LHS/RHS arguments to createOp and removed all these temporary objects.

Differential Revision: https://reviews.llvm.org/D88193
2020-09-24 10:57:11 -07:00
Craig Topper 7a3c643c35 [SLP] Make HorizontalReduction::getOperationData take an Instruction* instead of a Value*. NFCI
All of the callers already have an Instruction *. Many of them
from a dyn_cast.

Also update the OperationData constructor to use a Instruction&
to remove a dyn_cast and make it clear that the pointer is non-null.

Differential Revision: https://reviews.llvm.org/D88132
2020-09-23 10:51:03 -07:00
Simon Pilgrim 474dc33d07 Add missing namespace closure comment. NFCI.
Fixes clang-tidy llvm-namespace-comment warning.
2020-09-23 16:19:25 +01:00
Florian Hahn 31923f6b36 [VPlan] Disconnect VPValue and VPUser.
This refactors VPuser to not inherit from VPValue to facilitate
introducing operations that introduce multiple VPValues (e.g.
VPInterleaveRecipe).

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D84679
2020-09-23 14:44:31 +01:00
Alexey Bataev d6ac649ccd [SLP]Fix coding style, NFC. 2020-09-22 17:44:29 -04:00
Stefanos Baziotis 89c1e35f3c [LoopInfo] empty() -> isInnermost(), add isOutermost()
Differential Revision: https://reviews.llvm.org/D82895
2020-09-22 23:28:51 +03:00
Florian Hahn c671e34bf2 [VPlan] Add dump() helper to VPValue & VPRecipeBase.
This provides a convenient way to print VPValues and recipes in a
debugger. In particular it saves the user from instantiating
VPSlotTracker to print recipes or values.
2020-09-22 15:55:16 +01:00
Sanjay Patel 0c3bfbe4bc [SLP] reduce code duplication for checking parent block; NFC 2020-09-22 09:21:20 -04:00
Sanjay Patel bbd49a0266 [SLP] move misplaced code comments; NFC 2020-09-22 09:21:20 -04:00
Sanjay Patel 062276c691 [SLP] clean up code in gather(); NFC
1. Use range for-loop to avoid repeatedly accessing end index.
2. Better variable names.
2020-09-22 09:21:20 -04:00
Simon Pilgrim d682a36ef9 [SLP] Merge null and dyn_cast<> checks into dyn_cast_or_null<>. NFCI. 2020-09-22 14:01:47 +01:00
Sanjay Patel 7451bf0b0b [SLP] use std::distance/find to reduce code; NFC
We were already using this code pattern right after
the loop, so this makes it consistent.
2020-09-21 16:22:55 -04:00
Sanjay Patel be93505986 [LoopVectorize] use unary shuffle creator to reduce code duplication; NFC 2020-09-21 15:34:24 -04:00
Sanjay Patel a44238cb44 [SLP] use unary shuffle creator to reduce code duplication; NFC 2020-09-21 13:54:06 -04:00
Sanjay Patel 1e6b240d7d [IRBuilder][VectorCombine] make and use a convenience function for unary shuffle; NFC
This reduces code duplication for common construct.
Follow-ups can use this in SLP, LoopVectorizer, and other passes.
2020-09-21 13:47:01 -04:00
Simon Pilgrim 005f826a05 [SLP] Use for-range loops across ValueLists. NFCI.
Also rename some existing loops that used a 'j' iterator to consistently use 'V'.
2020-09-21 18:24:23 +01:00
Sanjay Patel 46075e0b78 [SLP] simplify interface for gather(); NFC
The implementation of gather() should be reduced too,
but this change by itself makes things a little clearer:
we don't try to gather to a different type or
number-of-values than whatever is passed in as the value
list itself.
2020-09-21 12:57:28 -04:00
Simon Pilgrim 3ddecfd220 SLPVectorizer.cpp - fix include ordering. NFCI. 2020-09-21 17:17:11 +01:00
Alexey Bataev 3ff07fcd54 [SLP] Allow reordering of vectorization trees with reused instructions.
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D45263
2020-09-21 10:51:03 -04:00
Fangrui Song 6913812abc Fix some clang-tidy bugprone-argument-comment issues 2020-09-19 20:41:25 -07:00
Eric Christopher ecfd8161bf Temporarily Revert "[SLP] Allow reordering of vectorization trees with reused instructions."
as it's infinite looping on occasion.

This reverts commit 455ca0ebb6.
2020-09-18 12:50:04 -07:00
Alexey Bataev 455ca0ebb6 [SLP] Allow reordering of vectorization trees with reused instructions.
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D45263
2020-09-18 09:34:59 -04:00
Sanjay Patel 48a23bccf3 [VectorCombine] limit load+insert transform to one-use
As discussed in:
https://llvm.org/PR47558
...there are several potential fixes/follow-ups visible
in the test case, but this is the quickest and safest
fix of the perf regression.
2020-09-17 14:29:15 -04:00
Sanjay Patel ddd9575d15 [VectorCombine] rearrange bailouts for load insert for efficiency; NFC 2020-09-17 13:50:37 -04:00
Sanjay Patel 03783f19dc [SLP] sort candidates to increase chance of optimal compare reduction
This is one (small) part of improving PR41312:
https://llvm.org/PR41312

As shown there and in the smaller tests here, if we have some member of the
reduction values that does not match the others, we want to push it to the
end (bring the matching members forward and together).

In the regression tests, we have 5 candidates for the 4 slots of the reduction.
If the one "wrong" compare is grouped with the others, it prevents forming the
ideal v4i1 compare reduction.

Differential Revision: https://reviews.llvm.org/D87772
2020-09-17 08:49:27 -04:00
Sanjay Patel 24238f09ed [SLP] fix formatting; NFC
Also move variable declarations closer to usage and add code comments.
2020-09-16 08:50:27 -04:00
Sanjay Patel 6a23668e78 [SLP] remove uses of 'auto' that obscure functionality; NFC 2020-09-16 08:26:21 -04:00
Sanjay Patel 0cee1bf5d1 [SLP] remove redundant size check; NFC
We bail out on small array size anyway.
2020-09-16 08:11:19 -04:00
Sanjay Patel bbad998bab [SLP] move loop index variable declaration to its use; NFC 2020-09-16 07:59:31 -04:00
Sanjay Patel 158989184e [SLP] change poorly named variable; NFC
'V' shadows a function argument.
2020-09-16 07:59:31 -04:00
Wenlei He 2ea4c2c598 [BFI] Make BFI information available through loop passes inside LoopStandardAnalysisResults
~~D65060 uncovered that trying to use BFI in loop passes can lead to non-deterministic behavior when blocks are re-used while retaining old BFI data.~~

~~To make sure BFI is preserved through loop passes a Value Handle (VH) callback is registered on blocks themselves. When a block is freed it now also wipes out the accompanying BFI entry such that stale BFI data can no longer persist resolving the determinism issue. ~~

~~An optimistic approach would be to incrementally update BFI information throughout the loop passes rather than only invalidating them on removed blocks. The issues with that are:~~
~~1. It is not clear how BFI information should be incrementally updated: If a block is duplicated does its BFI information come with? How about if it's split/modified/moved around? ~~
~~2. Assuming we can address these problems the implementation here will be a massive undertaking. ~~

~~There's a known need of BFI in LICM analysis which requires correct but not incrementally updated BFI data. A follow-up change can register BFI in all loop passes so this preserved but potentially lossy data is available to any loop pass that wants it.~~

See: D75341 for an identical implementation of preserving BFI via VH callbacks. The previous statements do still apply but this change no longer has to be in this diff because it's already upstream 😄 .

This diff also moves BFI to be a part of LoopStandardAnalysisResults since the previous method using getCachedResults now (correctly!) statically asserts (D72893) that this data isn't static through the loop passes.

Testing
Ninja check

Reviewed By: asbirlea, nikic

Differential Revision: https://reviews.llvm.org/D86156
2020-09-15 16:16:24 -07:00
Huihui Zhang 3b7f5166bd [SLPVectorizer][SVE] Skip scalable-vector instructions before vectorizeSimpleInstructions.
For scalable type, the aggregated size is unknown at compile-time.
Skip instructions with scalable type to ensure the list of instructions
for vectorizeSimpleInstructions does not contains any scalable-vector instructions.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D87550
2020-09-15 13:10:15 -07:00
Fangrui Song 4452cc4086 [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan
Similar to the tsan suppression in
`Utils/VNCoercion.cpp:getLoadLoadClobberFullWidthSize` (rL175034; load widening used by GVN),
the D81766 optimization should be suppressed under tsan due to potential
spurious data race reports:

  struct A {
    int i;
    const short s; // the load cannot be vectorized because
    int modify;    // it overlaps with bytes being concurrently modified
    long pad1, pad2;
  };
  // __tsan_read16 does not know that some bytes are undef and accessing is safe

Similarly, under asan, users can mark memory regions with
`__asan_poison_memory_region`. A widened load can lead to a spurious
use-after-poison error. hwasan/memtag should be similarly suppressed.

`mustSuppressSpeculation` suppresses asan/hwasan/tsan but not memtag, so
we need to exclude memtag in `vectorizeLoadInsert`.

Note, memtag suppression can be relaxed if the load is aligned to the
its granule (usually 16), but that is out of scope of this patch.

Reviewed By: spatel, vitalybuka

Differential Revision: https://reviews.llvm.org/D87538
2020-09-15 09:47:21 -07:00
Simon Pilgrim 2b42d53e5e SLPVectorizer.h - remove unnecessary AliasAnalysis.h include. NFCI.
Forward declare AAResults instead of the (old) AliasAnalysis type.

Remove includes from SLPVectorizer.cpp that are already included in SLPVectorizer.h.
2020-09-15 16:24:05 +01:00
David Green 74760bb00f [LV][ARM] Add preferInloopReduction target hook.
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.

For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.

Differential Revision: https://reviews.llvm.org/D75512
2020-09-12 17:47:04 +01:00
Sanjay Patel 40f12ef621 [SLP] further limit bailout for load combine candidate (PR47450)
The test example based on PR47450 shows that we can
match non-byte-sized shifts, but those won't ever be
bswap opportunities. This isn't a full fix (we'd still
match if the shifts were by 8-bits for example), but
this should be enough until there's evidence that we
need to do more (this is a borderline case for
vectorization in the first place).
2020-09-11 11:56:11 -04:00
Craig Topper c195ae2f00 [SLPVectorizer][X86][AMDGPU] Remove fcmp+select to fmin/fmax reduction support.
Previously we could match fcmp+select to a reduction if the fcmp had
the nonans fast math flag. But if the select had the nonans fast
math flag, InstCombine would turn it into a fminnum/fmaxnum intrinsic
before SLP gets to it. Seems fairly likely that if one of the
fcmp+select pair have the fast math flag, they both would.

My plan is to start vectorizing the fmaxnum/fminnum version soon,
but I wanted to get this code out as it had some of the strangest
fast math flag behaviors.
2020-09-10 11:49:19 -07:00
Simon Pilgrim 5ea9e655ef VPlan.h - remove unnecessary forward declarations. NFCI.
Already defined in includes.
2020-09-07 18:35:06 +01:00
Huihui Zhang b4f04d7135 [VectorCombine][SVE] Do not fold bitcast shuffle for scalable type.
First, shuffle cost for scalable type is not known for scalable type;
Second, we cannot reason if the narrowed shuffle mask for scalable type
is a splat or not.

E.g., Bitcast splat vector from type <vscale x 4 x i32> to <vscale x 8 x i16>
will involve narrowing shuffle mask <vscale x 4 x i32> zeroinitializer to
<vscale x 8 x i32> with element sequence of <0, 1, 0, 1, ...>, which cannot be
reasoned if it's a valid splat or not.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D86995
2020-09-02 15:02:16 -07:00
Sanjay Patel 8fb055932c [VectorCombine] allow vector loads with mismatched insert type
This is an enhancement to D81766 to allow loading the minimum target
vector type into an IR vector with a different number of elements.

In one of the motivating tests from PR16739, SLP creates <2 x float>
load ops mixed with <4 x float> insert ops, so we want to handle that
pattern in addition to potential oversized vectors created by the
vectorizers.

For now, we are assuming the insert/extract subvector with undef is
free because there is no exact corresponding TTI modeling for that.

Differential Revision: https://reviews.llvm.org/D86160
2020-09-02 08:11:36 -04:00
Aaron Liu d7e16ca28f [LV] Interleave to expose ILP for small loops with scalar reductions.
Interleave for small loops that have reductions inside,
which breaks dependencies and expose.

This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.

Differential Revision: https://reviews.llvm.org/D81416
2020-09-01 19:47:32 +00:00
Florian Hahn eb35ebb3a2 [LV] Update CFG before adding runtime checks.
addRuntimeChecks uses SCEVExpander, which relies on the DT/LoopInfo to
be up-to-date. Changing the CFG afterwards may invalidate some inserted
instructions, especially LCSSA phis.

Reorder the code to first update the CFG and then create the runtime
checks. This should not have any impact on the generated code, as we
adjust the CFG and generate runtime checks together.

Fixes PR47343.
2020-08-30 18:21:44 +01:00
Sanjay Patel af4581e8ab [SLP] make commutative check apply only to binops; NFC
As discussed in D86798, it's not clear if the caller code
works with a more liberal definition of "commutative" that
includes intrinsics like min/max. This makes the binop
restriction (current functionality is unchanged) explicit
until the code is audited/tested.
2020-08-30 10:55:44 -04:00
David Green 543c5425f1 [LV] Add some const to RecurrenceDescriptor. NFC 2020-08-30 12:27:51 +01:00
Florian Hahn 5067f4b626 [LV] Check opt-for-size before expanding runtime checks.
Move bail out when optimizing for size before runtime check generation.
In that case, we do not use the result of the expansion, the expanded
instruction will be dead and cleaned up later.

By doing the check before expanding the runtime-checks, we can save a
bit of unnecessary work.
2020-08-29 20:35:14 +01:00
David Sherwood f4257c5832 [SVE] Make ElementCount members private
This patch changes ElementCount so that the Min and Scalable
members are now private and can only be accessed via the get
functions getKnownMinValue() and isScalable(). In addition I've
added some other member functions for more commonly used operations.
Hopefully this makes the class more useful and will reduce the
need for calling getKnownMinValue().

Differential Revision: https://reviews.llvm.org/D86065
2020-08-28 14:43:53 +01:00
Christopher Tetreault 5e63083435 [SVE] Remove calls to VectorType::getNumElements from Transforms/Vectorize
Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D82056
2020-08-27 12:02:20 -07:00
Sjoerd Meijer bda8fbe2d2 [LV] Fallback strategies if tail-folding fails
This implements 2 different vectorisation fallback strategies if tail-folding
fails: 1) don't vectorise at all, or 2) vectorise using a scalar epilogue. This
can be controlled with option -prefer-predicate-over-epilogue, that has been
changed to take a numeric value corresponding to the tail-folding preference
and preferred fallback.

Patch by: Pierre van Houtryve, Sjoerd Meijer.

Differential Revision: https://reviews.llvm.org/D79783
2020-08-26 16:55:25 +01:00
Sjoerd Meijer ae366479e8 [LV] get.active.lane.mask consuming tripcount instead of backedge-taken count
This adapts LV to the new semantics of get.active.lane.mask as discussed in
D86147, which means that the LV now emits intrinsic get.active.lane.mask with
the loop tripcount instead of the backedge-taken count as its second argument.
The motivation for this is described in D86147.

Differential Revision: https://reviews.llvm.org/D86304
2020-08-25 13:49:19 +01:00
Francesco Petrogalli 5a34b3ab95 [llvm][LV] Replace `unsigned VF` with `ElementCount VF` [NFCI]
Changes:

* Change `ToVectorTy` to deal directly with `ElementCount` instances.
* `VF == 1` replaced with `VF.isScalar()`.
* `VF > 1` and `VF >=2` replaced with `VF.isVector()`.
* `VF <=1` is replaced with `VF.isZero() || VF.isScalar()`.
* Replaced the uses of `llvm::SmallSet<ElementCount, ...>` with
   `llvm::SmallSetVector<ElementCount, ...>`. This avoids the need of an
   ordering function for the `ElementCount` class.
* Bits and pieces around printing the `ElementCount` to string streams.

To guarantee that this change is a NFC, `VF.Min` and asserts are used
in the following places:

1. When it doesn't make sense to deal with the scalable property, for
example:
   a. When computing unrolling factors.
   b. When shuffle masks are built for fixed width vector types
In this cases, an
assert(!VF.Scalable && "<mgs>") has been added to make sure we don't
enter coepaths that don't make sense for scalable vectors.
2. When there is a conscious decision to use `FixedVectorType`. These
uses of `FixedVectorType` will likely be removed in favour of
`VectorType` once the vectorizer is generic enough to deal with both
fixed vector types and scalable vector types.
3. When dealing with building constants out of the value of VF, for
example when computing the vectorization `step`, or building vectors
of indices. These operation _make sense_ for scalable vectors too,
but changing the code in these places to be generic and make it work
for scalable vectors is to be submitted in a separate patch, as it is
a functional change.
4. When building the potential VFs in VPlan. Making the VPlan generic
enough to handle scalable vectorization factors is a functional change
that needs a separate patch. See for example `void
LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned
MaxVF)`.
5. The class `IntrinsicCostAttribute`: this class still uses `unsigned
VF` as updating the field to use `ElementCount` woudl require changes
that could result in changing the behavior of the compiler. Will be done
in a separate patch.
7. When dealing with user input for forcing the vectorization
factor. In this case, adding support for scalable vectorization is a
functional change that migh require changes at command line.

Note that in some places the idiom

```
unsigned VF = ...
auto VTy = FixedVectorType::get(ScalarTy, VF)
```

has been replaced with

```
ElementCount VF = ...
assert(!VF.Scalable && ...);
auto VTy = VectorType::get(ScalarTy, VF)
```

The assertion guarantees that the new code is (at least in debug mode)
functionally equivalent to the old version. Notice that this change had been
possible because none of the methods that are specific to `FixedVectorType`
were used after the instantiation of `VTy`.

Reviewed By: rengolin, ctetreau

Differential Revision: https://reviews.llvm.org/D85794
2020-08-24 13:54:03 +00:00
Francesco Petrogalli bad7d6b373 Revert "[llvm][LV] Replace `unsigned VF` with `ElementCount VF` [NFCI]"
Reverting because the commit message doesn't reflect the one agreed on
phabricator at https://reviews.llvm.org/D85794.

This reverts commit c8d2b065b9.
2020-08-24 13:50:55 +00:00
Francesco Petrogalli c8d2b065b9 [llvm][LV] Replace `unsigned VF` with `ElementCount VF` [NFCI]
Changes:

* Change `ToVectorTy` to deal directly with `ElementCount` instances.
* `VF == 1` replaced with `VF.isScalar()`.
* `VF > 1` and `VF >=2` replaced with `VF.isVector()`.
* `VF <=1` is replaced with `VF.isZero() || VF.isScalar()`.
* Add `<` operator to `ElementCount` to be able to use
`llvm::SmallSetVector<ElementCount, ...>`.
* Bits and pieces around printing the ElementCount to string streams.
* Added a static method to `ElementCount` to represent a scalar.

To guarantee that this change is a NFC, `VF.Min` and asserts are used
in the following places:

1. When it doesn't make sense to deal with the scalable property, for
example:
   a. When computing unrolling factors.
   b. When shuffle masks are built for fixed width vector types
In this cases, an
assert(!VF.Scalable && "<mgs>") has been added to make sure we don't
enter coepaths that don't make sense for scalable vectors.
2. When there is a conscious decision to use `FixedVectorType`. These
uses of `FixedVectorType` will likely be removed in favour of
`VectorType` once the vectorizer is generic enough to deal with both
fixed vector types and scalable vector types.
3. When dealing with building constants out of the value of VF, for
example when computing the vectorization `step`, or building vectors
of indices. These operation _make sense_ for scalable vectors too,
but changing the code in these places to be generic and make it work
for scalable vectors is to be submitted in a separate patch, as it is
a functional change.
4. When building the potential VFs in VPlan. Making the VPlan generic
enough to handle scalable vectorization factors is a functional change
that needs a separate patch. See for example `void
LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned
MaxVF)`.
5. The class `IntrinsicCostAttribute`: this class still uses `unsigned
VF` as updating the field to use `ElementCount` woudl require changes
that could result in changing the behavior of the compiler. Will be done
in a separate patch.
7. When dealing with user input for forcing the vectorization
factor. In this case, adding support for scalable vectorization is a
functional change that migh require changes at command line.

Differential Revision: https://reviews.llvm.org/D85794
2020-08-24 13:39:42 +00:00
David Green 2b69efded0 [ARM][LV] Add a preferPredicatedReductionSelect target hook
As part of D84741, this adds a target hook for the
preferPredicatedReductionSelect option and makes use
of it under MVE, allowing us to tail predicate most
reduction loops.

Differential Revision: https://reviews.llvm.org/D85980
2020-08-21 08:48:12 +01:00
David Green 816097e4e5 [LV] Allow tail folded reduction selects to remain in the loop
The normal scheme for tail folding reductions is to use:

loop:
  p = phi(0, a)
  mask = ...
  x = masked_load(..., mask)
  a = add(x, p)
s = select(mask, a, p)

This means we need to keep the register p and a alive out of the loop, plus
the mask. On a target with predicated operations we can instead generate
the phi as p = phi(0, s). This ensures the select in the loop and we can
fold select(m, add(a, b), c) to something like a vaddt c, a, b using the
m predicate. This in turn allows us to tail predicate the entire loop.

Differential Revision: https://reviews.llvm.org/D84741
2020-08-20 14:31:14 +01:00
Hiroshi Yamauchi ab401a8c8a [PGO][PGSO][LV] Fix loop not vectorized issue under profile guided size opts.
D81345 appears to accidentally disables vectorization when explicitly
enabled. As PGSO isn't currently accessible from LoopAccessInfo, revert back to
the vectorization with versioning-for-unit-stride for PGSO.

Differential Revision: https://reviews.llvm.org/D85784
2020-08-19 12:13:34 -07:00
Mehdi Amini a407ec9b6d Revert "Revert "[NFC][llvm] Make the contructors of `ElementCount` private.""
Was reverted because MLIR/Flang builds were broken, these APIs have been
fixed in the meantime.
2020-08-19 17:26:36 +00:00
Mehdi Amini 4fc56d70aa Revert "[NFC][llvm] Make the contructors of `ElementCount` private."
This reverts commit 264afb9e6a.
(and dependent 6b742cc48 and fc53bd610f)

MLIR/Flang are broken.
2020-08-19 17:21:37 +00:00
Francesco Petrogalli 264afb9e6a [NFC][llvm] Make the contructors of `ElementCount` private.
Differential Revision: https://reviews.llvm.org/D86120
2020-08-19 16:26:44 +00:00
Bjorn Pettersson 11446b02c7 [VectorCombine] Fix for non-zero addrspace when creating vector load from scalar load
This is a fixup to commit 43bdac2906, to make sure the
address space from the original load pointer is retained in the
vector pointer.

Resolves problem with
  Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
due to address space mismatch.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D85912
2020-08-13 18:25:32 +02:00
Sanjay Patel cc892fd9f4 [VectorCombine] early exit if target has no vector registers
Based on post-commit discussion in:
D81766

Other vectorization passes (SLP and Loop) use this TTI API similarly.
2020-08-12 09:22:31 -04:00
Sanjay Patel b0b95dab1c [VectorCombine] add safety check for 0-width register
Based on post-commit discussion in D81766, Hexagon sets this to "0".
I'll see if I can come up with a test, but making the obvious
code fix first to unblock that target.
2020-08-11 20:30:02 -04:00
Dinar Temirbulatov b1600d8b89 [NFC] Guard the cost report block of debug outputs with NDEBUG and
switch to SmallString, this is part of D57779.
2020-08-11 16:34:47 +02:00
Florian Hahn 0b774acf11 [SLP] Make sure instructions are ordered when computing spill cost.
The entries in VectorizableTree are not necessarily ordered by their
position in basic blocks. Collect them and order them by dominance so
later instructions are guaranteed to be visited first. For instructions
in different basic blocks, we only scan to the beginning of the block,
so their order does not matter, as long as all instructions in a basic
block are grouped together. Using dominance ensures a deterministic order.

The modified test case contains an example where we compute a wrong
spill cost (2) without this patch, even though there is no call between
any instruction in the bundle.

This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D82444
2020-08-11 11:18:12 +02:00
Sanjay Patel 43bdac2906 [VectorCombine] try to create vector loads from scalar loads
This patch was adjusted to match the most basic pattern that starts with an insertelement
(so there's no extract created here). Hopefully, that removes any concern about
interfering with other passes. Ie, the transform should almost always be profitable.

We could make an argument that this could be part of canonicalization, but we
conservatively try not to create vector ops from scalar ops in passes like instcombine.

If the transform is not profitable, the backend should be able to re-scalarize the load.

Differential Revision: https://reviews.llvm.org/D81766
2020-08-09 09:05:06 -04:00
Anton Afanasyev a7478fab6c [SLP] Fix order of `insertelement`/`insertvalue` seed operands
Summary:
This patch takes the indices operands of `insertelement`/`insertvalue`
into account while generation of seed elements for `findBuildAggregate()`.
This function has kept the original order of `insert`s before.
Also this patch optimizes `findBuildAggregate()` preventing it from
redundant temporary vector allocations and its multiple reversing.

Fixes llvm.org/pr44067

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83779
2020-08-06 22:09:24 +03:00
David Green 745bf6cf44 [LoopVectorizer] Inloop vector reductions
Arm MVE has multiple instructions such as VMLAVA.s8, which (in this
case) can take two 128bit vectors, sign extend the inputs to i32,
multiplying them together and sum the result into a 32bit general
purpose register. So taking 16 i8's as inputs, they can multiply and
accumulate the result into a single i32 without any rounding/truncating
along the way. There are also reduction instructions for plain integer
add and min/max, and operations that sum into a pair of 32bit registers
together treated as a 64bit integer (even though MVE does not have a
plain 64bit addition instruction). So giving the vectorizer the ability
to use these instructions both enables us to vectorize at higher
bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the reduction
operation, specified with a llvm.experimental.vector.reduce when
vectorizing for Arm, occurs inside the loop not after it like most
reductions. This patch attempts to do that, teaching the vectorizer
about in-loop reductions. It does this through a vplan recipe
representing the reductions that the original chain of reduction
operations is replaced by. Cost modelling is currently just done through
a prefersInloopReduction TTI hook (which follows in a later patch).

Differential Revision: https://reviews.llvm.org/D75069
2020-08-06 10:10:50 +01:00
Jordan Rupprecht 3c39db0c44 Revert "[LoopVectorizer] Inloop vector reductions"
This reverts commit e9761688e4. It breaks the build:

```
~/src/llvm-project/llvm/lib/Analysis/IVDescriptors.cpp:868:10: error: no viable conversion from returned value of type 'SmallVector<[...], 8>' to function return type 'SmallVector<[...], 4>'
  return ReductionOperations;
```
2020-08-05 10:24:15 -07:00
David Green e9761688e4 [LoopVectorizer] Inloop vector reductions
Arm MVE has multiple instructions such as VMLAVA.s8, which (in this
case) can take two 128bit vectors, sign extend the inputs to i32,
multiplying them together and sum the result into a 32bit general
purpose register. So taking 16 i8's as inputs, they can multiply and
accumulate the result into a single i32 without any rounding/truncating
along the way. There are also reduction instructions for plain integer
add and min/max, and operations that sum into a pair of 32bit registers
together treated as a 64bit integer (even though MVE does not have a
plain 64bit addition instruction). So giving the vectorizer the ability
to use these instructions both enables us to vectorize at higher
bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the reduction
operation, specified with a llvm.experimental.vector.reduce when
vectorizing for Arm, occurs inside the loop not after it like most
reductions. This patch attempts to do that, teaching the vectorizer
about in-loop reductions. It does this through a vplan recipe
representing the reductions that the original chain of reduction
operations is replaced by. Cost modelling is currently just done through
a prefersInloopReduction TTI hook (which follows in a later patch).

Differential Revision: https://reviews.llvm.org/D75069
2020-08-05 18:14:05 +01:00
Bardia Mahjour 3c0f347002 [NFC][LV] Vectorized Loop Skeleton Refactoring
This patch tries to improve readability and maintenance
of createVectorizedLoopSkeleton by reorganizing some lines,
updating some of the comments and breaking it up into
smaller logical units.

Reviewed By: pjeeva01

Differential Revision: https://reviews.llvm.org/D83824
2020-08-04 14:50:57 -04:00
Florian Hahn 98db27711d [LV] Do not check widening decision for instrs outside of loop.
No widening decisions will be computed for instructions outside the
loop. Do not try to get a widening decision. The load/store will be just
a scalar load, so treating at as normal should be fine I think.

Fixes PR46950.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D85087
2020-08-03 10:09:24 +01:00
Vitaly Buka b0eb40ca39 [NFC] Remove unused GetUnderlyingObject paramenter
Depends on D84617.

Differential Revision: https://reviews.llvm.org/D84621
2020-07-31 02:10:03 -07:00
Vitaly Buka 89051ebace [NFC] GetUnderlyingObject -> getUnderlyingObject
I am going to touch them in the next patch anyway
2020-07-30 21:08:24 -07:00
David Green 1da0c47fa2 [LoopVectorizer] Don't create unused block masks for reductions. NFC
This removes some unneeded block masks when we don't have any
reductions. It should not have any effect on codegen as the values
created are dead anyway.

Differential Revision: https://reviews.llvm.org/D81415
2020-07-30 14:28:08 +01:00
Simon Pilgrim cc529285fd VectorUtils.h - reduce unnecessary includes. NFC.
Replace TargetLibraryInfo.h include with forward declaration and fix implicit dependencies.

Reduce SmallSet.h include to SmallVector.h include.
2020-07-30 12:27:49 +01:00
David Sherwood 9ad7c980bb [SVE] Don't consider scalable vector types in SLPVectorizerPass::vectorizeChainsInBlock
In vectorizeChainsInBlock we try to collect chains of PHI nodes
that have the same element type, but the code is relying upon
the implicit conversion from TypeSize -> uint64_t. For now, I have
modified the code to ignore PHI nodes with scalable types.

Differential Revision: https://reviews.llvm.org/D83542
2020-07-29 16:29:19 +01:00
David Green 60280e9818 [Analysis] TTI: Add CastContextHint for getCastInstrCost
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types.  Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization.  Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.

For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.

To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.

Original patch by Pierre van Houtryve

Differential Revision: https://reviews.llvm.org/D79162
2020-07-29 13:32:53 +01:00
Kazu Hirata 902cbcd59e Use llvm::is_contained where appropriate (NFC)
Summary:
This patch replaces std::find with llvm::is_contained where
appropriate.

Reviewers: efriedma, nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, rogfer01, kerbowa, llvm-commits, vkmr

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84489
2020-07-27 10:20:44 -07:00
Hiroshi Yamauchi 7bedae7dee [PGO][PGSO] Add profile guided size optimization to loop vectorization legality. 2020-07-21 11:16:36 -07:00
Arthur Eubanks 0dfa4a83fa Revert "[PGO][PGSO] Add profile guided size optimization to loop vectorization legality."
This reverts commit 30c382a7c6.

See https://crbug.com/1106813.
2020-07-17 16:47:41 -07:00
Stanislav Mekhanoshin efb5040262 Fixed warning about signed/unsigned comparison
I've got the report clang11 issues signed/unsigned mismatch
warning here. For some reason only clang11 seems to issue
this warning.

Differential Revision: https://reviews.llvm.org/D83916
2020-07-17 11:03:42 -07:00
Anna Welker 23c9534515 [LV] Enable the LoopVectorizer to create pointer inductions
This patch enables the LoopVectorizer to build a phi of pointer
type and provide the vector loads and stores with vector type
getelementptrs built from the pointer induction variable, which
produces much less instructions than the previous approach of
creating scalar getelementpointers and glue them together to a
vector.

Differential Revision: https://reviews.llvm.org/D81267
2020-07-17 13:35:07 +01:00
Hiroshi Yamauchi 30c382a7c6 [PGO][PGSO] Add profile guided size optimization to loop vectorization legality.
Differential Revision: https://reviews.llvm.org/D83329
2020-07-15 11:49:36 -07:00
Sanne Wouda 13fec93a77 [NFC] rename to reflect F is not necessarily an Intrinsic 2020-07-13 15:28:46 +01:00
Sanne Wouda 7b84045565 [SLPVectorizer] handle vectorizeable library functions
Teaches the SLPVectorizer to use vectorized library functions for
non-intrinsic calls.

This already worked for intrinsics that have vectorized library
functions, thanks to D75878, but schedules with library functions with a
vector variant were being rejected early.

-   assume that there are no load/store dependencies between lib
    functions with a vector variant; this would otherwise prevent the
    bundle from becoming "ready"

-   check during legalization that the vector variant can be used

-   fix-up where we previously assumed that a call would be an intrinsic

Differential Revision: https://reviews.llvm.org/D82550
2020-07-13 15:28:46 +01:00
Ayal Zaks 82a5157ff1 [LV] Fixing versioning-for-unit-stide of loops with small trip count
This patch fixes D81345 and PR46652.

If a loop with a small trip count is compiled w/o -Os/-Oz, Loop Access Analysis
still generates runtime checks for unit strides that will version the loop.

In such cases, the loop vectorizer should either re-run the analysis or bail-out
from vectorizing the loop, as done prior to D81345. The latter is applied for
now as the former requires refactoring.

Differential Revision: https://reviews.llvm.org/D83470
2020-07-12 19:51:47 +03:00
Florian Hahn 264ab1e2c8 [LV] Pick vector loop body as insert point for SCEV expansion.
Currently the DomTree is not kept up to date for additional blocks
generated in the vector loop, for example when vectorizing with
predication. SCEVExpander relies on dominance checks when looking for
existing instructions to re-use and in some cases that can lead to the
expander picking instructions that do not actually dominate their insert
point (e.g. as in PR46525).

Unfortunately keeping the DT up-to-date is a bit tricky, because the CFG
is only patched up after generating code for a block. For now, we can
just use the vector loop header, as this ensures the inserted
instructions dominate all uses in the vector loop. There should be no
noticeable impact on the generated code, as other passes should sink
those instructions, if profitable.

Fixes PR46525.

Reviewers: Ayal, gilr, mkazantsev, dmgreen

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D83288
2020-07-10 10:37:12 +01:00
Benjamin Kramer b44470547e Make helpers static. NFC. 2020-07-09 13:48:56 +02:00
Nicolai Hähnle 3fa989d4fd DomTree: remove explicit use of DomTreeNodeBase::iterator
Summary:
Almost all uses of these iterators, including implicit ones, really
only need the const variant (as it should be). The only exception is
in NewGVN, which changes the order of dominator tree child nodes.

Change-Id: I4b5bd71e32d71b0c67b03d4927d93fe9413726d4

Reviewers: arsenm, RKSimon, mehdi_amini, courbet, rriddle, aartbik

Subscribers: wdng, Prazek, hiraditya, kuhar, rogfer01, rriddle, jpienaar, shauheen, antiagainst, nicolasvasilache, arpith-jacob, mgester, lucyrfox, aartbik, liufengdb, stephenneuendorffer, Joonsoo, grosul1, vkmr, Kayjukh, jurahul, msifontes, cfe-commits, llvm-commits

Tags: #clang, #mlir, #llvm

Differential Revision: https://reviews.llvm.org/D83087
2020-07-08 18:18:49 +02:00
Stanislav Mekhanoshin 64030099c3 SLP: honor requested max vector size merging PHIs
At the moment this place does not check maximum size set
by TTI and just creates a maximum possible vectors.

Differential Revision: https://reviews.llvm.org/D82227
2020-07-08 08:06:15 -07:00
Florian Hahn 04b85e2bcb Revert "[SLP] Make sure instructions are ordered when computing spill cost."
This seems to break http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-win/builds/24371

This reverts commit eb46137daa.
2020-07-07 23:15:01 +01:00
Ayal Zaks 7bf299c8d8 [LV] Vectorize without versioning-for-unit-stride under -Os/-Oz
If a loop is in a function marked OptSize, Loop Access Analysis should refrain
from generating runtime checks for unit strides that will version the loop.

If a loop is in a function marked OptSize and its vectorization is enabled, it
should be vectorized w/o any versioning.

Fixes PR46228.

Differential Revision: https://reviews.llvm.org/D81345
2020-07-07 15:04:21 +03:00
Jordan Rupprecht 10c82eecbc Revert "[LV] Enable the LoopVectorizer to create pointer inductions"
This reverts commit a8fe12065e.

It causes a crash when building gzip. Will post the detailed reduced test case to D81267.
2020-07-06 17:50:38 -07:00
Florian Hahn cff5739157 [LV] Pass dbgs() to verifyFunction call.
This is done in other places of the pass already and improves the output
on verification failure.
2020-07-06 15:09:20 +01:00
Florian Hahn eb46137daa [SLP] Make sure instructions are ordered when computing spill cost.
The entries in VectorizableTree are not necessarily ordered by their
position in basic blocks. Collect them and order them by dominance so
later instructions are guaranteed to be visited first. For instructions
in different basic blocks, we only scan to the beginning of the block,
so their order does not matter, as long as all instructions in a basic
block are grouped together. Using dominance ensures a deterministic order.

The modified test case contains an example where we compute a wrong
spill cost (2) without this patch, even though there is no call between
any instruction in the bundle.

This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.

Reviewers: craig.topper, RKSimon, xbolva00, ABataev, spatel

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D82444
2020-07-03 17:30:17 +01:00
Anna Welker a8fe12065e [LV] Enable the LoopVectorizer to create pointer inductions
This patch enables the LoopVectorizer to build a phi of pointer
type and provide the vector loads and stores with vector type
getelementptrs built from the pointer induction variable, which
produces much less instructions than the previous approach of
creating scalar getelementpointers and glue them together to a
vector.

Differential Revision: https://reviews.llvm.org/D81267
2020-07-02 11:39:28 +01:00
Sanjay Patel b6315aee5b [VectorCombine] try to form vector compare and binop to eliminate scalar ops
binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
-->
vcmp = cmp Pred X, VecC
ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0

This is a larger pattern than the existing extractelement folds because we can't
reasonably vectorize the sub-patterns with constants based on cost model calcs
(it doesn't usually make sense to replace a single extracted scalar op with
constant operand with a vector op).

I salvaged as much of the existing logic as I could, but there might be better
ways to share and reduce code.

The motivating case from PR43745:
https://bugs.llvm.org/show_bug.cgi?id=43745
...is the special case of a 2-way reduction. We tried to get SLP to handle that
particular pattern in D59710, but that caused crashing and regressions.
This patch is more general, but hopefully safer.

The v2f64 test with SSE2 surprised me - the cost model accounting looks like this:
OldCost = 0 (free extract of f64 at index 0) + 1 (extract of f64 at index 1) + 2 (scalar fcmps) + 1 (and of bools) = 4
NewCost = 2 (vector fcmp) + 1 (shuffle) + 1 (vector 'and') + 1 (extract of bool) = 5

Differential Revision: https://reviews.llvm.org/D82474
2020-06-29 10:38:52 -04:00
Sanjay Patel 3b95d8346d [VectorCombine] refactor - make helper function for extract to shuffle logic; NFC
Preliminary for D82474
2020-06-29 09:55:34 -04:00
Florian Hahn c0cdba727a [VPlan] Add & use VPValue for VPWidenGEPRecipe operands (NFC).
This patch adds VPValue version of the GEP's operands to
VPWidenGEPRecipe and uses them during code-generation.

Reviewers: Ayal, gilr, rengolin

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D80220
2020-06-26 20:59:17 +01:00
Guillaume Chatelet 1507fc1506 [Alignment][NFC] Migrate TTI::isLegalToVectorize{Load,Store}Chain to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82653
2020-06-26 14:14:27 +00:00
Guillaume Chatelet b66e33a689 [Alignment][NFC] Migrate TTI::getGatherScatterOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82577
2020-06-26 11:08:27 +00:00
Guillaume Chatelet fdc7c7fb87 [Alignment][NFC] Migrate TTI::getInterleavedMemoryOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82573
2020-06-26 11:00:53 +00:00
Guillaume Chatelet 7e1f79c3de [Alignment][NFC] Migrate TTI::getMaskedMemoryOpCost to Align
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82569
2020-06-26 10:14:16 +00:00
Simon Pilgrim 1b10c618e9 LoopVectorize.h - reduce AliasAnalysis.h include to forward declaration. NFC.
Replace legacy AliasAnalysis typedef with AAResults where necessary.
2020-06-26 10:49:00 +01:00
dfukalov 7ddee0922f [NFCI][CostModel] Add const to Value*.
Summary:
Get back `const` partially lost in one of recent changes.
Additionally specify explicit qualifiers in few places.

Reviewers: samparker

Reviewed By: samparker

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82383
2020-06-24 23:16:08 +03:00
Florian Hahn 35bb9bfbb0 [SLP] Limit GEP lists based on width of index computation.
D68667 introduced a tighter limit to the number of GEPs to simplify
together. The limit was based on the vector element size of the pointer,
but the pointers themselves are not actually put in vectors.

IIUC we try to vectorize the index computations here, so we should base
the limit on the vector element size of the computation of the index.

This restores the test regression on AArch64 and also restores the
vectorization for a important pattern in SPEC2006/464.h264ref on
AArch64 (@test_i16_extend). We get a large benefit from doing a single
load up front and then processing the index computations in vectors.

Note that we could probably even further improve the AArch64 codegen, if
we would do zexts to i32 instead of i64 for the sub operands and then do
a single vector sext on the result of the subtractions. AArch64 provides
dedicated vector instructions to do so. Sketch of proof in Alive:
https://alive2.llvm.org/ce/z/A4xYAB

Reviewers: craig.topper, RKSimon, xbolva00, ABataev, spatel

Reviewed By: ABataev, spatel

Differential Revision: https://reviews.llvm.org/D82418
2020-06-24 19:56:53 +01:00
Sanjay Patel a0f967418f [VectorCombine] give invalid index value a name; NFC 2020-06-24 11:10:36 -04:00
Sanjay Patel 54143e2bd5 [VectorCombine] do not use magic number for undef mask element; NFC 2020-06-22 20:47:09 -04:00
Sanjay Patel 9934cc544c [VectorCombine] make helper function for shift-shuffle; NFC
This will probably be useful for other extract patterns.
2020-06-22 12:23:52 -04:00
Sanjay Patel 98c2f4eea5 [VectorCombine] add helper to replace uses and rename
The tests are regenerated to show a path that missed renaming,
but there should be no functional difference from this patch.
2020-06-22 09:58:49 -04:00
Sanjay Patel de65b356dc [VectorCombine] add/use pass-level IRBuilder
This saves creating/destroying a builder every time we
perform some transform.

The tests show instruction ordering diffs resulting from
always inserting at the root instruction now, but those
should be benign.
2020-06-22 09:01:29 -04:00
Sanjay Patel cce625f73d [VectorCombine] improve IR debugging by providing/salvaging value names
The tests are regenerated to show the diffs, but there should be no
functional change from this patch.
2020-06-22 08:35:47 -04:00
Sanjay Patel 6bdd531af5 [VectorCombine] create class for pass to hold analyses, etc; NFC
This doesn't change anything currently, but it would make sense
to create a class-level IRBuilder instead of recreating that
everywhere. As we expand to more optimizations, we will probably
also want to hold things like the DataLayout or other constant
refs in here too.
2020-06-21 16:07:33 -04:00
Sanjay Patel 741e20f3d6 [VectorCombine] fix assert for type of compare operand
As shown in the post-commit comment for D81661 - we need to
loosen the type assertion to allow scalarization of a compare
for vectors of pointers.
2020-06-20 15:20:17 -04:00
Sanjay Patel 216a37bb46 [VectorCombine] refactor extract-extract logic; NFCI 2020-06-19 14:52:27 -04:00
Sanjay Patel 6d864097a2 [VectorCombine] fix crash while transforming constants
This is a variation of the proposal in D82049 with an extra test.
2020-06-19 12:30:32 -04:00
Sanjay Patel 46a285ad9e [IRBuilder] add/use wrapper to create a generic compare based on predicate type; NFC
The predicate can always be used to distinguish between icmp and fcmp,
so we don't need to keep repeating this check in the callers.
2020-06-18 15:47:06 -04:00
Simon Pilgrim a5f1f9c9b8 ScalarEvolution.h - reduce LoopInfo.h include to forward declarations. NFC.
Move ScalarEvolution::forgetLoopDispositions implementation to ScalarEvolution.cpp to remove the dependency.

Add implicit header dependency to source files where necessary.
2020-06-17 15:48:23 +01:00
Sjoerd Meijer c1034d044a Follow up of rGe345d547a0d5, and attempt to pacify buildbot:
"error: 'get' is deprecated: The base class version of get with the scalable
argument defaulted to false is deprecated."

Changed VectorType::get() -> FixedVectorType::get().
2020-06-17 13:24:09 +01:00
Sjoerd Meijer e345d547a0 Recommit "[LV] Emit @llvm.get.active.lane.mask for tail-folded loops"
Fixed ARM regression test.

Please see the original commit message rG47650451738c for details.
2020-06-17 13:12:15 +01:00
Sjoerd Meijer d4e183f686 Revert "[LV] Emit @llvm.get.active.mask for tail-folded loops"
This reverts commit 4765045173
while I investigate the build bot failures.
2020-06-17 10:09:54 +01:00
Sjoerd Meijer 4765045173 [LV] Emit @llvm.get.active.mask for tail-folded loops
This emits new IR intrinsic @llvm.get.active.mask for tail-folded vectorised
loops if the intrinsic is supported by the backend, which is checked by
querying TargetTransform hook emitGetActiveLaneMask.

This intrinsic creates a mask representing active and inactive vector lanes,
which is used by the masked load/store instructions that are created for
tail-folded loops. The semantics of @llvm.get.active.mask are described here in
LangRef:

https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics

This intrinsic is also used to provide a hint to the backend. That is, the
second argument of the intrinsic represents the back-edge taken count of the
loop. For MVE, for example, we use that to set up tail-predication, which is a
new form of predication in MVE for vector loops that implicitely predicates the
last vector loop iteration by implicitely setting active/inactive lanes, i.e.
the tail loop is predicated. In order to set up a tail-predicated vector loop,
we need to know the number of data elements processed by the vector loop, which
corresponds the the tripcount of the scalar loop, which we can now reconstruct
using @llvm.get.active.mask.

Differential Revision: https://reviews.llvm.org/D79100
2020-06-17 09:53:58 +01:00
Christopher Tetreault ff628f5f5e [SVE] Eliminate calls to default-false VectorType::get() from Vectorize
Reviewers: efriedma, fhahn, spatel, sdesmalen, kmclaughlin

Reviewed By: efriedma

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81521
2020-06-16 12:50:13 -07:00
Sanjay Patel ed67f5e7ab [VectorCombine] scalarize compares with insertelement operand(s)
Generalize scalarization (recently enhanced with D80885)
to allow compares as well as binops.
Similar to binops, we are avoiding scalarization of a loaded
value because that could avoid a register transfer in codegen.
This requires 1 extra predicate that I am aware of: we do not
want to scalarize the condition value of a vector select. That
might also invert a transform that we do in instcombine that
prefers a vector condition operand for a vector select.

I think this is the final step in solving PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463

Differential Revision: https://reviews.llvm.org/D81661
2020-06-16 13:48:10 -04:00
Sam Parker 2596da3174 [CostModel] getCFInstrCost in getUserCost.
Have BasicTTI call the base implementation so that both agree on the
default behaviour, which the default being a cost of '1'. This has
required an X86 specific implementation as it seems to be very
reliant on those instructions being free. Changes are also made to
AMDGPU so that their implementations distinguish between cost kinds,
so that the unrolling isn't affected. PowerPC also has its own
implementation to prevent changes to the reg-usage vectorizer test.

The cost model test changes now reflect that ret instructions are not
generally free.

Differential Revision: https://reviews.llvm.org/D79164
2020-06-15 09:28:46 +01:00
Roman Lebedev 7aeb41b3c8
[NFCI] VectorCombine: add statistic for bitcast(shuf()) -> shuf(bitcast()) xform 2020-06-12 23:10:53 +03:00
Florian Hahn 3a846d4d92 [VPlan] Reject loops without computable backedge taken counts
getOrCreateTripCount is used to generate code for the outer loop, but it
requires a computable backedge taken counts. Check that in the VPlan
native path.

Reviewers: Ayal, gilr, rengolin, sguggill

Reviewed By: sguggill

Differential Revision: https://reviews.llvm.org/D81088
2020-06-12 10:31:18 +01:00
Sanjay Patel 039ff29ef6 [VectorCombine] remove unused parameters; NFC 2020-06-11 19:15:03 -04:00
Simon Pilgrim 5dc4e7c2b9 [VectorCombine] scalarizeBinop - support an all-constant src vector operand
scalarizeBinop currently folds

  vec_bo((inselt VecC0, V0, Index), (inselt VecC1, V1, Index))
  ->
  inselt(vec_bo(VecC0, VecC1), scl_bo(V0,V1), Index)

This patch extends this to account for cases where one of the vec_bo operands is already all-constant and performs similar cost checks to determine if the scalar binop with a constant still makes sense:

  vec_bo((inselt VecC0, V0, Index), VecC1)
  ->
  inselt(vec_bo(VecC0, VecC1), scl_bo(V0,extractelt(V1,Index)), Index)

Fixes PR42174

Differential Revision: https://reviews.llvm.org/D80885
2020-06-09 19:02:05 +01:00
Benjamin Kramer 3badd17b69 SmallPtrSet::find -> SmallPtrSet::count
The latter is more readable and more efficient. While there clean up
some double lookups. NFCI.
2020-06-07 22:38:08 +02:00
Simon Pilgrim 5006e551d3 LoopAnalysisManager.h - reduce includes to forward declarations. NFC.
Move implicit include dependencies down to header/source files.
2020-06-06 14:06:46 +01:00
Florian Hahn 211596c94e [VPlan] Support extracting lanes for defs managed in VPTransformState.
Currently extracting a lane for a VPValue def is not supported, if it is
managed directly by VPTransformState (e.g. because it is created by a
VPInstruction or an external VPValue def).

For now, simply extract the requested lane. In the future, we should
also cache the extracted scalar values, similar to LV.

Reviewers: Ayal, rengolin, gilr, SjoerdMeijer

Reviewed By: SjoerdMeijer

Differential Revision: https://reviews.llvm.org/D80787
2020-06-03 12:14:16 +01:00
Florian Hahn b446ec56a2 [LV] Make sure the MaxVF is a power-of-2 by rounding down.
LV currently only supports power of 2 vectorization factors, which has
been made explicit with the assertion added in
840450549c.

However, if the widest type is not a power-of-2 the computed MaxVF won't
be a power-of-2 either. This patch updates computeFeasibleMaxVF to
ensure the returned value is a power-of-2 by rounding down to the
nearest power-of-2.

Fixes PR46139.

Reviewers: Ayal, gilr, rengolin

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D80870
2020-06-02 10:40:49 +01:00
Valery N Dmitriev a45688a72c [SLP] Apply external to vectorizable tree users cost adjustment for
relevant aggregate build instructions only (UserCost).
Users are detected with findBuildAggregate routine and the trick is
that following SLP vectorization may end up vectorizing entire list
with smaller chunks. Cost adjustment then is applied for individual
chunks and these adjustments obviously have to be smaller than the
entire aggregate build cost.

Differential Revision: https://reviews.llvm.org/D80773
2020-05-29 15:37:41 -07:00
Christopher Tetreault d2befc6633 [SVE] Eliminate calls to default-false VectorType::get() from Vectorize
Reviewers: efriedma, c-rhodes, david-arm, fhahn

Reviewed By: david-arm

Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80339
2020-05-29 11:31:24 -07:00
Florian Hahn 9b507b2127 [LAA] We only need pointer checks if there are non-zero checks (NFC).
If it turns out that we can do runtime checks, but there are no
runtime-checks to generate, set RtCheck.Need to false.

This can happen if we can prove statically that the pointers passed in
to canCheckPtrAtRT do not alias. This should not change any results, but
allows us to skip some work and assert that runtime checks are
generated, if LAA indicates that runtime checks are required.

Reviewers: anemet, Ayal

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D79969

Note: This is a recommit of 259abfc7cb,
with some suggested renaming.
2020-05-27 12:47:36 +01:00
Florian Hahn 2d0389821e Revert "[LAA] We only need pointer checks if there are non-zero checks (NFC)."
This reverts commit 259abfc7cb.

Reverting this, as I missed a case where we return without setting
RtCheck.Need.
2020-05-27 12:39:45 +01:00
Florian Hahn 259abfc7cb [LAA] We only need pointer checks if there are non-zero checks (NFC).
If it turns out that we can do runtime checks, but there are no
runtime-checks to generate, set RtCheck.Need to false.

This can happen if we can prove statically that the pointers passed in
to canCheckPtrAtRT do not alias. This should not change any results, but
allows us to skip some work and assert that runtime checks are
generated, if LAA indicates that runtime checks are required.

Reviewers: anemet, Ayal

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D79969
2020-05-27 12:37:20 +01:00
Simon Pilgrim 35963f6d85 VPlanValue.h - reduce unnecessary includes to forward declarations. NFC. 2020-05-27 11:26:14 +01:00
Ayal Zaks 840450549c [LV] Clamp MaxVF to power of 2.
If a loop has a constant trip count known to be a multiple of MaxVF (times user
UF), LV infers that no tail will be generated for any chosen VF. This relies on
the chosen VF's being powers of 2 bound by MaxVF, and assumes MaxVF is a power
of 2. Make sure the latter holds, in particular when MaxVF is set by a memory
dependence distance which may not be a power of 2.

Differential Revision: https://reviews.llvm.org/D80491
2020-05-25 11:24:33 +03:00
Florian Hahn 0deab8a54f [LV] Either get invariant condition OR vector condition.
Currently we unconditionally get the first lane of the condition
operand, even if we later use the full vector condition. This can result
in some unnecessary instructions being generated.

Suggested as follow-up in D80219.
2020-05-24 17:16:42 +01:00
Sanjay Patel 7eed772a27 [PatternMatch] abbreviate vector inst matchers; NFC
Readability is not reduced with these opcodes/match lines,
so reduce odds of awkward wrapping from 80-col limit.
2020-05-24 09:19:47 -04:00
Florian Hahn 15224408f0 [VPlan] Use VPUser for VPWidenSelectRecipe operands (NFC).
VPWidenSelectRecipe already contains a VPUser, but it is not used. This
patch updates the code related to VPWidenSelectRecipe to use VPUser for
its operands.

Reviewers: Ayal, gilr, rengolin

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D80219
2020-05-24 13:58:08 +01:00
Sanjay Patel 024098ae53 [VectorCombine] set preserve alias analysis
As noted in D80236, moving the pass in the pipeline exposed this
shortcoming. Extra work to recalculate the alias results showed
up as a compile-time slowdown.
2020-05-22 16:25:16 -04:00
Anh Tuyen Tran 13bf6039c9 Title: [LV] Handle Fold-Tail of loops with vectorizarion factor equal to 1
Summary:
When handling loops whose VF is 1, fold-tail vectorization sets the
backedge taken count of the original loop with a vector of a single
element. This causes type-mismatch during instruction generartion.

The purpose of this patch is toto address the case of VF==1.

Reviewer: Ayal (Ayal Zaks), bmahjour (Bardia Mahjour), fhahn (Florian Hahn), gilr (Gil Rapaport), rengolin (Renato Golin)

Reviewed By: Ayal (Ayal Zaks), bmahjour (Bardia Mahjour), fhahn (Florian Hahn)

Subscribers: Ayal (Ayal Zaks), rkruppe (Hanna Kruppe), bmahjour (Bardia Mahjour), rogfer01 (Roger Ferrer Ibanez), vkmr (Vineet Kumar), bollu (Siddharth Bhat), hiraditya (Aditya Kumar), llvm-commits (Mailing List llvm-commits)

Tag: LLVM

Differential Revision: https://reviews.llvm.org/D79976
2020-05-22 13:30:56 +00:00
Sanjay Patel 21f7cf4057 [SLP] fix verification check for valid IR
This is a fix for PR45965 - https://bugs.llvm.org/show_bug.cgi?id=45965 -
which was left out of D80106 because of a test failure.

SLP does its own mini-CSE after potentially creating redundant instructions,
so we need to wait for that to complete before running the verifier.
Otherwise, we will see a test failure for
test/Transforms/SLPVectorizer/X86/crash_vectorizeTree.ll (not changed here)
because a phi temporarily has identical but different incoming values for
the same incoming block.

A related, but independent, test that would have been altered here was
fixed with:
rG880df55

The test was escaping verification in SLP without this change because we
were not running verifyFunction() unless SLP actually changed the IR.

Differential Revision: https://reviews.llvm.org/D80401
2020-05-22 09:15:27 -04:00
Dinar Temirbulatov df3b95bc0a [SLP][NFC] PR45269 getVectorElementSize() is slow
The algorithm inside getVectorElementSize() is almost O(x^2) complexity and
when, for example, we compile MultiSource/Applications/ClamAV/shared_sha256.c
with 1k instructions inside sha256_transform() function that resulted in almost
~800k iterations. The following change improves the algorithm with the map to
a liner complexity.

Differential Revision: https://reviews.llvm.org/D80241
2020-05-21 17:26:50 +02:00
Sam Parker 8cc911fa5b [NFCI][CostModel] Refactor getIntrinsicInstrCost
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.

Differential Revision: https://reviews.llvm.org/D79941
2020-05-20 11:59:08 +01:00
Florian Hahn bcbd26bfe6 [SCEV] Move ScalarEvolutionExpander.cpp to Transforms/Utils (NFC).
SCEVExpander modifies the underlying function so it is more suitable in
Transforms/Utils, rather than Analysis. This allows using other
transform utils in SCEVExpander.

This patch was originally committed as b8a3c34eee, but broke the
modules build, as LoopAccessAnalysis was using the Expander.

The code-gen part of LAA was moved to lib/Transforms recently, so this
patch can be landed again.

Reviewers: sanjoy.google, efriedma, reames

Reviewed By: sanjoy.google

Differential Revision: https://reviews.llvm.org/D71537
2020-05-20 10:53:40 +01:00
Florian Hahn 7cefd1b4cd [LV] Remove duplicated return stmt (NFC). 2020-05-19 17:20:50 +01:00
Florian Hahn cff9399f6b [VPlan] Fix comment for User in VPWidenSelectRecipe (NFC).
The comment was referring the arguments of the call, but the recipe
widens a select.
2020-05-19 15:31:39 +01:00
Florian Hahn f828d75b46 [VPlan] Add & use VPValue operands for VPReplicateRecipe (NFC).
This patch adds VPValue version of the instruction operands to
VPReplicateRecipe and uses them during code-generation.

Reviewers: Ayal, gilr, rengolin

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D80114
2020-05-19 15:12:17 +01:00
Florian Hahn 66ad107452 [VPlan] Remove unique_ptr from VPBranchOnRecipeMask (NFC).
We can remove a dynamic memory allocation, by checking the number of
operands: no operands = all true, 1 operand = mask.

Reviewers: Ayal, gilr, rengolin

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D80110
2020-05-19 15:01:37 +01:00
Eli Friedman 27b4e6931d [NFC] Replace MaybeAlign with Align in TargetTransformInfo. 2020-05-18 19:25:49 -07:00
Ayal Zaks 682e739638 [LV] Fix FoldTail under user VF and UF
LV considers an internally computed MaxVF to decide if a constant trip-count is
a multiple of any subsequently chosen VF, and conclude that no scalar remainder
iterations (tail) will be left for Fold Tail to handle. If an external VF is
provided via -force-vector-width, it must be considered instead of the internal
MaxVF.
If an external UF is provided via -force-vector-interleave, it too must be
considered in addition to MaxVF or user VF.

Fixes PR45679.

Differential Revision: https://reviews.llvm.org/D80085
2020-05-19 01:32:25 +03:00
Craig Topper c9f63297e2 Fix several places that were calling verifyFunction or verifyModule without checking the return value.
verifyFunction/verifyModule don't assert or error internally. They
also don't print anything if you don't pass a raw_ostream to them.
So the caller needs to check the result and ideally pass a stream
to get the messages. Otherwise they're just really expensive no-ops.

I've filed PR45965 for another instance in SLPVectorizer
that causes a lit test failure.

Differential Revision: https://reviews.llvm.org/D80106
2020-05-18 13:28:46 -07:00
Volkan Keles 63081dc6f6 LoadStoreVectorizer: Match nested adds to prove vectorization is safe
If both OpA and OpB is an add with NSW/NUW and with the same LHS operand,
we can guarantee that the transformation is safe if we can prove that OpA
won't overflow when IdxDiff added to the RHS of OpA.

Review: https://reviews.llvm.org/D79817
2020-05-18 12:13:01 -07:00
Nikita Popov 52e98f620c [Alignment] Remove unnecessary getValueOrABITypeAlignment calls (NFC)
Now that load/store alignment is required, we no longer need most
of them. Also switch the getLoadStoreAlignment() helper to return
Align instead of MaybeAlign.
2020-05-17 22:19:15 +02:00
Sanjay Patel 81e9ede3a2 [VectorCombine] forward walk through instructions to improve chaining of transforms
This is split off from D79799 - where I was proposing to fully iterate
over a function until there are no more transforms. I suspect we are
still going to want to do something like that eventually.

But we can achieve the same gains much more efficiently on the current
set of regression tests just by reversing the order that we visit the
instructions.

This may also reduce the motivation for D79078, but we are still not
getting the optimal pattern for a reduction.
2020-05-16 13:08:01 -04:00
Florian Hahn 4c8285c750 [VPlan] Move emission of \\l\"+\n to dumpBasicBlock (NFC).
The patch standardizes printing of VPRecipes a bit, by hoisting out the
common emission of \\l\"+\n. It simplifies the code and is also a first
step towards untangling printing from DOT format output, with the goal
of making the DOT output optional and to provide a more concise debug
output if DOT output is disabled.

Reviewers: gilr, Ayal, rengolin

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D78883
2020-05-14 13:07:59 +01:00
Alina Sbirlea bd541b217f [NewPassManager] Add assertions when getting statefull cached analysis.
Summary:
Analyses that are statefull should not be retrieved through a proxy from
an outer IR unit, as these analyses are only invalidated at the end of
the inner IR unit manager.
This patch disallows getting the outer manager and provides an API to
get a cached analysis through the proxy. If the analysis is not
stateless, the call to getCachedResult will assert.

Reviewers: chandlerc

Subscribers: mehdi_amini, eraman, hiraditya, zzheng, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72893
2020-05-13 12:38:38 -07:00
Sjoerd Meijer 9529597cf4 Recommit #2: "[LV] Induction Variable does not remain scalar under tail-folding."
This was reverted because of a miscompilation. At closer inspection, the
problem was actually visible in a changed llvm regression test too. This
one-line follow up fix/recommit will splat the IV, which is what we are trying
to avoid if unnecessary in general, if tail-folding is requested even if all
users are scalar instructions after vectorisation. Because with tail-folding,
the splat IV will be used by the predicate of the masked loads/stores
instructions. The previous version omitted this, which caused the
miscompilation. The original commit message was:

If tail-folding of the scalar remainder loop is applied, the primary induction
variable is splat to a vector and used by the masked load/store vector
instructions, thus the IV does not remain scalar. Because we now mark
that the IV does not remain scalar for these cases, we don't emit the vector IV
if it is not used. Thus, the vectoriser produces less dead code.

Thanks to Ayal Zaks for the direction how to fix this.
2020-05-13 13:50:09 +01:00
Sanjay Patel 5f730b645d [VectorCombine] account for extra uses in scalarization cost
Follow-up to D79452.
Mimics the extra use cost formula for the inverse transform with extracts.
2020-05-11 15:20:57 -04:00
Florian Hahn 8528186b9b [LAA] Move runtime-check generation to Transforms/Utils/loopUtils (NFC)
Currently LAA's uses of ScalarEvolutionExpander blocks moving the
expander from Analysis to Transforms. Conceptually the expander does not
fit into Analysis (it is only used for code generation) and
runtime-check generation also seems to be better suited as a
transformation utility.

Reviewers: Ayal, anemet

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D78460
2020-05-10 17:39:26 +01:00
Florian Hahn 96c63f544f Recommit "[LAA] Remove one addRuntimeChecks function (NFC)."
The failing assertion has been fixed and the problematic test case has
been added.

This reverts the revert commit fc44617f28.
2020-05-10 15:19:57 +01:00
Florian Hahn fc44617f28 Revert "[LAA] Remove one addRuntimeChecks function (NFC)."
This reverts commit c28114c8ff.

This causes some bots to fail:

http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-android/builds/30596/steps/build%20android%2Faarch64/logs/stdio
2020-05-10 13:28:00 +01:00
Florian Hahn c28114c8ff [LAA] Remove one addRuntimeChecks function (NFC).
In order to reduce the API surface area (preparation for D78460), remove
a addRuntimeChecks() function and do the additional check in the single
caller.

Reviewers: Ayal, anemet

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D79679
2020-05-10 12:48:55 +01:00
Sanjay Patel 0d2a0b44c8 [VectorCombine] scalarize binop of inserted elements into vector constants
As with the extractelement patterns that are currently in vector-combine,
there are going to be several possible variations on this theme. This
should be the clearest, simplest example.

Scalarization is the right direction for target-independent canonicalization,
and InstCombine has some of those folds already, but it doesn't do this.
I proposed a similar transform in D50992. Here in vector-combine, we can
check the cost model to be sure it's profitable, so there should be less risk.

Differential Revision: https://reviews.llvm.org/D79452
2020-05-08 16:31:12 -04:00
Benjamin Kramer f936457f80 Revert "Recommit "[LV] Induction Variable does not remain scalar under tail-folding.""
This reverts commit ae45b4dbe7. It
causes miscompilations, test case on the mailing list.
2020-05-08 14:49:10 +02:00
Sanjay Patel 02051c7f3a [SLP] add another bailout for load-combine patterns (2nd try)
The original patch (rG86dfbc676ebe) exposed an existing bug:
we could wrongly cast a constant expression to BinaryOperator
because the pattern matching allows that. This adds a check
for that case, and there's a reduced test case to verify no
crashing.

Original commit message:

This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.

The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.

That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:

  vpmovzxwd     (%rsi), %xmm0
  vmovdqu       %xmm0, (%rdi)

rather than:

  movzwl        (%rsi), %eax
  movl  %eax, (%rdi)
  ...

In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:

  movzbl (%rdi), %eax
  vmovd %eax, %xmm0
  movzbl 1(%rdi), %eax
  vmovd %eax, %xmm1
  movzbl 2(%rdi), %eax
  vpinsrb $4, 4(%rdi), %xmm0, %xmm0
  vpinsrb $8, 8(%rdi), %xmm0, %xmm0
  vpinsrb $12, 12(%rdi), %xmm0, %xmm0
  vmovd %eax, %xmm2
  movzbl 3(%rdi), %eax
  vpinsrb $1, 5(%rdi), %xmm1, %xmm1
  vpinsrb $2, 9(%rdi), %xmm1, %xmm1
  vpinsrb $3, 13(%rdi), %xmm1, %xmm1
  vpslld $24, %xmm0, %xmm0
  vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
  vpslld $16, %xmm1, %xmm1
  vpor %xmm0, %xmm1, %xmm0
  vpinsrb $1, 6(%rdi), %xmm2, %xmm1
  vmovd %eax, %xmm2
  vpinsrb $2, 10(%rdi), %xmm1, %xmm1
  vpinsrb $3, 14(%rdi), %xmm1, %xmm1
  vpinsrb $1, 7(%rdi), %xmm2, %xmm2
  vpinsrb $2, 11(%rdi), %xmm2, %xmm2
  vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
  vpinsrb $3, 15(%rdi), %xmm2, %xmm2
  vpslld $8, %xmm1, %xmm1
  vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
  vpor %xmm2, %xmm1, %xmm1
  vpor %xmm1, %xmm0, %xmm0
  vmovdqu %xmm0, (%rsi)

  movl  (%rdi), %eax
  movl  4(%rdi), %ecx
  movl  8(%rdi), %edx
  movbel        %eax, (%rsi)
  movbel        %ecx, 4(%rsi)
  movl  12(%rdi), %ecx
  movbel        %edx, 8(%rsi)
  movbel        %ecx, 12(%rsi)

Differential Revision: https://reviews.llvm.org/D78997
2020-05-07 15:04:37 -04:00
Hans Wennborg c54c6ee1a7 Revert "[SLP] add another bailout for load-combine patterns"
It caused asserts building Chromium, see discussion on
https://reviews.llvm.org/D78997

This reverts commit 86dfbc676e.
2020-05-07 16:31:52 +02:00
Sjoerd Meijer 3bbc71d6c9 [LV] Fix typo in variable name. NFC. 2020-05-07 13:53:44 +01:00
Sjoerd Meijer ae45b4dbe7 Recommit "[LV] Induction Variable does not remain scalar under tail-folding."
With 3 llvm regr tests fixed/updated that I had missed.
2020-05-07 11:52:20 +01:00
Sjoerd Meijer 20d67ffeae Revert "[LV] Induction Variable does not remain scalar under tail-folding."
This reverts commit 617aa64c84.

while I investigate buildbot failures.
2020-05-07 09:29:56 +01:00
Sjoerd Meijer 617aa64c84 [LV] Induction Variable does not remain scalar under tail-folding.
If tail-folding of the scalar remainder loop is applied, the primary induction
variable is splat to a vector and used by the masked load/store vector
instructions, thus the IV does not remain scalar. Because we now mark
that the IV does not remain scalar for these cases, we don't emit the vector IV
if it is not used. Thus, the vectoriser produces less dead code.

Thanks to Ayal Zaks for the direction how to fix this.

Differential Revision: https://reviews.llvm.org/D78911
2020-05-07 09:15:23 +01:00
Sanjay Patel 86dfbc676e [SLP] add another bailout for load-combine patterns
This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.

The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.

That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:

  vpmovzxwd	(%rsi), %xmm0
  vmovdqu	%xmm0, (%rdi)

rather than:

  movzwl	(%rsi), %eax
  movl	%eax, (%rdi)
  ...

In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:

  movzbl (%rdi), %eax
  vmovd %eax, %xmm0
  movzbl 1(%rdi), %eax
  vmovd %eax, %xmm1
  movzbl 2(%rdi), %eax
  vpinsrb $4, 4(%rdi), %xmm0, %xmm0
  vpinsrb $8, 8(%rdi), %xmm0, %xmm0
  vpinsrb $12, 12(%rdi), %xmm0, %xmm0
  vmovd %eax, %xmm2
  movzbl 3(%rdi), %eax
  vpinsrb $1, 5(%rdi), %xmm1, %xmm1
  vpinsrb $2, 9(%rdi), %xmm1, %xmm1
  vpinsrb $3, 13(%rdi), %xmm1, %xmm1
  vpslld $24, %xmm0, %xmm0
  vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
  vpslld $16, %xmm1, %xmm1
  vpor %xmm0, %xmm1, %xmm0
  vpinsrb $1, 6(%rdi), %xmm2, %xmm1
  vmovd %eax, %xmm2
  vpinsrb $2, 10(%rdi), %xmm1, %xmm1
  vpinsrb $3, 14(%rdi), %xmm1, %xmm1
  vpinsrb $1, 7(%rdi), %xmm2, %xmm2
  vpinsrb $2, 11(%rdi), %xmm2, %xmm2
  vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
  vpinsrb $3, 15(%rdi), %xmm2, %xmm2
  vpslld $8, %xmm1, %xmm1
  vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
  vpor %xmm2, %xmm1, %xmm1
  vpor %xmm1, %xmm0, %xmm0
  vmovdqu %xmm0, (%rsi)

  movl	(%rdi), %eax
  movl	4(%rdi), %ecx
  movl	8(%rdi), %edx
  movbel	%eax, (%rsi)
  movbel	%ecx, 4(%rsi)
  movl	12(%rdi), %ecx
  movbel	%edx, 8(%rsi)
  movbel	%ecx, 12(%rsi)

Differential Revision: https://reviews.llvm.org/D78997
2020-05-05 12:44:38 -04:00
Simon Pilgrim 4e3c005554 [TTI] getScalarizationOverhead - use explicit VectorType operand
getScalarizationOverhead is only ever called with vectors (and we already had a load of cast<VectorType> calls immediately inside the functions).

Followup to D78357

Reviewed By: @samparker

Differential Revision: https://reviews.llvm.org/D79341
2020-05-05 16:59:23 +01:00