Commit Graph

1402 Commits

Author SHA1 Message Date
Simon Pilgrim f114ef3731 [CostModel][X86] Add generic costs for vXi32 MUL -> v2Xi16 PMADDDW folds
Based off the improved fold in D108522

This should eventually allow us to replace the SLM only cost patterns with generic versions.
2021-09-05 16:08:11 +01:00
Nikita Popov 02f74eadbe [IVDescriptors] Make pointer inductions compatible with opaque pointers
Store the used element type in the InductionDescriptor. For typed
pointers, it remains the pointer element type. For opaque pointers,
we always use an i8 element type, such that the step is a simple
offset.

A previous version of this patch instead tried to guess the element
type from an induction GEP, but this is not reliable, as the GEP
may be hidden (see @both in iv_outside_user.ll).

Differential Revision: https://reviews.llvm.org/D104795
2021-09-01 21:02:05 +02:00
Simon Pilgrim 10c982e0b3 Revert rG1c9bec727ab5c53fa060560dc8d346a911142170 : [InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated

As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Differential Revision: https://reviews.llvm.org/D106450
2021-08-23 21:09:26 +01:00
Florian Hahn d024a01511
Recommit "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts the revert ab9296f13b.

The issue causing the revert should be fixed in 9baed023b4.
2021-08-23 11:25:27 +01:00
Florian Hahn 9baed023b4
[LV] Adjust reduction recipes before recurrence handling.
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
2021-08-22 11:02:33 +01:00
Florian Hahn ab9296f13b
Revert "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts commit f4122398e7 to
investigate a crash exposed by it.

The patch breaks building the code below with `clang -O2 --target=aarch64-linux`

     int a;
     double b, c;
     void d() {
       for (; a; a++) {
         b += c;
         c = a;
       }
     }
2021-08-20 21:24:28 +01:00
David Sherwood f4122398e7 [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:

  -force-ordered-reductions=true|false

I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:

  Transforms/LoopVectorize/AArch64/strict-fadd.ll
  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D106653
2021-08-19 09:29:40 +01:00
David Sherwood 219d4518fc [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
For tight loops like this:

  float r = 0;
  for (int i = 0; i < n; i++) {
    r += a[i];
  }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Differential Revision: https://reviews.llvm.org/D108292
2021-08-18 17:01:56 +01:00
Dylan Fleming ef198cd99e [SVE] Remove usage of getMaxVScale for AArch64, in favour of IR Attribute
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D106277
2021-08-17 14:42:47 +01:00
Nikita Popov 570c9beb8e [MemorySSA] Remove unnecessary MSSA dependencies
LoopLoadElimination, LoopVersioning and LoopVectorize currently
fetch MemorySSA when construction LoopAccessAnalysis. However,
LoopAccessAnalysis does not actually use MemorySSA and we can pass
nullptr instead.

This saves one MemorySSA calculation in the default pipeline, and
thus improves compile-time.

Differential Revision: https://reviews.llvm.org/D108074
2021-08-16 20:40:55 +02:00
Paul Walker f7a831daa6 [LoopVectorize] Don't emit remarks about lack of scalable vectors unless they're specifically requested.
Previously we emitted a "does not support scalable vectors"
remark for all targets whenever vectorisation is attempted. This
pollutes the output for architectures that don't support scalable
vectors and is likely confusing to the user.

Instead this patch introduces a debug message that reports when
scalable vectorisation is allowed by the target and only issues
the previous remark when scalable vectorisation is specifically
requested, for example:

  #pragma clang loop vectorize_width(2, scalable)

Differential Revision: https://reviews.llvm.org/D108028
2021-08-15 12:15:52 +01:00
David Sherwood 3ce8c31eb8 [NFC] Add extra RUN line to strict reduction tests
I have added RUN lines to both:

  Transforms/LoopVectorize/AArch64/strict-fadd.ll
  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

to show the default behaviour is to not vectorise when the following
flag is unset:

  -force-ordered-reductions
2021-08-10 14:48:38 +01:00
David Sherwood 8439415333 [IR] Let ConstantVector::getSplat use poison instead of undef
This patch updates ConstantVector::getSplat to use poison instead
of undef when using insertelement/shufflevector to splat.

This follows on from D93793.

Differential Revision: https://reviews.llvm.org/D107751
2021-08-10 08:27:43 +01:00
Dorit Nuzman 67278b8a90 [LV] Support Interleaved Store Group With Gaps
Teach LV to use masked-store to support interleave-store-group with
gaps (instead of scatters/scalarization).

The symmetric case of using masked-load to support
interleaved-load-group with gaps was introduced a while ago, by
https://reviews.llvm.org/D53668; This patch completes the store-scenario
leftover from D53668, and solves PR50566.

Reviewed by: Ayal Zaks

Differential Revision: https://reviews.llvm.org/D104750
2021-08-08 10:32:02 +03:00
David Sherwood 3fd96e1b2e [LoopVectorize] Improve vectorisation of some intrinsics by treating them as uniform
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:

  1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
  are always uniform by definition.
  2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
  loop invariant input operands then these are also uniform too.

Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:

  1. If the 'assume' intrinsic's operand is not loop invariant, then we
  are free to treat this as uniform anyway since it's only a performance
  hint. We will get the benefit for the first lane.
  2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
  variant then for scalable vectors we assume these still ultimately come
  from the broadcast of an alloca. We do not support scalable vectorisation
  of loops containing alloca instructions, hence the alloca itself would
  be invariant. If the pointer does not come from an alloca then the
  intrinsic itself has no effect.

I have updated the assume test for fixed width, since we now treat it
as uniform:

  Transforms/LoopVectorize/assume.ll

I've also added new scalable vectorisation tests for other intriniscs:

  Transforms/LoopVectorize/scalable-assume.ll
  Transforms/LoopVectorize/scalable-lifetime.ll
  Transforms/LoopVectorize/scalable-noalias-scope-decl.ll

Differential Revision: https://reviews.llvm.org/D107284
2021-08-06 10:13:15 +01:00
David Sherwood 43a5c750d1 Revert "[LoopVectorize] Add support for replication of more intrinsics with scalable vectors"
This reverts commit 95800da914.
2021-08-06 09:48:16 +01:00
Sander de Smalen 3e47f009ff [LV] Consider ExtractValue as uniform.
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D107286
2021-08-05 16:20:50 +01:00
David Sherwood 95800da914 [LoopVectorize] Add support for replication of more intrinsics with scalable vectors
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:

  1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
  are always uniform by definition.
  2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
  loop invariant input operands then these are also uniform too.

Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:

  1. If the 'assume' intrinsic's operand is not loop invariant, then we
  are free to treat this as uniform anyway since it's only a performance
  hint. We will get the benefit for the first lane.
  2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
  variant then for scalable vectors we assume these still ultimately come
  from the broadcast of an alloca. We do not support scalable vectorisation
  of loops containing alloca instructions, hence the alloca itself would
  be invariant. If the pointer does not come from an alloca then the
  intrinsic itself has no effect.

I have updated the assume test for fixed width, since we now treat it
as uniform:

  Transforms/LoopVectorize/assume.ll

I've also added new scalable vectorisation tests for other intriniscs:

  Transforms/LoopVectorize/scalable-assume.ll
  Transforms/LoopVectorize/scalable-lifetime.ll
  Transforms/LoopVectorize/scalable-noalias-scope-decl.ll

Differential Revision: https://reviews.llvm.org/D107284
2021-08-05 15:17:27 +01:00
David Sherwood 1172a8a763 [NFC] Clean up tests in test/Transforms/LoopVectorize/assume.ll
The tests previously had lots of unnecessary CHECK lines, where
all we really need to check is the presence (or absence) of the
assume intrinsic and the correct input operands.

Differential Revision: https://reviews.llvm.org/D107157
2021-08-05 14:58:40 +01:00
Sander de Smalen 8d08a84745 [LV] Remove a change that was added in D106164.
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D106878
2021-08-05 14:44:53 +01:00
David Sherwood 21bf8172db [NFC] Remove redundant test in Transforms/LoopVectorize/lifetime.ll
The two tests (@testloopvariant and @testbitcast) are actually
identical as in both loops the bitcast gets widened, forcing the
lifetime marker to be replicated using each lane of the input
vector.

Differential Revision: https://reviews.llvm.org/D107150
2021-08-05 14:39:08 +01:00
David Sherwood 0156f91f3b [NFC] Rename enable-strict-reductions to force-ordered-reductions
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.

Differential Revision: https://reviews.llvm.org/D107264
2021-08-03 09:33:01 +01:00
Florian Hahn bb725c9803
[VPlan] Use defined and ops VPValues to print VPInterleaveRecipe.
This patch updates VPInterleaveRecipe::print to print the actual defined
VPValues for load groups and the store VPValue operands for store
groups.

The IR references may become outdated while transforming the VPlan and
the defined and stored VPValues always are up-to-date.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D107223
2021-08-02 18:36:36 +01:00
Florian Hahn c726b627ad
[VPlan] Add interleave group printing test. 2021-07-31 16:11:01 +01:00
Kerry McLaughlin 9d35594993 Reland "[LV] Use lookThroughAnd with logical reductions"
If a reduction Phi has a single user which `AND`s the Phi with a type mask,
`lookThroughAnd` will return the user of the Phi and the narrower type represented
by the mask. Currently this is only used for arithmetic reductions, whereas loops
containing logical reductions will create a reduction intrinsic using the widened
type, for example:

  for.body:
    %phi = phi i32 [ %and, %for.body ], [ 255, %entry ]
    %mask = and i32 %phi, 255
    %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
    %load = load i8, i8* %gep
    %ext = zext i8 %load to i32
    %and = and i32 %mask, %ext
    ...

^ this will generate an and reduction intrinsic such as the following:
    call i32 @llvm.vector.reduce.and.v8i32(<8 x i32>...)

The same example for an add instruction would create an intrinsic of type i8:
    call i8 @llvm.vector.reduce.add.v8i8(<8 x i8>...)

This patch changes AddReductionVar to call lookThroughAnd for other integer
reductions, allowing loops similar to the example above with reductions such
as and, or & xor to vectorize.

Reviewed By: david-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D105632
2021-07-30 18:04:09 +01:00
David Green 41cedb1c9a [LV][ARM] Tighten up MLA reduction costing
This makes a couple of changes to the costing of MLA reduction patterns,
to more accurately cost various patterns that can come up from
vectorization.

 - The Arm implementation of getExtendedAddReductionCost is altered to
   only provide costs for legal or smaller types. Larger than legal types
   need to be split, which currently does not work very well, especially
   for predicated reductions where the predicate may be legal but needs to
   be split. Currently we limit it to legal or smaller input types.
 - The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext))
   is a pattern that can come up, and can be treated the same as
   reduce(mul(ext, ext)) providing the extension types match.
 - And it has been adjusted to not count the ext in reduce(mul(ext, ext))
   as part of a reduce(mul) pattern.

Together these changes help to more accurately cost the mla reductions
in cases such as where the extend types don't match or the extend
opcodes are different, picking better vector factors that don't result
in expanded reductions.

Differential Revision: https://reviews.llvm.org/D106166
2021-07-28 12:50:58 +01:00
David Green 037b7715dd [ARM] Extra MVE reduction vectorizer tests. NFC 2021-07-28 10:55:06 +01:00
James Y Knight 3d272eea08 Fix test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll.
It was writing to the source directory (which may not be writeable),
rather than using %t.

Fixes: a5dd6c6cf9 ("[LoopVectorize] Don't interleave scalar ordered reductions for inner loops")
2021-07-27 18:32:29 -04:00
David Sherwood a5dd6c6cf9 [LoopVectorize] Don't interleave scalar ordered reductions for inner loops
Consider the following loop:

  void foo(float *dst, float *src, int N) {
    for (int i = 0; i < N; i++) {
      dst[i] = 0.0;
      for (int j = 0; j < N; j++) {
        dst[i] += src[(i * N) + j];
      }
    }
  }

When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.

Test added here:

  Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll

Differential Revision: https://reviews.llvm.org/D106646
2021-07-27 17:41:01 +01:00
Anna Thomas c78b9542e3 Update reduction test. Remove standalone test file
Based on post commit review comments at 68ffed12b.
2021-07-27 12:35:04 -04:00
Anna Thomas 68ffed12b7 [IVDescriptors] Fix bug in checkOrderedReduction
The Exit instruction passed in for checking if it's an ordered reduction need not be
an FPAdd operation. We need to bail out at that point instead of
assuming it is an FPAdd (and hence has two operands). See added testcase.
It crashes without the patch because the Exit instruction is a phi with
exactly one operand.
This latent bug was exposed by 95346ba which added support for
multi-exit loops for vectorization.

Reviewed-By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D106843
2021-07-27 09:31:44 -04:00
Sander de Smalen d7dd12aee3 [LV] Disable Scalable VFs when tail folding is enabled b/c of low tripcount.
The loop vectorizer may decide to use tail folding when the trip-count
is low. When that happens, scalable VFs are no longer a candidate,
since tail folding/predication is not yet supported for scalable vectors.

This can be re-enabled in a future patch.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D106657
2021-07-27 11:37:21 +01:00
Masoud Ataei 45951ad323 [PowerPC] Add pwr7 and pwr10 support to IBM MASSV pass on AIX
Before MASSV only supported P8 and P9 on AIX ans Linux . This patch proposes
MASSV to add support of P7 and P10 only on AIX too.

Differential: https://reviews.llvm.org/D106678
2021-07-26 23:21:38 +00:00
Sander de Smalen 13ccb09725 [LV] Don't let ForceTargetInstructionCost override Invalid cost.
Invalid costs can be used to avoid vectorization with a given VF, which is
used for scalable vectors to avoid things that the code-generator cannot
handle. If we override the cost using the -force-target-instruction-cost
option of the LV, we would override this mechanism, rendering the flag useless.

This change ensures the cost is only overriden when the original cost that
was calculated is valid. That allows the flag to be used in combination
with the -scalable-vectorization option.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106677
2021-07-26 20:27:49 +01:00
Sander de Smalen e745277012 [AArch64] NFC: Make some AArch64-SVE LoopVectorize tests generic.
This change moves most of `sve-inductions.ll` to non-AArch64 specific
LV tests using the `-target-supports-scalable-vectors` flag, because they're
not explicitly AArch64-specific. One test builds on AArch64-specific
knowledge regarding masked loads/stores, and remains in sve-inductions.ll.
2021-07-26 20:27:48 +01:00
Sander de Smalen b9051ba848 [LV] Remove assert that VF cannot be scalable in setCostBasedWideningDecision.
Scalarization for scalable vectors is not (yet) supported, so the
LV discards a VF when scalarization is chosen as the widening
decision. It should therefore not assert that the VF is not scalable
when it computes the decision to scalarize.

The code can get here when both the interleave-cost, gather/scatter cost
and scalarization-cost are all illegal. This may e.g. happen for SVE
when the VF=1, to avoid generating `<vscale x 1 x eltty>` types that
the code-generator cannot yet handle.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106656
2021-07-26 17:11:45 +01:00
Sander de Smalen 981e9dce54 [LV] Don't assume isScalarAfterVectorization if one of the uses needs widening.
This fixes an issue that was found in D105199, where a GEP instruction
is used both as the address of a store, as well as the value of a store.
For the former, the value is scalar after vectorization, but the latter
(as value) requires widening.

Other code in that function seems to prevent similar cases from happening,
but it seems this case was missed.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106164
2021-07-26 16:01:55 +01:00
Florian Hahn 7a1e73f0b9
Recommit "[VPlan] Add recipe for first-order rec phis, make splicing explicit."
This reverts the revert commit b1777b04dc.

The patch originally got reverted due to a crash:
https://bugs.chromium.org/p/chromium/issues/detail?id=1232798#c2

The underlying issue was that we were not using the stored values from
the modified memory recipes, but the out-of-date values directly from
the IR (accessed via the VPlan). This should be fixed in d995d6376. A
reduced version of the reproducer has been added in 93664503be.
2021-07-26 15:50:30 +01:00
Florian Hahn 93664503be
[LV] Add test to store a first-order rec via interleave group.
This is a reduced version of the reproducer from
https://bugs.chromium.org/p/chromium/issues/detail?id=1232798#c2
2021-07-26 15:20:04 +01:00
Dylan Fleming 20b0fa91c9 [SVE] Add support for folding for select + masked loads
Add folds to instcombine to support the removal of select instruction when the masked_load is guaranteed to zero the same lanes, i.e. select(mask, mload(,,mask,0), 0) -> mload(,,mask,0).

Patch originally authored by @paulwalker-arm

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106376
2021-07-26 11:58:41 +01:00
David Sherwood b2a5f0029f Fix test failures caused by 0aff1798b5 2021-07-26 11:40:26 +01:00
David Sherwood 0aff1798b5 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Nico Weber b1777b04dc Revert "[VPlan] Add recipe for first-order rec phis, make splicing explicit."
Makes clang crash: https://reviews.llvm.org/D105008#2903350
This reverts commit d2a73fb44e.

Also revert a minor formatting follow-up:
This reverts commit 82834a6732.
2021-07-25 17:39:28 -04:00
Caroline Concatto 5a4de84d55 [LoopVectorize] Fix crash for predicated instruction with scalable VF
This patch avoids computing discounts for predicated instructions  when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.

This loop now does not use VF scalable

```
for (long long i = 0; i < n; i++)
    if (cond[i])
      a[i] /= b[i];
```

Differential Revision: https://reviews.llvm.org/D101916
2021-07-22 12:48:27 +01:00
Simon Pilgrim 1c9bec727a [InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Differential Revision: https://reviews.llvm.org/D106450
2021-07-22 10:58:51 +01:00
Kerry McLaughlin be753b207f Revert "[LV] Use lookThroughAnd with logical reductions"
Reverting patch due to buildbot failures.

This reverts commit e22a599672.
2021-07-21 15:16:00 +01:00
Simon Pilgrim ca9b60f9de [LoopVectorize] Regenerate sve-vector-reverse.ll test checks 2021-07-21 15:14:04 +01:00
Kerry McLaughlin e22a599672 [LV] Use lookThroughAnd with logical reductions
If a reduction Phi has a single user which `AND`s the Phi with a type mask,
`lookThroughAnd` will return the user of the Phi and the narrower type represented
by the mask. Currently this is only used for arithmetic reductions, whereas loops
containing logical reductions will create a reduction intrinsic using the widened
type, for example:

  for.body:
    %phi = phi i32 [ %and, %for.body ], [ 255, %entry ]
    %mask = and i32 %phi, 255
    %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
    %load = load i8, i8* %gep
    %ext = zext i8 %load to i32
    %and = and i32 %mask, %ext
    ...

^ this will generate an and reduction intrinsic such as the following:
    call i32 @llvm.vector.reduce.and.v8i32(<8 x i32>...)

The same example for an add instruction would create an intrinsic of type i8:
    call i8 @llvm.vector.reduce.add.v8i8(<8 x i8>...)

This patch changes AddReductionVar to call lookThroughAnd for other integer
reductions, allowing loops similar to the example above with reductions such
as and, or & xor to vectorize.

Reviewed By: david-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D105632
2021-07-21 09:56:00 +01:00
Florian Hahn d2a73fb44e
[VPlan] Add recipe for first-order rec phis, make splicing explicit.
This patch adds a VPFirstOrderRecurrencePHIRecipe, to further untangle
VPWidenPHIRecipe into distinct recipes for distinct use cases/lowering.
See D104989 for a new recipe for reduction phis.

This patch also introduces a new `FirstOrderRecurrenceSplice`
VPInstruction opcode, which is used to make the forming of the vector
recurrence value explicit in VPlan. This more accurately models def-uses
in VPlan and also simplifies code-generation. Now, the vector recurrence
values are created at the right place during VPlan-codegeneration,
rather than during post-VPlan fixups.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D105008
2021-07-20 16:14:17 +02:00
Mindong Chen e908e063d1 [LoopUtils] Fix incorrect RT check bounds of loop-invariant mem accesses
This fixes the lower and upper bound calculation of a
RuntimeCheckingPtrGroup when it has more than one loop
invariant pointers. Resolves PR50686.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D104148
2021-07-19 19:38:24 +08:00