Commit Graph

1031 Commits

Author SHA1 Message Date
Florian Hahn 6ca0985aa5 [InterleavedAccessAnalysis] Fix integer overflow in insertMember.
Without checking for integer overflow, invalid members can be added
 e.g. if the calculated key overflows, becomes positive and the largest key.

This fixes
      https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=7560
      https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13128
      https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13229

Reviewers: Ayal, anna, hsaito, efriedma

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D55538

llvm-svn: 355613
2019-03-07 17:50:16 +00:00
Nikita Popov af2b0bef43 [ValueTracking] More accurate unsigned sub overflow detection
Second part of D58593.

Compute precise overflow conditions based on all known bits, rather
than just the sign bits. Unsigned a - b overflows iff a < b, and we
can determine whether this always/never happens based on the minimal
and maximal values achievable for a and b subject to the known bits
constraint.

llvm-svn: 355109
2019-02-28 18:04:20 +00:00
Michael Kruse 77a614a6e1 Refactor setAlreadyUnrolled() and setAlreadyVectorized().
Loop::setAlreadyUnrolled() and
LoopVectorizeHints::setLoopAlreadyUnrolled() both add loop metadata that
stops the same loop from being transformed multiple times. This patch
merges both implementations.

In doing so we fix 3 potential issues:

 * setLoopAlreadyUnrolled() kept the llvm.loop.vectorize/interleave.*
   metadata even though it will not be used anymore. This already caused
   problems such as http://llvm.org/PR40546. Change the behavior to the
   one of setAlreadyUnrolled which deletes this loop metadata.

 * setAlreadyUnrolled() used to create a new LoopID by calling
   MDNode::get with nullptr as the first operand, then replacing it by
   the returned references using replaceOperandWith. It is possible
   that MDNode::get would instead return an existing node (due to
   de-duplication) that then gets modified. To avoid, use a fresh
   TempMDNode that does not get uniqued with anything else before
   replacing it with replaceOperandWith.

 * LoopVectorizeHints::matchesHintMetadataName() only compares the
   suffix of the attribute to set the new value for. That is, when
   called with "enable", would erase attributes such as
   "llvm.loop.unroll.enable", "llvm.loop.vectorize.enable" and
   "llvm.loop.distribute.enable" instead of the one to replace.
   Fortunately, function was only called with "isvectorized".

Differential Revision: https://reviews.llvm.org/D57566

llvm-svn: 353738
2019-02-11 19:45:44 +00:00
Florian Hahn ba5acbc4fe [LV] Prevent interleaving if computeMaxVF returned None.
As discussed in D57382, interleaving should be avoided if computeMaxVF
returns None, same as we currently do for vectorization.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=6477

Reviewers: Ayal, dcaballe, hsaito, mkuper, rengolin

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D57837

llvm-svn: 353461
2019-02-07 20:49:10 +00:00
Alina Sbirlea f9027e554a Check bool attribute value in getOptionalBoolLoopAttribute.
Summary:
Check the bool value of the attribute in getOptionalBoolLoopAttribute
not just its existance.
Eliminates the warning noise generated when vectorization is explicitly disabled.

Reviewers: Meinersbur, hfinkel, dmgreen

Subscribers: jlebar, sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D57260

llvm-svn: 352555
2019-01-29 22:33:20 +00:00
Johannes Doerfert 00102c7d95 [ValueTracking] Look through casts when determining non-nullness
Bitcast and certain Ptr2Int/Int2Ptr instructions will not alter the
value of their operand and can therefore be looked through when we
determine non-nullness.

Differential Revision: https://reviews.llvm.org/D54956

llvm-svn: 352293
2019-01-26 23:40:35 +00:00
Simon Pilgrim c934d3a01b [CostModel][X86] Add explicit vector select costs
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.

Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).

The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.

llvm-svn: 351685
2019-01-20 13:55:01 +00:00
Sanjay Patel 7d65fe5cd5 [LoopVectorizer] give more advice in remark about failure to vectorize call
Something like this is requested by:
https://bugs.llvm.org/show_bug.cgi?id=40265
...and it seems like a common enough case that we should acknowledge it.

Differential Revision: https://reviews.llvm.org/D56551

llvm-svn: 351010
2019-01-12 15:27:15 +00:00
Florian Hahn ef307b8c26 [LAA] Avoid generating RT checks for known deps preventing vectorization.
If we found unsafe dependences other than 'unknown', we already know at
compile time that they are unsafe and the runtime checks should always
fail. So we can avoid generating them in those cases.

This should have no negative impact on performance as the runtime checks
that would be created previously should always fail. As a sanity check,
I measured the test-suite, spec2k and spec2k6 and there were no regressions.

Reviewers: Ayal, anemet, hsaito

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D55798

llvm-svn: 349794
2018-12-20 18:49:09 +00:00
Michael Kruse 978ba61536 Introduce llvm.loop.parallel_accesses and llvm.access.group metadata.
The current llvm.mem.parallel_loop_access metadata has a problem in that
it uses LoopIDs. LoopID unfortunately is not loop identifier. It is
neither unique (there's even a regression test assigning the some LoopID
to multiple loops; can otherwise happen if passes such as LoopVersioning
make copies of entire loops) nor persistent (every time a property is
removed/added from a LoopID's MDNode, it will also receive a new LoopID;
this happens e.g. when calling Loop::setLoopAlreadyUnrolled()).
Since most loop transformation passes change the loop attributes (even
if it just to mark that a loop should not be processed again as
llvm.loop.isvectorized does, for the versioned and unversioned loop),
the parallel access information is lost for any subsequent pass.

This patch unlinks LoopIDs and parallel accesses.
llvm.mem.parallel_loop_access metadata on instruction is replaced by
llvm.access.group metadata. llvm.access.group points to a distinct
MDNode with no operands (avoiding the problem to ever need to add/remove
operands), called "access group". Alternatively, it can point to a list
of access groups. The LoopID then has an attribute
llvm.loop.parallel_accesses with all the access groups that are parallel
(no dependencies carries by this loop).

This intentionally avoid any kind of "ID". Loops that are clones/have
their attributes modifies retain the llvm.loop.parallel_accesses
attribute. Access instructions that a cloned point to the same access
group. It is not necessary for each access to have it's own "ID" MDNode,
but those memory access instructions with the same behavior can be
grouped together.

The behavior of llvm.mem.parallel_loop_access is not changed by this
patch, but should be considered deprecated.

Differential Revision: https://reviews.llvm.org/D52116

llvm-svn: 349725
2018-12-20 04:58:07 +00:00
Florian Hahn 485f2826ba [LAA] Introduce enum for vectorization safety status (NFC).
This patch adds a VectorizationSafetyStatus enum, which will be extended
in a follow up patch to distinguish between 'safe with runtime checks'
and 'known unsafe' dependences.

Reviewers: anemet, anna, Ayal, hsaito

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D54892

llvm-svn: 349556
2018-12-18 22:25:11 +00:00
Sanjay Patel 608d128c42 [LoopVectorize] auto-generate complete checks; NFC
The first test claims to show that the vectorizer will
generate a vector load/loop, but then this file runs
other passes which might scalarize that op. I'm removing 
instcombine from the RUN line here to break that dependency.
Also, I'm generating full checks to make it clear exactly 
what the vectorizer has done.

llvm-svn: 349554
2018-12-18 22:23:04 +00:00
Michael Kruse 7244852557 [Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes.
When multiple loop transformation are defined in a loop's metadata, their order of execution is defined by the order of their respective passes in the pass pipeline. For instance, e.g.

    #pragma clang loop unroll_and_jam(enable)
    #pragma clang loop distribute(enable)

is the same as

    #pragma clang loop distribute(enable)
    #pragma clang loop unroll_and_jam(enable)

and will try to loop-distribute before Unroll-And-Jam because the LoopDistribute pass is scheduled after UnrollAndJam pass. UnrollAndJamPass only supports one inner loop, i.e. it will necessarily fail after loop distribution. It is not possible to specify another execution order. Also,t the order of passes in the pipeline is subject to change between versions of LLVM, optimization options and which pass manager is used.

This patch adds 'followup' attributes to various loop transformation passes. These attributes define which attributes the resulting loop of a transformation should have. For instance,

    !0 = !{!0, !1, !2}
    !1 = !{!"llvm.loop.unroll_and_jam.enable"}
    !2 = !{!"llvm.loop.unroll_and_jam.followup_inner", !3}
    !3 = !{!"llvm.loop.distribute.enable"}

defines a loop ID (!0) to be unrolled-and-jammed (!1) and then the attribute !3 to be added to the jammed inner loop, which contains the instruction to distribute the inner loop.

Currently, in both pass managers, pass execution is in a fixed order and UnrollAndJamPass will not execute again after LoopDistribute. We hope to fix this in the future by allowing pass managers to run passes until a fixpoint is reached, use Polly to perform these transformations, or add a loop transformation pass which takes the order issue into account.

For mandatory/forced transformations (e.g. by having been declared by #pragma omp simd), the user must be notified when a transformation could not be performed. It is not possible that the responsible pass emits such a warning because the transformation might be 'hidden' in a followup attribute when it is executed, or it is not present in the pipeline at all. For this reason, this patche introduces a WarnMissedTransformations pass, to warn about orphaned transformations.

Since this changes the user-visible diagnostic message when a transformation is applied, two test cases in the clang repository need to be updated.

To ensure that no other transformation is executed before the intended one, the attribute `llvm.loop.disable_nonforced` can be added which should disable transformation heuristics before the intended transformation is applied. E.g. it would be surprising if a loop is distributed before a #pragma unroll_and_jam is applied.

With more supported code transformations (loop fusion, interchange, stripmining, offloading, etc.), transformations can be used as building blocks for more complex transformations (e.g. stripmining+stripmining+interchange -> tiling).

Reviewed By: hfinkel, dmgreen

Differential Revision: https://reviews.llvm.org/D49281
Differential Revision: https://reviews.llvm.org/D55288

llvm-svn: 348944
2018-12-12 17:32:52 +00:00
Nikita Popov 110cf05203 Reapply "[DemandedBits][BDCE] Support vectors of integers"
DemandedBits and BDCE currently only support scalar integers. This
patch extends them to also handle vector integer operations. In this
case bits are not tracked for individual vector elements, instead a
bit is demanded if it is demanded for any of the elements. This matches
the behavior of computeKnownBits in ValueTracking and
SimplifyDemandedBits in InstCombine.

Unlike the previous iteration of this patch, getDemandedBits() can now
again be called on arbirary (sized) instructions, even if they don't
have integer or vector of integer type. (For vector types the size of the
returned mask will now be the scalar size in bits though.)

The added LoopVectorize test case shows a case which triggered an
assertion failure with the previous attempt, because getDemandedBits()
was called on a pointer-typed instruction.

Differential Revision: https://reviews.llvm.org/D55297

llvm-svn: 348602
2018-12-07 15:38:13 +00:00
David L. Jones 5ff7b8a04a Revert r347934 "[SCEV] Guard movement of insertion point for loop-invariants"
This change caused SEGVs in instcombine. (The r347934 change seems to me to be a
precipitating cause, not a root cause. Details are on the llvm-commits thread
for r347934.)

llvm-svn: 348426
2018-12-05 23:13:50 +00:00
Craig Topper 88270231f8 [X86][LoopVectorize] Replace -mcpu=skylake-avx512 with -mattr=avx512f in some tests that failed when experimenting with defaulting to -mprefer-vector-width=256 for skylake-avx512.
llvm-svn: 348063
2018-12-01 01:38:44 +00:00
Renato Golin 135e72e1b9 Add a new reduction pattern match
Adding a new reduction pattern match for vectorizing code similar
to TSVC s3111:

for (int i = 0; i < N; i++)
  if (a[i] > b)
    sum += a[i];

This patch adds support for fadd, fsub and fmull, as well as multiple
branches and different (but compatible) instructions (ex. add+sub) in
different branches.

The difference from the previous patch(https://reviews.llvm.org/D49168)
is as follows:
 - Added check of fast-math property of fp-instruction to the
   previous patch
 - Fix/add some pattern for if-reduction.ll


Differential Revision: https://reviews.llvm.org/D54464

Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org>
     and Masakazu Ueno <masakazu.ueno@linaro.org>

llvm-svn: 347989
2018-11-30 13:40:10 +00:00
Warren Ristow 72d1f3a285 [SCEV] Guard movement of insertion point for loop-invariants
r320789 suppressed moving the insertion point of SCEV expressions with
dev/rem operations to the loop header in non-loop-invariant situations.
This, and similar, hoisting is also unsafe in the loop-invariant case,
since there may be a guard against a zero denominator. This is an
adjustment to the fix of r320789 to suppress the movement even in the
loop-invariant case.

This fixes PR30806.

Differential Revision: https://reviews.llvm.org/D54713

llvm-svn: 347934
2018-11-30 00:02:54 +00:00
Martin Storsjo bfd1d27585 Revert "[LICM] Enable control flow hoisting by default" and "[LICM] Reapply r347190 "Make LICM able to hoist phis" with fix"
This reverts commits r347776 and r347778.

The first one, r347776, caused significant compile time regressions
for certain input files, see PR39836 for details.

llvm-svn: 347867
2018-11-29 14:39:39 +00:00
John Brawn 4557ffeb63 [LICM] Enable control flow hoisting by default
Differential Revision: https://reviews.llvm.org/D54949

llvm-svn: 347778
2018-11-28 17:23:03 +00:00
Joel Jones 7459398a43 Revert unapproved commit
llvm-svn: 347511
2018-11-24 07:26:55 +00:00
Joel Jones 5f533c5fe1 [AArch64] Enable libm vectorized functions via SLEEF
This changeset is modeled after Intel's submission for SVML. It enables
trigonometry functions vectorization via SLEEF: http://sleef.org/.

 * A new vectorization library enum is added to TargetLibraryInfo.h: SLEEF.
 * A new option is added to TargetLibraryInfoImpl - ClVectorLibrary: SLEEF.
 * A comprehensive test case is included in this changeset.
 * In a separate changeset (for clang), a new vectorization library argument is
   added to -fveclib: -fveclib=SLEEF.

Trigonometry functions that are vectorized by sleef:

acos
asin
atan
atanh
cos
cosh
exp
exp2
exp10
lgamma
log10
log2
log
sin
sinh
sqrt
tan
tanh
tgamma

Patch by Stefan Teleman
Differential Revision: https://reviews.llvm.org/D53927

llvm-svn: 347510
2018-11-24 06:41:39 +00:00
Benjamin Kramer 2cad359c91 Revert "[LICM] Make LICM able to hoist phis"
This reverts commit r347190.

llvm-svn: 347225
2018-11-19 16:51:57 +00:00
Anna Thomas 5e9215f02b [LV] Avoid vectorizing unsafe dependencies in uniform address
Summary:
Currently, when vectorizing stores to uniform addresses, the only
instance we prevent vectorization is if there are multiple stores to the
same uniform address causing an unsafe dependency.
This patch teaches LAA to avoid vectorizing loops that have an unsafe
cross-iteration dependency between a load and a store to the same uniform address.

Fixes PR39653.

Reviewers: Ayal, efriedma

Subscribers: rkruppe, llvm-commits

Differential Revision: https://reviews.llvm.org/D54538

llvm-svn: 347220
2018-11-19 15:39:59 +00:00
John Brawn 12c046fba0 [LICM] Make LICM able to hoist phis
The general approach taken is to make note of loop invariant branches, then when
we see something conditional on that branch, such as a phi, we create a copy of
the branch and (empty versions of) its successors and hoist using that.

This has no impact by itself that I've been able to see, as LICM typically
doesn't see such phis as they will have been converted into selects by the time
LICM is run, but once we start doing phi-to-select conversion later it will be
important.

Differential Revision: https://reviews.llvm.org/D52827

llvm-svn: 347190
2018-11-19 11:31:24 +00:00
Simon Pilgrim 47d38198eb [CostModel] Add more realistic SK_InsertSubvector generic costs.
Instead of defaulting to a cost = 1, expand to element extract/insert like we do for other shuffles.

llvm-svn: 346662
2018-11-12 15:20:24 +00:00
Sanjay Patel 1456fd7614 [VectorUtils] add funnel-shifts to the list of vectorizable intrinsics
This just identifies the intrinsics as candidates for vectorization.
It does not mean we will attempt to vectorize under normal conditions
(the test file is forcing vectorization). 

The cost model must be fixed to show that the transform is profitable 
in general.

Allowing vectorization with these intrinsics is required to avoid
potential regressions from canonicalizing to the intrinsics from
generic IR:
https://bugs.llvm.org/show_bug.cgi?id=37417

llvm-svn: 346661
2018-11-12 15:20:14 +00:00
Sanjay Patel 75120dcb06 [LoopVectorize] add tests for funnel shifts; NFC
llvm-svn: 346658
2018-11-12 14:52:01 +00:00
Jonas Paulsson 79f2441eee [SystemZ] Rework getInterleavedMemoryOpCost()
Model this function more closely after the BasicTTIImpl version, with
separate handling of loads and stores. For loads, the set of actually loaded
vectors is checked.

This makes it more readable and just slightly more accurate generally.

Review: Ulrich Weigand
https://reviews.llvm.org/D53071

llvm-svn: 345998
2018-11-02 17:15:36 +00:00
Ayal Zaks 45a3ca7be7 [LV] Avoid vectorizing loops under opt for size that involve SCEV checks
Fix PR39417, PR39497

The loop vectorizer may generate runtime SCEV checks for overflow and stride==1
cases, leading to execution of original scalar loop. The latter is forbidden
when optimizing for size. An assert introduced in r344743 triggered the above
PR's showing it does happen. This patch fixes this behavior by preventing
vectorization in such cases.

Differential Revision: https://reviews.llvm.org/D53612

llvm-svn: 345959
2018-11-02 09:16:12 +00:00
Dorit Nuzman 34da6dd696 [LV] Support vectorization of interleave-groups that require an epilog under
optsize using masked wide loads 

Under Opt for Size, the vectorizer does not vectorize interleave-groups that
have gaps at the end of the group (such as a loop that reads only the even
elements: a[2*i]) because that implies that we'll require a scalar epilogue
(which is not allowed under Opt for Size). This patch extends the support for
masked-interleave-groups (introduced by D53011 for conditional accesses) to
also cover the case of gaps in a group of loads; Targets that enable the
masked-interleave-group feature don't have to invalidate interleave-groups of
loads with gaps; they could now use masked wide-loads and shuffles (if that's
what the cost model selects).

Reviewers: Ayal, hsaito, dcaballe, fhahn

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D53668

llvm-svn: 345705
2018-10-31 09:57:56 +00:00
Jonas Paulsson 1f067c94dc [LoopVectorizer] Fix for cost values of memory accesses.
This commit is a combination of two patches:

* "Fix in getScalarizationOverhead()"

   If target returns false in TTI.prefersVectorizedAddressing(), it means the
   address registers will not need to be extracted. Therefore, there should
   be no operands scalarization overhead for a load instruction.

* "Don't pass the instruction pointer from getMemInstScalarizationCost."

   Since VF is always > 1, this is a cost query for an instruction in the
   vectorized loop and it should not be evaluated within the scalar
   context of the instruction.

Review: Ulrich Weigand, Hal Finkel
https://reviews.llvm.org/D52351
https://reviews.llvm.org/D52417

llvm-svn: 345603
2018-10-30 14:34:15 +00:00
Renato Golin 53bd4f4832 Revert r344172: [LV] Add a new reduction pattern match
This patch has caused fast-math issues in the reduction pattern.

Will re-work and land again.

llvm-svn: 345465
2018-10-27 22:13:43 +00:00
Simon Pilgrim 53e8e145e9 [CostModel][X86] Add realistic vXi64 uitofp vXf64 costs
Match codegen improvements from D53649/rL345256

llvm-svn: 345263
2018-10-25 13:06:20 +00:00
Dorit Nuzman 5114390e48 [LV] Don't have fold-tail under optsize invalidate interleave-groups when
masked-interleaving is enabled

Enable interleave-groups under fold-tail scenario for Opt for size compilation;
D50480 added support for vectorizing loops of arbitrary trip-count without a
remiander, which in turn makes everything in the loop conditional, including
interleave-groups if any. It therefore invalidated all interleave-groups
because we didn't have support for vectorizing predicated interleaved-groups
at the time. In the meantime, D53011 introduced this support, so we don't
have to invalidate interleave-groups when masked-interleaved support is enabled.

Reviewers: Ayal, hsaito, dcaballe, fhahn

Reviewed By: hsaito

Differential Revision: https://reviews.llvm.org/D53559

llvm-svn: 345115
2018-10-24 07:11:38 +00:00
Sanjay Patel 747feb28e4 [InstCombine] use 'match' to handle vectors and simplify code
This is another step towards completely removing the fake 
binop queries for not/neg/fneg.

llvm-svn: 345036
2018-10-23 15:05:12 +00:00
Dorit Nuzman 3ec99fe21b [IAI,LV] Avoid creating a scalar epilogue due to gaps in interleave-groups when
optimizing for size

LV is careful to respect -Os and not to create a scalar epilog in all cases
(runtime tests, trip-counts that require a remainder loop) except for peeling
due to gaps in interleave-groups. This patch fixes that; -Os will now have us
invalidate such interleave-groups and vectorize without an epilog.

The patch also removes a related FIXME comment that is now obsolete, and was
also inaccurate:
"FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a smaller
MaxVF that does not require a scalar epilog."
(requiresScalarEpilog() has nothing to do with VF).

Reviewers: Ayal, hsaito, dcaballe, fhahn

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D53420

llvm-svn: 344883
2018-10-22 06:17:09 +00:00
Thomas Lively 8a91cf1cc5 [LoopVectorize] Loop vectorization for minimum and maximum
Summary: Depends on D52766.

Reviewers: aheejin, dschuff

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52767

llvm-svn: 344816
2018-10-19 21:11:43 +00:00
Ayal Zaks b0b5312e67 [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt for size
When optimizing for size, a loop is vectorized only if the resulting vector loop
completely replaces the original scalar loop. This holds if no runtime guards
are needed, if the original trip-count TC does not overflow, and if TC is a
known constant that is a multiple of the VF. The last two TC-related conditions
can be overcome by
1. rounding the trip-count of the vector loop up from TC to a multiple of VF;
2. masking the vector body under a newly introduced "if (i <= TC-1)" condition.

The patch allows loops with arbitrary trip counts to be vectorized under -Os,
subject to the existing cost model considerations. It also applies to loops with
small trip counts (under -O2) which are currently handled as if under -Os.

The patch does not handle loops with reductions, live-outs, or w/o a primary
induction variable, and disallows interleave groups.

(Third, final and main part of -)
Differential Revision: https://reviews.llvm.org/D50480

llvm-svn: 344743
2018-10-18 15:03:15 +00:00
Anna Thomas 6f732bfb79 [LV] Teach vectorizer about variant value store into uniform address
Summary:
Teach vectorizer about vectorizing variant value stores to uniform
address. Similar to rL343028, we do not allow vectorization if we have
multiple stores to the same uniform address.

Cost model already has the change for considering the extract
instruction cost for a variant value store. See added test cases for how
vectorization is done.
The patch also contains changes to the ORE messages.

Reviewers: Ayal, mkuper, anemet, hsaito

Subscribers: rkruppe, llvm-commits

Differential Revision: https://reviews.llvm.org/D52656

llvm-svn: 344613
2018-10-16 15:46:26 +00:00
Ayal Zaks 1a8713046d [LV] Add test checks when vectorizing loops under opt for size; NFC
Landing this as a separate part of https://reviews.llvm.org/D50480, recording
current behavior more accurately, to clarify subsequent diff ([LV] Vectorizing
loops of arbitrary trip count without remainder under opt for size).

llvm-svn: 344606
2018-10-16 14:25:02 +00:00
Dorit Nuzman 38bbf81ade recommit 344472 after fixing build failure on ARM and PPC.
llvm-svn: 344475
2018-10-14 08:50:06 +00:00
Dorit Nuzman 5118c68cde revert 344472 due to failures.
llvm-svn: 344473
2018-10-14 07:21:20 +00:00
Dorit Nuzman 8174368955 [IAI,LV] Add support for vectorizing predicated strided accesses using masked
interleave-group

The vectorizer currently does not attempt to create interleave-groups that
contain predicated loads/stores; predicated strided accesses can currently be
vectorized only using masked gather/scatter or scalarization. This patch makes
predicated loads/stores candidates for forming interleave-groups during the
Loop-Vectorizer's analysis, and adds the proper support for masked-interleave-
groups to the Loop-Vectorizer's planning and transformation stages. The patch
also extends the TTI API to allow querying the cost of masked interleave groups
(which each target can control); Targets that support masked vector loads/
stores may choose to enable this feature and allow vectorizing predicated
strided loads/stores using masked wide loads/stores and shuffles.

Reviewers: Ayal, hsaito, dcaballe, fhahn, javed.absar

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D53011

llvm-svn: 344472
2018-10-14 07:06:16 +00:00
Sanjay Patel 05aadf885d [InstCombine] reverse 'trunc X to <N x i1>' canonicalization; 2nd try
Re-trying r344082 because it unintentionally included extra diffs.

Original commit message:
icmp ne (and X, 1), 0 --> trunc X to N x i1

Ideally, we'd do the same for scalars, but there will likely be
regressions unless we add more trunc folds as we're doing here
for vectors.

The motivating vector case is from PR37549:
https://bugs.llvm.org/show_bug.cgi?id=37549

define <4 x float> @bitwise_select(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) {

  %c = fcmp ole <4 x float> %x, %y
  %s = sext <4 x i1> %c to <4 x i32>
  %s1 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>
  %s2 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 2, i32 2, i32 3, i32 3>
  %cond = or <4 x i32> %s1, %s2
  %condtr = trunc <4 x i32> %cond to <4 x i1>
  %r = select <4 x i1> %condtr, <4 x float> %z, <4 x float> %w
  ret <4 x float> %r

}

Here's a sampling of the vector codegen for that case using
mask+icmp (current behavior) vs. trunc (with this patch):

AVX before:

vcmpleps        %xmm1, %xmm0, %xmm0
vpermilps       $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps       $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps   %xmm0, %xmm1, %xmm0
vandps  LCPI0_0(%rip), %xmm0, %xmm0
vxorps  %xmm1, %xmm1, %xmm1
vpcmpeqd        %xmm1, %xmm0, %xmm0
vblendvps       %xmm0, %xmm3, %xmm2, %xmm0

AVX after:

vcmpleps        %xmm1, %xmm0, %xmm0
vpermilps       $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps       $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps   %xmm0, %xmm1, %xmm0
vblendvps       %xmm0, %xmm2, %xmm3, %xmm0

AVX512f before:

vcmpleps        %xmm1, %xmm0, %xmm0
vpermilps       $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps       $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps   %xmm0, %xmm1, %xmm0
vpbroadcastd    LCPI0_0(%rip), %xmm1 ## xmm1 = [1,1,1,1]
vptestnmd       %zmm1, %zmm0, %k1
vblendmps       %zmm3, %zmm2, %zmm0 {%k1}

AVX512f after:

vcmpleps        %xmm1, %xmm0, %xmm0
vpermilps       $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps       $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps   %xmm0, %xmm1, %xmm0
vpslld  $31, %xmm0, %xmm0
vptestmd        %zmm0, %zmm0, %k1
vblendmps       %zmm2, %zmm3, %zmm0 {%k1}

AArch64 before:

fcmge   v0.4s, v1.4s, v0.4s
zip1    v1.4s, v0.4s, v0.4s
zip2    v0.4s, v0.4s, v0.4s
orr     v0.16b, v1.16b, v0.16b
movi    v1.4s, #1
and     v0.16b, v0.16b, v1.16b
cmeq    v0.4s, v0.4s, #0
bsl     v0.16b, v3.16b, v2.16b

AArch64 after:

fcmge   v0.4s, v1.4s, v0.4s
zip1    v1.4s, v0.4s, v0.4s
zip2    v0.4s, v0.4s, v0.4s
orr     v0.16b, v1.16b, v0.16b
bsl     v0.16b, v2.16b, v3.16b

PowerPC-le before:

xvcmpgesp 34, 35, 34
vspltisw 0, 1
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxlxor 35, 35, 35
xxland 34, 0, 32
vcmpequw 2, 2, 3
xxsel 34, 36, 37, 34

PowerPC-le after:

xvcmpgesp 34, 35, 34
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxsel 34, 37, 36, 0

Differential Revision: https://reviews.llvm.org/D52747

llvm-svn: 344181
2018-10-10 20:47:46 +00:00
Sanjay Patel 58fc00d0bc revert r344082: [InstCombine] reverse 'trunc X to <N x i1>' canonicalization
This commit accidentally included the diffs from D53057.

llvm-svn: 344178
2018-10-10 20:39:39 +00:00
Renato Golin cb19c8e3aa [LV] Add a new reduction pattern match
Adding a new reduction pattern match for vectorizing code similar to TSVC s3111:

for (int i = 0; i < N; i++)
  if (a[i] > b)
    sum += a[i];

This patch adds support for fadd, fsub and fmull, as well as multiple
branches and different (but compatible) instructions (ex. add+sub) in
different branches.

I have forwarded to trunk, added fsub and fmul functionality and
additional tests, but the credit goes to Takahiro, who did most of the
actual work.

Differential Revision: https://reviews.llvm.org/D49168

Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org>.

llvm-svn: 344172
2018-10-10 18:49:49 +00:00
Justin Bogner 90fde0e06f [LV] Move test for r343954 into x86 subdirectory
This test uses an x86 triple, so it needs to be in the x86 specific
test directory.

llvm-svn: 344087
2018-10-09 22:40:04 +00:00
Sanjay Patel e9ca7ea3e5 [InstCombine] reverse 'trunc X to <N x i1>' canonicalization
icmp ne (and X, 1), 0 --> trunc X to N x i1

Ideally, we'd do the same for scalars, but there will likely be 
regressions unless we add more trunc folds as we're doing here 
for vectors.

The motivating vector case is from PR37549:
https://bugs.llvm.org/show_bug.cgi?id=37549

define <4 x float> @bitwise_select(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) {
  %c = fcmp ole <4 x float> %x, %y
  %s = sext <4 x i1> %c to <4 x i32>
  %s1 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>
  %s2 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 2, i32 2, i32 3, i32 3>
  %cond = or <4 x i32> %s1, %s2
  %condtr = trunc <4 x i32> %cond to <4 x i1>
  %r = select <4 x i1> %condtr, <4 x float> %z, <4 x float> %w
  ret <4 x float> %r
}

Here's a sampling of the vector codegen for that case using 
mask+icmp (current behavior) vs. trunc (with this patch):

AVX before:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vandps	LCPI0_0(%rip), %xmm0, %xmm0
vxorps	%xmm1, %xmm1, %xmm1
vpcmpeqd	%xmm1, %xmm0, %xmm0
vblendvps	%xmm0, %xmm3, %xmm2, %xmm0

AVX after:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vblendvps	%xmm0, %xmm2, %xmm3, %xmm0

AVX512f before:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vpbroadcastd	LCPI0_0(%rip), %xmm1 ## xmm1 = [1,1,1,1]
vptestnmd	%zmm1, %zmm0, %k1
vblendmps	%zmm3, %zmm2, %zmm0 {%k1}

AVX512f after:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vpslld	$31, %xmm0, %xmm0
vptestmd	%zmm0, %zmm0, %k1
vblendmps	%zmm2, %zmm3, %zmm0 {%k1}

AArch64 before:

fcmge	v0.4s, v1.4s, v0.4s
zip1	v1.4s, v0.4s, v0.4s
zip2	v0.4s, v0.4s, v0.4s
orr	v0.16b, v1.16b, v0.16b
movi	v1.4s, #1
and	v0.16b, v0.16b, v1.16b
cmeq	v0.4s, v0.4s, #0
bsl	v0.16b, v3.16b, v2.16b

AArch64 after:

fcmge	v0.4s, v1.4s, v0.4s
zip1	v1.4s, v0.4s, v0.4s
zip2	v0.4s, v0.4s, v0.4s
orr	v0.16b, v1.16b, v0.16b
bsl	v0.16b, v2.16b, v3.16b

PowerPC-le before:

xvcmpgesp 34, 35, 34
vspltisw 0, 1
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxlxor 35, 35, 35
xxland 34, 0, 32
vcmpequw 2, 2, 3
xxsel 34, 36, 37, 34

PowerPC-le after:

xvcmpgesp 34, 35, 34
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxsel 34, 37, 36, 0

Differential Revision: https://reviews.llvm.org/D52747

llvm-svn: 344082
2018-10-09 21:26:01 +00:00
Max Kazantsev b07369651e [LV] Do not create SCEVs on broken IR in emitTransformedIndex. PR39160
At the point when we perform `emitTransformedIndex`, we have a broken IR (in
particular, we have Phis for which not every incoming value is properly set). On
such IR, it is illegal to create SCEV expressions, because their internal
simplification process may try to prove some predicates and break when it
stumbles across some broken IR.

The only purpose of using SCEV in this particular place is attempt to simplify
the generated code slightly. It seems that the result isn't worth it, because
some trivial cases (like addition of zero and multiplication by 1) can be
handled separately if needed, but more generally InstCombine is able to achieve
the goals we want to achieve by using SCEV.

This patch fixes a functional crash described in PR39160, and as side-effect it
also generates a bit smarter code in some simple cases. It also may cause some
optimality loss (i.e. we will now generate `mul` by power of `2` instead of
shift etc), but there is nothing what InstCombine could not handle later. In
case of dire need, we can support more trivial cases just in place.

Note that this patch only fixes one particular case of the general problem that
LV misuses SCEV, attempting to create SCEVs or prove predicates on invalid IR.
The general solution, however, seems complex enough.

Differential Revision: https://reviews.llvm.org/D52881
Reviewed By: fhahn, hsaito

llvm-svn: 343954
2018-10-08 05:46:29 +00:00
Dorit Nuzman 72f6e29980 [IAI,LV] Avoid creating interleave-groups for predicated accesse
This patch fixes PR39099.

When strided loads are predicated, each of them will form an interleaved-group
(with gaps). However, subsequent stages of vectorization (planning and
transformation) assume that if a load is part of an Interleave-Group it is not
predicated, resulting in wrong code - unmasked wide loads are created.

The Interleaving Analysis does take care not to have conditional interleave
groups of size > 1, but until we extend the planning and transformation stages
to support masked-interleave-groups we should also avoid having them for
size == 1.

Reviewers: Ayal, hsaito, dcaballe, fhahn

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D52682

llvm-svn: 343931
2018-10-07 06:57:25 +00:00
Anna Thomas b1e3d45318 [LV][LAA] Vectorize loop invariant values stored into loop invariant address
Summary:
We are overly conservative in loop vectorizer with respect to stores to loop
invariant addresses.
More details in https://bugs.llvm.org/show_bug.cgi?id=38546
This is the first part of the fix where we start with vectorizing loop invariant
values to loop invariant addresses.

This also includes changes to ORE for stores to invariant address.

Reviewers: anemet, Ayal, mkuper, mssimpso

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D50665

llvm-svn: 343028
2018-09-25 20:57:20 +00:00
Warren Ristow 4f27730eaf [Loop Vectorizer] Abandon vectorization when no integer IV found
Support for vectorizing loops with secondary floating-point induction
variables was added in r276554.  A primary integer IV is still required
for vectorization to be done.  If an FP IV was found, but no integer IV
was found at all (primary or secondary), the attempt to vectorize still
went forward, causing a compiler-crash.  This change abandons that
attempt when no integer IV is found.  (Vectorizing FP-only cases like
this, rather than bailing out, is discussed as possible future work
in D52327.)

See PR38800 for more information.

Differential Revision: https://reviews.llvm.org/D52327

llvm-svn: 342786
2018-09-21 23:03:50 +00:00
Hideki Saito ea7f3035a0 [VPlan] Implement initial vector code generation support for simple outer loops.
Summary:
[VPlan] Implement vector code generation support for simple outer loops.

Context: Patch Series #1 for outer loop vectorization support in LV  using VPlan. (RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).
                                                          
This patch introduces vector code generation support for simple outer loops that are currently supported in the VPlanNativePath. Changes here essentially do the following:

  - force vector code generation using explicit vectorize_width

  - add conservative early returns in cost model and other places for VPlanNativePath

  - add code for setting up outer loop inductions 

  - support for widening non-induction PHIs that can result from inner loops and uniform conditional branches

  - support for generating uniform inner branches

We plan to add a handful C outer loop executable tests once the initial code generation support is committed. This patch is expected to be NFC for the inner loop vectorizer path. Since we are moving in the direction of supporting outer loop vectorization in LV, it may also be time to rename classes such as InnerLoopVectorizer. 

Reviewers: fhahn, rengolin, hsaito, dcaballe, mkuper, hfinkel, Ayal

Reviewed By: fhahn, hsaito

Subscribers: dmgreen, bollu, tschuett, rkruppe, rogfer01, llvm-commits

Differential Revision: https://reviews.llvm.org/D50820

llvm-svn: 342197
2018-09-14 00:36:00 +00:00
Florian Hahn 1086ce2397 [LV] Move InterleaveGroup and InterleavedAccessInfo to VectorUtils.h (NFC)
Move the 2 classes out of LoopVectorize.cpp to make it easier to re-use
them for VPlan outside LoopVectorize.cpp

Reviewers: Ayal, mssimpso, rengolin, dcaballe, mkuper, hsaito, hfinkel, xbolva00

Reviewed By: rengolin, xbolva00

Differential Revision: https://reviews.llvm.org/D49488

llvm-svn: 342027
2018-09-12 08:01:57 +00:00
Tim Northover 12c1f7675f InstCombine: move hasOneUse check to the top of foldICmpAddConstant
There were two combines not covered by the check before now, neither of which
actually differed from normal in the benefit analysis.

The most recent seems to be because it was just added at the top of the
function (naturally). The older is from way back in 2008 (r46687) when we just
didn't put those checks in so routinely, and has been diligently maintained
since.

llvm-svn: 341831
2018-09-10 14:26:44 +00:00
Anna Thomas 110df11a1a [LV] Fix code gen for conditionally executed loads and stores
Fix a latent bug in loop vectorizer which generates incorrect code for
memory accesses that are executed conditionally. As pointed in review,
this bug definitely affects uniform loads and may affect conditional
stores that should have turned into scatters as well).

The code gen for conditionally executed uniform loads on architectures
that support masked gather instructions is broken.

Without this patch, we were unconditionally executing the *conditional*
load in the vectorized version.

This patch does the following:
1. Uniform conditional loads on architectures with gather support will
   have correct code generated. In particular, the cost model
   (setCostBasedWideningDecision) is fixed.
2. For the recipes which are handled after the widening decision is set,
   we use the isScalarWithPredication(I, VF) form which is added in the
   patch.

3. Fix the vectorization cost model for scalarization
   (getMemInstScalarizationCost): implement and use isPredicatedInst to
   identify *all* predicated instructions, not just scalar+predicated. So,
   now the cost for scalarization will be increased for maskedloads/stores
   and gather/scatter operations. In short, we should be choosing the
   gather/scatter in place of scalarization on archs where it is
   profitable.
4. We needed to weaken the assert in useEmulatedMaskMemRefHack.

Reviewers: Ayal, hsaito, mkuper

Differential Revision: https://reviews.llvm.org/D51313

llvm-svn: 341673
2018-09-07 15:53:48 +00:00
Anna Thomas dbacea188b [LV] First order recurrence phis should not be treated as uniform
This is fix for PR38786.
First order recurrence phis were incorrectly treated as uniform,
which caused them to be vectorized as uniform instructions.

Patch by Ayal Zaks and Orivej Desh!

Reviewed by: Anna

Differential Revision: https://reviews.llvm.org/D51639

llvm-svn: 341416
2018-09-04 22:12:23 +00:00
Nicola Zaghen 9588ad9611 [InstCombine] Fold icmp ugt/ult (add nuw X, C2), C --> icmp ugt/ult X, (C - C2)
Support for sgt/slt was added in rL294898, this adds the same cases also for unsigned compares.

This is the Alive proof: https://rise4fun.com/Alive/nyY

Differential Revision: https://reviews.llvm.org/D50972

llvm-svn: 341353
2018-09-04 10:29:48 +00:00
Roman Tereshin 02320eee6b Revert "[SCEV][NFC] Check NoWrap flags before lexicographical comparison of SCEVs"
This reverts r319889.

Unfortunately, wrapping flags are not a part of SCEV's identity (they
do not participate in computing a hash value or in equality
comparisons) and in fact they could be assigned after the fact w/o
rebuilding a SCEV.

Grep for const_cast's to see quite a few of examples, apparently all
for AddRec's at the moment.

So, if 2 expressions get built in 2 slightly different ways: one with
flags set in the beginning, the other with the flags attached later
on, we may end up with 2 expressions which are exactly the same but
have their operands swapped in one of the commutative N-ary
expressions, and at least one of them will have "sorted by complexity"
invariant broken.

2 identical SCEV's won't compare equal by pointer comparison as they
are supposed to.

A real-world reproducer is added as a regression test: the issue
described causes 2 identical SCEV expressions to have different order
of operands and therefore compare not equal, which in its turn
prevents LoadStoreVectorizer from vectorizing a pair of consecutive
loads.

On a larger example (the source of the test attached, which is a
bugpoint) I have seen even weirder behavior: adding a constant to an
existing SCEV changes the order of the existing terms, for instance,
getAddExpr(1, ((A * B) + (C * D))) returns (1 + (C * D) + (A * B)).

Differential Revision: https://reviews.llvm.org/D40645

llvm-svn: 340777
2018-08-27 21:41:37 +00:00
Vedant Kumar a459b9f757 Avoid dbg.value use-before-def in a few tests (NFC)
This is preparation for landing a use-before-def verifier for debug
intrinsics (D46100).

As a drive-by, remove `tail` from debug intrinsic calls because it
doesn't mean anything in that context.

llvm-svn: 340366
2018-08-21 23:42:08 +00:00
Anna Thomas 1d78503f6a NFC: update the test comments in LV test about early exit loops
llvm-svn: 340337
2018-08-21 21:12:02 +00:00
Anna Thomas b02b0ad8c7 [LV] Vectorize loops where non-phi instructions used outside loop
Summary:
Follow up change to rL339703, where we now vectorize loops with non-phi
instructions used outside the loop. Note that the cyclic dependency
identification occurs when identifying reduction/induction vars.

We also need to identify that we do not allow users where the PSCEV information
within and outside the loop are different. This was the fix added in rL307837
for PR33706.

Reviewers: Ayal, mkuper, fhahn

Subscribers: javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D50778

llvm-svn: 340278
2018-08-21 14:40:27 +00:00
Anna Thomas 2d33ce7701 NFC: Add loop vectorizer tests showing various control flow within loop that skip iterations
llvm-svn: 340275
2018-08-21 13:02:09 +00:00
Michael Berg ed89d069f4 add a missed case for binary op FMF propagation under select folds
llvm-svn: 339938
2018-08-16 20:59:45 +00:00
Anna Thomas 60a1e4dddc [LV] Teach about non header phis that have uses outside the loop
Summary:
This patch teaches the loop vectorizer to vectorize loops with non
header phis that have have outside uses.  This is because the iteration
dependence distance for these phis can be widened upto VF (similar to
how we do for induction/reduction) if they do not have a cyclic
dependence with header phis. When identifying reduction/induction/first
order recurrence header phis, we already identify if there are any cyclic
dependencies that prevents vectorization.

The vectorizer is taught to extract the last element from the vectorized
phi and update the scalar loop exit block phi to contain this extracted
element from the vector loop.

This patch can be extended to vectorize loops where instructions other
than phis have outside uses.

Reviewers: Ayal, mkuper, mssimpso, efriedma

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D50579

llvm-svn: 339703
2018-08-14 18:22:19 +00:00
Anna Thomas cce7c24af1 NFC: Add a test to LV showing that reduction is not possible when reduction var is reset in the loop
Added a test case to reduction showing where it's illegal to identify
vectorize a loop.
Resetting the reduction var during loop iterations disallows us from
widening the dependency cycle to VF, thereby making it illegal to
vectorize the loop.

llvm-svn: 339605
2018-08-13 19:55:25 +00:00
Anastasis Grammenos f6e143e67f Revert "[LV][DebugInfo] Set DL to the middle block Icmp instruction"
This reverts commit r338106.

llvm-svn: 338109
2018-07-27 08:22:54 +00:00
Anastasis Grammenos 03948d0e0f [LV][DebugInfo] Set DL to the middle block Icmp instruction
Reviewers: hsaito

Differential Revision: https://reviews.llvm.org/D49746

llvm-svn: 338106
2018-07-27 07:12:44 +00:00
Hideki Saito ef380b0fc5 [LV] Fix for PR38110, LV encountered llvm_unreachable()
Summary: truncateToMinimalBitWidths() doesn't handle all Instructions and the worst case is compiler crash via llvm_unreachable(). Fix is to add a case to handle PHINode and changed the worst case to NO-OP (from compiler crash).

Reviewers: sbaranga, mssimpso, hsaito

Reviewed By: hsaito

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49461

llvm-svn: 337861
2018-07-24 22:30:31 +00:00
Anastasis Grammenos 612bf7cac5 [DebugInfo][LoopVectorize] Preserve DL in induction PHI and Add
Differential Revision: https://reviews.llvm.org/D48968

llvm-svn: 336667
2018-07-10 13:29:50 +00:00
Max Kazantsev 20da7e467a Revert "[InstCombine] Delay foldICmpUsingKnownBits until simple transforms are done"
llvm-svn: 336410
2018-07-06 04:04:13 +00:00
Gabor Buella da4a966e1c NFC - Various typo fixes in tests
llvm-svn: 336268
2018-07-04 13:28:39 +00:00
Anastasis Grammenos 204726b345 [DebugInfo][LoopVectorize] Preserve DL in generated phi instruction
When creating `phi` instructions to resume at the scalar part of the loop,
copy the DebugLoc from the original phi over to the new one.

Differential Revision: https://reviews.llvm.org/D48769

llvm-svn: 336256
2018-07-04 10:16:55 +00:00
Max Kazantsev 3097b76e8c [InstCombine] Delay foldICmpUsingKnownBits until simple transforms are done
This patch changes order of transform in InstCombineCompares to avoid
performing transforms based on ranges which produce complex bit arithmetics
before more simple things (like folding with constants) are done. See PR37636
for the motivating example.

Differential Revision: https://reviews.llvm.org/D48584
Reviewed By: spatel, lebedev.ri

llvm-svn: 336172
2018-07-03 06:23:57 +00:00
Adhemerval Zanella cadcfed7aa [AArch64] Add custom lowering for v4i8 trunc store
This patch adds a custom trunc store lowering for v4i8 vector types.
Since there is not v.4b register, the v4i8 is promoted to v4i16 (v.4h)
and default action for v4i8 is to extract each element and issue 4
byte stores.

A better strategy would be to extended the promoted v4i16 to v8i16
(with undef elements) and extract and store the word lane which
represents the v4i8 subvectores. The construction:

  define void @foo(<4 x i16> %x, i8* nocapture %p) {
    %0 = trunc <4 x i16> %x to <4 x i8>
    %1 = bitcast i8* %p to <4 x i8>*
    store <4 x i8> %0, <4 x i8>* %1, align 4, !tbaa !2
    ret void
  }

Can be optimized from:

  umov    w8, v0.h[3]
  umov    w9, v0.h[2]
  umov    w10, v0.h[1]
  umov    w11, v0.h[0]
  strb    w8, [x0, #3]
  strb    w9, [x0, #2]
  strb    w10, [x0, #1]
  strb    w11, [x0]
  ret

To:

  xtn     v0.8b, v0.8h
  str     s0, [x0]
  ret

The patch also adjust the memory cost for autovectorization, so the C
code:

  void foo (const int *src, int width, unsigned char *dst)
  {
    for (int i = 0; i < width; i++)
       *dst++ = *src++;
  }

can be vectorized to:

  .LBB0_4:                                // %vector.body
                                          // =>This Inner Loop Header: Depth=1
        ldr     q0, [x0], #16
        subs    x12, x12, #4            // =4
        xtn     v0.4h, v0.4s
        xtn     v0.8b, v0.8h
        st1     { v0.s }[0], [x2], #4
        b.ne    .LBB0_4

Instead of byte operations.

llvm-svn: 335735
2018-06-27 13:58:46 +00:00
Florian Hahn b10b141a79 Revert r335513: [SCEVExp] Advance found insertion point
llvm-svn: 335522
2018-06-25 20:55:26 +00:00
Florian Hahn 0b3ed5742a Force vector width for scev-expander-debug.ll test
llvm-svn: 335520
2018-06-25 20:40:50 +00:00
Florian Hahn 5947c17fd4 [SCEVExp] Advance found insertion point until we find a non-dbg instruction.
This avoids creating unnecessary casts if the IP used to be a dbg info
intrinsic. Fixes PR37727.

Reviewers: vsk, aprantl, sanjoy, efriedma

Reviewed By: vsk, efriedma

Differential Revision: https://reviews.llvm.org/D47874

llvm-svn: 335513
2018-06-25 19:17:29 +00:00
Sanjay Patel 447e8ece4d [LoopVectorize] regenerate full checks; NFC
llvm-svn: 335257
2018-06-21 16:54:32 +00:00
Diego Caballero 72aed5e5dc Move redundant-vf2-cost.ll test to X86 directory
redundant-vf2-cost.ll is X86 specific. Moved from
test/Transforms/LoopVectorize/redundant-vf2-cost.ll to
test/Transforms/LoopVectorize/X86/redundant-vf2-cost.ll

llvm-svn: 334854
2018-06-15 18:46:03 +00:00
Diego Caballero 68795245cf [LV] Prevent LV to run cost model twice for VF=2
This is a minor fix for LV cost model, where the cost for VF=2 was
computed twice when the vectorization of the loop was forced without
specifying a VF.

Reviewers: xusx595, hsaito, fhahn, mkuper

Reviewed By: hsaito, xusx595

Differential Revision: https://reviews.llvm.org/D48048

llvm-svn: 334840
2018-06-15 16:21:35 +00:00
Roman Shirokiy 9ba0aa2da0 [LV] Fix PR36983. For a given recurrence, fix all phis in exit block
There could be more than one PHIs in exit block using same loop recurrence.
Don't assume there is only one and fix each user.

Differential Revision: https://reviews.llvm.org/D47788

llvm-svn: 334271
2018-06-08 08:21:20 +00:00
Sanjay Patel 4b1205b40f [TargetLibraryInfo] add mappings from LLVM sin/cos intrinsics to SVML calls
These weren't included in D19544 - probably just an oversight.
D40044 made it more likely that we'll have LLVM math intrinsics rather 
than libcalls, so this bug was more easily exposed.
As the tests/code show, we already have the complete mappings for pow/exp/log.

I don't have any experience with SVML, so I don't know if anything else is 
missing. It's also not clear to me that we should be doing this transform in 
IR rather than DAG/isel, but that's a separate issue.

Differential Revision: https://reviews.llvm.org/D47610

llvm-svn: 334211
2018-06-07 18:21:24 +00:00
Karl-Johan Karlsson 6d52e5c3e4 [ConstantFold] Disallow folding vector geps into bitcasts
Summary:
Getelementptr returns a vector of pointers, instead of a single address,
when one or more of its arguments is a vector. In such case it is not
possible to simplify the expression by inserting a bitcast of operand(0)
into the destination type, as it will create a bitcast between different
sizes.

Reviewers: majnemer, mkuper, mssimpso, spatel

Reviewed By: spatel

Subscribers: lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D46379

llvm-svn: 333783
2018-06-01 19:34:35 +00:00
Sanjay Patel affe450db7 [LoopVectorize, x86] add tests to show missing SVML transforms; NFC
llvm-svn: 333707
2018-05-31 22:31:02 +00:00
Sanjay Patel 2ae1ab30e3 [LoopVectorize, x86] regenerate checks; NFC
I removed the 'fast' flag from the calls because that's not required.

llvm-svn: 333695
2018-05-31 21:30:36 +00:00
Diego Caballero 168d04d544 [VPlan] Reland r332654 and silence unused func warning
r332654 was reverted due to an unused function warning in
release build. This commit includes the same code with the
warning silenced.

Differential Revision: https://reviews.llvm.org/D44338

llvm-svn: 332860
2018-05-21 18:14:23 +00:00
Amara Emerson 08099c7edd Delete a test that was missed in the revert r332747.
r332747 originally reverted r332654 which added this test.

llvm-svn: 332755
2018-05-18 19:21:40 +00:00
Alexander Ivchenko 5c54742da4 [X86][CET] Changing -fcf-protection behavior to comply with gcc (LLVM part)
This patch aims to match the changes introduced in gcc by
https://gcc.gnu.org/ml/gcc-cvs/2018-04/msg00534.html. The
IBT feature definition is removed, with the IBT instructions
being freely available on all X86 targets. The shadow stack
instructions are also being made freely available, and the
use of all these CET instructions is controlled by the module
flags derived from the -fcf-protection clang option. The hasSHSTK
option remains since clang uses it to determine availability of
shadow stack instruction intrinsics, but it is no longer directly used.

Comes with a clang patch (D46881).

Patch by mike.dvoretsky

Differential Revision: https://reviews.llvm.org/D46882

llvm-svn: 332705
2018-05-18 11:58:25 +00:00
Diego Caballero f58ad3129c [LV][VPlan] Build plain CFG with simple VPInstructions for outer loops.
Patch #3 from VPlan Outer Loop Vectorization Patch Series #1
(RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).

Expected to be NFC for the current inner loop vectorization path. It
introduces the basic algorithm to build the VPlan plain CFG (single-level
CFG, no hierarchical CFG (H-CFG), yet) in the VPlan-native vectorization
path using VPInstructions. It includes:
  - VPlanHCFGBuilder: Main class to build the VPlan H-CFG (plain CFG without nested regions, for now).
  - VPlanVerifier: Main class with utilities to check the consistency of a H-CFG.
  - VPlanBlockUtils: Main class with utilities to manipulate VPBlockBases in VPlan.

Reviewers: rengolin, fhahn, mkuper, mssimpso, a.elovikov, hfinkel, aprantl.

Differential Revision: https://reviews.llvm.org/D44338

llvm-svn: 332654
2018-05-17 19:24:47 +00:00
Karl-Johan Karlsson b27dd33c58 [LV] Add lit testcase for bitcast problem. NFC
llvm-svn: 331878
2018-05-09 13:34:57 +00:00
Shiva Chen 2c864551df [DebugInfo] Add DILabel metadata and intrinsic llvm.dbg.label.
In order to set breakpoints on labels and list source code around
labels, we need collect debug information for labels, i.e., label
name, the function label belong, line number in the file, and the
address label located. In order to keep these information in LLVM
IR and to allow backend to generate debug information correctly.
We create a new kind of metadata for labels, DILabel. The format
of DILabel is

!DILabel(scope: !1, name: "foo", file: !2, line: 3)

We hope to keep debug information as much as possible even the
code is optimized. So, we create a new kind of intrinsic for label
metadata to avoid the metadata is eliminated with basic block.
The intrinsic will keep existing if we keep it from optimized out.
The format of the intrinsic is

llvm.dbg.label(metadata !1)

It has only one argument, that is the DILabel metadata. The
intrinsic will follow the label immediately. Backend could get the
label metadata through the intrinsic's parameter.

We also create DIBuilder API for labels to be used by Frontend.
Frontend could use createLabel() to allocate DILabel objects, and use
insertLabel() to insert llvm.dbg.label intrinsic in LLVM IR.

Differential Revision: https://reviews.llvm.org/D45024

Patch by Hsiangkai Wang.

llvm-svn: 331841
2018-05-09 02:40:45 +00:00
Hideki Saito d722d61402 [LV] Fix for PR37248, Broadcast codegen incorrectly assumed vector loop body is single basic block
Summary:
Broadcast code generation emitted instructions in pre-header, while the instruction they are dependent on in the vector loop body.
This resulted in an IL verification error ---- value used before defined.


Reviewers: rengolin, fhahn, hfinkel

Reviewed By: rengolin, fhahn

Subscribers: dcaballe, Ka-Ka, llvm-commits

Differential Revision: https://reviews.llvm.org/D46302

llvm-svn: 331799
2018-05-08 18:57:34 +00:00
Daniel Neilson 1091ca4640 [LV] Move test/Transforms/LoopVectorize/pr23997.ll
Summary:
This fixes a build break with r331269.

test/Transforms/LoopVectorize/pr23997.ll

should be in:

test/Transforms/LoopVectorize/X86/pr23997.ll

llvm-svn: 331281
2018-05-01 16:40:45 +00:00
Daniel Neilson 9e4bbe801a [LV] Preserve inbounds on created GEPs
Summary:
This is a fix for PR23997.

The loop vectorizer is not preserving the inbounds property of GEPs that it creates.
This is inhibiting some optimizations. This patch preserves the inbounds property in
the case where a load/store is being fed by an inbounds GEP.

Reviewers: mkuper, javed.absar, hsaito

Reviewed By: hsaito

Subscribers: dcaballe, hsaito, llvm-commits

Differential Revision: https://reviews.llvm.org/D46191

llvm-svn: 331269
2018-05-01 15:35:08 +00:00
Diego Caballero 60f2776b2f [LV][VPlan] Detect outer loops for explicit vectorization.
Patch #2 from VPlan Outer Loop Vectorization Patch Series #1
(RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).

This patch introduces the basic infrastructure to detect, legality check
and process outer loops annotated with hints for explicit vectorization.
All these changes are protected under the feature flag
-enable-vplan-native-path. This should make this patch NFC for the existing
inner loop vectorizer.

Reviewers: hfinkel, mkuper, rengolin, fhahn, aemerson, mssimpso.

Differential Revision: https://reviews.llvm.org/D42447

llvm-svn: 330739
2018-04-24 17:04:17 +00:00
Krzysztof Parzyszek dfed941eec [LV] Introduce TTI::getMinimumVF
The function getMinimumVF(ElemWidth) will return the minimum VF for
a vector with elements of size ElemWidth bits. This value will only
apply to targets for which TTI::shouldMaximizeVectorBandwidth returns
true. The value of 0 indicates that there is no minimum VF.

Differential Revision: https://reviews.llvm.org/D45271

llvm-svn: 330062
2018-04-13 20:16:32 +00:00
Sebastian Pop d870aea03e [InstCombine] reassociate loop invariant GEP chains to enable LICM
This change brings performance of zlib up by 10%. The example below is from a
hot loop in longest_match() from zlib.

do.body:
  %cur_match.addr.0 = phi i32 [ %cur_match, %entry ], [ %2, %do.cond ]
  %idx.ext = zext i32 %cur_match.addr.0 to i64
  %add.ptr = getelementptr inbounds i8, i8* %win, i64 %idx.ext
  %add.ptr2 = getelementptr inbounds i8, i8* %add.ptr, i64 %idx.ext1
  %add.ptr3 = getelementptr inbounds i8, i8* %add.ptr2, i64 -1

In this example %idx.ext1 is a loop invariant. It will be moved above the use of
loop induction variable %idx.ext such that it can be hoisted out of the loop by
LICM. The operands that have dependences carried by the loop will be sinked down
in the GEP chain. This patch will produce the following output:

do.body:
  %cur_match.addr.0 = phi i32 [ %cur_match, %entry ], [ %2, %do.cond ]
  %idx.ext = zext i32 %cur_match.addr.0 to i64
  %add.ptr = getelementptr inbounds i8, i8* %win, i64 %idx.ext1
  %add.ptr2 = getelementptr inbounds i8, i8* %add.ptr, i64 -1
  %add.ptr3 = getelementptr inbounds i8, i8* %add.ptr2, i64 %idx.ext

llvm-svn: 328539
2018-03-26 16:19:31 +00:00
Evgeny Stupachenko 579507a53a Revert r325687 (workaround for PR36032).
Summary:
Revert r325687 workaround for PR36032 since
 a fix was committed in r326154.

Reviewers: sbaranga

Differential Revision: http://reviews.llvm.org/D44768

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 328257
2018-03-22 22:04:39 +00:00
Andrei Elovikov 8b8253fdc7 [LV] Let recordVectorLoopValueForInductionCast to check if IV was created from the cast.
Summary:
It turned out to be error-prone to expect the callers to handle that - better to
leave the decision to this routine and make the required data to be explicitly
passed to the function.

This handles the case that was missed in the r322473 and fixes the assert
mentioned in PR36524.

Reviewers: dorit, mssimpso, Ayal, dcaballe

Reviewed By: dcaballe

Subscribers: Ka-Ka, hiraditya, dneilson, hsaito, llvm-commits

Differential Revision: https://reviews.llvm.org/D43812

llvm-svn: 327960
2018-03-20 09:04:39 +00:00
Renato Golin bc94b98c44 [LV] Adding test for r327109
llvm-svn: 327155
2018-03-09 18:02:36 +00:00
Max Kazantsev f8d2969abb [SCEV] Smart range calculation for SCEVUnknown Phis
The range of SCEVUnknown Phi which merges values `X1, X2, ..., XN`
can be evaluated as `U(Range(X1), Range(X2), ..., Range(XN))`.

Reviewed By: sanjoy
Differential Revision: https://reviews.llvm.org/D43810

llvm-svn: 326418
2018-03-01 06:56:48 +00:00
Sanjay Patel 8529dd5ee1 [ARM] add loop vectorizer test based on 482.sphinx3 from SPEC2006; NFC
This is a slight reduction of one of the benchmarks
that suffered with D43079. Cost model changes should
not cause this test to remain scalarized.

llvm-svn: 326221
2018-02-27 18:33:24 +00:00
Adam Nemet b424cd5d61 Make test agnostic to cost model
This was causing bot failures on greendragon

llvm-svn: 326169
2018-02-27 05:41:16 +00:00
Evgeny Stupachenko f1c058d99b Fix r326154 buildbots test fail
Summary:

Add specific mtriples to tests added in r326154.

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 326158
2018-02-27 01:33:11 +00:00
Evgeny Stupachenko a732611ac8 Fix PR36032, PR35432
Summary:

The change fix an assert fail at ScalarEvolutionExpander.cpp:
  assert(ExitCount != SE.getCouldNotCompute() && "Invalid loop count");

Reviewers: sbaranga

Differential Revision: http://reviews.llvm.org/D42604

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 326154
2018-02-27 00:17:31 +00:00
Renato Golin 9d1b2acaaa [LV] Move isLegalMasked* functions from Legality to CostModel
All SIMD architectures can emulate masked load/store/gather/scatter
through element-wise condition check, scalar load/store, and
insert/extract. Therefore, bailing out of vectorization as legality
failure, when they return false, is incorrect. We should proceed to cost
model and determine profitability.

This patch is to address the vectorizer's architectural limitation
described above. As such, I tried to keep the cost model and
vectorize/don't-vectorize behavior nearly unchanged. Cost model tuning
should be done separately.

Please see
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120164.html for
RFC and the discussions.

Closes D43208.

Patch by: Hideki Saito <hideki.saito@intel.com>

llvm-svn: 326079
2018-02-26 11:06:36 +00:00
Alexey Bataev 650f639d33 [LV] Fix test checks, NFC
llvm-svn: 325699
2018-02-21 16:48:23 +00:00
Silviu Baranga 10ad93c6bf [SCEV] Temporarily disable loop versioning for the purpose
of turning SCEVUnknowns of PHIs into AddRecExprs.

This feature is now hidden behind the -scev-version-unknown flag.

Fixes PR36032 and PR35432.

llvm-svn: 325687
2018-02-21 15:20:32 +00:00
Sanjay Patel e6143904b9 revert r325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280)
There are too many perf regressions resulting from this, so we need to 
investigate (and add tests for) targets like ARM and AArch64 before 
trying to reinstate.

llvm-svn: 325658
2018-02-21 01:42:52 +00:00
Alexey Bataev 42bcec7d38 [LV] Fix test checks, NFC.
llvm-svn: 325617
2018-02-20 19:49:25 +00:00
Sanjay Patel 3e8a76abfd [TTI CostModel] change default cost of FP ops to 1 (PR36280)
This change was mentioned at least as far back as:
https://bugs.llvm.org/show_bug.cgi?id=26837#c26
...and I found a real program that is harmed by this: 
Himeno running on AMD Jaguar gets 6% slower with SLP vectorization:
https://bugs.llvm.org/show_bug.cgi?id=36280
...but the change here appears to solve that bug only accidentally.

The div/rem costs for x86 look very wrong in some cases, but that's already true, 
so we can fix those in follow-up patches. There's also evidence that more cost model
changes are needed to solve SLP problems as shown in D42981, but that's an independent 
problem (though the solution may be adjusted after this change is made).

Differential Revision: https://reviews.llvm.org/D43079

llvm-svn: 325515
2018-02-19 16:11:44 +00:00
Mircea Trofin 73b96d6dcf [LV] Fix analyzeInterleaving when -pass-remarks enabled
Summary:
If -pass-remarks=loop-vectorize, atomic ops will be seen by
analyzeInterleaving(), even though canVectorizeMemory() == false. This
is because we are requesting extra analysis instead of bailing out.

In such a case, we end up with a Group in both Load- and StoreGroups,
and then we'll try to access freed memory when traversing LoadGroups after having had released the Group when  iterating over StoreGroups.

The fix is to include mayWriteToMemory() when validating that two
instructions are the same kind of memory operation.

Reviewers: mssimpso, davidxl

Reviewed By: davidxl

Subscribers: hsaito, fhahn, llvm-commits

Differential Revision: https://reviews.llvm.org/D43064

llvm-svn: 324786
2018-02-10 00:07:45 +00:00
Sanjay Patel 42b8c23cc6 [LoopVectorize] auto-generate complete checks; NFC
llvm-svn: 324611
2018-02-08 15:13:47 +00:00
Mircea Trofin 06ac8cfbd1 Verify profile data confirms large loop trip counts.
Summary:
Loops with inequality comparers, such as:

   // unsigned bound
   for (unsigned i = 1; i < bound; ++i) {...}

have getSmallConstantMaxTripCount report a large maximum static
trip count - in this case, 0xffff fffe. However, profiling info
may show that the trip count is much smaller, and thus
counter-recommend vectorization.

This change:
- flips loop-vectorize-with-block-frequency on by default.
- validates profiled loop frequency data supports vectorization,
  when static info appears to not counter-recommend it. Absence
  of profile data means we rely on static data, just as we've
  done so far.

Reviewers: twoh, mkuper, davidxl, tejohnson, Ayal

Reviewed By: davidxl

Subscribers: bkramer, llvm-commits

Differential Revision: https://reviews.llvm.org/D42946

llvm-svn: 324543
2018-02-07 23:29:52 +00:00
Max Kazantsev f7667483c1 [NFC] Add tests for PR35743
llvm-svn: 324209
2018-02-05 08:09:49 +00:00
Chad Rosier a097bc69df [LV] Use Demanded Bits and ValueTracking for reduction type-shrinking
The type-shrinking logic in reduction detection, although narrow in scope, is
also rather ad-hoc, which has led to bugs (e.g., PR35734). This patch modifies
the approach to rely on the demanded bits and value tracking analyses, if
available. We currently perform type-shrinking separately for reductions and
other instructions in the loop. Long-term, we should probably think about
computing minimal bit widths in a more complete way for the loops we want to
vectorize.

PR35734
Differential Revision: https://reviews.llvm.org/D42309

llvm-svn: 324195
2018-02-04 15:42:24 +00:00
Craig Topper 0d797a34d8 [X86] Add support for passing 'prefer-vector-width' function attribute into X86Subtarget and exposing via X86's getRegisterWidth TTI interface.
This will cause the vectorizers to do some limiting of the vector widths they create. This is not a strict limit. There are reasons I know of that the loop vectorizer will generate larger vectors for.

I've written this in such a way that the interface will only return a properly supported width(0/128/256/512) even if the attribute says something funny like 384 or 10.

This has been split from D41895 with the remainder in a follow up commit.

llvm-svn: 323015
2018-01-20 00:26:08 +00:00
Craig Topper 83b0a98902 [X86] Use vmovdqu64/vmovdqa64 for unmasked integer vector stores for consistency with loads.
Previously we used 64 for vXi64 stores and 32 for everything else. This change uses 64 for everything just like do for loads.

llvm-svn: 322820
2018-01-18 07:44:09 +00:00
Andrei Elovikov 7457aa0bce [LV] Don't call recordVectorLoopValueForInductionCast for newly-created IV from a trunc.
Summary:
This method is supposed to be called for IVs that have casts in their use-def
chains that are completely ignored after vectorization under PSE. However, for
truncates of such IVs the same InductionDescriptor is used during
creation/widening of both original IV based on PHINode and new IV based on
TruncInst.

This leads to unintended second call to recordVectorLoopValueForInductionCast
with a VectorLoopVal set to the newly created IV for a trunc and causes an
assert due to attempt to store new information for already existing entry in the
map. This is wrong and should not be done.

Fixes PR35773.

Reviewers: dorit, Ayal, mssimpso

Reviewed By: dorit

Subscribers: RKSimon, dim, dcaballe, hsaito, llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D41913

llvm-svn: 322473
2018-01-15 10:56:07 +00:00
Florian Hahn 467abe3e4f [LV] Remove unnecessary DoExtraAnalysis guard (silent bug)
canVectorize is only checking if the loop has a normalized pre-header if DoExtraAnalysis is true.
This doesn't make sense to me because reporting analysis information shouldn't alter legality
checks. This is probably the result of a last minute minor change before committing (?).

Patch by Diego Caballero.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D40973

llvm-svn: 321172
2017-12-20 13:28:38 +00:00
Hal Finkel 92ea8acbcd Move Transforms/LoopVectorize/consecutive-ptr-cg-bug.ll into the X86 subdirectory
This test depends on X86's TTI; move into the X86 subdirectory.

llvm-svn: 320914
2017-12-16 05:10:20 +00:00
Hal Finkel 5444f40965 [LV] Extend InstWidening with CM_Widen_Recursive
Changes to the original scalar loop during LV code gen cause the return value
of Legal->isConsecutivePtr() to be inconsistent with the return value during
legal/cost phases (further analysis and information of the bug is in D39346).
This patch is an alternative fix to PR34965 following the CM_Widen approach
proposed by Ayal and Gil in D39346. It extends InstWidening enum with
CM_Widen_Reverse to properly record the widening decision for consecutive
reverse memory accesses and, consequently, get rid of the
Legal->isConsetuviePtr() call in LV code gen. I think this is a simpler/cleaner
solution to PR34965 than the one in D39346.

Fixes PR34965.

Patch by Diego Caballero, thanks!

Differential Revision: https://reviews.llvm.org/D40742

llvm-svn: 320913
2017-12-16 02:55:24 +00:00
Dorit Nuzman 4750c785b3 [LV] Support efficient vectorization of an induction with redundant casts
D30041 extended SCEVPredicateRewriter to improve handling of Phi nodes whose
update chain involves casts; PSCEV can now build an AddRecurrence for some
forms of such phi nodes, under the proper runtime overflow test. This means
that we can identify such phi nodes as an induction, and the loop-vectorizer
can now vectorize such inductions, however inefficiently. The vectorizer
doesn't know that it can ignore the casts, and so it vectorizes them.

This patch records the casts in the InductionDescriptor, so that they could
be marked to be ignored for cost calculation (we use VecValuesToIgnore for
that) and ignored for vectorization/widening/scalarization (i.e. treated as
TriviallyDead).

In addition to marking all these casts to be ignored, we also need to make
sure that each cast is mapped to the right vector value in the vector loop body
(be it a widened, vectorized, or scalarized induction). So whenever an
induction phi is mapped to a vector value (during vectorization/widening/
scalarization), we also map the respective cast instruction (if exists) to that
vector value. (If the phi-update sequence of an induction involves more than one
cast, then the above mapping to vector value is relevant only for the last cast
of the sequence as we allow only the "last cast" to be used outside the
induction update chain itself).

This is the last step in addressing PR30654.

llvm-svn: 320672
2017-12-14 07:56:31 +00:00
Dorit Nuzman 927b31600e [LV] Ignore the cost of values that will not appear in the vectorized loop
VecValuesToIgnore holds values that will not appear in the vectorized loop.
We should therefore ignore their cost when VF > 1.

Differential Revision: https://reviews.llvm.org/D40883

llvm-svn: 320463
2017-12-12 08:57:43 +00:00
Dorit Nuzman 5809e70540 [SCEV] Fix wrong Equal predicate created in getAddRecForPhiWithCasts
CreateAddRecFromPHIWithCastsImpl() adds an IncrementNUSW overflow predicate
which allows the PSCEV rewriter to rewrite this scev expression:
 (zext i8 {0, + , (trunc i32 step to i8)} to i32)
into
 {0, +, (sext i8 (trunc i32 step to i8) to i32)}

But then it adds the wrong Equal predicate:
 %step == (zext i8 (trunc i32 %step to i8) to i32).
instead of:
 %step == (sext i8 (trunc i32 %step to i8) to i32)

This is fixed here.

Differential Revision: https://reviews.llvm.org/D40641

llvm-svn: 320298
2017-12-10 11:13:35 +00:00
Adam Nemet a502ee73c4 [LV] Interleaved access vectorization: fix computing new alias info
As a new access is generated spanning across multiple fields, we need to
propagate alias info from all the fields to form the most generic alias info.

rdar://35602528

Differential Revision: https://reviews.llvm.org/D40617

llvm-svn: 319979
2017-12-06 22:42:24 +00:00
Gil Rapaport 8b9d1f3c5b [LV] Model masking in VPlan, introducing VPInstructions
This patch adds a new abstraction layer to VPlan and leverages it to model the planned
instructions that manipulate masks (AND, OR, NOT), introduced during predication.

The new VPValue and VPUser classes model how data flows into, through and out
of a VPlan, forming the vertices of a planned Def-Use graph. The new
VPInstruction class is a generic single-instruction Recipe that models a
planned instruction along with its opcode, operands and users. See
VectorizationPlan.rst for more details.

Differential Revision: https://reviews.llvm.org/D38676

llvm-svn: 318645
2017-11-20 12:01:47 +00:00
Gil Rapaport 848581cadb [LV] Introduce VPBlendRecipe, VPWidenMemoryInstructionRecipe
This patch is part of D38676.

The patch introduces two new Recipes to handle instructions whose vectorization
involves masking. These Recipes take VPlan-level masks in D38676, but still rely
on ILV's existing createEdgeMask(), createBlockInMask() in this patch.

VPBlendRecipe handles intra-loop phi nodes, which are vectorized as a sequence
of SELECTs. Its execute() code is refactored out of ILV::widenPHIInstruction(),
which now handles only loop-header phi nodes.

VPWidenMemoryInstructionRecipe handles load/store which are to be widened
(but are not part of an Interleave Group). In this patch it simply calls
ILV::vectorizeMemoryInstruction on execute().

Differential Revision: https://reviews.llvm.org/D39068

llvm-svn: 318149
2017-11-14 12:09:30 +00:00
Dan Gohman 2c74fe977d Add an @llvm.sideeffect intrinsic
This patch implements Chandler's idea [0] for supporting languages that
require support for infinite loops with side effects, such as Rust, providing
part of a solution to bug 965 [1].

Specifically, it adds an `llvm.sideeffect()` intrinsic, which has no actual
effect, but which appears to optimization passes to have obscure side effects,
such that they don't optimize away loops containing it. It also teaches
several optimization passes to ignore this intrinsic, so that it doesn't
significantly impact optimization in most cases.

As discussed on llvm-dev [2], this patch is the first of two major parts.
The second part, to change LLVM's semantics to have defined behavior
on infinite loops by default, with a function attribute for opting into
potential-undefined-behavior, will be implemented and posted for review in
a separate patch.

[0] http://lists.llvm.org/pipermail/llvm-dev/2015-July/088103.html
[1] https://bugs.llvm.org/show_bug.cgi?id=965
[2] http://lists.llvm.org/pipermail/llvm-dev/2017-October/118632.html

Differential Revision: https://reviews.llvm.org/D38336

llvm-svn: 317729
2017-11-08 21:59:51 +00:00
Dorit Nuzman eb13dd3eac [LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a
single-iteration loop

This fixes PR34681. Avoid adding the "Stride == 1" predicate when we know that
Stride >= Trip-Count. Such a predicate will effectively optimize a single
or zero iteration loop, as Trip-Count <= Stride == 1.

Differential Revision: https://reviews.llvm.org/D38785

llvm-svn: 317438
2017-11-05 16:53:15 +00:00
Sanjay Patel b049173157 [SimplifyCFG] use pass options and remove the latesimplifycfg pass
This is no-functional-change-intended.

This is repackaging the functionality of D30333 (defer switch-to-lookup-tables) and 
D35411 (defer folding unconditional branches) with pass parameters rather than a named
"latesimplifycfg" pass. Now that we have individual options to control the functionality,
we could decouple when these fire (but that's an independent patch if desired). 

The next planned step would be to add another option bit to disable the sinking transform
mentioned in D38566. This should also make it clear that the new pass manager needs to
be updated to limit simplifycfg in the same way as the old pass manager.

Differential Revision: https://reviews.llvm.org/D38631

llvm-svn: 316835
2017-10-28 18:43:07 +00:00
Dehao Chen ed2d5402cb Do not add discriminator encoding for debug intrinsics.
Summary: There are certain requirements for debug location of debug intrinsics, e.g. the scope of the DILocalVariable should be the same as the scope of its debug location. As a result, we should not add discriminator encoding for debug intrinsics.

Reviewers: dblaikie, aprantl

Reviewed By: aprantl

Subscribers: JDevlieghere, aprantl, bjope, sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D39343

llvm-svn: 316703
2017-10-26 21:20:52 +00:00
Ayal Zaks c9e0f886e5 [LV] Fix PR34743 - handle casts that sink after interleaved loads
When ignoring a load that participates in an interleaved group, make sure to
move a cast that needs to sink after it.

Testcase derived from reproducer of PR34743.

Differential Revision: https://reviews.llvm.org/D38338

llvm-svn: 314986
2017-10-05 15:45:14 +00:00
Ayal Zaks fc3f7a4f0c [LV] Fix PR34711 - widen instruction ranges when sinking casts
Instead of trying to keep LastWidenRecipe updated after creating each recipe,
have tryToWiden() retrieve the last recipe of the current VPBasicBlock and check
if it's a VPWidenRecipe when attempting to extend its range. This ensures that
such extensions, optimized to maintain the original instruction order, do so
only when the instructions are to maintain their relative order. The latter does
not always hold, e.g., when a cast needs to sink to unravel first order
recurrence (r306884).

Testcase derived from reproducer of PR34711.

Differential Revision: https://reviews.llvm.org/D38339

llvm-svn: 314981
2017-10-05 12:41:49 +00:00
Matthew Simpson f4bb480b62 [LV] Use correct insertion point when type shrinking reductions
When type shrinking reductions, we should insert the truncations and extends at
the end of the loop latch block. Previously, these instructions were inserted
at the end of the loop header block. The difference is only a problem for loops
with predicated instructions (e.g., conditional stores and instructions that
may divide by zero). For these instructions, we create new basic blocks inside
the vectorized loop, which cause the loop header and latch to no longer be the
same block. This should fix PR34687.

Reference: https://bugs.llvm.org/show_bug.cgi?id=34687
llvm-svn: 314542
2017-09-29 18:07:39 +00:00
Sanjay Patel ecb175608f [InstCombine] remove extract-of-select vector transform (2nd try)
The 1st attempt at this:
https://reviews.llvm.org/rL314117
was reverted at:
https://reviews.llvm.org/rL314118

because of bot fails for clang tests that were checking optimized IR. That should be fixed with:
https://reviews.llvm.org/rL314144
...so try again. 

Original commit message:

The transform to convert an extract-of-a-select-of-vectors was added at:
https://reviews.llvm.org/rL194013

And a question about the validity of this transform was raised in the review:
https://reviews.llvm.org/D1539:
...but not answered AFAICT>

Most of the motivating cases in that patch are now handled by other combines. These are the tests that were added with
the original commit, but they are not regressing even after we remove the transform in this patch.

The diffs we see after removing this transform cause us to avoid increasing the instruction count, so we don't want to do
those transforms as canonicalizations.

The motivation for not turning a vector-select-of-vectors into a scalar operation is shown in PR33301:
https://bugs.llvm.org/show_bug.cgi?id=33301
...in those cases, we'll get vector ops with this patch rather than the vector/scalar mix that we currently see.

Differential Revision: https://reviews.llvm.org/D38006

llvm-svn: 314147
2017-09-25 20:30:53 +00:00
Sanjay Patel aa7f750bec revert r314117 because there are bogus clang tests that depend on the optimizer
llvm-svn: 314118
2017-09-25 17:00:04 +00:00
Sanjay Patel 9639897d77 [InstCombine] remove extract-of-select vector transform
The transform to convert an extract-of-a-select-of-vectors was added at:
rL194013

And a question about the validity of this transform was raised in the review:
https://reviews.llvm.org/D1539:
...but not answered AFAICT>

Most of the motivating cases in that patch are now handled by other combines. These are the tests that were added with
the original commit, but they are not regressing even after we remove the transform in this patch.

The diffs we see after removing this transform cause us to avoid increasing the instruction count, so we don't want to do
those transforms as canonicalizations.

The motivation for not turning a vector-select-of-vectors into a scalar operation is shown in PR33301:
https://bugs.llvm.org/show_bug.cgi?id=33301
...in those cases, we'll get vector ops with this patch rather than the vector/scalar mix that we currently see.

Differential Revision: https://reviews.llvm.org/D38006

llvm-svn: 314117
2017-09-25 16:41:34 +00:00
Alon Kom 682cfc1d4c [LV] Fix maximum legal VF calculation
This patch fixes pr34283, which exposed that the computation of
maximum legal width for vectorization was wrong, because it relied
on MaxInterleaveFactor to obtain the maximum stride used in the loop,
however not all strided accesses in the loop have an interleave-group
associated with them.
Instead of recording the maximum stride in the loop, which can be over
conservative (e.g. if the access with the maximum stride is not involved
in the dependence limitation), this patch tracks the actual maximum legal
width imposed by accesses that are involved in dependencies.

Differential Revision: https://reviews.llvm.org/D37507

llvm-svn: 313237
2017-09-14 07:40:02 +00:00
Anna Thomas 19529f75b9 [LV] Avoid computing the register usage for default VF. NFC
These are changes to reduce redundant computations when calculating a
feasible vectorization factor:
1. early return when target has no vector registers
2. don't compute register usage for the default VF.

Suggested during review for D37702.

llvm-svn: 313176
2017-09-13 19:35:45 +00:00
Ayal Zaks e2a8c0758f [LV] Fix PR34523 - avoid generating redundant selects
When converting a PHI into a series of 'select' instructions to combine the
incoming values together according their edge masks, initialize the first
value to the incoming value In0 of the first predecessor, instead of
generating a redundant assignment 'select(Cond[0], In0, In0)'. The latter
fails when the Cond[0] mask is null, representing a full mask, which can
happen only when there's a single incoming value.

No functional changes intended nor expected other than surviving null Cond[0]'s.

This fix follows D35725, which introduced using null to represent full masks.

Differential Revision: https://reviews.llvm.org/D37619

llvm-svn: 313119
2017-09-13 06:28:37 +00:00
Anna Thomas 9f1be02fa3 [LV] Clamp the VF to the trip count
Summary:
When the MaxVectorSize > ConstantTripCount, we should just clamp the
vectorization factor to be the ConstantTripCount.
This vectorizes loops where the TinyTripCountThreshold >= TripCount < MaxVF.

Earlier we were finding the maximum vector width, which could be greater than
the trip count itself. The Loop vectorizer does all the work for generating a
vectorizable loop, but in the end we would always choose the scalar loop (since
the VF > trip count). This allows us to choose the VF keeping in mind the trip
count if available.

This is a fix on top of rL312472.

Reviewers: Ayal, zvi, hfinkel, dneilson

Reviewed by: Ayal

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37702

llvm-svn: 313046
2017-09-12 16:32:45 +00:00
Zvi Rackover 9a087a357a LoopVectorize: MaxVF should not be larger than the loop trip count
Summary:
Improve how MaxVF is computed while taking into account that MaxVF should not be larger than the loop's trip count.

Other than saving on compile-time by pruning the possible MaxVF candidates, this patch fixes pr34438 which exposed the following flow:
1. Short trip count identified -> Don't bail out, set OptForSize:=True to avoid tail-loop and runtime checks.
2. Compute MaxVF returned 16 on a target supporting AVX512.
3. OptForSize -> choose VF:=MaxVF.
4. Bail out because TripCount = 8, VF = 16, TripCount % VF !=0 means we need a tail loop.

With this patch step 2. will choose MaxVF=8 based on TripCount.

Reviewers: Ayal, dorit, mkuper, hfinkel

Reviewed By: hfinkel

Subscribers: hfinkel, llvm-commits

Differential Revision: https://reviews.llvm.org/D37425

llvm-svn: 312472
2017-09-04 08:35:13 +00:00
Adrian Prantl 05782218ab Canonicalize the representation of empty an expression in DIGlobalVariableExpression
This change simplifies code that has to deal with
DIGlobalVariableExpression and mirrors how we treat DIExpressions in
debug info intrinsics. Before this change there were two ways of
representing empty expressions on globals, a nullptr and an empty
!DIExpression().

If someone needs to upgrade out-of-tree testcases:
  perl -pi -e 's/(!DIGlobalVariableExpression\(var: ![0-9]*)\)/\1, expr: !DIExpression())/g' <MYTEST.ll>
will catch 95%.

llvm-svn: 312144
2017-08-30 18:06:51 +00:00
Craig Topper 516e39cd38 [InstCombine] Teach select01 helper of foldSelectIntoOp to handle vector splats
We were handling some vectors in foldSelectIntoOp, but not if the operand of the bin op was any kind of vector constant. This patch fixes it to treat vector splats the same as scalars.

Differential Revision: https://reviews.llvm.org/D37232

llvm-svn: 311940
2017-08-28 22:00:27 +00:00
Ayal Zaks 1f58dda4e4 [LV] Fix PR34248 - recommit D32871 after revert r311304
Original commit r311077 of D32871 was reverted in r311304 due to failures
reported in PR34248.

This recommit fixes PR34248 by restricting the packing of predicated scalars
into vectors only when vectorizing, avoiding doing so when unrolling w/o
vectorizing. Added a test derived from the reproducer of PR34248.

llvm-svn: 311849
2017-08-27 12:55:46 +00:00
Chandler Carruth bd6dc14230 Revert r311077: [LV] Using VPlan ...
This causes LLVM to assert fail on PPC64 and crash / infloop in other
cases. Filed http://llvm.org/PR34248 with reproducer attached.

llvm-svn: 311304
2017-08-20 23:17:11 +00:00
Elena Demikhovsky f58f838495 Changed basic cost of store operation on X86
Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling.
This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks.

Differential Revision: https://reviews.llvm.org/D35888

llvm-svn: 311285
2017-08-20 12:34:29 +00:00
Aditya Kumar a525fffd07 [Loop Vectorize] Added a separate metadata
Added a separate metadata to indicate when the loop
has already been vectorized instead of setting width and count to 1.

Patch written by Divya Shanmughan and Aditya Kumar

Differential Revision: https://reviews.llvm.org/D36220

llvm-svn: 311281
2017-08-20 10:32:41 +00:00
Sam Elliott b0c9753691 Keep Optimization Remark Yaml in NewPM
Summary:
The New Pass Manager infrastructure was forgetting to keep around the optimization remark yaml file that the compiler might have been producing. This meant setting the option to '-' for stdout worked, but setting it to a filename didn't give file output (presumably it was deleted because compilation didn't explicitly keep it). This change just ensures that the file is kept if compilation succeeds.

So far I have updated one of the optimization remark output tests to add a version with the new pass manager. It is my intention for this patch to also include changes to all tests that use `-opt-remark-output=` but I wanted to get the code patch ready for review while I was making all those changes.

Fixes https://bugs.llvm.org/show_bug.cgi?id=33951

Reviewers: anemet, chandlerc

Reviewed By: anemet, chandlerc

Subscribers: javed.absar, chandlerc, fhahn, llvm-commits

Differential Revision: https://reviews.llvm.org/D36906

llvm-svn: 311271
2017-08-20 01:30:45 +00:00
Ayal Zaks 6627883369 [LV] Using VPlan to model the vectorized code and drive its transformation
VPlan is an ongoing effort to refactor and extend the Loop Vectorizer. This
patch introduces the VPlan model into LV and uses it to represent the vectorized
code and drive the generation of vectorized IR.

In this patch VPlan models the vectorized loop body: the vectorized control-flow
is represented using VPlan's Hierarchical CFG, with predication refactored from
being a post-vectorization-step into a vectorization planning step modeling
if-then VPRegionBlocks, and generating code inline with non-predicated code. The
vectorized code within each VPBasicBlock is represented as a sequence of
Recipes, each responsible for modelling and generating a sequence of IR
instructions. To keep the size of this commit manageable the Recipes in this
patch are coarse-grained and capture large chunks of LV's code-generation logic.
The constructed VPlans are dumped in dot format under -debug.

This commit retains current vectorizer output, except for minor instruction
reorderings; see associated modifications to lit tests.

For further details on the VPlan model see docs/Proposals/VectorizationPlan.rst
and its references.

Authors: Gil Rapaport and Ayal Zaks

Differential Revision: https://reviews.llvm.org/D32871

llvm-svn: 311077
2017-08-17 09:29:59 +00:00
Ayal Zaks 25e2800e20 [LV] Minor savings to Sink casts to unravel first order recurrence
Two minor savings: avoid copying the SinkAfter map and avoid moving a cast if it
is not needed.

Differential Revision: https://reviews.llvm.org/D36408

llvm-svn: 310910
2017-08-15 08:32:59 +00:00
Anna Thomas 9b6e12f3dc [LoopVectorize] Fix assertion failure in Fcmp vectorization
Summary:
When vectorizing fcmps we can trip on incorrect cast assertion when setting the
FastMathFlags after generating the vectorized FCmp.
This can happen if the FCmp can be folded to true or false directly. The fix
here is to set the FastMathFlag using the FastMathFlagBuilder *before* creating
the FCmp Instruction. This is what's done by other optimizations such as
InstCombine.
Added a test case which trips on cast assertion without this patch.

Reviewers: Ayal, mssimpso, mkuper, gilr

Reviewed by: Ayal, mssimpso

Subscribers: llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D36244

llvm-svn: 310389
2017-08-08 18:07:44 +00:00
Matt Arsenault e6de494b74 LV: Don't insert runtime ptr checks on divergent targets
llvm-svn: 309890
2017-08-02 21:43:08 +00:00
Ayal Zaks e841b214b1 [LV] Avoid redundant operations manipulating masks
The Loop Vectorizer generates redundant operations when manipulating masks:
AND with true, OR with false, compare equal to true. Instead of relying on
a subsequent pass to clean them up, this patch avoids generating them.

Use null (no-mask) to represent all-one full masks, instead of a constant
all-one vector, following the convention of masked gathers and scatters.

Preparing for a follow-up VPlan patch in which these mask manipulating
operations are modeled using recipes.

Differential Revision: https://reviews.llvm.org/D35725

llvm-svn: 309558
2017-07-31 13:21:42 +00:00
Adrian Prantl abe04759a6 Remove the obsolete offset parameter from @llvm.dbg.value
There is no situation where this rarely-used argument cannot be
substituted with a DIExpression and removing it allows us to simplify
the DWARF backend. Note that this patch does not yet remove any of
the newly dead code.

rdar://problem/33580047
Differential Revision: https://reviews.llvm.org/D35951

llvm-svn: 309426
2017-07-28 20:21:02 +00:00
Michael Zolotukhin cd2255ea6a [tests] Cleanup vect.omp.persistence.ll test.
llvm-svn: 308964
2017-07-25 10:35:16 +00:00
Balaram Makam b05a55787a [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure.
Summary:
When simplifying unconditional branches from empty blocks, we pre-test if the
BB belongs to a set of loop headers and keep the block to prevent passes from
destroying canonical loop structure. However, the current algorithm fails if
the destination of the branch is a loop header. Especially when such a loop's
latch block is folded into loop header it results in additional backedges and
LoopSimplify turns it into a nested loop which prevent later optimizations
from being applied (e.g., loop  unrolling and loop interleaving).

This patch augments the existing algorithm by further checking if the
destination of the branch belongs to a set of loop headers and defer
eliminating it if yes to LateSimplifyCFG.

Fixes PR33605: https://bugs.llvm.org/show_bug.cgi?id=33605

Reviewers: efriedma, mcrosier, pacxx, hsung, davidxl

Reviewed By: efriedma

Subscribers: ashutosh.nema, gberry, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D35411

llvm-svn: 308422
2017-07-19 08:53:34 +00:00
Ayal Zaks 8c452d76ed [LV] Test once if vector trip count is zero, instead of twice
Generate a single test to decide if there are enough iterations to jump to the
vectorized loop, or else go to the scalar remainder loop. This test compares the
Scalar Trip Count: if STC < VF * UF go to the scalar loop. If
requiresScalarEpilogue() holds, at-least one iteration must remain scalar; the
rest can be used to form vector iterations. So in this case the test checks
instead if (STC - 1) < VF * UF by comparing STC <= VF * UF, and going to the
scalar loop if so. Otherwise the vector loop is entered for at-least one vector
iteration.

This test covers the case where incrementing the backedge-taken count will
overflow leading to an incorrect trip count of zero. In this (rare) case we will
also avoid the vector loop and jump to the scalar loop.

This patch simplifies the existing tests and effectively removes the basic-block
originally named "min.iters.checked", leaving the single test in block
"vector.ph".

Original observation and initial patch by Evgeny Stupachenko.

Differential Revision: https://reviews.llvm.org/D34150

llvm-svn: 308421
2017-07-19 05:16:39 +00:00
Dorit Nuzman ca4fd18ddc PSCEV] Create AddRec for Phis in cases of possible integer overflow,
using runtime checks

Extend the SCEVPredicateRewriter to work a bit harder when it encounters an
UnknownSCEV for a Phi node; Try to build an AddRecurrence also for Phi nodes
whose update chain involves casts that can be ignored under the proper runtime
overflow test. This is one step towards addressing PR30654.

Differential revision: http://reviews.llvm.org/D30041

llvm-svn: 308299
2017-07-18 11:57:08 +00:00
Michael Kuperstein fdb46b2fb4 [LV] Don't allow outside uses of IVs if the SCEV is predicated on loop conditions.
This fixes PR33706.
Differential Revision: https://reviews.llvm.org/D35227

llvm-svn: 307837
2017-07-12 19:53:55 +00:00
Sanjay Patel 15689aeae9 [LoopVectorize] partly revert r307475
Bots are failing because of the additional checks.

llvm-svn: 307476
2017-07-08 16:34:46 +00:00
Sanjay Patel 28f53ef44d [LoopVectorize] auto-generate complete checks; NFC
I'm looking at a cmp transform in InstCombine that would affect these tests,
but it's hard to know if it makes things better or worse without seeing the
full IR. OTOH, maybe these tests shouldn't be running a bunch of transform
passes in the first place?

llvm-svn: 307475
2017-07-08 16:10:42 +00:00
NAKAMURA Takumi 91a00b974d llvm/test/Transforms/LoopVectorize/X86/slm-no-vectorize.ll: -debug is available in +Asserts.
llvm-svn: 306979
2017-07-02 14:25:27 +00:00
Mohammed Agabaria eb09a810e6 [X86][CM] update add\sub costs of vectors of 64 in X86\SLM arch
this patch updates the cost of addq\subq (add\subtract of vectors of 64bits)
based on the performance numbers of SLM arch.

Differential Revision: https://reviews.llvm.org/D33983

llvm-svn: 306974
2017-07-02 12:16:15 +00:00
Teresa Johnson c12306c0ad Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."
This still breaks PPC tests we have. I'll forward reproduction
instructions to dehao.

llvm-svn: 306936
2017-07-01 03:24:09 +00:00
Teresa Johnson eb4fba9d61 re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306935
2017-07-01 03:24:08 +00:00
Teresa Johnson de56903bde revert r306336 for breaking ppc test.
llvm-svn: 306934
2017-07-01 03:24:07 +00:00
Teresa Johnson 1fbaffeba1 Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306933
2017-07-01 03:24:06 +00:00
Ayal Zaks 2ff59d4350 [LV] Sink casts to unravel first order recurrence
Check if a single cast is preventing handling a first-order-recurrence Phi,
because the scheduling constraints it imposes on the first-order-recurrence
shuffle are infeasible; but they can be made feasible by moving the cast
downwards. Record such casts and move them when vectorizing the loop.

Differential Revision: https://reviews.llvm.org/D33058

llvm-svn: 306884
2017-06-30 21:05:06 +00:00
Ayal Zaks 8d26f0a602 [LV] Optimize for size when vectorizing loops with tiny trip count
It may be detrimental to vectorize loops with very small trip count, as various
costs of the vectorized loop body as well as enclosing overheads including
runtime tests and scalar iterations may outweigh the gains of vectorizing. The
current cost model measures the cost of the vectorized loop body only, expecting
it will amortize other costs, and loops with known or expected very small trip
counts are not vectorized at all. This patch allows loops with very small trip
counts to be vectorized, but under OptForSize constraints, which ensure the cost
of the loop body is dominant, having no runtime guards nor scalar iterations.

Patch inspired by D32451.

Differential Revision: https://reviews.llvm.org/D34373

llvm-svn: 306803
2017-06-30 08:02:35 +00:00
Daniel Jasper 5ce1ce742e Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."
This still breaks PPC tests we have. I'll forward reproduction
instructions to dehao.

llvm-svn: 306792
2017-06-30 06:32:21 +00:00
Ayal Zaks d9bc43ef2a [LV] Fix PR33613 - retain order of insertelement per part
r306381 caused PR33613, by reversing the order in which insertelements were
generated per unroll part. This patch fixes PR33613 by retraining this order,
placing each set of insertelements per part immediately after the last scalar
being packed for this part. Includes a test case derived from PR33613.

Reference: https://bugs.llvm.org/show_bug.cgi?id=33613
Differential Revision: https://reviews.llvm.org/D34760

llvm-svn: 306575
2017-06-28 17:59:33 +00:00
Dehao Chen 920d022519 re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306473
2017-06-27 22:05:58 +00:00
Sanjay Patel 7227276d41 [InstCombine] canonicalize icmp predicate feeding select
This canonicalization was suggested in D33172 as a way to make InstCombine behavior more uniform. 
We have this transform for icmp+br, so unless there's some reason that icmp+select should be 
treated differently, we should do the same thing here.

The benefit comes from increasing the chances of creating identical instructions. This is shown in
the tests in logical-select.ll (PR32791). InstCombine doesn't fold those directly, but EarlyCSE 
can simplify the identical cmps, and then InstCombine can fold the selects together.

The possible regression for the tests in select.ll raises questions about poison/undef:
http://lists.llvm.org/pipermail/llvm-dev/2017-May/113261.html

...but that transform is just as likely to be triggered by this canonicalization as it is to be 
missed, so we're just pointing out a commutation deficiency in the pattern matching:
https://reviews.llvm.org/rL228409

Differential Revision: https://reviews.llvm.org/D34242

llvm-svn: 306435
2017-06-27 17:53:22 +00:00
Dehao Chen 8b7effb344 revert r306336 for breaking ppc test.
llvm-svn: 306344
2017-06-26 23:05:35 +00:00
Dehao Chen 79655792cc Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306336
2017-06-26 21:41:09 +00:00
Diana Picus b512e91515 Revert "Enable vectorizer-maximize-bandwidth by default."
This reverts commit r305960 because it broke self-hosting on AArch64.

llvm-svn: 305990
2017-06-22 10:00:28 +00:00
Dehao Chen 014db29b89 Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 305960
2017-06-21 22:01:32 +00:00
Matt Arsenault 67cd347e93 AMDGPU: Allow vectorization of packed types
llvm-svn: 305844
2017-06-20 20:38:06 +00:00
Taewook Oh 9083547ae3 Improve profile-guided heuristics to use estimated trip count.
Summary:
Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.

This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.

Reviewers: Ayal, mssimpso, mkuper, danielcdh, wmi, tejohnson

Reviewed By: tejohnson

Subscribers: tejohnson, mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D32451

llvm-svn: 305729
2017-06-19 18:48:58 +00:00
George Burgess IV a20352e13e [LoopVectorize] Don't preserve nsw/nuw flags on shrunken ops.
If we're shrinking a binary operation, it may be the case that the new
operations wraps where the old didn't. If this happens, the behavior
should be well-defined. So, we can't always carry wrapping flags with us
when we shrink operations.

If we do, we get incorrect optimizations in cases like:

void foo(const unsigned char *from, unsigned char *to, int n) {
  for (int i = 0; i < n; i++)
    to[i] = from[i] - 128;
}

which gets optimized to:

void foo(const unsigned char *from, unsigned char *to, int n) {
  for (int i = 0; i < n; i++)
    to[i] = from[i] | 128;
}

Because:
- InstCombine turned `sub i32 %from.i, 128` into
  `add nuw nsw i32 %from.i, 128`.
- LoopVectorize vectorized the add to be `add nuw nsw <16 x i8>` with a
  vector full of `i8 128`s
- InstCombine took advantage of the fact that the newly-shrunken add
  "couldn't wrap", and changed the `add` to an `or`.

InstCombine seems happy to figure out whether we can add nuw/nsw on its
own, so I just decided to drop the flags. There are already a number of
places in LoopVectorize where we rely on InstCombine to clean up.

llvm-svn: 305053
2017-06-09 03:56:15 +00:00
Matthew Simpson 646475a9bc [LV] Reapply r303763 with fix for PR33193
r303763 caused build failures in some out-of-tree tests due to an assertion in
TTI. The original patch updated cost estimates for induction variable update
instructions marked for scalarization. However, it didn't consider that the
incoming value of an induction variable phi node could be a cast instruction.
This caused queries for cast instruction costs with a mix of vector and scalar
types. This patch includes a fix for cast instructions and the test case from
PR33193.

The fix was suggested by Jonas Paulsson <paulsson@linux.vnet.ibm.com>.

Reference: https://bugs.llvm.org/show_bug.cgi?id=33193
Original Differential Revision: https://reviews.llvm.org/D33457

llvm-svn: 304235
2017-05-30 19:55:57 +00:00
Joerg Sonnenberger 9375a25342 Revert r303763, results in asserts i.e. while building Ruby.
llvm-svn: 304179
2017-05-29 22:52:17 +00:00
Matthew Simpson d6f179cad6 [LV] Update type in cost model for scalarization
For non-uniform instructions marked for scalarization, we should update
`VectorTy` when computing instruction costs to reflect the scalar type. In
addition to determining instruction costs, this type is also used to signal
that all instructions in the loop will be scalarized. This currently affects
memory instructions and non-pointer induction variables and their updates. (We
also mark GEPs scalar after vectorization, but their cost is computed together
with memory instructions.) For scalarized induction updates, this patch also
scales the scalar cost by the vectorization factor, corresponding to each
induction step.

llvm-svn: 303763
2017-05-24 15:26:15 +00:00
Jonas Paulsson 8624b7e1ce [LoopVectorizer] Let target prefer scalar addressing computations.
The loop vectorizer usually vectorizes any instruction it can and then
extracts the elements for a scalarized use. On SystemZ, all elements
containing addresses must be extracted into address registers (GRs). Since
this extraction is not free, it is better to have the address in a suitable
register to begin with. By forcing address arithmetic instructions and loads
of addresses to be scalar after vectorization, two benefits result:

* No need to extract the register
* LSR optimizations trigger (LSR isn't handling vector addresses currently)

Benchmarking show improvements on SystemZ with this new behaviour.

Any other target could try this by returning false in the new hook
prefersVectorizedAddressing().

Review: Renato Golin, Elena Demikhovsky, Ulrich Weigand
https://reviews.llvm.org/D32422

llvm-svn: 303744
2017-05-24 13:42:56 +00:00
Ayal Zaks 589e1d9610 [LV] Report multiple reasons for not vectorizing under allowExtraAnalysis
The default behavior of -Rpass-analysis=loop-vectorizer is to report only the
first reason encountered for not vectorizing, if one is found, at which time the
vectorizer aborts its handling of the loop. This patch allows multiple reasons
for not vectorizing to be identified and reported, at the potential expense of
additional compile-time, under allowExtraAnalysis which can currently be turned
on by Clang's -fsave-optimization-record and opt's -pass-remarks-missed.

Removed from LoopVectorizationLegality::canVectorize() the redundant checking
and reporting if we CantComputeNumberOfIterations, as LAI::canAnalyzeLoop() also
does that. This redundancy is caught by a lit test once multiple reasons are
reported.

Patch initially developed by Dror Barak.

Differential Revision: https://reviews.llvm.org/D33396

llvm-svn: 303613
2017-05-23 07:08:02 +00:00
Amara Emerson c9916d7e97 Re-commit r302678, fixing PR33053.
The issue was that the AArch64 TTI hook allowed unpacked integer cmp reductions
which didn't have a lowering.

llvm-svn: 303211
2017-05-16 21:29:22 +00:00
Matthew Simpson af60af1ed5 Revert 303174, 303176, and 303178
These commits are breaking the bots. Reverting to investigate.

llvm-svn: 303182
2017-05-16 15:50:30 +00:00
Matthew Simpson 62a7fab6b9 Make test target-specific
llvm-svn: 303178
2017-05-16 15:33:22 +00:00
Matthew Simpson c3c92cf2c7 Fix test case to unbreak bots
llvm-svn: 303176
2017-05-16 15:20:27 +00:00
Matthew Simpson b7b5d55c38 [LV] Avoid potentential division by zero when selecting IC
llvm-svn: 303174
2017-05-16 14:43:55 +00:00
Hans Wennborg bd6e9e77a7 Revert r302678 "[AArch64] Enable use of reduction intrinsics."
This caused PR33053.

Original commit message:

> The new experimental reduction intrinsics can now be used, so I'm enabling this
> for AArch64. We will need this for SVE anyway, so it makes sense to do this for
> NEON reductions as well.
>
> The existing code to match shufflevector patterns are replaced with a direct
> lowering of the reductions to AArch64-specific nodes. Tests updated with the
> new, simpler, representation.
>
> Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 303115
2017-05-15 20:59:32 +00:00
Simon Pilgrim 7d62e4b455 [LoopOptimizer][Fix]PR32859, PR24738
The Loop vectorizer pass introduced undef value while it is fixing output of LCSSA form.
Here it is:

before: %e.0.ph = phi i32 [ 0, %for.inc.2.i ]
after: %e.0.ph = phi i32 [ 0, %for.inc.2.i ], [ undef, %middle.block ]

and after this change we have:

%e.0.ph = phi i32 [ 0, %for.inc.2.i ]
%e.0.ph = phi i32 [ 0, %for.inc.2.i ], [ 0, %middle.block ]

Committed on behalf of @dtemirbulatov

Differential Revision: https://reviews.llvm.org/D33055

llvm-svn: 302988
2017-05-13 13:25:57 +00:00
Andrew Kaylor b01e94ee8d [TLI] Add mapping for various '__<func>_finite' forms of the math routines to SVML routines
Patch by Chris Chrulski

Differential Revision: https://reviews.llvm.org/D31789

llvm-svn: 302957
2017-05-12 22:11:26 +00:00
Amara Emerson 816542ceb3 [AArch64] Enable use of reduction intrinsics.
The new experimental reduction intrinsics can now be used, so I'm enabling this
for AArch64. We will need this for SVE anyway, so it makes sense to do this for
NEON reductions as well.

The existing code to match shufflevector patterns are replaced with a direct
lowering of the reductions to AArch64-specific nodes. Tests updated with the
new, simpler, representation.

Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 302678
2017-05-10 15:15:38 +00:00
Anna Thomas 0691483435 [LV] Fix insertion point for shuffle vectors in first order recurrence
Summary:
In first order recurrence vectorization, when the previous value is a phi node, we need to
set the insertion point to the first non-phi node.
We can have the previous value being a phi node, due to the generation of new
IVs as part of trunc optimization [1].

[1] https://reviews.llvm.org/rL294967

Reviewers: mssimpso, mkuper

Subscribers: mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D32969

llvm-svn: 302532
2017-05-09 14:29:33 +00:00
Elad Cohen ef5798acf5 Support arbitrary address space pointers in masked gather/scatter intrinsics.
Fixes PR31789 - When loop-vectorize tries to use these intrinsics for a
non-default address space pointer we fail with a "Calling a function with a
bad singature!" assertion. This patch solves this by adding the 'vector of
pointers' argument as an overloaded type which will determine the address
space.

Differential revision: https://reviews.llvm.org/D31490

llvm-svn: 302018
2017-05-03 12:28:54 +00:00
Matthew Simpson 9eed0bee3d [LV] Handle external uses of floating-point induction variables
Reference: https://bugs.llvm.org/show_bug.cgi?id=32758
Differential Revision: https://reviews.llvm.org/D32445

llvm-svn: 301428
2017-04-26 16:23:02 +00:00
Gil Rapaport 5c875c3d6f [LV] Make LIT test insensitive to basic block numbering
This patch is part of D28975's breakdown.

induction.ll encodes the specific (and rather arbitrary) numbers given to
predicated basic blocks by the unique naming mechanism, which makes it
sensitive to changes in LV's instruction generation order. This patch replaces
those specific numbers with a numeric pattern.

Differential Revision: https://reviews.llvm.org/D32404

llvm-svn: 301345
2017-04-25 18:14:24 +00:00
Matthew Simpson e2037d24f9 [LV] Model if-converted phi node costs
Phi nodes in non-header blocks are converted to select instructions after
if-conversion. This patch updates the cost model to account for the selects.

Differential Revision: https://reviews.llvm.org/D31906

llvm-svn: 300980
2017-04-21 14:14:54 +00:00
Anna Thomas dcdb325fee [LV] Fix the vector code generation for first order recurrence
Summary:
In first order recurrences where phi's are used outside the loop,
we should generate an additional vector.extract of the second last element from
the vectorized phi update.
This is because we require the phi itself (which is the value at the second last
iteration of the vector loop) and not the phi's update within the loop.
Also fix the code gen when we just unroll, but don't vectorize.
Fixes PR32396.

Reviewers: mssimpso, mkuper, anemet

Subscribers: llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D31979

llvm-svn: 300238
2017-04-13 18:59:25 +00:00
Renato Golin af3bc2089e [SystemZ] Fix more target specific tests
llvm-svn: 300081
2017-04-12 18:03:09 +00:00
Sanjay Patel 33439f982b [InstCombine] morph an existing instruction instead of creating a new one
One potential way to make InstCombine (very slightly?) faster is to recycle instructions 
when possible instead of creating new ones. It's not explicitly stated AFAIK, but we don't
consider this an "InstSimplify". We could, however, make a new layer to house transforms 
like this if that makes InstCombine more manageable (just throwing out an idea; not sure 
how much opportunity is actually here).

Differential Revision: https://reviews.llvm.org/D31863

llvm-svn: 300067
2017-04-12 15:11:33 +00:00
Jonas Paulsson 592dbea779 [LoopVectorizer] Improve handling of branches during cost estimation.
The cost for a branch after vectorization is very different depending on if
the vectorizer will if-convert the block (branch is eliminated), or if
scalarized and predicated blocks will be produced (branch duplicated before
each block). There is also the case of remaining scalar branches, such as the
back-edge branch.

This patch handles these cases differently with TTI based cost estimates.

Review: Matthew Simpson
https://reviews.llvm.org/D31175

llvm-svn: 300058
2017-04-12 13:13:15 +00:00
Jonas Paulsson da74ed42da [LoopVectorizer, TTI] New method supportsEfficientVectorElementLoadStore()
Since SystemZ supports vector element load/store instructions, there is no
need for extracts/inserts if a vector load/store gets scalarized.

This patch lets Target specify that it supports such instructions by means of
a new TTI hook that defaults to false.

The use for this is in the LoopVectorizer getScalarizationOverhead() method,
which will with this patch produce a smaller sum for a vector load/store on
SystemZ.

New test: test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll

Review: Adam Nemet
https://reviews.llvm.org/D30680

llvm-svn: 300056
2017-04-12 12:41:37 +00:00
Jonas Paulsson fccc7d66c3 [SystemZ] TargetTransformInfo cost functions implemented.
getArithmeticInstrCost(), getShuffleCost(), getCastInstrCost(),
getCmpSelInstrCost(), getVectorInstrCost(), getMemoryOpCost(),
getInterleavedMemoryOpCost() implemented.

Interleaved access vectorization enabled.

BasicTTIImpl::getCastInstrCost() improved to check for legal extending loads,
in which case the cost of the z/sext instruction becomes 0.

Review: Ulrich Weigand, Renato Golin.
https://reviews.llvm.org/D29631

llvm-svn: 300052
2017-04-12 11:49:08 +00:00
Anna Thomas 00dc1b74b7 [LV] Avoid vectorizing first order recurrence when phi uses are outside loop
In the vectorization of first order recurrence, we vectorize such
that the last element in the vector will be the one extracted to pass into the
scalar remainder loop. However, this is not true when there is a phi (other
than the primary induction variable) is used outside the loop.
In such a case, we need the value from the second last iteration (i.e.
the phi value), not the last iteration (which would be the phi update).
I've added a test case for this. Also see PR32396.

A follow up patch would generate the correct code gen for such cases,
and turn this vectorization on.

Differential Revision: https://reviews.llvm.org/D31910

Reviewers: mssimpso
llvm-svn: 299985
2017-04-11 21:02:00 +00:00
Anna Thomas 98cbb067ce [LV] Move first order recurrence test to common folder. NFC
llvm-svn: 299969
2017-04-11 18:31:42 +00:00
Matt Arsenault f10061ec70 Add address space mangling to lifetime intrinsics
In preparation for allowing allocas to have non-0 addrspace.

llvm-svn: 299876
2017-04-10 20:18:21 +00:00
Matthew Simpson 11fe2e9f2b Reapply r298620: [LV] Vectorize GEPs
This patch reapplies r298620. The original patch was reverted because of two
issues. First, the patch exposed a bug in InstCombine that caused the Chromium
builds to fail (PR32414). This issue was fixed in r299017. Second, the patch
introduced a bug in the vectorizer's scalars analysis that caused test suite
builds to fail on SystemZ. The scalars analysis was too aggressive and marked a
memory instruction scalar, even though it was going to be vectorized. This
issue has been fixed in the current patch and several new test cases for the
scalars analysis have been added.

llvm-svn: 299770
2017-04-07 14:15:34 +00:00
Matthew Simpson 1a4d5c9860 [LV] Make test case more robust
This test case depends on the loop being vectorized without forcing the
vectorization factor. If the profitability ever changes in the future (due to
cost model improvements), the test may no longer work as intended. Instead of
checking the resulting IR, we should just check the instruction costs. The
costs will be computed regardless if vectorization is profitable.

llvm-svn: 299545
2017-04-05 14:34:13 +00:00
Matthew Simpson b8ff4a4a70 [LV] Transform truncations of non-primary induction variables
The vectorizer tries to replace truncations of induction variables with new
induction variables having the smaller type. After r295063, this optimization
was applied to all integer induction variables, including non-primary ones.
When optimizing the truncation of a non-primary induction variable, we still
need to transform the new induction so that it has the correct start value.
This should fix PR32419.

Reference: https://bugs.llvm.org/show_bug.cgi?id=32419
llvm-svn: 298882
2017-03-27 20:07:38 +00:00
Ivan Krasin c2124e185c Revert r298620: [LV] Vectorize GEPs
Reason: breaks linking Chromium with LLD + ThinLTO (a pass crashes)
LLVM bug: https://bugs.llvm.org//show_bug.cgi?id=32413

Original change description:

[LV] Vectorize GEPs

This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.

With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.

The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.

Original Differential Revision: https://reviews.llvm.org/D30710

llvm-svn: 298735
2017-03-24 20:49:43 +00:00
Gil Rapaport 638d4538cd [LV] Add regression test for r297610
The new test asserts that scalarized memory operations get memcheck metadata
added even if the loop is only unrolled.

Differential Revision: https://reviews.llvm.org/D30972

llvm-svn: 298641
2017-03-23 20:02:23 +00:00
Matthew Simpson 4e7b71bc86 [LV] Vectorize GEPs
This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.

With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.

The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.

Differential Revision: https://reviews.llvm.org/D30710

llvm-svn: 298620
2017-03-23 16:29:58 +00:00
Matthew Simpson 1fb4064531 [LV] Delete unneeded scalar GEP creation code
The code for generating scalar base pointers in vectorizeMemoryInstruction is
not needed. We currently scalarize all GEPs and maintain the scalarized values
in VectorLoopValueMap. The GEP cloning in this unneeded code is the same as
that in scalarizeInstruction. The test cases that changed as a result of this
patch changed because we were able to reuse the scalarized GEP that we
previously generated instead of cloning a new one.

Differential Revision: https://reviews.llvm.org/D30587

llvm-svn: 298615
2017-03-23 16:07:21 +00:00
Matt Arsenault 3dbeefa978 AMDGPU: Mark all unspecified CC functions in tests as amdgpu_kernel
Currently the default C calling convention functions are treated
the same as compute kernels. Make this explicit so the default
calling convention can be changed to a non-kernel.

Converted with perl -pi -e 's/define void/define amdgpu_kernel void/'
on the relevant test directories (and undoing in one place that actually
wanted a non-kernel).

llvm-svn: 298444
2017-03-21 21:39:51 +00:00
Jonas Paulsson a48ea231c0 [TargetTransformInfo] getIntrinsicInstrCost() scalarization estimation improved
getIntrinsicInstrCost() used to only compute scalarization cost based on types.
This patch improves this so that the actual arguments are checked when they are
available, in order to handle only unique non-constant operands.

Tests updates:

Analysis/CostModel/X86/arith-fp.ll
Transforms/LoopVectorize/AArch64/interleaved_cost.ll
Transforms/LoopVectorize/ARM/interleaved_cost.ll

The improvement in getOperandsScalarizationOverhead() to differentiate on
constants made it necessary to update the interleaved_cost.ll tests even
though they do not relate to intrinsics.

Review: Hal Finkel
https://reviews.llvm.org/D29540

llvm-svn: 297705
2017-03-14 06:35:36 +00:00
Changpeng Fang 1be9b9f816 AMDGPU/SI: Disable unrolling in the loop vectorizer if the loop is not vectorized.
Reviewers:
  arsenm

Differential Revision:
  http://reviews.llvm.org/D30719

llvm-svn: 297328
2017-03-09 00:07:00 +00:00
Matthew Simpson 3388de1349 [LV] Select legal insert point when fixing first-order recurrences
Because IRBuilder performs constant-folding, it's not guaranteed that an
instruction in the original loop map to an instruction in the vector loop. It
could map to a constant vector instead. The handling of first-order recurrences
was incorrectly making this assumption when setting the IRBuilder's insert
point.

llvm-svn: 297302
2017-03-08 18:18:20 +00:00
Matthew Simpson 8966848d17 [LV] Make the test case for PR30183 less fragile
This patch also renames the PR number the test points to. The previous
reference was PR29559, but that bug was somehow deleted and recreated under
PR30183.

llvm-svn: 297295
2017-03-08 17:03:38 +00:00
Matthew Simpson 903dd5aa9b [LV] Add missing check labels to tests and reformat
llvm-svn: 297294
2017-03-08 16:55:34 +00:00
Matthew Simpson c86b2134c7 [LV] Consider users that are memory accesses in uniforms expansion step
When expanding the set of uniform instructions beyond the seed instructions
(e.g., consecutive pointers), we mark a new instruction uniform if all its
loop-varying users are uniform. We should also allow users that are consecutive
or interleaved memory accesses. This fixes cases where we have an instruction
that is used as the pointer operand of a consecutive access but also used by a
non-memory instruction that later becomes uniform as part of the expansion.

llvm-svn: 297179
2017-03-07 18:47:30 +00:00
Matthew Simpson aee9771ae2 [ARM/AArch64] Update costs for interleaved accesses with wide types
After r296750, we're able to match interleaved accesses having types wider than
128 bits. This patch updates the associated TTI costs.

Differential Revision: https://reviews.llvm.org/D29675

llvm-svn: 296751
2017-03-02 15:15:35 +00:00
Matthew Simpson 455c2ee394 [LV] Considier non-consecutive but vectorizable accesses for VF selection
When computing the smallest and largest types for selecting the maximum
vectorization factor, we currently ignore loads and stores of pointer types if
the memory access is non-consecutive. We do this because such accesses must be
scalarized regardless of vectorization factor, and thus shouldn't be considered
when determining the factor. This patch makes this check less aggressive by
also considering non-consecutive accesses that may be vectorized, such as
interleaved accesses. Because we don't know at the time of the check if an
accesses will certainly be vectorized (this is a cost model decision given a
particular VF), we consider all accesses that can potentially be vectorized.

Differential Revision: https://reviews.llvm.org/D30305

llvm-svn: 296747
2017-03-02 13:55:05 +00:00
Adam Nemet 15032a0455 [LV] These remark should have been missed remarks
The practice in LV is that we emit analysis remarks and then finally report
either a missed or applied remark on the final decision whether vectorization
is taking place.  On this code path, we were closing with an analysis remark.

llvm-svn: 296578
2017-03-01 04:31:15 +00:00
Craig Topper fe25988c68 [AVX-512] Fix the execution domain for AVX-512 integer broadcasts.
llvm-svn: 296290
2017-02-26 06:45:51 +00:00
Matthew Simpson bdc9c78880 [LV] Merge floating-point and integer induction widening code
This patch merges the existing floating-point induction variable widening code
into the integer induction variable widening code, creating a single set of
functions for both kinds of inductions. The primary motivation for doing this
is to enable vector phi node creation for floating-point induction variables.

Differential Revision: https://reviews.llvm.org/D30211

llvm-svn: 296145
2017-02-24 18:20:12 +00:00
Matthew Simpson 835b246d7f [LV] Update floating-point induction test checks (NFC)
llvm-svn: 295885
2017-02-22 21:56:02 +00:00
Matthew Simpson 5ef66ef248 [LV] Add scalar floating-point induction test (NFC)
llvm-svn: 295862
2017-02-22 19:09:38 +00:00
Karl-Johan Karlsson 6eaed7aceb [LoopVectorize] Added address space check when analysing interleaved accesses
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.

This is fixing pr31900.

Reviewers: HaoLiu, mssimpso, mkuper

Reviewed By: mssimpso, mkuper

Subscribers: llvm-commits, efriedma, mzolotukhin

Differential Revision: https://reviews.llvm.org/D29717

This reverts r295042 (re-applies r295038) with an additional fix for the
buildbot problem.

llvm-svn: 295858
2017-02-22 18:37:36 +00:00
Dehao Chen 7d230325ef Increases full-unroll threshold.
Summary:
The default threshold for fully unroll is too conservative. This patch doubles the full-unroll threshold

This change will affect the following speccpu2006 benchmarks (performance numbers were collected from Intel Sandybridge):

Performance:

403	0.11%
433	0.51%
445	0.48%
447	3.50%
453	1.49%
464	0.75%

Code size:

403	0.56%
433	0.96%
445	2.16%
447	2.96%
453	0.94%
464	8.02%

The compiler time overhead is similar with code size.

Reviewers: davidxl, mkuper, mzolotukhin, hfinkel, chandlerc

Reviewed By: hfinkel, chandlerc

Subscribers: mehdi_amini, zzheng, efriedma, haicheng, hfinkel, llvm-commits

Differential Revision: https://reviews.llvm.org/D28368

llvm-svn: 295538
2017-02-18 03:46:51 +00:00
Matthew Simpson f68e183f91 [LV] Remove constant restriction for vector phi creation
We previously only created a vector phi node for an induction variable if its
step had a constant integer type. However, the step actually only needs to be
loop-invariant. We only handle inductions having loop-invariant steps, so this
patch should enable vector phi node creation for all integer induction
variables that will be vectorized.

Differential Revision: https://reviews.llvm.org/D29956

llvm-svn: 295456
2017-02-17 16:09:07 +00:00
Matthew Simpson f09d13e5cc Reapply "[LV] Extend trunc optimization to all IVs with constant integer steps"
This reapplies commit r294967 with a fix for the execution time regressions
caught by the clang-cmake-aarch64-quick bot. We now extend the truncate
optimization to non-primary induction variables only if the truncate isn't
already free.

Differential Revision: https://reviews.llvm.org/D29847

llvm-svn: 295063
2017-02-14 16:28:32 +00:00
Karl-Johan Karlsson ec21b769ec Revert "[LoopVectorize] Added address space check when analysing interleaved accesses"
This reverts r295038. The buildbot clang-with-thin-lto-ubuntu failed.
I'm reverting to investigate.

llvm-svn: 295042
2017-02-14 10:06:16 +00:00
Karl-Johan Karlsson 2ec409cca2 [LoopVectorize] Added address space check when analysing interleaved accesses
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.

This is fixing pr31900.

Reviewers: HaoLiu, mssimpso, mkuper

Reviewed By: mssimpso, mkuper

Subscribers: llvm-commits, efriedma, mzolotukhin

Differential Revision: https://reviews.llvm.org/D29717

llvm-svn: 295038
2017-02-14 08:14:06 +00:00
Matthew Simpson 659f92e2aa Revert "[LV] Extend trunc optimization to all IVs with constant integer steps"
This reverts commit r294967. This patch caused execution time slowdowns in a
few LLVM test-suite tests, as reported by the clang-cmake-aarch64-quick bot.
I'm reverting to investigate.

llvm-svn: 294973
2017-02-13 18:02:35 +00:00
Matthew Simpson 7b7f40297f [LV] Extend trunc optimization to all IVs with constant integer steps
This patch extends the optimization of truncations whose operand is an
induction variable with a constant integer step. Previously we were only
applying this optimization to the primary induction variable. However, the cost
model assumes the optimization is applied to the truncation of all integer
induction variables (even regardless of step type). The transformation is now
applied to the other induction variables, and I've updated the cost model to
ensure it is better in sync with the transformation we actually perform.

Differential Revision: https://reviews.llvm.org/D29847

llvm-svn: 294967
2017-02-13 16:48:00 +00:00
Dorit Nuzman eac89d736c [LV/LoopAccess] Check statically if an unknown dependence distance can be
proven larger than the loop-count

This fixes PR31098: Try to resolve statically data-dependences whose
compile-time-unknown distance can be proven larger than the loop-count, 
instead of resorting to runtime dependence checking (which are not always 
possible).

For vectorization it is sufficient to prove that the dependence distance 
is >= VF; But in some cases we can prune unknown dependence distances early,
and even before selecting the VF, and without a runtime test, by comparing 
the distance against the loop iteration count. Since the vectorized code 
will be executed only if LoopCount >= VF, proving distance >= LoopCount 
also guarantees that distance >= VF. This check is also equivalent to the 
Strong SIV Test.

Reviewers: mkuper, anemet, sanjoy

Differential Revision: https://reviews.llvm.org/D28044

llvm-svn: 294892
2017-02-12 09:32:53 +00:00
Ahmed Bougacha 8425f453ef [ARM] Make f16 interleaved accesses expensive.
There are no vldN/vstN f16 variants, even with +fullfp16.
We could use the i16 variants, but, in practice, even with +fullfp16,
the f16 sequence leading to the i16 shuffle usually gets scalarized.
We'd need to improve our support for f16 codegen before getting there.

Teach the cost model to consider f16 interleaved operations as
expensive.  Otherwise, we are all but guaranteed to end up with
a large block of scalarized vector code.

llvm-svn: 294819
2017-02-11 01:53:04 +00:00
Dehao Chen fb02f7140a Encode duplication factor from loop vectorization and loop unrolling to discriminator.
Summary:
This patch starts the implementation as discuss in the following RFC: http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html

When optimization duplicates code that will scale down the execution count of a basic block, we will record the duplication factor as part of discriminator so that the offline process tool can find the duplication factor and collect the accurate execution frequency of the corresponding source code. Two important optimization that fall into this category is loop vectorization and loop unroll. This patch records the duplication factor for these 2 optimizations.

The recording will be guarded by a flag encode-duplication-in-discriminators, which is off by default.

Reviewers: probinson, aprantl, davidxl, hfinkel, echristo

Reviewed By: hfinkel

Subscribers: mehdi_amini, anemet, mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D26420

llvm-svn: 294782
2017-02-10 21:09:07 +00:00
Matthew Simpson df124a7569 [LV] Remove type restriction for vector phi creation
We previously only created a vector phi node for an induction variable if its
type matched the type of the canonical induction variable.

Differential Revision: https://reviews.llvm.org/D29776

llvm-svn: 294755
2017-02-10 16:15:26 +00:00
Elena Demikhovsky 5267edd3e3 [Loop Vectorizer] Cost-based decision for vectorization form of memory instruction.
Making the cost model selecting between Interleave, GatherScatter or Scalar vectorization form of memory instruction.
The right decision should be done for non-consecutive memory access instrcuctions that may have more than one vectorization solution.

This patch includes the following changes:
- Cost Model calculates the cost of Load/Store vector form and choose the better option between Widening, Interleave, GatherScactter and Scalarization. Cost Model keeps the widening decision.
- Arrays of Uniform and Scalar values are moved from Legality to Cost Model.
- Cost Model collects Uniforms and Scalars per VF. The collection is based on CM decision map of Loadis/Stores vectorization form.
- Vectorization of memory instruction is performed according to the CM decision.

Differential Revision: https://reviews.llvm.org/D27919

llvm-svn: 294503
2017-02-08 19:25:23 +00:00
Craig Topper e0ac7f3beb [X86] Remove PCOMMIT instruction support since Intel has deprecated this instruction with no plans to release products with it.
Intel's documentation for the deprecation https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

llvm-svn: 294405
2017-02-08 05:45:39 +00:00
Matthew Simpson 3877f397cd [LV] Add new ARM/AArch64 interleaved access cost model tests (NFC)
llvm-svn: 294342
2017-02-07 19:34:24 +00:00
Matthew Simpson 1cd02f13a5 [LV] Simplify ARM/AArch64 interleaved access cost model tests (NFC)
This patch removes unneeded instructions from the existing ARM/AArch64
interleaved access cost model tests. I'll be adding a similar set of tests in a
follow-on patch to increase coverage.

llvm-svn: 294336
2017-02-07 19:17:44 +00:00
Adam Nemet 0bf1b863b9 [LV] Also port failure remarks to new OptimizationRemarkEmitter API
llvm-svn: 293866
2017-02-02 05:41:51 +00:00
Chandler Carruth 6f4ed077d0 [LV] Fix an issue where forming LCSSA in the place that we did would
change the set of uniform instructions in the loop causing an assert
failure.

The problem is that the legalization checking also builds data
structures mapping various facts about the loop body. The immediate
cause was the set of uniform instructions. If these then change when
LCSSA is formed, the data structures would already have been built and
become stale. The included test case triggered an assert in loop
vectorize that was reduced out of the new PM's pipeline.

The solution is to form LCSSA early enough that no information is cached
across the changes made. The only really obvious position is outside of
the main logic to vectorize the loop. This also has the advantage of
removing one case where forming LCSSA could mutate the loop but we
wouldn't track that as a "Changed" state.

If it is significantly advantageous to do some legalization checking
prior to this, we can do a more careful positioning but it seemed best
to just back off to a safe position first.

llvm-svn: 293168
2017-01-26 10:41:09 +00:00
Mohammed Agabaria 20caee95e1 [X86] enable memory interleaving for X86\SLM arch.
Differential Revision: https://reviews.llvm.org/D28547

llvm-svn: 293040
2017-01-25 09:14:48 +00:00
Michael Kuperstein 230867e583 [LV] Run loop-simplify and LCSSA explicitly instead of "requiring" them
This changes the vectorizer to explicitly use the loopsimplify and lcssa utils,
instead of "requiring" the transformations as if they were analyses.

This is not NFC, since it changes the LCSSA behavior - we no longer run LCSSA
for all loops, but rather only for the loops we expect to modify.

Differential Revision: https://reviews.llvm.org/D28868

llvm-svn: 292456
2017-01-19 00:42:28 +00:00
Michael Kuperstein 7cefb409b0 [LV] Allow reductions that have several uses outside the loop
We currently check whether a reduction has a single outside user. We don't
really need to require that - we just need to make sure a single value is
used externally. The number of external users of that value shouldn't actually
matter.

Differential Revision: https://reviews.llvm.org/D28830

llvm-svn: 292424
2017-01-18 19:02:52 +00:00
Matthew Simpson e2c9ad9483 [LV] Add requires asserts to test case
llvm-svn: 292280
2017-01-17 22:21:33 +00:00
Matthew Simpson 3fbdaa5906 [LV] Mark non-consecutive-like pointers non-uniform
If a memory instruction will be vectorized, but it's pointer operand is
non-consecutive-like, the instruction is a gather or scatter operation. Its
pointer operand will be non-uniform. This should fix PR31671.

Reference: https://llvm.org/bugs/show_bug.cgi?id=31671
Differential Revision: https://reviews.llvm.org/D28819

llvm-svn: 292254
2017-01-17 20:51:39 +00:00
Mohammed Agabaria 81d0f17055 [X86] fixing failed test in commit: r291657
Missing Requires asserts.

llvm-svn: 291659
2017-01-11 09:03:11 +00:00
Mohammed Agabaria 2c96c43388 [X86] updating TTI costs for arithmetic instructions on X86\SLM arch.
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.

special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq. 
In case if the real operands bitwidth <= 16.

Differential Revision: https://reviews.llvm.org/D28104 

llvm-svn: 291657
2017-01-11 08:23:37 +00:00
Michael Kuperstein ee31cbe35f [LV] Don't panic when encountering the IV of an outer loop.
Bail out instead of asserting when we encounter this situation,
which can actually happen.

The reason the test uses the new PM is that the "bad" phi, incidentally, gets
cleaned up by LoopSimplify. But LICM can create this kind of phi and preserve
loop simplify form, so the cleanup has no chance to run.

This fixes PR31190.
We may want to solve this in a less conservative manner, since this phi is
actually uniform within the inner loop (or we may want LICM to output a cleaner
promotion to begin with).

Differential Revision: https://reviews.llvm.org/D28490

llvm-svn: 291589
2017-01-10 19:32:30 +00:00
Matthew Simpson cf796478e9 [LV] Fix-up external IV users after updating dominator tree
This patch delays the fix-up step for external induction variable users until
after the dominator tree has been properly updated. This should fix PR30742.
The SCEVExpander in InductionDescriptor::transform can generate code in the
wrong location if the dominator tree is not up-to-date. We should work towards
keeping the dominator tree up-to-date throughout the transformation.

Reference: https://llvm.org/bugs/show_bug.cgi?id=30742
Differential Revision: https://reviews.llvm.org/D28168

llvm-svn: 291462
2017-01-09 19:05:29 +00:00
Mohammed Agabaria 23599ba794 Currently isLikelyComplexAddressComputation tries to figure out if the given stride seems to be 'complex' and need some extra cost for address computation handling.
This code seems to be target dependent which may not be the same for all targets.
Passed the decision whether the given stride is complex or not to the target by sending stride information via SCEV to getAddressComputationCost instead of 'IsComplex'.

Specifically at X86 targets we dont see any significant address computation cost in case of the strided access in general.

Differential Revision: https://reviews.llvm.org/D27518

llvm-svn: 291106
2017-01-05 14:03:41 +00:00
Adrian Prantl 1eadba1c8c Renumber testcase metadata nodes after r290153.
This patch renumbers the metadata nodes in debug info testcases after
https://reviews.llvm.org/D26769. This is a separate patch because it
causes so much churn. This was implemented with a python script that
pipes the testcases through llvm-as - | llvm-dis - and then goes
through the original and new output side-by side to insert all
comments at a close-enough location.

Differential Revision: https://reviews.llvm.org/D27765

llvm-svn: 290292
2016-12-22 00:45:21 +00:00
Adrian Prantl bceaaa9643 [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

This reapplies r289902 with additional testcase upgrades and a change
to the Bitcode record for DIGlobalVariable, that makes upgrading the
old format unambiguous also for variables without DIExpressions.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 290153
2016-12-20 02:09:43 +00:00
Florian Hahn 2e03213f90 [LoopVersioning] Require loop-simplify form for loop versioning.
Summary:
Requiring loop-simplify form for loop versioning ensures that the
runtime check block always dominates the exit block.
    
This patch closes #30958 (https://llvm.org/bugs/show_bug.cgi?id=30958).

Reviewers: silviu.baranga, hfinkel, anemet, ashutosh.nema

Subscribers: ashutosh.nema, mzolotukhin, efriedma, hfinkel, llvm-commits

Differential Revision: https://reviews.llvm.org/D27469

llvm-svn: 290116
2016-12-19 17:13:37 +00:00
Adrian Prantl 73ec065604 Revert "[IR] Remove the DIExpression field from DIGlobalVariable."
This reverts commit 289920 (again).
I forgot to implement a Bitcode upgrade for the case where a DIGlobalVariable
has not DIExpression. Unfortunately it is not possible to safely upgrade
these variables without adding a flag to the bitcode record indicating which
version they are.
My plan of record is to roll the planned follow-up patch that adds a
unit: field to DIGlobalVariable into this patch before recomitting.
This way we only need one Bitcode upgrade for both changes (with a
version flag in the bitcode record to safely distinguish the record
formats).

Sorry for the churn!

llvm-svn: 289982
2016-12-16 19:39:01 +00:00
Matthew Simpson a4964f291a Reapply "[LV] Enable vectorization of loops with conditional stores by default"
This patch reapplies r289863. The original patch was reverted because it
exposed a bug causing the loop vectorizer to crash in the Python runtime on
PPC. The underlying issue was fixed with r289958.

llvm-svn: 289975
2016-12-16 19:12:02 +00:00
Matthew Simpson 099af810de [LV] Don't attempt to type-shrink scalarized instructions
After r288909, instructions feeding predicated instructions may be scalarized
if profitable. Since these instructions will remain scalar, we shouldn't
attempt to type-shrink them. We should only truncate vector types to their
minimal bit widths. This bug was exposed by enabling the vectorization of loops
containing conditional stores by default.

llvm-svn: 289958
2016-12-16 16:52:35 +00:00
Chandler Carruth 48b4e614d8 Revert r289863: [LV] Enable vectorization of loops with conditional
stores by default

This uncovers a crasher in the loop vectorizer on PPC when building the
Python runtime. I'll send the testcase to the review thread for the
original commit.

llvm-svn: 289934
2016-12-16 11:31:39 +00:00
Adrian Prantl 74a835cda0 [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

This reapplies r289902 with additional testcase upgrades.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 289920
2016-12-16 04:25:54 +00:00
Adrian Prantl 03c6d31a3b Revert "[IR] Remove the DIExpression field from DIGlobalVariable."
This reverts commit 289902 while investigating bot berakage.

llvm-svn: 289906
2016-12-16 01:00:30 +00:00
Adrian Prantl ce13935776 [IR] Remove the DIExpression field from DIGlobalVariable.
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.

Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:

(1) The DIGlobalVariable should describe the source level variable,
    not how to get to its location.

(2) It makes it unsafe/hard to update the expressions when we call
    replaceExpression on the DIGLobalVariable.

(3) It makes it impossible to represent a global variable that is in
    more than one location (e.g., a variable with multiple
    DW_OP_LLVM_fragment-s).  We also moved away from attaching the
    DIExpression to DILocalVariable for the same reasons.

<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769

llvm-svn: 289902
2016-12-16 00:36:43 +00:00
Matthew Simpson 6a98bcfe33 [LV] Enable vectorization of loops with conditional stores by default
This patch sets the default value of the "-enable-cond-stores-vec" command line
option to "true".

Differential Revision: https://reviews.llvm.org/D27814

llvm-svn: 289863
2016-12-15 20:11:05 +00:00
Michael Kuperstein 3d23d4a234 [LV] Don't vectorize when we have a small static bound on trip count
We currently check if the exact trip count is known and is smaller than the
"tiny loop" bound. We should be checking the maximum bound on the trip count
instead.

Differential Revision: https://reviews.llvm.org/D27690

llvm-svn: 289583
2016-12-13 20:38:18 +00:00
Sanjoy Das 3336f681e3 [Verifier] Add verification for TBAA metadata
Summary:
This change adds some verification in the IR verifier around struct path
TBAA metadata.

Other than some basic sanity checks (e.g. we get constant integers where
we expect constant integers), this checks:

 - That by the time an struct access tuple `(base-type, offset)` is
   "reduced" to a scalar base type, the offset is `0`.  For instance, in
   C++ you can't start from, say `("struct-a", 16)`, and end up with
   `("int", 4)` -- by the time the base type is `"int"`, the offset
   better be zero.  In particular, a variant of this invariant is needed
   for `llvm::getMostGenericTBAA` to be correct.

 - That there are no cycles in a struct path.

 - That struct type nodes have their offsets listed in an ascending
   order.

 - That when generating the struct access path, you eventually reach the
   access type listed in the tbaa tag node.

Reviewers: dexonsmith, chandlerc, reames, mehdi_amini, manmanren

Subscribers: mcrosier, llvm-commits

Differential Revision: https://reviews.llvm.org/D26438

llvm-svn: 289402
2016-12-11 20:07:15 +00:00
Matthew Simpson 364da7e527 [LV] Scalarize operands of predicated instructions
This patch attempts to scalarize the operand expressions of predicated
instructions if they were conditionally executed in the original loop. After
scalarization, the expressions will be sunk inside the blocks created for the
predicated instructions. The transformation essentially performs
un-if-conversion on the operands.

The cost model has been updated to determine if scalarization is profitable. It
compares the cost of a vectorized instruction, assuming it will be
if-converted, to the cost of the scalarized instruction, assuming that the
instructions corresponding to each vector lane will be sunk inside a predicated
block, possibly avoiding execution. If it's more profitable to scalarize the
entire expression tree feeding the predicated instruction, the expression will
be scalarized; otherwise, it will be vectorized. We only consider the cost of
the entire expression to accurately estimate the cost of the required
insertelement and extractelement instructions.

Differential Revision: https://reviews.llvm.org/D26083

llvm-svn: 288909
2016-12-07 15:03:32 +00:00
Guozhi Wei 835de1f3ab [ppc] Correctly compute the cost of loading 32/64 bit memory into VSR
VSX has instructions lxsiwax/lxsdx that can load 32/64 bit value into VSX register cheaply. That patch makes it known to memory cost model, so the vectorization of the test case in pr30990 is beneficial.

Differential Revision: https://reviews.llvm.org/D26713

llvm-svn: 288560
2016-12-03 00:41:43 +00:00
Robert Lougher b0905209dd [LoopVectorizer] When estimating reg usage, unused insts may "end" another use
The register usage algorithm incorrectly treats instructions whose value is
not used within the loop (e.g. those that do not produce a value).

The algorithm first calculates the usages within the loop.  It iterates over
the instructions in order, and records at which instruction index each use
ends (in fact, they're actually recorded against the next index, as this is
when we want to delete them from the open intervals).

The algorithm then iterates over the instructions again, adding each
instruction in turn to a list of open intervals.  Instructions are then
removed from the list of open intervals when they occur in the list of uses
ended at the current index.

The problem is, instructions which are not used in the loop are skipped.
However, although they aren't used, the last use of a value may have been
recorded against that instruction index.  In this case, the use is not deleted
from the open intervals, which may then bump up the estimated register usage.

This patch fixes the issue by simply moving the "is used" check after the loop
which erases the uses at the current index.

Differential Revision: https://reviews.llvm.org/D26554

llvm-svn: 286969
2016-11-15 14:27:33 +00:00
Simon Pilgrim 27fed8e5d6 [X86][AVX] Fixed v16i16/v32i8 ADD/SUB costs on AVX1 subtargets
Add explicit v16i16/v32i8 ADD/SUB costs, matching the costs of v4i64/v8i32 - they were missing for some reason.

This has side effects on the LV max bandwidth tests (AVX1 now prefers 128-bit vectors vs AVX2 which still prefers 256-bit)

llvm-svn: 286832
2016-11-14 14:45:16 +00:00
Adam Nemet 9bfbf8bbdf [LV] Stop saying "use -Rpass-analysis=loop-vectorize"
This is PR28376.

Unfortunately given the current structure of optimization diagnostics we
lack the capability to tell whether the user has
passed -Rpass-analysis=loop-vectorize since this is local to the
front-end (BackendConsumer::OptimizationRemarkHandler).

So rather than printing this even if the user has already
passed -Rpass-analysis, this patch just punts and stops recommending
this option.  I don't think that getting this right is worth the
complexity.

Differential Revision: https://reviews.llvm.org/D26563

llvm-svn: 286662
2016-11-11 22:51:46 +00:00
Dehao Chen 06e079a530 Update vectorization debug info unittest.
Summary:
The change will test the change in r286159.
The idea behind the change: Make the dbg location different between loop header and preheader/exit. Originally, dbg location 21 exists in 3 BBs: preheader, header, critical edge (exit). Update the debug location of inside the loop header from !21 to !22 so that it will reflect the correct location.

Reviewers: probinson

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D26428

llvm-svn: 286403
2016-11-09 22:25:19 +00:00
Dorit Nuzman bf2c15b5dc Second attempt at r285517.
llvm-svn: 285568
2016-10-31 13:17:31 +00:00
Dorit Nuzman 06903d16af Revert r285517 due to build failures.
llvm-svn: 285518
2016-10-30 14:34:57 +00:00
Dorit Nuzman 3c1c658f24 [LoopVectorize] Make interleaved-accesses analysis less conservative about
possible pointer-wrap-around concerns, in some cases.

Before this patch, collectConstStridedAccesses (part of interleaved-accesses
analysis) called getPtrStride with [Assume=false, ShouldCheckWrap=true] when
examining all candidate pointers. This is too conservative. Instead, this
patch makes collectConstStridedAccesses use an optimistic approach, calling
getPtrStride with [Assume=true, ShouldCheckWrap=false], and then, once the
candidate interleave groups have been formed, revisits the pointer-wrapping
analysis but only where it matters: namely, in groups that have gaps, and where
the gaps are not at the very end of the group (in which case the loop is
peeled). This second time getPtrStride is called with [Assume=false,
ShouldCheckWrap=true], but this could further be improved to using Assume=true,
once we also add the logic to track that we are not going to meet the scev
runtime checks threshold.

Differential Revision: https://reviews.llvm.org/D25276

llvm-svn: 285517
2016-10-30 12:23:26 +00:00
Matthew Simpson 9b6755362b [LV] Correct misleading comments in test (NFC)
llvm-svn: 285402
2016-10-28 14:27:45 +00:00
Matthew Simpson c62266d680 [LV] Sink scalar operands of predicated instructions
When we predicate an instruction (div, rem, store) we place the instruction in
its own basic block within the vectorized loop. If a predicated instruction has
scalar operands, it's possible to recursively sink these scalar expressions
into the predicated block so that they might avoid execution. This patch sinks
as much scalar computation as possible into predicated blocks. We previously
were able to sink such operands only if they were extractelement instructions.

Differential Revision: https://reviews.llvm.org/D25632

llvm-svn: 285097
2016-10-25 18:59:45 +00:00
Michael Kuperstein b2443ed62b [X86] Enable interleaved memory access by default
This lets the loop vectorizer generate interleaved memory accesses on x86.

Differential Revision: https://reviews.llvm.org/D25350

llvm-svn: 284779
2016-10-20 21:04:31 +00:00
Matthew Simpson 41fa838f07 [LV] Avoid emitting trivially dead instructions
Some instructions from the original loop, when vectorized, can become trivially
dead. This happens because of the way we structure the new loop. For example,
we create new induction variables and induction variable "steps" in the new
loop. Thus, when we go to vectorize the original induction variable update, it
may no longer be needed due to the instructions we've already created. This
patch prevents us from creating these redundant instructions. This reduces code
size before simplification and allows greater flexibility in code generation
since we have fewer unnecessary instruction uses.

Differential Revision: https://reviews.llvm.org/D25631

llvm-svn: 284631
2016-10-19 19:22:02 +00:00
Matthew Simpson 1d4b163fc0 [LV] Account for predicated stores in instruction costs
This patch ensures that we scale the estimated cost of predicated stores by
block probability. This is a follow-on patch for r284123.

llvm-svn: 284126
2016-10-13 14:54:31 +00:00
Matthew Simpson 6cdb5a6f96 [LV] Avoid rounding errors for predicated instruction costs
This patch modifies the cost calculation of predicated instructions (div and
rem) to avoid the accumulation of rounding errors due to multiple truncating
integer divisions. The calculation for predicated stores will be addressed in a
follow-on patch since we currently don't scale the cost of predicated stores by
block probability.

Differential Revision: https://reviews.llvm.org/D25333

llvm-svn: 284123
2016-10-13 14:19:48 +00:00
Matthew Simpson a371c14ffe [LV] Don't mark multi-use branch conditions uniform
Previously, we marked the branch conditions of latch blocks uniform after
vectorization if they were instructions contained in the loop. However, if a
condition instruction has users other than the branch, it may not remain
uniform. This patch ensures the conditions we mark uniform are only used by the
branch. This should fix PR30627.

Reference: https://llvm.org/bugs/show_bug.cgi?id=30627
llvm-svn: 283563
2016-10-07 15:20:13 +00:00
Michael Kuperstein 5185b7dde3 [LV] Remove triples from target-independent vectorizer tests. NFC.
Vectorizer tests in the target-independent directory should not have a target
triple. If a test really needs to query a specific backend, it belongs in the
right target subdirectory (which "REQUIRES" the right backend). Otherwise, it
should not specify a triple.

llvm-svn: 283512
2016-10-06 23:57:25 +00:00
Matthew Simpson 7808833e28 [LV] Build all scalar steps for non-uniform induction variables
When building the steps for scalar induction variables, we previously attempted
to determine if all the scalar users of the induction variable were uniform. If
they were, we would only emit the step corresponding to vector lane zero. This
optimization was too aggressive. We generally don't know the entire set of
induction variable users that will be scalar. We have
isScalarAfterVectorization, but this is only a conservative estimate of the
instructions that will be scalarized. Thus, an induction variable may have
scalar users that aren't already known to be scalar. To avoid emitting unused
steps, we can only check that the induction variable is uniform. This should
fix PR30542.

Reference: https://llvm.org/bugs/show_bug.cgi?id=30542
llvm-svn: 282863
2016-09-30 15:13:52 +00:00
Matthew Simpson b764aba2ab [LV] Scalarize instructions marked scalar after vectorization
This patch ensures that we actually scalarize instructions marked scalar after
vectorization. Previously, such instructions may have been vectorized instead.

Differential Revision: https://reviews.llvm.org/D23889

llvm-svn: 282418
2016-09-26 17:08:37 +00:00
Matthew Simpson 15869f86d8 [LV] Don't emit unused scalars for uniform instructions
If we identify an instruction as uniform after vectorization, we know that we
should only use the value corresponding to the first vector lane of each unroll
iteration. However, when scalarizing such instructions, we still produce values
for the other vector lanes. This patch prevents us from generating the unused
scalars.

Differential Revision: https://reviews.llvm.org/D24275

llvm-svn: 282087
2016-09-21 16:50:24 +00:00
Adam Nemet e3cef93727 [LV] When reporting about a specific instruction without debug location use loop's
This can occur for example if some optimization drops the debug location.

llvm-svn: 282048
2016-09-21 03:14:20 +00:00
Elena Demikhovsky 5f8cc0c346 [Loop Vectorizer] Consecutive memory access - fixed and simplified
Amended consecutive memory access detection in Loop Vectorizer.
Load/Store were not handled properly without preceding GEP instruction.

Differential Revision: https://reviews.llvm.org/D20789

llvm-svn: 281853
2016-09-18 13:56:08 +00:00
Matthew Simpson b25e87fca5 [LV] Process pointer IVs with PHINodes in collectLoopUniforms
This patch moves the processing of pointer induction variables in
collectLoopUniforms from the consecutive pointer phase of the analysis to the
phi node phase. Previously, if a pointer induction variable was used by both a
scalarized non-memory instruction as well as a vectorized memory instruction,
we would incorrectly identify the pointer as uniform. Pointer induction
variables should be treated the same as other phi nodes. That is, they are
uniform if all users of the induction variable and induction variable update
are uniform.

Differential Revision: https://reviews.llvm.org/D24511

llvm-svn: 281485
2016-09-14 14:47:40 +00:00
Peter Collingbourne d4135bbc30 DebugInfo: New metadata representation for global variables.
This patch reverses the edge from DIGlobalVariable to GlobalVariable.
This will allow us to more easily preserve debug info metadata when
manipulating global variables.

Fixes PR30362. A program for upgrading test cases is attached to that
bug.

Differential Revision: http://reviews.llvm.org/D20147

llvm-svn: 281284
2016-09-13 01:12:59 +00:00
Matthew Simpson bfe5e1817b [LV] Ensure proper handling of multi-use case when collecting uniforms
The test case included in r280979 wasn't checking what it was supposed to be
checking for the predicated store case. Fixing the test revealed that the
multi-use case (when a pointer is used by both vectorized and scalarized memory
accesses) wasn't being handled properly. We can't skip over
non-consecutive-like pointers since they may have looked consecutive-like with
a different memory access.

llvm-svn: 280992
2016-09-08 21:38:26 +00:00
Matthew Simpson 408a3abcfe [LV] Don't mark pointers used by scalarized memory accesses uniform
Previously, all consecutive pointers were marked uniform after vectorization.
However, if a consecutive pointer is used by a memory access that is eventually
scalarized, the pointer won't remain uniform after all. An example is
predicated stores. Even though a predicated store may be consecutive, it will
still be scalarized, making it's pointer operand non-uniform.

This patch updates the logic in collectLoopUniforms to consider the cases where
a memory access may be scalarized. If a memory access may be scalarized, its
pointer operand is not marked uniform. The determination of whether a given
memory instruction will be scalarized or not has been moved into a common
function that is used by the vectorizer, cost model, and legality analysis.

Differential Revision: https://reviews.llvm.org/D24271

llvm-svn: 280979
2016-09-08 19:11:07 +00:00