Commit Graph

777 Commits

Author SHA1 Message Date
Adam Nemet b424cd5d61 Make test agnostic to cost model
This was causing bot failures on greendragon

llvm-svn: 326169
2018-02-27 05:41:16 +00:00
Evgeny Stupachenko f1c058d99b Fix r326154 buildbots test fail
Summary:

Add specific mtriples to tests added in r326154.

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 326158
2018-02-27 01:33:11 +00:00
Evgeny Stupachenko a732611ac8 Fix PR36032, PR35432
Summary:

The change fix an assert fail at ScalarEvolutionExpander.cpp:
  assert(ExitCount != SE.getCouldNotCompute() && "Invalid loop count");

Reviewers: sbaranga

Differential Revision: http://reviews.llvm.org/D42604

From: Evgeny Stupachenko <evstupac@gmail.com>
                         <evgeny.v.stupachenko@intel.com>
llvm-svn: 326154
2018-02-27 00:17:31 +00:00
Renato Golin 9d1b2acaaa [LV] Move isLegalMasked* functions from Legality to CostModel
All SIMD architectures can emulate masked load/store/gather/scatter
through element-wise condition check, scalar load/store, and
insert/extract. Therefore, bailing out of vectorization as legality
failure, when they return false, is incorrect. We should proceed to cost
model and determine profitability.

This patch is to address the vectorizer's architectural limitation
described above. As such, I tried to keep the cost model and
vectorize/don't-vectorize behavior nearly unchanged. Cost model tuning
should be done separately.

Please see
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120164.html for
RFC and the discussions.

Closes D43208.

Patch by: Hideki Saito <hideki.saito@intel.com>

llvm-svn: 326079
2018-02-26 11:06:36 +00:00
Alexey Bataev 650f639d33 [LV] Fix test checks, NFC
llvm-svn: 325699
2018-02-21 16:48:23 +00:00
Silviu Baranga 10ad93c6bf [SCEV] Temporarily disable loop versioning for the purpose
of turning SCEVUnknowns of PHIs into AddRecExprs.

This feature is now hidden behind the -scev-version-unknown flag.

Fixes PR36032 and PR35432.

llvm-svn: 325687
2018-02-21 15:20:32 +00:00
Sanjay Patel e6143904b9 revert r325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280)
There are too many perf regressions resulting from this, so we need to 
investigate (and add tests for) targets like ARM and AArch64 before 
trying to reinstate.

llvm-svn: 325658
2018-02-21 01:42:52 +00:00
Alexey Bataev 42bcec7d38 [LV] Fix test checks, NFC.
llvm-svn: 325617
2018-02-20 19:49:25 +00:00
Sanjay Patel 3e8a76abfd [TTI CostModel] change default cost of FP ops to 1 (PR36280)
This change was mentioned at least as far back as:
https://bugs.llvm.org/show_bug.cgi?id=26837#c26
...and I found a real program that is harmed by this: 
Himeno running on AMD Jaguar gets 6% slower with SLP vectorization:
https://bugs.llvm.org/show_bug.cgi?id=36280
...but the change here appears to solve that bug only accidentally.

The div/rem costs for x86 look very wrong in some cases, but that's already true, 
so we can fix those in follow-up patches. There's also evidence that more cost model
changes are needed to solve SLP problems as shown in D42981, but that's an independent 
problem (though the solution may be adjusted after this change is made).

Differential Revision: https://reviews.llvm.org/D43079

llvm-svn: 325515
2018-02-19 16:11:44 +00:00
Mircea Trofin 73b96d6dcf [LV] Fix analyzeInterleaving when -pass-remarks enabled
Summary:
If -pass-remarks=loop-vectorize, atomic ops will be seen by
analyzeInterleaving(), even though canVectorizeMemory() == false. This
is because we are requesting extra analysis instead of bailing out.

In such a case, we end up with a Group in both Load- and StoreGroups,
and then we'll try to access freed memory when traversing LoadGroups after having had released the Group when  iterating over StoreGroups.

The fix is to include mayWriteToMemory() when validating that two
instructions are the same kind of memory operation.

Reviewers: mssimpso, davidxl

Reviewed By: davidxl

Subscribers: hsaito, fhahn, llvm-commits

Differential Revision: https://reviews.llvm.org/D43064

llvm-svn: 324786
2018-02-10 00:07:45 +00:00
Sanjay Patel 42b8c23cc6 [LoopVectorize] auto-generate complete checks; NFC
llvm-svn: 324611
2018-02-08 15:13:47 +00:00
Mircea Trofin 06ac8cfbd1 Verify profile data confirms large loop trip counts.
Summary:
Loops with inequality comparers, such as:

   // unsigned bound
   for (unsigned i = 1; i < bound; ++i) {...}

have getSmallConstantMaxTripCount report a large maximum static
trip count - in this case, 0xffff fffe. However, profiling info
may show that the trip count is much smaller, and thus
counter-recommend vectorization.

This change:
- flips loop-vectorize-with-block-frequency on by default.
- validates profiled loop frequency data supports vectorization,
  when static info appears to not counter-recommend it. Absence
  of profile data means we rely on static data, just as we've
  done so far.

Reviewers: twoh, mkuper, davidxl, tejohnson, Ayal

Reviewed By: davidxl

Subscribers: bkramer, llvm-commits

Differential Revision: https://reviews.llvm.org/D42946

llvm-svn: 324543
2018-02-07 23:29:52 +00:00
Max Kazantsev f7667483c1 [NFC] Add tests for PR35743
llvm-svn: 324209
2018-02-05 08:09:49 +00:00
Chad Rosier a097bc69df [LV] Use Demanded Bits and ValueTracking for reduction type-shrinking
The type-shrinking logic in reduction detection, although narrow in scope, is
also rather ad-hoc, which has led to bugs (e.g., PR35734). This patch modifies
the approach to rely on the demanded bits and value tracking analyses, if
available. We currently perform type-shrinking separately for reductions and
other instructions in the loop. Long-term, we should probably think about
computing minimal bit widths in a more complete way for the loops we want to
vectorize.

PR35734
Differential Revision: https://reviews.llvm.org/D42309

llvm-svn: 324195
2018-02-04 15:42:24 +00:00
Craig Topper 0d797a34d8 [X86] Add support for passing 'prefer-vector-width' function attribute into X86Subtarget and exposing via X86's getRegisterWidth TTI interface.
This will cause the vectorizers to do some limiting of the vector widths they create. This is not a strict limit. There are reasons I know of that the loop vectorizer will generate larger vectors for.

I've written this in such a way that the interface will only return a properly supported width(0/128/256/512) even if the attribute says something funny like 384 or 10.

This has been split from D41895 with the remainder in a follow up commit.

llvm-svn: 323015
2018-01-20 00:26:08 +00:00
Craig Topper 83b0a98902 [X86] Use vmovdqu64/vmovdqa64 for unmasked integer vector stores for consistency with loads.
Previously we used 64 for vXi64 stores and 32 for everything else. This change uses 64 for everything just like do for loads.

llvm-svn: 322820
2018-01-18 07:44:09 +00:00
Andrei Elovikov 7457aa0bce [LV] Don't call recordVectorLoopValueForInductionCast for newly-created IV from a trunc.
Summary:
This method is supposed to be called for IVs that have casts in their use-def
chains that are completely ignored after vectorization under PSE. However, for
truncates of such IVs the same InductionDescriptor is used during
creation/widening of both original IV based on PHINode and new IV based on
TruncInst.

This leads to unintended second call to recordVectorLoopValueForInductionCast
with a VectorLoopVal set to the newly created IV for a trunc and causes an
assert due to attempt to store new information for already existing entry in the
map. This is wrong and should not be done.

Fixes PR35773.

Reviewers: dorit, Ayal, mssimpso

Reviewed By: dorit

Subscribers: RKSimon, dim, dcaballe, hsaito, llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D41913

llvm-svn: 322473
2018-01-15 10:56:07 +00:00
Florian Hahn 467abe3e4f [LV] Remove unnecessary DoExtraAnalysis guard (silent bug)
canVectorize is only checking if the loop has a normalized pre-header if DoExtraAnalysis is true.
This doesn't make sense to me because reporting analysis information shouldn't alter legality
checks. This is probably the result of a last minute minor change before committing (?).

Patch by Diego Caballero.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D40973

llvm-svn: 321172
2017-12-20 13:28:38 +00:00
Hal Finkel 92ea8acbcd Move Transforms/LoopVectorize/consecutive-ptr-cg-bug.ll into the X86 subdirectory
This test depends on X86's TTI; move into the X86 subdirectory.

llvm-svn: 320914
2017-12-16 05:10:20 +00:00
Hal Finkel 5444f40965 [LV] Extend InstWidening with CM_Widen_Recursive
Changes to the original scalar loop during LV code gen cause the return value
of Legal->isConsecutivePtr() to be inconsistent with the return value during
legal/cost phases (further analysis and information of the bug is in D39346).
This patch is an alternative fix to PR34965 following the CM_Widen approach
proposed by Ayal and Gil in D39346. It extends InstWidening enum with
CM_Widen_Reverse to properly record the widening decision for consecutive
reverse memory accesses and, consequently, get rid of the
Legal->isConsetuviePtr() call in LV code gen. I think this is a simpler/cleaner
solution to PR34965 than the one in D39346.

Fixes PR34965.

Patch by Diego Caballero, thanks!

Differential Revision: https://reviews.llvm.org/D40742

llvm-svn: 320913
2017-12-16 02:55:24 +00:00
Dorit Nuzman 4750c785b3 [LV] Support efficient vectorization of an induction with redundant casts
D30041 extended SCEVPredicateRewriter to improve handling of Phi nodes whose
update chain involves casts; PSCEV can now build an AddRecurrence for some
forms of such phi nodes, under the proper runtime overflow test. This means
that we can identify such phi nodes as an induction, and the loop-vectorizer
can now vectorize such inductions, however inefficiently. The vectorizer
doesn't know that it can ignore the casts, and so it vectorizes them.

This patch records the casts in the InductionDescriptor, so that they could
be marked to be ignored for cost calculation (we use VecValuesToIgnore for
that) and ignored for vectorization/widening/scalarization (i.e. treated as
TriviallyDead).

In addition to marking all these casts to be ignored, we also need to make
sure that each cast is mapped to the right vector value in the vector loop body
(be it a widened, vectorized, or scalarized induction). So whenever an
induction phi is mapped to a vector value (during vectorization/widening/
scalarization), we also map the respective cast instruction (if exists) to that
vector value. (If the phi-update sequence of an induction involves more than one
cast, then the above mapping to vector value is relevant only for the last cast
of the sequence as we allow only the "last cast" to be used outside the
induction update chain itself).

This is the last step in addressing PR30654.

llvm-svn: 320672
2017-12-14 07:56:31 +00:00
Dorit Nuzman 927b31600e [LV] Ignore the cost of values that will not appear in the vectorized loop
VecValuesToIgnore holds values that will not appear in the vectorized loop.
We should therefore ignore their cost when VF > 1.

Differential Revision: https://reviews.llvm.org/D40883

llvm-svn: 320463
2017-12-12 08:57:43 +00:00
Dorit Nuzman 5809e70540 [SCEV] Fix wrong Equal predicate created in getAddRecForPhiWithCasts
CreateAddRecFromPHIWithCastsImpl() adds an IncrementNUSW overflow predicate
which allows the PSCEV rewriter to rewrite this scev expression:
 (zext i8 {0, + , (trunc i32 step to i8)} to i32)
into
 {0, +, (sext i8 (trunc i32 step to i8) to i32)}

But then it adds the wrong Equal predicate:
 %step == (zext i8 (trunc i32 %step to i8) to i32).
instead of:
 %step == (sext i8 (trunc i32 %step to i8) to i32)

This is fixed here.

Differential Revision: https://reviews.llvm.org/D40641

llvm-svn: 320298
2017-12-10 11:13:35 +00:00
Adam Nemet a502ee73c4 [LV] Interleaved access vectorization: fix computing new alias info
As a new access is generated spanning across multiple fields, we need to
propagate alias info from all the fields to form the most generic alias info.

rdar://35602528

Differential Revision: https://reviews.llvm.org/D40617

llvm-svn: 319979
2017-12-06 22:42:24 +00:00
Gil Rapaport 8b9d1f3c5b [LV] Model masking in VPlan, introducing VPInstructions
This patch adds a new abstraction layer to VPlan and leverages it to model the planned
instructions that manipulate masks (AND, OR, NOT), introduced during predication.

The new VPValue and VPUser classes model how data flows into, through and out
of a VPlan, forming the vertices of a planned Def-Use graph. The new
VPInstruction class is a generic single-instruction Recipe that models a
planned instruction along with its opcode, operands and users. See
VectorizationPlan.rst for more details.

Differential Revision: https://reviews.llvm.org/D38676

llvm-svn: 318645
2017-11-20 12:01:47 +00:00
Gil Rapaport 848581cadb [LV] Introduce VPBlendRecipe, VPWidenMemoryInstructionRecipe
This patch is part of D38676.

The patch introduces two new Recipes to handle instructions whose vectorization
involves masking. These Recipes take VPlan-level masks in D38676, but still rely
on ILV's existing createEdgeMask(), createBlockInMask() in this patch.

VPBlendRecipe handles intra-loop phi nodes, which are vectorized as a sequence
of SELECTs. Its execute() code is refactored out of ILV::widenPHIInstruction(),
which now handles only loop-header phi nodes.

VPWidenMemoryInstructionRecipe handles load/store which are to be widened
(but are not part of an Interleave Group). In this patch it simply calls
ILV::vectorizeMemoryInstruction on execute().

Differential Revision: https://reviews.llvm.org/D39068

llvm-svn: 318149
2017-11-14 12:09:30 +00:00
Dan Gohman 2c74fe977d Add an @llvm.sideeffect intrinsic
This patch implements Chandler's idea [0] for supporting languages that
require support for infinite loops with side effects, such as Rust, providing
part of a solution to bug 965 [1].

Specifically, it adds an `llvm.sideeffect()` intrinsic, which has no actual
effect, but which appears to optimization passes to have obscure side effects,
such that they don't optimize away loops containing it. It also teaches
several optimization passes to ignore this intrinsic, so that it doesn't
significantly impact optimization in most cases.

As discussed on llvm-dev [2], this patch is the first of two major parts.
The second part, to change LLVM's semantics to have defined behavior
on infinite loops by default, with a function attribute for opting into
potential-undefined-behavior, will be implemented and posted for review in
a separate patch.

[0] http://lists.llvm.org/pipermail/llvm-dev/2015-July/088103.html
[1] https://bugs.llvm.org/show_bug.cgi?id=965
[2] http://lists.llvm.org/pipermail/llvm-dev/2017-October/118632.html

Differential Revision: https://reviews.llvm.org/D38336

llvm-svn: 317729
2017-11-08 21:59:51 +00:00
Dorit Nuzman eb13dd3eac [LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a
single-iteration loop

This fixes PR34681. Avoid adding the "Stride == 1" predicate when we know that
Stride >= Trip-Count. Such a predicate will effectively optimize a single
or zero iteration loop, as Trip-Count <= Stride == 1.

Differential Revision: https://reviews.llvm.org/D38785

llvm-svn: 317438
2017-11-05 16:53:15 +00:00
Sanjay Patel b049173157 [SimplifyCFG] use pass options and remove the latesimplifycfg pass
This is no-functional-change-intended.

This is repackaging the functionality of D30333 (defer switch-to-lookup-tables) and 
D35411 (defer folding unconditional branches) with pass parameters rather than a named
"latesimplifycfg" pass. Now that we have individual options to control the functionality,
we could decouple when these fire (but that's an independent patch if desired). 

The next planned step would be to add another option bit to disable the sinking transform
mentioned in D38566. This should also make it clear that the new pass manager needs to
be updated to limit simplifycfg in the same way as the old pass manager.

Differential Revision: https://reviews.llvm.org/D38631

llvm-svn: 316835
2017-10-28 18:43:07 +00:00
Dehao Chen ed2d5402cb Do not add discriminator encoding for debug intrinsics.
Summary: There are certain requirements for debug location of debug intrinsics, e.g. the scope of the DILocalVariable should be the same as the scope of its debug location. As a result, we should not add discriminator encoding for debug intrinsics.

Reviewers: dblaikie, aprantl

Reviewed By: aprantl

Subscribers: JDevlieghere, aprantl, bjope, sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D39343

llvm-svn: 316703
2017-10-26 21:20:52 +00:00
Ayal Zaks c9e0f886e5 [LV] Fix PR34743 - handle casts that sink after interleaved loads
When ignoring a load that participates in an interleaved group, make sure to
move a cast that needs to sink after it.

Testcase derived from reproducer of PR34743.

Differential Revision: https://reviews.llvm.org/D38338

llvm-svn: 314986
2017-10-05 15:45:14 +00:00
Ayal Zaks fc3f7a4f0c [LV] Fix PR34711 - widen instruction ranges when sinking casts
Instead of trying to keep LastWidenRecipe updated after creating each recipe,
have tryToWiden() retrieve the last recipe of the current VPBasicBlock and check
if it's a VPWidenRecipe when attempting to extend its range. This ensures that
such extensions, optimized to maintain the original instruction order, do so
only when the instructions are to maintain their relative order. The latter does
not always hold, e.g., when a cast needs to sink to unravel first order
recurrence (r306884).

Testcase derived from reproducer of PR34711.

Differential Revision: https://reviews.llvm.org/D38339

llvm-svn: 314981
2017-10-05 12:41:49 +00:00
Matthew Simpson f4bb480b62 [LV] Use correct insertion point when type shrinking reductions
When type shrinking reductions, we should insert the truncations and extends at
the end of the loop latch block. Previously, these instructions were inserted
at the end of the loop header block. The difference is only a problem for loops
with predicated instructions (e.g., conditional stores and instructions that
may divide by zero). For these instructions, we create new basic blocks inside
the vectorized loop, which cause the loop header and latch to no longer be the
same block. This should fix PR34687.

Reference: https://bugs.llvm.org/show_bug.cgi?id=34687
llvm-svn: 314542
2017-09-29 18:07:39 +00:00
Sanjay Patel ecb175608f [InstCombine] remove extract-of-select vector transform (2nd try)
The 1st attempt at this:
https://reviews.llvm.org/rL314117
was reverted at:
https://reviews.llvm.org/rL314118

because of bot fails for clang tests that were checking optimized IR. That should be fixed with:
https://reviews.llvm.org/rL314144
...so try again. 

Original commit message:

The transform to convert an extract-of-a-select-of-vectors was added at:
https://reviews.llvm.org/rL194013

And a question about the validity of this transform was raised in the review:
https://reviews.llvm.org/D1539:
...but not answered AFAICT>

Most of the motivating cases in that patch are now handled by other combines. These are the tests that were added with
the original commit, but they are not regressing even after we remove the transform in this patch.

The diffs we see after removing this transform cause us to avoid increasing the instruction count, so we don't want to do
those transforms as canonicalizations.

The motivation for not turning a vector-select-of-vectors into a scalar operation is shown in PR33301:
https://bugs.llvm.org/show_bug.cgi?id=33301
...in those cases, we'll get vector ops with this patch rather than the vector/scalar mix that we currently see.

Differential Revision: https://reviews.llvm.org/D38006

llvm-svn: 314147
2017-09-25 20:30:53 +00:00
Sanjay Patel aa7f750bec revert r314117 because there are bogus clang tests that depend on the optimizer
llvm-svn: 314118
2017-09-25 17:00:04 +00:00
Sanjay Patel 9639897d77 [InstCombine] remove extract-of-select vector transform
The transform to convert an extract-of-a-select-of-vectors was added at:
rL194013

And a question about the validity of this transform was raised in the review:
https://reviews.llvm.org/D1539:
...but not answered AFAICT>

Most of the motivating cases in that patch are now handled by other combines. These are the tests that were added with
the original commit, but they are not regressing even after we remove the transform in this patch.

The diffs we see after removing this transform cause us to avoid increasing the instruction count, so we don't want to do
those transforms as canonicalizations.

The motivation for not turning a vector-select-of-vectors into a scalar operation is shown in PR33301:
https://bugs.llvm.org/show_bug.cgi?id=33301
...in those cases, we'll get vector ops with this patch rather than the vector/scalar mix that we currently see.

Differential Revision: https://reviews.llvm.org/D38006

llvm-svn: 314117
2017-09-25 16:41:34 +00:00
Alon Kom 682cfc1d4c [LV] Fix maximum legal VF calculation
This patch fixes pr34283, which exposed that the computation of
maximum legal width for vectorization was wrong, because it relied
on MaxInterleaveFactor to obtain the maximum stride used in the loop,
however not all strided accesses in the loop have an interleave-group
associated with them.
Instead of recording the maximum stride in the loop, which can be over
conservative (e.g. if the access with the maximum stride is not involved
in the dependence limitation), this patch tracks the actual maximum legal
width imposed by accesses that are involved in dependencies.

Differential Revision: https://reviews.llvm.org/D37507

llvm-svn: 313237
2017-09-14 07:40:02 +00:00
Anna Thomas 19529f75b9 [LV] Avoid computing the register usage for default VF. NFC
These are changes to reduce redundant computations when calculating a
feasible vectorization factor:
1. early return when target has no vector registers
2. don't compute register usage for the default VF.

Suggested during review for D37702.

llvm-svn: 313176
2017-09-13 19:35:45 +00:00
Ayal Zaks e2a8c0758f [LV] Fix PR34523 - avoid generating redundant selects
When converting a PHI into a series of 'select' instructions to combine the
incoming values together according their edge masks, initialize the first
value to the incoming value In0 of the first predecessor, instead of
generating a redundant assignment 'select(Cond[0], In0, In0)'. The latter
fails when the Cond[0] mask is null, representing a full mask, which can
happen only when there's a single incoming value.

No functional changes intended nor expected other than surviving null Cond[0]'s.

This fix follows D35725, which introduced using null to represent full masks.

Differential Revision: https://reviews.llvm.org/D37619

llvm-svn: 313119
2017-09-13 06:28:37 +00:00
Anna Thomas 9f1be02fa3 [LV] Clamp the VF to the trip count
Summary:
When the MaxVectorSize > ConstantTripCount, we should just clamp the
vectorization factor to be the ConstantTripCount.
This vectorizes loops where the TinyTripCountThreshold >= TripCount < MaxVF.

Earlier we were finding the maximum vector width, which could be greater than
the trip count itself. The Loop vectorizer does all the work for generating a
vectorizable loop, but in the end we would always choose the scalar loop (since
the VF > trip count). This allows us to choose the VF keeping in mind the trip
count if available.

This is a fix on top of rL312472.

Reviewers: Ayal, zvi, hfinkel, dneilson

Reviewed by: Ayal

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37702

llvm-svn: 313046
2017-09-12 16:32:45 +00:00
Zvi Rackover 9a087a357a LoopVectorize: MaxVF should not be larger than the loop trip count
Summary:
Improve how MaxVF is computed while taking into account that MaxVF should not be larger than the loop's trip count.

Other than saving on compile-time by pruning the possible MaxVF candidates, this patch fixes pr34438 which exposed the following flow:
1. Short trip count identified -> Don't bail out, set OptForSize:=True to avoid tail-loop and runtime checks.
2. Compute MaxVF returned 16 on a target supporting AVX512.
3. OptForSize -> choose VF:=MaxVF.
4. Bail out because TripCount = 8, VF = 16, TripCount % VF !=0 means we need a tail loop.

With this patch step 2. will choose MaxVF=8 based on TripCount.

Reviewers: Ayal, dorit, mkuper, hfinkel

Reviewed By: hfinkel

Subscribers: hfinkel, llvm-commits

Differential Revision: https://reviews.llvm.org/D37425

llvm-svn: 312472
2017-09-04 08:35:13 +00:00
Adrian Prantl 05782218ab Canonicalize the representation of empty an expression in DIGlobalVariableExpression
This change simplifies code that has to deal with
DIGlobalVariableExpression and mirrors how we treat DIExpressions in
debug info intrinsics. Before this change there were two ways of
representing empty expressions on globals, a nullptr and an empty
!DIExpression().

If someone needs to upgrade out-of-tree testcases:
  perl -pi -e 's/(!DIGlobalVariableExpression\(var: ![0-9]*)\)/\1, expr: !DIExpression())/g' <MYTEST.ll>
will catch 95%.

llvm-svn: 312144
2017-08-30 18:06:51 +00:00
Craig Topper 516e39cd38 [InstCombine] Teach select01 helper of foldSelectIntoOp to handle vector splats
We were handling some vectors in foldSelectIntoOp, but not if the operand of the bin op was any kind of vector constant. This patch fixes it to treat vector splats the same as scalars.

Differential Revision: https://reviews.llvm.org/D37232

llvm-svn: 311940
2017-08-28 22:00:27 +00:00
Ayal Zaks 1f58dda4e4 [LV] Fix PR34248 - recommit D32871 after revert r311304
Original commit r311077 of D32871 was reverted in r311304 due to failures
reported in PR34248.

This recommit fixes PR34248 by restricting the packing of predicated scalars
into vectors only when vectorizing, avoiding doing so when unrolling w/o
vectorizing. Added a test derived from the reproducer of PR34248.

llvm-svn: 311849
2017-08-27 12:55:46 +00:00
Chandler Carruth bd6dc14230 Revert r311077: [LV] Using VPlan ...
This causes LLVM to assert fail on PPC64 and crash / infloop in other
cases. Filed http://llvm.org/PR34248 with reproducer attached.

llvm-svn: 311304
2017-08-20 23:17:11 +00:00
Elena Demikhovsky f58f838495 Changed basic cost of store operation on X86
Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling.
This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks.

Differential Revision: https://reviews.llvm.org/D35888

llvm-svn: 311285
2017-08-20 12:34:29 +00:00
Aditya Kumar a525fffd07 [Loop Vectorize] Added a separate metadata
Added a separate metadata to indicate when the loop
has already been vectorized instead of setting width and count to 1.

Patch written by Divya Shanmughan and Aditya Kumar

Differential Revision: https://reviews.llvm.org/D36220

llvm-svn: 311281
2017-08-20 10:32:41 +00:00
Sam Elliott b0c9753691 Keep Optimization Remark Yaml in NewPM
Summary:
The New Pass Manager infrastructure was forgetting to keep around the optimization remark yaml file that the compiler might have been producing. This meant setting the option to '-' for stdout worked, but setting it to a filename didn't give file output (presumably it was deleted because compilation didn't explicitly keep it). This change just ensures that the file is kept if compilation succeeds.

So far I have updated one of the optimization remark output tests to add a version with the new pass manager. It is my intention for this patch to also include changes to all tests that use `-opt-remark-output=` but I wanted to get the code patch ready for review while I was making all those changes.

Fixes https://bugs.llvm.org/show_bug.cgi?id=33951

Reviewers: anemet, chandlerc

Reviewed By: anemet, chandlerc

Subscribers: javed.absar, chandlerc, fhahn, llvm-commits

Differential Revision: https://reviews.llvm.org/D36906

llvm-svn: 311271
2017-08-20 01:30:45 +00:00
Ayal Zaks 6627883369 [LV] Using VPlan to model the vectorized code and drive its transformation
VPlan is an ongoing effort to refactor and extend the Loop Vectorizer. This
patch introduces the VPlan model into LV and uses it to represent the vectorized
code and drive the generation of vectorized IR.

In this patch VPlan models the vectorized loop body: the vectorized control-flow
is represented using VPlan's Hierarchical CFG, with predication refactored from
being a post-vectorization-step into a vectorization planning step modeling
if-then VPRegionBlocks, and generating code inline with non-predicated code. The
vectorized code within each VPBasicBlock is represented as a sequence of
Recipes, each responsible for modelling and generating a sequence of IR
instructions. To keep the size of this commit manageable the Recipes in this
patch are coarse-grained and capture large chunks of LV's code-generation logic.
The constructed VPlans are dumped in dot format under -debug.

This commit retains current vectorizer output, except for minor instruction
reorderings; see associated modifications to lit tests.

For further details on the VPlan model see docs/Proposals/VectorizationPlan.rst
and its references.

Authors: Gil Rapaport and Ayal Zaks

Differential Revision: https://reviews.llvm.org/D32871

llvm-svn: 311077
2017-08-17 09:29:59 +00:00
Ayal Zaks 25e2800e20 [LV] Minor savings to Sink casts to unravel first order recurrence
Two minor savings: avoid copying the SinkAfter map and avoid moving a cast if it
is not needed.

Differential Revision: https://reviews.llvm.org/D36408

llvm-svn: 310910
2017-08-15 08:32:59 +00:00
Anna Thomas 9b6e12f3dc [LoopVectorize] Fix assertion failure in Fcmp vectorization
Summary:
When vectorizing fcmps we can trip on incorrect cast assertion when setting the
FastMathFlags after generating the vectorized FCmp.
This can happen if the FCmp can be folded to true or false directly. The fix
here is to set the FastMathFlag using the FastMathFlagBuilder *before* creating
the FCmp Instruction. This is what's done by other optimizations such as
InstCombine.
Added a test case which trips on cast assertion without this patch.

Reviewers: Ayal, mssimpso, mkuper, gilr

Reviewed by: Ayal, mssimpso

Subscribers: llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D36244

llvm-svn: 310389
2017-08-08 18:07:44 +00:00
Matt Arsenault e6de494b74 LV: Don't insert runtime ptr checks on divergent targets
llvm-svn: 309890
2017-08-02 21:43:08 +00:00
Ayal Zaks e841b214b1 [LV] Avoid redundant operations manipulating masks
The Loop Vectorizer generates redundant operations when manipulating masks:
AND with true, OR with false, compare equal to true. Instead of relying on
a subsequent pass to clean them up, this patch avoids generating them.

Use null (no-mask) to represent all-one full masks, instead of a constant
all-one vector, following the convention of masked gathers and scatters.

Preparing for a follow-up VPlan patch in which these mask manipulating
operations are modeled using recipes.

Differential Revision: https://reviews.llvm.org/D35725

llvm-svn: 309558
2017-07-31 13:21:42 +00:00
Adrian Prantl abe04759a6 Remove the obsolete offset parameter from @llvm.dbg.value
There is no situation where this rarely-used argument cannot be
substituted with a DIExpression and removing it allows us to simplify
the DWARF backend. Note that this patch does not yet remove any of
the newly dead code.

rdar://problem/33580047
Differential Revision: https://reviews.llvm.org/D35951

llvm-svn: 309426
2017-07-28 20:21:02 +00:00
Michael Zolotukhin cd2255ea6a [tests] Cleanup vect.omp.persistence.ll test.
llvm-svn: 308964
2017-07-25 10:35:16 +00:00
Balaram Makam b05a55787a [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure.
Summary:
When simplifying unconditional branches from empty blocks, we pre-test if the
BB belongs to a set of loop headers and keep the block to prevent passes from
destroying canonical loop structure. However, the current algorithm fails if
the destination of the branch is a loop header. Especially when such a loop's
latch block is folded into loop header it results in additional backedges and
LoopSimplify turns it into a nested loop which prevent later optimizations
from being applied (e.g., loop  unrolling and loop interleaving).

This patch augments the existing algorithm by further checking if the
destination of the branch belongs to a set of loop headers and defer
eliminating it if yes to LateSimplifyCFG.

Fixes PR33605: https://bugs.llvm.org/show_bug.cgi?id=33605

Reviewers: efriedma, mcrosier, pacxx, hsung, davidxl

Reviewed By: efriedma

Subscribers: ashutosh.nema, gberry, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D35411

llvm-svn: 308422
2017-07-19 08:53:34 +00:00
Ayal Zaks 8c452d76ed [LV] Test once if vector trip count is zero, instead of twice
Generate a single test to decide if there are enough iterations to jump to the
vectorized loop, or else go to the scalar remainder loop. This test compares the
Scalar Trip Count: if STC < VF * UF go to the scalar loop. If
requiresScalarEpilogue() holds, at-least one iteration must remain scalar; the
rest can be used to form vector iterations. So in this case the test checks
instead if (STC - 1) < VF * UF by comparing STC <= VF * UF, and going to the
scalar loop if so. Otherwise the vector loop is entered for at-least one vector
iteration.

This test covers the case where incrementing the backedge-taken count will
overflow leading to an incorrect trip count of zero. In this (rare) case we will
also avoid the vector loop and jump to the scalar loop.

This patch simplifies the existing tests and effectively removes the basic-block
originally named "min.iters.checked", leaving the single test in block
"vector.ph".

Original observation and initial patch by Evgeny Stupachenko.

Differential Revision: https://reviews.llvm.org/D34150

llvm-svn: 308421
2017-07-19 05:16:39 +00:00
Dorit Nuzman ca4fd18ddc PSCEV] Create AddRec for Phis in cases of possible integer overflow,
using runtime checks

Extend the SCEVPredicateRewriter to work a bit harder when it encounters an
UnknownSCEV for a Phi node; Try to build an AddRecurrence also for Phi nodes
whose update chain involves casts that can be ignored under the proper runtime
overflow test. This is one step towards addressing PR30654.

Differential revision: http://reviews.llvm.org/D30041

llvm-svn: 308299
2017-07-18 11:57:08 +00:00
Michael Kuperstein fdb46b2fb4 [LV] Don't allow outside uses of IVs if the SCEV is predicated on loop conditions.
This fixes PR33706.
Differential Revision: https://reviews.llvm.org/D35227

llvm-svn: 307837
2017-07-12 19:53:55 +00:00
Sanjay Patel 15689aeae9 [LoopVectorize] partly revert r307475
Bots are failing because of the additional checks.

llvm-svn: 307476
2017-07-08 16:34:46 +00:00
Sanjay Patel 28f53ef44d [LoopVectorize] auto-generate complete checks; NFC
I'm looking at a cmp transform in InstCombine that would affect these tests,
but it's hard to know if it makes things better or worse without seeing the
full IR. OTOH, maybe these tests shouldn't be running a bunch of transform
passes in the first place?

llvm-svn: 307475
2017-07-08 16:10:42 +00:00
NAKAMURA Takumi 91a00b974d llvm/test/Transforms/LoopVectorize/X86/slm-no-vectorize.ll: -debug is available in +Asserts.
llvm-svn: 306979
2017-07-02 14:25:27 +00:00
Mohammed Agabaria eb09a810e6 [X86][CM] update add\sub costs of vectors of 64 in X86\SLM arch
this patch updates the cost of addq\subq (add\subtract of vectors of 64bits)
based on the performance numbers of SLM arch.

Differential Revision: https://reviews.llvm.org/D33983

llvm-svn: 306974
2017-07-02 12:16:15 +00:00
Teresa Johnson c12306c0ad Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."
This still breaks PPC tests we have. I'll forward reproduction
instructions to dehao.

llvm-svn: 306936
2017-07-01 03:24:09 +00:00
Teresa Johnson eb4fba9d61 re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306935
2017-07-01 03:24:08 +00:00
Teresa Johnson de56903bde revert r306336 for breaking ppc test.
llvm-svn: 306934
2017-07-01 03:24:07 +00:00
Teresa Johnson 1fbaffeba1 Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306933
2017-07-01 03:24:06 +00:00
Ayal Zaks 2ff59d4350 [LV] Sink casts to unravel first order recurrence
Check if a single cast is preventing handling a first-order-recurrence Phi,
because the scheduling constraints it imposes on the first-order-recurrence
shuffle are infeasible; but they can be made feasible by moving the cast
downwards. Record such casts and move them when vectorizing the loop.

Differential Revision: https://reviews.llvm.org/D33058

llvm-svn: 306884
2017-06-30 21:05:06 +00:00
Ayal Zaks 8d26f0a602 [LV] Optimize for size when vectorizing loops with tiny trip count
It may be detrimental to vectorize loops with very small trip count, as various
costs of the vectorized loop body as well as enclosing overheads including
runtime tests and scalar iterations may outweigh the gains of vectorizing. The
current cost model measures the cost of the vectorized loop body only, expecting
it will amortize other costs, and loops with known or expected very small trip
counts are not vectorized at all. This patch allows loops with very small trip
counts to be vectorized, but under OptForSize constraints, which ensure the cost
of the loop body is dominant, having no runtime guards nor scalar iterations.

Patch inspired by D32451.

Differential Revision: https://reviews.llvm.org/D34373

llvm-svn: 306803
2017-06-30 08:02:35 +00:00
Daniel Jasper 5ce1ce742e Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."
This still breaks PPC tests we have. I'll forward reproduction
instructions to dehao.

llvm-svn: 306792
2017-06-30 06:32:21 +00:00
Ayal Zaks d9bc43ef2a [LV] Fix PR33613 - retain order of insertelement per part
r306381 caused PR33613, by reversing the order in which insertelements were
generated per unroll part. This patch fixes PR33613 by retraining this order,
placing each set of insertelements per part immediately after the last scalar
being packed for this part. Includes a test case derived from PR33613.

Reference: https://bugs.llvm.org/show_bug.cgi?id=33613
Differential Revision: https://reviews.llvm.org/D34760

llvm-svn: 306575
2017-06-28 17:59:33 +00:00
Dehao Chen 920d022519 re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306473
2017-06-27 22:05:58 +00:00
Sanjay Patel 7227276d41 [InstCombine] canonicalize icmp predicate feeding select
This canonicalization was suggested in D33172 as a way to make InstCombine behavior more uniform. 
We have this transform for icmp+br, so unless there's some reason that icmp+select should be 
treated differently, we should do the same thing here.

The benefit comes from increasing the chances of creating identical instructions. This is shown in
the tests in logical-select.ll (PR32791). InstCombine doesn't fold those directly, but EarlyCSE 
can simplify the identical cmps, and then InstCombine can fold the selects together.

The possible regression for the tests in select.ll raises questions about poison/undef:
http://lists.llvm.org/pipermail/llvm-dev/2017-May/113261.html

...but that transform is just as likely to be triggered by this canonicalization as it is to be 
missed, so we're just pointing out a commutation deficiency in the pattern matching:
https://reviews.llvm.org/rL228409

Differential Revision: https://reviews.llvm.org/D34242

llvm-svn: 306435
2017-06-27 17:53:22 +00:00
Dehao Chen 8b7effb344 revert r306336 for breaking ppc test.
llvm-svn: 306344
2017-06-26 23:05:35 +00:00
Dehao Chen 79655792cc Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 306336
2017-06-26 21:41:09 +00:00
Diana Picus b512e91515 Revert "Enable vectorizer-maximize-bandwidth by default."
This reverts commit r305960 because it broke self-hosting on AArch64.

llvm-svn: 305990
2017-06-22 10:00:28 +00:00
Dehao Chen 014db29b89 Enable vectorizer-maximize-bandwidth by default.
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd                 26.84  -0.31%
spec/2006/fp/C++/447.dealII               46.19  +0.89%
spec/2006/fp/C++/450.soplex               42.92  -0.44%
spec/2006/fp/C++/453.povray               38.57  -2.25%
spec/2006/fp/C/433.milc                   24.54  -0.76%
spec/2006/fp/C/470.lbm                    41.08  +0.26%
spec/2006/fp/C/482.sphinx3                47.58  -0.99%
spec/2006/int/C++/471.omnetpp             22.06  +1.87%
spec/2006/int/C++/473.astar               22.65  -0.12%
spec/2006/int/C++/483.xalancbmk           33.69  +4.97%
spec/2006/int/C/400.perlbench             33.43  +1.70%
spec/2006/int/C/401.bzip2                 23.02  -0.19%
spec/2006/int/C/403.gcc                   32.57  -0.43%
spec/2006/int/C/429.mcf                   40.35  +0.27%
spec/2006/int/C/445.gobmk                 26.96  +0.06%
spec/2006/int/C/456.hmmer                  24.4  +0.19%
spec/2006/int/C/458.sjeng                 27.91  -0.08%
spec/2006/int/C/462.libquantum            57.47  -0.20%
spec/2006/int/C/464.h264ref               46.52  +1.35%

geometric mean                                   +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Reviewers: hfinkel, mkuper, davidxl, chandlerc

Reviewed By: chandlerc

Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D33341

llvm-svn: 305960
2017-06-21 22:01:32 +00:00
Matt Arsenault 67cd347e93 AMDGPU: Allow vectorization of packed types
llvm-svn: 305844
2017-06-20 20:38:06 +00:00
Taewook Oh 9083547ae3 Improve profile-guided heuristics to use estimated trip count.
Summary:
Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.

This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.

Reviewers: Ayal, mssimpso, mkuper, danielcdh, wmi, tejohnson

Reviewed By: tejohnson

Subscribers: tejohnson, mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D32451

llvm-svn: 305729
2017-06-19 18:48:58 +00:00
George Burgess IV a20352e13e [LoopVectorize] Don't preserve nsw/nuw flags on shrunken ops.
If we're shrinking a binary operation, it may be the case that the new
operations wraps where the old didn't. If this happens, the behavior
should be well-defined. So, we can't always carry wrapping flags with us
when we shrink operations.

If we do, we get incorrect optimizations in cases like:

void foo(const unsigned char *from, unsigned char *to, int n) {
  for (int i = 0; i < n; i++)
    to[i] = from[i] - 128;
}

which gets optimized to:

void foo(const unsigned char *from, unsigned char *to, int n) {
  for (int i = 0; i < n; i++)
    to[i] = from[i] | 128;
}

Because:
- InstCombine turned `sub i32 %from.i, 128` into
  `add nuw nsw i32 %from.i, 128`.
- LoopVectorize vectorized the add to be `add nuw nsw <16 x i8>` with a
  vector full of `i8 128`s
- InstCombine took advantage of the fact that the newly-shrunken add
  "couldn't wrap", and changed the `add` to an `or`.

InstCombine seems happy to figure out whether we can add nuw/nsw on its
own, so I just decided to drop the flags. There are already a number of
places in LoopVectorize where we rely on InstCombine to clean up.

llvm-svn: 305053
2017-06-09 03:56:15 +00:00
Matthew Simpson 646475a9bc [LV] Reapply r303763 with fix for PR33193
r303763 caused build failures in some out-of-tree tests due to an assertion in
TTI. The original patch updated cost estimates for induction variable update
instructions marked for scalarization. However, it didn't consider that the
incoming value of an induction variable phi node could be a cast instruction.
This caused queries for cast instruction costs with a mix of vector and scalar
types. This patch includes a fix for cast instructions and the test case from
PR33193.

The fix was suggested by Jonas Paulsson <paulsson@linux.vnet.ibm.com>.

Reference: https://bugs.llvm.org/show_bug.cgi?id=33193
Original Differential Revision: https://reviews.llvm.org/D33457

llvm-svn: 304235
2017-05-30 19:55:57 +00:00
Joerg Sonnenberger 9375a25342 Revert r303763, results in asserts i.e. while building Ruby.
llvm-svn: 304179
2017-05-29 22:52:17 +00:00
Matthew Simpson d6f179cad6 [LV] Update type in cost model for scalarization
For non-uniform instructions marked for scalarization, we should update
`VectorTy` when computing instruction costs to reflect the scalar type. In
addition to determining instruction costs, this type is also used to signal
that all instructions in the loop will be scalarized. This currently affects
memory instructions and non-pointer induction variables and their updates. (We
also mark GEPs scalar after vectorization, but their cost is computed together
with memory instructions.) For scalarized induction updates, this patch also
scales the scalar cost by the vectorization factor, corresponding to each
induction step.

llvm-svn: 303763
2017-05-24 15:26:15 +00:00
Jonas Paulsson 8624b7e1ce [LoopVectorizer] Let target prefer scalar addressing computations.
The loop vectorizer usually vectorizes any instruction it can and then
extracts the elements for a scalarized use. On SystemZ, all elements
containing addresses must be extracted into address registers (GRs). Since
this extraction is not free, it is better to have the address in a suitable
register to begin with. By forcing address arithmetic instructions and loads
of addresses to be scalar after vectorization, two benefits result:

* No need to extract the register
* LSR optimizations trigger (LSR isn't handling vector addresses currently)

Benchmarking show improvements on SystemZ with this new behaviour.

Any other target could try this by returning false in the new hook
prefersVectorizedAddressing().

Review: Renato Golin, Elena Demikhovsky, Ulrich Weigand
https://reviews.llvm.org/D32422

llvm-svn: 303744
2017-05-24 13:42:56 +00:00
Ayal Zaks 589e1d9610 [LV] Report multiple reasons for not vectorizing under allowExtraAnalysis
The default behavior of -Rpass-analysis=loop-vectorizer is to report only the
first reason encountered for not vectorizing, if one is found, at which time the
vectorizer aborts its handling of the loop. This patch allows multiple reasons
for not vectorizing to be identified and reported, at the potential expense of
additional compile-time, under allowExtraAnalysis which can currently be turned
on by Clang's -fsave-optimization-record and opt's -pass-remarks-missed.

Removed from LoopVectorizationLegality::canVectorize() the redundant checking
and reporting if we CantComputeNumberOfIterations, as LAI::canAnalyzeLoop() also
does that. This redundancy is caught by a lit test once multiple reasons are
reported.

Patch initially developed by Dror Barak.

Differential Revision: https://reviews.llvm.org/D33396

llvm-svn: 303613
2017-05-23 07:08:02 +00:00
Amara Emerson c9916d7e97 Re-commit r302678, fixing PR33053.
The issue was that the AArch64 TTI hook allowed unpacked integer cmp reductions
which didn't have a lowering.

llvm-svn: 303211
2017-05-16 21:29:22 +00:00
Matthew Simpson af60af1ed5 Revert 303174, 303176, and 303178
These commits are breaking the bots. Reverting to investigate.

llvm-svn: 303182
2017-05-16 15:50:30 +00:00
Matthew Simpson 62a7fab6b9 Make test target-specific
llvm-svn: 303178
2017-05-16 15:33:22 +00:00
Matthew Simpson c3c92cf2c7 Fix test case to unbreak bots
llvm-svn: 303176
2017-05-16 15:20:27 +00:00
Matthew Simpson b7b5d55c38 [LV] Avoid potentential division by zero when selecting IC
llvm-svn: 303174
2017-05-16 14:43:55 +00:00
Hans Wennborg bd6e9e77a7 Revert r302678 "[AArch64] Enable use of reduction intrinsics."
This caused PR33053.

Original commit message:

> The new experimental reduction intrinsics can now be used, so I'm enabling this
> for AArch64. We will need this for SVE anyway, so it makes sense to do this for
> NEON reductions as well.
>
> The existing code to match shufflevector patterns are replaced with a direct
> lowering of the reductions to AArch64-specific nodes. Tests updated with the
> new, simpler, representation.
>
> Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 303115
2017-05-15 20:59:32 +00:00
Simon Pilgrim 7d62e4b455 [LoopOptimizer][Fix]PR32859, PR24738
The Loop vectorizer pass introduced undef value while it is fixing output of LCSSA form.
Here it is:

before: %e.0.ph = phi i32 [ 0, %for.inc.2.i ]
after: %e.0.ph = phi i32 [ 0, %for.inc.2.i ], [ undef, %middle.block ]

and after this change we have:

%e.0.ph = phi i32 [ 0, %for.inc.2.i ]
%e.0.ph = phi i32 [ 0, %for.inc.2.i ], [ 0, %middle.block ]

Committed on behalf of @dtemirbulatov

Differential Revision: https://reviews.llvm.org/D33055

llvm-svn: 302988
2017-05-13 13:25:57 +00:00
Andrew Kaylor b01e94ee8d [TLI] Add mapping for various '__<func>_finite' forms of the math routines to SVML routines
Patch by Chris Chrulski

Differential Revision: https://reviews.llvm.org/D31789

llvm-svn: 302957
2017-05-12 22:11:26 +00:00
Amara Emerson 816542ceb3 [AArch64] Enable use of reduction intrinsics.
The new experimental reduction intrinsics can now be used, so I'm enabling this
for AArch64. We will need this for SVE anyway, so it makes sense to do this for
NEON reductions as well.

The existing code to match shufflevector patterns are replaced with a direct
lowering of the reductions to AArch64-specific nodes. Tests updated with the
new, simpler, representation.

Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 302678
2017-05-10 15:15:38 +00:00
Anna Thomas 0691483435 [LV] Fix insertion point for shuffle vectors in first order recurrence
Summary:
In first order recurrence vectorization, when the previous value is a phi node, we need to
set the insertion point to the first non-phi node.
We can have the previous value being a phi node, due to the generation of new
IVs as part of trunc optimization [1].

[1] https://reviews.llvm.org/rL294967

Reviewers: mssimpso, mkuper

Subscribers: mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D32969

llvm-svn: 302532
2017-05-09 14:29:33 +00:00
Elad Cohen ef5798acf5 Support arbitrary address space pointers in masked gather/scatter intrinsics.
Fixes PR31789 - When loop-vectorize tries to use these intrinsics for a
non-default address space pointer we fail with a "Calling a function with a
bad singature!" assertion. This patch solves this by adding the 'vector of
pointers' argument as an overloaded type which will determine the address
space.

Differential revision: https://reviews.llvm.org/D31490

llvm-svn: 302018
2017-05-03 12:28:54 +00:00
Matthew Simpson 9eed0bee3d [LV] Handle external uses of floating-point induction variables
Reference: https://bugs.llvm.org/show_bug.cgi?id=32758
Differential Revision: https://reviews.llvm.org/D32445

llvm-svn: 301428
2017-04-26 16:23:02 +00:00
Gil Rapaport 5c875c3d6f [LV] Make LIT test insensitive to basic block numbering
This patch is part of D28975's breakdown.

induction.ll encodes the specific (and rather arbitrary) numbers given to
predicated basic blocks by the unique naming mechanism, which makes it
sensitive to changes in LV's instruction generation order. This patch replaces
those specific numbers with a numeric pattern.

Differential Revision: https://reviews.llvm.org/D32404

llvm-svn: 301345
2017-04-25 18:14:24 +00:00
Matthew Simpson e2037d24f9 [LV] Model if-converted phi node costs
Phi nodes in non-header blocks are converted to select instructions after
if-conversion. This patch updates the cost model to account for the selects.

Differential Revision: https://reviews.llvm.org/D31906

llvm-svn: 300980
2017-04-21 14:14:54 +00:00
Anna Thomas dcdb325fee [LV] Fix the vector code generation for first order recurrence
Summary:
In first order recurrences where phi's are used outside the loop,
we should generate an additional vector.extract of the second last element from
the vectorized phi update.
This is because we require the phi itself (which is the value at the second last
iteration of the vector loop) and not the phi's update within the loop.
Also fix the code gen when we just unroll, but don't vectorize.
Fixes PR32396.

Reviewers: mssimpso, mkuper, anemet

Subscribers: llvm-commits, mzolotukhin

Differential Revision: https://reviews.llvm.org/D31979

llvm-svn: 300238
2017-04-13 18:59:25 +00:00
Renato Golin af3bc2089e [SystemZ] Fix more target specific tests
llvm-svn: 300081
2017-04-12 18:03:09 +00:00
Sanjay Patel 33439f982b [InstCombine] morph an existing instruction instead of creating a new one
One potential way to make InstCombine (very slightly?) faster is to recycle instructions 
when possible instead of creating new ones. It's not explicitly stated AFAIK, but we don't
consider this an "InstSimplify". We could, however, make a new layer to house transforms 
like this if that makes InstCombine more manageable (just throwing out an idea; not sure 
how much opportunity is actually here).

Differential Revision: https://reviews.llvm.org/D31863

llvm-svn: 300067
2017-04-12 15:11:33 +00:00
Jonas Paulsson 592dbea779 [LoopVectorizer] Improve handling of branches during cost estimation.
The cost for a branch after vectorization is very different depending on if
the vectorizer will if-convert the block (branch is eliminated), or if
scalarized and predicated blocks will be produced (branch duplicated before
each block). There is also the case of remaining scalar branches, such as the
back-edge branch.

This patch handles these cases differently with TTI based cost estimates.

Review: Matthew Simpson
https://reviews.llvm.org/D31175

llvm-svn: 300058
2017-04-12 13:13:15 +00:00
Jonas Paulsson da74ed42da [LoopVectorizer, TTI] New method supportsEfficientVectorElementLoadStore()
Since SystemZ supports vector element load/store instructions, there is no
need for extracts/inserts if a vector load/store gets scalarized.

This patch lets Target specify that it supports such instructions by means of
a new TTI hook that defaults to false.

The use for this is in the LoopVectorizer getScalarizationOverhead() method,
which will with this patch produce a smaller sum for a vector load/store on
SystemZ.

New test: test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll

Review: Adam Nemet
https://reviews.llvm.org/D30680

llvm-svn: 300056
2017-04-12 12:41:37 +00:00
Jonas Paulsson fccc7d66c3 [SystemZ] TargetTransformInfo cost functions implemented.
getArithmeticInstrCost(), getShuffleCost(), getCastInstrCost(),
getCmpSelInstrCost(), getVectorInstrCost(), getMemoryOpCost(),
getInterleavedMemoryOpCost() implemented.

Interleaved access vectorization enabled.

BasicTTIImpl::getCastInstrCost() improved to check for legal extending loads,
in which case the cost of the z/sext instruction becomes 0.

Review: Ulrich Weigand, Renato Golin.
https://reviews.llvm.org/D29631

llvm-svn: 300052
2017-04-12 11:49:08 +00:00
Anna Thomas 00dc1b74b7 [LV] Avoid vectorizing first order recurrence when phi uses are outside loop
In the vectorization of first order recurrence, we vectorize such
that the last element in the vector will be the one extracted to pass into the
scalar remainder loop. However, this is not true when there is a phi (other
than the primary induction variable) is used outside the loop.
In such a case, we need the value from the second last iteration (i.e.
the phi value), not the last iteration (which would be the phi update).
I've added a test case for this. Also see PR32396.

A follow up patch would generate the correct code gen for such cases,
and turn this vectorization on.

Differential Revision: https://reviews.llvm.org/D31910

Reviewers: mssimpso
llvm-svn: 299985
2017-04-11 21:02:00 +00:00
Anna Thomas 98cbb067ce [LV] Move first order recurrence test to common folder. NFC
llvm-svn: 299969
2017-04-11 18:31:42 +00:00
Matt Arsenault f10061ec70 Add address space mangling to lifetime intrinsics
In preparation for allowing allocas to have non-0 addrspace.

llvm-svn: 299876
2017-04-10 20:18:21 +00:00
Matthew Simpson 11fe2e9f2b Reapply r298620: [LV] Vectorize GEPs
This patch reapplies r298620. The original patch was reverted because of two
issues. First, the patch exposed a bug in InstCombine that caused the Chromium
builds to fail (PR32414). This issue was fixed in r299017. Second, the patch
introduced a bug in the vectorizer's scalars analysis that caused test suite
builds to fail on SystemZ. The scalars analysis was too aggressive and marked a
memory instruction scalar, even though it was going to be vectorized. This
issue has been fixed in the current patch and several new test cases for the
scalars analysis have been added.

llvm-svn: 299770
2017-04-07 14:15:34 +00:00
Matthew Simpson 1a4d5c9860 [LV] Make test case more robust
This test case depends on the loop being vectorized without forcing the
vectorization factor. If the profitability ever changes in the future (due to
cost model improvements), the test may no longer work as intended. Instead of
checking the resulting IR, we should just check the instruction costs. The
costs will be computed regardless if vectorization is profitable.

llvm-svn: 299545
2017-04-05 14:34:13 +00:00
Matthew Simpson b8ff4a4a70 [LV] Transform truncations of non-primary induction variables
The vectorizer tries to replace truncations of induction variables with new
induction variables having the smaller type. After r295063, this optimization
was applied to all integer induction variables, including non-primary ones.
When optimizing the truncation of a non-primary induction variable, we still
need to transform the new induction so that it has the correct start value.
This should fix PR32419.

Reference: https://bugs.llvm.org/show_bug.cgi?id=32419
llvm-svn: 298882
2017-03-27 20:07:38 +00:00
Ivan Krasin c2124e185c Revert r298620: [LV] Vectorize GEPs
Reason: breaks linking Chromium with LLD + ThinLTO (a pass crashes)
LLVM bug: https://bugs.llvm.org//show_bug.cgi?id=32413

Original change description:

[LV] Vectorize GEPs

This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.

With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.

The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.

Original Differential Revision: https://reviews.llvm.org/D30710

llvm-svn: 298735
2017-03-24 20:49:43 +00:00
Gil Rapaport 638d4538cd [LV] Add regression test for r297610
The new test asserts that scalarized memory operations get memcheck metadata
added even if the loop is only unrolled.

Differential Revision: https://reviews.llvm.org/D30972

llvm-svn: 298641
2017-03-23 20:02:23 +00:00
Matthew Simpson 4e7b71bc86 [LV] Vectorize GEPs
This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.

With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.

The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.

Differential Revision: https://reviews.llvm.org/D30710

llvm-svn: 298620
2017-03-23 16:29:58 +00:00
Matthew Simpson 1fb4064531 [LV] Delete unneeded scalar GEP creation code
The code for generating scalar base pointers in vectorizeMemoryInstruction is
not needed. We currently scalarize all GEPs and maintain the scalarized values
in VectorLoopValueMap. The GEP cloning in this unneeded code is the same as
that in scalarizeInstruction. The test cases that changed as a result of this
patch changed because we were able to reuse the scalarized GEP that we
previously generated instead of cloning a new one.

Differential Revision: https://reviews.llvm.org/D30587

llvm-svn: 298615
2017-03-23 16:07:21 +00:00
Matt Arsenault 3dbeefa978 AMDGPU: Mark all unspecified CC functions in tests as amdgpu_kernel
Currently the default C calling convention functions are treated
the same as compute kernels. Make this explicit so the default
calling convention can be changed to a non-kernel.

Converted with perl -pi -e 's/define void/define amdgpu_kernel void/'
on the relevant test directories (and undoing in one place that actually
wanted a non-kernel).

llvm-svn: 298444
2017-03-21 21:39:51 +00:00
Jonas Paulsson a48ea231c0 [TargetTransformInfo] getIntrinsicInstrCost() scalarization estimation improved
getIntrinsicInstrCost() used to only compute scalarization cost based on types.
This patch improves this so that the actual arguments are checked when they are
available, in order to handle only unique non-constant operands.

Tests updates:

Analysis/CostModel/X86/arith-fp.ll
Transforms/LoopVectorize/AArch64/interleaved_cost.ll
Transforms/LoopVectorize/ARM/interleaved_cost.ll

The improvement in getOperandsScalarizationOverhead() to differentiate on
constants made it necessary to update the interleaved_cost.ll tests even
though they do not relate to intrinsics.

Review: Hal Finkel
https://reviews.llvm.org/D29540

llvm-svn: 297705
2017-03-14 06:35:36 +00:00
Changpeng Fang 1be9b9f816 AMDGPU/SI: Disable unrolling in the loop vectorizer if the loop is not vectorized.
Reviewers:
  arsenm

Differential Revision:
  http://reviews.llvm.org/D30719

llvm-svn: 297328
2017-03-09 00:07:00 +00:00
Matthew Simpson 3388de1349 [LV] Select legal insert point when fixing first-order recurrences
Because IRBuilder performs constant-folding, it's not guaranteed that an
instruction in the original loop map to an instruction in the vector loop. It
could map to a constant vector instead. The handling of first-order recurrences
was incorrectly making this assumption when setting the IRBuilder's insert
point.

llvm-svn: 297302
2017-03-08 18:18:20 +00:00
Matthew Simpson 8966848d17 [LV] Make the test case for PR30183 less fragile
This patch also renames the PR number the test points to. The previous
reference was PR29559, but that bug was somehow deleted and recreated under
PR30183.

llvm-svn: 297295
2017-03-08 17:03:38 +00:00
Matthew Simpson 903dd5aa9b [LV] Add missing check labels to tests and reformat
llvm-svn: 297294
2017-03-08 16:55:34 +00:00
Matthew Simpson c86b2134c7 [LV] Consider users that are memory accesses in uniforms expansion step
When expanding the set of uniform instructions beyond the seed instructions
(e.g., consecutive pointers), we mark a new instruction uniform if all its
loop-varying users are uniform. We should also allow users that are consecutive
or interleaved memory accesses. This fixes cases where we have an instruction
that is used as the pointer operand of a consecutive access but also used by a
non-memory instruction that later becomes uniform as part of the expansion.

llvm-svn: 297179
2017-03-07 18:47:30 +00:00
Matthew Simpson aee9771ae2 [ARM/AArch64] Update costs for interleaved accesses with wide types
After r296750, we're able to match interleaved accesses having types wider than
128 bits. This patch updates the associated TTI costs.

Differential Revision: https://reviews.llvm.org/D29675

llvm-svn: 296751
2017-03-02 15:15:35 +00:00
Matthew Simpson 455c2ee394 [LV] Considier non-consecutive but vectorizable accesses for VF selection
When computing the smallest and largest types for selecting the maximum
vectorization factor, we currently ignore loads and stores of pointer types if
the memory access is non-consecutive. We do this because such accesses must be
scalarized regardless of vectorization factor, and thus shouldn't be considered
when determining the factor. This patch makes this check less aggressive by
also considering non-consecutive accesses that may be vectorized, such as
interleaved accesses. Because we don't know at the time of the check if an
accesses will certainly be vectorized (this is a cost model decision given a
particular VF), we consider all accesses that can potentially be vectorized.

Differential Revision: https://reviews.llvm.org/D30305

llvm-svn: 296747
2017-03-02 13:55:05 +00:00
Adam Nemet 15032a0455 [LV] These remark should have been missed remarks
The practice in LV is that we emit analysis remarks and then finally report
either a missed or applied remark on the final decision whether vectorization
is taking place.  On this code path, we were closing with an analysis remark.

llvm-svn: 296578
2017-03-01 04:31:15 +00:00
Craig Topper fe25988c68 [AVX-512] Fix the execution domain for AVX-512 integer broadcasts.
llvm-svn: 296290
2017-02-26 06:45:51 +00:00
Matthew Simpson bdc9c78880 [LV] Merge floating-point and integer induction widening code
This patch merges the existing floating-point induction variable widening code
into the integer induction variable widening code, creating a single set of
functions for both kinds of inductions. The primary motivation for doing this
is to enable vector phi node creation for floating-point induction variables.

Differential Revision: https://reviews.llvm.org/D30211

llvm-svn: 296145
2017-02-24 18:20:12 +00:00
Matthew Simpson 835b246d7f [LV] Update floating-point induction test checks (NFC)
llvm-svn: 295885
2017-02-22 21:56:02 +00:00
Matthew Simpson 5ef66ef248 [LV] Add scalar floating-point induction test (NFC)
llvm-svn: 295862
2017-02-22 19:09:38 +00:00
Karl-Johan Karlsson 6eaed7aceb [LoopVectorize] Added address space check when analysing interleaved accesses
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.

This is fixing pr31900.

Reviewers: HaoLiu, mssimpso, mkuper

Reviewed By: mssimpso, mkuper

Subscribers: llvm-commits, efriedma, mzolotukhin

Differential Revision: https://reviews.llvm.org/D29717

This reverts r295042 (re-applies r295038) with an additional fix for the
buildbot problem.

llvm-svn: 295858
2017-02-22 18:37:36 +00:00
Dehao Chen 7d230325ef Increases full-unroll threshold.
Summary:
The default threshold for fully unroll is too conservative. This patch doubles the full-unroll threshold

This change will affect the following speccpu2006 benchmarks (performance numbers were collected from Intel Sandybridge):

Performance:

403	0.11%
433	0.51%
445	0.48%
447	3.50%
453	1.49%
464	0.75%

Code size:

403	0.56%
433	0.96%
445	2.16%
447	2.96%
453	0.94%
464	8.02%

The compiler time overhead is similar with code size.

Reviewers: davidxl, mkuper, mzolotukhin, hfinkel, chandlerc

Reviewed By: hfinkel, chandlerc

Subscribers: mehdi_amini, zzheng, efriedma, haicheng, hfinkel, llvm-commits

Differential Revision: https://reviews.llvm.org/D28368

llvm-svn: 295538
2017-02-18 03:46:51 +00:00
Matthew Simpson f68e183f91 [LV] Remove constant restriction for vector phi creation
We previously only created a vector phi node for an induction variable if its
step had a constant integer type. However, the step actually only needs to be
loop-invariant. We only handle inductions having loop-invariant steps, so this
patch should enable vector phi node creation for all integer induction
variables that will be vectorized.

Differential Revision: https://reviews.llvm.org/D29956

llvm-svn: 295456
2017-02-17 16:09:07 +00:00
Matthew Simpson f09d13e5cc Reapply "[LV] Extend trunc optimization to all IVs with constant integer steps"
This reapplies commit r294967 with a fix for the execution time regressions
caught by the clang-cmake-aarch64-quick bot. We now extend the truncate
optimization to non-primary induction variables only if the truncate isn't
already free.

Differential Revision: https://reviews.llvm.org/D29847

llvm-svn: 295063
2017-02-14 16:28:32 +00:00
Karl-Johan Karlsson ec21b769ec Revert "[LoopVectorize] Added address space check when analysing interleaved accesses"
This reverts r295038. The buildbot clang-with-thin-lto-ubuntu failed.
I'm reverting to investigate.

llvm-svn: 295042
2017-02-14 10:06:16 +00:00
Karl-Johan Karlsson 2ec409cca2 [LoopVectorize] Added address space check when analysing interleaved accesses
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.

This is fixing pr31900.

Reviewers: HaoLiu, mssimpso, mkuper

Reviewed By: mssimpso, mkuper

Subscribers: llvm-commits, efriedma, mzolotukhin

Differential Revision: https://reviews.llvm.org/D29717

llvm-svn: 295038
2017-02-14 08:14:06 +00:00
Matthew Simpson 659f92e2aa Revert "[LV] Extend trunc optimization to all IVs with constant integer steps"
This reverts commit r294967. This patch caused execution time slowdowns in a
few LLVM test-suite tests, as reported by the clang-cmake-aarch64-quick bot.
I'm reverting to investigate.

llvm-svn: 294973
2017-02-13 18:02:35 +00:00
Matthew Simpson 7b7f40297f [LV] Extend trunc optimization to all IVs with constant integer steps
This patch extends the optimization of truncations whose operand is an
induction variable with a constant integer step. Previously we were only
applying this optimization to the primary induction variable. However, the cost
model assumes the optimization is applied to the truncation of all integer
induction variables (even regardless of step type). The transformation is now
applied to the other induction variables, and I've updated the cost model to
ensure it is better in sync with the transformation we actually perform.

Differential Revision: https://reviews.llvm.org/D29847

llvm-svn: 294967
2017-02-13 16:48:00 +00:00
Dorit Nuzman eac89d736c [LV/LoopAccess] Check statically if an unknown dependence distance can be
proven larger than the loop-count

This fixes PR31098: Try to resolve statically data-dependences whose
compile-time-unknown distance can be proven larger than the loop-count, 
instead of resorting to runtime dependence checking (which are not always 
possible).

For vectorization it is sufficient to prove that the dependence distance 
is >= VF; But in some cases we can prune unknown dependence distances early,
and even before selecting the VF, and without a runtime test, by comparing 
the distance against the loop iteration count. Since the vectorized code 
will be executed only if LoopCount >= VF, proving distance >= LoopCount 
also guarantees that distance >= VF. This check is also equivalent to the 
Strong SIV Test.

Reviewers: mkuper, anemet, sanjoy

Differential Revision: https://reviews.llvm.org/D28044

llvm-svn: 294892
2017-02-12 09:32:53 +00:00
Ahmed Bougacha 8425f453ef [ARM] Make f16 interleaved accesses expensive.
There are no vldN/vstN f16 variants, even with +fullfp16.
We could use the i16 variants, but, in practice, even with +fullfp16,
the f16 sequence leading to the i16 shuffle usually gets scalarized.
We'd need to improve our support for f16 codegen before getting there.

Teach the cost model to consider f16 interleaved operations as
expensive.  Otherwise, we are all but guaranteed to end up with
a large block of scalarized vector code.

llvm-svn: 294819
2017-02-11 01:53:04 +00:00
Dehao Chen fb02f7140a Encode duplication factor from loop vectorization and loop unrolling to discriminator.
Summary:
This patch starts the implementation as discuss in the following RFC: http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html

When optimization duplicates code that will scale down the execution count of a basic block, we will record the duplication factor as part of discriminator so that the offline process tool can find the duplication factor and collect the accurate execution frequency of the corresponding source code. Two important optimization that fall into this category is loop vectorization and loop unroll. This patch records the duplication factor for these 2 optimizations.

The recording will be guarded by a flag encode-duplication-in-discriminators, which is off by default.

Reviewers: probinson, aprantl, davidxl, hfinkel, echristo

Reviewed By: hfinkel

Subscribers: mehdi_amini, anemet, mzolotukhin, llvm-commits

Differential Revision: https://reviews.llvm.org/D26420

llvm-svn: 294782
2017-02-10 21:09:07 +00:00
Matthew Simpson df124a7569 [LV] Remove type restriction for vector phi creation
We previously only created a vector phi node for an induction variable if its
type matched the type of the canonical induction variable.

Differential Revision: https://reviews.llvm.org/D29776

llvm-svn: 294755
2017-02-10 16:15:26 +00:00
Elena Demikhovsky 5267edd3e3 [Loop Vectorizer] Cost-based decision for vectorization form of memory instruction.
Making the cost model selecting between Interleave, GatherScatter or Scalar vectorization form of memory instruction.
The right decision should be done for non-consecutive memory access instrcuctions that may have more than one vectorization solution.

This patch includes the following changes:
- Cost Model calculates the cost of Load/Store vector form and choose the better option between Widening, Interleave, GatherScactter and Scalarization. Cost Model keeps the widening decision.
- Arrays of Uniform and Scalar values are moved from Legality to Cost Model.
- Cost Model collects Uniforms and Scalars per VF. The collection is based on CM decision map of Loadis/Stores vectorization form.
- Vectorization of memory instruction is performed according to the CM decision.

Differential Revision: https://reviews.llvm.org/D27919

llvm-svn: 294503
2017-02-08 19:25:23 +00:00
Craig Topper e0ac7f3beb [X86] Remove PCOMMIT instruction support since Intel has deprecated this instruction with no plans to release products with it.
Intel's documentation for the deprecation https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

llvm-svn: 294405
2017-02-08 05:45:39 +00:00
Matthew Simpson 3877f397cd [LV] Add new ARM/AArch64 interleaved access cost model tests (NFC)
llvm-svn: 294342
2017-02-07 19:34:24 +00:00
Matthew Simpson 1cd02f13a5 [LV] Simplify ARM/AArch64 interleaved access cost model tests (NFC)
This patch removes unneeded instructions from the existing ARM/AArch64
interleaved access cost model tests. I'll be adding a similar set of tests in a
follow-on patch to increase coverage.

llvm-svn: 294336
2017-02-07 19:17:44 +00:00
Adam Nemet 0bf1b863b9 [LV] Also port failure remarks to new OptimizationRemarkEmitter API
llvm-svn: 293866
2017-02-02 05:41:51 +00:00
Chandler Carruth 6f4ed077d0 [LV] Fix an issue where forming LCSSA in the place that we did would
change the set of uniform instructions in the loop causing an assert
failure.

The problem is that the legalization checking also builds data
structures mapping various facts about the loop body. The immediate
cause was the set of uniform instructions. If these then change when
LCSSA is formed, the data structures would already have been built and
become stale. The included test case triggered an assert in loop
vectorize that was reduced out of the new PM's pipeline.

The solution is to form LCSSA early enough that no information is cached
across the changes made. The only really obvious position is outside of
the main logic to vectorize the loop. This also has the advantage of
removing one case where forming LCSSA could mutate the loop but we
wouldn't track that as a "Changed" state.

If it is significantly advantageous to do some legalization checking
prior to this, we can do a more careful positioning but it seemed best
to just back off to a safe position first.

llvm-svn: 293168
2017-01-26 10:41:09 +00:00
Mohammed Agabaria 20caee95e1 [X86] enable memory interleaving for X86\SLM arch.
Differential Revision: https://reviews.llvm.org/D28547

llvm-svn: 293040
2017-01-25 09:14:48 +00:00
Michael Kuperstein 230867e583 [LV] Run loop-simplify and LCSSA explicitly instead of "requiring" them
This changes the vectorizer to explicitly use the loopsimplify and lcssa utils,
instead of "requiring" the transformations as if they were analyses.

This is not NFC, since it changes the LCSSA behavior - we no longer run LCSSA
for all loops, but rather only for the loops we expect to modify.

Differential Revision: https://reviews.llvm.org/D28868

llvm-svn: 292456
2017-01-19 00:42:28 +00:00
Michael Kuperstein 7cefb409b0 [LV] Allow reductions that have several uses outside the loop
We currently check whether a reduction has a single outside user. We don't
really need to require that - we just need to make sure a single value is
used externally. The number of external users of that value shouldn't actually
matter.

Differential Revision: https://reviews.llvm.org/D28830

llvm-svn: 292424
2017-01-18 19:02:52 +00:00