Address calculation for gather/scather in vectorized code can incur a
significant cost making vectorization unbeneficial. Add infrastructure to add
cost.
Tests and cost model for targets will be in follow-up commits.
radar://14351991
llvm-svn: 186187
Before we could vectorize PHINodes scanning successors was a good way of finding candidates. Now we can vectorize the phinodes which is simpler.
llvm-svn: 186139
We can vectorize them because in the case where we wrap in the address space the
unvectorized code would have had to access a pointer value of zero which is
undefined behavior in address space zero according to the LLVM IR semantics.
(Thank you Duncan, for pointing this out to me).
Fixes PR16592.
llvm-svn: 186088
Commit 185883 fixes a bug in the IRBuilder that should fix the ASan bot. AssertingVH can help in exposing some RAUW problems.
Thanks Ben and Alexey!
llvm-svn: 185886
This is a complete re-write if the bottom-up vectorization class.
Before this commit we scanned the instruction tree 3 times. First in search of merge points for the trees. Second, for estimating the cost. And finally for vectorization.
There was a lot of code duplication and adding the DCE exposed bugs. The new design is simpler and DCE was a part of the design.
In this implementation we build the tree once. After that we estimate the cost by scanning the different entries in the constructed tree (in any order). The vectorization phase also works on the built tree.
llvm-svn: 185774
Math functions are mark as readonly because they read the floating point
rounding mode. Because we don't vectorize loops that would contain function
calls that set the rounding mode it is safe to ignore this memory read.
llvm-svn: 185299
To support this we have to insert 'extractelement' instructions to pick the right lane.
We had this functionality before but I removed it when we moved to the multi-block design because it was too complicated.
llvm-svn: 185230
In this code we keep track of pointers that we are allowed to read from, if they are accessed by non-predicated blocks.
We use this list to allow vectorization of conditional loads in predicated blocks because we know that these addresses don't segfault.
llvm-svn: 185214
I used the class to safely reset the state of the builder's debug location. I
think I have caught all places where we need to set the debug location to a new
one. Therefore, we can replace the class by a function that just sets the debug
location.
llvm-svn: 185165
When we store values for reversed induction stores we must not store the
reversed value in the vectorized value map. Another instruction might use this
value.
This fixes 3 test cases of PR16455.
llvm-svn: 185051
This should hopefully have fixed the stage2/stage3 miscompare on the dragonegg
testers.
"LoopVectorize: Use the dependence test utility class
We now no longer need alias analysis - the cases that alias analysis would
handle are now handled as accesses with a large dependence distance.
We can now vectorize loops with simple constant dependence distances.
for (i = 8; i < 256; ++i) {
a[i] = a[i+4] * a[i+8];
}
for (i = 8; i < 256; ++i) {
a[i] = a[i-4] * a[i-8];
}
We would be able to vectorize about 200 more loops (in many cases the cost model
instructs us no to) in the test suite now. Results on x86-64 are a wash.
I have seen one degradation in ammp. Interestingly, the function in which we
now vectorize a loop is never executed so we probably see some instruction
cache effects. There is a 2% improvement in h264ref. There is one or the other
TSCV loop kernel that speeds up.
radar://13681598"
llvm-svn: 184724
We now no longer need alias analysis - the cases that alias analysis would
handle are now handled as accesses with a large dependence distance.
We can now vectorize loops with simple constant dependence distances.
for (i = 8; i < 256; ++i) {
a[i] = a[i+4] * a[i+8];
}
for (i = 8; i < 256; ++i) {
a[i] = a[i-4] * a[i-8];
}
We would be able to vectorize about 200 more loops (in many cases the cost model
instructs us no to) in the test suite now. Results on x86-64 are a wash.
I have seen one degradation in ammp. Interestingly, the function in which we
now vectorize a loop is never executed so we probably see some instruction
cache effects. There is a 2% improvement in h264ref. There is one or the other
TSCV loop kernel that speeds up.
radar://13681598
llvm-svn: 184685
This class checks dependences by subtracting two Scalar Evolution access
functions allowing us to catch very simple linear dependences.
The checker assumes source order in determining whether vectorization is safe.
We currently don't reorder accesses.
Positive true dependencies need to be a multiple of VF otherwise we impede
store-load forwarding.
llvm-svn: 184684
Sets of dependent accesses are built by unioning sets based on underlying
objects. This class will be used by the upcoming dependence checker.
llvm-svn: 184683
Untill now we detected the vectorizable tree and evaluated the cost of the
entire tree. With this patch we can decide to trim-out branches of the tree
that are not profitable to vectorizer.
Also, increase the max depth from 6 to 12. In the worse possible case where all
of the code is made of diamond-shaped graph this can bring the cost to 2**10,
but diamonds are not very common.
llvm-svn: 184681
Rewrote the SLP-vectorization as a whole-function vectorization pass. It is now able to vectorize chains across multiple basic blocks.
It still does not vectorize PHIs, but this should be easy to do now that we scan the entire function.
I removed the support for extracting values from trees.
We are now able to vectorize more programs, but there are some serious regressions in many workloads (such as flops-6 and mandel-2).
llvm-svn: 184647
We collect gather sequences when we vectorize basic blocks. Gather sequences are excellent
hints for vectorization of other basic blocks.
llvm-svn: 184444
The type <3 x i8> is a common in graphics and we want to be able to vectorize it.
This changes accelerates bullet by 12% and 471_omnetpp by 5%.
llvm-svn: 184317
Use ScalarEvolution's getBackedgeTakenCount API instead of getExitCount since
that is really what we want to know. Using the more specific getExitCount was
safe because we made sure that there is only one exiting block.
No functionality change.
llvm-svn: 183047
We check that instructions in the loop don't have outside users (except if
they are reduction values). Unfortunately, we skipped this check for
if-convertable PHIs.
Fixes PR16184.
llvm-svn: 183035
- llvm.loop.parallel metadata has been renamed to llvm.loop to be more generic
by making the root of additional loop metadata.
- Loop::isAnnotatedParallel now looks for llvm.loop and associated
llvm.mem.parallel_loop_access
- document llvm.loop and update llvm.mem.parallel_loop_access
- add support for llvm.vectorizer.width and llvm.vectorizer.unroll
- document llvm.vectorizer.* metadata
- add utility class LoopVectorizerHints for getting/setting loop metadata
- use llvm.vectorizer.width=1 to indicate already vectorized instead of
already_vectorized
- update existing tests that used llvm.loop.parallel and
llvm.vectorizer.already_vectorized
Reviewed by: Nadav Rotem
llvm-svn: 182802
We are not working on a DAG and I ran into a number of problems when I enabled the vectorizations of 'diamond-trees' (trees that share leafs).
* Imroved the numbering API.
* Changed the placement of new instructions to the last root.
* Fixed a bug with external tree users with non-zero lane.
* Fixed a bug in the placement of in-tree users.
llvm-svn: 182508
The Value pointers we store in the induction variable list can be RAUW'ed by a
call to SCEVExpander::expandCodeFor, use a TrackingVH instead. Do the same thing
in some other places where we store pointers that could potentially be RAUW'ed.
Fixes PR16073.
llvm-svn: 182485
We only want to check this once, not for every conditional block in the loop.
No functionality change (except that we don't perform a check redudantly
anymore).
llvm-svn: 181942
InstCombine can be uncooperative to vectorization and sink loads into
conditional blocks. This prevents vectorization.
Undo this optimization if there are unconditional memory accesses to the same
addresses in the loop.
radar://13815763
llvm-svn: 181860
We used to give up if we saw two integer inductions. After this patch, we base
further induction variables on the chosen one like we do in the reverse
induction and pointer induction case.
Fixes PR15720.
radar://13851975
llvm-svn: 181746
The external user does not have to be in lane #0. We have to save the lane for each scalar so that we know which vector lane to extract.
llvm-svn: 181674
Use the widest induction type encountered for the cannonical induction variable.
We used to turn the following loop into an empty loop because we used i8 as
induction variable type and truncated 1024 to 0 as trip count.
int a[1024];
void fail() {
int reverse_induction = 1023;
unsigned char forward_induction = 0;
while ((reverse_induction) >= 0) {
forward_induction++;
a[reverse_induction] = forward_induction;
--reverse_induction;
}
}
radar://13862901
llvm-svn: 181667
A computable loop exit count does not imply the presence of an induction
variable. Scalar evolution can return a value for an infinite loop.
Fixes PR15926.
llvm-svn: 181495
The two nested loops were confusing and also conservative in identifying
reduction variables. This patch replaces them by a worklist based approach.
llvm-svn: 181369
We were passing an i32 to ConstantInt::get where an i64 was needed and we must
also pass the sign if we pass negatives numbers. The start index passed to
getConsecutiveVector must also be signed.
Should fix PR15882.
llvm-svn: 181286
Add support for min/max reductions when "no-nans-float-math" is enabled. This
allows us to assume we have ordered floating point math and treat ordered and
unordered predicates equally.
radar://13723044
llvm-svn: 181144
By supporting the vectorization of PHINodes with more than two incoming values we can increase the complexity of nested if statements.
We can now vectorize this loop:
int foo(int *A, int *B, int n) {
for (int i=0; i < n; i++) {
int x = 9;
if (A[i] > B[i]) {
if (A[i] > 19) {
x = 3;
} else if (B[i] < 4 ) {
x = 4;
} else {
x = 5;
}
}
A[i] = x;
}
}
llvm-svn: 181037
the things, and renames it to CBindingWrapping.h. I also moved
CBindingWrapping.h into Support/.
This new file just contains the macros for defining different wrap/unwrap
methods.
The calls to those macros, as well as any custom wrap/unwrap definitions
(like for array of Values for example), are put into corresponding C++
headers.
Doing this required some #include surgery, since some .cpp files relied
on the fact that including Wrap.h implicitly caused the inclusion of a
bunch of other things.
This also now means that the C++ headers will include their corresponding
C API headers; for example Value.h must include llvm-c/Core.h. I think
this is harmless, since the C API headers contain just external function
declarations and some C types, so I don't believe there should be any
nasty dependency issues here.
llvm-svn: 180881
This patch disables memory-instruction vectorization for types that need padding
bytes, e.g., x86_fp80 has 10 bytes store size with 6 bytes padding in darwin on
x86_64. Because the load/store vectorization is performed by the bit casting to
a packed vector, which has incompatible memory layout due to the lack of padding
bytes, the present vectorizer produces inconsistent result for memory
instructions of those types.
This patch checks an equality of the AllocSize of a scalar type and allocated
size for each vector element, to ensure that there is no padding bytes and the
array can be read/written using vector operations.
Patch by Daisuke Takahashi!
Fixes PR15758.
llvm-svn: 180196
even if erroneously annotated with the parallel loop metadata.
Fixes Bug 15794:
"Loop Vectorizer: Crashes with the use of llvm.loop.parallel metadata"
llvm-svn: 180081