Commit Graph

3262 Commits

Author SHA1 Message Date
Chandler Carruth 11e7f6b50a [x86] Remove a SimpleTy usage. No need for it here, we already have the
MVT.

llvm-svn: 230622
2015-02-26 10:37:01 +00:00
Chandler Carruth d283cb6203 [x86] Make the vector shuffle helpers order the SDLoc and MVT arguments.
This ordering matches that of DAG.getNode.

llvm-svn: 230617
2015-02-26 08:19:24 +00:00
Eric Christopher 5f54195e4a Remove a FIXME.
Explanation: This function is in TargetLowering because it uses
RegClassForVT which would need to be moved to TargetRegisterInfo
and would necessitate moving isTypeLegal over as well - a massive
change that would just require TargetLowering having a TargetRegisterInfo
class member that it would use.

llvm-svn: 230585
2015-02-26 00:00:35 +00:00
Eric Christopher 23a3a7c871 Remove an argument-less call to getSubtargetImpl from TargetLoweringBase.
This required plumbing a TargetRegisterInfo through computeRegisterProperties
and into findRepresentativeClass which uses it for register class
iteration. This required passing a subtarget into a few target specific
initializations of TargetLowering.

llvm-svn: 230583
2015-02-26 00:00:24 +00:00
Sanjay Patel a709f3a5ae simplify control flow; NFC
llvm-svn: 230342
2015-02-24 16:26:02 +00:00
Sanjay Patel 2898548598 fix typo in comment; NFC
llvm-svn: 230338
2015-02-24 16:11:05 +00:00
Elena Demikhovsky 52e81bc499 AVX-512: recommitted 229837 + bugfix + test
llvm-svn: 230223
2015-02-23 15:12:31 +00:00
Benjamin Kramer 71bfe3d1a4 X86: Remove custom lowering of SIGN_EXTEND_INREG
This was just replicating logic from the legalizer. Covered by existing
tests.

llvm-svn: 230136
2015-02-21 14:31:29 +00:00
Tim Northover 3b6b7ca2bc CodeGen: convert CCState interface to using ArrayRefs
Everyone except R600 was manually passing the length of a static array
at each callsite, calculated in a variety of interesting ways. Far
easier to let ArrayRef handle that.

There should be no functional change, but out of tree targets may have
to tweak their calls as with these examples.

llvm-svn: 230118
2015-02-21 02:11:17 +00:00
Simon Pilgrim b7875837c7 LowerScalarImmediateShift - Merged v16i8 and v32i8 shift lowering. NFC.
llvm-svn: 230074
2015-02-20 22:13:03 +00:00
Sanjay Patel 1b53f74c4c canonicalize a v2f64 blendi of 2 registers
This canonicalization step saves us 3 pattern matching possibilities * 4 math ops
for scalar FP math that uses xmm regs. The backend can re-commute the operands
post-instruction-selection if that makes register allocation better.

The tests in llvm/test/CodeGen/X86/sse-scalar-fp-arith.ll cover this scenario already,
so there are no new tests with this patch.

Differential Revision: http://reviews.llvm.org/D7777

llvm-svn: 230024
2015-02-20 16:55:27 +00:00
Chandler Carruth 0c5f059865 [x86] Switching the shuffle equivalence test to a variadic template was
the wrong answer. We also got initializer lists which are *way* cleaner
for this kind of thing. Let's use those and make this a normal, boring
functionn accepting ArrayRef.

llvm-svn: 230004
2015-02-20 10:47:28 +00:00
Nick Lewycky a2bda08806 Fix build in release mode, -Wunused-variable on this lambda function used only in an assert.
llvm-svn: 229977
2015-02-20 07:16:17 +00:00
David Blaikie 9a3644c472 Fix -Wunused-variable warning in non-asserts build, and optimize a little bit while I'm here.
llvm-svn: 229970
2015-02-20 06:28:38 +00:00
Chandler Carruth 4041f2217b [x86] Remove the old vector shuffle lowering code and its flag.
The new shuffle lowering has been the default for some time. I've
enabled the new legality testing by default with no really blocking
regressions. I've fuzz tested this very heavily (many millions of fuzz
test cases have passed at this point). And this cleans up a ton of code.
=]

Thanks again to the many folks that helped with this transition. There
was a lot of work by others that went into the new shuffle lowering to
make it really excellent.

In case you aren't using a diff algorithm that can handle this:
  X86ISelLowering.cpp: 22 insertions(+), 2940 deletions(-)

llvm-svn: 229964
2015-02-20 04:25:04 +00:00
Chandler Carruth eb206aa1ea [x86] Now that the new vector shuffle legality is enabled and everything
is going well, remove the flag and the code for the old legality tests.

This is the first step toward removing the entire old vector shuffle
lowering. *Much* more code to delete coming up next.

llvm-svn: 229963
2015-02-20 03:59:35 +00:00
Chandler Carruth d2b14b296c [x86] Make the new vector shuffle legality test on by default, which
reflects the fact that the x86 backend can in fact lower any shuffle you
want it to with reasonably high code quality.

My recent work on the new vector shuffle has made this regress *very*
little. The diff in the test cases makes me very, very happy.

llvm-svn: 229958
2015-02-20 03:05:47 +00:00
Eric Christopher 0d94fa98e5 Revert "AVX-512: Full implementation for VRNDSCALESS/SD instructions and intrinsics."
The instructions were being generated on architectures that don't support avx512.

This reverts commit r229837.

llvm-svn: 229942
2015-02-20 00:45:28 +00:00
Benjamin Kramer ea68a944a1 Demote vectors to arrays. No functionality change.
llvm-svn: 229861
2015-02-19 15:26:17 +00:00
Chandler Carruth 5d1a84b7b8 [x86] Delete still more piles of complex code now that we have a good
systematic lowering of v8i16.

This required a slight strategy shift to prefer unpack lowerings in more
places. While this isn't a cut-and-dry win in every case, it is in the
overwhelming majority. There are only a few places where the old
lowering would probably be a touch faster, and then only by a small
margin.

In some cases, this is yet another significant improvement.

llvm-svn: 229859
2015-02-19 15:21:57 +00:00
Chandler Carruth 0b39536390 [x86] Teach the unpack lowering how to lower with an initial unpack in
addition to lowering to trees rooted in an unpack.

This saves shuffles and or registers in many various ways, lets us
handle another class of v4i32 shuffles pre SSE4.1 without domain
crosses, etc.

llvm-svn: 229856
2015-02-19 15:06:13 +00:00
Chandler Carruth 352eba1c29 [x86] Dramatically improve v8i16 shuffle lowering by not using its
terribly complex partial blend logic.

This code path was one of the more complex and bug prone when it first
went in and it hasn't faired much better. Ultimately, with the simpler
basis for unpack lowering and support bit-math blending, this is
completely obsolete. In the worst case without this we generate
different but equivalent instructions. However, in many cases we
generate much better code. This is especially true when blends or pshufb
is available.

This does expose one (minor) weakness of the unpack lowering that I'll
try to address.

In case you were wondering, this is actually a big part of what I've
been trying to pull off in the recent string of commits.

llvm-svn: 229853
2015-02-19 14:08:24 +00:00
Chandler Carruth 2c0390ca4b [x86] Remove the final fallback in the v8i16 lowering that isn't really
needed, and significantly improve the SSSE3 path.

This makes the new strategy much more clear. If we can blend, we just go
with that. If we can't blend, we try to permute into an unpack so
that we handle cases where the unpack doing the blend also simplifies
the shuffle. If that fails and we've got SSSE3, we now call into
factored-out pshufb lowering code so that we leverage the fact that
pshufb can set up a blend for us while shuffling. This generates great
code, especially because we *know* we don't have a fast blend at this
point. Finally, we fall back on decomposing into permutes and blends
because we do at least have a bit-math-based blend if we need to use
that.

This pretty significantly improves some of the v8i16 code paths. We
never need to form pshufb for the single-input shuffles because we have
effective target-specific combines to form it there, but we were missing
its effectiveness in the blends.

llvm-svn: 229851
2015-02-19 13:56:49 +00:00
Chandler Carruth f0f0d27391 [x86] Simplify the pre-SSSE3 v16i8 lowering significantly by decomposing
them into permutes and a blend with the generic decomposition logic.

This works really well in almost every case and lets the code only
manage the expansion of a single input into two v8i16 vectors to perform
the actual shuffle. The blend-based merging is often much nicer than the
pack based merging that this replaces. The only place where it isn't we
end up blending between two packs when we could do a single pack. To
handle that case, just teach the v2i64 lowering to handle these blends
by digging out the operands.

With this we're down to only really random permutations that cause an
explosion of instructions.

llvm-svn: 229849
2015-02-19 13:15:12 +00:00
Chandler Carruth 8817e5e01b [x86] Remove the insanely over-aggressive unpack lowering strategy for
v16i8 shuffles, and replace it with new facilities.

This uses precise patterns to match exact unpacks, and the new
generalized unpack lowering only when we detect a case where we will
have to shuffle both inputs anyways and they terminate in exactly
a blend.

This fixes all of the blend horrors that I uncovered by always lowering
blends through the vector shuffle lowering. It also removes *sooooo*
much of the crazy instruction sequences required for v16i8 lowering
previously. Much cleaner now.

The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when
it would be marginally nicer to use pshufb+pshufb+por. However, the
difference there is *tiny*. In many cases its a win because we re-use
the pshufb mask. In others, we get to avoid the pshufb entirely. I've
left a FIXME, but I'm dubious we can really do better than this. I'm
actually pretty happy with this lowering now.

For SSE2 this exposes some horrors that were really already there. Those
will have to fixed by changing a different path through the v16i8
lowering.

llvm-svn: 229846
2015-02-19 12:10:37 +00:00
Chandler Carruth 38dea42ddf [x86] The SELECT x86 DAG combine also does legalization. It used to rely
on things not being marked as either custom or legal, but we now do
custom lowering of more VSELECT nodes. To cope with this, manually
replicate the legality tests here. These have to stay in sync with the
set of tests used in the custom lowering of VSELECT.

Ideally, we wouldn't do any of this combine-based-legalization when we
have an actual custom legalization step for VSELECT, but I'm not going
to be able to rewrite all of that today.

I don't have a test case for this currently, but it was found when
compiling a number of the test-suite benchmarks. I'll try to reduce
a test case and add it.

This should at least fix the test-suite fallout on build bots.

llvm-svn: 229844
2015-02-19 11:43:37 +00:00
Elena Demikhovsky 69e8b45b13 AVX-512: Full implementation for VRNDSCALESS/SD instructions and intrinsics.
llvm-svn: 229837
2015-02-19 10:48:04 +00:00
Chandler Carruth bcb6c5f62d [x86] Add support for bit-wise blending and use it in the v8 and v16
lowering paths. I'm going to be leveraging this to simplify a lot of the
overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes.

Sadly, this isn't profitable on v4i32 and v2i64. There, the float and
double blending instructions for pre-SSE4.1 are actually pretty good,
and we can't beat them with bit math. And once SSE4.1 comes around we
have direct blending support and this ceases to be relevant.

Also, some of the test cases look odd because the domain fixer
canonicalizes these to floating point domain. That's OK, it'll use the
integer domain when it matters and some day I may be able to update
enough of LLVM to canonicalize the other way.

This restores almost all of the regressions from teaching x86's vselect
lowering to always use vector shuffle lowering for blends. The remaining
problems are because the v16 lowering path is still doing crazy things.
I'll be re-arranging that strategy in more detail in subsequent commits
to finish recovering the performance here.

llvm-svn: 229836
2015-02-19 10:46:52 +00:00
Chandler Carruth b89464a9b6 [x86,sdag] Two interrelated changes to the x86 and sdag code.
First, don't combine bit masking into vector shuffles (even ones the
target can handle) once operation legalization has taken place. Custom
legalization of vector shuffles may exist for these patterns (making the
predicate return true) but that custom legalization may in some cases
produce the exact bit math this matches. We only really want to handle
this prior to operation legalization.

However, the x86 backend, in a fit of awesome, relied on this. What it
would do is mark VSELECTs as expand, which would turn them into
arithmetic, which this would then match back into vector shuffles, which
we would then lower properly. Amazing.

Instead, the second change is to teach the x86 backend to directly form
vector shuffles from VSELECT nodes with constant conditions, and to mark
all of the vector types we support lowering blends as shuffles as custom
VSELECT lowering. We still mark the forms which actually support
variable blends as *legal* so that the custom lowering is bypassed, and
the legal lowering can even be used by the vector shuffle legalization
(yes, i know, this is confusing. but that's how the patterns are
written).

This makes the VSELECT lowering much more sensible, and in fact should
fix a bunch of bugs with it. However, as you'll see in the test cases,
right now what it does is point out the *hilarious* deficiency of the
new vector shuffle lowering when it comes to blends. Fortunately, my
very next patch fixes that. I can't submit it yet, because that patch,
somewhat obviously, forms the exact and/or pattern that the DAG combine
is matching here! Without this patch, teaching the vector shuffle
lowering to produce the right code infloops in the DAG combiner. With
this patch alone, we produce terrible code but at least lower through
the right paths. With both patches, all the regressions here should be
fixed, and a bunch of the improvements (like using 2 shufps with no
memory loads instead of 2 andps with memory loads and an orps) will
stay. Win!

There is one other change worth noting here. We had hilariously wrong
vectorization cost estimates for vselect because we fell through to the
code path that assumed all "expand" vector operations are scalarized.
However, the "expand" lowering of VSELECT is vector bit math, most
definitely not scalarized. So now we go back to the correct if horribly
naive cost of "1" for "not scalarized". If anyone wants to add actual
modeling of shuffle costs, that would be cool, but this seems an
improvement on its own. Note the removal of 16 and 32 "costs" for doing
a blend. Even in SSE2 we can blend in fewer than 16 instructions. ;] Of
course, we don't right now because of OMG bad code, but I'm going to fix
that. Next patch. I promise.

llvm-svn: 229835
2015-02-19 10:36:19 +00:00
Benjamin Kramer 6ca8992018 X86: Use bitset to manage a bag of bits. NFC.
Doesn't matter in terms of memory usage or perf here, but it's a neat
simplification.

llvm-svn: 229672
2015-02-18 14:10:44 +00:00
Chandler Carruth bbb377c3a1 [x86] Tighten the assertions to document that canonicalization has
actually removed all but a *very* small number of choices for v2i64.
Also remove dead code handling cases that simply cannot arise.

llvm-svn: 229670
2015-02-18 11:46:29 +00:00
Chandler Carruth 811f0ee8c1 [x86] Switch an if which is trivially true to an assert. NFC
llvm-svn: 229669
2015-02-18 11:46:27 +00:00
Chandler Carruth 8f3e585b17 [x86] Remove some more 'bit' nomenclature from the generic shift
lowering.

llvm-svn: 229668
2015-02-18 11:46:23 +00:00
Chandler Carruth 672a98ea28 [x86] Fold together the two shift lowering strategies. They were doing
quite literally the same work, we just need to special case the >64-bit
element shift code emission to emit the byte shift instructions and
offsets. This also makes reasoning about each of the vector lowering
strategies easier as we don't have to remember to use both forms.

llvm-svn: 229662
2015-02-18 10:40:38 +00:00
Chandler Carruth 48cc6c623a [x86] Refactor the bit shift code the same as I just did the byte shift
code.

While this didn't have the miscompile (it used MatchLeft consistently)
it missed some cases where it could use right shifts. I've added a test
case Craig Topper came up with to exercise the right shift matching.

This code is really identical between the two. I'm going to merge them
next so that we don't keep two copies of all of this logic.

llvm-svn: 229655
2015-02-18 09:19:58 +00:00
Elena Demikhovsky 714f23bcdb AVX-512: Added support for FP instructions with embedded rounding mode.
By Asaf Badouh <asaf.badouh@intel.com>

llvm-svn: 229645
2015-02-18 07:59:20 +00:00
Chandler Carruth 55553f5299 [x86] Rewrite the byte shift detection to not use boolean variables to
track state.

I didn't like this in the code review because the pattern tends to be
error prone, but I didn't see a clear way to rewrite it. Turns out that
there were bugs here, I found them when fuzz testing our shuffle
lowering for correctness on x86.

The core of the problem is that we need to consistently test all our
preconditions for the same directionality of shift and the same input
vector. Instead, formulate this as two predicates (one doesn't depend on
the input in any way), pass things like the directionality and input
vector as inputs, and loop over the alternatives.

This fixes a pattern of very rare miscompiles coming out of this code.
Turned up roughly 4 out of every 1 million v8 shuffles in my fuzz
testing. The new code is over half a million test runs with no failures
yet. I've also fuzzed every other function in the lowering code with
over 3.5 million test cases and not discovered any other miscompiles.

llvm-svn: 229642
2015-02-18 07:13:48 +00:00
Simon Pilgrim 1d89a02abb [X86][SSE] Generalised unpckl/unpckh shuffle matching
Added commuted unpckl/unpckh shuffle matching patterns as many cases containing undefined lanes fail to commute by themselves.

Differential Revision: http://reviews.llvm.org/D7564

llvm-svn: 229571
2015-02-17 22:24:32 +00:00
Benjamin Kramer 6cd780ff21 Prefer SmallVector::append/insert over push_back loops.
Same functionality, but hoists the vector growth out of the loop.

llvm-svn: 229500
2015-02-17 15:29:18 +00:00
Andrea Di Biagio eb97f92489 [X86] Silence -Wsign-compare warnings.
GCC 4.8 reported two new warnings due to comparisons
between signed and unsigned integer expressions. The new warnings were
accidentally introduced by revision 229480.
Added explicit casts to silence the warnings. No functional change intended.

llvm-svn: 229488
2015-02-17 11:20:11 +00:00
Michael Kuperstein ff5acaf50c [X86] Combine vector anyext + and into a vector zext
Vector zext tends to get legalized into a vector anyext, represented as a vector shuffle with an undef vector + a bitcast, that gets ANDed with a mask that zeroes the undef elements.
Combine this into an explicit shuffle with a zero vector instead. This allows shuffle lowering to match it as a zext, instead of matching it as an anyext and emitting an explicit AND.
This combine only covers a subset of the cases, but it's a start.

Differential Revision: http://reviews.llvm.org/D7666

llvm-svn: 229480
2015-02-17 08:22:51 +00:00
Chandler Carruth 55db07016e [x86] Teach the unpack lowering to try wider element unpacks.
This allows it to match still more places where previously we would have
to fall back on floating point shuffles or other more complex lowering
strategies.

I'm hoping to replace some of the hand-rolled unpack matching with this
routine is it gets more and more clever.

llvm-svn: 229463
2015-02-17 02:12:24 +00:00
Cameron McInally c5764cbe4e [AVX512] Make 512b vector floating point rounds legal on AVX512.
llvm-svn: 229445
2015-02-16 22:15:42 +00:00
Craig Topper 49df44e2e2 [X86] Remove the multiply by 8 that goes into the shift constant for X86ISD::VSHLDQ and X86ISD::VSRLDQ. This simplifies the pattern matching in isel and allows these nodes to become the patterns embedded in the instruction.
llvm-svn: 229431
2015-02-16 20:52:07 +00:00
Chandler Carruth 1e57e2deb8 [x86] Add a generic unpack-targeted lowering technique. This can be used
to generically lower blends and is particularly nice because it is
available frome SSE2 onward. This removes a lot of the remaining domain
crossing blends in SSE2 code.

I'm hoping to replace some of the "interleaved" lowering hacks with
something closer to this which should be more principled. First, this
needs to learn how to detect and use other interleavings besides that of
the natural type provided. That will be a follow-up patch though.

llvm-svn: 229378
2015-02-16 12:28:18 +00:00
Chandler Carruth c802085b3a [x86] Add initial basic support for forming blends of v16i8 vectors.
This blend instruction is ... really lame. The register usage is insane.
As a consequence this is probably only *barely* better than 2 pshufbs
followed by a por, and that mostly because it only has to read from
a single memory location.

However, this doesn't fix as much as I kind of expected, so more to go.
Pretty sure that the ordering and delegation of v16i8 is just really,
really bad.

llvm-svn: 229373
2015-02-16 10:58:23 +00:00
Chandler Carruth e63bbd97a7 [x86] Switch my usage of VariadicFunction to a "normal" variadic
template now that we can use them.

This is, of course, horribly ugly because of the required recursive
formulation. Suggestions for making it less ugly welcome.

llvm-svn: 229367
2015-02-16 09:59:48 +00:00
Craig Topper 7e8dcef094 [X86] Add support for lowering shuffles to 256-bit PALIGNR instruction.
llvm-svn: 229359
2015-02-16 06:29:06 +00:00
Chandler Carruth 87e580a659 [x86] Teach the 128-bit vector shuffle lowering routines to take
advantage of the existence of a reasonable blend instruction.

The 256-bit vector shuffle lowering has leveraged the general technique
of decomposed shuffles and blends for quite some time, but this never
made it back into the 128-bit code, and there are a large number of
patterns where this is substantially better. For example, this removes
almost all domain crossing in vector shuffles that involve some blend
and some permutation with SSE4.1 and later. See the massive reduction
in 'shufps' for integer test cases in this commit.

This isn't perfect yet for a few reasons:

1) The v8i16 shuffle lowering continues to plague me. We don't always
   form an unpack-based blend when that would be better. But the wins
   pretty drastically outstrip the losses here.
2) The v16i8 shuffle lowering is just a disaster here. I never went and
   implemented blend support here for some terrible reason. I'll do
   that next probably. I've not updated it for now.

More variations on this technique are coming as well -- we don't
shuffle-into-unpack or shuffle-into-palignr, both of which would also be
profitable.

Note that some test cases grow significantly in the number of
instructions, but I expect to actually be faster. We use
pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are
very likely to pipeline well (two ports on most modern intel chips) and
the blend is a *very* fast instruction. The domain switch penalty will
essentially always be more than a blend instruction, which is the only
increase in tree height.

llvm-svn: 229350
2015-02-16 01:52:02 +00:00
Simon Pilgrim 2a7bedb73e Coding style fixes to recent patches. NFC.
llvm-svn: 229312
2015-02-15 14:19:29 +00:00
Simon Pilgrim 00bd79d794 [X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2
This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets.

It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions.

Differential Revision: http://reviews.llvm.org/D7596

llvm-svn: 229311
2015-02-15 13:19:52 +00:00
Chandler Carruth bf0fb06e0d [x86] Teach the decomposed shuffle/blend lowering to use an early blend
when that will allow it to lower with a single permute instead of
multiple permutes.

It tries to detect when it will only have to do a single permute in
either case to maximize folding of loads and such.

This cuts a *lot* of the avx2 shuffle permute counts in half. =]

llvm-svn: 229309
2015-02-15 12:42:15 +00:00
Chandler Carruth 75d9a97569 [x86] Teach the shuffle mask equivalence test to look through build
vectors and detect equivalent inputs.

This lets the code match unpck-style instructions when only one of the
inputs are lined up but the other input is a splat and so which lanes we
pull from doesn't matter. Today, this doesn't really happen, but just by
accident. I have a patch that normalizes how we shuffle splats, and with
that patch this will be necessary for a lot of the mask equivalence
tests to work.

I don't really know how to write a test case for this specific change
until the other change lands though.

llvm-svn: 229307
2015-02-15 12:07:55 +00:00
Chandler Carruth 4fe214b1f2 [x86] Tweak the ordering of unpack matching vs. element insertion, and
don't try to do element insertion for non-zero-index floating point
vectors.

We don't have any useful patterns or lowering for element insertion into
high elements of a floating point vector, and the generic shuffle
lowering will end up being better -- namely it will fall back to unpck.
But we should try to handle other forms of element insertion before
matching unpck patterns.

While this doesn't matter much right now, I'm working on a patch that
makes unpck matching much more powerful, and that patch will break
without this re-ordering.

llvm-svn: 229306
2015-02-15 12:01:14 +00:00
Chandler Carruth 56e0ceda0d [x86] Stop shuffling zero vectors. =]
I was somewhat surprised this pattern really came up, but it does. It
seems better to just directly handle it than try to special case every
place where we end up forming a shuffle that devolves to a shuffle of
a zero vector.

llvm-svn: 229301
2015-02-15 10:34:52 +00:00
Chandler Carruth 3d272daaed [x86] Use a more helpful parenthesizing of these comparisons. Silences
a -Wparentheses complaint from GCC.

llvm-svn: 229300
2015-02-15 10:15:20 +00:00
Chandler Carruth 62558c1d4d [x86] When splitting 256-bit vectors into 128-bit vectors, don't extract
subvectors from buildvectors. That doesn't really make any sense and it
breaks all of the down-stream matching of buildvectors to cleverly lower
shuffles.

With this, we now get the shift-based lowering of 256-bit vector
shuffles with AVX1 when we split them into 128-bit vectors. We also do
much better on the zero-extension patterns, although there remains quite
a bit of room for improvement here.

llvm-svn: 229299
2015-02-15 10:12:02 +00:00
Chandler Carruth a6f8a3661c [x86] Make computing the zeroable elements slightly more powerful, at
least in theory.

I don't actually have a test case that benefits from this, but
theoretically, it could come up, and I don't want to try to think about
whether this is the culprit or something else is, so I'd rather just
make this code powerful. =/ Makes me sad that I can't really test it
though.

llvm-svn: 229298
2015-02-15 09:33:36 +00:00
Chandler Carruth 0ddfe0c7c5 [x86] Add a slight variation on some of the other generic shuffle
lowerings -- one which decomposes into an initial blend followed by
a permute.

Particularly on newer chips, blends are handled independently of
shuffles and so this is much less bottlenecked on the single port that
floating point shuffles are executed with on Intel.

I'll be adding this lowering to a bunch of other code paths in
subsequent commits to handle still more places where we can effectively
leverage blends when they're available in the ISA.

llvm-svn: 229292
2015-02-15 08:26:30 +00:00
Duncan P. N. Exon Smith 5975a703e6 X86: Canonicalize access to function attributes, NFC
Canonicalize access to function attributes to use the simpler API.

getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
  => getFnAttribute(Kind)

getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
  => hasFnAttribute(Kind)

llvm-svn: 229214
2015-02-14 01:59:52 +00:00
Sanjay Patel 34da52a894 fix typos; NFC
llvm-svn: 229155
2015-02-13 21:07:22 +00:00
Craig Topper 007a713ebf Fix a typo in a comment. NFC
llvm-svn: 229071
2015-02-13 06:07:29 +00:00
David Majnemer a12fcb790f X86: Don't crash if we can't decode the pshufb mask
Constant pool entries are uniqued by their contents regardless of their
type.  This means that a pshufb can have a shuffle mask which isn't a
simple array of bytes.

The code path which attempts to decode the mask didn't check for
failure, causing PR22559.

llvm-svn: 228979
2015-02-12 23:26:26 +00:00
Benjamin Kramer 5f6a907288 MathExtras: Bring Count(Trailing|Leading)Ones and CountPopulation in line with countTrailingZeros
Update all callers.

llvm-svn: 228930
2015-02-12 15:35:40 +00:00
Elena Demikhovsky d2cb3c8876 AVX-512: Fixed the "test" operation for i1 type
Using KORTESTW for comparison i1 value with zero was wrong since the instruction tests 16 bits.
KORTESTW may be used with KSHIFTL+KSHIFTR that clean the 15 upper bits.
I removed (X86cmp i1, 0) pattern and zero-extend i1 to i8 and then use TESTB.

There are some cases where i1 is in the mask register and the upper bits are already zeroed.
Then KORTESTW is the better solution, but it is subject for optimization.
Meanwhile, I'm fixing the correctness issue.

llvm-svn: 228916
2015-02-12 08:40:34 +00:00
David Majnemer 13d0b11d7b X86: Make @llvm.frameaddress work correctly with Windows unwind codes
Simply loading or storing the frame pointer is not sufficient for
Windows targets.  Instead, create a synthetic frame object that we will
lower later.  References to this synthetic object will be replaced with
the correct reference to the frame address.

llvm-svn: 228748
2015-02-10 21:22:05 +00:00
Sanjay Patel 3510bc7162 fix typos; NFC
llvm-svn: 228529
2015-02-08 18:54:22 +00:00
Sanjay Patel 3d982214b0 use local variables; NFC
llvm-svn: 228452
2015-02-06 22:43:52 +00:00
Ahmed Bougacha e892d13d90 [CodeGen] Add hook/combine to form vector extloads, enabled on X86.
The combine that forms extloads used to be disabled on vector types,
because "None of the supported targets knows how to perform load and
sign extend on vectors in one instruction."

That's not entirely true, since at least SSE4.1 X86 knows how to do
those sextloads/zextloads (with PMOVS/ZX).
But there are several aspects to getting this right.
First, vector extloads are controlled by a profitability callback.
For instance, on ARM, several instructions have folded extload forms,
so it's not always beneficial to create an extload node (and trying to
match extloads is a whole 'nother can of worms).

The interesting optimization enables folding of s/zextloads to illegal
(splittable) vector types, expanding them into smaller legal extloads.

It's not ideal (it introduces some legalization-like behavior in the
combine) but it's better than the obvious alternative: form illegal
extloads, and later try to split them up.  If you do that, you might
generate extloads that can't be split up, but have a valid ext+load
expansion.  At vector-op legalization time, it's too late to generate
this kind of code, so you end up forced to scalarize. It's better to
just avoid creating egregiously illegal nodes.

This optimization is enabled unconditionally on X86.

Note that the splitting combine is happy with "custom" extloads. As
is, this bypasses the actual custom lowering, and just unrolls the
extload. But from what I've seen, this is still much better than the
current custom lowering, which does some kind of unrolling at the end
anyway (see for instance load_sext_4i8_to_4i64 on SSE2, and the added
FIXME).

Also note that the existing combine that forms extloads is now also
enabled on legal vectors.  This doesn't have a big effect on X86
(because sext+load is usually combined to sext_inreg+aextload).
On ARM it fires on some rare occasions; that's for a separate commit.

Differential Revision: http://reviews.llvm.org/D6904

llvm-svn: 228325
2015-02-05 18:31:02 +00:00
Andrew Trick 7fc4583eda X86 ABI fix for return values > 24 bytes.
The return value's address must be returned in %rax.
i.e. the callee needs to copy the sret argument (%rdi)
into the return value (%rax).

This probably won't manifest as a bug when the caller is LLVM-compiled
code. But it is an ABI guarantee and tools expect it.

llvm-svn: 228321
2015-02-05 18:09:05 +00:00
Sanjay Patel 67667bcdc4 move fold comments to the corresponding fold; NFC
llvm-svn: 228317
2015-02-05 17:33:59 +00:00
Bruno Cardoso Lopes ab9ae87623 [X86][MMX] Handle i32->mmx conversion using movd
Implement a BITCAST dag combine to transform i32->mmx conversion patterns
into a X86 specific node (MMX_MOVW2D) and guarantee that moves between
i32 and x86mmx are better handled, i.e., don't use store-load to do the
conversion..

llvm-svn: 228293
2015-02-05 13:23:07 +00:00
Larisse Voufo bc9f12e7bc Disable enumeral mismatch warning when compiling llvm with gcc.
Tested with gcc 4.9.2.
Compiling with -Werror was producing:
.../llvm/lib/Target/X86/X86ISelLowering.cpp: In function 'llvm::SDValue lowerVectorShuffleAsBitMask(llvm::SDLoc, llvm::MVT, llvm::SDValue, llvm::SDValue, llvm::ArrayRef<int>, llvm::SelectionDAG&)':
.../llvm/lib/Target/X86/X86ISelLowering.cpp:7771:40: error: enumeral mismatch in conditional expression: 'llvm::X86ISD::NodeType' vs 'llvm::ISD::NodeType' [-Werror=enum-compare]
   V = DAG.getNode(VT.isFloatingPoint() ? X86ISD::FAND : ISD::AND, DL, VT, V,
                                        ^

llvm-svn: 228271
2015-02-05 04:54:51 +00:00
Chandler Carruth 024cf8efd7 [x86] Start to introduce bit-masking based blend lowering.
This is the simplest form of bit-math based blending which only fires
when we are blending with zero and is relatively profitable. I've only
enabled this path on very specific lowering strategies. I'm planning to
widen its applicability in subsequent patches, but so far you'll notice
that even though we get fewer shufps instructions, we *still* do the bit
math in the FP execution port. I'm looking into why this is still
happening.

llvm-svn: 228124
2015-02-04 09:06:05 +00:00
Chandler Carruth 68fcb38328 [x86] Fix signed vs. unsigned comparison.
llvm-svn: 228055
2015-02-03 22:43:30 +00:00
Simon Pilgrim 9eecb14bd9 Fixed unused variable warning.
llvm-svn: 228054
2015-02-03 22:39:28 +00:00
Simon Pilgrim 46cd4f7400 [X86][SSE] psrl(w/d/q) and psll(w/d/q) bit shifts for SSE2
Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699.

Differential Revision: http://reviews.llvm.org/D6649

llvm-svn: 228047
2015-02-03 21:58:29 +00:00
Simon Pilgrim c4e5f1e192 Fixed signed/unsigned comparison warning.
llvm-svn: 228027
2015-02-03 20:54:01 +00:00
Simon Pilgrim 03c379a0fa Fixed unused variable warning.
llvm-svn: 228025
2015-02-03 20:38:52 +00:00
Simon Pilgrim d9885856e6 [X86][SSE] Added general integer shuffle matching for MOVQ instruction
This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend.

llvm-svn: 228022
2015-02-03 20:09:18 +00:00
Simon Pilgrim 6544f815b3 [X86][AVX2] Enabled shuffle matching for the AVX2 zero extension (128bit -> 256bit) vpmovzx* instructions.
Differential Revision: http://reviews.llvm.org/D7251

llvm-svn: 228014
2015-02-03 19:34:09 +00:00
Sanjay Patel b7d5628784 Merge consecutive 16-byte loads into one 32-byte load (PR22329)
This patch detects consecutive vector loads using the existing 
EltsFromConsecutiveLoads() logic. This fixes:
http://llvm.org/bugs/show_bug.cgi?id=22329

This patch effectively reverts the tablegen additions of D6492 / 
http://reviews.llvm.org/rL224344 ...which in hindsight were a horrible hack.

The test cases that were added with that patch are simply modified to load
from varying offsets of a base pointer. These loads did not match the existing
tablegen patterns.

A happy side effect of doing this optimization earlier is that we can now fold
the load into a math op where possible; this is shown in some of the updated
checks in the test file.

Differential Revision: http://reviews.llvm.org/D7303

llvm-svn: 228006
2015-02-03 18:54:00 +00:00
Bruno Cardoso Lopes 077774b820 [X86][MMX] Improve transfer from mmx to i32
Improve EXTRACT_VECTOR_ELT DAG combine to catch conversion patterns
between x86mmx and i32 with more layers of indirection.

Before:
  movq2dq %mm0, %xmm0
  movd %xmm0, %eax
After:
  movd %mm0, %eax

llvm-svn: 227969
2015-02-03 14:46:49 +00:00
Eric Christopher 05b819718c Reuse a bunch of cached subtargets and remove getSubtarget calls
without a Function argument.

llvm-svn: 227814
2015-02-02 17:38:43 +00:00
Craig Topper 2844ca7319 [X86] Add a few target specific nodes to 'getTargetNodeName'
llvm-svn: 227720
2015-02-01 10:00:37 +00:00
Simon Pilgrim 9c76b47469 [X86][SSE] Shuffle mask decode support for zero extend, scalar float/double moves and integer load instructions
This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd).

Also adds shuffle mask decodes for integer loads (movd/movq).

Differential Revision: http://reviews.llvm.org/D7228

llvm-svn: 227688
2015-01-31 14:09:36 +00:00
Eric Christopher 295596a0a7 Remove the last vestiges of resetOperationActions.
llvm-svn: 227648
2015-01-31 00:21:17 +00:00
Reid Kleckner a580b6ec67 Win64: Put a REX_W prefix on all TAILJMP* instructions
MSDN's x64 software conventions page says that this is one of the fixed
list of legal epilogues:
https://msdn.microsoft.com/en-us/library/tawsa7cb.aspx

Presumably this is how the unwinder distinguishes epilogue jumps from
in-function control flow.

Also normalize the way we place "## TAILCALL" comments on such jumps.

llvm-svn: 227611
2015-01-30 21:03:31 +00:00
Sanjay Patel e4ca47875f tidy up; NFC
llvm-svn: 227582
2015-01-30 16:58:58 +00:00
Reid Kleckner e83ab8c2de x86: Remove unused variables not caught by MSVC =P
llvm-svn: 227520
2015-01-30 00:05:39 +00:00
Reid Kleckner ca9b9feb2c x86: Fix large model calls to __chkstk for dynamic allocas
In the large code model, we now put __chkstk in %r11 before calling it.

Refactor the code so that we only do this once. Simplify things by using
__chkstk_ms instead of __chkstk on cygming. We already use that symbol
in the prolog emission, and it simplifies our logic.

Second half of PR18582.

llvm-svn: 227519
2015-01-29 23:58:04 +00:00
Sanjay Patel 3a907eaccd Change SmallVector param to the more general ArrayRef; NFCI
llvm-svn: 227514
2015-01-29 23:35:04 +00:00
Reid Kleckner f0abdae34e x86: Remove the W64ALLOCA pseudo
This is just an alias for CALL64pcrel32, and we can just use that opcode
with explicit defs in the MI.

No functionality change.

llvm-svn: 227508
2015-01-29 23:09:37 +00:00
Simon Pilgrim 80bd3c9e5f Spelling fixes. NFC.
llvm-svn: 227376
2015-01-28 22:03:52 +00:00
Sanjay Patel 4058dd9f3f invert check for less indentation; use local vars to reduce duplication; NFC
llvm-svn: 227355
2015-01-28 19:44:21 +00:00
Sanjay Patel 9bb601856e use SDValue methods directly instead of getNode()->* ; NFCI
llvm-svn: 227334
2015-01-28 18:01:31 +00:00
Michael Kuperstein 951995821a [X86] Reduce some 32-bit imuls into lea + shl
Reduce integer multiplication by a constant of the form k*2^c, where k is in {3,5,9} into a lea + shl. Previously it was only done for imulq on 64-bit platforms, but it makes sense for imull and 32-bit as well.

Differential Revision: http://reviews.llvm.org/D7196

llvm-svn: 227308
2015-01-28 14:08:22 +00:00
Michael Kuperstein f387611ac2 [x32] Enable sibcall optimization on x32.
This includes two things:
1) Fix TCRETURNdi and TCRETURN64di patterns to check the right thing (LP64 as opposed to target bitness).
2) Allow LEA64_32 in MatchingStackOffset.

llvm-svn: 227307
2015-01-28 13:38:48 +00:00
Elena Demikhovsky 7b0dd39db6 AVX-512: Added FMA intrinsics with rounding mode
By Asaf Badouh and Elena Demikhovsky

Added special nodes for rounding: FMADD_RND, FMSUB_RND..
It will prevent merge between nodes with rounding and other standard nodes.

llvm-svn: 227303
2015-01-28 10:21:27 +00:00
Alexey Samsonov 533948088e Revert "[x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector"
This reverts commits r226953 and r226974.

llvm-svn: 227248
2015-01-27 21:34:11 +00:00
Elena Demikhovsky 1a603b3f13 AVX-512: Changes in operations on masks registers for KNL and SKX
- Added KSHIFTB/D/Q for skx
- Added KORTESTB/D/Q for skx
- Fixed store operation for v8i1 type for KNL
- Store size of v8i1, v4i1 and v2i1 are changed to 8 bits

llvm-svn: 227043
2015-01-25 12:47:15 +00:00
Bruno Cardoso Lopes ddcc2e31a7 [x86] Fix a comment
llvm-svn: 226974
2015-01-24 00:22:04 +00:00
Bruno Cardoso Lopes 56567f9135 [x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector
Handle the poor codegen for i64/x86xmm->v2i64 (%mm -> %xmm) moves. Instead of
using stack store/load pair to do the job, use scalar_to_vector directly, which
in the MMX case can use movq2dq. This was the current behavior prior to
improvements for vector legalization of extloads in r213897.

This commit fixes the regression and as a side-effect also remove some
unnecessary shuffles.

In the new attached testcase, we go from:

pshufw  $-18, (%rdi), %mm0
movq    %mm0, -8(%rsp)
movq    -8(%rsp), %xmm0
pshufd  $-44, %xmm0, %xmm0
movd    %xmm0, %eax
...

To:

pshufw  $-18, (%rdi), %mm0
movq2dq %mm0, %xmm0
movd    %xmm0, %eax
...

Differential Revision: http://reviews.llvm.org/D7126
rdar://problem/19413324

llvm-svn: 226953
2015-01-23 22:44:16 +00:00
Eric Christopher a1c6e0c8ce Remove some local variables in place of just querying for them
in the couple of asserts.

llvm-svn: 226917
2015-01-23 17:22:44 +00:00
Alexander Potapenko a007905e4e Mark |TLI| variables used to suppress -Wunused-variable warnings.
(These vars are only used in assertions)

llvm-svn: 226815
2015-01-22 13:03:33 +00:00
Elena Demikhovsky 150d9f3187 Fixed a bug in type legalizer for masked load/store intrinsics.
The problem occurs when after vectorization we have type
<2 x i32>. This type is promoted to <2 x i64> and then requires
additional efforts for expanding loads and truncating stores.
I added EXPAND / TRUNCATE attributes to the masked load/store
SDNodes. The code now contains additional shuffles.
I've prepared changes in the cost estimation for masked memory
operations, it will be submitted separately.

llvm-svn: 226808
2015-01-22 12:07:59 +00:00
Simon Pilgrim b16b09b154 [X86][SSE] Added support for SSE3 lane duplication shuffle instructions
This patch adds shuffle matching for the SSE3 MOVDDUP, MOVSLDUP and MOVSHDUP instructions. The big use of these being that they avoid many single source shuffles from needing to use (pre-AVX) dual source instructions such as SHUFPD/SHUFPS: causing extra moves and preventing load folds.

Adding these instructions uncovered an issue in XFormVExtractWithShuffleIntoLoad which crashed on single operand shuffle instructions (now fixed). It also involved fixing getTargetShuffleMask to correctly identify theses instructions as unary shuffles.

Also adds a missing tablegen pattern for MOVDDUP.

Differential Revision: http://reviews.llvm.org/D7042

llvm-svn: 226716
2015-01-21 22:44:35 +00:00
Ahmed Bougacha 8f09e9f7c5 [X86] Declare SSE4.1/AVX2 vector extloads covered by PMOV[SZ]X legal.
Now that we can fully specify extload legality, we can declare them
legal for the PMOVSX/PMOVZX instructions.  This for instance enables
a DAGCombine to fire on code such as
  (and (<zextload-equivalent> ...), <redundant mask>)
to turn it into:
  (zextload ...)
as seen in the testcase changes.

There is one regression, in widen_load-2.ll: we're no longer able
to do store-to-load forwarding with illegal extload memory types.
This will be addressed separately.

Differential Revision: http://reviews.llvm.org/D6533

llvm-svn: 226676
2015-01-21 17:07:06 +00:00
Andrea Di Biagio ae47bc6ab9 [X86][DAG] Disable target specific combine on INSERTPS dag nodes at -O0.
This patch disables target specific combine on X86ISD::INSERTPS dag nodes
if optlevel is CodeGenOpt::None.

The backend currently implements a target specific combine rule that converts
a vector load used by an INSERTPS dag node into a scalar load plus a
scalar_to_vector. This allows ISel to select a single INSERTPSrm instead of
two instructions (i.e. a vector load plus INSERTPSrr).

However, the existing target combine rule on INSERTPS nodes only works under
the assumption that ISel will always be able to match an INSERTPSrm. This is
not true in general at -O0, since the backend only allows folding a load into
the memory operand of an instruction if the optimization level is not
CodeGenOpt::None.

In the example below:

//
__m128 test(__m128 a, __m128 *b) {
  __m128 c = _mm_insert_ps(a, *b, 1 << 6);
  return c;
}
//

Before this patch, at -O0, the backend would have canonicalized the load to 'b'
into a scalar load plus scalar_to_vector. Later on, ISel would have selected an
INSERTPSrr leaving the insertps mask in an inconsistent state:

  movss 4(%rdi), %xmm1
  insertps  $64, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1,2,3].

With this patch, the backend avoids folding the vector load into the operand of
the INSERTPS. The new codegen at -O0 is:

  movaps (%rdi), %xmm1
  insertps  $64, %xmm1, %xmm0 # %xmm1[1],xmm0[1,2,3].

llvm-svn: 226277
2015-01-16 14:55:26 +00:00
Adam Nemet e5dbcb7fd0 [AVX512] Unpack support in new shuffle lowering
This now handles both 32 and 64-bit element sizes.

In this version, the test are in vector-shuffle-512-v8.ll, canonicalized by
Chandler's update_llc_test_checks.py.

Part of <rdar://problem/17688758>

llvm-svn: 225838
2015-01-13 22:20:18 +00:00
Simon Pilgrim d88ab87064 [X86][SSE] Minor regression fix for r225551
r225551 vector byte shuffle optimization caused an assertion as fully zeroable vectors can be produced under certain circumstances. This fix drops the assert and returns a zero vector where the assert would have failed.

llvm-svn: 225718
2015-01-12 22:38:08 +00:00
Ahmed Bougacha 291833b959 [X86] Also create+widen FMIN/FMAX nodes for v2f32.
This happens in the HINT benchmark, where the SLP-vectorizer created
v2f32 fcmp/select code.  The "correct" solution would have been to
teach the vectorizer cost model that v2f32 isn't legal (because really,
it isn't), but if we can vectorize we might as well do so.

We legalize these v2f32 FMIN/FMAX nodes by widening to v4f32 later on.
v3f32 were already widened to v4f32 by the generic unroll-and-build-vector
legalization.

rdar://15763436
Differential Revision: http://reviews.llvm.org/D6557

llvm-svn: 225691
2015-01-12 20:31:30 +00:00
David Majnemer 14141f941a Revert most of r225597
We can't rely on a DataLayout enlightened constant folder.

llvm-svn: 225599
2015-01-11 07:29:51 +00:00
David Majnemer 292d0c796b X86: Properly decode shuffle masks when the constant pool type is weird
It's possible for the constant pool entry for the shuffle mask to come
from a completely different operation.  This occurs when Constants have
the same bit pattern but have different types.

Make DecodePSHUFBMask tolerant of types which, after a bitcast, are
appropriately sized vector types.

This fixes PR22188.

llvm-svn: 225597
2015-01-11 05:08:57 +00:00
Saleem Abdulrasool 9cf2679d3b X86: teach X86TargetLowering about L,M,O constraints
Teach the ISelLowering for X86 about the L,M,O target specific constraints.
Although, for the moment, clang performs constraint validation and prevents
passing along inline asm which may have immediate constant constraints violated,
the backend should be able to cope with the invalid inline asm a bit better.

llvm-svn: 225596
2015-01-11 04:39:24 +00:00
Simon Pilgrim 94a4cc027a [X86][SSE] Improved (v)insertps shuffle matching
In the current code we only attempt to match against insertps if we have exactly one element from the second input vector, irrespective of how much of the shuffle result is zeroable.

This patch checks to see if there is a single non-zeroable element from either input that requires insertion. It also supports matching of cases where only one of the inputs need to be referenced.

We also split insertps shuffle matching off into a new lowerVectorShuffleAsInsertPS function.

Differential Revision: http://reviews.llvm.org/D6879

llvm-svn: 225589
2015-01-10 19:45:33 +00:00
Simon Pilgrim ec1f2c2cab [X86][SSE] Avoid vector byte shuffles with zero by using pshufb to create zeros
pshufb can shuffle in zero bytes as well as bytes from a source vector - we can use this to avoid having to shuffle 2 vectors and ORing the result when the used inputs from a vector are all zeroable.

Differential Revision: http://reviews.llvm.org/D6878

llvm-svn: 225551
2015-01-09 22:03:19 +00:00
Chandler Carruth 685b1803ab [x86] Add a flag to control the vector shuffle legality predicates that
complements the new vector shuffle lowering code path. This flag,
naturally, is *off* because we've not tested or evaluated the results of
this at all. However, the flag will make it much easier to evaluate
whether we can be this aggressive and whether there are missing vector
shuffle lowering optimizations.

llvm-svn: 225491
2015-01-09 01:24:36 +00:00
Ahmed Bougacha d716121888 [X86] Reflow comment. NFC.
llvm-svn: 225455
2015-01-08 17:49:48 +00:00
Michael Kuperstein 46f7d525c3 [X86] Don't try to generate direct calls to TLS globals
The call lowering assumes that if the callee is a global, we want to emit a direct call.
This is correct for regular globals, but not for TLS ones.

Differential Revision: http://reviews.llvm.org/D6862

llvm-svn: 225438
2015-01-08 11:50:58 +00:00
Ahmed Bougacha 2b6917b020 [SelectionDAG] Allow targets to specify legality of extloads' result
type (in addition to the memory type).

The *LoadExt* legalization handling used to only have one type, the
memory type.  This forced users to assume that as long as the extload
for the memory type was declared legal, and the result type was legal,
the whole extload was legal.

However, this isn't always the case.  For instance, on X86, with AVX,
this is legal:
    v4i32 load, zext from v4i8
but this isn't:
    v4i64 load, zext from v4i8
Whereas v4i64 is (arguably) legal, even without AVX2.

Note that the same thing was done a while ago for truncstores (r46140),
but I assume no one needed it yet for extloads, so here we go.

Calls to getLoadExtAction were changed to add the value type, found
manually in the surrounding code.

Calls to setLoadExtAction were mechanically changed, by wrapping the
call in a loop, to match previous behavior.  The loop iterates over
the MVT subrange corresponding to the memory type (FP vectors, etc...).
I also pulled neighboring setTruncStoreActions into some of the loops;
those shouldn't make a difference, as the additional types are illegal.
(e.g., i128->i1 truncstores on PPC.)

No functional change intended.

Differential Revision: http://reviews.llvm.org/D6532

llvm-svn: 225421
2015-01-08 00:51:32 +00:00
Ahmed Bougacha 67dd2d25a3 [CodeGen] Use MVT iterator_ranges in legality loops. NFC intended.
A few loops do trickier things than just iterating on an MVT subset,
so I'll leave them be for now.
Follow-up of r225387.

llvm-svn: 225392
2015-01-07 21:27:10 +00:00
Ahmed Bougacha b994d0c0c5 [X86] Fix 512->256 typo in comments. NFC.
llvm-svn: 225367
2015-01-07 19:38:50 +00:00
Ahmed Bougacha aa2d290997 [X86] Teach FCOPYSIGN lowering to recognize constant magnitudes.
For code like:
    float foo(float x) { return copysign(1.0, x); }
We used to generate:
    andps  <-0.000000e+00,0,0,0>, %xmm0
    movss  <1.000000e+00>, %xmm1
    andps  <nan>, %xmm1
    orps   %xmm0, %xmm1
Basically doing an abs(1.0f) in the two middle instructions.

We now generate:
    andps  <-0.000000e+00,0,0,0>, %xmm0
    orps   <1.000000e+00,0,0,0>, %xmm0

Builds on cleanups r223415, r223542.
rdar://19049548
Differential Revision: http://reviews.llvm.org/D6555

llvm-svn: 225357
2015-01-07 17:33:03 +00:00
David Majnemer 29c52f7449 X86: Don't make illegal GOTTPOFF relocations
"ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF
relocation target a movq or addq instruction.

Prohibit the truncation of such loads to movl or addl.

This fixes PR22083.

Differential Revision: http://reviews.llvm.org/D6839

llvm-svn: 225250
2015-01-06 07:12:52 +00:00
Craig Topper 49758aab94 [X86] Make isel select the shorter form of jump instructions instead of the long form.
The assembler backend will relax to the long form if necessary. This removes a swap from long form to short form in the MCInstLowering code. Selecting the long form used to be required by the old JIT.

llvm-svn: 225242
2015-01-06 04:23:53 +00:00
Simon Pilgrim 4c55af6850 [X86][SSE] lowerVectorShuffleAsByteShift tidyup
Removed local isSequential predicate and use standard helper isSequentialOrUndefInRange instead.

llvm-svn: 225216
2015-01-05 22:08:48 +00:00
Simon Pilgrim 71b96b35e1 [X86][SSE] Fixed description for isSequentialOrUndefInRange. NFC.
llvm-svn: 225202
2015-01-05 21:09:48 +00:00
Andrea Di Biagio 6477847ef4 Improved comments. No functional change intended.
llvm-svn: 225080
2015-01-02 10:47:46 +00:00
Andrea Di Biagio 22ee3f63b9 [CodeGenPrepare] Teach when it is profitable to speculate calls to @llvm.cttz/ctlz.
If the control flow is modelling an if-statement where the only instruction in
the 'then' basic block (excluding the terminator) is a call to cttz/ctlz,
CodeGenPrepare can try to speculate the cttz/ctlz call and simplify the control
flow graph.

Example:
\code
entry:
  %cmp = icmp eq i64 %val, 0
  br i1 %cmp, label %end.bb, label %then.bb

then.bb:
  %c = tail call i64 @llvm.cttz.i64(i64 %val, i1 true)
  br label %end.bb

end.bb:
  %cond = phi i64 [ %c, %then.bb ], [ 64, %entry]
\code

In this example, basic block %then.bb is taken if value %val is not zero.
Also, the phi node in %end.bb would propagate the size-of in bits of %val
only if %val is equal to zero.

With this patch, CodeGenPrepare will try to hoist the call to cttz from %then.bb
into basic block %entry only if cttz is cheap to speculate for the target.

Added two new hooks in TargetLowering.h to let targets customize the behavior
(i.e. decide whether it is cheap or not to speculate calls to cttz/ctlz). The
two new methods are 'isCheapToSpeculateCtlz' and 'isCheapToSpeculateCttz'.
By default, both methods return 'false'.
On X86, method 'isCheapToSpeculateCtlz' returns true only if the target has
LZCNT. Method 'isCheapToSpeculateCttz' only returns true if the target has BMI.

Differential Revision: http://reviews.llvm.org/D6728

llvm-svn: 224899
2014-12-28 11:07:35 +00:00
Aaron Ballman 4eb5c2e089 Fixing another -Wunused-variable warning, this time in release builds without asserts. NFC.
llvm-svn: 224889
2014-12-27 19:17:53 +00:00
Aaron Ballman b66d54c549 Removing a variable that is set but never used, to silence a -Wunused-but-set-variable warning; NFC.
llvm-svn: 224888
2014-12-27 19:01:19 +00:00
Elena Demikhovsky fcea06acb5 AVX-512: Added FMA instructions, intrinsics an tests for KNL and SKX targets
by Asaf Badouh

http://reviews.llvm.org/D6456

llvm-svn: 224764
2014-12-23 10:30:39 +00:00
Elena Demikhovsky 3121449f0b AVX-512: BLENDM - fixed encoding of the broadcast version
Added more intrinsics and encoding tests.

llvm-svn: 224760
2014-12-23 09:36:28 +00:00
Jim Grosbach 1bd0f3530e X86: Don't over-align combined loads.
When combining consecutive loads+inserts into a single vector load,
we should keep the alignment of the base load. Doing otherwise can, and does,
lead to using overly aligned instructions. In the included test case, for
example, using a 32-byte vmovaps on a 16-byte aligned value. Oops.

rdar://19190968

llvm-svn: 224746
2014-12-23 00:35:23 +00:00
Reid Kleckner ce0093344f Make musttail more robust for vector types on x86
Previously I tried to plug musttail into the existing vararg lowering
code. That turned out to be a mistake, because non-vararg calls use
significantly different register lowering, even on x86. For example, AVX
vectors are usually passed in registers to normal functions and memory
to vararg functions.  Now musttail uses a completely separate lowering.

Hopefully this can be used as the basis for non-x86 perfect forwarding.

Reviewers: majnemer

Differential Revision: http://reviews.llvm.org/D6156

llvm-svn: 224745
2014-12-22 23:58:37 +00:00
Bruno Cardoso Lopes 811c173523 [x86] Add vector @llvm.ctpop intrinsic custom lowering
Currently, when ctpop is supported for scalar types, the expansion of
@llvm.ctpop.vXiY uses vector element extractions, insertions and individual
calls to @llvm.ctpop.iY. When not, expansion with bit-math operations is used
for the scalar calls.

Local haswell measurements show that we can improve vector @llvm.ctpop.vXiY
expansion in some cases by using a using a vector parallel bit twiddling
approach, based on:

v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = ((v + (v >> 4) & 0xF0F0F0F)
v = v + (v >> 8)
v = v + (v >> 16)
v = v & 0x0000003F
(from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)

When scalar ctpop isn't supported, the approach above performs better for
v2i64, v4i32, v4i64 and v8i32 (see numbers below). And even when scalar ctpop
is supported, this approach performs ~2x better for v8i32.

Here, x86_64 implies -march=corei7-avx without ctpop and x86_64h includes ctpop
support with -march=core-avx2.

== [x86_64h - new]
v8i32: 0.661685
v4i32: 0.514678
v4i64: 0.652009
v2i64: 0.324289
== [x86_64h - old]
v8i32: 1.29578
v4i32: 0.528807
v4i64: 0.65981
v2i64: 0.330707

== [x86_64 - new]
v8i32: 1.003
v4i32: 0.656273
v4i64: 1.11711
v2i64: 0.754064
== [x86_64 - old]
v8i32: 2.34886
v4i32: 1.72053
v4i64: 1.41086
v2i64: 1.0244

More work for other vector types will come next.

llvm-svn: 224725
2014-12-22 19:45:43 +00:00
Elena Demikhovsky 949b0d46bf AVX-512: Added all forms of BLENDM instructions,
intrinsics, encoding tests for AVX-512F and skx instructions.

llvm-svn: 224707
2014-12-22 13:52:48 +00:00
Elena Demikhovsky fb73ca516b Masked load and store codegen - fixed 128-bit vectors
The codegen failed on 128-bit types on AVX2.
I added patterns and in td files and tests.

llvm-svn: 224647
2014-12-19 23:27:57 +00:00
Robert Khasanov 79fb7292d7 [AVX512] Enable FP arithmetic lowering for AVX512VL subsets.
Added RegOp2MemOpTable4 to transform 4th operand from register to memory in merge-masked versions of instructions. 
Added lowering tests.

llvm-svn: 224516
2014-12-18 12:28:22 +00:00
Michael Kuperstein 047b1a0400 [DAGCombine] Slightly improve lowering of BUILD_VECTOR into a shuffle.
This handles the case of a BUILD_VECTOR being constructed out of elements extracted from a vector twice the size of the result vector. Previously this was always scalarized. Now, we try to construct a shuffle node that feeds on extract_subvectors.

This fixes PR15872 and provides a partial fix for PR21711.

Differential Revision: http://reviews.llvm.org/D6678

llvm-svn: 224429
2014-12-17 12:32:17 +00:00
Quentin Colombet fc2201e922 [CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.

Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.

This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.

Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.


** Context **

Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.


** Motivating Example **

Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
  %ld = load i8* %addr1
  %zextld = zext i8 %ld to i32
  %ld2 = load i32* %addr2
  %add = add nsw i32 %ld2, %zextld
  %sextadd = sext i32 %add to i64
  %zexta = zext i8 %a to i32
  %addza = add nsw i32 %zexta, %zextld
  %sextaddza = sext i32 %addza to i64
  %addb = add nsw i32 %b, %zextld
  %sextaddb = sext i32 %addb to i64
  call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
  ret void
}

As it is, this IR generates the following assembly on x86_64:
[...]
  movzbl  (%rdi), %eax   # zero-extended load
  movl  (%rsi), %es      # plain load
  addl  %eax, %esi       # 32-bit add
  movslq  %esi, %rdi     # sign extend the result of add
  movzbl  %dl, %edx      # zero extend the first argument
  addl  %eax, %edx       # 32-bit add
  movslq  %edx, %rsi     # sign extend the result of add
  addl  %eax, %ecx       # 32-bit add
  movslq  %ecx, %rdx     # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.

Now, by promoting the additions to form more extended loads we would generate:
[...]
  movzbl  (%rdi), %eax   # zero-extended load
  movslq  (%rsi), %rdi   # sign-extended load
  addq  %rax, %rdi       # 64-bit add
  movzbl  %dl, %esi      # zero extend the first argument
  addq  %rax, %rsi       # 64-bit add
  movslq  %ecx, %rdx     # sign extend the second argument
  addq  %rax, %rdx       # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.

This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.

Note: The throughput numbers are similar on Sandy Bridge and Haswell.


** Proposed Solution **

To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))

The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.


** Performance **

Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar:  ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%

The results are consistent for both O3 and Os.

<rdar://problem/18310086>

llvm-svn: 224402
2014-12-17 01:36:17 +00:00
Reid Kleckner 04b69f89aa Revert "[CodeGenPrepare] Move sign/zero extensions near loads using type promotion."
This reverts commit r224351. It causes assertion failures when building
ICU.

llvm-svn: 224397
2014-12-17 00:29:23 +00:00
Quentin Colombet d5e57b731f [CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.

Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.


** Context **

Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.


** Motivating Example **

Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
  %ld = load i8* %addr1
  %zextld = zext i8 %ld to i32
  %ld2 = load i32* %addr2
  %add = add nsw i32 %ld2, %zextld
  %sextadd = sext i32 %add to i64
  %zexta = zext i8 %a to i32
  %addza = add nsw i32 %zexta, %zextld
  %sextaddza = sext i32 %addza to i64
  %addb = add nsw i32 %b, %zextld
  %sextaddb = sext i32 %addb to i64
  call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
  ret void
}

As it is, this IR generates the following assembly on x86_64:
[...]
  movzbl  (%rdi), %eax   # zero-extended load
  movl  (%rsi), %es      # plain load
  addl  %eax, %esi       # 32-bit add
  movslq  %esi, %rdi     # sign extend the result of add
  movzbl  %dl, %edx      # zero extend the first argument
  addl  %eax, %edx       # 32-bit add
  movslq  %edx, %rsi     # sign extend the result of add
  addl  %eax, %ecx       # 32-bit add
  movslq  %ecx, %rdx     # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.

Now, by promoting the additions to form more extended loads we would generate:
[...]
  movzbl  (%rdi), %eax   # zero-extended load
  movslq  (%rsi), %rdi   # sign-extended load
  addq  %rax, %rdi       # 64-bit add
  movzbl  %dl, %esi      # zero extend the first argument
  addq  %rax, %rsi       # 64-bit add
  movslq  %ecx, %rdx     # sign extend the second argument
  addq  %rax, %rdx       # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.

This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.

Note: The throughput numbers are similar on Sandy Bridge and Haswell.


** Proposed Solution **

To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))

The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.


** Performance **

Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar:  ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%

The results are consistent for both O3 and Os.

<rdar://problem/18310086>

llvm-svn: 224351
2014-12-16 19:09:03 +00:00
Robert Khasanov d04cd2fbfe [AVX512] Enable integer arithmetic lowering for AVX512BW/VL subsets.
Added lowering tests.

llvm-svn: 224349
2014-12-16 18:24:07 +00:00
Elena Demikhovsky 72860c341e AVX-512: Added EXPAND instructions and intrinsics.
llvm-svn: 224241
2014-12-15 10:03:52 +00:00
Robert Khasanov 37c3ad6c20 [AVX512] Enabling bit logic lowering
Added lowering tests.

llvm-svn: 224132
2014-12-12 17:02:18 +00:00
Robert Khasanov e82a3630b7 [AVX512] Enabling MIN/MAX lowering.
Added lowering tests.

llvm-svn: 224127
2014-12-12 15:10:43 +00:00
Sanjay Patel 757942a38f remove function names from comments; NFC
llvm-svn: 224080
2014-12-11 23:38:43 +00:00
Sanjay Patel c694ac5519 return without temporary; NFC
llvm-svn: 224076
2014-12-11 23:30:36 +00:00
Elena Demikhovsky 908dbf48c8 AVX-512: Added all forms of COMPRESS instruction
+ intrinsics + tests

llvm-svn: 224019
2014-12-11 15:02:24 +00:00
Elena Demikhovsky fc081457f1 AVX-512: Fixed a bug in lowering setcc for MVT::i1 type
llvm-svn: 224008
2014-12-11 10:21:12 +00:00
Michael Kuperstein 0104ff6529 [X86] Make a code path in EltsFromConsecutiveLoads work only on vectors it expects
EltsFromConsecutiveLoads was apparently only ever called for 128-bit vectors, and assumed this implicitly. r223518 started calling it for AVX-sized vectors, causing the code path that had this assumption to crash.
This adds a check to make this path fire only for 128-bit vectors.

Differential Revision: http://reviews.llvm.org/D6579

llvm-svn: 223922
2014-12-10 08:46:12 +00:00
Elena Demikhovsky fa4a6c18f7 AVX-512: Added some comments to ERI scalar intrinsics.
No functional change.

llvm-svn: 223761
2014-12-09 07:06:32 +00:00
Andrea Di Biagio 64bc246f3f [X86] Improved lowering of packed v8i16 vector shifts by non-constant count.
Before this patch, the backend sub-optimally expanded the non-constant shift
count of a v8i16 shift into a sequence of two 'movd' plus 'movzwl'.

With this patch the backend checks if the target features sse4.1. If so, then
it lets the shuffle legalizer deal with the expansion of the shift amount.

Example:
;;
define <8 x i16> @test(<8 x i16> %A, <8 x i16> %B) {
  %shamt = shufflevector <8 x i16> %B, <8 x i16> undef, <8 x i32> zeroinitializer
  %shl = shl <8 x i16> %A, %shamt
  ret <8 x i16> %shl
}
;;

Before (with -mattr=+avx):
  vmovd  %xmm1, %eax
  movzwl  %ax, %eax
  vmovd  %eax, %xmm1
  vpsllw  %xmm1, %xmm0, %xmm0
  retq

Now:
  vpxor  %xmm2, %xmm2, %xmm2
  vpblendw  $1, %xmm1, %xmm2, %xmm1
  vpsllw  %xmm1, %xmm0, %xmm0
  retq

llvm-svn: 223660
2014-12-08 14:36:51 +00:00
Elena Demikhovsky 68e04b8613 X86 intrinsics moved form X86ISelLowering.cpp to X86IntrinsicsInfo.h
X86ISelLowering.cpp has a long switch for intrinsics. I moved a part of
this long switch to the new intrinsics table in X86IntrinsicsInfo.h.
No functional changes, just code and compile time optimization.

llvm-svn: 223641
2014-12-08 09:03:08 +00:00
Ahmed Bougacha 89bc485085 [X86] Cleanup FCOPYSIGN lowering. NFC intended.
llvm-svn: 223542
2014-12-05 23:11:36 +00:00
Sanjay Patel 4bf9b7685c Optimize merging of scalar loads for 32-byte vectors [X86, AVX]
Fix the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ).
Before we crack 32-byte build vectors into smaller chunks (and then subsequently
glue them back together), we should look for the easy case where we can just load
all elements in a single op.

An example of the codegen change is:

From:

vmovss  16(%rdi), %xmm1
vmovups (%rdi), %xmm0
vinsertps       $16, 20(%rdi), %xmm1, %xmm1
vinsertps       $32, 24(%rdi), %xmm1, %xmm1
vinsertps       $48, 28(%rdi), %xmm1, %xmm1
vinsertf128     $1, %xmm1, %ymm0, %ymm0
retq

To:

vmovups (%rdi), %ymm0
retq

Differential Revision: http://reviews.llvm.org/D6536

llvm-svn: 223518
2014-12-05 21:28:14 +00:00
Jan Wen Voung f547861ba0 Use 32-bit ebp for NaCl64 in a limited case: llvm.frameaddress.
Summary:
Follow up to [x32] "Use ebp/esp as frame and stack pointer":
http://reviews.llvm.org/D4617

In that earlier patch, NaCl64 was made to always use rbp.
That's needed for most cases because rbp should hold a full
64-bit address within the NaCl sandbox so that load/stores
off of rbp don't require sandbox adjustment (zeroing the top
32-bits, then filling those by adding r15).

However, llvm.frameaddress returns a pointer and pointers
are 32-bit for NaCl64. In this case, use ebp instead, which
will make the register copy type check. A similar mechanism
may be needed for llvm.eh.return, but is not added in this change.

Test Plan: test/CodeGen/X86/frameaddr.ll

Reviewers: dschuff, nadav

Subscribers: jfb, llvm-commits

Differential Revision: http://reviews.llvm.org/D6514

llvm-svn: 223510
2014-12-05 20:55:53 +00:00
Andrea Di Biagio 3e425c8d19 [X86] Improved lowering of packed vector shifts to vpsllq/vpsrlq.
SSE2/AVX non-constant packed shift instructions only use the lower 64-bit of
the shift count. 

This patch teaches function 'getTargetVShiftNode' how to deal with shifts
where the shift count node is of type MVT::i64.

Before this patch, function 'getTargetVShiftNode' only knew how to deal with
shift count nodes of type MVT::i32. This forced the backend to wrongly
truncate the shift count to MVT::i32, and then zero-extend it back to MVT::i64.

llvm-svn: 223505
2014-12-05 20:02:22 +00:00
Andrea Di Biagio 2876a67312 [X86] Avoid introducing extra shuffles when lowering packed vector shifts.
When lowering a vector shift node, the backend checks if the shift count is a
shuffle with a splat mask. If so, then it introduces an extra dag node to
extract the splat value from the shuffle. The splat value is then used
to generate a shift count of a target specific shift.

However, if we know that the shift count is a splat shuffle, we can use the
splat index 'I' to extract the I-th element from the first shuffle operand.
The advantage is that the splat shuffle may become dead since we no longer
use it.

Example:

;;
define <4 x i32> @example(<4 x i32> %a, <4 x i32> %b) {
  %c = shufflevector <4 x i32> %b, <4 x i32> undef, <4 x i32> zeroinitializer
  %shl = shl <4 x i32> %a, %c
  ret <4 x i32> %shl
}
;;

Before this patch, llc generated the following code (-mattr=+avx):
  vpshufd $0, %xmm1, %xmm1   # xmm1 = xmm1[0,0,0,0]
  vpxor  %xmm2, %xmm2
  vpblendw $3, %xmm1, %xmm2, %xmm1 # xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7]
  vpslld %xmm1, %xmm0, %xmm0
  retq

With this patch, the redundant splat operation is removed from the code.
  vpxor  %xmm2, %xmm2
  vpblendw $3, %xmm1, %xmm2, %xmm1 # xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7]
  vpslld %xmm1, %xmm0, %xmm0
  retq

llvm-svn: 223461
2014-12-05 12:13:30 +00:00
Eric Christopher 2189515132 Rename the x86 isTargetMacho to isTargetMachO for uniformity.
llvm-svn: 223421
2014-12-05 00:22:38 +00:00
Eric Christopher 66322e822c Both of these subtargets have functions that check whether or
not the target is mach-o. Use them.

llvm-svn: 223420
2014-12-05 00:22:35 +00:00
Ahmed Bougacha 24ebb93da1 [X86] Delete dead code in fcopysign lowering. NFC.
r32900 introduced custom lowering for fcopysign, with two checks to
change the magnitude value's type if it's larger/smaller than the sign
value's type.  r32932 replaced that code for the smaller case.
r43205 did the same for the larger case, but left the old code, now dead.

llvm-svn: 223415
2014-12-04 23:52:15 +00:00
Bruno Cardoso Lopes fd52b95530 [x86] Fix isOffsetSuitableForCodeModel kernel code model offset
Offset == 0 is a valid offset for kernel code model according to the
x86_64 System V ABI. Found by inspection, no testcase.

llvm-svn: 223383
2014-12-04 20:36:06 +00:00
Michael Kuperstein 0492bd2b9e [X86] Improve a dag-combine that handles a vector extract -> zext sequence.
The current DAG combine turns a sequence of extracts from <4 x i32> followed by zexts into a store followed by scalar loads.
According to measurements by Martin Krastev (see PR 21269) for x86-64, a sequence of an extract, movs and shifts gives better performance. However, for 32-bit x86, the previous sequence still seems better.

Differential Revision: http://reviews.llvm.org/D6501

llvm-svn: 223360
2014-12-04 13:49:51 +00:00
Andrea Di Biagio 61fac30180 [X86] Simplify code. NFC.
Replaced some logic that checked if a build_vector node is doing a splat of a
non-undef value with a call to method BuildVectorSDNode::getSplatValue().
No functional change intended.

llvm-svn: 223354
2014-12-04 11:21:44 +00:00
Elena Demikhovsky f1de34b84d Masked Load / Store Intrinsics - the CodeGen part.
I'm recommiting the codegen part of the patch.
The vectorizer part will be send to review again.

Masked Vector Load and Store Intrinsics.
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)

Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.

http://reviews.llvm.org/D6191

llvm-svn: 223348
2014-12-04 09:40:44 +00:00
Michael Liao 5bf9578ce4 [X86] Clean up whitespace as well as minor coding style
llvm-svn: 223339
2014-12-04 05:20:33 +00:00
Michael Liao d8faa61b20 [X86] Restore X86 base pointer after call to llvm.eh.sjlj.setjmp
Commit on 

- This patch fixes the bug described in
  http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-May/062343.html

The fix allocates an extra slot just below the GPRs and stores the base pointer
there. This is done only for functions containing llvm.eh.sjlj.setjmp that also
need a base pointer. Because code containing llvm.eh.sjlj.setjmp saves all of
the callee-save GPRs in the prologue, the offset to the extra slot can be
computed before prologue generation runs.

Impact at run-time on affected functions is::

  - One extra store in the prologue, The store saves the base pointer.
  - One extra load after a llvm.eh.sjlj.setjmp. The load restores the base pointer.

Because the extra slot is just above a gap between frame-pointer-relative and
base-pointer-relative chunks of memory, there is no impact on other offset
calculations other than ensuring there is room for the extra slot.

http://reviews.llvm.org/D6388

Patch by Arch Robison <arch.robison@intel.com>

llvm-svn: 223329
2014-12-04 00:56:38 +00:00
Sanjay Patel 23b7ce2725 fix typos, grammar, formatting; NFC
llvm-svn: 223276
2014-12-03 22:28:05 +00:00
Sanjay Patel 57f033a024 fix typo in comment
llvm-svn: 223127
2014-12-02 17:25:27 +00:00
Philip Reames 0365f1a376 [Statepoints 2/4] Statepoint infrastructure for garbage collection: MI & x86-64 Backend
This is the second patch in a small series.  This patch contains the MachineInstruction and x86-64 backend pieces required to lower Statepoints.  It does not include the code to actually generate the STATEPOINT machine instruction and as a result, the entire patch is currently dead code.  I will be submitting the SelectionDAG parts within the next 24-48 hours.  Since those pieces are by far the most complicated, I wanted to minimize the size of that patch.  That patch will include the tests which exercise the functionality in this patch.  The entire series can be seen as one combined whole in http://reviews.llvm.org/D5683.

The STATEPOINT psuedo node is generated after all gc values are explicitly spilled to stack slots.  The purpose of this node is to wrap an actual call instruction while recording the spill locations of the meta arguments used for garbage collection and other purposes.  The STATEPOINT is modeled as modifing all of those locations to prevent backend optimizations from forwarding the value from before the STATEPOINT to after the STATEPOINT.  (Doing so would break relocation semantics for collectors which wish to relocate roots.)

The implementation of STATEPOINT is closely modeled on PATCHPOINT.  Eventually, much of the code in this patch will be removed.  The long term plan is to merge the functionality provided by statepoints and patchpoints.  Merging their implementations in the backend is likely to be a good starting point.

Reviewed by: atrick, ributzka

llvm-svn: 223085
2014-12-01 22:52:56 +00:00
Duncan P. N. Exon Smith 9bc81fbe92 Revert "Masked Vector Load and Store Intrinsics."
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot.  I'll respond to the commit on the
list with a reproduction of one of the failures.

Conflicts:
	lib/Target/X86/X86TargetTransformInfo.cpp

llvm-svn: 222936
2014-11-28 21:29:14 +00:00
Elena Demikhovsky 905a5a606f AVX-512: Scalar ERI intrinsics
including SAE mode and memory operand.
Added AVX512_maskable_scalar template, that should cover all scalar instructions in the future.

The main difference between AVX512_maskable_scalar<> and AVX512_maskable<> is using X86select instead of vselect.
I need it, because I can't create vselect node for MVT::i1 mask for scalar instruction.

http://reviews.llvm.org/D6378

llvm-svn: 222820
2014-11-26 10:46:49 +00:00
Simon Pilgrim 371417db34 [X86][SSE] Improvements to byte shift shuffle matching
Since (v)pslldq / (v)psrldq instructions resolve to a single input argument it is useful to match it much earlier than we currently do - this prevents more complicated shuffles (notably insertion into a zero vector) matching before it.

Differential Revision: http://reviews.llvm.org/D6409

llvm-svn: 222796
2014-11-25 22:34:59 +00:00
Cameron McInally 9b7c15a364 [AVX512] Add 512b integer shift by variable intrinsics and patterns.
llvm-svn: 222786
2014-11-25 20:41:51 +00:00
Andrea Di Biagio 23e2cfa834 [X86] Improved target specific combine on VSELECT dag nodes.
This patch teaches function 'transformVSELECTtoBlendVECTOR_SHUFFLE' how to
convert VSELECT dag nodes to shuffles on targets that do not have SSE4.1.
On pre-SSE4.1 targets, we can still perform blend operations using movss/movsd.

Also, removed a target specific combine that performed a premature lowering of
VSELECT nodes to target specific MOVSS/MOVSD nodes.

llvm-svn: 222647
2014-11-24 12:23:15 +00:00
Michael Kuperstein 9ef90647b9 [X86] Fixes bug in build_vector v4x32 lowering
r222375 made some improvements to build_vector lowering of v4x32 and v4xf32 into an insertps, but it missed a case where:

1. A single extracted element is used twice.
2. The lower of the two non-zero indexes should be preserved, and the higher should be used for the dest mask.

This caused a crash, since the source value for the insertps ends-up uninitialized.

Differential Revision: http://reviews.llvm.org/D6377

llvm-svn: 222635
2014-11-23 13:09:06 +00:00
Elena Demikhovsky 9e5089a938 Masked Vector Load and Store Intrinsics.
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)

Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.

http://reviews.llvm.org/D6191

llvm-svn: 222632
2014-11-23 08:07:43 +00:00
Chandler Carruth 5852b1f4c2 [x86] Teach the vector shuffle yet another step of canonicalization.
No functionality changed yet, but this will prevent subsequent patches
from having to handle permutations of various interleaved shuffle
patterns.

llvm-svn: 222614
2014-11-22 09:18:53 +00:00
Sanjay Patel 501890e909 Add a feature flag for slow 32-byte unaligned memory accesses [x86].
This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen
for Sandy Bridge and Ivy Bridge. There is no functionality change intended for 
those chips. Previously, the absence of AVX2 was being used as a proxy to detect
this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2
that do not have the 32-byte unaligned access slowdown.

Performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ).

Differential Revision: http://reviews.llvm.org/D6355

llvm-svn: 222544
2014-11-21 17:40:04 +00:00
Chandler Carruth ce5a26b0e7 [x86] Restructure the checking patterns for v16 and v32 avx2 vector
shuffle lowering to allow much better blend matching.

Specifically, with the new structure the code seems clearer to me and we
correctly can hit the cases where merging two 128-bit lanes is a clear
win and can be shuffled cheaply afterward.

llvm-svn: 222539
2014-11-21 14:53:03 +00:00
Chandler Carruth 6c4d1ea8c4 [x86] Make the previous logic significantly less conservative and get
a bunch more improvements.

Non-lane-crossing is fine, the key is that lane merging only makes sense
for single-input shuffles. Not sure why I got so turned around here. The
code all works, I was just using the wrong model for it.

This only updates v4 and v8 lowering. The v16 and v32 lowering requires
restructuring the entire check sequence.

llvm-svn: 222537
2014-11-21 14:33:24 +00:00
Chandler Carruth d2b19bc867 [x86] Teach the x86 vector shuffle lowering to detect mergable 128-bit
lanes.

By special casing these we can often either reduce the total number of
shuffles significantly or reduce the number of (high latency on Haswell)
AVX2 shuffles that potentially cross 128-bit lanes. Even when these
don't actually cross lanes, they have much higher latency to support
that. Doing two of them and a blend is worse than doing a single insert
across the 128-bit lanes to blend and then doing a single interleaved
shuffle.

While this seems like a narrow case, it kept cropping up on me and the
difference is *huge* as you can see in many of the test cases. I first
hit this trying to perfectly fix the interleaving shuffle patterns used
by Halide for AVX2.

llvm-svn: 222533
2014-11-21 13:56:05 +00:00
Alexey Volkov fd1731d876 [X86] For Silvermont CPU use 16-bit division instead of 64-bit for small positive numbers
Differential Revision: http://reviews.llvm.org/D5938

llvm-svn: 222521
2014-11-21 11:19:34 +00:00
Quentin Colombet a7439d4483 [X86] Do not custom lower UINT_TO_FP when the target type does not
match the custom lowering.

<rdar://problem/19026326>

llvm-svn: 222489
2014-11-21 00:47:19 +00:00
Reid Kleckner 343c395f11 Fix more instances of -Wsentinel on Windows with s/NULL/nullptr/
Follow up to r221940, where I must not have caught em all. NFC

llvm-svn: 222481
2014-11-20 23:51:47 +00:00
Saleem Abdulrasool 2f3b3f3182 X86: use the correct alloca symbol for Windows Itanium
Windows itanium targets the MSVCRT, and the stack probe symbol is provided by
MSVCRT.  This corrects the emission of stack probes on i686-windows-itanium.

llvm-svn: 222439
2014-11-20 18:01:26 +00:00
Andrea Di Biagio 1b657bfcc8 [X86] Improved lowering of v4x32 build_vector dag nodes.
This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes
that are known to have at least two non-zero elements.

With this patch, a build_vector that performs a blend with zero is 
converted into a shuffle. This is done to let the shuffle legalizer expand
the dag node in a optimal way. For example, if we know that a build_vector
performs a blend with zero, we can try to lower it as a movq/blend instead of
always selecting an insertps.

This patch also improves the logic that lowers a build_vector into a insertps
with zero masking. See for example the extra test cases added to test sse41.ll.

Differential Revision: http://reviews.llvm.org/D6311

llvm-svn: 222375
2014-11-19 19:34:29 +00:00
Simon Pilgrim 3ac3b251a9 [X86][SSE] pslldq/psrldq byte shifts/rotation for SSE2
This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain.

I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr.

Differential Revision: http://reviews.llvm.org/D5699

llvm-svn: 222340
2014-11-19 10:06:49 +00:00
Simon Pilgrim 6d675f4e35 [X86][SSE] Improve legal SHUFP and PSHUFD shuffle matching
Updated X86TargetLowering::isShuffleMaskLegal to match SHUFP masks with commuted inputs and PSHUFD masks that reference the second input.

As part of this I've refactored isPSHUFDMask to work in a more general manner and allow it to match against either the first or second input vector.

Differential Revision: http://reviews.llvm.org/D6287

llvm-svn: 222087
2014-11-15 21:13:05 +00:00
Tim Northover d3be12a6c7 X86: use getConstant rather than getTargetConstant behind BUILD_VECTOR.
getTargetConstant should only be used when you can guarantee the instruction
selected will be able to cope with the raw value. BUILD_VECTOR is rather too
generic for this so we should use getConstant instead. In that case, an
instruction can still consume the constant, but if it doesn't it'll be
materialised through its own round of ISel.

Should fix PR21352.

llvm-svn: 221961
2014-11-14 01:30:14 +00:00
Aditya Nandakumar 3053155652 We can get the TLOF from the TargetMachine - so constructor no longer requires TargetLoweringObjectFile to be passed.
llvm-svn: 221926
2014-11-13 21:29:21 +00:00
Elena Demikhovsky d5e95b57e0 AVX-512: SINT_TO_FP cost model and some bugfixes
Checked some corner cases, for example translation
of <8 x i1> to <8 x double>

llvm-svn: 221883
2014-11-13 11:46:16 +00:00
Aditya Nandakumar a27193297f This patch changes the ownership of TLOF from TargetLoweringBase to TargetMachine so that different subtargets could share the TLOF effectively
llvm-svn: 221878
2014-11-13 09:26:31 +00:00
Chandler Carruth fee91883f4 [x86] Teach the vector shuffle lowering to make a more nuanced decision
between splitting a vector into 128-bit lanes and recombining them vs.
decomposing things into single-input shuffles and a final blend.

This handles a large number of cases in AVX1 where the cross-lane
shuffles would be much more expensive to represent even though we end up
with a fast blend at the root. Instead, we can do a better job of
shuffling in a single lane and then inserting it into the other lanes.

This fixes the remaining bits of Halide's regression captured in PR21281
for AVX1. However, the bug persists in AVX2 because I've made this
change reasonably conservative. The cases where it makes sense in AVX2
to split into 128-bit lanes are much more rare because we can often do
full permutations across all elements of the 256-bit vector. However,
the particular test case in PR21281 is an example of one of the rare
cases where it is *always* better to work in a single 128-bit lane. I'm
going to try to teach the logic to detect and form the good code even in
AVX2 next, but it will need to use a separate heuristic.

Finally, there is one pesky regression here where we previously would
craftily use vpermilps in AVX1 to shuffle both high and low halves at
the same time. We no longer pull that off, and not for any really good
reason. Ultimately, I think this is just another missing nuance to the
selection heuristic that I'll try to add in afterward, but this change
already seems strictly worth doing considering the magnitude of the
improvements in common matrix math shuffle patterns.

As always, please let me know if this causes a surprising regression for
you.

llvm-svn: 221861
2014-11-13 04:06:10 +00:00
Chandler Carruth 253dd39a9a [x86] Don't form overly fragmented blends when splitting and
re-combining shuffles because nothing was available in the wider vector
type.

The key observation (which I've put in the comments for future
maintainers) is that at this point, no further combining is really
possible. And so even though these shuffles trivially could be combined,
we need to actually do that as we produce them when producing them this
late in the lowering.

This fixes another (huge) part of the Halide vector shuffle regressions.
As it happens, this was already well covered by the tests, but I hadn't
noticed how bad some of these got. The specific patterns that turn
directly into unpckl/h patterns were occurring *many* times in common
vector processing code.

There are still more problems here sadly, but trying to incrementally
tease them apart and it looks like this is the core of the problem in
the splitting logic.

There is some chance of regression here, you can see it in the test
changes. Specifically, where we stop forming pshufb in some cases, it is
possible that pshufb was in fact faster. Intel "says" that pshufb is
slower than the instruction sequences replacing it.

llvm-svn: 221852
2014-11-13 02:42:08 +00:00
Sanjay Patel f6f7d5d1dd Expose the number of Newton-Raphson iterations applied to the hardware's reciprocal estimate as a parameter (x86).
This is a follow-on to r221706 and r221731 and discussed in more detail in PR21385.

This patch also loosens the testcase checking for btver2. We know that the "1.0" will be loaded, but
we can't tell exactly when, so replace the CHECK-NEXT specifiers with plain CHECKs. The CHECK-NEXT
sequence relied on a quirk of post-RA-scheduling that may change independently of anything in these tests.

llvm-svn: 221819
2014-11-12 21:39:01 +00:00
Cameron McInally 73a6bca32b [AVX512] Add integer shift by immediate intrinsics.
llvm-svn: 221811
2014-11-12 19:58:54 +00:00