Commit Graph

2776 Commits

Author SHA1 Message Date
Chandler Carruth a8311b3681 [x86] Begin stubbing out the AVX support in the new vector shuffle
lowering scheme.

Currently, this just directly bails to the fallback path of splitting
the 256-bit vector into two 128-bit vectors, operating there, and then
joining the results back together. While the results are far from
perfect, they are *shockingly* good for what we're doing here. I'll be
layering the rest of the functionality on top of this piece by piece and
updating tests as I go.

Note that 256-bit vectors in this mode are still somewhat WIP. While
I think the code paths that I'm adding here are clean and good-to-go,
there are still a lot of 128-bit assumptions that I'll need to stomp out
as I march through the functional spread here.

llvm-svn: 215637
2014-08-14 12:13:59 +00:00
Quentin Colombet 57fb040beb [X86] Fix the value of the low mask for the lowering of MUL_LOHI for v4i32.
Found by code inspection.

llvm-svn: 215604
2014-08-13 23:49:24 +00:00
Aaron Ballman 1013b6b60c Silence a -Wparenthesis warning with these asserts. NFC.
llvm-svn: 215537
2014-08-13 10:49:07 +00:00
Elena Demikhovsky 51bbd011c3 AVX-512: Fixed a bug in shufflevector lowering.
PALIGNR instruction does not exist in AVX-512F set.
Added a test.

llvm-svn: 215526
2014-08-13 07:58:43 +00:00
Chandler Carruth b7eda21bb0 [x86] Rewrite a core part of the new vector shuffle lowering to handle
one pesky test case correctly.

This test case caused the old code to infloop occilating between solving
the low-half and the high-half. The 'side balancing' part of
single-input v8 shuffle lowering didn't handle the one pattern which can
cause it to occilate. Fortunately the fuzz testing found this case.
Unfortuately it was *terrible* to handle. I'm really sorry for the
amount and density of the code here, I'd love suggestions on how to
simplify it. I feel like there *must* be a simpler form here, but after
a lot of days I've not found it. This is the only one I've found that
even works. I've added the one pesky test case along with some nice
comments explaining the core problem that we have to solve here.

So far this has survived approximately 32k test cases. More strenuous
fuzzing commencing.

llvm-svn: 215519
2014-08-13 01:25:45 +00:00
Adam Nemet cee9d0a460 [AVX512] Handle valign masking intrinsic via C++ lowering
I think that this will scale better in most cases than adding a Pat<> for each
mapping from the intrinsic DAG to the intruction (i.e. rri, rrik, rrikz).  We
can just lower to the SDNode and have the resulting DAG be matches by the DAG
patterns.

Alternatively (long term), we could keep the Pat<>s but generate them via the
new AVX512_masking multiclass.  The difficulty is that in order to formulate
that we would have to concatenate DAGs.  Currently this is only supported if
the operators of the input DAGs are identical.

llvm-svn: 215473
2014-08-12 21:13:12 +00:00
Sanjay Patel 8687f320e0 fixed typos
llvm-svn: 215451
2014-08-12 16:00:06 +00:00
Hans Wennborg 21b20cb11a Increase the size of these SmallVectors in X86ISelLowering.cpp.
In a Clang bootstrap, their sizes were always 12, 16 and 16, respectively.

llvm-svn: 215336
2014-08-11 02:21:22 +00:00
Sanjay Patel b48d1cfa53 fixed typos
llvm-svn: 215299
2014-08-09 22:23:02 +00:00
Patrik Hagglund b0e86ec814 [pr19635] Revert most of r170537, and add new testcase.
Patch provided by Andrey Kuharev.

Sorry, r170537 was obviously wrong.

llvm-svn: 215190
2014-08-08 08:21:19 +00:00
Alexander Kornienko 7151ad7762 Insert parens to avoid a warning:
suggest parentheses around arithmetic in operand of '^' [-Wparentheses]

llvm-svn: 215101
2014-08-07 12:09:34 +00:00
Chandler Carruth 4e8fcbd3fd [x86] Fix another miscompile found through fuzz testing the new vector
shuffle lowering.

This is closely related to the previous one. Here we failed to use the
source offset when swapping in the other case -- where we end up
swapping the *final* shuffle. The cause of this bug is a bit different:
I simply wasn't thinking about the fact that this mask is actually
a slice of a wide mask and thus has numbers that need SourceOffset
applied. Simple fix. Would be even more simple with an algorithm-y thing
to use here, but correctness first. =]

llvm-svn: 215095
2014-08-07 10:37:35 +00:00
Chandler Carruth e206385e99 [x86] Fix another miscompile in the new vector shuffle lowering found
via the fuzz tester.

Here I missed an offset when round-tripping a value through a shuffle
mask. I got it right 2 lines below. See a problem? I do. ;] I'll
probably be adding a little "swap" algorithm which accepts a range and
two values and swaps those values where they occur in the range. Don't
really have a name for it, let me know if you do.

llvm-svn: 215094
2014-08-07 10:14:27 +00:00
Chandler Carruth 78494364d1 [x86] Fix another miscompile in the new vector shuffle lowering found
through the new fuzzer.

This one is great: bad operator precedence led the modulus to happen at
the wrong point. All the asserts didn't fire because there were usually
the right values past the end of the 4 element region we were looking
at. Probably could have gotten a crash here with ASan + fuzzing, but the
correctness tests pinpointed this really nicely.

llvm-svn: 215092
2014-08-07 09:45:02 +00:00
Pavel Chupin f55eb450e5 [x32] Use ebp/esp as frame and stack pointer
Summary:
Since pointers are 32-bit on x32 we can use ebp and esp as frame and stack
pointer. Some operations like PUSH/POP and CFI_INSTRUCTION still
require 64-bit register, so using 64-bit MachineFramePtr where required.

X86_64 NaCl uses 64-bit frame/stack pointers, however it's been found that
both isTarget64BitLP64 and isTarget64BitILP32 are true for NaCl. Addressing
this issue here as well by making isTarget64BitLP64 false.

Also mark hasReservedSpillSlot unreachable on X86. See inlined comments.

Test Plan: Add one new simple test and upgrade 2 existing with x32 target case.

Reviewers: nadav, dschuff

Subscribers: llvm-commits, zinovy.nis

Differential Revision: http://reviews.llvm.org/D4617

llvm-svn: 215091
2014-08-07 09:41:19 +00:00
Chandler Carruth 27046758de [x86] Fix a miscompile in the new shuffle lowering found through the new
fuzz testing.

The function which tested for adjacency did what it said on the tin, but
when I called it, I wanted it to do something more thorough: I wanted to
know if the *pairs* of shuffle elements were adjacent and started at
0 mod 2. In one place I had the decency to try to test for this, but in
the other it was completely skipped, miscompiling this test case. Fix
this by making the helper actually do what I wanted it to do everywhere
I called it (and removing the now redundant code in one place).

I *really* dislike the name "canWidenShuffleElements" for this
predicate. If anyone can come up with a better name, please let me know.
The other name I thought about was "canWidenShuffleMask" but is it
really widening the mask to reduce the number of lanes shuffled? I don't
know. Naming things is hard.

llvm-svn: 215089
2014-08-07 08:11:31 +00:00
Eric Christopher b5217507c7 Remove the target machine from CCState. Previously it was only used
to get the subtarget and that's accessible from the MachineFunction
now. This helps clear the way for smaller changes where we getting
a subtarget will require passing in a MachineFunction/Function as
well.

llvm-svn: 214988
2014-08-06 18:45:26 +00:00
Chandler Carruth c3927cd8c9 [x86] Fix two independent miscompiles in the process of getting the same
test case to actually generate correct code.

The primary miscompile fixed here is that we weren't correctly handling
in-place elements in one half of a single-input v8i16 shuffle when
moving a dword of elements from that half to the other half. Some times,
we would clobber the in-place elements in forming the dword to move
across halves.

The fix to this involves forcibly marking the in-place inputs even when
there is no need to gather them into a dword, and to much more carefully
re-arrange the elements when grouping them into a dword to move across
halves. With these two changes we would generate correct shuffles for
the test case, but found another miscompile. There are also some random
perturbations of the generated shuffle pattern in SSE2. It looks like
a wash; more instructions in some cases fewer in others.

The second miscompile would corrupt the results into nonsense. This is
a buggy pattern in one of the added DAG combines. Mapping elements
through a PSHUFD when pairing redundant half-shuffles is *much* harder
than this code makes it out to be -- it requires reasoning about *all*
of where the input is used in the PSHUFD, not just one part of where it
is used. Plus, we can't combine a half shuffle *into* a PSHUFD but the
code didn't guard against it. I think this was just a bad idea and I've
just removed that aspect of the combine. No tests regress as
a consequence so seems OK.

llvm-svn: 214954
2014-08-06 10:16:36 +00:00
Chandler Carruth 8f23ba26d2 [x86] Switch to a formulation of a for loop that is much more obviously
not corrupting the mask by mutating it more times than intended. No
functionality changed (the results were non-overlapping so the old
version "worked" but was non-obvious).

llvm-svn: 214953
2014-08-06 10:16:33 +00:00
JF Bastien ac8b66b32c Fix typos in comments and doc
Committing http://reviews.llvm.org/D4798 for Robin Morisset (morisset@google.com)

llvm-svn: 214934
2014-08-05 23:27:34 +00:00
Chandler Carruth a746239be3 [x86] Fix a crasher due to shuffles which cancel each other out and add
a test case.

We also miscompile this test case which is showing a serious flaw in the
single-input v8i16 shuffle code. I've left the specific instruction
checks FIXME-ed out until I can address the bug in the single-input
code, but I wanted to separate out a significant functionality change to
produce correct code from a very simple and targeted crasher fix.

The miscompile problem stems from keeping track of inputs by value
rather than by index. As a consequence of doing this, we can't reliably
update those inputs because they might swap and we can't detect this
without copying the mask.

The blend code now uses indices for the input lists and this seems
strictly better. It also should make it easier to sort things and do
other cleanups. I think the time has come to simplify The Great Lambda
here.

llvm-svn: 214914
2014-08-05 18:45:49 +00:00
Adam Nemet c04f3f9f73 [X86] Improve comments for r214888
A rebase somehow ate my comments. This restores them.

llvm-svn: 214903
2014-08-05 17:58:49 +00:00
Adam Nemet 164b07fbfe [X86] Add lowering to VALIGN
This was currently part of lowering to PALIGNR with some special-casing to
make interlane shifting work.  Since AVX512F has interlane alignr (valignd/q)
and AVX512BW has vpalignr we need to support both of these *at the same time*,
e.g. for SKX.

This patch breaks out the common code and then add support to check both of
these lowering options from LowerVECTOR_SHUFFLE.

I also added some FIXMEs where I think the AVX512BW and AVX512VL additions
should probably go.

llvm-svn: 214888
2014-08-05 17:22:59 +00:00
Adam Nemet 2f10cc699d [X86] Separate DAG node for valign and palignr
They have different semantics (valign is interlane while palingr is intralane)
and palingr is still needed even in the AVX512 context.  According to the
latest spec AVX512BW provides these.

llvm-svn: 214887
2014-08-05 17:22:55 +00:00
Chandler Carruth 183771bd8e [x86] Reformat some code I moved around in a prior commit but left
poorly formatted. Sorry about that.

llvm-svn: 214853
2014-08-05 10:35:30 +00:00
Chandler Carruth 947cef191d [x86] Fix a crash and wrong-code bug in the new vector lowering all
found by a single test reduced out of a failure on llvm-stress.

The start of the problem (and the crash) came when we tried to use
a find of a non-used slot in the move-to half of the move-mask as the
target for two bad-half inputs. While if lucky this will be the first of
a pair of slots which we can place the bad-half inputs into, it isn't
actually guaranteed. This really isn't surprising, not sure what I was
thinking. The correct way to find the two unused slots is to look for
one of the *used* slots. We know it isn't that pair, and we can use some
modular arithmetic to find the other pair by masking off the odd bit and
adding 2 modulo 4. With this, we reliably found a viable pair of slots
for the bad-half inputs.

Sadly, that wasn't enough. We also had a wrong code bug that surfaced
when I reduced the test case for this where we would use the same slot
twice for the two bad inputs. This is because both of the bad inputs
could be in odd slots originally and thus the mod-2 mapping would
actually be the same. The whole point of the weird indexing into the
pair of empty slots was to try to leverage when the end result needed
the two bad-half inputs to be paired in a dword and pre-pair them in the
correct orrientation. This is less important with the powerful combining
we're now doing, and also easier and more reliable to achieve be noting
that we add the bad-half inputs in order. Thus, if they are in a dword
pair, the low part of that will be the first input in the sequence.
Always putting that in the low element will just do the right thing in
addition to computing the correct result.

Test case added. =]

llvm-svn: 214849
2014-08-05 08:19:21 +00:00
Eric Christopher fc6de428c8 Have MachineFunction cache a pointer to the subtarget to make lookups
shorter/easier and have the DAG use that to do the same lookup. This
can be used in the future for TargetMachine based caching lookups from
the MachineFunction easily.

Update the MIPS subtarget switching machinery to update this pointer
at the same time it runs.

llvm-svn: 214838
2014-08-05 02:39:49 +00:00
Eric Christopher d913448b38 Remove the TargetMachine forwards for TargetSubtargetInfo based
information and update all callers. No functional change.

llvm-svn: 214781
2014-08-04 21:25:23 +00:00
Chandler Carruth 0e2ddb2790 [x86] Just unilaterally prefer SSSE3-style PSHUFB lowerings over clever
use of PACKUS. It's cleaner that way.

I looked at implementing clever combine-based folding of PACKUS chains
into PSHUFB but it is quite hard and doesn't seem likely to be worth it.
The most annoying part would be detecting that the correct masking had
been done to use PACKUS-style instructions as a blend operation rather
than there being any saturating as is indicated by its name. We generate
really nice code for what few test cases I've come up with that aren't
completely contrived for this by just directly prefering PSHUFB and so
let's go with that strategy for now. =]

llvm-svn: 214707
2014-08-04 10:17:35 +00:00
Chandler Carruth 06e6f1cae2 [x86] Implement more aggressive use of PACKUS chains for lowering common
patterns of v16i8 shuffles.

This implements one of the more important FIXMEs for the SSE2 support in
the new shuffle lowering. We now generate the optimal shuffle sequence
for truncate-derived shuffles which show up essentially everywhere.

Unfortunately, this exposes a weakness in other parts of the shuffle
logic -- we can no longer form PSHUFB here. I'll add the necessary
support for that and other things in a subsequent commit.

llvm-svn: 214702
2014-08-04 09:40:02 +00:00
Chandler Carruth 37a18821cd [x86] Handle single input shuffles in the SSSE3 case more intelligently.
I spent some time looking into a better or more principled way to handle
this. For example, by detecting arbitrary "unneeded" ORs... But really,
there wasn't any point. We just shouldn't build blatantly wrong code so
late in the pipeline rather than adding more stages and logic later on
to fix it. Avoiding this is just too simple.

llvm-svn: 214680
2014-08-04 01:14:24 +00:00
Chandler Carruth 16c13cad35 [x86] Remove the FIXME that was implemented in r214628. Managed to
forget to update the comment here... =/

llvm-svn: 214630
2014-08-02 11:34:23 +00:00
Chandler Carruth 4c57955fe3 [x86] Largely complete the use of PSHUFB in the new vector shuffle
lowering with a small addition to it and adding PSHUFB combining.

There is one obvious place in the new vector shuffle lowering where we
should form PSHUFBs directly: when without them we will unpack a vector
of i8s across two different registers and do a potentially 4-way blend
as i16s only to re-pack them into i8s afterward. This is the crazy
expensive fallback path for i8 shuffles and we can just directly use
pshufb here as it will always be cheaper (the unpack and pack are
two instructions so even a single shuffle between them hits our
three instruction limit for forming PSHUFB).

However, this doesn't generate very good code in many cases, and it
leaves a bunch of common patterns not using PSHUFB. So this patch also
adds support for extracting a shuffle mask from PSHUFB in the X86
lowering code, and uses it to handle PSHUFBs in the recursive shuffle
combining. This allows us to combine through them, combine multiple ones
together, and generally produce sufficiently high quality code.

Extracting the PSHUFB mask is annoyingly complex because it could be
either pre-legalization or post-legalization. At least this doesn't have
to deal with re-materialized constants. =] I've added decode routines to
handle the different patterns that show up at this level and we dispatch
through them as appropriate.

The two primary test cases are updated. For the v16 test case there is
still a lot of room for improvement. Since I was going through it
systematically I left behind a bunch of FIXME lines that I'm hoping to
turn into ALL lines by the end of this.

llvm-svn: 214628
2014-08-02 10:39:15 +00:00
Chandler Carruth 5219d4eff6 [x86] Fix a few typos in my comments spotted in passing.
llvm-svn: 214626
2014-08-02 10:29:34 +00:00
Chandler Carruth 34f9a987e9 [x86] Teach the target shuffle mask extraction to recognize unary forms
of normally binary shuffle instructions like PUNPCKL and MOVLHPS.

This detects cases where a single register is used for both operands
making the shuffle behave in a unary way. We detect this and adjust the
mask to use the unary form which allows the existing DAG combine for
shuffle instructions to actually work at all.

As a consequence, this uncovered a number of obvious bugs in the
existing DAG combine which are fixed. It also now canonicalizes several
shuffles even with the existing lowering. These typically are trying to
match the shuffle to the domain of the input where before we only really
modeled them with the floating point variants. All of the cases which
change to an integer shuffle here have something in the integer domain, so
there are no more or fewer domain crosses here AFAICT. Technically, it
might be better to go from a GPR directly to the floating point domain,
but detecting floating point *outputs* despite integer inputs is a lot
more code and seems unlikely to be worthwhile in practice. If folks are
seeing domain-crossing regressions here though, let me know and I can
hack something up to fix it.

Also as a consequence, a bunch of missed opportunities to form pshufb
now can be formed. Notably, splats of i8s now form pshufb.
Interestingly, this improves the existing splat lowering too. We go from
3 instructions to 1. Yes, we may tie up a register, but it seems very
likely to be worth it, especially if splatting the 0th byte (the
common case) as then we can use a zeroed register as the mask.

llvm-svn: 214625
2014-08-02 10:27:38 +00:00
Akira Hatanaka 3516669a50 [X86] Simplify X87 stackifier pass.
Stop using ST registers for function returns and inline-asm instructions and use
FP registers instead. This allows removing a large amount of code in the
stackifier pass that was needed to track register liveness and handle copies
between ST and FP registers and function calls returning floating point values.

It also fixes a bug which manifests when an ST register defined by an
inline-asm instruction was live across another inline-asm instruction, as shown
in the following sequence of machine instructions:

1. INLINEASM <es:frndint> $0:[regdef], %ST0<imp-def,tied5>
2. INLINEASM <es:fldcw $0>
3. %FP0<def> = COPY %ST0

<rdar://problem/16952634>

llvm-svn: 214580
2014-08-01 22:19:41 +00:00
Louis Gerbarg 67474e3755 Make sure no loads resulting from load->switch DAGCombine are marked invariant
Currently when DAGCombine converts loads feeding a switch into a switch of
addresses feeding a load the new load inherits the isInvariant flag of the left
side. This is incorrect since invariant loads can be reordered in cases where it
is illegal to reoarder normal loads.

This patch adds an isInvariant parameter to getExtLoad() and updates all call
sites to pass in the data if they have it or false if they don't. It also
changes the DAGCombine to use that data to make the right decision when
creating the new load.

llvm-svn: 214449
2014-07-31 21:45:05 +00:00
Matt Arsenault 6f2a526101 Add alignment value to allowsUnalignedMemoryAccess
Rename to allowsMisalignedMemoryAccess.

On R600, 8 and 16 byte accesses are mostly OK with 4-byte alignment,
and don't need to be split into multiple accesses. Vector loads with
an alignment of the element type are not uncommon in OpenCL code.

llvm-svn: 214055
2014-07-27 17:46:40 +00:00
Chandler Carruth 64a7c828cb [x86] Sink a variable only used by asserts into the asserts. Should fix
some -Werror bots, sorry for the noise.

llvm-svn: 214043
2014-07-27 01:45:49 +00:00
Chandler Carruth 80c5bfd843 [x86] Add a much more powerful framework for combining x86 shuffle
instructions in the legalized DAG, and leverage it to combine long
sequences of instructions to PSHUFB.

Eventually, the other x86-instruction-specific shuffle combines will
probably all be driven out of this routine. But the real motivation is
to detect after we have fully legalized and optimized a shuffle to the
minimal number of x86 instructions whether it is profitable to replace
the chain with a fully generic PSHUFB instruction even though doing so
requires either a load from a constant pool or tying up a register with
the mask.

While the Intel manuals claim it should be used when it replaces 5 or
more instructions (!!!!) my experience is that it is actually very fast
on modern chips, and so I've gon with a much more aggressive model of
replacing any sequence of 3 or more instructions.

I've also taught it to do some basic canonicalization to special-purpose
instructions which have smaller encodings than their generic
counterparts.

There are still quite a few FIXMEs here, and I've not yet implemented
support for lowering blends with PSHUFB (where its power really shines
due to being able to zero out lanes), but this starts implementing real
PSHUFB support even when using the new, fancy shuffle lowering. =]

llvm-svn: 214042
2014-07-27 01:15:58 +00:00
Chandler Carruth 5896698e2e [x86] Fix PR20355 (for real). There are many layers to this bug.
The tale starts with r212808 which attempted to fix inversion of the low
and high bits when lowering MUL_LOHI. Sadly, that commit did not include
any positive test cases, and just removed some operations from a test
case where the actual logic being changed isn't fully visible from the
test.

What this commit did was two things. First, it reversed the low and high
results in the formation of the MERGE_VALUES node for the multiple
results. This is entirely correct.

Second it changed the shuffles for extracting the low and high
components from the i64 results of the multiplies to extract them
assuming a big-endian-style encoding of the multiply results. This
second change is wrong. There is no big-endian encoding in x86, the
results of the multiplies are normal v2i64s: when cast to v4i32, the low
i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3.

However, the first change wasn't enough to actually fix the bug, which
is (I assume) why the second change was also made. There was another bug
in the MERGE_VALUES formation: we weren't using a VTList, and so were
getting a single result node! When grabbing the *second* result from the
node, we got... well.. colud be anything. I think this *appeared* to
invert things, but had to be causing other problems as well.

Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should
have been fine, right? NOOOPE! Because the core bug was never addressed,
the test in vector-idiv failed when I fixed the MERGE_VALUES node.
Because there are essentially no docs for this node, I had to guess at
how to fix it and tried swapping the operands, restoring the order of
the original code before r212808. While this "fixed" the test case (in
that we produced the write instructions) we were still extracting the
wrong elements of the i64s, and thus PR20355 was still broken.

This commit essentially reverts the big-endian-style extraction part of
r212808 and goes back to the original masks which were correct. Now that
the MERGE_VALUES node formation is also correct, everything works. I've
also included a more detailed test from PR20355 to make sure this stays
fixed.

llvm-svn: 214011
2014-07-26 03:46:57 +00:00
Chandler Carruth f6406ac5d6 [x86] Revert r214007: Fix PR20355 ...
The clever way to implement signed multiplication with unsigned *is
already implemented* and tested and working correctly. The bug is
somewhere else. Re-investigating.

This will teach me to not scroll far enough to read the code that did
what I thought needed to be done.

llvm-svn: 214009
2014-07-26 02:14:54 +00:00
Chandler Carruth 1bf4d19172 [x86] Fix PR20355 (and dups) by not using unsigned multiplication when
signed multiplication is requested. While there is not a difference in
the *low* half of the result, the *high* half (used specifically to
implement the signed division by these constants) certainly is used. The
test case I've nuked was actively asserting wrong code.

There is a delightful solution to doing signed multiplication even when
we don't have it that Richard Smith has crafted, but I'll add the
machinery back and implement that in a follow-up patch. This at least
restores correctness.

llvm-svn: 214007
2014-07-26 01:52:13 +00:00
Akira Hatanaka e5b6e0d231 [stack protector] Fix a potential security bug in stack protector where the
address of the stack guard was being spilled to the stack.

Previously the address of the stack guard would get spilled to the stack if it
was impossible to keep it in a register. This patch introduces a new target
independent node and pseudo instruction which gets expanded post-RA to a
sequence of instructions that load the stack guard value. Register allocator
can now just remat the value when it can't keep it in a register. 

<rdar://problem/12475629>

llvm-svn: 213967
2014-07-25 19:31:34 +00:00
Chandler Carruth 3de980d2ff [SDAG] Enable the new assert for out-of-range result numbers in
SDValues, fixing the two bugs left in the regression suite.

The key for both of these was the use a single value type rather than
a VTList which caused an unintentionally single-result merge-value node.
Fix this by getting the appropriate VTList in place.

Doing this exposed that the comments in x86's code abouth how MUL_LOHI
operands are handle is wrong. The bug with the use of out-of-range
result numbers was hiding the bug about the order of operands here (as
best i can tell). There are more places where the code appears to get
this backwards still...

llvm-svn: 213931
2014-07-25 09:19:23 +00:00
Chandler Carruth 80b869461e [x86] Make vector legalization of extloads work more like the "normal"
vector operation legalization with support for custom target lowering
and fallback to expand when it fails, and use this to implement sext and
anyext load lowering for x86 in a more principled way.

Previously, the x86 backend relied on a target DAG combine to "combine
away" sextload and extload nodes prior to legalization, or would expand
them during legalization with terrible code. This is particularly
problematic because the DAG combine relies on running over non-canonical
DAG nodes at just the right time to match several common and important
patterns. It used a combine rather than lowering because we didn't have
good lowering support, and to expose some tricks being employed to more
combine phases.

With this change it becomes a proper lowering operation, the backend
marks that it can lower these nodes, and I've added support for handling
the canonical forms that don't have direct legal representations such as
sextload of a v4i8 -> v4i64 on AVX1. With this change, our test cases
for this behavior continue to pass even after the DAG combiner beigns
running more systematically over every node.

There is some noise caused by this in the test suite where we actually
use vector extends instead of subregister extraction. This doesn't
really seem like the right thing to do, but is unlikely to be a critical
regression. We do regress in one case where by lowering to the
target-specific patterns early we were able to combine away extraneous
legal math nodes. However, this regression is completely addressed by
switching to a widening based legalization which is what I'm working
toward anyways, so I've just switched the test to that mode.

Differential Revision: http://reviews.llvm.org/D4654

llvm-svn: 213897
2014-07-24 22:09:56 +00:00
Reid Kleckner 9a412d13c1 Replace an assertion with a fatal error
Frontends are responsible for putting inalloca on parameters that would
be passed in memory and not registers.

llvm-svn: 213891
2014-07-24 19:53:33 +00:00
Saleem Abdulrasool 34610e33ae X86: silence sign comparison warning
GCC 4.8 detected a signed compare [-Wsign-compare].  Add a cast for the
destination index.  Add an assert to catch a potential overflow however unlikely
it may be.

llvm-svn: 213878
2014-07-24 17:12:06 +00:00
Filipe Cabecinhas 933cccf3fa Fixed PR20411 - bug in getINSERTPS()
When we had a vector_shuffle where we had an input from each vector, we
could miscompile it because we were assuming the input from V2 wouldn't
be moved from where it was on the vector.

Added a test case.

llvm-svn: 213826
2014-07-24 01:28:21 +00:00
Jim Grosbach 724e438c62 [X86,AArch64] Extend vcmp w/ unary op combine to work w/ more constants.
The transform to constant fold unary operations with an AND across a
vector comparison applies when the constant is not a splat of a scalar
as well.

llvm-svn: 213800
2014-07-23 20:41:43 +00:00
Jim Grosbach 8f6f0858ec X86: restrict combine to when type sizes are safe.
The folding of unary operations through a vector compare and mask operation
is only safe if the unary operation result is of the same size as its input.
For example, it's not safe for [su]itofp from v4i32 to v4f64.

llvm-svn: 213799
2014-07-23 20:41:38 +00:00
Robert Khasanov 74acbb7767 [SKX] Enabling mask instructions: encoding, lowering
KMOVB, KMOVW, KMOVD, KMOVQ, KNOTB, KNOTW, KNOTD, KNOTQ

Reviewed by Elena Demikhovsky <elena.demikhovsky@intel.com>

llvm-svn: 213757
2014-07-23 14:49:42 +00:00
Andrea Di Biagio 842355e900 Revert r211771. It was: "[X86] Improve the selection of SSE3/AVX addsub instructions".
This chang fully reverts r211771.
That revision added a canonicalization rule which has the potential to causes a
combine-cycle in the target-independent canonicalizing DAG combine.

The plan is to move the logic that forms target specific addsub nodes as part of
the lowering of shuffles.

llvm-svn: 213736
2014-07-23 11:20:24 +00:00
Andrea Di Biagio 4d8bd41600 [DAG] Refactor some logic. No functional change.
This patch removes function 'CommuteVectorShuffle' from X86ISelLowering.cpp
and moves its logic into SelectionDAG.cpp as method 'getCommutedVectorShuffles'.
This refactoring is in preperation of an upcoming change to the DAGCombiner.

llvm-svn: 213503
2014-07-21 07:28:51 +00:00
Tim Northover 871de902af X86: support fpext/fptrunc operations to and from 16-bit floats.
llvm-svn: 213374
2014-07-18 13:01:25 +00:00
Jim Grosbach b6535c32f5 X86: Constant fold converting vector setcc results to float.
Since the result of a SETCC for X86 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
  UNARYOP(AND(VECTOR_CMP(x,y), constant))
      --> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).

This implements the transform where UNARYOP is [su]int_to_fp.

For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
  %cmp = fcmp oeq <4 x float> %val, %test
  %ext = zext <4 x i1> %cmp to <4 x i32>
  %result = sitofp <4 x i32> %ext to <4 x float>
  ret <4 x float> %result
}

Before this change, the SSE code is generated as:
LCPI0_0:
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .section  __TEXT,__text,regular,pure_instructions
  .globl  _foo
  .align  4, 0x90
_foo:                                   ## @foo
  cmpeqps %xmm1, %xmm0
  andps LCPI0_0(%rip), %xmm0
  cvtdq2ps  %xmm0, %xmm0
  retq

After, the code is improved to:
LCPI0_0:
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .section  __TEXT,__text,regular,pure_instructions
  .globl  _foo
  .align  4, 0x90
_foo:                                   ## @foo
  cmpeqps %xmm1, %xmm0
  andps LCPI0_0(%rip), %xmm0
  retq

The cvtdq2ps has been constant folded away and the floating point 1.0f
vector lanes are materialized directly via the ModRM operand of andps.

llvm-svn: 213342
2014-07-18 00:40:56 +00:00
Tim Northover 84ce0a642e CodeGen: generate single libcall for fptrunc -> f16 operations.
Previously we asserted on this code. Currently compiler-rt doesn't
actually implement any of these new libcalls, but external help is
pretty much the only viable option for LLVM.

I've followed the much more generic "__truncST2" naming, as opposed to
the odd name for f32 -> f16 truncation. This can obviously be changed
later, or overridden by any targets that need to.

llvm-svn: 213252
2014-07-17 11:12:12 +00:00
Tim Northover 2131044814 X86: support double extension of f16 type.
x86 has no native ability to extend an f16 to f64, but the same result
is obtained if we expand it into two separate extensions: f16 -> f32
-> f64.

Unfortunately the same is not true for truncate, so that still results
in a compilation failure.

llvm-svn: 213251
2014-07-17 11:04:04 +00:00
Tim Northover fd7e424935 CodeGen: extend f16 conversions to permit types > float.
This makes the two intrinsics @llvm.convert.from.f16 and
@llvm.convert.to.f16 accept types other than simple "float". This is
only strictly needed for the truncate operation, since otherwise
double rounding occurs and there's no way to represent the strict IEEE
conversion. However, for symmetry we allow larger types in the extend
too.

During legalization, we can expand an "fp16_to_double" operation into
two extends for convenience, but abort when the truncate isn't legal. A new
libcall is probably needed here.

Even after this commit, various target tweaks are needed to actually use the
extended intrinsics. I've put these into separate commits for clarity, so there
are no actual tests of f64 conversion here.

llvm-svn: 213248
2014-07-17 10:51:23 +00:00
Tim Northover 7f3e11e7c0 CodeGen: don't form illegail EXTLOAD operations.
It turns out that in most cases (the main exception being i1-related
types) once these operations are formed we cannot separate them and
the targets end up having to deal with them whether they want to or
not.

This is not a good situation, and a more reasonable default can be
formed by ackowledging this and having targets leave them as Legal.
Only x86 seems to be affected (other targets don't even try marking
the operation Expand).

Mostly there's no visible change here yet, but it will be useful to
have truly expanded EXTLOADS for MVT::f16 softening support.

llvm-svn: 213162
2014-07-16 15:37:24 +00:00
Andrea Di Biagio a03624d8ab [X86] Add a check for 'isMOVHLPSMask' within method 'isShuffleMaskLegal'.
Before this change, method 'isShuffleMaskLegal' didn't know that shuffles
implementing a 'movhlps' operation were perfectly legal for SSE targets.

This patch adds the missing check for 'isMOVHLPSMask' inside method
'isShuffleMaskLegal' to fix the problem.

The reason why it is important to do this is because the DAGCombiner
conservatively avoids combining a pair of shuffles if the resulting shuffle
node has an illegal mask. Before this patch, shuffles with a MOVHLPS mask were
wrongly considered not to be legal. This was the root cause of some poor-code
generation bugs.

llvm-svn: 213137
2014-07-16 11:29:39 +00:00
David Majnemer d4d9944416 Fix typo in comment
No functionality changed.

llvm-svn: 213052
2014-07-15 07:11:32 +00:00
Quentin Colombet 0f179c4d8a [X86] Fix the inversion of low and high bits for the lowering of MUL_LOHI.
Also add a few comments.

<rdar://problem/17581756>

llvm-svn: 212808
2014-07-11 12:08:23 +00:00
Chandler Carruth df8d0caab7 [x86] Add another combine that is particularly useful for the new vector
shuffle lowering: match shuffle patterns equivalent to an unpcklwd or
unpckhwd instruction.

This allows us to use generic lowering code for v8i16 shuffles and match
the unpack pattern late.

llvm-svn: 212705
2014-07-10 11:09:29 +00:00
Chandler Carruth 853fa0ac8d [x86] Expand the target DAG combining for PSHUFD nodes to be able to
combine into half-shuffles through unpack instructions that expand the
half to a whole vector without messing with the dword lanes.

This fixes some redundant instructions in splat-like lowerings for
v16i8, which are now getting to be *really* nice.

llvm-svn: 212695
2014-07-10 09:57:36 +00:00
Chandler Carruth a34a8e230d [x86] Tweak the v16i8 single input special case lowering for shuffles
that splat i8s into i16s.

Previously, we would try much too hard to arrange a sequence of i8s in
one half of the input such that we could unpack them into i16s and
shuffle those into place. This isn't always going to be a cheaper i8
shuffle than our other strategies. The case where it is always going to
be cheaper is when we can arrange all the necessary inputs into one half
using just i16 shuffles. It happens that viewing the problem this way
also makes it much easier to produce an efficient set of shuffles to
move the inputs into one half and then unpack them.

With this, our splat code gets one step closer to being not terrible
with the new experimental lowering strategy. It also exposes two
combines missing which I will add next.

llvm-svn: 212692
2014-07-10 09:16:40 +00:00
Chandler Carruth 7d2ffb5492 [x86] Initial improvements to the new shuffle lowering for v16i8
shuffles specifically for cases where a small subset of the elements in
the input vector are actually used.

This is specifically targetted at improving the shuffles generated for
trunc operations, but also helps out splat-like operations.

There is still some really low-hanging fruit here that I want to address
but this is a huge step in the right direction.

llvm-svn: 212680
2014-07-10 04:34:06 +00:00
Chandler Carruth b3840a55ae [x86] Refactor some of the new code for lowering v16i8 shuffles to
remove duplication and make it easier to select different strategies.

No functionality changed.

llvm-svn: 212674
2014-07-10 02:24:26 +00:00
Chandler Carruth d3561f6fec [SDAG] Make the new zext-vector-inreg node default to expand so targets
don't need to set it manually.

This is based on feedback from Tom who pointed out that if every target
needs to handle this we need to reach out to those maintainers. In fact,
it doesn't make sense to duplicate everything when anything other than
expand seems unlikely at this stage.

llvm-svn: 212661
2014-07-09 22:53:04 +00:00
Benjamin Kramer d6f1733add X86: When lowering v8i32 himuls use the correct shuffle masks for AVX2.
Turns out my trick of using the same masks for SSE4.1 and AVX2 didn't work out
as we have to blend two vectors. While there remove unecessary cross-lane moves
from the shuffles so the backend can lower it to palignr instead of vperm.

Fixes PR20118, a miscompilation of vector sdiv by constant on AVX2.

llvm-svn: 212611
2014-07-09 11:12:39 +00:00
Chandler Carruth afe4b2507e [x86] Add a ZERO_EXTEND_VECTOR_INREG DAG node and use it when widening
vector types to be legal and a ZERO_EXTEND node is encountered.

When we use widening to legalize vector types, extend nodes are a real
challenge. Either the input or output is likely to be legal, but in many
cases not both. As a consequence, we don't really have any way to
represent this situation and the prior code in the widening legalization
framework would just scalarize the extend operation completely.

This patch introduces a new DAG node to represent doing a zero extend of
a vector "in register". The core of the idea is to allow legal but
different vector types in the input and output. The output vector must
have fewer lanes but wider elements. The operation is defined to zero
extend the low elements of the input to the size of the output elements,
and drop all of the high elements which don't have a corresponding lane
in the output vector.

It also includes generic expansion of this node in terms of blending
a zero vector into the high elements of the vector and bitcasting
across. This in turn yields extremely nice code for x86 SSE2 when we use
the new widening legalization logic in conjunction with the new shuffle
lowering logic.

There is still more to do here. We need to support sign extension, any
extension, and potentially int-to-float conversions. My current plan is
to continue using similar synthetic nodes to model each of these
transitions with generic lowering code for each one.

However, with this patch LLVM already reaches performance parity with
GCC for the core C loops of the x264 code (assuming you disable the
hand-written assembly versions) when compiling for SSE2 and SSE3
architectures and enabling the new widening and lowering logic for
vectors.

Differential Revision: http://reviews.llvm.org/D4405

llvm-svn: 212610
2014-07-09 10:58:18 +00:00
Chandler Carruth ef5dcf571e [x86] Initialize a pointer to null to fix a bug in r212602.
This should restore GCC hosts (which happen to put the bad stuff into
the pointer) and MSan, etc.

llvm-svn: 212606
2014-07-09 10:36:42 +00:00
Chandler Carruth 2ebc942683 [x86] Re-apply a variant of the x86 side of r212324 now that the rest
has settled without incident, removing the x86-specific and overly
strict 'isVectorSplat' routine in favor of generic and more powerful
splat detection.

The primary motivation and result of this is that the x86 backend can
now see through splats which contain undef elements. This is essential
if we are using a widening form of legalization and I've updated a test
case to also run in that mode as before this change the generated code
for the test case was completely scalarized.

This version of the patch much more carefully handles the undef lanes.
- We aren't overly conservative about them in the shift lowering
  (where we will never use the splat itself).
- One place where the splat would have been re-used by the existing code
  now explicitly constructs a new constant splat that will be safe.
- The broadcast lowering is much more reasonable with undefs by doing
  a correct check of whether the splat is the only user of a loaded
  value, checking that the splat actually crosses multiple lanes before
  using a broadcast, and handling broadcasts of non-constant splats.

As a consequence of the last bullet, the weird usage of vpshufd instead
of vbroadcast is gone, and we actually can lower an AVX splat with
vbroadcastss where before we emitted a really strange pattern of
a vector load and a manual splat across the vector.

llvm-svn: 212602
2014-07-09 10:06:58 +00:00
Chandler Carruth 142e966261 [x86,SDAG] Sink the logic for folding shuffles of splats more
aggressively from the x86 shuffle lowering to the generic SDAG vector
shuffle formation code.

This code already tried to fold away shuffles of splats! It just had
lots of bugs and couldn't handle the case my new x86 shuffle lowering
needed.

First, it failed to correctly compute whether N2 was undef because it
pre-computed this, then did transformations which could *make* N2 undef,
then failed to ever re-consider the precomputed state.

Second, it didn't look through bitcasts at all, even in the safe cases
where they are just element-type bitcasts with no change to the number
of elements.

Third, it didn't handle all-zero bit casts nicely the way my code in the
x86 side of things did, which is essential to getting good zext-shuffle
lowerings.

But all of these are generic. I just ported the code down to this layer
and fixed the surrounding bugs. Tests exercising this in the x86 backend
still pass and some silly code in widen_cast-6.ll gets better. I updated
that test to be a bit more precise but it's still pretty unclear what
the value of the test is in this day and age.

llvm-svn: 212517
2014-07-08 08:45:38 +00:00
Andrea Di Biagio 2620b877b6 [x86] Fix assertion failure caused by a wrong combine of PSHUFD nodes with different types.
When combining a sequence of two PSHUFD dag nodes into a single PSHUFD,
make sure that we assign the correct type to the resulting PSHUFD.
X86ISD::PSHUFD dag nodes can be either MVT::v4i32 or MVT::v4f32.

Before this change, an assertion failure was triggered in method
'DAGCombinerInfo::CombineTo' when trying to combine the shuffles from the test
below into a single PSHUFD.

define <4 x float> @test1(<4 x float> %V) {
  %1 = shufflevector <4 x float> %V, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
  %2 = shufflevector <4 x float> %1, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
  ret <4 x float> %2
}

llvm-svn: 212498
2014-07-07 23:25:23 +00:00
Chandler Carruth beeacac0b3 [x86] Revert r212324 which was too aggressive w.r.t. allowing undef
lanes in vector splats.

The core problem here is that undef lanes can't *unilaterally* be
considered to contribute to splats. Their handling needs to be more
cautious. There is also a reported failure of the nightly testers
(thanks Tobias!) that may well stem from the same core issue. I'm going
to fix this theoretical issue, factor the APIs a bit better, and then
verify that I don't see anything bad with Tobias's reduction from the
test suite before recommitting.

Original commit message for r212324:
  [x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for
  any constant, constant FP, or undef splat and to tolerate any undef
  lanes in a splat, then replace all uses of isSplatVector in X86's
  lowering with it.

  This fixes issues where undef lanes in an otherwise splat vector would
  prevent the splat logic from firing. It is a touch more awkward to use
  this interface, but it is much more accurate. Suggestions for better
  interface structuring welcome.

  With this fix, the code generated with the widening legalization
  strategy for widen_cast-4.ll is *dramatically* improved as the special
  lowering strategies for a v16i8 SRA kick in even though the high lanes
  are undef.

  We also get a slightly different choice for broadcasting an aligned
  memory location, and use vpshufd instead of vbroadcastss. This looks
  like a minor win for pipelining and domain crossing, but a minor loss
  for the number of micro-ops. I suspect its a wash, but folks can
  easily tweak the lowering if they want.

llvm-svn: 212475
2014-07-07 19:03:32 +00:00
Chandler Carruth 0dcb366268 [x86] Teach the new vector shuffle lowering code to handle what is
essentially a DAG combine that never gets a chance to run.

We might typically expect DAG combining to remove shuffles-of-splats and
other similar patterns, but we don't get a chance to run the DAG
combiner when we recursively form sub-shuffles during the lowering of
a shuffle. So instead hand-roll a really important combine directly into
the lowering code to detect shuffles-of-splats, especially shuffles of
an all-zero splat which needn't even have the same element width, etc.

This lets the new vector shuffle lowering handle shuffles which
implement things like zero-extension really nicely. This will become
even more important when I wire the legalization of zero-extension to
vector shuffles with the new widening legalization strategy.

llvm-svn: 212444
2014-07-07 09:06:58 +00:00
Chandler Carruth 5d79bb5d32 [x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for
any constant, constant FP, or undef splat and to tolerate any undef
lanes in a splat, then replace all uses of isSplatVector in X86's
lowering with it.

This fixes issues where undef lanes in an otherwise splat vector would
prevent the splat logic from firing. It is a touch more awkward to use
this interface, but it is much more accurate. Suggestions for better
interface structuring welcome.

With this fix, the code generated with the widening legalization
strategy for widen_cast-4.ll is *dramatically* improved as the special
lowering strategies for a v16i8 SRA kick in even though the high lanes
are undef.

We also get a slightly different choice for broadcasting an aligned
memory location, and use vpshufd instead of vbroadcastss. This looks
like a minor win for pipelining and domain crossing, but a minor loss
for the number of micro-ops. I suspect its a wash, but folks can easily
tweak the lowering if they want.

llvm-svn: 212324
2014-07-04 08:11:49 +00:00
Chandler Carruth 19cff8205e [x86] Clarify that this lowering only applies to vectors and is only
used when we have SSE2.

llvm-svn: 212300
2014-07-03 22:57:44 +00:00
Andrea Di Biagio a37a2fc81f [X86] Add ISel patterns to select 'f32_to_f16' and 'f16_to_f32' dag nodes.
This patch adds tablegen patterns to select F16C float-to-half-float
conversion instructions from 'f32_to_f16' and 'f16_to_f32' dag nodes.

If the target doesn't have F16C, then 'f32_to_f16' and 'f16_to_f32'
are expanded into library calls.

llvm-svn: 212293
2014-07-03 21:51:06 +00:00
Chandler Carruth 739b6ada99 [x86] Fix crashes in lowering bitcast instructions with the widening
mode.

This also runs the test in that mode which would reproduce the crash.
What I love is that *every single FIXME* in the test is addressed by
switching to widening.

llvm-svn: 212254
2014-07-03 03:43:47 +00:00
Chandler Carruth 49a8b10d82 [x86] Based on a long conversation between myself, Jim Grosbach, Hal
Finkel, Eric Christopher, and a bunch of other people I'm probably
forgetting (sorry), add an option to the x86 backend to widen vectors
during type legalization rather than promote them.

This still would promote vNi1 vectors to get the masks right, but would
widen other vectors. A lot of experiments are piling up right now
showing that widening should probably be the default legalization
strategy outside of vNi1 cases, but it is very hard to test the
rammifications of that and fix bugs in widening-based legalization
without an option that enables it. I'll be checking in tests shortly
that use this option to exercise cases where widening doesn't work well
and hopefully we'll be able to switch fully to this soon.

llvm-svn: 212249
2014-07-03 02:11:29 +00:00
Benjamin Kramer e739cf3eb5 X86: When combining shuffles just remove shuffles that are completely redundant.
CombineTo doesn't allow replacing a node with itself so this would crash if the
combined shuffle is the same as the input shuffle.

llvm-svn: 212181
2014-07-02 15:09:44 +00:00
Juergen Ributzka 3bd03c7099 [DAG] Pass the argument list to the CallLoweringInfo via move semantics. NFCI.
The argument list vector is never used after it has been passed to the
CallLoweringInfo and moving it to the CallLoweringInfo is cleaner and
pretty much as cheap as keeping a pointer to it.

llvm-svn: 212135
2014-07-01 22:01:54 +00:00
Tim Northover df58625e3c X86: delegate expanding atomic libcalls to generic code.
On targets without cmpxchg16b or cmpxchg8b, the borderline atomic
operations were slipping through the gaps.

X86AtomicExpand.cpp was delegating to ISelLowering. Generic
ISelLowering was delegating to X86ISelLowering and X86ISelLowering was
asserting. The correct behaviour is to expand to a libcall, preferably
in generic ISelLowering.

This can be achieved by X86ISelLowering deciding it doesn't want the
faff after all.

llvm-svn: 212134
2014-07-01 21:44:59 +00:00
Tim Northover 277066ab43 X86: expand atomics in IR instead of as MachineInstrs.
The logic for expanding atomics that aren't natively supported in
terms of cmpxchg loops is much simpler to express at the IR level. It
also allows the normal optimisations and CodeGen improvements to help
out with atomics, instead of using a limited set of possible
instructions..

rdar://problem/13496295

llvm-svn: 212119
2014-07-01 18:53:31 +00:00
Andrea Di Biagio 53b6830069 [X86] Add support for builtin to read performance monitoring counters.
This patch adds support for a new builtin instruction called
__builtin_ia32_rdpmc.

Builtin '__builtin_ia32_rdpmc' is defined as a 'GCC builtin'; on X86, it can
be used to read performance monitoring counters. It takes as input the index
of the performance counter to read, and returns the value of the specified
performance counter as a 64-bit number.

Calls to this new builtin will map to instruction RDPMC.
The index in input to the builtin call is moved to register %ECX. The result
of the builtin call is the value of the specified performance counter (RDPMC
would return that quantity in registers RDX:RAX).

This patch:
 - Adds builtin int_x86_rdpmc as a GCCBuiltin;
 - Adds a new x86 DAG node called 'RDPMC_DAG';
 - Teaches how to lower this new builtin;
 - Adds an ISel pattern to select instruction RDPMC;
 - Fixes the definition of instruction RDPMC adding %RAX and %RDX as
   implicit definitions, and adding %ECX as implicit use;
 - Adds a LLVM test to verify that the new builtin is correctly selected.

llvm-svn: 212049
2014-06-30 17:14:21 +00:00
Chandler Carruth bd0717d7cc [x86] Fix a bug in the v8i16 shuffling exposed by the new splat-like
lowering for v16i8.

ASan and some bots caught this bug with existing test cases. Fixing it
even fixed a miscompile with one of the test cases. I'm still a bit
suspicious of this test case as I've not taken a proper amount of time
to think about it, but the fix here is strict goodness.

llvm-svn: 211976
2014-06-28 05:46:28 +00:00
Chandler Carruth 887c2c3482 [x86] Add handling for splat-like widenings of v16i8 shuffles.
These show up really frequently, not the least with actual splats. =] We
lowered these quite badly before. The new code path tries to widen i8
shuffles to i16 shuffles in a splat-like way. There are still some
inefficiencies in our i16 splat logic though, so we aren't really done
here.

Also, for certain patterns (bit of a gather-and-splat) we still
generate pretty silly code, and I've left a fixme for addressing it.
However, I'm not actually worried about this code pattern as much. The
old shuffle lowering generates a 29 instruction monstrosity for it that
should execute much more slowly.

llvm-svn: 211974
2014-06-28 05:16:40 +00:00
Chandler Carruth a94ef908d9 [x86] Fix another bug hit when bootstrapping with the new shuffle
lowering.

For maximum irony, I had already discovered this bug, diagnosed it, and
left FIXMEs about it in the test cases. =[ I just failed to go back over
those until after i had reduced a bootstrap miscompile down to a single
TU, stared at the assembly for an hour, and figured out the bug. Again.

Oh well.

llvm-svn: 211955
2014-06-27 20:07:40 +00:00
Chandler Carruth dd6470a9dd [x86] Fix a miscompile in the new shuffle lowering uncovered by
a bootstrap.

I managed to mis-remember how PACKUS worked on x86, and was using undef
for the high bytes instead of zero. The fix is fairly obvious.

llvm-svn: 211922
2014-06-27 18:25:23 +00:00
Alexander Kornienko b673b4b187 Clean up unused variable warning in release build.
llvm-svn: 211902
2014-06-27 15:30:55 +00:00
Chandler Carruth ed4a0bc734 [x86] Clean up some unused variables, especially in release builds.
llvm-svn: 211894
2014-06-27 12:04:18 +00:00
Chandler Carruth 688001f042 [x86] Teach the target combine step to aggressively fold pshufd insturcions.
Summary:
This allows it to fold pshufd instructions across intervening
half-shuffles and other noise. This pattern actually shows up in the
generic lowering tests, but I've also added direct tests using
intrinsics to make sure that the specific desired functionality is
working even if the lowering stuff changes in the future.

Differential Revision: http://reviews.llvm.org/D4292

llvm-svn: 211892
2014-06-27 11:40:13 +00:00
Chandler Carruth 0d6d1f2b17 [x86] Teach the target-specific combining how to aggressively fold
half-shuffles, even looking through intervening instructions in a chain.

Summary:
This doesn't happen to show up with any test cases I've found for the current
shuffle lowering, but previous attempts would benefit from this and it seems
generally useful. I've tested it directly using intrinsics, which also shows
that it will work with hand vectorized code as well.

Note that even though pshufd isn't directly used in these tests, it gets
exercised because we combine some of the half shuffles into a pshufd
first, and then merge them.

Differential Revision: http://reviews.llvm.org/D4291

llvm-svn: 211890
2014-06-27 11:34:40 +00:00
Chandler Carruth 97ebc2362c [x86] Teach the X86 backend to DAG-combine SSE2 shuffles that are
trivially redundant.

This fixes several cases in the new vector shuffle lowering algorithm
which would generate redundant shuffle instructions for the sake of
simplicity.

I'm also deleting a testcase which was somewhat ridiculous. It was
checking for a bug in 2007 about incorrectly transforming shuffles by
looking for the string "-86" in the output of a pretty substantial
function. This test case doesn't seem to have any value at this point.

Differential Revision: http://reviews.llvm.org/D4240

llvm-svn: 211889
2014-06-27 11:27:52 +00:00
Chandler Carruth 83860cfcfa [x86] Begin a significant overhaul of how vector lowering is done in the
x86 backend.

This sketches out a new code path for vector lowering, hidden behind an
off-by-default flag while it is under development. The fundamental idea
behind the new code path is to aggressively break down the problem space
in ways that ease selecting the odd set of instructions available on
x86, and carefully avoid scalarizing code even when forced to use older
ISAs. Notably, this starts off restricting itself to SSE2 and implements
the complete vector shuffle and blend space for 128-bit vectors in SSE2
without scalarizing. The plan is to layer on top of this ISA extensions
where we can bail out of the complex SSE2 lowering and opt for
a cheaper, specialized instruction (or set of instructions). It also
needs to be generalized to AVX and AVX512 vector widths.

Currently, this does a decent but not perfect job for SSE2. There are
some specific shortcomings that I plan to address:
- We need a peephole combine to fold together shuffles where possible.
  There are cases where a previous shuffle could be modified slightly to
  arrange for elements to be in the correct position and a later shuffle
  eliminated. Doing this eagerly added quite a bit of complexity, and
  so my plan is to combine away these redundancies afterward.
- There are a lot more clever ways to use unpck and pack that need to be
  added. This is essential for real world shuffles as it turns out...

Once SSE2 is polished a bit I should be able to get interesting numbers
on performance improvements on benchmarks conducive to vectorization.
All of this will be off by default until it is functionally equivalent
of course.

Differential Revision: http://reviews.llvm.org/D4225

llvm-svn: 211888
2014-06-27 11:23:44 +00:00
Andrea Di Biagio 1ee38843ac Silence a warning due to a comparison between signed and unsigned.
No functional change intended.

llvm-svn: 211782
2014-06-26 13:41:10 +00:00
Andrea Di Biagio 7fb85256bc [X86] Improve the selection of SSE3/AVX addsub instructions.
This patch teaches the backend how to canonicalize a shuffle vectors
according to the rule:

 - (shuffle (FADD A, B), (FSUB A, B), Mask) ->
       (shuffle (FSUB A, -B), (FADD A, -B), Mask)

Where 'Mask' is:
  <0,5,2,7>            ;; for v4f32 and v4f64 shuffles.
  <0,3>                ;; for v2f64 shuffles.
  <0,9,2,11,4,13,6,15> ;; for v8f32 shuffles.

In general, ISel only knows how to pattern-match a canonical
'fadd + fsub + blendi' dag node sequence into an ADDSUB instruction.

This new rule allows to convert a non-canonical dag sequence into a
canonical one that will be matched by a single ADDSUB at ISel stage.

The idea of converting a non-canonical ADDSUB into a canonical one by
swapping the first two operands of the shuffle, and then negating the
second operand of the FADD and FSUB, was originally proposed by Hal Finkel.

llvm-svn: 211771
2014-06-26 10:45:21 +00:00
Andrea Di Biagio 07cdffc324 [X86] Always prefer to lower a VECTOR_SHUFFLE into a BLENDI instead of SHUFP (or VPERM2X128).
This patch teaches method 'LowerVECTOR_SHUFFLE' to give higher precedence to
the check for 'isBlendMask'; the idea is that, when possible, we should firstly
check if a shuffle performs a blend, and in case, try to lower it into a BLENDI
instead of selecting a SHUFP or (worse) a VPERM2X128.

In general:
 - AVX VBLENDPS/D always have better latency and throughput than VPERM2F128;
 - BLENDPS/D instructions tend to always have better 'reciprocal throughput'
   than the equivalent SHUFPS/D;
 - Both BLENDPS/D and SHUFPS/D are often decoded into the same number of
   m-ops; however, a m-op obtained from a BLENDPS/D can be scheduled to more
   than one execution port.

This patch:
 - Moves the check for 'isBlendMask' immediately before the check for
   'isSHUFPMask' within method 'LowerVECTOR_SHUFFLE';
 - Updates existing tests for sse/avx shuffle/blend instructions to verify
   that we select (v)blendps/d when possible (instead of (v)shufps/d or
   vperm2f128).

llvm-svn: 211720
2014-06-25 17:41:58 +00:00