Commit Graph

2993 Commits

Author SHA1 Message Date
Chandler Carruth 5219d4eff6 [x86] Fix a few typos in my comments spotted in passing.
llvm-svn: 214626
2014-08-02 10:29:34 +00:00
Chandler Carruth 34f9a987e9 [x86] Teach the target shuffle mask extraction to recognize unary forms
of normally binary shuffle instructions like PUNPCKL and MOVLHPS.

This detects cases where a single register is used for both operands
making the shuffle behave in a unary way. We detect this and adjust the
mask to use the unary form which allows the existing DAG combine for
shuffle instructions to actually work at all.

As a consequence, this uncovered a number of obvious bugs in the
existing DAG combine which are fixed. It also now canonicalizes several
shuffles even with the existing lowering. These typically are trying to
match the shuffle to the domain of the input where before we only really
modeled them with the floating point variants. All of the cases which
change to an integer shuffle here have something in the integer domain, so
there are no more or fewer domain crosses here AFAICT. Technically, it
might be better to go from a GPR directly to the floating point domain,
but detecting floating point *outputs* despite integer inputs is a lot
more code and seems unlikely to be worthwhile in practice. If folks are
seeing domain-crossing regressions here though, let me know and I can
hack something up to fix it.

Also as a consequence, a bunch of missed opportunities to form pshufb
now can be formed. Notably, splats of i8s now form pshufb.
Interestingly, this improves the existing splat lowering too. We go from
3 instructions to 1. Yes, we may tie up a register, but it seems very
likely to be worth it, especially if splatting the 0th byte (the
common case) as then we can use a zeroed register as the mask.

llvm-svn: 214625
2014-08-02 10:27:38 +00:00
Akira Hatanaka 3516669a50 [X86] Simplify X87 stackifier pass.
Stop using ST registers for function returns and inline-asm instructions and use
FP registers instead. This allows removing a large amount of code in the
stackifier pass that was needed to track register liveness and handle copies
between ST and FP registers and function calls returning floating point values.

It also fixes a bug which manifests when an ST register defined by an
inline-asm instruction was live across another inline-asm instruction, as shown
in the following sequence of machine instructions:

1. INLINEASM <es:frndint> $0:[regdef], %ST0<imp-def,tied5>
2. INLINEASM <es:fldcw $0>
3. %FP0<def> = COPY %ST0

<rdar://problem/16952634>

llvm-svn: 214580
2014-08-01 22:19:41 +00:00
Louis Gerbarg 67474e3755 Make sure no loads resulting from load->switch DAGCombine are marked invariant
Currently when DAGCombine converts loads feeding a switch into a switch of
addresses feeding a load the new load inherits the isInvariant flag of the left
side. This is incorrect since invariant loads can be reordered in cases where it
is illegal to reoarder normal loads.

This patch adds an isInvariant parameter to getExtLoad() and updates all call
sites to pass in the data if they have it or false if they don't. It also
changes the DAGCombine to use that data to make the right decision when
creating the new load.

llvm-svn: 214449
2014-07-31 21:45:05 +00:00
Matt Arsenault 6f2a526101 Add alignment value to allowsUnalignedMemoryAccess
Rename to allowsMisalignedMemoryAccess.

On R600, 8 and 16 byte accesses are mostly OK with 4-byte alignment,
and don't need to be split into multiple accesses. Vector loads with
an alignment of the element type are not uncommon in OpenCL code.

llvm-svn: 214055
2014-07-27 17:46:40 +00:00
Chandler Carruth 64a7c828cb [x86] Sink a variable only used by asserts into the asserts. Should fix
some -Werror bots, sorry for the noise.

llvm-svn: 214043
2014-07-27 01:45:49 +00:00
Chandler Carruth 80c5bfd843 [x86] Add a much more powerful framework for combining x86 shuffle
instructions in the legalized DAG, and leverage it to combine long
sequences of instructions to PSHUFB.

Eventually, the other x86-instruction-specific shuffle combines will
probably all be driven out of this routine. But the real motivation is
to detect after we have fully legalized and optimized a shuffle to the
minimal number of x86 instructions whether it is profitable to replace
the chain with a fully generic PSHUFB instruction even though doing so
requires either a load from a constant pool or tying up a register with
the mask.

While the Intel manuals claim it should be used when it replaces 5 or
more instructions (!!!!) my experience is that it is actually very fast
on modern chips, and so I've gon with a much more aggressive model of
replacing any sequence of 3 or more instructions.

I've also taught it to do some basic canonicalization to special-purpose
instructions which have smaller encodings than their generic
counterparts.

There are still quite a few FIXMEs here, and I've not yet implemented
support for lowering blends with PSHUFB (where its power really shines
due to being able to zero out lanes), but this starts implementing real
PSHUFB support even when using the new, fancy shuffle lowering. =]

llvm-svn: 214042
2014-07-27 01:15:58 +00:00
Chandler Carruth 5896698e2e [x86] Fix PR20355 (for real). There are many layers to this bug.
The tale starts with r212808 which attempted to fix inversion of the low
and high bits when lowering MUL_LOHI. Sadly, that commit did not include
any positive test cases, and just removed some operations from a test
case where the actual logic being changed isn't fully visible from the
test.

What this commit did was two things. First, it reversed the low and high
results in the formation of the MERGE_VALUES node for the multiple
results. This is entirely correct.

Second it changed the shuffles for extracting the low and high
components from the i64 results of the multiplies to extract them
assuming a big-endian-style encoding of the multiply results. This
second change is wrong. There is no big-endian encoding in x86, the
results of the multiplies are normal v2i64s: when cast to v4i32, the low
i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3.

However, the first change wasn't enough to actually fix the bug, which
is (I assume) why the second change was also made. There was another bug
in the MERGE_VALUES formation: we weren't using a VTList, and so were
getting a single result node! When grabbing the *second* result from the
node, we got... well.. colud be anything. I think this *appeared* to
invert things, but had to be causing other problems as well.

Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should
have been fine, right? NOOOPE! Because the core bug was never addressed,
the test in vector-idiv failed when I fixed the MERGE_VALUES node.
Because there are essentially no docs for this node, I had to guess at
how to fix it and tried swapping the operands, restoring the order of
the original code before r212808. While this "fixed" the test case (in
that we produced the write instructions) we were still extracting the
wrong elements of the i64s, and thus PR20355 was still broken.

This commit essentially reverts the big-endian-style extraction part of
r212808 and goes back to the original masks which were correct. Now that
the MERGE_VALUES node formation is also correct, everything works. I've
also included a more detailed test from PR20355 to make sure this stays
fixed.

llvm-svn: 214011
2014-07-26 03:46:57 +00:00
Chandler Carruth f6406ac5d6 [x86] Revert r214007: Fix PR20355 ...
The clever way to implement signed multiplication with unsigned *is
already implemented* and tested and working correctly. The bug is
somewhere else. Re-investigating.

This will teach me to not scroll far enough to read the code that did
what I thought needed to be done.

llvm-svn: 214009
2014-07-26 02:14:54 +00:00
Chandler Carruth 1bf4d19172 [x86] Fix PR20355 (and dups) by not using unsigned multiplication when
signed multiplication is requested. While there is not a difference in
the *low* half of the result, the *high* half (used specifically to
implement the signed division by these constants) certainly is used. The
test case I've nuked was actively asserting wrong code.

There is a delightful solution to doing signed multiplication even when
we don't have it that Richard Smith has crafted, but I'll add the
machinery back and implement that in a follow-up patch. This at least
restores correctness.

llvm-svn: 214007
2014-07-26 01:52:13 +00:00
Akira Hatanaka e5b6e0d231 [stack protector] Fix a potential security bug in stack protector where the
address of the stack guard was being spilled to the stack.

Previously the address of the stack guard would get spilled to the stack if it
was impossible to keep it in a register. This patch introduces a new target
independent node and pseudo instruction which gets expanded post-RA to a
sequence of instructions that load the stack guard value. Register allocator
can now just remat the value when it can't keep it in a register. 

<rdar://problem/12475629>

llvm-svn: 213967
2014-07-25 19:31:34 +00:00
Chandler Carruth 3de980d2ff [SDAG] Enable the new assert for out-of-range result numbers in
SDValues, fixing the two bugs left in the regression suite.

The key for both of these was the use a single value type rather than
a VTList which caused an unintentionally single-result merge-value node.
Fix this by getting the appropriate VTList in place.

Doing this exposed that the comments in x86's code abouth how MUL_LOHI
operands are handle is wrong. The bug with the use of out-of-range
result numbers was hiding the bug about the order of operands here (as
best i can tell). There are more places where the code appears to get
this backwards still...

llvm-svn: 213931
2014-07-25 09:19:23 +00:00
Chandler Carruth 80b869461e [x86] Make vector legalization of extloads work more like the "normal"
vector operation legalization with support for custom target lowering
and fallback to expand when it fails, and use this to implement sext and
anyext load lowering for x86 in a more principled way.

Previously, the x86 backend relied on a target DAG combine to "combine
away" sextload and extload nodes prior to legalization, or would expand
them during legalization with terrible code. This is particularly
problematic because the DAG combine relies on running over non-canonical
DAG nodes at just the right time to match several common and important
patterns. It used a combine rather than lowering because we didn't have
good lowering support, and to expose some tricks being employed to more
combine phases.

With this change it becomes a proper lowering operation, the backend
marks that it can lower these nodes, and I've added support for handling
the canonical forms that don't have direct legal representations such as
sextload of a v4i8 -> v4i64 on AVX1. With this change, our test cases
for this behavior continue to pass even after the DAG combiner beigns
running more systematically over every node.

There is some noise caused by this in the test suite where we actually
use vector extends instead of subregister extraction. This doesn't
really seem like the right thing to do, but is unlikely to be a critical
regression. We do regress in one case where by lowering to the
target-specific patterns early we were able to combine away extraneous
legal math nodes. However, this regression is completely addressed by
switching to a widening based legalization which is what I'm working
toward anyways, so I've just switched the test to that mode.

Differential Revision: http://reviews.llvm.org/D4654

llvm-svn: 213897
2014-07-24 22:09:56 +00:00
Reid Kleckner 9a412d13c1 Replace an assertion with a fatal error
Frontends are responsible for putting inalloca on parameters that would
be passed in memory and not registers.

llvm-svn: 213891
2014-07-24 19:53:33 +00:00
Saleem Abdulrasool 34610e33ae X86: silence sign comparison warning
GCC 4.8 detected a signed compare [-Wsign-compare].  Add a cast for the
destination index.  Add an assert to catch a potential overflow however unlikely
it may be.

llvm-svn: 213878
2014-07-24 17:12:06 +00:00
Filipe Cabecinhas 933cccf3fa Fixed PR20411 - bug in getINSERTPS()
When we had a vector_shuffle where we had an input from each vector, we
could miscompile it because we were assuming the input from V2 wouldn't
be moved from where it was on the vector.

Added a test case.

llvm-svn: 213826
2014-07-24 01:28:21 +00:00
Jim Grosbach 724e438c62 [X86,AArch64] Extend vcmp w/ unary op combine to work w/ more constants.
The transform to constant fold unary operations with an AND across a
vector comparison applies when the constant is not a splat of a scalar
as well.

llvm-svn: 213800
2014-07-23 20:41:43 +00:00
Jim Grosbach 8f6f0858ec X86: restrict combine to when type sizes are safe.
The folding of unary operations through a vector compare and mask operation
is only safe if the unary operation result is of the same size as its input.
For example, it's not safe for [su]itofp from v4i32 to v4f64.

llvm-svn: 213799
2014-07-23 20:41:38 +00:00
Robert Khasanov 74acbb7767 [SKX] Enabling mask instructions: encoding, lowering
KMOVB, KMOVW, KMOVD, KMOVQ, KNOTB, KNOTW, KNOTD, KNOTQ

Reviewed by Elena Demikhovsky <elena.demikhovsky@intel.com>

llvm-svn: 213757
2014-07-23 14:49:42 +00:00
Andrea Di Biagio 842355e900 Revert r211771. It was: "[X86] Improve the selection of SSE3/AVX addsub instructions".
This chang fully reverts r211771.
That revision added a canonicalization rule which has the potential to causes a
combine-cycle in the target-independent canonicalizing DAG combine.

The plan is to move the logic that forms target specific addsub nodes as part of
the lowering of shuffles.

llvm-svn: 213736
2014-07-23 11:20:24 +00:00
Andrea Di Biagio 4d8bd41600 [DAG] Refactor some logic. No functional change.
This patch removes function 'CommuteVectorShuffle' from X86ISelLowering.cpp
and moves its logic into SelectionDAG.cpp as method 'getCommutedVectorShuffles'.
This refactoring is in preperation of an upcoming change to the DAGCombiner.

llvm-svn: 213503
2014-07-21 07:28:51 +00:00
Tim Northover 871de902af X86: support fpext/fptrunc operations to and from 16-bit floats.
llvm-svn: 213374
2014-07-18 13:01:25 +00:00
Jim Grosbach b6535c32f5 X86: Constant fold converting vector setcc results to float.
Since the result of a SETCC for X86 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
  UNARYOP(AND(VECTOR_CMP(x,y), constant))
      --> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).

This implements the transform where UNARYOP is [su]int_to_fp.

For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
  %cmp = fcmp oeq <4 x float> %val, %test
  %ext = zext <4 x i1> %cmp to <4 x i32>
  %result = sitofp <4 x i32> %ext to <4 x float>
  ret <4 x float> %result
}

Before this change, the SSE code is generated as:
LCPI0_0:
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .long 1                       ## 0x1
  .section  __TEXT,__text,regular,pure_instructions
  .globl  _foo
  .align  4, 0x90
_foo:                                   ## @foo
  cmpeqps %xmm1, %xmm0
  andps LCPI0_0(%rip), %xmm0
  cvtdq2ps  %xmm0, %xmm0
  retq

After, the code is improved to:
LCPI0_0:
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .long 1065353216              ## float 1.000000e+00
  .section  __TEXT,__text,regular,pure_instructions
  .globl  _foo
  .align  4, 0x90
_foo:                                   ## @foo
  cmpeqps %xmm1, %xmm0
  andps LCPI0_0(%rip), %xmm0
  retq

The cvtdq2ps has been constant folded away and the floating point 1.0f
vector lanes are materialized directly via the ModRM operand of andps.

llvm-svn: 213342
2014-07-18 00:40:56 +00:00
Tim Northover 84ce0a642e CodeGen: generate single libcall for fptrunc -> f16 operations.
Previously we asserted on this code. Currently compiler-rt doesn't
actually implement any of these new libcalls, but external help is
pretty much the only viable option for LLVM.

I've followed the much more generic "__truncST2" naming, as opposed to
the odd name for f32 -> f16 truncation. This can obviously be changed
later, or overridden by any targets that need to.

llvm-svn: 213252
2014-07-17 11:12:12 +00:00
Tim Northover 2131044814 X86: support double extension of f16 type.
x86 has no native ability to extend an f16 to f64, but the same result
is obtained if we expand it into two separate extensions: f16 -> f32
-> f64.

Unfortunately the same is not true for truncate, so that still results
in a compilation failure.

llvm-svn: 213251
2014-07-17 11:04:04 +00:00
Tim Northover fd7e424935 CodeGen: extend f16 conversions to permit types > float.
This makes the two intrinsics @llvm.convert.from.f16 and
@llvm.convert.to.f16 accept types other than simple "float". This is
only strictly needed for the truncate operation, since otherwise
double rounding occurs and there's no way to represent the strict IEEE
conversion. However, for symmetry we allow larger types in the extend
too.

During legalization, we can expand an "fp16_to_double" operation into
two extends for convenience, but abort when the truncate isn't legal. A new
libcall is probably needed here.

Even after this commit, various target tweaks are needed to actually use the
extended intrinsics. I've put these into separate commits for clarity, so there
are no actual tests of f64 conversion here.

llvm-svn: 213248
2014-07-17 10:51:23 +00:00
Tim Northover 7f3e11e7c0 CodeGen: don't form illegail EXTLOAD operations.
It turns out that in most cases (the main exception being i1-related
types) once these operations are formed we cannot separate them and
the targets end up having to deal with them whether they want to or
not.

This is not a good situation, and a more reasonable default can be
formed by ackowledging this and having targets leave them as Legal.
Only x86 seems to be affected (other targets don't even try marking
the operation Expand).

Mostly there's no visible change here yet, but it will be useful to
have truly expanded EXTLOADS for MVT::f16 softening support.

llvm-svn: 213162
2014-07-16 15:37:24 +00:00
Andrea Di Biagio a03624d8ab [X86] Add a check for 'isMOVHLPSMask' within method 'isShuffleMaskLegal'.
Before this change, method 'isShuffleMaskLegal' didn't know that shuffles
implementing a 'movhlps' operation were perfectly legal for SSE targets.

This patch adds the missing check for 'isMOVHLPSMask' inside method
'isShuffleMaskLegal' to fix the problem.

The reason why it is important to do this is because the DAGCombiner
conservatively avoids combining a pair of shuffles if the resulting shuffle
node has an illegal mask. Before this patch, shuffles with a MOVHLPS mask were
wrongly considered not to be legal. This was the root cause of some poor-code
generation bugs.

llvm-svn: 213137
2014-07-16 11:29:39 +00:00
David Majnemer d4d9944416 Fix typo in comment
No functionality changed.

llvm-svn: 213052
2014-07-15 07:11:32 +00:00
Quentin Colombet 0f179c4d8a [X86] Fix the inversion of low and high bits for the lowering of MUL_LOHI.
Also add a few comments.

<rdar://problem/17581756>

llvm-svn: 212808
2014-07-11 12:08:23 +00:00
Chandler Carruth df8d0caab7 [x86] Add another combine that is particularly useful for the new vector
shuffle lowering: match shuffle patterns equivalent to an unpcklwd or
unpckhwd instruction.

This allows us to use generic lowering code for v8i16 shuffles and match
the unpack pattern late.

llvm-svn: 212705
2014-07-10 11:09:29 +00:00
Chandler Carruth 853fa0ac8d [x86] Expand the target DAG combining for PSHUFD nodes to be able to
combine into half-shuffles through unpack instructions that expand the
half to a whole vector without messing with the dword lanes.

This fixes some redundant instructions in splat-like lowerings for
v16i8, which are now getting to be *really* nice.

llvm-svn: 212695
2014-07-10 09:57:36 +00:00
Chandler Carruth a34a8e230d [x86] Tweak the v16i8 single input special case lowering for shuffles
that splat i8s into i16s.

Previously, we would try much too hard to arrange a sequence of i8s in
one half of the input such that we could unpack them into i16s and
shuffle those into place. This isn't always going to be a cheaper i8
shuffle than our other strategies. The case where it is always going to
be cheaper is when we can arrange all the necessary inputs into one half
using just i16 shuffles. It happens that viewing the problem this way
also makes it much easier to produce an efficient set of shuffles to
move the inputs into one half and then unpack them.

With this, our splat code gets one step closer to being not terrible
with the new experimental lowering strategy. It also exposes two
combines missing which I will add next.

llvm-svn: 212692
2014-07-10 09:16:40 +00:00
Chandler Carruth 7d2ffb5492 [x86] Initial improvements to the new shuffle lowering for v16i8
shuffles specifically for cases where a small subset of the elements in
the input vector are actually used.

This is specifically targetted at improving the shuffles generated for
trunc operations, but also helps out splat-like operations.

There is still some really low-hanging fruit here that I want to address
but this is a huge step in the right direction.

llvm-svn: 212680
2014-07-10 04:34:06 +00:00
Chandler Carruth b3840a55ae [x86] Refactor some of the new code for lowering v16i8 shuffles to
remove duplication and make it easier to select different strategies.

No functionality changed.

llvm-svn: 212674
2014-07-10 02:24:26 +00:00
Chandler Carruth d3561f6fec [SDAG] Make the new zext-vector-inreg node default to expand so targets
don't need to set it manually.

This is based on feedback from Tom who pointed out that if every target
needs to handle this we need to reach out to those maintainers. In fact,
it doesn't make sense to duplicate everything when anything other than
expand seems unlikely at this stage.

llvm-svn: 212661
2014-07-09 22:53:04 +00:00
Benjamin Kramer d6f1733add X86: When lowering v8i32 himuls use the correct shuffle masks for AVX2.
Turns out my trick of using the same masks for SSE4.1 and AVX2 didn't work out
as we have to blend two vectors. While there remove unecessary cross-lane moves
from the shuffles so the backend can lower it to palignr instead of vperm.

Fixes PR20118, a miscompilation of vector sdiv by constant on AVX2.

llvm-svn: 212611
2014-07-09 11:12:39 +00:00
Chandler Carruth afe4b2507e [x86] Add a ZERO_EXTEND_VECTOR_INREG DAG node and use it when widening
vector types to be legal and a ZERO_EXTEND node is encountered.

When we use widening to legalize vector types, extend nodes are a real
challenge. Either the input or output is likely to be legal, but in many
cases not both. As a consequence, we don't really have any way to
represent this situation and the prior code in the widening legalization
framework would just scalarize the extend operation completely.

This patch introduces a new DAG node to represent doing a zero extend of
a vector "in register". The core of the idea is to allow legal but
different vector types in the input and output. The output vector must
have fewer lanes but wider elements. The operation is defined to zero
extend the low elements of the input to the size of the output elements,
and drop all of the high elements which don't have a corresponding lane
in the output vector.

It also includes generic expansion of this node in terms of blending
a zero vector into the high elements of the vector and bitcasting
across. This in turn yields extremely nice code for x86 SSE2 when we use
the new widening legalization logic in conjunction with the new shuffle
lowering logic.

There is still more to do here. We need to support sign extension, any
extension, and potentially int-to-float conversions. My current plan is
to continue using similar synthetic nodes to model each of these
transitions with generic lowering code for each one.

However, with this patch LLVM already reaches performance parity with
GCC for the core C loops of the x264 code (assuming you disable the
hand-written assembly versions) when compiling for SSE2 and SSE3
architectures and enabling the new widening and lowering logic for
vectors.

Differential Revision: http://reviews.llvm.org/D4405

llvm-svn: 212610
2014-07-09 10:58:18 +00:00
Chandler Carruth ef5dcf571e [x86] Initialize a pointer to null to fix a bug in r212602.
This should restore GCC hosts (which happen to put the bad stuff into
the pointer) and MSan, etc.

llvm-svn: 212606
2014-07-09 10:36:42 +00:00
Chandler Carruth 2ebc942683 [x86] Re-apply a variant of the x86 side of r212324 now that the rest
has settled without incident, removing the x86-specific and overly
strict 'isVectorSplat' routine in favor of generic and more powerful
splat detection.

The primary motivation and result of this is that the x86 backend can
now see through splats which contain undef elements. This is essential
if we are using a widening form of legalization and I've updated a test
case to also run in that mode as before this change the generated code
for the test case was completely scalarized.

This version of the patch much more carefully handles the undef lanes.
- We aren't overly conservative about them in the shift lowering
  (where we will never use the splat itself).
- One place where the splat would have been re-used by the existing code
  now explicitly constructs a new constant splat that will be safe.
- The broadcast lowering is much more reasonable with undefs by doing
  a correct check of whether the splat is the only user of a loaded
  value, checking that the splat actually crosses multiple lanes before
  using a broadcast, and handling broadcasts of non-constant splats.

As a consequence of the last bullet, the weird usage of vpshufd instead
of vbroadcast is gone, and we actually can lower an AVX splat with
vbroadcastss where before we emitted a really strange pattern of
a vector load and a manual splat across the vector.

llvm-svn: 212602
2014-07-09 10:06:58 +00:00
Chandler Carruth 142e966261 [x86,SDAG] Sink the logic for folding shuffles of splats more
aggressively from the x86 shuffle lowering to the generic SDAG vector
shuffle formation code.

This code already tried to fold away shuffles of splats! It just had
lots of bugs and couldn't handle the case my new x86 shuffle lowering
needed.

First, it failed to correctly compute whether N2 was undef because it
pre-computed this, then did transformations which could *make* N2 undef,
then failed to ever re-consider the precomputed state.

Second, it didn't look through bitcasts at all, even in the safe cases
where they are just element-type bitcasts with no change to the number
of elements.

Third, it didn't handle all-zero bit casts nicely the way my code in the
x86 side of things did, which is essential to getting good zext-shuffle
lowerings.

But all of these are generic. I just ported the code down to this layer
and fixed the surrounding bugs. Tests exercising this in the x86 backend
still pass and some silly code in widen_cast-6.ll gets better. I updated
that test to be a bit more precise but it's still pretty unclear what
the value of the test is in this day and age.

llvm-svn: 212517
2014-07-08 08:45:38 +00:00
Andrea Di Biagio 2620b877b6 [x86] Fix assertion failure caused by a wrong combine of PSHUFD nodes with different types.
When combining a sequence of two PSHUFD dag nodes into a single PSHUFD,
make sure that we assign the correct type to the resulting PSHUFD.
X86ISD::PSHUFD dag nodes can be either MVT::v4i32 or MVT::v4f32.

Before this change, an assertion failure was triggered in method
'DAGCombinerInfo::CombineTo' when trying to combine the shuffles from the test
below into a single PSHUFD.

define <4 x float> @test1(<4 x float> %V) {
  %1 = shufflevector <4 x float> %V, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
  %2 = shufflevector <4 x float> %1, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
  ret <4 x float> %2
}

llvm-svn: 212498
2014-07-07 23:25:23 +00:00
Chandler Carruth beeacac0b3 [x86] Revert r212324 which was too aggressive w.r.t. allowing undef
lanes in vector splats.

The core problem here is that undef lanes can't *unilaterally* be
considered to contribute to splats. Their handling needs to be more
cautious. There is also a reported failure of the nightly testers
(thanks Tobias!) that may well stem from the same core issue. I'm going
to fix this theoretical issue, factor the APIs a bit better, and then
verify that I don't see anything bad with Tobias's reduction from the
test suite before recommitting.

Original commit message for r212324:
  [x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for
  any constant, constant FP, or undef splat and to tolerate any undef
  lanes in a splat, then replace all uses of isSplatVector in X86's
  lowering with it.

  This fixes issues where undef lanes in an otherwise splat vector would
  prevent the splat logic from firing. It is a touch more awkward to use
  this interface, but it is much more accurate. Suggestions for better
  interface structuring welcome.

  With this fix, the code generated with the widening legalization
  strategy for widen_cast-4.ll is *dramatically* improved as the special
  lowering strategies for a v16i8 SRA kick in even though the high lanes
  are undef.

  We also get a slightly different choice for broadcasting an aligned
  memory location, and use vpshufd instead of vbroadcastss. This looks
  like a minor win for pipelining and domain crossing, but a minor loss
  for the number of micro-ops. I suspect its a wash, but folks can
  easily tweak the lowering if they want.

llvm-svn: 212475
2014-07-07 19:03:32 +00:00
Chandler Carruth 0dcb366268 [x86] Teach the new vector shuffle lowering code to handle what is
essentially a DAG combine that never gets a chance to run.

We might typically expect DAG combining to remove shuffles-of-splats and
other similar patterns, but we don't get a chance to run the DAG
combiner when we recursively form sub-shuffles during the lowering of
a shuffle. So instead hand-roll a really important combine directly into
the lowering code to detect shuffles-of-splats, especially shuffles of
an all-zero splat which needn't even have the same element width, etc.

This lets the new vector shuffle lowering handle shuffles which
implement things like zero-extension really nicely. This will become
even more important when I wire the legalization of zero-extension to
vector shuffles with the new widening legalization strategy.

llvm-svn: 212444
2014-07-07 09:06:58 +00:00
Chandler Carruth 5d79bb5d32 [x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for
any constant, constant FP, or undef splat and to tolerate any undef
lanes in a splat, then replace all uses of isSplatVector in X86's
lowering with it.

This fixes issues where undef lanes in an otherwise splat vector would
prevent the splat logic from firing. It is a touch more awkward to use
this interface, but it is much more accurate. Suggestions for better
interface structuring welcome.

With this fix, the code generated with the widening legalization
strategy for widen_cast-4.ll is *dramatically* improved as the special
lowering strategies for a v16i8 SRA kick in even though the high lanes
are undef.

We also get a slightly different choice for broadcasting an aligned
memory location, and use vpshufd instead of vbroadcastss. This looks
like a minor win for pipelining and domain crossing, but a minor loss
for the number of micro-ops. I suspect its a wash, but folks can easily
tweak the lowering if they want.

llvm-svn: 212324
2014-07-04 08:11:49 +00:00
Chandler Carruth 19cff8205e [x86] Clarify that this lowering only applies to vectors and is only
used when we have SSE2.

llvm-svn: 212300
2014-07-03 22:57:44 +00:00
Andrea Di Biagio a37a2fc81f [X86] Add ISel patterns to select 'f32_to_f16' and 'f16_to_f32' dag nodes.
This patch adds tablegen patterns to select F16C float-to-half-float
conversion instructions from 'f32_to_f16' and 'f16_to_f32' dag nodes.

If the target doesn't have F16C, then 'f32_to_f16' and 'f16_to_f32'
are expanded into library calls.

llvm-svn: 212293
2014-07-03 21:51:06 +00:00
Chandler Carruth 739b6ada99 [x86] Fix crashes in lowering bitcast instructions with the widening
mode.

This also runs the test in that mode which would reproduce the crash.
What I love is that *every single FIXME* in the test is addressed by
switching to widening.

llvm-svn: 212254
2014-07-03 03:43:47 +00:00
Chandler Carruth 49a8b10d82 [x86] Based on a long conversation between myself, Jim Grosbach, Hal
Finkel, Eric Christopher, and a bunch of other people I'm probably
forgetting (sorry), add an option to the x86 backend to widen vectors
during type legalization rather than promote them.

This still would promote vNi1 vectors to get the masks right, but would
widen other vectors. A lot of experiments are piling up right now
showing that widening should probably be the default legalization
strategy outside of vNi1 cases, but it is very hard to test the
rammifications of that and fix bugs in widening-based legalization
without an option that enables it. I'll be checking in tests shortly
that use this option to exercise cases where widening doesn't work well
and hopefully we'll be able to switch fully to this soon.

llvm-svn: 212249
2014-07-03 02:11:29 +00:00
Benjamin Kramer e739cf3eb5 X86: When combining shuffles just remove shuffles that are completely redundant.
CombineTo doesn't allow replacing a node with itself so this would crash if the
combined shuffle is the same as the input shuffle.

llvm-svn: 212181
2014-07-02 15:09:44 +00:00
Juergen Ributzka 3bd03c7099 [DAG] Pass the argument list to the CallLoweringInfo via move semantics. NFCI.
The argument list vector is never used after it has been passed to the
CallLoweringInfo and moving it to the CallLoweringInfo is cleaner and
pretty much as cheap as keeping a pointer to it.

llvm-svn: 212135
2014-07-01 22:01:54 +00:00
Tim Northover df58625e3c X86: delegate expanding atomic libcalls to generic code.
On targets without cmpxchg16b or cmpxchg8b, the borderline atomic
operations were slipping through the gaps.

X86AtomicExpand.cpp was delegating to ISelLowering. Generic
ISelLowering was delegating to X86ISelLowering and X86ISelLowering was
asserting. The correct behaviour is to expand to a libcall, preferably
in generic ISelLowering.

This can be achieved by X86ISelLowering deciding it doesn't want the
faff after all.

llvm-svn: 212134
2014-07-01 21:44:59 +00:00
Tim Northover 277066ab43 X86: expand atomics in IR instead of as MachineInstrs.
The logic for expanding atomics that aren't natively supported in
terms of cmpxchg loops is much simpler to express at the IR level. It
also allows the normal optimisations and CodeGen improvements to help
out with atomics, instead of using a limited set of possible
instructions..

rdar://problem/13496295

llvm-svn: 212119
2014-07-01 18:53:31 +00:00
Andrea Di Biagio 53b6830069 [X86] Add support for builtin to read performance monitoring counters.
This patch adds support for a new builtin instruction called
__builtin_ia32_rdpmc.

Builtin '__builtin_ia32_rdpmc' is defined as a 'GCC builtin'; on X86, it can
be used to read performance monitoring counters. It takes as input the index
of the performance counter to read, and returns the value of the specified
performance counter as a 64-bit number.

Calls to this new builtin will map to instruction RDPMC.
The index in input to the builtin call is moved to register %ECX. The result
of the builtin call is the value of the specified performance counter (RDPMC
would return that quantity in registers RDX:RAX).

This patch:
 - Adds builtin int_x86_rdpmc as a GCCBuiltin;
 - Adds a new x86 DAG node called 'RDPMC_DAG';
 - Teaches how to lower this new builtin;
 - Adds an ISel pattern to select instruction RDPMC;
 - Fixes the definition of instruction RDPMC adding %RAX and %RDX as
   implicit definitions, and adding %ECX as implicit use;
 - Adds a LLVM test to verify that the new builtin is correctly selected.

llvm-svn: 212049
2014-06-30 17:14:21 +00:00
Chandler Carruth bd0717d7cc [x86] Fix a bug in the v8i16 shuffling exposed by the new splat-like
lowering for v16i8.

ASan and some bots caught this bug with existing test cases. Fixing it
even fixed a miscompile with one of the test cases. I'm still a bit
suspicious of this test case as I've not taken a proper amount of time
to think about it, but the fix here is strict goodness.

llvm-svn: 211976
2014-06-28 05:46:28 +00:00
Chandler Carruth 887c2c3482 [x86] Add handling for splat-like widenings of v16i8 shuffles.
These show up really frequently, not the least with actual splats. =] We
lowered these quite badly before. The new code path tries to widen i8
shuffles to i16 shuffles in a splat-like way. There are still some
inefficiencies in our i16 splat logic though, so we aren't really done
here.

Also, for certain patterns (bit of a gather-and-splat) we still
generate pretty silly code, and I've left a fixme for addressing it.
However, I'm not actually worried about this code pattern as much. The
old shuffle lowering generates a 29 instruction monstrosity for it that
should execute much more slowly.

llvm-svn: 211974
2014-06-28 05:16:40 +00:00
Chandler Carruth a94ef908d9 [x86] Fix another bug hit when bootstrapping with the new shuffle
lowering.

For maximum irony, I had already discovered this bug, diagnosed it, and
left FIXMEs about it in the test cases. =[ I just failed to go back over
those until after i had reduced a bootstrap miscompile down to a single
TU, stared at the assembly for an hour, and figured out the bug. Again.

Oh well.

llvm-svn: 211955
2014-06-27 20:07:40 +00:00
Chandler Carruth dd6470a9dd [x86] Fix a miscompile in the new shuffle lowering uncovered by
a bootstrap.

I managed to mis-remember how PACKUS worked on x86, and was using undef
for the high bytes instead of zero. The fix is fairly obvious.

llvm-svn: 211922
2014-06-27 18:25:23 +00:00
Alexander Kornienko b673b4b187 Clean up unused variable warning in release build.
llvm-svn: 211902
2014-06-27 15:30:55 +00:00
Chandler Carruth ed4a0bc734 [x86] Clean up some unused variables, especially in release builds.
llvm-svn: 211894
2014-06-27 12:04:18 +00:00
Chandler Carruth 688001f042 [x86] Teach the target combine step to aggressively fold pshufd insturcions.
Summary:
This allows it to fold pshufd instructions across intervening
half-shuffles and other noise. This pattern actually shows up in the
generic lowering tests, but I've also added direct tests using
intrinsics to make sure that the specific desired functionality is
working even if the lowering stuff changes in the future.

Differential Revision: http://reviews.llvm.org/D4292

llvm-svn: 211892
2014-06-27 11:40:13 +00:00
Chandler Carruth 0d6d1f2b17 [x86] Teach the target-specific combining how to aggressively fold
half-shuffles, even looking through intervening instructions in a chain.

Summary:
This doesn't happen to show up with any test cases I've found for the current
shuffle lowering, but previous attempts would benefit from this and it seems
generally useful. I've tested it directly using intrinsics, which also shows
that it will work with hand vectorized code as well.

Note that even though pshufd isn't directly used in these tests, it gets
exercised because we combine some of the half shuffles into a pshufd
first, and then merge them.

Differential Revision: http://reviews.llvm.org/D4291

llvm-svn: 211890
2014-06-27 11:34:40 +00:00
Chandler Carruth 97ebc2362c [x86] Teach the X86 backend to DAG-combine SSE2 shuffles that are
trivially redundant.

This fixes several cases in the new vector shuffle lowering algorithm
which would generate redundant shuffle instructions for the sake of
simplicity.

I'm also deleting a testcase which was somewhat ridiculous. It was
checking for a bug in 2007 about incorrectly transforming shuffles by
looking for the string "-86" in the output of a pretty substantial
function. This test case doesn't seem to have any value at this point.

Differential Revision: http://reviews.llvm.org/D4240

llvm-svn: 211889
2014-06-27 11:27:52 +00:00
Chandler Carruth 83860cfcfa [x86] Begin a significant overhaul of how vector lowering is done in the
x86 backend.

This sketches out a new code path for vector lowering, hidden behind an
off-by-default flag while it is under development. The fundamental idea
behind the new code path is to aggressively break down the problem space
in ways that ease selecting the odd set of instructions available on
x86, and carefully avoid scalarizing code even when forced to use older
ISAs. Notably, this starts off restricting itself to SSE2 and implements
the complete vector shuffle and blend space for 128-bit vectors in SSE2
without scalarizing. The plan is to layer on top of this ISA extensions
where we can bail out of the complex SSE2 lowering and opt for
a cheaper, specialized instruction (or set of instructions). It also
needs to be generalized to AVX and AVX512 vector widths.

Currently, this does a decent but not perfect job for SSE2. There are
some specific shortcomings that I plan to address:
- We need a peephole combine to fold together shuffles where possible.
  There are cases where a previous shuffle could be modified slightly to
  arrange for elements to be in the correct position and a later shuffle
  eliminated. Doing this eagerly added quite a bit of complexity, and
  so my plan is to combine away these redundancies afterward.
- There are a lot more clever ways to use unpck and pack that need to be
  added. This is essential for real world shuffles as it turns out...

Once SSE2 is polished a bit I should be able to get interesting numbers
on performance improvements on benchmarks conducive to vectorization.
All of this will be off by default until it is functionally equivalent
of course.

Differential Revision: http://reviews.llvm.org/D4225

llvm-svn: 211888
2014-06-27 11:23:44 +00:00
Andrea Di Biagio 1ee38843ac Silence a warning due to a comparison between signed and unsigned.
No functional change intended.

llvm-svn: 211782
2014-06-26 13:41:10 +00:00
Andrea Di Biagio 7fb85256bc [X86] Improve the selection of SSE3/AVX addsub instructions.
This patch teaches the backend how to canonicalize a shuffle vectors
according to the rule:

 - (shuffle (FADD A, B), (FSUB A, B), Mask) ->
       (shuffle (FSUB A, -B), (FADD A, -B), Mask)

Where 'Mask' is:
  <0,5,2,7>            ;; for v4f32 and v4f64 shuffles.
  <0,3>                ;; for v2f64 shuffles.
  <0,9,2,11,4,13,6,15> ;; for v8f32 shuffles.

In general, ISel only knows how to pattern-match a canonical
'fadd + fsub + blendi' dag node sequence into an ADDSUB instruction.

This new rule allows to convert a non-canonical dag sequence into a
canonical one that will be matched by a single ADDSUB at ISel stage.

The idea of converting a non-canonical ADDSUB into a canonical one by
swapping the first two operands of the shuffle, and then negating the
second operand of the FADD and FSUB, was originally proposed by Hal Finkel.

llvm-svn: 211771
2014-06-26 10:45:21 +00:00
Andrea Di Biagio 07cdffc324 [X86] Always prefer to lower a VECTOR_SHUFFLE into a BLENDI instead of SHUFP (or VPERM2X128).
This patch teaches method 'LowerVECTOR_SHUFFLE' to give higher precedence to
the check for 'isBlendMask'; the idea is that, when possible, we should firstly
check if a shuffle performs a blend, and in case, try to lower it into a BLENDI
instead of selecting a SHUFP or (worse) a VPERM2X128.

In general:
 - AVX VBLENDPS/D always have better latency and throughput than VPERM2F128;
 - BLENDPS/D instructions tend to always have better 'reciprocal throughput'
   than the equivalent SHUFPS/D;
 - Both BLENDPS/D and SHUFPS/D are often decoded into the same number of
   m-ops; however, a m-op obtained from a BLENDPS/D can be scheduled to more
   than one execution port.

This patch:
 - Moves the check for 'isBlendMask' immediately before the check for
   'isSHUFPMask' within method 'LowerVECTOR_SHUFFLE';
 - Updates existing tests for sse/avx shuffle/blend instructions to verify
   that we select (v)blendps/d when possible (instead of (v)shufps/d or
   vperm2f128).

llvm-svn: 211720
2014-06-25 17:41:58 +00:00
Chandler Carruth e5724d7532 [x86] Add intrinsics for the pshufd, pshuflw, and pshufhw instructions.
llvm-svn: 211694
2014-06-25 13:12:54 +00:00
NAKAMURA Takumi 1db5995d14 Re-apply r211399, "Generate native unwind info on Win64" with a fix to ignore SEH pseudo ops in X86 JIT emitter.
--
This patch enables LLVM to emit Win64-native unwind info rather than
DWARF CFI.  It handles all corner cases (I hope), including stack
realignment.

Because the unwind info is not flexible enough to describe stack frames
with a gap of unknown size in the middle, such as the one caused by
stack realignment, I modified register spilling code to place all spills
into the fixed frame slots, so that they can be accessed relative to the
frame pointer.

Patch by Vadim Chugunov!

Reviewed By: rnk

Differential Revision: http://reviews.llvm.org/D4081

llvm-svn: 211691
2014-06-25 12:41:52 +00:00
Andrea Di Biagio 6d9b9e125d [X86] Add target combine rule to select ADDSUB instructions from a build_vector
This patch teaches the backend how to combine a build_vector that implements
an 'addsub' between packed float vectors into a sequence of vector add
and vector sub followed by a VSELECT.

The new VSELECT is expected to be lowered into a BLENDI.
At ISel stage, the sequence 'vector add + vector sub + BLENDI' is
pattern-matched against ISel patterns added at r211427 to select
'addsub' instructions.
Added three more ISel patterns for ADDSUB.

Added test sse3-avx-addsub-2.ll to verify that we correctly emit 'addsub'
instructions.

llvm-svn: 211679
2014-06-25 10:02:21 +00:00
Robert Khasanov 21c836823f vpblend intrinsics combines as shifts intrinsics due to absence return stmt between them
Fix PR20088

Differential Revision: http://reviews.llvm.org/D4277

llvm-svn: 211617
2014-06-24 18:08:04 +00:00
NAKAMURA Takumi d77cefe633 Revert r211399, "Generate native unwind info on Win64"
It broke Legacy JIT Tests on x86_64-{mingw32|msvc}, aka Windows x64.

llvm-svn: 211480
2014-06-22 22:00:56 +00:00
Filipe Cabecinhas 1af2dfd274 Fix PR20087 by using the source index when changing the vector load
llvm-svn: 211472
2014-06-22 17:21:37 +00:00
Reid Kleckner 4a01230db4 Generate native unwind info on Win64
This patch enables LLVM to emit Win64-native unwind info rather than
DWARF CFI.  It handles all corner cases (I hope), including stack
realignment.

Because the unwind info is not flexible enough to describe stack frames
with a gap of unknown size in the middle, such as the one caused by
stack realignment, I modified register spilling code to place all spills
into the fixed frame slots, so that they can be accessed relative to the
frame pointer.

Patch by Vadim Chugunov!

Reviewed By: rnk

Differential Revision: http://reviews.llvm.org/D4081

llvm-svn: 211399
2014-06-20 20:35:47 +00:00
Chandler Carruth 8366cebeb5 [x86] Make the x86 PACKSSWB, PACKSSDW, PACKUSWB, and PACKUSDW
instructions available as synthetic SDNodes PACKSS and PACKUS that will
select to the correct instruction variants based on the return type.
This allows us to use these rather important instructions when lowering
vector shuffles.

Also moves the relevant instruction definitions to be split out from
the fully generic multiclasses to allow them to match these new SDNodes
in the same way that the UNPCK instructions do.

No functionality should actually be changed here.

llvm-svn: 211332
2014-06-20 01:05:28 +00:00
Andrea Di Biagio 54b0949af9 [X86] Teach how to combine horizontal binop even in the presence of undefs.
Before this change, the backend was unable to fold a build_vector dag
node with UNDEF operands into a single horizontal add/sub.

This patch teaches how to combine a build_vector with UNDEF operands into a
horizontal add/sub when possible. The algorithm conservatively avoids to combine
a build_vector with only a single non-UNDEF operand.

Added test haddsub-undef.ll to verify that we correctly fold horizontal binop
even in the presence of UNDEFs.

llvm-svn: 211265
2014-06-19 10:29:41 +00:00
Cameron McInally 0d0489cea6 Hook up vector int_ctlz for AVX512.
llvm-svn: 211024
2014-06-16 14:12:28 +00:00
Tim Northover 51472bc600 X86: lower ATOMIC_CMP_SWAP_WITH_SUCCESS directly
Lowering this new node allows us to fold the almost universal
comparison for success before it's even formed. Instead we can create
a copy from EFLAGS and an X86ISD::SETCC operation since all "cmpxchg"
instructions set the zero-flag to the correct value.

rdar://problem/13201607

llvm-svn: 210923
2014-06-13 17:29:39 +00:00
Tim Northover 420a216817 IR: add "cmpxchg weak" variant to support permitted failure.
This commit adds a weak variant of the cmpxchg operation, as described
in C++11. A cmpxchg instruction with this modifier is permitted to
fail to store, even if the comparison indicated it should.

As a result, cmpxchg instructions must return a flag indicating
success in addition to their original iN value loaded. Thus, for
uniformity *all* cmpxchg instructions now return "{ iN, i1 }". The
second flag is 1 when the store succeeded.

At the DAG level, a new ATOMIC_CMP_SWAP_WITH_SUCCESS node has been
added as the natural representation for the new cmpxchg instructions.
It is a strong cmpxchg.

By default this gets Expanded to the existing ATOMIC_CMP_SWAP during
Legalization, so existing backends should see no change in behaviour.
If they wish to deal with the enhanced node instead, they can call
setOperationAction on it. Beware: as a node with 2 results, it cannot
be selected from TableGen.

Currently, no use is made of the extra information provided in this
patch. Test updates are almost entirely adapting the input IR to the
new scheme.

Summary for out of tree users:
------------------------------

+ Legacy Bitcode files are upgraded during read.
+ Legacy assembly IR files will be invalid.
+ Front-ends must adapt to different type for "cmpxchg".
+ Backends should be unaffected by default.

llvm-svn: 210903
2014-06-13 14:24:07 +00:00
Andrea Di Biagio 2dd3b3b674 [X86] Teach how to dump the name of target node RDTSCP_DAG.
When I originally added node RDTSCP_DAG (r207127) I forgot to add
a string name for it in method 'getTargetNodeName'.

No functional change intended.

llvm-svn: 210769
2014-06-12 11:37:24 +00:00
Andrea Di Biagio 972ff97f8c [X86] Teach how to combine AVX and AVX2 horizontal binop on packed 256-bit vectors.
This patch adds target combine rules to match:
 - [AVX] Horizontal add/sub of packed single/double precision floating point
   values from 256-bit vectors;
 - [AVX2] Horizontal add/sub of packed integer values from 256-bit vectors.

llvm-svn: 210761
2014-06-12 10:53:48 +00:00
Tim Northover 4dc9eaa6ba X86: add stringy name for X86ISD::LCMPXCHG16_DAG
I don't know what "target specific node #383" is, and I don't want to
have to.

llvm-svn: 210663
2014-06-11 17:04:08 +00:00
Andrea Di Biagio c7af75f9a7 [X86] Refactor the logic to select horizontal adds/subs to a helper function.
This patch moves part of the logic implemented by the target specific
combine rules added at r210477 to a separate helper function.
This should make easier to add more rules for matching AVX/AVX2 horizontal
adds/subs.

This patch also fixes a problem caused by a wrong check performed on indices
of extract_vector_elt dag nodes in input to the scalar adds/subs.

New tests have been added to verify that we correctly check indices of
extract_vector_elt dag nodes when selecting a horizontal operation.

llvm-svn: 210644
2014-06-11 07:57:50 +00:00
Eric Christopher 68d7559e97 Use the TargetMachine on the DAG or the MachineFunction instead
of using the cached TargetMachine.

llvm-svn: 210589
2014-06-10 21:25:13 +00:00
Eric Christopher 19b1d73e88 Add a FIXME.
llvm-svn: 210559
2014-06-10 18:31:18 +00:00
Andrea Di Biagio fa508af0fe [X86] Improved target combine rules for selecting horizontal add/sub.
This patch slightly changes the algorithm introduced at revision 210477
to fix a problem where the algorithm was producing incorrect code for 
the VEX.256 encoded versions of horizontal add/sub.

For these cases, we now try to split the two 256-bit vectors into
128-bit chunks before emitting horizontal add/sub dag nodes.

Added a new test case into haddsub-2.ll.

llvm-svn: 210545
2014-06-10 16:42:57 +00:00
Tom Stellard 3787b12255 SelectionDAG: Don't use MVT::Other to determine legality of ISD::SELECT_CC
The SelectionDAG bad a special case for ISD::SELECT_CC, where it would
allow targets to specify:

setOperationAction(ISD::SELECT_CC, MVT::Other, Expand);

to indicate that they wanted to expand ISD::SELECT_CC for all types.
This wasn't applied correctly everywhere, and it makes writing new
DAG patterns with ISD::SELECT_CC difficult.

llvm-svn: 210541
2014-06-10 16:01:29 +00:00
Tim Northover 7b9f86da5d Revert "X86: elide comparisons after cmpxchg instructions."
This reverts commit r210523. It was committed prematurely without waiting for
review.

llvm-svn: 210524
2014-06-10 10:50:11 +00:00
Tim Northover 84ad29ca1f X86: elide comparisons after cmpxchg instructions.
The C++ and C semantics of the compare_and_swap operations actually
require us to return a boolean "success" value. In LLVM terms this
means a second comparison of the output of "cmpxchg" against the input
desired value.

However, x86's "cmpxchg" instruction sets all flags for the comparison
formed, so we can skip any secondary comparison. (N.b. this isn't true
for cmpxchg8b/16b, which only set ZF).

rdar://problem/13201607

llvm-svn: 210523
2014-06-10 10:49:07 +00:00
Eric Christopher a08f30bd40 Move all of the x86 subtarget initialized variables down into the x86 subtarget
from the x86 target machine. Should be no functional change.

llvm-svn: 210479
2014-06-09 17:08:19 +00:00
Andrea Di Biagio f99dd64f0a [X86] Add target combine rules for horizontal add/sub.
This patch adds new target specific combine rules to identify horizontal
add/sub idioms from BUILD_VECTOR dag nodes.

This patch also teaches the DAGCombiner how to canonicalize sequences of
insert_vector_elt dag nodes according to the following rule:

  (insert_vector_elt (insert_vector_elt A, I0), I1) ->
    (insert_vecto_elt (insert_vector_elt A, I1), I0)

This new canonicalization rule only triggers if the inner insert_vector
dag node has exactly one use; also, both indices must be known constants,
and I1 < I0.
This last rule made it possible to write a simpler algorithm to identify
horizontal add/sub patterns because now we don't have to worry about the
ordering of insert_vector_elt dag nodes.

llvm-svn: 210477
2014-06-09 16:54:41 +00:00
Andrea Di Biagio dfbdc71ea1 [X86] Avoid emitting unnecessary test instructions.
This patch teaches the backend how to check for the 'NoSignedWrap' flag on
binary operations to improve the emission of 'test' instructions.

If the result of a binary operation is known not to overflow we know that
resetting the Overflow flag is unnecessary and so we can avoid emitting
the test instruction.

Patch by Marcello Maggioni.

llvm-svn: 210468
2014-06-09 12:34:50 +00:00
Alexey Volkov 5260dba323 [X86] Use ADD/SUB instead of INC/DEC for Silvermont
According to Intel Software Optimization Manual 
on Silvermont INC or DEC instructions require 
an additional uop to merge the flags.
As a result, a branch instruction depending 
on an INC or a DEC instruction incurs a 1 cycle penalty.

Differential Revision: http://reviews.llvm.org/D3990

llvm-svn: 210466
2014-06-09 11:40:41 +00:00
Craig Topper 66f09ad041 [C++11] Use 'nullptr'.
llvm-svn: 210442
2014-06-08 22:29:17 +00:00
Benjamin Kramer d0700b2919 X86: Don't turn shifts into ands if there's another use that may not check for equality.
Fixes PR19964.

llvm-svn: 210371
2014-06-06 21:08:55 +00:00
Filipe Cabecinhas 5181255696 Fixed a bug in lowering shuffle_vectors to insertps
Summary:
We were being too strict and not accounting for undefs.
Added a test case and fixed another one where we improved codegen.

Reviewers: grosbach, nadav, delena

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D4039

llvm-svn: 210361
2014-06-06 18:07:06 +00:00
Andrea Di Biagio 4760813831 [X86] Fix checked arithmetic for i8 on X86.
When lowering a ISD::BRCOND into a test+branch, make sure that we
always use the correct condition code to emit the test operation.

This fixes PR19858: "i8 checked mul is wrong on x86".

Patch by Keno Fisher!

llvm-svn: 210032
2014-06-02 16:00:27 +00:00
Eric Christopher 8995833a34 Have the TLOF creation take a Triple rather than needing a subtarget.
llvm-svn: 209937
2014-05-31 00:07:32 +00:00
Andrea Di Biagio 446a527905 [X86] Add two combine rules to simplify dag nodes introduced during type legalization when promoting nodes with illegal vector type.
This patch teaches the backend how to simplify/canonicalize dag node
sequences normally introduced by the backend when promoting certain dag nodes
with illegal vector type.

This patch adds two new combine rules:
1) fold (shuffle (bitcast (BINOP A, B)), Undef, <Mask>) ->
        (shuffle (BINOP (bitcast A), (bitcast B)), Undef, <Mask>)

2) fold (BINOP (shuffle (A, Undef, <Mask>)), (shuffle (B, Undef, <Mask>))) ->
        (shuffle (BINOP A, B), Undef, <Mask>).

Both rules are only triggered on the type-legalized DAG.
In particular, rule 1. is a target specific combine rule that attempts
to sink a bitconvert into the operands of a binary operation.
Rule 2. is a target independet rule that attempts to move a shuffle
immediately after a binary operation.

llvm-svn: 209930
2014-05-30 23:17:53 +00:00
Filipe Cabecinhas d3aebaf875 Separate the check for blend shuffle_vector masks
Summary:
Separate the check for blend shuffle_vector masks into isBlendMask.
This function will also be used to check if a vector shuffle is legal. No
change in functionality was intended, but we ended up improving codegen on
two tests, which were being (more) optimized only if the resulting shuffle
was legal.

Reviewers: nadav, delena, andreadb

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3964

llvm-svn: 209923
2014-05-30 21:31:21 +00:00
Rafael Espindola a31f3e50dc Delete dead code.
GV is never used past this point. This was probably a copy and paste error.

llvm-svn: 209518
2014-05-23 15:07:51 +00:00
Andrea Di Biagio c8dd1ad85b [X86] Improve the lowering of BITCAST from MVT::f64 to MVT::v4i16/MVT::v8i8.
This patch teaches the x86 backend how to efficiently lower ISD::BITCAST dag
nodes from MVT::f64 to MVT::v4i16 (and vice versa), and from MVT::f64 to
MVT::v8i8 (and vice versa).

This patch extends the logic from revision 208107 to also handle MVT::v4i16
and MVT::v8i8. Also, this patch correctly propagates Undef values when
performing the widening of a vector (example: when widening from v2i32 to
v4i32, the upper 64bits of the resulting vector are 'undef').

llvm-svn: 209451
2014-05-22 16:21:39 +00:00
Quentin Colombet b4d53f1afa [X86] Fix a bug in the lowering of BLENDI introduced in r209043.
ISD::VSELECT mask uses 1 to identify the first argument and 0 to identify the
second argument.
On the other hand, BLENDI uses 0 to identify the first argument and 1 to
identify the second argument.
Fix the generation of the blend mask to account for this difference.

The bug did not show up with r209043, because we were not checking for the
actual arguments of the blend instruction!
This commit also fixes the test cases.

Note: The same mask works for the BLENDr variant because the arguments are
swapped during instruction selection (see the BLENDXXrr patterns).

<rdar://problem/16975435>

llvm-svn: 209324
2014-05-21 22:00:39 +00:00
Simon Atanasyan e7fa2314af Add parentheses to suppress the gcc warning '-Wparentheses'.
No functional changes.

llvm-svn: 209203
2014-05-20 10:23:04 +00:00
Filipe Cabecinhas dc92102766 Added more insertps optimizations
Summary:
When inserting an element that's coming from a vector load or a broadcast
of a vector (or scalar) load, combine the load into the insertps
instruction.
Added PerformINSERTPSCombine for the case where we need to fix the load
(load of a vector + insertps with a non-zero CountS).
Added patterns for the broadcasts.

Also added tests for SSE4.1, AVX, and AVX2.

Reviewers: delena, nadav, craig.topper

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3581

llvm-svn: 209156
2014-05-19 19:45:57 +00:00
Benjamin Kramer f3ad23551d SDAG: Legalize vector BSWAP into a shuffle if the shuffle is legal but the bswap not.
- On ARM/ARM64 we get a vrev because the shuffle matching code is really smart. We still unroll anything that's not v4i32 though.
- On X86 we get a pshufb with SSSE3. Required more cleverness in isShuffleMaskLegal.
- On PPC we get a vperm for v8i16 and v4i32. v2i64 is unrolled.

llvm-svn: 209123
2014-05-19 13:12:38 +00:00
Saleem Abdulrasool f3a5a5c546 Target: remove old constructors for CallLoweringInfo
This is mostly a mechanical change changing all the call sites to the newer
chained-function construction pattern.  This removes the horrible 15-parameter
constructor for the CallLoweringInfo in favour of setting properties of the call
via chained functions.  No functional change beyond the removal of the old
constructors are intended.

llvm-svn: 209082
2014-05-17 21:50:17 +00:00
Chandler Carruth c85473143c [x86] Fix a bad predicate I spotted by inspection -- pshufhw and pshuflw
were added in SSE2, no SSSE3. Found this while auditing all uses of
SSSE3 in the X86 target. I don't actually expect this to make
a significant difference on anything and I don't have any detailed test
cases but I updated the existing test cases that already covered some of
this code path.

llvm-svn: 209056
2014-05-17 03:29:20 +00:00
Filipe Cabecinhas 89654da069 Implemented special cases for PerformVSELECTCombine.
vselects with constant masks, after legalization, will get turned into
specialized shuffle_vectors so they can be matched to blend+imm
instructions.

Fixed some tests.

llvm-svn: 209044
2014-05-16 22:47:54 +00:00
Filipe Cabecinhas e15551832c Lower vselects into X86ISD::BLENDI when appropriate.
LowerVSELECT will, if possible, generate a X86ISD::BLENDI DAG node if the
condition is constant and we can emit that instruction, given the
subtarget.

This is not enough for all cases. An additional SELECTCombine optimization
will be committed.

Fixed tests that were expecting variable blends but where a blend+imm can
be generated.
Added test where we can't emit blend+immediate.
Added avx2 blend+imm tests.

llvm-svn: 209043
2014-05-16 22:47:49 +00:00
Filipe Cabecinhas 17254aaead Implemented LowerVSELECT to custom lower some instructions.
No functionality change intended. The types that previously were set to
lower as Expand or Legal are doing the same thing with this lowering
function.

llvm-svn: 209042
2014-05-16 22:47:43 +00:00
Rafael Espindola e0098928c9 Delete getAliasedGlobal.
llvm-svn: 209040
2014-05-16 22:37:03 +00:00
Andrea Di Biagio d621120533 [X86] Teach the backend how to fold SSE4.1/AVX/AVX2 blend intrinsics.
Added target specific combine rules to fold blend intrinsics according
to the following rules:
 1) fold(blend A, A, Mask) -> A;
 2) fold(blend A, B, <allZeros>) -> A;
 3) fold(blend A, B, <allOnes>) -> B.

Added two new tests to verify that the new folding rules work for all
the optimized blend intrinsics.

llvm-svn: 208895
2014-05-15 15:18:15 +00:00
Alp Toker beaca19c7c Fix typos
llvm-svn: 208839
2014-05-15 01:52:21 +00:00
Jay Foad a0653a3e6c Rename ComputeMaskedBits to computeKnownBits. "Masked" has been
inappropriate since it lost its Mask parameter in r154011.

llvm-svn: 208811
2014-05-14 21:14:37 +00:00
Reid Kleckner 7a59e0845f Try to fix an SDAG dependence issue with sret
r208453 added support for having sret on the second parameter.  In that
change, the code for copying sret into a virtual register was hoisted
into the loop that lowers formal parameters.  This caused a "Wrong
topological sorting" assertion failure during scheduling when a
parameter is passed in memory.  This change undoes that by creating a
second loop that deals with sret.

I'm worried that this fix is incomplete.  I don't fully understand the
dependence issues.  However, with this change we produce the same DAGs
we used to produce, so if they are broken, they are just as broken as
they have always been.

llvm-svn: 208637
2014-05-12 22:01:27 +00:00
Aaron Ballman 29fd7b9b20 Silencing an MSVC warning about not all control paths returning a value (even though the switch is fully covered). No functional change.
llvm-svn: 208565
2014-05-12 14:22:58 +00:00
Benjamin Kramer 3b36b72a9c X86: Make sure that we have SSE4.1 before we generate insertps nodes.
PR19721.

llvm-svn: 208552
2014-05-12 13:12:08 +00:00
NAKAMURA Takumi 6c383021c9 X86ISelLowering.cpp:LowerINTRINSIC_W_CHAIN(): Prune impossible "default:" [-Wcovered-switch-default]
llvm-svn: 208533
2014-05-12 10:16:46 +00:00
Elena Demikhovsky 4f591c0d45 Fixed compilation issue
llvm-svn: 208524
2014-05-12 07:45:41 +00:00
Elena Demikhovsky 8e8fde8e93 AVX-512: changes in intrinsics
1) Changed gather and scatter intrinsics. Now they are aligned with GCC built-ins. There is no more non-masked form. Masked intrinsic receives -1 if all lanes are executed.
2) I changed the function that works with intrinsics inside X86ISelLowering.cpp. I put all intrinsics in one table. I did it for INTRINSICS_W_CHAIN and plan to put all intrinsics from WO_CHAIN set to the same table in order to avoid the long-long "switch". (I wanted to use static map initialization that allowed by C++11 but I wasn't able to compile it on VS2012).
3) I added gather/scatter prefetch intrinsics.
4) I fixed MRMm encoding for masked instructions.

llvm-svn: 208522
2014-05-12 07:18:51 +00:00
Hal Finkel f0e086a0bc Pass the value type to TLI::getRegisterByName
We must validate the value type in TLI::getRegisterByName, because if we
don't and the wrong type was used with the IR intrinsic, then we'll assert
(because we won't be able to find a valid register class with which to
construct the requested copy operation). For PPC64, additionally, the type
information is necessary to decide between the 64-bit register and the 32-bit
subregister.

No functionality change.

llvm-svn: 208508
2014-05-11 19:29:07 +00:00
Filipe Cabecinhas 0e3d1cb5d6 Fixed a bug when lowering build_vector (PR19694)
When lowering build_vector to an insertps, we would still lower it, even
if the source vectors weren't v4x32. This would break on avx if the source
was a v8x32. We now check the type of the source vectors.

llvm-svn: 208487
2014-05-11 08:12:56 +00:00
Reid Kleckner 7941856445 Allow sret on the second parameter as well as the first
MSVC always places the implicit sret parameter after the implicit this
parameter of instance methods.  We used to handle this for
x86_thiscallcc by allocating the sret parameter on the stack and leaving
the this pointer in ecx, but that doesn't handle alternative calling
conventions like cdecl, stdcall, fastcall, or the win64 convention.

Instead, change the verifier to allow sret on the second parameter.

This also requires changing the Mips and X86 backends to return the
argument with the sret parameter, instead of assuming that the sret
parameter comes first.

The Sparc backend also returns sret parameters in a register, but I
wasn't able to update it to handle secondary sret parameters.  It
currently calls report_fatal_error if you feed it an sret in the second
parameter.

Reviewers: rafael.espindola, majnemer

Differential Revision: http://reviews.llvm.org/D3617

llvm-svn: 208453
2014-05-09 22:32:13 +00:00
Andrea Di Biagio 932ff1ddd7 Fix 80 col violation.
No functional change intended.

llvm-svn: 208405
2014-05-09 11:08:23 +00:00
Filipe Cabecinhas e4b482b3ed Optimize shufflevector that copies an i64/f64 and zeros the rest.
Summary:
Also ran clang-format on the function. The code added is the last else
if block.

Reviewers: nadav, craig.topper, delena

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3518

llvm-svn: 208372
2014-05-08 23:16:08 +00:00
Andrea Di Biagio e85ba4df52 [X86] Add target specific combine rules to fold SSE2/AVX2 packed arithmetic shift intrinsics.
This patch teaches the backend how to combine packed SSE2/AVX2 arithmetic shift
intrinsics.

The rules are:
 - Always fold a packed arithmetic shift by zero to its first operand;
 - Convert a packed arithmetic shift intrinsic dag node into a ISD::SRA only if
   the shift count is known to be smaller than the vector element size.

This patch also teaches to function 'getTargetVShiftByConstNode' how fold
target specific vector shifts by zero.

Added two new tests to verify that the DAGCombiner is able to fold
sequences of SSE2/AVX2 packed arithmetic shift calls.

llvm-svn: 208342
2014-05-08 17:44:04 +00:00
Filipe Cabecinhas 095d9d573a Lower certain build_vectors to insertps instructions
Summary:
Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.
Also optimize when we move one element in a vector to a different place
in that vector while zeroing out some of the other elements.

Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.

Added some tests for different cases where this optimization triggers.

Reviewers: nadav, delena, craig.topper

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3521

llvm-svn: 208271
2014-05-08 00:25:16 +00:00
Andrea Di Biagio c14ccc9184 [X86] Improve the lowering of BITCAST dag nodes from type f64 to type v2i32 (and vice versa).
Before this patch, the backend always emitted a store+load sequence to
bitconvert from f64 to i64 the input operand of a ISD::BITCAST dag node that
performed a bitconvert from type MVT::f64 to type MVT::v2i32. The resulting
i64 node was then used to build a v2i32 vector.

With this patch, the backend now produces a cheaper SCALAR_TO_VECTOR from
MVT::f64 to MVT::v2f64. That SCALAR_TO_VECTOR is then followed by a "free"
bitcast to type MVT::v4i32. The elements of the resulting
v4i32 are then extracted to build a v2i32 vector (which is illegal and
therefore promoted to MVT::v2i64).

This is in general cheaper than emitting a stack store+load sequence
to bitconvert the operand from type f64 to type i64.

llvm-svn: 208107
2014-05-06 17:09:03 +00:00
Renato Golin c7aea40ec6 Implememting named register intrinsics
This patch implements the infrastructure to use named register constructs in
programs that need access to specific registers (bare metal, kernels, etc).

So far, only the stack pointer is supported as a technology preview, but as it
is, the intrinsic can already support all non-allocatable registers from any
architecture.

llvm-svn: 208104
2014-05-06 16:51:25 +00:00
Reid Kleckner 4a406d32e9 Fix i128 div/mod on mingw64
The Win64 docs are very clear that anything larger than 8 bytes is
passed by reference, and GCC MinGW64 honors that for __modti3 and
friends.

Patch by Jameson Nash!

llvm-svn: 208029
2014-05-06 01:20:42 +00:00
Filipe Cabecinhas fe59062b75 Revert "Optimize shufflevector that copies an i64/f64 and zeros the rest."
This reverts commit 207992. I misread the phab number on the LGTM.

llvm-svn: 207993
2014-05-05 19:40:36 +00:00
Filipe Cabecinhas 263d98c19f Optimize shufflevector that copies an i64/f64 and zeros the rest.
Summary:
Also ran clang-format on the function. The code added is the last else
if block.

Reviewers: nadav, craig.topper

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3518

llvm-svn: 207992
2014-05-05 19:36:28 +00:00
Craig Topper 2d2aa0ca1f Use makeArrayRef insted of calling ArrayRef<T> constructor directly. I introduced most of these recently.
llvm-svn: 207616
2014-04-30 07:17:30 +00:00
Reid Kleckner fb69308568 Implement X86 code generation for musttail
Currently, musttail codegen is relying on sibcall optimization, and
reporting a fatal error if fails.  Sibcall optimization fails when stack
arguments need to be modified, which is insufficient for musttail.

The logic for moving arguments in memory safely is already implemented
for GuaranteedTailCallOpt.  This change merely arranges for musttail
calls to use it.

No functional change for GuaranteedTailCallOpt.

Reviewers: espindola

Differential Revision: http://reviews.llvm.org/D3493

llvm-svn: 207598
2014-04-29 23:55:41 +00:00
Elena Demikhovsky 299cf511c4 AVX-512: optimized a shuffle pattern to VINSERTI64x4.
Added intrinsics for VPERMT2PS/PD/D/Q instructions.

llvm-svn: 207513
2014-04-29 09:09:15 +00:00
Quentin Colombet 50efe87e5b [X86] Add more details in the comments of X86TargetLowering::getScalingFactorCost.
llvm-svn: 207432
2014-04-28 18:39:57 +00:00
Craig Topper e73658ddbb [C++] Use 'nullptr'.
llvm-svn: 207394
2014-04-28 04:05:08 +00:00
Craig Topper dd5e16dd34 Convert one last signature of getNode to take an ArrayRef of SDUse.
llvm-svn: 207376
2014-04-27 19:21:06 +00:00
Craig Topper 64941d9786 Convert SelectionDAG::getMergeValues to use ArrayRef.
llvm-svn: 207374
2014-04-27 19:20:57 +00:00
Benjamin Kramer 3693e77cb4 X86: If SSE4.1 is missing lower SMUL_LOHI of v4i32 to pmuludq and fix up the high parts.
This is more expensive than pmuldq but still cheaper than scalarizing the whole thing.

llvm-svn: 207370
2014-04-27 18:47:41 +00:00
Craig Topper 206fcd450a Convert getMemIntrinsicNode to take ArrayRef of SDValue instead of pointer and size.
llvm-svn: 207329
2014-04-26 19:29:41 +00:00
Craig Topper 48d114bed1 Convert SelectionDAG::getNode methods to use ArrayRef<SDValue>.
llvm-svn: 207327
2014-04-26 18:35:24 +00:00
Benjamin Kramer c2ad8f3ef1 Print X86ISD::PMULDQ nodes properly in debug output.
llvm-svn: 207322
2014-04-26 16:26:41 +00:00
Benjamin Kramer 6d2dff61f9 X86: Lower SMUL_LOHI of v4i32 to pmuldq when SSE4.1 is available.
llvm-svn: 207318
2014-04-26 14:12:19 +00:00
Benjamin Kramer c9827ab103 X86: Add patterns for MULHU/MULHS of v8i16 and v16i16.
This gets us pretty code for divs of i16 vectors. Turn the existing
intrinsics into the corresponding nodes.

llvm-svn: 207317
2014-04-26 13:01:03 +00:00
Benjamin Kramer ad0168702a Rip out X86-specific vector SDIV lowering, make the corresponding DAGCombiner transform work on vectors.
llvm-svn: 207316
2014-04-26 13:00:53 +00:00
Benjamin Kramer 29139d5cb5 X86: Custom lower v4i32 UMUL_LOHI into 2 pmuludqs.
Test will follow soon.

llvm-svn: 207314
2014-04-26 12:06:11 +00:00
Quentin Colombet ea18933d97 [X86] Implement TargetLowering::getScalingFactorCost hook.
Scaling factors are not free on X86 because every "complex" addressing mode
breaks the related instruction into 2 allocations instead of 1.

<rdar://problem/16730541>

llvm-svn: 207301
2014-04-26 01:11:26 +00:00
Filipe Cabecinhas 363b570d2a Optimization for certain shufflevector by using insertps.
Summary:
If we're doing a v4f32/v4i32 shuffle on x86 with SSE4.1, we can lower
certain shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory
operand.

Added tests for insertps optimizations on shufflevector.
Added support and tests for v4i32 vector optimization.

Reviewers: nadav

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D3475

llvm-svn: 207291
2014-04-25 23:51:17 +00:00
Craig Topper 062a2baef0 [C++] Use 'nullptr'. Target edition.
llvm-svn: 207197
2014-04-25 05:30:21 +00:00
Benjamin Kramer 76f753e9a9 X86: Don't transform shifts into ands when the sign bit is tested.
Should unbreak MultiSource/Benchmarks/mediabench/g721/g721encode/encode.

llvm-svn: 207145
2014-04-24 20:51:37 +00:00
Reid Kleckner 5772b77789 Add 'musttail' marker to call instructions
This is similar to the 'tail' marker, except that it guarantees that
tail call optimization will occur.  It also comes with convervative IR
verification rules that ensure that tail call optimization is possible.

Reviewers: nicholas

Differential Revision: http://llvm-reviews.chandlerc.com/D3240

llvm-svn: 207143
2014-04-24 20:14:34 +00:00
Andrea Di Biagio d1ab866868 [X86] Add support for Read Time Stamp Counter x86 builtin intrinsics.
This patch:
- Adds two new X86 builtin intrinsics ('int_x86_rdtsc' and
   'int_x86_rdtscp') as GCCBuiltin intrinsics;
- Teaches the backend how to lower the two new builtins;
- Introduces a common function to lower READCYCLECOUNTER dag nodes
  and the two new rdtsc/rdtscp intrinsics;
- Improves (and extends) the existing x86 test 'rdtsc.ll'; now test 'rdtsc.ll'
  correctly verifies that both READCYCLECOUNTER and the two new intrinsics
  work fine for both 64bit and 32bit Subtargets.

llvm-svn: 207127
2014-04-24 17:18:27 +00:00
Benjamin Kramer f4575db2fd X86: Emit test instead of constant shift + compare if the shift result is unused.
This allows us to compile
  return (mask & 0x8 ? a : b);
into
  testb $8, %dil
  cmovnel %edx, %esi
instead of
  andl  $8, %edi
  shrl  $3, %edi
  cmovnel %edx, %esi

which we formed previously because dag combiner canonicalizes setcc of and into shift.

llvm-svn: 207088
2014-04-24 08:15:31 +00:00
Elena Demikhovsky acc5c9e83e AVX-512: store and truncstore for i1 values
llvm-svn: 206897
2014-04-22 14:13:10 +00:00
Lang Hames 3067ab2344 [X86] Use tablegen instead of DAG combines to match BZHI instructions, as
suggested by Ben Kramer in review of r206738.

Thanks again Ben!

llvm-svn: 206879
2014-04-22 10:41:56 +00:00
Lang Hames f6f42cac3f [X86] Don't use BZHI for short masks (>=32 bits). Thanks to Ben Kramer for the
review.

llvm-svn: 206869
2014-04-22 07:40:34 +00:00
Chandler Carruth 84e68b2994 [Modules] Fix potential ODR violations by sinking the DEBUG_TYPE
definition below all of the header #include lines, lib/Target/...
edition.

llvm-svn: 206842
2014-04-22 02:41:26 +00:00
Lang Hames 5aa6ee80b6 [X86] ISEL (and X, <constant mask>) to BZHI when BMI2 is available.
Generating BZHI in the variable mask case, i.e. (and X, (sub (shl 1, N), 1)),
was already supported, but we were missing the constant-mask case. This patch
fixes that.

<rdar://problem/15480077>

llvm-svn: 206738
2014-04-21 08:18:53 +00:00
Adam Nemet ee7a3e38c9 [X86] Improve buildFromShuffleMostly for AVX
For a 256-bit BUILD_VECTOR consisting mostly of shuffles of 256-bit vectors,
both the BUILD_VECTOR and its operands may need to be legalized in multiple
steps.  Consider:

(v8f32 (BUILD_VECTOR (extract_vector_elt (v8f32 %vreg0,) Constant<1>),
                     (extract_vector_elt %vreg0, Constant<2>),
                     (extract_vector_elt %vreg0, Constant<3>),
                     (extract_vector_elt %vreg0, Constant<4>),
                     (extract_vector_elt %vreg0, Constant<5>),
                     (extract_vector_elt %vreg0, Constant<6>),
                     (extract_vector_elt %vreg0, Constant<7>),
                     %vreg1))

a. We can't build a 256-bit vector efficiently so, we need to split it into
two 128-bit vecs and combine them with VINSERTX128.

b. Operands like (extract_vector_elt (v8f32 %vreg0), Constant<7>) needs to be
split into a VEXTRACTX128 and a further extract_vector_elt from the
resulting 128-bit vector.

c. The extract_vector_elt from b. is lowered into a shuffle to the first
element and a movss.

Depending on the order in which we legalize the BUILD_VECTOR and its
operands[1], buildFromShuffleMostly may be faced with:

(v4f32 (BUILD_VECTOR (extract_vector_elt
                      (vector_shuffle<1,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
                      Constant<0>),
                     (extract_vector_elt
                      (vector_shuffle<2,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
                      Constant<0>),
                     (extract_vector_elt
                      (vector_shuffle<3,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
                      Constant<0>),
                     %vreg1))

In order to figure out the underlying vector and their identity we need to see
through the shuffles.

[1] Note that the order in which operations and their operands are legalized is
only guaranteed in the first iteration of LegalizeDAG.

Fixes <rdar://problem/16296956>

llvm-svn: 206634
2014-04-18 19:44:16 +00:00
Andrea Di Biagio aac2eac4c2 [X86] Improve the lowering of packed shifts by constant build_vector.
This patch teaches the backend how to efficiently lower logical and
arithmetic packed shifts on both SSE and AVX/AVX2 machines.

When possible, instead of scalarizing a vector shift, the backend should try
to expand the shift into a sequence of two packed shifts by immedate count
followed by a MOVSS/MOVSD.

Example
  (v4i32 (srl A, (build_vector < X, Y, Y, Y>)))

Can be rewritten as:
  (v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>)))

[with X and Y ConstantInt]

The advantage is that the two new shifts from the example would be lowered into
X86ISD::VSRLI nodes. This is always cheaper than scalarizing the vector into
four scalar shifts plus four pairs of vector insert/extract.

llvm-svn: 206316
2014-04-15 19:30:48 +00:00
Nick Lewycky aad475b324 Break PseudoSourceValue out of the Value hierarchy. It is now the root of its own tree containing FixedStackPseudoSourceValue (which you can use isa/dyn_cast on) and MipsCallEntry (which you can't). Anything that needs to use either a PseudoSourceValue* and Value* is strongly encouraged to use a MachinePointerInfo instead.
llvm-svn: 206255
2014-04-15 07:22:52 +00:00
David Blaikie 9027abae53 Change argument order and add explanatory comment to r206130
Changes requested in code review by Eric Christopher of r206130.

llvm-svn: 206219
2014-04-14 22:23:06 +00:00
David Blaikie 269e0fb2e4 Fix instruction debug info location during legalization
I found this from a particular GDB test suite case of inlining
(something similar is provided as a test case) but came across a few
other related cases (other callers of the same functions, and one other
instance of the same coding mistake in a separate function).

I'm not sure what the best way to test this is (let alone to cover the
other cases I discovered), so hopefully this sufficies - open to ideas.

llvm-svn: 206130
2014-04-13 06:39:55 +00:00
Reid Kleckner 9c6582129a Move the segmented stack switch to a function attribute
This removes the -segmented-stacks command line flag in favor of a
per-function "split-stack" attribute.

Patch by Luqman Aden and Alex Crichton!

llvm-svn: 205997
2014-04-10 22:58:43 +00:00
Elena Demikhovsky cf0b9bafc3 AVX-512: insert element to mask vector; store i1 data
Implemented INSERT_VECTOR_ELT operation for v16i1 and v8i1 vectors;
Implemented "store" for i1 type

llvm-svn: 205850
2014-04-09 12:37:50 +00:00
Elena Demikhovsky 3dcfbdfa54 AVX-512: Added fp_to_uint and uint_to_fp patterns.
llvm-svn: 205754
2014-04-08 07:24:02 +00:00
Matt Arsenault cf6f688a40 Add DAG parameter to ComputeNumSignBitsForTargetNode
This way, you can check the number of sign bits in the
operands. The depth parameter it already has is pretty useless
without this.

llvm-svn: 205649
2014-04-04 20:13:13 +00:00
Craig Topper 840beec2d0 Make consistent use of MCPhysReg instead of uint16_t throughout the tree.
llvm-svn: 205610
2014-04-04 05:16:06 +00:00
Yaron Keren 2895496852 Added isTargetWindowsMSVC(), renamed isTargetMingw() to isTargetWindowsGNU()
and isTargetCygwin() to isTargetWindowsCygwin() to be consistent with the
four Windows environments in Triple.h.

Suggestion by Saleem Abdulrasool!

llvm-svn: 205393
2014-04-02 04:27:51 +00:00
Yaron Keren 48d68d439a If isKnownWindowsMSVCEnvironment then getOS == Triple::Win32 and
Environment == Triple::MSVC so it will never be MinGW or Cygwin.

llvm-svn: 205349
2014-04-01 18:52:55 +00:00
Yaron Keren 136fe7db46 isTargetWindows() renamed to isTargetKnownWindowsMSVC()
to reflect its current functionality.

Based on Takumi NAKAMURA suggestion.

llvm-svn: 205338
2014-04-01 18:15:34 +00:00
Aaron Ballman 8bf5a548ea Attempting to fix r205124, which had failed asserts when built with MSVC.
Suggestion from Yaron Keren.

llvm-svn: 205313
2014-04-01 13:56:35 +00:00
Alexey Volkov 1328b28dc6 [x86] Do not convert to cmp32 for Atom arch by Sergey Okunev
Differential Revision: http://llvm-reviews.chandlerc.com/D2824

llvm-svn: 205288
2014-04-01 08:13:07 +00:00
Rafael Espindola 24a669d225 Prevent alias from pointing to weak aliases.
This adds back r204781.

Original message:

Aliases are just another name for a position in a file. As such, the
regular symbol resolutions are not applied. For example, given

define void @my_func() {
  ret void
}
@my_alias = alias weak void ()* @my_func
@my_alias2 = alias void ()* @my_alias

We produce without this patch:

        .weak   my_alias
my_alias = my_func
        .globl  my_alias2
my_alias2 = my_alias

That is, in the resulting ELF file my_alias, my_func and my_alias are
just 3 names pointing to offset 0 of .text. That is *not* the
semantics of IR linking. For example, linking in a

@my_alias = alias void ()* @other_func

would require the strong my_alias to override the weak one and
my_alias2 would end up pointing to other_func.

There is no way to represent that with aliases being just another
name, so the best solution seems to be to just disallow it, converting
a miscompile into an error.

llvm-svn: 204934
2014-03-27 15:26:56 +00:00
Rafael Espindola 65481d7b97 Revert "Prevent alias from pointing to weak aliases."
This reverts commit r204781.

I will follow up to with msan folks to see what is what they
were trying to do with aliases to weak aliases.

llvm-svn: 204784
2014-03-26 06:14:40 +00:00
Rafael Espindola 3b712a84a9 Prevent alias from pointing to weak aliases.
Aliases are just another name for a position in a file. As such, the
regular symbol resolutions are not applied. For example, given

define void @my_func() {
  ret void
}
@my_alias = alias weak void ()* @my_func
@my_alias2 = alias void ()* @my_alias

We produce without this patch:

        .weak   my_alias
my_alias = my_func
        .globl  my_alias2
my_alias2 = my_alias

That is, in the resulting ELF file my_alias, my_func and my_alias are
just 3 names pointing to offset 0 of .text. That is *not* the
semantics of IR linking. For example, linking in a

@my_alias = alias void ()* @other_func

would require the strong my_alias to override the weak one and
my_alias2 would end up pointing to other_func.

There is no way to represent that with aliases being just another
name, so the best solution seems to be to just disallow it, converting
a miscompile into an error.

llvm-svn: 204781
2014-03-26 04:48:47 +00:00
Adam Nemet 4beef4c90d [X86] Generate VPSHUFB for in-place v16i16 shuffles
This used to resort to splitting the 256-bit operation into two 128-bit
shuffles and then recombining the results.

Fixes <rdar://problem/16167303>

llvm-svn: 204735
2014-03-25 17:47:06 +00:00
Adam Nemet ac6d6383a3 [X86] Factor out new helper getPSHUFB
I found three implementations of this.  This splits it out into a new function
and uses it from the three places.

My plan is to add a fourth use when lowering a vector_shuffle:v16i16.

Compared the assembly output of test/CodeGen/X86 before and after.

The only change is due to how the first PSHUFB was generated in
LowerVECTOR_SHUFFLEv8i16.  If the shuffle mask specified undef (i.e. -1), the
old implementation would write -1 * 2 and -1 * 2 + 1 (254 and 255) in the
control mask.  Now we write 0x80.  These are of course interchangeable since
bit 7 decides if a constant zero is written in the result byte.  The other
instances of this code use 0x80 consistently.

Related to <rdar://problem/16167303>

llvm-svn: 204734
2014-03-25 17:47:03 +00:00
Adam Nemet b47372f555 [X86] Fix non-determinism in LowerVectorAllZeroTest
This can be observed with the old testcase of CodeGen/X86/pr12312.ll:

47c47
<       vorps   %ymm0, %ymm1, %ymm0
---
>       vorps   %ymm1, %ymm0, %ymm0
97c97
<       vorps   %ymm1, %ymm0, %ymm0
---
>       vorps   %ymm0, %ymm1, %ymm0

The vector VecIns is populated with all the values from VecInMap. This is done
while iterating VecInMap.  VecInMap uses a hash of pointer values so the
resulting order can vary depending on the memory layout.

The fix is to populate the vector VecIns earlier as VecInMap is populated.
This is done in DAG traversal order.

Fixes <rdar://problem/16398806>

llvm-svn: 204623
2014-03-24 16:52:08 +00:00
Craig Topper c6d4efa1e5 Prune includes in X86 target.
llvm-svn: 204216
2014-03-19 06:53:25 +00:00
Adam Nemet 8a130a5f86 [X86] Fix unused variable warning with NDEBUG from r204058
llvm-svn: 204063
2014-03-17 17:32:53 +00:00
Adam Nemet 24381f1cb7 [VectorLegalizer/X86] Don't unvectorize fp_to_uint for v8f32->v8i16
Rather than LegalizeAction::Expand, this needs LegalizeAction::Promote to get
promoted to fp_to_sint v8f32->v8i32.  This is a legal operation on AVX.

For that to work properly, we also need to teach the legalizer about the
specific promotion required here.  The default vector promotion uses
bitcasting to a vector type of the same total size.  We want to promote the
vector element type, effectively widening the operation and then truncating
the result.  This is analogous to the current logic of how int_to_fp is
promoted.

The change also factors out some code from the int_to_fp promotion code to
ValueType::widenIntegerVectorElementType.  This is now shared between
int_to_fp and fp_to_int.

There is no longer need for the custom lowering of fp_to_sint f32->v8i16 in
X86.  It can now go through the new target-independent fp_to_*int promotion
logic.

I also checked that no other target uses Promote for these ops yet, so there
shouldn't be any unexpected change in behavior.

Fixes <rdar://problem/16202247>

llvm-svn: 204058
2014-03-17 17:06:14 +00:00
Arnaud A. de Grandmaison 75c9e6dedf Remove some dead assignements found by scan-build
llvm-svn: 204013
2014-03-15 22:13:15 +00:00
Owen Anderson 16c6bf49b7 Phase 2 of the great MachineRegisterInfo cleanup. This time, we're changing
operator* on the by-operand iterators to return a MachineOperand& rather than
a MachineInstr&.  At this point they almost behave like normal iterators!

Again, this requires making some existing loops more verbose, but should pave
the way for the big range-based for-loop cleanups in the future.

llvm-svn: 203865
2014-03-13 23:12:04 +00:00
Hans Wennborg 6c37f8b985 X86: Don't generate 64-bit movd after cmpneqsd in 32-bit mode (PR19059)
This fixes the bug where we would bitcast the 64-bit floating point result
of cmpneqsd to a 64-bit integer even on 32-bit targets.

Differential Revision: http://llvm-reviews.chandlerc.com/D3009

llvm-svn: 203581
2014-03-11 15:49:24 +00:00
Tim Northover e94a518a22 IR: add a second ordering operand to cmpxhg for failure
The syntax for "cmpxchg" should now look something like:

	cmpxchg i32* %addr, i32 42, i32 3 acquire monotonic

where the second ordering argument gives the required semantics in the case
that no exchange takes place. It should be no stronger than the first ordering
constraint and cannot be either "release" or "acq_rel" (since no store will
have taken place).

rdar://problem/15996804

llvm-svn: 203559
2014-03-11 10:48:52 +00:00
Jim Grosbach c94d993adf X86: Enable ISel of 16-bit MOVBE instructions.
When the MOVBE instructions are available, use them for 16-bit endian
swapping as well as for 32 and 64 bit.

The patterns were already present on the instructions, but weren't being
matched because the operation was unconditionally marked to 'Expand.'
Change that to be conditional on whether the MOVBE instructions are
available. Use 'rolw' to implement the in-register version (32 and 64
bit have the dedicated 'bswap' instruction for that).

Patch by Louis Gerbarg <lgg@apple.com>.

rdar://15479984

llvm-svn: 203524
2014-03-11 00:44:14 +00:00
Cameron McInally 791ae9927c Lower AVX v4i64->v4i32 truncate to one shuffle.
llvm-svn: 202996
2014-03-05 19:41:16 +00:00
Chandler Carruth 219b89b987 [Modules] Move CallSite into the IR library where it belogs. It is
abstracting between a CallInst and an InvokeInst, both of which are IR
concepts.

llvm-svn: 202816
2014-03-04 11:01:28 +00:00
Benjamin Kramer b6d0bd48bd [C++11] Replace llvm::next and llvm::prior with std::next and std::prev.
Remove the old functions.

llvm-svn: 202636
2014-03-02 12:27:27 +00:00
Elena Demikhovsky 9737e3886b AVX-512: Fixed extract_vector_elt for v8i1 vector
llvm-svn: 202624
2014-03-02 09:19:44 +00:00
Quentin Colombet 85c9e16291 Lower unsigned vsetcc to psubus in certain cases
The current approach to lower a vsetult is to flip the sign bit of the
operands, swap the operands and then use a (signed) pcmpgt.  psubus (unsigned
saturating subtract) can be used to emulate a vsetult more efficiently:

+    case ISD::SETULT: {
+      // If the comparison is against a constant we can turn this into a
+      // setule.  With psubus, setule does not require a swap.  This is
+      // beneficial because the constant in the register is no longer
+      // destructed as the destination so it can be hoisted out of a loop.

I also enable lowering via psubus in a few other cases where it's clearly
beneficial: setule and setuge if minu/maxu cannot be used.
    
rdar://problem/14338765

Patch by Adam Nemet <anemet@apple.com>.

llvm-svn: 202301
2014-02-26 21:39:12 +00:00
Tim Northover aeb8e06d4c X86 CodeGenPrep: sink shufflevectors before shifts
On x86, shifting a vector by a scalar is significantly cheaper than shifting a
vector by another fully general vector. Unfortunately, because SelectionDAG
operates on just one basic block at a time, the shufflevector instruction that
reveals whether the right-hand side of a shift *is* really a scalar is often
not visible to CodeGen when it's needed.

This adds another handler to CodeGenPrepare, to sink any useful shufflevector
instructions down to the basic block where they're used, predicated on a target
hook (since on other architectures, doing so will often just introduce extra
real work).

rdar://problem/16063505

llvm-svn: 201655
2014-02-19 10:02:43 +00:00
Tim Northover f06df5866f X86: use vpsllvd (& friends) for 16-bit shifts on Haswell
llvm-svn: 201558
2014-02-18 11:15:32 +00:00
Elena Demikhovsky 1fad075974 AVX-512: simpyfied BUILD_VECTOR for masks; fixed cmp/test sequence
llvm-svn: 201487
2014-02-16 11:34:23 +00:00
Andrea Di Biagio 386d566395 [X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.
Instead of expanding a packed shift into a sequence of scalar shifts,
the backend now tries (when possible) to convert the vector shift into a
vector multiply.

Before this change, a shift of a MVT::v8i16 vector by a
build_vector of constants was always scalarized into a long sequence of "vector
extracts + scalar shifts + vector insert".
With this change, if there is SSE2 support, we emit a single vector multiply.

This change also affects SSE4.1, AVX, AVX2 shifts:
 - A shift of a MVT::v4i32 vector by a build_vector of non uniform constants
is now lowered when possible into a single SSE4.1 vector multiply.
 - Packed v16i16 shift left by constant build_vector are now expanded when
possible into a single AVX2 vpmullw.
This change also improves the lowering of AVX512f vector shifts.

Added test CodeGen/X86/vec_shift6.ll with some code examples that are affected
by this change.

llvm-svn: 201271
2014-02-12 23:42:28 +00:00
Elena Demikhovsky 1f32c313f1 AVX: fixed a bug in LowerVECTOR_SHUFFLE
llvm-svn: 201140
2014-02-11 10:21:53 +00:00
Elena Demikhovsky 2aafc22ed9 AVX-512: Optimized BUILD_VECTOR pattern;
fixed encoding of VEXTRACTPS instruction.

llvm-svn: 201134
2014-02-11 07:25:59 +00:00
Elena Demikhovsky 9f423d6f25 AVX-512: Fixed extract_vector_elt for v16i1 and v8i1 vectors.
llvm-svn: 201066
2014-02-10 07:02:39 +00:00
Tim Northover 546b57b011 X86: deduplicate V[SZ]EXT_MOVL and V[SZ]EXT nodes
I believe VZEXT_MOVL means "zero all vector elements except the first" (and
should have identical input & output types) whereas VZEXT means "zero extend
each element of a vector (discarding higher elements if necessary)".

For example:
    (v4i32 (vzext (v16i8 ...)))

should zero extend the low 4 bytes of the incoming vector to 32-bits,
discarding higher bytes.

However, somewhere in the past, these two concepts had become confused, even
leading to a nonsensical VSEXT_MOVL.

This re-merges the nodes where appropriate (all VSEXT_MOVL -> VSEXT, VZEXT_MOVL
-> VZEXT when it's an actual extension).

rdar://problem/15981990

llvm-svn: 200918
2014-02-06 09:54:51 +00:00
Matt Arsenault 25793a3f22 Add address space argument to allowsUnalignedMemoryAccess.
On R600, some address spaces have more strict alignment
requirements than others.

llvm-svn: 200887
2014-02-05 23:15:53 +00:00
Elena Demikhovsky 0b79be8ab2 AVX-512: optimized icmp -> sext -> icmp pattern
llvm-svn: 200849
2014-02-05 16:17:36 +00:00
Craig Topper 7ee163842f Move matching for x86 BMI BLSI/BLSMSK/BLSR instructions to isel patterns instead of DAG combine. This weakens the ability to fold loads with them because we aren't able to match patterns that load the same thing twice. But maybe we should fix that if we care. The peephole optimizer will be able to fold some loads in its absense.
llvm-svn: 200824
2014-02-05 07:09:40 +00:00
Reid Kleckner f5b76518c9 Implement inalloca codegen for x86 with the new inalloca design
Calls with inalloca are lowered by skipping all stores for arguments
passed in memory and the initial stack adjustment to allocate argument
memory.

Now the frontend is responsible for the memory layout, and the backend
doesn't have to do any work.  As a result these changes are pretty
minimal.

Reviewers: echristo

Differential Revision: http://llvm-reviews.chandlerc.com/D2637

llvm-svn: 200596
2014-01-31 23:50:57 +00:00
Reid Kleckner c5d9e159eb x86: Rename NumBytesForCalleeToPush to ...Pop for accuracy
If we have a callee cleanup convention, the callee is going to pop the
arguments off the stack, not push them on.

llvm-svn: 200566
2014-01-31 19:07:18 +00:00
Andrea Di Biagio 2ea61f17ad [X86] Add extra rules for combining vselect dag nodes into movsd.
This improves the fix committed at revision 199683 adding the
following new target specific combine rules:

1) fold (v4i32: vselect <0,0,-1,-1>, A, B) ->
        (v4i32 (bitcast (movsd (v2i64 (bitcast A)), (v2i64 (bitcast B))) ))

2) fold (v4f32: vselect <0,0,-1,-1>, A, B) ->
        (v4f32 (bitcast (movsd (v2f64 (bitcast A)), (v2f64 (bitcast B))) ))

3) fold (v4i32: vselect <-1,-1,0,0>, A, B) ->
        (v4i32 (bitcast (movsd (v2i64 (bitcast B)), (v2i64 (bitcast A))) ))

4) fold (v4f32: vselect <-1,-1,0,0>, A, B) ->
        (v4f32 (bitcast (movsd (v2i64 (bitcast B)), (v2i64 (bitcast A))) ))

llvm-svn: 200324
2014-01-28 18:14:21 +00:00
Juergen Ributzka 659ce00d60 [TLI] Add a new hook to TargetLowering to query the target if a load of a constant should be converted to simply the constant itself.
Before this patch we used getIntImmCost from TargetTransformInfo to determine if
a load of a constant should be converted to just a constant, but the threshold
for this was set to an arbitrary value. This value works well for the two
targets (X86 and ARM) that implement this target-hook, but it isn't
target-independent at all.

Now targets have the possibility to decide directly if this optimization should
be performed. The default value is set to false to preserve the current
behavior. The target hook has been moved to TargetLowering, which removed the
last use and need of TargetTransformInfo in SelectionDAG.

llvm-svn: 200271
2014-01-28 01:20:14 +00:00
Juergen Ributzka e758ddcd16 [X86] Prevent the creation of redundant ops for sadd and ssub with overflow.
This commit teaches the X86 backend to create the same X86 instructions when it
lowers an sadd/ssub with overflow intrinsic and a conditional branch that uses
that overflow result. This allows SelectionDAG to recognize and remove one of
the redundant operations.

This fixes <rdar://problem/15874016> and <rdar://problem/15661073>.

Reviewed by Nadav

llvm-svn: 199976
2014-01-24 06:47:57 +00:00
Lang Hames b1ce33379a Add a few missing cases from r199933. Testcase coming shortly.
llvm-svn: 199938
2014-01-23 21:27:27 +00:00
Lang Hames 23de211c5d Replace vfmaddxx213 instructions with their 231-type equivalents in accumulator
loops. Writing back to the accumulator (231-type) allows the coalescer to
eliminate an extra copy.

llvm-svn: 199933
2014-01-23 20:23:36 +00:00
Elena Demikhovsky a5d38a39a0 AVX-512: added VPERM2D VPERM2Q VPERM2PS VPERM2PD instructions,
they give better sequences than VPERMI

llvm-svn: 199893
2014-01-23 14:27:26 +00:00
Andrea Di Biagio 450d1661be [X86] Teach how to combine a vselect into a movss/movsd
Add target specific rules for combining vselect dag nodes into movss/movsd
when possible.

If the vector type of the vselect dag node in input is either MVT::v4i13 or
MVT::v4f32, then try to fold according to rules:

  1) fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B)
  2) fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)

If the vector type of the vselect dag node in input is either MVT::v2i64 or
MVT::v2f64 (and we have SSE2), then try to fold according to rules:

  3) fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
  4) fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)

llvm-svn: 199683
2014-01-20 19:35:22 +00:00
Michael Gottesman 8347c34e67 Move the retrieval of VT after all of the early exits from PerformOrCombine that do not use VT. NFC.
llvm-svn: 199612
2014-01-19 21:06:00 +00:00
Elena Demikhovsky d1487261a0 AVX-512: fixed a compare pattern
llvm-svn: 199366
2014-01-16 08:45:54 +00:00
David Majnemer dee105772c WinCOFF: Transform IR expressions featuring __ImageBase into image relative relocations
MSVC on x64 requires that we create image relative symbol
references to refer to RTTI data. Seeing as how there is no way to
explicitly make reference to a given relocation type in LLVM IR, pattern
match expressions of the form &foo - &__ImageBase.

Differential Revision: http://llvm-reviews.chandlerc.com/D2523

llvm-svn: 199312
2014-01-15 09:16:42 +00:00
Elena Demikhovsky 79b75d9048 Fixed identation.
llvm-svn: 199301
2014-01-15 07:18:11 +00:00
Lang Hames 06234ec147 Add FPExt option to CCValAssign::LocInfo. When generating calling-convention
promotion code, Tablegen will now select FPExt for floating point promotions
(previously it had returned AExt, which is not valid for floating point types).

Any out-of-tree targets that were relying on AExt being returned for FP
promotions will need to update their code check for FPExt instead.

llvm-svn: 199252
2014-01-14 19:56:36 +00:00
Nico Rieck 7157bb765e Decouple dllexport/dllimport from linkage
Representing dllexport/dllimport as distinct linkage types prevents using
these attributes on templates and inline functions.

Instead of introducing further mixed linkage types to include linkonce and
weak ODR, the old import/export linkage types are replaced with a new
separate visibility-like specifier:

  define available_externally dllimport void @f() {}
  @Var = dllexport global i32 1, align 4

Linkage for dllexported globals and functions is now equal to their linkage
without dllexport. Imported globals and functions must be either
declarations with external linkage, or definitions with
AvailableExternallyLinkage.

llvm-svn: 199218
2014-01-14 15:22:47 +00:00
Elena Demikhovsky 767fc967b4 AVX-512: optimized scalar compare patterns
removed AVX512SI format, since it is similar to AVX512BI.

llvm-svn: 199217
2014-01-14 15:10:08 +00:00
Andrea Di Biagio 5448a3c771 [X86] Fix assertion failure caused by a wrong folding of vector shifts by immediate count.
This fixes a regression intruced by r198113.

Revision r198113 introduced an algorithm that tries to fold a vector shift
by immediate count into a build_vector if the input vector is a known vector
of constants.

However the algorithm only worked under the assumption that the input vector
type and the shift type are exactly the same.

This patch disables the folding of vector shift by immediate count if the
input vector type and the shift value type are not the same.

llvm-svn: 199213
2014-01-14 13:17:12 +00:00
Nico Rieck 9d2e0df049 Revert "Decouple dllexport/dllimport from linkage"
Revert this for now until I fix an issue in Clang with it.

This reverts commit r199204.

llvm-svn: 199207
2014-01-14 12:38:32 +00:00
Nico Rieck e43aaf7967 Decouple dllexport/dllimport from linkage
Representing dllexport/dllimport as distinct linkage types prevents using
these attributes on templates and inline functions.

Instead of introducing further mixed linkage types to include linkonce and
weak ODR, the old import/export linkage types are replaced with a new
separate visibility-like specifier:

  define available_externally dllimport void @f() {}
  @Var = dllexport global i32 1, align 4

Linkage for dllexported globals and functions is now equal to their linkage
without dllexport. Imported globals and functions must be either
declarations with external linkage, or definitions with
AvailableExternallyLinkage.

llvm-svn: 199204
2014-01-14 11:55:03 +00:00
Elena Demikhovsky 172a27c750 AVX-512: Added more intrinsics for pmin/pmax, pabs, blend, pmuldq.
llvm-svn: 198745
2014-01-08 10:54:22 +00:00
Bill Wendling 13199b17f8 Remove unnecessary #includes.
llvm-svn: 198585
2014-01-06 06:00:00 +00:00
Bill Wendling 908bf814e7 Refactor function that checks that __builtin_returnaddress's argument is constant.
This moves the check up into the parent class so that all targets can use it
without having to copy (and keep in sync) the same error message.

llvm-svn: 198579
2014-01-06 00:43:20 +00:00
Elena Demikhovsky f404e054a1 AVX-512: changed property name from "neverHasSideEffects=1" to "hasSideEffects=0", added this property to VMOVSS/VMOVSD;
Optimized a truncate pattern.

llvm-svn: 198562
2014-01-05 14:21:07 +00:00
Elena Demikhovsky 52e4a0e109 AVX-512: Added more intrinsics for convert and min/max.
Removed vzeroupper from AVX-512 mode - our optimization gude does not recommend to insert vzeroupper at all.

llvm-svn: 198557
2014-01-05 10:46:09 +00:00
Bill Wendling df7dd28dc8 Emit an error message if the value passed to __builtin_returnaddress isn't a constant
__builtin_returnaddress requires that the value passed into is be a constant.
However, at -O0 even a constant expression may not be converted to a constant.
Emit an error message intead of crashing.

llvm-svn: 198531
2014-01-05 01:47:20 +00:00
Craig Topper a448bd868f Make more of the x86 lowering helper functions static.
llvm-svn: 198146
2013-12-29 01:48:38 +00:00
Craig Topper 059e8e0da1 Switch from EVT to MVT in more of the x86 instruction lowering code.
llvm-svn: 198144
2013-12-29 01:10:06 +00:00
Craig Topper bf096926c9 Use getSimpleValueType in a few spots where the type should be simple.
llvm-svn: 198117
2013-12-28 18:35:48 +00:00
Craig Topper e829fe42af Minor indentation fix to match other switch statements. Change llvm_unreachable text to match similar places.
llvm-svn: 198116
2013-12-28 17:37:32 +00:00
Andrea Di Biagio eaceba0ed0 [X86] Teach the backend how to fold target specific dag node for packed
vector shift by immedate count (VSHLI/VSRLI/VSRAI) into a build_vector when
the vector in input to the shift is a build_vector of all constants or UNDEFs.

Target specific nodes for packed shifts by immediate count are in
general introduced by function 'getTargetVShiftByConstNode' (in
X86ISelLowering.cpp) when lowering shift operations, SSE/AVX immediate
shift intrinsics and (only in very few cases) SIGN_EXTEND_INREG dag
nodes.

This patch adds extra rules for simplifying vector shifts inside
function 'getTargetVShiftByConstNode'.

Added file test/CodeGen/X86/vec_shift5.ll to verify that packed
shifts by immediate are correctly folded into a build_vector when the
input vector to the shift dag node is a vector of constants or undefs.

llvm-svn: 198113
2013-12-28 11:11:52 +00:00
Elena Demikhovsky b64d7e8586 AVX-512: Result type of scalar SETCC is MVT::i1 for AVX-512.
llvm-svn: 198008
2013-12-25 10:06:40 +00:00
Elena Demikhovsky 64c9548d66 AVX-512: fixed some patterns for MVT::i1
llvm-svn: 197981
2013-12-24 14:24:07 +00:00
Elena Demikhovsky fe24a30e38 AVX512: SETCC returns i1 for AVX-512 and i8 for all others
llvm-svn: 197876
2013-12-22 10:13:18 +00:00
Duncan P. N. Exon Smith ab5dbebc11 Assert that the last operand is actually EFLAGS
This is another follow-up to r197503, after a post-commit review by
Andy.

<rdar://problem/15627766>

llvm-svn: 197520
2013-12-17 20:28:21 +00:00
Duncan P. N. Exon Smith 512601d77f Revert "Revert "Mark vastart_save_xmm_regs as changing EFLAGS""
This reverts commit r197481, recommiting r197469 with an extra fix.

The vastart_save_xmm_regs pseudo-instruction expands to a test and a
branch, so it modifies EFLAGS.  Mark it so, or else the scheduler might
place it in the middle of another test+branch.

This fixes a bug exposed by r192750, which changed the initial scheduler
to source-order as part of enabling the MI Scheduler for X86.

This re-commit changes the VASTART_SAVE_XMM_REGS custom inserter not to
try to save %flags, and adds a test that catches the bad behavior of
r197469.

<rdar://problem/15627766>

llvm-svn: 197503
2013-12-17 15:54:45 +00:00
Stepan Dyatkovskiy 7f7c2710e0 Fix for PR18045:
http://llvm.org/bugs/show_bug.cgi?id=18045

Short issue description:
For X86 machines with sse < sse4.1 we got failures for some
particular load/store vector sequences:

$ clang-trunk -m32 -O2 test-case.c
fatal error: error in backend: Cannot select: 0x4200920: v4i32,ch = load 0x41d6ab0, 0x4205850,
      0x41dcb10<LD16[getelementptr inbounds ([4 x i32]* @e, i32 0, i32 0)](align=4)> [ORD=82]
      [ID=58]
  0x4205850: i32 = X86ISD::Wrapper 0x41d5490 [ORD=26] [ID=43]
    0x41d5490: i32 = TargetGlobalAddress<[4 x i32]* @e> 0 [ORD=26] [ID=23]
  0x41dcb10: i32 = undef [ID=2]

The reason is that EltsFromConsecutiveLoads could emit such load instruction
both before and after legalize stage. Though this instruction is not legal for
machines with SSSE3 and lower.

The fix: In EltsFromConsecutiveLoads, if we have passed legalize stage, we
check whether nodes it emits are legal. 

P.S.: If you get failure in time from 12:00 and till 22:00 (UTC-8),
perhaps I'll slow with response, so you better reject this commit. Thanks!

llvm-svn: 197492
2013-12-17 12:07:33 +00:00
Elena Demikhovsky c5f6726a24 AVX-512: Added implementation of CONCAT_VECTORS for v8i1 vectors (by Alexey Bader).
Added implementation of "truncate" from integer type (i64/i32/i16/i8) to i1.

llvm-svn: 197482
2013-12-17 08:33:15 +00:00
Elena Demikhovsky 47fc44e52e AVX-512: Added legal type MVT::i1 and VK1 register for it.
Added scalar compare VCMPSS, VCMPSD.
Implemented LowerSELECT for scalar FP operations.
I replaced FSETCCss, FSETCCsd with one node type FSETCCs.
Node extract_vector_elt(v16i1/v8i1, idx) returns an element of type i1.

llvm-svn: 197384
2013-12-16 13:52:35 +00:00
Benjamin Kramer e723bb10b0 X86: When lowering shl_parts, don't emit shift amounts larger than the bit width.
While it's safe for the X86-specific shift nodes, dag combining will
kill generic nodes. Insert an AND to make it safe, isel will nuke it
as x86's shift instructions have an implicit AND.

Fixes PR16108, which contains a contraption to hit this case in between
constant folders.

llvm-svn: 197228
2013-12-13 13:40:24 +00:00
Rafael Espindola 32cb5ac904 Switch to the new MingW ABI.
GCC 4.7 changed the MingW ABI. On the LLVM side it means that sret functions
don't pop the stack.

llvm-svn: 197163
2013-12-12 16:06:58 +00:00
Tim Northover 9653eb5759 Make Triple's isOSBinFormatXXX functions partition triple-space.
Most users would be surprised if "isCOFF" and "isMachO" were simultaneously
true, unless they'd put the compiler in a box with a gun attached to a photon
detector.

This makes sure precisely one of the three formats is true for any triple and
simplifies some target logic based on that.

llvm-svn: 196934
2013-12-10 16:57:43 +00:00
Elena Demikhovsky e382c3fdcd AVX-512: changed intrinsics for mask operations
llvm-svn: 196918
2013-12-10 13:53:10 +00:00
Cameron McInally c5f420e129 Suppress '(x < y) ? a : 0 -> (x < y) & a' transform on X86 architectures with dedicated mask registers.
Patch by Aleksey Bader.

llvm-svn: 196386
2013-12-04 14:52:33 +00:00
Lang Hames 39609996d9 Refactor a lot of patchpoint/stackmap related code to simplify and make it
target independent.

Most of the x86 specific stackmap/patchpoint handling was necessitated by the
use of the native address-mode format for frame index operands. PEI has now
been modified to treat stackmap/patchpoint similarly to DEBUG_INFO, allowing
us to use a simple, platform independent register/offset pair for frame
indexes on stackmap/patchpoints.

Notes:
  - Folding is now platform independent and automatically supported.
  - Emiting patchpoints with direct memory references now just involves calling
    the TargetLoweringBase::emitPatchPoint utility method from the target's
    XXXTargetLowering::EmitInstrWithCustomInserter method. (See
    X86TargetLowering for an example).
  - No more ugly platform-specific operand parsers.

This patch shouldn't change the generated output for X86. 

llvm-svn: 195944
2013-11-29 03:07:54 +00:00
Michael Liao d617a3015d Fix PR18054
- Fix bug in (vsext (vzext x)) -> (vsext x) in SIGN_EXTEND_IN_REG
  lowering where we need to check whether x is a vector type (in-reg
  type) of i8, i16 or i32; otherwise, that optimization is not valid.

llvm-svn: 195779
2013-11-26 20:31:31 +00:00
Andrew Trick 391dbadb51 StackMap: Implement support for DirectMemRefOp.
A Direct stack map location records the address of frame index. This
address is itself the value that the runtime requested. This differs
from IndirectMemRefOp locations, which refer to a stack locations from
which the requested values must be loaded. Direct locations can
directly communicate the address if an alloca, while IndirectMemRefOp
handle register spills.

For example:

entry:
  %a = alloca i64...
  llvm.experimental.stackmap(i32 <ID>, i32 <shadowBytes>, i64* %a)

Since both the alloca and stackmap intrinsic are in the entry block,
and the intrinsic takes the address of the alloca, the runtime can
assume that LLVM will not substitute alloca with any intervening
value. This must be verified by the runtime by checking that the stack
map's location is a Direct location type. The runtime can then
determine the alloca's relative location on the stack immediately after
compilation, or at any time thereafter. This differs from Register and
Indirect locations, because the runtime can only read the values in
those locations when execution reaches the instruction address of the
stack map.

llvm-svn: 195712
2013-11-26 02:03:25 +00:00
Andrew Trick d3ab37cfeb whitespace
llvm-svn: 195711
2013-11-26 02:03:20 +00:00
Jim Grosbach 860934a924 X86: Perform integer comparisons at i32 or larger.
Utilizing the 8 and 16 bit comparison instructions, even when an input can
be folded into the comparison instruction itself, is typically not worth it.
There are too many partial register stalls as a result, leading to significant
slowdowns. By always performing comparisons on at least 32-bit
registers, performance of the calculation chain leading to the
comparison improves. Continue to use the smaller comparisons when
minimizing size, as that allows better folding of loads into the
comparison instructions.

rdar://15386341

llvm-svn: 195496
2013-11-22 19:57:47 +00:00
Michael Liao 02160d580b Fix PR18014
- When simplifying the mask generation for BLEND, check whether that mask is
  also consumed by other non-BLEND insns. If true, skip that simplification.

llvm-svn: 195476
2013-11-22 17:56:57 +00:00
Rafael Espindola 5a8e985ad3 Don't produce tail calls when the caller is x86_thiscallcc.
The callee will not pop the stack for us.

llvm-svn: 195467
2013-11-22 15:18:28 +00:00
Kostya Serebryany 4007009815 Revert r195318 as it causes miscompilation (PR18029)
llvm-svn: 195439
2013-11-22 10:30:39 +00:00
Ekaterina Romanova d5fa55470c SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.

It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.

llvm-svn: 195383
2013-11-21 23:21:26 +00:00
Bill Wendling 07787f8747 The basic problem is that some mainstream programs cannot deal with the way
clang optimizes tail calls, as in this example:

int foo(void);
int bar(void) {
 return foo();
}

where the call is transformed to:

  calll .L0$pb
.L0$pb:
  popl  %eax
.Ltmp0:
  addl  $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax
  movl  foo@GOT(%eax), %eax
  popl  %ebp
  jmpl  *%eax                   # TAILCALL

However, the GOT references must all be resolved at dlopen() time, and so this
approach cannot be used with lazy dynamic linking (e.g. using RTLD_LAZY), which
usually populates the PLT with stubs that perform the actual resolving.

This patch changes X86TargetLowering::LowerCall() to skip tail call
optimization, if the called function is a global or external symbol.

Patch by Dimitry Andric!

PR15086

llvm-svn: 195318
2013-11-21 07:04:30 +00:00
NAKAMURA Takumi 3dedf827f8 X86ISelLowering.cpp: Mark a variable VT as LLVM_ATTRIBUTE_UNUSED. [-Wunused-variable]
llvm-svn: 195238
2013-11-20 10:55:22 +00:00
NAKAMURA Takumi f2392ebb94 Whitespace.
llvm-svn: 195237
2013-11-20 10:55:15 +00:00
Elena Demikhovsky a5967af97d Fixed compilation error.
llvm-svn: 195230
2013-11-20 09:23:22 +00:00
Elena Demikhovsky e1f9bf054f AVX-512: Concat 4 128-bit vectors in one 512-bit vector.
llvm-svn: 195229
2013-11-20 09:10:40 +00:00
Cameron McInally ad41f1f693 Add AVX512 unmasked FMA intrinsics and support.
llvm-svn: 194824
2013-11-15 17:01:14 +00:00
Matt Arsenault b03bd4d96b Add addrspacecast instruction.
Patch by Michele Scandale!

llvm-svn: 194760
2013-11-15 01:34:59 +00:00
Elena Demikhovsky 0a74b7da35 AVX-512: Handled extractelement from mask vector;
Added VMOSHDUP/VMOVSLDUP shuffle instructions.

llvm-svn: 194691
2013-11-14 11:29:27 +00:00
Juergen Ributzka 34c652d34d SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too.
This patch reapplies r193676 with an additional fix for the Hexagon backend. The
SystemZ backend has already been fixed by r194148.

The Type Legalizer recognizes that VSELECT needs to be split, because the type
is to wide for the given target. The same does not always apply to SETCC,
because less space is required to encode the result of a comparison. As a result
VSELECT is split and SETCC is unrolled into scalar comparisons.

This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG
Combiner. If a matching pattern is found, then the result mask of SETCC is
promoted to the expected vector mask type for the given target. Now the type
legalizer will split both VSELECT and SETCC.

This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX
pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>.

Reviewed by Nadav

llvm-svn: 194542
2013-11-13 01:57:54 +00:00
Juergen Ributzka 87ed906b2e [Stackmap] Materialize the jump address within the patchpoint noop slide.
This patch moves the jump address materialization inside the noop slide. This
enables patching of the materialization itself or its complete removal. This
patch also adds the ability to define scratch registers that can be used safely
by the code called from the patchpoint intrinsic. At least one scratch register
is required, because that one is used for the materialization of the jump
address. This patch depends on D2009.

Differential Revision: http://llvm-reviews.chandlerc.com/D2074

Reviewed by Andy

llvm-svn: 194306
2013-11-09 01:51:33 +00:00
Juergen Ributzka 9969d3e6e8 [Stackmap] Add AnyReg calling convention support for patchpoint intrinsic.
The idea of the AnyReg Calling Convention is to provide the call arguments in
registers, but not to force them to be placed in a paticular order into a
specified set of registers. Instead it is up tp the register allocator to assign
any register as it sees fit. The same applies to the return value (if
applicable).

Differential Revision: http://llvm-reviews.chandlerc.com/D2009

Reviewed by Andy

llvm-svn: 194293
2013-11-08 23:28:16 +00:00
Eric Christopher 542c8d934d Check for both styles of clobbers, those produced by dragonegg and
those produced by clang for the inline asm bswap conversion.

Modified from a patch by Chris Smowton.

llvm-svn: 194016
2013-11-04 21:41:21 +00:00
Elena Demikhovsky 496656900e AVX-512: Implemented CMOV for 512-bit vectors
llvm-svn: 193747
2013-10-31 13:15:32 +00:00
Juergen Ributzka 3bd686d493 Revert "SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too."
Now Hexagon and SystemZ are not happy with it :-(

llvm-svn: 193677
2013-10-30 06:36:19 +00:00
Juergen Ributzka 6ad05d6b95 SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too.
The Type Legalizer recognizes that VSELECT needs to be split, because the type
is to wide for the given target. The same does not always apply to SETCC,
because less space is required to encode the result of a comparison. As a result
VSELECT is split and SETCC is unrolled into scalar comparisons.

This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG
Combiner. If a matching pattern is found, then the result mask of SETCC is
promoted to the expected vector mask type for the given target. This mask has
usually the same size as the VSELECT return type (except for Intel KNL). Now the
type legalizer will split both VSELECT and SETCC.

This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX
pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>.

Reviewed by Nadav

llvm-svn: 193676
2013-10-30 05:48:18 +00:00
Elena Demikhovsky 199c823555 AVX-512: PMIN/PMAX intrinsics and patterns
Patch by Cameron McInally <cameron.mcinally@nyu.edu>

llvm-svn: 193497
2013-10-27 08:18:37 +00:00
Nadav Rotem d369d4bdf9 Optimize concat_vectors(X, undef) -> scalar_to_vector(X).
This optimization is not SSE specific so I am moving it to DAGco.
The new scalar_to_vector dag node exposed a missing pattern in the AArch64 target that I needed to add.

llvm-svn: 193393
2013-10-25 06:41:18 +00:00
Yaron Keren 79bb266346 (this is a corrected patch)
Calling _chkstk is required on ELF as well as COFF on Windows. Without 
_chkstk, functions requiring large stack crash in initialization code.

Previous code tested for COFF format but not Mach-O and this patch modifies 
the code to test for Windows OS (both Windows target and MingW target) 
but not Mach-O object format: Looks like macho environment was used to 
build some EFI code.
 
Credits to Andrew MacPherson.

llvm-svn: 193289
2013-10-23 23:37:01 +00:00
Rafael Espindola bca3ab0905 Revert "Calling _chkstk is required on ELF as well as COFF on Windows. Without _chkstk functions requiring large stack crash in initialization code. Previous code tested for COFF format but not Mach-O and this patch modifies the code to test for Windows."
This reverts commit r193263.

It is causing CodeGen/X86/mingw-alloca.ll to fail.

llvm-svn: 193275
2013-10-23 21:45:09 +00:00
Benjamin Kramer 0ccab2d66c X86: Custom lower sext v16i8 to v16i16, and the corresponding truncate.
Also update the cost model.

llvm-svn: 193270
2013-10-23 21:06:07 +00:00
Yaron Keren 03ac82edf5 Calling _chkstk is required on ELF as well as COFF on Windows.
Without _chkstk functions requiring large stack crash in 
initialization code. Previous code tested for COFF format but 
not Mach-O and this patch modifies the code to test for Windows.

Credits to Andrew MacPherson.

llvm-svn: 193263
2013-10-23 19:40:07 +00:00
Benjamin Kramer da8446b833 X86: Custom lower zext v16i8 to v16i16.
On sandy bridge (PR17654) we now get
	vpxor	%xmm1, %xmm1, %xmm1
	vpunpckhbw	%xmm1, %xmm0, %xmm2
	vpunpcklbw	%xmm1, %xmm0, %xmm0
	vinsertf128	$1, %xmm2, %ymm0, %ymm0

On haswell it's a simple
	vpmovzxbw	%xmm0, %ymm0

There is a maze of duplicated and dead transforms and patterns in this
area. Remove the dead custom lowering of zext v8i16 to v8i32, that's
already handled by LowerAVXExtend.

llvm-svn: 193262
2013-10-23 19:19:04 +00:00
Jim Grosbach 005e5b55a7 X86: Make concat_vectors combine a bit more conservative.
Per Nadav's review comments for r192866.

llvm-svn: 193252
2013-10-23 17:37:40 +00:00
Lang Hames 2783993fca X86 vector element shift-by-immediate instructions take i8 immediates. Make
the instruction defenitions and ISEL reflect this.

Prior to this patch these instructions took an i32i8imm, and the high bits were
dropped during encoding. This led to incorrect behavior for shifts by
immediates higher than 255. This patch fixes that issue by detecting large
immediate shifts and returning constant zero (for logical shifts) or capping
the shift amount at an encodable value (for arithmetic shifts).

Fixes <rdar://problem/14968098>

llvm-svn: 193096
2013-10-21 17:51:24 +00:00
Elena Demikhovsky 665c90e184 AVX-512: MUL operation lowering for v8i64
llvm-svn: 193083
2013-10-21 13:27:34 +00:00
Jim Grosbach c044c65470 x86: Move bitcasts outside concat_vector.
Consider the following:

typedef unsigned short ushort4U __attribute__((ext_vector_type(4),
aligned(2)));
typedef unsigned short ushort4 __attribute__((ext_vector_type(4)));
typedef unsigned short ushort8 __attribute__((ext_vector_type(8)));
typedef int int4 __attribute__((ext_vector_type(4)));

int4 __bbase_cvt_int(ushort4 v) {
  ushort8 a;
  a.lo = v;
  return _mm_cvtepu16_epi32(a);
}

This generates the, not unreasonable, IR:
define <4 x i32> @foo0(double %v.coerce) nounwind ssp {
  %tmp = bitcast double %v.coerce to <4 x i16>
  %tmp1 = shufflevector <4 x i16> %tmp, <4 x i16> undef, <8 x i32> <i32
  %0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %tmp2 = tail call <4 x i32> @llvm.x86.sse41.pmovzxwd(<8 x i16> %tmp1)
  ret <4 x i32> %tmp2
}

The problem is when type legalization gets hold of the v4i16. It
legalizes that by spilling to the stack, then doing a zero-extending
load. Things go even more silly from there, ending up with something
like:
_foo0:
  movsd %xmm0, -8(%rsp)       <== Spill to the stack.
  movq  -8(%rsp), %xmm0       <== Reload it right back out.
  pmovzxwd  %xmm0, %xmm1      <== Here's what we actually asked for.
  pblendw $1, %xmm1, %xmm0    <== We don't need this at all
  pmovzxwd  %xmm0, %xmm0      <== We already did this
  ret

The v8i8 to v8i16 zext intrinsic gives even worse results, with two
table lookups via pshufb instructions(!!).

To avoid all that, we can move the bitcasting until after we've formed
the wider (legal) vector type. Then our normal codegen flows along
nicely and we get the expected:
_foo0:
  pmovzxwd  %xmm0, %xmm0
  ret

rdar://15245794

llvm-svn: 192866
2013-10-17 02:58:06 +00:00
Michael Liao ad71659def Fix PR17546
- Type of index used in extract_vector_elt or insert_vector_elt supposes
  to be TLI.getVectorIdxTy() which is pointer type on most targets. It'd
  better to truncate (or zero-extend in case it's changed later) it to
  mask element type to guarantee they are matching instead of asserting
  that.

llvm-svn: 192722
2013-10-15 17:51:58 +00:00
Michael Liao 8ba068211d Fix PR16807
- Lower signed division by constant powers-of-2 to target-independent
  DAG operators instead of target-dependent ones to support them better
  on targets where vector types are legal but shift operators on that
  types are illegal. E.g., on AVX, PSRAW is only available on <8 x i16>
  though <16 x i16> is a legal type.

llvm-svn: 192721
2013-10-15 17:51:02 +00:00
Eric Christopher 755711e510 Reformat this routine slightly.
llvm-svn: 192630
2013-10-14 21:52:23 +00:00
Elena Demikhovsky 82a46ebe0a Fixed a bug in dynamic allocation memory on stack.
The alignment of allocated space was wrong, see Bugzila 17345.

Done by Zvi Rackover <zvi.rackover@intel.com>.

llvm-svn: 192573
2013-10-14 07:26:51 +00:00
Benjamin Kramer 7b5e159450 X86: Fix type check. Just because an integer type is illegal doesn't mean it's i64.
Fixes PR17495, where an i24 triggered this code. It's intended to
optimize i64 loads on 32 bit x86.

llvm-svn: 192123
2013-10-07 19:11:35 +00:00
Elena Demikhovsky 2e408aefe0 AVX-512: added scalar convert instructions and intrinsics.
Fixed load folding in VPERM2I instruction.

llvm-svn: 192063
2013-10-06 13:11:09 +00:00
Elena Demikhovsky 462a2d235b AVX-512: fixed shuffle lowering
in case of BLEND and added VSHUFPS patterns.

llvm-svn: 192055
2013-10-06 06:11:18 +00:00
Craig Topper b01cd1aa74 Add patterns for selecting TBM instructions from logical operations. Patch from Yunzhong Gao.
llvm-svn: 191871
2013-10-03 04:16:45 +00:00
Rafael Espindola 44fee4e0eb Remove several unused variables.
Patch by Alp Toker.

llvm-svn: 191757
2013-10-01 13:32:03 +00:00
Robert Wilhelm f0cfb83bb4 Fix spelling intruction -> instruction.
llvm-svn: 191610
2013-09-28 11:46:15 +00:00
Juergen Ributzka f043a65327 Revert "SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too."
This reverts commit r191130.

llvm-svn: 191138
2013-09-21 15:09:46 +00:00
Juergen Ributzka c2551eb4ff Fix the buildbot
llvm-svn: 191133
2013-09-21 05:15:01 +00:00
Juergen Ributzka ab930591c7 [X86] Emulate AVX 256bit MIN/MAX support by splitting the vector.
In AVX 256bit vectors are valid vectors and therefore the Type Legalizer doesn't
split the VSELECT and SETCC nodes. AVX only supports MIN/MAX on 128bit vectors
and this fix enables vector splitting for this special case in the X86 DAG
Combiner.

This fix is related to PR16695, PR17002, and <rdar://problem/14594431>.

llvm-svn: 191131
2013-09-21 04:55:22 +00:00
Juergen Ributzka e9a80fc912 SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too.
The Type Legalizer recognizes that VSELECT needs to be split, because the type
is to wide for the given target. The same does not always apply to SETCC,
because less space is required to encode the result of a comparison. As a result
VSELECT is split and SETCC is unrolled into scalar comparisons.

This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG
Combiner. If a matching pattern is found, then the result mask of SETCC is
promoted to the expected vector mask for the given target. This mask has usually
te same size as the VSELECT return type (except for Intel KNL). Now the type
legalizer will split both VSELECT and SETCC.

This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX
pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>.

llvm-svn: 191130
2013-09-21 04:55:18 +00:00
Elena Demikhovsky 8952974e29 AVX-512: implemented extractelement with variable index.
Added parsing of mask register and "zeroing" semantic, like {%k1} {z}.

llvm-svn: 190595
2013-09-12 08:55:00 +00:00
Juergen Ributzka 53d0b492f5 [X86] Perform VSELECT DAG combines also before DAG type legalization.
If the DAG already has only legal types, then the second round of DAG combines
is skipped. In this case VSELECT+SETCC patterns that match a more efficient
instruction (e.g. min/max) are never recognized.

This fix allows VSELECT+SETCC combines if the types are already legal before DAG
type legalization.

Reviewer: Nadav
llvm-svn: 190105
2013-09-05 23:02:56 +00:00
Craig Topper b25f0f5538 Create BEXTR instructions for (and ((sra or srl) x, imm), (2**size - 1)). Fixes PR17028.
llvm-svn: 189742
2013-09-02 07:53:17 +00:00
Elena Demikhovsky 4def4b088f AVX-512: Added GATHER and SCATTER instructions.
llvm-svn: 189729
2013-09-01 14:24:41 +00:00
Craig Topper f78c19c3bb Fixup BZHI selection to remove an unneeded zero extension.
llvm-svn: 189656
2013-08-30 07:16:16 +00:00
Craig Topper 0bccad2d43 Teach X86 backend to create BMI2 BZHI instructions from (and X, (add (shl 1, Y), -1)). Fixes PR17038.
llvm-svn: 189653
2013-08-30 06:52:21 +00:00
Elena Demikhovsky 980c6b08b1 AVX-512: added extend and truncate instructions.
llvm-svn: 189580
2013-08-29 11:56:53 +00:00
Elena Demikhovsky 12f24673e0 AVX-512: Added FMA instructions.
llvm-svn: 189326
2013-08-27 08:39:25 +00:00
Elena Demikhovsky 0a2b6290f1 AVX-512: Added shuffle instructions -
VPSHUFD, VPERMILPS, VMOVDDUP, VMOVLHPS, VMOVHLPS, VSHUFPS, VALIGN
 single and double forms.

llvm-svn: 189215
2013-08-26 12:45:35 +00:00
Elena Demikhovsky f8f478b19d AVX-512: added UNPACK instructions and tests for all-zero/all-ones vectors
llvm-svn: 189189
2013-08-25 12:54:30 +00:00
Elena Demikhovsky 33d447a2d6 AVX-512: Added SHIFT instructions.
llvm-svn: 188899
2013-08-21 09:36:02 +00:00
Elena Demikhovsky 1490c5eb5b AVX-512: added arithmetic and logical operations.
ADD, SUB, MUL integer and FP types. OR, AND, XOR.
Added embeded broadcast form for these instructions.

llvm-svn: 188673
2013-08-19 13:26:14 +00:00
Elena Demikhovsky 3ce8dbbac2 AVX-512: Added VMOVD, VMOVQ, VMOVSS, VMOVSD instructions.
llvm-svn: 188637
2013-08-18 13:08:57 +00:00
Craig Topper e6861c9ce5 Make more of the lowering helpers static. Also use MVT instead of EVT in a couple places.
llvm-svn: 188629
2013-08-18 08:53:01 +00:00
Craig Topper 8dbc7e9d35 Revert r188449 as it turns out we're just missing the instructions that need the v16i32/v16f32 matching.
llvm-svn: 188454
2013-08-15 08:38:25 +00:00
Craig Topper 2ffd06528d Don't let isPermImmMask handle v16i32 since VPERMI doesn't match on that type. Remove 128-bit vector handling from isPermImmMask too, it's covered by isPSHUFDMask.
llvm-svn: 188449
2013-08-15 07:30:51 +00:00
Craig Topper 6f4dd2dacf Use MVT in place of EVT in more X86 operation lowering functions.
llvm-svn: 188445
2013-08-15 05:33:45 +00:00
Craig Topper 5671010cbb Replace getValueType().getSimpleVT() with getSimpleValueType(). Also remove one weird cast from MVT->EVT just to call getSimpleVT().
llvm-svn: 188441
2013-08-15 02:33:50 +00:00
Craig Topper d03748cf5e Make more helper methods into static functions.
llvm-svn: 188366
2013-08-14 07:53:41 +00:00
Craig Topper 7b7b159574 Remove tab characters.
llvm-svn: 188365
2013-08-14 07:35:18 +00:00
Craig Topper d905fded68 Make some helper methods static.
llvm-svn: 188364
2013-08-14 07:34:43 +00:00
Craig Topper 60769e050d Use MVT in more lowering code.
llvm-svn: 188363
2013-08-14 07:04:42 +00:00
Craig Topper 52b00359b1 Replace EVT with MVT in isVectorShift. Keeps compiler from generating unneeded checks and handling for extended types.
llvm-svn: 188362
2013-08-14 06:21:10 +00:00
Craig Topper 67476d7485 Replace EVT with MVT in many of the shuffle lowering functions. Keeps compiler from generating unneeded checks and handling for extended types.
llvm-svn: 188361
2013-08-14 05:58:39 +00:00
Evgeniy Stepanov 7dee697faa Fix compiler warnings.
../lib/Target/X86/X86ISelLowering.cpp:9715:7: error: unused variable 'OpVT' [-Werror,-Wunused-variable]
  EVT OpVT = Op0.getValueType();
      ^
../lib/Target/X86/X86ISelLowering.cpp:9763:14: error: unused variable 'NumElems' [-Werror,-Wunused-variable]
    unsigned NumElems = VT.getVectorNumElements();

llvm-svn: 188269
2013-08-13 14:04:20 +00:00
Elena Demikhovsky 60b1f289f2 AVX-512: Added CMP and BLEND instructions.
Lowering for SETCC.

llvm-svn: 188265
2013-08-13 13:24:07 +00:00
Elena Demikhovsky 5fed3b95db AVX-512: Added more tests for BROADCAST
llvm-svn: 188148
2013-08-11 12:29:16 +00:00
Elena Demikhovsky cf5b1458e6 AVX-512: Added VPERM* instructons and MOV* zmm-to-zmm instructions.
Added a test for shuffles using VPERM.

llvm-svn: 188147
2013-08-11 07:55:09 +00:00
Elena Demikhovsky 45c54ad8dc AVX-512 set: Added BROADCAST instructions
with lowering logic and a test.

llvm-svn: 187884
2013-08-07 12:34:55 +00:00
Craig Topper c5b0ad27ab Simplify code. No functional change intended.
llvm-svn: 187870
2013-08-07 08:16:07 +00:00
Tim Northover a4415854db Refactor isInTailCallPosition handling
This change came about primarily because of two issues in the existing code.
Niether of:

define i64 @test1(i64 %val) {
  %in = trunc i64 %val to i32
  tail call i32 @ret32(i32 returned %in)
  ret i64 %val
}

define i64 @test2(i64 %val) {
  tail call i32 @ret32(i32 returned undef)
  ret i32 42
}

should be tail calls, and the function sameNoopInput is responsible. The main
problem is that it is completely symmetric in the "tail call" and "ret" value,
but in reality different things are allowed on each side.

For these cases:
1. Any truncation should lead to a larger value being generated by "tail call"
   than needed by "ret".
2. Undef should only be allowed as a source for ret, not as a result of the
   call.

Along the way I noticed that a mismatch between what this function treats as a
valid truncation and what the backends see can lead to invalid calls as well
(see x86-32 test case).

This patch refactors the code so that instead of being based primarily on
values which it recurses into when necessary, it starts by inspecting the type
and considers each fundamental slot that the backend will see in turn. For
example, given a pathological function that returned {{}, {{}, i32, {}}, i32}
we would consider each "real" i32 in turn, and ask if it passes through
unchanged. This is much closer to what the backend sees as a result of
ComputeValueVTs.

Aside from the bug fixes, this eliminates the recursion that's going on and, I
believe, makes the bulk of the code significantly easier to understand. The
trade-off is the nasty iterators needed to find the real types inside a
returned value.

llvm-svn: 187787
2013-08-06 09:12:35 +00:00
Craig Topper cf969eadaf Simplify vector lane handling math a bit. No functional change intended.
llvm-svn: 187783
2013-08-06 07:23:12 +00:00
Craig Topper 7418ff460c Simplify math a little bit.
llvm-svn: 187781
2013-08-06 06:54:25 +00:00
Craig Topper 9bc00b65b6 Replace EVT with MVT in isHorizontalBinOp as it is only called with legal types.
llvm-svn: 187779
2013-08-06 06:05:05 +00:00
Craig Topper 47d7c5c8fe Simplify code slightly. No functional change.
llvm-svn: 187771
2013-08-06 04:12:40 +00:00
Aaron Ballman 5b4634576e Silencing an MSVC11 type conversion warning.
llvm-svn: 187727
2013-08-05 13:47:03 +00:00
Elena Demikhovsky 40864b690b AVX-512 set: added mask operations, lowering BUILD_VECTOR for i1 vector types.
Added intrinsics and tests.

llvm-svn: 187717
2013-08-05 08:52:21 +00:00
Benjamin Kramer 5bc180c14f X86: Turn fp selects into mask operations.
double test(double a, double b, double c, double d) { return a<b ? c : d; }

before:
_test:
	ucomisd	%xmm0, %xmm1
	ja	LBB0_2
	movaps	%xmm3, %xmm2
LBB0_2:
	movaps	%xmm2, %xmm0

after:
_test:
	cmpltsd	%xmm1, %xmm0
	andpd	%xmm0, %xmm2
	andnpd	%xmm3, %xmm0
	orpd	%xmm2, %xmm0

Small speedup on Benchmarks/SmallPT

llvm-svn: 187706
2013-08-04 12:05:16 +00:00
Tim Northover ecc018c7b7 X86: correct tail return address calculation
Due to the weird and wondeful usual arithmetic conversions, some
calculations involving negative values were getting performed in
uint32_t and then promoted to int64_t, which is really not a good
idea.

Patch by Katsuhiro Ueno.

llvm-svn: 187703
2013-08-04 09:35:57 +00:00
Elena Demikhovsky b1266b5447 EVEX and compressed displacement encoding for AVX512
llvm-svn: 187576
2013-08-01 13:34:06 +00:00
Elena Demikhovsky b0a75431ad Fixed assertion in Extract128BitVector()
llvm-svn: 187493
2013-07-31 12:03:08 +00:00
Elena Demikhovsky 67b05fc0b3 Added INSERT and EXTRACT intructions from AVX-512 ISA.
All insertf*/extractf* functions replaced with insert/extract since we have insertf and inserti forms.
Added lowering for INSERT_VECTOR_ELT / EXTRACT_VECTOR_ELT for 512-bit vectors.
Added lowering for EXTRACT/INSERT subvector for 512-bit vectors.
Added a test.

llvm-svn: 187491
2013-07-31 11:35:14 +00:00
Nico Rieck 06d17c80cc Proper va_arg/va_copy lowering on win64
Win64 uses CharPtrBuiltinVaList instead of X86_64ABIBuiltinVaList like
other 64-bit targets.

llvm-svn: 187355
2013-07-29 13:07:06 +00:00
Justin Holewinski d3f2035a3c Add a target legalize hook for SplitVectorOperand (again)
CustomLowerNode was not being called during SplitVectorOperand,
meaning custom legalization could not be used by targets.

This also adds a test case for NVPTX that depends on this custom
legalization.

Differential Revision: http://llvm-reviews.chandlerc.com/D1195

Attempt to fix the buildbots by making the X86 test I just added platform independent

llvm-svn: 187202
2013-07-26 13:28:29 +00:00
Rafael Espindola 1d812728cc Revert "Add a target legalize hook for SplitVectorOperand"
This reverts commit 187198. It broke the bots.

The soft float test probably needs a -triple because of name differences.
On the hard float test I am getting a "roundss $1, %xmm0, %xmm0", instead of
"vroundss $1, %xmm0, %xmm0, %xmm0".

llvm-svn: 187201
2013-07-26 13:18:16 +00:00
Justin Holewinski f848a24e50 Add a target legalize hook for SplitVectorOperand
CustomLowerNode was not being called during SplitVectorOperand,
meaning custom legalization could not be used by targets.

This also adds a test case for NVPTX that depends on this custom
legalization.

Differential Revision: http://llvm-reviews.chandlerc.com/D1195

llvm-svn: 187198
2013-07-26 12:46:39 +00:00
Elena Demikhovsky 8cfb43f73b I'm starting to commit KNL backend. I'll push patches one-by-one. This patch includes support for the extended register set XMM16-31, YMM16-31, ZMM0-31.
The full ISA you can see here: http://software.intel.com/en-us/intel-isa-extensions

llvm-svn: 187030
2013-07-24 11:02:47 +00:00
Juergen Ributzka 3d527d80b8 [X86] Use min/max to optimze unsigend vector comparison on X86
Use PMIN/PMAX for UGE/ULE vector comparions to reduce the number of required
instructions. This trick also works for UGT/ULT, but there is no advantage in
doing so. It wouldn't reduce the number of instructions and it would actually
reduce performance.

Reviewer: Ben

radar:5972691

llvm-svn: 186432
2013-07-16 18:20:45 +00:00
Craig Topper 202fbc2c9b Add 'static' keyword to some const arrays for consistency.
llvm-svn: 186308
2013-07-15 06:54:12 +00:00
Craig Topper b94011fd28 Use SmallVectorImpl& instead of SmallVector to avoid repeating small vector size.
llvm-svn: 186274
2013-07-14 04:42:23 +00:00
Stephen Lin fda967fdea X86: fold SSE2/AVX2 logical shift by immediate amount into zero vector when possible
Patch by Andrea Di Biagio

llvm-svn: 186165
2013-07-12 15:31:36 +00:00
Charles Davis e8f297ca94 Target/X86: Add explicit Win64 and System V/x86-64 calling conventions.
Summary:
This patch adds explicit calling convention types for the Win64 and
System V/x86-64 ABIs. This allows code to override the default, and use
the Win64 convention on a target that wants to use SysV (and
vice-versa). This is needed to implement the `ms_abi` and `sysv_abi` GNU
attributes.

Reviewers:

CC:

llvm-svn: 186144
2013-07-12 06:02:35 +00:00
Stephen Lin 73de7bf5de AArch64/PowerPC/SystemZ/X86: This patch fixes the interface, usage, and all
in-tree implementations of TargetLoweringBase::isFMAFasterThanMulAndAdd in
order to resolve the following issues with fmuladd (i.e. optional FMA)
intrinsics:

1. On X86(-64) targets, ISD::FMA nodes are formed when lowering fmuladd
intrinsics even if the subtarget does not support FMA instructions, leading
to laughably bad code generation in some situations.

2. On AArch64 targets, ISD::FMA nodes are formed for operations on fp128,
resulting in a call to a software fp128 FMA implementation.

3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics on types
like v2f32, v8f32, v4f64, etc., even though they promote, split, scalarize,
etc. to types that support hardware FMAs.

The function has also been slightly renamed for consistency and to force a
merge/build conflict for any out-of-tree target implementing it. To resolve,
see comments and fixed in-tree examples.

llvm-svn: 185956
2013-07-09 18:16:56 +00:00