llvm-project

Commit Graph

Author	SHA1	Message	Date
Eric Christopher	05b819718c	Reuse a bunch of cached subtargets and remove getSubtarget calls without a Function argument. llvm-svn: 227814	2015-02-02 17:38:43 +00:00
Craig Topper	2844ca7319	[X86] Add a few target specific nodes to 'getTargetNodeName' llvm-svn: 227720	2015-02-01 10:00:37 +00:00
Simon Pilgrim	9c76b47469	[X86][SSE] Shuffle mask decode support for zero extend, scalar float/double moves and integer load instructions This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd). Also adds shuffle mask decodes for integer loads (movd/movq). Differential Revision: http://reviews.llvm.org/D7228 llvm-svn: 227688	2015-01-31 14:09:36 +00:00
Eric Christopher	295596a0a7	Remove the last vestiges of resetOperationActions. llvm-svn: 227648	2015-01-31 00:21:17 +00:00
Reid Kleckner	a580b6ec67	Win64: Put a REX_W prefix on all TAILJMP* instructions MSDN's x64 software conventions page says that this is one of the fixed list of legal epilogues: https://msdn.microsoft.com/en-us/library/tawsa7cb.aspx Presumably this is how the unwinder distinguishes epilogue jumps from in-function control flow. Also normalize the way we place "## TAILCALL" comments on such jumps. llvm-svn: 227611	2015-01-30 21:03:31 +00:00
Sanjay Patel	e4ca47875f	tidy up; NFC llvm-svn: 227582	2015-01-30 16:58:58 +00:00
Reid Kleckner	e83ab8c2de	x86: Remove unused variables not caught by MSVC =P llvm-svn: 227520	2015-01-30 00:05:39 +00:00
Reid Kleckner	ca9b9feb2c	x86: Fix large model calls to __chkstk for dynamic allocas In the large code model, we now put __chkstk in %r11 before calling it. Refactor the code so that we only do this once. Simplify things by using __chkstk_ms instead of __chkstk on cygming. We already use that symbol in the prolog emission, and it simplifies our logic. Second half of PR18582. llvm-svn: 227519	2015-01-29 23:58:04 +00:00
Sanjay Patel	3a907eaccd	Change SmallVector param to the more general ArrayRef; NFCI llvm-svn: 227514	2015-01-29 23:35:04 +00:00
Reid Kleckner	f0abdae34e	x86: Remove the W64ALLOCA pseudo This is just an alias for CALL64pcrel32, and we can just use that opcode with explicit defs in the MI. No functionality change. llvm-svn: 227508	2015-01-29 23:09:37 +00:00
Simon Pilgrim	80bd3c9e5f	Spelling fixes. NFC. llvm-svn: 227376	2015-01-28 22:03:52 +00:00
Sanjay Patel	4058dd9f3f	invert check for less indentation; use local vars to reduce duplication; NFC llvm-svn: 227355	2015-01-28 19:44:21 +00:00
Sanjay Patel	9bb601856e	use SDValue methods directly instead of getNode()->* ; NFCI llvm-svn: 227334	2015-01-28 18:01:31 +00:00
Michael Kuperstein	951995821a	[X86] Reduce some 32-bit imuls into lea + shl Reduce integer multiplication by a constant of the form k*2^c, where k is in {3,5,9} into a lea + shl. Previously it was only done for imulq on 64-bit platforms, but it makes sense for imull and 32-bit as well. Differential Revision: http://reviews.llvm.org/D7196 llvm-svn: 227308	2015-01-28 14:08:22 +00:00
Michael Kuperstein	f387611ac2	[x32] Enable sibcall optimization on x32. This includes two things: 1) Fix TCRETURNdi and TCRETURN64di patterns to check the right thing (LP64 as opposed to target bitness). 2) Allow LEA64_32 in MatchingStackOffset. llvm-svn: 227307	2015-01-28 13:38:48 +00:00
Elena Demikhovsky	7b0dd39db6	AVX-512: Added FMA intrinsics with rounding mode By Asaf Badouh and Elena Demikhovsky Added special nodes for rounding: FMADD_RND, FMSUB_RND.. It will prevent merge between nodes with rounding and other standard nodes. llvm-svn: 227303	2015-01-28 10:21:27 +00:00
Alexey Samsonov	533948088e	Revert "[x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector" This reverts commits r226953 and r226974. llvm-svn: 227248	2015-01-27 21:34:11 +00:00
Elena Demikhovsky	1a603b3f13	AVX-512: Changes in operations on masks registers for KNL and SKX - Added KSHIFTB/D/Q for skx - Added KORTESTB/D/Q for skx - Fixed store operation for v8i1 type for KNL - Store size of v8i1, v4i1 and v2i1 are changed to 8 bits llvm-svn: 227043	2015-01-25 12:47:15 +00:00
Bruno Cardoso Lopes	ddcc2e31a7	[x86] Fix a comment llvm-svn: 226974	2015-01-24 00:22:04 +00:00
Bruno Cardoso Lopes	56567f9135	[x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector Handle the poor codegen for i64/x86xmm->v2i64 (%mm -> %xmm) moves. Instead of using stack store/load pair to do the job, use scalar_to_vector directly, which in the MMX case can use movq2dq. This was the current behavior prior to improvements for vector legalization of extloads in r213897. This commit fixes the regression and as a side-effect also remove some unnecessary shuffles. In the new attached testcase, we go from: pshufw $-18, (%rdi), %mm0 movq %mm0, -8(%rsp) movq -8(%rsp), %xmm0 pshufd $-44, %xmm0, %xmm0 movd %xmm0, %eax ... To: pshufw $-18, (%rdi), %mm0 movq2dq %mm0, %xmm0 movd %xmm0, %eax ... Differential Revision: http://reviews.llvm.org/D7126 rdar://problem/19413324 llvm-svn: 226953	2015-01-23 22:44:16 +00:00
Eric Christopher	a1c6e0c8ce	Remove some local variables in place of just querying for them in the couple of asserts. llvm-svn: 226917	2015-01-23 17:22:44 +00:00
Alexander Potapenko	a007905e4e	Mark \|TLI\| variables used to suppress -Wunused-variable warnings. (These vars are only used in assertions) llvm-svn: 226815	2015-01-22 13:03:33 +00:00
Elena Demikhovsky	150d9f3187	Fixed a bug in type legalizer for masked load/store intrinsics. The problem occurs when after vectorization we have type <2 x i32>. This type is promoted to <2 x i64> and then requires additional efforts for expanding loads and truncating stores. I added EXPAND / TRUNCATE attributes to the masked load/store SDNodes. The code now contains additional shuffles. I've prepared changes in the cost estimation for masked memory operations, it will be submitted separately. llvm-svn: 226808	2015-01-22 12:07:59 +00:00
Simon Pilgrim	b16b09b154	[X86][SSE] Added support for SSE3 lane duplication shuffle instructions This patch adds shuffle matching for the SSE3 MOVDDUP, MOVSLDUP and MOVSHDUP instructions. The big use of these being that they avoid many single source shuffles from needing to use (pre-AVX) dual source instructions such as SHUFPD/SHUFPS: causing extra moves and preventing load folds. Adding these instructions uncovered an issue in XFormVExtractWithShuffleIntoLoad which crashed on single operand shuffle instructions (now fixed). It also involved fixing getTargetShuffleMask to correctly identify theses instructions as unary shuffles. Also adds a missing tablegen pattern for MOVDDUP. Differential Revision: http://reviews.llvm.org/D7042 llvm-svn: 226716	2015-01-21 22:44:35 +00:00
Ahmed Bougacha	8f09e9f7c5	[X86] Declare SSE4.1/AVX2 vector extloads covered by PMOV[SZ]X legal. Now that we can fully specify extload legality, we can declare them legal for the PMOVSX/PMOVZX instructions. This for instance enables a DAGCombine to fire on code such as (and (<zextload-equivalent> ...), <redundant mask>) to turn it into: (zextload ...) as seen in the testcase changes. There is one regression, in widen_load-2.ll: we're no longer able to do store-to-load forwarding with illegal extload memory types. This will be addressed separately. Differential Revision: http://reviews.llvm.org/D6533 llvm-svn: 226676	2015-01-21 17:07:06 +00:00
Andrea Di Biagio	ae47bc6ab9	[X86][DAG] Disable target specific combine on INSERTPS dag nodes at -O0. This patch disables target specific combine on X86ISD::INSERTPS dag nodes if optlevel is CodeGenOpt::None. The backend currently implements a target specific combine rule that converts a vector load used by an INSERTPS dag node into a scalar load plus a scalar_to_vector. This allows ISel to select a single INSERTPSrm instead of two instructions (i.e. a vector load plus INSERTPSrr). However, the existing target combine rule on INSERTPS nodes only works under the assumption that ISel will always be able to match an INSERTPSrm. This is not true in general at -O0, since the backend only allows folding a load into the memory operand of an instruction if the optimization level is not CodeGenOpt::None. In the example below: // __m128 test(__m128 a, __m128 b) { __m128 c = _mm_insert_ps(a, b, 1 << 6); return c; } // Before this patch, at -O0, the backend would have canonicalized the load to 'b' into a scalar load plus scalar_to_vector. Later on, ISel would have selected an INSERTPSrr leaving the insertps mask in an inconsistent state: movss 4(%rdi), %xmm1 insertps $64, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1,2,3]. With this patch, the backend avoids folding the vector load into the operand of the INSERTPS. The new codegen at -O0 is: movaps (%rdi), %xmm1 insertps $64, %xmm1, %xmm0 # %xmm1[1],xmm0[1,2,3]. llvm-svn: 226277	2015-01-16 14:55:26 +00:00
Adam Nemet	e5dbcb7fd0	[AVX512] Unpack support in new shuffle lowering This now handles both 32 and 64-bit element sizes. In this version, the test are in vector-shuffle-512-v8.ll, canonicalized by Chandler's update_llc_test_checks.py. Part of <rdar://problem/17688758> llvm-svn: 225838	2015-01-13 22:20:18 +00:00
Simon Pilgrim	d88ab87064	[X86][SSE] Minor regression fix for r225551 r225551 vector byte shuffle optimization caused an assertion as fully zeroable vectors can be produced under certain circumstances. This fix drops the assert and returns a zero vector where the assert would have failed. llvm-svn: 225718	2015-01-12 22:38:08 +00:00
Ahmed Bougacha	291833b959	[X86] Also create+widen FMIN/FMAX nodes for v2f32. This happens in the HINT benchmark, where the SLP-vectorizer created v2f32 fcmp/select code. The "correct" solution would have been to teach the vectorizer cost model that v2f32 isn't legal (because really, it isn't), but if we can vectorize we might as well do so. We legalize these v2f32 FMIN/FMAX nodes by widening to v4f32 later on. v3f32 were already widened to v4f32 by the generic unroll-and-build-vector legalization. rdar://15763436 Differential Revision: http://reviews.llvm.org/D6557 llvm-svn: 225691	2015-01-12 20:31:30 +00:00
David Majnemer	14141f941a	Revert most of r225597 We can't rely on a DataLayout enlightened constant folder. llvm-svn: 225599	2015-01-11 07:29:51 +00:00
David Majnemer	292d0c796b	X86: Properly decode shuffle masks when the constant pool type is weird It's possible for the constant pool entry for the shuffle mask to come from a completely different operation. This occurs when Constants have the same bit pattern but have different types. Make DecodePSHUFBMask tolerant of types which, after a bitcast, are appropriately sized vector types. This fixes PR22188. llvm-svn: 225597	2015-01-11 05:08:57 +00:00
Saleem Abdulrasool	9cf2679d3b	X86: teach X86TargetLowering about L,M,O constraints Teach the ISelLowering for X86 about the L,M,O target specific constraints. Although, for the moment, clang performs constraint validation and prevents passing along inline asm which may have immediate constant constraints violated, the backend should be able to cope with the invalid inline asm a bit better. llvm-svn: 225596	2015-01-11 04:39:24 +00:00
Simon Pilgrim	94a4cc027a	[X86][SSE] Improved (v)insertps shuffle matching In the current code we only attempt to match against insertps if we have exactly one element from the second input vector, irrespective of how much of the shuffle result is zeroable. This patch checks to see if there is a single non-zeroable element from either input that requires insertion. It also supports matching of cases where only one of the inputs need to be referenced. We also split insertps shuffle matching off into a new lowerVectorShuffleAsInsertPS function. Differential Revision: http://reviews.llvm.org/D6879 llvm-svn: 225589	2015-01-10 19:45:33 +00:00
Simon Pilgrim	ec1f2c2cab	[X86][SSE] Avoid vector byte shuffles with zero by using pshufb to create zeros pshufb can shuffle in zero bytes as well as bytes from a source vector - we can use this to avoid having to shuffle 2 vectors and ORing the result when the used inputs from a vector are all zeroable. Differential Revision: http://reviews.llvm.org/D6878 llvm-svn: 225551	2015-01-09 22:03:19 +00:00
Chandler Carruth	685b1803ab	[x86] Add a flag to control the vector shuffle legality predicates that complements the new vector shuffle lowering code path. This flag, naturally, is off because we've not tested or evaluated the results of this at all. However, the flag will make it much easier to evaluate whether we can be this aggressive and whether there are missing vector shuffle lowering optimizations. llvm-svn: 225491	2015-01-09 01:24:36 +00:00
Ahmed Bougacha	d716121888	[X86] Reflow comment. NFC. llvm-svn: 225455	2015-01-08 17:49:48 +00:00
Michael Kuperstein	46f7d525c3	[X86] Don't try to generate direct calls to TLS globals The call lowering assumes that if the callee is a global, we want to emit a direct call. This is correct for regular globals, but not for TLS ones. Differential Revision: http://reviews.llvm.org/D6862 llvm-svn: 225438	2015-01-08 11:50:58 +00:00
Ahmed Bougacha	2b6917b020	[SelectionDAG] Allow targets to specify legality of extloads' result type (in addition to the memory type). The LoadExt legalization handling used to only have one type, the memory type. This forced users to assume that as long as the extload for the memory type was declared legal, and the result type was legal, the whole extload was legal. However, this isn't always the case. For instance, on X86, with AVX, this is legal: v4i32 load, zext from v4i8 but this isn't: v4i64 load, zext from v4i8 Whereas v4i64 is (arguably) legal, even without AVX2. Note that the same thing was done a while ago for truncstores (r46140), but I assume no one needed it yet for extloads, so here we go. Calls to getLoadExtAction were changed to add the value type, found manually in the surrounding code. Calls to setLoadExtAction were mechanically changed, by wrapping the call in a loop, to match previous behavior. The loop iterates over the MVT subrange corresponding to the memory type (FP vectors, etc...). I also pulled neighboring setTruncStoreActions into some of the loops; those shouldn't make a difference, as the additional types are illegal. (e.g., i128->i1 truncstores on PPC.) No functional change intended. Differential Revision: http://reviews.llvm.org/D6532 llvm-svn: 225421	2015-01-08 00:51:32 +00:00
Ahmed Bougacha	67dd2d25a3	[CodeGen] Use MVT iterator_ranges in legality loops. NFC intended. A few loops do trickier things than just iterating on an MVT subset, so I'll leave them be for now. Follow-up of r225387. llvm-svn: 225392	2015-01-07 21:27:10 +00:00
Ahmed Bougacha	b994d0c0c5	[X86] Fix 512->256 typo in comments. NFC. llvm-svn: 225367	2015-01-07 19:38:50 +00:00
Ahmed Bougacha	aa2d290997	[X86] Teach FCOPYSIGN lowering to recognize constant magnitudes. For code like: float foo(float x) { return copysign(1.0, x); } We used to generate: andps <-0.000000e+00,0,0,0>, %xmm0 movss <1.000000e+00>, %xmm1 andps <nan>, %xmm1 orps %xmm0, %xmm1 Basically doing an abs(1.0f) in the two middle instructions. We now generate: andps <-0.000000e+00,0,0,0>, %xmm0 orps <1.000000e+00,0,0,0>, %xmm0 Builds on cleanups r223415, r223542. rdar://19049548 Differential Revision: http://reviews.llvm.org/D6555 llvm-svn: 225357	2015-01-07 17:33:03 +00:00
David Majnemer	29c52f7449	X86: Don't make illegal GOTTPOFF relocations "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF relocation target a movq or addq instruction. Prohibit the truncation of such loads to movl or addl. This fixes PR22083. Differential Revision: http://reviews.llvm.org/D6839 llvm-svn: 225250	2015-01-06 07:12:52 +00:00
Craig Topper	49758aab94	[X86] Make isel select the shorter form of jump instructions instead of the long form. The assembler backend will relax to the long form if necessary. This removes a swap from long form to short form in the MCInstLowering code. Selecting the long form used to be required by the old JIT. llvm-svn: 225242	2015-01-06 04:23:53 +00:00
Simon Pilgrim	4c55af6850	[X86][SSE] lowerVectorShuffleAsByteShift tidyup Removed local isSequential predicate and use standard helper isSequentialOrUndefInRange instead. llvm-svn: 225216	2015-01-05 22:08:48 +00:00
Simon Pilgrim	71b96b35e1	[X86][SSE] Fixed description for isSequentialOrUndefInRange. NFC. llvm-svn: 225202	2015-01-05 21:09:48 +00:00
Andrea Di Biagio	6477847ef4	Improved comments. No functional change intended. llvm-svn: 225080	2015-01-02 10:47:46 +00:00
Andrea Di Biagio	22ee3f63b9	[CodeGenPrepare] Teach when it is profitable to speculate calls to @llvm.cttz/ctlz. If the control flow is modelling an if-statement where the only instruction in the 'then' basic block (excluding the terminator) is a call to cttz/ctlz, CodeGenPrepare can try to speculate the cttz/ctlz call and simplify the control flow graph. Example: \code entry: %cmp = icmp eq i64 %val, 0 br i1 %cmp, label %end.bb, label %then.bb then.bb: %c = tail call i64 @llvm.cttz.i64(i64 %val, i1 true) br label %end.bb end.bb: %cond = phi i64 [ %c, %then.bb ], [ 64, %entry] \code In this example, basic block %then.bb is taken if value %val is not zero. Also, the phi node in %end.bb would propagate the size-of in bits of %val only if %val is equal to zero. With this patch, CodeGenPrepare will try to hoist the call to cttz from %then.bb into basic block %entry only if cttz is cheap to speculate for the target. Added two new hooks in TargetLowering.h to let targets customize the behavior (i.e. decide whether it is cheap or not to speculate calls to cttz/ctlz). The two new methods are 'isCheapToSpeculateCtlz' and 'isCheapToSpeculateCttz'. By default, both methods return 'false'. On X86, method 'isCheapToSpeculateCtlz' returns true only if the target has LZCNT. Method 'isCheapToSpeculateCttz' only returns true if the target has BMI. Differential Revision: http://reviews.llvm.org/D6728 llvm-svn: 224899	2014-12-28 11:07:35 +00:00
Aaron Ballman	4eb5c2e089	Fixing another -Wunused-variable warning, this time in release builds without asserts. NFC. llvm-svn: 224889	2014-12-27 19:17:53 +00:00
Aaron Ballman	b66d54c549	Removing a variable that is set but never used, to silence a -Wunused-but-set-variable warning; NFC. llvm-svn: 224888	2014-12-27 19:01:19 +00:00
Elena Demikhovsky	fcea06acb5	AVX-512: Added FMA instructions, intrinsics an tests for KNL and SKX targets by Asaf Badouh http://reviews.llvm.org/D6456 llvm-svn: 224764	2014-12-23 10:30:39 +00:00
Elena Demikhovsky	3121449f0b	AVX-512: BLENDM - fixed encoding of the broadcast version Added more intrinsics and encoding tests. llvm-svn: 224760	2014-12-23 09:36:28 +00:00
Jim Grosbach	1bd0f3530e	X86: Don't over-align combined loads. When combining consecutive loads+inserts into a single vector load, we should keep the alignment of the base load. Doing otherwise can, and does, lead to using overly aligned instructions. In the included test case, for example, using a 32-byte vmovaps on a 16-byte aligned value. Oops. rdar://19190968 llvm-svn: 224746	2014-12-23 00:35:23 +00:00
Reid Kleckner	ce0093344f	Make musttail more robust for vector types on x86 Previously I tried to plug musttail into the existing vararg lowering code. That turned out to be a mistake, because non-vararg calls use significantly different register lowering, even on x86. For example, AVX vectors are usually passed in registers to normal functions and memory to vararg functions. Now musttail uses a completely separate lowering. Hopefully this can be used as the basis for non-x86 perfect forwarding. Reviewers: majnemer Differential Revision: http://reviews.llvm.org/D6156 llvm-svn: 224745	2014-12-22 23:58:37 +00:00
Bruno Cardoso Lopes	811c173523	[x86] Add vector @llvm.ctpop intrinsic custom lowering Currently, when ctpop is supported for scalar types, the expansion of @llvm.ctpop.vXiY uses vector element extractions, insertions and individual calls to @llvm.ctpop.iY. When not, expansion with bit-math operations is used for the scalar calls. Local haswell measurements show that we can improve vector @llvm.ctpop.vXiY expansion in some cases by using a using a vector parallel bit twiddling approach, based on: v = v - ((v >> 1) & 0x55555555); v = (v & 0x33333333) + ((v >> 2) & 0x33333333); v = ((v + (v >> 4) & 0xF0F0F0F) v = v + (v >> 8) v = v + (v >> 16) v = v & 0x0000003F (from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) When scalar ctpop isn't supported, the approach above performs better for v2i64, v4i32, v4i64 and v8i32 (see numbers below). And even when scalar ctpop is supported, this approach performs ~2x better for v8i32. Here, x86_64 implies -march=corei7-avx without ctpop and x86_64h includes ctpop support with -march=core-avx2. == [x86_64h - new] v8i32: 0.661685 v4i32: 0.514678 v4i64: 0.652009 v2i64: 0.324289 == [x86_64h - old] v8i32: 1.29578 v4i32: 0.528807 v4i64: 0.65981 v2i64: 0.330707 == [x86_64 - new] v8i32: 1.003 v4i32: 0.656273 v4i64: 1.11711 v2i64: 0.754064 == [x86_64 - old] v8i32: 2.34886 v4i32: 1.72053 v4i64: 1.41086 v2i64: 1.0244 More work for other vector types will come next. llvm-svn: 224725	2014-12-22 19:45:43 +00:00
Elena Demikhovsky	949b0d46bf	AVX-512: Added all forms of BLENDM instructions, intrinsics, encoding tests for AVX-512F and skx instructions. llvm-svn: 224707	2014-12-22 13:52:48 +00:00
Elena Demikhovsky	fb73ca516b	Masked load and store codegen - fixed 128-bit vectors The codegen failed on 128-bit types on AVX2. I added patterns and in td files and tests. llvm-svn: 224647	2014-12-19 23:27:57 +00:00
Robert Khasanov	79fb7292d7	[AVX512] Enable FP arithmetic lowering for AVX512VL subsets. Added RegOp2MemOpTable4 to transform 4th operand from register to memory in merge-masked versions of instructions. Added lowering tests. llvm-svn: 224516	2014-12-18 12:28:22 +00:00
Michael Kuperstein	047b1a0400	[DAGCombine] Slightly improve lowering of BUILD_VECTOR into a shuffle. This handles the case of a BUILD_VECTOR being constructed out of elements extracted from a vector twice the size of the result vector. Previously this was always scalarized. Now, we try to construct a shuffle node that feeds on extract_subvectors. This fixes PR15872 and provides a partial fix for PR21711. Differential Revision: http://reviews.llvm.org/D6678 llvm-svn: 224429	2014-12-17 12:32:17 +00:00
Quentin Colombet	fc2201e922	[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure: The type promotion helper does not support vector type, so when make such it does not kick in in such cases. Original commit message: [CodeGenPrepare] Move sign/zero extensions near loads using type promotion. This patch extends the optimization in CodeGenPrepare that moves a sign/zero extension near a load when the target can combine them. The optimization may promote any operations between the extension and the load to make that possible. Although this optimization may be beneficial for all targets, in particular AArch64, this is enabled for X86 only as I have not benchmarked it for other targets yet. Context Most targets feature extended loads, i.e., loads that perform a zero or sign extension for free. In that context it is interesting to expose such pattern in CodeGenPrepare so that the instruction selection pass can form such loads. Sometimes, this pattern is blocked because of instructions between the load and the extension. When those instructions are promotable to the extended type, we can expose this pattern. Motivating Example Let us consider an example: define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) { %ld = load i8* %addr1 %zextld = zext i8 %ld to i32 %ld2 = load i32* %addr2 %add = add nsw i32 %ld2, %zextld %sextadd = sext i32 %add to i64 %zexta = zext i8 %a to i32 %addza = add nsw i32 %zexta, %zextld %sextaddza = sext i32 %addza to i64 %addb = add nsw i32 %b, %zextld %sextaddb = sext i32 %addb to i64 call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb) ret void } As it is, this IR generates the following assembly on x86_64: [...] movzbl (%rdi), %eax # zero-extended load movl (%rsi), %es # plain load addl %eax, %esi # 32-bit add movslq %esi, %rdi # sign extend the result of add movzbl %dl, %edx # zero extend the first argument addl %eax, %edx # 32-bit add movslq %edx, %rsi # sign extend the result of add addl %eax, %ecx # 32-bit add movslq %ecx, %rdx # sign extend the result of add [...] The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA. Now, by promoting the additions to form more extended loads we would generate: [...] movzbl (%rdi), %eax # zero-extended load movslq (%rsi), %rdi # sign-extended load addq %rax, %rdi # 64-bit add movzbl %dl, %esi # zero extend the first argument addq %rax, %rsi # 64-bit add movslq %ecx, %rdx # sign extend the second argument addq %rax, %rdx # 64-bit add [...] The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA. This kind of sequences happen a lot on code using 32-bit indexes on 64-bit architectures. Note: The throughput numbers are similar on Sandy Bridge and Haswell. Proposed Solution To avoid the penalty of all these sign/zero extensions, we merge them in the loads at the beginning of the chain of computation by promoting all the chain of computation on the extended type. The promotion is done if and only if we do not introduce new extensions, i.e., if we do not degrade the code quality. To achieve this, we extend the existing “move ext to load” optimization with the promotion mechanism introduced to match larger patterns for addressing mode (r200947). The idea of this extension is to perform the following transformation: ext(promotableInst1(...(promotableInstN(load)))) => promotedInst1(...(promotedInstN(ext(load)))) The promotion mechanism in that optimization is enabled by a new TargetLowering switch, which is off by default. In other words, by default, the optimization performs the “move ext to load” optimization as it was before this patch. Performance Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10. Tested Optimization Levels: O3/Os Tests: llvm-testsuite + externals. Results: - No regression beside noise. - Improvements: CINT2006/473.astar: ~2% Benchmarks/PAQ8p: ~2% Misc/perlin: ~3% The results are consistent for both O3 and Os. <rdar://problem/18310086> llvm-svn: 224402	2014-12-17 01:36:17 +00:00
Reid Kleckner	04b69f89aa	Revert "[CodeGenPrepare] Move sign/zero extensions near loads using type promotion." This reverts commit r224351. It causes assertion failures when building ICU. llvm-svn: 224397	2014-12-17 00:29:23 +00:00
Quentin Colombet	d5e57b731f	[CodeGenPrepare] Move sign/zero extensions near loads using type promotion. This patch extends the optimization in CodeGenPrepare that moves a sign/zero extension near a load when the target can combine them. The optimization may promote any operations between the extension and the load to make that possible. Although this optimization may be beneficial for all targets, in particular AArch64, this is enabled for X86 only as I have not benchmarked it for other targets yet. Context Most targets feature extended loads, i.e., loads that perform a zero or sign extension for free. In that context it is interesting to expose such pattern in CodeGenPrepare so that the instruction selection pass can form such loads. Sometimes, this pattern is blocked because of instructions between the load and the extension. When those instructions are promotable to the extended type, we can expose this pattern. Motivating Example Let us consider an example: define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) { %ld = load i8* %addr1 %zextld = zext i8 %ld to i32 %ld2 = load i32* %addr2 %add = add nsw i32 %ld2, %zextld %sextadd = sext i32 %add to i64 %zexta = zext i8 %a to i32 %addza = add nsw i32 %zexta, %zextld %sextaddza = sext i32 %addza to i64 %addb = add nsw i32 %b, %zextld %sextaddb = sext i32 %addb to i64 call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb) ret void } As it is, this IR generates the following assembly on x86_64: [...] movzbl (%rdi), %eax # zero-extended load movl (%rsi), %es # plain load addl %eax, %esi # 32-bit add movslq %esi, %rdi # sign extend the result of add movzbl %dl, %edx # zero extend the first argument addl %eax, %edx # 32-bit add movslq %edx, %rsi # sign extend the result of add addl %eax, %ecx # 32-bit add movslq %ecx, %rdx # sign extend the result of add [...] The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA. Now, by promoting the additions to form more extended loads we would generate: [...] movzbl (%rdi), %eax # zero-extended load movslq (%rsi), %rdi # sign-extended load addq %rax, %rdi # 64-bit add movzbl %dl, %esi # zero extend the first argument addq %rax, %rsi # 64-bit add movslq %ecx, %rdx # sign extend the second argument addq %rax, %rdx # 64-bit add [...] The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA. This kind of sequences happen a lot on code using 32-bit indexes on 64-bit architectures. Note: The throughput numbers are similar on Sandy Bridge and Haswell. Proposed Solution To avoid the penalty of all these sign/zero extensions, we merge them in the loads at the beginning of the chain of computation by promoting all the chain of computation on the extended type. The promotion is done if and only if we do not introduce new extensions, i.e., if we do not degrade the code quality. To achieve this, we extend the existing “move ext to load” optimization with the promotion mechanism introduced to match larger patterns for addressing mode (r200947). The idea of this extension is to perform the following transformation: ext(promotableInst1(...(promotableInstN(load)))) => promotedInst1(...(promotedInstN(ext(load)))) The promotion mechanism in that optimization is enabled by a new TargetLowering switch, which is off by default. In other words, by default, the optimization performs the “move ext to load” optimization as it was before this patch. Performance Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10. Tested Optimization Levels: O3/Os Tests: llvm-testsuite + externals. Results: - No regression beside noise. - Improvements: CINT2006/473.astar: ~2% Benchmarks/PAQ8p: ~2% Misc/perlin: ~3% The results are consistent for both O3 and Os. <rdar://problem/18310086> llvm-svn: 224351	2014-12-16 19:09:03 +00:00
Robert Khasanov	d04cd2fbfe	[AVX512] Enable integer arithmetic lowering for AVX512BW/VL subsets. Added lowering tests. llvm-svn: 224349	2014-12-16 18:24:07 +00:00
Elena Demikhovsky	72860c341e	AVX-512: Added EXPAND instructions and intrinsics. llvm-svn: 224241	2014-12-15 10:03:52 +00:00
Robert Khasanov	37c3ad6c20	[AVX512] Enabling bit logic lowering Added lowering tests. llvm-svn: 224132	2014-12-12 17:02:18 +00:00
Robert Khasanov	e82a3630b7	[AVX512] Enabling MIN/MAX lowering. Added lowering tests. llvm-svn: 224127	2014-12-12 15:10:43 +00:00
Sanjay Patel	757942a38f	remove function names from comments; NFC llvm-svn: 224080	2014-12-11 23:38:43 +00:00
Sanjay Patel	c694ac5519	return without temporary; NFC llvm-svn: 224076	2014-12-11 23:30:36 +00:00
Elena Demikhovsky	908dbf48c8	AVX-512: Added all forms of COMPRESS instruction + intrinsics + tests llvm-svn: 224019	2014-12-11 15:02:24 +00:00
Elena Demikhovsky	fc081457f1	AVX-512: Fixed a bug in lowering setcc for MVT::i1 type llvm-svn: 224008	2014-12-11 10:21:12 +00:00
Michael Kuperstein	0104ff6529	[X86] Make a code path in EltsFromConsecutiveLoads work only on vectors it expects EltsFromConsecutiveLoads was apparently only ever called for 128-bit vectors, and assumed this implicitly. r223518 started calling it for AVX-sized vectors, causing the code path that had this assumption to crash. This adds a check to make this path fire only for 128-bit vectors. Differential Revision: http://reviews.llvm.org/D6579 llvm-svn: 223922	2014-12-10 08:46:12 +00:00
Elena Demikhovsky	fa4a6c18f7	AVX-512: Added some comments to ERI scalar intrinsics. No functional change. llvm-svn: 223761	2014-12-09 07:06:32 +00:00
Andrea Di Biagio	64bc246f3f	[X86] Improved lowering of packed v8i16 vector shifts by non-constant count. Before this patch, the backend sub-optimally expanded the non-constant shift count of a v8i16 shift into a sequence of two 'movd' plus 'movzwl'. With this patch the backend checks if the target features sse4.1. If so, then it lets the shuffle legalizer deal with the expansion of the shift amount. Example: ;; define <8 x i16> @test(<8 x i16> %A, <8 x i16> %B) { %shamt = shufflevector <8 x i16> %B, <8 x i16> undef, <8 x i32> zeroinitializer %shl = shl <8 x i16> %A, %shamt ret <8 x i16> %shl } ;; Before (with -mattr=+avx): vmovd %xmm1, %eax movzwl %ax, %eax vmovd %eax, %xmm1 vpsllw %xmm1, %xmm0, %xmm0 retq Now: vpxor %xmm2, %xmm2, %xmm2 vpblendw $1, %xmm1, %xmm2, %xmm1 vpsllw %xmm1, %xmm0, %xmm0 retq llvm-svn: 223660	2014-12-08 14:36:51 +00:00
Elena Demikhovsky	68e04b8613	X86 intrinsics moved form X86ISelLowering.cpp to X86IntrinsicsInfo.h X86ISelLowering.cpp has a long switch for intrinsics. I moved a part of this long switch to the new intrinsics table in X86IntrinsicsInfo.h. No functional changes, just code and compile time optimization. llvm-svn: 223641	2014-12-08 09:03:08 +00:00
Ahmed Bougacha	89bc485085	[X86] Cleanup FCOPYSIGN lowering. NFC intended. llvm-svn: 223542	2014-12-05 23:11:36 +00:00
Sanjay Patel	4bf9b7685c	Optimize merging of scalar loads for 32-byte vectors [X86, AVX] Fix the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ). Before we crack 32-byte build vectors into smaller chunks (and then subsequently glue them back together), we should look for the easy case where we can just load all elements in a single op. An example of the codegen change is: From: vmovss 16(%rdi), %xmm1 vmovups (%rdi), %xmm0 vinsertps $16, 20(%rdi), %xmm1, %xmm1 vinsertps $32, 24(%rdi), %xmm1, %xmm1 vinsertps $48, 28(%rdi), %xmm1, %xmm1 vinsertf128 $1, %xmm1, %ymm0, %ymm0 retq To: vmovups (%rdi), %ymm0 retq Differential Revision: http://reviews.llvm.org/D6536 llvm-svn: 223518	2014-12-05 21:28:14 +00:00
Jan Wen Voung	f547861ba0	Use 32-bit ebp for NaCl64 in a limited case: llvm.frameaddress. Summary: Follow up to [x32] "Use ebp/esp as frame and stack pointer": http://reviews.llvm.org/D4617 In that earlier patch, NaCl64 was made to always use rbp. That's needed for most cases because rbp should hold a full 64-bit address within the NaCl sandbox so that load/stores off of rbp don't require sandbox adjustment (zeroing the top 32-bits, then filling those by adding r15). However, llvm.frameaddress returns a pointer and pointers are 32-bit for NaCl64. In this case, use ebp instead, which will make the register copy type check. A similar mechanism may be needed for llvm.eh.return, but is not added in this change. Test Plan: test/CodeGen/X86/frameaddr.ll Reviewers: dschuff, nadav Subscribers: jfb, llvm-commits Differential Revision: http://reviews.llvm.org/D6514 llvm-svn: 223510	2014-12-05 20:55:53 +00:00
Andrea Di Biagio	3e425c8d19	[X86] Improved lowering of packed vector shifts to vpsllq/vpsrlq. SSE2/AVX non-constant packed shift instructions only use the lower 64-bit of the shift count. This patch teaches function 'getTargetVShiftNode' how to deal with shifts where the shift count node is of type MVT::i64. Before this patch, function 'getTargetVShiftNode' only knew how to deal with shift count nodes of type MVT::i32. This forced the backend to wrongly truncate the shift count to MVT::i32, and then zero-extend it back to MVT::i64. llvm-svn: 223505	2014-12-05 20:02:22 +00:00
Andrea Di Biagio	2876a67312	[X86] Avoid introducing extra shuffles when lowering packed vector shifts. When lowering a vector shift node, the backend checks if the shift count is a shuffle with a splat mask. If so, then it introduces an extra dag node to extract the splat value from the shuffle. The splat value is then used to generate a shift count of a target specific shift. However, if we know that the shift count is a splat shuffle, we can use the splat index 'I' to extract the I-th element from the first shuffle operand. The advantage is that the splat shuffle may become dead since we no longer use it. Example: ;; define <4 x i32> @example(<4 x i32> %a, <4 x i32> %b) { %c = shufflevector <4 x i32> %b, <4 x i32> undef, <4 x i32> zeroinitializer %shl = shl <4 x i32> %a, %c ret <4 x i32> %shl } ;; Before this patch, llc generated the following code (-mattr=+avx): vpshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] vpxor %xmm2, %xmm2 vpblendw $3, %xmm1, %xmm2, %xmm1 # xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7] vpslld %xmm1, %xmm0, %xmm0 retq With this patch, the redundant splat operation is removed from the code. vpxor %xmm2, %xmm2 vpblendw $3, %xmm1, %xmm2, %xmm1 # xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7] vpslld %xmm1, %xmm0, %xmm0 retq llvm-svn: 223461	2014-12-05 12:13:30 +00:00
Eric Christopher	2189515132	Rename the x86 isTargetMacho to isTargetMachO for uniformity. llvm-svn: 223421	2014-12-05 00:22:38 +00:00
Eric Christopher	66322e822c	Both of these subtargets have functions that check whether or not the target is mach-o. Use them. llvm-svn: 223420	2014-12-05 00:22:35 +00:00
Ahmed Bougacha	24ebb93da1	[X86] Delete dead code in fcopysign lowering. NFC. r32900 introduced custom lowering for fcopysign, with two checks to change the magnitude value's type if it's larger/smaller than the sign value's type. r32932 replaced that code for the smaller case. r43205 did the same for the larger case, but left the old code, now dead. llvm-svn: 223415	2014-12-04 23:52:15 +00:00
Bruno Cardoso Lopes	fd52b95530	[x86] Fix isOffsetSuitableForCodeModel kernel code model offset Offset == 0 is a valid offset for kernel code model according to the x86_64 System V ABI. Found by inspection, no testcase. llvm-svn: 223383	2014-12-04 20:36:06 +00:00
Michael Kuperstein	0492bd2b9e	[X86] Improve a dag-combine that handles a vector extract -> zext sequence. The current DAG combine turns a sequence of extracts from <4 x i32> followed by zexts into a store followed by scalar loads. According to measurements by Martin Krastev (see PR 21269) for x86-64, a sequence of an extract, movs and shifts gives better performance. However, for 32-bit x86, the previous sequence still seems better. Differential Revision: http://reviews.llvm.org/D6501 llvm-svn: 223360	2014-12-04 13:49:51 +00:00
Andrea Di Biagio	61fac30180	[X86] Simplify code. NFC. Replaced some logic that checked if a build_vector node is doing a splat of a non-undef value with a call to method BuildVectorSDNode::getSplatValue(). No functional change intended. llvm-svn: 223354	2014-12-04 11:21:44 +00:00
Elena Demikhovsky	f1de34b84d	Masked Load / Store Intrinsics - the CodeGen part. I'm recommiting the codegen part of the patch. The vectorizer part will be send to review again. Masked Vector Load and Store Intrinsics. Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores. Added SDNodes for masked operations and lowering patterns for X86 code generator. Examples: <16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align /, <16 x i1> %mask) declare void @llvm.masked.store.v8f64(i8 %addr, <8 x double> %value, i32 4, <8 x i1> %mask) Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch. http://reviews.llvm.org/D6191 llvm-svn: 223348	2014-12-04 09:40:44 +00:00
Michael Liao	5bf9578ce4	[X86] Clean up whitespace as well as minor coding style llvm-svn: 223339	2014-12-04 05:20:33 +00:00
Michael Liao	d8faa61b20	[X86] Restore X86 base pointer after call to llvm.eh.sjlj.setjmp Commit on - This patch fixes the bug described in http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-May/062343.html The fix allocates an extra slot just below the GPRs and stores the base pointer there. This is done only for functions containing llvm.eh.sjlj.setjmp that also need a base pointer. Because code containing llvm.eh.sjlj.setjmp saves all of the callee-save GPRs in the prologue, the offset to the extra slot can be computed before prologue generation runs. Impact at run-time on affected functions is:: - One extra store in the prologue, The store saves the base pointer. - One extra load after a llvm.eh.sjlj.setjmp. The load restores the base pointer. Because the extra slot is just above a gap between frame-pointer-relative and base-pointer-relative chunks of memory, there is no impact on other offset calculations other than ensuring there is room for the extra slot. http://reviews.llvm.org/D6388 Patch by Arch Robison <arch.robison@intel.com> llvm-svn: 223329	2014-12-04 00:56:38 +00:00
Sanjay Patel	23b7ce2725	fix typos, grammar, formatting; NFC llvm-svn: 223276	2014-12-03 22:28:05 +00:00
Sanjay Patel	57f033a024	fix typo in comment llvm-svn: 223127	2014-12-02 17:25:27 +00:00
Philip Reames	0365f1a376	[Statepoints 2/4] Statepoint infrastructure for garbage collection: MI & x86-64 Backend This is the second patch in a small series. This patch contains the MachineInstruction and x86-64 backend pieces required to lower Statepoints. It does not include the code to actually generate the STATEPOINT machine instruction and as a result, the entire patch is currently dead code. I will be submitting the SelectionDAG parts within the next 24-48 hours. Since those pieces are by far the most complicated, I wanted to minimize the size of that patch. That patch will include the tests which exercise the functionality in this patch. The entire series can be seen as one combined whole in http://reviews.llvm.org/D5683. The STATEPOINT psuedo node is generated after all gc values are explicitly spilled to stack slots. The purpose of this node is to wrap an actual call instruction while recording the spill locations of the meta arguments used for garbage collection and other purposes. The STATEPOINT is modeled as modifing all of those locations to prevent backend optimizations from forwarding the value from before the STATEPOINT to after the STATEPOINT. (Doing so would break relocation semantics for collectors which wish to relocate roots.) The implementation of STATEPOINT is closely modeled on PATCHPOINT. Eventually, much of the code in this patch will be removed. The long term plan is to merge the functionality provided by statepoints and patchpoints. Merging their implementations in the backend is likely to be a good starting point. Reviewed by: atrick, ributzka llvm-svn: 223085	2014-12-01 22:52:56 +00:00
Duncan P. N. Exon Smith	9bc81fbe92	Revert "Masked Vector Load and Store Intrinsics." This reverts commit r222632 (and follow-up r222636), which caused a host of LNT failures on an internal bot. I'll respond to the commit on the list with a reproduction of one of the failures. Conflicts: lib/Target/X86/X86TargetTransformInfo.cpp llvm-svn: 222936	2014-11-28 21:29:14 +00:00
Elena Demikhovsky	905a5a606f	AVX-512: Scalar ERI intrinsics including SAE mode and memory operand. Added AVX512_maskable_scalar template, that should cover all scalar instructions in the future. The main difference between AVX512_maskable_scalar<> and AVX512_maskable<> is using X86select instead of vselect. I need it, because I can't create vselect node for MVT::i1 mask for scalar instruction. http://reviews.llvm.org/D6378 llvm-svn: 222820	2014-11-26 10:46:49 +00:00
Simon Pilgrim	371417db34	[X86][SSE] Improvements to byte shift shuffle matching Since (v)pslldq / (v)psrldq instructions resolve to a single input argument it is useful to match it much earlier than we currently do - this prevents more complicated shuffles (notably insertion into a zero vector) matching before it. Differential Revision: http://reviews.llvm.org/D6409 llvm-svn: 222796	2014-11-25 22:34:59 +00:00
Cameron McInally	9b7c15a364	[AVX512] Add 512b integer shift by variable intrinsics and patterns. llvm-svn: 222786	2014-11-25 20:41:51 +00:00
Andrea Di Biagio	23e2cfa834	[X86] Improved target specific combine on VSELECT dag nodes. This patch teaches function 'transformVSELECTtoBlendVECTOR_SHUFFLE' how to convert VSELECT dag nodes to shuffles on targets that do not have SSE4.1. On pre-SSE4.1 targets, we can still perform blend operations using movss/movsd. Also, removed a target specific combine that performed a premature lowering of VSELECT nodes to target specific MOVSS/MOVSD nodes. llvm-svn: 222647	2014-11-24 12:23:15 +00:00
Michael Kuperstein	9ef90647b9	[X86] Fixes bug in build_vector v4x32 lowering r222375 made some improvements to build_vector lowering of v4x32 and v4xf32 into an insertps, but it missed a case where: 1. A single extracted element is used twice. 2. The lower of the two non-zero indexes should be preserved, and the higher should be used for the dest mask. This caused a crash, since the source value for the insertps ends-up uninitialized. Differential Revision: http://reviews.llvm.org/D6377 llvm-svn: 222635	2014-11-23 13:09:06 +00:00
Elena Demikhovsky	9e5089a938	Masked Vector Load and Store Intrinsics. Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores. Added SDNodes for masked operations and lowering patterns for X86 code generator. Examples: <16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align /, <16 x i1> %mask) declare void @llvm.masked.store.v8f64(i8 %addr, <8 x double> %value, i32 4, <8 x i1> %mask) Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch. http://reviews.llvm.org/D6191 llvm-svn: 222632	2014-11-23 08:07:43 +00:00
Chandler Carruth	5852b1f4c2	[x86] Teach the vector shuffle yet another step of canonicalization. No functionality changed yet, but this will prevent subsequent patches from having to handle permutations of various interleaved shuffle patterns. llvm-svn: 222614	2014-11-22 09:18:53 +00:00
Sanjay Patel	501890e909	Add a feature flag for slow 32-byte unaligned memory accesses [x86]. This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen for Sandy Bridge and Ivy Bridge. There is no functionality change intended for those chips. Previously, the absence of AVX2 was being used as a proxy to detect this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2 that do not have the 32-byte unaligned access slowdown. Performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ). Differential Revision: http://reviews.llvm.org/D6355 llvm-svn: 222544	2014-11-21 17:40:04 +00:00
Chandler Carruth	ce5a26b0e7	[x86] Restructure the checking patterns for v16 and v32 avx2 vector shuffle lowering to allow much better blend matching. Specifically, with the new structure the code seems clearer to me and we correctly can hit the cases where merging two 128-bit lanes is a clear win and can be shuffled cheaply afterward. llvm-svn: 222539	2014-11-21 14:53:03 +00:00
Chandler Carruth	6c4d1ea8c4	[x86] Make the previous logic significantly less conservative and get a bunch more improvements. Non-lane-crossing is fine, the key is that lane merging only makes sense for single-input shuffles. Not sure why I got so turned around here. The code all works, I was just using the wrong model for it. This only updates v4 and v8 lowering. The v16 and v32 lowering requires restructuring the entire check sequence. llvm-svn: 222537	2014-11-21 14:33:24 +00:00
Chandler Carruth	d2b19bc867	[x86] Teach the x86 vector shuffle lowering to detect mergable 128-bit lanes. By special casing these we can often either reduce the total number of shuffles significantly or reduce the number of (high latency on Haswell) AVX2 shuffles that potentially cross 128-bit lanes. Even when these don't actually cross lanes, they have much higher latency to support that. Doing two of them and a blend is worse than doing a single insert across the 128-bit lanes to blend and then doing a single interleaved shuffle. While this seems like a narrow case, it kept cropping up on me and the difference is huge as you can see in many of the test cases. I first hit this trying to perfectly fix the interleaving shuffle patterns used by Halide for AVX2. llvm-svn: 222533	2014-11-21 13:56:05 +00:00
Alexey Volkov	fd1731d876	[X86] For Silvermont CPU use 16-bit division instead of 64-bit for small positive numbers Differential Revision: http://reviews.llvm.org/D5938 llvm-svn: 222521	2014-11-21 11:19:34 +00:00
Quentin Colombet	a7439d4483	[X86] Do not custom lower UINT_TO_FP when the target type does not match the custom lowering. <rdar://problem/19026326> llvm-svn: 222489	2014-11-21 00:47:19 +00:00
Reid Kleckner	343c395f11	Fix more instances of -Wsentinel on Windows with s/NULL/nullptr/ Follow up to r221940, where I must not have caught em all. NFC llvm-svn: 222481	2014-11-20 23:51:47 +00:00
Saleem Abdulrasool	2f3b3f3182	X86: use the correct alloca symbol for Windows Itanium Windows itanium targets the MSVCRT, and the stack probe symbol is provided by MSVCRT. This corrects the emission of stack probes on i686-windows-itanium. llvm-svn: 222439	2014-11-20 18:01:26 +00:00
Andrea Di Biagio	1b657bfcc8	[X86] Improved lowering of v4x32 build_vector dag nodes. This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes that are known to have at least two non-zero elements. With this patch, a build_vector that performs a blend with zero is converted into a shuffle. This is done to let the shuffle legalizer expand the dag node in a optimal way. For example, if we know that a build_vector performs a blend with zero, we can try to lower it as a movq/blend instead of always selecting an insertps. This patch also improves the logic that lowers a build_vector into a insertps with zero masking. See for example the extra test cases added to test sse41.ll. Differential Revision: http://reviews.llvm.org/D6311 llvm-svn: 222375	2014-11-19 19:34:29 +00:00
Simon Pilgrim	3ac3b251a9	[X86][SSE] pslldq/psrldq byte shifts/rotation for SSE2 This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain. I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr. Differential Revision: http://reviews.llvm.org/D5699 llvm-svn: 222340	2014-11-19 10:06:49 +00:00
Simon Pilgrim	6d675f4e35	[X86][SSE] Improve legal SHUFP and PSHUFD shuffle matching Updated X86TargetLowering::isShuffleMaskLegal to match SHUFP masks with commuted inputs and PSHUFD masks that reference the second input. As part of this I've refactored isPSHUFDMask to work in a more general manner and allow it to match against either the first or second input vector. Differential Revision: http://reviews.llvm.org/D6287 llvm-svn: 222087	2014-11-15 21:13:05 +00:00
Tim Northover	d3be12a6c7	X86: use getConstant rather than getTargetConstant behind BUILD_VECTOR. getTargetConstant should only be used when you can guarantee the instruction selected will be able to cope with the raw value. BUILD_VECTOR is rather too generic for this so we should use getConstant instead. In that case, an instruction can still consume the constant, but if it doesn't it'll be materialised through its own round of ISel. Should fix PR21352. llvm-svn: 221961	2014-11-14 01:30:14 +00:00
Aditya Nandakumar	3053155652	We can get the TLOF from the TargetMachine - so constructor no longer requires TargetLoweringObjectFile to be passed. llvm-svn: 221926	2014-11-13 21:29:21 +00:00
Elena Demikhovsky	d5e95b57e0	AVX-512: SINT_TO_FP cost model and some bugfixes Checked some corner cases, for example translation of <8 x i1> to <8 x double> llvm-svn: 221883	2014-11-13 11:46:16 +00:00
Aditya Nandakumar	a27193297f	This patch changes the ownership of TLOF from TargetLoweringBase to TargetMachine so that different subtargets could share the TLOF effectively llvm-svn: 221878	2014-11-13 09:26:31 +00:00
Chandler Carruth	fee91883f4	[x86] Teach the vector shuffle lowering to make a more nuanced decision between splitting a vector into 128-bit lanes and recombining them vs. decomposing things into single-input shuffles and a final blend. This handles a large number of cases in AVX1 where the cross-lane shuffles would be much more expensive to represent even though we end up with a fast blend at the root. Instead, we can do a better job of shuffling in a single lane and then inserting it into the other lanes. This fixes the remaining bits of Halide's regression captured in PR21281 for AVX1. However, the bug persists in AVX2 because I've made this change reasonably conservative. The cases where it makes sense in AVX2 to split into 128-bit lanes are much more rare because we can often do full permutations across all elements of the 256-bit vector. However, the particular test case in PR21281 is an example of one of the rare cases where it is always better to work in a single 128-bit lane. I'm going to try to teach the logic to detect and form the good code even in AVX2 next, but it will need to use a separate heuristic. Finally, there is one pesky regression here where we previously would craftily use vpermilps in AVX1 to shuffle both high and low halves at the same time. We no longer pull that off, and not for any really good reason. Ultimately, I think this is just another missing nuance to the selection heuristic that I'll try to add in afterward, but this change already seems strictly worth doing considering the magnitude of the improvements in common matrix math shuffle patterns. As always, please let me know if this causes a surprising regression for you. llvm-svn: 221861	2014-11-13 04:06:10 +00:00
Chandler Carruth	253dd39a9a	[x86] Don't form overly fragmented blends when splitting and re-combining shuffles because nothing was available in the wider vector type. The key observation (which I've put in the comments for future maintainers) is that at this point, no further combining is really possible. And so even though these shuffles trivially could be combined, we need to actually do that as we produce them when producing them this late in the lowering. This fixes another (huge) part of the Halide vector shuffle regressions. As it happens, this was already well covered by the tests, but I hadn't noticed how bad some of these got. The specific patterns that turn directly into unpckl/h patterns were occurring many times in common vector processing code. There are still more problems here sadly, but trying to incrementally tease them apart and it looks like this is the core of the problem in the splitting logic. There is some chance of regression here, you can see it in the test changes. Specifically, where we stop forming pshufb in some cases, it is possible that pshufb was in fact faster. Intel "says" that pshufb is slower than the instruction sequences replacing it. llvm-svn: 221852	2014-11-13 02:42:08 +00:00
Sanjay Patel	f6f7d5d1dd	Expose the number of Newton-Raphson iterations applied to the hardware's reciprocal estimate as a parameter (x86). This is a follow-on to r221706 and r221731 and discussed in more detail in PR21385. This patch also loosens the testcase checking for btver2. We know that the "1.0" will be loaded, but we can't tell exactly when, so replace the CHECK-NEXT specifiers with plain CHECKs. The CHECK-NEXT sequence relied on a quirk of post-RA-scheduling that may change independently of anything in these tests. llvm-svn: 221819	2014-11-12 21:39:01 +00:00
Cameron McInally	73a6bca32b	[AVX512] Add integer shift by immediate intrinsics. llvm-svn: 221811	2014-11-12 19:58:54 +00:00
Chandler Carruth	0c922fcec5	[x86] Start improving the matching of unpck instructions based on test cases from Halide folks. This initial step was extracted from a prototype change by Clay Wood to try and address regressions found with Halide and the new vector shuffle lowering. llvm-svn: 221779	2014-11-12 10:05:18 +00:00
Elena Demikhovsky	be8808dc3f	AVX-512: Intrinsics for ERI 3 instructions: vrcp28, vrsqrt28, vexp2, only vector forms. Intrinsics include SAE (Suppres All Exceptions) parameter. http://reviews.llvm.org/D6214 llvm-svn: 221774	2014-11-12 07:31:03 +00:00
Sanjay Patel	e2e589288f	Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385). This is a first step for generating SSE rcp instructions for reciprocal calcs when fast-math allows it. This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ). For now, be conservative and only enable this for AMD btver2 where performance improves significantly both in terms of latency and throughput. We may never enable this codegen for Intel Core* chips because the divider circuits are just too fast. On SandyBridge, divss can be as fast as 10 cycles versus the 21 cycle critical path for the rcp + mul + sub + mul + add estimate. Follow-on patches may allow configuration of the number of Newton-Raphson refinement steps, add AVX512 support, and enable the optimization for more chips. More background here: http://llvm.org/bugs/show_bug.cgi?id=21385 Differential Revision: http://reviews.llvm.org/D6175 llvm-svn: 221706	2014-11-11 20:51:00 +00:00
Dario Domizioli	e904e85faf	[X86][ELF] Fix PR20243 - leaf frame pointer bug with TLS access The ISel lowering for global TLS access in PIC mode was creating a pseudo instruction that is later expanded to a call, but the code was not setting the hasCalls flag in the MachineFrameInfo alongside the adjustsStack flag. This caused some functions to be mistakenly recognized as leaf functions, and this in turn affected the decision to eliminate the frame pointer. With the fix, hasCalls is properly set and the leaf frame pointer is correctly preserved. llvm-svn: 221695	2014-11-11 18:44:49 +00:00
Andrea Di Biagio	5fa2e15453	[X86] Add missing check for 'isINSERTPSMask' in method 'isShuffleMaskLegal'. This helps the DAGCombiner to identify more opportunities to fold shuffles. llvm-svn: 221684	2014-11-11 11:20:31 +00:00
Quentin Colombet	360460ba64	[X86] Custom lower UINT_TO_FP from v4f32 to v4i32, and for v8f32 to v8i32 if AVX2 is available. According to IACA, the new lowering has a throughput of 8 cycles instead of 13 with the previous one. Althought this lowering kicks in some SPECs benchmarks, the performance improvement was within the noise. Correctness testing has been done for the whole range of uint32_t with the following program: uint4 v = (uint4) {0,1,2,3}; uint32_t i; //Check correctness over entire range for uint4 -> float4 conversion for( i = 0; i < 1U << (32-2); i++ ) { float4 t = test(v); float4 c = correct(v); if( 0xf != _mm_movemask_ps( t == c )) { printf( "Error @ %vx: %vf vs. %vf\n", v, c, t); return -1; } v += 4; } Where "correct" is the old lowering and "test" the new one. The patch adds a test case for the two custom lowering instruction. It also modifies the vector cost model, which is why cast.ll and uitofp.ll are modified. 2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7 instructions instead of 4 (3 more constant loads). rdar://problem/18153096> llvm-svn: 221657	2014-11-11 02:23:47 +00:00
Ahmed Bougacha	b5367eeea3	[X86] Add VFMADDSUB cases for the 213->231 custom inserter. Also add tests for vfmadd/vfmsub. llvm-svn: 221488	2014-11-06 22:04:15 +00:00
Ahmed Bougacha	9152361d73	[X86] Add missing FMA3 VFMADDSUB in the emitter. Also reuse the fma4 intrinsic test to cover fma3 instructions too. llvm-svn: 221487	2014-11-06 21:58:11 +00:00
Quentin Colombet	dbe33e7aa4	[X86] Lower VSELECT into SHRUNKBLEND when we shrink the bits used into the condition to match a blend. This prevents optimizations that work on VSELECT to perform invalid transformations. Indeed, the optimized condition does not match the vector boolean content that is expected and bad things may happen. This patch yields the exact same code on the whole test-suite + specs (-O3 and -O3 -march=core-avx2), it improves one test case (vector-blend.ll) and fixes a bug reduced in vselect-avx.ll. <rdar://problem/18819506> llvm-svn: 221429	2014-11-06 02:25:03 +00:00
Andrea Di Biagio	ce46b97b48	[X86] Teach method 'isVectorClearMaskLegal' how to check for legal blend masks. This patch improves the folding of vector AND nodes into blend operations for targets that feature SSE4.1. A vector AND node where one of the operands is a constant build_vector with elements that are either zero or all-ones can be converted into a blend. This allows for example to simplify the following code: define <4 x i32> @test(<4 x i32> %A, <4 x i32> %B) { %1 = and <4 x i32> %A, <i32 0, i32 0, i32 0, i32 -1> %2 = and <4 x i32> %B, <i32 -1, i32 -1, i32 -1, i32 0> %3 = or <4 x i32> %1, %2 ret <4 x i32> %3 } Before this patch llc (-mcpu=corei7) generated: andps LCPI1_0(%rip), %xmm0, %xmm0 andps LCPI1_1(%rip), %xmm1, %xmm1 orps %xmm1, %xmm0, %xmm0 retq With this patch we generate a single 'vpblendw'. llvm-svn: 221343	2014-11-05 13:04:14 +00:00
Ahmed Bougacha	ec0f3d755f	[X86] Add debug print name for X86ISD::[US]MUL8. NFC-ish. The opcodes were added in r220516, but I forgot to add the print names. llvm-svn: 221185	2014-11-03 21:25:18 +00:00
Ahmed Bougacha	12eb558bd9	[X86] 8bit divrem: Improve codegen for AH register extraction. For 8-bit divrems where the remainder is used, we used to generate: divb %sil shrw $8, %ax movzbl %al, %eax That was to avoid an H-reg access, which is problematic mainly because it isn't possible in REX-prefixed instructions. This patch optimizes that to: divb %sil movzbl %ah, %eax To do that, we explicitly extend AH, and extract the L-subreg in the resulting register. The extension is done using the NOREX variants of MOVZX. To support signed operations, MOVSX_NOREX is also added. Further, this introduces a new SDNode type, [us]divrem_ext_hreg, which is then lowered to a sequence containing a single zext (rather than 2). Differential Revision: http://reviews.llvm.org/D6064 llvm-svn: 221176	2014-11-03 20:26:35 +00:00
Adrian Prantl	a0852d2be3	Revert "Temporarily revert r220777 to sort out build bot breakage." This reverts commit r221028. Later commits depend on this and reverting just this one causes even more bots to fail. llvm-svn: 221041	2014-11-01 03:19:45 +00:00
NAKAMURA Takumi	2ab9801509	Revert r220779, "[AVX512] Removed special case for cmp instructions in getVectorMaskingNode. Now cmp intrinsics lower as other intrinsics through VSELECT, and then VSELECT tranforms to AND in PerformSELECTCombine." Since r221028 (reverting r220777), this caused failures. llvm-svn: 221040	2014-11-01 01:36:14 +00:00
Adrian Prantl	cd4872399a	Temporarily revert r220777 to sort out build bot breakage. "[x86] Simplify vector selection if condition value type matches vselect value type and true value is all ones or false value is all zeros." llvm-svn: 221028	2014-11-01 00:26:59 +00:00
Robert Khasanov	784c3f0d6e	[AVX512] Removed special case for cmp instructions in getVectorMaskingNode. Now cmp intrinsics lower as other intrinsics through VSELECT, and then VSELECT tranforms to AND in PerformSELECTCombine. No functional change. llvm-svn: 220779	2014-10-28 16:17:14 +00:00
Robert Khasanov	4441c4d31b	[x86] Simplify vector selection if condition value type matches vselect value type and true value is all ones or false value is all zeros. This transformation worked if selector is produced by SETCC, however SETCC is needed only if we consider to swap operands. So I replaced SETCC check for this case. Added tests for vselect of <X x i1> values. llvm-svn: 220777	2014-10-28 15:59:40 +00:00
Robert Khasanov	dd09a8f320	[AVX512] Bring back vector-shuffle lowering support through broadcasts Ffter commit at rev219046 512-bit broadcasts lowering become non-optimal. Most of tests on broadcasting and embedded broadcasting were changed and they doesn’t produce efficient code. Example below is from commit changes (it’s the first test from test/CodeGen/X86/avx512-vbroadcast.ll): define <16 x i32> @_inreg16xi32(i32 %a) { ; CHECK-LABEL: _inreg16xi32: ; CHECK: ## BB#0: -; CHECK-NEXT: vpbroadcastd %edi, %zmm0 +; CHECK-NEXT: vmovd %edi, %xmm0 +; CHECK-NEXT: vpbroadcastd %xmm0, %ymm0 +; CHECK-NEXT: vinserti64x4 $1, %ymm0, %zmm0, %zmm0 ; CHECK-NEXT: retq %b = insertelement <16 x i32> undef, i32 %a, i32 0 %c = shufflevector <16 x i32> %b, <16 x i32> undef, <16 x i32> zeroinitializer ret <16 x i32> %c } Here, 256-bit broadcast was generated instead of 512-bit one. In this patch 1) I added vector-shuffle lowering through broadcasts 2) Removed asserts and branches likes because this is incorrect - assert(Subtarget->hasDQI() && "We can only lower v8i64 with AVX-512-DQI"); 3) Fixed lowering tests llvm-svn: 220774	2014-10-28 12:28:51 +00:00
Simon Pilgrim	fd080af0c5	[X86][SSE] Bitcast assertion in XFormVExtractWithShuffleIntoLoad Minor patch to fix an issue in XFormVExtractWithShuffleIntoLoad where a load is unary shuffled, then bitcast (to a type with the same number of elements) before extracting an element. An undef was created for the second shuffle operand using the original (post-bitcasted) vector type instead of the pre-bitcasted type like the rest of the shuffle node - this was then causing an assertion on the different types later on inside SelectionDAG::getVectorShuffle. Differential Revision: http://reviews.llvm.org/D5917 llvm-svn: 220592	2014-10-24 21:04:41 +00:00
Sanjay Patel	f924e11967	Allow AVX vrsqrtps generation. This is a follow-on to r220570 that allows a 256-bit (v8f32) version of vrsqrtps to be generated. llvm-svn: 220579	2014-10-24 17:59:18 +00:00
Sanjay Patel	957efc23bb	Use rsqrt (X86) to speed up reciprocal square root calcs This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed. For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c (if we convert the data type to single-precision float). This patch adds a two constant version of the Newton-Raphson refinement algorithm to DAGCombiner that can be selected by any target via a parameter returned by getRsqrtEstimate().. See PR20900 for more details: http://llvm.org/bugs/show_bug.cgi?id=20900 Differential Revision: http://reviews.llvm.org/D5658 llvm-svn: 220570	2014-10-24 17:02:16 +00:00
Ahmed Bougacha	5175bcf43a	[X86] Improve mul w/ overflow codegen, to MUL8+SETO. Currently, @llvm.smul.with.overflow.i8 expands to 9 instructions, where 3 are really needed. This adds X86ISD::UMUL8/SMUL8 SD nodes, and custom lowers them to MUL8/IMUL8 + SETO. i8 is a special case because there is no two/three operand variants of (I)MUL8, so the first operand and return value need to go in AL/AX. Also, we can't write patterns for these instructions: TableGen refuses patterns where output operands don't match SDNode results. In this case, instructions where the output operand is an implicitly defined register. A related special case (and FIXME) exists for MUL8 (X86InstrArith.td): // FIXME: Used for 8-bit mul, ignore result upper 8 bits. // This probably ought to be moved to a def : Pat<> if the // syntax can be accepted. [(set AL, (mul AL, GR8:$src)), (implicit EFLAGS)] Ideally, these go away with UMUL8, but we still need to improve TableGen support of implicit operands in patterns. Before this change: movsbl %sil, %eax movsbl %dil, %ecx imull %eax, %ecx movb %cl, %al sarb $7, %al movzbl %al, %eax movzbl %ch, %esi cmpl %eax, %esi setne %al After: movb %dil, %al imulb %sil seto %al Also, remove a made-redundant testcase for PR19858, and enable more FastISel ALU-overflow tests for SelectionDAG too. Differential Revision: http://reviews.llvm.org/D5809 llvm-svn: 220516	2014-10-23 21:55:31 +00:00
Matt Arsenault	7c93690be0	Add minnum / maxnum codegen llvm-svn: 220342	2014-10-21 23:01:01 +00:00
Quentin Colombet	06355199f1	[X86] Fix a bug in the lowering of the mask of VSELECT. X86 code to lower VSELECT messed a bit with the bits set in the mask of VSELECT when it knows it can be lowered into BLEND. Indeed, only the high bits need to be set for those and it optimizes those accordingly. However, when the mask is a compile time constant, the lowering will be handled by the generic optimizer and those modifications will generate bad code in the generic optimizer. This patch fixes that by preventing the optimization if the VSELECT will be handled by the generic optimizer. <rdar://problem/18675020> llvm-svn: 220242	2014-10-20 23:13:30 +00:00
Filipe Cabecinhas	9d7bd78ffa	Fix a broadcast related regression on the vector shuffle lowering. Summary: Test by Robert Lougher! Reviewers: chandlerc Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5745 llvm-svn: 219617	2014-10-13 16:16:16 +00:00
Simon Pilgrim	3c1e1e9498	Test commit access (email fix) Indentation tidyup. llvm-svn: 219577	2014-10-11 20:28:56 +00:00
Simon Pilgrim	d89591e0a1	Test commit access Fix comment typo + spelling. llvm-svn: 219572	2014-10-11 14:23:36 +00:00
Robert Khasanov	b51bb22611	[AVX512] Added intrinsics for 128-, 256- and 512-bit versions of VPCMP/VPCMPU{BWDQ} Added CMP_MASK_CC intrinsic type. Added tests for intrinsics. Patch by Sergey Lisitsyn <sergey.lisitsyn@intel.com> llvm-svn: 219316	2014-10-08 15:49:26 +00:00
Elena Demikhovsky	44bf0637d5	AVX-512-SKX: Added instruction VPMOVM2B/W/D/Q. This instruction allows to broadacst mask vector to data vector. llvm-svn: 219083	2014-10-05 14:11:08 +00:00
Chandler Carruth	acecdc0211	[x86] Fix PR21139, one of the last remaining regressions found in the new vector shuffle lowering. This is loosely based on a patch by Marius Wachtler to the PR (thanks!). I refactored it a bi to use std::count_if and a mutable array ref but the core idea was exactly right. I also added some direct testing of this case. I believe PR21137 is now the only remaining regression. llvm-svn: 219081	2014-10-05 12:07:34 +00:00
Chandler Carruth	9f4d9fa54e	[x86] Teach the new vector shuffle lowering how to lower 128-bit shuffles using AVX and AVX2 instructions. This fixes PR21138, one of the few remaining regressions impacting benchmarks from the new vector shuffle lowering. You may note that it "regresses" many of the vperm2x128 test cases -- these were actually "improved" by the naive lowering that the new shuffle lowering previously did. This regression gave me fits. I had this patch ready-to-go about an hour after flipping the switch but wasn't sure how to have the best of both worlds here and thought the correct solution might be a completely different approach to lowering these vector shuffles. I'm now convinced this is the correct lowering and the missed optimizations shown in vperm2x128 are actually due to missing target-independent DAG combines. I've even written most of the needed DAG combine and will submit it shortly, but this part is ready and should help some real-world benchmarks out. llvm-svn: 219079	2014-10-05 11:41:36 +00:00
Chandler Carruth	99627bfbff	[x86] Enable the new vector shuffle lowering by default. Update the entire regression test suite for the new shuffles. Remove most of the old testing which was devoted to the old shuffle lowering path and is no longer relevant really. Also remove a few other random tests that only really exercised shuffles and only incidently or without any interesting aspects to them. Benchmarking that I have done shows a few small regressions with this on LNT, zero measurable regressions on real, large applications, and for several benchmarks where the loop vectorizer fires in the hot path it shows 5% to 40% improvements for SSE2 and SSE3 code running on Sandy Bridge machines. Running on AMD machines shows even more dramatic improvements. When using newer ISA vector extensions the gains are much more modest, but the code is still better on the whole. There are a few regressions being tracked (PR21137, PR21138, PR21139) but by and large this is expected to be a win for x86 generated code performance. It is also more correct than the code it replaces. I have fuzz tested this extensively with ISA extensions up through AVX2 and found no crashes or miscompiles (yet...). The old lowering had a few miscompiles and crashers after a somewhat smaller amount of fuzz testing. There is one significant area where the new code path lags behind and that is in AVX-512 support. However, there was extremely little support for that already and so this isn't a significant step backwards and the new framework will probably make it easier to implement lowering that uses the full power of AVX-512's table-based shuffle+blend (IMO). Many thanks to Quentin, Andrea, Robert, and others for benchmarking assistance. Thanks to Adam and others for help with AVX-512. Thanks to Hal, Eric, and many others for answering my incessant questions about how the backend actually works. =] I will leave the old code path in the tree until the 3 PRs above are at least resolved to folks' satisfaction. Then I will rip it (and 1000s of lines of code) out. =] I don't expect this flag to stay around for very long. It may not survive next week. llvm-svn: 219046	2014-10-04 03:52:55 +00:00
Chandler Carruth	200e87c0c5	[x86] Fix a bug in the VZEXT DAG combine that I just made more powerful. It turns out this combine was always somewhat flawed -- there are cases where nested VZEXT nodes can't be combined: if their types have a mismatch that can be observed in the result. While none of these show up in currently, once I switch to the new vector shuffle lowering a few test cases actually form such nested VZEXT nodes. I've not come up with any IR pattern that I can sensible write to exercise this, but it will be covered by tests once I flip the switch. llvm-svn: 219044	2014-10-04 02:51:03 +00:00
Chandler Carruth	7e26a67ffa	[x86] Sink a generic combine of VZEXT nodes from the lowering to VZEXT nodes to the DAG combining of them. This will allow the combine to fire on both old vector shuffle lowering and the new vector shuffle lowering and generally seems like a cleaner design. I've trimmed down the code a bit and tried to make it and the surrounding combine fairly clean while moving it around. llvm-svn: 219042	2014-10-04 01:05:48 +00:00
Chandler Carruth	f3e880697a	[x86] Add a really preposterous number of patterns for matching all of the various ways in which blends can be used to do vector element insertion for lowering with the scalar math instruction forms that effectively re-blend with the high elements after performing the operation. This then allows me to bail on the element insertion lowering path when we have SSE4.1 and are going to be doing a normal blend, which in turn restores the last of the blends lost from the new vector shuffle lowering when I got it to prioritize insertion in other cases (for example when we don't have a blend instruction). Without the patterns, using blends here would have regressed sse-scalar-fp-arith.ll completely with the new vector shuffle lowering. For completeness, I've added RUN-lines with the new lowering here. This is somewhat superfluous as I'm about to flip the default, but hey, it shows that this actually significantly changed behavior. The patterns I've added are just ridiculously repetative. Suggestions on making them better very much welcome. In particular, handling the commuted form of the v2f64 patterns is somewhat obnoxious. llvm-svn: 219033	2014-10-03 22:43:17 +00:00
Chandler Carruth	1964078936	[x86] Teach the new vector shuffle lowering to aggressively form MOVSS and MOVSD nodes for single element vector inserts. This is particularly important because a number of patterns in the backend detect these patterns and leverage them to simplify things. It also fixes quite a few of the insertion bad code examples. However, it regresses a specific area: when available, blendps and blendpd are dramatically faster than movss and movsd respectively. But it doesn't really work to form the blend logic first because the blends aren't as crazy efficient when the data is coming from memory anyways, and thus will have a movss or movsd regardless. Also, doing that would block a bunch of the patterns that this is designed to hit. So my plan is to go into the patterns for lowering MOVSS and MOVSD and lower them via blends when available. However that's a pretty invasive restructuring so it will need to be a follow-up patch. I have already gone into the patterns to lower MOVSS and MOVSD from memory using MOVLPD, etc. Without that, several of the test cases I already have regress. llvm-svn: 218985	2014-10-03 13:11:13 +00:00
Chandler Carruth	4bf341de3c	[x86] Refactor the element insertion logic in the new vector shuffle lowering to handle the potential mirroring of 2-element vectors (because we can't reliably sort them one way) in the caller rather than in the insertion logic. This will simplify things considerably as more ways to fail to match the insertion are added because now we have a nice try and retry point. llvm-svn: 218980	2014-10-03 12:01:55 +00:00
Chandler Carruth	971a560cb8	[x86] Significantly improve the ability of the new vector shuffle lowering to match VZEXT_MOVL patterns. I hadn't realized that these had sufficient pattern smarts in the backend to lower zext-ing from the low element of a vector without it being a scalar_to_vector node. They do, and this is how to match a bunch of patterns for movq, movss, etc. There is a weird propensity to end up using pshufd to place the element afterward even though it means domain crossing (or rather, to use xorps+movss to zext the element rather than movq) but that's an orthogonal problem with VZEXT_MOVL that someone should probably look at. llvm-svn: 218977	2014-10-03 11:25:58 +00:00
Chandler Carruth	e91b316266	[x86] Unbreak SSE1 with the new vector shuffle lowering. We can't widen element types to form illegal vector types. I've added a special SSE1 test case here that makes sure we don't break this going forward. llvm-svn: 218974	2014-10-03 10:11:39 +00:00
Chandler Carruth	75e182b414	[x86] Teach the new vector shuffle lowering to widen floating point elements as well as integer elements in order to form simpler shuffle patterns. This is the primary reason why we were failing to match some of the 2-and-2 floating point shuffles such as PR21140. Even after fixing this we need to support some extra patterns in the backend in order to match the resulting X86ISD::UNPCKL nodes into the correct instructions. This commit should fix PR21140 and includes more comprehensive testing of insertion patterns in v4 shuffles. Not all of the added tests are beautiful. For example, we don't have clever instructions to insert-via-load in the integer domain. There are also some places where we aren't sufficiently cunning with our use of movq and movd, but that's future work. llvm-svn: 218911	2014-10-02 21:37:14 +00:00
Chandler Carruth	8a16802d46	[x86] Improve and correct how the new vector shuffle lowering was matching and lowering 64-bit insertions. The first problem was that we weren't looking through bitcasts to discover that we could lower as insertions. Once fixed, we in turn weren't looking through bitcasts to discover that we could fold a load into the lowering. Once fixed, we weren't forming a SCALAR_TO_VECTOR node around the inserted element and instead were passing a scalar to a DAG node that expected a vector. It turns out there are some patterns that will "lower" this into the correct asm, but the rest of the X86 backend is very unhappy with such antics. This should fix a few more edge case regressions I've spotted going through the regression test suite to enable the new vector shuffle lowering. llvm-svn: 218839	2014-10-01 23:14:28 +00:00
Sanjay Patel	9ebfbb969d	Lower FNEG ( FABS (x) ) -> FNABS (x) [X86 codegen] PR20578 Negative FABS of either a scalar or vector should be handled the same way on x86 with SSE/AVX: a single OR instruction of the FP operand with a constant to light up the sign bit(s). http://llvm.org/bugs/show_bug.cgi?id=20578 Differential Revision: http://reviews.llvm.org/D5201 llvm-svn: 218822	2014-10-01 21:20:06 +00:00
Eric Christopher	12f4a78581	constify TargetMachine parameter for X86TargetLowering. llvm-svn: 218804	2014-10-01 20:38:22 +00:00
Sanjay Patel	0e4a83e89c	Don't repeat function/variable name in comment. NFC. llvm-svn: 218791	2014-10-01 19:39:32 +00:00
Chandler Carruth	6c02c031b8	[x86] Fix a few more tiny patterns with the new vector shuffle lowering that keep cropping up in the regression test suite. This also addresses one of the issues raised on the mailing list with failing to form 'movsd' in as many cases as we realistically should. There will be corresponding patches forthcoming for v4f32 at least. This was a lot of fuss for a relatively small gain, but all the fuss was on my end trying different ways of holding the pieces of the x86 fragment patterns just right. Now that it works, the code is reasonably simple. In the new test cases I'm adding here, v2i64 sticks out as just plain horrible. I've not come up with any great ideas here other than that it would be nice to recognize when we're going to take a domain crossing hit and cross earlier to get the decent instructions. At least with AVX it is slightly less silly.... llvm-svn: 218756	2014-10-01 11:14:02 +00:00
Chandler Carruth	048486109b	[x86] Delete some extraneous logic from the new vector shuffle lowering. Nothing was relying on this and there are potentially some edge cases that it would not be correct under. Removing it seems better than trying to "fix" it as nothing was relying on it. llvm-svn: 218755	2014-10-01 11:13:57 +00:00
Nick Lewycky	5f75f4ddb9	Fix typo in comment from r218733 llvm-svn: 218739	2014-10-01 03:37:34 +00:00
Chandler Carruth	26cb9b8d2d	[x86] Teach the new vector shuffle lowering to be even more aggressive in exposing the scalar value to the broadcast DAG fragment so that we can catch even reloads and fold them into the broadcast. This is somewhat magical I'm afraid but seems to work. It is also what the old lowering did, and I've switched an old test to run both lowerings demonstrating that we get the same result. Unlike the old code, I'm not lowering f32 or f64 scalars through this path when we only have AVX1. The target patterns include pretty heinous code to re-cast those as shuffles when the scalar happens to not be spilled because AVX1 provides no broadcast mechanism from registers what-so-ever. This is terribly brittle. I'd much rather go through our generic lowering code to get this. If needed, we can add a peephole to get even more opportunities to broadcast-from-spill-slots that are exposed post-RA, but my suspicion is this just doesn't matter that much. llvm-svn: 218734	2014-10-01 03:19:43 +00:00
Chandler Carruth	846baf2ca1	[x86] Hoist the zext-lowering up in the v4i32 lowering routine -- it is the same speed as pshufd but we can fold loads into the pmovzx instructions. This fixes some regressions that came up in the regression test suite for the new vector shuffle lowering. llvm-svn: 218733	2014-10-01 02:25:54 +00:00
Chandler Carruth	b9d3fa1e65	[x86] Teach the new vector shuffle lowering about VBROADCAST and VPBROADCAST. This has the somewhat expected pervasive impact. I don't know why I forgot about this. Everything seems good with lots of significant improvements in the tests. llvm-svn: 218724	2014-10-01 00:41:21 +00:00
Robert Khasanov	5aa4445bde	[AVX512] Added intrinsics for 128- and 256-bit versions of VCMPEQ{BWDQ} Fixed lowering of this intrinsics in case when mask is v2i1 and v4i1. Now cmp intrinsics lower in the following way: (i8 (int_x86_avx512_mask_pcmpeq_q_128 (v2i64 %a), (v2i64 %b), (i8 %mask))) -> (i8 (bitcast (v8i1 (insert_subvector undef, (v2i1 (and (PCMPEQM %a, %b), (extract_subvector (v8i1 (bitcast %mask)), 0))), 0)))) llvm-svn: 218669	2014-09-30 11:41:54 +00:00
Robert Khasanov	a27c8e0fd9	[AVX512] Enabled intrinsics for VPCMPEQD and VPCMPEQQ. Added CMP_MASK intrinsic type llvm-svn: 218667	2014-09-30 11:19:50 +00:00
Chandler Carruth	aaf8e03d92	[x86] Revert r218588, r218589, and r218600. These patches were pursuing a flawed direction and causing miscompiles. Read on for details. Fundamentally, the premise of this patch series was to map VECTOR_SHUFFLE DAG nodes into VSELECT DAG nodes for all blends because we are going to have to lower to VSELECT nodes for some blends to trigger the instruction selection patterns of variable blend instructions. This doesn't actually work out so well. In order to match performance with the existing VECTOR_SHUFFLE lowering code, we would need to re-slice the blend in order to fit it into either the integer or floating point blends available on the ISA. When coming from VECTOR_SHUFFLE (or other vNi1 style VSELECT sources) this works well because the X86 backend ensures that these types of operands to VSELECT get sign extended into '-1' and '0' for true and false, allowing us to re-slice the bits in whatever granularity without changing semantics. However, if the VSELECT condition comes from some other source, for example code lowering vector comparisons, it will likely only have the required bit set -- the high bit. We can't blindly slice up this style of VSELECT. Reid found some code using Halide that triggers this and I'm hopeful to eventually get a test case, but I don't need it to understand why this is A Bad Idea. There is another aspect that makes this approach flawed. When in VECTOR_SHUFFLE form, we have very distilled information that represents the constant blend mask. Converting back to a VSELECT form actually can lose this information, and so I think now that it is better to treat this as VECTOR_SHUFFLE until the very last moment and only use VSELECT nodes for instruction selection purposes. My plan is to: 1) Clean up and formalize the target pre-legalization DAG combine that converts a VSELECT with a constant condition operand into a VECTOR_SHUFFLE. 2) Remove any fancy lowering from VSELECT during legalization relying entirely on the DAG combine to catch cases where we can match to an immediate-controlled blend instruction. One additional step that I'm not planning on but would be interested in others' opinions on: we could add an X86ISD::VSELECT or X86ISD::BLENDV which encodes a fully legalized VSELECT node. Then it would be easy to write isel patterns only in terms of this to ensure VECTOR_SHUFFLE legalization only ever forms the fully legalized construct and we can't cycle between it and VSELECT combining. llvm-svn: 218658	2014-09-30 02:52:28 +00:00
Chandler Carruth	6cbf43167b	[x86] Make the new vector shuffle lowering lower blends as VSELECT nodes, and rely exclusively on its logic. This removes a ton of duplication from the blend lowering and centralizes it in one place. One downside is that it requires a bunch of hacks to make this work with the current legalization framework. We have to manually speculate one aspect of legalizing VSELECT nodes to get everything to work nicely because the existing legalization framework isn't actually bottom-up. The other grossness is that we somewhat duplicate the analysis of constant blends. I'm on the fence here. If reviewers thing this would look better with VSELECT when it has constant operands dumping over tho VECTOR_SHUFFLE, we could go that way. But it would be a substantial change because currently all of the actual blend instructions are matched via patterns in the TD files based around VSELECT nodes (despite them not being perfect fits for that). Suggestions welcome, but at least this removes the rampant duplication in the backend. llvm-svn: 218600	2014-09-29 09:57:07 +00:00
Chandler Carruth	b1cc7a8542	[x86] Delete a bunch of really bad and totally unnecessary code in the X86 target-specific DAG combining that tried to convert VSELECT nodes into VECTOR_SHUFFLE nodes that it "knew" would lower into immediate-controlled blend nodes. Turns out, we have perfectly good lowering of all these VSELECT nodes, and indeed that lowering already knows how to handle lowering through BLENDI to immediate-controlled blend nodes. The code just wasn't getting used much because this thing forced the world to go through the vector shuffle lowering. Yuck. This also exposes that I was too aggressive in avoiding domain crossing in v218588 with that lowering -- when the other option is to expand into two 128-bit vectors, it is worth domain crossing. Restore that behavior now that we have nice tests covering it. The test updates here fall into two camps. One is where previously we ended up with an unsigned encoding of the blend operand and now we get a signed encoding. In most of those places there were elaborate comments explaining exactly what these operands really mean. Rather than that, just switch these tests to use the nicely decoded comments that make it obvious that the final shuffle matches. The other updates are just removing pointless domain crossing by blending integers with PBLENDW rather than BLENDPS. llvm-svn: 218589	2014-09-29 02:01:20 +00:00
Chandler Carruth	d639c7a829	[x86] Refactor all of the VSELECT-as-blend lowering code to avoid domain crossing and generally work more like the blend emission code in the new vector shuffle lowering. My goal is to have the new vector shuffle lowering just produce VSELECT nodes that are either matched here to BLENDI or are legal and matched in the .td files to specific blend instructions. That seems much cleaner as there are other ways to produce a VSELECT anyways. =] No observable functionality changed yet, mostly because this code appears to be near-dead. The behavior of this lowering routine did change though. This code being mostly dead and untestable will change with my next commit which will also point some new tests at it. llvm-svn: 218588	2014-09-29 01:32:54 +00:00
Chandler Carruth	2f9e56e527	[x86] Improve naming and comments for VSELECT lowering. No functionality changed. llvm-svn: 218586	2014-09-29 00:51:58 +00:00
Chandler Carruth	c7129276cd	[x86] Add the dispatch skeleton to the new vector shuffle lowering for AVX-512. There is no interesting logic yet. Everything ends up eventually delegating to the generic code to split the vector and shuffle the halves. Interestingly, that logic does a significantly better job of lowering all of these types than the generic vector expansion code does. Mostly, it lets most of the cases fall back to nice AVX2 code rather than all the way back to SSE code paths. Step 2 of basic AVX-512 support in the new vector shuffle lowering. Next up will be to incrementally add direct support for the basic instruction set to each type (adding tests first). llvm-svn: 218585	2014-09-29 00:37:27 +00:00
Chandler Carruth	32a3ebda14	[x86] Make the split-and-lower routine fully generic by relaxing the assertion, making the name generic, and improving the documentation. Step 1 in adding very primitive support for AVX-512. No functionality changed yet. llvm-svn: 218584	2014-09-29 00:21:49 +00:00
Chandler Carruth	24e3b69cbd	[x86] Teach the new vector shuffle lowering to fall back on AVX-512 vectors. Someone will need to build the AVX512 lowering, which should follow AVX1 and AVX2 very closely for AVX512F and AVX512BW resp. I've added a dummy test which is a port of the v8f32 and v8i32 tests from AVX and AVX2 to v8f64 and v8i64 tests for AVX512F and AVX512BW. Hopefully this is enough information for someone to implement proper lowering here. If not, I'll be happy to help, but right now the AVX-512 support isn't a priority for me. llvm-svn: 218583	2014-09-28 23:53:10 +00:00
Chandler Carruth	abe742e8fb	[x86] Fix the new vector shuffle lowering's use of VSELECT for AVX2 lowerings. This was hopelessly broken. First, the x86 backend wants '-1' to be the element value representing true in a boolean vector, and second the operand order for VSELECT is backwards from the actual x86 instructions. To make matters worse, the backend is just using '-1' as the true value to get the high bit to be set. It doesn't actually symbolically map the '-1' to anything. But on x86 this isn't quite how it works: there only the high bit is relevant. As a consequence weird non-'-1' values like 0x80 actually "work" once you flip the operands to be backwards. Anyways, thanks to Hal for helping me sort out what these should be. llvm-svn: 218582	2014-09-28 23:23:55 +00:00
Chandler Carruth	6578f9208b	[x86] Fix a really silly bug that I introduced fixing another bug in the new vector shuffle target DAG combines -- it helps to actually test for the value you want rather than just using an integer in a boolean context. Have I mentioned that I loathe implicit conversions recently? :: sigh :: llvm-svn: 218576	2014-09-28 06:11:04 +00:00
Chandler Carruth	b10c6b8e9e	[x86] Fix yet another bug in the new vector shuffle lowering's handling of widening masks. We can't widen a zeroing mask unless both elements that would be merged are either zeroed or undef. This is the only way to widen a mask if it has a zeroed element. Also clean up the code here by ordering the checks in a more logical way and by using the symoblic values for undef and zero. I'm actually torn on using the symbolic values because the existing code is littered with the assumption that -1 is undef, and moreover that entries '< 0' are the special entries. While that works with the values given to these constants, using the symbolic constants actually makes it a bit more opaque why this is the case. llvm-svn: 218575	2014-09-28 03:30:25 +00:00
Chandler Carruth	f4b9e6b9d9	[x86] Fix yet another issue with widening vector shuffle elements. I spotted this by inspection when debugging something else, so I have no test case what-so-ever, and am not even sure it is possible to realistically trigger the bug. But this is what was intended here. llvm-svn: 218565	2014-09-27 08:40:33 +00:00
Chandler Carruth	4d03be1717	[x86] Fix terrible bugs everywhere in the new vector shuffle lowering and in the target shuffle combining when trying to widen vector elements. Previously only one of these was correct, and we didn't correctly propagate zeroing target shuffle masks (which have a different sentinel value from undef in non- target shuffle masks now). This isn't just a missed optimization, this caused us to drop zeroing shuffles on the floor and miscompile code. The added test case is one example of that. There are other fixes to the test suite as a consequence of this as well as restoring the undef elements in some of the masks that were lost when I brought sanity to the actual value of the undef and zero sentinels. I've also just cleaned up some of the PSHUFD and PSHUFLW and PSHUFHW combining code, but that code really needs to go. It was a nice initial attempt, but it isn't very principled and the recursive shuffle combiner is much more powerful. llvm-svn: 218562	2014-09-27 04:42:44 +00:00
Chandler Carruth	f572f3b2c0	[x86] Fix a moderately terrifying bug in the new 128-bit shuffle logic that managed to elude all of my fuzz testing historically. =/ Something changed to allow this code path to actually be exercised and it was doing bad things. It is especially heavily exercised by the patterns that emerge when doing AVX shuffles that end up lowered through the 128-bit code path. llvm-svn: 218540	2014-09-26 20:41:45 +00:00
Chandler Carruth	acd1906446	[x86] The mnemonic is SHUFPS not SHUPFS. =[ I'm very bad at spelling sadly. llvm-svn: 218524	2014-09-26 17:27:40 +00:00
Chandler Carruth	0c9ee10d01	[x86] In the new vector shuffle lowering, when trying to do another layer of tie-breaking sorting, it really helps to check that you're in a tie first. =] Otherwise the whole thing cycles infinitely. Test case added, another one found through fuzz testing. llvm-svn: 218523	2014-09-26 17:24:26 +00:00
Chandler Carruth	5afd4c2603	[x86] Fix a large collection of bugs that crept in as I fleshed out the AVX support. New test cases included. Note that none of the existing test cases covered these buggy code paths. =/ Also, it is clear from this that SHUFPS and SHUFPD are the most bug prone shuffle instructions in x86. =[ These were all detected by fuzz-testing. (I <3 fuzz testing.) llvm-svn: 218522	2014-09-26 17:11:02 +00:00
Robin Morisset	810739d174	Lower idempotent RMWs to fence+load Summary: I originally tried doing this specifically for X86 in the backend in D5091, but it was rather brittle and generally running too late to be general. Furthermore, other targets may want to implement similar optimizations. So I reimplemented it at the IR-level, fitting it into AtomicExpandPass as it interacts with that pass (which could not be cleanly done before at the backend level). This optimization relies on a new target hook, which is only used by X86 for now, as the correctness of the optimization on other targets remains an open question. If it is found correct on other targets, it should be trivial to enable for them. Details of the optimization are discussed in D5091. Test Plan: make check-all + a new test Reviewers: jfb Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5422 llvm-svn: 218455	2014-09-25 17:27:43 +00:00
Chandler Carruth	0a6e961efd	[x86] Teach the new vector shuffle lowering to use AVX2 instructions for v4f64 and v8f32 shuffles when they are lane-crossing. We have fully general lane-crossing permutation functions in AVX2 that make this easy. Part of this also changes exactly when and how these vectors are split up when we don't have AVX2. This isn't always a win but it usually is a win, so on the balance I think its better. The primary regressions are all things that just need to be fixed anyways such as modeling when a blend can be completely accomplished via VINSERTF128, etc. Also, this highlights one of the few remaining big features: we do a really poor job of inserting elements into AVX registers efficiently. This completes almost all of the big tricks I have in mind for AVX2. The only things left that I plan to add: 1) element insertion smarts 2) palignr and other fairly specialized lowerings when they happen to apply llvm-svn: 218449	2014-09-25 11:03:55 +00:00
Chandler Carruth	e91d68c475	[x86] Teach the new vector shuffle lowering a fancier way to lower 256-bit vectors with lane-crossing. Rather than immediately decomposing to 128-bit vectors, try flipping the 256-bit vector lanes, shuffling them and blending them together. This reduces our worst case shuffle by a pretty significant margin across the board. llvm-svn: 218446	2014-09-25 10:21:15 +00:00
Chandler Carruth	02387122e0	[x86] Fix an oversight in the v8i32 path of the new vector shuffle lowering where it only used the mask of the low 128-bit lane rather than the entire mask. This allows the new lowering to correctly match the unpack patterns for v8i32 vectors. For reference, the reason that we check for the the entire mask rather than checking the repeated mask is because the repeated masks don't abide by all of the invariants of normal masks. As a consequence, it is safer to use the full mask with functions like the generic equivalence test. llvm-svn: 218442	2014-09-25 04:10:27 +00:00
Chandler Carruth	8140158cb5	[x86] Rearrange the code for v16i16 lowering a bit for clarity and to reduce the amount of checking we do here. The first realization is that only non-crossing cases between 128-bit lanes are handled by almost the entire function. It makes more sense to handle the crossing cases first. THe second is that until we actually are going to generate fancy shared lowering strategies that use the repeated semantics of the v8i16 lowering, we should waste time checking for repeated masks. It is simplest to directly test for the entire unpck masks anyways, so we gained nothing from this. This also matches the structure of v32i8 more closely. No functionality changed here. llvm-svn: 218441	2014-09-25 04:03:22 +00:00
Chandler Carruth	d8f528adb8	[x86] Implement AVX2 support for v32i8 in the new vector shuffle lowering. This completes the basic AVX2 feature support, but there are still some improvements I'd like to do to really get the last mile of performance here. llvm-svn: 218440	2014-09-25 02:52:12 +00:00
Chandler Carruth	d355369dbb	[x86] Remove the defunct X86ISD::BLENDV entry -- we use vector selects for this now. Should prevent folks from running afoul of this and not knowing why their code won't instruction select the way I just did... llvm-svn: 218436	2014-09-25 01:16:01 +00:00
Chandler Carruth	a577bc26b6	[x86] Fix the v16i16 blend logic I added in the prior commit and add the missing test cases for it. Unsurprisingly, without test cases, there were bugs here. Surprisingly, this bug wasn't caught at compile time. Yep, there is an X86ISD::BLENDV. It isn't wired to anything. Oops. I'll fix than next. llvm-svn: 218434	2014-09-25 01:13:38 +00:00
Chandler Carruth	98443d89b9	[x86] Implement v16i16 support with AVX2 in the new vector shuffle lowering. This also implements the fancy blend lowering for v16i16 using AVX2 and teaches the X86 backend to print shuffle masks for 256-bit PSHUFB and PBLENDW instructions. It also makes the mask decoding correct for PBLENDW instructions. The yaks, they are legion. Tests are updated accordingly. There are some missing tests for the VBLENDVB lowering, but I'll add those in a follow-up as this commit has accumulated enough cruft already. llvm-svn: 218430	2014-09-25 00:24:19 +00:00
Chandler Carruth	edcba62b4a	[x86] Factor out the logic to generically decombose a vector shuffle into unblended shuffles and a blend. This is the consistent fallback for the lowering paths that have fast blend operations available, and its getting quite repetitive. No functionality changed. llvm-svn: 218399	2014-09-24 18:20:09 +00:00
Chandler Carruth	9bd10e7492	[x86] Teach the new vector shuffle lowering to lower v8i32 shuffles with the native AVX2 instructions. Note that the test case is really frustrating here because VPERMD requires the mask to be in the register input and we don't produce a comment looking through that to the constant pool. I'm going to attempt to improve this in a subsequent commit, but not sure if I will succeed. llvm-svn: 218347	2014-09-24 01:24:44 +00:00
Chandler Carruth	fd11815a7d	[x86] Fix a really terrible bug in the repeated 128-bin-lane shuffle detection. It was incorrectly handling undef lanes by actually treating an undef lane in the first 128-bit lane as a numeric shuffle value. Fortunately, this almost always DTRT and disabled detecting repeated patterns. But not always. =/ This patch introduces a much more principled approach and fixes the miscompiles I spotted by inspection previously. llvm-svn: 218346	2014-09-24 01:03:57 +00:00
Chandler Carruth	df2e421845	[x86] Teach the new vector shuffle lowering to lower v4i64 vector shuffles using the AVX2 instructions. This is the first step of cutting in real AVX2 support. Note that I have spotted at least one bug in the test cases already, but I suspect it was already present and just is getting surfaced. Will investigate next. llvm-svn: 218338	2014-09-23 22:39:02 +00:00
Chandler Carruth	9a94bd6fa4	[x86] Teach the rest of the 'target shuffle' machinery about blends and add VPBLENDD to the InstPrinter's comment generation so we get nice comments everywhere. Now that we have the nice comments, I can see the bug introduced by a silly typo in the commit that enabled VPBLENDD, and have fixed it. Yay tests that are easy to inspect. llvm-svn: 218335	2014-09-23 22:14:14 +00:00
Robin Morisset	6dbbbc28b0	[X86] Make wide loads be managed by AtomicExpand Summary: AtomicExpand already had logic for expanding wide loads and stores on LL/SC architectures, and for expanding wide stores on CmpXchg architectures, but not for wide loads on CmpXchg architectures. This patch fills this hole, and makes use of this new feature in the X86 backend. Only one functionnal change: we now lose the SynchScope attribute. It is regrettable, but I have another patch that I will submit soon that will solve this for all of AtomicExpand (it seemed better to split it apart as it is a different concern). Test Plan: make check-all (lots of tests for this functionality already exist) Reviewers: jfb Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5404 llvm-svn: 218332	2014-09-23 20:59:25 +00:00
Chandler Carruth	adcfec995c	[x86] Teach the new shuffle lowering's blend functionality to use AVX2's VPBLENDD where appropriate even on 128-bit vectors. According to Agner's tables, this instruction is significantly higher throughput (can execute on any port) on Haswell chips so we should aggressively try to form it when available. Sadly, this loses our delightful shuffle comments. I'll add those back for VPBLENDD next. llvm-svn: 218322	2014-09-23 18:16:12 +00:00
Chandler Carruth	40592d2dec	[x86] Teach the vector comment parsing and printing to correctly handle undef in the shuffle mask. This shows up when we're printing comments during lowering and we still have an IR-level constant hanging around that models undef. A nice consequence of this is much prettier test cases where the undef lanes actually show up as undef rather than as a particular set of values. This also allows us to print shuffle comments in cases that use undef such as the recently added variable VPERMILPS lowering. Now those test cases have nice shuffle comments attached with their details. The shuffle lowering for PSHUFB has been augmented to use undef, and the shuffle combining has been augmented to comprehend it. llvm-svn: 218301	2014-09-23 11:15:19 +00:00
Chandler Carruth	6d5916a2d7	[x86] Teach the AVX1 path of the new vector shuffle lowering one more trick that I missed. VPERMILPS has a non-immediate memory operand mode that allows it to do asymetric shuffles in the two 128-bit lanes. Use this rather than two shuffles and a blend. However, it turns out the variable shuffle path to VPERMILPS (and VPERMILPD, although that one offers no functional differenc from the immediate operand other than variability) wasn't even plumbed through codegen. Do such plumbing so that we can reasonably emit a variable-masked VPERMILP instruction. Also plumb basic comment parsing and printing through so that the tests are reasonable. There are still a few tests which don't show the shuffle pattern. These are tests with undef lanes. I'll teach the shuffle decoding and printing to handle undef mask entries in a follow-up. I've looked at the masks and they seem reasonable. llvm-svn: 218300	2014-09-23 10:08:29 +00:00
Chandler Carruth	ed5dfff865	[x86] Rename X86ISD::VPERMILP to X86ISD::VPERMILPI (and the same for the td pattern). Currently we only model the immediate operand variation of VPERMILPS and VPERMILPD, we should make that clear in the pseudos used. Will be adding support for the variable mask variant in my next commit. llvm-svn: 218282	2014-09-22 22:29:42 +00:00
Kaelyn Takata	cecdff6512	Fix a "typo" from my previous commit. llvm-svn: 218281	2014-09-22 22:17:59 +00:00
Kaelyn Takata	ba0a1e0520	Silence unused variable warnings in the new stub functions that occur when assertions are disabled. llvm-svn: 218280	2014-09-22 22:14:13 +00:00
Chandler Carruth	252debeb0b	[x86] Stub out the integer lowering of 256-bit vectors with AVX2 support. No interesting functionality yet, but this will let me implement one vector type at a time. llvm-svn: 218277	2014-09-22 21:45:57 +00:00
Sanjay Patel	7939d7229d	Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2). We generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors. This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for size on any CPU with AVX or AVX2. The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data. Differential Revision: http://reviews.llvm.org/D5347 llvm-svn: 218263	2014-09-22 18:54:01 +00:00
Pavel Chupin	be9f12102f	[x32] Fix segmented stacks support Summary: Update segmented-stacks*.ll tests with x32 target case and make corresponding changes to make them pass. Test Plan: tests updated with x32 target Reviewers: nadav, rafael, dschuff Subscribers: llvm-commits, zinovy.nis Differential Revision: http://reviews.llvm.org/D5245 llvm-svn: 218247	2014-09-22 13:11:35 +00:00
Chandler Carruth	12bbf7d922	[x86] Back out a bad choice about lowering v4i64 and pave the way for a more sane approach to AVX2 support. Fundamentally, there is no useful way to lower integer vectors in AVX. None. We always end up with a VINSERTF128 in the end, so we might as well eagerly switch to the floating point domain and do everything there. This cleans up lots of weird and unlikely to be correct differences between integer and floating point shuffles when we only have AVX1. The other nice consequence is that by doing things this way we will make it much easier to write the integer lowering routines as we won't need to duplicate the logic to check for AVX vs. AVX2 in each one -- if we actually try to lower a 256-bit vector as an integer vector, we have AVX2 and can rely on it. I think this will make the code much simpler and more comprehensible. Currently, I've disabled all support for AVX2 so that we always fall back to AVX. This keeps everything working rather than asserting. That will go away with the subsequent series of patches that provide a baseline AVX2 implementation. Please note, I'm going to implement AVX2 without access to hardware. That means I cannot correctness test this path. I will be relying on those with access to AVX2 hardware to do correctness testing and fix bugs here, but as a courtesy I'm trying to sketch out the framework for the new-style vector shuffle lowering in the context of the AVX2 ISA. llvm-svn: 218228	2014-09-22 00:32:15 +00:00
Chandler Carruth	5d45962b2c	[x86] Teach the new vector shuffle lowering how to cleverly lower single input v8f32 shuffles which are not 128-bit lane crossing but have different shuffle patterns in the low and high lanes. This removes most of the extract/insert traffic that was unnecessary and is particularly good at lowering cases where only one of the two lanes is shuffled at all. I've also added a collection of test cases with undef lanes because this lowering is somewhat more sensitive to undef lanes than others. llvm-svn: 218226	2014-09-21 23:46:13 +00:00
Chandler Carruth	215037e35d	[x86] With the stronger canonicalization of shuffles added in r218216, the new vector shuffle lowering no longer needs to check both symmetric forms of UNPCK patterns for v4f64. llvm-svn: 218217	2014-09-21 13:37:51 +00:00
Chandler Carruth	b3125c7522	[x86] Teach the new vector shuffle lowering to re-use the SHUFPS lowering when it can use a symmetric SHUFPS across both 128-bit lanes. This required making the SHUFPS lowering tolerant of other vector types, and adjusting our canonicalization to canonicalize harder. This is the last of the clever uses of symmetry I've thought of for v8f32. The rest of the tricks I'm aware of here are to work around assymetry in the mask. llvm-svn: 218216	2014-09-21 13:35:14 +00:00
Chandler Carruth	02f3554971	[x86] Refactor the logic to form SHUFPS instruction patterns to lower a generic vector shuffle mask into a helper that isn't specific to the other things that influence which choice is made or the specific types used with the instruction. No functionality changed. llvm-svn: 218215	2014-09-21 13:03:00 +00:00
Chandler Carruth	33eda72802	[x86] Teach the new vector shuffle lowering the basics about insertion of a single element into a zero vector for v4f64 and v4i64 in AVX. Ironically, there is less to see here because xor+blend is so crazy fast that we can't really beat that to zero the high 128-bit lane. llvm-svn: 218214	2014-09-21 12:49:46 +00:00
Chandler Carruth	43f5974ea0	[x86] Teach the new vector shuffle lowering how to lower to UNPCKLPS and UNPCKHPS with AVX vectors by recognizing those patterns when they are repeated for both 128-bit lanes. With this, we now generate the exact same (really nice) code for Quentin's avx_test_case.ll which was the most significant regression reported for the new shuffle lowering. In fact, I'm out of specific test cases for AVX lowering, the rest were AVX2 I think. However, there are a bunch of pretty obvious remaining things to improve with AVX... llvm-svn: 218213	2014-09-21 12:20:44 +00:00
Chandler Carruth	88404c4f9b	[x86] Begin teaching the new vector shuffle lowering among the most important bits of cleverness: to detect and lower repeated shuffle patterns between the two 128-bit lanes with a single instruction. This patch just teaches it how to lower single-input shuffles that fit this model using VPERMILPS. =] There is more that needs to happen here. llvm-svn: 218211	2014-09-21 12:01:19 +00:00
Chandler Carruth	3dccabaf35	[x86] Explicitly lower to a blend early if it is trivial to do so for v8f32 shuffles in the new vector shuffle lowering code. This is very cheap to do and makes it much more clear that anything more expensive but overlapping with this lowering should be selected afterward (for example using AVX2's VPERMPS). However, no functionality changed here as without this code we would fall through to create no-op shuffles of each input and a blend. =] llvm-svn: 218209	2014-09-21 11:40:39 +00:00
Chandler Carruth	e81bfbada9	[x86] Teach the new vector shuffle lowering of v4f64 to prefer a direct VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the reciprocal throughput on Ivy Bridge CPUs much like it does everywhere for 128-bits. There isn't a downside, so just eagerly use this instruction when it suffices. llvm-svn: 218208	2014-09-21 11:17:55 +00:00
Chandler Carruth	8d0a1b209b	[x86] Switch the blend implementation to use a MVT switch rather than awkward conditions. The readability improvement of this will be even more important as I generalize it to handle more types. No functionality changed. llvm-svn: 218205	2014-09-21 10:36:12 +00:00
Chandler Carruth	f098cee2e3	[x86] Remove some essentially lying comments from the v4f64 path of the new vector shuffle lowering. llvm-svn: 218204	2014-09-21 10:27:14 +00:00
Chandler Carruth	a746d776eb	[x86] Fix a helper to reflect that what we actually care about is 128-bit lane crossings, not 'half' crossings. This came up in code review ages ago, but I hadn't really addresesd it. Also added some documentation for the helper. No functionality changed. llvm-svn: 218203	2014-09-21 09:35:25 +00:00
Chandler Carruth	293327ddcd	[x86] Teach the new vector shuffle lowering the first step toward more actual support for complex AVX shuffling tricks. We can do independent blends of the low and high 128-bit lanes of an avx vector, so shuffle the inputs into place and then do the blend at 256 bits. This will in many cases remove one blend instruction. The next step is to permute the low and high halves in-place rather than extracting them and re-inserting them. llvm-svn: 218202	2014-09-21 09:35:22 +00:00
Chandler Carruth	a454812ac8	[x86] Teach the new vector shuffle lowering to use VPERMILPD for single-input shuffles with doubles. This allows them to fold memory operands into the shuffle, etc. This is just the analog to the v4f32 case in my prior commit. llvm-svn: 218193	2014-09-20 22:09:27 +00:00
Chandler Carruth	6f80abac4e	[x86] Teach the new vector shuffle lowering to use the AVX VPERMILPS instruction for single-vector floating point shuffles. This in turn allows the shuffles to fold a load into the instruction which is one of the common regressions hit with the new shuffle lowering. llvm-svn: 218190	2014-09-20 20:52:07 +00:00
Chandler Carruth	8c4cccd4aa	[x86] Teach the v4f32 path of the new shuffle lowering to handle the tricky case of single-element insertion into the zero lane of a zero vector. We can't just use the same pattern here as we do in every other vector type because the general insertion logic can handle insertion into the non-zero lane of the vector. However, in SSE4.1 with v4f32 vectors we have INSERTPS that is a much better choice than the generic one for such lowerings. But INSERTPS can do lots of other lowerings as well so factoring its logic into the general insertion logic doesn't work very well. We also can't just extract the core common part of the general insertion logic that is faster (forming VZEXT_MOVL synthetic nodes that lower to MOVSS when they can) because VZEXT_MOVL is often faster than a blend while INSERTPS is slower! So instead we do a restrictive condition on attempting to use the generic insertion logic to narrow it to those cases where VZEXT_MOVL won't need a shuffle afterward and thus will do better than INSERTPS. Then we try blending. Then we go back to INSERTPS. This still doesn't generate perfect code for some silly reasons that can be fixed by tweaking the td files for lowering VZEXT_MOVL to use XORPS+BLENDPS when available rather than XORPS+MOVSS when the input ends up in a register rather than a load from memory -- BLENDPSrr has twice the reciprocal throughput of MOVSSrr. Don't you love this ISA? llvm-svn: 218177	2014-09-20 04:15:22 +00:00
Chandler Carruth	87dcf09367	[x86] Refactor the code for emitting INSERTPS to reuse the zeroable mask analysis used elsewhere. This removes the last duplicate of this logic. Also simplify the code here quite a bit. No functionality changed. llvm-svn: 218176	2014-09-20 03:57:01 +00:00
Chandler Carruth	00389f3ed9	[x86] Generalize the single-element insertion lowering to work with floating point types and use it for both v2f64 and v2i64 single-element insertion lowering. This fixes the last non-AVX performance regression test case I've gotten of for the new vector shuffle lowering. There is obvious analogous lowering for v4f32 that I'll add in a follow-up patch (because with INSERTPS, v4f32 requires special treatment). After that, its AVX stuff. llvm-svn: 218175	2014-09-20 03:32:25 +00:00
Chandler Carruth	dba8444c2a	[x86] Replace some duplicated logic reasoning about whether particular vector lanes can be modeled as zero with a call to the new function that computes a bit-vector representing that information. No functionality changed here, but will allow doing more clever things with the zero-test. llvm-svn: 218174	2014-09-20 02:44:21 +00:00
Chandler Carruth	a6b7178b9d	[x86] Hoist a function up to the rest of the non-type-specific lowering helpers, and re-flow the logic to use early exit and be a bit more readable. No functionality changed. llvm-svn: 218155	2014-09-19 21:52:10 +00:00
Chandler Carruth	f85c6dfa45	[x86] Hoist the actual lowering logic into a helper function to separate it from the shuffle pattern matching logic. Also cleaned up variable names, comments, etc. No functionality changed. llvm-svn: 218152	2014-09-19 21:20:08 +00:00
Chandler Carruth	0fc0c22fa9	[x86] Fully generalize the zext lowering in the new vector shuffle lowering to support both anyext and zext and to custom lower for many different microarchitectures. Using this allows us to get exactly the right code for zext and anyext shuffles in all the vector sizes. For v16i8, the improvement is huge. The new SSE2 test case added I refused to add before this because it was sooooo muny instructions. llvm-svn: 218143	2014-09-19 20:00:32 +00:00
Chandler Carruth	8a6536d4b2	[x86] Recognize that we can use duplication to widen v16i8 shuffles due to undef lanes as well as defined widenable lanes. This dramatically improves the lowering we use for undef-shuffles in a zext-ish pattern for SSE2. llvm-svn: 218115	2014-09-19 09:45:21 +00:00
Chandler Carruth	2e275142cd	[x86] Teach the new vector shuffle lowering to also use pmovzx for v4i32 shuffles that are zext-ing. Not a lot to see here; the undef lane variant is better handled with pshufd, but this improves the actual zext pattern. llvm-svn: 218112	2014-09-19 08:37:44 +00:00
Chandler Carruth	398ba9a018	[x86] Add a dedicated lowering path for zext-compatible vector shuffles to the new vector shuffle lowering code. This allows us to emit PMOVZX variants consistently for patterns where it is a viable lowering. This instruction is both fast and allows us to fold loads into it. This only hooks the new lowering up for i16 and i8 element widths, mostly so I could manage the change to the tests. I'll add the i32 one next, although it is significantly less interesting. One thing to note is that we already had some tests for these patterns but those tests had far less horrible instructions. The problem is that those tests weren't checking the strict start and end of the instruction sequence. =[ As a consequence something changed in the lowering making us generate TERRIBLE code for these patterns in SSE2 through SSSE3. I've consolidated all of the tests and spelled out the madness that we currently emit for these shuffles. I'm going to try to figure out what has gone wrong here. llvm-svn: 218102	2014-09-19 06:07:49 +00:00
Chandler Carruth	9057fcaf82	[x86] Use PALIGNR for v4i32 and v2i64 blends when appropriate. There is no purpose in using it for single-input shuffles as pshufd is just as fast and doesn't tie the two operands. This removes a substantial amount of wrong-domain blend operations in SSSE3 mode. It also completes the usage of PALIGNR for integer shuffles and addresses one of the test cases Quentin hit with the new vector shuffle lowering. There is still the question of whether and when to use this for floating point shuffles. It is faster than shufps or shufpd but in the integer domain. I don't yet really have a good heuristic here for when to use this instruction for floating point vectors. llvm-svn: 218038	2014-09-18 09:00:25 +00:00
Chandler Carruth	867930aadf	[x86] Initial step of teaching the new vector shuffle lowering about PALIGNR. This just adds it to the v8i16 and v16i8 lowering steps where it is completely unmatched. It also introduces the logic for detecting rotation shuffle masks even in the presence of single input or blend masks and arbitrarily undef lanes. I've added fairly comprehensive tests for the matching logic in v8i16 because the tests at that size are much easier to write and manage. I've not checked the SSE2 code generated for these tests because the code is horrible. It is absolute madness. Testing it will just make the test brittle without giving any interesting improvements in the correctness confidence. llvm-svn: 218013	2014-09-18 04:11:29 +00:00
Pavel Chupin	37b65d81dd	[x32] Fix function indirect calls Summary: Zero-extend register to 64-bit for callq/jmpq. Test Plan: 3 tests added Reviewers: nadav, dschuff Subscribers: llvm-commits, zinovy.nis Differential Revision: http://reviews.llvm.org/D5355 llvm-svn: 217942	2014-09-17 07:09:23 +00:00
Robin Morisset	25c8e318e4	[X86] Use the generic AtomicExpandPass instead of X86AtomicExpandPass This required a new hook called hasLoadLinkedStoreConditional to know whether to expand atomics to LL/SC (ARM, AArch64, in a future patch Power) or to CmpXchg (X86). Apart from that, the new code in AtomicExpandPass is mostly moved from X86AtomicExpandPass. The main result of this patch is to get rid of that pass, which had lots of code duplicated with AtomicExpandPass. llvm-svn: 217928	2014-09-17 00:06:58 +00:00
Chandler Carruth	429c29d187	[x86] Remove a FIXME that doesn't make any sense. Only the lanes feeding the blend that is matched by this are "used" in any sense, and so any build_vector or other nodes feeding these will already drop other lanes. llvm-svn: 217855	2014-09-16 02:16:42 +00:00
Chandler Carruth	b1c024a2de	[x86] Cleanup an unused variable by actually using it in the non-asserts place where it was needed. llvm-svn: 217854	2014-09-16 02:14:51 +00:00
Chandler Carruth	74acb46d26	[x86] Remove the last vestiges of the BLENDI-based ADDSUB pattern matching. This design just fundamentally didn't work because ADDSUB is available prior to any legal lowerings of BLENDI nodes. Instead, we have a dedicated ADDSUB synthetic ISD node which is pattern matched trivially into the instructions. These nodes are then recognized by both the existing and a trivial new lowering combine in the backend. Removing these patterns required adding 2 missing shuffle masks to the DAG combine, without which tests would have failed. Added the masks and a helpful assert as well to catch if anything ever goes wrong here. llvm-svn: 217851	2014-09-16 00:39:08 +00:00
Chandler Carruth	f845e89425	[x86] As a follow-up to r217819, don't check for VSELECT legality now that we don't use VSELECT and directly emit an addsub synthetic node. Also remove a stale comment referencing VSELECT. The test case is updated to use 'core2' which only has SSE3, not SSE4.1, and it still passes. Previously it would not because we lacked sufficient blend support to legalize the VSELECT. llvm-svn: 217849	2014-09-16 00:24:42 +00:00
Chandler Carruth	de5f2b356b	[x86] Add the beginnings of a proper DAG combine to match ADDSUBPS and ADDSUBPD nodes out of blends of adds and subs. This allows us to actually form these instructions with SSE3 rather than only forming them when we had both SSE3 for the ADDSUB instructions and SSE4.1 for the blend instructions. ;] Kind-of important. I've adjusted the CPU requirements on one of the tests to demonstrate this kicking in nicely for an SSE3 cpu configuration. llvm-svn: 217848	2014-09-16 00:15:20 +00:00
Chandler Carruth	204ad4c613	[x86] Start fixing our emission of ADDSUBPS and ADDSUBPD instructions by introducing a synthetic X86 ISD node representing this generic operation. The relevant patterns for mapping these nodes into the concrete instructions are also added, and a gnarly bit of C++ code in the target-specific DAG combiner is replaced with simple code emitting this primitive. The next step is to generically combine blends of adds and subs into this node so that we can drop the reliance on an SSE4.1 ISD node (BLENDI) when matching an SSE3 feature (ADDSUB). llvm-svn: 217819	2014-09-15 20:09:47 +00:00
Chandler Carruth	707a2e098d	[x86] Begin emitting PBLENDW instructions for integer blend operations when SSE4.1 is available. This removes a ton of domain crossing from blend code paths that were ending up in the floating point code path. This is just the tip of the iceberg though. The real switch is for integer blend lowering to more actively rely on this instruction being available so we don't hit shufps at all any longer. =] That will come in a follow-up patch. Another place where we need better support is for using PBLENDVB when doing so avoids the need to have two complementary PSHUFB masks. llvm-svn: 217767	2014-09-15 12:40:54 +00:00
Chandler Carruth	12d4a70cbd	[x86] Teach the x86 DAG combiner to form UNPCKLPS and UNPCKHPS instructions from the relevant shuffle patterns. This is the last tweak I'm aware of to generate essentially perfect v4f32 and v2f64 shuffles with the new vector shuffle lowering up through SSE4.1. I'm sure I've missed some and it'd be nice to check since v4f32 is amenable to exhaustive exploration, but this is all of the tricks I'm aware of. With AVX there is a new trick to use the VPERMILPS instruction, that's coming up in a subsequent patch. llvm-svn: 217761	2014-09-15 11:26:25 +00:00
Chandler Carruth	41a25dd7ef	[x86] Teach the x86 DAG combiner to form MOVSLDUP and MOVSHDUP instructions when it finds an appropriate pattern. These are lovely instructions, and its a shame to not use them. =] They are fast, and can hand loads folded into their operands, etc. I've also plumbed the comment shuffle decoding through the various layers so that the test cases are printed nicely. llvm-svn: 217758	2014-09-15 11:15:23 +00:00
Chandler Carruth	35e3b545d6	[x86] Undo a flawed transform I added to form UNPCK instructions when AVX is available, and generally tidy up things surrounding UNPCK formation. Originally, I was thinking that the only advantage of PSHUFD over UNPCK instruction variants was its free copy, and otherwise we should use the shorter encoding UNPCK instructions. This isn't right though, there is a larger advantage of being able to fold a load into the operand of a PSHUFD. For UNPCK, the operand must be in a register so it can be the second input. This removes the UNPCK formation in the target-specific DAG combine for v4i32 shuffles. It also lifts the v8 and v16 cases out of the AVX-specific check as they are potentially replacing multiple instructions with a single instruction and so should always be valuable. The floating point checks are simplified accordingly. This also adjusts the formation of PSHUFD instructions to attempt to match the shuffle mask to one which would fit an UNPCK instruction variant. This was originally motivated to allow it to match the UNPCK instructions in the combiner, but clearly won't now. Eventually, we should add a MachineCombiner pass that can form UNPCK instructions post-RA when the operand is known to be in a register and thus there is no loss. llvm-svn: 217755	2014-09-15 10:35:41 +00:00
Chandler Carruth	44e64b5267	[x86] Teach the new vector shuffle lowering to use 'punpcklwd' and 'punpckhwd' instructions when suitable rather than falling back to the generic algorithm. While we could canonicalize to these patterns late in the process, that wouldn't help when the freedom to use them is only visible during initial lowering when undef lanes are well understood. This, it turns out, is very important for matching the shuffle patterns that are used to lower sign extension. Fixes a small but relevant regression in gcc-loops with the new lowering. When I changed this I noticed that several 'pshufd' lowerings became unpck variants. This is bad because it removes the ability to freely copy in the same instruction. I've adjusted the widening test to handle undef lanes correctly and now those will correctly continue to use 'pshufd' to lower. However, this caused a bunch of churn in the test cases. No functional change, just churn. Both of these changes are part of addressing a general weakness in the new lowering -- it doesn't sufficiently leverage undef lanes. I've at least a couple of patches that will help there at least in an academic sense. llvm-svn: 217752	2014-09-15 09:02:37 +00:00
Chandler Carruth	0a98790b32	[x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD. These are super simple. They even take precedence over crazy instructions like INSERTPS because they have very high throughput on modern x86 chips. I still have to teach the integer shuffle variants about this to avoid so many domain crossings. However, due to the particular instructions available, that's a touch more complex and so a separate patch. Also, the backend doesn't seem to realize it can commute blend instructions by negating the mask. That would help remove a number of copies here. Suggestions on how to do this welcome, it's an area I'm less familiar with. llvm-svn: 217744	2014-09-14 23:43:33 +00:00
Chandler Carruth	47ebd24e24	[x86] Teach the vector combiner that picks a canonical shuffle from to support transforming the forms from the new vector shuffle lowering to use 'movddup' when appropriate. A bunch of the cases where we actually form 'movddup' don't actually show up in the test results because something even later than DAG legalization maps them back to 'unpcklpd'. If this shows back up as a performance problem, I'll probably chase it down, but it is at least an encoded size loss. =/ To make this work, also always do this canonicalizing step for floating point vectors where the baseline shuffle instructions don't provide any free copies of their inputs. This also causes us to canonicalize unpck[hl]pd into mov{hl,lh}ps (resp.) which is a nice encoding space win. There is one test which is "regressed" by this: extractelement-load. There, the test case where the optimization it is testing fails, the exact instruction pattern which results is slightly different. This should probably be fixed by having the appropriate extract formed earlier in the DAG, but that would defeat the purpose of the test.... If this test case is critically important for anyone, please let me know and I'll try to work on it. The prior behavior was actually contrary to the comment in the test case and seems likely to have been an accident. llvm-svn: 217738	2014-09-14 22:41:37 +00:00
Adam Nemet	053c4e825c	[AVX512] Fix miscompile for unpack r189189 implemented AVX512 unpack by essentially performing a 256-bit unpack between the low and the high 256 bits of src1 into the low part of the destination and another unpack of the low and high 256 bits of src2 into the high part of the destination. I don't think that's how unpack works. AVX512 unpack simply has more 128-bit lanes but other than it works the same way as AVX. So in each 128-bit lane, we're always interleaving certain parts of both operands rather different parts of one of the operands. E.g. for this: __v16sf a = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; __v16sf b = { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; __v16sf c = __builtin_shufflevector(a, b, 0, 8, 1, 9, 4, 12, 5, 13, 16, 24, 17, 25, 20, 28, 21, 29); we generated punpcklps (notice how the elements of a and b are not interleaved in the shuffle). In turn, c was set to this: 0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29 Obviously this should have just returned the mask vector of the shuffle vector. I mostly reverted this change and made sure the original AVX code worked for 512-bit vectors as well. Also updated the tests because they matched the logic from the code. llvm-svn: 217602	2014-09-11 16:51:10 +00:00
Bob Wilson	b3482af341	Set trunc store action to Expand for all X86 targets. When compiling without SSE2, isTruncStoreLegal(F64, F32) would return Legal, whereas with SSE2 it would return Expand. And since the Target doesn't seem to actually handle a truncstore for double -> float, it would just output a store of a full double in the space for a float hence overwriting other bits on the stack. Patch by Luqman Aden! llvm-svn: 217410	2014-09-09 01:13:36 +00:00
Chandler Carruth	0a8151e69a	[x86] Revert my over-eager commit in r217332. I hadn't actually run all the tests yet and these combines have somewhat surprisingly far reaching effects. llvm-svn: 217333	2014-09-07 12:37:11 +00:00
Chandler Carruth	8405e8fff9	[x86] Tweak the rules surrounding 0,0 and 1,1 v2f64 shuffles and add support for MOVDDUP which is really important for matrix multiply style operations that do lots of non-vector-aligned load and splats. The original motivation was to add support for MOVDDUP as the lack of it regresses matmul_f64_4x4 by 5% or so. However, all of the rules here were somewhat suspicious. First, we should always be using the floating point domain shuffles, regardless of how many copies we have to make as a movapd is crazy faster than the domain switching cost on some chips. (Mostly because movapd is crazy cheap.) Because SHUFPD can't do the copy-for-free trick of the PSHUF instructions, there is no need to avoid canonicalizing on UNPCK variants, so do that canonicalizing. This also ensures we have the chance to form MOVDDUP. =] Second, we assume SSE2 support when doing any vector lowering, and given that we should just use UNPCKLPD and UNPCKHPD as they can operate on registers or memory. If vectors get spilled or come from memory at all this is going to allow the load to be folded into the operation. If we want to optimize for encoding size (the only difference, and only a 2 byte difference) it should be done much later, likely after RA. llvm-svn: 217332	2014-09-07 12:02:14 +00:00
Chandler Carruth	21d27ee95b	[x86] Fix an embarressing bug in the INSERTPS formation code. The mask computation was totally wrong, but somehow it didn't really show up with llc. I've added an assert that triggers on multiple existing test cases and updated one of them to show the correct value. There appear to still be more bugs lurking around insertps's mask. =/ However, note that this only really impacts the new vector shuffle lowering. llvm-svn: 217289	2014-09-05 23:19:45 +00:00
Chandler Carruth	19cbf0e2c4	[x86] Factor out the zero vector insertion logic in the new vector shuffle lowering for integer vectors and share it from v4i32, v8i16, and v16i8 code paths. Ironically, the SSE2 v16i8 code for this is now better than the SSSE3! =] Will have to fix the SSSE3 code next to just using a single pshufb. llvm-svn: 217240	2014-09-05 10:36:31 +00:00
Chandler Carruth	2e5134f8f4	[x86] Teach the new v4i32 shuffle lowering some more tricks to recognize vzext patterns and insert-element patterns that for SSE4 have dedicated instructions. With this we can enable the experimental mode in a regression test that happens to cover some of the past set of issues. You can see that the new logic does significantly better here on the floating point cases. A follow-up to this change and the previous ones will hoist the logic into helpers so it can be shared across element type sizes as in this particular case it generalizes cleanly. llvm-svn: 217136	2014-09-04 09:26:30 +00:00
Elena Demikhovsky	228ab3d7b3	X86 Intrinsics table - changed to a static table sorted by intrinsic id. Used binary search over the tables. llvm-svn: 217131	2014-09-04 06:34:34 +00:00
Chandler Carruth	fc0db222b5	[x86] Teach the new vector shuffle lowering about the zero masking abilities of INSERTPS which are really powerful and come up in very important contexts such as forming diagonal matrices, etc. With this I ended up being able to remove the somewhat weird helper I added for INSERTPS because we can collapse the entire state to a no-op mask. Added a bunch of tests for inserting into a zero-ish vector. llvm-svn: 217117	2014-09-04 01:13:48 +00:00
Chandler Carruth	dad5400397	[x86] Teach the new vector shuffle lowering about the simplest of 'insertps' patterns. This replaces two shuffles with a single insertps in very common cases. My next patch will extend this to leverage the zeroing capabilities of insertps which will allow it to be used in a much wider set of cases. llvm-svn: 217100	2014-09-03 22:48:34 +00:00
Sanjay Patel	3f7a24e400	Refactor LowerFABS and LowerFNEG into one function (x86) (NFC) We duplicate ~30 lines of code to lower FABS and FNEG for x86, so this patch combines them into one function. No functional change intended, so no additional test cases. Test-suite behavior is unchanged. Differential Revision: http://reviews.llvm.org/D5064 llvm-svn: 216942	2014-09-02 20:24:47 +00:00
Reid Kleckner	0b2bccc3cd	CodeGen: Handle va_start in the entry block Also fix a small copy-paste bug in X86ISelLowering where Chain should have been used in place of DAG.getEntryToken(). Fixes PR20828. llvm-svn: 216929	2014-09-02 18:42:44 +00:00
Sanjay Patel	601492a3e3	Use an integer constant for FABS / FNEG (x86). This change will ease refactoring LowerFABS() and LowerFNEG() since they have a lot of overlap. Remove the creation of a floating point constant from an integer because it's going to be used for a bitwise integer op anyway. No change to codegen expected, but the verbose comment string for asm output may change from float values to hex (integer), depending on whether the constant already exists or not. Differential Revision: http://reviews.llvm.org/D5052 llvm-svn: 216889	2014-09-01 19:01:47 +00:00
Reid Kleckner	d70ab41a4f	Speculative build fix for const, gcc, and ArrayRef overloads llvm-svn: 216793	2014-08-29 22:12:08 +00:00
Reid Kleckner	dccd0cbec3	Add a const and munge some comments llvm-svn: 216781	2014-08-29 21:42:21 +00:00
Reid Kleckner	16e5541211	musttail: Forward regparms of variadic functions on x86_64 Summary: If a variadic function body contains a musttail call, then we copy all of the remaining register parameters into virtual registers in the function prologue. We track the virtual registers through the function body, and add them as additional registers to pass to the call. Because this is all done in virtual registers, the register allocator usually gives us good code. If the function does a call, however, it will have to spill and reload all argument registers (ew). Forwarding regparms on x86_32 is not implemented because most compilers don't support varargs in 32-bit with regparms. Reviewers: majnemer Subscribers: aemerson, llvm-commits Differential Revision: http://reviews.llvm.org/D5060 llvm-svn: 216780	2014-08-29 21:42:08 +00:00
Reid Kleckner	329d4a2b29	Verifier: Don't reject varargs callee cleanup functions We've rejected these kinds of functions since r28405 in 2006 because it's impossible to lower the return of a callee cleanup varargs function. However there are lots of legal ways to leave such a function without returning, such as aborting. Today we can leave a function with a musttail call to another function with the correct prototype, and everything works out. I'm removing the verifier check declaring that a normal return from such a function is UB. Reviewed By: nlewycky Differential Revision: http://reviews.llvm.org/D5059 llvm-svn: 216779	2014-08-29 21:25:28 +00:00
Robert Khasanov	a651a62340	[SKX] Enable lowering of integer CMP operations. Added new types to Legalizer. Fixed getSetCCResultType function Added lowering tests. Reviewed by Elena Demikhovsky. llvm-svn: 216717	2014-08-29 08:46:04 +00:00
Sanjay Patel	81ecbb0737	Fix a logic bug in x86 vector codegen: sext (zext (x) ) != sext (x) (PR20472). Remove a block of code from LowerSIGN_EXTEND_INREG() that was added with: http://llvm.org/viewvc/llvm-project?view=revision&revision=177421 And caused: http://llvm.org/bugs/show_bug.cgi?id=20472 (more analysis here) http://llvm.org/bugs/show_bug.cgi?id=18054 The testcases confirm that we (1) don't remove a zext op that is necessary and (2) generate a pmovz instead of punpck if SSE4.1 is available. Although pmovz is 1 byte longer, it allows folding of the load, and so saves 3 bytes overall. Differential Revision: http://reviews.llvm.org/D4909 llvm-svn: 216679	2014-08-28 18:59:22 +00:00
Chandler Carruth	c01ce6bc01	[x86] Fix whitespace and formatting around this function with clang-format, no functionality changed. llvm-svn: 216646	2014-08-28 04:00:24 +00:00
Chandler Carruth	cb07a4adf3	[x86] Hoist conditions from every single if in this routine to a single early exit. And factor the subsequent cast<> from all but one block into a single variable. No functionality changed. llvm-svn: 216645	2014-08-28 03:57:13 +00:00
Chandler Carruth	974aa336b1	[x86] Inline an SSE4 helper function for INSERT_VECTOR_ELT lowering, no functionality changed. Separating this into two functions wasn't helping. There was a decent amount of boilerplate duplicated, and some subsequent refactorings here will pull even more common code out. llvm-svn: 216644	2014-08-28 03:52:45 +00:00
Sanjay Patel	1d23bac843	typo in comment llvm-svn: 216609	2014-08-27 20:27:05 +00:00
Chandler Carruth	a5a8a9adc8	[x86] Fix a regression introduced with r213897 for 32-bit targets where we stopped efficiently lowering sextload using the SSE41 instructions for that operation. This is a consequence of a bad predicate I used thinking of the memory access needs. The code actually handles the cases where the predicate doesn't apply, and handles them much better. =] Simple fix and a test case added. Fixes PR20767. llvm-svn: 216538	2014-08-27 11:39:47 +00:00
Chandler Carruth	74ec9e19ee	[SDAG] Re-instate r215611 with a fix to a pesky X86 DAG combine. This combine is essentially combining target-specific nodes back into target independent nodes that it "knows" will be combined yet again by a target independent DAG combine into a different set of target-independent nodes that are legal (not custom though!) and thus "ok". This seems... deeply flawed. The crux of the problem is that we don't combine un-legalized shuffles that are introduced by legalizing other operations, and thus we don't see a very profitable combine opportunity. So the backend just forces the input to that combine to re-appear. However, for this to work, the conditions detected to re-form the unlegalized nodes must be exactly right. Previously, failing this would have caused poor code (if you're lucky) or a crasher when we failed to select instructions. After r215611 we would fall back into the legalizer. In some cases, this just "fixed" the crasher by produces bad code. But in the test case added it caused the legalizer and the dag combiner to iterate forever. The fix is to make the alignment checking in the x86 side of things match the alignment checking in the generic DAG combine exactly. This isn't really a satisfying or principled fix, but it at least make the code work as intended. It also highlights that it would be nice to detect the availability of under aligned loads for a given type rather than bailing on this optimization. I've left a FIXME to document this. Original commit message for r215611 which covers the rest of the chang: [SDAG] Fix a case where we would iteratively legalize a node during combining by replacing it with something else but not re-process the node afterward to remove it. In a truly remarkable stroke of bad luck, this would (in the test case attached) end up getting some other node combined into it without ever getting re-processed. By adding it back on to the worklist, in addition to deleting the dead nodes more quickly we also ensure that if it stops being dead for any reason it makes it back through the legalizer. Without this, the test case will end up failing during instruction selection due to an and node with a type we don't have an instruction pattern for. It took many million runs of the shuffle fuzz tester to find this. llvm-svn: 216537	2014-08-27 11:22:16 +00:00
Chandler Carruth	70f81a98ca	[x86] Fix a bug in r216319 where I was missing a 'break'. This actually was caught by existing tests but those tests were disabled with an XFAIL because of PR20736. While working on fixing that, I noticed the test failure, and tracked it down to this. We even have a really nice Clang warning that would have caught this but it isn't enabled in LLVM! =[ I may look at enabling it. llvm-svn: 216391	2014-08-25 18:06:11 +00:00
Elena Demikhovsky	22e735d725	X86 intrinsics table - simplifies intrinsics lowering. The tables are initialized when X86TargetLowering object is created. llvm-svn: 216345	2014-08-24 09:19:56 +00:00
Chandler Carruth	a15258b4e6	[x86] Start fixing a really subtle and terrible form of miscompile in these DAG combines. The DAG auto-CSE thing is truly terrible. Due to it, when RAUW-ing a node with its operand, you can cause its uses to CSE to itself, which then causes their uses to become your uses which causes them to be picked up by the RAUW. For nodes that are determined to be "no-ops", this is "fine". But if the RAUW is one of several steps to enact a transformation, this causes the DAG to really silently eat an discard nodes that you would never expect. It took days for me to actually pinpoint a test case triggering this and a really frustrating amount of time to even comprehend the bug because I never even thought about the ability of RAUW to iteratively consume nodes due to CSE-ing them into itself. To fix this, we have to build up a brand-new chain of operations any time we are combining across (potentially) intervening nodes. But once the logic is added to do this, another issue surfaces: CombineTo eagerly deletes the one node combined, but no others. This is... really frustrating. If deleting it makes its operands become dead, those operand nodes often won't go onto the worklist in the order you would want -- they're already on it and not near the top. That means things higher on the worklist will get combined prior to these dead nodes being GCed out of the worklist, and if the chain is long, the immediate users won't be enough to re-detect where the root of the chain is that became single-use again after deleting the dead nodes. The better way to do this is to never immediately delete nodes, and instead to just enqueue them so we can recursively delete them. The combined-from node is typically not on the worklist anyways by virtue of having been popped off.... But that in turn breaks other tests that require CombineTo to delete unused nodes. :: sigh :: Fortunately, there is a better way. This whole routine should have been returning the replacement rather than using CombineTo which is quite hacky. Switch to that, and all the pieces fall together. I suspect the same kind of miscompile is possible in the half-shuffle folding code, and potentially the recursive folding code. I'll be switching those over to a pattern more like this one for safety's sake even though I don't immediately have any test cases for them. Note that the only way I got a test case for this instance was with heavily DAG combined 256-bit shuffle sequences generated by my fuzzer. ;] llvm-svn: 216319	2014-08-23 10:25:15 +00:00
Reid Kleckner	2d9bb65b3d	ARM / x86_64 varargs: Don't save regparms in prologue without va_start There's no need to do this if the user doesn't call va_start. In the future, we're going to have thunks that forward these register parameters with musttail calls, and they won't need these spills for handling va_start. Most of the test suite changes are adding va_start calls to existing tests to keep things working. llvm-svn: 216294	2014-08-22 21:59:26 +00:00
Duncan P. N. Exon Smith	c667974b65	Revert "X86: Align the stack on word boundaries in LowerFormalArguments()" This (mostly) reverts commit r216119. Somewhere during the review Reid committed r214980 which fixed this another way, and I neglected to check that the testcase still failed before committing. I've left test/CodeGen/X86/aligned-variadic.ll around in case it adds extra coverage. llvm-svn: 216246	2014-08-21 23:36:08 +00:00
Benjamin Kramer	b791ef21d2	X86: Turn redundant if into an assertion. While there remove noop casts. llvm-svn: 216168	2014-08-21 10:31:37 +00:00
Robert Khasanov	46409eae8e	[x86] Added _addcarry_ and _subborrow_ intrinsics llvm-svn: 216164	2014-08-21 09:43:43 +00:00
Robert Khasanov	7c5a843646	[x86] Broadwell: ADOX/ADCX. Added _addcarryx_u{32\|64} intrinsics to LLVM. llvm-svn: 216162	2014-08-21 09:27:00 +00:00
Sanjay Patel	bba72c7c1e	Don't prevent a vselect of constants from becoming a single load (PR20648). Fix for PR20648 - http://llvm.org/bugs/show_bug.cgi?id=20648 This patch checks the operands of a vselect to see if all values are constants. If yes, bail out of any further attempts to create a blend or shuffle because SelectionDAGLegalize knows how to turn this kind of vselect into a single load. This already happens for machines without SSE4.1, so the added checks just send more targets down that path. Differential Revision: http://reviews.llvm.org/D4934 llvm-svn: 216121	2014-08-20 20:34:56 +00:00
Duncan P. N. Exon Smith	b18263531d	X86: Align the stack on word boundaries in LowerFormalArguments() The goal of the patch is to implement section 3.2.3 of the AMD64 ABI correctly. The controlling sentence is, "The size of each argument gets rounded up to eightbytes. Therefore the stack will always be eightbyte aligned." The equivalent sentence in the i386 ABI page 37 says, "At all times, the stack pointer should point to a word-aligned area." For both architectures, the stack pointer is not being rounded up to the nearest eightbyte or word between the last normal argument and the first variadic argument. Patch by Thomas Jablin! llvm-svn: 216119	2014-08-20 19:40:59 +00:00
Keno Fischer	d750723d29	Do not insert a tail call when returning multiple values on X86 Summary: This fixes http://llvm.org/bugs/show_bug.cgi?id=19530. The problem is that X86ISelLowering erroneously thought the third call was eligible for tail call elimination. It would have been if it's return value was actually the one returned by the calling function, but here that is not the case and additional values are being returned. Test Plan: Test case from the original bug report is included. Reviewers: rafael Reviewed By: rafael Subscribers: rafael, llvm-commits Differential Revision: http://reviews.llvm.org/D4968 llvm-svn: 216117	2014-08-20 19:00:37 +00:00
Elena Demikhovsky	34d2d76d25	AVX-512: Fixed a bug in emitting compare for MVT:i1 type. Added a test. llvm-svn: 215889	2014-08-18 11:59:06 +00:00
Elena Demikhovsky	67d9500444	Reverted last commit llvm-svn: 215828	2014-08-17 09:39:48 +00:00
Elena Demikhovsky	2bb991a0c5	Added a table for intrinsics on X86. It should remove dosens of lines in handling instrinsics (in a huge switch) and give an easy way to add new intrinsics. I did not completed to move al intrnsics to the table, I'll do this in the upcomming commits. llvm-svn: 215826	2014-08-17 09:00:20 +00:00
Chandler Carruth	5a37dd6ff4	[x86] Fix an indentation goof in a prior commit. Should have re-run clang-format. llvm-svn: 215824	2014-08-17 00:40:34 +00:00
Chandler Carruth	e020f117ce	[x86] Teach lots of the new vector shuffle lowering to use UNPCK instructions for blend operations at 128 bits. This was a serious hole in our prior blend lowering. llvm-svn: 215819	2014-08-16 09:42:15 +00:00
Reid Kleckner	a6b86bef4d	Fix the build with MSVC 2013 after new shuffle code MSVC gives this awesome diagnostic: ..\lib\Target\X86\X86ISelLowering.cpp(7085) : error C2971: 'llvm::VariadicFunction1' : template parameter 'Func' : 'isShuffleEquivalentImpl' : a local variable cannot be used as a non-type argument ..\include\llvm/ADT/VariadicFunction.h(153) : see declaration of 'llvm::VariadicFunction1' ..\lib\Target\X86\X86ISelLowering.cpp(7061) : see declaration of 'isShuffleEquivalentImpl' Using an anonymous namespace makes the problem go away. llvm-svn: 215744	2014-08-15 18:03:58 +00:00
Chandler Carruth	03f456abbe	[x86] Teach the new AVX v4f64 shuffle lowering to use UNPCK instructions where applicable for blending. llvm-svn: 215737	2014-08-15 17:42:00 +00:00
Chandler Carruth	f88612581e	[x86] Add the initial skeleton of type-based dispatch for AVX vectors in the new shuffle lowering and an implementation for v4 shuffles. This allows us to handle non-half-crossing shuffles directly for v4 shuffles, both integer and floating point. This currently misses places where we could perform the blend via UNPCK instructions, but otherwise generates equally good or better code for the test cases included to the existing vector shuffle lowering. There are a few cases that are entertainingly better. ;] llvm-svn: 215702	2014-08-15 11:01:40 +00:00
Chandler Carruth	6649f53207	[x86] Remove the duplicated code for testing whether we can widen the elements of a shuffle mask and simplify how it works. No functionality changed now that the bug that was here has been fixed. llvm-svn: 215696	2014-08-15 07:41:57 +00:00
Chandler Carruth	17fd848bfa	[x86] Fix the very broken formation of vpunpck instructions in the target-specific shuffl DAG combines. We were recognizing the paired shuffles backwards. This code needs to be replaced anyways as we have the same functionality elsewhere, but I'll do the refactoring in a follow-up, this is the minimal fix to the behavior. In addition to fixing miscompiles with the new vector shuffle lowering, it also causes the canonicalization to kick in much better, selecting the smaller encoding variants in lots of places in the new AVX path. This still isn't quite ideal as we don't need both the shufpd and the punpck instructions, but that'll get fixed in a follow-up patch. llvm-svn: 215690	2014-08-15 03:54:49 +00:00
Chandler Carruth	372c143c2f	[x86] Fix PR20540 where the x86 shuffle DAG combiner had completely broken logic for merging shuffle masks in the face of SM_SentinelZero mask operands. While these are '-1' they don't mean 'undef' the way '-1' means in the pre-legalized shuffle masks. Instead, they mean that the shuffle operation is forcibly zeroing that lane. Reflect this and explicitly handle it in a bunch of places. In one place the effect is equivalent but much more clear. In the rest it was really weirdly broken. Also, rewrite the entire merging thing to be a more directy operation with a single loop and just doing math to map the indices through the various masks. Also add a bunch of asserts to try to make in extremely clear what the different masks can possibly look like. Finally, add some comments to clarify that we're merging shuffle masks up here rather than down as we do everywhere else, and thus the logic is quite confusing. Thanks to several different people for sending test cases, and for Robert Khasanov for an initial attempt at fixing. llvm-svn: 215687	2014-08-15 02:43:18 +00:00
Adam Nemet	6fa675754a	[AVX512] Switch FMA intrinsics to the masking version This does the renaming and updates the lowering logic. Part of <rdar://problem/17688758> llvm-svn: 215664	2014-08-14 17:13:30 +00:00
Adam Nemet	9b4f08c729	[X86] Break out logic to map FMA Intrinsic number to Opcode No functional change. Will be used to lower AVX512 masking FMA intrinsics. llvm-svn: 215663	2014-08-14 17:13:27 +00:00
Adam Nemet	6f31063dcf	[AVX512] Break out the logic to lower masking intrinsics No functional change. This will be used by the FMA intrinsic lowering as well and hopefully many more. llvm-svn: 215661	2014-08-14 17:13:24 +00:00
Chandler Carruth	a8311b3681	[x86] Begin stubbing out the AVX support in the new vector shuffle lowering scheme. Currently, this just directly bails to the fallback path of splitting the 256-bit vector into two 128-bit vectors, operating there, and then joining the results back together. While the results are far from perfect, they are shockingly good for what we're doing here. I'll be layering the rest of the functionality on top of this piece by piece and updating tests as I go. Note that 256-bit vectors in this mode are still somewhat WIP. While I think the code paths that I'm adding here are clean and good-to-go, there are still a lot of 128-bit assumptions that I'll need to stomp out as I march through the functional spread here. llvm-svn: 215637	2014-08-14 12:13:59 +00:00
Quentin Colombet	57fb040beb	[X86] Fix the value of the low mask for the lowering of MUL_LOHI for v4i32. Found by code inspection. llvm-svn: 215604	2014-08-13 23:49:24 +00:00
Aaron Ballman	1013b6b60c	Silence a -Wparenthesis warning with these asserts. NFC. llvm-svn: 215537	2014-08-13 10:49:07 +00:00
Elena Demikhovsky	51bbd011c3	AVX-512: Fixed a bug in shufflevector lowering. PALIGNR instruction does not exist in AVX-512F set. Added a test. llvm-svn: 215526	2014-08-13 07:58:43 +00:00
Chandler Carruth	b7eda21bb0	[x86] Rewrite a core part of the new vector shuffle lowering to handle one pesky test case correctly. This test case caused the old code to infloop occilating between solving the low-half and the high-half. The 'side balancing' part of single-input v8 shuffle lowering didn't handle the one pattern which can cause it to occilate. Fortunately the fuzz testing found this case. Unfortuately it was terrible to handle. I'm really sorry for the amount and density of the code here, I'd love suggestions on how to simplify it. I feel like there must be a simpler form here, but after a lot of days I've not found it. This is the only one I've found that even works. I've added the one pesky test case along with some nice comments explaining the core problem that we have to solve here. So far this has survived approximately 32k test cases. More strenuous fuzzing commencing. llvm-svn: 215519	2014-08-13 01:25:45 +00:00
Adam Nemet	cee9d0a460	[AVX512] Handle valign masking intrinsic via C++ lowering I think that this will scale better in most cases than adding a Pat<> for each mapping from the intrinsic DAG to the intruction (i.e. rri, rrik, rrikz). We can just lower to the SDNode and have the resulting DAG be matches by the DAG patterns. Alternatively (long term), we could keep the Pat<>s but generate them via the new AVX512_masking multiclass. The difficulty is that in order to formulate that we would have to concatenate DAGs. Currently this is only supported if the operators of the input DAGs are identical. llvm-svn: 215473	2014-08-12 21:13:12 +00:00
Sanjay Patel	8687f320e0	fixed typos llvm-svn: 215451	2014-08-12 16:00:06 +00:00
Hans Wennborg	21b20cb11a	Increase the size of these SmallVectors in X86ISelLowering.cpp. In a Clang bootstrap, their sizes were always 12, 16 and 16, respectively. llvm-svn: 215336	2014-08-11 02:21:22 +00:00
Sanjay Patel	b48d1cfa53	fixed typos llvm-svn: 215299	2014-08-09 22:23:02 +00:00
Patrik Hagglund	b0e86ec814	[pr19635] Revert most of r170537, and add new testcase. Patch provided by Andrey Kuharev. Sorry, r170537 was obviously wrong. llvm-svn: 215190	2014-08-08 08:21:19 +00:00
Alexander Kornienko	7151ad7762	Insert parens to avoid a warning: suggest parentheses around arithmetic in operand of '^' [-Wparentheses] llvm-svn: 215101	2014-08-07 12:09:34 +00:00
Chandler Carruth	4e8fcbd3fd	[x86] Fix another miscompile found through fuzz testing the new vector shuffle lowering. This is closely related to the previous one. Here we failed to use the source offset when swapping in the other case -- where we end up swapping the final shuffle. The cause of this bug is a bit different: I simply wasn't thinking about the fact that this mask is actually a slice of a wide mask and thus has numbers that need SourceOffset applied. Simple fix. Would be even more simple with an algorithm-y thing to use here, but correctness first. =] llvm-svn: 215095	2014-08-07 10:37:35 +00:00
Chandler Carruth	e206385e99	[x86] Fix another miscompile in the new vector shuffle lowering found via the fuzz tester. Here I missed an offset when round-tripping a value through a shuffle mask. I got it right 2 lines below. See a problem? I do. ;] I'll probably be adding a little "swap" algorithm which accepts a range and two values and swaps those values where they occur in the range. Don't really have a name for it, let me know if you do. llvm-svn: 215094	2014-08-07 10:14:27 +00:00
Chandler Carruth	78494364d1	[x86] Fix another miscompile in the new vector shuffle lowering found through the new fuzzer. This one is great: bad operator precedence led the modulus to happen at the wrong point. All the asserts didn't fire because there were usually the right values past the end of the 4 element region we were looking at. Probably could have gotten a crash here with ASan + fuzzing, but the correctness tests pinpointed this really nicely. llvm-svn: 215092	2014-08-07 09:45:02 +00:00
Pavel Chupin	f55eb450e5	[x32] Use ebp/esp as frame and stack pointer Summary: Since pointers are 32-bit on x32 we can use ebp and esp as frame and stack pointer. Some operations like PUSH/POP and CFI_INSTRUCTION still require 64-bit register, so using 64-bit MachineFramePtr where required. X86_64 NaCl uses 64-bit frame/stack pointers, however it's been found that both isTarget64BitLP64 and isTarget64BitILP32 are true for NaCl. Addressing this issue here as well by making isTarget64BitLP64 false. Also mark hasReservedSpillSlot unreachable on X86. See inlined comments. Test Plan: Add one new simple test and upgrade 2 existing with x32 target case. Reviewers: nadav, dschuff Subscribers: llvm-commits, zinovy.nis Differential Revision: http://reviews.llvm.org/D4617 llvm-svn: 215091	2014-08-07 09:41:19 +00:00
Chandler Carruth	27046758de	[x86] Fix a miscompile in the new shuffle lowering found through the new fuzz testing. The function which tested for adjacency did what it said on the tin, but when I called it, I wanted it to do something more thorough: I wanted to know if the pairs of shuffle elements were adjacent and started at 0 mod 2. In one place I had the decency to try to test for this, but in the other it was completely skipped, miscompiling this test case. Fix this by making the helper actually do what I wanted it to do everywhere I called it (and removing the now redundant code in one place). I really dislike the name "canWidenShuffleElements" for this predicate. If anyone can come up with a better name, please let me know. The other name I thought about was "canWidenShuffleMask" but is it really widening the mask to reduce the number of lanes shuffled? I don't know. Naming things is hard. llvm-svn: 215089	2014-08-07 08:11:31 +00:00
Eric Christopher	b5217507c7	Remove the target machine from CCState. Previously it was only used to get the subtarget and that's accessible from the MachineFunction now. This helps clear the way for smaller changes where we getting a subtarget will require passing in a MachineFunction/Function as well. llvm-svn: 214988	2014-08-06 18:45:26 +00:00
Chandler Carruth	c3927cd8c9	[x86] Fix two independent miscompiles in the process of getting the same test case to actually generate correct code. The primary miscompile fixed here is that we weren't correctly handling in-place elements in one half of a single-input v8i16 shuffle when moving a dword of elements from that half to the other half. Some times, we would clobber the in-place elements in forming the dword to move across halves. The fix to this involves forcibly marking the in-place inputs even when there is no need to gather them into a dword, and to much more carefully re-arrange the elements when grouping them into a dword to move across halves. With these two changes we would generate correct shuffles for the test case, but found another miscompile. There are also some random perturbations of the generated shuffle pattern in SSE2. It looks like a wash; more instructions in some cases fewer in others. The second miscompile would corrupt the results into nonsense. This is a buggy pattern in one of the added DAG combines. Mapping elements through a PSHUFD when pairing redundant half-shuffles is much harder than this code makes it out to be -- it requires reasoning about all of where the input is used in the PSHUFD, not just one part of where it is used. Plus, we can't combine a half shuffle into a PSHUFD but the code didn't guard against it. I think this was just a bad idea and I've just removed that aspect of the combine. No tests regress as a consequence so seems OK. llvm-svn: 214954	2014-08-06 10:16:36 +00:00
Chandler Carruth	8f23ba26d2	[x86] Switch to a formulation of a for loop that is much more obviously not corrupting the mask by mutating it more times than intended. No functionality changed (the results were non-overlapping so the old version "worked" but was non-obvious). llvm-svn: 214953	2014-08-06 10:16:33 +00:00
JF Bastien	ac8b66b32c	Fix typos in comments and doc Committing http://reviews.llvm.org/D4798 for Robin Morisset (morisset@google.com) llvm-svn: 214934	2014-08-05 23:27:34 +00:00
Chandler Carruth	a746239be3	[x86] Fix a crasher due to shuffles which cancel each other out and add a test case. We also miscompile this test case which is showing a serious flaw in the single-input v8i16 shuffle code. I've left the specific instruction checks FIXME-ed out until I can address the bug in the single-input code, but I wanted to separate out a significant functionality change to produce correct code from a very simple and targeted crasher fix. The miscompile problem stems from keeping track of inputs by value rather than by index. As a consequence of doing this, we can't reliably update those inputs because they might swap and we can't detect this without copying the mask. The blend code now uses indices for the input lists and this seems strictly better. It also should make it easier to sort things and do other cleanups. I think the time has come to simplify The Great Lambda here. llvm-svn: 214914	2014-08-05 18:45:49 +00:00
Adam Nemet	c04f3f9f73	[X86] Improve comments for r214888 A rebase somehow ate my comments. This restores them. llvm-svn: 214903	2014-08-05 17:58:49 +00:00
Adam Nemet	164b07fbfe	[X86] Add lowering to VALIGN This was currently part of lowering to PALIGNR with some special-casing to make interlane shifting work. Since AVX512F has interlane alignr (valignd/q) and AVX512BW has vpalignr we need to support both of these at the same time, e.g. for SKX. This patch breaks out the common code and then add support to check both of these lowering options from LowerVECTOR_SHUFFLE. I also added some FIXMEs where I think the AVX512BW and AVX512VL additions should probably go. llvm-svn: 214888	2014-08-05 17:22:59 +00:00
Adam Nemet	2f10cc699d	[X86] Separate DAG node for valign and palignr They have different semantics (valign is interlane while palingr is intralane) and palingr is still needed even in the AVX512 context. According to the latest spec AVX512BW provides these. llvm-svn: 214887	2014-08-05 17:22:55 +00:00
Chandler Carruth	183771bd8e	[x86] Reformat some code I moved around in a prior commit but left poorly formatted. Sorry about that. llvm-svn: 214853	2014-08-05 10:35:30 +00:00
Chandler Carruth	947cef191d	[x86] Fix a crash and wrong-code bug in the new vector lowering all found by a single test reduced out of a failure on llvm-stress. The start of the problem (and the crash) came when we tried to use a find of a non-used slot in the move-to half of the move-mask as the target for two bad-half inputs. While if lucky this will be the first of a pair of slots which we can place the bad-half inputs into, it isn't actually guaranteed. This really isn't surprising, not sure what I was thinking. The correct way to find the two unused slots is to look for one of the used slots. We know it isn't that pair, and we can use some modular arithmetic to find the other pair by masking off the odd bit and adding 2 modulo 4. With this, we reliably found a viable pair of slots for the bad-half inputs. Sadly, that wasn't enough. We also had a wrong code bug that surfaced when I reduced the test case for this where we would use the same slot twice for the two bad inputs. This is because both of the bad inputs could be in odd slots originally and thus the mod-2 mapping would actually be the same. The whole point of the weird indexing into the pair of empty slots was to try to leverage when the end result needed the two bad-half inputs to be paired in a dword and pre-pair them in the correct orrientation. This is less important with the powerful combining we're now doing, and also easier and more reliable to achieve be noting that we add the bad-half inputs in order. Thus, if they are in a dword pair, the low part of that will be the first input in the sequence. Always putting that in the low element will just do the right thing in addition to computing the correct result. Test case added. =] llvm-svn: 214849	2014-08-05 08:19:21 +00:00
Eric Christopher	fc6de428c8	Have MachineFunction cache a pointer to the subtarget to make lookups shorter/easier and have the DAG use that to do the same lookup. This can be used in the future for TargetMachine based caching lookups from the MachineFunction easily. Update the MIPS subtarget switching machinery to update this pointer at the same time it runs. llvm-svn: 214838	2014-08-05 02:39:49 +00:00
Eric Christopher	d913448b38	Remove the TargetMachine forwards for TargetSubtargetInfo based information and update all callers. No functional change. llvm-svn: 214781	2014-08-04 21:25:23 +00:00
Chandler Carruth	0e2ddb2790	[x86] Just unilaterally prefer SSSE3-style PSHUFB lowerings over clever use of PACKUS. It's cleaner that way. I looked at implementing clever combine-based folding of PACKUS chains into PSHUFB but it is quite hard and doesn't seem likely to be worth it. The most annoying part would be detecting that the correct masking had been done to use PACKUS-style instructions as a blend operation rather than there being any saturating as is indicated by its name. We generate really nice code for what few test cases I've come up with that aren't completely contrived for this by just directly prefering PSHUFB and so let's go with that strategy for now. =] llvm-svn: 214707	2014-08-04 10:17:35 +00:00
Chandler Carruth	06e6f1cae2	[x86] Implement more aggressive use of PACKUS chains for lowering common patterns of v16i8 shuffles. This implements one of the more important FIXMEs for the SSE2 support in the new shuffle lowering. We now generate the optimal shuffle sequence for truncate-derived shuffles which show up essentially everywhere. Unfortunately, this exposes a weakness in other parts of the shuffle logic -- we can no longer form PSHUFB here. I'll add the necessary support for that and other things in a subsequent commit. llvm-svn: 214702	2014-08-04 09:40:02 +00:00
Chandler Carruth	37a18821cd	[x86] Handle single input shuffles in the SSSE3 case more intelligently. I spent some time looking into a better or more principled way to handle this. For example, by detecting arbitrary "unneeded" ORs... But really, there wasn't any point. We just shouldn't build blatantly wrong code so late in the pipeline rather than adding more stages and logic later on to fix it. Avoiding this is just too simple. llvm-svn: 214680	2014-08-04 01:14:24 +00:00
Chandler Carruth	16c13cad35	[x86] Remove the FIXME that was implemented in r214628. Managed to forget to update the comment here... =/ llvm-svn: 214630	2014-08-02 11:34:23 +00:00
Chandler Carruth	4c57955fe3	[x86] Largely complete the use of PSHUFB in the new vector shuffle lowering with a small addition to it and adding PSHUFB combining. There is one obvious place in the new vector shuffle lowering where we should form PSHUFBs directly: when without them we will unpack a vector of i8s across two different registers and do a potentially 4-way blend as i16s only to re-pack them into i8s afterward. This is the crazy expensive fallback path for i8 shuffles and we can just directly use pshufb here as it will always be cheaper (the unpack and pack are two instructions so even a single shuffle between them hits our three instruction limit for forming PSHUFB). However, this doesn't generate very good code in many cases, and it leaves a bunch of common patterns not using PSHUFB. So this patch also adds support for extracting a shuffle mask from PSHUFB in the X86 lowering code, and uses it to handle PSHUFBs in the recursive shuffle combining. This allows us to combine through them, combine multiple ones together, and generally produce sufficiently high quality code. Extracting the PSHUFB mask is annoyingly complex because it could be either pre-legalization or post-legalization. At least this doesn't have to deal with re-materialized constants. =] I've added decode routines to handle the different patterns that show up at this level and we dispatch through them as appropriate. The two primary test cases are updated. For the v16 test case there is still a lot of room for improvement. Since I was going through it systematically I left behind a bunch of FIXME lines that I'm hoping to turn into ALL lines by the end of this. llvm-svn: 214628	2014-08-02 10:39:15 +00:00
Chandler Carruth	5219d4eff6	[x86] Fix a few typos in my comments spotted in passing. llvm-svn: 214626	2014-08-02 10:29:34 +00:00
Chandler Carruth	34f9a987e9	[x86] Teach the target shuffle mask extraction to recognize unary forms of normally binary shuffle instructions like PUNPCKL and MOVLHPS. This detects cases where a single register is used for both operands making the shuffle behave in a unary way. We detect this and adjust the mask to use the unary form which allows the existing DAG combine for shuffle instructions to actually work at all. As a consequence, this uncovered a number of obvious bugs in the existing DAG combine which are fixed. It also now canonicalizes several shuffles even with the existing lowering. These typically are trying to match the shuffle to the domain of the input where before we only really modeled them with the floating point variants. All of the cases which change to an integer shuffle here have something in the integer domain, so there are no more or fewer domain crosses here AFAICT. Technically, it might be better to go from a GPR directly to the floating point domain, but detecting floating point outputs despite integer inputs is a lot more code and seems unlikely to be worthwhile in practice. If folks are seeing domain-crossing regressions here though, let me know and I can hack something up to fix it. Also as a consequence, a bunch of missed opportunities to form pshufb now can be formed. Notably, splats of i8s now form pshufb. Interestingly, this improves the existing splat lowering too. We go from 3 instructions to 1. Yes, we may tie up a register, but it seems very likely to be worth it, especially if splatting the 0th byte (the common case) as then we can use a zeroed register as the mask. llvm-svn: 214625	2014-08-02 10:27:38 +00:00
Akira Hatanaka	3516669a50	[X86] Simplify X87 stackifier pass. Stop using ST registers for function returns and inline-asm instructions and use FP registers instead. This allows removing a large amount of code in the stackifier pass that was needed to track register liveness and handle copies between ST and FP registers and function calls returning floating point values. It also fixes a bug which manifests when an ST register defined by an inline-asm instruction was live across another inline-asm instruction, as shown in the following sequence of machine instructions: 1. INLINEASM <es:frndint> $0:[regdef], %ST0<imp-def,tied5> 2. INLINEASM <es:fldcw $0> 3. %FP0<def> = COPY %ST0 <rdar://problem/16952634> llvm-svn: 214580	2014-08-01 22:19:41 +00:00
Louis Gerbarg	67474e3755	Make sure no loads resulting from load->switch DAGCombine are marked invariant Currently when DAGCombine converts loads feeding a switch into a switch of addresses feeding a load the new load inherits the isInvariant flag of the left side. This is incorrect since invariant loads can be reordered in cases where it is illegal to reoarder normal loads. This patch adds an isInvariant parameter to getExtLoad() and updates all call sites to pass in the data if they have it or false if they don't. It also changes the DAGCombine to use that data to make the right decision when creating the new load. llvm-svn: 214449	2014-07-31 21:45:05 +00:00
Matt Arsenault	6f2a526101	Add alignment value to allowsUnalignedMemoryAccess Rename to allowsMisalignedMemoryAccess. On R600, 8 and 16 byte accesses are mostly OK with 4-byte alignment, and don't need to be split into multiple accesses. Vector loads with an alignment of the element type are not uncommon in OpenCL code. llvm-svn: 214055	2014-07-27 17:46:40 +00:00
Chandler Carruth	64a7c828cb	[x86] Sink a variable only used by asserts into the asserts. Should fix some -Werror bots, sorry for the noise. llvm-svn: 214043	2014-07-27 01:45:49 +00:00
Chandler Carruth	80c5bfd843	[x86] Add a much more powerful framework for combining x86 shuffle instructions in the legalized DAG, and leverage it to combine long sequences of instructions to PSHUFB. Eventually, the other x86-instruction-specific shuffle combines will probably all be driven out of this routine. But the real motivation is to detect after we have fully legalized and optimized a shuffle to the minimal number of x86 instructions whether it is profitable to replace the chain with a fully generic PSHUFB instruction even though doing so requires either a load from a constant pool or tying up a register with the mask. While the Intel manuals claim it should be used when it replaces 5 or more instructions (!!!!) my experience is that it is actually very fast on modern chips, and so I've gon with a much more aggressive model of replacing any sequence of 3 or more instructions. I've also taught it to do some basic canonicalization to special-purpose instructions which have smaller encodings than their generic counterparts. There are still quite a few FIXMEs here, and I've not yet implemented support for lowering blends with PSHUFB (where its power really shines due to being able to zero out lanes), but this starts implementing real PSHUFB support even when using the new, fancy shuffle lowering. =] llvm-svn: 214042	2014-07-27 01:15:58 +00:00
Chandler Carruth	5896698e2e	[x86] Fix PR20355 (for real). There are many layers to this bug. The tale starts with r212808 which attempted to fix inversion of the low and high bits when lowering MUL_LOHI. Sadly, that commit did not include any positive test cases, and just removed some operations from a test case where the actual logic being changed isn't fully visible from the test. What this commit did was two things. First, it reversed the low and high results in the formation of the MERGE_VALUES node for the multiple results. This is entirely correct. Second it changed the shuffles for extracting the low and high components from the i64 results of the multiplies to extract them assuming a big-endian-style encoding of the multiply results. This second change is wrong. There is no big-endian encoding in x86, the results of the multiplies are normal v2i64s: when cast to v4i32, the low i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3. However, the first change wasn't enough to actually fix the bug, which is (I assume) why the second change was also made. There was another bug in the MERGE_VALUES formation: we weren't using a VTList, and so were getting a single result node! When grabbing the second result from the node, we got... well.. colud be anything. I think this appeared to invert things, but had to be causing other problems as well. Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should have been fine, right? NOOOPE! Because the core bug was never addressed, the test in vector-idiv failed when I fixed the MERGE_VALUES node. Because there are essentially no docs for this node, I had to guess at how to fix it and tried swapping the operands, restoring the order of the original code before r212808. While this "fixed" the test case (in that we produced the write instructions) we were still extracting the wrong elements of the i64s, and thus PR20355 was still broken. This commit essentially reverts the big-endian-style extraction part of r212808 and goes back to the original masks which were correct. Now that the MERGE_VALUES node formation is also correct, everything works. I've also included a more detailed test from PR20355 to make sure this stays fixed. llvm-svn: 214011	2014-07-26 03:46:57 +00:00
Chandler Carruth	f6406ac5d6	[x86] Revert r214007: Fix PR20355 ... The clever way to implement signed multiplication with unsigned is already implemented and tested and working correctly. The bug is somewhere else. Re-investigating. This will teach me to not scroll far enough to read the code that did what I thought needed to be done. llvm-svn: 214009	2014-07-26 02:14:54 +00:00
Chandler Carruth	1bf4d19172	[x86] Fix PR20355 (and dups) by not using unsigned multiplication when signed multiplication is requested. While there is not a difference in the low half of the result, the high half (used specifically to implement the signed division by these constants) certainly is used. The test case I've nuked was actively asserting wrong code. There is a delightful solution to doing signed multiplication even when we don't have it that Richard Smith has crafted, but I'll add the machinery back and implement that in a follow-up patch. This at least restores correctness. llvm-svn: 214007	2014-07-26 01:52:13 +00:00
Akira Hatanaka	e5b6e0d231	[stack protector] Fix a potential security bug in stack protector where the address of the stack guard was being spilled to the stack. Previously the address of the stack guard would get spilled to the stack if it was impossible to keep it in a register. This patch introduces a new target independent node and pseudo instruction which gets expanded post-RA to a sequence of instructions that load the stack guard value. Register allocator can now just remat the value when it can't keep it in a register. <rdar://problem/12475629> llvm-svn: 213967	2014-07-25 19:31:34 +00:00
Chandler Carruth	3de980d2ff	[SDAG] Enable the new assert for out-of-range result numbers in SDValues, fixing the two bugs left in the regression suite. The key for both of these was the use a single value type rather than a VTList which caused an unintentionally single-result merge-value node. Fix this by getting the appropriate VTList in place. Doing this exposed that the comments in x86's code abouth how MUL_LOHI operands are handle is wrong. The bug with the use of out-of-range result numbers was hiding the bug about the order of operands here (as best i can tell). There are more places where the code appears to get this backwards still... llvm-svn: 213931	2014-07-25 09:19:23 +00:00
Chandler Carruth	80b869461e	[x86] Make vector legalization of extloads work more like the "normal" vector operation legalization with support for custom target lowering and fallback to expand when it fails, and use this to implement sext and anyext load lowering for x86 in a more principled way. Previously, the x86 backend relied on a target DAG combine to "combine away" sextload and extload nodes prior to legalization, or would expand them during legalization with terrible code. This is particularly problematic because the DAG combine relies on running over non-canonical DAG nodes at just the right time to match several common and important patterns. It used a combine rather than lowering because we didn't have good lowering support, and to expose some tricks being employed to more combine phases. With this change it becomes a proper lowering operation, the backend marks that it can lower these nodes, and I've added support for handling the canonical forms that don't have direct legal representations such as sextload of a v4i8 -> v4i64 on AVX1. With this change, our test cases for this behavior continue to pass even after the DAG combiner beigns running more systematically over every node. There is some noise caused by this in the test suite where we actually use vector extends instead of subregister extraction. This doesn't really seem like the right thing to do, but is unlikely to be a critical regression. We do regress in one case where by lowering to the target-specific patterns early we were able to combine away extraneous legal math nodes. However, this regression is completely addressed by switching to a widening based legalization which is what I'm working toward anyways, so I've just switched the test to that mode. Differential Revision: http://reviews.llvm.org/D4654 llvm-svn: 213897	2014-07-24 22:09:56 +00:00
Reid Kleckner	9a412d13c1	Replace an assertion with a fatal error Frontends are responsible for putting inalloca on parameters that would be passed in memory and not registers. llvm-svn: 213891	2014-07-24 19:53:33 +00:00
Saleem Abdulrasool	34610e33ae	X86: silence sign comparison warning GCC 4.8 detected a signed compare [-Wsign-compare]. Add a cast for the destination index. Add an assert to catch a potential overflow however unlikely it may be. llvm-svn: 213878	2014-07-24 17:12:06 +00:00
Filipe Cabecinhas	933cccf3fa	Fixed PR20411 - bug in getINSERTPS() When we had a vector_shuffle where we had an input from each vector, we could miscompile it because we were assuming the input from V2 wouldn't be moved from where it was on the vector. Added a test case. llvm-svn: 213826	2014-07-24 01:28:21 +00:00
Jim Grosbach	724e438c62	[X86,AArch64] Extend vcmp w/ unary op combine to work w/ more constants. The transform to constant fold unary operations with an AND across a vector comparison applies when the constant is not a splat of a scalar as well. llvm-svn: 213800	2014-07-23 20:41:43 +00:00
Jim Grosbach	8f6f0858ec	X86: restrict combine to when type sizes are safe. The folding of unary operations through a vector compare and mask operation is only safe if the unary operation result is of the same size as its input. For example, it's not safe for [su]itofp from v4i32 to v4f64. llvm-svn: 213799	2014-07-23 20:41:38 +00:00
Robert Khasanov	74acbb7767	[SKX] Enabling mask instructions: encoding, lowering KMOVB, KMOVW, KMOVD, KMOVQ, KNOTB, KNOTW, KNOTD, KNOTQ Reviewed by Elena Demikhovsky <elena.demikhovsky@intel.com> llvm-svn: 213757	2014-07-23 14:49:42 +00:00
Andrea Di Biagio	842355e900	Revert r211771. It was: "[X86] Improve the selection of SSE3/AVX addsub instructions". This chang fully reverts r211771. That revision added a canonicalization rule which has the potential to causes a combine-cycle in the target-independent canonicalizing DAG combine. The plan is to move the logic that forms target specific addsub nodes as part of the lowering of shuffles. llvm-svn: 213736	2014-07-23 11:20:24 +00:00
Andrea Di Biagio	4d8bd41600	[DAG] Refactor some logic. No functional change. This patch removes function 'CommuteVectorShuffle' from X86ISelLowering.cpp and moves its logic into SelectionDAG.cpp as method 'getCommutedVectorShuffles'. This refactoring is in preperation of an upcoming change to the DAGCombiner. llvm-svn: 213503	2014-07-21 07:28:51 +00:00
Tim Northover	871de902af	X86: support fpext/fptrunc operations to and from 16-bit floats. llvm-svn: 213374	2014-07-18 13:01:25 +00:00
Jim Grosbach	b6535c32f5	X86: Constant fold converting vector setcc results to float. Since the result of a SETCC for X86 is 0 or -1 in each lane, we can move unary operations, in this case [su]int_to_fp through the mask operation and constant fold the operation away. Generally speaking: UNARYOP(AND(VECTOR_CMP(x,y), constant)) --> AND(VECTOR_CMP(x,y), constant2) where constant2 is UNARYOP(constant). This implements the transform where UNARYOP is [su]int_to_fp. For example, consider the simple function: define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind { %cmp = fcmp oeq <4 x float> %val, %test %ext = zext <4 x i1> %cmp to <4 x i32> %result = sitofp <4 x i32> %ext to <4 x float> ret <4 x float> %result } Before this change, the SSE code is generated as: LCPI0_0: .long 1 ## 0x1 .long 1 ## 0x1 .long 1 ## 0x1 .long 1 ## 0x1 .section __TEXT,__text,regular,pure_instructions .globl _foo .align 4, 0x90 _foo: ## @foo cmpeqps %xmm1, %xmm0 andps LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 retq After, the code is improved to: LCPI0_0: .long 1065353216 ## float 1.000000e+00 .long 1065353216 ## float 1.000000e+00 .long 1065353216 ## float 1.000000e+00 .long 1065353216 ## float 1.000000e+00 .section __TEXT,__text,regular,pure_instructions .globl _foo .align 4, 0x90 _foo: ## @foo cmpeqps %xmm1, %xmm0 andps LCPI0_0(%rip), %xmm0 retq The cvtdq2ps has been constant folded away and the floating point 1.0f vector lanes are materialized directly via the ModRM operand of andps. llvm-svn: 213342	2014-07-18 00:40:56 +00:00
Tim Northover	84ce0a642e	CodeGen: generate single libcall for fptrunc -> f16 operations. Previously we asserted on this code. Currently compiler-rt doesn't actually implement any of these new libcalls, but external help is pretty much the only viable option for LLVM. I've followed the much more generic "__truncST2" naming, as opposed to the odd name for f32 -> f16 truncation. This can obviously be changed later, or overridden by any targets that need to. llvm-svn: 213252	2014-07-17 11:12:12 +00:00
Tim Northover	2131044814	X86: support double extension of f16 type. x86 has no native ability to extend an f16 to f64, but the same result is obtained if we expand it into two separate extensions: f16 -> f32 -> f64. Unfortunately the same is not true for truncate, so that still results in a compilation failure. llvm-svn: 213251	2014-07-17 11:04:04 +00:00
Tim Northover	fd7e424935	CodeGen: extend f16 conversions to permit types > float. This makes the two intrinsics @llvm.convert.from.f16 and @llvm.convert.to.f16 accept types other than simple "float". This is only strictly needed for the truncate operation, since otherwise double rounding occurs and there's no way to represent the strict IEEE conversion. However, for symmetry we allow larger types in the extend too. During legalization, we can expand an "fp16_to_double" operation into two extends for convenience, but abort when the truncate isn't legal. A new libcall is probably needed here. Even after this commit, various target tweaks are needed to actually use the extended intrinsics. I've put these into separate commits for clarity, so there are no actual tests of f64 conversion here. llvm-svn: 213248	2014-07-17 10:51:23 +00:00
Tim Northover	7f3e11e7c0	CodeGen: don't form illegail EXTLOAD operations. It turns out that in most cases (the main exception being i1-related types) once these operations are formed we cannot separate them and the targets end up having to deal with them whether they want to or not. This is not a good situation, and a more reasonable default can be formed by ackowledging this and having targets leave them as Legal. Only x86 seems to be affected (other targets don't even try marking the operation Expand). Mostly there's no visible change here yet, but it will be useful to have truly expanded EXTLOADS for MVT::f16 softening support. llvm-svn: 213162	2014-07-16 15:37:24 +00:00
Andrea Di Biagio	a03624d8ab	[X86] Add a check for 'isMOVHLPSMask' within method 'isShuffleMaskLegal'. Before this change, method 'isShuffleMaskLegal' didn't know that shuffles implementing a 'movhlps' operation were perfectly legal for SSE targets. This patch adds the missing check for 'isMOVHLPSMask' inside method 'isShuffleMaskLegal' to fix the problem. The reason why it is important to do this is because the DAGCombiner conservatively avoids combining a pair of shuffles if the resulting shuffle node has an illegal mask. Before this patch, shuffles with a MOVHLPS mask were wrongly considered not to be legal. This was the root cause of some poor-code generation bugs. llvm-svn: 213137	2014-07-16 11:29:39 +00:00
David Majnemer	d4d9944416	Fix typo in comment No functionality changed. llvm-svn: 213052	2014-07-15 07:11:32 +00:00
Quentin Colombet	0f179c4d8a	[X86] Fix the inversion of low and high bits for the lowering of MUL_LOHI. Also add a few comments. <rdar://problem/17581756> llvm-svn: 212808	2014-07-11 12:08:23 +00:00
Chandler Carruth	df8d0caab7	[x86] Add another combine that is particularly useful for the new vector shuffle lowering: match shuffle patterns equivalent to an unpcklwd or unpckhwd instruction. This allows us to use generic lowering code for v8i16 shuffles and match the unpack pattern late. llvm-svn: 212705	2014-07-10 11:09:29 +00:00
Chandler Carruth	853fa0ac8d	[x86] Expand the target DAG combining for PSHUFD nodes to be able to combine into half-shuffles through unpack instructions that expand the half to a whole vector without messing with the dword lanes. This fixes some redundant instructions in splat-like lowerings for v16i8, which are now getting to be really nice. llvm-svn: 212695	2014-07-10 09:57:36 +00:00
Chandler Carruth	a34a8e230d	[x86] Tweak the v16i8 single input special case lowering for shuffles that splat i8s into i16s. Previously, we would try much too hard to arrange a sequence of i8s in one half of the input such that we could unpack them into i16s and shuffle those into place. This isn't always going to be a cheaper i8 shuffle than our other strategies. The case where it is always going to be cheaper is when we can arrange all the necessary inputs into one half using just i16 shuffles. It happens that viewing the problem this way also makes it much easier to produce an efficient set of shuffles to move the inputs into one half and then unpack them. With this, our splat code gets one step closer to being not terrible with the new experimental lowering strategy. It also exposes two combines missing which I will add next. llvm-svn: 212692	2014-07-10 09:16:40 +00:00
Chandler Carruth	7d2ffb5492	[x86] Initial improvements to the new shuffle lowering for v16i8 shuffles specifically for cases where a small subset of the elements in the input vector are actually used. This is specifically targetted at improving the shuffles generated for trunc operations, but also helps out splat-like operations. There is still some really low-hanging fruit here that I want to address but this is a huge step in the right direction. llvm-svn: 212680	2014-07-10 04:34:06 +00:00
Chandler Carruth	b3840a55ae	[x86] Refactor some of the new code for lowering v16i8 shuffles to remove duplication and make it easier to select different strategies. No functionality changed. llvm-svn: 212674	2014-07-10 02:24:26 +00:00
Chandler Carruth	d3561f6fec	[SDAG] Make the new zext-vector-inreg node default to expand so targets don't need to set it manually. This is based on feedback from Tom who pointed out that if every target needs to handle this we need to reach out to those maintainers. In fact, it doesn't make sense to duplicate everything when anything other than expand seems unlikely at this stage. llvm-svn: 212661	2014-07-09 22:53:04 +00:00
Benjamin Kramer	d6f1733add	X86: When lowering v8i32 himuls use the correct shuffle masks for AVX2. Turns out my trick of using the same masks for SSE4.1 and AVX2 didn't work out as we have to blend two vectors. While there remove unecessary cross-lane moves from the shuffles so the backend can lower it to palignr instead of vperm. Fixes PR20118, a miscompilation of vector sdiv by constant on AVX2. llvm-svn: 212611	2014-07-09 11:12:39 +00:00
Chandler Carruth	afe4b2507e	[x86] Add a ZERO_EXTEND_VECTOR_INREG DAG node and use it when widening vector types to be legal and a ZERO_EXTEND node is encountered. When we use widening to legalize vector types, extend nodes are a real challenge. Either the input or output is likely to be legal, but in many cases not both. As a consequence, we don't really have any way to represent this situation and the prior code in the widening legalization framework would just scalarize the extend operation completely. This patch introduces a new DAG node to represent doing a zero extend of a vector "in register". The core of the idea is to allow legal but different vector types in the input and output. The output vector must have fewer lanes but wider elements. The operation is defined to zero extend the low elements of the input to the size of the output elements, and drop all of the high elements which don't have a corresponding lane in the output vector. It also includes generic expansion of this node in terms of blending a zero vector into the high elements of the vector and bitcasting across. This in turn yields extremely nice code for x86 SSE2 when we use the new widening legalization logic in conjunction with the new shuffle lowering logic. There is still more to do here. We need to support sign extension, any extension, and potentially int-to-float conversions. My current plan is to continue using similar synthetic nodes to model each of these transitions with generic lowering code for each one. However, with this patch LLVM already reaches performance parity with GCC for the core C loops of the x264 code (assuming you disable the hand-written assembly versions) when compiling for SSE2 and SSE3 architectures and enabling the new widening and lowering logic for vectors. Differential Revision: http://reviews.llvm.org/D4405 llvm-svn: 212610	2014-07-09 10:58:18 +00:00
Chandler Carruth	ef5dcf571e	[x86] Initialize a pointer to null to fix a bug in r212602. This should restore GCC hosts (which happen to put the bad stuff into the pointer) and MSan, etc. llvm-svn: 212606	2014-07-09 10:36:42 +00:00
Chandler Carruth	2ebc942683	[x86] Re-apply a variant of the x86 side of r212324 now that the rest has settled without incident, removing the x86-specific and overly strict 'isVectorSplat' routine in favor of generic and more powerful splat detection. The primary motivation and result of this is that the x86 backend can now see through splats which contain undef elements. This is essential if we are using a widening form of legalization and I've updated a test case to also run in that mode as before this change the generated code for the test case was completely scalarized. This version of the patch much more carefully handles the undef lanes. - We aren't overly conservative about them in the shift lowering (where we will never use the splat itself). - One place where the splat would have been re-used by the existing code now explicitly constructs a new constant splat that will be safe. - The broadcast lowering is much more reasonable with undefs by doing a correct check of whether the splat is the only user of a loaded value, checking that the splat actually crosses multiple lanes before using a broadcast, and handling broadcasts of non-constant splats. As a consequence of the last bullet, the weird usage of vpshufd instead of vbroadcast is gone, and we actually can lower an AVX splat with vbroadcastss where before we emitted a really strange pattern of a vector load and a manual splat across the vector. llvm-svn: 212602	2014-07-09 10:06:58 +00:00
Chandler Carruth	142e966261	[x86,SDAG] Sink the logic for folding shuffles of splats more aggressively from the x86 shuffle lowering to the generic SDAG vector shuffle formation code. This code already tried to fold away shuffles of splats! It just had lots of bugs and couldn't handle the case my new x86 shuffle lowering needed. First, it failed to correctly compute whether N2 was undef because it pre-computed this, then did transformations which could make N2 undef, then failed to ever re-consider the precomputed state. Second, it didn't look through bitcasts at all, even in the safe cases where they are just element-type bitcasts with no change to the number of elements. Third, it didn't handle all-zero bit casts nicely the way my code in the x86 side of things did, which is essential to getting good zext-shuffle lowerings. But all of these are generic. I just ported the code down to this layer and fixed the surrounding bugs. Tests exercising this in the x86 backend still pass and some silly code in widen_cast-6.ll gets better. I updated that test to be a bit more precise but it's still pretty unclear what the value of the test is in this day and age. llvm-svn: 212517	2014-07-08 08:45:38 +00:00
Andrea Di Biagio	2620b877b6	[x86] Fix assertion failure caused by a wrong combine of PSHUFD nodes with different types. When combining a sequence of two PSHUFD dag nodes into a single PSHUFD, make sure that we assign the correct type to the resulting PSHUFD. X86ISD::PSHUFD dag nodes can be either MVT::v4i32 or MVT::v4f32. Before this change, an assertion failure was triggered in method 'DAGCombinerInfo::CombineTo' when trying to combine the shuffles from the test below into a single PSHUFD. define <4 x float> @test1(<4 x float> %V) { %1 = shufflevector <4 x float> %V, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1> %2 = shufflevector <4 x float> %1, <4 x float> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1> ret <4 x float> %2 } llvm-svn: 212498	2014-07-07 23:25:23 +00:00
Chandler Carruth	beeacac0b3	[x86] Revert r212324 which was too aggressive w.r.t. allowing undef lanes in vector splats. The core problem here is that undef lanes can't unilaterally be considered to contribute to splats. Their handling needs to be more cautious. There is also a reported failure of the nightly testers (thanks Tobias!) that may well stem from the same core issue. I'm going to fix this theoretical issue, factor the APIs a bit better, and then verify that I don't see anything bad with Tobias's reduction from the test suite before recommitting. Original commit message for r212324: [x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for any constant, constant FP, or undef splat and to tolerate any undef lanes in a splat, then replace all uses of isSplatVector in X86's lowering with it. This fixes issues where undef lanes in an otherwise splat vector would prevent the splat logic from firing. It is a touch more awkward to use this interface, but it is much more accurate. Suggestions for better interface structuring welcome. With this fix, the code generated with the widening legalization strategy for widen_cast-4.ll is dramatically improved as the special lowering strategies for a v16i8 SRA kick in even though the high lanes are undef. We also get a slightly different choice for broadcasting an aligned memory location, and use vpshufd instead of vbroadcastss. This looks like a minor win for pipelining and domain crossing, but a minor loss for the number of micro-ops. I suspect its a wash, but folks can easily tweak the lowering if they want. llvm-svn: 212475	2014-07-07 19:03:32 +00:00
Chandler Carruth	0dcb366268	[x86] Teach the new vector shuffle lowering code to handle what is essentially a DAG combine that never gets a chance to run. We might typically expect DAG combining to remove shuffles-of-splats and other similar patterns, but we don't get a chance to run the DAG combiner when we recursively form sub-shuffles during the lowering of a shuffle. So instead hand-roll a really important combine directly into the lowering code to detect shuffles-of-splats, especially shuffles of an all-zero splat which needn't even have the same element width, etc. This lets the new vector shuffle lowering handle shuffles which implement things like zero-extension really nicely. This will become even more important when I wire the legalization of zero-extension to vector shuffles with the new widening legalization strategy. llvm-svn: 212444	2014-07-07 09:06:58 +00:00
Chandler Carruth	5d79bb5d32	[x86] Generalize BuildVectorSDNode::getConstantSplatValue to work for any constant, constant FP, or undef splat and to tolerate any undef lanes in a splat, then replace all uses of isSplatVector in X86's lowering with it. This fixes issues where undef lanes in an otherwise splat vector would prevent the splat logic from firing. It is a touch more awkward to use this interface, but it is much more accurate. Suggestions for better interface structuring welcome. With this fix, the code generated with the widening legalization strategy for widen_cast-4.ll is dramatically improved as the special lowering strategies for a v16i8 SRA kick in even though the high lanes are undef. We also get a slightly different choice for broadcasting an aligned memory location, and use vpshufd instead of vbroadcastss. This looks like a minor win for pipelining and domain crossing, but a minor loss for the number of micro-ops. I suspect its a wash, but folks can easily tweak the lowering if they want. llvm-svn: 212324	2014-07-04 08:11:49 +00:00
Chandler Carruth	19cff8205e	[x86] Clarify that this lowering only applies to vectors and is only used when we have SSE2. llvm-svn: 212300	2014-07-03 22:57:44 +00:00
Andrea Di Biagio	a37a2fc81f	[X86] Add ISel patterns to select 'f32_to_f16' and 'f16_to_f32' dag nodes. This patch adds tablegen patterns to select F16C float-to-half-float conversion instructions from 'f32_to_f16' and 'f16_to_f32' dag nodes. If the target doesn't have F16C, then 'f32_to_f16' and 'f16_to_f32' are expanded into library calls. llvm-svn: 212293	2014-07-03 21:51:06 +00:00
Chandler Carruth	739b6ada99	[x86] Fix crashes in lowering bitcast instructions with the widening mode. This also runs the test in that mode which would reproduce the crash. What I love is that every single FIXME in the test is addressed by switching to widening. llvm-svn: 212254	2014-07-03 03:43:47 +00:00
Chandler Carruth	49a8b10d82	[x86] Based on a long conversation between myself, Jim Grosbach, Hal Finkel, Eric Christopher, and a bunch of other people I'm probably forgetting (sorry), add an option to the x86 backend to widen vectors during type legalization rather than promote them. This still would promote vNi1 vectors to get the masks right, but would widen other vectors. A lot of experiments are piling up right now showing that widening should probably be the default legalization strategy outside of vNi1 cases, but it is very hard to test the rammifications of that and fix bugs in widening-based legalization without an option that enables it. I'll be checking in tests shortly that use this option to exercise cases where widening doesn't work well and hopefully we'll be able to switch fully to this soon. llvm-svn: 212249	2014-07-03 02:11:29 +00:00
Benjamin Kramer	e739cf3eb5	X86: When combining shuffles just remove shuffles that are completely redundant. CombineTo doesn't allow replacing a node with itself so this would crash if the combined shuffle is the same as the input shuffle. llvm-svn: 212181	2014-07-02 15:09:44 +00:00
Juergen Ributzka	3bd03c7099	[DAG] Pass the argument list to the CallLoweringInfo via move semantics. NFCI. The argument list vector is never used after it has been passed to the CallLoweringInfo and moving it to the CallLoweringInfo is cleaner and pretty much as cheap as keeping a pointer to it. llvm-svn: 212135	2014-07-01 22:01:54 +00:00
Tim Northover	df58625e3c	X86: delegate expanding atomic libcalls to generic code. On targets without cmpxchg16b or cmpxchg8b, the borderline atomic operations were slipping through the gaps. X86AtomicExpand.cpp was delegating to ISelLowering. Generic ISelLowering was delegating to X86ISelLowering and X86ISelLowering was asserting. The correct behaviour is to expand to a libcall, preferably in generic ISelLowering. This can be achieved by X86ISelLowering deciding it doesn't want the faff after all. llvm-svn: 212134	2014-07-01 21:44:59 +00:00
Tim Northover	277066ab43	X86: expand atomics in IR instead of as MachineInstrs. The logic for expanding atomics that aren't natively supported in terms of cmpxchg loops is much simpler to express at the IR level. It also allows the normal optimisations and CodeGen improvements to help out with atomics, instead of using a limited set of possible instructions.. rdar://problem/13496295 llvm-svn: 212119	2014-07-01 18:53:31 +00:00
Andrea Di Biagio	53b6830069	[X86] Add support for builtin to read performance monitoring counters. This patch adds support for a new builtin instruction called __builtin_ia32_rdpmc. Builtin '__builtin_ia32_rdpmc' is defined as a 'GCC builtin'; on X86, it can be used to read performance monitoring counters. It takes as input the index of the performance counter to read, and returns the value of the specified performance counter as a 64-bit number. Calls to this new builtin will map to instruction RDPMC. The index in input to the builtin call is moved to register %ECX. The result of the builtin call is the value of the specified performance counter (RDPMC would return that quantity in registers RDX:RAX). This patch: - Adds builtin int_x86_rdpmc as a GCCBuiltin; - Adds a new x86 DAG node called 'RDPMC_DAG'; - Teaches how to lower this new builtin; - Adds an ISel pattern to select instruction RDPMC; - Fixes the definition of instruction RDPMC adding %RAX and %RDX as implicit definitions, and adding %ECX as implicit use; - Adds a LLVM test to verify that the new builtin is correctly selected. llvm-svn: 212049	2014-06-30 17:14:21 +00:00
Chandler Carruth	bd0717d7cc	[x86] Fix a bug in the v8i16 shuffling exposed by the new splat-like lowering for v16i8. ASan and some bots caught this bug with existing test cases. Fixing it even fixed a miscompile with one of the test cases. I'm still a bit suspicious of this test case as I've not taken a proper amount of time to think about it, but the fix here is strict goodness. llvm-svn: 211976	2014-06-28 05:46:28 +00:00
Chandler Carruth	887c2c3482	[x86] Add handling for splat-like widenings of v16i8 shuffles. These show up really frequently, not the least with actual splats. =] We lowered these quite badly before. The new code path tries to widen i8 shuffles to i16 shuffles in a splat-like way. There are still some inefficiencies in our i16 splat logic though, so we aren't really done here. Also, for certain patterns (bit of a gather-and-splat) we still generate pretty silly code, and I've left a fixme for addressing it. However, I'm not actually worried about this code pattern as much. The old shuffle lowering generates a 29 instruction monstrosity for it that should execute much more slowly. llvm-svn: 211974	2014-06-28 05:16:40 +00:00
Chandler Carruth	a94ef908d9	[x86] Fix another bug hit when bootstrapping with the new shuffle lowering. For maximum irony, I had already discovered this bug, diagnosed it, and left FIXMEs about it in the test cases. =[ I just failed to go back over those until after i had reduced a bootstrap miscompile down to a single TU, stared at the assembly for an hour, and figured out the bug. Again. Oh well. llvm-svn: 211955	2014-06-27 20:07:40 +00:00
Chandler Carruth	dd6470a9dd	[x86] Fix a miscompile in the new shuffle lowering uncovered by a bootstrap. I managed to mis-remember how PACKUS worked on x86, and was using undef for the high bytes instead of zero. The fix is fairly obvious. llvm-svn: 211922	2014-06-27 18:25:23 +00:00
Alexander Kornienko	b673b4b187	Clean up unused variable warning in release build. llvm-svn: 211902	2014-06-27 15:30:55 +00:00
Chandler Carruth	ed4a0bc734	[x86] Clean up some unused variables, especially in release builds. llvm-svn: 211894	2014-06-27 12:04:18 +00:00
Chandler Carruth	688001f042	[x86] Teach the target combine step to aggressively fold pshufd insturcions. Summary: This allows it to fold pshufd instructions across intervening half-shuffles and other noise. This pattern actually shows up in the generic lowering tests, but I've also added direct tests using intrinsics to make sure that the specific desired functionality is working even if the lowering stuff changes in the future. Differential Revision: http://reviews.llvm.org/D4292 llvm-svn: 211892	2014-06-27 11:40:13 +00:00
Chandler Carruth	0d6d1f2b17	[x86] Teach the target-specific combining how to aggressively fold half-shuffles, even looking through intervening instructions in a chain. Summary: This doesn't happen to show up with any test cases I've found for the current shuffle lowering, but previous attempts would benefit from this and it seems generally useful. I've tested it directly using intrinsics, which also shows that it will work with hand vectorized code as well. Note that even though pshufd isn't directly used in these tests, it gets exercised because we combine some of the half shuffles into a pshufd first, and then merge them. Differential Revision: http://reviews.llvm.org/D4291 llvm-svn: 211890	2014-06-27 11:34:40 +00:00
Chandler Carruth	97ebc2362c	[x86] Teach the X86 backend to DAG-combine SSE2 shuffles that are trivially redundant. This fixes several cases in the new vector shuffle lowering algorithm which would generate redundant shuffle instructions for the sake of simplicity. I'm also deleting a testcase which was somewhat ridiculous. It was checking for a bug in 2007 about incorrectly transforming shuffles by looking for the string "-86" in the output of a pretty substantial function. This test case doesn't seem to have any value at this point. Differential Revision: http://reviews.llvm.org/D4240 llvm-svn: 211889	2014-06-27 11:27:52 +00:00
Chandler Carruth	83860cfcfa	[x86] Begin a significant overhaul of how vector lowering is done in the x86 backend. This sketches out a new code path for vector lowering, hidden behind an off-by-default flag while it is under development. The fundamental idea behind the new code path is to aggressively break down the problem space in ways that ease selecting the odd set of instructions available on x86, and carefully avoid scalarizing code even when forced to use older ISAs. Notably, this starts off restricting itself to SSE2 and implements the complete vector shuffle and blend space for 128-bit vectors in SSE2 without scalarizing. The plan is to layer on top of this ISA extensions where we can bail out of the complex SSE2 lowering and opt for a cheaper, specialized instruction (or set of instructions). It also needs to be generalized to AVX and AVX512 vector widths. Currently, this does a decent but not perfect job for SSE2. There are some specific shortcomings that I plan to address: - We need a peephole combine to fold together shuffles where possible. There are cases where a previous shuffle could be modified slightly to arrange for elements to be in the correct position and a later shuffle eliminated. Doing this eagerly added quite a bit of complexity, and so my plan is to combine away these redundancies afterward. - There are a lot more clever ways to use unpck and pack that need to be added. This is essential for real world shuffles as it turns out... Once SSE2 is polished a bit I should be able to get interesting numbers on performance improvements on benchmarks conducive to vectorization. All of this will be off by default until it is functionally equivalent of course. Differential Revision: http://reviews.llvm.org/D4225 llvm-svn: 211888	2014-06-27 11:23:44 +00:00
Andrea Di Biagio	1ee38843ac	Silence a warning due to a comparison between signed and unsigned. No functional change intended. llvm-svn: 211782	2014-06-26 13:41:10 +00:00
Andrea Di Biagio	7fb85256bc	[X86] Improve the selection of SSE3/AVX addsub instructions. This patch teaches the backend how to canonicalize a shuffle vectors according to the rule: - (shuffle (FADD A, B), (FSUB A, B), Mask) -> (shuffle (FSUB A, -B), (FADD A, -B), Mask) Where 'Mask' is: <0,5,2,7> ;; for v4f32 and v4f64 shuffles. <0,3> ;; for v2f64 shuffles. <0,9,2,11,4,13,6,15> ;; for v8f32 shuffles. In general, ISel only knows how to pattern-match a canonical 'fadd + fsub + blendi' dag node sequence into an ADDSUB instruction. This new rule allows to convert a non-canonical dag sequence into a canonical one that will be matched by a single ADDSUB at ISel stage. The idea of converting a non-canonical ADDSUB into a canonical one by swapping the first two operands of the shuffle, and then negating the second operand of the FADD and FSUB, was originally proposed by Hal Finkel. llvm-svn: 211771	2014-06-26 10:45:21 +00:00
Andrea Di Biagio	07cdffc324	[X86] Always prefer to lower a VECTOR_SHUFFLE into a BLENDI instead of SHUFP (or VPERM2X128). This patch teaches method 'LowerVECTOR_SHUFFLE' to give higher precedence to the check for 'isBlendMask'; the idea is that, when possible, we should firstly check if a shuffle performs a blend, and in case, try to lower it into a BLENDI instead of selecting a SHUFP or (worse) a VPERM2X128. In general: - AVX VBLENDPS/D always have better latency and throughput than VPERM2F128; - BLENDPS/D instructions tend to always have better 'reciprocal throughput' than the equivalent SHUFPS/D; - Both BLENDPS/D and SHUFPS/D are often decoded into the same number of m-ops; however, a m-op obtained from a BLENDPS/D can be scheduled to more than one execution port. This patch: - Moves the check for 'isBlendMask' immediately before the check for 'isSHUFPMask' within method 'LowerVECTOR_SHUFFLE'; - Updates existing tests for sse/avx shuffle/blend instructions to verify that we select (v)blendps/d when possible (instead of (v)shufps/d or vperm2f128). llvm-svn: 211720	2014-06-25 17:41:58 +00:00
Chandler Carruth	e5724d7532	[x86] Add intrinsics for the pshufd, pshuflw, and pshufhw instructions. llvm-svn: 211694	2014-06-25 13:12:54 +00:00
NAKAMURA Takumi	1db5995d14	Re-apply r211399, "Generate native unwind info on Win64" with a fix to ignore SEH pseudo ops in X86 JIT emitter. -- This patch enables LLVM to emit Win64-native unwind info rather than DWARF CFI. It handles all corner cases (I hope), including stack realignment. Because the unwind info is not flexible enough to describe stack frames with a gap of unknown size in the middle, such as the one caused by stack realignment, I modified register spilling code to place all spills into the fixed frame slots, so that they can be accessed relative to the frame pointer. Patch by Vadim Chugunov! Reviewed By: rnk Differential Revision: http://reviews.llvm.org/D4081 llvm-svn: 211691	2014-06-25 12:41:52 +00:00
Andrea Di Biagio	6d9b9e125d	[X86] Add target combine rule to select ADDSUB instructions from a build_vector This patch teaches the backend how to combine a build_vector that implements an 'addsub' between packed float vectors into a sequence of vector add and vector sub followed by a VSELECT. The new VSELECT is expected to be lowered into a BLENDI. At ISel stage, the sequence 'vector add + vector sub + BLENDI' is pattern-matched against ISel patterns added at r211427 to select 'addsub' instructions. Added three more ISel patterns for ADDSUB. Added test sse3-avx-addsub-2.ll to verify that we correctly emit 'addsub' instructions. llvm-svn: 211679	2014-06-25 10:02:21 +00:00
Robert Khasanov	21c836823f	vpblend intrinsics combines as shifts intrinsics due to absence return stmt between them Fix PR20088 Differential Revision: http://reviews.llvm.org/D4277 llvm-svn: 211617	2014-06-24 18:08:04 +00:00
NAKAMURA Takumi	d77cefe633	Revert r211399, "Generate native unwind info on Win64" It broke Legacy JIT Tests on x86_64-{mingw32\|msvc}, aka Windows x64. llvm-svn: 211480	2014-06-22 22:00:56 +00:00
Filipe Cabecinhas	1af2dfd274	Fix PR20087 by using the source index when changing the vector load llvm-svn: 211472	2014-06-22 17:21:37 +00:00
Reid Kleckner	4a01230db4	Generate native unwind info on Win64 This patch enables LLVM to emit Win64-native unwind info rather than DWARF CFI. It handles all corner cases (I hope), including stack realignment. Because the unwind info is not flexible enough to describe stack frames with a gap of unknown size in the middle, such as the one caused by stack realignment, I modified register spilling code to place all spills into the fixed frame slots, so that they can be accessed relative to the frame pointer. Patch by Vadim Chugunov! Reviewed By: rnk Differential Revision: http://reviews.llvm.org/D4081 llvm-svn: 211399	2014-06-20 20:35:47 +00:00
Chandler Carruth	8366cebeb5	[x86] Make the x86 PACKSSWB, PACKSSDW, PACKUSWB, and PACKUSDW instructions available as synthetic SDNodes PACKSS and PACKUS that will select to the correct instruction variants based on the return type. This allows us to use these rather important instructions when lowering vector shuffles. Also moves the relevant instruction definitions to be split out from the fully generic multiclasses to allow them to match these new SDNodes in the same way that the UNPCK instructions do. No functionality should actually be changed here. llvm-svn: 211332	2014-06-20 01:05:28 +00:00
Andrea Di Biagio	54b0949af9	[X86] Teach how to combine horizontal binop even in the presence of undefs. Before this change, the backend was unable to fold a build_vector dag node with UNDEF operands into a single horizontal add/sub. This patch teaches how to combine a build_vector with UNDEF operands into a horizontal add/sub when possible. The algorithm conservatively avoids to combine a build_vector with only a single non-UNDEF operand. Added test haddsub-undef.ll to verify that we correctly fold horizontal binop even in the presence of UNDEFs. llvm-svn: 211265	2014-06-19 10:29:41 +00:00
Cameron McInally	0d0489cea6	Hook up vector int_ctlz for AVX512. llvm-svn: 211024	2014-06-16 14:12:28 +00:00
Tim Northover	51472bc600	X86: lower ATOMIC_CMP_SWAP_WITH_SUCCESS directly Lowering this new node allows us to fold the almost universal comparison for success before it's even formed. Instead we can create a copy from EFLAGS and an X86ISD::SETCC operation since all "cmpxchg" instructions set the zero-flag to the correct value. rdar://problem/13201607 llvm-svn: 210923	2014-06-13 17:29:39 +00:00
Tim Northover	420a216817	IR: add "cmpxchg weak" variant to support permitted failure. This commit adds a weak variant of the cmpxchg operation, as described in C++11. A cmpxchg instruction with this modifier is permitted to fail to store, even if the comparison indicated it should. As a result, cmpxchg instructions must return a flag indicating success in addition to their original iN value loaded. Thus, for uniformity all cmpxchg instructions now return "{ iN, i1 }". The second flag is 1 when the store succeeded. At the DAG level, a new ATOMIC_CMP_SWAP_WITH_SUCCESS node has been added as the natural representation for the new cmpxchg instructions. It is a strong cmpxchg. By default this gets Expanded to the existing ATOMIC_CMP_SWAP during Legalization, so existing backends should see no change in behaviour. If they wish to deal with the enhanced node instead, they can call setOperationAction on it. Beware: as a node with 2 results, it cannot be selected from TableGen. Currently, no use is made of the extra information provided in this patch. Test updates are almost entirely adapting the input IR to the new scheme. Summary for out of tree users: ------------------------------ + Legacy Bitcode files are upgraded during read. + Legacy assembly IR files will be invalid. + Front-ends must adapt to different type for "cmpxchg". + Backends should be unaffected by default. llvm-svn: 210903	2014-06-13 14:24:07 +00:00
Andrea Di Biagio	2dd3b3b674	[X86] Teach how to dump the name of target node RDTSCP_DAG. When I originally added node RDTSCP_DAG (r207127) I forgot to add a string name for it in method 'getTargetNodeName'. No functional change intended. llvm-svn: 210769	2014-06-12 11:37:24 +00:00
Andrea Di Biagio	972ff97f8c	[X86] Teach how to combine AVX and AVX2 horizontal binop on packed 256-bit vectors. This patch adds target combine rules to match: - [AVX] Horizontal add/sub of packed single/double precision floating point values from 256-bit vectors; - [AVX2] Horizontal add/sub of packed integer values from 256-bit vectors. llvm-svn: 210761	2014-06-12 10:53:48 +00:00
Tim Northover	4dc9eaa6ba	X86: add stringy name for X86ISD::LCMPXCHG16_DAG I don't know what "target specific node #383" is, and I don't want to have to. llvm-svn: 210663	2014-06-11 17:04:08 +00:00
Andrea Di Biagio	c7af75f9a7	[X86] Refactor the logic to select horizontal adds/subs to a helper function. This patch moves part of the logic implemented by the target specific combine rules added at r210477 to a separate helper function. This should make easier to add more rules for matching AVX/AVX2 horizontal adds/subs. This patch also fixes a problem caused by a wrong check performed on indices of extract_vector_elt dag nodes in input to the scalar adds/subs. New tests have been added to verify that we correctly check indices of extract_vector_elt dag nodes when selecting a horizontal operation. llvm-svn: 210644	2014-06-11 07:57:50 +00:00
Eric Christopher	68d7559e97	Use the TargetMachine on the DAG or the MachineFunction instead of using the cached TargetMachine. llvm-svn: 210589	2014-06-10 21:25:13 +00:00
Eric Christopher	19b1d73e88	Add a FIXME. llvm-svn: 210559	2014-06-10 18:31:18 +00:00
Andrea Di Biagio	fa508af0fe	[X86] Improved target combine rules for selecting horizontal add/sub. This patch slightly changes the algorithm introduced at revision 210477 to fix a problem where the algorithm was producing incorrect code for the VEX.256 encoded versions of horizontal add/sub. For these cases, we now try to split the two 256-bit vectors into 128-bit chunks before emitting horizontal add/sub dag nodes. Added a new test case into haddsub-2.ll. llvm-svn: 210545	2014-06-10 16:42:57 +00:00
Tom Stellard	3787b12255	SelectionDAG: Don't use MVT::Other to determine legality of ISD::SELECT_CC The SelectionDAG bad a special case for ISD::SELECT_CC, where it would allow targets to specify: setOperationAction(ISD::SELECT_CC, MVT::Other, Expand); to indicate that they wanted to expand ISD::SELECT_CC for all types. This wasn't applied correctly everywhere, and it makes writing new DAG patterns with ISD::SELECT_CC difficult. llvm-svn: 210541	2014-06-10 16:01:29 +00:00
Tim Northover	7b9f86da5d	Revert "X86: elide comparisons after cmpxchg instructions." This reverts commit r210523. It was committed prematurely without waiting for review. llvm-svn: 210524	2014-06-10 10:50:11 +00:00
Tim Northover	84ad29ca1f	X86: elide comparisons after cmpxchg instructions. The C++ and C semantics of the compare_and_swap operations actually require us to return a boolean "success" value. In LLVM terms this means a second comparison of the output of "cmpxchg" against the input desired value. However, x86's "cmpxchg" instruction sets all flags for the comparison formed, so we can skip any secondary comparison. (N.b. this isn't true for cmpxchg8b/16b, which only set ZF). rdar://problem/13201607 llvm-svn: 210523	2014-06-10 10:49:07 +00:00
Eric Christopher	a08f30bd40	Move all of the x86 subtarget initialized variables down into the x86 subtarget from the x86 target machine. Should be no functional change. llvm-svn: 210479	2014-06-09 17:08:19 +00:00
Andrea Di Biagio	f99dd64f0a	[X86] Add target combine rules for horizontal add/sub. This patch adds new target specific combine rules to identify horizontal add/sub idioms from BUILD_VECTOR dag nodes. This patch also teaches the DAGCombiner how to canonicalize sequences of insert_vector_elt dag nodes according to the following rule: (insert_vector_elt (insert_vector_elt A, I0), I1) -> (insert_vecto_elt (insert_vector_elt A, I1), I0) This new canonicalization rule only triggers if the inner insert_vector dag node has exactly one use; also, both indices must be known constants, and I1 < I0. This last rule made it possible to write a simpler algorithm to identify horizontal add/sub patterns because now we don't have to worry about the ordering of insert_vector_elt dag nodes. llvm-svn: 210477	2014-06-09 16:54:41 +00:00
Andrea Di Biagio	dfbdc71ea1	[X86] Avoid emitting unnecessary test instructions. This patch teaches the backend how to check for the 'NoSignedWrap' flag on binary operations to improve the emission of 'test' instructions. If the result of a binary operation is known not to overflow we know that resetting the Overflow flag is unnecessary and so we can avoid emitting the test instruction. Patch by Marcello Maggioni. llvm-svn: 210468	2014-06-09 12:34:50 +00:00
Alexey Volkov	5260dba323	[X86] Use ADD/SUB instead of INC/DEC for Silvermont According to Intel Software Optimization Manual on Silvermont INC or DEC instructions require an additional uop to merge the flags. As a result, a branch instruction depending on an INC or a DEC instruction incurs a 1 cycle penalty. Differential Revision: http://reviews.llvm.org/D3990 llvm-svn: 210466	2014-06-09 11:40:41 +00:00
Craig Topper	66f09ad041	[C++11] Use 'nullptr'. llvm-svn: 210442	2014-06-08 22:29:17 +00:00
Benjamin Kramer	d0700b2919	X86: Don't turn shifts into ands if there's another use that may not check for equality. Fixes PR19964. llvm-svn: 210371	2014-06-06 21:08:55 +00:00
Filipe Cabecinhas	5181255696	Fixed a bug in lowering shuffle_vectors to insertps Summary: We were being too strict and not accounting for undefs. Added a test case and fixed another one where we improved codegen. Reviewers: grosbach, nadav, delena Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D4039 llvm-svn: 210361	2014-06-06 18:07:06 +00:00
Andrea Di Biagio	4760813831	[X86] Fix checked arithmetic for i8 on X86. When lowering a ISD::BRCOND into a test+branch, make sure that we always use the correct condition code to emit the test operation. This fixes PR19858: "i8 checked mul is wrong on x86". Patch by Keno Fisher! llvm-svn: 210032	2014-06-02 16:00:27 +00:00
Eric Christopher	8995833a34	Have the TLOF creation take a Triple rather than needing a subtarget. llvm-svn: 209937	2014-05-31 00:07:32 +00:00
Andrea Di Biagio	446a527905	[X86] Add two combine rules to simplify dag nodes introduced during type legalization when promoting nodes with illegal vector type. This patch teaches the backend how to simplify/canonicalize dag node sequences normally introduced by the backend when promoting certain dag nodes with illegal vector type. This patch adds two new combine rules: 1) fold (shuffle (bitcast (BINOP A, B)), Undef, <Mask>) -> (shuffle (BINOP (bitcast A), (bitcast B)), Undef, <Mask>) 2) fold (BINOP (shuffle (A, Undef, <Mask>)), (shuffle (B, Undef, <Mask>))) -> (shuffle (BINOP A, B), Undef, <Mask>). Both rules are only triggered on the type-legalized DAG. In particular, rule 1. is a target specific combine rule that attempts to sink a bitconvert into the operands of a binary operation. Rule 2. is a target independet rule that attempts to move a shuffle immediately after a binary operation. llvm-svn: 209930	2014-05-30 23:17:53 +00:00
Filipe Cabecinhas	d3aebaf875	Separate the check for blend shuffle_vector masks Summary: Separate the check for blend shuffle_vector masks into isBlendMask. This function will also be used to check if a vector shuffle is legal. No change in functionality was intended, but we ended up improving codegen on two tests, which were being (more) optimized only if the resulting shuffle was legal. Reviewers: nadav, delena, andreadb Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3964 llvm-svn: 209923	2014-05-30 21:31:21 +00:00
Rafael Espindola	a31f3e50dc	Delete dead code. GV is never used past this point. This was probably a copy and paste error. llvm-svn: 209518	2014-05-23 15:07:51 +00:00
Andrea Di Biagio	c8dd1ad85b	[X86] Improve the lowering of BITCAST from MVT::f64 to MVT::v4i16/MVT::v8i8. This patch teaches the x86 backend how to efficiently lower ISD::BITCAST dag nodes from MVT::f64 to MVT::v4i16 (and vice versa), and from MVT::f64 to MVT::v8i8 (and vice versa). This patch extends the logic from revision 208107 to also handle MVT::v4i16 and MVT::v8i8. Also, this patch correctly propagates Undef values when performing the widening of a vector (example: when widening from v2i32 to v4i32, the upper 64bits of the resulting vector are 'undef'). llvm-svn: 209451	2014-05-22 16:21:39 +00:00
Quentin Colombet	b4d53f1afa	[X86] Fix a bug in the lowering of BLENDI introduced in r209043. ISD::VSELECT mask uses 1 to identify the first argument and 0 to identify the second argument. On the other hand, BLENDI uses 0 to identify the first argument and 1 to identify the second argument. Fix the generation of the blend mask to account for this difference. The bug did not show up with r209043, because we were not checking for the actual arguments of the blend instruction! This commit also fixes the test cases. Note: The same mask works for the BLENDr variant because the arguments are swapped during instruction selection (see the BLENDXXrr patterns). <rdar://problem/16975435> llvm-svn: 209324	2014-05-21 22:00:39 +00:00
Simon Atanasyan	e7fa2314af	Add parentheses to suppress the gcc warning '-Wparentheses'. No functional changes. llvm-svn: 209203	2014-05-20 10:23:04 +00:00
Filipe Cabecinhas	dc92102766	Added more insertps optimizations Summary: When inserting an element that's coming from a vector load or a broadcast of a vector (or scalar) load, combine the load into the insertps instruction. Added PerformINSERTPSCombine for the case where we need to fix the load (load of a vector + insertps with a non-zero CountS). Added patterns for the broadcasts. Also added tests for SSE4.1, AVX, and AVX2. Reviewers: delena, nadav, craig.topper Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3581 llvm-svn: 209156	2014-05-19 19:45:57 +00:00
Benjamin Kramer	f3ad23551d	SDAG: Legalize vector BSWAP into a shuffle if the shuffle is legal but the bswap not. - On ARM/ARM64 we get a vrev because the shuffle matching code is really smart. We still unroll anything that's not v4i32 though. - On X86 we get a pshufb with SSSE3. Required more cleverness in isShuffleMaskLegal. - On PPC we get a vperm for v8i16 and v4i32. v2i64 is unrolled. llvm-svn: 209123	2014-05-19 13:12:38 +00:00
Saleem Abdulrasool	f3a5a5c546	Target: remove old constructors for CallLoweringInfo This is mostly a mechanical change changing all the call sites to the newer chained-function construction pattern. This removes the horrible 15-parameter constructor for the CallLoweringInfo in favour of setting properties of the call via chained functions. No functional change beyond the removal of the old constructors are intended. llvm-svn: 209082	2014-05-17 21:50:17 +00:00
Chandler Carruth	c85473143c	[x86] Fix a bad predicate I spotted by inspection -- pshufhw and pshuflw were added in SSE2, no SSSE3. Found this while auditing all uses of SSSE3 in the X86 target. I don't actually expect this to make a significant difference on anything and I don't have any detailed test cases but I updated the existing test cases that already covered some of this code path. llvm-svn: 209056	2014-05-17 03:29:20 +00:00
Filipe Cabecinhas	89654da069	Implemented special cases for PerformVSELECTCombine. vselects with constant masks, after legalization, will get turned into specialized shuffle_vectors so they can be matched to blend+imm instructions. Fixed some tests. llvm-svn: 209044	2014-05-16 22:47:54 +00:00
Filipe Cabecinhas	e15551832c	Lower vselects into X86ISD::BLENDI when appropriate. LowerVSELECT will, if possible, generate a X86ISD::BLENDI DAG node if the condition is constant and we can emit that instruction, given the subtarget. This is not enough for all cases. An additional SELECTCombine optimization will be committed. Fixed tests that were expecting variable blends but where a blend+imm can be generated. Added test where we can't emit blend+immediate. Added avx2 blend+imm tests. llvm-svn: 209043	2014-05-16 22:47:49 +00:00
Filipe Cabecinhas	17254aaead	Implemented LowerVSELECT to custom lower some instructions. No functionality change intended. The types that previously were set to lower as Expand or Legal are doing the same thing with this lowering function. llvm-svn: 209042	2014-05-16 22:47:43 +00:00
Rafael Espindola	e0098928c9	Delete getAliasedGlobal. llvm-svn: 209040	2014-05-16 22:37:03 +00:00
Andrea Di Biagio	d621120533	[X86] Teach the backend how to fold SSE4.1/AVX/AVX2 blend intrinsics. Added target specific combine rules to fold blend intrinsics according to the following rules: 1) fold(blend A, A, Mask) -> A; 2) fold(blend A, B, <allZeros>) -> A; 3) fold(blend A, B, <allOnes>) -> B. Added two new tests to verify that the new folding rules work for all the optimized blend intrinsics. llvm-svn: 208895	2014-05-15 15:18:15 +00:00
Alp Toker	beaca19c7c	Fix typos llvm-svn: 208839	2014-05-15 01:52:21 +00:00
Jay Foad	a0653a3e6c	Rename ComputeMaskedBits to computeKnownBits. "Masked" has been inappropriate since it lost its Mask parameter in r154011. llvm-svn: 208811	2014-05-14 21:14:37 +00:00
Reid Kleckner	7a59e0845f	Try to fix an SDAG dependence issue with sret r208453 added support for having sret on the second parameter. In that change, the code for copying sret into a virtual register was hoisted into the loop that lowers formal parameters. This caused a "Wrong topological sorting" assertion failure during scheduling when a parameter is passed in memory. This change undoes that by creating a second loop that deals with sret. I'm worried that this fix is incomplete. I don't fully understand the dependence issues. However, with this change we produce the same DAGs we used to produce, so if they are broken, they are just as broken as they have always been. llvm-svn: 208637	2014-05-12 22:01:27 +00:00
Aaron Ballman	29fd7b9b20	Silencing an MSVC warning about not all control paths returning a value (even though the switch is fully covered). No functional change. llvm-svn: 208565	2014-05-12 14:22:58 +00:00
Benjamin Kramer	3b36b72a9c	X86: Make sure that we have SSE4.1 before we generate insertps nodes. PR19721. llvm-svn: 208552	2014-05-12 13:12:08 +00:00
NAKAMURA Takumi	6c383021c9	X86ISelLowering.cpp:LowerINTRINSIC_W_CHAIN(): Prune impossible "default:" [-Wcovered-switch-default] llvm-svn: 208533	2014-05-12 10:16:46 +00:00
Elena Demikhovsky	4f591c0d45	Fixed compilation issue llvm-svn: 208524	2014-05-12 07:45:41 +00:00
Elena Demikhovsky	8e8fde8e93	AVX-512: changes in intrinsics 1) Changed gather and scatter intrinsics. Now they are aligned with GCC built-ins. There is no more non-masked form. Masked intrinsic receives -1 if all lanes are executed. 2) I changed the function that works with intrinsics inside X86ISelLowering.cpp. I put all intrinsics in one table. I did it for INTRINSICS_W_CHAIN and plan to put all intrinsics from WO_CHAIN set to the same table in order to avoid the long-long "switch". (I wanted to use static map initialization that allowed by C++11 but I wasn't able to compile it on VS2012). 3) I added gather/scatter prefetch intrinsics. 4) I fixed MRMm encoding for masked instructions. llvm-svn: 208522	2014-05-12 07:18:51 +00:00
Hal Finkel	f0e086a0bc	Pass the value type to TLI::getRegisterByName We must validate the value type in TLI::getRegisterByName, because if we don't and the wrong type was used with the IR intrinsic, then we'll assert (because we won't be able to find a valid register class with which to construct the requested copy operation). For PPC64, additionally, the type information is necessary to decide between the 64-bit register and the 32-bit subregister. No functionality change. llvm-svn: 208508	2014-05-11 19:29:07 +00:00
Filipe Cabecinhas	0e3d1cb5d6	Fixed a bug when lowering build_vector (PR19694) When lowering build_vector to an insertps, we would still lower it, even if the source vectors weren't v4x32. This would break on avx if the source was a v8x32. We now check the type of the source vectors. llvm-svn: 208487	2014-05-11 08:12:56 +00:00
Reid Kleckner	7941856445	Allow sret on the second parameter as well as the first MSVC always places the implicit sret parameter after the implicit this parameter of instance methods. We used to handle this for x86_thiscallcc by allocating the sret parameter on the stack and leaving the this pointer in ecx, but that doesn't handle alternative calling conventions like cdecl, stdcall, fastcall, or the win64 convention. Instead, change the verifier to allow sret on the second parameter. This also requires changing the Mips and X86 backends to return the argument with the sret parameter, instead of assuming that the sret parameter comes first. The Sparc backend also returns sret parameters in a register, but I wasn't able to update it to handle secondary sret parameters. It currently calls report_fatal_error if you feed it an sret in the second parameter. Reviewers: rafael.espindola, majnemer Differential Revision: http://reviews.llvm.org/D3617 llvm-svn: 208453	2014-05-09 22:32:13 +00:00
Andrea Di Biagio	932ff1ddd7	Fix 80 col violation. No functional change intended. llvm-svn: 208405	2014-05-09 11:08:23 +00:00
Filipe Cabecinhas	e4b482b3ed	Optimize shufflevector that copies an i64/f64 and zeros the rest. Summary: Also ran clang-format on the function. The code added is the last else if block. Reviewers: nadav, craig.topper, delena Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3518 llvm-svn: 208372	2014-05-08 23:16:08 +00:00
Andrea Di Biagio	e85ba4df52	[X86] Add target specific combine rules to fold SSE2/AVX2 packed arithmetic shift intrinsics. This patch teaches the backend how to combine packed SSE2/AVX2 arithmetic shift intrinsics. The rules are: - Always fold a packed arithmetic shift by zero to its first operand; - Convert a packed arithmetic shift intrinsic dag node into a ISD::SRA only if the shift count is known to be smaller than the vector element size. This patch also teaches to function 'getTargetVShiftByConstNode' how fold target specific vector shifts by zero. Added two new tests to verify that the DAGCombiner is able to fold sequences of SSE2/AVX2 packed arithmetic shift calls. llvm-svn: 208342	2014-05-08 17:44:04 +00:00
Filipe Cabecinhas	095d9d573a	Lower certain build_vectors to insertps instructions Summary: Vectors built with zeros and elements in the same order as another (source) vector are optimized to be built using a single insertps instruction. Also optimize when we move one element in a vector to a different place in that vector while zeroing out some of the other elements. Further optimizations are possible, described in TODO comments. I will be implementing at least some of them in the near future. Added some tests for different cases where this optimization triggers. Reviewers: nadav, delena, craig.topper Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3521 llvm-svn: 208271	2014-05-08 00:25:16 +00:00
Andrea Di Biagio	c14ccc9184	[X86] Improve the lowering of BITCAST dag nodes from type f64 to type v2i32 (and vice versa). Before this patch, the backend always emitted a store+load sequence to bitconvert from f64 to i64 the input operand of a ISD::BITCAST dag node that performed a bitconvert from type MVT::f64 to type MVT::v2i32. The resulting i64 node was then used to build a v2i32 vector. With this patch, the backend now produces a cheaper SCALAR_TO_VECTOR from MVT::f64 to MVT::v2f64. That SCALAR_TO_VECTOR is then followed by a "free" bitcast to type MVT::v4i32. The elements of the resulting v4i32 are then extracted to build a v2i32 vector (which is illegal and therefore promoted to MVT::v2i64). This is in general cheaper than emitting a stack store+load sequence to bitconvert the operand from type f64 to type i64. llvm-svn: 208107	2014-05-06 17:09:03 +00:00
Renato Golin	c7aea40ec6	Implememting named register intrinsics This patch implements the infrastructure to use named register constructs in programs that need access to specific registers (bare metal, kernels, etc). So far, only the stack pointer is supported as a technology preview, but as it is, the intrinsic can already support all non-allocatable registers from any architecture. llvm-svn: 208104	2014-05-06 16:51:25 +00:00
Reid Kleckner	4a406d32e9	Fix i128 div/mod on mingw64 The Win64 docs are very clear that anything larger than 8 bytes is passed by reference, and GCC MinGW64 honors that for __modti3 and friends. Patch by Jameson Nash! llvm-svn: 208029	2014-05-06 01:20:42 +00:00
Filipe Cabecinhas	fe59062b75	Revert "Optimize shufflevector that copies an i64/f64 and zeros the rest." This reverts commit 207992. I misread the phab number on the LGTM. llvm-svn: 207993	2014-05-05 19:40:36 +00:00
Filipe Cabecinhas	263d98c19f	Optimize shufflevector that copies an i64/f64 and zeros the rest. Summary: Also ran clang-format on the function. The code added is the last else if block. Reviewers: nadav, craig.topper Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3518 llvm-svn: 207992	2014-05-05 19:36:28 +00:00
Craig Topper	2d2aa0ca1f	Use makeArrayRef insted of calling ArrayRef<T> constructor directly. I introduced most of these recently. llvm-svn: 207616	2014-04-30 07:17:30 +00:00
Reid Kleckner	fb69308568	Implement X86 code generation for musttail Currently, musttail codegen is relying on sibcall optimization, and reporting a fatal error if fails. Sibcall optimization fails when stack arguments need to be modified, which is insufficient for musttail. The logic for moving arguments in memory safely is already implemented for GuaranteedTailCallOpt. This change merely arranges for musttail calls to use it. No functional change for GuaranteedTailCallOpt. Reviewers: espindola Differential Revision: http://reviews.llvm.org/D3493 llvm-svn: 207598	2014-04-29 23:55:41 +00:00
Elena Demikhovsky	299cf511c4	AVX-512: optimized a shuffle pattern to VINSERTI64x4. Added intrinsics for VPERMT2PS/PD/D/Q instructions. llvm-svn: 207513	2014-04-29 09:09:15 +00:00
Quentin Colombet	50efe87e5b	[X86] Add more details in the comments of X86TargetLowering::getScalingFactorCost. llvm-svn: 207432	2014-04-28 18:39:57 +00:00
Craig Topper	e73658ddbb	[C++] Use 'nullptr'. llvm-svn: 207394	2014-04-28 04:05:08 +00:00
Craig Topper	dd5e16dd34	Convert one last signature of getNode to take an ArrayRef of SDUse. llvm-svn: 207376	2014-04-27 19:21:06 +00:00
Craig Topper	64941d9786	Convert SelectionDAG::getMergeValues to use ArrayRef. llvm-svn: 207374	2014-04-27 19:20:57 +00:00
Benjamin Kramer	3693e77cb4	X86: If SSE4.1 is missing lower SMUL_LOHI of v4i32 to pmuludq and fix up the high parts. This is more expensive than pmuldq but still cheaper than scalarizing the whole thing. llvm-svn: 207370	2014-04-27 18:47:41 +00:00
Craig Topper	206fcd450a	Convert getMemIntrinsicNode to take ArrayRef of SDValue instead of pointer and size. llvm-svn: 207329	2014-04-26 19:29:41 +00:00
Craig Topper	48d114bed1	Convert SelectionDAG::getNode methods to use ArrayRef<SDValue>. llvm-svn: 207327	2014-04-26 18:35:24 +00:00
Benjamin Kramer	c2ad8f3ef1	Print X86ISD::PMULDQ nodes properly in debug output. llvm-svn: 207322	2014-04-26 16:26:41 +00:00
Benjamin Kramer	6d2dff61f9	X86: Lower SMUL_LOHI of v4i32 to pmuldq when SSE4.1 is available. llvm-svn: 207318	2014-04-26 14:12:19 +00:00
Benjamin Kramer	c9827ab103	X86: Add patterns for MULHU/MULHS of v8i16 and v16i16. This gets us pretty code for divs of i16 vectors. Turn the existing intrinsics into the corresponding nodes. llvm-svn: 207317	2014-04-26 13:01:03 +00:00
Benjamin Kramer	ad0168702a	Rip out X86-specific vector SDIV lowering, make the corresponding DAGCombiner transform work on vectors. llvm-svn: 207316	2014-04-26 13:00:53 +00:00
Benjamin Kramer	29139d5cb5	X86: Custom lower v4i32 UMUL_LOHI into 2 pmuludqs. Test will follow soon. llvm-svn: 207314	2014-04-26 12:06:11 +00:00
Quentin Colombet	ea18933d97	[X86] Implement TargetLowering::getScalingFactorCost hook. Scaling factors are not free on X86 because every "complex" addressing mode breaks the related instruction into 2 allocations instead of 1. <rdar://problem/16730541> llvm-svn: 207301	2014-04-26 01:11:26 +00:00
Filipe Cabecinhas	363b570d2a	Optimization for certain shufflevector by using insertps. Summary: If we're doing a v4f32/v4i32 shuffle on x86 with SSE4.1, we can lower certain shufflevectors to an insertps instruction: When most of the shufflevector result's elements come from one vector (and keep their index), and one element comes from another vector or a memory operand. Added tests for insertps optimizations on shufflevector. Added support and tests for v4i32 vector optimization. Reviewers: nadav Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3475 llvm-svn: 207291	2014-04-25 23:51:17 +00:00
Craig Topper	062a2baef0	[C++] Use 'nullptr'. Target edition. llvm-svn: 207197	2014-04-25 05:30:21 +00:00
Benjamin Kramer	76f753e9a9	X86: Don't transform shifts into ands when the sign bit is tested. Should unbreak MultiSource/Benchmarks/mediabench/g721/g721encode/encode. llvm-svn: 207145	2014-04-24 20:51:37 +00:00
Reid Kleckner	5772b77789	Add 'musttail' marker to call instructions This is similar to the 'tail' marker, except that it guarantees that tail call optimization will occur. It also comes with convervative IR verification rules that ensure that tail call optimization is possible. Reviewers: nicholas Differential Revision: http://llvm-reviews.chandlerc.com/D3240 llvm-svn: 207143	2014-04-24 20:14:34 +00:00
Andrea Di Biagio	d1ab866868	[X86] Add support for Read Time Stamp Counter x86 builtin intrinsics. This patch: - Adds two new X86 builtin intrinsics ('int_x86_rdtsc' and 'int_x86_rdtscp') as GCCBuiltin intrinsics; - Teaches the backend how to lower the two new builtins; - Introduces a common function to lower READCYCLECOUNTER dag nodes and the two new rdtsc/rdtscp intrinsics; - Improves (and extends) the existing x86 test 'rdtsc.ll'; now test 'rdtsc.ll' correctly verifies that both READCYCLECOUNTER and the two new intrinsics work fine for both 64bit and 32bit Subtargets. llvm-svn: 207127	2014-04-24 17:18:27 +00:00
Benjamin Kramer	f4575db2fd	X86: Emit test instead of constant shift + compare if the shift result is unused. This allows us to compile return (mask & 0x8 ? a : b); into testb $8, %dil cmovnel %edx, %esi instead of andl $8, %edi shrl $3, %edi cmovnel %edx, %esi which we formed previously because dag combiner canonicalizes setcc of and into shift. llvm-svn: 207088	2014-04-24 08:15:31 +00:00
Elena Demikhovsky	acc5c9e83e	AVX-512: store and truncstore for i1 values llvm-svn: 206897	2014-04-22 14:13:10 +00:00
Lang Hames	3067ab2344	[X86] Use tablegen instead of DAG combines to match BZHI instructions, as suggested by Ben Kramer in review of r206738. Thanks again Ben! llvm-svn: 206879	2014-04-22 10:41:56 +00:00
Lang Hames	f6f42cac3f	[X86] Don't use BZHI for short masks (>=32 bits). Thanks to Ben Kramer for the review. llvm-svn: 206869	2014-04-22 07:40:34 +00:00
Chandler Carruth	84e68b2994	[Modules] Fix potential ODR violations by sinking the DEBUG_TYPE definition below all of the header #include lines, lib/Target/... edition. llvm-svn: 206842	2014-04-22 02:41:26 +00:00
Lang Hames	5aa6ee80b6	[X86] ISEL (and X, <constant mask>) to BZHI when BMI2 is available. Generating BZHI in the variable mask case, i.e. (and X, (sub (shl 1, N), 1)), was already supported, but we were missing the constant-mask case. This patch fixes that. <rdar://problem/15480077> llvm-svn: 206738	2014-04-21 08:18:53 +00:00
Adam Nemet	ee7a3e38c9	[X86] Improve buildFromShuffleMostly for AVX For a 256-bit BUILD_VECTOR consisting mostly of shuffles of 256-bit vectors, both the BUILD_VECTOR and its operands may need to be legalized in multiple steps. Consider: (v8f32 (BUILD_VECTOR (extract_vector_elt (v8f32 %vreg0,) Constant<1>), (extract_vector_elt %vreg0, Constant<2>), (extract_vector_elt %vreg0, Constant<3>), (extract_vector_elt %vreg0, Constant<4>), (extract_vector_elt %vreg0, Constant<5>), (extract_vector_elt %vreg0, Constant<6>), (extract_vector_elt %vreg0, Constant<7>), %vreg1)) a. We can't build a 256-bit vector efficiently so, we need to split it into two 128-bit vecs and combine them with VINSERTX128. b. Operands like (extract_vector_elt (v8f32 %vreg0), Constant<7>) needs to be split into a VEXTRACTX128 and a further extract_vector_elt from the resulting 128-bit vector. c. The extract_vector_elt from b. is lowered into a shuffle to the first element and a movss. Depending on the order in which we legalize the BUILD_VECTOR and its operands[1], buildFromShuffleMostly may be faced with: (v4f32 (BUILD_VECTOR (extract_vector_elt (vector_shuffle<1,u,u,u> (extract_subvector %vreg0, Constant<4>), undef), Constant<0>), (extract_vector_elt (vector_shuffle<2,u,u,u> (extract_subvector %vreg0, Constant<4>), undef), Constant<0>), (extract_vector_elt (vector_shuffle<3,u,u,u> (extract_subvector %vreg0, Constant<4>), undef), Constant<0>), %vreg1)) In order to figure out the underlying vector and their identity we need to see through the shuffles. [1] Note that the order in which operations and their operands are legalized is only guaranteed in the first iteration of LegalizeDAG. Fixes <rdar://problem/16296956> llvm-svn: 206634	2014-04-18 19:44:16 +00:00
Andrea Di Biagio	aac2eac4c2	[X86] Improve the lowering of packed shifts by constant build_vector. This patch teaches the backend how to efficiently lower logical and arithmetic packed shifts on both SSE and AVX/AVX2 machines. When possible, instead of scalarizing a vector shift, the backend should try to expand the shift into a sequence of two packed shifts by immedate count followed by a MOVSS/MOVSD. Example (v4i32 (srl A, (build_vector < X, Y, Y, Y>))) Can be rewritten as: (v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>))) [with X and Y ConstantInt] The advantage is that the two new shifts from the example would be lowered into X86ISD::VSRLI nodes. This is always cheaper than scalarizing the vector into four scalar shifts plus four pairs of vector insert/extract. llvm-svn: 206316	2014-04-15 19:30:48 +00:00
Nick Lewycky	aad475b324	Break PseudoSourceValue out of the Value hierarchy. It is now the root of its own tree containing FixedStackPseudoSourceValue (which you can use isa/dyn_cast on) and MipsCallEntry (which you can't). Anything that needs to use either a PseudoSourceValue* and Value* is strongly encouraged to use a MachinePointerInfo instead. llvm-svn: 206255	2014-04-15 07:22:52 +00:00
David Blaikie	9027abae53	Change argument order and add explanatory comment to r206130 Changes requested in code review by Eric Christopher of r206130. llvm-svn: 206219	2014-04-14 22:23:06 +00:00
David Blaikie	269e0fb2e4	Fix instruction debug info location during legalization I found this from a particular GDB test suite case of inlining (something similar is provided as a test case) but came across a few other related cases (other callers of the same functions, and one other instance of the same coding mistake in a separate function). I'm not sure what the best way to test this is (let alone to cover the other cases I discovered), so hopefully this sufficies - open to ideas. llvm-svn: 206130	2014-04-13 06:39:55 +00:00
Reid Kleckner	9c6582129a	Move the segmented stack switch to a function attribute This removes the -segmented-stacks command line flag in favor of a per-function "split-stack" attribute. Patch by Luqman Aden and Alex Crichton! llvm-svn: 205997	2014-04-10 22:58:43 +00:00
Elena Demikhovsky	cf0b9bafc3	AVX-512: insert element to mask vector; store i1 data Implemented INSERT_VECTOR_ELT operation for v16i1 and v8i1 vectors; Implemented "store" for i1 type llvm-svn: 205850	2014-04-09 12:37:50 +00:00
Elena Demikhovsky	3dcfbdfa54	AVX-512: Added fp_to_uint and uint_to_fp patterns. llvm-svn: 205754	2014-04-08 07:24:02 +00:00
Matt Arsenault	cf6f688a40	Add DAG parameter to ComputeNumSignBitsForTargetNode This way, you can check the number of sign bits in the operands. The depth parameter it already has is pretty useless without this. llvm-svn: 205649	2014-04-04 20:13:13 +00:00
Craig Topper	840beec2d0	Make consistent use of MCPhysReg instead of uint16_t throughout the tree. llvm-svn: 205610	2014-04-04 05:16:06 +00:00
Yaron Keren	2895496852	Added isTargetWindowsMSVC(), renamed isTargetMingw() to isTargetWindowsGNU() and isTargetCygwin() to isTargetWindowsCygwin() to be consistent with the four Windows environments in Triple.h. Suggestion by Saleem Abdulrasool! llvm-svn: 205393	2014-04-02 04:27:51 +00:00
Yaron Keren	48d68d439a	If isKnownWindowsMSVCEnvironment then getOS == Triple::Win32 and Environment == Triple::MSVC so it will never be MinGW or Cygwin. llvm-svn: 205349	2014-04-01 18:52:55 +00:00
Yaron Keren	136fe7db46	isTargetWindows() renamed to isTargetKnownWindowsMSVC() to reflect its current functionality. Based on Takumi NAKAMURA suggestion. llvm-svn: 205338	2014-04-01 18:15:34 +00:00
Aaron Ballman	8bf5a548ea	Attempting to fix r205124, which had failed asserts when built with MSVC. Suggestion from Yaron Keren. llvm-svn: 205313	2014-04-01 13:56:35 +00:00
Alexey Volkov	1328b28dc6	[x86] Do not convert to cmp32 for Atom arch by Sergey Okunev Differential Revision: http://llvm-reviews.chandlerc.com/D2824 llvm-svn: 205288	2014-04-01 08:13:07 +00:00
Rafael Espindola	24a669d225	Prevent alias from pointing to weak aliases. This adds back r204781. Original message: Aliases are just another name for a position in a file. As such, the regular symbol resolutions are not applied. For example, given define void @my_func() { ret void } @my_alias = alias weak void ()* @my_func @my_alias2 = alias void ()* @my_alias We produce without this patch: .weak my_alias my_alias = my_func .globl my_alias2 my_alias2 = my_alias That is, in the resulting ELF file my_alias, my_func and my_alias are just 3 names pointing to offset 0 of .text. That is not the semantics of IR linking. For example, linking in a @my_alias = alias void ()* @other_func would require the strong my_alias to override the weak one and my_alias2 would end up pointing to other_func. There is no way to represent that with aliases being just another name, so the best solution seems to be to just disallow it, converting a miscompile into an error. llvm-svn: 204934	2014-03-27 15:26:56 +00:00
Rafael Espindola	65481d7b97	Revert "Prevent alias from pointing to weak aliases." This reverts commit r204781. I will follow up to with msan folks to see what is what they were trying to do with aliases to weak aliases. llvm-svn: 204784	2014-03-26 06:14:40 +00:00
Rafael Espindola	3b712a84a9	Prevent alias from pointing to weak aliases. Aliases are just another name for a position in a file. As such, the regular symbol resolutions are not applied. For example, given define void @my_func() { ret void } @my_alias = alias weak void ()* @my_func @my_alias2 = alias void ()* @my_alias We produce without this patch: .weak my_alias my_alias = my_func .globl my_alias2 my_alias2 = my_alias That is, in the resulting ELF file my_alias, my_func and my_alias are just 3 names pointing to offset 0 of .text. That is not the semantics of IR linking. For example, linking in a @my_alias = alias void ()* @other_func would require the strong my_alias to override the weak one and my_alias2 would end up pointing to other_func. There is no way to represent that with aliases being just another name, so the best solution seems to be to just disallow it, converting a miscompile into an error. llvm-svn: 204781	2014-03-26 04:48:47 +00:00
Adam Nemet	4beef4c90d	[X86] Generate VPSHUFB for in-place v16i16 shuffles This used to resort to splitting the 256-bit operation into two 128-bit shuffles and then recombining the results. Fixes <rdar://problem/16167303> llvm-svn: 204735	2014-03-25 17:47:06 +00:00
Adam Nemet	ac6d6383a3	[X86] Factor out new helper getPSHUFB I found three implementations of this. This splits it out into a new function and uses it from the three places. My plan is to add a fourth use when lowering a vector_shuffle:v16i16. Compared the assembly output of test/CodeGen/X86 before and after. The only change is due to how the first PSHUFB was generated in LowerVECTOR_SHUFFLEv8i16. If the shuffle mask specified undef (i.e. -1), the old implementation would write -1 * 2 and -1 * 2 + 1 (254 and 255) in the control mask. Now we write 0x80. These are of course interchangeable since bit 7 decides if a constant zero is written in the result byte. The other instances of this code use 0x80 consistently. Related to <rdar://problem/16167303> llvm-svn: 204734	2014-03-25 17:47:03 +00:00
Adam Nemet	b47372f555	[X86] Fix non-determinism in LowerVectorAllZeroTest This can be observed with the old testcase of CodeGen/X86/pr12312.ll: 47c47 < vorps %ymm0, %ymm1, %ymm0 --- > vorps %ymm1, %ymm0, %ymm0 97c97 < vorps %ymm1, %ymm0, %ymm0 --- > vorps %ymm0, %ymm1, %ymm0 The vector VecIns is populated with all the values from VecInMap. This is done while iterating VecInMap. VecInMap uses a hash of pointer values so the resulting order can vary depending on the memory layout. The fix is to populate the vector VecIns earlier as VecInMap is populated. This is done in DAG traversal order. Fixes <rdar://problem/16398806> llvm-svn: 204623	2014-03-24 16:52:08 +00:00
Craig Topper	c6d4efa1e5	Prune includes in X86 target. llvm-svn: 204216	2014-03-19 06:53:25 +00:00
Adam Nemet	8a130a5f86	[X86] Fix unused variable warning with NDEBUG from r204058 llvm-svn: 204063	2014-03-17 17:32:53 +00:00
Adam Nemet	24381f1cb7	[VectorLegalizer/X86] Don't unvectorize fp_to_uint for v8f32->v8i16 Rather than LegalizeAction::Expand, this needs LegalizeAction::Promote to get promoted to fp_to_sint v8f32->v8i32. This is a legal operation on AVX. For that to work properly, we also need to teach the legalizer about the specific promotion required here. The default vector promotion uses bitcasting to a vector type of the same total size. We want to promote the vector element type, effectively widening the operation and then truncating the result. This is analogous to the current logic of how int_to_fp is promoted. The change also factors out some code from the int_to_fp promotion code to ValueType::widenIntegerVectorElementType. This is now shared between int_to_fp and fp_to_int. There is no longer need for the custom lowering of fp_to_sint f32->v8i16 in X86. It can now go through the new target-independent fp_to_*int promotion logic. I also checked that no other target uses Promote for these ops yet, so there shouldn't be any unexpected change in behavior. Fixes <rdar://problem/16202247> llvm-svn: 204058	2014-03-17 17:06:14 +00:00
Arnaud A. de Grandmaison	75c9e6dedf	Remove some dead assignements found by scan-build llvm-svn: 204013	2014-03-15 22:13:15 +00:00
Owen Anderson	16c6bf49b7	Phase 2 of the great MachineRegisterInfo cleanup. This time, we're changing operator* on the by-operand iterators to return a MachineOperand& rather than a MachineInstr&. At this point they almost behave like normal iterators! Again, this requires making some existing loops more verbose, but should pave the way for the big range-based for-loop cleanups in the future. llvm-svn: 203865	2014-03-13 23:12:04 +00:00
Hans Wennborg	6c37f8b985	X86: Don't generate 64-bit movd after cmpneqsd in 32-bit mode (PR19059) This fixes the bug where we would bitcast the 64-bit floating point result of cmpneqsd to a 64-bit integer even on 32-bit targets. Differential Revision: http://llvm-reviews.chandlerc.com/D3009 llvm-svn: 203581	2014-03-11 15:49:24 +00:00
Tim Northover	e94a518a22	IR: add a second ordering operand to cmpxhg for failure The syntax for "cmpxchg" should now look something like: cmpxchg i32* %addr, i32 42, i32 3 acquire monotonic where the second ordering argument gives the required semantics in the case that no exchange takes place. It should be no stronger than the first ordering constraint and cannot be either "release" or "acq_rel" (since no store will have taken place). rdar://problem/15996804 llvm-svn: 203559	2014-03-11 10:48:52 +00:00
Jim Grosbach	c94d993adf	X86: Enable ISel of 16-bit MOVBE instructions. When the MOVBE instructions are available, use them for 16-bit endian swapping as well as for 32 and 64 bit. The patterns were already present on the instructions, but weren't being matched because the operation was unconditionally marked to 'Expand.' Change that to be conditional on whether the MOVBE instructions are available. Use 'rolw' to implement the in-register version (32 and 64 bit have the dedicated 'bswap' instruction for that). Patch by Louis Gerbarg <lgg@apple.com>. rdar://15479984 llvm-svn: 203524	2014-03-11 00:44:14 +00:00
Cameron McInally	791ae9927c	Lower AVX v4i64->v4i32 truncate to one shuffle. llvm-svn: 202996	2014-03-05 19:41:16 +00:00
Chandler Carruth	219b89b987	[Modules] Move CallSite into the IR library where it belogs. It is abstracting between a CallInst and an InvokeInst, both of which are IR concepts. llvm-svn: 202816	2014-03-04 11:01:28 +00:00
Benjamin Kramer	b6d0bd48bd	[C++11] Replace llvm::next and llvm::prior with std::next and std::prev. Remove the old functions. llvm-svn: 202636	2014-03-02 12:27:27 +00:00
Elena Demikhovsky	9737e3886b	AVX-512: Fixed extract_vector_elt for v8i1 vector llvm-svn: 202624	2014-03-02 09:19:44 +00:00
Quentin Colombet	85c9e16291	Lower unsigned vsetcc to psubus in certain cases The current approach to lower a vsetult is to flip the sign bit of the operands, swap the operands and then use a (signed) pcmpgt. psubus (unsigned saturating subtract) can be used to emulate a vsetult more efficiently: + case ISD::SETULT: { + // If the comparison is against a constant we can turn this into a + // setule. With psubus, setule does not require a swap. This is + // beneficial because the constant in the register is no longer + // destructed as the destination so it can be hoisted out of a loop. I also enable lowering via psubus in a few other cases where it's clearly beneficial: setule and setuge if minu/maxu cannot be used. rdar://problem/14338765 Patch by Adam Nemet <anemet@apple.com>. llvm-svn: 202301	2014-02-26 21:39:12 +00:00
Tim Northover	aeb8e06d4c	X86 CodeGenPrep: sink shufflevectors before shifts On x86, shifting a vector by a scalar is significantly cheaper than shifting a vector by another fully general vector. Unfortunately, because SelectionDAG operates on just one basic block at a time, the shufflevector instruction that reveals whether the right-hand side of a shift is really a scalar is often not visible to CodeGen when it's needed. This adds another handler to CodeGenPrepare, to sink any useful shufflevector instructions down to the basic block where they're used, predicated on a target hook (since on other architectures, doing so will often just introduce extra real work). rdar://problem/16063505 llvm-svn: 201655	2014-02-19 10:02:43 +00:00
Tim Northover	f06df5866f	X86: use vpsllvd (& friends) for 16-bit shifts on Haswell llvm-svn: 201558	2014-02-18 11:15:32 +00:00
Elena Demikhovsky	1fad075974	AVX-512: simpyfied BUILD_VECTOR for masks; fixed cmp/test sequence llvm-svn: 201487	2014-02-16 11:34:23 +00:00
Andrea Di Biagio	386d566395	[X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it. Instead of expanding a packed shift into a sequence of scalar shifts, the backend now tries (when possible) to convert the vector shift into a vector multiply. Before this change, a shift of a MVT::v8i16 vector by a build_vector of constants was always scalarized into a long sequence of "vector extracts + scalar shifts + vector insert". With this change, if there is SSE2 support, we emit a single vector multiply. This change also affects SSE4.1, AVX, AVX2 shifts: - A shift of a MVT::v4i32 vector by a build_vector of non uniform constants is now lowered when possible into a single SSE4.1 vector multiply. - Packed v16i16 shift left by constant build_vector are now expanded when possible into a single AVX2 vpmullw. This change also improves the lowering of AVX512f vector shifts. Added test CodeGen/X86/vec_shift6.ll with some code examples that are affected by this change. llvm-svn: 201271	2014-02-12 23:42:28 +00:00
Elena Demikhovsky	1f32c313f1	AVX: fixed a bug in LowerVECTOR_SHUFFLE llvm-svn: 201140	2014-02-11 10:21:53 +00:00
Elena Demikhovsky	2aafc22ed9	AVX-512: Optimized BUILD_VECTOR pattern; fixed encoding of VEXTRACTPS instruction. llvm-svn: 201134	2014-02-11 07:25:59 +00:00
Elena Demikhovsky	9f423d6f25	AVX-512: Fixed extract_vector_elt for v16i1 and v8i1 vectors. llvm-svn: 201066	2014-02-10 07:02:39 +00:00
Tim Northover	546b57b011	X86: deduplicate V[SZ]EXT_MOVL and V[SZ]EXT nodes I believe VZEXT_MOVL means "zero all vector elements except the first" (and should have identical input & output types) whereas VZEXT means "zero extend each element of a vector (discarding higher elements if necessary)". For example: (v4i32 (vzext (v16i8 ...))) should zero extend the low 4 bytes of the incoming vector to 32-bits, discarding higher bytes. However, somewhere in the past, these two concepts had become confused, even leading to a nonsensical VSEXT_MOVL. This re-merges the nodes where appropriate (all VSEXT_MOVL -> VSEXT, VZEXT_MOVL -> VZEXT when it's an actual extension). rdar://problem/15981990 llvm-svn: 200918	2014-02-06 09:54:51 +00:00
Matt Arsenault	25793a3f22	Add address space argument to allowsUnalignedMemoryAccess. On R600, some address spaces have more strict alignment requirements than others. llvm-svn: 200887	2014-02-05 23:15:53 +00:00
Elena Demikhovsky	0b79be8ab2	AVX-512: optimized icmp -> sext -> icmp pattern llvm-svn: 200849	2014-02-05 16:17:36 +00:00
Craig Topper	7ee163842f	Move matching for x86 BMI BLSI/BLSMSK/BLSR instructions to isel patterns instead of DAG combine. This weakens the ability to fold loads with them because we aren't able to match patterns that load the same thing twice. But maybe we should fix that if we care. The peephole optimizer will be able to fold some loads in its absense. llvm-svn: 200824	2014-02-05 07:09:40 +00:00
Reid Kleckner	f5b76518c9	Implement inalloca codegen for x86 with the new inalloca design Calls with inalloca are lowered by skipping all stores for arguments passed in memory and the initial stack adjustment to allocate argument memory. Now the frontend is responsible for the memory layout, and the backend doesn't have to do any work. As a result these changes are pretty minimal. Reviewers: echristo Differential Revision: http://llvm-reviews.chandlerc.com/D2637 llvm-svn: 200596	2014-01-31 23:50:57 +00:00
Reid Kleckner	c5d9e159eb	x86: Rename NumBytesForCalleeToPush to ...Pop for accuracy If we have a callee cleanup convention, the callee is going to pop the arguments off the stack, not push them on. llvm-svn: 200566	2014-01-31 19:07:18 +00:00
Andrea Di Biagio	2ea61f17ad	[X86] Add extra rules for combining vselect dag nodes into movsd. This improves the fix committed at revision 199683 adding the following new target specific combine rules: 1) fold (v4i32: vselect <0,0,-1,-1>, A, B) -> (v4i32 (bitcast (movsd (v2i64 (bitcast A)), (v2i64 (bitcast B))) )) 2) fold (v4f32: vselect <0,0,-1,-1>, A, B) -> (v4f32 (bitcast (movsd (v2f64 (bitcast A)), (v2f64 (bitcast B))) )) 3) fold (v4i32: vselect <-1,-1,0,0>, A, B) -> (v4i32 (bitcast (movsd (v2i64 (bitcast B)), (v2i64 (bitcast A))) )) 4) fold (v4f32: vselect <-1,-1,0,0>, A, B) -> (v4f32 (bitcast (movsd (v2i64 (bitcast B)), (v2i64 (bitcast A))) )) llvm-svn: 200324	2014-01-28 18:14:21 +00:00
Juergen Ributzka	659ce00d60	[TLI] Add a new hook to TargetLowering to query the target if a load of a constant should be converted to simply the constant itself. Before this patch we used getIntImmCost from TargetTransformInfo to determine if a load of a constant should be converted to just a constant, but the threshold for this was set to an arbitrary value. This value works well for the two targets (X86 and ARM) that implement this target-hook, but it isn't target-independent at all. Now targets have the possibility to decide directly if this optimization should be performed. The default value is set to false to preserve the current behavior. The target hook has been moved to TargetLowering, which removed the last use and need of TargetTransformInfo in SelectionDAG. llvm-svn: 200271	2014-01-28 01:20:14 +00:00
Juergen Ributzka	e758ddcd16	[X86] Prevent the creation of redundant ops for sadd and ssub with overflow. This commit teaches the X86 backend to create the same X86 instructions when it lowers an sadd/ssub with overflow intrinsic and a conditional branch that uses that overflow result. This allows SelectionDAG to recognize and remove one of the redundant operations. This fixes <rdar://problem/15874016> and <rdar://problem/15661073>. Reviewed by Nadav llvm-svn: 199976	2014-01-24 06:47:57 +00:00
Lang Hames	b1ce33379a	Add a few missing cases from r199933. Testcase coming shortly. llvm-svn: 199938	2014-01-23 21:27:27 +00:00
Lang Hames	23de211c5d	Replace vfmaddxx213 instructions with their 231-type equivalents in accumulator loops. Writing back to the accumulator (231-type) allows the coalescer to eliminate an extra copy. llvm-svn: 199933	2014-01-23 20:23:36 +00:00
Elena Demikhovsky	a5d38a39a0	AVX-512: added VPERM2D VPERM2Q VPERM2PS VPERM2PD instructions, they give better sequences than VPERMI llvm-svn: 199893	2014-01-23 14:27:26 +00:00
Andrea Di Biagio	450d1661be	[X86] Teach how to combine a vselect into a movss/movsd Add target specific rules for combining vselect dag nodes into movss/movsd when possible. If the vector type of the vselect dag node in input is either MVT::v4i13 or MVT::v4f32, then try to fold according to rules: 1) fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B) 2) fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A) If the vector type of the vselect dag node in input is either MVT::v2i64 or MVT::v2f64 (and we have SSE2), then try to fold according to rules: 3) fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B) 4) fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A) llvm-svn: 199683	2014-01-20 19:35:22 +00:00
Michael Gottesman	8347c34e67	Move the retrieval of VT after all of the early exits from PerformOrCombine that do not use VT. NFC. llvm-svn: 199612	2014-01-19 21:06:00 +00:00
Elena Demikhovsky	d1487261a0	AVX-512: fixed a compare pattern llvm-svn: 199366	2014-01-16 08:45:54 +00:00
David Majnemer	dee105772c	WinCOFF: Transform IR expressions featuring __ImageBase into image relative relocations MSVC on x64 requires that we create image relative symbol references to refer to RTTI data. Seeing as how there is no way to explicitly make reference to a given relocation type in LLVM IR, pattern match expressions of the form &foo - &__ImageBase. Differential Revision: http://llvm-reviews.chandlerc.com/D2523 llvm-svn: 199312	2014-01-15 09:16:42 +00:00
Elena Demikhovsky	79b75d9048	Fixed identation. llvm-svn: 199301	2014-01-15 07:18:11 +00:00
Lang Hames	06234ec147	Add FPExt option to CCValAssign::LocInfo. When generating calling-convention promotion code, Tablegen will now select FPExt for floating point promotions (previously it had returned AExt, which is not valid for floating point types). Any out-of-tree targets that were relying on AExt being returned for FP promotions will need to update their code check for FPExt instead. llvm-svn: 199252	2014-01-14 19:56:36 +00:00
Nico Rieck	7157bb765e	Decouple dllexport/dllimport from linkage Representing dllexport/dllimport as distinct linkage types prevents using these attributes on templates and inline functions. Instead of introducing further mixed linkage types to include linkonce and weak ODR, the old import/export linkage types are replaced with a new separate visibility-like specifier: define available_externally dllimport void @f() {} @Var = dllexport global i32 1, align 4 Linkage for dllexported globals and functions is now equal to their linkage without dllexport. Imported globals and functions must be either declarations with external linkage, or definitions with AvailableExternallyLinkage. llvm-svn: 199218	2014-01-14 15:22:47 +00:00
Elena Demikhovsky	767fc967b4	AVX-512: optimized scalar compare patterns removed AVX512SI format, since it is similar to AVX512BI. llvm-svn: 199217	2014-01-14 15:10:08 +00:00
Andrea Di Biagio	5448a3c771	[X86] Fix assertion failure caused by a wrong folding of vector shifts by immediate count. This fixes a regression intruced by r198113. Revision r198113 introduced an algorithm that tries to fold a vector shift by immediate count into a build_vector if the input vector is a known vector of constants. However the algorithm only worked under the assumption that the input vector type and the shift type are exactly the same. This patch disables the folding of vector shift by immediate count if the input vector type and the shift value type are not the same. llvm-svn: 199213	2014-01-14 13:17:12 +00:00
Nico Rieck	9d2e0df049	Revert "Decouple dllexport/dllimport from linkage" Revert this for now until I fix an issue in Clang with it. This reverts commit r199204. llvm-svn: 199207	2014-01-14 12:38:32 +00:00
Nico Rieck	e43aaf7967	Decouple dllexport/dllimport from linkage Representing dllexport/dllimport as distinct linkage types prevents using these attributes on templates and inline functions. Instead of introducing further mixed linkage types to include linkonce and weak ODR, the old import/export linkage types are replaced with a new separate visibility-like specifier: define available_externally dllimport void @f() {} @Var = dllexport global i32 1, align 4 Linkage for dllexported globals and functions is now equal to their linkage without dllexport. Imported globals and functions must be either declarations with external linkage, or definitions with AvailableExternallyLinkage. llvm-svn: 199204	2014-01-14 11:55:03 +00:00
Elena Demikhovsky	172a27c750	AVX-512: Added more intrinsics for pmin/pmax, pabs, blend, pmuldq. llvm-svn: 198745	2014-01-08 10:54:22 +00:00
Bill Wendling	13199b17f8	Remove unnecessary #includes. llvm-svn: 198585	2014-01-06 06:00:00 +00:00
Bill Wendling	908bf814e7	Refactor function that checks that __builtin_returnaddress's argument is constant. This moves the check up into the parent class so that all targets can use it without having to copy (and keep in sync) the same error message. llvm-svn: 198579	2014-01-06 00:43:20 +00:00
Elena Demikhovsky	f404e054a1	AVX-512: changed property name from "neverHasSideEffects=1" to "hasSideEffects=0", added this property to VMOVSS/VMOVSD; Optimized a truncate pattern. llvm-svn: 198562	2014-01-05 14:21:07 +00:00
Elena Demikhovsky	52e4a0e109	AVX-512: Added more intrinsics for convert and min/max. Removed vzeroupper from AVX-512 mode - our optimization gude does not recommend to insert vzeroupper at all. llvm-svn: 198557	2014-01-05 10:46:09 +00:00
Bill Wendling	df7dd28dc8	Emit an error message if the value passed to __builtin_returnaddress isn't a constant __builtin_returnaddress requires that the value passed into is be a constant. However, at -O0 even a constant expression may not be converted to a constant. Emit an error message intead of crashing. llvm-svn: 198531	2014-01-05 01:47:20 +00:00
Craig Topper	a448bd868f	Make more of the x86 lowering helper functions static. llvm-svn: 198146	2013-12-29 01:48:38 +00:00
Craig Topper	059e8e0da1	Switch from EVT to MVT in more of the x86 instruction lowering code. llvm-svn: 198144	2013-12-29 01:10:06 +00:00
Craig Topper	bf096926c9	Use getSimpleValueType in a few spots where the type should be simple. llvm-svn: 198117	2013-12-28 18:35:48 +00:00
Craig Topper	e829fe42af	Minor indentation fix to match other switch statements. Change llvm_unreachable text to match similar places. llvm-svn: 198116	2013-12-28 17:37:32 +00:00
Andrea Di Biagio	eaceba0ed0	[X86] Teach the backend how to fold target specific dag node for packed vector shift by immedate count (VSHLI/VSRLI/VSRAI) into a build_vector when the vector in input to the shift is a build_vector of all constants or UNDEFs. Target specific nodes for packed shifts by immediate count are in general introduced by function 'getTargetVShiftByConstNode' (in X86ISelLowering.cpp) when lowering shift operations, SSE/AVX immediate shift intrinsics and (only in very few cases) SIGN_EXTEND_INREG dag nodes. This patch adds extra rules for simplifying vector shifts inside function 'getTargetVShiftByConstNode'. Added file test/CodeGen/X86/vec_shift5.ll to verify that packed shifts by immediate are correctly folded into a build_vector when the input vector to the shift dag node is a vector of constants or undefs. llvm-svn: 198113	2013-12-28 11:11:52 +00:00
Elena Demikhovsky	b64d7e8586	AVX-512: Result type of scalar SETCC is MVT::i1 for AVX-512. llvm-svn: 198008	2013-12-25 10:06:40 +00:00
Elena Demikhovsky	64c9548d66	AVX-512: fixed some patterns for MVT::i1 llvm-svn: 197981	2013-12-24 14:24:07 +00:00
Elena Demikhovsky	fe24a30e38	AVX512: SETCC returns i1 for AVX-512 and i8 for all others llvm-svn: 197876	2013-12-22 10:13:18 +00:00
Duncan P. N. Exon Smith	ab5dbebc11	Assert that the last operand is actually EFLAGS This is another follow-up to r197503, after a post-commit review by Andy. <rdar://problem/15627766> llvm-svn: 197520	2013-12-17 20:28:21 +00:00
Duncan P. N. Exon Smith	512601d77f	Revert "Revert "Mark vastart_save_xmm_regs as changing EFLAGS"" This reverts commit r197481, recommiting r197469 with an extra fix. The vastart_save_xmm_regs pseudo-instruction expands to a test and a branch, so it modifies EFLAGS. Mark it so, or else the scheduler might place it in the middle of another test+branch. This fixes a bug exposed by r192750, which changed the initial scheduler to source-order as part of enabling the MI Scheduler for X86. This re-commit changes the VASTART_SAVE_XMM_REGS custom inserter not to try to save %flags, and adds a test that catches the bad behavior of r197469. <rdar://problem/15627766> llvm-svn: 197503	2013-12-17 15:54:45 +00:00
Stepan Dyatkovskiy	7f7c2710e0	Fix for PR18045: http://llvm.org/bugs/show_bug.cgi?id=18045 Short issue description: For X86 machines with sse < sse4.1 we got failures for some particular load/store vector sequences: $ clang-trunk -m32 -O2 test-case.c fatal error: error in backend: Cannot select: 0x4200920: v4i32,ch = load 0x41d6ab0, 0x4205850, 0x41dcb10<LD16[getelementptr inbounds ([4 x i32]* @e, i32 0, i32 0)](align=4)> [ORD=82] [ID=58] 0x4205850: i32 = X86ISD::Wrapper 0x41d5490 [ORD=26] [ID=43] 0x41d5490: i32 = TargetGlobalAddress<[4 x i32]* @e> 0 [ORD=26] [ID=23] 0x41dcb10: i32 = undef [ID=2] The reason is that EltsFromConsecutiveLoads could emit such load instruction both before and after legalize stage. Though this instruction is not legal for machines with SSSE3 and lower. The fix: In EltsFromConsecutiveLoads, if we have passed legalize stage, we check whether nodes it emits are legal. P.S.: If you get failure in time from 12:00 and till 22:00 (UTC-8), perhaps I'll slow with response, so you better reject this commit. Thanks! llvm-svn: 197492	2013-12-17 12:07:33 +00:00
Elena Demikhovsky	c5f6726a24	AVX-512: Added implementation of CONCAT_VECTORS for v8i1 vectors (by Alexey Bader). Added implementation of "truncate" from integer type (i64/i32/i16/i8) to i1. llvm-svn: 197482	2013-12-17 08:33:15 +00:00
Elena Demikhovsky	47fc44e52e	AVX-512: Added legal type MVT::i1 and VK1 register for it. Added scalar compare VCMPSS, VCMPSD. Implemented LowerSELECT for scalar FP operations. I replaced FSETCCss, FSETCCsd with one node type FSETCCs. Node extract_vector_elt(v16i1/v8i1, idx) returns an element of type i1. llvm-svn: 197384	2013-12-16 13:52:35 +00:00
Benjamin Kramer	e723bb10b0	X86: When lowering shl_parts, don't emit shift amounts larger than the bit width. While it's safe for the X86-specific shift nodes, dag combining will kill generic nodes. Insert an AND to make it safe, isel will nuke it as x86's shift instructions have an implicit AND. Fixes PR16108, which contains a contraption to hit this case in between constant folders. llvm-svn: 197228	2013-12-13 13:40:24 +00:00
Rafael Espindola	32cb5ac904	Switch to the new MingW ABI. GCC 4.7 changed the MingW ABI. On the LLVM side it means that sret functions don't pop the stack. llvm-svn: 197163	2013-12-12 16:06:58 +00:00
Tim Northover	9653eb5759	Make Triple's isOSBinFormatXXX functions partition triple-space. Most users would be surprised if "isCOFF" and "isMachO" were simultaneously true, unless they'd put the compiler in a box with a gun attached to a photon detector. This makes sure precisely one of the three formats is true for any triple and simplifies some target logic based on that. llvm-svn: 196934	2013-12-10 16:57:43 +00:00
Elena Demikhovsky	e382c3fdcd	AVX-512: changed intrinsics for mask operations llvm-svn: 196918	2013-12-10 13:53:10 +00:00
Cameron McInally	c5f420e129	Suppress '(x < y) ? a : 0 -> (x < y) & a' transform on X86 architectures with dedicated mask registers. Patch by Aleksey Bader. llvm-svn: 196386	2013-12-04 14:52:33 +00:00
Lang Hames	39609996d9	Refactor a lot of patchpoint/stackmap related code to simplify and make it target independent. Most of the x86 specific stackmap/patchpoint handling was necessitated by the use of the native address-mode format for frame index operands. PEI has now been modified to treat stackmap/patchpoint similarly to DEBUG_INFO, allowing us to use a simple, platform independent register/offset pair for frame indexes on stackmap/patchpoints. Notes: - Folding is now platform independent and automatically supported. - Emiting patchpoints with direct memory references now just involves calling the TargetLoweringBase::emitPatchPoint utility method from the target's XXXTargetLowering::EmitInstrWithCustomInserter method. (See X86TargetLowering for an example). - No more ugly platform-specific operand parsers. This patch shouldn't change the generated output for X86. llvm-svn: 195944	2013-11-29 03:07:54 +00:00
Michael Liao	d617a3015d	Fix PR18054 - Fix bug in (vsext (vzext x)) -> (vsext x) in SIGN_EXTEND_IN_REG lowering where we need to check whether x is a vector type (in-reg type) of i8, i16 or i32; otherwise, that optimization is not valid. llvm-svn: 195779	2013-11-26 20:31:31 +00:00
Andrew Trick	391dbadb51	StackMap: Implement support for DirectMemRefOp. A Direct stack map location records the address of frame index. This address is itself the value that the runtime requested. This differs from IndirectMemRefOp locations, which refer to a stack locations from which the requested values must be loaded. Direct locations can directly communicate the address if an alloca, while IndirectMemRefOp handle register spills. For example: entry: %a = alloca i64... llvm.experimental.stackmap(i32 <ID>, i32 <shadowBytes>, i64* %a) Since both the alloca and stackmap intrinsic are in the entry block, and the intrinsic takes the address of the alloca, the runtime can assume that LLVM will not substitute alloca with any intervening value. This must be verified by the runtime by checking that the stack map's location is a Direct location type. The runtime can then determine the alloca's relative location on the stack immediately after compilation, or at any time thereafter. This differs from Register and Indirect locations, because the runtime can only read the values in those locations when execution reaches the instruction address of the stack map. llvm-svn: 195712	2013-11-26 02:03:25 +00:00
Andrew Trick	d3ab37cfeb	whitespace llvm-svn: 195711	2013-11-26 02:03:20 +00:00
Jim Grosbach	860934a924	X86: Perform integer comparisons at i32 or larger. Utilizing the 8 and 16 bit comparison instructions, even when an input can be folded into the comparison instruction itself, is typically not worth it. There are too many partial register stalls as a result, leading to significant slowdowns. By always performing comparisons on at least 32-bit registers, performance of the calculation chain leading to the comparison improves. Continue to use the smaller comparisons when minimizing size, as that allows better folding of loads into the comparison instructions. rdar://15386341 llvm-svn: 195496	2013-11-22 19:57:47 +00:00
Michael Liao	02160d580b	Fix PR18014 - When simplifying the mask generation for BLEND, check whether that mask is also consumed by other non-BLEND insns. If true, skip that simplification. llvm-svn: 195476	2013-11-22 17:56:57 +00:00
Rafael Espindola	5a8e985ad3	Don't produce tail calls when the caller is x86_thiscallcc. The callee will not pop the stack for us. llvm-svn: 195467	2013-11-22 15:18:28 +00:00
Kostya Serebryany	4007009815	Revert r195318 as it causes miscompilation (PR18029) llvm-svn: 195439	2013-11-22 10:30:39 +00:00
Ekaterina Romanova	d5fa55470c	SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction. AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) \| (y >> (64 - c))) when we are not optimizing for size. It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel. llvm-svn: 195383	2013-11-21 23:21:26 +00:00
Bill Wendling	07787f8747	The basic problem is that some mainstream programs cannot deal with the way clang optimizes tail calls, as in this example: int foo(void); int bar(void) { return foo(); } where the call is transformed to: calll .L0$pb .L0$pb: popl %eax .Ltmp0: addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax movl foo@GOT(%eax), %eax popl %ebp jmpl *%eax # TAILCALL However, the GOT references must all be resolved at dlopen() time, and so this approach cannot be used with lazy dynamic linking (e.g. using RTLD_LAZY), which usually populates the PLT with stubs that perform the actual resolving. This patch changes X86TargetLowering::LowerCall() to skip tail call optimization, if the called function is a global or external symbol. Patch by Dimitry Andric! PR15086 llvm-svn: 195318	2013-11-21 07:04:30 +00:00
NAKAMURA Takumi	3dedf827f8	X86ISelLowering.cpp: Mark a variable VT as LLVM_ATTRIBUTE_UNUSED. [-Wunused-variable] llvm-svn: 195238	2013-11-20 10:55:22 +00:00
NAKAMURA Takumi	f2392ebb94	Whitespace. llvm-svn: 195237	2013-11-20 10:55:15 +00:00
Elena Demikhovsky	a5967af97d	Fixed compilation error. llvm-svn: 195230	2013-11-20 09:23:22 +00:00
Elena Demikhovsky	e1f9bf054f	AVX-512: Concat 4 128-bit vectors in one 512-bit vector. llvm-svn: 195229	2013-11-20 09:10:40 +00:00
Cameron McInally	ad41f1f693	Add AVX512 unmasked FMA intrinsics and support. llvm-svn: 194824	2013-11-15 17:01:14 +00:00
Matt Arsenault	b03bd4d96b	Add addrspacecast instruction. Patch by Michele Scandale! llvm-svn: 194760	2013-11-15 01:34:59 +00:00
Elena Demikhovsky	0a74b7da35	AVX-512: Handled extractelement from mask vector; Added VMOSHDUP/VMOVSLDUP shuffle instructions. llvm-svn: 194691	2013-11-14 11:29:27 +00:00
Juergen Ributzka	34c652d34d	SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too. This patch reapplies r193676 with an additional fix for the Hexagon backend. The SystemZ backend has already been fixed by r194148. The Type Legalizer recognizes that VSELECT needs to be split, because the type is to wide for the given target. The same does not always apply to SETCC, because less space is required to encode the result of a comparison. As a result VSELECT is split and SETCC is unrolled into scalar comparisons. This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG Combiner. If a matching pattern is found, then the result mask of SETCC is promoted to the expected vector mask type for the given target. Now the type legalizer will split both VSELECT and SETCC. This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>. Reviewed by Nadav llvm-svn: 194542	2013-11-13 01:57:54 +00:00
Juergen Ributzka	87ed906b2e	[Stackmap] Materialize the jump address within the patchpoint noop slide. This patch moves the jump address materialization inside the noop slide. This enables patching of the materialization itself or its complete removal. This patch also adds the ability to define scratch registers that can be used safely by the code called from the patchpoint intrinsic. At least one scratch register is required, because that one is used for the materialization of the jump address. This patch depends on D2009. Differential Revision: http://llvm-reviews.chandlerc.com/D2074 Reviewed by Andy llvm-svn: 194306	2013-11-09 01:51:33 +00:00
Juergen Ributzka	9969d3e6e8	[Stackmap] Add AnyReg calling convention support for patchpoint intrinsic. The idea of the AnyReg Calling Convention is to provide the call arguments in registers, but not to force them to be placed in a paticular order into a specified set of registers. Instead it is up tp the register allocator to assign any register as it sees fit. The same applies to the return value (if applicable). Differential Revision: http://llvm-reviews.chandlerc.com/D2009 Reviewed by Andy llvm-svn: 194293	2013-11-08 23:28:16 +00:00
Eric Christopher	542c8d934d	Check for both styles of clobbers, those produced by dragonegg and those produced by clang for the inline asm bswap conversion. Modified from a patch by Chris Smowton. llvm-svn: 194016	2013-11-04 21:41:21 +00:00
Elena Demikhovsky	496656900e	AVX-512: Implemented CMOV for 512-bit vectors llvm-svn: 193747	2013-10-31 13:15:32 +00:00
Juergen Ributzka	3bd686d493	Revert "SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too." Now Hexagon and SystemZ are not happy with it :-( llvm-svn: 193677	2013-10-30 06:36:19 +00:00
Juergen Ributzka	6ad05d6b95	SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too. The Type Legalizer recognizes that VSELECT needs to be split, because the type is to wide for the given target. The same does not always apply to SETCC, because less space is required to encode the result of a comparison. As a result VSELECT is split and SETCC is unrolled into scalar comparisons. This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG Combiner. If a matching pattern is found, then the result mask of SETCC is promoted to the expected vector mask type for the given target. This mask has usually the same size as the VSELECT return type (except for Intel KNL). Now the type legalizer will split both VSELECT and SETCC. This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>. Reviewed by Nadav llvm-svn: 193676	2013-10-30 05:48:18 +00:00
Elena Demikhovsky	199c823555	AVX-512: PMIN/PMAX intrinsics and patterns Patch by Cameron McInally <cameron.mcinally@nyu.edu> llvm-svn: 193497	2013-10-27 08:18:37 +00:00
Nadav Rotem	d369d4bdf9	Optimize concat_vectors(X, undef) -> scalar_to_vector(X). This optimization is not SSE specific so I am moving it to DAGco. The new scalar_to_vector dag node exposed a missing pattern in the AArch64 target that I needed to add. llvm-svn: 193393	2013-10-25 06:41:18 +00:00
Yaron Keren	79bb266346	(this is a corrected patch) Calling _chkstk is required on ELF as well as COFF on Windows. Without _chkstk, functions requiring large stack crash in initialization code. Previous code tested for COFF format but not Mach-O and this patch modifies the code to test for Windows OS (both Windows target and MingW target) but not Mach-O object format: Looks like macho environment was used to build some EFI code. Credits to Andrew MacPherson. llvm-svn: 193289	2013-10-23 23:37:01 +00:00
Rafael Espindola	bca3ab0905	Revert "Calling _chkstk is required on ELF as well as COFF on Windows. Without _chkstk functions requiring large stack crash in initialization code. Previous code tested for COFF format but not Mach-O and this patch modifies the code to test for Windows." This reverts commit r193263. It is causing CodeGen/X86/mingw-alloca.ll to fail. llvm-svn: 193275	2013-10-23 21:45:09 +00:00
Benjamin Kramer	0ccab2d66c	X86: Custom lower sext v16i8 to v16i16, and the corresponding truncate. Also update the cost model. llvm-svn: 193270	2013-10-23 21:06:07 +00:00
Yaron Keren	03ac82edf5	Calling _chkstk is required on ELF as well as COFF on Windows. Without _chkstk functions requiring large stack crash in initialization code. Previous code tested for COFF format but not Mach-O and this patch modifies the code to test for Windows. Credits to Andrew MacPherson. llvm-svn: 193263	2013-10-23 19:40:07 +00:00
Benjamin Kramer	da8446b833	X86: Custom lower zext v16i8 to v16i16. On sandy bridge (PR17654) we now get vpxor %xmm1, %xmm1, %xmm1 vpunpckhbw %xmm1, %xmm0, %xmm2 vpunpcklbw %xmm1, %xmm0, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0 On haswell it's a simple vpmovzxbw %xmm0, %ymm0 There is a maze of duplicated and dead transforms and patterns in this area. Remove the dead custom lowering of zext v8i16 to v8i32, that's already handled by LowerAVXExtend. llvm-svn: 193262	2013-10-23 19:19:04 +00:00
Jim Grosbach	005e5b55a7	X86: Make concat_vectors combine a bit more conservative. Per Nadav's review comments for r192866. llvm-svn: 193252	2013-10-23 17:37:40 +00:00
Lang Hames	2783993fca	X86 vector element shift-by-immediate instructions take i8 immediates. Make the instruction defenitions and ISEL reflect this. Prior to this patch these instructions took an i32i8imm, and the high bits were dropped during encoding. This led to incorrect behavior for shifts by immediates higher than 255. This patch fixes that issue by detecting large immediate shifts and returning constant zero (for logical shifts) or capping the shift amount at an encodable value (for arithmetic shifts). Fixes <rdar://problem/14968098> llvm-svn: 193096	2013-10-21 17:51:24 +00:00
Elena Demikhovsky	665c90e184	AVX-512: MUL operation lowering for v8i64 llvm-svn: 193083	2013-10-21 13:27:34 +00:00
Jim Grosbach	c044c65470	x86: Move bitcasts outside concat_vector. Consider the following: typedef unsigned short ushort4U __attribute__((ext_vector_type(4), aligned(2))); typedef unsigned short ushort4 __attribute__((ext_vector_type(4))); typedef unsigned short ushort8 __attribute__((ext_vector_type(8))); typedef int int4 __attribute__((ext_vector_type(4))); int4 __bbase_cvt_int(ushort4 v) { ushort8 a; a.lo = v; return _mm_cvtepu16_epi32(a); } This generates the, not unreasonable, IR: define <4 x i32> @foo0(double %v.coerce) nounwind ssp { %tmp = bitcast double %v.coerce to <4 x i16> %tmp1 = shufflevector <4 x i16> %tmp, <4 x i16> undef, <8 x i32> <i32 %0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %tmp2 = tail call <4 x i32> @llvm.x86.sse41.pmovzxwd(<8 x i16> %tmp1) ret <4 x i32> %tmp2 } The problem is when type legalization gets hold of the v4i16. It legalizes that by spilling to the stack, then doing a zero-extending load. Things go even more silly from there, ending up with something like: _foo0: movsd %xmm0, -8(%rsp) <== Spill to the stack. movq -8(%rsp), %xmm0 <== Reload it right back out. pmovzxwd %xmm0, %xmm1 <== Here's what we actually asked for. pblendw $1, %xmm1, %xmm0 <== We don't need this at all pmovzxwd %xmm0, %xmm0 <== We already did this ret The v8i8 to v8i16 zext intrinsic gives even worse results, with two table lookups via pshufb instructions(!!). To avoid all that, we can move the bitcasting until after we've formed the wider (legal) vector type. Then our normal codegen flows along nicely and we get the expected: _foo0: pmovzxwd %xmm0, %xmm0 ret rdar://15245794 llvm-svn: 192866	2013-10-17 02:58:06 +00:00
Michael Liao	ad71659def	Fix PR17546 - Type of index used in extract_vector_elt or insert_vector_elt supposes to be TLI.getVectorIdxTy() which is pointer type on most targets. It'd better to truncate (or zero-extend in case it's changed later) it to mask element type to guarantee they are matching instead of asserting that. llvm-svn: 192722	2013-10-15 17:51:58 +00:00
Michael Liao	8ba068211d	Fix PR16807 - Lower signed division by constant powers-of-2 to target-independent DAG operators instead of target-dependent ones to support them better on targets where vector types are legal but shift operators on that types are illegal. E.g., on AVX, PSRAW is only available on <8 x i16> though <16 x i16> is a legal type. llvm-svn: 192721	2013-10-15 17:51:02 +00:00
Eric Christopher	755711e510	Reformat this routine slightly. llvm-svn: 192630	2013-10-14 21:52:23 +00:00
Elena Demikhovsky	82a46ebe0a	Fixed a bug in dynamic allocation memory on stack. The alignment of allocated space was wrong, see Bugzila 17345. Done by Zvi Rackover <zvi.rackover@intel.com>. llvm-svn: 192573	2013-10-14 07:26:51 +00:00
Benjamin Kramer	7b5e159450	X86: Fix type check. Just because an integer type is illegal doesn't mean it's i64. Fixes PR17495, where an i24 triggered this code. It's intended to optimize i64 loads on 32 bit x86. llvm-svn: 192123	2013-10-07 19:11:35 +00:00
Elena Demikhovsky	2e408aefe0	AVX-512: added scalar convert instructions and intrinsics. Fixed load folding in VPERM2I instruction. llvm-svn: 192063	2013-10-06 13:11:09 +00:00
Elena Demikhovsky	462a2d235b	AVX-512: fixed shuffle lowering in case of BLEND and added VSHUFPS patterns. llvm-svn: 192055	2013-10-06 06:11:18 +00:00
Craig Topper	b01cd1aa74	Add patterns for selecting TBM instructions from logical operations. Patch from Yunzhong Gao. llvm-svn: 191871	2013-10-03 04:16:45 +00:00
Rafael Espindola	44fee4e0eb	Remove several unused variables. Patch by Alp Toker. llvm-svn: 191757	2013-10-01 13:32:03 +00:00
Robert Wilhelm	f0cfb83bb4	Fix spelling intruction -> instruction. llvm-svn: 191610	2013-09-28 11:46:15 +00:00
Juergen Ributzka	f043a65327	Revert "SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too." This reverts commit r191130. llvm-svn: 191138	2013-09-21 15:09:46 +00:00
Juergen Ributzka	c2551eb4ff	Fix the buildbot llvm-svn: 191133	2013-09-21 05:15:01 +00:00
Juergen Ributzka	ab930591c7	[X86] Emulate AVX 256bit MIN/MAX support by splitting the vector. In AVX 256bit vectors are valid vectors and therefore the Type Legalizer doesn't split the VSELECT and SETCC nodes. AVX only supports MIN/MAX on 128bit vectors and this fix enables vector splitting for this special case in the X86 DAG Combiner. This fix is related to PR16695, PR17002, and <rdar://problem/14594431>. llvm-svn: 191131	2013-09-21 04:55:22 +00:00
Juergen Ributzka	e9a80fc912	SelectionDAG: Teach the legalizer to split SETCC if VSELECT needs splitting too. The Type Legalizer recognizes that VSELECT needs to be split, because the type is to wide for the given target. The same does not always apply to SETCC, because less space is required to encode the result of a comparison. As a result VSELECT is split and SETCC is unrolled into scalar comparisons. This commit fixes the issue by checking for VSELECT-SETCC patterns in the DAG Combiner. If a matching pattern is found, then the result mask of SETCC is promoted to the expected vector mask for the given target. This mask has usually te same size as the VSELECT return type (except for Intel KNL). Now the type legalizer will split both VSELECT and SETCC. This allows the following X86 DAG Combine code to sucessfully detect the MIN/MAX pattern. This fixes PR16695, PR17002, and <rdar://problem/14594431>. llvm-svn: 191130	2013-09-21 04:55:18 +00:00
Elena Demikhovsky	8952974e29	AVX-512: implemented extractelement with variable index. Added parsing of mask register and "zeroing" semantic, like {%k1} {z}. llvm-svn: 190595	2013-09-12 08:55:00 +00:00
Juergen Ributzka	53d0b492f5	[X86] Perform VSELECT DAG combines also before DAG type legalization. If the DAG already has only legal types, then the second round of DAG combines is skipped. In this case VSELECT+SETCC patterns that match a more efficient instruction (e.g. min/max) are never recognized. This fix allows VSELECT+SETCC combines if the types are already legal before DAG type legalization. Reviewer: Nadav llvm-svn: 190105	2013-09-05 23:02:56 +00:00
Craig Topper	b25f0f5538	Create BEXTR instructions for (and ((sra or srl) x, imm), (2**size - 1)). Fixes PR17028. llvm-svn: 189742	2013-09-02 07:53:17 +00:00
Elena Demikhovsky	4def4b088f	AVX-512: Added GATHER and SCATTER instructions. llvm-svn: 189729	2013-09-01 14:24:41 +00:00
Craig Topper	f78c19c3bb	Fixup BZHI selection to remove an unneeded zero extension. llvm-svn: 189656	2013-08-30 07:16:16 +00:00
Craig Topper	0bccad2d43	Teach X86 backend to create BMI2 BZHI instructions from (and X, (add (shl 1, Y), -1)). Fixes PR17038. llvm-svn: 189653	2013-08-30 06:52:21 +00:00
Elena Demikhovsky	980c6b08b1	AVX-512: added extend and truncate instructions. llvm-svn: 189580	2013-08-29 11:56:53 +00:00
Elena Demikhovsky	12f24673e0	AVX-512: Added FMA instructions. llvm-svn: 189326	2013-08-27 08:39:25 +00:00
Elena Demikhovsky	0a2b6290f1	AVX-512: Added shuffle instructions - VPSHUFD, VPERMILPS, VMOVDDUP, VMOVLHPS, VMOVHLPS, VSHUFPS, VALIGN single and double forms. llvm-svn: 189215	2013-08-26 12:45:35 +00:00
Elena Demikhovsky	f8f478b19d	AVX-512: added UNPACK instructions and tests for all-zero/all-ones vectors llvm-svn: 189189	2013-08-25 12:54:30 +00:00
Elena Demikhovsky	33d447a2d6	AVX-512: Added SHIFT instructions. llvm-svn: 188899	2013-08-21 09:36:02 +00:00
Elena Demikhovsky	1490c5eb5b	AVX-512: added arithmetic and logical operations. ADD, SUB, MUL integer and FP types. OR, AND, XOR. Added embeded broadcast form for these instructions. llvm-svn: 188673	2013-08-19 13:26:14 +00:00
Elena Demikhovsky	3ce8dbbac2	AVX-512: Added VMOVD, VMOVQ, VMOVSS, VMOVSD instructions. llvm-svn: 188637	2013-08-18 13:08:57 +00:00
Craig Topper	e6861c9ce5	Make more of the lowering helpers static. Also use MVT instead of EVT in a couple places. llvm-svn: 188629	2013-08-18 08:53:01 +00:00
Craig Topper	8dbc7e9d35	Revert r188449 as it turns out we're just missing the instructions that need the v16i32/v16f32 matching. llvm-svn: 188454	2013-08-15 08:38:25 +00:00
Craig Topper	2ffd06528d	Don't let isPermImmMask handle v16i32 since VPERMI doesn't match on that type. Remove 128-bit vector handling from isPermImmMask too, it's covered by isPSHUFDMask. llvm-svn: 188449	2013-08-15 07:30:51 +00:00
Craig Topper	6f4dd2dacf	Use MVT in place of EVT in more X86 operation lowering functions. llvm-svn: 188445	2013-08-15 05:33:45 +00:00
Craig Topper	5671010cbb	Replace getValueType().getSimpleVT() with getSimpleValueType(). Also remove one weird cast from MVT->EVT just to call getSimpleVT(). llvm-svn: 188441	2013-08-15 02:33:50 +00:00
Craig Topper	d03748cf5e	Make more helper methods into static functions. llvm-svn: 188366	2013-08-14 07:53:41 +00:00
Craig Topper	7b7b159574	Remove tab characters. llvm-svn: 188365	2013-08-14 07:35:18 +00:00
Craig Topper	d905fded68	Make some helper methods static. llvm-svn: 188364	2013-08-14 07:34:43 +00:00
Craig Topper	60769e050d	Use MVT in more lowering code. llvm-svn: 188363	2013-08-14 07:04:42 +00:00
Craig Topper	52b00359b1	Replace EVT with MVT in isVectorShift. Keeps compiler from generating unneeded checks and handling for extended types. llvm-svn: 188362	2013-08-14 06:21:10 +00:00
Craig Topper	67476d7485	Replace EVT with MVT in many of the shuffle lowering functions. Keeps compiler from generating unneeded checks and handling for extended types. llvm-svn: 188361	2013-08-14 05:58:39 +00:00
Evgeniy Stepanov	7dee697faa	Fix compiler warnings. ../lib/Target/X86/X86ISelLowering.cpp:9715:7: error: unused variable 'OpVT' [-Werror,-Wunused-variable] EVT OpVT = Op0.getValueType(); ^ ../lib/Target/X86/X86ISelLowering.cpp:9763:14: error: unused variable 'NumElems' [-Werror,-Wunused-variable] unsigned NumElems = VT.getVectorNumElements(); llvm-svn: 188269	2013-08-13 14:04:20 +00:00
Elena Demikhovsky	60b1f289f2	AVX-512: Added CMP and BLEND instructions. Lowering for SETCC. llvm-svn: 188265	2013-08-13 13:24:07 +00:00
Elena Demikhovsky	5fed3b95db	AVX-512: Added more tests for BROADCAST llvm-svn: 188148	2013-08-11 12:29:16 +00:00
Elena Demikhovsky	cf5b1458e6	AVX-512: Added VPERM* instructons and MOV* zmm-to-zmm instructions. Added a test for shuffles using VPERM. llvm-svn: 188147	2013-08-11 07:55:09 +00:00
Elena Demikhovsky	45c54ad8dc	AVX-512 set: Added BROADCAST instructions with lowering logic and a test. llvm-svn: 187884	2013-08-07 12:34:55 +00:00
Craig Topper	c5b0ad27ab	Simplify code. No functional change intended. llvm-svn: 187870	2013-08-07 08:16:07 +00:00
Tim Northover	a4415854db	Refactor isInTailCallPosition handling This change came about primarily because of two issues in the existing code. Niether of: define i64 @test1(i64 %val) { %in = trunc i64 %val to i32 tail call i32 @ret32(i32 returned %in) ret i64 %val } define i64 @test2(i64 %val) { tail call i32 @ret32(i32 returned undef) ret i32 42 } should be tail calls, and the function sameNoopInput is responsible. The main problem is that it is completely symmetric in the "tail call" and "ret" value, but in reality different things are allowed on each side. For these cases: 1. Any truncation should lead to a larger value being generated by "tail call" than needed by "ret". 2. Undef should only be allowed as a source for ret, not as a result of the call. Along the way I noticed that a mismatch between what this function treats as a valid truncation and what the backends see can lead to invalid calls as well (see x86-32 test case). This patch refactors the code so that instead of being based primarily on values which it recurses into when necessary, it starts by inspecting the type and considers each fundamental slot that the backend will see in turn. For example, given a pathological function that returned {{}, {{}, i32, {}}, i32} we would consider each "real" i32 in turn, and ask if it passes through unchanged. This is much closer to what the backend sees as a result of ComputeValueVTs. Aside from the bug fixes, this eliminates the recursion that's going on and, I believe, makes the bulk of the code significantly easier to understand. The trade-off is the nasty iterators needed to find the real types inside a returned value. llvm-svn: 187787	2013-08-06 09:12:35 +00:00
Craig Topper	cf969eadaf	Simplify vector lane handling math a bit. No functional change intended. llvm-svn: 187783	2013-08-06 07:23:12 +00:00
Craig Topper	7418ff460c	Simplify math a little bit. llvm-svn: 187781	2013-08-06 06:54:25 +00:00
Craig Topper	9bc00b65b6	Replace EVT with MVT in isHorizontalBinOp as it is only called with legal types. llvm-svn: 187779	2013-08-06 06:05:05 +00:00
Craig Topper	47d7c5c8fe	Simplify code slightly. No functional change. llvm-svn: 187771	2013-08-06 04:12:40 +00:00
Aaron Ballman	5b4634576e	Silencing an MSVC11 type conversion warning. llvm-svn: 187727	2013-08-05 13:47:03 +00:00
Elena Demikhovsky	40864b690b	AVX-512 set: added mask operations, lowering BUILD_VECTOR for i1 vector types. Added intrinsics and tests. llvm-svn: 187717	2013-08-05 08:52:21 +00:00
Benjamin Kramer	5bc180c14f	X86: Turn fp selects into mask operations. double test(double a, double b, double c, double d) { return a<b ? c : d; } before: _test: ucomisd %xmm0, %xmm1 ja LBB0_2 movaps %xmm3, %xmm2 LBB0_2: movaps %xmm2, %xmm0 after: _test: cmpltsd %xmm1, %xmm0 andpd %xmm0, %xmm2 andnpd %xmm3, %xmm0 orpd %xmm2, %xmm0 Small speedup on Benchmarks/SmallPT llvm-svn: 187706	2013-08-04 12:05:16 +00:00
Tim Northover	ecc018c7b7	X86: correct tail return address calculation Due to the weird and wondeful usual arithmetic conversions, some calculations involving negative values were getting performed in uint32_t and then promoted to int64_t, which is really not a good idea. Patch by Katsuhiro Ueno. llvm-svn: 187703	2013-08-04 09:35:57 +00:00
Elena Demikhovsky	b1266b5447	EVEX and compressed displacement encoding for AVX512 llvm-svn: 187576	2013-08-01 13:34:06 +00:00
Elena Demikhovsky	b0a75431ad	Fixed assertion in Extract128BitVector() llvm-svn: 187493	2013-07-31 12:03:08 +00:00
Elena Demikhovsky	67b05fc0b3	Added INSERT and EXTRACT intructions from AVX-512 ISA. All insertf/extractf functions replaced with insert/extract since we have insertf and inserti forms. Added lowering for INSERT_VECTOR_ELT / EXTRACT_VECTOR_ELT for 512-bit vectors. Added lowering for EXTRACT/INSERT subvector for 512-bit vectors. Added a test. llvm-svn: 187491	2013-07-31 11:35:14 +00:00
Nico Rieck	06d17c80cc	Proper va_arg/va_copy lowering on win64 Win64 uses CharPtrBuiltinVaList instead of X86_64ABIBuiltinVaList like other 64-bit targets. llvm-svn: 187355	2013-07-29 13:07:06 +00:00
Justin Holewinski	d3f2035a3c	Add a target legalize hook for SplitVectorOperand (again) CustomLowerNode was not being called during SplitVectorOperand, meaning custom legalization could not be used by targets. This also adds a test case for NVPTX that depends on this custom legalization. Differential Revision: http://llvm-reviews.chandlerc.com/D1195 Attempt to fix the buildbots by making the X86 test I just added platform independent llvm-svn: 187202	2013-07-26 13:28:29 +00:00
Rafael Espindola	1d812728cc	Revert "Add a target legalize hook for SplitVectorOperand" This reverts commit 187198. It broke the bots. The soft float test probably needs a -triple because of name differences. On the hard float test I am getting a "roundss $1, %xmm0, %xmm0", instead of "vroundss $1, %xmm0, %xmm0, %xmm0". llvm-svn: 187201	2013-07-26 13:18:16 +00:00
Justin Holewinski	f848a24e50	Add a target legalize hook for SplitVectorOperand CustomLowerNode was not being called during SplitVectorOperand, meaning custom legalization could not be used by targets. This also adds a test case for NVPTX that depends on this custom legalization. Differential Revision: http://llvm-reviews.chandlerc.com/D1195 llvm-svn: 187198	2013-07-26 12:46:39 +00:00
Elena Demikhovsky	8cfb43f73b	I'm starting to commit KNL backend. I'll push patches one-by-one. This patch includes support for the extended register set XMM16-31, YMM16-31, ZMM0-31. The full ISA you can see here: http://software.intel.com/en-us/intel-isa-extensions llvm-svn: 187030	2013-07-24 11:02:47 +00:00
Juergen Ributzka	3d527d80b8	[X86] Use min/max to optimze unsigend vector comparison on X86 Use PMIN/PMAX for UGE/ULE vector comparions to reduce the number of required instructions. This trick also works for UGT/ULT, but there is no advantage in doing so. It wouldn't reduce the number of instructions and it would actually reduce performance. Reviewer: Ben radar:5972691 llvm-svn: 186432	2013-07-16 18:20:45 +00:00
Craig Topper	202fbc2c9b	Add 'static' keyword to some const arrays for consistency. llvm-svn: 186308	2013-07-15 06:54:12 +00:00
Craig Topper	b94011fd28	Use SmallVectorImpl& instead of SmallVector to avoid repeating small vector size. llvm-svn: 186274	2013-07-14 04:42:23 +00:00
Stephen Lin	fda967fdea	X86: fold SSE2/AVX2 logical shift by immediate amount into zero vector when possible Patch by Andrea Di Biagio llvm-svn: 186165	2013-07-12 15:31:36 +00:00
Charles Davis	e8f297ca94	Target/X86: Add explicit Win64 and System V/x86-64 calling conventions. Summary: This patch adds explicit calling convention types for the Win64 and System V/x86-64 ABIs. This allows code to override the default, and use the Win64 convention on a target that wants to use SysV (and vice-versa). This is needed to implement the `ms_abi` and `sysv_abi` GNU attributes. Reviewers: CC: llvm-svn: 186144	2013-07-12 06:02:35 +00:00
Stephen Lin	73de7bf5de	AArch64/PowerPC/SystemZ/X86: This patch fixes the interface, usage, and all in-tree implementations of TargetLoweringBase::isFMAFasterThanMulAndAdd in order to resolve the following issues with fmuladd (i.e. optional FMA) intrinsics: 1. On X86(-64) targets, ISD::FMA nodes are formed when lowering fmuladd intrinsics even if the subtarget does not support FMA instructions, leading to laughably bad code generation in some situations. 2. On AArch64 targets, ISD::FMA nodes are formed for operations on fp128, resulting in a call to a software fp128 FMA implementation. 3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics on types like v2f32, v8f32, v4f64, etc., even though they promote, split, scalarize, etc. to types that support hardware FMAs. The function has also been slightly renamed for consistency and to force a merge/build conflict for any out-of-tree target implementing it. To resolve, see comments and fixed in-tree examples. llvm-svn: 185956	2013-07-09 18:16:56 +00:00
Nico Rieck	51969be724	Reuse %rax after calling __chkstk on win64 Reapply this as I reverted the wrong commit. llvm-svn: 185807	2013-07-08 11:20:11 +00:00
Nico Rieck	4801303ce1	Revert "Proper va_arg/va_copy lowering on win64" This reverts commit 2b52880592a525cfe04d8f9008a35da8c2ea94c3. Needs review. llvm-svn: 185806	2013-07-08 11:19:44 +00:00
Nico Rieck	43b51056d6	Revert "Reuse %rax after calling __chkstk on win64" This reverts commit 01f8d579f7672872324208ac5bc4ac311e81b22e. llvm-svn: 185781	2013-07-08 01:30:57 +00:00
Nico Rieck	7adf6111a8	Reuse %rax after calling __chkstk on win64 llvm-svn: 185778	2013-07-07 16:48:39 +00:00
Nico Rieck	99ef2890c0	Proper va_arg/va_copy lowering on win64 llvm-svn: 185763	2013-07-06 18:08:19 +00:00
Jakob Stoklund Olesen	db429d9483	Remove the EXCEPTIONADDR, EHSELECTION, and LSDAADDR ISD opcodes. These exception-related opcodes are not used any longer. llvm-svn: 185625	2013-07-04 13:54:20 +00:00
Jakob Stoklund Olesen	a1f5b901a5	Revert r185595-185596 which broke buildbots. Revert "Simplify landing pad lowering." Revert "Remove the EXCEPTIONADDR, EHSELECTION, and LSDAADDR ISD opcodes." llvm-svn: 185600	2013-07-04 00:26:30 +00:00
Jakob Stoklund Olesen	f33ec531fa	Remove the EXCEPTIONADDR, EHSELECTION, and LSDAADDR ISD opcodes. These exception-related opcodes are not used any longer. llvm-svn: 185596	2013-07-03 23:56:31 +00:00
Craig Topper	31ee5866de	Use SmallVectorImpl::iterator/const_iterator instead of SmallVector to avoid specifying the vector size. llvm-svn: 185540	2013-07-03 15:07:05 +00:00
Elena Demikhovsky	6769c50d9e	Optimized integer vector multiplication operation by replacing it with shift/xor/sub when it is possible. Fixed a bug in SDIV, where the const operand is not a splat constant vector. llvm-svn: 184931	2013-06-26 10:55:03 +00:00
Chad Rosier	295bd43adb	The getRegForInlineAsmConstraint function should only accept MVT value types. llvm-svn: 184642	2013-06-22 18:37:38 +00:00
Bill Wendling	8f26840c5a	Don't cache the instruction and register info from the TargetMachine, because the internals of TargetMachine could change. No functionality change intended. llvm-svn: 183571	2013-06-07 21:00:34 +00:00
Andrew Trick	ad6d08ac6f	Order CALLSEQ_START and CALLSEQ_END nodes. Fixes PR16146: gdb.base__call-ar-st.exp fails after pre-RA-sched=source fixes. Patch by Xiaoyi Guo! This also fixes an unsupported dbg.value test case. Codegen was previously incorrect but the test was passing by luck. llvm-svn: 182885	2013-05-29 22:03:55 +00:00
Andrew Trick	ef9de2a739	Track IR ordering of SelectionDAG nodes 2/4. Change SelectionDAG::getXXXNode() interfaces as well as call sites of these functions to pass in SDLoc instead of DebugLoc. llvm-svn: 182703	2013-05-25 02:42:55 +00:00

... 12 13 14 15 16 ...

3679 Commits