llvm-project

Commit Graph

Author	SHA1	Message	Date
Elena Demikhovsky	a480ef5494	AVX-512: added calling conventions for i1 vectors. Fixed bug: https://llvm.org/bugs/show_bug.cgi?id=20724 llvm-svn: 235889	2015-04-27 15:11:19 +00:00
Elena Demikhovsky	d1084c5b3f	AVX-512: Extend/Truncate operations for SKX, SETCC for bit-vectors llvm-svn: 235875	2015-04-27 12:57:59 +00:00
Simon Pilgrim	4f683c264a	[X86][SSE] Add v16i8/v32i8 multiplication support Patch to allow int8 vectors to be multiplied on the SSE unit instead of being scalarized. The patch sign extends the i8 lanes to i16, uses the SSE2 pmullw multiplication instruction, then packs the lower byte from each result. Differential Revision: http://reviews.llvm.org/D9115 llvm-svn: 235837	2015-04-27 07:55:46 +00:00
Sanjay Patel	fe1365ac50	[x86] allow 64-bit extracted vector element integer stores on a 32-bit system With SSE2, we can generate a 'movq' or other 64-bit store op on a 32-bit system even though 64-bit integers are not legal types. So instead of producing this: pshufd $229, %xmm0, %xmm1 ## xmm1 = xmm0[1,1,2,3] movd %xmm0, (%eax) movd %xmm1, 4(%eax) We can do: movq %xmm0, (%eax) This is a fix for the problem noted in D7296. Differential Revision: http://reviews.llvm.org/D9134 llvm-svn: 235460	2015-04-22 00:24:30 +00:00
Matthias Braun	9e9e8b3230	X86: Match for X86ISD nodes in LowerBUILD_VECTOR instead of BUILD_VECTORCombine There doesn't seem to be a reason to perform this target ISD node matching in an DAGCombine, moving it to lowering fixes PR23296. Differential Revision: http://reviews.llvm.org/D9137 llvm-svn: 235394	2015-04-21 17:21:36 +00:00
Elena Demikhovsky	50b88ddb87	AVX-512: Added logical and arithmetic instructions for SKX by Asaf Badouh (asaf.badouh@intel.com) llvm-svn: 235375	2015-04-21 10:27:40 +00:00
Matthias Braun	b6b5aaad98	X86: Do not select X86 custom vector nodes if operand types don't match X86ISD::ADDSUB, X86ISD::(F)HADD, X86ISD::(F)HSUB should not be selected if the operand types do not match the result type because vector type legalization cannot deal with this for custom nodes. Testcase X86ISD::ADDSUB is attached. I could not create a testcase for the FHADD/FHSUB cases because of: https://llvm.org/bugs/show_bug.cgi?id=23296 Differential Revision: http://reviews.llvm.org/D9120 llvm-svn: 235367	2015-04-21 01:13:41 +00:00
Simon Pilgrim	749953eebb	[X86][SSE] Fix for getScalarValueForVectorElement to detect scalar sources requiring truncation. The fix ensures that scalar sources inserted into a vector are the correct bit size. Integer scalar sources from BUILD_VECTOR and SCALAR_TO_VECTOR nodes may require truncation that this function doesn't currently support. llvm-svn: 235281	2015-04-19 22:16:49 +00:00
Sanjay Patel	7024b8121a	[x86] Implement combineRepeatedFPDivisors Set the transform bar at 2 divisions because the fastest current x86 FP divider circuit is in SandyBridge / Haswell at 10 cycle latency (best case) relative to a 5 cycle multiplier. So that's the worst case for this transform (no latency win), but multiplies are obviously pipelined while divisions are not, so there's still a big throughput win which we would expect to show up in typical FP code. These are the sequences I'm comparing: divss %xmm2, %xmm0 mulss %xmm1, %xmm0 divss %xmm2, %xmm0 Becomes: movss LCPI0_0(%rip), %xmm3 ## xmm3 = mem[0],zero,zero,zero divss %xmm2, %xmm3 mulss %xmm3, %xmm0 mulss %xmm1, %xmm0 mulss %xmm3, %xmm0 [Ignore for the moment that we don't optimize the chain of 3 multiplies into 2 independent fmuls followed by 1 dependent fmul...this is the DAG version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that, then the transform becomes even more profitable on all targets.] Differential Revision: http://reviews.llvm.org/D8941 llvm-svn: 235012	2015-04-15 15:22:55 +00:00
Krzysztof Parzyszek	a46c36b8f4	Allow memory intrinsics to be tail calls llvm-svn: 234764	2015-04-13 17:16:45 +00:00
Benjamin Kramer	619c4e57ba	Reduce dyn_cast<> to isa<> or cast<> where possible. No functional change intended. llvm-svn: 234586	2015-04-10 11:24:51 +00:00
Simon Pilgrim	49ba9b8e2f	[X86][SSE] Use (V)PINSRB for direct byte insertion in 16i8 buildvector on SSE4.1 targets This patch allows SSE4.1 targets to use (V)PINSRB to create 16i8 vectors by inserting i8 scalars directly into a XMM register instead of merging pairs of i8 scalars into a i16 and using the SSE2 PINSRW instruction. This allows folding of byte loads and reduces scalar register usage as well. Differential Revision: http://reviews.llvm.org/D8839 llvm-svn: 234193	2015-04-06 18:39:00 +00:00
David Majnemer	3337064a47	[WinEH] Sink UnwindHelp completely out of IR We don't need to represent UnwindHelp in IR. Instead, we can use the knowledge that we are emitting the parent function to decide if we should create the UnwindHelp stack object. llvm-svn: 234061	2015-04-03 22:32:26 +00:00
Sanjay Patel	eca590ffb3	[AVX] Improve insertion of i8 or i16 into low element of 256-bit zero vector Without this patch, we split the 256-bit vector into halves and produced something like: movzwl (%rdi), %eax vmovd %eax, %xmm0 vxorps %xmm1, %xmm1, %xmm1 vblendps $15, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7] Now, we eliminate the xor and blend because those zeros are free with the vmovd: movzwl (%rdi), %eax vmovd %eax, %xmm0 This should be the final fix needed to resolve PR22685: https://llvm.org/bugs/show_bug.cgi?id=22685 llvm-svn: 233941	2015-04-02 20:21:52 +00:00
David Majnemer	a225a19dd0	[WinEH] Generate .xdata for catch handlers This lets us catch exceptions in simple cases. N.B. Things that do not work include (but are not limited to): - Throwing from within a catch handler. - Catching an object with a named catch parameter. - 'CatchHigh' is fictitious, we aren't sure of its purpose. - We aren't entirely efficient with regards to the number of EH states that we generate. - IP-to-State tables are sensitive to the order of emission. llvm-svn: 233767	2015-03-31 22:35:44 +00:00
Sanjay Patel	2ae9943881	[X86, AVX] try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector sub-types I suggested this change in D7898 (http://llvm.org/viewvc/llvm-project?view=revision&revision=231354) It improves the v4i64 case although not optimally. This AVX codegen: vmovq {{.#+}} xmm0 = mem[0],zero vxorpd %ymm1, %ymm1, %ymm1 vblendpd {{.#+}} ymm0 = ymm0[0],ymm1[1,2,3] Becomes: vmovsd {{.*#+}} xmm0 = mem[0],zero Unfortunately, this doesn't completely solve PR22685. There are still at least 2 problems under here: We're not handling v32i8 / v16i16. We're not getting the FP / int domains right for instruction selection. But since this patch alone appears to do no harm, reduces code duplication, and helps v4i64, I'm submitting this patch ahead of fixing the above. Differential Revision: http://reviews.llvm.org/D8341 llvm-svn: 233704	2015-03-31 16:32:11 +00:00
Sanjay Patel	bbe1756a8c	more space; NFC llvm-svn: 233554	2015-03-30 15:31:32 +00:00
Sanjay Patel	4fa4a886d7	comment cleanup; NFC llvm-svn: 233293	2015-03-26 17:18:17 +00:00
Sanjay Patel	cdf1e2e363	Use SDValue bool checks; NFC intended llvm-svn: 233289	2015-03-26 16:55:43 +00:00
Sanjay Patel	2f8f019daf	[X86, AVX] improve insertion into zero element of 256-bit vector This patch allows AVX blend instructions to handle insertion into the low element of a 256-bit vector for the appropriate data types. For f32, instead of: vblendps $1, %xmm1, %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[1,2,3] vblendps $15, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7] we get: vblendps $1, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0],ymm0[1,2,3,4,5,6,7] For f64, instead of: vmovsd %xmm1, %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[1] vblendpd $3, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0,1],ymm0[2,3] we get: vblendpd $1, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0],ymm0[1,2,3] For the hardware-neglected integer data types, I left a TODO comment in the code and added regression tests for a follow-on patch. Differential Revision: http://reviews.llvm.org/D8609 llvm-svn: 233199	2015-03-25 17:36:01 +00:00
Sanjay Patel	99d246d7d7	[X86, AVX] recognize shufflevector with zero input as a vperm2 (PR22984) vperm2x128 instructions have the special ability (aka free hardware capability) to shuffle zero values into a vector. This patch recognizes that type of shuffle and generates the appropriate control byte. https://llvm.org/bugs/show_bug.cgi?id=22984 Differential Revision: http://reviews.llvm.org/D8563 llvm-svn: 233100	2015-03-24 19:19:07 +00:00
Sanjay Patel	c88f724fed	[X86] Prefer blendps over insertps codegen for one special case With this patch, for this one exact case, we'll generate: blendps %xmm0, %xmm1, $1 instead of: insertps %xmm0, %xmm1, $0 If there's a memory operand available for load folding and we're optimizing for size, we'll still generate the insertps. The detailed performance data motivation for this may be found in D7866; in summary, blendps has 2-3x throughput vs. insertps on widely used chips. Differential Revision: http://reviews.llvm.org/D8332 llvm-svn: 232850	2015-03-20 21:19:52 +00:00
Simon Pilgrim	180cad2e57	Stripped trailing whitespace. NFC. llvm-svn: 232822	2015-03-20 16:08:17 +00:00
Sanjay Patel	803fb7c85c	move insert, extract, concat helper functions closer to related helper functions; NFCI llvm-svn: 232781	2015-03-19 23:04:25 +00:00
Sanjay Patel	d5c2d287f9	[X86, AVX] use blends instead of insert128 with index 0 Another case of x86-specific shuffle strength reduction: avoid generating insert*128 instructions with index 0 because they are slower than their non-lane-changing blend equivalents. Shuffle lowering already catches most of these cases, but the zero vector case and some other paths such as in the modified test in vector-shuffle-256-v32.ll were getting through. Differential Revision: http://reviews.llvm.org/D8366 llvm-svn: 232773	2015-03-19 22:29:40 +00:00
Simon Pilgrim	5ec5c9cafe	[X86][SSE] Avoid scalarization of v2i64 vector shifts (REAPPLIED) Fixed broken tests. Differential Revision: http://reviews.llvm.org/D8416 llvm-svn: 232682	2015-03-18 22:18:51 +00:00
Eric Christopher	050f590a0c	Revert "[X86][SSE] Avoid scalarization of v2i64 vector shifts" as it appears to have broken tests/bots. This reverts commit r232660. llvm-svn: 232670	2015-03-18 21:01:00 +00:00
Simon Pilgrim	5c837edc2a	[X86][SSE] Avoid scalarization of v2i64 vector shifts Currently v2i64 vectors shifts (non-equal shift amounts) are scalarized, costing 4 x extract, 2 x x86-shifts and 2 x insert instructions - and it gets even more awkward on 32-bit targets. This patch separately shifts the vector by both shift amounts and then shuffles the partial results back together, costing 2 x shuffles and 2 x sse-shifts instructions (+ 2 movs on pre-AVX hardware). Note - this patch only improves the SHL / LSHR logical shifts as only these are supported in SSE hardware. Differential Revision: http://reviews.llvm.org/D8416 llvm-svn: 232660	2015-03-18 19:35:31 +00:00
Sanjay Patel	c2bdf37712	fix comments to match code; NFC llvm-svn: 232385	2015-03-16 15:38:48 +00:00
Gabor Horvath	fee043439c	[llvm] Replacing asserts with static_asserts where appropriate Summary: This patch consists of the suggestions of clang-tidy/misc-static-assert check. Reviewers: alexfh Reviewed By: alexfh Subscribers: xazax.hun, llvm-commits Differential Revision: http://reviews.llvm.org/D8343 llvm-svn: 232366	2015-03-16 09:53:42 +00:00
Simon Pilgrim	e2c6a1b96d	Use SDValue bool check to tidyup some possible combines. NFC. llvm-svn: 232331	2015-03-15 19:47:42 +00:00
Simon Pilgrim	cc4d4d4fef	Use SDValue bool check to tidyup some possible combines. NFC. llvm-svn: 232325	2015-03-15 17:21:35 +00:00
Andrea Di Biagio	510feca1b8	[X86][AVX] Fix wrong lowering of v4x64 shuffles into concat_vector plus extract_subvector nodes. This patch fixes a bug in the shuffle lowering logic implemented by function 'lowerV2X128VectorShuffle'. The are few cases where function 'lowerV2X128VectorShuffle' wrongly expands a shuffle of two v4X64 vectors into a CONCAT_VECTORS of two EXTRACT_SUBVECTOR nodes. The problematic expansion only occurs when the shuffle mask M has an 'undef' element at position 2, and M is equivalent to mask <0,1,4,5>. In that case, the algorithm propagates the wrong vector to one of the two new EXTRACT_SUBVECTOR nodes. Example: ;; define <4 x double> @test(<4 x double> %A, <4 x double> %B) { entry: %0 = shufflevector <4 x double> %A, <4 x double> %B, <4 x i32><i32 undef, i32 1, i32 undef, i32 5> ret <4 x double> %0 } ;; Before this patch, llc (-mattr=+avx) generated: vinsertf128 $1, %xmm0, %ymm0, %ymm0 With this patch, llc correctly generates: vinsertf128 $1, %xmm1, %ymm0, %ymm0 Added test lower-vec-shuffle-bug.ll Differential Revision: http://reviews.llvm.org/D8259 llvm-svn: 232179	2015-03-13 17:29:49 +00:00
Quentin Colombet	f59b2d034c	[X86] Fix a regression introduced by r223641. The permps and permd instructions have their operands swapped compared to the intrinsic definition. Therefore, they do not fall into the INTR_TYPE_2OP category. I did not create a new category for those two, as they are the only one AFAICT in that case. <rdar://problem/20108262> llvm-svn: 232085	2015-03-12 19:34:12 +00:00
Andrea Di Biagio	de2fb00a16	[X86] Fix wrong target specific combine on SETCC nodes. Part of the folding logic implemented by function 'PerformISDSETCCCombine' only worked under the assumption that the condition code in input could have been either SETNE or SETEQ. Unfortunately that assumption was incorrect, and in some cases the algorithm ended up incorrectly folding SETCC nodes. The incorrect folding only affected SETCC dag nodes where: - one of the operands was a build_vector of all zeroes; - the other operand was a SIGN_EXTEND from a vector of MVT:i1 elements; - the condition code was neither SETNE nor SETEQ. Example: (setcc (v4i32 (sign_extend v4i1:%A)), (v4i32 VectorOfAllZeroes), setge) Before this patch, the entire dag node sequence from the example was incorrectly folded to node %A. With this patch, the dag node sequence is folded to a (xor %A, (v4i1 VectorOfAllOnes)). Added test setcc-combine.ll. Thanks to Greg Bedwell for spotting this issue. llvm-svn: 232046	2015-03-12 15:16:58 +00:00
Eric Christopher	9deb75d176	Have getCallPreservedMask and getThisCallPreservedMask take a MachineFunction argument so that we can grab subtarget specific features off of it. llvm-svn: 231979	2015-03-11 22:42:13 +00:00
Andrea Di Biagio	6c7d70469c	[X86][AVX] Fix wrong lowering of VPERM2X128 nodes There were cases where the backend computed a wrong permute mask for a VPERM2X128 node. Example: \code define <8 x float> @foo(<8 x float> %a, <8 x float> %b) { %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 undef, i32 undef, i32 6, i32 7, i32 undef, i32 undef, i32 6, i32 7> ret <8 x float> %shuffle } \code end Before this patch, llc (with -mattr=+avx) emitted the following vperm2f128: vperm2f128 $0, %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[0,1,0,1] With this patch, llc emits a vperm2f128 with a correct permute mask: vperm2f128 $17, %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[2,3,2,3] Differential Revision: http://reviews.llvm.org/D8119 llvm-svn: 231601	2015-03-08 16:28:47 +00:00
Simon Pilgrim	8c58c066b7	[DAGCombiner] Add a shuffle mask commutation helper function. NFCI. We have an increasing number of cases where we are creating commuted shuffle masks - all implementing nearly the same code. This patch adds a static helper function - ShuffleVectorSDNode::commuteMask() and replaces a number of cases to use it. Differential Revision: http://reviews.llvm.org/D8139 llvm-svn: 231581	2015-03-07 22:33:11 +00:00
Benjamin Kramer	4e0c7928e2	X86: Roll repetitive code into a loop. NFC. llvm-svn: 231565	2015-03-07 15:06:16 +00:00
Eric Christopher	25dbdeb4d1	Typo. llvm-svn: 231547	2015-03-07 01:39:09 +00:00
Sanjay Patel	302404b277	[AVX] Lower / fast-isel scalar FP selects into VBLENDV instructions (PR22483) This patch reduces code size for all AVX targets and increases speed for some chips. SSE 4.1 introduced the useless (see code comments) 2-register form of BLENDV and only in the packed float/double flavors. AVX subsequently made the instruction useful by adding a 4-register operand form. So we just need to paper over the lack of scalar forms of this instruction, complicate the code to choose float or double forms, and use blendv on scalars since all FP is in xmm registers anyway. This gives us an approximately 50% speed up for a blendv microbenchmark sequence on SandyBridge and Haswell: blendv : 29.73 cycles/iter logic : 43.15 cycles/iter No new test cases with this patch because: 1. fast-isel-select-sse.ll tests the positive side for regular X86 lowering and fast-isel 2. sse-minmax.ll and fp-select-cmp-and.ll confirm that we're not firing for scalar selects without AVX 3. fp-select-cmp-and.ll and logical-load-fold.ll confirm that we're not firing for scalar selects with constants. http://llvm.org/bugs/show_bug.cgi?id=22483 Differential Revision: http://reviews.llvm.org/D8063 llvm-svn: 231408	2015-03-05 21:46:54 +00:00
Elena Demikhovsky	de05f10de2	AVX-512, SKX: Enabled masked_load/store operations for this target. Added lowering for ISD::CONCAT_VECTORS and ISD::INSERT_SUBVECTOR for i1 vectors, it is needed to pass all masked_memop.ll tests for SKX. llvm-svn: 231371	2015-03-05 15:11:35 +00:00
Craig Topper	0ee8470a43	[X86] Use vmovss to handle inserting an element into index 0 of a v8f32 vector of zeros. llvm-svn: 231354	2015-03-05 06:38:42 +00:00
JF Bastien	f14889ee34	Mutate TargetLowering::shouldExpandAtomicRMWInIR to specifically dictate how AtomicRMWInsts are expanded. Summary: In PNaCl, most atomic instructions have their own @llvm.nacl.atomic.* function, each one, with a few exceptions, represents a consistent behaviour across all NaCl-supported targets. Unfortunately, the atomic RMW operations nand, [u]min, and [u]max aren't directly represented by any such @llvm.nacl.atomic.* function. This patch refines shouldExpandAtomicRMWInIR in TargetLowering so that a future `Le32TargetLowering` class can selectively inform the caller how the target desires the atomic RMW instruction to be expanded (ie via load-linked/store-conditional for ARM/AArch64, via cmpxchg for X86/others?, or not at all for Mips) if at all. This does not represent a behavioural change and as such no tests were added. Patch by: Richard Diamond. Reviewers: jfb Reviewed By: jfb Subscribers: jfb, aemerson, t.p.northover, llvm-commits Differential Revision: http://reviews.llvm.org/D7713 llvm-svn: 231250	2015-03-04 15:47:57 +00:00
Sanjay Patel	948602bd17	use bool operator shortcut; NFC llvm-svn: 231123	2015-03-03 20:41:27 +00:00
Ahmed Bougacha	afbd6887c4	[X86] Special-case 2x CMOV when custom-inserting. This lets us avoid a few copies that are otherwise hard to get rid of. The way this is done is, the custom-inserter looks at the following instruction for another CMOV, and replaces both at the same time. A previous version used a new CMOV2 opcode, but the custom inserter is expected to be able to return a different basic block anyway, which means it's OK - though far from ideal - to alter that block's contents. Explicitly document that, in case it ever makes a difference. Alternatives welcome! Follow-up to r231045. rdar://19767934 Closes http://reviews.llvm.org/D8019 llvm-svn: 231046	2015-03-03 01:21:16 +00:00
Ahmed Bougacha	066d0b8e64	[X86] Combine (cmov (and/or (setcc) (setcc))) into (cmov (cmov)). Fold and/or of setcc's to double CMOV: (CMOV F, T, ((cc1 \| cc2) != 0)) -> (CMOV (CMOV F, T, cc1), T, cc2) (CMOV F, T, ((cc1 & cc2) != 0)) -> (CMOV (CMOV T, F, !cc1), F, !cc2) When we can't use the CMOV instruction, it might increase branch mispredicts. When we can, or when there is no mispredict, this improves throughput and reduces register pressure. These can't be catched by generic combines, because the pattern can appear when legalizing some instructions (such as fcmp une). rdar://19767934 http://reviews.llvm.org/D7634 llvm-svn: 231045	2015-03-03 01:09:14 +00:00
Benjamin Kramer	42cc33e816	X86: Replace variadic function with init list. NFC. llvm-svn: 230911	2015-03-01 21:47:40 +00:00
Benjamin Kramer	030133c5db	ArrayRef: Remove the equals helper with many arguments. With initializer lists there is a really neat idiomatic way to write this, 'ArrayRef.equals({1, 2, 3, 4, 5})'. Remove the equal method which always had a hard limit on the number of arguments. I considered rewriting it with variadic templates but that's not really a good fit for a function with homogeneous arguments. 'ArrayRef == {1, 2, 3, 4, 5}' would've been even more awesome, but C++11 doesn't allow init lists with binary operators. llvm-svn: 230907	2015-03-01 21:05:05 +00:00
Craig Topper	782d620657	[X86] Remove the blendpd/blendps/pblendw/pblendd intrinsics. They can represented by shuffle_vector instructions. llvm-svn: 230860	2015-02-28 19:33:17 +00:00
Benjamin Kramer	5fbfe2ffdc	Convert push_back loops into append calls. No functionality change intended. llvm-svn: 230849	2015-02-28 13:20:15 +00:00
Chandler Carruth	9ad2ffac23	[x86] Run most of the rest of the shuffle combining over non-128-bit vectors. This lets us fix the rest of the v16 lowering problems when pshufb is clearly better. We might still be able to improve some of the lowerings by enabling the other combine-based rewriting to fire for non-128-bit vectors, but this at least should remove any regressions from using the fancy v16i16 lowering strategy. llvm-svn: 230753	2015-02-27 12:13:14 +00:00
Chandler Carruth	66b705bc64	[x86] Teach a bunch of the x86-specific shuffle combining to work with 256-bit vectors as well as 128-bit vectors. Fixes some of the redundant shuffles for v16i16. llvm-svn: 230752	2015-02-27 11:45:13 +00:00
Chandler Carruth	97f3260f57	[x86] Make the v8i16 clever single-input shuffle lowering usable for repeated 128-bit lane shuffles of wider vector types and use it to lower 256-bit v16i16 vector shuffles where applicable. This should let us perfectly lowering the pattern of pshuflw and pshufhw even for AVX2 256-bit patterns. I've not added AVX-512 support, but it should be trivial for someone working on that to wire up. Note that currently this generates bad, long shuffle chains because we don't combine 256-bit target shuffles. The subsequent patches will fix that. llvm-svn: 230751	2015-02-27 11:33:46 +00:00
Chandler Carruth	ddc4d085cc	[x86] Make the single-input v8i16 lowering directly recurse rather than going back through the entire vector shuffle lowering. This is an important step to being able to re-use this logic. llvm-svn: 230743	2015-02-27 09:11:38 +00:00
Eric Christopher	11e4df73c8	getRegForInlineAsmConstraint wants to use TargetRegisterInfo for a lookup, pass that in rather than use a naked call to getSubtargetImpl. This involved passing down and around either a TargetMachine or TargetRegisterInfo. Update all callers/definitions around the targets and SelectionDAG. llvm-svn: 230699	2015-02-26 22:38:43 +00:00
Chandler Carruth	653773d004	[x86] Fix PR22706 where we would incorrectly try lower a v32i8 dynamic blend as legal. We made the same mistake in two different places. Whenever we are custom lowering a v32i8 blend we need to check whether we are custom lowering it only for constant conditions that can be shuffled, or whether we actually have AVX2 and full dynamic blending support on bytes. Both are fixed, with comments added to make it clear what is going on and a new test case. llvm-svn: 230695	2015-02-26 22:15:34 +00:00
Chandler Carruth	7bd840d058	[x86] Restructure the comments and the conditions for handling dynamic blends. This makes it much more clear what is going on. The case we're handling is that of dynamic conditions, and we're bailing when the nature of the vector types and subtarget preclude lowering the dynamic condition vselect as an actual blend. No functionality changed here, but this will make a subsequent bug-fix to this code much more clear. llvm-svn: 230690	2015-02-26 21:29:06 +00:00
Chandler Carruth	efc6819041	[x86] Re-order the combines of select in the X86 backend. This doesn't change functionality, but makes it more clear that the dynamic case and the shuffle case don't overlap in any interesting way. llvm-svn: 230689	2015-02-26 21:21:36 +00:00
Chandler Carruth	0757f14c69	[x86] Add an assert to catch if we ever try to blend a v32i8 without AVX2. llvm-svn: 230688	2015-02-26 21:18:20 +00:00
Reid Kleckner	e81017248c	Don't sibcall between SysV and Win64 convention functions The shadow stack space expectations won't match. Fixes PR22709. llvm-svn: 230667	2015-02-26 19:43:20 +00:00
Chandler Carruth	8e0a3ea52c	[x86] Sink the single-input v8i16 lowering code that is actually formulaic into the top v8i16 lowering routine. This makes the generalized lowering a completely general and single path lowering which will allow generalizing it in turn for multiple 128-bit lanes. llvm-svn: 230623	2015-02-26 11:00:40 +00:00
Chandler Carruth	11e7f6b50a	[x86] Remove a SimpleTy usage. No need for it here, we already have the MVT. llvm-svn: 230622	2015-02-26 10:37:01 +00:00
Chandler Carruth	d283cb6203	[x86] Make the vector shuffle helpers order the SDLoc and MVT arguments. This ordering matches that of DAG.getNode. llvm-svn: 230617	2015-02-26 08:19:24 +00:00
Eric Christopher	5f54195e4a	Remove a FIXME. Explanation: This function is in TargetLowering because it uses RegClassForVT which would need to be moved to TargetRegisterInfo and would necessitate moving isTypeLegal over as well - a massive change that would just require TargetLowering having a TargetRegisterInfo class member that it would use. llvm-svn: 230585	2015-02-26 00:00:35 +00:00
Eric Christopher	23a3a7c871	Remove an argument-less call to getSubtargetImpl from TargetLoweringBase. This required plumbing a TargetRegisterInfo through computeRegisterProperties and into findRepresentativeClass which uses it for register class iteration. This required passing a subtarget into a few target specific initializations of TargetLowering. llvm-svn: 230583	2015-02-26 00:00:24 +00:00
Sanjay Patel	a709f3a5ae	simplify control flow; NFC llvm-svn: 230342	2015-02-24 16:26:02 +00:00
Sanjay Patel	2898548598	fix typo in comment; NFC llvm-svn: 230338	2015-02-24 16:11:05 +00:00
Elena Demikhovsky	52e81bc499	AVX-512: recommitted 229837 + bugfix + test llvm-svn: 230223	2015-02-23 15:12:31 +00:00
Benjamin Kramer	71bfe3d1a4	X86: Remove custom lowering of SIGN_EXTEND_INREG This was just replicating logic from the legalizer. Covered by existing tests. llvm-svn: 230136	2015-02-21 14:31:29 +00:00
Tim Northover	3b6b7ca2bc	CodeGen: convert CCState interface to using ArrayRefs Everyone except R600 was manually passing the length of a static array at each callsite, calculated in a variety of interesting ways. Far easier to let ArrayRef handle that. There should be no functional change, but out of tree targets may have to tweak their calls as with these examples. llvm-svn: 230118	2015-02-21 02:11:17 +00:00
Simon Pilgrim	b7875837c7	LowerScalarImmediateShift - Merged v16i8 and v32i8 shift lowering. NFC. llvm-svn: 230074	2015-02-20 22:13:03 +00:00
Sanjay Patel	1b53f74c4c	canonicalize a v2f64 blendi of 2 registers This canonicalization step saves us 3 pattern matching possibilities * 4 math ops for scalar FP math that uses xmm regs. The backend can re-commute the operands post-instruction-selection if that makes register allocation better. The tests in llvm/test/CodeGen/X86/sse-scalar-fp-arith.ll cover this scenario already, so there are no new tests with this patch. Differential Revision: http://reviews.llvm.org/D7777 llvm-svn: 230024	2015-02-20 16:55:27 +00:00
Chandler Carruth	0c5f059865	[x86] Switching the shuffle equivalence test to a variadic template was the wrong answer. We also got initializer lists which are way cleaner for this kind of thing. Let's use those and make this a normal, boring functionn accepting ArrayRef. llvm-svn: 230004	2015-02-20 10:47:28 +00:00
Nick Lewycky	a2bda08806	Fix build in release mode, -Wunused-variable on this lambda function used only in an assert. llvm-svn: 229977	2015-02-20 07:16:17 +00:00
David Blaikie	9a3644c472	Fix -Wunused-variable warning in non-asserts build, and optimize a little bit while I'm here. llvm-svn: 229970	2015-02-20 06:28:38 +00:00
Chandler Carruth	4041f2217b	[x86] Remove the old vector shuffle lowering code and its flag. The new shuffle lowering has been the default for some time. I've enabled the new legality testing by default with no really blocking regressions. I've fuzz tested this very heavily (many millions of fuzz test cases have passed at this point). And this cleans up a ton of code. =] Thanks again to the many folks that helped with this transition. There was a lot of work by others that went into the new shuffle lowering to make it really excellent. In case you aren't using a diff algorithm that can handle this: X86ISelLowering.cpp: 22 insertions(+), 2940 deletions(-) llvm-svn: 229964	2015-02-20 04:25:04 +00:00
Chandler Carruth	eb206aa1ea	[x86] Now that the new vector shuffle legality is enabled and everything is going well, remove the flag and the code for the old legality tests. This is the first step toward removing the entire old vector shuffle lowering. Much more code to delete coming up next. llvm-svn: 229963	2015-02-20 03:59:35 +00:00
Chandler Carruth	d2b14b296c	[x86] Make the new vector shuffle legality test on by default, which reflects the fact that the x86 backend can in fact lower any shuffle you want it to with reasonably high code quality. My recent work on the new vector shuffle has made this regress very little. The diff in the test cases makes me very, very happy. llvm-svn: 229958	2015-02-20 03:05:47 +00:00
Eric Christopher	0d94fa98e5	Revert "AVX-512: Full implementation for VRNDSCALESS/SD instructions and intrinsics." The instructions were being generated on architectures that don't support avx512. This reverts commit r229837. llvm-svn: 229942	2015-02-20 00:45:28 +00:00
Benjamin Kramer	ea68a944a1	Demote vectors to arrays. No functionality change. llvm-svn: 229861	2015-02-19 15:26:17 +00:00
Chandler Carruth	5d1a84b7b8	[x86] Delete still more piles of complex code now that we have a good systematic lowering of v8i16. This required a slight strategy shift to prefer unpack lowerings in more places. While this isn't a cut-and-dry win in every case, it is in the overwhelming majority. There are only a few places where the old lowering would probably be a touch faster, and then only by a small margin. In some cases, this is yet another significant improvement. llvm-svn: 229859	2015-02-19 15:21:57 +00:00
Chandler Carruth	0b39536390	[x86] Teach the unpack lowering how to lower with an initial unpack in addition to lowering to trees rooted in an unpack. This saves shuffles and or registers in many various ways, lets us handle another class of v4i32 shuffles pre SSE4.1 without domain crosses, etc. llvm-svn: 229856	2015-02-19 15:06:13 +00:00
Chandler Carruth	352eba1c29	[x86] Dramatically improve v8i16 shuffle lowering by not using its terribly complex partial blend logic. This code path was one of the more complex and bug prone when it first went in and it hasn't faired much better. Ultimately, with the simpler basis for unpack lowering and support bit-math blending, this is completely obsolete. In the worst case without this we generate different but equivalent instructions. However, in many cases we generate much better code. This is especially true when blends or pshufb is available. This does expose one (minor) weakness of the unpack lowering that I'll try to address. In case you were wondering, this is actually a big part of what I've been trying to pull off in the recent string of commits. llvm-svn: 229853	2015-02-19 14:08:24 +00:00
Chandler Carruth	2c0390ca4b	[x86] Remove the final fallback in the v8i16 lowering that isn't really needed, and significantly improve the SSSE3 path. This makes the new strategy much more clear. If we can blend, we just go with that. If we can't blend, we try to permute into an unpack so that we handle cases where the unpack doing the blend also simplifies the shuffle. If that fails and we've got SSSE3, we now call into factored-out pshufb lowering code so that we leverage the fact that pshufb can set up a blend for us while shuffling. This generates great code, especially because we know we don't have a fast blend at this point. Finally, we fall back on decomposing into permutes and blends because we do at least have a bit-math-based blend if we need to use that. This pretty significantly improves some of the v8i16 code paths. We never need to form pshufb for the single-input shuffles because we have effective target-specific combines to form it there, but we were missing its effectiveness in the blends. llvm-svn: 229851	2015-02-19 13:56:49 +00:00
Chandler Carruth	f0f0d27391	[x86] Simplify the pre-SSSE3 v16i8 lowering significantly by decomposing them into permutes and a blend with the generic decomposition logic. This works really well in almost every case and lets the code only manage the expansion of a single input into two v8i16 vectors to perform the actual shuffle. The blend-based merging is often much nicer than the pack based merging that this replaces. The only place where it isn't we end up blending between two packs when we could do a single pack. To handle that case, just teach the v2i64 lowering to handle these blends by digging out the operands. With this we're down to only really random permutations that cause an explosion of instructions. llvm-svn: 229849	2015-02-19 13:15:12 +00:00
Chandler Carruth	8817e5e01b	[x86] Remove the insanely over-aggressive unpack lowering strategy for v16i8 shuffles, and replace it with new facilities. This uses precise patterns to match exact unpacks, and the new generalized unpack lowering only when we detect a case where we will have to shuffle both inputs anyways and they terminate in exactly a blend. This fixes all of the blend horrors that I uncovered by always lowering blends through the vector shuffle lowering. It also removes sooooo much of the crazy instruction sequences required for v16i8 lowering previously. Much cleaner now. The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when it would be marginally nicer to use pshufb+pshufb+por. However, the difference there is tiny. In many cases its a win because we re-use the pshufb mask. In others, we get to avoid the pshufb entirely. I've left a FIXME, but I'm dubious we can really do better than this. I'm actually pretty happy with this lowering now. For SSE2 this exposes some horrors that were really already there. Those will have to fixed by changing a different path through the v16i8 lowering. llvm-svn: 229846	2015-02-19 12:10:37 +00:00
Chandler Carruth	38dea42ddf	[x86] The SELECT x86 DAG combine also does legalization. It used to rely on things not being marked as either custom or legal, but we now do custom lowering of more VSELECT nodes. To cope with this, manually replicate the legality tests here. These have to stay in sync with the set of tests used in the custom lowering of VSELECT. Ideally, we wouldn't do any of this combine-based-legalization when we have an actual custom legalization step for VSELECT, but I'm not going to be able to rewrite all of that today. I don't have a test case for this currently, but it was found when compiling a number of the test-suite benchmarks. I'll try to reduce a test case and add it. This should at least fix the test-suite fallout on build bots. llvm-svn: 229844	2015-02-19 11:43:37 +00:00
Elena Demikhovsky	69e8b45b13	AVX-512: Full implementation for VRNDSCALESS/SD instructions and intrinsics. llvm-svn: 229837	2015-02-19 10:48:04 +00:00
Chandler Carruth	bcb6c5f62d	[x86] Add support for bit-wise blending and use it in the v8 and v16 lowering paths. I'm going to be leveraging this to simplify a lot of the overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes. Sadly, this isn't profitable on v4i32 and v2i64. There, the float and double blending instructions for pre-SSE4.1 are actually pretty good, and we can't beat them with bit math. And once SSE4.1 comes around we have direct blending support and this ceases to be relevant. Also, some of the test cases look odd because the domain fixer canonicalizes these to floating point domain. That's OK, it'll use the integer domain when it matters and some day I may be able to update enough of LLVM to canonicalize the other way. This restores almost all of the regressions from teaching x86's vselect lowering to always use vector shuffle lowering for blends. The remaining problems are because the v16 lowering path is still doing crazy things. I'll be re-arranging that strategy in more detail in subsequent commits to finish recovering the performance here. llvm-svn: 229836	2015-02-19 10:46:52 +00:00
Chandler Carruth	b89464a9b6	[x86,sdag] Two interrelated changes to the x86 and sdag code. First, don't combine bit masking into vector shuffles (even ones the target can handle) once operation legalization has taken place. Custom legalization of vector shuffles may exist for these patterns (making the predicate return true) but that custom legalization may in some cases produce the exact bit math this matches. We only really want to handle this prior to operation legalization. However, the x86 backend, in a fit of awesome, relied on this. What it would do is mark VSELECTs as expand, which would turn them into arithmetic, which this would then match back into vector shuffles, which we would then lower properly. Amazing. Instead, the second change is to teach the x86 backend to directly form vector shuffles from VSELECT nodes with constant conditions, and to mark all of the vector types we support lowering blends as shuffles as custom VSELECT lowering. We still mark the forms which actually support variable blends as legal so that the custom lowering is bypassed, and the legal lowering can even be used by the vector shuffle legalization (yes, i know, this is confusing. but that's how the patterns are written). This makes the VSELECT lowering much more sensible, and in fact should fix a bunch of bugs with it. However, as you'll see in the test cases, right now what it does is point out the hilarious deficiency of the new vector shuffle lowering when it comes to blends. Fortunately, my very next patch fixes that. I can't submit it yet, because that patch, somewhat obviously, forms the exact and/or pattern that the DAG combine is matching here! Without this patch, teaching the vector shuffle lowering to produce the right code infloops in the DAG combiner. With this patch alone, we produce terrible code but at least lower through the right paths. With both patches, all the regressions here should be fixed, and a bunch of the improvements (like using 2 shufps with no memory loads instead of 2 andps with memory loads and an orps) will stay. Win! There is one other change worth noting here. We had hilariously wrong vectorization cost estimates for vselect because we fell through to the code path that assumed all "expand" vector operations are scalarized. However, the "expand" lowering of VSELECT is vector bit math, most definitely not scalarized. So now we go back to the correct if horribly naive cost of "1" for "not scalarized". If anyone wants to add actual modeling of shuffle costs, that would be cool, but this seems an improvement on its own. Note the removal of 16 and 32 "costs" for doing a blend. Even in SSE2 we can blend in fewer than 16 instructions. ;] Of course, we don't right now because of OMG bad code, but I'm going to fix that. Next patch. I promise. llvm-svn: 229835	2015-02-19 10:36:19 +00:00
Benjamin Kramer	6ca8992018	X86: Use bitset to manage a bag of bits. NFC. Doesn't matter in terms of memory usage or perf here, but it's a neat simplification. llvm-svn: 229672	2015-02-18 14:10:44 +00:00
Chandler Carruth	bbb377c3a1	[x86] Tighten the assertions to document that canonicalization has actually removed all but a very small number of choices for v2i64. Also remove dead code handling cases that simply cannot arise. llvm-svn: 229670	2015-02-18 11:46:29 +00:00
Chandler Carruth	811f0ee8c1	[x86] Switch an if which is trivially true to an assert. NFC llvm-svn: 229669	2015-02-18 11:46:27 +00:00
Chandler Carruth	8f3e585b17	[x86] Remove some more 'bit' nomenclature from the generic shift lowering. llvm-svn: 229668	2015-02-18 11:46:23 +00:00
Chandler Carruth	672a98ea28	[x86] Fold together the two shift lowering strategies. They were doing quite literally the same work, we just need to special case the >64-bit element shift code emission to emit the byte shift instructions and offsets. This also makes reasoning about each of the vector lowering strategies easier as we don't have to remember to use both forms. llvm-svn: 229662	2015-02-18 10:40:38 +00:00
Chandler Carruth	48cc6c623a	[x86] Refactor the bit shift code the same as I just did the byte shift code. While this didn't have the miscompile (it used MatchLeft consistently) it missed some cases where it could use right shifts. I've added a test case Craig Topper came up with to exercise the right shift matching. This code is really identical between the two. I'm going to merge them next so that we don't keep two copies of all of this logic. llvm-svn: 229655	2015-02-18 09:19:58 +00:00
Elena Demikhovsky	714f23bcdb	AVX-512: Added support for FP instructions with embedded rounding mode. By Asaf Badouh <asaf.badouh@intel.com> llvm-svn: 229645	2015-02-18 07:59:20 +00:00
Chandler Carruth	55553f5299	[x86] Rewrite the byte shift detection to not use boolean variables to track state. I didn't like this in the code review because the pattern tends to be error prone, but I didn't see a clear way to rewrite it. Turns out that there were bugs here, I found them when fuzz testing our shuffle lowering for correctness on x86. The core of the problem is that we need to consistently test all our preconditions for the same directionality of shift and the same input vector. Instead, formulate this as two predicates (one doesn't depend on the input in any way), pass things like the directionality and input vector as inputs, and loop over the alternatives. This fixes a pattern of very rare miscompiles coming out of this code. Turned up roughly 4 out of every 1 million v8 shuffles in my fuzz testing. The new code is over half a million test runs with no failures yet. I've also fuzzed every other function in the lowering code with over 3.5 million test cases and not discovered any other miscompiles. llvm-svn: 229642	2015-02-18 07:13:48 +00:00
Simon Pilgrim	1d89a02abb	[X86][SSE] Generalised unpckl/unpckh shuffle matching Added commuted unpckl/unpckh shuffle matching patterns as many cases containing undefined lanes fail to commute by themselves. Differential Revision: http://reviews.llvm.org/D7564 llvm-svn: 229571	2015-02-17 22:24:32 +00:00
Benjamin Kramer	6cd780ff21	Prefer SmallVector::append/insert over push_back loops. Same functionality, but hoists the vector growth out of the loop. llvm-svn: 229500	2015-02-17 15:29:18 +00:00
Andrea Di Biagio	eb97f92489	[X86] Silence -Wsign-compare warnings. GCC 4.8 reported two new warnings due to comparisons between signed and unsigned integer expressions. The new warnings were accidentally introduced by revision 229480. Added explicit casts to silence the warnings. No functional change intended. llvm-svn: 229488	2015-02-17 11:20:11 +00:00
Michael Kuperstein	ff5acaf50c	[X86] Combine vector anyext + and into a vector zext Vector zext tends to get legalized into a vector anyext, represented as a vector shuffle with an undef vector + a bitcast, that gets ANDed with a mask that zeroes the undef elements. Combine this into an explicit shuffle with a zero vector instead. This allows shuffle lowering to match it as a zext, instead of matching it as an anyext and emitting an explicit AND. This combine only covers a subset of the cases, but it's a start. Differential Revision: http://reviews.llvm.org/D7666 llvm-svn: 229480	2015-02-17 08:22:51 +00:00
Chandler Carruth	55db07016e	[x86] Teach the unpack lowering to try wider element unpacks. This allows it to match still more places where previously we would have to fall back on floating point shuffles or other more complex lowering strategies. I'm hoping to replace some of the hand-rolled unpack matching with this routine is it gets more and more clever. llvm-svn: 229463	2015-02-17 02:12:24 +00:00
Cameron McInally	c5764cbe4e	[AVX512] Make 512b vector floating point rounds legal on AVX512. llvm-svn: 229445	2015-02-16 22:15:42 +00:00
Craig Topper	49df44e2e2	[X86] Remove the multiply by 8 that goes into the shift constant for X86ISD::VSHLDQ and X86ISD::VSRLDQ. This simplifies the pattern matching in isel and allows these nodes to become the patterns embedded in the instruction. llvm-svn: 229431	2015-02-16 20:52:07 +00:00
Chandler Carruth	1e57e2deb8	[x86] Add a generic unpack-targeted lowering technique. This can be used to generically lower blends and is particularly nice because it is available frome SSE2 onward. This removes a lot of the remaining domain crossing blends in SSE2 code. I'm hoping to replace some of the "interleaved" lowering hacks with something closer to this which should be more principled. First, this needs to learn how to detect and use other interleavings besides that of the natural type provided. That will be a follow-up patch though. llvm-svn: 229378	2015-02-16 12:28:18 +00:00
Chandler Carruth	c802085b3a	[x86] Add initial basic support for forming blends of v16i8 vectors. This blend instruction is ... really lame. The register usage is insane. As a consequence this is probably only barely better than 2 pshufbs followed by a por, and that mostly because it only has to read from a single memory location. However, this doesn't fix as much as I kind of expected, so more to go. Pretty sure that the ordering and delegation of v16i8 is just really, really bad. llvm-svn: 229373	2015-02-16 10:58:23 +00:00
Chandler Carruth	e63bbd97a7	[x86] Switch my usage of VariadicFunction to a "normal" variadic template now that we can use them. This is, of course, horribly ugly because of the required recursive formulation. Suggestions for making it less ugly welcome. llvm-svn: 229367	2015-02-16 09:59:48 +00:00
Craig Topper	7e8dcef094	[X86] Add support for lowering shuffles to 256-bit PALIGNR instruction. llvm-svn: 229359	2015-02-16 06:29:06 +00:00
Chandler Carruth	87e580a659	[x86] Teach the 128-bit vector shuffle lowering routines to take advantage of the existence of a reasonable blend instruction. The 256-bit vector shuffle lowering has leveraged the general technique of decomposed shuffles and blends for quite some time, but this never made it back into the 128-bit code, and there are a large number of patterns where this is substantially better. For example, this removes almost all domain crossing in vector shuffles that involve some blend and some permutation with SSE4.1 and later. See the massive reduction in 'shufps' for integer test cases in this commit. This isn't perfect yet for a few reasons: 1) The v8i16 shuffle lowering continues to plague me. We don't always form an unpack-based blend when that would be better. But the wins pretty drastically outstrip the losses here. 2) The v16i8 shuffle lowering is just a disaster here. I never went and implemented blend support here for some terrible reason. I'll do that next probably. I've not updated it for now. More variations on this technique are coming as well -- we don't shuffle-into-unpack or shuffle-into-palignr, both of which would also be profitable. Note that some test cases grow significantly in the number of instructions, but I expect to actually be faster. We use pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are very likely to pipeline well (two ports on most modern intel chips) and the blend is a very fast instruction. The domain switch penalty will essentially always be more than a blend instruction, which is the only increase in tree height. llvm-svn: 229350	2015-02-16 01:52:02 +00:00
Simon Pilgrim	2a7bedb73e	Coding style fixes to recent patches. NFC. llvm-svn: 229312	2015-02-15 14:19:29 +00:00
Simon Pilgrim	00bd79d794	[X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2 This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets. It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions. Differential Revision: http://reviews.llvm.org/D7596 llvm-svn: 229311	2015-02-15 13:19:52 +00:00
Chandler Carruth	bf0fb06e0d	[x86] Teach the decomposed shuffle/blend lowering to use an early blend when that will allow it to lower with a single permute instead of multiple permutes. It tries to detect when it will only have to do a single permute in either case to maximize folding of loads and such. This cuts a lot of the avx2 shuffle permute counts in half. =] llvm-svn: 229309	2015-02-15 12:42:15 +00:00
Chandler Carruth	75d9a97569	[x86] Teach the shuffle mask equivalence test to look through build vectors and detect equivalent inputs. This lets the code match unpck-style instructions when only one of the inputs are lined up but the other input is a splat and so which lanes we pull from doesn't matter. Today, this doesn't really happen, but just by accident. I have a patch that normalizes how we shuffle splats, and with that patch this will be necessary for a lot of the mask equivalence tests to work. I don't really know how to write a test case for this specific change until the other change lands though. llvm-svn: 229307	2015-02-15 12:07:55 +00:00
Chandler Carruth	4fe214b1f2	[x86] Tweak the ordering of unpack matching vs. element insertion, and don't try to do element insertion for non-zero-index floating point vectors. We don't have any useful patterns or lowering for element insertion into high elements of a floating point vector, and the generic shuffle lowering will end up being better -- namely it will fall back to unpck. But we should try to handle other forms of element insertion before matching unpck patterns. While this doesn't matter much right now, I'm working on a patch that makes unpck matching much more powerful, and that patch will break without this re-ordering. llvm-svn: 229306	2015-02-15 12:01:14 +00:00
Chandler Carruth	56e0ceda0d	[x86] Stop shuffling zero vectors. =] I was somewhat surprised this pattern really came up, but it does. It seems better to just directly handle it than try to special case every place where we end up forming a shuffle that devolves to a shuffle of a zero vector. llvm-svn: 229301	2015-02-15 10:34:52 +00:00
Chandler Carruth	3d272daaed	[x86] Use a more helpful parenthesizing of these comparisons. Silences a -Wparentheses complaint from GCC. llvm-svn: 229300	2015-02-15 10:15:20 +00:00
Chandler Carruth	62558c1d4d	[x86] When splitting 256-bit vectors into 128-bit vectors, don't extract subvectors from buildvectors. That doesn't really make any sense and it breaks all of the down-stream matching of buildvectors to cleverly lower shuffles. With this, we now get the shift-based lowering of 256-bit vector shuffles with AVX1 when we split them into 128-bit vectors. We also do much better on the zero-extension patterns, although there remains quite a bit of room for improvement here. llvm-svn: 229299	2015-02-15 10:12:02 +00:00
Chandler Carruth	a6f8a3661c	[x86] Make computing the zeroable elements slightly more powerful, at least in theory. I don't actually have a test case that benefits from this, but theoretically, it could come up, and I don't want to try to think about whether this is the culprit or something else is, so I'd rather just make this code powerful. =/ Makes me sad that I can't really test it though. llvm-svn: 229298	2015-02-15 09:33:36 +00:00
Chandler Carruth	0ddfe0c7c5	[x86] Add a slight variation on some of the other generic shuffle lowerings -- one which decomposes into an initial blend followed by a permute. Particularly on newer chips, blends are handled independently of shuffles and so this is much less bottlenecked on the single port that floating point shuffles are executed with on Intel. I'll be adding this lowering to a bunch of other code paths in subsequent commits to handle still more places where we can effectively leverage blends when they're available in the ISA. llvm-svn: 229292	2015-02-15 08:26:30 +00:00
Duncan P. N. Exon Smith	5975a703e6	X86: Canonicalize access to function attributes, NFC Canonicalize access to function attributes to use the simpler API. getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind) => getFnAttribute(Kind) getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind) => hasFnAttribute(Kind) llvm-svn: 229214	2015-02-14 01:59:52 +00:00
Sanjay Patel	34da52a894	fix typos; NFC llvm-svn: 229155	2015-02-13 21:07:22 +00:00
Craig Topper	007a713ebf	Fix a typo in a comment. NFC llvm-svn: 229071	2015-02-13 06:07:29 +00:00
David Majnemer	a12fcb790f	X86: Don't crash if we can't decode the pshufb mask Constant pool entries are uniqued by their contents regardless of their type. This means that a pshufb can have a shuffle mask which isn't a simple array of bytes. The code path which attempts to decode the mask didn't check for failure, causing PR22559. llvm-svn: 228979	2015-02-12 23:26:26 +00:00
Benjamin Kramer	5f6a907288	MathExtras: Bring Count(Trailing\|Leading)Ones and CountPopulation in line with countTrailingZeros Update all callers. llvm-svn: 228930	2015-02-12 15:35:40 +00:00
Elena Demikhovsky	d2cb3c8876	AVX-512: Fixed the "test" operation for i1 type Using KORTESTW for comparison i1 value with zero was wrong since the instruction tests 16 bits. KORTESTW may be used with KSHIFTL+KSHIFTR that clean the 15 upper bits. I removed (X86cmp i1, 0) pattern and zero-extend i1 to i8 and then use TESTB. There are some cases where i1 is in the mask register and the upper bits are already zeroed. Then KORTESTW is the better solution, but it is subject for optimization. Meanwhile, I'm fixing the correctness issue. llvm-svn: 228916	2015-02-12 08:40:34 +00:00
David Majnemer	13d0b11d7b	X86: Make @llvm.frameaddress work correctly with Windows unwind codes Simply loading or storing the frame pointer is not sufficient for Windows targets. Instead, create a synthetic frame object that we will lower later. References to this synthetic object will be replaced with the correct reference to the frame address. llvm-svn: 228748	2015-02-10 21:22:05 +00:00
Sanjay Patel	3510bc7162	fix typos; NFC llvm-svn: 228529	2015-02-08 18:54:22 +00:00
Sanjay Patel	3d982214b0	use local variables; NFC llvm-svn: 228452	2015-02-06 22:43:52 +00:00
Ahmed Bougacha	e892d13d90	[CodeGen] Add hook/combine to form vector extloads, enabled on X86. The combine that forms extloads used to be disabled on vector types, because "None of the supported targets knows how to perform load and sign extend on vectors in one instruction." That's not entirely true, since at least SSE4.1 X86 knows how to do those sextloads/zextloads (with PMOVS/ZX). But there are several aspects to getting this right. First, vector extloads are controlled by a profitability callback. For instance, on ARM, several instructions have folded extload forms, so it's not always beneficial to create an extload node (and trying to match extloads is a whole 'nother can of worms). The interesting optimization enables folding of s/zextloads to illegal (splittable) vector types, expanding them into smaller legal extloads. It's not ideal (it introduces some legalization-like behavior in the combine) but it's better than the obvious alternative: form illegal extloads, and later try to split them up. If you do that, you might generate extloads that can't be split up, but have a valid ext+load expansion. At vector-op legalization time, it's too late to generate this kind of code, so you end up forced to scalarize. It's better to just avoid creating egregiously illegal nodes. This optimization is enabled unconditionally on X86. Note that the splitting combine is happy with "custom" extloads. As is, this bypasses the actual custom lowering, and just unrolls the extload. But from what I've seen, this is still much better than the current custom lowering, which does some kind of unrolling at the end anyway (see for instance load_sext_4i8_to_4i64 on SSE2, and the added FIXME). Also note that the existing combine that forms extloads is now also enabled on legal vectors. This doesn't have a big effect on X86 (because sext+load is usually combined to sext_inreg+aextload). On ARM it fires on some rare occasions; that's for a separate commit. Differential Revision: http://reviews.llvm.org/D6904 llvm-svn: 228325	2015-02-05 18:31:02 +00:00
Andrew Trick	7fc4583eda	X86 ABI fix for return values > 24 bytes. The return value's address must be returned in %rax. i.e. the callee needs to copy the sret argument (%rdi) into the return value (%rax). This probably won't manifest as a bug when the caller is LLVM-compiled code. But it is an ABI guarantee and tools expect it. llvm-svn: 228321	2015-02-05 18:09:05 +00:00
Sanjay Patel	67667bcdc4	move fold comments to the corresponding fold; NFC llvm-svn: 228317	2015-02-05 17:33:59 +00:00
Bruno Cardoso Lopes	ab9ae87623	[X86][MMX] Handle i32->mmx conversion using movd Implement a BITCAST dag combine to transform i32->mmx conversion patterns into a X86 specific node (MMX_MOVW2D) and guarantee that moves between i32 and x86mmx are better handled, i.e., don't use store-load to do the conversion.. llvm-svn: 228293	2015-02-05 13:23:07 +00:00
Larisse Voufo	bc9f12e7bc	Disable enumeral mismatch warning when compiling llvm with gcc. Tested with gcc 4.9.2. Compiling with -Werror was producing: .../llvm/lib/Target/X86/X86ISelLowering.cpp: In function 'llvm::SDValue lowerVectorShuffleAsBitMask(llvm::SDLoc, llvm::MVT, llvm::SDValue, llvm::SDValue, llvm::ArrayRef<int>, llvm::SelectionDAG&)': .../llvm/lib/Target/X86/X86ISelLowering.cpp:7771:40: error: enumeral mismatch in conditional expression: 'llvm::X86ISD::NodeType' vs 'llvm::ISD::NodeType' [-Werror=enum-compare] V = DAG.getNode(VT.isFloatingPoint() ? X86ISD::FAND : ISD::AND, DL, VT, V, ^ llvm-svn: 228271	2015-02-05 04:54:51 +00:00
Chandler Carruth	024cf8efd7	[x86] Start to introduce bit-masking based blend lowering. This is the simplest form of bit-math based blending which only fires when we are blending with zero and is relatively profitable. I've only enabled this path on very specific lowering strategies. I'm planning to widen its applicability in subsequent patches, but so far you'll notice that even though we get fewer shufps instructions, we still do the bit math in the FP execution port. I'm looking into why this is still happening. llvm-svn: 228124	2015-02-04 09:06:05 +00:00
Chandler Carruth	68fcb38328	[x86] Fix signed vs. unsigned comparison. llvm-svn: 228055	2015-02-03 22:43:30 +00:00
Simon Pilgrim	9eecb14bd9	Fixed unused variable warning. llvm-svn: 228054	2015-02-03 22:39:28 +00:00
Simon Pilgrim	46cd4f7400	[X86][SSE] psrl(w/d/q) and psll(w/d/q) bit shifts for SSE2 Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699. Differential Revision: http://reviews.llvm.org/D6649 llvm-svn: 228047	2015-02-03 21:58:29 +00:00
Simon Pilgrim	c4e5f1e192	Fixed signed/unsigned comparison warning. llvm-svn: 228027	2015-02-03 20:54:01 +00:00
Simon Pilgrim	03c379a0fa	Fixed unused variable warning. llvm-svn: 228025	2015-02-03 20:38:52 +00:00
Simon Pilgrim	d9885856e6	[X86][SSE] Added general integer shuffle matching for MOVQ instruction This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend. llvm-svn: 228022	2015-02-03 20:09:18 +00:00
Simon Pilgrim	6544f815b3	[X86][AVX2] Enabled shuffle matching for the AVX2 zero extension (128bit -> 256bit) vpmovzx* instructions. Differential Revision: http://reviews.llvm.org/D7251 llvm-svn: 228014	2015-02-03 19:34:09 +00:00
Sanjay Patel	b7d5628784	Merge consecutive 16-byte loads into one 32-byte load (PR22329) This patch detects consecutive vector loads using the existing EltsFromConsecutiveLoads() logic. This fixes: http://llvm.org/bugs/show_bug.cgi?id=22329 This patch effectively reverts the tablegen additions of D6492 / http://reviews.llvm.org/rL224344 ...which in hindsight were a horrible hack. The test cases that were added with that patch are simply modified to load from varying offsets of a base pointer. These loads did not match the existing tablegen patterns. A happy side effect of doing this optimization earlier is that we can now fold the load into a math op where possible; this is shown in some of the updated checks in the test file. Differential Revision: http://reviews.llvm.org/D7303 llvm-svn: 228006	2015-02-03 18:54:00 +00:00
Bruno Cardoso Lopes	077774b820	[X86][MMX] Improve transfer from mmx to i32 Improve EXTRACT_VECTOR_ELT DAG combine to catch conversion patterns between x86mmx and i32 with more layers of indirection. Before: movq2dq %mm0, %xmm0 movd %xmm0, %eax After: movd %mm0, %eax llvm-svn: 227969	2015-02-03 14:46:49 +00:00
Eric Christopher	05b819718c	Reuse a bunch of cached subtargets and remove getSubtarget calls without a Function argument. llvm-svn: 227814	2015-02-02 17:38:43 +00:00
Craig Topper	2844ca7319	[X86] Add a few target specific nodes to 'getTargetNodeName' llvm-svn: 227720	2015-02-01 10:00:37 +00:00
Simon Pilgrim	9c76b47469	[X86][SSE] Shuffle mask decode support for zero extend, scalar float/double moves and integer load instructions This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd). Also adds shuffle mask decodes for integer loads (movd/movq). Differential Revision: http://reviews.llvm.org/D7228 llvm-svn: 227688	2015-01-31 14:09:36 +00:00
Eric Christopher	295596a0a7	Remove the last vestiges of resetOperationActions. llvm-svn: 227648	2015-01-31 00:21:17 +00:00
Reid Kleckner	a580b6ec67	Win64: Put a REX_W prefix on all TAILJMP* instructions MSDN's x64 software conventions page says that this is one of the fixed list of legal epilogues: https://msdn.microsoft.com/en-us/library/tawsa7cb.aspx Presumably this is how the unwinder distinguishes epilogue jumps from in-function control flow. Also normalize the way we place "## TAILCALL" comments on such jumps. llvm-svn: 227611	2015-01-30 21:03:31 +00:00
Sanjay Patel	e4ca47875f	tidy up; NFC llvm-svn: 227582	2015-01-30 16:58:58 +00:00
Reid Kleckner	e83ab8c2de	x86: Remove unused variables not caught by MSVC =P llvm-svn: 227520	2015-01-30 00:05:39 +00:00
Reid Kleckner	ca9b9feb2c	x86: Fix large model calls to __chkstk for dynamic allocas In the large code model, we now put __chkstk in %r11 before calling it. Refactor the code so that we only do this once. Simplify things by using __chkstk_ms instead of __chkstk on cygming. We already use that symbol in the prolog emission, and it simplifies our logic. Second half of PR18582. llvm-svn: 227519	2015-01-29 23:58:04 +00:00
Sanjay Patel	3a907eaccd	Change SmallVector param to the more general ArrayRef; NFCI llvm-svn: 227514	2015-01-29 23:35:04 +00:00
Reid Kleckner	f0abdae34e	x86: Remove the W64ALLOCA pseudo This is just an alias for CALL64pcrel32, and we can just use that opcode with explicit defs in the MI. No functionality change. llvm-svn: 227508	2015-01-29 23:09:37 +00:00
Simon Pilgrim	80bd3c9e5f	Spelling fixes. NFC. llvm-svn: 227376	2015-01-28 22:03:52 +00:00
Sanjay Patel	4058dd9f3f	invert check for less indentation; use local vars to reduce duplication; NFC llvm-svn: 227355	2015-01-28 19:44:21 +00:00
Sanjay Patel	9bb601856e	use SDValue methods directly instead of getNode()->* ; NFCI llvm-svn: 227334	2015-01-28 18:01:31 +00:00
Michael Kuperstein	951995821a	[X86] Reduce some 32-bit imuls into lea + shl Reduce integer multiplication by a constant of the form k*2^c, where k is in {3,5,9} into a lea + shl. Previously it was only done for imulq on 64-bit platforms, but it makes sense for imull and 32-bit as well. Differential Revision: http://reviews.llvm.org/D7196 llvm-svn: 227308	2015-01-28 14:08:22 +00:00
Michael Kuperstein	f387611ac2	[x32] Enable sibcall optimization on x32. This includes two things: 1) Fix TCRETURNdi and TCRETURN64di patterns to check the right thing (LP64 as opposed to target bitness). 2) Allow LEA64_32 in MatchingStackOffset. llvm-svn: 227307	2015-01-28 13:38:48 +00:00
Elena Demikhovsky	7b0dd39db6	AVX-512: Added FMA intrinsics with rounding mode By Asaf Badouh and Elena Demikhovsky Added special nodes for rounding: FMADD_RND, FMSUB_RND.. It will prevent merge between nodes with rounding and other standard nodes. llvm-svn: 227303	2015-01-28 10:21:27 +00:00
Alexey Samsonov	533948088e	Revert "[x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector" This reverts commits r226953 and r226974. llvm-svn: 227248	2015-01-27 21:34:11 +00:00
Elena Demikhovsky	1a603b3f13	AVX-512: Changes in operations on masks registers for KNL and SKX - Added KSHIFTB/D/Q for skx - Added KORTESTB/D/Q for skx - Fixed store operation for v8i1 type for KNL - Store size of v8i1, v4i1 and v2i1 are changed to 8 bits llvm-svn: 227043	2015-01-25 12:47:15 +00:00
Bruno Cardoso Lopes	ddcc2e31a7	[x86] Fix a comment llvm-svn: 226974	2015-01-24 00:22:04 +00:00
Bruno Cardoso Lopes	56567f9135	[x86] Combine x86mmx/i64 to v2i64 conversion to use scalar_to_vector Handle the poor codegen for i64/x86xmm->v2i64 (%mm -> %xmm) moves. Instead of using stack store/load pair to do the job, use scalar_to_vector directly, which in the MMX case can use movq2dq. This was the current behavior prior to improvements for vector legalization of extloads in r213897. This commit fixes the regression and as a side-effect also remove some unnecessary shuffles. In the new attached testcase, we go from: pshufw $-18, (%rdi), %mm0 movq %mm0, -8(%rsp) movq -8(%rsp), %xmm0 pshufd $-44, %xmm0, %xmm0 movd %xmm0, %eax ... To: pshufw $-18, (%rdi), %mm0 movq2dq %mm0, %xmm0 movd %xmm0, %eax ... Differential Revision: http://reviews.llvm.org/D7126 rdar://problem/19413324 llvm-svn: 226953	2015-01-23 22:44:16 +00:00
Eric Christopher	a1c6e0c8ce	Remove some local variables in place of just querying for them in the couple of asserts. llvm-svn: 226917	2015-01-23 17:22:44 +00:00
Alexander Potapenko	a007905e4e	Mark \|TLI\| variables used to suppress -Wunused-variable warnings. (These vars are only used in assertions) llvm-svn: 226815	2015-01-22 13:03:33 +00:00
Elena Demikhovsky	150d9f3187	Fixed a bug in type legalizer for masked load/store intrinsics. The problem occurs when after vectorization we have type <2 x i32>. This type is promoted to <2 x i64> and then requires additional efforts for expanding loads and truncating stores. I added EXPAND / TRUNCATE attributes to the masked load/store SDNodes. The code now contains additional shuffles. I've prepared changes in the cost estimation for masked memory operations, it will be submitted separately. llvm-svn: 226808	2015-01-22 12:07:59 +00:00
Simon Pilgrim	b16b09b154	[X86][SSE] Added support for SSE3 lane duplication shuffle instructions This patch adds shuffle matching for the SSE3 MOVDDUP, MOVSLDUP and MOVSHDUP instructions. The big use of these being that they avoid many single source shuffles from needing to use (pre-AVX) dual source instructions such as SHUFPD/SHUFPS: causing extra moves and preventing load folds. Adding these instructions uncovered an issue in XFormVExtractWithShuffleIntoLoad which crashed on single operand shuffle instructions (now fixed). It also involved fixing getTargetShuffleMask to correctly identify theses instructions as unary shuffles. Also adds a missing tablegen pattern for MOVDDUP. Differential Revision: http://reviews.llvm.org/D7042 llvm-svn: 226716	2015-01-21 22:44:35 +00:00
Ahmed Bougacha	8f09e9f7c5	[X86] Declare SSE4.1/AVX2 vector extloads covered by PMOV[SZ]X legal. Now that we can fully specify extload legality, we can declare them legal for the PMOVSX/PMOVZX instructions. This for instance enables a DAGCombine to fire on code such as (and (<zextload-equivalent> ...), <redundant mask>) to turn it into: (zextload ...) as seen in the testcase changes. There is one regression, in widen_load-2.ll: we're no longer able to do store-to-load forwarding with illegal extload memory types. This will be addressed separately. Differential Revision: http://reviews.llvm.org/D6533 llvm-svn: 226676	2015-01-21 17:07:06 +00:00
Andrea Di Biagio	ae47bc6ab9	[X86][DAG] Disable target specific combine on INSERTPS dag nodes at -O0. This patch disables target specific combine on X86ISD::INSERTPS dag nodes if optlevel is CodeGenOpt::None. The backend currently implements a target specific combine rule that converts a vector load used by an INSERTPS dag node into a scalar load plus a scalar_to_vector. This allows ISel to select a single INSERTPSrm instead of two instructions (i.e. a vector load plus INSERTPSrr). However, the existing target combine rule on INSERTPS nodes only works under the assumption that ISel will always be able to match an INSERTPSrm. This is not true in general at -O0, since the backend only allows folding a load into the memory operand of an instruction if the optimization level is not CodeGenOpt::None. In the example below: // __m128 test(__m128 a, __m128 b) { __m128 c = _mm_insert_ps(a, b, 1 << 6); return c; } // Before this patch, at -O0, the backend would have canonicalized the load to 'b' into a scalar load plus scalar_to_vector. Later on, ISel would have selected an INSERTPSrr leaving the insertps mask in an inconsistent state: movss 4(%rdi), %xmm1 insertps $64, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1,2,3]. With this patch, the backend avoids folding the vector load into the operand of the INSERTPS. The new codegen at -O0 is: movaps (%rdi), %xmm1 insertps $64, %xmm1, %xmm0 # %xmm1[1],xmm0[1,2,3]. llvm-svn: 226277	2015-01-16 14:55:26 +00:00
Adam Nemet	e5dbcb7fd0	[AVX512] Unpack support in new shuffle lowering This now handles both 32 and 64-bit element sizes. In this version, the test are in vector-shuffle-512-v8.ll, canonicalized by Chandler's update_llc_test_checks.py. Part of <rdar://problem/17688758> llvm-svn: 225838	2015-01-13 22:20:18 +00:00
Simon Pilgrim	d88ab87064	[X86][SSE] Minor regression fix for r225551 r225551 vector byte shuffle optimization caused an assertion as fully zeroable vectors can be produced under certain circumstances. This fix drops the assert and returns a zero vector where the assert would have failed. llvm-svn: 225718	2015-01-12 22:38:08 +00:00
Ahmed Bougacha	291833b959	[X86] Also create+widen FMIN/FMAX nodes for v2f32. This happens in the HINT benchmark, where the SLP-vectorizer created v2f32 fcmp/select code. The "correct" solution would have been to teach the vectorizer cost model that v2f32 isn't legal (because really, it isn't), but if we can vectorize we might as well do so. We legalize these v2f32 FMIN/FMAX nodes by widening to v4f32 later on. v3f32 were already widened to v4f32 by the generic unroll-and-build-vector legalization. rdar://15763436 Differential Revision: http://reviews.llvm.org/D6557 llvm-svn: 225691	2015-01-12 20:31:30 +00:00
David Majnemer	14141f941a	Revert most of r225597 We can't rely on a DataLayout enlightened constant folder. llvm-svn: 225599	2015-01-11 07:29:51 +00:00
David Majnemer	292d0c796b	X86: Properly decode shuffle masks when the constant pool type is weird It's possible for the constant pool entry for the shuffle mask to come from a completely different operation. This occurs when Constants have the same bit pattern but have different types. Make DecodePSHUFBMask tolerant of types which, after a bitcast, are appropriately sized vector types. This fixes PR22188. llvm-svn: 225597	2015-01-11 05:08:57 +00:00
Saleem Abdulrasool	9cf2679d3b	X86: teach X86TargetLowering about L,M,O constraints Teach the ISelLowering for X86 about the L,M,O target specific constraints. Although, for the moment, clang performs constraint validation and prevents passing along inline asm which may have immediate constant constraints violated, the backend should be able to cope with the invalid inline asm a bit better. llvm-svn: 225596	2015-01-11 04:39:24 +00:00
Simon Pilgrim	94a4cc027a	[X86][SSE] Improved (v)insertps shuffle matching In the current code we only attempt to match against insertps if we have exactly one element from the second input vector, irrespective of how much of the shuffle result is zeroable. This patch checks to see if there is a single non-zeroable element from either input that requires insertion. It also supports matching of cases where only one of the inputs need to be referenced. We also split insertps shuffle matching off into a new lowerVectorShuffleAsInsertPS function. Differential Revision: http://reviews.llvm.org/D6879 llvm-svn: 225589	2015-01-10 19:45:33 +00:00
Simon Pilgrim	ec1f2c2cab	[X86][SSE] Avoid vector byte shuffles with zero by using pshufb to create zeros pshufb can shuffle in zero bytes as well as bytes from a source vector - we can use this to avoid having to shuffle 2 vectors and ORing the result when the used inputs from a vector are all zeroable. Differential Revision: http://reviews.llvm.org/D6878 llvm-svn: 225551	2015-01-09 22:03:19 +00:00
Chandler Carruth	685b1803ab	[x86] Add a flag to control the vector shuffle legality predicates that complements the new vector shuffle lowering code path. This flag, naturally, is off because we've not tested or evaluated the results of this at all. However, the flag will make it much easier to evaluate whether we can be this aggressive and whether there are missing vector shuffle lowering optimizations. llvm-svn: 225491	2015-01-09 01:24:36 +00:00
Ahmed Bougacha	d716121888	[X86] Reflow comment. NFC. llvm-svn: 225455	2015-01-08 17:49:48 +00:00
Michael Kuperstein	46f7d525c3	[X86] Don't try to generate direct calls to TLS globals The call lowering assumes that if the callee is a global, we want to emit a direct call. This is correct for regular globals, but not for TLS ones. Differential Revision: http://reviews.llvm.org/D6862 llvm-svn: 225438	2015-01-08 11:50:58 +00:00
Ahmed Bougacha	2b6917b020	[SelectionDAG] Allow targets to specify legality of extloads' result type (in addition to the memory type). The LoadExt legalization handling used to only have one type, the memory type. This forced users to assume that as long as the extload for the memory type was declared legal, and the result type was legal, the whole extload was legal. However, this isn't always the case. For instance, on X86, with AVX, this is legal: v4i32 load, zext from v4i8 but this isn't: v4i64 load, zext from v4i8 Whereas v4i64 is (arguably) legal, even without AVX2. Note that the same thing was done a while ago for truncstores (r46140), but I assume no one needed it yet for extloads, so here we go. Calls to getLoadExtAction were changed to add the value type, found manually in the surrounding code. Calls to setLoadExtAction were mechanically changed, by wrapping the call in a loop, to match previous behavior. The loop iterates over the MVT subrange corresponding to the memory type (FP vectors, etc...). I also pulled neighboring setTruncStoreActions into some of the loops; those shouldn't make a difference, as the additional types are illegal. (e.g., i128->i1 truncstores on PPC.) No functional change intended. Differential Revision: http://reviews.llvm.org/D6532 llvm-svn: 225421	2015-01-08 00:51:32 +00:00
Ahmed Bougacha	67dd2d25a3	[CodeGen] Use MVT iterator_ranges in legality loops. NFC intended. A few loops do trickier things than just iterating on an MVT subset, so I'll leave them be for now. Follow-up of r225387. llvm-svn: 225392	2015-01-07 21:27:10 +00:00
Ahmed Bougacha	b994d0c0c5	[X86] Fix 512->256 typo in comments. NFC. llvm-svn: 225367	2015-01-07 19:38:50 +00:00
Ahmed Bougacha	aa2d290997	[X86] Teach FCOPYSIGN lowering to recognize constant magnitudes. For code like: float foo(float x) { return copysign(1.0, x); } We used to generate: andps <-0.000000e+00,0,0,0>, %xmm0 movss <1.000000e+00>, %xmm1 andps <nan>, %xmm1 orps %xmm0, %xmm1 Basically doing an abs(1.0f) in the two middle instructions. We now generate: andps <-0.000000e+00,0,0,0>, %xmm0 orps <1.000000e+00,0,0,0>, %xmm0 Builds on cleanups r223415, r223542. rdar://19049548 Differential Revision: http://reviews.llvm.org/D6555 llvm-svn: 225357	2015-01-07 17:33:03 +00:00
David Majnemer	29c52f7449	X86: Don't make illegal GOTTPOFF relocations "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF relocation target a movq or addq instruction. Prohibit the truncation of such loads to movl or addl. This fixes PR22083. Differential Revision: http://reviews.llvm.org/D6839 llvm-svn: 225250	2015-01-06 07:12:52 +00:00
Craig Topper	49758aab94	[X86] Make isel select the shorter form of jump instructions instead of the long form. The assembler backend will relax to the long form if necessary. This removes a swap from long form to short form in the MCInstLowering code. Selecting the long form used to be required by the old JIT. llvm-svn: 225242	2015-01-06 04:23:53 +00:00
Simon Pilgrim	4c55af6850	[X86][SSE] lowerVectorShuffleAsByteShift tidyup Removed local isSequential predicate and use standard helper isSequentialOrUndefInRange instead. llvm-svn: 225216	2015-01-05 22:08:48 +00:00
Simon Pilgrim	71b96b35e1	[X86][SSE] Fixed description for isSequentialOrUndefInRange. NFC. llvm-svn: 225202	2015-01-05 21:09:48 +00:00
Andrea Di Biagio	6477847ef4	Improved comments. No functional change intended. llvm-svn: 225080	2015-01-02 10:47:46 +00:00
Andrea Di Biagio	22ee3f63b9	[CodeGenPrepare] Teach when it is profitable to speculate calls to @llvm.cttz/ctlz. If the control flow is modelling an if-statement where the only instruction in the 'then' basic block (excluding the terminator) is a call to cttz/ctlz, CodeGenPrepare can try to speculate the cttz/ctlz call and simplify the control flow graph. Example: \code entry: %cmp = icmp eq i64 %val, 0 br i1 %cmp, label %end.bb, label %then.bb then.bb: %c = tail call i64 @llvm.cttz.i64(i64 %val, i1 true) br label %end.bb end.bb: %cond = phi i64 [ %c, %then.bb ], [ 64, %entry] \code In this example, basic block %then.bb is taken if value %val is not zero. Also, the phi node in %end.bb would propagate the size-of in bits of %val only if %val is equal to zero. With this patch, CodeGenPrepare will try to hoist the call to cttz from %then.bb into basic block %entry only if cttz is cheap to speculate for the target. Added two new hooks in TargetLowering.h to let targets customize the behavior (i.e. decide whether it is cheap or not to speculate calls to cttz/ctlz). The two new methods are 'isCheapToSpeculateCtlz' and 'isCheapToSpeculateCttz'. By default, both methods return 'false'. On X86, method 'isCheapToSpeculateCtlz' returns true only if the target has LZCNT. Method 'isCheapToSpeculateCttz' only returns true if the target has BMI. Differential Revision: http://reviews.llvm.org/D6728 llvm-svn: 224899	2014-12-28 11:07:35 +00:00
Aaron Ballman	4eb5c2e089	Fixing another -Wunused-variable warning, this time in release builds without asserts. NFC. llvm-svn: 224889	2014-12-27 19:17:53 +00:00
Aaron Ballman	b66d54c549	Removing a variable that is set but never used, to silence a -Wunused-but-set-variable warning; NFC. llvm-svn: 224888	2014-12-27 19:01:19 +00:00
Elena Demikhovsky	fcea06acb5	AVX-512: Added FMA instructions, intrinsics an tests for KNL and SKX targets by Asaf Badouh http://reviews.llvm.org/D6456 llvm-svn: 224764	2014-12-23 10:30:39 +00:00
Elena Demikhovsky	3121449f0b	AVX-512: BLENDM - fixed encoding of the broadcast version Added more intrinsics and encoding tests. llvm-svn: 224760	2014-12-23 09:36:28 +00:00
Jim Grosbach	1bd0f3530e	X86: Don't over-align combined loads. When combining consecutive loads+inserts into a single vector load, we should keep the alignment of the base load. Doing otherwise can, and does, lead to using overly aligned instructions. In the included test case, for example, using a 32-byte vmovaps on a 16-byte aligned value. Oops. rdar://19190968 llvm-svn: 224746	2014-12-23 00:35:23 +00:00
Reid Kleckner	ce0093344f	Make musttail more robust for vector types on x86 Previously I tried to plug musttail into the existing vararg lowering code. That turned out to be a mistake, because non-vararg calls use significantly different register lowering, even on x86. For example, AVX vectors are usually passed in registers to normal functions and memory to vararg functions. Now musttail uses a completely separate lowering. Hopefully this can be used as the basis for non-x86 perfect forwarding. Reviewers: majnemer Differential Revision: http://reviews.llvm.org/D6156 llvm-svn: 224745	2014-12-22 23:58:37 +00:00
Bruno Cardoso Lopes	811c173523	[x86] Add vector @llvm.ctpop intrinsic custom lowering Currently, when ctpop is supported for scalar types, the expansion of @llvm.ctpop.vXiY uses vector element extractions, insertions and individual calls to @llvm.ctpop.iY. When not, expansion with bit-math operations is used for the scalar calls. Local haswell measurements show that we can improve vector @llvm.ctpop.vXiY expansion in some cases by using a using a vector parallel bit twiddling approach, based on: v = v - ((v >> 1) & 0x55555555); v = (v & 0x33333333) + ((v >> 2) & 0x33333333); v = ((v + (v >> 4) & 0xF0F0F0F) v = v + (v >> 8) v = v + (v >> 16) v = v & 0x0000003F (from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) When scalar ctpop isn't supported, the approach above performs better for v2i64, v4i32, v4i64 and v8i32 (see numbers below). And even when scalar ctpop is supported, this approach performs ~2x better for v8i32. Here, x86_64 implies -march=corei7-avx without ctpop and x86_64h includes ctpop support with -march=core-avx2. == [x86_64h - new] v8i32: 0.661685 v4i32: 0.514678 v4i64: 0.652009 v2i64: 0.324289 == [x86_64h - old] v8i32: 1.29578 v4i32: 0.528807 v4i64: 0.65981 v2i64: 0.330707 == [x86_64 - new] v8i32: 1.003 v4i32: 0.656273 v4i64: 1.11711 v2i64: 0.754064 == [x86_64 - old] v8i32: 2.34886 v4i32: 1.72053 v4i64: 1.41086 v2i64: 1.0244 More work for other vector types will come next. llvm-svn: 224725	2014-12-22 19:45:43 +00:00
Elena Demikhovsky	949b0d46bf	AVX-512: Added all forms of BLENDM instructions, intrinsics, encoding tests for AVX-512F and skx instructions. llvm-svn: 224707	2014-12-22 13:52:48 +00:00

... 2 3 4 5 6 ...

3324 Commits