llvm-project

Commit Graph

Author	SHA1	Message	Date
Gabor Buella	31fa8025ba	[X86] WaitPKG instructions Three new instructions: umonitor - Sets up a linear address range to be monitored by hardware and activates the monitor. The address range should be a writeback memory caching type. umwait - A hint that allows the processor to stop instruction execution and enter an implementation-dependent optimized state until occurrence of a class of events. tpause - Directs the processor to enter an implementation-dependent optimized state until the TSC reaches the value in EDX:EAX. Also modifying the description of the mfence instruction, as the rep prefix (0xF3) was allowed before, which would conflict with umonitor during disassembly. Before: $ echo 0xf3,0x0f,0xae,0xf0 \| llvm-mc -disassemble .text mfence After: $ echo 0xf3,0x0f,0xae,0xf0 \| llvm-mc -disassemble .text umonitor %rax Reviewers: craig.topper, zvi Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D45253 llvm-svn: 330462	2018-04-20 18:42:47 +00:00
Alexander Ivchenko	e8fed1546e	Lowering x86 adds/addus/subs/subus intrinsics (llvm part) This is the patch that lowers x86 intrinsics to native IR in order to enable optimizations. The patch also includes folding of previously missing saturation patterns so that IR emits the same machine instructions as the intrinsics. Patch by tkrupa Differential Revision: https://reviews.llvm.org/D44785 llvm-svn: 330322	2018-04-19 12:13:30 +00:00
Keith Wyss	3d86823f3d	[XRay] Typed event logging intrinsic Summary: Add an LLVM intrinsic for type discriminated event logging with XRay. Similar to the existing intrinsic for custom events, but also accepts a type tag argument to allow plugins to be aware of different types and semantically interpret logged events they know about without choking on those they don't. Relies on a symbol defined in compiler-rt patch D43668. I may wait to submit before I can see demo everything working together including a still to come clang patch. Reviewers: dberris, pelikan, eizan, rSerge, timshen Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D45633 llvm-svn: 330219	2018-04-17 21:30:29 +00:00
Hiroshi Inoue	ae17900997	[NFC] fix trivial typos in document and comments "not not" -> "not" etc llvm-svn: 330083	2018-04-14 08:59:00 +00:00
Craig Topper	254ed028a4	[X86] Remove the pmuldq/pmuldq intrinsics and replace with native IR. This completes the work started in r329604 and r329605 when we changed clang to no longer use the intrinsics. We lost some InstCombine SimplifyDemandedBit optimizations through this change as we aren't able to fold 'and', bitcast, shuffle very well. llvm-svn: 329990	2018-04-13 06:07:18 +00:00
Sriraman Tallam	d693093a65	GOTPCREL references must always use RIP. With -fno-plt, global value references can use GOTPCREL and RIP must be used. Differential Revision: https://reviews.llvm.org/D45460 llvm-svn: 329765	2018-04-10 22:50:05 +00:00
Chandler Carruth	0ca3bd0729	[x86] Model the direction flag (DF) separately from the rest of EFLAGS. This cleans up a number of operations that only claimed te use EFLAGS due to using DF. But no instructions which we think of us setting EFLAGS actually modify DF (other than things like popf) and so this needlessly creates uses of EFLAGS that aren't really there. In fact, DF is so restrictive it is pretty easy to model. Only STD, CLD, and the whole-flags writes (WRFLAGS and POPF) need to model this. I've also somewhat cleaned up some of the flag management instruction definitions to be in the correct .td file. Adding this extra register also uncovered a failure to use the correct datatype to hold X86 registers, and I've corrected that as necessary here. Differential Revision: https://reviews.llvm.org/D45154 llvm-svn: 329673	2018-04-10 06:40:51 +00:00
Chandler Carruth	19618fc639	[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues. The key idea is to lower COPY nodes populating EFLAGS by scanning the uses of EFLAGS and introducing dedicated code to preserve the necessary state in a GPR. In the vast majority of cases, these uses are cmovCC and jCC instructions. For such cases, we can very easily save and restore the necessary information by simply inserting a setCC into a GPR where the original flags are live, and then testing that GPR directly to feed the cmov or conditional branch. However, things are a bit more tricky if arithmetic is using the flags. This patch handles the vast majority of cases that seem to come up in practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of partially preserved EFLAGS as LLVM doesn't currently model that at all. There are a large number of operations that techinaclly observe EFLAGS currently but shouldn't in this case -- they typically are using DF. Currently, they will not be handled by this approach. However, I have never seen this issue come up in practice. It is already pretty rare to have these patterns come up in practical code with LLVM. I had to resort to writing MIR tests to cover most of the logic in this pass already. I suspect even with its current amount of coverage of arithmetic users of EFLAGS it will be a significant improvement over the current use of pushf/popf. It will also produce substantially faster code in most of the common patterns. This patch also removes all of the old lowering for EFLAGS copies, and the hack that forced us to use a frame pointer when EFLAGS copies were found anywhere in a function so that the dynamic stack adjustment wasn't a problem. None of this is needed as we now lower all of these copies directly in MI and without require stack adjustments. Lots of thanks to Reid who came up with several aspects of this approach, and Craig who helped me work out a couple of things tripping me up while working on this. Differential Revision: https://reviews.llvm.org/D45146 llvm-svn: 329657	2018-04-10 01:41:17 +00:00
Craig Topper	47b2f9d836	[X86] Don't use Lower512IntUnary to split bitcasts with v32i16/v64i8 types on targets without AVX512BW. LowerIntUnary as its name says has an assert for integer types. But for the bitcast case one side might be an FP type. Rather than making sure the function really works for fp types and renaming it. Just do really basic splitting directly. The LowerIntUnary has the advantage that it can peek through BUILD_VECTOR because every other call is during Lowering. But these calls are during legalization and will be followed by a DAG combine round. Revert some change to LowerVectorIntUnary that were originally made just to make these two calls work even in pure integer cases. This was found purely by compiling the avx512f-builtins.c test from clang so I've copied over the offending function from that. llvm-svn: 329616	2018-04-09 20:37:14 +00:00
Mandeep Singh Grang	68a151a13c	[X86] Change std::sort to llvm::sort in response to r327219 Summary: r327219 added wrappers to std::sort which randomly shuffle the container before sorting. This will help in uncovering non-determinism caused due to undefined sorting order of objects having the same key. To make use of that infrastructure we need to invoke llvm::sort instead of std::sort. Note: This patch is one of a series of patches to replace all std::sort to llvm::sort. Refer the comments section in D44363 for a list of all the required patches. Reviewers: chandlerc, craig.topper, RKSimon Reviewed By: chandlerc, craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44874 llvm-svn: 329534	2018-04-08 16:42:52 +00:00
Craig Topper	ef37aebc96	[X86] Combine vXi64 multiplies to MULDQ/MULUDQ during DAG combine instead of lowering. Previously we used a custom lowering for this because of the AVX1 splitting requirement. But we can do the split during DAG combine if we check the types and subtarget llvm-svn: 329510	2018-04-07 19:09:52 +00:00
Benjamin Kramer	1fc0da4849	Make helpers static. NFC. llvm-svn: 329170	2018-04-04 11:45:11 +00:00
Craig Topper	3064c15dc3	[X86] Remove some code that was only needed when i1 was a legal type. NFC llvm-svn: 329146	2018-04-04 04:38:54 +00:00
Craig Topper	9b8cd5fe55	[X86] Don't check for folding into a store when deciding if we can promote an i16 mul. There's no RMW mul operation. llvm-svn: 328931	2018-04-01 06:29:32 +00:00
Craig Topper	db6caabccc	[X86] Check if the load and store are to the same pointer before preventing i16 RMW shifts and subtracts from being promoted. llvm-svn: 328930	2018-04-01 06:29:28 +00:00
Craig Topper	ae2de57db0	[X86] Allow i16 subtracts to be promoted if the load is on the LHS and its not being stored. llvm-svn: 328928	2018-04-01 06:29:25 +00:00
Craig Topper	9bc0d881a3	[X86] Remove unneeded temporary variable. NFC This Promote flag was alwasys set to true except in the default case. But in the default case we don't need to set PVT and can just return false. llvm-svn: 328926	2018-04-01 06:29:21 +00:00
Simon Pilgrim	8c8ebd7945	Fix trailing whitespace. NFCI. llvm-svn: 328917	2018-03-31 09:14:14 +00:00
Simon Pilgrim	71c5f3fffd	[X86][SSE] Don't bother re-adding combined target shuffles to the work list We are re-adding all the bitcasts, constant masks and target shuffles to the work list for no apparent gain. Found while investigating adding SimplifyDemandedVectorElts to target shuffles. Differential Revision: https://reviews.llvm.org/D44942 llvm-svn: 328771	2018-03-29 11:18:41 +00:00
Reid Kleckner	41fb2dba9c	[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32 Summary: Re-lands r328386 and r328443, reverting r328482. Incorporates fixes from @mstorsjo in D44876 (thanks!) so that small parameters in i8 and i16 do not end up in the SysV register parameters (EDI, ESI, etc). I added tests for how we receive small parameters, since that is the important part. It's always safe to store more bytes than will be read, but the assumptions you make when loading them are what really matter. I also tested this by self-hosting clang and it passed tests on win64. Reviewers: mstorsjo, hans Subscribers: hiraditya, mstorsjo, llvm-commits Differential Revision: https://reviews.llvm.org/D44900 llvm-svn: 328570	2018-03-26 18:49:48 +00:00
Hans Wennborg	311b63f13b	Revert r328386 "[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32" This broke Chromium (see crbug.com/825748). It looks like mstorsjo's follow-up patch at D44876 fixes this, but let's revert back to green for now until that's ready to land. (Also reverts r328443.) > Both GCC and MSVC only look at the low byte of a boolean when it is > passed. llvm-svn: 328482	2018-03-26 10:07:51 +00:00
Simon Pilgrim	854ac7490d	[X86] Add missing full stop to comment. NFCI. llvm-svn: 328456	2018-03-25 18:49:48 +00:00
Craig Topper	2c0a62ab9a	[X86] Add a DAG combine to simplify PMULDQ/PMULUDQ nodes These nodes only use the lower 32 bits of their inputs so we can use SimplifyDemandedBits to simplify them. Differential Revision: https://reviews.llvm.org/D44375 llvm-svn: 328405	2018-03-24 01:52:01 +00:00
Reid Kleckner	e27b410661	[X86] Fix Windows `i1 zeroext` conventions to use i8 instead of i32 Both GCC and MSVC only look at the low byte of a boolean when it is passed. llvm-svn: 328386	2018-03-23 23:38:53 +00:00
Martin Storsjo	07589fc496	[X86] Don't use the MSVC stack protector names on mingw Mingw uses the same stack protector functions as GCC provides on other platforms as well. Patch by Valentin Churavy! Differential Revision: https://reviews.llvm.org/D27296 llvm-svn: 328039	2018-03-20 20:37:51 +00:00
Craig Topper	ab6076514d	[X86] Simplify the AVX512 code in LowerTruncate a little. We don't need to create an ISD::TRUNCATE node to return, we started with one and can return it. Also remove the call to getExtendInVec, the result is just going to be a getNode of that value passed in. llvm-svn: 327914	2018-03-19 21:58:02 +00:00
Craig Topper	3b967466d5	[X86] Replace a couple calls to getExtendInVec with getNode and the appropriate target independent EXTEND_VECTOR_INREG opcode. llvm-svn: 327899	2018-03-19 20:20:22 +00:00
Craig Topper	259eaa6e7c	[X86] Remove sse41 specific code from lowering v16i8 multiply With the SRAs removed from the SSE2 code in D44267, then there doesn't appear to be any advantage to the sse41 code. The punpcklbw instruction and pmovsx seem to have the same latency and throughput on most CPUs. And the SSE41 code requires moving the upper 64-bits into the lower 64-bit before the sign extend can be done. The unpckhbw in sse2 code can do better than that. llvm-svn: 327869	2018-03-19 17:31:41 +00:00
Oren Ben Simhon	fdd72fd522	[X86] Added support for nocf_check attribute for indirect Branch Tracking X86 Supports Indirect Branch Tracking (IBT) as part of Control-Flow Enforcement Technology (CET). IBT instruments ENDBR instructions used to specify valid targets of indirect call / jmp. The `nocf_check` attribute has two roles in the context of X86 IBT technology: 1. Appertains to a function - do not add ENDBR instruction at the beginning of the function. 2. Appertains to a function pointer - do not track the target function of this pointer by adding nocf_check prefix to the indirect-call instruction. This patch implements `nocf_check` context for Indirect Branch Tracking. It also auto generates `nocf_check` prefixes before indirect branchs to jump tables that are guarded by range checks. Differential Revision: https://reviews.llvm.org/D41879 llvm-svn: 327767	2018-03-17 13:29:46 +00:00
Craig Topper	f0815e01d8	[X86] Merge ADDSUB/SUBADD detection into single methods that can detect either and indicate what they found. Previously, we called the same functions twice with a bool flag determining whether we should look for ADDSUB or SUBADD. It would be more efficient to run the code once and detect either pattern with a flag to tell which type it found. Differential Revision: https://reviews.llvm.org/D44540 llvm-svn: 327730	2018-03-16 18:25:59 +00:00
Craig Topper	1b8cf49704	[SelectionDAG][ARM][X86] Teach PromoteIntRes_SETCC to do a better job picking the result type for the setcc. Previously if getSetccResultType returned an illegal type we just fell back to using the default promoted type. This appears to have been to handle the case where for vectors getSetccResultType returns the input type, but the input type itself isn't legal and will need to be promoted. Without the legality check we would never reach a legal type. But just picking the promoted type to be the setcc type can create strange setccs where the result type is 128 bits and the operand type is 256 bits. If for example the result type was promoted to v8i16 from v8i1, but the input type was promoted from v8i23 to v8i32. We currently handle this with custom lowering code in X86. This legality check also caused us reject the getSetccResultType when the input type needed to be widened or split. Even though that result wouldn't have caused legalization to get stuck. This patch tries to fix this by detecting the getSetccResultType needs to be promoted. If its input type also needs to be promoted we'll try a ask for a new setcc result type based on its eventual promoted value. Otherwise we fall back to default type to promote to. For any other illegal values we might get back from the initial call to getSetccResultType we just keep and allow it to be re-legalized later via splitting or widening or scalarizing. llvm-svn: 327683	2018-03-15 23:04:11 +00:00
Craig Topper	c3983c34cd	[X86] Make sure we use FSUB instruction as the reference for operand order in isAddSubOrSubAdd when recognizing subadd The FADD part of the addsub/subadd pattern can have its operands commuted, but when checking for fsubadd we were using the fadd as reference and commuting the fsub node. llvm-svn: 327660	2018-03-15 20:30:54 +00:00
Craig Topper	5a0251fe67	[X86] Simplify the type legality checking for (FM)ADDSUB/SUBADD matching. NFCI Rather than enumerating all specific types, for the DAG combine we can just use TLI::isTypeLegal and an SSE3 check. For the BUILD_VECTOR version we already know the type is legal so we just need to check SSE3. llvm-svn: 327649	2018-03-15 17:38:59 +00:00
Craig Topper	627e001fad	[X86] Fix 80 column violations. llvm-svn: 327648	2018-03-15 17:38:55 +00:00
Craig Topper	26a3a80c87	[X86] Add support for matching FMSUBADD from build_vector. llvm-svn: 327604	2018-03-15 06:14:55 +00:00
Craig Topper	a5e712f402	[X86] Remove old TODO. We have coverage for this now. Coverage was added in r320950. llvm-svn: 327603	2018-03-15 06:14:53 +00:00
Craig Topper	b9526e9fdb	[X86] Use MVT in a couple places where we know the type is legal. llvm-svn: 327602	2018-03-15 06:14:51 +00:00
Craig Topper	b36cb20ef9	[X86] Teach X86TargetLowering::targetShrinkDemandedConstant to set non-demanded bits if it helps created an and mask that can be matched as a zero extend. I had to modify the bswap recognition to allow unshrunk masks to make this work. Fixes PR36689. Differential Revision: https://reviews.llvm.org/D44442 llvm-svn: 327530	2018-03-14 16:55:15 +00:00
Matt Arsenault	41e5ac4fa4	TargetMachine: Add address space to getPointerSize llvm-svn: 327467	2018-03-14 00:36:23 +00:00
Craig Topper	ec4881ad53	[X86] Simplify the LowerAVXCONCAT_VECTORS code a little by creating a single path for insert_subvector handling. We now only create recursive concats if we have more than two non-zero values. This keeps our subvector broadcast DAG combine functioning. llvm-svn: 327457	2018-03-13 22:36:07 +00:00
Craig Topper	cc060e921b	[X86] Rewrite LowerAVXCONCAT_VECTORS similar to how we handle vXi1 concats. This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs. This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best. We probably want to merge this with the vXi1 concat code since they are very similar. llvm-svn: 327454	2018-03-13 22:05:25 +00:00
Craig Topper	7e711a6822	[X86] Remove SplitBinaryOpsAndApply and use SplitOpsAndApply by adding curly braces around the ops. Summary: Unless you were intentionally avoiding this syntax? I saw you mentioned makeArrayRef in your commit that added SplitOpsAndApply. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44403 llvm-svn: 327418	2018-03-13 16:23:27 +00:00
Simon Pilgrim	93bd7187f4	[X86][SSE41] createVariablePermute v2X64 - PCMPEQQ can test for index 0/1 and select between them. llvm-svn: 327385	2018-03-13 12:22:58 +00:00
Craig Topper	acaba3b402	[X86] Remove use of MVT class from the ShuffleDecode library. MVT belongs to the CodeGen layer, but ShuffleDecode is used by the X86 InstPrinter which is part of the MC layer. This only worked because MVT is completely implemented in a header file with no other library dependencies. Differential Revision: https://reviews.llvm.org/D44353 llvm-svn: 327292	2018-03-12 16:43:11 +00:00
Simon Pilgrim	6618e2a09c	[X86][SSE] createVariablePermute - PSHUFB requires SSSE3 not just SSE3 llvm-svn: 327259	2018-03-12 12:30:04 +00:00
Craig Topper	7cc1b1fc84	[X86] Don't compute known bits twice for the same SDValue in LowerMUL. We called MaskedValueIsZero with two different masks, but underneath that calls computeKnownBits before applying the mask. This means we compute the same known bits twice due to the two calls. Instead just call computeKnownBits directly and apply the two masks ourselves. llvm-svn: 327251	2018-03-12 05:35:02 +00:00
Simon Pilgrim	d09cc9c62c	[X86][MMX] Support MMX build vectors to avoid SSE usage (PR29222) 64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type. This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics. We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover. Differential Revision: https://reviews.llvm.org/D43618 llvm-svn: 327247	2018-03-11 19:22:13 +00:00
Simon Pilgrim	30f74c14ff	[X86][AVX] createVariablePermute - scale v16i16 variable permutes to use v32i8 codegen XOP was already doing this, and now AVX performs v32i8 variable permutes as well. llvm-svn: 327245	2018-03-11 17:23:54 +00:00
Simon Pilgrim	b306501796	[X86][AVX] createVariablePermute - widen permutes for cases where the source vector is wider than the destination type llvm-svn: 327244	2018-03-11 17:00:46 +00:00
Simon Pilgrim	9a5d0c7540	[X86][AVX] createVariablePermute - use PSHUFB+PCMPGT+SELECT for v32i8 variable permutes Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size. llvm-svn: 327242	2018-03-11 16:28:11 +00:00
Simon Pilgrim	d2fbd87ce8	Fix for buildbots which didn't like makeArrayRef with initializer lists. llvm-svn: 327241	2018-03-11 14:31:55 +00:00
Simon Pilgrim	e60afdf9eb	[X86][SSE] Generalized SplitBinaryOpsAndApply to SplitOpsAndApply to support any number of ops. I've kept SplitBinaryOpsAndApply as a wrapper to avoid a lot of makeArrayRef code. llvm-svn: 327240	2018-03-11 14:04:53 +00:00
Simon Pilgrim	f9cc80d218	[X86][AVX] createVariablePermute - use 2xVPERMIL+PCMPGT+SELECT for v8i32/v8f32 and v4i64/v4f64 variable permutes As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi. For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends. llvm-svn: 327239	2018-03-11 11:52:26 +00:00
Simon Pilgrim	2565bd421e	[X86][AVX512] createVariablePermute - Non-VLX targets can widen v4i64/v8f64 variable permutes to v8i64/v8f64 Permutes in the upper elements will be undefined, but they will be discarded anyway. llvm-svn: 327238	2018-03-11 11:19:19 +00:00
Simon Pilgrim	64b899f0f3	[x86][SSE] Add widenSubVector helper. NFCI. Helper function to insert a subvector into the bottom elements of a larger zero/undef vector with the same scalar type. I've converted a couple of INSERT_SUBVECTOR calls to use it, there are plenty more although in some cases I was worried it might make the code more ambiguous. llvm-svn: 327236	2018-03-11 10:50:48 +00:00
Simon Pilgrim	de7f3f0f91	[X86][XOP] createVariablePermute - use VPERMIL2 for v8i32/v4i64 variable permutes llvm-svn: 327222	2018-03-10 19:49:59 +00:00
Simon Pilgrim	ff1248f82f	[X86][XOP] createVariablePermute - use VPPERM for v16i16 variable permutes llvm-svn: 327218	2018-03-10 18:33:29 +00:00
Simon Pilgrim	d9dc114e2f	[X86][SSE] createVariablePermute - create index scaling helper. NFCI. This will help in some future changes for custom lowering. llvm-svn: 327217	2018-03-10 18:12:35 +00:00
Simon Pilgrim	8224241f75	[X86][XOP] createVariablePermute - use VPPERM for v32i8 variable permutes llvm-svn: 327213	2018-03-10 16:51:45 +00:00
Simon Pilgrim	2cd489feb2	[X86][AVX] createVariablePermute - fix v2i64/v2f64 VPERMILPD index creation. The input indices vector will put the index in bit0, but VPERMILPD actually selects off bit1 - so we need to scale accordingly. llvm-svn: 327159	2018-03-09 18:37:56 +00:00
Simon Pilgrim	230d38b559	[X86][SSE] createVariablePermute - move source vector canonicalization to top of function. NFCI. This is to make it easier to return early from the switch statement with custom lowering. llvm-svn: 327157	2018-03-09 18:08:08 +00:00
Simon Pilgrim	033a4167d2	Tidyup comment that was destroyed by clang-format. NFCI. llvm-svn: 327141	2018-03-09 15:50:09 +00:00
Simon Pilgrim	322c521ed7	[X86][SSE] createVariablePermute - move index vector canonicalization to top of function. NFCI. This is to make it easier to return early from the switch statement with custom lowering. llvm-svn: 327140	2018-03-09 15:48:56 +00:00
Craig Topper	784f1bbf5e	[X86] Remove SRAs from v16i8 multiply lowering on sse2 targets Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does. Differential Revision: https://reviews.llvm.org/D44267 llvm-svn: 327093	2018-03-09 01:22:31 +00:00
Simon Pilgrim	c286680032	[X86][AVX] Pull out variable permute creation from LowerBUILD_VECTORAsVariablePermute. NFCI. This will make it easier to handle more complex cases than basic scaling or index masks. llvm-svn: 327054	2018-03-08 20:07:06 +00:00
Craig Topper	a406796f5f	[X86] Change X86::PMULDQ/PMULUDQ opcodes to take vXi64 type as input instead of vXi32. This instruction can be thought of as reading either the even elements of a vXi32 input or the lower half of each element of a vXi64 input. We currently use the vXi32 interpretation, but vXi64 matches better with its broadcast behavior in EVEX. I'm looking at moving MULDQ/MULUDQ creation to a DAG combine so we can do it when AVX512DQ is enabled without having to go through Custom lowering. But in some of the test cases we failed to use a broadcast load due to the size difference. This should help with that. I'm also wondering if we can model these instructions in native IR and remove the intrinsics and I think using a vXi64 type will work better with that. llvm-svn: 326991	2018-03-08 08:02:52 +00:00
Simon Pilgrim	68594ee24a	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - reorder permute types. NFCI. Reorder into 128/256/512 bit vector size groupings. NFCI commit before some new features. llvm-svn: 326963	2018-03-07 23:56:42 +00:00
Craig Topper	80ec0c3106	[X86] Remove unused function argument. NFC llvm-svn: 326939	2018-03-07 19:45:45 +00:00
Craig Topper	c3c15dd640	[X86] Make the MUL->VPMADDWD work before op legalization on AVX1 targets. Simplify feature checks by using isTypeLegal. The v8i32 conversion on AVX1 targets was only working after LowerMUL splits 256-bit vectors. While I was there I've also made it so we don't have to check for AVX2 and BWI directly and instead just ask if the type is legal. Differential Revision: https://reviews.llvm.org/D44190 llvm-svn: 326917	2018-03-07 17:53:18 +00:00
Craig Topper	80d3bb3b4b	[TargetLowering] Rename DAGCombinerInfo::isAfterLegalizeVectorOps to DAGCombiner::isAfterLegalizeDAG since that's what it checks. NFC The code checks Level == AfterLegalizeDAG which is the fourth and last of the possible DAG combine stages that we have. There is a Level called AfterLegalVectorOps, but that's the third DAG combine and it doesn't always run. A function called isAfterLegalVectorOps should imply it returns true in either of the DAG combines that runs after the legalize vector ops stage, but that's not what this function does. llvm-svn: 326832	2018-03-06 19:44:52 +00:00
Craig Topper	274e08dd81	[X86] Reject registers that require a REX prefix in inline asm constraints in 32-bit mode We don't currently reject r8-r15 or xmm8-32 or bpl/spl/sil/dil in 32-bit mode. Differential Revision: https://reviews.llvm.org/D44031 llvm-svn: 326826	2018-03-06 18:56:33 +00:00
Craig Topper	f546b2c06f	[X86] Replace usages of X86Subtarget::hasFp256 with hasAVX. Remove hasFP256. Almost none of these usages were FP specific. And we had no clear guideliness on when to use hasAVX vs hasFP256. I might also remove hasInt256 too since its an alias for hasAVX2. llvm-svn: 326682	2018-03-05 00:13:35 +00:00
Craig Topper	f2aae62228	[X86] Add a DAG combine to turn stores of vXi1 constants into scalar stores. llvm-svn: 326679	2018-03-04 19:33:15 +00:00
Craig Topper	12c35e1940	[X86] Fix unused variable in release builds. llvm-svn: 326672	2018-03-04 02:14:16 +00:00
Craig Topper	a476026f70	[X86] Combine (store (v1i1 (scalar_to_vector (i8 X)))) -> (store (i8 X)). llvm-svn: 326670	2018-03-04 01:48:02 +00:00
Craig Topper	be31585be8	[X86] Lower v1i1/v2i1/v4i1/v8i1 load/stores to i8 load/store during op legalization if AVX512DQ is not supported. We were previously doing this with isel patterns. Moving it to op legalization gives us chance to see the required bitcast earlier. And it lets us remove some isel patterns. llvm-svn: 326669	2018-03-04 01:48:00 +00:00
Craig Topper	d4b6601662	[X86] Remove 'else' after return. NFC llvm-svn: 326642	2018-03-03 05:18:21 +00:00
Craig Topper	6b1419b547	[X86] Reject xmm16-31 in inline asm constraints when AVX512 is disabled Fixes PR36532 Differential Revision: https://reviews.llvm.org/D43960 llvm-svn: 326596	2018-03-02 18:19:40 +00:00
Simon Pilgrim	90fd0622b6	[X86][MMX] Improve handling of 64-bit MMX constants 64-bit MMX constant generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type. This patch bitcasts the constant to a double value to allow correct loading directly to the MMX register. I've added MMX constant asm comment support to improve testing, it's better to always print the double values as hex constants as MMX is mainly an integer unit (and even with 3DNow! its just floats). Differential Revision: https://reviews.llvm.org/D43616 llvm-svn: 326497	2018-03-01 22:22:31 +00:00
Craig Topper	ccfa5257a6	[X86] Make sure we don't combine (fneg (fma X, Y, Z)) to a target specific node when there are no FMA instructions. This would cause a 'cannot select' error at isel when we should have emitted a lib call and an xor. Fixes PR36553. llvm-svn: 326393	2018-03-01 00:08:38 +00:00
Craig Topper	e31b9d1e5f	[X86] Lower extract_element from k-registers by bitcasting from v16i1 to i16 and extending/truncating. This is equivalent to what isel was doing anyway but by canonicalizing earlier we can remove some patterns. llvm-svn: 326375	2018-02-28 22:23:55 +00:00
Simon Pilgrim	72b86586b0	[X86][AVX512] Improve support for signed saturation truncation stores Matches what we already manage for unsigned saturation truncation stores Differential Revision: https://reviews.llvm.org/D43629 llvm-svn: 326372	2018-02-28 21:42:19 +00:00
Chih-Hung Hsieh	9f9e4681ac	[TLS] use emulated TLS if the target supports only this mode Emulated TLS is enabled by llc flag -emulated-tls, which is passed by clang driver. When llc is called explicitly or from other drivers like LTO, missing -emulated-tls flag would generate wrong TLS code for targets that supports only this mode. Now use useEmulatedTLS() instead of Options.EmulatedTLS to decide whether emulated TLS code should be generated. Unit tests are modified to run with and without the -emulated-tls flag. Differential Revision: https://reviews.llvm.org/D42999 llvm-svn: 326341	2018-02-28 17:48:55 +00:00
Craig Topper	48d5ed265c	[X86] Don't use EXTRACT_ELEMENT from v1i1 with i8/i32 result type when we need to guarantee zeroes in the upper bits of return. An extract_element where the result type is larger than the scalar element type is semantically an any_extend of from the scalar element type to the result type. If we expect zeroes in the upper bits of the i8/i32 we need to mae sure those zeroes are explicit in the DAG. For these cases the best way to accomplish this is use an insert_subvector to pad zeroes to the upper bits of the v1i1 first. We extend to either v16i1(for i32) or v8i1(for i8). Then bitcast that to a scalar and finish with a zero_extend up to i32 if necessary. We can't extend past v16i1 because that's the largest mask size on KNL. But isel is smarter enough to know that a zext of a bitcast from v16i1 to i16 can use a KMOVW instruction. The insert_subvectors will be dropped during isel because we can determine that the producing instruction already zeroed the upper bits of the k-register. llvm-svn: 326308	2018-02-28 08:14:28 +00:00
Craig Topper	ac799b05d4	[X86] Change the masked FPCLASS implementation to use AND instead of OR to combine the mask results. While the description for the instruction does mention OR, its talking about how the individual classification test results are ORed together. The incoming mask is used as a zeroing write mask. If the bit is 1 the classification is written to the output. The bit is 0 the output is 0. This equivalent to an AND. Here is pseudocode from the intrinsics guide FOR j := 0 to 1 i := j*64 IF k1[j] k[j] := CheckFPClass_FP64(a[i+63:i], imm8[7:0]) ELSE k[j] := 0 FI ENDFOR k[MAX:2] := 0 llvm-svn: 326306	2018-02-28 06:19:55 +00:00
Simon Pilgrim	ba43ec8702	[X86][AVX] combineLoopMAddPattern - support 256-bit cases on AVX1 via SplitBinaryOpsAndApply llvm-svn: 326189	2018-02-27 12:20:37 +00:00
Craig Topper	264707bae4	[X86] Simplify if condition. NFC SSE2 implies SSE1 and we already covered f32 in the SSE1 check so we don't need to check f32 in the SSE2 check. llvm-svn: 326170	2018-02-27 06:00:38 +00:00
Craig Topper	fcaa0323ec	[X86] Replace an impossible if condition with an assert. llvm-svn: 326167	2018-02-27 03:50:00 +00:00
Craig Topper	e5d39e42b9	[X86] Add constant folding to combineMOVMSK. There's still some shortcoming in our ability to combine binops of constants with different sizes separated by an extend. I'll try to look at that next. llvm-svn: 326128	2018-02-26 21:17:33 +00:00
Craig Topper	5e0ceb8865	[X86] Add a custom legalization for (i16 (bitcast v16i1)) and (i32 (bitcast v32i1)) without AVX512 to prevent scalarization Summary: We have an early DAG combine to turn these patterns into MOVMSK, but that combine doesn't work if the vXi1 type has more elements than the widest legal vXi8 type. Type legalization will eventually split it down to v16i1 or v32i1 and then the bitcast gets legalized to a truncstore and a scalar load. The truncstore will get lowered to a series of extracts and bit math. This patch adds a custom legalization to use a sign extend and MOVMSK instead. This prevents the eventual scalarization. Reviewers: spatel, RKSimon, zvi Reviewed By: RKSimon Subscribers: mgorny, llvm-commits Differential Revision: https://reviews.llvm.org/D43593 llvm-svn: 326119	2018-02-26 20:32:27 +00:00
Simon Pilgrim	db0ed7d724	[X86][AVX] createPSADBW - support 256-bit cases on AVX1 via SplitBinaryOpsAndApply llvm-svn: 326104	2018-02-26 18:17:25 +00:00
Craig Topper	5c980eba47	[X86] Don't use getZExtValue when we have no idea how large the input elements are. llvm-svn: 326066	2018-02-26 04:43:24 +00:00
Craig Topper	2286058f46	[X86] Use SelectionDAG::SplitVectorOperand to simplify some code. NFC llvm-svn: 326065	2018-02-26 02:16:34 +00:00
Craig Topper	2bf8e3e0e1	[X86] Simplify the ReplaceNodeResults code for X86ISD::AVG. This code seemed to try to widen to 128, 256, or 512 bit vectors, but we only create X86ISD::AVG with a power of 2 number of elements. This means the only nodes that need to be legalized are less than 128-bits and need to be widened up to 128 bits. llvm-svn: 326064	2018-02-26 02:16:33 +00:00
Craig Topper	79d189f597	[X86] Remove VT.isSimple() check from detectAVGPattern. Which types are considered 'simple' is a function of the requirements of all targets that LLVM supports. That shouldn't directly affect what types we are able to handle. The remainder of this code checks that the number of elements is a power of 2 and takes care of splitting down to a legal size. llvm-svn: 326063	2018-02-26 02:16:31 +00:00
Simon Pilgrim	a4fb569483	[X86][SSE] combineSubToSubus - support v8i64 handling from SSSE3 Our UMIN/UMAX, vector truncation and shuffle combining is good enough to efficiently handle v8i64 with the number of leading zeros that are necessary for PSUBUS. llvm-svn: 326034	2018-02-24 14:06:39 +00:00
Simon Pilgrim	8ad91261e8	[X86][SSE] combineSubToSubus - support v8i32 handling from SSSE3 (not SSE41) Now that UMIN etc are Legal/Custom for SSE2+, we can efficiently match SUBUS v8i32 cases from SSSE3 which can perform efficient truncation with PSHUFB. llvm-svn: 326033	2018-02-24 13:39:13 +00:00
Simon Pilgrim	744f008a75	[X86][SSE] combineSubToSubus - begun generalizing to work with any type sizes with SplitBinaryOpsAndApply llvm-svn: 326030	2018-02-24 12:44:12 +00:00
Simon Pilgrim	51ce2ed367	Fix spelling in comment. NFCI. llvm-svn: 326029	2018-02-24 12:27:02 +00:00
Craig Topper	161c805da4	[X86] Use SelectionDAG::getNot instead of implementing manually. NFC llvm-svn: 326020	2018-02-24 03:15:54 +00:00
Sriraman Tallam	609f8c013c	Intrinsics calls should avoid the PLT when "RtLibUseGOT" metadata is present. Differential Revision: https://reviews.llvm.org/D42216 llvm-svn: 325962	2018-02-23 21:32:06 +00:00
Simon Pilgrim	69b8fa8391	Fixed unused variable warning. NFCI. llvm-svn: 325950	2018-02-23 20:16:18 +00:00
Craig Topper	61d6ddbf0a	[X86] Add DAG combine to remove (and X, 1) from in front of a v1i1 scalar to vector. These can be created by type legalization promoting the inputs to select to match scalar boolean contents. We were trying to pattern match them away during isel, but its better to just remove them from the DAG. I've cleaned up some patterns to not check for this 'and' anymore. But I suspect this has also opened up opportunities for pattern removal. llvm-svn: 325949	2018-02-23 20:13:42 +00:00
Simon Pilgrim	425965be0f	[X86][SSE] Generalize x > C-1 ? x+-C : 0 --> subus x, C combine for non-uniform constants llvm-svn: 325944	2018-02-23 19:58:44 +00:00
Craig Topper	11704dcc72	[X86] Custom split v32i16/v64i8 bitcasts when AVX512F is available, but BWI is not. The test changes you can see are related to the changes in ReplaceNodeResults. Though shuffle-vs-trunc-512.ll does have a test that exercises the code in LowerBITCAST. Looks like the test output didn't change because DAG combining is able to clean up the resulting type legalization. Adding the custom hook just makes type legalization work less hard. Differential Revision: https://reviews.llvm.org/D43447 llvm-svn: 325933	2018-02-23 18:43:36 +00:00
Hans Wennborg	89c35fc44d	Support for the mno-stack-arg-probe flag Adds support for this flag. There is also another piece for clang (separate review). More info: https://bugs.llvm.org/show_bug.cgi?id=36221 By Ruslan Nikolaev! Differential Revision: https://reviews.llvm.org/D43107 llvm-svn: 325900	2018-02-23 13:46:25 +00:00
Craig Topper	0dcc88a500	[X86] Turn setne X, signedmax into setgt signedmax, X in LowerVSETCC to avoid an invert We won't be able to fold the constant pool load, but its still better than materialing ones and xoring for the invert if we used PCMPEQ. This will fix another regression from D42948. llvm-svn: 325845	2018-02-23 00:21:39 +00:00
Craig Topper	d2fab30827	[X86] Turn setne X, signedmin into setgt X, signedmin in LowerVSETCC to avoid an invert This will fix one of the regressions from D42948. Differential Revision: https://reviews.llvm.org/D43531 llvm-svn: 325840	2018-02-22 23:46:28 +00:00
Craig Topper	1aed540ea2	[X86] Make the subus special case in LowerVSETCC self contained Previously this code overrode the flags and opcode used by the later code in LowerVSETCC. This makes the code difficult to read and follow. This patch moves all the SUBUS code into its own function and makes it responsible for creating its own SDNodes on success. Differential Revision: https://reviews.llvm.org/D43530 llvm-svn: 325827	2018-02-22 20:24:18 +00:00
Simon Pilgrim	55b7e01116	[X86][MMX] Generlize MMX_MOVD64rr combines to accept v4i16/v8i8 build vectors as well as v2i32 Also handle both cases where the lower 32-bits of the MMX is undef or zero extended. llvm-svn: 325736	2018-02-21 23:07:30 +00:00
Simon Pilgrim	82d33b7c44	[X86] LowerBITCAST - pull out repeated calls to getOperand(0). NFCI. llvm-svn: 325695	2018-02-21 16:35:40 +00:00
Craig Topper	df0c22fcd3	[X86] Correct SHRUNKBLEND creation to work correctly when there are multiple uses of the condition. SimplifyDemandedBits forces the demanded mask to all 1s if the node has multiple uses, unless the AssumeSingleUse flag is set. So previously we were only really likely to simplify something if the condition had a single use. And on the off chance we did simplify with multiple uses the demanded mask being used was all ones so there was no reason to create a shrunkblend. This patch now checks that the condition is only used by selects first, and then sets the AssumeSingleUse flag for the simplifcation. Then we convert the selects to shrunkblend, and finally replace condition. Differential Revision: https://reviews.llvm.org/D43446 llvm-svn: 325604	2018-02-20 17:58:17 +00:00
Craig Topper	010ae8dcbb	[X86] Promote 16-bit cmovs to 32-bits This allows us to avoid an opsize prefix. And forcing some move immediates to i32 avoids a length changing prefix on those instructions. This mostly replaces the existing combine we had for zext/sext+cmov of constants. I left in a case for sign extending a 32 bit cmov of constants to 64 bits. Differential Revision: https://reviews.llvm.org/D43327 llvm-svn: 325601	2018-02-20 17:41:00 +00:00
Craig Topper	b195ed8ce3	[X86] Use vpmovq2m/vpmovd2m for truncate to vXi1 when possible. Previously we used vptestmd, but the scheduling data for SKX says vpmovq2m/vpmovd2m is lower latency. We already used vpmovb2m/vpmovw2m for byte/word truncates. So this is more consistent anyway. llvm-svn: 325534	2018-02-19 22:07:31 +00:00
Craig Topper	e60f1472f1	[X86] Stop swapping the operands of AVX512 setge. We swapped the operands and used setle, but I don't see any reason to do that. I think this is a holdover from SSE where we swap and the invert to use pcmpgt. But with AVX512 we don't want an invert so we won't use pcmpgt. So there's no need to swap. llvm-svn: 325527	2018-02-19 19:23:35 +00:00
Craig Topper	9471a7c898	[X86] Reduce the number of isel pattern variations needed for VPTESTM/VPTESTNM matching. Canonicalize EQ/NE PCMPM to have build vector all zeros on the RHS so we don't have to pattern match it in both locations. This significantly reduces the number of isel patterns needed since we also had to multiply it out with loads being in either operand of the 'and' input node and in the 'and' masking node. This removes over 24000 bytes from the isel table. llvm-svn: 325526	2018-02-19 19:23:31 +00:00
Simon Pilgrim	c302a581a0	[X86][SSE] combineTruncateWithSat - use truncateVectorWithPACK down to 64-bit subvectors Add support for chaining PACKSS/PACKUS down to 64-bit vectors by using only a single 128-bit input. llvm-svn: 325494	2018-02-19 13:29:20 +00:00
Craig Topper	9cf812e1ed	[X86] Correct a typo I made in combineToExtendCMOV recently. We're accidentally checking that the same node is a constant twice instead of checking the other node. This isn't a functional problem since we didn't do anything below that explicitly requires constants. It just means we may have introduced a sign_extend or zero_extend that won't fold out. llvm-svn: 325469	2018-02-18 20:41:25 +00:00
Craig Topper	0bcdd399e7	[X86] Turn selects with constant condition into vector shuffles during DAG combine Summary: Currently we convert to shuffles during lowering. This moves it to DAG combine so hopefully we can get it done before type legalization has to extend the condition. I believe in some cases we're creating SHRUNKBLENDs that end up with constant conditions because we see the extended on the condition and think its a dynamic selelect before DAG combine gets a chance to constant fold the extend. We could add combines to turn SHRUNKBLENDs with constant condition back to vselect. But it seemed like it might be better to just send them to shuffles as early as possible so they never get a chance to become SHRUNKBLENDs. This the reason some tests went from blends controlled by a constant pool load to just move. Some of the constant pool entries changed because the sign_extend introduced by type legalization turned undef elements in select condition into 0s. While the select->shuffle used -1 in the shuffle mask. So now the shuffle lowering can do what it wants with them. I'll remove the lowering code as a follow up. We might be able to simplify some of the pre-checks for SHRUNKBLEND as the FIXME there says. Reviewers: spatel, RKSimon, efriedma, zvi, andreadb Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D43367 llvm-svn: 325417	2018-02-17 00:30:30 +00:00
Craig Topper	27b9ac2372	[X86] In lowerVSELECTtoVectorShuffle, don't map undef select condition to undef in shuffle mask. Undef in select condition means we should pick the element from one side or the other. An undef in a shuffle mask means pick any element from either source or worse. I suspect by the time we get here most of the undefs in a constant vector have been removed by other things, but doing this for safety. llvm-svn: 325394	2018-02-16 21:36:29 +00:00
Craig Topper	de565fc73e	[X86] Only reorder srl/and on last DAG combiner run This seems to interfere with a target independent brcond combine that looks for the (srl (and X, C1), C2) pattern to enable TEST instructions. Once we flip, that combine doesn't fire and we end up exposing it to the X86 specific BT combine which causes us to emit a BT instruction. BT has lower throughput than TEST. We could try to make the brcond combine aware of the alternate pattern, but since the flip was just a code size reduction and not likely to enable other combines, it seemed easier to just delay it until after lowering. Differential Revision: https://reviews.llvm.org/D43201 llvm-svn: 325371	2018-02-16 18:51:09 +00:00
Craig Topper	79bd39db80	[X86] Remove call to ShrinkDemandedCosntant from the SHRUNKBLEND creation code. We only run this code if know the condition isn't a constant vector. ShrinkDemandedConstant isn't going to find any different. llvm-svn: 325368	2018-02-16 18:34:46 +00:00
Simon Pilgrim	4e2f757dc1	[X86][SSE] Allow float domain crossing if we are merging 2 or more shuffles and the root started as a float domain shuffle llvm-svn: 325349	2018-02-16 14:57:25 +00:00
Craig Topper	2e4b838c06	[X86] Allow CMOVs of constants to be sign extended from i32. Sign extending i32 constants only requires a REX prefix as does widening the CMOV. This is cheaper than the explicit sign extend op. llvm-svn: 325318	2018-02-16 07:16:15 +00:00
Craig Topper	5d9e301042	[X86] Don't zero_extend cmov up to i64, stop at i32. Zero extend from i32 to i64 is free. So extend from i16 to i32, and then use a free zero extend to finish. llvm-svn: 325317	2018-02-16 06:52:43 +00:00
Craig Topper	f3f35efe5c	[X86] Enable BT to be used in place of TEST for single bit checks under optsize We already do this for 64-bit when it won't fit into a 64-bit AND/TEST's immediate field. This adds an additional qualifier to do it for any single bit constant larger than 8-bits under optsize Differential Revision: https://reviews.llvm.org/D43346 llvm-svn: 325290	2018-02-15 20:27:30 +00:00
Simon Pilgrim	17bb6f0755	[X86][SSE] combineTruncateWithSat - use truncateVectorWithPACK to chain PACKUS vXi32-vXi8 saturated truncation We can use PACKSS/PACKUS to saturate each stage of the chain: PACKSSDW down to [-32768,32767] and then PACKUSWB to [0,255]. llvm-svn: 325243	2018-02-15 14:37:59 +00:00
Simon Pilgrim	908f833e57	[X86][SSE] combineTruncateWithSat - use truncateVectorWithPACK to chain PACKSS vXi32-vXi8 saturated truncation We can use PACKSS to saturate each stage of the chain: PACKSSDW down to [-32768,32767] and then PACKSSWB to [-128,127]. PACKUS is a little trickier and will be handled in a separate patch. llvm-svn: 325235	2018-02-15 13:33:15 +00:00
Simon Pilgrim	2ec8373633	[X86][SSE] truncateVectorWithPACK - Use src type instead of dst to select between PACKSDW/PACKSWB Try to keep PACKSDW/PACKSWB as wide as possible, this helps ComputeNumSignBits as it can only peek through bitcasts to wider types, pre-AVX2 codegen was already doing this as it could peek through bitcasts/subvectors more easily than AVX2 could through shuffles. This shouldn't affect existing results as calls to truncateVectorWithPACK ensure we have enough sign bits to pack to the same value, but it should make it possible to use truncateVectorWithPACK chains to perform saturation in combineTruncateWithSat with a future patch. llvm-svn: 325149	2018-02-14 18:23:58 +00:00
Simon Pilgrim	ded6e7a263	Fix GCC -Wlogical-op-parentheses warning. NFCI. llvm-svn: 325129	2018-02-14 15:07:36 +00:00
Simon Pilgrim	86d15bff68	[X86][SSE] Relax type legality for combineTruncateWithSat PACKSS/PACKUS truncation While the AVX512 VTRUNCS/VTRUNCUS instructions require legal types, truncateVectorWithPACK handles cases with multiples of legal types through splitting/concatenation. So we just need to ensure that the src/dst scalar types are correct and leave truncateVectorWithPACK to handle the rest of it. llvm-svn: 325127	2018-02-14 14:14:29 +00:00
Reid Kleckner	91e11a83fc	[X86] Use EDI for retpoline when no scratch regs are left Summary: Instead of solving the hard problem of how to pass the callee to the indirect jump thunk without a register, just use a CSR. At a call boundary, there's nothing stopping us from using a CSR to hold the callee as long as we save and restore it in the prologue. Also, add tests for this mregparm=3 case. I wrote execution tests for __llvm_retpoline_push, but they never got committed as lit tests, either because I never rewrote them or because they got lost in merge conflicts. Reviewers: chandlerc, dwmw2 Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D43214 llvm-svn: 325049	2018-02-13 20:47:49 +00:00
Craig Topper	036789a7e8	[X86] Add combine to shrink 64-bit ands when one input is an any_extend and the other input guarantees upper 32 bits are 0. Summary: This gets the shift case from PR35792. Reviewers: spatel, RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D43222 llvm-svn: 325018	2018-02-13 16:25:25 +00:00
Craig Topper	5ce6db93c1	[X86] Use getTypeAction in most places that were checking ExperimentalVectorWideningLegalization. This will allow more flexibility in what types we legalize via widening or not. This should help with a couple lines in D41062. llvm-svn: 324980	2018-02-13 01:49:58 +00:00
Craig Topper	88939fefe8	[X86] Simplify X86DAGToDAGISel::matchBEXTRFromAnd by creating an X86ISD::BEXTR node and calling Select. Add isel patterns to recognize this node. This removes a bunch of special case code for selecting the immediate and folding loads. llvm-svn: 324939	2018-02-12 21:18:11 +00:00
Craig Topper	3ce035acf3	[X86] Add KADD X86ISD opcode instead of reusing ISD::ADD. ISD::ADD implies individual vector element addition with no carries between elements. But for a vXi1 type that would be the same as XOR. And we already turn ISD::ADD into ISD::XOR for all vXi1 types during lowering. So the ISD::ADD pattern would never be able to match anyway. KADD is different, it adds the elements but also propagates a carry between them. This just a way of doing an add in k-register without bitcasting to the scalar domain. There's still no way to match the pattern, but at least its not obviously wrong. llvm-svn: 324861	2018-02-12 01:33:38 +00:00
Craig Topper	363e099446	[X86] Remove MASK_BINOP intrinsic type. NFC llvm-svn: 324858	2018-02-11 22:32:30 +00:00
Craig Topper	38d61c38a2	[X86] Remove dead code from getMaskNode that looked for a i64 mask with a maskVT that wasn't v64i1. NFC llvm-svn: 324857	2018-02-11 22:32:29 +00:00
Craig Topper	a7ac028a6b	[X86] Remove LowerBoolVSETCC_AVX512, we get this with a target independent DAG combine now. NFC llvm-svn: 324856	2018-02-11 22:32:27 +00:00
Simon Pilgrim	0d8c4bfc2a	[X86][SSE] Use SplitBinaryOpsAndApply to recognise PSUBUS patterns before they're split on AVX1 This needs to be generalised further to support AVX512BW cases but I want to add non-uniform constants first. llvm-svn: 324844	2018-02-11 17:29:42 +00:00
Craig Topper	ca5a340171	[X86] Use min/max for vector ult/ugt compares if avoids a sign flip. Summary: Currently we only use min/max to help with ule/uge compares because it removes an invert of the result that would otherwise be needed. But we can also use it for ult/ugt compares if it will prevent the need for a sign bit flip needed to use pcmpgt at the cost of requiring an invert after the compare. I also refactored the code so that the max/min code is self contained and does its own return instead of setting up a flag to manipulate the rest of the function's behavior. Most of the test cases look ok with this. I did notice that we added instructions when one of the operands being sign flipped is a constant vector that we were able to constant fold the flip into. I also noticed that sometimes the SSE min/max clobbers a register that is needed after the compare. This resulted in an extra move being inserted before the min/max to preserve the register. We could try to detect this and switch from min to max and change the compare operands to use the operand that gets reused in the compare. Reviewers: spatel, RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42935 llvm-svn: 324842	2018-02-11 17:11:40 +00:00
Simon Pilgrim	c2544c572a	[X86][SSE] Moved SplitBinaryOpsAndApply earlier so more methods can use it. NFCI. llvm-svn: 324841	2018-02-11 17:01:43 +00:00
Simon Pilgrim	0be5567a89	[X86][SSE] Enable SMIN/SMAX/UMIN/UMAX custom lowering for all legal types This allows us to recognise more saturation patterns and also simplify some MINMAX codegen that was failing to combine CMPGE comparisons to a legal CMPGT. Differential Revision: https://reviews.llvm.org/D43014 llvm-svn: 324837	2018-02-11 10:52:37 +00:00
Craig Topper	24d3b28d93	[X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits and 512 bits aren't required This patch adds a new function attribute "required-vector-width" that can be set by the frontend to indicate the maximum vector width present in the original source code. The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc. The backend will then use this information to determine if its save to make 512-bit vectors illegal when the preference is for 256-bit vectors. For code that has no vectors in it originally and only get vectors through the loop and slp vectorizers this allows us to generate code largely similar to our AVX2 only output while still enabling AVX512 features like mask registers and gather/scatter. The loop vectorizer doesn't always obey TTI and will create oversized vectors with the expectation the backend will legalize it. In order to avoid changing the vectorizer and potentially harm our AVX2 codegen this patch tries to make the legalizer behavior similar. This is restricted to CPUs that support AVX512F and AVX512VL so that we have good fallback options to use 128 and 256-bit vectors and still get masking. I've qualified every place I could find in X86ISelLowering.cpp and added tests cases for many of them with 2 different values for the attribute to see the codegen differences. We still need to do frontend work for the attribute and teach the inliner how to merge it, etc. But this gets the codegen layer ready for it. Differential Revision: https://reviews.llvm.org/D42724 llvm-svn: 324834	2018-02-11 08:06:27 +00:00
Craig Topper	a4bf9b8d51	[X86] Remove setOperationAction lines for promoting vXi1 SINT_TO_FP/UINT_TO_FP. We promote these via a DAG combine now before lowering gets the chance. Also remove the v2i1 custom handling since it will no longer be triggered. llvm-svn: 324833	2018-02-11 07:44:33 +00:00
Craig Topper	ba5ad55965	[X86] Remove some redundant qualifications from the setOperationAction blocks. NFC These were added as part of the refactoring for prefer vector width. At the time I thought the hasAVX512 here would be replaced with "allow 512 bit vectors" so that it would read "allow 512 bit vectors OR VLX". But now the plan is to only give the option of disabling 512 bit vectors when VLX is enabled. So we don't need this qualification at all llvm-svn: 324831	2018-02-11 03:07:19 +00:00
Craig Topper	4dccffc84a	[X86] Change signatures of avx512 packed fp compare intrinsics to return a vXi1 mask type to be closer to an fcmp. Summary: This patch changes the signature of the avx512 packed fp compare intrinsics to return a vXi1 vector and no longer take a mask as input. The casts to scalar type will now need to be explicit in the IR. The masking node will now be an explicit and in the IR. This makes the intrinsic look much more similar to an fcmp instruction that we wish we could use for these but can't. We already use icmp instructions for integer compares. Previously the lowering step of isel would turn the intrinsic into an X86 specific ISD node and a emit the masking nodes as well as some bitcasts. This means DAG combines can't see the vXi1 type until somewhat late, making it more difficult to combine out gpr<->mask transition sequences. By exposing the vXi1 type explicitly in the IR and initial SelectionDAG we give earlier DAG combines and even InstCombine the chance to see it and optimize it. This should make any issues with gpr<->mask sequences the same between integer and fp. Meaning we only have to fix them once. Reviewers: spatel, delena, RKSimon, zvi Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D43137 llvm-svn: 324827	2018-02-10 23:33:55 +00:00
Craig Topper	9121eb575e	[X86] Custom legalize (v2i32 (setcc (v2f32))) so that we don't end up with a (v4i1 (setcc (v4f32))) Undef VLX, getSetCCResultType returns v2i1/v4i1 for v2f32/v4f32 so default type legalization will end up changing the setcc result type back to vXi1 if it had been extended. The resulting extend gets messed up further by type legalization and is difficult to recombine back to (v4i32 (setcc (v4f32))) after legalization. I went ahead and enabled this for SSE2 and later since its always the result we want and this helps type legalization get there in less steps. llvm-svn: 324822	2018-02-10 19:12:58 +00:00
Craig Topper	28d3a73c81	[X86] Extend inputs with elements smaller than i32 to sint_to_fp/uint_to_fp before type legalization. This prevents extends of masks being introduced during lowering where it become difficult to combine them out. There are a few oddities in here. We sometimes concatenate two k-registers produced by two compares, sign_extend the combined pair, then extract two halves. This worked better previously because the sign_extend wasn't created until after the fp_to_sint was split which led to a split sign_extend being created. We probably also need to custom type legalize (v2i32 (sext v2i1)) via widening. llvm-svn: 324820	2018-02-10 17:58:58 +00:00
Craig Topper	b8d7b1620b	[X86] Custom legalize (v2i1 (fp_to_uint/fp_to_sint v2f64)) without AVX512VL. Strangely the code was already present, just the setOperationAction wasn't being called without VLX. llvm-svn: 324806	2018-02-10 08:39:31 +00:00
Craig Topper	c3aab4bbe1	[X86] Legalize zero extends from vXi1 to vXi16/vXi32/vXi64 using a sign extend and a shift. This avoids a constant pool load to create 1. The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization. llvm-svn: 324805	2018-02-10 08:06:52 +00:00
Craig Topper	d34af6f636	[X86] Teach combineExtSetcc to handle ZERO_EXTEND by widening the setcc and then masking. A later DAG combine will convert to a shift. This helps to avoid a constant pool load needed to zero extend from the mask. llvm-svn: 324804	2018-02-10 08:06:49 +00:00
Craig Topper	fa6113b3d7	[X86] Teach combineInsertSubvector how to combine some k-register insert_subvectors and extract_subvector sequences to remove extra zeroing.wq llvm-svn: 324791	2018-02-10 01:00:41 +00:00
Craig Topper	99db883d55	[X86] Teach lower1BitVectorShuffle to recognize shuffles that are just filling upper elements with zero. Replace with insert_subvector. There's still some extra kshifts in one of the modified test cases here, but hopefully that's only a DAG combine away. llvm-svn: 324782	2018-02-09 23:32:27 +00:00
Craig Topper	ca5841b4e4	[X86] Simplify some code in lowerV4X128VectorShuffle and lowerV2X128VectorShuffle Previously we extracted two subvectors and concatenate. But the concatenate will be lowered to two insert subvectors. Then DAG combine will merge once of the inserts and one of the extracts back into the original vector. We might as well just directly use one extract and one insert. llvm-svn: 324710	2018-02-09 05:54:36 +00:00
Craig Topper	28166a877d	[X86] Teach shuffle lowering to recognize 128/256 bit insertions into a zero vector. This regresses a couple cases in the shuffle combining test. But those cases use intrinsics that InstCombine knows how to turn into a generic shuffle earlier. This should give opportunities to fold this earlier in InstCombine or DAG combine. llvm-svn: 324709	2018-02-09 05:54:34 +00:00
Craig Topper	9e030c9e00	[X86] Improve combineCastedMaskArithmetic to fold (bitcast (vXi1 (and/or/xor X, C)))->(vXi1 (and/or/xor (bitcast X), (bitcast C)) where C is a constant build_vector. Most vxi1 constant build vectors have to be implemented in the scalar domain anyway so we'll probably end up with a cast there later. But by then its too late to do the combine to get rid of it. llvm-svn: 324662	2018-02-08 22:26:39 +00:00
Craig Topper	1b5b4ccb77	[X86] Add DAG combine to constant fold a bitcast of a vXi1 constant build_vector into a scalar integer. llvm-svn: 324661	2018-02-08 22:26:36 +00:00
Craig Topper	dccf72b583	[X86] Remove kortest intrinsics and replace with native IR. llvm-svn: 324646	2018-02-08 20:16:06 +00:00
Clement Courbet	1b8c08b633	[X86] Fix compilation of r324580. @ctopper Can you check that the fix is correct ? llvm-svn: 324586	2018-02-08 09:41:50 +00:00
Craig Topper	8d0c8c9be1	[X86] Support folding in a k-register OR when creating KORTEST from scalar compare of a bitcast from vXi1. This should allow us to remove the kortest intrinsic from IR and use compare+bitcast+or in IR instead. llvm-svn: 324580	2018-02-08 08:29:43 +00:00
Craig Topper	93505707b6	[X86] Allow KORTEST instruction to be used for testing if a mask is all ones The KTEST instruction sets the C flag if the result of anding both operands together is all 1s. We can use this to lower (icmp eq/ne (bitcast (vXi1 X), -1) Differential Revision: https://reviews.llvm.org/D42772 llvm-svn: 324577	2018-02-08 07:54:16 +00:00
Craig Topper	f5465f98d2	[X86] Don't emit KTEST instructions unless only the Z flag is being used Summary: KTEST has weird flag behavior. The Z flag is set for all bits in the AND of the k-registers being 0, and the C flag is set for all bits being 1. All other flags are cleared. We currently emit this instruction in EmitTEST and don't check the condition code. This can lead to strange things like using the S flag after a KTEST for a signed compare. The domain reassignment pass can also transform TEST instructions into KTEST and is not protected against the flag usage either. For now I've disabled this part of the domain reassignment pass. I tried to comment out the checks in the mir test so that we could recover them later, but I couldn't figure out how to get that to work. This patch moves the KTEST handling into LowerSETCC and now creates a ktest+x86setcc. I've chosen this approach because I'd like to add support for the C flag for all ones in a followup patch. To do that requires that I can rewrite the condition code going in the x86setcc to be different than the original SETCC condition code. This fixes PR36182. I'll file a PR to fix domain reassignment once this goes in. Should this be merged to 6.0? Reviewers: spatel, guyblank, RKSimon, zvi Reviewed By: guyblank Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42770 llvm-svn: 324576	2018-02-08 07:45:55 +00:00
Craig Topper	37765ff326	[X86] Prune some unreachable 'return SDValue()' paths from LowerSIGN_EXTEND/LowerZERO_EXTEND/LowerANY_EXTEND. We were doing a lot of whitelisting of what we handle in these routines, but setOperationAction constrains what we can get here. So just add some asserts and prune the unreachable paths. llvm-svn: 324538	2018-02-07 22:45:38 +00:00
Craig Topper	1db5ebc016	[X86] Remove dead code from EmitTest that looked for an i1 type which should have already been type legalized away. NFC llvm-svn: 324536	2018-02-07 22:19:26 +00:00
Simon Pilgrim	b4e789e8f6	[X86][AVX] Add PACKSSDW/PACKUSDW support for truncation of clamped values SSE and shorter vector sizes will have to wait until we can add support for general SMIN/SMAX matching. llvm-svn: 324485	2018-02-07 15:48:44 +00:00
Chandler Carruth	282ae1632a	[x86/retpoline] Make the external thunk names exactly match the names that happened to end up in GCC. This is really unfortunate, as the names don't have much rhyme or reason to them. Originally in the discussions it seemed fine to rely on aliases to map different names to whatever external thunk code developers wished to use but there are practical problems with that in the kernel it turns out. And since we're discovering this practical problems late and since GCC has already shipped a release with one set of names, we are forced, yet again, to blindly match what is there. Somewhat rushing this patch out for the Linux kernel folks to test and so we can get it patched into our releases. Differential Revision: https://reviews.llvm.org/D42998 llvm-svn: 324449	2018-02-07 06:16:24 +00:00
Craig Topper	58ecffd857	[DAGCombiner][AMDGPU][X86] Turn cttz/ctlz into cttz_zero_undef/ctlz_zero_undef if we can prove the input is never zero X86 currently has a late DAG combine after cttz/ctlz are turned into BSR+BSF+CMOV to detect this and remove the CMOV. But we should be able to do this much earlier and avoid creating the cmov all together. For the changed AMDGPU test case it appears that previously the i8 cttz was type legalized to i16 which introduced an OR with 256 in order to limit the result to 8 on the widened type. At this point the result is known to never be zero, but nothing checked that. Then operation legalization is told to promote all i16 cttz to i32. This introduces an extend and a truncate and another OR with 65536 to limit the result to 16. With the DAG combiner change we are able to prevent the creation of the second OR since the opcode will have been changed to cttz_zero_undef after the first OR. I the lack of the OR caused the instruction to change to v_ffbl_b32_sdwa Differential Revision: https://reviews.llvm.org/D42985 llvm-svn: 324427	2018-02-06 23:54:37 +00:00
Simon Pilgrim	ae00a71f55	[X86][SSE] Add PACKUS support for truncation of clamped values Followup to D42544 that matches PACKUSWB cases for non-AVX512, SSE and PACKUSDW cases will have to wait until we can add support for general SMIN/SMAX matching. llvm-svn: 324347	2018-02-06 14:07:46 +00:00
Simon Pilgrim	90a237bf83	[X86][SSE] Add PACKSS support for truncation of clamped values Followup to D42544 that matches PACKSSWB cases for non-AVX512, SSE and PACKSSDW cases will have to wait until we can add support for general SMIN/SMAX matching. llvm-svn: 324339	2018-02-06 12:16:10 +00:00
Craig Topper	9c6c7c5e9b	[X86] Relax restrictions on what setcc condition codes can be folded with a sext when AVX512 is enabled. We now allow all signed comparisons and not equal. The complement that needs to be added for this is no worse than the extend. And the vector output forms of pcmpeq/pcmpgt have better latency than the k-register version on SKX. llvm-svn: 324294	2018-02-05 23:57:01 +00:00
Craig Topper	5a2bd99a9e	[X86] Add isel patterns for selecting masked SUBV_BROADCAST with bitcasts. Remove combineBitcastForMaskedOp. Add test cases for the merge masked versions to make sure we have all those covered. llvm-svn: 324210	2018-02-05 08:37:37 +00:00
Craig Topper	6ff5eb5dd5	[X86] Remove unused lambda. NFC llvm-svn: 324206	2018-02-05 06:56:33 +00:00
Craig Topper	25ceba7f30	[X86] Remove X86ISD::SHUF128 from combineBitcastForMaskedOp. Use isel patterns instead. We always created X86ISD::SHUF128 with a 64-bit element type so we can use isel patterns to detect a bitconvert to 32-bit to handle masking. The test changes are because we also match the bitconvert even if there is no masking. This leads to unnecessary isel pattern, but it requires more multiclass hackery in tablegen to get rid of it. llvm-svn: 324205	2018-02-05 06:00:23 +00:00
Craig Topper	8d511a65af	[X86] Add DAG combine to turn (bitcast (and/or/xor (bitcast X), Y)) -> (and/or/xor X, (bitcast Y)) when casting between GPRs and mask operations. This reduces the number of transitions between k-registers and GPRs, reducing the number of instructions. There's still some room for improvement to remove more transitions, but this is a good start. llvm-svn: 324184	2018-02-04 01:43:48 +00:00
Craig Topper	17d99f1df4	[X86] Remove unused function argument. NFC llvm-svn: 324183	2018-02-04 01:43:44 +00:00
Craig Topper	071ad9c6e0	[X86] Remove and autoupgrade kand/kandn/kor/kxor/kxnor/knot intrinsics. Clang already stopped using these a couple months ago. The test cases aren't great as there is nothing forcing the operations to stay in k-registers so some of them moved back to scalar ops due to the bitcasts being moved around. llvm-svn: 324177	2018-02-03 20:18:25 +00:00
Craig Topper	fae8788cfa	[X86] Prefer to create a ISD::SETCC over X86ISD::PCMPEQ in combineVectorSizedSetCCEquality. This is running pre-legalize, we should try to use target independent nodes. This will give the best opportunity for target independent optimizations. llvm-svn: 324147	2018-02-02 21:59:46 +00:00
Craig Topper	10aa254ecd	[X86] Pass SDLoc by const reference in a few more places in X86ISelLowering.cpp. NFC llvm-svn: 324135	2018-02-02 20:32:00 +00:00
Craig Topper	76c5ce5184	[X86] Legalize (v64i1 (bitcast (i64 X))) on 32-bit targets by extracting 32-bit halves from i32, bitcasting each to v32i1, and concatenating. This prevents the scalarization that would otherwise occur. llvm-svn: 324057	2018-02-02 05:59:33 +00:00
Craig Topper	5570e03b21	[X86] Legalize (i64 (bitcast (v64i1 X))) on 32-bit targets by extracting to v32i1 and bitcasting to i32. This saves a trip through memory and seems to open up other combining opportunities. llvm-svn: 324056	2018-02-02 05:59:31 +00:00
Craig Topper	2d67d1e2a8	[X86] Separate the call to LowerVectorAllZeroTest from EmitTest. NFCI Every instruction that has the word TEST in its name seems to have been buried into EmitTest. But that code is largely concerned with trying to reuse the flags from instructions that update flags in a pretty normal way. PTEST/TESTP/KTEST do not update flags in a normal way. They only update Z and C and the C flag update is non-standard. Rather than try to bend EmitTest's already complex logic to accomodate this, just move the call up to LowerSETCC and replicate the few pre-checks that are needed. While there add a FIXME for using the C flag for checking for all 1s which we definitely couldn't do from EmitTEST. llvm-svn: 324029	2018-02-01 23:21:20 +00:00
Simon Pilgrim	1a8cefc328	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - add support for scaling index vectors This allows us to use PSHUFB for v8i16/v4i32 and VPERMD/PERMPS for v4i64/v4f64 variable shuffles. Differential Revision: https://reviews.llvm.org/D42487 llvm-svn: 323987	2018-02-01 18:10:30 +00:00
Craig Topper	a8a24232ee	[X86] Remove custom lowering vXi1 extending loads and truncating stores. Summary: Now that v2i1/v4i1 are legal without VLX. And v32i1 is legalized by splitting rather than widening. And isVectorLoadExtDesirable returns false for vXi1. It appears this handling is dead because the operations simply don't exist. Reviewers: RKSimon, zvi, guyblank, delena, spatel Reviewed By: delena Subscribers: llvm-commits, rengolin Differential Revision: https://reviews.llvm.org/D42781 llvm-svn: 323983	2018-02-01 17:08:41 +00:00
Craig Topper	7e910a9e85	[X86] Turn X86ISD::AND nodes that have no flag users back into ISD::AND just before isel to enable test instruction matching Summary: EmitTest sometimes creates X86ISD::AND specifically to hide the AND from DAG combine. But this prevents isel patterns that look for (cmp (and X, Y), 0) from being able to see it. So we end up with an AND and a TEST. The TEST gets removed by compare instruction optimization during the peephole pass. This patch attempts to fix this by converting X86ISD::AND with no flag users back into ISD::AND during the DAG preprocessing just before isel. In order to do this correctly I had to make the X86ISD::AND node created by EmitTest in this case really have a flag output. Which arguably it should have had anyway so that the number of operands would be consistent for the opcode in all cases. Then I had to modify the ReplaceAllUsesWith to understand that we might be looking at an instruction with 2 outputs. Though in this case there are no uses to replace since we just created the node, but that's what the code did before so I just made it keep working. Reviewers: spatel, RKSimon, niravd, deadalnix Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42764 llvm-svn: 323982	2018-02-01 17:08:39 +00:00
Dean Michael Berris	cdca0730be	[XRay][compiler-rt+llvm] Update XRay register stashing semantics Summary: This change expands the amount of registers stashed by the entry and `__xray_CustomEvent` trampolines. We've found that since the `__xray_CustomEvent` trampoline calls can show up in situations where the scratch registers are being used, and since we don't typically want to affect the code-gen around the disabled `__xray_customevent(...)` intrinsic calls, that we need to save and restore the state of even the scratch registers in the handling of these custom events. Reviewers: pcc, pelikan, dblaikie, eizan, kpw, echristo, chandlerc Reviewed By: echristo Subscribers: chandlerc, echristo, hiraditya, davide, dblaikie, llvm-commits Differential Revision: https://reviews.llvm.org/D40894 llvm-svn: 323940	2018-02-01 02:21:54 +00:00
Craig Topper	e44faf53c7	[X86] Make the type checks in detectAVX512USatPattern more robust This code currently uses isSimple and getSizeInBits in an attempt to prune types. But isSimple will return true for any type that any target supports natively. I don't think that's a good way to prune types. I also don't think the dest element type checks are very robust since we didn't do an isSimple check on the dest type. This patch adds a check for the input type being legal to the one caller that didn't already check that. Then we explicitly check the element types for the destination are i8, i16, or i32 Differential Revision: https://reviews.llvm.org/D42706 llvm-svn: 323924	2018-01-31 22:26:31 +00:00
Craig Topper	d759f476e8	[X86] Remove redundant check for hasAVX512 before calling hasBWI. NFC hasBWI implies hasAVX512. llvm-svn: 323823	2018-01-30 21:53:35 +00:00
Simon Pilgrim	073f089c6e	[X86][XOP] Update isVectorShiftByScalarCheap with cases covered by XOP Similar to D42437, XOP supports variable shift for v16i8/v8i16/v4i32/v2i64 types. Differential Revision: https://reviews.llvm.org/D42526 llvm-svn: 323797	2018-01-30 18:10:21 +00:00
Craig Topper	eb13ebdb99	[X86] Don't create SHRUNKBLEND when the condition is used by the true or false operand of the vselect. Fixes PR34592. Differential Revision: https://reviews.llvm.org/D42628 llvm-svn: 323672	2018-01-29 17:56:57 +00:00
Hiroshi Inoue	c8e9245816	[NFC] fix trivial typos in comments and documents "to to" -> "to" llvm-svn: 323628	2018-01-29 05:17:03 +00:00
Craig Topper	3913a4dd56	[X86] Fix a crash that can occur in combineExtractVectorElt due to not checking the width of a ConstantSDNode before calling getConstantOperandVal. llvm-svn: 323614	2018-01-28 07:29:35 +00:00
Craig Topper	15d69739e2	[X86] Remove VPTESTM/VPTESTNM ISD opcodes. Use isel patterns matching cmpm eq/ne with immallzeros. llvm-svn: 323612	2018-01-28 00:56:30 +00:00
Craig Topper	247016a735	[X86] Use vptestm/vptestnm for comparisons with zero to avoid creating a zero vector. We can use the same input for both operands to get a free compare with zero. We already use this trick in a couple places where we explicitly create PTESTM with the same input twice. This generalizes it. I'm hoping to remove the ISD opcodes and move this to isel patterns like we do for scalar cmp/test. llvm-svn: 323605	2018-01-27 20:19:09 +00:00
Craig Topper	513d3fa674	[X86] Remove X86ISD::PCMPGTM/PCMPEQM and instead just use X86ISD::PCMPM and pattern match the immediate value during isel. Legalization is still biased to turn LT compares in to GT by swapping operands to avoid needing extra isel patterns to commute. I'm hoping to remove TESTM/TESTNM next and this should simplify that by making EQ/NE more similar. llvm-svn: 323604	2018-01-27 20:19:02 +00:00
Simon Pilgrim	fe3fac805a	[X86][SSE] Simplify demanded elements from BROADCAST shuffle source. If broadcasting from another shuffle, attempt to simplify it. We can probably generalize this a lot more (embedding in combineX86ShufflesRecursively), but BROADCAST is one of the more troublesome as it accepts inputs of different sizes to the result. llvm-svn: 323602	2018-01-27 19:48:13 +00:00
Benjamin Kramer	a03d3198ee	[X86] Unbreak the build. X86ISelLowering.cpp:34130:5: error: return type 'llvm::SDValue' must match previous return type 'const llvm::SDValue' when lambda expression has unspecified explicit return type llvm-svn: 323557	2018-01-26 20:16:43 +00:00
Craig Topper	d4795b700d	[X86] Allow any_extend to be combined with setcc on VLX targets. For VLX target getSetccResultType returns vXi1 which prevents the target independent DAG combine from doing this tranform itself. llvm-svn: 323555	2018-01-26 20:02:52 +00:00
Simon Pilgrim	8e9becbd81	[X86][AVX512] Add combining support for X86ISD::VTRUNCS Similar to the existing support for X86ISD::VTRUNCUS. Differential Revision: https://reviews.llvm.org/D42544 llvm-svn: 323553	2018-01-26 20:01:12 +00:00
Sanjay Patel	b8ae262bd3	[x86] fix typo in comment; NFC llvm-svn: 323545	2018-01-26 18:44:32 +00:00
Simon Pilgrim	1b14bdc0b8	[X86][AVX] LowerBUILD_VECTORAsVariablePermute - add support for VPERMILPV to v4i32/v4f32 Extension to D42431, adding support for v4i32/v4f32 as well as v2i64/v2f64 now that D42308 has landed llvm-svn: 323542	2018-01-26 17:19:59 +00:00
Simon Pilgrim	76ede609f6	[X86][SSE] Don't colaesce v4i32 extracts We currently coalesce v4i32 extracts from all 4 elements to 2 v2i64 extracts + shifts/sign-extends. This seems to have been added back in the days when we tended to spill vectors and reload scalars, or ended up with repeated shuffles moving everything down to 0'th index. I don't think either of these are likely these days as we have better EXTRACT_VECTOR_ELT and VECTOR_SHUFFLE handling, and the existing code tends to make it very difficult for various vector and load combines. Differential Revision: https://reviews.llvm.org/D42308 llvm-svn: 323541	2018-01-26 17:11:34 +00:00
Simon Pilgrim	d567c27c84	[X86][SSE] Drop PMADDWD in lowerMul As mentioned in D42258, we don't need this any more llvm-svn: 323540	2018-01-26 16:57:36 +00:00
Simon Pilgrim	445d7c0e5c	[X86] Cleanup SDLoc arguments as mentioned on D42544 llvm-svn: 323526	2018-01-26 14:00:01 +00:00
Craig Topper	882f0d7955	[X86] Remove dead code from LowerBUILD_VECTOR that tried to handle i64 element type in 32-bit mode. Type legalization would prevent any i64 operands to the build_vector from existing before we get here. The coverage bots show this code as uncovered. llvm-svn: 323506	2018-01-26 07:30:44 +00:00
Craig Topper	77c5077585	[X86] Remove code from combineBitcastvxi1 that was needed to support the previous native IR for kunpck intrinsics. The original autoupgrade for kunpck intrinsics used a bitcasted scalar shift, or, and. This combine would turn this into a concat_vectors. Now the kunpck intrinsics are autoupgraded to a vector shuffle that will become a concat_vectors. llvm-svn: 323504	2018-01-26 07:15:21 +00:00
Craig Topper	95e8c9143e	[X86] Remove unused intrinsic type handling. NFC llvm-svn: 323503	2018-01-26 07:15:20 +00:00
Craig Topper	ccb35dfda6	[X86] Simplify condition in VSETCC. NFC This listed all legal 128-bit integer types individually, but since we already know we have a legal type and its integer, we can just check is128BitVector. llvm-svn: 323502	2018-01-26 07:15:18 +00:00
Craig Topper	faa56f7b08	[X86] Remove LowerVSETCC code for handling vXi1 setcc with vXi8/vXi16 input type. NFC These kinds of setccs are promoted by a DAG combine before they ever get to legalization. llvm-svn: 323501	2018-01-26 07:15:17 +00:00
Craig Topper	ad8ce0b800	[X86] Remove some dead code from LowerVSETCC. NFC This code was added in r321967, but ultimately I fixed the issue in the legalizer and this code was no longer required. llvm-svn: 323500	2018-01-26 07:15:16 +00:00
Simon Pilgrim	09c56b799f	[X86] Apply clang-format to detectUSatPattern. NFCI. Cleanup from D42544 llvm-svn: 323439	2018-01-25 16:38:56 +00:00
Simon Pilgrim	9f551ad604	[X86][SSE] Aggressively use PMADDWD for v4i32 multiplies with 17 or more leading zeros As discussed in D41484, PMADDWD for 'zero extended' vXi32 is nearly always a better option than PMULLD: On SNB it will result in code that isn't any faster, but not any slower so we may as well keep it. On KNL it only has half the throughput, so I've disabled it on there - ideally there'd be a better way than this. Differential Revision: https://reviews.llvm.org/D42258 llvm-svn: 323367	2018-01-24 19:20:02 +00:00
Simon Pilgrim	f26df47831	[X86][SSE] Avoid calls to combineX86ShufflesRecursively that can't combine to target shuffles (PR32037) Don't bother making recursive calls to combineX86ShufflesRecursively if we have more shuffle source operands than will be combined together with the remaining recursive depth. See https://bugs.llvm.org/show_bug.cgi?id=32037#c26 and https://bugs.llvm.org/show_bug.cgi?id=32037#c27 for the reduction in compile times from this patch. Differential Revision: https://reviews.llvm.org/D42378 llvm-svn: 323320	2018-01-24 11:41:09 +00:00
Craig Topper	0321ebc054	[X86] Use ISD::SIGN_EXTEND instead of X86ISD::VSEXT for mask to xmm/ymm/zmm conversion There are a couple tricky things with this patch. I had to add an override of isVectorLoadExtDesirable to stop DAG combine from combining sign_extend with loads after legalization since we legalize sextload using a load+sign_extend. Overriding this hook actually prevents a lot sextloads from being created in the first place. I also had to add isel patterns because DAG combine blindly combines sign_extend+truncate to a smaller sign_extend which defeats what legalization was trying to do. Differential Revision: https://reviews.llvm.org/D42407 llvm-svn: 323301	2018-01-24 04:51:17 +00:00
Zvi Rackover	b5447b1e7c	X86: Update isVectorShiftByScalarCheap with cases covered by AVX512BW Summary: AVX512BW adds support for variable shift amount for 16-bit element vectors. Reviewers: craig.topper, RKSimon, spatel Reviewed By: RKSimon Subscribers: rengolin, tschuett, llvm-commits Differential Revision: https://reviews.llvm.org/D42437 llvm-svn: 323292	2018-01-24 01:36:40 +00:00
Simon Pilgrim	2cc74ed2be	[X86][AVX] LowerBUILD_VECTORAsVariablePermute - add support for VPERMILPV to v2i64/v2f64 Minor refactor to make it possible for LowerBUILD_VECTORAsVariablePermute to be used with a wider variety of shuffles op and types. I'd have liked to add v4i32/v4f32 support as well but we don't see v4i32 index extractions at the moment (which is why I created D42308) After this I intend to begin adding scaling support for PSHUFB (v8i16, v4i32, v2i64)) and VPERMPS (v4f64, v4i64). Differential Revision: https://reviews.llvm.org/D42431 llvm-svn: 323260	2018-01-23 21:33:24 +00:00
Simon Pilgrim	6ff241fc99	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - extract subvector from oversized index vectors llvm-svn: 323223	2018-01-23 17:02:15 +00:00
Craig Topper	c58c2b5c9b	[X86] Rewrite vXi1 element insertion by using a vXi1 scalar_to_vector and inserting into a vXi1 vector. The existing code was already doing something very similar to subvector insertion so this allows us to remove the nearly duplicate code. This patch is a little larger than it should be due to differences between the DQI handling between the two today. llvm-svn: 323212	2018-01-23 15:56:36 +00:00
Simon Pilgrim	0c9f77a9f9	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the source vector is not larger than the destination We might be able to support this in the future with VPERMV3, OR(PSHUFB, PSHUFB) etc. llvm-svn: 323210	2018-01-23 15:51:03 +00:00
Simon Pilgrim	9b4a097f94	Use EVT::changeVectorElementTypeToInteger() to convert index type to integer llvm-svn: 323207	2018-01-23 15:30:07 +00:00
Simon Pilgrim	e2905c8a0c	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the index vector has the correct number of elements llvm-svn: 323206	2018-01-23 15:13:37 +00:00
Craig Topper	76adcc86cd	[X86] Legalize v32i1 without BWI via splitting to v16i1 rather than the default of promoting to v32i8. Summary: For the most part its better to keep v32i1 as a mask type of a narrower width than trying to promote it to a ymm register. I had to add some overrides to the methods that get the types for the calling convention so that we still use v32i8 for argument/return purposes. There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined. I think we also need a DAG combine to remove trunc from (extract_vector_elt (trunc)) Overall this removes something like 13000 CHECK lines from lit tests. Reviewers: zvi, RKSimon, delena, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42031 llvm-svn: 323201	2018-01-23 14:25:39 +00:00
Simon Pilgrim	8ea1a0c690	[X86][SSE] LowerBUILD_VECTORAsVariablePermute - fix PSHUFB source/index operand ordering As detailed in rL317463, PSHUFB (like most variable shuffle instructions) uses Op[0] for the source vector and Op[1] for the shuffle index vector, VPERMV works in reverse which is probably where the confusion comes from. Differential Revision: https://reviews.llvm.org/D42380 llvm-svn: 323190	2018-01-23 11:39:06 +00:00
Craig Topper	c92edd994e	[X86] Don't reorder (srl (and X, C1), C2) if (and X, C1) can be matched as a movzx Summary: If we can match as a zero extend there's no need to flip the order to get an encoding benefit. As movzx is 3 bytes with independent source/dest registers. The shortest 'and' we could make is also 3 bytes unless we get lucky in the register allocator and its on AL/AX/EAX which have a 2 byte encoding. This patch was more impressive before r322957 went in. It removed some of the same Ands that got deleted by that patch. Reviewers: spatel, RKSimon Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42313 llvm-svn: 323175	2018-01-23 05:45:52 +00:00
Craig Topper	26a701f24f	[X86] Various vXi1 insertion improvements. Add missing patterns for inserting v1i1 into a zero vector. Use insert_subvector to zero upper bits before inserting an element into a vXi1 vector. Replace kshift based isel pattern with insert_subvector based pattern now that code that caused the pattern has been fixed to emit insert_subvector. llvm-svn: 323173	2018-01-23 05:36:53 +00:00
Chandler Carruth	c58f2166ab	Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre.. Summary: First, we need to explain the core of the vulnerability. Note that this is a very incomplete description, please see the Project Zero blog post for details: https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html The basis for branch target injection is to direct speculative execution of the processor to some "gadget" of executable code by poisoning the prediction of indirect branches with the address of that gadget. The gadget in turn contains an operation that provides a side channel for reading data. Most commonly, this will look like a load of secret data followed by a branch on the loaded value and then a load of some predictable cache line. The attacker then uses timing of the processors cache to determine which direction the branch took in the speculative execution, and in turn what one bit of the loaded value was. Due to the nature of these timing side channels and the branch predictor on Intel processors, this allows an attacker to leak data only accessible to a privileged domain (like the kernel) back into an unprivileged domain. The goal is simple: avoid generating code which contains an indirect branch that could have its prediction poisoned by an attacker. In many cases, the compiler can simply use directed conditional branches and a small search tree. LLVM already has support for lowering switches in this way and the first step of this patch is to disable jump-table lowering of switches and introduce a pass to rewrite explicit indirectbr sequences into a switch over integers. However, there is no fully general alternative to indirect calls. We introduce a new construct we call a "retpoline" to implement indirect calls in a non-speculatable way. It can be thought of loosely as a trampoline for indirect calls which uses the RET instruction on x86. Further, we arrange for a specific call->ret sequence which ensures the processor predicts the return to go to a controlled, known location. The retpoline then "smashes" the return address pushed onto the stack by the call with the desired target of the original indirect call. The result is a predicted return to the next instruction after a call (which can be used to trap speculative execution within an infinite loop) and an actual indirect branch to an arbitrary address. On 64-bit x86 ABIs, this is especially easily done in the compiler by using a guaranteed scratch register to pass the target into this device. For 32-bit ABIs there isn't a guaranteed scratch register and so several different retpoline variants are introduced to use a scratch register if one is available in the calling convention and to otherwise use direct stack push/pop sequences to pass the target address. This "retpoline" mitigation is fully described in the following blog post: https://support.google.com/faqs/answer/7625886 We also support a target feature that disables emission of the retpoline thunk by the compiler to allow for custom thunks if users want them. These are particularly useful in environments like kernels that routinely do hot-patching on boot and want to hot-patch their thunk to different code sequences. They can write this custom thunk and use `-mretpoline-external-thunk` in addition to `-mretpoline`. In this case, on x86-64 thu thunk names must be: ``` __llvm_external_retpoline_r11 ``` or on 32-bit: ``` __llvm_external_retpoline_eax __llvm_external_retpoline_ecx __llvm_external_retpoline_edx __llvm_external_retpoline_push ``` And the target of the retpoline is passed in the named register, or in the case of the `push` suffix on the top of the stack via a `pushl` instruction. There is one other important source of indirect branches in x86 ELF binaries: the PLT. These patches also include support for LLD to generate PLT entries that perform a retpoline-style indirection. The only other indirect branches remaining that we are aware of are from precompiled runtimes (such as crt0.o and similar). The ones we have found are not really attackable, and so we have not focused on them here, but eventually these runtimes should also be replicated for retpoline-ed configurations for completeness. For kernels or other freestanding or fully static executables, the compiler switch `-mretpoline` is sufficient to fully mitigate this particular attack. For dynamic executables, you must compile all libraries with `-mretpoline` and additionally link the dynamic executable and all shared libraries with LLD and pass `-z retpolineplt` (or use similar functionality from some other linker). We strongly recommend also using `-z now` as non-lazy binding allows the retpoline-mitigated PLT to be substantially smaller. When manually apply similar transformations to `-mretpoline` to the Linux kernel we observed very small performance hits to applications running typical workloads, and relatively minor hits (approximately 2%) even for extremely syscall-heavy applications. This is largely due to the small number of indirect branches that occur in performance sensitive paths of the kernel. When using these patches on statically linked applications, especially C++ applications, you should expect to see a much more dramatic performance hit. For microbenchmarks that are switch, indirect-, or virtual-call heavy we have seen overheads ranging from 10% to 50%. However, real-world workloads exhibit substantially lower performance impact. Notably, techniques such as PGO and ThinLTO dramatically reduce the impact of hot indirect calls (by speculatively promoting them to direct calls) and allow optimized search trees to be used to lower switches. If you need to deploy these techniques in C++ applications, we strongly recommend that you ensure all hot call targets are statically linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well tuned servers using all of these techniques saw 5% - 10% overhead from the use of retpoline. We will add detailed documentation covering these components in subsequent patches, but wanted to make the core functionality available as soon as possible. Happy for more code review, but we'd really like to get these patches landed and backported ASAP for obvious reasons. We're planning to backport this to both 6.0 and 5.0 release streams and get a 5.0 release with just this cherry picked ASAP for distros and vendors. This patch is the work of a number of people over the past month: Eric, Reid, Rui, and myself. I'm mailing it out as a single commit due to the time sensitive nature of landing this and the need to backport it. Huge thanks to everyone who helped out here, and everyone at Intel who helped out in discussions about how to craft this. Also, credit goes to Paul Turner (at Google, but not an LLVM contributor) for much of the underlying retpoline design. Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D41723 llvm-svn: 323155	2018-01-22 22:05:25 +00:00
Simon Pilgrim	17682a86da	[X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding (Reapplied) Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering. Reapplied after rL322279 was reverted at rL322335 due to PR35918, underlying issue was fixed at rL322644. llvm-svn: 323104	2018-01-22 12:05:17 +00:00
Marina Yatsina	6fc2aaae8d	Separate ExecutionDepsFix into 4 parts: 1. ReachingDefsAnalysis - Allows to identify for each instruction what is the “closest” reaching def of a certain register. Used by BreakFalseDeps (for clearance calculation) and ExecutionDomainFix (for arbitrating conflicting domains). 2. ExecutionDomainFix - Changes the variant of the instructions in order to minimize domain crossings. 3. BreakFalseDeps - Breaks false dependencies. 4. LoopTraversal - Creatws a traversal order of the basic blocks that is optimal for loops (introduced in revision L293571). Both ExecutionDomainFix and ReachingDefsAnalysis use this to determine the order they will traverse the basic blocks. This also included the following changes to ExcecutionDepsFix original logic: 1. BreakFalseDeps and ReachingDefsAnalysis logic no longer restricted by a register class. 2. ReachingDefsAnalysis tracks liveness of reg units instead of reg indices into a given reg class. Additional changes in affected files: 1. X86 and ARM targets now inherit from ExecutionDomainFix instead of ExecutionDepsFix. BreakFalseDeps also was added to the passes they activate. 2. Comments and references to ExecutionDepsFix replaced with ExecutionDomainFix and BreakFalseDeps, as appropriate. Additional refactoring changes will follow. This commit is (almost) NFC. The only functional change is that now BreakFalseDeps will break dependency for all register classes. Since no additional instructions were added to the list of instructions that have false dependencies, there is no actual change yet. In a future commit several instructions (and tests) will be added. This is the first of multiple patches that fix bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869 Most of the patches are intended at refactoring the existent code. Additional relevant reviews: https://reviews.llvm.org/D40331 https://reviews.llvm.org/D40332 https://reviews.llvm.org/D40333 https://reviews.llvm.org/D40334 Differential Revision: https://reviews.llvm.org/D40330 Change-Id: Icaeb75e014eff96a8f721377783f9a3e6c679275 llvm-svn: 323087	2018-01-22 10:05:23 +00:00
Craig Topper	7fddf2bfef	[X86] Add an override of targetShrinkDemandedConstant to limit the damage that shrinkdemandedbits can do to zext_in_reg operations Summary: This patch adds an implementation of targetShrinkDemandedConstant that tries to keep shrinkdemandedbits from removing bits that would otherwise have been recognized as a movzx. We still need a follow patch to stop moving ands across srl if the and could be represented as a movzx before the shift but not after. I think this should help with some of the cases that D42088 ended up removing during isel. Reviewers: spatel, RKSimon Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D42265 llvm-svn: 323048	2018-01-20 18:50:09 +00:00
Simon Pilgrim	89540d9665	[X86][SSE] Check for out of bounds PEXTR/PINSR indices during faux shuffle combining. llvm-svn: 323045	2018-01-20 17:16:01 +00:00
Craig Topper	08bd14803c	[X86] Teach X86 codegen to use vector width preference to avoid promoting to 512-bit types when VLX is enabled and the preference is for a smaller size. This change applies to places where we would turn 128/256-bit code into 512-bit in order to get a wider element type through sext/zext. Any 512-bit types that already existed in the IR/DAG will be left that way. The width preference has no effect on codegen behavior when the target does not have AVX512 enabled. So AVX/AVX2 codegen cannot be limited via this mechanism yet. If the preference is lower than 256 we may still use a 256 bit type to do the operation. Constraining to 128 bits makes it much more difficult to support some operations. For many of these cases we need to change element width while keeping element count constant which is easiest done by switching between 256 and 128 bit. The preference is only obeyed when AVX512 and VLX are available. This means the preference is not obeyed for KNL, but is obeyed for SKX, Cannonlake, and Icelake. For KNL, the only way to do masked operation is on 512-bit registers so we would have to completely disable masking to obey the preference. We would also lose support for gather, scatter, ctlz, vXi64 multiplies, etc. This may change in the future, but this simplifies the initial implementation. Differential Revision: https://reviews.llvm.org/D41895 llvm-svn: 323016	2018-01-20 00:26:12 +00:00
Craig Topper	b70ca5060f	[X86] Teach LowerBUILD_VECTOR to recognize pair-wise splats of 32-bit elements and use a 64-bit broadcast If we are splatting pairs of 32-bit elements, we can use a 64-bit broadcast to get the job done. We could probably could probably do this with other sizes too, for example four 16-bit elements. Or we could broadcast pairs of 16-bit elements using a 32-bit element broadcast. But I've left that as a future improvement. I've also restricted this to AVX2 only because we can only broadcast loads under AVX. Differential Revision: https://reviews.llvm.org/D42086 llvm-svn: 322730	2018-01-17 18:58:22 +00:00
Craig Topper	279ace187a	[X86] When legalizing (v64i1 select i8, v64i1, v64i1) make sure not to introduce bitcasts to i64 in 32-bit mode We legalize selects of masks with scalar conditions using a bitcast to an integer type. But if we are in 32-bit mode we can't convert v64i1 to i64. So instead split the v64i1 to v32i1 and concat it back together. Each half will then be legalized by bitcasting to i32 which is fine. The test case is a little indirect. If we have the v64i1 select in IR it will get legalized by legalize vector ops which has a run of type legalization after it. That type legalization run is able to fix this i64 bitcast. So in order to avoid that we need a build_vector of a splat which legalize vector ops will ignore. Legalize DAG will then turn that into a select via LowerBUILD_VECTORvXi1. And the select will get legalized. In this case there is no type legalizer run to cleanup the bitcast. This fixes pr35972. llvm-svn: 322724	2018-01-17 18:46:01 +00:00
Benjamin Kramer	8d073a2c2d	[X86] Don't mutate shuffle arguments after early-out for AVX512 The match* functions have the annoying behavior of modifying its inputs. Save and restore the inputs, just in case the early out for AVX512 is hit. This is still not great and its only a matter of time this kind of bug happens again, but I couldn't come up with a better pattern without rewriting significant chunks of this code. Fixes PR35977. llvm-svn: 322644	2018-01-17 13:01:06 +00:00
Benjamin Kramer	05dc3527de	[X86] Constify DebugLoc parameters. No functionality change. llvm-svn: 322643	2018-01-17 13:00:58 +00:00
Craig Topper	77ba1e7c08	[X86] In LowerBUILD_VECTOR, rename ExtVT to EltVT so it makes sense. llvm-svn: 322616	2018-01-17 03:58:21 +00:00
Simon Pilgrim	3e0aafbfcc	[X86][MMX] Accept UNDEF upper bits for MOVD GR32->MMX llvm-svn: 322574	2018-01-16 17:01:31 +00:00
Simon Pilgrim	85e6139633	[X86][MMX] Improve MMX constant generation Extend the MMX zero code to take any constant with zero'd upper 32-bits llvm-svn: 322553	2018-01-16 14:21:28 +00:00
Simon Pilgrim	85bd9141ca	[X86][MMX] Add support for MMX zero vector creation As mentioned on PR35869, (and came up recently on D41517) we don't create a MMX zero register via the PXOR but instead perform a spill to stack from a XMM zero register. This patch adds support for direct MMX zero vector creation and should make it easier to add better constant vector creation in the future as well. Differential Revision: https://reviews.llvm.org/D41908 llvm-svn: 322525	2018-01-15 22:32:40 +00:00
Craig Topper	1393ccf949	[X86] Use MVT::getVectorVT instead of EVT::getVectorVT when splitting 256/512 bit build_vectors. NFC We must be creating a legal type here which means it can be an MVT. llvm-svn: 322512	2018-01-15 20:33:53 +00:00
Craig Topper	aacc622564	[X86] Generalize some code in LowerBUILD_VECTOR. NFC llvm-svn: 322511	2018-01-15 20:33:52 +00:00
Craig Topper	4f7fadd029	[X86] Remove unnecessary if statement from LowerBUILD_VECTOR. NFCI We were checking for 128, 256, or 512 bit vectors, but those are the only types that can get here. llvm-svn: 322510	2018-01-15 20:33:50 +00:00
Simon Pilgrim	9904fe77a0	[X86][SSE] Support combining MOVLHPS undef inputs llvm-svn: 322459	2018-01-14 18:50:34 +00:00
Craig Topper	b2868233b7	[X86] Use ISD::TRUNCATE instead of X86ISD::VTRUNC when input and output types have the same number of elements. llvm-svn: 322455	2018-01-14 08:11:36 +00:00
Craig Topper	57d58051bb	[X86] Add X86ISD::VTRUNC to computeKnownBitsForTargetNode. We have to take special care to avoid the cases where the result of the truncate would be padded with zero elements. Ideally we'd just use ISD::TRUNCATE for these cases instead. llvm-svn: 322454	2018-01-14 08:11:33 +00:00
Craig Topper	e9fc0cd920	[X86] Improve legalization of vXi16/vXi8 selects. Extend vXi1 conditions of vXi8/vXi16 selects even before type legalization gets a chance to split wide vectors. Previously we would only extend 128 and 256 bit vectors. But if we start with a 512 bit vector or wider that needs to be split we wouldn't extend until after the split had taken place. By extending early we improve the results of type legalization. Don't widen condition of 128/256 bit vXi16/vXi8 selects when we have BWI but not VLX. We can still use a mask register by widening the select to 512-bits instead. This is similar to what we do for compares already. llvm-svn: 322450	2018-01-14 02:05:51 +00:00
Zvi Rackover	652f9a1896	X86: Add pattern matching for PMADDWD In addition to the existing match as part of a loop-reduction, add a straightforward pattern match for DAG-contained patterns. Reviewers: RKSimon, craig.topper Subscribers: llvm-commits Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D41811 llvm-svn: 322446	2018-01-13 17:42:19 +00:00
Craig Topper	6f109f8c6c	[X86] Add DAG combine to promote vXi1 result of a vXi8/vXi16 setcc when we have AVX512 but not BWI. This avoids having the result type stick around until lowering where we have to extend the setcc and insert a truncate. If we get the types converted early we can do more to optimize it. llvm-svn: 322432	2018-01-13 06:24:46 +00:00
David L. Jones	8c87213c26	Revert r322279 due to Skylake miscompile. Summary: This revision causes Skylake (and apparently, only Skylake) codegen to fail in certain cases. Details: https://bugs.llvm.org/show_bug.cgi?id=35918 Subscribers: sanjoy, llvm-commits Differential Revision: https://reviews.llvm.org/D41972 llvm-svn: 322335	2018-01-12 00:17:38 +00:00
Craig Topper	2aac3ee5bc	[X86] Legalize 128/256 gathers/scatters on KNL by using widening rather than sign extending the index. We can just widen the vectors with undef and zero extend the mask. llvm-svn: 322308	2018-01-11 19:38:30 +00:00

... 3 4 5 6 7 ...

5570 Commits