llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	6a6da233b9	[X86] Make LowerOperationWrapper more robust. Remove now unnecessary ReplaceAllUsesWith from LowerMSCATTER. Previously LowerOperationWrapper took the number of results from the original node and counted that many results from the new node. This was intended to drop chain operands from FP_TO_SINT lowering that uses X87 with memory operations to stack temporaries. The final load had an extra chain output that needs to be ignored. Unfortunately, it didn't work with scatter which has 2 result operands, the mask output which is discarded and a chain output. The chain output is the one that is needed but it comes second and it would be dropped by the previous logic here. To workaround this we were doing a ReplaceAllUses in the lowering code so that the generic legalization code wouldn't see any uses to replace since it had been given the wrong result/type. After this change we take the LowerOperation result directly if the original node has one result. This allows us to directly return the chain from scatter or the load data from the FP_TO_SINT case. When the original node has multiple results we'll ensure the returned node has the same number and copy them over. For cases where the original node has multiple results and the new code for some reason has even more results, MERGE_VALUES can be used to pass only the needed results. llvm-svn: 357887	2019-04-08 07:39:17 +00:00
Simon Pilgrim	07adb6abda	[X86][SSE] SimplifyDemandedBitsForTargetNode - Add initial PACKSS support In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands. This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved. Differential Revision: https://reviews.llvm.org/D60333 llvm-svn: 357859	2019-04-07 10:40:01 +00:00
Simon Pilgrim	d0a53d4914	[X86] combineBitcastvxi1 - provide dst VT and src SDValue directly. NFCI. Prep work to make it easier to reuse the BITCAST->MOVSMK combine in other cases. llvm-svn: 357847	2019-04-06 18:54:17 +00:00
Simon Pilgrim	af1cbdd3ba	Fix spelling mistake. NFCI. llvm-svn: 357843	2019-04-06 15:38:34 +00:00
Francis Visoiu Mistrih	9d9d1b6b2b	[X86] Enable tail calls for CallingConv::Swift It's currently only enabled on AArch64 (enabled in r281376). llvm-svn: 357809	2019-04-05 20:18:25 +00:00
Craig Topper	80aa2290fb	[X86] Merge the different Jcc instructions for each condition code into single instructions that store the condition code as an operand. Summary: This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes. Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser. Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon Reviewed By: RKSimon Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60228 llvm-svn: 357802	2019-04-05 19:28:09 +00:00
Piotr Sobczak	0376ac1d94	[SelectionDAG] Compute known bits of CopyFromReg Summary: Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if the virtual reg used has one def only. This can be particularly useful when calling isBaseWithConstantOffset() with the ISD::CopyFromReg argument, as more optimizations may get enabled in the result. Also add a missing truncation on X86, found by testing of this patch. Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa Reviewers: bogner, craig.topper, RKSimon Reviewed By: RKSimon Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D59535 llvm-svn: 357745	2019-04-05 07:44:09 +00:00
Craig Topper	94f1772b1e	[X86] Promote i16 SRA instructions to i32 We already promote SRL and SHL to i32. This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions. I think there might be some DAG combine improvement opportunities in the test changes here. Differential Revision: https://reviews.llvm.org/D60278 llvm-svn: 357743	2019-04-05 06:32:50 +00:00
Evandro Menezes	85bd3978ae	[IR] Refactor attribute methods in Function class (NFC) Rename the functions that query the optimization kind attributes. Differential revision: https://reviews.llvm.org/D60287 llvm-svn: 357731	2019-04-04 22:40:06 +00:00
James Y Knight	a040174418	Revert [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs It unnecessarily breaks previously-working code which used varargs, but didn't pass any float/double arguments (such as EDK2). Also revert the fixup on top of that: Revert [X86] Fix a test from r357317 This reverts r357317 (git commit `d413f41de6`) This reverts r357380 (git commit `7af32444b9`) llvm-svn: 357718	2019-04-04 19:05:48 +00:00
Sanjay Patel	17648b848e	[x86] eliminate unnecessary broadcast of horizontal op This is another pattern that comes up if we more aggressively scalarize FP ops. llvm-svn: 357703	2019-04-04 14:46:13 +00:00
Craig Topper	051bd16faf	[X86] Remove CustomInserters for RDPKRU/WRPKRU. Use some custom lowering and new ISD opcodes instead. These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers. This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes that take the extra zeroes as inputs. The zeros will then go through isel on their own to select the MOV32r0 pseudo. Then we just need to mention the physical registers directly in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary copies to/from physical registers. llvm-svn: 357659	2019-04-04 00:28:49 +00:00
Craig Topper	52cac4b79f	[X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom instruction selection instead. This custom inserter existed so we could do a weird thing where we pretended that the instructions support a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer size agnostic in the isel pattern. To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV instruction. With this change we now just do custom selection during isel instead and just assign the incoming address of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take care of selecting an LEA or other operation to do any address computation needed in this basic block. I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit use is properly sized based on the pointer. Without this we get comments in the assembly output about killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX. llvm-svn: 357652	2019-04-03 23:28:30 +00:00
Sanjay Patel	c9a012e4ea	[x86] fold shuffles of h-ops that have an undef operand If an operand is undef, we can assume it's the same as the other operand. llvm-svn: 357644	2019-04-03 22:40:35 +00:00
Sanjay Patel	61b5e3c6a9	[x86] eliminate movddup of horizontal op This pattern would show up as a regression if we more aggressively convert vector FP ops to scalar ops. There's still a missed optimization for the v4f64 legal case (AVX) because we create that h-op with an undef operand. We should probably just duplicate the operands for that pattern to avoid trouble. llvm-svn: 357642	2019-04-03 22:15:29 +00:00
Krzysztof Parzyszek	4841643a1d	[X86] Extend boolean arguments to inline-asm according to getBooleanType Differential Revision: https://reviews.llvm.org/D60208 llvm-svn: 357615	2019-04-03 17:43:14 +00:00
Simon Pilgrim	15919ad306	[X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 reduction on AVX1 Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result. llvm-svn: 357611	2019-04-03 17:28:34 +00:00
Simon Pilgrim	9e28dddf55	[X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction on AVX1 Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets. llvm-svn: 357608	2019-04-03 17:17:13 +00:00
Craig Topper	2e1bf89e3a	[X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics. This is the appropriate opcode for only having a chain output. Though I'm not sure it matters much. llvm-svn: 357375	2019-04-01 05:26:12 +00:00
Sanjay Patel	e1bc360fc6	[x86] allow movmsk with 2-element reductions One motivation for making this change is that the lack of using movmsk is likely a main source of perf difference between clang and gcc on the C-Ray benchmark as shown here: https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5 ...but this change alone isn't enough to solve that problem. The 'all-of' examples show what is likely the worst case trade-off: we end up with an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of' examples look clearly better using movmsk because we've traded 2 vector instructions for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'. If we examine the llvm-mca output for these cases, it appears that even though the 'all-of' movmsk variant looks worse on paper, it would perform better on both Haswell and Jaguar. $ llvm-mca -mcpu=haswell no_movmsk.s -timeline Iterations: 100 Instructions: 400 Total Cycles: 504 Total uOps: 400 Dispatch Width: 4 uOps Per Cycle: 0.79 IPC: 0.79 Block RThroughput: 1.0 $ llvm-mca -mcpu=haswell movmsk.s -timeline Iterations: 100 Instructions: 600 Total Cycles: 358 Total uOps: 600 Dispatch Width: 4 uOps Per Cycle: 1.68 IPC: 1.68 Block RThroughput: 1.5 $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline Iterations: 100 Instructions: 400 Total Cycles: 407 Total uOps: 400 Dispatch Width: 2 uOps Per Cycle: 0.98 IPC: 0.98 Block RThroughput: 2.0 $ llvm-mca -mcpu=btver2 movmsk.s -timeline Iterations: 100 Instructions: 600 Total Cycles: 311 Total uOps: 600 Dispatch Width: 2 uOps Per Cycle: 1.93 IPC: 1.93 Block RThroughput: 3.0 Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if that's true, then we're also almost certainly making the wrong transform already for reductions with >2 elements, so that should be fixed independently. Differential Revision: https://reviews.llvm.org/D59997 llvm-svn: 357367	2019-03-31 15:11:34 +00:00
Simon Pilgrim	10c9032c02	[X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316) Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits. llvm-svn: 357351	2019-03-30 17:12:29 +00:00
Simon Pilgrim	3293455595	[X86][SSE] detectAVGPattern - begin generalizing ADD matches Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316 llvm-svn: 357349	2019-03-30 15:31:53 +00:00
Amara Emerson	d413f41de6	[X86] When using Win64 ABI, exit with error if SSE is disabled for varargs We need XMM registers to handle varargs with the Win64 ABI. Before we would silently generate bad code resulting in an assertion failure elsewhere in the backend. llvm-svn: 357317	2019-03-29 21:30:51 +00:00
Simon Pilgrim	aeaf7fcdde	[X86] Add X86TargetLowering::isCommutativeBinOp override. We currently just have test coverage for PMULUDQ - will add more in the future. llvm-svn: 357244	2019-03-29 11:25:58 +00:00
Sanjay Patel	5bbf6f0bd8	[x86] avoid cmov in movmsk reduction This is probably the least important of our movmsk problems, but I'm starting at the bottom to reduce distractions. We were creating a select_cc which bypasses the select and bitmask codegen optimizations that we have now. If we produce a compare+negate instead, we allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov. There's no partial register update danger in these sequences because we always produce the zero-register xor ahead of the 'set' if needed. There seems to be a missing fold for sext of a bool bit here: negl %ecx movslq %ecx, %rax ...but that's an independent transform. Differential Revision: https://reviews.llvm.org/D59818 llvm-svn: 357172	2019-03-28 14:16:13 +00:00
Sanjay Patel	1df0bb6264	[x86] improve AVX lowering of vector zext If we know the 2 halves of an oversized zext-in-reg are the same, don't create those halves independently. I tried several different approaches to fold this, but it's difficult to get right during legalization. In the default path, we are creating a generic shuffle that looks like an unpack high, but it can get transformed into a different mask (a blend), so it's not straightforward to match that. If we try to fold after it actually becomes an X86ISD::UNPCKH node, we can't be sure what the operand node is - it might be a generic shuffle, or it could be some x86-specific op. From the test output, we should be doing something like this for SSE4.1 as well, but I'd rather leave that as a follow-up since it involves changing lowering actions. Differential Revision: https://reviews.llvm.org/D59777 llvm-svn: 357129	2019-03-27 22:42:11 +00:00
Sanjay Patel	704817912a	[x86] look through bitcast operand of MOVMSK This is not exactly NFC because it should make further combines of MOVMSK easier to match, but there should be no outward differences because we have isel patterns in place specifically to allow this. See: // Also support integer VTs to avoid a int->fp bitcast in the DAG. llvm-svn: 357128	2019-03-27 22:24:03 +00:00
Simon Pilgrim	ccb71b2985	Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure. This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain. ........ Causes PR41249 llvm-svn: 357057	2019-03-27 10:25:02 +00:00
Simon Pilgrim	87d4ab8b92	[X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure. This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain. llvm-svn: 356864	2019-03-24 19:06:35 +00:00
Simon Pilgrim	a71c0ed471	[X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets). Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers. llvm-svn: 356858	2019-03-24 16:30:35 +00:00
Sanjay Patel	7d676dfd86	[x86] improve the default expansion of uaddsat/usubsat This is yet another step towards solving PR14613: https://bugs.llvm.org/show_bug.cgi?id=14613 uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y usubsat X, Y --> (X >u Y) ? X - Y : 0 We can't count on a sane vector ISA, so override the default (umin/umax) expansion of unsigned add/sub saturate in cases where we do not have umin/umax. Differential Revision: https://reviews.llvm.org/D59006 llvm-svn: 356855	2019-03-24 13:55:54 +00:00
Sanjay Patel	2e92846d36	[x86] reduce code duplication; NFC llvm-svn: 356836	2019-03-23 15:00:52 +00:00
Craig Topper	ce1ed55a4a	[X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences. This mitigates some of PR41151 for the i64 on i686 case when we have SSE2. Differential Revision: https://reviews.llvm.org/D59662 llvm-svn: 356808	2019-03-22 20:47:02 +00:00
Craig Topper	1ffd8e8114	[X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead. I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer. So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag? I've left some FIXMEs for improvements we can do without SSE2. Differential Revision: https://reviews.llvm.org/D59679 llvm-svn: 356807	2019-03-22 20:46:56 +00:00
Simon Pilgrim	564392d752	[X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203) llvm-svn: 356784	2019-03-22 17:23:55 +00:00
Craig Topper	b3bad3dce3	[X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead. llvm-svn: 356721	2019-03-21 21:37:18 +00:00
Simon Pilgrim	c2e4405475	[X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers We don't use X86ISD::ANDNP for mask registers. Test case from @craig.topper (Craig Topper) llvm-svn: 356696	2019-03-21 18:32:38 +00:00
Craig Topper	8d46403b8e	[X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set. CMPXCHG8B was introduced on i586/pentium generation. If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG. Differential Revision: https://reviews.llvm.org/D59576 llvm-svn: 356631	2019-03-20 23:35:49 +00:00
Craig Topper	0367553304	[X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend. This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before falling back to move immedate, GPR to k-register, and masked op. I had to make some changes to support v8i64 when i64 is not a legal type. And to support floating point types. This trades a load for the move immediate and GPR move which is higher latency. But its probably better for register pressure not having to hop through other register classes. The load+and should play better with LICM and rematerialization I think. Differential Revision: https://reviews.llvm.org/D59479 llvm-svn: 356618	2019-03-20 21:30:20 +00:00
Simon Pilgrim	2acca37a2d	[X86] Use getConstantOperandAPInt to detect out-of-range shifts. llvm-svn: 356549	2019-03-20 11:41:52 +00:00
Andrea Di Biagio	624f5deff4	[X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI This patch removes the following dag node opcodes from namespace X86ISD: RDTSC_DAG, RDTSCP_DAG, RDPMC_DAG The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The only differences are: RDTSC/RDTSCP don't implicitly read ECX. RDTSCP also implicitly writes ECX. I moved the common expansion logic into a helper function with the goal to get rid of code repetition. That helper is now used for the expansion of RDTSC/RDTSCP/RDPMC/XGETBV intrinsics. No functional change intended. Differential Revision: https://reviews.llvm.org/D59547 llvm-svn: 356546	2019-03-20 11:21:15 +00:00
Simon Pilgrim	e744f513c4	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed. llvm-svn: 356483	2019-03-19 17:23:25 +00:00
Jordan Rupprecht	f74d45a775	[NFC] Fix unused variable in release builds This was introduced in rL356468. llvm-svn: 356477	2019-03-19 16:52:40 +00:00
Simon Pilgrim	a56f2822d0	[SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect These changes are related to PR37743 and include: SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node. Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner. Add promoting the integer ABS node in the LegalizeIntegerType. Expand-based legalization of integer result for the ABS nodes. Expand-based legalization of ABS vector operations. Add some integer abs testcases for different typesizes for Thumb arch Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to: tmp = (SRA, Hi, 31) Lo = (UADDO tmp, Lo) Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1)) Lo = (XOR tmp, Lo) The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern: (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))). Change integer abs testcases for codegen with the ABS node support for AArch64. Indicate that the ABS is legal for the i64 type when the NEON is supported. Change the integer abs testcases to show changing of codegen. Add combine and legalization of ABS nodes for Thumb arch. Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition. For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743 Patch by: @ikulagin (Ivan Kulagin) Differential Revision: https://reviews.llvm.org/D49837 llvm-svn: 356468	2019-03-19 16:24:55 +00:00
Adhemerval Zanella	664c1ef528	[TargetLowering] Add code size information on isFPImmLegal. NFC This allows better code size for aarch64 floating point materialization in a future patch. Reviewers: evandro Differential Revision: https://reviews.llvm.org/D58690 llvm-svn: 356389	2019-03-18 18:40:07 +00:00
Simon Pilgrim	f2c53b5d6c	[X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes. Replaces existing i1-only fold. llvm-svn: 356325	2019-03-16 15:02:00 +00:00
Simon Pilgrim	0f472e1d01	[X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW Improved constant folding for PEXTRB/PEXTRW will be added in a future commit llvm-svn: 356324	2019-03-16 14:29:50 +00:00
Roman Lebedev	9f37790608	[X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's Summary: As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780 If we have `sext (trunc (cmov C0, C1) to i8)`, we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))` Reviewers: craig.topper, andreadb, RKSimon Reviewed By: craig.topper Subscribers: llvm-commits, andreadb Tags: #llvm Differential Revision: https://reviews.llvm.org/D59412 llvm-svn: 356301	2019-03-15 21:18:05 +00:00
Roman Lebedev	b6e376ddfa	[X86] Promote i8 CMOV's (PR40965) Summary: @mclow.lists brought up this issue up in IRC, it came up during implementation of libc++ `std::midpoint()` implementation (D59099) https://godbolt.org/z/oLrHBP Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`. This differential proposes to always promote i8 CMOV. There are several concerns here: * Is this actually more performant, or is it just the ASM that looks cuter? * Does this result in partial register stalls? * What about branch predictor? # Indeed, performance should be the main point here. Let's look at a simple microbenchmark: {F8412076} ``` #include "benchmark/benchmark.h" #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> // Future preliminary libc++ code, from Marshall Clow. namespace std { template <class _Tp> __inline _Tp midpoint(_Tp __a, _Tp __b) noexcept { using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type; int __sign = 1; _Up __m = __a; _Up __M = __b; if (__a > __b) { __sign = -1; __m = __b; __M = __a; } return __a + __sign * _Tp(_Up(__M - __m) >> 1); } } // namespace std template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct RandRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(getVectorOfRandomNumbers<T>(count), getVectorOfRandomNumbers<T>(count)); } }; struct ZeroRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(std::vector<T>(count, T(0)), getVectorOfRandomNumbers<T>(count)); } }; template <class T, class Gen> void BM_StdMidpoint(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { for (size_t i = 0; i < Length; i++) { const auto calculated = std::midpoint(a[i], b[i]); benchmark::DoNotOptimize(calculated); } } state.SetComplexityN(Length); state.counters["midpoints"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["midpoints/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = 2 * 1024 * 1024; // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /total bufs/ 2 / /maximal elt size/ sizeof(T) / /safety margin/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } // Both of the values are random. // The comparison is unpredictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand) ->Apply(CustomArguments<int32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand) ->Apply(CustomArguments<int64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand) ->Apply(CustomArguments<int16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand) ->Apply(CustomArguments<int8_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand) ->Apply(CustomArguments<uint8_t>); // One value is always zero, and another is bigger or equal than zero. // The comparison is predictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand) ->Apply(CustomArguments<uint8_t>); ``` ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm 2019-03-06 21:53:31 Running ./llvm-cmov-bench-OLD Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.78, 1.81, 1.36 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300398 ns 300404 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300433 ns 300433 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 169857 ns 169858 ns 4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 169770 ns 169771 ns 4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 591169 ns 591179 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 591264 ns 591274 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<uint16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 2983669 ns 2983689 ns 235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s BM_StdMidpoint<int8_t, RandRand>_BigO 5.69 N 5.69 N BM_StdMidpoint<int8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint8_t, RandRand>/524288 2668398 ns 2668419 ns 262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 5.09 N 5.09 N BM_StdMidpoint<uint8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300887 ns 300887 ns 2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169634 ns 169634 ns 4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 592252 ns 592255 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 987295 ns 987309 ns 711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 1.88 N 1.88 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 1 % 1 % RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW 2019-03-06 21:56:58 Running ./llvm-cmov-bench-NEW Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.17, 1.46, 1.30 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300878 ns 300880 ns 2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300231 ns 300226 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 170819 ns 170777 ns 4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.60 N 2.60 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 171705 ns 171708 ns 4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.62 N 2.62 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 592510 ns 592516 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 614823 ns 614823 ns 1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.33 N 2.33 N BM_StdMidpoint<uint16_t, RandRand>_RMS 4 % 4 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 1073181 ns 1073201 ns 650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s BM_StdMidpoint<int8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<int8_t, RandRand>_RMS 1 % 1 % BM_StdMidpoint<uint8_t, RandRand>/524288 1071010 ns 1071020 ns 653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<uint8_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300413 ns 300416 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169667 ns 169669 ns 4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 591396 ns 591404 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 1069421 ns 1069413 ns 655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 2.04 N 2.04 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 0 % 0 % Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 +0.0016 +0.0016 300398 300878 300404 300880 <...> BM_StdMidpoint<uint32_t, RandRand>/131072 -0.0007 -0.0007 300433 300231 300433 300226 <...> BM_StdMidpoint<int64_t, RandRand>/65536 +0.0057 +0.0054 169857 170819 169858 170777 <...> BM_StdMidpoint<uint64_t, RandRand>/65536 +0.0114 +0.0114 169770 171705 169771 171708 <...> BM_StdMidpoint<int16_t, RandRand>/262144 +0.0023 +0.0023 591169 592510 591179 592516 <...> BM_StdMidpoint<uint16_t, RandRand>/262144 +0.0398 +0.0398 591264 614823 591274 614823 <...> BM_StdMidpoint<int8_t, RandRand>/524288 -0.6403 -0.6403 2983669 1073181 2983689 1073201 <...> BM_StdMidpoint<uint8_t, RandRand>/524288 -0.5986 -0.5986 2668398 1071010 2668419 1071020 <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 -0.0016 -0.0016 300887 300413 300887 300416 <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 +0.0002 +0.0002 169634 169667 169634 169669 <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 -0.0014 -0.0014 592252 591396 592255 591404 <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 +0.0832 +0.0832 987295 1069421 987309 1069413 ``` What can we tell from the benchmark? * `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance. * All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case. That is because there we are computing mid point between zero and some random number, thus if the branch predictor is in use, it is in optimal situation. * Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%. # What about branch predictor? * `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`, which may mean that well-predicted branch is better than `cmov`. * Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`, `cmov` is up to +10% worse than well-predicted branch. * However, i do not believe this is a concern. If the branch is well predicted, then the PGO will also say that it is well predicted, and LLVM will happily expand cmov back into branch: https://godbolt.org/z/P5ufig # What about partial register stalls? I'm not really able to answer that. What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch) in ~50% of cases you will have to pay branch misprediction penalty. ``` $ grep -i MispredictPenalty X86Sched*.td X86SchedBroadwell.td: let MispredictPenalty = 16; X86SchedHaswell.td: let MispredictPenalty = 16; X86SchedSandyBridge.td: let MispredictPenalty = 16; X86SchedSkylakeClient.td: let MispredictPenalty = 14; X86SchedSkylakeServer.td: let MispredictPenalty = 14; X86ScheduleBdVer2.td: let MispredictPenalty = 20; // Minimum branch misdirection penalty. X86ScheduleBtVer2.td: let MispredictPenalty = 14; // Minimum branch misdirection penalty X86ScheduleSLM.td: let MispredictPenalty = 10; X86ScheduleZnver1.td: let MispredictPenalty = 17; ``` .. which it can be as small as 10 cycles and as large as 20 cycles. Partial register stalls do not seem to be an issue for AMD CPU's. For intel CPU's, they should be around ~5 cycles? Is that actually an issue here? I'm not sure. In short, i'd say this is an improvement, at least on this microbenchmark. Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 \| PR40965 ]]. Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic Reviewed By: craig.topper, andreadb Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists Tags: #llvm, #libc Differential Revision: https://reviews.llvm.org/D59035 llvm-svn: 356300	2019-03-15 21:17:53 +00:00
Craig Topper	af856db961	[X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table. The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen. Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable. Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table. llvm-svn: 356294	2019-03-15 19:59:35 +00:00
Simon Pilgrim	d33e62c826	[X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x))) Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits. llvm-svn: 356292	2019-03-15 19:14:28 +00:00
Simon Pilgrim	65165d54bb	[X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW llvm-svn: 356270	2019-03-15 16:16:49 +00:00
Simon Pilgrim	0ad17402a9	[X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm. Prep work for PR40203 llvm-svn: 356249	2019-03-15 11:05:42 +00:00
Sanjay Patel	5d1df114e8	[x86] prevent infinite looping from vselect commutation (PR41066) This is an immediate fix for: https://bugs.llvm.org/show_bug.cgi?id=41066 ...but as noted there and the code comments, we should do better by stubbing this out sooner. llvm-svn: 356158	2019-03-14 15:32:34 +00:00
Craig Topper	84abec2855	[X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b() The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well. We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass. I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue. Differential Revision: https://reviews.llvm.org/D59308 llvm-svn: 356078	2019-03-13 18:48:50 +00:00
Simon Pilgrim	bef4fe056d	[X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode llvm-svn: 356067	2019-03-13 17:00:18 +00:00
Simon Pilgrim	d9aa879b67	[X86][AVX] Add combineConcatVectors support to improve subvector handling Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization. This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time. The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit. llvm-svn: 356064	2019-03-13 16:37:30 +00:00
Sanjay Patel	0a251e4076	[x86] limit extractelement of setcc to pre-legalization A fuzzer found the crasher: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700 The bug was introduced recently here: rL355741 This is the quick fix. If we need to do this transform later, then we'd have to extend/truncate the vector setcc element type to the scalar setcc type (i8). llvm-svn: 356053	2019-03-13 14:49:52 +00:00
Simon Pilgrim	0c1e5aacd3	Fix signed/unsigned mismatch warning. NFCI. llvm-svn: 356046	2019-03-13 13:14:14 +00:00
Simon Pilgrim	7abbd70300	[X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false. This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on. Differential Revision: https://reviews.llvm.org/D58888 llvm-svn: 356043	2019-03-13 12:20:39 +00:00
Sanjay Patel	737c27a9cd	[x86] scalarize extractelement 0 of FP vselect llvm-svn: 355955	2019-03-12 19:20:45 +00:00
Craig Topper	f19d6a4073	[X86] Add SCALAR_SINT_TO_FP/SCALAR_UINT_TO_FP ISD opcodes without rounding mode. After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those. llvm-svn: 355807	2019-03-11 04:37:01 +00:00
Craig Topper	ecbc141dbf	[X86] Split SCALEF(S) ISD opcodes into a version without rounding mode. llvm-svn: 355806	2019-03-11 04:36:59 +00:00
Craig Topper	a0b5338834	[X86] Split RCP28/RSQRT/GETEXP/EXP2 ISD opcodes into SAE and current direction nodes. Remove rounding mode operand. llvm-svn: 355805	2019-03-11 04:36:57 +00:00
Craig Topper	ba7d654526	[X86] Rename _RND versions of RANGE/REDUCE/GETMANT/RDNSCALE ISD opcodes to _SAE. Remove SAE operand. No need to explicitly store it and match it during isel. llvm-svn: 355804	2019-03-11 04:36:55 +00:00
Craig Topper	244ffcdf0d	[X86] Rename X86ISD::CVTPH2PS_RND to CVTPH2PS_SAE. Remove SAE operand. llvm-svn: 355803	2019-03-11 04:36:53 +00:00
Craig Topper	6059b1737e	[X86] Rename the CVTT*_RND ISD nodes to _SAE and remove the SAE operand. Split VFPROUNDS_RND/VFPEXT(S)_RND into versions without rounding operand. For VFPEXT(S) we only need current rounding mode and an SAE version. Neither need extra operand. llvm-svn: 355802	2019-03-11 04:36:51 +00:00
Craig Topper	4c544ca993	[X86] Rename X86ISD::CMPM_RND and X86ISD::FSETCCM_RND to _SAE instead of _RND. Remove rounding operand. The operand could only be the SAE encoding so no need to include it. llvm-svn: 355801	2019-03-11 04:36:49 +00:00
Craig Topper	704303a2a1	[X86] Split the VFIXUPIMM/VFIXUPIMMS nodes into a current rounding mode and SAE ISD opcode. Remove matching of FROUND_CURRENT and FROUND_NO_EXC for these nodes from isel table. llvm-svn: 355800	2019-03-11 04:36:47 +00:00
Craig Topper	b7e6bfe579	[X86] Begin removing matching of FROUND_CURRENT and FROUND_NO_EXC from isel tables. Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC. This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS. llvm-svn: 355799	2019-03-11 04:36:44 +00:00
Sanjay Patel	26e06e859e	[x86] add x86-specific opcodes to extractelement scalarization list llvm-svn: 355792	2019-03-10 18:56:21 +00:00
Craig Topper	66c9690ad6	[X86] Remove unused variable. NFC llvm-svn: 355790	2019-03-10 17:36:41 +00:00
Craig Topper	93e15dfacc	[X86] Make lowering of intrinsics with rounding mode stricter so that only valid rounding modes are lowered. Update tests accordingly Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates. With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure. llvm-svn: 355789	2019-03-10 17:20:45 +00:00
Craig Topper	0dc8c52d4e	[X86] Remove dead code from the handler for INTR_TYPE_SCALAR_MASK_RM. The code in here handles nodes with 6 or 7 operands. But only the 6 operand case is ever used these days. llvm-svn: 355788	2019-03-10 17:20:42 +00:00
Sanjay Patel	f84083b4db	[x86] scalarize extract element 0 of FP cmp An extension of D58282 noted in PR39665: https://bugs.llvm.org/show_bug.cgi?id=39665 This doesn't answer the request to use movmsk, but that's an independent problem. We need this and probably still need scalarization of FP selects because we can't do that as a target-independent transform (although it seems likely that targets besides x86 should have this transform). llvm-svn: 355741	2019-03-08 21:54:41 +00:00
Sanjay Patel	b22f438df3	[x86] prevent infinite looping from inverse shuffle transforms llvm-svn: 355713	2019-03-08 19:20:28 +00:00
Craig Topper	3acc4236b8	[X86] Enable combineFMinNumFMaxNum for 512 bit vectors when AVX512 is enabled. Simplified by just checking if the vector type is legal rather than listing all combinations of types and features. Fixes PR40984. llvm-svn: 355582	2019-03-07 06:30:19 +00:00
Simon Pilgrim	9d6347cfc1	[DAGCombine] Improve select (not Cond), N1, N2 -> select Cond, N2, N1 fold Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well. Requested by @spatel on D59006 llvm-svn: 355533	2019-03-06 18:52:52 +00:00
Simon Pilgrim	468bb2e601	[X86][SSE] VSELECT(XOR(Cond,-1), LHS, RHS) --> VSELECT(Cond, RHS, LHS) As noticed on D58965 DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point. Differential Revision: https://reviews.llvm.org/D58974 llvm-svn: 355494	2019-03-06 10:54:43 +00:00
Jeremy Morse	09d8ea5282	[X86] Avoid codegen changes when DBG_VALUE appears between lowered selects X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo instructions without accounting for debug intrinsics. This leads to different codegen with and without option -g, if a DBG_VALUE instruction lands in the middle of several lowered selects. Work around this by skipping over debug instructions when looking for CMOV sequences, and sinking those debug insts into the EmitLoweredSelect sunk block. This might slightly shift where variables appear in the instruction sequence, but won't re-order assignments. Differential Revision: https://reviews.llvm.org/D58672 llvm-svn: 355307	2019-03-04 10:56:02 +00:00
Simon Pilgrim	e48be5d698	Remove unused variable. NFCI. llvm-svn: 355289	2019-03-03 14:23:07 +00:00
Simon Pilgrim	d8e91a54c0	[X86] getShuffleScalarElt - peek through insert/extract subvector nodes. llvm-svn: 355288	2019-03-03 14:11:05 +00:00
Simon Pilgrim	11149ea433	[X86] Pull out combineToConsecutiveLoads helper. NFCI. llvm-svn: 355287	2019-03-03 13:53:27 +00:00
Amaury Sechet	f24abf6511	[X86] Improve use of SHLD/SHRD Summary: This extends the variety of pattern that can generate a SHLD instead of using two shifts. This fixes a regression that would be introduced by D57367 or D33587 Reviewers: RKSimon, craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D57389 llvm-svn: 355260	2019-03-02 02:44:16 +00:00
Sanjay Patel	7fc6ef7dd7	[x86] scalarize extract element 0 of FP math This is another step towards ensuring that we produce the optimal code for reductions, but there are other potential benefits as seen in the tests diffs: 1. Memory loads may get scalarized resulting in more efficient code. 2. Memory stores may get scalarized resulting in more efficient code. 3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch. 4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch. The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions. Differential Revision: https://reviews.llvm.org/D58282 llvm-svn: 355130	2019-02-28 19:47:04 +00:00
Craig Topper	38427c47b9	[X86] Don't peek through bitcasts before checking ISD::isBuildVectorOfConstantSDNodes in combineTruncatedArithmetic We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop. Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding. Differential Revision: https://reviews.llvm.org/D58705 llvm-svn: 355116	2019-02-28 18:49:29 +00:00
Bjorn Pettersson	d30f308a9f	Add support for computing "zext of value" in KnownBits. NFCI Summary: The description of KnownBits::zext() and KnownBits::zextOrTrunc() has confusingly been telling that the operation is equivalent to zero extending the value we're tracking. That has not been true, instead the user has been forced to explicitly set the extended bits as known zero afterwards. This patch adds a second argument to KnownBits::zext() and KnownBits::zextOrTrunc() to control if the extended bits should be considered as known zero or as unknown. Reviewers: craig.topper, RKSimon Reviewed By: RKSimon Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58650 llvm-svn: 355099	2019-02-28 15:45:29 +00:00
Simon Pilgrim	134bc19079	[X86][AVX] Remove superfluous insert_subvector(zero, bitcast(x)) -> bitcast(insert_subvector(zero, x)) fold This is caught by other existing bitcast folds. llvm-svn: 355084	2019-02-28 11:39:52 +00:00
Simon Pilgrim	87aeff8bbb	[X86][AVX] Fold vf64 concat_vectors(movddup(x),movddup(x)) -> broadcast(x) llvm-svn: 355078	2019-02-28 10:53:58 +00:00
Simon Pilgrim	1001a6ab03	[X86][AVX] Pull out some INSERT_SUBVECTOR combines into a combineConcatVectorOps helper. NFCI A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node. I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered. Differential Revision: https://reviews.llvm.org/D58583 llvm-svn: 355015	2019-02-27 18:46:32 +00:00
Simon Pilgrim	71bb6850cf	[X86][AVX] Only combine loads to broadcasts for legal types Thanks to @echristo for spotting this. llvm-svn: 354961	2019-02-27 11:17:25 +00:00
Reid Kleckner	2f055f026a	[X86] Fix bug in x86_intrcc with arg copy elision Summary: Use a custom calling convention handler for interrupts instead of fixing up the locations in LowerMemArgument. This way, the offsets are correct when constructed and we don't need to account for them in as many places. Depends on D56883 Replaces D56275 Reviewers: craig.topper, phil-opp Subscribers: hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D56944 llvm-svn: 354837	2019-02-26 02:11:25 +00:00
Simon Pilgrim	c61f1e8e6c	[X86] Merge ISD::ADD/SUB nodes into X86ISD::ADD/SUB equivalents (PR40483) Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results. Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB. Differential Revision: https://reviews.llvm.org/D58597 llvm-svn: 354771	2019-02-25 11:19:37 +00:00
Simon Pilgrim	cfaf663a35	[X86] Combine zext(packus(x),packus(y)) -> concat(x,y) (PR39637) Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step. llvm-svn: 354757	2019-02-24 19:57:52 +00:00
Craig Topper	be3348573e	[LegalizeTypes][AArch64][X86] Make type legalization of vector (S/U)ADD/SUB/MULO follow getSetCCResultType for the overflow bits. Make UnrollVectorOverflowOp properly convert from scalar boolean contents to vector boolean contents Summary: When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it. We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used. Reviewers: spatel, RKSimon, nikic Reviewed By: nikic Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58567 llvm-svn: 354753	2019-02-24 19:23:36 +00:00
Simon Pilgrim	4f4f9abdfa	[X86][AVX] Rename lowerShuffleByMerging128BitLanes to lowerShuffleAsLanePermuteAndRepeatedMask. NFC. Name better matches the other similar 'lane permute' and 'repeated mask' functions we have. llvm-svn: 354749	2019-02-24 17:30:06 +00:00
Craig Topper	be9eeb5526	Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value types" And its follow ups r354511, r354640. A follow patch will fix the issue that caused it to be reverted. llvm-svn: 354737	2019-02-23 21:41:42 +00:00
Craig Topper	ccc860cb81	Recommit r354647 and r354648 "[LegalizeTypes] When promoting the result of EXTRACT_SUBVECTOR, also check if the input needs to be promoted. Use that to determine the element type to extract" r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit." These were reverted in r354713 as their context depended on other patches that were reverted for a bug. llvm-svn: 354734	2019-02-23 19:51:32 +00:00
Simon Pilgrim	f383a47b7d	[X86][AVX] combineInsertSubvector - remove concat_vectors(load(x),load(x)) --> sub_vbroadcast(x) D58053/rL354340 added this to EltsFromConsecutiveLoads directly llvm-svn: 354732	2019-02-23 18:53:03 +00:00
Simon Pilgrim	e08f177ea2	[X86][AVX] concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) --> broadcast(x) For AVX1, limit this to i32/f32/i64/f64 loading cases only. llvm-svn: 354730	2019-02-23 18:34:05 +00:00
Simon Pilgrim	31793733a0	[X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input in place Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD. llvm-svn: 354729	2019-02-23 17:10:47 +00:00
Reid Kleckner	e3876637cf	Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more value types" r354363 caused https://crbug.com/934963#c1, which has a plain C reduced test case. I also had to revert some dependent changes: - r354648 - r354647 - r354640 - r354511 llvm-svn: 354713	2019-02-23 01:19:42 +00:00
Craig Topper	a9697f24cf	[X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when input type will be promoted by the type legalize to 128-bits. If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register. Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization. llvm-svn: 354709	2019-02-23 00:35:02 +00:00
Sanjay Patel	a9e289174a	[x86] allow narrowing of vector UINT_TO_FP As discussed in: D56864 D58197 Always use the narrow (128-bit) instruction when possible. We already had the signed int version of this transform. llvm-svn: 354675	2019-02-22 15:47:45 +00:00
Sanjay Patel	1baf7896cc	[x86] simplify code in combineExtractSubvector; NFC Only the 1st fold is attempted pre-legalization, but it requires legal (simple) types too, so we don't need an EVT in any of the code. llvm-svn: 354674	2019-02-22 15:28:22 +00:00
Craig Topper	3a391fc0e8	[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit. Type legalization is causing two nodes to be created here, but we can use a single node to extend from v8i16 to v2i64. llvm-svn: 354648	2019-02-22 01:49:53 +00:00
Sanjay Patel	234a5e8ea4	[x86] vectorize more cast ops in lowering to avoid register file transfers This is a follow-up to D56864. If we're extracting from a non-zero index before casting to FP, then shuffle the vector and optionally narrow the vector before doing the cast: cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0 This might be enough to close PR39974: https://bugs.llvm.org/show_bug.cgi?id=39974 Differential Revision: https://reviews.llvm.org/D58197 llvm-svn: 354619	2019-02-21 20:40:39 +00:00
Nirav Dave	dce91c1edb	[X86] Fix copy-paste error in @ccz flag. @ccz operand should be equivalent to @cce. llvm-svn: 354588	2019-02-21 15:28:31 +00:00
Simon Pilgrim	e6b338cbef	[X86][SSE] combineX86ShufflesRecursively - moved to generic op input index lookup. NFCI. We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement. llvm-svn: 354575	2019-02-21 12:24:49 +00:00
Nikita Popov	c3b496de7a	[SDAG] Support vector UMULO/SMULO Second part of https://bugs.llvm.org/show_bug.cgi?id=40442. This adds an extra UnrollVectorOverflowOp() method to SDAG, because the general UnrollOverflowOp() method can't deal with multiple results. Additionally we need to expand UMULO/SMULO during vector op legalization, as it may result in unrolling, which may need additional type legalization. Differential Revision: https://reviews.llvm.org/D57997 llvm-svn: 354513	2019-02-20 20:41:44 +00:00
Simon Pilgrim	dca47c659c	[X86][SSE] combineX86ShufflesRecursively - begin generalizing the number of shuffle inputs. NFCI. We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit. llvm-svn: 354489	2019-02-20 17:58:29 +00:00
Simon Pilgrim	0b3b9424ca	[X86][SSE] Generalize X86ISD::BLENDI support to more value types D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361 Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 354363	2019-02-19 18:05:42 +00:00
Simon Pilgrim	d6add74915	Cast from SDValue directly instead of superfluous getNode(). NFCI. llvm-svn: 354343	2019-02-19 16:20:09 +00:00
Simon Pilgrim	952abcefe4	[X86][AVX] EltsFromConsecutiveLoads - Add BROADCAST lowering support This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads. It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load. Differential Revision: https://reviews.llvm.org/D58053 llvm-svn: 354340	2019-02-19 15:57:09 +00:00
Sanjay Patel	d8b4efcb6b	[CGP] form usub with overflow from sub+icmp The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487: https://bugs.llvm.org/show_bug.cgi?id=31754 https://bugs.llvm.org/show_bug.cgi?id=40487 ..and those are shown in the IR test file and x86 codegen file. Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use. This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo formation by default. Only x86 overrides that setting for now although other targets will likely benefit by forming usbuo too. Differential Revision: https://reviews.llvm.org/D57789 llvm-svn: 354298	2019-02-18 23:33:05 +00:00
Sanjay Patel	fff628274d	[x86] split more v8f32/v8i32 shuffles in lowering Similar to D57867 - this is a small patch with lots of test diffs. With half-vector-width narrowing potential, using an extract + 128-bit vshufps is a win because it replaces a 256-bit shuffle with a 128-bit shufle. This seems like it should be a win even for targets with 'fast-variable-shuffle', but we are intentionally deferring that to an independent change to make sure that is true. Differential Revision: https://reviews.llvm.org/D58181 llvm-svn: 354279	2019-02-18 16:46:12 +00:00
Craig Topper	ce3c5ac6a6	[X86] In FP_TO_INTHelper, when moving data from SSE register to X87 register file via the stack, use the same stack slot we use for the integer conversion. No need for a separate stack slot. The lifetimes don't overlap. Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot. llvm-svn: 354234	2019-02-17 19:23:49 +00:00
Craig Topper	db5aa955cb	[X86] When type legalizing the result of a i64 fp_to_uint on 32-bit targets. Generate all of the ops as i64 and let them be legalized. No need to manually split everything. We can let the type legalizer work for us. The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended. llvm-svn: 354197	2019-02-16 08:25:42 +00:00
Craig Topper	db2f084aa9	[X86] Don't set exception mask bits when modifying FPCW to change rounding mode for fp->int conversion When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption. This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block) Differential Revision: https://reviews.llvm.org/D57788 llvm-svn: 354178	2019-02-15 21:59:33 +00:00
Nirav Dave	7875841121	[X86] Fix LowerAsmOutputForConstraint. Summary: Update Flag when generating cc output. Fixes PR40737. Reviewers: rnk, nickdesaulniers, craig.topper, spatel Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58283 llvm-svn: 354163	2019-02-15 20:01:55 +00:00
Craig Topper	9c6a9276da	[X86] Move all the SSE legality checks out of FP_TO_INTHelper and up to LowerFP_TO_INT. NFCI These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there. Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner. llvm-svn: 354161	2019-02-15 19:21:39 +00:00
Simon Pilgrim	6ce08672fb	[X86][AVX] lowerShuffleAsLanePermuteAndPermute - fully populate the lane shuffle mask (PR40730) As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index. This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128. This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch. Differential Revision: https://reviews.llvm.org/D58237 llvm-svn: 354117	2019-02-15 11:39:21 +00:00
Nirav Dave	5ffdc43dc9	[X86] cleanup inline asm register generation. NFCI. llvm-svn: 354042	2019-02-14 18:06:21 +00:00
Craig Topper	1d158dd930	[X86] Make (f80 (sint_to_fp (i16))) use fistps/fisttps instead of fistpl/fisttpl when SSE is enabled. When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion. Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type. llvm-svn: 354003	2019-02-14 01:41:43 +00:00
Craig Topper	9b61f48e4b	[X86] Use default expansion for (i64 fp_to_uint f80) when avx512 is enabled on 64-bit targets to match what happens without avx512. In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this. llvm-svn: 353921	2019-02-13 07:42:34 +00:00
Craig Topper	3099e442a6	[X86] Refactor the FP_TO_INTHelper interface. NFCI -Pull the final stack load creation from the two callers into the helper. -Return a single SDValue instead of a std::pair. -Remove the Replace flag which isn't really needed. llvm-svn: 353920	2019-02-13 07:42:31 +00:00
Simon Pilgrim	5338f41ced	[X86][AVX] Enable shuffle combining support for zero_extend A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars. llvm-svn: 353860	2019-02-12 17:22:35 +00:00
Craig Topper	7670ede434	[X86] Collapse FP_TO_INT16_IN_MEM/FP_TO_INT32_IN_MEM/FP_TO_INT64_IN_MEM into a single opcode using memory VT to distinquish. NFC llvm-svn: 353798	2019-02-12 06:14:18 +00:00
Craig Topper	d7303ecd0b	[X86] Remove the value type operand from the floating point load/store MemIntrinsicSDNodes. Use the MemoryVT instead. NFCI We already have the memory VT, we can just match from that during isel. llvm-svn: 353797	2019-02-12 06:14:16 +00:00
Craig Topper	75eb0af874	[X86] Correct the memory operand for the FLD emitted in FP_TO_INTHelper for 32-bit SSE targets. We were using DstTy, but that represents the integer type we are converting to which is i64 in this case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers. If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte access. The store we used to get from SSE to the stack is using the corect size. While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function. llvm-svn: 353745	2019-02-11 20:38:10 +00:00
Sam McCall	e825ba9165	Revert "[X86][SSE] Generalize X86ISD::BLENDI support to more value types" This reverts commit r353610. It causes a miscompile visible in macro expansion in a bootstrapped clang. http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190211/626590.html llvm-svn: 353699	2019-02-11 14:05:36 +00:00
Simon Pilgrim	f6e6c369c0	[X86] EltsFromConsecutiveLoads - replace SmallBitVector with APInt (NFC). Minor refactor to simplify some incoming patches to improve broadcast loads. llvm-svn: 353655	2019-02-10 22:45:48 +00:00
Sanjay Patel	833550fc74	[x86] narrow 256-bit horizontal ops via demanded elements 256-bit horizontal math ops are an x86 monstrosity (and thankfully have not been extended to 512-bit AFAIK). The two 128-bit halves operate on separate halves of the inputs. So if we don't demand anything in the upper half of the result, we can extract the low halves of the inputs, do the math, and then insert that result into a 256-bit output. All of the extract/insert is free (ymm<-->xmm), so we're left with a narrower (cheaper) version of the original op. In the affected tests based on: https://bugs.llvm.org/show_bug.cgi?id=33758 https://bugs.llvm.org/show_bug.cgi?id=38971 ...we see that the h-op narrowing can result in further narrowing of other math via existing generic transforms. I originally drafted this patch as an exact pattern match starting from extract_vector_elt, but I thought we might see diffs starting from extract_subvector too, so I changed it to a more general demanded elements solution. There are no extra existing regression test improvements from that switch though, so we could go back. Differential Revision: https://reviews.llvm.org/D57841 llvm-svn: 353641	2019-02-10 15:22:06 +00:00
Simon Pilgrim	6bf7b30b10	[X86] CombineOr - fold to generic funnel shifts As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead. There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked. llvm-svn: 353626	2019-02-09 20:34:59 +00:00
Simon Pilgrim	690a2889d8	[X86][SSE] Generalize X86ISD::BLENDI support to more value types D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 353610	2019-02-09 13:13:59 +00:00
Sanjay Patel	e9cc26a56a	[x86] fix formatting; NFC (test commit #2 migrating to git) llvm-svn: 353533	2019-02-08 16:48:40 +00:00
Craig Topper	738180cc7f	Fix the lowering issue of intrinsics llvm.localaddress on X86 Patch by Yuanke Luo Reviewers: craig.topper, annita.zhang, smaslov, rnk, wxiao3 Reviewed By: rnk Subscribers: efriedma, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D57501 llvm-svn: 353492	2019-02-08 01:14:12 +00:00
Sanjay Patel	81f859d169	[x86] fix formatting; NFC llvm-svn: 353477	2019-02-07 22:36:55 +00:00
Simon Pilgrim	fe3ac70b18	[DAGCombiner] (add (umax X, C), -C) --> (usubsat X, C) (PR40111) Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111. Differential Revision: https://reviews.llvm.org/D57754 llvm-svn: 353457	2019-02-07 20:14:43 +00:00
Sanjay Patel	a5c4a5e958	[x86] split more 256/512-bit shuffles in lowering This is intentionally a small step because it's hard to know exactly where we might introduce a conflicting transform with the code that tries to form wider shuffles. But I think this is safe - if we have a wide shuffle with 2 operands, then we should do better with an extract + narrow shuffle. Differential Revision: https://reviews.llvm.org/D57867 llvm-svn: 353427	2019-02-07 17:10:49 +00:00
Nirav Dave	84e5bf0c95	[X86] Simplify casing. NFC. llvm-svn: 353417	2019-02-07 15:43:40 +00:00
Nirav Dave	c6bfa103a5	[X86][DAG] Avoid creating dangling bitcast. combineExtractWithShuffle may leave a dangling bitcast which may prevent further optimization in later passes. Avoid constructing it unless it is used. llvm-svn: 353333	2019-02-06 19:45:47 +00:00
Nirav Dave	e5c37958f9	[InlineAsm][X86] Add backend support for X86 flag output parameters. Allow custom handling of inline assembly output parameters and add X86 flag parameter support. llvm-svn: 353307	2019-02-06 15:26:29 +00:00
Sanjay Patel	e84fbb67a1	[x86] vectorize cast ops in lowering to avoid register file transfers The proposal in D56796 may cross the line because we're trying to avoid vectorization transforms in generic DAG combining. So this is an alternate, later, x86-specific translation of that patch. There are several potential follow-ups to enhance this: 1. Allow extraction from non-zero element index. 2. Peek through extends of smaller width integers. 3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P Differential Revision: https://reviews.llvm.org/D56864 llvm-svn: 353302	2019-02-06 14:59:39 +00:00
Simon Pilgrim	b0afc69435	[X86][SSE] Disable ZERO_EXTEND shuffle combining rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly. llvm-svn: 353198	2019-02-05 19:15:48 +00:00
Simon Pilgrim	822d2e35e7	[X86][AVX] Attempt to combine shuffles to subvector broadcast load llvm-svn: 353189	2019-02-05 17:02:49 +00:00
Simon Pilgrim	62af24cc93	[X86][SSE] Add SimplifyDemandedVectorElts support for X86ISD::BLENDV llvm-svn: 353165	2019-02-05 12:27:29 +00:00
Simon Pilgrim	9e595e3663	[X86][AVX] Attempt to share broadcasts of different widths (PR39454) If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free). Differential Revision: https://reviews.llvm.org/D57663 llvm-svn: 353154	2019-02-05 10:58:43 +00:00
Craig Topper	f86eb00f12	[X86] Connect the default fpsr and dirflag clobbers in inline assembly to the registers we have defined for them. Summary: We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly. This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that. Reviewers: rnk Reviewed By: rnk Subscribers: eraman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D57641 llvm-svn: 353141	2019-02-05 06:13:06 +00:00
Simon Pilgrim	6e5350a367	[X86][SSE] SimplifyDemandedBitsForTargetNode - PCMPGT(0,X) sign mask For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly. Differential Revision: https://reviews.llvm.org/D57667 llvm-svn: 353051	2019-02-04 15:43:36 +00:00
Simon Pilgrim	9899967464	Use auto for dyn_cast case to save a line. NFCI. llvm-svn: 353041	2019-02-04 12:32:39 +00:00
Simon Pilgrim	1fce5a8b75	[X86][AVX] Support shuffle combining for VBROADCAST with smaller vector sources getTargetShuffleMask can only do this safely if we're extracting the lowest subvector from a vector of the same result type. llvm-svn: 352999	2019-02-03 16:51:33 +00:00
Simon Pilgrim	18b73a655b	[X86][AVX] Support shuffle combining for VPMOVZX with smaller vector sources llvm-svn: 352997	2019-02-03 16:10:18 +00:00
Simon Pilgrim	a2a3e5b811	[X86][AVX] More aggressively simplify BROADCAST source operand Aim to use scalar source or lowest 128-bit vector directly. We're still missing some VZMOVL_LOAD combines. llvm-svn: 352994	2019-02-03 14:39:41 +00:00
Craig Topper	950ca192f6	[X86] Lower ISD::UADDO to use the Z flag instead of C flag when the RHS is a constant 1 to encourage INC formation. Summary: Add an additional combine to combineCarryThroughADD to reverse it back to the C flag to avoid regressions. I believe this catches the cases that D57547 got. Reviewers: RKSimon, spatel Reviewed By: spatel Subscribers: javed.absar, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D57637 llvm-svn: 352984	2019-02-03 07:25:06 +00:00
Simon Pilgrim	dbf302c9f1	[X86][AVX] Enable INSERT_SUBVECTOR(SRC0, SHUFFLE(SRC1)) shuffle combining Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles. The is exposes a couple of minor issues that will be fixed shortly: Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts). llvm-svn: 352963	2019-02-02 18:08:04 +00:00
Simon Pilgrim	bd42f97946	[SDAG] Add SDNode/SDValue getConstantOperandAPInt helper. NFCI. We already have the getConstantOperandVal helper which returns a uint64_t, but along comes the fuzzer and inserts a i128 -1 constant or something and the whole thing asserts....... I've updated a few obvious cases, and tried to make use of the const reference where possible, but there's more to do. A number of existing oss-fuzz tickets should be fixed if we start using APInt and perform value clamping where necessary. llvm-svn: 352961	2019-02-02 17:35:06 +00:00
James Y Knight	14359ef1b6	[opaque pointer types] Pass value type to LoadInst creation. This cleans up all LoadInst creation in LLVM to explicitly pass the value type rather than deriving it from the pointer's element-type. Differential Revision: https://reviews.llvm.org/D57172 llvm-svn: 352911	2019-02-01 20:44:24 +00:00
James Y Knight	7976eb5838	[opaque pointer types] Pass function types to CallInst creation. This cleans up all CallInst creation in LLVM to explicitly pass a function type rather than deriving it from the pointer's element-type. Differential Revision: https://reviews.llvm.org/D57170 llvm-svn: 352909	2019-02-01 20:43:25 +00:00
Simon Pilgrim	85184017e9	[X86][SSE] Use PSLLDQ/PSRLDQ to mask out zeroable ends of a shuffle As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section. For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts). Differential Revision: https://reviews.llvm.org/D56784 llvm-svn: 352883	2019-02-01 16:02:12 +00:00
Simon Pilgrim	1a529f58f9	[X86][AVX] Combine INSERT_SUBVECTOR(SRC0, BITCAST(SHUFFLE(EXTRACT_SUBVECTOR(SRC1))) Enable peeking through one use bitcasts to the subvector shuffle. This still depends on the subvector being the same scalar-size but D57514 has already helped with the more tricky patterns llvm-svn: 352879	2019-02-01 15:31:01 +00:00
James Y Knight	13680223b9	[opaque pointer types] Add a FunctionCallee wrapper type, and use it. Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc doesn't choke on it, hopefully. Original Message: The FunctionCallee type is effectively a {FunctionType,Value} pair, and is a useful convenience to enable code to continue passing the result of getOrInsertFunction() through to EmitCall, even once pointer types lose their pointee-type. Then: - update the CallInst/InvokeInst instruction creation functions to take a Callee, - modify getOrInsertFunction to return FunctionCallee, and - update all callers appropriately. One area of particular note is the change to the sanitizer code. Previously, they had been casting the result of `getOrInsertFunction` to a `Function*` via `checkSanitizerInterfaceFunction`, and storing that. That would report an error if someone had already inserted a function declaraction with a mismatching signature. However, in general, LLVM allows for such mismatches, as `getOrInsertFunction` will automatically insert a bitcast if needed. As part of this cleanup, cause the sanitizer code to do the same. (It will call its functions using the expected signature, however they may have been declared.) Finally, in a small number of locations, callers of `getOrInsertFunction` actually were expecting/requiring that a brand new function was being created. In such cases, I've switched them to Function::Create instead. Differential Revision: https://reviews.llvm.org/D57315 llvm-svn: 352827	2019-02-01 02:28:03 +00:00
James Y Knight	fadf25068e	Revert "[opaque pointer types] Add a FunctionCallee wrapper type, and use it." This reverts commit `f47d6b38c7` (r352791). Seems to run into compilation failures with GCC (but not clang, where I tested it). Reverting while I investigate. llvm-svn: 352800	2019-01-31 21:51:58 +00:00
James Y Knight	f47d6b38c7	[opaque pointer types] Add a FunctionCallee wrapper type, and use it. The FunctionCallee type is effectively a {FunctionType,Value} pair, and is a useful convenience to enable code to continue passing the result of getOrInsertFunction() through to EmitCall, even once pointer types lose their pointee-type. Then: - update the CallInst/InvokeInst instruction creation functions to take a Callee, - modify getOrInsertFunction to return FunctionCallee, and - update all callers appropriately. One area of particular note is the change to the sanitizer code. Previously, they had been casting the result of `getOrInsertFunction` to a `Function*` via `checkSanitizerInterfaceFunction`, and storing that. That would report an error if someone had already inserted a function declaraction with a mismatching signature. However, in general, LLVM allows for such mismatches, as `getOrInsertFunction` will automatically insert a bitcast if needed. As part of this cleanup, cause the sanitizer code to do the same. (It will call its functions using the expected signature, however they may have been declared.) Finally, in a small number of locations, callers of `getOrInsertFunction` actually were expecting/requiring that a brand new function was being created. In such cases, I've switched them to Function::Create instead. Differential Revision: https://reviews.llvm.org/D57315 llvm-svn: 352791	2019-01-31 20:35:56 +00:00
Simon Pilgrim	00cefe1158	Trim trailing whitespace. NFCI. llvm-svn: 352775	2019-01-31 17:49:25 +00:00
Simon Pilgrim	eb6aef6db3	[X86][AVX] Fold concat(broadcast(x),broadcast(x)) -> broadcast(x) Differential Revision: https://reviews.llvm.org/D57514 llvm-svn: 352774	2019-01-31 17:48:35 +00:00
Simon Pilgrim	d04a2d2d5e	[X86][AVX] insert_subvector(bitcast(v), bitcast(s), c1) -> bitcast(insert_subvector(v,s,c2)) Similar to what we already do in DAGCombiner, but this version also handles bitcasts from types with different scalar sizes, which x86 is better at handling. Differential Revision: https://reviews.llvm.org/D57514 llvm-svn: 352773	2019-01-31 17:38:10 +00:00
Simon Pilgrim	63f3383ece	[X86][AVX] Fold broadcast(bitcast(src)) -> bitcast(broadcast(src)) llvm-svn: 352751	2019-01-31 14:04:07 +00:00
Simon Pilgrim	a001008a09	[X86] combineExtractWithShuffle - more aggressively peek through bitcasts Fixes regression introduced by rL352743 llvm-svn: 352745	2019-01-31 11:55:30 +00:00
Simon Pilgrim	b96a2c7fed	[X86][AVX] Enable AVX1 broadcasts in shuffle combining Enables 32/64-bit scalar load broadcasts on AVX1 targets The extractelement-load.ll regression will be fixed shortly in a followup commit. llvm-svn: 352743	2019-01-31 11:41:10 +00:00
Simon Pilgrim	51c2efc104	[X86][AVX] Fold vt1 concat_vectors(vt2 undef, vt2 broadcast(x)) --> vt1 broadcast(x) If we're not inserting the broadcast into the lowest subvector then we can avoid the insertion by just performing a larger broadcast. Avoids a regression when we enable AVX1 broadcasts in shuffle combining llvm-svn: 352742	2019-01-31 11:15:05 +00:00
Craig Topper	8bdc203d4b	[X86] Remove handling of ISD::INTRINSIC_WO_CHAIN in ReplaceNodeResults. I believe this was there to handle avx512bw intrinsics that returned i64 type in 32-bit mode. But all those intrinsics have since been changed to v64i1 results or replaced with generic IR. llvm-svn: 352698	2019-01-31 00:04:46 +00:00
Simon Pilgrim	317fad5921	[X86][AVX] Prefer to combine shuffle to broadcasts whenever possible This is the first step towards improving broadcast support on AVX1 targets. llvm-svn: 352634	2019-01-30 16:19:19 +00:00
Mikael Holmen	b792627ce9	Fix compiler warning when using clang 3.6.0 Without the fix we get the following (with -Werror): ../lib/Target/X86/X86ISelLowering.cpp:14181:58: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] SmallVector<std::array<int, 2>, 2> LaneSrcs(NumLanes, {-1, -1}); ^~~~~~ { } 1 error generated. llvm-svn: 352455	2019-01-29 06:51:28 +00:00
Craig Topper	390ac61b93	Recommit r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer" This did not cause the buildbot failure it was previously reverted for. Original commit message: I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits. This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inre On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all. llvm-svn: 352433	2019-01-28 21:38:47 +00:00
Simon Pilgrim	2c17512456	[X86][AVX] Remove lowerShuffleByMerging128BitLanes 2-lane restriction First step towards adding support for 64-bit unary "sublane" handling (a bit like lowerShuffleAsRepeatedMaskAndLanePermute). This allows us to add lowerV64I8Shuffle handling. llvm-svn: 352389	2019-01-28 17:02:35 +00:00
Sanjay Patel	94cca60b82	[x86] allow more shuffle splitting to avoid vpermps (PR40434) This is tricky to make optimal: sometimes we're better off using a single wider op, but other times it makes more sense to combine a narrow ops to achieve the same result. This solves the case from: https://bugs.llvm.org/show_bug.cgi?id=40434 There's potentially a similar change for vectors with 64-bit elements, but it needs adjustments similar to rL352333 to avoid creating infinite loops. llvm-svn: 352380	2019-01-28 15:51:34 +00:00
Craig Topper	453150bc18	[X86] Add new variadic avx512 compress/expand intrinsics that use vXi1 types for the mask argument. Remove and autoupgrade the old intrinsics llvm-svn: 352343	2019-01-28 07:03:03 +00:00
Sanjay Patel	ebe6b43aec	[x86] add restriction for lowering to vpermps This transform was added with rL351346, and we had an escape for shufps, but we also want one for unpckps vs. vpermps because vpermps doesn't take an immediate shuffle index operand. llvm-svn: 352333	2019-01-27 21:53:33 +00:00
Simon Pilgrim	670a6971f8	[X86][SSE] Add UNDEF handling to combineSelect ISD::USUBSAT matching (PR40083) llvm-svn: 352330	2019-01-27 21:01:23 +00:00
Simon Pilgrim	f10b6623cc	[X86][SSE] Permit UNDEFs in combineAddToSUBUS matching (PR40083) llvm-svn: 352328	2019-01-27 20:36:37 +00:00
Sanjay Patel	5f1fdaa192	[x86] refactor logic in lowerShuffleWithUndefHalf Although this is longer code, this is no-functional-change-intended. The goal is to untangle the conditions under which we bail out, so that's easier to adjust. llvm-svn: 352320	2019-01-27 18:12:03 +00:00
Simon Pilgrim	a914fa4dd8	[X86] combineAddOrSubToADCOrSBB/combineCarryThroughADD - use oneuse for entire SDNode Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB. I've added the getNode() to make it clearer what is intended than just the -> redirection. llvm-svn: 352291	2019-01-26 21:29:16 +00:00
Simon Pilgrim	37a8e65a60	[X86] combineCarryThroughADD - add support for X86::COND_A commutations (PR24545) As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction. Differential Revision: https://reviews.llvm.org/D57281 llvm-svn: 352289	2019-01-26 20:23:04 +00:00
Simon Pilgrim	b7a15acd38	[X86] Fold X86ISD::SBB(ISD::SUB(X,Y),0) -> X86ISD::SBB(X,Y) (PR25858) We often generate X86ISD::SBB(X, 0) for carry flag arithmetic. I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet. Differential Revision: https://reviews.llvm.org/D57169 llvm-svn: 352288	2019-01-26 20:13:44 +00:00
Simon Pilgrim	6162fba57c	[X86][SSE] Generalized unsigned compares to support nonsplat constant vectors (PR39859) llvm-svn: 352283	2019-01-26 16:40:03 +00:00
Sanjay Patel	a03c63b77f	[x86] add helper for creating a half-width shuffle; NFC This reduces a bit of duplication between the combining and lowering places that use it, but the primary motivation is to make it easier to rearrange the lowering logic and solve PR40434: https://bugs.llvm.org/show_bug.cgi?id=40434 llvm-svn: 352280	2019-01-26 16:20:22 +00:00
Craig Topper	58e6b37e62	Revert r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer" This might be breaking an lldb windows buildbot. llvm-svn: 352268	2019-01-26 02:44:58 +00:00
Craig Topper	7a8e74775c	[X86] Add DAG combine to merge vzext_movl with the various fp<->int conversion operations that only write the lower 64-bits of an xmm register and zero the rest. Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D56971 llvm-svn: 352260	2019-01-26 01:17:09 +00:00
Craig Topper	b1d3457c03	[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer Summary: I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits. This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization. On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D57186 llvm-svn: 352255	2019-01-26 00:26:37 +00:00
Craig Topper	4cf28bad5b	[X86] Combine masked store and truncate into masked truncating stores. We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch. This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate. Differential Revision: https://reviews.llvm.org/D57218 llvm-svn: 352230	2019-01-25 18:37:36 +00:00
Sanjay Patel	0020f8bb23	[x86] simplify logic in lowerShuffleWithUndefHalf(); NFCI This seems unnecessarily complicated because we gave names to opposite polarity bools and have code comments that don't really line up with the logic. Step 1: remove UndefUpper and assert that it is the opposite of UndefLower after the initial early exit. llvm-svn: 352217	2019-01-25 17:00:41 +00:00
Simon Pilgrim	f56298f4b9	[X86] Simplify X86ISD::ADD/SUB if we don't use the result flag Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag. This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further. Noticed in some of the test cases in PR31754 Differential Revision: https://reviews.llvm.org/D57234 llvm-svn: 352210	2019-01-25 15:58:28 +00:00
Sanjay Patel	21aa6ddc14	[x86] narrow a shuffle that doesn't use or set any high elements This isn't the final fix for our reduction/horizontal codegen, but it takes care of a lot of the problems. After we narrow the shuffle, existing combines for insert/extract and binops kick in, and we end up with cheaper 128-bit ops. The avg and mul reduction tests show an existing shuffle lowering hole for AVX2/AVX512. I think in its most minimal form this is: https://bugs.llvm.org/show_bug.cgi?id=40434 ...but we might need multiple fixes to get it right. Differential Revision: https://reviews.llvm.org/D57156 llvm-svn: 352209	2019-01-25 15:37:42 +00:00
Sanjay Patel	4c304b2923	[x86] move half-size shuffle mask creation to helper; NFC As noted in D57156, we want to check at least part of this pattern earlier (in combining), so this will allow the code to be shared instead of duplicated. llvm-svn: 352127	2019-01-24 23:12:36 +00:00
Sanjay Patel	e524639d72	[x86] rename VectorShuffle -> Shuffle; NFC This wasn't consistent within the file, so made it harder to search. Standardize on the shorter name to save some typing. llvm-svn: 352077	2019-01-24 18:52:12 +00:00
Sanjay Patel	e5a0bcf7b8	[x86] add low/high undef half shuffle mask helpers; NFC This is the most common usage for isUndefInRange, so make the code slightly less duplicated and more readable. llvm-svn: 352063	2019-01-24 17:05:02 +00:00
Matt Arsenault	a5840c3c39	Codegen support for atomicrmw fadd/fsub llvm-svn: 351851	2019-01-22 18:36:06 +00:00
Simon Pilgrim	933673d878	[X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y)) For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations. This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns. Differential Revision: https://reviews.llvm.org/D55935 llvm-svn: 351819	2019-01-22 13:44:49 +00:00
Craig Topper	bcbdf61078	[X86] Use X86ISD::VFPROUND instead of ISD::FP_ROUND for 256 and 512 bit cvtpd2ps intrinsics. Summary: Use X86ISD::VFPROUND in the instruction isel patterns. Add new patterns for ISD::FP_ROUND to maintain support for fptrunc in IR. In the process I found a couple duplicate isel patterns which I also deleted in this patch. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D56991 llvm-svn: 351762	2019-01-21 20:14:09 +00:00
Craig Topper	c2087d8f3f	[X86] Change avx512 COMPRESS and EXPAND lowering to use a single masked node instead of expand/compress+select. Summary: For compress, a select node doesn't semantically reflect the behavior of the instruction. The mask would have holes in it, but the resulting write is to contiguous elements at the bottom of the vector. Furthermore, as far as the compressing and expanding is concerned the behavior is depended on the mask. You can't just have an expand/compress node that only reads the input vector. That node would have no meaning by itself. This all only works because we pattern match the compress/expand+select back to the instruction. But conceivably an optimization of the select could break the pattern and leave something meaningless. This patch modifies the expand and compress node to take the mask and passthru as additional inputs and gets rid of the select all together. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D57002 llvm-svn: 351761	2019-01-21 20:02:28 +00:00
Craig Topper	4aa74fff1f	[X86] Add masked MCVTSI2P/MCVTUI2P ISD opcodes to model the cvtqq2ps cvtuqq2ps nodes that produce less than 128-bits of results. These nodes zero the upper half of the result and can't be represented with vselect. llvm-svn: 351666	2019-01-19 21:26:20 +00:00
Chandler Carruth	2946cd7010	Update the file headers across all of the LLVM projects in the monorepo to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636	2019-01-19 08:50:56 +00:00
Reid Kleckner	38f9900aa5	[X86] Deduplicate static calling convention helpers for code size, NFC Summary: Right now we include ${TGT}GenCallingConv.inc once per each instruction selection method implemented by ${TGT}: - ${TGT}ISelLowering.cpp - ${TGT}CallLowering.cpp - ${TGT}FastISel.cpp Instead, add a mechanism to tablegen for marking a particular convention as "External", which causes tablegen to emit into the ::llvm namespace, instead of as a static helper. This allows us to provide a header to forward declare it, so we can simply call the function from all the places it is referenced. Typically the calling convention analyzer is called indirectly, so it doesn't benefit from inlining. This saves a bit of final binary size, but mostly just saves object file size: before after diff artifact 12852K 12492K -360K X86ISelLowering.cpp.obj 4640K 4280K -360K X86FastISel.cpp.obj 1704K 2092K +388K X86CallingConv.cpp.obj 52448K 52336K -112K llc.exe I didn't collect before numbers for X86CallLowering.cpp.obj, which is for GlobalISel, but we should save 360K there as well. This patch applies the strategy to the X86 backend, but there is no reason it couldn't be applied to the other backends that implement multiple ISel strategies, like AArch64. Reviewers: craig.topper, hfinkel, efriedma Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D56883 llvm-svn: 351616	2019-01-19 00:33:02 +00:00
Craig Topper	08d3d32ead	[X86] Lower avx512f scatter intrinsics to X86MaskedScatterSDNode instead of going directly to MachineSDNode. This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field. llvm-svn: 351583	2019-01-18 20:14:46 +00:00
Craig Topper	b9d4461f9f	[X86] Lower avx2/avx512f gather intrinsics to X86MaskedGatherSDNode instead of going directly to MachineSDNode.: This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field. Differential Revision: https://reviews.llvm.org/D56827 llvm-svn: 351570	2019-01-18 18:22:26 +00:00
Sanjay Patel	b6c91a1a59	[x86] simplify code for SDValue.getOperand(); NFC llvm-svn: 351557	2019-01-18 15:55:21 +00:00
Craig Topper	59abdf5f3f	[X86] Add X86ISD::VSHLV and X86ISD::VSRLV nodes for psllv and psrlv Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits. This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts. We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change. Patch by zhutianyang and Craig Topper Differential Revision: https://reviews.llvm.org/D56695 llvm-svn: 351381	2019-01-16 21:46:32 +00:00
Craig Topper	5ea3120718	[X86] Use X86ISD::BLENDV for blendv intrinsics. Replace vselect with blendv just before isel table lookup. Remove vselect isel patterns. This cleans up the duplication we have with both intrinsic isel patterns and vselect isel patterns. This should also allow the intrinsics to get SimplifyDemandedBits support for the condition. I've switched the canonical pattern in isel to use the X86ISD::BLENDV node instead of VSELECT. Since it always seemed weird to move from BLENDV with its relaxed rules on condition bits to VSELECT which has strict rules about all bits of the condition element being the same. Its more correct to go from VSELECT to BLENDV. Differential Revision: https://reviews.llvm.org/D56771 llvm-svn: 351380	2019-01-16 21:46:28 +00:00
Craig Topper	e5b7cc8aa0	[X86] Add a one use check to the setcc inversion code in combineVSelectWithAllOnesOrZeros If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc. Differential Revision: https://reviews.llvm.org/D56765 llvm-svn: 351378	2019-01-16 21:29:29 +00:00
Simon Pilgrim	5a2bbe267a	[X86] getFauxShuffleMask - bail for non-byte aligned shuffle types Remove the existing assertion and just return false for unexpected shuffle value types (<X x i1> mainly....). Found while updating combineX86ShufflesRecursively to run within SimplifyDemandedVectorElts/SimplifyDemandedBits. llvm-svn: 351365	2019-01-16 18:15:31 +00:00
Simon Pilgrim	524daea429	[X86] Add combineX86ShufflesRecursively helper. NFCI. combineX86ShufflesRecursively is pretty cumbersome with a lot of arguments that only matter later in recursion. This commit adds a wrapper version that only takes the initial root Op to simplify calls that don't need to worry about these. An early, cleanup step towards merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts/SimplifyDemandedBits. llvm-svn: 351352	2019-01-16 16:01:42 +00:00
Sanjay Patel	0dbecd05ed	[x86] lower shuffle of extracts to AVX2 vperm instructions I was trying to prevent shuffle regressions while matching more horizontal ops and ended up here: shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0 The affected tests were added for: https://bugs.llvm.org/show_bug.cgi?id=34380 This patch won't change the examples in the bug report itself, but we should be able to extend this to catch more types. Differential Revision: https://reviews.llvm.org/D56756 llvm-svn: 351346	2019-01-16 14:15:18 +00:00
Mandeep Singh Grang	436735c3fe	[EH] Rename llvm.x86.seh.recoverfp intrinsic to llvm.eh.recoverfp Summary: Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc. Refer D53541 for the context. Clang counterpart D56748. Reviewers: rnk, efriedma Reviewed By: rnk, efriedma Subscribers: javed.absar, kristof.beyls, llvm-commits Differential Revision: https://reviews.llvm.org/D56747 llvm-svn: 351281	2019-01-16 00:37:13 +00:00
Craig Topper	0e420e6a62	[X86] Rename SHRUNKBLEND ISD node to BLENDV. That's really what it is. If we didn't use intrinsics for BLENDVPS/BLENDVPD/PBLENDVB all the way to isel, this is the node we would use. llvm-svn: 351278	2019-01-16 00:20:30 +00:00
Craig Topper	34ac509ac8	[X86] Add avx512 scatter intrinsics that use a vXi1 mask instead of a scalar integer. We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways. llvm-svn: 351275	2019-01-15 23:36:25 +00:00
Craig Topper	82015b633b	[X86] Add versions of the avx512 gather intrinsics that take the mask as a vXi1 vector instead of a scalar In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen. I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics. I'll work on scatter and gatherpf/scatterpf next. Differential Revision: https://reviews.llvm.org/D56527 llvm-svn: 351234	2019-01-15 20:12:33 +00:00
Nirav Dave	dbc41bac0c	[X86] Fix register class for assembly constraints to ST(7). NFCI. Modify getRegForInlineAsmConstraint to return special singleton register class when a constraint references ST(7) not RFP80 for which ST(7) is not a member. llvm-svn: 351206	2019-01-15 17:09:14 +00:00
Simon Pilgrim	b8f08c8d7b	[X86] Bailout of lowerVectorShuffleAsPermuteAndUnpack for shuffle-with-zero (PR40306) If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements. We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets. llvm-svn: 351203	2019-01-15 16:56:55 +00:00
Benjamin Kramer	2fc8ede082	[X86] Fix unused variable warning in Release builds. NFC. llvm-svn: 351136	2019-01-14 23:29:54 +00:00
Craig Topper	9906f77f82	[X86] Silence a -Wparentheses warning on gcc. NFC llvm-svn: 351111	2019-01-14 19:44:02 +00:00
Simon Pilgrim	bfe2ee453a	[X86][SSSE3] Bailout of lowerVectorShuffleAsPermuteAndUnpack for shuffle-with-zero (PR40306) If we have PSHUFB and we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements. llvm-svn: 351103	2019-01-14 19:07:26 +00:00
Sanjay Patel	b23ff7a0e2	[x86] lower extracted add/sub to horizontal vector math add (extractelt (X, 0), extractelt (X, 1)) --> extractelt (hadd X, X), 0 This is the integer sibling to D56011. There's an additional restriction to only to do this transform in the case where we don't have extra extracts from the source vector. Without that, we can fail to match larger horizontal patterns that are more beneficial than this minimal case. An improvement to the more general h-op lowering may allow us to remove the restriction here in a follow-up. llvm-svn: 351093	2019-01-14 18:44:02 +00:00
Craig Topper	c8cd85588b	[X86] Remove unused intrinsic handlers. NFC llvm-svn: 351032	2019-01-14 01:56:59 +00:00
Craig Topper	075fcc1151	[X86] Remove FPCLASS intrinsic handler. Use INTR_TYPE_2OP instead. NFC llvm-svn: 351031	2019-01-14 01:44:09 +00:00
Craig Topper	3f3b8ef442	[X86] Remove mask parameter from vpshufbitqmb intrinsics. Change result to a vXi1 vector. The input mask can be represented with an AND in IR. Fixes PR40258 llvm-svn: 351028	2019-01-14 00:03:50 +00:00
Craig Topper	31156bbdb9	[X86] Add more ISD nodes to handle masked versions of VCVT(T)PD2DQZ128/VCVT(T)PD2UDQZ128 which only produce 2 result elements and zeroes the upper elements. We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this. Fixes another case from PR34877 llvm-svn: 351018	2019-01-13 02:59:59 +00:00
Craig Topper	4561edbec0	[X86] Add X86ISD::VMFPROUND to handle the masked case of VCVTPD2PSZ128 which only produces 2 result elements and zeroes the upper elements. We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this. Fixes another case from PR34877. llvm-svn: 351017	2019-01-13 02:59:57 +00:00
Simon Pilgrim	a0069ba0db	[X86] More aggressive shuffle mask widening in combineExtractWithShuffle Use demanded extract index to set most of the shuffle mask to undef, making it easier to widen and peek through. llvm-svn: 351013	2019-01-12 16:38:56 +00:00
Simon Pilgrim	a21e2bd682	[X86] Improve vXi64 ISD::ABS codegen with SSE41+ Make use of vblendvpd to select on the signbit Differential Revision: https://reviews.llvm.org/D56544 llvm-svn: 350999	2019-01-12 10:28:12 +00:00
Simon Pilgrim	ca0de0363b	[X86][AARCH64] Improve ISD::ABS support This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types. Differential Revision: https://reviews.llvm.org/D56544 llvm-svn: 350998	2019-01-12 09:59:32 +00:00
Craig Topper	90fe6edcba	[X86] Remove X86ISD::SELECT as its no longer used by any of our intrinsic lowering. llvm-svn: 350995	2019-01-12 08:15:54 +00:00
Craig Topper	33b2cf50e3	[X86] Add ISD node for masked version of CVTPS2PH. The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do. This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there. Fixes another instruction for PR34877. llvm-svn: 350994	2019-01-12 08:05:12 +00:00
Craig Topper	a69d903204	[X86] Remove unnecessary code from getMaskNode. We no longer need to extend mask scalars before bitcasting them to vXi1. This was only needed for the truncate intrinsics. And was really a bug in our lowering of them. llvm-svn: 350991	2019-01-12 06:13:44 +00:00
Craig Topper	bf61525e8c	[X86] When lowering v1i1/v2i1/v4i1/v8i1 load/store with avx512f, but not avx512dq, use v16i1 as the intermediate mask type instead of v8i1. We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type. By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned. llvm-svn: 350989	2019-01-12 02:22:10 +00:00
Craig Topper	abe6ef8d09	[X86] Add ISD nodes for masked truncate so we can properly represent when the output has more elements than the input due to needing to be 128 bits. We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask. This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that. This fixes some of PR34877 llvm-svn: 350985	2019-01-12 00:55:27 +00:00
Sanjay Patel	40cd4b77e9	[x86] allow insert/extract when matching horizontal ops Previously, we limited this transform to cases where the extraction into the build vector happens from vectors of the same type as the build vector, but that's not required. There's a slight potential regression seen in the AVX512 result for phadd -- we're using the 256-bit flavor of the instruction now even though the 128-bit subset is sufficient. The same problem could already be seen in the AVX2 result. Follow-up patches will attempt to narrow that back down. llvm-svn: 350928	2019-01-11 14:27:59 +00:00
Craig Topper	b97885cc2e	[X86] Change vXi1 extract_vector_elt lowering to be legal if the index is 0. Add DAG combine to turn scalar_to_vector+extract_vector_elt into extract_subvector. We were lowering the last step extract_vector_elt to a bitcast+truncate. Change it to use an extract_vector_elt of index 0 instead. Add isel patterns to do the equivalent of what the bitcast would have done. Plus an isel pattern for an any_extend+extract to prevent some regressions. Finally add a DAG combine to turn v1i1 scalar_to_vector+extract_vector_elt of 0 into an extract_subvector. This fixes some of the regressions from D350800. llvm-svn: 350918	2019-01-11 05:44:56 +00:00
Craig Topper	844f989608	[X86] Call SimplifyDemandedBits on conditions of X86ISD::SHRUNKBLEND This extends to combineVSelectToShrunkBlend to be able to resimplify SHRUNKBLENDS that have already been created. This should help some of the regressions from D56387 Differential Revision: https://reviews.llvm.org/D56421 llvm-svn: 350875	2019-01-10 19:05:34 +00:00
Craig Topper	350e6e9d7c	[X86] Simplify the BRCOND handling for FCMP_UNE. Despite what the comment says, FCMP_UNE would be an OR not an AND. In the lowering code the first branch created still goes to the original destination. The second branch was exchanged to go to where the subsequent unconditional branch went. This is different than what we do for FCMP_OEQ where both branches that we create go to the original unconditional branch. As far as I can tell, I think this means we don't need to exchange the branch target with the unconditional branch for FCMP_UNE at all. Differential Revision: https://reviews.llvm.org/D56309 llvm-svn: 350873	2019-01-10 19:02:14 +00:00
Sanjay Patel	87ae1460f7	[x86] fix remaining miscompile bug in horizontal binop matching (PR40243) When we use the partial-matching function on a 128-bit chunk, we must account for the possibility that we've matched undef halves of the original source vectors, so the outputs may need to be reset. This should allow closing PR40243: https://bugs.llvm.org/show_bug.cgi?id=40243 llvm-svn: 350830	2019-01-10 15:27:23 +00:00
Sanjay Patel	ed5cfc6792	[x86] fix horizontal binop matching for 256-bit vectors (PR40243) This is a partial fix for: https://bugs.llvm.org/show_bug.cgi?id=40243 ...as seen in the integer test, we still need to correct the result when using the existing (old) horizontal op matching function because it does not model the way x86 256-bit horizontal ops return results (each 128-bit half is its own horizontal-op). A potential follow-up change for that is discussed in the bug report - see also D56490. This generally duplicates a lot of the existing matching code, but we can't just remove that without introducing regressions, so the existing code is renamed and used less often. Follow-ups may try to reduce that overlap. Differential Revision: https://reviews.llvm.org/D56450 llvm-svn: 350826	2019-01-10 15:04:52 +00:00
Craig Topper	c38c9c120f	[X86] After turning VSELECT into SHRUNKBLEND, make we push the VSELECT into the worklist so it can be deleted. Found while trying to figure out why my second version of D56421 worked better than the first version. We weren't deleting the vselect in a timely fashion and that caused SimplfyDemandedBit to see an additional user. The new version doesn't have this problem so this fix isn't needed there, but seemed like the right thing to do. llvm-svn: 350781	2019-01-10 00:14:27 +00:00
Simon Pilgrim	5a7132ff0f	[X86] Enable combining shuffles to PACKSS/PACKUS for 256/512-bit vectors llvm-svn: 350716	2019-01-09 13:23:28 +00:00
Craig Topper	2fa8e2d8a8	[X86] Correct the MaskVT for avx512 gather/scatter intrinsics to use the min of the number of index and data elements. When the result type is v2i64/v2f64 and the index element size is i32, the index vector has two unused elements making the type v4i32. The mask VT should match the number of memory accesses that will be made. This is consistent with the isel patterns used for the target independent gather/scatter intrinsic. llvm-svn: 350687	2019-01-09 04:21:12 +00:00
Craig Topper	6ffeeb705f	[X86] Add support for matching vector funnel shift to AVX512VBMI2 instructions. Summary: AVX512VBMI2 supports a funnel shift by immediate and a funnel shift by a variable vector. Reviewers: spatel, RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D56361 llvm-svn: 350498	2019-01-06 18:10:18 +00:00
Craig Topper	d0ba531a0c	[X86] Use two pmovmskbs in combineBitcastvxi1 for (i64 (bitcast (v64i1 (truncate (v64i8)))) on KNL. llvm-svn: 350481	2019-01-05 22:42:58 +00:00
Craig Topper	46f8b4a11e	[X86] Allow combinevxi1Bitcast to use pmovmskb on avx512 targets if the input is a truncate from v16i8/v32i8. This is especially helpful on targets without avx512bw since we don't have a good way to convert from v16i8/v32i8 to v16i1/v32i1 for the truncate anyway. If we're just going to convert it to a GPR we might as well use pmovmskb to accomplish both. llvm-svn: 350480	2019-01-05 21:40:07 +00:00
Craig Topper	3f48dbf72e	[X86] Allow LowerTRUNCATE to use PACKUS/PACKSS for v16i16->v16i8 truncate when -mprefer-vector-width-256 is in effect and BWI is not available. llvm-svn: 350473	2019-01-05 18:48:11 +00:00
Nikita Popov	c35b4a37ba	[X86] Fix warning; NFC llvm-svn: 350437	2019-01-04 21:41:35 +00:00

... 3 4 5 6 7 ...

6328 Commits