llvm-project

Commit Graph

Author	SHA1	Message	Date
Krzysztof Parzyszek	4841643a1d	[X86] Extend boolean arguments to inline-asm according to getBooleanType Differential Revision: https://reviews.llvm.org/D60208 llvm-svn: 357615	2019-04-03 17:43:14 +00:00
Simon Pilgrim	15919ad306	[X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 reduction on AVX1 Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result. llvm-svn: 357611	2019-04-03 17:28:34 +00:00
Simon Pilgrim	9e28dddf55	[X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction on AVX1 Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets. llvm-svn: 357608	2019-04-03 17:17:13 +00:00
Craig Topper	2e1bf89e3a	[X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics. This is the appropriate opcode for only having a chain output. Though I'm not sure it matters much. llvm-svn: 357375	2019-04-01 05:26:12 +00:00
Sanjay Patel	e1bc360fc6	[x86] allow movmsk with 2-element reductions One motivation for making this change is that the lack of using movmsk is likely a main source of perf difference between clang and gcc on the C-Ray benchmark as shown here: https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5 ...but this change alone isn't enough to solve that problem. The 'all-of' examples show what is likely the worst case trade-off: we end up with an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of' examples look clearly better using movmsk because we've traded 2 vector instructions for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'. If we examine the llvm-mca output for these cases, it appears that even though the 'all-of' movmsk variant looks worse on paper, it would perform better on both Haswell and Jaguar. $ llvm-mca -mcpu=haswell no_movmsk.s -timeline Iterations: 100 Instructions: 400 Total Cycles: 504 Total uOps: 400 Dispatch Width: 4 uOps Per Cycle: 0.79 IPC: 0.79 Block RThroughput: 1.0 $ llvm-mca -mcpu=haswell movmsk.s -timeline Iterations: 100 Instructions: 600 Total Cycles: 358 Total uOps: 600 Dispatch Width: 4 uOps Per Cycle: 1.68 IPC: 1.68 Block RThroughput: 1.5 $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline Iterations: 100 Instructions: 400 Total Cycles: 407 Total uOps: 400 Dispatch Width: 2 uOps Per Cycle: 0.98 IPC: 0.98 Block RThroughput: 2.0 $ llvm-mca -mcpu=btver2 movmsk.s -timeline Iterations: 100 Instructions: 600 Total Cycles: 311 Total uOps: 600 Dispatch Width: 2 uOps Per Cycle: 1.93 IPC: 1.93 Block RThroughput: 3.0 Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if that's true, then we're also almost certainly making the wrong transform already for reductions with >2 elements, so that should be fixed independently. Differential Revision: https://reviews.llvm.org/D59997 llvm-svn: 357367	2019-03-31 15:11:34 +00:00
Simon Pilgrim	10c9032c02	[X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316) Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits. llvm-svn: 357351	2019-03-30 17:12:29 +00:00
Simon Pilgrim	3293455595	[X86][SSE] detectAVGPattern - begin generalizing ADD matches Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316 llvm-svn: 357349	2019-03-30 15:31:53 +00:00
Amara Emerson	d413f41de6	[X86] When using Win64 ABI, exit with error if SSE is disabled for varargs We need XMM registers to handle varargs with the Win64 ABI. Before we would silently generate bad code resulting in an assertion failure elsewhere in the backend. llvm-svn: 357317	2019-03-29 21:30:51 +00:00
Simon Pilgrim	aeaf7fcdde	[X86] Add X86TargetLowering::isCommutativeBinOp override. We currently just have test coverage for PMULUDQ - will add more in the future. llvm-svn: 357244	2019-03-29 11:25:58 +00:00
Sanjay Patel	5bbf6f0bd8	[x86] avoid cmov in movmsk reduction This is probably the least important of our movmsk problems, but I'm starting at the bottom to reduce distractions. We were creating a select_cc which bypasses the select and bitmask codegen optimizations that we have now. If we produce a compare+negate instead, we allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov. There's no partial register update danger in these sequences because we always produce the zero-register xor ahead of the 'set' if needed. There seems to be a missing fold for sext of a bool bit here: negl %ecx movslq %ecx, %rax ...but that's an independent transform. Differential Revision: https://reviews.llvm.org/D59818 llvm-svn: 357172	2019-03-28 14:16:13 +00:00
Sanjay Patel	1df0bb6264	[x86] improve AVX lowering of vector zext If we know the 2 halves of an oversized zext-in-reg are the same, don't create those halves independently. I tried several different approaches to fold this, but it's difficult to get right during legalization. In the default path, we are creating a generic shuffle that looks like an unpack high, but it can get transformed into a different mask (a blend), so it's not straightforward to match that. If we try to fold after it actually becomes an X86ISD::UNPCKH node, we can't be sure what the operand node is - it might be a generic shuffle, or it could be some x86-specific op. From the test output, we should be doing something like this for SSE4.1 as well, but I'd rather leave that as a follow-up since it involves changing lowering actions. Differential Revision: https://reviews.llvm.org/D59777 llvm-svn: 357129	2019-03-27 22:42:11 +00:00
Sanjay Patel	704817912a	[x86] look through bitcast operand of MOVMSK This is not exactly NFC because it should make further combines of MOVMSK easier to match, but there should be no outward differences because we have isel patterns in place specifically to allow this. See: // Also support integer VTs to avoid a int->fp bitcast in the DAG. llvm-svn: 357128	2019-03-27 22:24:03 +00:00
Simon Pilgrim	ccb71b2985	Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure. This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain. ........ Causes PR41249 llvm-svn: 357057	2019-03-27 10:25:02 +00:00
Simon Pilgrim	87d4ab8b92	[X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure. This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain. llvm-svn: 356864	2019-03-24 19:06:35 +00:00
Simon Pilgrim	a71c0ed471	[X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets). Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers. llvm-svn: 356858	2019-03-24 16:30:35 +00:00
Sanjay Patel	7d676dfd86	[x86] improve the default expansion of uaddsat/usubsat This is yet another step towards solving PR14613: https://bugs.llvm.org/show_bug.cgi?id=14613 uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y usubsat X, Y --> (X >u Y) ? X - Y : 0 We can't count on a sane vector ISA, so override the default (umin/umax) expansion of unsigned add/sub saturate in cases where we do not have umin/umax. Differential Revision: https://reviews.llvm.org/D59006 llvm-svn: 356855	2019-03-24 13:55:54 +00:00
Sanjay Patel	2e92846d36	[x86] reduce code duplication; NFC llvm-svn: 356836	2019-03-23 15:00:52 +00:00
Craig Topper	ce1ed55a4a	[X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences. This mitigates some of PR41151 for the i64 on i686 case when we have SSE2. Differential Revision: https://reviews.llvm.org/D59662 llvm-svn: 356808	2019-03-22 20:47:02 +00:00
Craig Topper	1ffd8e8114	[X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead. I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer. So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag? I've left some FIXMEs for improvements we can do without SSE2. Differential Revision: https://reviews.llvm.org/D59679 llvm-svn: 356807	2019-03-22 20:46:56 +00:00
Simon Pilgrim	564392d752	[X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203) llvm-svn: 356784	2019-03-22 17:23:55 +00:00
Craig Topper	b3bad3dce3	[X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead. llvm-svn: 356721	2019-03-21 21:37:18 +00:00
Simon Pilgrim	c2e4405475	[X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers We don't use X86ISD::ANDNP for mask registers. Test case from @craig.topper (Craig Topper) llvm-svn: 356696	2019-03-21 18:32:38 +00:00
Craig Topper	8d46403b8e	[X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set. CMPXCHG8B was introduced on i586/pentium generation. If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG. Differential Revision: https://reviews.llvm.org/D59576 llvm-svn: 356631	2019-03-20 23:35:49 +00:00
Craig Topper	0367553304	[X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend. This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before falling back to move immedate, GPR to k-register, and masked op. I had to make some changes to support v8i64 when i64 is not a legal type. And to support floating point types. This trades a load for the move immediate and GPR move which is higher latency. But its probably better for register pressure not having to hop through other register classes. The load+and should play better with LICM and rematerialization I think. Differential Revision: https://reviews.llvm.org/D59479 llvm-svn: 356618	2019-03-20 21:30:20 +00:00
Simon Pilgrim	2acca37a2d	[X86] Use getConstantOperandAPInt to detect out-of-range shifts. llvm-svn: 356549	2019-03-20 11:41:52 +00:00
Andrea Di Biagio	624f5deff4	[X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI This patch removes the following dag node opcodes from namespace X86ISD: RDTSC_DAG, RDTSCP_DAG, RDPMC_DAG The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The only differences are: RDTSC/RDTSCP don't implicitly read ECX. RDTSCP also implicitly writes ECX. I moved the common expansion logic into a helper function with the goal to get rid of code repetition. That helper is now used for the expansion of RDTSC/RDTSCP/RDPMC/XGETBV intrinsics. No functional change intended. Differential Revision: https://reviews.llvm.org/D59547 llvm-svn: 356546	2019-03-20 11:21:15 +00:00
Simon Pilgrim	e744f513c4	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed. llvm-svn: 356483	2019-03-19 17:23:25 +00:00
Jordan Rupprecht	f74d45a775	[NFC] Fix unused variable in release builds This was introduced in rL356468. llvm-svn: 356477	2019-03-19 16:52:40 +00:00
Simon Pilgrim	a56f2822d0	[SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect These changes are related to PR37743 and include: SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node. Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner. Add promoting the integer ABS node in the LegalizeIntegerType. Expand-based legalization of integer result for the ABS nodes. Expand-based legalization of ABS vector operations. Add some integer abs testcases for different typesizes for Thumb arch Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to: tmp = (SRA, Hi, 31) Lo = (UADDO tmp, Lo) Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1)) Lo = (XOR tmp, Lo) The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern: (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))). Change integer abs testcases for codegen with the ABS node support for AArch64. Indicate that the ABS is legal for the i64 type when the NEON is supported. Change the integer abs testcases to show changing of codegen. Add combine and legalization of ABS nodes for Thumb arch. Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition. For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743 Patch by: @ikulagin (Ivan Kulagin) Differential Revision: https://reviews.llvm.org/D49837 llvm-svn: 356468	2019-03-19 16:24:55 +00:00
Adhemerval Zanella	664c1ef528	[TargetLowering] Add code size information on isFPImmLegal. NFC This allows better code size for aarch64 floating point materialization in a future patch. Reviewers: evandro Differential Revision: https://reviews.llvm.org/D58690 llvm-svn: 356389	2019-03-18 18:40:07 +00:00
Simon Pilgrim	f2c53b5d6c	[X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes. Replaces existing i1-only fold. llvm-svn: 356325	2019-03-16 15:02:00 +00:00
Simon Pilgrim	0f472e1d01	[X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW Improved constant folding for PEXTRB/PEXTRW will be added in a future commit llvm-svn: 356324	2019-03-16 14:29:50 +00:00
Roman Lebedev	9f37790608	[X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's Summary: As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780 If we have `sext (trunc (cmov C0, C1) to i8)`, we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))` Reviewers: craig.topper, andreadb, RKSimon Reviewed By: craig.topper Subscribers: llvm-commits, andreadb Tags: #llvm Differential Revision: https://reviews.llvm.org/D59412 llvm-svn: 356301	2019-03-15 21:18:05 +00:00
Roman Lebedev	b6e376ddfa	[X86] Promote i8 CMOV's (PR40965) Summary: @mclow.lists brought up this issue up in IRC, it came up during implementation of libc++ `std::midpoint()` implementation (D59099) https://godbolt.org/z/oLrHBP Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`. This differential proposes to always promote i8 CMOV. There are several concerns here: * Is this actually more performant, or is it just the ASM that looks cuter? * Does this result in partial register stalls? * What about branch predictor? # Indeed, performance should be the main point here. Let's look at a simple microbenchmark: {F8412076} ``` #include "benchmark/benchmark.h" #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> // Future preliminary libc++ code, from Marshall Clow. namespace std { template <class _Tp> __inline _Tp midpoint(_Tp __a, _Tp __b) noexcept { using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type; int __sign = 1; _Up __m = __a; _Up __M = __b; if (__a > __b) { __sign = -1; __m = __b; __M = __a; } return __a + __sign * _Tp(_Up(__M - __m) >> 1); } } // namespace std template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct RandRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(getVectorOfRandomNumbers<T>(count), getVectorOfRandomNumbers<T>(count)); } }; struct ZeroRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(std::vector<T>(count, T(0)), getVectorOfRandomNumbers<T>(count)); } }; template <class T, class Gen> void BM_StdMidpoint(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { for (size_t i = 0; i < Length; i++) { const auto calculated = std::midpoint(a[i], b[i]); benchmark::DoNotOptimize(calculated); } } state.SetComplexityN(Length); state.counters["midpoints"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["midpoints/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = 2 * 1024 * 1024; // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /total bufs/ 2 / /maximal elt size/ sizeof(T) / /safety margin/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } // Both of the values are random. // The comparison is unpredictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand) ->Apply(CustomArguments<int32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand) ->Apply(CustomArguments<int64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand) ->Apply(CustomArguments<int16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand) ->Apply(CustomArguments<int8_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand) ->Apply(CustomArguments<uint8_t>); // One value is always zero, and another is bigger or equal than zero. // The comparison is predictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand) ->Apply(CustomArguments<uint8_t>); ``` ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm 2019-03-06 21:53:31 Running ./llvm-cmov-bench-OLD Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.78, 1.81, 1.36 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300398 ns 300404 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300433 ns 300433 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 169857 ns 169858 ns 4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 169770 ns 169771 ns 4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 591169 ns 591179 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 591264 ns 591274 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<uint16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 2983669 ns 2983689 ns 235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s BM_StdMidpoint<int8_t, RandRand>_BigO 5.69 N 5.69 N BM_StdMidpoint<int8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint8_t, RandRand>/524288 2668398 ns 2668419 ns 262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 5.09 N 5.09 N BM_StdMidpoint<uint8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300887 ns 300887 ns 2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169634 ns 169634 ns 4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 592252 ns 592255 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 987295 ns 987309 ns 711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 1.88 N 1.88 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 1 % 1 % RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW 2019-03-06 21:56:58 Running ./llvm-cmov-bench-NEW Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.17, 1.46, 1.30 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300878 ns 300880 ns 2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300231 ns 300226 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 170819 ns 170777 ns 4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.60 N 2.60 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 171705 ns 171708 ns 4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.62 N 2.62 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 592510 ns 592516 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 614823 ns 614823 ns 1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.33 N 2.33 N BM_StdMidpoint<uint16_t, RandRand>_RMS 4 % 4 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 1073181 ns 1073201 ns 650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s BM_StdMidpoint<int8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<int8_t, RandRand>_RMS 1 % 1 % BM_StdMidpoint<uint8_t, RandRand>/524288 1071010 ns 1071020 ns 653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<uint8_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300413 ns 300416 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169667 ns 169669 ns 4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 591396 ns 591404 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 1069421 ns 1069413 ns 655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 2.04 N 2.04 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 0 % 0 % Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 +0.0016 +0.0016 300398 300878 300404 300880 <...> BM_StdMidpoint<uint32_t, RandRand>/131072 -0.0007 -0.0007 300433 300231 300433 300226 <...> BM_StdMidpoint<int64_t, RandRand>/65536 +0.0057 +0.0054 169857 170819 169858 170777 <...> BM_StdMidpoint<uint64_t, RandRand>/65536 +0.0114 +0.0114 169770 171705 169771 171708 <...> BM_StdMidpoint<int16_t, RandRand>/262144 +0.0023 +0.0023 591169 592510 591179 592516 <...> BM_StdMidpoint<uint16_t, RandRand>/262144 +0.0398 +0.0398 591264 614823 591274 614823 <...> BM_StdMidpoint<int8_t, RandRand>/524288 -0.6403 -0.6403 2983669 1073181 2983689 1073201 <...> BM_StdMidpoint<uint8_t, RandRand>/524288 -0.5986 -0.5986 2668398 1071010 2668419 1071020 <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 -0.0016 -0.0016 300887 300413 300887 300416 <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 +0.0002 +0.0002 169634 169667 169634 169669 <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 -0.0014 -0.0014 592252 591396 592255 591404 <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 +0.0832 +0.0832 987295 1069421 987309 1069413 ``` What can we tell from the benchmark? * `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance. * All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case. That is because there we are computing mid point between zero and some random number, thus if the branch predictor is in use, it is in optimal situation. * Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%. # What about branch predictor? * `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`, which may mean that well-predicted branch is better than `cmov`. * Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`, `cmov` is up to +10% worse than well-predicted branch. * However, i do not believe this is a concern. If the branch is well predicted, then the PGO will also say that it is well predicted, and LLVM will happily expand cmov back into branch: https://godbolt.org/z/P5ufig # What about partial register stalls? I'm not really able to answer that. What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch) in ~50% of cases you will have to pay branch misprediction penalty. ``` $ grep -i MispredictPenalty X86Sched*.td X86SchedBroadwell.td: let MispredictPenalty = 16; X86SchedHaswell.td: let MispredictPenalty = 16; X86SchedSandyBridge.td: let MispredictPenalty = 16; X86SchedSkylakeClient.td: let MispredictPenalty = 14; X86SchedSkylakeServer.td: let MispredictPenalty = 14; X86ScheduleBdVer2.td: let MispredictPenalty = 20; // Minimum branch misdirection penalty. X86ScheduleBtVer2.td: let MispredictPenalty = 14; // Minimum branch misdirection penalty X86ScheduleSLM.td: let MispredictPenalty = 10; X86ScheduleZnver1.td: let MispredictPenalty = 17; ``` .. which it can be as small as 10 cycles and as large as 20 cycles. Partial register stalls do not seem to be an issue for AMD CPU's. For intel CPU's, they should be around ~5 cycles? Is that actually an issue here? I'm not sure. In short, i'd say this is an improvement, at least on this microbenchmark. Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 \| PR40965 ]]. Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic Reviewed By: craig.topper, andreadb Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists Tags: #llvm, #libc Differential Revision: https://reviews.llvm.org/D59035 llvm-svn: 356300	2019-03-15 21:17:53 +00:00
Craig Topper	af856db961	[X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table. The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen. Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable. Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table. llvm-svn: 356294	2019-03-15 19:59:35 +00:00
Simon Pilgrim	d33e62c826	[X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x))) Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits. llvm-svn: 356292	2019-03-15 19:14:28 +00:00
Simon Pilgrim	65165d54bb	[X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW llvm-svn: 356270	2019-03-15 16:16:49 +00:00
Simon Pilgrim	0ad17402a9	[X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm. Prep work for PR40203 llvm-svn: 356249	2019-03-15 11:05:42 +00:00
Sanjay Patel	5d1df114e8	[x86] prevent infinite looping from vselect commutation (PR41066) This is an immediate fix for: https://bugs.llvm.org/show_bug.cgi?id=41066 ...but as noted there and the code comments, we should do better by stubbing this out sooner. llvm-svn: 356158	2019-03-14 15:32:34 +00:00
Craig Topper	84abec2855	[X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b() The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well. We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass. I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue. Differential Revision: https://reviews.llvm.org/D59308 llvm-svn: 356078	2019-03-13 18:48:50 +00:00
Simon Pilgrim	bef4fe056d	[X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode llvm-svn: 356067	2019-03-13 17:00:18 +00:00
Simon Pilgrim	d9aa879b67	[X86][AVX] Add combineConcatVectors support to improve subvector handling Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization. This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time. The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit. llvm-svn: 356064	2019-03-13 16:37:30 +00:00
Sanjay Patel	0a251e4076	[x86] limit extractelement of setcc to pre-legalization A fuzzer found the crasher: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700 The bug was introduced recently here: rL355741 This is the quick fix. If we need to do this transform later, then we'd have to extend/truncate the vector setcc element type to the scalar setcc type (i8). llvm-svn: 356053	2019-03-13 14:49:52 +00:00
Simon Pilgrim	0c1e5aacd3	Fix signed/unsigned mismatch warning. NFCI. llvm-svn: 356046	2019-03-13 13:14:14 +00:00
Simon Pilgrim	7abbd70300	[X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false. This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on. Differential Revision: https://reviews.llvm.org/D58888 llvm-svn: 356043	2019-03-13 12:20:39 +00:00
Sanjay Patel	737c27a9cd	[x86] scalarize extractelement 0 of FP vselect llvm-svn: 355955	2019-03-12 19:20:45 +00:00
Craig Topper	f19d6a4073	[X86] Add SCALAR_SINT_TO_FP/SCALAR_UINT_TO_FP ISD opcodes without rounding mode. After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those. llvm-svn: 355807	2019-03-11 04:37:01 +00:00
Craig Topper	ecbc141dbf	[X86] Split SCALEF(S) ISD opcodes into a version without rounding mode. llvm-svn: 355806	2019-03-11 04:36:59 +00:00
Craig Topper	a0b5338834	[X86] Split RCP28/RSQRT/GETEXP/EXP2 ISD opcodes into SAE and current direction nodes. Remove rounding mode operand. llvm-svn: 355805	2019-03-11 04:36:57 +00:00
Craig Topper	ba7d654526	[X86] Rename _RND versions of RANGE/REDUCE/GETMANT/RDNSCALE ISD opcodes to _SAE. Remove SAE operand. No need to explicitly store it and match it during isel. llvm-svn: 355804	2019-03-11 04:36:55 +00:00

1 2 3 4 5 ...

6113 Commits