llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	a71c0ed471	[X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets). Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers. llvm-svn: 356858	2019-03-24 16:30:35 +00:00
Sanjay Patel	7d676dfd86	[x86] improve the default expansion of uaddsat/usubsat This is yet another step towards solving PR14613: https://bugs.llvm.org/show_bug.cgi?id=14613 uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y usubsat X, Y --> (X >u Y) ? X - Y : 0 We can't count on a sane vector ISA, so override the default (umin/umax) expansion of unsigned add/sub saturate in cases where we do not have umin/umax. Differential Revision: https://reviews.llvm.org/D59006 llvm-svn: 356855	2019-03-24 13:55:54 +00:00
Sanjay Patel	2e92846d36	[x86] reduce code duplication; NFC llvm-svn: 356836	2019-03-23 15:00:52 +00:00
Craig Topper	ce1ed55a4a	[X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences. This mitigates some of PR41151 for the i64 on i686 case when we have SSE2. Differential Revision: https://reviews.llvm.org/D59662 llvm-svn: 356808	2019-03-22 20:47:02 +00:00
Craig Topper	1ffd8e8114	[X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead. I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer. So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag? I've left some FIXMEs for improvements we can do without SSE2. Differential Revision: https://reviews.llvm.org/D59679 llvm-svn: 356807	2019-03-22 20:46:56 +00:00
Simon Pilgrim	564392d752	[X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203) llvm-svn: 356784	2019-03-22 17:23:55 +00:00
Craig Topper	b3bad3dce3	[X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead. llvm-svn: 356721	2019-03-21 21:37:18 +00:00
Simon Pilgrim	c2e4405475	[X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers We don't use X86ISD::ANDNP for mask registers. Test case from @craig.topper (Craig Topper) llvm-svn: 356696	2019-03-21 18:32:38 +00:00
Craig Topper	8d46403b8e	[X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set. CMPXCHG8B was introduced on i586/pentium generation. If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG. Differential Revision: https://reviews.llvm.org/D59576 llvm-svn: 356631	2019-03-20 23:35:49 +00:00
Craig Topper	0367553304	[X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend. This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before falling back to move immedate, GPR to k-register, and masked op. I had to make some changes to support v8i64 when i64 is not a legal type. And to support floating point types. This trades a load for the move immediate and GPR move which is higher latency. But its probably better for register pressure not having to hop through other register classes. The load+and should play better with LICM and rematerialization I think. Differential Revision: https://reviews.llvm.org/D59479 llvm-svn: 356618	2019-03-20 21:30:20 +00:00
Simon Pilgrim	2acca37a2d	[X86] Use getConstantOperandAPInt to detect out-of-range shifts. llvm-svn: 356549	2019-03-20 11:41:52 +00:00
Andrea Di Biagio	624f5deff4	[X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI This patch removes the following dag node opcodes from namespace X86ISD: RDTSC_DAG, RDTSCP_DAG, RDPMC_DAG The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The only differences are: RDTSC/RDTSCP don't implicitly read ECX. RDTSCP also implicitly writes ECX. I moved the common expansion logic into a helper function with the goal to get rid of code repetition. That helper is now used for the expansion of RDTSC/RDTSCP/RDPMC/XGETBV intrinsics. No functional change intended. Differential Revision: https://reviews.llvm.org/D59547 llvm-svn: 356546	2019-03-20 11:21:15 +00:00
Simon Pilgrim	e744f513c4	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed. llvm-svn: 356483	2019-03-19 17:23:25 +00:00
Jordan Rupprecht	f74d45a775	[NFC] Fix unused variable in release builds This was introduced in rL356468. llvm-svn: 356477	2019-03-19 16:52:40 +00:00
Simon Pilgrim	a56f2822d0	[SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect These changes are related to PR37743 and include: SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node. Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner. Add promoting the integer ABS node in the LegalizeIntegerType. Expand-based legalization of integer result for the ABS nodes. Expand-based legalization of ABS vector operations. Add some integer abs testcases for different typesizes for Thumb arch Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to: tmp = (SRA, Hi, 31) Lo = (UADDO tmp, Lo) Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1)) Lo = (XOR tmp, Lo) The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern: (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))). Change integer abs testcases for codegen with the ABS node support for AArch64. Indicate that the ABS is legal for the i64 type when the NEON is supported. Change the integer abs testcases to show changing of codegen. Add combine and legalization of ABS nodes for Thumb arch. Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition. For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743 Patch by: @ikulagin (Ivan Kulagin) Differential Revision: https://reviews.llvm.org/D49837 llvm-svn: 356468	2019-03-19 16:24:55 +00:00
Adhemerval Zanella	664c1ef528	[TargetLowering] Add code size information on isFPImmLegal. NFC This allows better code size for aarch64 floating point materialization in a future patch. Reviewers: evandro Differential Revision: https://reviews.llvm.org/D58690 llvm-svn: 356389	2019-03-18 18:40:07 +00:00
Simon Pilgrim	f2c53b5d6c	[X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes. Replaces existing i1-only fold. llvm-svn: 356325	2019-03-16 15:02:00 +00:00
Simon Pilgrim	0f472e1d01	[X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW Improved constant folding for PEXTRB/PEXTRW will be added in a future commit llvm-svn: 356324	2019-03-16 14:29:50 +00:00
Roman Lebedev	9f37790608	[X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's Summary: As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780 If we have `sext (trunc (cmov C0, C1) to i8)`, we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))` Reviewers: craig.topper, andreadb, RKSimon Reviewed By: craig.topper Subscribers: llvm-commits, andreadb Tags: #llvm Differential Revision: https://reviews.llvm.org/D59412 llvm-svn: 356301	2019-03-15 21:18:05 +00:00
Roman Lebedev	b6e376ddfa	[X86] Promote i8 CMOV's (PR40965) Summary: @mclow.lists brought up this issue up in IRC, it came up during implementation of libc++ `std::midpoint()` implementation (D59099) https://godbolt.org/z/oLrHBP Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`. This differential proposes to always promote i8 CMOV. There are several concerns here: * Is this actually more performant, or is it just the ASM that looks cuter? * Does this result in partial register stalls? * What about branch predictor? # Indeed, performance should be the main point here. Let's look at a simple microbenchmark: {F8412076} ``` #include "benchmark/benchmark.h" #include <algorithm> #include <cmath> #include <cstdint> #include <iterator> #include <limits> #include <random> #include <type_traits> #include <utility> #include <vector> // Future preliminary libc++ code, from Marshall Clow. namespace std { template <class _Tp> __inline _Tp midpoint(_Tp __a, _Tp __b) noexcept { using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type; int __sign = 1; _Up __m = __a; _Up __M = __b; if (__a > __b) { __sign = -1; __m = __b; __M = __a; } return __a + __sign * _Tp(_Up(__M - __m) >> 1); } } // namespace std template <typename T> std::vector<T> getVectorOfRandomNumbers(size_t count) { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); std::vector<T> v; v.reserve(count); std::generate_n(std::back_inserter(v), count, [&dis, &gen]() { return dis(gen); }); assert(v.size() == count); return v; } struct RandRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(getVectorOfRandomNumbers<T>(count), getVectorOfRandomNumbers<T>(count)); } }; struct ZeroRand { template <typename T> static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) { return std::make_pair(std::vector<T>(count, T(0)), getVectorOfRandomNumbers<T>(count)); } }; template <class T, class Gen> void BM_StdMidpoint(benchmark::State& state) { const size_t Length = state.range(0); const std::pair<std::vector<T>, std::vector<T>> Data = Gen::template Gen<T>(Length); const std::vector<T>& a = Data.first; const std::vector<T>& b = Data.second; assert(a.size() == Length && b.size() == a.size()); benchmark::ClobberMemory(); benchmark::DoNotOptimize(a); benchmark::DoNotOptimize(a.data()); benchmark::DoNotOptimize(b); benchmark::DoNotOptimize(b.data()); for (auto _ : state) { for (size_t i = 0; i < Length; i++) { const auto calculated = std::midpoint(a[i], b[i]); benchmark::DoNotOptimize(calculated); } } state.SetComplexityN(Length); state.counters["midpoints"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant); state.counters["midpoints/sec"] = benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate); const size_t BytesRead = 2 * sizeof(T) * Length; state.counters["bytes_read/iteration"] = benchmark::Counter(BytesRead, benchmark::Counter::kDefaults, benchmark::Counter::OneK::kIs1024); state.counters["bytes_read/sec"] = benchmark::Counter( BytesRead, benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); } template <typename T> static void CustomArguments(benchmark::internal::Benchmark* b) { const size_t L2SizeBytes = 2 * 1024 * 1024; // What is the largest range we can check to always fit within given L2 cache? const size_t MaxLen = L2SizeBytes / /total bufs/ 2 / /maximal elt size/ sizeof(T) / /safety margin/ 2; b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN); } // Both of the values are random. // The comparison is unpredictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand) ->Apply(CustomArguments<int32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand) ->Apply(CustomArguments<int64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand) ->Apply(CustomArguments<int16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand) ->Apply(CustomArguments<int8_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand) ->Apply(CustomArguments<uint8_t>); // One value is always zero, and another is bigger or equal than zero. // The comparison is predictable. BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand) ->Apply(CustomArguments<uint32_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand) ->Apply(CustomArguments<uint64_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand) ->Apply(CustomArguments<uint16_t>); BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand) ->Apply(CustomArguments<uint8_t>); ``` ``` $ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm 2019-03-06 21:53:31 Running ./llvm-cmov-bench-OLD Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.78, 1.81, 1.36 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300398 ns 300404 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300433 ns 300433 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 169857 ns 169858 ns 4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 169770 ns 169771 ns 4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 591169 ns 591179 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 591264 ns 591274 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.25 N 2.25 N BM_StdMidpoint<uint16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 2983669 ns 2983689 ns 235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s BM_StdMidpoint<int8_t, RandRand>_BigO 5.69 N 5.69 N BM_StdMidpoint<int8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint8_t, RandRand>/524288 2668398 ns 2668419 ns 262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 5.09 N 5.09 N BM_StdMidpoint<uint8_t, RandRand>_RMS 0 % 0 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300887 ns 300887 ns 2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169634 ns 169634 ns 4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 592252 ns 592255 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 987295 ns 987309 ns 711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 1.88 N 1.88 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 1 % 1 % RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW 2019-03-06 21:56:58 Running ./llvm-cmov-bench-NEW Run on (8 X 4000 MHz CPU s) CPU Caches: L1 Data 16K (x8) L1 Instruction 64K (x4) L2 Unified 2048K (x4) L3 Unified 8192K (x1) Load Average: 1.17, 1.46, 1.30 ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters<...> ---------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 300878 ns 300880 ns 2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint32_t, RandRand>/131072 300231 ns 300226 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 % <...> BM_StdMidpoint<int64_t, RandRand>/65536 170819 ns 170777 ns 4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s BM_StdMidpoint<int64_t, RandRand>_BigO 2.60 N 2.60 N BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint64_t, RandRand>/65536 171705 ns 171708 ns 4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s BM_StdMidpoint<uint64_t, RandRand>_BigO 2.62 N 2.62 N BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 % <...> BM_StdMidpoint<int16_t, RandRand>/262144 592510 ns 592516 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s BM_StdMidpoint<int16_t, RandRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint16_t, RandRand>/262144 614823 ns 614823 ns 1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s BM_StdMidpoint<uint16_t, RandRand>_BigO 2.33 N 2.33 N BM_StdMidpoint<uint16_t, RandRand>_RMS 4 % 4 % <...> BM_StdMidpoint<int8_t, RandRand>/524288 1073181 ns 1073201 ns 650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s BM_StdMidpoint<int8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<int8_t, RandRand>_RMS 1 % 1 % BM_StdMidpoint<uint8_t, RandRand>/524288 1071010 ns 1071020 ns 653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s BM_StdMidpoint<uint8_t, RandRand>_BigO 2.05 N 2.05 N BM_StdMidpoint<uint8_t, RandRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 300413 ns 300416 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 % <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 169667 ns 169669 ns 4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 % <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 591396 ns 591404 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 % <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 1069421 ns 1069413 ns 655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s BM_StdMidpoint<uint8_t, ZeroRand>_BigO 2.04 N 2.04 N BM_StdMidpoint<uint8_t, ZeroRand>_RMS 0 % 0 % Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------- <...> BM_StdMidpoint<int32_t, RandRand>/131072 +0.0016 +0.0016 300398 300878 300404 300880 <...> BM_StdMidpoint<uint32_t, RandRand>/131072 -0.0007 -0.0007 300433 300231 300433 300226 <...> BM_StdMidpoint<int64_t, RandRand>/65536 +0.0057 +0.0054 169857 170819 169858 170777 <...> BM_StdMidpoint<uint64_t, RandRand>/65536 +0.0114 +0.0114 169770 171705 169771 171708 <...> BM_StdMidpoint<int16_t, RandRand>/262144 +0.0023 +0.0023 591169 592510 591179 592516 <...> BM_StdMidpoint<uint16_t, RandRand>/262144 +0.0398 +0.0398 591264 614823 591274 614823 <...> BM_StdMidpoint<int8_t, RandRand>/524288 -0.6403 -0.6403 2983669 1073181 2983689 1073201 <...> BM_StdMidpoint<uint8_t, RandRand>/524288 -0.5986 -0.5986 2668398 1071010 2668419 1071020 <...> BM_StdMidpoint<uint32_t, ZeroRand>/131072 -0.0016 -0.0016 300887 300413 300887 300416 <...> BM_StdMidpoint<uint64_t, ZeroRand>/65536 +0.0002 +0.0002 169634 169667 169634 169669 <...> BM_StdMidpoint<uint16_t, ZeroRand>/262144 -0.0014 -0.0014 592252 591396 592255 591404 <...> BM_StdMidpoint<uint8_t, ZeroRand>/524288 +0.0832 +0.0832 987295 1069421 987309 1069413 ``` What can we tell from the benchmark? * `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance. * All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case. That is because there we are computing mid point between zero and some random number, thus if the branch predictor is in use, it is in optimal situation. * Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%. # What about branch predictor? * `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`, which may mean that well-predicted branch is better than `cmov`. * Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`, `cmov` is up to +10% worse than well-predicted branch. * However, i do not believe this is a concern. If the branch is well predicted, then the PGO will also say that it is well predicted, and LLVM will happily expand cmov back into branch: https://godbolt.org/z/P5ufig # What about partial register stalls? I'm not really able to answer that. What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch) in ~50% of cases you will have to pay branch misprediction penalty. ``` $ grep -i MispredictPenalty X86Sched*.td X86SchedBroadwell.td: let MispredictPenalty = 16; X86SchedHaswell.td: let MispredictPenalty = 16; X86SchedSandyBridge.td: let MispredictPenalty = 16; X86SchedSkylakeClient.td: let MispredictPenalty = 14; X86SchedSkylakeServer.td: let MispredictPenalty = 14; X86ScheduleBdVer2.td: let MispredictPenalty = 20; // Minimum branch misdirection penalty. X86ScheduleBtVer2.td: let MispredictPenalty = 14; // Minimum branch misdirection penalty X86ScheduleSLM.td: let MispredictPenalty = 10; X86ScheduleZnver1.td: let MispredictPenalty = 17; ``` .. which it can be as small as 10 cycles and as large as 20 cycles. Partial register stalls do not seem to be an issue for AMD CPU's. For intel CPU's, they should be around ~5 cycles? Is that actually an issue here? I'm not sure. In short, i'd say this is an improvement, at least on this microbenchmark. Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 \| PR40965 ]]. Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic Reviewed By: craig.topper, andreadb Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists Tags: #llvm, #libc Differential Revision: https://reviews.llvm.org/D59035 llvm-svn: 356300	2019-03-15 21:17:53 +00:00
Craig Topper	af856db961	[X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table. The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen. Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable. Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table. llvm-svn: 356294	2019-03-15 19:59:35 +00:00
Simon Pilgrim	d33e62c826	[X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x))) Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits. llvm-svn: 356292	2019-03-15 19:14:28 +00:00
Simon Pilgrim	65165d54bb	[X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW llvm-svn: 356270	2019-03-15 16:16:49 +00:00
Simon Pilgrim	0ad17402a9	[X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm. Prep work for PR40203 llvm-svn: 356249	2019-03-15 11:05:42 +00:00
Sanjay Patel	5d1df114e8	[x86] prevent infinite looping from vselect commutation (PR41066) This is an immediate fix for: https://bugs.llvm.org/show_bug.cgi?id=41066 ...but as noted there and the code comments, we should do better by stubbing this out sooner. llvm-svn: 356158	2019-03-14 15:32:34 +00:00
Craig Topper	84abec2855	[X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b() The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well. We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass. I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue. Differential Revision: https://reviews.llvm.org/D59308 llvm-svn: 356078	2019-03-13 18:48:50 +00:00
Simon Pilgrim	bef4fe056d	[X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode llvm-svn: 356067	2019-03-13 17:00:18 +00:00
Simon Pilgrim	d9aa879b67	[X86][AVX] Add combineConcatVectors support to improve subvector handling Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization. This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time. The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit. llvm-svn: 356064	2019-03-13 16:37:30 +00:00
Sanjay Patel	0a251e4076	[x86] limit extractelement of setcc to pre-legalization A fuzzer found the crasher: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700 The bug was introduced recently here: rL355741 This is the quick fix. If we need to do this transform later, then we'd have to extend/truncate the vector setcc element type to the scalar setcc type (i8). llvm-svn: 356053	2019-03-13 14:49:52 +00:00
Simon Pilgrim	0c1e5aacd3	Fix signed/unsigned mismatch warning. NFCI. llvm-svn: 356046	2019-03-13 13:14:14 +00:00
Simon Pilgrim	7abbd70300	[X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false. This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on. Differential Revision: https://reviews.llvm.org/D58888 llvm-svn: 356043	2019-03-13 12:20:39 +00:00
Sanjay Patel	737c27a9cd	[x86] scalarize extractelement 0 of FP vselect llvm-svn: 355955	2019-03-12 19:20:45 +00:00
Craig Topper	f19d6a4073	[X86] Add SCALAR_SINT_TO_FP/SCALAR_UINT_TO_FP ISD opcodes without rounding mode. After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those. llvm-svn: 355807	2019-03-11 04:37:01 +00:00
Craig Topper	ecbc141dbf	[X86] Split SCALEF(S) ISD opcodes into a version without rounding mode. llvm-svn: 355806	2019-03-11 04:36:59 +00:00
Craig Topper	a0b5338834	[X86] Split RCP28/RSQRT/GETEXP/EXP2 ISD opcodes into SAE and current direction nodes. Remove rounding mode operand. llvm-svn: 355805	2019-03-11 04:36:57 +00:00
Craig Topper	ba7d654526	[X86] Rename _RND versions of RANGE/REDUCE/GETMANT/RDNSCALE ISD opcodes to _SAE. Remove SAE operand. No need to explicitly store it and match it during isel. llvm-svn: 355804	2019-03-11 04:36:55 +00:00
Craig Topper	244ffcdf0d	[X86] Rename X86ISD::CVTPH2PS_RND to CVTPH2PS_SAE. Remove SAE operand. llvm-svn: 355803	2019-03-11 04:36:53 +00:00
Craig Topper	6059b1737e	[X86] Rename the CVTT*_RND ISD nodes to _SAE and remove the SAE operand. Split VFPROUNDS_RND/VFPEXT(S)_RND into versions without rounding operand. For VFPEXT(S) we only need current rounding mode and an SAE version. Neither need extra operand. llvm-svn: 355802	2019-03-11 04:36:51 +00:00
Craig Topper	4c544ca993	[X86] Rename X86ISD::CMPM_RND and X86ISD::FSETCCM_RND to _SAE instead of _RND. Remove rounding operand. The operand could only be the SAE encoding so no need to include it. llvm-svn: 355801	2019-03-11 04:36:49 +00:00
Craig Topper	704303a2a1	[X86] Split the VFIXUPIMM/VFIXUPIMMS nodes into a current rounding mode and SAE ISD opcode. Remove matching of FROUND_CURRENT and FROUND_NO_EXC for these nodes from isel table. llvm-svn: 355800	2019-03-11 04:36:47 +00:00
Craig Topper	b7e6bfe579	[X86] Begin removing matching of FROUND_CURRENT and FROUND_NO_EXC from isel tables. Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC. This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS. llvm-svn: 355799	2019-03-11 04:36:44 +00:00
Sanjay Patel	26e06e859e	[x86] add x86-specific opcodes to extractelement scalarization list llvm-svn: 355792	2019-03-10 18:56:21 +00:00
Craig Topper	66c9690ad6	[X86] Remove unused variable. NFC llvm-svn: 355790	2019-03-10 17:36:41 +00:00
Craig Topper	93e15dfacc	[X86] Make lowering of intrinsics with rounding mode stricter so that only valid rounding modes are lowered. Update tests accordingly Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates. With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure. llvm-svn: 355789	2019-03-10 17:20:45 +00:00
Craig Topper	0dc8c52d4e	[X86] Remove dead code from the handler for INTR_TYPE_SCALAR_MASK_RM. The code in here handles nodes with 6 or 7 operands. But only the 6 operand case is ever used these days. llvm-svn: 355788	2019-03-10 17:20:42 +00:00
Sanjay Patel	f84083b4db	[x86] scalarize extract element 0 of FP cmp An extension of D58282 noted in PR39665: https://bugs.llvm.org/show_bug.cgi?id=39665 This doesn't answer the request to use movmsk, but that's an independent problem. We need this and probably still need scalarization of FP selects because we can't do that as a target-independent transform (although it seems likely that targets besides x86 should have this transform). llvm-svn: 355741	2019-03-08 21:54:41 +00:00
Sanjay Patel	b22f438df3	[x86] prevent infinite looping from inverse shuffle transforms llvm-svn: 355713	2019-03-08 19:20:28 +00:00
Craig Topper	3acc4236b8	[X86] Enable combineFMinNumFMaxNum for 512 bit vectors when AVX512 is enabled. Simplified by just checking if the vector type is legal rather than listing all combinations of types and features. Fixes PR40984. llvm-svn: 355582	2019-03-07 06:30:19 +00:00
Simon Pilgrim	9d6347cfc1	[DAGCombine] Improve select (not Cond), N1, N2 -> select Cond, N2, N1 fold Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well. Requested by @spatel on D59006 llvm-svn: 355533	2019-03-06 18:52:52 +00:00
Simon Pilgrim	468bb2e601	[X86][SSE] VSELECT(XOR(Cond,-1), LHS, RHS) --> VSELECT(Cond, RHS, LHS) As noticed on D58965 DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point. Differential Revision: https://reviews.llvm.org/D58974 llvm-svn: 355494	2019-03-06 10:54:43 +00:00
Jeremy Morse	09d8ea5282	[X86] Avoid codegen changes when DBG_VALUE appears between lowered selects X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo instructions without accounting for debug intrinsics. This leads to different codegen with and without option -g, if a DBG_VALUE instruction lands in the middle of several lowered selects. Work around this by skipping over debug instructions when looking for CMOV sequences, and sinking those debug insts into the EmitLoweredSelect sunk block. This might slightly shift where variables appear in the instruction sequence, but won't re-order assignments. Differential Revision: https://reviews.llvm.org/D58672 llvm-svn: 355307	2019-03-04 10:56:02 +00:00
Simon Pilgrim	e48be5d698	Remove unused variable. NFCI. llvm-svn: 355289	2019-03-03 14:23:07 +00:00
Simon Pilgrim	d8e91a54c0	[X86] getShuffleScalarElt - peek through insert/extract subvector nodes. llvm-svn: 355288	2019-03-03 14:11:05 +00:00
Simon Pilgrim	11149ea433	[X86] Pull out combineToConsecutiveLoads helper. NFCI. llvm-svn: 355287	2019-03-03 13:53:27 +00:00
Amaury Sechet	f24abf6511	[X86] Improve use of SHLD/SHRD Summary: This extends the variety of pattern that can generate a SHLD instead of using two shifts. This fixes a regression that would be introduced by D57367 or D33587 Reviewers: RKSimon, craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D57389 llvm-svn: 355260	2019-03-02 02:44:16 +00:00
Sanjay Patel	7fc6ef7dd7	[x86] scalarize extract element 0 of FP math This is another step towards ensuring that we produce the optimal code for reductions, but there are other potential benefits as seen in the tests diffs: 1. Memory loads may get scalarized resulting in more efficient code. 2. Memory stores may get scalarized resulting in more efficient code. 3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch. 4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch. The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions. Differential Revision: https://reviews.llvm.org/D58282 llvm-svn: 355130	2019-02-28 19:47:04 +00:00
Craig Topper	38427c47b9	[X86] Don't peek through bitcasts before checking ISD::isBuildVectorOfConstantSDNodes in combineTruncatedArithmetic We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop. Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding. Differential Revision: https://reviews.llvm.org/D58705 llvm-svn: 355116	2019-02-28 18:49:29 +00:00
Bjorn Pettersson	d30f308a9f	Add support for computing "zext of value" in KnownBits. NFCI Summary: The description of KnownBits::zext() and KnownBits::zextOrTrunc() has confusingly been telling that the operation is equivalent to zero extending the value we're tracking. That has not been true, instead the user has been forced to explicitly set the extended bits as known zero afterwards. This patch adds a second argument to KnownBits::zext() and KnownBits::zextOrTrunc() to control if the extended bits should be considered as known zero or as unknown. Reviewers: craig.topper, RKSimon Reviewed By: RKSimon Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58650 llvm-svn: 355099	2019-02-28 15:45:29 +00:00
Simon Pilgrim	134bc19079	[X86][AVX] Remove superfluous insert_subvector(zero, bitcast(x)) -> bitcast(insert_subvector(zero, x)) fold This is caught by other existing bitcast folds. llvm-svn: 355084	2019-02-28 11:39:52 +00:00
Simon Pilgrim	87aeff8bbb	[X86][AVX] Fold vf64 concat_vectors(movddup(x),movddup(x)) -> broadcast(x) llvm-svn: 355078	2019-02-28 10:53:58 +00:00
Simon Pilgrim	1001a6ab03	[X86][AVX] Pull out some INSERT_SUBVECTOR combines into a combineConcatVectorOps helper. NFCI A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node. I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered. Differential Revision: https://reviews.llvm.org/D58583 llvm-svn: 355015	2019-02-27 18:46:32 +00:00
Simon Pilgrim	71bb6850cf	[X86][AVX] Only combine loads to broadcasts for legal types Thanks to @echristo for spotting this. llvm-svn: 354961	2019-02-27 11:17:25 +00:00
Reid Kleckner	2f055f026a	[X86] Fix bug in x86_intrcc with arg copy elision Summary: Use a custom calling convention handler for interrupts instead of fixing up the locations in LowerMemArgument. This way, the offsets are correct when constructed and we don't need to account for them in as many places. Depends on D56883 Replaces D56275 Reviewers: craig.topper, phil-opp Subscribers: hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D56944 llvm-svn: 354837	2019-02-26 02:11:25 +00:00
Simon Pilgrim	c61f1e8e6c	[X86] Merge ISD::ADD/SUB nodes into X86ISD::ADD/SUB equivalents (PR40483) Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results. Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB. Differential Revision: https://reviews.llvm.org/D58597 llvm-svn: 354771	2019-02-25 11:19:37 +00:00
Simon Pilgrim	cfaf663a35	[X86] Combine zext(packus(x),packus(y)) -> concat(x,y) (PR39637) Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step. llvm-svn: 354757	2019-02-24 19:57:52 +00:00
Craig Topper	be3348573e	[LegalizeTypes][AArch64][X86] Make type legalization of vector (S/U)ADD/SUB/MULO follow getSetCCResultType for the overflow bits. Make UnrollVectorOverflowOp properly convert from scalar boolean contents to vector boolean contents Summary: When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it. We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used. Reviewers: spatel, RKSimon, nikic Reviewed By: nikic Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58567 llvm-svn: 354753	2019-02-24 19:23:36 +00:00
Simon Pilgrim	4f4f9abdfa	[X86][AVX] Rename lowerShuffleByMerging128BitLanes to lowerShuffleAsLanePermuteAndRepeatedMask. NFC. Name better matches the other similar 'lane permute' and 'repeated mask' functions we have. llvm-svn: 354749	2019-02-24 17:30:06 +00:00
Craig Topper	be9eeb5526	Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value types" And its follow ups r354511, r354640. A follow patch will fix the issue that caused it to be reverted. llvm-svn: 354737	2019-02-23 21:41:42 +00:00
Craig Topper	ccc860cb81	Recommit r354647 and r354648 "[LegalizeTypes] When promoting the result of EXTRACT_SUBVECTOR, also check if the input needs to be promoted. Use that to determine the element type to extract" r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit." These were reverted in r354713 as their context depended on other patches that were reverted for a bug. llvm-svn: 354734	2019-02-23 19:51:32 +00:00
Simon Pilgrim	f383a47b7d	[X86][AVX] combineInsertSubvector - remove concat_vectors(load(x),load(x)) --> sub_vbroadcast(x) D58053/rL354340 added this to EltsFromConsecutiveLoads directly llvm-svn: 354732	2019-02-23 18:53:03 +00:00
Simon Pilgrim	e08f177ea2	[X86][AVX] concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) --> broadcast(x) For AVX1, limit this to i32/f32/i64/f64 loading cases only. llvm-svn: 354730	2019-02-23 18:34:05 +00:00
Simon Pilgrim	31793733a0	[X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input in place Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD. llvm-svn: 354729	2019-02-23 17:10:47 +00:00
Reid Kleckner	e3876637cf	Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more value types" r354363 caused https://crbug.com/934963#c1, which has a plain C reduced test case. I also had to revert some dependent changes: - r354648 - r354647 - r354640 - r354511 llvm-svn: 354713	2019-02-23 01:19:42 +00:00
Craig Topper	a9697f24cf	[X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when input type will be promoted by the type legalize to 128-bits. If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register. Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization. llvm-svn: 354709	2019-02-23 00:35:02 +00:00
Sanjay Patel	a9e289174a	[x86] allow narrowing of vector UINT_TO_FP As discussed in: D56864 D58197 Always use the narrow (128-bit) instruction when possible. We already had the signed int version of this transform. llvm-svn: 354675	2019-02-22 15:47:45 +00:00
Sanjay Patel	1baf7896cc	[x86] simplify code in combineExtractSubvector; NFC Only the 1st fold is attempted pre-legalization, but it requires legal (simple) types too, so we don't need an EVT in any of the code. llvm-svn: 354674	2019-02-22 15:28:22 +00:00
Craig Topper	3a391fc0e8	[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit. Type legalization is causing two nodes to be created here, but we can use a single node to extend from v8i16 to v2i64. llvm-svn: 354648	2019-02-22 01:49:53 +00:00
Sanjay Patel	234a5e8ea4	[x86] vectorize more cast ops in lowering to avoid register file transfers This is a follow-up to D56864. If we're extracting from a non-zero index before casting to FP, then shuffle the vector and optionally narrow the vector before doing the cast: cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0 This might be enough to close PR39974: https://bugs.llvm.org/show_bug.cgi?id=39974 Differential Revision: https://reviews.llvm.org/D58197 llvm-svn: 354619	2019-02-21 20:40:39 +00:00
Nirav Dave	dce91c1edb	[X86] Fix copy-paste error in @ccz flag. @ccz operand should be equivalent to @cce. llvm-svn: 354588	2019-02-21 15:28:31 +00:00
Simon Pilgrim	e6b338cbef	[X86][SSE] combineX86ShufflesRecursively - moved to generic op input index lookup. NFCI. We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement. llvm-svn: 354575	2019-02-21 12:24:49 +00:00
Nikita Popov	c3b496de7a	[SDAG] Support vector UMULO/SMULO Second part of https://bugs.llvm.org/show_bug.cgi?id=40442. This adds an extra UnrollVectorOverflowOp() method to SDAG, because the general UnrollOverflowOp() method can't deal with multiple results. Additionally we need to expand UMULO/SMULO during vector op legalization, as it may result in unrolling, which may need additional type legalization. Differential Revision: https://reviews.llvm.org/D57997 llvm-svn: 354513	2019-02-20 20:41:44 +00:00
Simon Pilgrim	dca47c659c	[X86][SSE] combineX86ShufflesRecursively - begin generalizing the number of shuffle inputs. NFCI. We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit. llvm-svn: 354489	2019-02-20 17:58:29 +00:00
Simon Pilgrim	0b3b9424ca	[X86][SSE] Generalize X86ISD::BLENDI support to more value types D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361 Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 354363	2019-02-19 18:05:42 +00:00
Simon Pilgrim	d6add74915	Cast from SDValue directly instead of superfluous getNode(). NFCI. llvm-svn: 354343	2019-02-19 16:20:09 +00:00
Simon Pilgrim	952abcefe4	[X86][AVX] EltsFromConsecutiveLoads - Add BROADCAST lowering support This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads. It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load. Differential Revision: https://reviews.llvm.org/D58053 llvm-svn: 354340	2019-02-19 15:57:09 +00:00
Sanjay Patel	d8b4efcb6b	[CGP] form usub with overflow from sub+icmp The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487: https://bugs.llvm.org/show_bug.cgi?id=31754 https://bugs.llvm.org/show_bug.cgi?id=40487 ..and those are shown in the IR test file and x86 codegen file. Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use. This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo formation by default. Only x86 overrides that setting for now although other targets will likely benefit by forming usbuo too. Differential Revision: https://reviews.llvm.org/D57789 llvm-svn: 354298	2019-02-18 23:33:05 +00:00
Sanjay Patel	fff628274d	[x86] split more v8f32/v8i32 shuffles in lowering Similar to D57867 - this is a small patch with lots of test diffs. With half-vector-width narrowing potential, using an extract + 128-bit vshufps is a win because it replaces a 256-bit shuffle with a 128-bit shufle. This seems like it should be a win even for targets with 'fast-variable-shuffle', but we are intentionally deferring that to an independent change to make sure that is true. Differential Revision: https://reviews.llvm.org/D58181 llvm-svn: 354279	2019-02-18 16:46:12 +00:00
Craig Topper	ce3c5ac6a6	[X86] In FP_TO_INTHelper, when moving data from SSE register to X87 register file via the stack, use the same stack slot we use for the integer conversion. No need for a separate stack slot. The lifetimes don't overlap. Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot. llvm-svn: 354234	2019-02-17 19:23:49 +00:00
Craig Topper	db5aa955cb	[X86] When type legalizing the result of a i64 fp_to_uint on 32-bit targets. Generate all of the ops as i64 and let them be legalized. No need to manually split everything. We can let the type legalizer work for us. The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended. llvm-svn: 354197	2019-02-16 08:25:42 +00:00
Craig Topper	db2f084aa9	[X86] Don't set exception mask bits when modifying FPCW to change rounding mode for fp->int conversion When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption. This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block) Differential Revision: https://reviews.llvm.org/D57788 llvm-svn: 354178	2019-02-15 21:59:33 +00:00
Nirav Dave	7875841121	[X86] Fix LowerAsmOutputForConstraint. Summary: Update Flag when generating cc output. Fixes PR40737. Reviewers: rnk, nickdesaulniers, craig.topper, spatel Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58283 llvm-svn: 354163	2019-02-15 20:01:55 +00:00
Craig Topper	9c6a9276da	[X86] Move all the SSE legality checks out of FP_TO_INTHelper and up to LowerFP_TO_INT. NFCI These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there. Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner. llvm-svn: 354161	2019-02-15 19:21:39 +00:00
Simon Pilgrim	6ce08672fb	[X86][AVX] lowerShuffleAsLanePermuteAndPermute - fully populate the lane shuffle mask (PR40730) As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index. This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128. This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch. Differential Revision: https://reviews.llvm.org/D58237 llvm-svn: 354117	2019-02-15 11:39:21 +00:00
Nirav Dave	5ffdc43dc9	[X86] cleanup inline asm register generation. NFCI. llvm-svn: 354042	2019-02-14 18:06:21 +00:00
Craig Topper	1d158dd930	[X86] Make (f80 (sint_to_fp (i16))) use fistps/fisttps instead of fistpl/fisttpl when SSE is enabled. When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion. Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type. llvm-svn: 354003	2019-02-14 01:41:43 +00:00
Craig Topper	9b61f48e4b	[X86] Use default expansion for (i64 fp_to_uint f80) when avx512 is enabled on 64-bit targets to match what happens without avx512. In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this. llvm-svn: 353921	2019-02-13 07:42:34 +00:00
Craig Topper	3099e442a6	[X86] Refactor the FP_TO_INTHelper interface. NFCI -Pull the final stack load creation from the two callers into the helper. -Return a single SDValue instead of a std::pair. -Remove the Replace flag which isn't really needed. llvm-svn: 353920	2019-02-13 07:42:31 +00:00
Simon Pilgrim	5338f41ced	[X86][AVX] Enable shuffle combining support for zero_extend A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars. llvm-svn: 353860	2019-02-12 17:22:35 +00:00
Craig Topper	7670ede434	[X86] Collapse FP_TO_INT16_IN_MEM/FP_TO_INT32_IN_MEM/FP_TO_INT64_IN_MEM into a single opcode using memory VT to distinquish. NFC llvm-svn: 353798	2019-02-12 06:14:18 +00:00
Craig Topper	d7303ecd0b	[X86] Remove the value type operand from the floating point load/store MemIntrinsicSDNodes. Use the MemoryVT instead. NFCI We already have the memory VT, we can just match from that during isel. llvm-svn: 353797	2019-02-12 06:14:16 +00:00

1 2 3 4 5 ...

6149 Commits