llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	8ae7e41bea	Fix spelling in comments. NFCI. llvm-svn: 307288	2017-07-06 18:17:07 +00:00
Simon Pilgrim	713600747e	[X86][SSE4A] Add support for shuffle combining to INSERTQI. llvm-svn: 307268	2017-07-06 15:34:17 +00:00
Simon Pilgrim	7b79fbd4ea	[X86][SSE] combineX86ShuffleChain - merge duplicate creations of integer mask types llvm-svn: 307257	2017-07-06 13:09:19 +00:00
Simon Pilgrim	77ad6d9bb2	[X86][SSE] combineX86ShuffleChain - merge duplicate 'Zeroable' element masks llvm-svn: 307255	2017-07-06 12:40:10 +00:00
Simon Pilgrim	cc0f785dca	[X86][SSE4A] Add support for shuffle combining to EXTRQ. llvm-svn: 307254	2017-07-06 12:22:58 +00:00
Simon Pilgrim	1dd0bd1949	[X86][SSE4A] Split EXTRQ/INSERTQ shuffle matching from lowering. NFCI. First step toward supporting shuffle combining to EXTRQ/INSERTQ. llvm-svn: 307250	2017-07-06 11:06:54 +00:00
Simon Pilgrim	ac3e7f3f57	[X86][SSE4A] Add support for combining from non-v16i8 EXTRQI/INSERTQI shuffles With the improved shuffle decoding we can now combine EXTRQI/INSERTQI shuffles from non-v16i8 vector types llvm-svn: 307099	2017-07-04 18:11:02 +00:00
Simon Pilgrim	9f0a0bd20b	[X86][SSE4A] Generalized EXTRQI/INSERTQI shuffle decodes The existing decodes only worked for v16i8 vectors, this adds support for any 128-bit vector llvm-svn: 307095	2017-07-04 16:53:12 +00:00
Simon Pilgrim	fa6e675267	[X86][SSE4A] Add support for combining from EXTRQI/INSERTQI shuffles llvm-svn: 307048	2017-07-03 20:58:16 +00:00
Simon Pilgrim	a9655ffb42	[X86][AVX512VPOPCNTDQ] Improve support for v16i8/v8i16/v16i16/ CTPOP Zero extend to v16i32/v8i64, use VPOPCNTDQ instructions and truncate back. llvm-svn: 306990	2017-07-02 19:32:37 +00:00
Simon Pilgrim	8971b2904e	[X86][SSE] Attempt to combine 64-bit and 32-bit shuffles to unary shuffles before bit shifts We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive llvm-svn: 306978	2017-07-02 14:16:25 +00:00
Simon Pilgrim	4cb5613c38	[X86][SSE] Attempt to combine 64-bit and 16-bit shuffles to unary shuffles before bit shifts We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive The 32-bit shuffles are a bit tricky and will be dealt with in a later patch llvm-svn: 306977	2017-07-02 13:19:10 +00:00
Simon Pilgrim	724990ab64	[X86][SSE] Pulled common variables to top of matchUnaryPermuteVectorShuffle. NFCI. llvm-svn: 306847	2017-06-30 18:00:14 +00:00
Ayman Musa	721d97f7b8	Recommitting rL305465 after fixing bug in TableGen in rL306251 & rL306371 [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 306402	2017-06-27 12:08:37 +00:00
Galina Kistanova	06a0e0e6a9	Fixed the warning introduced by r306289 to make ubuntu-gcc7.1-werror bot green. llvm-svn: 306369	2017-06-27 06:58:57 +00:00
Sanjay Patel	15748d239e	[x86] transform vector inc/dec to use -1 constant (PR33483) Convert vector increment or decrement to sub/add with an all-ones constant: add X, <1, 1...> --> sub X, <-1, -1...> sub X, <1, 1...> --> add X, <-1, -1...> The all-ones vector constant can be materialized using a pcmpeq instruction that is commonly recognized as an idiom (has no register dependency), so that's better than loading a splat 1 constant. AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better way to produce 512 one-bits. The general advantages of this lowering are: 1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables, so in theory, this could be better for perf, but... 2. That seems unlikely to affect any OOO implementation, and I can't measure any real perf difference from this transform on Haswell or Jaguar, but... 3. It doesn't look like it from the diffs, but this is an overall size win because we eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting a scalar load (which might itself be a bug), then we're replacing a scalar constant load + broadcast with a single cheap op, so that should always be smaller/better too. 4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1 and psub x, -1, so we should use that form for +1 too because we can. If there's some reason to favor a constant load on some CPU, let's make the reverse transform for all of these cases (either here in the DAG or in a later machine pass). This should fix: https://bugs.llvm.org/show_bug.cgi?id=33483 Differential Revision: https://reviews.llvm.org/D34336 llvm-svn: 306289	2017-06-26 14:19:26 +00:00
Sanjay Patel	3de6bad65f	[x86] fix value types for SBB transform (PR33560) I'm not sure yet why this wouldn't fail in the simple case, but clearly I used the wrong value type with: https://reviews.llvm.org/rL306040 ...and the bug manifests with: https://bugs.llvm.org/show_bug.cgi?id=33560 llvm-svn: 306139	2017-06-23 18:42:15 +00:00
Simon Pilgrim	6e85e92b6c	Remove trailing whitespace. NFCI. llvm-svn: 306121	2017-06-23 16:35:32 +00:00
Sanjay Patel	359ae44fb4	[x86] add/sub (X==0) --> sbb(cmp X, 1) This is very similar to the transform in: https://reviews.llvm.org/rL306040 ...but in this case, we use cmp X, 1 to set the carry bit as needed. Again, we can show that all of these are logically equivalent (although InstCombine currently canonicalizes to a form not seen here), and if we believe IACA, then this is the smallest/fastest code. Eg, with SNB: \| Num Of \| Ports pressure in cycles \| \| \| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| \| --------------------------------------------------------------------- \| 1 \| 1.0 \| \| \| \| \| \| \| cmp edi, 0x1 \| 2 \| \| 1.0 \| \| \| \| 1.0 \| CP \| sbb eax, eax The larger motivation is to clean up all select-of-constants combining/lowering because we're missing some common cases. llvm-svn: 306072	2017-06-22 23:47:15 +00:00
Sanjay Patel	41a34e4111	[x86] add/sub (X==0) --> sbb(neg X) Our handling of select-of-constants is lumpy in IR (https://reviews.llvm.org/D24480), lumpy in DAGCombiner, and lumpy in X86ISelLowering. That's why we only had the 'sbb' codegen in 1 out of the 4 tests. This is a step towards smoothing that out. First, show that all of these IR forms are equivalent: http://rise4fun.com/Alive/mx Second, show that the 'sbb' version is faster/smaller. IACA output for SandyBridge (later Intel and AMD chips are similar based on Agner's tables): This is the "obvious" x86 codegen (what gcc appears to produce currently): \| Num Of \| Ports pressure in cycles \| \| \| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| \| --------------------------------------------------------------------- \| 1* \| \| \| \| \| \| \| \| xor eax, eax \| 1 \| 1.0 \| \| \| \| \| \| CP \| test edi, edi \| 1 \| \| \| \| \| \| 1.0 \| CP \| setnz al \| 1 \| \| 1.0 \| \| \| \| \| CP \| neg eax This is the adc version: \| 1* \| \| \| \| \| \| \| \| xor eax, eax \| 1 \| 1.0 \| \| \| \| \| \| CP \| cmp edi, 0x1 \| 2 \| \| 1.0 \| \| \| \| 1.0 \| CP \| adc eax, 0xffffffff And this is sbb: \| 1 \| 1.0 \| \| \| \| \| \| \| neg edi \| 2 \| \| 1.0 \| \| \| \| 1.0 \| CP \| sbb eax, eax If IACA is trustworthy, then sbb became a single uop in Broadwell, so this will be clearly better than the alternatives going forward. llvm-svn: 306040	2017-06-22 18:11:19 +00:00
whitequark	cebe8241ca	[X86] Add support for "probe-stack" attribute This commit adds prologue code emission for stack probe function calls. Reviewed By: majnemer Differential Revision: https://reviews.llvm.org/D34387 llvm-svn: 306010	2017-06-22 15:42:53 +00:00
Elena Demikhovsky	2dac0b4d58	AVX-512: Lowering Masked Gather intrinsic - fixed a bug Masked gather for vector length 2 is lowered incorrectly for element type i32. The type <2 x i32> was automatically extended to <2 x i64> and we generated VPGATHERQQ instead of VPGATHERQD. The type <2 x float> is extended to <4 x float>, so there is no bug for this type, but the sequence may be more optimal. In this patch I'm fixing <2 x i32>bug and optimizing <2 x float> sequence for GATHERs only. The same fix should be done for Scatters as well. Differential revision: https://reviews.llvm.org/D34343 llvm-svn: 305987	2017-06-22 06:47:41 +00:00
Sanjay Patel	cec6a500a8	[x86] fix formatting; NFC llvm-svn: 305914	2017-06-21 14:27:11 +00:00
Sanjay Patel	0656629b87	[x86] enable CGP memcmp() expansion for 2/4/8 byte sizes There are a couple of potential improvements as seen in the IR and asm: 1. We're unnecessarily extending to a larger type to compare values. 2. The codegen for (select cond, 1, -1) could avoid a cmov. (or we could change the order of the compares, so we have a select with 0 operand) llvm-svn: 305802	2017-06-20 15:58:30 +00:00
Simon Pilgrim	4822b5b649	[X86][SSE] Relax 0/-1 vector element insertion to work for any vector with >=16bit elements Shuffle lowering/combining now does a good job for 256/512-bit vectors - we don't need to prevent this llvm-svn: 305801	2017-06-20 15:19:02 +00:00
Simon Pilgrim	b233c0a5d2	[X86][SSE] Dropped old INSERT_VECTOR_ELT lowering TODO Target shuffle combining now supports the matching of INSERT_VECTOR_ELT/PINSRW/PINSRB for merging multiple insertions into shuffles/bitmasks. llvm-svn: 305788	2017-06-20 10:33:34 +00:00
Simon Pilgrim	07cfc80186	Remove trailing whitespace. NFCI. llvm-svn: 305476	2017-06-15 16:20:27 +00:00
Simon Pilgrim	4d432b2c6b	[X86][AVX2] Fix issue in lowerV8I16GeneralSingleInputVectorShuffle that was assuming v8i16 vectors We can use this with v16i16/v32i16 as well. Found during fuzz testing. llvm-svn: 305472	2017-06-15 14:52:30 +00:00
Simon Pilgrim	b98cb3808c	Revert r305465: [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). This is causing windows buildbot failures llvm-svn: 305470	2017-06-15 14:39:34 +00:00
Ayman Musa	56912cda71	[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 305465	2017-06-15 13:02:37 +00:00
Sanjay Patel	83cb007940	[x86] avoid unnecessary shuffle mask math in combineX86ShufflesRecursively() This is a follow-up to https://reviews.llvm.org/D34174 / https://reviews.llvm.org/rL305398. We mentioned replacing the multiplies with shifts, but the real win seems to be in bypassing the extra ops in the common case when the RootRatio and OpRatio are one. This gives us another 1-2% overall win for the test in PR32037: https://bugs.llvm.org/show_bug.cgi?id=32037 llvm-svn: 305414	2017-06-14 20:37:11 +00:00
Sanjay Patel	ce0b99563a	[x86] replace div/rem with shift/mask for better shuffle combining perf We know that shuffle masks are power-of-2 sizes, but there's no way (?) for LLVM to know that, so hack combineX86ShufflesRecursively() to be much faster by replacing div/rem with shift/mask. This makes the motivating compile-time bug in PR32037 ( https://bugs.llvm.org/show_bug.cgi?id=32037 ) about 9% faster overall. Differential Revision: https://reviews.llvm.org/D34174 llvm-svn: 305398	2017-06-14 17:00:57 +00:00
Simon Pilgrim	9ff06a0c7e	Strip UTF8 BOM that got added in rL305091 Seems my recent move to VS2017 has resulted in a few text editor issues..... llvm-svn: 305285	2017-06-13 10:17:57 +00:00
Simon Pilgrim	2b3b717768	[X86][SSE] Refactor getTargetConstantBitsFromNode to avoid large APInts (PR32037) Much of PR32037's compile time regression is due to getTargetConstantBitsFromNode always creating large (>64bit) APInts during the bitcasting from the source data to the destination bitwidth. This commit avoids this bitcast stage if the data is already the correct bitwidth. llvm-svn: 305284	2017-06-13 10:13:48 +00:00
Sanjay Patel	d4765a38b4	[DAG] add helper to bind memop chains; NFCI This step is just intended to reduce code duplication rather than change any functionality. A follow-up would be to replace PPCTargetLowering::spliceIntoChain() usage with this new helper. Differential Revision: https://reviews.llvm.org/D33649 llvm-svn: 305192	2017-06-12 14:41:48 +00:00
Sanjay Patel	dcbfbb11d9	[x86] use vperm2f128 rather than vinsertf128 when there's a chance to fold a 32-byte load I was looking closer at the x86 test diffs in D33866, and the first change seems like it shouldn't happen in the first place. So this patch will resolve that. Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for any given CPU model, so we should be able to interchange those without affecting perf. But as we can see in some of the diffs here, using vperm2f128 allows load folding, so we should take that opportunity to reduce code size and register pressure. A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128 was introduced with AVX1, we should be selecting it in all of the same situations that we would with AVX2. If there's some reason that an AVX1 CPU would not want to use this instruction, that should be fixed up in a later pass. Differential Revision: https://reviews.llvm.org/D33938 llvm-svn: 305171	2017-06-11 21:18:58 +00:00
Simon Pilgrim	3d37b1a277	[X86][SSE] Add support for PACKSS nodes to faux shuffle extraction If the inputs won't saturate during packing then we can treat the PACKSS as a truncation shuffle llvm-svn: 305091	2017-06-09 17:29:52 +00:00
Andrew V. Tischenko	e0531025f8	This patch closes PR28513: an optimization of multiplication by different constants. The initial patch was rejected: I fixed the issue and re-apply it. llvm-svn: 304972	2017-06-08 10:20:13 +00:00
Sanjay Patel	6e8e7cc70e	[x86] avoid flipping sign bits for vector icmp by using known bits If we know that both operands of an unsigned integer vector comparison are non-negative, then it's safe to directly use a signed-compare-greater-than instruction (the only non-equality integer vector compare predicate provided by SSE/AVX). We're intentionally not changing the condition code to signed in order to preserve the existing transforms that use min/max/psubus below here. This should solve PR33276: https://bugs.llvm.org/show_bug.cgi?id=33276 Differential Revision: https://reviews.llvm.org/D33862 llvm-svn: 304909	2017-06-07 13:46:34 +00:00
Simon Pilgrim	58f5be2771	[X86][SSE] Fix an issue with PEXTRW/PEXTRB indices during shuffle combining We were checking that the index was in range of the destination vector type, not the (larger) source vector type llvm-svn: 304894	2017-06-07 10:30:35 +00:00
Simon Pilgrim	b2ef948628	[X86][AVX1] Split 256-bit vector non-temporal loads to keep it non-temporal (PR32744) Differential Revision: https://reviews.llvm.org/D33728 llvm-svn: 304718	2017-06-05 16:02:01 +00:00
Simon Pilgrim	46dd55f1e1	[X86][SSE] Change BUILD_VECTOR interleaving ordering to improve coalescing/combine opportunities We currently generate BUILD_VECTOR as a tree of UNPCKL shuffles of the same type: e.g. for v4f32: Step 1: unpcklps 0, 2 ==> X: <?, ?, 2, 0> : unpcklps 1, 3 ==> Y: <?, ?, 3, 1> Step 2: unpcklps X, Y ==> <3, 2, 1, 0> The issue is because we are not placing sequential vector elements together early enough, we fail to recognise many combinable patterns - consecutive scalar loads, extractions etc. Instead, this patch unpacks progressively larger sequential vector elements together: e.g. for v4f32: Step 1: unpcklps 0, 2 ==> X: <?, ?, 1, 0> : unpcklps 1, 3 ==> Y: <?, ?, 3, 2> Step 2: unpcklpd X, Y ==> <3, 2, 1, 0> This does mean that we are creating UNPCKL shuffle of different value types, but the relevant combines that benefit from this are quite capable of handling the additional BITCASTs that are now included in the shuffle tree. Differential Revision: https://reviews.llvm.org/D33864 llvm-svn: 304688	2017-06-04 20:12:04 +00:00
Simon Pilgrim	f93debb40c	[X86][SSE] Add SCALAR_TO_VECTOR(PEXTRW/PEXTRB) support to faux shuffle combining Generalized existing SCALAR_TO_VECTOR(EXTRACT_VECTOR_ELT) code to support AssertZext + PEXTRW/PEXTRB cases as well. llvm-svn: 304659	2017-06-03 11:12:57 +00:00
Sanjay Patel	e737cf8500	[x86] simplify code for vector icmp pred transforms; NFCI Organizing by transform is smaller and easier to read than a squashed switch with fall-throughs. llvm-svn: 304611	2017-06-02 23:21:53 +00:00
Ahmed Bougacha	018a68f9e4	[X86] Correctly broadcast NaN-like integers as float on AVX. Since r288804, we try to lower build_vectors on AVX using broadcasts of float/double. However, when we broadcast integer values that happen to have a NaN float bitpattern, we lose the NaN payload, thereby changing the integer value being broadcast. This is caused by ConstantFP::get, to which we pass the splat i32 as a float (by bitcasting it using bitsToFloat). ConstantFP::get takes a double parameter, so we end up lossily converting a single-precision NaN to double-precision. Instead, avoid any kinds of conversions by directly building an APFloat from the splatted APInt. Note that this also fixes another piece of code (broadcast of subvectors), that currently isn't susceptible to the same problem. Also note that we could really just use APInt and ConstantInt throughout: the constant pool type doesn't matter much. Still, for consistency, use the appropriate type. llvm-svn: 304590	2017-06-02 20:02:59 +00:00
Sanjay Patel	469014ada4	[x86] fix formatting; NFCI llvm-svn: 304576	2017-06-02 18:14:31 +00:00
David Blaikie	b6b42e018a	Tidy up a bit of r304516, use SmallVector::assign rather than for loop This might give a few better opportunities to optimize these to memcpy rather than loops - also a few minor cleanups (StringRef-izing, templating (to avoid std::function indirection), etc). The SmallVector::assign(iter, iter) could be improved with the use of SFINAE, but the (iter, iter) ctor and append(iter, iter) need it to and don't have it - so, workaround it for now rather than bothering with the added complexity. (also, as noted in the added FIXME, these assign ops could potentially be optimized better at least for non-trivially-copyable types) llvm-svn: 304566	2017-06-02 17:24:26 +00:00
Amaury Sechet	2adb7bdbca	Remove ADDC, ADDE, SUBC, SUBE and SETCCE support from the X86 backend, use the CARRY ops instead. Summary: As per title. This cleanup some technical debt. Depends on D33374 Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D33390 llvm-svn: 304435	2017-06-01 16:33:08 +00:00
Zvi Rackover	7693733e80	[X86] Match bitcast of vxi1 to pmovmsk Summary: Add an early combine to match patterns such as: (i16 bitcast (v16i1 x)) -> (i16 movmsk (v16i8 sext (v16i1 x))) This combine needs to happen early enough before type-legalization scalarizes the result of the setcc. Reviewers: igorb, craig.topper, RKSimon Subscribers: delena, llvm-commits Differential Revision: https://reviews.llvm.org/D33311 llvm-svn: 304406	2017-06-01 11:27:57 +00:00
Amaury Sechet	251ea8a4f8	Do not legalize large setcc with setcce, introduce setcccarry and do it with usubo/setcccarry. Summary: This is a continuation of the work started in D29872 . Passing the carry down as a value rather than as a glue allows for further optimizations. Introducing setcccarry makes the use of addc/subc unecessary and we can start the removal process. This patch only introduce the optimization strictly required to get the same level of optimization as was available before nothing more. Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D33374 llvm-svn: 304404	2017-06-01 11:14:17 +00:00

1 2 3 4 5 ...

4691 Commits