llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	fb2539b9d8	[X86][SSE] Add X86ISD::AVG to isCommutativeBinOp to support folding shuffles through the binop	2021-10-13 10:54:37 +01:00
Simon Pilgrim	cceceb7242	[X86][SSE] Add tests showing missing shuffle(avg(shuffle(),shuffle())) -> avg(shuffle(),shuffle()) fold X86ISD::AVG needs to be added to isCommutativeBinOp to use these folds.	2021-10-13 10:35:24 +01:00
Roman Lebedev	958da6598f	[X86] `detectAVGPattern()`: don't require zext in the with-constant case	2021-10-12 20:24:17 +03:00
Roman Lebedev	0902451abe	[NFC][X86] Add another test case for PR52131	2021-10-12 20:19:35 +03:00
Roman Lebedev	5f4f5da634	[X86] `detectAVGPattern()`: support basic case of PAVG chaining (PR52131) As noted in https://github.com/halide/Halide/pull/6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571	2021-10-12 19:51:43 +03:00
Guozhi Wei	6599961c17	[TwoAddressInstructionPass] Improve the SrcRegMap and DstRegMap computation This patch contains following enhancements to SrcRegMap and DstRegMap: 1 In findOnlyInterestingUse not only check if the Reg is two address usage, but also check after commutation can it be two address usage. 2 If a physical register is clobbered, remove SrcRegMap entries that are mapped to it. 3 In processTiedPairs, when create a new COPY instruction, add a SrcRegMap entry only when the COPY instruction is coalescable. (The COPY src is killed) With these enhancements isProfitableToCommute can do better commute decision, and finally more register copies are removed. Differential Revision: https://reviews.llvm.org/D108731	2021-10-11 15:28:31 -07:00
Roman Lebedev	7af6a44077	[NFC][X86][Codegen] Add semi-negative PAVG chain test (PR52131)	2021-10-11 22:13:38 +03:00
Roman Lebedev	76495ea317	[NFC][X86][Codegen] Add basic PAVG chain test (PR52131)	2021-10-11 21:42:32 +03:00
Roman Lebedev	f5753125f0	[Codegen][TLI][X86] SimplifyMultipleUseDemandedBits(): 0'th vec subreg widening is free, try to perform it earlier I believe, the profitability reasoning here is correct "sub"reg is already located within the 0'th subreg of wider reg, so if we have suvector insertion at index 0 into undef, then it's always free do to. After this, D109065 finally avoids the regression in D108382. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109074	2021-09-02 00:54:05 +03:00
Roman Lebedev	0aef747b84	[NFC][X86][Codegen] Megacommit: mass-regenerate all check lines that were already autogenerated The motivation is that the update script has at least two deviations (`<...>@GOT`/`<...>@PLT`/ and not hiding pointer arithmetics) from what pretty much all the checklines were generated with, and most of the tests are still not updated, so each time one of the non-up-to-date tests is updated to see the effect of the code change, there is a lot of noise. Instead of having to deal with that each time, let's just deal with everything at once. This has been done via: ``` cd llvm-project/llvm/test/CodeGen/X86 grep -rl "; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py" \| xargs -L1 <...>/llvm-project/llvm/utils/update_llc_test_checks.py --llc-binary <...>/llvm-project/build/bin/llc ``` Not all tests were regenerated, however.	2021-06-11 23:57:02 +03:00
Simon Pilgrim	d6b68d1344	[X86][SSE] combineExtractWithShuffle - support zero-extending to allow extracting from narrow shuffle masks If the shuffle mask can't be widened to match the original extracted element width, see if the upper bits are zeroable - which allows us to extract+zero-extend the smaller extraction.	2021-01-29 14:22:10 +00:00
Simon Pilgrim	ce06475da9	[X86][AVX] IsElementEquivalent - add matchShuffleWithUNPCK + VBROADCAST/VBROADCAST_LOAD handling Specify LHS/RHS operands in matchShuffleWithUNPCK's calls to isTargetShuffleEquivalent, and handle VBROADCAST/VBROADCAST_LOAD matching in IsElementEquivalent	2021-01-18 15:55:00 +00:00
Simon Pilgrim	fc446935d7	[X86] detectAVGPattern - accept non-pow2 vectors by padding. Drop the pow2 vector limitation for AVG generation by padding the vector to the next pow2, creating the PAVG nodes and then extracting the final subvector. Fixes some poor codegen that has been annoying me for years.....	2020-09-15 10:07:03 +01:00
Simon Pilgrim	21d02dc595	[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support This patch uses partial DemandedElts masks to further simplify target shuffle chains and finally starts making target shuffle combining part of SimplifyDemandedBits/SimplifyDemandedVectorElts. We already manage this for Depth == 0 cases, where combineX86ShuffleChain would early-out if the shuffle combined to the same op, but the patch generalizes this by manipulating the depth handling of combineX86ShufflesRecursively - calling with a new Depth = 0 and reducing the maximum shuffle combine depth accordingly. Differential Revision: https://reviews.llvm.org/D66004	2020-09-02 09:24:46 +01:00
Simon Pilgrim	0c005be6eb	[X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast. getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements. I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.	2020-07-29 11:32:44 +01:00
Simon Pilgrim	17eafe0841	[X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls. By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.	2020-07-26 16:04:22 +01:00
Simon Pilgrim	77133cc1e2	[X86][AVX] Attempt to fold PACK(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(PACK(X,Y)). Truncations lowered as shuffles of multiple (concatenated) vectors often leave us with lane-crossing shuffles that feed a PACKSS/PACKUS, if both shuffles are fed from the same 2 vector sources, then we can PACK the sources directly and shuffle the result instead. This is currently limited to whole i128 lanes in a 256-bit vector, but we can extend this if the need arises (but I'm not seeing many examples in real world code).	2020-07-10 09:33:27 +01:00
Simon Pilgrim	e3c0be596c	[DAG] SimplifyDemandedVectorElts - add EXTRACT_SUBVECTOR SimplifyMultipleUseDemandedBits handling	2020-05-01 13:48:07 +01:00
Craig Topper	446a3be8f1	[X86] Add PACK instructions to hasUndefRegUpdate so the BreakFalseDeps pass will reassign an undef second source to match the first source We generate PACK instructions with an undef second source when we are truncating from a 128-bit vector to something narrower and we don't care about the upper bits of the vector register. The register allocation process will always assign untied undef uses to xmm0. This creates a false dependency on xmm0. By adding these instructions to hasUndefRegUpdate, we can get the BreakFalseDeps pass to reassign the source to match the other input. Normally this interface is used for instructions that might need an xor inserted to break the dependency. But the pass also has a heuristic that tries to use the same register as other sources. That should always be possible for these instructions so we'll never trigger the xor dependency break. Differential Revision: https://reviews.llvm.org/D79032	2020-04-28 15:11:32 -07:00
Craig Topper	8dfb9627b7	[X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead. This moves v32i16/v64i8 to a model consistent with how we treat integer types with avx1. This does change the ABI for types vXi16/vXi8 vectors larger than 512 bits to pass in multiple zmms instead of multiple ymms. We'd already hacked some code to make v64i8/v32i16 pass in zmm. Cost model is still a bit of a mess. In some place I tried to match existing behavior. But really we need to account for splitting and concating costs. Cost model for shuffles is especially pessimistic. Differential Revision: https://reviews.llvm.org/D76212	2020-04-15 12:17:18 -07:00
Simon Pilgrim	39a52a19ed	[X86] lowerV16I8Shuffle - create v8i16 mask for PACKUS(AND(),AND()) patterns. We can improve computeKnownBits results by avoiding excess bitcasts. For this pattern we were doing: (v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK)))) By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on. This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.	2020-03-26 19:59:57 +00:00
Simon Pilgrim	0105e9cd92	[X86][SSE] Add some additional irregular AVG tests Finally resurrecting D56506 and want to improve test coverage.	2020-03-22 14:28:31 +00:00
Simon Pilgrim	05c0d34918	[X86][SSE] Prefer trunc(movd(x)) to pextrb(x,0) If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB. The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4). This matches what we already do for PEXTRW. Differential Revision: https://reviews.llvm.org/D76138	2020-03-13 18:43:04 +00:00
Simon Pilgrim	d8f9416fdc	[DAG] MatchRotate - Add funnel shift by immediate support This patch reuses the existing MatchRotate ROTL/ROTR rotation pattern code to also recognize the more general FSHL/FSHR funnel shift patterns when we have constant shift amounts. Differential Revision: https://reviews.llvm.org/D75114	2020-03-11 18:55:18 +00:00
Simon Pilgrim	18c19441d1	[X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128 For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern. At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.	2020-03-10 10:44:28 +00:00
Craig Topper	3c4e635593	[X86] Always emit an integer vbroadcast_load from lowerBuildVectorAsBroadcast regardless of AVX vs AVX2 If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1. Differential Revision: https://reviews.llvm.org/D75413	2020-03-03 10:39:11 -08:00
Simon Pilgrim	e2c2eb0a55	[X86] Fix NSW/NUW typo in avg test (PR44973) The not_avg_v16i8_wide_constants test shouldn't assume NSW/NUW for the addition of -1 - copy + paste typo from other avg tests	2020-02-20 19:22:37 +00:00
Sanjay Patel	e48b536be6	[x86] form broadcast of scalar memop even with >1 use The unseen logic diff occurs because MayFoldLoad() is defined like this: static bool MayFoldLoad(SDValue Op) { return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode()); } The test diffs here all seem ok to me on screen/paper, but it's hard to know if that will lead to universally better perf for all targets. For example, if a target implements broadcast from mem as multiple uops, we would have to weigh the potential reduction of instructions and register pressure vs. possible increase in number of uops. I don't know if we can make a truly informed decision on this at compile-time. The motivating case that I'm looking at in PR42024: https://bugs.llvm.org/show_bug.cgi?id=42024 ...resembles the diff in extract-concat.ll, but we're not going to change the larger example there without at least 1 other fix. Differential Revision: https://reviews.llvm.org/D74088	2020-02-16 10:32:56 -05:00
Simon Pilgrim	8fbc7fd567	[DAG] SimplifyMultipleUseDemandedBits - peek through unused ISD::INSERT_SUBVECTOR subvectors If we don't demand any elements of the inserted subvector then just skip it.	2020-01-31 18:57:22 +00:00
Sanjay Patel	cb5612e2df	[DAGCombiner] reduce extract subvector of concat If we are extracting a chunk of a vector that's a fraction of an operand of the concatenated vector operand, we can extract directly from one of those original operands. This is another suggestion from PR42024: https://bugs.llvm.org/show_bug.cgi?id=42024#c2 But I'm not sure yet if it will make any difference on those patterns. It seems to help a few existing AVX512 tests though. Differential Revision: https://reviews.llvm.org/D72361	2020-01-09 09:38:12 -05:00
Craig Topper	18e8d02e8c	[X86] Pass v32i16/v64i8 in zmm registers on KNL target. gcc and icc pass these types in zmm registers in zmm registers. This patch implements a quick hack to override the register type before calling convention handling to one that is legal. Longer term we might want to do something similar to 256-bit integer registers on AVX1 where we just split all the operations. Fixes PR42957 Differential Revision: https://reviews.llvm.org/D66708 llvm-svn: 370495	2019-08-30 17:35:08 +00:00
Craig Topper	a0d92c7262	[X86] Teach lowerV4I32Shuffle to only use broadcasts if the mask has more than one undef element. Prioritize shifts over broadcast in lowerV8I16Shuffle. The motivating case are the changes in vector-reduce-add.ll where we were doing extra work in the scalar domain instead of shuffling. There may be some one use check that needs to be looked into there, but this patch sidesteps the issue by avoiding broadcasts that aren't really broadcasting. Differential Revision: https://reviews.llvm.org/D66071 llvm-svn: 369287	2019-08-19 18:15:50 +00:00
Craig Topper	8b5f2ab2a4	Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default." The assert that caused this to be reverted should be fixed now. Original commit message: This patch changes our defualt legalization behavior for 16, 32, and 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to widening. For example, v8i8 will now be widened to v16i8 instead of promoted to v8i16. This keeps the elements widths the same and pads with undef elements. We believe this is a better legalization strategy. But it carries some issues due to the fragmented vector ISA. For example, i8 shifts and multiplies get widened and then later have to be promoted/split into vXi16 vectors. This has the potential to cause regressions so we wanted to get it in early in the 10.0 cycle so we have plenty of time to address them. Next steps will be to merge tests that explicitly test the command line option. And then we can remove the option and its associated code. llvm-svn: 368183	2019-08-07 16:24:26 +00:00
Mitch Phillips	bd0d97e1c4	Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default." This reverts commit `3de33245d2`. This commit broke the MSan buildbots. See https://reviews.llvm.org/rL367901 for more information. llvm-svn: 368107	2019-08-06 23:00:43 +00:00
Craig Topper	3de33245d2	[X86] Enable -x86-experimental-vector-widening-legalization by default. This patch changes our defualt legalization behavior for 16, 32, and 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to widening. For example, v8i8 will now be widened to v16i8 instead of promoted to v8i16. This keeps the elements widths the same and pads with undef elements. We believe this is a better legalization strategy. But it carries some issues due to the fragmented vector ISA. For example, i8 shifts and multiplies get widened and then later have to be promoted/split into vXi16 vectors. This has the potential to cause regressions so we wanted to get it in early in the 10.0 cycle so we have plenty of time to address them. Next steps will be to merge tests that explicitly test the command line option. And then we can remove the option and its associated code. llvm-svn: 367901	2019-08-05 18:25:36 +00:00
Craig Topper	510e6fadaa	[X86] When using AND+PACKUS in lowerV16I8Shuffle, generate the build vector directly in v16i8 with the correct 0x00 or 0xFF elements rather than using another VT and bitcasting it. The build_vector will become a constant pool load. By using the desired type initially, it ensures we don't generate a bitcast of the constant pool load which will need to be folded with the load. While experimenting with another patch, I noticed that when the load type and the constant pool type don't match, then SimplifyDemandedBits can't handle it. While we should probably fix that, this was a simple way to fix the issue I saw. llvm-svn: 366732	2019-07-22 19:58:49 +00:00
Simon Pilgrim	c0711af7f9	[X86][AVX] combineExtractSubvector - 'little to big' extract_subvector(bitcast()) support Ideally this needs to be a generic combine in DAGCombiner::visitEXTRACT_SUBVECTOR but there's some nasty regressions in aarch64 due to neon shuffles not handling bitcasts at all..... llvm-svn: 364407	2019-06-26 11:21:09 +00:00
Sanjay Patel	606eb2367f	[x86] split 256-bit store of concatenated vectors This shows up as a side issue to the main problem for the AVX target example from PR37428: https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3 But as we can see in the pile of existing test diffs, it's actually a widespread problem that affects any AVX or later target. Apart from a couple of oddballs, I think these are all improvements for the reasons stated in the code comment: we do not want to enable YMM unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit stores anyway. We could say that MergeConsecutiveStores() is going overboard on some of these examples, but that won't solve the problem completely. But that is a reason I'm proposing this as a lowering rather than a combine: we will infinite loop fighting the merge code if we try this earlier. Differential Revision: https://reviews.llvm.org/D62498 llvm-svn: 362524	2019-06-04 16:40:04 +00:00
Sanjay Patel	f7980e727f	Revert "[x86] split 256-bit store of concatenated vectors" This reverts commit `d5a8637072`. Most likely suspect for this bot failure: http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/9684 llvm-svn: 361850	2019-05-28 17:37:58 +00:00
Sanjay Patel	d5a8637072	[x86] split 256-bit store of concatenated vectors This shows up as a side issue to the main problem for the AVX target example from PR37428: https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3 But as we can see in the pile of existing test diffs, it's actually a widespread problem that affects any AVX or later target. Apart from a couple of oddballs, I think these are all improvements for the reasons stated in the code comment: we do not want to enable YMM unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit stores anyway. We could say that MergeConsecutiveStores() is going overboard on some of these examples, but that won't solve the problem completely. But that is the reason I'm proposing this as a lowering rather than a combine: we will infinite loop fighting the merge code if we try this earlier. Differential Revision: https://reviews.llvm.org/D62498 llvm-svn: 361822	2019-05-28 13:54:17 +00:00
Simon Pilgrim	b0f51266b8	[X86][AVX] Fold concat(packus(),packus()) -> packus(concat(),concat()) (PR34773) Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining. llvm-svn: 360134	2019-05-07 11:17:39 +00:00
Craig Topper	3649c20884	[X86] Use INSERT_SUBREG rather than SUBREG_TO_REG when creating LEA64_32 during isel. SUBREG_TO_REG is supposed to be used to assert that we know the upper bits are zero. But that isn't the case here. We've done no analysis of the inputs. llvm-svn: 357673	2019-04-04 05:00:18 +00:00
Simon Pilgrim	10c9032c02	[X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316) Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits. llvm-svn: 357351	2019-03-30 17:12:29 +00:00
Simon Pilgrim	cfdf09ba7d	[X86][SSE] Add PAVG test case from PR41316 llvm-svn: 357346	2019-03-30 13:53:11 +00:00
Simon Pilgrim	a71c0ed471	[X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685) Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets). Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers. llvm-svn: 356858	2019-03-24 16:30:35 +00:00
Simon Pilgrim	65165d54bb	[X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW llvm-svn: 356270	2019-03-15 16:16:49 +00:00
Sanjay Patel	21aa6ddc14	[x86] narrow a shuffle that doesn't use or set any high elements This isn't the final fix for our reduction/horizontal codegen, but it takes care of a lot of the problems. After we narrow the shuffle, existing combines for insert/extract and binops kick in, and we end up with cheaper 128-bit ops. The avg and mul reduction tests show an existing shuffle lowering hole for AVX2/AVX512. I think in its most minimal form this is: https://bugs.llvm.org/show_bug.cgi?id=40434 ...but we might need multiple fixes to get it right. Differential Revision: https://reviews.llvm.org/D57156 llvm-svn: 352209	2019-01-25 15:37:42 +00:00
Craig Topper	0ec17884de	[LegalizeVectorTypes] Don't use SplitVecOp_TruncateHelper if we're heading towards scalarizing the type. This code takes a truncate, fp_to_int, or int_to_fp with a legal result type and an input type that needs to be split and enlarges the elements in the result type before doing the split. Then inserts a follow up truncate or fp_round after concatenating the two halves back together. But if the input type of the original op is being split on its way to ultimately being scalarized we're just going to end up building a vector from scalars and then truncating or rounding it in the vector register. Seems kind of silly to enlarge the result element type of the operation only to end up with scalar code and then building a vector with large elements only to make the elements smaller again in the vector register. Seems better to just try to get away producing smaller result types in the scalarized code. The X86 test case that changes is a pretty contrived test case that exists because of a bug we used to have in our AVG matching code. I think the code is better now, but its not realistic anyway. llvm-svn: 347482	2018-11-23 02:32:13 +00:00
Craig Topper	b239763384	[LegalizeVectorTypes] Have SplitVecOp_TruncateHelper fall back to SplitVecOp_UnaryOp if splitting the output type would be a legal type. SplitVecOp_TruncateHelper tries to introduce a multilevel truncate to avoid scalarization. But if splitting the result type would still be a legal type we don't need to do that. The comment block at the top of the function implied that this was already implemented. I looked back through the history and it doesn't look to have ever been checked. llvm-svn: 347479	2018-11-22 22:56:52 +00:00
Craig Topper	bc8148f7b0	[X86] Lower v16i16->v8i16 truncate using an 'and' with 255, an extract_subvector, and a packuswb instruction. Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D54671 llvm-svn: 347171	2018-11-18 17:59:28 +00:00

1 2 3

115 Commits