llvm-project

Commit Graph

Author	SHA1	Message	Date
David Green	f9aa8623fe	[ARM] Add more MVE intrinsics to sink splats to This adds a few more unpredicated intrinsics to sink splats to, in order to create more qr instruction variants. Notably this includes saddsat/uaddsat but also some of the unpredicated mve intrinsics. Differential Revision: https://reviews.llvm.org/D110333	2021-09-30 14:41:23 +01:00
Jay Foad	156d7d2df7	[LiveIntervals] Remove unused subreg ranges in repairIntervalsInRange If the old instructions mentioned a subreg that the new instructions do not, remove the subrange for that subreg. For example, in TwoAddressInstructionPass::eliminateRegSequence, if a use operand in the REG_SEQUENCE has the undef flag then we don't generate a copy for it so after the elimination there should be no live interval at all for the corresponding subreg of the def. This is a small step towards switching TwoAddressInstructionPass over from LiveVariables to LiveIntervals. Currently this path is only tested if you explicitly enable -early-live-intervals. Differential Revision: https://reviews.llvm.org/D110542	2021-09-30 09:15:10 +01:00
David Green	fdd8c10959	[ARM] Delay reverting WLS in arm-block-placement As we have to split blocks, we may be left in an invalid loop state after a WLS is reverted to a DLS. Instead remember the WLS that could not be fixed and revert them after finishing processing all other loops. Differential Revision: https://reviews.llvm.org/D110567	2021-09-28 15:38:29 +01:00
David Green	2c53215e99	[ARM] Skip debug info in recomputeVPTBlockMask The ARMLowOverheadLoops pass recalculates VPT block masks when it converts VCMP's inside VPT blocks into VPT's. The function to do so doesn't seem to handle debug info though, leading to invalid block creation or asserts at compile time. Make sure the function skips any debug info between the MVE instructions it inspects. Differential Revision: https://reviews.llvm.org/D110564	2021-09-28 14:58:13 +01:00
Jay Foad	20c0280733	[LiveIntervals] Repair subreg ranges in processTiedPairs In TwoAddressInstructionPass::processTiedPairs, update subranges of the live interval for RegB as well as the main range. This is a small step towards switching TwoAddressInstructionPass over from LiveVariables to LiveIntervals. Currently this path is only tested if you explicitly enable -early-live-intervals. Differential Revision: https://reviews.llvm.org/D110526	2021-09-28 08:10:16 +01:00
David Green	bb2d23dcd4	[ARM] Improve detection of fallthough when aligning blocks We align non-fallthrough branches under Cortex-M at O3 to lead to fewer instruction fetches. This improves that for the block after a LE or LETP. These blocks will still have terminating branches until the LowOverheadLoops pass is run (as they are not handled by analyzeBranch, the branch is not removed until later), so canFallThrough will return false. These extra branches will eventually be removed, leaving a fallthrough, so treat them as such and don't add unnecessary alignments. Differential Revision: https://reviews.llvm.org/D107810	2021-09-27 11:21:21 +01:00
David Green	883758ed48	[ARM] Fix Arm block placement creating branches after jump tables. Given: - A jump table - Which jumps to the next block - The next block ends in a WLS - Where the WLS conditionally jumps to block earlier in the program. The Arm block placement pass would attempt to move the block containing the WLS earlier, as the WLS instruction can only branch forward. In doing so it would add a branch from the jumptable block to the WLS block, thinking it previously fell-through. This in itself would be fine, if a little inefficient, but the constant island pass expects all instructions after a jump-table branch to have been removed by analyzeBranch. So it gets confused and can assign the same labels to multiple jump table blocks. I've changed the condition to the same as used in analyzeBranch.	2021-09-25 11:32:25 +01:00
David Green	a5211bf365	[ARM] Addition jump table plus while loop block placement pass test. Also regenerated the file, whilst here.	2021-09-24 19:30:49 +01:00
Stanislav Mekhanoshin	08d7eec06e	Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit `92c1fd19ab`.	2021-09-24 10:26:11 -07:00
Jay Foad	e4e95f14f1	[LiveIntervals] Repair live intervals that gain subranges In repairIntervalsInRange, if the new instructions refer to subregs but the old instructions did not, make sure any existing live interval for the superreg is updated to have subranges. Also skip repairing any range that we have recalculated from scratch, partly for efficiency but also to avoids some cases that repairOldRegInRange can't handle. The existing test/CodeGen/AMDGPU/twoaddr-regsequence.mir provides some test coverage for this change: when TwoAddressInstructionPass converts REG_SEQUENCE into subreg copies, the live intervals will now get subranges and MachineVerifier will verify that the subranges are correct. Unfortunately MachineVerifier does not complain if the subranges are not present, so the test also passed before this patch. This patch also fixes ~800 of the ~1500 failures in the whole CodeGen lit test suite when -early-live-intervals is forced on. Differential Revision: https://reviews.llvm.org/D110328	2021-09-24 11:58:08 +01:00
David Green	e2050f94b6	[ARM] Extra tests for unpredicated qr MVE intrinsics.	2021-09-23 18:07:08 +01:00
David Green	02cd8a6b91	[ARM] Allow smaller VMOVL in tail predicated loops This allows VMOVL in tail predicated loops so long as the the vector size the VMOVL is extending into is less than or equal to the size of the VCTP in the tail predicated loop. These cases represent a sign-extend-inreg (or zero-extend-inreg), which needn't block tail predication as in https://godbolt.org/z/hdTsEbx8Y. For this a vecsize has been added to the TSFlag bits of MVE instructions, which stores the size of the elements that the MVE instruction operates on. In the case of multiple size (such as a MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 = i64, which often (but not always) comes from the instruction encoding directly. A unit test was added, and although only a subset of the vecsizes are currently used, the rest should be useful for other cases. Differential Revision: https://reviews.llvm.org/D109706	2021-09-22 12:07:52 +01:00
David Green	636fc0ef86	[ARM] Add additional tests for VMOVL in tail predicated loops.	2021-09-22 09:33:36 +01:00
David Green	3f90df22f1	[ARM] MVE reverse shuffles. The vectorizer can sometimes make reverse shuffles from indices that count down. In MVE, we don't have a 128bit rev instruction, but we can select this to a VREV64 with some lane movs to swap the two halfs. Ideally this would use VMOVD's, but only gets as far as VMOVS's at the moment. Differential Revision: https://reviews.llvm.org/D69510	2021-09-20 13:48:01 +01:00
David Green	cb5e3f7959	[ARM] Prevent large integer VQDMULH pattern crashes Put a limit on the size of constant integers we test when looking for VQDMULH, to prevent it from crashing from values more than 64bits.	2021-09-18 18:47:02 +01:00
Matt Arsenault	4a36e96c3f	RegAllocGreedy: Account for reserved registers in num regs heuristic This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heuristic to use. This was taking the raw number of registers in the class, even though not all of them may be available. AMDGPU heavily relies on dynamically reserved numbers of registers based on user attributes to satisfy occupancy constraints, so the raw number is highly misleading. There are still a few problems here. In the original testcase that made me notice this, the live range size is incorrect after the scheduler rearranges instructions, since the instructions don't have the original InstrDist offsets. Additionally, I think it would be more appropriate to use the number of disjointly allocatable registers in the class. For the AMDGPU register tuples, there are a large number of registers in each tuple class, but only a small fraction can actually be allocated at the same time since they all overlap with each other. It seems we do not have a query that corresponds to the number of independently allocatable registers. Relatedly, I'm still debugging some allocation failures where overlapping tuples seem to not be handled correctly. The test changes are mostly noise. There are a handful of x86 tests that look like regressions with an additional spill, and a handful that now avoid a spill. The worst looking regression is likely test/Thumb2/mve-vld4.ll which introduces a few additional spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll shows a massive improvement by completely eliminating a large number of spills inside a loop.	2021-09-14 21:00:29 -04:00
Nikita Popov	90ec6dff86	[OpaquePtr] Forbid mixing typed and opaque pointers Currently, opaque pointers are supported in two forms: The -force-opaque-pointers mode, where all pointers are opaque and typed pointers do not exist. And as a simple ptr type that can coexist with typed pointers. This patch removes support for the mixed mode. You either get typed pointers, or you get opaque pointers, but not both. In the (current) default mode, using ptr is forbidden. In -opaque-pointers mode, all pointers are opaque. The motivation here is that the mixed mode introduces additional issues that don't exist in fully opaque mode. D105155 is an example of a design problem. Looking at D109259, it would probably need additional work to support mixed mode (e.g. to generate GEPs for typed base but opaque result). Mixed mode will also end up inserting many casts between i8* and ptr, which would require significant additional work to consistently avoid. I don't think the mixed mode is particularly valuable, as it doesn't align with our end goal. The only thing I've found it to be moderately useful for is adding some opaque pointer tests in between typed pointer tests, but I think we can live without that. Differential Revision: https://reviews.llvm.org/D109290	2021-09-10 15:18:23 +02:00
Roman Lebedev	909cba9699	[SimplifyCFG] performBranchToCommonDestFolding(): require block-closed SSA form for bonus instructions (PR51125) I can't seem to wrap my head around the proper fix here, we should be fine without this requirement, iff we can form this form, but the naive attempt (https://reviews.llvm.org/D106317) has failed. So just to unblock the release, put up a restriction. Fixes https://bugs.llvm.org/show_bug.cgi?id=51125	2021-09-09 12:28:09 +03:00
David Green	adfd12e6d1	[ARM] Add patterns for store(fptosisat(..)) As an extension to D107866, this adds store(fptosisat(..)) patterns, similar to the existing fptosi patterns, to prevent unnecessarily moving into gpr regs where we can use fp stores directly. Differential Revision: https://reviews.llvm.org/D108378	2021-09-03 19:22:11 +01:00
David Green	f37e132263	[ARM] Add VFP lowering for fptosi.sat This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat and llvm.fptoui.sat to VCVT instructions that inherently perform the saturate. Differential Revision: https://reviews.llvm.org/D107866	2021-09-03 18:11:08 +01:00
David Green	9cb8f4d1ad	[ARM] Add a tail-predication loop predicate register The semantics of tail predication loops means that the value of LR as an instruction is executed determines the predicate. In other words: mov r3, #3 DLSTP lr, r3 // Start tail predication, lr==3 VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0. mov lr, #1 VADD.s32 q0, q1, q2 // Only first lane is updated. This means that the value of lr cannot be spilled and re-used in tail predication regions without potentially altering the behaviour of the program. More lanes than required could be stored, for example, and in the case of a gather those lanes might not have been setup, leading to alignment exceptions. This patch adds a new lr predicate operand to MVE instructions in order to keep a reference to the lr that they use as a tail predicate. It will usually hold the zeroreg meaning not predicated, being set to the LR phi value in the MVETPAndVPTOptimisationsPass. This will prevent it from being spilled anywhere that it needs to be used. A lot of tests needed updating. Differential Revision: https://reviews.llvm.org/D107638	2021-09-02 13:42:58 +01:00
David Green	49476a4d66	[ARM] Add MVE lowering for fptosi.sat This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics, selecting a VCVT instruction which under MVE will inherently perform the saturate. Differential Revision: https://reviews.llvm.org/D107865	2021-09-01 22:38:47 +01:00
David Green	22c384129e	[ARM] Add missing validForTailPredication for VMINNM/VMAXNM Apparently this was missing, preventing the generation of tail predication loops containing VMINNM, VMAXNM, VMINNMA and VMAXNMA.	2021-08-31 18:19:03 +01:00
David Green	198259becb	[ARM] Test for VMINNM/VMAXNM in tail predicated loops.	2021-08-31 18:19:03 +01:00
David Green	bd0959354f	[ARM] Add Extra FpToIntSat tests. This adds extra MVE vector fptosi.sat and fptoui.sat tests, along with adding or adjusting the existing scalar tests to cover more architectures and instruction combinations.	2021-08-25 20:10:18 +01:00
Stanislav Mekhanoshin	92c1fd19ab	Allow rematerialization of virtual reg uses Currently isReallyTriviallyReMaterializableGeneric() implementation prevents rematerialization on any virtual register use on the grounds that is not a trivial rematerialization and that we do not want to extend liveranges. It appears that LRE logic does not attempt to extend a liverange of a source register for rematerialization so that is not an issue. That is checked in the LiveRangeEdit::allUsesAvailableAt(). The only non-trivial aspect of it is accounting for tied-defs which normally represent a read-modify-write operation and not rematerializable. The test for a tied-def situation already exists in the /CodeGen/AMDGPU/remat-vop.mir, test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve. The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets where I more or less understand the asm it seems to reduce spilling (as expected) or be neutral. However, it needs a review by all targets' specialists. Differential Revision: https://reviews.llvm.org/D106408	2021-08-24 11:09:02 -07:00
David Green	605489d593	[ARM] Fix VQDMULH fold for scalar smin Add a variant of mve-vqdmulh tests that uses min/max intrinsics directly, including a scalar test that shows it misbehaving for min intrinsics and a fix for the combine to prevent it from misbehaving.	2021-08-21 16:33:18 +01:00
David Green	d10f23a25d	[ISel] Expand saddsat and ssubsat via asr and xor This changes the lowering of saddsat and ssubsat so that instead of using: r,o = saddo x, y c = setcc r < 0 s = c ? INTMAX : INTMIN ret o ? s : r into using asr and xor to materialize the INTMAX/INTMIN constants: r,o = saddo x, y s = ashr r, BW-1 x = xor s, INTMIN ret o ? x : r https://alive2.llvm.org/ce/z/TYufgD This seems to reduce the instruction count in most testcases across most architectures. X86 has some custom lowering added to compensate for cases where it can increase instruction count. Differential Revision: https://reviews.llvm.org/D105853	2021-08-19 16:08:07 +01:00
David Green	765a421276	[ARM] Add MVE min/max intrinsic tests. NFC	2021-08-19 14:33:34 +01:00
Petr Hosek	2d4470ab89	Revert "Allow rematerialization of virtual reg uses" This reverts commit `877572cc19` which introduced PR51516.	2021-08-18 00:12:41 -07:00
David Green	52e0cf9d61	[ARM] Enable subreg liveness This enables subreg liveness in the arm backend when MVE is present, which allows the register allocator to detect when subregister are alive/dead, compared to only acting on full registers. This can helps produce better code on MVE with the way MQPR registers are made up of SPR registers, but is especially helpful for MQQPR and MQQQQPR registers, where there are very few "registers" available and being able to split them up into subregs can help produce much better code. Differential Revision: https://reviews.llvm.org/D107642	2021-08-17 14:10:33 +01:00
David Green	9236dea255	[ARM] Create MQQPR and MQQQQPR register classes Similar to the MQPR register class as the MVE equivalent to QPR, this adds MQQPR and MQQQQPR register classes for the MVE equivalents of QQPR and QQQQPR registers. The MVE MQPR seemed have worked out quite well, and adding MQQPR and MQQQQPR allows us to a little more accurately specify the number of registers, calculating register pressure limits a little better. Differential Revision: https://reviews.llvm.org/D107463	2021-08-16 22:58:12 +01:00
Stanislav Mekhanoshin	877572cc19	Allow rematerialization of virtual reg uses Currently isReallyTriviallyReMaterializableGeneric() implementation prevents rematerialization on any virtual register use on the grounds that is not a trivial rematerialization and that we do not want to extend liveranges. It appears that LRE logic does not attempt to extend a liverange of a source register for rematerialization so that is not an issue. That is checked in the LiveRangeEdit::allUsesAvailableAt(). The only non-trivial aspect of it is accounting for tied-defs which normally represent a read-modify-write operation and not rematerializable. The test for a tied-def situation already exists in the /CodeGen/AMDGPU/remat-vop.mir, test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve. The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets where I more or less understand the asm it seems to reduce spilling (as expected) or be neutral. However, it needs a review by all targets' specialists. Differential Revision: https://reviews.llvm.org/D106408	2021-08-16 12:42:42 -07:00
David Green	ae9a346ef8	[ARM] Fix DAG combine loop in reduction distribution Given a constant operand, the MVE and DAGCombine combines could fight, each redistributing in the opposite order. Add a guard to the MVE vecreduce distribution to prevent that.	2021-08-12 16:37:39 +01:00
Stanislav Mekhanoshin	c46cb72fea	[Thumb2] generate checks in ldr-str-imm12.ll. NFC. That seems this test does not check what was stated in the comment anymore. Just switch to generated checks. Differential Revision: https://reviews.llvm.org/D107590	2021-08-05 14:58:02 -07:00
David Green	eeddcba525	[RDA] Attempt to make RDA subreg aware This attempts to make more of RDA aware of potentially overlapping subregisters. Some of this was already in place, with it iterating through MCRegUnitIterators. This also replaces calls to LiveRegs.contains(..) with !LiveRegs.available(..), and updates the isValidRegUseOf and isValidRegDefOf to search subregs. Differential Revision: https://reviews.llvm.org/D107351	2021-08-04 14:21:32 +01:00
David Green	ff9958b70e	[ARM] Test showing incorrect codegen when subreg liveness is enabled. NFC	2021-08-04 14:21:32 +01:00
David Green	2829391840	[ARM] Revert WLSTP to DLSTP if the target block is out of range If the block target for a WLSTP instruction is known to be out of range, and cannot be fixed by the ARMBlockPlacementPass, we can relax it to a DLSTP (and cmp/branch) to still allow the creation of tail predicated loops. That is what this patch does, adding extra revert code to the fallback path of ARMBlockPlacementPass. Due to the code produced when reverting, this creates a DLSTP between a Bcc and a Br. As a DLS isn't necessarily a terminator we need to split the block to move the DLS/Br into. Differential Revision: https://reviews.llvm.org/D104709	2021-08-02 10:59:52 +01:00
David Green	85455192e1	[ARM] Add trackLiveness to block-placement.mir. NFC Also move the test to mve-wls-block-placement.mir, to better fit what it tests.	2021-08-02 09:03:22 +01:00
Eli Friedman	bdd55b2f18	Fix the default alignment of i1 vectors. Currently, the default alignment is much larger than the actual size of the vector in memory. Fix this to use a sane default. For SVE, temporarily remove lowering of load/store operations for predicates with less than 16 elements. The layout the backend was assuming for SVE predicates with less than 16 elements doesn't agree with the frontend. More work probably needs to be done here. This change is, strictly speaking, not backwards-compatible at the bitcode level. But probably nobody is actually depending on that; i1 vectors in memory are rare, and the code that does use them probably ends up forcing the alignment to something sane anyway. If we think this is a concern, I can restrict this to scalable vectors for now (where it's actually causing issues for me at the moment). Differential Revision: https://reviews.llvm.org/D88994	2021-07-31 14:09:59 -07:00
David Green	15a1d7e839	[ARM] Switch order of creating VADDV and VMLAV. It can be beneficial to attempt to try the larger VMLAV patterns before VADDV, in case both may match the same code.	2021-07-31 16:28:52 +01:00
David Green	69cdadddec	[ARM] Distribute reductions based on ascending load offset This distributes reductions based on the relative offset of loads, if one is found from their operands. Given chains of reductions this will then sort them in ascending load order, which in turn can help simple prefetches latch on to increasing strides more easily. Differential Revision: https://reviews.llvm.org/D106569	2021-07-30 19:50:07 +01:00
David Green	532d05b714	[ARM] Attempt to distribute reductions This adds a combine for adds of reductions, distributing them so that they occur sequentially to enable better use of accumulating VADDVA instructions. It combines: add(X, add(vecreduce(Y), vecreduce(Z))) -> add(add(X, vecreduce(Y)), vecreduce(Z)) and add(add(A, reduce(B)), add(C, reduce(D))) -> add(add(add(A, C), reduce(B)), reduce(D)) These together distribute the add's so that more reductions can be selected to VADDVA. Differential Revision: https://reviews.llvm.org/D106532	2021-07-30 14:48:31 +01:00
David Green	4b56306762	[ARM] Turn vecreduce_add(add(x, y)) into vecreduce(x) + vecreduce(y) Under MVE we can use VADDV/VADDVA's to perform integer add reductions, so it can be beneficial to use more reductions than summing subvectors and reducing once. Especially for VMLAV/VMLAVA the mul can be incorporated into the reduction, producing less instructions. Some of the test cases currently get larger due to extra integer adds, but will be improved in a followup patch. Differential Revision: https://reviews.llvm.org/D106531	2021-07-30 10:10:41 +01:00
David Green	ee32cc386c	[ARM] MVE SLP'd reduction tests. NFC These are generated from SLP vectorization, for example https://godbolt.org/z/ebxdPh1Kz. These backend tests show cases that we can produce better code for.	2021-07-30 09:43:29 +01:00
David Green	54c91c0c74	[ARM] Implement isLoad/StoreFromStackSlot for MVE stack stores accesses This implements the isLoadFromStackSlot and isStoreToStackSlot for MVE MVE_VSTRWU32 and MVE_VLDRWU32 functions. They behave the same as many other loads/stores, expecting a FI in Op1 and zero offset in Op2. At the same time this alters VLDR_P0_off and VSTR_P0_off to use the same code too, as they too should be returning VPR in Op0, take a FI in Op1 and zero offset in Op2. Differential Revision: https://reviews.llvm.org/D106797	2021-07-27 09:11:58 +01:00
Johannes Doerfert	25a3130d89	[Local] Do not introduce a new `llvm.trap` before `unreachable` This is the second attempt to remove the `llvm.trap` insertion after https://reviews.llvm.org/rGe14e7bc4b889dfaffb7180d176a03311df2d4ae6 reverted the first one. It is not clear what the exact issue was back then and it might already be gone by now, it has been >5 years after all. Replaces D106299. Differential Revision: https://reviews.llvm.org/D106308	2021-07-26 23:33:36 -05:00
David Green	d0c7d4d8a0	[ARM] Fixup vst4 test. NFC	2021-07-26 20:56:22 +01:00
David Green	010f8e3057	[ARM] Ensure correct regclass in distributing postinc The register class required for some MVE loads/stores is more constrained than the register we use when creating postinc. Make sure we constrain the register class to keep the code correct.	2021-07-26 14:26:38 +01:00
David Green	eb1e95dbdf	[ARM] Extend more reductions during lowering This relaxes the VMLAV and VADDV reduction recognition code to handle smaller than legal types, extending them as needed. That was already handled for some reductions, this extends it to more types in a more generic way. If a smaller than legal value is found it is extended to the legal type as needed. Differential Revision: https://reviews.llvm.org/D106051	2021-07-19 08:58:03 +01:00

1 2 3 4 5 ...

1469 Commits