llvm-project

Commit Graph

Author	SHA1	Message	Date
Juan Manuel MARTINEZ CAAMAÑO	e438f2d821	[DAGCombine] Do not fold SRA/SRL of MUL into MULH when MUL's LSB are used, and MUL_LOHI is available Folding into a sra(mul) / srl(mul) into a mulh introduces an extra multiplication to compute the high half of the multiplication, while it is more profitable to compute the high and lower halfs with a single mul_lohi. Differential Revision: https://reviews.llvm.org/D133768	2022-09-16 15:48:36 +00:00
Jeffrey Byrnes	f90cf68003	[NFC] Fix tests in commit `20cf170e68`	2022-09-15 15:37:58 -07:00
Alexander Timofeev	fbdea5a2e9	[AMDGPU] Always select s_cselect_b32 for uniform 'select' SDNode This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D133593	2022-09-15 22:03:56 +02:00
Craig Topper	ace05124f5	[IntegerDivision][AMDGPU] Use CreateLogicalOr to block poison propagation. There are two ctlz intrinsics here with the zero_is_poison flag set. There are also two comparisons that check if either of the inputs the ctlzs are zero. We need to use a logical or to block the poison from the ctlz if either of the inputs is zero. Reviewed By: arsenm, aqjune Differential Revision: https://reviews.llvm.org/D130680	2022-09-15 09:38:02 -07:00
Jay Foad	3822a01e0b	[AMDGPU] Add GFX11 ds_bvh_stack_rtn_b32 instruction Differential Revision: https://reviews.llvm.org/D133928	2022-09-15 16:46:14 +01:00
Matt Arsenault	69153d6c0a	AMDGPU: Use GlobalPriority for largest register tuples Only do this for 16 and 32 register tuples, although we might want to extend to 8 tuples. It's incredibly expensive to spill these, and doing so majorly interferes with the ability to allocate anything else in the function. The lit tests show mostly sizeable improvements with a handful of tiny regressions with large vectors.	2022-09-15 11:45:02 -04:00
Ivan Kosarev	693f816288	[AMDGPU][SILoadStoreOptimizer] Merge SGPR_IMM scalar buffer loads. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D133787	2022-09-15 13:48:51 +01:00
Piotr Sobczak	abd927e5a8	[AMDGPU] Check for num elts in SelectVOP3PMods The rest of the code section assumes there are exactly two elements in the vector (Lo, Hi), so add the check before entering the section. Differential Revision: https://reviews.llvm.org/D133852	2022-09-14 20:00:19 +02:00
Jon Chesterfield	cdb9738963	[amdgpu] Expand all ConstantExpr users of LDS variables in instructions Bug noted in D112717 can be sidestepped with this change. Expanding all ConstantExpr involved with LDS up front makes the variable specialisation simpler. Excludes ConstantExpr that don't access LDS to avoid disturbing codegen elsewhere. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D133422	2022-09-14 07:55:46 +01:00
Matt Arsenault	d00b0aab31	RegisterCoalescer: Fix verifier error when merging copy of undef There's no real read of the register, so the copy introduced a new live value. Make sure we introduce a replacement implicit_def instead of just erasing the copy. Found from llvm-reduce since it tries to set undef on everything.	2022-09-13 18:40:28 -04:00
Jay Foad	2e8863b6a1	[AMDGPU] Don't shrink VOP3 instructions pre-RA on GFX10+ In GFX10, there is no advantage to shrinking these instructions pre-RA, so this just saves a bit of work. In GFX11 there is an advantage to not shrinking them pre-RA, because the register classes for 16-bit operands are less restrictive in the VOP3 form than in the shrunk form. This patch is a prerequisite for actually setting up those register classes correctly for 16-bit vs non-16-bit operands. Differential Revision: https://reviews.llvm.org/D133769	2022-09-13 20:26:08 +01:00
Jay Foad	3743f9afeb	[AMDGPU] Add GFX11 globalisel test coverage for fptosi/fptoui	2022-09-13 10:51:02 +01:00
Jay Foad	210e6a993d	[GlobalISel] Simplify extended add/sub to add/sub with carry Simplify extended add/sub (with carry-in and carry-out) to add/sub with carry (with carry-out only) if carry-in is known to be zero. Differential Revision: https://reviews.llvm.org/D133702	2022-09-12 17:05:44 +01:00
Joe Nash	8604904e68	[AMDGPU] Separate check lines for some GFX11 16-bit codegen tests NFC. Pre-commits test changes to have a separate CHECK line where GFX11 behavior will diverge from previous subtargets in a future patch.	2022-09-12 09:38:34 -04:00
Joe Nash	f912d0d6e1	[AMDGPU] Change test check name. NFC Change the check name from GFX10 to GFX10Plus to refect its actual usage	2022-09-12 09:28:55 -04:00
Joe Nash	aecd479198	[AMDGPU] Autogenerate test with regclass numbers NFC. This test contains a good amount of verbose compiler output as well as numbers which depend on the number of registers defined and are difficult to update. So autogenerate the test.	2022-09-12 09:21:03 -04:00
Matt Arsenault	da59fe7f15	AMDGPU: Fix test failure Forgot to commit regenerated test	2022-09-12 09:33:22 -04:00
Matt Arsenault	e30271169f	RegAllocGreedy: Try local instruction splitting with subranges This was only trying this to relax register class constraints, but this can also help if there are subranges involved. This solves a compilation failure for AMDGPU when there is high pressure created by large register tuples. If one virtual register is using most of the available budget, we need to be able to evict subranges. This solves the immediate failure, but this solution leaves a lot to be desired. In the relevant testcases, we have 32-element tuples but most of the uses are operations on 1 element subranges of it. What we're now getting is a spill and restore of the full 1024 bits and an extract of the used 32-bits. It would be far better if we introduced a copy to a new virtual register with a smaller register class and used narrower spills. Furthermore, we could probably do a better job if the allocator were to introduce new subranges where none previously existed in the highest pressure scenarios. The block and region splits should also try to split specific subranges out. The mve-vst3.ll test changes looks like noise to me, but instruction count increased by one. mve-vst4.ll looks like a solid improvement with several 16-byte spills eliminated. splitkit-copy-live-lanes.mir also shows a solid reduction in total spill count. This could use more tests but it's pretty tiring to come up with cases that fail on this.	2022-09-12 09:03:55 -04:00
Matt Arsenault	bb70b5d406	CodeGen: Set MODereferenceable from isDereferenceableAndAlignedPointer Previously this was assuming piontsToConstantMemory implies dereferenceable.	2022-09-12 08:38:35 -04:00
Matt Arsenault	b5041527c7	DeadMachineInstructionElim: Switch to using LiveRegUnits Theoretically improves compile time for targets with many overlapping registers	2022-09-12 07:55:14 -04:00
Jay Foad	8901f7cebc	[AMDGPU] Fix crash legalizing G_EXTRACT_VECTOR_ELT with negative index Fixes https://github.com/llvm/llvm-project/issues/57408 Differential Revision: https://reviews.llvm.org/D132938	2022-09-09 15:53:34 +01:00
Thomas Symalla	72730c3f0e	[NFC][AMDGPU] Pre-commit test for D132837.	2022-09-09 14:09:02 +02:00
Jay Foad	afa0ed33df	[AMDGPU] Fix shrinking of F16 FMA on newer subtargets D125803 introduced shrinking of F16 FMA to FMAAK/FMAMK in SIShrinkInstructions (useful on GFX10+ where VOP3 instructions may have a literal operand) but failed to handle the V_FMA_F16_gfx9_e64 form of the opcode which is used on GFX9+. Differential Revision: https://reviews.llvm.org/D133489	2022-09-08 16:41:04 +01:00
Eric Wang	d8a2d3f7d4	[NFC][Regalloc] Introduce the RegAllocPriorityAdvisorAnalysis This patch introduces the priority analysis and the priority advisor, the default implementation, and the scaffolding for introducing the other implementations of the advisor. Reviewed By: mtrofin Differential Revision: https://reviews.llvm.org/D132835	2022-09-08 07:50:03 -07:00
Ivan Kosarev	57c943d581	[AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions. Reviewed By: nhaehnle Differential Revision: https://reviews.llvm.org/D124671	2022-09-08 15:21:30 +01:00
Jay Foad	5b652f77e0	[AMDGPU] Add basic tests for emitting v_fma_f16 and friends	2022-09-08 14:40:36 +01:00
Justin Bogner	a81c7dbf0d	[AMDGPU] Drop _oneuse checks from med3 patterns We use _oneuse checks to make sure combines won't accidentally increase code size, but this prevents the optimization in cases where we happen to want to clamp multiple values to the same range It's safe to drop these checks for two reasons: 1. The pattern of max/min operations for med3 is complicated enough it's unlikely to come up by accident, so this will still only fire when appropriate to do so 2. Even if every intermediate is used and we don't save a single operation, we still won't end up with more operations since the med3 replaces the final max/min. In pathological cases we could potentially end up with a larger encoding size or possibly slightly increased vgpr pressure, but the risk of that is low, especially considering the upside. Differential Revision: https://reviews.llvm.org/D132621	2022-09-07 16:31:49 -07:00
Stanislav Mekhanoshin	fb28bf3fb4	[AMDGPU] Fix liveness verifier error in hazard recognizer After D133067 we are inserting swaps to use a new physical register. I have noticed verifier errors about undefined physical register uses if we are tracking liveness post RA. We have no access to LIS at this point, so mark new register uses as undef to calm down the verifier. Liveness should not matter at this point anyway. Note the description of the RegState::Undef: "Value of the register doesn't matter." I.e. it does not say it is strictly undefined. In fact that is what we really need: this value does not matter. I also had to modify the test a bit since with tracking enabled it does not pass verification even before the recognizer. Differential Revision: https://reviews.llvm.org/D133459	2022-09-07 16:30:36 -07:00
Stanislav Mekhanoshin	95d497ff2a	[AMDGPU] W/a hazard if 64 bit shift amount is a highest allocated VGPR In this case gfx90a uses v0 instead of the correct register. Swap the value temporarily with a lower register and then swap it back. Unfortunately hazard recognizer works after wait count insertion, so we cannot simply reuse an arbitrary register, hence w/a also includes a full waitcount. This can be avoided if we run it from expandPostRAPseudo, but that is a complete misplacement. Differential Revision: https://reviews.llvm.org/D133067	2022-09-07 14:23:49 -07:00
Jon Chesterfield	23f6c8d635	[amdgpu] Always, instead of mostly, remove unused LDS symbols Currently LDS variables are removed by the lower module pass if they have a use which is caught by the replace with struct control flow. This makes tests brittle to changes to that control flow which induces noise when trying to improve lowering. Some tests already check that variables are removed, while others checked that they are not removed. LDS variables are not (currently) externally accessible, and if that changes the machinery which makes them externally accessible will look like a use. This change therefore breaks no applications. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D133028	2022-09-07 18:28:16 +01:00
Jay Foad	01c53d7d80	[AMDGPU] Add an operand folding test case from D114232	2022-09-07 11:16:40 +01:00
Jay Foad	9b8b1ac436	[AMDGPU] Add a comment for a missing fold	2022-09-07 09:57:18 +01:00
Matthias Gehre	6cc52d594f	Fix AMDGPU test failures due to "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64"	2022-09-06 16:18:14 +01:00
Daniil Fukalov	51d33afcbe	[RegisterCoalescer] Fix crash on early clobbered subreg operands. The issue was with processing two subregs of the same reg are used in the same instruction (e.g. inline asm): "def early-clobber" and other just "def". Register coalescer ran in bad recursion if the early clobbered subreg is second in the following sequence of COPYs. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D127136	2022-09-06 08:42:37 +03:00
Ivan Kosarev	5db8d6fd2b	[AMDGPU][CodeGen] Support (base \| offset) SMEM loads. Prevents generation of unnecessary s_or_b32 instructions. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D132552	2022-09-05 14:22:06 +01:00
Ivan Kosarev	1f550d86b2	[AMDGPU][CodeGen] Pre-commit a test on (base \| offset) SMEM loads for D132552. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D133021	2022-09-05 13:12:43 +01:00
Ivan Kosarev	f33645301e	[AMDGPU][CodeGen] Support (soffset + offset) s_buffer_load's. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130263	2022-09-05 12:53:05 +01:00
Simon Pilgrim	e438ce5694	[AMDGPU] Add -verify-machineinstrs to attr-amdgpu-flat-work-group-size* tests These were affected by D131825 (and reported on Issue #57149) - adding the verification will help ensure that we don't hit this again on builds with EXPENSIVE_CHECKS enabled	2022-09-03 13:47:41 +01:00
Daniil Fukalov	b4e1b0e00d	[LiveIntervals] Split live intervals on any dead def Each dead def of the same virtual register is required to be split into multiple virtual registers with separate live intervals to avoid MachineVerifier error. Partially fixes https://github.com/llvm/llvm-project/issues/56050 and https://github.com/llvm/llvm-project/issues/56051 Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D130477	2022-09-02 20:00:22 +03:00
Nuno Lopes	858fe8664e	Expand Div/Rem: consider the case where the dividend is zero So we can't use ctlz in poison-producing mode	2022-09-01 17:04:26 +01:00
Stanislav Mekhanoshin	fd1f8c85f2	[AMDGPU] Limit TID / wavefrontsize uniformness to 1D kernels If a kernel has uneven dimensions we can have a value of workitem-id-x divided by the wavefrontsize non-uniform. For example dimensions (65, 2) will have workitems with address (64, 0) and (0, 1) packed into a same wave which gives 1 and 0 after the division by 64 respectively. Unfortunately, this limits the optimization to OpenCL only and only if reqd_work_group_size attribute is set. This patch limits it to 1D kernels, although that shall be possible to perform this optimization is the size of the X dimension is a power of 2, we just do not currently have infrastructure to query it. Note that presence of amdgpu-no-workitem-id-y attribute does not help as it only hints the lack of the workitem-id-y query, but not the absence of the actual 2nd dimension, therefore affecting just the SGPR allocation. Differential Revision: https://reviews.llvm.org/D132879	2022-08-30 12:22:08 -07:00
Justin Bogner	f9433161f5	[AMDGPU] Precommit two tests showing missed combines to v_med3	2022-08-30 11:56:09 -07:00
Jon Chesterfield	9b0b912e15	[amdgpu][nfc] Add test case showing false aliasing in LDS lowering	2022-08-30 15:33:57 +01:00
Thomas Symalla	d26dd37149	[NFC][AMDGPU] Pre-commit tests for D132837. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D132930	2022-08-30 13:55:36 +02:00
Stanislav Mekhanoshin	813ae2871d	[AMDGPU] Detect uniformness of TID / wavefrontsize A value of 'workitemid / wavefrontize' or 'workitemid & (wavefrontize - 1)' is wave uniform. Differential Revision: https://reviews.llvm.org/D132511	2022-08-26 23:26:08 -07:00
Pierre van Houtryve	59cf9dd923	[AMDGPU][GISel] Enable Selection of ADD3 for G_PTR_ADD Allows things like `(G_PTR_ADD (G_PTR_ADD a, b), c)` to be simplified into a single ADD3 instruction instead of two adds. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D131254	2022-08-24 14:44:19 +00:00
Raghav	79d2529c10	AMDGPU/MetaData: Restrict address space key to only be emitted for "global_buffer" and "dynamic_shared_pointer" This matches .address_space docs at https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 Differential Revision: https://reviews.llvm.org/D132145	2022-08-23 14:01:01 -04:00
Luo, Yuanke	5159be3c9b	(Reland) [fastalloc] Support allocating specific register class in fastalloc This reverts commit `853bb192c4`.	2022-08-20 13:25:34 +08:00
Austin Kerbow	b0f4678b90	[AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy Adds a builtin that serves as an optimization hint to apply specific optimized DAG mutations during scheduling. This also disables any other mutations or clustering that may interfere with the desired pipeline. The first optimization strategy that is added here is designed to improve the performance of small gemm kernels on gfx90a. Reviewed By: jrbyrnes Differential Revision: https://reviews.llvm.org/D132079	2022-08-19 15:38:36 -07:00
jeff	20cf170e68	[InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done. This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic. Change-Id: Ifcd981709dc2ea94d4acbcb84efe7176593ec8c7	2022-08-19 11:37:20 -07:00
Luo, Yuanke	28733d86cf	[amdgpu] Change the RA to basic Specifying `-regalloc=fast` is not reliable. With fast register allocation, `LIS = getAnalysisIfAvailable<LiveIntervals>();` get nullptr in "si-lower-sgpr-spills" pass, so the slot index is not created in the pass for new inserted instructions. When verifying the machine instructions, it fails on checking slot index. While greedy-ra is time consuming basic-ra can be used to reduce compiling time for this test case. Differential Revision: https://reviews.llvm.org/D131931	2022-08-18 08:16:19 +08:00
Jeffrey Byrnes	1c8d7ea973	[AMDGPU] Implement pipeline solver for non-trivial pipelines Requested SchedGroup pipelines may be non-trivial to satisify. A minimimal example is if the requested pipeline is {2 VMEM, 2 VALU, 2 VMEM} and the original order of SUnits is {VMEM, VALU, VMEM, VALU, VMEM}. Because of existing dependencies, the choice of which SchedGroup the middle VMEM goes into impacts how closely we are able to match the requested pipeline. It seems minimizing the degree of misfit (as measured by the number of edges we can't add) w.r.t the choice we make when mapping an instruction -> SchedGroup is an NP problem. This patch implements the PipelineSolver class which produces a solution for the defined problem for the sched_group_barrier mutation. The solver has both an exponential time exact algorithm and a greedy algorithm. The patch includes some controls which allows the user to select the greedy/exact algorithm. Differential Revision: https://reviews.llvm.org/D130797	2022-08-17 16:21:59 -07:00
Nicolas Miller	ccfabfbb1f	Fix subrange liveness checking at rematerialization This patch fixes an issue where an instruction reading a whole register would be moved during register allocation into a spot where one of the subregisters was dead. The code to check whether an instruction can be rematerialized at a given point or not was already checking for subranges to ensure that subregisters are live, but only when the instruction being moved was using a subregister, this patch changes that so the subranges are checked even when the moved instruction uses the full register. This patch also adds a case to the original test for the subrange checking that trigger the issue described above. The original subrange checking code was introduced in this revision: https://reviews.llvm.org/D115278 And I've encountered this issue on AMDGPUs while working with DPC++: https://github.com/intel/llvm/issues/6209 Essentially the greedy register allocator attempts to move the following instruction: ``` %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec ``` From `@3440` into the body of a loop `@16312`, but `%3078` has the following live ranges: ``` %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r ``` So `@16312e` `%3078.sub1` is alive but `%3078.sub0` is dead, so this instruction being moved there leads to invalid memory accesses as `3078.sub0` ends up being trashed and the result of this instruction is used as part of an address calculation for a load. On the original ticket this issue showed up on gfx906 and gfx90a but not on gfx908, this turned out to be because on gfx908 instead of moving the shift instruction into the loop, its value is spilled into an ACC register, gfx906 doesn't have ACC registers and for gfx90a ACC registers are used like regular vector registers and so aren't used for spilling. With this patch the original application from the DPC++ ticket works properly on gfx906, and the result of the shift instruction is correctly spilled instead of moving the instruction in the loop. Original Author: npmiller Reviewed by: rampitec Submitted by: rampitec Differential Revision: https://reviews.llvm.org/D131884	2022-08-16 10:50:09 -07:00
Luo, Yuanke	853bb192c4	Revert "(Reland) [fastalloc] Support allocating specific register class in fastalloc" This reverts commit `30f9e6ebd3`.	2022-08-15 20:33:15 +08:00
Luo, Yuanke	30f9e6ebd3	(Reland) [fastalloc] Support allocating specific register class in fastalloc Reland commit `719658d078` The base RA support infrastructure that only allow a specific register class be allocated in RA pss. Since greedy RA, basic RA derived from base RA, they all allow allocating specific register class. Fast RA doesn't support allocating register for specific register class. This patch is to enable ShouldAllocateClass in fast RA, so that it can support allocating register for specific register class. Differential Revision: https://reviews.llvm.org/D131825	2022-08-13 13:57:34 +08:00
Daniil Fukalov	5eeef48ed1	[NFC] Split test to reduce time to run. The `RUN:` line with `--debug` used just two function. Moved this test out to different file. Reviewed By: vangthao Differential Revision: https://reviews.llvm.org/D130920	2022-08-12 10:40:31 +03:00
Joe Nash	b92161f927	[AMDGPU] Autogenerate spill-vector-superclass. NFC This test is already a subset of the autogenerated test lines, so truly auto-generate it to make it easier to update.	2022-08-11 10:34:10 -04:00
David Stuttard	1d1cc05539	AMDGPU: mbcnt allow for non-zero src1 for known-bits Src1 for mbcnt can be a non-zero literal or register. Take this into account when calculating known bits. Differential Revision: https://reviews.llvm.org/D131478	2022-08-11 13:23:43 +01:00
Martin Sebor	0dcfe7aa35	[InstCombine] Tighten up known library function signature tests (PR #56463 ) Replace a switch statement used to validate arguments to known library functions with a more consistent table-driven approach and tighten it up.	2022-08-10 14:15:46 -06:00
Evgenii Stepanov	8ea1cf3111	Revert "[AMDGPU] SIFixSGPRCopies refactoring" Breaks ASan tests. This reverts commit `3f8ae7efa8`.	2022-08-10 11:32:46 -07:00
Venkata Ramanaiah Nalamothu	486594119d	[AMDGPU] Fix prologue/epilogue markers in .debug_line table for trivial functions All the prologue instructions should have unknown source location co-ordinates while the epilogue instructions should have source location of last non-debug instruction after which epilogue instructions are insrted. This ensures the prologue/epilogue markers are generated correctly in the line table. Changes are brought in from the downstream CFI patches. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D131485	2022-08-10 23:00:19 +05:30
alex-t	3f8ae7efa8	[AMDGPU] SIFixSGPRCopies refactoring This change finalizes the series of patches aiming to replace old strategy of VGPR to SGPR copies loweriong. Following the https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code parts that are no longer used were removed. Pass main loop is no longer used for the MIR changes but collect information for further analysis. Actual MIR lowering happens further according the analysys result in the set of separate functions. Another important change concerns the order of lowering: VGPR to SGPR copies lowering is done first to have priority on the rest of the MIR changes. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D131246	2022-08-10 00:51:57 +02:00
Yaxun (Sam) Liu	e780648a15	[AMDGPU] Unify unreachable intrinsics si-annotate-control-flow does depth first traversal of BB's of a function to insert amdgcn if intrinsics for conditional branches so that isel can generate correct instructions later. si-annotate-control-flow checks whether the successor BB for the 'else' branch of a conditional branch has been visited. If it has been visited, si-annotate-control-flow assumes the conditional branch has been handled and will not try to insert if intrinsic for it. This assumption is not correct when the IR contains multiple unreachable BB's. Then 'if' intrinscs are not inserted and incorrect ISA are generated. This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes unify unreachables even if they are uniformly reached. In this way the IR will not contain multiple exits, and structurizer is able to structurize the IR containing one unified exit. Reviewed by: Ruiling Song, Matt Arsenault Differential Revision: https://reviews.llvm.org/D131181 Fixes: SWDEV-343244	2022-08-09 10:23:32 -04:00
Leon Clark	6a275cd53c	Transform illegal intrinsics to V_ILLEGAL Related tasks: - SWDEV-240194 - SWDEV-309417 - SWDEV-334876 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D123693	2022-08-06 08:59:00 +01:00
Austin Kerbow	b568cb1064	[AMDGPU] Pre-commit tests for D130797	2022-08-04 22:52:54 -07:00
jeff	e0b16aaaf9	[AMDGPU] Precommit test case for D130729 Change-Id: I23396655880b5732ad991d47e51900408f5bc5ee	2022-08-03 15:22:16 -07:00
Austin Kerbow	3dfa562643	[AMDGPU] Add CL option for max-ilp scheduler. When compiling for multiple targets the scheduler that is selected via the -misched option is applied globally. This patch adds a target CL option instead. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D131022	2022-08-02 16:52:14 -07:00
Austin Kerbow	d7100b398b	[AMDGPU] Add GCNMaxILPSchedStrategy Creates a new scheduling strategy that attempts to maximize ILP for a single wave. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130869	2022-08-02 13:21:24 -07:00
Alexander Timofeev	a321d95b59	[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs In the `2e29b0138c` we introduce a specific solving algorithm that analyzes the VGPR to SGPR copies use chains and either lowers the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms. Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues. At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF because of the weird mixing of SALU and VALU instructions. This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs, we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common VGPR to SGPR lowering algorithm. This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130367	2022-08-02 18:37:57 +02:00
Jay Foad	e301e071ba	[AMDGPU] Remove IR SpeculativeExecution pass from codegen pipeline This pass seems to have very little effect because all it does is hoist some instructions, but it is followed later in the codegen pipeline by the IR CodeSinking pass which does the opposite. Differential Revision: https://reviews.llvm.org/D130258	2022-08-02 17:35:20 +01:00
Jay Foad	c24d68fff1	[AMDGPU] Take advantage of VOP3 literals in convertToThreeAddress This improves a corner case where v_fmac can be converted to v_fma on GFX10+ even if it has a literal operand. Differential Revision: https://reviews.llvm.org/D130992	2022-08-02 17:27:11 +01:00
Vang Thao	7fc52d7c8b	[AMDGPU] Fix DGEMM hazard for GFX90a For VALU write and memory (VM, L/DS, FLAT) instructions, SQ would insert wait-states to avoid data hazard. However when there is a DGEMM instruction in-between them, SQ incorrectly disables the wait-states thus the data hazard needs to be handled with this workaround. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130677	2022-08-01 11:56:22 -07:00
Piotr Sobczak	f29a19b0b8	[AMDGPU] Extend cases for ReadM0MovRelInterpHazard Extend hazard recognizer of ReadM0MovRelInterpHazard with DS_READ_ADDTID and DS_WRITE_ADDTID, as they also require a manually inserted S_NOP after SALU writing m0. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130783	2022-08-01 17:59:33 +02:00
Carl Ritson	4c4db81630	[AMDGPU] Extend SILoadStoreOptimizer to s_load instructions Apply merging to s_load as is done for s_buffer_load. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130742	2022-07-30 11:38:39 +09:00
Austin Kerbow	2c82a126d7	[AMDGPU] Omit unnecessary waitcnt before barriers It is not necessary to wait for all outstanding memory operations before barriers on hardware that can back off of the barrier in the event of an exception when traps are enabled. Add a new subtarget feature which tracks which HW has this ability. Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D130722	2022-07-29 11:12:36 -07:00
Amaury Séchet	226086230c	[DAG] Use recursivelyDeleteUnusedNodes in CommitTargetLoweringOpt. It simplifies the logic and removes the need for manual bookkeeping. Differential Revision: https://reviews.llvm.org/D130445	2022-07-29 13:49:03 +00:00
Jay Foad	3cfa9b1431	[AMDGPU] user-sgpr-init16-bug does not apply to gfx1103 Differential Revision: https://reviews.llvm.org/D130347	2022-07-29 14:21:13 +01:00
Matt Arsenault	ef906f287e	AMDGPU: Fix assertion when printing unreachable functions Since `814a0abcce`, this would break if we had a function in the module that becomes dead in any codegen IR pass. The function wasn't deleted since it was initially used in dead code, but is detached from the call graph and doesn't appear in the PO traversal. Do a second walk over the module to populate the resources of any functions which weren't already processed.	2022-07-29 08:57:43 -04:00
Matt Arsenault	a4834ad068	RegisterCoalescer: Shrink main range after shrinking subranges If the subregister uses were dead, this would leave the main range segment pointing to a deleted instruction. Not sure if this should try to avoid shrinking if we know we don't have dead components.	2022-07-29 08:57:28 -04:00
Alexander Timofeev	d7ae1a9097	Revert "[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs" This reverts commit `76d9ae924c`. because it causes several VK CTS tests to fail	2022-07-29 14:19:07 +02:00
Changpeng Fang	2b731b30a7	AMDGPU: Take care of "tied" operand when removeOperand Summary: Flat scratch load of D16 type by default has tied vdst_in operand (with vdst). This should be taken care of at the time of "removeOperand" in eliminateFrameIndex. Otherwise we will hit an assert saying "Cannot move tied operands". This patch unties vdst_in before the move, and retie it with vdst afterwards. Reviewers: arsenm, foad Differential Revision: https://reviews.llvm.org/D130537	2022-07-28 17:30:49 -07:00
Anshil Gandhi	5c38056431	[AMDGPU][Scheduler] Avoid initializing Register pressure tracker when tracking is disabled When register pressure tracking is disabled, the scheduler attempts to load pressures at SReg_32 and VGPR_32. This causes an index out of bounds error. This patch fixes this issue by disabling the initialization of RPTracker when not needed. NFC Reviewed By: rampitec, kerbowa, arsenm Differential Revision: https://reviews.llvm.org/D129322	2022-07-28 15:39:28 -06:00
Austin Kerbow	0f93a45b11	[AMDGPU] Add isMeta flag to SCHED_GROUP_BARRIER	2022-07-28 11:04:33 -07:00
Austin Kerbow	f5b21680d1	[AMDGPU] Add amdgcn_sched_group_barrier builtin This builtin allows the creation of custom scheduling pipelines on a per-region basis. Like the sched_barrier builtin this is intended to be used either for testing, in situations where the default scheduler heuristics cannot be improved, or in critical kernels where users are trying to get performance that is close to handwritten assembly. Obviously using these builtins will require extra work from the kernel writer to maintain the desired behavior. The builtin can be used to create groups of instructions called "scheduling groups" where ordering between the groups is enforced by the scheduler. __builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter is a mask that determines the types of instructions that you would like to synchronize around and add to a scheduling group. These instructions will be selected from the bottom up starting from the sched_group_barrier's location during instruction scheduling. The second parameter is the number of matching instructions that will be associated with this sched_group_barrier. The third parameter is an identifier which is used to describe what other sched_group_barriers should be synchronized with. Note that multiple sched_group_barriers must be added in order for them to be useful since they only synchronize with other sched_group_barriers. Only "scheduling groups" with a matching third parameter will have any enforced ordering between them. As an example, the code below tries to create a pipeline of 1 VMEM_READ instruction followed by 1 VALU instruction followed by 5 MFMA instructions... // 1 VMEM_READ __builtin_amdgcn_sched_group_barrier(32, 1, 0) // 1 VALU __builtin_amdgcn_sched_group_barrier(2, 1, 0) // 5 MFMA __builtin_amdgcn_sched_group_barrier(8, 5, 0) // 1 VMEM_READ __builtin_amdgcn_sched_group_barrier(32, 1, 0) // 3 VALU __builtin_amdgcn_sched_group_barrier(2, 3, 0) // 2 VMEM_WRITE __builtin_amdgcn_sched_group_barrier(64, 2, 0) Reviewed By: jrbyrnes Differential Revision: https://reviews.llvm.org/D128158	2022-07-28 10:43:14 -07:00
Simon Pilgrim	69d5a038b9	[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts. This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits. There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP. Differential Revision: https://reviews.llvm.org/D77804	2022-07-28 14:10:44 +01:00
Alexander Timofeev	76d9ae924c	[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs In the `2e29b0138c` we introduce a specific solving algorithm that analyzes the VGPR to SGPR copies use chains and either lowers the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms. Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues. At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF because of the weird mixing of SALU and VALU instructions. This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs, we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common VGPR to SGPR lowering algorithm. This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130367	2022-07-28 14:30:29 +02:00
Austin Kerbow	ba0d079c7a	[AMDGPU] Aggressively schedule to reduce RP in occupancy limited regions By not clustering loads and adjusting heuristics to more aggressively reduce register pressure we may be able to increase occupancy for the function if it was dropped in a first pass scheduling. Similarly, try to reduce spilling if register usage exceeds lower bound occupancy. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130329	2022-07-27 22:34:37 -07:00
Carl Ritson	dbda30e294	[AMDGPU][SIFoldOperands] Clear kills when folding COPY Clear all kill flags on source register when folding a COPY. This is necessary because the kills may now be out of order with the uses. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130622	2022-07-28 11:57:55 +09:00
Stanislav Mekhanoshin	68901fdbeb	[AMDGPU] Consider S_SETPRIO a scheduling boundary The instruction is used to modify wave priority with the intent to affect VALU execution and currently we can reschedule VALU around it since that VALU does not have side effects. Differential Revision: https://reviews.llvm.org/D130654	2022-07-27 11:50:23 -07:00
Stanislav Mekhanoshin	190f3fbba8	[AMDGPU] Precommit s_setprio scheduling test. NFC.	2022-07-27 11:26:15 -07:00
Simon Pilgrim	529bd4f352	[DAG] SimplifyDemandedBits - don't early-out for multiple use values SimplifyDemandedBits currently early-outs for multi-use values beyond the root node (just returning the knownbits), which is missing a number of optimizations as there are plenty of cases where we can still simplify when initially demanding all elements/bits. @lenary has confirmed that the test cases in aea-erratum-fix.ll need refactoring and the current increase codegen is not a major concern. Differential Revision: https://reviews.llvm.org/D129765	2022-07-27 10:54:06 +01:00
Jon Chesterfield	923b90bddb	[amdgpu][nfc] Separate LDS struct creation from RAUW	2022-07-26 20:59:17 +01:00
Matt Arsenault	62531518f9	RegAllocGreedy: Add a command line flag for reverseLocalAssignment Introduce a flag like for some of the other target heuristic controls to help with experimentation.	2022-07-25 15:47:15 -04:00
Matt Arsenault	cb0c71e8b1	AMDGPU: Adjust register allocation priority values down Set the priorities consistently to number of registers in the tuple - 1. Previously we started at 1, and also tried to give SGPR higher values than VGPRs. There's no point in assigning SGPRs higher values now that those are allocated in a separate regalloc run. This avoids overflowing the 5 bits used for the class priority in the allocation heuristic for 32 element tuples. This avoids some cases where smaller registers unexpectedly get prioritized over larger.	2022-07-25 15:47:15 -04:00
Sanjay Patel	bfb9b8e075	[Passes] add a tail-call-elim pass near the end of the opt pipeline We call tail-call-elim near the beginning of the pipeline, but that is too early to annotate calls that get added later. In the motivating case from issue #47852, the missing 'tail' on memset leads to sub-optimal codegen. I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time. "tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ): https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright Note that there was an effort to split the tail call functionality into 2 passes - that could help reduce compile-time if we find that this change costs more in compile-time than expected based on the preliminary testing: D60031 Differential Revision: https://reviews.llvm.org/D130374	2022-07-25 15:25:47 -04:00
David Stuttard	b14d7bf750	AMDGPU: Turn off force init 16 input SGPRS for pal Pal uses a different mechanism for user sgprs. Differential Revision: https://reviews.llvm.org/D129566	2022-07-25 10:52:46 +01:00
Matt Arsenault	40abb28f61	RegAllocGreedy: Fix subranges when rematerializing dead subreg defs This would create a new interval missing the subrange and hit this verifier error: * Bad machine code: Live interval for subreg operand has no subranges * - function: test_remat_subreg_def - basic block: %bb.0 (0xa568758) [0B;128B) - instruction: 32B dead undef %4.sub0:vreg_64 = V_MOV_B32_e32 2, implicit $exec	2022-07-24 11:51:59 -04:00
Matt Arsenault	12144a12da	AMDGPU: Fix broken test checks	2022-07-24 11:25:49 -04:00
Nuno Lopes	a30e77b6f6	fix tests for commit `9df0b254d2`	2022-07-23 22:32:30 +01:00
Jay Foad	798fa7e9d6	[AMDGPU] Add a test where regClassPriorityTrumpsGlobalness uses more vgprs	2022-07-22 12:08:47 +01:00

1 2 3 4 5 ...

5780 Commits