llvm-project

Commit Graph

Author	SHA1	Message	Date
Carl Ritson	a35013bec6	[AMDGPU][GFX11] Mitigate VALU mask write hazard VALU use of an SGPR (pair) as mask followed by SALU write to the same SGPR can cause incorrect execution of subsequent SALU reads of the SGPR. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D134151	2022-10-01 16:21:24 +09:00
Jeffrey Byrnes	5a61340eb4	[AMDGPU] Fix tests in `f6a2e6afed`	2022-09-30 12:53:08 -07:00
Jeffrey Byrnes	f6a2e6afed	[AMDGPU] Precommit test case for D133584	2022-09-30 12:43:23 -07:00
Arthur Eubanks	e23aee7175	[test] Update some legacy PM tests	2022-09-30 11:31:02 -07:00
Joe Nash	50dfd3e9e4	[AMDGPU] Add test for FMAC_e64 dpp combine. NFC.	2022-09-30 12:48:12 -04:00
Pierre van Houtryve	d8258508d4	[AMDGPU][GISel] Update `isCanonicalized` Recognize more opcodes in the function. Fixes some regressions introduced in D134857 for fdiv.f16 too. Depends on D134857 Reviewed By: arsenm, foad Differential Revision: https://reviews.llvm.org/D134862	2022-09-30 14:13:35 +00:00
Pierre van Houtryve	7388520d1c	[GISel] Add more cases to isKnownNeverNaN Make it even with the DAG implementation as of D134854 Reviewed By: arsenm, foad Differential Revision: https://reviews.llvm.org/D134857	2022-09-30 14:10:56 +00:00
Pierre van Houtryve	653beae5a1	[AMDGPU][GISel] Add Identity BUILD_VECTOR Combines Folds-away BUILD_VECTOR-related noops in the post-legalizer combiner. Depends on D134433 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134953	2022-09-30 14:07:13 +00:00
Pierre van Houtryve	9a67a6b72a	[AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal. Also removes RegBankInfo's scalarization of small BUILD_VECTORs, replacing it with InstructionSelector logic instead. This allows for V2S16 BUILD_VECTOR instructions to survive all the way to ISel so we can select FMA/MAD_MIX instructions in D134354. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134433	2022-09-30 14:04:53 +00:00
Thomas Symalla	a41dde2c62	[AMDGPU] Add use check in v_fma combine. In D132837, an existing v_fma combine was extended to regard nested fma instructions. Originally, the inner FMA was checked for being used only once. In its current state, this check is missing, which causes some regressions. In this patch, this check was added. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D134856	2022-09-29 12:25:03 +02:00
Thomas Symalla	a41764810f	[NFC][AMDGPU] Pre-commit FMA test.	2022-09-29 09:54:06 +02:00
Pierre van Houtryve	682c7c77f5	[AMDGPU] Update `mad-mix*` CodeGen tests - Use `fneg %a` instead of `fsub -0.0, %a` - This is for D134354 as we don't currently support folding `fsub -0.0, %a` into `fneg` on GISel. Also, `fneg` is the canonical way to do the negation. - Switch to `update_llc_test_checks`-generated tests. - Better test coverage - Easier to update - Easier to see changes in future diffs - Remove unnecessary CL arguments in RUN lines Motive for the patch: Preparation for D134354 - we would like to put GISel tests in this file as well. Fixing the lack of `fneg` and switching to generated testing makes it much easier. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134793	2022-09-29 07:11:34 +00:00
Abinav Puthan Purayil	3759398b4b	[AMDGPU] Report minimum scratch size in code object v5 and later by default This change sets -amdgpu-assume-{external-call-stack-size \| dynamic-stack-object-size} options to zero by default for code object v5 and later. The runtime is expected to adjust the scratch size if the amdhsa_uses_dynamic_stack bit in the kernel descriptor is set. Differential Revision: https://reviews.llvm.org/D128346	2022-09-29 09:52:45 +05:30
Jessica Paquette	1eb49bbab6	[GlobalISel][CallLowering] Use hasRetAttr for return flags on CallBases Given something like this: ``` declare signext i16 @signext_callee() define i32 @caller() { %res = call i16 @signext_callee() ... } ``` CallLowering would miss that signext_callee's return value is sign extended, because it isn't on the call. Use hasRetAttr on the CallBase to allow us to catch this. (This now inserts G_ASSERT_SEXT/G_ASSERT_ZEXT like in the original review.) Differential Revision: https://reviews.llvm.org/D86228	2022-09-28 19:38:24 -07:00
Jay Foad	2c12a04bba	[ISel] Fix DAG divergence after new FMA combine D132837 introduced a new DAG combine that used MorphNodeTo to morph an FMUL into an FMA. It turns out that MorphNodeTo does not properly update the divergence bit for users of the morphed node, causing an assertion failure on the new test case: llc: SelectionDAG.cpp:10486: void llvm::SelectionDAG::VerifyDAGDivergence(): Assertion `calculateDivergence(N) == N->isDivergent() && "Divergence bit inconsistency detected"' failed. Fixing MorphNodeTo to propagate the divergence bit is tricky because of the way it is used to select machine instructions, so use getNode and ReplaceAllUsesOfValueWith instead. Differential Revision: https://reviews.llvm.org/D134810	2022-09-28 19:41:51 +01:00
Baptiste	b556726ccc	[AMDGPU] Avoid flushing the vmcnt counter in loop preheaders if not necessary One of the conditions to flush the vmcnt counter in loop preheaders is: The loop contains a use of a vgpr that is defined out of the loop. The code currently checks if a waitcnt is needed by looking at the score of that vgpr in the score brackets. This is not enough and may cause the generation of an unnecessary vmcnt flush. This patch fixes that case. Differential Revision: https://reviews.llvm.org/D130313	2022-09-28 13:05:50 -04:00
Jon Chesterfield	35f2584ef9	[amdgpu] Error, instead of miscompile, anonymous kernels using lds The association between kernel and struct is done by symbol name. This doesn't work robustly for anonymous kernels as shown by the modified test case. An alternative association between function and struct can be constructed if necessary, probably though metadata, but on the basis that we currently miscompile anonymous kernels and that they are difficult to construct from application code and difficult to call from the runtime, this patch makes it a fatal error for now. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134741	2022-09-28 16:30:04 +01:00
Sameer Sahasrabuddhe	3f078b308b	[AAPointerInfo] OffsetInfo: Unassigned is distinct from Unknown A User like the PHINode may be visited multiple times for the same pointer along different def-use edges. The uninitialized state of OffsetInfo at the first visit needs to be distinct from the Unknown value that may be assigned after processing the PHINode. Without that, a PHINode with all inputs Unknown is never followed to its uses. This results in incorrect optimization because some interfering accessess are missed. Differential Revision: https://reviews.llvm.org/D134704	2022-09-28 20:31:36 +05:30
Matt Arsenault	7a84624079	AMDGPU: Make various vector undefs legal Surprisingly these were getting legalized to something zero initialized. This fixes an infinite loop when combining some vector types. Also fixes zero initializing some undef values. SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking for the legality of the output undefs they are replacing unused operations with. This resulted in turning vectors into undefs that were later re-legalized back into zero vectors.	2022-09-28 10:48:52 -04:00
Jon Chesterfield	80ba432821	[amdgpu][nfc] Allocate kernel-specific LDS struct deterministically A kernel may have an associated struct for laying out LDS variables. This patch puts that instance, if present, at a deterministic address by allocating it at the same time as the module scope instance. This is relatively likely to be where the instance was allocated anyway (~NFC) but will allow later patches to calculate where a given field can be found, which means a function which is only reachable from a single kernel will be able to access a LDS variable with zero overhead. That will be particularly helpful for applications that instantiate a function template containing LDS variables once per kernel. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D127052	2022-09-28 14:55:16 +01:00
Carl Ritson	266b5dbc5d	[AMDGPU] Add MIMG NSA threshold configuration attribute Make MIMG NSA minimum addresses threshold an attribute that can be set on a function or configured via command line. This enables frontend tuning which allows increased NSA usage where beneficial. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D134780	2022-09-28 20:03:18 +09:00
Changpeng Fang	dee4bc4a4e	AMDGPU: Handle new address pattern in LowerKernelAttributes introduced by opaque pointers Summary: With opaque pointer support, the "ptr" type is introduced and thus BitCast is not necessary in some cases. This work takes care of this change, and recognizes the new address patterns to do appropriate optimizations. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D134596	2022-09-26 09:31:52 -07:00
jeff	e6c29c0338	[AMDGPU] Precommit switching test to generated checks for D134463 Change-Id: I6ed91aacf6856ef535a54284d48e937db33be1a3	2022-09-26 08:13:32 -07:00
Ruiling Song	a5676a3a7e	StructurizeCFG: Set Undef for non-predecessors in setPhiValues() During structurization process, we may place non-predecessor blocks between the predecessors of a block in the structurized CFG. Take the typical while-break case as an example: ``` /---A(v=...) \| / \ ^ B C \| \ /\| \---L \| \ / E (r = phi (v:C)...) ``` After structurization, the CFG would be look like: ``` /---A \| \|\ \| \| C \| \|/ \| F1 ^ \|\ \| \| B \| \|/ \| F2 \| \|\ \| \| L \ \|/ \--F3 \| E ``` We can see that block B is placed between the predecessors(C/L) of E. During phi reconstruction, to achieve the same sematics as before, we are reconstructing the PHIs as: F1: v1 = phi (v:C), (undef:A) F3: r = phi (v1:F2), ... But this is also saying that `v1` would be live through B, which is not quite necessary. The idea in the change is to say the incoming value from B is Undef for the PHI in E. With this change, the reconstructed PHI would be: F1: v1 = phi (v:C), (undef:A) F2: v2 = phi (v1:F1), (undef:B) F3: r = phi (v2:F2), ... Reviewed by: sameerds Differential Revision: https://reviews.llvm.org/D132450	2022-09-26 09:54:47 +08:00
Ruiling Song	40e9284f3c	StructurizeCFG: prefer reduced number of live values The instruction simplification will try to simplify the affected phis. In some cases, this might extend the liveness of values. For example: BB0: \| \ \| BB1 \| / BB2:phi (BB0, v), (BB1, undef) The phi in BB2 will be simplified to v as v dominates BB2, but this is increasing the number of active values in BB1. By setting CanUseUndef to false, we will not simplify the phi in this way, this would help register pressure. This is mandatory for the later change to help reducing VGPR pressure for AMDGPU. Reviewed by: foad, sameerds Differential Revision: https://reviews.llvm.org/D132449	2022-09-26 09:54:47 +08:00
Ruiling Song	66325d9ba1	AMDGPU: Add a test to show how later optimization works Differential Revision: https://reviews.llvm.org/D132448	2022-09-26 09:54:47 +08:00
Ruiling Song	cf14c7caac	AMDGPU: Add a pass to rewrite certain undef in PHI For the pattern of IR (%if terminates with a divergent branch.), divergence analysis will report %phi as uniform to help optimal code generation. ``` %if \| \ \| %then \| / %endif: %phi = phi [ %uniform, %if ], [ %undef, %then ] ``` In the backend, %phi and %uniform will be assigned a scalar register. But the %undef from %then will make the scalar register dead in %then. This will likely cause the register being over-written in %then. To fix the issue, we will rewrite %undef as %uniform. For details, please refer the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test changes shown, but this is mandatory for later changes. Reviewed by: sameerds Differential Revision: https://reviews.llvm.org/D133840	2022-09-26 09:54:47 +08:00
Petar Avramovic	dcc756d03e	[AMDGPU] Pattern for flat atomic fadd f64 intrinsic with local addr Fix regression from clang opencl test in builtins-fp-atomics-gfx90a.cl test_flat_add_local_f64 caused by D130579 Revert `a3becb333d`. Differential Revision: https://reviews.llvm.org/D134568	2022-09-25 13:25:41 +02:00
jeff	33ab74ac46	[AMDGPU] Precommit switching test to generated checks for D134463 Change-Id: I0d90f86ab759347a2f20448d28cc09ddaea3a4d4	2022-09-23 15:12:53 -07:00
Jay Foad	ddfa0f62d8	[AMDGPU] Add GFX11 feature for subtargets with more VGPRs The full complement of physical VGPRs for GFX11 is 50% more than GFX10. Some subtargets have this, others stay the same as GFX10. This affects occupancy calculations. Differential Revision: https://reviews.llvm.org/D134522	2022-09-23 20:18:23 +01:00
jeff	5787d44462	[AMDGPU] Precommit switching test to generated checks for D134463 Change-Id: Iaa8f6263178cfa8405d9a0298b2695b32a42fdf3	2022-09-23 10:15:47 -07:00
Petar Avramovic	6db7921b65	AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd Remove manual selection for atomic fadd from global-isel. Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD which corresponds to llvm-ir's atomicrmw fadd instruction. global and flat atomic fadd patterns changes: Split rtn/no-rtn patterns Add missing patterns or fix predicates Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors). Patterns now check addrspace of pointer, added patterns for flat intrinsic. with global addrspace pointer that selects into global atomic instruction. buffer atomic fadd patterns changes: Rdit patterns to import into global-isel. Remove gfx6/gfx7 _addr64 and _offset patterns. Remove patterns that can't be reached (same pattern but different feature). Differential Revision: https://reviews.llvm.org/D130579	2022-09-23 17:52:10 +02:00
Petar Avramovic	5cee9047d5	AMDGPU: Improve atomicrmw fadd selection Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11 as for gfx90a. Add missing globalisel legalizer support for flat atomicrmw fadd f32 on gfx940 and gfx11. Isel support for gfx11 will be added in D130579. Differential Revision: https://reviews.llvm.org/D131560	2022-09-23 17:52:10 +02:00
Petar Avramovic	48968c47b0	AMDGPU: Add detailed buffer, global and flat atomic fadd tests Precommit for D130579 that will remove manual selection and use patterns from td files. Tests are grouped based on target features. All patterns have rtn and no-rtn versions. buffer atomics patterns are selected based on the intrinsic used (raw or struct) and the offset operand (imm or vgpr): _offset raw with imm offset _offen raw with vgpr offset (or large imm offset) _idxen struct with imm offset _bothen struct with vgpr offset (or large imm offset) global and flat atomics are selected via intrinsic or the atomicrmw fadd. atomicrmw tests have amdgpu-unsafe-fp-atomics=true and non-system scope since they get expanded otherwise. atomicrmw fadd does not support vector type, test float and double. global atomics patterns are selected based on address type via (global or flat) intrinsic or atomicrmw fadd with global address(addrspace(1)). 'no suffix' vgpr addrspace(1) address _saddr sgpr addrspace(1)* address flat atomics patterns are selected via (flat)intrinsic or atomicrmw fadd with flat address (* - address space 0). Differential Revision: https://reviews.llvm.org/D131561	2022-09-23 17:52:10 +02:00
Petar Avramovic	e03d36d4ae	[AMDGPU] Add FeatureFlatAtomicFaddF32Inst Feature used by targets that have flat_atomic_add_f32 instruction (gfx940 and gfx11). Remove isGFX940GFX11Plus. Add hasFlatAtomicFaddF32Inst Subtarget check for codegen. Differential Revision: https://reviews.llvm.org/D134532	2022-09-23 17:52:10 +02:00
Matt Arsenault	94ebd7d9ff	MachineVerifier: Verify REG_SEQUENCE Somehow there was no verification of this, other than an ad-hoc assertion in TwoAddressInstructions.	2022-09-22 09:51:15 -04:00
Thomas Symalla	1f4d3c681c	[NFC][AMDGPU] Add new v_bfi Codegen test. Pre-commit a test for an upcoming change.	2022-09-21 19:02:11 +02:00
Matt Arsenault	10207fc5ae	AMDGPU: Move test to correct location This is not a MIR printer/parser test, so it belongs with the ordinary codegen tests.	2022-09-21 11:30:32 -04:00
Sanjay Patel	0f32a5dea0	[InstCombine] don't canonicalize shl+sub to mul+add This stops Negator from transforming: `C1 - shl X, C2 --> mul X, (1<<C2) + C1` ...in the general case. There does not seem to be any analysis benefit to using mul in IR, and there's definitely downside in codegen (particularly when the multiply has to be expanded). If `C1` is 0, then there's a stronger argument that the single mul is a better canonicalization than negate-of-shl, but we may want to remove that too. This was noted as a potential conflict for D133667. Differential Revision: https://reviews.llvm.org/D134310	2022-09-21 08:39:07 -04:00
Thomas Symalla	c98a46fee6	[ISel] Enable generating more fma instructions. This patch changes a FADD / FMUL => FMA ISel pattern implemented in D80801 so that it peeks through more than one FMA. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D132837	2022-09-21 12:03:11 +02:00
Jay Foad	f09e3ad88a	[AMDGPU] Update checks in mad_u64_u32.ll. NFC.	2022-09-21 10:42:55 +01:00
Thomas Symalla	053df6ccaf	[NFC][AMDGPU] Pre-commit test for D132837. Pre-commit an additional fmac test for D132837.	2022-09-21 10:16:29 +02:00
Changpeng Fang	3ae4c3589e	AMDGPU: Implicit kernel arguments related optimization when uniform-workgroup-size=true Summary: Under code object version 5, ockl_get_local_size returns the value computed by the expression: workgroup_id < hidden_block_count ? hidden_group_size : hidden_remainder For functions with the attribute uniform-work-group-size=true. we can evaluate workgroup_id < hidden_block_count as true, and thus hidden_group_size is returned for ockl_get_local_size. With uniform-workgroup-size=true, this work also set all remainders to zero, and if there is reqd_work_group_size, we also set work-group-size to the required value from the metadata. Reviewers: arsenm and bcahoon Differential Revision: https://reviews.llvm.org/D131276	2022-09-20 17:25:52 -07:00
Anshil Gandhi	a0c53524a5	[AMDGPU] Fix size of SOPK instructions to 4 bytes Instructions in SOPK format may not have 32-bit literal constants following the instruction. Differential Revision: https://reviews.llvm.org/D133972	2022-09-20 14:27:09 -06:00
Jay Foad	f19cc793d2	[AMDGPU] Disable fp atomic to s_denorm_mode hazard for GFX11 This hazard only exists on GFX10. Differential Revision: https://reviews.llvm.org/D134276	2022-09-20 17:40:49 +01:00
Joe Nash	b982ba2a6e	[AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C Due to the encoding changes in GFX11, we had a hack in place that disables the use of VGPRs above 128. This patch removes the need for that hack. We introduce a new register class VGPR_32_Lo128 which is used for 16-bit operands of VOP1, VOP2, and VOPC instructions. This register class only has the low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1, VOP2, and VOPC instructions are correctly limited to use the first 128 VGPRs, while the other instructions can freely use all 256. We introduce new pseduo-instructions used on GFX11 which have the suffix t16 (True 16) to use the VGPR_32_Lo128 register class. Reviewed By: foad, rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D133723	2022-09-20 09:56:28 -04:00
Alexander Timofeev	2e8817b90a	[AMDGPU] SIFixSGPRCopies reworking to use one pass over the MIR for analysis and lowering. This change finalizes the series of patches aiming to replace the old strategy of VGPR to SGPR copy lowering. # Following the https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code parts that are no longer used were removed. # The first pass over the MachineFunctoin collects all the necessary information. # Lowering is done in 3 phases: - VGPR to SGPR copies analysis lowering - REG_SEQUENCE, PHIs, and SGPR to VGPR copies lowering - SCC copies lowering is done in a separate pass over the Machine Function Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D131246	2022-09-19 23:31:45 +02:00
jeff	1bb293f658	[AMDGPU] [DAGCombiner] Precommit test for D133584 Change-Id: I488ac9b23718f8d0b28db034c4cc455ae736e785	2022-09-19 11:37:35 -07:00
Stanislav Mekhanoshin	167826ee12	[AMDGPU] Fix runline for windows in sdag-print-divergence.ll. NFC.	2022-09-16 12:15:08 -07:00
Stanislav Mekhanoshin	a0c8f5fefa	[SDAG] Print divergence in SDNode::dump If target does not support divergence the field is set to false and not printed. Differential Revision: https://reviews.llvm.org/D133984	2022-09-16 11:43:34 -07:00

1 2 3 4 5 ...

5780 Commits