Commit Graph

5780 Commits

Author SHA1 Message Date
Carl Ritson a35013bec6 [AMDGPU][GFX11] Mitigate VALU mask write hazard
VALU use of an SGPR (pair) as mask followed by SALU write to the
same SGPR can cause incorrect execution of subsequent SALU reads
of the SGPR.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D134151
2022-10-01 16:21:24 +09:00
Jeffrey Byrnes 5a61340eb4 [AMDGPU] Fix tests in f6a2e6afed 2022-09-30 12:53:08 -07:00
Jeffrey Byrnes f6a2e6afed [AMDGPU] Precommit test case for D133584 2022-09-30 12:43:23 -07:00
Arthur Eubanks e23aee7175 [test] Update some legacy PM tests 2022-09-30 11:31:02 -07:00
Joe Nash 50dfd3e9e4 [AMDGPU] Add test for FMAC_e64 dpp combine. NFC. 2022-09-30 12:48:12 -04:00
Pierre van Houtryve d8258508d4 [AMDGPU][GISel] Update `isCanonicalized`
Recognize more opcodes in the function.
Fixes some regressions introduced in D134857 for fdiv.f16 too.

Depends on D134857

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D134862
2022-09-30 14:13:35 +00:00
Pierre van Houtryve 7388520d1c [GISel] Add more cases to isKnownNeverNaN
Make it even with the DAG implementation as of D134854

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D134857
2022-09-30 14:10:56 +00:00
Pierre van Houtryve 653beae5a1 [AMDGPU][GISel] Add Identity BUILD_VECTOR Combines
Folds-away BUILD_VECTOR-related noops in the post-legalizer combiner.

Depends on D134433

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134953
2022-09-30 14:07:13 +00:00
Pierre van Houtryve 9a67a6b72a [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.

This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134433
2022-09-30 14:04:53 +00:00
Thomas Symalla a41dde2c62 [AMDGPU] Add use check in v_fma combine.
In D132837, an existing v_fma combine was extended to regard nested
fma instructions. Originally, the inner FMA was checked for being used
only once. In its current state, this check is missing, which causes
some regressions.

In this patch, this check was added.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D134856
2022-09-29 12:25:03 +02:00
Thomas Symalla a41764810f [NFC][AMDGPU] Pre-commit FMA test. 2022-09-29 09:54:06 +02:00
Pierre van Houtryve 682c7c77f5 [AMDGPU] Update `mad-mix*` CodeGen tests
- Use `fneg %a` instead of `fsub -0.0, %a`
  - This is for D134354 as we don't currently support folding `fsub -0.0, %a` into `fneg` on GISel.
    Also, `fneg` is the canonical way to do the negation.
- Switch to `update_llc_test_checks`-generated tests.
  - Better test coverage
  - Easier to update
  - Easier to see changes in future diffs
- Remove unnecessary CL arguments in RUN lines

Motive for the patch: Preparation for D134354 - we would like to
put GISel tests in this file as well. Fixing the lack of `fneg` and
switching to generated testing makes it much easier.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134793
2022-09-29 07:11:34 +00:00
Abinav Puthan Purayil 3759398b4b [AMDGPU] Report minimum scratch size in code object v5 and later by default
This change sets
-amdgpu-assume-{external-call-stack-size | dynamic-stack-object-size}
options to zero by default for code object v5 and later. The runtime is
expected to adjust the scratch size if the amdhsa_uses_dynamic_stack bit
in the kernel descriptor is set.

Differential Revision: https://reviews.llvm.org/D128346
2022-09-29 09:52:45 +05:30
Jessica Paquette 1eb49bbab6 [GlobalISel][CallLowering] Use hasRetAttr for return flags on CallBases
Given something like this:

```
declare signext i16 @signext_callee()
define i32 @caller() {
  %res = call i16 @signext_callee()
  ...
}
```

CallLowering would miss that signext_callee's return value is sign extended,
because it isn't on the call.

Use hasRetAttr on the CallBase to allow us to catch this.

(This now inserts G_ASSERT_SEXT/G_ASSERT_ZEXT like in the original review.)

Differential Revision: https://reviews.llvm.org/D86228
2022-09-28 19:38:24 -07:00
Jay Foad 2c12a04bba [ISel] Fix DAG divergence after new FMA combine
D132837 introduced a new DAG combine that used MorphNodeTo to morph an
FMUL into an FMA. It turns out that MorphNodeTo does not properly update
the divergence bit for users of the morphed node, causing an assertion
failure on the new test case:

llc: SelectionDAG.cpp:10486: void llvm::SelectionDAG::VerifyDAGDivergence(): Assertion `calculateDivergence(N) == N->isDivergent() && "Divergence bit inconsistency detected"' failed.

Fixing MorphNodeTo to propagate the divergence bit is tricky because of
the way it is used to select machine instructions, so use getNode and
ReplaceAllUsesOfValueWith instead.

Differential Revision: https://reviews.llvm.org/D134810
2022-09-28 19:41:51 +01:00
Baptiste b556726ccc [AMDGPU] Avoid flushing the vmcnt counter in loop preheaders if not necessary
One of the conditions to flush the vmcnt counter in loop preheaders is: The loop
contains a use of a vgpr that is defined out of the loop. The code currently
checks if a waitcnt is needed by looking at the score of that vgpr in the score
brackets. This is not enough and may cause the generation of an unnecessary
vmcnt flush. This patch fixes that case.

Differential Revision: https://reviews.llvm.org/D130313
2022-09-28 13:05:50 -04:00
Jon Chesterfield 35f2584ef9 [amdgpu] Error, instead of miscompile, anonymous kernels using lds
The association between kernel and struct is done by symbol name.
This doesn't work robustly for anonymous kernels as shown by the modified
test case.

An alternative association between function and struct can be constructed
if necessary, probably though metadata, but on the basis that we currently
miscompile anonymous kernels and that they are difficult to construct from
application code and difficult to call from the runtime, this patch makes
it a fatal error for now.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134741
2022-09-28 16:30:04 +01:00
Sameer Sahasrabuddhe 3f078b308b [AAPointerInfo] OffsetInfo: Unassigned is distinct from Unknown
A User like the PHINode may be visited multiple times for the same pointer along
different def-use edges. The uninitialized state of OffsetInfo at the first
visit needs to be distinct from the Unknown value that may be assigned after
processing the PHINode. Without that, a PHINode with all inputs Unknown is never
followed to its uses. This results in incorrect optimization because some
interfering accessess are missed.

Differential Revision: https://reviews.llvm.org/D134704
2022-09-28 20:31:36 +05:30
Matt Arsenault 7a84624079 AMDGPU: Make various vector undefs legal
Surprisingly these were getting legalized to something
zero initialized.

This fixes an infinite loop when combining some vector types.
Also fixes zero initializing some undef values.

SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking
for the legality of the output undefs they are replacing unused
operations with. This resulted in turning vectors into undefs
that were later re-legalized back into zero vectors.
2022-09-28 10:48:52 -04:00
Jon Chesterfield 80ba432821 [amdgpu][nfc] Allocate kernel-specific LDS struct deterministically
A kernel may have an associated struct for laying out LDS variables.
This patch puts that instance, if present, at a deterministic address by
allocating it at the same time as the module scope instance.

This is relatively likely to be where the instance was allocated anyway (~NFC)
but will allow later patches to calculate where a given field can be found,
which means a function which is only reachable from a single kernel will be
able to access a LDS variable with zero overhead. That will be particularly
helpful for applications that instantiate a function template containing LDS
variables once per kernel.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D127052
2022-09-28 14:55:16 +01:00
Carl Ritson 266b5dbc5d [AMDGPU] Add MIMG NSA threshold configuration attribute
Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D134780
2022-09-28 20:03:18 +09:00
Changpeng Fang dee4bc4a4e AMDGPU: Handle new address pattern in LowerKernelAttributes introduced by opaque pointers
Summary:
  With opaque pointer support, the "ptr" type is introduced and thus BitCast is not necessary in some cases.
This work takes care of this change, and recognizes the new address patterns to do appropriate optimizations.

Reviewers:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D134596
2022-09-26 09:31:52 -07:00
jeff e6c29c0338 [AMDGPU] Precommit switching test to generated checks for D134463
Change-Id: I6ed91aacf6856ef535a54284d48e937db33be1a3
2022-09-26 08:13:32 -07:00
Ruiling Song a5676a3a7e StructurizeCFG: Set Undef for non-predecessors in setPhiValues()
During structurization process, we may place non-predecessor blocks
between the predecessors of a block in the structurized CFG. Take
the typical while-break case as an example:
```
 /---A(v=...)
 |  / \
 ^ B   C
 |  \ /|
 \---L |
     \ /
      E (r = phi (v:C)...)
```
After structurization, the CFG would be look like:
```
 /---A
 |   |\
 |   | C
 |   |/
 |   F1
 ^   |\
 |   | B
 |   |/
 |   F2
 |   |\
 |   | L
 \   |/
  \--F3
     |
     E
```
We can see that block B is placed between the predecessors(C/L) of E.
During phi reconstruction, to achieve the same sematics as before, we
are reconstructing the PHIs as:
  F1: v1 = phi (v:C), (undef:A)
  F3: r = phi (v1:F2), ...
But this is also saying that `v1` would be live through B, which is not
quite necessary. The idea in the change is to say the incoming value
from B is Undef for the PHI in E. With this change, the reconstructed
PHI would be:
  F1: v1 = phi (v:C), (undef:A)
  F2: v2 = phi (v1:F1), (undef:B)
  F3: r  = phi (v2:F2), ...

Reviewed by: sameerds

Differential Revision: https://reviews.llvm.org/D132450
2022-09-26 09:54:47 +08:00
Ruiling Song 40e9284f3c StructurizeCFG: prefer reduced number of live values
The instruction simplification will try to simplify the affected phis.
In some cases, this might extend the liveness of values. For example:

  BB0:
   | \
   | BB1
   | /
  BB2:phi (BB0, v), (BB1, undef)

The phi in BB2 will be simplified to v as v dominates BB2, but this is
increasing the number of active values in BB1. By setting CanUseUndef
to false, we will not simplify the phi in this way, this would help
register pressure. This is mandatory for the later change to help
reducing VGPR pressure for AMDGPU.

Reviewed by: foad, sameerds

Differential Revision: https://reviews.llvm.org/D132449
2022-09-26 09:54:47 +08:00
Ruiling Song 66325d9ba1 AMDGPU: Add a test to show how later optimization works
Differential Revision: https://reviews.llvm.org/D132448
2022-09-26 09:54:47 +08:00
Ruiling Song cf14c7caac AMDGPU: Add a pass to rewrite certain undef in PHI
For the pattern of IR (%if terminates with a divergent branch.),
divergence analysis will report %phi as uniform to help optimal code
generation.
```
  %if
  | \
  | %then
  | /
  %endif: %phi = phi [ %uniform, %if ], [ %undef, %then ]
```
In the backend, %phi and %uniform will be assigned a scalar register.
But the %undef from %then will make the scalar register dead in %then.
This will likely cause the register being over-written in %then. To fix
the issue, we will rewrite %undef as %uniform. For details, please refer
the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test
changes shown, but this is mandatory for later changes.

Reviewed by: sameerds

Differential Revision: https://reviews.llvm.org/D133840
2022-09-26 09:54:47 +08:00
Petar Avramovic dcc756d03e [AMDGPU] Pattern for flat atomic fadd f64 intrinsic with local addr
Fix regression from clang opencl test in builtins-fp-atomics-gfx90a.cl
test_flat_add_local_f64 caused by D130579
Revert a3becb333d.

Differential Revision: https://reviews.llvm.org/D134568
2022-09-25 13:25:41 +02:00
jeff 33ab74ac46 [AMDGPU] Precommit switching test to generated checks for D134463
Change-Id: I0d90f86ab759347a2f20448d28cc09ddaea3a4d4
2022-09-23 15:12:53 -07:00
Jay Foad ddfa0f62d8 [AMDGPU] Add GFX11 feature for subtargets with more VGPRs
The full complement of physical VGPRs for GFX11 is 50% more than GFX10.
Some subtargets have this, others stay the same as GFX10. This affects
occupancy calculations.

Differential Revision: https://reviews.llvm.org/D134522
2022-09-23 20:18:23 +01:00
jeff 5787d44462 [AMDGPU] Precommit switching test to generated checks for D134463
Change-Id: Iaa8f6263178cfa8405d9a0298b2695b32a42fdf3
2022-09-23 10:15:47 -07:00
Petar Avramovic 6db7921b65 AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd
Remove manual selection for atomic fadd from global-isel.
Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD
which corresponds to llvm-ir's atomicrmw fadd instruction.

global and flat atomic fadd patterns changes:
Split rtn/no-rtn patterns
Add missing patterns or fix predicates
Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors).
Patterns now check addrspace of pointer, added patterns for flat intrinsic.
with global addrspace pointer that selects into global atomic instruction.

buffer atomic fadd patterns changes:
Rdit patterns to import into global-isel.
Remove gfx6/gfx7 _addr64 and _offset patterns.
Remove patterns that can't be reached (same pattern but different feature).

Differential Revision: https://reviews.llvm.org/D130579
2022-09-23 17:52:10 +02:00
Petar Avramovic 5cee9047d5 AMDGPU: Improve atomicrmw fadd selection
Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11
as for gfx90a. Add missing globalisel legalizer support for flat
atomicrmw fadd f32 on gfx940 and gfx11.
Isel support for gfx11 will be added in D130579.

Differential Revision: https://reviews.llvm.org/D131560
2022-09-23 17:52:10 +02:00
Petar Avramovic 48968c47b0 AMDGPU: Add detailed buffer, global and flat atomic fadd tests
Precommit for D130579 that will remove manual selection and use
patterns from td files. Tests are grouped based on target features.

All patterns have rtn and no-rtn versions.

buffer atomics patterns are selected based on the intrinsic used
(raw or struct) and the offset operand (imm or vgpr):
_offset raw with imm offset
_offen raw with vgpr offset (or large imm offset)
_idxen struct with imm offset
_bothen struct with vgpr offset (or large imm offset)

global and flat atomics are selected via intrinsic or the atomicrmw fadd.
atomicrmw tests have amdgpu-unsafe-fp-atomics=true and non-system scope
since they get expanded otherwise. atomicrmw fadd does not support vector
type, test float and double.

global atomics patterns are selected based on address type via (global or
flat) intrinsic or atomicrmw fadd with global address(addrspace(1)*).
'no suffix' vgpr addrspace(1)* address
_saddr sgpr addrspace(1)* address

flat atomics patterns are selected via (flat)intrinsic or atomicrmw fadd
with flat address (* - address space 0).

Differential Revision: https://reviews.llvm.org/D131561
2022-09-23 17:52:10 +02:00
Petar Avramovic e03d36d4ae [AMDGPU] Add FeatureFlatAtomicFaddF32Inst
Feature used by targets that have flat_atomic_add_f32 instruction
(gfx940 and gfx11). Remove isGFX940GFX11Plus.
Add hasFlatAtomicFaddF32Inst Subtarget check for codegen.

Differential Revision: https://reviews.llvm.org/D134532
2022-09-23 17:52:10 +02:00
Matt Arsenault 94ebd7d9ff MachineVerifier: Verify REG_SEQUENCE
Somehow there was no verification of this, other than an ad-hoc
assertion in TwoAddressInstructions.
2022-09-22 09:51:15 -04:00
Thomas Symalla 1f4d3c681c [NFC][AMDGPU] Add new v_bfi Codegen test.
Pre-commit a test for an upcoming change.
2022-09-21 19:02:11 +02:00
Matt Arsenault 10207fc5ae AMDGPU: Move test to correct location
This is not a MIR printer/parser test, so it belongs with the ordinary
codegen tests.
2022-09-21 11:30:32 -04:00
Sanjay Patel 0f32a5dea0 [InstCombine] don't canonicalize shl+sub to mul+add
This stops Negator from transforming:
`C1 - shl X, C2 --> mul X, (1<<C2) + C1`
...in the general case. There does not seem to be any analysis
benefit to using mul in IR, and there's definitely downside in
codegen (particularly when the multiply has to be expanded).

If `C1` is 0, then there's a stronger argument that the single
mul is a better canonicalization than negate-of-shl, but we may
want to remove that too.

This was noted as a potential conflict for D133667.

Differential Revision: https://reviews.llvm.org/D134310
2022-09-21 08:39:07 -04:00
Thomas Symalla c98a46fee6 [ISel] Enable generating more fma instructions.
This patch changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D132837
2022-09-21 12:03:11 +02:00
Jay Foad f09e3ad88a [AMDGPU] Update checks in mad_u64_u32.ll. NFC. 2022-09-21 10:42:55 +01:00
Thomas Symalla 053df6ccaf [NFC][AMDGPU] Pre-commit test for D132837.
Pre-commit an additional fmac test for D132837.
2022-09-21 10:16:29 +02:00
Changpeng Fang 3ae4c3589e AMDGPU: Implicit kernel arguments related optimization when uniform-workgroup-size=true
Summary:
   Under code object version 5, ockl_get_local_size returns the value computed by the expression:
workgroup_id < hidden_block_count ? hidden_group_size : hidden_remainder
For functions with the attribute uniform-work-group-size=true. we can evaluate workgroup_id < hidden_block_count
as true, and thus hidden_group_size is returned for ockl_get_local_size.
  With uniform-workgroup-size=true, this work also set all remainders to zero, and if there
is reqd_work_group_size, we also set work-group-size to the required value from the metadata.

Reviewers:
  arsenm and bcahoon

Differential Revision:
  https://reviews.llvm.org/D131276
2022-09-20 17:25:52 -07:00
Anshil Gandhi a0c53524a5 [AMDGPU] Fix size of SOPK instructions to 4 bytes
Instructions in SOPK format may not have 32-bit
literal constants following the instruction.

Differential Revision: https://reviews.llvm.org/D133972
2022-09-20 14:27:09 -06:00
Jay Foad f19cc793d2 [AMDGPU] Disable fp atomic to s_denorm_mode hazard for GFX11
This hazard only exists on GFX10.

Differential Revision: https://reviews.llvm.org/D134276
2022-09-20 17:40:49 +01:00
Joe Nash b982ba2a6e [AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C
Due to the encoding changes in GFX11, we had a hack in place that
    disables the use of VGPRs above 128. This patch removes the need for
    that hack.

    We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
    operands of VOP1, VOP2, and VOPC instructions. This register class only has the
    low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
    VOP2, and VOPC instructions are correctly limited to use the first 128
    VGPRs, while the other instructions can freely use all 256.

    We introduce new pseduo-instructions used on GFX11 which have the suffix
    t16 (True 16) to use the VGPR_32_Lo128 register class.

Reviewed By: foad, rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D133723
2022-09-20 09:56:28 -04:00
Alexander Timofeev 2e8817b90a [AMDGPU] SIFixSGPRCopies reworking to use one pass over the MIR for analysis and lowering.
This change finalizes the series of patches aiming to replace the old strategy of VGPR to SGPR copy lowering.

  # Following the https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code parts that are no longer used were removed.
  # The first pass over the MachineFunctoin collects all the necessary information.
  # Lowering is done in 3 phases:
     - VGPR to SGPR copies analysis  lowering
     - REG_SEQUENCE, PHIs, and SGPR to VGPR copies lowering
     - SCC copies lowering is done in a separate pass over the Machine Function

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D131246
2022-09-19 23:31:45 +02:00
jeff 1bb293f658 [AMDGPU] [DAGCombiner] Precommit test for D133584
Change-Id: I488ac9b23718f8d0b28db034c4cc455ae736e785
2022-09-19 11:37:35 -07:00
Stanislav Mekhanoshin 167826ee12 [AMDGPU] Fix runline for windows in sdag-print-divergence.ll. NFC. 2022-09-16 12:15:08 -07:00
Stanislav Mekhanoshin a0c8f5fefa [SDAG] Print divergence in SDNode::dump
If target does not support divergence the field is set to false
and not printed.

Differential Revision: https://reviews.llvm.org/D133984
2022-09-16 11:43:34 -07:00