Fix regression from clang opencl test in builtins-fp-atomics-gfx90a.cl
test_flat_add_local_f64 caused by D130579
Revert a3becb333d.
Differential Revision: https://reviews.llvm.org/D134568
This patch removes the predicate for return atomic ops and uses
AddedComplexity to distinguish its selection from its no return variant.
This will produce better matchers that doesn't unnecessarily check for
the negated predicate if the initial predicate failed. Also, it
simplifies the enabling of no return atomic ops selection in GlobalISel.
Differential Revision: https://reviews.llvm.org/D128241
MC layer support for ds instructions
Contributors:
Piotr Sobczak <Piotr.Sobczak@amd.com>
Patch 14/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D126463
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D126468
The builtin predicate handling has a strange behavior where the code
assumes that a PatFrag is a stack of PatFrags, and each level adds at
most one predicate. I don't think this particularly makes sense,
especially without a diagnostic to ensure you aren't trying to set
multiple at once.
This wasn't followed for address spaces and alignment, which could
potentially fall through to report no builtin predicate was
added. Just switch these to follow the existing convention for now.
This is to avoid relying on the post-isel hook.
This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.
Differential Revision: https://reviews.llvm.org/D123583
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b64 aligned by 8: 3.2 sec
ds_write2_b32 aligned by 8: 3.2 sec
ds_write_b16 * 4 aligned by 8: 7.0 sec
ds_write_b8 * 8 aligned by 8: 13.2 sec
ds_write_b64 aligned by 1: 7.3 sec
ds_write2_b32 aligned by 1: 7.5 sec
ds_write_b16 * 4 aligned by 1: 14.0 sec
ds_write_b8 * 8 aligned by 1: 13.2 sec
ds_write_b64 aligned by 2: 7.3 sec
ds_write2_b32 aligned by 2: 7.5 sec
ds_write_b16 * 4 aligned by 2: 7.1 sec
ds_write_b8 * 8 aligned by 2: 13.3 sec
ds_write_b64 aligned by 4: 4.6 sec
ds_write2_b32 aligned by 4: 3.2 sec
ds_write_b16 * 4 aligned by 4: 7.1 sec
ds_write_b8 * 8 aligned by 4: 13.3 sec
ds_read_b64 aligned by 8: 2.3 sec
ds_read2_b32 aligned by 8: 2.2 sec
ds_read_u16 * 4 aligned by 8: 4.8 sec
ds_read_u8 * 8 aligned by 8: 8.6 sec
ds_read_b64 aligned by 1: 4.4 sec
ds_read2_b32 aligned by 1: 7.3 sec
ds_read_u16 * 4 aligned by 1: 14.0 sec
ds_read_u8 * 8 aligned by 1: 8.7 sec
ds_read_b64 aligned by 2: 4.4 sec
ds_read2_b32 aligned by 2: 7.3 sec
ds_read_u16 * 4 aligned by 2: 4.8 sec
ds_read_u8 * 8 aligned by 2: 8.7 sec
ds_read_b64 aligned by 4: 4.4 sec
ds_read2_b32 aligned by 4: 2.3 sec
ds_read_u16 * 4 aligned by 4: 4.8 sec
ds_read_u8 * 8 aligned by 4: 8.7 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b64 aligned by 8: 4.4 sec
ds_write2_b32 aligned by 8: 4.3 sec
ds_write_b16 * 4 aligned by 8: 7.9 sec
ds_write_b8 * 8 aligned by 8: 13.0 sec
ds_write_b64 aligned by 1: 23.2 sec
ds_write2_b32 aligned by 1: 23.1 sec
ds_write_b16 * 4 aligned by 1: 44.0 sec
ds_write_b8 * 8 aligned by 1: 13.0 sec
ds_write_b64 aligned by 2: 23.2 sec
ds_write2_b32 aligned by 2: 23.1 sec
ds_write_b16 * 4 aligned by 2: 7.9 sec
ds_write_b8 * 8 aligned by 2: 13.1 sec
ds_write_b64 aligned by 4: 13.5 sec
ds_write2_b32 aligned by 4: 4.3 sec
ds_write_b16 * 4 aligned by 4: 7.9 sec
ds_write_b8 * 8 aligned by 4: 13.1 sec
ds_read_b64 aligned by 8: 3.5 sec
ds_read2_b32 aligned by 8: 3.4 sec
ds_read_u16 * 4 aligned by 8: 5.3 sec
ds_read_u8 * 8 aligned by 8: 8.5 sec
ds_read_b64 aligned by 1: 13.1 sec
ds_read2_b32 aligned by 1: 22.7 sec
ds_read_u16 * 4 aligned by 1: 43.9 sec
ds_read_u8 * 8 aligned by 1: 7.9 sec
ds_read_b64 aligned by 2: 13.1 sec
ds_read2_b32 aligned by 2: 22.7 sec
ds_read_u16 * 4 aligned by 2: 5.6 sec
ds_read_u8 * 8 aligned by 2: 7.9 sec
ds_read_b64 aligned by 4: 13.1 sec
ds_read2_b32 aligned by 4: 3.4 sec
ds_read_u16 * 4 aligned by 4: 5.6 sec
ds_read_u8 * 8 aligned by 4: 7.9 sec
```
GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.
Differential Revision: https://reviews.llvm.org/D123956
The sramecc feature changes the behaviour of d16 loads so they do not
preserve the unused 16 bits of the result register, but it has no impact
on d16 stores, so we should make use of them even when the feature is
enabled.
Differential Revision: https://reviews.llvm.org/D104912
This is just a reminder that the operands are swapped compared with all
the other CmpXChg instructions.
Differential Revision: https://reviews.llvm.org/D119421
SelectionDAG relies on MachineInstr's HasPostISelHook for selecting the
no-return atomic ops. GlobalISel, at the moment, doesn't handle
HasPostISelHook.
This change adds the selection for no-return ds_* atomic ops in tblgen
so that it can work with both GlobalISel and SelectionDAG. I couldn't
add the predicates for GlobalISel in this change since there's a
restriction in GlobalISelEmitter that disallows selecting generic
atomics ops that return with instructions that doesn't return.
We can't remove the HasPostISelHook code that selects the no return
atomic ops in SelectionDAG yet since we still need to cover selections
in FLATInstructions.td, BUFInstructions.td.
Differential Revision: https://reviews.llvm.org/D115881
Add patterns for i8/i16 local atomic load/store.
Added tests for new patterns.
Copied atomic_[store/load]_local.ll to GlobalISel directory.
Differential Revision: https://reviews.llvm.org/D111869
This does not affect codegen, which only tests these flags on Pseudo
instructions, but might help llvm-mca which has to work with Real
instructions. In particular setting LGKM_CNT on DS instructions helps
with the problem identified in D104149.
Differential Revision: https://reviews.llvm.org/D104293
Part of the code related to ds_read/ds_write ISel is refactored, and the
corresponding comment is re-written for better readability, which would help
while implementing any future ds_read/ds_write ISel related modifications.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100300
Coyp SchedRW from pseudos to real instructions so that llvm-mca has
access to it. This is NFC for normal compiler codegen, which schedules
pseudos not real instructions.
Add an llvm-mca test for some high latency double-precision instructions
as a smoke test.
Differential Revision: https://reviews.llvm.org/D99187
We are using AtomicNoRet map in multiple places to determine
if an instruction atomic, rtn or nortn atomic. This method
does not work always since we have some instructions which
only has rtn or nortn version.
One such instruction is ds_wrxchg_rtn_b32 which does not have
nortn version. This has caused changes in memory legalizer
tests.
Differential Revision: https://reviews.llvm.org/D96639
Both ds_read_b128 and ds_read2_b64 are valid for 128bit 16-byte aligned
loads but the one that will be selected is determined either by the order in
tablegen or by the AddedComplexity attribute. Currently ds_read_b128 has
priority.
While ds_read2_b64 has lower alignment requirements, we cannot always
restrict ds_read_b128 to 16-byte alignment because of unaligned-access-mode
option. This was causing ds_read_b128 to be selected for 8-byte aligned
loads regardles of chosen access mode.
To resolve this we use two patterns for selecting ds_read_b128. One
requires alignment of 16-byte and the other requires
unaligned-access-mode option.
Same goes for ds_write2_b64 and ds_write_b128.
Differential Revision: https://reviews.llvm.org/D92767
By setting up the AsmStrings correctly we can remove some special cases
from AMDGPUInstPrinter::printOffset.
Differential Revision: https://reviews.llvm.org/D90307
Fix local ds_read/write_b96/b128 so they can be selected if the alignment
allows. Otherwise, either pick appropriate ds_read2/write2 instructions or break
them down.
Differential Revision: https://reviews.llvm.org/D81638
ISD::ATOMIC_STORE arbitrarily has the operands in the opposite order
from regular ISD::STORE, which always introduced an annoying
duplication of patterns to handle both cases. Since in GlobalISel
there's just the one G_STORE, we need to swap the operands to
correctly emit the type check for the pointer operand.
Some work started in 20aafa3156 to
migrate SelectionDAG to use ISD::STORE for atomics, but that work
seems to have stalled. Since this is the pretty much the last
operation which matters which isn't supported for AMDGPU, use this
compatibility hack to unblock declaring it functionally complete.
Not sure what's going on with the pending_phis AArch64 test. It seems
it didn't always use atomics, and I'm not sure what it was originally
testing matters anymore.
The annoying behavior where the output is different due to the
legality check struck again, plus the subtarget predicate wasn't
really correctly set for DS FP atomics.
Some of the FP min/max instructions seem to be in the gfx6/gfx7
manuals, but IIRC this might have been one of the cases where the
manual got ahead of the actual hardware support, but I've left these
as-is for now since the assembler tests seem to expect them.
It uses VGPR_32.RegTypes which includes 16 bit types. As a
result DS_WRITE_B32 may be generated for "store i16" which
is a bug. The only reason we do not hit it now is relative
patterns complexity and sorting. Should DS_WRITE_B16 pattern
complexity become higher and the bug appears.
Differential Revision: https://reviews.llvm.org/D74868
The current implementation assumes there is an instruction associated
with the transform, but this is not the case for
timm/TargetConstant/immarg values. These transforms should directly
operate on a specific MachineOperand in the source
instruction. TableGen would assert if you attempted to define an
equivalent GISDNodeXFormEquiv using timm when it failed to find the
instruction matcher.
Specially recognize SDNodeXForms on timm, and pass the operand index
to the render function.
Ideally this would be a separate render function type that looks like
void renderFoo(MachineInstrBuilder, const MachineOperand&), but this
proved to be somewhat mechanically painful. Add an optional operand
index which will only be passed if the transform should only look at
the one source operand.
Theoretically it would also be possible to only ever pass the
MachineOperand, and the existing renderers would check the parent. I
think that would be somewhat ugly for the standard usage which may
want to inspect other operands, and I also think MachineOperand should
eventually not carry a pointer to the parent instruction.
Use it in one sample pattern. This isn't a great example, since the
transform exists to satisfy DAG type constraints. This could also be
avoided by just changing the MachineInstr's arbitrary choice of
operand type from i16 to i32. Other patterns have nontrivial uses, but
this serves as the simplest example.
One flaw this still has is if you try to use an SDNodeXForm defined
for imm, but the source pattern uses timm, you still see the "Failed
to lookup instruction" assert. However, there is now a way to avoid
it.
We are duplicating predicates if several parts of the combined
predicate list contain the same condition. Added code to deduplicate
the list.
We have AssemblerPredicates and AssemblerPredicate in the
PredicateControl, but we never use AssemblerPredicates with an
actual list, so this one is dropped.
This addresses the first part of the llvm bug 43886:
https://bugs.llvm.org/show_bug.cgi?id=43886
Differential Revision: https://reviews.llvm.org/D69815
This reverts r372314, reapplying r372285 and the commits which depend
on it (r372286-r372293, and r372296-r372297)
This was missing one switch to getTargetConstant in an untested case.
llvm-svn: 372338
This broke the Chromium build, causing it to fail with e.g.
fatal error: error in backend: Cannot select: t362: v4i32 = X86ISD::VSHLI t392, Constant:i8<15>
See llvm-commits thread of r372285 for details.
This also reverts r372286, r372287, r372288, r372289, r372290, r372291,
r372292, r372293, r372296, and r372297, which seemed to depend on the
main commit.
> Encode them directly as an imm argument to G_INTRINSIC*.
>
> Since now intrinsics can now define what parameters are required to be
> immediates, avoid using registers for them. Intrinsics could
> potentially want a constant that isn't a legal register type. Also,
> since G_CONSTANT is subject to CSE and legalization, transforms could
> potentially obscure the value (and create extra work for the
> selector). The register bank of a G_CONSTANT is also meaningful, so
> this could throw off future folding and legalization logic for AMDGPU.
>
> This will be much more convenient to work with than needing to call
> getConstantVRegVal and checking if it may have failed for every
> constant intrinsic parameter. AMDGPU has quite a lot of intrinsics wth
> immarg operands, many of which need inspection during lowering. Having
> to find the value in a register is going to add a lot of boilerplate
> and waste compile time.
>
> SelectionDAG has always provided TargetConstant for constants which
> should not be legalized or materialized in a register. The distinction
> between Constant and TargetConstant was somewhat fuzzy, and there was
> no automatic way to force usage of TargetConstant for certain
> intrinsic parameters. They were both ultimately ConstantSDNode, and it
> was inconsistently used. It was quite easy to mis-select an
> instruction requiring an immediate. For SelectionDAG, start emitting
> TargetConstant for these arguments, and using timm to match them.
>
> Most of the work here is to cleanup target handling of constants. Some
> targets process intrinsics through intermediate custom nodes, which
> need to preserve TargetConstant usage to match the intrinsic
> expectation. Pattern inputs now need to distinguish whether a constant
> is merely compatible with an operand or whether it is mandatory.
>
> The GlobalISelEmitter needs to treat timm as a special case of a leaf
> node, simlar to MachineBasicBlock operands. This should also enable
> handling of patterns for some G_* instructions with immediates, like
> G_FENCE or G_EXTRACT.
>
> This does include a workaround for a crash in GlobalISelEmitter when
> ARM tries to uses "imm" in an output with a "timm" pattern source.
llvm-svn: 372314