This commit adds a new IR level pass to the AMDGPU backend to perform
atomic optimizations. It works by:
- Running through a function and finding atomicrmw add/sub or uses of
the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform,
record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on
whether the value is uniform/divergent use wavefront wide operations
(DPP in the divergent case) to calculate the total amount to be
atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic
operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of
the wavefront, and calculate our individual lanes offset into the
final result.
Differential Revision: https://reviews.llvm.org/D51969
llvm-svn: 343973
The IRBuilder CreateIntrinsic method wouldn't allow you to specify the
types that you wanted the intrinsic to be mangled with. To fix this
I've:
- Added an ArrayRef<Type *> member to both CreateIntrinsic overloads.
- Used that array to pass into the Intrinsic::getDeclaration call.
- Added a CreateUnaryIntrinsic to replace the most common use of
CreateIntrinsic where the type was auto-deduced from operand 0.
- Added a bunch more unit tests to test Create*Intrinsic calls that
weren't being tested (including the FMF flag that wasn't checked).
This was suggested as part of the AMDGPU specific atomic optimizer
review (https://reviews.llvm.org/D51969).
Differential Revision: https://reviews.llvm.org/D52087
llvm-svn: 343962
Summary:
Merge the SMRD patterns for CI into the same multiclass as the
patterns for other sub-targets.
This removes some duplicate code and will make it easier for some
future GlobalISel changes I would like to do.
Reviewers: arsenm
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52557
llvm-svn: 343909
Finally all targets are enabling multiple regalloc hints, so the hook to
disable this can now be removed.
NFC.
Review: Simon Pilgrim
https://reviews.llvm.org/D52316
llvm-svn: 343851
The isAmdCodeObjectV2 is a misleading name which actually checks whether the os
is amdhsa or mesa.
Also add a test to make sure we do not generate old kernel header for code
object v3.
Differential Revision: https://reviews.llvm.org/D52897
llvm-svn: 343813
Summary:
The new buffer/tbuffer intrinsics handle an out-of-range immediate
offset by moving/adding offset&-4096 to a vgpr, leaving an in-range
immediate offset, with a chance of the move/add being CSEd for similar
loads/stores.
However it turns out that a negative offset in a vgpr is illegal, even
if adding the immediate offset makes it legal again.
Therefore, this commit disables the offset&-4096 thing if the offset is
negative.
Differential Revision: https://reviews.llvm.org/D52683
Change-Id: Ie02f0a74f240a138dc2a29d17cfbd9e350e4ed13
llvm-svn: 343672
Currently it returns incorrect operand size for a target independet
node such as COPY if operand is a register with subreg. Instead of
correct subreg size it returns a size of the whole superreg.
Differential Revision: https://reviews.llvm.org/D52736
llvm-svn: 343508
Summary: This change enables VOP3 shifts to be explicitly selected
dependent on the divergence.
Differential Revision: https://reviews.llvm.org/D52559
Reviewers: rampitec
llvm-svn: 343455
Summary:
This is essentially NFC, because the complex pattern used for these patterns
will fail on non-CI, but this makes the pattern consistent with other CI
smrd patterns. It is also a performance improvement, because the pattern
will now fail earlier on non-CI.
Reviewers: arsenm, nhaehnle
Reviewed By: arsenm
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52469
llvm-svn: 343125
Summary:
We generate s_xor to lower add of i1s in general cases, and s_not to
lower add with a one-bit imm of -1 (true).
Reviewers:
rampitec
Differential Revision:
https://reviews.llvm.org/D52518
llvm-svn: 343030
[AMDGPU] lower-switch in preISel as a workaround for legacy DA
Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.
As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.
llvm-svn: 342956
If the alignment is at least 4, this should report true.
Something still seems off with how < 4-byte types are
handled here though.
Fixing this seems to change how some combines get
to where they get, but somehow isn't changing the net
result.
llvm-svn: 342879
Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.
As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.
Reviewers: arsenm, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits, simoll
Differential Revision: https://reviews.llvm.org/D52221
llvm-svn: 342722
Summary: This change is the first part of the AMDGPU target description
change. The aim of it is the effective splitting the vector and scalar
flows at the selection stage. Selection uses predicate functions based
on the framework implemented earlier - https://reviews.llvm.org/D35267
Differential revision: https://reviews.llvm.org/D52019
Reviewers: rampitec
llvm-svn: 342719
Summary:
This is required for GPUs with 16 bit instructions where f16 is a
legal register type and hence int_to_fp i1 to f16 is not lowered
by legalizing.
Reviewers: arsenm, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52018
Change-Id: Ie4c0fd6ced7cf10ad612023c6879724d9ded5851
llvm-svn: 342558
- Instead of having both `SUnit::dump(ScheduleDAG*)` and
`ScheduleDAG::dumpNode(ScheduleDAG*)`, just keep the latter around.
- Add `ScheduleDAG::dump()` and avoid code duplication in several
places. Implement it for different ScheduleDAG variants.
- Add `ScheduleDAG::dumpNodeName()` in favor of the `SUnit::print()`
functions. They were only ever used for debug dumping and putting the
function into ScheduleDAG is consistent with the `dumpNode()` change.
llvm-svn: 342520
Summary:
GFX9 and above support sin/cos instructions with a greater range and thus don't
require a fract instruction prior to invocation.
Added a subtarget feature to reflect this and added code to take advantage of
expanded range on GFX9+
Also updated the tests to check correct behaviour
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D51933
Change-Id: I1c1f1d3726a5ae32116646ca5cfa1ab4ef69e5b0
llvm-svn: 342222
Summary:
I accidentally left this behind in D50306, and it causes a build warning
when I build with gcc7.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52022
Change-Id: I30f7a47047e9d9d841f652da66d2fea19e74842c
llvm-svn: 342189
If an argument was passed on the stack, this
was using the default alignment.
I'm not sure there's an observable change from this. This
was observable due to bugs in expansion of unaligned
loads and stores, but since that is fixed I don't think
this matters much.
llvm-svn: 342133
Move isa version determination into TargetParser.
Also switch away from target features to CPU string when
determining isa version. This fixes an issue when we
output wrong isa version in the object code when features
of a particular CPU are altered (i.e. gfx902 w/o xnack
used to result in gfx900).
llvm-svn: 342069
into TargetParser.
Also switch away from target features to CPU string when
determining isa version. This fixes an issue when we
output wrong isa version in the object code when features
of a particular CPU are altered (i.e. gfx902 w/o xnack
used to result in gfx900).
Differential Revision: https://reviews.llvm.org/D51890
llvm-svn: 341982
This already worked if only one register piece was used,
but didn't if a type was split into multiple, unequal
sized pieces.
Fixes not splitting 3i16/v3f16 into two registers for
AMDGPU.
This will also allow fixing the ABI for 16-bit vectors
in a future commit so that it's the same for all subtargets.
llvm-svn: 341801
Summary:
This fixes a bug where a large number of implicit def instructions can fill the GCNHazardRecognizer lookahead buffer causing required NOPs to not be inserted.
Reviewers: nhaehnle, arsenm
Reviewed By: arsenm
Subscribers: sheredom, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D51726
Change-Id: Ie75338f94de704ee5816b05afd0c922c6748a95b
llvm-svn: 341798
Emit a waterfall loop in the general case for a potentially-divergent Rsrc
operand. When practical, avoid this by using Addr64 instructions.
Differential Revision: https://reviews.llvm.org/D50982
llvm-svn: 341413
The intention is to enable the extract_vector_elt load combine,
and doing this for other operations interferes with more
useful optimizations on vectors.
Handle any type of load since in principle we should do the
same combine for the various load intrinsics.
llvm-svn: 341219
Summary:
This is patch 1 of the new DivergenceAnalysis (https://reviews.llvm.org/D50433).
The purpose of this patch is to free up the name DivergenceAnalysis for the new generic
implementation. The generic implementation class will be shared by specialized
divergence analysis classes.
Patch by: Simon Moll
Reviewed By: nhaehnle
Subscribers: jvesely, jholewinski, arsenm, nhaehnle, mgorny, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D50434
Change-Id: Ie8146b11be2c50d5312f30e11c7a3036a15b48cb
llvm-svn: 341071
Summary:
Add some optional code to validate getInstSizeInBytes for emitted
instructions. This flushed out some issues which are fixed by this
patch:
- Streamline getInstSizeInBytes
- Properly define the VI readlane/writelane instruction as VOP3
- Fix the inline constant determination. Specifically, this change
fixes an issue where a 32-bit value of 0xffffffff was recorded
as unsigned. This is equal to -1 when restricting to a 32-bit
comparison, and an inline constant can be used.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D50629
Change-Id: Id87c3b7975839da0de8156a124b0ce98c5fb47f2
llvm-svn: 340903
This can leave behind the uses with the defs removed.
Since this should only really happen in tests, it's not worth the
effort of trying to handle this.
llvm-svn: 340866
The original motivating example uses a 64-bit add, so the carry
is used. Insert a copy from VCC. This may allow shrinking of
the used carry instruction. At worst, we are replacing a
mov to materialize the constant with a copy of vcc.
llvm-svn: 340862
This needs to be done in the SSA fold operands
pass to be effective, so there is a bit of overlap
with SIShrinkInstructions but I don't think this
is practically avoidable.
llvm-svn: 340859
Summary:
Patch by Marek Olsak and David Stuttard, both of AMD.
This adds a new amdgcn intrinsic supporting s.buffer.load, in particular
multiple dword variants. These are convenient to use from some front-end
implementations.
Also modified the existing llvm.SI.load.const intrinsic to common up the
underlying implementation.
This modification also requires that we can lower to non-uniform loads correctly
by splitting larger dword variants into sizes supported by the non-uniform
versions of the load.
V2: Addressed minor review comments.
V3: i1 glc is now i32 cachepolicy for consistency with buffer and
tbuffer intrinsics, plus fixed formatting issue.
V4: Added glc test.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D51098
Change-Id: I83a6e00681158bb243591a94a51c7baa445f169b
llvm-svn: 340684
32-bit constant address space is declared as 6, so the
maximum number of address spaces is 6, not 5.
Fixes "LLVM ERROR: Pointer address space out of range".
v5: rename MAX_COMMON_ADDRESS to MAX_AMDGPU_ADDRESS
v4: - fix compilation issues
- fix out of bounds access
v3: use static_assert()
v2: add a very simple test for 32-bit addr space
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106630
llvm-svn: 340417
Constant and global may alias, also one rules table wasn't
ordered correctly.
Pinpointed by Matt.
v2: add a test with swapped parameters
llvm-svn: 340416
This was hackily adding in the 4-bytes reserved for the callee's
emergency stack slot. Treat it like a normal stack allocation
so we get the correct alignment padding behavior. This fixes
an inconsistency between the caller and callee.
llvm-svn: 340396
In general we can't assume flat loads are uniform, and cases where we can prove
they are should be handled through infer-address-spaces.
Differential Revision: https://reviews.llvm.org/D50991
llvm-svn: 340343
Summary:
Previously the new llvm.amdgcn.raw/struct.buffer.load/store intrinsics
only allowed float types for the data to be loaded or stored, which
sometimes meant the frontend needed to generate a bitcast. In this, the
new intrinsics copied the old buffer intrinsics.
This commit extends the new intrinsics to allow int types as well.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D50315
Change-Id: I8202af2d036455553681dcbb3d7d32ae273f8f85
llvm-svn: 340270
Summary:
This commit adds new intrinsics
llvm.amdgcn.raw.buffer.load
llvm.amdgcn.raw.buffer.load.format
llvm.amdgcn.raw.buffer.load.format.d16
llvm.amdgcn.struct.buffer.load
llvm.amdgcn.struct.buffer.load.format
llvm.amdgcn.struct.buffer.load.format.d16
llvm.amdgcn.raw.buffer.store
llvm.amdgcn.raw.buffer.store.format
llvm.amdgcn.raw.buffer.store.format.d16
llvm.amdgcn.struct.buffer.store
llvm.amdgcn.struct.buffer.store.format
llvm.amdgcn.struct.buffer.store.format.d16
llvm.amdgcn.raw.buffer.atomic.*
llvm.amdgcn.struct.buffer.atomic.*
with the following changes from the llvm.amdgcn.buffer.*
intrinsics:
* there are separate raw and struct versions: raw does not have an
index arg and sets idxen=0 in the instruction, and struct always sets
idxen=1 in the instruction even if the index is 0, to allow for the
fact that gfx9 does bounds checking differently depending on whether
idxen is set;
* there is a combined cachepolicy arg (glc+slc)
* there are now only two offset args: one for the offset that is
included in bounds checking and swizzling, to be split between the
instruction's voffset and immoffset fields, and one for the offset
that is excluded from bounds checking and swizzling, to go into the
instruction's soffset field.
The AMDISD::BUFFER_* SD nodes always have an index operand, all three
offset operands, combined cachepolicy operand, and an extra idxen
operand.
The obsolescent llvm.amdgcn.buffer.* intrinsics continue to work.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D50306
Change-Id: If897ea7dc34fcbf4d5496e98cc99a934f62fc205
llvm-svn: 340269
Summary:
This commit adds new intrinsics
llvm.amdgcn.raw.tbuffer.load
llvm.amdgcn.struct.tbuffer.load
llvm.amdgcn.raw.tbuffer.store
llvm.amdgcn.struct.tbuffer.store
with the following changes from the llvm.amdgcn.tbuffer.* intrinsics:
* there are separate raw and struct versions: raw does not have an index
arg and sets idxen=0 in the instruction, and struct always sets
idxen=1 in the instruction even if the index is 0, to allow for the
fact that gfx9 does bounds checking differently depending on whether
idxen is set;
* there is a combined format arg (dfmt+nfmt)
* there is a combined cachepolicy arg (glc+slc)
* there are now only two offset args: one for the offset that is
included in bounds checking and swizzling, to be split between the
instruction's voffset and immoffset fields, and one for the offset
that is excluded from bounds checking and swizzling, to go into the
instruction's soffset field.
The AMDISD::TBUFFER_* SD nodes always have an index operand, all three
offset operands, combined format operand, combined cachepolicy operand,
and an extra idxen operand.
The tbuffer pseudo- and real instructions now also have a combined
format operand.
The obsolescent llvm.amdgcn.tbuffer.* and llvm.SI.tbuffer.store
intrinsics continue to work.
V2: Separate raw and struct intrinsics.
V3: Moved extract_glc and extract_slc defs to a more sensible place.
V4: Rebased on D49995.
V5: Only two separate offset args instead of three.
V6: Pseudo- and real instructions have joint format operand.
V7: Restored optionality of dfmt and nfmt in assembler.
V8: Addressed minor review comments.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D49026
Change-Id: If22ad77e349fac3a5d2f72dda53c010377d470d4
llvm-svn: 340268
getTargetCustom() requires values for "Kind" in the constructor
that are not in the PSVKind enum. Passing a value that is not inside
an enum as an argument to a constructor of the type of the enum is
UB. Changing to the underlying type of the enum would solve the UB
Differential Revision: https://reviews.llvm.org/D50909
llvm-svn: 340200
32-bit constant address space is declared as 6, so the
maximum number of address spaces is 6, not 5.
Fixes "LLVM ERROR: Pointer address space out of range".
v3: use static_assert()
v2: add a very simple test for 32-bit addr space
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106630
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
llvm-svn: 340171
a generically extensible collection of extra info attached to
a `MachineInstr`.
The primary change here is cleaning up the APIs used for setting and
manipulating the `MachineMemOperand` pointer arrays so chat we can
change how they are allocated.
Then we introduce an extra info object that using the trailing object
pattern to attach some number of MMOs but also other extra info. The
design of this is specifically so that this extra info has a fixed
necessary cost (the header tracking what extra info is included) and
everything else can be tail allocated. This pattern works especially
well with a `BumpPtrAllocator` which we use here.
I've also added the basic scaffolding for putting interesting pointers
into this, namely pre- and post-instruction symbols. These aren't used
anywhere yet, they're just there to ensure I've actually gotten the data
structure types correct. I'll flesh out support for these in
a subsequent patch (MIR dumping, parsing, the works).
Finally, I've included an optimization where we store any single pointer
inline in the `MachineInstr` to avoid the allocation overhead. This is
expected to be the overwhelmingly most common case and so should avoid
any memory usage growth due to slightly less clever / dense allocation
when dealing with >1 MMO. This did require several ergonomic
improvements to the `PointerSumType` to reasonably support the various
usage models.
This also has a side effect of freeing up 8 bits within the
`MachineInstr` which could be repurposed for something else.
The suggested direction here came largely from Hal Finkel. I hope it was
worth it. ;] It does hopefully clear a path for subsequent extensions
w/o nearly as much leg work. Lots of thanks to Reid and Justin for
careful reviews and ideas about how to do all of this.
Differential Revision: https://reviews.llvm.org/D50701
llvm-svn: 339940
This will allow the library to just use __builtin_expf directly
without expanding this itself. Note f64 still won't work because
there is no exp instruction for it.
llvm-svn: 339902
Handle fmul, fsub and preserve flags.
Also really test minnum/maxnum reductions.
The existing tests were only checking from
minnum/maxnum matched from a fast math compare
and select which is not the same.
llvm-svn: 339820
`MachineMemOperand` pointers attached to `MachineSDNodes` and instead
have the `SelectionDAG` fully manage the memory for this array.
Prior to this change, the memory management was deeply confusing here --
The way the MI was built relied on the `SelectionDAG` allocating memory
for these arrays of pointers using the `MachineFunction`'s allocator so
that the raw pointer to the array could be blindly copied into an
eventual `MachineInstr`. This creates a hard coupling between how
`MachineInstr`s allocate their array of `MachineMemOperand` pointers and
how the `MachineSDNode` does.
This change is motivated in large part by a change I am making to how
`MachineFunction` allocates these pointers, but it seems like a layering
improvement as well.
This would run the risk of increasing allocations overall, but I've
implemented an optimization that should avoid that by storing a single
`MachineMemOperand` pointer directly instead of allocating anything.
This is expected to be a net win because the vast majority of uses of
these only need a single pointer.
As a side-effect, this makes the API for updating a `MachineSDNode` and
a `MachineInstr` reasonably different which seems nice to avoid
unexpected coupling of these two layers. We can map between them, but we
shouldn't be *surprised* at where that occurs. =]
Differential Revision: https://reviews.llvm.org/D50680
llvm-svn: 339740
I'm not sure the exact nsz flag combination that
is OK. I think as long as it's on either, this is OK.
For now just check it on the omod multiply.
llvm-svn: 339513
If one of the elements is undef, use the canonicalized constant
from the other element instead of 0.
Splat vectors are more useful for other optimizations, such
as matching vector clamps. This was breaking on clamps
of half3 from the undef 4th component.
llvm-svn: 339512
Fixup test to check for GCN prefix
These patterns always zero extend the result even though it might need sign extension.
This has been broken since the addition of i16 support.
It has popped up in mad_sat(char) test since min(max()) combination is turned into v_med3, resulting in the following (incorrect) sequence:
v_mad_i16 v2, v10, v9, v11
v_med3_i32 v2, v2, v8, v7
Fixes mad_sat(char) piglit on VI.
Differential Revision: https://reviews.llvm.org/D49836
llvm-svn: 339190
This is necessary to add a VI specific builtin,
__builtin_amdgcn_s_dcache_wb. We already have an
overly specific feature for one of these builtins,
for s_memrealtime. I'm not sure whether it's better
to add more of those, or to get rid of that and merge
it with vi-insts.
Alternatively, maybe this logically goes with scalar-stores?
llvm-svn: 339104
Everything should quiet, and I think everything should
flush.
I assume the min3/med3/max3 follow the same rules
as regular min/max for flushing, which should at
least be conservatively correct.
There are still more operations that need to
be handled.
llvm-svn: 339065
Not sure why this was checking for denormals for f16.
My interpretation of the IEEE standard is conversions
should produce a canonical result, and the ISA manual
says denormals are created when appropriate.
llvm-svn: 339064
If denormals are enabled, denormals are canonical.
Also fix a few other issues. minnum/maxnum are supposed
to canonicalize. Temporarily improve workaround for the
instruction behavior change in gfx9.
Handle selects and fcopysign.
The tests were also largely broken, since they were
checking for a flush used on some targets after the
store of the result.
llvm-svn: 339061
Summary:
This patch improves Inliner to provide causes/reasons for negative inline decisions.
1. It adds one new message field to InlineCost to report causes for Always and Never instances. All Never and Always instantiations must provide a simple message.
2. Several functions that used to return the inlining results as boolean are changed to return InlineResult which carries the cause for negative decision.
3. Changed remark priniting and debug output messages to provide the additional messages and related inline cost.
4. Adjusted tests for changed printing.
Patch by: yrouban (Yevgeny Rouban)
Reviewers: craig.topper, sammccall, sgraenitz, NutshellySima, shchenz, chandlerc, apilipenko, javed.absar, tejohnson, dblaikie, sanjoy, eraman, xbolva00
Reviewed By: tejohnson, xbolva00
Subscribers: xbolva00, llvm-commits, arsenm, mehdi_amini, eraman, haicheng, steven_wu, dexonsmith
Differential Revision: https://reviews.llvm.org/D49412
llvm-svn: 338969
Add a parameter for testing specifically for
sNaNs - at least one instruction pattern on AMDGPU
needs to check specifically for this.
Also handle more cases, and add a target hook
for custom nodes, similar to the hooks for known
bits.
llvm-svn: 338910
Summary:
By not reconstructing the operand list of the SDNode, this change makes
it easier to add the forthcoming new tbuffer and buffer intrinsics.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D49995
Change-Id: I0cb79ef0801532645d7dd954a6d7355139db7b38
llvm-svn: 338784
Summary:
I encountered some problems with SIFixWWMLiveness when WWM is in a loop:
1. It sometimes gave invalid MIR where there is some control flow path
to the new implicit use of a register on EXIT_WWM that does not pass
through any def.
2. There were lots of false positives of registers that needed to have
an implicit use added to EXIT_WWM.
3. Adding an implicit use to EXIT_WWM (and adding an implicit def just
before the WWM code, which I tried in order to fix (1)) caused lots
of the values to be spilled and reloaded unnecessarily.
This commit is a rework of SIFixWWMLiveness, with the following changes:
1. Instead of considering any register with a def that can reach the WWM
code and a def that can be reached from the WWM code, it now
considers three specific cases that need to be handled.
2. A register that needs liveness over WWM to be synthesized now has it
done by adding itself as an implicit use to defs other than the
dominant one.
Also added the following fixmes:
FIXME: We should detect whether a register in one of the above
categories is already live at the WWM code before deciding to add the
implicit uses to synthesize its liveness.
FIXME: I believe this whole scheme may be flawed due to the possibility
of the register allocator doing live interval splitting.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D46756
Change-Id: Ie7fba0ede0378849181df3f1a9a7a39ed1a94a94
llvm-svn: 338783
Summary:
This fixes a problem where a load from global+idx generated incorrect
code on <=gfx7 when the index is divergent.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D47383
Change-Id: Ib4d177d6254b1dd3f8ec0203fdddec94bd8bc5ed
llvm-svn: 338779
Mutate the node type during selection when it
doesn't matter. This avoids an intermediate bitcast
node on targets with legal i16/f16.
Also fixes missing output modifiers on v_cvt_pkrtz_f32_f16,
which I assume are OK.
llvm-svn: 338619
Summary:
Add _L to _LZ image intrinsic table mapping to table gen.
In ISelLowering check if image intrinsic has lod and if it's equal
to zero, if so remove lod and change opcode to equivalent mapped _LZ.
Change-Id: Ie24cd7e788e2195d846c7bd256151178cbb9ec71
Subscribers: arsenm, mehdi_amini, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, steven_wu, dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D49483
llvm-svn: 338523
Summary:
This patch improves Inliner to provide causes/reasons for negative inline decisions.
1. It adds one new message field to InlineCost to report causes for Always and Never instances. All Never and Always instantiations must provide a simple message.
2. Several functions that used to return the inlining results as boolean are changed to return InlineResult which carries the cause for negative decision.
3. Changed remark priniting and debug output messages to provide the additional messages and related inline cost.
4. Adjusted tests for changed printing.
Patch by: yrouban (Yevgeny Rouban)
Reviewers: craig.topper, sammccall, sgraenitz, NutshellySima, shchenz, chandlerc, apilipenko, javed.absar, tejohnson, dblaikie, sanjoy, eraman, xbolva00
Reviewed By: tejohnson, xbolva00
Subscribers: xbolva00, llvm-commits, arsenm, mehdi_amini, eraman, haicheng, steven_wu, dexonsmith
Differential Revision: https://reviews.llvm.org/D49412
llvm-svn: 338494
When lowering calling conventions, prefer to decompose vectors
into the constitute register types. This avoids artifical constraints
to satisfy a wide super-register.
This improves code quality because now optimizations don't need to
deal with the super-register constraint. For example the immediate
folding code doesn't deal with 4 component reg_sequences, so by
breaking the register down earlier the existing immediate folding
code is able to work.
This also avoids the need for the shader input processing code
to manually split vector types.
llvm-svn: 338416
Summary:
This patch improves Inliner to provide causes/reasons for negative inline decisions.
1. It adds one new message field to InlineCost to report causes for Always and Never instances. All Never and Always instantiations must provide a simple message.
2. Several functions that used to return the inlining results as boolean are changed to return InlineResult which carries the cause for negative decision.
3. Changed remark priniting and debug output messages to provide the additional messages and related inline cost.
4. Adjusted tests for changed printing.
Patch by: yrouban (Yevgeny Rouban)
Reviewers: craig.topper, sammccall, sgraenitz, NutshellySima, shchenz, chandlerc, apilipenko, javed.absar, tejohnson, dblaikie, sanjoy, eraman, xbolva00
Reviewed By: tejohnson, xbolva00
Subscribers: xbolva00, llvm-commits, arsenm, mehdi_amini, eraman, haicheng, steven_wu, dexonsmith
Differential Revision: https://reviews.llvm.org/D49412
llvm-svn: 338387
We could choose a free 0 for this, but this
matches the behavior for fmul undef, 1.0. Also,
the NaN use is more useful for folding use operations
although if it's not eliminated it is more expensive
in terms of code size.
llvm-svn: 338376
Summary:
These instructions interact with hardware blocks outside the shader core,
and they can have "scalar" side effects even when EXEC = 0. We don't
want these scalar side effects to occur when all lanes want to skip
these instructions, so always add the execz skip branch instruction
for basic blocks that contain them.
Also ensure that we skip scalar stores / atomics, though we don't
code-gen those yet.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48431
Change-Id: Ieaeb58352e2789ffd64745603c14970c60819d44
llvm-svn: 338235
SelectionDAGBuilder widens v3i32/v3f32 arguments to
to v4i32/v4f32 which consume an additional register.
In addition to wasting argument space, this produces extra
instructions since now it appears the 4th vector component has
a meaningful value to most combines.
llvm-svn: 338197
R600 can't handle immediates for BFE, these will be eliminated later.
Fixes powr/pow regressions n r600 since r334817
Differential Revision: https://reviews.llvm.org/D49641
llvm-svn: 338127
Scale the offset of VGPR spills by the wave size when it cannot fit in the
12-bit offset immediate field and so is added to the soffset SGPR. This
accounts for hardware swizzling of scratch memory.
Differential Revision: https://reviews.llvm.org/D49448
llvm-svn: 338060
Summary:
We were marking G_EXTRACT operations unsupported if the output type
was larger than the input type. I don't see how this could ever actually
happen, so I dropped the constraint. Doing this makes it possible to
reuse the same legality code for G_INSERT.
Reviewers: arsenm
Reviewed By: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D49600
llvm-svn: 337794
Re-apply "[AMDGPU][Waitcnt] fix "comparison of integers of different signs" build error""
( fe0a456510131f268e388c4a18a92f575c0db183 ), which was inadvertantly reverted via
2b2ee080f0164485562593b1b87291a48cea4a9a .
llvm-svn: 337156
Memory legalizer, waitcnt, and shrink passes can perturb the instructions,
which means that the post-RA hazard recognizer pass should run after them.
Otherwise, one of those passes may invalidate the work done by the hazard
recognizer. Note that this has adverse side-effect that any consecutive
S_NOP 0's, emitted by the hazard recognizer, will not be shrunk into a
single S_NOP <N>. This should be addressed in a follow-on patch.
Differential Revision: https://reviews.llvm.org/D49288
llvm-svn: 337154
This reverts commit r337021.
WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x1415cd65 in void write_signed<long>(llvm::raw_ostream&, long, unsigned long, llvm::IntegerStyle) /code/llvm-project/llvm/lib/Support/NativeFormatting.cpp:95:7
#1 0x1415c900 in llvm::write_integer(llvm::raw_ostream&, long, unsigned long, llvm::IntegerStyle) /code/llvm-project/llvm/lib/Support/NativeFormatting.cpp:121:3
#2 0x1472357f in llvm::raw_ostream::operator<<(long) /code/llvm-project/llvm/lib/Support/raw_ostream.cpp:117:3
#3 0x13bb9d4 in llvm::raw_ostream::operator<<(int) /code/llvm-project/llvm/include/llvm/Support/raw_ostream.h:210:18
#4 0x3c2bc18 in void printField<unsigned int, &(amd_kernel_code_s::amd_kernel_code_version_major)>(llvm::StringRef, amd_kernel_code_s const&, llvm::raw_ostream&) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:78:23
#5 0x3c250ba in llvm::printAmdKernelCodeField(amd_kernel_code_s const&, int, llvm::raw_ostream&) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:104:5
#6 0x3c27ca3 in llvm::dumpAmdKernelCode(amd_kernel_code_s const*, llvm::raw_ostream&, char const*) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:113:5
#7 0x3a46e6c in llvm::AMDGPUTargetAsmStreamer::EmitAMDKernelCodeT(amd_kernel_code_s const&) /code/llvm-project/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp:161:3
#8 0xd371e4 in llvm::AMDGPUAsmPrinter::EmitFunctionBodyStart() /code/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp:204:26
[...]
Uninitialized value was created by an allocation of 'KernelCode' in the stack frame of function '_ZN4llvm16AMDGPUAsmPrinter21EmitFunctionBodyStartEv'
#0 0xd36650 in llvm::AMDGPUAsmPrinter::EmitFunctionBodyStart() /code/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp:192
llvm-svn: 337079
This needs to refer to arguments by their original argument
index, not the argument split index which depends on what
the type splitting decides to do.
Also avoid increment PSInputNum for each split piece.
llvm-svn: 337022
This was completely broken if there was ever a struct argument, as
this information is thrown away during the argument analysis.
The offsets as passed in to LowerFormalArguments are not useful,
as they partially depend on the legalized result register type,
and they don't consider the alignment in the first place.
Ignore the Ins array, and instead figure out from the raw IR type
what we need to do. This seems to fix the padding computation
if the DAG lowering is forced (and stops breaking arguments
following padded arguments if the arguments were only partially
lowered in the IR)
llvm-svn: 337021
SITargetLowering queries SIInstrInfo in its constructor, so SIInstrInfo
must be initialized first. This fixes msan buildbot failures and was
introduced by r336851.
llvm-svn: 336861
Summary:
This is a follow-up to r335942.
- Merge SISubtarget into AMDGPUSubtarget and rename to GCNSubtarget
- Rename AMDGPUCommonSubtarget to AMDGPUSubtarget
- Merge R600Subtarget::Generation and GCNSubtarget::Generation into
AMDGPUSubtarget::Generation.
Reviewers: arsenm, jvesely
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D49037
llvm-svn: 336851
Build error on Android; reported by and fix provided by (thanks) by Mauro Rossi <issor.oruam@gmail.com>
Fixes the following building error:
external/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:1903:61:
error: comparison of integers of different signs:
'typename iterator_traits<__wrap_iter<MachineBasicBlock **> >::difference_type'
(aka 'int') and 'unsigned int' [-Werror,-Wsign-compare]
BlockWaitcntProcessedSet.end(), &MBB) < Count)) {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~
1 error generated.
Differential Revision: https://reviews.llvm.org/D49089
llvm-svn: 336588
These won't work for the forseeable future. These aren't allowed
from OpenCL, but IPO optimizations can make them appear.
Also directly set the attributes on functions, regardless
of the linkage rather than cloning functions like before.
llvm-svn: 336587
Avoid using allocateKernArg / AssignFn. We do not want any
of the type splitting properties of normal calling convention
lowering.
For now at least this exists alongside the IR argument lowering
pass. This is necessary to handle struct padding correctly while
some arguments are still skipped by the IR argument lowering
pass.
llvm-svn: 336373
Wait states are not properly being inserted after buffer_store for v_interp instructions.
Add VALU to V_INTERP instructions so that the GCNHazardRecognizer can
check and insert the appropriate wait states when needed.
Differential Revision: https://reviews.llvm.org/D48772
Change-Id: Id540c9b074fc69b5c1de6b182276aa089c74aa64
llvm-svn: 336339
Summary:
This patch introduce new intrinsic -
strip.invariant.group that was described in the
RFC: Devirtualization v2
Reviewers: rsmith, hfinkel, nlopes, sanjoy, amharc, kuhar
Subscribers: arsenm, nhaehnle, JDevlieghere, hiraditya, xbolva00, llvm-commits
Differential Revision: https://reviews.llvm.org/D47103
Co-authored-by: Krzysztof Pszeniczny <krzysztof.pszeniczny@gmail.com>
llvm-svn: 336073
Summary:
We could split sizes that are not power of two into smaller sized
G_IMPLICIT_DEF instructions, but this ends up generating
G_MERGE_VALUES instructions which we then have to handle in the instruction
selector. Since G_IMPLICIT_DEF is really a no-op it's easier just to
keep everything that can fit into a register legal.
Reviewers: arsenm
Reviewed By: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48777
llvm-svn: 336041
This was introducing unnecessary padding after the explicit
arguments, depending on the alignment of the total struct type.
Also has the side effect of avoiding creating an extra GEP for
the offset from the base kernel argument to the explicit kernel
argument offset.
llvm-svn: 335999
This allows to hoist code portion to compute reciprocal of loop
invariant denominator in integer division after codegen prepare
expansion.
Differential Revision: https://reviews.llvm.org/D48604
llvm-svn: 335988
Summary:
We now have two sets of generated TableGen files, one for R600 and one
for GCN, so each sub-target now has its own tables of instructions,
registers, ISel patterns, etc. This should help reduce compile time
since each sub-target now only has to consider information that
is specific to itself. This will also help prevent the R600
sub-target from slowing down new features for GCN, like disassembler
support, GlobalISel, etc.
Reviewers: arsenm, nhaehnle, jvesely
Reviewed By: arsenm
Subscribers: MatzeB, kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D46365
llvm-svn: 335942
This allows hoisting of a common code, for instance if denominator
is loop invariant. Current change is expansion only, adding licm to
the target pass list going to be a separate patch. Given this patch
changes to codegen are minor as the expansion is similar to that on
DAG. DAG expansion still must remain for R600.
Differential Revision: https://reviews.llvm.org/D48586
llvm-svn: 335868
We have too many mechanisms for tracking the various offsets
used for kernel arguments, so remove one. There's still a lot of
confusion with these because there are two different "implicit"
argument areas located at the beginning and end of the kernarg
segment.
Additionally, the offset was determined based on the memory
size of the split element types. This would break in a future
commit where v3i32 is decomposed into separate i32 pieces.
llvm-svn: 335830
In principle nothing should stop these from working, but
work is necessary to create an ABI for dealing with the stack
related registers.
llvm-svn: 335829
Just fix the crash for now by not doing the optimization since
figuring out how to properly convert the bits for an arbitrary
struct is a pain.
Also fix a crash when there is only an empty struct argument.
llvm-svn: 335827
If a source of rcp instruction is a result of any conversion from
an integer convert it into rcp_iflag instruction. No FP exception
can ever happen except division by zero if a single precision rcp
argument is a representation of an integral number.
Differential Revision: https://reviews.llvm.org/D48569
llvm-svn: 335742
This replaces most argument uses with loads, but for
now not all.
The code in SelectionDAG for calling convention lowering
is actively harmful for amdgpu_kernel. It attempts to
split the argument types into register legal types, which
results in low quality code for arbitary types. Since
all kernel arguments are passed in memory, we just want the
raw types.
I've tried a couple of methods of mitigating this in SelectionDAG,
but it's easier to just bypass this problem alltogether. It's
possible to hack around the problem in the initial lowering,
but the real problem is the DAG then expects to be able to use
CopyToReg/CopyFromReg for uses of the arguments outside the block.
Exposing the argument loads in the IR also has the advantage
that the LoadStoreVectorizer can merge them.
I'm not sure the best approach to dealing with the IR
argument list is. The patch as-is just leaves the IR arguments
in place, so all the existing code will still compute the same
kernarg size and pointlessly lowers the arguments.
Arguably the frontend should emit kernels with an empty argument
list in the first place. Alternatively a dummy array could be
inserted as a single argument just to reserve space.
This does have some disadvantages. Local pointer kernel arguments can
no longer have AssertZext placed on them as the equivalent !range
metadata is not valid on pointer typed loads. This is mostly bad
for SI which needs to know about the known bits in order to use the
DS instruction offset, so in this case this is not done.
More importantly, this skips noalias arguments since this pass
does not yet convert this to the equivalent !alias.scope and !noalias
metadata. Producing this metadata correctly seems to be tricky,
although this logically is the same as inlining into a function which
doesn't exist. Additionally, exposing these loads to the vectorizer
may result in degraded aliasing information if a pointer load is
merged with another argument load.
I'm also not entirely sure this is preserving the current clover
ABI, although I would greatly prefer if it would stop widening
arguments and match the HSA ABI. As-is I think it is extending
< 4-byte arguments to 4-bytes but doesn't align them to 4-bytes.
llvm-svn: 335650
Note a normal select test is not currently possible because this
relies on input registers tracked in SIMachineFunctionInfo which
are not currently serializable in MIR, but this does work end-to-end
from the IR.
llvm-svn: 335490
This should avoid relying on the pointee type
to get the alignment, particularly since pointee
types are supposed to be removed at some point.
Also fixes not getting the alignment for unsized types.
llvm-svn: 335478
Implements PR34259
Intrinsics.h is a very popular header. Most LLVM TUs care about things
like dbg_value, but they don't care how they are implemented. After I
split these out, IntrinsicImpl.inc is 1.7 MB, so this saves each LLVM TU
from scanning 1.7 MB of source that gets pre-processed away.
It also means we can modify intrinsic properties without triggering a
full rebuild, but that's probably less of a win.
I think the next best thing to do would be to split out the target
intrinsics into their own header. Very, very few TUs care about
target-specific intrinsics. It's very hard to split up the target
independent intrinsics like llvm.expect, assume, and dbg.value, though.
llvm-svn: 335407
Not sure why the 32/64 split is needed in the atomic_load
store hierarchies. The regular PatFrags do this, but we don't
do it for the existing handling for global.
llvm-svn: 335325
Summary:
We can select all instructions that are marked as legal in a full piglit run,
so now is a good time to make the TableGen'd instruction selector default
for all opcodes. This is NFC for a full piglit run, which is why there are
no tests.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48198
llvm-svn: 335319
Update AMDGPU assembler syntax behind the code-object-v3 feature:
* Replace/rename most AMDGPU assembler directives/symbols and document them.
* Provide more diagnostics (e.g. values out of range, missing values, repeated
values).
* Provide path for backwards compatibility, even with underlying descriptor
changes.
Differential Revision: https://reviews.llvm.org/D47736
llvm-svn: 335281
BlockWaitcntProcessedSet was not being cleared between calls, so it was
producing incorrect counts in cases where MBB addresses happened to coincide
across multiple calls.
Differential Revision: https://reviews.llvm.org/D48391
llvm-svn: 335268
and everything that comes with it from implementation
and v3 header files.
Leave definition in v2 header files for backwards
compatibility.
Differential Revision: https://reviews.llvm.org/D48191
llvm-svn: 335267
Summary:
For sample and gather ops, we can accurately determine the set of
vaddr-size instruction variants that are required. This reduces
the size of instruction tables by ~5%.
The number of machine instruction opcodes is reduced from 10002
to 9476.
Change-Id: Ie7fc65d3657b762c7816017fe70b2e9bec644a8a
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D48168
llvm-svn: 335232
Summary:
This also removes the need for atomic pseudo instructions, since
we select the correct encoding directly in SITargetLowering::lowerImage
for dimension-aware image intrinsics.
Mesa uses dimension-aware image intrinsics since
commit a9a7993441.
Change-Id: I7473d20009476a4ed6d919cae4e6dca9ff42e77a
Reviewers: arsenm, rampitec, mareko, tpr, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48167
llvm-svn: 335231
Summary:
Having TableGen patterns for image intrinsics is hitting limitations:
for D16 we already have to manually pre-lower the packing of data
values, and we will have to do the same for A16 eventually.
Since there is already some custom C++ code anyway, it is arguably easier
to just do everything in C++, now that we can use the beefed-up generic
tables backend of TableGen to provide all the required metadata and map
intrinsics to corresponding opcodes. With this approach, all image
intrinsic lowering happens in SITargetLowering::lowerImage. That code is
dense due to all the cases that it handles, but it should still be easier
to follow than what we had before, by virtue of it all being done in a
single location, and by virtue of not relying on the TableGen pattern
magic that very few people really understand.
This means that we will have MachineSDNodes with MIMG instructions
during DAG combining, but that seems alright: previously we had
intrinsic nodes instead, but those are similarly opaque to the generic
CodeGen infrastructure, and the final pattern matching just did a 1:1
translation to machine instructions anyway. If anything, the fact that
we now merge the address words into a vector before DAG combine should
be an advantage.
Change-Id: I417f26bd88f54ce9781c1668acc01f3f99774de6
Reviewers: arsenm, rampitec, rtaylor, tstellar
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48017
llvm-svn: 335228
Summary:
This allows us to access rich information about MIMG opcodes from C++ code.
Simplifying the mapping between equivalent opcodes of different data size
becomes quite natural.
This also flattens the MIMG-related class and multiclass hierarchy a little,
and collapses together some of the scaffolding for sample and gather4 opcodes.
Change-Id: I1a2549fdc1e881ff100e5393d2d87e73729a0ccd
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48016
llvm-svn: 335227
Summary:
This will allows us to provide rich metadata about the instructions
in tables that are accessible by custom C++ code.
Change-Id: Id9305a26304ab6a6cceb6c65c8cd49141cc0101d
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48011
llvm-svn: 335224
Summary:
Kill instructions sometimes do use SCC in unusual circumstances, when
v_cmpx cannot be used due to the operands that are involved.
Additionally, even if SCC was never defined by the expansion, kill pseudos
could previously occur between an s_cmp and an s_cbranch_scc, which breaks
the SCC liveness tracking when the pseudo is expanded to split the basic
block. While it would be possible to explicitly mark the SCC as live-in for
the successor basic block, it's simpler to just mark the pseudo as using SCC,
so that such a sequence is never emitted by instruction selection in the
first place.
A similar issue affects indirect source/dest pseudos in principle, although
I haven't been able to come up with a test case where it actually matters
(this affects instruction selection, so a MIR test can't be used).
Fixes: dEQP-GLES3.functional.shaders.discard.dynamic_loop_always
Change-Id: Ica8d82ecff1a763b892a1112cf1b06c948863a4f
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D47761
llvm-svn: 335223
Summary:
This allows us to reduce the number of different machine instruction
opcodes, which reduces the table sizes and helps flatten the TableGen
multiclass hierarchies.
We can do this because for each hardware MIMG opcode, we have a full set
of IMAGE_xxx_Vn_Vm machine instructions for all required sizes of vdata
and vaddr registers. Instead of having separate D16 machine instructions,
a packed D16 instructions loading e.g. 4 components can simply use the
same V2 opcode variant that non-D16 instructions use.
We still require a TSFlag for D16 buffer instructions, because the
D16-ness of buffer instructions is part of the opcode. Renaming the flag
should help avoid future confusion.
The one non-obvious code change is that for gather4 instructions, the
disassembler can no longer automatically decide whether to use a V2 or
a V4 variant. The existing logic which choose the correct variant for
other MIMG instruction is extended to cover gather4 as well.
As a bonus, some of the assembler error messages are now more helpful
(e.g., complaining about a wrong data size instead of a non-existing
instruction).
While we're at it, delete a whole bunch of dead legacy TableGen code.
Change-Id: I89b02c2841c06f95e662541433e597f5d4553978
Reviewers: arsenm, rampitec, kzhuravl, artem.tamazov, dp, rtaylor
Subscribers: wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D47434
llvm-svn: 335222
This is the common case in the BE when we serialize condition and then
rematerialize it. Use either original or inverted condition.
Differential Revision: https://reviews.llvm.org/D48246
llvm-svn: 334882
Try to access pieces 4 bytes at a time. This helps
various hasOneUse extract_vector_elt combines, such
as load width reductions.
Avoids test regressions in a future commit.
llvm-svn: 334836
Summary: The same pattern as D48010, but this one is IR-canonical as of D47428.
Reviewers: nhaehnle, bogner, tstellar, arsenm
Reviewed By: arsenm
Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #amdgpu
Differential Revision: https://reviews.llvm.org/D48012
llvm-svn: 334817
Summary:
As a followup for D48007.
Since we already handle `x << (bitwidth - y) >> (bitwidth - y)` pattern,
which does not have ub for both the edge cases (`y == 0`, `y == bitwidth`),
i think also handling a pattern that is ub for `y == bitwidth` should be fine.
Reviewers: nhaehnle, bogner, tstellar, arsenm
Reviewed By: arsenm
Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #amdgpu
Differential Revision: https://reviews.llvm.org/D48010
llvm-svn: 334816
Summary:
D47980 will canonicalize the `x << (32 - y) >> (32 - y)`,
which is the pattern the AMDGPU expects to `x & (-1 >> (32 - y))`,
which is not recognized by AMDGPU.
Thus, it needs to be recognized, too.
Reviewers: nhaehnle, bogner, tstellar, arsenm
Reviewed By: arsenm
Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #amdgpu
Differential Revision: https://reviews.llvm.org/D48007
llvm-svn: 334815
Currently the handle type is a global pointer which holds 8 bytes.
We need a larger type which hold 16 bytes, therefore change it
to [i64 x 2].
Differential Revision: https://reviews.llvm.org/D48094
llvm-svn: 334625
Summary:
The code that handles ISD:Register and ISD::CopyFromReg assumes
the target is amdgcn, so this is broken on r600. We don't
need this analysis on r600 anyway so we can safely move
it to SITargetLowering.
Reviewers: alex-t, arsenm, nhaehnle
Reviewed By: arsenm
Subscribers: msearles, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D46298
llvm-svn: 334607
- Do not emit following assembler directives:
- .hsa_code_object_version
- .hsa_code_object_isa
- .amd_amdgpu_isa
- .amd_amdgpu_hsa_metadata
- .amd_amdgpu_pal_metadata
- Do not emit .note entries
- Cleanup and bring in sync kernel descriptor header file
- Emit kernel descriptor into .rodata with appropriate relocations and
alignments
llvm-svn: 334519
The use iterator, used within findMaskOperands(), can return anything which is
not a def. isUse() requires a register, so check isReg() before calling isUse().
Differential Revision: https://reviews.llvm.org/D48047
llvm-svn: 334459
Rational: if there is indirect access that is usually an issue
because load is not ready by the use. However, if use is inside a
loop and load is outside that is potentially an issue for a first
iteration only.
Differential Revision: https://reviews.llvm.org/D47740
llvm-svn: 334420
AMDGPU inline assembler support i16, half and i128 typed variables in constraints, but they were reported as error.
Needed to fix https://github.com/RadeonOpenCompute/ROCm/issues/341,
e.g. to be able to load with global_load_dwordx4 to a 128bit integer variable
Differential Revision: https://reviews.llvm.org/D44920
llvm-svn: 334301
- Make code easier to maintain.
- Avoid generating waitcnts for VMEM if the address sppace does not involve VMEM.
- Add support to generate waitcnts for LDS and GDS memory.
Differential Revision: https://reviews.llvm.org/D47504
llvm-svn: 334241
This has two main components. First, widen
widen short constant loads in DAG when they have
the correct alignment. This is already done a bit in
AMDGPUCodeGenPrepare, since that has access to
DivergenceAnalysis. This can't help kernarg loads
created in the DAG. Start to use DAG divergence analysis
to help this case.
The second part is to avoid kernel argument lowering
breaking the alignment of short vector elements because
calling convention lowering wants to split everything
into legal register types.
When loading a split type, load the nearest 4-byte aligned
segment and shift to get the desired bits. This extra
load of the earlier argument piece ends up merging,
and the bit extract hopefully folds out.
There are a number of improvements and regressions with
this, but I think as-is this is a better compromise between
several of the worst parts of SelectionDAG.
Particularly when i16 is legal, this produces worse code
for i8 and i16 element vector kernel arguments. This is
partially due to the very weak load merging the DAG does.
It only looks for fairly specific combines between pairs
of loads which no longer appear. In particular this
causes v4i16 loads to be split into 2 components when
previously the two halves were merged.
Worse, because of the newly introduced shifts, there
is a lot more unnecessary vector packing and unpacking code
emitted. At least some of this is due to reporting
false for isTypeDesirableForOp for i16 as a workaround for
the lack of divergence information in the DAG. The cases
where this happens it doesn't actually matter, but the
relevant code in SimplifyDemandedBits doens't have the context
to know to ignore this.
The use of the scalar cache is probably more important
than the mess of mostly scalar instructions doing this packing
and unpacking. Future work can fix this, possibly by making better
use of the new DAG divergence information for controlling promotion
decisions, or adding another version of shift + trunc + shift
combines that doesn't only know about the used types.
llvm-svn: 334180
When denormals are supported we are producing a full division for
1.0f / x. That still can be replaced by the faster version:
bool c = fabs(x) > 0x1.0p+96f;
float s = c ? 0x1.0p-32f : 1.0f;
x *= s;
return s * v_rcp_f32(x)
in case if requested accuracy is 2.5ulp or less. The same version
is used if denormals are not supported for non 1.0 numerators, where
just v_rcp_f32 is then used for 1.0 numerator.
The optimization of 1/x is extended to the case -1/x, which is the
same except for the resulting sign bit.
OpenCL conformance passed with both enabled and disabled denorms.
Differential Revision: https://reviews.llvm.org/D47805
llvm-svn: 334142