There were various hacks used to try to avoid making s1 SGPR vs. s1
VCC ambiguous after constraining the register before we had a strategy
to deal with this. This also attempted to handle undef operands, which
are now illegal gMIR.
We are using undef on the indirect move source subreg and then
using implicit super-reg. This creates a problem in RA when
Greedy decides to split the register. It reassigns the implicit
super-reg but does not bother to change undef source because
it is really does not matter. The fix is to stop lying to RA and
drop undef flag.
This has also hit a problem in SIFoldOperands as it can fold
immediate into an indirect move since there is no undef flag
anymore. That results in multiple test failures, so added the
check for this case.
Differential Revision: https://reviews.llvm.org/D84899
The hardware has created a real mess in the naming for add/sub, which
have been renamed basically every generation. Switch the carry out
pseudos to have the gfx9/gfx10 names. We were using the original SI/CI
v_add_i32/v_sub_i32 names. Later targets reintroduced these names as
carryless instructions with a saturating clamp bit, which we do not
define. Do this rename so we can unambiguously add these missing
instructions.
The carry-in versions should also be renamed, but at least those had a
consistent _u32 name to begin with. The 16-bit instructions were also
renamed, but aren't ambiguous.
This does regress assembler error message quality in some cases. In
mismatched wave32/wave64 situations, this will switch from
"unsupported instruction" to "invalid operand", with the error
pointing at the wrong position. I couldn't quite follow how the
assembler selects these, but the previous behavior seemed accidental
to me. It looked like there was a partial attempt to handle this which
was never completed (i.e. there is an AMDGPUOperand::isBoolReg but it
isn't used for anything).
Select into corresponding V_CMP instruction based on CmpInst predicate,
stored as immediate, in last operand.
Differential Revision: https://reviews.llvm.org/D82652
I'm guessing this was a holdover from when 0 was an invalid stack
pointer, but surprised nobody has discovered this before.
Also don't allow offset folding for -1 pointers, since it looks weird
to partially fold this.
I consider this to be a hack, since we probably should not mark any
16-bit extract as legal, and require all extracts to be done on
multiples of 32. There are quite a few more battles to fight in the
legalizer for sub-dword vectors, so just select this for now so we can
pass OpenCL conformance without crashing.
Also fix the same assert for G_INSERTs. Unlike G_EXTRACT there's not a
trivial way to select this so just fail on it.
Confusingly, these were unrelated and had different semantics. The
G_PTR_MASK instruction predates the llvm.ptrmask intrinsic, but has a
different format. G_PTR_MASK only allows clearing the low bits of a
pointer, and only a constant number of bits. The ptrmask intrinsic
allows an arbitrary mask. Replace G_PTR_MASK to match the intrinsic.
Only selects the cases that look like the old instruction. More work
is needed to select the general case. Also new legalization code is
still needed to deal with the case where the incoming mask size does
not match the pointer size, which has a specified behavior in the
langref.
This is currently missing most of the hard parts to lower correctly,
so disable it for now. This fixes at least one OpenCL conformance test
and allows it to pass with fallback. Hide this behind an option for
now.
This does for G_EXTRACT_VECTOR_ELT what 588bd7be36 did for G_TRUNC.
Ideally types without a corresponding register class wouldn't reach
here, but we're currently missing some (in particular a 192-bit class
is missing).
Summary:
This fixes a few issues related to SMRD offsets. On gfx9 and gfx10 we have a
signed byte offset immediate, however we can overflow into a negative since we
treat it as unsigned.
Also, the SMRD SOFFSET sgpr is an unsigned offset on all subtargets. We
sometimes tried to use negative values here.
Third, S_BUFFER instructions should never use a signed offset immediate.
Differential Revision: https://reviews.llvm.org/D77082
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This allows us to removes the scratch wave
offset register from the calling convention ABI.
As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.
Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.
Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75138
We weren't considering the packed case correctly, and this was passing
through to the selector. The selector only checked the size, so this
would incorrectly compile to a single 32-bit scalar add.
As usual, the LegalizerHelper is somewhat awkward to use from
applyMappingImpl. I think this is the first place we've needed
multi-step legalization here though.
G_SHUFFLE_VECTOR is legal since it theoretically may help match op_sel
for VOP3P instructions. Expand it in some other way in case it doesn't
fold into the use instructions.
We should try the generated matchers before the manual selection. This
means the patterns are now handling the common cases, but the manual
selection code is not yet dead. It's still handling the non-s32/s64
cases (like v2s16 and v2s32). Currently tablegen doesn't have a nice
way to have a single pattern that covers multiple types.
We have patterns for s_pack* selection, but they assume the inputs are
a build_vector with 16-bit inputs, not a truncating build
vector. Since there's still outstanding work for how to handle
mismatched result and source element vector operations, and since I'm
trying a different packed vector strategy than SelectionDAG, just
manually select this for now.
This looked through copies to find the source modifiers, which may
have been SGPR->VGPR copies added to avoid potential constant bus
violations. Re-insert a copy to a VGPR if this happens.
Vector indexing with a constant index should be folded out in the
legalizer, but this was accidentally falling through. This would
produce the indexing operation with $noreg. Handle this case as a
dynamic index just in case a bug like this happens again in the
future.
I believe this also fixes bugs with CI 32-bit handling, which was
incorrectly skipping offsets that look like signed 32-bit values. Also
validate the offsets are dword aligned before folding.
This should be no problem to support with a pattern, but it turns out
there are just too many yaks to shave. The main problem is in the DAG
emitter, which I have no desire to sink effort into fixing.
If we had a bit to disable patterns in the DAG importer, fixing the
GlobalISelEmitter is more manageable.
Trivial type predicates should be moved into the tablegen pattern
itself, and not checked inside complex patterns. This eliminates a
redundant complex pattern, and fixes select source modifiers for
GlobalISel.
I have further patches which fully handle select in tablegen and
remove all of the C++ selection, although it requires the ugliness to
support the entire range of legal register types.
Use intermediate instructions, unlike with buffer stores. This is
necessary because of the need to have an internal way to distinguish
between signed and unsigned extloads. This introduces some duplication
and near duplication with the buffer store selection path. The store
handling should maybe be moved into legalization to match and
eliminate the duplication.
The pattern is also mishandled by the generated matcher, so workaround
this as in the DAG path.
The existing DAG tests aren't particularly targeted to just this one
intrinsic. These also end up differing in scheduling from SGPR->VGPR
operand constraint copies.
Handle dynamic vector extracts that use an index that's an add of a
constant offset into moving the base subregister of the indexing
operation.
Force the add into the loop in regbankselect, which will be recognized
when selected.
This does produce slightly different code. Now a unique IMPLICIT_DEF
is emitted for each of the implicit_def operands, rather than reusing
the same one.
The current implementation assumes there is an instruction associated
with the transform, but this is not the case for
timm/TargetConstant/immarg values. These transforms should directly
operate on a specific MachineOperand in the source
instruction. TableGen would assert if you attempted to define an
equivalent GISDNodeXFormEquiv using timm when it failed to find the
instruction matcher.
Specially recognize SDNodeXForms on timm, and pass the operand index
to the render function.
Ideally this would be a separate render function type that looks like
void renderFoo(MachineInstrBuilder, const MachineOperand&), but this
proved to be somewhat mechanically painful. Add an optional operand
index which will only be passed if the transform should only look at
the one source operand.
Theoretically it would also be possible to only ever pass the
MachineOperand, and the existing renderers would check the parent. I
think that would be somewhat ugly for the standard usage which may
want to inspect other operands, and I also think MachineOperand should
eventually not carry a pointer to the parent instruction.
Use it in one sample pattern. This isn't a great example, since the
transform exists to satisfy DAG type constraints. This could also be
avoided by just changing the MachineInstr's arbitrary choice of
operand type from i16 to i32. Other patterns have nontrivial uses, but
this serves as the simplest example.
One flaw this still has is if you try to use an SDNodeXForm defined
for imm, but the source pattern uses timm, you still see the "Failed
to lookup instruction" assert. However, there is now a way to avoid
it.
This doesn't enable any new imports yet, but moves the fmed patterns
from failing on this to hitting the "complex suboperand referenced
more than once" limitation in tablegen.
This solves selection failures with generated selection patterns,
which would fail due to inferring the SGPR reg bank for virtual
registers with a set register class instead of VCC bank. Use
instruction selection would constrain the virtual register to a
specific class, so when the def was selected later the bank no longer
was set to VCC.
Remove the SCC reg bank. SCC isn't directly addressable, so it
requires copying from SCC to an allocatable 32-bit register during
selection, so these might as well be treated as 32-bit SGPR values.
Now any scalar boolean value that will produce an outupt in SCC should
be widened during RegBankSelect to s32. Any s1 value should be a
vector boolean during selection. This makes the vcc register bank
unambiguous with a normal SGPR during selection.
Summary of how this should now work:
- G_TRUNC is always a no-op, and never should use a vcc bank result.
- SALU boolean operations should be promoted to s32 in RegBankSelect
apply mapping
- An s1 value means vcc bank at selection. The exception is for
legalization artifacts that use s1, which are never VCC. All other
contexts should infer the VCC register classes for s1 typed
registers. The LLT for the register is now needed to infer the
correct register class. Extensions with vcc sources should be
legalized to a select of constants during RegBankSelect.
- Copy from non-vcc to vcc ensures high bits of the input value are
cleared during selection.
- SALU boolean inputs should ensure the inputs are 0/1. This includes
select, conditional branches, and carry-ins.
There are a few somewhat dirty details. One is that G_TRUNC/G_*EXT
selection ignores the usual register-bank from register class
functions, and can't handle truncates with VCC result banks. I think
this is OK, since the artifacts are specially treated anyway. This
does require some care to avoid producing cases with vcc. There will
also be no 100% reliable way to verify this rule is followed in
selection in case of register classes, and violations manifests
themselves as invalid copy instructions much later.
Standard phi handling also only considers the bank of the result
register, and doesn't insert copies to make the source banks
match. This doesn't work for vcc, so we have to manually correct phi
inputs in this case. We should add a verifier check to make sure there
are no phis with mixed vcc and non-vcc register bank inputs.
There's also some duplication with the LegalizerHelper, and some code
which should live in the helper. I don't see a good way to share
special knowledge about what types to use for intermediate operations
depending on the bank for example. Using the helper to replace
extensions with selects also seems somewhat awkward to me.
Another issue is there are some contexts calling
getRegBankFromRegClass that apparently don't have the LLT type for the
register, but I haven't yet run into a real issue from this.
This also introduces new unnecessary instructions in most cases, since
we don't yet try to optimize out the zext when the source is known to
come from a compare.
This assumed a 32-bit extract size, which would produce invalid copies
with 64-bit extracts. Handle the easy case. Ideally we would have a
way to get the proper subreg index for any 32-bit offset, but there
should probably be a tablegenerated way of getting the subreg index
for any size and offset.