This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133593
There are two ctlz intrinsics here with the zero_is_poison flag
set. There are also two comparisons that check if either of the
inputs the ctlzs are zero. We need to use a logical or to block
the poison from the ctlz if either of the inputs is zero.
Reviewed By: arsenm, aqjune
Differential Revision: https://reviews.llvm.org/D130680
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected to be selected to SALU form, except those not having one.
VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
and overhead introduced by the data transfer between the VGPRs and SGPRs.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D128252
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.
Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.
This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122933
Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate
redundant S_AND operations against EXEC for V_CMP results in VCC.
These occur after after register allocation when VCC has been
selected as the comparison destination.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D120202
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.
Differential Revision: https://reviews.llvm.org/D117219
This can be viewed as recognizing that multiply-by-power-of-2 doesn't
have a carry into the top bit of an M-bit * N-bit number.
Enhancing canonicalization of mul -> select might also handle some of
these if we were ok with increasing instruction count with casts in
some cases.
This doesn't help https://llvm.org/PR49055 , but it's a simpler
pattern that we miss.
Note: "-sccp" already gets these examples using a constant
range analysis.
Differential Revision: https://reviews.llvm.org/D114962
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary
results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slows down the loader,
and is only used to simplify disassembly.
Use '--symbolize-operands' with llvm-objdump to improve readability of the
branch target operands in disassembly.
Fixes: SWDEV-312223
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D114273
Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguously
distinguished from regular V_MOV_B32_e32. Previously they
were distinguished by looking for extra implicit operands
but this is fragile because regular moves sometimes have
extra implicit operands too:
- either by accident, when instructions end up with
duplicate implicit operands (see e.g. D100939)
- or by design, when SIInstrInfo::copyPhysReg breaks a
multi-dword copy into individual subreg mov instructions
and adds implicit operands for the super-register.
The effect of this is that SIInstrInfo::isFoldableCopy can
be simplified and identifies more foldable copies. The test
diffs show that more immediate 0 values have been folded as
inline operands.
SIInstrInfo::isReallyTriviallyReMaterializable could
probably be simplified too but that is not part of this
patch.
Differential Revision: https://reviews.llvm.org/D114230
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.
Differential Revision: https://reviews.llvm.org/D113679
MachineLoop::isLoopInvariant() returns false for all VALU
because of the exec use. Check TII::isIgnorableUse() to
allow hoisting.
That unfortunately results in higher register consumption
since MachineLICM does not adequately estimate pressure.
Therefor I think it shall only be enabled after D107677 even
though it does not depend on it.
Differential Revision: https://reviews.llvm.org/D107859
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.
Differential Revision: https://reviews.llvm.org/D111646
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.
The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.
This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.
We don't have real demanded bits support for MULHU, but we can
still use the known bits based constant folding support at the end
of SimplifyDemandedBits to simplify a MULHU. This helps with cases
where we know the LHS and RHS have enough leading zeros so that
the high multiply result is always 0.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D106471
This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D107486
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.
This gives 10-20% uplift in a set of huge apps heavily using double
precession math.
Fixes: SWDEV-292645
Differential Revision: https://reviews.llvm.org/D104874
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D104622
The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to use VCC instead. When that happens, the pass must
also change instructions that define SCC to define VCC.
The pass was not changing the SCC definition when an ADDC is
converted due to a input that is a VGPR to SGPR copy. But, the
initial ADD insruction, which define SCC, is not converted.
This causes a compilation failure due to a use of an undefined
physical register.
This patch adds code that inserts the SCC definition in the
MoveToVALU worklist when a SCC use is converted to a VCC use.
Differential Revision: https://reviews.llvm.org/D102111
Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops.
Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement.
Differential Revision: https://reviews.llvm.org/D98857
For an instruction in the basic block BB, SinkingPass enumerates basic blocks
dominated by BB and BB's successors. For each enumerated basic block,
SinkingPass uses `AllUsesDominatedByBlock` to check whether the basic
block dominates all of the instruction's users. This is inefficient.
Use the nearest common dominator of all users to avoid enumerating the
candidate. The nearest common dominator may be in a parent loop which is
not beneficial. In that case, find the ancestors in the dominator tree.
In the case that the instruction has no user, with this change we will
not perform unnecessary move. This causes some amdgpu test changes.
A stage-2 x86-64 clang is a byte identical with this change.
Fix 64-bit copy to SCC by restricting the pattern resulting
in such a copy to subtargets supporting 64-bit scalar compare,
and mapping the copy to S_CMP_LG_U64.
Before introducing the S_CSELECT pattern with explicit SCC
(0045786f14), there was no need
for handling 64-bit copy to SCC ($scc = COPY sreg_64).
The proposed handling to read only the low bits was however
based on a false premise that it is only one bit that matters,
while in fact the copy source might be a vector of booleans and
all bits need to be considered.
The practical problem of mapping the 64-bit copy to SCC is that
the natural instruction to use (S_CMP_LG_U64) is not available
on old hardware. Fix it by restricting the problematic pattern
to subtargets supporting the instruction (hasScalarCompareEq64).
Differential Revision: https://reviews.llvm.org/D85207
tryLatency compares two sched candidates. For the top zone it prefers
the one with lesser depth, but only if that depth is greater than the
total latency of the instructions we've already scheduled -- otherwise
its latency would be hidden and there would be no stall.
Unfortunately it only tests the depth of one of the candidates. This can
lead to situations where the TopDepthReduce heuristic does not kick in,
but a lower priority heuristic chooses the other candidate, whose depth
*is* greater than the already scheduled latency, which causes a stall.
The fix is to apply the heuristic if the depth of *either* candidate is
greater than the already scheduled latency.
All this also applies to the BotHeightReduce heuristic in the bottom
zone.
Differential Revision: https://reviews.llvm.org/D72392
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Re-commit D81925 with a bugfix D82370.
Differential Revision: https://reviews.llvm.org/D81925
Differential Revision: https://reviews.llvm.org/D82370
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81925
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
Summary:
Currently we custom select add/sub with carry out to scalar form relying on later replacing them to vector form if necessary.
This change enables custom selection code to take the divergence of adde/addc SDNodes into account and select the appropriate form in one step.
Reviewers: arsenm, vpykhtin, rampitec
Reviewed By: arsenm, vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa
Differential Revision: https://reviews.llvm.org/D76371
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
We probably want this, and I've meant to turn this on for a long
time. SC actually emits a special case to early-out for a 1
denominator, which perhaps should also be considered.
Summary:
This fixes some tests that did not specify -mcpu. Doing that disables
all subtarget features, which gives behavior that (a) does not
necessarily correspond to any actual target, and (b) can change as we
add new subtarget features.
Also added gfx1010 to memtime test.
Differential Revision: https://reviews.llvm.org/D74594
Change-Id: I8c0fe4fa03e9a93ef8bb722cd42d22e064526309