Adds legalizer, register bank select, and instruction
select support for G_SBFX and G_UBFX. These opcodes generate
scalar or vector ALU bitfield extract instructions for
AMDGPU. The instructions allow both constant or register
values for the offset and width operands.
The 32-bit scalar version is expanded to a sequence that
combines the offset and width into a single register.
There are no 64-bit vgpr bitfield extract instructions, so the
operations are expanded to a sequence of instructions that
implement the operation. If the width is a constant,
then the 32-bit bitfield extract instructions are used.
Moved the AArch64 specific code for creating G_SBFX to
CombinerHelper.cpp so that it can be used by other targets.
Only bitfield extracts with constant offset and width values
are handled currently.
Differential Revision: https://reviews.llvm.org/D100149
This also adds new interfaces for the fixed- and scalable case:
* LLT::fixed_vector
* LLT::scalable_vector
The strategy for migrating to the new interfaces was as follows:
* If the new LLT is a (modified) clone of another LLT, taking the
same number of elements, then use LLT::vector(OtherTy.getElementCount())
or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2)
or operator*. That is because there is no reason to specifically restrict
the types to 'fixed_vector'.
* If the algorithm works on the number of elements (as unsigned), then
just use fixed_vector. This will need to be fixed up in the future when
modifying the algorithm to also work for scalable vectors, and will need
then need additional tests to confirm the behaviour works the same for
scalable vectors.
* If the test used the '/*Scalable=*/true` flag of LLT::vector, then
this is replaced by LLT::scalable_vector.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D104451
Since this method can apply to cmpxchg operations, make sure it's clear
what value we're actually retrieving. This will help ensure we don't
accidentally ignore the failure ordering of cmpxchg in the future.
We could potentially introduce a getOrdering() method on AtomicSDNode
that asserts the operation isn't cmpxchg, but not sure that's
worthwhile.
Differential Revision: https://reviews.llvm.org/D103338
G_INSERT legalization is incomplete and doesn't work very
well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES
padding with undef values (although this can get pretty large).
For the case of load/store narrowing, this is still performing the
load/stores in irregularly sized pieces. It might be cleaner to split
this down into equal sized pieces, and rely on load/store merging to
optimize it.
When narrowing G_ADD and G_SUB, handle types that aren't a multiple of
the type we're narrowing to. This allows us to handle types like s96
on 64 bit targets.
Note that the test here has a couple of dead instructions because of
the way the setup legalizes. I wasn't able to come up with a way to
write this test that avoids that easily.
Differential Revision: https://reviews.llvm.org/D97811
When narrowing G_INSERT, handle types that aren't a multiple of the
type we're narrowing to. This comes up if we're narrowing something
like an s96 to fit in 64 bit registers and also for non-byte multiple
packed types if they come up.
This implementation handles these cases by extending the extra bits to
the narrow size and truncating the result back to the destination
size.
Differential Revision: https://reviews.llvm.org/D97791
Also changes the fewerElements helper to use the lookthrough constant helper
instead of m_ICst, since m_ICst doesn't look through extends.
Differential Revision: https://reviews.llvm.org/D103227
Adjusting the load register type is a widenScalar type action, not a
lowering. lowerLoad should be reserved for operations that change the
memory access size, such as unaligned load decomposition. With this
trying to adjust the register type, it was hard to avoid infinite
loops in the legalizer. Adds a bandaid to avoid regressing a few
AArch64 tests, but I'm not sure what the exact condition is and
there's probably a cleaner way to do this.
For AMDGPU this regresses handling of some cases for unaligned loads,
but the way this is currently working is a pretty ugly hack.
Thhis is a port from the DAG legalization. We're still missing some of the
canonicalizations of shuffles but it's a start.
Differential Revision: https://reviews.llvm.org/D102828
There were a bunch of lost debug location remarks that show up when legalizing
tail calls on AArch64.
This would happen because we drop the return in the block where we emit the
tail call. So, we end up dropping the debug location, which makes the
LostDebugLocObserver report a missing debug location.
Although it's *true* that we lose these debug locations, this isn't
a particularly useful remark. We expect to drop these debug locations when
emitting tail calls. Suppressing remarks in this case is preferable, since the
amount of noise could hide actual debug location related bugs.
To do this, I just plumbed the LostDebugLocObserver through the relevant
LegalizerHelper functions. This is the only case I can think of where we need
the LostDebugLocObserver in the LegalizerHelper. So, rather than storing it
in the LegalizerHelper proper and mucking around with the constructors, I
figured it'd be cleanest to take the simplest path for now.
This clears up ~20 noisy lost debug location remarks on CTMark in AArch64 at
-Os.
Differential Revision: https://reviews.llvm.org/D103128
The function `reduceOperationWidth` helps to legalize a vector
operation either by narrowing its type or by scalarizing the
operation itself. It currently supports instructions with one result.
This patch, in addition allows the same for instructions with two
results (for instance, G_SDIVREM).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D100725
This patch adds 3 methods, one for power-of-2 vectors which use tree
reductions using vector ops, before a final reduction op. For non-pow-2
types it generates multiple narrow reductions and combines the values with
scalar ops.
Differential Revision: https://reviews.llvm.org/D97163
Darwin platforms for both AArch64 and X86 can provide optimized `bzero()`
routines. In this case, it may be preferable to use `bzero` in place of a
memset of 0.
This adds a G_BZERO generic opcode, similar to G_MEMSET et al. This opcode can
be generated by platforms which may want to use bzero.
To emit the G_BZERO, this adds a pre-legalize combine for AArch64. The
conditions for this are largely a port of the bzero case in
`AArch64SelectionDAGInfo::EmitTargetCodeForMemset`.
The only difference in comparison to the SelectionDAG code is that, when
compiling for minsize, this will fire for all memsets of 0. The original code
notes that it's not beneficial to do this for small memsets; however, using
bzero here will save a mov from wzr. For minsize, I think that it's preferable
to prioritise omitting the mov.
This also fixes a bug in the libcall legalization code which would delete
instructions which could not be legalized. It also adds a check to make sure
that we actually get a libcall name.
Code size improvements (Darwin):
- CTMark -Os: -0.0% geomean (-0.1% on pairlocalalign)
- CTMark -Oz: -0.2% geomean (-0.5% on bullet)
Differential Revision: https://reviews.llvm.org/D99358
It is good to have a combined `divrem` instruction when the
`div` and `rem` are computed from identical input operands.
Some targets can lower them through a single expansion that
computes both division and remainder. It effectively reduces
the number of instructions than individually expanding them.
Reviewed By: arsenm, paquette
Differential Revision: https://reviews.llvm.org/D96013
The current narrowing code for G_PHI can only handle the case
where the size is a multiple of the narrow size. If this is not
the case, fall back to SDAG instead of asserting.
Original patch by shepmaster.
Differential Revision: https://reviews.llvm.org/D92446
The implementation for vectors is broken and doesn't seem to be used by
anything. Explicitly remove support for them, they can be added again
later when they're properly implemented.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D95699
These two cases have identical implementations other than an
unreachable part of `G_ADD` that checks if the scalar we're narrowing
is a vector. Combining them to avoid unnecessary divergence.
Remove the call to setFlags in favour of creating the instruction with
the correct flags in the first place, so we don't have to explicitly
notify the observer.
Differential Revision: https://reviews.llvm.org/D95681
These are widened to a wider UADDE/USUBE, with the overflow value
unused, and with the same synthesis of a new overflow value as for the
O operations.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D95326
Implement widening for G_SADDO and G_SSUBO.
Add legalize-add/sub tests for narrow overflowing add/sub on AArch64.
Differential Revision: https://reviews.llvm.org/D95034
The widenScalar implementation for signed and unsigned overflowing
operations were very similar: both are checked by truncating the result
and then re-sign/zero-extending it and checking that it matches the
computed operation.
Using a truncate + zero-extend for the unsigned case instead of manually
producing the AND instruction like before leads to an extra copy
instruction during legalization, but this should be harmless.
Differential Revision: https://reviews.llvm.org/D95035
Implement widening for G_SADDO and G_SSUBO. Previously it was only
implemented for G_UADDO and G_USUBO. Also add legalize-add/sub tests for
narrow overflowing add/sub on AArch64.
Differential Revision: https://reviews.llvm.org/D95034
This fixes double printing of insertion debug messages in the
legalizer.
Try to cleanup usage of observers. Currently the use of observers is
pretty hard to follow and it's not clear what is responsible for
them. Observers are referenced in 3 places:
1. In the MachineFunction
2. In the MachineIRBuilder
3. In the LegalizerHelper
The observers in the MachineFunction and MachineIRBuilder are both
called only on insertions, and are redundant with each other. The
source of the double printing was the same observer was added to both
the MachineFunction, and the MachineIRBuilder. One of these references
needs to be removed. Arguably observers in general should be fully
removed from one or the other, but it may be useful to have a local
observer in the MachineIRBuilder that is not added to the function's
observers. Alternatively, the wrapper observer could manage a local
observer in one place.
The LegalizerHelper only ever calls the observer on changing/changed
instructions, and never insertions. Logically these are two different
types of observers, for changes and for insertions.
Additionally, some places used the GISelObserverWrapper when they only
needed a single observer they could use directly.
Setting the observer in the LegalizerHelper constructor is not
flexible enough if the LegalizerHelper is constructed anywhere outside
the one used by the legalizer. AMDGPU calls the LegalizerHelper in
RegBankSelect, and needs to use a local observer to apply the regbank
to newly created instructions. Currently it accomplishes this by
constructing a local MachineIRBuilder. I'm trying to move the
MachineIRBuilder to be owned/maintained by the RegBankSelect pass
itself, but the locally constructed LegalizerHelper would reset the
observer.
Mips also has a special case use of the LegalizationArtifactCombiner
in applyMappingImpl; I think we do need to run the artifact combiner
during RegBankSelect, but in a more consistent way outside of
applyMappingImpl.
The lowering of vector selects needs to first splat the scalar mask into a vector
first.
This was causing a crash when building oggenc in the test suite.
Differential Revision: https://reviews.llvm.org/D91655
During lowering of G_UMULO and friends, the previous code moved the builder's
insertion point to be after the legalizing instruction. When that happened, if
there happened to be a "G_CONSTANT i32 0" immediately after, the CSEMIRBuilder
would try to find that constant during the buildConstant(zero) call, and since
it dominates itself would return the iterator unchanged, even though the def
of the constant was *after* the current insertion point. This resulted in the
compare being generated *before* the constant which it was using.
There's no need to modify the insertion point before building the mul-hi or
constant. Delaying moving the insert point ensures those are built/CSEd before
the G_ICMP is built.
Fixes PR47679
Differential Revision: https://reviews.llvm.org/D88514
Fix creation of illegal unmerge when widen was requested to a type which
is not a multiple of the destination type. E.g. when trying to widen
an s48 unmerge to s64 the existing code would create an illegal unmerge
from s64 to s48.
Instead, create further unmerges to a GCD type, then use this to remerge
these intermediate results to the actual destinations.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D88422