This was a workaround for not supporting indirect calls when
instcombine didn't eliminate constant expression casts of the callee
at -O0. Indirect calls are supposed to work now, so drop the hack.
Code using indirect calls is broken without this, and there isn't
really much value in supporting the old attempt to vary the argument
placement based on uses. This resulted in more argument shuffling code
anyway.
Also have the option stop implying all inputs need to be passed. This
will no rely on the amdgpu-no-* attributes to avoid passing
unnecessary values.
- CUDA cannot associate memory space with pointer types. Even though Clang could add extra attributes to specify the address space explicitly on a pointer type, it breaks the portability between Clang and NVCC.
- This change proposes to assume the address space from a pointer from the assumption built upon target-specific address space predicates, such as `__isGlobal` from CUDA. E.g.,
```
foo(float *p) {
__builtin_assume(__isGlobal(p));
// From there, we could assume p is a global pointer instead of a
// generic one.
}
```
This makes the code portable without introducing the implementation-specific features.
Note that NVCC starts to support __builtin_assume from version 11.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112041
Run post-RA SIShrinkInstructions just before post-RA scheduling, instead
of afterwards. After the fixes in D112305 and D112317 this seems to make
no difference, but it paves the way for scheduler tweaks that are
sensitive to the e32 vs e64 encoding of VALU instructions.
Differential Revision: https://reviews.llvm.org/D112341
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.
Differential Revision: https://reviews.llvm.org/D111646
The new pass walks kernel's pointer arguments, then loads from them.
If a loaded value is a pointer and loaded pointer is unmodified in
the kernel before the load, then promote loaded pointer to global.
Then recursively continue.
Differential Revision: https://reviews.llvm.org/D111464
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
pressure, which in some cases gives it more freedom to reorder
instructions.
Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
increased by about 1%.
- The number of runs of two or more load instructions increased by
about 4%.
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.
This allows us to ensure that Support doesn't have includes from MC/*.
Differential Revision: https://reviews.llvm.org/D111454
This was introduced in D32628 but it does not seem to be required any
more. At least it does not show any problems in check-llvm in an
LLVM_ENABLE_EXPENSIVE_CHECKS build.
Differential Revision: https://reviews.llvm.org/D110692
ASan device library functions (those starts with the prefix __asan_)
are at the moment undergoing through undesired optimizations due to
internalization. Hence, in order to avoid such undesired optimizations
on ASan device library functions, do not internalize them in the first
place.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D110468
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
Previously we assumed all callable functions did not need any
implicitly passed inputs, and added attributes to functions to
indicate when they were necessary. Requiring attributes for
correctness is pretty ugly, and it makes supporting indirect and
external calls more complicated.
This inverts the direction of the attributes, so an undecorated
function is assumed to need all implicit imputs. This enables
AMDGPUAttributor by default to mark when functions are proven to not
need a given input. This strips the equivalent functionality from the
legacy AMDGPUAnnotateKernelFeatures pass.
However, AMDGPUAnnotateKernelFeatures is not fully removed at this
point although it should be in the future. It is still necessary for
the two hacky amdgpu-calls and amdgpu-stack-objects attributes, which
would be better served by a trivial analysis on the IR during
selection. Additionally, AMDGPUAnnotateKernelFeatures still
redundantly handles the uniform-work-group-size attribute to be
removed in a future commit.
At this point when not using -amdgpu-fixed-function-abi, we are still
modifying the ABI based on these newly negated attributes. In the
future, this option will be removed and the locations for implicit
inputs will always be fixed. We will then use the new attributes to
avoid passing the values when unnecessary.
1. Splitted out some parts of R600 target to separate modules/headers.
2. Reduced some include lists in headers.
3. Found and fixed issue with override `GCNTargetMachine::getSubtargetImpl()`
and `R600TargetMachine::getSubtargetImpl()` had different return value type
than base class.
4. Minor forward declarations cleanup.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108596
This patch introduces a new code object metadata field, ".kind"
which is used to add support for init and fini kernels.
HSAStreamer will use function attributes, "device-init" and
"device-fini" to distinguish between init and fini kernels from
the regular kernels and will emit metadata with ".kind" set to
"init" and "fini" respectively.
To reduce the number of init and fini kernels, the ctors and
dtors present in the llvm's global.ctors and global.dtors lists
are called from a single init and fini kernel respectively.
Reviewed by: yaxunl
Differential Revision: https://reviews.llvm.org/D105682
This patch introduces a new code object metadata field, ".kind"
which is used to add support for init and fini kernels.
HSAStreamer will use function attributes, "device-init" and
"device-fini" to distinguish between init and fini kernels from
the regular kernels and will emit metadata with ".kind" set to
"init" and "fini" respectively.
To reduce the number of init and fini kernels, the ctors and
dtors present in the llvm's global.ctors and global.dtors lists
are called from a single init and fini kernel respectively.
Reviewed by: yaxunl
Differential Revision: https://reviews.llvm.org/D105682
Pulled out the OptimizationLevel class from PassBuilder in order to be able to access it from within the PassManager and avoid include conflicts.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D107025
This patch introduces a pass that uses the Attributor to deduce AMDGPU specific attributes.
Reviewed By: jdoerfert, arsenm
Differential Revision: https://reviews.llvm.org/D104997
This is SCC pass, moving it to the end of SCC PM saves one
Function PM. This needs the analysis to take into account
memory access width since it is now places after the
load/store optimizer (D105651).
Differential Revision: https://reviews.llvm.org/D105652
First, collect the register usage in each function, then apply the
maximum register usage of all functions to functions with indirect
calls.
This is more accurate than guessing the maximum register usage without
looking at the actual usage.
As before, assume that indirect calls will hit a function in the
current module.
Differential Revision: https://reviews.llvm.org/D105839
AMDGPU normally spills SGPRs to VGPRs. Previously, since all register
classes are handled at the same time, this was problematic. We don't
know ahead of time how many registers will be needed to be reserved to
handle the spilling. If no VGPRs were left for spilling, we would have
to try to spill to memory. If the spilled SGPRs were required for exec
mask manipulation, it is highly problematic because the lanes active
at the point of spill are not necessarily the same as at the restore
point.
Avoid this problem by fully allocating SGPRs in a separate regalloc
run from VGPRs. This way we know the exact number of VGPRs needed, and
can reserve them for a second run. This fixes the most serious
issues, but it is still possible using inline asm to make all VGPRs
unavailable. Start erroring in the case where we ever would require
memory for an SGPR spill.
This is implemented by giving each regalloc pass a callback which
reports if a register class should be handled or not. A few passes
need some small changes to deal with leftover virtual registers.
In the AMDGPU implementation, a new pass is introduced to take the
place of PrologEpilogInserter for SGPR spills emitted during the first
run.
One disadvantage of this is currently StackSlotColoring is no longer
used for SGPR spills. It would need to be run again, which will
require more work.
Error if the standard -regalloc option is used. Introduce new separate
-sgpr-regalloc and -vgpr-regalloc flags, so the two runs can be
controlled individually. PBQB is not currently supported, so this also
prevents using the unhandled allocator.
- In [D98783](https://reviews.llvm.org/D98783), an extra GlobalDCE pass
is inserted before the internalization pass to ensure a global
variable without users could be internalized even if there are dead
users. Instead of inserting a dedicated optimization pass, the
dead user checking, i.e. 'use_empty()', should be preceeded with
constant dead user removal to ensure an accurate result.
Differential Revision: https://reviews.llvm.org/D105590
We have several checks for both cl::opt and OptLevel over our
pass config, although these checks do not properly work if
default value of a cl::opt will be false. Create a helper to
use instead and properly handle it. NFC for now.
Differential Revision: https://reviews.llvm.org/D105517
There are cases where infer address spaces pass cannot yet
infer an address space in the opt pipeline and then in the
llc pipeline it runs too late for atomic expand pass to
benefit from a specific address space.
Move atomic expand pass past the infer address spaces.
Fixes: SWDEV-293410
Differential Revision: https://reviews.llvm.org/D105511
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.
This gives 10-20% uplift in a set of huge apps heavily using double
precession math.
Fixes: SWDEV-292645
Differential Revision: https://reviews.llvm.org/D104874
Most tests passed with an extra argument to explicitly enable the pass.
One does not, deleted it as part of this change. I can't see why the codegen
would be different between default on and default off but switched on. It
can be retrieved from the project history.
This would be a revert, but git revert was not clean. Disabling the pass
and leaving it in tree is less likely to cause breakage elsewhere than
patching up the git revert conflicts on unfamiliar code. It'll be landed
without review, as @hsmhsm is believed unavailable at present.
Differential Revision: https://reviews.llvm.org/D104962
This pass aims to optimize VGPR live-range in a typical divergent if-else
control flow. For example:
def(a)
if(cond)
use(a)
... // A
else
use(a)
As AMDGPU access vgpr with respect to active-mask, we can mark `a` as
dead in region A. For details, please refer to the comments in
implementation file.
The pass is enabled by default, the frontend can disable it through
"-amdgpu-opt-vgpr-liverange=false".
Differential Revision: https://reviews.llvm.org/D102212
The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103225
This patch disables the SIFormMemoryClauses pass at -O1. This pass has a
significant impact on compilation time, so we only want it to be enabled
starting from -O2.
Differential Revision: https://reviews.llvm.org/D101939
Moving code sinking pass before structurizer creates more sinking
opportunities.
The extra flow edges introduced by the structurizer can have adverse
effects on sinking, because the sinking pass prefers moving instructions
to blocks with unique predecessors and the structurizer destroys that
property in some cases.
A notable example is moving high-latency image instructions across kills.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D101115
Printing pass manager invocations is fairly verbose and not super
useful.
This allows us to remove DebugLogging from pass managers and PassBuilder
since all logging (aside from analysis managers) goes through
instrumentation now.
This has the downside of never being able to print the top level pass
manager via instrumentation, but that seems like a minor downside.
Reviewed By: ychen
Differential Revision: https://reviews.llvm.org/D101797
Serialize ScavengeFI from SIMachineFunctionInfo into yaml.
ScavengeFI is not used outside of the PrologEpilogInserter,
so this shouldn't change anything.
Differential Revision: https://reviews.llvm.org/D101367
This patch disables some of the passes at -O1. These passes have a significant
impact on compilation time, so we only want them to be enabled starting from -O2.
Differential Revision: https://reviews.llvm.org/D101414
the compilation time and there is no case for which we see any improvement in
performance. This patch removes this pass and its associated test cases from
the tree.
Differential Revision: https://reviews.llvm.org/D101313
Change-Id: I0599169a7609c19a887f8d847a71e664030cc141
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.
Reapply after fixing dependent change D100188.
Differential Revision: https://reviews.llvm.org/D100189
The internalization pass only internalizes global variables
with no users. If the global variable has some dead user,
the internalization pass will not internalize it.
To be able to internalize global variables with dead
users, a global dce pass is needed before the
internalization pass.
This patch adds that.
Reviewed by: Artem Belevich, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D98783
Now since LDS uses within non-kernel functions are being handled in the
pass - LowerModuleLDS, we *NO* need to *forcefully* inline non-kernel
functions just because they use LDS. Do forceful inlining only when the
pass - LowerModuleLDS is not enabled. It is enabled by default.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D100481