Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.
This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.
Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.
Removing artificial edges also allows a more relaxed scheduling.
Differential Revision: https://reviews.llvm.org/D72737
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.
This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.
Differential revision: https://reviews.llvm.org/D68092
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.
This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.
Differential revision: https://reviews.llvm.org/D68092
Summary:
For builds with LLVM_BUILD_LLVM_DYLIB=ON and BUILD_SHARED_LIBS=OFF
this change makes all symbols in the target specific libraries hidden
by default.
A new macro called LLVM_EXTERNAL_VISIBILITY has been added to mark symbols in these
libraries public, which is mainly needed for the definitions of the
LLVMInitialize* functions.
This patch reduces the number of public symbols in libLLVM.so by about
25%. This should improve load times for the dynamic library and also
make abi checker tools, like abidiff require less memory when analyzing
libLLVM.so
One side-effect of this change is that for builds with
LLVM_BUILD_LLVM_DYLIB=ON and LLVM_LINK_LLVM_DYLIB=ON some unittests that
access symbols that are no longer public will need to be statically linked.
Before and after public symbol counts (using gcc 8.2.1, ld.bfd 2.31.1):
nm before/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
36221
nm after/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
26278
Reviewers: chandlerc, beanz, mgorny, rnk, hans
Reviewed By: rnk, hans
Subscribers: merge_guards_bot, luismarques, smeenai, ldionne, lenary, s.egerton, pzheng, sameer.abuasal, MaskRay, wuzish, echristo, Jim, hiraditya, michaelplatings, chapuni, jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, mgrang, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, kristina, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D54439
Summary:
- `dead-mi-elimination` assumes MIR in the SSA form and cannot be
arranged after phi elimination or DeSSA. It's enhanced to handle the
dead register definition by skipping use check on it. Once a register
def is `dead`, all its uses, if any, should be `undef`.
- Re-arrange the DIE in RA phase for AMDGPU by placing it directly after
`detect-dead-lanes`.
- Many relevant tests are refined due to different register assignment.
Reviewers: rampitec, qcolombet, sunfish
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72709
This file lists every pass in LLVM, and is included by Pass.h, which is
very popular. Every time we add, remove, or rename a pass in LLVM, it
caused lots of recompilation.
I found this fact by looking at this table, which is sorted by the
number of times a file was changed over the last 100,000 git commits
multiplied by the number of object files that depend on it in the
current checkout:
recompiles touches affected_files header
342380 95 3604 llvm/include/llvm/ADT/STLExtras.h
314730 234 1345 llvm/include/llvm/InitializePasses.h
307036 118 2602 llvm/include/llvm/ADT/APInt.h
213049 59 3611 llvm/include/llvm/Support/MathExtras.h
170422 47 3626 llvm/include/llvm/Support/Compiler.h
162225 45 3605 llvm/include/llvm/ADT/Optional.h
158319 63 2513 llvm/include/llvm/ADT/Triple.h
140322 39 3598 llvm/include/llvm/ADT/StringRef.h
137647 59 2333 llvm/include/llvm/Support/Error.h
131619 73 1803 llvm/include/llvm/Support/FileSystem.h
Before this change, touching InitializePasses.h would cause 1345 files
to recompile. After this change, touching it only causes 550 compiles in
an incremental rebuild.
Reviewers: bkramer, asbirlea, bollu, jdoerfert
Differential Revision: https://reviews.llvm.org/D70211
The default FP mode should really be a property of a specific
function, and not a subtarget. Introduce the necessary fields to the
SIMachineFunctionInfo to help move towards this goal.
Summary:
This has been superseded by "[AMDGPU]: PHI Elimination hooks added for custom COPY insertion."
This reverts the code changes from commit 53f967f2bd
but keeps the test case.
Reviewers: hliao, arsenm, tpr, dstuttard
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68769
llvm-svn: 374347
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
llvm-svn: 374284
The mul24 matching could interfere with SLSR and the other addressing
mode related passes. This probably is not the optimal placement, but
is an intermediate step. This should probably be moved after all the
generic IR passes, particularly LSR. Moving this after LSR seems to
help in some cases, and hurts others.
As-is in this patch, in idiv-licm, it saves 1-2 instructions inside
some of the loop bodies, but increases the number in others. Moving
this later helps these loops. In the new lsr tests in
mul24-pass-ordering, the intrinsic prevents introducing more
instructions in the loop preheader, so moving this later ends up
hurting them. This shouldn't be any worse than before the intrinsics
were introduced in r366094, and LSR should probably be smarter. I
think it's because it doesn't know the and inside the loop will be
folded away.
llvm-svn: 369991
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances across the monorepo.
llvm-svn: 369013
This pass is a port of the according pass from the HSAIL compiler.
It parses printf calls and setup runtime printf buffer.
After that it copies printf arguments to the buffer and fills in
module metadata for runtime.
Differential Revision: https://reviews.llvm.org/D24035
llvm-svn: 368592
Summary:
- As LCSSA is turned on just before isel, it may create PHI of the flow,
which is consumed by pseudo structurized CFG instructions. When that
PHIs are eliminated in O0, COPY may be placed wrongly as the these
pseudo structurized CFG instructions are considering prologue of MBB.
- Run extra `unreachable-mbb-elimination` at the end of isel to clean up
PHIs.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64353
llvm-svn: 367023
Add a string attribute instead of directly setting
MachineFunctionInfo. This avoids trying to get the analysis in the
MachineFunctionInfo in a way that doesn't work with the new pass
manager.
This will also avoid re-visiting the call graph for every single
function.
llvm-svn: 365241
This is split out from my patches to split register allocation into a
separate SGPR and VGPR phase, and has some parts that aren't yet used
(like maintaining LiveIntervals).
This simplifies making the frame pointer register callee saved. As it
is now, the code to determine callee saves needs to predict all the
possible SGPR spills and how many callee saved VGPRs are needed. By
handling this before PrologEpilogInserter, it's possible to just check
the spill objects that already exist.
Change-Id: I29e6df4034afcf949e06f8ef44206acb94696f04
llvm-svn: 365095
This allows targets to make more decisions about reserved registers
after isel. For example, now it should be certain there are calls or
stack objects in the frame or not, which could have been introduced by
legalization.
Patch by Matthias Braun
llvm-svn: 363757
AMDGPUPropagateAttributes will not work on function bitcatsts,
so move AMDGPUFixFunctionBitcasts before it.
Differential Revision: https://reviews.llvm.org/D63455
llvm-svn: 363614
The pass works in two modes:
Mode 1: Just set attributes starting from kernels. This can work at
the very beginning of opt and llc pipeline, but cannot clone functions
because it must be a function pass.
Mode 2: Actually clone functions for new attributes. This can only work
after all function passes in the opt pipeline because it has to be a
module pass.
Differential Revision: https://reviews.llvm.org/D63208
llvm-svn: 363586
This reverts r362990 (git commit 374571301d)
This was causing linker warnings on Darwin:
ld: warning: direct access in function 'llvm::initializeEvexToVexInstPassPass(llvm::PassRegistry&)'
from file '../../lib/libLLVMX86CodeGen.a(X86EvexToVex.cpp.o)' to global weak symbol
'void std::__1::__call_once_proxy<std::__1::tuple<void* (&)(llvm::PassRegistry&),
std::__1::reference_wrapper<llvm::PassRegistry>&&> >(void*)' from file '../../lib/libLLVMCore.a(Verifier.cpp.o)'
means the weak symbol cannot be overridden at runtime. This was likely caused by different translation
units being compiled with different visibility settings.
llvm-svn: 363028
Summary:
For builds with LLVM_BUILD_LLVM_DYLIB=ON and BUILD_SHARED_LIBS=OFF
this change makes all symbols in the target specific libraries hidden
by default.
A new macro called LLVM_EXTERNAL_VISIBILITY has been added to mark symbols in these
libraries public, which is mainly needed for the definitions of the
LLVMInitialize* functions.
This patch reduces the number of public symbols in libLLVM.so by about
25%. This should improve load times for the dynamic library and also
make abi checker tools, like abidiff require less memory when analyzing
libLLVM.so
One side-effect of this change is that for builds with
LLVM_BUILD_LLVM_DYLIB=ON and LLVM_LINK_LLVM_DYLIB=ON some unittests that
access symbols that are no longer public will need to be statically linked.
Before and after public symbol counts (using gcc 8.2.1, ld.bfd 2.31.1):
nm before/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
36221
nm after/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
26278
Reviewers: chandlerc, beanz, mgorny, rnk, hans
Reviewed By: rnk, hans
Subscribers: Jim, hiraditya, michaelplatings, chapuni, jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, mgrang, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, kristina, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D54439
llvm-svn: 362990
Move the declarations of getThe<Name>Target() functions into a new header in
TargetInfo and make users of these functions include this new header.
This fixes a layering problem.
llvm-svn: 360713
Because CodeGen can't depend on GlobalISel, we need a way to encapsulate the CSE
configs that can be passed between TargetPassConfig and the targets' custom
pass configs. This CSEConfigBase allows targets to create custom CSE configs
which is then used by the GISel passes for the CSEMIRBuilder.
This support will be used in a follow up commit to allow constant-only CSE for
-O0 compiles in D60580.
llvm-svn: 358368
Add support for min/max flavor selects in computeConstantRange(),
which allows us to fold comparisons of a min/max against a constant
in InstSimplify. This fixes an infinite InstCombine loop, with the
test case taken from D59378.
Relative to the previous iteration, this contains some adjustments for
AMDGPU med3 tests: The AMDGPU target runs InstSimplify prior to codegen,
which ends up constant folding some existing med3 tests after this
change. To preserve these tests a hidden -amdgpu-scalar-ir-passes option
is added, which allows disabling scalar IR passes (that use InstSimplify)
for testing purposes.
Differential Revision: https://reviews.llvm.org/D59506
llvm-svn: 357870
Detect dead lanes can create some dead defs. Then RenameIndependentSubregs
will break a REG_SEQUENCE which may use these dead defs. At this point
a dead instruction can be removed but we do not run a DCE anymore.
MachineDCE was only running before live variable analysis. The patch
adds a mean to preserve LiveIntervals and SlotIndexes in case it works
past this.
Differential Revision: https://reviews.llvm.org/D59626
llvm-svn: 357805
This change incorporates an effort by Connor Abbot to change how we deal
with WWM operations potentially trashing valid values in inactive lanes.
Previously, the SIFixWWMLiveness pass would work out which registers
were being trashed within WWM regions, and ensure that the register
allocator did not have any values it was depending on resident in those
registers if the WWM section would trash them. This worked perfectly
well, but would cause sometimes severe register pressure when the WWM
section resided before divergent control flow (or at least that is where
I mostly observed it).
This fix instead runs through the WWM sections and pre allocates some
registers for WWM. It then reserves these registers so that the register
allocator cannot use them. This results in a significant register
saving on some WWM shaders I'm working with (130 -> 104 VGPRs, with just
this change!).
Differential Revision: https://reviews.llvm.org/D59295
llvm-svn: 357400
Also includes one example of how this transform is unsound. This isn't
verifying the copies are used in the control flow intrinisic patterns.
Also add option to disable exec mask opt pass. Since this pass is
unsound, it may be useful to turn it off until it is fixed.
llvm-svn: 357091
This will allow targets more flexibility to replace the
register allocator core passes. In a future commit,
AMDGPU will run the core register assignment passes
twice, and will also want to disallow using the
standard -regalloc option.
llvm-svn: 356506
Add an experimental buffer fat pointer address space that is currently
unhandled in the backend. This commit reserves address space 7 as a
non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer
descriptor + 32-bit offset) that is heavily used in graphics workloads
using the AMDGPU backend.
Differential Revision: https://reviews.llvm.org/D58957
llvm-svn: 356373
There are a few different issues, mostly stemming from using
generation based checks for anything instead of subtarget
features. Stop adding flat-address-space as a feature for HSA, as it
should only be a device property. This was incorrectly allowing flat
instructions to select for SI.
Increase the default generation for HSA to avoid the encoding error
when emitting objects. This has some other side effects from various
checks which probably should be separate subtarget features (in the
cost model and for dealing with the DS offset folding issue).
Partial fix for bug 41070. It should probably be an error to try using
amdhsa without flat support.
llvm-svn: 356347
This has been a very painful missing feature that has made producing
reduced testcases difficult. In particular the various registers
determined for stack access during function lowering were necessary to
avoid undefined register errors in a large percentage of
cases. Implement a subset of the important fields that need to be
preserved for AMDGPU.
Most of the changes are to support targets parsing register fields and
properly reporting errors. The biggest sort-of bug remaining is for
fields that can be initialized from the IR section will be overwritten
by a default initialized machineFunctionInfo section. Another
remaining bug is the machineFunctionInfo section is still printed even
if empty.
llvm-svn: 356215
A previous patch for "uniform-work-group-size" attribute was found to break
some RADV and possibly radeon SI tests and had to be retracted.
This patch fixes that.
Differential Revision: http://reviews.llvm.org/D58993
llvm-svn: 355574
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute from the kernel to the called functions. Once this pass is run, all kernels, even the ones which initially did not have the attribute, will be able to indicate weather or not they have uniform work group size depending on the value of the attribute.
Differential Revision: https://reviews.llvm.org/D50200
llvm-svn: 348971
A new pass to manage the Mode register.
Currently this just manages the floating point double precision
rounding requirements, but is intended to be easily extended to
encompass all Mode register settings.
The immediate motivation comes from the requirement to use the
round-to-zero rounding mode for the 16 bit interpolation
instructions, where the rounding mode setting is shared between
16 and 64 bit operations.
llvm-svn: 348754
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.
Differential Revision: https://reviews.llvm.org/D50141
llvm-svn: 348585
Introduces DPP pseudo instructions and the pass that combines DPP mov with subsequent uses.
Differential revision: https://reviews.llvm.org/D53762
llvm-svn: 347993
Also revert fix r347876
One of the buildbots was reporting a failure in some relevant tests that I can't
repro or explain at present, so reverting until I can isolate.
llvm-svn: 347911
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
llvm-svn: 347871
Add a pass to fixup various vector ISel issues.
Currently we handle converting GLOBAL_{LOAD|STORE}_*
and GLOBAL_Atomic_* instructions into their _SADDR variants.
This involves feeding the sreg into the saddr field of the new instruction.
llvm-svn: 347008
This allows testing AMDGPU alias analysis like any
other alias analysis pass. This fixes the existing
test pointlessly running opt -O3 when it really
just wants to run the one analysis.
Before there was no way to test this using -aa-eval
with opt, since the default constructed pass
is run. The wrapper subclass allows the
default constructor to pass the necessary callback.
llvm-svn: 346353
Summary:
Instead of writing boolean values temporarily into 32-bit VGPRs
if they are involved in PHIs or are observed from outside a loop,
we use bitwise masking operations to combine lane masks in a way
that is consistent with wave control flow.
Move SIFixSGPRCopies to before this pass, since that pass
incorrectly attempts to move SGPR phis to VGPRs.
This should recover most of the code quality that was lost with
the bug fix in "AMDGPU: Remove PHI loop condition optimization".
There are still some relevant cases where code quality could be
improved, in particular:
- We often introduce redundant masks with EXEC. Ideally, we'd
have a generic computeKnownBits-like analysis to determine
whether masks are already masked by EXEC, so we can avoid this
masking both here and when lowering uniform control flow.
- The criterion we use to determine whether a def is observed
from outside a loop is conservative: it doesn't check whether
(loop) branch conditions are uniform.
Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b
Reviewers: arsenm, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, mgorny, yaxunl, dstuttard, t-tye, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D53496
llvm-svn: 345719
AMDGPU currently only supports direct calls, but at lower optimisation levels it
fails to lower statically direct calls which appear indirect due to a bitcast.
Add a pass to visit all CallSites and use CallPromotionUtils to "devirtualize"
calls.
Differential Revision: https://reviews.llvm.org/D52741
llvm-svn: 345382
This commit adds a new IR level pass to the AMDGPU backend to perform
atomic optimizations. It works by:
- Running through a function and finding atomicrmw add/sub or uses of
the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform,
record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on
whether the value is uniform/divergent use wavefront wide operations
(DPP in the divergent case) to calculate the total amount to be
atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic
operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of
the wavefront, and calculate our individual lanes offset into the
final result.
Differential Revision: https://reviews.llvm.org/D51969
llvm-svn: 343973
[AMDGPU] lower-switch in preISel as a workaround for legacy DA
Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.
As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.
llvm-svn: 342956
Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.
As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.
Reviewers: arsenm, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits, simoll
Differential Revision: https://reviews.llvm.org/D52221
llvm-svn: 342722
Memory legalizer, waitcnt, and shrink passes can perturb the instructions,
which means that the post-RA hazard recognizer pass should run after them.
Otherwise, one of those passes may invalidate the work done by the hazard
recognizer. Note that this has adverse side-effect that any consecutive
S_NOP 0's, emitted by the hazard recognizer, will not be shrunk into a
single S_NOP <N>. This should be addressed in a follow-on patch.
Differential Revision: https://reviews.llvm.org/D49288
llvm-svn: 337154
Summary:
This is a follow-up to r335942.
- Merge SISubtarget into AMDGPUSubtarget and rename to GCNSubtarget
- Rename AMDGPUCommonSubtarget to AMDGPUSubtarget
- Merge R600Subtarget::Generation and GCNSubtarget::Generation into
AMDGPUSubtarget::Generation.
Reviewers: arsenm, jvesely
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D49037
llvm-svn: 336851
These won't work for the forseeable future. These aren't allowed
from OpenCL, but IPO optimizations can make them appear.
Also directly set the attributes on functions, regardless
of the linkage rather than cloning functions like before.
llvm-svn: 336587
This allows to hoist code portion to compute reciprocal of loop
invariant denominator in integer division after codegen prepare
expansion.
Differential Revision: https://reviews.llvm.org/D48604
llvm-svn: 335988
This replaces most argument uses with loads, but for
now not all.
The code in SelectionDAG for calling convention lowering
is actively harmful for amdgpu_kernel. It attempts to
split the argument types into register legal types, which
results in low quality code for arbitary types. Since
all kernel arguments are passed in memory, we just want the
raw types.
I've tried a couple of methods of mitigating this in SelectionDAG,
but it's easier to just bypass this problem alltogether. It's
possible to hack around the problem in the initial lowering,
but the real problem is the DAG then expects to be able to use
CopyToReg/CopyFromReg for uses of the arguments outside the block.
Exposing the argument loads in the IR also has the advantage
that the LoadStoreVectorizer can merge them.
I'm not sure the best approach to dealing with the IR
argument list is. The patch as-is just leaves the IR arguments
in place, so all the existing code will still compute the same
kernarg size and pointlessly lowers the arguments.
Arguably the frontend should emit kernels with an empty argument
list in the first place. Alternatively a dummy array could be
inserted as a single argument just to reserve space.
This does have some disadvantages. Local pointer kernel arguments can
no longer have AssertZext placed on them as the equivalent !range
metadata is not valid on pointer typed loads. This is mostly bad
for SI which needs to know about the known bits in order to use the
DS instruction offset, so in this case this is not done.
More importantly, this skips noalias arguments since this pass
does not yet convert this to the equivalent !alias.scope and !noalias
metadata. Producing this metadata correctly seems to be tricky,
although this logically is the same as inlining into a function which
doesn't exist. Additionally, exposing these loads to the vectorizer
may result in degraded aliasing information if a pointer load is
merged with another argument load.
I'm also not entirely sure this is preserving the current clover
ABI, although I would greatly prefer if it would stop widening
arguments and match the HSA ABI. As-is I think it is extending
< 4-byte arguments to 4-bytes but doesn't align them to 4-bytes.
llvm-svn: 335650
Memory clauses are formed into bundles in presence of xnack.
Their source operands are marked as early-clobber.
This allows to allocate distinct source and destination registers
within a clause and prevent breaking the clause with s_nop in the
hazard recognizer.
Clauses are undone before post-RA scheduler to allow some rescheduling,
which will not break the clause since artificial edges are created in
the dag to keep memory operations together. Yet this allows a better
ILP in some cases.
Differential Revision: https://reviews.llvm.org/D47511
llvm-svn: 333691
With the removal of the old waitcnt pass, the '-enable-si-insert-waitcnts' option is obsolete. Remove it.
Differential Revision: https://reviews.llvm.org/D47378
llvm-svn: 333303
Eliminate loads from the dispatch packet when they will have
a known value.
Also pattern match the code used by the library to handle partial
workgroup dispatches, which isn't necessary if reqd_work_group_size
is used.
llvm-svn: 332771
This pass is
a) broken.
b) r600 specific.
Fixing (a) is a bit more non-trivial, but fixing (b)
is easy. Move this pass to being R600 only for now.
This pass does pass all the unit tests, however clang
no longer generates code that looks like the unit test
input, so fixing the pass requires fixing the tests and
the pass as one, and checking it works with clang still.
Patch by Dave Airlie
llvm-svn: 332196
Remove the old waitcnt pass ( si-insert-waits ), which is no longer maintained
and getting crufty
Differential Revision: https://reviews.llvm.org/D46448
llvm-svn: 331641
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
Summary:
This fixes AMDGPU GlobalISel test failures when enabling the AMDGPU
target without any other targets that use GlobalISel.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D45353
llvm-svn: 329588
Note: This is a candidate for LLVM 6.0, because it was planned to be
in that release but was delayed due to a long review period.
Merge conflict in release_60 - resolution:
Add "-p6:32:32" into the second (non-amdgiz) string.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D41651
llvm-svn: 324487
1. Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
2. The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
3. Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.
Differential Revision: https://reviews.llvm.org/D42854
llvm-svn: 324440
This avoids playing games with pseudo pass IDs and avoids using an
unreliable MRI::isSSA() check to determine whether register allocation
has happened.
Note that this renames:
- MachineLICMID -> EarlyMachineLICM
- PostRAMachineLICMID -> MachineLICMID
to be consistent with the EarlyTailDuplicate/TailDuplicate naming.
llvm-svn: 322927
Re-land r321234. It had to be reverted because it broke the shared
library build. The shared library build broke because there was a
missing LLVMBuild dependency from lib/Passes (which calls
TargetMachine::getTargetIRAnalysis) to lib/Target. As far as I can
tell, this problem was always there but was somehow masked
before (perhaps because TargetMachine::getTargetIRAnalysis was a
virtual function).
Original commit message:
This makes the TargetMachine interface a bit simpler. We still need
the std::function in TargetIRAnalysis to avoid having to add a
dependency from Analysis to Target.
See discussion:
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119749.html
I avoided adding all of the backend owners to this review since the
change is simple, but let me know if you feel differently about this.
Reviewers: echristo, MatzeB, hfinkel
Reviewed By: hfinkel
Subscribers: jholewinski, jfb, arsenm, dschuff, mcrosier, sdardis, nemanjai, nhaehnle, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, llvm-commits
Differential Revision: https://reviews.llvm.org/D41464
llvm-svn: 321375
Summary:
This makes the TargetMachine interface a bit simpler. We still need
the std::function in TargetIRAnalysis to avoid having to add a
dependency from Analysis to Target.
See discussion:
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119749.html
I avoided adding all of the backend owners to this review since the
change is simple, but let me know if you feel differently about this.
Reviewers: echristo, MatzeB, hfinkel
Reviewed By: hfinkel
Subscribers: jholewinski, jfb, arsenm, dschuff, mcrosier, sdardis, nemanjai, nhaehnle, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, llvm-commits
Differential Revision: https://reviews.llvm.org/D41464
llvm-svn: 321234
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
Reverting to investigate layering effects of MCJIT not linking
libCodeGen but using TargetMachine::getNameWithPrefix() breaking the
lldb bots.
This reverts commit r315633.
llvm-svn: 315637
Merge LLVMTargetMachine into TargetMachine.
- There is no in-tree target anymore that just implements TargetMachine
but not LLVMTargetMachine.
- It should still be possible to stub out all the various functions in
case a target does not want to use lib/CodeGen
- This simplifies the code and avoids methods ending up in the wrong
interface.
Differential Revision: https://reviews.llvm.org/D38489
llvm-svn: 315633
This patch adds a post-linking pass which replaces the function pointer of enqueued
block kernel with a global variable (runtime handle) and adds
runtime-handle attribute to the enqueued block kernel.
In LLVM CodeGen the runtime-handle metadata will be translated to
RuntimeHandle metadata in code object. Runtime allocates a global buffer
for each kernel with RuntimeHandel metadata and saves the kernel address
required for the AQL packet into the buffer. __enqueue_kernel function
in device library knows that the invoke function pointer in the block
literal is actually runtime handle and loads the kernel address from it
and puts it into AQL packet for dispatching.
This cannot be done in FE since FE cannot create a unique global variable
with external linkage across LLVM modules. The global variable with internal
linkage does not work since optimization passes will try to replace loads
of the global variable with its initialization value.
Differential Revision: https://reviews.llvm.org/D38610
llvm-svn: 315352
We have a single library build without relaxation options.
When inlined library functions remove fast math attributes
from the functions they are integrated into.
This patch sets relaxation attributes on the functions after
linking provided corresponding relaxation options are given.
Math instructions inside the inlined functions remain to have
no fast flags, but inlining does not prevent fast math
transformations of a surrounding caller code anymore.
Differential Revision: https://reviews.llvm.org/D38325
llvm-svn: 314568
The pass does simplifications of well known AMD library calls.
If given -amdgpu-prelink option it works in a pre-link mode which
allows to reference new library functions which will be linked in
later.
In addition it also used to process traditional AMD option
-fuse-native which allows to replace some of the functions with
their fast native implementations from the library.
The necessary glue to pass the prelink option and translate
-fuse-native is to be added to the driver.
Differential Revision: https://reviews.llvm.org/D36436
llvm-svn: 310731
Summary: This refactoring is required in order to split the R600 and GCN tablegen files.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D36286
llvm-svn: 310336
This hasn't done anything in a long time. This was
running after the the control flow pseudos were expanded,
so this would never find them. The control flow pseudo
expansion was moved to solve the problem this pass was
supposed to solve in the first place, except handling
it earlier also fixes it for fast regalloc which doesn't
use LiveIntervals.
Noticed by checking LCOV reports.
llvm-svn: 310274
Try to avoid mutually exclusive features. Don't use
a real default GPU, and use a fake "generic". The goal
is to make it easier to see which set of features are
incompatible between feature strings.
Most of the test changes are due to random scheduling changes
from not having a default fullspeed model.
llvm-svn: 310258
Summary:
Whole Wavefront Wode (WWM) is similar to WQM, except that all of the
lanes are always enabled, regardless of control flow. This is required
for implementing wavefront reductions in non-uniform control flow, where
we need to use the inactive lanes to propagate intermediate results, so
they need to be enabled. We need to propagate WWM to uses (unless
they're explicitly marked as exact) so that they also propagate
intermediate results correctly. We do the analysis and exec mask munging
during the WQM pass, since there are interactions with WQM for things
that require both WQM and WWM. For simplicity, WWM is entirely
block-local -- blocks are never WWM on entry or exit of a block, and WWM
is not propagated to the block level. This means that computations
involving WWM cannot involve control flow, but we only ever plan to use
WWM for a few limited purposes (none of which involve control flow)
anyways.
Shaders can ask for WWM using the @llvm.amdgcn.wwm intrinsic. There
isn't yet a way to turn WWM off -- that will be added in a future
change.
Finally, it turns out that turning on inactive lanes causes a number of
problems with register allocation. While the best long-term solution
seems like teaching LLVM's register allocator about predication, for now
we need to add some hacks to prevent ourselves from getting into trouble
due to constraints that aren't currently expressed in LLVM. For the gory
details, see the comments at the top of SIFixWWMLiveness.cpp.
Reviewers: arsenm, nhaehnle, tpr
Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D35524
llvm-svn: 310087
With this change, the GlobalISel library gets always built. In
particular, this is not possible to opt GlobalISel out of the build
using the LLVM_BUILD_GLOBAL_ISEL variable any more.
llvm-svn: 309990
IMHO it is an antipattern to have a enum value that is Default.
At any given piece of code it is not clear if we have to handle
Default or if has already been mapped to a concrete value. In this
case in particular, only the target can do the mapping and it is nice
to make sure it is always done.
This deletes the two default enum values of CodeModel and uses an
explicit Optional<CodeModel> when it is possible that it is
unspecified.
llvm-svn: 309911
Add a pass to remove redundant S_OR_B64 instructions enabling lanes in
the exec. If two SI_END_CF (lowered as S_OR_B64) come together without any
vector instructions between them we can only keep outer SI_END_CF, given
that CFG is structured and exec bits of the outer end statement are always
not less than exec bit of the inner one.
This needs to be done before the RA to eliminate saved exec bits registers
but after register coalescer to have no vector registers copies in between
of different end cf statements.
Differential Revision: https://reviews.llvm.org/D35967
llvm-svn: 309762
Includes a hack to fix the type selected for
the GlobalAddress of the function, which will be
fixed by changing the default datalayout to use
generic pointers for 0.
llvm-svn: 309732
It is better to return arguments directly in registers
if we are making a call rather than introducing expensive
stack usage. In one of sample compile from one of
Blender's many kernel variants, this fires on about
~20 different functions. Future improvements may be to
recognize simple cases where the pointer is indexing a small
array. This also fails when the store to the out argument
is in a separate block from the return, which happens in
a few of the Blender functions. This should also probably
be using MemorySSA which might help with that.
I'm not sure this is correct as a FunctionPass, but
MemoryDependenceAnalysis seems to not work with
a ModulePass.
I'm also not sure where it should run.I think it should
run before DeadArgumentElimination, so maybe either
EP_CGSCCOptimizerLate or EP_ScalarOptimizerLate.
llvm-svn: 309416
It adds it for the target after inlining but before SROA where
we can get most out of it.
Differential Revision: https://reviews.llvm.org/D34366
llvm-svn: 305759
I did this a long time ago with a janky python script, but now
clang-format has built-in support for this. I fed clang-format every
line with a #include and let it re-sort things according to the precise
LLVM rules for include ordering baked into clang-format these days.
I've reverted a number of files where the results of sorting includes
isn't healthy. Either places where we have legacy code relying on
particular include ordering (where possible, I'll fix these separately)
or where we have particular formatting around #include lines that
I didn't want to disturb in this patch.
This patch is *entirely* mechanical. If you get merge conflicts or
anything, just ignore the changes in this patch and run clang-format
over your #include lines in the files.
Sorry for any noise here, but it is important to keep these things
stable. I was seeing an increasing number of patches with irrelevant
re-ordering of #include lines because clang-format was used. This patch
at least isolates that churn, makes it easy to skip when resolving
conflicts, and gets us to a clean baseline (again).
llvm-svn: 304787
Remove dependency of SDWA pass on SIShrinkInstructions.
The goal is to move SDWA even higher in the stack to avoid second run
of MachineLICM, MachineCSE and SIFoldOperands.
Also added handling to preserve original src modifiers.
Differential Revision: https://reviews.llvm.org/D33860
llvm-svn: 304665
-enable-si-insert-waitcnts=1 becomes the default
-enable-si-insert-waitcnts=0 to use old pass
Differential Revision: https://reviews.llvm.org/D33730
llvm-svn: 304551
TargetPassConfig is not useful for targets that do not use the CodeGen
library, so we may just as well store a pointer to an
LLVMTargetMachine instead of just to a TargetMachine.
While at it, also change the constructor to take a reference instead of a
pointer as the TM must not be nullptr.
llvm-svn: 304247
An encoding does not allow to use SDWA in an instruction with
scalar operands, either literals or SGPRs. That is however possible
to copy these operands into a VGPR first.
Several copies of the value are produced if multiple SDWA conversions
were done. To cleanup MachineLICM (to hoist copies out of loops),
MachineCSE (to remove duplicate copies) and SIFoldOperands (to replace
SGPR to VGPR copy with immediate copy right to the VGPR) runs are added
after the SDWA pass.
Differential Revision: https://reviews.llvm.org/D33583
llvm-svn: 304219
This provides a new way to access the TargetMachine through
TargetPassConfig, as a dependency.
The patterns replaced here are:
* Passes handling a null TargetMachine call
`getAnalysisIfAvailable<TargetPassConfig>`.
* Passes not handling a null TargetMachine
`addRequired<TargetPassConfig>` and call
`getAnalysis<TargetPassConfig>`.
* MachineFunctionPasses now use MF.getTarget().
* Remove all the TargetMachine constructors.
* Remove INITIALIZE_TM_PASS.
This fixes a crash when running `llc -start-before prologepilog`.
PEI needs StackProtector, which gets constructed without a TargetMachine
by the pass manager. The StackProtector pass doesn't handle the case
where there is no TargetMachine, so it segfaults.
Related to PR30324.
Differential Revision: https://reviews.llvm.org/D33222
llvm-svn: 303360
If workgroup size is known inform llvm about range returned by local
id and local size queries.
Differential Revision: https://reviews.llvm.org/D31804
llvm-svn: 300102
Summary:
Difference beetween PreRegAlloc() and MachineSSAOptimization() are that the former is run despite of -O0 optimization level. In my undestanding SiShrinkInstructions and SDWAPeephole shouldn't run when optimizations are disabled.
With this change order of passes will not change.
Reviewers: arsenm, vpykhtin, rampitec
Subscribers: qcolombet, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye
Differential Revision: https://reviews.llvm.org/D31705
llvm-svn: 299757
Our final address space mapping is to let constant address space to be 4 to match nvptx.
However for now we will make it 2 to avoid unnecessary work in FE/BE/devlib
about intrinsics returning constant pointers.
Differential Revision: https://reviews.llvm.org/D31770
llvm-svn: 299690
If set to false it does not remove global aliases. With this parameter
set to false it should be safe to run the pass before link.
Differential Revision: https://reviews.llvm.org/D31489
llvm-svn: 299108
Previously it was covered by the internalization. It turns out we cannot
run internalizer in FE, it break separate compilation tests. Thus early
inliner gets its own option.
Differential Revision: https://reviews.llvm.org/D31429
llvm-svn: 298935
As we introduced target triple environment amdgiz and amdgizcl, the address
space values are no longer enums. We have to decide the value by target triple.
The basic idea is to use struct AMDGPUAS to represent address space values.
For address space values which are not depend on target triple, use static
const members, so that they don't occupy extra memory space and is equivalent
to a compile time constant.
Since the struct is lightweight and cheap, it can be created on the fly at
the point of usage. Or it can be added as member to a pass and created at
the beginning of the run* function.
Differential Revision: https://reviews.llvm.org/D31284
llvm-svn: 298846
Switch data layout by target triple environment amdgiz and amdgizcl indicating using of an address space mapping in which generic address space is 0.
amdgiz is for non-OpenCL environment where generic address space is 0.
amdgizcl is for OpenCL environment where generic address space is 0.
Differential Revision: https://reviews.llvm.org/D31211
llvm-svn: 298758
StructurizeCFG can't handle cases with multiple
returns creating regions with multiple exits.
Create a copy of UnifyFunctionExitNodes that only
unifies exit nodes that skips exit nodes
with uniform branch sources.
llvm-svn: 298729
Summary:
First iteration of SDWA peephole.
This pass tries to combine several instruction into one SDWA instruction. E.g. it converts:
'''
V_LSHRREV_B32_e32 %vreg0, 16, %vreg1
V_ADD_I32_e32 %vreg2, %vreg0, %vreg3
V_LSHLREV_B32_e32 %vreg4, 16, %vreg2
'''
Into:
'''
V_ADD_I32_sdwa %vreg4, %vreg1, %vreg3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
'''
Pass structure:
1. Iterate over machine instruction in basic block and try to apply "SDWA patterns" to each of them. SDWA patterns match machine instruction into either source or destination SDWA operand. E.g. ''' V_LSHRREV_B32_e32 %vreg0, 16, %vreg1''' is matched to source SDWA operand '''%vreg1 src_sel:WORD_1'''.
2. Iterate over found SDWA operands and find instruction that could be potentially coverted into SDWA. E.g. for source SDWA operand potential instruction are all instruction in this basic block that uses '''%vreg0'''
3. Iterate over all potential instructions and check if they can be converted into SDWA.
4. Convert instructions to SDWA.
This review contains basic implementation of SDWA peephole pass. This pass requires additional testing fot both correctness and performance (no performance testing done).
There are several ways this pass can be improved:
1. Make this pass work on whole function not only basic block. As I can see this can be done right now without changes to pass.
2. Introduce more SDWA patterns
3. Introduce mnemonics to limit when SDWA patterns should apply
Reviewers: vpykhtin, alex-t, arsenm, rampitec
Subscribers: wdng, nhaehnle, mgorny
Differential Revision: https://reviews.llvm.org/D30038
llvm-svn: 298365
This is direct port of HSAILAliasAnalysis pass, just cleaned for
style and renamed.
Differential Revision: https://reviews.llvm.org/D31103
llvm-svn: 298172
Loop unswitching can be extremely harmful for a SIMT target. In case
if hoisted condition is not uniform a SIMT machine will execute both
clones of a loop sequentially. Therefor LoopUnswitch checks if the
condition is non-divergent.
Since DivergenceAnalysis adds an expensive PostDominatorTree analysis
not needed for non-SIMT targets a new option is added to avoid unneded
analysis initialization. The method getAnalysisUsage is called when
TargetTransformInfo is not yet available and we cannot use it here.
For that reason a new field DivergentTarget is added to PassManagerBuilder
to control the behavior and set this field from a target.
Differential Revision: https://reviews.llvm.org/D30796
llvm-svn: 298104
We can mark functions to always inline early in the opt. Since we do not have
call support this early inlining creates opportunities for inter-procedural
optimizations which would not occur otherwise.
Differential Revision: https://reviews.llvm.org/D31016
llvm-svn: 297958
This patch reverts region's scheduling to the original untouched state
in case if we have have decreased occupancy.
In addition it switches to use TargetRegisterInfo occupancy callback
for pressure limits instead of gradually increasing limits which were
just passed by. We are going to stay with the best schedule so we do
not need to tolerate worsened scheduling anymore.
Differential Revision: https://reviews.llvm.org/D29971
llvm-svn: 295206
Since we have no call support and late linking we can produce code
only for used symbols. This saves compilation time, size of the final
executable, and size of any intermediate dumps.
Run Internalize pass early in the opt pipeline followed by global
DCE pass. To enable it RT can pass -amdgpu-internalize-symbols option.
Differential Revision: https://reviews.llvm.org/D29214
llvm-svn: 293549
With the adjustPassManager interface that is now possible to use
custom early module passes.
Differential Revision: https://reviews.llvm.org/D29189
llvm-svn: 293300
This change introduces adjustPassManager target callback giving a
target an opportunity to tweak PassManagerBuilder before pass
managers are populated.
This generalizes and replaces addEarlyAsPossiblePasses target
callback. In particular that can be used to add custom passes to
extension points other than EP_EarlyAsPossible.
Differential Revision: https://reviews.llvm.org/D28336
llvm-svn: 293189
Leave early ifcvt disabled for now since there are some
shader-db regressions.
This causes some immediate improvements, but could be better.
The cost checking that the pass does is based on critical path
length for out of order CPUs which we do not want so it skips out
on many cases we want.
llvm-svn: 293016
Regalloc creates COPY instructions which do not formally use VALU.
That results in v_mov instructions displaced after exec mask modification.
One pass which do it is SIOptimizeExecMasking, but potentially it can be
done by other passes too.
This patch adds a pass immediately after regalloc to add implicit exec
use operand to all VGPR copy instructions.
Differential Revision: https://reviews.llvm.org/D28874
llvm-svn: 292956
Multiple metadata values for records such as opencl.ocl.version, llvm.ident
and similar are created after linking several modules. For some of them, notably
opencl.ocl.version, this creates semantic problem because we cannot tell which
version of OpenCL the composite module conforms.
Moreover, such repetitions of identical values often create a huge list of
unneeded metadata, which grows bitcode size both in memory and stored on disk.
It can go up to several Mb when linked against our OpenCL library. Lastly, such
long lists obscure reading of dumped IR.
The pass unifies metadata after linking.
Differential Revision: https://reviews.llvm.org/D25381
llvm-svn: 289092
Summary:
LC can currently select scalar load for uniform memory access
basing on readonly memory address space only. This restriction
originated from the fact that in HW prior to VI vector and scalar caches
are not coherent. With MemoryDependenceAnalysis we can check that the
memory location corresponding to the memory operand of the LOAD is not
clobbered along the all paths from the function entry.
Reviewers: rampitec, tstellarAMD, arsenm
Subscribers: wdng, arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D26917
llvm-svn: 289076
The structured CFG is just an aid to inserting exec
mask modification instructions, once that is done
we don't really need it anymore. We also
do not analyze blocks with terminators that
modify exec, so this should only be impacting
true branches.
llvm-svn: 288744
This makes the createGenericSchedLive() function that constructs the
default scheduler available for the public API. This should help when
you want to get a scheduler and the default list of DAG mutations.
This also shrinks the list of default DAG mutations:
{Load|Store}ClusterDAGMutation and MacroFusionDAGMutation are no longer
added by default. Targets can easily add them if they need them. It also
makes it easier for targets to add alternative/custom macrofusion or
clustering mutations while staying with the default
createGenericSchedLive(). It also saves the callback back and forth in
TargetInstrInfo::enableClusterLoads()/enableClusterStores().
Differential Revision: https://reviews.llvm.org/D26986
llvm-svn: 288057
Fixes to allow spilling all registers at the end of the block
work with exec modifications. Don't emit s_and_saveexec_b64 for
if lowering, and instead emit copies. Mark control flow mask
instructions as terminators to get correct spill code placement
with fast regalloc, and then have a separate optimization pass
form the saveexec.
This should work if SGPRs are spilled to VGPRs, but
will likely fail in the case that an SGPR spills to memory
and no workitem takes a divergent branch.
llvm-svn: 282667
Summary:
The SILoadStoreOptimizer can now look ahead more then one instruction when
looking for instructions to merge, which greatly improves the number of
loads/stores that we are able to merge.
Moving the pass before scheduling avoids increasing register pressure after
the scheduler, so that the scheduler's register pressure estimates will be
more accurate. It also gives more consistent results, since it is no longer
affected by minor scheduling changes.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D23814
llvm-svn: 279991
Do most of the lowering in a pre-RA pass. Keep the skip jump
insertion late, plus a few other things that require more
work to move out.
One concern I have is now there may be COPY instructions
which do not have the necessary implicit exec uses
if they will be lowered to v_mov_b32.
This has a positive effect on SGPR usage in shader-db.
llvm-svn: 279464
minimal and boring form than the old pass manager's version.
This pass does the very minimal amount of work necessary to inline
functions declared as always-inline. It doesn't support a wide array of
things that the legacy pass manager did support, but is alse ... about
20 lines of code. So it has that going for it. Notably things this
doesn't support:
- Array alloca merging
- To support the above, bottom-up inlining with careful history
tracking and call graph updates
- DCE of the functions that become dead after this inlining.
- Inlining through call instructions with the always_inline attribute.
Instead, it focuses on inlining functions with that attribute.
The first I've omitted because I'm hoping to just turn it off for the
primary pass manager. If that doesn't pan out, I can add it here but it
will be reasonably expensive to do so.
The second should really be handled by running global-dce after the
inliner. I don't want to re-implement the non-trivial logic necessary to
do comdat-correct DCE of functions. This means the -O0 pipeline will
have to be at least 'always-inline,global-dce', but that seems
reasonable to me. If others are seriously worried about this I'd like to
hear about it and understand why. Again, this is all solveable by
factoring that logic into a utility and calling it here, but I'd like to
wait to do that until there is a clear reason why the existing
pass-based factoring won't work.
The final point is a serious one. I can fairly easily add support for
this, but it seems both costly and a confusing construct for the use
case of the always inliner running at -O0. This attribute can of course
still impact the normal inliner easily (although I find that
a questionable re-use of the same attribute). I've started a discussion
to sort out what semantics we want here and based on that can figure out
if it makes sense ta have this complexity at O0 or not.
One other advantage of this design is that it should be quite a bit
faster due to checking for whether the function is a viable candidate
for inlining exactly once per function instead of doing it for each call
site.
Anyways, hopefully a reasonable starting point for this pass.
Differential Revision: https://reviews.llvm.org/D23299
llvm-svn: 278896