* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 293184
Leave early ifcvt disabled for now since there are some
shader-db regressions.
This causes some immediate improvements, but could be better.
The cost checking that the pass does is based on critical path
length for out of order CPUs which we do not want so it skips out
on many cases we want.
llvm-svn: 293016
Summary:
This lets you select which sort of spilling you want, either s[0:1] or 64-bit loads from s[0:1].
Patch By: Dave Airlie
Reviewers: nhaehnle, arsenm, tstellarAMD
Reviewed By: arsenm
Subscribers: mareko, llvm-commits, kzhuravl, wdng, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D25428
llvm-svn: 293000
The sequence like this:
v_cmpx_le_f32_e32 vcc, 0, v0
s_branch BB0_30
s_cbranch_execnz BB0_30
; BB#29:
exp null off, off, off, off done vm
s_endpgm
BB0_30:
; %endif110
is likely wrong. The s_branch instruction will unconditionally jump
to BB0_30 and the skip block (exp done + endpgm) inserted for
performing the kill instruction will never be executed. This results
in a GPU hang with Star Ruler 2.
The s_branch instruction is added during the "Control Flow Optimizer"
pass which seems to re-organize the basic blocks, and we assume
that SI_KILL_TERMINATOR is always the last instruction inside a
basic block. Thus, after inserting a skip block we just go to the
next BB without looking at the subsequent instructions after the
kill, and the s_branch op is never removed.
Instead, we should remove the unconditional out branches and let
skip the two instructions if the exec mask is non-zero.
This patch fixes the GPU hang and doesn't introduce any regressions
with "make check".
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99019
Patch by Samuel Pitoiset <samuel.pitoiset@gmail.com>
llvm-svn: 292985
This switches to the workaround that HSA defaults to
for the mesa path.
This should be applied to the 4.0 branch.
Patch by Vedran Miletić <vedran@miletic.net>
llvm-svn: 292982
Regalloc creates COPY instructions which do not formally use VALU.
That results in v_mov instructions displaced after exec mask modification.
One pass which do it is SIOptimizeExecMasking, but potentially it can be
done by other passes too.
This patch adds a pass immediately after regalloc to add implicit exec
use operand to all VGPR copy instructions.
Differential Revision: https://reviews.llvm.org/D28874
llvm-svn: 292956
Inline spiller can decide to move a spill as early as possible in the basic block.
It will skip phis and label, but we also need to make sure it skips instructions
in the basic block prologue which restore exec mask.
Added isPositionLike callback in TargetInstrInfo to detect instructions which
shall be skipped in addition to common phis, labels etc.
Differential Revision: https://reviews.llvm.org/D27997
llvm-svn: 292554
This instruction is missing from LiveIntervals.
I'm not aware of any problems because of this though.
Differential Revision: https://reviews.llvm.org/D28879
llvm-svn: 292521
For -(x + y) -> (-x) + (-y), if x == -y, this would
change the result from -0.0 to 0.0. Since the fma/fmad
combine is an extension of this problem it also
applies there.
fmul should be fine, and I don't think any of the unary
operators or conversions should be a problem either.
llvm-svn: 292473
Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.
With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.
Differential Revision: https://reviews.llvm.org/D28782
llvm-svn: 292413
Rename from addOperand to just add, to match the other method that has been
added to MachineInstrBuilder for adding more than just 1 operand.
See https://reviews.llvm.org/D28057 for the whole discussion.
Differential Revision: https://reviews.llvm.org/D28556
llvm-svn: 291891
This produces worse code when i16 is legal, mostly
due to combines getting confused by conversions inserted
for uniform 16-bit operations.
llvm-svn: 291717
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.
special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq.
In case if the real operands bitwidth <= 16.
Differential Revision: https://reviews.llvm.org/D28104
llvm-svn: 291657
For i16 zeroext arguments when i16 was a legal type, the
known bits information from the truncate was lost. Insert
a zeroext so the known bits optimizations work with the 32-bit
loads.
Fixes code quality regressions vs. SI in min.ll test.
llvm-svn: 291461
Among other stuff, this allows to use predefined .option.machine_version_major
/minor/stepping symbols in the directive.
Relevant test expanded at once (also file renamed for clarity).
Differential Revision: https://reviews.llvm.org/D28140
llvm-svn: 290710
The feature allows for conditional assembly, filling the entries
of .amd_kernel_code_t etc.
Symbols are defined with value 0 at the beginning of each kernel scope.
After each register usage, the respective symbol is set to:
value = max( value, ( register index + 1 ) )
Thus, at the end of scope the value represents a count of used registers.
Kernel scopes begin at .amdgpu_hsa_kernel directive, end at the
next .amdgpu_hsa_kernel (or EOF, whichever comes first). There is also
dummy scope that lies from the beginning of source file til the
first .amdgpu_hsa_kernel.
Test added.
Differential Revision: https://reviews.llvm.org/D27859
llvm-svn: 290608
Canonicalize a select with a constant to the false side. This
enables more instruction shrinking opportunities since an
inline immediate can be used for the false side of v_cndmask_b32_e32.
This seems to usually be better but causes some code size regressions
in some tests.
llvm-svn: 290372
Summary: This is needed for later SDWA support in CodeGen.
Reviewers: vpykhtin, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27412
llvm-svn: 290338
Summary: Real instruction should copy constraints from real instruction. This allows auto-generated disassembler to correctly process tied operands.
Reviewers: nhaustov, vpykhtin, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27847
llvm-svn: 290336
When the instruction is processed the first time, it may be
deleted resulting in crashes. While the new test adds the same
user to the worklist twice, this particular case doesn't crash
but I'm not sure why.
llvm-svn: 290191
Summary:
Without a MachineMemOperand, the scheduler was assuming MIMG instructions
were ordered memory references, so no loads or stores could be reordered
across them.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27536
llvm-svn: 290179
Summary:
Added pair of directives .hsa_code_object_metadata/.end_hsa_code_object_metadata.
Between them user can put YAML string that would be directly put to the generated note. E.g.:
'''
.hsa_code_object_metadata
{
amd.MDVersion: [ 2, 0 ]
}
.end_hsa_code_object_metadata
'''
Based on D25046
Reviewers: vpykhtin, nhaustov, yaxunl, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, mgorny, tony-tye
Differential Revision: https://reviews.llvm.org/D27619
llvm-svn: 290097
Specifically avoid implicit conversions from/to integral types to
avoid potential errors when changing the underlying type. For example,
a typical initialization of a "full" mask was "LaneMask = ~0u", which
would result in a value of 0x00000000FFFFFFFF if the type was extended
to uint64_t.
Differential Revision: https://reviews.llvm.org/D27454
llvm-svn: 289820
Since SGPRs should spill to VGPRs, they should be allocated first.
I don't think this is sufficient for SGPRs to always spill to
VGPRs though.
llvm-svn: 289671
Retrying after fixing after removing load-store factoring through
token factors in favor of improved token factor operand pruning
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289659
At least the plugin used by the LibreOffice build
(<https://wiki.documentfoundation.org/Development/Clang_plugins>) indirectly
uses those members (through inline functions in LLVM/Clang include files in turn
using them), but they are not exported by utils/extract_symbols.py on Windows,
and accessing data across DLL/EXE boundaries on Windows is generally
problematic.
Differential Revision: https://reviews.llvm.org/D26671
llvm-svn: 289647
Summary:
While the result is constant across a single primitive, each pixel
shader wave can have pixels from multiple primitives.
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D27572
llvm-svn: 289447
Since 32-bit instructions with 32-bit input immediate behavior
are used to materialize 16-bit constants in 32-bit registers
for 16-bit instructions, determining the legality based
on the size is incorrect. Change operands to have the size
specified in the type.
Also adds a workaround for a disassembler bug that
produces an immediate MCOperand for an operand that
is supposed to be OPERAND_REGISTER.
The assembler appears to accept out of bounds immediates and
truncates them, but this seems to be an issue for 32-bit
already.
llvm-svn: 289306
Summary:
There is no point in setting SGPRS=104, because VI allocates SGPRs
in multiples of 16, so 104 -> 112. That enables us to use all 102 SGPRs
for general purposes.
Reviewers: tstellarAMD
Subscribers: qcolombet, arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27149
llvm-svn: 289260
Retrying after fixing overly aggressive load-store forwarding optimization.
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289221
These were selecting directly to the VOP2 form instead
of VOP3 like the i32 instructions. Fixes regressions in
future commits where an immediate isn't folded because it was
initially used for the second operand.
Because uniform 16-bit operations are promoted to i32, it's
difficult to get a simple testcase where this matters. Fold
failures in SIFoldOperands here tend to be hidden by commute
and fold in SIShrinkInstructions.
llvm-svn: 289189
Not having this legal led to combine failures, resulting
in dumb things like bitcasts of constants not being folded
away.
The only reason I'm leaving the v_mov_b32 hack that f32
already uses is to avoid madak formation test regressions.
PeepholeOptimizer has an ordering issue where the immediate
fold attempt is into the sgpr->vgpr copy instead of the actual
use. Running it twice avoids that problem.
llvm-svn: 289096
The correct commutable opcode was set to itself, so this
was simply swapping the operands to commute instead of also
changing the opcode to v_subrev_u16.
llvm-svn: 289093
Multiple metadata values for records such as opencl.ocl.version, llvm.ident
and similar are created after linking several modules. For some of them, notably
opencl.ocl.version, this creates semantic problem because we cannot tell which
version of OpenCL the composite module conforms.
Moreover, such repetitions of identical values often create a huge list of
unneeded metadata, which grows bitcode size both in memory and stored on disk.
It can go up to several Mb when linked against our OpenCL library. Lastly, such
long lists obscure reading of dumped IR.
The pass unifies metadata after linking.
Differential Revision: https://reviews.llvm.org/D25381
llvm-svn: 289092
Summary:
LC can currently select scalar load for uniform memory access
basing on readonly memory address space only. This restriction
originated from the fact that in HW prior to VI vector and scalar caches
are not coherent. With MemoryDependenceAnalysis we can check that the
memory location corresponding to the memory operand of the LOAD is not
clobbered along the all paths from the function entry.
Reviewers: rampitec, tstellarAMD, arsenm
Subscribers: wdng, arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D26917
llvm-svn: 289076
Summary:
Without the fix to isFrameOffsetLegal to consider the instruction's
immediate offset, the new test case hits the corresponding assertion in
resolveFrameIndex, because the LocalStackSlotAllocation pass re-uses a
different base register.
With only the fix to isFrameOffsetLegal, code quality reduces in a bunch of
places because frame base registers are added where they're not needed.
This is addressed by properly implementing needsFrameBaseReg, which also
helps to avoid unnecessary zero frame indices in a bunch of other places.
Fixes piglit glsl-1.50/execution/variable-indexing/gs-output-array-vec4-index-wr.shader_test
Reviewers: arsenm, tstellarAMD
Subscribers: qcolombet, kzhuravl, wdng, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D27344
llvm-svn: 289048
Summary:
If we write an immediate to a VGPR and then copy the VGPR to an
SGPR, we can replace the copy with a S_MOV_B32 sgpr, imm, rather than
moving the copy to the SALU.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D27272
llvm-svn: 288849
The structured CFG is just an aid to inserting exec
mask modification instructions, once that is done
we don't really need it anymore. We also
do not analyze blocks with terminators that
modify exec, so this should only be impacting
true branches.
llvm-svn: 288744
Structure the definitions a bit more like the other classes.
The main change here is to split EXP with the done bit set
to a separate opcode, so we can set mayLoad = 1 so that it won't
be reordered before the other exp stores, since this has the special
constraint that if the done bit is set then this should be the last
exp in she shader.
Previously all exp instructions were inferred to have unmodeled
side effects.
llvm-svn: 288695
lib/Target/AMDGPU/SIRegisterInfo.cpp: In member function 'void llvm::SIRegisterInfo::spillSGPR(llvm::MachineBasicBlock::iterator, int, llvm::RegScavenger*) const':
lib/Target/AMDGPU/SIRegisterInfo.cpp:572:30: warning: variable 'SubRC' set but not used [-Wunused-but-set-variable]
const TargetRegisterClass *SubRC = nullptr;
^
lib/Target/AMDGPU/SIRegisterInfo.cpp: In member function 'void llvm::SIRegisterInfo::restoreSGPR(llvm::MachineBasicBlock::iterator, int, llvm::RegScavenger*) const':
lib/Target/AMDGPU/SIRegisterInfo.cpp:723:30: warning: variable 'SubRC' set but not used [-Wunused-but-set-variable]
const TargetRegisterClass *SubRC = nullptr;
^
The variable was assigned to, but never used. The functions called did not
mutate state. Simplify the logic and remove the variable. Identified by gcc
5.4.0.
llvm-svn: 288601
Since the spill is for the whole wave, these
don't have the swizzling problems that vector stores do
and a single 4-byte allocation is enough to spill a 64 element
register. This should reduce the number of spill instructions and
put all the spills for a register in the same cacheline.
This should save allocated private size, but for now it doesn't.
The extra slots are allocated for each component, but never used
because the frame layout is essentially finalized before frame
indices are replaced. For always using the scalar store path,
this should probably be moved into processFunctionBeforeFrameFinalized.
llvm-svn: 288445
This is not in the list of valid inputs for the encoding.
When spilling, copies from exec can be folded directly
into the spill instruction which results in broken
stores.
This only fixes the operand constraints, more codegen
work is required to avoid emitting the invalid
spills.
This sort of breaks the dbg.value test. Because the
register class of the s_load_dwordx2 changes, there
is a copy to SReg_64, and the copy is the operand
of dbg_value. The copy is later dead, and removed
from the dbg_value.
llvm-svn: 288191
Use vaddr/vdst for the same purposes.
This also fixes a beg in SIInsertWaits for the
operand check. The stored value operand is currently called
data0 in the single offset case, not data.
llvm-svn: 288188
It isn't generally safe to fold the frame index
directly into the operand since it will possibly
not be an inline immediate after it is expanded.
This surprisingly seems to produce better code, since
the FI doesn't prevent folding other immediate operands.
llvm-svn: 288185
Change the logic for when to fold immediates to
consider the destination operand rather than the
source of the materializing mov instruction.
No change yet, but this will allow for correctly handling
i16/f16 operands. Since 32-bit moves are used to materialize
constants for these, the same bitvalue will not be in the
register.
llvm-svn: 288184
This makes the createGenericSchedLive() function that constructs the
default scheduler available for the public API. This should help when
you want to get a scheduler and the default list of DAG mutations.
This also shrinks the list of default DAG mutations:
{Load|Store}ClusterDAGMutation and MacroFusionDAGMutation are no longer
added by default. Targets can easily add them if they need them. It also
makes it easier for targets to add alternative/custom macrofusion or
clustering mutations while staying with the default
createGenericSchedLive(). It also saves the callback back and forth in
TargetInstrInfo::enableClusterLoads()/enableClusterStores().
Differential Revision: https://reviews.llvm.org/D26986
llvm-svn: 288057
Codegen prepare sinks comparisons close to a user is we have only one register
for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions.
Changed BE to report we have many condition registers. That way IR LICM pass
would hoist an invariant comparison out of a loop and codegen prepare will not
sink it.
With that done a condition is calculated in one block and used in another.
Current behavior is to store workitem's condition in a VGPR using v_cndmask_b32
and then restore it with yet another v_cmp instruction from that v_cndmask's
result. To mitigate the issue a propagation of source SGPR pair in place of v_cmp
is implemented. Additional side effect of this is that we may consume less VGPRs
at a cost of more SGPRs in case if holding of multiple conditions is needed, and
that is a clear win in most cases.
Differential Revision: https://reviews.llvm.org/D26114
llvm-svn: 288053
The scavenger was not passed if requiresFrameIndexScavenging was
enabled. I need to be able to test for the availability of an
unallocatable register here, so I can't create a virtual register for
it.
It might be better to just always use the scavenger and stop
creating virtual registers.
llvm-svn: 287843
m0 may need to be written for spill code, so
we don't want general code uses relying on the
value stored in it.
This introduces a few code quality regressions where copies
from m0 are not coalesced into copies of a copy of m0.
llvm-svn: 287841
The size and offset were wrong. The size of the object was
being used for the size of the access, when here it is really
being split into 4-byte accesses. The underlying object size
is set in the MachinePointerInfo, which also didn't have the
offset set.
llvm-svn: 287806
Summary:
* ARM is omitted from this patch because this check appears to expose bugs in this target.
* Mips is omitted from this patch because this check either detects bugs or deliberate
emission of instructions that don't satisfy their predicates. One deliberate
use is the SYNC instruction where the version with an operand is correctly
defined as requiring MIPS32 while the version without an operand is defined
as an alias of 'SYNC 0' and requires MIPS2.
* X86 is omitted from this patch because it doesn't use the tablegen-erated
MCCodeEmitter infrastructure.
Patches for ARM and Mips will follow.
Depends on D25617
Reviewers: tstellarAMD, jmolloy
Subscribers: wdng, jmolloy, aemerson, rengolin, arsenm, jyknight, nemanjai, nhaehnle, tstellarAMD, llvm-commits
Differential Revision: https://reviews.llvm.org/D25618
llvm-svn: 287439
Summary:
The 32-bit instructions don't zero the high 16-bits like the 16-bit
instructions do.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D26828
llvm-svn: 287342
Summary:
The addr64-based legalization is incorrect for MUBUF instructions with idxen
set as well as for BUFFER_LOAD/STORE_FORMAT_* instructions. This affects
e.g. shaders that access buffer textures.
Since we never actually need the addr64-legalization in shaders, this patch
takes the easy route and keys off the calling convention. If this ever
affects (non-OpenGL) compute, the type of legalization needs to be chosen
based on some TSFlag.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98664
Reviewers: arsenm, tstellarAMD
Subscribers: kzhuravl, wdng, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D26747
llvm-svn: 287339
Summary:
1. Don't try to copy values to and from the same register class.
2. Replace copies with of registers with immediate values with v_mov/s_mov
instructions.
The main purpose of this change is to make MachineSink do a better job of
determining when it is beneficial to split a critical edge, since the pass
assumes that copies will become move instructions.
This prevents a regression in uniform-cfg.ll if we enable critical edge
splitting for AMDGPU.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D23408
llvm-svn: 287131
wbinvl.* are vector instruction that do not sue vector registers.
v2: check only M?BUF instructions
Differential Revision: https://reviews.llvm.org/D26633
llvm-svn: 287056
The wave barrier represents the discardable barrier. Its main purpose is to
carry convergent attribute, thus preventing illegal CFG optimizations. All lanes
in a wave come to convergence point simultaneously with SIMT, thus no special
instruction is needed in the ISA. The barrier is discarded during code generation.
Differential Revision: https://reviews.llvm.org/D26585
llvm-svn: 287007
Summary: This is needed to be able to use this flags in InstrMappings.
Reviewers: tstellarAMD, vpykhtin
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D26666
llvm-svn: 286960
Summary:
Extend image intrinsics to support data types of V1F32 and V2F32.
TODO: we should define a mapping table to change the opcode for data type of V2F32 but just one channel is active,
even though such case should be very rare.
Reviewers:
tstellarAMD
Differential Revision:
http://reviews.llvm.org/D26472
llvm-svn: 286860
nThis avoids the nasty problems caused by using
memory instructions that read the exec mask while
spilling / restoring registers used for control flow
masking, but only for VI when these were added.
This always uses the scalar stores when enabled currently,
but it may be better to still try to spill to a VGPR
and use this on the fallback memory path.
The cache also needs to be flushed before wave termination
if a scalar store is used.
llvm-svn: 286766
Summary:
This pass was assuming that when a PHI instruction defined a register
used by another PHI instruction that the defining insstruction would
be legalized before the using instruction.
This assumption was causing the pass to not legalize some PHI nodes
within divergent flow-control.
This fixes a bug that was uncovered by r285762.
Reviewers: nhaehnle, arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D26303
llvm-svn: 286676
The generic infrastructure to compute the Newton series for reciprocal and
reciprocal square root was conceived to allow a target to compute the series
itself. However, the original code did not properly consider this condition
if returned by a target. This patch addresses the issues to allow a target
to compute the series on its own.
Differential revision: https://reviews.llvm.org/D22975
llvm-svn: 286523
Currently runtime metadata is emitted as an ELF section with name .AMDGPU.runtime_metadata.
However there is a standard way to convey vendor specific information about how to run an ELF binary, which is called vendor-specific note element (http://www.netbsd.org/docs/kernel/elf-notes.html).
This patch lets AMDGPU backend emits runtime metadata as a note element in .note section.
Differential Revision: https://reviews.llvm.org/D25781
llvm-svn: 286502
Codegen prepare sinks comparisons close to a user is we have only one register
for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions.
Changed BE to report we have many condition registers. That way IR LICM pass
would hoist an invariant comparison out of a loop and codegen prepare will not
sink it.
With that done a condition is calculated in one block and used in another.
Current behavior is to store workitem's condition in a VGPR using v_cndmask
and then restore it with yet another v_cmp instruction from that v_cndmask's
result. To mitigate the issue a forward propagation of a v_cmp 64 bit result
to an user is implemented. Additional side effect of this is that we may
consume less VGPRs in a cost of more SGPRs in case if holding of multiple
conditions is needed, and that is a clear win in most cases.
llvm-svn: 286171
If the branch was on a read-undef of vcc, passes that used
analyzeBranch to invert the branch condition wouldn't preserve
the undef flag resulting in a verifier error.
Fixes verifier failures in a future commit.
Also fix verifier error when inserting copy for vccz
corruption bug.
llvm-svn: 286133
This file is unused as of r285939, but we need to keep it around
for bots that don't do full rebuilds. We should be able to delete this
again in a few days.
llvm-svn: 285948
hange explores the fact that LDS reads may be reordered even if access
the same location.
Prior the change, algorithm immediately stops as soon as any memory
access encountered between loads that are expected to be merged
together. Although, Read-After-Read conflict cannot affect execution
correctness.
Improves hcBLAS CGEMM manually loop-unrolled kernels performance by 44%.
Also improvement expected on any massive sequences of reads from LDS.
Differential Revision: https://reviews.llvm.org/D25944
llvm-svn: 285919
Summary:
The post-RA scheduler occasionally uses additional implicit operands when
the vector implicit operand as a whole is killed, but some subregisters
are still live because they are directly referenced later. Unfortunately,
this seems incredibly subtle to reproduce.
Fixes piglit spec/glsl-110/execution/variable-indexing/vs-temp-array-mat2-index-wr.shader_test
and others.
Reviewers: arsenm, tstellarAMD
Subscribers: kzhuravl, wdng, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D25656
llvm-svn: 285835
This is the conservatively correct way because it's easy to
move or replace a scalar immediate. This was incorrect in the case
when the register class wasn't known from the static instruction
definition, but still needed to be an SGPR. The main example of this
is inlineasm has an SGPR constraint.
Also start verifying the register classes of inlineasm operands.
llvm-svn: 285762
Instructions with a 32-bit base encoding with an optional
32-bit literal encoded after them report their size as 4
for the disassembler. Consider these when computing the
MachineInstr size. This fixes problems caused by size estimate
consistency in BranchRelaxation.
llvm-svn: 285743
This will prevent following regression when enabling i16 support (D18049):
test/CodeGen/AMDGPU/ctlz.ll
test/CodeGen/AMDGPU/ctlz_zero_undef.ll
Differential Revision: https://reviews.llvm.org/D25802
llvm-svn: 285716
As it stands, the OperandMatchResultTy is only included in the generated
header if there is custom operand parsing. However, almost all backends
make use of MatchOperand_Success and friends from OperandMatchResultTy for
e.g. parseRegister. This is a pain when starting an AsmParser for a new
backend that doesn't yet have custom operand parsing. Move the enum to
MCTargetAsmParser.h.
This patch is a prerequisite for D23563
Differential Revision: https://reviews.llvm.org/D23496
llvm-svn: 285705
I wanted to implement this as a target independent expansion, however when
targets say they want to expand FP_TO_FP16 what they actually want is
the unsafe math expansion when possible and expansion to a libcall in all
other cases.
The only way to make this work as a target independent would be to add logic
to target's TargetLowering construction to mark theses nodes as Expand when
LegalizeDAG can use the unsafe expansion and mark them as LibCall when it
cannot. I think this would be possible, but I think it would be too fragile
and complex as it would require targets to keep their expansion logic up
to date with the code in LegalizeDAG.
Reviewers: bogner, ab, t.p.northover, arsenm
Subscribers: wdng, llvm-commits, nhaehnle
Differential Revision: https://reviews.llvm.org/D25999
llvm-svn: 285704
Fixes Bug 30808.
Note that passing subtarget information to predicates seems too complicated, so gfx8-specific def smrd_offset_20 introduced.
Old gfx6/7-specific def renamed to smrd_offset_8 for clarity.
Lit tests updated.
Differential Revision: https://reviews.llvm.org/D26085
llvm-svn: 285590
Summary:
Flat instruction can return out of order, so we need always need to wait
for all the outstanding flat operations.
Reviewers: tony-tye, arsenm
Subscribers: kzhuravl, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D25998
llvm-svn: 285479
Also add glc bit to the scalar loads since they exist on VI
and change the caching behavior.
This currently has an assembler bug where the glc bit is incorrectly
accepted on SI/CI which do not have it.
llvm-svn: 285463
While trying to add the glc bit to SMEM instructions on VI
with the new refactoring I ran into some kind of shadowing
problem for the glc operand when using the pseudoinstruction
as a multiclass parameter.
Everywhere that currently uses it defines the operand to have the same
name as its type, i.e. glc:$glc which works. For some reason now it
conflicts, and its up evaluating to the wrong thing. For the
real encoding classes,
let Inst{16} = !if(ps.has_glc, glc, ?); was not being evaluated
and still visible in the Inst initializer in the expanded td file.
In other cases I got a a different error about an illegal operand
where this was using { 0 } initializer from the bits<1> glc initializer
instead of evaluating it as false in the if.
For consistency all of the operand types should probably
be captialized to avoid conflicting with the variable names
unless somebody has a better idea of how to fix this.
llvm-svn: 285462
It's possible to have a use of the private resource descriptor or
scratch wave offset registers even though there are no allocated
stack objects. This would result in continuing to use the maximum
number reserved registers. This could go over the number of SGPRs
available on VI, or violate the SGPR limit requested by
the function attributes.
llvm-svn: 285435
Summary:
When finding a match for a merge and collecting the instructions that must
be moved, keep in mind that the instruction we merge might actually use one
of the defs that are being moved.
Fixes piglit spec/arb_enhanced_layouts/execution/component-layout/vs-tcs-load-output[-indirect].
The fact that the ds_read in the test case is not eliminated suggests that
there might be another problem related to alias analysis, but that's a
separate problem: this pass should still work correctly even when earlier
optimization passes missed something or were disabled.
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D25829
llvm-svn: 285273
Add missing ISA versions 7.0.2/8.0.4/8.1.0. to backend.
Refactor processor definition to use ISA version features.
Fixed ISA version for stoney.
Based on Laurent Morichetti's patch.
Differential Revision: https://reviews.llvm.org/D25919
llvm-svn: 285210
These functions are about classifying a global which will actually be
emitted, so it does not make sense for them to take a GlobalValue which may
for example be an alias.
Change the Mach-O object writer and the Hexagon, Lanai and MIPS backends to
look through aliases before using TargetLoweringObjectFile interfaces. These
are functional changes but all appear to be bug fixes.
Differential Revision: https://reviews.llvm.org/D25917
llvm-svn: 285006
Summary:
The v_movreld machine instruction is used with three operands that are
in a sense tied to each other (the explicit VGPR_32 def and the implicit
VGPR_NN def and use). There is no way to express that using the currently
available operand bits, and indeed there are cases where the Two Address
instructions pass does the wrong thing.
This patch introduces a new set of pseudo instructions that are identical
in intended semantics as v_movreld, but they only have two tied operands.
Having to add a new set of pseudo instructions is admittedly annoying, but
it's a fairly straightforward and solid approach. The only alternative I
see is to try to teach the Two Address instructions pass about Three Address
instructions, and I'm afraid that's trickier and is going to end up more
fragile.
Note that v_movrels does not suffer from this problem, and so this patch
does not touch it.
This fixes several GL45-CTS.shaders.indexing.* tests.
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D25633
llvm-svn: 284980
This will prevent following regression when enabling i16 support (D18049):
test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
Differential Revision: https://reviews.llvm.org/D25805
llvm-svn: 284891
This is a retry of r284495 which was reverted at r284513 due to use-after-scope bugs
caused by faulty usage of StringRef.
This version also renames a pair of functions:
getRecipEstimateDivEnabled()
getRecipEstimateSqrtEnabled()
as suggested by Eric Christopher.
original commit msg:
[Target] remove TargetRecip class; move reciprocal estimate isel functionality to TargetLowering
This is a follow-up to https://reviews.llvm.org/D24816 - where we changed reciprocal estimates to be function attributes
rather than TargetOptions.
This patch is intended to be a structural, but not functional change. By moving all of the
TargetRecip functionality into TargetLowering, we can remove all of the reciprocal estimate
state, shield the callers from the string format implementation, and simplify/localize the
logic needed for a target to enable this.
If a function has a "reciprocal-estimates" attribute, those settings may override the target's
default reciprocal preferences for whatever operation and data type we're trying to optimize.
If there's no attribute string or specific setting for the op/type pair, just use the target
default settings.
As noted earlier, a better solution would be to move the reciprocal estimate settings to IR
instructions and SDNodes rather than function attributes, but that's a multi-step job that
requires infrastructure improvements. I intend to work on that, but it's not clear how long
it will take to get all the pieces in place.
Differential Revision: https://reviews.llvm.org/D25440
llvm-svn: 284746
All of these existed because MSVC 2013 was unable to synthesize default
move ctors. We recently dropped support for it so all that error-prone
boilerplate can go.
No functionality change intended.
llvm-svn: 284721
This is a follow-up to D24816 - where we changed reciprocal estimates to be function attributes
rather than TargetOptions.
This patch is intended to be a structural, but not functional change. By moving all of the
TargetRecip functionality into TargetLowering, we can remove all of the reciprocal estimate
state, shield the callers from the string format implementation, and simplify/localize the
logic needed for a target to enable this.
If a function has a "reciprocal-estimates" attribute, those settings may override the target's
default reciprocal preferences for whatever operation and data type we're trying to optimize.
If there's no attribute string or specific setting for the op/type pair, just use the target
default settings.
As noted earlier, a better solution would be to move the reciprocal estimate settings to IR
instructions and SDNodes rather than function attributes, but that's a multi-step job that
requires infrastructure improvements. I intend to work on that, but it's not clear how long
it will take to get all the pieces in place.
Differential Revision: https://reviews.llvm.org/D25440
llvm-svn: 284495
Summary:
If we are loading an i16 value from a 32-bit memory location, then
we need to be able to truncate the loaded value to i16.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D25198
llvm-svn: 284397
Summary:
We are using this helper for our 24-bit arithmetic combines, so we are now able to eliminate multi-use operations that mask the high-bits of 24-bit inputs (e.g. and x, 0xffffff)
Reviewers: arsenm, nhaehnle
Subscribers: tony-tye, arsenm, kzhuravl, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D24672
llvm-svn: 284267
Summary:
This will be used for 64-bit MULHU, which is in turn used for the 64-bit
divide-by-constant optimization (see D24822).
Reviewers: arsenm, tstellarAMD
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D25289
llvm-svn: 284224
Retrying after upstream changes.
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll -
This test appears to work but no longer exhibits the spill behavior.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 284151
Because everything live is spilled at the end of a
block by fast regalloc, assume this will happen and
avoid the copies of the resource descriptor.
llvm-svn: 284119
This is the most basic handling of the indirect access
pseudos using GPR indexing mode. This currently only enables
the mode for a single v_mov_b32 and then disables it.
This is much more complicated to use than the movrel instructions,
so a new optimization pass is probably needed to fold the access
into the uses and keep the mode enabled for them.
llvm-svn: 284031
- Refactor bit packing/unpacking
- Calculate bit mask given bit shift and bit width
- Introduce function for decoding bits of waitcnt
- Introduce function for encoding bits of waitcnt
- Introduce function for getting waitcnt mask (instead of using bare numbers)
- Introduce function fot getting max waitcnt(s) (instead of using bare numbers)
Differential Revision: https://reviews.llvm.org/D25298
llvm-svn: 283919
Summary:
There was a bug with sequences like
s_mov_b64 s[0:1], exec
s_and_b64 s[2:3]<def>, s[0:1], s[2:3]<kill>
...
s_mov_b64_term exec, s[2:3]
because s[2:3] was defined and used in the same instruction, ending up with
SaveExecInst inside OtherUseInsts.
Note that the test case also exposes an unrelated bug.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98028
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D25306
llvm-svn: 283528
When constant folding an operation to a copy or an immediate
mov, the implicit uses/defs of the old instruction were left behind,
e.g. replacing v_or_b32 left the implicit exec use on the new copy.
llvm-svn: 283471
These ones need to have the size on the pseudo instruction set for
getInstSizeInBytes to work correctly. These also have a statically
known size.
llvm-svn: 283437
The register scavenging code does not support multiple definitions of
the same vreg.
Differential Revision: https://reviews.llvm.org/D25220
llvm-svn: 283369
Summary: Added 6 new target hooks for the vectorizer in order to filter types, handle size constraints and decide how to split chains.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, mzolotukhin, wdng, llvm-commits, nhaehnle
Differential Revision: https://reviews.llvm.org/D24727
llvm-svn: 283099
For some reason there are both of these available, except
for scalar 64-bit compares which only has u64. I'm not sure
why there are both (I'm guessing it's for the one bit inputs we
don't use), but for consistency always using the
unsigned one.
llvm-svn: 282832
Fixes to allow spilling all registers at the end of the block
work with exec modifications. Don't emit s_and_saveexec_b64 for
if lowering, and instead emit copies. Mark control flow mask
instructions as terminators to get correct spill code placement
with fast regalloc, and then have a separate optimization pass
form the saveexec.
This should work if SGPRs are spilled to VGPRs, but
will likely fail in the case that an SGPR spills to memory
and no workitem takes a divergent branch.
llvm-svn: 282667
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll -
This test appears to work but no longer exhibits the spill
behavior.
Reviewers: arsenm, hfinkel, tstellarAMD, nhaehnle, jyknight
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 282600
The only implementation that exists immediately looks it up anyway, and the
information is needed to handle various parameter attributes (stored on the
function itself).
llvm-svn: 282068
Summary:
In case s_branch instruction target is itself backend should emit offset -1 but instead it emit 0.
'''
label:
s_branch label // should emit [0xff,0xff,0x82,0xbf]
'''
Tom, Matt: why are we adjusting fixup values in applyFixup() method instead of processFixup()? processFixup() is calling adjustFixupValue() but does nothing with its result.
Reviewers: vpykhtin, artem.tamazov, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl
Differential Revision: https://reviews.llvm.org/D24671
llvm-svn: 281896
We were trying to avoid using a FrameIndex operand in non-pointer
operands in a convoluted way, and would break because of
using TargetFrameIndex. The TargetFrameIndex should only be used
in the case where it makes sense to fold it as part of the addressing
mode, otherwise it requires materialization like a normal constant.
This wasn't working reliably and failed in the added testcase, hitting
the assert when processing the frame index.
The TargetFrameIndex was coming from trying to produce an AssertZext
limiting the maximum stack size. I'm not sure this was correct to begin
with, because it is apparently possible to have a single workitem
dispatch that requires all 4G of private memory.
llvm-svn: 281824
This reduces the number of copies and reg_sequences
when using fp constant vectors. This significantly
reduces the code size in local-stack-alloc-bug.ll
llvm-svn: 281822
Summary: i8, i16, and f16 values are not extended to 32-bit in the HSA kernel ABI.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D24621
llvm-svn: 281789
These clean up some unnecessary or instructions in
cases with complex loops.
In the original testcase I noticed this, the same
or with exec was repeated 5 or 6 times in a row. With
this only one is emitted or sometimes a copy.
llvm-svn: 281786
Summary:
The main challenge in lowering kernel arguments for AMDGPU is determing the
memory type of the argument. The generic calling convention code assumes
that only legal register types can be stored in memory, but this is not the
case for AMDGPU.
This consolidates all the logic AMDGPU uses for deducing memory types into a single
function. This will make it much easier to support different ABIs in the future.
Reviewers: arsenm
Subscribers: arsenm, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D24614
llvm-svn: 281781
Summary:
mesa3d will use the same kernel calling convention as amdhsa, but it will
handle everything else like the default 'unknown' OS type.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D22783
llvm-svn: 281779
This addresses a TODO to handle operations besides and. This
also starts eliminating no-op operations with a constant that
can emerge later.
llvm-svn: 281488
Summary: This removes disabled instructions from match tables so we will not match them at all.
Reviewers: tstellarAMD, vpykhtin, artem.tamazov
Subscribers: wdng, nhaehnle, arsenm
Differential Revision: https://reviews.llvm.org/D24452
llvm-svn: 281216
Now that MachineBasicBlock::reverse_instr_iterator knows when it's at
the end (since r281168 and r281170), implement
MachineBasicBlock::reverse_iterator directly on top of an
ilist::reverse_iterator by adding an IsReverse template parameter to
MachineInstrBundleIterator. This replaces another hard-to-reason-about
use of std::reverse_iterator on list iterators, matching the changes for
ilist::reverse_iterator from r280032 (see the "out of scope" section at
the end of that commit message). MachineBasicBlock::reverse_iterator
now has a handle to the current node and has obvious invalidation
semantics.
r280032 has a more detailed explanation of how list-style reverse
iterators (invalidated when the pointed-at node is deleted) are
different from vector-style reverse iterators like std::reverse_iterator
(invalidated on every operation). A great motivating example is this
commit's changes to lib/CodeGen/DeadMachineInstructionElim.cpp.
Note: If your out-of-tree backend deletes instructions while iterating
on a MachineBasicBlock::reverse_iterator or converts between
MachineBasicBlock::iterator and MachineBasicBlock::reverse_iterator,
you'll need to update your code in similar ways to r280032. The
following table might help:
[Old] ==> [New]
delete &*RI, RE = end() delete &*RI++
RI->erase(), RE = end() RI++->erase()
reverse_iterator(I) std::prev(I).getReverse()
reverse_iterator(I) ++I.getReverse()
--reverse_iterator(I) I.getReverse()
reverse_iterator(std::next(I)) I.getReverse()
RI.base() std::prev(RI).getReverse()
RI.base() ++RI.getReverse()
--RI.base() RI.getReverse()
std::next(RI).base() RI.getReverse()
(For more details, have a look at r280032.)
llvm-svn: 281172
Summary:
An IR load can be invariant, dereferenceable, neither, or both. But
currently, MI's notion of invariance is IR-invariant &&
IR-dereferenceable.
This patch splits up the notions of invariance and dereferenceability at
the MI level. It's NFC, so adds some probably-unnecessary
"is-dereferenceable" checks, which we can remove later if desired.
Reviewers: chandlerc, tstellarAMD
Subscribers: jholewinski, arsenm, nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D23371
llvm-svn: 281151
If the literal is being folded into src0, it doesn't matter
if it's an SGPR because it's being replaced with the literal.
Also fixes initially selecting 32-bit versions of some instructions
which also confused commuting.
llvm-svn: 281117
Summary:
Prevously assembler parsed all literals as either 32-bit integers or 32-bit floating-point values. Because of this we couldn't support f64 literals.
E.g. in instruction "v_fract_f64 v[0:1], 0.5", literal 0.5 was encoded as 32-bit literal 0x3f000000, which is incorrect and will be interpreted as 3.0517578125E-5 instead of 0.5. Correct encoding is inline constant 240 (optimal) or 32-bit literal 0x3FE00000 at least.
With this change the way immediate literals are parsed is changed. All literals are always parsed as 64-bit values either integer or floating-point. Then we convert parsed literals to correct form based on information about type of operand parsed (was literal floating or binary) and type of expected instruction operands (is this f32/64 or b32/64 instruction).
Here are rules how we convert literals:
- We parsed fp literal:
- Instruction expects 64-bit operand:
- If parsed literal is inlinable (e.g. v_fract_f64_e32 v[0:1], 0.5)
- then we do nothing this literal
- Else if literal is not-inlinable but instruction requires to inline it (e.g. this is e64 encoding, v_fract_f64_e64 v[0:1], 1.5)
- report error
- Else literal is not-inlinable but we can encode it as additional 32-bit literal constant
- If instruction expect fp operand type (f64)
- Check if low 32 bits of literal are zeroes (e.g. v_fract_f64 v[0:1], 1.5)
- If so then do nothing
- Else (e.g. v_fract_f64 v[0:1], 3.1415)
- report warning that low 32 bits will be set to zeroes and precision will be lost
- set low 32 bits of literal to zeroes
- Instruction expects integer operand type (e.g. s_mov_b64_e32 s[0:1], 1.5)
- report error as it is unclear how to encode this literal
- Instruction expects 32-bit operand:
- Convert parsed 64 bit fp literal to 32 bit fp. Allow lose of precision but not overflow or underflow
- Is this literal inlinable and are we required to inline literal (e.g. v_trunc_f32_e64 v0, 0.5)
- do nothing
- Else report error
- Do nothing. We can encode any other 32-bit fp literal (e.g. v_trunc_f32 v0, 10000000.0)
- Parsed binary literal:
- Is this literal inlinable (e.g. v_trunc_f32_e32 v0, 35)
- do nothing
- Else, are we required to inline this literal (e.g. v_trunc_f32_e64 v0, 35)
- report error
- Else, literal is not-inlinable and we are not required to inline it
- Are high 32 bit of literal zeroes or same as sign bit (32 bit)
- do nothing (e.g. v_trunc_f32 v0, 0xdeadbeef)
- Else
- report error (e.g. v_trunc_f32 v0, 0x123456789abcdef0)
For this change it is required that we know operand types of instruction (are they f32/64 or b32/64). I added several new register operands (they extend previous register operands) and set operand types to corresponding types:
'''
enum OperandType {
OPERAND_REG_IMM32_INT,
OPERAND_REG_IMM32_FP,
OPERAND_REG_INLINE_C_INT,
OPERAND_REG_INLINE_C_FP,
}
'''
This is not working yet:
- Several tests are failing
- Problems with predicate methods for inline immediates
- LLVM generated assembler parts try to select e64 encoding before e32.
More changes are required for several AsmOperands.
Reviewers: vpykhtin, tstellarAMD
Subscribers: arsenm, kzhuravl, artem.tamazov
Differential Revision: https://reviews.llvm.org/D22922
llvm-svn: 281050
Summary:
Also removed duplicate code from AMDGPUTargetAsmStreamer.
This change only change how amd_kernel_code_t is parsed and printed. No variable names are changed.
Reviewers: vpykhtin, tstellarAMD
Subscribers: arsenm, wdng, nhaehnle
Differential Revision: https://reviews.llvm.org/D24296
llvm-svn: 281028
OpenCL kernels have hidden kernel arguments for global offset and printf buffer. For consistency, these hidden argument should be included in the runtime metadata. Also updated kernel argument kind metadata.
Differential Revision: https://reviews.llvm.org/D23424
llvm-svn: 280829
- Implemented amdgpu-flat-work-group-size attribute
- Implemented amdgpu-num-active-waves-per-eu attribute
- Implemented amdgpu-num-sgpr attribute
- Implemented amdgpu-num-vgpr attribute
- Dynamic LDS constraints are in a separate patch
Patch by Tom Stellard and Konstantin Zhuravlyov
Differential Revision: https://reviews.llvm.org/D21562
llvm-svn: 280747
Summary:
I put this code here, because I want to re-use it in a few other places.
This supersedes some of the immediate folding code we have in SIFoldOperands.
I think the peephole optimizers is probably a better place for folding
immediates into copies, since it does some register coalescing in the same time.
This will also make it easier to transition SIFoldOperands into a smarter pass,
where it looks at all uses of instruction at once to determine the optimal way to
fold operands. Right now, the pass just considers one operand at a time.
Reviewers: arsenm
Subscribers: wdng, nhaehnle, arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23402
llvm-svn: 280744
I'm not sure if this should be considered a bug in
copyImplicitOps or not, but implicit operands that are part
of the static instruction definition should not be copied.
llvm-svn: 280594
Summary:
This contains two changes that reduce the time spent in WQM, with the
intention of reducing bandwidth required by VMEM loads:
1. Sampling instructions by themselves don't need to run in WQM, only their
coordinate inputs need it (unless of course there is a dependent sampling
instruction). The initial scanInstructions step is modified accordingly.
2. When switching back from WQM to Exact, switch back as soon as possible.
This affects the logic in processBlock.
This should always be a win or at best neutral.
There are also some cleanups (e.g. remove unused ExecExports) and some new
debugging output.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: http://reviews.llvm.org/D22092
llvm-svn: 280590
Summary:
This fixes a rare bug in polygon stippling with non-monolithic pixel shaders.
The underlying problem is as follows: the prolog part contains the polygon
stippling sequence, i.e. a kill. The main part then enables WQM based on the
_reduced_ exec mask, effectively undoing most of the polygon stippling.
Since we cannot know whether polygon stippling will be used, the main part
of a non-monolithic shader must always return to exact mode to fix this
problem.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23131
llvm-svn: 280589
readlane/writelane do not support using m0 as the output/input.
Constrain the register class of spill vregs to try to avoid this,
but also handle spilling of the physreg when necessary by inserting
an additional copy to a normal SGPR.
llvm-svn: 280584
Add runtime metdata for pointee alignment of pointer type kernel argument. The key is KeyArgPointeeAlign and the value is a 32 bit unsigned integer.
Differential Revision: https://reviews.llvm.org/D24145
llvm-svn: 280399
Summary:
Created a new td file MIMGInstructions.td which contains all definitions
of MIMG related instructions.
Reviewed by:
kzhuravl, vpykhtin
Differential Revision:
http://reviews.llvm.org/D24106
llvm-svn: 280385
Summary: This fixes some OpenCV tests that were broken by libclc commit r276443.
Reviewers: arsenm, jvesely
Subscribers: arsenm, wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D24051
llvm-svn: 280274
Summary:
Simply replace usage of aliases to functions with aliasee.
This came up when bitcode linking to builtin library and
calls to aliases not being resolved.
Also made minor improvements to existing test.
Reviewers: tstellarAMD, alex-t, vpykhtin
Subscribers: arsenm, wdng, rampitec
Differential Revision: https://reviews.llvm.org/D24023
llvm-svn: 280221
Move SDLoc initialization to comon place.
fall back to AMDGPU version in one place
Differential Revision: https://reviews.llvm.org/D23900
llvm-svn: 280030
Summary:
The SILoadStoreOptimizer can now look ahead more then one instruction when
looking for instructions to merge, which greatly improves the number of
loads/stores that we are able to merge.
Moving the pass before scheduling avoids increasing register pressure after
the scheduler, so that the scheduler's register pressure estimates will be
more accurate. It also gives more consistent results, since it is no longer
affected by minor scheduling changes.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D23814
llvm-svn: 279991
Summary:
For shrinking SOPK instructions, we were creating a hint to tell the
register allocator to use the register allocated for src0 for the dst
operand as well. However, this seems to not work sometimes depending
on the order virtual registers are assigned physical registers.
To fix this, I've added a second allocation hint which does the reverse,
asks that the register allocated for dst is used for src0.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23862
llvm-svn: 279968
Summary:
The SILoadStoreOptimizer will need to use AliasAnalysis here in order to
move it before scheduling.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23813
llvm-svn: 279963
There's only one use of this for the convenience
of a pattern. I think v_mov_b64_pseudo should also be
moved, but SIFoldOperands does currently make use of it.
llvm-svn: 279901
It isn't used for anything, and is also misleading since
it could be spilled at the end of the block, so it can't be relied
on. There ends up being a verifier error about using an undefined
register since the spill kills the register.
llvm-svn: 279899
Summary:
If the scheduler clusters the loads, then the offsets will be sorted,
but it is possible for the scheduler to scheduler loads together
without out explicitly clustering them, which would give us non-sorted
offsets.
Also, we will want to do this if we move the load/store optimizer before
the scheduler.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23776
llvm-svn: 279870
Summary:
There are a few different sgpr pressure sets, but we only care about
the one which covers all of the sgprs. We were using hard-coded
register pressure set names to determine the reg set id for the
biggest sgpr set. However, we were using the wrong name, and this
method is pretty fragile, since the reg pressure set names may
change.
The new method just looks for the pressure set that contains the most
reg units and sets that set as our SGPR pressure set. We've also
adopted the same technique for determining our VGPR pressure set.
Reviewers: arsenm
Subscribers: MatzeB, arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23687
llvm-svn: 279867
MCContext already has many tasks, and separating CodeView out from it is
probably a good idea. The .cv_loc tracking was modelled on the DWARF
tracking which lived directly in MCContext.
Removes the inclusion of MCCodeView.h from MCContext.h, so now there are
only 10 build actions while I hack on CodeView support instead of 265.
llvm-svn: 279847
Summary:
This patch implements readlane/readfirstlane intrinsics.
TODO: need to define a new register class to consider the case
that the source could be a vector register or M0.
Reviewed by:
arsenm and tstellarAMD
Differential Revision:
http://reviews.llvm.org/D22489
llvm-svn: 279660
Re-apply this patch, hopefully I will get away without any warnings
in the constructor now.
This patch removes the MachineFunctionAnalysis. Instead we keep a
map from IR Function to MachineFunction in the MachineModuleInfo.
This allows the insertion of ModulePasses into the codegen pipeline
without breaking it because the MachineFunctionAnalysis gets dropped
before a module pass.
Peak memory should stay unchanged without a ModulePass in the codegen
pipeline: Previously the MachineFunction was freed at the end of a codegen
function pipeline because the MachineFunctionAnalysis was dropped; With
this patch the MachineFunction is freed after the AsmPrinter has
finished.
Differential Revision: http://reviews.llvm.org/D23736
llvm-svn: 279602
I want to compute the SSA property of .mir files automatically in
upcoming patches. The problem with this is that some inputs will be
reported as static single assignment with some passes claiming not to
support SSA form. In reality though those passes do not support PHI
instructions => Track the presence of PHI instructions separate from the
SSA property.
Differential Revision: https://reviews.llvm.org/D22719
llvm-svn: 279573
Re-apply this commit with the deletion of a MachineFunction delegated to
a separate pass to avoid use after free when doing this directly in
AsmPrinter.
This patch removes the MachineFunctionAnalysis. Instead we keep a
map from IR Function to MachineFunction in the MachineModuleInfo.
This allows the insertion of ModulePasses into the codegen pipeline
without breaking it because the MachineFunctionAnalysis gets dropped
before a module pass.
Peak memory should stay unchanged without a ModulePass in the codegen
pipeline: Previously the MachineFunction was freed at the end of a codegen
function pipeline because the MachineFunctionAnalysis was dropped; With
this patch the MachineFunction is freed after the AsmPrinter has
finished.
Differential Revision: http://reviews.llvm.org/D23736
llvm-svn: 279564
This patch removes the MachineFunctionAnalysis. Instead we keep a
map from IR Function to MachineFunction in the MachineModuleInfo.
This allows the insertion of ModulePasses into the codegen pipeline
without breaking it because the MachineFunctionAnalysis gets dropped
before a module pass.
Peak memory should stay unchanged without a ModulePass in the codegen
pipeline: Previously the MachineFunction was freed at the end of a codegen
function pipeline because the MachineFunctionAnalysis was dropped; With
this patch the MachineFunction is freed after the AsmPrinter has
finished.
Differential Revision: http://reviews.llvm.org/D23736
llvm-svn: 279502
Do most of the lowering in a pre-RA pass. Keep the skip jump
insertion late, plus a few other things that require more
work to move out.
One concern I have is now there may be COPY instructions
which do not have the necessary implicit exec uses
if they will be lowered to v_mov_b32.
This has a positive effect on SGPR usage in shader-db.
llvm-svn: 279464
Currently nodes_iterator may dereference to a NodeType* or a NodeType&. Make them all dereference to NodeType*, which is NodeRef later.
Differential Revision: https://reviews.llvm.org/D23704
Differential Revision: https://reviews.llvm.org/D23705
llvm-svn: 279326
The names of the tablegen defs now match the names of the ISD nodes.
This makes the world a slightly saner place, as previously "fround" matched
ISD::FP_ROUND and not ISD::FROUND.
Differential Revision: https://reviews.llvm.org/D23597
llvm-svn: 279129
This is a mechanical change of comments in switches like fallthrough,
fall-through, or fall-thru to use the LLVM_FALLTHROUGH macro instead.
llvm-svn: 278902
minimal and boring form than the old pass manager's version.
This pass does the very minimal amount of work necessary to inline
functions declared as always-inline. It doesn't support a wide array of
things that the legacy pass manager did support, but is alse ... about
20 lines of code. So it has that going for it. Notably things this
doesn't support:
- Array alloca merging
- To support the above, bottom-up inlining with careful history
tracking and call graph updates
- DCE of the functions that become dead after this inlining.
- Inlining through call instructions with the always_inline attribute.
Instead, it focuses on inlining functions with that attribute.
The first I've omitted because I'm hoping to just turn it off for the
primary pass manager. If that doesn't pan out, I can add it here but it
will be reasonably expensive to do so.
The second should really be handled by running global-dce after the
inliner. I don't want to re-implement the non-trivial logic necessary to
do comdat-correct DCE of functions. This means the -O0 pipeline will
have to be at least 'always-inline,global-dce', but that seems
reasonable to me. If others are seriously worried about this I'd like to
hear about it and understand why. Again, this is all solveable by
factoring that logic into a utility and calling it here, but I'd like to
wait to do that until there is a clear reason why the existing
pass-based factoring won't work.
The final point is a serious one. I can fairly easily add support for
this, but it seems both costly and a confusing construct for the use
case of the always inliner running at -O0. This attribute can of course
still impact the normal inliner easily (although I find that
a questionable re-use of the same attribute). I've started a discussion
to sort out what semantics we want here and based on that can figure out
if it makes sense ta have this complexity at O0 or not.
One other advantage of this design is that it should be quite a bit
faster due to checking for whether the function is a viable candidate
for inlining exactly once per function instead of doing it for each call
site.
Anyways, hopefully a reasonable starting point for this pass.
Differential Revision: https://reviews.llvm.org/D23299
llvm-svn: 278896
The structs ImmOp and RegOp are in AArch64AsmParser.cpp (inside
anonymous namespace).
This diff changes the order of fields and removes the excessive padding
(8 bytes).
Patch by Alexander Shaposhnikov
llvm-svn: 278844
Recall that MSVC always gives enums the type 'int', nothing else. MSVC
2015 does not appear to have this problem anymore.
Clang-cl -Wmicrosoft-enum-value flags this, FWIW, so now I have a true
positive for my warning. :)
llvm-svn: 278762
Remove all ilist_iterator to pointer casts. There were two reasons for
casts:
- Checking for an uninitialized (i.e., null) iterator. I added
MachineInstrBundleIterator::isValid() to check for that case.
- Comparing an iterator against the underlying pointer value while
avoiding converting the pointer value to an iterator. This is
occasionally necessary in MachineInstrBundleIterator, since there is
an assertion in the constructors that the underlying MachineInstr is
not bundled (but we don't care about that if we're just checking for
pointer equality).
To support the latter case, I rewrote the == and != operators for
ilist_iterator and MachineInstrBundleIterator.
- The implicit constructors now use enable_if to exclude
const-iterator => non-const-iterator conversions from overload
resolution (previously it was a compiler error on instantiation, now
it's SFINAE).
- The == and != operators are now global (friends), and are not
templated.
- MachineInstrBundleIterator has overloads to compare against both
const_pointer and const_reference. This avoids the implicit
conversions to MachineInstrBundleIterator that assert, instead just
checking the address (and I added unit tests to confirm this).
Notably, the only remaining uses of ilist_iterator::getNodePtrUnchecked
are in ilist.h, and no code outside of ilist*.h directly relies on this
UB end-iterator-to-pointer conversion anymore. It's still needed for
ilist_*sentinel_traits, but I'll clean that up soon.
llvm-svn: 278478
Summary:
This patch define and implement amdgcn image intrinsics with sampler.
1. define vdata type to be llvm_anyfloat_ty, address type to be llvm_anyfloat_ty,
and rsrc type to be llvm_anyint_ty. As a result, we expect the intrinsics name
to have three suffixes to overload each of these three types;
2. D128 as well as two other flags are implied in the three types, for example,
if you use v8i32 as resource type, then r128 is 0!
3. don't expose TFE flag, and other flags are exposed in the instruction order:
unrm, glc, slc, lwe and da.
Differential Revision: http://reviews.llvm.org/D22838
Reviewed by:
arsenm and tstellarAMD
llvm-svn: 278291
Insert before the skip branch if one is created.
This is a somewhat more natural placement relative
to the skip branches, and makes it possible to implement
analyzeBranch for skip blocks.
The test changes are mostly due to a quirk where
the block label is not emitted if there is a terminator
that is not also a branch.
llvm-svn: 278273
Summary:
This is the setting of the Vulkan closed source driver.
It decreases the max wave count from 10 to 8.
26010 shaders in 14650 tests
Totals:
VGPRS: 829593 -> 808440 (-2.55 %)
Spilled SGPRs: 81878 -> 42226 (-48.43 %)
Spilled VGPRs: 367 -> 358 (-2.45 %)
Scratch VGPRs: 1764 -> 1748 (-0.91 %) dwords per thread
Code Size: 36677864 -> 35923932 (-2.06 %) bytes
There is a massive decrease in SGPR spilling in general and -7.4% spilled
VGPRs for DiRT Showdown (= SGPRs spilled to scratch?)
Reviewers: arsenm, tstellarAMD, nhaehnle
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23034
llvm-svn: 277867
Adding missing tests for OCL type names for half, float, double, char, short, long, and unknown.
Patch by Aaron En Ye Shi.
Differential Revision: https://reviews.llvm.org/D22964
llvm-svn: 277759
Summary:
TargetBaseAlign is no longer required since LSV checks if target allows misaligned accesses.
A constant defining a base alignment is still needed for stack accesses where alignment can be adjusted.
Previous patch (D22936) was reverted because tests were failing. This patch also fixes the cause of those failures:
- x86 failing tests either did not have the right target, or the right alignment.
- NVPTX failing tests did not have the right alignment.
- AMDGPU failing test (merge-stores) should allow vectorization with the given alignment but the target info
considers <3xi32> a non-standard type and gives up early. This patch removes the condition and only checks
for a maximum size allowed and relies on the next condition checking for %4 for correctness.
This should be revisited to include 3xi32 as a MVT type (on arsenm's non-immediate todo list).
Note that checking the sizeInBits for a MVT is undefined (leads to an assertion failure),
so we need to create an EVT, hence the interface change in allowsMisaligned to include the Context.
Reviewers: arsenm, jlebar, tstellarAMD
Subscribers: jholewinski, arsenm, mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D23068
llvm-svn: 277735
On modern Intel processors hardware SQRT in many cases is faster than RSQRT
followed by Newton-Raphson refinement. The patch introduces a simple heuristic
to choose between hardware SQRT instruction and Newton-Raphson software
estimation.
The patch treats scalars and vectors differently. The heuristic is that for
scalars the compiler should optimize for latency while for vectors it should
optimize for throughput. It is based on the assumption that throughput bound
code is likely to be vectorized.
Basically, the patch disables scalar NR for big cores and disables NR completely
for Skylake. Firstly, scalar SQRT has shorter latency than NR code in big cores.
Secondly, vector SQRT has been greatly improved in Skylake and has better
throughput compared to NR.
Differential Revision: https://reviews.llvm.org/D21379
llvm-svn: 277725
Summary:
Two types of stores are possible in pixel shaders: stores to memory that are
explicitly requested at the API level, and stores that are an implementation
detail of register spilling or lowering of arrays.
For the first kind of store, we must ensure that helper pixels have no effect
and hence WQM must be disabled. The second kind of store must always be
executed, because the written value may be loaded again in a way that is
relevant for helper pixels as well -- and there are no externally visible
effects anyway.
This is a candidate for the 3.9 release branch.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D22675
llvm-svn: 277504
Summary:
There are cases where uniform branch conditions are computed in VGPRs, and
we didn't correctly mark those as WQM.
The stray change in basic-branch.ll is because invoking the LiveIntervals
analysis leads to the detection of a dead register that would otherwise not
be seen at -O0.
This is a candidate for the 3.9 branch, as it fixes a possible hang.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D22673
llvm-svn: 277500
This adds a target hook getInstSizeInBytes to TargetInstrInfo that a lot of
subclasses already implement.
Differential Revision: https://reviews.llvm.org/D22885
llvm-svn: 277126
Summary:
We were using reserved VGPRs for SGPR spilling and this was causing
some programs with a workgroup size of 1024 to use more than 64
registers, which is illegal.
Reviewers: arsenm, mareko, nhaehnle
Subscribers: nhaehnle, arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D22032
llvm-svn: 276980
Summary:
SI_ELSE is lowered into two parts:
s_or_saveexec_b64 dst, src (at the start of the basic block)
s_xor_b64 exec, exec, dst (at the end of the basic block)
The idea is that dst contains the exec mask of the preceding IF block. It can
happen that SIWholeQuadMode decides to switch from WQM to Exact mode inside
the basic block that contains SI_ELSE, in which case it introduces an instruction
s_and_b64 exec, exec, s[...]
which masks out bits that can correspond to both the IF and the ELSE paths.
So the resulting sequence must be:
s_or_savexec_b64 dst, src
s_and_b64 exec, exec, s[...] <-- added by SIWholeQuadMode
s_and_b64 dst, dst, exec <-- added by SILowerControlFlow
s_xor_b64 exec, exec, dst
Whether to add the additional s_and_b64 dst, dst, exec is currently determined
via the ExecModified tracking. With this change, it is instead determined by
an additional flag on SI_ELSE which is set by SIWholeQuadMode.
Finally: It also occured to me that an alternative approach for the long run
is for SILowerControlFlow to unconditionally emit
s_or_saveexec_b64 dst, src
...
s_and_b64 dst, dst, exec
s_xor_b64 exec, exec, dst
and have a pass that detects and cleans up the "redundant AND with exec"
pattern where possible. This could be useful anyway, because we also add
instructions
s_and_b64 vcc, exec, vcc
before s_cbranch_scc (in moveToALU), and those are often redundant. I have
some pending changes to how KILL is lowered that could also benefit from
such a cleanup pass.
In any case, this current patch could help in the short term with the whole
ExecModified business.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D22846
llvm-svn: 276972