We create MMO's for the VLDn/VSTn intrinsics in ARMTargetLowering::
getTgtMemIntrinsic, but they do not currently make it ll the way through
ISel. This changes that in the various places it needs changing, making
sure that the MMO is propagate through to the final instruction. This
can help in scheduling, not treating the VLD2/VST2 as a scheduling
barrier.
Differential Revision: https://reviews.llvm.org/D101096
Setting the preffered function alignment to 16 for Cortex A53/A55
improves performance in a wide range of benchmarks. This brings it
in line with the Cortex-A53/A55 tuning that is used in GCC
(gcc/config/aarch64/aarch64.c).
Differential Revision: https://reviews.llvm.org/D101636
Change-Id: I2ce47fe7ab5e3b54f49c89038d8da4e404742de2
This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
X32 uses 32-bit ELF object files with 32-bit alignment, so the
.note.gnu.property section needs to be emitted as it is for X86.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101689
Introduce basic schedule model for AMD Zen 3 CPU's, a.k.a `znver3`.
This is fully built from scratch, from llvm-mca measurements
and documented reference materials.
Nothing was copied from `znver2`/`znver1`.
I believe this is in a reasonable state of completion for inclusion,
probably better than D52779 `bdver2` was :)
Namely:
* uops are pretty spot-on (at least what llvm-mca can measure)
{F16422596}
* latency is also pretty spot-on (at least what llvm-mca can measure)
{F16422601}
* throughput is within reason
{F16422607}
I haven't run much benchmarks with this,
however RawSpeed benchmarks says this is beneficial:
{F16603978}
{F16604029}
I'll call out the obvious problems there:
* i didn't really bother with X87 instructions
* i didn't really bother with obviously-microcoded/system instructions
* There are large discrepancy in throughput for `mr` and `rm` instructions.
I'm not really sure if it's a modelling defect that needs to be fixed,
or it's a defect of measurments.
* Pipe distributions are probably bad :)
I can't do much here until AMD allows that to be fixed
by documenting the appropriate counters and updating libpfm
That being said, as @RKSimon notes:
>>! In D94395#2647381, @RKSimon wrote:
> I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions <...>
so how much worse this could possibly be?!
Things that aren't there:
* Various tunings: zero idioms, etc. That is follow-ups.
Differential Revision: https://reviews.llvm.org/D94395
Apply the same logic used to check if CMPXCHG nodes should be expanded
at -O0: the register allocator may end up spilling some register in
between the atomic load/store pairs, breaking the atomicity and possibly
stalling the execution.
Fixes PR48017
Reviewed By: efriedman
Differential Revision: https://reviews.llvm.org/D101163
Related to PR50172.
Protects us against regressions after we will start doing cttz(zext(x)) -> zext(cttz(x)) transformation in the middle-end.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101662
This is a long overdue cleanup. Not every use is eliminated, I stuck to uses
that were directly being called from select(), and not the render functions.
Differential Revision: https://reviews.llvm.org/D101590
SIPreEmitPeephole did not try to remove redundant s_set_gpr_idx_*
instructions in blocks that end with a conditional branch instruction.
This seems like a simple oversight.
Differential Revision: https://reviews.llvm.org/D101629
Fixes the following warnings observerd when building the experimental
m68k backend (-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="M68k"):
../lib/Target/M68k/M68kMachineFunction.h:71:3: warning: explicitly
defaulted default constructor is implicitly deleted
[-Wdefaulted-function-deleted]
M68kMachineFunctionInfo() = default;
^
../lib/Target/M68k/M68kMachineFunction.h:24:20: note: default
constructor of 'M68kMachineFunctionInfo' is implicitly deleted because
field 'MF' of reference type 'llvm::MachineFunction &' would not be
initialized
MachineFunction &MF;
^
In file included from ../lib/Target/M68k/M68kISelLowering.cpp:18:
In file included from ../lib/Target/M68k/M68kSubtarget.h:17:
../lib/Target/M68k/M68kFrameLowering.h:60:8: warning:
'llvm::M68kFrameLowering::emitCalleeSavedFrameMoves' hides overloaded
virtual functions [-Woverloaded-virtual]
void emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
../include/llvm/CodeGen/TargetFrameLowering.h:215:3: note: hidden
overloaded virtual function
'llvm::TargetFrameLowering::emitCalleeSavedFrameMoves' declared here:
different number of parameters (2 vs 3)
emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
../include/llvm/CodeGen/TargetFrameLowering.h:218:16: note: hidden
overloaded virtual function
'llvm::TargetFrameLowering::emitCalleeSavedFrameMoves' declared here:
different number of parameters (4 vs 3)
virtual void emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
pr/50071
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D101588
These operations don't exist natively, so just let the
target-independent code expand to plain shifts.
The generated sequences could probably be optimized a bit more, but
they seem good enough for now.
Differential Revision: https://reviews.llvm.org/D101574
This patch merges STR<S,D,Q,W,X>pre-STR<S,D,Q,W,X>ui and
LDR<S,D,Q,W,X>pre-LDR<S,D,Q,W,X>ui instruction pairs into a single
STP<S,D,Q,W,X>pre and LDP<S,D,Q,W,X>pre instruction, respectively.
For each pair, there is a MIR test that verifies this optimization.
Differential Revision: https://reviews.llvm.org/D99272
Change-Id: Ie97a20c8c716c08492fe229c22e14e3c98ef08b7
The functionality in SVEIntrinsicOpts::isReinterpretToSVBool was moved in
D101302, however the original now unused function was not removed (NFC).
Differential Revision: https://reviews.llvm.org/D101642
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
This patch introduces a new infrastructure that is used to select the load and
store instructions in the PPC backend.
The primary motivation is that the current implementation of selecting load/stores
is dependent on the ordering of patterns in TableGen. Given this limitation, we
are not able to easily and reliably generate the P10 prefixed load and stores
instructions (such as when the immediates that fit within 34-bits). This
refactoring is meant to provide us with more control over the patterns/different
forms to exploit, as well as eliminating dependency of pattern declaration in TableGen.
The idea of this refactoring is that it introduces a set of addressing modes that
correspond to different instruction formats of a particular load and store
instruction, along with a set of common flags that describes a load/store.
Whenever a load/store instruction is being selected, we analyze the instruction
and compute a set of flags for it. The computed flags are then used to
select the most optimal load/store addressing mode.
This patch is the first of a series of patches to be committed - it contains the
initial implementation of the refactored load/store selection infrastructure and
also updates P8/P9 patterns to adopt this infrastructure. The idea is that
incremental patches will add more implementation and support, and eventually
the old implementation will be removed.
Differential Revision: https://reviews.llvm.org/D93370
Summary:
This patch implements the backend implementation of adding global variables
directly to the table of contents (TOC), rather than adding the address of the
variable to the TOC.
Currently, this patch will look for the "toc-data" attribute on symbols in the
IR, and then add those symbols to the TOC.
ATM, this is implemented for 32 bit AIX.
Reviewers: sfertile
Differential Revision: https://reviews.llvm.org/D101178
As discussed in D100107, this patch first convert index_vector to
step_vector, and convert step_vector back to index_vector after LegalizeDAG.
Differential Revision: https://reviews.llvm.org/D100816
DAGCombiner was recently taught how to combine STEP_VECTOR nodes,
meaning the step value is no longer guaranteed to be one by the time it
reaches the backend for lowering.
This patch supports such cases on RISC-V by lowering to other step
values to a multiply following the vid.v instruction. It includes a
small optimization for common cases where the multiply can be expressed
as a shift left.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D100856
When rvv vector objects existed, using sp to access the fixed stack object will pass the rvv vector objects field. So the StackOffset needs add a scalable offset of the size of rvv vector objects field
Differential Revision: https://reviews.llvm.org/D100286
When creating G_SBFX/G_UBFX opcodes, the last operand is the
width instead of the bit position. The bit position is used
for the AArch64 SBFM and UBFM instructions. The bit position
is converted to a width if the SBFX/UBFX aliases are generated.
For other SBMF/UBFM aliases, such as shifts, the bit position
is used.
Differential Revision: https://reviews.llvm.org/D101543
Refactor IsHazardFn and IsExpiredFn to use constant references as these should not be mutating the instructions visited and the instruction can never be null.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D101430
Remove an early-out in wait state counting which can never be
taken.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D101520
As explained in the comments, matchSwap matches:
// mov t, x
// mov x, y
// mov y, t
and turns it into:
// mov t, x (t is potentially dead and move eliminated)
// v_swap_b32 x, y
On physical registers we don't have full use-def chains so the check
for T being live-out was not working properly with subregs/superregs.
Differential Revision: https://reviews.llvm.org/D101546
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
When atomic image intrinsic return value is unused, register class for
destination of a sub-register copy of return value ends up not being set.
This copy then hits 'Register class not set' assert later.
If return value has uses, register class is determined by use instruction.
Fix is to not create sub-register copy when image intrinsic destination has
no uses because it would be deleted by dead-mi-elimination later anyway.
Differential Revision: https://reviews.llvm.org/D101448
- Add new variantKinds for the symbol's variable offset and region handle
- Print the proper relocation specifier @gd in the asm streamer when emitting
the TC Entry for the variable offset for the symbol
- Fix the switch section failure between the TC Entry of variable offset and
region handle
- Put .__tls_get_addr symbol in the ProgramCodeSects with XTY_ER property
Reviewed by: sfertile
Differential Revision: https://reviews.llvm.org/D100956
This reverts commit 3b8ec86fd5.
Revert "[X86] Refine AMX fast register allocation"
This reverts commit c3f95e9197.
This pass breaks using LLVM in a multi-threaded environment by
introducing global state.
Similar for or/xor with 0 in place of -1.
This is the canonical form produced by InstCombine for something like `c ? x & y : x;` Since we have to use control flow to expand select we'll usually end up with a mv in basic block. By folding this we may be able to pull the and/or/xor into the block instead and avoid a mv instruction.
The code here is based on code from ARM that uses this to create predicated instructions. I'm doing it on SELECT_CC so it happens late, but we could do it on select earlier which is what ARM does. I'm not sure if we lose any combine opportunities if we do it earlier.
I left out add and sub because this can separate sext.w from the add/sub. It also made a conditional i64 addition/subtraction on RV32 worse. I guess both of those would be fixed by doing this earlier on select.
The select-binop-identity.ll test has not been commited yet, but I made the diff show the changes to it.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D101485
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
- Currently, the "." (Dot) character, when not identifying an Identifier or a Constant, refers to the current PC (Program Counter)
- However, in z/OS, for the HLASM dialect, it strictly accepts only the "*" as the current PC (Support for this will be put up in a follow-up patch)
- The changes in this patch allow individual platforms to choose whether they would like to use the "." (Dot) character as a marker for the current PC or not.
- It is achieved by introducing a new field in MCAsmInfo.h called `DotIsPC` (similar to `DollarIsPC`)
Reviewed By: abhina.sreeskantharajan
Differential Revision: https://reviews.llvm.org/D100975
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
At the intrinsic layer the sve.insr operation takes a scalar. When this
scalar is an integer we are forcing a data transition between GPRs and
ZPRs that is potentially costly.
Often the integer scalar is the result of a vector extract, when
performing a reduction for example. In such cases we should keep all
data within the ZPRs.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101169
By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs
Co-authored-by: Paul Walker <paul.walker@arm.com>
Depends on D101302
Differential Revision: https://reviews.llvm.org/D101167
As part of this the ptrue coalescing done in SVEIntrinsicOpts has been
modified to not introduce redundant converts, since the convert removal
will no longer run after that optimisation to clean up.
Differential Revision: https://reviews.llvm.org/D101302
This allows calling buildSpillLoadStore for an empty basic block, where
MI points at the end of the block instead of to an instruction.
This only happens with downstream CFI changes, so I was not able to
create a testcase that works with upstream LLVM.
Differential Revision: https://reviews.llvm.org/D101356
Otherwise the CMP glue may be used in multiple nodes, needing to be
emitted multiple times. Currently this either increases instruction
count or fails as it attempt to insert the same node multiple times.
This patch enables support on SPE for constrained arithmetic and
comparison operations. This fixes bugzilla 50070.
One thing not covered is fcmp vs. fcmps on SPE. Some condition code
generates singaling comparison while some not. In this patch, all are
considered as singaling. So there might be still some issue when
compiling from C code.
Reviewed By: jhibbits
Differential Revision: https://reviews.llvm.org/D101282
This is an complementary/alternative fix for D99068. It takes a slightly
different approach by explicitly summing up all of the required split
part type sizes and ensuring we allocate enough space for them. It also
takes the maximum alignment of each part.
Compared with D99068 there are fewer changes to the stack objects in
existing tests. However, @luismarques has shown in that patch that there
are opportunities to reduce our stack usage in the future.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D99087
As X32 uses 32-bit pointers without having 32-bit indirect branch
instructions, we need to fix up indirect branches by extending the
branch targets to 64 bits. This was already done for BRIND but not yet
for NT_BRIND. The same logic works for both, so this applies that
existing logic to NT_BRIND as well.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101499
The ARMConstantIsland pass will convert any t2B to tB if they are within
range after it has added or moved any constant pools. They don't need to
be deliberately converted beforehand, and it doesn't deal with needing
to convert tB to t2B very well.