A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).
The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.
Differential Revision: https://reviews.llvm.org/D131253
Src1 for mbcnt can be a non-zero literal or register. Take this into account
when calculating known bits.
Differential Revision: https://reviews.llvm.org/D131478
The patch uses peephole method to fold merge.vvm and unmasked intrinsics to
masked intrinsics. Using peephole intead of tablegen patterns is to avoid large
auto gnerated code.
Note: The patch ignores segment loads since I don't know how to test them.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D130442
Commit 8922adf646 recently made JITTargetMachineBuilder honor the
hasJIT property of the target. LLVM supports just-in-time compilation
on RISC-V, so set the flag.
Differential Revision: https://reviews.llvm.org/D131617
This patch makes the variants of `mm*_cast*` intel intrinsics that use `shufflevector(freeze(poison), ..)` emit efficient assembly.
(These intrinsics are planned to use `shufflevector(freeze(poison), ..)` after shufflevector's semantics update; relevant thread: D103874)
To do so, this patch
1. Updates `LowerAVXCONCAT_VECTORS` in X86ISelLowering.cpp to recognize `FREEZE(UNDEF)` operand of `CONCAT_VECTOR` in addition to `UNDEF`
2. Updates X86InstrVecCompiler.td to recognize `insert_subvector` of `FREEZE(UNDEF)` vector as its first operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D130339
As extending from or truncating to mask vector do not use the same instructions as the normal cast, this path changed it to 2 which is the number of instructions we used.
Differential Revision: https://reviews.llvm.org/D131552
All the prologue instructions should have unknown source location
co-ordinates while the epilogue instructions should have source
location of last non-debug instruction after which epilogue
instructions are insrted.
This ensures the prologue/epilogue markers are generated correctly
in the line table.
Changes are brought in from the downstream CFI patches.
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D131485
These changes to address issue
https://github.com/llvm/llvm-project/issues/55857.
Since R30/S30 is used as pointer (32 bits) for GOT Table in the ppc32 ABI,
remove it from the SPE callee save register when PIC is enabled.
This prevents emitting the SPE load and store for S30 and S31 regs.
Differential revision: https://reviews.llvm.org/D127495
This adresses various regression in D131260 , as well as is a useful optimization in itself.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D131358
SPE doesn't have a fmadd instruction, so don't bother hoisting a
multiply and add sequence to this, as it'd become just a library call.
Hoisting happens too late for the CTR usability test to veto using the
CTR in a loop, and results in an assert "Invalid PPC CTR loop!".
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128642
Prior to this patch, libcalls inserted by the SelectionDAG legalizer
could never be tailcalled. The eligibility of libcalls for tail calling
is is partly determined by checking TargetLowering::isInTailCallPosition
and comparing the return type of the libcall and the calleer.
isInTailCallPosition in turn calls TargetLowering::isUsedByReturnOnly
(which always returns false if not implemented by the target).
This patch provides a minimal implementation of
TargetLowering::isUsedByReturnOnly - enough to support tail calling
libcalls on hard float ABIs. Soft-float ABIs are left for a follow on
patch. libcall-tail-calls.ll also shows missed opportunities to tail
call integer libcalls, but this is due to issues outside of
the isUsedByReturnOnly hook.
Differential Revision: https://reviews.llvm.org/D131087
This specific optimisation is handled in OptimizeBlock in BranchFolding
so is redundant. As discussed on the review thread, I've verified that
we have test coverage for that optimisation within test/CodeGen/X86 by
disabling the BranchFolding version of this transform after applying
this patch and rerunning the test suite.
Differential Revision: https://reviews.llvm.org/D129204
WebAssembly globals are represented as IR globals with the wasm_var
address space (AS1). Prior to this patch, a wasm global load that isn't
lowerable will produce a failure to select, while a wasm global store
will produced incorrect code. This patch ensures we consistently produce
a clear error.
As noted in the test cases, it's conceivable that a frontend or an
optimisation pass could produce similar IR even in the presence of the
semantic restrictions on pointers to Wasm globals in the frontend, which
is a separate problem to address.
Differential Revision: https://reviews.llvm.org/D131387
The cost of convert from or to mask vector is different from other cases. We could not use PowDiff to calculate it. This patch set it to 3 as we use 3 instruction to make it.
Differential Revision: https://reviews.llvm.org/D131149
This change finalizes the series of patches aiming to replace old
strategy of VGPR to SGPR copies loweriong. Following the
https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code
parts that are no longer used were removed. Pass main loop is no longer used
for the MIR changes but collect information for further analysis. Actual MIR
lowering happens further according the analysys result in the set of separate
functions. Another important change concerns the order of lowering: VGPR to
SGPR copies lowering is done first to have priority on the rest of the MIR
changes.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131246
ROPI/RWPI are not supported with LOAD_STACK_GUARD currently.
Reviewed By: nickdesaulniers, rengolin
Differential Revision: https://reviews.llvm.org/D131427
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
This patch adds the names of the Arm Architecture Reference Manual (ARM)
features to the corresponding Subtarget Features in the AArch64 backend
and target parser.
The aim of this is to make it clearer what architectural features a
subtarget feature might enable (so, which features a CPU must provide to
support that subtarget feature), and so make it easier to add new CPUs
in the future.
Differential Revision: https://reviews.llvm.org/D131257
si-annotate-control-flow does depth first traversal of BB's of
a function to insert amdgcn if intrinsics for conditional
branches so that isel can generate correct instructions later.
si-annotate-control-flow checks whether the successor BB for the 'else'
branch of a conditional branch has been visited. If it has been
visited, si-annotate-control-flow assumes the conditional
branch has been handled and will not try to insert if intrinsic
for it.
This assumption is not correct when the IR contains multiple
unreachable BB's. Then 'if' intrinscs are not inserted and incorrect
ISA are generated.
This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes
unify unreachables even if they are uniformly reached. In this way
the IR will not contain multiple exits, and structurizer is able to
structurize the IR containing one unified exit.
Reviewed by: Ruiling Song, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D131181
Fixes: SWDEV-343244
This adds a +forced-atomics target feature with the same semantics
as +atomics-32 on ARM (D130480). For RISCV targets without the +a
extension, this forces LLVM to assume that lock-free atomics
(up to 32/64 bits for riscv32/64 respectively) are available.
This means that atomic load/store are lowered to a simple load/store
(and fence as necessary), as these are guaranteed to be atomic
(as long as they're aligned). Atomic RMW/CAS are lowered to __sync
(rather than __atomic) libcalls. Responsibility for providing the
__sync libcalls lies with the user (for privileged single-core code
they can be implemented by disabling interrupts). Code using
+forced-atomics and -forced-atomics are not ABI compatible if atomic
variables cross the ABI boundary.
For context, the difference between __sync and __atomic is that the
former are required to be lock-free, while the latter requires a
shared global lock provided by a shared object library. See
https://llvm.org/docs/Atomics.html#libcalls-atomic for a detailed
discussion on the topic.
This target feature will be used by Rust's riscv32i target family
to support the use of atomic load/store without atomic RMW/CAS.
Differential Revision: https://reviews.llvm.org/D130621
This allows relaxing some relocations to STT_SECTION symbol+offset
instead of emitting a relocation against a symbol.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131433
ARMAsmPrinter::emitFunctionEntryLabel() was not calling the base class
function so the $local alias was not being emitted. This should not have
any function effect right now since ARM does not generate different code
for the $local symbols, but it could be improved in the future.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131392
This allows a number of optimisation passes to work.
E.g. BranchFolding and MachineBlockPlacement.
Differential Revision: https://reviews.llvm.org/D131316
The register operand of DBG_VALUE is not selected to a proper register
bank in both AArch64 and X86. This would cause getRegClass crash after
global ISel. After discussion, we think the MIR should assume all
vritual register should be set proper register class after global ISel,
so this patch is to fix the gap of DBG_VALUE for AArch64 and X86.
Differential Revision: https://reviews.llvm.org/D129037
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
is added and CompleteModel is set to one.
2. Information for pseudo SVE instructions is added. Those
instructions are present at the time of scheduling.
3. Resource and latency information for SVE instructions is modified
to be more accurate.
For example, the description for CMPEQ, which consumes one cycle
each of unit FLA and PPR, is as follows.
```
Previous:
def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```
Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf
Reviewed By: dmgreen, kawashima-fj
Differential Revision: https://reviews.llvm.org/D131165
Map hardware loop intrinsics loop_decrement and set_loop_iteration
to the new PowerPC pseudo instructions, so that the hardware loop
intrinsics will be expanded to normal cmp+branch form or ctrloop
form based on the CTR register usage on MIR level.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D123366
The floating point stores use a different register class, it
probably makes sense to have a different SchedRead.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D131379
The ABI for big-endian AArch32, as specified by AAELF32, is above-
averagely complicated. Relocatable object files are expected to store
instruction encodings in byte order matching the ELF file's endianness
(so, big-endian for a BE ELF file). But executable images can
//either// do that //or// store instructions little-endian regardless
of data and ELF endianness (to support BE32 and BE8 platforms
respectively). They signal the latter by setting the EF_ARM_BE8 flag
in the ELF header.
(In the case of the Thumb instruction set, this all means that each
16-bit halfword of a Thumb instruction is stored in one or other
endianness. The two halfwords of a 32-bit Thumb instruction must
appear in the same order no matter what, because the first halfword is
the one that must avoid overlapping the encoding of any 16-bit Thumb
instruction.)
llvm-objdump was unconditionally expecting Arm instructions to be
stored little-endian. So it would correctly disassemble a BE8 image,
but if you gave it a BE32 image or a BE object file, it would retrieve
every instruction in byte-swapped form and disassemble it to
nonsense. (Even an object file output by LLVM itself, because
ARMMCCodeEmitter outputs instructions big-endian in big-endian mode,
which is correct for writing an object file.)
This patch allows llvm-objdump to correctly disassemble all three of
those classes of Arm ELF file. It does it by introducing a new
SubtargetFeature for big-endian instructions, setting it from the ELF
image type and flags during llvm-objdump setup, and teaching both
ARMDisassembler and llvm-objdump itself to pay attention to it when
retrieving instruction data from a section being disassembled.
Differential Revision: https://reviews.llvm.org/D130902
Add patterns to select predicated instructions when lowering:
fadd(a, select(mask, b, splat(0)))
fsub(a, select(mask, b, splat(0)))
'fadd' is unsafe unless no-signed zeros fast-math flag is set, since
-0.0 + 0.0 = 0.0
changes the sign. Alive2: https://alive2.llvm.org/ce/z/wbhJh_
Also adds FMA patterns for:
fadd(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
fsub(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
These patterns require the 'contract' fast-math flag to be set, and the
fadd 'nsz' as above.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130564
This patch ensures the `$fp` always points to the bottom of the vararg
spill region.
Includes support for expand `ISD::DYNAMIC_STACKALLOC`.
Differential Revision: https://reviews.llvm.org/D130250
std::iterator has been deprecated in C++17 and some standard library implementations such as MS STL or libc++ emit deperecation messages when using the class.
Since LLVM has now switched to C++17 these will emit warnings on these implementations, or worse, errors in build configurations using -Werror.
This patch fixes these issues by replacing them with LLVMs own llvm::iterator_facade_base which offers a superset of functionality of std::iterator.
Differential Revision: https://reviews.llvm.org/D131320
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:
vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu
With the unsound result being:
vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu
Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.
Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16
After the unsound optimization: a1 = 16 and a2 = 33
This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.
This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.
Differential Revision: https://reviews.llvm.org/D131264
NOTE: i8 vector splats are ignored because the immediate range of
DUP already has full coverage.
Differential Revision: https://reviews.llvm.org/D131078
There are no AMDGPUSampleVariant versions for _G16, it is treated more like a
modifier for derivatives (_D) (also for intrinsics where it is overloaded type
instead of part of instrinsic name) so we ended up making more variants for
these instruction then we actually needed.
32-bit derivatives need 6 dwords at most, while 16-bit need 4 at most. Using
same AMDGPUSampleVariant for both, we ended up creating 2 extra variants per
instruction than were necessary.
In total this deletes 260 unused tablegen records.
Differential Revision: https://reviews.llvm.org/D131252
According to the description of the LoongArch abi documentation,
(https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
the calling convention of LoongArch is almost the same as the RISCV's
(except for the vector part), so we borrow the implementation of RISCV.
This patch only guarantees the correctness of lp64d, because only the
part of lp64d is described in detail in the documentation.
Differential Revision: https://reviews.llvm.org/D130249
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.
Differential Revision: https://reviews.llvm.org/D131073
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: craig.topper, skan
Differential Revision: https://reviews.llvm.org/D130956
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.
After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).
This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).
The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
This allows the construct to be shared between different backends. However, it
still remains illegal to use TypedPointerType in LLVM IR--the type is intended
to remain an auxiliary type, not a real LLVM type. So no support is provided for
LLVM-C, nor bitcode, nor LLVM assembly (besides the bare minimum needed to make
Type->dump() work properly).
Reviewed By: beanz, nikic, aeubanks
Differential Revision: https://reviews.llvm.org/D130592
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
The isOnlyUserOf prevented the fold if the chain result had any
users. What we really care about is the the data result from the
AND is only used by the TEST, and the flags results from the ANDs
aren't used at all. It's ok if the chain has users, we just need
to replace those users with the chain from the TESTrm.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D131117
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.
To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.
Fixes PR56905.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131113
If we're going to emit a rep prefix before bsf as proposed in
D130956, it makes sense to promote i16 operations to i32 to avoid
the false depedency of tzcntw.
Reviewed By: skan, pengfei
Differential Revision: https://reviews.llvm.org/D130995
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.
Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.
Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.
Differential Revision: https://reviews.llvm.org/D130191
Enable SGPRs for the following operands of these opcodes:
- src operands of VOP3 variant.
- src2 operand of DPP variants.
Differential Revision: https://reviews.llvm.org/D130989
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D130956
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...
Differential Revision: https://reviews.llvm.org/D130870
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
When compiling for multiple targets the scheduler that is selected via the
-misched option is applied globally. This patch adds a target CL option instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131022
This used to print from the ADDI where the operand number was
correct. It recently changed to print from the LUI or AUIPC which
needs to use operand 1 instead of 2.
This shows up as a crash with -debug.
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130869
When emit-obj from clang directly, DirectX backend will hit assert caused by not initialize passes for AsmPrinter.
The fix will initialize the passes by calling createPassConfig.
Also ignore global variable which not has section in DXILAsmPrinter::emitGlobalVariable to avoid hit llvm_unreachable in DXILTargetObjectFile::SelectSectionForGlobal.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D130856
rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with
Windows calling convention on non-Windows OSes.
Here we mark $x18 as callee-saved for functions with Windows calling
convention on Darwin, as well as on other non-Windows platforms, in
order to prevent some miscompilations (like miscompilation of
win64cc-darwin-backup-x18.ll).
Since getCalleeSavedRegs doesn't return x18 in list of callee-saved
registers, assignCalleeSavedSpillSlots and determineCalleeSaves
consider different sets of registers as callee-saved. It causes an
error:
```
Assertion failed: ((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file
AArch64MachineFunctionInfo.h, line 292.
```
Differential Revision: https://reviews.llvm.org/D130676
The ZERO register should be exposed as a constant physical register through the interface TargetRegisterInfo::isConstantPhysReg.
Differential Revision: https://reviews.llvm.org/D130932
I think these pseudos will exist when the post-RA scheduler runs
so they should have sched classes.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D130945
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.
Differential Revision: https://reviews.llvm.org/D130258
This improves a corner case where v_fmac can be converted to v_fma on
GFX10+ even if it has a literal operand.
Differential Revision: https://reviews.llvm.org/D130992
The problem Alexander reported on D127982 was caused by an optimization
for AVX512-FP16 instruction. We must limit it to the feature enabled only.
During the investigation, I found we didn't expand for fp_round/fp_extend
without F16C. This may result runtime crash, so change them too.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130817