std::iterator has been deprecated in C++17 and some standard library implementations such as MS STL or libc++ emit deperecation messages when using the class.
Since LLVM has now switched to C++17 these will emit warnings on these implementations, or worse, errors in build configurations using -Werror.
This patch fixes these issues by replacing them with LLVMs own llvm::iterator_facade_base which offers a superset of functionality of std::iterator.
Differential Revision: https://reviews.llvm.org/D131320
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:
vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu
With the unsound result being:
vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu
Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.
Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16
After the unsound optimization: a1 = 16 and a2 = 33
This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.
This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.
Differential Revision: https://reviews.llvm.org/D131264
NOTE: i8 vector splats are ignored because the immediate range of
DUP already has full coverage.
Differential Revision: https://reviews.llvm.org/D131078
There are no AMDGPUSampleVariant versions for _G16, it is treated more like a
modifier for derivatives (_D) (also for intrinsics where it is overloaded type
instead of part of instrinsic name) so we ended up making more variants for
these instruction then we actually needed.
32-bit derivatives need 6 dwords at most, while 16-bit need 4 at most. Using
same AMDGPUSampleVariant for both, we ended up creating 2 extra variants per
instruction than were necessary.
In total this deletes 260 unused tablegen records.
Differential Revision: https://reviews.llvm.org/D131252
According to the description of the LoongArch abi documentation,
(https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
the calling convention of LoongArch is almost the same as the RISCV's
(except for the vector part), so we borrow the implementation of RISCV.
This patch only guarantees the correctness of lp64d, because only the
part of lp64d is described in detail in the documentation.
Differential Revision: https://reviews.llvm.org/D130249
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.
Differential Revision: https://reviews.llvm.org/D131073
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: craig.topper, skan
Differential Revision: https://reviews.llvm.org/D130956
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.
After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).
This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).
The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
This allows the construct to be shared between different backends. However, it
still remains illegal to use TypedPointerType in LLVM IR--the type is intended
to remain an auxiliary type, not a real LLVM type. So no support is provided for
LLVM-C, nor bitcode, nor LLVM assembly (besides the bare minimum needed to make
Type->dump() work properly).
Reviewed By: beanz, nikic, aeubanks
Differential Revision: https://reviews.llvm.org/D130592
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
The isOnlyUserOf prevented the fold if the chain result had any
users. What we really care about is the the data result from the
AND is only used by the TEST, and the flags results from the ANDs
aren't used at all. It's ok if the chain has users, we just need
to replace those users with the chain from the TESTrm.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D131117
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.
To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.
Fixes PR56905.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131113
If we're going to emit a rep prefix before bsf as proposed in
D130956, it makes sense to promote i16 operations to i32 to avoid
the false depedency of tzcntw.
Reviewed By: skan, pengfei
Differential Revision: https://reviews.llvm.org/D130995
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.
Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.
Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.
Differential Revision: https://reviews.llvm.org/D130191
Enable SGPRs for the following operands of these opcodes:
- src operands of VOP3 variant.
- src2 operand of DPP variants.
Differential Revision: https://reviews.llvm.org/D130989
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D130956
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...
Differential Revision: https://reviews.llvm.org/D130870
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
When compiling for multiple targets the scheduler that is selected via the
-misched option is applied globally. This patch adds a target CL option instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131022
This used to print from the ADDI where the operand number was
correct. It recently changed to print from the LUI or AUIPC which
needs to use operand 1 instead of 2.
This shows up as a crash with -debug.
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130869
When emit-obj from clang directly, DirectX backend will hit assert caused by not initialize passes for AsmPrinter.
The fix will initialize the passes by calling createPassConfig.
Also ignore global variable which not has section in DXILAsmPrinter::emitGlobalVariable to avoid hit llvm_unreachable in DXILTargetObjectFile::SelectSectionForGlobal.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D130856
rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with
Windows calling convention on non-Windows OSes.
Here we mark $x18 as callee-saved for functions with Windows calling
convention on Darwin, as well as on other non-Windows platforms, in
order to prevent some miscompilations (like miscompilation of
win64cc-darwin-backup-x18.ll).
Since getCalleeSavedRegs doesn't return x18 in list of callee-saved
registers, assignCalleeSavedSpillSlots and determineCalleeSaves
consider different sets of registers as callee-saved. It causes an
error:
```
Assertion failed: ((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file
AArch64MachineFunctionInfo.h, line 292.
```
Differential Revision: https://reviews.llvm.org/D130676
The ZERO register should be exposed as a constant physical register through the interface TargetRegisterInfo::isConstantPhysReg.
Differential Revision: https://reviews.llvm.org/D130932
I think these pseudos will exist when the post-RA scheduler runs
so they should have sched classes.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D130945
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.
Differential Revision: https://reviews.llvm.org/D130258
This improves a corner case where v_fmac can be converted to v_fma on
GFX10+ even if it has a literal operand.
Differential Revision: https://reviews.llvm.org/D130992
The problem Alexander reported on D127982 was caused by an optimization
for AVX512-FP16 instruction. We must limit it to the feature enabled only.
During the investigation, I found we didn't expand for fp_round/fp_extend
without F16C. This may result runtime crash, so change them too.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130817
This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16
CMLTz instruction. The Srl and And extract the top bit (whether the
input is negative) and the Mul sets all values in the i16 half to all
1/0 depending on if that top bit was set. This is equivalent to a v8i16
CMLTz instruction. The same applies to other sizes with equivalent
constants.
Differential Revision: https://reviews.llvm.org/D130874
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
* TargetFrameLowering has a TransientStackAlignment field that "returns
the number of bytes to which the stack pointer must be aligned at all
times, even between calls.
* As explained in the [RISC-V calling
convention](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc),
the stack pointer must remain fully aligned throughout execution for
compliant code. This is important for embedded targets that might avoid
realigning the stack pointer for interrupt service routines. Systems
running full OSes may always realign the stack anyway.
* TransientStackAlignment is used in estimateStackSize in
MachineFrameInfo and in PEI::calculateFrameObjectOffsets.
* estimateStackSize is only used in the RISC-V backend for scavenging
slots. It may be possible to craft a function where the difference
is observable, but it wouldn't be a meaningful test.
* calculateFrameObjectOffsets makes use of TransientStackAlignment,
but then sets the stack alignment to the max of that alignment and
MaxAlign, which is unconditionally set to 16 in
RISCVFrameLowering::processFunctionBeforeFrameFinalized
* I've changed this logic to only set MaxAlign if there are RVV frame
objects. There should be no functional change here for either RVV
targets (MaxAlign is set as before) or non-RVV targets
(TransientStackAlign is now 16 anyway).
Differential Revision: https://reviews.llvm.org/D130068
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.
This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.
Differential Revision: https://reviews.llvm.org/D130868
It's possible we have:
lui a0, %hi(sym)
addi a0, %lo(sym)
addi a0, <offset1>
lw a0, <offset2>(a0)
We want to arrive at
lui a0, %hi(sym+offset1+offset2)
lw a0, %lo(sym+offset1+offset2)
We currently fail to do this because we only consider loads/stores
if we didn't find any arithmetic.
This patch splits arithmetic folding and load/store folding into
two separate phases. The load/store folding can no longer assume
the offset in hi/lo is 0 so we must combine the offsets. I've applied
the same simm32 limit that we applied in the arithmetic folding.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D130931
The patch replaces SPIRVBaseInfo.* previously created using macros by
the tablegen approach. There are many small changes in other files due to
differences in namespaces. Also, functions in SPIRVUtils are moved to
the llvm namespace.
Differential Revision: https://reviews.llvm.org/D130518
Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
2xi64 is the legalized type for wide reductions (like 16xi64) and setting the
cost to 2 makes `load-reduce` and `load-zext-reduce` patterns profitable.
The few performance measurments that I did on an aarch64 machine confirm that
these patterns are actually faster when vectorized.
Differential Revision: https://reviews.llvm.org/D130740
For VALU write and memory (VM, L/DS, FLAT) instructions, SQ would insert
wait-states to avoid data hazard. However when there is a DGEMM instruction
in-between them, SQ incorrectly disables the wait-states thus the data hazard
needs to be handled with this workaround.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130677
At least based on the lit tests, the coalescer sometimes fails to
propagate the copy from X0 into the branch instruction. This patch
does it manually during isel. The majority of the changes are from
the select patterns.
Some of the changes are just register allocation changes. Only
the Select change affects the whether a b*z instruction is generated
in the tests. I changed the branch pattern for consistency.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D130809
The only iterator we're holding points to HiLUI and we never
delete that so I think it is safe to delete everything else
immediately.
I want to split detectAndFoldOffset into two phases. First, combine
LUI+ADDI with any ADD/ADDI/SHXADD that comes after it. This may
open opportunities to fold the ADDI from the LUI+ADDI into a
load/store address. So the load/store folding should run as a
second phase even if the ADD/ADDI/SHXADD made changes.
In order to do this we need to eagerly delete instructions in the
first phase so that we don't have dead users of the LUI+ADDI
when we start the second phase.
Patches to split the phases will come later.
Reviewed By: asb, luismarques
Differential Revision: https://reviews.llvm.org/D130119
Extend hazard recognizer of ReadM0MovRelInterpHazard with
DS_READ_ADDTID and DS_WRITE_ADDTID, as they also
require a manually inserted S_NOP after SALU writing m0.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130783
Fix all instances of:
*** Bad machine code: Kill missing from LiveVariables ***
in the X86 CodeGen tests with D129213 applied, which adds verification
of LiveIntervals after the TwoAddressInstruction pass runs.
Differential Revision: https://reviews.llvm.org/D129634
According to the ABI for the Arm Architecture, the value for the
Tag_also_compatible_with eabi attribute is represented by an NTBS entry.
This string value, in turn, is composed of a pair of tag+value encoded
in one of two formats:
- ULEB128: tag, ULEB128: value, 0.
- ULEB128: tag, NBTS: data.
(See [[ 60a8eb8c55/addenda32/addenda32.rst (3373secondary-compatibility-tag) | section 3.3.7.3 on the Addenda to, and Errata in, the ABI for the Arm Architecture ]].)
Currently the Arm assembly parser and streamer ignore the encoding of
the attribute's NTBS value, which can result in incorrect attributes
being emitted in both assembly and object file outputs.
This patch fixes these issues by properly handing the value's encoding.
An update to llvm-readobj to properly handle the attribute's value will be
covered by a separate patch.
Patch by Victor Campos and Lucas Prates.
Reviewed By: vhscampos
Differential Revision: https://reviews.llvm.org/D129500
Builds upon D123264, adding support for merging the low part of the LLA
address into the load/store instruction offsets.
Differential Revision: https://reviews.llvm.org/D123265
Expand load address pseudo-instructions earlier (pre-ra) to allow follow-up
patches to fold the addi of PseudoLLA instructions into the immediate
operand of load/store instructions.
Differential Revision: https://reviews.llvm.org/D123264
Only PACKSS/PACKUS faux shuffles make use of the demanded elts at the moment, but this at least improves the handling of a couple of truncation patterns.
This adds a merge operand to all of the binary _VL nodes. Including
integer and widening. They all share multiclasses in tablegen
so doing them all at once was easiest.
I plan to use FADD_VL in an upcoming patch. The rest are just for
consistency to keep tablegen working.
This does reduce the isel table size by about 25k so that's nice.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130816
Noticed by inspection and I can't seem to make a test case, but SSE arithmetic bit shifts clamp to the max shift amount (i.e. create a sign splat) - combineVectorShiftImm already does something similar.
For constants in the range [-2047, 2048] we use addi. If the constant
is -2048 we can use xori. If we don't match this explicitly, we'll
emit an LI for the -2048 followed by an XOR.
It is not necessary to wait for all outstanding memory operations before
barriers on hardware that can back off of the barrier in the event of an
exception when traps are enabled. Add a new subtarget feature which
tracks which HW has this ability.
Reviewed By: #amdgpu, rampitec
Differential Revision: https://reviews.llvm.org/D130722
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
Updated version - original version rG057db2002bb3 didn't correctly account for multiple uses of the mask that might be folding "OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))" in canonicalizeBitSelect
Since 814a0abcce, this would break if we
had a function in the module that becomes dead in any codegen IR
pass. The function wasn't deleted since it was initially used in dead
code, but is detached from the call graph and doesn't appear in the PO
traversal. Do a second walk over the module to populate the resources
of any functions which weren't already processed.
Ensure non-terminators don't follow terminators.
This patch fixes the `sdiv-udiv-srem-urem.ll` test failure with
expensive check.
Differential Revision: https://reviews.llvm.org/D130247
A build vector of two extracted elements is equivalent to an extract
subvector where the inner vector is any-extended to the
extract_vector_elt VT, because extract_vector_elt has the effect of an
any-extend.
(build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1))
=> (extract_subvector (anyext_i16_to_i32 vec) Idx)
Depends on D130697
Differential Revision: https://reviews.llvm.org/D130698
Summary:
Flat scratch load of D16 type by default has tied vdst_in operand (with vdst). This should be taken
care of at the time of "removeOperand" in eliminateFrameIndex. Otherwise we will hit an assert saying
"Cannot move tied operands". This patch unties vdst_in before the move, and retie it with vdst afterwards.
Reviewers:
arsenm, foad
Differential Revision: https://reviews.llvm.org/D130537
When register pressure tracking is disabled, the scheduler attempts to load
pressures at SReg_32 and VGPR_32. This causes an index out of bounds error.
This patch fixes this issue by disabling the initialization of RPTracker
when not needed. NFC
Reviewed By: rampitec, kerbowa, arsenm
Differential Revision: https://reviews.llvm.org/D129322
This builtin allows the creation of custom scheduling pipelines on a per-region
basis. Like the sched_barrier builtin this is intended to be used either for
testing, in situations where the default scheduler heuristics cannot be
improved, or in critical kernels where users are trying to get performance that
is close to handwritten assembly. Obviously using these builtins will require
extra work from the kernel writer to maintain the desired behavior.
The builtin can be used to create groups of instructions called "scheduling
groups" where ordering between the groups is enforced by the scheduler.
__builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter
is a mask that determines the types of instructions that you would like to
synchronize around and add to a scheduling group. These instructions will be
selected from the bottom up starting from the sched_group_barrier's location
during instruction scheduling. The second parameter is the number of matching
instructions that will be associated with this sched_group_barrier. The third
parameter is an identifier which is used to describe what other
sched_group_barriers should be synchronized with. Note that multiple
sched_group_barriers must be added in order for them to be useful since they
only synchronize with other sched_group_barriers. Only "scheduling groups" with
a matching third parameter will have any enforced ordering between them.
As an example, the code below tries to create a pipeline of 1 VMEM_READ
instruction followed by 1 VALU instruction followed by 5 MFMA instructions...
// 1 VMEM_READ
__builtin_amdgcn_sched_group_barrier(32, 1, 0)
// 1 VALU
__builtin_amdgcn_sched_group_barrier(2, 1, 0)
// 5 MFMA
__builtin_amdgcn_sched_group_barrier(8, 5, 0)
// 1 VMEM_READ
__builtin_amdgcn_sched_group_barrier(32, 1, 0)
// 3 VALU
__builtin_amdgcn_sched_group_barrier(2, 3, 0)
// 2 VMEM_WRITE
__builtin_amdgcn_sched_group_barrier(64, 2, 0)
Reviewed By: jrbyrnes
Differential Revision: https://reviews.llvm.org/D128158
This avoids a vmerge at the end and avoids spurious fflags updates.
This isn't used for constrained intrinsic so we technically don't have
to worry about fflags, but it doesn't cost much to support it.
To support I've extend our FCOPYSIGN_VL node to support a passthru
operand. Similar to what was done for VRGATHER*_VL nodes.
I plan to do a similar update for trunc, floor, and ceil.
Reviewed By: reames, frasercrmck
Differential Revision: https://reviews.llvm.org/D130659
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
We will insert a new operand which is identical to the Dest for complex
FMUL with a mask. https://godbolt.org/z/eTEdnYv3q
Complex FMA and FMUL with maskz don't have this problem.
Reviewed By: LuoYuanke, skan
Differential Revision: https://reviews.llvm.org/D130638
By not clustering loads and adjusting heuristics to more aggressively reduce
register pressure we may be able to increase occupancy for the function if it
was dropped in a first pass scheduling.
Similarly, try to reduce spilling if register usage exceeds lower bound
occupancy.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130329
Clear all kill flags on source register when folding a COPY.
This is necessary because the kills may now be out of order with the uses.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D130622
InstCombine and DAGCombine prefer to keep shl before binops.
This patch teaches isel to convert to (shl (and/or/xor X, C1 >> C2), C2)
if (C1 >> C2) is a simm12. The idea was taken from X86's isel code.
There's a special case implemented for a sext_inreg between the
shift and the binop.
Differential Revision: https://reviews.llvm.org/D130610
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration.
At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible.
This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised. Given that, having issues fall out is not unexpected. If you find issues, please make sure to include as much information as you can when reverting this change.
Differential Revision: https://reviews.llvm.org/D129013
The instruction is used to modify wave priority with the intent
to affect VALU execution and currently we can reschedule VALU
around it since that VALU does not have side effects.
Differential Revision: https://reviews.llvm.org/D130654
Without this, the intrinsic will be expanded to an integer; thereby an
explicit copy (from GPR to SIMD register) will be codegen'd. This matches the
general convention of using "v1" types to represent scalar integer operations in
vector registers.
The similar approach is observed in D56616, and the pattern likely applies on
other intrinsic that accepts integer scalars (e.g.,
int_aarch64_neon_sqdmulls_scalar)
Differential Revision: https://reviews.llvm.org/D130548
This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of
materializing a specific constant in code size. Doing so prevents us from
sinking constants which require multiple instructions to generate into
use blocks.
Code size savings on CTMark -Os:
Program size.__text
before after diff
ClamAV/clamscan 381940.00 382052.00 0.0%
lencod/lencod 428408.00 428428.00 0.0%
SPASS/SPASS 411868.00 411876.00 0.0%
kimwitu++/kc 449944.00 449944.00 0.0%
Bullet/bullet 463588.00 463556.00 -0.0%
sqlite3/sqlite3 284696.00 284668.00 -0.0%
consumer-typeset/consumer-typeset 414492.00 414424.00 -0.0%
7zip/7zip-benchmark 595244.00 594972.00 -0.0%
mafft/pairlocalalign 247512.00 247368.00 -0.1%
tramp3d-v4/tramp3d-v4 372884.00 372044.00 -0.2%
Geomean difference -0.0%
Differential Revision: https://reviews.llvm.org/D130554
I don't have any evidence these particular uses are actually causing any
issues, but we should avoid accidentally truncating immediate values
depending on the host.
We can use slli.uw by C followed by sh1add. Similar can be done
for multiples of 5 and 9. We need to make sure that C is less than
32 to stay in bounds of the 5-bit immediate for slli.uw.
We have existing patterns for (mul X, 3<<C) that use sh1add
followed by slli. That order doesn't allow the and to be folded.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130146
A mul by a negated power of 2 is a slli followed by neg. This doesn't
require any constant materialization and may be lower latency than mul.
The neg may also be foldable into other arithmetic.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130047
We can't guarantee the long always 64 bits like WINDOWS or LLP64 data
model (rare but we should consider).
So use int64_t from inttypes.h and safe in this case.
Fixes https://github.com/llvm/llvm-project/issues/55911 .
It errors out in the Bazel CI:
AMDGPULowerModuleLDSPass.cpp:384:12: error: chosen constructor is
explicit in copy-initialization
return {SGV, std::move(Map)};
Reviewed By: rupprecht
Differential Revision: https://reviews.llvm.org/D130623
Normally, generic processor does not have any SubtargetFeature. And it
can just generate most basic instructions which have no Predicates to
guard.
But it needs to enbale predicate for the btsti16 instruction as one of the most basic instructions.
Or the generic processor can't finish codegen process. So Add FeatureBTST16 SubtargetFeature to generic ProcessorModel.
Given a patch like D129506, using instructions not valid for the current
feature set becomes an error. This updates the Arm hint-space
instructions for pac/bti to require thumbv7m as opposed to 8.1-m.main, to
make them valid when compiling for thumbv7m with -mbranch-protection.
Differential Revision: https://reviews.llvm.org/D129692
This adds a +atomic-32 target feature, which instructs LLVM to assume
that lock-free 32-bit atomics are available for this target, even
if they usually wouldn't be.
If only atomic loads/stores are used, then this won't emit libcalls.
If atomic CAS is used, then the user is responsible for providing
any necessary __sync implementations (e.g. by masking interrupts
for single-core privileged use cases).
See https://reviews.llvm.org/D120026#3674333 for context on this
change. The tl;dr is that the thumbv6m target in Rust has
historically made atomic load/store only available, which is
incompatible with the change from D120026, which switched these to
use libatomic.
Differential Revision: https://reviews.llvm.org/D130480
These are aliases that allow the immediate offset to be ommitted.
We had predicates for the RV64, RV32+F, and D versions, but
not the base versions.
I've also re-ordered them to share Predicate lines to improve
readability.
Tries to make the different scheduling stages a bit more self contained and
modifiable. Intended to be NFC. Preface to other changes.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130147
This patch starts small, only detecting sequences of the form
<a, a+n, a+2n, a+3n, ...> where a and n are ConstantSDNodes.
Differential Revision: https://reviews.llvm.org/D125194
This helps fold away the ptest instructions, which needs the knowledge on whether
the general predicate is known to zero the inactive lanes.
This fixes some PTEST regressions introduced by D129282.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D129852
Currently, when llvm-objdump is disassembling a code section and
encounters a point where no instruction can be decoded, it uses the
same policy on all targets: consume one byte of the section, emit it
as "<unknown>", and try disassembling from the next byte position.
On an architecture where instructions are always 4 bytes long and
4-byte aligned, this makes no sense at all. If a 4-byte word cannot be
decoded as an instruction, then the next place that a valid
instruction could //possibly// be found is 4 bytes further on.
Disassembling from a misaligned address can't possibly produce
anything that the code generator intended, or that the CPU would even
attempt to execute.
This patch introduces a new MCDisassembler virtual method called
`suggestBytesToSkip`, which allows each target to choose its own
resynchronization policy. For Arm (as opposed to Thumb) and AArch64,
I've filled in the new method to return a fixed width of 4.
Thumb is a more interesting case, because the criterion for
identifying 2-byte and 4-byte instruction encodings is very simple,
and doesn't require the particular instruction to be recognized. So
`suggestBytesToSkip` is also passed an ArrayRef of the bytes in
question, so that it can take that into account. The new test case
shows Thumb disassembly skipping over two unrecognized instructions,
and identifying one as 2-byte and one as 4-byte.
For targets other than Arm and AArch64, this is NFC: the base class
implementation of `suggestBytesToSkip` still returns 1, so that the
existing behavior is unchanged. Other targets can fill in their own
implementations as they see fit; I haven't attempted to choose a new
behavior for each one myself.
I've updated all the call sites of `MCDisassembler::getInstruction` in
llvm-objdump, and also one in sancov, which was the only other place I
spotted the same idiom of `if (Size == 0) Size = 1` after a call to
`getInstruction`.
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D130357