VMV_V_X_VL nodes should always have a passthru, a splat, and a VL.
We were sometimes missing the VL.
This went unnoticed because these cases were all selected into the
following node to form a .vx or .vi instruction. The ComplexPattern
that does this, doesn't check the VL operand. I've added an assert
to the ComplexPattern to catch if the operand is missing.
@qcolombet spotted some of these in D134703.
This captures the target shader model and pipeline stage into the DXIL
metadata for consumption by the DirectX runtime.
Reviewed By: python3kgae
Differential Revision: https://reviews.llvm.org/D134469
This is a refactor for another patch. For now we move the vreg
creation to the caller.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D135008
Currently, AArch64 doesn't support vectorization for non temporal loads because `isLegalNTLoad` is not implemented for the target.
This patch applies similar functionality as `D73158` but for non temporal loads
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D131964
Before, the isPreLegalize() query in CombinerHelper only checked for the
presence of a LegalizerInfo object. This is problematic when we want to have
a combine actually check for legality in a pre-legalizer combine pass, since
if we pass a LegalizerInfo object to the constructor it causes the combines to
think that we're running *post* legalizer, which isn't true.
This change fixes it to instead check an explicit bool that passes to signal
whether the pass will be run before or after legalization.
Doing so exposed a bug in the extending loads combine, which tried to check for
legality of candidate extending loads if LegalizerInfo was present. Since we
only ran it pre-legalizer and therefore with a null LegalizerInfo, it never
actually ran. Also fixes the legality checks to keep the tests passing.
Differential Revision: https://reviews.llvm.org/D135044
One of the sources is the same size as the destination so that source
doesn't have an overlap with the destination register. By using the _TIED
form we avoid an early clobber contraint for that source.
This matches what was already done for instrinsics. ConvertToThreeAddress
will fix it if it can't stay tied.
Change the costmodel to lower a = b * C where C = -(2^n - 2^m) to
lsl w8, w0, m
sub w0, w8, w0, lsl n
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134934
VALU use of an SGPR (pair) as mask followed by SALU write to the
same SGPR can cause incorrect execution of subsequent SALU reads
of the SGPR.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D134151
If True has a Chain result, the other operands of the vmerge may
depend on it through that Chain. We need to ensure it isn't a
predecessor of those operands.
Reviewed By: fakepaper56
Differential Revision: https://reviews.llvm.org/D134980
This stops reporting CostPerUse 1 for `R8`-`R15` and `XMM8`-`XMM31`.
This was previously done because instruction encoding require a REX
prefix when using them resulting in longer instruction encodings. I
found that this regresses the quality of the register allocation as the
costs impose an ordering on eviction candidates. I also feel that there
is a bit of an impedance mismatch as the actual costs occure when
encoding instructions using those registers, but the order of VReg
assignments is not primarily ordered by number of Defs+Uses.
I did extensive measurements with the llvm-test-suite wiht SPEC2006 +
SPEC2017 included, internal services showed similar patterns. Generally
there are a log of improvements but also a lot of regression. But on
average the allocation quality seems to improve at a small code size
regression.
Results for measuring static and dynamic instruction counts:
Dynamic Counts (scaled by execution frequency) / Optimization Remarks:
Spills+FoldedSpills -5.6%
Reloads+FoldedReloads -4.2%
Copies -0.1%
Static / LLVM Statistics:
regalloc.NumSpills mean -1.6%, geomean -2.8%
regalloc.NumReloads mean -1.7%, geomean -3.1%
size..text mean +0.4%, geomean +0.4%
Static / LLVM Statistics:
mean -2.2%, geomean -3.1%) regalloc.NumSpills
mean -2.6%, geomean -3.9%) regalloc.NumReloads
mean +0.6%, geomean +0.6%) size..text
Static / LLVM Statistics:
regalloc.NumSpills mean -3.0%
regalloc.NumReloads mean -3.3%
size..text mean +0.3%, geomean +0.3%
Differential Revision: https://reviews.llvm.org/D133902
When returning from a function with both SCS and PAC-RET enabled, we need to
authenticate the return address from the stack and then load from the SCS,
but this was happening in the reverse order when RETA[AB] were being used.
Fix it by disabling the use of RETA[AB] when SCS is enabled.
Fixes pr58072.
Differential Revision: https://reviews.llvm.org/D134931
1. Save typed pointer type for GlobalVariable/Function instead of the ObjectType.
This will allow use GlobalVariable/Function as value.
2. Save target type for global ctors for Constant.
3. In DXILBitcodeWriter::getTypeID, check PointerMap first for Constant case.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D133283
Simplify and make the pair-wise relocation more precise. If either of
the symbol references are textual, the relocation must be delayed. If
the difference is across sections, delay it as well which partially
matches the behaviour of gas. We unfortunately do not handle the case
where the difference references a symbol that is not yet defined. In
such a case, we simply fail to resolve the difference, which should
hopefully not be too onerous (particularly since no other target
supports cross-section references and it is not clear if this was
intentional on the part of RISCV).
Differential Revision: https://reviews.llvm.org/D132262
Reviewed By: @MaskRay
This code is directly ported from the X86 backend which applies the same rewrite (along with several others). Planning on looking more closely at the other branchless variants from x86 to see if any are worth porting in future changes.
Motivation here is the coremark crc8 routine from https://github.com/eembc/coremark/blob/main/core_util.c#L165. This patch significantly reduces the number of unpredictable branches in the workload.
Differential Revision: https://reviews.llvm.org/D134881
Recognize more opcodes in the function.
Fixes some regressions introduced in D134857 for fdiv.f16 too.
Depends on D134857
Reviewed By: arsenm, foad
Differential Revision: https://reviews.llvm.org/D134862
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.
This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134433
Using this helper makes work about neutral elements more easier. Although I only
find one case now, I think it will have more chance to be used since so many
combine works are related to neutral elements.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D133866
This is a port of an existing optimization in AArch64 ISelLowering, handling
a case when the same input vector can be used for both ext inputs.
Differential Revision: https://reviews.llvm.org/D134891
In combineOr (X86ISelLowering.cpp) there is a DAG combine that rewrite
a "(0 - SetCC) | C" pattern into something simpler given that a LEA
can be used. Another requirement is that C has some specific value,
for example 1 or 7. When checking those requirements the code used a
32-bit unsigned variable to store the value of C. So for a 64-bit OR
this could miscompile in case any of the 32 most significant bits in
C were non zero.
This patch adds fixes the bug by using a large enough type for the
C value.
The faulty code seem to have been introduced by commit 9bceb8981d
(D131358).
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D134892
Decompose the const 14 can be separated from D132322
Change the costmodel to lower a = b * C where C = 2^n - 2^m to
lsl w8, w0, n
sub w0, w8, w0, lsl m
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134706
This change updates the costs to make constant pool loads match their actual cost, and adds the broadcast special case to avoid too many regressions. We really need more information about the constants being rematerialized, but this is an incremental improvement.
Differential Revision: https://reviews.llvm.org/D134746
k: A memory operand whose address is formed by a base register and
(optionally scaled) index register.
m: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as st.w and ld.w.
ZB: An address that is held in a general-purpose register. The offset
is zero.
ZC: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as ll.w and sc.w.
Differential Revision: https://reviews.llvm.org/D134638
This change sets
-amdgpu-assume-{external-call-stack-size | dynamic-stack-object-size}
options to zero by default for code object v5 and later. The runtime is
expected to adjust the scratch size if the amdhsa_uses_dynamic_stack bit
in the kernel descriptor is set.
Differential Revision: https://reviews.llvm.org/D128346
LoongArchELFObjectWriter::getRelocType check IsPCRel for FK_Data_4
(which we produce a R_LARCH_32_PCREL relocation for if IsPCRel).
R_LARCH_32_PCREL is required for FDE relocation.
Differential Revision: https://reviews.llvm.org/D134715
The target selection DAG lowering information is needed for
SelectionDAGBuilder to lower a call like memcmp into an optimized
form.
Differential Revision: https://reviews.llvm.org/D134712
A few issues:
1. There was no legalizer test for G_PTRTOINT
2. Same clamping issue as in many other opcodes
3. AArch64 pointers can only be 64b, so in reality we always have to trunc or
extend with any size other than p0 anyway.
This seems to actually produce more correct selection for narrow types as well.
Differential Revision: https://reviews.llvm.org/D107588
This is intended to be equivalent to the s32 + s64 cases in
AArch64TargetLowering::LowerFCOPYSIGN.
Widen everything and then use G_BIT + a mask to handle the actual copysign
operation. Then, narrow back down to s32/s64.
I wasn't sure about what the best/most canonical INSERT_SUBREG-selectable
pattern is. I chose G_INSERT_VECTOR_ELT + an undef vector because it produces
reasonably okay codegen. (It doesn't produce INSERT_SUBREG right now though.)
If there's a better way to do this then I'm happy to change it.
We also have a couple codegen deficiencies with how we emit vector constants
right now. (We need a GISel equivalent to the tryAdvSIMDModImm64 stuff)
Differential Revision: https://reviews.llvm.org/D108725
This is necessary for custom-legalizing G_FCOPYSIGN.
This is equivalent to the BIT instruction (bitwise insert if true).
Add selection testcases for imported patterns.
Differential Revision: https://reviews.llvm.org/D108714
1. Save typed pointer type for GlobalVariable/Function instead of the ObjectType.
This will allow use GlobalVariable/Function as value.
2. Save target type for global ctors for Constant.
3. In DXILBitcodeWriter::getTypeID, check PointerMap first for Constant case.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D133283
One of the conditions to flush the vmcnt counter in loop preheaders is: The loop
contains a use of a vgpr that is defined out of the loop. The code currently
checks if a waitcnt is needed by looking at the score of that vgpr in the score
brackets. This is not enough and may cause the generation of an unnecessary
vmcnt flush. This patch fixes that case.
Differential Revision: https://reviews.llvm.org/D130313
The association between kernel and struct is done by symbol name.
This doesn't work robustly for anonymous kernels as shown by the modified
test case.
An alternative association between function and struct can be constructed
if necessary, probably though metadata, but on the basis that we currently
miscompile anonymous kernels and that they are difficult to construct from
application code and difficult to call from the runtime, this patch makes
it a fatal error for now.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134741
Surprisingly these were getting legalized to something
zero initialized.
This fixes an infinite loop when combining some vector types.
Also fixes zero initializing some undef values.
SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking
for the legality of the output undefs they are replacing unused
operations with. This resulted in turning vectors into undefs
that were later re-legalized back into zero vectors.
For gathers which load in 8 and 16 bit data then use that data
as an index, the index can be extended to 32 bits instead of
64 bits
Differential Revision: https://reviews.llvm.org/D130692
This reverts commit 1c62af3e23.
The commit causes the test below to fail. Revert for now to get the bots
back to green.
Failing test:
lvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll
Currently over 256 non-temporal loads are broken inefficently. For example, `v17i32` gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133421
A kernel may have an associated struct for laying out LDS variables.
This patch puts that instance, if present, at a deterministic address by
allocating it at the same time as the module scope instance.
This is relatively likely to be where the instance was allocated anyway (~NFC)
but will allow later patches to calculate where a given field can be found,
which means a function which is only reachable from a single kernel will be
able to access a LDS variable with zero overhead. That will be particularly
helpful for applications that instantiate a function template containing LDS
variables once per kernel.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127052
fp16 and bf16 values can be used in GCC's inline assembly using the "t"
constraint, which means "VFP floating-point registers s0-s31" - fp16 and
bf16 values are stored in S registers too.
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 't' constraint.
Fixes#57753
Differential Revision: https://reviews.llvm.org/D134553
Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D134780
Defines LoongArch registers for getExceptionPointerRegister() and
getExceptionSelectorRegister().
Differential Revision: https://reviews.llvm.org/D134709
These instructions are flag setting so the ptest is redundant, the
TableGen class wasn't setting the element size for the predicate causing
the checks in AArch64InstrInfo::optimizePTestInstr to fail.
This is purely NFC restructure in advance of a change which actually exposes zero strides. This is mostly because I find this interface confusing each time I look at it.
I wasn't able to produce a testcase for that because right now VWSUB is
only generated from VWSUB_W and from there to trigger the commutative
bug we would need to grab VWSUB where the splat value is on the LHS,
which is currently not matched.
Differential Revision: https://reviews.llvm.org/D134701
The motivation here is to enable a change I'm exploring in vectorizer to prefer base + offset_vector addressing for scatter/gather. The form the vectorizer would end up emitting would be a gep whose vector operand is an add of the scalar IV (splated) and the index vector. This change makes sure we can recognize that pattern as well as the current code structure. As a side effect, it might improve scatter/gathers from other sources.
Differential Revision: https://reviews.llvm.org/D134755
Previous commit 8b00b24f85 missed to add `int_ceil` anchor for the
llvm.ceil.* section under LangRef.rst
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D134586
My original intent had been to reuse this for arithmetic instructions as well, but due to the availability of a immediate splat encoding there, we will need different heuristics. So specialize the existing code for the store case.
Initial table.get/set implementation would match and lower combinations
of GEP+load/store to table.get/set instructions. However, this is error
prone due to potential combinations of GEP+load/store we don't implement,
and load/store optimizations. By changing the code to using intrinsics, we
avoid both issues and simplify the code.
New builtins implemented:
* @llvm.wasm.table.get.externref
* @llvm.wasm.table.get.funcref
* @llvm.wasm.table.set.externref
* @llvm.wasm.table.set.funcref
Reviewed By: asb, tlively
Differential Revision: https://reviews.llvm.org/D134436
Add vp.maxnum and vp.minnum which are vector predicted intrinsics of llvm.maxnum
and llvm.minnum.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D134639
This feature implements support for making entries in the exception section
on XCOFF on the direct assembly path using the ".except" pseudo-op. It also
provides functionality to lower entries (comprised of language and reason
codes) into the exception section through the use of annotation metadata
attached to llvm.ppc.trap/trapd/tw/tdw intrinsics. Integrated assembler
support will be provided in another review. https://reviews.llvm.org/D133030
needs to merge first for LIT tests
Reviewed By: shchenz, RKSimon
Differential Revision: https://reviews.llvm.org/D132146
Summary:
With opaque pointer support, the "ptr" type is introduced and thus BitCast is not necessary in some cases.
This work takes care of this change, and recognizes the new address patterns to do appropriate optimizations.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D134596
Disable FMAX/FMIN selection from select_cc in VEInstrInfo.td because of
the lack of NaN consideration. This patch removes such selection from
VEInstrInfo.td and lets llvm work on it in combineMinNumMaxNum.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134595
Support smax/smin in VEInstrInfo.td. Remove obsolete patterns for
smax/smin. Add regression tests for smax/smin/umax/umin.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134583
The `CodeGenPrepare` pass can sink bitwise `and` used by compare to
zero into the basic blocks where the users are. This operation is
guarded by lowering hook, which is disabled for ARM. In the ARM
architecture versions from v7-M up these two operations can be folded
into `tst rN, #imm` instruction. Sinking of `and` can also enable
the cmov-to-bfi DAG combiner.
This patch fixes some benchmark regressions caused
by https://reviews.llvm.org/D129370 as well scoring slightly better overall.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D134360
The commit D120104 enabled FeatureFuseAdrpAdd for -mcpu=generic,
allowing the linker to relax adrp;add pairs where possible. D132075
extended that to neoverse-n1, this patch extends it to all other cortex
and neoverse cpus for the same reasons.
Differential Revision: https://reviews.llvm.org/D134521
This patch uses a unified interface for lower GlobalAddress ConstantPool
BlockAddress and JumpTable.
This patch allows lowering addresses by using PC-relative addressing
for DSO-local symbols, and accessing the address through the global
offset table for DSO-preemptable symbols.
Remove hardcoded `MininumJumpTableEntries` for test lower JumpTable.
Also updated some test cases using ConstantPool, due to the addition of
relocation information.
Differential Revision: https://reviews.llvm.org/D134431
As the LoongArch port is largely modeled after RISCV it has the same
behavior of not accepting `generic` as a CPU name. For better
compatibility with consumers of LLVM (e.g. mesa) follow D121149's suit
and treat `generic` the same as an empty CPU name.
Differential Revision: https://reviews.llvm.org/D134412
As explained in D68559 the `fastcc` calling convention may be requested
under certain conditions, hence the need for supporting it. But unlike
RISCV we actually treat it exactly like ccc, without actually inventing
any performance hack right here. And CSKY does the same thing.
This is going to fix a few more test cases with native LoongArch builds.
Differential Revision: https://reviews.llvm.org/D134443
For the pattern of IR (%if terminates with a divergent branch.),
divergence analysis will report %phi as uniform to help optimal code
generation.
```
%if
| \
| %then
| /
%endif: %phi = phi [ %uniform, %if ], [ %undef, %then ]
```
In the backend, %phi and %uniform will be assigned a scalar register.
But the %undef from %then will make the scalar register dead in %then.
This will likely cause the register being over-written in %then. To fix
the issue, we will rewrite %undef as %uniform. For details, please refer
the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test
changes shown, but this is mandatory for later changes.
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D133840
This patch adds support for constraints `f`, `l`, `I`, `K` according
to [1]. The remain constraints (`k`, `m`, `ZB`, `ZC`) will be added
later as they are a little more complex than the others.
f: A floating-point register (if available).
l: A signed 16-bit constant.
I: A signed 12-bit constant (for arithmetic instructions).
K: An unsigned 12-bit constant (for logic instructions).
For now, no need to support register alias (e.g. `$a0`) in llvm as
clang will correctly decode the usage of register name aliases into
their official names. And AFAIK, the not yet upstreamed `rustc` for
LoongArch will always use official register names (e.g. `$r4`).
[1] https://gcc.gnu.org/onlinedocs/gccint/Machine-Constraints.html
Differential Revision: https://reviews.llvm.org/D134157
This is a follow-on to https://reviews.llvm.org/D134073.
It renames a few fields to have consistent names, as well as renaming
operands to match the field names.
The encoder behavior is unchanged by this cleanup, but a few
instructions were previously being disassembled incorrectly, and have
been corrected by this change. All of the affected instructions were
missing disassembly tests, which are now added.
Differential Revision: https://reviews.llvm.org/D134185
This is a follow-on to https://reviews.llvm.org/D134073.
It renames a couple of fields to match their operands, as well as
introducing sub-operand names where required.
This change _only_ fixes the 'R600' half of the target, not the
'AMDGPU' half. Fixing the AMDGPU half will be a significantly more
difficult change (which I've not yet attempted.)
Differential Revision: https://reviews.llvm.org/D134078
This is a follow-on to https://reviews.llvm.org/D134073.
Lanai was almost clean: the only issue is that 'bit' behaves
differently than 'bits<1>', because only the 'bits' type preserves
unresolved references via 'keepUnsetBits()' in
TableGen/Record.h. Thus, use bits instead.
This issue _would_ have caused invalid instruction emission/decoding,
except that the PQ bits were being overriden after the fact by code in
'adjustPqBits' in MCTargetDesc/LanaiMCCodeEmitter.cpp, and
'PostOperandDecodeAdjust' in Disassembler/LanaiDisassembler.cpp.
Differential Revision: https://reviews.llvm.org/D134075
Fix regression from clang opencl test in builtins-fp-atomics-gfx90a.cl
test_flat_add_local_f64 caused by D130579
Revert a3becb333d.
Differential Revision: https://reviews.llvm.org/D134568
Very straight forward extension of the existing pattern matching pass to handle scalable types as well as fixed length types. The only extra bit beyond removing a bailout is recognizing stepvector.
Differential Revision: https://reviews.llvm.org/D134502
The code previously assumed fixed length vectors; make the relevant code conditional.
Having the lowering in place is neccessary for an upcoming change to generalize scatter/gather matching to scalable vectors.
Differential Revision: https://reviews.llvm.org/D134489
Summary:
The existing undefined-bitfield-to-operand matching behavior is very
hard to understand, due to the combination of positional and named
matching. This can make it difficult to track down a bug in a target's
instruction definitions.
Over the last decade, folks have tried to work-around this in various
ways, but it's time to finally ditch the positional matching. With
https://reviews.llvm.org/D131003, there are no longer cases that
_require_ positional matching, and it's time to start removing usage
and support for it.
Therefore: add a (default-false) option, and set it to true only in
those targets that require positional matching today. Subsequent
changes will start cleaning up additional in-tree targets.
NOTE TO OUT OF TREE TARGET MAINTAINERS:
If this change breaks your build, you may restore the previous
behavior simply by adding:
let useDeprecatedPositionallyEncodedOperands = 1;
to your target's InstrInfo tablegen definition. However, this is
temporary -- the option will be removed in the future.
If your target does not set 'decodePositionallyEncodedOperands', you
may thus start migrating to named operands. However, if you _do_
currently set that option, I recommend waiting until a subsequent
change lands, which adds decoder support for named sub-operands.
Differential Revision: https://reviews.llvm.org/D134073
These names can then be matched by name against 'bits' fields in a
record, to populate an instruction's encoding.
This does _not_ yet change DecoderEmitter to allow by-name matching of
sub-operands. Unlike the encoder, the decoder already defaulted to not
supporting positional matching, and backends had workarounds in place
for the missing decoding support.
Additionally, use this new capability to allow the ARM and AArch64
backends not to require any positional operand matching.
Differential Revision: https://reviews.llvm.org/D131003
The full complement of physical VGPRs for GFX11 is 50% more than GFX10.
Some subtargets have this, others stay the same as GFX10. This affects
occupancy calculations.
Differential Revision: https://reviews.llvm.org/D134522
This has the advantage of dealing with live EFLAGS, using LEA instead of
SUB if needed to avoid clobbering. That also respects feature "lea-sp".
We could allow unrolled stack probing from blocks with live-EFLAGS, if
canUseAsEpilogue learns when emitStackProbeInlineGeneric will be used.
Differential Revision: https://reviews.llvm.org/D134495
Remove manual selection for atomic fadd from global-isel.
Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD
which corresponds to llvm-ir's atomicrmw fadd instruction.
global and flat atomic fadd patterns changes:
Split rtn/no-rtn patterns
Add missing patterns or fix predicates
Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors).
Patterns now check addrspace of pointer, added patterns for flat intrinsic.
with global addrspace pointer that selects into global atomic instruction.
buffer atomic fadd patterns changes:
Rdit patterns to import into global-isel.
Remove gfx6/gfx7 _addr64 and _offset patterns.
Remove patterns that can't be reached (same pattern but different feature).
Differential Revision: https://reviews.llvm.org/D130579
Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11
as for gfx90a. Add missing globalisel legalizer support for flat
atomicrmw fadd f32 on gfx940 and gfx11.
Isel support for gfx11 will be added in D130579.
Differential Revision: https://reviews.llvm.org/D131560
Feature used by targets that have flat_atomic_add_f32 instruction
(gfx940 and gfx11). Remove isGFX940GFX11Plus.
Add hasFlatAtomicFaddF32Inst Subtarget check for codegen.
Differential Revision: https://reviews.llvm.org/D134532
The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both.
This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling.
Fixes#50778
Differential Revision: https://reviews.llvm.org/D111968
This patch removes the aarch64 instrinsic svget/svset/svcreate from llvm.
It also implements the InstCombine for vector.extract that used to be in svget.
Depends on: D131547
Differential Revision: https://reviews.llvm.org/D131548
RISCV doesn't actually support a scaled form of indexed load and store. We previously handled this by forming the scaled SDNode, and then doing custom legalization during lowering. This patch instead adds a callback via TLI to prevent formation entirely.
This has two effects:
* First, the GEP gets expanded (and used). Instead of the shift being created with an SDLoc of the memory operation, it has the SDLoc of the GEP instruction. This avoids the scheduler perturbing IR order when there's no reason to.
* Second, we fix what appears to be a bug in index calculation with RV32. The rules for GEPs require index calculation be done in particular bitwidth, and it appears the custom legalization code got this wrong for the case where index type exceeds pointer width. (Or at least, I trust the generic GEP lowering to be correct a lot more.)
The DAGCombiner change to handle VPScatter/VPGather is technically separate, but is required to prevent a regression on those intrinsics.
Differential Revision: https://reviews.llvm.org/D134382
DXIL relies on a whole bunch of IR metadata constructs being populated
in the right shape. Rather than just hard coding or using complicated
arrangements of constant data strings, let's make first-class objects
that reprensent the metadata and manage reading and writing the
metadata from the module.
Reviewed By: python3kgae
Differential Revision: https://reviews.llvm.org/D134397
This patch uses structured bindings to simplify a couple of specific
cases when lowering RVV operations where we commonly declare two
SDValues and immediately 'tie' them to the mask and vector length.
There's also a couple places where we split vectors that structured
bindings make sense to use.
This patch tries to keep these sorts of changes minimal and to cases
where the returned types are commonly understood, rather than applying
this wholesale to the RISCV backend.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D134442
Various bits of existing code assume the presence of one operand implies the presence of another. Add verifier rules to catch violations.
Differential Revision: https://reviews.llvm.org/D133810
The default fixed vector legalization is to unroll. The default
scalable vector legalization is to clamp in the FP domain. The
RVV vfcvt instructions have saturating behavior so we can use them
directly. The only difference is that RVV instruction turn nan into
the max value, but the _SAT intrinsics want 0.
I'm only supporting 1 step of narrowing for now. I think we can
support more steps by using VNCLIP to saturate and narrower.
The only case that needs 2 steps of widening is f16->i64 which we can
do as f16->f32->i64.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D134400
They're roughly ARMv8.6. This works in the .td file, but in
AArch64TargetParser.def, marking them v8.6 brings in support for the SM4
cryptographic hash and we don't actually have that. So TargetParser side
they're marked as v8.5, with the extra features (BF16 and I8MM added manually).
Finally, A16 supports the HCX extension in addition to v8.6. This has no
TargetParser implications.
This mainly just adds costs for the targets where we have actual funnelshift/rotate instructions (VBMI2/XOP etc.) - the cases where we expand still need addressing, although for many the default shift+or expansion, especially for uniform cases, isn't that bad.
This was achieved with the 'cost-tables vs llvm-mca' script D103695
The patch fixes the SPIRV backend build using clang. It also replaces
UndefValue with PoisonValue in SPIRVRegularizer.cpp.
Fixes: #57773
Differential Revision: https://reviews.llvm.org/D134071
With Zbp removed, we no longer need the generalized forms.
The computeKnownBitsForTargetNode code brev8/orc.b is still based
on the general form with the shift amount forced to 7.
Use load32_zero instead of load32_splat to load the low 32 bits from memory to
v128. Test cases are added to cover this change.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D134257
Specifically predicates for extensions that are subsets of other
extensions. These predicates should never be used. Should always
check the superset extension or the superset ORed with the sub extendsion.
We have namespaces `DXIL` and `dxil`, which is just confusing. This
renames `DXIL` -> `dxil` making everything consistent.
While the LLVM coding standards don't have a clear direction here, I
chose lower case because by my current unscientific count there are
more places where we had the lowercase namespace than the uppercase.
Name them after the instructions VFCVT_RTZ_X(U)_F_VL to make it
clear that the ISD nodes don't have the poison semantics of
ISD::SINT_TO_FP/UINT_TO_FP.
I play to reuse this node for a FP_TO_SINT_SAT/FP_TO_UINT_SAT
patch and need the instruction semantics.
This validation was introduced in D34003 for v_qsad/v_mqsad instructions
but it applies to all instructions with earlyclobber operands, which now
includes v_mad_i64/v_mad_u64.
In all these cases I do not think there is documentation saying that the
destination must not overlap the sources. Rather there are *some* cases
where the instruction may not function correctly if there is an overlap,
and we are using earlyclobber as a conservative way of preventing
codegen from generating those cases.
I think it is unhelpful for the assembler to enforce the earlyclobber
restriction because it prevents assembling cases where the programmer
knows that in fact the overlap is safe.
See also: https://github.com/llvm/llvm-project/issues/57610
Differential Revision: https://reviews.llvm.org/D134272
Resizing operations (e.g. sign extension) in DAG can go from any width
to any other width, e.g. i8 -> i32. If the input and the result differ
by a factor larger than 2, the operation cannot be legal in HVX, since
the only two legal vector sizes in HVX are a single vector and a pair
of vectors.
To simplify the legalization, such operations are expanded into steps
that only double/halve the type size, so that each such step can be fully
legalized on its own. The complication is that DAG will automatically
fold these steps back into one, e.g. sext(sext) -> sext. To prevent that
new HexagonISD nodes are introduced: TL_EXTEND and TL_TRUNCATE. Once
legalized, these nodes are replaced with the original opcodes.
The type legalization is now common to aext/sext/zext/trunc and Hexagon-
specific ssat/usat nodes.
shuffle (tbl2, tbl2) can be folded into a single tbl4 if the mask for
the selected elements is constant.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133491
Add costs for the funnel shift instructions - fixes some discrepancies I was hitting with costs numbers from the 'cost-tables vs llvm-mca' script D103695
Remove obsolete ANDrm patterns for MIMM operands. We add these
translations to optimize commonly used cast operations before
we support MIMM operands directly by each isntruction. Such
translations are obsolete now.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134341
The llvm.aarch64.neon.scalar.sqxtn.i32.i64 intrinsics take and return
integer types, but operate on fp registers. This can create some
inefficiencies in their lowering, where the registers are converted to
fp a little too late. This patch adds lowering for the intrinsics,
creating bitcasts to/from fp types to allow nicer folding later when the
instructions are selected, especially around insert/extracts.
Differential Revision: https://reviews.llvm.org/D134024
We previously added l2i/i2l macros to simpily EXTRACT_SUBREG/INSERT_SUBREG
conversions. This patch changes VEInstrInfo.td to use such macros to
simplify existing code.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134118
Add maxnum and minnum for float and double. Lowering is already
implemented, so this patch changes them legal and adds regression
tests.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134108
VE has fused multiply-add instruction for only vector calculations. This
patch forces to expand scalar FMA to multiply and add instructions.
This patch also adds regression test.
Reviewed By: efocht
Differential Revision: https://reviews.llvm.org/D134107
This adds some quick tablegen patterns for vector_insert(bitcast(..))
and bitcast(vector_extract(..)), allowing us to avoid a round-trip
through GPRs.
Differential Revision: https://reviews.llvm.org/D134022
Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
when the callee's function body is executed in a different streaming mode than
its caller. This is needed because function calls are the boundaries for
streaming mode changes.
More details about the SME attributes and design can be found
in D131562.
Differential Revision: https://reviews.llvm.org/D131581
This extension does not appear to be on its way to ratification.
Out of the unratified bitmanip extensions, this one had the
largest impact on the compiler.
Posting this patch to start a discussion about whether we should
remove these extensions. We'll talk more at the RISC-V sync meeting this
Thursday.
Reviewed By: asb, reames
Differential Revision: https://reviews.llvm.org/D133834
Summary:
Under code object version 5, ockl_get_local_size returns the value computed by the expression:
workgroup_id < hidden_block_count ? hidden_group_size : hidden_remainder
For functions with the attribute uniform-work-group-size=true. we can evaluate workgroup_id < hidden_block_count
as true, and thus hidden_group_size is returned for ockl_get_local_size.
With uniform-workgroup-size=true, this work also set all remainders to zero, and if there
is reqd_work_group_size, we also set work-group-size to the required value from the metadata.
Reviewers:
arsenm and bcahoon
Differential Revision:
https://reviews.llvm.org/D131276
Clean up ahead of a patch to fix bugs in the AMDGPUDisassembler.
Use lit.local.cfg substitutions and more idiomatic use of split-file to
simplify and extend existing kernel-descriptor disassembly tests.
Add a comment to AMDHSAKernelDescriptor.h, as at least one small set
towards keeping all kernel-descriptor sensitive code in sync.
Reviewed By: kzhuravl, arsenm
Differential Revision: https://reviews.llvm.org/D130105
Since there is no guaranteed correspondence of SDNode and MI operands, we need
getters simular to RISCVII::get*OpNum for SDNodes.
More uses of getVecPolicyOpIdx will be added in D130895.
Reviewed By: craig.topper, arcbbb
Differential Revision: https://reviews.llvm.org/D134179
This implements experimental support for the Zawrs extension as specified here: https://github.com/riscv/riscv-zawrs/releases/download/V1.0-rc3/Zawrs.pdf. Despite the 1.0 version name, this has not been ratified and there was a major change to proposed specification between rc2 and rc3. Once this is ratified, it'll move out of experimental status.
This change adds assembly support, but does not include C language or IR intrinsics. We can decide if we want them, and handle that in a separate patch.
Differential Revision: https://reviews.llvm.org/D133443
This patch enables the LSLFast feature for Cortex-A76, Cortex-A77,
Cortex-A78, Cortex-A78C, Cortex-A710, Cortex-X1, Cortex-X2, Neoverse N1,
Neoverse N2, Neoverse V1 and the Neoverse 512TB pseudo-cpu, in-line with
the software optimization guides for those CPUs.
Differntial revision: https://reviews.llvm.org/D134273
Due to the encoding changes in GFX11, we had a hack in place that
disables the use of VGPRs above 128. This patch removes the need for
that hack.
We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
operands of VOP1, VOP2, and VOPC instructions. This register class only has the
low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
VOP2, and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.
We introduce new pseduo-instructions used on GFX11 which have the suffix
t16 (True 16) to use the VGPR_32_Lo128 register class.
Reviewed By: foad, rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D133723
This patch removes the intrinsic aarch64.sve.ldN from tablegen in favour of
using arch64.sve.ldN.sret.
Depends on: D133023
Differential Revision: https://reviews.llvm.org/D133025
The change add support for the cases when return value is passed in
memory rathen than in registers.
Differential Revision: https://reviews.llvm.org/D134181
If we have already calculated the incoming state before, use that
as our starting point to ensure we are conservative.
This fixes an infinite loop found in our downstream where we
we allowed two waves of updates to propagate through a loop and
the merge points allowed us to toggle back and forth between states.
No small reproducer right now.
Differential Revision: https://reviews.llvm.org/D134229
This change finalizes the series of patches aiming to replace the old strategy of VGPR to SGPR copy lowering.
# Following the https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code parts that are no longer used were removed.
# The first pass over the MachineFunctoin collects all the necessary information.
# Lowering is done in 3 phases:
- VGPR to SGPR copies analysis lowering
- REG_SEQUENCE, PHIs, and SGPR to VGPR copies lowering
- SCC copies lowering is done in a separate pass over the Machine Function
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131246
We were only setting this flag the first time we added the blocks
not when we mark them for revisiting.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D134193
Previously only using the UnsafeFPMath option, this now looks for the
fast moth flags on the instructions, using the same flag flags as other
backends.
shrinkdemandedconstant does some optimizations, but is not very friendly to riscv, targetShrinkDemandedConstant to limit the damage.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D134155
The LEA optimization pass visits each basic block of a given machine
function. In each basic block, for each pair of LEAs that differ only
in their displacement fields, we replace all uses of the second LEA
with the first LEA while adjusting the displacement.
Now, without this patch, after all the replacements are made, the
following assert triggers:
assert(MRI->use_empty(LastVReg) &&
"The LEA's def register must have no uses");
The replacement loop uses:
for (MachineOperand &MO :
llvm::make_early_inc_range(MRI->use_operands(LastVReg))) {
which is equivalent to:
for (auto UI = MRI->use_begin(LastVReg), UE = MRI->use_end();
UI != UE;) {
MachineOperand &MO = *UI++; // <-- Look!
That is, immediately after the post increment, make_early_inc_range
already has the iterator for the next iteration in its mind.
The problem is that in one iteration of the loop, we could replace two
uses in a debug instruction like:
DBG_VALUE_LIST !"r", !DIExpression(DW_OP_LLVM_arg, 0), %0:gr64, %0:gr64, ...
So, the iterator for the next iteration becomes invalid. We end up
traversing a garbage use list from that point on. In turn, we don't
get to visit remaining uses.
The patch fixes the problem by switching to a "draining" while loop:
while (!MRI->use_empty(LastVReg)) {
MachineOperand &MO = *MRI->use_begin(LastVReg);
MachineInstr &MI = *MO.getParent();
The credit goes to Simon Pilgrim for reducing the test case.
Fixes https://github.com/llvm/llvm-project/issues/57673
Differential Revision: https://reviews.llvm.org/D133631
Special registers, e.g. MODE, do not have register classes so
will cause null pointer exception if passed to isSGPRReg.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134025
When a streaming mode change is (or may be) required for a call, it will
need to restore the original mode after the call, which prevents the use of
tail-call optimization. The same holds true for a call that requires the lazy-save
mechanism to be set up before the call, and possibly restored after.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131579
This is a partial port of the code used by the SelectionDAGBuilder to
translate selects.
In particular, see matchSelectPattern in ValueTracking.cpp. This is a
GISel-equivalent of the portion which handles fminnum/fmaxnum/fminimum/fmaximum.
I tried to set it up so it'd be easy to add the non-FP cases. Those are simpler.
On the AArch64-end, it seems like the FP cases are more important for perf
right now, so I bit the bullet and went at the more complicated problem. :)
I elected to do this as a post-legalize combine rather than in the
IRTranslator because
Deciding which fmax/fmin to use can depend on legalization rules
Philosophically-speaking (TM), putting it in a combine just feels cleaner
Being able to enable/disable the combine is handy
Another option would be to use the ValueTracking code in the IRTranslator and
match what SelectionDAGBuilder::visitSelect does. I think that may be somewhat
annoying since we'd need to write lowerings back into the selects in the
legalizer. I'm not strongly opposed to the approach.
We'd also want to be careful with vector selects once that's implemented,
which explicitly check if a vector select is legal on the target. That'd
probably need a hook.
From what I can tell, doing this as a combine is probably a cleaner option
long-term.
Differential Revision: https://reviews.llvm.org/D116702
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (and recent fixes to the bdver2 + alderlake models)
Adding full CostKinds costs are affecting some other tests as they make assumptions about SizeLatency costs, so they need addressing first
When a function is streaming-compatible and calls a function with a normal or streaming
interface, it may need to enable/disable stremaing mode before the call, and
needs to restore PSTATE.SM after the call.
This patch implements this with a Pseudo node that gets expanded to a
conditional branch and smstart/smstop node.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131578
Although unsupported on HSW, we reuse this model for KNL which does require them
Noticed when running the cost model fuzz script from D103695 with -mcpu=knl
This patch implements the ABI for calls from:
Normal -> Streaming
Normal -> Streaming-compatible
Streaming -> Normal
Streaming -> Streaming-compatible
Streaming -> Streaming
The compiler inserts SMSTART/SMSTOP instructions before and after the call,
depending on the required transition.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131576
On AArch64, doing the vector truncate separately after the fptoui
conversion can be lowered more efficiently using tbl.4, building on
D133495.
https://alive2.llvm.org/ce/z/T538CC
Depends on D133495
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133496
These were based off a mixture of vector integer add/sub costs and the numbers from the 'cost-tables vs llvm-mca' script from D103695 - the extra costs for different predicates are still proving tricky to implement, but I've gotten most costs to within +/1 now - the AVX512 are tricky as we still don't handle predicate results properly, so most of these were done by hand.
Similar to using tbl to lower vector ZExts, tbl4 can be used to lower
vector truncates.
The initial version support i32->i8 conversions.
Depends on D120571
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133495
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133593
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.
This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.
This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D120571
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.
It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.
The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
A thread may not have access to SME or TPIDR2_EL0, so in order to
safely query PSTATE.SM in a streaming-compatible function, the
code should call `__arm_sme_state()`, as described in the ABI:
c2bb09c4d4
This means that the value of pstate.sm is:
* 0 if the function is non-streaming.
* 1 if the function has `arm_streaming` or `arm_locally_streaming`.
* evaluated at runtime by a call to __arm_sme_state() otherwise.
This patch also adds a calling convention for calls to SME support routines.
At some point we can remove the need for the llvm.aarch64.get.pstatesm() intrinsic
and use function calls (with the corresponding cc) directly instead.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131571
The class priority is expected to be at most 5 bits before it starts
clobbering bits used for other fields. Also clamp the instruction
distance in case we have millions of instructions.
AMDGPU was accidentally overflowing into the global priority bit in
some cases. I think in principal we would have wanted this, but in the
cases I've looked at, it had the counter intuitive effect and
de-prioritized the large register tuple.
Avoid using weird bit hack PPC uses for global priority. The
AllocationPriority field is really 5 bits, and PPC was relying on
overflowing this to 6-bits to forcibly set the global priority
bit. Split this out as a separate flag to avoid having magic behavior
for values above 31.
Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
A complete implementation of `applyFixup` for D132323.
Makes `LoongArchAsmBackend::shouldForceRelocation` to determine
if the relocation types must be forced.
This patch also adds range and alignment checks for `b*` instructions'
operands, at which point the offset to a label is known.
Differential Revision: https://reviews.llvm.org/D132818
The patch adds the regularization pass that prepare LLVM IR for
the IR translation. It also contains following changes:
- reduce indentation, make getNonParametrizedType, getSamplerType,
getPipeType, getImageType, getSampledImageType static in SPIRVBuiltins,
- rename mayBeOclOrSpirvBuiltin to getOclOrSpirvBuiltinDemangledName,
- move isOpenCLBuiltinType, isSPIRVBuiltinType, isSpecialType from
SPIRVGlobalRegistry.cpp to SPIRVUtils.cpp, renaming isSpecialType to
isSpecialOpaqueType,
- implment getTgtMemIntrinsic() in SPIRVISelLowering,
- add hasSideEffects = 0 in Pseudo (SPIRVInstrFormats.td),
- add legalization rule for G_MEMSET, correct G_BRCOND rule,
- add capability processing for OpBuildNDRange in SPIRVModuleAnalysis,
- don't correct types of registers holding constants and used in
G_ADDRSPACE_CAST (SPIRVPreLegalizer.cpp),
- lower memset/bswap intrinsics to functions in SPIRVPrepareFunctions,
- change TargetLoweringObjectFileELF to SPIRVTargetObjectFile
in SPIRVTargetMachine.cpp,
- correct comments.
5 LIT tests are added to show the improvement.
Differential Revision: https://reviews.llvm.org/D133253
Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
doesn't happened in peephole optimizer.
Summary: Converting a comparison against 1 or -1 into a comparison
against 0 can exploit record-form instructions for comparison optimization.
The conversion will happen only when a record-form instruction can be used
to replace the comparison during the peephole optimizer (see function optimizeCompareInstr).
In post-RA, we also want to optimize the comparison by using the record
form (see D131873) and it requires additional dataflow analysis to reliably
find uses of the CR register set.
It's reasonable to common the conversion for both peephole optimizer and
post-RA optimizer.
Converting to comparison against zero even when the optimization doesn't
happened in peephole optimizer may create additional opportunities for the
post-RA optimization.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D131374
Add fix for propagation of !pcsections metadata for expanded atomics,
together with more tests for interesting atomic instructions (based on
llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D133710
`MOVEM` is used to spill the register, which will cause problem with 1 byte data, since it only supports word (2 bytes) and long (4 bytes) size.
We change to use the normal `move` instruction to spill 1 byte data.
Fixes#57660
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D133636
This code was written as if it lived in the MC layer instead of
the CodeGen layer. We get the MCInstrDesc directly from MachineInstr.
And we can use RISCVSubtarget::is64Bit instead of going to the
Triple.
Differential Revision: https://reviews.llvm.org/D133905
Copy the asserts from the printing code, and turn them into actual verifier rules. Doing this revealed an existing bug - see 0a14551.
Differential Revision: https://reviews.llvm.org/D133869
Found this when adding verifier rules. The case which arises is that we have a DefMBBI which has a VecPolicy operand. The code was not expecting this, and the unconditional copy of the last two operands resulted in the SEW and VecPolicy fields being added to the VMV_V_V as AVL and SEW respectively.
Oddly, this appears to be a silent in practice. There's no test change despite verifier changes proving that we definitely hit this in existing tests.
Differential Revision: https://reviews.llvm.org/D133868
The rest of the code section assumes there are exactly two elements
in the vector (Lo, Hi), so add the check before entering the section.
Differential Revision: https://reviews.llvm.org/D133852
SLH will fall back to a different technique if X16 is being used,
so there is no need to warn for inline asm use. Only prevent other codegen
from using it.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D133766
The current code for generating nontemporal load outputs the wrong assembly for big endian architecture.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133789
Corrects the shift by constant costs to better account for them being converted to multiples for lowering - which demonstrates that we should probably be trying harder NOT to convert these to multiplies for some CPUs (v4i32 in particular).
Bug noted in D112717 can be sidestepped with this change.
Expanding all ConstantExpr involved with LDS up front makes the variable specialisation simpler. Excludes ConstantExpr that don't access LDS to avoid disturbing codegen elsewhere.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133422