Old canRealignStack calls TRI::canRealignStack and hasReservedCallFrame.
But, this hasReservedCallFrame return true whenever for VE since VE
allocates call frame all the time. It means this canRealignStack is
identical to TRI::canRealignStack. This patch removes VE's
canRealignStack and let caller call TRI::canRealignStack directly.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91929
This is part of the discussion on D91876 about trying to reduce custom lowering of MIN/MAX ops on older SSE targets - if we can improve generic vector expansion we should be able to relax the limitations in SelectionDAGBuilder when it will let MIN/MAX ops be generated, and avoid having to flag so many ops as 'custom'.
We generate two 4 byte loads or two stores as part of the expansion.
Previously the MemOperand was set the same for both to cover the
full 8 bytes. Now we set a separate 4 byte mem operand for each
with a 4 byte offset for the high part.
Summary: We have the patterns to fold 2 RLWINMs before RA, while some RLWINM will be generated after RA, for example rGc4690b007743. If the RLWINM generated after RA followed by another RLWINM, we expect to perform the optimization too.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D89855
All these potential null pointer dereferences are reported by my static analyzer for null smart pointer dereferences, which has a different implementation from `alpha.cplusplus.SmartPtr`.
The checked pointers in this patch are initialized by Target::createXXX functions. When the creator function pointer is not correctly set, a null pointer will be returned, or the creator function may originally return a null pointer.
Some of them may not make sense as they may be checked before entering the function, but I fixed them all in this patch. I submit this fix because 1) similar checks are found in some other places in the LLVM codebase for the same return value of the function; and, 2) some of the pointers are dereferenced before they are checked, which may definitely trigger a null pointer dereference if the return value is nullptr.
Reviewed By: tejohnson, MaskRay, jpienaar
Differential Revision: https://reviews.llvm.org/D91410
%rip was only included for 64-bit RIP-relative relocations, but needs to be included for 32-bit as well.
Reviewed By: MaskRay, RKSimon
Differential Revision: https://reviews.llvm.org/D91339
A getAdjustedFrameSize function may need to handle larger than 32 bits
integer, so change int to uint64_t.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91862
Also use DataLayout to get type size. Relying on the IR type size is
also pretty broken here, since this won't perfectly capture how types
are legalized.
Prior to this the DefaultMode was never selected, but RISCVGenDAGISel.inc, RISCVGenRegisterInfo.inc, RISCVGenGlobalISel.inc all ended up with extra table entries for that mode.
This patch removes the RV32 and uses DefaultMode for RV32. This impressively reduces the size of my release+asserts llc binary by about 270K. About 15K from RISCVGenDAGISel.inc, 1-2K from RISCVGenRegisterInfo.inc, but the vast majority from RISCVGenGlobalISel.inc.
Differential Revision: https://reviews.llvm.org/D90973
Previously we required a sra to pattern match these properly in isel. If the consumer didn't need the result sign extended we'll have an srl instead of sra and fail to match.
This patch switches to custom legalizing to GREVIW using portions of D91259.
Differential Revision: https://reviews.llvm.org/D91457
This should result in better utilization of RORIW since we
don't need to look for a SIGN_EXTEND_INREG that may not exist.
Also remove rotl/rotr isel matching to GREVI and just prefer RORI.
This is to keep consistency so we don't have to match ROLW/RORW
to GREVIW as well. I imagine RORI/RORIW performance will be the
same or better than GREVI.
Differential Revision: https://reviews.llvm.org/D91449
Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on SSE targets.
Enable custom lowering for v8i32/v4i64 and generalize the 128-bit lowering code for any vector size - this also lets us use the slightly cheaper codegen for icmp_ugt instead of umin/umax.
The default version only works if the returned node has a single
result. The X86 and PowerPC versions support multiple results
and allow a single result to be returned from a node with
multiple outputs. And allow a single result that is not result 0
of the node.
Also replace the Mips version since the new version should work
for it. The original version handled multiple results, but only
if the new node and original node had the same number of results.
Differential Revision: https://reviews.llvm.org/D91846
Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on pre-SSE4 targets.
We're still using X86ISD::BLENDV on some AVX targets as we don't do custom lowering for >= 256-bit vectors.
Really this (and combineVSelectWithAllOnesOrZeros) needs moving to DAGCombiner, but pre-SSE42 we see the vXi64 comparison type as a 2 x 32-bits result so we can't just rely on ComputeNumSignBits to give us the 'all bits' result we need.
This will ensure that passes that add new global variables will create them
in address space 1 once the passes have been updated to no longer default
to the implicit address space zero.
This also changes AutoUpgrade.cpp to add -G1 to the DataLayout if it wasn't
already to present to ensure bitcode backwards compatibility.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D84345
Just something I forgot when I added the R82. Need to have a look
at crypto and fusing, but will do that as a follow up.
Differential Revision: https://reviews.llvm.org/D91848
This checks to see if the loop will likely become a tail predicated loop
and disables wls loop generation if so, as the likelihood for reverting
is currently too high. These should be fairly rare situations anyway due
to the way iterations and element counts are used during lowering. Just
not trying can alter how SCEV's are materialized however, leading to
different codegen.
It also adds a option to disable all while low overhead loops, for
debugging.
Differential Revision: https://reviews.llvm.org/D91663
This patch implements out of line atomics for LSE deployment
mechanism. Details how it works can be found in llvm/docs/Atomics.rst
Options -moutline-atomics and -mno-outline-atomics to enable and disable it
were added to clang driver. This is clang and llvm part of out-of-line atomics
interface, library part is already supported by libgcc. Compiler-rt
support is provided in separate patch.
Differential Revision: https://reviews.llvm.org/D91157
Implement getMinimumJumpTableEntries() to specify threshold for jump
table genaration. We use 8 for the case of PIC mode to relieve the
impact of PIC calculation required to implement PIC mode jump table.
Update jump table regression test also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91785
Extract the scratch offset from the scratch buffer descriptor that is
stored in the global table.
Differential Revision: https://reviews.llvm.org/D91701
For MASM syntax, the prefixes are not enclosed in braces.
The assembly code should like:
"evex vcvtps2pd xmm0, xmm1"
Differential Revision: https://reviews.llvm.org/D90441
Clang generates a '%' prefix for some registers in CFI directives. E.g.
".cfi_register lr, r12" becomes ".cfi_register lr, %r12" after
processing.
Differential Revision: https://reviews.llvm.org/D91735
When constructing a MemoryLocation by hand, require that a
LocationSize is explicitly specified. D91649 will split up
LocationSize::unknown() into two different states, and callers
should make an explicit choice regarding the kind of MemoryLocation
they want to have.
This moves the recognition of GREVI and GORCI from TableGen patterns
into a DAGCombine. This is done primarily to match "deeper" patterns in
the future, like (grevi (grevi x, 1) 2) -> (grevi x, 3).
TableGen is not best suited to matching patterns such as these as the compile
time of the DAG matchers quickly gets out of hand due to the expansion of
commutative permutations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91259
This converts the intermediate VPR use assertion to a condition in the if-statement to protect against assertion failures in case behaviuour is changed.
This is a follow-up to https://reviews.llvm.org/D90935 and implements the post-approval comments.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91790
This was already something that was handled by one of the "else"
branches in maybeLoweredToCall, so this patch is an NFC but makes it
explicit and adds a test. We may in the future want to support this
under certain situations but for the moment just don't try and create
low overhead loops with inline asm in them.
Differential Revision: https://reviews.llvm.org/D91257
D57663 allowed us to reuse broadcasts of the same scalar value by extracting low subvectors from the widest type.
Unfortunately we weren't ensuring the broadcasts were from the same SDValue, just the same SDNode - which failed on multiple-value nodes like ISD::SDIVREM
FYI: I intend to request this be merged into the 11.x release branch.
Differential Revision: https://reviews.llvm.org/D91709
This defines the vec_broadcast SDNode along with lowering and isel code.
We also remove unused type mappings for the vector register classes (all vector MVTs that are not used in the ISA go).
We will implement support for short vectors later by intercepting nodes with illegal vector EVTs before LLVM has had a chance to widen them.
Reviewed By: kaz7
Differential Revision: https://reviews.llvm.org/D91646
2c196bbc6b asserted that
`SmallVector::push_back` doesn't invalidate the parameter when it needs
to grow. Do the same for `resize`, `append`, `assign`, `insert`, and
`emplace_back`.
Differential Revision: https://reviews.llvm.org/D91744
@tangxingxin1008 found a bug that regard vadd.vv v1, v3, a0 as a valid V
instruction. We should remove the VRegAsmOperand operand class and use
VR register class directly.
Patched by: tangxingxin1008, Hsiangkai
Differential Revision: https://reviews.llvm.org/D91712
In some situations, the compiler may insert an accumulator prime instruction and
an accumulator unprime instruction with no use of that accumulator between the two.
That's for example the case when we store an accumulator after assembling it or
restoring it. This patch adds a peephole to remove these prime and unprime instructions.
Differential Revision: https://reviews.llvm.org/D91386
We have workarounds for two different cases where vccz can get out of
sync with the value in vcc. This fixes them in two ways:
1. Fix the case where the def of vcc was in a previous basic block, by
pessimistically assuming that vccz might be incorrect at a basic block
boundary.
2. Fix the handling of pre-existing waitcnt instructions by calling
generateWaitcntInstBefore before examining ScoreBrackets to determine
whether there's an outstanding smem read operation.
Differential Revision: https://reviews.llvm.org/D91636
The SystemZISD::IABS node is no longer needed since ISD::ABS can be used
instead.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D91697
We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size.
We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW.
I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D91515
Implement JumpTable to make BRIND work on VE. Update an existing
br_jt regression test also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91582
This patch factors out the part of printInstruction that gets the
mnemonic string for a given MCInst. This is intended to be used
subsequently for the instruction-mix remarks to display the final
mnemonic (D90040).
Unfortunately making `getMnemonic` available to the AsmPrinter
seems to require making it virtual. Not sure if there's a way around
that with the current layering of the AsmPrinters.
Reviewed By: Paul-C-Anagnostopoulos
Differential Revision: https://reviews.llvm.org/D90039
When we see
```
xor = G_XOR xor_lhs, -1
select = G_SELECT cc, tval, xor
```
Fold this into
```
select = CSINV tval, xor_lhs, cc
```
Update select-select.mir to reflect the changes.
For now, only handle the case where the G_XOR is the false-value for the
G_SELECT. It may make more sense to handle the true-value case in post-legalizer
lowering.
Differential Revision: https://reviews.llvm.org/D90774
- In certain cases, a generic pointer could be assumed as a pointer to
the global memory space or other spaces. With a dedicated target hook
to query that address space from a given value, infer-address-space
pass could infer and propagate that to all its users.
Differential Revision: https://reviews.llvm.org/D91121
Add lvm/svm intrinsic instructions and a regression test. Change
RegisterInfo to specify that VM0/VMP0 are constant and reserved
registers. This modifies a vst regression test, so update it.
Also add pseudo instructions for VM512 register classes
and mechanism to expand them after register allocation.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91541
The G_ZEXT in these cases seems to actually come from a combine that we do but
SelectionDAG doesn't. Looking through it allows us to match "uxtw #2" addressing
modes.
Differential Revision: https://reviews.llvm.org/D91475
We need to make sure the upper 32 bits are all ones to ensure the result is properly sign extended. Previously we only checked the lower 32 bits of the mask. I've also added a check that the shift amount is less than 32. Without that the original code asserts inside maskLeadingOnes if the SROI check is removed or the SROIW pattern is checked first. I've refactored the code to use early outs to reduce nesting.
I've also updated SLOIW matching with the same changes, but I couldn't find a broken test case with the existing code.
Differential Revision: https://reviews.llvm.org/D90961
This defines a 'fastcc' for the VE target and implements vreg-to-vreg
copy for parameter passing. The 'fastcc' extends the standard CC for
SX-Aurora with register passing of vector-typed parameters and return
values.
Reviewed By: kaz7
Differential Revision: https://reviews.llvm.org/D90842
This patch fixes the function isWideningInstruction for scalable vectors.
Now the cost model can check the widening pattern for SVE.
Differential Revision: https://reviews.llvm.org/D91260
This patch adds the SchedMachineModel for Cortex-M7. It
also adds test cases for the scheduling information.
Details of the pipeline and descriptions are in comments
in file ARMScheduleM7.td included in this patch.
Differential Revision: https://reviews.llvm.org/D91355
Similar to the X86 and AMDGPU targets, this uses a macro to cut down on
repetitive and error-prone code when converting RISCVISD node names to
strings in getTargetNodeName.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D91414
The VE backend represents vector instructions with an explicit 'i32'
vector length operand. In the VE ISA, the vector length is always read
from the VL hardware register. The LVLGen pass inserts 'lvl'
instructions as necessary to set VL to the right value before each
vector instruction.
Reviewed By: kaz7
Differential Revision: https://reviews.llvm.org/D91416
We unconditionally marked i64 as Custom, but did not install a
handler in ReplaceNodeResults when i64 isn't legal type. This
leads to ReplaceNodeResults asserting.
We have two options to fix this. Only mark i64 as Custom on
64-bit targets and let it expand to two i32 bitreverses which
each need a VPPERM. Or the other option is to add the Custom
handling to ReplaceNodeResults. This is what I went with.
When the load value is folded into the sin/cos operation, the
AMDGPU library calls simplifier could still mark the function
as unmodified. Instead ensure if there is an early return,
return whether the load was folded into the sin/cos call.
Authored by MJDSys
Differential Revision: https://reviews.llvm.org/D91401
I'm not why it was added to DAGToDAG oringally but it seems
to make sense alongside the non-TLS version: LowerGlobalAddress
Differential Revision: https://reviews.llvm.org/D91432
When we see
```
%sub = G_SUB 0, %x
%select = G_SELECT %cc, %t, %sub
```
Fold away the G_SUB by producing
```
%select = CSNEG %t, %x, cc
```
Simple IR example: https://godbolt.org/z/K8TEnh
This is valid on both sides of the select, but for now, just handle one side.
It may make more sense to handle swapping sides during post-legalizer lowering.
Differential Revision: https://reviews.llvm.org/D90723
Reducing some code duplication.
We had a helper for checking if a predicate is unsigned. Remove that and use
the existing function in Instructions.cpp.
Differential Revision: https://reviews.llvm.org/D91288
It's fairly common to need matchers for a specific constant value, or for
common idioms like finding a negated register.
Add
- `m_SpecificICst`, which returns true when matching a specific value..
- `m_ZeroInt`, which returns true when an integer 0 is matched.
- `m_Neg`, which returns when a register is negated.
Also update a few places which use idioms related to the new matchers.
Differential Revision: https://reviews.llvm.org/D91397
These relocations represent offsets from the __tls_base symbol.
Previously we were just using normal MEMORY_ADDR relocations and relying
on the linker to select a segment-offset rather and absolute value in
Symbol::getVirtualAddress(). Using an explicit relocation type allows
allow us to clearly distinguish absolute from relative relocations based
on the relocation information alone.
One place this is useful is being able to reject absolute relocation in
the PIC case, but still accept TLS relocations.
Differential Revision: https://reviews.llvm.org/D91276
If the scatter store is able to perform the sign/zero extend of
its index, this is folded into the instruction with refineIndexType().
Additionally, refineUniformBase() will return the base pointer and index
from an add + splat_vector.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90942
No longer rely on an external tool to build the llvm component layout.
Instead, leverage the existing `add_llvm_componentlibrary` cmake function and
introduce `add_llvm_component_group` to accurately describe component behavior.
These function store extra properties in the created targets. These properties
are processed once all components are defined to resolve library dependencies
and produce the header expected by llvm-config.
Differential Revision: https://reviews.llvm.org/D90848
This was a mistake introduced in D91294. I'm not sure how to
exercise this with the existing code, but I hit it while trying
some follow up experiments.
We can't store garbage in the unused bits. It possible that something like zextload from i1/i2/i4 is created to read the memory. Those zextloads would be legalized assuming the extra bits are 0.
I'm not sure that the code in lowerStore is executed for the v1i1/v2i1/v4i1 case. It looks like the DAG combine in combineStore may have converted them to v8i1 first. And I think we're missing some cases to avoid going to the stack in the first place. But I don't have time to investigate those things at the moment so I wanted to focus on the correctness issue.
Should fix PR48147.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D91294
Select the following:
- G_SELECT cc, 0, 1 -> CSINC zreg, zreg, cc
- G_SELECT cc 0, -1 -> CSINV zreg, zreg cc
- G_SELECT cc, 1, f -> CSINC f, zreg, inv_cc
- G_SELECT cc, -1, f -> CSINV f, zreg, inv_cc
- G_SELECT cc, t, 1 -> CSINC t, zreg, cc
- G_SELECT cc, t, -1 -> CSINC t, zreg, cc
(IR example: https://godbolt.org/z/YfPna9)
These correspond to a bunch of the AArch64csel patterns in AArch64InstrInfo.td.
Unfortunately, it doesn't seem like we can import patterns that use NZCV like
those ones do. E.g.
```
def : Pat<(AArch64csel GPR32:$tval, (i32 1), (i32 imm:$cc), NZCV),
(CSINCWr GPR32:$tval, WZR, (i32 imm:$cc))>;
```
So we have to manually select these for now.
This replaces `selectSelectOpc` with an `emitSelect` function, which performs
these optimizations.
Differential Revision: https://reviews.llvm.org/D90701
Also fix a similar issue in SIInsertWaitcnts, but I don't think that fix
has any effect in practice.
Differential Revision: https://reviews.llvm.org/D91290
Follow up from a similar patch on RISCV 637f19c36b
Nothing reads this Glue value that I could see. The SDNode def in
the td file does not have the SDNPOutGlue flag so I don't think
this glue would get properly propagated to MachineSDNodes if it
was used.
-Use MCRegister instead of Register in MC layer.
-Move some enums from RISCVInstrInfo.h to RISCVBaseInfo.h to be with other TSFlags bits.
Differential Revision: https://reviews.llvm.org/D91114
The fshl and fshr intrinsics are defined to modulo their shift amount by the bitwidth of one of their inputs. The FSR/FSL instructions read one extra bit from the shift amount. If that bit is set the inputs are swapped. In order to preserve the semantics of the llvm intrinsics we need to make sure that the extra bit isn't set. DAG combine or instcombine may have removed any mask that was originally present.
We could be smarter here and try to use computeKnownBits to check if the bit is known zero, but wanted to start with correctness.
Differential Revision: https://reviews.llvm.org/D90905
Of course there was something missing, in this case a check that the def
of the count register we are adding to a t2DoLoopStartTP would dominate
the insertion point.
In the future, when we remove some of these COPY's in between, the
t2DoLoopStartTP will always become the last instruction in the block,
preventing this from happening. In the meantime we need to check they
are created in a sensible order.
Differential Revision: https://reviews.llvm.org/D91287
Change the default type of v64 register class from v512i32 to v256f64.
Add a regression test also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91301
When passing SVE types as arguments to function calls we can run
out of hardware SVE registers. This is normally fine, since we
switch to an indirect mode where we pass a pointer to a SVE stack
object in a GPR. However, if we switch over part-way through
processing a SVE tuple then part of it will be in registers and
the other part will be on the stack.
I've fixed this by ensuring that:
1. When we don't have enough registers to allocate the whole block
we mark any remaining SVE registers temporarily as allocated.
2. We temporarily remove the InConsecutiveRegs flags from the last
tuple part argument and reinvoke the autogenerated calling
convention handler. Doing this prevents the code from entering
an infinite recursion and, in combination with 1), ensures we
switch over to the Indirect mode.
3. After allocating a GPR register for the pointer to the tuple we
then deallocate any SVE registers we marked as allocated in 1).
We also set the InConsecutiveRegs flags back how they were before.
4. I've changed the AArch64ISelLowering LowerCALL and
LowerFormalArguments functions to detect the start of a tuple,
which involves allocating a single stack object and doing the
correct numbers of legal loads and stores.
Differential Revision: https://reviews.llvm.org/D90219
When there is full fp16 support, there is no reason to widen 16-bit
G_FCONSTANTs to 32 bits. Mark them as legal in this case.
Also, we currently import a pattern for materializing a 16-bit 0.0.
Add a testcase showing we select it.
(All other 16-bit G_FCONSTANTS are not yet selected.)
Differential Revision: https://reviews.llvm.org/D89164
We were creating RISCVISD::SELECT_CC nodes with Glue output that was never being used, and the tablegen SDNode had the SDNPInGlue flag instead of the SDNPOutGlue flag.
Since we don't seem to need the Glue just get rid of it from both places.
Differential Revision: https://reviews.llvm.org/D91199
The manual selection code for add/sub was not checking if it was possible to
fold in shifts + extends (the *rx opcode variants).
As a result, we could never select things like
```
cmp x1, w0, uxtw #2
```
Because we don't import any patterns for compares.
This adds support for the arithmetic shifted register forms and updates tests
for instructions selected using `emitADD`, `emitADDS`, and `emitSUBS`.
This is a 0.1% geomean code size improvement on SPECINT2000 at -Os.
Differential Revision: https://reviews.llvm.org/D91207
Previously, we only handled negative arithmetic immediates in the imported
selector code.
Since we don't import code for, say, compares, we were missing opportunities
for things like
```
%cst:gpr(s64) = G_CONSTANT i64 -10
%cmp:gpr(s32) = G_ICMP intpred(eq), %reg0(s64), %cst
->
%adds = ADDSXri %reg0, 10, 0, implicit-def $nzcv
%cmp = CSINCWr $wzr, $wzr, 1, implicit $nzcv
```
Instead, we would have to materialize the constant and emit a SUBS.
This adds support for selection like above for SUB, SUBS, ADD, and ADDS.
This is a 0.1% geomean code size improvement on SPECINT2000 at -Os.
Differential Revision: https://reviews.llvm.org/D91108
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
Lowers the llvm.masked.scatter intrinsics (scalar plus vector addressing mode only)
Changes included in this patch:
- Custom lowering for MSCATTER, which chooses the appropriate scatter store opcode to use.
Floating-point scatters are cast to integer, with patterns added to match FP reinterpret_casts.
- Added the getCanonicalIndexType function to convert redundant addressing
modes (e.g. scaling is redundant when accessing bytes)
- Tests with 32 & 64-bit scaled & unscaled offsets
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90941
This patch adds the IsTruncatingStore flag to MaskedScatterSDNode, set by getMaskedScatter().
Updated SelectionDAGDumper::print_details for MaskedScatterSDNode to print
the details of masked scatters (is truncating, signed or scaled).
This is the first in a series of patches which adds support for scalable masked scatters
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90939
These do things like turn a multiply of a pow-2+1 into a shift and and add,
which is a common pattern that pops up, and is universally better than expensive
madd instructions with a constant.
I've added check lines to an existing codegen test since the code being ported
is almost identical, however the mul by negative pow2 constant tests don't generate
the same code because we're missing some generic G_MUL combines still.
Differential Revision: https://reviews.llvm.org/D91125
Previously we used setRegClass to rgpr, which may expand the register
domain if the result was already in a constrained class (tcgpr in the
above PR).
Differential Revision: https://reviews.llvm.org/D91192
These are opsel opcodes with op_sel actually being ignored.
As a such op_sel_hi needs to be set to default 1 even though
these bits are ignored. This is compatibility change.
Differential Revision: https://reviews.llvm.org/D91202
This introduces a new pseudo instruction, almost identical to a
t2DoLoopStart but taking 2 parameters - the original loop iteration
count needed for a low overhead loop, plus the VCTP element count needed
for a DLSTP instruction setting up a tail predicated loop. The idea is
that the instruction holds both values and the backend
ARMLowOverheadLoops pass can pick between the two, depending on whether
it creates a tail predicated loop or falls back to a low overhead loop.
To do that there needs to be something that converts a t2DoLoopStart to
a t2DoLoopStartTP, for which this patch repurposes the
MVEVPTOptimisationsPass as a "tail predication and vpt optimisation"
pass. The extra operand for the t2DoLoopStartTP is chosen based on the
operands of VCTP's in the loop, and the instruction is moved as late in
the block as possible to attempt to increase the likelihood of making
tail predicated loops.
Differential Revision: https://reviews.llvm.org/D90591
We already do not unroll loops with vector instructions under MVE, but
that does not include the remainder loops that the vectorizer produces.
These remainder loops will be rarely executed and are not worth
unrolling, as the trip count is likely to be low if they get executed at
all. Luckily they get llvm.loop.isvectorized to make recognizing them
simpler.
We have wanted to do this for a while but hit issues with low overhead
loops being reverted due to difficult registry allocation. With recent
changes that seems to be less of an issue now.
Differential Revision: https://reviews.llvm.org/D90055
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.
The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.
Differential Revision: https://reviews.llvm.org/D89883
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR
This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.
This is a fairly simple change in itself, but leads to a number of other
required alterations.
- The hardware loop pass, if UsePhi is set, now generates loops of the
form:
%start = llvm.start.loop.iterations(%N)
loop:
%p = phi [%start], [%dec]
%dec = llvm.loop.decrement.reg(%p, 1)
%c = icmp ne %dec, 0
br %c, loop, exit
- For this a new llvm.start.loop.iterations intrinsic was added, identical
to llvm.set.loop.iterations but produces a value as seen above, gluing
the loop together more through def-use chains.
- This new instrinsic conceptually produces the same output as input,
which is taught to SCEV so that the checks in MVETailPredication are not
affected.
- Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
been left mostly as before. We should now more reliably be able to tell
that the t2DoLoopStart is correct without having to prove it, but
t2WhileLoopStart and tail-predicated loops will remain the same.
- And all the tests have been updated. There are a lot of them!
This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.
Differential Revision: https://reviews.llvm.org/D89881
We used cast where we should use dyn_cast. So, change it this time.
Old code cause problems if I implement brind instruction and compile
openmp using new compiler.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D91151
Some use cases (e.g. kernel devs) have strict requirements to only enable
features available with -march=armv8-a, e.g. no armv8.1-a. Enabling RAS 1.1 in
all AArch64 means they can consider to support it.
Bear in mind that the first versions of the Armv8 architecture still do not
support RAS 1.1. This patch only lets devs write code with the user-friendly
register mnemonic instead of the ugly generic S<op0>_<op1>_<Cn>_<Cm>_<op2>.
They still need to place runtime checks to make sure that the CPU to run on
supports RAS 1.1.
Differential Revision: https://reviews.llvm.org/D90594
We can use KnownBitsAnalysis to cover cases when mask is not trivial. It can
also help with cases when mask is not constant but can still be folded into
one. Since 'and' is comutative we should treat both operands as possible
replacements.
Differential Revision: https://reviews.llvm.org/D90674
Summary: This patch try to do the following transformation if the multiplier doen't fit int16:
(mul X, c1 << c2) -> (rldicr (mulli X, c1) c2)
Reviewed By: jsji, steven.zhang
Differential Revision: https://reviews.llvm.org/D87384
If the source of S_MOV_{B32,B64}_term is an immediate then it
cannot be lowered to a COPY.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D90451
For example if the sign extension is only used in for TBZ, and the value is used elsewhere with a zero extension, this can eliminate a sign extension.
Reviewed By: samparker
Differential Revision: https://reviews.llvm.org/D90606