Intrinsic declarations use the default subtarget, but this should be
using the subtarget for the calling function. I haven't been able to
come up with a case where it matters though.
This patch adjusts the following ARM/AArch64 LLVM IR intrinsics:
- neon_bfmmla
- neon_bfmlalb
- neon_bfmlalt
so that they take and return bf16 and float types. Previously these
intrinsics used <8 x i8> and <4 x i8> vectors (a rudiment from
implementation lacking bf16 IR type).
The neon_vbfdot[q] intrinsics are adjusted similarly. This change
required some additional selection patterns for vbfdot itself and
also for vector shuffles (in a previous patch) because of SelectionDAG
transformations kicking in and mangling the original code.
This patch makes the generated IR cleaner (less useless bitcasts are
produced), but it does not affect the final assembly.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D86146
Original Commit Message:
After the commit r368987 (rG643adb55769e) was landed, the frame record (FP and LR register)
may be placed in the middle of a stack frame if a function has both callee-saved
general-purpose registers and floating point registers. This will break the stack unwinders
that simply walk through the frame records (based on the guarantee from AAPCS64
"The Frame Pointer" section). This commit fixes the problem by adding the frame record offset.
Patch By: logan
Differential Revision: D70800
This patch adds code to recognize vector shuffles which can be
represented as VDUP (splat) of a vector lane with of a different
(wider) type than the original vector lane type.
For example:
shufflevector <4 x i16> %v, <4 x i16> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
is essentially:
shufflevector <2 x i32> %v, <2 x i32> undef, <2 x i32> <i32 0, i32 0>
Such patterns are generated by the SelectionDAG machinery in some cases
(see DAGCombiner::visitBITCAST in DAGCombiner.cpp, the "Remove double
bitcasts from shuffles" part).
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D86225
Since the canonical floatig-point move is fsgnj rd, rs, rs, we should
handle this case in RISCVInstrInfo::isAsCheapAsAMove().
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D86518
The isTriviallyRematerializable hook is only called for instructions that are
tagged as isAsCheapAsAMove. Since ADDI 0 is used for "mv" it should definitely
be marked with "isAsCheapAsAMove". This change avoids one stack spill in most of
the atomic-rmw.ll tests functions. It also avoids stack spills in two of our
out-of-tree CHERI tests.
ORI/XORI with zero may or may not be the same as a move micro-architecturally,
but since we are already doing it for register == x0, we might as well
do the same if the immediate is zero.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D86480
There is no justification for changing vcc_lo to vcc
when shrinking V_CNDMASK, and such a change could
later confuse live variable analysis.
Make sure the original register is preserved.
Differential Revision: https://reviews.llvm.org/D86541
Enable default outlining when the function has the minsize attribute
and we're targeting an m-class core.
Differential Revision: https://reviews.llvm.org/D82951
Implements the assemble and disassemble support of RISCV Vector
extension zvamo instructions, base on the 0.9 spec version.
Reviewed by HsiangKai
Differential Revision: https://reviews.llvm.org/D85069
Fix the ARM backend's analyzeBranch so it doesn't ignore predicated
return instructions, and make the MachineVerifier rule more strict.
Differential Revision: https://reviews.llvm.org/D40061
This patch implements the function prototypes vec_mulh and vec_dive in order to
utilize the vector multiply high (vmulh[s|u][w|d]) and vector divide extended
(vdive[s|u][w|d]) instructions introduced in Power10.
Differential Revision: https://reviews.llvm.org/D82609
AArch64, X86 and Mips currently directly consumes these and custom
lowering to produce a libcall, but really these should follow the
normal legalization process through the libcall/lower action.
Original Commit Message:
After the commit r368987 (rG643adb55769e) was landed, the frame record (FP and LR register)
may be placed in the middle of a stack frame if a function has both callee-saved
general-purpose registers and floating point registers. This will break the stack unwinders
that simply walk through the frame records (based on the guarantee from AAPCS64
"The Frame Pointer" section). This commit fixes the problem by adding the frame record offset.
Patch By: logan
The version of `st1d` that operates with vector plus immediate
addressing mode uses the alias `st1d { <Zn>.d }, <Pg>, [<Za>.d]` for
rendering `st1d { <Zn>.d }, <Pg>, [<Za>.d, #0]`. The disassembler was
generating `<Zn>.s` instead of `<Zn>.d>`.
Differential Revision: https://reviews.llvm.org/D86633
This would assert with unaligned DS access enabled. The offset may not
be aligned. Theoretically the pattern predicate should check the
memory alignment, although it is possible to have the memory be
aligned but not the immediate offset.
In this case I would expect it to use ds_{read|write}_b64 with
unaligned access, but am not clear if there's a reason it doesn't.
This is an older syntax than the {disp32} and {disp8} pseudo
prefixes that were added a few weeks ago. We can reuse most of
the support for that to support .d32 and .d8 as well.
Summary:
Support TOCU and TOCL relocation type for object file generation.
Reviewed by: DiggerLin
Differential Revision: https://reviews.llvm.org/D84549
When floating point callee-saved registers were used, the frame pointer would
incorrectly point to the bottom of the CSR space (containing saved floating-point
registers), rather than to the frame record.
While all frame offsets were calculated consistently, resulting in working code,
this prevented stack walkers from being about to traverse the frame list.
If the condition output is negated, swap the branch targets. This is
similar to what SelectionDAG does for when SelectionDAGBuilder
decides to invert the condition and swap the branches.
This is leaving behind a dead constant def for some reason.
If a workgroup size is known to be not greater than wavefront size
the s_barrier instruction is not needed since all threads are guaranteed
to come to the same point at the same time.
This is the same optimization that was implemented for SelectionDAG in
D31731.
Differential Revision: https://reviews.llvm.org/D86609
MVE Gather scatter codegeneration is looking a lot better than it used
to, but still has some issues. The instructions we currently model as 1
cycle per element, which is a bit low for some cases. Increasing the
cost by the MVECostFactor brings them in-line with our other instruction
costs. This will have the effect of only generating then when the extra
benefit is more likely to overcome some of the issues. Notably in
running out of registers and vectorizing loops that could otherwise be
SLP vectorized.
In the short-term whilst we look at other ways of dealing with those
more directly, we can increase the costs of gathers to make them more
likely to be beneficial when created.
Differential Revision: https://reviews.llvm.org/D86444
pointer.
mwaitx uses EBX as one of its argument.
Using this instruction clobbers RBX as it is defined to hold one of the
input. When the backend uses dynamically allocated stack, RBX is used as
a reserved register for the base pointer.
This patch is adapted from @qcolombet patch for cmpxchg at r263325.
This fixes PR43528.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D73475
The switch in AArch64Operand::print was changed in D45688 so the shift
can be printed after printing the register. This is implemented with
LLVM_FALLTHROUGH and was broken in D52485 when BTIHint was put between
the register and shift operands.
Reviewed By: ostannard
Differential Revision: https://reviews.llvm.org/D86535
This fixes an issue where the restore point of callee-saves in the
function epilogues was incorrectly calculated when the basic block
consisted of only a RET instruction. This caused dealloc instructions
to be inserted in between the block of callee-save restore instructions,
rather than before it.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D86099
Since we can only copy to GR32 we had to EXTRACT from GR32, but
we would first go to GR16 and then the truncate would extra again
to GR8. This adds a special case to go directly from GR32 to GR8.
This would eventually get cleaned up, but though maybe we should
avoid doing it in the first place. Our k-register handling is weird
and we could probably stand to have some more special ISD nodes
for the conversions so the i32 type would be explicit.
The IsExtractedElement already called getOperand(0) so Extract
here is the source vector. We shouldn't call getOperand(0). This
worked for the original test cases because the result was a
bitcast so the getOperand(0) accidently peeked through the bitcast
which is what we wanted.
In the failing case here, the operand turns out to be undef so
the getOperand(0) asserts because undef has no operands.
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25184
Differential Revision: https://reviews.llvm.org/D86428
KMOVWkr produces VK16, there's no reason to copy it to VK16 again.
Test changes are presumably because we were scheduling based on
the COPY that is no longer there.
Most notably, we were incorrectly reporting <3 x s16> as a legal type
for these. Make sure these aren't legal to help make progress on
fixing the artifact combiner and vector legalizer
rules. Unfortunately, this means spreading the -global-isel-abort=0
hack, although this doesn't change the legalizer result in any
situation.
This adapts tail-predication to the new semantics of get.active.lane.mask as
defined in D86147. This means that:
- we can remove the BTC + 1 overflow checks because now the loop tripcount is
passed in to the intrinsic,
- we can immediately use that value to setup a counter for the number of
elements processed by the loop and don't need to materialize BTC + 1.
Differential Revision: https://reviews.llvm.org/D86303
Without the fix gcc 7.4 warns with
../lib/Target/PowerPC/PPCAsmPrinter.cpp: In member function 'void {anonymous}::PPCAsmPrinter::EmitTlsCall(const llvm::MachineInstr*, llvm::MCSymbolRefExpr::VariantKind)':
../lib/Target/PowerPC/PPCAsmPrinter.cpp:525:53: warning: enumeral and non-enumeral type in conditional expression [-Wextra]
MCInstBuilder(Subtarget->isPPC64() ? Opcode : PPC::BL_TLS)
~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
Also updates isConstOrConstSplatFP to allow the mul(A,-1) -> neg(A)
transformation when -1 is expressed as an ISD::SPLAT_VECTOR.
Differential Revision: https://reviews.llvm.org/D86415
Support -march=sapphirerapids for x86.
Compare with Icelake Server, it includes 14 more new features. They are
amxtile, amxint8, amxbf16, avx512bf16, avx512vp2intersect, cldemote,
enqcmd, movdir64b, movdiri, ptwrite, serialize, shstk, tsxldtrk, waitpkg.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D86503
PC-Relative addressing introduces a fair bit of complexity for correctly
eliminating TOC accesses. FastISel does not include any of that handling so we
miscompile code with -mcpu=pwr10 -O0 if it includes an external call that
FastISel does not handle followed by any of the following:
Floating point constant materialization
Materialization of a GlobalValue
Call that FastISel does handle
This patch switches to SDISel for any of the above.
Differential revision: https://reviews.llvm.org/D86343
This is preparation for making clang default to -mtune=generic when no -march is specified. This will allow the default tuning to be "generic" even though our default march is "pentium4" or "x86-64".
To avoid llc lit test regressions, if no mcpu is specified, I've defaulted tune to use i586 to match the old tuning settings of no CPU. Some tests explicitly used -mcpu=generic which I've removed so they instead get this default of architecture features from generic and tune from i586.
I updated one llvm-mca test to check a different CPU since generic has a scheduler model now
Differential Revision: https://reviews.llvm.org/D86312
Current custom lowering of truncate vector handles a source of up to 128 bits, but that only uses one of the two shuffle vector operands. Extend it to use both operands to handle 256 bit sources.
Differential Revision: https://reviews.llvm.org/D68035
This interferes with GlobalISel's much better handling of the
situation.
This should really be disable for GlobalISel. However, the fallback
only re-runs the selection passes, and doesn't go back and rerun any
codegen IR passes. I haven't come up with a good solution to this
problem.
This patch adds frontend and backend options to enable and disable
the PowerPC MMA operations added in ISA 3.1. Instructions using these
options will be added in subsequent patches.
Differential Revision: https://reviews.llvm.org/D81442
Handle workitem intrinsics. There isn't really away to adequately test
this right now, since none of the known bits users are fine grained
enough to test the edge conditions. This triggers a number of
instances of the new 64-bit to 32-bit shift combine in the existing
tests.
shl ([sza]ext x, y) => zext (shl x, y).
Turns expensive 64 bit shifts into 32 bit if it does not overflow the
source type:
This is a port of an AMDGPU DAG combine added in
5fa289f0d8. InstCombine does this
already, but we need to do it again here to apply it to shifts
introduced for lowered getelementptrs. This will help matching
addressing modes that use 32-bit offsets in a future patch.
TableGen annoyingly assumes only a single match data operand, so
introduce a reusable struct. However, this still requires defining a
separate GIMatchData for every combine which is still annoying.
Adds a morally equivalent function to the existing
getShiftAmountTy. Without this, we would have to do try to repeatedly
query the legalizer info and guess at what type to use for the shift.
If gather/scatters are enabled, ARMTargetTransformInfo now allows
tail predication for loops with a much wider range of strides, up
to anything that is loop invariant.
Differential Revision: https://reviews.llvm.org/D85410
We may meet Invalid CTR loop crash when there's constrained ops inside.
This patch adds constrained FP intrinsics to the list so that CTR loop
verification doesn't complain about it.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D81924
This patch makes these operations legal, and add necessary codegen
patterns.
There's still some issue similar to D77033 for conversion from v1i128
type. But normal type tests synced in vector-constrained-fp-intrinsic
are passed successfully.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D83654
The following program miscompiles because rL216012 added static
relocation model support but not for PIC.
```
// clang -fpic -mcmodel=large -O0 a.cc
double foo() { return 42.0; }
```
This patch adds PIC support.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D86024
As disscussed in post-commit review starting with
https://reviews.llvm.org/D84108#2227365
while this appears to be mostly a win overall, especially code-size-wise,
this appears to shake //certain// code pattens in a way that is extremely
unfavorable for performance (+30% runtime regression)
on certain CPU's (i personally can't reproduce).
So until the behaviour is better understood, and a path forward is mapped,
let's back this out for now.
This reverts commit 1d51dc38d8.
This is the slowest operation in the already slow pass.
Instead of sorting just put a stall list into an ordered
map.
Differential Revision: https://reviews.llvm.org/D86253
This patch adds support for constrained scalar int to fp operations on
PowerPC. Besides, this also fixes the FP exception bit of FCFID*
instructions.
Reviewed By: steven.zhang, uweigand
Differential Revision: https://reviews.llvm.org/D81669
This patch is the initial support for the Intial Exec Thread Local
Local Storage model to produce code sequence and relocations correct
to the ABI for the model when using PC relative memory operations.
Reviewed By: stefanp
Differential Revision: https://reviews.llvm.org/D81947
PseudoBRIND had seemingly inherited incorrect annotations denoting it as
a call instruction and that it defines X1/ra. This caused excess
save/restore code to be emitted for ra.
Differential Revision: https://reviews.llvm.org/D86286
Do not break down local loads and stores so ds_read/write_b96/b128 in
ISelLowering can be selected on subtargets that support them and if align
requirements allow them.
Differential Revision: https://reviews.llvm.org/D84403
Fix local ds_read/write_b96/b128 so they can be selected if the alignment
allows. Otherwise, either pick appropriate ds_read2/write2 instructions or break
them down.
Differential Revision: https://reviews.llvm.org/D81638
Features UnalignedBufferAccess and UnalignedDSAccess are now used to determine
whether hardware supports such access.
UnalignedAccessMode should be used to enable them.
hasUnalignedBufferAccessEnabled() and hasUnalignedDSAccessEnabled() can be
now used to quickly check both.
Differential Revision: https://reviews.llvm.org/D84522
Adjust alignment requirements for ds_read/write_b96/b128.
GFX9 and onwards allow misaligned access for reads and writes but only if
SH_MEM_CONFIG.alignment_mode allows it.
UnalignedDSAccess is set on GCN subtargets from GFX9 onward to let us know if we
can relax alignment requirements.
UnalignedAccessMode acts similary to UnalignedBufferAccess for DS instructions
but only from GFX9 onward and is supposed to match alignment_mode. By default
alignment of 4 is required.
Differential Revision: https://reviews.llvm.org/D82788
In SelectionDAGBuilder always translate the fshl and fshr intrinsics to
FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and
ORs. Improve the legalization of FSHL and FSHR to avoid code quality
regressions.
Differential Revision: https://reviews.llvm.org/D77152
Modify the ARM getCmpSelInstrCost implementation for the code size
costs of selects. Now consider the legalization cost and increase
the cost of i1 because those values wouldn't live in a general purpose
register. We also make selects +1 more expensive to account for the IT
instruction.
Differential Revision: https://reviews.llvm.org/D82091
As part of D84741, this adds a target hook for the
preferPredicatedReductionSelect option and makes use
of it under MVE, allowing us to tail predicate most
reduction loops.
Differential Revision: https://reviews.llvm.org/D85980
Summary:
- HIP uses an unsized extern array `extern __shared__ T s[]` to declare
the dynamic shared memory, which size is not known at the
compile time.
Reviewers: arsenm, yaxunl, kpyzhov, b-sumner
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82496
Assuming this is used to split a memory access into smaller pieces,
the new access should still have the same aliasing properties as the
original memory access. As far as I can tell, this wasn't
intentionally dropped. It may be necessary to drop this if you are
moving the operand outside of the bounds of the original object in
such a way that it may alias another IR object, but I don't think any
of the existing users are doing this. Some of the uses widen into
unused alignment padding, which I think is OK.
Custom lower and widen odd sized loads up to the alignment. The
default set of legalization actions doesn't have a way to represent
this. This fixes naturally aligned <3 x s8> and <3 x s16> loads.
This also starts moving towards eliminating the buggy and
overcomplicated legalization rules for narrowing. All the memory size
changes should be done in the lower or custom action, not NarrowScalar
/ FewerElements. These currently have redundant and ambiguous code
with the lower action.
The SGPR spills happen in SILowerSGPRSpills() and allSGPRSpillsAreDead()
make sure there are no SGPR spills pending during PEI. But the FP/BP
spills happen during PEI and are exceptions.
Use actual frame indices of FP/BP in allSGPRSpillsAreDead() to
accommodate the exceptions.
Differential Revision: https://reviews.llvm.org/D86291
This patch is the initial support for the General Dynamic Thread Local
Local Storage model to produce code sequence and relocations correct
to the ABI for the model when using PC relative memory operations.
Patch by: NeHuang
Reviewed By: stefanp
Differential Revision: https://reviews.llvm.org/D82315
There are no nxv16i8/nxv8i16 SDIV instructions, so these fixed width operations must be promoted to nxv4i32.
Differential Revision: https://reviews.llvm.org/D86114
This ensures that we never encode an instruction which is unavailable,
such as if we explicitly insert a forbidden instruction when lowering.
This is particularly important on RISC-V given its high degree of
modularity, and will become increasingly important as new standard
extensions appear.
Reviewed By: asb, lenary
Differential Revision: https://reviews.llvm.org/D85015
The getSrcFromCopy helper nowadays return a MachineOperand pointer,
so talking about zero_reg was incorrect as it nowadays return
a nullptr when not finding a copy like instruction.
For scalable vector shifts the prediacte is typically all active,
which gets selected to an unpredicated shift by immediate. When
code generating for fixed length vectors the predicate is based
on the vector length and so additional patterns are required to
make use of SVE's predicated shift by immediate instructions.
Differential Revision: https://reviews.llvm.org/D86204
When sampling from images with coordinates that only have 16 bit
accuracy, convert the image intrinsic call to use a16 or g16.
This does only happen if the target hardware supports it.
An alternative would be to always apply this combination, independent of
the target hardware and extend 16 bit arguments to 32 bit arguments
during legalization. To me, this sounds like an unnecessary roundtrip
that could prevent some further InstCombine optimizations.
Differential Revision: https://reviews.llvm.org/D85887
The `UnrollMaxBlockToAnalyze` parameter is used at the stage when we have no
information about a loop body BB cost. In some cases, e.g. for simple loop
```
for(int i=0; i<32; ++i){
D = Arr2[i*8 + C1];
Arr1[i*64 + C2] += C3 * D;
Arr1[i*64 + C2 + 2048] += C4 * D;
}
```
current default parameter value is not enough to run deeper cost analyze so the
loop is not completely unrolled.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D86248
Use the stack to save and restore the link register when there is no
available register to do it.
Differential Revision: https://reviews.llvm.org/D76069
This patch adds support for constrained scalar fp to int operations on
PowerPC. Besides, this fixes the FP exception bit of quad-precision
convert & truncate instructions.
Reviewed By: steven.zhang, uweigand
Differential Revision: https://reviews.llvm.org/D81537
TargetRegisterInfo::getMinimalPhysRegClass() returns rtcGPR64RegClassID for X16
and X17, as it's the last matching class. This in turn gets passed to
AArch64RegisterBankInfo::getRegBankFromRegClass(), which hits an unreachable.
It seems sensible to handle this case, so copies from X16 and X17 work.
Copying from X17 is used in inline assembly in libunwind for pointer
authentication.
Differential Revision: https://reviews.llvm.org/D85720
Previously we weren't adding the LegalizerInfo to the post-legalizer
combiner. Since that's fixed, we don't need to try to filter out the
one case that was breaking.
If we have a mask, and a value x, where (x & mask) == x, we can drop the AND
and just use x.
This is about a 0.4% geomean code size improvement on CTMark at -O3 for AArch64.
In AArch64, this is most useful post-legalization. Patterns like this often
show up when legalizing s1s, which must be extended to larger types.
e.g.
```
%cmp:_(s32) = G_ICMP ...
%and:_(s32) = G_AND %cmp, 1
```
Since G_ICMP only produces a single bit, there's no reason to mask it with the
G_AND.
Differential Revision: https://reviews.llvm.org/D85463
Add handling for storing the extracted lower (truncated bits) element from a X86ISD::VTRUNC node - this can be lowered to a generic truncated store directly.
Differential Revision: https://reviews.llvm.org/D86158
This implements the assemble and disassemble support of RISCV Vector
extension Zvlsseg instructions, base on the 0.9 spec version.
Reviewed by HsiangKai
Differential Revision: https://reviews.llvm.org/D84416
Summary:
When the resource descriptor is of vgpr, we need a waterfall loop
to read into a sgpr. In this patchm we generalized the implementation
to work for any regster class sizes, and extend the work to MIMG
instructions.
Fixes: SWDEV-223405
Reviewers:
arsenm, nhaehnle
Differential Revision:
https://reviews.llvm.org/D82603
These instructions weren't in the initial version of MMX, but
were added when SSE1 was introduced. We already have the intrinsic
named correctly to include sse and the frontened header enforces
sse. We have one place in the backend where we DAG combine to
this intrinsic, but that's also qualified. So don't know of anything
currently broken unless someone writes their own IR and doesn't
set the sse feature.
We probably want to introduce pseudo-instructions at some point, like
we have for binary operations, but this seems okay for now.
One thing I'm not sure about is whether we should be doing this as a
DAGCombine instead of directly pattern-matching it. I don't see any big
downside to doing it this way, though.
Differential Revision: https://reviews.llvm.org/D85681
This isn't necessaary for ACLE, but could be useful in other situations.
And the change is simple.
Differential Revision: https://reviews.llvm.org/D85251
VLD2/4 instructions cannot be predicated, so we cannot tail predicate
them from autovec. From intrinsics though, they should be valid as they
will just end up loading extra values into off vector lanes, not
effecting the on lanes. The same is true for loads in general where so
long as we are not using the other vector lanes, an unpredicated load
can be converted to a predicated one.
This marks VLD2 and VLD4 instructions as validForTailPredication and
allows any unpredicated load in tail predication loop, which seems to be
valid given the other checks we have.
Differential Revision: https://reviews.llvm.org/D86022
There are some cases where the instruction that sets up the iteration
count for a tail predicated loop cannot be moved before the dlstp,
stopping tail predication entirely. This patch checks if the mov operand
can be used and if so, uses that instead.
Differential Revision: https://reviews.llvm.org/D86087
Summary:
This is a follow up for D82481. For .lcomm directive, although it's
not necessary to have .rename emitted, it's still desirable to do
it so that we do not see internal 'Rename..' gets print out in
symbol table. And we could have consistent naming between TC entry
and .lcomm. And also have consistent naming between IR and final
object file.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D86075
Allow non-VLX targets to use 512-bits VPERMV/VPERMV3 for 128/256-bit shuffles.
TBH I'm not sure these targets actually exist in the wild, but we're testing for them and its good test coverage for shuffle lowering/combines across different subvector widths.
Previously, it would successfully select and assert if not HSA or PAL
when expanding the pseudoinstruction. We don't need the
pseudoinstruction anymore since we know the total size after
legalization.
The code to determine the value size was overcomplicated and only
correct in the case where the result register already had a register
class assigned. We can always take the size directly from the
register's type.
Right shift patterns will no longer incorrectly accept a shift
amount of zero. At the same time they will allow larger shift
amounts that are now saturated to their upper bound.
Patterns have been extended to enable immediate forms for shifts
taking an arbitrary predicate.
This patch also unifies the code path for immediate parsing so the
i64 based shifts are no longer treated specially.
Differential Revision: https://reviews.llvm.org/D86084
This patch adds lowerShuffleWithVTRUNC to handle basic binary shuffles that can be lowered either as a pure ISD::TRUNCATE or a X86ISD::VTRUNC (with undef/zero values in the remaining upper elements).
We concat the binary sources together into a single 256-bit source vector. To avoid regressions we perform this after we've tried to lower with PACKS/PACKUS which typically does a cleaner job than a concat.
For non-AVX512VL cases we have to canonicalize VTRUNC cases to use a 512-bit source vectors (inserting undefs/zeros in the upper elements as necessary), truncate and then (possibly) extract the 128-bit result.
This should address the last regressions in D66004
Differential Revision: https://reviews.llvm.org/D86093
This patch implements the vec_extractm function prototypes in altivec.h in
order to utilize the vector extract with mask instructions introduced in Power10.
Differential Revision: https://reviews.llvm.org/D82675
Doesn't really matter in practice but that's how the nodes are
normally created by SelectionDAGBuilder. So we should match.
Found by temporarily hacking type checks into isel table.
This is the type declared in X86InstrFragmentsSIMD.td. ISel pattern
matching doesn't check so it doesn't matter in practice. Maybe for
SelectionDAG CSE it would matter.
When diffing disassembly dump of two binaries, I see lots of noises from mismatched jump target addresses and global data references, which unnecessarily causes diffs on every function, making it impractical. I'm trying to symbolize the raw binary addresses to minimize the diff noise.
In this change, a local branch target is modeled as a label and the branch target operand will simply be printed as a label. Local labels are collected by a separate pre-decoding pass beforehand. A global data memory operand will be printed as a global symbol instead of the raw data address. Unfortunately, due to the way the disassembler is set up and to be less intrusive, a global symbol is always printed as the last operand of a memory access instruction. This is less than ideal but is probably acceptable from checking code quality point of view since on most targets an instruction can have at most one memory operand.
So far only the X86 disassemblers are supported.
Test Plan:
llvm-objdump -d --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:
<_start>:
push rax
mov dword ptr [rsp + 4], 0
mov dword ptr [rsp], 0
mov eax, dword ptr [rsp]
cmp eax, dword ptr [rip + 4112] # 202182 <g>
jge 0x20117e <_start+0x25>
call 0x201158 <foo>
inc dword ptr [rsp]
jmp 0x201169 <_start+0x10>
xor eax, eax
pop rcx
ret
```
llvm-objdump -d **--symbolize-operands** --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:
<_start>:
push rax
mov dword ptr [rsp + 4], 0
mov dword ptr [rsp], 0
<L1>:
mov eax, dword ptr [rsp]
cmp eax, dword ptr <g>
jge <L0>
call <foo>
inc dword ptr [rsp]
jmp <L1>
<L0>:
xor eax, eax
pop rcx
ret
```
Note that the jump instructions like `jge 0x20117e <_start+0x25>` without this work is printed as a real target address and an offset from the leading symbol. With a change in the optimizer that adds/deletes an instruction, the address and offset may shift for targets placed after the instruction. This will be a problem when diffing the disassembly from two optimizers where there are unnecessary false positives due to such branch target address changes. With `--symbolize-operand`, a label is printed for a branch target instead to reduce the false positives. Similarly, the disassemble of PC-relative global variable references is also prone to instruction insertion/deletion.
Reviewed By: jhenderson, MaskRay
Differential Revision: https://reviews.llvm.org/D84191
The previous implementation was incorrect, and based off incorrect
instruction definitions. Unfortunately we can't match natural
addressing in a lot of cases due to the shift/scale applied in
getelementptrs. This relies on reducing the 64-bit shift to 32-bits.
We may have an SGPR->VGPR copy if a totally uniform pointer
calculation is used for a VGPR pointer operand.
Also hack around a bug in MUBUF matching which would incorrectly use
MUBUF for global when flat was requested. This should really be a
predicate on the parent pattern, but the DAG always checked this
manually inside the complex pattern.
If the same stream object is used for multiple compiles, the PAL metadata from eariler compilations will leak into later one. See https://github.com/GPUOpen-Drivers/llpc/issues/882 for how this is happening in LLPC.
No tests were added because multiple compiles will have to happen using the same pass manager, and I do not see a setup for that on the LLVM side. Let me know if there is a good way to test this.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D85667