In a future change, we will sometimes use a VGPR offset for doing
spills to memory, in which case we need 2 free VGPRs to do the SGPR
spill. In most cases we could spill the VGPR along with the SGPR being
spilled, but we don't have any free lanes for SGPR_1024 in wave32 so
we could still potentially need a second scavenging slot.
As noted in D116804, we want to effectively invert that patch
for CPUs (intel) that don't break the false dependency on
sbb %eax, %eax
So we will likely want to create that here in the
X86DAGToDAGISel::Select() case for X86::SETCC_CARRY.
setjmp can return twice, but PostDominatorTree is unaware of this. as
such, it overestimates postdominance, leaving some cases where memory
does not get untagged on return. this causes false positives later in
the program execution.
this is a workaround for now, in the longer term PostDominatorTree
should be made aware of returns_twice, as this may cause problems
elsewhere.
See D118647 for equivalent fix to HWASan.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D118749
VLMaxSentintel happens to be represented as -1 TargetConstant. A user
provided -1 would be an ISD::Constant. We shouldn't assume that they
are the same thing. I'm still not entirely convinced that we should be
treating -1 from the user as VLMAX.
Also fix one place that failed to use XLenVT for the VLMaxSentinel,
using MVT::i64 in code that only executes on RV32.
Warn on inline assembly clobbering reserved registers. It should also
warn on at least some reserved register defs, but that isn't happening
right now. If you have a def and re-use of a register we reserve, the
register coalescer will eliminate the intermediate virtual
register. When the reserved reg def is introduced later by the
backend, it will end up clobbering the value the register coalescer
assumed was live through the range.
There is also isInlineAsmReadOnlyReg, although I don't understand what
the distinction really is. It's called in SelectionDAGBuilder, long
before the set of reserved registers is frozen so I'm not sure how
that can possibly work reliably.
Unfortunately this is also using the ugly tablegenerated names for the
registers.
This adds or reuses ISD opcodes for vadd.wv, vaddu.wv, vadd.vv, vaddu.vv
and a similar set for sub.
I've included support for narrowing scalar splats that have known
sign/zero bits similar to what was done for MUL_VL.
The conversion to vwadd.vv proceeds in two phases. First we'll form
a vwadd.wv by narrowing one of the operands. Then we'll visit the
vwadd.wv to try to narrow the other operand. This turned out to be
simpler than catching all the cases in one step. The forming of of
vwadd.wv can happen for either operand for add, but only the right
hand side for sub since sub isn't commutable.
An interesting quirk is that ADD_VL and VZEXT_VL/VSEXT_VL are formed
during vector op legalization, but VMV_V_X_VL isn't usually formed
until op legalization when BUILD_VECTORS are handled. This leads to
VWADD_W_VL forming in one DAG combine round, and then a later DAG combine
round sees the VMV_V_X_VL and needs to commute the operands to get the
splat in position. This alone necessitated a VWADD_W_VL combine function
which made forming vwadd.vv in two stages an easy choice.
I've left out trying hard to form vwadd.wx instructions for now. It would
only save an extend in the scalar domain which isn't as interesting.
Might need to review the test coverage a bit. Most of the vwadd.wv
instructions are coming from vXi64 tests on rv64. The tests were
copy pasted from the existing multiply tests.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117954
This allows us to set the noclobber flag on (the MMO of) a load
instruction instead of on the pointer. This fixes a bug where noclobber
was being applied to all loads from the same pointer, even if some of
them were clobbered.
Differential Revision: https://reviews.llvm.org/D118775
This patch introduces the conversions from math function calls
to MASS library calls. To resolves calls generated with these conversions, one
need to link libxlopt.a library. This patch is tested on PowerPC Linux and AIX.
Differential: https://reviews.llvm.org/D101759
Reviewer: bmahjour
LLVM has a couple of ways of producing ccmp - either from chains in isel
or from a later ifcvt style pass. This adds a simple DAG combine to
capture more cases, converting and(csel(0, 1, cc0), csel(0, 1, cc1))
into a csel(ccmp(.., cc0)), depending on cc1 (a SUBS in this case).
Differential Revision: https://reviews.llvm.org/D118327
This is an intentionally limited/different form of D90113.
That patch bravely tries to generalize folds where we pull
a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...but it is not universally profitable.
This is the inverse of IR canonicalization as discussed in
D113442.
We know that this transform is not entirely profitable even
within x86, so we only handle x86 vector fadd/fsub as a 1st
step. The intent is to prevent AVX512 regressions as mentioned
in D113442.
The plan is to port this to DAGCombiner (so it will eventually
look more like D90113) and add more types/cases in pieces with
many more tests to verify that we are seeing improvements.
Differential Revision: https://reviews.llvm.org/D118644
None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.
Apparently GCC 5.4 (a supported compiler) has a bug where it will
use the "MachineInstr &MI" defined by the range-based for loop
to evaluate the for loop expression. Pick a different variable
name to avoid this.
This patch adds custom lowering support for ISD::SDIV and ISD::UDIV
when SVE is enabled, regardless of the minimum SVE vector length. We do
this because NEON simply does not have vector integer divide support, so
we want to take advantage of these instructions in SVE.
As part of this patch I've also simplified LowerToPredicatedOp to avoid
re-asking the same question about whether we should be using SVE for
fixed length vectors. Once we've made the decision to call
LowerToPredicatedOp, then we should simply assert we should be using SVE.
I've updated the 128-bit min SVE vector bits tests here:
CodeGen/AArch64/sve-fixed-length-int-div.ll
CodeGen/AArch64/sve-fixed-length-int-rem.ll
Differential Revision: https://reviews.llvm.org/D117871
Whilst adding legal types <-> register classes for Streaming SVE in
D118561 I noticed the hasSVE predication block set operation actions for
opcodes that may not be legal in Streaming SVE. Move these operations to
the later hasSVE block which has loops over the same types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118560
The new LEGALAVL node annotates that the AVL refers to packs of 64bit.
We use a two-stage lowering approach with LEGALAVL:
First, standard SDNodes are translated into illegal VVP layer nodes.
Regardless of source (VP or standard), all VVP nodes have a mask and AVL
parameter. The AVL parameter refers to the element position (just as in
VP intrinsics).
Second, we legalize the AVL usage in VVP layer nodes. If the element
size is < 64bit, the EVL parameter has to be adjusted to refer to packs
of 64bits. We wrap the legalized AVL in a LEGALAVL node to track this.
Reviewed By: kaz7
Differential Revision: https://reviews.llvm.org/D118321
This patch fixes the atomicrmw result value to be the value before the
operation instead of the value after the operation. This was a bug, left
as a FIXME in the code (see https://reviews.llvm.org/D97127).
From the LangRef:
> The contents of memory at the location specified by the <pointer>
> operand are atomically read, modified, and written back. The original
> value at the location is returned.
Doing this expansion early allows the register allocator to arrange
registers in such a way that commutable operations are simply swapped
around as needed, which results in shorter code while still being
correct.
Differential Revision: https://reviews.llvm.org/D117725
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
MemorySSA considers any atomic a def to any operation it dominates
just like a barrier or fence. That is correct from memory state
perspective, but not required for the no-clobber metadata since
we are not using it for reordering. Skip such atomics during the
scan just like a barrier if it does not alias with the load.
Differential Revision: https://reviews.llvm.org/D118661
Currently we cannot convert a vector load into scalar if there
is dominating barrier or fence. It is considered a clobbering
memory access to prevent memory operations reordering. While
reordering is not possible the actual memory is not being clobbered
by a barrier or fence and we can still use a scalar load for a
uniform pointer.
The solution is not to bail on a first clobbering access but
traverse MemorySSA to the root excluding barriers and fences.
Differential Revision: https://reviews.llvm.org/D118419
The first phase of the analysis can avoid a vsetvli if an earlier
instruction in the block used an SEW and LMUL that when combined with
the EEW of the load/store would produce the desired EMUL. If we
avoided a vsetvli this will affect the global analysis we do in the
second phase.
The third phase where we really insert the vsetvlis needs to agree
with the first phase. If it doesn't we can insert vsetvlis that
invalidate the global analysis.
In the test case there is a VSETVLI in the preheader that sets
SEW=64 and LMUL=1. Inside the loop there is a VADD with SEW=64 and LMUL=1.
This VADD is followed by a store that wants wants SEW=32 LMUL=1/2.
Because it has EEW=32 as part of the opcode the SEW=64 LMUL=1 from the
VADD can be become EMUL=1 for the store. So the first phase determines no
vsetvli is needed.
The third phase manages CurInfo differently than BBInfo.Change from the
first phase. CurInfo is only updated when we see a vsetvli or insert
a vsetvli. This was done to allow predecessor block information from
the global analysis to be applied to multiple instructions. Since the
loop body has no vsetvli we won't update CurInfo for either the VADD
or the VSE. This prevented us from checking the store vsetvli elision
for the VSE resulting in a vsetvli SEW=32 LMUL=1/2 being emitted which
invalidated the global analysis.
To mitigate this, I've added a BBLocalInfo variable that more closely
matches the first phase propagation. This gets updated based on the
VADD and prevents emitting a vsetvli for the store like we did in the
first phase.
I wonder if we should do an earlier phase to handle the load/store case
by adding more pseudo opcodes and changing the SEW/LMUL for those
instructions before the insertion analysis. That might be more robust
than trying to guarantee two phases make the same decision.
Fixes the test from D118629.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118667
This patch updates the P10 patterns with a load feeding into an insertelt to
utilize the refactored load and store infrastructure, as well as updating any
tests that exhibit any codegen changes.
Furthermore, custom legalization is added for v4f32 on Power9 and above to not
only assist with adjusting the refactored load/stores for P10 vector insert,
but also it enables the utilization of direct moves.
Differential Revision: https://reviews.llvm.org/D115691
Pointer element types do not imply that the pointer is ABI aligned.
We should be using either an explicit align attribute here, or fall
back to an alignment of 1. This fixes a new element type access
introduced in D117764.
I don't think this makes any practical difference though, as the
lowering does not depend on alignment.
Differential Revision: https://reviews.llvm.org/D118681
Currently, ARMBaseInstrInfo::getInstSizeInBytes() uses hard-coded
instruction size for some pseudo-instructions, while this
information should ideally be found in ARMInstrInfo.td,
ARMInstrThumb(2).td files (which can be accessed via MCInstrDesc). Hence,
the .td files should be updated and no hard-coded instruction sizes
should be used by getInstSizeInBytes() anymore.
Differential Revision: https://reviews.llvm.org/D118009
Currently, AArch64InstrInfo::getInstSizeInBytes() uses hard-coded
instruction size for some pseudo-instructions, while this
information should ideally be found in AArch64InstrInfo.td file (which
can be accessed via MCInstrDesc). Hence, the .td file should be updated
and no hard-coded instruction sizes should be used by
getInstSizeInBytes() anymore.
Differential Revision: https://reviews.llvm.org/D117970
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
Summary:
Add code object v5 support (deafult is still v4)
Generate metadata for implicit kernel args for the new ABI
Set the metadata version to be 1.2
Reviewers:
t-tye, b-sumner, arsenm, and bcahoon
Fixes:
SWDEV-307188, SWDEV-307189
Differential Revision:
https://reviews.llvm.org/D118272
This removes `HasPAUTH` from `AArch64SubTarget`, as it seems to be a
redundant, unused copy of `HasPAuth`.
Differential Revision: https://reviews.llvm.org/D117782
New target SDNodes are added: AArch64ISD::MOPS_MEMSET, etc.
Each intrinsic is translated to one of these in SelectionDAGBuilder
via EmitTargetCodeForMOPS.
A custom lowering routine for INTRINSIC_W_CHAIN is added to handle
llvm.aarch64.mops.memset.tag. This takes a separate path from the common
intrinsics but ultimately ends up in the same EmitMOPS().
This is part 4/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Patch by Tomas Matheson, Lucas Prates and Son Tuan Vu.
Differential Revision: https://reviews.llvm.org/D117764
This implements codegen for Armv8.8/9.3 Memory Operations extension (MOPS).
Any memcpy/memset/memmov intrinsics will always be emitted as a series
of three consecutive instructions P, M and E which perform the
operation. The SelectionDAG implementation is split into a separate
patch.
AArch64LegalizerInfo will now consider the following generic opcodes
if +mops is available, instead of legalising by expanding them to
libcalls: G_BZERO, G_MEMCPY_INLINE, G_MEMCPY, G_MEMMOVE, G_MEMSET
The s8 value of memset is legalised to s64 to match the pseudos.
AArch64O0PreLegalizerCombinerInfo will still be able to combine
G_MEMCPY_INLINE even if +mops is present, as it is unclear whether it is
better to generate fixed length copies or MOPS instructions for the
inline code of small or zero-sized memory operations, so we choose to be
conservative for now.
AArch64InstructionSelector will select the above as new pseudo
instructions: AArch64::MOPSMemory{Copy/Move/Set/SetTagging} These are
each expanded to a series of three instructions (e.g. SETP/SETM/SETE)
which must be emitted together during code emission to avoid scheduler
reordering.
This is part 3/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Patch by Tomas Matheson and Son Tuan Vu
Differential Revision: https://reviews.llvm.org/D117763
According to the specification, MOPS instructions define/use NZCV flags as
part of their semantics (see discussion in
https://reviews.llvm.org/D116157).
More specifically, the specification of the MOPS extension states that
each memcpy/memset/memmov operation will be performed by a series
of three MOPS instructions P, M and E. The P instruction writes to the
NZCV flags, while the others (M and E) reads from the NZCV flags.
This is part 2/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Differential Revision: https://reviews.llvm.org/D117757
This reverts commit 00bf4755e9.
This change broke the emscripten builder (among other things):
https://ci.chromium.org/ui/p/emscripten-releases/builders/try/linux/b8823500584349280721/overview
Sample failure:
```
test_unistd_unlink (test_core.core0) ...
wasm-ld: error: symbol type mismatch: __stdio_write
>>> defined as WASM_SYMBOL_TYPE_FUNCTION in /usr/local/google/home/sbc/dev/wasm/emscripten/cache/sysroot/lib/wasm32-emscripten/libc-debug.a(__stdio_write.o)
>>> defined as WASM_SYMBOL_TYPE_DATA in /usr/local/google/home/sbc/dev/wasm/emscripten/cache/sysroot/lib/wasm32-emscripten/libc-debug.a(stderr.o)
```
The optimization added in D118139 causes a crash on the added test case
while trying to zero extend an vector of floats.
Fix the crash by bailing out for floating point operands.
Reviewed By: DavidTruby
Differential Revision: https://reviews.llvm.org/D118615
We convert VLEN to vscale by dividing by RVVBitsPerBlock which is
currently 64. This is only correct if VLEN is evenly divisible by
64. With only Zvl32b we can't assume that.
This patch adds a fatal_error to prevent generating code that may
be broken.
We probably need to look at how we size stack frame objects too.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118583
We had previously hardcoded this to assume that vector registers
are 128 bits. This was true when only V existed, but after Zve
extensions were added this became incorrect.
This patch adjusts it to support 128, 64, or 32 bit vectors depending
on Zvl. The 128-bit limit is artificial, but we don't have any test
coverage showing that we larger values so I was being conservative.
None of our lit tests depend on this code today due to the custom
lowering of ISD::VSCALE that inserts the appropriate left or right
shift to convert from VLENB to VSCALE. That code was added after
this code in computeKnownBitsForTargetNode.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118582
The spec doesn't seem to be written as if Zfh implies Zfhmin. They
seem to be separate extensions.
This patch moves the instructions from Zfhmin to be enabled with
either the Zfh or Zfhmin extensions.
Reviewed By: achieveartificialintelligence
Differential Revision: https://reviews.llvm.org/D118581
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.