Cleanup CCMP pattern matching code in preparation for review/bugfix:
- Rename `isConjunctionDisjunctionTree()` to `canEmitConjunction()`
(it won't accept arbitrary disjunctions and is really about whether we
can transform the subtree into a conjunction that we can emit).
- Rename `emitConjunctionDisjunctionTree()` to `emitConjunction()`
llvm-svn: 346203
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.
llvm-svn: 346180
This was added in r330630. GCC's -Wimplicit-fallthrough seems to not
fire when the previous case contains a switch itself.
This fallthrough was bening because the helper function implementing the
case used dyn_cast to re-check the type of the node in question. After
fixing the fallthrough, we can strengthen the cast.
llvm-svn: 345864
Summary:
Thunk functions in Windows are varag functions that call a musttail function
to pass the arguments after the fixup is done. We need to make sure that we
forward the arguments from the caller vararg to the callee vararg function.
This is the same mechanism that is used for Windows on X86.
Reviewers: ssijaric, eli.friedman, TomTan, mgrang, mstorsjo, rnk, compnerd, efriedma
Reviewed By: efriedma
Subscribers: efriedma, kristof.beyls, chrib, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D53843
llvm-svn: 345641
Re-apply r345315 with testcase fixes.
Include all of the store's source vector operands when creating the
MachineMemOperand. Previously, we were missing the first operand,
making the store size seem smaller than it really is.
Differential Revision: https://reviews.llvm.org/D52816
llvm-svn: 345631
Include all of the store's source vector operands when creating the
MachineMemOperand. Previously, we were missing the first operand,
making the store size seem smaller than it really is.
Differential Revision: https://reviews.llvm.org/D52816
llvm-svn: 345315
Summary:
Changes all uses of minnan/maxnan to minimum/maximum
globally. These names emphasize that the semantic difference between
these operations is more than just NaN-propagation.
Reviewers: arsenm, aheejin, dschuff, javed.absar
Subscribers: jholewinski, sdardis, wdng, sbc100, jgravelle-google, jrtc27, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D53112
llvm-svn: 345218
AARCH64 equivalent to D53257 - uses widening pairwise adds on vXi8 CTPOP to support i16/i32/i64 vectors.
This is a blocker for generic vector CTPOP expansion (P32655) - this will remove the aarch64 diff from D53258.
Differential Revision: https://reviews.llvm.org/D53259
llvm-svn: 344554
Summary:
AArch64 can fold some shift+extend operations on the RHS operand of
comparisons, so swap the operands if that makes sense.
This provides a fix for https://bugs.llvm.org/show_bug.cgi?id=38751
Reviewers: efriedma, t.p.northover, javed.absar
Subscribers: mcrosier, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D53067
llvm-svn: 344439
Share predecessor search bookkeeping in both perform PostLD1Combine
and performNEONPostLDSTCombine. This should be approximately a 4x and
2x performance improvement.
llvm-svn: 342986
Summary:
Specifying X[8-15,18] registers as callee-saved is used to support
CONFIG_ARM64_LSE_ATOMICS in Linux kernel. As part of this patch we:
- use custom CSR list/mask when user specifies custom CSRs
- update Machine Register Info's list of CSRs with additional custom CSRs in
LowerCall and LowerFormalArguments.
Reviewers: srhines, nickdesaulniers, efriedma, javed.absar
Reviewed By: nickdesaulniers
Subscribers: kristof.beyls, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D52216
llvm-svn: 342824
This involves changing the shouldExpandAtomicCmpXchgInIR interface, but I have
updated the in-tree backends using this hook (ARM, AArch64, Hexagon) so they
will see no functional change. Previously this hook returned bool, but it now
returns AtomicExpansionKind.
This hook allows targets to select how a given cmpxchg is to be expanded.
D48131 uses this to expand part-word cmpxchg to a target-specific intrinsic.
See my associated RFC for more info on the motivation for this change
<http://lists.llvm.org/pipermail/llvm-dev/2018-June/123993.html>.
Differential Revision: https://reviews.llvm.org/D48130
llvm-svn: 342550
This patch adds parsing support for the 'aarch64_vector_pcs'
calling convention attribute to calls and function declarations.
More information describing the vector ABI and procedure call standard
can be found here:
https://developer.arm.com/products/software-development-tools/\
hpc/arm-compiler-for-hpc/vector-function-abi
Reviewers: t.p.northover, rnk, rengolin, javed.absar, thegameg, SjoerdMeijer
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D51477
llvm-svn: 342030
Summary:
Reserving registers x1-7 is used to support CONFIG_ARM64_LSE_ATOMICS in Linux kernel. This change adds support for reserving registers x1 through x7.
Reviewers: javed.absar, phosek, srhines, nickdesaulniers, efriedma
Reviewed By: nickdesaulniers, efriedma
Subscribers: niravd, jfb, manojgupta, nickdesaulniers, jyknight, efriedma, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D48580
llvm-svn: 341706
Summary:
I added a few ARM64 memset codegen tests in r341406 and r341493, and annotated
where the generated code was bad. This patch fixes the majority of the issues by
requesting that a 2xi64 vector be used for memset of 32 bytes and above.
The patch leaves the former request for f128 unchanged, despite f128
materialization being suboptimal: doing otherwise runs into other asserts in
isel and makes this patch too broad.
This patch hides the issue that was present in bzero_40_stack and bzero_72_stack
because the code now generates in a better order which doesn't have the store
offset issue. I'm not aware of that issue appearing elsewhere at the moment.
<rdar://problem/44157755>
Reviewers: t.p.northover, MatzeB, javed.absar
Subscribers: eraman, kristof.beyls, chrib, dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D51706
llvm-svn: 341558
The runtime pseudo relocations can't handle the AArch64 format PC
relative addressing in adrp+add/ldr pairs. By using stubs, the potentially
dllimported addresses can be touched up by the runtime pseudo relocation
framework.
Differential Revision: https://reviews.llvm.org/D51452
llvm-svn: 341401
When initial support for dllimport was added for aarch64 in
SVN r316555, ClassifyGlobalReference didn't set the MO_DLLIMPORT
flag - that was only completed in SVN r323810. Reuse the return
value from ClassifyGlobalReference for this purpose as well.
llvm-svn: 341310
This adds the plumbing for the Tiny code model for the AArch64 backend. This,
instead of loading addresses through the normal ADRP;ADD pair used in the Small
model, uses a single ADR. The 21 bit range of an ADR means that the code and
its statically defined symbols need to be within 1MB of each other.
This makes it mostly interesting for embedded applications where we want to fit
as much as we can in as small a space as possible.
Differential Revision: https://reviews.llvm.org/D49673
llvm-svn: 340397
`MachineMemOperand` pointers attached to `MachineSDNodes` and instead
have the `SelectionDAG` fully manage the memory for this array.
Prior to this change, the memory management was deeply confusing here --
The way the MI was built relied on the `SelectionDAG` allocating memory
for these arrays of pointers using the `MachineFunction`'s allocator so
that the raw pointer to the array could be blindly copied into an
eventual `MachineInstr`. This creates a hard coupling between how
`MachineInstr`s allocate their array of `MachineMemOperand` pointers and
how the `MachineSDNode` does.
This change is motivated in large part by a change I am making to how
`MachineFunction` allocates these pointers, but it seems like a layering
improvement as well.
This would run the risk of increasing allocations overall, but I've
implemented an optimization that should avoid that by storing a single
`MachineMemOperand` pointer directly instead of allocating anything.
This is expected to be a net win because the vast majority of uses of
these only need a single pointer.
As a side-effect, this makes the API for updating a `MachineSDNode` and
a `MachineInstr` reasonably different which seems nice to avoid
unexpected coupling of these two layers. We can map between them, but we
shouldn't be *surprised* at where that occurs. =]
Differential Revision: https://reviews.llvm.org/D50680
llvm-svn: 339740
Intentionally excluding nodes from the DAGCombine worklist is likely to
lead to weird optimizations and infinite loops, so it's generally a bad
idea.
To avoid the infinite loops, fix DAGCombine to use the
isDesirableToCommuteWithShift target hook before performing the
transforms in question, and implement the target hook in the ARM backend
disable the transforms in question.
Fixes https://bugs.llvm.org/show_bug.cgi?id=38530 . (I don't have a
reduced testcase for that bug. But we should have sufficient test
coverage for PerformSHLSimplify given that we're not playing weird
tricks with the worklist. I can try to bugpoint it if necessary,
though.)
Differential Revision: https://reviews.llvm.org/D50667
llvm-svn: 339734
Summary:
Ensure that NormalizedBuildVector returns a BUILD_VECTOR with operands of the
same type. This fixes an assertion failure in VerifySDNode.
Reviewers: SjoerdMeijer, t.p.northover, javed.absar
Reviewed By: SjoerdMeijer
Subscribers: kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D50202
llvm-svn: 339013
The vector contains the SDNodes that these functions create. The number of nodes is always a small number so we should use SmallVector to avoid a heap allocation.
llvm-svn: 338329
This patch adds a custom trunc store lowering for v4i8 vector types.
Since there is not v.4b register, the v4i8 is promoted to v4i16 (v.4h)
and default action for v4i8 is to extract each element and issue 4
byte stores.
A better strategy would be to extended the promoted v4i16 to v8i16
(with undef elements) and extract and store the word lane which
represents the v4i8 subvectores. The construction:
define void @foo(<4 x i16> %x, i8* nocapture %p) {
%0 = trunc <4 x i16> %x to <4 x i8>
%1 = bitcast i8* %p to <4 x i8>*
store <4 x i8> %0, <4 x i8>* %1, align 4, !tbaa !2
ret void
}
Can be optimized from:
umov w8, v0.h[3]
umov w9, v0.h[2]
umov w10, v0.h[1]
umov w11, v0.h[0]
strb w8, [x0, #3]
strb w9, [x0, #2]
strb w10, [x0, #1]
strb w11, [x0]
ret
To:
xtn v0.8b, v0.8h
str s0, [x0]
ret
The patch also adjust the memory cost for autovectorization, so the C
code:
void foo (const int *src, int width, unsigned char *dst)
{
for (int i = 0; i < width; i++)
*dst++ = *src++;
}
can be vectorized to:
.LBB0_4: // %vector.body
// =>This Inner Loop Header: Depth=1
ldr q0, [x0], #16
subs x12, x12, #4 // =4
xtn v0.4h, v0.4s
xtn v0.8b, v0.8h
st1 { v0.s }[0], [x2], #4
b.ne .LBB0_4
Instead of byte operations.
llvm-svn: 335735
Register x20 is a callee-saved register which may be used for other
purposes in certain contexts, for example to hold special variables
within the kernel. This change adds support for reserving this register
both to frontend and backend to make this register usable for these
purposes.
Differential Revision: https://reviews.llvm.org/D46552
llvm-svn: 334531
As suggested in https://bugs.llvm.org/show_bug.cgi?id=32384#c1, this change
makes the inlining of `memset()` and `memcpy()` more aggressive when
compiling for speed. The tuning remains the same when optimizing for size.
Patch by: Sebastian Pop <s.pop@samsung.com>
Evandro Menezes <e.menezes@samsung.com>
Differential revision: https://reviews.llvm.org/D45098
llvm-svn: 333429
Keep loads and stores together (target defines how many loads
and stores to gang up), such that it will help in pairing
and vectorization.
Differential Revision https://reviews.llvm.org/D46477
llvm-svn: 332482
This patch re-introduces the "S" inline assembler constraint. This matches
an absolute symbolic address or a label reference. The primary use case is
asm("adrp %0, %1\n\t"
"add %0, %0, :lo12:%1" : "=r"(addr) : "S"(&var));
I say re-introduces as it seems like "S" was implemented in the original
AArch64 backend, but it looks like it wasn't carried forward to the merged
backend. The original implementation had A and L modifiers that could be
used to print ":lo12:" to the string. It looks like gcc doesn't use these
and :lo12: is expected to be written in the inline assembly string so I've
not implemented A and L. Clang already supports the S modifier.
Fixes PR37180
Differential Revision: https://reviews.llvm.org/D46745
llvm-svn: 332444
The DEBUG() macro is very generic so it might clash with other projects.
The renaming was done as follows:
- git grep -l 'DEBUG' | xargs sed -i 's/\bDEBUG\s\?(/LLVM_DEBUG(/g'
- git diff -U0 master | ../clang/tools/clang-format/clang-format-diff.py -i -p1 -style LLVM
- Manual change to APInt
- Manually chage DOCS as regex doesn't match it.
In the transition period the DEBUG() macro is still present and aliased
to the LLVM_DEBUG() one.
Differential Revision: https://reviews.llvm.org/D43624
llvm-svn: 332240
Summary:
performPostLD1Combine in AArch64ISelLowering looks for vector
insert_vector_elt of a loaded value which it can optimize into a single
LD1LANE instruction. The code checking for the pattern was not checking
if the lane index was a constant which could cause two problems:
- an assert when lowering the LD1LANE ISD node since it assumes an
constant operand
- an assert in isel if the lane index value depends on the
post-incremented base register
Both of these issues are avoided by simply checking that the lane index
is a constant.
Fixes bug 35822.
Reviewers: t.p.northover, javed.absar
Subscribers: rengolin, kristof.beyls, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D46591
llvm-svn: 332103
Accessing the members of a large data structures needs a lot of GEPs which
usually have large offsets due to the size of the underlying data structure. If
the offsets are too large to fit into the r+i addressing mode, these GEPs cannot
be sunk to their users' blocks and many extra registers are needed then to carry
the values of these GEPs.
This patch tries to split a large data struct starting from %base like the
following.
Before:
BB0:
%base =
BB1:
%gep0 = gep %base, off0
%gep1 = gep %base, off1
%gep2 = gep %base, off2
BB2:
%load1 = load %gep0
%load2 = load %gep1
%load3 = load %gep2
After:
BB0:
%base =
%new_base = gep %base, off0
BB1:
%new_gep0 = %new_base
%new_gep1 = gep %new_base, off1 - off0
%new_gep2 = gep %new_base, off2 - off0
BB2:
%load1 = load i32, i32* %new_gep0
%load2 = load i32, i32* %new_gep1
%load3 = load i32, i32* %new_gep2
In the above example, the struct is split into two parts. The first part still
starts from %base and the second part starts from %new_base. After the
splitting, %new_gep1 and %new_gep2 have smaller offsets and then can be sunk to
BB2 and folded into their users.
The algorithm to split data structure is simple and very similar to the work of
merging SExts. First, it collects GEPs that have large offsets when iterating
the blocks. Second, it splits the underlying data structures and updates the
collected GEPs to use smaller offsets.
Differential Revision: https://reviews.llvm.org/D42759
llvm-svn: 332015
Inspired by r331508, I did a grep and found these.
Mostly just change from dyn_cast to cast. Some cases also showed a dyn_cast result being converted to bool, so those I changed to isa.
llvm-svn: 331577
Summary: Adding support for Fast flags in the SDNode to leverage fast math sub flag usage.
Reviewers: spatel, arsenm, jbhateja, hfinkel, escha, qcolombet, echristo, wristow, javed.absar
Reviewed By: spatel
Subscribers: llvm-commits, rampitec, nhaehnle, tstellar, FarhanaAleen, nemanjai, javed.absar, jbhateja, hfinkel, wdng
Differential Revision: https://reviews.llvm.org/D45710
llvm-svn: 331547
This patch adds a custom lowering for ISD::MULH{S,U} used on divide by
constant optimization (DAGCombiner::BuildSDIV and DAGCombiner::BuildUDIV).
New patterns for smull and umull are added, so AArch64ISD::{S,U}MULL
can be correctly lowered to smull2 and umull2.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D46009
llvm-svn: 331522
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
This reland includes a check to prevent the DAG combiner from folding an
offset that is smaller than the existing one. This can cause oscillations
between two possible DAGs, which was the cause of the hang and later assertion
failure observed on the lnt-ctmark-aarch64-O3-flto bot.
http://green.lab.llvm.org/green/job/lnt-ctmark-aarch64-O3-flto/2024/
Original commit message:
> This is a code size win in code that takes offseted addresses
> frequently, such as C++ constructors that typically need to compute
> an offseted address of a vtable. This reduces the size of Chromium
> for Android's .text section by 108KB.
Differential Revision: https://reviews.llvm.org/D45199
llvm-svn: 330630
This is a code size win in code that takes offseted addresses
frequently, such as C++ constructors that typically need to compute
an offseted address of a vtable. This reduces the size of Chromium
for Android's .text section by 108KB.
Differential Revision: https://reviews.llvm.org/D45199
llvm-svn: 329956
This is a code size win in code that takes offseted addresses
frequently, such as C++ constructors that typically need to compute
an offseted address of a vtable. It reduces the size of Chromium for
Android's .text section by 46KB, or 56KB with ThinLTO (which exposes
more opportunities to use a direct access rather than a GOT access).
Because the addend range is limited in COFF and Mach-O, this is
enabled for ELF only.
Differential Revision: https://reviews.llvm.org/D45199
llvm-svn: 329611
Currently EVT is in the IR layer only because of Function.cpp needing a very small piece of the functionality of EVT::getEVTString(). The rest of EVT is used in codegen making CodeGen a better place for it.
The previous code converted a Type* to EVT and then called getEVTString. This was only expected to handle the primitive types from Type*. Since there only a few primitive types, we can just print them as strings directly.
Differential Revision: https://reviews.llvm.org/D45017
llvm-svn: 328806
This is used by llvm tblgen as well as by LLVM Targets, so the only
common place is Support for now. (maybe we need another target for these
sorts of things - but for now I'm at least making them correct & we can
make them better if/when people have strong feelings)
llvm-svn: 328395
Loads and stores can only shift the offset register by the size of the value
being loaded, but currently the DAGCombiner will reduce the width of the load
if it's followed by a trunc making it impossible to later combine the shift.
Solve this by implementing shouldReduceLoadWidth for the AArch64 backend and
make it prevent the width reduction if this is what would happen, though do
allow it if reducing the load width will let us eliminate a later sign or zero
extend.
Differential Revision: https://reviews.llvm.org/D44794
llvm-svn: 328321
This extends the use of this attribute on ARM and AArch64 from
SVN r325900 (where it was only checked for fixed stack
allocations on ARM/AArch64, but for all stack allocations on X86).
This also adds a testcase for the existing use of disabling the
fixed stack probe with the attribute on ARM and AArch64.
Differential Revision: https://reviews.llvm.org/D44291
llvm-svn: 327897
Following the ARM-neon backend, define isExtractSubvectorCheap to return true
when extracting low and high part of a neon register.
The patch disables a test in llvm/test/CodeGen/AArch64/arm64-ext.ll This
testcase is fragile in the sense that it requires a BUILD_VECTOR to "survive"
all DAG transforms until ISelLowering. The testcase is supposed to check that
AArch64TargetLowering::ReconstructShuffle() works, and for that we need a
BUILD_VECTOR in ISelLowering. As we now transform the BUILD_VECTOR earlier into
an VEXT + vector_shuffle, we don't have the BUILD_VECTOR pattern when we get to
ISelLowering. As there is no way to disable the combiner to only exercise the
code in ISelLowering, the patch disables the testcase.
Differential revision: https://reviews.llvm.org/D43973
llvm-svn: 326811
The error occurs when reading i16 elements (as in the testcase) from a v8i8
with a pattern of <0,2,4,6>. As all the data in the vector is accessed, the
operation is not a VUZP. The patch stops the pattern recognition of VUZP when
EXTRACT_VECTOR_ELT has a different element type than BUILD_VECTOR.
llvm-svn: 326722
Use the whole gammut of constant immediates available to set up a vector.
Instead of using, for example, `mov w0, #0xffff; dup v0.4s, w0`, which
transfers between register files, use the more efficient `movi v0.4s, #-1`
instead. Not limited to just a few values, but any immediate value that can
be encoded by all the variants of `FMOV`, `MOVI`, `MVNI`, thus eliminating
the need to there be patterns to optimize special cases.
Differential revision: https://reviews.llvm.org/D42133
llvm-svn: 326718
when a BUILD_VECTOR is created out of a sequence of EXTRACT_VECTOR_ELT with a
specific pattern sequence, either <0, 2, 4, ...> or <1, 3, 5, ...>, replace the
BUILD_VECTOR with either vuzp1 or vuzp2.
With this patch LLVM generates the following code for the first function fun1 in the testcase:
adrp x8, .LCPI0_0
ldr q0, [x8, :lo12:.LCPI0_0]
tbl v0.16b, { v0.16b }, v0.16b
ext v1.16b, v0.16b, v0.16b, #8
uzp1 v0.8b, v0.8b, v1.8b
str d0, [x8]
ret
Without this patch LLVM currently generates this code:
adrp x8, .LCPI0_0
ldr q0, [x8, :lo12:.LCPI0_0]
tbl v0.16b, { v0.16b }, v0.16b
mov v1.16b, v0.16b
mov v1.b[1], v0.b[2]
mov v1.b[2], v0.b[4]
mov v1.b[3], v0.b[6]
mov v1.b[4], v0.b[8]
mov v1.b[5], v0.b[10]
mov v1.b[6], v0.b[12]
mov v1.b[7], v0.b[14]
str d1, [x8]
ret
llvm-svn: 326443
Emulated TLS is enabled by llc flag -emulated-tls,
which is passed by clang driver.
When llc is called explicitly or from other drivers like LTO,
missing -emulated-tls flag would generate wrong TLS code for targets
that supports only this mode.
Now use useEmulatedTLS() instead of Options.EmulatedTLS to decide whether
emulated TLS code should be generated.
Unit tests are modified to run with and without the -emulated-tls flag.
Differential Revision: https://reviews.llvm.org/D42999
llvm-svn: 326341
Get rid of icky goto loops and make the code easier to maintain. Otherwise,
NFC.
Restore r324903 and fix PR36369.
Differentail revision: https://reviews.llvm.org/D43364
llvm-svn: 325621
This makes sure that alloca() function calls properly probe the
stack as needed.
Differential Revision: https://reviews.llvm.org/D42356
llvm-svn: 325433
The data type is assumed to be a vector, but sometimes it is not, leading
to an assertion.
Add simple test-case to verify this.
Differential revision: https://reviews.llvm.org/D42599
llvm-svn: 325378
It caused "Cannot select: t33: f64 = AArch64ISD::FMOV Constant:i32<0>"
in Chromium builds. See PR36369.
> Get rid of icky goto loops and make the code easier to maintain (NFC).
>
> Differential revision: https://reviews.llvm.org/D42723
llvm-svn: 325034
Armv8.1-A added an atomic load-clear instruction (which performs bitwise
and with the complement of it's operand), but not a load-and
instruction. Our current code-generation for atomic load-and always
inserts an MVN instruction to invert its argument, even if it could be
folded into a constant or another instruction.
This adds lowering early in selection DAG to convert a load-and
operation into an xor with -1 and a load-clear, allowing the normal DAG
optimisations to work on it.
To do this, I've had to add a new ISD opcode, ATOMIC_LOAD_CLR. I don't
see any easy way to do this with an AArch64-specific ISD node, because
the code-generation for atomic operations assumes the SDNodes are of
type AtomicSDNode.
I've left the old tablegen patterns in because they are still needed for
global isel.
Differential revision: https://reviews.llvm.org/D42478
llvm-svn: 324908
Armv8.1-A added an atomic load-add instruction, but not a load-subtract
instruction. Our current code-generation for atomic load-subtract always
inserts a NEG instruction to negate it's argument, even if it could be
folded into a constant or another instruction.
This adds lowering early in selection DAG to convert a load-subtract
operation into a subtract and a load-add, allowing the normal DAG
optimisations to work on it.
I've left the old tablegen patterns in because they are still needed for
global isel.
Some of the tests in this patch are copied from D35375 by Chad Rosier (which
was abandoned).
Differential revision: https://reviews.llvm.org/D42477
llvm-svn: 324892
We were generating "fmov h0, wzr" instructions when FullFP16 is not enabled.
I've not added any tests, because the problem was visible in:
test/CodeGen/AArch64/arm64-zero-cycle-zeroing.ll,
which I had to change: I don't think Cyclone has FullFP16 enabled
by default, so it shouldn't be using this v8.2a instruction.
I've also removed these rdar tags, please shout if there are any objections.
Differential Revision: https://reviews.llvm.org/D43020
llvm-svn: 324581
I added this comment with D42323, but as discussed in D42806, the architecture
does the right thing for denorms. We don't even need the select on 0.0 here?
llvm-svn: 323996
As shown in the example in PR34994:
https://bugs.llvm.org/show_bug.cgi?id=34994
...we can return a very wrong answer (inf instead of 0.0) for square root when
using a reciprocal square root estimate instruction.
Here, I've conditionalized the filtering out of denorms based on the function
having "denormal-fp-math"="ieee" in its attributes. The other options for this
attribute are 'preserve-sign' and 'positive-zero'.
So we don't generate this extra code by default with just '-ffast-math' (because
then there's no denormal attribute string at all), but it works if you specify
'-ffast-math -fdenormal-fp-math=ieee' from clang.
As noted in the review, there may be other problems in clang that affect the
results depending on platform (Linux x86 at least), but this should allow
creating the desired codegen.
Differential Revision: https://reviews.llvm.org/D42323
llvm-svn: 323981
This reverts commit r322917 due to multiple performance regressions in spec2006
and spec2017. XFAILed llvm/test/CodeGen/AArch64/big-callframe.ll which initially
motivated this change.
llvm-svn: 323683
The Large System Extension added an atomic compare-and-swap instruction
that operates on a pair of 64-bit registers, which we can use to
implement a 128-bit cmpxchg.
Because i128 is not a legal type for AArch64 we have to do all of the
instruction selection in C++, and the instruction requires even/odd
register pairs, so we have to wrap it in REG_SEQUENCE and EXTRACT_SUBREG
nodes. This is very similar to what we do for 64-bit cmpxchg in the ARM
backend.
Differential revision: https://reviews.llvm.org/D42104
llvm-svn: 323634
This patch enables aggressive FMA by default on T99, and provides a -mllvm
option to enable the same on other AArch64 micro-arch's (-mllvm
-aarch64-enable-aggressive-fma).
Test case demonstrating the effects on T99 is included.
Patch by: steleman (Stefan Teleman)
Differential Revision: https://reviews.llvm.org/D40696
llvm-svn: 323474
Summary:
Loads/stores of some NEON vector types are promoted to other vector
types with different lane sizes but same vector size. This is not a
problem in little-endian but, when in big-endian, it requires
additional byte reversals required to preserve the lane ordering
while keeping the right endianness of the data inside each lane.
For example:
%1 = load <4 x half>, <4 x half>* %p
results in the following assembly:
ld1 { v0.2s }, [x1]
rev32 v0.4h, v0.4h
This patch changes the promotion of these loads/stores so that the
actual vector load/store (LD1/ST1) takes care of the endianness
correctly and there is no need for further byte reversals. The
previous code now results in the following assembly:
ld1 { v0.4h }, [x1]
Reviewers: olista01, SjoerdMeijer, efriedma
Reviewed By: efriedma
Subscribers: aemerson, rengolin, javed.absar, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D42235
llvm-svn: 323325
Improves the code generation for v4f16 FCMP instructions when FullFP16 is not supported.
Generating FCTVL(s) rather than a longer series of FCVTs.
Differential Revision: https://reviews.llvm.org/D41772
llvm-svn: 323118
Re-commit of r322200: The testcase shouldn't hit machineverifiers
anymore with r322917 in place.
Large callframes (calls with several hundreds or thousands or
parameters) could lead to situations in which the emergency spillslot is
out of range to be addressed relative to the stack pointer.
This commit forces the use of a frame pointer in the presence of large
callframes.
This commit does several things:
- Compute max callframe size at the end of instruction selection.
- Add mirFileLoaded target callback. Use it to compute the max callframe size
after loading a .mir file when the size wasn't specified in the file.
- Let TargetFrameLowering::hasFP() return true if there exists a
callframe > 255 bytes.
- Always place the emergency spillslot close to FP if we have a frame
pointer.
- Note that `useFPForScavengingIndex()` would previously return false
when a base pointer was available leading to the emergency spillslot
getting allocated late (that's the whole effect of this callback).
Which made no sense to me so I took this case out: Even though the
emergency spillslot is technically not referenced by FP in this case
we still want it allocated early.
Differential Revision: https://reviews.llvm.org/D40876
llvm-svn: 322919
Do not create CALLSEQ_START/CALLSEQ_END when there is no callframe to
setup and the callframe size is 0.
- Fixes an invalid callframe nesting for byval arguments, which would
look like this before this patch (as in `big-byval.ll`):
...
ADJCALLSTACKDOWN 32768, 0, ... # Setup for extfunc
...
ADJCALLSTACKDOWN 0, 0, ... # setup for memcpy
...
BL &memcpy ...
ADJCALLSTACKUP 0, 0, ... # destroy for memcpy
...
BL &extfunc
ADJCALLSTACKUP 32768, 0, ... # destroy for extfunc
- Saves us two instructions in the common case of zero-sized stackframes.
- Remove an unnecessary scheduling barrier (hence the small unittest
changes).
Differential Revision: https://reviews.llvm.org/D42006
llvm-svn: 322917
Revert for now as the testcase is hitting a pre-existing verifier error
that manifest as a failure when expensive checks are enabled (or
-verify-machineinstrs) is used.
This reverts commit r322200.
llvm-svn: 322231
Large callframes (calls with several hundreds or thousands or
parameters) could lead to situations in which the emergency spillslot is
out of range to be addressed relative to the stack pointer.
This commit forces the use of a frame pointer in the presence of large
callframes.
This commit does several things:
- Compute max callframe size at the end of instruction selection.
- Add mirFileLoaded target callback. Use it to compute the max callframe size
after loading a .mir file when the size wasn't specified in the file.
- Let TargetFrameLowering::hasFP() return true if there exists a
callframe > 255 bytes.
- Always place the emergency spillslot close to FP if we have a frame
pointer.
- Note that `useFPForScavengingIndex()` would previously return false
when a base pointer was available leading to the emergency spillslot
getting allocated late (that's the whole effect of this callback).
Which made no sense to me so I took this case out: Even though the
emergency spillslot is technically not referenced by FP in this case
we still want it allocated early.
Differential Revision: https://reviews.llvm.org/D40876
llvm-svn: 322200
Currently the promotion for these ignores the normal getTypeToPromoteTo and instead just tries to double the element width. This is because the default behavior of getTypeToPromote to just adds 1 to the SimpleVT, which has the affect of increasing the element count while keeping the scalar size the same.
If multiple steps are required to get to a legal operation type, int_to_fp will be promoted multiple times. And fp_to_int will keep trying wider types in a loop until it finds one that works.
getTypeToPromoteTo does have the ability to query a promotion map to get the type and not do the increasing behavior. It seems better to just let the target specify the promotion type in the map explicitly instead of letting the legalizer iterate via widening.
FWIW, it's worth I think for any other vector operations that need to be promoted, we have to specify the type explicitly because the default behavior of getTypeToPromote isn't useful for vectors. The other types of promotion already require either the element count is constant or the total vector width is constant, but neither happens by incrementing the SimpleVT enum.
Differential Revision: https://reviews.llvm.org/D40664
llvm-svn: 321629
Note:
- X86ISelLowering: setLibcallName(SINCOS) was superfluous as
InitLibcalls() already does it.
- ARMISelLowering: Setting libcallnames for sincos/sincosf seemed
superfluous as in the darwin case it wouldn't be used while for all
other cases InitLibcalls already does it.
llvm-svn: 321036
Rather than adding more bits to express every
MMO flag you could want, just directly use the
MMO flags. Also fixes using a bunch of bool arguments to
getMemIntrinsicNode.
On AMDGPU, buffer and image intrinsics should always
have MODereferencable set, but currently there is no
way to do that directly during the initial intrinsic
lowering.
llvm-svn: 320746
As suggested by Eli Friedman, instead of aborting if an overflow check
uses something other than SETEQ or SETNE, simply do not apply the
optimization.
Differential Revision: https://reviews.llvm.org/D39147
llvm-svn: 319837
This matches how it is done on X86.
This allows using emulated tls on windows; in MinGW environments,
native tls isn't supported at the moment.
Set the right Data*bitsDirective for windows to match the existing
tests for other platforms. Make parts of the existing tests a regex,
to allow matching .section .rdata for windows, to avoid having to
duplicate the rest of the tests for windows.
Differential Revision: https://reviews.llvm.org/D40770
llvm-svn: 319644
Summary:
Now that store-merge is only generates type-safe stores, do a second
pass just before instruction selection to allow lowered intrinsics to
be merged as well.
Reviewers: jyknight, hfinkel, RKSimon, efriedma, rnk, jmolloy
Subscribers: javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D33675
llvm-svn: 319036
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
This header includes CodeGen headers, and is not, itself, included by
any Target headers, so move it into CodeGen to match the layering of its
implementation.
llvm-svn: 317647
The number of iterations was incorrectly determined for DP FP vector types
and the tests were insufficient to flag this issue.
Differential revision: https://reviews.llvm.org/D39507
llvm-svn: 317349
Previously, the dllimport attribute did the right thing in terms
of treating it as a pointer to a value, but this makes sure the
names get mangled properly, and calls to such functions load the
function from the __imp_ pointer.
This is based on SVN r212431 and r212430 where the same was
implemented for ARM.
Differential Revision: https://reviews.llvm.org/D38530
llvm-svn: 316555
E.g. if we have a (xor(overflow-bit), 1) where overflow-bit comes from an
intrinsic like llvm.sadd.with.overflow then we can kill the xor and use the
inverted condition code for the CSEL.
rdar://28495949
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D38160
llvm-svn: 315205
Summary:
Avoid using XZR/WZR directly as operands to split stores of zero
vectors. Doing so can lead to the XZR/WZR being used by an instruction
that doesn't allow it (e.g. add).
Fixes bug 34674.
Reviewers: t.p.northover, efriedma, MatzeB
Subscribers: aemerson, rengolin, javed.absar, mcrosier, eraman, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D38146
llvm-svn: 313916