On CPUs with the zero cycle zeroing feature enabled "movi v.2d" should
be used to zero a vector register. This was previously done at
instruction selection time, however the register coalescer sometimes
widened multiple vregs to the Q width because of that leading to extra
spills. This patch leaves the decision on how to zero a register to the
AsmPrinter phase where it doesn't affect register allocation anymore.
This patch also sets isAsCheapAsAMove=1 on FMOVS0, FMOVD0.
This fixes http://llvm.org/PR27454, rdar://25866262
Differential Revision: http://reviews.llvm.org/D21826
llvm-svn: 274686
The way the named arguments for various system instructions are handled at the
moment has a few problems:
- Large-scale duplication between AArch64BaseInfo.h and AArch64BaseInfo.cpp
- That weird Mapping class that I have no idea what I was on when I thought
it was a good idea.
- Searches are performed linearly through the entire list.
- We print absolutely all registers in upper-case, even though some are
canonically mixed case (SPSel for example).
- The ARM ARM specifies sysregs in terms of 5 fields, but those are relegated
to comments in our implementation, with a slightly opaque hex value
indicating the canonical encoding LLVM will use.
This adds a new TableGen backend to produce efficiently searchable tables, and
switches AArch64 over to using that infrastructure.
llvm-svn: 274576
This reverts commit r259387 because it inserts illegal code after legalization
in some backends where i64 OR type is illegal for example.
llvm-svn: 274573
The other use really does only care about the SDNode (it checks the
opcode against a whitelist), but bitFieldPlacement can be misled if
the node produces multiple results.
Patch by Ismail Badawi.
llvm-svn: 274567
In bidirectional scheduling this gives more stable results than just
comparing the "reason" fields of the top/bottom node because the reason
field may be higher depending on what other nodes are in the queue.
Differential Revision: http://reviews.llvm.org/D19401
llvm-svn: 273755
r273271 changed the RUN line of the regression test to use
-march=cyclone instead of -mtriple=aarch64-none-none.
This caused a change in the output syntax for the ext
instruction, causing the test to fail. Change this test
back to using -mtriple=aarch64-none-none.
llvm-svn: 273286
Summary:
We have switched to using features for all heuristics, but
the tests for these are still using -mcpu, which means we
are not directly testing the features.
This converts at least some of the existing regression tests
to use the new features.
This still leaves the following features untested:
merge-narrow-ld
predictable-select-expensive
alternate-sextload-cvt-f32-pattern
disable-latency-sched-heuristic
Reviewers: mcrosier, t.p.northover, rengolin
Subscribers: MatzeB, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D21288
llvm-svn: 273271
The backend has been around for years, it's pretty ridiculous that we can't
even use the preferred form for printing "MOV" aliases. Unfortunately, TableGen
can't handle the complex predicates when printing so it's a bunch of nasty C++.
Oh well.
llvm-svn: 272865
Of course the assembly was right but because the opcode was MOVZWi it was
encoded as "movz w16, #65535, lsl #32" which is an unallocated encoding and
would go horribly wrong on a CPU.
No idea how this bug survived this long. It seems nobody is using that aspect
of patchpoints.
llvm-svn: 272831
This reapplies commit r271930, r271915, r271923. They hit a bug in
Thumb which is fixed in r272258 now.
The original message:
The code layout that TailMerging (inside BranchFolding) works on is not the
final layout optimized based on the branch probability. Generally, after
BlockPlacement, many new merging opportunities emerge.
This patch calls Tail Merging after MBP and calls MBP again if Tail Merging
merges anything.
llvm-svn: 272267
Summary:
Consider the following diamond CFG:
A
/ \
B C
\/
D
Suppose A->B and A->C have probabilities 81% and 19%. In block-placement, A->B is called a hot edge and the final placement should be ABDC. However, the current implementation outputs ABCD. This is because when choosing the next block of B, it checks if Freq(C->D) > Freq(B->D) * 20%, which is true (if Freq(A) = 100, then Freq(B->D) = 81, Freq(C->D) = 19, and 19 > 81*20%=16.2). Actually, we should use 25% instead of 20% as the probability here, so that we have 19 < 81*25%=20.25, and the desired ABDC layout will be generated.
Reviewers: djasper, davidxl
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D20989
llvm-svn: 272203
Teach AArch64RegisterBankInfo that G_OR can be mapped on either GPR or
FPR for 64-bit or 32-bit values.
Add test cases demonstrating how this information is used to coalesce a
computation on a single register bank.
llvm-svn: 272170
The code layout that TailMerging (inside BranchFolding) works on is not the
final layout optimized based on the branch probability. Generally, after
BlockPlacement, many new merging opportunities emerge.
This patch calls Tail Merging after MBP and calls MBP again if Tail Merging
merges anything.
Differential Revision: http://reviews.llvm.org/D20276
llvm-svn: 271925
We were assuming all SBFX-like operations would have the shl/asr form, but often
when the field being extracted is an i8 or i16, we end up with a
SIGN_EXTEND_INREG acting on a shift instead.
This is a port of r213754 from ARM to AArch64.
llvm-svn: 271677
Summary:
If the target requests it, use emptry spaces in the fixed and
callee-save stack area to allocate local stack objects.
AArch64: Change last callee-save reg stack object alignment instead of
size to leave a gap to take advantage of above change.
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: rengolin, mcrosier, llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D20220
llvm-svn: 271527
A constant pool holding the address of a variable in equivalent to
a got entry. It produces exactly the same instruction sequence as a
got use and unlike a got use this is not uniqued by the linker.
llvm-svn: 271311
Canonicalize (srl (bswap i32 x), 16) to (rotr (bswap i32 x), 16), if the high
16-bits of x are zero. Similarly, canonicalize (srl (bswap i64 x), 32) to
(rotr (bswap i64 x), 32), if the high 32-bits of x are zero.
test_rev_w_srl16: test_rev_w_srl16:
and w8, w0, #0xffff and w8, w0, #0xffff
rev w8, w8 ---> rev16 w0, w8
lsr w0, w8, #16
test_rev_x_srl32: test_rev_x_srl32:
rev x8, x8 ---> rev32 x0, x8
lsr x0, x8, #32
llvm-svn: 270896
If and only if the value being inserted sets only known zero bits.
This combine transforms things like
and w8, w0, #0xfffffff0
movz w9, #5
orr w0, w8, w9
into
movz w8, #5
bfxil w0, w8, #0, #4
The combine is tuned to make sure we always reduce the number of instructions.
We avoid churning code for what is expected to be performance neutral changes
(e.g., converted AND+OR to OR+BFI).
Differential Revision: http://reviews.llvm.org/D20387
llvm-svn: 270846
Summary:
As this optimization converts two loads into one load with two shift instructions,
it could potentially hurt performance if a loop is arithmetic operation intensive.
Reviewers: t.p.northover, mcrosier, jmolloy
Subscribers: evandro, jmolloy, aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D20172
llvm-svn: 270251
When matching an interleaved load to an ldN pattern, the interleaved access
pass checks that all users of the load are shuffles. If the load is used by an
instruction other than a shuffle, the pass gives up and an ldN is not
generated. This patch considers users of the load that are extractelement
instructions. It attempts to modify the extracts to use one of the available
shuffles rather than the load. After the transformation, the load is only used
by shuffles and will then be matched with an ldN pattern.
Differential Revision: http://reviews.llvm.org/D20250
llvm-svn: 270142
Mask0Imm and ~Mask1Imm must be equivalent and one of the MaskImms is a shifted
mask (e.g., 0x000ffff0). Both 'and's must have a single use.
This changes code like:
and w8, w0, #0xffff000f
and w9, w1, #0x0000fff0
orr w0, w9, w8
into
lsr w8, w1, #4
bfi w0, w8, #4, #12
llvm-svn: 270063
When processing inline asm that contains errors, make sure we can recover
gracefully by creating an UNDEF SDValue for the inline asm statement before
returning from SelectionDAGBuilder::visitInlineAsm. This is necessary for
consumers that don't exit on the first error that is emitted (e.g. clang)
and that would assert later on.
Fixes PR24071.
Patch by Diana Picus.
llvm-svn: 269811
Without a diagnostic handler installed, llc's behaviour is to exit on the first
error that it encounters. This is very different from the behaviour of clang
and other front ends, which try to gather as many errors as possible before
exiting.
This commit adds a diagnostic handler to llc, allowing it to find and report
more than one error. The old behaviour is preserved under a flag (-exit-on-error).
Some of the tests fail with the new diagnostic handler, so they have to use the
new flag in order to run under the previous behaviour. Some of these are known
bugs, others need further investigation. Ideally, we should fix the tests and
remove the flag at some point in the future.
Reapplied after fixing the LLDB build that was broken due to the new
DiagnosticSeverity in LLVMContext.h, and fixed an UB in the new change.
Patch by Diana Picus.
llvm-svn: 269655
Without a diagnostic handler installed, llc's behaviour is to exit on the first
error that it encounters. This is very different from the behaviour of clang
and other front ends, which try to gather as many errors as possible before
exiting.
This commit adds a diagnostic handler to llc, allowing it to find and report
more than one error. The old behaviour is preserved under a flag (-exit-on-error).
Some of the tests fail with the new diagnostic handler, so they have to use the
new flag in order to run under the previous behaviour. Some of these are known
bugs, others need further investigation. Ideally, we should fix the tests and
remove the flag at some point in the future.
Reapplied after fixing the LLDB build that was broken due to the new
DiagnosticSeverity in LLVMContext.h.
Patch by Diana Picus.
llvm-svn: 269563
Most immediates are printed in Aarch64InstPrinter using 'formatImm' macro,
but not all of them.
Implementation contains following rules:
- floating point immediates are always printed as decimal
- signed integer immediates are printed depends on flag settings
(for negative values 'formatImm' macro prints the value as i.e -0x01
which may be convenient when imm is an address or offset)
- logical immediates are always printed as hex
- the 64-bit immediate for advSIMD, encoded in "a🅱️c:d:e:f:g:h" is always printed as hex
- the 64-bit immedaite in exception generation instructions like:
brk, dcps1, dcps2, dcps3, hlt, hvc, smc, svc is always printed as hex
- the rest of immediates is printed depends on availability
of -print-imm-hex
Signed-off-by: Maciej Gabka <maciej.gabka@arm.com>
Signed-off-by: Paul Osmialowski <pawel.osmialowski@arm.com>
Differential Revision: http://reviews.llvm.org/D16929
llvm-svn: 269446
This reverts commit r269428, as it breaks the LLDB build. We need to
understand how to change LLDB in the same way as LLC before landing this
again.
llvm-svn: 269432
Without a diagnostic handler installed, llc's behaviour is to exit on the first
error that it encounters. This is very different from the behaviour of clang
and other front ends, which try to gather as many errors as possible before
exiting.
This commit adds a diagnostic handler to llc, allowing it to find and report
more than one error. The old behaviour is preserved under a flag (-exit-on-error).
Some of the tests fail with the new diagnostic handler, so they have to use the
new flag in order to run under the previous behaviour. Some of these are known
bugs, others need further investigation. Ideally, we should fix the tests and
remove the flag at some point in the future.
Patch by Diana Picus.
llvm-svn: 269428
For BITREVERSE, bit shifting/masking every bit in a vector element is a very lengthy procedure.
If the input vector type is a whole multiple of bytes wide then we can split this into a BSWAP shuffle stage (to reverse at the byte level) and then a BITREVERSE stage applied to each byte. Most vector capable targets can efficiently BSWAP using shuffles resulting in a considerable reduction in instructions.
With this patch targets would only need to implement a target specific vXi8 BITREVERSE implementation to efficiently reverse most legal vector types.
Differential Revision: http://reviews.llvm.org/D19978
llvm-svn: 269290
For narrow stores (e.g., strb, srth) we know the upper bits of the register are
unused/not useful. In some cases we can use this information to eliminate
unnecessary instructions.
For example, without this patch we generate (from the 2nd test case):
ldr w8, [x0]
and w8, w8, #0xfff0
bfxil w8, w2, #16, #4
strh w8, [x1]
and after the patch the 'and' is removed:
ldr w8, [x0]
bfxil w8, w2, #16, #4
strh w8, [x1]
ret
During the lowering of the bitfield insert instruction the 'and' is eliminated
because we know the upper 16-bits that are masked off are unused and the lower
4-bits that are masked off are overwritten by the insert itself. Therefore, the
'and' is unnecessary.
Differential Revision: http://reviews.llvm.org/D20175
llvm-svn: 269226
Summary: When emitting comparison for fp16, in addition to promote the LHS and RHS to fp32, we need to change the VT as well.
Reviewers: t.p.northover
Subscribers: t.p.northover, aemerson, rengolin, llvm-commits
Differential Revision: http://reviews.llvm.org/D19922
llvm-svn: 269151
Unlike xN/wN, the size of vN is genuinely ambiguous in the assembly, so we
should try to infer what was intended from the type. But only down to 64-bits
(vN can never represent sN, hN or bN).
llvm-svn: 269132
Summary:
This implements the lowering of the X constraint on
AArch64.
The default behaviour of the X constraint lowering is to
restrict it to "f". This is a problem because the "f"
constraint is not implemented on AArch64 and would be too
restrictive anyway. Therefore, the AArch64 hook will
lower this to "w" (if the operand is a floating point or
vector) or "r" otherwise.
The implementation is similar with the one added for
ARM (r267411).
This is the AArch64 side of the fix for http://llvm.org/PR26493
Reviewers: rengolin
Subscribers: aemerson, rengolin, llvm-commits, t.p.northover
Differential Revision: http://reviews.llvm.org/D19967
llvm-svn: 268907
Summary:
If a function needs to allocate both callee-save stack memory and local
stack memory, we currently decrement/increment the SP in two steps:
first for the callee-save area, and then for the local stack area. This
changes the code to allocate them both at once at the very beginning/end
of the function. This has two benefits:
1) there is one fewer sub/add micro-op in the prologue/epilogue
2) the stack adjustment instructions act as a scheduling barrier, so
moving them to the very beginning/end of the function increases post-RA
scheduler's ability to move instructions (that only depend on argument
registers) before any of the callee-save stores
This change can cause an increase in instructions if the original local
stack SP decrement could be folded into the first store to the stack.
This occurs when the first local stack store is to stack offset 0. In
this case we are trading off one more sub instruction for one fewer sub
micro-op (along with benefits (2) and (3) above).
Reviewers: t.p.northover
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18619
llvm-svn: 268746
If we don't, values that aren't precisely representable in f16 could
be used as-is in a promoted f32 operation, which would produce
incorrect results.
AArch64 had the correct behavior; add a focused test.
Fixes http://llvm.org/PR26871
llvm-svn: 268700
This patch adds support for estimating the square root, its reciprocal and
division or reciprocal using the combiner generic reciprocal machinery.
llvm-svn: 268539
transferSuccessors() would LoadCmpBB a successor of DoneBB,
whereas it should be a successor of the original MBB.
Follow-up to r266339.
Unfortunately, it's tricky to catch this in the verifier.
llvm-svn: 267779
log2(Mask) is smaller than 32, we must use the 32-bit variant because the 64-bit
variant cannot encode it. Therefore, set the subreg part accordingly.
[AArch64] Fix optimizeCondBranch logic.
The opcode for the optimized branch does not depend on the size
of the activate bits in the AND masks, but the AND opcode itself.
Indeed, we need to use a X or W variant based on the AND variant
not based on whether the mask fits into the related variant.
Otherwise, we may end up using the W variant of the optimized branch
for 64-bit register inputs!
This fixes the last make check verifier issues for AArch64: PR27479.
llvm-svn: 267465
The original patch caused crashes because it could derefence a null pointer
for SelectionDAGTargetInfo for targets that do not define it.
Evaluates fmul+fadd -> fmadd combines and similar code sequences in the
machine combiner. It adds support for float and double similar to the existing
integer implementation. The key features are:
- DAGCombiner checks whether it should combine greedily or let the machine
combiner do the evaluation. This is only supported on ARM64.
- It gives preference to throughput over latency: the heuristic used is
to combine always in loops. The targets decides whether the machine
combiner should optimize for throughput or latency.
- Supports for fmadd, f(n)msub, fmla, fmls patterns
- On by default at O3 ffast-math
llvm-svn: 267328
The opcode for the optimized branch does not depend on the size
of the activate bits in the AND masks, but the AND opcode itself.
Indeed, we need to use a X or W variant based on the AND variant
not based on whether the mask fits into the related variant.
Otherwise, we may end up using the W variant of the optimized branch
for 64-bit register inputs!
This fixes the last make check verifier issues for AArch64: PR27479.
llvm-svn: 267206
Avoid quadratic complexity in unusually large basic blocks by limiting
the size of the ready lists.
Differential Revision: http://reviews.llvm.org/D19349
llvm-svn: 267189
We used to simply set the kill flags to true when transforming a scalar
instruction to a vector one.
SrcScalar1 = copy SrcVector1
... = opScalar SrcScalar1
=>
SrcScalar1 = copy SrcVector1
... = opVector SrcVector1<kill>
This is obviously wrong. The proper update consists in:
1. Propagate the kill status from the copy to the new opVector
2. Reset the kill status on the copy, since the live-range of
SrcVector1 got extended.
This fixes some of the machine verifier errors for AArch64 with make check.
llvm-svn: 267180
Evaluates fmul+fadd -> fmadd combines and similar code sequences in the
machine combiner. It adds support for float and double similar to the existing
integer implementation. The key features are:
- DAGCombiner checks whether it should combine greedily or let the machine
combiner do the evaluation. This is only supported on ARM64.
- It gives preference to throughput over latency: the heuristic used is
to combine always in loops. The targets decides whether the machine
combiner should optimize for throughput or latency.
- Supports for fmadd, f(n)msub, fmla, fmls patterns
- On by default at O3 ffast-math
llvm-svn: 267098
This test used to write a .s file until r266971 fixed that. But on most bots,
the .s file still exists. Add an rm statement to clean up the bots. In a few
days, this statement can go away again.
llvm-svn: 267095
AArch64InstrInfo::optimizeCompareInstr has bug PR27158 which causes generation of incorrect code.
A compare instruction is substituted with another instruction which does not
produce the same flags as the original compare instruction.
This patch contains:
1. Fix of the bug.
2. A regression test in MIR.
3. A new test to check that SUBS is replaced by SUB.
Differential Revision: http://reviews.llvm.org/D18838
llvm-svn: 266969
Both AArch64 and ARM support llvm.<arch>.thread.pointer intrinsics that
just return the thread pointer. I have a pending patch that does the same
for SystemZ (D19054), and there are many more targets that could benefit
from one.
This patch merges the ARM and AArch64 intrinsics into a single target
independent one that will also be used by subsequent targets.
Differential Revision: http://reviews.llvm.org/D19098
llvm-svn: 266818
With this change, ideally IR pass can always generate llvm.stackguard
call to get the stack guard; but for now there are still IR form stack
guard customizations around (see getIRStackGuard()). Future SSP
customization should go through LOAD_STACK_GUARD.
There is a behavior change: stack guard values are not CSEed anymore,
since we should never reuse the value in case that it has been spilled (and
corrupted). See ssp-guard-spill.ll. This also cause the change of stack
size and codegen in X86 and AArch64 test cases.
Ideally we'd like to know if the guard created in llvm.stackprotector() gets
spilled or not. If the value is spilled, discard the value and reload
stack guard; otherwise reuse the value. This can be done by teaching
register allocator to know how to rematerialize LOAD_STACK_GUARD and
force a rematerialization (which seems hard), or check for spilling in
expandPostRAPseudo. It only makes sense when the stack guard is a global
variable, which requires more instructions to load. Anyway, this seems to go out
of the scope of the current patch.
llvm-svn: 266806
This improves AA in the MI schduler when reason about paired instructions.
Phabricator Revision: http://reviews.llvm.org/D17098
PR26358
llvm-svn: 266462
Currently each Function points to a DISubprogram and DISubprogram has a
scope field. For member functions the scope is a DICompositeType. DIScopes
point to the DICompileUnit to facilitate type uniquing.
Distinct DISubprograms (with isDefinition: true) are not part of the type
hierarchy and cannot be uniqued. This change removes the subprograms
list from DICompileUnit and instead adds a pointer to the owning compile
unit to distinct DISubprograms. This would make it easy for ThinLTO to
strip unneeded DISubprograms and their transitively referenced debug info.
Motivation
----------
Materializing DISubprograms is currently the most expensive operation when
doing a ThinLTO build of clang.
We want the DISubprogram to be stored in a separate Bitcode block (or the
same block as the function body) so we can avoid having to expensively
deserialize all DISubprograms together with the global metadata. If a
function has been inlined into another subprogram we need to store a
reference the block containing the inlined subprogram.
Attached to https://llvm.org/bugs/show_bug.cgi?id=27284 is a python script
that updates LLVM IR testcases to the new format.
http://reviews.llvm.org/D19034
<rdar://problem/25256815>
llvm-svn: 266446
Summary:
Without MMOs, the callee-save load/store instructions were treated as
volatile by the MI post-RA scheduler and AArch64LoadStoreOptimizer.
Reviewers: t.p.northover, mcrosier
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D17661
llvm-svn: 266439
Perform store clustering just like load clustering. This change add
StoreClusterMutation in machine-scheduler. To control StoreClusterMutation,
added enableClusterStores() in TargetInstrInfo.h. This is enabled only on
AArch64 for now.
This change also add support for unscaled stores which were not handled in
getMemOpBaseRegImmOfs().
llvm-svn: 266437
FastRegAlloc works only at the basic-block level and spills all live-out
registers. Unfortunately for a stack-based cmpxchg near the spill slots, this
can perpetually clear the exclusive monitor, which means the cmpxchg will never
succeed.
I believe the only way to handle this within LLVM is by expanding the loop
post-regalloc. We don't want this in general because it severely limits the
optimisations that can be done, so we limit this to -O0 compilations.
It's an ugly hack, and about the one good point in the whole mess is that we
can treat all cmpxchg operations in the most naive way possible (seq_cst, no
clrex faff) without affecting correctness.
Should fix PR25526.
llvm-svn: 266339
It is very likely that the swiftself parameter is alive throughout most
functions function so putting it into a callee save register should
avoid spills for the callers with only a minimum amount of extra spills
in the callees.
Currently the generated code is correct but unnecessarily spills and
reloads arguments passed in callee save registers, I will address this
in upcoming patches.
This also adds a missing check that for tail calls the preserved value
of the caller must be the same as the callees parameter.
Differential Revision: http://reviews.llvm.org/D19007
llvm-svn: 266251
Disable LDP/STP for quads on Exynos M1 as they are not as efficient as pairs
of regular LDR/STR.
Patch by Abderrazek Zaafrani <a.zaafrani@samsung.com>.
llvm-svn: 266223
This patch fixes a bug (PR26827) when using anti-aliasing in store
merging. This sets the chain users of the component stores to point to
the new store instead of the component stores chain parent.
Reviewers: jyknight
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18909
llvm-svn: 266217
two fixes with one about error verify-regalloc reported, and
another about live range update of phi after rematerialization.
r265547:
Replace analyzeSiblingValues with new algorithm to fix its compile
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Patches on top of r265547:
r265610 "Fix the compare-clang diff error introduced by r265547."
r265639 "Fix the sanitizer bootstrap error in r265547."
r265657 "InlineSpiller.cpp: Escap \@ in r265547. [-Wdocumentation]"
Differential Revision: http://reviews.llvm.org/D15302
Differential Revision: http://reviews.llvm.org/D18934
Differential Revision: http://reviews.llvm.org/D18935
Differential Revision: http://reviews.llvm.org/D18936
llvm-svn: 266162
Summary:
In getUnderlyingObjectsForInstr(): Don't give up on instructions with
multiple MMOs, instead look through all the MMOs and if they all meet
the conservative criteria previously used for single MMO instructions,
then return all of the underlying objects derived from the MMOs.
The change to ScheduleDAGInstrs::buildSchedGraph() is needed to avoid
the case where multiple underlying objects are present and are related
in such a way that successive iterations of the loop end up adding a
dependency from an instruction to itself.
Reviewers: atrick, hfinkel
Subscribers: MatzeB, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18093
llvm-svn: 266084
It caused PR27275: "ARM: Bad machine code: Using an undefined physical register"
Also reverting the following commits that were landed on top:
r265610 "Fix the compare-clang diff error introduced by r265547."
r265639 "Fix the sanitizer bootstrap error in r265547."
r265657 "InlineSpiller.cpp: Escap \@ in r265547. [-Wdocumentation]"
llvm-svn: 265790
when DenseMap growed and moved memory. I verified it fixed the bootstrap
problem on x86_64-linux-gnu but I cannot verify whether it fixes
the bootstrap error on clang-ppc64be-linux. I will watch the build-bot
result closely.
Replace analyzeSiblingValues with new algorithm to fix its compile
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265547
Bionic has a defined thread-local location for the stack protector
cookie. Emit a direct load instead of going through __stack_chk_guard.
llvm-svn: 265481
Presently, CodeGenPrepare deletes all nearly empty (only phi and branch)
basic blocks. This pass can delete loop preheaders which frequently creates
critical edges. A preheader can be a convenient place to spill registers to
the stack. If the entrance to a loop body is a critical edge, then spills
may occur in the loop body rather than immediately before it. This patch
protects loop preheaders from deletion in CodeGenPrepare even if they are
nearly empty.
Since the patch alters the CFG, it affects a large number of test cases.
In most cases, the changes are merely cosmetic (basic blocks have different
names or instruction orders change slightly). I am somewhat concerned about
the test/CodeGen/Mips/brdelayslot.ll test case. If the loop preheader is not
deleted, then the MIPS backend does not take advantage of a branch delay
slot. Consequently, I would like some close review by a MIPS expert.
The patch also partially subsumes D16893 from George Burgess IV. George
correctly notes that CodeGenPrepare does not actually preserve the dominator
tree. I think the dominator tree was usually not valid when CodeGenPrepare
ran, but I am using LoopInfo to mark preheaders, so the dominator tree is
now always valid before CodeGenPrepare.
Author: Tom Jablin (tjablin)
Reviewers: hfinkel george.burgess.iv vkalintiris dsanders kbarton cycheng
http://reviews.llvm.org/D16984
llvm-svn: 265397
We can only perform a tail call to a callee that preserves all the
registers that the caller needs to preserve.
This situation happens with calling conventions like preserver_mostcc or
cxx_fast_tls. It was explicitely handled for fast_tls and failing for
preserve_most. This patch generalizes the check to any calling
convention.
Related to rdar://24207743
Differential Revision: http://reviews.llvm.org/D18680
llvm-svn: 265329
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265309
This mostly cosmetic patch moves the DebugEmissionKind enum from DIBuilder
into DICompileUnit. DIBuilder is not the right place for this enum to live
in — a metadata consumer should not have to include DIBuilder.h.
I also added a Verifier check that checks that the emission kind of a
DICompileUnit is actually legal.
http://reviews.llvm.org/D18612
<rdar://problem/25427165>
llvm-svn: 265077
Summary:
This change will allow loads with imp-def to be clustered in machine-scheduler pass.
areMemAccessesTriviallyDisjoint() can also handle loads with imp-def.
Reviewers: mcrosier, jmolloy, t.p.northover
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18665
llvm-svn: 265051
Summary:
This change will handle missing store pair opportunity where the first store
instruction stores zero followed by the non-zero store. For example, this change
will convert :
str wzr, [x8]
str w1, [x8, #4]
into:
stp wzr, w1, [x8]
Reviewers: jmolloy, t.p.northover, mcrosier
Subscribers: flyingforyou, aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18570
llvm-svn: 265021
When merging stores in DAGCombiner, add check to ensure that no
dependenices exist that would cause the construction of a cycle in our
DAG. This may happen if one store has a data dependence on another
instruction (e.g. a load) which itself has a (chain) dependence on
another store being merged. These stores cannot be merged safely and
doing so results in a cycle that is discovered in LegalizeDAG.
This test is only done in cases where Antialias analysis is used (UseAA)
as non-AA store merge candidates will be merged logically after all
loads which have been checked to not alias.
Reviewers: ahatanak, spatel, niravd, arsenm, hfinkel, tstellarAMD, jyknight
Subscribers: llvm-commits, tberghammer, danalbert, srhines
Differential Revision: http://reviews.llvm.org/D18336
llvm-svn: 264461
This patch adds unscaled loads and sign-extend loads to the TII
getMemOpBaseRegImmOfs API, which is used to control clustering in the MI
scheduler. This is done to create more opportunities for load pairing. I've
also added the scaled LDRSWui instruction, which was missing from the scaled
instructions. Finally, I've added support in shouldClusterLoads for clustering
adjacent sext and zext loads that too can be paired by the load/store optimizer.
Differential Revision: http://reviews.llvm.org/D18048
llvm-svn: 263819
When the SP in not changed because of realignment/VLAs etc., we restore the SP
by using the previous value of SP and not the FP. Breaking the dependency will
help in cases when the epilog of a callee is close to the epilog of the caller;
for then "sub sp, fp, #" depends on the load restoring the FP in the epilog of
the callee.
http://reviews.llvm.org/D18060
Patch by Aditya Kumar and Evandro Menezes.
llvm-svn: 263458
Summary:
Peephole optimization that generates a single TBZ/TBNZ instruction
for test and branch sequences like in the example below. This handles
the cases that miss folding of AND into TBZ/TBNZ during ISelLowering of BR_CC
Examples:
and w8, w8, #0x400
cbnz w8, L1
to
tbnz w8, #10, L1
Reviewers: MatzeB, jmolloy, mcrosier, t.p.northover
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D17942
llvm-svn: 263136
This change adds a support for a preserve_most calling convention to the AArch64 backend, similar to how it was done for X86-64.
There is also a subsequent patch on top of this one to add a tail-calls support for this calling convention.
Differential Revision: http://reviews.llvm.org/D18016
llvm-svn: 263092
This test failed in some ARM bots after a divmod change because it was
running on a native llc, instead of targeted one. This makes sure the test
is target-specific (as intended), and also copies to ARM and AArch64
directories. If it is also supposed to work on other architectures, I'll
leave as an exercise to the respective maintainers.
llvm-svn: 262620
Summary:
This change enables frame pointer elimination in non-leaf functions.
The -fomit-frame-pointer option still needs to be used when compiling
via clang (or an equivalent method of not setting the
'no-frame-pointer-elim*' function attributes if generating llvm IR via
some other method) to take advantage of this optimization.
This change should be NFC when compiling via clang without
-fomit-frame-pointer.
Reviewers: t.p.northover
Subscribers: aemerson, rengolin, tberghammer, qcolombet, llvm-commits, danalbert, mcrosier, srhines
Differential Revision: http://reviews.llvm.org/D17730
llvm-svn: 262495
Revert r262248 in an attempt to fix the clang-native-aarch64-full
bot and to investigate a performance regression in
SingleSource/Benchmarks/CoyoteBench/huffbench
llvm-svn: 262388
Summary:
Both the hardware and LLVM have changed since 2012.
Now, load-based heuristic don't show big differences any more on OoO cores.
There is no notable regressons and improvements on spec2000/2006. (Cortex-A57, Core i5).
Reviewers: spatel, zansari
Differential Revision: http://reviews.llvm.org/D16836
llvm-svn: 261809
Summary:
Fix a bug in epilog generation where the incoming stack arguments were
not being popped for fastcc functions when -tailcallopt was passed.
Reviewers: t.p.northover, mcrosier, jmolloy, rengolin
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D16894
llvm-svn: 261650
Summary:
Without this, this command
$ llvm-run llc -stop-after machine-cp -o - <( echo '' )
outputs an error, because we close stdout twice -- once when closing the
file opened for "-o", and again when closing outs().
Also clarify in the outs() definition that you can't ever call it if you
want to open your own raw_fd_ostream on stdout.
Reviewers: jroelofs, tstellarAMD
Subscribers: jholewinski, qcolombet, dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D17422
llvm-svn: 261286
After r261154, we were only clearing flags if the known-zero register was
originally live-in to the basic block, but we have to do it even if not when
more than one COPY has been eliminated, otherwise the user of the first COPY
may still have <kill> marked.
E.g.
BB#N:
%X0 = COPY %XZR
STRXui %X0<kill>, <fi#0>
%X0 = COPY %XZR
STRXui %X0<kill>, <fi#1>
We can eliminate both copies, X0 is not live-in, but we must clear the kill on
the first store.
Unfortunately, I've been unable to come up with a non-fragile test for this.
I've only seen it in the wild with regalloc-created spills, and attempts to
reproduce that in a reasonable way run afoul of COPY coalescing. Even volatile
asm clobbers were moved around. Should fix the aarch64 bot though.
llvm-svn: 261175
Mostly, this fixes the bug that if the CBZ guaranteed Xn but Wn was used, we
didn't sort out the use-def chain properly.
I've also made it check more than just the last instruction for a compatible
CBZ (so it can cope without fallthroughs). I'd have liked to do that
separately, but it's helps writing the test.
Finally, I removed some custom loops in favour of MachineInstr helpers and
refactored the control flow to flatten it and avoid possibly quadratic
iterations in blocks with many copies. NFC for these, just a general tidy-up.
llvm-svn: 261154
Summary:
This change will add a pass to remove unnecessary zero copies in target blocks
of cbz/cbnz instructions. E.g., the copy instruction in the code below can be
removed because the cbz jumps to BB1 when x0 is zero :
BB0:
cbz x0, .BB1
BB1:
mov x0, xzr
Jun
Reviewers: gberry, jmolloy, HaoLiu, MatzeB, mcrosier
Subscribers: mcrosier, mssimpso, haicheng, bmakam, llvm-commits, aemerson, rengolin
Differential Revision: http://reviews.llvm.org/D16203
llvm-svn: 261004
Summary:
Before this change, callee-save registers would be rounded up to even
pairs of GPRs and FPRs. This change eliminates these extra padding
load/stores, though it does keep the stack allocation the same size
unless both the GPR and FPR sets have an odd size, in which case one
full pair stack slot (16 bytes) is saved.
This optimization cannot currently be done for MachO targets since they
rely on a fast-path .debug_frame equivalent that can only encode
callee-save registers as pairs.
Reviewers: t.p.northover, rengolin, mcrosier, jmolloy
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D17000
llvm-svn: 260689
This patch allows the mixing of scaled and unscaled load/stores to form
load/store pairs.
This is a reapplication of r259812, which had an incorrect assert. The
test_stur_str_no_assert() test is a reduced version of the issue hit in
the AArch64 self-host.
PR24465
llvm-svn: 260523
Summary:
Fix case where a pre-inc/dec load/store would not be formed if the
add/sub that forms the inc/dec part of the operation was the first
instruction in the block being examined.
Reviewers: mcrosier, jmolloy, t.p.northover, junbuml
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D16785
llvm-svn: 260275
If a range has a lower bound of 0, add an AssertZext from the
nearest floor power of two.
This allows operations with some workitem intrinsics with known
maximum ranges to use fast 24-bit multiplies.
llvm-svn: 260109
This patch allows the mixing of scaled and unscaled load/stores to form
load/store pairs.
PR24465
http://reviews.llvm.org/D12116
Many thanks to Ahmed and Michael for fixes and code review.
This is a reapplication of r246769 and r259790. The tramp3d failure was caused
by an incorrect refactoring in the patch. Specifically, we weren't always
properly clearing the SExtIdx flag.
llvm-svn: 259812
During instruction selection, the AArch64 backend can recognise the
following pattern and generate an [U|S]MADDL instruction, i.e. a
multiply of two 32-bit operands with a 64-bit result:
(mul (sext i32), (sext i32))
However, when one of the operands is constant, the sign extension
gets folded into the constant in SelectionDAG::getNode(). This means
that the instruction selection sees this:
(mul (sext i32), i64)
...which doesn't match the pattern. Sign-extension and 64-bit
multiply instructions are generated, which are slower than one 32-bit
multiply.
Add a pattern to match this and generate the correct instruction, for
both signed and unsigned multiplies.
Patch by Chris Diamand!
llvm-svn: 259800
This patch allows the mixing of scaled and unscaled load/stores to form
load/store pairs.
PR24465
http://reviews.llvm.org/D12116
Many thanks to Ahmed and Michael for fixes and code review.
This is a reapplication of r246769, which was reverted in r246782 due to a
test-suite failure. I'm unable to reproduce the issue at this time.
llvm-svn: 259790
Recommited, after some fixing with test cases.
Updated test cases:
test/CodeGen/AArch64/arm64-misched-memdep-bug.ll
test/CodeGen/AArch64/tailcall_misched_graph.ll
Temporarily disabled test cases:
test/CodeGen/AMDGPU/split-vector-memoperand-offsets.ll
test/CodeGen/PowerPC/ppc64-fastcc.ll (partially updated)
test/CodeGen/PowerPC/vsx-fma-m.ll
test/CodeGen/PowerPC/vsx-fma-sp.ll
http://reviews.llvm.org/D8705
Reviewers: Hal Finkel, Andy Trick.
llvm-svn: 259673
Summary:
This is an extension to the existing implementation of r242436 which
restricts to only select inputs. This version fixes missed opportunities
in pr26084 by attempting to lower conditional compare sequences of
and/or trees with setcc leafs. This will additionaly handle the case
when a tree with select input is not a conjunction-disjunction tree
but some of the sub trees are conjunction-disjunction trees.
Reviewers: jmolloy, t.p.northover, mcrosier, MatzeB
Subscribers: mcrosier, llvm-commits, junbuml, haicheng, mssimpso, gberry
Differential Revision: http://reviews.llvm.org/D16291
llvm-svn: 259387
Since we only have pair - not single - nontemporal store instructions,
we have to extract the high part into a separate register to be able
to use them.
When the initial nontemporal codegen support was added, I wrote the
extract using the nonsensical UBFX [0,32[.
Use the correct LSR form instead.
llvm-svn: 259134
Summary:
findBetterNeighborChains does not handle volatile or indexed stores.
However, it did not check when adding stores to ChainedStores.
Reviewers: arsenm
Differential Revision: http://reviews.llvm.org/D16463
llvm-svn: 259024
For historic reasons, the behavior of .align differs between targets.
Fortunately, there are alternatives, .p2align and .balign, which make the
interpretation of the parameter explicit, and which behave consistently across
targets.
This patch teaches MC to use .p2align instead of .align, so that people reading
code for multiple architectures don't have to remember which way each platform
does its .align directive.
Differential Revision: http://reviews.llvm.org/D16549
llvm-svn: 258750
Some of the conditions necessary to produce ccmp sequences were only
checked in recursive calls to emitConjunctionDisjunctionTree() after
some of the earlier expressions were already built. Move all checks over
to isConjunctionDisjunctionTree() so they are all checked before we
start emitting instructions.
Also rename some variable to better reflect their usage.
llvm-svn: 258605
The current behavior is incorrect, as the two CCs returned by
changeFPCCToAArch64CC, intended to be OR'ed, are instead used
in an AND ccmp chain.
Consider:
define i32 @t(float %a, float %b, float %c, float %d, i32 %e, i32 %f) {
%cc1 = fcmp one float %a, %b
%cc2 = fcmp olt float %c, %d
%and = and i1 %cc1, %cc2
%r = select i1 %and, i32 %e, i32 %f
ret i32 %r
}
Assuming (%a < %b) and (%c < %d); we used to do:
fcmp s0, s1 # nzcv <- 1000
orr w8, wzr, #0x1 # w8 <- 1
csel w9, w8, wzr, mi # w9 <- 1
csel w8, w8, w9, gt # w8 <- 1
fcmp s2, s3 # nzcv <- 1000
cset w9, mi # w9 <- 1
tst w8, w9 # (w8 & w9) == 1, so: nzcv <- 0000
csel w0, w0, w1, ne # w0 <- w0
We now do:
fcmp s2, s3 # nzcv <- 1000
fccmp s0, s1, #0, mi # mi, so: nzcv <- 1000
fccmp s0, s1, #8, le # !le, so: nzcv <- 1000
csel w0, w0, w1, pl # !pl, so: w0 <- w1
In other words, we transformed:
(c < d) && ((a < b) || (a > b))
into:
(c < d) && (a u>= b) && (a u<= b)
whereas, per De Morgan's, we wanted:
(c < d) && !((a u>= b) && (a u<= b))
Note that this problem doesn't occur in the test-suite.
changeFPCCToAArch64CC produces disjunct CCs; here, one -> mi/gt.
We can't represent that in the fccmp chain; it can't express
arbitrary OR sequences, as one comment explains:
In general we can create code for arbitrary "... (and (and A B) C)"
sequences. We can also implement some "or" expressions, because
"(or A B)" is equivalent to "not (and (not A) (not B))" and we can
implement some negation operations. [...] However there is no way
to negate the result of a partial sequence.
Instead, introduce changeFPCCToANDAArch64CC, which produces the
conjunct cond codes:
- (a one b)
== ((a olt b) || (a ogt b))
== ((a ord b) && (a une b))
- (a ueq b)
== ((a uno b) || (a oeq b))
== ((a ule b) && (a uge b))
Note that, at first, one might think that, when PushNegate is true,
we should use the disjunct CCs, in effect doing:
(a || b)
= !(!a && !(b))
= !(!a && !(b1 || b2)) <- changeFPCCToAArch64CC(b, b1, b2)
= !(!a && !b1 && !b2)
However, we can take advantage of the fact that the CC is already
negated, which lets us avoid special-casing PushNegate and doing
the simpler to reason about:
(a || b)
= !(!a && (!b))
= !(!a && (b1 && b2)) <- changeFPCCToANDAArch64CC(!b, b1, b2)
= !(!a && b1 && b2)
This makes both emitConditionalCompare cases behave identically,
and produces correct ccmp sequences for the 2-CC fcmps.
llvm-svn: 258533
Summary:
SETCC with f16 vectors has OperationAction set to Expand but still gets
lowered to FCM* intrinsics based on its result type. This patch skips
lowering of VSETCC if the operand is an f16 vector.
v4 and v8 tests included.
Reviewers: ab, jmolloy
Subscribers: srhines, llvm-commits
Differential Revision: http://reviews.llvm.org/D15361
llvm-svn: 258471
When we have a single basic block, the explicit copy-back instructions should
be inserted right before the terminator. Before this fix, they were wrongly
placed at the beginning of the basic block.
I will commit fixes to other platforms as well.
PR26136
llvm-svn: 257929
Summary:
This pass may modify the Cmp operands. However, the flag reg may be used by both the branch and CSEL.
Modifying CMP will have side effect on CSEL.
Reviewers: t.p.northover
Subscribers: llvm-commits, aemerson, rengolin
Differential Revision: http://reviews.llvm.org/D16147
llvm-svn: 257844
Previous implementation in http://reviews.llvm.org/D10522
created external references to __emutls_v.* variables.
Such references are inaccurate and cannot be handled by
all linkers, e.g. Android dynamic and gold linkers for aarch64.
Now a new LowerEmuTLS pass to go through all global variables,
and add emutls_v.* and emutls_t.* variables.
These __emutls* variables have the same linkage and
visibility as the associated user defined TLS variable.
Also removed old code that dump __emutls* variables in AsmPrinter.cpp,
and updated TLS unit tests.
Differential Revision: http://reviews.llvm.org/D15300
llvm-svn: 257718
This is a recommit of r257253 which was reverted in r257270.
Previous testcase can make failure on some targets due to using opt with O3 option.
Original Summary:
Merge MBBICommon and MBBI's MMOs.
Differential Revision: http://reviews.llvm.org/D15990
llvm-svn: 257317
Summary:
In buildSchedGraph(), when adding memory dependencies for loads, move
the call to adjustChainDeps() after the call to
addChainDependency(AliasChain) to handle the case where
addChainDependency(AliasChain) ends up not adding a dependency and
instead putting the SU on the RejectMemNodes list. The call to
adjustChainDeps() must be done after the call to addChainDependency() in
order to process the SU added to the RejectMemNodes list to create
memory dependencies for it.
Reviewers: hfinkel, atrick, jonpa, resistor
Subscribers: mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D15927
llvm-svn: 256950
Summary:
There are a number of files in the tree which have been accidentally checked in with DOS line endings. Convert these to native line endings.
There are also a few files which have DOS line endings on purpose, and I have set the svn:eol-style property to 'CRLF' on those.
Reviewers: joerg, aaron.ballman
Subscribers: aaron.ballman, sanjoy, dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D15848
llvm-svn: 256707
This is a recommit of r256004 which was reverted in r256160. The issue was the
incorrect promotion for half and byte loads transformed into mov instructions.
This fix will replace half and byte type loads only with bit field extracts.
Original commit message:
This change promotes load instructions which directly read from stored by
replacing them with mov instructions. If the store is wider than the load,
the load will be replaced with a bitfield extract.
For example :
STRWui %W1, %X0, 1
%W0 = LDRHHui %X0, 3
becomes
STRWui %W1, %X0, 1
%W0 = UBFMWri %W1, 16, 31
llvm-svn: 256249
This patch adds to the target description two additional patterns for matching
extract-extend operations to SMOV. The patterns catch the v16i8-to-i64 and
v8i16-to-i64 cases. The existing patterns miss these cases because the
extracted elements must first be legalized to i32, resulting in any_extend
nodes.
This was originally implemented as a DAG combine (r255895), but was reverted
due to failing out-of-tree tests.
llvm-svn: 256176
Disable post-ra scheduler for perturbed tests to appease the bots and to
preserve the history of the tests.
http://reviews.llvm.org/D15652
llvm-svn: 256158
This change promotes load instructions which directly read from stores by
replacing them with mov instructions. If the store is wider than the load,
the load will be replaced with a bitfield extract.
For example :
STRWui %W1, %X0, 1
%W0 = LDRHHui %X0, 3
becomes
STRWui %W1, %X0, 1
%W0 = UBFMWri %W1, 16, 31
llvm-svn: 256004
This patch adds a DAG combine for (any_extend (extract_vector_elt v, i)) ->
(extract_vector_elt v, i). The combine enables us to better match some SMOV
patterns.
Differential Revision: http://reviews.llvm.org/D15515
llvm-svn: 255895
The access function has a short entry and a short exit, the initialization
block is only run the first time. To improve the performance, we want to
have a short frame at the entry and exit.
We explicitly handle most of the CSRs via copies. Only the CSRs that are not
handled via copies will be in CSR_SaveList.
Frame lowering and prologue/epilogue insertion will generate a short frame
in the entry and exit according to CSR_SaveList. The majority of the CSRs will
be handled by register allcoator. Register allocator will try to spill and
reload them in the initialization block.
We add CSRsViaCopy, it will be explicitly handled during lowering.
1> we first set FunctionLoweringInfo->SplitCSR if conditions are met (the target
supports it for the given machine function and the function has only return
exits). We also call TLI->initializeSplitCSR to perform initialization.
2> we call TLI->insertCopiesSplitCSR to insert copies from CSRsViaCopy to
virtual registers at beginning of the entry block and copies from virtual
registers to CSRsViaCopy at beginning of the exit blocks.
3> we also need to make sure the explicit copies will not be eliminated.
The target independent portion was committed as r255353.
rdar://problem/23557469
Differential Revision: http://reviews.llvm.org/D15341
llvm-svn: 255821
PR25763 demonstrated an issue with D14683 - vector comparison constant folding only works for i1 results, so we need to split off the sign-extension of the result to the required type. Luckily this can be done with the existing type legalization code.
llvm-svn: 255289
Otherwise, we think that most types that look like they'd fit in a
legal vector type are legal (so, basically, *any* vector type with a
size between 33 and 128 bits, I think, since we use pow2 alignment;
e.g., v2i25, v3f32, ...).
DataLayout::getTypeAllocSize rounds up based on alignment.
When checking for target intrinsic legality, that's not what we want:
if rounding makes a difference, the type isn't legal, and the
target intrinsics shouldn't be used, as they are always assumed legal.
One could make the argument that alloc size is ultimately the most
relevant here, since we're dealing with LD/ST intrinsics. That's only
true if we did legalize them though; that's a problem for another day.
Use DataLayout::getTypeSizeInBits instead of getTypeAllocSizeInBits.
Type::getSizeInBits can't be used because that'd gratuitously break
pointer vector support.
Some of these uses are currently fine, because we only hit them when
the type is already known legal (e.g., r114454). Update them for
consistency. It's faster to avoid the rounding anyway!
llvm-svn: 255089
Summary:
This fixes failure when trying to select
insertelement <4 x half> undef, half %a, i64 0
which gets transformed to a scalar_to_vector node.
The accompanying v4 and v8 tests fail instruction selection without this
patch.
Reviewers: ab, jmolloy
Subscribers: srhines, llvm-commits
Differential Revision: http://reviews.llvm.org/D15322
llvm-svn: 255072
In the case of a conditional branch without a preceding cmp we used to emit
a "and; cmp; b.eq/b.ne" sequence, use tbz/tbnz instead.
Differential Revision: http://reviews.llvm.org/D15122
llvm-svn: 254621
The ARM ARM is clear that 128-bit loads are only guaranteed to have been atomic
if there has been a corresponding successful stxp. It's less clear for AArch32, so
I'm leaving that alone for now.
llvm-svn: 254524
We mustn't introduce a shift of exactly 64-bits for any inputs, since that's an
UNDEF value (and worse, it's not what you want with the natural Arch64
implementation).
The generated code is pretty horrific, but I couldn't come up with an obviously
better alternative (if the amount is constant EXTR could help). Turns out
128-bit shifts are just nasty.
rdar://22491037
llvm-svn: 254475
Summary:
When not useful bits, BitWidth becomes 0 and APInt will not be happy.
See https://llvm.org/bugs/show_bug.cgi?id=25571
We can just mark the operand as IMPLICIT_DEF is none bits of it is used.
Reviewers: t.p.northover, jmolloy
Subscribers: gberry, jmolloy, mgrang, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D14803
llvm-svn: 254440