Summary:
__shared__ variable may now emit undef value as initializer, do not
throw error on that.
Test Plan: test/CodeGen/NVPTX/global-addrspace.ll
Patch by Xuetian Weng
Reviewers: jholewinski, tra, jingyue
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D12242
llvm-svn: 245785
Although the basic s_load_* instructions happen to use the same
opcode, some of the special case SMRD instructions have
different opcodes.
llvm-svn: 245775
We can wait on either VM, EXP or LGKM.
The waits are independent.
Without this patch, a wait inserted because of one of them
would also wait for all the previous others.
This patch makes s_wait only wait for the ones we need for the next
instruction.
Here's an example of subtle perf reduction this patch solves:
This is without the patch:
buffer_load_format_xyzw v[8:11], v0, s[44:47], 0 idxen
buffer_load_format_xyzw v[12:15], v0, s[48:51], 0 idxen
s_load_dwordx4 s[44:47], s[8:9], 0xc
s_waitcnt lgkmcnt(0)
buffer_load_format_xyzw v[16:19], v0, s[52:55], 0 idxen
s_load_dwordx4 s[48:51], s[8:9], 0x10
s_waitcnt vmcnt(1)
buffer_load_format_xyzw v[20:23], v0, s[44:47], 0 idxen
The s_waitcnt vmcnt(1) is useless.
The reason it is added is because the last
buffer_load_format_xyzw needs s[44:47], which was issued
by the first s_load_dwordx4. It waits for all VM
before that call to have finished.
Internally after every instruction, 3 counters (for VM, EXP and LGTM)
are updated after every instruction. For example buffer_load_format_xyzw
will
increase the VM counter, and s_load_dwordx4 the LGKM one.
Without the patch, for every defined register,
the current 3 counters are stored, and are used to know
how long to wait when an instruction needs the register.
Because of that, the s[44:47] counter includes that to use the register
you need to wait for the previous buffer_load_format_xyzw.
Instead this patch stores only the counters that matter for the
register,
and puts zero for the other ones, since we don't need any wait for them.
Patch by: Axel Davy
Differential Revision: http://reviews.llvm.org/D11883
llvm-svn: 245755
When PPCVSXFMAMutate would look at the input addend register, it would get its
input value number. This would fail, however, if the register was undef,
causing a segfault. Don't segfault (just skip such FMA instructions).
Fixes the test case from PR24542 (although that may have been over-reduced).
llvm-svn: 245741
See discussion in D12154 ( http://reviews.llvm.org/D12154 ), AMD Software
Optimization Guides for 10H/12H/15H/16H, and Agner Fog's experimental data.
llvm-svn: 245733
This is a 'no functional change intended' patch. It removes one FIXME, but adds several more.
Motivation: the FeatureFastUAMem attribute may be too general. It is used to determine if any
sized misaligned memory access under 32-bytes is 'fast'. From the added FIXME comments, however,
you can see that we're not consistent about this. Changing the name of the attribute makes it
clearer to see the logic holes.
Changing this to a 'slow' attribute also means we don't have to add an explicit 'fast' attribute
to new chips; fast unaligned accesses have been standard for several generations of CPUs now.
Differential Revision: http://reviews.llvm.org/D12154
llvm-svn: 245729
Note: I do not implement a base pointer, so it's still impossible to
have dynamic realignment AND dynamic alloca in the same function.
This also moves the code for determining the frame index reference
into getFrameIndexReference, where it belongs, instead of inline in
eliminateFrameIndex.
[Begin long-winded screed]
Now, stack realignment for Sparc is actually a silly thing to support,
because the Sparc ABI has no need for it -- unlike the situation on
x86, the stack is ALWAYS aligned to the required alignment for the CPU
instructions: 8 bytes on sparcv8, and 16 bytes on sparcv9.
However, LLVM unfortunately implements user-specified overalignment
using stack realignment support, so for now, I'm going to go along
with that tradition. GCC instead treats objects which have alignment
specification greater than the maximum CPU-required alignment for the
target as a separate block of stack memory, with their own virtual
base pointer (which gets aligned). Doing it that way avoids needing to
implement per-target support for stack realignment, except for the
targets which *actually* have an ABI-specified stack alignment which
is too small for the CPU's requirements.
Further unfortunately in LLVM, the default canRealignStack for all
targets effectively returns true, despite that implementing that is
something a target needs to do specifically. So, the previous behavior
on Sparc was to silently ignore the user's specified stack
alignment. Ugh.
Yet MORE unfortunate, if a target actually does return false from
canRealignStack, that also causes the user-specified alignment to be
*silently ignored*, rather than emitting an error.
(I started looking into fixing that last, but it broke a bunch of
tests, because LLVM actually *depends* on having it silently ignored:
some architectures (e.g. non-linux i386) have smaller stack alignment
than spilled-register alignment. But, the fact that a register needs
spilling is not known until within the register allocator. And by that
point, the decision to not reserve the frame pointer has been frozen
in place. And without a frame pointer, stack realignment is not
possible. So, canRealignStack() returns false, and
needsStackRealignment() then returns false, assuming everyone can just
go on their merry way assuming the alignment requirements were
probably just suggestions after-all. Sigh...)
Differential Revision: http://reviews.llvm.org/D12208
llvm-svn: 245668
When producing conditional compare sequences for or operations we need
to negate the operands and the finally tested flags. The thing is if we negate
the finally tested flags this equals a logical negation of all previously
emitted expressions. There was a case missing where we have to order OR
expressions so they get emitted first.
This fixes http://llvm.org/PR24459
llvm-svn: 245641
Create CMP;CCMP sequences from and/or trees does not gain us anything if
the and/or tree is materialized to a GP register anyway. While most of
the code already checked for hasOneUse() there was one important case
missing.
llvm-svn: 245640
Fixes PR23464: one way to use the broadcast intrinsics is:
_mm256_broadcastw_epi16(_mm_cvtsi32_si128(*(int*)src));
We don't currently fold this, but now that we use native IR for
the intrinsics (r245605), we can look through one bitcast to find
the broadcast scalar.
Differential Revision: http://reviews.llvm.org/D10557
llvm-svn: 245613
Summary:
Add an LSR test that exercises isTruncateFree. Without this change, LSR creates
another indvar representing the truncated value.
Reviewers: jholewinski, eliben
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D12058
llvm-svn: 245611
Since r245605, the clang headers don't use these anymore.
r245165 updated some of the tests already; update the others, add
an autoupgrade, remove the intrinsics, and cleanup the definitions.
Differential Revision: http://reviews.llvm.org/D10555
llvm-svn: 245606
It won't go well. We've already marked 64-bit SETCCs as non-Custom, but it's just possible that a SETCC has a legal result type but an illegal operand type. If this happens, bail out before we create unselectable nodes.
Fixes PR24292. I tried to create a testcase but in 99% of cases we can't trigger this - not surprising that this bug has been latent since 2009.
llvm-svn: 245577
COMISD should receive QWORD because it is defined as
(V)COMISD xmm1, xmm2/m64
COMISS should receive DWORD because it is defined as
(V)COMISS xmm1, xmm2/m32
Differential Revision: http://reviews.llvm.org/D11712
llvm-svn: 245551
XVCMPEQDP is used for VSX v2f64 equality comparisons, but the value type needs
to be v2i64 (as that's the corresponding SETCC type).
Fixes PR24225.
llvm-svn: 245535
This DAGCombine was creating custom SDAG nodes with an illegal ppc_fp128
operand type because it was triggering on f64/f32 int2fp(fp2int(ppc_fp128 x)),
but shouldn't (it should only apply to f32/f64 types). The result was a crash.
llvm-svn: 245530
We are already falling back to SelectionDAG when encountering an shift with UB.
This adds the same checks for shifts with UB that get folded into arithmetic or
logical operations.
This fixes rdar://problem/22345295.
llvm-svn: 245499
We don't do a great job with >= 0 comparisons against zero when the
result is used as an i8.
Given something like:
void f(long long LL, bool *B) {
*B = LL >= 0;
}
We used to generate:
shrq $63, %rdi
xorb $1, %dil
movb %dil, (%rsi)
Now we generate:
testq %rdi, %rdi
setns (%rsi)
Differential Revision: http://reviews.llvm.org/D12136
llvm-svn: 245498
Previously WebAssembly's datalayout string had -v128:8:128. This had been an
attempt to declare a certain level of support for unaligned SIMD accesses.
However, clang makes its own determinations for SIMD alignment that are
independent of the datalayout string, so this wasn't actually meaningful.
llvm-svn: 245494
This revision has introduced an issue that only affects bootstrapped compiler
when it is printing the ASM. I am working on resolving the issue, but in the
meantime, I'm disabling the legalization of scalar_to_vector operation for v2i64
and the associated testing until I can get this fixed.
llvm-svn: 245481
Reintroduce r245442. Remove an overly conservative assertion introduced
in r245442. We could replace the assertion to use `shareSameRegisterFile`
instead, but in that point in `insertPHI` we already lost the original
Def subreg to check against. So drop the assertion completely.
Original commit message:
- Teaches the ValueTracker in the PeepholeOptimizer to look through PHI
instructions.
- Add findNextSourceAndRewritePHI method to lookup into multiple sources
returnted by the ValueTracker and rewrite PHIs with new sources.
With these changes we can find more register sources and rewrite more
copies to allow coaslescing of bitcast instructions. Hence, we eliminate
unnecessary VR64 <-> GR64 copies in x86, but it could be extended to
other archs by marking "isBitcast" on target specific instructions. The
x86 example follows:
A:
psllq %mm1, %mm0
movd %mm0, %r9
jmp C
B:
por %mm1, %mm0
movd %mm0, %r9
jmp C
C:
movd %r9, %mm0
pshufw $238, %mm0, %mm0
Becomes:
A:
psllq %mm1, %mm0
jmp C
B:
por %mm1, %mm0
jmp C
C:
pshufw $238, %mm0, %mm0
Differential Revision: http://reviews.llvm.org/D11197
rdar://problem/20404526
llvm-svn: 245479
Since r244955, we try to use the short-form ErrorInfo when both
tries failed, and the long-form match failed on a suffix operand.
However, this means we sometimes mix ErrorInfo and MatchResult
(one manifestation of this being PR24498). Instead, restore both.
llvm-svn: 245469
This patch updates the X86 lowering so that the Exception Pointer and Selector
are 64-bit wide only if Subtarget.isTarget64BitLP64.
Patch by João Porto
Reviewers: dschuff, rnk
Differential Revision: http://reviews.llvm.org/D12111
llvm-svn: 245454
Reapply r243486.
- Teaches the ValueTracker in the PeepholeOptimizer to look through PHI
instructions.
- Add findNextSourceAndRewritePHI method to lookup into multiple sources
returnted by the ValueTracker and rewrite PHIs with new sources.
With these changes we can find more register sources and rewrite more
copies to allow coaslescing of bitcast instructions. Hence, we eliminate
unnecessary VR64 <-> GR64 copies in x86, but it could be extended to
other archs by marking "isBitcast" on target specific instructions. The
x86 example follows:
A:
psllq %mm1, %mm0
movd %mm0, %r9
jmp C
B:
por %mm1, %mm0
movd %mm0, %r9
jmp C
C:
movd %r9, %mm0
pshufw $238, %mm0, %mm0
Becomes:
A:
psllq %mm1, %mm0
jmp C
B:
por %mm1, %mm0
jmp C
C:
pshufw $238, %mm0, %mm0
Differential Revision: http://reviews.llvm.org/D11197
rdar://problem/20404526
llvm-svn: 245442
Summary:
The mid-end was generating vector smin/smax/umin/umax nodes, but
we were using vbsl to generatate the code. This adds the vmin/vmax
patterns and a test to check that we are now generating vmin/vmax
instructions.
Reviewers: rengolin, jmolloy
Subscribers: aemerson, rengolin, llvm-commits
Differential Revision: http://reviews.llvm.org/D12105
llvm-svn: 245439
There are some cases where the mul sequence is smaller, but for the most part,
using a div is preferable. This does not apply to vectors, since x86 doesn't
have vector idiv, and a vector mul/shifts sequence ought to be smaller than a
scalarized division.
Differential Revision: http://reviews.llvm.org/D12082
llvm-svn: 245431
This removes the isPow2SDivCheap() query, as it is not currently used in
any meaningful way. isIntDivCheap() no longer relies on a state variable
(as all in-tree target set it to false), but the interface allows querying
based on the type optimization level.
NFC.
Differential Revision: http://reviews.llvm.org/D12082
llvm-svn: 245430
This commit adds support for bit mask target flag serialization to the MIR
printer and the MIR parser. It also adds support for the machine operand's
target flag serialization to the AArch64 target.
Reviewers: Duncan P. N. Exon Smith
llvm-svn: 245383
This consolidates use of isUnalignedMem32Slow() in one place.
There is a slight change in logic although I'm not sure that it would ever
come up in the real world: we were assuming that an alignment of the type
size is always fast; now, we actually check the data layout to confirm that.
llvm-svn: 245382
To properly handle this, define the *a instructions as separate
instruction classes by refactoring the LoadA and StoreA multiclasses.
Move the instruction tests into the sparcv9 file to test the difference.
llvm-svn: 245360
State numbers are calculated by performing a walk from the innermost
funclet to the outermost funclet. Rudimentary support for the new EH
constructs has been added to the assembly printer, just enough to test
the new machinery.
Differential Revision: http://reviews.llvm.org/D12098
llvm-svn: 245331
Summary: This is the correct way to handle JAL instructions when PIC is enabled.
Patch by Toma Tabacu
Reviewers: seanbruno, tomatabacu
Subscribers: brooks, seanbruno, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D6231
llvm-svn: 245305
Summary:
This information is needed to decide whether we do the PIC-only JAL expansions or not. It's also needed for an upcoming patch which implements the .cprestore assembler directive (which can only be used effectively in PIC mode).
By making this information available to the MipsAsmParser, we will know when to insert the instructions mandated by the .cprestore assembler directive and we will be able to give some useful warnings when we encounter a potential misuse of this directive.
Patch by Toma Tabacu
Reviewers: dsanders, seanbruno
Subscribers: brooks, seanbruno, rafael, llvm-commits
Differential Revision: http://reviews.llvm.org/D5626
llvm-svn: 245291
Summary:
Increase the estimated costs for insert/extract element operations on
AArch64. This is motivated by results from benchmarking interleaved
accesses.
Add missing costs for zext/sext/trunc instructions and some integer to
floating point conversions. These costs were previously calculated
by scalarizing these operation and were affected by the cost increase of
the insert/extract element operations.
Reviewers: rengolin
Subscribers: mcrosier, aemerson, rengolin, llvm-commits
Differential Revision: http://reviews.llvm.org/D11939
llvm-svn: 245226
Summary:
This change limits the minimum cost of an insert/extract
element operation to 2 in cases where this would result
in mixing of NEON and VFP code.
Reviewers: rengolin
Subscribers: mssimpso, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D12030
llvm-svn: 245225
Summary: It is the same as LA, except that it can also load 64-bit addresses and it only works on 64-bit MIPS architectures.
Reviewers: tomatabacu, seanbruno, vkalintiris
Subscribers: brooks, seanbruno, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D9524
llvm-svn: 245208
This change makes ScalarEvolution a stand-alone object and just produces
one from a pass as needed. Making this work well requires making the
object movable, using references instead of overwritten pointers in
a number of places, and other refactorings.
I've also wired it up to the new pass manager and added a RUN line to
a test to exercise it under the new pass manager. This includes basic
printing support much like with other analyses.
But there is a big and somewhat scary change here. Prior to this patch
ScalarEvolution was never *actually* invalidated!!! Re-running the pass
just re-wired up the various other analyses and didn't remove any of the
existing entries in the SCEV caches or clear out anything at all. This
might seem OK as everything in SCEV that can uses ValueHandles to track
updates to the values that serve as SCEV keys. However, this still means
that as we ran SCEV over each function in the module, we kept
accumulating more and more SCEVs into the cache. At the end, we would
have a SCEV cache with every value that we ever needed a SCEV for in the
entire module!!! Yowzers. The releaseMemory routine would dump all of
this, but that isn't realy called during normal runs of the pipeline as
far as I can see.
To make matters worse, there *is* actually a key that we don't update
with value handles -- there is a map keyed off of Loop*s. Because
LoopInfo *does* release its memory from run to run, it is entirely
possible to run SCEV over one function, then over another function, and
then lookup a Loop* from the second function but find an entry inserted
for the first function! Ouch.
To make matters still worse, there are plenty of updates that *don't*
trip a value handle. It seems incredibly unlikely that today GVN or
another pass that invalidates SCEV can update values in *just* such
a way that a subsequent run of SCEV will incorrectly find lookups in
a cache, but it is theoretically possible and would be a nightmare to
debug.
With this refactoring, I've fixed all this by actually destroying and
recreating the ScalarEvolution object from run to run. Technically, this
could increase the amount of malloc traffic we see, but then again it is
also technically correct. ;] I don't actually think we're suffering from
tons of malloc traffic from SCEV because if we were, the fact that we
never clear the memory would seem more likely to have come up as an
actual problem before now. So, I've made the simple fix here. If in fact
there are serious issues with too much allocation and deallocation,
I can work on a clever fix that preserves the allocations (while
clearing the data) between each run, but I'd prefer to do that kind of
optimization with a test case / benchmark that shows why we need such
cleverness (and that can test that we actually make it faster). It's
possible that this will make some things faster by making the SCEV
caches have higher locality (due to being significantly smaller) so
until there is a clear benchmark, I think the simple change is best.
Differential Revision: http://reviews.llvm.org/D12063
llvm-svn: 245193
We can set additional bits in a mask given that we know the other
operand of an AND already has some bits set to zero. This can be more
efficient if doing so allows us to use an instruction which implicitly
sign extends the immediate.
This fixes PR24085.
Differential Revision: http://reviews.llvm.org/D11289
llvm-svn: 245169
For cases where we TRUNCATE and then ZERO_EXTEND to a larger size (often from vector legalization), see if we can mask the source data and then ZERO_EXTEND (instead of after a ANY_EXTEND). This can help avoid having to generate a larger mask, and possibly applying it to several sub-vectors.
(zext (truncate x)) -> (zext (and(x, m))
Includes a minor patch to SystemZ to better recognise 8/16-bit zero extension patterns from RISBG bit-extraction code.
This is the first of a number of minor patches to help improve the conversion of byte masks to clear mask shuffles.
Differential Revision: http://reviews.llvm.org/D11764
llvm-svn: 245160
When trying to fix SGPR live ranges, skip defs that are
killed in the same block as the def. I don't think
we need to worry about these cases as long as the
live ranges of the SGPRs in dominating blocks are
correct.
This reduces the number of elements the second
loop over the function needs to look at, and makes
it generally easier to understand. The second loop
also only considers if the live range is live
in to a block, which logically means it
must have been live out from another.
llvm-svn: 245150
function.
This was the same as getFrameIndexReference, but without the FrameReg
output.
Differential Revision: http://reviews.llvm.org/D12042
llvm-svn: 245148
This is just an initial checkin of an implementation of the Relooper algorithm, in preparation for WebAssembly codegen to utilize. It doesn't do anything yet by itself.
The Relooper algorithm takes an arbitrary control flow graph and generates structured control flow from that, utilizing a helper variable when necessary to handle irreducibility. The WebAssembly backend will be able to use this in order to generate an AST for its binary format.
Author: azakai
Reviewers: jfb, sunfish
Subscribers: jevinskie, arsenm, jroelofs, llvm-commits
Differential revision: http://reviews.llvm.org/D11691
llvm-svn: 245142
True branch instructions do behave as expected with liveness.
Avoid the phrasing "branch decision is based on a value in an SGPR"
because this could be misleading. A VALU compare instruction's
result is still based on an SGPR, even though that condition
may be divergent.
llvm-svn: 245131
Although targeting CoreCLR is similar to targeting MSVC, there are
certain important differences that the backend must be aware of
(e.g. differences in stack probes, EH, and library calls).
Differential Revision: http://reviews.llvm.org/D11012
llvm-svn: 245115
We canonicalize V64 vectors to V128 through insert_subvector: the other
FMLA/FMLS/FMUL/FMULX patterns match that already, but this one doesn't,
so we'd fail to match fmls and generate fneg+fmla instead.
The vector equivalents are already tested and functional.
llvm-svn: 245107
This patch makes the Darwin ARM backend take advantage of TargetParser. It
also teaches TargetParser about ARMV7K for the first time. This makes target
triple parsing more consistent across llvm.
Differential Revision: http://reviews.llvm.org/D11996
llvm-svn: 245081
This patch fixes the x86 implementation of allowsMisalignedMemoryAccess() to correctly
return the 'Fast' output parameter for 32-byte accesses. To test that, an existing load
merging optimization is changed to use the TLI hook. This exposes a shortcoming in the
current logic and results in the regression test update. Changing other direct users of
the isUnalignedMem32Slow() x86 CPU attribute would be a follow-on patch.
Without the fix in allowsMisalignedMemoryAccesses(), we will infinite loop when targeting
SandyBridge because LowerINSERT_SUBVECTOR() creates 32-byte loads from two 16-byte loads
while PerformLOADCombine() splits them back into 16-byte loads.
Differential Revision: http://reviews.llvm.org/D10662
llvm-svn: 245075
This reverts commit r245047.
It was failing on the darwin bots. The problem was that when running
./bin/llc -march=msp430
llc gets to
if (TheTriple.getTriple().empty())
TheTriple.setTriple(sys::getDefaultTargetTriple());
Which means that we go with an arch of msp430 but a triple of
x86_64-apple-darwin14.4.0 which fails badly.
That code has to be updated to select a triple based on the value of
march, but that is not a trivial fix.
llvm-svn: 245062
Other than some places that were handling unknown as ELF, this should
have no change. The test updates are because we were detecting
arm-coff or x86_64-win64-coff as ELF targets before.
It is not clear if the enum should live on the Triple. At least now it lives
in a single location and should be easier to move somewhere else.
llvm-svn: 245047
Spotted by Ahmed - in r244594 I inadvertently marked f16 min/max as legal.
I've reverted it here, and marked min/max on scalar f16's as promote. I've also added a testcase. The test just checks that the compiler doesn't fall over - it doesn't create fmin nodes for f16 yet.
llvm-svn: 245035
This introduces the basic functionality to support "token types".
The motivation stems from the need to perform operations on a Value
whose provenance cannot be obscured.
There are several applications for such a type but my immediate
motivation stems from WinEH. Our personality routine enforces a
single-entry - single-exit regime for cleanups. After several rounds of
optimizations, we may be left with a terminator whose "cleanup-entry
block" is not entirely clear because control flow has merged two
cleanups together. We have experimented with using labels as operands
inside of instructions which are not terminators to indicate where we
came from but found that LLVM does not expect such exotic uses of
BasicBlocks.
Instead, we can use this new type to clearly associate the "entry point"
and "exit point" of our cleanup. This is done by having the cleanuppad
yield a Token and consuming it at the cleanupret.
The token type makes it impossible to obscure or otherwise hide the
Value, making it trivial to track the relationship between the two
points.
What is the burden to the optimizer? Well, it turns out we have already
paid down this cost by accepting that there are certain calls that we
are not permitted to duplicate, optimizations have to watch out for
such instructions anyway. There are additional places in the optimizer
that we will probably have to update but early examination has given me
the impression that this will not be heroic.
Differential Revision: http://reviews.llvm.org/D11861
llvm-svn: 245029
We used to just say "invalid type suffix for instruction", which is
misleading. This is because we fallback to the long-form matcher if the
short-form matcher failed, losing the error information on the way.
Save it, so that we can provide a little better diagnostics when the
long-form matcher thinks a suffix is the cause of the error.
llvm-svn: 244955
Follow up to D10947 - D9746 added general SMAX/SMIN/UMAX/UMIN pattern matching to SelectionDAGBuilder::visitSelect.
This patch removes the X86 implementation and improves the AVX1/AVX2 support to correctly lower 256-bit integer vectors.
Differential Revision: http://reviews.llvm.org/D12006
llvm-svn: 244949
After r244870 flush() will only compare two null pointers and return,
doing nothing but wasting run time. The call is not required any more
as the stream and its SmallString are always in sync.
Thanks to David Blaikie for reviewing.
llvm-svn: 244928
This patch corresponds to review:
http://reviews.llvm.org/D11471
It improves the code generated for converting a scalar to a vector value. With
direct moves from GPRs to VSRs, we no longer require expensive stack operations
for this. Subsequent patches will handle the reverse case and more general
operations between vectors and their scalar elements.
llvm-svn: 244921
This was my error. We've got f32 marked as legal because they're simulated using a v2f32 instruction, but there's no equivalent for f64.
This will get test coverage imminently when D12015 lands.
llvm-svn: 244916
This overrides the default to more closely resemble the hand-crafted matching logic in ISelLowering. It makes sense, as there is no VFP equivalent of vmin or vmax, to use them when they're available even if in general VFP ops should be preferred.
This should be NFC.
llvm-svn: 244915
Recent mesa/llvmpipe crashes on SystemZ due to a failed assertion when
attempting to compile a routine with a return type of
{ <4 x float>, <4 x float>, <4 x float>, <4 x float> }
on a system without vector instruction support.
This is because after legalizing the vector type, we get a return value
consisting of 16 floats, which cannot all be returned in registers.
Usually, what should happen in this case is that the target's CanLowerReturn
routine rejects the return type, in which case SelectionDAG falls back to
implementing a structure return in memory via implicit reference.
However, the SystemZ target never actually implemented any CanLowerReturn
routine, and thus would accept any struct return type.
This patch fixes the crash by implementing CanLowerReturn. As a side effect,
this also handles fp128 return values, fixing a todo that was noted in
SystemZCallingConv.td.
llvm-svn: 244889
Other than PC-relative loads/store the patterns that match the various
load/store addressing modes have the same complexity, so the order that they
are matched is the order that they appear in the .td file.
Rearrange the instruction definitions in ARMInstrThumb.td, and make use of
AddedComplexity for PC-relative loads, so that the instruction matching order
is the order that results in the simplest selection logic. This also makes
register-offset load/store be selected when it should, as previously it was
only selected for too-large immediate offsets.
Differential Revision: http://reviews.llvm.org/D11800
llvm-svn: 244882
We can lower them using our cool tricks if we fpext/fptrunc the second
input, like we do for f32/f64.
Follow-up to r243924, r243926, and r244858.
llvm-svn: 244860
Summary:
D11924 implemented part of the floating-point comparisons, this patch implements the rest:
* Tell ISelLowering that all booleans are either 0 or 1.
* Expand the eq/ne/lt/le/gt/ge floating-point comparisons to the canonical ones (similar to what Mips32r6InstrInfo.td does).
* Add tests for ord/uno.
* Add tests for ueq/one/ult/ule/ugt/uge.
* Fix existing comparison tests to remove the (res & 1) code, which setBooleanContents stops from generating.
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11970
llvm-svn: 244779
This abstracts away the test for "when can we fold across a MachineInstruction"
into the the MI interface, and changes call-frame optimization use the same test
the peephole optimizer users.
Differential Revision: http://reviews.llvm.org/D11945
llvm-svn: 244729
As discussed in D11886, this patch moves the SSE/AVX vector blend folding to instcombiner from PerformINTRINSIC_WO_CHAINCombine (which allows us to remove this completely).
InstCombiner already had partial support for this, I just had to add support for zero (ConstantAggregateZero) masks and also the case where both selection inputs were the same (allowing us to ignore the mask).
I also moved all the relevant combine tests into InstCombine/blend_x86.ll
Differential Revision: http://reviews.llvm.org/D11934
llvm-svn: 244723
The same value is used multiple times through the function. Hoist the condition
into a variable. This should fix a silly static analysis warning where the
conditions flip around. No functional change intended.
llvm-svn: 244713
This commit transforms the mips-specific 'MipsCallEntry' subclass of the
'PseudoSourceValue' class into two, target-independent subclasses named
'GlobalValuePseudoSourceValue' and 'ExternalSymbolPseudoSourceValue'.
This change makes it easier to serialize the pseudo source values by removing
target-specific pseudo source values.
Reviewers: Akira Hatanaka
llvm-svn: 244698
This commit removes the global manager variable which is responsible for
storing and allocating pseudo source values and instead it introduces a new
manager class named 'PseudoSourceValueManager'. Machine functions now own an
instance of the pseudo source value manager class.
This commit also modifies the 'get...' methods in the 'MachinePointerInfo'
class to construct pseudo source values using the instance of the pseudo
source value manager object from the machine function.
This commit updates calls to the 'get...' methods from the 'MachinePointerInfo'
class in a lot of different files because those calls now need to pass in a
reference to a machine function to those methods.
This change will make it easier to serialize pseudo source values as it will
enable me to transform the mips specific MipsCallEntry PseudoSourceValue
subclass into two target independent subclasses.
Reviewers: Akira Hatanaka
llvm-svn: 244693
This commit introduces a new enumerator named 'PSVKind' in the
'PseudoSourceValue' class. This enumerator is now used to distinguish between
the various kinds of pseudo source values.
This change is done in preparation for the changes to the pseudo source value
object management and to the PseudoSourceValue's class hierarchy - the next two
PseudoSourceValue commits will get rid of the global variable that manages the
pseudo source values and the mips specific MipsCallEntry subclass.
Reviewers: Akira Hatanaka
llvm-svn: 244687
For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32 bits. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.
llvm-svn: 244684
Some of the FP comparisons (ueq, one, ult, ule, ugt, uge) are currently broken, I'll fix them in a follow-up.
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11924
llvm-svn: 244665
Summary: Implementation is the same as in AArch64.
Subscribers: aemerson, jfb, llvm-commits, sunfish
Differential Revision: http://reviews.llvm.org/D11956
llvm-svn: 244655
First step in preventing immediates that occur more than once within a single
basic block from being pulled into their users, in order to prevent unnecessary
large instruction encoding .Currently enabled only when optimizing for size.
Patch by: zia.ansari@intel.com
Differential Revision: http://reviews.llvm.org/D11363
llvm-svn: 244601
Lower Intrinsic::aarch64_neon_fmin/fmax to fminnum/fmannum and match that instead. Minimal functional change:
- Extra tests added because coverage of scalar fminnm/fmaxnm instructions was nonexistant.
- f16 test updated because now we actually generate scalar fminnm/fmaxnm we no longer need to bail out to a libcall!
llvm-svn: 244595
Lower Intrinsic::arm_neon_vmins/vmaxs to fminnan/fmaxnan and match that instead. This is important because SDAG will soon be able to select FMINNAN itself, so we need a unified lowering path for intrinsics and SDAG.
NFCI.
llvm-svn: 244593
Lower the intrinsic to a FMINNUM/FMAXNUM node and select that instead. This is important because soon SDAG will be able to select FMINNUM/FMAXNUM itself, so we need an integrated lowering path between SDAG and intrinsics.
NFCI.
llvm-svn: 244592
REPE, REPZ, REPNZ, REPNE should have mnemonics for Intel syntax as well.
Currently using these instructions causes compilation errors for Intel syntax.
Differential Revision: http://reviews.llvm.org/D11794
llvm-svn: 244584
The "imul reg, imm" alias is not defined for intel syntax.
In intel syntax there is no w/l/q suffix for the imul instruction.
Differential Revision: http://reviews.llvm.org/D11887
llvm-svn: 244582
Summary:
This patch remaps the assembly idiom 'move' to 'or' instead of 'daddu' or
'addu'. The use of addu/daddu instead of or as move was highlighted as a
performance issue during the analysis of a recent 64bit design. Originally
move was encoded as 'or' by binutils but was changed for the r10k cpu family
due to their pipeline which had 2 arithmetic units and a single logical unit,
and so could issue multiple (d)addu based moves at the same time but only 1
logical move.
This patch preserves the disassembly behaviour so that disassembling a old style
(d)addu move still appears as move, but assembling move always gives an or
Patch by Simon Dardis.
Reviewers: vkalintiris
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11796
llvm-svn: 244579
When optimizing for size, replace "addl $4, %esp" and "addl $8, %esp"
following a call by one or two pops, respectively. We don't try to do it in
general, but only when the stack adjustment immediately follows a call - which
is the most common case.
That allows taking a short-cut when trying to find a free register to pop into,
instead of a full-blown liveness check. If the adjustment immediately follows a
call, then every register the call clobbers but doesn't define should be dead at
that point, and can be used.
Differential Revision: http://reviews.llvm.org/D11749
llvm-svn: 244578
Summary: I somehow forgot to add these when I added the basic floating-point opcodes. Also remove ceil/floor/trunc/nearestint for now, and add them only when properly tested.
Subscribers: llvm-commits, sunfish, jfb
Differential Revision: http://reviews.llvm.org/D11927
llvm-svn: 244562
Summary: convertToHexString doesn't represent them correctly at this point in time. This is a follow-up to sunfish's suggestion in D11914.
Subscribers: llvm-commits, sunfish, jfb
Differential Revision: http://reviews.llvm.org/D11925
llvm-svn: 244551
Summary:
For now output using C99's hexadecimal floating-point representation.
This patch also cleans up how machine operands are printed: instead of special-casing per type of machine instruction, the code now handles operands generically.
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11914
llvm-svn: 244520
NaCl's sandbox doesn't allow PUSHF/POPF out of security concerns (priviledged emulators have forgotten to mask system bits in the past, and EFLAGS's DF bit is a constant source of hilarity). Commit r220529 fixed PR20376 by saving cmpxchg's flags result using EFLAGS, this commit now generated LAHF/SAHF instead, for all of x86 (not just NaCl) because it leads to an overall performance gain over PUSHF/POPF.
As with the previous patch this code generation is pretty bad because it occurs very later, after register allocation, and in many cases it rematerializes flags which were already available (e.g. already in a register through SETE). Fortunately it's somewhat rare that this code needs to fire.
I did [[ https://github.com/jfbastien/benchmark-x86-flags | a bit of benchmarking ]], the results on an Intel Haswell E5-2690 CPU at 2.9GHz are:
| Time per call (ms) | Runtime (ms) | Benchmark |
| 0.000012514 | 6257 | sete.i386 |
| 0.000012810 | 6405 | sete.i386-fast |
| 0.000010456 | 5228 | sete.x86-64 |
| 0.000010496 | 5248 | sete.x86-64-fast |
| 0.000012906 | 6453 | lahf-sahf.i386 |
| 0.000013236 | 6618 | lahf-sahf.i386-fast |
| 0.000010580 | 5290 | lahf-sahf.x86-64 |
| 0.000010304 | 5152 | lahf-sahf.x86-64-fast |
| 0.000028056 | 14028 | pushf-popf.i386 |
| 0.000027160 | 13580 | pushf-popf.i386-fast |
| 0.000023810 | 11905 | pushf-popf.x86-64 |
| 0.000026468 | 13234 | pushf-popf.x86-64-fast |
Clearly `PUSHF`/`POPF` are suboptimal. It doesn't really seems to be worth teaching LLVM about individual flags, at least not for this purpose.
Reviewers: rnk, jvoung, t.p.northover
Subscribers: llvm-commits
Differential revision: http://reviews.llvm.org/D6629
llvm-svn: 244503
As discussed in D11760, this patch moves the (V)PSRA(WD) arithmetic shift-by-constant folding to InstCombine to match the logical shift implementations.
Differential Revision: http://reviews.llvm.org/D11886
llvm-svn: 244495
The LDD/STD instructions can load/store a 64bit quantity from/to
memory to/from a consecutive even/odd pair of (32-bit) registers. They
are part of SparcV8, and also present in SparcV9. (Although deprecated
there, as you can store 64bits in one register).
As recommended on llvmdev in the thread "How to enable use of 64bit
load/store for 32bit architecture" from Apr 2015, I've modeled the
64-bit load/store operations as working on a v2i32 type, rather than
making i64 a legal type, but with few legal operations. The latter
does not (currently) work, as there is much code in llvm which assumes
that if i64 is legal, operations like "add" will actually work on it.
The same assumption does not hold for v2i32 -- for vector types, it is
workable to support only load/store, and expand everything else.
This patch:
- Adds a new register class, IntPair, for even/odd pairs of registers.
- Modifies the list of reserved registers, the stack spilling code,
and register copying code to support the IntPair register class.
- Adds support in AsmParser. (note that in asm text, you write the
name of the first register of the pair only. So the parser has to
morph the single register into the equivalent paired register).
- Adds the new instructions themselves (LDD/STD/LDDA/STDA).
- Hooks up the instructions and registers as a vector type v2i32. Adds
custom legalizer to transform i64 load/stores into v2i32 load/stores
and bitcasts, so that the new instructions can actually be
generated, and marks all operations other than load/store on v2i32
as needing to be expanded.
- Copies the unfortunate SelectInlineAsm hack from ARMISelDAGToDAG.
This hack undoes the transformation of i64 operands into two
arbitrarily-allocated separate i32 registers in
SelectionDAGBuilder. and instead passes them in a single
IntPair. (Arbitrarily allocated registers are not useful, asm code
expects to be receiving a pair, which can be passed to ldd/std.)
Also adds a bunch of test cases covering all the bugs I've added along
the way.
Differential Revision: http://reviews.llvm.org/D8713
llvm-svn: 244484
The SP was always unconditionally assigned to later, but initialised early.
This delays the initialisation, and avoids the dead store. Identified by
clang static analysis. No functional change intended.
llvm-svn: 244423
The pass adds new kernel arguments for image attributes, and
resolves calls to dummy attribute and resource id getter functions.
Patch by: Zoltan Gilian
llvm-svn: 244372
At this point the given Opc must be valid, otherwise we should
not look for a matching pair to form paired load or store.
Thanks to Chad to point out this piece of code!
llvm-svn: 244366
Summary:
With InstAlias, we don't need to print the _e32 portion of the mnemonic
when we print the $dst operand. This change makes it possible to
include vcc in the asm string when we switch VOPC over to having
implicit vcc defs.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11813
llvm-svn: 244362
Summary: We were using the SI encoding for VI.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11812
llvm-svn: 244332
Summary:
Port the ReconstructShuffle function from AArch64 to ARM
to handle mismatched incoming types in the BUILD_VECTOR
node.
This fixes an outstanding FIXME in the ReconstructShuffle
code.
Reviewers: t.p.northover, rengolin
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D11720
llvm-svn: 244314
Summary: WebAssembly's tablegen instructions have the names WebAssembly expects, but by LLVM convention they're uppercase and suffixed with their type after an underscore. Leave the C++ code that way, but print outt he names WebAssembly expects (lowercase, no type). We could teach tablegen to do this later, maybe by using `!cast<string>(node)` in the .td files.
Reviewers: sunfish
Subscribers: jfb, llvm-commits
Differential Revision: http://reviews.llvm.org/D11776
llvm-svn: 244305
When we are not emitting the condition for the branch, because the condition is
in another BB or SDAG did the selection for us, then we have to mask the flag in
the register with AND.
This is required when the condition comes from a truncate, because SDAG only
truncates down to a legal size of i32.
This fixes rdar://problem/22161062.
llvm-svn: 244291
This reverts commit r243198 and 243304.
Turns out this wasn't the correct fix for this problem. It works only within
FastISel, but fails when the truncate is selected by SDAG.
llvm-svn: 244287
After r244074, we now have a successors() method to iterate over
all the successors of a TerminatorInst. This commit changes a bunch
of eligible loops to use it.
llvm-svn: 244260
Summary: This allows us to consolidate several of the TableGen patterns.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11602
llvm-svn: 244253
This change improves EmitLoweredSelect() so that multiple contiguous CMOV pseudo
instructions with the same (or exactly opposite) conditions get lowered using a single
new basic-block. This eliminates unnecessary extra basic-blocks (and CFG merge points)
when contiguous CMOVs are being lowered.
Patch by: kevin.b.smith@intel.com
Differential Revision: http://reviews.llvm.org/D11428
llvm-svn: 244202
This commit implements the initial serialization of the machine operand target
flags. It extends the 'TargetInstrInfo' class to add two new methods that help
to provide text based serialization for the target flags.
This commit can serialize only the X86 target flags, and the target flags for
the other targets will be serialized in the follow-up commits.
Reviewers: Duncan P. N. Exon Smith
llvm-svn: 244185
Summary: The casts from String to PatFrag weren't needed if we instead provided an SDNode. This fix was suggested by @pete in D11382.
Subscribers: pete, llvm-commits
Differential Revision: http://reviews.llvm.org/D11788
llvm-svn: 244167
More specifically, make NVPTXISelDAGToDAG able to emit cached loads (LDG) for pointer induction variables.
Also fix latent bug where LDG was not restricted to kernel functions. I believe that this could not be triggered so far since we do not currently infer that a pointer is global outside a kernel function, and only loads of global pointers are considered for cached loads.
llvm-svn: 244166
Summary: PR24191 finds that the expected memory-register operations aren't generated when relaxed { load ; modify ; store } is used. This is similar to PR17281 which was addressed in D4796, but only for memory-immediate operations (and for memory orderings up to acquire and release). This patch also handles some floating-point operations.
Reviewers: reames, kcc, dvyukov, nadav, morisset, chandlerc, t.p.northover, pete
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11382
llvm-svn: 244128
rather than 'unsigned' for their costs.
For something like costs in particular there is a natural "negative"
value, that of savings or saved cost. As a consequence, there is a lot
of code that subtracts or creates negative values based on cost, all of
which is prone to awkwardness or bugs when dealing with an unsigned
type. Similarly, we *never* want these values to wrap, as that would
cause Very Bad code generation (likely percieved as an infinite loop as
we try to emit over 2^32 instructions or some such insanity).
All around 'int' seems a much better fit for these basic metrics. I've
added asserts to ensure that at least the TTI interface never returns
negative numbers here. If we ever have a use case for negative numbers,
we can remove this, but this way a bug where someone used '-1' to
produce a 'very large' cost will be caught by the assert.
This passes all tests, and is also UBSan clean.
No functional change intended.
Differential Revision: http://reviews.llvm.org/D11741
llvm-svn: 244080
To get the successors of a BB we currently do successors(BB) which
ultimately walks the successors of the BB's terminator.
This moves the iterator to TerminatorInst as thats what we're actually
using to do the iteration, and adds a member function to TerminatorInst
to allow us to iterate directly over successors given an instruction.
For example, we can now do
for (auto *Succ : BI->successors())
instead of
for (unsigned i = 0, e = BI->getNumSuccessors(); i != e; ++i)
Reviewed by Tobias Grosser.
llvm-svn: 244074
Summary: Among other things, this allows -print-after-all/-print-before-all to
dump IR around this pass.
IIRC, this pass is off by default, but it's still helpful when debugging.
llvm-svn: 244056
Summary: Among other things, this allows -print-after-all/-print-before-all to
dump IR around this pass.
This is the AArch64 version of r243052.
llvm-svn: 244041
return StringSwitch<int>(Flags)
.Case("g", 0x1)
.Case("nzcvq", 0x2)
.Case("nzcvqg", 0x3)
.Default(-1);
...
// The _g and _nzcvqg versions are only valid if the DSP extension is
// available.
if (!Subtarget->hasThumb2DSP() && (Mask & 0x2))
return -1;
ARMARM confirms that the comment is right, and the code was wrong.
llvm-svn: 244029
Create wrapper methods in the Function class for the OptimizeForSize and MinSize
attributes. We want to hide the logic of "or'ing" them together when optimizing
just for size (-Os).
Currently, we are not consistent about this and rely on a front-end to always set
OptimizeForSize (-Os) if MinSize (-Oz) is on. Thus, there are 18 FIXME changes here
that should be added as follow-on patches with regression tests.
This patch is NFC-intended: it just replaces existing direct accesses of the attributes
by the equivalent wrapper call.
Differential Revision: http://reviews.llvm.org/D11734
llvm-svn: 243994
In the commentary for D11660, I wasn't sure if it was alright to create new
integer machine instructions without also creating the implicit EFLAGS operand.
From what I can see, the implicit operand is always created by the MachineInstrBuilder
based on the instruction type, so we don't have to do that explicitly. However, in
reviewing the debug output, I noticed that the operand was not marked as 'dead'.
The machine combiner should do that to preserve future optimization opportunities
that may be checking for that dead EFLAGS operand themselves.
Differential Revision: http://reviews.llvm.org/D11696
llvm-svn: 243990
Summary:
Previously, we would check whether the target is supported or not, only in
fastSelectInstruction(). This means that 64-bit targets could use FastISel too.
We fix this by checking every overridden method of the FastISel class and
by falling back to SelectionDAG if the target isn't supported. This change
should have been committed along with r243638, but somehow I missed it.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11755
llvm-svn: 243986
It introduced two regressions on 64-bit big-endian targets running under N32
(MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4, and
MultiSource/Applications/kimwitu++/kc) The issue is that on 64-bit targets
comparisons such as BEQ compare the whole GPR64 but incorrectly tell the
instruction selector that they operate on GPR32's. This leads to the
elimination of i32->i64 extensions that are actually required by
comparisons to work correctly.
There's currently a patch under review that fixes this problem.
llvm-svn: 243984
This adds the software division routines for the Windows RTABI. These are not
expected to be used often though as most modern Windows ARM capable targets
support hardware division. In the case that the target CPU doesnt support
hardware division, this will be the fallback.
llvm-svn: 243952
Some are named "FP", others "SD", others still "FP*SD".
Rename all this to just use "FP", which, except for conversions
(which don't use this format naming scheme), implies "SD" anyway.
llvm-svn: 243936
It's already in SysRegMappings, no need to also have it in MSRMappings:
the latter is only used if we didn't find a match in the former.
llvm-svn: 243933
There's a bunch of code in LowerFCOPYSIGN that does smart lowering, and
is actually already vector-aware; let's use it instead of scalarizing!
The only interesting change is that for v2f32, we previously always used
use v4i32 as the integer vector type.
Use v2i32 instead, and mark FCOPYSIGN as Custom.
llvm-svn: 243926
This is necessary for WatchOS support, where the compact unwind format assumes
this kind of layout. For now we only want this on Swift-like CPUs though, where
it's been the Xcode behaviour for ages. Also, since it can expand the prologue
we don't want it at -Oz.
llvm-svn: 243884
Enabling merging of extern globals appears to be generally either beneficial or
harmless. On some benchmarks suites (on Cortex-M4F, Cortex-A9, and Cortex-A57)
it gives improvements in the 1-5% range, but in the rest the overall effect is
zero.
Differential Revision: http://reviews.llvm.org/D10966
llvm-svn: 243874
In http://reviews.llvm.org/rL215382, IT forming was made more conservative under
the belief that a flag-setting instruction was unpredictable inside an IT block on ARMv6M.
But actually, ARMv6M doesn't even support IT blocks so that's impossible. In the ARMARM for
v7M, v7AR and v8AR it states that the semantics of such an instruction changes inside an
IT block - it doesn't set the flags. So actually it is fine to use one inside an IT block
as long as the flags register is dead afterwards.
This gives significant performance improvements in a variety of MPEG based workloads.
Differential revision: http://reviews.llvm.org/D11680
llvm-svn: 243869
Summary: This currently sets the shift amount RHS to the same type as the LHS, and assumes that the LHS is a simple type. This isn't currently the case e.g. with weird integers sizes, but will eventually be true and will assert if not. That's what you get for having an experimental backend: break it and you get to keep both pieces. Most backends either set the RHS to MVT::i32 or MVT::i64, but WebAssembly is a virtual ISA and tries to have regular-looking binary operations where both operands are the same type (even if a 64-bit RHS shifter is slightly silly, hey it's free!).
Subscribers: llvm-commits, sunfish, jfb
Differential Revision: http://reviews.llvm.org/D11715
llvm-svn: 243860
Remove some unnecessary explicit special members in Hexagon that, once
removed, allow the other implicit special members to be used without
depending on deprecated features.
llvm-svn: 243825
Summary: Also test 64-bit integers, except shifts for now which are broken because isel dislikes the 32-bit truncate that precedes them.
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11699
llvm-svn: 243822
Various targets use std::swap on specific MCAsmOperands (ARM and
possibly Hexagon as well). It might be helpful to mark those subclasses
as final, to ensure that the availability of move/copy operations can't
lead to slicing. (same sort of requirements as the non-vitual dtor -
protected or a final class)
llvm-svn: 243820
This commit fixes a bug in the class 'SIInstrInfo' where the implicit register
machine operands were added to a machine instruction in an incorrect order -
the implicit uses were added before the implicit defs.
I found this bug while working on moving the implicit register operand
verification code from the MIR parser to the machine verifier.
This commit also makes the method 'addImplicitDefUseOperands' in the machine
instruction class public so that it can be reused in the 'SIInstrInfo' class.
Reviewers: Matt Arsenault
Differential Revision: http://reviews.llvm.org/D11689
llvm-svn: 243799
Summary:
For example, in
struct S {
int *x;
int *y;
};
__global__ void foo(S s) {
int *b = s.y;
// use b
}
"b" is guaranteed to point to global. NVPTX should emit ld.global/st.global for
accessing "b".
Reviewers: jholewinski
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11505
llvm-svn: 243790
Summary:
Use -1 as numoperands for the return SDTypeProfile, denoting that return is variadic. Note that the patterns in InstrControl.td still need to match the inputs, so this ins't an "anything goes" variadic on ret!
The next step will be to handle other local types (not just int32).
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11692
llvm-svn: 243783
Summary:
This prints assembly for int32 integer operations defined in WebAssemblyInstrInteger.td only, with major caveats:
- The operation names are currently incorrect.
- Other integer and floating-point types will be added later.
- The printer isn't factored out to handle recursive AST code yet, since it can't even handle control flow anyways.
- The assembly format isn't full s-expressions yet either, this will be added later.
- This currently disables PrologEpilogCodeInserter as well as MachineCopyPropagation becasue they don't like virtual registers, which WebAssembly likes quite a bit. This will be fixed by factoring out NVPTX's change (currently a fork of PrologEpilogCodeInserter).
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11671
llvm-svn: 243763
Add i16, i32, i64 imul machine instructions to the list of reassociation
candidates.
A new bit of logic is needed to handle integer instructions: they have an
implicit EFLAGS operand, so we have to make sure it's dead in order to do
any reassociation with integer ops.
Differential Revision: http://reviews.llvm.org/D11660
llvm-svn: 243756
Summary:
Favor the extended reg patterns over the shifted reg patterns that match
only the operand shift and not the full sign/zero extend and shift.
Reviewers: jmolloy, t.p.northover
Subscribers: mcrosier, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D11569
llvm-svn: 243753
For a modulo (reminder) operation,
clang -target armv7-none-linux-gnueabi generates "__modsi3"
clang -target armv7-none-eabi generates "__aeabi_idivmod"
clang -target armv7-linux-androideabi generates "__modsi3"
Android bionic libc doesn't provide a __modsi3, instead it provides a
"__aeabi_idivmod". This patch fixes the LLVM ARMISelLowering to generate
the correct call when ever there is a modulo operation.
Differential Revision: http://reviews.llvm.org/D11661
llvm-svn: 243717
Fixing MinSize attribute handling was discussed in D11363.
This is a prerequisite patch to doing that.
The handling of OptSize when lowering mem* functions was broken
on Darwin because it wants to ignore -Os for these cases, but the
existing logic also made it ignore -Oz (MinSize).
The Linux change demonstrates a widespread problem. The backend
doesn't usually recognize the MinSize attribute by itself; it
assumes that if the MinSize attribute exists, then the OptSize
attribute must also exist.
Fixing this more generally will be a follow-on patch or two.
Differential Revision: http://reviews.llvm.org/D11568
llvm-svn: 243693
I'm not sure what reasons the comment here could have
had for not setting these. Without these set, there is
an assertion hit during DWARF emission.
llvm-svn: 243661
Copy implementation of applyFixup from AArch64 with AArch64 bits
ripped out.
Tests will be included with a later commit. Several other
problems must be fixed before binary debug info emission
will work.
llvm-svn: 243660
Summary:
Replace the switch on instruction opcode with a switch on register size.
This way we don't need to update the switch statement when we add new
SMRD variants.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11601
llvm-svn: 243652
Summary:
This function is never called. isReallyTriviallyReMaterializable() is
the function that should be implemented instead.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11620
llvm-svn: 243651
Summary:
This hidden option would disable code generation through FastISel by
default. It was removed from the available options and from the
Fast-ISel tests that required it in order to run the tests.
Reviewers: dsanders
Subscribers: qcolombet, llvm-commits
Differential Revision: http://reviews.llvm.org/D11610
llvm-svn: 243638
Summary:
Previously, we would sign-extend non-boolean negative constants and
zero-extend otherwise. This was problematic for PHI instructions with
negative values that had a type with bitwidth less than that of the
register used for materialization.
More specifically, ComputePHILiveOutRegInfo() assumes the constants
present in a PHI node are zero extended in their container and
afterwards deduces the known bits.
For example, previously we would materialize an i16 -4 with the
following instruction:
addiu $r, $zero, -4
The register would end-up with the 32-bit 2's complement representation
of -4. However, ComputePHILiveOutRegInfo() would generate a constant
with the upper 16-bits set to zero. The SelectionDAG builder would use
that information to generate an AssertZero node that would remove any
subsequent trunc & zero_extend nodes.
In theory, we should modify ComputePHILiveOutRegInfo() to consult
target-specific hooks about the way they prefer to materialize the
given constants. However, git-blame reports that this specific code
has not been touched since 2011 and it seems to be working well for every
target so far.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11592
llvm-svn: 243636
Bonus change to remove emacs major mode marker from SystemZMachineFunctionInfo.cpp because emacs already knows it's C++ from the extension. Also fix typo "appeary" in AMDGPUMCAsmInfo.h.
llvm-svn: 243585
This patch improves the 32-bit target i64 constant matching to detect the shuffle vector splats that are introduced by i64 vector shift vectorization (D8416).
Differential Revision: http://reviews.llvm.org/D11327
llvm-svn: 243577
It's potentially more efficient on Cyclone, and from the optimization guides &
schedulers looks like it has no effect on Cortex-A53 or A57. In general you'd
expect a MOV to be about the most efficient instruction with its semantics,
even though the official "UXTW" alias is really a UBFX.
llvm-svn: 243576
This patch vectorizes the v2i64/v4i64 ASHR shift operations - the last remaining integer vector shifts that are still being transferred to/from the scalar unit to be completed.
Differential Revision: http://reviews.llvm.org/D11439
llvm-svn: 243569
No functional change because "lsl #12" is actually encoded as 12, but one less
bug if someone ever decides to change that for the giggles.
llvm-svn: 243536
Given certain shuffle-vector masks, LLVM emits splat instructions
which splat the wrong bytes from the source register. The issue is
that the function PPC::isSplatShuffleMask() in PPCISelLowering.cpp
does not ensure that the splat pattern found is requesting bytes that
are aligned on an EltSize boundary. This patch detects this situation
as not a valid splat mask, resulting in a permute being generated
instead of a splat.
Patch and test case by Tyler Kenney, cleaned up a bit by me.
This is a simple bug fix that would be good to incorporate into 3.7.
llvm-svn: 243519
This commit defines subtarget feature strict-align and uses it instead of
cl::opt -aarch64-strict-align to decide whether strict alignment should be
forced.
rdar://problem/21529937
llvm-svn: 243516
This fix was suggested as part of D11345 and is part of fixing PR24141.
With this change, we can avoid walking the uses of a divisor node if the target
doesn't want the combineRepeatedFPDivisors transform in the first place.
There is no NFC-intended other than that.
Differential Revision: http://reviews.llvm.org/D11531
llvm-svn: 243498
This commit defines subtarget feature strict-align and uses it instead of
cl::opt -arm-strict-align to decide whether strict alignment should be
forced. Also, remove the logic that was checking the OS and architecture
as clang is now responsible for setting strict-align based on the command
line options specified and the target architecute and OS.
rdar://problem/21529937
http://reviews.llvm.org/D11470
llvm-svn: 243493
Reapply 243271 with more fixes; although we are not handling multiple
sources with coalescable copies, we were not properly skipping this
case.
- Teaches the ValueTracker in the PeepholeOptimizer to look through PHI
instructions.
- Add findNextSourceAndRewritePHI method to lookup into multiple sources
returnted by the ValueTracker and rewrite PHIs with new sources.
With these changes we can find more register sources and rewrite more
copies to allow coaslescing of bitcast instructions. Hence, we eliminate
unnecessary VR64 <-> GR64 copies in x86, but it could be extended to
other archs by marking "isBitcast" on target specific instructions. The
x86 example follows:
A:
psllq %mm1, %mm0
movd %mm0, %r9
jmp C
B:
por %mm1, %mm0
movd %mm0, %r9
jmp C
C:
movd %r9, %mm0
pshufw $238, %mm0, %mm0
Becomes:
A:
psllq %mm1, %mm0
jmp C
B:
por %mm1, %mm0
jmp C
C:
pshufw $238, %mm0, %mm0
Differential Revision: http://reviews.llvm.org/D11197
rdar://problem/20404526
llvm-svn: 243486
Summary:
Currently, we support only the MIPS O32 ABI calling convention for call
lowering. With this change we avoid using the O32 calling convetion for
lowering calls marked as using the fast calling convention.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11515
llvm-svn: 243485
Summary:
Generate correct code for the select instruction by zero-extending
it's boolean/condition operand to GPR-width. This is necessary because
the conditional-move instructions operate on the whole register.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11506
llvm-svn: 243469
If the pointer is the store's value operand, this would produce
a broken module. Make sure the use is actually for the pointer operand.
llvm-svn: 243462
Summary: MCAsmInfo is set up with the default AssemblerDialect, which is zero.
Subscribers: llvm-commits, sunfish, jfb
Differential Revision: http://reviews.llvm.org/D11567
llvm-svn: 243452
The 'common' section TLS is not implemented.
Current C/C++ TLS variables are not placed in common section.
DWARF debug info to get the address of TLS variables is not generated yet.
clang and driver changes in http://reviews.llvm.org/D10524
Added -femulated-tls flag to select the emulated TLS model,
which will be used for old targets like Android that do not
support ELF TLS models.
Added TargetLowering::LowerToTLSEmulatedModel as a target-independent
function to convert a SDNode of TLS variable address to a function call
to __emutls_get_address.
Added into lib/Target/*/*ISelLowering.cpp to call LowerToTLSEmulatedModel
for TLSModel::Emulated. Although all targets supporting ELF TLS models are
enhanced, emulated TLS model has been tested only for Android ELF targets.
Modified AsmPrinter.cpp to print the emutls_v.* and emutls_t.* variables for
emulated TLS variables.
Modified DwarfCompileUnit.cpp to skip some DIE for emulated TLS variabls.
TODO: Add proper DIE for emulated TLS variables.
Added new unit tests with emulated TLS.
Differential Revision: http://reviews.llvm.org/D10522
llvm-svn: 243438
Summary:
Add patterns for doing floating point round with various rounding modes
followed by conversion to int as a single FCVT* instruction.
Reviewers: t.p.northover, jmolloy
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D11424
llvm-svn: 243422
This path add the aarch64 lowering of __builtin_thread_pointer. It uses
the already implemented AArch64ISD::THREAD_POINTER used in TLS generation.
llvm-svn: 243412
X86FrameLowering has both a mergeSPUpdates() that accepts a direction, and an
mergeSPUpdatesUp(), which seem to do the same thing, except for a slightly
different interface. Removed the less general function.
NFC.
Differential Revision: http://reviews.llvm.org/D11510
llvm-svn: 243396
VPAND is a lot faster than VPSHUFB and VPBLENDVB - this patch ensures we attempt to lower to a basic bitmask before lowering to the slower byte shuffle/blend instructions.
Split off from D11518.
Differential Revision: http://reviews.llvm.org/D11541
llvm-svn: 243395
This is a follow-up to the FIXME that was added with D7474 ( http://reviews.llvm.org/rL229531 ).
I thought this load folding bug had been made hard-to-hit, but it turns out to be very easy
when targeting 32-bit x86 and causes a miscompile/crash in Wine:
https://bugs.winehq.org/show_bug.cgi?id=38826https://llvm.org/bugs/show_bug.cgi?id=22371#c25
The quick fix is to simply remove the scalar FP logical instructions from the load folding table
in X86InstrInfo, but that causes us to miss load folds that should be possible when lowering fabs,
fneg, fcopysign. So the majority of this patch is altering those lowerings to use *vector* FP
logical instructions (because that's all x86 gives us anyway). That lets us do the load folding
legally.
Differential Revision: http://reviews.llvm.org/D11477
llvm-svn: 243361
Summary: WebAssemblySubtarget.cpp expects a default 'generic' CPU to exist, and this seems to be prevalent with other targets. It makes sense to have something between MVP and bleeding-edge, even though for now it's the same as MVP. This removes a warning that's currently generated.
Subscribers: jfb, llvm-commits, sunfish
Differential Revision: http://reviews.llvm.org/D11546
llvm-svn: 243345
be reserved.
The decision to reserve x18 is going to be made solely by the front-end,
so it isn't necessary to check if the OS is Darwin in the backend.
llvm-svn: 243308
There is an ODR conflict between lib/ExecutionEngine/ExecutionEngineBindings.cpp
and lib/Target/TargetMachineC.cpp. The inline definitions should simply
be marked static (thanks dblaikie for the hint).
llvm-svn: 243298
Author: Dave Airlie <airlied@redhat.com>
In order to implement indirect sampler loads, we don't
want to match on a VGPR load but an SGPR one for constants,
as we cannot feed VGPRs to the sampler only SGPRs.
this should be applicable for llvm 3.7 as well.
llvm-svn: 243294
This reverts commit r243146.
Feedback from Craig Topper and David Blaikie was that we don't put const on Type as it has no mutable state.
llvm-svn: 243282
Reapply r242295 with fixes in the implementation.
- Teaches the ValueTracker in the PeepholeOptimizer to look through PHI
instructions.
- Add findNextSourceAndRewritePHI method to lookup into multiple sources
returnted by the ValueTracker and rewrite PHIs with new sources.
With these changes we can find more register sources and rewrite more
copies to allow coaslescing of bitcast instructions. Hence, we eliminate
unnecessary VR64 <-> GR64 copies in x86, but it could be extended to
other archs by marking "isBitcast" on target specific instructions. The
x86 example follows:
A:
psllq %mm1, %mm0
movd %mm0, %r9
jmp C
B:
por %mm1, %mm0
movd %mm0, %r9
jmp C
C:
movd %r9, %mm0
pshufw $238, %mm0, %mm0
Becomes:
A:
psllq %mm1, %mm0
jmp C
B:
por %mm1, %mm0
jmp C
C:
pshufw $238, %mm0, %mm0
Differential Revision: http://reviews.llvm.org/D11197
rdar://problem/20404526
llvm-svn: 243271
Summary:
Fix the cost of interleaved accesses for ARM/AArch64.
We were calling getTypeAllocSize and using it to check
the number of bits, when we should have called
getTypeAllocSizeInBits instead.
This would pottentially cause the vectorizer to
generate loads/stores and shuffles which cannot
be matched with an interleaved access instruction.
No performance changes are expected for now since
matching/generating interleaved accesses is still
disabled by default.
Reviewers: rengolin
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D11524
llvm-svn: 243270
When truncating to non-legal types (such as i16, i8 and i1) always use an AND
instruction to mask out the upper bits. This was only done when the source type
was an i64, but not when the source type was an i32.
This commit fixes this and adds the missing i32 truncate tests.
This fixes rdar://problem/21990703.
llvm-svn: 243198
extension property we're requesting - zero or sign extended.
This fixes cases where we want to return a zero extended 32-bit -1
and not be sign extended for the entire register. Also updated the
already out of date comment with the current behavior.
llvm-svn: 243192
whether register x18 should be reserved.
This change is needed because we cannot use a backend option to set
cl::opt "aarch64-reserve-x18" when doing LTO.
Out-of-tree projects currently using cl::opt option "-aarch64-reserve-x18"
to reserve x18 should make changes to add subtarget feature "reserve-x18"
to the IR.
rdar://problem/21529937
Differential Revision: http://reviews.llvm.org/D11463
llvm-svn: 243186
Instead of the pattern
for (auto I = x.rbegin(), E = x.end(); I != E; ++I)
we can use make_range to construct the reverse range and iterate using
that instead.
llvm-svn: 243163
We had a few places where we did
for (unsigned i = 0, e = STy->getNumElements(); i != e; ++i) {
but those could instead do
for (auto *EltTy : STy->elements()) {
llvm-svn: 243136
Summary:
Replace getDataLayout() with a createDataLayout() method to make
explicit that it is intended to create a DataLayout only and not
accessing it for other purpose.
This change is the last of a series of commits dedicated to have a
single DataLayout during compilation by using always the one owned
by the module.
Reviewers: echristo
Subscribers: jholewinski, llvm-commits, rafael, yaron.keren
Differential Revision: http://reviews.llvm.org/D11103
(cherry picked from commit 5609fc56bca971e5a7efeaa6ca4676638eaec5ea)
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 243114
Some shufflevectors are currently being incorrectly lowered in the AArch32
backend as the existing checks for detecting the NEON operations from the
shufflevector instruction expects the shuffle mask and the vector operands to be
of the same length.
This is not always the case as the mask may be twice as long as the operand;
here only the lower half of the shufflemask gets checked, so provided the lower
half of the shufflemask looks like a vector transpose (or even is just all -1
for undef) then the intrinsics may get incorrectly lowered into a vector
transpose (VTRN) instruction.
This patch fixes this by accommodating for both cases and adds regression tests.
Differential Revision: http://reviews.llvm.org/D11407
llvm-svn: 243103
is an immediate, in this check the value is negated and stored in and int64_t.
The value can be -2^63 yet the result cannot be stored in an int64_t and this
gives some undefined behaviour causing failures. The negation is only necessary
when the values is within a certain range and so it should not need to negate
-2^63, this patch introduces this and also a regression test.
Differential Revision: http://reviews.llvm.org/D11408
llvm-svn: 243100
This reverts commit 0f720d984f419c747709462f7476dff962c0bc41.
It breaks clang too badly, I need to prepare a proper patch for clang
first.
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 243089
Summary:
Replace getDataLayout() with a createDataLayout() method to make
explicit that it is intended to create a DataLayout only and not
accessing it for other purpose.
This change is the last of a series of commits dedicated to have a
single DataLayout during compilation by using always the one owned
by the module.
Reviewers: echristo
Subscribers: jholewinski, llvm-commits, rafael, yaron.keren
Differential Revision: http://reviews.llvm.org/D11103
(cherry picked from commit 5609fc56bca971e5a7efeaa6ca4676638eaec5ea)
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 243083
Summary: Among other things, this allows -print-after-all/-print-before-all to dump IR around this pass.
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D11373
llvm-svn: 243052
Adds pushes to the folding tables.
This also required a fix to the TD definition, since the memory forms of
the push instructions did not have the right mayLoad/mayStore flags.
Differential Revision: http://reviews.llvm.org/D11340
llvm-svn: 243010
The DAG Node "SCALAR_TO_VECTOR" may be created if the type of the scalar element is legal.
Added a check for the scalar type before creating this node.
Added a test that fails with assertion on the current version.
Differential Revision: http://reviews.llvm.org/D11413
llvm-svn: 242994
This commit broke the build. Numerous build bots broken, and it was
blocking my progress so reverting.
It should be trivial to reproduce -- enable the BPF backend and it
should fail when running llvm-tblgen.
llvm-svn: 242992
Summary:
Add a basic CodeGen bitcode test which (for now) only prints out the function name and nothing else. The current code merely implements the basic needed for the test run to not crash / assert. Getting to that point required:
- Basic InstPrinter.
- Basic AsmPrinter.
- DiagnosticInfoUnsupported (not strictly required, but nice to have, duplicated from AMDGPU/BPF's ISelLowering).
- Some SP and register setup in WebAssemblyTargetLowering.
- Basic LowerFormalArguments.
- GenInstrInfo.
- Placeholder LowerFormalArguments.
- Placeholder CanLowerReturn and LowerReturn.
- Basic DAGToDAGISel::Select, which requiresGenDAGISel.inc as well as GET_INSTRINFO_ENUM with GenInstrInfo.inc.
- Remove WebAssemblyFrameLowering::determineCalleeSaves and rely on default.
- Implement WebAssemblyFrameLowering::hasFP, same as AArch64's implementation.
Follow-up patches will implement a real AsmPrinter, which will require adding MI opcodes specific to WebAssembly.
Reviewers: sunfish
Subscribers: aemerson, jfb, llvm-commits
Differential Revision: http://reviews.llvm.org/D11369
llvm-svn: 242939
through APIs that are no longer necessary now that the update API has
been removed.
This will make changes to the AA interfaces significantly less
disruptive (I hope). Either way, it seems like a really nice cleanup.
llvm-svn: 242882
Summary:
MCRegAliasIterator only works for physical registers. So, do not run it
on virtual registers.
With this issue fixed, we can resurrect the BranchFolding pass in NVPTX
backend.
Reviewers: jholewinski, bkramer
Subscribers: henryhu, meheff, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11174
llvm-svn: 242871
This makes one substantive change and a few stylistic changes to the
VSX swap optimization pass.
The substantive change is to permit LXSDX and LXSSPX instructions to
participate in swap optimization computations. The previous change to
insert a swap following a SUBREG_TO_REG widening operation makes this
almost trivial.
I experimented with also permitting STXSDX and STXSSPX instructions.
This can be done using similar techniques: we could insert a swap
prior to a narrowing COPY operation, and then permit these stores to
participate. I prototyped this, but discovered that the pattern of a
narrowing COPY followed by an STXSDX does not occur in any of our
test-suite code. So instead, I added commentary indicating that this
could be done.
Other TLC:
- I changed SH_COPYSCALAR to SH_COPYWIDEN to more clearly indicate
the direction of the copy.
- I factored the insertion of swap instructions into a separate
function.
Finally, I added a new test case to check that the scalar-to-vector
loads are working properly with swap optimization.
llvm-svn: 242838
whether register r9 should be reserved.
This recommits r242737, which broke bots because the number of subtarget
features went over the limit of 64.
This change is needed because we cannot use a backend option to set
cl::opt "arm-reserve-r9" when doing LTO.
Out-of-tree projects currently using cl::opt option "-arm-reserve-r9" to
reserve r9 should make changes to add subtarget feature "reserve-r9" to
the IR.
rdar://problem/21529937
Differential Revision: http://reviews.llvm.org/D11320
llvm-svn: 242756
Re-apply of r241928 which had to be reverted because of the r241926
revert.
This commit factors out common code from MergeBaseUpdateLoadStore() and
MergeBaseUpdateLSMultiple() and introduces a new function
MergeBaseUpdateLSDouble() which merges adds/subs preceding/following a
strd/ldrd instruction into an strd/ldrd instruction with writeback where
possible.
Differential Revision: http://reviews.llvm.org/D10676
llvm-svn: 242743
Re-apply r241926 with an additional check that r13 and r15 are not used
for LDRD/STRD. See http://llvm.org/PR24190. This also already includes
the fix from r241951.
Differential Revision: http://reviews.llvm.org/D10623
llvm-svn: 242742
whether register r9 should be reserved.
This change is needed because we cannot use a backend option to set
cl::opt "arm-reserve-r9" when doing LTO.
Out-of-tree projects currently using cl::opt option "-arm-reserve-r9" to
reserve r9 should make changes to add subtarget feature "reserve-r9" to
the IR.
rdar://problem/21529937
Differential Revision: http://reviews.llvm.org/D11320
llvm-svn: 242737
Even though this is just some hinting for the scheduler it doesn't make
sense to do that unless you know the target can perform the fusion.
llvm-svn: 242732
This patch does the following:
* Fix FIXME on `needsStackRealignment`: it is now shared between multiple targets, implemented in `TargetRegisterInfo`, and isn't `virtual` anymore. This will break out-of-tree targets, silently if they used `virtual` and with a build error if they used `override`.
* Factor out `canRealignStack` as a `virtual` function on `TargetRegisterInfo`, by default only looks for the `no-realign-stack` function attribute.
Multiple targets duplicated the same `needsStackRealignment` code:
- Aarch64.
- ARM.
- Mips almost: had extra `DEBUG` diagnostic, which the default implementation now has.
- PowerPC.
- WebAssembly.
- x86 almost: has an extra `-force-align-stack` option, which the default implementation now has.
The default implementation of `needsStackRealignment` used to just return `false`. My current patch changes the behavior by simply using the above shared behavior. This affects:
- AMDGPU
- BPF
- CppBackend
- MSP430
- NVPTX
- Sparc
- SystemZ
- XCore
- Out-of-tree targets
This is a breaking change! `make check` passes.
The only implementation of the `virtual` function (besides the slight different in x86) was Hexagon (which did `MF.getFrameInfo()->getMaxAlignment() > 8`), and potentially some out-of-tree targets. Hexagon now uses the default implementation.
`needsStackRealignment` was being overwritten in `<Target>GenRegisterInfo.inc`, to return `false` as the default also did. That was odd and is now gone.
Reviewers: sunfish
Subscribers: aemerson, llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11160
llvm-svn: 242727
This is the first step toward supporting shrink-wrapping for this target.
The changes could be summarized by these items:
- Expand the tail-call return as part of the expand pseudo pass.
- Get rid of the assumptions that the epilogue is the exit block:
* Do not assume which registers are free in the epilogue. (This indirectly
improve the lowering of the code for the segmented stacks, see the test
cases.)
* Take into account that the basic block can be empty.
Related to <rdar://problem/20821730>
llvm-svn: 242714
Summary:
[NVPTX] make load on global readonly memory to use ldg
Summary:
As describe in [1], ld.global.nc may be used to load memory by nvcc when
__restrict__ is used and compiler can detect whether read-only data cache
is safe to use.
This patch will try to check whether ldg is safe to use and use them to
replace ld.global when possible. This change can improve the performance
by 18~29% on affected kernels (ratt*_kernel and rwdot*_kernel) in
S3D benchmark of shoc [2].
Patched by Xuetian Weng.
[1] http://docs.nvidia.com/cuda/kepler-tuning-guide/#read-only-data-cache
[2] https://github.com/vetter/shoc
Test Plan: test/CodeGen/NVPTX/load-with-non-coherent-cache.ll
Reviewers: jholewinski, jingyue
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D11314
llvm-svn: 242713
Summary:
This change generalizes the implicit null checks pass to work with
instructions that don't have any explicit register defs. This lets us
use X86's `cmp` against memory as faulting load instructions.
Reviewers: reames, JosephTremoulet
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11286
llvm-svn: 242703
Summary:
The MUBUF addr64 bit has been removed on VI, so we must use FLAT
instructions when the pointer is stored in VGPRs.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11067
llvm-svn: 242673
Reordered the data tables at the top and placed the lookups after. The first stage in the yak shaving necessary to get more accurate costs for a variety of targets given the recent improvements to SINT_TO_FP/UINT_TO_FP/SIGN_EXTEND vector lowering.
llvm-svn: 242643
canFoldMemoryOperand is not actually used anywhere in the codebase - all existing users instead call foldMemoryOperand directly when they wish to fold and can correctly deduce what they need from the return value.
This patch removes the canFoldMemoryOperand base function and the target implementations; only x86 had a real (bit-rotted) implementation, although AMDGPU had a preparatory stub that had never needed to be completed.
Differential Revision: http://reviews.llvm.org/D11331
llvm-svn: 242638
SKX supports conversion for all FP types. Integer types include doublewords and quardwords.
I added "Legal" status for these nodes and a bunch of tests.
I added "NoVLX" for AVX DAG selection to force VLX instructions selection when VLX is supported.
Differential Revision: http://reviews.llvm.org/D11255
llvm-svn: 242637
The standard containers are not designed to be inherited from, as
illustrated by the MSVC hacks for NodeOrdering. No functional change
intended.
llvm-svn: 242616
Reapply r242500 now that the swift schedmodel includes LDRLIT.
This is mostly done to disable the PostRAScheduler which optimizes for
instruction latencies which isn't a good fit for out-of-order
architectures. This also allows to leave out the itinerary table in
swift in favor of the SchedModel ones.
This change leads to performance improvements/regressions by as much as
10% in some benchmarks, in fact we loose 0.4% performance over the
llvm-testsuite for reasons that appear to be unknown or out of the
compilers control. rdar://20803802 documents the investigation of
these effects.
While it is probably a good idea to perform the same switch for the
other ARM out-of-order CPUs, I limited this change to swift as I cannot
perform the benchmark verification on the other CPUs.
Differential Revision: http://reviews.llvm.org/D10513
llvm-svn: 242588
These pseudo instructions are only lowered after register allocation and
are therefore still present when the machine scheduler runs.
Add a run: line to a testcase that uses the uncommon flags necessary to
actually produce a LDRLIT instruction on swift.
llvm-svn: 242587
This is mostly done to disable the PostRAScheduler which optimizes for
instruction latencies which isn't a good fit for out-of-order
architectures. This also allows to leave out the itinerary table in
swift in favor of the SchedModel ones.
This change leads to performance improvements/regressions by as much as
10% in some benchmarks, in fact we loose 0.4% performance over the
llvm-testsuite for reasons that appear to be unknown or out of the
compilers control. rdar://20803802 documents the investigation of
these effects.
While it is probably a good idea to perform the same switch for the
other ARM out-of-order CPUs, I limited this change to swift as I cannot
perform the benchmark verification on the other CPUs.
Differential Revision: http://reviews.llvm.org/D10513
llvm-svn: 242500
Constructing a name based on the function name didn't give us a unique
symbol if we had more than one setjmp in a function. Using
MCContext::createTempSymbol() always gives us a unique name.
Differential Revision: http://reviews.llvm.org/D9314
llvm-svn: 242482
llvm.eh.sjlj.setjmp was used as part of the SjLj exception handling
style but is also used in clang to implement __builtin_setjmp. The ARM
backend needs to output additional dispatch tables for the SjLj
exception handling style, these tables however can't be emitted if
llvm.eh.sjlj.setjmp is simply used for __builtin_setjmp and no actual
landing pad blocks exist.
To solve this issue a new llvm.eh.sjlj.setup_dispatch intrinsic is
introduced which is used instead of llvm.eh.sjlj.setjmp in the SjLj
exception handling lowering, so we can differentiate between the case
where we actually need to setup a dispatch table and the case where we
just need the __builtin_setjmp semantic.
Differential Revision: http://reviews.llvm.org/D9313
llvm-svn: 242481
C11 leaves the choice on whether round-to-integer operations set the inexact
flag implementation-defined. Darwin does expect it to be set, but this seems to
be against the intent of the IEEE document and slower to implement anyway. So
it should be opt-in.
llvm-svn: 242446
I was looking at some vector code generation and kept seeing
unnecessary vector copies into the Altivec half of the VSX registers.
I discovered that we overlooked v4i32 when adding the register classes
for VSX; we only added v4f32 and v2f64. This means that anything that
canonicalizes into v4i32 (which is a LOT of stuff) ends up being
forced into VRRC on its way to VSRC.
The fix is one line. The rest of the patch is fixing up some test
cases whose code generation has changed as a result.
This seems like it would be a good candidate for backport to 3.7.
llvm-svn: 242442
Summary:
SpeculativeExecution enables a series straight line optimizations (such
as SLSR and NaryReassociate) on conditional code. For example,
if (...)
... b * s ...
if (...)
... (b + 1) * s ...
speculative execution can hoist b * s and (b + 1) * s from then-blocks,
so that we have
... b * s ...
if (...)
...
... (b + 1) * s ...
if (...)
...
Then, SLSR can rewrite (b + 1) * s to (b * s + s) because after
speculative execution b * s dominates (b + 1) * s.
The performance impact of this change is significant. It speeds up the
benchmarks running EigenFloatContractionKernelInternal16x16
(ba68f42fa6/unsupported/Eigen/CXX11/src/Tensor/TensorContractionCuda.h?at=default#cl-526)
by roughly 2%. Some internal benchmarks that have the above code pattern
are improved by up to 40%. No significant slowdowns are observed on
Eigen CUDA microbenchmarks.
Reviewers: jholewinski, broune, eliben
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11201
llvm-svn: 242437
This is a new iteration of the reverted r238793 /
http://reviews.llvm.org/D8232 which wrongly assumed that any and/or
trees can be represented by conditional compare sequences, however there
are some restrictions to that. This version fixes this and adds comments
that explain exactly what types of and/or trees can actually be
implemented as conditional compare sequences.
Related to http://llvm.org/PR20927, rdar://18326194
Differential Revision: http://reviews.llvm.org/D10579
llvm-svn: 242436
Summary:
We can safely assume that the high bit of scratch offsets will never
be set, because this would require at least 128 GB of GPU memory.
Reviewers: arsenm
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11225
llvm-svn: 242433
This reverts commit r242300.
This is causing buildbot failures which we are investigating.
I'll reapply once we know whats going on, but for now want to
get the bots green.
llvm-svn: 242428
Summary:
This fixes an issue on MIPS where the infinite-loop-evergreen.ll test
was failing to terminate.
Fixes PR24147.
Reviewers: arsenm, dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D11260
llvm-svn: 242410
This allows more call sequences to use pushes instead of movs when optimizing for size.
In particular, calling conventions that pass some parameters in registers (e.g. thiscall) are now supported.
This should no longer cause miscompiles, now that a bug in emitPrologue was fixed in r242395.
llvm-svn: 242398
When X86FrameLowering::emitPrologue() looks for where to insert the %esp subtraction
to allocate stack space for local allocations, it assumes that any sequence of push
instructions that starts at function entry consists purely of spills of callee-save
registers.
This may be false, since from some point forward, the pushes may pushing arguments
to a subsequent function call.
This caused a miscompile that was exposed by r240257, and is not easily testable
since r240257 was reverted. A test will be committed separately after r240257 is
reapplied.
llvm-svn: 242395
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
This patch is quite boring overall, except for some uglyness in
ASMPrinter which has a getDataLayout function but has some clients
that use it without a Module (llmv-dsymutil, llvm-dwarfdump), so
some methods are taking a DataLayout as parameter.
Reviewers: echristo
Subscribers: yaron.keren, rafael, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11090
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 242386
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
Reviewers: echristo
Subscribers: yaron.keren, rafael, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11079
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 242385
pairs for 32-bit immediates.
This change is needed to avoid emitting movt/movw pairs when doing LTO
and do so on a per-function basis.
Out-of-tree projects currently using cl::opt option -arm-use-movt=0 or
false to avoid emitting movt/movw pairs should make changes to add
subtarget feature "+no-movt" (see the changes made to clang in r242368).
rdar://problem/21529937
Differential Revision: http://reviews.llvm.org/D11026
llvm-svn: 242369
The pass here was clearing kill flags on instructions which had
their sources killed in the instruction being combined. But
given that the new instruction is inserted after the existing ones,
any existing instructions with kill flags will lead to the verifier
complaining that we are reading an undefined physreg.
For example, what we had prior to this optimization is
t2STRi12 %R1, %SP, 12
t2STRi12 %R1<kill>, %SP, 16
t2STRi12 %R0<kill>, %SP, 8
and prior to this fix that would generate
t2STRi12 %R1<kill>, %SP, 16
t2STRDi8 %R0<kill>, %R1, %SP, 8
This is clearly incorrect as it didn't clear the kill flag on R1
used with offset 16 because there was no kill flag on the instruction
with offset 12.
After this change we clear the kill flag on the offset 16 instruction
because we know it will be used afterwards in the new instruction.
I haven't provided a test case. I have a small test, but even it is
very sensitive to register allocation order which isn't ideal.
llvm-svn: 242359
Pass a const reference to LiveRegMatrix to getRegAllocationHints()
because some targets can prodive better hints if they can test whether a
physreg has been used for register allocation yet.
llvm-svn: 242340
These were the cause of a verifier error when building 7zip with
-verify-machineinstrs. Running 'make check' with the verifier
triggered the same error on the test here so i've updated the test
to run the verifier on one of its runs instead of adding a new one.
While looking at this code, there was a stale comment that these
instructions were only used for disassembly. This probably used to
be the case, but they are now used in the 'ARM load / store optimization pass' too.
llvm-svn: 242300
The vec_sld interface provides access to the vsldoi instruction.
Unlike most of the vec_* interfaces, we do not attempt to change the
generated code for vec_sld based on the endian mode. It is too
difficult to correctly infer the desired semantics because of
different element types, and the corrected instruction sequence is
expensive, involving loading a permute control vector and performing a
generalized permute.
For GCC, this was implemented as "Don't touch the vec_sld"
implementation. When it came time for the LLVM implementation, I did
the same thing. However, this was hasty and incorrect. In LLVM's
version of altivec.h, vec_sld was previously defined in terms of the
vec_perm interface. Because vec_perm semantics are adjusted for
little endian, this means that leaving vec_sld untouched causes it to
generate something different for LE than for BE. Not good.
This back-end patch accompanies the changes to altivec.h that change
vec_sld's behavior for little endian. Those changes mean that we see
slightly different code in the back end when trying to recognize a
VSLDOI instruction in isVSLDOIShuffleMask. In particular, a
ShuffleKind of 1 (where the two inputs are identical) must now be
treated the same way as a ShuffleKind of 2 (little endian with
different inputs) when little endian mode is in force. This is
because ShuffleKind of 1 is defined using big-endian numbering.
This has a ripple effect on LowerBUILD_VECTOR, where we create our own
internal VSLDOI instructions. Because these are a ShuffleKind of 1,
they will now have their shift amounts subtracted from 16 when
recognizing the shuffle mask. To avoid problems we have to subtract
them from 16 again before creating the VSLDOI instructions.
There are a couple of other uses of BuildVSLDOI, but these do not need
to be modified because the shift amount is 8, which is unchanged when
subtracted from 16.
llvm-svn: 242296
- Teaches the ValueTracker in the PeepholeOptimizer to look through PHI
instructions.
- Add findNextSourceAndRewritePHI method to lookup into multiple sources
returnted by the ValueTracker and rewrite PHIs with new sources.
With these changes we can find more register sources and rewrite more
copies to allow coaslescing of bitcast instructions. Hence, we eliminate
unnecessary VR64 <-> GR64 copies in x86, but it could be extended to
other archs by marking "isBitcast" on target specific instructions. The
x86 example follows:
A:
psllq %mm1, %mm0
movd %mm0, %r9
jmp C
B:
por %mm1, %mm0
movd %mm0, %r9
jmp C
C:
movd %r9, %mm0
pshufw $238, %mm0, %mm0
Becomes:
A:
psllq %mm1, %mm0
jmp C
B:
por %mm1, %mm0
jmp C
C:
pshufw $238, %mm0, %mm0
Differential Revision: http://reviews.llvm.org/D11197
rdar://problem/20404526
llvm-svn: 242295
This is a direct port of the code from the X86 backend (r239486/r240361), which
uses the MachineCombiner to reassociate (floating-point) adds/muls to increase
ILP, to the PowerPC backend. The rationale is the same.
There is a lot of copy-and-paste here between the X86 code and the PowerPC
code, and we should extract at least some of this into CodeGen somewhere.
However, I don't want to do that until this code is enhanced to handle FMAs as
well. After that, we'll be in a better position to extract the common parts.
llvm-svn: 242279
If the source of the copy that defines the addend is a physical register, then
its existing live range may not extend to the FMA being mutated. Make sure we
extend the live range of the register to meet the FMA because it will become
its operand in this case.
I don't have an independent test case, but it will be exposed by change to be
committed shortly enabling the use of the machine combiner to do fadd/fmul
reassociation, and will be covered by one of the associated regression tests.
llvm-svn: 242278
Bitpatterns rejected by the decoder method of `MSR (immediate)` should be
decoded as the `extended MSR (register)` instruction.
Differential Revision: http://reviews.llvm.org/D7174
llvm-svn: 242276
This code was breaking from the case statement if the getStoreSizeInBits()
value was not a multiple of 0. Given that the implementation returns
getStoreSize() * 8, it can only be a multiple of 8.
llvm-svn: 242255
Summary:
processFunctionBeforeCalleeSavedScan was renamed to determineCalleeSaves and now takes a BitVector parameter as of rL242165, reviewed in http://reviews.llvm.org/D10909
WebAssembly is still marked as experimental and therefore doesn't build by default. It does, however, grep by default! I notice that processFunctionBeforeCalleeSavedScan is still mentioned in a few comments and error messages, which I also fixed.
Reviewers: qcolombet, sunfish
Subscribers: jfb, dsanders, hfinkel, MatzeB, llvm-commits
Differential Revision: http://reviews.llvm.org/D11199
llvm-svn: 242242
Follow-up r235483, with the corresponding support in PPC. We use a regular call
for symbolic targets (because they're much cheaper than indirect calls).
llvm-svn: 242239
We used to take the address specified as the direct target of the patchpoint
and did no TOC-pointer handling. This, however, as not all that useful,
because MCJIT tends to create a lot of modules, and they have their own TOC
sections. Thus, to call from the generated code to other generated code, you
really need to switch TOC pointers. Make this work as expected, and under
ELFv1, tread the address as the function descriptor address so that the correct
TOC pointer can be loaded.
llvm-svn: 242217
SelectionDAG already had begin/end methods for iterating over all
the nodes, but didn't define an iterator_range for us in foreach
loops.
This adds such a method and uses it in some of the eligible places
throughout the backends.
llvm-svn: 242212
Summary: This patch has the most basic instruction codegen for 32 and 64 bit int/fp.
Reviewers: sunfish
Subscribers: llvm-commits, jfb
Differential Revision: http://reviews.llvm.org/D11193
llvm-svn: 242201
MOVSDto64rr and MOV64toSDrr are defined to convert between FR64 (%xmm)
<-> GR64 registers, not VR64 (%mm) <-> GR64. This is wrong.
I found this by inspection and could not find a suitable testcase for it
since (1) we don't handle MMX bitcasts in Peephole optimizer as to
generate COPYs that (2) could be expanded back to the appropriate x86
instruction in ExpandPostRA.
Switch to use the appropriate instructions: MMX_MOVD64from64rr and
MMX_MOVD64to64rr here.
llvm-svn: 242191
PowerPC uses itineraries to describe processor pipelines (and dispatch-group
restrictions for P7/P8 cores). Unfortunately, the target-independent
implementation of TII.getInstrLatency calls ItinData->getStageLatency, and that
looks for the largest cycle count in the pipeline for any given instruction.
This, however, yields the wrong answer for the PPC itineraries, because we
don't encode the full pipeline. Because the functional units are fully
pipelined, we only model the initial stages (there are no relevant hazards in
the later stages to model), and so the technique employed by getStageLatency
does not really work. Instead, we should take the maximum output operand
latency, and that's what PPCInstrInfo::getInstrLatency now does.
This caused some test-case churn, including two unfortunate side effects.
First, the new arrangement of copies we get from function parameters now
sometimes blocks VSX FMA mutation (a FIXME has been added to the code and the
test cases), and we have one significant test-suite regression:
SingleSource/Benchmarks/BenchmarkGame/spectral-norm
56.4185% +/- 18.9398%
In this benchmark we have a loop with a vectorized FP divide, and it with the
new scheduling both divides end up in the same dispatch group (which in this
case seems to cause a problem, although why is not exactly clear). The grouping
structure is hard to predict from the bottom of the loop, and there may not be
much we can do to fix this.
Very few other test-suite performance effects were really significant, but
almost all weakly favor this change. However, in light of the issues
highlighted above, I've left the old behavior available via a
command-line flag.
llvm-svn: 242188
This can be done only with moves which theoretically
will optimize better later.
Although this transform increases the instruction count,
it should be code size / cycle count neutral in the worst
VALU case. It also seems to slightly improve a couple
of testcases due to other DAG combines this exposes.
This is probably slightly worse for the SALU case, so
it might be better to handle this during moveToVALU,
although then you lose some simplifications like
the load width reducing in the simple testcase.
llvm-svn: 242177
If the read2 produced was supposed to be writing into a
super register, it would use the wrong subregister indices.
Fix this by inserting copies, so we only ever write to a vreg_64.
Run the register coalescer again to clean this up, although this
isn't ideal and often does result in an extra move.
Also remove the assert that offset1 > offset0.
There isn't a real reason to not allow this other than a minor
convenience in the compiler, and it doesn't seem worth the effort
of avoiding it.
llvm-svn: 242174
We have a detailed def/use lists for every physical register in
MachineRegisterInfo anyway, so there is little use in maintaining an
additional bitset of which ones are used.
Removing it frees us from extra book keeping. This simplifies
VirtRegMap.
Differential Revision: http://reviews.llvm.org/D10911
llvm-svn: 242173
This changes TargetFrameLowering::processFunctionBeforeCalleeSavedScan():
- Rename the function to determineCalleeSaves()
- Pass a bitset of callee saved registers by reference, thus avoiding
the function-global PhysRegUsed bitset in MachineRegisterInfo.
- Without PhysRegUsed the implementation is fine tuned to not save
physcial registers which are only read but never modified.
Related to rdar://21539507
Differential Revision: http://reviews.llvm.org/D10909
llvm-svn: 242165