PR14478 highlights a serious problem in SROA that simply wasn't being
exercised due to a lack of vector input code mixed with C-library
function calls. Part of SROA was written carefully to handle subvector
accesses via memset and memcpy, but the rewriter never grew support for
this. Fixing it required refactoring the subvector access code in other
parts of SROA so it could be shared, and then fixing the splat formation
logic and using subvector insertion (this patch).
The PR isn't quite fixed yet, as memcpy is still broken in the same way.
I'm starting on that series of patches now.
Hopefully this will be enough to bring the bullet benchmark back to life
with the bb-vectorizer enabled, but that may require fixing memcpy as
well.
llvm-svn: 170301
Mips16 is really a processor decoding mode (ala thumb 1) and in the same
program, mips16 and mips32 functions can exist and can call each other.
If a jal type instruction encounters an address with the lower bit set, then
the processor switches to mips16 mode (if it is not already in it). If the
lower bit is not set, then it switches to mips32 mode.
The linker knows which functions are mips16 and which are mips32.
When relocation is performed on code labels, this lower order bit is
set if the code label is a mips16 code label.
In general this works just fine, however when creating exception handling
tables and dwarf, there are cases where you don't want this lower order
bit added in.
This has been traditionally distinguished in gas assembly source by using a
different syntax for the label.
lab1: ; this will cause the lower order bit to be added
lab2=. ; this will not cause the lower order bit to be added
In some cases, it does not matter because in dwarf and debug tables
the difference of two labels is used and in that case the lower order
bits subtract each other out.
To fix this, I have added to mcstreamer the notion of a debuglabel.
The default is for label and debug label to be the same. So calling
EmitLabel and EmitDebugLabel produce the same result.
For various reasons, there is only one set of labels that needs to be
modified for the mips exceptions to work. These are the "$eh_func_beginXXX"
labels.
Mips overrides the debug label suffix from ":" to "=." .
This initial patch fixes exceptions. More changes most likely
will be needed to DwarfCFException to make all of this work
for actual debugging. These changes will be to emit debug labels in some
places where a simple label is emitted now.
Some historical discussion on this from gcc can be found at:
http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00623.htmlhttp://gcc.gnu.org/ml/gcc-patches/2008-11/msg01273.html
llvm-svn: 170279
We match the pattern "x >= y ? x-y : 0" into "subus x, y" and two special cases
if y is a constant. DAGCombiner canonicalizes those so we first have to undo the
canonicalization for those cases. The pattern occurs in gzip when the loop
vectorizer is enabled. Part of PR14613.
llvm-svn: 170273
In this case, essentially it is soft float with different library routines.
The next step will be to make this fully interoperational with mips32 floating
point and that requires creating stubs for functions with signatures that
contain floating point types.
I have a more sophisticated design for mips16 hardfloat which I hope to
implement at a later time that directly does floating point without the need
for function calls.
The mips16 encoding has no floating point instructions so one needs to
switch to mips32 mode to execute floating point instructions.
llvm-svn: 170259
immediate generates the narrow version. Needed when doing round-trip
assemble/disassemble testing using the alternate syntax that specifies
'pc' directly.
llvm-svn: 170255
for TLS dynamic models on 64-bit PowerPC ELF. The default sort routine
for relocations only sorts on the r_offset field; but with TLS, there
can be two relocations with the same r_offset. For PowerPC, this patch
sorts secondarily on descending r_type, which matches the behavior
expected by the linker.
llvm-svn: 170237
for a wider range of GOT entries that can hold thread-relative offsets.
This matches the behavior of GCC, which was not documented in the PPC64 TLS
ABI. The ABI will be updated with the new code sequence.
Former sequence:
ld 9,x@got@tprel(2)
add 9,9,x@tls
New sequence:
addis 9,2,x@got@tprel@ha
ld 9,x@got@tprel@l(9)
add 9,9,x@tls
Note that a linker optimization exists to transform the new sequence into
the shorter sequence when appropriate, by replacing the addis with a nop
and modifying the base register and relocation type of the ld.
llvm-svn: 170209
This assumes (1 << n) is always not zero. Consider n is greater than word size.
Although I know it is undefined, this transforms undefined behavior hidden.
This led clang unexpected behavior with some failures. I will investigate to fix undefined shl in clang.
llvm-svn: 170128
Better controls the inlining of functions when the caller function has MinSize attribute.
Basically, when the caller function has this attribute, we do not "force" the inlining
of callee functions carrying the InlineHint attribute (i.e., functions defined with
inline keyword)
llvm-svn: 170065
predictable when compiled on at least one non-PowerPC host. Source of
nondeterminism not apparent. Restrict the test to build on PowerPC hosts
for now while looking into the issue further.
llvm-svn: 170016
PowerPC target. This is the last of the four models, so we now have
full TLS support.
This is mostly a straightforward extension of the general dynamic model.
I had to use an additional Chain operand to tie ADDIS_DTPREL_HA to the
register copy following ADDI_TLSLD_L; otherwise everything above the
ADDIS_DTPREL_HA appeared dead and was removed.
As before, there are new test cases to test the assembly generation, and
the relocations output during integrated assembly. The expected code
gen sequence can be read in test/CodeGen/PowerPC/tls-ld.ll.
There are a couple of things I think can be done more efficiently in the
overall TLS code, so there will likely be a clean-up patch forthcoming;
but for now I want to be sure the functionality is in place.
Bill
llvm-svn: 170003
When ASan replaces <alloca instruction> with
<offset into a common large alloca>, it should also patch
llvm.dbg.declare calls and replace debug info descriptors to mark
that we've replaced alloca with a value that stores an address
of the user variable, not the user variable itself.
See PR11818 for more context.
llvm-svn: 169984
fsub X, +0 ==> X
fsub X, -0 ==> X, when we know X is not -0
fsub +/-0.0, (fsub -0.0, X) ==> X
fsub nsz +/-0.0, (fsub +/-0.0, X) ==> X
fsub nnan ninf X, X ==> 0.0
fadd nsz X, 0 ==> X
fadd [nnan ninf] X, (fsub [nnan ninf] 0, X) ==> 0
where nnan and ninf have to occur at least once somewhere in this expression
fmul X, 1.0 ==> X
llvm-svn: 169940
Given a thread-local symbol x with global-dynamic access, the generated
code to obtain x's address is:
Instruction Relocation Symbol
addis ra,r2,x@got@tlsgd@ha R_PPC64_GOT_TLSGD16_HA x
addi r3,ra,x@got@tlsgd@l R_PPC64_GOT_TLSGD16_L x
bl __tls_get_addr(x@tlsgd) R_PPC64_TLSGD x
R_PPC64_REL24 __tls_get_addr
nop
<use address in r3>
The implementation borrows from the medium code model work for introducing
special forms of ADDIS and ADDI into the DAG representation. This is made
slightly more complicated by having to introduce a call to the external
function __tls_get_addr. Using the full call machinery is overkill and,
more importantly, makes it difficult to add a special relocation. So I've
introduced another opcode GET_TLS_ADDR to represent the function call, and
surrounded it with register copies to set up the parameter and return value.
Most of the code is pretty straightforward. I ran into one peculiarity
when I introduced a new PPC opcode BL8_NOP_ELF_TLSGD, which is just like
BL8_NOP_ELF except that it takes another parameter to represent the symbol
("x" above) that requires a relocation on the call. Something in the
TblGen machinery causes BL8_NOP_ELF and BL8_NOP_ELF_TLSGD to be treated
identically during the emit phase, so this second operand was never
visited to generate relocations. This is the reason for the slightly
messy workaround in PPCMCCodeEmitter.cpp:getDirectBrEncoding().
Two new tests are included to demonstrate correct external assembly and
correct generation of relocations using the integrated assembler.
Comments welcome!
Thanks,
Bill
llvm-svn: 169910
try to reduce the width of this load, and would end up transforming:
(truncate (lshr (sextload i48 <ptr> as i64), 32) to i32)
to
(truncate (zextload i32 <ptr+4> as i64) to i32)
We lost the sext attached to the load while building the narrower i32
load, and replaced it with a zext because lshr always zext's the
results. Instead, bail out of this combine when there is a conflict
between a sextload and a zext narrowing. The rest of the DAG combiner
still optimize the code down to the proper single instruction:
movswl 6(...),%eax
Which is exactly what we wanted. Previously we read past the end *and*
missed the sign extension:
movl 6(...), %eax
llvm-svn: 169802
This shouldn't affect codegen for -O0 compiles as tail call markers are not
emitted in unoptimized compiles. Testing with the external/internal nightly
test suite reveals no change in compile time performance. Testing with -O1,
-O2 and -O3 with fast-isel enabled did not cause any compile-time or
execution-time failures. All tests were performed on my x86 machine.
I'll monitor our arm testers to ensure no regressions occur there.
In an upcoming clang patch I will be marking the objc_autoreleaseReturnValue
and objc_retainAutoreleaseReturnValue as tail calls unconditionally. While
it's theoretically true that this is just an optimization, it's an
optimization that we very much want to happen even at -O0, or else ARC
applications become substantially harder to debug.
Part of rdar://12553082
llvm-svn: 169796
controls each of the abbreviation sets (only a single one at the
moment) and computes offsets separately as well for each set
of DIEs.
No real function change, ordering of abbreviations for the skeleton
CU changed but only because we're computing in a separate order. Fix
the testcase not to care.
llvm-svn: 169793
1. Teach it to use overlapping unaligned load / store to copy / set the trailing
bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. e.g.
x86 and ARM.
3. When memcpy from a constant string, do *not* replace the load with a constant
if it's not possible to materialize an integer immediate with a single
instruction (required a new target hook: TLI.isIntImmLegal()).
4. Use unaligned load / stores more aggressively if target hooks indicates they
are "fast".
5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8.
Also increase the threshold to something reasonable (8 for memset, 4 pairs
for memcpy).
This significantly improves Dhrystone, up to 50% on ARM iOS devices.
rdar://12760078
llvm-svn: 169791
Analyse Phis under the starting assumption that they are NoAlias. Recursively
look at their inputs.
If they MayAlias/MustAlias there must be an input that makes them so.
Addresses bug 14351.
llvm-svn: 169788
misched used GetUnderlyingObject in order to break false load/store
dependencies, and the -enable-aa-sched-mi feature similarly relied on
GetUnderlyingObject in order to ensure it is safe to use the aliasing analysis.
Unfortunately, GetUnderlyingObject does not recurse through phi nodes, and so
(especially due to LSR) all of these mechanisms failed for
induction-variable-dependent loads and stores inside loops.
This change replaces uses of GetUnderlyingObject with GetUnderlyingObjects
(which will recurse through phi and select instructions) in misched.
Andy reviewed, tested and simplified this patch; Thanks!
llvm-svn: 169744
When SROA was evaluating a mixture of i1 and i8 loads and stores, in
just a particular case, it would tickle a latent bug where we compared
bits to bytes rather than bits to bits. As a consequence of the latent
bug, we would allow integers through which were not byte-size multiples,
a situation the later rewriting code was never intended to handle.
In release builds this could trigger all manner of oddities, but the
reported issue in PR14548 was forming invalid bitcast instructions.
The only downside of this fix is that it makes it more clear that SROA
in its current form is not capable of handling mixed i1 and i8 loads and
stores. Sometimes with the previous code this would work by luck, but
usually it would crash, so I'm not terribly worried. I'll watch the LNT
numbers just to be sure.
llvm-svn: 169719
- added function to VectorTargetTransformInfo to query cost of intrinsics
- vectorize trivially vectorizable intrinsic calls such as sin, cos, log, etc.
Reviewed by: Nadav
llvm-svn: 169711
The limit seems to break newer pythons (see PR13598) so just drop it for now.
Eventually lit should learn to set limits for its children instead of a global
limit in the makefile.
If some PPC bots fail after this change: That's a good thing, they actually run
clang tests now.
llvm-svn: 169695
There are still bugs in this pass, as well as other issues that are
being worked on, but the bugs are crashers that occur pretty easily in
the wild. Test cases have been sent to the original commit's review
thread.
This reverts the commits:
r169671: Fix a logic error.
r169604: Move the popcnt tests to an X86 subdirectory.
r168931: Initial commit adding the pass.
llvm-svn: 169683
Before this patch, when you objdump an LLVM-compiled file, objdump tried to
decode data-in-code sections as if they were code. This patch adds the missing
Mapping Symbols, as defined by "ELF for the ARM Architecture" (ARM IHI 0044D).
Patch based on work by Greg Fitzgerald.
llvm-svn: 169609
Buildbots for some hosts may choose to build only their own backend in order to
maximise testing-turnaround time. Move the test into a prefixed directory so
lit's standard "backend specific" suppression can be done.
llvm-svn: 169604
by virtue of inbounds GEPs that preclude a null pointer.
This is a very common pattern in the code generated by std::vector and
other standard library routines which use allocators that test for null
pervasively. This is one step closer to teaching Clang+LLVM to be able
to produce an empty function for:
void f() {
std::vector<int> v;
v.push_back(1);
v.push_back(2);
v.push_back(3);
v.push_back(4);
}
Which is related to getting them to completely fold SmallVector
push_back sequences into constants when inlining and other optimizations
make that a possibility.
llvm-svn: 169573
Instead of unconditionally storing origin with every application store,
only do this when the shadow of the stored value is != 0.
This change also delays instrumentation of stores until after the walk over
function's instructions, because adding new basic blocks confuses InstVisitor.
We only keep 1 origin value per 4 bytes of application memory. This change
fixes the bug when a store of a single clean byte wiped the origin for the
whole 4-byte area.
Since stores of uninitialized values are relatively uncommon, this change
improves performance of track-origins mode by 5% median and by up to 47% on
specs.
llvm-svn: 169490
Some languages, e.g. Ada and Pascal, allow you to specify that the array bounds
are different from the default (1 in these cases). If we have a lower bound
that's non-default, then we emit the lower bound. We also calculate the correct
upper bound in those cases.
llvm-svn: 169484
RUN: a
RUN: b || true
as "a && (b || true)" in Tcl mode, and as "(a && b) || true" in sh mode.
Everyone seems to (quite reasonably) write tests assuming the Tcl behavior,
so use that in sh mode too.
llvm-svn: 169441
This is much simpler to reason about, more efficient, and
fixes some corner cases involving implicit super-register defs.
Fixed rdar://12797931.
llvm-svn: 169425
The encoding of NOP in ARMAsmBackend.cpp is missing a trailing zero, which
causes the emission of a coprocessor instruction rather than "mov r0, r0"
as indicated in the comment. The test also checks for the wrong encoding.
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20121203/157919.html
llvm-svn: 169420
The new command line option -unwind-info dumps the Win64 EH unwind
data to the console. This is a nice feature if you need to debug
generated EH data (e.g. from LLVM). Includes a test case.
Initial patch by João Matos, extensions and rework by Kai Nacke.
llvm-svn: 169415
This is for the lldb team so most of but not all of the values are
to be printed as hex with this option. Some small values like the
scale in an X86 address were requested to printed in decimal
without the leading 0x.
There may be some tweaks need to places that may still be in
decimal that they want in hex. Specially for arm. I made my best
guess. Any tweaks from here should be simple.
I also did the best I know now with help from the C++ gurus
creating the cleanest formatImm() utility function and containing
the changes. But if someone has a better idea to make something
cleaner I'm all ears and game for changing the implementation.
rdar://8109283
llvm-svn: 169393
This change attempts to simplify (X^Y) -> X or Y in the user's context if we know that
only bits from X or Y are demanded.
A minimized case is provided bellow. This change will simplify "t>>16" into "var1 >>16".
=============================================================
unsigned foo (unsigned val1, unsigned val2) {
unsigned t = val1 ^ 1234;
return (t >> 16) | t; // NOTE: t is used more than once.
}
=============================================================
Note that if the "t" were used only once, the expression would be finally optimized as well.
However, with with this change, the optimization will take place earlier.
Reviewed by Nadav, Thanks a lot!
llvm-svn: 169317
The count attribute is more accurate with regards to the size of an array. It
also obviates the upper bound attribute in the subrange. We can also better
handle an unbound array by setting the count to -1 instead of the lower bound to
1 and upper bound to 0.
llvm-svn: 169312
This reapplies the fix for PR13303 now with more justification. Based on my
execution of the GDB 7.5 test suite this results in:
expected passes: 16101 -> 20890 (+30%)
unexpected failures: 4826 -> 637 (-77%)
There are 23 checks that used to pass and now fail. They are all in
gdb.reverse. Investigating a few looks like they were accidentally passing
due to extra breakpoints being set by this bug. They're generally due to the
difference in end location between gcc and clang, the test suite is trying to
set breakpoints on the closing '}' that clang doesn't associate with any
instructions.
llvm-svn: 169304
on 64-bit PowerPC ELF.
The patch includes code to handle external assembly and MC output with the
integrated assembler. It intentionally does not support the "old" JIT.
For the initial-exec TLS model, the ABI requires the following to calculate
the address of external thread-local variable x:
Code sequence Relocation Symbol
ld 9,x@got@tprel(2) R_PPC64_GOT_TPREL16_DS x
add 9,9,x@tls R_PPC64_TLS x
The register 9 is arbitrary here. The linker will replace x@got@tprel
with the offset relative to the thread pointer to the generated GOT
entry for symbol x. It will replace x@tls with the thread-pointer
register (13).
The two test cases verify correct assembly output and relocation output
as just described.
PowerPC-specific selection node variants are added for the two
instructions above: LD_GOT_TPREL and ADD_TLS. These are inserted
when an initial-exec global variable is encountered by
PPCTargetLowering::LowerGlobalTLSAddress(), and later lowered to
machine instructions LDgotTPREL and ADD8TLS. LDgotTPREL is a pseudo
that uses the same LDrs support added for medium code model's LDtocL,
with a different relocation type.
The rest of the processing is straightforward.
llvm-svn: 169281
The count field is necessary because there isn't a difference between the 'lo'
and 'hi' attributes for a one-element array and a zero-element array. When the
count is '0', we know that this is a zero-element array. When it's >=1, then
it's a normal constant sized array. When it's -1, then the array is unbounded.
llvm-svn: 169218
Added the code that actually performs the if-conversion during vectorization.
We can now vectorize this code:
for (int i=0; i<n; ++i) {
unsigned k = 0;
if (a[i] > b[i]) <------ IF inside the loop.
k = k * 5 + 3;
a[i] = k; <---- K is a phi node that becomes vector-select.
}
llvm-svn: 169217
the alignment is clamped to TargetFrameLowering.getStackAlignment if the target
does not support stack realignment or the option "realign-stack" is off.
This will cause miscompile if the address is treated as aligned and add is
replaced with or in DAGCombine.
Added a bool StackRealignable to TargetFrameLowering to check whether stack
realignment is implemented for the target. Also added a bool RealignOption
to MachineFrameInfo to check whether the option "realign-stack" is on.
rdar://12713765
llvm-svn: 169197
This change tries to simmplify E1 = " X >> C1 << C2" into :
- E2 = "X << (C2 - C1)" if C2 > C1, or
- E2 = "X >> (C1 - C2)" if C1 > C2, or
- E2 = X if C1 == C2.
Reviewed by Nadav. Thanks!
llvm-svn: 169182
is not yet good enough for more sophistication. The important goal of this
test is to make sure llc doesn't crash on this IR like it used to.
llvm-svn: 169146
; CHECK: [[VAR:[a-z]]]
The problem was that to find the end of the regex var definition, it was
simplistically looking for the next ]] and finding the incorrect one. A
better approach is to count nesting of brackets (taking escaping into
account). This way the brackets that are part of the regex can be discovered
and skipped properly, and the ]] ending is detected in the right place.
llvm-svn: 169109
Also check in a case to repeat the issue, on which 'opt -globalopt' consumes 1.6GB memory.
The big memory footprint cause is that current GlobalOpt one by one hoists and stores the leaf element constant into the global array, in each iteration, it recreates the global array initializer constant and leave the old initializer alone. This may result in many obsolete constants left.
For example: we have global array @rom = global [16 x i32] zeroinitializer
After the first element value is hoisted and installed: @rom = global [16 x i32] [ 1, 0, 0, ... ]
After the second element value is installed: @rom = global [16 x 32] [ 1, 2, 0, 0, ... ] // here the previous initializer is obsolete
...
When the transform is done, we have 15 obsolete initializers left useless.
llvm-svn: 169079
The TwoAddressInstructionPass takes the machine code out of SSA form by
expanding REG_SEQUENCE instructions into copies. It is no longer
necessary to rewrite the registers used by a REG_SEQUENCE instruction
because the new coalescer algorithm can do it now.
REG_SEQUENCE is just converted to a sequence of sub-register copies now.
llvm-svn: 169067
Codegen was failing with an assertion because of unexpected vector
operands when legalizing the selection DAG for a MUL instruction.
The asserting code was legalizing multiplies for vectors of size 128
bits. It uses a custom lowering to try and detect cases where it can
use a VMULL instruction instead of a VMOVL + VMUL. The code was
looking for input operands to the MUL that had been sign or zero
extended. If it found the extended operands it would drop the
sign/zero extension and use the original vector size as input to a
VMULL instruction.
The code assumed that the original input vector was 64 bits so that
after dropping the extension it would fit directly into a D register
and could be used as an operand of a VMULL instruction. The input
code that trigger the failure used a vector of <4 x i8> that was
sign extended to <4 x i32>. It was not safe to drop the sign
extension in this case because the original vector is only 32 bits
wide. The fix is to insert a sign extension for the vector to reach
the required 64 bit size. In this particular example, the vector would
need to be sign extented to a <4 x i16>.
llvm-svn: 169024
instruction (vmaddfp) to conform with IEEE to ensure the sign of a zero
result when resulting product is -0.0.
The -0.0 vector addend to vmaddfp is generated by a creating a vector
with full bits sets and then shifting each elements by 31-bits to the
left, resulting in a vector of 0x80000000 (or -0.0 as float).
The 'buildvec_canonicalize.ll' was adjusted to reflect this change and
the 'vec_mul.ll' was complemented with the float vector multiplication
test.
llvm-svn: 168998
This revision attempts to recognize following population-count pattern:
while(a) { c++; ... ; a &= a - 1; ... },
where <c> and <a>could be used multiple times in the loop body.
TODO: On X8664 and ARM, __buildin_ctpop() are not expanded to a efficent
instruction sequence, which need to be improved in the following commits.
Reviewed by Nadav, really appreciate!
llvm-svn: 168931
the last invoke instruction in the function. This also removes the last landing
pad in an function. This is fine, but with SjLj EH code, we've already placed a
bunch of code in the 'entry' block, which expects the landing pad to stick
around.
When we get to the situation where CGP has removed the last landing pad, go
ahead and nuke the SjLj instructions from the 'entry' block.
<rdar://problem/12721258>
llvm-svn: 168930
This patch migrates the puts optimizations from the simplify-libcalls
pass into the instcombine library call simplifier.
All the simplifiers from simplify-libcalls have now been migrated to
instcombine. Yay! Just a few other bits to migrate (prototype attribute
inference and a few statistics) and simplify-libcalls can finally be put
to rest.
llvm-svn: 168925
If we need to split the operand of a VSELECT, it must be the mask operand. We
split the entire VSELECT operand with EXTRACT_SUBVECTOR.
llvm-svn: 168883
The createPPCMCAsmInfo routine used PPC::R1 as the initial frame
pointer register, but on PPC64 the 32-bit R1 register does not
have a corresponding DWARF number, causing invalid CIE initial
frame state to be emitted. Fix by using PPC::X1 instead.
llvm-svn: 168799
Accordingly, update a testcase with a broken datalayout string.
Also, we never parse negative numbers, because '-' is used as a
separator. Therefore, use unsigned as result type.
llvm-svn: 168785
This is a simple, cheap infrastructure for analyzing the shape of a
DAG. It recognizes uniform DAGs that take the shape of bottom-up
subtrees, such as the included matrix multiplication example. This is
useful for heuristics that balance register pressure with ILP. Two
canonical expressions of the heuristic are implemented in scheduling
modes: -misched-ilpmin and -misched-ilpmax.
llvm-svn: 168773
This fixes a hole in the "cheap" alias analysis logic implemented within
the DAG builder itself, regardless of whether proper alias analysis is
enabled. It now handles this pattern produced by LSR+CodeGenPrepare.
%sunkaddr1 = ptrtoint * %obj to i64
%sunkaddr2 = add i64 %sunkaddr1, %lsr.iv
%sunkaddr3 = inttoptr i64 %sunkaddr2 to i32*
store i32 %v, i32* %sunkaddr3
llvm-svn: 168768
When the CodeGenInfo is to be created for the PPC64 target machine,
a default code-model selection is converted to CodeModel::Medium
provided we are not targeting the Darwin OS. Defaults for Darwin
are unaffected.
llvm-svn: 168747
boundaries.
Given the following case:
BB0
%vreg1<def> = SUBrr %vreg0, %vreg7
%vreg2<def> = COPY %vreg7
BB1
%vreg10<def> = SUBrr %vreg0, %vreg2
We should be able to CSE between SUBrr in BB0 and SUBrr in BB1.
rdar://12462006
llvm-svn: 168717
My commit to migrate the printf simplifiers from the simplify-libcalls
in r168604 introduced a regression reported by Duncan [1]. The problem
is that in some cases the library call simplifier can return a new value
that has no uses and the new value's type is different than the old value's
type (which is fine because there are no uses). The specific case that
triggered the bug looked something like:
declare void @printf(i8*, ...)
...
call void (i8*, ...)* @printf(i8* %fmt)
Which we want to optimized into:
call i32 @putchar(i32 104)
However, the code was attempting to replace all uses of the printf with
the putchar and the types differ, hence a crash. This is fixed by *just*
deleting the original instruction when there are no uses. The old
simplify-libcalls pass is already doing something similar.
[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-November/056338.html
llvm-svn: 168716
when the destination register is wider than the memory load.
These load instructions load from m32 or m64 and set the upper bits to zero,
while the folded instructions may accept m128.
rdar://12721174
llvm-svn: 168710
The default for 64-bit PowerPC is small code model, in which TOC entries
must be addressable using a 16-bit offset from the TOC pointer. Additionally,
only TOC entries are addressed via the TOC pointer.
With medium code model, TOC entries and data sections can all be addressed
via the TOC pointer using a 32-bit offset. Cooperation with the linker
allows 16-bit offsets to be used when these are sufficient, reducing the
number of extra instructions that need to be executed. Medium code model
also does not generate explicit TOC entries in ".section toc" for variables
that are wholly internal to the compilation unit.
Consider a load of an external 4-byte integer. With small code model, the
compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
With medium model, it instead generates:
addis 3, 2, .LC1@toc@ha
ld 3, .LC1@toc@l(3)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
Here .LC1@toc@ha is a relocation requesting the upper 16 bits of the
32-bit offset of ei's TOC entry from the TOC base pointer. Similarly,
.LC1@toc@l is a relocation requesting the lower 16 bits. Note that if
the linker determines that ei's TOC entry is within a 16-bit offset of
the TOC base pointer, it will replace the "addis" with a "nop", and
replace the "ld" with the identical "ld" instruction from the small
code model example.
Consider next a load of a function-scope static integer. For small code
model, the compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc test_fn_static.si[TC],test_fn_static.si
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
For medium code model, the compiler generates:
addis 3, 2, test_fn_static.si@toc@ha
addi 3, 3, test_fn_static.si@toc@l
lwz 4, 0(3)
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
Again, the linker may replace the "addis" with a "nop", calculating only
a 16-bit offset when this is sufficient.
Note that it would be more efficient for the compiler to generate:
addis 3, 2, test_fn_static.si@toc@ha
lwz 4, test_fn_static.si@toc@l(3)
The current patch does not perform this optimization yet. This will be
addressed as a peephole optimization in a later patch.
For the moment, the default code model for 64-bit PowerPC will remain the
small code model. We plan to eventually change the default to medium code
model, which matches current upstream GCC behavior. Note that the different
code models are ABI-compatible, so code compiled with different models will
be linked and execute correctly.
I've tested the regression suite and the application/benchmark test suite in
two ways: Once with the patch as submitted here, and once with additional
logic to force medium code model as the default. The tests all compile
cleanly, with one exception. The mandel-2 application test fails due to an
unrelated ABI compatibility with passing complex numbers. It just so happens
that small code model was incredibly lucky, in that temporary values in
floating-point registers held the expected values needed by the external
library routine that was called incorrectly. My current thought is to correct
the ABI problems with _Complex before making medium code model the default,
to avoid introducing this "regression."
Here are a few comments on how the patch works, since the selection code
can be difficult to follow:
The existing logic for small code model defines three pseudo-instructions:
LDtoc for most uses, LDtocJTI for jump table addresses, and LDtocCPT for
constant pool addresses. These are expanded by SelectCodeCommon(). The
pseudo-instruction approach doesn't work for medium code model, because
we need to generate two instructions when we match the same pattern.
Instead, new logic in PPCDAGToDAGISel::Select() intercepts the TOC_ENTRY
node for medium code model, and generates an ADDIStocHA followed by either
a LDtocL or an ADDItocL. These new node types correspond naturally to
the sequences described above.
The addis/ld sequence is generated for the following cases:
* Jump table addresses
* Function addresses
* External global variables
* Tentative definitions of global variables (common linkage)
The addis/addi sequence is generated for the following cases:
* Constant pool entries
* File-scope static global variables
* Function-scope static variables
Expanding to the two-instruction sequences at select time exposes the
instructions to subsequent optimization, particularly scheduling.
The rest of the processing occurs at assembly time, in
PPCAsmPrinter::EmitInstruction. Each of the instructions is converted to
a "real" PowerPC instruction. When a TOC entry needs to be created, this
is done here in the same manner as for the existing LDtoc, LDtocJTI, and
LDtocCPT pseudo-instructions (I factored out a new routine to handle this).
I had originally thought that if a TOC entry was needed for LDtocL or
ADDItocL, it would already have been generated for the previous ADDIStocHA.
However, at higher optimization levels, the ADDIStocHA may appear in a
different block, which may be assembled textually following the block
containing the LDtocL or ADDItocL. So it is necessary to include the
possibility of creating a new TOC entry for those two instructions.
Note that for LDtocL, we generate a new form of LD called LDrs. This
allows specifying the @toc@l relocation for the offset field of the LD
instruction (i.e., the offset is replaced by a SymbolLo relocation).
When the peephole optimization described above is added, we will need
to do similar things for all immediate-form load and store operations.
The seven "mcm-n.ll" test cases are kept separate because otherwise the
intermingling of various TOC entries and so forth makes the tests fragile
and hard to understand.
The above assumes use of an external assembler. For use of the
integrated assembler, new relocations are added and used by
PPCELFObjectWriter. Testing is done with "mcm-obj.ll", which tests for
proper generation of the various relocations for the same sequences
tested with the external assembler.
llvm-svn: 168708
argument. Instead, use a pair of .local and .comm directives.
This avoids spurious differences between binaries built by the
integrated assembler vs. those built by the external assembler,
since the external assembler may impose alignment requirements
on .lcomm symbols where the integrated assembler does not.
llvm-svn: 168704
If the Src and Dst are the same instruction,
no loop-independent dependence is possible,
so we force the PossiblyLoopIndependent flag to false.
The test case results are updated appropriately.
llvm-svn: 168678
It currently assumes register numbering and any harmless change in the X86
register naming makes it fail. It's enough to match the register names.
llvm-svn: 168632
InstCombineLoadStoreAlloca.cpp, which had many issues.
(At least two bugs were noted on llvm-commits, and it was overly conservative.)
Instead, use getOrEnforceKnownAlignment.
llvm-svn: 168629
This pass was conservative in that it always reserved the FP to enable dynamic
stack realignment, which allowed the RA to use aligned spills for vector
registers. This happens even when spills were not necessary. The RA has
since been improved to use unaligned spills when necessary.
The new behavior is to realign the stack if the frame pointer was already
reserved for some other reason, but don't reserve the frame pointer just
because a function contains vector virtual registers.
Part of rdar://12719844
llvm-svn: 168627
Enhancement to InstCombine. Try to catch this opportunity:
---------------------------------------------------------------
((X^C1) >> C2) ^ C3 => (X>>C2) ^ ((C1>>C2)^C3)
where the subexpression "X ^ C1" has more than one uses, and
"(X^C1) >> C2" has single use.
----------------------------------------------------------------
Reviewed by Nadav (with minor change per his request).
llvm-svn: 168615
In preparation for the FileCheck functionality change which will allow using
a variable later on the same line.
No functionality change.
llvm-svn: 168588
to support it. Original patch with the parsing and plumbing by the PaX team and
Roman Divacky. I added the bits in MCDwarf.cpp and the test.
llvm-svn: 168565
The last remaining bit is "bcl 20, 31, AnonSymbol", which I couldn't find the
instruction definition for. Only whitespace changes in assembly output.
llvm-svn: 168541
I discovered a few more missing functions while migrating optimizations
from the simplify-libcalls pass to the instcombine (I already added some
in r167659).
llvm-svn: 168501
analysis. Better is to look for cases with useful GEPs and use them
when possible. When a pair of useful GEPs is not available, use the
raw SCEVs directly. This approach supports better analysis of pointer
dereferencing.
In parallel, all the test cases are updated appropriately.
Cases where we have a store to *B++ can now be analyzed!
llvm-svn: 168474
This patch provides support for the MIPS relocations:
*) R_MIPS_GOT_HI16
*) R_MIPS_GOT_LO16
*) R_MIPS_CALL_HI16
*) R_MIPS_CALL_LO16
These are used for large GOT instruction sequences.
Contributer: Jack Carter
llvm-svn: 168471
Now if we can transform an alloca into a single vector value, but it has
subvector, non-element accesses, we form the appropriate shufflevectors
to allow SROA to proceed. This fixes PR14055 which pointed out a very
common pattern that SROA couldn't handle -- mixed vec3 and vec4
operations on a single alloca.
llvm-svn: 168418
The issue is that we may end up with newly OOB loads when speculating
a load into the predecessors of a PHI node, and this confuses the new
integer splitting logic in some cases, triggering an assertion failure.
In fact, the branch in question must be dead code as it loads from
a too-narrow alloca. Add code to handle this gracefully and leave the
requisite FIXMEs for both optimizing more aggressively and doing more to
aid sanitizing invalid code which triggers these patterns.
llvm-svn: 168361
to properly handle the combinations of these with split integer loads
and stores. This essentially replaces Evan's r168227 by refactoring the
code in a different way, and trynig to mirror that refactoring in both
the load and store sides of the rewriting.
Generally speaking there was some really problematic duplicated code
here that led to poorly founded assumptions and then subtle bugs. Now
much of the code actually flows through and follows a more consistent
style and logical path. There is still a tiny bit of duplication on the
store side of things, but it is much less bad.
This also changes the logic to never re-use a load or store instruction
as that was simply too error prone in practice.
I've added a few tests (one a reduction of the one in Evan's original
patch, which happened to be the same as the report in PR14349). I'm
going to look at adding a few more tests for things I found and fixed in
passing (such as the volatile tests in the vectorizable predicate).
This patch has survived bootstrap, and modulo one bugfix survived
Duncan's test suite, but let me know if anything else explodes.
llvm-svn: 168346
operands of the expression being written was wrongly thought to be reusable as
an inner node of the expression resulting in it turning up as both an inner node
*and* a leaf, creating a cycle in the def-use graph. This would have caused the
verifier to blow up if things had gotten that far, however it managed to provoke
an infinite loop first.
llvm-svn: 168291
This is a partial solution to PR14351. It removes some of the special
significance of the first incoming phi value in the phi aliasing checking logic
in BasicAA. In the context of a loop, the old logic assumes that the first
incoming value is the interesting one (meaning that it is the one that comes
from outside the loop), but this is often not the case. With this change, we
now test first the incoming value that comes from a block other than the parent
of the phi being tested.
llvm-svn: 168245
This patch replaces the hard coded GPR pair [R0, R1] of
Intrinsic:arm_ldrexd and [R2, R3] of Intrinsic:arm_strexd with
even/odd GPRPair reg class.
Similar to the lowering of atomic_64 operation.
llvm-svn: 168207
Before, the parser would assert on the following code:
@a2 = global i8 addrspace(1)* @a
@a = addrspace(1) global i8 0
because the type of @a was "i8*" instead of "i8 addrspace(1)*" when parsing
the initializer for @a2.
llvm-svn: 168197
replaced by this patch is equivalent to the new logic, but you'd be wrong, and
that's exactly where the bug was. There's a similar bug in instsimplify which
manifests itself as instsimplify failing to simplify this, rather than doing it
wrong, see next commit.
llvm-svn: 168181
It turns out that the operands of a Constant are not always themselves
Constant. For example, one of the operands of BlockAddress is
BasicBlock, which is not a Constant.
This should fix the dragonegg-x86_64-linux-gcc-4.6-test build which
broke in r168037.
llvm-svn: 168147
This patch lowers the llvm.floor, llvm.ceil, llvm.trunc, and
llvm.nearbyint to Altivec instruction when using 4 single-precision
float vectors.
llvm-svn: 168086
For global variables that get the same value stored into them
everywhere, GlobalOpt will replace them with a constant. The problem is
that a thread-local GlobalVariable looks like one value (the address of
the TLS var), but is different between threads.
This patch introduces Constant::isThreadDependent() which returns true
for thread-local variables and constants which depend on them (e.g. a GEP
into a thread-local array), and teaches GlobalOpt not to track such
values.
llvm-svn: 168037
the utility for extracting a chain of operations from the IR, thought that it
might as well combine any constants it came across (rather than just returning
them along with everything else). On the other hand, the factorization code
would like to see the individual constants (this is quite reasonable: it is
much easier to pull a factor of 3 out of 2*3 than it is to pull it out of 6;
you may think 6/3 isn't so hard, but due to overflow it's not as easy to undo
multiplications of constants as it may at first appear). This patch therefore
makes LinearizeExprTree stupider: it now leaves optimizing to the optimization
part of reassociate, and sticks to just analysing the IR.
llvm-svn: 168035
PPC64 target. The five tests modified herein test code generation that is
sensitive to the code model selected. So I've added -code-model=small to
the RUN commands for each.
Since small code model is the default, this has no effect for now; but this
prepares us for eventually changing the default to medium code model for PPC64.
Test changes verified with small and medium code model as default on
powerpc64-unknown-linux-gnu. All tests continue to pass.
llvm-svn: 167999
The stack realignment code was fixed to work when there is stack realignment and
a dynamic alloca is present so this shouldn't cause correctness issues anymore.
Note that this also enables generation of AVX instructions for memset
under the assumptions:
- Unaligned loads/stores are always fast on CPUs supporting AVX
- AVX is not slower than SSE
We may need some tweaked heuristics if one of those assumptions turns out not to
be true.
Effectively reverts r58317. Part of PR2962.
llvm-svn: 167967
This patch changes the definition of negative from -0..-255 to -1..-255. I am changing this because of
a bug that we had in some of the patterns that assumed that "subs" of zero does not set the carry flag.
rdar://12028498
llvm-svn: 167963
When an instruction as written requires 32-bit mode and we're assembling
in 64-bit mode, or vice-versa, issue a more specific diagnostic about
what's wrong.
rdar://12700702
llvm-svn: 167937
chain is correctly setup.
As an example, if the original load must happen before later stores, we need
to make sure the constructed VZEXT_LOAD is constrained to be before the stores.
rdar://12684358
llvm-svn: 167859
physical register as candidate for common subexpression elimination
in MachineCSE.
This fixes a bug on PowerPC in MultiSource/Applications/oggenc/oggenc
caused by MachineCSE invalidly merging two separate DYNALLOC insns.
llvm-svn: 167855
On MSYS, 70 is not seen, but 1.
r127726 should be reworked. Candidate options are;
1) Use not exit(70), but _exit(70), in report_fatal_error().
2) Return with _exit(70) in ~raw_ostream().
llvm-svn: 167836
Previously in a vector of pointers, the pointer couldn't be any pointer type,
it had to be a pointer to an integer or floating point type. This is a hassle
for dragonegg because the GCC vectorizer happily produces vectors of pointers
where the pointer is a pointer to a struct or whatever. Vector getelementptr
was restricted to just one index, but now that vectors of pointers can have
any pointer type it is more natural to allow arbitrary vector getelementptrs.
There is however the issue of struct GEPs, where if each lane chose different
struct fields then from that point on each lane will be working down into
unrelated types. This seems like too much pain for too little gain, so when
you have a vector struct index all the elements are required to be the same.
llvm-svn: 167828
This patch migrates the math library call simplifications from the
simplify-libcalls pass into the instcombine library call simplifier.
I have typically migrated just one simplifier at a time, but the math
simplifiers are interdependent because:
1. CosOpt, PowOpt, and Exp2Opt all depend on UnaryDoubleFPOpt.
2. CosOpt, PowOpt, Exp2Opt, and UnaryDoubleFPOpt all depend on
the option -enable-double-float-shrink.
These two factors made migrating each of these simplifiers individually
more of a pain than it would be worth. So, I migrated them all together.
llvm-svn: 167815
Don't choose a vectorization plan containing only shuffles and
vector inserts/extracts. Due to inperfections in the cost model,
these can lead to infinite recusion.
llvm-svn: 167811
If we have a type 'int a[1]' and a type 'int b[0]', the generated DWARF is the
same for both of them because we use the 'upper_bound' attribute. Instead use
the 'count' attrbute, which gives the correct number of elements in the array.
<rdar://problem/12566646>
llvm-svn: 167806
This fixes another infinite recursion case when using target costs.
We can only replace insert element input chains that are pure (end
with inserting into an undef).
llvm-svn: 167784
The old checking code, which assumed that input shuffles and insert-elements
could always be folded (and thus were free) is too simple.
This can only happen in special circumstances.
Using the simple check caused infinite recursion.
llvm-svn: 167750
The pass would previously assert when trying to compute the cost of
compare instructions with illegal vector types (like struct pointers).
llvm-svn: 167743
The assertion is trigged when the Reassociater tries to transform expression
... + 2 * n * 3 + 2 * m + ...
into:
... + 2 * (n*3 + m).
In the process of the transformation, a helper routine folds the constant 2*3 into 6,
confusing optimizer which is trying the to eliminate the common factor 2, and cannot
find 2 any more.
Review is pending. But I'd like commit first in order to help those who are waiting
for this fix.
llvm-svn: 167740
This adds support for weak DAG edges to the general scheduling
infrastructure in preparation for MachineScheduler support for
heuristics based on weak edges.
llvm-svn: 167738
This fixes a bug where shuffles were being fused such that the
resulting input types were not legal on the target. This would
occur only when both inputs and dependencies were also foldable
operations (such as other shuffles) and there were other connected
pairs in the same block.
llvm-svn: 167731
The library call simplifier folds memcmp calls with all constant arguments
to a constant. For example:
memcmp("foo", "foo", 3) -> 0
memcmp("hel", "foo", 3) -> 1
memcmp("foo", "hel", 3) -> -1
The folding is implemented in terms of the system memcmp that LLVM gets
linked with. It currently just blindly uses the value returned from
the system memcmp as the folded constant.
This patch normalizes the values returned from the system memcmp to
(-1, 0, 1) so that we get consistent results across multiple platforms.
The test cases were adjusted accordingly.
llvm-svn: 167726
Each SM and PTX version is modeled as a subtarget feature/CPU. Additionally,
PTX 3.1 is added as the default PTX version to be out-of-the-box compatible
with CUDA 5.0.
Available CPUs for this target:
sm_10 - Select the sm_10 processor.
sm_11 - Select the sm_11 processor.
sm_12 - Select the sm_12 processor.
sm_13 - Select the sm_13 processor.
sm_20 - Select the sm_20 processor.
sm_21 - Select the sm_21 processor.
sm_30 - Select the sm_30 processor.
sm_35 - Select the sm_35 processor.
Available features for this target:
ptx30 - Use PTX version 3.0.
ptx31 - Use PTX version 3.1.
sm_10 - Target SM 1.0.
sm_11 - Target SM 1.1.
sm_12 - Target SM 1.2.
sm_13 - Target SM 1.3.
sm_20 - Target SM 2.0.
sm_21 - Target SM 2.1.
sm_30 - Target SM 3.0.
sm_35 - Target SM 3.5.
llvm-svn: 167699
Transforms/InstCombine/memcmp-1.ll has a test case that looks like:
@foo = constant [4 x i8] c"foo\00"
@hel = constant [4 x i8] c"hel\00"
...
%mem1 = getelementptr [4 x i8]* @hel, i32 0, i32 0
%mem2 = getelementptr [4 x i8]* @foo, i32 0, i32 0
%ret = call i32 @memcmp(i8* %mem1, i8* %mem2, i32 3)
ret i32 %ret
; CHECK: ret i32 2
The folded return value (2 above) is computed using the system memcmp
that the compiler is linked with. This can return different values on
different systems. The test was originally written on an OS X 10.7.5
x86-64 box and passed. However, it failed on one of the x86-64 FreeBSD
buildbots because the system memcpy on that machine returned a different
value (1 instead of 2).
I fixed the test by checking the folding constants with regexes.
llvm-svn: 167691
Several of the simplifiers migrated from the simplify-libcalls pass to
the instcombine pass were not correctly checking the target library
information to gate the simplifications. This patch ensures that the
check is made.
llvm-svn: 167660
mov lr, pc
b.w _foo
The "mov" instruction doesn't set bit zero to one, it's putting incorrect
value in lr. It messes up backtraces.
rdar://12663632
llvm-svn: 167657
The RegMaskSlots contains 'r' slots while NewIdx and OldIdx are 'B'
slots. This broke the checks in the assertions.
This fixes PR14302.
llvm-svn: 167625
Improve ARM build attribute emission for architectures types.
This also changes the default architecture emitted for a generic CPU to "v7".
llvm-svn: 167574
- Add RTM code generation support throught 3 X86 intrinsics:
xbegin()/xend() to start/end a transaction region, and xabort() to abort a
tranaction region
llvm-svn: 167573
misched is disabled by default. With -enable-misched, these heuristics
balance the schedule to simultaneously avoid saturating processor
resources, expose ILP, and minimize register pressure. I've been
analyzing the performance of these heuristics on everything in the
llvm test suite in addition to a few other benchmarks. I would like
each heuristic check to be verified by a unit test, but I'm still
trying to figure out the best way to do that. The heuristics are still
in considerable flux, but as they are refined we should be rigorous
about unit testing the improvements.
llvm-svn: 167527
to be extended to a full register. This is modeled in the IR by marking
the return value (or argument) with a signext or zeroext attribute.
However, while these attributes are respected for function arguments,
they are currently ignored for function return values by the PowerPC
back-end. This patch updates PPCCallingConv.td to ask for the promotion
to i64, and fixes LowerReturn and LowerCallResult to implement it.
The new test case verifies that both arguments and return values are
properly extended when passing them; and also that the optimizers
understand incoming argument and return values are in fact guaranteed
by the ABI to be extended.
The patch caused a spurious breakage in CodeGen/PowerPC/coalesce-ext.ll,
since the test case used a "ret" instruction to create a use of an i32
value at the end of the function (to set up data flow as required for
what the test is intended to test). Since there's now an implicit
promotion to i64, that data flow no longer works as expected. To fix
this, this patch now adds an extra "add" to ensure we have an appropriate
use of the i32 value.
llvm-svn: 167396
The Z constraint specifies an r+r memory address, and the y modifier expands
to the "r, r" in the asm string. For this initial implementation, the base
register is forced to r0 (which has the special meaning of 0 for r+r addressing
on PowerPC) and the full address is taken in the second register. In the
future, this should be improved.
llvm-svn: 167388
'nocapture' attribute.
The nocapture attribute only specifies that no copies are made that
outlive the function. This isn't the same as there being no copies at all.
This fixes PR14045.
llvm-svn: 167381
The new analysis is not yet ready for prime time. It has a *critical*
flawed assumption, and some troubling shortages of testing. Until it's
been hammered into better shape, let's stick with the working code. This
should be easy to revert itself when the analysis is ready.
Fixes PR14241, a miscompile of any memcpy-able loop which uses a pointer
as the induction mechanism. If you have been seeing miscompiles in this
revision range, you really want to test with this backed out. The
results of this miscompile are a bit subtle as they can lead to
downstream passes concluding things are impossible which are in fact
possible.
Thanks to David Blaikie for the majority of the reduction of this
miscompile. I'll be checking in the test case in a non-revert commit.
Revesions reverted here:
r167045: LoopIdiom: Fix a serious missed optimization: we only turned
top-level loops into memmove.
r166877: LoopIdiom: Add checks to avoid turning memmove into an infinite
loop.
r166875: LoopIdiom: Recognize memmove loops.
r166874: LoopIdiom: Replace custom dependence analysis with
DependenceAnalysis.
llvm-svn: 167286
When target cost information is available, compute explicit costs of inserting and
extracting values from vectors. At this point, all costs are estimated using the
target information, and the chain-depth heuristic is not needed. As a result, it is now, by
default, disabled when using target costs.
llvm-svn: 167256
run through the 'C' preprocessor. That is pick up the file name
and line numbers from the cpp hash file line comments for the
dwarf file and line numbers tables.
rdar://9275556
llvm-svn: 167237
getIntPtrType support for multiple address spaces via a pointer type,
and also introduced a crasher bug in the constant folder reported in
PR14233.
These commits also contained several problems that should really be
addressed before they are re-committed. I have avoided reverting various
cleanups to the DataLayout APIs that are reasonable to have moving
forward in order to reduce the amount of churn, and minimize the number
of commits that were reverted. I've also manually updated merge
conflicts and manually arranged for the getIntPtrType function to stay
in DataLayout and to be defined in a plausible way after this revert.
Thanks to Duncan for working through this exact strategy with me, and
Nick Lewycky for tracking down the really annoying crasher this
triggered. (Test case to follow in its own commit.)
After discussing with Duncan extensively, and based on a note from
Micah, I'm going to continue to back out some more of the more
problematic patches in this series in order to ensure we go into the
LLVM 3.2 branch with a reasonable story here. I'll send a note to
llvmdev explaining what's going on and why.
Summary of reverted revisions:
r166634: Fix a compiler warning with an unused variable.
r166607: Add some cleanup to the DataLayout changes requested by
Chandler.
r166596: Revert "Back out r166591, not sure why this made it through
since I cancelled the command. Bleh, sorry about this!
r166591: Delete a directory that wasn't supposed to be checked in yet.
r166578: Add in support for getIntPtrType to get the pointer type based
on the address space.
llvm-svn: 167221
The adc/sbb optimization is to able to convert following expression
into a single adc/sbb instruction:
(ult) ... = x + 1 // where the ult is unsigned-less-than comparison
(ult) ... = x - 1
This change is to flip the "x >u y" (i.e. ugt comparison) in order
to expose the adc/sbb opportunity.
llvm-svn: 167180
BBVectorize would, except for loads and stores, always fuse instructions
so that the first instruction (in the current source order) would always
represent the low part of the input vectors and the second instruction
would always represent the high part. This lead to too many shuffles
being produced because sometimes the opposite order produces fewer of them.
With this change, BBVectorize tracks the kind of pair connections that form
the DAG of candidate pairs, and uses that information to reorder the pairs to
avoid excess shuffles. Using this information, a future commit will be able
to add VTTI-based shuffle costs to the pair selection procedure. Importantly,
the number of remaining shuffles can now be estimated during pair selection.
There are some trivial instruction reorderings in the test cases, and one
simple additional test where we certainly want to do a reordering to
avoid an unnecessary shuffle.
llvm-svn: 167122
By propagating the value for the switch condition, LLVM can now build
lookup tables for code such as:
switch (x) {
case 1: return 5;
case 2: return 42;
case 3: case 4: case 5:
return x - 123;
default:
return 123;
}
Given that x is known for each case, "x - 123" becomes a constant for
cases 3, 4, and 5.
llvm-svn: 167115
parameters. Examples of these are:
struct { } a;
union { } b[256];
int a[0];
An empty aggregate has an address, although dereferencing that address is
pointless. When passed as a parameter, an empty aggregate does not consume
a protocol register, nor does it consume a doubleword in the parameter save
area. Passing an empty aggregate by reference passes an address just as
for any other aggregate. Returning an empty aggregate uses GPR3 as a hidden
address of the return value location, just as for any other aggregate.
The patch modifies PPCTargetLowering::LowerFormalArguments_64SVR4 and
PPCTargetLowering::LowerCall_64SVR4 to properly skip empty aggregate
parameters passed by value. The handling of return values and by-reference
parameters was already correct.
Built on powerpc64-unknown-linux-gnu and tested with no new regressions.
A test case is included to test proper handling of empty aggregate
parameters on both sides of the function call protocol.
llvm-svn: 167090
This patch migrates the stpcpy optimizations from the simplify-libcalls
pass into the instcombine library call simplifier. Note that the
__stpcpy_chk simplifications were migrated in a previous commit.
llvm-svn: 167083
r166198 migrated the strcpy optimization to instcombine. The strcpy
simplifier that was migrated from Transforms/Scalar/SimplifyLibCalls.cpp
was also doing some __strcpy_chk simplifications. Those fortified
simplifications were migrated as well, but introduced a bug in the
__stpcpy_chk simplifier in the process. This happened because the
__strcpy_chk and __stpcpy_chk simplifiers were both mapped to StrCpyChkOpt
which was updated with simplifications that worked for __strcpy_chk, but
not __stpcpy_chk.
This patch fixes the problem by adding proper test coverage and creating a
new simplifier for __stpcpy_chk (instead of sharing one with __strcpy_chk).
llvm-svn: 167082
the first source operand is tied to the destination operand.
This is to accurately model the corresponding instructions where the upper
bits are unmodified.
rdar://12558838
PR14221
llvm-svn: 167064
integers in that the code to handle split alloca-wide integer loads or
stores doesn't come first. It should, for the same reasons as with
integers, and the PR attests to that. Also had to fix a busted assert in
that this test case also covers.
llvm-svn: 167051
This patch adds more support for vector type comparisons using altivec.
It adds correct support for v16i8, v8i16, v4i32, and v4f32 vector
types for comparison operators ==, !=, >, >=, <, and <=.
llvm-svn: 167015
When the switch-to-lookup tables transform landed in SimplifyCFG, it
was pointed out that this could be inappropriate for some targets.
Since there was no way at the time for the pass to know anything about
the target, an awkward reverse-transform was added in CodeGenPrepare
that turned lookup tables back into switches for some targets.
This patch uses the new TargetTransformInfo to determine if a
switch should be transformed, and removes
CodeGenPrepare::ConvertLoadToSwitch.
llvm-svn: 167011
getCastInstrCost had an assert prohibiting scalar to vector casts. Such casts,
however, are allowed. This should make the vectorizer buildbot happier.
llvm-svn: 166998
When the operand is a plain immediate rather than a label, print it
as [pc, #imm] like we do for the Thumb2 wide encoding variant.
rdar://12154503
llvm-svn: 166991
We will make them delay slot forms if there is something that can be
placed in the delay slot during a separate pass. Mips16 extended instructions
cannot be placed in delay slots.
llvm-svn: 166990
is 24 bits not 20 and the decoding needed to correctly handle converting the
J1 and J2 bits to their I1 and I2 values to reconstruct the displacement.
llvm-svn: 166982
%0 = load <8 x i16>* %dest
%1 = shufflevector <8 x i16> %0, <8 x i16> %in,
<8 x i32> < i32 0, i32 1, i32 2, i32 3, i32 13, i32 undef, i32 14, i32 14>
store <8 x i16> %1, <8 x i16>* %dest
We get:
vmovlpd (%eax), %xmm0, %xmm0
instead of:
vmovaps (%eax), %xmm1
vmovsd %xmm1, %xmm0, %xmm0
No extra test-case is added. I just fixed the existing one
(also it uses FileCheck now).
llvm-svn: 166971
ELF ABI.
A varargs parameter consisting of a single-precision floating-point value,
or of a single-element aggregate containing a single-precision floating-point
value, must be passed in the low-order (rightmost) four bytes of the
doubleword stack slot reserved for that parameter. If there are GPR protocol
registers remaining, the parameter must also be mirrored in the low-order
four bytes of the reserved GPR.
Prior to this patch, such parameters were being passed in the high-order
four bytes of the stack slot and the mirrored GPR.
The patch adds a new test case to verify the correct code generation.
llvm-svn: 166968
checks to avoid performing compile-time arithmetic on PPCDoubleDouble.
Now that APFloat supports arithmetic on PPCDoubleDouble, those checks
are no longer needed, and we can treat the type like any other.
llvm-svn: 166958
Partial copies can show up even when CoalescerPair.isPartial() returns
false. For example:
%vreg24:dsub_0<def> = COPY %vreg31:dsub_0; QPR:%vreg24,%vreg31
Such a partial-partial copy is not good enough for the transformation
adjustCopiesBackFrom() needs to do.
llvm-svn: 166944
incorrect instruction sequence due to it not being aware that an
inline assembly instruction may reference memory.
This patch fixes the problem by causing the scheduler to always assume that any
inline assembly code instruction could access memory. This is necessary because
the internal representation of the inline instruction does not include
any information about memory accesses.
This should fix PR13504.
llvm-svn: 166929
ELF subtarget.
The existing logic is used as a fallback to avoid any changes to the Darwin
ABI. PPC64 ELF now has two possible data layout strings: one for FreeBSD,
which requires 8-byte alignment, and a default string that requires
16-byte alignment.
I've added a test for PPC64 Linux to verify the 16-byte alignment. If
somebody wants to add a separate test for FreeBSD, that would be great.
Note that there is a companion patch to update the alignment information
in Clang, which I am committing now as well.
llvm-svn: 166928
output of both
llvm-extract foo.ll -func=bar
and
llvm-extract foo.ll -func=bar -delete
so the two new files could not be linked together anymore. With this change
alias are handled almost like functions and global variables. Almost because
with alias we cannot just clear the initializer/body, we have to create a new
declaration and replace the alias with it.
The net result is that now the output of the above commands can be linked
even if foo.ll has aliases.
llvm-svn: 166907
Previously mips16 was sharing the pattern addr which is used for mips32
and mips64. This had a number of problems:
1) Storing and loading byte and halfword quantities for mips16 has particular
problems due to the primarily non mips16 nature of SP. When we must
load/store byte/halfword stack objects in a function, we must create a mips16
alias register for SP. This functionality is tested in stchar.ll.
2) We need to have an FP register under certain conditions (such as
dynamically sized alloca). We use mips16 register S0 for this purpose.
In this case, we also use this register when accessing frame objects so this
issue also affects the complex pattern addr16. This functionality is
tested in alloca16.ll.
The Mips16InstrInfo.td has been updated to use addr16 instead of addr.
The complex pattern C++ function for addr has been copied to addr16 and
updated to reflect the above issues.
llvm-svn: 166897
This turns loops like
for (unsigned i = 0; i != n; ++i)
p[i] = p[i+1];
into memmove, which has a highly optimized implementation in most libcs.
This was really easy with the new DependenceAnalysis :)
llvm-svn: 166875
Requires a lot less code and complexity on loop-idiom's side and the more
precise analysis can catch more cases, like the one I included as a test case.
This also fixes the edge-case miscompilation from PR9481.
Compile time performance seems to be slightly worse, but this is mostly due
to an extra LCSSA run scheduled by the PassManager and should be fixed there.
llvm-svn: 166874
Add getCostXXX calls for different families of opcodes, such as casts, arithmetic, cmp, etc.
Port the LoopVectorizer to the new API.
The LoopVectorizer now finds instructions which will remain uniform after vectorization. It uses this information when calculating the cost of these instructions.
llvm-svn: 166836
Keep the integer_insertelement test case, the new coalescer can handle
this kind of lane insertion without help from pseudo-instructions.
llvm-svn: 166835
The LoopSimplify bug is pretty harmless because the loop goes from unanalyzable
to analyzable but the LCSSA bug is very nasty. It only comes into play with a
specific order of the LoopPassManager worklist and can cause actual
miscompilations, when a SCEV refers to a value that has been replaced with PHI
node. SCEVExpander may then insert code into the wrong place, either violating
domination or randomly miscompiling stuff.
Comes with an extensive test case reduced from the test-suite with
bugpoint+SCEVValidator.
llvm-svn: 166787
This is the first of several steps to incorporate information from the new
TargetTransformInfo infrastructure into BBVectorize. Two things are done here:
1. Target information is used to determine if it is profitable to fuse two
instructions. This means that the cost of the vector operation must not
be more expensive than the cost of the two original operations. Pairs that
are not profitable are no longer considered (because current cost information
is incomplete, for intrinsics for example, equal-cost pairs are still
considered).
2. The 'cost savings' computed for the profitability check are also used to
rank the DAGs that represent the potential vectorization plans. Specifically,
for nodes of non-trivial depth, the cost savings is used as the node
weight.
The next step will be to incorporate the shuffle costs into the DAG weighting;
this will give the edges of the DAG weights as well. Once that is done, when
target information is available, we should be able to dispense with the
depth heuristic.
llvm-svn: 166716
The isValueEqualityComparison() guard at the top of SimplifySwitch()
only applies to some of the possible transformations.
The newer transformations work just fine on large switches, and the
check on predecessor count is nonsensical.
llvm-svn: 166710
structs having size 3, 5, 6, or 7. Such a struct must be passed and received
as right-justified within its register or memory slot. The problem is only
present for structs that are passed in registers.
Previously, as part of a patch handling all structs of size less than 8, I
added logic to rotate the incoming register so that the struct was left-
justified prior to storing the whole register. This was incorrect because
the address of the parameter had already been adjusted earlier to point to
the right-adjusted value in the storage slot. Essentially I had accidentally
accounted for the right-adjustment twice.
In this patch, I removed the incorrect logic and reorganized the code to make
the flow clearer.
The removal of the rotates changes the expected code generation, so test case
structsinregs.ll has been modified to reflect this. I also added a new test
case, jaggedstructs.ll, to demonstrate that structs of these sizes can now
be properly received and passed.
I've built and tested the code on powerpc64-unknown-linux-gnu with no new
regressions. I also ran the GCC compatibility test suite and verified that
earlier problems with these structs are now resolved, with no new regressions.
llvm-svn: 166680
This patch adds initial PPC64 TOC MC object creation using the small mcmodel
(a single 64K TOC) adding the some TOC relocations (R_PPC64_TOC,
R_PPC64_TOC16, and R_PPC64_TOC16DS).
The addition of 'undefinedExplicitRelSym' hook on 'MCELFObjectTargetWriter'
is meant to avoid the creation of an unreferenced ".TOC." symbol (used in
the .odp creation) as well to set the R_PPC64_TOC relocation target as the
temporary ".TOC." symbol. On PPC64 ABI, the R_PPC64_TOC relocation should
not point to any symbol.
llvm-svn: 166677
smaller integer loads and stores.
The high-level motivation is that the frontend sometimes generates
a single whole-alloca integer load or store during ABI lowering of
splittable allocas. We need to be able to break this apart in order to
see the underlying elements and properly promote them to SSA values. The
hope is that this fixes some performance regressions on x86-32 with the
new SROA pass.
Unfortunately, this causes quite a bit of churn in the test cases, and
bloats some IR that comes out. When we see an alloca that consists soley
of bits and bytes being extracted and re-inserted, we now do some
splitting first, before building widened integer "bucket of bits"
representations. These are always well folded by instcombine however, so
this shouldn't actually result in missed opportunities.
If this splitting of all-integer allocas does cause problems (perhaps
due to smaller SSA values going into the RA), we could potentially go to
some extreme measures to only do this integer splitting trick when there
are non-integer component accesses of an alloca, but discovering this is
quite expensive: it adds yet another complete walk of the recursive use
tree of the alloca.
Either way, I will be watching build bots and LNT bots to see what
fallout there is here. If anyone gets x86-32 numbers before & after this
change, I would be very interested.
llvm-svn: 166662
into a sbc with a positive number, the immediate should be complemented, not
negated. Also added a missing pattern for ARM codegen.
rdar://12559385
llvm-svn: 166613
When the trip count is -1, getSmallConstantTripMultiple could return zero,
and this would cause runtime loop unrolling to assert. Instead of returning
zero, one is now returned (consistent with the existing overflow cases).
Fixes PR14167.
llvm-svn: 166612
- If more than 1 elemennts are defined and target supports the vectorized
conversion, use the vectorized one instead to reduce the strength on
conversion operation.
llvm-svn: 166546
- As there's no 64-bit GPRs in 32-bit mode, a custom conversion from v2u32 to
v2f32 is added to improve the efficiency of the code generated.
llvm-svn: 166545
the difference from "int x" (which should go in registers and
"struct y {int x;}" (which should not).
Clang will be updated in the next patches.
llvm-svn: 166536
loads. It's not really profitable and may result in GVN going into an infinite
loop when it hits constructs like this:
%x = gep %some.type %x, ...
Found via an LTO build of LLVM.
llvm-svn: 166490
%V = mul i64 %N, 4
%t = getelementptr i8* bitcast (i32* %arr to i8*), i32 %V
into
%t1 = getelementptr i32* %arr, i32 %N
%t = bitcast i32* %t1 to i8*
incorporating the multiplication into the getelementptr.
This happens all the time in dragonegg, for example for
int foo(int *A, int N) {
return A[N];
}
because gcc turns this into byte pointer arithmetic before it hits the plugin:
D.1590_2 = (long unsigned int) N_1(D);
D.1591_3 = D.1590_2 * 4;
D.1592_5 = A_4(D) + D.1591_3;
D.1589_6 = *D.1592_5;
return D.1589_6;
The D.1592_5 line is a POINTER_PLUS_EXPR, which is turned into a getelementptr
on a bitcast of A_4 to i8*, so this becomes exactly the kind of IR that the
transform fires on.
An analogous transform (with no testcases!) already existed for bitcasts of
arrays, so I rewrote it to share code with this one.
llvm-svn: 166474
The CFG of the machine function needs to know that the targets of the indirect
branch are successors to the indirect branch.
<rdar://problem/12529625>
llvm-svn: 166448
Per the October 12, 2012 Proposal for annotated disassembly output sent out by
Jim Grosbach this set of changes implements this for X86 and arm. The llvm-mc
tool now has a -mdis option to produced the marked up disassembly and a couple
of small example test cases have been added.
rdar://11764962
llvm-svn: 166445
Unreachable blocks can have invalid instructions. For example,
jump threading can produce self-referential instructions in
unreachable blocks. Also, we should not be spending time
optimizing unreachable code. Fixes PR14133.
llvm-svn: 166423
very small but very important bugfix:
bool shouldExplore(Use *U) {
Value *V = U->get();
if (isa<CallInst>(V) || isa<InvokeInst>(V))
[...]
should have read:
bool shouldExplore(Use *U) {
Value *V = U->getUser();
if (isa<CallInst>(V) || isa<InvokeInst>(V))
Fixes PR14143!
llvm-svn: 166407
This is important for vectors of pointers because only DataLayout,
not the underlying vector type, knows how to calculate the size
of the pointers in the vector. Fixes PR14138.
llvm-svn: 166401
It passes all tests, produces better results than the old code but uses the
wrong pass, LoopDependenceAnalysis, which is old and unmaintained. "Why is it
still in tree?", you might ask. The answer is obviously: "To confuse developers."
Just swapping in the new dependency pass sends the pass manager into an infinte
loop, I'll try to figure out why tomorrow.
llvm-svn: 166399
Requires a lot less code and complexity on loop-idiom's side and the more
precise analysis can catch more cases, like the one I included as a test case.
This also fixes the edge-case miscompilation from PR9481. I'm not entirely
sure that all cases are handled that the old checks handled but LDA will
certainly become smarter in the future.
llvm-svn: 166390
We used a SCEV to detect that A[X] is consecutive. We assumed that X was
the induction variable. But X can be any expression that uses the induction
for example: X = i + 2;
llvm-svn: 166388
This is important for nested-loop reductions such as :
In the innermost loop, the induction variable does not start with zero:
for (i = 0 .. n)
for (j = 0 .. m)
sum += ...
llvm-svn: 166387
If the pointer is consecutive then it is safe to read and write. If the pointer is non-loop-consecutive then
it is unsafe to vectorize it because we may hit an ordering issue.
llvm-svn: 166371
which is supposed to consistently raise SIGTRAP across all systems. In contrast,
__builtin_trap() behave differently on different systems. e.g. it raises SIGTRAP on ARM, and
SIGILL on X86. The purpose of __builtin_debugtrap() is to consistently provide "trap"
functionality, in the mean time preserve the compatibility with on gcc on __builtin_trap().
The X86 backend is already able to handle debugtrap(). This patch is to:
1) make front-end recognize "__builtin_debugtrap()" (emboddied in the one-line change to Clang).
2) In DAG legalization phase, by default, "debugtrap" will be replaced with "trap", which
make the __builtin_debugtrap() "available" to all existing ports without the hassle of
changing their code.
3) If trap-function is specified (via -trap-func=xyz to llc), both __builtin_debugtrap() and
__builtin_trap() will be expanded into the function call of the specified trap function.
This behavior may need change in the future.
The provided testing-case is to make sure 2) and 3) are working for ARM port, and we
already have a testing case for x86.
llvm-svn: 166300
- If INSERT_VECTOR_ELT is supported (above SSE2, either by custom
sequence of legal insn), transform BUILD_VECTOR into SHUFFLE +
INSERT_VECTOR_ELT if most of elements could be built from SHUFFLE with few
(so far 1) elements being inserted.
llvm-svn: 166288
Removed extra stack frame object for fixed byval arguments,
VarArgsStyleRegisters invocation was reworked due to some improper usage in
past. PR14099 also demonstrates it.
llvm-svn: 166273
The LTO Internalize pass is hiding symbols needed by the bugpoint-passes
plug-in. We need to add a flag to control whether Internalize should be run.
This is a temporary workaround to make these tests pass in the meantime.
llvm-svn: 166239
When merging stack slots, if StackColoring::remapInstructions gets a
value back from GetUnderlyingObject that it does not know about or is
not itself a stack slot, clear the memory operand in case it aliases
the merged slot. This prevents the introduction of incorrect aliasing
information.
Author: Matthew Curtis <mcurtis@codeaurora.org>
llvm-svn: 166216
This patch migrates the strcpy optimizations from the simplify-libcalls pass
into the instcombine library call simplifier. Note also that StrCpyChkOpt
has been updated with a few simplifications that were being done in the
simplify-libcalls version of StrCpyOpt, but not in the migrated implementation
of StrCpyOpt. There is no reason to overload StrCpyOpt with fortified and
regular simplifications in the new model since there is already a dedicated
simplifier for __strcpy_chk.
llvm-svn: 166198
test case on PowerPC caused by rounding errors when converting from a 64-bit
integer to a single-precision floating point. The reason for this are
double-rounding effects, since on PowerPC we have to convert to an
intermediate double-precision value first, which gets rounded to the
final single-precision result.
The patch fixes the problem by preparing the 64-bit integer so that the
first conversion step to double-precision will always be exact, and the
final rounding step will result in the correctly-rounded single-precision
result. The generated code sequence is equivalent to what GCC would generate.
When -enable-unsafe-fp-math is in effect, that extra effort is omitted
and we accept possible rounding errors (just like GCC does as well).
llvm-svn: 166178
- Folding (trunc (concat ... X )) to (concat ... (trunc X) ...) is valid
when '...' are all 'undef's.
- r166125 relies on this transformation.
llvm-svn: 166155
- If the extracted vector has the same type of all vectored being concatenated
together, it should be simplified directly into v_i, where i is the index of
the element being extracted.
llvm-svn: 166125
a pointer. A very bad idea. Let's not do that. Fixes PR14105.
Note that this wasn't *that* glaring of an oversight. Originally, these
routines were only called on offsets within an alloca, which are
intrinsically positive. But over the evolution of the pass, they ended
up being called for arbitrary offsets, and things went downhill...
llvm-svn: 166095
- MBB address is only valid as an immediate value in Small & Static
code/relocation models. On other models, LEA is needed to load IP address of
the restore MBB.
- A minor fix of MBB in MC lowering is added as well to enable target
relocation flag being propagated into MC.
llvm-svn: 166084
PR14098 contains an example where we would rematerialize a MOV8ri
immediately after the original instruction:
%vreg7:sub_8bit<def> = MOV8ri 9; GR32_ABCD:%vreg7
%vreg22:sub_8bit<def> = MOV8ri 9; GR32_ABCD:%vreg7
Besides being pointless, it is also wrong since the original instruction
only redefines part of the register, and the value read by the new
instruction is wrong.
The problem was the LiveRangeEdit::allUsesAvailableAt() didn't
special-case OrigIdx == UseIdx and found the wrong SSA value.
llvm-svn: 166068
An obfuscated splat is where the frontend poorly generates code for a splat
using several different shuffles to create the splat, i.e.,
%A = load <4 x float>* %in_ptr, align 16
%B = shufflevector <4 x float> %A, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 undef, i32 undef>
%C = shufflevector <4 x float> %B, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
%D = shufflevector <4 x float> %C, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
llvm-svn: 166061
For the PowerPC 64-bit ELF Linux ABI, aggregates of size less than 8
bytes are to be passed in the low-order bits ("right-adjusted") of the
doubleword register or memory slot assigned to them. A previous patch
addressed this for aggregates passed in registers. However, small
aggregates passed in the overflow portion of the parameter save area are
still being passed left-adjusted.
The fix is made in PPCTargetLowering::LowerCall_Darwin_Or_64SVR4 on the
caller side, and in PPCTargetLowering::LowerFormalArguments_64SVR4 on
the callee side. The main fix on the callee side simply extends
existing logic for 1- and 2-byte objects to 1- through 7-byte objects,
and correcting a constant left over from 32-bit code. There is also a
fix to a bogus calculation of the offset to the following argument in
the parameter save area.
On the caller side, again a constant left over from 32-bit code is
fixed. Additionally, some code for 1, 2, and 4-byte objects is
duplicated to handle the 3, 5, 6, and 7-byte objects for SVR4 only. The
LowerCall_Darwin_Or_64SVR4 logic is getting fairly convoluted trying to
handle both ABIs, and I propose to separate this into two functions in a
future patch, at which time the duplication can be removed.
The patch adds a new test (structsinmem.ll) to demonstrate correct
passing of structures of all seven sizes. Eight dummy parameters are
used to force these structures to be in the overflow portion of the
parameter save area.
As a side effect, this corrects the case when aggregates passed in
registers are saved into the first eight doublewords of the parameter
save area: Previously they were stored left-justified, and now are
properly stored right-justified. This requires changing the expected
output of existing test case structsinregs.ll.
llvm-svn: 166022
Stack is formed improperly for long structures passed as byval arguments for
EABI mode.
If we took AAPCS reference, we can found the next statements:
A: "If the argument requires double-word alignment (8-byte), the NCRN (Next
Core Register Number) is rounded up to the next even register number." (5.5
Parameter Passing, Stage C, C.3).
B: "The alignment of an aggregate shall be the alignment of its most-aligned
component." (4.3 Composite Types, 4.3.1 Aggregates).
So if we have structure with doubles (9 double fields) and 3 Core unused
registers (r1, r2, r3): caller should use r2 and r3 registers only.
Currently r1,r2,r3 set is used, but it is invalid.
Callee VA routine should also use r2 and r3 regs only. All is ok here. This
behaviour is guessed by rounding up SP address with ADD+BFC operations.
Fix:
Main fix is in ARMTargetLowering::HandleByVal. If we detected AAPCS mode and
8 byte alignment, we waste odd registers then.
P.S.:
I also improved LDRB_POST_IMM regression test. Since ldrb instruction will
not generated by current regression test after this patch.
llvm-svn: 166018
Original message:
The attached is the fix to radar://11663049. The optimization can be outlined by following rules:
(select (x != c), e, c) -> select (x != c), e, x),
(select (x == c), c, e) -> select (x == c), x, e)
where the <c> is an integer constant.
The reason for this change is that : on x86, conditional-move-from-constant needs two instructions;
however, conditional-move-from-register need only one instruction.
While the LowerSELECT() sounds to be the most convenient place for this optimization, it turns out to be a bad place. The reason is that by replacing the constant <c> with a symbolic value, it obscure some instruction-combining opportunities which would otherwise be very easy to spot. For that reason, I have to postpone the change to last instruction-combining phase.
The change passes the test of "make check-all -C <build-root/test" and "make -C project/test-suite/SingleSource".
Original message since r165661:
My previous change has a bug: I negated the condition code of a CMOV, and go ahead creating a new CMOV using the *ORIGINAL* condition code.
llvm-svn: 166017
- Besides used in SjLj exception handling, __builtin_setjmp/__longjmp is also
used as a light-weight replacement of setjmp/longjmp which are used to
implementation continuation, user-level threading, and etc. The support added
in this patch ONLY addresses this usage and is NOT intended to support SjLj
exception handling as zero-cost DWARF exception handling is used by default
in X86.
llvm-svn: 165989
includes extracting ints for copying elsewhere and inserting ints when
copying into the alloca. This should fix the CanSROA assertion coming
out of Clang's regression test suite.
llvm-svn: 165931
cases where we have partial integer loads and stores to an otherwise
promotable alloca to widen[1] those loads and stores to cover the entire
alloca and bitcast them into the appropriate type such that promotion
can proceed.
These partial loads and stores stem from an annoying confluence of ARM's
calling convention and ABI lowering and the FCA pre-splitting which
takes place in SROA. Clang lowers a { double, double } in-register
function argument as a [4 x i32] function argument to ensure it is
placed into integer 32-bit registers (a really unnerving implicit
contract between Clang and the ARM backend I would add). This results in
a FCA load of [4 x i32]* from the { double, double } alloca, and SROA
decomposes this into a sequence of i32 loads and stores. Inlining
proceeds, code gets folded, but at the end of the day, we still have i32
stores to the low and high halves of a double alloca. Widening these to
be i64 operations, and bitcasting them to double prior to loading or
storing allows promotion to proceed for these allocas.
I looked quite a bit changing the IR which Clang produces for this case
to be more friendly, but small changes seem unlikely to help. I think
the best representation we could use currently would be to pass 4 i32
arguments thereby avoiding any FCAs, but that would still require this
fix. It seems like it might eventually be nice to somehow encode the ABI
register selection choices outside of the parameter type system so that
the parameter can be a { double, double }, but the CC register
annotations indicate that this should be passed via 4 integer registers.
This patch does not address the second problem in PR14059, which is the
reverse: when a struct alloca is loaded as a *larger* single integer.
This patch also does not address some of the code quality issues with
the FCA-splitting. Those don't actually impede any optimizations really,
but they're on my list to clean up.
[1]: Pedantic footnote: for those concerned about memory model issues
here, this is safe. For the alloca to be promotable, it cannot escape or
have any use of its address that could allow these loads or stores to be
racing. Thus, widening is always safe.
llvm-svn: 165928
This patch migrates the strcmp and strncmp optimizations from the
simplify-libcalls pass into the instcombine library call simplifier.
llvm-svn: 165915
The new coalescer can merge a dead def into an unused lane of an
otherwise live vector register.
Clear the <dead> flag when that happens since the flag refers to the
full virtual register which is still live after the partial dead def.
This fixes PR14079.
llvm-svn: 165877
This patch migrates the strchr and strrchr optimizations from the
simplify-libcalls pass into the instcombine library call simplifier.
llvm-svn: 165875
This patch migrates the strcat and strncat optimizations from the
simplify-libcalls pass into the instcombine library call simplifier.
llvm-svn: 165874
It is possible that the live range of the value being pruned loops back
into the kill MBB where the search started. When that happens, make sure
that the beginning of KillMBB is also pruned.
Instead of starting a DFS at KillMBB and skipping the root of the
search, start a DFS at each KillMBB successor, and allow the search to
loop back to KillMBB.
This fixes PR14078.
llvm-svn: 165872
type coercion code, especially when targetting ARM. Things like [1
x i32] instead of i32 are very common there.
The goal of this logic is to ensure that when we are picking an alloca
type, we look through such wrapper aggregates and across any zero-length
aggregate elements to find the simplest type possible to form a type
partition.
This logic should (generally speaking) rarely fire. It only ends up
kicking in when an alloca is accessed using two different types (for
instance, i32 and float), and the underlying alloca type has wrapper
aggregates around it. I noticed a significant amount of this occurring
looking at stepanov_abstraction generated code for arm, and suspect it
happens elsewhere as well.
Note that this doesn't yet address truly heinous IR productions such as
PR14059 is concerning. Those result in mismatched *sizes* of types in
addition to mismatched access and alloca types.
llvm-svn: 165870
X86 doesn't have i8 cmovs so isel would emit a branch. Emitting branches at this
level is often not a good idea because it's too late for many optimizations to
kick in. This solution doesn't add any extensions (truncs are free) and tries
to avoid introducing partial register stalls by filtering direct copyfromregs.
I'm seeing a ~10% speedup on reading a random .png file with libpng15 via
graphicsmagick on x86_64/westmere, but YMMV depending on the microarchitecture.
llvm-svn: 165868
local frame causes problem.
For example:
void f(StructToPass s) {
g(&s, sizeof(s));
}
will cause problem with tail-call since part of s is passed via registers and
saved in f's local frame. When g tries to access s, part of s may be corrupted
since f's local frame is popped out before the tail-call.
The current fix is to disable tail-call if getVarArgsRegSaveSize is not 0 for
the caller. This is a conservative approach, if we can prove the address of
s or part of s is not taken and passed to g, it should be okay to perform
tail-call.
rdar://12442472
llvm-svn: 165853
The backend already pattern matches to form VBSL when it can. We may want to
teach it to use the vbsl intrinsics at some point to prevent machine licm from
mucking with this, but using the Expand is completely correct.
http://llvm.org/bugs/show_bug.cgi?id=13831http://llvm.org/bugs/show_bug.cgi?id=13961
Patch by Peter Couperus <peter.couperus@st.com>.
llvm-svn: 165845
Completely update one interval at a time instead of collecting live
range fragments to be updated. This avoids building data structures,
except for a single SmallPtrSet of updated intervals.
Also share code between handleMove() and handleMoveIntoBundle().
Add support for moving dead defs across other live values in the
interval. The MI scheduler can do that.
llvm-svn: 165824
PHIElimination inserts IMPLICIT_DEF instructions to guarantee that all
PHI predecessors have a live-out value. These IMPLICIT_DEF values are
not considered to be real interference when coalescing virtual
registers:
%vreg1 = IMPLICIT_DEF
%vreg2 = MOV32r0
When joining %vreg1 and %vreg2, the IMPLICIT_DEF instruction and its
value number should simply be erased since the %vreg2 value number now
provides a live-out value for the PHI predecesor block.
llvm-svn: 165813
On PowerPC, a bitcast of <16 x i8> to i128 may run through a code
path in ExpandRes_BITCAST that attempts to do an intermediate
bitcast to a <4 x i32> vector, and then construct the Hi and Lo parts
of the resulting i128 by pairing up two of those i32 vector elements
each. The code already recognizes that on a big-endian system, the
first two vector elements form the Hi part, and the final two vector
elements form the Lo part (vice-versa from the little-endian situation).
However, we also need to take endianness into account when forming each
of those separate pairs: on a big-endian system, vector element 0 is
the *high* part of the pair making up the Hi part of the result, and
vector element 1 is the low part of the pair. The code currently always
uses vector element 0 as the low part and vector element 1 as the high
part, as is appropriate for little-endian platforms only.
This patch fixes this by swapping the vector elements as they are
paired up as appropriate.
llvm-svn: 165802
not legal. However, it should use a div instruction + mul + sub if divide is
legal. The rem legalization code was missing a check and incorrectly uses a
divrem libcall even when div is legal.
rdar://12481395
llvm-svn: 165778
to the instruction position. The old encoding would give an absolute
ID which counts up within a function, and only resets at the next function.
I.e., Instead of having:
... = icmp eq i32 n-1, n-2
br i1 ..., label %bb1, label %bb2
it will now be roughly:
... = icmp eq i32 1, 2
br i1 1, label %bb1, label %bb2
This makes it so that ids remain relatively small and can be encoded
in fewer bits.
With this encoding, forward reference operands will be given
negative-valued IDs. Use signed VBRs for the most common case
of forward references, which is phi instructions.
To retain backward compatibility we bump the bitcode version
from 0 to 1 to distinguish between the different encodings.
llvm-svn: 165739
Not all instructions define a virtual register in their first operand.
Specifically, INLINEASM has a different format.
<rdar://problem/12472811>
llvm-svn: 165721
For function calls on the 64-bit PowerPC SVR4 target, each parameter
is mapped to as many doublewords in the parameter save area as
necessary to hold the parameter. The first 13 non-varargs
floating-point values are passed in registers; any additional
floating-point parameters are passed in the parameter save area. A
single-precision floating-point parameter (32 bits) must be mapped to
the second (rightmost, low-order) word of its assigned doubleword
slot.
Currently LLVM violates this ABI requirement by mapping such a
parameter to the first (leftmost, high-order) word of its assigned
doubleword slot. This is internally self-consistent but will not
interoperate correctly with libraries compiled with an ABI-compliant
compiler.
This patch corrects the problem by adjusting the parameter addressing
on both sides of the calling convention.
llvm-svn: 165714
Note: [D]M{T,F}CP2 is just a recommended encoding. Vendors often provide a
custom CP2 that interprets instructions differently and may wish to add their
own instructions that use this opcode. We should ensure that this is easy to
do. I will probably add a 'has custom CP{0-3}' subtarget flag to make this
easy: We want to avoid the GCC situation where every MIPS vendor makes a custom
fork that breaks every other MIPS CPU and so can't be merged upstream.
llvm-svn: 165711
Patch from Preston Briggs <preston.briggs@gmail.com>.
This is an updated version of the dependence-analysis patch, including an MIV
test based on Banerjee's inequalities.
It's a fairly complete implementation of the paper
Practical Dependence Testing
Gina Goff, Ken Kennedy, and Chau-Wen Tseng
PLDI 1991
It cannot yet propagate constraints between coupled RDIV subscripts (discussed
in Section 5.3.2 of the paper).
It's organized as a FunctionPass with a single entry point that supports testing
for dependence between two instructions in a function. If there's no dependence,
it returns null. If there's a dependence, it returns a pointer to a Dependence
which can be queried about details (what kind of dependence, is it loop
independent, direction and distance vector entries, etc). I haven't included
every imaginable feature, but there's a good selection that should be adequate
for supporting many loop transformations. Of course, it can be extended as
necessary.
Included in the patch file are many test cases, commented with C code showing
the loops and array references.
llvm-svn: 165708
value but later turns out to be a function.
Unfortunately, we can't fold tests into a single file because we only get one
error out of llvm-as.
llvm-svn: 165680
Original message:
The attached is the fix to radar://11663049. The optimization can be outlined by following rules:
(select (x != c), e, c) -> select (x != c), e, x),
(select (x == c), c, e) -> select (x == c), x, e)
where the <c> is an integer constant.
The reason for this change is that : on x86, conditional-move-from-constant needs two instructions;
however, conditional-move-from-register need only one instruction.
While the LowerSELECT() sounds to be the most convenient place for this optimization, it turns out to be a bad place. The reason is that by replacing the constant <c> with a symbolic value, it obscure some instruction-combining opportunities which would otherwise be very easy to spot. For that reason, I have to postpone the change to last instruction-combining phase.
The change passes the test of "make check-all -C <build-root/test" and "make -C project/test-suite/SingleSource".
llvm-svn: 165661
the compiler makes use of GPR0. However, there are two flavors of
GPR0 defined by the target: the 32-bit GPR0 (R0) and the 64-bit GPR0
(X0). The spill/reload code makes use of R0 regardless of whether we
are generating 32- or 64-bit code.
This patch corrects the problem in the obvious manner, using X0 and
ADDI8 for 64-bit and R0 and ADDI for 32-bit.
llvm-svn: 165658
the Altivec extensions were introduced. Its use is optional, and
allows the compiler to communicate to the operating system which
vector registers should be saved and restored during a context switch.
In practice, this information is ignored by the various operating
systems using the SVR4 ABI; the kernel saves and restores the entire
register state. Setting the VRSAVE register is no longer performed by
the AIX XL compilers, the IBM i compilers, or by GCC on Power Linux
systems. It seems best to avoid this logic within LLVM as well.
This patch avoids generating code to update and restore VRSAVE for the
PowerPC SVR4 ABIs (32- and 64-bit). The code remains in place for the
Darwin ABI.
llvm-svn: 165656
- Due to the current matching vector elements constraints in
ISD::FP_ROUND, rounding from v2f64 to v4f32 (after legalization from
v2f32) is scalarized. Add a customized v2f32 widening to convert it
into a target-specific X86ISD::VFPROUND to work around this
constraints.
llvm-svn: 165631
SDNode for LDRB_POST_IMM is invalid: number of registers added to SDNode fewer
that described in .td.
7 ops is needed, but SDNode with only 6 is created.
In more details:
In ARMInstrInfo.td, in multiclass AI2_ldridx, in definition _POST_IMM, offset
operand is defined as am2offset_imm. am2offset_imm is complex parameter type,
and actually it consists from dummy register and imm itself. As I understood
trick with dummy reg was made for AsmParser. In ARMISelLowering.cpp, this dummy
register was not added to SDNode, and it cause crash in Peephole Optimizer pass.
The problem fixed by setting up additional dummy reg when emitting
LDRB_POST_IMM instruction.
llvm-svn: 165617
SchedulerDAGInstrs::buildSchedGraph ignores dependencies between FixedStack
objects and byval parameters. So loading byval parameters from stack may be
inserted *before* it will be stored, since these operations are treated as
independent.
Fix:
Currently ARMTargetLowering::LowerFormalArguments saves byval registers with
FixedStack MachinePointerInfo. To fix the problem we need to store byval
registers with MachinePointerInfo referenced to first the "byval" parameter.
Also commit adds two new fields to the InputArg structure: Function's argument
index and InputArg's part offset in bytes relative to the start position of
Function's argument. E.g.: If function's argument is 128 bit width and it was
splitted onto 32 bit regs, then we got 4 InputArg structs with same arg index,
but different offset values.
llvm-svn: 165616
This patch provides initial implementation of load address
macro instruction for Mips. We have implemented two kinds
of expansions with their variations depending on the size
of immediate operand:
1) load address with immediate value directly:
* la d,j => addiu d,$zero,j (for -32768 <= j <= 65535)
* la d,j => lui d,hi16(j)
ori d,d,lo16(j) (for any other 32 bit value of j)
2) load load address with register offset value
* la d,j(s) => addiu d,s,j (for -32768 <= j <= 65535)
* la d,j(s) => lui d,hi16(j) (for any other 32 bit value of j)
ori d,d,lo16(j)
addu d,d,s
This patch does not cover the case when the address is loaded
from the value of the label or function.
Contributer: Vladimir Medic
llvm-svn: 165561
DeadArgumentElimination pass can replace one LLVM function with another,
invalidating a pointer stored in debug info metadata entry for this function.
To fix this, we collect debug info descriptors for functions before
running a DeadArgumentElimination pass and "patch" pointers in metadata nodes
if we replace a function.
llvm-svn: 165490
When the CFG contains a loop with multiple entry blocks, the traces
computed by MachineTraceMetrics don't always have the same nice
properties. Loop back-edges are normally excluded from traces, but
MachineLoopInfo doesn't recognize loops with multiple entry blocks, so
those back-edges may be included.
Avoid asserting when that happens by adding an isEarlierInSameTrace()
function that accurately determines if a dominating block is part of the
same trace AND is above the currrent block in the trace.
llvm-svn: 165434
Vector compare using altivec 'vcmpxxx' instructions have as third argument
a vector register instead of CR one, different from integer and float-point
compares. This leads to a failure in code generation, where 'SelectSETCC'
expects a DAG with a CR register and gets vector register instead.
This patch changes the behavior by just returning a DAG with the
vector compare instruction based on the type. The patch also adds a testcase
for all vector types llvm defines.
It also included a fix on signed 5-bits predicates printing, where
signed values were not handled correctly as signed (char are unsigned by
default for PowerPC). This generates 'vspltisw' (vector splat)
instruction with SIM out of range.
llvm-svn: 165419
are in fact identity operations. We detect these and kill their
partitions so that even splitting is unaffected by them. This is
particularly important because Clang relies on emitting identity memcpy
operations for struct copies, and these fold away to constants very
often after inlining.
Fixes the last big performance FIXME I have on my plate.
llvm-svn: 165285
Make sure functions located in user specified text sections (via the
section attribute) are located together with the default text sections.
Otherwise, for large object files, the relocations for call instructions
are more likely to be out of range. This becomes even more likely in the
presence of LTO.
rdar://12402636
llvm-svn: 165254
a) frame setup instructions define the prologue
b) we shouldn't change our location mid-stream
Add a test to make sure that the stack adjustment stays within
the prologue.
llvm-svn: 165250
We conservatively only check the first use to avoid walking long use chains.
This catches the common case of having both a load and a store to a pointer
supplied by a PHI node.
llvm-svn: 165232
cpyDest can be mutated in some cases, which would then cause a crash later if
indeed the memory was underaligned. This brought down several buildbots, so
I guess the underaligned case is much more common than I thought!
llvm-svn: 165228
Currently, we re-visit allocas when something changes about the way they
might be *split* to allow better scalarization to take place. However,
we weren't handling the case when the *promotion* is what would change
the behavior of SROA. When an address derived from an alloca is stored
into another alloca, we consider the first to have escaped. If the
second is ever promoted to an SSA value, we will suddenly be able to run
the SROA pass on the first alloca.
This patch adds explicit support for this form if iteration. When we
detect a store of a pointer derived from an alloca, we flag the
underlying alloca for reprocessing after promotion. The logic works hard
to only do this when there is definitely going to be promotion and it
might remove impediments to the analysis of the alloca.
Thanks to Nick for the great test case and Benjamin for some sanity
check review.
llvm-svn: 165223
was less aligned than the old. In the testcase this results in an overaligned
memset: the memset alignment was correct for the original memory but is too much
for the new memory. Fix this by either increasing the alignment of the new
memory or bailing out if that isn't possible. Should fix the gcc-4.7 self-host
buildbot failure.
llvm-svn: 165220
Sorry for this being broken so long. =/
As part of this, switch all of the existing tests to be Little Endian,
which is the behavior I was asserting in them anyways! Add in a new
big-endian test that checks the interesting behavior there.
Another part of this is to tighten the rules abotu when we perform the
full-integer promotion. This logic now rejects cases where there fully
promoted integer is a non-multiple-of-8 bitwidth or cases where the
loads or stores touch bits which are in the allocated space of the
alloca but are not loaded or stored when accessing the integer. Sadly,
these aren't really observable today as the rest of the pass will
already ensure the invariants hold. However, the latter situation is
likely to become a potential concern in the future.
Thanks to Benjamin and Duncan for early review of this patch. I'm still
looking into whether there are further endianness issues, please let me
know if anyone sees BE failures persisting past this.
llvm-svn: 165219
macro instruction (li) in the assembler.
We have identified three possible expansions depending on
the size of immediate operand:
1) for 0 ≤ j ≤ 65535.
li d,j =>
ori d,$zero,j
2) for −32768 ≤ j < 0.
li d,j =>
addiu d,$zero,j
3) for any other value of j that is representable as a 32-bit integer.
li d,j =>
lui d,hi16(j)
ori d,d,lo16(j)
All of the above have been implemented in ths patch.
Contributer: Vladimir Medic
llvm-svn: 165199
.set option
The patch implements following options
at - lets the assembler use the $at register for macros,
but generates warnings if the source program uses $at
noat - let source programs use $at without issuingwarnings.
noreorder - prevents the assembler from reordering machine
language instructions.
nomacro - causes the assembler to print a warning whenever
an assembler operation generates more than one
machine language instruction.
macro - lets the assembler generate multiple machine instructions
from a single assembler instruction
reorder - lets the assembler reorder machine language
instructions to improve performance
The above variants are parsed and their boolean values set or unset.
The code to actually use them will come later.
Following options are not implemented yet:
nomips16
nomicromips
move
nomove
Contributer: Vladimir Medic
llvm-svn: 165194
in the Intel syntax.
The MC layer supports emitting in the Intel syntax, but this would require the
inline assembly MachineInstr to be lowered to an MCInst before emission. This
is potential future work, but for now emitting directly from the MachineInstr
suffices.
llvm-svn: 165173
multiple stores with a single load. We create the wide loads and stores (and their chains)
before we remove the scalar loads and stores and fix the DAG chain. We attempted to merge
loads with a different chain. When that happened, the assumption that it is safe to RAUW
broke and a cycle was introduced.
llvm-svn: 165148
is not profitable in many cases because modern processors perform multiple stores
in parallel and merging stores prior to merging requires extra work. We handle two main cases:
1. Store of multiple consecutive constants:
q->a = 3;
q->4 = 5;
In this case we store a single legal wide integer.
2. Store of multiple consecutive loads:
int a = p->a;
int b = p->b;
q->a = a;
q->b = b;
In this case we load/store either ilegal vector registers or legal wide integer registers.
llvm-svn: 165125
a memcpy to reflect that '0' has a different meaning when applied to
a load or store. Now we correctly use underaligned loads and stores for
the test case added.
llvm-svn: 165101
necessary during rewriting. As part of this, fix a real think-o here
where we might have left off an alignment specification when the address
is in fact underaligned. I haven't come up with any way to trigger this,
as there is always some other factor that reduces the alignment, but it
certainly might have been an observable bug in some way I can't think
of. This also slightly changes the strategy for placing explicit
alignments on loads and stores to only do so when the alignment does not
match that required by the ABI. This causes a few redundant alignments
to go away from test cases.
I've also added a couple of tests that really push on the alignment that
we end up with on loads and stores. More to come here as I try to fix an
underlying bug I have conjectured and produced test cases for, although
it's not clear if this bug is the one currently hitting dragonegg's
gcc47 bootstrap.
llvm-svn: 165100
Enable the pass by default for targets that request it, and change the
-enable-early-ifcvt to the opposite -disable-early-ifcvt.
There are still some x86 regressions when enabling early if-conversion
because of the missing machine models. Disable the pass for x86 until
machine models are added.
llvm-svn: 165075
X86DAGToDAGISel::PreprocessISelDAG(), isel is moving load inside
callseq_start / callseq_end so it can be folded into a call. This can
create a cycle in the DAG when the call is glued to a copytoreg. We
have been lucky this hasn't caused too many issues because the pre-ra
scheduler has special handling of call sequences. However, it has
caused a crash in a specific tailcall case.
rdar://12393897
llvm-svn: 165072
scheduled for processing on the worklist eventually gets deleted while
we are processing another alloca, fixing the original test case in
PR13990.
To facilitate this, add a remove_if helper to the SetVector abstraction.
It's not easy to use the standard abstractions for this because of the
specifics of SetVectors types and implementation.
Finally, a nice small test case is included. Thanks to Benjamin for the
fantastic reduced test case here! All I had to do was delete some empty
basic blocks!
llvm-svn: 165065
JoinVals::pruneValues() calls LIS->pruneValue() to avoid conflicts when
overlapping two different values. This produces a set of live range end
points that are used to reconstruct the live range (with SSA update)
after joining the two registers.
When a value is pruned twice, the set of end points was insufficient:
v1 = DEF
v1 = REPLACE1
v1 = REPLACE2
KILL v1
The end point at KILL would only reconstruct the live range from
REPLACE2 to KILL, leaving the range REPLACE1-REPLACE2 dead.
Add REPLACE2 as an end point in this case so the full live range is
reconstructed.
This fixes PR13999.
llvm-svn: 165056
This adds 'elf' as a recognized target triple environment value and overrides the default generated object format on Windows platforms if that value is present. This patch also enables MCJIT tests on Windows using the new environment value.
llvm-svn: 165030
The target backend can support data-in-code load commands even when
the assembler doesn't, or vice-versa. Allow targets to opt-in for
direct-to-object.
PR13973.
llvm-svn: 164974
- Update maximal stack alignment when stack arguments are prepared before a
call.
- Test cases are enhanced to show it's not a Win32 specific issue but a generic
one.
llvm-svn: 164946
alignment requirements of the new alloca. As one consequence which was
reported as a bug by Duncan, we overaligned memcpy calls to ranges of
allocas after they were rewritten to types with lower alignment
requirements. Other consquences are possible, but I don't have any test
cases for them.
llvm-svn: 164937
a pair of instructions, one for the used pointer and the second for the
user. This simplifies the representation and also makes it more dense.
This was noticed because of the miscompile in PR13926. In that case, we
were running up against a fundamental "bad idea" in the speculation of
PHI and select instructions: the speculation and rewriting are
interleaved, which requires phi speculation to also perform load
rewriting! This is bad, and causes us to miss opportunities to do (for
example) vector rewriting only exposed after PHI speculation, etc etc.
It also, in the old system, required us to insert *new* load uses into
the current partition's use list, which would then be ignored during
rewriting because we had already extracted an end iterator for the use
list. The appending behavior (and much of the other oddities) stem from
the strange de-duplication strategy in the PartitionUse builder.
Amusingly, all this went without notice for so long because it could
only be triggered by having *different* GEPs into the same partition of
the same alloca, where both different GEPs were operands of a single
PHI, and where the GEP which was not encountered first also had multiple
uses within that same PHI node... Hence the insane steps required to
reproduce.
So, step one in fixing this fundamental bad idea is to make the
PartitionUse actually contain a Use*, and to make the builder do proper
deduplication instead of funky de-duplication. This is enough to remove
the appending behavior, and fix the miscompile in PR13926, but there is
more work to be done here. Subsequent commits will lift the speculation
into its own visitor. It'll be a useful step toward potentially
extracting all of the speculation logic into a generic utility
transform.
The existing PHI test case for repeated operands has been made more
extreme to catch even these issues. This test case, run through the old
pass, will exactly reproduce the miscompile from PR13926. ;] We were so
close here!
llvm-svn: 164925
source of false positives due to globals being declared in a header with some
kind of incomplete (small) type, but the actual definition being bigger.
llvm-svn: 164912
because moden processos can store multiple values in parallel, and preparing the consecutive store requires
some work. We only handle these cases:
1. Consecutive stores where the values and consecutive loads. For example:
int a = p->a;
int b = p->b;
q->a = a;
q->b = b;
2. Consecutive stores where the values are constants. Foe example:
q->a = 4;
q->b = 5;
llvm-svn: 164910
alignment could lose it due to the alloca type moving down to a much
smaller alignment guarantee.
Now SROA will actively compute a proper alignment, factoring the target
data, any explicit alignment, and the offset within the struct. This
will in some cases lower the alignment requirements, but when we lower
them below those of the type, we drop the alignment entirely to give
freedom to the code generator to align it however is convenient.
Thanks to Duncan for the lovely test case that pinned this down. =]
llvm-svn: 164891
buildbots. Original commit message:
A DAGCombine optimization for merging consecutive stores. This optimization is not profitable in many cases
because moden processos can store multiple values in parallel, and preparing the consecutive store requires
some work. We only handle these cases:
1. Consecutive stores where the values and consecutive loads. For example:
int a = p->a;
int b = p->b;
q->a = a;
q->b = b;
2. Consecutive stores where the values are constants. Foe example:
q->a = 4;
q->b = 5;
llvm-svn: 164890
because moden processos can store multiple values in parallel, and preparing the consecutive store requires
some work. We only handle these cases:
1. Consecutive stores where the values and consecutive loads. For example:
int a = p->a;
int b = p->b;
q->a = a;
q->b = b;
2. Consecutive stores where the values are constants. Foe example:
q->a = 4;
q->b = 5;
llvm-svn: 164885
If the width is very large it gets truncated from uint64_t to uint32_t when
passed to TD->fitsInLegalInteger. The truncated value can fit in a register.
This manifested in massive memory usage or crashes (PR13946).
llvm-svn: 164784
If the offset is more than 24-bits, it won't fit in a scattered
relocation offset field, so we fall back to using a non-scattered
relocation.
rdar://12358909
llvm-svn: 164724
teach the callgraph logic to not create callgraph edges to intrinsics for invoke
instructions; it already skips this for call instructions. Fixes PR13903.
llvm-svn: 164707
- Put statistics in alphabetical order
- Don't use getZextValue when building TableInt, just use APInts
- Introduce Create{Z,S}ExtOrTrunc in IRBuilder.
llvm-svn: 164696
rewriter in SROA to carry a proper alignment. This involves
interrogating various sources of alignment, etc. This is a more complete
and principled fix to PR13920 as well as related bugs pointed out by Eli
in review and by inspection in the area.
Also by inspection fix the integer and vector promotion paths to create
aligned loads and stores. I still need to work up test cases for
these... Sorry for the delay, they were found purely by inspection.
llvm-svn: 164689
tables in bitmaps when they fit in a target-legal register.
This saves some space, and it also allows for building tables that would
otherwise be deemed too sparse.
One interesting case that this hits is example 7 from
http://blog.regehr.org/archives/320. We currently generate good code
for this when lowering the switch to the selection DAG: we build a
bitmask to decide whether to jump to one block or the other. My patch
will result in the same bitmask, but it removes the need for the jump,
as the return value can just be retrieved from the mask.
llvm-svn: 164684
This should really, really fix PR13916. For real this time. The
underlying bug is... a bit more subtle than I had imagined.
The setup is a code pattern that leads to an @llvm.memcpy call with two
equal pointers to an alloca in the source and dest. Now, not any pattern
will do. The alloca needs to be formed just so, and both pointers should
be wrapped in different bitcasts etc. When this precise pattern hits,
a funny sequence of events transpires. First, we correctly detect the
potential for overlap, and correctly optimize the memcpy. The first
time. However, we do simplify the set of users of the alloca, and that
causes us to run the alloca back through the SROA pass in case there are
knock-on simplifications. At this point, a curious thing has happened.
If we happen to have an i8 alloca, we have direct i8 pointer values. So
we don't bother creating a cast, we rewrite the arguments to the memcpy
to dircetly refer to the alloca.
Now, in an unrelated area of the pass, we have clever logic which
ensures that when visiting each User of a particular pointer derived
from an alloca, we only visit that User once, and directly inspect all
of its operands which refer to that particular pointer value. However,
the mechanism used to detect memcpy's with the potential to overlap
relied upon getting visited once per *Use*, not once per *User*. This is
always true *unless* the same exact value is both source and dest. It
turns out that almost nothing actually produces that pattern though.
We can hand craft test cases that more directly test this behavior of
course, and those are included. Also, note that there is a significant
missed optimization here -- we prove in many cases that there is
a non-volatile memcpy call with identical source and dest addresses. We
shouldn't prevent splitting the alloca in that case, and in fact we
should just remove such memcpy calls eagerly. I'll address that in
a subsequent commit.
llvm-svn: 164669
scalar-to-vector conversion that we cannot handle. For instance, when an invalid
constraint is used in an inline asm statement.
<rdar://problem/12284092>
llvm-svn: 164662
scalar-to-vector conversion that we cannot handle. For instance, when an invalid
constraint is used in an inline asm statement.
<rdar://problem/12284092>
llvm-svn: 164657
only a missed optimization opportunity if the store is over-aligned, but a
miscompile if the store's new type has a higher natural alignment than the
memcpy did. Fixes PR13920!
llvm-svn: 164641
Chandler, it's not obvious that it's okay that this alloca gets into the list
twice to begin with. Please review and see whether this is the fix you really
want, but I wanted to get a fix checked in quickly.
llvm-svn: 164634
When a BL/BLX references a symbol in the same translation unit that is
out of range, use an external relocation. The linker will use this to
generate a branch island rather than a direct reference, allowing the
relocation to resolve correctly.
rdar://12359919
llvm-svn: 164615
to chains or cycles between PHIs and/or selects. Also add a couple of
really nice test cases reduced from Kostya's reports in PR13905 and
PR13906. Both are fixed by this patch.
llvm-svn: 164596
Previously it was only be able to detect problems if the pointer was a numerical
value (eg inttoptr i32 1 to i32*), but not if it was an alloca or globa. The
reason was the use of ComputeMaskedBits: imagine you have "alloca i8, align 2",
and ask ComputeMaskedBits what it knows about the bits of the alloca pointer.
It can tell you that the bottom bit is known zero (due to align 2) but it can't
tell you that bit 1 is known one. That's because the address could be an even
multiple of 2 rather than an odd multiple, eg it might be a multiple of 4. Thus
trying to use KnownOne is ineffective in the case of an alloca as it will never
have any bits set. Instead look explicitly for constant offsets from allocas
and globals.
llvm-svn: 164595
Even out-of-line jump tables can be in the code section, so mark them
as data-regions for those targets which support the directives.
rdar://12362871&12362974
llvm-svn: 164571
integer promotion analogous to vector promotion. When there is an
integer alloca being accessed both as its integer type and as a narrower
integer type, promote the narrower access to "insert" and "extract" the
smaller integer from the larger one, and make the integer alloca
a candidate for promotion.
In the new formulation, we don't care about target legal integer or use
thresholds to control things. Instead, we only perform this promotion to
an integer type which the frontend has already emitted a load or store
for. This bounds the scope and prevents optimization passes from
coalescing larger and larger entities into a single integer.
llvm-svn: 164479
across the uses of the alloca. It's entirely possible for negative
numbers to come up here, and in some rare cases simply doing the 2's
complement arithmetic isn't the correct decision. Notably, we can't zext
the index of the GEP. The definition of GEP is that these offsets are
sign extended or truncated to the size of the pointer, and then wrapping
2's complement arithmetic used.
This patch fixes an issue that comes up with *no* input from the
buildbots or bootstrap afaict. The only place where it manifested,
disturbingly, is Clang's own regression test suite. A reduced and
targeted collection of tests are added to cope with this. Note that I've
tried to pin down the potential cases of overflow, but may have missed
some cases. I've tried to add a few cases to test this, but its hard
because LLVM has quite limited support for >64bit constructs.
llvm-svn: 164475
This patch fixes load/store instructions to handle less common cases
like "asr #32", "rrx" properly throughout the MC layer.
Patch by Chris Lidbury.
llvm-svn: 164455
selects with a constant condition. This resulted in the operands
remaining live through the SROA rewriter. Most of the time, this just
caused some dead allocas to persist and get zapped by later passes, but
in one case found by Joerg, it caused a crash when we tried to *promote*
the alloca despite it having this dead use. We already have the
mechanisms in place to handle this, just wire select up to them.
llvm-svn: 164427
We inserted a placeholder that was never replaced because the function was
already visited. Assert that all placeholders have been resolved when tearing
down the bitcode reader.
Fixes PR13895.
llvm-svn: 164369
The expression based expansion too often results in IR level optimizations
splitting the intermediate values into separate basic blocks, preventing
the formation of the VBSL instruction as the code author intended. In
particular, LICM would often hoist part of the computation out of a loop.
rdar://11011471
llvm-svn: 164340
A PHI can't create interference on its own. If two live ranges interfere
at a PHI, they must also interfere when leaving one of the PHI
predecessors.
llvm-svn: 164330
We already have HoistThenElseCodeToIf, this patch implements
SinkThenElseCodeToEnd. When END block has only two predecessors and each
predecessor terminates with unconditional branches, we compare instructions in
IF and ELSE blocks backwards and check whether we can sink the common
instructions down.
rdar://12191395
llvm-svn: 164325
- Rewrite/merge pseudo-atomic instruction emitters to address the
following issue:
* Reduce one unnecessary load in spin-loop
previously the spin-loop looks like
thisMBB:
newMBB:
ld t1 = [bitinstr.addr]
op t2 = t1, [bitinstr.val]
not t3 = t2 (if Invert)
mov EAX = t1
lcs dest = [bitinstr.addr], t3 [EAX is implicit]
bz newMBB
fallthrough -->nextMBB
the 'ld' at the beginning of newMBB should be lift out of the loop
as lcs (or CMPXCHG on x86) will load the current memory value into
EAX. This loop is refined as:
thisMBB:
EAX = LOAD [MI.addr]
mainMBB:
t1 = OP [MI.val], EAX
LCMPXCHG [MI.addr], t1, [EAX is implicitly used & defined]
JNE mainMBB
sinkMBB:
* Remove immopc as, so far, all pseudo-atomic instructions has
all-register form only, there is no immedidate operand.
* Remove unnecessary attributes/modifiers in pseudo-atomic instruction
td
* Fix issues in PR13458
- Add comprehensive tests on atomic ops on various data types.
NOTE: Some of them are turned off due to missing functionality.
- Revise tests due to the new spin-loop generated.
llvm-svn: 164281
A common coalescing conflict in vector code is lane insertion:
%dst = FOO
%src = BAR
%dst:ssub0 = COPY %src
The live range of %src interferes with the ssub0 lane of %dst, but that
lane is never read after %src would have clobbered it. That makes it
safe to merge the live ranges and eliminate the COPY:
%dst = FOO
%dst:ssub0 = BAR
This patch teaches the new coalescer to resolve conflicts where dead
vector lanes would be clobbered, at least as long as the clobbered
vector lanes don't escape the basic block.
llvm-svn: 164250
to improve compatibility with GNU as.
Based on a patch by PaX Team.
Fixed assertion failures on non-Darwin and added additional test cases.
llvm-svn: 164248
- Merge the processing of LOAD_ADD with other atomic load-arith
operations
- Separate the logic getting target constant for atomic-load-op and add
an optimization for atomic-load-add on i16 with negative value
- Optimize a minor case for atomic-fetch-add i16 with negative operand. Test
case is revised.
llvm-svn: 164243
lib/Target/PowerPC/PPCISelLowering.{h,cpp}
Rename LowerFormalArguments_Darwin to LowerFormalArguments_Darwin_Or_64SVR4.
Rename LowerFormalArguments_SVR4 to LowerFormalArguments_32SVR4.
Receive small structs right-justified in LowerFormalArguments_Darwin_Or_64SVR4.
Rename LowerCall_Darwin to LowerCall_Darwin_Or_64SVR4.
Rename LowerCall_SVR4 to LowerCall_32SVR4.
Pass small structs right-justified in LowerCall_Darwin_Or_64SVR4.
test/CodeGen/PowerPC/structsinregs.ll
New test.
llvm-svn: 164228
two variables where the first variable is returned and the second
ignored.
I don't think this occurs in practice (other passes should have cleaned
up the unused phi node), but it should still be handled correctly.
Also make the logic for determining if we should return early less
sketchy.
llvm-svn: 164225
This is a follow-up from r163302, which added a transformation to
SimplifyCFG that turns some switches into loads from lookup tables.
It was pointed out that some targets, such as GPUs and deeply embedded
targets, might not find this appropriate, but SimplifyCFG doesn't have
enough information about the target to decide this.
This patch adds the reverse transformation to CodeGenPrep: it turns
loads from lookup tables back into switches for targets where we do not
build jump tables (assuming these are also the targets where lookup
tables are inappropriate).
Hopefully we will eventually get to have target information in
SimplifyCFG, and then this CodeGenPrep transformation can be removed.
llvm-svn: 164206
from the dragonegg build bots when we turned on the full version of the
pass. Included a much reduced test case for this pesky bug, despite
bugpoint's uncooperative behavior.
Also, I audited all the similar code I could find and didn't spot any
other cases where this mistake cropped up.
llvm-svn: 164178
working on FCA splitting. Instead of refusing to form a common type when
there are uses of a subsection of the alloca as well as a use of the
entire alloca, just skip the subsection uses and continue looking for
a whole-alloca use with a type that we can use.
This produces slightly prettier IR I think, and also fixes the other
failure in the test.
llvm-svn: 164146
FCAs. This is essential in order to promote allocas that are used in
struct returns by frontends like Clang. The FCA load would block the
rest of the pass from firing, resulting is significant regressions with
the bullet benchmark in the nightly test suite.
Thanks to Duncan for repeated discussions about how best to do this, and
to both him and Benjamin for review.
This appears to have blocked many places where the pass tries to fire,
and so I'm expect somewhat different results with this fix added.
As with the last big patch, I'm including a change to enable the SROA by
default *temporarily*. Ben is going to remove this as soon as the LNT
bots pick up the patch. I'm just trying to get a round of LNT numbers
from the stable machines in the lab.
NOTE: Four clang tests are expected to fail in the brief window where
this is enabled. Sorry for the noise!
llvm-svn: 164119
aligned address. Based on patch by David Peixotto.
Also use vld1.64 / vst1.64 with 128-bit alignment to take advantage of alignment
hints. rdar://12090772, rdar://12238782
llvm-svn: 164089
Add LIS::pruneValue() and extendToIndices(). These two functions are
used by the register coalescer when merging two live ranges requires
more than a trivial value mapping as supported by LiveInterval::join().
The pruneValue() function can remove the part of a value number that is
going to conflict in join(). Afterwards, extendToIndices can restore the
live range, using any new dominating value numbers and updating the SSA
form.
Use this complex value mapping to support merging a register into a
vector lane that has a conflicting value, but the clobbered lane is
undef.
llvm-svn: 164074
It had patterns for zext-loading and extending. This commit adds patterns for loading a wide type, performing a bitcast,
and extending. This is an odd pattern, but it is commonly used when writing code with intrinsics.
rdar://11897677
llvm-svn: 163995
new one, and add support for running the new pass in that mode and in
that slot of the pass manager. With this the new pass can completely
replace the old one within the pipeline.
The strategy for enabling or disabling the SSAUpdater logic is to do it
by making the requirement of the domtree analysis optional. By default,
it is required and we get the standard mem2reg approach. This is usually
the desired strategy when run in stand-alone situations. Within the
CGSCC pass manager, we disable requiring of the domtree analysis and
consequentially trigger fallback to the SSAUpdater promotion.
In theory this would allow the pass to re-use a domtree if one happened
to be available even when run in a mode that doesn't require it. In
practice, it lets us have a single pass rather than two which was
simpler for me to wrap my head around.
There is a hidden flag to force the use of the SSAUpdater code path for
the purpose of testing. The primary testing strategy is just to run the
existing tests through that path. One notable difference is that it has
custom code to handle lifetime markers, and one of the tests has been
enhanced to exercise that code.
This has survived a bootstrap and the test suite without serious
correctness issues, however my run of the test suite produced *very*
alarming performance numbers. I don't entirely understand or trust them
though, so more investigation is on-going.
To aid my understanding of the performance impact of the new SROA now
that it runs throughout the optimization pipeline, I'm enabling it by
default in this commit, and will disable it again once the LNT bots have
picked up one iteration with it. I want to get those bots (which are
much more stable) to evaluate the impact of the change before I jump to
any conclusions.
NOTE: Several Clang tests will fail because they run -O3 and check the
result's order of output. They'll go back to passing once I disable it
again.
llvm-svn: 163965
destination.
Updated previous implementation to fix a case not covered:
// PBI: br i1 %x, TrueDest, BB
// BI: br i1 %y, TrueDest, FalseDest
The other case was handled correctly.
// PBI: br i1 %x, BB, FalseDest
// BI: br i1 %y, TrueDest, FalseDest
Also tried to use 64-bit arithmetic instead of APInt with scale to simplify the
computation. Let me know if you have other opinions about this.
llvm-svn: 163954
This is essentially a ground up re-think of the SROA pass in LLVM. It
was initially inspired by a few problems with the existing pass:
- It is subject to the bane of my existence in optimizations: arbitrary
thresholds.
- It is overly conservative about which constructs can be split and
promoted.
- The vector value replacement aspect is separated from the splitting
logic, missing many opportunities where splitting and vector value
formation can work together.
- The splitting is entirely based around the underlying type of the
alloca, despite this type often having little to do with the reality
of how that memory is used. This is especially prevelant with unions
and base classes where we tail-pack derived members.
- When splitting fails (often due to the thresholds), the vector value
replacement (again because it is separate) can kick in for
preposterous cases where we simply should have split the value. This
results in forming i1024 and i2048 integer "bit vectors" that
tremendously slow down subsequnet IR optimizations (due to large
APInts) and impede the backend's lowering.
The new design takes an approach that fundamentally is not susceptible
to many of these problems. It is the result of a discusison between
myself and Duncan Sands over IRC about how to premptively avoid these
types of problems and how to do SROA in a more principled way. Since
then, it has evolved and grown, but this remains an important aspect: it
fixes real world problems with the SROA process today.
First, the transform of SROA actually has little to do with replacement.
It has more to do with splitting. The goal is to take an aggregate
alloca and form a composition of scalar allocas which can replace it and
will be most suitable to the eventual replacement by scalar SSA values.
The actual replacement is performed by mem2reg (and in the future
SSAUpdater).
The splitting is divided into four phases. The first phase is an
analysis of the uses of the alloca. This phase recursively walks uses,
building up a dense datastructure representing the ranges of the
alloca's memory actually used and checking for uses which inhibit any
aspects of the transform such as the escape of a pointer.
Once we have a mapping of the ranges of the alloca used by individual
operations, we compute a partitioning of the used ranges. Some uses are
inherently splittable (such as memcpy and memset), while scalar uses are
not splittable. The goal is to build a partitioning that has the minimum
number of splits while placing each unsplittable use in its own
partition. Overlapping unsplittable uses belong to the same partition.
This is the target split of the aggregate alloca, and it maximizes the
number of scalar accesses which become accesses to their own alloca and
candidates for promotion.
Third, we re-walk the uses of the alloca and assign each specific memory
access to all the partitions touched so that we have dense use-lists for
each partition.
Finally, we build a new, smaller alloca for each partition and rewrite
each use of that partition to use the new alloca. During this phase the
pass will also work very hard to transform uses of an alloca into a form
suitable for promotion, including forming vector operations, speculating
loads throguh PHI nodes and selects, etc.
After splitting is complete, each newly refined alloca that is
a candidate for promotion to a scalar SSA value is run through mem2reg.
There are lots of reasonably detailed comments in the source code about
the design and algorithms, and I'm going to be trying to improve them in
subsequent commits to ensure this is well documented, as the new pass is
in many ways more complex than the old one.
Some of this is still a WIP, but the current state is reasonbly stable.
It has passed bootstrap, the nightly test suite, and Duncan has run it
successfully through the ACATS and DragonEgg test suites. That said, it
remains behind a default-off flag until the last few pieces are in
place, and full testing can be done.
Specific areas I'm looking at next:
- Improved comments and some code cleanup from reviews.
- SSAUpdater and enabling this pass inside the CGSCC pass manager.
- Some datastructure tuning and compile-time measurements.
- More aggressive FCA splitting and vector formation.
Many thanks to Duncan Sands for the thorough final review, as well as
Benjamin Kramer for lots of review during the process of writing this
pass, and Daniel Berlin for reviewing the data structures and algorithms
and general theory of the pass. Also, several other people on IRC, over
lunch tables, etc for lots of feedback and advice.
llvm-svn: 163883
- Enhance the fix to PR12312 to support wider integer, such as 256-bit
integer. If more than 1 fully evaluated vectors are found, POR them
first followed by the final PTEST.
llvm-svn: 163832
- Find a legal vector type before casting and extracting element from it.
- As the new vector type may have more than 2 elements, build the final
hi/lo pair by BFS pairing them from bottom to top.
llvm-svn: 163830
Add a PatFrag to match X86tcret using 6 fixed registers or less. This
avoids folding loads into TCRETURNmi64 using 7 or more volatile
registers.
<rdar://problem/12282281>
llvm-svn: 163819
by xoring the high-bit. This fails if the source operand is a vector because we need to negate
each of the elements in the vector.
Fix rdar://12281066 PR13813.
llvm-svn: 163802
are within the lifetime zone. Sometime legitimate usages of allocas are
hoisted outside of the lifetime zone. For example, GEPS may calculate the
address of a member of an allocated struct. This commit makes sure that
we only check (abort regions or assert) for instructions that read and write
memory using stack frames directly. Notice that by allowing legitimate
usages outside the lifetime zone we also stop checking for instructions
which use derivatives of allocas. We will catch less bugs in user code
and in the compiler itself.
llvm-svn: 163791
We don't have enough GR64_TC registers when calling a varargs function
with 6 arguments. Since %al holds the number of vector registers used,
only %r11 is available as a scratch register.
This means that addressing modes using both base and index registers
can't be folded into TCRETURNmi64.
<rdar://problem/12282281>
llvm-svn: 163761
Add some support for dealing with an object pointer on arguments.
Part of rdar://9797999
which now supports adding the object pointer attribute to the
subprogram as it should.
llvm-svn: 163754
- BlockAddress has no support of BA + offset form and there is no way to
propagate that offset into machine operand;
- Add BA + offset support and a new interface 'getTargetBlockAddress' to
simplify target block address forming;
- All targets are modified to use new interface and X86 backend is enhanced to
support BA + offset addressing.
llvm-svn: 163743
nonvolatile condition register fields across calls under the SVR4 ABIs.
* With the 64-bit ABI, the save location is at a fixed offset of 8 from
the stack pointer. The frame pointer cannot be used to access this
portion of the stack frame since the distance from the frame pointer may
change with alloca calls.
* With the 32-bit ABI, the save location is just below the general
register save area, and is accessed via the frame pointer like the rest
of the save areas. This is an optional slot, so it must only be created
if any of CR2, CR3, and CR4 were modified.
* For both ABIs, save/restore logic is generated only if one of the
nonvolatile CR fields were modified.
I also took this opportunity to clean up an extra FIXME in
PPCFrameLowering.h. Save area offsets for 32-bit GPRs are meaningless
for the 64-bit ABI, so I removed them for correctness and efficiency.
Fixes PR13708 and partially also PR13623. It lets us enable exception handling
on PPC64.
Patch by William J. Schmidt!
llvm-svn: 163713
SelectionDAG::getConstantFP(double Val, EVT VT, bool isTarget);
should not be used when Val is not a simple constant (as the comment in
SelectionDAG.h indicates). This patch avoids using this function
when folding an unknown constant through a bitcast, where it cannot be
guaranteed that Val will be a simple constant.
llvm-svn: 163703
The input program may contain intructions which are not inside lifetime
markers. This can happen due to a bug in the compiler or due to a bug in
user code (for example, returning a reference to a local variable).
This commit adds checks that all of the instructions in the function and
invalidates lifetime ranges which do not contain all of the instructions.
llvm-svn: 163678
a pair of switch/branch where both depend on the value of the same variable and
the default case of the first switch/branch goes to the second switch/branch.
Code clean up and fixed a few issues:
1> handling the case where some cases of the 2nd switch are invalidated
2> correctly calculate the weight for the 2nd switch when it is a conditional eq
Testing case is modified from Alastair's original patch.
llvm-svn: 163635
The ARM backend can eliminate cmp instructions by reusing flags from a
nearby sub instruction with similar arguments.
Don't do that if the sub is predicated - the flags are not written
unconditionally.
<rdar://problem/12263428>
llvm-svn: 163535
- If a boolean value is generated from CMOV and tested as boolean value,
simplify the use of test result by referencing the original condition.
RDRAND intrinisc is one of such cases.
llvm-svn: 163516
For some reason .lcomm uses byte alignment and .comm log2 alignment so we can't
use the same setting for both. Fix this by reintroducing the LCOMM enum.
I verified this against mingw's gcc.
llvm-svn: 163420
- Darwin lied about not supporting .lcomm and turned it into zerofill in the
asm parser. Push the zerofill-conversion down into macho-specific code.
- This makes the tri-state LCOMMType enum superfluous, there are no targets
without .lcomm.
- Do proper error reporting when trying to use .lcomm with alignment on a target
that doesn't support it.
- .comm and .lcomm alignment was parsed in bytes on COFF, should be power of 2.
- Fixes PR13755 (.lcomm crashes on ELF).
llvm-svn: 163395
gas accepts this and it seems to be common enough to be worth supporting. This
doesn't affect the parsing of reg operands outside of .cfi directives.
llvm-svn: 163390
The assembler can alias one instruction into another based
on the operands. For example the jump instruction "J" takes
and immediate operand, but if the operand is a register the
assembler will change it into a jump register "JR" instruction.
These changes are in the instruction td file.
Test cases included
Contributer: Vladimir Medic
llvm-svn: 163368
Actually these are just stubs for parsing the directives.
Semantic support will come later.
Test cases included
Contributer: Vladimir Medic
llvm-svn: 163364
- This patch is inspired by the failure of the following code snippet
which is used to convert enumerable values into encoding bits to
improve the readability of td files.
class S<int s> {
bits<2> V = !if(!eq(s, 8), {0, 0},
!if(!eq(s, 16), {0, 1},
!if(!eq(s, 32), {1, 0},
!if(!eq(s, 64), {1, 1}, {?, ?}))));
}
Later, PR8330 is found to report not exactly the same bug relevant
issue to bit/bits values.
- Instead of resolving bit/bits values separately through
resolveBitReference(), this patch adds getBit() for all Inits and
resolves bit value by resolving plus getting the specified bit. This
unifies the resolving of bit with other values and removes redundant
logic for resolving bit only. In addition,
BitsInit::resolveReferences() is optimized to take advantage of this
origanization by resolving VarBitInit's variable reference first and
then getting bits from it.
- The type interference in '!if' operator is revised to support possible
combinations of int and bits/bit in MHS and RHS.
- As there may be illegal assignments from integer value to bit, says
assign 2 to a bit, but we only check this during instantiation in some
cases, e.g.
bit V = !if(!eq(x, 17), 0, 2);
Verbose diagnostic message is generated when invalid value is
resolveed to help locating the error.
- PR8330 is fixed as well.
llvm-svn: 163360
The RegisterCoalescer understands overlapping live ranges where one
register is defined as a copy of the other. With this change, register
allocators using LiveRegMatrix can do the same, at least for copies
between physical and virtual registers.
When a physreg is defined by a copy from a virtreg, allow those live
ranges to overlap:
%CL<def> = COPY %vreg11:sub_8bit; GR32_ABCD:%vreg11
%vreg13<def,tied1> = SAR32rCL %vreg13<tied0>, %CL<imp-use,kill>
We can assign %vreg11 to %ECX, overlapping the live range of %CL.
llvm-svn: 163336
Enhances basic alias analysis to recognize phis whose first incoming values are
NoAlias and whose other incoming values are just the phi node itself through
some amount of recursion.
Example: With this change basicaa reports that ptr_phi and ptr_phi2 do not alias
each other.
bb:
ptr = ptr2 + 1
loop:
ptr_phi = phi [bb, ptr], [loop, ptr_plus_one]
ptr2_phi = phi [bb, ptr2], [loop, ptr2_plus_one]
...
ptr_plus_one = gep ptr_phi, 1
ptr2_plus_one = gep ptr2_phi, 1
This enables the elimination of one load in code like the following:
extern int foo;
int test_noalias(int *ptr, int num, int* coeff) {
int *ptr2 = ptr;
int result = (*ptr++) * (*coeff--);
while (num--) {
*ptr2++ = *ptr;
result += (*coeff--) * (*ptr++);
}
*ptr = foo;
return result;
}
Part 2/2 of fix for PR13564.
llvm-svn: 163319
If we can show that the base pointers of two GEPs don't alias each other using
precise analysis and the indices and base offset are equal then the two GEPs
also don't alias each other.
This is primarily needed for the follow up patch that analyses NoAlias'ing PHI
nodes.
Part 1/2 of fix for PR13564.
llvm-svn: 163317
The lookup tables did not get built in a deterministic order.
This makes them get built in the order that the corresponding phi nodes
were found.
llvm-svn: 163305
If we have a BUILD_VECTOR that is mostly a constant splat, it is often better to splat that constant then insertelement the non-constant lanes instead of insertelementing every lane from an undef base.
llvm-svn: 163304
This adds a transformation to SimplifyCFG that attemps to turn switch
instructions into loads from lookup tables. It works on switches that
are only used to initialize one or more phi nodes in a common successor
basic block, for example:
int f(int x) {
switch (x) {
case 0: return 5;
case 1: return 4;
case 2: return -2;
case 5: return 7;
case 6: return 9;
default: return 42;
}
This speeds up the code by removing the hard-to-predict jump, and
reduces code size by removing the code for the jump targets.
llvm-svn: 163302