Before:
movabsq $4294967296, %rax ## encoding: [0x48,0xb8,0x00,0x00,0x00,0x00,0x01,0x00,0x00,0x00]
testq %rax, %rdi ## encoding: [0x48,0x85,0xf8]
jne LBB0_2 ## encoding: [0x75,A]
After:
btq $32, %rdi ## encoding: [0x48,0x0f,0xba,0xe7,0x20]
jb LBB0_2 ## encoding: [0x72,A]
btq is usually slower than testq because it doesn't fuse with the jump, but here we're better off
saving one register and a giant movabsq.
llvm-svn: 145103
When this field is true it means that the load is from constant (runt-time or compile-time) and so can be hoisted from loops or moved around other memory accesses
llvm-svn: 144100
On x86: (shl V, 1) -> add V,V
Hardware support for vector-shift is sparse and in many cases we scalarize the
result. Additionally, on sandybridge padd is faster than shl.
llvm-svn: 143311
fixes: Use a separate register, instead of SP, as the
calling-convention resource, to avoid spurious conflicts with
actual uses of SP. Also, fix unscheduling of calling sequences,
which can be triggered by pseudo-two-address dependencies.
llvm-svn: 143206
it fixes the dragonegg self-host (it looks like gcc is miscompiled).
Original commit messages:
Eliminate LegalizeOps' LegalizedNodes map and have it just call RAUW
on every node as it legalizes them. This makes it easier to use
hasOneUse() heuristics, since unneeded nodes can be removed from the
DAG earlier.
Make LegalizeOps visit the DAG in an operands-last order. It previously
used operands-first, because LegalizeTypes has to go operands-first, and
LegalizeTypes used to be part of LegalizeOps, but they're now split.
The operands-last order is more natural for several legalization tasks.
For example, it allows lowering code for nodes with floating-point or
vector constants to see those constants directly instead of seeing the
lowered form (often constant-pool loads). This makes some things
somewhat more complicated today, though it ought to allow things to be
simpler in the future. It also fixes some bugs exposed by Legalizing
using RAUW aggressively.
Remove the part of LegalizeOps that attempted to patch up invalid chain
operands on libcalls generated by LegalizeTypes, since it doesn't work
with the new LegalizeOps traversal order. Instead, define what
LegalizeTypes is doing to be correct, and transfer the responsibility
of keeping calls from having overlapping calling sequences into the
scheduler.
Teach the scheduler to model callseq_begin/end pairs as having a
physical register definition/use to prevent calls from having
overlapping calling sequences. This is also somewhat complicated, though
there are ways it might be simplified in the future.
This addresses rdar://9816668, rdar://10043614, rdar://8434668, and others.
Please direct high-level questions about this patch to management.
Delete #if 0 code accidentally left in.
llvm-svn: 143188
on every node as it legalizes them. This makes it easier to use
hasOneUse() heuristics, since unneeded nodes can be removed from the
DAG earlier.
Make LegalizeOps visit the DAG in an operands-last order. It previously
used operands-first, because LegalizeTypes has to go operands-first, and
LegalizeTypes used to be part of LegalizeOps, but they're now split.
The operands-last order is more natural for several legalization tasks.
For example, it allows lowering code for nodes with floating-point or
vector constants to see those constants directly instead of seeing the
lowered form (often constant-pool loads). This makes some things
somewhat more complicated today, though it ought to allow things to be
simpler in the future. It also fixes some bugs exposed by Legalizing
using RAUW aggressively.
Remove the part of LegalizeOps that attempted to patch up invalid chain
operands on libcalls generated by LegalizeTypes, since it doesn't work
with the new LegalizeOps traversal order. Instead, define what
LegalizeTypes is doing to be correct, and transfer the responsibility
of keeping calls from having overlapping calling sequences into the
scheduler.
Teach the scheduler to model callseq_begin/end pairs as having a
physical register definition/use to prevent calls from having
overlapping calling sequences. This is also somewhat complicated, though
there are ways it might be simplified in the future.
This addresses rdar://9816668, rdar://10043614, rdar://8434668, and others.
Please direct high-level questions about this patch to management.
llvm-svn: 143177
http://lab.llvm.org:8011/builders/llvm-x86_64-linux/builds/101
--- Reverse-merging r141854 into '.':
U test/MC/Disassembler/X86/x86-32.txt
U test/MC/Disassembler/X86/simple-tests.txt
D test/CodeGen/X86/bmi.ll
U lib/Target/X86/X86InstrInfo.td
U lib/Target/X86/X86ISelLowering.cpp
U lib/Target/X86/X86.td
U lib/Target/X86/X86Subtarget.h
llvm-svn: 141857
- x87: no min or max.
- SSE1: min/max for single precision scalars and vectors.
- SSE2: min/max for single and double precision scalars and vectors.
- AVX: as SSE2, but also supports the wider ymm vectors. (this is covered by the isTypeLegal check)
llvm-svn: 140296
dag-combine optimization to implement the ext-load efficiently (using shuffles).
For example the type <4 x i8> is stored in memory as i32, but it needs to
find its way into a <4 x i32> register. Previously we scalarized the memory
access, now we use shuffles.
llvm-svn: 139995
maxps and maxpd). This broke the sse41-blend.ll testcase by causing
maxpd to be produced rather than a cmp+blend pair, which is the reason
I tweaked it. Gives a small speedup on doduc with dragonegg when the
GCC vectorizer is used.
llvm-svn: 139986
take into consideration the presence of AVX. This change, together with
the SSEDomainFix enabled for AVX, makes AVX codegen to always (hopefully)
emit the same code as SSE for 128-bit vector ops. I don't
have a testcase for this, but AVX now beats SSE in performance for
128-bit ops in the majority of programas in the llvm testsuite
llvm-svn: 139817
However with this fix it does now.
Basically the operand order for the x86 target specific node
is not the same as the instruction, but since the intrinsic need that
specific order at the instruction definition, just change the order
during legalization. Also, there were some wrong invertions of condition
codes, such as GE => LE, GT => LT, fix that too. Fix PR10907.
llvm-svn: 139528
assert("not implemented for target shuffle node");
to:
assert(0 && "not implemented for target shuffle node");
This causes a test failure in CodeGen/X86/palignr.ll which has
been marked as XFAIL for the time being.
Test failure filed at PR10901.
llvm-svn: 139454
in Nadav's r139285 and r139287 commits.
1) Rename vsel.ll to a more descriptive name
2) Change the order of BLEND operands to "Op1, Op2, Cond", this is
necessary because PBLENDVB is already used in different places with
this order, and it was being emitted in the wrong way for vselect
3) Add AVX patterns and tests for the same SSE41 instructions
llvm-svn: 139305
with a vector condition); such selects become VSELECT codegen nodes.
This patch also removes VSETCC codegen nodes, unifying them with SETCC
nodes (codegen was actually often using SETCC for vector SETCC already).
This ensures that various DAG combiner optimizations kick in for vector
comparisons. Passes dragonegg bootstrap with no testsuite regressions
(nightly testsuite as well as "make check-all"). Patch mostly by
Nadav Rotem.
llvm-svn: 139159
init.trampoline and adjust.trampoline intrinsics, into two intrinsics
like in GCC. While having one combined intrinsic is tempting, it is
not natural because typically the trampoline initialization needs to
be done in one function, and the result of adjust trampoline is needed
in a different (nested) function. To get around this llvm-gcc hacks the
nested function lowering code to insert an additional parent variable
holding the adjust.trampoline result that can be accessed from the child
function. Dragonegg doesn't have the luxury of tweaking GCC code, so it
stored the result of adjust.trampoline in the memory GCC set aside for
the trampoline itself (this is always available in the child function),
and set up some new memory (using an alloca) to hold the trampoline.
Unfortunately this breaks Go which allocates trampoline memory on the
heap and wants to use it even after the parent has exited (!). Rather
than doing even more hacks to get Go working, it seemed best to just use
two intrinsics like in GCC. Patch mostly by Sanjoy Das.
llvm-svn: 139140
- Duplicate some store patterns to their AVX forms!
- Catched a bug while restricting the patterns subtarget, fix it
and update a testcase to check it properly
llvm-svn: 138851
code is inserted to first check if the current stacklet has enough
space. If so, space is allocated by simply decrementing the stack
pointer. Otherwise a runtime routine (__morestack_allocate_stack_space
in libgcc) is called which allocates the required memory from the
heap.
Patch by Sanjoy Das.
llvm-svn: 138818
from DYNAMIC_STACKALLOC.
Two new pseudo instructions (SEG_ALLOCA_32 and SEG_ALLOCA_64) which
will match X86SegAlloca (based on word size) are also added. They
will be custom emitted to inject the actual stack handling code.
Patch by Sanjoy Das.
llvm-svn: 138814
X86. Modify the pass added in the previous patch to call this new
code.
This new prologues generated will call a libgcc routine (__morestack)
to allocate more stack space from the heap when required
Patch by Sanjoy Das.
llvm-svn: 138812
explicit about which subtarget they refer to, and add AVX versions of
the ones we currently don't. Make the mask check more strict, to be
clear it won't be used to match to 256-bit versions!
llvm-svn: 138514
match splats in the form (splat (scalar_to_vector (load ...))) whenever
the load can be folded. All the logic and instruction emission is
working but because of PR8156, there are no ways to match loads, cause
they can never be folded for splats. Thus, the tests are XFAILed, but
I've tested and exercised all the logic using a relaxed version for
checking the foldable loads, as if the bug was already fixed. This
should work out of the box once PR8156 gets fixed since MayFoldLoad will
work as expected.
llvm-svn: 137810
vinsertf128 $1 + vpermilps $0, remove the old code that used to first
do the splat in a 128-bit vector and then insert it into a larger one.
This is better because the handling code gets simpler and also makes a
better room for the upcoming vbroadcast!
llvm-svn: 137807
there is no support for native 256-bit shuffles, be more smart in some
cases, for example, when you can extract specific 128-bit parts and use
regular 128-bit shuffles for them. Example:
For this shuffle:
shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32>
<i32 1, i32 0, i32 7, i32 6>
This was expanded to:
vextractf128 $1, %ymm1, %xmm2
vpextrq $0, %xmm2, %rax
vmovd %rax, %xmm1
vpextrq $1, %xmm2, %rax
vmovd %rax, %xmm2
vpunpcklqdq %xmm1, %xmm2, %xmm1
vpextrq $0, %xmm0, %rax
vmovd %rax, %xmm2
vpextrq $1, %xmm0, %rax
vmovd %rax, %xmm0
vpunpcklqdq %xmm2, %xmm0, %xmm0
vinsertf128 $1, %xmm1, %ymm0, %ymm0
ret
Now we get:
vshufpd $1, %xmm0, %xmm0, %xmm0
vextractf128 $1, %ymm1, %xmm1
vshufpd $1, %xmm1, %xmm1, %xmm1
vinsertf128 $1, %xmm1, %ymm0, %ymm0
llvm-svn: 137733
vectors. It operates on 128-bit elements instead of regular scalar
types. Recognize shuffles that are suitable for VPERM2F128 and teach
the x86 legalizer how to handle them.
llvm-svn: 137519
(for example, after integer operation), do not pack the registers into a YMM
before saving. Its better to save as two XMM registers.
Before:
vinsertf128 $1, %xmm3, %ymm0, %ymm3
vinsertf128 $0, %xmm1, %ymm3, %ymm1
vmovaps %ymm1, 416(%rsp)
After:
vmovaps %xmm3, 416+16(%rsp)
vmovaps %xmm1, 416(%rsp)
llvm-svn: 137308
data in-register prior to saving to memory. When we reorder the data in memory
we prevent the need to save multiple scalars to memory, making a single regular
store.
llvm-svn: 137238
avoid returning early for v8i32 types, which would only be valid for
vector with all zeros. Also split the handling of zeros and ones into separate
checking logic since they are handled differently. This fixes PR10547
llvm-svn: 136642
working on x86 (at least for trivial testcases); other architectures will
need more work so that they actually emit the appropriate instructions for
orderings stricter than 'monotonic'. (As far as I can tell, the ARM, PPC,
Mips, and Alpha backends need such changes.)
llvm-svn: 136457
usage of the shuffle bitmask. Both work in 128-bit lanes without
crossing, but in the former the mask of the high part is the same
used by the low part while in the later both lanes have independent
masks. Handle this properly and and add support for vpermilpd.
llvm-svn: 136200
On x86 we can't encode an immediate LHS of a sub directly. If the RHS comes from a XOR with a constant we can
fold the negation into the xor and add one to the immediate of the sub. Then we can turn the sub into an add,
which can be commuted and encoded efficiently.
This code is generated for __builtin_clz and friends.
llvm-svn: 136167
shuffle before inserting on a 256-bit vector.
- Add AVX versions of movd/movq instructions
- Introduce a few COPY patterns to match insert_subvector instructions.
This turns a trivial insert_subvector instruction into a register copy,
coalescing the xmm into a ymm and avoid emiting on more instruction.
llvm-svn: 136002
the way to go. Doing this here will prevent several node matches later,
and would have to force looking all the way through several
VINSERTF128/VEXTRACTF128 chains to optimize simple things.
llvm-svn: 135730
and was actually very wrong, fix it and make it simpler. Also remove the
ConcatVectors function, which is unused now.
- Fix a introduction of useless nodes in r126664 and r126264. The
VUNPCKL* should never be introduced cause we don't want duplicate
nodes for 128 AVX and non-AVX modes, the actual instruction
difference only exists during isel, but not for target specific DAG
nodes. We only introduce V* target nodes when there is no 128-bit
version already there.
- Fix a fragile test and make it more useful.
llvm-svn: 135729
instruction introduced in AVX, which can operate on 128 and 256-bit vectors.
It considers a 256-bit vector as two independent 128-bit lanes. It can permute
any 32 or 64 elements inside a lane, and restricts the second lane to
have the same permutation of the first one. With the improved splat support
introduced early today, adding codegen for this instruction enable more
efficient 256-bit code:
Instead of:
vextractf128 $0, %ymm0, %xmm0
punpcklbw %xmm0, %xmm0
punpckhbw %xmm0, %xmm0
vinsertf128 $0, %xmm0, %ymm0, %ymm1
vinsertf128 $1, %xmm0, %ymm1, %ymm0
vextractf128 $1, %ymm0, %xmm1
shufps $1, %xmm1, %xmm1
movss %xmm1, 28(%rsp)
movss %xmm1, 24(%rsp)
movss %xmm1, 20(%rsp)
movss %xmm1, 16(%rsp)
vextractf128 $0, %ymm0, %xmm0
shufps $1, %xmm0, %xmm0
movss %xmm0, 12(%rsp)
movss %xmm0, 8(%rsp)
movss %xmm0, 4(%rsp)
movss %xmm0, (%rsp)
vmovaps (%rsp), %ymm0
We get:
vextractf128 $0, %ymm0, %xmm0
punpcklbw %xmm0, %xmm0
punpckhbw %xmm0, %xmm0
vinsertf128 $0, %xmm0, %ymm0, %ymm1
vinsertf128 $1, %xmm0, %ymm1, %ymm0
vpermilps $85, %ymm0, %ymm0
llvm-svn: 135662
refactor the code and add a bunch of comments. The final shuffle
emitted by handling 256-bit types is suitable for the VPERM shuffle
instruction which is going to be introduced in a next commit (with
a testcase which cover this commit)
llvm-svn: 135661
to MCRegisterInfo. Also initialize the mapping at construction time.
This patch eliminate TargetRegisterInfo from TargetAsmInfo. It's another step
towards fixing the layering violation.
llvm-svn: 135424
1) Make non-legal 256-bit loads to be promoted to v4i64. This lets us
canonize the loads and handle things the same way we use to handle
for 128-bit registers. Despite of what one of the removed comments
explained, the load promotion would not mess with VPERM, it's only a
matter of doing the appropriate bitcasts when this instructions comes
to be introduced. Also make LOAD v8i32 legal.
2) Doing 1) exposed two bugs:
- v4i64 was being promoted to itself for several opcodes (introduced
in r124447 by David Greene) causing endless recursion and the stack to
explode.
- there was no support for allOnes BUILD_VECTORs and ANDNP would fail to
match because it was generating early target constant pools during
lowering.
3) The testcases are already checked-in, doing 1) exposed the
bugs in the current testcases.
4) Tidy up code to be more clear and explicit about AVX.
llvm-svn: 135313
when determining validity of matching constraint. Allow i1
types access to the GR8 reg class for x86.
Fixes PR10352 and rdar://9777108
llvm-svn: 135180
During type legalization we often use the SIGN_EXTEND_INREG SDNode.
When this SDNode is legalized during the LegalizeVector phase, it is
scalarized because non-simple types are automatically marked to be expanded.
In this patch we add support for lowering SIGN_EXTEND_INREG manually.
This fixes CodeGen/X86/vec_sext.ll when running with the '-promote-elements'
flag.
llvm-svn: 135144
Drop the FpMov instructions, use plain COPY instead.
Drop the FpSET/GET instruction for accessing fixed stack positions.
Instead use normal COPY to/from ST registers around inline assembly, and
provide a single new FpPOP_RETVAL instruction that can access the return
value(s) from a call. This is still necessary since you cannot tell from
the CALL instruction alone if it returns anything on the FP stack. Teach
fast isel to use this.
This provides a much more robust way of handling fixed stack registers -
we can tolerate arbitrary FP stack instructions inserted around calls
and inline assembly. Live range splitting could sometimes break x87 code
by inserting spill code in unfortunate places.
As a bonus we handle floating point inline assembly correctly now.
llvm-svn: 134018
optimizations when emitting calls to the function; instead those calls may
use faster relocations which require the function to be immediately resolved
upon loading the dynamic object featuring the call. This is useful when it
is known that the function will be called frequently and pervasively and
therefore there is no merit in delaying binding of the function.
Currently only implemented for x86-64, where it turns into a call through
the global offset table.
Patch by Dan Gohman, who assures me that he's going to add LangRef documentation
for this once it's committed.
llvm-svn: 133080
floating-point comparison, generate a mask of 0s or 1s, and generally
DTRT with NaNs. Only profitable when the user wants a materialized 0
or 1 at runtime. rdar://problem/5993888
llvm-svn: 132404
non-zero.
- Teach X86 cmov optimization to eliminate the cmov from ctlz, cttz extension
when the source of X86ISD::BSR / X86ISD::BSF is proven to be non-zero.
rdar://9490949
llvm-svn: 131948
to have single return block (at least getting there) for optimizations. This
is general goodness but it would prevent some tailcall optimizations.
One specific case is code like this:
int f1(void);
int f2(void);
int f3(void);
int f4(void);
int f5(void);
int f6(void);
int foo(int x) {
switch(x) {
case 1: return f1();
case 2: return f2();
case 3: return f3();
case 4: return f4();
case 5: return f5();
case 6: return f6();
}
}
=>
LBB0_2: ## %sw.bb
callq _f1
popq %rbp
ret
LBB0_3: ## %sw.bb1
callq _f2
popq %rbp
ret
LBB0_4: ## %sw.bb3
callq _f3
popq %rbp
ret
This patch teaches codegenprep to duplicate returns when the return value
is a phi and where the phi operands are produced by tail calls followed by
an unconditional branch:
sw.bb7: ; preds = %entry
%call8 = tail call i32 @f5() nounwind
br label %return
sw.bb9: ; preds = %entry
%call10 = tail call i32 @f6() nounwind
br label %return
return:
%retval.0 = phi i32 [ %call10, %sw.bb9 ], [ %call8, %sw.bb7 ], ... [ 0, %entry ]
ret i32 %retval.0
This allows codegen to generate better code like this:
LBB0_2: ## %sw.bb
jmp _f1 ## TAILCALL
LBB0_3: ## %sw.bb1
jmp _f2 ## TAILCALL
LBB0_4: ## %sw.bb3
jmp _f3 ## TAILCALL
rdar://9147433
llvm-svn: 127953
not have native support for this operation (such as X86).
The legalized code uses two vector INT_TO_FP operations and is faster
than scalarizing.
llvm-svn: 127951
comparisons on x86. Essentially, the way this works is that SUB+SBB sets
the relevant flags the same way a double-width CMP would.
This is a substantial improvement over the generic lowering in LLVM. The output
is also shorter than the gcc-generated output; I haven't done any detailed
benchmarking, though.
llvm-svn: 127852
rather than an int. Thankfully, this only causes LLVM to miss optimizations, not
generate incorrect code.
This just fixes the zext at the return. We still insert an i32 ZextAssert when
reading a function's arguments, but it is followed by a truncate and another i8
ZextAssert so it is not optimized.
llvm-svn: 127766
testcases accordingly. Some are currently xfailed and will be filed
as bugs to be fixed or understood.
Performance results:
roughly neutral on SPEC
some micro benchmarks in the llvm suite are up between 100 and 150%, only
a pair of regressions that are due to be investigated
john-the-ripper saw:
10% improvement in traditional DES
8% improvement in BSDI DES
59% improvement in FreeBSD MD5
67% improvement in OpenBSD Blowfish
14% improvement in LM DES
Small compile time impact.
llvm-svn: 127208
regs. This is the only change in this checkin that may affects the
default scheduler. With better register tracking and heuristics, it
doesn't make sense to artificially lower the register limit so much.
Added -sched-high-latency-cycles and X86InstrInfo::isHighLatencyDef to
give the scheduler a way to account for div and sqrt on targets that
don't have an itinerary. It is currently defaults to 10 (the actual
number doesn't matter much), but only takes effect on non-default
schedulers: list-hybrid and list-ilp.
Added several heuristics that can be individually disabled for the
non-default sched=list-ilp mode. This helps us determine how much
better we can do on a given benchmark than the default
scheduler. Certain compute intensive loops run much faster in this
mode with the right set of heuristics, and it doesn't seem to have
much negative impact elsewhere. Not all of the heuristics are needed,
but we still need to experiment to decide which should be disabled by
default for sched=list-ilp.
llvm-svn: 127067
and 256-bit forms. Because the number of elements in a vector
does not determine the vector type (4 elements could be v4f32 or
v4f64), pass the full type of the vector to decode routines.
llvm-svn: 126664
In other words, do not keep track of argument's location. The debugger (gdb) is not prepared to see line table entries for arguments. For the debugger, "second" line table entry marks beginning of function body.
This requires some coordination with debugger to get this working.
- The debugger needs to be aware of prolog_end attribute attached with line table entries.
- The compiler needs to accurately mark prolog_end in line table entries (at -O0 and at -O1+)
llvm-svn: 126155
(LLVMX86Utils.a) to break cyclic library dependencies between
LLVMX86CodeGen.a and LLVMX86AsmParser.a. Previously this code was in
a header file and marked static but AVX requires some additional
functionality here that won't be used by all clients. Since including
unused static functions causes a gcc compiler warning, keeping it as a
header would break builds that use -Werror. Putting this in its own
library solves both problems at once.
llvm-svn: 125765
have their low bits set to zero. This allows us to optimize
out explicit stack alignment code like in stack-align.ll:test4 when
it is redundant.
Doing this causes the code generator to start turning FI+cst into
FI|cst all over the place, which is general goodness (that is the
canonical form) except that various pieces of the code generator
don't handle OR aggressively. Fix this by introducing a new
SelectionDAG::isBaseWithConstantOffset predicate, and using it
in places that are looking for ADD(X,CST). The ARM backend in
particular was missing a lot of addressing mode folding opportunities
around OR.
llvm-svn: 125470
This allows us to easily support 256-bit operations that don't have
native 256-bit support. This applies to integer operations, certain
types of shuffles and various othher things.
llvm-svn: 124910
matching EXTRACT_SUBVECTOR to VEXTRACTF128 along with support routines
to examine and translate index values. VINSERTF128 comes next. With
these two in place we can begin supporting more AVX operations as
INSERT/EXTRACT can be used as a fallback when 256-bit support is not
available.
llvm-svn: 124797
Reversing the operands allows us to fold, but doesn't force us to. Also, at
this point the DAG is still being optimized, so the check for hasOneUse is not
very precise.
llvm-svn: 124773
default implementation for x86, going through the stack in a similr
fashion to how the codegen implements BUILD_VECTOR. Eventually this
will get matched to VINSERTF128 if AVX is available.
llvm-svn: 124307
implementation of EXTRACT_SUBVECTOR for x86, going through the stack
in a similr fashion to how the codegen implements BUILD_VECTOR.
Eventually this will get matched to VEXTRACTF128 if AVX is available.
llvm-svn: 124292
The theory is it's still faster than a pair of movq / a quad of movl. This
will probably hurt older chips like P4 but should run faster on current
and future Intel processors. rdar://8817010
llvm-svn: 122955
the same as setcc. Optimize ADDC(0,0,FLAGS) -> SET_CARRY(FLAGS). This is
a step towards finishing off PR5443. In the testcase in that bug we now get:
movq %rdi, %rax
addq %rsi, %rax
sbbq %rcx, %rcx
testb $1, %cl
setne %dl
ret
instead of:
movq %rdi, %rax
addq %rsi, %rax
movl $0, %ecx
adcq $0, %rcx
testq %rcx, %rcx
setne %dl
ret
llvm-svn: 122219
their carry depenedencies with MVT::Flag operands) and use clean and beautiful
EFLAGS dependences instead.
We do this by changing the modelling of SBB/ADC to have EFLAGS input and outputs
(which is what requires the previous scheduler change) and change X86 ISelLowering
to custom lower ADDC and friends down to X86ISD::ADD/ADC/SUB/SBB nodes.
With the previous series of changes, this causes no changes in the testsuite, woo.
llvm-svn: 122213
the output to the correct register. Fixes a hidden problem uncovered
by the last patch where we'd try to DAG combine our MVT::Other node
oddly.
llvm-svn: 121358
result. This allows us to compile:
void *test12(long count) {
return new int[count];
}
into:
test12:
movl $4, %ecx
movq %rdi, %rax
mulq %rcx
movq $-1, %rdi
cmovnoq %rax, %rdi
jmp __Znam ## TAILCALL
instead of:
test12:
movl $4, %ecx
movq %rdi, %rax
mulq %rcx
seto %cl
testb %cl, %cl
movq $-1, %rdi
cmoveq %rax, %rdi
jmp __Znam
Of course it would be even better if the regalloc inverted the cmov to 'cmovoq',
which would eliminate the need for the 'movq %rdi, %rax'.
llvm-svn: 120936
backend that they were all implemented except umul. This one fell back
to the default implementation that did a hi/lo multiply and compared the
top. Fix this to check the overflow flag that the 'mul' instruction
sets, so we can avoid an explicit test. Now we compile:
void *func(long count) {
return new int[count];
}
into:
__Z4funcl: ## @_Z4funcl
movl $4, %ecx ## encoding: [0xb9,0x04,0x00,0x00,0x00]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
mulq %rcx ## encoding: [0x48,0xf7,0xe1]
seto %cl ## encoding: [0x0f,0x90,0xc1]
testb %cl, %cl ## encoding: [0x84,0xc9]
movq $-1, %rdi ## encoding: [0x48,0xc7,0xc7,0xff,0xff,0xff,0xff]
cmoveq %rax, %rdi ## encoding: [0x48,0x0f,0x44,0xf8]
jmp __Znam ## TAILCALL
instead of:
__Z4funcl: ## @_Z4funcl
movl $4, %ecx ## encoding: [0xb9,0x04,0x00,0x00,0x00]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
mulq %rcx ## encoding: [0x48,0xf7,0xe1]
testq %rdx, %rdx ## encoding: [0x48,0x85,0xd2]
movq $-1, %rdi ## encoding: [0x48,0xc7,0xc7,0xff,0xff,0xff,0xff]
cmoveq %rax, %rdi ## encoding: [0x48,0x0f,0x44,0xf8]
jmp __Znam ## TAILCALL
Other than the silly seto+test, this is using the o bit directly, so it's going in the right
direction.
llvm-svn: 120935
The user (i.e. whoever generated a call to the intrinsic in the first place) is
essentially asking for a particular instruction to be placed in the assembler.
If that instruction won't execute on the target machine, that's their problem
not ours. Two buildbots with processors that don't support SSE3 were barfing
on the apm.ll test in CodeGen/X86 because of this assertion.
llvm-svn: 120574
legalization time. Since at legalization time there is no mapping from
SDNode back to the corresponding LLVM instruction and the return
SDNode is target specific, this requires a target hook to check for
eligibility. Only x86 and ARM support this form of sibcall optimization
right now.
rdar://8707777
llvm-svn: 120501
nodes to indicate when ha16/lo16 modifiers should be used. This lets
us pass PowerPC/indirectbr.ll.
The one annoying thing about this patch is that the MCSymbolExpr isn't
expressive enough to represent ha16(label1-label2) which we need on
PowerPC. I have a terrible hack in the meantime, but this will have
to be revisited at some point.
Last major conversion item left is global variable references.
llvm-svn: 119105