types, as well as int<->ptr casts. This allows us to tailcall functions
with some trivial casts between the call and return (i.e. because the
return types disagree).
llvm-svn: 157798
We handle struct byval by inserting a pseudo op, which will be expanded to a
loop at ExpandISelPseudos.
A separate patch for clang will be submitted to enable struct byval.
rdar://9877866
llvm-svn: 157793
- compute size & offset at the same time. The side-effects of this are that we now support negative GEPs. It's now approaching a phase that it can be reused by other passes (e.g., lowering of the objectsize intrinsic)
- use APInt throughout to handle wrap-arounds
- add support for PHI instrumentation
- add a cache (required for recursive PHIs anyway)
- remove hoisting support for now, since it was wrong in a few cases
sorry for the churn here.. tests will follow soon.
llvm-svn: 157775
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 157755
Reg-units are named after their root registers, and most units have a
single root, so they simply print as 'AL', 'XMM0', etc. The rare dual
root reg-units print as FPSCR~FPSCR_NZCV, FP0~ST7, ...
The printing piggybacks on the existing register name tables, so no
extra const data space is required.
llvm-svn: 157754
be non contiguous, non overlapping and sorted by the lower end.
While this is technically a backward incompatibility, every frontent currently
produces range metadata with a single interval and we don't have any pass
that merges intervals yet, so no existing bitcode files should be rejected by
this.
llvm-svn: 157741
I disabled FMA3 autodetection, since the result may differ from expected for some benchmarks.
I added tests for GodeGen and intrinsics.
I did not change llvm.fma.f32/64 - it may be done later.
llvm-svn: 157737
It helps compile exotic inline asm. In the test case, normal GR32
virtual registers use up eax-edx so the final GR32_ABCD live range has
no registers left. Since all the live ranges were tiny, we had no way of
prioritizing the smaller register class.
This patch allows tiny unspillable live ranges to be evicted by tiny
unspillable live ranges from a smaller register class.
<rdar://problem/11542429>
llvm-svn: 157715
This also required making recursive simplifications until
nothing changes or a hard limit (currently 3) is hit.
With the simplification in place indvars can canonicalize
loops of the form
for (unsigned i = 0; i < a-b; ++i)
into
for (unsigned i = 0; i != a-b; ++i)
which used to fail because SCEV created a weird umax expr
for the backedge taken count.
llvm-svn: 157701
integer registers. This is already supported by the fastcc convention, but it doesn't
hurt to support it in the standard conventions as well.
In cases where we can cheat at the calling convention, this allows us to avoid returning
things through memory in more cases.
llvm-svn: 157698
If integer overflow causes one of the terms to reach zero, that can
force the entire expression to zero.
Fixes PR12929: cast<Ty>() argument of incompatible type
llvm-svn: 157673
Besides adding the new insertPass function, this patch uses it to
enhance the existing -print-machineinstrs so that the MachineInstrs
after a specific pass can be printed.
Patch by Bin Zeng!
llvm-svn: 157655
- hoist checks out of loops where SCEV is smart enough
- add additional statistics to measure how much we loose for not supporting interprocedural and pointers loaded from memory
llvm-svn: 157649
This required light surgery on the assembler and disassembler
because the instructions use an uncommon encoding. They are
the only two instructions in x86 that use register operands
and two immediates.
llvm-svn: 157634
ranges for the instruction about to be bundled. This fixes a bug in an external
project where an assertion was triggered due to spurious 'multiple defs' within
the bundle.
Patch by Ivan Llopard. Thanks Ivan!
llvm-svn: 157632
The test case feeds the following into InstCombine's visitSelect:
%tobool8 = icmp ne i32 0, 0
%phitmp = select i1 %tobool8, i32 3, i32 0
Then instcombine replaces the right side of the switch with 0, doesn't notice
that nothing changes and tries again indefinitely.
This fixes PR12897.
llvm-svn: 157587
Attribute bits above 1<<30 are now encoded correctly. Additionally,
the encoding/decoding functionality has been hoisted to helper functions
in Attributes.h in an effort to help the encoding/decoding to stay in
sync with the Attribute bitcode definitions.
llvm-svn: 157581
Implemented IntItem - the wrapper around APInt. Why not to use APInt item directly right now?
1. It will very difficult to implement case ranges as series of small patches. We got several large and heavy patches. Each patch will about 90-120 kb. If you replace ConstantInt with APInt in SwitchInst you will need to changes at the same time all Readers,Writers and absolutely all passes that uses SwitchInst.
2. We can implement APInt pool inside and save memory space. E.g. we use several switches that works with 256 bit items (switch on signatures, or strings). We can avoid value duplicates in this case.
3. IntItem can be easyly easily replaced with APInt.
4. Currenly we can interpret IntItem both as ConstantInt and as APInt. It allows to provide SwitchInst methods that works with ConstantInt for non-updated passes.
Why I need it right now? Currently I need to update SimplifyCFG pass (EqualityComparisons). I need to work with APInts directly a lot, so peaces of code
ConstantInt *V = ...;
if (V->getValue().ugt(AnotherV->getValue()) {
...
}
will look awful. Much more better this way:
IntItem V = ConstantIntVal->getValue();
if (AnotherV < V) {
}
Of course any reviews are welcome.
P.S.: I'm also going to rename ConstantRangesSet to IntegersSubset, and CRSBuilder to IntegersSubsetMapping (allows to map individual subsets of integers to the BasicBlocks).
Since in future these classes will founded on APInt, it will possible to use them in more generic ways.
llvm-svn: 157576
replicating the code for every place it's needed, we instead generate a function
that does that for us. This function is local to the executable, so there
shouldn't be any writing violations.
llvm-svn: 157564
making it stronger and more sane.
Delete the code from tblgen that produced the old code.
Besides being a path forward in intrinsic sanity, this also eliminates a bunch of
machine generated code that was compiled into Function.o
llvm-svn: 157545
definition in the map before calling itself to retrieve the
DIE for the declaration. Without this change, if this causes
getOrCreateSubprogramDIE to be recursively called on the definition,
it will create multiple DIEs for that definition. Fixes PR12831.
llvm-svn: 157541
SimplifyCFG tends to form a lot of 2-3 case switches when merging branches. Move
the most likely condition to the front so it is checked first and the others can
be skipped. This is currently not as effective as it could be because SimplifyCFG
destroys profiling metadata when merging branches and switches. Merging branch
weight metadata is tricky though.
This code touches at most 3 cases so I didn't use a proper sorting algorithm.
llvm-svn: 157521
then it doesn't alter the instructions composing it, however it would continue
to move the instructions to just before the expression root. Ensure it doesn't
move them either, so now it really does nothing if there is nothing to do. That
commit also ensured that nsw etc flags weren't cleared if the expression was not
being changed. Tweak this a bit so that it doesn't clear flags on the initial
part of a computation either if that part didn't change but later bits did.
llvm-svn: 157518
are passed in. However, those arguments may be in a write-protected area, as far
as the runtime library is concerned. For instance, the data could be placed into
a 'linkedit' section, which isn't writable. Emit the code from
llvm_gcda_increment_indirect_counter directly into the function instead.
Note: The code for this is ugly, and can lead to bloat. We should look into
simplifying this code instead of having all of these branches.
<rdar://problem/11181370>
llvm-svn: 157505
to pass around a struct instead of a large set of individual values. This
cleans up the interface and allows more information to be added to the struct
for future targets without requiring changes to each and every target.
NV_CONTRIB
llvm-svn: 157479
with arbitrary topologies (previously it would give up when hitting a diamond
in the use graph for example). The testcase from PR12764 is now reduced from
a pile of additions to the optimal 1617*%x0+208. In doing this I changed the
previous strategy of dropping all uses for expression leaves to one of dropping
all but one use. This works out more neatly (but required a bunch of tweaks)
and is also safer: some recently fixed bugs during recursive linearization were
because the linearization code thinks it completely owns a node if it has no uses
outside the expression it is linearizing. But if the node was also in another
expression that had been linearized (and thus all uses of the node from that
expression dropped) then the conclusion that it is completely owned by the
expression currently being linearized is wrong. Keeping one use from within each
linearized expression avoids this kind of mistake.
llvm-svn: 157467
The Hazard checker implements in-order contraints, or interlocked
resources. Ready instructions with hazards do not enter the available
queue and are not visible to other heuristics.
The major code change is the addition of SchedBoundary to encapsulate
the state at the top or bottom of the schedule, including both a
pending and available queue.
The scheduler now counts cycles in sync with the hazard checker. These
are minimum cycle counts based on known hazards.
Targets with no itinerary (x86_64) currently remain at cycle 0. To fix
this, we need to provide some maximum issue width for all targets. We
also need to add the concept of expected latency vs. minimum latency.
llvm-svn: 157427
I'm not sure it's really worth expressing this as a range rather than 3 specific equalities, but it doesn't seem fundamentally wrong either.
llvm-svn: 157398
LowerSwitch::Clusterify : main functinality was replaced with CRSBuilder::optimize, so big part of Clusterify's code was reduced.
test/Transform/LowerSwitch/feature.ll - this test was refactored: grep + count was replaced with FileCheck usage.
llvm-svn: 157384
Live ranges with a constrained register class may benefit from splitting
around individual uses. It allows the remaining live range to use a
larger register class where it may allocate. This is like spilling to a
different register class.
This is only attempted on constrained register classes.
<rdar://problem/11438902>
llvm-svn: 157354
Now that the coalescer keeps live intervals and machine code in sync at
all times, it needs to deal with identity copies differently.
When merging two virtual registers, all identity copies are removed
right away. This means that other identity copies must come from
somewhere else, and they are going to have a value number.
Deal with such copies by merging the value numbers before erasing the
copy instruction. Otherwise, we leave dangling value numbers in the live
interval.
This fixes PR12927.
llvm-svn: 157340
leader table. That's because it wasn't expecting instructions to turn up as
leader for a value number that is not its own, but equality propagation could
create this situation. One solution is to have the leader table use a WeakVH
but this slows down GVN by about 5%. Instead just have equality propagation not
add instructions to the leader table, only constants and arguments. In theory
this might cause GVN to run more (each time it changes something it runs again)
but it doesn't seem to occur enough to cause a slow down.
llvm-svn: 157251
instruction encodings can be excluded during mips16 processing.
This revision fixes the issue raised by Jim Grosbach.
bool hasStandardEncoding() const { return !inMips16Mode(); }
When micromips is added it will be
bool StandardEncoding() const { return !inMips16Mode()&& !inMicroMipsMode(); }
No additional testing is needed other than to assure that there is no regression
from this patch.
Patch by Reed Kotler.
llvm-svn: 157234
32-bit offset jump tables just use real branch instructions and so aren't
marked as data regions. We were still emitting the .end_data_region
marker though, which assert()ed.
rdar://11499158
llvm-svn: 157221
This helps compile time when the greedy register allocator splits live
ranges in giant functions. Without the bias, we would try to grow
regions through the giant edge bundles, usually to find out that the
region became too big and expensive.
If a live range has many uses in blocks near the giant bundle, the small
negative bias doesn't make a big difference, and we still consider
regions including the giant edge bundle.
Giant edge bundles are usually connected to landing pads or indirect
branches.
llvm-svn: 157174
With physreg joining out of the way, it is easy to recognize the
instructions that need their kill flags cleared while testing for
interference.
This allows us to skip the final scan of all instructions for an 11%
speedup of the coalescer pass.
llvm-svn: 157169
may be RAUW'd by the recursive call to LegalizeOps; instead, retrieve
the other operands when calling UpdateNodeOperands. Fixes PR12889.
llvm-svn: 157162
There should be no difference in the resulting binary, given a sufficiently
smart compiler. However we already had compiler timeouts on the generated
code in Intrinsics.gen, this hopefully makes the lives of slow buildbots a
little easier.
llvm-svn: 157161
Dead code elimination during coalescing could cause a virtual register
to be split into connected components. The following rewriting would be
confused about the already joined copies present in the code, but
without a corresponding value number in the live range.
Erase all joined copies instantly when joining intervals such that the
MI and LiveInterval representations are always in sync.
llvm-svn: 157135
The current code will generate a prologue which starts with something like:
mflr 0
stw 31, -4(1)
stw 0, 4(1)
stwu 1, -16(1)
But under the PPC32 SVR4 ABI, access to negative offsets from R1 is not allowed.
This was pointed out by Peter Bergner.
llvm-svn: 157133
Dead code and joined copies are now eliminated on the fly, and there is
no need for a post pass.
This makes the coalescer work like other modern register allocator
passes: Code is changed on the fly, there is no pending list of changes
to be committed.
llvm-svn: 157132
The late dead code elimination is no longer necessary.
The test changes are cause by a register hint that can be either %rdi or
%rax. The choice depends on the use list order, which this patch changes.
llvm-svn: 157131
Before rewriting uses of one value in A to register B, check that there
are no tied uses. That would require multiple A values to be rewritten.
This bug can't bite in the current version of the code for a fairly
subtle reason: A tied use would have caused 2-addr to insert a copy
before the use. If the copy has been coalesced, it will be found by the
same loop changed by this patch, and the optimization is aborted.
This was exposed by 400.perlbench and lua after applying a patch that
deletes joined copies aggressively.
llvm-svn: 157130
Remaining virtreg->physreg copies were rematerialized during
updateRegDefsUses(), but we already do the same thing in joinCopy() when
visiting the physreg copy instruction.
Eliminate the preserveSrcInt argument to reMaterializeTrivialDef(). It
is now always true.
llvm-svn: 157103
Dead copies cause problems because they are trivial to coalesce, but
removing them gived the live range a dangling end point. This patch
enables full dead code elimination which trims live ranges to their uses
so end points don't dangle.
DCE may erase multiple instructions. Put the pointers in an ErasedInstrs
set so we never risk visiting erased instructions in the work list.
There isn't supposed to be any dead copies entering RegisterCoalescer,
but they do slip by as evidenced by test/CodeGen/X86/coalescer-dce.ll.
llvm-svn: 157101
getUDivExpr attempts to simplify by checking for overflow.
isLoopEntryGuardedByCond then evaluates the loop predicate which
may lead to the same getUDivExpr causing endless recursion.
Fixes PR12868: clang 3.2 segmentation fault.
llvm-svn: 157092
Use a dedicated MachO load command to annotate data-in-code regions.
This is the same format the linker produces for final executable images,
allowing consistency of representation and use of introspection tools
for both object and executable files.
Data-in-code regions are annotated via ".data_region"/".end_data_region"
directive pairs, with an optional region type.
data_region_directive := ".data_region" { region_type }
region_type := "jt8" | "jt16" | "jt32" | "jta32"
end_data_region_directive := ".end_data_region"
The previous handling of ARM-style "$d.*" labels was broken and has
been removed. Specifically, it didn't handle ARM vs. Thumb mode when
marking the end of the section.
rdar://11459456
llvm-svn: 157062
It is no longer necessary to separate VirtCopies, PhysCopies, and
ImpDefCopies. Implicitly defined copies are extremely rare after we
added the ProcessImplicitDefs pass, and physical register copies are not
joined any longer.
llvm-svn: 157059
This has been disabled for a while, and it is not a feature we want to
support. Copies between physical and virtual registers are eliminated by
good hinting support in the register allocator. Joining virtual and
physical registers is really a form of register allocation, and the
coalescer is not properly equipped to do that. In particular, it cannot
backtrack coalescing decisions, and sometimes that would cause it to
create programs that were impossible to register allocate, by exhausting
a small register class.
It was also very difficult to keep track of the live ranges of aliasing
registers when extending the live range of a physreg. By disabling
physreg joining, we can let fixed physreg live ranges remain constant
throughout the register allocator super-pass.
One type of physreg joining remains: A virtual register that has a
single value which is a copy of a reserved register can be merged into
the reserved physreg. This always lowers register pressure, and since we
don't compute live ranges for reserved registers, there are no problems
with aliases.
llvm-svn: 157055
SelectionDAGBuilder::Clusterify : main functinality was replaced with CRSBuilder::optimize, so big part of Clusterify's code was reduced.
llvm-svn: 157046
non-profitable commute using outdated info. The test case would still fail
because of poor pre-RA schedule. That will be fixed by MI scheduler.
rdar://11472010
llvm-svn: 157038
the 0b10 mask encoding bits. Make MSR APSR writes without a _<bits> qualifier
an alias for MSR APSR_nzcvq even though ARM as deprecated it use. Also add
support for suffixes (_nzcvq, _g, _nzcvqg) for APSR versions. Some FIXMEs in
the code for better error checking when versions shouldn't be used.
rdar://11457025
llvm-svn: 157019
Introduce the basic strategy for register pressure scheduling.
1) Respect target limits at all times.
2) Indentify critical register classes (pressure sets).
Track pressure within the scheduled region.
Avoid increasing scheduled pressure for critical registers.
3) Avoid exceeding the max pressure of the region prior to scheduling.
Added logic for picking between the top and bottom ready Q's based on
regpressure heuristics.
Status: functional but needs to be asjusted to achieve good results.
llvm-svn: 157006
RegisterCoalescer set <undef> flags on all operands of copy instructions
that are scheduled to be removed. This is so they won't affect
shrinkToUses() by introducing false register reads.
Make sure those <undef> flags are never cleared, or shrinkToUses() could
cause live intervals to end at instructions about to be deleted.
This would be a lot simpler if RegisterCoalescer could just erase joined
copies immediately instead of keeping all the to-be-deleted instructions
around.
This fixes PR12862. Unfortunately, bugpoint can't create a sane test
case for this. Like many other coalescer problems, this failure depends
of a very fragile series of events.
<rdar://problem/11474428>
llvm-svn: 157001
separate side table, using the handy SequenceToOffsetTable class. This encodes all
these weird things into another 256 bytes, allowing all intrinsics to be encoded this way.
llvm-svn: 156995
llc to recognize MIPS16 as a MIPS ASE extension. -mips16 will mean the
mips16 ASE for mips32 by default.
As part of fixing of adding this we discovered some small changes that
need to be made to MipsInstrInfo::storeRegToStackSLot and
MipsInstrInfo::loadRegFromStackSlot. We were using some "==" equality tests
where in fact we should have been using Mips::<regclas>.hasSubClassEQ instead,
per suggestion of Jakob Stoklund Olesen.
Patch by Reed Kotler.
llvm-svn: 156958
When widening an existing <def,reads-undef> operand to a super-register,
it may be necessary to clear the <undef> flag because the wider register
is now read-modify-write through the instruction.
Conversely, it may be necessary to add an <undef> flag when the
coalescer turns a full-register def into a sub-register def, but the
larger register wasn't live before the instruction.
This happens in test/CodeGen/ARM/coalesce-subregs.ll, but the test
is too small for the <undef> flags to affect the generated code.
llvm-svn: 156951
It's more flexible for MCJIT tasks, in addition it's provides a invalidation instruction cache for code sections which will be used before JIT code will be executed.
llvm-svn: 156933
generated code (for Intrinsic::getType) into a table. This handles common cases right now,
but I plan to extend it to handle all cases and merge in type verification logic as well
in follow-on patches.
llvm-svn: 156905
It is now possible to coalesce weird skewed sub-register copies by
picking a super-register class larger than both original registers. The
included test case produces code like this:
vld2.32 {d16, d17, d18, d19}, [r0]!
vst2.32 {d18, d19, d20, d21}, [r0]
We still perform interference checking as if it were a normal full copy
join, so this is still quite conservative. In particular, the f1 and f2
functions in the included test case still have remaining copies because
of false interference.
llvm-svn: 156878
It is possible to coalesce two overlapping registers to a common
super-register that it larger than both of the original registers.
The important difference is that it may be necessary to rewrite DstReg
operands as well as SrcReg operands because the sub-register index has
changed.
This behavior is still disabled by CoalescerPair.
llvm-svn: 156869
Now both SrcReg and DstReg can be sub-registers of the final coalesced
register.
CoalescerPair::setRegisters still rejects such copies because
RegisterCoalescer doesn't yet handle them.
llvm-svn: 156848
This feature avoids creating edges in the scheduler's dependence graph
for non-aliasing memory operations according to whichever alias
analysis is available. It has been fully tested in Hexagon. Before
making this default, it needs to be extended to handle multiple
MachineMemOperands, compile time needs more evaluation, and
benchmarking on X86 and ARM is needed.
Patch by Sergei Larin!
llvm-svn: 156842
The purpose of this option is to silence error messages issued by machine
verifier passes and enable them to run to the end. If this option is not
provided, -verify-machineinstrs complains when it discovers there is a
non-terminator instruction (an instruction that is in a delay slot) after the
first terminator in a basic block.
llvm-svn: 156790
so that it can be reused in MemCpyOptimizer. This analysis is needed to remove
an unnecessary memcpy when returning a struct into a local variable.
rdar://11341081
PR12686
llvm-svn: 156776
Ordinary patch for PR1255.
Added new case-ranges orientated methods for adding/removing cases in SwitchInst. After this patch cases will internally representated as ConstantArray-s instead of ConstantInt, externally cases wrapped within the ConstantRangesSet object.
Old methods of SwitchInst are also works well, but marked as deprecated. So on this stage we have no side effects except that I added support for case ranges in BitcodeReader/Writer, of course test for Bitcode is also added. Old "switch" format is also supported.
llvm-svn: 156704
the ones that get or set the frame index for the $gp save slot.
Remove the piece of code in MipsFunctionInfo::getGlobalBaseReg() which returns
GP. This function should always return a virtual register.
llvm-svn: 156695
- Stop creating stack frame objects needed for saving $gp.
- Insert a node that copies the global pointer register to register $gp
before the call node. This will ensure $gp is valid at the entry of the
called function.
llvm-svn: 156692
- Stop emitting instructions needed to initialize the global pointer register.
- Stop emitting .cprestore directive.
- Do not take into account the $gp save slot when computing stack size.
llvm-svn: 156691
- Remove code which lowers pseudo SETGP01.
- Fix LowerSETGP01. The first two of the three instructions that are emitted to
initialize the global pointer register now use register $2.
- Stop emitting .cpload directive.
llvm-svn: 156689
pointer register.
This is the first of the series of patches which clean up the way global pointer
register is used. The patches will make the following improvements:
- Make $gp an allocatable temporary register rather than reserving it.
- Use a virtual register as the global pointer register and let the register
allocator decide which register to assign to it or whether spill/reloads are
needed.
- Make sure $gp is valid at the entry of a called function, which is necessary
for functions using lazy binding.
- Remove the need for emitting .cprestore and .cpload directives.
llvm-svn: 156671
This patch will optimize the following cases:
sub r1, r3 | sub r1, imm
cmp r3, r1 or cmp r1, r3 | cmp r1, imm
bge L1
TO
subs r1, r3
bge L1 or ble L1
If the branch instruction can use flag from "sub", then we can replace
"sub" with "subs" and eliminate the "cmp" instruction.
rdar: 10734411
llvm-svn: 156599
Prioritize the instruction that comes closest to keeping pressure
under the target's limit. Then prioritize instructions that avoid
increasing the max pressure in the scheduled region. The max pressure
heuristic is a tad aggressive. Later I'll fix it to consider the
unscheduled pressure as well.
WIP: This is mostly functional but untested and not likely to do much good yet.
llvm-svn: 156574
Added getMaxExcessUpward/DownwardPressure. They somewhat abuse the
tracker by speculatively handling an instruction out of order. But it
is convenient for now. In the future, we will cache each instruction's
pressure contribution to make this efficient.
llvm-svn: 156561
This patch will optimize the following cases:
sub r1, r3 | sub r1, imm
cmp r3, r1 or cmp r1, r3 | cmp r1, imm
bge L1
TO
subs r1, r3
bge L1 or ble L1
If the branch instruction can use flag from "sub", then we can replace
"sub" with "subs" and eliminate the "cmp" instruction.
rdar: 10734411
llvm-svn: 156550
Instruction::IsIdenticalToWhenDefined.
This manifested itself when inlining two calls to the same function. The
inlined function had a switch statement that returned one of a set of
global variables. Without this modification, the two phi instructions that
chose values from the branches of the switch instruction inlined from the
callee were considered equivalent and jump-threading replaced a load for the
first switch value with a phi selecting from the second switch, thereby
producing incorrect code.
This patch has been tested with "make check-all", "lnt runteste nt", and
llvm self-hosted, and on the original program that had this problem,
wireshark.
<rdar://problem/11025519>
llvm-svn: 156548
refactor code a bit to enable future changes to support run-time information
add support to compute allocation sizes at run-time if penalty > 1 (e.g., malloc(x), calloc(x, y), and VLAs)
llvm-svn: 156515
For the Family 6 switch in sys::getHostCPUName, an unrecognized model was
reported as "i686". That's a really bad default since it means that new
CPUs will be treated as if they can only use 32-bit code. This just looks
at the cpuid extended feature flag for 64 bit support, and if that is set,
it uses a default x86-64 cpu. Similar logic is already used for the Family
15 code. <rdar://problem/11314502>
llvm-svn: 156486
This lets you save the textual representation of the LLVM IR to a file.
Before this patch it could only be printed to STDERR from llvm-c.
Patch by Carlo Kok!
llvm-svn: 156479
When a combine twiddles an extract_vector, care should be take to preserve
the type of the index operand. No luck extracting a reasonable testcase,
unfortunately.
rdar://11391009
llvm-svn: 156419
replace the operands of expressions with only one use with undef and generate
a new expression for the original without using RAUW to update the original.
Thus any copies of the original expression held in a vector may end up
referring to some bogus value - and using a ValueHandle won't help since there
is no RAUW. There is already a mechanism for getting the effect of recursion
non-recursively: adding the value to be recursed on to RedoInsts. But it wasn't
being used systematically. Have various places where recursion had snuck in at
some point use the RedoInsts mechanism instead. Fixes PR12169.
llvm-svn: 156379
Added new case-ranges orientated methods for adding/removing cases in SwitchInst. After this patch cases will internally representated as ConstantArray-s instead of ConstantInt, externally cases wrapped within the ConstantRangesSet object.
Old methods of SwitchInst are also works well, but marked as deprecated. So on this stage we have no side effects except that I added support for case ranges in BitcodeReader/Writer, of course test for Bitcode is also added. Old "switch" format is also supported.
llvm-svn: 156374
At least some of them:
%vreg1:sub_16bit = COPY %vreg2:sub_16bit; GR64:%vreg1, GR32: %vreg2
Previously, we couldn't figure out that the above copy could be
eliminated by coalescing %vreg2 with %vreg1:sub_32bit.
The new getCommonSuperRegClass() hook makes it possible.
This is not very useful yet since the unmodified part of the destination
register usually interferes with the source register. The coalescer
needs to understand sub-register interference checking first.
llvm-svn: 156334
The getPointerRegClass() hook can return register classes that depend on
the calling convention of the current function (ptr_rc_tailcall).
So far, we have been able to infer the calling convention from the
subtarget alone, but as we add support for multiple calling conventions
per target, that no longer works.
Patch by Yiannis Tsiouris!
llvm-svn: 156328
optional library support to the llvm-build tool:
- Add new command line parameter to llvm-build: “--enable-optional-libraries”
- Add handing of new llvm-build library type “OptionalLibrary”
- Update Cmake and automake build systems to pass correct flags to llvm-build
based on configuration
Patch by Dan Malea!
llvm-svn: 156319
This function is a generalization of getMatchingSuperRegClass() to the
symmetric case where both sides are using a sub-register index. It will
find a super-register class and sub-register indexes that make this
diagram commute:
PreA
SuperRC ----------> RCA
| |
| |
PreB | | SubA
| |
| |
V V
RCB ----------> SubRC
SubB
This can be used to coalesce copies like:
%vreg1:sub16 = COPY %vreg2:sub16; GR64:%vreg1, GR32: %vreg2
llvm-svn: 156317
This patch will optimize -(x != 0) on X86
FROM
cmpl $0x01,%edi
sbbl %eax,%eax
notl %eax
TO
negl %edi
sbbl %eax %eax
In order to generate negl, I added patterns in Target/X86/X86InstrCompiler.td:
def : Pat<(X86sub_flag 0, GR32:$src), (NEG32r GR32:$src)>;
rdar: 10961709
llvm-svn: 156312
The primitive conservative heuristic seems to give a slight overall
improvement while not regressing stuff. Make it available to wider
testing. If you notice any speed regressions (or significant code
size regressions) let me know!
llvm-svn: 156258
- Just use sys::Process::GetRandomNumber instead of having two poor
implementations.
- This is ~70 times (!) faster on my OS X machine.
llvm-svn: 156238
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
This will be used to determine whether it's profitable to turn a select into a
branch when the branch is likely to be predicted.
Currently enabled for everything but Atom on X86 and Cortex-A9 devices on ARM.
I'm not entirely happy with the name of this flag, suggestions welcome ;)
llvm-svn: 156233
We want the representative register class to contain the largest
super-registers available. This makes the function less sensitive to the
register class numbering.
llvm-svn: 156220
In file included from ../lib/Target/NVPTX/VectorElementize.cpp:53:
../lib/Target/NVPTX/NVPTX.h:44:3: warning: default label in switch which covers all enumeration values [-Wcovered-switch-default]
default: assert(0 && "Unknown condition code");
^
1 warning generated.
The prevailing pattern in LLVM is to not use a default label, and instead to
use llvm_unreachable to denote that the switch in fact covers all return paths
from the function.
llvm-svn: 156209
add a new Region::block_iterator which actually iterates over the basic
blocks of the region.
The old iterator, now call 'block_node_iterator' iterates over
RegionNodes which contain a single basic block. This works well with the
GraphTraits-based iterator design, however most users actually want an
iterator over the BasicBlocks inside these RegionNodes. Now the
'block_iterator' is a wrapper which exposes exactly this interface.
Internally it uses the block_node_iterator to walk all nodes which are
single basic blocks, but transparently unwraps the basic block to make
user code simpler.
While this patch is a bit of a wash, most of the updates are to internal
users, not external users of the RegionInfo. I have an accompanying
patch to Polly that is a strict simplification of every user of this
interface, and I'm working on a pass that also wants the same simplified
interface.
This patch alone should have no functional impact.
llvm-svn: 156202
The new target machines are:
nvptx (old ptx32) => 32-bit PTX
nvptx64 (old ptx64) => 64-bit PTX
The sources are based on the internal NVIDIA NVPTX back-end, and
contain more functionality than the current PTX back-end currently
provides.
NV_CONTRIB
llvm-svn: 156196
of the CodeExtractor utility. This allows speculatively computing input
and output sets to measure the likely size impact of the code
extraction.
These sets cannot be reused sadly -- we mutate the function prior to
forming the final sets used by the actual extraction.
The interface has been revamped slightly to make it easier to use
correctly by making the interface const and sinking the computation of
the number of exit blocks into the full extraction function and away
from the rest of this logic which just computed two output parameters.
llvm-svn: 156168
and expose it as a utility class rather than as free function wrappers.
The simple free-function interface works well for the bugpoint-specific
pass's uses of code extraction, but in an upcoming patch for more
advanced code extraction, they simply don't expose a rich enough
interface. I need to expose various stages of the process of doing the
code extraction and query information to decide whether or not to
actually complete the extraction or give up.
Rather than build up a new predicate model and pass that into these
functions, just take the class that was actually implementing the
functions and lift it up into a proper interface that can be used to
perform code extraction. The interface is cleaned up and re-documented
to work better in a header. It also is now setup to accept the blocks to
be extracted in the constructor rather than in a method.
In passing this essentially reverts my previous commit here exposing
a block-level query for eligibility of extraction. That is no longer
necessary with the more rich interface as clients can query the
extraction object for eligibility directly. This will reduce the number
of walks of the input basic block sequence by quite a bit which is
useful if this enters the normal optimization pipeline.
llvm-svn: 156163
This moves the logic for selecting a TLS model to a single place,
instead of the previous three (ARM, Mips, and X86 which already
uses this function).
llvm-svn: 156162
The masks returned by SuperRegClassIterator are computed automatically
by TableGen. This is better than depending on the manually specified
SuperRegClasses.
llvm-svn: 156147
This iterator class provides a more abstract interface to the (Idx,
Mask) lists of super-registers for a register class. The layout of the
tables shouldn't be exposed to clients.
llvm-svn: 156144
minor behavior changes with this, but nothing I have seen evidence of in
the wild or expect to be meaningful. The real goal is unifying our logic
and simplifying the interfaces. A summary of the changes follows:
- Make 'callIsSmall' actually accept a callsite so it can handle
intrinsics, and simplify callers appropriately.
- Nuke a completely bogus declaration of 'callIsSmall' that was still
lurking in InlineCost.h... No idea how this got missed.
- Teach the 'isInstructionFree' about the various more intelligent
'free' heuristics that got added to the inline cost analysis during
review and testing. This mostly surrounds int->ptr and ptr->int casts.
- Switch most of the interesting parts of the inline cost analysis that
were essentially computing 'is this instruction free?' to use the code
metrics routine instead. This way we won't keep duplicating logic.
All of this is motivated by the desire to allow other passes to compute
a roughly equivalent 'cost' metric for a particular basic block as the
inline cost analysis. Sadly, re-using the same analysis for both is
really messy because only the actual inline cost analysis is ever going
to go to the contortions required for simplification, SROA analysis,
etc.
llvm-svn: 156140
for the assembler and disassembler. Which were not being set/read correctly
for offsets greater than 22 bits in some cases.
Changes to lib/Target/ARM/ARMAsmBackend.cpp from Gideon Myles!
llvm-svn: 156118
extraction into a public interface. Also clean it up and apply it more
consistently such that we check for landing pads *anywhere* in the
extracted code, not just in single-block extraction.
This will be used to guide decisions in passes that are planning to
eventually perform a round of code extraction.
llvm-svn: 156114
to catch cases like:
%reg1024<def> = MOV r1
%reg1025<def> = MOV r0
%reg1026<def> = ADD %reg1024, %reg1025
r0 = MOV %reg1026
By commuting ADD, it let coalescer eliminate all of the copies. However, there
was a bug in the heuristics where it ended up commuting the ADD in:
%reg1024<def> = MOV r0
%reg1025<def> = MOV 0
%reg1026<def> = ADD %reg1024, %reg1025
r0 = MOV %reg1026
That did no benefit but rather ensure the last MOV would not be coalesced.
rdar://11355268
llvm-svn: 156048
The ensures that virtual registers always belong to an allocatable class.
If your target attempts to create a vreg for an operand that has no
allocatable register subclass, you will crash quickly.
This ensures that targets define register classes as intended.
llvm-svn: 156046
Expressions for movw/movt don't always have an :upper16: or :lower16:
on them and that's ok. When they don't, it's just a plain [0-65536]
immediate result, effectively the same as a :lower16: variant kind.
rdar://10550147
llvm-svn: 155941
in order to avoid assertion failures in the register scavenger. The assertion
failures were “Bad machine code: Using an undefined physical register” and
“Bad machine code: MBB exits via unconditional fall-through but its successor
differs from its CFG successor!”.
llvm-svn: 155930
Previously, an unsupported/unknown assembler directive issued a warning.
That's generally unsafe, and inconsistent with the behaviour of pretty
much every system assembler. Now that the MC assemblers are mature
enough to be the default on multiple targets, it's reasonable to
issue errors for these.
For target or platform directives that need to stay warnings, we
should add explicit handlers for them in, e.g., ELFAsmParser.cpp,
DarwinAsmParser.cpp, et. al., and issue the warning there.
rdar://9246275
llvm-svn: 155926
The caller is already responsible for eating any additional input on the
line. Putting an additional EatToEndOfStatement() in ParseStatement()
causes an entire extra statement to be consumed when treating warnings
as errors. For example, test/MC/macros.s will assert() because the
.endmacro directive is missed as a result.
rdar://11355843
llvm-svn: 155925
This patch will optimize the following cases on X86
(a > b) ? (a-b) : 0
(a >= b) ? (a-b) : 0
(b < a) ? (a-b) : 0
(b <= a) ? (a-b) : 0
FROM
movl %edi, %ecx
subl %esi, %ecx
cmpl %edi, %esi
movl $0, %eax
cmovll %ecx, %eax
TO
xorl %eax, %eax
subl %esi, %edi
cmovll %eax, %edi
movl %edi, %eax
rdar: 10734411
llvm-svn: 155919
- Improved parameter names for clarity
- Added comments
- emitCommonSymbols should return void because its return value is not being
used anywhere
- Attempt to reduce the usage of the RelocationValueRef type. Restricts it
for a single goal and may serve as a step for eventual removal.
llvm-svn: 155908
The TargetPassManager's default constructor wants to initialize the PassManager
to 'null'. But it's illegal to bind a null reference to a null l-value. Make the
ivar a pointer instead.
PR12468
llvm-svn: 155902
- There's no point having a different type for the local and global symbol
tables.
- Renamed SymbolTable to GlobalSymbolTable to clarify the intention
- Improved const correctness where relevant
llvm-svn: 155898
<rdar://problem/11291436>.
This is a second attempt at a fix for this, the first was r155468. Thanks
to Chandler, Bob and others for the feedback that helped me improve this.
llvm-svn: 155866
Replace some assert() calls w/ actual diagnostics. In a perfect world,
there'd be range checks on these values long before things ever reached
this code. For now, though, issuing a better-late-than-never diagnostic
is still a big improvement over assert().
rdar://11347287
llvm-svn: 155851
This was exposed by SingleSource/UnitTests/Vector/constpool.c.
The computed size of a basic block isn't always a multiple of its known
alignment, and that can introduce extra alignment padding after the
block.
<rdar://problem/11347135>
llvm-svn: 155845
On x86-32, structure return via sret lets the callee pop the hidden
pointer argument off the stack, which the caller then re-pushes.
However if the calling convention is fastcc, then a register is used
instead, and the caller should not adjust the stack. This is
implemented with a check of IsTailCallConvention
X86TargetLowering::LowerCall but is now checked properly in
X86FastISel::DoSelectCall.
(this time, actually commit what was reviewed!)
llvm-svn: 155825
ARM BUILD_VECTORs created after type legalization cannot use i8 or i16
operands, since those types are not legal. Instead use i32 operands, which
will be implicitly truncated by the BUILD_VECTOR to match the element type.
llvm-svn: 155824
relocations are resolved. It's much more reasonable to do this decision when
relocations are just being added - we have all the information at that point.
Also a bit of renaming and extra comments to clarify extensions.
llvm-svn: 155819
Allow the "SplitCriticalEdge" function to split the edge to a landing pad. If
the pass is *sure* that it thinks it knows what it's doing, then it may go ahead
and specify that the landing pad can have its critical edge split. The loop
unswitch pass is one of these passes. It will split the critical edges of all
edges coming from a loop to a landing pad not within the loop. Doing so will
retain important loop analysis information, such as loop simplify.
llvm-svn: 155817
- Add comments
- Change field names to be more reasonable
- Fix indentation and naming to conform to coding conventions
- Remove unnecessary includes / replace them by forward declatations
llvm-svn: 155815
The code could search past the end of the basic block when there was
already a constant pool entry after the block.
Test case with giant basic block in SingleSource/UnitTests/Vector/constpool.c
llvm-svn: 155753
This time, also fix the caller of AddGlue to properly handle
incomplete chains. AddGlue had failure modes, but shamefully hid them
from its caller. It's luck ran out.
Fixes rdar://11314175: BuildSchedUnits assert.
llvm-svn: 155749
Make sure when parsing the Thumb1 sp+register ADD instruction that
the source and destination operands match. In thumb2, just use the
wide encoding if they don't. In Thumb1, issue a diagnostic.
rdar://11219154
llvm-svn: 155748
On x86-32, structure return via sret lets the callee pop the hidden
pointer argument off the stack, which the caller then re-pushes.
However if the calling convention is fastcc, then a register is used
instead, and the caller should not adjust the stack. This is
implemented with a check of IsTailCallConvention
X86TargetLowering::LowerCall but is now checked properly in
X86FastISel::DoSelectCall.
llvm-svn: 155745
Previously, ARMConstantIslandPass would conservatively compute the
address of an aligned basic block as:
RoundUpToAlignment(Offset + UnknownPadding)
This worked fine for the layout algorithm itself, but it could fool the
verify() function because it accounts for alignment padding twice: Once
when adding the worst case UnknownPadding, and again by rounding up the
fictional block offset. This meant that when optimizeThumb2Instructions
would shrink an instruction, the conservative distance estimate could
grow. That shouldn't be possible since the woorst case alignment padding
wss already included.
This patch drops the use of RoundUpToAlignment, and depends only on
worst case padding to compute conservative block offsets. This has the
weird effect that the computed offset for an aligned block may not be
aligned.
The important difference is that shrinking an instruction can never
cause the estimated distance between two instructions to grow. The
estimated distance is always larger than the real distance that only the
assembler knows.
<rdar://problem/11339352>
llvm-svn: 155744
x == -y --> x+y == 0
x != -y --> x+y != 0
On x86, the generated code goes from
negl %esi
cmpl %esi, %edi
je .LBB0_2
to
addl %esi, %edi
je .L4
This case is correctly handled for ARM with "cmn".
Patch by Manman Ren.
rdar://11245199
PR12545
llvm-svn: 155739
Target specific types should not be vectorized. As a practical matter,
these types are already register matched (at least in the x86 case),
and codegen does not always work correctly (at least in the ppc case,
and this is not worth fixing because ppc_fp128 is currently broken and
will probably go away soon).
llvm-svn: 155729
* Model FPSW (the FPU status word) as a register.
* Add ISel patterns for the FUCOM*, FNSTSW and SAHF instructions.
* During Legalize/Lowering, build a node sequence to transfer the comparison
result from FPSW into EFLAGS. If you're wondering about the right-shift: That's
an implicit sub-register extraction (%ax -> %ah) which is handled later on by
the instruction selector.
Fixes PR6679. Patch by Christoph Erhardt!
llvm-svn: 155704
instead of getAggregateElement. This has the advantage of being
more consistent and allowing higher-level constant folding to
procede even if an inner extract element cannot be folded.
Make ConstantFoldInstruction call ConstantFoldConstantExpression
on the instruction's operands, making it more consistent with
ConstantFoldConstantExpression itself. This makes sure that
ConstantExprs get TargetData-aware folding before being handed
off as operands for further folding.
This causes more expressions to be folded, but due to a known
shortcoming in constant folding, this currently has the side effect
of stripping a few more nuw and inbounds flags in the non-targetdata
side of constant-fold-gep.ll. This is mostly harmless.
This fixes rdar://11324230.
llvm-svn: 155682
The required checks are moved to ChainInstruction() itself and the
policy decisions are moved to IVChain::isProfitableInc().
Also cache the ExprBase in IVChain to avoid frequent recomputations.
No functional change intended.
llvm-svn: 155676
DAGCombine strangeness may result in multiple loads from the same
offset. They both may try to glue themselves to another load. We could
insist that the redundant loads glue themselves to each other, but the
beter fix is to bail out from bad gluing at the time we detect it.
Fixes rdar://11314175: BuildSchedUnits assert.
llvm-svn: 155668
The base address for the PC-relative load is Align(PC,4), so it's the
address of the word containing the 16-bit instruction, not the address
of the instruction itself. Ugh.
rdar://11314619
llvm-svn: 155659
On some cores it's a bad idea for performance to mix VFP and NEON instructions
and since these patterns are NEON anyway, the NEON load should be used.
llvm-svn: 155630
elements to minimize the number of multiplies required to compute the
final result. This uses a heuristic to attempt to form near-optimal
binary exponentiation-style multiply chains. While there are some cases
it misses, it seems to at least a decent job on a very diverse range of
inputs.
Initial benchmarks show no interesting regressions, and an 8%
improvement on SPASS. Let me know if any other interesting results (in
either direction) crop up!
Credit to Richard Smith for the core algorithm, and helping code the
patch itself.
llvm-svn: 155616
the feature set of v7a. This comes about if the user specifies something like
-arch armv7 -mcpu=cortex-m3. We shouldn't be generating instructions such as
uxtab in this case.
rdar://11318438
llvm-svn: 155601
Cross-class joins have been normal and fully supported for a while now.
With TableGen generating the getMatchingSuperRegClass() hook, they are
unlikely to cause problems again.
llvm-svn: 155552
Remove the heuristic for disabling cross-class joins. The greedy
register allocator can handle the narrow register classes, and when it
splits a live range, it can pick a larger register class.
Benchmarks were unaffected by this change.
<rdar://problem/11302212>
llvm-svn: 155551
When an instruction match is found, but the subtarget features it
requires are not available (missing floating point unit, or thumb vs arm
mode, for example), issue a diagnostic that identifies what the feature
mismatch is.
rdar://11257547
llvm-svn: 155499
constants in C++11 mode. I have no idea why it required such particular
circumstances to get here, the code seems clearly to rely upon unchecked
assumptions.
Specifically, when we decide to form an index into a struct type, we may
have gone through (at least one) zero-length array indexing round, which
would have left the offset un-adjusted, and thus not necessarily valid
for use when indexing the struct type.
This is just an canonicalization step, so the correct thing is to refuse
to canonicalize nonsensical GEPs of this form. Implemented, and test
case added.
Fixes PR12642. Pair debugged and coded with Richard Smith. =] I credit
him with most of the debugging, and preventing me from writing the wrong
code.
llvm-svn: 155466
MachineInstr sequence.
This uses the new target interface for tracking register pressure
using pressure sets to model overlapping register classes and
subregisters.
RegisterPressure results can be tracked incrementally or stored at
region boundaries. Global register pressure can be deduced from local
RegisterPressure results if desired.
This is an early, somewhat untested implementation. I'm working on
testing it within the context of a register pressure reducing
MachineScheduler.
llvm-svn: 155454
immediate. We can't use it here because the shuffle code does not check that
the lower part of the word is identical to the upper part.
llvm-svn: 155440
using the pattern (vbroadcast (i32load src)). In some cases, after we generate
this pattern new users are added to the load node, which prevent the selection
of the blend pattern. This commit provides fallback patterns which perform
in-vector broadcast (using in-vector vbroadcast in AVX2 and pshufd on AVX1).
llvm-svn: 155437
on X86 Atom. Some of our tests failed because the tail merging part of
the BranchFolding pass was creating new basic blocks which did not
contain live-in information. When the anti-dependency code in the Post-RA
scheduler ran, it would sometimes rename the register containing
the function return value because the fact that the return value was
live-in to the subsequent block had been lost. To fix this, it is necessary
to run the RegisterScavenging code in the BranchFolding pass.
This patch makes sure that the register scavenging code is invoked
in the X86 subtarget only when post-RA scheduling is being done.
Post RA scheduling in the X86 subtarget is only done for Atom.
This patch adds a new function to the TargetRegisterClass to control
whether or not live-ins should be preserved during branch folding.
This is necessary in order for the anti-dependency optimizations done
during the PostRASchedulerList pass to work properly when doing
Post-RA scheduling for the X86 in general and for the Intel Atom in particular.
The patch adds and invokes the new function trackLivenessAfterRegAlloc()
instead of using the existing requiresRegisterScavenging().
It changes BranchFolding.cpp to call trackLivenessAfterRegAlloc() instead of
requiresRegisterScavenging(). It changes the all the targets that
implemented requiresRegisterScavenging() to also implement
trackLivenessAfterRegAlloc().
It adds an assertion in the Post RA scheduler to make sure that post RA
liveness information is available when it is needed.
It changes the X86 break-anti-dependencies test to use –mcpu=atom, in order
to avoid running into the added assertion.
Finally, this patch restores the use of anti-dependency checking
(which was turned off temporarily for the 3.1 release) for
Intel Atom in the Post RA scheduler.
Patch by Andy Zhang!
Thanks to Jakob and Anton for their reviews.
llvm-svn: 155395
When building LLVM on Linux with libc++ with CMake TIME_WITH_SYS_TIME is
undefined, and HAVE_SYS_TIME_H is defined. This ends up including
sys/time.h but not time.h. Unix/TimeValue.inc requires time.h for asctime_r
and localtime. libstdc++ seems to include time.h anyway, but libc++ does
not.
Fix this by always including time.h
llvm-svn: 155382
test suite failures. The failures occur at each stage, and only get
worse, so I'm reverting all of them.
Please resubmit these patches, one at a time, after verifying that the
regression test suite passes. Never submit a patch without running the
regression test suite.
llvm-svn: 155372
Original commit message:
Defer some shl transforms to DAGCombine.
The shl instruction is used to represent multiplication by a constant
power of two as well as bitwise left shifts. Some InstCombine
transformations would turn an shl instruction into a bit mask operation,
making it difficult for later analysis passes to recognize the
constsnt multiplication.
Disable those shl transformations, deferring them to DAGCombine time.
An 'shl X, C' instruction is now treated mostly the same was as 'mul X, C'.
These transformations are deferred:
(X >>? C) << C --> X & (-1 << C) (When X >> C has multiple uses)
(X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2) (When C2 > C1)
(X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2) (When C1 > C2)
The corresponding exact transformations are preserved, just like
div-exact + mul:
(X >>?,exact C) << C --> X
(X >>?,exact C1) << C2 --> X << (C2-C1)
(X >>?,exact C1) << C2 --> X >>?,exact (C1-C2)
The disabled transformations could also prevent the instruction selector
from recognizing rotate patterns in hash functions and cryptographic
primitives. I have a test case for that, but it is too fragile.
llvm-svn: 155362
The problem is that the struct file_status on UNIX systems has two
members called st_dev and st_ino; those are also members of the
struct stat, and they are reserved identifiers which can also be
provided as #define (and this is the case for st_dev on Hurd).
The solution (attached) is to rename them, for example adding a
"fs_" prefix (= file status) to them.
Patch by Pino Toscano
llvm-svn: 155354
The X86 target is editing the selection DAG while isel is selecting
nodes following a topological ordering. When the DAG hacking triggers
CSE, nodes can be deleted and bad things happen.
llvm-svn: 155257
Now that multiple DAGUpdateListeners can be active at the same time,
ISelPosition can become a local variable in DoInstructionSelection.
We simply register an ISelUpdater with CurDAG while ISelPosition exists.
llvm-svn: 155249
Instead of passing listener pointers to RAUW, let SelectionDAG itself
keep a linked list of interested listeners.
This makes it possible to have multiple listeners active at once, like
RAUWUpdateListener was already doing. It also makes it possible to
register listeners up the call stack without controlling all RAUW calls
below.
DAGUpdateListener uses an RAII pattern to add itself to the SelectionDAG
list of active listeners.
llvm-svn: 155248
The <undef> flag on a def operand only applies to partial register
redefinitions. Only print the flag when relevant, and print it as
<def,read-undef> to make it clearer what it means.
llvm-svn: 155239
This nicely handles the most common case of virtual register sets, but
also handles anticipated cases where we will map pointers to IDs.
The goal is not to develop a completely generic SparseSet
template. Instead we want to handle the expected uses within llvm
without any template antics in the client code. I'm adding a bit of
template nastiness here, and some assumption about expected usage in
order to make the client code very clean.
The expected common uses cases I'm designing for:
- integer keys that need to be reindexed, and may map to additional
data
- densely numbered objects where we want pointer keys because no
number->object map exists.
llvm-svn: 155227
Use the new TwoOperandAliasConstraint to handle lots of the two-operand aliases
for NEON instructions. There's still more to go, but this is a good chunk of
them.
llvm-svn: 155210
(load only has one operand) and smuggle in some whitespace changes too
NB: I am obviously testing the water here, and believe that the unguarded
cast is still wrong, but why is the getZExtValue of the load's operand
tested against zero here? Any review is appreciated.
llvm-svn: 155190
While the patch was perfect and defect free, it exposed a really nasty
bug in X86 SelectionDAG that caused an llc crash when compiling lencod.
I'll put the patch back in after fixing the SelectionDAG problem.
llvm-svn: 155181
The shl instruction is used to represent multiplication by a constant
power of two as well as bitwise left shifts. Some InstCombine
transformations would turn an shl instruction into a bit mask operation,
making it difficult for later analysis passes to recognize the
constsnt multiplication.
Disable those shl transformations, deferring them to DAGCombine time.
An 'shl X, C' instruction is now treated mostly the same was as 'mul X, C'.
These transformations are deferred:
(X >>? C) << C --> X & (-1 << C) (When X >> C has multiple uses)
(X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2) (When C2 > C1)
(X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2) (When C1 > C2)
The corresponding exact transformations are preserved, just like
div-exact + mul:
(X >>?,exact C) << C --> X
(X >>?,exact C1) << C2 --> X << (C2-C1)
(X >>?,exact C1) << C2 --> X >>?,exact (C1-C2)
The disabled transformations could also prevent the instruction selector
from recognizing rotate patterns in hash functions and cryptographic
primitives. I have a test case for that, but it is too fragile.
llvm-svn: 155136
symbolicated. These have and operand type of TYPE_RELv which was not handled
as isBranch in translateImmediate() in X86Disassembler.cpp. rdar://11268426
llvm-svn: 155074
commits have had several major issues pointed out in review, and those
issues are not being addressed in a timely fashion. Furthermore, this
was all committed leading up to the v3.1 branch, and we don't need piles
of code with outstanding issues in the branch.
It is possible that not all of these commits were necessary to revert to
get us back to a green state, but I'm going to let the Hexagon
maintainer sort that out. They can recommit, in order, after addressing
the feedback.
Reverted commits, with some notes:
Primary commit r154616: HexagonPacketizer
- There are lots of review comments here. This is the primary reason
for reverting. In particular, it introduced large amount of warnings
due to a bad construct in tablegen.
- Follow-up commits that should be folded back into this when
reposting:
- r154622: CMake fixes
- r154660: Fix numerous build warnings in release builds.
- Please don't resubmit this until the three commits above are
included, and the issues in review addressed.
Primary commit r154695: Pass to replace transfer/copy ...
- Reverted to minimize merge conflicts. I'm not aware of specific
issues with this patch.
Primary commit r154703: New Value Jump.
- Primarily reverted due to merge conflicts.
- Follow-up commits that should be folded back into this when
reposting:
- r154703: Remove iostream usage
- r154758: Fix CMake builds
- r154759: Fix build warnings in release builds
- Please incorporate these fixes and and review feedback before
resubmitting.
Primary commit r154829: Hexagon V5 (floating point) support.
- Primarily reverted due to merge conflicts.
- Follow-up commits that should be folded back into this when
reposting:
- r154841: Remove unused variable (fixing build warnings)
There are also accompanying Clang commits that will be reverted for
consistency.
llvm-svn: 155047
DenseMap's hash function uses slightly more entropy and reduces hash collisions
significantly. I also experimented with Hashing.h, but it didn't gave a lot of
improvement while being much more expensive to compute.
llvm-svn: 154996
If the loop contains invoke instructions, whose unwind edge escapes the loop,
then don't try to unswitch the loop. Doing so may cause the unwind edge to be
split, which not only is non-trivial but doesn't preserve loop simplify
information.
Fixes PR12573
llvm-svn: 154987
This introduces a threshold of 200 IV Users, which is very
conservative but should be sufficient to avoid serious compile time
sink or stack overflow. The llvm test-suite with LTO never exceeds 190
users per loop.
The bug doesn't relate to a specific type of loop. Checking in an
arbitrary giant loop as a unit test would be silly.
Fixes rdar://11262507.
llvm-svn: 154983
also fix SimplifyLibCalls to use TLI rather than compile-time conditionals to enable optimizations on floor, ceil, round, rint, and nearbyint
llvm-svn: 154960
transformation:
(X op C1) ^ C2 --> (X op C1) & ~C2 iff (C1&C2) == C2
should be done.
This change has been tested:
Using a debug+asserts build:
on the specific test case that brought this bug to light
make check-all
lnt nt
using this clang to build a release version of clang
Using the release+asserts clang-with-clang build:
on the specific test case that brought this bug to light
make check-all
lnt nt
Checking in because Evan wants it checked in. Test case forthcoming after
scrubbing.
llvm-svn: 154955
for the life of me remember why I wrote it this way, but I can't see any good
reason for it now. This patch replaces the custom linked list with an ilist.
This change should preserve the existing numberings exactly, so no generated code
should change (if it does, file a bug!).
llvm-svn: 154904
the MCJIT execution engine.
The GDB JIT debugging integration support works by registering a loaded
object image with a pre-defined function that GDB will monitor if GDB
is attached. GDB integration support is implemented for ELF only at this
time. This integration requires GDB version 7.0 or newer.
Patch by Andy Kaylor!
llvm-svn: 154868
both fallthrough and a conditional branch target the same successor.
Gracefully delete the conditional branch and introduce any unconditional
branch needed to reach the actual successor. This fixes memory
corruption in 2009-06-15-RegScavengerAssert.ll and possibly other tests.
Also, while I'm here fix a latent bug I spotted by inspection. I never
applied the same fundamental fix to this fallthrough successor finding
logic that I did to the logic used when there are no conditional
branches. As a consequence it would have selected landing pads had they
be aligned in just the right way here. I don't have a test case as
I spotted this by inspection, and the previous time I found this
required have of TableGen's source code to produce it. =/ I hate backend
bugs. ;]
Thanks to Jim Grosbach for helping me reason through this and reviewing
the fix.
llvm-svn: 154867
A trailing comma means no argument at all (i.e., as if the comma were not
present), not an empty argument to the invokee.
rdar://11252521
llvm-svn: 154863
through the use of 'fpmath' metadata. Currently this only provides a 'fpaccuracy'
value, which may be a number in ULPs or the keyword 'fast', however the intent is
that this will be extended with additional information about NaN's, infinities
etc later. No optimizations have been hooked up to this so far.
llvm-svn: 154822
This is mostly to test the waters. I'd like to get results from FNT
build bots and other bots running on non-x86 platforms.
This feature has been pretty heavily tested over the last few months by
me, and it fixes several of the execution time regressions caused by the
inlining work by preventing inlining decisions from radically impacting
block layout.
I've seen very large improvements in yacr2 and ackermann benchmarks,
along with the expected noise across all of the benchmark suite whenever
code layout changes. I've analyzed all of the regressions and fixed
them, or found them to be impossible to fix. See my email to llvmdev for
more details.
I'd like for this to be in 3.1 as it complements the inliner changes,
but if any failures are showing up or anyone has concerns, it is just
a flag flip and so can be easily turned off.
I'm switching it on tonight to try and get at least one run through
various folks' performance suites in case SPEC or something else has
serious issues with it. I'll watch bots and revert if anything shows up.
llvm-svn: 154816
rotation. When there is a loop backedge which is an unconditional
branch, we will end up with a branch somewhere no matter what. Try
placing this backedge in a fallthrough position above the loop header as
that will definitely remove at least one branch from the loop iteration,
where whole loop rotation may not.
I haven't seen any benchmarks where this is important but loop-blocks.ll
tests for it, and so this will be covered when I flip the default.
llvm-svn: 154812
laid out in a form with a fallthrough into the header and a fallthrough
out of the bottom. In that case, leave the loop alone because any
rotation will introduce unnecessary branches. If either side looks like
it will require an explicit branch, then the rotation won't add any, do
it to ensure the branch occurs outside of the loop (if possible) and
maximize the benefit of the fallthrough in the bottom.
llvm-svn: 154806
This is a complex change that resulted from a great deal of
experimentation with several different benchmarks. The one which proved
the most useful is included as a test case, but I don't know that it
captures all of the relevant changes, as I didn't have specific
regression tests for each, they were more the result of reasoning about
what the old algorithm would possibly do wrong. I'm also failing at the
moment to craft more targeted regression tests for these changes, if
anyone has ideas, it would be welcome.
The first big thing broken with the old algorithm is the idea that we
can take a basic block which has a loop-exiting successor and a looping
successor and use the looping successor as the layout top in order to
get that particular block to be the bottom of the loop after layout.
This happens to work in many cases, but not in all.
The second big thing broken was that we didn't try to select the exit
which fell into the nearest enclosing loop (to which we exit at all). As
a consequence, even if the rotation worked perfectly, it would result in
one of two bad layouts. Either the bottom of the loop would get
fallthrough, skipping across a nearer enclosing loop and thereby making
it discontiguous, or it would be forced to take an explicit jump over
the nearest enclosing loop to earch its successor. The point of the
rotation is to get fallthrough, so we need it to fallthrough to the
nearest loop it can.
The fix to the first issue is to actually layout the loop from the loop
header, and then rotate the loop such that the correct exiting edge can
be a fallthrough edge. This is actually much easier than I anticipated
because we can handle all the hard parts of finding a viable rotation
before we do the layout. We just store that, and then rotate after
layout is finished. No inner loops get split across the post-rotation
backedge because we check for them when selecting the rotation.
That fix exposed a latent problem with our exitting block selection --
we should allow the backedge to point into the middle of some inner-loop
chain as there is no real penalty to it, the whole point is that it
*won't* be a fallthrough edge. This may have blocked the rotation at all
in some cases, I have no idea and no test case as I've never seen it in
practice, it was just noticed by inspection.
Finally, all of these fixes, and studying the loops they produce,
highlighted another problem: in rotating loops like this, we sometimes
fail to align the destination of these backwards jumping edges. Fix this
by actually walking the backwards edges rather than relying on loopinfo.
This fixes regressions on heapsort if block placement is enabled as well
as lots of other cases where the previous logic would introduce an
abundance of unnecessary branches into the execution.
llvm-svn: 154783
As an example, attach range info to the "invalid instruction" message:
$ clang -arch arm -c asm.c
asm.c:2:11: error: invalid instruction
__asm__("foo r0");
^
<inline asm>:1:2: note: instantiated into assembly here
foo r0
^~~
llvm-svn: 154765
thinking of generalizing it to be able to specify other freedoms beyond accuracy
(such as that NaN's don't have to be respected). I'd like the 3.1 release (the
first one with this metadata) to have the more generic name already rather than
having to auto-upgrade it in 3.2.
llvm-svn: 154744
When vectorizing pointer types it is important to realize that potential
pairs cannot be connected via the address pointer argument of a load or store.
This is because even after vectorization, the address is still a scalar because
the address of the higher half of the pair is implicit from the address of the
lower half (it need not be, and should not be, explicitly computed).
llvm-svn: 154735
This is a special flag for targets that really want their block
terminators in the DAG. The default scheduler cannot handle this
correctly, so it becomes the specialized scheduler's responsibility to
schedule terminators.
llvm-svn: 154712
- Don't copy offsets into HashData, the underlying vector won't change once the table is finalized.
- Allocate HashData and HashDataContents in a BumpPtrAllocator.
- Allocate string map entries in the same allocator.
- Random cleanups.
llvm-svn: 154694
As has been suggested by Duncan and others, Early-CSE and GVN should
do similar redundancy elimination, but Early-CSE is much less expensive.
Most of my autovectorization benchmarks show a performance regresion, but
all of these are < 0.1%, and so I think that it is still worth using
the less expensive pass.
llvm-svn: 154673
obviously cannot know that this code is present, let alone used. So prevent the
internalize pass from internalizing those global values which code-gen may
insert.
llvm-svn: 154645
directly instead of a user Instruction. This allows them to test
whether a def dominates a particular operand if the user instruction
is a PHI.
llvm-svn: 154631
of zero-initialized sections, virtual sections and common symbols
and preventing the loading of sections which are not required for
execution such as debug information.
Patch by Andy Kaylor!
llvm-svn: 154610
- FCOPYSIGN nodes that have operands of different types were not handled.
- Different code was generated depending on the endianness of the target.
Additionally, code is added that emits INS and EXT instructions, if they are
supported by target (they are R2 instructions).
llvm-svn: 154540
While there is an encoding for it in VUZP, the result of that is undefined,
so we should avoid it. Define the instruction as a pseudo for VTRN.32
instead, as the ARM ARM indicates.
rdar://11222366
llvm-svn: 154511
While there is an encoding for it in VZIP, the result of that is undefined,
so we should avoid it. Define the instruction as a pseudo for VTRN.32
instead, as the ARM ARM indicates.
rdar://11221911
llvm-svn: 154505
FoldingSet is implemented as a chained hash table. When there is a hash
collision during insertion, which is common as we fill the table until a
load factor of 2.0 is hit, we walk the chained elements, comparing every
operand with the new element's operands. This can be very expensive if the
MDNode has many operands.
We sacrifice a word of space in MDNode to cache the full hash value, reducing
compares on collision to a minimum. MDNode grows from 28 to 32 bytes + operands
on x86. On x86_64 the new bits fit nicely into existing padding, not growing
the struct at all.
The actual speedup depends a lot on the test case and is typically between
1% and 2% for C++ code with clang -c -O0 -g.
llvm-svn: 154497
Fix a dagcombine optimization which assumes that the vsetcc result type is always
of the same size as the compared values. This is ture for SSE/AVX/NEON but not
for all targets.
llvm-svn: 154490
Original message:
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendV uses a register for the selection while Vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154483
predicates.
Also remove NEON2 since it's not really useful and it is confusing. If
NEON + VFP4 implies NEON2 but NEON2 doesn't imply NEON + VFP4, what does it
really mean?
rdar://10139676
llvm-svn: 154480
1. The new instruction itinerary entries are not properly described.
2. The asm parser can't handle vfms and vfnms.
3. There were no assembler, disassembler test cases.
4. HasNEON2 has the wrong assembler predicate.
rdar://10139676
llvm-svn: 154456
Allow cheap instructions to be hoisted if they are register pressure
neutral or better. This happens if the instruction is the last loop use
of another virtual register.
Only expensive instructions are allowed to increase loop register
pressure.
llvm-svn: 154455
Hoisting a value that is used by a PHI in the loop will introduce a
copy because the live range is extended to cross the PHI.
The same applies to PHIs in exit blocks.
Also use this opportunity to make HasLoopPHIUse() non-recursive.
llvm-svn: 154454
- don't isntrument reads from constant globals.
Saves ~1.5% of instrumented instructions on CPU2006
(counting static instructions, not their execution).
- don't insrument reads from vtable (which is a global constant too).
Saves ~5%.
I did not measure the run-time impact of this,
but it is certainly non-negative.
llvm-svn: 154444
StringMap. This was redundant and unnecessarily bloated the MDString class.
Because the MDString class is a "Value" and will never have a "name", and
because the Name field in the Value class is a pointer to a StringMap entry, we
repurpose the Name field for an MDString. It stores the StringMap entry in the
Name field, and uses the normal methods to get the string (name) back.
PR12474
llvm-svn: 154429
a write to the same temp follows in the same BB.
Also add stats printing.
On Spec CPU2006 this optimization saves roughly 4% of instrumented reads
(which is 3% of all instrumented accesses):
Writes : 161216
Reads : 446458
Reads-before-write: 18295
llvm-svn: 154418
We were incorrectly conflating some add variants which don't have a
cc_out operand with the mirroring sub encodings, which do. Part of the
awesome non-orthogonality legacy of thumb1. Similarly, handling of
add/sub of an immediate was sometimes incorrectly removing the cc_out
operand for add/sub register variants.
rdar://11216577
llvm-svn: 154411
blendv uses a register for the selection while vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154396
the loop header has a non-loop predecessor which has been pre-fused into
its chain due to unanalyzable branches. In this case, rotating the
header into the body of the loop in order to place a loop exit at the
bottom of the loop is a Very Bad Idea as it makes the loop
non-contiguous.
I'm working on a good test case for this, but it's a bit annoynig to
craft. I should get one shortly, but I'm submitting this now so I can
begin the (lengthy) performance analysis process. An initial run of LNT
looks really, really good, but there is too much noise there for me to
trust it much.
llvm-svn: 154395
Take this opportunity to generalize the indirectbr bailout logic for
loop transformations. CFG transformations will never get indirectbr
right, and there's no point trying.
llvm-svn: 154386
legalizer always use the DAG entry node. This is wrong when the libcall is
emitted as a tail call since it effectively folds the return node. If
the return node's input chain is not the entry (i.e. call, load, or store)
use that as the tail call input chain.
PR12419
rdar://9770785
rdar://11195178
llvm-svn: 154370
in-register, such that we can use a single vector store rather then a
series of scalar stores.
For func_4_8 the generated code
vldr d16, LCPI0_0
vmov d17, r0, r1
vadd.i16 d16, d17, d16
vmov.u16 r0, d16[3]
strb r0, [r2, #3]
vmov.u16 r0, d16[2]
strb r0, [r2, #2]
vmov.u16 r0, d16[1]
strb r0, [r2, #1]
vmov.u16 r0, d16[0]
strb r0, [r2]
bx lr
becomes
vldr d16, LCPI0_0
vmov d17, r0, r1
vadd.i16 d16, d17, d16
vuzp.8 d16, d17
vst1.32 {d16[0]}, [r2, :32]
bx lr
I'm not fond of how this combine pessimizes 2012-03-13-DAGCombineBug.ll,
but I couldn't think of a way to judiciously apply this combine.
This
ldrh r0, [r0, #4]
strh r0, [r1]
becomes
vldr d16, [r0]
vmov.u16 r0, d16[2]
vmov.32 d16[0], r0
vuzp.16 d16, d17
vst1.32 {d16[0]}, [r1, :32]
PR11158
rdar://10703339
llvm-svn: 154340
This patch restores TwoAddressInstructionPass's pre-r153892 behaviour when
rescheduling instructions in TryInstructionTransform. Hopefully this will fix
PR12493. To refix PR11861, lowering of INSERT_SUBREGS is deferred until after
the copy that unties the operands is emitted (this seems to be a more
appropriate fix for that issue anyway).
llvm-svn: 154338
A couple of cases where we were accidentally creating constant conditions by
something like "x == a || b" instead of "x == a || x == b". In one case a
conditional & then unreachable was used - I transformed this into a direct
assert instead.
llvm-svn: 154324
x86 addressing modes. This allows PIE-based TLS offsets to fit directly
into an addressing mode immediate offset, which is the last remaining
code quality issue from PR12380. With this patch, that PR is completely
fixed.
To understand why this patch is correct to match these offsets into
addressing mode immediates, break it down by cases:
1) 32-bit is trivially correct, and unmodified here.
2) 64-bit non-small mode is unchanged and never matches.
3) 64-bit small PIC code which is RIP-relative is handled specially in
the match to try to fit RIP into the base register. If it fails, it
now early exits. This behavior is unchanged by the patch.
4) 64-bit small non-PIC code which is not RIP-relative continues to work
as it did before. The reason these immediates are safe is because the
ABI ensures they fit in small mode. This behavior is unchanged.
5) 64-bit small PIC code which is *not* using RIP-relative addressing.
This is the only case changed by the patch, and the primary place you
see it is in TLS, either the win64 section offset TLS or Linux
local-exec TLS model in a PIC compilation. Here the ABI again ensures
that the immediates fit because we are in small mode, and any other
operations required due to the PIC relocation model have been handled
externally to the Wrapper node (extra loads etc are made around the
wrapper node in ISelLowering).
I've tested this as much as I can comparing it with GCC's output, and
everything appears safe. I discussed this with Anton and it made sense
to him at least at face value. That said, if there are issues with PIC
code after this patch, yell and we can revert it.
llvm-svn: 154304
when -ffast-math, i.e. don't just always do it if the reciprocal can
be formed exactly. There is already an IR level transform that does
that, and it does it more carefully.
llvm-svn: 154296
optimizations which are valid for position independent code being linked
into a single executable, but not for such code being linked into
a shared library.
I discussed the design of this with Eric Christopher, and the decision
was to support an optional bit rather than a completely separate
relocation model. Fundamentally, this is still PIC relocation, its just
that certain optimizations are only valid under a PIC relocation model
when the resulting code won't be in a shared library. The simplest path
to here is to expose a single bit option in the TargetOptions. If folks
have different/better designs, I'm all ears. =]
I've included the first optimization based upon this: changing TLS
models to the *Exec models when PIE is enabled. This is the LLVM
component of PR12380 and is all of the hard work.
llvm-svn: 154294
in TargetLowering. There was already a FIXME about this location being
odd. The interface is simplified as a consequence. This will also make
it easier to change TLS models when compiling with PIE.
llvm-svn: 154292
where a chain outside of the loop block-set ended up in the worklist for
scheduling as part of the contiguous loop. However, asserting the first
block in the chain is in the loop-set isn't a valid check -- we may be
forced to drag a chain into the worklist due to one block in the chain
being part of the loop even though the first block is *not* in the loop.
This occurs when we have been forced to form a chain early due to
un-analyzable branches.
No test case here as I have no idea how to even begin reducing one, and
it will be hopelessly fragile. We have to somehow end up with a loop
header of an inner loop which is a successor of a basic block with an
unanalyzable pair of branch instructions. Ow. Self-host triggers it so
it is unlikely it will regress.
This at least gets block placement back to passing selfhost and the test
suite. There are still a lot of slowdown that I don't like coming out of
block placement, although there are now also a lot of speedups. =[ I'm
seeing swings in both directions up to 10%. I'm going to try to find
time to dig into this and see if we can turn this on for 3.1 as it does
a really good job of cleaning up after some loops that degraded with the
inliner changes.
llvm-svn: 154287
GEPs, bit casts, and stores reaching it but no other instructions. These
often show up during the iterative processing of the inliner, SROA, and
DCE. Once we hit this point, we can completely remove the alloca. These
were actually showing up in the final, fully optimized code in a bunch
of inliner tests I've been working on, and notably they show up after
LLVM finishes optimizing away all function calls involved in
hash_combine(a, b).
llvm-svn: 154285
Previously we used three instructions to broadcast an immediate value into a
vector register.
On Sandybridge we continue to load the broadcasted value from the constant pool.
llvm-svn: 154284
An MDNode has a list of MDNodeOperands allocated directly after it as part of
its allocation. Therefore, the Parent of the MDNodeOperands can be found by
walking back through the operands to the beginning of that list. Mark the first
operand's value pointer as being the 'first' operand so that we know where the
beginning of said list is.
This saves a *lot* of space during LTO with -O0 -g flags.
llvm-svn: 154280
shuffle node because it could introduce new shuffle nodes that were not
supported efficiently by the target.
2. Add a more restrictive shuffle-of-shuffle optimization for cases where the
second shuffle reverses the transformation of the first shuffle.
llvm-svn: 154266
reciprocal if converting to the reciprocal is exact. Do it even if inexact
if -ffast-math. This substantially speeds up ac.f90 from the polyhedron
benchmarks.
llvm-svn: 154265
speculate. Without this, loop rotate (among many other places) would
suddenly stop working in the presence of debug info. I found this
looking at loop rotate, and have augmented its tests with a reduction
out of a very hot loop in yacr2 where failing to do this rotation costs
sometimes more than 10% in runtime performance, perturbing numerous
downstream optimizations.
This should have no impact on performance without debug info, but the
change in performance when debug info is enabled can be extreme. As
a consequence (and this how I got to this yak) any profiling of
performance problems should be treated with deep suspicion -- they may
have been wildly innacurate of debug info was enabled for profiling. =/
Just a heads up.
llvm-svn: 154263
The tLDRr instruction with the last register operand set to the zero register
prints in assembly as if no register was specified, and the assembler encodes
it as a tLDRi instruction with a zero immediate. With the integrated assembler,
that zero register gets emitted as "r0", so we get "ldr rx, [ry, r0]" which
is broken. Emit the instruction as tLDRi with a zero immediate. I don't
know if there's a good way to write a testcase for this. Suggestions welcome.
Opportunities for follow-up work:
1) The asm printer should complain if a non-optional register operand is set
to the zero register, instead of silently dropping it.
2) The integrated assembler should complain in the same situation, instead of
silently emitting the operand as "r0".
llvm-svn: 154261
Cygwin-1.7 supports dw2. Some recent mingw distros support one, too.
I have confirmed test-suite/SingleSource/Benchmarks/Shootout-C++/except.cpp can pass on Cygwin.
llvm-svn: 154247
by default.
This is a behaviour configurable in the MCAsmInfo. I've decided to turn
it on by default in (possibly optimistic) hopes that most assemblers are
reasonably sane. If this proves a problem, switching to default seems
reasonable.
I'm not sure if this is the opportune place to test, but it seemed good
to make sure it was tested somewhere.
llvm-svn: 154235
After register masks were introdruced to represent the call clobbers, it
is no longer necessary to have duplicate instruction for iOS.
llvm-svn: 154209
disassembler requires a MCSubtargetInfo and a
MCInstrInfo to exist in order to initialize the
instruction printer and disassembler; however,
although the printer and disassembler keep
references to these objects they do not own them.
Previously, the MCSubtargetInfo and MCInstrInfo
objects were just leaked.
I have extended LLVMDisasmContext to own these
objects and delete them when it is destroyed.
llvm-svn: 154192
simplification has been performed. This is a bit less efficient
(requires another ilist walk of the basic blocks) but shouldn't matter
in practice. More importantly, it's just too much work to keep track of
all the various ways the return instructions can be mutated while
simplifying them. This fixes yet another crasher, reported by Daniel
Dunbar.
llvm-svn: 154179
dead code, including dead return instructions in some cases. Otherwise,
we end up having a bogus poniter to a return instruction that blows up
much further down the road.
It turns out that this pattern is both simpler to code, easier to update
in the face of enhancements to the inliner cleanup, and likely cheaper
given that it won't add dead instructions to the list.
Thanks to John Regehr's numerous test cases for teasing this out.
llvm-svn: 154157
We had special instructions for iOS because r9 is call-clobbered, but
that is represented dynamically by the register mask operands now, so
there is no need for the pseudo-instructions.
llvm-svn: 154144
The load/store optimizer splits LDRD/STRD into two instructions when the
register pairing doesn't work out. For negative offsets in Thumb2, it uses
t2STRi8 to do that. That's fine, except for the case when the offset is in
the range [-4,-1]. In that case, we'll also form a second t2STRi8 with
the original offset plus 4, resulting in a t2STRi8 with a non-negative
offset, which ends up as if it were an STRT, which is completely bogus.
Similarly for loads.
No testcase, unfortunately, as any I've been able to construct is both large
and extremely fragile.
rdar://11193937
llvm-svn: 154141
'add r2, #-1024' should just use 'sub r2, #1024' rather than erroring out.
Thumb1 aliases for adding a negative immediate to the stack pointer,
also.
rdar://11192734
llvm-svn: 154123
LSR always tries to make the ICmp in the loop latch use the incremented
induction variable. This allows the induction variable to be kept in a
single register.
When the induction variable limit is equal to the stride,
SimplifySetCC() would break LSR's hard work by transforming:
(icmp (add iv, stride), stride) --> (cmp iv, 0)
This forced us to use lea for the IC update, preventing the simpler
incl+cmp.
<rdar://problem/7643606>
<rdar://problem/11184260>
llvm-svn: 154119
of the BBVectorizePass without using command line option. As pointed out
by Hal, we can ask the TargetLoweringInfo for the architecture specific
VectorizeConfig to perform vectorizing with architecture specific
information.
llvm-svn: 154096
the caller requested a null-terminated one.
When mapping the file there could be a racing issue that resulted in the file being larger
than the FileSize passed by the caller. We already have an assertion
for this in MemoryBuffer::init() but have a runtime guarantee that
the buffer will be null-terminated, so do a copy that adds a null-terminator.
Protects against crash of rdar://11161822.
llvm-svn: 154082
LSR can fold three addressing modes into its ICmpZero node:
ICmpZero BaseReg + Offset => ICmp BaseReg, -Offset
ICmpZero -1*ScaleReg + Offset => ICmp ScaleReg, Offset
ICmpZero BaseReg + -1*ScaleReg => ICmp BaseReg, ScaleReg
The first two cases are only used if TLI->isLegalICmpImmediate() likes
the offset.
Make sure the right Offset sign is passed to this method in the second
case. The ARM version is not symmetric.
<rdar://problem/11184260>
llvm-svn: 154079
A MOVCCr instruction can be commuted by inverting the condition. This
can help reduce register pressure and remove unnecessary copies in some
cases.
<rdar://problem/11182914>
llvm-svn: 154033
This allows us to keep passing reduced masks to SimplifyDemandedBits, but
know about all the bits if SimplifyDemandedBits fails. This allows instcombine
to simplify cases like the one in the included testcase.
llvm-svn: 154011
When folding X == X we need to check getBooleanContents() to determine if the
result is a vector of ones or a vector of negative ones.
I tried creating a test case, but the problem seems to only be exposed on a
much older version of clang (around r144500).
rdar://10923049
llvm-svn: 153966
brace) so that we get more accurate line number information about the
declaration of a given function and the line where the function
first starts.
Part of rdar://11026482
llvm-svn: 153916
This patch allows llvm to recognize that a 64 bit object file is being produced
and that the subsequently generated ELF header has the correct information.
The test case checks for both big and little endian flavors.
Patch by Jack Carter.
llvm-svn: 153889
http://llvm.org/bugs/show_bug.cgi?id=12343
We have not trivial way for splitting edges that are goes from indirect branch. We can do it with some tricks, but it should be additionally discussed. And it is still dangerous due to difficulty of indirect branches controlling.
Fix forbids this case for unswitching.
llvm-svn: 153879
Do not try to optimize swizzles of shuffles if the source shuffle has more than
a single user, except when the source shuffle is also a swizzle.
llvm-svn: 153864
Post-RA scheduling gives a significant performance improvement on
the embedded cores, so turn it on. Using full anti-dep. breaking is
important for FP-intensive blocks, so turn it on (just on the
embedded cores for now; this should also be good on the 970s because
post-ra scheduling is all that we have for now, but that should have
more testing first).
llvm-svn: 153843
This adds a full itinerary for IBM's PPC64 A2 embedded core. These
cores form the basis for the CPUs in the new IBM BG/Q supercomputer.
llvm-svn: 153842
As a side note, I really dislike array_pod_sort... Do we really still
care about any STL implementations that get this so wrong? Does libc++?
llvm-svn: 153834
a single missing character. Somehow, this had gone untested. I've added
tests for returns-twice logic specifically with the always-inliner that
would have caught this, and fixed the bug.
Thanks to Matt for the careful review and spotting this!!! =D
llvm-svn: 153832
Loads and stores can have different pipeline behavior, especially on
embedded chips. This change allows those differences to be expressed.
Except for the 440 scheduler, there are no functionality changes.
On the 440, the latency adjustment is only by one cycle, and so this
probably does not affect much. Nevertheless, it will make a larger
difference in the future and this removes a FIXME from the 440 itin.
llvm-svn: 153821
This is the CodeGen equivalent of r153747. I tested that there is not noticeable
performance difference with any combination of -O0/-O2 /-g when compiling
gcc as a single compilation unit.
llvm-svn: 153817
Dynamic linking on PPC64 has had problems since we had to move the top-down
hazard-detection logic post-ra. For dynamic linking to work there needs to be
a nop placed after every call. It turns out that it is really hard to guarantee
that nothing will be placed in between the call (bl) and the nop during post-ra
scheduling. Previous attempts at fixing this by placing logic inside the
hazard detector only partially worked.
This is now fixed in a different way: call+nop codegen-only instructions. As far
as CodeGen is concerned the pair is now a single instruction and cannot be split.
This solution works much better than previous attempts.
The scoreboard hazard detector is also renamed to be more generic, there is currently
no cpu-specific logic in it.
llvm-svn: 153816
the very high overhead of the complex inline cost analysis when all it
wants to do is detect three patterns which must not be inlined. Comment
the code, clean it up, and leave some hints about possible performance
improvements if this ever shows up on a profile.
Moving this off of the (now more expensive) inline cost analysis is
particularly important because we have to run this inliner even at -O0.
llvm-svn: 153814
interfaces. These methods were used in the old inline cost system where
there was a persistent cache that had to be updated, invalidated, and
cleared. We're now doing more direct computations that don't require
this intricate dance. Even if we resume some level of caching, it would
almost certainly have a simpler and more narrow interface than this.
llvm-svn: 153813
on a per-callsite walk of the called function's instructions, in
breadth-first order over the potentially reachable set of basic blocks.
This is a major shift in how inline cost analysis works to improve the
accuracy and rationality of inlining decisions. A brief outline of the
algorithm this moves to:
- Build a simplification mapping based on the callsite arguments to the
function arguments.
- Push the entry block onto a worklist of potentially-live basic blocks.
- Pop the first block off of the *front* of the worklist (for
breadth-first ordering) and walk its instructions using a custom
InstVisitor.
- For each instruction's operands, re-map them based on the
simplification mappings available for the given callsite.
- Compute any simplification possible of the instruction after
re-mapping, and store that back int othe simplification mapping.
- Compute any bonuses, costs, or other impacts of the instruction on the
cost metric.
- When the terminator is reached, replace any conditional value in the
terminator with any simplifications from the mapping we have, and add
any successors which are not proven to be dead from these
simplifications to the worklist.
- Pop the next block off of the front of the worklist, and repeat.
- As soon as the cost of inlining exceeds the threshold for the
callsite, stop analyzing the function in order to bound cost.
The primary goal of this algorithm is to perfectly handle dead code
paths. We do not want any code in trivially dead code paths to impact
inlining decisions. The previous metric was *extremely* flawed here, and
would always subtract the average cost of two successors of
a conditional branch when it was proven to become an unconditional
branch at the callsite. There was no handling of wildly different costs
between the two successors, which would cause inlining when the path
actually taken was too large, and no inlining when the path actually
taken was trivially simple. There was also no handling of the code
*path*, only the immediate successors. These problems vanish completely
now. See the added regression tests for the shiny new features -- we
skip recursive function calls, SROA-killing instructions, and high cost
complex CFG structures when dead at the callsite being analyzed.
Switching to this algorithm required refactoring the inline cost
interface to accept the actual threshold rather than simply returning
a single cost. The resulting interface is pretty bad, and I'm planning
to do lots of interface cleanup after this patch.
Several other refactorings fell out of this, but I've tried to minimize
them for this patch. =/ There is still more cleanup that can be done
here. Please point out anything that you see in review.
I've worked really hard to try to mirror at least the spirit of all of
the previous heuristics in the new model. It's not clear that they are
all correct any more, but I wanted to minimize the change in this single
patch, it's already a bit ridiculous. One heuristic that is *not* yet
mirrored is to allow inlining of functions with a dynamic alloca *if*
the caller has a dynamic alloca. I will add this back, but I think the
most reasonable way requires changes to the inliner itself rather than
just the cost metric, and so I've deferred this for a subsequent patch.
The test case is XFAIL-ed until then.
As mentioned in the review mail, this seems to make Clang run about 1%
to 2% faster in -O0, but makes its binary size grow by just under 4%.
I've looked into the 4% growth, and it can be fixed, but requires
changes to other parts of the inliner.
llvm-svn: 153812
The powi intrinsic requires special handling because it always takes a single
integer power regardless of the result type. As a result, we can vectorize
only if the powers are equal. Fixes PR12364.
llvm-svn: 153797
ARMConstantIslandPass still has bugs where jump table compression can
cause constant pool entries to go out of range.
Add a safety margin of 2 bytes when placing constant islands, but use
the real max displacement for verification.
<rdar://problem/11156595>
llvm-svn: 153789
When an immediate is both a value [t2_]so_imm and a [t2_]so_imm_neg,
we want to use the non-negated form to make sure we prefer the normal
encoding, not the aliased encoding via the negation of, e.g., 'cmp.w'.
llvm-svn: 153770
For 'adds r2, r2, #56' outside of an IT block, the 16-bit encoding T2
can be used for this syntax. Prefer the narrow encoding when possible.
rdar://11156277
llvm-svn: 153759
1. The main works will made in the RuntimeDyLdImpl with uses the ObjectFile class. RuntimeDyLdMachO and RuntimeDyLdELF now only parses relocations and resolve it. This is allows to make improvements of the RuntimeDyLd more easily. In addition the support for COFF can be easily added.
2. Added ARM relocations to RuntimeDyLdELF.
3. Added support for stub functions for the ARM, allowing to do a long branch.
4. Added support for external functions that are not loaded from the object files, but can be loaded from external libraries. Now MCJIT can correctly execute the code containing the printf, putc, and etc.
5. The sections emitted instead functions, thanks Jim Grosbach. MemoryManager.startFunctionBody() and MemoryManager.endFunctionBody() have been removed.
6. MCJITMemoryManager.allocateDataSection() and MCJITMemoryManager. allocateCodeSection() used JMM->allocateSpace() instead of JMM->allocateCodeSection() and JMM->allocateDataSection(), because I got an error: "Cannot allocate an allocated block!" with object file contains more than one code or data sections.
llvm-svn: 153754
here but it has no other uses, then we have a problem. E.g.,
int foo (const int *x) {
char a[*x];
return 0;
}
If we assign 'a' a vreg and fast isel later on has to use the selection
DAG isel, it will want to copy the value to the vreg. However, there are
no uses, which goes counter to what selection DAG isel expects.
<rdar://problem/11134152>
llvm-svn: 153705
This pass splits basic blocks to insert constant islands, and it
doesn't recompute the live-in lists. No later passes depend on accurate
liveness information.
This fixes PR12410 where the machine code verifier was complaining.
llvm-svn: 153700
We are sometimes allocatinog from the DPair register class which
contains odd-even pairs in addition to the Q registers.
Place the Q registers first in the DPair allocation order as they can be
copied with a single instruction. The odd-even pairs should only be
allocated as a last resort.
llvm-svn: 153699
ARM recently gained DPair, DTriple, and DQuad register classes.
Update copyPhysReg() to handle copies in these register classes.
No test case, it is difficult to make the register allocator emit the
odd copies reliably. The missing DPair copy caused a failure on
partialsums in the nightly test suite.
<rdar://problem/11147997>
llvm-svn: 153686
CodeGenPrepare sinks compare instructions down to their uses to prevent
live flags and predicate registers across basic blocks.
PRE of a compare instruction prevents that, forcing the i1 compare
result into a general purpose register. That is usually more expensive
than the redundant compare PRE was trying to eliminate in the first
place.
llvm-svn: 153657
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
llvm-svn: 153635
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
llvm-svn: 153617
Some targets still mess up the liveness information, but that isn't
verified after MRI->invalidateLiveness().
The verifier can still check other useful things like register classes
and CFG, so it should be enabled after all passes.
llvm-svn: 153615
The late scheduler depends on accurate liveness information if it is
breaking anti-dependencies, so we should be able to verify it.
Relax the terminator checking in the machine code verifier so it can
handle the basic blocks created by if conversion.
llvm-svn: 153614
When an strd instruction doesn't get the registers it wants, it can be
expanded into two str instructions. Make sure the first str doesn't kill
the base register in the case where the base and data registers are
identical:
t2STRi12 %R0<kill>, %R0, 4, pred:14, pred:%noreg
t2STRi12 %R2<kill>, %R0, 8, pred:14, pred:%noreg
<rdar://problem/11101911>
llvm-svn: 153611
When a number of sub-register VLRDS instructions are combined into a
VLDM, preserve any super-register implicit defs. This is required to
keep the register scavenger and machine code verifier happy.
Enable machine code verification after ARMLoadStoreOptimizer.
ARM/2012-01-26-CopyPropKills.ll was failing because of this.
llvm-svn: 153610
The arm_neon intrinsics can create virtual registers from the DPair
register class which allows both even-odd and odd-even D-register pairs.
This fixes PR12389.
llvm-svn: 153603
Extract the liveness verification into its own method.
This makes it possible to run the machine code verifier after liveness
information is no longer required to be valid.
llvm-svn: 153596
Revert r153519: "ARMLoadStoreOptimizer invalidates register liveness."
These patches caused miscompilations in povray by turning off branch
folding's updating of live-in lists.
It turns out the the late scheduler depends on the live-in lists, even
if it doesn't need correct kill flags.
<rdar://problem/11139228>
llvm-svn: 153593
Original commit message for r153521 (aka r153423):
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loding a boolean value.
llvm-svn: 153587
blocks in the function cloner. This removes the last case of trivially
dead code that I've been seeing in the wild getting inlined, analyzed,
re-inlined, optimized, only to be deleted. Nukes a FIXME from the
cleanup tests.
llvm-svn: 153572
them as machine instructions. Directives ".set noat" and ".set at" are now
emitted only at the beginning and end of a function except in the case where
they are emitted to enclose .cpload with an immediate operand that doesn't fit
in 16-bit field or unaligned load/stores.
Also, make the following changes:
- Remove function isUnalignedLoadStore and use a switch-case statement to
determine whether an instruction is an unaligned load or store.
- Define helper function CreateMCInst which generates an instance of an MCInst
from an opcode and a list of operands.
llvm-svn: 153552
undefined behavior, which Rafael was kind enough to fix.
Original commit message for r153423:
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loding a boolean value.
llvm-svn: 153521
This pass tries to update kill flags, but there are still many bugs.
Passes after the load/store optimizer don't need accurate liveness, so
don't even try.
<rdar://problem/11101911>
llvm-svn: 153519
Branch folding can use a register scavenger to update liveness
information when required. Don't do that if liveness information is
already invalid.
llvm-svn: 153517
Late optimization passes like branch folding and tail duplication can
transform the machine code in a way that makes it expensive to keep the
register liveness information up to date. There is a fuzzy line between
register allocation and late scheduling where the liveness information
degrades.
The MRI::tracksLiveness() flag makes the line clear: While true,
liveness information is accurate, and can be used for register
scavenging. Once the flag is false, liveness information is not
accurate, and can only be used as a hint.
Late passes generally don't need the liveness information, but they will
sometimes use the register scavenger to help update it. The scavenger
enforces strict correctness, and we have to spend a lot of code to
update register liveness that may never be used.
llvm-svn: 153511
size bloat. Unfortunately, I expect this to disable the majority of the
benefit from r152737. I'm hopeful at least that it will fix PR12345. To
explain this requires... quite a bit of backstory I'm afraid.
TL;DR: The change in r152737 actually did The Wrong Thing for
linkonce-odr functions. This change makes it do the right thing. The
benefits we saw were simple luck, not any actual strategy. Benchmark
numbers after a mini-blog-post so that I've written down my thoughts on
why all of this works and doesn't work...
To understand what's going on here, you have to understand how the
"bottom-up" inliner actually works. There are two fundamental modes to
the inliner:
1) Standard fixed-cost bottom-up inlining. This is the mode we usually
think about. It walks from the bottom of the CFG up to the top,
looking at callsites, taking information about the callsite and the
called function and computing th expected cost of inlining into that
callsite. If the cost is under a fixed threshold, it inlines. It's
a touch more complicated than that due to all the bonuses, weights,
etc. Inlining the last callsite to an internal function gets higher
weighth, etc. But essentially, this is the mode of operation.
2) Deferred bottom-up inlining (a term I just made up). This is the
interesting mode for this patch an r152737. Initially, this works
just like mode #1, but once we have the cost of inlining into the
callsite, we don't just compare it with a fixed threshold. First, we
check something else. Let's give some names to the entities at this
point, or we'll end up hopelessly confused. We're considering
inlining a function 'A' into its callsite within a function 'B'. We
want to check whether 'B' has any callers, and whether it might be
inlined into those callers. If so, we also check whether inlining 'A'
into 'B' would block any of the opportunities for inlining 'B' into
its callers. We take the sum of the costs of inlining 'B' into its
callers where that inlining would be blocked by inlining 'A' into
'B', and if that cost is less than the cost of inlining 'A' into 'B',
then we skip inlining 'A' into 'B'.
Now, in order for #2 to make sense, we have to have some confidence that
we will actually have the opportunity to inline 'B' into its callers
when cheaper, *and* that we'll be able to revisit the decision and
inline 'A' into 'B' if that ever becomes the correct tradeoff. This
often isn't true for external functions -- we can see very few of their
callers, and we won't be able to re-consider inlining 'A' into 'B' if
'B' is external when we finally see more callers of 'B'. There are two
cases where we believe this to be true for C/C++ code: functions local
to a translation unit, and functions with an inline definition in every
translation unit which uses them. These are represented as internal
linkage and linkonce-odr (resp.) in LLVM. I enabled this logic for
linkonce-odr in r152737.
Unfortunately, when I did that, I also introduced a subtle bug. There
was an implicit assumption that the last caller of the function within
the TU was the last caller of the function in the program. We want to
bonus the last caller of the function in the program by a huge amount
for inlining because inlining that callsite has very little cost.
Unfortunately, the last caller in the TU of a linkonce-odr function is
*not* the last caller in the program, and so we don't want to apply this
bonus. If we do, we can apply it to one callsite *per-TU*. Because of
the way deferred inlining works, when it sees this bonus applied to one
callsite in the TU for 'B', it decides that inlining 'B' is of the
*utmost* importance just so we can get that final bonus. It then
proceeds to essentially force deferred inlining regardless of the actual
cost tradeoff.
The result? PR12345: code bloat, code bloat, code bloat. Another result
is getting *damn* lucky on a few benchmarks, and the over-inlining
exposing critically important optimizations. I would very much like
a list of benchmarks that regress after this change goes in, with
bitcode before and after. This will help me greatly understand what
opportunities the current cost analysis is missing.
Initial benchmark numbers look very good. WebKit files that exhibited
the worst of PR12345 went from growing to shrinking compared to Clang
with r152737 reverted.
- Bootstrapped Clang is 3% smaller with this change.
- Bootstrapped Clang -O0 over a single-source-file of lib/Lex is 4%
faster with this change.
Please let me know about any other performance impact you see. Thanks to
Nico for reporting and urging me to actually fix, Richard Smith, Duncan
Sands, Manuel Klimek, and Benjamin Kramer for talking through the issues
today.
llvm-svn: 153506
MachinePointerInfo when getStore is called to create a node that stores an
argument passed in register to the stack. Without this change, the post RA
scheduler will fail to discover the dependencies between the stores
instructions and the instructions that load from a structure passed by value.
The link to the related discussion is here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-March/048055.html
llvm-svn: 153499
copies being considered for removal. Make sure to track all of the copies,
rather than just the most recent encountered, by holding a DenseSet instead of
an unsigned in SrcMap.
No test case - couldn't reduce something with a sane size.
llvm-svn: 153487
produces a 32-bit immediate which is consumed by the use. It tries to
fold the immediate by breaking it into two parts and fold them into the
immmediate fields of two uses. e.g
movw r2, #40885
movt r3, #46540
add r0, r0, r3
=>
add.w r0, r0, #3019898880
add.w r0, r0, #30146560
;
However, this transformation is incorrect if the user produces a flag. e.g.
movw r2, #40885
movt r3, #46540
adds r0, r0, r3
=>
add.w r0, r0, #3019898880
adds.w r0, r0, #30146560
Note the adds.w may not set the carry flag even if the original sequence
would.
rdar://11116189
llvm-svn: 153484
relocations. The algorithm is the same as
that for x86_64. Scattered relocations, a
feature present in i386 but not on x86_64,
are not yet supported.
llvm-svn: 153466
Original commit message:
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loading a boolean value.
llvm-svn: 153452
constant-offsets of a common base using the generic GEP-walking logic
I added for computing pointer differences in the same situation.
llvm-svn: 153419
inbounds GEPs. This isn't really necessary for simplifying pointer
differences, but I'm planning to re-use the same code to simplify
pointer comparisons where it is necessary. Since real code almost
exclusively uses inbounds GEPs, it doesn't seem worth it to support the
extra complexity of turning it on and off. If anyone would like that
back, feel free to shout. Note that instcombine will still catch any of
these patterns.
llvm-svn: 153418
aggressively. There are lots of dire warnings about this being expensive
that seem to predate switching to the TrackingVH-based value remapper
that is automatically updated on RAUW. This makes it easy to not just
prune single-entry PHIs, but to fully simplify PHIs, and to recursively
simplify the newly inlined code to propagate PHINode simplifications.
This introduces a bit of a thorny problem though. We may end up
simplifying a branch condition to a constant when we fold PHINodes, and
we would like to nuke any dead blocks resulting from this so that time
isn't wasted continually analyzing them, but this isn't easy. Deleting
basic blocks *after* they are fully cloned and mapped into the new
function currently requires manually updating the value map. The last
piece of the simplification-during-inlining puzzle will require either
switching to WeakVH mappings or some other piece of refactoring. I've
left a FIXME in the testcase about this.
llvm-svn: 153410
to instead rely on much more generic and powerful instruction
simplification in the function cloner (and thus inliner).
This teaches the pruning function cloner to use instsimplify rather than
just the constant folder to fold values during cloning. This can
simplify a large number of things that constant folding alone cannot
begin to touch. For example, it will realize that 'or' and 'and'
instructions with certain constant operands actually become constants
regardless of what their other operand is. It also can thread back
through the caller to perform simplifications that are only possible by
looking up a few levels. In particular, GEPs and pointer testing tend to
fold much more heavily with this change.
This should (in some cases) have a positive impact on compile times with
optimizations on because the inliner itself will simply avoid cloning
a great deal of code. It already attempted to prune proven-dead code,
but now it will be use the stronger simplifications to prove more code
dead.
llvm-svn: 153403
fire if anything ever invalidates the assumption of a terminator
instruction being unchanged throughout the routine.
I've convinced myself that the current definition of simplification
precludes such a transformation, so I think getting some asserts
coverage that we don't violate this agreement is sufficient to make this
code safe for the foreseeable future.
Comments to the contrary or other suggestions are of course welcome. =]
The bots are now happy with this code though, so it appears the bug here
has indeed been fixed.
llvm-svn: 153401
list. This is a bad idea. ;] I'm hopeful this is the bug that's showing
up with the MSVC bots, but we'll see.
It is definitely unnecessary. InstSimplify won't do anything to
a terminator instruction, we don't need to even include it in the
iteration range. We can also skip the now dead terminator check,
although I've made it an assert to help document that this is an
important invariant.
I'm still a bit queasy about this because there is an implicit
assumption that the terminator instruction cannot be RAUW'ed by the
simplification code. While that appears to be true at the moment, I see
no guarantee that would ensure it remains true in the future. I'm
looking at the cleanest way to solve that...
llvm-svn: 153399
spotted by inspection, and I've crafted no test case that triggers it on
my machine, but some of the windows builders are hitting what looks like
memory corruption, so *something* is amiss here.
This patch takes a more generalized approach to eliminating
double-visits. Imagine code such as:
%x = ...
%y = add %x, 1
%z = add %x, %y
You can imagine that if we simplify %x, we would add %y and %z to the
list. If the use-chain order happens to cause us to add them in reverse
order, we could pull %y off first, and simplify it, adding %z to the
list. We now have %z on the list twice, and will reference it after it
is deleted.
Currently, all my test cases happen to not trigger this, likely due to
the use-chain ordering, but there seems no guarantee that such
a situation could not occur, so we should handle it correctly.
Again, if anyone knows how to craft a testcase that actually triggers
this, please let me know.
llvm-svn: 153397
worklist. This can happen in theory when an instruction uses itself,
such as a PHI node. This was spotted by inspection, and unfortunately
I've not been able to come up with a test case that would trigger it. If
anyone has ideas, let me know...
llvm-svn: 153396
bit simpler by handling a common case explicitly.
Also, refactor the implementation to use a worklist based walk of the
recursive users, rather than trying to use value handles to detect and
recover from RAUWs during the recursive descent. This fixes a very
subtle bug in the previous implementation where degenerate control flow
structures could cause mutually recursive instructions (PHI nodes) to
collapse in just such a way that From became equal to To after some
amount of recursion. At that point, we hit the inf-loop that the assert
at the top attempted to guard against. This problem is defined away when
not using value handles in this manner. There are lots of comments
claiming that the WeakVH will protect against just this sort of error,
but they're not accurate about the actual implementation of WeakVHs,
which do still track RAUWs.
I don't have any test case for the bug this fixes because it requires
running the recursive simplification on unreachable phi nodes. I've no
way to either run this or easily write an input that triggers it. It was
found when using instruction simplification inside the inliner when
running over the nightly test-suite.
llvm-svn: 153393
The PPC64 SVR4 ABI requires integer stack arguments, and thus the var. args., that
are smaller than 64 bits be zero extended to 64 bits.
llvm-svn: 153373
Code such as:
%vreg100 = setcc %vreg10, -1, SETNE
brcond %vreg10, %tgt
was being incorrectly morphed into
%vreg100 = and %vreg10, 1
brcond %vreg10, %tgt
where the 'and' instruction could be eliminated since
such logic is on 1-bit types in the PTX back-end, leaving
us with just:
brcond %vreg10, %tgt
which essentially gives us inverted branch conditions.
llvm-svn: 153364
destination module, but one of them isn't used in the destination module. If
another module comes along and the uses the unused type, there could be type
conflicts when the modules are finally linked together. (This happened when
building LLVM.)
The test that was reduced is:
Module A:
%Z = type { %A }
%A = type { %B.1, [7 x x86_fp80] }
%B.1 = type { %C }
%C = type { i8* }
declare void @func_x(%C*, i64, i64)
declare void @func_z(%Z* nocapture)
Module B:
%B = type { %C.1 }
%C.1 = type { i8* }
%A.2 = type { %B.3, [5 x x86_fp80] }
%B.3 = type { %C.1 }
define void @func_z() {
%x = alloca %A.2, align 16
%y = getelementptr inbounds %A.2* %x, i64 0, i32 0, i32 0
call void @func_x(%C.1* %y, i64 37, i64 927) nounwind
ret void
}
declare void @func_x(%C.1*, i64, i64)
declare void @func_y(%B* nocapture)
(Unfortunately, this test doesn't fail under llvm-link, only during an LTO
linking.) The '%C' and '%C.1' clash. The destination module gets the '%C'
declaration. When merging Module B, it looks at the '%C.1' subtype of the '%B'
structure. It adds that in, because that's cool. And when '%B.3' is processed,
it uses the '%C.1'. But the '%B' has used '%C' and we prefer to use '%C'. So the
'@func_x' type is changed to 'void (%C*, i64, i64)', but the type of '%x' in
'@func_z' remains '%A.2'. The GEP resolves to a '%C.1', which conflicts with the
'@func_x' signature.
We can resolve this situation by making sure that the type is used in the
destination before saying that it should be used in the module being merged in.
With this fix, LLVM and Clang both compile under LTO.
<rdar://problem/10913281>
llvm-svn: 153351
same basic block, and it's not safe to insert code in the successor
blocks if the edges are critical edges. Splitting those edges is
possible, but undesirable, especially on the unwind side. Instead,
make the bottom-up code motion to consider invokes to be part of
their successor blocks, rather than part of their parent blocks, so
that it doesn't push code past them and onto the edges. This fixes
PR12307.
llvm-svn: 153343
This is necessary if the client wants to be able to mutate TargetOptions (for example, fast FP math mode) after the initial creation of the ExecutionEngine.
llvm-svn: 153342
dominated by Root, check that B is available throughout the scope. This
is obviously true (famous last words?) given the current logic, but the
check may be helpful if more complicated reasoning is added one day.
llvm-svn: 153323
(and hopefully on Windows). The bots have been down most of the day
because of this, and it's not clear to me what all will be required to
fix it.
The commits started with r153205, then r153207, r153208, and r153221.
The first commit seems to be the real culprit, but I couldn't revert
a smaller number of patches.
When resubmitting, r153207 and r153208 should be folded into r153205,
they were simple build fixes.
llvm-svn: 153241
execution-time regression for nsieve-bits on the ARMv7 -O0 -g nightly tester.
This may also improve compile-time on architectures that would otherwise
generate a libcall for urem (e.g., ARM) or fall back to the DAG selector.
rdar://10810716
llvm-svn: 153230
1. Declare a virtual function getPointerToNamedFunction() in JITMemoryManager
2. Move the implementation of getPointerToNamedFunction() form JIT/MCJIT to DefaultJITMemoryManager.
llvm-svn: 153205
Type legalization can zero-extend the elements of the build_vector node, so,
for example, we may have an <8 x i8> with i32 elements of value 255. That
should return 'true' for the vector being all ones.
llvm-svn: 153203
not attched to a basic block or function. There are conservatively
correct answers in these cases, and this makes the analysis more useful
in contexts where we have a partially formed bit of IR.
I don't have any way to test this directly... suggestions welcome here,
but I'm not seeing anything sadly. I only found this using a subsequent
patch to the inliner which runs instsimplify on partially inlined
instructions, and even then only on a quite large program. I never got
a reasonable testcase out of it, and anything I do get is likely to be
quite fragile due to requiring an interaction of two different passes,
and the only result being a segfault if it goes wrong.
llvm-svn: 153176
These changes allow us to compile big endian from the command line for 32 bit
Mips targets. This patch will result in code and data actually being produced
in the correct endianess.
llvm-svn: 153153
relocations (i.e., pieces of data whose addresses
are referred to elsewhere in the binary image) and
update the references when the section containing
the relocations moves. The way this works is that
there is a map from section IDs to lists of
relocations.
Because the relocations are associated with the
section containing the data being referred to, they
are updated only when the target moves. However,
many data references are relative and also depend
on the location of the referrer.
To solve this problem, I introduced a new data
structure, Referrer, which simply contains the
section being referred to and the index of the
relocation in that section. These referrers are
associated with the source containing the
reference that needs to be updated, so now
regardless of which end of the relocation moves,
the relocation will now be updated correctly.
llvm-svn: 153147
Do not call SplitBlockPredecessors on a loop preheader when one of the
predecessors is an indirectbr. Otherwise, you will hit this assert:
!isa<IndirectBrInst>(Preds[i]->getTerminator()) && "Cannot split an edge from an IndirectBrInst"
llvm-svn: 153134
instead of skipping the current loop.
My prior fix was incomplete because of an overzealous compile-time optimization:
Better fix for: <rdar://problem/11049788> Segmentation fault: 11 in LoopStrengthReduce
llvm-svn: 153131
ARMBaseRegisterInfo::canRealignStack was checking for variable-sized objects
but not for stack adjustments around calls. Use hasReservedCallFrame() to
check for both. The hasBasePointer function was already correctly checking
both conditions, so the effect of this was that a base pointer would be used
without checking whether the base pointer register could be reserved. I don't
have a small testcase for this.
<rdar://problem/11075906>
llvm-svn: 153110
This results in things such as
vmovups 16(%rdi), %xmm0
vinsertf128 $1, %xmm0, %ymm0, %ymm0
to be combined to
vinsertf128 $1, 16(%rdi), %ymm0, %ymm0
rdar://11076953
llvm-svn: 153092
i128). In that case, we may not be able to print out the MCExpr as an
expression. For instance, we could have an MCExpr like this:
0xBEEF0000BEEF0000 | (0xBEEF0000BEEF0000 << 64)
The MCExpr printer handles sizes up to 64-bits, but this expression would
require 128-bits. In this situation, try to evaluate the constant expression and
emit that as the value into 64-bit chunks.
<rdar://problem/11070338>
llvm-svn: 153081
a variable. The previous code would break the debug info changing
code invariant. This will regress debug info for arguments where
we elide the alloca created.
Fixes rdar://11066468
llvm-svn: 153074
X86InstrCompiler.td.
It also adds –mcpu-generic to the legalize-shift-64.ll test so the test
will pass if run on an Intel Atom CPU, which would otherwise
produce an instruction schedule which differs from that which the test expects.
llvm-svn: 153033
fast-isel before emitting code. If the program bails after code was emitted,
then it could lead to the stack being adjusted more than once (two
CALLSEQ_BEGINs emitted) but being adjuste back only once after the call. This
leads to general badness and gnashing of teeth.
<rdar://problem/11050630>
llvm-svn: 152959
It's not a good style idea, as the registers will be laid down in memory in
numerical order, not the order they're in the list, but it's legal. vldm/vstm
are stricter.
rdar://11064740
llvm-svn: 152943
alignment. If that's the case, then we want to make sure that we don't increase
the alignment of the store instruction. Because if we increase it to be "more
aligned" than the pointer, code-gen may use instructions which require a greater
alignment than the pointer guarantees.
<rdar://problem/11043589>
llvm-svn: 152907
It was added in 2007 as the first cut at supporting no-inline
attributes, but we didn't have function attributes of any form at the
time. However, it was added without any mention in the LangRef or other
documentation.
Later on, in 2008, Devang added function notes for 'inline=never' and
then turned them into proper function attributes. From that point
onward, as far as I can tell, the world moved on, and no one has touched
'llvm.noinline' in any meaningful way since.
It's time has now come. We have had better mechanisms for doing this for
a long time, all the frontends I'm aware of use them, and this is just
holding back progress. Given that it was never a documented feature of
the IR, I've provided no auto-upgrade support. If people know of real,
in-the-wild bitcode that relies on this, yell at me and I'll add it, but
I *seriously* doubt anyone cares.
llvm-svn: 152904
directly query the function information which this set was representing.
This simplifies the interface of the inline cost analysis, and makes the
always-inline pass significantly more efficient.
Previously, always-inline would first make a single set of every
function in the module *except* those marked with the always-inline
attribute. It would then query this set at every call site to see if the
function was a member of the set, and if so, refuse to inline it. This
is quite wasteful. Instead, simply check the function attribute directly
when looking at the callsite.
The normal inliner also had similar redundancy. It added every function
in the module with the noinline attribute to its set to ignore, even
though inside the cost analysis function we *already tested* the
noinline attribute and produced the same result.
The only tricky part of removing this is that we have to be able to
correctly remove only the functions inlined by the always-inline pass
when finalizing, which requires a bit of a hack. Still, much less of
a hack than the set of all non-always-inline functions was. While I was
touching this function, I switched a heavy-weight set to a vector with
sort+unique. The algorithm already had a two-phase insert and removal
pattern, we were just needlessly paying the uniquing cost on every
insert.
This probably speeds up some compiles by a small amount (-O0 compiles
with lots of always-inline, so potentially heavy libc++ users), but I've
not tried to measure it.
I believe there is no functional change here, but yell if you spot one.
None are intended.
Finally, the direction this is going in is to greatly simplify the
inline cost query interface so that we can replace its implementation
with a much more clever one. Along the way, all the APIs get simplified,
so it seems incrementally good.
llvm-svn: 152903
analysis implementation. The header was already separated. Also cleanup
all the comments in the header to follow a nice modern doxygen form.
There is still plenty of cruft here, but some of that will fall out in
subsequent refactorings and this was an easy step in the right
direction. No functionality changed here.
llvm-svn: 152898
These edges are not really necessary, but it is consistent with the
way we currently create physreg edges. Scheduler heuristics that
expect a DAG edge to the block terminator could benefit from this
change. Although in the future I hope we have a better mechanism for
modeling latency across scheduling regions.
llvm-svn: 152895
Only record IVUsers that are dominated by simplified loop
headers. Otherwise SCEVExpander will crash while looking for a
preheader.
I previously tried to work around this in LSR itself, but that was
insufficient. This way, LSR can continue to run if some uses are not
in simple loops, as long as we don't attempt to analyze those users.
Fixes <rdar://problem/11049788> Segmentation fault: 11 in LoopStrengthReduce
llvm-svn: 152892
on our internal nightly testers. So, basically revert r152486 again.
Abbreviated original commit message:
Implement a more intelligent way of spilling uses across an invoke boundary.
It looks as if Chander's inlining work, r152737, exposed an issue.
llvm-svn: 152887
It caused MSP430DAGToDAGISel::SelectIndexedBinOp() to be miscompiled.
When two ReplaceUses()'s are expanded as inline, vtable in base class is stored to latter (ISelUpdater)ISU.
llvm-svn: 152877
theoretical fix since it only matters for types with >= 2^63 bits (!) and also
only matters if pointers have more than 64 bits, which is not supported anyway.
llvm-svn: 152831
This needs a test, but it will take some time to figure
out the best way to get an input that will produce > 2^16 relocs.
Patch by Graydon Hoare!
llvm-svn: 152787
out the DW_AT_name. Older gdbs unfortunately still use it to
disambiguate member functions in templated classes (gdb.cp/templates.exp).
rdar://11043421 (which is now deferred for a bit)
llvm-svn: 152782