Remove an accidentally-added instruction definition and add a comment in the
test case. This is in response to a post-commit review by Bill Schmidt.
No functionality change intended.
llvm-svn: 177404
The ARM backend currently has poor codegen for long sext/zext
operations, such as v8i8 -> v8i32. This patch addresses this
by performing a custom expansion in ARMISelLowering. It also
adds/changes the cost of such lowering in ARMTTI.
This partially addresses PR14867.
Patch by Pete Couperus
llvm-svn: 177380
PPC64 supports unaligned loads and stores of 64-bit values, but
in order to use the r+i forms, the offset must be a multiple of 4.
Unfortunately, this cannot always be determined by examining the
immediate itself because it might be available only via a TOC entry.
In order to get around this issue, we additionally predicate the
selection of the r+i form on the alignment of the load or store
(forcing it to be at least 4 in order to select the r+i form).
llvm-svn: 177338
The default logic marks them as too expensive.
For example, before this patch we estimated:
cost of 16 for instruction: %r = uitofp <4 x i16> %v0 to <4 x float>
While this translates to:
vmovl.u16 q8, d16
vcvt.f32.u32 q8, q8
All other costs are left to the values assigned by the fallback logic. Theses
costs are mostly reasonable in the sense that they get progressively more
expensive as the instruction sequences emitted get longer.
radar://13445992
llvm-svn: 177334
Fix cost of some "cheap" cast instructions. Before this patch we used to
estimate for example:
cost of 16 for instruction: %r = fptoui <4 x float> %v0 to <4 x i16>
While we would emit:
vcvt.s32.f32 q8, q8
vmovn.i32 d16, q8
vuzp.8 d16, d17
All other costs are left to the values assigned by the fallback logic. Theses
costs are mostly reasonable in the sense that they get progressively more
expensive as the instruction sequences emitted get longer.
radar://13434072
llvm-svn: 177333
We hitch a ride with the existing OpndItins class that was used to add
instruction itinerary classes in the many multiclasses in this file.
Use the link provided by the X86FoldableSchedWrite.Folded to find the
right SchedWrite for folded loads.
llvm-svn: 177326
This new-style scheduling information is going to replace the
instruction iteneraries.
This also serves as a test case for Andy's fix in r177317.
llvm-svn: 177323
This commit fixes an assert that would occur on loops with large constant counts
(like looping for ((uint32_t) -1) iterations on PPC64). The existing code did
not handle counts that it computed to be negative (asserting instead), but
these can be created with valid inputs.
This bug was discovered by bugpoint while I was attempting to isolate a
completely different problem.
Also, in writing test cases for the negative-count problem, I discovered that
the ori/lsi handling was broken (there was a typo which caused the logic that
was supposed to detect these pairs and extract the iteration count to always
fail). This has now also been corrected (and is covered by one of the new test
cases).
llvm-svn: 177295
Because the initial-value constants had not been added to the list
of instructions considered for DCE the resulting code had redundant
constant-materialization instructions.
llvm-svn: 177294
Unfortunately the previous fix for inserting waits for unordered
defines wasn't sufficient, cause it's possible that even ordered
defines are only partially used (or not used at all).
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 177271
MinGW is almost completely compatible to MSVC, with the exception of the _tls_array global not being available.
Patch by David Nadlinger!
llvm-svn: 177257
This change cleans up two issues with Altivec register spilling:
1. The spilling code was inefficient (using two instructions, and add and a
load, when just one would do)
2. The code assumed that r0 would always be available (true for now, but this
will change)
The new code handles VR spilling just like GPR spills but forced into r+r mode.
As a result, when any VR spills are present, we must now always allocate the
register-scavenger spill slot.
llvm-svn: 177231
As a follow-up to r158719, remove PPCRegisterInfo::avoidWriteAfterWrite.
Jakob pointed out in response to r158719 that this callback is currently unused
and so this has no effect (and the speedups that I thought that I had observed
as a result of implementing this function must have been noise).
llvm-svn: 177228
Since almost all X86 instructions can fold loads, use a multiclass to
define register/memory pairs of SchedWrites.
An X86FoldableSchedWrite represents the register version of an
instruction. It holds a reference to the SchedWrite to use when the
instruction folds a load.
This will be used inside multiclasses that define rr and rm instruction
versions together.
llvm-svn: 177210
I was too pessimistic in r177105. Vector selects that fit into a legal register
type lower just fine. I was mislead by the code fragment that I was using. The
stores/loads that I saw in those cases came from lowering the conditional off
an address.
Changing the code fragment to:
%T0_3 = type <8 x i18>
%T1_3 = type <8 x i1>
define void @func_blend3(%T0_3* %loadaddr, %T0_3* %loadaddr2,
%T1_3* %blend, %T0_3* %storeaddr) {
%v0 = load %T0_3* %loadaddr
%v1 = load %T0_3* %loadaddr2
==> FROM:
;%c = load %T1_3* %blend
==> TO:
%c = icmp slt %T0_3 %v0, %v1
==> USE:
%r = select %T1_3 %c, %T0_3 %v0, %T0_3 %v1
store %T0_3 %r, %T0_3* %storeaddr
ret void
}
revealed this mistake.
radar://13403975
llvm-svn: 177170
Unaligned access is supported on PPC for non-vector types, and is generally
more efficient than manually expanding the loads and stores.
A few of the existing test cases were using expanded unaligned loads and stores
to test other features (like load/store with update), and for these test cases,
unaligned access remains disabled.
llvm-svn: 177160
In preparation for the addition of other SIMD ISA extensions (such as QPX) we
need to make sure that all Altivec patterns are properly predicated on having
Altivec support.
No functionality change intended (one test case needed to be updated b/c it
assumed that Altivec intrinsics would be supported without enabling Altivec
support).
llvm-svn: 177152
For spills into a large stack frame, the FI-elimination code uses the register
scavenger to obtain a free GPR for use with an r+r-addressed load or store.
When there are no available GPRs, the scavenger gets one by using its spill
slot. Previously, we were not always allocating that spill slot and the RS
would assert when the spill slot was needed.
I don't currently have a small test that triggered the assert, but I've
created a small regression test that verifies that the spill slot is now
added when the stack frame is sufficiently large.
llvm-svn: 177140
The new InstrSchedModel is easier to use than the instruction
itineraries. It will be used to model instruction latency and throughput
in modern Intel microarchitectures like Sandy Bridge.
InstrSchedModel should be able to coexist with instruction itinerary
classes, but for cleanliness we should switch the Atom processor model
to the new InstrSchedModel as well.
llvm-svn: 177122
See the Mips16ISetLowering.cpp patch to see a use of this.
For now now the extra code in Mips16ISetLowering.cpp is a nop but is
used for test purposes. Mips32 registers are setup and then removed and
then the Mips16 registers are setup.
Normally you need to add register classes and then call
computeRegisterProperties.
llvm-svn: 177120
This is a generic function (derived from PEI); moving it into
MachineFrameInfo eliminates a current redundancy between the ARM and AArch64
backends, and will allow it to be used by the PowerPC target code.
No functionality change intended.
llvm-svn: 177111
Add the current PEI register scavenger as a parameter to the
processFunctionBeforeFrameFinalized callback.
This change is necessary in order to allow the PowerPC target code to
set the register scavenger frame index after the save-area offset
adjustments performed by processFunctionBeforeFrameFinalized. Only
after these adjustments have been made is it possible to estimate
the size of the stack frame.
llvm-svn: 177108
Make requiresFrameIndexScavenging return true, and create virtual registers in
the spilling code instead of using the register scavenger directly. This makes
the target-level code simpler, and importantly, delays the scavenging until
after callee-saved register processing (which will be important for later
changes).
Also cleans up trackLivenessAfterRegAlloc (makes it inline in the header with
the other related functions). This makes it clear that it always returns true.
No functionality change intended.
llvm-svn: 177107
We used to add a spill slot for the register scavenger whenever the function
has a frame pointer. This is unnecessarily conservative: We may need the spill
slot for dynamic stack allocations, and functions with dynamic stack
allocations always have a FP, but we might also have a FP for other reasons
(such as the user explicitly disabling frame-pointer elimination), and we don't
necessarily need a spill slot for those functions.
The structsinregs test needed adjustment because it disables FP elimination.
llvm-svn: 177106
By terrible I mean we store/load from the stack.
This matters on PAQp8 in _Z5trainPsS_ii (which is inlined into Mixer::update)
where we decide to vectorize a loop with a VF of 8 resulting in a 25%
degradation on a cortex-a8.
LV: Found an estimated cost of 2 for VF 8 For instruction: icmp slt i32
LV: Found an estimated cost of 2 for VF 8 For instruction: select i1, i32, i32
The bug that tracks the CodeGen part is PR14868.
radar://13403975
llvm-svn: 177105
I don't think that it is otherwise clear how the overlapping offsets
are processed into distinct spill slots. Comment that this is done
in processFunctionBeforeFrameFinalized.
llvm-svn: 177094
This doesn't reset all of the target options within the TargetOptions
object. This is because some of those are ABI-specific and must be determined if
it's okay to change those on the fly.
llvm-svn: 176986
Increase the cost of v8/v16-i8 to v8/v16-i32 casts and truncates as the backend
currently lowers those using stack accesses.
This was responsible for a significant degradation on
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
where we vectorize one loop to a vector factor of 16. After this patch we select
a vector factor of 4 which will generate reasonable code.
unsigned char cle[32];
void test(short c) {
unsigned short compte;
for (compte = 0; compte <= 31; compte++) {
cle[compte] = cle[compte] ^ c;
}
}
radar://13220512
llvm-svn: 176898
Now that only the register-scavenger version of the CR spilling code remains,
we no longer need the Darwin R2 hack. Darwin can use R0 as a spare register in
any case where the System V ABI uses it (R0 is special architecturally, and so
is reserved under all common ABIs).
A few test cases needed to be updated to reflect the register-allocation changes.
llvm-svn: 176868
This removes the -disable-ppc[32|64]-regscavenger options; the code
that uses the register scavenger has been working well (and has been the default)
for some time, and we don't need options to enable the old (broken) CR spilling code.
llvm-svn: 176865
The strlen+memcmp was hidden in a call to StringRef::operator==. We check if
there are any null bytes in the string upfront so we can simplify the comparison
Small speedup when compiling code with many function calls.
llvm-svn: 176766
fold selectcc (selectcc x, y, a, b, cc), b, a, b, setne ->
selectcc x, y, a, b, cc
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 176700
Two changes:
1. Prefer SET* instructions when possible
2. Handle the CND*_INT case with floating-point args
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 176699
LegalizeDAG.cpp uses the value of the comparison operands when checking
the legality of BR_CC, so DAGCombiner should do the same.
v2:
- Expand more BR_CC value types for NVPTX
v3:
- Expand correct BR_CC value types for Hexagon, Mips, and XCore.
llvm-svn: 176694
This is certainly not the last word on scheduling for this target, but
right now this allows a few apps to run / finish with radeonsi, most
notably UT2004 / Lightsmark. They fail to compile some shaders with the
default scheduler because it ends up trying to spill registers, which
we don't support yet (and which is probably a bad idea in general for
performance if it can be avoided).
NOTE: This is a candidate for the Mesa stable branch.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176687
That can usually be lowered efficiently and is common in sandybridge code.
It would be nice to do this in DAGCombiner but we can't insert arbitrary
BUILD_VECTORs this late.
Fixes PR15462.
llvm-svn: 176634
v2: update CMakeLists.txt as well
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176626
v2: fix R600 regressions
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176624
Just encode the type as target specific attribute.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 176622
- Phi nodes should be replaced/updated after lowering CMOV into branch
because 'mainMBB' updating operand in Phi node is changed.
- Add EFLAGS in livein before lowering the 2nd CMOV. It's necessary as
we will reuse the EFLAGS generated before the 1st lowered CMOV, which
won't clobber EFLAGS. However, we need explicitly specify that.
- '-attr=-cmov' test case are added.
llvm-svn: 176598