Summary:
With D9096 and D9101, there's no need to run DCE after SLSR and
SeparateConstOffsetFromGEP.
Test Plan: no regression
Reviewers: jholewinski, meheff
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D9172
llvm-svn: 235415
There doesn't seem to be a reason to perform this target ISD node matching
in an DAGCombine, moving it to lowering fixes PR23296.
Differential Revision: http://reviews.llvm.org/D9137
llvm-svn: 235394
Summary:
This directive is exactly the same as .asciz, except it's only used by MIPS.
It is used to store null terminated strings in object files.
Reviewers: rafael, dsanders, echristo
Reviewed By: dsanders, echristo
Subscribers: echristo, llvm-commits
Differential Revision: http://reviews.llvm.org/D7530
llvm-svn: 235382
Summary:
The 64-bit version of the variable shift instructions uses the
shift_rotate_reg class which uses a GPR32Opnd to specify the variable
shift amount. With this patch we avoid the generation of a redundant
SLL instruction for the variable shift instructions in 64-bit targets.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7413
llvm-svn: 235376
This is an updated version of Chandler's patch D7402 that got accepted but never committed, and has bit-rotted a bit since.
I've updated the execution domain declarations to match the approach of the packed templates and also added some extra scalar unary tests.
Differential Revision: http://reviews.llvm.org/D9095
llvm-svn: 235372
X86ISD::ADDSUB, X86ISD::(F)HADD, X86ISD::(F)HSUB should not be selected
if the operand types do not match the result type because vector type
legalization cannot deal with this for custom nodes.
Testcase X86ISD::ADDSUB is attached. I could not create a testcase for
the FHADD/FHSUB cases because of: https://llvm.org/bugs/show_bug.cgi?id=23296
Differential Revision: http://reviews.llvm.org/D9120
llvm-svn: 235367
Summary:
Set operation action for FP16 conversion opcodes, so the Op legalizer
can choose the gnu_* libcalls for Mips.
Set LoadExtAction and TruncStoreAction for f16 scalars and vectors to
prevent (fpext (load )) and (store (fptrunc)) from getting combined into
unsupported operations.
Added test cases to test that these operations are handled correctly
for f16 scalars and vectors. This patch depends on
http://reviews.llvm.org/D8755.
Reviewers: srhines
Subscribers: llvm-commits, ab
Differential Revision: http://reviews.llvm.org/D8804
llvm-svn: 235341
This fixes a regression introduced at revision 231243.
The target-independent selection algorithm in FastISel knows how to select
a SINT_TO_FP if the target is SSE but not AVX. That is because on X86, the
tablegen'd 'fastEmit' functions know how to select CVTSI2SSrr and CVTSI2SDrr.
Method X86FastISel::X86SelectSIToFP was therefore working under the
wrong assumption that the target was AVX. That assumption was incorrect since
we can have a target that is neither AVX nor SSE.
So, rather than asserting for the presence of AVX, we should have had an
early exit from 'X86SelectSIToFP' if the target was not AVX.
This patch fixes the issue replacing the invalid assertion with an early exit.
Thanks to Dimitry Andric for reporting this problem and for providing a small
reproducible testcase. Added test pr23273.ll.
llvm-svn: 235295
The fix ensures that scalar sources inserted into a vector are the correct bit size.
Integer scalar sources from BUILD_VECTOR and SCALAR_TO_VECTOR nodes may require truncation that this function doesn't currently support.
llvm-svn: 235281
The result is either an Untyped reg sequence, on ldN with N > 1, or
just the type of the input vector, on ld1. Don't force Untyped.
Instead, just use the type of the reg sequence.
This mirrors the behavior of createTuple, which feeds the LD1*_POST.
The narrow code path wasn't actually covered by tests, because V64
insert_vector_elt are widened to V128 before the LD1LANEpost combine
has the chance to run, usually.
The only case where it does run on V64 vectors is if the vector ops
legalizer ran. So, tickle the code with a ctpop.
Fixes PR23265.
llvm-svn: 235243
Summary: Implement the method FastMaterializeAlloca in Mips fast-isel
Based on a patch by Reed Kotler.
Test Plan:
Passes test-suite at O0/O2 for mips32 r1/r2
fastalloca.ll
Reviewers: dsanders, rkotler
Subscribers: rfuhler, llvm-commits
Differential Revision: http://reviews.llvm.org/D6742
llvm-svn: 235213
Summary:
Add shift operators implementation to fast-isel for Mips. These are shift ops
for non legal forms, i.e. i8 and i16.
Based on a patch by Reed Kotler.
Test Plan:
Reviewers: dsanders
Subscribers: echristo, rfuhler, llvm-commits
Differential Revision: http://reviews.llvm.org/D6726
llvm-svn: 235194
Summary:
Previously, the presence of KILL instructions would block valid candidates
from filling a specific delay slot. With the elimination of the KILL
instructions, in the appropriate range, we are able to fill more slots and
keep the information from future def/use analysis consistent.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: hfinkel, llvm-commits
Differential Revision: http://reviews.llvm.org/D7724
llvm-svn: 235183
Summary:
For example, a common idiom was 'isN64 ? Mips::SP_64 : Mips::SP'. This has
been moved to MipsABIInfo and replaced with 'ABI.GetStackPtr()'.
There are others that should also be moved. This patch sticks to the ones that
are obviously non-functional. The others have minor mistakes that need fixing
at the same time, mostly involving checks for 64-bit GPR's instead of checks
for 64-bit pointers.
Reviewers: tomatabacu
Reviewed By: tomatabacu
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8972
llvm-svn: 235173
Found by code inspection, but breaking i16 at least breaks other tests.
They aren't checking this in particular though, so also add some
explicit tests for the already working types.
llvm-svn: 235148
A big-endian vector return needs a byte-swap which we aren't doing right now.
For now just bail on these cases to get correctness back.
llvm-svn: 235133
Fixed compilation with clang on some buildbots with "-Werror -Wmissing-field-initializers"
Related to: http://reviews.llvm.org/rL235089
llvm-svn: 235099
Summary: Previously, this was only happening for functions, but because of .insn, objects can also be marked now.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8007
llvm-svn: 235095
In order to introduce v8.1a-specific entities, Mappers should be aware of SubtargetFeatures available.
This patch introduces refactoring, that will then allow to easily introduce:
- v8.1-specific "pan" PState for PStateMapper (PAN extension)
- v8.1-specific sysregs for SysRegMapper (LOR,VHE extensions)
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8496
Patch by Tom Coxon
llvm-svn: 235089
Summary:
This assembler directive marks the current label as an instruction label in microMIPS and MIPS16.
This initial implementation works only for microMIPS.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8006
llvm-svn: 235084
BXJ was incorrectly said to be unsupported in ARMv8-A. It is not
supported in the A64 instruction set, but it is supported in the T32
and A32 instruction sets, because it's listed as an instruction in the
ARM ARM section F7.1.28.
Using SP as an operand to BXJ changed from UNPREDICTABLE to
PREDICTABLE in v8-A. This patch reflects that update as well.
This was found by MCHammer.
llvm-svn: 235024
This is a 1-line patch (with a TODO for AVX because that will affect
even more regression tests) that lets us substitute the appropriate
64-bit store for the float/double/int domains.
It's not clear to me exactly what the difference is between the 0xD6 (MOVPQI2QImr) and
0x7E (MOVSDto64mr) opcodes, but this is apparently the right choice.
Differential Revision: http://reviews.llvm.org/D8691
llvm-svn: 235014
Set the transform bar at 2 divisions because the fastest current
x86 FP divider circuit is in SandyBridge / Haswell at 10 cycle
latency (best case) relative to a 5 cycle multiplier.
So that's the worst case for this transform (no latency win),
but multiplies are obviously pipelined while divisions are not,
so there's still a big throughput win which we would expect to
show up in typical FP code.
These are the sequences I'm comparing:
divss %xmm2, %xmm0
mulss %xmm1, %xmm0
divss %xmm2, %xmm0
Becomes:
movss LCPI0_0(%rip), %xmm3 ## xmm3 = mem[0],zero,zero,zero
divss %xmm2, %xmm3
mulss %xmm3, %xmm0
mulss %xmm1, %xmm0
mulss %xmm3, %xmm0
[Ignore for the moment that we don't optimize the chain of 3 multiplies
into 2 independent fmuls followed by 1 dependent fmul...this is the DAG
version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that,
then the transform becomes even more profitable on all targets.]
Differential Revision: http://reviews.llvm.org/D8941
llvm-svn: 235012
Summary:
MSP430 doesn't seem to have any additional constraints. Therefore remove
the target hook.
No functional change intended.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8208
llvm-svn: 235003
Summary:
Refactor MipsAsmParser::getATReg to return an internal register number instead of a register index.
Also change all the int's to unsigned, seeing as the current AT register index is stored as an unsigned in MipsAssemblerOptions.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8478
llvm-svn: 234996
The ARMv8 ARMARM states that for these instructions in A64 state:
"Unspecified bits in "imm5" are ignored but should be set to zero by an assembler.", (imm4 for INS).
Make the disassembler accept any encoding with these ignored bits set to 1.
llvm-svn: 234896
This pass will always try to insert llvm.SI.ifbreak intrinsics
in the same block that its conditional value is computed in. This is
a problem when conditions for breaks or continue are computed outside
of the loop, because the llvm.SI.ifbreak intrinsic ends up being inserted
outside of the loop.
This patch fixes this problem by inserting the llvm.SI.ifbreak
intrinsics in the loop header when the condition is computed outside
the loop.
llvm-svn: 234891
Some targets (ie. Mips) have additional rules for ordering the relocation
table entries. Allow them to override generic sortRelocs(), which sorts
entries by Offset.
Then override this function for Mips, to emit HI16 and GOT16 relocations
against the local symbol in pair with the corresponding LO16 relocation.
Patch by Vladimir Stefanovic.
Differential Revision: http://reviews.llvm.org/D7414
llvm-svn: 234883
Gut the `DIDescriptor` wrappers around `MDLocalScope` subclasses. Note
that `DILexicalBlock` wraps `MDLexicalBlockBase`, not `MDLexicalBlock`.
llvm-svn: 234850
Gut all the non-pointer API from the variable wrappers, except an
implicit conversion from `DIGlobalVariable` to `DIDescriptor`. Note
that if you're updating out-of-tree code, `DIVariable` wraps
`MDLocalVariable` (`MDVariable` is a common base class shared with
`MDGlobalVariable`).
llvm-svn: 234840
Revert "Remove default in fully-covered switch (to fix Clang -Werror -Wcovered-switch-default)"
Revert "R600: Add carry and borrow instructions. Use them to implement UADDO/USUBO"
Revert "LegalizeDAG: Try to use Overflow operations when expanding ADD/SUB"
Using overflow operations fails CodeGen/Generic/2011-07-07-ScheduleDAGCrash.ll
on hexagon, nvptx, and r600. Revert while I investigate.
llvm-svn: 234768
v2: tighten the sub64 tests
v3: rename to CARRY/BORROW
v4: fixup test cmdline
add known bits computation
use sign extend instead of sub 0,x
better add test
v5: remove redundant break
move lowering to separate functions
fix comments
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewers: arsenm
llvm-svn: 234759
When I fixed these a couple of days ago to iterate over all loops, not just
depth == 1 loops, I inadvertently made it such that we'd only look at the first
top-level loop. Make sure that we really look at all of them.
llvm-svn: 234705
As it turns out, even though these are part of ISA 2.06, the P7 does not
support them (or, at least, not any P7s we're tested so far).
llvm-svn: 234686
This patch corresponds to review:
http://reviews.llvm.org/D8928
It adds direct move instructions to/from VSX registers to GPR's. These are
exploited for FP <-> INT conversions.
llvm-svn: 234682
The patch is generated using clang-tidy misc-use-override check.
This command was used:
tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py \
-checks='-*,misc-use-override' -header-filter='llvm|clang' \
-j=32 -fix -format
http://reviews.llvm.org/D8925
llvm-svn: 234679
This pass had the same problem as the data-prefetching pass: it was only
checking for depth == 1 loops in practice. Fix that, add some debugging
statements, and make sure that, when we grab an AddRec, it is for the loop we
expect.
llvm-svn: 234670
Currently, there's a single flag, checked by the pass itself.
It can't force-enable the pass (and is on by default), because it
might not even have been created, as that's the targets decision.
Instead, have separate explicit flags, so that the decision is
consistently made in the target.
Keep the flag as a last-resort "force-disable GlobalMerge" for now,
for backwards compatibility.
llvm-svn: 234666
The spilled registers are pristine and thus, correctly handled by
the register scavenger and so on, but the liveness information is
strictly speaking wrong at this point.
Fix that.
llvm-svn: 234664
Iterating over loops from the LoopInfo instance only provides top-level loops.
We need to search the whole tree of loops to find the inner ones.
llvm-svn: 234603
Using SchedAliases is convenient and works well for latency and resource
lookup for instructions. However, this creates an entry in
AArch64WriteLatencyTable with a WriteResourceID of 0, breaking any
SchedReadAdvance since the lookup will fail.
http://reviews.llvm.org/D8043
Patch by Dave Estes <cestes@codeaurora.org>!
llvm-svn: 234594
Summary:
Some optimizations such as jump threading and loop unswitching can negatively
affect performance when applied to divergent branches. The divergence analysis
added in this patch conservatively estimates which branches in a GPU program
can diverge. This information can then help LLVM to run certain optimizations
selectively.
Test Plan: test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
Reviewers: resistor, hfinkel, eliben, meheff, jholewinski
Subscribers: broune, bjarke.roune, madhur13490, tstellarAMD, dberlin, echristo, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8576
llvm-svn: 234567
When we have an instruction for this (and, thus, don't generate a runtime
call), we need to custom type legalize this (in a trivial way, just as we do
for fp_to_sint).
Fixes PR23173.
llvm-svn: 234561
For the most common ones (such as fadd), we already did the promotion.
Do the same thing for all the others.
Currently, we'll just crash/assert on all these operations, as
there's no hardware or libcall support whatsoever.
f16 (half) is specified as an interchange - not arithmetic - format,
and is expected to be promoted to single-precision for arithmetic
operations.
While there, teach the legalizer about promoting some of the (mostly
floating-point) operations that we never needed before.
Differential Revision: http://reviews.llvm.org/D8648
See related discussion on the thread for: http://reviews.llvm.org/D8755
llvm-svn: 234550
This is the patch corresponding to review:
http://reviews.llvm.org/D8406
It adds some missing instructions from ISA 2.06 to the PPC back end.
llvm-svn: 234546
formatted_raw_ostream is a wrapper over another stream to add column and line
number tracking.
It is used only for asm printing.
This patch moves the its creation down to where we know we are printing
assembly. This has the following advantages:
* Simpler lifetime management: std::unique_ptr
* We don't compute column and line number of object files :-)
llvm-svn: 234535
The integer extend optimization tries to fold the extend into the load
instruction. This requires us to identify if the extend has already been
emitted or not and act accordingly on it.
The check that was originally performed for this was not sufficient. Besides
checking the ValueMap for a mapped register we also need to check if the
virtual register has already an associated machine instruction that defines it.
This fixes rdar://problem/20470788.
llvm-svn: 234529
Revert "Add classof implementations to the raw_ostream classes."
Revert "Use the cast machinery to remove dummy uses of formatted_raw_ostream."
The underlying issue can be fixed without classof.
llvm-svn: 234495
Currently, llvm (backend) doesn't know cortex-r4, even though it is the
default target for armv7r. Using "--target=armv7r-arm-none-eabi" provokes
'cortex-r4' is not a recognized processor for this target' by llvm.
This patch adds support for cortex-r4 and, very closely related, r4f.
llvm-svn: 234486
Summary:
Make the code more readable by fusing the for-loops together and explicitly checking for each register class.
Also, this version is more straightforward because it doesn't assume that FPU registers always come before CPU registers in the CalleeSavedInfo vector.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8033
llvm-svn: 234475
restrictions when choosing a type for small-memcpy inlining in
SelectionDAGBuilder.
This ensures that the loads and stores output for the memcpy won't be further
expanded during legalization, which would cause the total number of instructions
for the memcpy to exceed (often significantly) the inlining thresholds.
<rdar://problem/17829180>
llvm-svn: 234462
Because -menable-no-nans causes fcmp conditions to be rewritten
without 'o' or 'u' the recognition code in needs to cope. Also
extended it to handle 'le' and 'ge.
Differential Revision: http://reviews.llvm.org/D8725
llvm-svn: 234421
Summary:
Even though there is no 2nd register operand in the "lw/sw $8, symbol" case, we still try to find one,
and we end up with $0, which makes us generate an unnecessary "addu $8, $8, $0" (a.k.a. "move $8, $8").
We can avoid this by checking if the 2nd register operand is different from $0, before generating the addu.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8055
llvm-svn: 234406
Summary:
They are of the form "bnezl/beqzl $rs, offset" and expand to "bnel/beql $rs, $zero, offset".
These instructions are used in Linux inline assembly.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8540
llvm-svn: 234401
Summary: Looks like new code from [[ http://reviews.llvm.org/rL222057 | rL222057 ]] doesn't account for early `return` in `ARMFrameLowering::emitPrologue`, which leads to loosing `.cfi_def_cfa_offset` directive for functions without stack frame.
Reviewers: echristo, rengolin, asl, t.p.northover
Reviewed By: t.p.northover
Subscribers: llvm-commits, rengolin, aemerson
Differential Revision: http://reviews.llvm.org/D8606
llvm-svn: 234399
Summary:
These AssemblerPredicate's are unnecessary and actually make some instructions unusable when assembling pre-MIPS32 ISAs.
For example, this was causing the IAS to reject the 'j' instruction for MIPS I-V.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8300
llvm-svn: 234398
This is currently considered experimental, but most of the more
commonly used instructions should work.
So far only SI has been extensively tested, CI and VI probably work too,
but may be buggy. The current set of tests cases do not give complete
coverage, but I think it is sufficient for an experimental assembler.
See the documentation in R600Usage for more information.
llvm-svn: 234381
We weren't checking the sign of the floating point immediate before translating
it to "fmov sD, wzr". Similarly for D-regs.
Technically "movi vD.2s, #0x80, lsl #24" would work most of the time, but it's
not a blessed alias (and I don't think it should be since people expect writing
sD to zero out the high lanes, and there's no dD equivalent). So an error it is.
rdar://20455398
llvm-svn: 234372
This shouldn't affect anything in-tree, as the OperandType users are
mostly smart disassemblers and such; more information is helpful there.
However, on the flip side, that + the fact that this is just hinting at
the meaning of operands makes this not really test-worthy or testable.
Differential Revision: http://reviews.llvm.org/D8620
llvm-svn: 234350
Instead of lowering SELECT to SELECT_CC which is further lowered later
immediately call the SELECT_CC lowering code. This is preferable
because:
- Avoids an unnecessary roundtrip through the legalization queues with
an intermediate node.
- More importantly: Lowered operations get visited last leading to SELECT_CC
getting visited with legalized operands and unlegalized ones for preexisting
SELECT_CC nodes. This does not hurt the current code (hence no testcase) but
is required for another patch I am working on.
Differential Revision: http://reviews.llvm.org/D8187
llvm-svn: 234334
Summary:
This is not possible when using the IAS for MIPS, but it is possible when using the IAS for other architectures and when using GAS for MIPS.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8578
llvm-svn: 234316
After recognising that a certain narrow instruction might need a relocation to
be represented, we used to unconditionally relax it to a Thumb2 instruction to
permit this. Unfortunately, some CPUs (e.g. v6m) don't even have most Thumb2
instructions, so we end up emitting a completely invalid instruction.
Theoretically, ELF does have relocations for these situations; but they are
fairly unusable with such short ranges and the ABI document even says they're
documented "for completeness". So an error is probably better there too.
rdar://20391953
llvm-svn: 234195
This patch allows SSE4.1 targets to use (V)PINSRB to create 16i8 vectors by inserting i8 scalars directly into a XMM register instead of merging pairs of i8 scalars into a i16 and using the SSE2 PINSRW instruction.
This allows folding of byte loads and reduces scalar register usage as well.
Differential Revision: http://reviews.llvm.org/D8839
llvm-svn: 234193
As pr19627 points out, every use of AliasedSymbol is likely a bug.
The main use was to avoid the oddity of a variable showing up as undefined. That
was fixed in r233995, which made these calls nops.
llvm-svn: 234169
This allows the compiler/assembly programmer to switch back to a
section. This in turn fixes the bootstrap failure on powerpc (tested
on gcc110) without changing the ppc codegen at all.
I will try to cleanup the various getELFSection overloads in a followup patch.
Just using a default argument now would lead to ambiguities.
llvm-svn: 234099
Previously the patterns didn't have high enough priority and we would only use the GR32 form if the only the upper 32 or 56 bits were zero.
Fixes PR23100.
llvm-svn: 234075
We don't need to represent UnwindHelp in IR. Instead, we can use the
knowledge that we are emitting the parent function to decide if we
should create the UnwindHelp stack object.
llvm-svn: 234061
The plan here is to push the API changes out from the common components
(like Constant::getGetElementPtr and IRBuilder::CreateGEP related
functions) and just update callers to either pass the type if it's
obvious, or pass null.
Do this with LoadInst as well and anything else that comes up, then to
start porting specific uses to not pass null anymore - this may require
some refactoring in each case.
llvm-svn: 234042
As a follow-up to r234021, assert that a debug info intrinsic variable's
`MDLocalVariable::getInlinedAt()` always matches the
`MDLocation::getInlinedAt()` of its `!dbg` attachment.
The goal here is to get rid of `MDLocalVariable::getInlinedAt()`
entirely (PR22778), but I'll let these assertions bake for a while
first.
If you have an out-of-tree backend that just broke, you're probably
attaching the wrong `DebugLoc` to a `DBG_VALUE` instruction. The one
you want is the location that was attached to the corresponding
`@llvm.dbg.declare` or `@llvm.dbg.value` call that you started with.
llvm-svn: 234038
When enabling PPC64LE, I disabled some optimizations of BUILD_VECTOR
nodes for little endian because wrong results were produced. I've
subsequently investigated and found this is due to a call to
BuildVectorSDNode::isConstantSplat that was always specifying
big-endian. With this changed to correctly identify the target
endianness, the optimizations work as expected.
I found another case of a call to the same method with big-endian
hardcoded, in PPC::isAllNegativeZeroVector(). I discovered this was
an orphaned method with no callers, so I've just removed it.
The existing test/CodeGen/PowerPC/vec_constants.ll checks these
optimizations, so for testing I've just added a variant for little
endian.
llvm-svn: 234011
Fixes PR19582.
Previously, when an asm assignment (.set or =) was created, we would look up
the section immediately in MCSymbol::setVariableValue. This caused symbols
to receive the wrong section if the RHS of the assignment had not been seen
yet. This had a knock-on effect in the object file emitters, causing them
to emit extra symbols, or to give symbols the wrong visibility or the wrong
section. For example, in the following asm:
.data
.Llocal:
.text
leaq .Llocal1(%rip), %rdi
.Llocal1 = .Llocal2
.Llocal2 = .Llocal
the first assignment would give .Llocal1 a null section, which would never get
fixed up by the second assignment. This would cause the ELF object file emitter
to consider .Llocal1 to be an undefined symbol and give it external linkage,
even though .Llocal1 should not have been emitted at all in the object file.
Or in the following asm:
alias_to_local = Ltmp0
Ltmp0:
the Mach-O object file emitter would give the alias_to_local symbol a n_type
of N_SECT and a n_sect of 0. This is invalid under the Mach-O specification,
which requires N_SECT symbols to receive a non-zero section number if the
symbol is defined in a section in the object file.
https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/#//apple_ref/c/tag/nlist
After this change we do not look up the section when the assignment is created,
but instead look it up on demand and store it in Section, which is treated
as a cache if the symbol is a variable symbol.
This change also fixes a bug in MCExpr::FindAssociatedSection. Previously,
if we saw a subtraction, we would return the first referenced section, even in
cases where we should have been returning the absolute pseudo-section. Now we
always return the absolute pseudo-section for expressions that subtract two
section-derived expressions. This isn't always correct (e.g. if one of the
sections ends up being laid out at an absolute address), but it's probably
the best we can do without more context.
This allows us to remove code in two places where we appear to have been
working around this bug, in MachObjectWriter::markAbsoluteVariableSymbols
and in X86AsmPrinter::EmitStartOfAsmFile.
Re-applies r233595 (aka D8586), which was reverted in r233898.
Differential Revision: http://reviews.llvm.org/D8798
llvm-svn: 233995
Register coalescing can change the target of a RegPair hint to a
physreg, we should not crash on this. This also slightly improved the
way ARMBaseRegisterInfo::updateRegAllocHint() works.
llvm-svn: 233987
Without this patch, we split the 256-bit vector into halves and produced something like:
movzwl (%rdi), %eax
vmovd %eax, %xmm0
vxorps %xmm1, %xmm1, %xmm1
vblendps $15, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
Now, we eliminate the xor and blend because those zeros are free with the vmovd:
movzwl (%rdi), %eax
vmovd %eax, %xmm0
This should be the final fix needed to resolve PR22685:
https://llvm.org/bugs/show_bug.cgi?id=22685
llvm-svn: 233941
Require the pointee type to be passed explicitly and assert that it is
correct. For now it's possible to pass nullptr here (and I've done so in
a few places in this patch) but eventually that will be disallowed once
all clients have been updated or removed. It'll be a long road to get
all the way there... but if you have the cahnce to update your callers
to pass the type explicitly without depending on a pointer's element
type, that would be a good thing to do soon and a necessary thing to do
eventually.
llvm-svn: 233938
For code like this:
define <8 x i32> @load_v8i32() {
ret <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
}
We produce this AVX code:
_load_v8i32: ## @load_v8i32
movl $7, %eax
vmovd %eax, %xmm0
vxorps %ymm1, %ymm1, %ymm1
vblendps $1, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
retq
There are at least 2 bugs in play here:
We're generating a blend when a move scalar does the same job using 2 less instruction bytes (see FIXMEs).
We're not matching an existing pattern that would eliminate the xor and blend entirely. The zero bytes are free with vmovd.
The 2nd fix involves an adjustment of "AddedComplexity" [1] and mostly masks the 1st problem.
[1] AddedComplexity has close to no documentation in the source.
The best we have is this comment: "roughly corresponds to the number of nodes that are covered".
It appears that x86 has bastardized this definition by inflating its values for some other
undocumented reason. For example, we have a pattern with "AddedComplexity = 400" (!).
I searched my way to this page:
https://groups.google.com/forum/#!topic/llvm-dev/5UX-Og9M0xQ
Differential Revision: http://reviews.llvm.org/D8794
llvm-svn: 233931
Summary:
Avoid duplicate code in Mips16FrameLowering and MipsSEFrameLowering by
providing an implementation of the eliminateCallFramePseudoInstr()
function from their base class.
Depends on D8640.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8641
llvm-svn: 233909
Summary:
adjustStackPtr() is implemented from both MipsSEInstrInfo and
Mips16InstrInfo. It makes sense to expose this function from
MipsInstrInfo and avoid explicit casting in some cases.
Depends on D8638.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8640
llvm-svn: 233905
addl has higher throughput and this was needlessly picking a suboptimal
encoding causing PR23098.
I wish there was a way of doing this without further duplicating tbl-
generated patterns, but so far I haven't found one.
llvm-svn: 233832
v8.1a is renamed to architecture, following current entity naming approach.
Excess generic cpu is removed. Intended use: "generic" cpu with "v8.1a" subtarget feature
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8767
llvm-svn: 233811
v8.1a is renamed to architecture, accordingly to approaches in ARM backend.
Excess generic cpu is removed. Intended use: "generic" cpu with "v8.1a" subtarget feature
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8766
llvm-svn: 233810
Under normal circumstances, use of CR bits is disabled when running at -O0, but
it is enabled by default otherwise, and if you have optnone functions, they'll
still generally be generated with crbits turned on (because nothing else turns
them off). FastISel can't handle most things dealing with i1 values when using
CR bits, and checks for that, but was not checking the return type on
functions; we can't fast-isel function calls with i1 return values either when
using CR bits for boolean values.
Fixes PR22664.
llvm-svn: 233775
This lets us catch exceptions in simple cases.
N.B. Things that do not work include (but are not limited to):
- Throwing from within a catch handler.
- Catching an object with a named catch parameter.
- 'CatchHigh' is fictitious, we aren't sure of its purpose.
- We aren't entirely efficient with regards to the number of EH states
that we generate.
- IP-to-State tables are sensitive to the order of emission.
llvm-svn: 233767
Even at -O0, we fall back to SDAG when we hit intrinsics, and if the intrinsic
is a memset/memcpy/etc. we might normally use vector types. At -O0, this is
probably not a good idea (because, if there is a bug in the lowering code,
there would be no good way to turn it off). At -O0, only use scalar preferred
types.
Related to PR22754.
llvm-svn: 233755
extended loads.
Implement the related target lowering hook so that the optimization has a better
estimation of the cost of an extension.
rdar://problem/19267165
llvm-svn: 233753
Change lowerCTPOP to:
- Gracefully handle a known-zero input value
- Simplify computation of significant bit size
Thanks to Jay Foad for the review!
llvm-svn: 233736
I suggested this change in D7898 (http://llvm.org/viewvc/llvm-project?view=revision&revision=231354)
It improves the v4i64 case although not optimally. This AVX codegen:
vmovq {{.*#+}} xmm0 = mem[0],zero
vxorpd %ymm1, %ymm1, %ymm1
vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
Becomes:
vmovsd {{.*#+}} xmm0 = mem[0],zero
Unfortunately, this doesn't completely solve PR22685. There are still at least 2 problems under here:
We're not handling v32i8 / v16i16.
We're not getting the FP / int domains right for instruction selection.
But since this patch alone appears to do no harm, reduces code duplication, and helps v4i64,
I'm submitting this patch ahead of fixing the above.
Differential Revision: http://reviews.llvm.org/D8341
llvm-svn: 233704
So far, we do not yet support any instruction specific to zEC12.
Most of the facilities added with zEC12 are indeed not very useful
to compiler code generation, but there is one exception: the
miscellaneous-extensions facility provides the RISBGN instruction,
which is a variant of RISBG that does not set the condition code.
Add support for this facility, MC support for RISBGN, and CodeGen
support for prefering RISBGN over RISBG on zEC12, unless we can
actually make use of the condition code set by RISBG.
llvm-svn: 233690
We already exploit a number of instructions specific to z196,
but not yet POPCNT. Add support for the population-count
facility, MC support for the POPCNT instruction, CodeGen
support for using POPCNT, and implement the getPopcntSupport
TargetTransformInfo hook.
llvm-svn: 233689
This hooks up the TargetTransformInfo machinery for SystemZ,
and provides an implementation of getIntImmCost.
In addition, the patch adds the isLegalICmpImmediate and
isLegalAddImmediate TargetLowering overrides, and updates
a couple of test cases where we now generate slightly
better code.
llvm-svn: 233688
We used to miss non-Q YMM integer vectors, and, non-Q/D XMM integer
vectors.
While there, change the v4i32 patterns to prefer MOVNTDQ.
llvm-svn: 233668
When we expand the RET_ReallyLR pseudo instruction we also need to transfer the
implicit operands.
The return register is an implicit operand and without it the liveness
calculation generates an incorrect live-out set for the patchpoint.
This fixes rdar://problem/19068476.
llvm-svn: 233635
This fixes the visibility of symbols in certain edge cases involving aliases
with multiple levels of indirection.
Fixes PR19582.
Differential Revision: http://reviews.llvm.org/D8586
llvm-svn: 233595
When a new SM architecture is introduced, it is only supported by the
current PTX version and later. Make sure we are using at least the
minimum PTX version for the target architecture.
This also removes support for PTX ISA < 3.2.
llvm-svn: 233583
Compiling the following function with -O0 would crash, since LLVM would
hit an assertion in getTestUnderMaskCond:
int test(unsigned long x)
{
return x >= 0 && x <= 15;
}
Fixed by detecting the case in the caller of getTestUnderMaskCond.
llvm-svn: 233541
Summary:
The 'R' constraint is actually supposed to be much more complicated than
this and is defined in terms of whether it will cause macro expansion in
the assembler. 'R' is getting less useful due to architecture changes and
ought to be replaced by other constraints. We therefore implement 9-bit
offsets which will work for all subtargets and all instructions.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8440
llvm-svn: 233537
All the ports have been fixed to read the feature bits from the subtarget passed
to the print methods. Also, delete the call to setAvailableFeatures in the
constructor of NVPTX's instprinter as the instprinter wasn't using the feature
bits anywhere.
llvm-svn: 233486
The asm syntax for the 32-bit rotate-and-mask instructions can take a 32-bit
bitmask instead of an (mb, me) pair. This syntax is not specified in the Power
ISA manual, but is accepted by GNU as, and is documented in IBM's Assembler
Language Reference. The GNU Multiple Precision Arithmetic Library (gmp)
contains assembly that uses this syntax.
To implement this, I moved the isRunOfOnes utility function from
PPCISelDAGToDAG.cpp to PPCMCTargetDesc.h.
llvm-svn: 233483
instead of the one passed to the constructor.
Unfortunately, I don't have a test case for this change. In order to test my
change, I will have to run the code after line 90 in printSparcAliasInstr. I
couldn't make that happen because printAliasInstr would always handle the
printing of fcmp instructions that the code after line 90 is supposed to handle.
llvm-svn: 233471
method.
This enables the instprinter to print a different system register name based on
the feature bits of the per-function subtarget.
Differential Revision: http://reviews.llvm.org/D8668
llvm-svn: 233412
per-function subtarget.
Currently, code-gen passes the default or generic subtarget to the constructors
of MCInstPrinter subclasses (see LLVMTargetMachine::addPassesToEmitFile), which
enables some targets (AArch64, ARM, and X86) to change their instprinter's
behavior based on the subtarget feature bits. Since the backend can now use
different subtargets for each function, instprinter has to be changed to use the
per-function subtarget rather than the default subtarget.
This patch takes the first step towards enabling instprinter to change its
behavior based on the per-function subtarget. It adds a bit "PassSubtarget" to
AsmWriter which tells table-gen to pass a reference to MCSubtargetInfo to the
various print methods table-gen auto-generates.
I will follow up with changes to instprinters of AArch64, ARM, and X86.
llvm-svn: 233411
Expose bpf pseudo load instruction via intrinsic. It is used by front-ends that
can encode file descriptors directly into IR instead of relying on relocations.
llvm-svn: 233396
Subtarget features must not be a part of the target machine. So, they are now not being stored in SysRegMapper, but provided each time fromString()/toString() are called
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8655
llvm-svn: 233386
Summary:
The ARM backend can use a loop to implement copying byval parameters before
a call. In non-thumb2 mode it uses a constant pool load to materialize the
trip count. For targets that need movt instead (e.g. Native Client), use
the same code as in thumb2 mode to materialize the trip count.
Reviewers: jfb, t.p.northover
Differential Revision: http://reviews.llvm.org/D8442
llvm-svn: 233324
Third element is to be added soon to "struct AArch64NamedImmMapper::Mapping". So its instances are renamed from ...Pairs to ...Mappings
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8582
llvm-svn: 233300
class AArch64NamedImmMapper is to become dependent of SubTargetFeatures, while class AArch64Operand don't have access to the latter.
So, AArch64NamedImmMapper constructor invocations are refactored away from methods of AArch64Operand.
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8579
llvm-svn: 233297
Summary: This groups all of the MipsAssemblerOptions functionality together, making it more reader-friendly.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8445
llvm-svn: 233271
This patch teaches fast-isel how to select 128-bit vector load instructions.
Added test CodeGen/X86/fast-isel-vecload.ll
Differential Revision: http://reviews.llvm.org/D8605
llvm-svn: 233270
for PPC due to some unfortunate default setting via TargetMachine
creation. I've added a FIXME on how this can be unraveled in the
backend and a test to make sure we successfully legalize 64-bit things
if we say we're 64-bits.
llvm-svn: 233239
This patch adds Hardware Transaction Memory (HTM) support supported by ISA 2.07
(POWER8). The intrinsic support is based on GCC one [1], but currently only the
'PowerPC HTM Low Level Built-in Function' are implemented.
The HTM instructions follows the RC ones and the transaction initiation result
is set on RC0 (with exception of tcheck). Currently approach is to create a
register copy from CR0 to GPR and comapring. Although this is suboptimal, since
the branch could be taken directly by comparing the CR0 value, it generates code
correctly on both test and branch and just return value. A possible future
optimization could be elimitate the MFCR instruction to branch directly.
The HTM usage requires a recently newer kernel with PPC HTM enabled. Tested on
powerpc64 and powerpc64le.
This is send along a clang patch to enabled the builtins and option switch.
[1] https://gcc.gnu.org/onlinedocs/gcc/PowerPC-Hardware-Transactional-Memory-Built-in-Functions.html
Phabricator Review: http://reviews.llvm.org/D8247
llvm-svn: 233204
This patch allows AVX blend instructions to handle insertion into the low
element of a 256-bit vector for the appropriate data types.
For f32, instead of:
vblendps $1, %xmm1, %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[1,2,3]
vblendps $15, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
we get:
vblendps $1, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0],ymm0[1,2,3,4,5,6,7]
For f64, instead of:
vmovsd %xmm1, %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[1]
vblendpd $3, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0,1],ymm0[2,3]
we get:
vblendpd $1, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm1[0],ymm0[1,2,3]
For the hardware-neglected integer data types, I left a TODO comment in the
code and added regression tests for a follow-on patch.
Differential Revision: http://reviews.llvm.org/D8609
llvm-svn: 233199
We can't use TargetFrameLowering::getFrameIndexOffset directly, because
Win64 really wants the offset from the stack pointer at the end of the
prologue. Instead, use X86FrameLowering::getFrameIndexOffsetFromSP(),
which is a pretty close approximiation of that. It fails to handle cases
with interestingly large stack alignments, which is pretty uncommon on
Win64 and is TODO.
llvm-svn: 233137
The changes to InstCombine (& SCEV) do seem a bit silly - it doesn't make
anything obviously better to have the caller access the pointers element
type (the thing I'm trying to remove) than the GEP itself, but it's a
helpful migration step. This will allow me to more obviously lock down
GEP (& Load, etc) API usage, then fix all the code that accesses pointer
element types except the places that need to be removed (most of the
InstCombines) anyway - at which point I'll need to just remove all that
code because it won't be meaningful anymore (there will be no pointer
types, so no bitcasts to combine)
SCEV looks like it'll need some restructuring - we'll have to do a bit
more work for GEP canonicalization, since it'll depend on how it's used
if we can even manage to canonicalize it to a non-ugly GEP. I guess we
can do some fun stuff like voting (do 2 out of 3 load from the GEP with
a certain type that gives a pretty GEP? Does every typed use of the GEP
use either a specific type or a generic type (i8*, etc)?)
llvm-svn: 233131
This code depended on a bug in the FindAssociatedSection function that would
cause it to return the wrong result for certain absolute expressions. Instead,
use EvaluateAsRelocatable.
llvm-svn: 233119
vperm2x128 instructions have the special ability (aka free hardware capability)
to shuffle zero values into a vector.
This patch recognizes that type of shuffle and generates the appropriate
control byte.
https://llvm.org/bugs/show_bug.cgi?id=22984
Differential Revision: http://reviews.llvm.org/D8563
llvm-svn: 233100
Simplify boolean expressions using `true` and `false` with `clang-tidy`
Patch by Richard Thomson.
Reviewed By: rengolin
Differential Revision: http://reviews.llvm.org/D8525
llvm-svn: 233089
This reverts commit r233055.
It still causes buildbot failures (gcc running out of memory on several platforms, and a self-host failure on arm), although less than the previous time.
llvm-svn: 233068
Summary:
Previous behaviour of 'R' and 'm' has been preserved for now. They will be
improved in subsequent commits.
The offset permitted by ZC varies according to the subtarget since it is
intended to match the restrictions of the pref, ll, and sc instructions.
The restrictions on these instructions are:
* For microMIPS: 12-bit signed offset.
* For Mips32r6/Mips64r6: 9-bit signed offset.
* Otherwise: 16-bit signed offset.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8414
llvm-svn: 233063
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first time this was committed (r229831), it caused several buildbot failures.
At least some of the ARM ones were due to gcc/binutils issues, and should now be fixed.
Differential Revision: http://reviews.llvm.org/D8542
llvm-svn: 233055
The pass used to be enabled by default with CodeGenOpt::Less (-O1).
This is too aggressive, considering the pass indiscriminately merges
all globals together.
Currently, performance doesn't always improve, and, on code that uses
few globals (e.g., the odd file- or function- static), more often than
not is degraded by the optimization. Lengthy discussion can be found
on llvmdev (AArch64-focused; ARM has similar problems):
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-February/082800.html
Also, it makes tooling and debuggers less useful when dealing with
globals and data sections.
GlobalMerge needs to better identify those cases that benefit, and this
will be done separately. In the meantime, move the pass to run with
-O3 rather than -O1, on both ARM and AArch64.
llvm-svn: 233024
Simplify boolean expressions with `true` and `false` using `clang-tidy`
Patch by Richard Thomson.
Differential Revision: http://reviews.llvm.org/D8520
llvm-svn: 233020
Simplify boolean expressions with `true` and `false` with `clang-tidy`
Patch by Richard Thomson.
Differential Revision: http://reviews.llvm.org/D8519
llvm-svn: 233002
This enables very common cases to switch to the
smaller encoding.
All of the standard LLVM canonicalizations of comparisons
are the opposite of what we want. Compares with constants
are moved to the RHS, but the first operand can be an inline
immediate, literal constant, or SGPR using the 32-bit VOPC
encoding.
There are additional bad canonicalizations that should
also be fixed, such as canonicalizing ge x, k to gt x, (k + 1)
if this makes k no longer an inline immediate value.
llvm-svn: 232988
This change is incorrect since it converts double rounding into single rounding,
which can produce different results. Instead this optimization will be done by
modifying Clang's codegen to not produce double rounding in the first place.
This reverts commit r232954.
llvm-svn: 232962
Anton tried this 5 years ago but it was reverted due to extra VMOVs
being emitted. This can be easily fixed with a liberal application
of patterns - matching loads/stores and extractelts.
llvm-svn: 232958
Specifically when the conversion is done in two steps, f16 -> f32 -> f64.
For example:
%1 = tail call float @llvm.convert.from.fp16.f32(i16 %0)
%conv = fpext float %1 to double
to:
vcvtb.f64.f16
llvm-svn: 232954
Fixing sign extension in makeLibCall for MIPS64. In MIPS64 architecture all
32 bit arguments (int, unsigned int, float 32 (soft float)) must be sign
extended. This fixes test "MultiSource/Applications/oggenc/".
Patch by Strahinja Petrovic.
Differential Revision: http://reviews.llvm.org/D7791
llvm-svn: 232943
Summary:
But still handle them the same way since I don't know how they differ on
this target.
Clang also has code for 'Ump', 'Utf', 'Usa', and 'Ush' but calls
llvm_unreachable() on this code path so they are not converted to a
constraint id at the moment.
No functional change intended.
Reviewers: t.p.northover
Subscribers: aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D8177
llvm-svn: 232941
strchr("123!", C) != nullptr is a common pattern to check if C is one
of 1, 2, 3 or !. If the largest element of the string is smaller than
the target's register size we can easily create a bitfield and just
do a simple test for set membership.
int foo(char C) { return strchr("123!", C) != nullptr; } now becomes
cmpl $64, %edi ## range check
sbbb %al, %al
movabsq $0xE000200000001, %rcx
btq %rdi, %rcx ## bit test
sbbb %cl, %cl
andb %al, %cl ## and the two conditions
andb $1, %cl
movzbl %cl, %eax ## returning an int
ret
(imho the backend should expand this into a series of branches, but
that's a different story)
The code is currently limited to bit fields that fit in a register, so
usually 64 or 32 bits. Sadly, this misses anything using alpha chars
or {}. This could be fixed by just emitting a i128 bit field, but that
can generate really ugly code so we have to find a better way. To some
degree this is also recreating switch lowering logic, but we can't
simply emit a switch instruction and thus change the CFG within
instcombine.
llvm-svn: 232902
TargetMachine::getSubtargetImpl routines.
This keeps the target independent code free of bare subtarget
calls while the remainder of the backends are migrated, or not
if they don't wish to support per-function subtargets as would
be needed for function multiversioning or LTO of disparate
cpu subarchitecture types, e.g.
clang -msse4.2 -c foo.c -emit-llvm -o foo.bc
clang -c bar.c -emit-llvm -o bar.bc
llvm-link foo.bc bar.bc -o baz.bc
llc baz.bc
and get appropriate code for what the command lines requested.
llvm-svn: 232885
As preparation for removing the getSubtargetImpl() call from
TargetMachine go ahead and flip the switch on caching the function
dependent subtarget and remove the bare getSubtargetImpl call
from the X86 port. As part of this add a few tests that show we
can generate code and assemble on X86 based on features/cpu on
the Function.
llvm-svn: 232879
thumb-ness similar to the rest of the Module level asm printing
infrastructure as debug info finalization happens after the function
may be missing.
llvm-svn: 232875
With this patch, for this one exact case, we'll generate:
blendps %xmm0, %xmm1, $1
instead of:
insertps %xmm0, %xmm1, $0
If there's a memory operand available for load folding and we're
optimizing for size, we'll still generate the insertps.
The detailed performance data motivation for this may be found in D7866;
in summary, blendps has 2-3x throughput vs. insertps on widely used chips.
Differential Revision: http://reviews.llvm.org/D8332
llvm-svn: 232850
The code this patch removes was there to make sure the text sections went
before the dwarf sections. That is necessary because MachO uses offsets
relative to the start of the file, so adding a section can change relaxations.
The dwarf sections were being printed at the start just to produce symbols
pointing at the start of those sections.
The underlying issue was fixed in r231898. The dwarf sections are now printed
when they are about to be used, which is after we printed the text sections.
To make sure we don't regress, the patch makes the MachO streamer assert
if CodeGen puts anything unexpected after the DWARF sections.
llvm-svn: 232842
The main differences are:
* Split in 32 and 64 bit functions.
* First switch on the Modifier so that we have only one non fully covered
switch.
* Map the fixup kind first to a x86_64 (or i386) specific enum, to make
it easy to handle cases like X86::reloc_riprel_4byte_movq_load.
* Switch on IsPCRel last, which reduces code duplication.
Fixes pr22308.
llvm-svn: 232837
LocalStackSlotPass assumes that isFrameOffsetLegal doesn't change its
answer when the base register changes. Unfortunately this isn't true
in thumb1, where SP-based loads allow a larger offset than
non-SP-based loads, and this causes the base register reuse code to
generate instructions that are unencodable, causing an assertion
failure.
Solve this by adding a BaseReg parameter to isFrameOffsetLegal, which
ARMBaseRegisterInfo can then make use of to give the correct answer.
Differential Revision: http://reviews.llvm.org/D8419
llvm-svn: 232825
This is needed for AVX512 masked scatter/gather support.
The R600 change is necessary to remove a hack that was working around the lack of multiple results.
llvm-svn: 232798
This enables us to remove calls to the subtarget from the TargetMachine
and with a small hack for backends that require global subtarget
information for module level code generation, e.g. mips abi flags, as
mentioned in a fixme in the code.
llvm-svn: 232776
Another case of x86-specific shuffle strength reduction:
avoid generating insert*128 instructions with index 0 because
they are slower than their non-lane-changing blend equivalents.
Shuffle lowering already catches most of these cases, but
the zero vector case and some other paths such as in the
modified test in vector-shuffle-256-v32.ll were getting
through.
Differential Revision: http://reviews.llvm.org/D8366
llvm-svn: 232773
Summary:
CUDA 7.0's libdevice uses slightly different IR to call __nvvm_reflect
and that triggers an assertion in nvvm_reflect optimization pass. This
change allows nvvm_reflect pass to deal with both old and new ways to
pass an argument to __nvvm_reflect.
Test Plan: ninja check-all
Reviewers: eliben, echristo
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8399
llvm-svn: 232732
There are two main advantages to doing this
* Targets that only need to handle one of the formats specially don't have
to worry about the others. For example, x86 now only registers a
constructor for the COFF streamer.
* Changes to the arguments passed to one format constructor will not impact
the other formats.
llvm-svn: 232699
as we don't necessarily need to do this yet - though we could move
the base class to the TargetMachine as it isn't subtarget dependent.
This reverts commit r232103.
llvm-svn: 232665
Currently v2i64 vectors shifts (non-equal shift amounts) are scalarized, costing 4 x extract, 2 x x86-shifts and 2 x insert instructions - and it gets even more awkward on 32-bit targets.
This patch separately shifts the vector by both shift amounts and then shuffles the partial results back together, costing 2 x shuffles and 2 x sse-shifts instructions (+ 2 movs on pre-AVX hardware).
Note - this patch only improves the SHL / LSHR logical shifts as only these are supported in SSE hardware.
Differential Revision: http://reviews.llvm.org/D8416
llvm-svn: 232660
Memcpy, and other memory intrinsics, typically tries to use LDM/STM if
the source and target addresses are 4-byte aligned. In CodeGenPrepare
look for calls to memory intrinsics and, if the object is on the
stack, 4-byte align it if it's large enough that we expect that memcpy
would want to use LDM/STM to copy it.
Differential Revision: http://reviews.llvm.org/D7908
llvm-svn: 232627
Currently, there are no itineraries defined for ext and ins instructions.
This patch adds these itineraries and uses them in the instruction definitions.
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D7209
llvm-svn: 232613
COFF COMDATs (for selection kinds other than 'select any') require at
least one non-section symbol in the symbol table.
Satisfy this by morally enhancing the linkage from private to internal.
Differential Revision: http://reviews.llvm.org/D8394
llvm-svn: 232570
Summary: Building FP16 constant vectors caused the FP16 data to be bitcast to i64. This patch creates a BITCAST node with the correct value, and adds a test to verify correct handling.
Reviewers: mcrosier
Reviewed By: mcrosier
Subscribers: mcrosier, jmolloy, ab, srhines, llvm-commits, rengolin, aemerson
Differential Revision: http://reviews.llvm.org/D8369
llvm-svn: 232562
Summary:
COFF COMDATs (for selection kinds other than 'select any') require at
least one non-section symbol in the symbol table.
Satisfy this by morally enhancing the linkage from private to internal.
Reviewers: rafael
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8374
llvm-svn: 232539
Before this patch code wanting to create temporary labels for a given entity
(function, cu, exception range, etc) had to keep its own counter to have stable
symbol names.
createTempSymbol would still add a suffix to make sure a new symbol was always
returned, but it kept a single counter. Because of that, if we were to use
just createTempSymbol("cu_begin"), the label could change from cu_begin42 to
cu_begin43 because some other code started using temporary labels.
Simplify this by just keeping one counter per prefix and removing the various
specialized counters.
llvm-svn: 232535
We have observed that noreg was being generated due to a bug in FastIsel and was not being detected during emission. It happens that in the Asm emission there is an assertion that detects this in getRegisterName() from the tbl-generated file PPCGenAsmWriter.inc. However, when emitting an Obj file, invalid registers can be emitted given that no check are made in getBinaryCodeFromInstr() from PPCGenMCCodeEmitter.inc. In order to cover all cases this adds an assertion for reg operands in LowerPPCMachineInstrToMCInst.
llvm-svn: 232525
The input offset to needsFrameBaseReg is a negative value below the top of the
stack frame, but when converting to a positive offset from the bottom of the
stack frame this value was negated, causing the final offset to be too large
by twice the input offset's magnitude. Fix that by not negating the offset.
Patch by John Brawn
Differential Revision: http://reviews.llvm.org/D8316
llvm-svn: 232513
Summary:
But still handle them the same way since I don't know how they differ on
this target.
No functional change intended.
Reviewers: uweigand
Reviewed By: uweigand
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8251
llvm-svn: 232495
The VSX stores are sometimes generated with a undefined index register, causing %noreg to be used and R0 to be emitted later on. The semantics of the VSX store (e.g. stdsdx) requires R0 to be used as base if we want zero to be used in the computation of the effective address instead of the content of R0. This patch checks if no index register was generated and forces R0 to be used as base address.
llvm-svn: 232486
Summary:
But still handle them the same way since I don't know how they differ on
this target.
No functional change intended.
Reviewers: kparzysz, adasgupt
Reviewed By: kparzysz, adasgupt
Subscribers: colinl, llvm-commits
Differential Revision: http://reviews.llvm.org/D8204
Like for the PowerPC target, I've had to add 'i' to the constraint mappings in
order to pass 2007-12-17-InvokeAsm.ll. It's not clear why 'i' has historically
been treated as a memory constraint.
llvm-svn: 232480
Summary:
This adds a MipsInstAlias which expands to XORi $reg,$reg,imm. For example, "xor $6, 0x3A" should be expanded to "xori $6, $6, 58".
This should work for all MIPS ISAs.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8284
llvm-svn: 232473
It's not completely clear why 'i' has historically been treated as a memory
constraint. According to the documentation, it represents a constant immediate.
llvm-svn: 232470
ARMv6K is another layer between ARMV6 and ARMV6T2. This is the LLVM
side of the changes.
ARMV6 family LLVM implementation.
+-------------------------------------+
| ARMV6 |
+----------------+--------------------+
| ARMV6M (thumb) | ARMV6K (arm,thumb) | <- From ARMV6K and ARMV6M processors
+----------------+--------------------+ have support for hint instructions
| ARMV6T2 (arm,thumb,thumb2) | (SEV/WFE/WFI/NOP/YIELD). They can
+-------------------------------------+ be either real or default to NOP.
| ARMV7 (arm,thumb,thumb2) | The two processors also use
+-------------------------------------+ different encoding for them.
Patch by Vinicius Tinti.
llvm-svn: 232468
Summary:
But still handle them the same way since I don't know how they differ on
this target.
Of these, 'es', and 'Q' do not have backend tests but are accepted by
clang.
No functional change intended. Depends on D8173.
Reviewers: hfinkel
Reviewed By: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8213
llvm-svn: 232466
Optimize concat_vectors of truncated vectors, where the intermediate
type is illegal, to avoid said illegality, e.g.,
(v4i16 (concat_vectors (v2i16 (truncate (v2i64))),
(v2i16 (truncate (v2i64)))))
->
(v4i16 (truncate (v4i32 (concat_vectors (v2i32 (truncate (v2i64))),
(v2i32 (truncate (v2i64)))))))
This isn't really target-specific, and, as such, would best go in the
DAGCombiner. However, ISD::TRUNCATE legality isn't keyed on both input
and result type, so we might generate worse code when we don't know
better. On AArch64 we know it's fine for v2i64->v4i16 and v4i32->v8i8.
rdar://20022387
llvm-svn: 232459
Fix justify error for small structures bigger than 32 bits in fixed
arguments for MIPS64 big endian. There was a problem when small structures
are passed as fixed arguments. The structures that are bigger than 32 bits
but smaller than 64 bits were not left justified properly on MIPS64 big
endian. This is fixed by shifting the value to make it left justified when
appropriate.
Patch by Aleksandar Beserminji.
Differential Revision: http://reviews.llvm.org/D8174
llvm-svn: 232382
Summary:
But still handle them the same way since I don't know how they differ on
this target.
No functional change intended.
Reviewers: kparzysz, adasgupt
Reviewed By: kparzysz, adasgupt
Subscribers: colinl, llvm-commits
Differential Revision: http://reviews.llvm.org/D8204
llvm-svn: 232374
Summary:
This is instead of doing this in target independent code and is the last
non-functional change before targets begin to distinguish between
different memory constraints when selecting code for the ISD::INLINEASM
node.
Next, each target will individually move away from the idea that all
memory constraints behave like 'm'.
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8173
llvm-svn: 232373
This fixes pr22854.
The core issue on the bug is that there are multiple instructions that
print the same in assembly. In fact, there doesn't seem to be any
syntax for specifying that a constant that fits in 8 bits should use a 32 bit
immediate.
The attached patch changes fast isel to consider i16immSExt8,
i32immSExt8, and i64immSExt8. They were disabled because fastisel didn’t know
to call the predicate back in the day.
llvm-svn: 232223
Fixes random crashes in for-loop piglit.
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Matt Arsenault <Matthew.Arsenault@amd.com>
llvm-svn: 232181
This patch fixes a bug in the shuffle lowering logic implemented by function
'lowerV2X128VectorShuffle'.
The are few cases where function 'lowerV2X128VectorShuffle' wrongly expands a
shuffle of two v4X64 vectors into a CONCAT_VECTORS of two EXTRACT_SUBVECTOR
nodes. The problematic expansion only occurs when the shuffle mask M has an
'undef' element at position 2, and M is equivalent to mask <0,1,4,5>.
In that case, the algorithm propagates the wrong vector to one of the two
new EXTRACT_SUBVECTOR nodes.
Example:
;;
define <4 x double> @test(<4 x double> %A, <4 x double> %B) {
entry:
%0 = shufflevector <4 x double> %A, <4 x double> %B, <4 x i32><i32 undef, i32 1, i32 undef, i32 5>
ret <4 x double> %0
}
;;
Before this patch, llc (-mattr=+avx) generated:
vinsertf128 $1, %xmm0, %ymm0, %ymm0
With this patch, llc correctly generates:
vinsertf128 $1, %xmm1, %ymm0, %ymm0
Added test lower-vec-shuffle-bug.ll
Differential Revision: http://reviews.llvm.org/D8259
llvm-svn: 232179
The operand flag word for ISD::INLINEASM nodes now contains a 15-bit
memory constraint ID when the operand kind is Kind_Mem. This constraint
ID is a numeric equivalent to the constraint code string and is converted
with a target specific hook in TargetLowering.
This patch maps all memory constraints to InlineAsm::Constraint_m so there
is no functional change at this point. It just proves that using these
previously unused bits in the encoding of the flag word doesn't break
anything.
The next patch will make each target preserve the current mapping of
everything to Constraint_m for itself while changing the target independent
implementation of the hook to return Constraint_Unknown appropriately. Each
target will then be adapted in separate patches to use appropriate
Constraint_* values.
PR22883 was caused the matching operands copying the whole of the operand flags
for the matched operand. This included the constraint id which needed to be
replaced with the operand number. This has been fixed with a conversion
function. Following on from this, matching operands also used the operand
number as the constraint id. This has been fixed by looking up the matched
operand and taking it from there.
llvm-svn: 232165
Summary: Make emitMipsAbiFlags a direct member of MipsTargetELFStreamer, as that's the only place where it's used, and remove the empty implementations from MipsTargetStreamer and MipsTargetAsmStreamer.
Reviewers: dsanders, rafael
Reviewed By: rafael
Subscribers: rafael, llvm-commits
Differential Revision: http://reviews.llvm.org/D8199
llvm-svn: 232161
This should complete the job started in r231794 and continued in r232045:
We want to replace as much custom x86 shuffling via intrinsics
as possible because pushing the code down the generic shuffle
optimization path allows for better codegen and less complexity
in LLVM.
AVX2 introduced proper integer variants of the hacked integer insert/extract
C intrinsics that were created for this same functionality with AVX1.
This should complete the removal of insert/extract128 intrinsics.
The Clang precursor patch for this change was checked in at r232109.
llvm-svn: 232120
Instead print them as part of the $dst operand. The AsmMatcher
requires the 32-bit and 64-bit encodings have the same mnemonic in
order to parse them correctly.
llvm-svn: 232105
This (r232027) has caused PR22883; so it seems those bits might be used by
something else after all. Reverting until we can figure out what else to do.
Original commit message:
The operand flag word for ISD::INLINEASM nodes now contains a 15-bit
memory constraint ID when the operand kind is Kind_Mem. This constraint
ID is a numeric equivalent to the constraint code string and is converted
with a target specific hook in TargetLowering.
This patch maps all memory constraints to InlineAsm::Constraint_m so there
is no functional change at this point. It just proves that using these
previously unused bits in the encoding of the flag word doesn't break anything.
The next patch will make each target preserve the current mapping of
everything to Constraint_m for itself while changing the target independent
implementation of the hook to return Constraint_Unknown appropriately. Each
target will then be adapted in separate patches to use appropriate Constraint_*
values.
llvm-svn: 232093
The permps and permd instructions have their operands swapped compared to the
intrinsic definition. Therefore, they do not fall into the INTR_TYPE_2OP
category.
I did not create a new category for those two, as they are the only one AFAICT
in that case.
<rdar://problem/20108262>
llvm-svn: 232085
Part of the folding logic implemented by function 'PerformISDSETCCCombine'
only worked under the assumption that the condition code in input could have
been either SETNE or SETEQ.
Unfortunately that assumption was incorrect, and in some cases the algorithm
ended up incorrectly folding SETCC nodes.
The incorrect folding only affected SETCC dag nodes where:
- one of the operands was a build_vector of all zeroes;
- the other operand was a SIGN_EXTEND from a vector of MVT:i1 elements;
- the condition code was neither SETNE nor SETEQ.
Example:
(setcc (v4i32 (sign_extend v4i1:%A)), (v4i32 VectorOfAllZeroes), setge)
Before this patch, the entire dag node sequence from the example was
incorrectly folded to node %A.
With this patch, the dag node sequence is folded to a
(xor %A, (v4i1 VectorOfAllOnes)).
Added test setcc-combine.ll.
Thanks to Greg Bedwell for spotting this issue.
llvm-svn: 232046
Summary:
The operand flag word for ISD::INLINEASM nodes now contains a 15-bit
memory constraint ID when the operand kind is Kind_Mem. This constraint
ID is a numeric equivalent to the constraint code string and is converted
with a target specific hook in TargetLowering.
This patch maps all memory constraints to InlineAsm::Constraint_m so there
is no functional change at this point. It just proves that using these
previously unused bits in the encoding of the flag word doesn't break anything.
The next patch will make each target preserve the current mapping of
everything to Constraint_m for itself while changing the target independent
implementation of the hook to return Constraint_Unknown appropriately. Each
target will then be adapted in separate patches to use appropriate Constraint_*
values.
Reviewers: hfinkel
Reviewed By: hfinkel
Subscribers: hfinkel, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8171
llvm-svn: 232027
Summary:
I don't know why every singled backend had to redeclare its own DataLayout.
There was a virtual getDataLayout() on the common base TargetMachine, the
default implementation returned nullptr. It was not clear from this that
we could assume at call site that a DataLayout will be available with
each Target.
Now getDataLayout() is no longer virtual and return a pointer to the
DataLayout member of the common base TargetMachine. I plan to turn it into
a reference in a future patch.
The only backend that didn't have a DataLayout previsouly was the CPPBackend.
It now initializes the default DataLayout. This commit is NFC for all the
other backends.
Test Plan: clang+llvm ninja check-all
Reviewers: echristo
Subscribers: jfb, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8243
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 231987
The PowerPC backend had a number of loads that were marked as canFoldAsLoad
(and I'm partially at fault here for copying around the relevant line of
TableGen definitions without really looking at what it meant). This is not
right; PPC (non-memory) instructions don't support direct memory operands, and
so there is nothing a 'foldable' instruction could be folded into.
Noticed by inspection, no test case.
The one thing we might lose by doing this is ability to fold some loads into
stackmap/patchpoint pseudo-instructions. However, this was untested, and would
not obviously have worked for extending loads, and I'd rather re-add support
for that once it can be tested.
llvm-svn: 231982
that control, individually, all of the disparate things it was
controlling.
At the same time move a FIXME in the Hexagon port to a new
subtarget function that will enable a user of the machine
scheduler to avoid using the source scheduler for pre-RA-scheduling.
The FIXME would have this removed, but involves either testcase
changes or adding -pre-RA-sched=source to a few testcases.
llvm-svn: 231980
time. The target independent code was passing in one all the
time and targets weren't checking validity before using. Update
a few calls to pass in a MachineFunction where necessary.
llvm-svn: 231970
The main issue being fixed here is that APCS targets handling a "byval align N"
parameter with N > 4 were miscounting what objects were where on the stack,
leading to FrameLowering setting the frame pointer incorrectly and clobbering
the stack.
But byval handling had grown over many years, and had multiple layers of cruft
trying to compensate for each other and calculate padding correctly. This only
really needs to be done once, in the HandleByVal function. Elsewhere should
just do what it's told by that call.
I also stripped out unnecessary APCS/AAPCS distinctions (now that Clang emits
byvals with the correct C ABI alignment), which simplified HandleByVal.
rdar://20095672
llvm-svn: 231959
This is a follow-up to r231182. This adds the "vbroadcasti128" instruction
back, but without the intrinsic mapping. Also add a test to check the
instriction encoding.
This is related to rdar://problem/18742778.
llvm-svn: 231945
Summary:
The generic ELF TargetObjectFile defaults to .ctors, but Linux's
defaults to .init_array by calling InitializeELF with the value of
UseInitArray from TargetMachine. Make NaCl's behavior match.
Reviewers: jvoung
Differential Revision: http://reviews.llvm.org/D8240
llvm-svn: 231934
This lets us pass the symbol to the constructor and avoid the mutable field.
This also opens the way for outputting the symbol only when needed, instead
of outputting them at the start of the file.
llvm-svn: 231859
This adds new node types for each intrinsic.
For instance, for addv, we have AArch64ISD::UADDV, such that:
(v4i32 (uaddv ...))
is the same as
(v4i32 (scalar_to_vector (i32 (int_aarch64_neon_uaddv ...))))
that is,
(v4i32 (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
(i32 (int_aarch64_neon_uaddv ...)), ssub)
In a combine, we transform all such across-vector-lanes intrinsics to:
(i32 (extract_vector_elt (uaddv ...), 0))
This has one big advantage: by making the extract_element explicit, we
enable the existing patterns for lane-aware instructions to fire.
This lets us avoid needlessly going through the GPRs. Consider:
uint32x4_t test_mul(uint32x4_t a, uint32x4_t b) {
return vmulq_n_u32(a, vaddvq_u32(b));
}
We now generate:
addv.4s s1, v1
mul.4s v0, v0, v1[0]
instead of the previous:
addv.4s s1, v1
fmov w8, s1
dup.4s v1, w8
mul.4s v0, v1, v0
rdar://20044838
llvm-svn: 231840
Most are redundant, and they never seem to fire.
The V128 integer patterns already exist in the INS multiclass.
The duplicates only fire when the vector index type isn't i64,
because they accept "imm" instead of an explicit "i64", as the
instruction definition patterns do.
TLI::getVectorIdxTy is i64 on AArch64, so this should never happen.
Also, one of them had a typo: for i64, INSvi32lane was used.
I noticed because I mistakenly used an explicit i32 as the idx type,
and got ins.s for an i64 vector_insert.
The V64 patterns also don't seem to ever fire, as V64 vector
extract/insert are legalized to V128.
The equivalent float patterns are unique and useful, so keep them.
No functional change intended; none exhibited on the LIT and LNT tests.
llvm-svn: 231838
If anyone is using this for some strange reason,
LLVMInitializeNVPTXAsmPrinter does exactly the same thing and is what
other LLVM tools are calling.
llvm-svn: 231810
When referring to a symbol in a dwarf section on ELF we should use
.long foo
instead of
.long foo - .debug_something
because ELF is unaware of the content of the sections and therefore needs
relocations. This has nothing to do with optimizing a -0.
llvm-svn: 231751
They mark the start of a compile unit, so name them .Lcu_*. Using
Section->getLabelBeginName() makes it looks like they mark the start of the
section.
While at it, switch to createTempSymbol to avoid collisions with labels
created in inline assembly. Not sure if a "don't crash" test is worth it.
With this getLabelBeginName is dead, delete it.
llvm-svn: 231750
Summary:
Now that the DataLayout is a mandatory part of the module, let's start
cleaning the codebase. This patch is a first attempt at doing that.
This patch is not exactly NFC as for instance some places were passing
a nullptr instead of the DataLayout, possibly just because there was a
default value on the DataLayout argument to many functions in the API.
Even though it is not purely NFC, there is no change in the
validation.
I turned as many pointer to DataLayout to references, this helped
figuring out all the places where a nullptr could come up.
I had initially a local version of this patch broken into over 30
independant, commits but some later commit were cleaning the API and
touching part of the code modified in the previous commits, so it
seemed cleaner without the intermediate state.
Test Plan:
Reviewers: echristo
Subscribers: llvm-commits
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 231740
clang-cl would warn that this value is not representable in 'int':
enum { FeatureX = 1ULL << 31 };
All MS enums are 'ints' unless otherwise specified, so we have to use an
explicit type. The AMDGPU target just hit 32 features, triggering this
warning.
Now that we have C++11 strong enum types, we can also eliminate the
'const uint64_t' codepath from tablegen and just use 'enum : uint64_t'.
llvm-svn: 231697
Summary:
Code is mostly copied from AArch64 port and modified where needed for Mips.
This handles the "non" legal cases of logical ops. Legal cases are handled by tablegen patterns.
Test Plan:
Make check test logopm.ll
All of test-suite passes at O0/O2 and mips32 r1/r2 with this new change.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: echristo, llvm-commits, aemerson, rfuhler
Differential Revision: http://reviews.llvm.org/D6599
llvm-svn: 231665
For inner one of nested loops, it is more likely to be a hot loop,
and the runtime check can be promoted out from patch 0001, so the
overhead is less, we can try a doubled threshold to unroll more loops.
llvm-svn: 231632
There were cases where the backend computed a wrong permute mask for a VPERM2X128 node.
Example:
\code
define <8 x float> @foo(<8 x float> %a, <8 x float> %b) {
%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 undef, i32 undef, i32 6, i32 7, i32 undef, i32 undef, i32 6, i32 7>
ret <8 x float> %shuffle
}
\code end
Before this patch, llc (with -mattr=+avx) emitted the following vperm2f128:
vperm2f128 $0, %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[0,1,0,1]
With this patch, llc emits a vperm2f128 with a correct permute mask:
vperm2f128 $17, %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[2,3,2,3]
Differential Revision: http://reviews.llvm.org/D8119
llvm-svn: 231601
We have an increasing number of cases where we are creating commuted shuffle masks - all implementing nearly the same code.
This patch adds a static helper function - ShuffleVectorSDNode::commuteMask() and replaces a number of cases to use it.
Differential Revision: http://reviews.llvm.org/D8139
llvm-svn: 231581
In theory this allows the compiler to skip materializing the array on
the stack. In practice clang often fails to do that, but that's a
different story. NFC.
llvm-svn: 231571
to disable lane switching if we don't actually have the instruction
set we want to switch to. Models the earlier check above the
conditional for the pass.
The testcase is one that triggered with the assert that's added
as part of the fix, use it to avoid adding a new testcase as it
highlights the same problem.
llvm-svn: 231539
Teach the load store optimizer how to sign extend a result of a load pair when
it helps creating more pairs.
The rational is that loads are more expensive than sign extensions, so if we
gather some in one instruction this is better!
<rdar://problem/20072968>
llvm-svn: 231527
Add MachO 32-bit (i.e. arm and x86) support for replacing global GOT equivalent
symbol accesses. Unlike 64-bit targets, there's no GOTPCREL relocation, and
access through a non_lazy_symbol_pointers section is used instead.
-- before
_extgotequiv:
.long _extfoo
_delta:
.long _extgotequiv-_delta
-- after
_delta:
.long L_extfoo$non_lazy_ptr-_delta
.section __IMPORT,__pointers,non_lazy_symbol_pointers
L_extfoo$non_lazy_ptr:
.indirect_symbol _extfoo
.long 0
llvm-svn: 231475
Follow up r230264 and add ARM64 support for replacing global GOT
equivalent symbol accesses by references to the GOT entry for the final
symbol instead, example:
-- before
.globl _foo
_foo:
.long 42
.globl _gotequivalent
_gotequivalent:
.quad _foo
.globl _delta
_delta:
.long _gotequivalent-_delta
-- after
.globl _foo
_foo:
.long 42
.globl _delta
Ltmp3:
.long _foo@GOT-Ltmp3
llvm-svn: 231474
Summary:
None of the .set directives can be used before the .module directives. The .set mips0/pop/push were not triggering this constraint.
Also added testing for all the other implemented directives which are supposed to trigger this constraint.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7140
llvm-svn: 231465
We supported forming IMGREL relocations from ConstantExprs involving
__ImageBase if the minuend was a GlobalVariable. Extend this
functionality to all GlobalObjects.
llvm-svn: 231456
This patch reduces code size for all AVX targets and increases speed for some chips.
SSE 4.1 introduced the useless (see code comments) 2-register form of BLENDV and
only in the packed float/double flavors.
AVX subsequently made the instruction useful by adding a 4-register operand form.
So we just need to paper over the lack of scalar forms of this instruction, complicate
the code to choose float or double forms, and use blendv on scalars since all FP is in
xmm registers anyway.
This gives us an approximately 50% speed up for a blendv microbenchmark sequence
on SandyBridge and Haswell:
blendv : 29.73 cycles/iter
logic : 43.15 cycles/iter
No new test cases with this patch because:
1. fast-isel-select-sse.ll tests the positive side for regular X86 lowering and fast-isel
2. sse-minmax.ll and fp-select-cmp-and.ll confirm that we're not firing for scalar selects without AVX
3. fp-select-cmp-and.ll and logical-load-fold.ll confirm that we're not firing for scalar selects with constants.
http://llvm.org/bugs/show_bug.cgi?id=22483
Differential Revision: http://reviews.llvm.org/D8063
llvm-svn: 231408
This commit enables forming vector extloads for ARM.
It only does so for legal types, and when we can't fold the extension
in a wide/long form of the user instruction.
Enabling it for larger types isn't as good an idea on ARM as it is on
X86, because:
- we pretend that extloads are legal, but end up generating vld+vmov
- we have instructions like vld {dN, dM}, which can't be generated
when we "manually expand" extloads to vld+vmov.
For legal types, the combine doesn't fire that often: in the
integration tests only in a big endian testcase, where it removes a
pointless AND.
Related to rdar://19723053
Differential Revision: http://reviews.llvm.org/D7423
llvm-svn: 231396
Added lowering for ISD::CONCAT_VECTORS and ISD::INSERT_SUBVECTOR for i1 vectors,
it is needed to pass all masked_memop.ll tests for SKX.
llvm-svn: 231371
Summary:
DataLayout keeps the string used for its creation.
As a side effect it is no longer needed in the Module.
This is "almost" NFC, the string is no longer
canonicalized, you can't rely on two "equals" DataLayout
having the same string returned by getStringRepresentation().
Get rid of DataLayoutPass: the DataLayout is in the Module
The DataLayout is "per-module", let's enforce this by not
duplicating it more than necessary.
One more step toward non-optionality of the DataLayout in the
module.
Make DataLayout Non-Optional in the Module
Module->getDataLayout() will never returns nullptr anymore.
Reviewers: echristo
Subscribers: resistor, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D7992
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 231270
Summary:
In PNaCl, most atomic instructions have their own @llvm.nacl.atomic.* function, each one, with a few exceptions, represents a consistent behaviour across all NaCl-supported targets. Unfortunately, the atomic RMW operations nand, [u]min, and [u]max aren't directly represented by any such @llvm.nacl.atomic.* function. This patch refines shouldExpandAtomicRMWInIR in TargetLowering so that a future `Le32TargetLowering` class can selectively inform the caller how the target desires the atomic RMW instruction to be expanded (ie via load-linked/store-conditional for ARM/AArch64, via cmpxchg for X86/others?, or not at all for Mips) if at all.
This does not represent a behavioural change and as such no tests were added.
Patch by: Richard Diamond.
Reviewers: jfb
Reviewed By: jfb
Subscribers: jfb, aemerson, t.p.northover, llvm-commits
Differential Revision: http://reviews.llvm.org/D7713
llvm-svn: 231250
This "itinerary class map" in PPCSchedule.td is incomplete and
redundant with the actual code. As it provides no value, we've
decided to remove it.
No functional change.
llvm-svn: 231246
The target-independent selection algorithm in FastISel already knows how
to select a SINT_TO_FP if the target is SSE but not AVX.
On targets that have SSE but not AVX, the tablegen'd 'fastEmit' functions
for ISD::SINT_TO_FP know how to select instruction X86::CVTSI2SSrr
(for an i32 to f32 conversion) and X86::CVTSI2SDrr (for an i32 to f64
conversion).
This patch simplifies the logic in method X86SelectSIToFP knowing that
the code would not be reachable if the subtarget doesn't have AVX.
No functional change intended.
llvm-svn: 231243
Summary:
Use more reasonable names for these pseudo-instructions.
As there's only one definition tied to any one of these classes, I named them with abbreviated versions of their respective class' name.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7831
llvm-svn: 231240
Summary:
Move the "Filler" parameter to the end of the parameter list as it is,
conceptually, the only output parameter of that function.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7726
llvm-svn: 231239
This commit fixes a bug introduced in r230956 where we were creating
CMovFP_{T,F} nodes with multiple return value types (one for each operand).
With this change the return value type of the new node is the same as the
value type of the True/False operands of the original node.
llvm-svn: 231237
As is described at http://llvm.org/bugs/show_bug.cgi?id=22408, the GNU linkers
ld.bfd and ld.gold currently only support a subset of the whole range of AArch64
ELF TLS relocations. Furthermore, they assume that some of the code sequences to
access thread-local variables are produced in a very specific sequence.
When the sequence is not as the linker expects, it can silently mis-relaxe/mis-optimize
the instructions.
Even if that wouldn't be the case, it's good to produce the exact sequence,
as that ensures that linkers can perform optimizing relaxations.
This patch:
* implements support for 16MiB TLS area size instead of 4GiB TLS area size. Ideally clang
would grow an -mtls-size option to allow support for both, but that's not part of this patch.
* by default doesn't produce local dynamic access patterns, as even modern ld.bfd and ld.gold
linkers do not support the associated relocations. An option (-aarch64-elf-ldtls-generation)
is added to enable generation of local dynamic code sequence, but is off by default.
* makes sure that the exact expected code sequence for local dynamic and general dynamic
accesses is produced, by making use of a new pseudo instruction. The patch also removes
two (AArch64ISD::TLSDESC_BLR, AArch64ISD::TLSDESC_CALL) pre-existing AArch64-specific pseudo
SDNode instructions that are superseded by the new one (TLSDESC_CALLSEQ).
llvm-svn: 231227
The intrinsic is no longer generated by the front-end. Remove the intrinsic and
auto-upgrade it to a vector shuffle.
Reviewed by Nadav
This is related to rdar://problem/18742778.
llvm-svn: 231182
Accidentally committed a few more of these cleanup changes than
intended. Still breaking these out & tidying them up.
This reverts commit r231135.
llvm-svn: 231136
There doesn't seem to be any need to assert that iterator assignment is
between iterators over the same node - if you want to reuse an iterator
variable to iterate another node, that's perfectly acceptable. Just
don't mix comparisons between iterators into disjoint sequences, as
usual.
llvm-svn: 231135
Previously we had only Linux using DTPOFF for these; all X86 ELF
targets should. Fixes a side issue mentioned in PR21077.
Differential Revision: http://reviews.llvm.org/D8011
llvm-svn: 231130
This lets us avoid a few copies that are otherwise hard to get rid of.
The way this is done is, the custom-inserter looks at the following
instruction for another CMOV, and replaces both at the same time.
A previous version used a new CMOV2 opcode, but the custom inserter
is expected to be able to return a different basic block anyway, which
means it's OK - though far from ideal - to alter that block's contents.
Explicitly document that, in case it ever makes a difference.
Alternatives welcome!
Follow-up to r231045.
rdar://19767934
Closes http://reviews.llvm.org/D8019
llvm-svn: 231046
Fold and/or of setcc's to double CMOV:
(CMOV F, T, ((cc1 | cc2) != 0)) -> (CMOV (CMOV F, T, cc1), T, cc2)
(CMOV F, T, ((cc1 & cc2) != 0)) -> (CMOV (CMOV T, F, !cc1), F, !cc2)
When we can't use the CMOV instruction, it might increase branch
mispredicts. When we can, or when there is no mispredict, this
improves throughput and reduces register pressure.
These can't be catched by generic combines, because the pattern can
appear when legalizing some instructions (such as fcmp une).
rdar://19767934
http://reviews.llvm.org/D7634
llvm-svn: 231045
Summary:
When the RHS of a conditional move node is zero, we can utilize the $zero
register by inverting the conditional move instruction and by swapping the
order of its True/False operands.
Reviewers: dsanders
Differential Revision: http://reviews.llvm.org/D7945
llvm-svn: 230956
With initializer lists there is a really neat idiomatic way to write
this, 'ArrayRef.equals({1, 2, 3, 4, 5})'. Remove the equal method which
always had a hard limit on the number of arguments. I considered
rewriting it with variadic templates but that's not really a good fit
for a function with homogeneous arguments.
'ArrayRef == {1, 2, 3, 4, 5}' would've been even more awesome, but C++11
doesn't allow init lists with binary operators.
llvm-svn: 230907
All of the cases were just appending from random access iterators to a
vector. Using insert/append can grow the vector to the perfect size
directly and moves the growing out of the loop. No intended functionalty
change.
llvm-svn: 230845
Straightforward patch to emit an alignment directive when emitting a
TOC entry. The test case was generated from the test in PR22711 that
demonstrated a misaligned .toc section. The object code is run
through llvm-readobj to verify that the correct alignment has been
applied to the .toc section.
Thanks to Ulrich Weigand for running down where the fix was needed.
llvm-svn: 230801
Summary:
Until now, we did this (among other things) based on whether or not the
target was Windows. This is clearly wrong, not just for Win64 ABI functions
on non-Windows, but for System V ABI functions on Windows, too. In this
change, we make this decision based on the ABI the calling convention
specifies instead.
Reviewers: rnk
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7953
llvm-svn: 230793
When using Altivec, we can use vector loads and stores for aligned memcpy and
friends. Starting with the P7 and VXS, we have reasonable unaligned vector
stores. Starting with the P8, we have fast unaligned loads too.
For QPX, we use vector loads are stores, but only for aligned memory accesses.
llvm-svn: 230788
vectors. This lets us fix the rest of the v16 lowering problems when
pshufb is clearly better.
We might still be able to improve some of the lowerings by enabling the
other combine-based rewriting to fire for non-128-bit vectors, but this
at least should remove any regressions from using the fancy v16i16
lowering strategy.
llvm-svn: 230753
repeated 128-bit lane shuffles of wider vector types and use it to lower
256-bit v16i16 vector shuffles where applicable.
This should let us perfectly lowering the pattern of pshuflw and pshufhw
even for AVX2 256-bit patterns.
I've not added AVX-512 support, but it should be trivial for someone
working on that to wire up.
Note that currently this generates bad, long shuffle chains because we
don't combine 256-bit target shuffles. The subsequent patches will fix
that.
llvm-svn: 230751
Summary:
We identify the cases where the operand to an ADDE node is a constant
zero. In such cases, we can avoid generating an extra ADDu instruction
disguised as an identity move alias (ie. addu $r, $r, 0 --> move $r, $r).
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7906
llvm-svn: 230742
Summary:
This change causes us to actually save non-volatile registers in a Win64
ABI function that calls a System V ABI function, and vice-versa.
Reviewers: rnk
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7919
llvm-svn: 230714
uses of TM->getSubtargetImpl and propagate to all calls.
This could be a debugging regression in places where we had a
TargetMachine and/or MachineFunction but don't have it as part
of the MachineInstr. Fixing this would require passing a
MachineFunction/Function down through the print operator, but
none of the existing uses in tree seem to do this.
llvm-svn: 230710
a lookup, pass that in rather than use a naked call to getSubtargetImpl.
This involved passing down and around either a TargetMachine or
TargetRegisterInfo. Update all callers/definitions around the targets
and SelectionDAG.
llvm-svn: 230699
blend as legal.
We made the same mistake in two different places. Whenever we are custom
lowering a v32i8 blend we need to check whether we are custom lowering
it only for constant conditions that can be shuffled, or whether we
actually have AVX2 and full dynamic blending support on bytes. Both are
fixed, with comments added to make it clear what is going on and a new
test case.
llvm-svn: 230695
dynamic blends.
This makes it much more clear what is going on. The case we're handling
is that of dynamic conditions, and we're bailing when the nature of the
vector types and subtarget preclude lowering the dynamic condition
vselect as an actual blend.
No functionality changed here, but this will make a subsequent bug-fix
to this code much more clear.
llvm-svn: 230690
There was a problem when passing structures as variable arguments.
The structures smaller than 64 bit were not left justified on MIPS64
big endian. This is now fixed by shifting the value to make it left-
justified when appropriate.
This fixes the bug http://llvm.org/bugs/show_bug.cgi?id=21608
Patch by Aleksandar Beserminji.
Differential Revision: http://reviews.llvm.org/D7881
llvm-svn: 230657
In case of "krait" CPU, asm printer doesn't emit any ".cpu" so the
features bits are not computed. This patch lets the asm printer
emit ".cpu cortex-a9" directive for krait and the hwdiv feature is
enabled through ".arch_extension". In short, krait is treated
as "cortex-a9" with hwdiv. We can not emit ".krait" as CPU since
it is not supported bu GNU GAS yet
llvm-svn: 230651
This patch is in response to r223147 where the avaiable features are
computed based on ".cpu" directive. This will work clean for the standard
variants like cortex-a9. For custom variants which rely on standard cpu names
for assembly, the additional features of a CPU should be propagated. This can be
done via ".arch_extension" as long as the assembler supports it. The
implementation for krait along with unit test will be submitted in next patch.
llvm-svn: 230650
The latency for the WriteMULm class was set to 4, which is actually lower than the latency for WriteMULr (5).
A better estimate would be 4 added to WriteMULr, that is, 9.
llvm-svn: 230634
formulaic into the top v8i16 lowering routine.
This makes the generalized lowering a completely general and single path
lowering which will allow generalizing it in turn for multiple 128-bit
lanes.
llvm-svn: 230623
Explanation: This function is in TargetLowering because it uses
RegClassForVT which would need to be moved to TargetRegisterInfo
and would necessitate moving isTypeLegal over as well - a massive
change that would just require TargetLowering having a TargetRegisterInfo
class member that it would use.
llvm-svn: 230585
This required plumbing a TargetRegisterInfo through computeRegisterProperties
and into findRepresentativeClass which uses it for register class
iteration. This required passing a subtarget into a few target specific
initializations of TargetLowering.
llvm-svn: 230583
LDtocL, and other loads that roughly correspond to the TOC_ENTRY SDAG node,
represent loads from the TOC, which is invariant. As a result, these loads can
be hoisted out of loops, etc. In order to do this, we need to generate
GOT-style MMOs for TOC_ENTRY, which requires treating it as a legitimate memory
intrinsic node type. Once this is done, the MMO transfer is automatically
handled for TableGen-driven instruction selection, and for nodes generated
directly in PPCISelDAGToDAG, we need to transfer the MMOs manually.
Also, we were not transferring MMOs associated with pre-increment loads, so do
that too.
Lastly, this fixes an exposed bug where R30 was not added as a defined operand of
UpdateGBR.
This problem was highlighted by an example (used to generate the test case)
posted to llvmdev by Francois Pichet.
llvm-svn: 230553
The Win64 epilogue structure is very restrictive, it permits a very
small number of opcodes and none of them are 'mov'.
This means that given:
mov %rbp, %rsp
pop %rbp
The mov isn't the epilogue, only the pop is. This is problematic unless
a frame pointer is present in which case we are free to do whatever we'd
like in the "body" of the function. If a frame pointer is present,
unwinding will undo the prologue operations in reverse order regardless
of the fact that we are at an instruction which is reseting the stack
pointer.
llvm-svn: 230543
We had somehow accumulated a few target-specific SDAG nodes dealing with PPC64
TOC access that were referenced only in TableGen patterns. The associated
(pseudo-)instructions are used, but are being generated directly. NFC.
llvm-svn: 230518
Reapply r230248.
Teach the peephole optimizer to work with MMX instructions by adding
entries into the foldable tables. This covers folding opportunities not
handled during isel.
llvm-svn: 230499
MMX_MOVD64rm zero-extends i32 load results into i64 registers.
The peephole optimizer will try to fold it in other MMX foldable
instructions, the wrong thing to do, since there's no MMX memory
instruction that loads from i32 and does implict zero extension.
Remove 'canFoldAsLoad' from MOVD64rm in order to prevent such folding.
The current MMX tests already test this, but since there are no MMX
instructions in the foldable tables yet, this did not trigger. This
commit prepares the addition of those instructions.
llvm-svn: 230498
Thumb-1 only allows SP-based LDR and STR to be word-sized, and SP-base LDR,
STR, and ADD only allow offsets that are a multiple of 4. Make some changes
to better make use of these instructions:
* Use word loads for anyext byte and halfword loads from the stack.
* Enforce 4-byte alignment on objects accessed in this way, to ensure that
the offset is valid.
* Do the same for objects whose frame index is used, in order to avoid having
to use more than one ADD to generate the frame index.
* Correct how many bits of offset we think AddrModeT1_s has.
Patch by John Brawn.
llvm-svn: 230496
Gather and scatter instructions additionally write to one of the source operands - mask register.
In this case Gather has 2 destination values - the loaded value and the mask.
Till now we did not support code gen pattern for gather - the instruction was generated from
intrinsic only and machine node was hardcoded.
When we introduce the masked_gather node, we need to select instruction automatically,
in the standard way.
I added a flag "hasTwoExplicitDefs" that allows to handle 2 destination operands.
(Some code in the X86InstrFragmentsSIMD.td is commented out, just to split one big
patch in many small patches)
llvm-svn: 230471
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
The reason why these large shift sizes happen is because OpaqueConstants
currently inhibit alot of DAG combining, but that has to be addressed in
another commit (like the proposal in D6946).
Differential Revision: http://reviews.llvm.org/D6940
llvm-svn: 230355
The logic is almost there already, with our special homogeneous aggregate
handling. Tweaking it like this allows front-ends to emit AAPCS compliant code
without ever having to count registers or add discarded padding arguments.
Only arrays of i32 and i64 are needed to model AAPCS rules, but I decided to
apply the logic to all integer arrays for more consistency.
llvm-svn: 230348
Summary: Separated some instruction and pseudo-instruction definitions from InstAlias definitions, added banner for pseudo-instructions and removed a redundant whitespace from a pseudo-instruction definition. No functional change.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7552
llvm-svn: 230327
Summary: Begin to add various address modes; including alloca.
Test Plan: Make sure there are no regressions in test-suite at O0/02 in mips32r1/r2
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: echristo, rfuhler, llvm-commits
Differential Revision: http://reviews.llvm.org/D6426
llvm-svn: 230300
This is a follow up to r230233 to fix something that I noticed by
inspection. The AddrModeT2_i8s4 addressing mode does not support
negative offsets. I spent a good chunk of the day trying to come up with
a testcase for this but was not successful. This addressing mode is used
to spill and restore GPRPair registers in Thumb2 code and that does not
happen often. We also make very limited used of negative offsets when
lowering frame indexes. I am going ahead with the change anyway, because
I am pretty confident that it is correct. I also added a missing assertion
to check that the low bits of the scaled offset are zero.
llvm-svn: 230297
Prologue emission, in some cases, requires calls to a stack probe helper
function. The amount of stack to probe is passed as a register
argument in the Win64 ABI but the instruction sequence used is
pessimistic: it assumes that the number of bytes to probe is greater
than 4 GB.
Instead, select a more appropriate opcode depending on the number of
bytes we are going to probe.
llvm-svn: 230270
Front-ends could use global unnamed_addr to hold pointers to other
symbols, like @gotequivalent below:
@foo = global i32 42
@gotequivalent = private unnamed_addr constant i32* @foo
@delta = global i32 trunc (i64 sub (i64 ptrtoint (i32** @gotequivalent to i64),
i64 ptrtoint (i32* @delta to i64))
to i32)
The global @delta holds a data "PC"-relative offset to @gotequivalent,
an unnamed pointer to @foo. The darwin/x86-64 assembly output for this follows:
.globl _foo
_foo:
.long 42
.globl _gotequivalent
_gotequivalent:
.quad _foo
.globl _delta
_delta:
.long _gotequivalent-_delta
Since unnamed_addr indicates that the address is not significant, only
the content, we can optimize the case above by replacing pc-relative
accesses to "GOT equivalent" globals, by a PC relative access to the GOT
entry of the final symbol instead. Therefore, "delta" can contain a pc
relative relocation to foo's GOT entry and we avoid the emission of
"gotequivalent", yielding the assembly code below:
.globl _foo
_foo:
.long 42
.globl _delta
_delta:
.long _foo@GOTPCREL+4
There are a couple of advantages of doing this: (1) Front-ends that need
to emit a great deal of data to store pointers to external symbols could
save space by not emitting such "got equivalent" globals and (2) IR
constructs combined with this opt opens a way to represent GOT pcrel
relocations by using the LLVM IR, which is something we previously had
no way to express.
Differential Revision: http://reviews.llvm.org/D6922
rdar://problem/18534217
llvm-svn: 230264
It was previously using the subtarget to get values for the global
offset without actually checking each function as it was generating
code. Go ahead and solidify the current behavior and make the
existing FIXMEs more prominent.
As a note the ARM backend previously had a thumb1 and non-thumb1
set of defaults. Only the former was tested so I've changed the
behavior to only use that for now.
llvm-svn: 230245
This patch adds the isProfitableToHoist API. For AArch64, we want to prevent a
fmul from being hoisted in cases where it is more profitable to form a
fmsub/fmadd.
Phabricator Review: http://reviews.llvm.org/D7299
Patch by Lawrence Hu <lawrence@codeaurora.org>
llvm-svn: 230241
Summary:
-mno-odd-spreg prohibits the use of odd-numbered single-precision floating
point registers. However, vector insert/extract was still using them when
manipulating the subregisters of an MSA register. Fixed this by ensuring
that insertion/extraction is only performed on even-numbered vector
registers when -mno-odd-spreg is given.
Reviewers: vmedic, sstankovic
Reviewed By: sstankovic
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7672
llvm-svn: 230235
The natural way to handle this addressing mode would be to say that it has
8 bits and gets scaled by 4, but since the MC layer is expecting the scaling
to be already reflected in the immediate value, we have been setting the
Scale to 1. That's fine, but then NumBits needs to be adjusted to reflect
the effective increase in the range of the immediate. That adjustment was
missing.
The consequence is that the register scavenger can fail.
The estimateRSStackSizeLimit() function in ARMFrameLowering.cpp correctly
assumes that the AddrModeT2_i8s4 address mode can handle scaled offsets up to
1020. Under just the right circumstances, we fail to reserve space for the
scavenger because it thinks that nothing will be needed. However, the overly
pessimistic behavior in rewriteT2FrameIndex causes some frame indexes to be
out of range and require scavenged registers, and so the scavenger asserts.
Unfortunately I have not been able to come up with a testcase for this. I
can only reproduce it on an internal branch where the frame layout and
register allocation is slightly different than trunk. We really need a
way to serialize MachineInstr-level IR to write reasonable tests for things
like this.
rdar://problem/19909005
llvm-svn: 230233
Teach the peephole optimizer to work with MMX instructions by adding
entries into the foldable tables. This covers folding opportunities not
handled during isel.
llvm-svn: 230226
I made the templates general, no need to define pattern separately for each instruction/intrinsic.
Now only need to add r_Int pattern for AVX.
llvm-svn: 230221
Synthesizing a call directly using the MI layer would confuse the frame
lowering code. This is problematic as frame lowering is highly
sensitive the particularities of calls, etc.
llvm-svn: 230129
Everyone except R600 was manually passing the length of a static array
at each callsite, calculated in a variety of interesting ways. Far
easier to let ArrayRef handle that.
There should be no functional change, but out of tree targets may have
to tweak their calls as with these examples.
llvm-svn: 230118
Stack realignment occurs after the prolog, not during, for Win64.
Because of this, don't factor in the maximum stack alignment when
establishing a frame pointer.
This fixes PR22572.
llvm-svn: 230113
The expansion code does the same thing. Since
the operands were not defined with the correct
types, this has the side effect of fixing operand
folding since the expanded pseudo would never use
SGPRs or inline immediates.
llvm-svn: 230072
This enables a few useful combines that used to only
use fma.
Also since v_mad_f32 apparently does not support denormals,
disable the existing cases that are custom handled if they are
requested.
llvm-svn: 230071
usage of instruction ADDU16 by CodeGen. For this instruction an improper
register is allocated, i.e. the register that is not from register set defined
for the instruction.
llvm-svn: 230053
changes to remove non-Function based subtargets out of the asm
printer. For module level emission we'll need to construct up
an MCSubtargetInfo so that we can encode instructions for
emission.
llvm-svn: 230050
This patch teaches X86FastISel how to select intrinsic 'convert_from_fp16' and
intrinsic 'convert_to_fp16'.
If the target has F16C, we can select VCVTPS2PHrr for a float-half conversion,
and VCVTPH2PSrr for a half-float conversion.
Differential Revision: http://reviews.llvm.org/D7673
llvm-svn: 230043
EmitFunctionStubs is called from doFinalization and so can't
depend on the Subtarget existing. It's also irrelevant as
we know we're darwin since we're in the darwin asm printer.
llvm-svn: 230039
This canonicalization step saves us 3 pattern matching possibilities * 4 math ops
for scalar FP math that uses xmm regs. The backend can re-commute the operands
post-instruction-selection if that makes register allocation better.
The tests in llvm/test/CodeGen/X86/sse-scalar-fp-arith.ll cover this scenario already,
so there are no new tests with this patch.
Differential Revision: http://reviews.llvm.org/D7777
llvm-svn: 230024
the wrong answer. We also got initializer lists which are *way* cleaner
for this kind of thing. Let's use those and make this a normal, boring
functionn accepting ArrayRef.
llvm-svn: 230004
The IBM BG/Q supercomputer's A2 cores have a hardware prefetching unit, the
L1P, but it does not prefetch directly into the A2's L1 cache. Instead, it
prefetches into its own L1P buffer, and the latency to access that buffer is
significantly higher than that to the L1 cache (although smaller than the
latency to the L2 cache). As a result, especially when multiple hardware
threads are not actively busy, explicitly prefetching data into the L1 cache is
advantageous.
I've been using this pass out-of-tree for data prefetching on the BG/Q for well
over a year, and it has worked quite well. It is enabled by default only for
the BG/Q, but can be enabled for other cores as well via a command-line option.
Eventually, we might want to add some TTI interfaces and move this into
Transforms/Scalar (there is nothing particularly target dependent about it,
although only machines like the BG/Q will benefit from its simplistic
strategy).
llvm-svn: 229966
The new shuffle lowering has been the default for some time. I've
enabled the new legality testing by default with no really blocking
regressions. I've fuzz tested this very heavily (many millions of fuzz
test cases have passed at this point). And this cleans up a ton of code.
=]
Thanks again to the many folks that helped with this transition. There
was a lot of work by others that went into the new shuffle lowering to
make it really excellent.
In case you aren't using a diff algorithm that can handle this:
X86ISelLowering.cpp: 22 insertions(+), 2940 deletions(-)
llvm-svn: 229964
is going well, remove the flag and the code for the old legality tests.
This is the first step toward removing the entire old vector shuffle
lowering. *Much* more code to delete coming up next.
llvm-svn: 229963
reflects the fact that the x86 backend can in fact lower any shuffle you
want it to with reasonably high code quality.
My recent work on the new vector shuffle has made this regress *very*
little. The diff in the test cases makes me very, very happy.
llvm-svn: 229958
This re-applies r223862, r224198, r224203, and r224754, which were
reverted in r228129 because they exposed Clang misalignment problems
when self-hosting.
The combine caused the crashes because we turned ISD::LOAD/STORE nodes
to ARMISD::VLD1/VST1_UPD nodes. When selecting addressing modes, we
were very lax for the former, and only emitted the alignment operand
(as in "[r1:128]") when it was larger than the standard alignment of
the memory type.
However, for ARMISD nodes, we just used the MMO alignment, no matter
what. In our case, we turned ISD nodes to ARMISD nodes, and this
caused the alignment operands to start being emitted.
And that's how we exposed alignment problems that were ignored before
(but I believe would have been caught with SCTRL.A==1?).
To fix this, we can just mirror the hack done for ISD nodes: only
take into account the MMO alignment when the access is overaligned.
Original commit message:
We used to only combine intrinsics, and turn them into VLD1_UPD/VST1_UPD
when the base pointer is incremented after the load/store.
We can do the same thing for generic load/stores.
Note that we can only combine the first load/store+adds pair in
a sequence (as might be generated for a v16f32 load for instance),
because other combines turn the base pointer addition chain (each
computing the address of the next load, from the address of the last
load) into independent additions (common base pointer + this load's
offset).
rdar://19717869, rdar://14062261.
llvm-svn: 229932
In preparation for a future patch:
- rename isLoad to isLoadOp: the former is confusing, and can be taken
to refer to the fact that the node is an ISD::LOAD. (it isn't, yet.)
- change formatting here and there.
- add some comments.
- const-ify bools.
llvm-svn: 229929
systematic lowering of v8i16.
This required a slight strategy shift to prefer unpack lowerings in more
places. While this isn't a cut-and-dry win in every case, it is in the
overwhelming majority. There are only a few places where the old
lowering would probably be a touch faster, and then only by a small
margin.
In some cases, this is yet another significant improvement.
llvm-svn: 229859
addition to lowering to trees rooted in an unpack.
This saves shuffles and or registers in many various ways, lets us
handle another class of v4i32 shuffles pre SSE4.1 without domain
crosses, etc.
llvm-svn: 229856
terribly complex partial blend logic.
This code path was one of the more complex and bug prone when it first
went in and it hasn't faired much better. Ultimately, with the simpler
basis for unpack lowering and support bit-math blending, this is
completely obsolete. In the worst case without this we generate
different but equivalent instructions. However, in many cases we
generate much better code. This is especially true when blends or pshufb
is available.
This does expose one (minor) weakness of the unpack lowering that I'll
try to address.
In case you were wondering, this is actually a big part of what I've
been trying to pull off in the recent string of commits.
llvm-svn: 229853
needed, and significantly improve the SSSE3 path.
This makes the new strategy much more clear. If we can blend, we just go
with that. If we can't blend, we try to permute into an unpack so
that we handle cases where the unpack doing the blend also simplifies
the shuffle. If that fails and we've got SSSE3, we now call into
factored-out pshufb lowering code so that we leverage the fact that
pshufb can set up a blend for us while shuffling. This generates great
code, especially because we *know* we don't have a fast blend at this
point. Finally, we fall back on decomposing into permutes and blends
because we do at least have a bit-math-based blend if we need to use
that.
This pretty significantly improves some of the v8i16 code paths. We
never need to form pshufb for the single-input shuffles because we have
effective target-specific combines to form it there, but we were missing
its effectiveness in the blends.
llvm-svn: 229851
them into permutes and a blend with the generic decomposition logic.
This works really well in almost every case and lets the code only
manage the expansion of a single input into two v8i16 vectors to perform
the actual shuffle. The blend-based merging is often much nicer than the
pack based merging that this replaces. The only place where it isn't we
end up blending between two packs when we could do a single pack. To
handle that case, just teach the v2i64 lowering to handle these blends
by digging out the operands.
With this we're down to only really random permutations that cause an
explosion of instructions.
llvm-svn: 229849
v16i8 shuffles, and replace it with new facilities.
This uses precise patterns to match exact unpacks, and the new
generalized unpack lowering only when we detect a case where we will
have to shuffle both inputs anyways and they terminate in exactly
a blend.
This fixes all of the blend horrors that I uncovered by always lowering
blends through the vector shuffle lowering. It also removes *sooooo*
much of the crazy instruction sequences required for v16i8 lowering
previously. Much cleaner now.
The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when
it would be marginally nicer to use pshufb+pshufb+por. However, the
difference there is *tiny*. In many cases its a win because we re-use
the pshufb mask. In others, we get to avoid the pshufb entirely. I've
left a FIXME, but I'm dubious we can really do better than this. I'm
actually pretty happy with this lowering now.
For SSE2 this exposes some horrors that were really already there. Those
will have to fixed by changing a different path through the v16i8
lowering.
llvm-svn: 229846
on things not being marked as either custom or legal, but we now do
custom lowering of more VSELECT nodes. To cope with this, manually
replicate the legality tests here. These have to stay in sync with the
set of tests used in the custom lowering of VSELECT.
Ideally, we wouldn't do any of this combine-based-legalization when we
have an actual custom legalization step for VSELECT, but I'm not going
to be able to rewrite all of that today.
I don't have a test case for this currently, but it was found when
compiling a number of the test-suite benchmarks. I'll try to reduce
a test case and add it.
This should at least fix the test-suite fallout on build bots.
llvm-svn: 229844
lowering paths. I'm going to be leveraging this to simplify a lot of the
overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes.
Sadly, this isn't profitable on v4i32 and v2i64. There, the float and
double blending instructions for pre-SSE4.1 are actually pretty good,
and we can't beat them with bit math. And once SSE4.1 comes around we
have direct blending support and this ceases to be relevant.
Also, some of the test cases look odd because the domain fixer
canonicalizes these to floating point domain. That's OK, it'll use the
integer domain when it matters and some day I may be able to update
enough of LLVM to canonicalize the other way.
This restores almost all of the regressions from teaching x86's vselect
lowering to always use vector shuffle lowering for blends. The remaining
problems are because the v16 lowering path is still doing crazy things.
I'll be re-arranging that strategy in more detail in subsequent commits
to finish recovering the performance here.
llvm-svn: 229836
First, don't combine bit masking into vector shuffles (even ones the
target can handle) once operation legalization has taken place. Custom
legalization of vector shuffles may exist for these patterns (making the
predicate return true) but that custom legalization may in some cases
produce the exact bit math this matches. We only really want to handle
this prior to operation legalization.
However, the x86 backend, in a fit of awesome, relied on this. What it
would do is mark VSELECTs as expand, which would turn them into
arithmetic, which this would then match back into vector shuffles, which
we would then lower properly. Amazing.
Instead, the second change is to teach the x86 backend to directly form
vector shuffles from VSELECT nodes with constant conditions, and to mark
all of the vector types we support lowering blends as shuffles as custom
VSELECT lowering. We still mark the forms which actually support
variable blends as *legal* so that the custom lowering is bypassed, and
the legal lowering can even be used by the vector shuffle legalization
(yes, i know, this is confusing. but that's how the patterns are
written).
This makes the VSELECT lowering much more sensible, and in fact should
fix a bunch of bugs with it. However, as you'll see in the test cases,
right now what it does is point out the *hilarious* deficiency of the
new vector shuffle lowering when it comes to blends. Fortunately, my
very next patch fixes that. I can't submit it yet, because that patch,
somewhat obviously, forms the exact and/or pattern that the DAG combine
is matching here! Without this patch, teaching the vector shuffle
lowering to produce the right code infloops in the DAG combiner. With
this patch alone, we produce terrible code but at least lower through
the right paths. With both patches, all the regressions here should be
fixed, and a bunch of the improvements (like using 2 shufps with no
memory loads instead of 2 andps with memory loads and an orps) will
stay. Win!
There is one other change worth noting here. We had hilariously wrong
vectorization cost estimates for vselect because we fell through to the
code path that assumed all "expand" vector operations are scalarized.
However, the "expand" lowering of VSELECT is vector bit math, most
definitely not scalarized. So now we go back to the correct if horribly
naive cost of "1" for "not scalarized". If anyone wants to add actual
modeling of shuffle costs, that would be cool, but this seems an
improvement on its own. Note the removal of 16 and 32 "costs" for doing
a blend. Even in SSE2 we can blend in fewer than 16 instructions. ;] Of
course, we don't right now because of OMG bad code, but I'm going to fix
that. Next patch. I promise.
llvm-svn: 229835
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
Differential Revision: http://reviews.llvm.org/D7065
llvm-svn: 229831
A null MCTargetStreamer allows IRObjectFile to ignore target-specific
directives. Previously we were crashing.
Differential Revision: http://reviews.llvm.org/D7711
llvm-svn: 229797
This involved moving two non-subtarget dependent features (64-bitness
and the driver interface) to the NVPTX target machine and updating
the uses (or migrating around the subtarget use for ease of review).
Otherwise use the cached subtarget or create a default subtarget
based on the TargetMachine cpu and feature string for the module
level assembler emission.
llvm-svn: 229785
VOP2 declares vsrc1, but VOP3 declares src1.
We can't use the same "ins" if the operands have different names in VOP2
and VOP3 encodings.
This fixes a hang in geometry shaders which spill M0 on VI.
(BTW it doesn't look like M0 needs spilling and the spilling seems
duplicated 3 times)
llvm-svn: 229752
Summary:
These ISA's didn't add any instructions so they are almost identical to
Mips32r2 and Mips64r2. Even the ELF e_flags are the same, However the ISA
revision in .MIPS.abiflags is 3 or 5 respectively instead of 2.
Reviewers: vmedic
Reviewed By: vmedic
Subscribers: tomatabacu, llvm-commits, atanasyan
Differential Revision: http://reviews.llvm.org/D7381
llvm-svn: 229695
Summary:
Parse for an MCExpr instead of an Identifier and use the symbol for relocations, not just the symbol's name.
This fixes errors when using local labels in .cpsetup (PR22518).
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: seanbruno, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D7697
llvm-svn: 229671
quite literally the same work, we just need to special case the >64-bit
element shift code emission to emit the byte shift instructions and
offsets. This also makes reasoning about each of the vector lowering
strategies easier as we don't have to remember to use both forms.
llvm-svn: 229662
Add some of the missing M and R class Cortex CPUs, namely:
Cortex-M0+ (called Cortex-M0plus for GCC compatibility)
Cortex-M1
SC000
SC300
Cortex-R5
llvm-svn: 229660
Removed (unreachable) default case in switch to clean up warning:
lib/Target/SystemZ/SystemZISelLowering.cpp:1974:5:
error: default label in switch which covers all enumeration values
[-Werror,-Wcovered-switch-default]
llvm-svn: 229658
code.
While this didn't have the miscompile (it used MatchLeft consistently)
it missed some cases where it could use right shifts. I've added a test
case Craig Topper came up with to exercise the right shift matching.
This code is really identical between the two. I'm going to merge them
next so that we don't keep two copies of all of this logic.
llvm-svn: 229655
The current SystemZ back-end only supports the local-exec TLS access model.
This patch adds all required CodeGen support for the other TLS models, which
means in particular:
- Expand initial-exec TLS accesses by loading TLS offsets from the GOT
using @indntpoff relocations.
- Expand general-dynamic and local-dynamic accesses by generating the
appropriate calls to __tls_get_offset. Note that this routine has
a non-standard ABI and requires loading the GOT pointer into %r12,
so the patch also adds support for the GLOBAL_OFFSET_TABLE ISD node.
- Add a new platform-specific optimization pass to remove redundant
__tls_get_offset calls in the local-dynamic model (modeled after
the corresponding X86 pass).
- Add test cases verifying all access models and optimizations.
llvm-svn: 229654
The current SystemZ back-end only supports the local-exec TLS access model.
This patch adds all required MC support for the other TLS models, which
means in particular:
- Support additional relocation types for
Initial-exec model: R_390_TLS_IEENT
Local-dynamic-model: R_390_TLS_LDO32, R_390_TLS_LDO64,
R_390_TLS_LDM32, R_390_TLS_LDM64, R_390_TLS_LDCALL
General-dynamic model: R_390_TLS_GD32, R_390_TLS_GD64, R_390_TLS_GDCALL
- Support assembler syntax to generate additional relocations
for use with __tls_get_offset calls:
:tls_gdcall:
:tls_ldcall:
The patch also adds a new test to verify fixups and relocations,
and removes the (already unused) FK_390_PLT16DBL/FK_390_PLT32DBL
fixup kinds.
llvm-svn: 229652
track state.
I didn't like this in the code review because the pattern tends to be
error prone, but I didn't see a clear way to rewrite it. Turns out that
there were bugs here, I found them when fuzz testing our shuffle
lowering for correctness on x86.
The core of the problem is that we need to consistently test all our
preconditions for the same directionality of shift and the same input
vector. Instead, formulate this as two predicates (one doesn't depend on
the input in any way), pass things like the directionality and input
vector as inputs, and loop over the alternatives.
This fixes a pattern of very rare miscompiles coming out of this code.
Turned up roughly 4 out of every 1 million v8 shuffles in my fuzz
testing. The new code is over half a million test runs with no failures
yet. I've also fuzzed every other function in the lowering code with
over 3.5 million test cases and not discovered any other miscompiles.
llvm-svn: 229642
Some formats capitalized these, but most didn't. Change
them all to be consistently lowercase.
Now, non-encoding fields and convenience bits are capitalized.
Also remove weird looking empty line in some of the formats.
llvm-svn: 229613
initialization. Initialize the subtarget once per function and
migrate EmitStartOfAsmFile to either use calls on the
TargetMachine or get information from the subtarget we'd use
for assembling.
The top-level-ness of the MIPS attribute output for assembly is,
by nature, contrary to how we'd want to do this for an LTO
situation where we have multiple cpu architectures so this
solution is good enough for now.
llvm-svn: 229596
This patch teaches fast-isel how to select a (V)CVTSI2SSrr for an integer to
float conversion, and how to select a (V)CVTSI2SDrr for an integer to double
conversion.
Added test 'fast-isel-int-float-conversion.ll'.
Differential Revision: http://reviews.llvm.org/D7698
llvm-svn: 229589
The problem in the original patch was not switching back to .text after printing
an eh table.
Original message:
On ELF, put PIC jump tables in a non executable section.
Fixes PR22558.
llvm-svn: 229586
Change the memory operands in sse12_fp_packed_scalar_logical_alias from scalars to vectors.
That's what the hardware packed logical FP instructions define: 128-bit memory operands.
There are no scalar versions of these instructions...because this is x86.
Generating the wrong code (folding a scalar load into a 128-bit load) is still possible
using the peephole optimization pass and the load folding tables. We won't completely
solve this bug until we either fix the lowering in fabs/fneg/fcopysign and any other
places where scalar FP logic is created or fix the load folding in foldMemoryOperandImpl()
to make sure it isn't changing the size of the load.
Differential Revision: http://reviews.llvm.org/D7474
llvm-svn: 229531
initialization. Initialize the subtarget once per function and
migrate Emit{Start|End}OfAsmFile to either use attributes on the
TargetMachine or get information from the subtarget we'd use
for assembling. One bit (getISAEncoding) touched the general
AsmPrinter and the debug output. Handle this one by passing
the function for the subprogram down and updating all callers
and users.
The top-level-ness of the ARM attribute output for assembly is,
by nature, contrary to how we'd want to do this for an LTO
situation where we have multiple cpu architectures so this
solution is good enough for now.
llvm-svn: 229528
GCC 4.8 reported two new warnings due to comparisons
between signed and unsigned integer expressions. The new warnings were
accidentally introduced by revision 229480.
Added explicit casts to silence the warnings. No functional change intended.
llvm-svn: 229488
- added mask types v8i1 and v16i1 to possible function parameters
- enabled passing 512-bit vectors in standard CC
- added a test for KNL intel_ocl_bi conventions
llvm-svn: 229482
Vector zext tends to get legalized into a vector anyext, represented as a vector shuffle with an undef vector + a bitcast, that gets ANDed with a mask that zeroes the undef elements.
Combine this into an explicit shuffle with a zero vector instead. This allows shuffle lowering to match it as a zext, instead of matching it as an anyext and emitting an explicit AND.
This combine only covers a subset of the cases, but it's a start.
Differential Revision: http://reviews.llvm.org/D7666
llvm-svn: 229480
initialization. Initialize the subtarget once per function and
migrate EmitStartOfAsmFile to either use attributes on the
TargetMachine or get information from all of the various
subtargets.
llvm-svn: 229475
This allows it to match still more places where previously we would have
to fall back on floating point shuffles or other more complex lowering
strategies.
I'm hoping to replace some of the hand-rolled unpack matching with this
routine is it gets more and more clever.
llvm-svn: 229463
Our register allocation has become better recently, it seems, and is now
starting to generate cross-block copies into inflated register classes. These
copies are not transformed into subregister insertions/extractions by the
PPCVSXCopy class, and so need to be handled directly by
PPCInstrInfo::copyPhysReg. The code to do this was *almost* there, but not
quite (it was unnecessarily restricting itself to only the direct
sub/super-register-class case (not copying between, for example, something in
VRRC and the lower-half of VSRC which are super-registers of F8RC).
Triggering this behavior manually is difficult; I'm including two
bugpoint-reduced test cases from the test suite.
llvm-svn: 229457
Patch to explicitly add the SSE MOVQ (rr,mr,rm) instructions to SSEPackedInt domain - prevents a number of costly domain switches.
Differential Revision: http://reviews.llvm.org/D7600
llvm-svn: 229439
This adds a safe interface to the machine independent InputArg struct
for accessing the index of the original (IR-level) argument. When a
non-native return type is lowered, we generate the hidden
machine-level sret argument on-the-fly. Before this fix, we were
representing this argument as OrigArgIndex == 0, which is an outright
lie. In particular this crashed in the AArch64 backend where we
actually try to access the type of the original argument.
Now we use a sentinel value for machine arguments that have no
original argument index. AArch64, ARM, Mips, and PPC now check for this
case before accessing the original argument.
Fixes <rdar://19792160> Null pointer assertion in AArch64TargetLowering
llvm-svn: 229413
to generically lower blends and is particularly nice because it is
available frome SSE2 onward. This removes a lot of the remaining domain
crossing blends in SSE2 code.
I'm hoping to replace some of the "interleaved" lowering hacks with
something closer to this which should be more principled. First, this
needs to learn how to detect and use other interleavings besides that of
the natural type provided. That will be a follow-up patch though.
llvm-svn: 229378
This blend instruction is ... really lame. The register usage is insane.
As a consequence this is probably only *barely* better than 2 pshufbs
followed by a por, and that mostly because it only has to read from
a single memory location.
However, this doesn't fix as much as I kind of expected, so more to go.
Pretty sure that the ordering and delegation of v16i8 is just really,
really bad.
llvm-svn: 229373
template now that we can use them.
This is, of course, horribly ugly because of the required recursive
formulation. Suggestions for making it less ugly welcome.
llvm-svn: 229367
advantage of the existence of a reasonable blend instruction.
The 256-bit vector shuffle lowering has leveraged the general technique
of decomposed shuffles and blends for quite some time, but this never
made it back into the 128-bit code, and there are a large number of
patterns where this is substantially better. For example, this removes
almost all domain crossing in vector shuffles that involve some blend
and some permutation with SSE4.1 and later. See the massive reduction
in 'shufps' for integer test cases in this commit.
This isn't perfect yet for a few reasons:
1) The v8i16 shuffle lowering continues to plague me. We don't always
form an unpack-based blend when that would be better. But the wins
pretty drastically outstrip the losses here.
2) The v16i8 shuffle lowering is just a disaster here. I never went and
implemented blend support here for some terrible reason. I'll do
that next probably. I've not updated it for now.
More variations on this technique are coming as well -- we don't
shuffle-into-unpack or shuffle-into-palignr, both of which would also be
profitable.
Note that some test cases grow significantly in the number of
instructions, but I expect to actually be faster. We use
pshufd+pshufd+blendw instead of a single shufps, but the pshufd's are
very likely to pipeline well (two ports on most modern intel chips) and
the blend is a *very* fast instruction. The domain switch penalty will
essentially always be more than a blend instruction, which is the only
increase in tree height.
llvm-svn: 229350
This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets.
It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions.
Differential Revision: http://reviews.llvm.org/D7596
llvm-svn: 229311
when that will allow it to lower with a single permute instead of
multiple permutes.
It tries to detect when it will only have to do a single permute in
either case to maximize folding of loads and such.
This cuts a *lot* of the avx2 shuffle permute counts in half. =]
llvm-svn: 229309
vectors and detect equivalent inputs.
This lets the code match unpck-style instructions when only one of the
inputs are lined up but the other input is a splat and so which lanes we
pull from doesn't matter. Today, this doesn't really happen, but just by
accident. I have a patch that normalizes how we shuffle splats, and with
that patch this will be necessary for a lot of the mask equivalence
tests to work.
I don't really know how to write a test case for this specific change
until the other change lands though.
llvm-svn: 229307
don't try to do element insertion for non-zero-index floating point
vectors.
We don't have any useful patterns or lowering for element insertion into
high elements of a floating point vector, and the generic shuffle
lowering will end up being better -- namely it will fall back to unpck.
But we should try to handle other forms of element insertion before
matching unpck patterns.
While this doesn't matter much right now, I'm working on a patch that
makes unpck matching much more powerful, and that patch will break
without this re-ordering.
llvm-svn: 229306
I was somewhat surprised this pattern really came up, but it does. It
seems better to just directly handle it than try to special case every
place where we end up forming a shuffle that devolves to a shuffle of
a zero vector.
llvm-svn: 229301
subvectors from buildvectors. That doesn't really make any sense and it
breaks all of the down-stream matching of buildvectors to cleverly lower
shuffles.
With this, we now get the shift-based lowering of 256-bit vector
shuffles with AVX1 when we split them into 128-bit vectors. We also do
much better on the zero-extension patterns, although there remains quite
a bit of room for improvement here.
llvm-svn: 229299
least in theory.
I don't actually have a test case that benefits from this, but
theoretically, it could come up, and I don't want to try to think about
whether this is the culprit or something else is, so I'd rather just
make this code powerful. =/ Makes me sad that I can't really test it
though.
llvm-svn: 229298
lowerings -- one which decomposes into an initial blend followed by
a permute.
Particularly on newer chips, blends are handled independently of
shuffles and so this is much less bottlenecked on the single port that
floating point shuffles are executed with on Intel.
I'll be adding this lowering to a bunch of other code paths in
subsequent commits to handle still more places where we can effectively
leverage blends when they're available in the ISA.
llvm-svn: 229292
Patch to allow XOP instructions (integer comparison and integer multiply-add) to be commuted. The comparison instructions sometimes require the compare mode to be flipped but the remaining instructions can use default commutation modes.
This patch also sets the SSE domains of all the XOP instructions.
Differential Revision: http://reviews.llvm.org/D7646
llvm-svn: 229267
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229261
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229260
This should allow finally fixing the f64 fdiv implementation.
Test is disabled for VI since there seems to be a problem with one
of the buffer load instructions on it.
llvm-svn: 229236
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229224
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229222
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229221
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229220
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229218
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229214
This takes the preposterous number of patterns in this section
that were last added to in r219033 down to just plain obnoxious.
With a little more work, we might get this down to just comical.
I've added more test cases to the existing file that checks these
patterns, but it seems that some of these patterns simply don't
exist with today's shuffle lowering.
llvm-svn: 229158
This patch adds functionality in MIPS delay slot filler such as if delay slot
filler have to put NOP instruction into the delay slot of microMIPS JR
instruction, then instead of emitting NOP this instruction is replaced by
compact jump instruction JRC.
Differential Revision: http://reviews.llvm.org/D7522
llvm-svn: 229128
Summary:
Made the following changes:
Added calls to emitDirectiveSetNoAt() and emitDirectiveSetAt().
Added special emit function for .set at=$reg, emitDirectiveSetAtWithArg(unsigned RegNo).
Improved parsing error checks for .set at.
Refactored parser code for .set at.
Improved testing of both directives.
Improved code readability and comments.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7176
llvm-svn: 229097
LLVM's include tree and the use of using declarations to hide the
'legacy' namespace for the old pass manager.
This undoes the primary modules-hostile change I made to keep
out-of-tree targets building. I sent an email inquiring about whether
this would be reasonable to do at this phase and people seemed fine with
it, so making it a reality. This should allow us to start bootstrapping
with modules to a certain extent along with making it easier to mix and
match headers in general.
The updates to any code for users of LLVM are very mechanical. Switch
from including "llvm/PassManager.h" to "llvm/IR/LegacyPassManager.h".
Qualify the types which now produce compile errors with "legacy::". The
most common ones are "PassManager", "PassManagerBase", and
"FunctionPassManager".
llvm-svn: 229094
Constant pool entries are uniqued by their contents regardless of their
type. This means that a pshufb can have a shuffle mask which isn't a
simple array of bytes.
The code path which attempts to decode the mask didn't check for
failure, causing PR22559.
llvm-svn: 228979
Summary:
Implement the bulk of returning values in Mips fast-isel
Test Plan:
reatabi.ll
Passes test-suite at -O0,-O2 and with mips32r2 and mips32r1.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits, aemerson, rfuhler
Differential Revision: http://reviews.llvm.org/D5920
llvm-svn: 228958
The changes in r223113 (ARM modified-immediate syntax) have broken
instructions like:
mov r0, #~0xffffff00
The problem is that I've added a spurious range check on the immediate
operand to ensure that it lies between INT32_MIN and UINT32_MAX. While
this range check is correct in theory, it causes problems because the
operand is stored in an int64_t (by MC). So valid 32-bit constants like
\#~0xffffff00 become out of range. The solution is to simply remove this
range check. It is not possible to validate the range of the immediate
operand with the current setup because: 1) The operand is stored in an
int64_t by MC, 2) The immediate can be of the forms #imm, #-imm, #~imm
or even #((~imm)) etc. So we just chop the value to 32 bits and use it.
Also noted that the original range check was note tested by any of the
unit tests. I've added a new test to cover #~imm kind of operands.
Change-Id: I411e90d84312a2eff01b732bb238af536c4a7599
llvm-svn: 228920
Using KORTESTW for comparison i1 value with zero was wrong since the instruction tests 16 bits.
KORTESTW may be used with KSHIFTL+KSHIFTR that clean the 15 upper bits.
I removed (X86cmp i1, 0) pattern and zero-extend i1 to i8 and then use TESTB.
There are some cases where i1 is in the mask register and the upper bits are already zeroed.
Then KORTESTW is the better solution, but it is subject for optimization.
Meanwhile, I'm fixing the correctness issue.
llvm-svn: 228916
This gives a rough estimate of whether using pushes instead of movs is profitable, in terms of size.
We go over all calls in the MachineFunction and compute:
a) For each callsite that can not use pushes, the penalty of not having a reserved call frame.
b) For each callsite that can use pushes, the gain of actually replacing the movs with pushes (and the potential penalty of having to readjust the stack).
Differential Revision: http://reviews.llvm.org/D7561
llvm-svn: 228915
On PowerPC, which has a full set of logical operations on (its multiple sets
of) condition-register bits, it is not profitable to break of complex
conditions feeding a jump into multiple jumps. We can turn off this feature of
CGP/SDAGBuilder by marking jumps as "expensive".
P7 test-suite speedups (no regressions):
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2
-0.626647% +/- 0.323583%
MultiSource/Benchmarks/Olden/power/power
-18.2821% +/- 8.06481%
llvm-svn: 228895
Summary:
Currently we have Mips32 and Mips64 disassemblers and this causes the target
triple to affect the disassembly despite all the relevant information being in
the ELF header. These implementations do not need to be separate.
This patch merges them together such that the appropriate tables are checked
for the subtarget (e.g. Mips64 is checked when GP64 is enabled).
Reviewers: vmedic
Reviewed By: vmedic
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7498
llvm-svn: 228825
This splits collecting information from actually performing the transformation, so that we can add a heuristic in between the two.
NFC.
Differential Revision: http://reviews.llvm.org/D7497
llvm-svn: 228817
The NodeMetadata are maintained in an incremental way. When an edge between
2 nodes has its cost updated, in the course of graph reduction for example,
the NodeMetadata need first to have the old edge cost removed, then the new
edge cost added. Only once the NodeMetadata have been fully updated, it
becomes safe to consider promoting the nodes to the
ConservativelyAllocatable or OptimallyReducible sets. Previously, this
promotion was occuring right after the removing the old cost, and this was
breaking the assumption that a ConservativelyAllocatable should not be
spilled.
This patch also adds asserts to:
- enforces the invariant that a node's reduction can not be downgraded,
- only not provably allocatable or optimally reducible nodes can be spilled.
llvm-svn: 228816
This allows IDEs to recognize the entire set of header files for
each of the core LLVM projects.
Differential Revision: http://reviews.llvm.org/D7526
Reviewed By: Chris Bieneman
llvm-svn: 228798
Simply loading or storing the frame pointer is not sufficient for
Windows targets. Instead, create a synthetic frame object that we will
lower later. References to this synthetic object will be replaced with
the correct reference to the frame address.
llvm-svn: 228748
See full discussion in http://reviews.llvm.org/D7491.
We now hide the add-immediate and call instructions together in a
separate pseudo-op, which is tagged to define GPR3 and clobber the
call-killed registers. The PPCTLSDynamicCall pass prior to RA now
expands this op into the two separate addi and call ops, with explicit
definitions of GPR3 on both instructions, and explicit clobbers on the
call instruction. The pass is now marked as requiring and preserving
the LiveIntervals and SlotIndexes analyses, and fixes these up after
the replacement sequences are introduced.
Self-hosting has been verified on LE P8 and BE P7 with various
optimization levels, etc. It has also been verified with the
--no-tls-optimize flag workaround removed.
llvm-svn: 228725
Walk the instructions marked FrameSetup and consider any stores of XMM
registers to the stack as needing a SaveXMM opcode.
This fixes PR22521.
Differential Revision: http://reviews.llvm.org/D7527
llvm-svn: 228724
Added most of the missing vector folding patterns for AVX2 (as well as fixing the vpermpd and verpmq patterns)
Differential Revision: http://reviews.llvm.org/D7492
llvm-svn: 228688
This patch adds the complete AMD Bulldozer XOP instruction set to the memory folding pattern tables for stack folding, etc.
Note: Many of the XOP instructions have multiple table entries as it can fold loads from different sources.
Differential Revision: http://reviews.llvm.org/D7484
llvm-svn: 228685
This patch teaches X86FastISel how to select AVX instructions for scalar
float/double convert operations.
Before this patch, X86FastISel always selected legacy SSE instructions
for FPExt (from float to double) and FPTrunc (from double to float).
For example:
\code
define double @foo(float %f) {
%conv = fpext float %f to double
ret double %conv
}
\end code
Before (with -mattr=+avx -fast-isel) X86FastIsel selected a CVTSS2SDrr which is
legacy SSE:
cvtss2sd %xmm0, %xmm0
With this patch, X86FastIsel selects a VCVTSS2SDrr instead:
vcvtss2sd %xmm0, %xmm0, %xmm0
Added test fast-isel-fptrunc-fpext.ll to check both the register-register and
the register-memory float/double conversion variants.
Differential Revision: http://reviews.llvm.org/D7438
llvm-svn: 228682
Win64 has specific contraints on what valid prologues and epilogues look
like. This constraint is born from the flexibility and descriptiveness
of Win64's unwind opcodes.
Prologues previously emitted by LLVM could not be represented by the
unwind opcodes, preventing operations powered by stack unwinding to
successfully work.
Differential Revision: http://reviews.llvm.org/D7520
llvm-svn: 228641
veqv (vector equivalence)
vnand
vorc
I increased the AddedComplexity for these instructions to 500 to ensure they are generated instead of issuing other VSX instructions.
Phabricator review: http://reviews.llvm.org/D7469
llvm-svn: 228580
While various DAG combines try to guarantee that a vector SETCC
operation will have the same output size as input, there's nothing
intrinsic to either creation or LegalizeTypes that actually guarantees
it, so the function needs to be ready to handle a mismatch.
Fortunately this is easy enough, just extend or truncate the naturally
compared result.
I couldn't reproduce the failure in other backends that I know have
SIMD, so it's probably only an issue for these two due to shared
heritage.
Should fix PR21645.
llvm-svn: 228518
If a loop predecessor has an invoke as its terminator, and the return value
from that invoke is used to determine the loop iteration space, then we can't
insert a computation based on that value in the loop predecessor prior to the
terminator (oops). If there's such an invoke, or just no predecessor for that
matter, insert a new loop preheader.
llvm-svn: 228488
from a conditional branch fed by an add/sub/mul-with-overflow node.
We previously used the SDLoc of the overflow node, for no good reason.
In some cases, this led to the Bcc and B terminators having different
source orders, and DBG_VALUEs being inserted between them.
The real issue is with the code that can't handle DBG_VALUEs between
terminators: the few places affected by this will be fixed soon.
In the meantime, fixing the SDLoc is a positive change no matter what.
No tests, as I have no idea how to get .loc emitted for branches?
rdar://19347133
llvm-svn: 228463
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).
Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.
llvm-svn: 228460
Doesn't seem necessary anymore. I think this was mostly compensating for
not enabling WQM for texture sampling instructions.
v2: Add test coverage
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 228373
If whole quad mode isn't enabled for these, the level of detail is
calculated incorrectly for pixels along diagonal triangle edges, causing
artifacts.
v2: Use a TSFlag instead of lots of switch cases
v3: Add test coverage
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=88642
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 228372
PassManager instance. In one case we can make the determination
from the Triple, in the other (execution dependency pass) the
pass will avoid running if we don't have any code that uses that
register class so go ahead and add it to the pipeline.
llvm-svn: 228334
dealing with module level emission. Currently this is using
the Triple to determine, but eventually the logic should
probably migrate to TLOF.
llvm-svn: 228332
PowerPC supports pre-increment load/store instructions (except for Altivec/VSX
vector load/stores). Using these on embedded cores can be very important, but
most loops are not naturally set up to use them. We can often change that,
however, by placing loops into a non-canonical form. Generically, this means
transforming loops like this:
for (int i = 0; i < n; ++i)
array[i] = c;
to look like this:
T *p = array[-1];
for (int i = 0; i < n; ++i)
*++p = c;
the key point is that addresses accessed are pulled into dedicated PHIs and
"pre-decremented" in the loop preheader. This allows the use of pre-increment
load/store instructions without loop peeling.
A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is
introduced to perform this transformation. I've used this code out-of-tree for
generating code for the PPC A2 for over a year. Somewhat to my surprise,
running the test suite + externals on a P7 with this transformation enabled
showed no performance regressions, and one speedup:
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk
-2.32514% +/- 1.03736%
So I'm going to enable it on everything for now. I was surprised by this
because, on the POWER cores, these pre-increment load/store instructions are
cracked (and, thus, harder to schedule effectively). But seeing no regressions,
and feeling that it is generally easier to split instructions apart late than
it is to combine them late, this might be the better approach regardless.
In the future, we might want to integrate this functionality into LSR (but
currently LSR does not create new PHI nodes, so (for that and other reasons)
significant work would need to be done).
llvm-svn: 228328
PowerPC supports pre-increment floating-point load/store instructions, both r+r
and r+i, and we had patterns for them, but they were not marked as legal. Mark
them as legal (and add a test case).
llvm-svn: 228327
The combine that forms extloads used to be disabled on vector types,
because "None of the supported targets knows how to perform load and
sign extend on vectors in one instruction."
That's not entirely true, since at least SSE4.1 X86 knows how to do
those sextloads/zextloads (with PMOVS/ZX).
But there are several aspects to getting this right.
First, vector extloads are controlled by a profitability callback.
For instance, on ARM, several instructions have folded extload forms,
so it's not always beneficial to create an extload node (and trying to
match extloads is a whole 'nother can of worms).
The interesting optimization enables folding of s/zextloads to illegal
(splittable) vector types, expanding them into smaller legal extloads.
It's not ideal (it introduces some legalization-like behavior in the
combine) but it's better than the obvious alternative: form illegal
extloads, and later try to split them up. If you do that, you might
generate extloads that can't be split up, but have a valid ext+load
expansion. At vector-op legalization time, it's too late to generate
this kind of code, so you end up forced to scalarize. It's better to
just avoid creating egregiously illegal nodes.
This optimization is enabled unconditionally on X86.
Note that the splitting combine is happy with "custom" extloads. As
is, this bypasses the actual custom lowering, and just unrolls the
extload. But from what I've seen, this is still much better than the
current custom lowering, which does some kind of unrolling at the end
anyway (see for instance load_sext_4i8_to_4i64 on SSE2, and the added
FIXME).
Also note that the existing combine that forms extloads is now also
enabled on legal vectors. This doesn't have a big effect on X86
(because sext+load is usually combined to sext_inreg+aextload).
On ARM it fires on some rare occasions; that's for a separate commit.
Differential Revision: http://reviews.llvm.org/D6904
llvm-svn: 228325
The return value's address must be returned in %rax.
i.e. the callee needs to copy the sret argument (%rdi)
into the return value (%rax).
This probably won't manifest as a bug when the caller is LLVM-compiled
code. But it is an ABI guarantee and tools expect it.
llvm-svn: 228321
We should be setting UnrollingPreferences::MaxCount to MAX_UINT instead
of UnrollingPreferences::Count.
Count is a 'forced unrolling factor', while MaxCount sets an upper
limit to the unrolling factor.
Setting Count to MAX_UINT was causing the loop in the testcase to be
unrolled 15 times, when it only had a maximum of 4 iterations.
llvm-svn: 228303
The llvm.SI.end.cf intrinsic is used to mark the end of if-then blocks,
if-then-else blocks, and loops. It is responsible for updating the
exec mask to re-enable threads that had been masked during the preceding
control flow block. For example:
s_mov_b64 exec, 0x3 ; Initial exec mask
s_mov_b64 s[0:1], exec ; Saved exec mask
v_cmpx_gt_u32 exec, s[2:3], v0, 0 ; llvm.SI.if
do_stuff()
s_or_b64 exec, exec, s[0:1] ; llvm.SI.end.cf
The bug fixed by this patch was one where the llvm.SI.end.cf intrinsic
was being inserted into the header of loops. This would happen when
an if block terminated in a loop header and we would end up with
code like this:
s_mov_b64 exec, 0x3 ; Initial exec mask
s_mov_b64 s[0:1], exec ; Saved exec mask
v_cmpx_gt_u32 exec, s[2:3], v0, 0 ; llvm.SI.if
do_stuff()
LOOP: ; Start of loop header
s_or_b64 exec, exec, s[0:1] ; llvm.SI.end.cf <-BUG: The exec mask has the
same value at the beginning of each loop
iteration.
do_stuff();
s_cbranch_execnz LOOP
The fix is to create a new basic block before the loop and insert the
llvm.SI.end.cf there. This way the exec mask is restored before the
start of the loop instead of at the beginning of each iteration.
llvm-svn: 228302
Patch by Kit Barton.
Add the vector count leading zeros instruction for byte, halfword,
word, and doubleword sizes. This is a fairly straightforward addition
after the changes made for vpopcnt:
1. Add the correct definitions for the various instructions in
PPCInstrAltivec.td
2. Make the CTLZ operation legal on vector types when using P8Altivec
in PPCISelLowering.cpp
Test Plan
Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the
instructions are being generated when the CTLZ operation is used in
LLVM.
Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s
and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively.
llvm-svn: 228301
Implement a BITCAST dag combine to transform i32->mmx conversion patterns
into a X86 specific node (MMX_MOVW2D) and guarantee that moves between
i32 and x86mmx are better handled, i.e., don't use store-load to do the
conversion..
llvm-svn: 228293
Summary: When evaluating floating point instructions in the inliner, ask the TTI whether it is an expensive operation. By default, it's not an expensive operation. This keeps the default behavior the same as before. The ARM TTI has been updated to return back TCC_Expensive for targets which don't have hardware floating point.
Reviewers: chandlerc, echristo
Reviewed By: echristo
Subscribers: t.p.northover, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D6936
llvm-svn: 228263
v2i32, i32, trunc i32 to i16, and truc i32 to i8 stores are legal for
all address spaces. We had marked them as custom in order to lower
them for the private address space, but this is no longer necessary.
This enables lowering of misaligned stores of these types in the
DAGLegalizer.
llvm-svn: 228189
This is a bug that was caused due to storing the feature bitset in a 32-bit
variable when it is a 64-bit mask, discarding the top half of the feature set.
llvm-svn: 228151
Currently, Cortex-A72 is modelled as an Cortex-A57 except the fp
load balancing pass isn't enabled for Cortex-A72 as it's not
profitable to have it enabled for this core.
Patch by Ranjeet Singh.
llvm-svn: 228140
This associates movss and movsd with the packed single and packed double
execution domains (resp.). While this is largely cosmetic, as we now
don't have weird ping-pong-ing between single and double precision, it
is also useful because it avoids the domain fixing algorithm from seeing
domain breaks that don't actually exist. It will also be much more
important if we have an execution domain default other than packed
single, as that would cause us to mix movss and movsd with integer
vector code on a regular basis, a very bad mixture.
llvm-svn: 228135
This reverts patches 223862, 224198, 224203, and 224754, which were all
related to the vector load/store combining and were reverted/reaplied
a few times due to the same alignment problems we're seeing now.
Further tests, mainly self-hosting Clang, will be needed to reapply this
patch in the future.
llvm-svn: 228129
This is the simplest form of bit-math based blending which only fires
when we are blending with zero and is relatively profitable. I've only
enabled this path on very specific lowering strategies. I'm planning to
widen its applicability in subsequent patches, but so far you'll notice
that even though we get fewer shufps instructions, we *still* do the bit
math in the FP execution port. I'm looking into why this is still
happening.
llvm-svn: 228124
Specifically, the existing patterns were scalar-only. These cover the
packed vector bitwise operations when specifically requested with pseudo
instructions. This is particularly important in SSE1 where we can't
actually emit a logical operation on a v2i64 as that isn't a legal type.
This will be tested in subsequent patches which form the floating point
and patterns in more places.
llvm-svn: 228123
The ARM assembler allows register alias redefinitions as long as it
targets the same register. r222319 broke that. In the AArch64 case
it would just produce a new warning, but in the ARM case it would
error out on previously accepted assembler.
llvm-svn: 228109
Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699.
Differential Revision: http://reviews.llvm.org/D6649
llvm-svn: 228047
Patch by Kit Barton.
Add the vector population count instructions for byte, halfword, word,
and doubleword sizes. There are two major changes here:
PPCISelLowering.cpp: Make CTPOP legal for vector types.
PPCRegisterInfo.td: Added v2i64 to the VRRC register
definition. This is needed for the doubleword variations of the
integer ops that were added in P8.
Test Plan
Test the instruction vpcnt* encoding/decoding in ppc64-encoding-vmx.s
Test the generation of the vpopcnt instructions for various vector
data types. When adding the v2i64 type to the Vector Register set, I
also needed to add the appropriate bit conversion patterns between
v2i64 and the existing vector types. Testing for these conversions
were also added in the test case by passing a different vector type as
a parameter into the test functions. There is also a run step that
will ensure the vpopcnt instructions are generated when the vsx
feature is disabled.
llvm-svn: 228046
What this does is that if you accidentally select these instructions on VI,
the code generation will fail, because the pseudo -> _vi mapping will be
undefined.
The idea is to be able to catch possible future bugs easily.
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 228038
This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend.
llvm-svn: 228022
Summary:
Straight-line strength reduction (SLSR) is implemented in GCC but not yet in
LLVM. It has proven to effectively simplify statements derived from an unrolled
loop, and can potentially benefit many other cases too. For example,
LLVM unrolls
#pragma unroll
foo (int i = 0; i < 3; ++i) {
sum += foo((b + i) * s);
}
into
sum += foo(b * s);
sum += foo((b + 1) * s);
sum += foo((b + 2) * s);
However, no optimizations yet reduce the internal redundancy of the three
expressions:
b * s
(b + 1) * s
(b + 2) * s
With SLSR, LLVM can optimize these three expressions into:
t1 = b * s
t2 = t1 + s
t3 = t2 + s
This commit is only an initial step towards implementing a series of such
optimizations. I will implement more (see TODO in the file commentary) in the
near future. This optimization is enabled for the NVPTX backend for now.
However, I am more than happy to push it to the standard optimization pipeline
after more thorough performance tests.
Test Plan: test/StraightLineStrengthReduce/slsr.ll
Reviewers: eliben, HaoLiu, meheff, hfinkel, jholewinski, atrick
Reviewed By: jholewinski, atrick
Subscribers: karthikthecool, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D7310
llvm-svn: 228016
This patch detects consecutive vector loads using the existing
EltsFromConsecutiveLoads() logic. This fixes:
http://llvm.org/bugs/show_bug.cgi?id=22329
This patch effectively reverts the tablegen additions of D6492 /
http://reviews.llvm.org/rL224344 ...which in hindsight were a horrible hack.
The test cases that were added with that patch are simply modified to load
from varying offsets of a base pointer. These loads did not match the existing
tablegen patterns.
A happy side effect of doing this optimization earlier is that we can now fold
the load into a math op where possible; this is shown in some of the updated
checks in the test file.
Differential Revision: http://reviews.llvm.org/D7303
llvm-svn: 228006
This can happen when a REV instruction is commuted.
The trick is not to define the _vi versions of instructions, which has these
consequences:
- code generation will always fail if a pseudo cannot be lowered
(very useful to catch bugs where an unsupported instruction somehow makes
it to the printer)
- ability to query if a pseudo can be lowered, which is done in commuteOpcode
to prevent REV from commuting to non-REV on VI
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227990
The getCommute* functions are only used with pseudos, so this commit doesn't
change anything.
The issue with missing non-rev versions of shift instructions on VI will fixed
separately.
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227989
- V_MAC_LEGACY_F32 exists on VI, but it's VOP3-only.
- Define CVT_PK opcodes which are different between SI and VI. These are
unused. The idea is to define all chip differences.
v2: keep V_MUL_LO_U32
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227988
These are VOP2 on SI and VOP3 on VI, and their pseudos are neither, which can
be a problem. In order to make isVOP2 and isVOP3 queries behave as expected,
the encoding must be determined first.
This doesn't fix any known issue, but better safe than sorry.
v2: add and use getMCOpcodeFromPseudo
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227987
This fixes a hang when using an empty geometry shader.
v2: - don't add s_nop when followed by s_waitcnt
- comestic changes
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227986
r224330 introduced a bug by misinterpreting the "FeatureVectorUAMem" bit.
The commit log says that change did not affect anything, but that's not correct.
That change allowed SSE instructions to have unaligned mem operands folded into
math ops, and that's not allowed in the default specification for any SSE variant.
The bug is exposed when compiling for an AVX-capable CPU that had this feature
flag but without enabling AVX codegen. Another mistake in r224330 was not adding
the feature flag to all AVX CPUs; the AMD chips were excluded.
This is part of the fix for PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ).
This feature bit is SSE-specific, so I've renamed it to "FeatureSSEUnalignedMem".
Changed the existing test case for the feature bit to reflect the new name and
renamed the test file itself to better reflect the feature.
Added runs to fold-vex.ll to check for the failing codegen.
Note that the feature bit is not set by default on any CPU because it may require a
configuration register setting to enable the enhanced unaligned behavior.
llvm-svn: 227983
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.
In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations. This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3. This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.
In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering. This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.
However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy. The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity. It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves. MachineCSE
does not have any machinery to common idempotent calls. This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace. In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.
I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes. However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.
All of which leads to the design presented here. This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering. Virtual registers
are used until prior to register allocation. At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them. The register allocator then coalesces these copies away. With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).
One additional problem must be dealt with: After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead. Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.
One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes. Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address. This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.
Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not. I've fixed the test case to reflect this.
There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.
I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.
llvm-svn: 227976
Improve EXTRACT_VECTOR_ELT DAG combine to catch conversion patterns
between x86mmx and i32 with more layers of indirection.
Before:
movq2dq %mm0, %xmm0
movd %xmm0, %eax
After:
movd %mm0, %eax
llvm-svn: 227969