Summary: Implement bswap intrinsic for MIPS FastISel. It's very different for misp32 r1/r2 .
Based on a patch by Reed Kotler.
Test Plan:
bswap1.ll
test-suite
Reviewers: dsanders, rkotler
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D7219
llvm-svn: 238760
Summary:
Implement the intrinsics memset, memcopy and memmove in MIPS FastISel.
Make some needed infrastructure fixes so that this can work.
Based on a patch by Reed Kotler.
Test Plan:
memtest1.ll
The patch passes test-suite for mips32 r1/r2 and at O0/O2
Reviewers: rkotler, dsanders
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D7158
llvm-svn: 238759
Summary: Implement the LLVM assembly urem/srem and sdiv/udiv instructions in MIPS FastISel.
Based on a patch by Reed Kotler.
Test Plan:
srem1.ll
div1.ll
test-suite at O0/O2 for mips32 r1/r2
Reviewers: dsanders, rkotler
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D7028
llvm-svn: 238757
Summary: Implement the LLVM IR select statement for MIPS FastISelsel.
Based on a patch by Reed Kotler.
Test Plan:
"Make check" test included now.
Passes test-suite at O2/O0 mips32 r1/r2.
Reviewers: dsanders, rkotler
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D6774
llvm-svn: 238756
Summary:
The contents of the HI/LO registers are unpredictable after the execution of
the MUL instruction. In addition to implicitly defining these registers in the
MUL instruction definition, we have to mark those registers as dead too.
Without this the fast register allocator is running out of registers when the
MUL instruction is followed by another one that tries to allocate the AC0
register.
Based on a patch by Reed Kotler.
Reviewers: dsanders, rkotler
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D9825
llvm-svn: 238755
This is important because of different addressing modes
depending on the address space for GPU targets.
This only adds the argument, and does not update
any of the uses to provide the correct address space.
llvm-svn: 238723
The original version didn't properly account for the base register
being modified before the final jump, so caused miscompilations in
Chromium and LLVM. I've fixed this and tested with an LLVM self-host
(I don't have the means to build & test Chromium).
The general idea remains the same: in pathological cases jump tables
can be too far away from the instructions referencing them (like other
constants) so they need to be movable.
Should fix PR23627.
llvm-svn: 238680
best approach of each.
For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.
For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.
If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.
But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.
I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.
llvm-svn: 238652
The plan was to move the whole table into the already existing ArchExtNames
but some fields depend on a table-generated file, and we don't yet have this
feature in the generic lib/Support side.
Once the minimum target-specific table-generated files are available in a
generic fashion to these libraries, we'll have to keep it in the ASM parser.
llvm-svn: 238651
shorter one. NFC.
In addition to being much shorter to type and requiring fewer arguments,
this change saves over 30 lines from this one file, all wasted on total
boilerplate...
llvm-svn: 238640
shifting vectors of bytes as x86 doesn't have direct support for that.
This removes a bunch of redundant masking in the generated code for SSE2
and SSE3.
In order to avoid the really significant code size growth this would
have triggered, I also factored the completely repeatative logic for
shifting and masking into two lambdas which in turn makes all of this
much easier to read IMO.
llvm-svn: 238637
in-register LUT technique.
Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html
The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.
On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.
The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.
I have extended Bruno's patch in the following ways:
0) I committed the new tests with baseline sequences so this shows
a diff, and regenerated the tests using the update scripts.
1) Bruno had noticed and mentioned in IRC a redundant mask that
I removed.
2) I introduced a particular optimization for the i32 vector cases where
we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
+ PSADBW to compute doubled high i32 popcounts. This takes advantage
of the fact that to line up the high i32 popcounts we have to shift
them anyways, and we can shift them by one fewer bit to effectively
divide the count by two. While the PSHUFD based horizontal add is no
faster, it doesn't require registers or load traffic the way a mask
would, and provides more ILP as it happens on different ports with
high throughput.
3) I did some code cleanups throughout to simplify the implementation
logic.
4) I refactored it to continue to use the parallel bitmath lowering when
SSSE3 is not available to preserve the performance of that version on
SSE2 targets where it is still much better than scalarizing as we'll
still do a bitmath implementation of popcount even in scalar code
there.
With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.
With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.
Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.
Differential Revision: http://reviews.llvm.org/D10084
llvm-svn: 238636
a separate routine, generalize it to work for all the integer vector
sizes, and do general code cleanups.
This dramatically improves lowerings of byte and short element vector
popcount, but more importantly it will make the introduction of the
LUT-approach much cleaner.
The biggest cleanup I've done is to just force the legalizer to do the
bitcasting we need. We run these iteratively now and it makes the code
much simpler IMO. Other changes were minor, and mostly naming and
splitting things up in a way that makes it more clear what is going on.
The other significant change is to use a different final horizontal sum
approach. This is the same number of instructions as the old method, but
shifts left instead of right so that we can clear everything but the
final sum with a single shift right. This seems likely better than
a mask which will usually have to read the mask from memory. It is
certaily fewer u-ops. Also, this will be temporary. This and the LUT
approach share the need of horizontal adds to finish the computation,
and we have more clever approaches than this one that I'll switch over
to.
llvm-svn: 238635
It turns out that _except_handler3 and _except_handler4 really use the
same stack allocation layout, at least today. They just make different
choices about encoding the LSDA.
This is in preparation for lowering the llvm.eh.exceptioninfo().
llvm-svn: 238627
This patch corresponds to review:
http://reviews.llvm.org/D9941
It adds the various FMA instructions introduced in the version 2.07 of
the ISA along with the testing for them. These are operations on single
precision scalar values in VSX registers.
llvm-svn: 238578
Small (really small!) C++ exception handling examples work on 32-bit x86
now.
This change disables the use of .seh_* directives in WinException when
CFI is not in use. It also uses absolute symbol references in the tables
instead of imagerel32 relocations.
Also fixes a cache invalidation bug in MMI personality classification.
llvm-svn: 238575
MIOperands/ConstMIOperands are classes iterating over the MachineOperand
of a MachineInstr, however MachineInstr::mop_iterator does the same
thing.
I assume these two iterators exist to have a uniform interface to
iterate over the operands of a machine instruction bundle and a single
machine instruction. However in practice I find it more confusing to have 2
different iterator classes, so this patch transforms (nearly all) the
code to use mop_iterators.
The only exception being MIOperands::anlayzePhysReg() and
MIOperands::analyzeVirtReg() still needing an equivalent, I leave that
as an exercise for the next patch.
Differential Revision: http://reviews.llvm.org/D9932
This version is slightly modified from the proposed revision in that it
introduces MachineInstr::getOperandNo to avoid the extra counting
variable in the few loops that previously used MIOperands::getOperandNo.
llvm-svn: 238539
This moves all the state numbering code for C++ EH to WinEHPrepare so
that we can call it from the X86 state numbering IR pass that runs
before isel.
Now we just call the same state numbering machinery and insert a bunch
of stores. It also populates MachineModuleInfo with information about
the current function.
llvm-svn: 238514
For x86 targets, do not do sibling call optimization when materializing
the callee's address would require a GOT relocation. We can still do
tail calls to internal functions, hidden functions, and protected
functions, because they do not require this kind of relocation. It is
still possible to get GOT relocations when the user explicitly asks for
it with musttail or -tailcallopt, both of which are supposed to
guarantee TCO.
Based on a patch by Chih-hung Hsieh.
Reviewers: srhines, timmurray, danalbert, enh, void, nadav, rnk
Subscribers: joerg, davidxl, llvm-commits
Differential Revision: http://reviews.llvm.org/D9799
llvm-svn: 238487
We were previously codegen'ing these as regular load/store operations and
hoping that the register allocator would allocate registers in ascending order
so that we could apply an LDM/STM combine after register allocation. According
to the commit that first introduced this code (r37179), we planned to teach
the register allocator to allocate the registers in ascending order. This
never got implemented, and up to now we've been stuck with very poor codegen.
A much simpler approach for achiveing better codegen is to create LDM/STM
instructions with identical sets of virtual registers, let the register
allocator pick arbitrary registers and order register lists when printing an
MCInst. This approach also avoids the need to repeatedly calculate offsets
which ultimately ought to be eliminated pre-RA in order to decrease register
pressure.
This is implemented by lowering the memcpy intrinsic to a series of SD-only
MCOPY pseudo-instructions which performs a memory copy using a given number
of registers. During SD->MI lowering, we lower MCOPY to LDM/STM. This is a
little unusual, but it avoids the need to encode register lists in the SD,
and we can take advantage of SD use lists to decide whether to use the _UPD
variant of the instructions.
Fixes PR9199.
Differential Revision: http://reviews.llvm.org/D9508
llvm-svn: 238473
Octeon CPUs use dmtc2 rt,imm16 and dmfcp2 rt,imm16 for the crypto coprocessor.
E.g. dmtc2 rt,0x4057 starts calculation of sha-1.
I had to introduce a new deconding namespace to avoid a decoding conflict.
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D10083
llvm-svn: 238439
Add support for resolving MIPS64r2 and MIPS64r6 relocations in MCJIT.
Patch by Vladimir Radosavljevic.
Differential Revision: http://reviews.llvm.org/D9667
llvm-svn: 238424
Summary:
This patch made two improvements to NaryReassociate and the NVPTX pipeline
1. Run EarlyCSE/GVN after NaryReassociate to get rid of redundant common
expressions.
2. When adding an instruction to SeenExprs, maps both the SCEV before and after
reassociation to that instruction.
Test Plan: updated @reassociate_gep_nsw in nary-gep.ll
Reviewers: meheff, broune
Reviewed By: broune
Subscribers: dberlin, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D9947
llvm-svn: 238396
Now that most of the methods in Clang and LLVM that were parsing arch/cpu/fpu
strings are using ARMTargetParser, it's time to make it a bit more conforming
with what the ABI says.
This commit adds some clarification on what build attributes are accepted and
which are "non-standard". It also makes clear that the "defaultCPU" and
"defaultArch" methods were really just build attribute getters.
It also diverges from GCC's behaviour to say that armv2/armv3 are really an
ARMv4 in the build attributes, when the ABI has a clear state for that: Pre-v4.
llvm-svn: 238344
This broke the llvm-mips-linux builder and several of our out-of-tree builders.
Initial investigations show that the commit probably isn't the problem but
reverting anyway while I investigate.
llvm-svn: 238302
With this patch the x86 backend is now shrink-wrapping capable
and this functionality can be tested by using the
-enable-shrink-wrap switch.
The next step is to make more test and enable shrink-wrapping by
default for x86.
Related to <rdar://problem/20821487>
llvm-svn: 238293
This gets gas and llc -filetype=obj to agree on the order of prefixes.
For llvm-mc we need to fix the asm parser to know that it makes a difference
on which line the "lock" is in.
Part of pr23594.
llvm-svn: 238232
v2: Use C++ comments and end with periods
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Matt Arsenault <Matthew.Arsenault@amd.com>
llvm-svn: 238228
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first several times this was committed (e.g. r229831, r233055), it caused several buildbot failures.
Apparently the reason for most failures was both clang and gcc's inability to deal with large numbers (> 10K) of bitset constructor calls in tablegen-generated initializers of instruction info tables.
This should now be fixed.
llvm-svn: 238192
Summary:
Following on from r209907 which made personality encodings indirect, do the
same for TType encodings. This fixes the case where a try/catch block needs
to generate references to, for example, std::exception in the
.gcc_except_table.
This commit uses DW_EH_PE_sdata8 for N64 as far as is possible at the moment.
However, it is possible to end up with DW_EH_PE_sdata4 when a TargetMachine is
not available. There's no risk of issues with inconsistency here since the
tables are self describing but it does mean there is a small chance of the
PC-relative offset being out of range for particularly large programs.
Reviewers: petarj
Reviewed By: petarj
Subscribers: srhines, joerg, tberghammer, llvm-commits
Differential Revision: http://reviews.llvm.org/D9669
llvm-svn: 238190
Part of D9474, this patch extends AVX2 v16i16 types to 2 x 8i32 vectors and uses i32 shift variable shifts before packing back to i16.
Adds AVX2 tests for v8i16 and v16i16
llvm-svn: 238149
This lets us drop a parameter the opName parameter to the VINTRP
multiclass and makes it possible to create multiple VINTRP defs
with the same asm mnemonic.
llvm-svn: 238146
in POWER8:
vadduqm
vaddeuqm
vaddcuq
vaddecuq
vsubuqm
vsubeuqm
vsubcuq
vsubecuq
In addition to adding the instructions themselves, it also adds support for the
v1i128 type for intrinsics (Intrinsics.td, Function.cpp, and
IntrinsicEmitter.cpp).
http://reviews.llvm.org/D9081
llvm-svn: 238144
The semantics of the scalar FMA intrinsics are that the high vector elements are copied from the first source.
The existing pattern switches src1 and src2 around, to match the "213" order, which ends up tying the original src2 to the dest. Since the actual scalar fma3 instructions copy the high elements from the dest register, the wrong values are copied.
This modifies the pattern to leave src1 and src2 in their original order.
Differential Revision: http://reviews.llvm.org/D9908
llvm-svn: 238131
On GPU targets, materializing constants is cheap and stores are
expensive, so only doing this for zero vectors was silly.
Most of the new testcases aren't optimally merged, and are for
later improvements.
llvm-svn: 238108
When the compare feeding a branch was in a different BB from the branch, we'd
try to "regenerate" the compare in the block with the branch, possibly trying
to make use of values not available there. Copy a page from AArch64's play book
here to fix the problem (at least in terms of correctness).
Fixes PR23640.
llvm-svn: 238097
This is part of the work to remove TargetMachine::resetTargetOptions.
In this patch, instead of updating global variable NoFramePointerElim in
resetTargetOptions, its use in DisableFramePointerElim is replaced with a call
to TargetFrameLowering::noFramePointerElim. This function determines on a
per-function basis if frame pointer elimination should be disabled.
There is no change in functionality except that cl:opt option "disable-fp-elim"
can now override function attribute "no-frame-pointer-elim".
llvm-svn: 238080
This patch adds a class for processing many recip codegen possibilities.
The TargetRecip class is intended to handle both command-line options to llc as well
as options passed in from a front-end such as clang with the -mrecip option.
The x86 backend is updated to use the new functionality.
Only -mcpu=btver2 with -ffast-math should see a functional change from this patch.
All other CPUs continue to *not* use reciprocal estimates by default with -ffast-math.
Differential Revision: http://reviews.llvm.org/D8982
llvm-svn: 238051
The 'off' field of 'struct bpf_insn' is in cpu-endianness,
since the rest is emitted as little endian, make sure
that 'off' field is little endian as well.
llvm-svn: 238038
The problem was that I slipped a change required for shrink-wrapping, namely I
used getFirstTerminator instead of the getLastNonDebugInstr that was here before
the refactoring, whereas the surrounding code is not yet patched for that.
Original message:
[X86] Refactor the prologue emission to prepare for shrink-wrapping.
- Add a late pass to expand pseudo instructions (tail call and EH returns).
Instead of doing it in the prologue emission.
- Factor some static methods in X86FrameLowering to ease code sharing.
NFC.
Related to <rdar://problem/20821487>
llvm-svn: 238035
This patch adds support for the ISA 2.07 additions involving the
branch history rolling buffer and event-based branching. These will
not be used by typical applications, so built-in support is not
required. They will only be available via inline assembly.
Assembly/disassembly tests are included in the patch.
llvm-svn: 238032
The list of subtarget features for the 7em triple contains 't2xtpk',
which actually disables that subtarget feature. Correct that to
'+t2xtpk' and test that the instructions enabled by that feature do
actually work.
Differential Revision: http://reviews.llvm.org/D9936
llvm-svn: 238022
Revert "[X86] Refactor the prologue emission to prepare for shrink-wrapping."
This reverts commit 6b3b93fc8b68a2c806aa992ee4bd3d7f61898d4b.
This reverts commit ab0b15dff8539826283a59c2dd700a18a9680e0f.
llvm-svn: 238011
- Add a late pass to expand pseudo instructions (tail call and EH returns).
Instead of doing it in the prologue emission.
- Factor some static methods in X86FrameLowering to ease code sharing.
NFC.
Related to <rdar://problem/20821487>
llvm-svn: 237977
Unfortunately, I can't reduce a small test case for this (although compiling
mpfr-3.1.2 with -O2 -mcpu=a2 would fairly reliably trigger a crash), but the
problem is fairly clear (at least once you know you're looking for one). If the
TLS instruction being replaced was at the end of the block, we'd increment the
iterator past it (so it would then point to MBB.end()), and then we'd increment
it again as part of the for statement, thus overrunning the end of the list.
Don't do that.
llvm-svn: 237974
The raw non-instruction/constant form of this is still relying on being
able to access the pointee type from a pointer type - those will be
cleaned up later. For now, just focus on the cases where the pointee
type is easily accessible.
llvm-svn: 237958
My recent patch to add support for ISA 2.07 vector pack/unpack
instructions didn't properly check for availability of the vpkudum
instruction when recognizing it as a special vector shuffle case.
This causes us to leave the vector shuffle in place (rather than
converting it to a vector permute) so that it can be recognized later
as a vpkudum, but that pattern is invalid for processors prior to
POWER8. Thus LLVM crashes with an "unable to select" message. We
observed this since one of our buildbots is configured to generate
code for a POWER7.
This patch fixes the problem by checking for availability of the
vpkudum instruction during custom lowering of vector shuffles.
I've added a test case variant for the vpkudum pattern when the
instruction isn't available.
llvm-svn: 237952
On X86 (and similar OOO cores) unrolling is very limited, and even if the
runtime unrolling is otherwise profitable, the expense of a division to compute
the trip count could greatly outweigh the benefits. On the A2, we unroll a lot,
and the benefits of unrolling are more significant (seeing a 5x or 6x speedup
is not uncommon), so we're more able to tolerate the expense, on average, of a
division to compute the trip count.
llvm-svn: 237947
http://reviews.llvm.org/D9891
Following up on the VSX single precision loads and stores added earlier, this
adds support for elementary arithmetic operations on single precision values
in VSX registers. These instructions utilize the new VSSRC register class.
Instructions added:
xsaddsp
xsdivsp
xsmulsp
xsresp
xsrsqrtesp
xssqrtsp
xssubsp
llvm-svn: 237937
This starts merging MCSection and MCSectionData.
There are a few issues with the current split between MCSection and
MCSectionData.
* It optimizes the the not as important case. We want the production
of .o files to be really fast, but the split puts the information used
for .o emission in a separate data structure.
* The ELF/COFF/MachO hierarchy is not represented in MCSectionData,
leading to some ad-hoc ways to represent the various flags.
* It makes it harder to remember where each item is.
The attached patch starts merging the two by moving the alignment from
MCSectionData to MCSection.
Most of the patch is actually just dropping 'const', since
MCSectionData is mutable, but MCSection was not.
llvm-svn: 237936
Predicate UseAVX depricates pattern selection on AVX-512.
This predicate is necessary for DAG selection to select EVEX form.
But mapping SSE intrinsics to AVX-512 instructions is not ready yet.
So I replaced UseAVX with HasAVX for intrinsics patterns.
llvm-svn: 237903
This patch improves support for sign extension of the lower lanes of vectors of integers by making use of the SSE41 pmovsx* sign extension instructions where possible, and optimizing the sign extension by shifts on pre-SSE41 targets (avoiding the use of i64 arithmetic shifts which require scalarization).
It converts SIGN_EXTEND nodes to SIGN_EXTEND_VECTOR_INREG where necessary, that more closely matches the pmovsx* instruction than the default approach of using SIGN_EXTEND_INREG which splits the operation (into an ANY_EXTEND lowered to a shuffle followed by shifts) making instruction matching difficult during lowering. Necessary support for SIGN_EXTEND_VECTOR_INREG has been added to the DAGCombiner.
Differential Revision: http://reviews.llvm.org/D9848
llvm-svn: 237885
Ideally this is going to be and LLVM IR pass (shared, among others
with AArch64), but for the time being just enable it if consumers
ask us for optimization and not unconditionally.
Discussed with Tim Northover on IRC.
llvm-svn: 237837
Now that Intrinsic::ID is a typed enum, we can forward declare it and so return it from this method.
This updates all users which were either using an unsigned to store it, or had a now unnecessary cast.
llvm-svn: 237810
fixed extract-insert i1 element,
load i1, zextload i1 should be with "and $1, %reg" to prevent loading garbage.
added a bunch of new tests.
llvm-svn: 237793
Summary:
For N32/N64, private labels begin with '.L' but for O32 they begin with '$'.
MCAsmInfo now has an initializer function which can be used to provide information from the TargetMachine to control the assembly syntax.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: jfb, sandeep, llvm-commits, rafael
Differential Revision: http://reviews.llvm.org/D9821
llvm-svn: 237789
Summary:
The documentation writes vectors highest-index first whereas LLVM-IR writes
them lowest-index first. As a result, instructions defined in terms of
left_half() and right_half() had the halves reversed.
In addition to correcting them, they have been improved to allow shuffles
that use the same operand twice or in reverse order. For example, ilvev
used to accept masks of the form:
<0, n, 2, n+2, 4, n+4, ...>
but now accepts:
<0, 0, 2, 2, 4, 4, ...>
<n, n, n+2, n+2, n+4, n+4, ...>
<0, n, 2, n+2, 4, n+4, ...>
<n, 0, n+2, 2, n+4, 4, ...>
One further improvement is that splati.[bhwd] is now the preferred instruction
for splat-like operations. The other special shuffles are no longer used
for splats. This lead to the discovery that <0, 0, ...> would not cause
splati.[hwd] to be selected and this has also been fixed.
This fixes the enc-3des test from the test-suite on Mips64r6 with MSA.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9660
llvm-svn: 237689
This changes the ABI used on 32-bit x86 for passing vector arguments.
Historically, clang passes the first 4 vector arguments in-register, and additional vector arguments on the stack, regardless of platform. That is different from the behavior of gcc, icc, and msvc, all of which pass only the first 3 arguments in-register.
The 3-register convention is documented, unofficially, in Agner's calling convention guide, and, officially, in the recently released version 1.0 of the i386 psABI.
Darwin is kept as is because the OS X ABI Function Call Guide explicitly documents the current (4-register) behavior.
This fixes PR21510
Differential revision: http://reviews.llvm.org/D9644
llvm-svn: 237682
This reverts commit r237210.
Also fix X86/complex-fca.ll to match the code that we used to generate
on win32 and now generate everwhere to conform to SysV.
llvm-svn: 237639
ld64 currently mishandles internal pointer relocations (i.e.
ARM64_RELOC_UNSIGNED referred to by section & offset rather than symbol). The
existing __cfstring clause was an early discovery and workaround for this, but
the problem is wider and we should avoid such relocations wherever possible for
now.
This code should be reverted to allowing internal relocations as soon as
possible.
PR23437.
llvm-svn: 237621
This was previously returning int. However there are no negative opcode
numbers and more importantly this was needlessly different from
MCInstrDesc::getOpcode() (which even is the value returned here) and
SDValue::getOpcode()/SDNode::getOpcode().
llvm-svn: 237611
Previously, they were forced to immediately follow the actual branch
instruction. This was usually OK (the LEAs actually accessing them got emitted
nearby, and weren't usually separated much afterwards). Unfortunately, a
sufficiently nasty phi elimination dumps many instructions right before the
basic block terminator, and this can increase the range too much.
This patch frees them up to be placed as usual by the constant islands pass,
and consequently has to slightly modify the form of TBB/TBH tables to refer to
a PC-relative label at the final jump. The other jump table formats were
already position-independent.
rdar://20813304
llvm-svn: 237590
This pseudo-instruction expands into 'sethi' and 'or' instructions,
or, just one of them, if the other isn't necessary for a given value.
Differential Revision: http://reviews.llvm.org/D9089
llvm-svn: 237585
- Adds support for the asm syntax, which has an immediate integer
"ASI" (address space identifier) appearing after an address, before
a comma.
- Adds the various-width load, store, and swap in alternate address
space instructions. (ldsba, ldsha, lduba, lduha, lda, stba, stha,
sta, swapa)
This does not attempt to hook these instructions up to pointer address
spaces in LLVM, although that would probably be a reasonable thing to
do in the future.
Differential Revision: http://reviews.llvm.org/D8904
llvm-svn: 237581
(Note that register "Y" is essentially just ASR0).
Also added some test cases for divide and multiply, which had none before.
Differential Revision: http://reviews.llvm.org/D8670
llvm-svn: 237580
This patch implements LLVM support for the ACLE special register intrinsics in
section 10.1, __arm_{w,r}sr{,p,64}.
This patch is intended to lower the read/write_register instrinsics, used to
implement the special register intrinsics in the clang patch for special
register intrinsics (see http://reviews.llvm.org/D9697), to ARM specific
instructions MRC,MCR,MSR etc. to allow reading an writing of coprocessor
registers in AArch32 and AArch64. This is done by inspecting the register
string passed to the intrinsic and then lowering to the appropriate
instruction.
Patch by Luke Cheeseman.
Differential Revision: http://reviews.llvm.org/D9699
llvm-svn: 237579
instructions. These intrinsics are comming with rounding mode.
Added intrinsics for MAXSS/D, MINSS/D - with and without sae.
By Asaf Badouh (asaf.badouh@intel.com)
llvm-svn: 237560
If some commits are happy, and some commits are sad, this is a sad commit. It
is sad because it restricts instruction scheduling to work around a binutils
linker bug, and moreover, one that may never be fixed. On 2012-05-21, GCC was
updated not to produce code triggering this bug, and now we'll do the same...
When resolving an address using the ELF ABI TOC pointer, two relocations are
generally required: one for the high part and one for the low part. Only
the high part generally explicitly depends on r2 (the TOC pointer). And, so,
we might produce code like this:
.Ltmp526:
addis 3, 2, .LC12@toc@ha
.Ltmp1628:
std 2, 40(1)
ld 5, 0(27)
ld 2, 8(27)
ld 11, 16(27)
ld 3, .LC12@toc@l(3)
rldicl 4, 4, 0, 32
mtctr 5
bctrl
ld 2, 40(1)
And there is nothing wrong with this code, as such, but there is a linker bug
in binutils (https://sourceware.org/bugzilla/show_bug.cgi?id=18414) that will
misoptimize this code sequence to this:
nop
std r2,40(r1)
ld r5,0(r27)
ld r2,8(r27)
ld r11,16(r27)
ld r3,-32472(r2)
clrldi r4,r4,32
mtctr r5
bctrl
ld r2,40(r1)
because the linker does not know (and does not check) that the value in r2
changed in between the instruction using the .LC12@toc@ha (TOC-relative)
relocation and the instruction using the .LC12@toc@l(3) relocation.
Because it finds these instructions using the relocations (and not by
scanning the instructions), it has been asserted that there is no good way
to detect the change of r2 in between. As a result, this bug may never be
fixed (i.e. it may become part of the definition of the ABI). GCC was
updated to add extra dependencies on r2 to instructions using the @toc@l
relocations to avoid this problem, and we'll do the same here.
This is done as a separate pass because:
1. These extra r2 dependencies are not really properties of the
instructions, but rather due to a linker bug, and maybe one day we'll be
able to get rid of them when targeting linkers without this bug (and,
thus, keeping the logic centralized here will make that
straightforward).
2. There are ISel-level peephole optimizations that propagate the @toc@l
relocations to some user instructions, and so the exta dependencies do
not apply only to a fixed set of instructions (without undesirable
definition replication).
The test case was reduced with the help of bugpoint, with minimal cleaning. I'm
looking forward to our upcoming MI serialization support, and with that, much
better tests can be created.
llvm-svn: 237556
Summary:
But still handle them the same way since I don't know how they differ on
this target.
Of these, 'o' and 'v' are not tested but were already implemented.
I'm not sure why 'i' is required for X86 since it's supposed to be an
immediate constraint rather than a memory constraint. A test asserts
without it so I've included it for now.
No functional change intended.
Reviewers: nadav
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8254
llvm-svn: 237517
This patch adds support for the following new instructions in the
Power ISA 2.07:
vpksdss
vpksdus
vpkudus
vpkudum
vupkhsw
vupklsw
These instructions are available through the vec_packs, vec_packsu,
vec_unpackh, and vec_unpackl built-in interfaces. These are
lane-sensitive instructions, so the built-ins have different
implementations for big- and little-endian, and the instructions must
be marked as killing the vector swap optimization for now.
The first three instructions perform saturating pack operations. The
fourth performs a modulo pack operation, which means it can be
represented with a vector shuffle, and conversely the appropriate
vector shuffles may cause this instruction to be generated. The other
instructions are only generated via built-in support for now.
Appropriate tests have been added.
There is a companion patch to clang for the rest of this support.
llvm-svn: 237499
Other pieces of CodeGen want to negate frame object offsets to account
for architectures where the stack grows down. Our object is a pseudo
object so it's offset doesn't matter. However, we shouldn't choose an
offset which results in undefined behavior if you negate it.
llvm-svn: 237474
Summary:
To maintain compatibility with GAS, we need to stop treating negative 32-bit immediates as 64-bit values when expanding LI/DLI.
This currently happens because of sign extension.
To do this we need to choose the 32-bit value expansion for values which use their upper 33 bits only for sign extension (i.e. no 0's, only 1's).
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8662
llvm-svn: 237428
Instead of doing that, create a temporary copy of MCTargetOptions and reset its
SanitizeAddress field based on the function's attribute every time an InlineAsm
instruction is emitted in AsmPrinter::EmitInlineAsm.
This is part of the work to remove TargetMachine::resetTargetOptions (the FIXME
added to TargetMachine.cpp in r236009 explains why this function has to be
removed).
Differential Revision: http://reviews.llvm.org/D9570
llvm-svn: 237412
The induction variable in the vectorized loop wasn't
recognized properly, so a hardware loop wasn't generated.
Differential Revision: http://reviews.llvm.org/D9722
llvm-svn: 237388
After converting a loop to a hardware loop, the pass should remove
any unnecessary instructions from the old compare-and-branch
code. This patch removes a dead constant assignment that was
used in the compare instruction.
Differential Revision: http://reviews.llvm.org/D9720
llvm-svn: 237373
If the loop trip count may underflow or wrap, the compiler should
not generate a hardware loop since the trip count will be
incorrect.
llvm-svn: 237365
Summary:
When we are trying to fill the delay slot of a call instruction, we must avoid
filler instructions that use the $ra register. This fixes the test
MultiSource/Applications/JM/lencod when we enable the forward delay slot filler.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9670
llvm-svn: 237362
Summary:
If we only pass the necessary operands, we don't have to determine the position of the symbol operand when entering expandLoadAddressSym().
This simplifies the expandLoadAddressSym() code.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9291
llvm-svn: 237355
i1 type is a legal type on AVX-512 and can be passed as parameter or return value.
i1 is promoted to i8 on return and to i32 for call arguments (i8 is also promoted to i32 here).
The result code is similar to the previous X86 targets, where i1 is allways promoted to i8.
llvm-svn: 237350
There's no need to manually pass modifier strings around to tell an operand how
to print now, that information is encoded in the operand itself since the MC
layer came along.
llvm-svn: 237295
We were creating and propagating two separate indices for each jump table (from
back in the mists of time). However, the generic index used by other backends
is sufficient to emit a unique symbol so this was unneeded.
llvm-svn: 237294
The previous logic mixed 2 separate questions:
+ Can we form a TBB/TBH instruction?
+ Can we remove the jump-table calculation before it?
It then performed a bunch of random tests on the instructions earlier in the
basic block, which were probably sufficient to answer 2 but only because of the
very limited ways in which a t2BR_JT can actually be created.
For example there's no reason to expect the LeaInst to define the same base
register as the following indexing calulation. In practice this means we might
have missed opportunities to form TBB/TBH, in theory you could end up
misidentifying a sequence and removing the wrong LEA:
%R1 = t2LEApcrelJT ...
%R2 = t2LEApcrelJT ...
<... using and killing %R2 ...>
%R2 = t2ADDr %R1, $Ridx
Before we would have looked for an LEA defining %R2 and found the wrong one. We
just got lucky that jump table setup was (almost?) always confined to a single
basic block and there was only one jump table per block.
llvm-svn: 237293
Some compilers warn about using the ternary operator with an unsigned variable
and enum.
I haven't seen this trigger in the llvm.org buildbots yet, but it probably will
at some point.
Reported by Daniel Sanders.
llvm-svn: 237262
The hardware loop pass should try to generate a hardware loop
instruction when the original loop has a critical edge.
Differential Revision: http://reviews.llvm.org/D9678
llvm-svn: 237258
Summary: A side-effect of this is that LA gains proper handling of unsigned and positive signed 16-bit immediates and more accurate error messages.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9290
llvm-svn: 237255
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first two times this was committed (r229831, r233055), it caused several buildbot failures.
At least some of the ARM and MIPS ones were due to gcc/binutils issues, and should now be fixed.
llvm-svn: 237234
Summary:
This change adds two new parameters to the statepoint intrinsic, `i64 id`
and `i32 num_patch_bytes`. `id` gets propagated to the ID field
in the generated StackMap section. If the `num_patch_bytes` is
non-zero then the statepoint is lowered to `num_patch_bytes` bytes of
nops instead of a call (the spill and reload code remains unchanged).
A non-zero `num_patch_bytes` is useful in situations where a language
runtime requires complete control over how a call is lowered.
This change brings statepoints one step closer to patchpoints. With
some additional work (that is not part of this patch) it should be
possible to get rid of `TargetOpcode::STATEPOINT` altogether.
PlaceSafepoints generates `statepoint` wrappers with `id` set to
`0xABCDEF00` (the old default value for the ID reported in the stackmap)
and `num_patch_bytes` set to `0`. This can be made more sophisticated
later.
Reviewers: reames, pgavlin, swaroop.sridhar, AndyAyers
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9546
llvm-svn: 237214
They do more harm than good when used in the MachineScheduler as they
tend to take preference to register pressure minimsation which is more
important for swift.
Differential Revision: http://reviews.llvm.org/D9718
llvm-svn: 237179
Summary:
This rule was always in the old SysV i386 ABI docs and the new ones that
H.J. Lu has put together, but we never noticed:
EAX scratch register; also used to return integer and pointer values
from functions; also stores the address of a returned struct or union
Fixes PR23491.
Reviewers: majnemer
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9715
llvm-svn: 237175
AMDGPU::SI_SPILL_V96_RESTORE was missing from a switch statement, which
caused the srsrc and soffset register to not be set correctly.
This commit replaces the switch statement with a SITargetInfo query
to make sure all spill instructions are covered.
Differential Revision: http://reviews.llvm.org/D9582
llvm-svn: 237164
On Mips, frame pointer points to the same side of the frame as the stack
pointer. This function is used to decide where to put register scavenging
spill slot. So far, it was put on the wrong side of the frame, and thus it
was too far away from $fp when frame was larger than 2^15 bytes.
Patch by Vladimir Radosavljevic.
http://reviews.llvm.org/D8895
llvm-svn: 237153
Spilling can insert instructions almost anywhere, and this can mess
up control flow lowering in a multitude of ways, due to instruction
reordering. Let's sort this out the easy way: never spill registers
involved with control flow, i.e. saved EXEC masks.
Unfortunately, this does not work at all with optimizations disabled,
as the register allocator ignores spill weights. This should be
addressed in a future commit.
The test was reduced from the "stacks" shader of [1]. Some issues
trigger the machine verifier while another one is checked manually.
[1] http://madebyevan.com/webgl-path-tracing/
v2: only insert pass with optimizations enabled, merge test runs.
Patch by: Grigori Goronzy
llvm-svn: 237152
We had code to do this in SIRegisterInfo::eliminateFrameIndex(), but
it is easier to just change the definition of SI_SPILL_S32_RESTORE to
only allow numbered sgprs.
llvm-svn: 237143
Instead add m0 as an implicit operand. This allows us to avoid using
the M0Reg register class and eliminates a number of unnecessary spills
when using s_sendmsg instructions. This impacts one shader in the
shader-db:
SGPRS: 48 -> 40 (-16.67 %)
VGPRS: 112 -> 108 (-3.57 %)
Code Size: 40132 -> 38796 (-3.33 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Scratch: 2048 -> 0 (-100.00 %) bytes per wave
llvm-svn: 237133
The other changes in the LowerShift() are not functional,
just to make the code more convenient.
So, the functional changes for SKX only.
llvm-svn: 237129
AEABI defines aligned variants of memcpy etc. that can be faster than
the default version due to not having to do alignment checks. When
emitting target code for these functions make use of these aligned
variants if possible. Also convert memset to memclr if possible.
Differential Revision: http://reviews.llvm.org/D8060
llvm-svn: 237127
Before revision 171146, function 'PerformTruncateCombine' used to perform
a premature lowering of TRUNCATE dag nodes.
Revision 171146 then moved all the logic implemented by PerformTruncateCombine
to a custom lowering hook. However, that revision forgot to delete
function PerformTruncateCombine from the code.
This patch removes function 'PerformTruncateCombine' since it has no effect
on the SelectionDAG. No functional change intended.
llvm-svn: 237122
Summary: Allow calls with non legal integer types based on i8 and i16 to be processed by mips fast-isel.
Based on a patch by Reed Kotler.
Test Plan:
"Make check" test forthcoming.
Test-suite passes at O0/O2 and with mips32 r1/r2
Reviewers: rkotler, dsanders
Subscribers: llvm-commits, rfuhler
Differential Revision: http://reviews.llvm.org/D6770
llvm-svn: 237121
Summary:
Try to compute addresses when the offset from a memory location is a constant
expression.
Based on a patch by Reed Kotler.
Test Plan:
Passes test-suite for -O0/O2 and mips 32 r1/r2
Reviewers: rkotler, dsanders
Subscribers: llvm-commits, aemerson, rfuhler
Differential Revision: http://reviews.llvm.org/D6767
llvm-svn: 237117
sys/time.h on Solaris (and possibly other systems) defines "SEC" as "1"
using a cpp macro. The result is that this fails to compile.
Fixes https://llvm.org/PR23482
llvm-svn: 237112
The X86-specific DAGCombine for stores should not assume vector types are always simple.
This fixes PR23476.
Differential Revision: http://reviews.llvm.org/D9659
llvm-svn: 237097
to use the information in the module rather than TargetOptions.
We've had and clang has used the use-soft-float attribute for some
time now so have the backends set a subtarget feature based on
a particular function now that subtargets are created based on
functions and function attributes.
For the one middle end soft float check go ahead and create
an overloadable TargetLowering::useSoftFloat function that
just checks the TargetSubtargetInfo in all cases.
Also remove the command line option that hard codes whether or
not soft-float is set by using the attribute for all of the
target specific test cases - for the generic just go ahead and
add the attribute in the one case that showed up.
llvm-svn: 237079
The TargetRegistry is just a namespace-like class, instantiated in one
place to use a range-based for loop. Instead, expose access to the
registry via a range-based 'targets()' function instead. This makes most
uses a bit awkward/more verbose - but eventually we should just add a
range-based find_if function which will streamline these functions. I'm
happy to mkae them a bit awkward in the interim as encouragement to
improve the algorithms in time.
llvm-svn: 237059
Summary:
r235215 adds support for f16 to be considered as a load/store type and
promote f16 operations to f32.
This patch has miscellaneous fixes for the X86 backend so all f16
operations are handled:
1. Set loadextaction for f16 vectors to expand.
2. Handle FP_EXTEND in a switch statement when handling v2f32
3. Do not fold (FP_TO_SINT (load f16)) into FP_TO_INT*_IN_MEM or
(store (SINT_TO_FP )) to a FILD.
Tests included.
Reviewers: ab, srhines, delena
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9092
llvm-svn: 237004
The code that builds the dependence graph assumes that two PseudoSourceValues
don't alias. In a tail calling function two FixedStackObjects might refer to the
same location. Worse 'immutable' fixed stack objects like function arguments are
not immutable and will be clobbered.
Change this so that a load from a FixedStackObject is not invariant in a tail
calling function and don't return a PseudoSourceValue for an instruction in tail
calling functions when building the dependence graph so that we handle function
arguments conservatively.
Fix for PR23459.
rdar://20740035
llvm-svn: 236916
This new class in a global context contain arch-specific knowledge in order
to provide LLVM libraries, tools and projects with the ability to understand
the architectures. For now, only FPU, ARCH and ARCH extensions on ARM are
supported.
Current behaviour it to parse from free-text to enum values and back, so that
all users can share the same parser and codes. This simplifies a lot both the
ASM/Obj streamers in the back-end (where this came from), and the front-end
parsers for command line arguments (where this is going to be used next).
The previous implementation, using .def/.h includes is deprecated due to its
inflexibility to be built without the backend support and for being too
cumbersome. As more architectures join this scheme, and as more features of
such architectures are added (such as hardware features, type sizes, etc) into
a full blown TargetDescription class, having a set of classes is the most
sane implementation.
The ultimate goal of this refactor both LLVM's and Clang's target description
classes into one unique interface, so that we can de-duplicate and standardise
the descriptions, as well as make it available for other front-ends, tools,
etc.
The FPU parsing for command line options in Clang has been converted to use
this new library and a number of aliases were added for compatibility:
* A bogus neon-vfpv3 alias (neon defaults to vfp3)
* armv5/v6
* {fp4/fp5}-{sp/dp}-d16
Next steps:
* Port Clang's ARCH/EXT parsing to use this library.
* Create a TableGen back-end to generate this information.
* Run this TableGen process regardless of which back-ends are built.
* Expose more information and rename it to TargetDescription.
* Continue re-factoring Clang to use as much of it as possible.
llvm-svn: 236900
Refactored parts of the hardware loop pass to generate
more. Also, added more tests.
Differential Revision: http://reviews.llvm.org/D9568
llvm-svn: 236896
A trunc from i32 to i1 on x86_64 generates an instruction such as
%vreg19<def> = COPY %vreg9:sub_8bit<kill>; GR8:%vreg19 GR32:%vreg9
However, the copy here should only have the kill flag on the 32-bit path, not the 64-bit one.
Otherwise, we are killing the source of the truncate which could be used later in the program.
llvm-svn: 236890
This changes the shape of the statepoint intrinsic from:
@llvm.experimental.gc.statepoint(anyptr target, i32 # call args, i32 unused, ...call args, i32 # deopt args, ...deopt args, ...gc args)
to:
@llvm.experimental.gc.statepoint(anyptr target, i32 # call args, i32 flags, ...call args, i32 # transition args, ...transition args, i32 # deopt args, ...deopt args, ...gc args)
This extension offers the backend the opportunity to insert (somewhat) arbitrary code to manage the transition from GC-aware code to code that is not GC-aware and back.
In order to support the injection of transition code, this extension wraps the STATEPOINT ISD node generated by the usual lowering lowering with two additional nodes: GC_TRANSITION_START and GC_TRANSITION_END. The transition arguments that were passed passed to the intrinsic (if any) are lowered and provided as operands to these nodes and may be used by the backend during code generation.
Eventually, the lowering of the GC_TRANSITION_{START,END} nodes should be informed by the GC strategy in use for the function containing the intrinsic call; for now, these nodes are instead replaced with no-ops.
Differential Revision: http://reviews.llvm.org/D9501
llvm-svn: 236888
Improved the AnalyzeBranch, InsertBranch, and RemoveBranch
functions in order to handle more of our branch instructions.
This requires changes to analyzeCompare and PredicateInstructions.
Specifically, we've added support for new value compare jumps,
improved handling of endloop, added more compare instructions,
and improved support for predicate instructions.
Differential Revision: http://reviews.llvm.org/D9559
llvm-svn: 236876
The function 'getTargetShuffleMask' already knows how to deal with PSHUFB nodes
where the mask node is a load from constant pool, and the constant pool node
is wrapped by a X86ISD::Wrapper node. This patch extends that logic by teaching
it how to also look through X86ISD::WrapperRIP.
This helps function combineX86ShufflesRecusively to combine more shuffle
sequences containing PSHUFB nodes if we are in RIPRel PIC mode.
Before this change, llc (with -relocation-model=pic -march=x86-64) was unable
to decode a pshufb where the mask was loaded from a constant pool. For example,
the no-op shuffle from test 'x86-fold-pshufb.ll' was not folded into its
operand, so instead of generating a single 'movaps' the backend always
generated a sub-optimal 'movdqa + pshufb' sequence.
Added test x86-fold-pshufb.ll.
llvm-svn: 236863
Summary:
In microMIPS, labels need to know whether they are on code or data. This is
indicated with STO_MIPS_MICROMIPS and can be inferred by being followed
by instructions. For empty basic blocks, we can ensure this by emitting the
.insn directive after the label.
Also, this fixes some failures in our out-of-tree microMIPS buildbots, for the
exception handling regression tests under: SingleSource/Regression/C++/EH
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9530
llvm-svn: 236815
We were accidentally folding a sign/zero extend in to address arithmetic in a different BB when the extend wasn't available there.
Cross BB fast-isel isn't safe, so restrict this to only when the extend is in the same BB as the use.
llvm-svn: 236764
This patch corresponds to review:
http://reviews.llvm.org/D9440
It adds a new register class to the PPC back end to contain single precision
values in VSX registers. Additionally, it adds scalar loads and stores for
VSX registers.
llvm-svn: 236755
This is a follow-on to r236740 where I took Andrea's advice
in D9504 to remove a redundant pattern...except that I removed
the wrong pattern!
AFAICT, there is no change in the final code produced because
subsequent passes would clean up the extra instructions created
by the more complicated pattern.
llvm-svn: 236743
Finish the job that was abandoned in D6958 following the refactoring in
http://reviews.llvm.org/rL230221:
1. Uncomment the intrinsic def for the AVX r_Int instruction.
2. Add missing r_Int entries to the load folding tables; there are already
tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I
haven't added any more in this patch.
3. Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).
So instead of this:
movaps %xmm0, %xmm1
rcpss %xmm1, %xmm1
movss %xmm1, %xmm0
We should now get:
rcpss %xmm0, %xmm0
And instead of this:
vsqrtss %xmm0, %xmm0, %xmm1
vblendps $1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]
We should now get:
vsqrtss %xmm0, %xmm0, %xmm0
Differential Revision: http://reviews.llvm.org/D9504
llvm-svn: 236740
http://reviews.llvm.org/D9517
The separate header file allows to reuse the MIPS ABI flags structure
constants in other LLVM tools like the llvm-readobj.
No functional changes.
llvm-svn: 236732
Added intrinsics for the instructions. CC parameter of the intrinsics was changed from i8 to i32 according to the spec.
By Igor Breger (igor.breger@intel.com)
llvm-svn: 236714
Summary: This will enable the IAS to reject floating point instructions if soft-float is enabled.
Reviewers: dsanders, echristo
Reviewed By: dsanders
Subscribers: jfb, llvm-commits, mpf
Differential Revision: http://reviews.llvm.org/D9053
llvm-svn: 236713
When folding a load in to another instruction, we need to fix the class of the index register
Otherwise, it could be something like GR64 not GR64_NOSP and would fail the machine verifier.
llvm-svn: 236644
The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce
the cost of overflow check, memory boundary check and extra prologue/epilogue code when
regular unroller will unroll the loop another time. Disable it when VF==1 remove the
unnecessary cost on x86. The same can be done for other platforms after verifying
interleaving/memory bound checking to be not perf critical on those platforms.
Differential Revision: http://reviews.llvm.org/D9515
llvm-svn: 236613
With neon enabled, we reach SelectBinaryFPOp and are able to get registers for a <2 x double> add.
However, we shouldn't actually attempt arithmetic on it as ARMIselLowering says "v2f64 is legal so that QR subregs can be extracted as f64 elements, but neither Neon nor VFP support any arithmetic operations on it."
This commit disables SelectBinaryFPOp for any vector types. There's already a FIXME to try handle neon. Doing so would require fixing this conditional which isn't safe for vectors 'VT == MVT::f64 || VT == MVT::i64'
llvm-svn: 236609
The initial code drop for VSX swap optimization permitted the
optimization only when all operations in a web of related computation
are lane-insensitive. For some lane-sensitive operations, we can
still permit the optimization provided that we make adjustments to
those operations. This patch adds special handling for vector splats
so that their presence doesn't kill the optimization.
Vector splats are lane-sensitive since they identify by number a
vector element to be used as the source of a splat. When swap
optimizations take place, the desired vector element will move to the
opposite doubleword of the quadword vector. We thus replace the index
I by (I + N/2) % N, where N is the number of elements in the vector.
A new test case is added to test that swap optimization succeeds when
vector splats are present, and that the proper input element is used
as the source of the splat.
An ancillary change removes SH_BUILDVEC as one of the kinds of special
handling that may be required by VSX swap optimization. From
experience with GCC, I had expected to need some modifications for
vector build operations, but I did not find that to be the case.
llvm-svn: 236606
Since r234249, i1 are sext instead of zext; because of that, doing
"CMP rN, #0; IT EQ/NE" isn't correct anymore.
"TST #1" is the conservatively correct alternative - the tradeoff being
that it doesn't have a 16-bit encoding -, so use that instead.
llvm-svn: 236569
This patch adds the minimum plumbing necessary to use IR-level
fast-math-flags (FMF) in the backend without actually using
them for anything yet. This is a follow-on to:
http://reviews.llvm.org/rL235997
...which split the existing nsw / nuw / exact flags and FMF
into their own struct.
There are 2 structural changes here:
1. The main diff is that we're preparing to extend the optimization
flags to affect more than just binary SDNodes. Eg, IR intrinsics
( https://llvm.org/bugs/show_bug.cgi?id=21290 ) or non-binop nodes
that don't even exist in IR such as FMA, FNEG, etc.
2. The other change is that we're actually copying the FP fast-math-flags
from the IR instructions to SDNodes.
Differential Revision: http://reviews.llvm.org/D8900
llvm-svn: 236546
The register set for LDMIA begins at offset 3, not 4. We were previously
missing the short encoding of this instruction in the case where the base
register was the first register in the register set.
Also clean up some dead code:
- The isARMLowRegister check is redundant with what VerifyLowRegs does;
replace with an assert.
- Remove handling of LDMDB instruction, which has no short encoding (and
does not appear in ReduceTable).
Differential Revision: http://reviews.llvm.org/D9485
llvm-svn: 236535
This adds intrinsics to allow access to all of the z13 vector instructions.
Note that instructions whose semantics can be described by standard LLVM IR
do not get any intrinsics.
For each instructions whose semantics *cannot* (fully) be described, we
define an LLVM IR target-specific intrinsic that directly maps to this
instruction.
For instructions that also set the condition code, the LLVM IR intrinsic
returns the post-instruction CC value as a second result. Instruction
selection will attempt to detect code that compares that CC value against
constants and use the condition code directly instead.
Based on a patch by Richard Sandiford.
llvm-svn: 236527
The ABI specifies that <1 x i128> and <1 x fp128> are supposed to be
passed in vector registers. We do not yet support those types, and
some infrastructure is missing before we can do so.
In order to prevent accidentally generating code violating the ABI,
this patch adds checks to detect those types and error out if user
code attempts to use them.
llvm-svn: 236526
The ABI allows sub-128 vectors to be passed and returned in registers,
with the vector occupying the upper part of a register. We therefore
want to legalize those types by widening the vector rather than promoting
the elements.
The patch includes some simple tests for sub-128 vectors and also tests
that we can recognize various pack sequences, some of which use sub-128
vectors as temporary results. One of these forms is based on the pack
sequences generated by llvmpipe when no intrinsics are used.
Signed unpacks are recognized as BUILD_VECTORs whose elements are
individually sign-extended. Unsigned unpacks can have the equivalent
form with zero extension, but they also occur as shuffles in which some
elements are zero.
Based on a patch by Richard Sandiford.
llvm-svn: 236525
The z13 vector facility includes some instructions that operate only on the
high f64 in a v2f64, effectively extending the FP register set from 16
to 32 registers. It's still better to use the old instructions if the
operands happen to fit though, since the older instructions have a shorter
encoding.
Based on a patch by Richard Sandiford.
llvm-svn: 236524
The architecture doesn't really have any native v4f32 operations except
v4f32->v2f64 and v2f64->v4f32 conversions, with only half of the v4f32
elements being used. Even so, using vector registers for <4 x float>
and scalarising individual operations is much better than generating
completely scalar code, since there's much less register pressure.
It's also more efficient to do v4f32 comparisons by extending to 2
v2f64s, comparing those, then packing the result.
This particularly helps with llvmpipe.
Based on a patch by Richard Sandiford.
llvm-svn: 236523
This adds ABI and CodeGen support for the v2f64 type, which is natively
supported by z13 instructions.
Based on a patch by Richard Sandiford.
llvm-svn: 236522
This the first of a series of patches to add CodeGen support exploiting
the instructions of the z13 vector facility. This patch adds support
for the native integer vector types (v16i8, v8i16, v4i32, v2i64).
When the vector facility is present, we default to the new vector ABI.
This is characterized by two major differences:
- Vector types are passed/returned in vector registers
(except for unnamed arguments of a variable-argument list function).
- Vector types are at most 8-byte aligned.
The reason for the choice of 8-byte vector alignment is that the hardware
is able to efficiently load vectors at 8-byte alignment, and the ABI only
guarantees 8-byte alignment of the stack pointer, so requiring any higher
alignment for vectors would require dynamic stack re-alignment code.
However, for compatibility with old code that may use vector types, when
*not* using the vector facility, the old alignment rules (vector types
are naturally aligned) remain in use.
These alignment rules are not only implemented at the C language level
(implemented in clang), but also at the LLVM IR level. This is done
by selecting a different DataLayout string depending on whether the
vector ABI is in effect or not.
Based on a patch by Richard Sandiford.
llvm-svn: 236521
This patch adds support for the z13 processor type and its vector facility,
and adds MC support for all new instructions provided by that facilily.
Apart from defining the new instructions, the main changes are:
- Adding VR128, VR64 and VR32 register classes.
- Making FP64 a subclass of VR64 and FP32 a subclass of VR32.
- Adding a D(V,B) addressing mode for scatter/gather operations
- Adding 1-, 2-, and 3-bit immediate operands for some 4-bit fields.
Until now all immediate operands have been the same width as the
underlying field (hence the assert->return change in decode[SU]ImmOperand).
In addition, sys::getHostCPUName is extended to detect running natively
on a z13 machine.
Based on a patch by Richard Sandiford.
llvm-svn: 236520
This reverts commit r236360.
This change exposed a bug in WinEHPrepare by opting win32 code into EH
preparation. We already knew that WinEHPrepare has bugs, and is the
status quo for x64, so I don't think that's a reason to hold off on this
change. I disabled exceptions in the sanitizer tests in r236505 and an
earlier revision.
llvm-svn: 236508
This patch introduces a new pass that computes the safe point to insert the
prologue and epilogue of the function.
The interest is to find safe points that are cheaper than the entry and exits
blocks.
As an example and to avoid regressions to be introduce, this patch also
implements the required bits to enable the shrink-wrapping pass for AArch64.
** Context **
Currently we insert the prologue and epilogue of the method/function in the
entry and exits blocks. Although this is correct, we can do a better job when
those are not immediately required and insert them at less frequently executed
places.
The job of the shrink-wrapping pass is to identify such places.
** Motivating example **
Let us consider the following function that perform a call only in one branch of
a if:
define i32 @f(i32 %a, i32 %b) {
%tmp = alloca i32, align 4
%tmp2 = icmp slt i32 %a, %b
br i1 %tmp2, label %true, label %false
true:
store i32 %a, i32* %tmp, align 4
%tmp4 = call i32 @doSomething(i32 0, i32* %tmp)
br label %false
false:
%tmp.0 = phi i32 [ %tmp4, %true ], [ %a, %0 ]
ret i32 %tmp.0
}
On AArch64 this code generates (removing the cfi directives to ease
readabilities):
_f: ; @f
; BB#0:
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
LBB0_2: ; %false
mov sp, x29
ldp x29, x30, [sp], #16
ret
With shrink-wrapping we could generate:
_f: ; @f
; BB#0:
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
add sp, x29, #16 ; =16
ldp x29, x30, [sp], #16
LBB0_2: ; %false
ret
Therefore, we would pay the overhead of setting up/destroying the frame only if
we actually do the call.
** Proposed Solution **
This patch introduces a new machine pass that perform the shrink-wrapping
analysis (See the comments at the beginning of ShrinkWrap.cpp for more details).
It then stores the safe save and restore point into the MachineFrameInfo
attached to the MachineFunction.
This information is then used by the PrologEpilogInserter (PEI) to place the
related code at the right place. This pass runs right before the PEI.
Unlike the original paper of Chow from PLDI’88, this implementation of
shrink-wrapping does not use expensive data-flow analysis and does not need hack
to properly avoid frequently executed point. Instead, it relies on dominance and
loop properties.
The pass is off by default and each target can opt-in by setting the
EnableShrinkWrap boolean to true in their derived class of TargetPassConfig.
This setting can also be overwritten on the command line by using
-enable-shrink-wrap.
Before you try out the pass for your target, make sure you properly fix your
emitProlog/emitEpilog/adjustForXXX method to cope with basic blocks that are not
necessarily the entry block.
** Design Decisions **
1. ShrinkWrap is its own pass right now. It could frankly be merged into PEI but
for debugging and clarity I thought it was best to have its own file.
2. Right now, we only support one save point and one restore point. At some
point we can expand this to several save point and restore point, the impacted
component would then be:
- The pass itself: New algorithm needed.
- MachineFrameInfo: Hold a list or set of Save/Restore point instead of one
pointer.
- PEI: Should loop over the save point and restore point.
Anyhow, at least for this first iteration, I do not believe this is interesting
to support the complex cases. We should revisit that when we motivating
examples.
Differential Revision: http://reviews.llvm.org/D9210
<rdar://problem/3201744>
llvm-svn: 236507
It adds v1i128 to the appropriate register classes and checks parameter passing
and return values.
This is related to http://reviews.llvm.org/D9081, which will add instructions
that exploit the v1i128 datatype.
Phabricator review: http://reviews.llvm.org/D9475
llvm-svn: 236503
Summary:
When using the N64 ABI, element-indices use the i64 type instead of i32.
In many cases, we can use iPTR to account for this but additional patterns
and pseudo's are also required.
This fixes most (but not quite all) failures in the test-suite when using
N64 and MSA together.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9342
llvm-svn: 236494
When forming an IT block from the first MOV here:
%R2<def> = t2MOVr %R0, pred:1, pred:%CPSR, opt:%noreg
%R3<def> = tMOVr %R0<kill>, pred:14, pred:%noreg
the move in to R3 is moved out of the IT block so that later instructions on the same predicate can be inside this block, and we can share the IT instruction.
However, when moving the R3 copy out of the IT block, we need to clear its kill flags for anything in use at this point in time, ie, R0 here.
This appeases the machine verifier which thought that R0 wasn't defined when used.
I have a test case, but its extremely register allocator specific. It would be too fragile to commit a test which depends on the register allocator here.
llvm-svn: 236468
At the moment, all subregs defined by the SystemZ target can be modified
independently of the wider register. E.g. writing to a GR32 does not
change the upper 32 bits of the GR64. Writing to an FP32 does not change
the lower 32 bits of the FP64.
Hoewver, the upcoming support for the vector extension redefines FP64 as
one half of a V128. Floating-point operations leave the other half of
a V128 in an unpredictable state, so it's no longer the case that writing
to an FP32 leaves the bits of the underlying register (the V128) alone.
I'd prefer to have separate subreg_ names for this situation, so that
it's obvious at a glance whether we're talking about a subreg that leaves
the other parts of the register alone.
No behavioral change intended.
Patch originally by Richard Sandiford.
llvm-svn: 236433
We know what MemoryKind an operand has at the time we construct it,
so we might as well just record it in an unused part of the structure.
This makes it easier to add scatter/gather addresses later.
No behavioral change intended.
Patch originally by Richard Sandiford.
llvm-svn: 236432
It seems SystemZTargetLowering::getTargetNodeName got out of sync with
some recent changes to the SystemZISD opcode list. Add back all the
missing opcodes (and re-sort to the same order as SystemISelLowering.h).
llvm-svn: 236430
Removed code that was replicating v8i16 'shift + mask' implementation that is done more nicely by making use of LowerScalarImmediateShift
llvm-svn: 236388
This pass is responsible for constructing the EH registration object
that gets linked into fs:00, which is all it does in this change. In the
future, it will also insert stores to update the EH state number.
I considered keeping this functionality in WinEHPrepare, but it's pretty
separable and X86 specific. It has conceptually very little to do with
the task of WinEHPrepare, which is currently outlining. WinEHPrepare is
also in theory useful on ARM, but this logic is pretty x86 specific.
Reviewers: andrew.w.kaylor, majnemer
Differential Revision: http://reviews.llvm.org/D9422
llvm-svn: 236339
Converting from t2LDRs to tLDRr caused the shift argument to drop the internal flag. This would then throw machine verifier errors.
Unfortunately i'm having trouble reducing a test case. I'm going to keep trying, but so far its a scary combination of machine sinking, an 'and i1', loads feeding loads, and a bunch of code which shouldn't change IT block formation, but does. Its not useful to commit a test in that state as we have no way of knowing if it even hits this code reliably in future.
rdar://problem/20752113
llvm-svn: 236333
Functions with jump tables need an alignment of 4 because they use the ADR
instruction, which aligns the PC to 4 bytes before adding an offset.
Differential Revision: http://reviews.llvm.org/D9424
llvm-svn: 236327
Summary:
LI should never accept immediates larger than 32 bits.
The additional Is32BitImm boolean also paves the way for unifying the functionality that LA and LI have in common.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9289
llvm-svn: 236313
Summary:
Generate one DSLL32 of 0 instead of two consecutive DSLL of 16.
In order to do this I had to change createLShiftOri's template argument from a bool to an unsigned.
This also gave me the opportunity to rewrite the mips64-expansions.s test, as it was testing the same cases multiple times and skipping over other cases.
It was also somewhat unreadable, as the CHECK lines were grouped in a huge block of text at the beginning of the file.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8974
llvm-svn: 236311
This pass was generating 'Instruction does not dominate all uses!'
errors for programs which had loops with a condition variable that
depended on the result of a phi instruction from outside of the loop.
The pass was inserting new phi nodes outside of the loop which used values
defined inside the loop.
http://bugs.freedesktop.org/show_bug.cgi?id=90056
llvm-svn: 236306
If we move an instruction from one block down to a MOVC and predicate it,
then the original instruction could be moved in to a loop. In this case,
its invalid for any kill flags to remain on there.
Fails with -verfy-machineinstrs.
rdar://problem/20752113
llvm-svn: 236290
The expansion for t2ABS was always setting the kill flag on the rsb instruction.
It should instead only be set on rsb if it was set on the original ABS instruction.
rdar://problem/20752113
llvm-svn: 236272
This helps reduce the frequency of stack realignment prologues in 32-bit
X86 Windows code. Before this change and the corresponding clang change,
we would take the max of the type preferred alignment and the explicit
alignment on the alloca.
If you don't override aggregate alignment in datalayout, you get a
default of 8. This dates back to 2007 / r34356, and changing it seems
prohibitively difficult at this point.
llvm-svn: 236270
temporary.
Because of that:
1. The machine verifier was complaining on such code.
2. The generate code worked just because the thumb reduction size pass fixed the
opcode.
rdar://problem/20749824
llvm-svn: 236247
changes:
Don't apply on hexagon and NVPTX since they no longer claim to support UADDO/USUBO
Add location to getConstant
Drop comment about the ops being turned into expand
llvm-svn: 236240
This was breaking sqlite with the machine verifier because operand 0 was a def according to tablegen, but didn't have the 'isDef' flag set.
Looking at the ISA, its clear that this operand is a source as writing to st(0) is implicit. So move the operand to the correct place in the td file.
rdar://problem/20751584
llvm-svn: 236183
There's probably no way to test BXJ, but if the compiler ever did emit it
during CodeGen it would have to be a block terminator so "isBranch" is
appropriate.
BLX is more tricky. Clearly a call, but it affects surprisingly little.
rdar://18719544
llvm-svn: 236140
x86 Windows uses the '_' prefix for all global symbols, and this was
mistakenly being applied to frameescape labels, which are not externally
visible global symbols. They use the private global prefix 'L'.
The *right* way to fix this is probably to stop masquerading this label
as an ExternalSymbol and create a new SDNode type. These labels are not
"external", and we know they will be resolved by assembly time. Having a
custom SDNode type would allow us to do better X86 address mode
matching, so it's probably worth doing eventually.
llvm-svn: 236123
Finish off PR23080 by renaming the debug info IR constructs from `MD*`
to `DI*`. The last of the `DIDescriptor` classes were deleted in
r235356, and the last of the related typedefs removed in r235413, so
this has all baked for about a week.
Note: If you have out-of-tree code (like a frontend), I recommend that
you get everything compiling and tests passing with the *previous*
commit before updating to this one. It'll be easier to keep track of
what code is using the `DIDescriptor` hierarchy and what you've already
updated, and I think you're extremely unlikely to insert bugs. YMMV of
course.
Back to *this* commit: I did this using the rename-md-di-nodes.sh
upgrade script I've attached to PR23080 (both code and testcases) and
filtered through clang-format-diff.py. I edited the tests for
test/Assembler/invalid-generic-debug-node-*.ll by hand since the columns
were off-by-three. It should work on your out-of-tree testcases (and
code, if you've followed the advice in the previous paragraph).
Some of the tests are in badly named files now (e.g.,
test/Assembler/invalid-mdcompositetype-missing-tag.ll should be
'dicompositetype'); I'll come back and move the files in a follow-up
commit.
llvm-svn: 236120
Reg+%g0 is preferred to Reg+imm0 by the manual, and is what GCC produces.
Futhermore, reg+imm is invalid for the (not yet supported) "alternate
address space" instructions.
Differential Revision: http://reviews.llvm.org/D8753
llvm-svn: 236107
Summary:
The existing code was correct for 32-bit GPR's but not 64-bit GPR's. It now
accounts for both cases.
Reviewers: vkalintiris
Reviewed By: vkalintiris
Subscribers: llvm-commits, mohit.bhakkad, sagar
Differential Revision: http://reviews.llvm.org/D9337
llvm-svn: 236099
Summary:
Do the assemble-time shifts from createLShiftOri at the source, which groups all the shifting together, closer to the main logic path, and
store the results in concisely-named variables to improve code clarity.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8973
llvm-svn: 236096
We were trying to look through COPY instructions, but only to the next
instruction in a BB and incorrectly anyway. The cases where that would actually
be a good idea are rare enough (and not even tested!) that it's not worth
trying to get right.
rdar://20721342
llvm-svn: 236050
We don't need codegen-only intrinsic instructions for the vector forms of these instructions.
This makes the reciprocal estimate instruction lowering identical to how we handle normal
square roots: (V)SQRTPS / (V)SQRTPD.
No existing regression tests fail with this patch.
Differential Revision: http://reviews.llvm.org/D9301
llvm-svn: 236013
Fixes a crash with basically any OpenGL application using the radeonsi
driver.
Patch by: Michel Dänzer
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=90176
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 236004
llc converts all feature strings to lower case, while the LLVM C API
does not, so we need a lower case alias in order to test this with llc.
llvm-svn: 236003
We need to track if an AddrSpaceCast expression was seen when
generating an MCExpr for a ConstantExpr. This change introduces a
custom lowerConstant method to the NVPTX asm printer that will create
NVPTXGenericMCSymbolRefExpr nodes at the appropriate places to encode
the information that a given symbol needs to be casted to a generic
address.
llvm-svn: 236000
This is a preliminary step to using the IR-level floating-point fast-math-flags in the SDAG (D8900).
In this patch, we introduce the optimization flags as their own struct. As noted in the TODO comment,
we should eventually share this data between the IR passes and the backend.
We also switch the existing nsw / nuw / exact bit functionality of the BinaryWithFlagsSDNode class to
use the new struct.
The tradeoff is that instead of using the free but limited space of SDNode's SubclassData, we add a
data member to the subclass. This means we don't have to repeat all of the get/set methods per flag,
but we're potentially adding size to all nodes of this subclassi type.
In practice on 64-bit systems (measured on Linux and MacOS X), there is no size difference between an
SDNode and BinaryWithFlagsSDNode after this change: they're both 80 bytes. This means that we had at
least one free byte to play with due to struct alignment.
Differential Revision: http://reviews.llvm.org/D9325
llvm-svn: 235997
Summary: If the immediate is 0, the ORi is pointless.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8969
llvm-svn: 235990
[DebugInfo] Add debug locations to constant SD nodes
This adds debug location to constant nodes of Selection DAG and updates
all places that create constants to pass debug locations
(see PR13269).
Can't guarantee that all locations are correct, but in a lot of cases choice
is obvious, so most of them should be. At least all tests pass.
Tests for these changes do not cover everything, instead just check it for
SDNodes, ARM and AArch64 where it's easy to get incorrect locations on
constants.
This is not complete fix as FastISel contains workaround for wrong debug
locations, which drops locations from instructions on processing constants,
but there isn't currently a way to use debug locations from constants there
as llvm::Constant doesn't cache it (yet). Although this is a bit different
issue, not directly related to these changes.
Differential Revision: http://reviews.llvm.org/D9084
llvm-svn: 235989
Summary: The new name is more accurate with regard to the functionality.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8968
llvm-svn: 235984
Summary: This removes multiple calls to getReg() and saves us column space in the source file.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8924
llvm-svn: 235978
This adds debug location to constant nodes of Selection DAG and updates
all places that create constants to pass debug locations
(see PR13269).
Can't guarantee that all locations are correct, but in a lot of cases choice
is obvious, so most of them should be. At least all tests pass.
Tests for these changes do not cover everything, instead just check it for
SDNodes, ARM and AArch64 where it's easy to get incorrect locations on
constants.
This is not complete fix as FastISel contains workaround for wrong debug
locations, which drops locations from instructions on processing constants,
but there isn't currently a way to use debug locations from constants there
as llvm::Constant doesn't cache it (yet). Although this is a bit different
issue, not directly related to these changes.
Differential Revision: http://reviews.llvm.org/D9084
llvm-svn: 235977
This matches other assemblers and is less unexpected (e.g. PR23227).
On ELF, I tried binutils gas v2.24 and nasm 2.10.09, and they both
agree on LShr. On COFF, I couldn't get my hands on an assembler yet,
so don't change the behavior. For now, don't change it on non-AArch64
Darwin either, as the other assembler is gas v1.38, which does an AShr.
llvm-svn: 235963
After legalization, scalar SETCC has an i32 result type on AArch64.
The i1 requirement seems too conservative, replace it with an assert.
This also means that we now can run after legalization. That should also
be fine, since the ops legalizer runs again after each combine, and
all types created all have the same sizes as the (legal) inputs.
Exposed by r235917; while there, robustize its tests (bsl also uses the
register it defines).
llvm-svn: 235922
When the setcc has f64 operands, we can't build a vector setcc mask
to feed a vselect, because f64 doesn't divide v3f32 evenly.
Just bail out when that happens.
llvm-svn: 235917
This patch adds a new SSA MI pass that runs on little-endian PPC64
code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors
without alignment constraints are accomplished for little-endian using
lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional
xxswapd instructions hurts performance in comparison with big-endian
code, but they are necessary in the general case to support correct
semantics.
However, the general case does not apply to most vector code. Many
vector instructions are lane-insensitive; they do not "care" which
lanes the parallel computations are performed within, provided that
the resulting data is stored into the correct locations. Thus this
pass looks for computations that perform only lane-insensitive
operations, and remove the unnecessary swaps from loads and stores in
such computations.
Future improvements will allow computations using certain
lane-sensitive operations to also be optimized in this manner, by
modifying the lane-sensitive operations to account for the permuted
order of the lanes. However, this patch only adds the infrastructure
to permit this; no lane-sensitive operations are optimized at this
time.
This code is heavily exercised by the various vectorizing applications
in the projects/test-suite tree. For the time being, I have only added
one simple test case to demonstrate what the pass is doing. Although
it is quite simple, it provides coverage for much of the code,
including the special case handling of copies and subreg-to-reg
operations feeding the swaps. I plan to add additional tests in the
future as I fill in more of the "special handling" code.
Two existing tests were affected, because they expected the swaps to
be present, but they are now removed.
llvm-svn: 235910
Use a loop instruction with a constant extender for a hardware
loop instruction that is too far away from the start of the loop.
This is cheaper than changing the SA register value.
Differential Revision: http://reviews.llvm.org/D9262
llvm-svn: 235882
Summary:
Changed the warning message to show the current value of $at, similar to what clang does for typedef's, and renamed warnIfAssemblerTemporary to a more descriptive name.
I also changed the type of variables which store registers from int to unsigned, updated the relevant test and tried to make the related comments clearer.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8479
llvm-svn: 235881
This reapplies r235194, which was reverted in r235495 because it was causing a
failure in our out-of-tree buildbots for MIPS. With the sign-extension patch
in r235718, this patch doesn't cause any problem any more.
llvm-svn: 235878
Patch to allow int8 vectors to be multiplied on the SSE unit instead of being scalarized.
The patch sign extends the i8 lanes to i16, uses the SSE2 pmullw multiplication instruction, then packs the lower byte from each result.
Differential Revision: http://reviews.llvm.org/D9115
llvm-svn: 235837
Summary:
Perform integer extension only when the destination type is one of
i8, i16 & i32 and when the source type is i1, i8 or i16. For other
combinations we fall back to SelectionDAG.
This fixes the test MultiSource/Benchmarks/7zip that was failing in our
out-of-tree MIPS buildbots.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9243
llvm-svn: 235718
Summary:
Fixes a bug in the NVPTX codegen. The code used to miss necessary "generic()"
on aggregates of addrspacecasts.
Test Plan: addrspacecast-gvar.ll
Reviewers: eliben, jholewinski
Reviewed By: jholewinski
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D9130
llvm-svn: 235689
Match binutils by supporting the optional register name prefix for new vector
registers ("vs" for VSX registers and "q" for QPX registers).
llvm-svn: 235665
Add assembler/disassembler support for dcbt/dcbtst (and aliases) with the hint
field specified (non-zero). Unforunately, the syntax for this instruction is
special in that it differs for server vs. embedded cores:
dcbt ra, rb, th [server]
dcbt th, ra, rb [embedded]
where th can be omitted when it is 0. dcbtst is the same. Thus we need to play
games in the parser and the printer to flip the operands around on the embedded
cores. We'll use the server syntax as the default (binutils currently uses the
embedded form by default, but IBM is changing that).
We also stop marking dcbtst as having unmodeled side effects (this is not
necessary, it is just a hint like dcbt -- noticed by inspection, so no separate
test case).
llvm-svn: 235657
When the base register index of the vector plus the constant offset
was less than zero, we were passing the wrong base register to the indirect
addressing instruction.
In this case, we need to set the base register to v0 and then add
the computed (negative) index to m0.
llvm-svn: 235641
The order in which branches appear in ImmBranches is approximately their
order within the function body. By visiting later branches first, we reduce
the distance between earlier forward branches and their targets, making it
more likely that the cbn?z optimization, which can only apply to forward
branches, will succeed for those earlier branches.
Differential Revision: http://reviews.llvm.org/D9185
llvm-svn: 235640
In particular, this preserves the kill flag, which allows the Thumb2 cbn?z
optimization to be applied in cases where a branch has been re-created after
the live variables analysis pass, e.g. by the machine block placement pass.
This appears to be low risk; a number of other targets seem to already be
doing something similar, e.g. AArch64, PowerPC.
Differential Revision: http://reviews.llvm.org/D9184
llvm-svn: 235639
This allows the constant island pass to lower these branches to cbn?z
instructions, resulting in a shorter instruction sequence.
Differential Revision: http://reviews.llvm.org/D9183
llvm-svn: 235638
This makes it more likely that we can use the 16-bit push and pop instructions
on Thumb-2, saving around 4 bytes per function.
Differential Revision: http://reviews.llvm.org/D9165
llvm-svn: 235637
This appears to have been introduced back in r76698 as part of an unrelated
change. I can find no official ARM documentation stating that Thumb-2 functions
require 4-byte alignment; in fact, ARM documentation appears to contradict
this (see, e.g., ARM Architecture Reference Manual Thumb-2 Supplement,
section 2.6.1: "Thumb-2 enforces 16-bit alignment on all instructions.").
Also remove code that sets alignment for ARM functions, which is redundant
with code in the MachineFunction constructor, and remove the hidden
-arm-align-constant-islands flag, which has been enabled by default since
r146739 (Dec 2011) and has probably received sufficient testing by now.
Differential Revision: http://reviews.llvm.org/D9138
llvm-svn: 235636
Summary:
We pick this order because SeparateConstOffsetFromGEP may create more
opportunities for SLSR.
Test Plan:
reassociate-geps-and-slsr.ll
no performance regression on internal benchmarks
Reviewers: meheff
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D9230
llvm-svn: 235632
TableGen had been nicely generating code to print a number of instructions using
shorter aliases (and PowerPC has plenty of short mnemonics), but we were not
calling it. For some of the aliases we support in the parser, TableGen can't
infer the "inverse" alias relationship, so there is still more to do.
Thus, after some hours of updating test cases...
llvm-svn: 235616
Summary:
Constant stores of f16 vectors can create NvCast nodes from various
operand types to v4f16 or v8f16 depending on patterns in the stored
constants. This patch adds nvcast rules with v4f16 and v8f16 values.
AArchISelLowering::LowerBUILD_VECTOR has the details on which constant
patterns generate the nvcast nodes.
Reviewers: jmolloy, srhines, ab
Subscribers: rengolin, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D9201
llvm-svn: 235610
Summary:
Set operation action for SINT_TO_FP and UINT_TO_FP nodes with v4i32,
v8i8, v8i16 inputs to allow promotion of v4f16 results.
Add tests for sitofp and uitofp for vec4, vec8, vec16, and i8, i16, i32,
and i64 vectors. Only missing tests are for v16i8 and v16i16 as the
shift operations are too complicated to write a proper check sequence.
The conversions from v4i64 to v4f16 do not depend on this patch - v4i64
is split and the conversion gets handled while lowering v2i64. I am
adding a test here for completeness.
Reviewers: aemerson, rengolin, ab, jmolloy, srhines
Subscribers: rengolin, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D9166
llvm-svn: 235609
Third time's the charm. The previous commit was reverted as a
reverse for-loop in SelectionDAGBuilder::lowerWorkItem did 'I--'
on an iterator at the beginning of a vector, causing asserts
when using debugging iterators. This commit fixes that.
llvm-svn: 235608
This is a re-commit of r235101, which also fixes the problems with the previous patch:
- Switches with only a default case and non-fallthrough were handled incorrectly
- The previous patch tickled a bug in PowerPC Early-Return Creation which is fixed here.
> This is a major rewrite of the SelectionDAG switch lowering. The previous code
> would lower switches as a binary tre, discovering clusters of cases
> suitable for lowering by jump tables or bit tests as it went along. To increase
> the likelihood of finding jump tables, the binary tree pivot was selected to
> maximize case density on both sides of the pivot.
>
> By not selecting the pivot in the middle, the binary trees would not always
> be balanced, leading to performance problems in the generated code.
>
> This patch rewrites the lowering to search for clusters of cases
> suitable for jump tables or bit tests first, and then builds the binary
> tree around those clusters. This way, the binary tree will always be balanced.
>
> This has the added benefit of decoupling the different aspects of the lowering:
> tree building and jump table or bit tests finding are now easier to tweak
> separately.
>
> For example, this will enable us to balance the tree based on profile info
> in the future.
>
> The algorithm for finding jump tables is quadratic, whereas the previous algorithm
> was O(n log n) for common cases, and quadratic only in the worst-case. This
> doesn't seem to be major problem in practice, e.g. compiling a file consisting
> of a 10k-case switch was only 30% slower, and such large switches should be rare
> in practice. Compiling e.g. gcc.c showed no compile-time difference. If this
> does turn out to be a problem, we could limit the search space of the algorithm.
>
> This commit also disables all optimizations during switch lowering in -O0.
>
> Differential Revision: http://reviews.llvm.org/D8649
llvm-svn: 235560
The CondOpt pass currently uses LiveIntervals to set the dead flag on a def. This patch uses MachineRegisterInfo::use_empty instead as that is equivalent to the def being dead.
This removes an instance of LiveIntervals in the pass manager pipeline and saves 3.8% of compile time on llc conpiled for AArch64.
Reviewed by Chad Rosier and Zhaoshi.
llvm-svn: 235532
This fixes a regression introduced at revision 218263.
On AVX, if we optimize for size, a splat build_vector of a load
is lowered into a VBROADCAST node. This is done even if the value type of the
splat build_vector node is v2i64.
Since AVX doesn't support v2f64/v2i64 broadcasts, revision 218263 added two
extra tablegen patterns to allow selecting a VMOVDDUPrm from an X86VBroadcast
where the scalar element comes from a loadi64/loadf64.
However, revision 218263 forgot to add an extra fallback pattern for the case
where we have a X86VBroadcast of a loadi64 with multiple uses.
This patch adds the missing tablegen pattern in X86InstrSSE.td.
This patch also adds an extra test to 'splat-for-size.ll' to verify that ISel
doesn't crash with a 'fatal error in the backend' due to a missing AVX pattern
to select v2i64 X86ISD::BROADCAST nodes.
llvm-svn: 235509
Enough concerns were raised that this optimization is pessimising some code patterns.
The obvious fix, to add a Reassociate run afterwards, causes even more pessimisation in some cases due to fewer complex addressing modes being matched. As there isn't a trivial fix for this, backing this out by default until someone gets a chance to fix the addressing mode matcher.
llvm-svn: 235491
X86 backend.
The code generated for symbolic targets is identical to the code generated for
constant targets, except that a relocation is emitted to fix up the actual
target address at link-time. This allows IR and object files containing
patchpoints to be cached across JIT-invocations where the target address may
change.
llvm-svn: 235483
With SSE2, we can generate a 'movq' or other 64-bit store op on a 32-bit system
even though 64-bit integers are not legal types.
So instead of producing this:
pshufd $229, %xmm0, %xmm1 ## xmm1 = xmm0[1,1,2,3]
movd %xmm0, (%eax)
movd %xmm1, 4(%eax)
We can do:
movq %xmm0, (%eax)
This is a fix for the problem noted in D7296.
Differential Revision: http://reviews.llvm.org/D9134
llvm-svn: 235460
Summary:
With D9096 and D9101, there's no need to run DCE after SLSR and
SeparateConstOffsetFromGEP.
Test Plan: no regression
Reviewers: jholewinski, meheff
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D9172
llvm-svn: 235415
There doesn't seem to be a reason to perform this target ISD node matching
in an DAGCombine, moving it to lowering fixes PR23296.
Differential Revision: http://reviews.llvm.org/D9137
llvm-svn: 235394
Summary:
This directive is exactly the same as .asciz, except it's only used by MIPS.
It is used to store null terminated strings in object files.
Reviewers: rafael, dsanders, echristo
Reviewed By: dsanders, echristo
Subscribers: echristo, llvm-commits
Differential Revision: http://reviews.llvm.org/D7530
llvm-svn: 235382
Summary:
The 64-bit version of the variable shift instructions uses the
shift_rotate_reg class which uses a GPR32Opnd to specify the variable
shift amount. With this patch we avoid the generation of a redundant
SLL instruction for the variable shift instructions in 64-bit targets.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7413
llvm-svn: 235376
This is an updated version of Chandler's patch D7402 that got accepted but never committed, and has bit-rotted a bit since.
I've updated the execution domain declarations to match the approach of the packed templates and also added some extra scalar unary tests.
Differential Revision: http://reviews.llvm.org/D9095
llvm-svn: 235372
X86ISD::ADDSUB, X86ISD::(F)HADD, X86ISD::(F)HSUB should not be selected
if the operand types do not match the result type because vector type
legalization cannot deal with this for custom nodes.
Testcase X86ISD::ADDSUB is attached. I could not create a testcase for
the FHADD/FHSUB cases because of: https://llvm.org/bugs/show_bug.cgi?id=23296
Differential Revision: http://reviews.llvm.org/D9120
llvm-svn: 235367
Summary:
Set operation action for FP16 conversion opcodes, so the Op legalizer
can choose the gnu_* libcalls for Mips.
Set LoadExtAction and TruncStoreAction for f16 scalars and vectors to
prevent (fpext (load )) and (store (fptrunc)) from getting combined into
unsupported operations.
Added test cases to test that these operations are handled correctly
for f16 scalars and vectors. This patch depends on
http://reviews.llvm.org/D8755.
Reviewers: srhines
Subscribers: llvm-commits, ab
Differential Revision: http://reviews.llvm.org/D8804
llvm-svn: 235341
This fixes a regression introduced at revision 231243.
The target-independent selection algorithm in FastISel knows how to select
a SINT_TO_FP if the target is SSE but not AVX. That is because on X86, the
tablegen'd 'fastEmit' functions know how to select CVTSI2SSrr and CVTSI2SDrr.
Method X86FastISel::X86SelectSIToFP was therefore working under the
wrong assumption that the target was AVX. That assumption was incorrect since
we can have a target that is neither AVX nor SSE.
So, rather than asserting for the presence of AVX, we should have had an
early exit from 'X86SelectSIToFP' if the target was not AVX.
This patch fixes the issue replacing the invalid assertion with an early exit.
Thanks to Dimitry Andric for reporting this problem and for providing a small
reproducible testcase. Added test pr23273.ll.
llvm-svn: 235295
The fix ensures that scalar sources inserted into a vector are the correct bit size.
Integer scalar sources from BUILD_VECTOR and SCALAR_TO_VECTOR nodes may require truncation that this function doesn't currently support.
llvm-svn: 235281
The result is either an Untyped reg sequence, on ldN with N > 1, or
just the type of the input vector, on ld1. Don't force Untyped.
Instead, just use the type of the reg sequence.
This mirrors the behavior of createTuple, which feeds the LD1*_POST.
The narrow code path wasn't actually covered by tests, because V64
insert_vector_elt are widened to V128 before the LD1LANEpost combine
has the chance to run, usually.
The only case where it does run on V64 vectors is if the vector ops
legalizer ran. So, tickle the code with a ctpop.
Fixes PR23265.
llvm-svn: 235243
Summary: Implement the method FastMaterializeAlloca in Mips fast-isel
Based on a patch by Reed Kotler.
Test Plan:
Passes test-suite at O0/O2 for mips32 r1/r2
fastalloca.ll
Reviewers: dsanders, rkotler
Subscribers: rfuhler, llvm-commits
Differential Revision: http://reviews.llvm.org/D6742
llvm-svn: 235213
Summary:
Add shift operators implementation to fast-isel for Mips. These are shift ops
for non legal forms, i.e. i8 and i16.
Based on a patch by Reed Kotler.
Test Plan:
Reviewers: dsanders
Subscribers: echristo, rfuhler, llvm-commits
Differential Revision: http://reviews.llvm.org/D6726
llvm-svn: 235194
Summary:
Previously, the presence of KILL instructions would block valid candidates
from filling a specific delay slot. With the elimination of the KILL
instructions, in the appropriate range, we are able to fill more slots and
keep the information from future def/use analysis consistent.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: hfinkel, llvm-commits
Differential Revision: http://reviews.llvm.org/D7724
llvm-svn: 235183
Summary:
For example, a common idiom was 'isN64 ? Mips::SP_64 : Mips::SP'. This has
been moved to MipsABIInfo and replaced with 'ABI.GetStackPtr()'.
There are others that should also be moved. This patch sticks to the ones that
are obviously non-functional. The others have minor mistakes that need fixing
at the same time, mostly involving checks for 64-bit GPR's instead of checks
for 64-bit pointers.
Reviewers: tomatabacu
Reviewed By: tomatabacu
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8972
llvm-svn: 235173
Found by code inspection, but breaking i16 at least breaks other tests.
They aren't checking this in particular though, so also add some
explicit tests for the already working types.
llvm-svn: 235148
A big-endian vector return needs a byte-swap which we aren't doing right now.
For now just bail on these cases to get correctness back.
llvm-svn: 235133
Fixed compilation with clang on some buildbots with "-Werror -Wmissing-field-initializers"
Related to: http://reviews.llvm.org/rL235089
llvm-svn: 235099
Summary: Previously, this was only happening for functions, but because of .insn, objects can also be marked now.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8007
llvm-svn: 235095
In order to introduce v8.1a-specific entities, Mappers should be aware of SubtargetFeatures available.
This patch introduces refactoring, that will then allow to easily introduce:
- v8.1-specific "pan" PState for PStateMapper (PAN extension)
- v8.1-specific sysregs for SysRegMapper (LOR,VHE extensions)
Reviewers: jmolloy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8496
Patch by Tom Coxon
llvm-svn: 235089
Summary:
This assembler directive marks the current label as an instruction label in microMIPS and MIPS16.
This initial implementation works only for microMIPS.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8006
llvm-svn: 235084
BXJ was incorrectly said to be unsupported in ARMv8-A. It is not
supported in the A64 instruction set, but it is supported in the T32
and A32 instruction sets, because it's listed as an instruction in the
ARM ARM section F7.1.28.
Using SP as an operand to BXJ changed from UNPREDICTABLE to
PREDICTABLE in v8-A. This patch reflects that update as well.
This was found by MCHammer.
llvm-svn: 235024
This is a 1-line patch (with a TODO for AVX because that will affect
even more regression tests) that lets us substitute the appropriate
64-bit store for the float/double/int domains.
It's not clear to me exactly what the difference is between the 0xD6 (MOVPQI2QImr) and
0x7E (MOVSDto64mr) opcodes, but this is apparently the right choice.
Differential Revision: http://reviews.llvm.org/D8691
llvm-svn: 235014
Set the transform bar at 2 divisions because the fastest current
x86 FP divider circuit is in SandyBridge / Haswell at 10 cycle
latency (best case) relative to a 5 cycle multiplier.
So that's the worst case for this transform (no latency win),
but multiplies are obviously pipelined while divisions are not,
so there's still a big throughput win which we would expect to
show up in typical FP code.
These are the sequences I'm comparing:
divss %xmm2, %xmm0
mulss %xmm1, %xmm0
divss %xmm2, %xmm0
Becomes:
movss LCPI0_0(%rip), %xmm3 ## xmm3 = mem[0],zero,zero,zero
divss %xmm2, %xmm3
mulss %xmm3, %xmm0
mulss %xmm1, %xmm0
mulss %xmm3, %xmm0
[Ignore for the moment that we don't optimize the chain of 3 multiplies
into 2 independent fmuls followed by 1 dependent fmul...this is the DAG
version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that,
then the transform becomes even more profitable on all targets.]
Differential Revision: http://reviews.llvm.org/D8941
llvm-svn: 235012
Summary:
MSP430 doesn't seem to have any additional constraints. Therefore remove
the target hook.
No functional change intended.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8208
llvm-svn: 235003
Summary:
Refactor MipsAsmParser::getATReg to return an internal register number instead of a register index.
Also change all the int's to unsigned, seeing as the current AT register index is stored as an unsigned in MipsAssemblerOptions.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8478
llvm-svn: 234996
The ARMv8 ARMARM states that for these instructions in A64 state:
"Unspecified bits in "imm5" are ignored but should be set to zero by an assembler.", (imm4 for INS).
Make the disassembler accept any encoding with these ignored bits set to 1.
llvm-svn: 234896
This pass will always try to insert llvm.SI.ifbreak intrinsics
in the same block that its conditional value is computed in. This is
a problem when conditions for breaks or continue are computed outside
of the loop, because the llvm.SI.ifbreak intrinsic ends up being inserted
outside of the loop.
This patch fixes this problem by inserting the llvm.SI.ifbreak
intrinsics in the loop header when the condition is computed outside
the loop.
llvm-svn: 234891
Some targets (ie. Mips) have additional rules for ordering the relocation
table entries. Allow them to override generic sortRelocs(), which sorts
entries by Offset.
Then override this function for Mips, to emit HI16 and GOT16 relocations
against the local symbol in pair with the corresponding LO16 relocation.
Patch by Vladimir Stefanovic.
Differential Revision: http://reviews.llvm.org/D7414
llvm-svn: 234883
Gut the `DIDescriptor` wrappers around `MDLocalScope` subclasses. Note
that `DILexicalBlock` wraps `MDLexicalBlockBase`, not `MDLexicalBlock`.
llvm-svn: 234850
Gut all the non-pointer API from the variable wrappers, except an
implicit conversion from `DIGlobalVariable` to `DIDescriptor`. Note
that if you're updating out-of-tree code, `DIVariable` wraps
`MDLocalVariable` (`MDVariable` is a common base class shared with
`MDGlobalVariable`).
llvm-svn: 234840
Revert "Remove default in fully-covered switch (to fix Clang -Werror -Wcovered-switch-default)"
Revert "R600: Add carry and borrow instructions. Use them to implement UADDO/USUBO"
Revert "LegalizeDAG: Try to use Overflow operations when expanding ADD/SUB"
Using overflow operations fails CodeGen/Generic/2011-07-07-ScheduleDAGCrash.ll
on hexagon, nvptx, and r600. Revert while I investigate.
llvm-svn: 234768
v2: tighten the sub64 tests
v3: rename to CARRY/BORROW
v4: fixup test cmdline
add known bits computation
use sign extend instead of sub 0,x
better add test
v5: remove redundant break
move lowering to separate functions
fix comments
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewers: arsenm
llvm-svn: 234759
When I fixed these a couple of days ago to iterate over all loops, not just
depth == 1 loops, I inadvertently made it such that we'd only look at the first
top-level loop. Make sure that we really look at all of them.
llvm-svn: 234705
As it turns out, even though these are part of ISA 2.06, the P7 does not
support them (or, at least, not any P7s we're tested so far).
llvm-svn: 234686
This patch corresponds to review:
http://reviews.llvm.org/D8928
It adds direct move instructions to/from VSX registers to GPR's. These are
exploited for FP <-> INT conversions.
llvm-svn: 234682
The patch is generated using clang-tidy misc-use-override check.
This command was used:
tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py \
-checks='-*,misc-use-override' -header-filter='llvm|clang' \
-j=32 -fix -format
http://reviews.llvm.org/D8925
llvm-svn: 234679
This pass had the same problem as the data-prefetching pass: it was only
checking for depth == 1 loops in practice. Fix that, add some debugging
statements, and make sure that, when we grab an AddRec, it is for the loop we
expect.
llvm-svn: 234670
Currently, there's a single flag, checked by the pass itself.
It can't force-enable the pass (and is on by default), because it
might not even have been created, as that's the targets decision.
Instead, have separate explicit flags, so that the decision is
consistently made in the target.
Keep the flag as a last-resort "force-disable GlobalMerge" for now,
for backwards compatibility.
llvm-svn: 234666
The spilled registers are pristine and thus, correctly handled by
the register scavenger and so on, but the liveness information is
strictly speaking wrong at this point.
Fix that.
llvm-svn: 234664
Iterating over loops from the LoopInfo instance only provides top-level loops.
We need to search the whole tree of loops to find the inner ones.
llvm-svn: 234603
Using SchedAliases is convenient and works well for latency and resource
lookup for instructions. However, this creates an entry in
AArch64WriteLatencyTable with a WriteResourceID of 0, breaking any
SchedReadAdvance since the lookup will fail.
http://reviews.llvm.org/D8043
Patch by Dave Estes <cestes@codeaurora.org>!
llvm-svn: 234594
Summary:
Some optimizations such as jump threading and loop unswitching can negatively
affect performance when applied to divergent branches. The divergence analysis
added in this patch conservatively estimates which branches in a GPU program
can diverge. This information can then help LLVM to run certain optimizations
selectively.
Test Plan: test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
Reviewers: resistor, hfinkel, eliben, meheff, jholewinski
Subscribers: broune, bjarke.roune, madhur13490, tstellarAMD, dberlin, echristo, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8576
llvm-svn: 234567
When we have an instruction for this (and, thus, don't generate a runtime
call), we need to custom type legalize this (in a trivial way, just as we do
for fp_to_sint).
Fixes PR23173.
llvm-svn: 234561
For the most common ones (such as fadd), we already did the promotion.
Do the same thing for all the others.
Currently, we'll just crash/assert on all these operations, as
there's no hardware or libcall support whatsoever.
f16 (half) is specified as an interchange - not arithmetic - format,
and is expected to be promoted to single-precision for arithmetic
operations.
While there, teach the legalizer about promoting some of the (mostly
floating-point) operations that we never needed before.
Differential Revision: http://reviews.llvm.org/D8648
See related discussion on the thread for: http://reviews.llvm.org/D8755
llvm-svn: 234550
This is the patch corresponding to review:
http://reviews.llvm.org/D8406
It adds some missing instructions from ISA 2.06 to the PPC back end.
llvm-svn: 234546
formatted_raw_ostream is a wrapper over another stream to add column and line
number tracking.
It is used only for asm printing.
This patch moves the its creation down to where we know we are printing
assembly. This has the following advantages:
* Simpler lifetime management: std::unique_ptr
* We don't compute column and line number of object files :-)
llvm-svn: 234535
The integer extend optimization tries to fold the extend into the load
instruction. This requires us to identify if the extend has already been
emitted or not and act accordingly on it.
The check that was originally performed for this was not sufficient. Besides
checking the ValueMap for a mapped register we also need to check if the
virtual register has already an associated machine instruction that defines it.
This fixes rdar://problem/20470788.
llvm-svn: 234529
Revert "Add classof implementations to the raw_ostream classes."
Revert "Use the cast machinery to remove dummy uses of formatted_raw_ostream."
The underlying issue can be fixed without classof.
llvm-svn: 234495
Currently, llvm (backend) doesn't know cortex-r4, even though it is the
default target for armv7r. Using "--target=armv7r-arm-none-eabi" provokes
'cortex-r4' is not a recognized processor for this target' by llvm.
This patch adds support for cortex-r4 and, very closely related, r4f.
llvm-svn: 234486
Summary:
Make the code more readable by fusing the for-loops together and explicitly checking for each register class.
Also, this version is more straightforward because it doesn't assume that FPU registers always come before CPU registers in the CalleeSavedInfo vector.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8033
llvm-svn: 234475
restrictions when choosing a type for small-memcpy inlining in
SelectionDAGBuilder.
This ensures that the loads and stores output for the memcpy won't be further
expanded during legalization, which would cause the total number of instructions
for the memcpy to exceed (often significantly) the inlining thresholds.
<rdar://problem/17829180>
llvm-svn: 234462
Because -menable-no-nans causes fcmp conditions to be rewritten
without 'o' or 'u' the recognition code in needs to cope. Also
extended it to handle 'le' and 'ge.
Differential Revision: http://reviews.llvm.org/D8725
llvm-svn: 234421