Commit Graph

27435 Commits

Author SHA1 Message Date
Sanjay Patel 21aa6ddc14 [x86] narrow a shuffle that doesn't use or set any high elements
This isn't the final fix for our reduction/horizontal codegen, but it takes care 
of a lot of the problems. After we narrow the shuffle, existing combines for 
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.

The avg and mul reduction tests show an existing shuffle lowering hole for 
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.

Differential Revision: https://reviews.llvm.org/D57156

llvm-svn: 352209
2019-01-25 15:37:42 +00:00
Alex Bradbury 38c4ec31cb [RISCV] Add tests to demonstrate bitcasted fneg/fabs dagcombines
This target-independent code won't trigger for cases such as RV32FD where
custom SelectionDAG nodes are generated. These new tests demonstrate such
cases. Additionally, float-arith.ll was updated so that fneg.s, fsgnjn.s, and
fabs.s selection patterns are actually exercised.

llvm-svn: 352199
2019-01-25 14:33:08 +00:00
Simon Pilgrim d41ccddda9 [X86] Add addcarry/subborrow combine tests
Show failure to simplify cases with zero op/flags

llvm-svn: 352196
2019-01-25 12:26:27 +00:00
Diana Picus 8976ad12a9 [ARM GlobalISel] Support shifts for Thumb2
Same as ARM.

On this occasion we split some of the instruction select tests for more
complicated instructions into their own files, so we can reuse them for
ARM and Thumb mode. Likewise for the legalizer tests.

llvm-svn: 352188
2019-01-25 10:48:42 +00:00
Anton Korobeynikov 509d5c4a7d [MSP430] Fix absolute addressing mode printing in AsmPrinter
Align checks for absolute addressing mode with its current
implementation (SR is used as a base register).

This fixes https://bugs.llvm.org/show_bug.cgi?id=39993

Patch by Kristina Bessonova!

Differential Revision: https://reviews.llvm.org/D56785

llvm-svn: 352178
2019-01-25 09:14:05 +00:00
Zi Xuan Wu 308a609c6e [PowerPC] Enhance the fast selection of cmp instruction and clean up related asserts
Fast selection of llvm icmp and fcmp instructions is not handled well about VSX instruction support.

We'd use VSX float comparison instruction instead of non-vsx float comparison instruction 
if the operand register class is VSSRC or VSFRC because i32 and i64 are mapped to VSSRC and 
VSFRC correspondingly if VSX feature is opened.

If the target does not have corresponding VSX instruction comparison for some type, 
just copy VSX-related register to common float register class and use non-vsx comparison instruction.

Differential Revision: https://reviews.llvm.org/D57078

llvm-svn: 352174
2019-01-25 07:24:59 +00:00
Alex Bradbury 456d3798d6 [RISCV] Custom-legalise i32 SDIV/UDIV/UREM on RV64M
Follow the same custom legalisation strategy as used in D57085 for
variable-length shifts (see that patch summary for more discussion). Although
we may lose out on some late-stage DAG combines, I think this custom
legalisation strategy is ultimately easier to reason about.

There are some codegen changes in rv64m-exhaustive-w-insts.ll but they are all
neutral in terms of the number of instructions.

Differential Revision: https://reviews.llvm.org/D57096

llvm-svn: 352171
2019-01-25 05:11:34 +00:00
Alex Bradbury 299d690a50 [RISCV] Custom-legalise 32-bit variable shifts on RV64
The previous DAG combiner-based approach had an issue with infinite loops
between the target-dependent and target-independent combiner logic (see
PR40333). Although this was worked around in rL351806, the combiner-based
approach is still potentially brittle and can fail to select the 32-bit shift
variant when profitable to do so, as demonstrated in the pr40333.ll test case.

This patch instead introduces target-specific SelectionDAG nodes for
SHLW/SRLW/SRAW and custom-lowers variable i32 shifts to them. pr40333.ll is a
good example of how this approach can improve codegen.

This adds DAG combine that does SimplifyDemandedBits on the operands (only
lower 32-bits of first operand and lower 5 bits of second operand are read).
This seems better than implementing SimplifyDemandedBitsForTargetNode as there
is no guarantee that would be called (and it's not for e.g. the anyext return
test cases). Also implements ComputeNumSignBitsForTargetNode.

There are codegen changes in atomic-rmw.ll and atomic-cmpxchg.ll but the new
instruction sequences are semantically equivalent.

Differential Revision: https://reviews.llvm.org/D57085

llvm-svn: 352169
2019-01-25 05:04:00 +00:00
Matt Arsenault 3e08b772b3 AMDGPU/GlobalISel: Scalarize add/sub
llvm-svn: 352167
2019-01-25 04:53:57 +00:00
Matt Arsenault e6cebd0d69 GlobalISel: fewerElementsVector for more cast types
llvm-svn: 352166
2019-01-25 04:37:33 +00:00
Matt Arsenault 95fd95cfe0 GlobalISel: fewerElementsVector for a few more trivial ops
llvm-svn: 352165
2019-01-25 04:03:38 +00:00
Matt Arsenault 5d622fbcc1 AMDGPU/GlobalISel: Legalize smulh/umulh and scalarize mul
llvm-svn: 352162
2019-01-25 03:23:04 +00:00
Matt Arsenault 1b1e685f10 GlobalISel: Support fewerElementsVector for icmp/fcmp
Also legalize 64-bit compares for AMDGPU

llvm-svn: 352157
2019-01-25 02:59:34 +00:00
Matt Arsenault ca676343a9 GlobalISel: Implement fewerElementsVector for extensions
llvm-svn: 352155
2019-01-25 02:36:32 +00:00
Nemanja Ivanovic b9b75de0ae [PowerPC] Exploit store instructions that store a single vector element
This patch exploits the instructions that store a single element from a vector
to preform a (store (extract_elt)). We already have code that does this with
ISA 3.0 instructions that were added to handle i8/i16 types. However, we had
never exploited the existing ones that handle f32/f64/i32/i64 types.

Differential revision: https://reviews.llvm.org/D56175

llvm-svn: 352131
2019-01-24 23:44:28 +00:00
Aditya Nandakumar 3ba0d94bce [GISel]: Change how CSE is enabled by default for each pass
https://reviews.llvm.org/D57178

Now add a hook in TargetPassConfig to query if CSE needs to be
enabled. By default this hook returns false only for O0 opt level but
this can be overridden by the target.
As a consequence of the default of enabled for non O0, a few tests
needed to be updated to not use CSE (by passing in -O0) to the run
line.

reviewed by: arsenm

llvm-svn: 352126
2019-01-24 23:11:25 +00:00
Matt Arsenault baa5d2e69c RegBankSelect: Support some more complex part mappings
llvm-svn: 352123
2019-01-24 22:47:04 +00:00
Jessica Paquette 245047dfe8 [GlobalISel][AArch64] Add isel support for FP16 vector @llvm.ceil
This patch adds support for vector @llvm.ceil intrinsics when full 16 bit
floating point support isn't available.

To do this, this patch...

- Implements basic isel for G_UNMERGE_VALUES
- Teaches the legalizer about 16 bit floats
- Teaches AArch64RegisterBankInfo to respect floating point registers on
  G_BUILD_VECTOR and G_UNMERGE_VALUES
- Teaches selectCopy about 16-bit floating point vectors

It also adds

- A legalizer test for the 16-bit vector ceil which verifies that we create a
  G_UNMERGE_VALUES and G_BUILD_VECTOR when full fp16 isn't supported
- An instruction selection test which makes sure we lower to G_FCEIL when
  full fp16 is supported
- A test for selecting G_UNMERGE_VALUES

And also updates arm64-vfloatintrinsics.ll to show that the new ceiling types
work as expected.

https://reviews.llvm.org/D56682

llvm-svn: 352113
2019-01-24 22:00:41 +00:00
Simon Pilgrim c12a634326 [X86] Regenerate SBB test to fix buildbots.
Some local WIP code unexpectedly managed to get in the way.

llvm-svn: 352081
2019-01-24 18:57:48 +00:00
James Y Knight 2c36240a82 Fix emission of _fltused for MSVC.
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.

The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.

This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)

That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.

Differential Revision: https://reviews.llvm.org/D56548

llvm-svn: 352076
2019-01-24 18:34:00 +00:00
Simon Pilgrim f4a1b54097 [X86] Add PR25858 test cases
llvm-svn: 352075
2019-01-24 18:30:45 +00:00
Sanjay Patel 55787a7e77 [x86] add tests for unpack shuffle lowering; NFC
https://bugs.llvm.org/show_bug.cgi?id=40434

llvm-svn: 352048
2019-01-24 14:12:34 +00:00
Petar Avramovic 79df859685 [MIPS GlobalISel] Select zero extending and sign extending load
Select zero extending and sign extending load for MIPS32.
Use size from MachineMemOperand to determine number of bytes to load.

Differential Revision: https://reviews.llvm.org/D57099

llvm-svn: 352038
2019-01-24 10:27:21 +00:00
Petar Avramovic b5a939d246 [MIPS GlobalISel] Combine extending loads
Use CombinerHelper to combine extending load instructions.
G_LOAD combined with G_ZEXT, G_SEXT or G_ANYEXT gives G_ZEXTLOAD,
G_SEXTLOAD or G_LOAD with same type as def of extending instruction
respectively.
Similarly G_ZEXTLOAD combined with G_ZEXT gives G_ZEXTLOAD and
G_SEXTLOAD combined with G_SEXT gives G_SEXTLOAD with same type
as def of extending instruction.

Differential Revision: https://reviews.llvm.org/D56914

llvm-svn: 352037
2019-01-24 10:09:52 +00:00
Jonas Paulsson 5916dea338 [SystemZ] Remember to reset the NoPHIs property on MF in createPHIsForSelects()
After creating new PHI instructions during isel pseudo expansion, the NoPHIs
property of MF should be reset in case it was previously set.

Review: Ulrich Weigand
llvm-svn: 352030
2019-01-24 07:54:41 +00:00
Craig Topper e79b779fbb [X86] Add test cases for opportunities to fold a truncate and a masked store into a truncating masked store.
llvm-svn: 352027
2019-01-24 06:15:03 +00:00
Reid Kleckner f9ebacfd29 Revert r351938 "[ARM] Alter the register allocation order for minsize on Thumb2"
This change caused fatal backend errors when compiling a file in libvpx
for Android.

llvm-svn: 351979
2019-01-23 21:10:48 +00:00
Craig Topper aa0e74c1fc [X86] Autogenerate complete checks. NFC
llvm-svn: 351970
2019-01-23 18:25:49 +00:00
Andrea Di Biagio d768d35515 [MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.

Differential Revision: https://reviews.llvm.org/D57056

llvm-svn: 351965
2019-01-23 16:35:07 +00:00
Tim Renouf f64f8efe13 [AMDGPU] With XNACK, cannot clause a load with result coalesced with operand
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.

The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.

Differential Revision: https://reviews.llvm.org/D57008

Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
2019-01-23 13:38:06 +00:00
Jonas Paulsson 6046d087c5 [SystemZ] Fix test case for buildbot.
llvm-clang-x86_64-expensive-checks-win triggered this assert:

"llvm.dbg.value intrinsic requires a !dbg attachment"

Hopefully, adding reasonable !dbg operands solves this.

llvm-svn: 351939
2019-01-23 10:29:12 +00:00
David Green 6a858a9425 [ARM] Alter the register allocation order for minsize on Thumb2
Currently in Arm code, we allocate LR first, under the assumption that
it needs to be saved anyway. Unfortunately this has the disadvantage
that it will require any instructions using it to be the longer thumb2
instructions, not the shorter thumb1 ones.

This switches the order when we are optimising for minsize, returning to
the default order so that more lower registers can be used. It can end
up requiring more pushed registers, but on average produces smaller code.

Differential Revision: https://reviews.llvm.org/D56008

llvm-svn: 351938
2019-01-23 10:18:30 +00:00
Sam Parker 31bef63bb4 [ARM][CGP] Check trunc type before replacing
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.

Differential Revision: https://reviews.llvm.org/D57041

llvm-svn: 351935
2019-01-23 09:18:44 +00:00
Sam Parker 9a2a89d58f [DAGCombine] Enable more pre-indexed stores
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.

Differential Revision: https://reviews.llvm.org/D56719

llvm-svn: 351933
2019-01-23 09:11:49 +00:00
Kristof Beyls e70ba0a1bc [SLH][AArch64] Remove accidentally retained -debug-only line from test.
llvm-svn: 351932
2019-01-23 09:10:12 +00:00
Kristof Beyls 3ff5dfd735 [SLH] AArch64: correctly pick temporary register to mask SP
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:

mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).

To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.

In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.

Differential Revision: https://reviews.llvm.org/D56717

llvm-svn: 351930
2019-01-23 08:18:39 +00:00
Jonas Paulsson 961c47ec98 [SystemZ] Handle DBG_VALUE instructions in two places in backend.
Two backend optimizations failed to handle cases when compiled with -g, due
to failing to consider DBG_VALUE instructions. This was in
SystemZTargetLowering::emitSelect() and
SystemZElimCompare::getRegReferences().

This patch makes sure that DBG_VALUEs are recognized so that they do not
affect these optimizations.

Tests for branch-on-count, load-and-trap and consecutive selects.

Review: Ulrich Weigand
https://reviews.llvm.org/D57048

llvm-svn: 351928
2019-01-23 07:42:26 +00:00
Brendon Cahoon 59d9973146 [Pipeliner] Add two pragmas to control software pipelining optimization
#pragma clang loop pipeline(disable)
  
    Disable SWP optimization for the next loop.
    “disable” is the only possible value.
  
#pragma clang loop pipeline_initiation_interval(number)
  
    Set value of initiation interval for SWP
    optimization to specified number value for
    the next loop. Number is the positive value
    greater than 0.
  
These pragmas could be used for debugging or reducing
compile time purposes. It is possible to disable SWP for
concrete loops to save compilation time or to find bugs
by not doing SWP to certain loops. It is possible to set
value of initiation interval to concrete number to save
compilation time by not doing extra pipeliner passes or
to check created schedule for specific initiation interval.

That is llvm part of the fix

Clang part of fix: https://reviews.llvm.org/D55710

Patch by Alexey Lapshin!

Differential Revision: https://reviews.llvm.org/D56403

llvm-svn: 351923
2019-01-23 03:26:10 +00:00
Peter Collingbourne 73078ecd38 hwasan: Move memory access checks into small outlined functions on aarch64.
Each hwasan check requires emitting a small piece of code like this:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#memory-accesses

The problem with this is that these code blocks typically bloat code
size significantly.

An obvious solution is to outline these blocks of code. In fact, this
has already been implemented under the -hwasan-instrument-with-calls
flag. However, as currently implemented this has a number of problems:
- The functions use the same calling convention as regular C functions.
  This means that the backend must spill all temporary registers as
  required by the platform's C calling convention, even though the
  check only needs two registers on the hot path.
- The functions take the address to be checked in a fixed register,
  which increases register pressure.
Both of these factors can diminish the code size effect and increase
the performance hit of -hwasan-instrument-with-calls.

The solution that this patch implements is to involve the aarch64
backend in outlining the checks. An intrinsic and pseudo-instruction
are created to represent a hwasan check. The pseudo-instruction
is register allocated like any other instruction, and we allow the
register allocator to select almost any register for the address to
check. A particular combination of (register selection, type of check)
triggers the creation in the backend of a function to handle the check
for specifically that pair. The resulting functions are deduplicated by
the linker. The pseudo-instruction (really the function) is specified
to preserve all registers except for the registers that the AAPCS
specifies may be clobbered by a call.

To measure the code size and performance effect of this change, I
took a number of measurements using Chromium for Android on aarch64,
comparing a browser with inlined checks (the baseline) against a
browser with outlined checks.

Code size: Size of .text decreases from 243897420 to 171619972 bytes,
or a 30% decrease.

Performance: Using Chromium's blink_perf.layout microbenchmarks I
measured a median performance regression of 6.24%.

The fact that a perf/size tradeoff is evident here suggests that
we might want to make the new behaviour conditional on -Os/-Oz.
But for now I've enabled it unconditionally, my reasoning being that
hwasan users typically expect a relatively large perf hit, and ~6%
isn't really adding much. We may want to revisit this decision in
the future, though.

I also tried experimenting with varying the number of registers
selectable by the hwasan check pseudo-instruction (which would result
in fewer variants being created), on the hypothesis that creating
fewer variants of the function would expose another perf/size tradeoff
by reducing icache pressure from the check functions at the cost of
register pressure. Although I did observe a code size increase with
fewer registers, I did not observe a strong correlation between the
number of registers and the performance of the resulting browser on the
microbenchmarks, so I conclude that we might as well use ~all registers
to get the maximum code size improvement. My results are below:

Regs | .text size | Perf hit
-----+------------+---------
~all | 171619972  | 6.24%
  16 | 171765192  | 7.03%
   8 | 172917788  | 5.82%
   4 | 177054016  | 6.89%

Differential Revision: https://reviews.llvm.org/D56954

llvm-svn: 351920
2019-01-23 02:20:10 +00:00
Matt Arsenault 4c5e8f51e7 AMDGPU/GlobalISel: Start selectively legalizing 16-bit operations
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.

llvm-svn: 351886
2019-01-22 22:00:19 +00:00
Matt Arsenault 736cfa9ffb AMDGPU/GlobalISel: Handle legality/regbanks for 32/64-bit shifts
llvm-svn: 351884
2019-01-22 21:51:38 +00:00
Matt Arsenault 30989e492b GlobalISel: Allow shift amount to be a different type
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.

X86 uses i8, but seemed to be hacking around this before.

llvm-svn: 351882
2019-01-22 21:42:11 +00:00
Matt Arsenault 6378629609 GlobalISel: Implement widen for extract_vector_elt elt type
llvm-svn: 351871
2019-01-22 20:38:15 +00:00
Matt Arsenault aebb2ee036 GlobalISel: Implement fewerElementsVector for basic FP ops
llvm-svn: 351866
2019-01-22 20:14:29 +00:00
Matt Arsenault 6614f852b6 GlobalISel: Support narrowing zextload/sextload
llvm-svn: 351856
2019-01-22 19:02:10 +00:00
Matt Arsenault a7cd83bc88 GlobalISel: Disallow vectors for G_CONSTANT/G_FCONSTANT
llvm-svn: 351853
2019-01-22 18:53:41 +00:00
Matt Arsenault a5840c3c39 Codegen support for atomicrmw fadd/fsub
llvm-svn: 351851
2019-01-22 18:36:06 +00:00
Sanjay Patel 58256802ce [x86] add partial undef 'and' test; NFC
llvm-svn: 351840
2019-01-22 17:01:06 +00:00
Sanjay Patel 6019e6f866 [x86] add another partial undef vector binop test; NFC
The existing test unintentionally shows that we have prematurely
optimized the shuffle into a vector concat and lost the undef info, 
so it is not affected by a basic improvement to 
SimplifyDemandedVectorElts.

llvm-svn: 351834
2019-01-22 16:26:09 +00:00
Sanjay Patel effee52c59 [DAGCombiner] narrow vector binop with 2 insert subvector operands
vecbo (insertsubv undef, X, Z), (insertsubv undef, Y, Z) --> insertsubv VecC, (vecbo X, Y), Z

This is another step in generic vector narrowing. It's also a step towards more horizontal op 
formation specifically for x86 (although we still failed to match those in the affected tests).

The scalarization cases are also not optimal (we should be scalarizing those), but it's still 
an improvement to use a narrower vector op when we know part of the result must be constant 
because both inputs are undef in some vector lanes.

I think a similar match but checking for a constant operand might help some of the cases in 
D51553.

Differential Revision: https://reviews.llvm.org/D56875

llvm-svn: 351825
2019-01-22 14:24:13 +00:00
Simon Pilgrim 933673d878 [X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.

This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.

Differential Revision: https://reviews.llvm.org/D55935

llvm-svn: 351819
2019-01-22 13:44:49 +00:00
Simon Pilgrim aa6a4339ac [X86][BtVer2] SSE2 vector shifts has local forwarding disabled
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57026

llvm-svn: 351817
2019-01-22 13:27:18 +00:00
Simon Pilgrim 2c69f90171 [X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57022

llvm-svn: 351815
2019-01-22 13:13:57 +00:00
Simon Pilgrim 180fcff5a7 [X86][SSE] Add selective commutation support for insertps (PR40340)
When we are inserting 1 "inline" element, and zeroing 2 of the other elements then we can safely commute the insertps source inputs to improve memory folding.

Differential Revision: https://reviews.llvm.org/D56843

llvm-svn: 351807
2019-01-22 12:17:48 +00:00
Alex Bradbury cd26560e46 [RISCV] Quick fix for PR40333
Avoid the infinite loop caused by the target DAG combine converting ANYEXT to
SIGNEXT and the target-independent DAG combine logic converting back to
ANYEXT. Do this by not adding the new node to the worklist.

Committing directly as this definitely doesn't make the problem any worse, and
I intend to follow-up with a patch that avoids this custom combiner logic
altogether and just lowers the i32 operations to a target-specific
SelectionDAG node. This should be easier to reason about and improve codegen
quality in some cases (though may miss out on some later DAG combines).

llvm-svn: 351806
2019-01-22 12:11:53 +00:00
Simon Pilgrim 372afb7ec4 [X86] Add test for matchAddressRecursively's MUL handling
Noticed in code coverage tests that this isn't tested.

llvm-svn: 351804
2019-01-22 11:39:21 +00:00
Eli Friedman 1eaa04d682 [ARM] Combine ands+lsls to lsls+lsrs for Thumb1.
This patch may seem familiar... but my previous patch handled the
equivalent lsls+and, not this case.  Usually instcombine puts the
"and" after the shift, so this case doesn't come up. However, if the
shift comes out of a GEP, it won't get canonicalized by instcombine,
and DAGCombine doesn't have an equivalent transform.

This also modifies isDesirableToCommuteWithShift to suppress DAGCombine
transforms which would make the overall code worse.

I'm not really happy adding a bunch of code to handle this, but it would
probably be tricky to substantially improve the behavior of DAGCombine
here.

Differential Revision: https://reviews.llvm.org/D56032

llvm-svn: 351776
2019-01-22 01:51:37 +00:00
Eli Friedman 23e60c7893 [AArch64] Add patterns for zext/sext of shift amount.
Not sure this is the best fix, but it saves an instruction for certain
constructs involving variable shifts.

Differential Revision: https://reviews.llvm.org/D55572

llvm-svn: 351768
2019-01-22 00:21:35 +00:00
Matt Arsenault fb67164ebc AMDGPU/GlobalISel: Legalize more fp<->int conversions
llvm-svn: 351767
2019-01-22 00:20:17 +00:00
Sanjay Patel 9884dd1034 [x86] add another test for xor with undefs; NFC
llvm-svn: 351764
2019-01-21 22:12:35 +00:00
Sanjay Patel 43345f29f4 [x86] add tests for vector ops with undef lanes; NFC
llvm-svn: 351763
2019-01-21 21:52:27 +00:00
Stanislav Mekhanoshin f92ed6966e [AMDGPU] Fixed hazard recognizer to walk predecessors
Fixes two problems with GCNHazardRecognizer:
1. It only scans up to 5 instructions emitted earlier.
2. It does not take control flow into account. An earlier instruction
from the previous basic block is not necessarily a predecessor.
At the same time a real predecessor block is not scanned.

The patch provides a way to distinguish between scheduler and
hazard recognizer mode. It is OK to work with emitted instructions
in the scheduler because we do not really know what will be emitted
later and its order. However, when pass works as a hazard recognizer
the schedule is already finalized, and we have full access to the
instructions for the whole function, so we can properly traverse
predecessors and their instructions.

Differential Revision: https://reviews.llvm.org/D56923

llvm-svn: 351759
2019-01-21 19:11:26 +00:00
Simon Pilgrim 9b73ae96c5 [X86][BtVer2] Update latency of mmx horizontal operations
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.

Confirmed with @andreadb

llvm-svn: 351755
2019-01-21 18:04:25 +00:00
Sanjay Patel fe3a1b56eb [AArch64] add more tests for buildvec to shuffle transform; NFC
These are copied from the sibling x86 file. I'm not sure which
of the current outputs (if any) is considered optimal, but
someone more familiar with AArch may want to take a look.

llvm-svn: 351754
2019-01-21 17:46:35 +00:00
Sanjay Patel e713c47d49 [DAGCombiner] fix crash when converting build vector to shuffle
The regression test is reduced from the example shown in D56281.
This does raise a question as noted in the test file: do we want
to handle this pattern? I don't have a motivating example for
that on x86 yet, but it seems like we could have that pattern 
there too, so we could avoid the back-and-forth using a shuffle.

llvm-svn: 351753
2019-01-21 17:30:14 +00:00
Andrea Di Biagio b68dd05c14 [X86][BtVer2] Update the WriteLoad latency.
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.

Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.

Differential Revision: https://reviews.llvm.org/D56922

llvm-svn: 351742
2019-01-21 12:04:10 +00:00
Craig Topper f608dc1f57 [X86] Remove and autoupgrade vpmovqd/vpmovwb intrinsics using trunc+select.
llvm-svn: 351729
2019-01-21 08:16:59 +00:00
Dylan McKay 5c23410fdf [AVR] Insert unconditional branch when inserting MBBs between blocks with fallthrough
This updates the AVR Select8/Select16 expansion code so that, when
inserting the two basic blocks for true and false conditions, any
existing fallthrough on the previous block is preserved.

Prior to this patch, if the block before the Select pseudo fell through
to the subsequent block, two new basic blocks would be inserted at the
prior fallthrough point, changing the fallthrough destination.

The predecessor or successor lists were not updated, causing the
BranchFolding pass at -O1 and above the rearrange basic blocks, causing
an infinite loop. Not to mention the unconditional fallthrough to the
true block is incorrect in of itself.

This patch modifies the Select8/16 expansion so that, if inserting true
and false basic blocks at a fallthrough point, the implicit branch is
preserved by means of an explicit, unconditional branch to the previous
fallthrough destination.

Thanks to Carl Peto for reporting this bug.

This fixes avr-rust bug https://github.com/avr-rust/rust/issues/123.

llvm-svn: 351721
2019-01-21 04:32:02 +00:00
Dylan McKay ce0ab06353 Revert "[AVR] Insert unconditional branch when inserting MBBs between blocks with fallthrough"
This reverts commit r351718.

Carl pointed out that the unit test could be improved.

This patch will be recommitted once the test is made more resilient.

llvm-svn: 351719
2019-01-21 02:46:13 +00:00
Dylan McKay 33acba43f0 [AVR] Insert unconditional branch when inserting MBBs between blocks with fallthrough
This updates the AVR Select8/Select16 expansion code so that, when
inserting the two basic blocks for true and false conditions, any
existing fallthrough on the previous block is preserved.

Prior to this patch, if the block before the Select pseudo fell through
to the subsequent block, two new basic blocks would be inserted at the
prior fallthrough point, changing the fallthrough destination.

The predecessor or successor lists were not updated, causing the
BranchFolding pass at -O1 and above the rearrange basic blocks, causing
an infinite loop. Not to mention the unconditional fallthrough to the
true block is incorrect in of itself.

This patch modifies the Select8/16 expansion so that, if inserting true
and false basic blocks at a fallthrough point, the implicit branch is
preserved by means of an explicit, unconditional branch to the previous
fallthrough destination.

Thanks to Carl Peto for reporting this bug.

This fixes avr-rust bug https://github.com/avr-rust/rust/issues/123.

llvm-svn: 351718
2019-01-21 02:44:09 +00:00
Matt Arsenault 7ac79ed8f0 AMDGPU: Legalize more bitcasts
llvm-svn: 351700
2019-01-20 19:45:18 +00:00
Matt Arsenault 46ffe68d77 AMDGPU/GlobalISel: Really legalize exts from i1
There is a combine that was hiding these tests
not actually testing what they should be, although
they were producing the expected end result.

llvm-svn: 351698
2019-01-20 19:28:20 +00:00
Simon Pilgrim e1143c1322 [X86] Auto upgrade VPCOM/VPCOMU intrinsics to generic integer comparisons
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.

Noticed while cleaning up vector integer comparison costs for PR40376.

llvm-svn: 351697
2019-01-20 19:27:40 +00:00
Matt Arsenault 745fd9f547 GlobalISel: Implement widenScalar for basic FP ops
llvm-svn: 351696
2019-01-20 19:10:31 +00:00
Matt Arsenault cfd9e7f594 AMDGPU/GlobalISel: Legalize f32->f16 fptrunc
llvm-svn: 351695
2019-01-20 19:10:26 +00:00
Matt Arsenault ff6a9a275b AMDGPU/GlobalISel: Fix some crashs in g_unmerge_values/g_merge_values
This was crashing in the predicate function assuming the value
is a vector.

Copy more of what AArch64 uses. This probably needs more refinement
later, but I don't exactly understand what it means in some cases,
particularly since any legalization for these seems to be missing.

llvm-svn: 351693
2019-01-20 18:40:36 +00:00
Matt Arsenault 2a2086b830 AMDGPU/GlobalISel: Regbank select for fpext
llvm-svn: 351692
2019-01-20 18:35:41 +00:00
Matt Arsenault 24563ef628 AMDGPU/GlobalISel: Cleanup legality for extensions
llvm-svn: 351691
2019-01-20 18:34:24 +00:00
Simon Pilgrim b590e4f7e5 [X86] Auto upgrade old style VPCOM/VPCOMU intrinsics to generic integer comparisons
We were upgrading these to the new style VPCOM/VPCOMU intrinsics (which includes the condition code immediate), but we'll be getting rid of those shortly, so convert these to generics first.

This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.

Noticed while cleaning up vector integer comparison costs for PR40376.

llvm-svn: 351690
2019-01-20 17:36:22 +00:00
Simon Pilgrim 4fd2459c4d [X86] Replace VPCOM/VPCOMU with generic integer comparisons (llvm)
These intrinsics can always be replaced with generic integer comparisons without any regression in codegen, even for -O0/-fast-isel cases.

Noticed while cleaning up vector integer comparison costs for PR40376.

A future commit will remove/autoupgrade the existing VPCOM/VPCOMU llvm intrinsics.

llvm-svn: 351688
2019-01-20 16:40:44 +00:00
Dylan McKay a6241a5dc0 [AVR] Remove unneeded XFAILs from the Generic CodeGen tests
These have been in place for quite a while now.

Several bugs have since been fixed, and these tests now pass.

llvm-svn: 351679
2019-01-20 11:16:58 +00:00
Dylan McKay 6afef286d9 [AVR] Fix codegen bug in 16-bit loads
Prior to this patch, the AVR::LDWRdPtr instruction was always lowered to
instructions of this pattern:

    ld  $GPR8, [PTR:XYZ]+
    ld  $GPR8, [PTR]+1

This has a problem; the [PTR] is incremented in-place once, but never
decremented.

Future uses of the same pointer will use the now clobbered value,
leading to the pointer being incorrect by an offset of one.

This patch modifies the expansion code of the LDWRdPtr pseudo
instruction so that the pointer variable is not silently clobbered in
future uses in the same live range.

Bug first reported by Keshav Kini.

Patch by Kaushik Phatak.

llvm-svn: 351673
2019-01-20 03:41:08 +00:00
Dylan McKay 52846ab09a Revert "[AVR] Fix codegen bug in 16-bit loads"
This reverts commit r351544.

In that commit, I had mistakenly misattributed the issue submitter as
the patch author, Kaushik Phatak.

The patch will be recommitted immediately with the correct attribution.

llvm-svn: 351672
2019-01-20 03:41:00 +00:00
Amara Emerson d5015edb37 Revert r351584: "GlobalISel: Verify g_zextload and g_sextload"
This new assertion triggered on the AArch64 GlobalISel bots. Reverting while it's being investigated.

llvm-svn: 351617
2019-01-19 00:36:11 +00:00
Matt Arsenault 96e4701401 AMDGPU/GlobalISel: Legalize more types for select
llvm-svn: 351599
2019-01-18 21:42:55 +00:00
Matt Arsenault 4599159ac3 AMDGPU/GlobalISel: Legalize illegal g_constant
llvm-svn: 351596
2019-01-18 21:33:50 +00:00
Matt Arsenault bd3a5b29cb GlobalISel: Verify G_BITCAST
llvm-svn: 351594
2019-01-18 21:04:59 +00:00
Sanjay Patel 4453e4292d [x86] add more movmsk tests; NFC
The existing tests already show a sub-optimal transform,
but this should make it clear that we can't just match
an 'and' op when creating movmsk instructions.

llvm-svn: 351590
2019-01-18 20:42:12 +00:00
Matt Arsenault f67ae61131 GlobalISel: Verify g_zextload and g_sextload
llvm-svn: 351584
2019-01-18 20:17:37 +00:00
Craig Topper b9d4461f9f [X86] Lower avx2/avx512f gather intrinsics to X86MaskedGatherSDNode instead of going directly to MachineSDNode.:
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.

Differential Revision: https://reviews.llvm.org/D56827

llvm-svn: 351570
2019-01-18 18:22:26 +00:00
Dmitry Preobrazhensky 6bc26aaada [AMDGPU][MC][GFX8+][DISASSEMBLER] Corrected 1/2pi value for 64-bit operands
See bug 39332: https://bugs.llvm.org/show_bug.cgi?id=39332

Reviewers: artem.tamazov, arsenm

Differential Revision: https://reviews.llvm.org/D56794

llvm-svn: 351555
2019-01-18 15:17:17 +00:00
Dylan McKay 77364be497 [AVR] Fix codegen bug in 16-bit loads
Prior to this patch, the AVR::LDWRdPtr instruction was always lowered to
instructions of this pattern:

    ld  $GPR8, [PTR:XYZ]+
    ld  $GPR8, [PTR]+1

This has a problem; the [PTR] is incremented in-place once, but never
decremented.

Future uses of the same pointer will use the now clobbered value,
leading to the pointer being incorrect by an offset of one.

This patch modifies the expansion code of the LDWRdPtr pseudo
instruction so that the pointer variable is not silently clobbered in
future uses in the same live range.

Patch by Keshav Kini.

llvm-svn: 351544
2019-01-18 11:27:38 +00:00
Shiva Chen e84c729aca [ScheduleDAGRRList] Do not preschedule the node has ADJCALLSTACKDOWN parent
We should not pre-scheduled the node has ADJCALLSTACKDOWN parent,
or else, when bottom-up scheduling, ADJCALLSTACKDOWN and
ADJCALLSTACKUP may hold CallResource too long and make other
calls can't be scheduled. If there's no other available node
to schedule, the scheduler will try to rename the register by
creating copy to avoid the conflict which will fail because
CallResource is not a real physical register.

llvm-svn: 351527
2019-01-18 08:36:06 +00:00
Hsiangkai Wang 66609a8255 [CodeGen] Fix bugs in LiveDebugVariables when debug labels are generated.
Remove DBG_LABELs in LiveDebugVariables and generate them in
VirtRegRewriter.

This bug is reported in
https://bugs.chromium.org/p/chromium/issues/detail?id=898152.

Differential Revision: https://reviews.llvm.org/D54465

llvm-svn: 351525
2019-01-18 07:17:09 +00:00
Dylan McKay 7203e00b5e [AVR] Expand 8/16-bit multiplication to libcalls on MCUs that don't have hardware MUL
This change modifies the LLVM ISel lowering settings so that
8-bit/16-bit multiplication is expanded to calls into the compiler
runtime library if the MCU being targeted does not support
multiplication in hardware.

Before this, MUL instructions would be generated on CPUs like the
ATtiny85, triggering a CPU reset due to an illegal instruction at
runtime.

First raised in https://github.com/avr-rust/rust/issues/124.

llvm-svn: 351523
2019-01-18 06:10:41 +00:00
Craig Topper f47bef661e [X86] Add test cases showing failure to fold a global variable address into the gather addressing mode when using the target specific intrinsics. NFC
llvm-svn: 351522
2019-01-18 06:06:03 +00:00
Craig Topper 13c7a9f81f [X86] Change avx512-gather-scatter-intrin.ll to use x86_64-unknown-unknown instead of x86_64-apple-darwin. NFC
Will help with an upcoming patch.

llvm-svn: 351521
2019-01-18 06:06:01 +00:00
Thomas Lively c6795e07f0 [WebAssembly] Add languages from debug info to producers section
Reviewers: aheejin, dschuff, sbc100

Subscribers: aprantl, jgravelle-google, hiraditya, sunfish

Differential Revision: https://reviews.llvm.org/D56889

llvm-svn: 351507
2019-01-18 02:47:48 +00:00
Matt Arsenault 456b93b4c2 AMDGPU: Convert tests away from llvm.SI.load.const
llvm-svn: 351494
2019-01-17 22:47:26 +00:00
Vladimir Stefanovic 3daf8bc96a [mips] Emit .reloc R_{MICRO}MIPS_JALR along with j(al)r(c) $25
The callee address is added as an optional operand (MCSymbol) in
AdjustInstrPostInstrSelection() and then used by asm printer to insert:
'.reloc tmplabel, R_MIPS_JALR, symbol
tmplabel:'.
Controlled with '-mips-jalr-reloc', default is true.

Differential revision: https://reviews.llvm.org/D56694

llvm-svn: 351485
2019-01-17 21:50:37 +00:00
Sanjin Sijaric 4d1450298c Fix the buildbot failure introduced by r351404
EXPENSIVE_CHECKS buildbots are failing due to r351404.

Add x1 as live in to the funclet basic block for SEH funclets, as well as
-verify-machineinstrs to the test case that triggered the failure.

llvm-svn: 351472
2019-01-17 20:24:14 +00:00
Simon Pilgrim 6cc9c3cd75 [X86][SSE] Add PR40340 test case
llvm-svn: 351430
2019-01-17 11:20:23 +00:00
Simon Pilgrim 8260bf9db2 [X86] Add AVX512 test to insertps
Pre-commit for PR40340

llvm-svn: 351429
2019-01-17 11:11:15 +00:00
Matt Arsenault 0cb08e448a Allow FP types for atomicrmw xchg
llvm-svn: 351427
2019-01-17 10:49:01 +00:00
Diana Picus d5c2499aec [ARM GlobalISel] Allow calls to varargs functions
Allow varargs functions to be called, both in arm and thumb mode. This
boils down to choosing the correct calling convention, which we can
easily test by making sure arm_aapcscc is used instead of
arm_aapcs_vfpcc when the callee is variadic.

llvm-svn: 351424
2019-01-17 10:11:55 +00:00
Alex Bradbury 07f1c62371 [RISCV] Add codegen support for RV64A
In order to support codegen RV64A, this patch:
* Introduces masked atomics intrinsics for atomicrmw operations and cmpxchg
  that use the i64 type. These are ultimately lowered to masked operations
  using lr.w/sc.w, but we need to use these alternate intrinsics for RV64
  because i32 is not legal
* Modifies RISCVExpandPseudoInsts.cpp to handle PseudoAtomicLoadNand64 and
  PseudoCmpXchg64
* Modifies the AtomicExpandPass hooks in RISCVTargetLowering to sext/trunc as
  needed for RV64 and to select the i64 intrinsic IDs when necessary
* Adds appropriate patterns to RISCVInstrInfoA.td
* Updates test/CodeGen/RISCV/atomic-*.ll to show RV64A support

This ends up being a fairly mechanical change, as the logic for RV32A is
effectively reused.

Differential Revision: https://reviews.llvm.org/D53233

llvm-svn: 351422
2019-01-17 10:04:39 +00:00
Sanjin Sijaric b694030647 [ARM64][Windows] Share unwind codes between epilogues
There are cases where we have multiple epilogues that have the exact same unwind
code sequence.  In that case, the epilogues can share the same unwind codes in
the .xdata section.  This should get us past the assert "SEH unwind data
splitting not yet implemented" in many cases.

We still need to add support for generating multiple .pdata/.xdata sections for
those functions that need to be split into fragments.

Differential Revision: https://reviews.llvm.org/D56813

llvm-svn: 351421
2019-01-17 09:45:17 +00:00
Thomas Lively cbda16eb8e [WebAssembly] Parse llvm.ident into producers section
llvm-svn: 351413
2019-01-17 02:29:55 +00:00
Thomas Lively 3cfcc94c09 Revert "[WebAssembly] Parse llvm.ident into producers section"
This reverts commit eccdbba3a02a33e13b5262e92200a33e2ead873d.

llvm-svn: 351410
2019-01-17 00:39:49 +00:00
Sanjin Sijaric 685565ae9a [SEH] [ARM64] Retrieve the frame pointer from SEH funclets
The Windows ARM64 runtime passes the establisher frame to funclets as the first
argument.

llvm-svn: 351404
2019-01-17 00:24:38 +00:00
Thomas Lively a56c23c5ba [WebAssembly] Parse llvm.ident into producers section
Summary:
Everything before the word "version" is the tool, and everything after
the word "version" is the version.

Reviewers: aheejin, dschuff

Subscribers: sbc100, jgravelle-google, sunfish, llvm-commits

Differential Revision: https://reviews.llvm.org/D56742

llvm-svn: 351399
2019-01-16 23:46:14 +00:00
Craig Topper 59abdf5f3f [X86] Add X86ISD::VSHLV and X86ISD::VSRLV nodes for psllv and psrlv
Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits.

This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts.

We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change.

Patch by zhutianyang and Craig Topper

Differential Revision: https://reviews.llvm.org/D56695

llvm-svn: 351381
2019-01-16 21:46:32 +00:00
Changpeng Fang fe9269f804 AMDGPU: Adjust the chain for loads writing to the HI part of a register.
Summary:
  For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part
of the register to maintain the appropriate order.

Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D56454

llvm-svn: 351379
2019-01-16 21:32:53 +00:00
Craig Topper e5b7cc8aa0 [X86] Add a one use check to the setcc inversion code in combineVSelectWithAllOnesOrZeros
If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc.

Differential Revision: https://reviews.llvm.org/D56765

llvm-svn: 351378
2019-01-16 21:29:29 +00:00
Craig Topper 238ad13b3a [X86] Add test case for D56765. NFC
llvm-svn: 351377
2019-01-16 21:29:26 +00:00
Nikita Popov 5fa75a0488 [X86] Add additional saturating add/sub vector tests; NFC
Additional tests for vNi32 and vNi64. I've added these for
usub.sat before, this covers uadd.sat, ssub.sat and sadd.sat.

llvm-svn: 351375
2019-01-16 20:53:23 +00:00
Mandeep Singh Grang 33c49c0c82 [COFF, ARM64] Implement support for SEH extensions __try/__except/__finally
Summary:
This patch supports MS SEH extensions __try/__except/__finally. The intrinsics localescape and localrecover are responsible for communicating escaped static allocas from the try block to the handler.

We need to preserve frame pointers for SEH. So we create a new function/property HasLocalEscape.

Reviewers: rnk, compnerd, mstorsjo, TomTan, efriedma, ssijaric

Reviewed By: rnk, efriedma

Subscribers: smeenai, jrmuizel, alex, majnemer, ssijaric, ehsan, dmajor, kristina, javed.absar, kristof.beyls, chrib, llvm-commits

Differential Revision: https://reviews.llvm.org/D53540

llvm-svn: 351370
2019-01-16 19:52:59 +00:00
Andrea Di Biagio c5f0f5309e [X86][BtVer2] Update latency of horizontal operations.
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.

This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.

I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).

Differential Revision: https://reviews.llvm.org/D56777

llvm-svn: 351366
2019-01-16 18:18:01 +00:00
Simon Pilgrim d40ab8db30 [X86] Regenerate test
Split check-prefixes to support a future commit

llvm-svn: 351362
2019-01-16 18:00:09 +00:00
Sanjay Patel 546c9f6d8f [x86] add tests for extracted scalar casts (PR39974); NFC
https://bugs.llvm.org/show_bug.cgi?id=39974

llvm-svn: 351354
2019-01-16 16:11:30 +00:00
Marek Olsak c5cec5e1fa AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52944

llvm-svn: 351351
2019-01-16 15:43:53 +00:00
Sanjay Patel 0dbecd05ed [x86] lower shuffle of extracts to AVX2 vperm instructions
I was trying to prevent shuffle regressions while matching more horizontal ops 
and ended up here:
  shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0

The affected tests were added for:
https://bugs.llvm.org/show_bug.cgi?id=34380

This patch won't change the examples in the bug report itself, but we should be 
able to extend this to catch more types.

Differential Revision: https://reviews.llvm.org/D56756

llvm-svn: 351346
2019-01-16 14:15:18 +00:00
Anton Korobeynikov cbdb4effae [MSP430] Emit a separate section for every interrupt vector
This is LLVM part of D56663

Linker scripts shipped by TI require to have every
interrupt vector in a separate section with a specific name:

 SECTIONS
 {
   __interrupt_vector_XX   : { KEEP (*(__interrupt_vector_XX )) } > VECTXX
   ...
 }

Follow the requirement emit the section for every vector
which contain address of interrupt handler:

  .section  __interrupt_vector_XX,"ax",@progbits
  .word %isr%

Patch by Kristina Bessonova!

Differential Revision: https://reviews.llvm.org/D56664

llvm-svn: 351345
2019-01-16 14:03:41 +00:00
Simon Pilgrim 1bdc049767 [X86][SSE] Add additional PR40318 shuffle test cases
llvm-svn: 351333
2019-01-16 13:15:59 +00:00
Sam Parker dd8cd6d26b [DAGCombine] Fix ReduceLoadWidth for shifted offsets
ReduceLoadWidth can trigger using a shifted mask is used and this
requires that the function return a shl node to correct for the
offset. However, the way that this was implemented meant that the
returned result could be an existing node, which would be incorrect.
This fixes the method of inserting the new node and replacing uses.

Differential Revision: https://reviews.llvm.org/D50432

llvm-svn: 351310
2019-01-16 08:40:12 +00:00
Aditya Nandakumar 500e3ead9f [GISel]: Add support for CSEing continuously during GISel passes.
https://reviews.llvm.org/D52803

This patch adds support to continuously CSE instructions during
each of the GISel passes. It consists of a GISelCSEInfo analysis pass
that can be used by the CSEMIRBuilder.

llvm-svn: 351283
2019-01-16 00:40:37 +00:00
Mandeep Singh Grang 436735c3fe [EH] Rename llvm.x86.seh.recoverfp intrinsic to llvm.eh.recoverfp
Summary:
Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc.
Refer D53541 for the context. Clang counterpart D56748.

Reviewers: rnk, efriedma

Reviewed By: rnk, efriedma

Subscribers: javed.absar, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D56747

llvm-svn: 351281
2019-01-16 00:37:13 +00:00
Craig Topper 34ac509ac8 [X86] Add avx512 scatter intrinsics that use a vXi1 mask instead of a scalar integer.
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.

llvm-svn: 351275
2019-01-15 23:36:25 +00:00
Changpeng Fang 20fe3d2f35 AMDGPU: Raise the priority of MAD24 in instruction selection.
Summary:
  We have seen performance regression when v_add3 is generated. The major reason is that the v_mad pattern
is broken when v_add3 is generated. We also see the register pressure increased. While we could not properly
estimate register pressure during instruction selection, we can give mad a higher priority.

In this work, we raise the priority for mad24 in selection and resolve the performance regression.

Reviewers:
  rampitec

Differential Revision:
  https://reviews.llvm.org/D56745

llvm-svn: 351273
2019-01-15 23:12:36 +00:00
Roman Lebedev fb4eed381d X86DAGToDAGISel::matchBitExtract() with truncation (PR36419)
Summary:
Previously in D54095 i have added support for extraction of `lshr` from `X` if we are to produce `BEXTR`.
That was good, but the fix was partial, there was still [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]].

That pattern can also appear, roughly, when you have a large (64-bit) storage, and the consume bits from it.
It will not be unexpected if you will be doing further computations in 32-bit width.
And then the current code breaks, as the tests show.

The basic idea/pattern here is following:
1. We have `i64` input
2. We perform `i64` right-shift on it.
3. We `trunc`ate that shifted value
4. We do all further work (masking) in `i32`

Since we see `trunc`ation and not `lshr`, we give up, and stop trying to extract that right-shift.
BUT. The mask is `i32`, therefore we can extend both of the operands of the masking (`and`) to `i64`
and truncate the result after masking: https://rise4fun.com/Alive/K4B
```
Name: @bextr64_32_b1 -> @bextr64_32_b0
  %shiftedval = lshr i64 %val, %numskipbits
  %truncshiftedval = trunc i64 %shiftedval to i32
  %widenumlowbits1 = zext i8 %numlowbits to i32
  %notmask1 = shl nsw i32 -1, %widenumlowbits1
  %mask1 = xor i32 %notmask1, -1
  %res = and i32 %truncshiftedval, %mask1
=>
  %shiftedval = lshr i64 %val, %numskipbits
  %widenumlowbits = zext i8 %numlowbits to i64
  %notmask = shl nsw i64 -1, %widenumlowbits
  %mask = xor i64 %notmask, -1
  %wideres = and i64 %shiftedval, %mask
  %res = trunc i64 %wideres to i32
```

Thus, we are again able to extract that `lshr` into `BEXTR`'s control.

Now, the perf (via `llvm-exegesis`) of the snippet suggests that it is not a good idea:
```
$ cat /tmp/old.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RSI
# LLVM-EXEGESIS-LIVEIN EDX
# LLVM-EXEGESIS-LIVEIN RDI
movq %rsi, %rcx
shrq %cl, %rdi
shll $8, %edx
bextrl %edx, %edi, %eax
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-1e0082.o
---
mode:            latency
key:
  instructions:
    - 'MOV64rr RCX RSI'
    - 'SHR64rCL RDI RDI'
    - 'SHL32ri EDX EDX i_0x8'
    - 'BEXTR32rr EAX EDI EDX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.6638, per_snippet_value: 2.6552 }
error:           ''
info:            ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-43e346.o
---
mode:            uops
key:
  instructions:
    - 'MOV64rr RCX RSI'
    - 'SHR64rCL RDI RDI'
    - 'SHL32ri EDX EDX i_0x8'
    - 'BEXTR32rr EAX EDI EDX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 0, per_snippet_value: 0 }
  - { key: PdFPU1, value: 0, per_snippet_value: 0 }
  - { key: PdFPU2, value: 0, per_snippet_value: 0 }
  - { key: PdFPU3, value: 0, per_snippet_value: 0 }
  - { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error:           ''
info:            ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
```
vs
```
$ cat /tmp/new.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RDX
# LLVM-EXEGESIS-LIVEIN SIL
# LLVM-EXEGESIS-LIVEIN RDI
shlq $8, %rdx
movzbl %sil, %eax
orq %rdx, %rax
bextrq %rax, %rdi, %rax
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8944f1.o
---
mode:            latency
key:
  instructions:
    - 'SHL64ri RDX RDX i_0x8'
    - 'MOVZX32rr8 EAX SIL'
    - 'OR64rr RAX RAX RDX'
    - 'BEXTR64rr RAX RDI RAX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.7454, per_snippet_value: 2.9816 }
error:           ''
info:            ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-da403c.o
---
mode:            uops
key:
  instructions:
    - 'SHL64ri RDX RDX i_0x8'
    - 'MOVZX32rr8 EAX SIL'
    - 'OR64rr RAX RAX RDX'
    - 'BEXTR64rr RAX RDI RAX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 0, per_snippet_value: 0 }
  - { key: PdFPU1, value: 0, per_snippet_value: 0 }
  - { key: PdFPU2, value: 0, per_snippet_value: 0 }
  - { key: PdFPU3, value: 0, per_snippet_value: 0 }
  - { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error:           ''
info:            ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
```
^ latency increased (worse).

Except //maybe// not really.
Like with all synthetic benchmarks, they //may// be misleading.

Let's take a look on some actual real-world hotpath.
In this case it's 'my' [[ https://github.com/darktable-org/rawspeed | RawSpeed ]]'s `BitStream<>::peekBitsNoFill()`, in [[ e3316dc851/src/librawspeed/decompressors/VC5Decompressor.cpp (L814) | GoPro VC5 decompressor ]]:
```
raw.pixls.us-unique/GoPro/HERO6 Black$ /usr/src/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-clangs1-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR
RUNNING: /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplwbKEM
2018-12-22 21:23:03
Running /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench
Run on (8 X 4012.81 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 3.41, 2.41, 2.03
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean           40 ms         40 ms        128   0.322244          7.96974        12M       37.4457M        298.534M      3.12047       24.8778   0.040465
GOPR9172.GPR/threads:8/real_time_median         39 ms         39 ms        128   0.312606          7.99155        12M        38.387M        306.788M      3.19891       25.5656   0.039115
GOPR9172.GPR/threads:8/real_time_stddev          4 ms          3 ms        128  0.0271557         0.130575          0        2.4941M        21.3909M     0.207842       1.78257   3.81081m
RUNNING: /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpWAkan9
2018-12-22 21:23:08
Running /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Run on (8 X 4013.1 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 3.78, 2.50, 2.06
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean           39 ms         39 ms        128   0.311533          7.97323        12M       38.6828M        308.471M      3.22356        25.706  0.0390928
GOPR9172.GPR/threads:8/real_time_median         38 ms         38 ms        128   0.304231          7.99005        12M       39.4437M        315.527M      3.28698        26.294  0.0380316
GOPR9172.GPR/threads:8/real_time_stddev          3 ms          3 ms        128  0.0229149         0.133814          0       2.26225M        19.1421M     0.188521       1.59517   3.13671m
Comparing /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 128 vs 128
GOPR9172.GPR/threads:8/real_time_mean                  -0.0339         -0.0316            40            39            40            39
GOPR9172.GPR/threads:8/real_time_median                -0.0277         -0.0274            39            38            39            38
GOPR9172.GPR/threads:8/real_time_stddev                -0.1769         -0.1267             4             3             3             3
```
I.e. this results in //roughly// -3% improvements in perf.

While this will help [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]], it won't address it fully.

Reviewers: RKSimon, craig.topper, andreadb, spatel

Reviewed By: craig.topper

Subscribers: courbet, llvm-commits

Differential Revision: https://reviews.llvm.org/D56052

llvm-svn: 351253
2019-01-15 21:31:18 +00:00
Craig Topper 82015b633b [X86] Add versions of the avx512 gather intrinsics that take the mask as a vXi1 vector instead of a scalar
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.

I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.

I'll work on scatter and gatherpf/scatterpf next.

Differential Revision: https://reviews.llvm.org/D56527

llvm-svn: 351234
2019-01-15 20:12:33 +00:00
Craig Topper 99fcbf67d0 [Nios2] Remove Nios2 backend
As mentioned here http://lists.llvm.org/pipermail/llvm-dev/2019-January/129121.html This backend is incomplete and has not been maintained in several months.

Differential Revision: https://reviews.llvm.org/D56691

llvm-svn: 351231
2019-01-15 19:59:19 +00:00
Nikita Popov d3b86b79fa Reapply "[CodeGen][X86] Expand USUBSAT to UMAX+SUB, also for vectors"
Related to https://bugs.llvm.org/show_bug.cgi?id=40123.

Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB,
which produces much better code for X86.

Reapplying with updated SLPVectorizer tests.

Differential Revision: https://reviews.llvm.org/D56636

llvm-svn: 351219
2019-01-15 18:43:41 +00:00
Simon Pilgrim b8f08c8d7b [X86] Bailout of lowerVectorShuffleAsPermuteAndUnpack for shuffle-with-zero (PR40306)
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.

We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.

llvm-svn: 351203
2019-01-15 16:56:55 +00:00
Simon Pilgrim fa7a8c7ddc [X86] Add PR40318 shuffle test case
The other test case is already covered by the PR40306 test case, which was mainly concerned with SSSE3 codegen.

llvm-svn: 351201
2019-01-15 16:31:10 +00:00
James Y Knight 693d39dd12 Remove irrelevant references to legacy git repositories from
compiler identification lines in test-cases.

(Doing so only because it's then easier to search for references which
are actually important and need fixing.)

llvm-svn: 351200
2019-01-15 16:18:52 +00:00
Sanjay Patel fad5bdaf95 [DAGCombiner] reduce buildvec of zexted extracted element to shuffle
The motivating case for this is shown in the first regression test. We are 
transferring to scalar and back rather than just zero-extending with 'vpmovzxdq'.

That's a special-case for a more general pattern as shown here. In all tests, 
we're avoiding the vector-scalar-vector moves in favor of vector ops.

We aren't producing optimal shuffle code in some cases though, so the patch is
limited to reduce regressions.

Differential Revision: https://reviews.llvm.org/D56281

llvm-svn: 351198
2019-01-15 16:11:05 +00:00
Roman Lebedev 58000f804a [NFC][X86] extract-bits.ll: add test with truncation with extra-use.
That extra-use *should* prevent D56052 from looking past the trunc.

llvm-svn: 351182
2019-01-15 10:36:20 +00:00
Craig Topper 2581249f05 [X86] Upgrade some avx512bw shift intrinsics that were removed a while ago. NFC
Masking was removed from these intrinsics and I guess we didn't update the tests then.

llvm-svn: 351165
2019-01-15 07:15:20 +00:00
Craig Topper 3c5423b26e [X86] Add test cases for D56695. NFC
llvm-svn: 351162
2019-01-15 06:39:51 +00:00
Craig Topper a3cfdcc21f [X86] Switch the triple on avx2-intrinsics-x86.ll to be -unknown-unknown instead of darwin so the constant pool entries will be filtered better by the script.
Darwin uses LCPI instead of .LCPI so the filter doesn't work.

This is silly, but it will help reduce some future some test diffs.

llvm-svn: 351161
2019-01-15 06:39:49 +00:00
Thomas Lively 6bf2b40051 [WebAssembly] Expand SIMD shifts while V8's implementation disagrees
Summary:
V8 currently implements SIMD shifts as taking an immediate operation,
which disagrees with the spec proposal and the toolchain
implementation. As a stopgap measure to get things working, unroll all
vector shifts. Since this is a temporary measure, there are no tests.

Reviewers: aheejin, dschuff

Subscribers: sbc100, jgravelle-google, sunfish, dmgreen, llvm-commits

Differential Revision: https://reviews.llvm.org/D56520

llvm-svn: 351151
2019-01-15 02:16:03 +00:00
Marek Olsak 33eb4d947d AMDGPU: Add a fast path for icmp.i1(src, false, NE)
Summary:
This allows moving the condition from the intrinsic to the standard ICmp
opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic
is an identity for retrieving the SGPR mask.

And we can also get the mask from and i1, or i1, xor i1.

Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52060

llvm-svn: 351150
2019-01-15 02:13:18 +00:00
Evandro Menezes f793fe1402 [AArch64] Adjust the feature set for Exynos
Enable the fusion of arithmetic and logic instructions for Exynos M4.

llvm-svn: 351149
2019-01-15 01:53:49 +00:00
Reid Kleckner fe5e5dcab0 [X86] Avoid clobbering ESP/RSP in the epilogue.
Summary:
In r345197 ESP and RSP were added to GR32_TC/GR64_TC, allowing them to
be used for tail calls, but this also caused `findDeadCallerSavedReg` to
think they were acceptable targets for clobbering. Filter them out.

Fixes PR40289.

Patch by Geoffry Song!

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D56617

llvm-svn: 351146
2019-01-15 01:24:18 +00:00
Evandro Menezes 111033d917 [AArch64] Fix typo (NFC)
Fix another typo, this time in the `RUN` line, which used a syntax not
universally supported, in test case added by D56572.

llvm-svn: 351144
2019-01-15 00:58:59 +00:00
Evandro Menezes 4b6c290fbd [AArch64] Fix typo (NFC)
Fix typo in test case added by D56572 (rL351139).

llvm-svn: 351143
2019-01-15 00:20:57 +00:00
Eli Friedman 295e346da4 [EarlyIfConversion] Don't if-convert unconditional branches.
A block ending in an unconditional branch can have two successors if one
is a landing pad.  In practice, I think this only has an effect on
Windows because landing pads are never empty for Itanium unwinding.

(Alternatively, I could add a check to
AArch64InstrInfo::canInsertSelect, but this seems more obvious.)

Differential Revision: https://reviews.llvm.org/D56468

llvm-svn: 351142
2019-01-15 00:19:46 +00:00
Eli Friedman 33aecc8182 [AArch64] Explicitly use v1i64 type for llvm.aarch64.neon.abs.i64 .
Otherwise, with D56544, the intrinsic will be expanded to an integer
csel, which is probably not what the user expected.  This matches the
general convention of using "v1" types to represent scalar integer
operations in vector registers.

While I'm here, also add some error checking so we don't generate
illegal ABS nodes.

Differential Revision: https://reviews.llvm.org/D56616

llvm-svn: 351141
2019-01-15 00:15:24 +00:00
Evandro Menezes bf59cb02c3 [AArch64] Add new target feature to fuse arithmetic and logic operations
This feature enables the fusion of some arithmetic and logic instructions
together.

Differential revision: https://reviews.llvm.org/D56572

llvm-svn: 351139
2019-01-14 23:54:36 +00:00