Commit Graph

3487 Commits

Author SHA1 Message Date
Eli Friedman 96e3cd85bd [ARM] Lower llvm.ctlz.i32 to a libcall when clz is not available.
The inline sequence is very long (about 70 bytes on Thumb1), so it's
not really a good idea to inline it, especially when optimizing for
size.

Differential Revision: https://reviews.llvm.org/D47917

llvm-svn: 340458
2018-08-22 21:47:14 +00:00
Martin Storsjo 5ab1d107bb [ARM] Avoid injecting constant islands in movw+movt pairs on Windows
On Windows, movw+movt pairs with relocations are handled with a single
relocation that covers them both. Therefore we can't inject anything
between these instructions, otherwise the relocation (which in LLVM
only is treated as the movw instruction's relocation, while the movt
instruction's relocation is dropped) will end up bogus.

These instructions are bundled up until right before the constant
islands pass, making this effectively the only place that can split
them apart.

Differential Revision: https://reviews.llvm.org/D51032

llvm-svn: 340451
2018-08-22 20:34:12 +00:00
Eli Friedman c11e2b9470 [ARM] Handle all-ones mask explicitly in targetShrinkDemandedConstant.
This avoids a potential infinite loop setting and unsetting bits in the
mask.

Reduced from a failure on the polly-aosp bot.

Differential Revision: https://reviews.llvm.org/D51066

llvm-svn: 340446
2018-08-22 20:13:45 +00:00
Sam Parker 4d519fc3b5 [ARM] Rotated operand patterns for *xtb16
Add intrinsic isel patterns for sxtb16, sxtab16, uxtb16 and uxtab16
so that they can perform a ror.

Differential Revision: https://reviews.llvm.org/D51034

llvm-svn: 340405
2018-08-22 12:58:36 +00:00
Sam Parker 597811e7a7 [DAGCombiner] Reduce load widths of shifted masks
During combining, ReduceLoadWdith is used to combine AND nodes that
mask loads into narrow loads. This patch allows the mask to be a
shifted constant. This results in a narrow load which is then left
shifted to compensate for the new offset.

Differential Revision: https://reviews.llvm.org/D50432

llvm-svn: 340261
2018-08-21 10:26:59 +00:00
Eli Friedman 73e8a784e6 [SelectionDAG] Improve the legalisation lowering of UMULO.
There is no way in the universe, that doing a full-width division in
software will be faster than doing overflowing multiplication in
software in the first place, especially given that this same full-width
multiplication needs to be done anyway.

This patch replaces the previous implementation with a direct lowering
into an overflowing multiplication algorithm based on half-width
operations.

Correctness of the algorithm was verified by exhaustively checking the
output of this algorithm for overflowing multiplication of 16 bit
integers against an obviously correct widening multiplication. Baring
any oversights introduced by porting the algorithm to DAG, confidence in
correctness of this algorithm is extremely high.

Following table shows the change in both t = runtime and s = space. The
change is expressed as a multiplier of original, so anything under 1 is
“better” and anything above 1 is worse.

+-------+-----------+-----------+-------------+-------------+
| Arch  | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s |
+-------+-----------+-----------+-------------+-------------+
|   X64 |     -     |     -     |    ~0.5     |    ~0.64    |
|  i686 |   ~0.5    |   ~0.6666 |    ~0.05    |    ~0.9     |
| armv7 |     -     |   ~0.75   |      -      |    ~1.4     |
+-------+-----------+-----------+-------------+-------------+

Performance numbers have been collected by running overflowing
multiplication in a loop under `perf` on two x86_64 (one Intel Haswell,
other AMD Ryzen) based machines. Size numbers have been collected by
looking at the size of function containing an overflowing multiply in
a loop.

All in all, it can be seen that both performance and size has improved
except in the case of armv7 where code size has regressed for 128-bit
multiply. u128*u128 overflowing multiply on 32-bit platforms seem to
benefit from this change a lot, taking only 5% of the time compared to
original algorithm to calculate the same thing.

The final benefit of this change is that LLVM is now capable of lowering
the overflowing unsigned multiply for integers of any bit-width as long
as the target is capable of lowering regular multiplication for the same
bit-width. Previously, 128-bit overflowing multiply was the widest
possible.

Patch by Simonas Kazlauskas!

Differential Revision: https://reviews.llvm.org/D50310

llvm-svn: 339922
2018-08-16 18:39:39 +00:00
Sam Parker 0d51197051 [ARM] Ignore GEPs in ARMCodeGenPrepare
While searching through the use-def tree, ignore GetElementPtrInst
instructions because they don't need promoting and neither do their
indices. Otherwise, the wide indices prevent the transformation from
happening.

Differential Revision: https://reviews.llvm.org/D50762

llvm-svn: 339871
2018-08-16 12:24:40 +00:00
Sam Parker 0e2f0bd48e [ARM] Allow zext in ARMCodeGenPrepare
Treat zext instructions as roots, like we do for truncs.

Differential Revision: https://reviews.llvm.org/D50759

llvm-svn: 339868
2018-08-16 11:54:09 +00:00
Sam Parker 13567dbbd8 [ARM] Allow signed icmps in ARMCodeGenPrepare
Originally committed in r339755 which was reverted in r339806 due to
an asan issue. The issue was caused by my assumption that operands to
a CallInst mapped to the FunctionType Params. CallInsts are now
handled by iterating over their ArgOperands instead of Operands.
    
Original Message:
  Treat signed icmps as 'sinks', allowing them to be in the use-def
  tree, enabling more promotions to be performed. As a sink, any
  promoted incoming values need to be truncated before being used by
  the signed icmp.
    
  Differential Revision: https://reviews.llvm.org/D50067

llvm-svn: 339858
2018-08-16 10:05:39 +00:00
Vitaly Buka ed4239f482 Revert "[ARM] Allow signed icmps in ARMCodeGenPrepare"
use-after-poison in check-llvm under asan

This reverts commit r339755.

llvm-svn: 339806
2018-08-15 20:09:35 +00:00
Sam Parker fabf7fe5f8 [ARM] TypeSize lower bound for ARMCodeGenPrepare
We only try to promote types with are smaller than 16-bits, but we
also need to check that the type is not less than 8-bits.

Differential Revision: https://reviews.llvm.org/D50769

llvm-svn: 339770
2018-08-15 13:29:50 +00:00
Sam Parker 6548cd3905 [ARM] Allow signed icmps in ARMCodeGenPrepare
Treat signed icmps as 'sinks', allowing them to be in the use-def
tree, enabling more promotions to be performed. As a sink, any
promoted incoming values need to be truncated before being used by
the signed icmp.

Differential Revision: https://reviews.llvm.org/D50067

llvm-svn: 339755
2018-08-15 08:23:03 +00:00
Sam Parker 7def86bbdb [ARM] Allow pointer values in ARMCodeGenPrepare
Add pointers to the list of allowed types, but don't try to promote
them. Also fixed a bug with the promotion of undef values, so a new
value is now created instead of mutating in place. We also now only
promote if there's an instruction in the use-def chains other than
the icmp, sinks and sources.

Differential Revision: https://reviews.llvm.org/D50054

llvm-svn: 339754
2018-08-15 07:52:35 +00:00
Luke Geeson 4ce41d2bb7 [ARM] Added FP16 VREV Vector Instrinsic CodeGen support
llvm-svn: 339546
2018-08-13 08:37:41 +00:00
Eli Friedman e1687a89e8 [ARM] Adjust AND immediates to make them cheaper to select.
LLVM normally prefers to minimize the number of bits set in an AND
immediate, but that doesn't always match the available ARM instructions.
In Thumb1 mode, prefer uxtb or uxth where possible; otherwise, prefer
a two-instruction sequence movs+ands or movs+bics.

Some potential improvements outlined in
ARMTargetLowering::targetShrinkDemandedConstant, but seems to work
pretty well already.

The ARMISelDAGToDAG fix ensures we don't generate an invalid UBFX
instruction due to a larger-than-expected mask. (It's orthogonal, in
some sense, but as far as I can tell it's either impossible or nearly
impossible to reproduce the bug without this change.)

According to my testing, this seems to consistently improve codesize by
a small amount by forming bic more often for ISD::AND with an immediate.

Differential Revision: https://reviews.llvm.org/D50030

llvm-svn: 339472
2018-08-10 21:21:53 +00:00
Sam Parker 8c4b964c5a [ARM] Disallow zexts in ARMCodeGenPrepare
Enabling ARMCodeGenPrepare by default caused a whole load of
failures. This is due to zexts and truncs not being handled properly.
ZExts are messy so it's just easier to disable for now and truncs
are allowed only as 'sinks'. I still need to figure out why allowing
them as 'sources' causes so many failures. The other main changes are
that we are explicit in the types that we converting to, it's now
always 'TypeSize'. Type support is also now performed while checking
for valid opcodes as it unnecessarily complicated having the checks
are different stages.
    
I've moved the tests around too, so we have the zext and truncs in
their own file as well as the overflowing opcode tests.

Differential Revision: https://reviews.llvm.org/D50518

llvm-svn: 339432
2018-08-10 13:57:13 +00:00
Evandro Menezes 9a92fe0c9e [ARM] Replace processor check with feature
Add new feature, `FeatureUseWideStrideVFP`, that replaces the need for a
processor check.  Otherwise, NFC.

llvm-svn: 339354
2018-08-09 16:13:24 +00:00
Sjoerd Meijer 806f70d229 [ARM] FP16: codegen support for VTRN
Differential Revision: https://reviews.llvm.org/D50454

llvm-svn: 339340
2018-08-09 12:45:09 +00:00
Petr Hosek 7b27454477 [ADT] Normalize empty triple components
LLVM triple normalization is handling "unknown" and empty components
differently; for example given "x86_64-unknown-linux-gnu" and
"x86_64-linux-gnu" which should be equivalent, triple normalization
returns "x86_64-unknown-linux-gnu" and "x86_64--linux-gnu". autoconf's
config.sub returns "x86_64-unknown-linux-gnu" for both
"x86_64-linux-gnu" and "x86_64-unknown-linux-gnu". This changes the
triple normalization to behave the same way, replacing empty triple
components with "unknown".

This addresses PR37129.

Differential Revision: https://reviews.llvm.org/D50219

llvm-svn: 339294
2018-08-08 22:23:57 +00:00
Eli Friedman 5b45a39056 [ARM] Avoid spilling lr with Thumb1 tail calls.
Normally, if any registers are spilled, we prefer to spill lr on Thumb1
so we can fold the "bx lr" into the "pop".  However, if there are tail
calls involved, restoring lr is expensive, so skip the optimization in
that case.

The spill of r7 in the new test also isn't necessary, but that's
mostly orthogonal to this patch. (It's the same code in
ARMFrameLowering, but it's not related to tail calls.)

Differential Revision: https://reviews.llvm.org/D49459

llvm-svn: 339283
2018-08-08 20:03:10 +00:00
Ties Stuij 0244aa67d6 revert tests of '[CodeGen] emit inline asm clobber list warnings for reserved'
llvm-svn: 339276
2018-08-08 17:19:32 +00:00
Ties Stuij 52f3631f4b [CodeGen] emit inline asm clobber list warnings for reserved
Summary:
Currently, in line with GCC, when specifying reserved registers like sp or pc on an inline asm() clobber list, we don't always preserve the original value across the statement. And in general, overwriting reserved registers can have surprising results.

For example:


```
extern int bar(int[]);

int foo(int i) {
  int a[i]; // VLA
  asm volatile(
      "mov r7, #1"
    :
    :
    : "r7"
  );

  return 1 + bar(a);
}
```

Compiled for thumb, this gives:
```
$ clang --target=arm-arm-none-eabi -march=armv7a -c test.c -o - -S -O1 -mthumb
...
foo:
        .fnstart
@ %bb.0:                                @ %entry
        .save   {r4, r5, r6, r7, lr}
        push    {r4, r5, r6, r7, lr}
        .setfp  r7, sp, #12
        add     r7, sp, #12
        .pad    #4
        sub     sp, #4
        movs    r1, #7
        add.w   r0, r1, r0, lsl #2
        bic     r0, r0, #7
        sub.w   r0, sp, r0
        mov     sp, r0
        @APP
        mov.w   r7, #1
        @NO_APP
        bl      bar
        adds    r0, #1
        sub.w   r4, r7, #12
        mov     sp, r4
        pop     {r4, r5, r6, r7, pc}
...
```

r7 is used as the frame pointer for thumb targets, and this function needs to restore the SP from the FP because of the variable-length stack allocation a. r7 is clobbered by the inline assembly (and r7 is included in the clobber list), but LLVM does not preserve the value of the frame pointer across the assembly block.

This type of behavior is similar to GCC's and has been discussed on the bugtracker: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11807 . No consensus seemed to have been reached on the way forward.  Clang behavior has briefly been discussed on the CFE mailing (starting here: http://lists.llvm.org/pipermail/cfe-dev/2018-July/058392.html). I've opted for following Eli Friedman's advice to print warnings when there are reserved registers on the clobber list so as not to diverge from GCC behavior for now.

The patch uses MachineRegisterInfo's target-specific knowledge of reserved registers, just before we convert the inline asm string in the AsmPrinter.

If we find a reserved register, we print a warning:
```
repro.c:6:7: warning: inline asm clobber list contains reserved registers: R7 [-Winline-asm]
      "mov r7, #1"
      ^
```

Reviewers: eli.friedman, olista01, javed.absar, efriedma

Reviewed By: efriedma

Subscribers: efriedma, eraman, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D49727

llvm-svn: 339257
2018-08-08 15:15:59 +00:00
Sjoerd Meijer 1919ecfd0b [ARM][NFC] Replaced tab-characters in test file vtrn.ll
llvm-svn: 339251
2018-08-08 14:42:11 +00:00
Sjoerd Meijer f8c394f0f5 [ARM] FP16: codegen support for VEXT
Differential Revision: https://reviews.llvm.org/D50427

llvm-svn: 339241
2018-08-08 13:26:38 +00:00
Sjoerd Meijer db5908deb9 [ARM] FP16: vector vmov and vdup support
This adds codegen support for the vmov_n_f16 and vdup_n_f16 variants.

Differential Revision: https://reviews.llvm.org/D50329

llvm-svn: 339238
2018-08-08 13:11:31 +00:00
Sjoerd Meijer 920a453485 [ARM] FP16: vector VMUL variants
This adds codegen support for the vmul_lane_f16 and vmul_n_f16 variants.

Differential Revision: https://reviews.llvm.org/D50326

llvm-svn: 339232
2018-08-08 10:27:34 +00:00
Sjoerd Meijer b33a4c02cc [ARM] FP16: support vector INT_TO_FP and FP_TO_INT
This adds codegen support for the different vcvt_f16 variants.

Differential Revision: https://reviews.llvm.org/D50393

llvm-svn: 339227
2018-08-08 09:45:34 +00:00
Thomas Preud'homme 4107b31df2 Support inline asm with multiple 64bit output in 32bit GPR
Summary: Extend fix for PR34170 to support inline assembly with multiple output operands that do not naturally go in the register class it is constrained to (eg. double in a 32-bit GPR as in the PR).

Reviewers: bogner, t.p.northover, lattner, javed.absar, efriedma

Reviewed By: efriedma

Subscribers: efriedma, tra, eraman, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D45437

llvm-svn: 339225
2018-08-08 09:35:26 +00:00
Sjoerd Meijer b264944ed5 [ARM] FP16: support the vector vmin and vmax variants
Differential Revision: https://reviews.llvm.org/D50238

llvm-svn: 339221
2018-08-08 07:20:15 +00:00
Sjoerd Meijer b39cd886b9 [ARM] FP16: codegen support for VACGT
Differential Revision: https://reviews.llvm.org/D50236

llvm-svn: 339148
2018-08-07 15:11:47 +00:00
Sjoerd Meijer a2ddddfd3e [ARM][NFC] Replaced tab characters in test file vfcmp.ll.
llvm-svn: 339111
2018-08-07 08:05:15 +00:00
Sjoerd Meijer d62c5ec2fe [ARM] FP16: support vector zip and unzip
This is addressing PR38404.

Differential Revision: https://reviews.llvm.org/D50186

llvm-svn: 338835
2018-08-03 09:24:29 +00:00
Sjoerd Meijer 9b30213828 [ARM] FP16: support VFMA
This is addressing PR38404.

llvm-svn: 338830
2018-08-03 09:12:56 +00:00
Sjoerd Meijer 8e7fab0443 [ARM][NFC] Follow up of r338568
I disabled more tests than necessary, this enables them.

llvm-svn: 338717
2018-08-02 14:04:48 +00:00
Alexander Ivchenko 49168f6778 [GlobalISel] Rewrite CallLowering::lowerReturn to accept multiple VRegs per Value
This is logical continuation of https://reviews.llvm.org/D46018 (r332449)

Differential Revision: https://reviews.llvm.org/D49660

llvm-svn: 338685
2018-08-02 08:33:31 +00:00
Sjoerd Meijer 590e4e8dde [ARM] Armv8.2-A FP16 vector intrinsics tests
Clang support for the Armv8.2-A FP16 vector intrinsic was committed in
rC328277, but this was never followed up, i.e. the LLVM part is missing.

I've raised PR38404, and this is the first step to address this. I.e.,
this adds tests for the Armv8.2-A FP16 vector intrinsic, and thus shows
which intrinsics already work, and which need further work.

Differential Revision: https://reviews.llvm.org/D50142

llvm-svn: 338568
2018-08-01 14:43:59 +00:00
Reid Kleckner b32ff46ff7 Revert r338354 "[ARM] Revert r337821"
Disable ARMCodeGenPrepare by default again. It is causing verifier
failues in V8 that look like:

  Duplicate integer as switch case
  switch i32 %trunc, label %if.end13 [
    i32 0, label %cleanup36
    i32 0, label %if.then8
  ], !dbg !4981
  i32 0
  fatal error: error in backend: Broken function found, compilation aborted!

I will continue reducing the test case and send it along.

llvm-svn: 338452
2018-07-31 23:09:42 +00:00
Sam Parker 2a6c842fda [ARM] Revert r337821
Re-enabling ARMCodeGenPrepare by default after failing to reproduce
the bootstrap issues that I was concerned it was causing.

llvm-svn: 338354
2018-07-31 09:04:14 +00:00
Thomas Preud'homme 196149c943 Reapply "Fix crash on inline asm with 64bit matching input in 32bit GPR"
This reapplies commit r338206 reverted by r338214 since the bug that
r338206 uncovered has been fixed in r338268.

Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.

llvm-svn: 338269
2018-07-30 16:48:39 +00:00
Thomas Preud'homme 6c1b075299 Fix uninitialized read in ARM's PrintAsmOperand
Summary:
Fix read of uninitialized RC variable in ARM's PrintAsmOperand when
hasRegClassConstraint returns false. This was causing
inline-asm-operand-implicit-cast test to fail in r338206.

Reviewers: t.p.northover, weimingz, javed.absar, chill

Reviewed By: chill

Subscribers: chill, eraman, kristof.beyls, chrib, llvm-commits

Differential Revision: https://reviews.llvm.org/D49984

llvm-svn: 338268
2018-07-30 16:45:40 +00:00
Petr Pavlu 8b6eff4e77 [ARM] Fix over-alignment in arguments that are HA of 128-bit vectors
Code in `CC_ARM_AAPCS_Custom_Aggregate()` is responsible for handling
homogeneous aggregates for `CC_ARM_AAPCS_VFP`. When an aggregate ends up
fully on stack, the function tries to pack all resulting items of the
aggregate as tightly as possible according to AAPCS.

Once the first item was laid out, the alignment used for consecutive
items was the size of one item. This logic went wrong for 128-bit
vectors because their alignment is normally only 64 bits, and so could
result in inserting unexpected padding between the first and second
element.

The patch fixes the problem by updating the alignment with the item size
only if this results in reducing it.

Differential Revision: https://reviews.llvm.org/D49720

llvm-svn: 338233
2018-07-30 08:49:30 +00:00
Sanjay Patel 7312206f2f revert r338206 because the test does not pass
Example of bot failure:
http://lab.llvm.org:8011/builders/clang-cmake-armv8-quick/builds/5107/steps/ninja%20check%201/logs/FAIL%3A%20LLVM%3A%3Ainline-asm-operand-implicit-cast.ll

llvm-svn: 338214
2018-07-29 14:30:49 +00:00
Thomas Preud'homme 74ffd14e15 Fix crash on inline asm with 64bit matching input in 32bit GPR
Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.

llvm-svn: 338206
2018-07-28 21:33:39 +00:00
Craig Topper 50b1d4303d [DAGCombiner] Teach DAG combiner that A-(B-C) can be folded to A+(C-B)
This can be useful since addition is commutable, and subtraction is not.

This matches a transform that is also done by InstCombine.

llvm-svn: 338181
2018-07-28 00:27:25 +00:00
Evandro Menezes fcca45f0dd [ARM] Add new target feature to fuse literal generation
This feature enables the fusion of such operations on Cortex A57 and Cortex
A72, as recommended in their Software Optimisation Guides, sections 4.14 and
4.11, respectively.

Differential revision: https://reviews.llvm.org/D49563

llvm-svn: 338147
2018-07-27 18:16:47 +00:00
Thomas Preud'homme 768d6ce4a3 Fix PR34170: Crash on inline asm with 64bit output in 32bit GPR
Add support for inline assembly with output operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR as in the PR).

llvm-svn: 337903
2018-07-25 11:11:12 +00:00
Sam Parker 8b93e82c3d [ARM] Disable ARMCodeGenPrepare by default
ARM Stage 2 builders have been suspiciously broken since the pass was
committed. Disabling to hopefully fix the bots and give me time to
debug.

llvm-svn: 337821
2018-07-24 12:04:23 +00:00
Sam Parker 3828c6ff94 [ARM] ARMCodeGenPrepare backend pass
Arm specific codegen prepare is implemented to perform type promotion
on icmp operands, which can enable the removal of uxtb and uxth
(unsigned extend) instructions. This is possible because performing
type promotion before ISel alleviates this duty from the DAG builder
which has to perform legalisation, but has a limited view on data
ranges.
    
The pass visits any instruction operand of an icmp and creates a
worklist to traverse the use-def tree to determine whether the values
can simply be promoted. Our concern is values in the registers
overflowing the narrow (i8, i16) data range, so instructions marked
with nuw can be promoted easily. For add and sub instructions, we are
able to use the parallel dsp instructions to operate on scalar data
types and avoid overflowing bits. Underflowing adds and subs are also
permitted when the result is only used by an unsigned icmp.

Differential Revision: https://reviews.llvm.org/D48832

llvm-svn: 337687
2018-07-23 12:27:47 +00:00
Evandro Menezes fffa9b5897 [ARM] Add new feature to enable optimizing the VFP registers
Enable the optimization of operations on DPR and SPR via a feature instead
of checking the target.

Differential revision: https://reviews.llvm.org/D49463

llvm-svn: 337575
2018-07-20 16:49:28 +00:00
Tim Northover a7c18ee8c4 ARM: switch armv7em MachO triple to hard-float defaults and libcalls.
We were emitting incorrect calls to libm functions that LLVM had decided it
knew about because the default is soft-float.

Recommitted without breaking ELF this time.

llvm-svn: 337450
2018-07-19 12:44:51 +00:00