Commit Graph

59397 Commits

Author SHA1 Message Date
Craig Topper ab7151f1cf [X86] Make PreprocessISelDAG create X86ISD::VRNDSCALE nodes with i32 constants instead of i8.
This is the type declared in X86InstrFragmentsSIMD.td. ISel pattern
matching doesn't check so it doesn't matter in practice. Maybe for
SelectionDAG CSE it would matter.
2020-08-17 17:25:51 -07:00
Hongtao Yu 819b2d9c79 [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.
When diffing disassembly dump of two binaries, I see lots of noises from mismatched jump target addresses and global data references, which unnecessarily causes diffs on every function, making it impractical. I'm trying to symbolize the raw binary addresses to minimize the diff noise.
In this change, a local branch target is modeled as a label and the branch target operand will simply be printed as a label. Local labels are collected by a separate pre-decoding pass beforehand. A global data memory operand will be printed as a global symbol instead of the raw data address. Unfortunately, due to the way the disassembler is set up and to be less intrusive, a global symbol is always printed as the last operand of a memory access instruction. This is less than ideal but is probably acceptable from checking code quality point of view since on most targets an instruction can have at most one memory operand.

So far only the X86 disassemblers are supported.

Test Plan:

llvm-objdump -d  --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr [rip + 4112]  # 202182 <g>
               	jge	0x20117e <_start+0x25>
               	call	0x201158 <foo>
               	inc	dword ptr [rsp]
               	jmp	0x201169 <_start+0x10>
               	xor	eax, eax
               	pop	rcx
               	ret
```

llvm-objdump -d  **--symbolize-operands** --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
<L1>:
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr  <g>
               	jge	 <L0>
               	call	 <foo>
               	inc	dword ptr [rsp]
               	jmp	 <L1>
<L0>:
               	xor	eax, eax
               	pop	rcx
               	ret
```

Note that the jump instructions like `jge 0x20117e <_start+0x25>` without this work is printed as a real target address and an offset from the leading symbol. With a change in the optimizer that adds/deletes an instruction, the address and offset may shift for targets placed after the instruction. This will be a problem when diffing the disassembly from two optimizers where there are unnecessary false positives due to such branch target address changes. With `--symbolize-operand`, a label is printed for a branch target instead to reduce the false positives. Similarly, the disassemble of PC-relative global variable references is also prone to instruction insertion/deletion.

Reviewed By: jhenderson, MaskRay

Differential Revision: https://reviews.llvm.org/D84191
2020-08-17 16:55:12 -07:00
Kazushi (Jam) Marukawa 68cb29eff1 [VE] Modify ISelLoweirng following clang-tidy
Modify case style of function names following clang-tidy.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D86076
2020-08-18 07:43:19 +09:00
Matt Arsenault a9ee0589a8 AMDGPU/GlobalISel: Match global saddr addressing mode 2020-08-17 15:48:06 -04:00
Matt Arsenault e1a2f4713c AMDGPU: Match global saddr addressing mode
The previous implementation was incorrect, and based off incorrect
instruction definitions. Unfortunately we can't match natural
addressing in a lot of cases due to the shift/scale applied in
getelementptrs. This relies on reducing the 64-bit shift to 32-bits.
2020-08-17 15:28:14 -04:00
Stanislav Mekhanoshin 24182f14b6 [AMDGPU] Define spill opcodes for all AGPR sizes
Since we have defined all these sizes I believe we shall be
able to spill these as well.

Differential Revision: https://reviews.llvm.org/D86098
2020-08-17 12:17:23 -07:00
Matt Arsenault c8a9872259 AMDGPU/GlobalISel: Look through copies in getPtrBaseWithConstantOffset
We may have an SGPR->VGPR copy if a totally uniform pointer
calculation is used for a VGPR pointer operand.

Also hack around a bug in MUBUF matching which would incorrectly use
MUBUF for global when flat was requested. This should really be a
predicate on the parent pattern, but the DAG always checked this
manually inside the complex pattern.
2020-08-17 12:31:38 -04:00
Steven Perron eed6476a87 Reset PAL metadata when AMDGPU traget stream finishes
If the same stream object is used for multiple compiles, the PAL metadata from eariler compilations will leak into later one.  See https://github.com/GPUOpen-Drivers/llpc/issues/882 for how this is happening in LLPC.

No tests were added because multiple compiles will have to happen using the same pass manager, and I do not see a setup for that on the LLVM side.  Let me know if there is a good way to test this.

Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D85667
2020-08-17 10:56:11 -04:00
Matt Arsenault c7b9cd31bf AMDGPU/GlobalISel: Fix missing 256-bit AGPR mapping 2020-08-17 09:53:26 -04:00
Matt Arsenault af162ac785 AMDGPU/GlobalISel: Fix using readfirstlane with ballot intrinsics
This should use the default mapping and insert a copy to the vcc bank,
and not try to insert a readfirstlane.
2020-08-17 09:53:25 -04:00
Matt Arsenault da3f357de6 AMDGPU: Don't look at dbg users for foldable operands
These would have always failed to fold, so checking them or adding
them to the fold candidates is useless.
2020-08-17 09:53:25 -04:00
Matt Arsenault 66ffa0e91f AMDGPU/GlobalISel: Fix using post-legal combiner without LegalizerInfo 2020-08-17 09:19:22 -04:00
Matt Arsenault e0375dbcb3 AMDGPU: Fix using wrong offsets for global atomic fadd intrinsics
Global instructions have the signed offsets.
2020-08-17 09:19:15 -04:00
Sam Elliott 3f7068ad98 [RISCV] Enable the use of the old mucounteren name
The RISC-V Privileged Specification 1.11 defines `mcountinhibit`, which
has the same numeric CSR value as `mucounteren` from 1.09.1. This patch
enables the use of the old `mucounteren` name.

Patch by Yuichi Sugiyama.

Reviewed By: lenary, jrtc27, pzheng

Differential Revision: https://reviews.llvm.org/D85067
2020-08-17 13:11:49 +01:00
Sam Elliott 5f9ecc5d85 [RISCV] Indirect branch generation in position independent code
This fixes the "Unable to insert indirect branch" fatal error sometimes
seen when generating position-independent code.

Patch by msizanoen1

Reviewed By: jrtc27

Differential Revision: https://reviews.llvm.org/D84833
2020-08-17 13:09:26 +01:00
Simon Pilgrim 1d2ede87ea [X86][AVX] Move lowerShuffleWithVPMOV inside explicit shuffle lowering cases
Perform lowerShuffleWithVPMOV as part of the v16i8/v8i16 shuffle lowering stages, which are the only types that are currently supported.

We need to expand support for lowering shuffles as truncations to fix the remaining regressions in D66004
2020-08-17 11:58:51 +01:00
Kazushi (Jam) Marukawa 40f1e7e804 [VE] Support f128
Support f128 using VE instructions.  Update regression tests.
I've noticed there is no load or store i128 test, so I add them too.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D86035
2020-08-17 17:26:52 +09:00
Craig Topper a206f85091 [X86] Reject dirflag in inline asm constraints other than clobber.
Fixes the crash from PR47195.
2020-08-16 23:33:45 -07:00
Chen Zheng 4d52ebb9b9 [PowerPC] Make StartMI ignore COPY like instructions.
Reviewed By: lkail

Differential Revision: https://reviews.llvm.org/D85659
2020-08-17 02:12:30 -04:00
Simon Pilgrim f25d47b7ed [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
We can now enable this for AVX1 targets can now assist with canonicalizeShuffleMaskWithHorizOp cleanup.

There's still a few missed opportunities for merging subvector insert/extracts into shuffles, but they shouldn't cause any regressions now.
2020-08-16 15:00:41 +01:00
Simon Pilgrim dca7eb7d60 [X86][SSE] Replace combineShuffleWithHorizOp with canonicalizeShuffleMaskWithHorizOp
Instead of just attempting to fold shuffle(HOP,HOP) for a specific target shuffle, make this part of combineX86ShufflesRecursively so we can perform this on the combined shuffle chain, which is particularly useful for recognising more cases of where we're performing multiple HOPs that can be merged and pre-AVX where we don't have good blend/unary target shuffle support.
2020-08-16 12:26:27 +01:00
Simon Pilgrim c27baa54b7 [X86] isRepeatedTargetShuffleMask - don't require specific MVT type. NFC.
Split the isRepeatedTargetShuffleMask into a wrapper variant that takes a MVT describing the mask width, and an internal version that just needs the raw mask element bit size.

This will be necessary for an upcoming change where the horizontal ops element width might not match the shuffle mask element width.
2020-08-16 11:51:44 +01:00
Amara Emerson 7006bb69ef [GlobalISel] Enable copy-propagation in post-legalizer combiner.
This cleans up copies that the legalizer or other combines leave around. They
can occasionally end up escaping as moves.

Differential Revision: https://reviews.llvm.org/D85964
2020-08-15 13:44:30 -07:00
Matt Arsenault f0af434b79 AMDGPU: Remove register class params from flat memory patterns 2020-08-15 12:12:33 -04:00
Matt Arsenault a7455652c0 AMDGPU: Fix global atomic saddr operand class 2020-08-15 12:12:28 -04:00
Matt Arsenault 625db2fe5b AMDGPU: Remove slc from flat offset complex patterns
This was always set to 0. Use a default value of 0 in this context to
satisfy the instruction definition patterns. We can't unconditionally
use SLC with a default value of 0 due to limitations in TableGen's
handling of defaulted operands when followed by non-default operands.
2020-08-15 12:12:24 -04:00
Matt Arsenault e5077b5c2a AMDGPU: Fix matching wrong offsets for global atomic loads
These used signed offsets with a different size.
2020-08-15 12:12:17 -04:00
Matt Arsenault 8cb022982a AMDGPU: Remove redundant FLAT complex patterns
These were identical to the non-atomic cases. I'm not sure why these
were ever separated.
2020-08-15 12:12:01 -04:00
Matt Arsenault 47af1ac69a AMDGPU: Correct definitions for global saddr instructions
The VGPR component is a 32-bit offset, not 64-bits.

I'm not sure what the correct syntax is for this. This maintains the
vaddr position and leaves saddr in the end "off" position. This is
particularly terrible for stores, since the operand order is now <vgpr
offset>, <data>, <sgpr base>, splitting the pointer operands. I
suppose this is a logical consequence from the mistake of not putting
the data operand first. I'm not sure what sp3 does.
2020-08-15 12:11:57 -04:00
Matt Arsenault 79298a5067 AMDGPU: Remove SIFixupVectorISel pass
This was only used for matching the saddr addressing mode of global
instructions, but this was not implemented correctly. The instruction
definitions aren't even correct, and are defined as using a 64-bit
VGPR component. Eliminate this pass to enable correcting the
instruction definitions. A new matching implementation can work in
GlobalISel or relying on DAG divergence information for the base
address.
2020-08-15 12:11:51 -04:00
Stanislav Mekhanoshin 43a38dc251 [AMDGPU] Fix MAI ld/st hazard handling
It did not process hazard for ds_permute because it does not
load or store even though it is DS.

Differential Revision: https://reviews.llvm.org/D86003
2020-08-14 17:07:37 -07:00
Cameron McInally 92593f9e77 [SVE] Lower fixed length vXi32/vXi64 SDIV to scalable vectors.
Differential Revision: https://reviews.llvm.org/D85982
2020-08-14 18:47:22 -05:00
Fangrui Song 58f5966d5b Fix TargetSubtargetInfo derivatives after D85165 2020-08-14 15:50:53 -07:00
Craig Topper c7a0b2684f [X86][MC][Target] Initial backend support a tune CPU to support -mtune
This patch implements initial backend support for a -mtune CPU controlled by a "tune-cpu" function attribute. If the attribute is not present X86 will use the resolved CPU from target-cpu attribute or command line.

This patch adds MC layer support a tune CPU. Each CPU now has two sets of features stored in their GenSubtargetInfo.inc tables . These features lists are passed separately to the Processor and ProcessorModel classes in tablegen. The tune list defaults to an empty list to avoid changes to non-X86. This annoyingly increases the size of static tables on all target as we now store 24 more bytes per CPU. I haven't quantified the overall impact, but I can if we're concerned.

One new test is added to X86 to show a few tuning features with mismatched tune-cpu and target-cpu/target-feature attributes to demonstrate independent control. Another new test is added to demonstrate that the scheduler model follows the tune CPU.

I have not added a -mtune to llc/opt or MC layer command line yet. With no attributes we'll just use the -mcpu for both. MC layer tools will always follow the normal CPU for tuning.

Differential Revision: https://reviews.llvm.org/D85165
2020-08-14 15:31:50 -07:00
Xiangling Liao f759b4e43b [AIX] Generate unique module id based on Pid and timestamp
A unique module id, which is a part of sinit and sterm function names, is
necessary to be unique. However, `getUniqueModuleId` will fail if there is
no strong external symbol within a module. We turn to use Pid and timestamp
when this happens.

Differential Revision: https://reviews.llvm.org/D85527
2020-08-14 16:22:50 -04:00
Simon Pilgrim e9eb2dc332 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813

Recommit of rG9bd97d036398 with fixed offset adjustments
2020-08-14 18:43:19 +01:00
Matt Arsenault 40a142fa57 AMDGPU/GlobalISel: Match andn2/orn2 for more types
Unfortunately this ends up not working as expected on targets with
16-bit operations due to AMDGPUCodeGenPrepare's promotion of uniform
16-bit ops to i32.

The vector case annoyingly requires switching the checked opcode,
since constants for vectors aren't directly handled.

I also need to think more carefully about whether this is valid for i1.
2020-08-14 13:18:03 -04:00
Kazushi (Jam) Marukawa 2f01af764b [VE] Remove obsolete I8/I16 register classes
Remove I8/I16 register classes which are prepared to implement previously
to implement VE ABI.  However, it is possible to implement VE ABI correctly
without them.  Therefore, removing them now.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D85905
2020-08-14 21:52:22 +09:00
Sam Parker eb82d58f83 [NFC][ARM] Port MaybeCall into ARMTTImpl method
Renamed to maybeLoweredToCall.
2020-08-14 10:23:20 +01:00
Sebastian Neubauer 9aa0ff77bd [AMDGPU] Enable .rodata for amdpal os
PAL recently got support for multiple ELF sections and relocations,
therefore we can now use .rodata sections instead of forcing constants
into .text.

Differential Revision: https://reviews.llvm.org/D85895
2020-08-14 09:05:48 +02:00
David Sherwood 6c7957c990 [SVE] Fix bug in SVEIntrinsicOpts::optimizePTest
The code wasn't taking into account that the two operands
passed to ptest could be identical and was trying to erase
them twice.

Differential Revision: https://reviews.llvm.org/D85892
2020-08-14 07:57:21 +01:00
David Green 0c390c22a5 Revert "[ARM] Fix IT block generation after Thumb2SizeReduce with -Oz"
This reverts commit 18279a54b5 as it is
causing some chromium android test problems.
2020-08-13 22:40:36 +01:00
Austin Kerbow 7d1cb187fb [AMDGPU] Fix FP/BP spills when MUBUF constant offset exceeded
If we need a scratch register for the spill don't use the same scratch
register that is being used for the MBUF offset.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D85772
2020-08-13 14:12:00 -07:00
Thomas Lively d53d952810 [WebAssembly] Allow inlining functions with different features
Allow inlining only when the Callee has a subset of the Caller's
features. In principle, we should be able to inline regardless of any
features because WebAssembly supports features at module granularity,
not function granularity, but without this restriction it would be
possible for a module to "forget" about features if all the functions
that used them were inlined.

Requested in PR46812.

Differential Revision: https://reviews.llvm.org/D85494
2020-08-13 13:57:43 -07:00
Cameron McInally 21810b0e14 [SVE] Lower fixed length vector integer UMIN/UMAX
Differential Revision: https://reviews.llvm.org/D85926
2020-08-13 14:48:36 -05:00
Stanislav Mekhanoshin 0462aef5f3 [AMDGPU] Inhibit SDWA if target instruction has FI
Differential Revision: https://reviews.llvm.org/D85918
2020-08-13 11:34:28 -07:00
Stanislav Mekhanoshin d25cb5a8a2 [AMDGPU] Fix misleading SDWA verifier error. NFC.
The old error from GFX9 shall be updated to GFX9+.
2020-08-13 11:32:17 -07:00
David Green 2632c625ed [ARM] Mark VMINNMA/VMAXNMA as commutative
These operations take Qda and Rn register operands, which are
commutative so long as the instruction is not predicated.

Differential Revision: https://reviews.llvm.org/D85813
2020-08-13 18:01:11 +01:00
Cameron McInally e1a87f0a9b [SVE] Lower fixed length vector integer SMIN/SMAX
Differential Revision: https://reviews.llvm.org/D85855
2020-08-13 11:41:20 -05:00
Simon Pilgrim cd3b850a4c rG9bd97d0363987b582 - Revert "[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))"
This reverts commit 9bd97d0363.

Seeing some codegen issues in internal testing.
2020-08-13 15:21:15 +01:00
Carl Ritson d538c5837a [AMDGPU] Fix missed SI_RETURN_TO_EPILOG in pre-emit peephole
SIPreEmitPeephole does not process all terminators, which means
it can fail to handle SI_RETURN_TO_EPILOG if immediately preceeded
by a branch to the early exit block.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D85872
2020-08-13 21:52:41 +09:00
Simon Pilgrim a31d20e67e [X86][SSE] IsElementEquivalent - add HOP(X,X) support
For HADD/HSUB/PACKS ops with repeated operands the lower/upper half element of each lane are known to be equivalent
2020-08-13 12:42:59 +01:00
Paul Walker e63cc8105a [SVE] Lower fixed length vector integer shifts.
Differential Revision: https://reviews.llvm.org/D85724
2020-08-13 12:35:47 +01:00
Anna Welker 9eb9ba076a [ARM][MVE] Fix for tail predication for loops containing MVE gather/scatters
Fix to include non-predicated version of write-back gather in special case
treatment for deducting the instruction type.
(This is fixing https://reviews.llvm.org/D85138 for corner cases)

Differential Revision: https://reviews.llvm.org/D85889
2020-08-13 12:24:19 +01:00
Paul Walker 130098228d [SVE] Lower fixed length vector integer ISD::SETCC operations.
Differential Revision: https://reviews.llvm.org/D85831
2020-08-13 12:01:56 +01:00
Paul Walker 9e04895258 [SVE] Lower fixed length integer extend operations.
Differential Revision: https://reviews.llvm.org/D85640
2020-08-13 11:54:53 +01:00
Ruiling Song 18b1e67523 [AMDGPU] Fix crash when dag-combining bitcast
From the code after the 'break', they are processing 64bit scalar and
vector bitcast. So I think the break-condition should be (cond1 || cond2)
This means we only execute following code if (64bit and dest-is-vector).

Also remove a previous fix which is not needed with this new fix.
(introduced in: 1349a04ef5)

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D85804
2020-08-13 10:23:13 +08:00
Albion Fung 3136cbe29e [PowerPC] Implement Vector Shift Builtins
This patch implements the builtins for the vector shifts (shl, srl, sra), and
adds the appropriate test cases for these builtins. The builtins utilize the
vector shift instructions introduced within ISA 3.1.

Differential Revision: https://reviews.llvm.org/D83338
2020-08-12 18:26:58 -05:00
Francesco Petrogalli c561f4d2ec [SVE][VLS] Don't combine logical AND.
Testing is performed when targeting 128, 256 and 512-bit wide vectors.

For 128-bit vectors, the original behavior of using NEON instructions is
preserved.

Differential Revision: https://reviews.llvm.org/D85479
2020-08-12 20:00:07 +01:00
Simon Pilgrim 39de63aef9 Fix signed/unsigned comparison warnings. NFC. 2020-08-12 19:22:13 +01:00
David Green 1bb3488685 [ARM] Predicated VFMA patterns
Similar to the Two op + select patterns that were added recently, this
adds some patterns for select + fma to turn them into predicated
operations.

Differential Revision: https://reviews.llvm.org/D85824
2020-08-12 18:35:01 +01:00
Simon Pilgrim 13d6cf0951 [X86][SSE] Pull out BUILD_VECTOR operand equivalence tests. NFC.
Pull out element equivalence code from isShuffleEquivalent/isTargetShuffleEquivalent, I've also removed many of the index modulos where possible.

First step toward simply adding some additional equivalence tests.
2020-08-12 18:20:18 +01:00
Craig Topper 5f7cdb2eff [X86][GlobalISel] Legalize G_ICMP results to s8.
We need to produce a setcc instruction which has an 8-bit result.
This gets rid of a bunch of cases that were using the s1->s8/s16/s32/s64
handling in selectZExt.

I'm not very familiar with GlobalISel yet so I'm not yet sure
the best way to do things. I'd especially like feedback on the
best way to handle the currently split 32-bit and 64-bit mode
handling.

Differential Revision: https://reviews.llvm.org/D85814
2020-08-12 10:13:59 -07:00
Cameron McInally ce2c991061 [SVE] Lower fixed length FP minnum/maxnum
Lower fixed length MINNUM/MAXNUM to scalable vectors. Cherry-picked from D71767 with added tests.

Differential Revision: https://reviews.llvm.org/D85744
2020-08-12 12:02:52 -05:00
Krzysztof Parzyszek a2dc19b81b [Hexagon] Return scalar size in getMinVectorRegisterBitWidth() when no HVX
This fixes https://llvm.org/PR47128.
2020-08-12 10:13:58 -05:00
Anna Welker 4fe5615eab [ARM][MVE] Enable tail predication for loops containing MVE gather/scatters
Widen the scope of memory operations that are allowed to be tail predicated
to include gathers and scatters, such that loops that are auto-vectorized
with the option -enable-arm-maskedgatscat (and actually end up containing
an MVE gather or scatter) can be tail predicated.

Differential Revision: https://reviews.llvm.org/D85138
2020-08-12 15:32:37 +01:00
Matt Arsenault e14474a39a AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.fadd
Remove the intermediate transform in the DAG path. I believe this is
the last non-deprecated intrinsic that needs handling.
2020-08-12 10:04:53 -04:00
Matt Arsenault 701228c411 AMDGPU: Handle intrinsics in performMemSDNodeCombine
This avoids a possible regression in a future patch
2020-08-12 10:04:53 -04:00
Sam Parker ea8448e361 [LoopUnroll] Adjust CostKind query
When TTI was updated to use an explicit cost, TCK_CodeSize was used
although the default implicit cost would have been the hand-wavey
cost of size and latency. So, revert back to this behaviour. This is
not expected to have (much) impact on targets since most (all?) of
them return the same value for SizeAndLatency and CodeSize.

When optimising for size, the logic has been changed to query
CodeSize costs instead of SizeAndLatency.

This patch also adds a testing option in the unroller so that
OptSize thresholds can be specified.

Differential Revision: https://reviews.llvm.org/D85723
2020-08-12 12:56:09 +01:00
Simon Pilgrim 9bd97d0363 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813
2020-08-12 12:16:36 +01:00
Simon Pilgrim a0c2c6aa42 [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
Only do this for AVX2+ targets as we still get some regressions on AVX1 without PERMPD/PERMQ
2020-08-12 11:31:05 +01:00
Sjoerd Meijer 6716e7868e [ARM][MVE] tail-predication: overflow checks for backedge taken count.
This pick ups the work on the overflow checks for get.active.lane.mask,
which ensure that it is safe to insert the VCTP intrinisc that enables
tail-predication. For a 2d auto-correlation kernel and its inner loop j:

  M = Size - i;
  for (j = 0; j < M; j++)
    Sum += Input[j] * Input[j+i];

For this inner loop, the SCEV backedge taken count (BTC) expression is:

  (-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>

and LoopUtil cannotBeMaxInLoop couldn't calculate a bound on this, thus "BTC
cannot be max" could not be determined. So overflow behaviour had to be assumed
in the loop tripcount expression that uses the BTC. As a result
tail-predication had to be forced (with an option) for this case.

This change solves that by using ScalarEvolution's helper
getConstantMaxBackedgeTakenCount which is able to determine the range of BTC,
thus can determine it is safe, so that we no longer need to force tail-predication
as reflected in the changed test cases.

Differential Revision: https://reviews.llvm.org/D85737
2020-08-12 09:32:26 +01:00
David Sherwood 88bbd30736 [SVE][CodeGen] Fix issues with EXTRACT_SUBVECTOR when using scalable FP vectors
In this patch I have fixed two issues:

1. Our SVE tuple get/set intrinsics were using the wrong constant type
for the index passed to EXTRACT_SUBVECTOR. I have fixed this by using the
function SelectionDAG::getVectorIdxConstant to create the value. Also, I
have updated the documentation for EXTRACT_SUBVECTOR describing what type
the constant index should be and we now enforce this when creating the
node.
2. The AArch64 backend was missing the appropriate patterns for
extracting certain subvectors (nxv4f16 and nxv2f32) from legal SVE types.
I have added them as part of this patch.

The only way that I could find to test the new patterns was to use the
SVE tuple get intrinsics, although I realise it looks a bit unusual.
Tests added here:

  test/CodeGen/AArch64/sve-extract-subvector.ll

Differential Revision: https://reviews.llvm.org/D85516
2020-08-12 08:35:46 +01:00
Kazushi (Jam) Marukawa 5d549219df [VE] Change to promote i32 AND/OR/XOR operations
VE has only 64 bits AND/OR/XOR instructions.  We pretended that VE has 32 bits
instructions also, but doing it increase the number of generated instructions.
Therefore, we decide to promote 32 bits operations and use only 64 bits
instructions in back end.  We also avoid pretending that VE has 32 bits LEA
instruction.  Update regression tests also.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D85726
2020-08-12 16:23:50 +09:00
Craig Topper 6b3dc96e59 [X86][GlobalISel] Replace a misuse of SUBREG_TO_REG with INSERT_SUBREG.
SUBREG_TO_REG is supposed to be used when we know the producing
instruction already zeroed the bits we're extending. But that's
not the case here. So INSERT_SUBREG with an IMPLICIT_DEF is the
correct thing to use.
2020-08-11 23:51:02 -07:00
Jordan Rupprecht 1a67522d3e [NFC] Inline variable only used in debug builds 2020-08-11 19:38:01 -07:00
Thomas Lively 2985c02f79 [WebAssembly][AsmParser] Name missing features in error message
Rather than just saying that some feature is missing, report the exact
features to make the error message more useful and actionable.

Differential Revision: https://reviews.llvm.org/D85795
2020-08-11 17:26:14 -07:00
Jian Cai 277873ce0f [AARCH64] [MC] add memtag as an alias of mte architecture extension
Add memtag as an alis of met architectture extesion to be consistent
with GNU as.

LINK:https://sourceware.org/bugzilla/show_bug.cgi?id=26339

Reviewed By: nickdesaulniers, MaskRay

Differential Revision: https://reviews.llvm.org/D85620
2020-08-11 13:28:47 -07:00
Thomas Lively 1a69f02397 [WebAssembly][NFC] Replace WASM with standard Wasm
The officially specified abbreviation for WebAssembly is Wasm and the
spec explicitly calls out WASM as being an incorrect spelling. This
patch fixes a few comments and error messages to use the
spec-compliant abbreviation.

Differential Revision: https://reviews.llvm.org/D85764
2020-08-11 12:27:59 -07:00
diggerlin e9ac1495e2 [AIX][XCOFF] change the operand of branch instruction from symbol name to qualified symbol name for function declarations
SUMMARY:

1. in the patch  , remove setting storageclass in function .getXCOFFSection and construct function of class MCSectionXCOFF
there are

XCOFF::StorageMappingClass MappingClass;
XCOFF::SymbolType Type;
XCOFF::StorageClass StorageClass;
in the MCSectionXCOFF class,
these attribute only used in the XCOFFObjectWriter, (asm path do not need the StorageClass)

we need get the value of StorageClass, Type,MappingClass before we invoke the getXCOFFSection every time.

actually , we can get the StorageClass of the MCSectionXCOFF  from it's delegated symbol.

2. we also change the oprand of branch instruction from symbol name to qualify symbol name.
for example change
bl .foo
extern .foo
to
bl .foo[PR]
extern .foo[PR]

3. and if there is reference indirect call a function bar.
we also add
  extern .bar[PR]

Reviewers:  Jason liu, Xiangling Liao

Differential Revision: https://reviews.llvm.org/D84765
2020-08-11 15:26:19 -04:00
Jessica Paquette bebe6a6449 [GlobalISel] Combine (logic_op (op x...), (op y...)) -> (op (logic_op x, y))
This implements

```
(logic_op (op x...), (op y...)) -> (op (logic_op x, y))
```

when `op` is an extend, a shift, or an and.

This is similar to `DAGCombiner::hoistLogicOpWithSameOpcodeHands`
(with a bunch of missing cases, e.g. G_TRUNC, G_BITCAST, etc.)

This is implemented so it works both pre and post-legalization.

This also adds a general way to add a series of instructions in a combine.
(`applyBuildInstructionSteps`).

Differential Revision: https://reviews.llvm.org/D85050
2020-08-11 10:40:06 -07:00
Simon Pilgrim 2655bd51d6 [X86][SSE] combineShuffleWithHorizOp - canonicalize SHUFFLE(HOP(X,Y),HOP(Y,X)) -> SHUFFLE(HOP(X,Y))
Attempt to canonicalize binary shuffles of HOPs with commuted operands to an unary shuffle.
2020-08-11 18:13:03 +01:00
Eric Christopher 8155cb27a2 Fold Opcode into assert uses to fix an unused variable warning without asserts. 2020-08-11 09:30:51 -07:00
Simon Pilgrim fe1f36986b [X86][SSE] combineShuffleWithHorizOp - avoid unnecessary subtraction. NFCI.
We can safely replace ((M - NumElts) % NumEltsPerLane) with (M % NumEltsPerLane) as the modulo result will be the same.
2020-08-11 17:07:32 +01:00
Matt Arsenault 0dc4c36d3a AMDGPU/GlobalISel: Manually select llvm.amdgcn.writelane
Fixup the special case constant bus handling pre-gfx10.
2020-08-11 11:56:16 -04:00
Simon Pilgrim 91d59cbf1b [X86][SSE] Add HADD/SUB support to combineHorizOpWithShuffle
Handles some HOP(SHUFFLE,SHUFFLE) patterns and sets us up to improve some of the cases mentioned in PR41813.
2020-08-11 16:14:14 +01:00
Matt Arsenault 076305568c AMDGPU/GlobalISel: Prepare for more custom load lowerings
Slight restructuring of the code to avoid formatting changes when more
cases are handled here.
2020-08-11 11:09:05 -04:00
Matt Arsenault e2f1b48f86 GlobalISel: Implement bitcast action for G_INSERT_VECTOR_ELT
This mirrors the support for the equivalent extracts. This also
creates a huge mess that would be greatly improved if we had any bit
operation combines.
2020-08-11 10:39:14 -04:00
Matt Arsenault 53f21e0fb7 TableGen/GlobalISel: Hack the operand order for atomic_store
ISD::ATOMIC_STORE arbitrarily has the operands in the opposite order
from regular ISD::STORE, which always introduced an annoying
duplication of patterns to handle both cases. Since in GlobalISel
there's just the one G_STORE, we need to swap the operands to
correctly emit the type check for the pointer operand.

Some work started in 20aafa3156 to
migrate SelectionDAG to use ISD::STORE for atomics, but that work
seems to have stalled. Since this is the pretty much the last
operation which matters which isn't supported for AMDGPU, use this
compatibility hack to unblock declaring it functionally complete.

Not sure what's going on with the pending_phis AArch64 test. It seems
it didn't always use atomics, and I'm not sure what it was originally
testing matters anymore.
2020-08-11 10:22:44 -04:00
Kerry McLaughlin 85c7e89f3b [CodeGen] Refactor getMemBasePlusOffset & getObjectPtrOffset to accept a TypeSize
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D85220
2020-08-11 12:17:10 +01:00
Kazushi (Jam) Marukawa 59703f1736 [VE] Update bit operations
Change bitreverse/bswap/ctlz/ctpop/cttz regression tests to support i128
and signext/zeroext i32 types.  This patch also change the way to support
i32 types using 64 bits VE instructions.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D85712
2020-08-11 19:42:12 +09:00
Paul Walker b6c7b7fa31 [SVE] Add ISD nodes for predicated integer extend inreg operations.
These are useful instructions when lowering fixed length vector
extends, so I've broken this patch out as kind of NFC like work.

Differential Revision: https://reviews.llvm.org/D85546
2020-08-11 11:39:26 +01:00
Simon Pilgrim 49016eeab6 [X86] Rename combineVectorPackWithShuffle -> combineHorizOpWithShuffle. NFC.
The plan is to use this for (F)HADD/SUB opcodes as well as PACKs - similar to how we use combineShuffleWithHorizOp
2020-08-11 11:38:43 +01:00
Paul Walker d542feb8e4 [SVE] Lower fixed length vector integer subtract operations.
Differential Revision: https://reviews.llvm.org/D85665
2020-08-11 11:32:12 +01:00
Kai Nacke d6f710fd46 [NFC] Fix typo in comment.
Twelvth -> Twelfth
2020-08-11 05:27:56 -04:00
Craig Topper 9201efb3b9 [X86] Custom match X86ISD::VPTERNLOG in X86ISelDAGToDAG in order to reduce isel patterns.
By factoring out the end of tryVPTERNLOG, we can use the same code
to directly match X86ISD::VPTERNLOG. This allows us to remove
around 3-4K worth of X86GenDAGISel.inc.
2020-08-10 23:15:58 -07:00
Wang, Pengfei 9512525947 [X86][FPEnv] Teach X86 mask compare intrinsics to respect strict FP semantics.
When we use mask compare intrinsics under strict FP option, the masked
elements shouldn't raise any exception. So, we cann't replace the
intrinsic with a full compare + "and" operation.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D85385
2020-08-11 10:28:41 +08:00
jasonliu 20abff0481 [XCOFF][AIX] Use TE storage mapping class when large code model is enabled
Summary:
Use TE SMC instead of TC SMC in large code model mode,
so that large code model TOC entries could get placed after all
the small code model TOC entries, which reduces the chance of TOC overflow.

Reviewed By: Xiangling_L

Differential Revision: https://reviews.llvm.org/D85455
2020-08-10 19:52:10 +00:00
Puyan Lotfi 7bc03f5553 [MachineOutliner][AArch64] WA for multiple stack fixup cases in MachineOutliner.
In cases where MachineOutliner candidates either are:

  * noreturn
  * have calls with no available LR or free regs
  * Don't use SP

we can end up hitting stack fixup code for the caller and the callee for
a FrameID of MachineOutlinerDefault. This triggers the assert:

  `assert(OF.FrameConstructionID != MachineOutlinerDefault &&
          "Can only fix up stack references once");`

in AArch64InstrInfo.cpp. This assert exists for now because a lot of the
fixup code is not tested to handle fixing up more than once and needs
some better checks and enhancements to avoid potentially generating
illegal code.

I've filed a Bugzilla report to track this until these cases are handled
by the AArch64 MachineOutliner: https://bugs.llvm.org/show_bug.cgi?id=46767

This diff detects cases that will cause these multiple stack fixups and
prune the Candidates from `RepeatedSequenceLocs`.

    Differential Revision: https://reviews.llvm.org/D83923
2020-08-10 15:43:30 -04:00
Matt Arsenault 6fe6b29c29 AMDGPU: Fix assertion in performSHLPtrCombine for 64-bit pointers 2020-08-10 13:46:52 -04:00
Matt Arsenault 68fab44acf AMDGPU: Fix visiting physreg dest users when folding immediate copies
This can fold the immediate into the physical destination, but this
should not look for further users of the register. Fixes regression
introduced by 766cb615a3.
2020-08-10 13:46:51 -04:00
Craig Topper 96dfc783b2 [BreakFalseDeps][X86] Move operand loop out of X86's getUndefRegClearance and put in the pass.
X86 is the only user of this interface in tree. Previously the
X86 pass would loop over operands looking for one undef operand for
the pass to fix. But there could theoretically be multiple operands
to fix. So it makes more sense for the pass to do the looping and
ask the target if an operand needs to be fixed.
2020-08-10 10:32:29 -07:00
Wouter van Oortmerssen 582fd474dd [WebAssembly] wasm64: fix memory.init operand types
I had assumed they would all become in i64, but this is not necessary as long as data segments stay 32-bit, see:
https://github.com/WebAssembly/memory64/blob/master/proposals/memory64/Overview.md

Differential Revision: https://reviews.llvm.org/D85552
2020-08-10 10:15:20 -07:00
Simon Pilgrim 9a368d2b00 [X86][SSE] shuffle(hop,hop) - canonicalize unary hop(x,x) shuffle masks
If a shuffle is referring to both the lower and upper half lanes of an unary horizontal op, then canonicalize the mask to only refer to the lower half.
2020-08-10 16:09:27 +01:00
jasonliu 7866442b3f [XCOFF] Adjust .rename emission sequence
Summary:
AIX assembler does not generate correct relocation when .rename
appear between tc entry label and .tc directive.
So only emit .rename after .tc/.comm or other linkage is emitted.

Reviewed By: daltenty, hubert.reinterpretcast

Differential Revision: https://reviews.llvm.org/D85317
2020-08-10 14:48:24 +00:00
Xiangling Liao 6ef801aa6b [AIX] Static init frontend recovery and backend support
On the frontend side, this patch recovers AIX static init implementation to
use the linkage type and function names Clang chooses for sinit related function.

On the backend side, this patch sets correct linkage and function names on aliases
created for sinit/sterm functions.

Differential Revision: https://reviews.llvm.org/D84534
2020-08-10 10:10:49 -04:00
Simon Pilgrim 07e673a02b [X86][SSE] Pull out shuffle(hop,hop) combine into combineShuffleWithHorizOp helper. NFC. 2020-08-10 15:08:57 +01:00
Stefan Pintilie 81883ca074 [PowerPC] Add option to control PCRel GOT indirect linker optimization
Add a hidden option to the compiler to control a the PC Relative GOT indirect
linker optimization.

If this option is set to false the compiler will no loger produce the
relocations required by the linker to perform the optimization.

Reviewed By: nemanjai, NeHuang, #powerpc

Differential Revision: https://reviews.llvm.org/D85377
2020-08-10 09:07:17 -05:00
Sam Parker 4f9f4b21e0 [ARM] Unrestrict Armv8-a IT when at minsize
IT blocks with more than one instruction were performance deprecated in Armv8
but that doesn't mean we should follow that advise when optimising for size.

Differential Revision: https://reviews.llvm.org/D85638
2020-08-10 14:59:53 +01:00
Simon Pilgrim e6dc2c8ce7 [X86][SSE] combineTargetShuffle - rearrange shuffle(hop,hop) matching to delay shuffle mask manipulation. NFC.
Check that we're shuffling hadd/pack ops first before altering shuffle masks.

First step towards adding extra functionality, plus it avoids costly shuffle mask manipulation if not necessary.
2020-08-10 14:13:19 +01:00
Matt Arsenault 40188f807d AMDGPU/GlobalISel: Don't try to handle undef source operand
This is now illegal MIR
2020-08-10 08:49:43 -04:00
Matt Arsenault f9c279b057 PeepholeOptimizer: Use Register 2020-08-10 08:49:36 -04:00
Matt Arsenault a0ec81f70d AMDGPU/GlobalISel: Merge load/store select cases 2020-08-10 08:46:26 -04:00
Matt Arsenault c8b17874e5 AMDGPU/GlobalISel: Fix typo 2020-08-10 08:41:17 -04:00
Matt Arsenault 9533f0ea68 AMDGPU/GlobalISel: Use nicer form of buildInstr 2020-08-10 08:41:07 -04:00
Qiu Chaofan dbcfbffc7a [PowerPC] Add intrinsic to read or set FPSCR register
This patch introduces two intrinsics: llvm.ppc.setflm and
llvm.ppc.readflm. They read from or write to FPSCR register
(floating-point status & control) which contains rounding mode and
exception status.

To ensure correctness of program, we need to prevent FP operations from
being moved across these intrinsics (mffs/mtfsf instruction), so here I
set them as scheduling boundaries. We can relax such restriction if
FPSCR is modeled well in the future.

Reviewed By: steven.zhang

Differential Revision: https://reviews.llvm.org/D84914
2020-08-10 18:27:45 +08:00
Petar Avramovic 0d58d9e8fb AMDGPU/GlobalISel: Lower G_FREM
Add custom lower for G_FREM.

Differential Revision: https://reviews.llvm.org/D84324
2020-08-10 10:10:46 +02:00
Piotr Sobczak 62d8b8a225 Fix 64-bit copy to SCC
Fix 64-bit copy to SCC by restricting the pattern resulting
in such a copy to subtargets supporting 64-bit scalar compare,
and mapping the copy to S_CMP_LG_U64.

Before introducing the S_CSELECT pattern with explicit SCC
(0045786f14), there was no need
for handling 64-bit copy to SCC ($scc = COPY sreg_64).

The proposed handling to read only the low bits was however
based on a false premise that it is only one bit that matters,
while in fact the copy source might be a vector of booleans and
all bits need to be considered.

The practical problem of mapping the 64-bit copy to SCC is that
the natural instruction to use (S_CMP_LG_U64) is not available
on old hardware. Fix it by restricting the problematic pattern
to subtargets supporting the instruction (hasScalarCompareEq64).

Differential Revision: https://reviews.llvm.org/D85207
2020-08-09 20:50:30 +02:00
David Green 186a7f81e8 [ARM] Add VADDV and VMLAV patterns for v16i16
This adds patterns for v16i16's vecreduce, using all the existing code
to go via an i32 VADDV/VMLAV and truncating the result.

Differential Revision: https://reviews.llvm.org/D85452
2020-08-09 11:09:49 +01:00
David Green 8590e5abad [ARM] Allow vecreduce_add in tail predicated loops
This allows vecreduce_add in loops so that we can tailpredicate them.

Differential Revision: https://reviews.llvm.org/D85454
2020-08-09 10:57:17 +01:00
David Green 296faa91ed [ARM] Some formatting and predicate VRHADD patterns. NFC
This formats some of the MVE patterns, and adds a missing
Predicates = [HasMVEInt] to some VRHADD patterns I noticed
as going through. Although I don't believe NEON would ever
use the patterns (as it would use ADDL and VSHRN instead)
they should ideally be predicated on having MVE instructions.
2020-08-09 10:07:52 +01:00
Craig Topper bc8be30540 [X86][GlobalISel] Remove unneeded code for handling zext i8->16, i8->i64, i16->i64, i32->i64.
These all seem to be handled by tablegen pattern imports.
2020-08-09 00:26:15 -07:00
Thomas Lively cc612c2908 [WebAssembly] Fix FastISel address calculation bug
Fixes PR47040, in which an assertion was improperly triggered during
FastISel's address computation. The issue was that an `Address` set to
be relative to the FrameIndex with offset zero was incorrectly
considered to have an unset base. When the left hand side of an add
set the Address to be 0 off the FrameIndex, the right side would not
detect that the Address base had already been set and could try to set
the Address to be relative to a register instead, triggering an
assertion.

This patch fixes the issue by explicitly tracking whether an `Address`
has been set rather than interpreting an offset of zero to mean the
`Address` has not been set.

Differential Revision: https://reviews.llvm.org/D85581
2020-08-08 15:23:11 -07:00
Craig Topper d3153b5ca2 [X86] Remove a DCI.isBeforeLegalize() call from combineVSelectWithAllOnesOrZeros.
This was blocking isTypeLegal call so that we could do a particular
transform on illegal types before type legalization. But the we
create a target specific node using that type. We shouldn't do
that if the type isn't legal. So I think we should just always
make sure the type is legal.

I suspect that in order to get the condition VT to not be a vector
of i1 we already completed type legalization anyway so this probably
doesn't matter much in practice.
2020-08-08 14:19:13 -07:00
Craig Topper 966a58e329 [X86] Support matching VPTERNLOG when the root node is X86ISD::ANDNP. 2020-08-08 13:11:47 -07:00
Dávid Bolvanský c814eca3e4 [AArch64RegisterInfo] Supress new warning 2020-08-08 21:47:01 +02:00
Craig Topper 815a9b256b [X86] Remove isSafeToClobberEFLAGS helper and just inline it into the call sites.
This is just a thin wrapper around computeRegisterLivness which
we can just call directly. The only real difference is that
isSafeToClobberEFLAGS returns a bool and computeRegisterLivness
returns an enum. So we need to check for the specific enum value
that isSafeToClobberEFLAGS was hiding.

I've also adjusted which sites pass an explicit value for
Neighborhood since the default for computeRegisterLivness is 10.
2020-08-08 12:31:58 -07:00
Craig Topper 8d3ae64b04 Recommit "[X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places"
I messed up the bug numbers in the commit message before

Previously this function searched 4 instructions forwards or
backwards to determine if it was ok to clobber eflags.

This is called in 3 places: rematerialization, turning 2 operand
leas into adds or splitting 3 ops leas into an lea and add on some
CPU targets.

This patch increases the search limit to 10 instructions for
rematerialization and 2 operand lea to add. I've left the old
treshold for 3 ops lea spliting as that increases code size.

Fixes PR47024 and PR46315.
2020-08-08 11:53:14 -07:00
Craig Topper 761f568420 Revert "[X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places"
This reverts commit 44b260cb0a.

I messed up the bug number in the commit message so I'm reverting
to fix it.
2020-08-08 11:53:14 -07:00
Simon Pilgrim cc15380f10 [X86][SSE] combineTargetShuffle - use scaleShuffleMask helper to widen shuffle mask. NFCI.
Use scaleShuffleMask helper for the shuffle(hadd,hadd) canonicalization.
2020-08-08 19:36:18 +01:00
Craig Topper 44b260cb0a [X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places
Previously this function searched 4 instructions forwards or
backwards to determine if it was ok to clobber eflags.

This is called in 3 places: rematerialization, turning 2 operand
leas into adds or splitting 3 ops leas into an lea and add on some
CPU targets.

This patch increases the search limit to 10 instructions for
rematerialization and 2 operand lea to add. I've left the old
treshold for 3 ops lea spliting as that increases code size.

Fixes PR47024 and PR43014
2020-08-08 11:29:41 -07:00
Craig Topper 514b00c439 [X86] Limit the scope of the min/max canonicalization in combineSelect
Previously the transform was doing these two canonicalizations
(x > y) ? x : y -> (x >= y) ? x : y
(x < y) ? x : y -> (x <= y) ? x : y

But those don't seem to be useful generally. And they actively
pessimize the cases in PR47049.

This patch limits it to
(x > 0) ? x : 0 -> (x >= 0) ? x : 0
(x < -1) ? x : -1 -> (x <= -1) ? x : -1

These are the cases mentioned in the comments as the motivation
for the canonicalization. These allow the CMOV to use the S
flag from the compare thus improving opportunities to use a TEST
or the flags from an arithmetic instruction.
2020-08-07 22:51:49 -07:00
Keno Fischer c58674df14 [X86] Don't produce bad x86andp nodes for i1 vectors
In D85499, I attempted to fix this same issue by canonicalizing
andnp for i1 vectors, but since there was some opposition to such
a change, this commit just fixes the bug by using two different
forms depending on which kind of vector type is in use. We can
then always decide to switch the canonical forms later.

Description of the original bug:
We have a DAG combine that tries to fold (vselect cond, 0000..., X) -> (andnp cond, x).
However, it does so by attempting to create an i64 vector with the number
of elements obtained by truncating division by 64 from the bitwidth. This is
bad for mask vectors like v8i1, since that division is just zero. Besides,
we don't want i64 vectors anyway. For i1 vectors, switch the pattern
to (andnp (not cond), x), which is the canonical form for `kandn`
on mask registers.

Fixes https://github.com/JuliaLang/julia/issues/36955.

Differential Revision: https://reviews.llvm.org/D85553
2020-08-07 20:05:47 -04:00
Matt Arsenault 3c0597a9e4 AMDGPU: Avoid explicitly listing all the memory nodes 2020-08-07 19:22:46 -04:00
Arthur Eubanks 1bf4629f11 [PPC] Rename bool-ret-to-int -> ppc-bool-ret-to-int
Reviewed By: #powerpc, nemanjai

Differential Revision: https://reviews.llvm.org/D85391
2020-08-07 11:27:05 -07:00
Vang Thao 04bd5b5286 [AMDGPU] Fix not rescheduling without clustering
Regions are sometimes skipped which should be rescheduled without memory op
clustering. RegionIdx is not incremented when iterating over regions that
are flagged to be skipped, causing the index to be incorrect.

Thanks to Vang Thao for discovering this bug!

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D85498
2020-08-07 11:15:58 -07:00
Amy Kwan 98eccec3ae [PowerPC] Add Vector Extract/Expand/Count with Mask, Move to VSR Mask Instruction Definitions and MC Tests
This patch adds the instruction definitions and assembly/disassembly tests for
the following set of instructions:

Vector Extract [byte | half | word | doubleword | quad] with mask
Vector Expand [byte | half | word | doubleword | quad] with mask
Move to VSR [byte | byte immediate | half | word | doubleword | quad] with mask
Vector Count Mask Bits [byte | half | word | doubleword]

Differential Revision: https://reviews.llvm.org/D83724
2020-08-07 11:02:08 -05:00
Kamau Bridgeman d8c6d083c9 [PowerPC][PCRelative] Set TLS unsupported with PC relative memops
Introduce a fatal error if any thread local storage code is compiled
using pc relative memory operations as well as a hidden override
option `-enable-ppc-pcrel-tls` so that this support can be incrementally
added if possible.

Reviewed By: #powerpc, nemanjai

Differential Revision: https://reviews.llvm.org/D85448
2020-08-07 10:56:24 -05:00
Bevin Hansson 5de6c56f7e [Intrinsic] Add sshl.sat/ushl.sat, saturated shift intrinsics.
Summary:
This patch adds two intrinsics, llvm.sshl.sat and llvm.ushl.sat,
which perform signed and unsigned saturating left shift,
respectively.

These are useful for implementing the Embedded-C fixed point
support in Clang, originally discussed in
http://lists.llvm.org/pipermail/llvm-dev/2018-August/125433.html
and
http://lists.llvm.org/pipermail/cfe-dev/2018-May/058019.html

Reviewers: leonardchan, craig.topper, bjope, jdoerfert

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83216
2020-08-07 15:09:24 +02:00
Kazushi (Jam) Marukawa 63bc5d7863 [VE] Change to expand multiply related instructions
Change to expand MULHU/MULHS/UMUL_LOHI/SMUL_LOHI for i32 and i64 since
those instructions are not available on Aurora SX VE.  Some of them
are used in expansion of i128 multiply, so need to modify them to
support i128.  Then, update basic arithmetic regression tests of
i128 and signed/unsigned i32 typed integer values.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D85490
2020-08-07 18:22:25 +09:00
David Sherwood 0905d9f31e [SVE][CodeGen] Fix bug with store of unpacked FP scalable vectors
Fixed an incorrect pattern in lib/Target/AArch64/AArch64SVEInstrInfo.td
for storing out <vscale x 2 x f32> unpacked scalable vectors. Added
a couple of tests to

  test/CodeGen/AArch64/sve-st1-addressing-mode-reg-imm.ll

Differential Revision: https://reviews.llvm.org/D85441
2020-08-07 07:19:09 +01:00
biplmish cce1b0e891 [PowerPC] Implement Vector Extract Low/High Order Builtins in LLVM/Clang
This patch implements the function prototypes vec_extractl and vec_extracth in altivec.h to utilize the vector extract double element instructions introduced in Power10.

Differential Revision: https://reviews.llvm.org/D84622
2020-08-07 01:02:29 -05:00
QingShan Zhang 55de46f3b2 [PowerPC] Support constrained fp operation for setcc
The constrained fp operation fcmp was added by https://reviews.llvm.org/D69281.
This patch is trying to add the support for PowerPC backend.

Reviewed By: uweigand

Differential Revision: https://reviews.llvm.org/D81727
2020-08-07 05:16:36 +00:00
Kazushi (Jam) Marukawa f92e0d9384 [VE] Optimize trunc related instructions
Change to not generate truncate instructions if all use of a truncate
operation don't care about higher bits.  For example, an i32 add
instruction doesn't care about higher 32 bits in 64 bit registers.
Updates regression tests also.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D85418
2020-08-07 09:21:05 +09:00
Yonghong Song c50f5dece9 BPF: fix libLLVMBPFCodeGen.so build failure
Buildbot reported a build failure when building shared
library libLLVMBPFCodeGen.so with unknown reference to
"createCFGSimplificationPass".

Commit 87cba43402 ("BPF: add a SimplifyCFG IR pass during
generic Scalar/IPO optimization") added an IR pass SimplifyCFG
by BPF target. The commit called function
createCFGSimplificationPass() defined in "Scalar" library.
Add this library in Target/BPF/LLVMBuild.txt so
shared library build can succeed.
2020-08-06 15:27:15 -07:00
Matt Arsenault 87b2af8140 AMDGPU/GlobalISel: Enable s_{and|or}n2_{b32|b64} patterns 2020-08-06 18:00:38 -04:00
Yonghong Song 87cba43402 BPF: add a SimplifyCFG IR pass during generic Scalar/IPO optimization
The following bpf linux kernel selftest failed with latest
llvm:
  $ ./test_progs -n 7/10
  ...
  The sequence of 8193 jumps is too complex.
  verification time 126272 usec
  stack depth 320
  processed 114799 insns (limit 1000000)
  ...
  libbpf: failed to load object 'pyperf600_nounroll.o'
  test_bpf_verif_scale:FAIL:110
  #7/10 pyperf600_nounroll.o:FAIL
  #7 bpf_verif_scale:FAIL

After some investigation, I found the following llvm patch
  https://reviews.llvm.org/D84108
is responsible. The patch disabled hoisting common instructions
in SimplifyCFG by default. Later on, the code changes and a
SimplifyCFG phase with hoisting on cannot do the work any more.

A test is provided to demonstrate the problem.
The IR before simplifyCFG looks like:
  for.cond:
    %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
    %cmp = icmp ult i32 %i.0, 6
    br i1 %cmp, label %for.body, label %for.cond.cleanup

  for.cond.cleanup:
    %2 = load i8*, i8** %frame_ptr, align 8, !tbaa !2
    %cmp2 = icmp eq i8* %2, null
    %conv = zext i1 %cmp2 to i32
    call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %1) #3
    call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %0) #3
    ret i32 %conv

  for.body:
    %3 = load i8*, i8** %frame_ptr, align 8, !tbaa !2
    %tobool.not = icmp eq i8* %3, null
    br i1 %tobool.not, label %for.inc, label %land.lhs.true

The first two insns of `for.cond.cleanup` and `for.body`, load and
icmp, can be hoisted to `for.cond` block. With Patch D84108, the
optimization is delayed. But unfortunately, later on loop rotation
added addition phi nodes to `for.body` and hoisting cannot
be done any more.

Note such a hoisting is beneficial to bpf programs as
bpf verifier does path sensitive analysis and verification.
The hoisting preverts reloading from stack which will assume
conservative value and increase exploited insns. In this case,
it caused verifier failure.

To fix this problem, I added an IR pass from bpf target
to performance additional simplifycfg with hoisting common inst
enabled.

Differential Revision: https://reviews.llvm.org/D85434
2020-08-06 13:16:00 -07:00
dfukalov 4ccc38813e [AMDGPU][CostModel] Add f16, f64 and contract cases to fused costs estimation.
Add cases of fused fmul+fadd/fsub with f16 and f64 operands to cost model.
Also added operations with contract attribute.

Fixed line endings in test.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D84995
2020-08-06 21:43:27 +03:00
Matt Arsenault e00201539f GlobalISel: Implement fewerElementsVector for G_EXTRACT_VECTOR_ELT
Use the same basic strategy as LegalizeVectorTypes. Try to index into
smaller pieces if there's a constant index, and otherwise fall back to
a stack temporary.
2020-08-06 14:33:16 -04:00
Matt Arsenault 1a0c0944c6 AMDGPU: Define raw/struct variants of buffer atomic fadd
Somehow the new FP atomic buffer intrinsics ended up using the legacy
style for buffer intrinsics.
2020-08-06 13:36:19 -04:00
Matt Arsenault eae9c54148 AArch64/GlobalISel: Fix verifier error after selecting returnaddress
This was caching the wrong register to re-use later.
2020-08-06 13:18:05 -04:00
Matt Arsenault 90eb7d5283 AMDGPU: Fix spilling of 96-bit AGPRs 2020-08-06 12:42:07 -04:00
Matt Arsenault 56270d1d42 AMDGPU/GlobalISel: Start trying to handle AGPR bank
Try to use AGPR banks for the various merge/unmerge type
operations. Previously these would introduce copies to VGPR.
2020-08-06 12:39:50 -04:00
Matt Arsenault 34040a4f61 GlobalISel: Define InvalidRegBankID enum value 2020-08-06 12:39:49 -04:00
Matt Arsenault 63cdc9a49f AMDGPU/GlobalISel: Handle llvm.amdgcn.ds.{fadd|fmin|fmax}
These intrinsics are missing mangling for both the pointer and data
type.
2020-08-06 11:09:08 -04:00
Matt Arsenault 63c4be53cf AMDGPU/GlobalISel: Try to promote to use packed saturating add/sub
This produces worse results right now for i8 vectors, but that should
be addressed when we actually try to optimize packed vectors.
2020-08-06 11:08:45 -04:00
Matt Arsenault dcf3ffb0a8 AMDGPU/GlobalISel: Move frame index selection to patterns
Doesn't really save any code until global value is handled too.
2020-08-06 10:42:15 -04:00
Matt Arsenault d188a608bd AMDGPU: Fix code duplication between the selectors
Not sure this is the right place for this helper.
2020-08-06 10:42:15 -04:00
Matt Arsenault 5a503521e7 AMDGPU/GlobalISel: Implement expansion for rsq.clamp
Not sure why we handle this removed instruction on newer subtargets
for this one and no others, but maintain compatibility with the DAG.
2020-08-06 10:23:25 -04:00
Matt Arsenault c015cbc68b AMDGPU/GlobalISel: Fix trying to widen <3 x s1> boolean ops 2020-08-06 10:07:22 -04:00
Matt Arsenault 28124a0a63 AMDGPU/GlobalISel: Stop using G_EXTRACT in argument lowering
We really need to put this undef padding stuff into a helper
somewhere, but leave that for when this is moved to generic code.
2020-08-06 09:55:35 -04:00
Matt Arsenault 6c7f640bf7 AMDGPU/GlobalISel: Implement LLT version of allowsMisalignedMemoryAccesses 2020-08-06 09:50:36 -04:00
Matt Arsenault 37894ba661 AMDGPU/GlobalISel: Make s16 phi legal
If we were to have an operation with an s16 def that needs to be
executed in a waterfall loop, not having s16 legal would place an
avoidable burden on RegBankSelect to widen it.
2020-08-06 09:41:14 -04:00
Matt Arsenault 5316256709 AMDGPU/GlobalISel: Fix assert on copy to vcc
This was trying to constrain a physical register. By the verifier's
understanding, it's impossible to have a 1-bit copy to vcc/vcc_lo so
don't try to handle physregs.
2020-08-06 09:41:14 -04:00
Paul Walker 0d33a8ef5b [SVE] Lower scalable vector mul operations.
This allows us to remove extra patterns from AArch64SVEInstrInfo.td
because we can reuse those required for fixed length vectors.

Differential Revision: https://reviews.llvm.org/D85328
2020-08-06 11:15:35 +01:00
Paul Walker 3ed59b775d [SVE] Implement lowering for fixed length vector multiplication.
NOTE: Also uses SVE code generation for NEON size vectors, instead
of expanding i64 based vector multiplications.

Differential Revision: https://reviews.llvm.org/D85327
2020-08-06 11:01:39 +01:00
Martin Storsjö f5e6fbac24 [AArch64] [Windows] Error out on unsupported symbol locations
These might occur in seemingly generic assembly. Previously when
targeting COFF, they were silently ignored, which certainly won't
give the right result. Instead clearly error out, to make it clear
that the assembly needs to be adjusted for this target.

Also change a preexisting report_fatal_error into a proper error
message, pointing out the offending source instruction. This isn't
strictly an internal error, as it can be triggered by user input.

Differential Revision: https://reviews.llvm.org/D85242
2020-08-06 09:23:46 +03:00
Martin Storsjö 5eedc01a82 [ARM, AArch64] Fix a comment typo. NFC. 2020-08-06 09:23:45 +03:00
Craig Topper 0215ae9735 [X86] Remove incomplete custom handling of i128 sdivrem/udivrem on Windows.
We need to have special handling of i128 div/rem on Windows due
to a weird calling convention needed for the libcall. There was
also some code that made it look like we do the same for sdivrem/udiv,
but the code didn't account for multiple return values of those
functions so couldn't possibly work. I think this code never
triggers because we don't have libcall names defined for those
functions by default so DAGCombine never creates DIVREM nodes.
2020-08-05 23:01:07 -07:00
Matt Arsenault 0ee1eba581 AMDGPU: Remove ATOMIC_PK_FADD
The f32 and v2f16 cases should be handled the same way.
2020-08-05 22:00:52 -04:00
Ruiling Song 5ddc8b49ba [AMDGPU] add buffer_atomic_swap for float
The functionality is used when calling imageAtomicExhange() on float
type imageBuffer in Graphics shaders.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D85187
2020-08-06 09:45:48 +08:00
Craig Topper 08b2d0a963 [X86] Disable copy elision in LowerMemArgument for scalarized vectors when the loc VT is a different size than the original element.
For example a v4f16 argument is scalarized to 4 i32 values. So
the values are spread out instead of being packed tightly like
in the original vector.

Fixes PR47000.
2020-08-05 15:44:54 -07:00
Stanislav Mekhanoshin 0bcda1a261 [AMDGPU] Scavenge temp reg for AGPR spill
Differential Revision: https://reviews.llvm.org/D85234
2020-08-05 13:29:19 -07:00
Matt Arsenault ec8c172d01 AMDGPU: Correct prolog SP initialization logic
Having callees that will read SP is not the only reason we need to
reference the stack pointer.
2020-08-05 15:47:53 -04:00
Stanislav Mekhanoshin ea7d0e2996 [AMDGPU] gfx1031 target
Differential Revision: https://reviews.llvm.org/D85337
2020-08-05 12:36:26 -07:00
Matt Arsenault 83eaf5d55d AMDGPU: Eliminate BUFFER_ATOMIC_PK_ADD_F16 node
This is redundant with the other no return buffer atomic node, and we
don't really need a separate type profile for it.
2020-08-05 15:16:51 -04:00
Matt Arsenault 43c0c9252a AMDGPU: Refactor buffer atomic intrinsic lowering
Move raw/struct buffer atomic lowering to separate functions. This
avoids a long nested switch, and simplifies a future patch.
2020-08-05 14:44:55 -04:00
Matt Arsenault 3e52667433 AMDGPU: Fix verifier error with undef source producing s_bitset*
This needs to preserve the undef flag.
2020-08-05 14:42:20 -04:00
Simon Pilgrim b60f998859 [X86][SSE] Fold 128-bit PACK(EXTEND(X),EXTEND(Y)) -> CONCAT(X,Y) subvectors
This is seen in the sub-128-bit vector trunc(ext()) of comparison results

Fixes pr46585.ll regression in D66004
2020-08-05 18:27:40 +01:00
Simon Pilgrim 6a06c7a0a7 [X86] isHorizontalBinOp - only update LHS/RHS references on success
We've had issues in the past where isHorizontalBinOp calls would affect later combines as the LHS/RHS references had been commuted but still failed to match.
2020-08-05 15:09:52 +01:00
Simon Pilgrim a57bfb44bc [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for integer types 2020-08-05 15:09:51 +01:00
Sam Parker f2675ab45f [ARM][CostModel] Implement getCFInstrCost
As with other targets, set the throughput cost of control-flow
instructions to free so that we don't miss out of vectorization
opportunities.

Differential Revision: https://reviews.llvm.org/D85283
2020-08-05 12:44:51 +01:00
Paul Walker 927fc536ca [SVE] Add lowering for fixed length vector and, or & xor operations.
Since there are no ill effects when performing these operations
with undefined elements, they are lowered to the already supported
unpredicated scalable vector equivalents.

Differential Revision: https://reviews.llvm.org/D85117
2020-08-05 11:28:34 +01:00
Sander de Smalen f2916636f8 [AArch64][SVE] Disable tail calls if callee does not preserve SVE regs.
This fixes an issue triggered by the following code, where emitEpilogue
got confused when trying to restore the SVE registers after the call,
whereas the call to bar() is implemented as a TCReturn:

  int non_sve();
  int sve(svint32_t x) { return non_sve(); }

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D84869
2020-08-05 09:38:54 +01:00
Jay Foad 8cbf4a17ac [AMDGPU] Propagate fast math flags in frem lowering
Differential Revision: https://reviews.llvm.org/D84518
2020-08-05 09:09:38 +01:00
Jay Foad 04cf4a5a65 [AMDGPU] Lower frem f16
Without this it would fail to select on subtargets that have 16-bit
instructions.

Differential Revision: https://reviews.llvm.org/D84517
2020-08-05 09:08:40 +01:00
Yonghong Song 00602ee7ef BPF: simplify IR generation for __builtin_btf_type_id()
This patch simplified IR generation for __builtin_btf_type_id().
For __builtin_btf_type_id(obj, flag), previously IR builtin
looks like
   if (obj is a lvalue)
     llvm.bpf.btf.type.id(obj.ptr, 1, flag)  !type
   else
     llvm.bpf.btf.type.id(obj, 0, flag)  !type
The purpose of the 2nd argument is to differentiate
   __builtin_btf_type_id(obj, flag) where obj is a lvalue
vs.
   __builtin_btf_type_id(obj.ptr, flag)

Note that obj or obj.ptr is never used by the backend
and the `obj` argument is only used to derive the type.
This code sequence is subject to potential llvm CSE when
  - obj is the same .e.g., nullptr
  - flag is the same
  - metadata type is different, e.g., typedef of struct "s"
    and strust "s".
In the above, we don't want CSE since their metadata is different.

This patch change IR builtin to
   llvm.bpf.btf.type.id(seq_num, flag)  !type
and seq_num is always increasing. This will prevent potential
llvm CSE.

Also report an error if the type name is empty for
remote relocation since remote relocation needs non-empty
type name to do relocation against vmlinux.

Differential Revision: https://reviews.llvm.org/D85174
2020-08-04 16:29:42 -07:00
Arthur Eubanks f50b3ff02e [Hexagon] Use InstSimplify instead of ConstantProp
This is the last remaining use of ConstantProp, migrate it to InstSimplify in the goal of removing ConstantProp.

Add -hexagon-instsimplify option to enable skipping of instsimplify in
tests that can't handle the extra optimization.

Differential Revision: https://reviews.llvm.org/D85047
2020-08-04 15:42:39 -07:00
Krzysztof Parzyszek 09897b146a [RDF] Remove uses of RDFRegisters::normalize (deprecate)
This function has been reduced to an identity function for some time.
2020-08-04 17:02:12 -05:00
Matt Arsenault 486e84dfa4 AMDGPU/GlobalISel: Use live in helper function for returnaddress 2020-08-04 17:36:01 -04:00
Matt Arsenault 89011fc3c9 AMDGPU/GlobalISel: Select llvm.returnaddress 2020-08-04 17:14:38 -04:00
Matt Arsenault f8fb7835d6 GlobalISel: Add utilty for getting function argument live ins
Get the argument register and ensure there's a copy to the virtual
register. AMDGPU and AArch64 have similarish code to get the livein
value, and I also want to use this in multiple places.

This is a bit more aggressive about setting the register class than
the original function, but that's probably OK.

I think we're missing a few verifier checks for function live ins. I
noticed AArch64's calling convention code is not actually adding
liveins to functions, only the entry block (which apparently might not
matter that much?). There should probably be a verifier check that
entry block live ins are also live into the function. We also might
need a verifier check that the copy to the livein virtual register is
in the entry block.
2020-08-04 16:55:55 -04:00
Eli Friedman 95efea4b93 [AArch64][SVE] Widen narrow sdiv/udiv operations.
The SVE instruction set only supports sdiv/udiv for 32-bit and 64-bit
integers.  If we see an 8-bit or 16-bit divide, widen the operands to 32
bits, and narrow the result.

Differential Revision: https://reviews.llvm.org/D85170
2020-08-04 13:22:15 -07:00
Yonghong Song 6d218b4adb BPF: support type exist/size and enum exist/value relocations
Four new CO-RE relocations are introduced:
  - TYPE_EXISTENCE: whether a typedef/record/enum type exists
  - TYPE_SIZE: the size of a typedef/record/enum type
  - ENUM_VALUE_EXISTENCE: whether an enum value of an enum type exists
  - ENUM_VALUE: the enum value of an enum type

These additional relocations will make CO-RE bpf programs
more adaptive for potential kernel internal data structure
changes.

Differential Revision: https://reviews.llvm.org/D83878
2020-08-04 12:35:39 -07:00
David Blaikie e31cfc4cd3 Fix -Wconstant-conversion warning with explicit cast
Introduced by fd6584a220

Following similar use of casts in AsmParser.cpp, for instance - ideally
this type would use unsigned chars as they're more representative of raw
data and don't get confused around implementation defined choices of
char's signedness, but this is what it is & the signed/unsigned
conversions are (so far as I understand) safe/bit preserving in this
usage and what's intended, given the API design here.
2020-08-04 10:41:27 -07:00
Matt Arsenault 0de547ed4a AMDGPU/GlobalISel: Ensure subreg is valid when selecting G_UNMERGE_VALUES
Fixes verifier error with SGPR unmerges with 96-bit result types.
2020-08-04 12:27:34 -04:00
Nemanja Ivanovic 14d726acd6 [PowerPC] Don't remove single swap between the load and store
The swap removal pass looks to remove swaps when a loaded value is swapped, some
number of lane-insensitive operations are performed and then the value is
swapped again and stored.

However, in a situation where we load the value, swap it and then store it
without swapping again, the pass erroneously removes the single swap. The
reason is that both checks in the same equivalence class:

- load feeds a swap
- swap feeds a store

pass. However, there is no check that the two swaps are actually a single swap.
This patch just fixes that.

Differential revision: https://reviews.llvm.org/D84785
2020-08-04 10:38:15 -05:00
Jay Foad 28e322ea93 [PowerPC] Custom lowering for funnel shifts
The custom lowering saves an instruction over the generic expansion, by
taking advantage of the fact that PowerPC shift instructions are well
defined in the shift-by-bitwidth case.

Differential Revision: https://reviews.llvm.org/D83948
2020-08-04 16:30:49 +01:00
Jay Foad 8ec8ad868d [AMDGPU] Use fma for lowering frem
This gives shorter f64 code and perhaps better accuracy.

Differential Revision: https://reviews.llvm.org/D84516
2020-08-04 16:18:23 +01:00
Simon Pilgrim 6f0da46d53 [X86] getFauxShuffleMask - drop unnecessary computeKnownBits OR(X,Y) shuffle decoding.
Now that rG47cea9e82dda941e lets us aggressively decode multi-use shuffles for the OR(SHUFFLE(),SHUFFLE()) case we don't need the computeKnownBits variant any more.
2020-08-04 15:57:47 +01:00
Simon Pilgrim 051f293b78 [X86] Remove unused canScaleShuffleElements helper
The only use was removed at rG36750ba5bd0e9e72

Thanks to @nemanjai for the heads up
2020-08-04 14:51:23 +01:00
Simon Pilgrim 36750ba5bd [X86][AVX] isHorizontalBinOp - relax lane-crossing limits for AVX1-only targets.
Permit lane-crossing post shuffles on AVX1 targets as long as every element comes from the same source lane, which for v8f32/v4f64 cases can be efficiently lowered with the LowerShuffleAsLanePermuteAnd* style methods.
2020-08-04 14:27:01 +01:00
Sander de Smalen bb3344c7d8 [AArch64][SVE] Add missing unwind info for SVE registers.
This patch adds a CFI entry for each SVE callee saved register
that needs unwind info at an offset from the CFA. The offset is
a DWARF expression because the offset is partly scalable.

The CFI entries only cover a subset of the SVE callee-saves and
only encodes the lower 64-bits, thus implementing the lowest
common denominator ABI. Existing unwinders may support VG but
only restore the lower 64-bits.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D84044
2020-08-04 11:47:06 +01:00
Sander de Smalen fd6584a220 [AArch64][SVE] Fix CFA calculation in presence of SVE objects.
The CFA is calculated as (SP/FP + offset), but when there are
SVE objects on the stack the SP offset is partly scalable and
should instead be expressed as the DWARF expression:

     SP + offset + scalable_offset * VG

where VG is the Vector Granule register, containing the
number of 64bits 'granules' in a scalable vector.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D84043
2020-08-04 11:47:06 +01:00
Paul Walker 4be13b15d6 [SVE] Replace remaining _MERGE_OP1 nodes with _PRED variants.
This is the final bit of work to relax the register allocation
requirements when code generating normal LLVM IR, which rarely
care about the result of inactive lanes. By using _PRED nodes
we can make better use of SVE's reversed instructions.

Also removes a redundant parameter from the min/max tests.

Differential Revision: https://reviews.llvm.org/D85142
2020-08-04 11:19:17 +01:00
Meera Nakrani 20283ff491 [ARM] Generated SSAT and USAT instructions with shift
Added patterns so that both SSAT and USAT instructions are generated with shifts. Added corresponding regression tests.

Differential Review: https://reviews.llvm.org/D85120
2020-08-04 09:38:17 +00:00
Simon Pilgrim 47cea9e82d Revert rG66e7dce714fab "Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero.""
[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero (REAPPLIED)

This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR().

This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.

Reapplied with fixed signed/unsigned comparisons.
2020-08-04 10:32:39 +01:00
Florian Hahn f7658241cb [AArch64] Consider instruction-level contract FMFs in combiner patterns.
Currently, instruction level fast math flags are not considered when
generating patterns for the machine combiner.

This currently leads to some missed opportunities to generate FMAs in
combination with `#pragma clang fp contract (fast)`.

For example, when building the example below with -O3 for AArch64, no
FMADD is generated. If built with -O2 and the DAGCombiner is used
instead of the MachineCombiner for FMAs, an FMADD is generated.

With this patch, the same code is generated in both cases.

    float madd_contract(float a, float b, float c) {
    #pragma clang fp contract (fast)
      return (a * b) + c;
    }

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D84930
2020-08-04 10:25:16 +01:00
Qiu Chaofan 6a78a8dd37 [NFC] [PowerPC] Refactor fp/int conversion lowering
For FP_TO_INT and INT_TO_FP lowering, we have direct-move and
non-direct-move methods. But they share some conversion logic, so we can
reduce redundant code by introducing new methods.

Reviewed By: steven.zhang

Differential Revision: https://reviews.llvm.org/D81818
2020-08-04 15:48:16 +08:00
Wang, Pengfei 6bc7ea2d8d [X86][AVX512] Fix build fail after D81548
Test function mask_cmp_128 failed during ISEL
LLVM ERROR: Cannot select: t37: v8i1 = X86ISD::KSHIFTL t48, TargetConstant:i8<4>
due to v8i1 only available under AVX512DQ.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D84922
2020-08-04 12:31:04 +08:00
Chen Zheng 45c46d180e [PowerPC] mark r+i as legal address mode for vector type after pwr9
Reviewed By: steven.zhang

Differential Revision: https://reviews.llvm.org/D84735
2020-08-04 00:02:37 -04:00
Carl Ritson 57899934ea [AMDGPU] Make GCNRegBankReassign assign based on subreg banks
When scavenging consider the sub-register of the source operand
to determine the bank of a candidate register (not just sub0).
Without this it is possible to introduce an infinite loop,
e.g. $sgpr15_sgpr16_sgpr17 can be assigned for a conflict between
$sgpr0 and SGPR_96:sub1.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D84910
2020-08-04 12:54:44 +09:00
Chen Zheng ba955397ac [SCEVExpander][PowerPC]clear scev rewriter before deleting instructions.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D85130
2020-08-03 20:36:08 -04:00
Christopher Tetreault c9e6887f83 [SVE] Remove bad calls to VectorType::getNumElements() from X86
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D85156
2020-08-03 16:34:10 -07:00
hgreving 509f5c4ec2 [MC] Fix memory leak when allocating MCInst with bump allocator
Adds the function createMCInst() to MCContext that creates a MCInst using
a typed bump alloctor.

MCInst contains a SmallVector<MCOperand, 8>. The SmallVector is POD only
for <= 8 operands. The default untyped bump pointer allocator of MCContext
does not delete the MCInst, so if the SmallVector grows, it's a leak.

This fixes https://bugs.llvm.org/show_bug.cgi?id=46900.
2020-08-03 16:08:26 -07:00
Christopher Tetreault 3b92db4c84 [SVE] Remove bad call to VectorType::getNumElements() from AMDGPU
Differential Revision: https://reviews.llvm.org/D85151
2020-08-03 15:56:10 -07:00
Christopher Tetreault b5059b7140 [SVE] Remove bad call to VectorType::getNumElements() from ARM
Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D85152
2020-08-03 15:41:14 -07:00
Jordan Rupprecht af3ec731d5 [NFC][ARM] Silence unused variable in release builds 2020-08-03 15:21:44 -07:00
Christopher Tetreault b43791e701 [SVE] Remove bad calls to VectorType::getNumElements() from PowerPC
Differential Revision: https://reviews.llvm.org/D85154
2020-08-03 15:15:20 -07:00
Mitch Phillips 9a05fa10bd [HWASan] [GlobalISel] Add +tagged-globals backend feature for GlobalISel
GlobalISel is the default ISel for aarch64 at -O0. Prior to D78465, GlobalISel
didn't have support for dealing with address-of-global lowerings, so it fell
back to SelectionDAGISel.

HWASan Globals require special handling, as they contain the pointer tag in the
top 16-bits, and are thus outside the code model. We need to generate a `movk`
in the instruction sequence with a G3 relocation to ensure the bits are
relocated properly. This is implemented in SelectionDAGISel, this patch does
the same for GlobalISel.

GlobalISel and SelectionDAGISel differ in their lowering sequence, so there are
differences in the final instruction sequence, explained in
`tagged-globals.ll`. Both of these implementations are correct, but GlobalISel
is slightly larger code size / slightly slower (by a couple of arithmetic
instructions). I don't see this as a problem for now as GlobalISel is only on
by default at `-O0`.

Reviewed By: aemerson, arsenm

Differential Revision: https://reviews.llvm.org/D82615
2020-08-03 14:28:44 -07:00
David Green 22916481c1 [ARM] Convert VPSEL to VMOV in tail predicated loops
VPSEL has slightly different semantics under tail predication (it can
end up selecting from Qn, Qm and Qd). We do not model that at the moment
so they block tail predicated loops from being formed.

This just converts them into a predicated VMOV instead (via a VORR),
allowing tail predication to happen whilst still modelling the original
behaviour of the input.

Differential Revision: https://reviews.llvm.org/D85110
2020-08-03 22:03:14 +01:00
Thomas Lively cb32792210 [WebAssembly] Implement prototype v128.load{32,64}_zero instructions
Specified in https://github.com/WebAssembly/simd/pull/237, these
instructions load the first vector lane from memory and zero the other
lanes. Since these instructions are not officially part of the SIMD
proposal, they are only available on an opt-in basis via LLVM
intrinsics and clang builtin functions. If these instructions are
merged to the proposal, this implementation will change so that the
instructions will be generated from normal IR. At that point the
intrinsics and builtin functions would be removed.

This PR also changes the opcodes for the experimental f32x4.qfm{a,s}
instructions because their opcodes conflicted with those of the
v128.load{32,64}_zero instructions. The new opcodes were chosen to
match those used in V8.

Differential Revision: https://reviews.llvm.org/D84820
2020-08-03 13:54:00 -07:00
Mitch Phillips 66e7dce714 Revert "[X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero."
This reverts commit 219f32f4b6.

Commit contains unsigned compasions that break bots that build with
-Wsign-compare.
2020-08-03 13:48:30 -07:00
Eli Friedman dca23ed895 [AArch64] Add missing isel patterns for fcvtzs/u intrinsic on v1f64.
Fixes test-suite compile failure caused by 8dfb5d7.

While I'm in the area, add some more test coverage to related
operations, to make sure we aren't missing any other patterns.
2020-08-03 13:04:59 -07:00
Jian Cai c6334db577 [X86] support .nops directive
Add support of .nops on X86. This addresses llvm.org/PR45788.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D82826
2020-08-03 11:50:56 -07:00
Joao Moreira f208c659fb [X86] Make ENDBR instruction a scheduling boundary
Instructions should not be scheduled across ENDBR instructions, as this would result in the ENDBR being displaced, breaking the parity needed for the Indirect Branch Tracking feature of CET.

Currently, the X86IndirectBranchTracking pass is later than the instruction scheduling in the pipeline, what causes the bug to be unnoticeable and very hard (if not unfeasible) to be triggered while compiling C files with the standard LLVM setup. Yet, for correctness and to prevent issues in future changes, the compiler should prevent the such scheduling.

Differential Revision: https://reviews.llvm.org/D84862
2020-08-03 10:47:23 -07:00
Simon Pilgrim 219f32f4b6 [X86][SSE] Shuffle combine blends to OR(X,Y) if the relevant elements are known zero.
This allows us to remove the (depth violating) code in getFauxShuffleMask where we were combining the OR(SHUFFLE,SHUFFLE) shuffle inputs as well, and not just the OR().

This is a minor step toward being able to shuffle combine from/to SELECT/BLENDV as a faux shuffle.
2020-08-03 18:32:47 +01:00
Craig Topper ac82b918c7 [X86] Use h-register for final XOR of __builtin_parity on 64-bit targets.
This adds an isel pattern and special XOR8rr_NOREX instruction
to enable the use of h-registers for __builtin_parity. This avoids
a copy and a shift instruction. The NOREX instruction is in case
register allocation doesn't use the matching l-register for some
reason. If a R8-R15 register gets picked instead, we won't be
able to encode the instruction since an h-register can't be used
with a REX prefix.

Fixes PR46954
2020-08-03 10:10:17 -07:00
Cameron McInally 31c7a2fd5c [FPEnv] Don't transform FSUB(-0,X)->FNEG(X) in SelectionDAGBuilder.
This patch stops unconditionally transforming FSUB(-0,X) into an FNEG(X) while building the DAG. There is also one small change to handle the new FSUB(-0,X) similarly to FNEG(X) in the AMDGPU backend.

Differential Revision: https://reviews.llvm.org/D84056
2020-08-03 10:22:25 -05:00
Matt Arsenault 2414bab5d7 AMDGPU/GlobalISel: Remove old hacks for boolean selection
There were various hacks used to try to avoid making s1 SGPR vs. s1
VCC ambiguous after constraining the register before we had a strategy
to deal with this. This also attempted to handle undef operands, which
are now illegal gMIR.
2020-08-03 09:04:14 -04:00
Matt Arsenault fd63e46941 AMDGPU/GlobalISel: Apply load bitcast to s.buffer.load intrinsic
Should also apply this to the non-scalar buffer loads.
2020-08-03 08:54:29 -04:00
Simon Pilgrim 99a971cadf [X86][SSE] Start shuffle combining from ANY_EXTEND_VECTOR_INREG on SSE targets
We already do this on AVX (+ for ZERO_EXTEND_VECTOR_INREG), but this enables it for all SSE targets - we attempted something similar back at rL357057 but hit issues with the ZERO_EXTEND_VECTOR_INREG handling (PR41249).

I'm still looking at the vector-mul.ll regression - which is due to 32-bit targets performing the load as a f64, resulting in the shuffle combiner thinking it has to create a shuffle in the float domain.
2020-08-03 13:41:48 +01:00
Matt Arsenault d8ef1d1251 AMDGPU/GlobalISel: Fix selecting broken copies for s32->s64 anyext
These should probably not be legal in the first place, but that might
also be a pain.
2020-08-03 08:36:41 -04:00
Nicholas Guy 18279a54b5 [ARM] Fix IT block generation after Thumb2SizeReduce with -Oz
Fixes a regression caused by D82439, in which IT blocks were no longer being
generated when -Oz is present. This was due to the CPSR register being marked as
dead, while this case was not accounted for.

Differential Revision: https://reviews.llvm.org/D83667
2020-08-03 13:20:32 +01:00
Fangrui Song 40da58a04b [MC] Default MCAsmBackend::mayNeedRelaxation() to false 2020-08-02 22:13:59 -07:00
QingShan Zhang 62e4644616 [NFC][PowerPC] Add a multiclass for fsetcc to define them in a uniform way
This is a refactor patch to prepare for adding the support for strict-fsetcc
in PowerPC backend. We want to move their definition into a uniform way so that,
we could add the strict node easier.

Reviewed By: shchenz

Differential Revision: https://reviews.llvm.org/D81712
2020-08-03 03:28:03 +00:00
StephenFan a96921afa7 [RISCV] eliminate the repetition declare of SDLoc DL
Differential revision: https://reviews.llvm.org/D85002
2020-08-03 10:24:30 +08:00
Craig Topper 64516ec7c1 [X86] Use parity flag from byte test/cmp instruction for __builtin_parity when input fits in 8 bits.
If the upper bits of the __builtin_parity idiom are known to be
0 we were previously emitting an xor with 0 to get the parity flag.
But we can use cmp/test instead which may expose opportunities for
load folding or combining an AND.
2020-08-02 10:45:04 -07:00
Matt Arsenault 212570abcf GlobalISel: Implement bitcast action for G_EXTRACT_VECTOR_ELEMENT
For AMDGPU, vectors with elements < 32 bits should be indexed in
32-bit elements and the desired bits extracted from there. For
elements > 64-bits, these should be reduce to 64/32 elements to enable
the normal dynamic indexing paths.

In the dynamic index cases, this produces shorter code most of the
time. This does immediately regress the constant index cases, but this
should be fixed once we have the most basic of shift combines.

The element size > 64 case is pretty much ported from the exisiting
DAG implementation for extract element promote. The increasing element
size case is new.
2020-08-02 10:42:07 -04:00
Simon Pilgrim 00d0f354f2 X86InstrInfo.cpp - fix include ordering. NFCI. 2020-08-02 15:34:18 +01:00
Simon Pilgrim 7dd4f03595 Use merge null and isa<> tests into isa_and_nonnull<>. NFCI. 2020-08-02 15:34:18 +01:00
Simon Pilgrim d14a22da5e [DAG] TargetLowering::LowerAsmOutputForConstraint - pass SDLoc as const&
Try to be more consistent with the SDLoc param in the TargetLowering methods.
2020-08-02 15:12:02 +01:00
Simon Pilgrim 20fbbbc583 [X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI.
Fixes clang-tidy warning.
2020-08-02 14:32:23 +01:00
Simon Pilgrim d7e2616741 [X86] Pass SDLoc by const reference. NFCI. 2020-08-02 14:32:22 +01:00
Simon Pilgrim 3f276840b6 [X86] Use const APInt& in for-range loop to avoid unnecessary copies. NFCI.
Fixes clang-tidy warning.
2020-08-02 14:32:22 +01:00
Simon Pilgrim 2700311cce [X86] combineX86ShuffleChain - pull out repeated RootVT.getSizeInBits() calls. NFCI. 2020-08-02 14:32:22 +01:00
Craig Topper 56166a3a52 [X86] Improve parity idiom recognition to handle (and (truncate (ctpop X)), 1).
Fixes part of PR46954
2020-08-01 22:59:43 -07:00
Kazu Hirata 60434989e5 Use llvm::is_contained where appropriate (NFC)
Use llvm::is_contained where appropriate (NFC)

Reviewed By: kazu

Differential Revision: https://reviews.llvm.org/D85083
2020-08-01 21:51:06 -07:00
Craig Topper e297d928dc [X86] Add assembler support for {disp8} and {disp32} to control the size of displacement used for memory operands.
These prefixes should override the default behavior and force a larger immediate size. I don't believe gas issues any warning if you use {disp8} when a 32-bit displacement is already required. And this patch doesn't either.

This completes the {disp8} and {disp32} support from PR46650.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D84793
2020-08-01 13:26:35 -07:00
Simon Pilgrim 82a5c848e7 [X86][AVX512] Fold concat(and(x,y),and(z,w)) -> and(concat(x,z),concat(y,w)) for 512-bit vectors
Helps vpternlog folding on non-AVX512BW targets
2020-08-01 20:34:39 +01:00
Simon Pilgrim bb13c34c3a [X86][AVX] Ensure we only combine to PSHUFLW/PSHUFHW on supporting targets
Noticed while investigating combining from concatenated shuffle vectors, we weren't checking that PSHUFLW/PSHUFHW was legal - we were depending on lowering splitting to subvectors.
2020-08-01 19:18:11 +01:00
David Green fd69df62ed [ARM] Distribute post-inc for Thumb2 sign/zero extending loads/stores
This adds sign/zero extending scalar loads/stores to the MVE
instructions added in D77813, allowing us to create up more post-inc
instructions. These are comparatively simple, compared to LDR/STR (which
may be better turned into an LDRD/LDM), but still require some additions
over MVE instructions. Because there are i12 and i8 variants of the
offset loads/stores dealing with different signs, we may need to convert
an i12 address to a i8 negative instruction. t2LDRBi12 can also be
shrunk to a tLDRi under the right conditions, so we need to be careful
with codesize too.

Differential Revision: https://reviews.llvm.org/D78625
2020-08-01 14:01:18 +01:00
Simon Pilgrim 1b1901536a [X86][AVX] Extend v2f64 BROADCAST(LOAD) -> BROADCAST_LOAD to v2i64/v4f32/v4i32
Minor precursor fix for D66004, but helps the SSE41 tests as well as they run with -disable-peephole
2020-08-01 12:28:29 +01:00
Craig Topper 75f134eec1 [X86] Refactor the broadcast and load folding in tryVPTESTM to reduce some code.
Now we try to load and broadcast together for operand 1. Followed
by load and broadcast for operand 1. Previously we tried load
operand 1, load operand 1, broadcast operand 0, broadcast operand 1.

Now we have a single helper that tries load and broadcast for
one operand that we can just call twice.
2020-07-31 23:57:13 -07:00
Craig Topper 1bd7046e4c [X86] Use TargetLowering::getRegClassFor to simplify some code in tryVPTESTM. NFCI 2020-07-31 21:39:10 -07:00
Justin Hibbits 7e9153e940 PowerPC: Don't lower SELECT_CC to PPCISD::FSEL on SPE
SPE doesn't have a fsel instruction, so don't try to lower to it.

This fixes a "Cannot select: tN: f64 = PPCISD::FSEL tX, tY, tZ" error.

Reviewed By: #powerpc, lkail
Differential Revision: https://reviews.llvm.org/D77773
2020-07-31 22:52:47 -05:00
Justin Hibbits 914dbf4808 PowerPC: Fix SPE extloadf32 handling.
The patterns were incorrect copies from the FPU code, and are
unnecessary, since there's no extended load for SPE.  Just let LLVM
itself do the work by marking it expand.

Reviewed By: #powerpc, lkail
Differential Revision: https://reviews.llvm.org/D78670
2020-07-31 22:42:57 -05:00
Kazushi (Jam) Marukawa 605fd4d77c [VE] Change calling convention to follow ABI
Change to expand all arguments and return values to i64 to follow ABI.
Update regression tests also.

Reviewed By: simoll

Differential Revision: https://reviews.llvm.org/D84581
2020-08-01 10:08:54 +09:00
Huihui Zhang 01bfe2e494 [AArch64][SVE] Allow vector of pointers as legal type for masked load/store.
Refer to LangRef http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
'llvm.masked.load/store.*’ intrinsics are overloaded intrinsic, which allow the
load/store data to be a vector of any integer, floating-point or pointer data type.

Therefore, allow pointer data type when checking 'isLegalMaskedLoadStore()'.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D85045
2020-07-31 17:30:23 -07:00
Craig Topper 93c678a79b [X86] Simplify vpternlog immediate selection.
Rather than hardcoding immediate values for 12 different combinations
in a nested pair of switches, we can perform the matched logic
operation on 3 magic constants to calculate the immediate.

Special thanks to this tweet https://twitter.com/rygorous/status/1187034321992871936
for making me realize I could do this.
2020-07-31 17:16:27 -07:00
Hsiangkai Wang 47a4a27f47 Upgrade MC to v0.9.
Differential revision: https://reviews.llvm.org/D80802
2020-08-01 07:42:06 +08:00
Sidharth Baveja b7cfa6ca92 [Loop Peeling] Separate the Loop Peeling Utilities from the Loop Unrolling Utilities
Summary: This patch separates the Loop Peeling Utilities from Loop Unrolling.
The reason for this change is that Loop Peeling is no longer only being used by
loop unrolling; Patch D82927 introduces loop peeling with fusion, such that
loops can be modified to have to same trip count, making them legal to be
peeled.

Reviewed By: Meinersbur

Differential Revision: https://reviews.llvm.org/D83056
2020-07-31 18:31:58 +00:00
Albion Fung 93fd8dbdc2 [PowerPC] Add Vector String Isolate instruction definitions and MC Tests
This patch implements the instruction definition and MC tests for the vector
string isolate instructions.

Differential Revision: https://reviews.llvm.org/D84197
2020-07-31 12:32:29 -05:00
Benjamin Kramer c6f08b14d4 Hide some internal symbols. NFC. 2020-07-31 17:28:02 +02:00
Matt Arsenault 57bd64ff84 Support addrspacecast initializers with isNoopAddrSpaceCast
Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs
with the DataLayout.
2020-07-31 10:42:43 -04:00
Vitaly Buka b0eb40ca39 [NFC] Remove unused GetUnderlyingObject paramenter
Depends on D84617.

Differential Revision: https://reviews.llvm.org/D84621
2020-07-31 02:10:03 -07:00
QingShan Zhang 9b04fec002 [PowerPC] Retrieve the offset from load/store if it stores to stack slots
Scheduler will try to retrieve the offset and base addr to determine if two
loads/stores are disjoint memory access. PowerPC failed to handle this for
frame index which will bring extra memory dependency for loads/stores.

Reviewed By: jji

Differential Revision: https://reviews.llvm.org/D84308
2020-07-31 07:08:20 +00:00
Craig Topper 30a0dbb70d [X86] Remove x86_sse42_crc32_64_64 from X86TTIImpl::simplifyDemandedUseBitsIntrinsic
It doesn't do any simplifying. It just computes known bits. We
can just let InstCombine call computeKnownBits which will handle
this just as well.
2020-07-30 21:51:23 -07:00
Vitaly Buka 89051ebace [NFC] GetUnderlyingObject -> getUnderlyingObject
I am going to touch them in the next patch anyway
2020-07-30 21:08:24 -07:00
Craig Topper 916d9e1877 [X86] Pass the OperandVector by reference to ParseIntelOperand and ParseRoundingMode. NFCI
Similar to what was recently done to ParseATTOperand. Make
ParseIntelOperand directly responsible for adding to the operand
vector instead of returning the operand. Return a bool for error.

Remove ErrorOperand since it is no longer used.
2020-07-30 19:52:38 -07:00
Vitaly Buka b256cb88a7 [ValueTracking] Remove AllocaForValue parameter
findAllocaForValue uses AllocaForValue to cache resolved values.
The function is used only to resolve arguments of lifetime
intrinsic which usually are not fare for allocas. So result reuse
is likely unnoticeable.

In followup patches I'd like to replace the function with
GetUnderlyingObjects.

Depends on D84616.

Differential Revision: https://reviews.llvm.org/D84617
2020-07-30 18:48:34 -07:00
Scott Constable ec1445c5af [X86] Fix for ballooning compile times due to Load Value Injection (LVI) mitigations
Fix for the issue raised in https://github.com/rust-lang/rust/issues/74632.

The current heuristic for inserting LFENCEs uses a quadratic-time algorithm. This can apparently cause substantial compilation slowdowns for building Rust projects, where functions > 5000 LoC are apparently common.

The updated heuristic in this patch implements a linear-time algorithm. On a set of benchmarks, the slowdown factor for the generated code was comparable (2.55x geo mean for the quadratic-time heuristic, vs. 2.58x for the linear-time heuristic). Both heuristics offer the same security properties, namely, mitigating LVI.

This patch also includes some formatting fixes.

Differential Revision: https://reviews.llvm.org/D84471
2020-07-30 17:22:33 -07:00
Craig Topper 3ad09fd03c [X86] Separate CPU Feature lists in X86.td between architecture features and tuning features
After the recent change to the tuning settings for pentium4 to improve our default 32-bit behavior, I've decided to see about implementing -mtune support. This way we could have a default architecture CPU of "pentium4" or "x86-64" and a default tuning cpu of "generic". And we could change our "pentium4" tuning settings back to what they were before.

As a step to supporting this, this patch separates all of the features lists for the CPUs into 2 lists. I'm using the Proc class and a new ProcModel class to concat the 2 lists before passing to the target independent ProcessorModel. Future work to truly support mtune would change ProcessorModel to take 2 lists separately. I've diffed the X86GenSubtargetInfo.inc file before and after this patch to ensure that the final feature list for the CPUs isn't changed.

Differential Revision: https://reviews.llvm.org/D84879
2020-07-30 17:19:19 -07:00
Amara Emerson 09f9f7dd1b [AArch64][GlobalISel] Add legalization & selection support for G_INTRINSIC_LRINT.
Differential Revision: https://reviews.llvm.org/D84552
2020-07-30 16:14:56 -07:00
Matt Arsenault e56e9022bc AMDGPU: Fix liveness errors when copying AGPR tuples
Avoid recursively calling copyPhysReg for AGPR handling. This was
dropping the necessary super register implicit defs to avoid liveness
verifier errors.
2020-07-30 18:13:04 -04:00
Changpeng Fang 243376cdc7 AMDGPU: Put inexpensive ops first in AMDGPUAnnotateUniformValues::visitLoadInst
Summary:
  This is in response to the review of https://reviews.llvm.org/D84873:
The expensive check should be reordered last

Reviewers:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D84890
2020-07-30 14:37:06 -07:00
Wouter van Oortmerssen ce1eb7af9d [WebAssembly] Fixed 64-bit indices in br_table
LLVM selection dag assumes "switch" indices are pointer sized, which causes problems for our 32-bit br_table. The new function ensures 32-bit operands don't get unnecessarily extended, and 64-bit operands get truncated.

Note that the changes to the existing test test exactly that: the addition of -NEXT in 2 places ensures no extension is inserted (which the test previously ignored) and that the wrap is present (previously omitted in wasm64 mode).

Differential Revision: https://reviews.llvm.org/D84705
2020-07-30 10:52:16 -07:00
Stanislav Mekhanoshin 5b32518f96 [AMDGPU] Do not use undef on indirect source
We are using undef on the indirect move source subreg and then
using implicit super-reg. This creates a problem in RA when
Greedy decides to split the register. It reassigns the implicit
super-reg but does not bother to change undef source because
it is really does not matter. The fix is to stop lying to RA and
drop undef flag.

This has also hit a problem in SIFoldOperands as it can fold
immediate into an indirect move since there is no undef flag
anymore. That results in multiple test failures, so added the
check for this case.

Differential Revision: https://reviews.llvm.org/D84899
2020-07-30 10:41:59 -07:00
Craig Topper 3632f765dc [WebAssembly] Fix GCC 5 build.
Hans' speculative fix in b7292f2db0
didn't work for me. This seems to.
2020-07-30 10:00:28 -07:00
hsmahesha 33fd4a18e7 [AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic
Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as
follows. The existing heuristic roughly summarizes as below:

* Assume, all the mem ops instructions participating in the clustering process,  loads/stores same num bytes
* If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes
* If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes
* If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes

So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it
properly handles both the sub-word loads and the wide loads.

Reviewed By: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D84354
2020-07-30 21:41:13 +05:30
Fangrui Song d2c2248722 [X86] Parse and ignore .arch directives
We parse .arch so that some `.arch i386; .code32` code can assemble. It seems
that X86AsmParser does not do a good job tracking what features are needed to
assemble instructions. GNU as's x86 port supports a very wide range of .arch
operands. Ignore the operand for now.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D84900
2020-07-30 08:30:06 -07:00
Momchil Velikov ef4e665435 [AArch64] Fix operand definitions of XPACI/XPACD
The operand to these instructions is both input and output.

These are not yet emitted by the compiler and the assembler already
works fine, so can't test in this patch.  But D75044 will use XPACI
and provide test coverage for this patch as well.

Differential Revision: https://reviews.llvm.org/D84298
2020-07-30 15:31:44 +01:00
Hans Wennborg b7292f2db0 Speculative GCC 5 build fix
It's complaining about specializing the template in a different namespace.
2020-07-30 16:12:52 +02:00
jasonliu 04dc9691eb [XCOFF][AIX] Enable -ffunction-sections
Summary:
This patch implements -ffunction-sections on AIX.
This patch focuses on assembly generation.
Follow-on patch needs to handle:
1. -ffunction-sections implication for jump table.
2. Object file generation path and associated testing.

Differential Revision: https://reviews.llvm.org/D83875
2020-07-30 13:30:01 +00:00
Simon Pilgrim cc529285fd VectorUtils.h - reduce unnecessary includes. NFC.
Replace TargetLibraryInfo.h include with forward declaration and fix implicit dependencies.

Reduce SmallSet.h include to SmallVector.h include.
2020-07-30 12:27:49 +01:00
Simon Pilgrim 2dec72ba5c [X86][SSE] combineExtractWithShuffle - extend extract(truncate(x),0) for any source vector size
As long as we can extract the lowest 128-bit subvector from the pre-truncated source vector, then we don't care what size it is.

The next stage will be to support non-zero extraction indices, as long as its still coming from the lowest 128-bit subvector.
2020-07-30 12:27:49 +01:00
David Sherwood 23ad660b5d [SVE][CodeGen] At -O0 fallback to DAG ISel when translating alloca with scalable types
When building code at -O0 We weren't falling back to DAG ISel correctly
when encountering alloca instructions with scalable vector types. This
is because the alloca has no operands that are scalable. I've fixed this by
adding a check in AArch64ISelLowering::fallBackToDAGISel for alloca
instructions with scalable types.

Differential Revision: https://reviews.llvm.org/D84746
2020-07-30 08:40:53 +01:00
Craig Topper 07bb8240a0 [X86] Pass the OperandVector to ParseMemOperand instead of returning the operand. NFCI
Continue the change made to ParseATTOperand to take the vector by
reference. Let ParseMemOperand add its memory operand to the
vector and just return true/false to indicate error.
2020-07-29 23:44:56 -07:00
Craig Topper 17597442db [X86] Don't pass some many parameters to ParseMemOperand by reference.
Pointers and SMLocs are cheap to copy. Even though the function
modifies some of these the caller doesn't use them after the call.
2020-07-29 23:44:56 -07:00
Craig Topper 9611ee5f40 [X86] Teach the assembler parser to handle a '*' between segment register and base/index/displacement part of an address
A '*' after the segment is equivalent to a '*' before the segment register. To make the AsmMatcher table work we need to place the '*' token into the operand vector before the full memory operand. To accomplish this I've modified some portions of operand parsing to expose the operand vector to ParseATTOperand so that the token can be pushed to the vector after parsing the segment register and before creating the memory operand using that segment register.

Fixes PR46879

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D84895
2020-07-29 21:15:04 -07:00
Kang Zhang a18953c1c0 [PowerPC] Fix RM operands for some instructions
Summary:
Some instructions have set the wrong [RM] flag, this patch is to fix it.

Instructions x(v|s)r(d|s)pi[zmp]? and fri[npzm] use fixed rounding
directions without referencing current rounding mode.

Also, the SETRNDi, SETRND, BCLRn, MTFSFI, MTFSB0, MTFSB1, MTFSFb,
MTFSFI, MTFSFI_rec, MTFSF, MTFSF_rec should also fix the RM flag.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D81360
2020-07-30 02:10:49 +00:00
Matt Arsenault 0da582d9b6 GlobalISel: Handle llvm.roundeven
I still think it's highly questionable that we have two intrinsics
with identical behavior and only vary by the name of the libcall used
if it happens to be lowered that way, but try to reduce the feature
delta between SDAG and GlobalISel for recently added intrinsics. I'm
not sure which opcode should be considered the canonical one, but
lower roundeven back to round.
2020-07-29 20:01:12 -04:00
Craig Topper b1c1825b99 [X86] Remove unused argument from HandleAVX512Operand in the assembly parser. 2020-07-29 14:23:01 -07:00
Simon Pilgrim a1c9529e60 [X86][AVX] isHorizontalBinOp - relax no-lane-crossing limit for AVX1-only targets.
Instead of never accepting v8f32/v4f64 FHADD/FHSUB if the input shuffle masks cross lanes, perform the matching and determine if the post shuffle mask simplifies to a 'whole lane shuffle' mask - in which case we are guaranteed to cheaply perform this as a VPERM2F128 shuffle.
2020-07-29 20:49:10 +01:00
Stanislav Mekhanoshin decfdb8ce3 [AMDGPU] Fixed formatting in GCNHazardRecognizer.cpp. NFC. 2020-07-29 12:21:28 -07:00
Stanislav Mekhanoshin 13b63be472 [AMDGPU] prefer non-mfma in post-RA schedule
MFMA instructions shall not be scheduled back to back
to avoid MAI SIMD stall. Tell post-RA schedule we would
prefer some other instruction instead.

Differential Revision: https://reviews.llvm.org/D84883
2020-07-29 12:17:50 -07:00
Baptiste Saleil 7aaa85627b [PowerPC] Add options to control paired vector memops support
Adds frontend and backend options to enable and disable the
PowerPC paired vector memory operations added in ISA 3.1.
Instructions using these options will be added in subsequent patches.

Differential Revision: https://reviews.llvm.org/D83722
2020-07-29 14:00:53 -05:00
Amara Emerson d8ba622209 [AArch64][GlobalISel] Selection support for vector DUP[X]lane instructions.
In future, we'd like to use the perfect-shuffle mechanism to deal with these
shuffle permutations. For now, this improves performance by avoiding the
super-expensive const-pool load + tbl instruction.

Differential Revision: https://reviews.llvm.org/D84866
2020-07-29 11:41:37 -07:00
Matt Arsenault 59fac51ff2 AMDGPU/GlobalISel: Handle llvm.amdgcn.reloc.constant 2020-07-29 14:24:21 -04:00
Matt Arsenault 0b7de7966f GlobalISel: Implement lower for G_EXTRACT_VECTOR_ELT
Use the basic store to stack and reload.
2020-07-29 14:16:28 -04:00
Jessica Paquette 7ff9575594 [AArch64][GlobalISel] Select XRO addressing mode with wide immediates
Port the wide immediate case from AArch64DAGToDAGISel::SelectAddrModeXRO.

If we have a wide immediate which can't be represented in an add, we can end up
with code like this:

```
mov  x0, imm
add x1, base, x0
ldr  x2, [x1, 0]
```

If we use the [base, xN] addressing mode instead, we can produce this:

```
mov  x0, imm
ldr  x2, [base, x0]
```

This saves 0.4% code size on 7zip at -O3, and gives a geomean code size
improvement of 0.1% on CTMark.

Differential Revision: https://reviews.llvm.org/D84784
2020-07-29 11:02:10 -07:00
Matt Arsenault 766cb615a3 AMDGPU: Relax restriction on folding immediates into physregs
I never completed the work on the patches referenced by
f8bf7d7f42, but this was intended to
avoid folding immediate writes into m0 which the coalescer doesn't
understand very well. Relax this to allow simple SGPR immediates to
fold directly into VGPR copies. This pattern shows up routinely in
current GlobalISel code since nothing is smart enough to emit VGPR
constants yet.
2020-07-29 14:01:53 -04:00
Heejin Ahn 276f9e8cfa [WebAssembly] Fix getBottom for loops
When it was first created, CFGSort only made sure BBs in each
`MachineLoop` are sorted together. After we added exception support,
CFGSort now also sorts BBs in each `WebAssemblyException`, which
represents a `catch` block, together, and
`Region` class was introduced to be a thin wrapper for both
`MachineLoop` and `WebAssemblyException`.

But how we compute those loops and exceptions is different.
`MachineLoopInfo` is constructed using the standard loop computation
algorithm in LLVM; the definition of loop is "a set of BBs that are
dominated by a loop header and have a path back to the loop header". So
even if some BBs are semantically contained by a loop in the original
program, or in other words dominated by a loop header, if they don't
have a path back to the loop header, they are not considered a part of
the loop. For example, if a BB is dominated by a loop header but
contains `call abort()` or `rethrow`, it wouldn't have a path back to
the header, so it is not included in the loop.

But `WebAssemblyException` is wasm-specific data structure, and its
algorithm is simple: a `WebAssemblyException` consists of an EH pad and
all BBs dominated by the EH pad. So this scenario is possible: (This is
also the situation in the newly added test in cfg-stackify-eh.ll)

```
Loop L: header, A, ehpad, latch
Exception E: ehpad, latch, B
```
(B contains `abort()`, so it does not have a path back to the loop
header, so it is not included in L.)

And it is sorted in this order:
```
header
A
ehpad
latch
B
```

And when CFGStackify places `end_loop` or `end_try` markers, it
previously used `WebAssembly::getBottom()`, which returns the latest BB
in the sorted order, and placed the marker there. So in this case the
marker placements will be like this:
```
loop
  header
  try
    A
  catch
    ehpad
    latch
end_loop         <-- misplaced!
    B
  end_try
```
in which nesting between the loop and the exception is not correct.
`end_loop` marker has to be placed after `B`, and also after `end_try`.

Maybe the fundamental way to solve this problem is to come up with our
own algorithm for computing loop region too, in which we include all BBs
dominated by a loop header in a loop. But this takes a lot more effort.
The only thing we need to fix is actually, `getBottom()`. If we make it
return the right BB, which means in case of a loop, the latest BB of the
loop itself and all exceptions contained in there, we are good.

This renames `Region` and `RegionInfo` to `SortRegion` and
`SortRegionInfo` and extracts them into their own file. And add
`getBottom` to `SortRegionInfo` class, from which it can access
`WebAssemblyExceptionInfo`, so that it can compute a correct bottom
block for loops.

Reviewed By: dschuff

Differential Revision: https://reviews.llvm.org/D84724
2020-07-29 10:36:32 -07:00
Craig Topper c4823b24a4 [X86] Add custom lowering for llvm.roundeven with sse4.1.
We can use the roundss/sd/ps/pd instructions like we do for
ceil/floor/trunc/rint/nearbyint.

Differential Revision: https://reviews.llvm.org/D84592
2020-07-29 10:23:08 -07:00
Roman Lebedev 1d51dc38d8
[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S

llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement

The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.

If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
  * -14 (-73.68%) loops not rotated due to the header size (yay)
  * -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
  * -3937 (-64.19%) common instructions hoisted
  * +561 (+0.06%) x86 asm instructions
  * -2 basic blocks
  * +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable  {F12360201}
  * -36396 (-65.29%) common instructions hoisted
  * +1676 (+0.02%) x86 asm instructions
  * +662 (+0.06%) basic blocks
  * +4395 (+0.04%) IR instructions

It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D84108
2020-07-29 20:05:30 +03:00
Kang Zhang 802c043078 [PowerPC] Set v1i128 to expand for SETCC to avoid crash
Summary:
PPC only supports the instruction selection for v16i8, v8i16, v4i32,
v2i64, v4f32 and v2f64 for ISD::SETCC, don't support the v1i128, so
v1i128 for ISD::SETCC will crash.

This patch is to set v1i128 to expand to avoid crash.

Reviewed By: steven.zhang

Differential Revision: https://reviews.llvm.org/D84238
2020-07-29 16:39:27 +00:00
Matt Arsenault d42c7b2211 AMDGPU: Account for the size of LDS globals used through constant
expressions.

Also "fix" the longstanding bug where the computed size depends on the
order of the visitation. We could try to predict the allocation order
used by legalization, but it would never be 100% perfect. Until we
start fixing the addresses somehow (or have a more reliable allocation
scheme later), just try to compute the size based on the worst case
padding.
2020-07-29 11:40:42 -04:00
Simon Pilgrim d1abca187d [CostModel][X86] Add SSE costs for SMAX/SMIN/UMAX/UMIN intrinsics 2020-07-29 15:55:43 +01:00
Simon Pilgrim 0a0f28254a [CostModel][X86] Add SSE costs for ABS intrinsics 2020-07-29 14:33:59 +01:00
David Green 9ddb28964c [ARM] Tune getCastInstrCost for extending masked loads and truncating masked stores
This patch uses the feature added in D79162 to fix the cost of a
sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's
not the case as we cannot split the load in the same way we would for
normal loads.

This updates the costs to better reflect reality, and adds a test for it
in test/Analysis/CostModel/ARM/cast.ll.

It also adds a vectorizer test that showcases the improvement: in some
cases, the vectorizer will now choose a smaller VF when
tail-predication is enabled, which results in better codegen. (Because
if it were to use a higher VF in those cases, the code we see above
would be generated, and the vmovs would block tail-predication later in
the process, resulting in very poor codegen overall)

Original Patch by Pierre van Houtryve

Differential Revision: https://reviews.llvm.org/D79163
2020-07-29 13:41:34 +01:00
David Green 60280e9818 [Analysis] TTI: Add CastContextHint for getCastInstrCost
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types.  Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization.  Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.

For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.

To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.

Original patch by Pierre van Houtryve

Differential Revision: https://reviews.llvm.org/D79162
2020-07-29 13:32:53 +01:00
Sjoerd Meijer 85342c27a3 [ARM] Optimize immediate selection
Optimize some specific immediates selection by materializing them with sub/mvn
instructions as opposed to loading them from the constant pool.

Patch by Ben Shi, powerman1st@163.com.

Differential Revision: https://reviews.llvm.org/D83745
2020-07-29 13:29:17 +01:00
Matt Arsenault 200bb5191a AMDGPU/GlobalISel: Refactor special argument management 2020-07-29 08:27:31 -04:00
Matt Arsenault c230965ccf AMDGPU: Make saturating add/sub legal for DAG path 2020-07-29 08:27:31 -04:00
Matt Arsenault cdd45d5f9c AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.csub
Remove the custom node boilerplate. Not sure why this tried to handle
the LDS atomic stuff.
2020-07-29 08:27:31 -04:00
Simon Pilgrim 0c005be6eb [X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks
If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast.

getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements.

I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.
2020-07-29 11:32:44 +01:00
Ikhlas Ajbar d50d4c3d44 [Hexagon] Correct the order of operands when lowering funnel shift-left
This patch corrects the order of operands in the pattern that lowers fshl
in Hexagon.
2020-07-28 21:22:41 -05:00
Kang Zhang 00046d789c [PowerPC] Add Def CR1 for MTFSFI_rec and MTFSF_rec 2020-07-29 01:47:23 +00:00
Matt Arsenault 44211f20a8 AMDGPU: Optimize copies to exec with other insts after exec def
It's possible to have terminator instructions after a write to exec,
so skip over them to find it.
2020-07-28 21:34:50 -04:00
Matt Arsenault b6ebc77326 AMDGPU/GlobalISel: Fix selecting llvm.amdgcn.s.getreg
This introduces the same bug llvm.amdgcn.s.setreg has where if the
user specified an immediate outside of the valid 16-bit range, it will
select into a verifier error.
2020-07-28 21:34:50 -04:00
Thomas Lively 11bb7eef41 [WebAssembly] Remove intrinsics for SIMD widening ops
Instead, pattern match extends of extract_subvectors to generate
widening operations. Since extract_subvector is not a legal node, this
is implemented via a custom combine that recognizes extract_subvector
nodes before they are legalized. The combine produces custom ISD nodes
that are later pattern matched directly, just like the intrinsic was.

Also removes the clang builtins for these operations since the
instructions can now be generated from portable code sequences.

Differential Revision: https://reviews.llvm.org/D84556
2020-07-28 18:25:55 -07:00
Craig Topper 06cf6f770d [X86] Add FeatureCMPXCHG8B and FeatureSlowUAMem16 to 'lakemont' in X86.td
We already had CMPXCH8B feature on this CPU for the frontend so
this doesn't have much effect.

The FeatureSlowUAMem16 only matters if someone compiles with
-march=lakemont -msse which doesn't make sense, but is consistent
with all our pre-sse4.2 CPUs. Maybe the feature flag should be
FeatureFastUAMem16 and set on the newer CPUs instead.
2020-07-28 18:24:46 -07:00
Matt Arsenault 6a7b6dd54b AMDGPU: Don't assert in canInsertSelect
Currently GlobalISel doesn't force all VGPR phi operands to VGPRs, so
this hit a case where it was queried with a VGPR and SGPR. This could
arguably be a verifier error, but it's currently not.
2020-07-28 21:01:06 -04:00
Thomas Lively ffd8c23ccb [WebAssembly] Implement truncating vector stores
Rather than expanding truncating stores so that vectors are stored one
lane at a time, lower them to a sequence of instructions using
narrowing operations instead, when possible. Since the narrowing
operations have saturating semantics, but truncating stores require
truncation, mask the stored value to manually truncate it before
narrowing. Also, since narrowing is a binary operation, pass in the
original vector as the unused second argument.

Differential Revision: https://reviews.llvm.org/D84377
2020-07-28 17:46:45 -07:00
Matt Arsenault 068808d102 AMDGPU: Don't assume call targets are registers
GlobalISel let through a call to null, which would then fold into the
source operand like any other inline immediate. The SelectionDAG
lowering deletes calls to null and undef as a workaround from before
calls were supported. We should probably drop the special handling
case in the DAG lowering now, since the middle end optimizers delete
null calls anyway.
2020-07-28 20:46:06 -04:00
Matt Arsenault 8860daf0ed AMDGPU: Handle a few missing cases in getAddrModeArguments 2020-07-28 20:22:38 -04:00
Matt Arsenault b3e63aa8a4 AMDGPU: Don't assume there is only one terminator copy
This would stop on the first in reverse order, failing the verifier if
there were more earlier in the block.
2020-07-28 20:22:38 -04:00
Matt Arsenault 592f2e8d1c AMDGPU: Fix verifier error on spilling partially defined SGPRs
This needs an implicit def of the super-register in case one of the
lanes isn't defined, similar to copyPhysReg (or the not-VGPR spill
case below). This showed up in GlobalISel testing since it currently
doesn't fold out many undef instructions.
2020-07-28 20:01:57 -04:00
Matt Arsenault 66d60e06cb AMDGPU: Serialize MFI spill fields
These should probably be inferred from the function on parse, but the
target specific infrastructure currently does not give you a way to do
this. SILowerSGPRSpills early exits without this reporting spills,
which makes it difficult to write a MIR test for.
2020-07-28 20:01:57 -04:00
Matt Arsenault e87356b498 GlobalISel: Don't assert on operations with no type indices
Fix not marking G_FENCE as legal on AMDGPU This was apparently
defaulting to legal using the "legacy" rules, whatever those are.
2020-07-28 16:49:55 -04:00
Matt Arsenault 5174e7b443 GlobalISel: Add typeIsNot LegalityPredicate
This allows sorting the legal/custom rules first as is recommended
2020-07-28 16:49:55 -04:00
Matt Arsenault 9731ef3ec5 AMDGPU/GlobalISel: Add SReg_96 to SGPRRegBank 2020-07-28 16:49:55 -04:00
Matt Arsenault e9b236f411 AMDGPU: Check for other defs when folding conditions into s_andn2_b64
We can't fold the masked compare value through the select if the
select condition is re-defed after the and instruction. Fixes a
verifier error and trying to use the outgoing value defined in the
block.

I'm not sure why this pass is bothering to handle physregs. It's
making this more complex and forces extra liveness computation.
2020-07-28 16:36:23 -04:00
Roman Lebedev e1dd212c87
[X86] Remove disabled miscompiling X86CondBrFolding pass
As briefly discussed in IRC with @craig.topper,
the pass is disabled basically since it's original introduction (nov 2018)
due to  known correctness issues (miscompilations),
and there hasn't been much work done to fix that.

While i won't promise that i will "fix" the pass,
i have looked at it previously, and i'm sure i won't try to fix it
if that requires actually fixing this existing code.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D84775
2020-07-28 23:35:04 +03:00
Craig Topper 69152a11cf [X86] Merge the two 'Emit the normal disp32 encoding' cases in SIB byte handling in emitMemModRMByte. NFCI
By repeating the Disp.isImm() check in a couple spots we can
make the normal case for immediate and for expression the same.
And then always rely on the ForceDisp32 flag to remove a later
non-zero immediate check.

This should make {disp32} pseudo prefix handling
slightly easier as we need the normal disp32 handler to handle a
immediate of 0.
2020-07-28 12:12:09 -07:00
jasonliu f8ab66538c [NFC][XCOFF] Use getFunctionEntryPointSymbol from TLOF to simplify logic
Reviewed By: Xiangling_L

Differential Revision: https://reviews.llvm.org/D84693
2020-07-28 18:59:51 +00:00
Simon Pilgrim 4838cd46a9 [X86][XOP] Shuffle v16i8 using VPPERM(X,Y) instead of OR(PSHUFB(X),PSHUFB(Y)) 2020-07-28 19:56:10 +01:00
Austin Kerbow adeeac9d5a [AMDGPU] Spill CSR VGPR which is reserved for SGPR spills
Update logic for reserving VGPR for SGPR spills. A CSR VGPR being reserved for
SGPR spills could be clobbered if there were no free lower VGPR's available.
Create a stack object so that it will be spilled in the prologue. Also
adds more tests.

Differential Revision: https://reviews.llvm.org/D83730
2020-07-28 11:53:02 -07:00
Craig Topper 91b8c1fd0f [X86] Simplify some code in emitMemModRMByte. NFCI 2020-07-28 10:46:04 -07:00
Craig Topper 6c3dc6e1d5 [X86] Merge disp8 and cdisp8 handling into a single helper function to reduce some code.
We currently handle EVEX and non-EVEX separately in two places. By sinking the EVEX
check into the existing helper for CDisp8 we can simplify these two places.

Differential Revision: https://reviews.llvm.org/D84730
2020-07-28 10:46:01 -07:00
Anna Welker 0c64233bb7 [ARM][MVE] Teach MVEGatherScatterLowering to merge successive getelementpointers
A patch following up on the introduction of pointer induction variables, adding
a preprocessing step to the address optimisation in the MVEGatherScatterLowering
pass. If the getelementpointer that is the address is itself using a
getelementpointer as base, they will be merged into one by summing up the
offsets, after checking that this will not cause an overflow (this can be
repeated recursively).

Differential Revision: https://reviews.llvm.org/D84027
2020-07-28 17:31:20 +01:00
Arthur Eubanks 2ca6c422d2 [FunctionAttrs] Rename functionattrs -> function-attrs
To match NewPM pass name, and also for readability.
Also rename rpo-functionattrs -> rpo-function-attrs while we're here.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D84694
2020-07-28 09:09:13 -07:00
Matt Arsenault 16bcd54570 AMDGPU/GlobalISel: Mark GlobalISel classes as final 2020-07-28 11:42:17 -04:00
Matt Arsenault bb23b5cfe0 AMDGPU/GlobalISel: Merge identical select cases 2020-07-28 11:42:17 -04:00
Matt Arsenault a4edc04693 AMDGPU/GlobalISel: Use clamp modifier for [us]addsat/[us]subsat
We also have never handled this for SelectionDAG, which needs
additional work.
2020-07-28 11:18:05 -04:00
Sander de Smalen cda2eb3ad2 [AArch64][SVE] Fix epilogue for SVE when the stack is realigned.
While deallocating the stackframe, the offset used to reload the
callee-saved registers was not pointing to the SVE callee-saves,
but rather to the whole SVE area.

   +--------------+
   | GRP callee   |
   |     saves    |
   +--------------+ <- FP
   | SVE callee   |
   |     saves    |
   +--------------+ <- Should restore SVE callee saves from here
   |  SVE Spills  |
   |  and Locals  |
   +--------------+ <- instead of from here.
   |              |
   :              :
   |              |
   +--------------+ <- SP

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D84539
2020-07-28 15:45:53 +01:00
Sander de Smalen 26b4ef3694 [AArch64][SVE] Don't align the last SVE callee save.
Instead of aligning the last callee-saved-register slot to the stack
alignment (16 bytes), just align the SVE callee-saved block. This also
simplifies the code that allocates space for the callee-saves.

This change is needed to make sure the offset to which the callee-saved
register is spilled, corresponds to the offset used for e.g. unwind call
frame instructions.

Reviewers: efriedma, paulwalker-arm, david-arm, rengolin

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D84042
2020-07-28 15:45:53 +01:00
Sander de Smalen 54492a5843 [AArch64][SVE] Don't support fixedStack for SVE objects.
Fixed stack objects are preallocated and defined to be allocated before
any of the regular stack objects. These are normally used to model stack
arguments.

The AAPCS does not support passing SVE registers on the stack by value
(only by reference). The current layout also doesn't place them before
all stack objects, but rather before all SVE objects. Removing this
simplifies the code that emits the allocation/deallocation
around callee-saved registers (D84042).

This patch also removes all uses of fixedStack from from
framelayout-sve.mir, where this was used purely for testing purposes.

Reviewers: paulwalker-arm, efriedma, rengolin

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D84538
2020-07-28 15:45:53 +01:00
Jinsong Ji d28f86723f Re-land "[PowerPC] Remove QPX/A2Q BGQ/BGP CNK support"
This reverts commit bf544fa1c3.

Fixed the typo in PPCInstrInfo.cpp.
2020-07-28 14:00:11 +00:00
Tim Northover 39108f4c7a ARM: make Thumb1 instructions non-flag-setting in IT block.
Many Thumb1 instructions are defined to set CPSR if executed outside an IT
block, but leave it alone from inside one. In MachineIR this is represented by
whether an optional register is CPSR or NoReg (0), and affects how the
instructions are printed.

This sets the instruction to the appropriate form during if-conversion.
2020-07-28 13:31:17 +01:00
Stefan Pintilie 97470897c4 [PowerPC] Split s34imm into two types
Currently the instruction paddi always takes s34imm as the type for the
34 bit immediate. However, the PC Relative form of the instruction should
not produce the same fixup as the non PC Relative form.
This patch splits the s34imm type into s34imm and s34imm_pcrel so that two
different fixups can be emitted.

Reviewed By: nemanjai, #powerpc, kamaub

Differential Revision: https://reviews.llvm.org/D83255
2020-07-28 05:55:56 -05:00
Simon Pilgrim 182111777b [X86][SSE] Attempt to match OP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
An initial backend patch towards fixing the various poor HADD combines (PR34724, PR41813, PR45747 etc.).

This extends isHorizontalBinOp to check if we have per-element horizontal ops (odd+even element pairs), but not in the expected serial order - in which case we build a "post shuffle mask" that we can apply to the HOP result, assuming we have fast-hops/optsize etc.

The next step will be to extend the SHUFFLE(HOP(X,Y)) combines as suggested on PR41813 - accepting more post-shuffle masks even on slow-hop targets if we can fold it into another shuffle.

Differential Revision: https://reviews.llvm.org/D83789
2020-07-28 10:04:14 +01:00
Craig Topper 647e861e08 [X86] Detect if EFLAGs is live across XBEGIN pseudo instruction. Add it as livein to the basic blocks created when expanding the pseudo
XBEGIN causes several based blocks to be inserted. If flags are live across it we need to make eflags live in the new basic blocks to avoid machine verifier errors.

Fixes PR46827

Reviewed By: ivanbaev

Differential Revision: https://reviews.llvm.org/D84479
2020-07-27 21:15:35 -07:00
Craig Topper 25f193fb46 [X86] Add support for {disp32} to control size of jmp and jcc instructions in the assembler
By default we pick a 1 byte displacement and let relaxation enlarge it if necessary. The GNU assembler supports a pseudo prefix to basically pre-relax the instruction the larger size.

I plan to add {disp8} and {disp32} support for memory operands in another patch which is why I've included the parsing code and enum for {disp8} pseudo prefix as well.

Reviewed By: echristo

Differential Revision: https://reviews.llvm.org/D84709
2020-07-27 21:11:48 -07:00
Craig Topper a0ebac52df [X86] Properly encode a 32-bit address with an index register and no base register in 16-bit mode.
In 16-bit mode we can encode a 32-bit address using 0x67 prefix.
We were failing to do this when the index register was a 32-bit
register, the base register was not present, and the displacement
fit in 16-bits.

Fixes PR46866.
2020-07-27 21:11:42 -07:00
Matt Arsenault ce944af33c AMDGPU/GlobalISel: Mark G_ATOMICRMW_{NAND|FSUB} as lower
These aren't implemented and we're still relying on the AtomicExpand
pass, but mark these as lower to eliminate a few of the few remaining
no rules defined cases.
2020-07-27 18:47:40 -04:00
Matt Arsenault 8b81d0633f AMDGPU: global_atomic_csub is not always dereferenceable 2020-07-27 18:47:39 -04:00
Francesco Petrogalli adb28e0fb2 [llvm][CodeGen] Addressing modes for SVE ldN.
Reviewers: c-rhodes, efriedma, sdesmalen

Subscribers: huihuiz, tschuett, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77251
2020-07-27 22:18:28 +00:00
Jinsong Ji bf544fa1c3 Revert "[PowerPC] Remove QPX/A2Q BGQ/BGP CNK support"
This reverts commit adffce7153.

This is breaking test-suite, revert while investigation.
2020-07-27 21:07:00 +00:00
Arthur Eubanks beb7e3bb70 Rename t2-reduce-size -> thumb2-reduce-size
For readability and consistency with other thumb2 passes like
"thumb2-it".

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D84696
2020-07-27 13:42:13 -07:00
Jinsong Ji adffce7153 [PowerPC] Remove QPX/A2Q BGQ/BGP CNK support
Per RFC http://lists.llvm.org/pipermail/llvm-dev/2020-April/141295.html
no one is making use of QPX/A2Q/BGQ/BGP CNK anymore.

This patch remove the support of QPX/A2Q in llvm, BGQ/BGP in clang,
CNK support in openmp/polly.

Reviewed By: hfinkel

Differential Revision: https://reviews.llvm.org/D83915
2020-07-27 19:24:39 +00:00
Arthur Eubanks 2a672767cc Prefix some AArch64/ARM passes with "aarch64-"/"arm-"
For consistency with other target specific passes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D84560
2020-07-27 11:00:39 -07:00
Kazu Hirata 902cbcd59e Use llvm::is_contained where appropriate (NFC)
Summary:
This patch replaces std::find with llvm::is_contained where
appropriate.

Reviewers: efriedma, nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, rogfer01, kerbowa, llvm-commits, vkmr

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84489
2020-07-27 10:20:44 -07:00
Craig Topper 51e1c028d4 [X86] Add back comment inadvertently lost in 1a1448e656. 2020-07-27 10:02:38 -07:00
Francesco Petrogalli dbeb184b7f [NFC][AArch64] Replace some template methods/invocations...
...with the non-template version, as the template version might
increase the size of the compiler build.

Methods affected:

1.`findAddrModeSVELoadStore`
2. `SelectPredicatedStore`

Also, remove the `const` qualifier from the `unsigned` parameters of
the methods to conform with other similar methods in the class.
2020-07-27 16:52:08 +00:00
Simon Pilgrim 4d84d94969 [X86][SSE] Relax 128-bit restriction on extract_subvector(ext_vector_inreg(X),0) -> ext_vector_inreg(extract_subvector(X,0)) fold
We only need to ensure that the source is larger than the subvector result type
2020-07-27 17:50:36 +01:00
jasonliu c25f61cf6a [XCOFF][AIX] Handle llvm.used and llvm.compiler.used global array
For now, just return and do nothing when we see llvm.used and
llvm.compiler.used global array.
Hopefully, we could come up with a good solution later to prevent
linker from eliminating symbols in llvm.used array.

Reviewed By: DiggerLin, daltenty

Differential Revision: https://reviews.llvm.org/D84363
2020-07-27 15:28:32 +00:00
Jon Roelofs f5e1ec8c58 [AArch64] fjcvtzs,rmif,cfinv,setf* all clobber nzcv
Differential Revision: https://reviews.llvm.org/D83818
2020-07-27 09:17:53 -06:00
Simon Pilgrim ab4ffa52f0 [X86][AVX] Fold extract_subvector(truncate(x),0) -> truncate(extract_subvector(x),0)
This is currently only supported for VLX targets where the op should be legal.
2020-07-27 14:51:29 +01:00
Simon Pilgrim f720c9c68c [X86] combineExtractSubvector - pull out repeated getSizeInBits() calls. NFCI. 2020-07-27 14:51:28 +01:00
Tim Northover 0f1494be43 AArch64: avoid UB shift of negative value
Left shifting a negative value is undefined behaviour, so this just moves the
negation afterwards to avoid it.
2020-07-27 13:49:50 +01:00
Tim Northover 216b67e202 AArch64: diagnose out of range relocation addends on MachO.
MachO only has 24-bit addends for most relocations, small enough that it can
overflow in semi-reasonable functions and cause insidious bugs if compiled
without assertions enabled. Switch it to an actual error instead.

The condition isn't quite identical because ld64 treats the addend as a signed
number.
2020-07-27 13:01:22 +01:00
Piotr Sobczak 590dd73c6e [AMDGPU] Make generating cache invalidating instructions optional
Summary:
D78800 skipped generating cache invalidating instrucions altogether
on AMDPAL. However, this is sometimes too restrictive - we want a
more flexible option to be able to toggle this behaviour on and off
while we work towards developing a correct implementation of the
alternative memory model.

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, dexonsmith, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84448
2020-07-27 09:24:11 +02:00
David Sherwood 14bc85e0eb [SVE] Don't use LocalStackAllocation for SVE objects
I have introduced a new TargetFrameLowering query function:

  isStackIdSafeForLocalArea

that queries whether or not it is safe for objects of a given stack
id to be bundled into the local area. The default behaviour is to
always bundle regardless of the stack id, however for AArch64 this is
overriden so that it's only safe for fixed-size stack objects.
There is future work here to extend this algorithm for multiple local
areas so that SVE stack objects can be bundled together and accessed
from their own virtual base-pointer.

Differential Revision: https://reviews.llvm.org/D83859
2020-07-27 08:22:01 +01:00
biplmish 825ed2d43d [PowerPC] Add Vector Extract Double Instruction Definitions and MC tests.
This patch adds the td definitions and asm/disasm tests for the following instructions:

Vector Extract Double Left Index - vextdubvlx, vextduhvlx, vextduwvlx, vextddvlx
Vector Extract Double Right Index - vextdubvrx, vextduhvrx, vextduwvrx, vextddvrx

Differential Revision: https://reviews.llvm.org/D84384
2020-07-26 23:56:19 -05:00
Matt Arsenault e97aa5609f AMDGPU/GlobalISel: Don't assert in LegalizerInfo constructor
We don't really need these asserts. The LegalizerInfo is also
overly-aggressivly constructed, even when not in use. It needs to not
assert on dummy targets that have manually specified, unrelated
features.
2020-07-26 23:01:28 -04:00
Craig Topper df12524e6b [X86] Turn X86DAGToDAGISel::tryVPTERNLOG into a fully custom instruction selector that can handle bitcasts between logic ops
Previously we just matched the logic ops and replaced with an
X86ISD::VPTERNLOG node that we would send through the normal
pattern match. But that approach couldn't handle a bitcast
between the logic ops. Extending that approach would require us
to peek through the bitcasts and emit new bitcasts to match
the types. Those new bitcasts would then have to be properly
topologically sorted.

This patch instead switches to directly emitting the
MachineSDNode and skips the normal tablegen pattern matching.
We do have to handle load folding and broadcast load folding
ourselves now. Which also means commuting the immediate control.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83630
2020-07-26 12:19:08 -07:00
Craig Topper 1a75d88b3e [X86] Move getGatherOverhead/getScatterOverhead into X86TargetTransformInfo.
These cost methods don't make much sense in X86Subtarget. Make
them methods in X86's TTI and move the feature checks from the
X86Subtarget constructor into these methods.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D84594
2020-07-26 10:38:42 -07:00
Simon Pilgrim 17eafe0841 [X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening
If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls.

By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.
2020-07-26 16:04:22 +01:00
Matt Arsenault d35e2c101d AMDGPU/GlobalISel: Fix not constraining ds_append/consume operands 2020-07-26 10:17:36 -04:00
Matt Arsenault f6176f8a5f GlobalISel: Handle G_PTR_ADD in narrowScalar 2020-07-26 10:08:17 -04:00
Matt Arsenault 7c09c173a2 AMDGPU/GlobalISel: Reorder G_CONSTANT legality rules
The legal cases should be the first rules.
2020-07-26 10:05:05 -04:00
Matt Arsenault bcf5184a68 AMDGPU/GlobalISel: Make sure <2 x s1> phis are scalarized 2020-07-26 10:04:47 -04:00
Matt Arsenault 6f961a1e7e AMDGPU/GlobalISel: Legalize GDS atomics
I noticed these don't use the _gfx9, non-m0 reading variants but not
sure if that's a bug or not. It's the same in the DAG.
2020-07-26 10:03:34 -04:00
Matt Arsenault 5819159995 AMDGPU/GlobalISel: Pack constant G_BUILD_VECTOR_TRUNCs when selecting 2020-07-26 09:55:34 -04:00
Matt Arsenault 4033aa1467 AMDGPU/GlobalISel: Sign extend integer constants
This matches the DAG behavior and fixes immediate folding
2020-07-26 09:30:14 -04:00
Amara Emerson 9b19400004 [AArch64][GlobalISel] Make <8 x s16> and <16 x s8> legal types for G_SHUFFLE_VECTOR and G_IMPLICIT_DEF.
Trivial change, we're still missing support for rev matching for these types
in the combiner.
2020-07-26 00:48:09 -07:00
Craig Topper 1a1448e656 [X86] Merge X86MCInstLowering's maxLongNopLength into emitNop and remove check for FeatureNOPL.
The switch in emitNop uses 64-bit registers for nops exceeding
2 bytes. This isn't valid outside 64-bit mode. We could fix this
easily enough, but there are no users that ask for more than 2
bytes outside 64-bit mode.

Inlining the method to make the coupling between the two methods
more explicit.
2020-07-25 22:11:47 -07:00
Craig Topper 14c59b4577 [X86] Remove getProcFamily() method from X86Subtarget. NFC
This isn't used and we've decided in the past that a CPU enum
for tuning is not a good idea.
2020-07-25 22:11:45 -07:00
Craig Topper 1df8804ce5 [X86] Replace a use of ProcIntelSLM with FeatureFast7ByteNOP. 2020-07-25 20:46:48 -07:00
Nemanja Ivanovic cdead4f89c [PowerPC][NFC] Fix an assert that cannot trip from 7d076e19e3
I mixed up the precedence of operators in the assert and thought I
had it right since there was no compiler warning. This just
adds the parentheses in the expression as needed.
2020-07-25 20:28:52 -04:00
Matt Arsenault 392b969c32 AMDGPU/GlobalISel: Don't assert on G_INSERT > 128-bits
Just fallback for now. Really tablegen needs to generate all of the
subregister index handling we need.
2020-07-25 10:05:44 -04:00
Simon Pilgrim 3b21823e4a [X86][SSE] combineX86ShufflesRecursively - move all Root node asserts to the same location. NFCI.
Minor tidyup for some upcoming shuffle combine improvements.
2020-07-25 12:48:14 +01:00
Simon Pilgrim 66998ae59f [X86][SSE] getFauxShuffle - ignore undemanded sources for PACKSS/PACKUS faux shuffles
If we don't care about an entire LHS/RHS of the PACK op, then can just treat it the same as undef (we don't care if it saturates) and is safe to treat as a shuffle.

This can happen if we attempt to decode as a faux shuffle before SimplifyDemandedVectorElts has been called on the PACK which should replace the source with UNDEF entirely.
2020-07-25 10:51:14 +01:00
Jessica Paquette 604e33e83a [AArch64][GlobalISel] Look through constants when selection stores of 0
Very minor code size improvements (hits 8 times in Bullet at -O3), but still
something.

Also very minor NFC change to make sure we only search for a 0 constant when
selecting a store. Before, we'd do this for loads as well.

Differential Revision: https://reviews.llvm.org/D84573
2020-07-24 22:46:14 -07:00
Amy Kwan 739cd2638b [PowerPC] Exploit the High Order Vector Multiply Instructions on Power10
This patch aims to exploit the following vector multiply high instructions on Power10.
vmulhsw VRT, VRA, VRB
vmulhsd VRT, VRA, VRB
vmulhuw VRT, VRA, VRB
vmulhud VRT, VRA, VRB

Differential Revision: https://reviews.llvm.org/D82584
2020-07-24 20:57:57 -05:00
Amy Kwan 74790a5dde [PowerPC] Implement Truncate and Store VSX Vector Builtins
This patch implements the `vec_xst_trunc` function in altivec.h in  order to
utilize the Store VSX Vector Rightmost [byte | half | word | doubleword] Indexed
instructions introduced in Power10.

Differential Revision: https://reviews.llvm.org/D82467
2020-07-24 19:22:39 -05:00
Jessica Paquette fcc55c0952 [AArch64][GlobalISel] Use wzr/xzr for 16 and 32 bit stores of zero
We weren't performing this optimization on 16 and 32 bit stores. SDAG happily
does this though.

e.g. https://godbolt.org/z/cWocKr

This saves about 0.2% in code size on CTMark at -O3.

Differential Revision: https://reviews.llvm.org/D84568
2020-07-24 17:15:20 -07:00
Amara Emerson f320f83f3a [AArch64][GlobalISel] Promote G_UITOFP vector operands to same elt size as result.
Fixes legalization failures.
2020-07-24 17:00:50 -07:00
Matt Arsenault 2bd72abef0 AMDGPU: Skip other terminators before inserting s_cbranch_exec[n]z
PHIElimination/createPHISourceCopy inserts non-branch terminators
after the control flow pseudo if a successor phi reads a register
defined by the control flow pseudo. If this happens, we need to split
the expansion of the control flow pseudo to ensure all the branches
are after all of the other mask management instructions.

GlobalISel hit this in testscases that happened to be tail
duplicated. The original testcase still does not work, since the same
problem appears to be present in a later pass.
2020-07-24 16:51:59 -04:00
Eli Friedman c02aa53ecb [AArch64][SVE] Add "fast" fcmp operations.
dacf8d3 added support for most fcmp operations, but there are some extra
variations I hadn't considered: SelectionDAG supports float comparisons
that are neither ordered nor unordered. Add support for the missing
operations.

Differential Revision: https://reviews.llvm.org/D84460
2020-07-24 13:22:41 -07:00
Nemanja Ivanovic 7d076e19e3 [PowerPC] Fix computation of offset for load-and-splat for permuted loads
Unfortunately this is another regression from my canonicalization patch
(1fed131660). The patch contained two implicit assumptions:
1. That we would have a permuted load only if we are loading a partial vector
2. That a partial vector load would necessarily be as wide as the splat

However, assumption 2 is not correct since it is possible to do a wider
load and only splat a half of it. This patch corrects this assumption by
simply checking if the load is permuted and adjusting the offset if it is.
2020-07-24 15:38:46 -04:00
madhur13490 4a577c3a22 [AMDGPU] Fix incorrect arch assert while setting up FlatScratchInit
Reviewers: arsenm, foad, rampitec, scott.linder

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84391
2020-07-24 18:19:04 +00:00
Craig Topper 945ed22f33 [X86] Move the implicit enabling of sse2 for 64-bit mode from X86Subtarget::initSubtargetFeatures to X86_MC::ParseX86Triple.
ParseX86Triple already checks for 64-bit mode and produces a
static string. We can just add +sse2 to the end of that static
string. This avoids a potential reallocation when appending it
to the std::string at runtime.

This is a slight change to the behavior of tools that only use
MC layer which weren't implicitly enabling sse2 before, but will
now. I don't think we check for sse2 explicitly in any MC layer
components so this shouldn't matter in practice. And if it did
matter the new behavior is more correct.
2020-07-24 11:14:20 -07:00
Francesco Petrogalli 809600d664 [llvm][sve] Reg + Imm addressing mode for ld1ro.
Reviewers: kmclaughlin, efriedma, sdesmalen

Subscribers: tschuett, hiraditya, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83357
2020-07-24 17:48:47 +00:00
Craig Topper 8158f0cefe [X86] Use X86_MC::ParseX86Triple to add mode features to feature string in X86Subtarget::initSubtargetFeatures.
Remove mode flags from constructor and remove calls to
ToggleFeature for the mode bits.

By adding them to the feature string we handle initializing the
mode member variables in X86Subtarget and the feature bits in
MCSubtargetInfo in one shot.
2020-07-24 10:48:22 -07:00
Meera Nakrani db37937a47 [ARM] Added additional patterns to VABD instruction
Added extra patterns to VABD instruction so it is selected in place of VSUB and VABS. Added corresponding regression test too.

Differential Revision: https://reviews.llvm.org/D84500
2020-07-24 17:46:25 +00:00
Meera Nakrani 805e6bcf22 Test Commit
Test commit - added whitespace in ARMInstrMVE.td
2020-07-24 17:22:56 +00:00
Dmitry Preobrazhensky 6b8948922c [AMDGPU][MC] Added support of SP3 syntax for MTBUF format modifier
Currently supported LLVM MTBUF syntax is shown below. It is not compatible with SP3.

    op     dst, addr, rsrc, FORMAT, soffset

This change adds support for SP3 syntax:

    op     dst, addr, rsrc, soffset SP3FORMAT

In addition to being compatible with SP3, this syntax allows using symbolic names for data, numeric and unified formats. Below is a list of added syntax variants.

format:<expression>
format:[<numeric-format-name>,<data-format-name>]
format:[<data-format-name>,<numeric-format-name>]
format:[<data-format-name>]
format:[<numeric-format-name>]
format:[<unified-format-name>]

The last syntax variant is supported for GFX10 only.

See llvm bug 37738

Reviewers: arsenm, rampitec, vpykhtin

Differential Revision: https://reviews.llvm.org/D84026
2020-07-24 16:41:03 +03:00
Simon Pilgrim 0128b9505c Revert rG5dd566b7c7b78bd- "PassManager.h - remove unnecessary Function.h/Module.h includes. NFCI."
This reverts commit 5dd566b7c7.

Causing some buildbot failures that I'm not seeing on MSVC builds.
2020-07-24 13:02:33 +01:00
Simon Pilgrim 5dd566b7c7 PassManager.h - remove unnecessary Function.h/Module.h includes. NFCI.
PassManager.h is one of the top headers in the ClangBuildAnalyzer frontend worst offenders list.

This exposes a large number of implicit dependencies on various forward declarations/includes in other headers that need addressing.
2020-07-24 12:40:50 +01:00
Petar Avramovic 47bd41d099 AMDGPU/GlobalISel: Select set.inactive intrinsic
Differential Revision: https://reviews.llvm.org/D84407
2020-07-24 10:14:14 +02:00
Craig Topper 205e8b7e89 [X86] Make the X86ProcFamilyEnum private to X86Subtarget. Removed unneeded 'protected' from X86Subtarget. NFC 2020-07-23 23:42:11 -07:00
Matt Arsenault 891759db73 GlobalISel: Add scalarSameSizeAs LegalizeRule
Widen or narrow a type to a type with the same scalar size as
another. This can be used to force G_PTR_ADD/G_PTRMASK's scalar
operand to match the bitwidth of the pointer type. Use this to
disallow narrower types for G_PTRMASK.
2020-07-23 21:17:31 -04:00
Eli Friedman 993c1a3219 [AArch64][SVE] Teach copyPhysReg to copy ZPR2/3/4.
It's sort of tricky to hit this in practice, but not impossible. I have
a synthetic C testcase if anyone is interested.

The implementation is identical to the equivalent NEON register copies.

Differential Revision: https://reviews.llvm.org/D84373
2020-07-23 16:41:37 -07:00
Amy Kwan 1dc1a3fb0c [PowerPC] Implement low-order Vector Multiply, Modulus and Divide Instructions
This patch aims to implement the low order vector multiply, divide and modulo
instructions available on Power10.

The patch involves legalizing the ISD nodes MUL, UDIV, SDIV, UREM and SREM for
v2i64 and v4i32 vector types in order to utilize the following instructions:
- Vector Multiply Low Doubleword: vmulld
- Vector Modulus Word/Doubleword: vmodsw, vmoduw, vmodsd, vmodud
- Vector Divide Word/Doubleword: vdivsw, vdivsd, vdivuw, vdivud

Differential Revision: https://reviews.llvm.org/D82510
2020-07-23 17:18:36 -05:00
David Green b37e92201c [ARM] Add predicated mla reduction patterns
Similar to 8fa824d7a3 but this time for MLA patterns, this selects
predicated vmlav/vmlava/vmlalv/vmlava instructions from
vecreduce.add(select(p, mul(x, y), 0)) nodes.

Differential Revision: https://reviews.llvm.org/D84102
2020-07-23 21:47:59 +01:00
Matt Arsenault b9c644ec61 AMDGPU: Fix failures from overflowing uint8_t number of operands
If the operand index exceeded the limit of unsigned char, it wrapped
and would point to the wrong operand. Increase the size of the operand
index field to avoid this, and also don't bother trying to fold into
implicit operands.
2020-07-23 15:39:33 -04:00
Amara Emerson 3b10e42ba1 [AArch64][GlobalISel] Add post-legalize combine for sext(trunc(sextload)) -> trunc/copy
On AArch64 we generate redundant G_SEXTs or G_SEXT_INREGs because of this.

Differential Revision: https://reviews.llvm.org/D81993
2020-07-23 12:06:35 -07:00
Matt Arsenault d2b8fcff34 AMDGPU/GlobalISel: Handle call return values
The only case that I know doesn't work is the implicit sret case when
the return type doesn't fit in the return registers.
2020-07-23 14:29:35 -04:00
Craig Topper 5dbcf5e3cc [X86] Add Feature64Bit to the 'generic' CPU and remove feature string hacking in X86Subtarget constructor
Feature64Bit is only used by a check in the X86Subtarget
constructor to ensure that the CPU selected supports 64-bit mode
when the triple is for 64-bit mode.

'generic' is the default CPU in llc and so needs to be able to
pass this check. Previously we did this by detecting the name and
adding the feature to the feature string. But there doesn't seem
to be any reason we can't just add the feature to the CPU directly.
2020-07-23 09:16:18 -07:00
Sebastian Neubauer 896679733d [AMDGPU] Fix typo. NFC 2020-07-23 17:01:12 +02:00
Simon Pilgrim d720ba1e4b [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add SSE shift multiple use handling
Add SimplifyMultipleUseDemandedVectorElts peek through for imm/var SSE shifts
2020-07-23 14:39:03 +01:00
Ulrich Weigand 68a80a4436 [SystemZ] Ensure -mno-vx disables any use of vector features
When passing the -vector feature to LLVM (or equivalently the
-mno-vx command line argument to clang), the intent is that
generated code must not use any vector features (in particular,
no vector registers must be used).

However, there are some cases where we still could generate
such uses; these are all related to some of the additional
vector features (like +vector-enhancements-1).  Since none
of those features are actually usable with -vector, just make
sure we disable them all if -vector is given.
2020-07-23 15:34:59 +02:00
Jay Foad b35833b84e [GlobalISel][AMDGPU] Legalize saturating add/subtract
Add support in LegalizerHelper for lowering G_SADDSAT etc. either
using add/subtract-with-overflow or using max/min instructions.

Enable this lowering for AMDGPU so it can be tested. The legalization
rules are still approximate and skips out on using the clamp bit to
treat these as legal, which has never been used before. This also
doesn't yet try to deal with expanding SALU cases.
2020-07-23 09:06:42 -04:00
Craig Topper ebe5f17f9c [X86] Remove the DeprecatedMPX feature flag.
We deprecated mpx feature in 10.0. I left this feature flag
in case someone still had IR files containing the feature
in a target-feature attribute. At the time I think I thought it
would fail the test if the feature couldn't be found. Further
review suggests that at worst it prints a message to
stderr about ignoring the feature.
2020-07-22 17:44:07 -07:00
Craig Topper b2c65beb14 [X86] Rework the "sahf" feature flag to only apply to 64-bit mode.
SAHF/LAHF instructions are always available in 32-bit mode. Early
64-bit capable CPUs made the undefined opcodes in 64-bit mode. This
was changed on later CPUs.

We have a feature flag to control our usage of these instructions.
This feature flag is hooked up to a clang command line option
-msahf/-mno-sahf specifically to give control of the 64-bit mode
behavior.

In the backend X86Subtarget constructor we were explicitly forcing
+sahf into the feature flag string if we were not compiling for
64-bit mode. This was intended to make the predicates always allow
the instructions outside of 64-bit mode. Unfortunately, the way
it was placed into the string allowed -mno-sahf from clang to disable
SAHF instructions in 32-bit mode. This causes an assertion to fire
if you compile a floating point comparison with something like
"-march=pentium -mno-sahf" as our floating point comparison
handling on CPUs that don't support FCOMI/FUCOMI instructions
requires SAHF.

To fix this, this commit restricts the feature flag to only apply to
64-bit mode by ignoring the flag outside 64-bit mode in
X86Subtarget::hasLAHFSAHF(). This way we don't need to mess with
the feature string at all.
2020-07-22 16:57:46 -07:00
Amy Kwan 5f11027395 [PowerPC][Power10] Fix vins*vlx instructions to have i32 arguments.
Previously, the vins*vlx instructions were incorrectly defined with i64 as the
second argument. This patches fixes this issue by correcting the second argument
of the vins*vlx instructions/intrinsics to be i32.

Differential Revision: https://reviews.llvm.org/D84277
2020-07-22 17:58:14 -05:00
Craig Topper deeb2fdbf4 [X86] Remove a couple temporary std::string for CPU names that I don't need to exist.
The input to these functions is a StringRef. We then convert it
to a std::string. Then maybe replace with "generic". I think we
can just overwrite the incoming StringRef with "generic" if needed
and then pass it along without creating any std::string.
2020-07-22 15:55:04 -07:00
David Green 411eb87c79 [ARM] Fix missing MVE_VMUL_qr predicate
This was missed out of 1030e82598, but hopefully fixes the issues
reported with NEON accidentally generating MVE instructions.
2020-07-22 20:43:02 +01:00
Amy Kwan 08b4a50e39 [PowerPC][Power10] Fix the Test LSB by Byte (xvtlsbb) Builtins Implementation
The implementation of the xvtlsbb builtins/intrinsics were not correct as the
intrinsics previously used i1 as an argument type. This patch changes the i1
argument type used in these intrinsics to be i32 instead, as having the second
as an i1 can lead to issues in the backend.

Differential Revision: https://reviews.llvm.org/D84291
2020-07-22 13:27:05 -05:00
Matt Arsenault d26526fd09 AArch64: Use Register 2020-07-22 14:14:44 -04:00
Matt Arsenault 0c92bfa4b8 GlobalISel: Don't use virtual for distinguishing arg handlers
There's no reason to involve the hassle of a virtual method targets
have to override for a simple boolean.

Not sure exactly what's going on with Mips, but it seems to define its
own totally separate handler classes.
2020-07-22 14:14:43 -04:00
Matt Arsenault 6f437117af AMDGPU: Don't assert on f16 inv2pi immediates pre-gfx8
v_cvt_f32_f16 can still accept this value as a literal constant. This
showed up in GlobalISel since it doesn't have constant folding for
G_FPEXT.
2020-07-22 13:59:03 -04:00
Matt Arsenault b98f902f18 GlobalISel: Restructure argument lowering loop in handleAssignments
This was structured in a way that implied every split argument is in
memory, or in registers. It is possible to pass an original argument
partially in registers, and partially in memory. Transpose the logic
here to only consider a single piece at a time. Every individual
CCValAssign should be treated independently, and any merge to original
value needs to be handled later.

This is in preparation for merging some preprocessing hacks in the
AMDGPU calling convention lowering into the generic code.

I'm also not sure what the correct behavior for memlocs where the
promoted size is larger than the original value. I've opted to clamp
the memory access size to not exceed the value register to avoid the
explicit trunc/extend/vector widen/vector extract instruction. This
happens for AMDGPU for i8 arguments that end up stack passed, which
are promoted to i16 (I think this is a preexisting DAG bug though, and
they should not really be promoted when in memory).
2020-07-22 13:31:11 -04:00
Matt Arsenault 1fd1beea18 AMDGPU/GlobalISel: Fix translation of indirect calls 2020-07-22 13:13:21 -04:00
David Green 8fa824d7a3 [ARM] Add predicated add reduction patterns
Given a vecreduce.add(select(p, x, 0)), we can convert that to a
predicated vaddv, as the else value for the select is the identity
value, a zero. That is what this patch does for the vaddv, vaddva,
vaddlv and vaddlva instructions, copying the existing patterns to also
handle predication through a select.

Differential Revision: https://reviews.llvm.org/D84101
2020-07-22 17:30:02 +01:00
Dmitry Preobrazhensky 0b8fd77ad9 [AMDGPU][MC] Corrected decoding of 16-bit literals
16-bit literals are encoded as 32-bit values. If high 16-bits of the value is 0xFFFF, the decoded instruction cannot be reassembled.

For example, the following code

0xff,0x04,0x04,0x52,0xcd,0xab,0xff,0xff

was decoded as

v_mul_lo_u16_e32 v2, 0xffffabcd, v2

However this literal is actually a 64-bit constant 0x00000000ffffabcd which violates requirements described in the documentation - the truncation is not safe.

This change corrects decoding to make reassembly possible.

Reviewers: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D84098
2020-07-22 17:20:43 +03:00
Stefan Pintilie a60251d739 [PowerPC] Add linker opt for PC Relative GOT indirect accesses
A linker optimization is available on PowerPC for GOT indirect PCRelative loads.

The idea is that we can mark a usual GOT indirect load:

pld 3, vec@got@pcrel(0), 1
lwa 3, 4(3)

With a relocation to say that if we don't need to go through the GOT we can let
the linker further optimize this and replace a load with a nop.

  pld 3, vec@got@pcrel(0), 1
.Lpcrel1:
.reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8)
  lwa 3, 4(3)

This patch adds the logic that allows the compiler to add the R_PPC64_PCREL_OPT.

Reviewers: nemanjai, lei, hfinkel, sfertile, efriedma, tstellar, grosbach

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D79864
2020-07-22 09:08:23 -05:00
jasonliu b98b1700ef [XCOFF] Enable symbol alias for AIX
Summary:
AIX assembly's .set directive is not usable for aliasing purpose.
We need to use extra-label-at-defintion strategy to generate symbol
aliasing on AIX.

Reviewed By: DiggerLin, Xiangling_L

Differential Revision: https://reviews.llvm.org/D83252
2020-07-22 14:03:55 +00:00
Sebastian Neubauer 2a6c871596 [InstCombine] Move target-specific inst combining
For a long time, the InstCombine pass handled target specific
intrinsics. Having target specific code in general passes was noted as
an area for improvement for a long time.

D81728 moves most target specific code out of the InstCombine pass.
Applying the target specific combinations in an extra pass would
probably result in inferior optimizations compared to the current
fixed-point iteration, therefore the InstCombine pass resorts to newly
introduced functions in the TargetTransformInfo when it encounters
unknown intrinsics.
The patch should not have any effect on generated code (under the
assumption that code never uses intrinsics from a foreign target).

This introduces three new functions:
TargetTransformInfo::instCombineIntrinsic
TargetTransformInfo::simplifyDemandedUseBitsIntrinsic
TargetTransformInfo::simplifyDemandedVectorEltsIntrinsic

A few target specific parts are left in the InstCombine folder, where
it makes sense to share code. The largest left-over part in
InstCombineCalls.cpp is the code shared between arm and aarch64.

This allows to move about 3000 lines out from InstCombine to the targets.

Differential Revision: https://reviews.llvm.org/D81728
2020-07-22 15:59:49 +02:00
David Green f8abecf337 [ARM] Extra MVE select(binop) patterns
This is very similar to 243970d03cace2, but handling a slightly
different form of predicated operations. When starting with a pattern of
the form select(p, BinOp(x, y), x), Instcombine will often transform
this to BinOp(x, select(p, y, 0)), where 0 is the identity value of the
binop (0 for adds/subs, 1 for muls, -1 for ands etc). This adds the
patterns that transforms those back into predicated binary operations.

There is also a very minor adjustment to tablegen null_frag in here, to
allow it to also be recognized as a PatLeaf node, so that it can be used
in MVE_TwoOpPattern to easily exclude the cases where we do not need the
alternate transform.

Differential Revision: https://reviews.llvm.org/D84091
2020-07-22 14:08:29 +01:00
David Green 3533e0a08d [ARM] Add patterns for select(p, BinOp(x, y), z) -> BinOpT(x, y,p z)
Most MVE instructions can be predicated to fold a select into the
instruction, using the predicate and the selects else as a passthough.
This adds tablegen patterns for most two operand instructions using the
newly added TwoOpPattern from 1030e82598.

Differential Revision: https://reviews.llvm.org/D83222
2020-07-22 13:24:01 +01:00
Chen Zheng 36f9fe2d34 [PowerPC] fixupIsDeadOrKill start and end in different block fixing
In fixupIsDeadOrKill, we assume StartMI and EndMI not exist in same
basic block, so we add an assertion in that function. This is wrong
before RA, as before RA the true definition may exist in another
block through copy like instructions.

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D83365
2020-07-22 06:27:13 -04:00
Sander de Smalen bef56f7fe2 [AArch64][SVE] Correctly allocate scavenging slot in presence of SVE.
This patch addresses two issues:

* Forces the availability of the base-pointer (x19) when the frame has
  both scalable vectors and variable-length arrays. Otherwise it will
  be expensive to access non-SVE locals.

* In presence of SVE stack objects, it will allocate the emergency
  scavenging slot close to the SP, so that they can be accessed from
  the SP or BP if available. If accessed from the frame-pointer, it will
  otherwise need an extra register to access the scavenging slot because
  of mixed scalable/non-scalable addressing modes.

Reviewers: efriedma, ostannard, cameron.mcinally, rengolin, david-arm

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D70174
2020-07-22 10:50:36 +01:00
Simon Wallis 94e4e37d55 [Thumb] set code alignment for 16-bit load from constant pool
Summary:
[Thumb] set code alignment for 16-bit load from constant pool

LLVM miscompiles this code when compiling for a target with v8.2-A FP16 and the Thumb ISA at -O0:

extern void bar(__fp16 P5);

int main() {
  __fp16 P5 = 1.96875;
  bar(P5);
}

The code section containing main has 2 byte alignment.
It needs to have 4 byte alignment,
because the load literal instruction has an offset from the
load address with the low 2 bits zeroed.

I do not include a test case in this check-in.
llc and llvm-mc do not exhibit this bug. They do not set code section alignment
in the same manner as clang.

Reviewers: dnsampaio

Reviewed By: dnsampaio

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84169
2020-07-22 10:12:41 +01:00
Petar Avramovic 44967fc604 AMDGPU: Simplify f16 to i64 custom lowering
Range that f16 can represent fits into i32.
Lower as f16->i32->i64 instead of f16->f32->i64
since f32->i64 has long expansion.

Differential Revision: https://reviews.llvm.org/D84166
2020-07-22 10:32:14 +02:00
David Spickett 3a34194606 [ARM] Fix Asm/Disasm of TBB/TBH instructions
Summary:
This fixes Bugzilla #46616 in which it was reported
that "tbb  [pc, r0]" was marked as SoftFail
(aka unpredictable) incorrectly.

Expected behaviour is:
* ARMv8 is required to use sp as rn or rm
  (tbb/tbh only have a Thumb encoding so using Arm mode
  is not an option)
* If rm is the pc then the instruction is always
  unpredictable

Some of this was implemented already and this fixes the
rest. Added tests cover the new and pre-existing handling.

Reviewers: ostannard

Reviewed By: ostannard

Subscribers: kristof.beyls, hiraditya, danielkiss, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D84227
2020-07-22 09:31:56 +01:00
Kai Luo c3f9697f1f [PowerPC] Fix wrong codegen when stack pointer has to realign performing dynalloc
Current powerpc backend generates wrong code sequence if stack pointer
has to realign if `-fstack-clash-protection` enabled. When probing
dynamic stack allocation, current `PREPARE_PROBED_ALLOCA` takes
`NegSizeReg` as input and returns
`FinalStackPtr`. `FinalStackPtr=StackPtr+ActualNegSize` is calculated
correctly, however code following `PREPARE_PROBED_ALLOCA` still uses
value of `NegSizeReg`, which does not contain `ActualNegSize` if
`MaxAlign > TargetAlign`, to calculate loop trip count and residual
number of bytes.

This patch is part of fix of
https://bugs.llvm.org/show_bug.cgi?id=46759.

Differential Revision: https://reviews.llvm.org/D84152
2020-07-22 06:35:12 +00:00
Kai Luo 8912252252 [PowerPC] Fix wrong codegen when stack pointer has to realign in prologue
Current powerpc backend generates wrong code sequence if stack pointer
has to realign if -fstack-clash-protection enabled. When probing in
prologue, backend should generate a subtraction instruction rather
than a `stux` instruction to realign the stack pointer.

This patch is part of fix of
https://bugs.llvm.org/show_bug.cgi?id=46759.

Differential Revision: https://reviews.llvm.org/D84218
2020-07-22 06:35:12 +00:00
Kang Zhang 9bbf0ecff3 [PowerPC] Fix the implicit operands in PredicateInstruction()
Summary:
In the function `PPCInstrInfo::PredicateInstruction()`, we will replace
non-Predicate Instructions to Predicate Instruction. But we forget add
the new implicit operands the new Predicate Instruction needed. This
patch is to fix this.

Reviewed By: jsji, efriedma

Differential Revision: https://reviews.llvm.org/D82390
2020-07-22 05:51:03 +00:00
Chen Zheng e8425b27fe [PowerPC] add store (load float*) pattern to isProfitableToHoist
store (load float*) can be optimized to store(load i32*) in InstCombine pass.

Add store (load float*) to isProfitableToHoist to make sure we don't break
the opt in InstCombine pass.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D82341
2020-07-21 20:55:13 -04:00
Amy Kwan 1eb279d2a8 [PowerPC][Power10] Add Vector Multiply/Mod/Divide Instruction Definitions and MC Tests
This patch adds the td definitions and asm/disasm tests for the following instructions:
- Vector Multiply Low Doubleword: vmulld
- Vector Modulus Word/Doubleword: vmodsw, vmoduw, vmodsd, vmodud
- Vector Divide Word/Doubleword: vdivsw, vdivuw, vdivsd, vdivud
- Vector Multiply High Word/Doubleword: vmulhsw, vmulhsd, vmulhuw, vmulhud
- Vector Divide Extended Word/Doubleword: vdivesw, vdiveuw, vdivesd, vdiveud

Differential Revision: https://reviews.llvm.org/D82929
2020-07-21 18:05:35 -05:00
Amara Emerson 791544422a Revert "[AArch64][GlobalISel] Add post-legalize combine for sext_inreg(trunc(sextload)) -> copy"
This reverts commit 64eb3a4915.

It caused miscompiles with optimizations enabled. Reverting while I investigate.
2020-07-21 16:01:18 -07:00
Amara Emerson f1ae96d9bf [AArch64][GlobalISel] Fix TLS accesses clobbering registers incorrectly.
This was happening because the BLR didn't have a use of the X0 arg register,
which would end up being re-used in high reg pressure situations.
The change also avoids hard coding the use of X0 for the sequence except to
copy the value for the call. ld64 should still be able to optimize it.

rdar://65438258
2020-07-21 16:01:17 -07:00
Matt Arsenault b258920095 AMDGPU/GlobalISel: Fix not erasing inst when lowering G_FRINT 2020-07-21 18:29:41 -04:00
Matt Arsenault 7cd8a0256d GlobalISel: Legalize G_FPOWI 2020-07-21 18:13:04 -04:00
Matt Arsenault 1168119c2f AMDGPU: Start interpreting byref on kernel arguments
These are treated identically to value aggregates placed in the kernel
argument list. A %struct.foo or %struct.foo addrspace(4)*
byref(sizeof(%struct.foo)) align(alignof(%struct.foo)) argument should
produce the same offsets and argument metadata.

This handles all 3 kernel ABI implementations, and the two HSA
metadata emission paths.
2020-07-21 18:11:22 -04:00
Simon Pilgrim 5b5dc2442a [X86][AVX] getTargetShuffleMask - don't decode VBROADCAST(EXTRACT_SUBVECTOR(X,0)) patterns.
getTargetShuffleMask is used by the various "SimplifyDemanded" folds so we can't assume that the bypassed extract_subvector can be safely simplified - getFauxShuffleMask performs a more general decode that allows us to more safely catch many of these cases so the impact is minimal.
2020-07-21 21:55:44 +01:00
Matt Arsenault 2fe0ea8261 DAG: Handle expanding strict_fsub into fneg and strict_fadd
The AMDGPU handling of f16 vectors is terrible still since it gets
scalarized even when the vector operation is legal.

The code is is essentially duplicated between the non-strict and
strict case. Apparently no other expansions are currently trying to do
this. This is mostly because I found the behavior of
getStrictFPOperationAction to be confusing. In the ARM case, it would
expand strict_fsub even though it shouldn't due to the later check. At
that point, the logic required to check for legality was more complex
than just duplicating the 2 instruction expansion.
2020-07-21 16:17:10 -04:00
diggerlin 11546898e2 [AIX][XCOFF]emit extern linkage for the llvm intrinsic symbol
SUMMARY:

when we call memset, memcopy,memmove etc(this are llvm intrinsic function) in the c source code. the llvm will generate IR
like call call void @llvm.memset.p0i8.i32(i8* align 4 bitcast (%struct.S* @s to i8*), i8 %1, i32 %2, i1 false)
for c source code
bash> cat test_memset.call

struct S{
 int a;
 int b;
};
extern struct  S s;
void bar() {
  memset(&s, s.b, s.b);
}
like

%struct.S = type { i32, i32 }
@s = external global %struct.S, align 4
; Function Attrs: noinline nounwind optnone
define void @bar() #0 {
entry:
  %0 = load i32, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32 0, i32 1), align 4
  %1 = trunc i32 %0 to i8
  %2 = load i32, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32 0, i32 1), align 4
  call void @llvm.memset.p0i8.i32(i8* align 4 bitcast (%struct.S* @s to i8*), i8 %1, i32 %2, i1 false)
  ret void
}
declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8, i32, i1 immarg) #1
If we want to let the aix as assembly compile pass without -u
it need to has following assembly code.
.extern .memset
(we do not output extern linkage for llvm instrinsic function.
even if we output the extern linkage for llvm intrinsic function, we should not out .extern llvm.memset.p0i8.i32,
instead of we should emit .extern memset)

for other llvm buildin function floatdidf . even if we do not call these function floatdidf in the c source code(the generated IR also do not the call __floatdidf . the function call
was generated in the LLVM optimized.
the function is not in the functions list of Module, but we still need to emit extern .__floatdidf

The solution for it as :
We record all the lllvm intrinsic extern symbol when transformCallee(), and emit all these symbol in the AsmPrinter::doFinalization(Module &M)

Reviewers:  jasonliu, Sean Fertile, hubert.reinterpretcast,

Differential Revision: https://reviews.llvm.org/D78929
2020-07-21 16:03:04 -04:00
David Green 1030e82598 [ARM] Add MVE_TwoOpPattern. NFC
This commons out a chunk of the different two operand MVE patterns into
a single helper multidef. Or technically two multidef patterns so that
the Dup qr patterns can also get the same treatment. This is most of the
two address instructions that we have some codegen pattern for (not ones
that we select purely from intrinsics). It does not include shifts,
which are more spread out and will need some extra work to be given the
same treatment.

Differential Revision: https://reviews.llvm.org/D83219
2020-07-21 19:51:37 +01:00
Sander de Smalen 9bacf15885 [AArch64][SVE] Fix PCS for functions taking/returning scalable types.
The default calling convention needs to save/restore the SVE callee
saves according to the SVE PCS when the function takes or returns
scalable types, even when the `aarch64_sve_vector_pcs` CC is not
specified for the function.

Reviewers: efriedma, paulwalker-arm, david-arm, rengolin

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D84041
2020-07-21 15:55:39 +01:00
Petre-Ionut Tudor 1af9fc8213 [ARM] Generate [SU]HADD from ((a + b) >> 1)
Summary:
Teach LLVM to recognize the above pattern, where the operands are
either signed or unsigned types.

Subscribers: kristof.beyls, hiraditya, danielkiss, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83777
2020-07-21 13:22:07 +01:00
David Green 30371df85f [ARM] More unpredictable VCVT instructions.
These extra vcvt instructions were missed from 74ca67c109 because they
live in a different Domain, but should be treated in the same way.

Differential Revision: https://reviews.llvm.org/D83204
2020-07-21 07:24:37 +01:00
Matt Arsenault 107c954c13 AMDGPU/GlobalISel: Remove unnecessary parameter 2020-07-20 20:53:01 -04:00
Artem Belevich bf66003a4f [MC,NVPTX] Add MCAsmPrinter support for unsigned-only data directives.
PTX does not support negative values in .bNN data directives and we must
typecast such values to unsigned before printing them.

MCAsmInfo can now specify whether such casting is necessary for particular
target.

Differential Revision: https://reviews.llvm.org/D83423
2020-07-20 16:24:41 -07:00
Eli Friedman b8f765a1e1 [AArch64][SVE] Add support for trunc to <vscale x N x i1>.
This isn't a natively supported operation, so convert it to a
mask+compare.

In addition to the operation itself, fix up some surrounding stuff to
make the testcase work: we need concat_vectors on i1 vectors, we need
legalization of i1 vector truncates, and we need to fix up all the
relevant uses of getVectorNumElements().

Differential Revision: https://reviews.llvm.org/D83811
2020-07-20 13:11:02 -07:00
Yuanfang Chen ca1e69a675 [NFC] remove unused includes of SelectionDAGISel.h 2020-07-20 10:43:29 -07:00
Yuanfang Chen 589c646a7e [llc] (almost) remove `--print-machineinstrs`
Its effect could be achieved by
`-stop-after`,`-print-after`,`-print-after-all`. But a few tests need to
print MIR after ISel which could not be done with
`-print-after`/`-stop-after` since isel pass does not have commandline name.
That's the reason `--print-machineinstrs` is downgraded to
`--print-after-isel` in this patch. `--print-after-isel` could be
removed after we switch to new pass manager since isel pass would have a
commandline text name to use `print-after` or equivalent switches.

The motivation of this patch is to reduce tests dependency on
would-be-deprecated feature.

Reviewed By: arsenm, dsanders

Differential Revision: https://reviews.llvm.org/D83275
2020-07-20 10:43:28 -07:00
Matt Arsenault ce76d15a70 AMDGPU: Use MCRegister for preloaded arguments
Attempt to fix build error with ancient GCC
2020-07-20 13:34:28 -04:00
Matt Arsenault 21ef01b7e3 AMDGPU: Remove outdated fixme 2020-07-20 11:41:41 -04:00
Matt Arsenault 84704d989b AMDGPU: Fix not accounting for constantexpr uses of LDS globals
This was failing to add the size of LDS globals that weren't directly
used by an instruction. They could be used by constant expressions
which are transitively used by the function. This requires a better
search, but just abort on this for now for correctness.
2020-07-20 11:41:41 -04:00
Matt Arsenault 61f1f2a204 AMDGPU/GlobalISel: Initial Implementation of calls
Return values, and tail calls are not yet handled.
2020-07-20 11:13:22 -04:00
Matt Arsenault ad8e900cb3 Verifier: Disallow byval and similar for AMDGPU calling conventions
These imply stack-like semantics, which doesn't make any sense for
entry points.
2020-07-20 10:58:57 -04:00
Simon Pilgrim 017e5c949b MCFixup.h - remove unnecessary MCExpr.h include. NFCI.
Move the include down to files that actually depend on MCExpr definitions.

Also exposes an implicit dependency on MCContext in AVRAsmBackend.h
2020-07-20 15:17:19 +01:00
Petar Avramovic 6a1030aa0e AMDGPU/GlobalISel: Legalize s16->s64 G_FPEXT
Legalize using narrowScalar as s16->s32 G_FPEXT
followed by s32->s64 G_FPEXT.

Differential Revision: https://reviews.llvm.org/D84030
2020-07-20 16:12:19 +02:00
Matt Arsenault 100564bdf8 AMDGPU/GlobalISel: Remove outdated comment 2020-07-20 10:06:18 -04:00
Matt Arsenault 93311a9812 AMDGPU/GlobalISel: Fix custom lowering of llvm.trunc.f64 for SI
This was missing an operand from BFE and not erasing the original
instruction.
2020-07-20 10:06:18 -04:00
Paul Walker 6384ec4099 [SVE] Add lowering for fixed length vector fdiv, fma, fmul and fsub operations.
Differential Revision: https://reviews.llvm.org/D84034
2020-07-20 11:57:34 +00:00
Tim Northover 88464a55b4 AArch64: emit @llvm.debugtrap as `brk #0xf000` on all platforms
It's useful for a debugger to be able to distinguish an @llvm.debugtrap
from a (noreturn) @llvm.trap, so this extends the existing Windows
behaviour to other platforms.
2020-07-20 10:31:26 +01:00
Petar Avramovic ba938f6388 AMDGPU/GlobalISel: Legalize s16->s64 G_FPTOSI/G_FPTOUI
Add narrowScalarFor action.
Add narrow scalar for typeIndex == 0 for G_FPTOSI/G_FPTOUI.
Legalize using narrowScalarFor as s16->s32 G_FPTOSI/G_FPTOUI
followed by s32->s64 G_SEXT/G_ZEXT.

Differential Revision: https://reviews.llvm.org/D84010
2020-07-20 11:06:11 +02:00
Sanjay Patel 50afa18772 [x86] split FMA with fast-math-flags to avoid libcall
fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware)

C/C++ code may use explicit fma() calls (which become LLVM fma
intrinsics in IR) but then gets compiled with -ffast-math or similar.
For targets that do not have FMA hardware, we don't want to go out to
the math library for a precise but slow FMA result.

I tried this as a generic DAGCombine, but it caused infinite looping
on more than 1 other target, so there's likely some over-reaching fma
formation happening.

There's also a potential intersection of strict FP with fast-math here.
Deferring to current behavior for that case (assuming that strict-ness
overrides fast-ness).

Differential Revision: https://reviews.llvm.org/D83981
2020-07-19 10:03:55 -04:00
David Green 3504acc33e [ARM] Don't mark vctp as having sideeffects
As far as I can tell, it should not be necessary for VCTP to be
unpredictable in tail predicated loops. Either it has a a valid loop
counter as a operand which will naturally keep it in the right loop, or
it doesn't and it won't be converted to a tail predicated loop. Not
marking it as having side effects allows it to be scheduled more cleanly
for cases where it is not expected to become a tail predicate loop.

Differential Revision: https://reviews.llvm.org/D83907
2020-07-19 09:28:09 +01:00
Kang Zhang d37befdfe5 [PowerPC] Remove the redundant implicit operands in ppc-early-ret pass
Summary:
In the `ppc-early-ret` pass, we have use `BuildMI` and `copyImplicitOps` when the branch instructions can do the early return. But the two functions will add implicit operands twice, this is not correct.

This patch is to remove the redundant implicit operands in `ppc-early-ret pass`.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D76042
2020-07-19 07:01:45 +00:00
Dmitry Preobrazhensky 2e87acac9b [AMDGPU] Removed s_mov_regrd and mov_fed opcodes
These opcodes are not intended for public use.

Reviewers: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D81659
2020-07-17 19:52:54 +03:00
Yonghong Song 0e347c0ff0 BPF: generate .rodata BTF datasec for certain initialized local var's
Currently, BTF datasec type for .rodata is generated only if there are
user-defined readonly global variables which have debuginfo generated.

Certain readonly global variables may be generated from initialized
local variables. For example,
  void foo(const void *);
  int test() {
    const struct {
      unsigned a[4];
      char b;
    } val = { .a = {2, 3, 4, 5}, .b = 6 };
    foo(&val);
    return 0;
  }

The clang will create a private linkage const global to store
the initialized value:
  @__const.test.val = private unnamed_addr constant %struct.anon
      { [4 x i32] [i32 2, i32 3, i32 4, i32 5], i8 6 }, align 4

This global variable eventually is put in .rodata ELF section.

If there is .rodata ELF section, libbpf expects a BTF .rodata
datasec as well even though it may be empty meaning there are no
global readonly variables with proper debuginfo. Martin reported
a bug where without this empty BTF .rodata datasec, the bpftool
gen will exit with an error.

This patch fixed the issue by generating .rodata BTF datasec
if there exists local var intial data which will result in
.rodata ELF section.

Differential Revision: https://reviews.llvm.org/D84002
2020-07-17 09:45:57 -07:00
Matt Arsenault 994fb86bc2 AMDGPU: Fix promoting f16 fpowi with legal f16 2020-07-17 11:29:05 -04:00
Jay Foad 760af7a074 [AMDGPU] Avoid splitting FLAT offsets in unsafe ways
As explained in the comment:

// For a FLAT instruction the hardware decides whether to access
// global/scratch/shared memory based on the high bits of vaddr,
// ignoring the offset field, so we have to ensure that when we add
// remainder to vaddr it still points into the same underlying object.
// The easiest way to do that is to make sure that we split the offset
// into two pieces that are both >= 0 or both <= 0.

In particular FLAT (as opposed to SCRATCH and GLOBAL) instructions have
an unsigned immediate offset field, so we can't use it to help split a
negative offset.

Differential Revision: https://reviews.llvm.org/D83394
2020-07-17 11:44:10 +01:00
Simon Wallis 3e0ccf9a90 [ARM] halfword store hits llvm_unreachable with big-endian
Summary:
[ARM] halfword store hits llvm_unreachable with big-endian

Provide missing case in getFixupKindContainerSizeBytes().

This stops execution reaching llvm_unreachable("Unknown fixup kind!")

D83947

Reviewers: olista01, ostannard

Reviewed By: ostannard

Subscribers: ostannard, kristof.beyls, hiraditya, danielkiss, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83947

Change-Id: I598aa1fb51fd1c6f424c557c85d6df6d1958bc62
2020-07-17 08:56:44 +01:00
hsmahesha 4905536086 Revert "[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size"
This reverts commit cc9d693856.
2020-07-17 12:20:37 +05:30
Craig Topper 6bba95831e [X86] Change the scheduler model for 'pentium4' to SandyBridgeModel.
I meant to do this in D83913, but missed it while updating the
feature list.

Interestingly I think this is disabling the postRA scheduler. But
it does match our default 64-bit behavior.

Reviewed By: echristo

Differential Revision: https://reviews.llvm.org/D83996
2020-07-16 22:04:29 -07:00
Craig Topper addbf732c8 [X86] Reorder how the subtarget map key is created.
We use a SmallString<512> and attempted to reserve enough space
for CPU plus Features, but that doesn't account for all the things
that get added to the string.

Reorder the string so the shortest things go first which shouldn't
exceed the small size. Finally add the feature string at the end
which might be long. This should ensure at most one heap allocation
without needing to use reserve.

I don't know if this matters much in practice, but I was looking
into something else that will require more code here and noticed
the odd reserve call.
2020-07-16 21:41:45 -07:00
Carl Ritson 3a18665748 [AMDGPU] Translate s_and/s_andn2 to s_mov in vcc optimisation
When SCC is dead, but VCC is required then replace s_and / s_andn2
with s_mov into VCC when mask value is 0 or -1.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D83850
2020-07-17 11:48:57 +09:00
Albion Fung c273563552 [PowerPC][Power10] Add 128-bit Binary Integer Operation instruction definitions and MC Tests
This patch adds the instruction definitions and MC tests for the 128-bit Binary
Integer Operation instructions introduced in Power10.

Differential Revision: https://reviews.llvm.org/D83516
2020-07-16 17:16:43 -05:00
Wouter van Oortmerssen cc1b9b680f [WebAssembly] 64-bit (function) pointer fixes.
Accounting for the fact that Wasm function indices are 32-bit, but in wasm64 we want uniform 64-bit pointers.
Includes reloc types for 64-bit table indices.

Differential Revision: https://reviews.llvm.org/D83729
2020-07-16 14:10:22 -07:00
Craig Topper 5408024fa8 [X86] Move integer hadd/hsub formation into a helper function shared by combineAdd and combineSub.
There was a lot of duplicate code here for checking the VT and
subtarget. Moving it into a helper avoids that.

It also fixes a bug that combineAdd reused Op0/Op1 after a call
to isHorizontalBinOp may have changed it. The new helper function
has its own local version of Op0/Op1 that aren't shared by other
code.

Fixes PR46455.

Reviewed By: spatel, bkramer

Differential Revision: https://reviews.llvm.org/D83971
2020-07-16 13:27:27 -07:00
Craig Topper ad171d24b9 [X86] Change the tuning settings for pentium4 to be more modern since its the default 32-bit cpu in clang
Alternative to D83897. I believe the big change here is that I removed slow unaligned memory 16

Down side that it may adversely effect tuning if someone explicitly targets -march=pentium4 and expects pentium4 tuned code. Of course pentium4 is so old our default behavior with the previous settings may not have been the best either.

Reviewed By: echristo, RKSimon

Differential Revision: https://reviews.llvm.org/D83913
2020-07-16 12:51:25 -07:00
Zakk Chen 294d1eae75 [RISCV] Add support for -mcpu option.
Summary:
1. gcc uses `-march` and `-mtune` flag to chose arch and
pipeline model, but clang does not have `-mtune` flag,
we uses `-mcpu` to chose both infos.
2. Add SiFive e31 and u54 cpu which have default march
and pipeline model.
3. Specific `-mcpu` with rocket-rv[32|64] would select
pipeline model only, and use the driver's arch choosing
logic to get default arch.

Reviewers: lenary, asb, evandro, HsiangKai

Reviewed By: lenary, asb, evandro

Tags: #llvm, #clang

Differential Revision: https://reviews.llvm.org/D71124
2020-07-16 11:46:22 -07:00
Thomas Lively ecb2e5bcd7 [WebAssembly] Implement v128.select
Although the SIMD spec proposal does not specifically include a
select instruction, the select instruction in MVP WebAssembly is
polymorphic over the selected types, so it is able to work on v128
values when they are enabled. This patch introduces a new variant of
the select instruction for each legal vector type. Additional ISel
patterns are adapted from the SELECT_I32 and SELECT_I64 patterns.

Depends on D83736.

Differential Revision: https://reviews.llvm.org/D83737
2020-07-16 11:37:25 -07:00