Commit Graph

20958 Commits

Author SHA1 Message Date
Hiroshi Yamauchi 0ab6a15698 [X86] Add support for using fast short rep mov for memcpy lowering.
Disabled by default behind an option.

Differential Revision: https://reviews.llvm.org/D86883
2020-09-09 12:46:40 -07:00
Simon Pilgrim 6e45b98934 X86CallFrameOptimization.cpp - use const references where possible. NFCI. 2020-09-09 16:35:08 +01:00
Simon Pilgrim e706116e11 X86FrameLowering::adjustStackWithPops - cleanup auto usage. NFCI.
Don't use auto for non-obvious types, and use const references.
2020-09-09 16:15:02 +01:00
Craig Topper b1e68f885b [SelectionDAGBuilder] Pass fast math flags to getNode calls rather than trying to set them after the fact.:
This removes the after the fact FMF handling from D46854 in favor of passing fast math flags to getNode. This should be a superset of D87130.

This required adding a SDNodeFlags to SelectionDAG::getSetCC.

Now we manage to contant fold some stuff undefs during the
initial getNode that we don't do in later DAG combines.

Differential Revision: https://reviews.llvm.org/D87200
2020-09-08 15:27:21 -07:00
Simon Pilgrim fcff2c32c0 X86CallLowering.cpp - improve auto const/pointer/reference qualifiers. NFCI.
Fix clang-tidy warnings by ensuring auto variables are more cleanly qualified, or just avoid auto entirely.
2020-09-08 13:01:23 +01:00
Simon Pilgrim 0729ae367a X86DomainReassignment.cpp - improve auto const/pointer/reference qualifiers. NFCI.
Fix clang-tidy warnings by ensuring auto variables are more cleanly qualified, or just avoid auto entirely.
2020-09-08 13:01:23 +01:00
Craig Topper da79b1eecc [SelectionDAG][X86][ARM] Teach ExpandIntRes_ABS to use sra+add+xor expansion when ADDCARRY is supported.
Rather than using SELECT instructions, use SRA, UADDO/ADDCARRY and
XORs to expand ABS. This is the multi-part version of the sequence
we use in LegalizeDAG.

It's also the same as the Custom sequence uses for i64 on 32-bit
and i128 on 64-bit. So we can remove the X86 customization.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D87215
2020-09-07 13:15:26 -07:00
Craig Topper 01b3e16757 [X86] Use the same sequence for i128 ISD::ABS on 64-bit targets as we use for i64 on 32-bit targets.
Differential Revision: https://reviews.llvm.org/D87214
2020-09-07 11:14:05 -07:00
Simon Pilgrim 9de0a3da6a [X86][SSE] Don't use LowerVSETCCWithSUBUS for unsigned compare with +ve operands (PR47448)
We already simplify the unsigned comparisons if we've found the operands are non-negative, but we were still calling LowerVSETCCWithSUBUS which resulted in the PR47448 regressions.
2020-09-07 16:11:40 +01:00
Simon Pilgrim 5bb27e735d X86AvoidStoreForwardingBlocks.cpp - use unsigned for Opcode values. NFCI.
Fixes clang-tidy cppcoreguidelines-narrowing-conversions warnings.
2020-09-07 12:56:27 +01:00
Simon Pilgrim 9b645ebfff [X86][AVX] Use lowerShuffleWithPERMV in shuffle combining to support non-VLX targets
lowerShuffleWithPERMV allows us to use the ZMM variants for 128/256-bit variable shuffles on non-VLX AVX512 targets.

This is another step towards shuffle combining through between vector widths - we still end up with an annoying regression (combine_vpermilvar_vperm2f128_zero_8f32) but we're going in the right direction....
2020-09-07 12:50:50 +01:00
Benjamin Kramer 7ba0f81934 [X86] Unbreak the build after 22fa6b20d9 2020-09-07 12:24:30 +02:00
Simon Pilgrim 71dfdbe2c7 [X86] getFauxShuffleMask - handle insert_subvector(zero, sub, C)
Directly use SM_SentinelZero elements if we're (widening)inserting into a zero vector.
2020-09-07 11:10:40 +01:00
Simon Pilgrim 9ad261540d [X86] Use Register instead of unsigned. NFCI.
Fixes llvm-prefer-register-over-unsigned clang-tidy warnings.
2020-09-07 10:49:29 +01:00
Simon Pilgrim 22fa6b20d9 [X86] Use Register instead of unsigned. NFCI.
Fixes llvm-prefer-register-over-unsigned clang-tidy warnings.
2020-09-07 10:38:09 +01:00
Simon Pilgrim 0dbe2504af [X86] Use Register instead of unsigned. NFCI.
Fixes llvm-prefer-register-over-unsigned clang-tidy warning.
2020-09-07 10:38:08 +01:00
Simon Pilgrim ecac5c2808 [X86][AVX] lowerShuffleWithPERMV - adjust binary shuffle masks to account for widening on non-VLX targets
rGabd33bf5eff2 enabled us to pad 128/256-bit shuffles to 512-bit on non-VLX targets, but wasn't updating binary shuffles to account for the new vector width.
2020-09-06 14:52:25 +01:00
Craig Topper 35b35a373d [X86] Prevent shuffle combining from creating an identical X86ISD::SHUF128.
This can cause an infinite loop if SimplifiedDemandedElts asks
for the node to replace itself.

A similar protection exists in other places in shuffle combining.

Fixes ISPC https://github.com/ispc/ispc/issues/1864
2020-09-04 14:12:49 -07:00
Simon Pilgrim 740625fecd [X86] Make lowerShuffleAsLanePermuteAndPermute use sublanes on AVX2
Extends lowerShuffleAsLanePermuteAndPermute to search for opportunities to use vpermq (64-bit cross-lane shuffle) and vpermd (32-bit cross-lane shuffle) to get elements into the correct lane, in addition to the 128-bit full-lane permutes it previously searched for.

This is especially helpful in cross-lane byte shuffles, where the alternative tends to be "vpshufb both lanes separately and blend them with a vpblendvb", which is very expensive, especially on Haswell where vpblendvb uses the same execution port as all the shuffles.

Addresses PR47262

Patch By: @TellowKrinkle (TellowKrinkle)

Differential Revision: https://reviews.llvm.org/D86429
2020-09-04 11:41:26 +01:00
Craig Topper 0851350557 [X86] Update stale comment. NFC
The optimization in ExpandIntOp_UINT_TO_FP was removed in D72728
in January 2020.
2020-09-03 16:19:10 -07:00
Simon Pilgrim 58afaecdc2 X86/X86TargetObjectFile.cpp - remove unused headers. NFCI. 2020-09-03 15:17:44 +01:00
Simon Pilgrim 0563cd6739 Fix spelling mistake. NFC. 2020-09-03 15:17:44 +01:00
Simon Pilgrim 890707aa01 [X86] Avoid llvm-qualified-auto warning by not using auto. NFC.
Try to consistently use the actual type name in the file.
2020-09-03 14:21:17 +01:00
Simon Pilgrim 23d9f4b958 [X86] Fix llvm-qualified-auto warning by using auto*. NFC. 2020-09-03 14:21:17 +01:00
Simon Pilgrim 5b29269744 [X86] Fix llvm-qualified-auto warning by using const auto*. NFC. 2020-09-03 14:21:17 +01:00
Simon Pilgrim e56edb801b [X86][SSE] Fold select(X > -1, A, B) -> select(0 > X, B, A) (PR47404)
Help PBLENDVB peek through to the sign bit source of the selection mask by swapping the select condition and inputs.
2020-09-03 13:02:08 +01:00
Simon Pilgrim 888049b97a [X86][SSE] Fold vselect(pshufb,pshufb) -> or(pshufb,pshufb)
If the PSHUFBs have no other uses, then we can force the unselected elements to zero to OR them instead, avoiding both an extra mask load and a costly variable blend.

Eventually we should try to bring this into shuffle combining, once we can more easily convert between shuffles + select patterns.
2020-09-02 16:55:00 +01:00
Martin Storsjö 4820af2bfc [X86] Remove superfluous trailing semicolons, fixing warnings. NFC. 2020-09-02 11:43:27 +03:00
Simon Pilgrim 21d02dc595 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support
This patch uses partial DemandedElts masks to further simplify target shuffle chains and finally starts making target shuffle combining part of SimplifyDemandedBits/SimplifyDemandedVectorElts.

We already manage this for Depth == 0 cases, where combineX86ShuffleChain would early-out if the shuffle combined to the same op, but the patch generalizes this by manipulating the depth handling of combineX86ShufflesRecursively - calling with a new Depth = 0 and reducing the maximum shuffle combine depth accordingly.

Differential Revision: https://reviews.llvm.org/D66004
2020-09-02 09:24:46 +01:00
Eric Astor a57fdcdd40 x87 FPU state instructions do not use an f32 memory location
These instructions actually use a 512-byte location, where bytes 464-511 are ignored.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D86942
2020-09-01 13:50:07 -04:00
Sanjay Patel 2d3e12818e [FastISel] update to use intrinsic's isCommutative(); NFC
This requires adding a missing 'const' to the definition because
the callers are using const args, but there should be no change
in behavior.

The intrinsic method was added with D86798 / rG096527214033
2020-08-30 11:36:41 -04:00
Craig Topper aab90384a3 [Attributes] Add a method to check if an Attribute has AttrKind None. Use instead of hasAttribute(Attribute::None)
There's a special case in hasAttribute for None when pImpl is null. If pImpl is not null we dispatch to pImpl->hasAttribute which will always return false for Attribute::None.

So if we just want to check for None its sufficient to just check that pImpl is null. Which can even be done inline.

This patch adds a helper for that case which I hope will speed up our getSubtargetImpl implementations.

Differential Revision: https://reviews.llvm.org/D86744
2020-08-28 13:23:45 -07:00
Craig Topper ba852e1e19 [X86] Don't call hasFnAttribute and getFnAttribute for 'prefer-vector-width' and 'min-legal-vector-width' in getSubtargetImpl
We only need to call getFnAttribute and then check if the Attribute
is None or not.
2020-08-27 10:40:20 -07:00
Matt Arsenault 0b7f6cc71a GlobalISel: Add generic instructions for memory intrinsics
AArch64, X86 and Mips currently directly consumes these and custom
lowering to produce a libcall, but really these should follow the
normal legalization process through the libcall/lower action.
2020-08-26 20:08:45 -04:00
Craig Topper 92d3e70df3 [X86] Change pentium4 tuning settings and scheduler model back to their values before D83913.
Clang now defaults to -march=pentium4 -mtune=generic so we don't
need modern tune settings on pentium4.
2020-08-26 15:38:12 -07:00
Craig Topper 09288bcbf5 [X86] Add assembler support for .d32 and .d8 mnemonic suffixes to control displacement size.
This is an older syntax than the {disp32} and {disp8} pseudo
prefixes that were added a few weeks ago. We can reuse most of
the support for that to support .d32 and .d8 as well.
2020-08-26 10:45:50 -07:00
Pierre Gousseau cda6b09242 [X86] Make sure we do not clobber RBX with mwaitx when used as a base
pointer.

mwaitx uses EBX as one of its argument.
Using this instruction clobbers RBX as it is defined to hold one of the
input. When the backend uses dynamically allocated stack, RBX is used as
a reserved register for the base pointer.

This patch is adapted from @qcolombet patch for cmpxchg at r263325.

This fixes PR43528.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D73475
2020-08-26 11:20:31 +01:00
Craig Topper 1d1515a9e2 [X86] Add an isel pattern for (i8 (trunc (i16 (bitconvert (v16i1 X))))) to avoid an extra EXTRACT_SUBREG
Since we can only copy to GR32 we had to EXTRACT from GR32, but
we would first go to GR16 and then the truncate would extra again
to GR8. This adds a special case to go directly from GR32 to GR8.
This would eventually get cleaned up, but though maybe we should
avoid doing it in the first place. Our k-register handling is weird
and we could probably stand to have some more special ISD nodes
for the conversions so the i32 type would be explicit.
2020-08-25 18:20:43 -07:00
Craig Topper b8ec8f5776 [X86] Remove extra getOperand(0) call from recently introduced store(extract_element(vtrunc)) to truncated store combine.
The IsExtractedElement already called getOperand(0) so Extract
here is the source vector. We shouldn't call getOperand(0). This
worked for the original test cases because the result was a
bitcast so the getOperand(0) accidently peeked through the bitcast
which is what we wanted.

In the failing case here, the operand turns out to be undef so
the getOperand(0) asserts because undef has no operands.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=25184

Differential Revision: https://reviews.llvm.org/D86428
2020-08-25 16:16:54 -07:00
Craig Topper ba319ac47e [X86] Remove a redundant COPY_TO_REGCLASS for VK16 after a KMOVWkr in an isel output pattern.
KMOVWkr produces VK16, there's no reason to copy it to VK16 again.

Test changes are presumably because we were scheduling based on
the COPY that is no longer there.
2020-08-25 15:19:27 -07:00
Freddy Ye e02d081f2b [X86] Support -march=sapphirerapids
Support -march=sapphirerapids for x86.
Compare with Icelake Server, it includes 14 more new features. They are
amxtile, amxint8, amxbf16, avx512bf16, avx512vp2intersect, cldemote,
enqcmd, movdir64b, movdiri, ptwrite, serialize, shstk, tsxldtrk, waitpkg.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D86503
2020-08-25 14:21:21 +08:00
Craig Topper f7c87b7e37 [X86] Copy the tuning features and scheduler model from pentium4/x86-64 to generic
This is preparation for making clang default to -mtune=generic when no -march is specified. This will allow the default tuning to be "generic" even though our default march is "pentium4" or "x86-64".

To avoid llc lit test regressions, if no mcpu is specified, I've defaulted tune to use i586 to match the old tuning settings of no CPU. Some tests explicitly used -mcpu=generic which I've removed so they instead get this default of architecture features from generic and tune from i586.

I updated one llvm-mca test to check a different CPU since generic has a scheduler model now

Differential Revision: https://reviews.llvm.org/D86312
2020-08-24 14:47:10 -07:00
Fangrui Song bef684154d [X86][FastISel] Support materializing floating-point constants for large code model & PIC
The following program miscompiles because rL216012 added static
relocation model support but not for PIC.

```
// clang -fpic -mcmodel=large -O0 a.cc
double foo() { return 42.0; }
```

This patch adds PIC support.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D86024
2020-08-23 08:36:18 -07:00
Hiroshi Yamauchi 28ccc52c40 [X86] Add feature for Fast Short REP MOV (FSRM) for Icelake or newer.
Differential Revision: https://reviews.llvm.org/D85989
2020-08-19 13:39:42 -07:00
Simon Pilgrim 057bdd63a4 [X86][AVX] lowerShuffleWithVPMOV - minor refactor to more closely match lowerShuffleAsVTRUNC
Replace isBuildVectorAllZeros check by using the Zeroable bitmask instead.
2020-08-19 14:34:32 +01:00
Simon Pilgrim 9fee2bad6d [X86] lowerShuffleWithVPMOV - remove unnecessary shuffle commutation. NFCI.
canonicalizeShuffleMaskWithCommute should have already ensured the lower elements are from V1, we do have test coverage for this already.
2020-08-19 13:28:59 +01:00
Simon Pilgrim b61cef3a92 [X86][AVX] getAVX512TruncNode - don't truncate from illegal vector widths.
Thanks to @fhahn for the test case.
2020-08-19 13:00:26 +01:00
Simon Pilgrim 80a0dc59b7 [X86][AVX] computeKnownBitsForTargetNode - add VTRUNC/VTRUNCS/VTRUNCUS known zero upper elements handling.
Like many of the AVX512 conversion ops, the VTRUNC ops guarantee the upper destination elements are zero.
2020-08-19 11:39:27 +01:00
Simon Pilgrim 46fc9a0dfc [X86][AVX] Fold store(extract_element(vtrunc)) to truncated store
Add handling for storing the extracted lower (truncated bits) element from a X86ISD::VTRUNC node - this can be lowered to a generic truncated store directly.

Differential Revision: https://reviews.llvm.org/D86158
2020-08-19 11:10:20 +01:00
Craig Topper 9028c03ce6 [X86] Fix the Predicates on MMX_PSHUFWri/PSHUFWmi to include SSE1 in addition to MMX.
These instructions weren't in the initial version of MMX, but
were added when SSE1 was introduced. We already have the intrinsic
named correctly to include sse and the frontened header enforces
sse. We have one place in the backend where we DAG combine to
this intrinsic, but that's also qualified. So don't know of anything
currently broken unless someone writes their own IR and doesn't
set the sse feature.
2020-08-18 14:28:26 -07:00
Simon Pilgrim 11ff5176c4 [X86][AVX] lowerShuffleWithVPMOV - add non-VLX support.
We can efficiently handle non-VLX cases now that we have the getAVX512TruncNode helper.
2020-08-18 17:51:14 +01:00
Simon Pilgrim abd33bf5ef [X86][AVX] lowerShuffleWithPERMV - pad 128/256-bit shuffles on non-VLX targets
Allow non-VLX targets to use 512-bits VPERMV/VPERMV3 for 128/256-bit shuffles.

TBH I'm not sure these targets actually exist in the wild, but we're testing for them and its good test coverage for shuffle lowering/combines across different subvector widths.
2020-08-18 15:46:02 +01:00
Simon Pilgrim 011bf4fd96 [X86][AVX] lowerShuffleWithVTRUNC - extend to support v16i16/v32i8 binary shuffles.
This requires a few additional SrcVT vs DstVT padding cases in getAVX512TruncNode.
2020-08-18 15:30:02 +01:00
Simon Pilgrim d5621b83a5 [X86][AVX] lowerShuffleWithVTRUNC - pull out TRUNCATE/VTRUNC creation into helper code. NFCI.
Prep work toward adding v16i16/v32i8 support for lowerShuffleWithVTRUNC and improving lowerShuffleWithVPMOV.
2020-08-18 14:52:42 +01:00
Simon Pilgrim 7db5124736 [X86][AVX] lowerShuffleWithVTRUNC - avoid unnecessary division in element counts. NFCI.
(256 / SrcEltBits) == ((2 * EltSizeInBits * NumElts) / (EltSizeInBits * Scale)) == (2 * (NumElts / Scale)) == NumSrcElts
2020-08-18 13:48:22 +01:00
Simon Pilgrim d2057a8015 [X86][AVX] Lower v16i8/v8i16 binary shuffles using VTRUNC/TRUNCATE
This patch adds lowerShuffleWithVTRUNC to handle basic binary shuffles that can be lowered either as a pure ISD::TRUNCATE or a X86ISD::VTRUNC (with undef/zero values in the remaining upper elements).

We concat the binary sources together into a single 256-bit source vector. To avoid regressions we perform this after we've tried to lower with PACKS/PACKUS which typically does a cleaner job than a concat.

For non-AVX512VL cases we have to canonicalize VTRUNC cases to use a 512-bit source vectors (inserting undefs/zeros in the upper elements as necessary), truncate and then (possibly) extract the 128-bit result.

This should address the last regressions in D66004

Differential Revision: https://reviews.llvm.org/D86093
2020-08-18 11:11:58 +01:00
Craig Topper b673dfbb9a [X86] When manually creating intrinsic nodes in X86ISelLowering, make sure we use getTargetConstant and pointer type for the intrinsic ID.
Doesn't really matter in practice but that's how the nodes are
normally created by SelectionDAGBuilder. So we should match.

Found by temporarily hacking type checks into isel table.
2020-08-17 17:25:53 -07:00
Craig Topper 2ffa5d218f [X86] Rename INTR_TYPE_4OP to INTR_TYPE_4OP_IMM8 and truncate immediates to MVT::i8
This makes sure VPTERNLOG is generated with MVT::i8 immediate
as its SDNode declaration in X86InstrFragmentsSIMD.td declares.
2020-08-17 17:25:52 -07:00
Craig Topper bc244f08cf [X86] Truncate immediate to i8 for INTR_TYPE_3OP_IMM8
This is used for DBPSADBW which has a i32 immediate for its
intrinsic and an i8 immediate in tablegen isel patterns.
2020-08-17 17:25:51 -07:00
Craig Topper ab7151f1cf [X86] Make PreprocessISelDAG create X86ISD::VRNDSCALE nodes with i32 constants instead of i8.
This is the type declared in X86InstrFragmentsSIMD.td. ISel pattern
matching doesn't check so it doesn't matter in practice. Maybe for
SelectionDAG CSE it would matter.
2020-08-17 17:25:51 -07:00
Hongtao Yu 819b2d9c79 [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.
When diffing disassembly dump of two binaries, I see lots of noises from mismatched jump target addresses and global data references, which unnecessarily causes diffs on every function, making it impractical. I'm trying to symbolize the raw binary addresses to minimize the diff noise.
In this change, a local branch target is modeled as a label and the branch target operand will simply be printed as a label. Local labels are collected by a separate pre-decoding pass beforehand. A global data memory operand will be printed as a global symbol instead of the raw data address. Unfortunately, due to the way the disassembler is set up and to be less intrusive, a global symbol is always printed as the last operand of a memory access instruction. This is less than ideal but is probably acceptable from checking code quality point of view since on most targets an instruction can have at most one memory operand.

So far only the X86 disassemblers are supported.

Test Plan:

llvm-objdump -d  --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr [rip + 4112]  # 202182 <g>
               	jge	0x20117e <_start+0x25>
               	call	0x201158 <foo>
               	inc	dword ptr [rsp]
               	jmp	0x201169 <_start+0x10>
               	xor	eax, eax
               	pop	rcx
               	ret
```

llvm-objdump -d  **--symbolize-operands** --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :
```
Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
<L1>:
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr  <g>
               	jge	 <L0>
               	call	 <foo>
               	inc	dword ptr [rsp]
               	jmp	 <L1>
<L0>:
               	xor	eax, eax
               	pop	rcx
               	ret
```

Note that the jump instructions like `jge 0x20117e <_start+0x25>` without this work is printed as a real target address and an offset from the leading symbol. With a change in the optimizer that adds/deletes an instruction, the address and offset may shift for targets placed after the instruction. This will be a problem when diffing the disassembly from two optimizers where there are unnecessary false positives due to such branch target address changes. With `--symbolize-operand`, a label is printed for a branch target instead to reduce the false positives. Similarly, the disassemble of PC-relative global variable references is also prone to instruction insertion/deletion.

Reviewed By: jhenderson, MaskRay

Differential Revision: https://reviews.llvm.org/D84191
2020-08-17 16:55:12 -07:00
Simon Pilgrim 1d2ede87ea [X86][AVX] Move lowerShuffleWithVPMOV inside explicit shuffle lowering cases
Perform lowerShuffleWithVPMOV as part of the v16i8/v8i16 shuffle lowering stages, which are the only types that are currently supported.

We need to expand support for lowering shuffles as truncations to fix the remaining regressions in D66004
2020-08-17 11:58:51 +01:00
Craig Topper a206f85091 [X86] Reject dirflag in inline asm constraints other than clobber.
Fixes the crash from PR47195.
2020-08-16 23:33:45 -07:00
Simon Pilgrim f25d47b7ed [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
We can now enable this for AVX1 targets can now assist with canonicalizeShuffleMaskWithHorizOp cleanup.

There's still a few missed opportunities for merging subvector insert/extracts into shuffles, but they shouldn't cause any regressions now.
2020-08-16 15:00:41 +01:00
Simon Pilgrim dca7eb7d60 [X86][SSE] Replace combineShuffleWithHorizOp with canonicalizeShuffleMaskWithHorizOp
Instead of just attempting to fold shuffle(HOP,HOP) for a specific target shuffle, make this part of combineX86ShufflesRecursively so we can perform this on the combined shuffle chain, which is particularly useful for recognising more cases of where we're performing multiple HOPs that can be merged and pre-AVX where we don't have good blend/unary target shuffle support.
2020-08-16 12:26:27 +01:00
Simon Pilgrim c27baa54b7 [X86] isRepeatedTargetShuffleMask - don't require specific MVT type. NFC.
Split the isRepeatedTargetShuffleMask into a wrapper variant that takes a MVT describing the mask width, and an internal version that just needs the raw mask element bit size.

This will be necessary for an upcoming change where the horizontal ops element width might not match the shuffle mask element width.
2020-08-16 11:51:44 +01:00
Craig Topper c7a0b2684f [X86][MC][Target] Initial backend support a tune CPU to support -mtune
This patch implements initial backend support for a -mtune CPU controlled by a "tune-cpu" function attribute. If the attribute is not present X86 will use the resolved CPU from target-cpu attribute or command line.

This patch adds MC layer support a tune CPU. Each CPU now has two sets of features stored in their GenSubtargetInfo.inc tables . These features lists are passed separately to the Processor and ProcessorModel classes in tablegen. The tune list defaults to an empty list to avoid changes to non-X86. This annoyingly increases the size of static tables on all target as we now store 24 more bytes per CPU. I haven't quantified the overall impact, but I can if we're concerned.

One new test is added to X86 to show a few tuning features with mismatched tune-cpu and target-cpu/target-feature attributes to demonstrate independent control. Another new test is added to demonstrate that the scheduler model follows the tune CPU.

I have not added a -mtune to llc/opt or MC layer command line yet. With no attributes we'll just use the -mcpu for both. MC layer tools will always follow the normal CPU for tuning.

Differential Revision: https://reviews.llvm.org/D85165
2020-08-14 15:31:50 -07:00
Simon Pilgrim e9eb2dc332 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813

Recommit of rG9bd97d036398 with fixed offset adjustments
2020-08-14 18:43:19 +01:00
Simon Pilgrim cd3b850a4c rG9bd97d0363987b582 - Revert "[X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))"
This reverts commit 9bd97d0363.

Seeing some codegen issues in internal testing.
2020-08-13 15:21:15 +01:00
Simon Pilgrim a31d20e67e [X86][SSE] IsElementEquivalent - add HOP(X,X) support
For HADD/HSUB/PACKS ops with repeated operands the lower/upper half element of each lane are known to be equivalent
2020-08-13 12:42:59 +01:00
Simon Pilgrim 39de63aef9 Fix signed/unsigned comparison warnings. NFC. 2020-08-12 19:22:13 +01:00
Simon Pilgrim 13d6cf0951 [X86][SSE] Pull out BUILD_VECTOR operand equivalence tests. NFC.
Pull out element equivalence code from isShuffleEquivalent/isTargetShuffleEquivalent, I've also removed many of the index modulos where possible.

First step toward simply adding some additional equivalence tests.
2020-08-12 18:20:18 +01:00
Craig Topper 5f7cdb2eff [X86][GlobalISel] Legalize G_ICMP results to s8.
We need to produce a setcc instruction which has an 8-bit result.
This gets rid of a bunch of cases that were using the s1->s8/s16/s32/s64
handling in selectZExt.

I'm not very familiar with GlobalISel yet so I'm not yet sure
the best way to do things. I'd especially like feedback on the
best way to handle the currently split 32-bit and 64-bit mode
handling.

Differential Revision: https://reviews.llvm.org/D85814
2020-08-12 10:13:59 -07:00
Simon Pilgrim 9bd97d0363 [X86][SSE] Fold HOP(SHUFFLE(X),SHUFFLE(Y)) --> SHUFFLE(HOP(X,Y))
This is beginning to look like a canonicalization stage that could be performed as part of shuffle combining

Another step towards PR41813
2020-08-12 12:16:36 +01:00
Simon Pilgrim a0c2c6aa42 [X86][AVX] Fold CONCAT(HOP(X,Y),HOP(Z,W)) -> HOP(CONCAT(X,Z),CONCAT(Y,W)) for float types
Only do this for AVX2+ targets as we still get some regressions on AVX1 without PERMPD/PERMQ
2020-08-12 11:31:05 +01:00
Craig Topper 6b3dc96e59 [X86][GlobalISel] Replace a misuse of SUBREG_TO_REG with INSERT_SUBREG.
SUBREG_TO_REG is supposed to be used when we know the producing
instruction already zeroed the bits we're extending. But that's
not the case here. So INSERT_SUBREG with an IMPLICIT_DEF is the
correct thing to use.
2020-08-11 23:51:02 -07:00
Simon Pilgrim 2655bd51d6 [X86][SSE] combineShuffleWithHorizOp - canonicalize SHUFFLE(HOP(X,Y),HOP(Y,X)) -> SHUFFLE(HOP(X,Y))
Attempt to canonicalize binary shuffles of HOPs with commuted operands to an unary shuffle.
2020-08-11 18:13:03 +01:00
Eric Christopher 8155cb27a2 Fold Opcode into assert uses to fix an unused variable warning without asserts. 2020-08-11 09:30:51 -07:00
Simon Pilgrim fe1f36986b [X86][SSE] combineShuffleWithHorizOp - avoid unnecessary subtraction. NFCI.
We can safely replace ((M - NumElts) % NumEltsPerLane) with (M % NumEltsPerLane) as the modulo result will be the same.
2020-08-11 17:07:32 +01:00
Simon Pilgrim 91d59cbf1b [X86][SSE] Add HADD/SUB support to combineHorizOpWithShuffle
Handles some HOP(SHUFFLE,SHUFFLE) patterns and sets us up to improve some of the cases mentioned in PR41813.
2020-08-11 16:14:14 +01:00
Kerry McLaughlin 85c7e89f3b [CodeGen] Refactor getMemBasePlusOffset & getObjectPtrOffset to accept a TypeSize
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D85220
2020-08-11 12:17:10 +01:00
Simon Pilgrim 49016eeab6 [X86] Rename combineVectorPackWithShuffle -> combineHorizOpWithShuffle. NFC.
The plan is to use this for (F)HADD/SUB opcodes as well as PACKs - similar to how we use combineShuffleWithHorizOp
2020-08-11 11:38:43 +01:00
Craig Topper 9201efb3b9 [X86] Custom match X86ISD::VPTERNLOG in X86ISelDAGToDAG in order to reduce isel patterns.
By factoring out the end of tryVPTERNLOG, we can use the same code
to directly match X86ISD::VPTERNLOG. This allows us to remove
around 3-4K worth of X86GenDAGISel.inc.
2020-08-10 23:15:58 -07:00
Wang, Pengfei 9512525947 [X86][FPEnv] Teach X86 mask compare intrinsics to respect strict FP semantics.
When we use mask compare intrinsics under strict FP option, the masked
elements shouldn't raise any exception. So, we cann't replace the
intrinsic with a full compare + "and" operation.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D85385
2020-08-11 10:28:41 +08:00
Craig Topper 96dfc783b2 [BreakFalseDeps][X86] Move operand loop out of X86's getUndefRegClearance and put in the pass.
X86 is the only user of this interface in tree. Previously the
X86 pass would loop over operands looking for one undef operand for
the pass to fix. But there could theoretically be multiple operands
to fix. So it makes more sense for the pass to do the looping and
ask the target if an operand needs to be fixed.
2020-08-10 10:32:29 -07:00
Simon Pilgrim 9a368d2b00 [X86][SSE] shuffle(hop,hop) - canonicalize unary hop(x,x) shuffle masks
If a shuffle is referring to both the lower and upper half lanes of an unary horizontal op, then canonicalize the mask to only refer to the lower half.
2020-08-10 16:09:27 +01:00
Simon Pilgrim 07e673a02b [X86][SSE] Pull out shuffle(hop,hop) combine into combineShuffleWithHorizOp helper. NFC. 2020-08-10 15:08:57 +01:00
Simon Pilgrim e6dc2c8ce7 [X86][SSE] combineTargetShuffle - rearrange shuffle(hop,hop) matching to delay shuffle mask manipulation. NFC.
Check that we're shuffling hadd/pack ops first before altering shuffle masks.

First step towards adding extra functionality, plus it avoids costly shuffle mask manipulation if not necessary.
2020-08-10 14:13:19 +01:00
Matt Arsenault f9c279b057 PeepholeOptimizer: Use Register 2020-08-10 08:49:36 -04:00
Craig Topper bc8be30540 [X86][GlobalISel] Remove unneeded code for handling zext i8->16, i8->i64, i16->i64, i32->i64.
These all seem to be handled by tablegen pattern imports.
2020-08-09 00:26:15 -07:00
Craig Topper d3153b5ca2 [X86] Remove a DCI.isBeforeLegalize() call from combineVSelectWithAllOnesOrZeros.
This was blocking isTypeLegal call so that we could do a particular
transform on illegal types before type legalization. But the we
create a target specific node using that type. We shouldn't do
that if the type isn't legal. So I think we should just always
make sure the type is legal.

I suspect that in order to get the condition VT to not be a vector
of i1 we already completed type legalization anyway so this probably
doesn't matter much in practice.
2020-08-08 14:19:13 -07:00
Craig Topper 966a58e329 [X86] Support matching VPTERNLOG when the root node is X86ISD::ANDNP. 2020-08-08 13:11:47 -07:00
Craig Topper 815a9b256b [X86] Remove isSafeToClobberEFLAGS helper and just inline it into the call sites.
This is just a thin wrapper around computeRegisterLivness which
we can just call directly. The only real difference is that
isSafeToClobberEFLAGS returns a bool and computeRegisterLivness
returns an enum. So we need to check for the specific enum value
that isSafeToClobberEFLAGS was hiding.

I've also adjusted which sites pass an explicit value for
Neighborhood since the default for computeRegisterLivness is 10.
2020-08-08 12:31:58 -07:00
Craig Topper 8d3ae64b04 Recommit "[X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places"
I messed up the bug numbers in the commit message before

Previously this function searched 4 instructions forwards or
backwards to determine if it was ok to clobber eflags.

This is called in 3 places: rematerialization, turning 2 operand
leas into adds or splitting 3 ops leas into an lea and add on some
CPU targets.

This patch increases the search limit to 10 instructions for
rematerialization and 2 operand lea to add. I've left the old
treshold for 3 ops lea spliting as that increases code size.

Fixes PR47024 and PR46315.
2020-08-08 11:53:14 -07:00
Craig Topper 761f568420 Revert "[X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places"
This reverts commit 44b260cb0a.

I messed up the bug number in the commit message so I'm reverting
to fix it.
2020-08-08 11:53:14 -07:00
Simon Pilgrim cc15380f10 [X86][SSE] combineTargetShuffle - use scaleShuffleMask helper to widen shuffle mask. NFCI.
Use scaleShuffleMask helper for the shuffle(hadd,hadd) canonicalization.
2020-08-08 19:36:18 +01:00
Craig Topper 44b260cb0a [X86] Increase the number of instructions searched for isSafeToClobberEFLAGS in a couple places
Previously this function searched 4 instructions forwards or
backwards to determine if it was ok to clobber eflags.

This is called in 3 places: rematerialization, turning 2 operand
leas into adds or splitting 3 ops leas into an lea and add on some
CPU targets.

This patch increases the search limit to 10 instructions for
rematerialization and 2 operand lea to add. I've left the old
treshold for 3 ops lea spliting as that increases code size.

Fixes PR47024 and PR43014
2020-08-08 11:29:41 -07:00
Craig Topper 514b00c439 [X86] Limit the scope of the min/max canonicalization in combineSelect
Previously the transform was doing these two canonicalizations
(x > y) ? x : y -> (x >= y) ? x : y
(x < y) ? x : y -> (x <= y) ? x : y

But those don't seem to be useful generally. And they actively
pessimize the cases in PR47049.

This patch limits it to
(x > 0) ? x : 0 -> (x >= 0) ? x : 0
(x < -1) ? x : -1 -> (x <= -1) ? x : -1

These are the cases mentioned in the comments as the motivation
for the canonicalization. These allow the CMOV to use the S
flag from the compare thus improving opportunities to use a TEST
or the flags from an arithmetic instruction.
2020-08-07 22:51:49 -07:00
Keno Fischer c58674df14 [X86] Don't produce bad x86andp nodes for i1 vectors
In D85499, I attempted to fix this same issue by canonicalizing
andnp for i1 vectors, but since there was some opposition to such
a change, this commit just fixes the bug by using two different
forms depending on which kind of vector type is in use. We can
then always decide to switch the canonical forms later.

Description of the original bug:
We have a DAG combine that tries to fold (vselect cond, 0000..., X) -> (andnp cond, x).
However, it does so by attempting to create an i64 vector with the number
of elements obtained by truncating division by 64 from the bitwidth. This is
bad for mask vectors like v8i1, since that division is just zero. Besides,
we don't want i64 vectors anyway. For i1 vectors, switch the pattern
to (andnp (not cond), x), which is the canonical form for `kandn`
on mask registers.

Fixes https://github.com/JuliaLang/julia/issues/36955.

Differential Revision: https://reviews.llvm.org/D85553
2020-08-07 20:05:47 -04:00
Craig Topper 0215ae9735 [X86] Remove incomplete custom handling of i128 sdivrem/udivrem on Windows.
We need to have special handling of i128 div/rem on Windows due
to a weird calling convention needed for the libcall. There was
also some code that made it look like we do the same for sdivrem/udiv,
but the code didn't account for multiple return values of those
functions so couldn't possibly work. I think this code never
triggers because we don't have libcall names defined for those
functions by default so DAGCombine never creates DIVREM nodes.
2020-08-05 23:01:07 -07:00