Commit Graph

6556 Commits

Author SHA1 Message Date
Sanjay Patel 2bf82879bd [x86] split more 256-bit stores of concatenated vectors
As suggested in D62498 - collectConcatOps() matches both
concat_vectors and insert_subvector patterns, and we see
more test improvements by using the more general match.

llvm-svn: 362620
2019-06-05 16:40:57 +00:00
Simon Pilgrim de586bd1fd [X86][AVX] Generalize split256BitStore to splitVectorStore. NFCI.
Enables us to use this to split 512-bit vectors in future patches.

llvm-svn: 362617
2019-06-05 16:14:14 +00:00
Simon Pilgrim 886a55eaa0 [X86][AVX] combineX86ShuffleChain - combine shuffle(extractsubvector(x),extractsubvector(y))
We already handle the case where we combine shuffle(extractsubvector(x),extractsubvector(x)), this relaxes the requirement to permit different sources as long as they have the same value type.

This causes a couple of cases where the VPERMV3 binary shuffles occur at a wider width than before, which I intend to improve in future commits - but as only the subvector's mask indices are defined, these will broadcast so we don't see any increase in constant size.

llvm-svn: 362599
2019-06-05 12:56:53 +00:00
Craig Topper 78fdce25a1 [X86] Cleanup convertIntLogicToFPLogic a little. NFCI
-Use early returns to reduce indentation
-Replace multipe ifs with a switch.
-Replace an assert with an llvm_unreachable default in the switch.
-Check that the FP type we're going to use for the
 X86ISD::FAND/FOR/FXOR is legal rather than checking that the
 integer type matches the width of a legal scalar fp type. This all
 runs after legalization so it shouldn't really matter, but making
 sure we're using a valid type in the X86ISD node is really
 whats important.

llvm-svn: 362565
2019-06-05 01:00:34 +00:00
Benjamin Kramer 03ff1b3c30 [X86] Fold single-use variable into assert. NFC.
Avoids an unused variable warning in Release builds.

llvm-svn: 362534
2019-06-04 18:01:07 +00:00
Sanjay Patel 606eb2367f [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is a reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 362524
2019-06-04 16:40:04 +00:00
Simon Pilgrim a6e289e9f8 [X86][SSE] Pulled out (sub (xor X, M), M) 'ConditionalNegate' out pattern match code. NFCI.
As discussed on D62777 - we should be able to use this in more SSE41+ cases as well but that requires us to separate it from the OR(AND(),ANDN()) matcher.

llvm-svn: 362504
2019-06-04 15:02:33 +00:00
Simon Pilgrim 71a39bcf68 [X86] isHorizontalBinOp - add extract_subvector(shuffle(x)) handling (PR39921)
Let's us match horizontal op patterns on fast-variable-shuffle targets (Haswell etc.)

llvm-svn: 362327
2019-06-02 15:47:49 +00:00
Simon Pilgrim 7a869e7036 [DAGCombine] Fold insert_subvector(bitcast(x),bitcast(y),c1) -> bitcast(insert_subvector(x,y),c2)
Move this combine from x86 into generic DAGCombine, which currently only manages cases where the bitcast is between types of the same scalarsize.

Differential Revision: https://reviews.llvm.org/D59188

llvm-svn: 362324
2019-06-02 14:42:11 +00:00
Pengfei Wang 2e67d0c842 [X86] Add VP2INTERSECT instructions
Support Intel AVX512 VP2INTERSECT instructions in llvm

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D62366

llvm-svn: 362188
2019-05-31 02:50:41 +00:00
Craig Topper d6b74cc859 [X86] Remove code that unnecessarily sets EXTLOAD with src type of v2f32/v4f32/v8f32 as Legal for SSE2/AVX/AVX512 respectively. NFC
The LoadExt table defaults to all combinations being Legal. For
vector types, only src VTs with an i1 element type were ever changed.
So we don't need to mark them legal manually.

llvm-svn: 362170
2019-05-30 22:29:06 +00:00
Simon Pilgrim 32aac1727a [X86][SSE] Improve bool vector extload (PR26091)
We already have good codegen for (vXiY *ext(vXi1 bitcast(iX))) cases, this patch uses it for loads of vXi1 types as well - changing the load into a iX integer load, and bitcasting so that combineToExtendBoolVectorInReg can then use it.

Differential Revision: https://reviews.llvm.org/D62449

llvm-svn: 362081
2019-05-30 10:25:20 +00:00
Pengfei Wang 1f67d94279 [X86] Add ENQCMD instructions
For more details about these instructions, please refer to the latest
ISE document:
https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference.

Patch by Tianqing Wang (tianqing)

Differential Revision: https://reviews.llvm.org/D62281

llvm-svn: 362053
2019-05-30 03:59:16 +00:00
Adhemerval Zanella 6d7bf5e8df [CodeGen] Add lrint/llrint builtins
This patch add the ISD::LRINT and ISD::LLRINT along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lrint/llrint generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D62017

llvm-svn: 361875
2019-05-28 20:47:44 +00:00
Sanjay Patel f7980e727f Revert "[x86] split 256-bit store of concatenated vectors"
This reverts commit d5a8637072.

Most likely suspect for this bot failure:
http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/9684

llvm-svn: 361850
2019-05-28 17:37:58 +00:00
Sanjay Patel d5a8637072 [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is the reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 361822
2019-05-28 13:54:17 +00:00
Sanjay Patel 6bf4ca9d2e [x86] fix 256-bit vector store splitting to honor 'volatile'
Forking this out of the discussion in D62498
(and assuming that will be committed later, so adding the helper function here).
The LangRef says:
"the backend should never split or merge target-legal volatile load/store instructions."

Differential Revision: https://reviews.llvm.org/D62506

llvm-svn: 361815
2019-05-28 12:58:07 +00:00
Benjamin Kramer 57e267a2e9 [X86] Custom lower CONCAT_VECTORS of v2i1
The generic legalizer cannot handle this. Add an assert instead of
silently miscompiling vectors with elements smaller than 8 bits.

llvm-svn: 361814
2019-05-28 12:52:57 +00:00
Simon Pilgrim a044410f37 [X86][SSE] Add shuffle combining support for ISD::ANY_EXTEND_VECTOR_INREG
Reuses what we already have in place for ISD::ZERO_EXTEND_VECTOR_INREG just with a different sentinel

llvm-svn: 361734
2019-05-26 16:00:35 +00:00
Simon Pilgrim 58a8541dcc [X86][AVX] combineBitcastvxi1 - peek through bitops to determine size of original vector
We were only testing for direct SETCC results - this allows us to peek through AND/OR/XOR combinations of the comparison results as well.

There's a missing SEXT(PACKSS) fold that I need to investigate for v8i1 cases before I can enable it there as well.

llvm-svn: 361716
2019-05-26 10:54:23 +00:00
Simon Pilgrim 40fa52b174 [X86] lowerBuildVectorToBitOp - support build_vector(shift()) -> shift(build_vector(),C)
Commonly occurs in sign-extension cases

llvm-svn: 361706
2019-05-25 18:02:17 +00:00
Nikita Popov d87eceda0e [X86] Combine fminnum/fmaxnum with non-nan operand to fmin/fmax
If we have a known non-nan operand, place it in the second operand
of fmin/fmax that is returned if either operand is nan.

Differential Revision: https://reviews.llvm.org/D62448

llvm-svn: 361704
2019-05-25 16:44:29 +00:00
Simon Pilgrim 95b8d9bbf8 [SelectionDAG] computeKnownBits - support constant pool values from target
This patch adds the overridable TargetLowering::getTargetConstantFromLoad function which allows targets to return any constant value loaded by a LoadSDNode node - only X86 makes use of this so far but everything should be in place for other targets.

computeKnownBits then uses this function to improve codegen, notably vector code after legalization.

A future commit will do the same for ComputeNumSignBits but computeKnownBits sees the bigger benefit.

This required a couple of fixes:
* SimplifyDemandedBits must early-out for getTargetConstantFromLoad cases to prevent infinite loops of constant regeneration (similar to what we already do for BUILD_VECTOR).
* Fix a DAGCombiner::visitTRUNCATE issue as we had trunc(shl(v8i32),v8i16) <-> shl(trunc(v8i16),v8i32) infinite loops after legalization on AVX512 targets.

Differential Revision: https://reviews.llvm.org/D61887

llvm-svn: 361620
2019-05-24 10:03:11 +00:00
Nikita Popov 15df05152d [X86] Don't compare i128 through vector if construction not cheap (PR41971)
Fix for https://bugs.llvm.org/show_bug.cgi?id=41971. Make the
combineVectorSizedSetCCEquality() transform more conservative by
checking that the bitcast to the vector type will be cheap/free
for both operands. I'm considering it cheap if it's a constant,
a load or already a vector. I've dropped the explicit check for
f128 because it should fall out naturally (in the cases where
it'd be detrimental).

Differential Revision: https://reviews.llvm.org/D62220

llvm-svn: 361352
2019-05-22 06:47:06 +00:00
Craig Topper ed6df47bae [X86] Remove an unneeded ZERO_EXTEND creation from LowerINTRINSIC_W_CHAIN. NFC
We were trying to ZERO_EXTEND from an i8 X86ISD::SETCC to i8 again.

llvm-svn: 361288
2019-05-21 19:03:45 +00:00
Simon Pilgrim 4b82e50315 [X86][SSE] computeKnownBitsForTargetNode - add X86ISD::ANDNP support
Fixes PACKSS-PSHUFB shuffle regressions mentioned on D61692

llvm-svn: 361270
2019-05-21 15:20:24 +00:00
Craig Topper 3164b50af7 [X86] Remove combineShift function. Just dispatch directly to the handler for each flavor from the main switch. NFC
llvm-svn: 361108
2019-05-19 01:01:46 +00:00
Simon Pilgrim 065431c82b [X86][SSE] Fold movmsk(not(x)) -> not(movmsk)
Helps to improve folding of comparisons with movmsk results.

llvm-svn: 361056
2019-05-17 17:56:25 +00:00
Simon Pilgrim 2c2f8e74b9 [X86][SSE] Match all-of bool scalar reductions into a bitcast/movmsk + cmp.
Same as what we do for vector reductions in combineHorizontalPredicateResult, use movmsk+cmp for scalar (and(extract(x,0),extract(x,1)) reduction patterns. 

llvm-svn: 361052
2019-05-17 17:25:55 +00:00
Simon Pilgrim 279314e81b [X86][AVX] Remove LowerCTTZ's AVX1 custom vector handling.
We can now rely on generic expansion to handle this.

llvm-svn: 361038
2019-05-17 14:37:19 +00:00
Simon Pilgrim 62c7032c18 [X86][AVX] isNOT - add extract_subvector(xor X, -1) -> extract_subvector(X) fold.
Prep work for the removal of the remaining x86 CTTZ vector lowering.

llvm-svn: 361035
2019-05-17 14:04:56 +00:00
Simon Pilgrim a6d3bd486b [X86] Pull out IsNOT helper. NFCI.
Return the input value for the NOT pattern: (xor X, -1) -> X

llvm-svn: 361012
2019-05-17 10:37:08 +00:00
Reid Kleckner 08c15df29f [X86] Deduplicate symbol lowering logic, NFC
Summary:
This refactors four pieces of code that create SDNodes for references to
symbols:
- normal global address lowering (LEA, MOV, etc)
- callee global address lowering (CALL)
- external symbol address lowering (LEA, MOV, etc)
- external symbol address lowering (CALL)

Each of these pieces of code need to:
- classify the reference
- lower the symbol
- emit a RIP wrapper if needed
- emit a load if needed
- add offsets if needed

I think handling them all in one place will make the code easier to
maintain in the future.

Reviewers: craig.topper, RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61690

llvm-svn: 360952
2019-05-16 23:15:26 +00:00
Adhemerval Zanella 73643b5041 [CodeGen] Add lround/llround builtins
This patch add the ISD::LROUND and ISD::LLROUND along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lround/llround generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

llvm-svn: 360889
2019-05-16 13:15:27 +00:00
Craig Topper 384d46c0d5 [X86] Use OR32mi8Locked instead of LOCK_OR32mi8 in emitLockedStackOp.
They encode the same way, but OR32mi8Locked sets hasUnmodeledSideEffects set
which should be stronger than the mayLoad/mayStore on LOCK_OR32mi8. I think
this makes sense since we are using it as a fence.

This also seems to hide the operation from the speculative load hardening pass
so I've reverted r360511.

llvm-svn: 360747
2019-05-15 04:15:46 +00:00
Philip Reames 658cad1287 [NFC] Reuse a helper function to eliminate duplicate code
llvm-svn: 360740
2019-05-15 01:39:07 +00:00
Philip Reames 445f942fc4 Use an offset from TOS for idempotent rmw locked op lowering
This was the portion split off D58632 so that it could follow the redzone API cleanup. Note that I changed the offset preferred from -8 to -64. The difference should be very minor, but I thought it might help address one concern which had been previously raised.

Differential Revision: https://reviews.llvm.org/D61862

llvm-svn: 360719
2019-05-14 22:32:42 +00:00
Simon Pilgrim c2d9cfd925 [X86] Disable shouldFoldConstantShiftPairToMask for scalar shifts on AMD targets (PR40758)
D61068 handled vector shifts, this patch does the same for scalars where there are similar number of pipes for shifts as bit ops - this is true almost entirely for AMD targets where the scalar ALUs are well balanced.

This combine avoids AND immediate mask which usually means we reduce encoding size.

Some tests show use of (slow, scaled) LEA instead of SHL in some cases, but thats due to particular shift immediates - shift+mask generate these just as easily.

Differential Revision: https://reviews.llvm.org/D61830

llvm-svn: 360684
2019-05-14 15:21:28 +00:00
Simon Pilgrim 2747ee2c83 [X86] X86TargetLowering::LowerINTRINSIC_WO_CHAIN - ensure rounding control is initialized. NFCI.
Fixes scan-build warnings

llvm-svn: 360664
2019-05-14 11:30:39 +00:00
Philip Reames 3098e44daa [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit targets
This is a follow on to D58632, with the same logic. Given a memory operation which needs ordering, but doesn't need to modify any particular address, prefer to use a locked stack op over an mfence.

Differential Revision: https://reviews.llvm.org/D61863

llvm-svn: 360649
2019-05-14 04:43:37 +00:00
Sanjay Patel 3a13d970aa [SDAG, x86] allow targets to override test for binop opcodes
This follows the pattern of the existing isCommutativeBinOp().

x86 shows improvements from vector narrowing for the min/max opcodes.

llvm-svn: 360639
2019-05-14 00:39:40 +00:00
Craig Topper e2966473dd [X86] Use ISD::MERGE_VALUES to return from lowerAtomicArith instead of calling ReplaceAllUsesOfValueWith and returning SDValue().
Returning SDValue() makes the caller think that nothing happened and it will
end up executing the Expand path. This generates extra nodes that will need to
be pruned as dead code.

Returning an ISD::MERGE_VALUES will tell the caller that we'd like to make a
change and it will take care of replacing uses. This will prevent falling into
the Expand path.

llvm-svn: 360627
2019-05-13 22:17:13 +00:00
Craig Topper 5f999c2bea [X86] Various type corrections to the code that creates LOCK_OR32mi8/OR32mi8Locked to the stack for idempotent atomic rmw and atomic fence.
These are updates to match how isel table would emit a LOCK_OR32mi8 node.

-Use i32 for the immediate zero even though only 8 bits are encoded.
-Use i16 for segment register.
-Use LOCK_OR32mi8 for idempotent atomic operations in 32-bit mode to match
64-bit mode. I'm not sure why OR32mi8Locked and LOCK_OR32mi8 both exist. The
only difference seems to be that OR32mi8Locked is marked as UnmodeledSideEffects=1.
-Emit an extra i32 result for the flags output.

I don't know if the types here really matter just noticed it was inconsistent
with normal behavior.

llvm-svn: 360619
2019-05-13 21:01:24 +00:00
Nick Desaulniers c33f754e74 [TargetLowering] Handle multi depth GEPs w/ inline asm constraints
Summary:
X86TargetLowering::LowerAsmOperandForConstraint had better support than
TargetLowering::LowerAsmOperandForConstraint for arbitrary depth
getelementpointers for "i", "n", and "s" extended inline assembly
constraints. Hoist its support from the derived class into the base
class.

Link: https://github.com/ClangBuiltLinux/linux/issues/469

Reviewers: echristo, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, E5ten, kees, jyknight, nemanjai, javed.absar, eraman, hiraditya, jsji, llvm-commits, void, craig.topper, nathanchance, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61560

llvm-svn: 360604
2019-05-13 17:27:44 +00:00
Simon Pilgrim 73aee29095 [X86][SSE] LowerBuildVectorv4x32 - don't insert MOVQ for undef elts
Fixes the regression noted in D61782 where a VZEXT_MOVL was being inserted because we weren't discriminating between 'zeroable' and 'all undef' for the upper elts.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360596
2019-05-13 16:10:11 +00:00
Simon Pilgrim cf5a8eb7cd [X86][SSE] Relax use limits for lowerAddSubToHorizontalOp (PR32433)
Now that we can use HADD/SUB for scalar additions from any pair of extracted elements (D61263), we can relax the one use limit as we will be able to merge multiple uses into using the same HADD/SUB op.

This exposes a couple of missed opportunities in LowerBuildVectorv4x32 which will be committed separately.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360594
2019-05-13 16:02:45 +00:00
Simon Pilgrim d9aa928603 [X86] Add SimplifyDemandedBits support for PEXTRB/PEXTRW (PR39709)
Test case will be included in a followup - its being used but its tricky to show a case that isn't caught at a later stage anyway.

llvm-svn: 360588
2019-05-13 15:31:27 +00:00
Simon Pilgrim a7fc763082 [X86][AVX] Split VZEXT_MOVL ymm/zmm if the upper elements are not demanded.
Removes unnecessary vzeroupper noted in D61806

llvm-svn: 360543
2019-05-12 15:16:29 +00:00
Simon Pilgrim fda6bffd3b [X86][SSE] SimplifyDemandedBits - call PEXTRB/PEXTRW SimplifyDemandedVectorElts as well.
See if we can simplify the demanded vector elts from the extraction before trying to simplify the demanded bits.

This helps us with target shuffles and hops in particular.

llvm-svn: 360535
2019-05-11 21:35:50 +00:00
Simon Pilgrim e4c5b6d9bd [X86][SSE] Add SimplifyDemandedVectorElts HADD/HSUB handling.
Still missing PHADDW/PHSUBW tests because PEXTRW doesn't call SimplifyDemandedVectorElts

llvm-svn: 360526
2019-05-11 16:07:12 +00:00
Craig Topper c9d7484aa3 [X86] Add CMOV_FR32X/CMOV_FR64X pseudo instructions. Use them in fast isel to fix a machine verifier error after adding test cases.
Fast isel picks the FR32X/FR64X register classes when lowering pseudo select, but it didn't have the right opcode to go with it.

llvm-svn: 360524
2019-05-11 16:00:28 +00:00
Simon Pilgrim a0b1518a4a [X86][SSE] Add getHopForBuildVector vector splitting
If we only use the lower xmm of a ymm hop, then extract the xmm's (for free), perform the xmm hop and then insert back into a ymm (for free).

Fixes some of the regressions noted in D61782

llvm-svn: 360435
2019-05-10 15:46:04 +00:00
Philip Reames bd588dfd59 [X86] Improve lowering of idemptotent RMW operations
The current lowering uses an mfence. mfences are substaintially higher latency than the locked operations originally requested, but we do want to avoid contention on the original cache line. As such, use a locked instruction on a cache line assumed to be thread local.

Differential Revision: https://reviews.llvm.org/D58632

llvm-svn: 360393
2019-05-09 23:23:42 +00:00
Simon Pilgrim 93bfa5af48 [X86][SSE] Fold add(shuffle(),shuffle()) to hadd on 'slow' targets (PR39920)
As reported on PR39920, "slow horizontal ops" targets tend to internally expand to 2*shuffle+add/sub - so if we can reduce 2*shuffle+add/sub to a hadd/sub then we should do it - similar port usage but reduced instruction count.

This works out in most cases, although the "PR22377" regression in vector-shuffle-combining.ll is annoying - going from 2*shuffle+add+shuffle to hadd+2*shuffle - I've opened PR41813 to cover this.

Differential Revision: https://reviews.llvm.org/D61308

llvm-svn: 360360
2019-05-09 17:45:01 +00:00
Reid Kleckner 6bf108d77a [COFF] Use COFF stubs for extern_weak functions
Summary:
A COFF stub indirects the reference to a symbol through memory. A
.refptr.$sym global variable pointer is created to refer to $sym.
Typically mingw uses these for external global variable declarations,
but we can use them for weak function declarations as well.

Updates the dso_local classification to add a special case for
extern_weak symbols on COFF in both clang and LLVM.

Fixes PR37598

Reviewers: smeenai, mstorsjo

Subscribers: hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D61615

llvm-svn: 360207
2019-05-07 23:06:21 +00:00
Eric Christopher 4727221734 Make sure that the DAG combiner doesn't merge stores that we explicitly
asked not be greater than preferred vector width for the vectorizer.
Test for both 128 and 256 with a skylake architecture.

llvm-svn: 360183
2019-05-07 19:25:34 +00:00
Simon Pilgrim debb2b2a1e Fix local shadow variable warning. NFCI.
llvm-svn: 360157
2019-05-07 14:56:34 +00:00
Simon Pilgrim b0f51266b8 [X86][AVX] Fold concat(packus(),packus()) -> packus(concat(),concat()) (PR34773)
Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining.

llvm-svn: 360134
2019-05-07 11:17:39 +00:00
Craig Topper a75630302d [X86] Use extended vector register classes in getRegForInlineAsmConstraint to support x/y/zmm16-31 when the type is mismatched.
The FR32/FR64/VR128/VR256 register classes don't contain the upper 16 registers. For most cases we use the default implementation which will find any register class that contains the register in question if the VT is legal for the register class. But if the VT is i32 or i64, we won't find a matching register class and will instead up in the code modified in this patch.

If the requested register is x/y/zmm16-31 we weren't returning a register class that contains those registers and will hit an assertion in the caller.

To fix this, I've changed to use the extended register class instead. I don't believe we need a subtarget check to see if avx512 is enabled. The default implementation just pick whatever register class it finds first. I checked and we currently pick FR32X for XMM0 with an f32 type using the default implementation regardless of whether avx512 is enabled. So I assume its it is ok to do the same for i32.

Differential Revision: https://reviews.llvm.org/D61457

llvm-svn: 360102
2019-05-06 23:57:42 +00:00
Simon Pilgrim 07d91cd98a [X86] lowerVectorShuffle - use any_of to detect out of bounds shuffle indices. NFCI.
Fixes cppcheck local shadow warning as well.

llvm-svn: 360027
2019-05-06 10:11:24 +00:00
Luo, Yuanke beec41c656 Enable AVX512_BF16 instructions, which are supported for BFLOAT16 in Cooper Lake
Summary:
1. Enable infrastructure of AVX512_BF16, which is supported for BFLOAT16 in Cooper Lake;
2. Enable VCVTNE2PS2BF16, VCVTNEPS2BF16 and DPBF16PS  instructions, which are Vector Neural Network Instructions supporting BFLOAT16 inputs and conversion instructions from IEEE single precision.
VCVTNE2PS2BF16: Convert Two Packed Single Data to One Packed BF16 Data.
VCVTNEPS2BF16: Convert Packed Single Data to Packed BF16 Data.
VDPBF16PS: Dot Product of BF16 Pairs Accumulated into Packed Single Precision.
For more details about BF16 isa, please refer to the latest ISE document: https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference

Author: LiuTianle

Reviewers: craig.topper, smaslov, LuoYuanke, wxiao3, annita.zhang, RKSimon, spatel

Reviewed By: craig.topper

Subscribers: kristina, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60550

llvm-svn: 360017
2019-05-06 08:22:37 +00:00
Simon Pilgrim 5170c0e5fe Move getOpcode() call into if statement. NFCI.
Avoids a cppcheck "Local variable name shadows outer variable" warning. 

llvm-svn: 359991
2019-05-05 18:34:38 +00:00
Simon Pilgrim cbcd9b1b92 [X86] Fix some cppcheck "Local variable name shadows outer variable" warnings. NFCI.
llvm-svn: 359976
2019-05-05 12:00:14 +00:00
Simon Pilgrim b323d5ec7c [X86] LowerToHorizontalOp - Tidyup calls to getHopForBuildVector. NFCI.
Merge the if() tests for the various HADD/SUB + Subtarget tests

llvm-svn: 359901
2019-05-03 15:56:06 +00:00
Simon Pilgrim bfdd0f75a8 [X86] Remove repeated variables. NFCI.
llvm-svn: 359889
2019-05-03 14:37:00 +00:00
Simon Pilgrim aa49be4926 Avoid cppcheck operator precedence warnings. NFCI.
Prefer ((X & Y) ? A : B) to (X & Y ? A : B)

llvm-svn: 359884
2019-05-03 13:50:38 +00:00
Simon Pilgrim a359ef192b [X86] LowerMULH - remove unused Lo/Hi vector indices. NFCI.
Leftover from before we had the extract128BitVector helpers.

llvm-svn: 359871
2019-05-03 10:32:07 +00:00
Simon Pilgrim 88f9117168 Reduce variable scope to just the if() block its actually used in. NFCI.
llvm-svn: 359869
2019-05-03 10:13:41 +00:00
Craig Topper e1e38d4248 [X86] Correct the register class for specific mask register constraints in getRegForInlineAsmConstraint when the VT is a scalar type
The default impementation in the base class for TargetLowering::getRegForInlineAsmConstraint doesn't work for mask registers when the VT is a scalar type integer types since the only legal mask types are vXi1. So we end up just getting whatever the first register class that contains the register. Currently this appears to be VK1, but its really dependent on the order tablegen outputs the register classes.

Some code in the caller ends up looking up the type for this register class and find v1i1 then generates a copyfromreg from the physical k-register with the v1i1 type. Then it generates an any_extend from v1i1 to the scalar VT which isn't legal. This bad any_extend sticks around until isel where it selects a MOVZX32rr8 with a v1i1 input or maybe a i8 input. Not sure but eventually we pick up a copy from VK1 to GR8 in MachineIR which isn't supported. This leads to a failure in physical register copying.

This patch uses the scalar type to find a VK class of the right size. In the attached test case this will be VK16. This causes a bitcast from vk16 to i16 to be generated instead of an any_extend. This will be properly iseled to a VK16 to GR32 copy and a GR32->GR16 extract_subreg.

Fixes PR41678

Differential Revision: https://reviews.llvm.org/D61453

llvm-svn: 359837
2019-05-02 22:26:40 +00:00
Simon Pilgrim df8daf0ef4 [X86][SSE] lowerAddSubToHorizontalOp - enable ymm extraction+fold
Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway.

Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result:
https://godbolt.org/z/0R-U-K

Differential Revision: https://reviews.llvm.org/D61426

llvm-svn: 359786
2019-05-02 14:00:55 +00:00
Simon Pilgrim 9fa56f7829 [X86][SSE] Move shouldUseHorizontalOp inside isHorizontalBinOp. NFCI.
Matches what we do for lowerAddSubToHorizontalOp and will make it easier to peek through subvectors to help fix PR39921

llvm-svn: 359782
2019-05-02 12:18:24 +00:00
Simon Pilgrim 9f04d97cd7 [X86][SSE] Fold scalar horizontal add/sub for non-0/1 element extractions
We already perform horizontal add/sub if we extract from elements 0 and 1, this patch extends it to non-0/1 element extraction indices (as long as they are from the lowest 128-bit vector).

Differential Revision: https://reviews.llvm.org/D61263

llvm-svn: 359707
2019-05-01 17:13:35 +00:00
Simon Pilgrim f5bdff7747 Fix 80 column violation. NFCI.
llvm-svn: 359694
2019-05-01 16:01:49 +00:00
Simon Pilgrim 6711b9699a [X86][SSE] Add demanded elts support X86ISD::PMULDQ\PMULUDQ
Add to SimplifyDemandedVectorEltsForTargetNode and SimplifyDemandedBitsForTargetNode

llvm-svn: 359686
2019-05-01 14:50:50 +00:00
Simon Pilgrim 3d6899e369 [X86][SSE] Add SSE vector shift support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359680
2019-05-01 13:51:09 +00:00
Simon Pilgrim ba372c6e62 [X86][SSE] Split 512-bit -> 128-bit vector directly in SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 359678
2019-05-01 12:48:42 +00:00
Simon Pilgrim 951a6b4579 [X86][SSE] Add 512-bit vector support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359677
2019-05-01 12:37:41 +00:00
Simon Pilgrim 37c2419cc7 [X86][SSE] Add X86ISD::PACKSS\PACKUS to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359673
2019-05-01 11:29:36 +00:00
Simon Pilgrim 3353cee06c [X86][SSE] Add X86ISD::UNPCKL\UNPCK to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359670
2019-05-01 11:08:03 +00:00
Simon Pilgrim f7b978a71b [X86][SSE] Move extract_subvector(pshufb) fold to SimplifyDemandedVectorEltsForTargetNode
This lets us hit more cases than combineExtractSubvector and allows us reuse more code.

llvm-svn: 359669
2019-05-01 10:58:38 +00:00
Simon Pilgrim a7d107a3e0 [X86] SimplifyDemandedVectorEltsForTargetNode - pull out vector halving code. NFCI.
Pull out the HADD/HSUB code to halve vector widths if the upper half isn't used - prep work to adding support for other opcodes.

llvm-svn: 359667
2019-05-01 10:38:10 +00:00
Simon Pilgrim 99eefe94b5 [X86][SSE] Extract i1 elements from vXi1 bool vectors
This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.

Differential Revision: https://reviews.llvm.org/D61189

llvm-svn: 359666
2019-05-01 10:02:22 +00:00
Simon Pilgrim 07ab4e7db8 [X86][SSE] Fold extract_subvector(extend(x)) -> extend_vector_inreg(x)
This adds any extend support - folding to zero_extend_vector_inreg (PMOVZX) for legality

Minor improvement for PR39709

llvm-svn: 359608
2019-04-30 20:31:07 +00:00
Simon Pilgrim 22641cc194 Fix for bug 41512: lower INSERT_VECTOR_ELT(ZeroVec, 0, Elt) to SCALAR_TO_VECTOR(Elt) for all SSE flavors
Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512

Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.

Committed on behalf of @Serge_Preis (Serge Preis)

Differential Revision: https://reviews.llvm.org/D60852

llvm-svn: 359545
2019-04-30 10:18:25 +00:00
Sjoerd Meijer 180f1ae57c [TargetLowering] Change getOptimalMemOpType to take a function attribute list
The MachineFunction wasn't used in getOptimalMemOpType, but more importantly,
this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType.

This is the groundwork for the changes in D59766 and D59787, that allows
implementation of TTI::getMemcpyCost.

Differential Revision: https://reviews.llvm.org/D59785

llvm-svn: 359537
2019-04-30 08:38:12 +00:00
Simon Pilgrim 028485d7b9 [X86][SSE] isHorizontalBinOp - add support for target shuffles
Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support.

This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type

Fixes PR39936 and should help with PR39920/PR39921

Differential Revision: https://reviews.llvm.org/D61245

llvm-svn: 359491
2019-04-29 19:52:59 +00:00
Simon Pilgrim d5cc753b6d [X86][SSE] combineExtractVectorElt - add early-out to return zero/undef for out-of-range extraction indices.
llvm-svn: 359406
2019-04-28 19:12:58 +00:00
Simon Pilgrim 22d1476bfa [X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.

llvm-svn: 359400
2019-04-28 14:31:01 +00:00
Simon Pilgrim 93ad48210c [X86][SSE] Optimize llvm.experimental.vector.reduce.xor.vXi1 parity reduction (PR38840)
An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0.

Differential Revision: https://reviews.llvm.org/D61230

llvm-svn: 359396
2019-04-28 10:46:17 +00:00
Simon Pilgrim 03c4e2663c Revert rL359389: [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.
........
Causing irregular buildbot failures.

llvm-svn: 359391
2019-04-27 20:44:08 +00:00
Simon Pilgrim 4118be3af6 [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.

llvm-svn: 359389
2019-04-27 20:04:44 +00:00
Simon Pilgrim 2a2d422400 [X86][AVX512] Improve vector bool reductions
As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern.

llvm-svn: 359386
2019-04-27 17:32:46 +00:00
Simon Pilgrim acc1e6d1c6 [X86][AVX] Merge mask select with shuffles across extract_subvector (PR40332)
Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector.

We can extend this in the future to handle more opcodes and non-zero selections.

llvm-svn: 359378
2019-04-27 13:35:32 +00:00
Craig Topper 063b471ff7 [X86] Use MOVQ for i64 atomic_stores when SSE2 is enabled
Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store.

Reviewers: spatel, RKSimon, reames, jfb, efriedma

Reviewed By: RKSimon

Subscribers: hiraditya, dexonsmith, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60546

llvm-svn: 359368
2019-04-27 03:38:15 +00:00
Simon Pilgrim 27e01e675c [X86][AVX] Fold extract_subvector(broadcast(x)) -> broadcast(x) iff x has one use
llvm-svn: 359332
2019-04-26 18:02:14 +00:00
Simon Pilgrim c3a34c3e07 Fix Wparentheses warning. NFCI.
llvm-svn: 359299
2019-04-26 12:23:42 +00:00
Simon Pilgrim bb230c5e79 [X86][SSE] Pull out OR(EXTRACTELT(X,0),OR(EXTRACTELT(X,1),...)) matching code from LowerVectorAllZeroTest
Create a matchBitOpReduction helper that checks for the pattern with any opcode.

First step towards reusing this code to recognize other scalar reduction patterns.

llvm-svn: 359296
2019-04-26 11:45:54 +00:00
Simon Pilgrim 5d6ef94c36 [X86][SSE] Disable shouldFoldConstantShiftPairToMask for btver1/btver2 targets (PR40758)
As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs).

Differential Revision: https://reviews.llvm.org/D61068

llvm-svn: 359293
2019-04-26 10:49:13 +00:00
Simon Pilgrim 5e161df9f8 [X86][AVX] Combine shuffles extracted from a common vector
A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380.

Differential Revision: https://reviews.llvm.org/D60512

llvm-svn: 359292
2019-04-26 09:56:14 +00:00
Simon Pilgrim 0a7d1b3ce1 [X86][SSE] combineBitcastvxi1 - add support for bitcasting to non-scalar integers
Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type.

We still have the issue identified in PR41594 but D61114 should handle this.

llvm-svn: 359176
2019-04-25 09:34:36 +00:00
Sanjay Patel b1b3368907 [x86] make sure horizontal op and broadcast types match to simplify (PR41414)
If the types don't match, we can't just remove the shuffle.
There may be some other opportunity for optimization here,
but this should prevent the crashing seen in:
https://bugs.llvm.org/show_bug.cgi?id=41414

llvm-svn: 359095
2019-04-24 14:05:08 +00:00
Simon Pilgrim d30745b2a0 [X86] Add shouldFoldConstantShiftPairToMask override placeholder. NFCI.
Prep work toward fixing PR40758

llvm-svn: 359088
2019-04-24 12:34:08 +00:00
Sanjay Patel 12a561fa1b [x86] use psubus for more vsetcc lowering (PR39859)
Circling back to a leftover bit from PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859#c1

...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'.
This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca.
We already do this transform for the SETULT predicate, so this makes the code more
symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect
anything but pre-SSE4.1 subtargets.

  $ cat before.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768]
	pxor	%xmm0, %xmm2
	pcmpgtw	-32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255]
	pand	%xmm2, %xmm0
	pandn	%xmm1, %xmm2
	por	%xmm2, %xmm0

  $ cat after.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [256,256,256,256,256,256,256,256]
	psubusw	%xmm0, %xmm2
	pxor	%xmm3, %xmm3
	pcmpeqw	%xmm2, %xmm3
	pand	%xmm3, %xmm0
	pandn	%xmm1, %xmm3
	por	%xmm3, %xmm0

  $ llvm-mca before.s -mcpu=haswell
  Iterations:        100
  Instructions:      600
  Total Cycles:      909
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    0.77
  IPC:               0.66
  Block RThroughput: 1.8

  $ llvm-mca after.s -mcpu=haswell
  Iterations:        100
  Instructions:      700
  Total Cycles:      409
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    1.71
  IPC:               1.71
  Block RThroughput: 1.8

Differential Revision: https://reviews.llvm.org/D60838

llvm-svn: 358999
2019-04-23 15:20:17 +00:00
Simon Pilgrim 0e4992ce27 [X86] Pull out collectConcatOps helper. NFCI.
Create collectConcatOps helper that returns all the subvector ops for CONCAT_VECTORS or a INSERT_SUBVECTOR series.

llvm-svn: 358989
2019-04-23 14:07:49 +00:00
Sanjay Patel bf8aacb715 [SelectionDAG] move splat util functions up from x86 lowering
This was supposed to be NFC, but the change in SDLoc
definitions causes instruction scheduling changes.

There's nothing x86-specific in this code, and it can
likely be used from DAGCombiner's simplifyVBinOp().

llvm-svn: 358930
2019-04-22 22:43:36 +00:00
Craig Topper 5c43ab337f [X86] Reject 512-bit types in getRegForInlineAsmConstraint when AVX512 is not enabled. Same for 256 bit and AVX.
llvm-svn: 358872
2019-04-22 06:12:02 +00:00
Craig Topper 3980d1ca6b [X86] Disable argument copy elision for arguments passed via pointers
Summary:
If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory.

The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too.

This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix.

Reviewers: rnk

Reviewed By: rnk

Subscribers: dbabokin, llvm-commits, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60801

llvm-svn: 358817
2019-04-20 15:26:44 +00:00
Simon Pilgrim 4171a91e92 [X86] combineVectorTruncationWithPACKUS - remove split/concatenation of mask
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.

This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).

This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.

My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.

Differential Revision: https://reviews.llvm.org/D60375#inline-539623

llvm-svn: 358692
2019-04-18 17:23:09 +00:00
Simon Pilgrim 8f87e53462 [X86][SSE] Lower ICMP EQ(AND(X,C),C) -> SRA(SHL(X,LOG2(C)),BW-1) iff C is power-of-2.
This replaces the MOVMSK combine introduced at D52121/rL342326

(movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))

with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra.

Differential Revision: https://reviews.llvm.org/D60625

llvm-svn: 358651
2019-04-18 09:58:59 +00:00
Simon Pilgrim e5573f4f4e [TargetLowering] Rename preferShiftsToClearExtremeBits and shouldFoldShiftPairToMask (PR41359)
As discussed on PR41359, this patch renames the pair of shift-mask target feature functions to make their purposes more obvious.

shouldFoldShiftPairToMask -> shouldFoldConstantShiftPairToMask

preferShiftsToClearExtremeBits -> shouldFoldMaskToVariableShiftPair

llvm-svn: 358526
2019-04-16 20:57:28 +00:00
Simon Pilgrim d769bb1e58 [X86][AVX] X86ISD::PERMV/PERMV3 node types can never fold index ops
Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register.

This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data.

Differential Revision: https://reviews.llvm.org/D60562

llvm-svn: 358516
2019-04-16 19:18:53 +00:00
Craig Topper 0495f29e42 [X86] Limit the 'x' inline assembly constraint to zmm0-15 when used for a 512 type.
The 'v' constraint is used to select zmm0-31. This makes 512 bit consistent with 128/256-bit.a

llvm-svn: 358450
2019-04-15 21:06:32 +00:00
Craig Topper 3d9b47c770 [X86] Block i32/i64 for 'k' and 'Yk' in getRegForInlineAsmConstraint without avx512bw.
32 and 64 bit k-registers require avx512bw. If we don't block this properly, it leads to a crash.

llvm-svn: 358436
2019-04-15 18:39:45 +00:00
Simon Pilgrim 6c8f4ada36 [X86][SSE] Recognise vXi1 boolean anyof/allof reduction patterns
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.

This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.

This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.

Differential Revision: https://reviews.llvm.org/D60610

llvm-svn: 358286
2019-04-12 14:22:57 +00:00
Craig Topper 68a5d619a4 [X86] Restrict vselect handling in scalarizeExtEltFP to only case to pre type legalization where the setcc result type is vXi1.
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.

The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.

llvm-svn: 358218
2019-04-11 19:57:44 +00:00
Craig Topper 586fad50ac [X86] Add patterns for using movss/movsd for atomic load/store of f32/64. Remove atomic fadd pseudos use isel patterns instead.
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.

It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.

This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.

Differential Revision: https://reviews.llvm.org/D60394

llvm-svn: 358215
2019-04-11 19:19:52 +00:00
Craig Topper f7e548c076 Recommit r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
With correct test checks this time.

If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ

This matches what gcc and icc do for this case and removes an existing FIXME.

llvm-svn: 358214
2019-04-11 19:19:42 +00:00
Craig Topper 8200880c9a Revert r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
I seem to have messed up the test checks.

llvm-svn: 358212
2019-04-11 19:04:38 +00:00
Craig Topper 1c2dfc3100 [X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integer and store it to a stack temporary. From there we can do two 32-bit loads to get the value into integer registers without worrying about atomicness.

This matches what gcc and icc do for this case and removes an existing FIXME.

Differential Revision: https://reviews.llvm.org/D60156

llvm-svn: 358211
2019-04-11 18:40:21 +00:00
Simon Pilgrim 40b647ae8e [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV3 mask support
Completes SimplifyDemandedVectorElts's basic variable shuffle mask support which should help D60512 + D60562 

llvm-svn: 358186
2019-04-11 15:29:15 +00:00
Simon Pilgrim 8a25154fa7 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV mask support
llvm-svn: 358174
2019-04-11 14:35:45 +00:00
Simon Pilgrim 6f3866c6fb [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMILPV mask support
llvm-svn: 358170
2019-04-11 14:15:01 +00:00
Simon Pilgrim cb5218ad48 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMIL2 mask support
llvm-svn: 358167
2019-04-11 14:04:19 +00:00
Simon Pilgrim e468cc7f14 [X86] SimplifyDemandedVectorElts - add VPPERM support
We need to add support for all variable shuffle mask ops, but VPPERM is the only one that already has test coverage.

llvm-svn: 358165
2019-04-11 13:30:38 +00:00
David Green 0861c87b06 Revert rL357745: [SelectionDAG] Compute known bits of CopyFromReg
Certain optimisations from ConstantHoisting and CGP rely on Selection DAG not
seeing through to the constant in other blocks. Revert this patch while we come
up with a better way to handle that.

I will try to follow this up with some better tests.

llvm-svn: 358113
2019-04-10 18:00:41 +00:00
Simon Pilgrim 37d8d55823 [X86][AVX] getTargetConstantBitsFromNode - extract bits from X86ISD::SUBV_BROADCAST
llvm-svn: 358096
2019-04-10 16:24:47 +00:00
Craig Topper 3a4c2192a4 [X86] Fix a couple lowering functions that called ReplaceAllUsesOfValueWith for the newly created code and then return SDValue(). Use MERGE_VALUES instead.
Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it.

Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements.

For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later.

llvm-svn: 357935
2019-04-08 19:44:07 +00:00
Sanjay Patel 50c3b290ed [x86] make 8-bit shl undesirable
I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops.

We've hinted at making all 8-bit ops undesirable for the reason in the code comment:

// TODO: Almost no 8-bit ops are desirable because they have no actual
//       size/speed advantages vs. 32-bit ops, but they do have a major
//       potential disadvantage by causing partial register stalls.

...but that leads to massive diffs and exposes all kinds of optimization holes itself.

Differential Revision: https://reviews.llvm.org/D60286

llvm-svn: 357912
2019-04-08 13:58:50 +00:00
Craig Topper 6a6da233b9 [X86] Make LowerOperationWrapper more robust. Remove now unnecessary ReplaceAllUsesWith from LowerMSCATTER.
Previously LowerOperationWrapper took the number of results from the original
node and counted that many results from the new node. This was intended to drop
chain operands from FP_TO_SINT lowering that uses X87 with memory operations to
stack temporaries. The final load had an extra chain output that needs to be
ignored.

Unfortunately, it didn't work with scatter which has 2 result operands, the
mask output which is discarded and a chain output. The chain output is the one
that is needed but it comes second and it would be dropped by the previous
logic here. To workaround this we were doing a ReplaceAllUses in the lowering
code so that the generic legalization code wouldn't see any uses to replace
since it had been given the wrong result/type.

After this change we take the LowerOperation result directly if the original
node has one result. This allows us to directly return the chain from scatter
or the load data from the FP_TO_SINT case. When the original node has multiple
results we'll ensure the returned node has the same number and copy them over.
For cases where the original node has multiple results and the new code for some
reason has even more results, MERGE_VALUES can be used to pass only the needed
results.

llvm-svn: 357887
2019-04-08 07:39:17 +00:00
Simon Pilgrim 07adb6abda [X86][SSE] SimplifyDemandedBitsForTargetNode - Add initial PACKSS support
In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands.

This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved.

Differential Revision: https://reviews.llvm.org/D60333

llvm-svn: 357859
2019-04-07 10:40:01 +00:00
Simon Pilgrim d0a53d4914 [X86] combineBitcastvxi1 - provide dst VT and src SDValue directly. NFCI.
Prep work to make it easier to reuse the BITCAST->MOVSMK combine in other cases.

llvm-svn: 357847
2019-04-06 18:54:17 +00:00
Simon Pilgrim af1cbdd3ba Fix spelling mistake. NFCI.
llvm-svn: 357843
2019-04-06 15:38:34 +00:00
Francis Visoiu Mistrih 9d9d1b6b2b [X86] Enable tail calls for CallingConv::Swift
It's currently only enabled on AArch64 (enabled in r281376).

llvm-svn: 357809
2019-04-05 20:18:25 +00:00
Craig Topper 80aa2290fb [X86] Merge the different Jcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon

Reviewed By: RKSimon

Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60228

llvm-svn: 357802
2019-04-05 19:28:09 +00:00
Piotr Sobczak 0376ac1d94 [SelectionDAG] Compute known bits of CopyFromReg
Summary:
Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.

This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.

Also add a missing truncation on X86, found by testing of this patch.

Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa

Reviewers: bogner, craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59535

llvm-svn: 357745
2019-04-05 07:44:09 +00:00
Craig Topper 94f1772b1e [X86] Promote i16 SRA instructions to i32
We already promote SRL and SHL to i32.

This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions.

I think there might be some DAG combine improvement opportunities in the test changes here.

Differential Revision: https://reviews.llvm.org/D60278

llvm-svn: 357743
2019-04-05 06:32:50 +00:00
Evandro Menezes 85bd3978ae [IR] Refactor attribute methods in Function class (NFC)
Rename the functions that query the optimization kind attributes.

Differential revision: https://reviews.llvm.org/D60287

llvm-svn: 357731
2019-04-04 22:40:06 +00:00
James Y Knight a040174418 Revert [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
It unnecessarily breaks previously-working code which used varargs,
but didn't pass any float/double arguments (such as EDK2).

Also revert the fixup on top of that:
Revert [X86] Fix a test from r357317

This reverts r357317 (git commit d413f41de6)
This reverts r357380 (git commit 7af32444b9)

llvm-svn: 357718
2019-04-04 19:05:48 +00:00
Sanjay Patel 17648b848e [x86] eliminate unnecessary broadcast of horizontal op
This is another pattern that comes up if we more aggressively
scalarize FP ops.

llvm-svn: 357703
2019-04-04 14:46:13 +00:00
Craig Topper 051bd16faf [X86] Remove CustomInserters for RDPKRU/WRPKRU. Use some custom lowering and new ISD opcodes instead.
These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers.

This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes
that take the extra zeroes as inputs. The zeros will then go through isel on their own to select
the MOV32r0 pseudo. Then we just need to mention the physical registers directly
in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary
copies to/from physical registers.

llvm-svn: 357659
2019-04-04 00:28:49 +00:00
Craig Topper 52cac4b79f [X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom instruction selection instead.
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.

To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.

With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.

I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.

llvm-svn: 357652
2019-04-03 23:28:30 +00:00
Sanjay Patel c9a012e4ea [x86] fold shuffles of h-ops that have an undef operand
If an operand is undef, we can assume it's the same as the
other operand.

llvm-svn: 357644
2019-04-03 22:40:35 +00:00
Sanjay Patel 61b5e3c6a9 [x86] eliminate movddup of horizontal op
This pattern would show up as a regression if we more
aggressively convert vector FP ops to scalar ops.

There's still a missed optimization for the v4f64 legal
case (AVX) because we create that h-op with an undef operand.
We should probably just duplicate the operands for that
pattern to avoid trouble.

llvm-svn: 357642
2019-04-03 22:15:29 +00:00
Krzysztof Parzyszek 4841643a1d [X86] Extend boolean arguments to inline-asm according to getBooleanType
Differential Revision: https://reviews.llvm.org/D60208

llvm-svn: 357615
2019-04-03 17:43:14 +00:00
Simon Pilgrim 15919ad306 [X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 reduction on AVX1
Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result.

llvm-svn: 357611
2019-04-03 17:28:34 +00:00
Simon Pilgrim 9e28dddf55 [X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction on AVX1
Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets.

llvm-svn: 357608
2019-04-03 17:17:13 +00:00
Craig Topper 2e1bf89e3a [X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics.
This is the appropriate opcode for only having a chain output. Though I'm not
sure it matters much.

llvm-svn: 357375
2019-04-01 05:26:12 +00:00
Sanjay Patel e1bc360fc6 [x86] allow movmsk with 2-element reductions
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.

  $ llvm-mca -mcpu=haswell no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      504
  Total uOps:        400

  Dispatch Width:    4
  uOps Per Cycle:    0.79
  IPC:               0.79
  Block RThroughput: 1.0

  $ llvm-mca -mcpu=haswell movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      358
  Total uOps:        600

  Dispatch Width:    4
  uOps Per Cycle:    1.68
  IPC:               1.68
  Block RThroughput: 1.5

  $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      407
  Total uOps:        400

  Dispatch Width:    2
  uOps Per Cycle:    0.98
  IPC:               0.98
  Block RThroughput: 2.0

  $ llvm-mca -mcpu=btver2 movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      311
  Total uOps:        600

  Dispatch Width:    2
  uOps Per Cycle:    1.93
  IPC:               1.93
  Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.

Differential Revision: https://reviews.llvm.org/D59997

llvm-svn: 357367
2019-03-31 15:11:34 +00:00
Simon Pilgrim 10c9032c02 [X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316)
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.

llvm-svn: 357351
2019-03-30 17:12:29 +00:00
Simon Pilgrim 3293455595 [X86][SSE] detectAVGPattern - begin generalizing ADD matches
Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316

llvm-svn: 357349
2019-03-30 15:31:53 +00:00
Amara Emerson d413f41de6 [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.

llvm-svn: 357317
2019-03-29 21:30:51 +00:00
Simon Pilgrim aeaf7fcdde [X86] Add X86TargetLowering::isCommutativeBinOp override.
We currently just have test coverage for PMULUDQ - will add more in the future.

llvm-svn: 357244
2019-03-29 11:25:58 +00:00
Sanjay Patel 5bbf6f0bd8 [x86] avoid cmov in movmsk reduction
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.

We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.

There seems to be a missing fold for sext of a bool bit here:

negl %ecx
movslq %ecx, %rax

...but that's an independent transform.

Differential Revision: https://reviews.llvm.org/D59818

llvm-svn: 357172
2019-03-28 14:16:13 +00:00
Sanjay Patel 1df0bb6264 [x86] improve AVX lowering of vector zext
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.

I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.

From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.

Differential Revision: https://reviews.llvm.org/D59777

llvm-svn: 357129
2019-03-27 22:42:11 +00:00
Sanjay Patel 704817912a [x86] look through bitcast operand of MOVMSK
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
  // Also support integer VTs to avoid a int->fp bitcast in the DAG.

llvm-svn: 357128
2019-03-27 22:24:03 +00:00
Simon Pilgrim ccb71b2985 Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249

llvm-svn: 357057
2019-03-27 10:25:02 +00:00
Simon Pilgrim 87d4ab8b92 [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.

llvm-svn: 356864
2019-03-24 19:06:35 +00:00
Simon Pilgrim a71c0ed471 [X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).

Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.

llvm-svn: 356858
2019-03-24 16:30:35 +00:00
Sanjay Patel 7d676dfd86 [x86] improve the default expansion of uaddsat/usubsat
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613

uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0

We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.

Differential Revision: https://reviews.llvm.org/D59006

llvm-svn: 356855
2019-03-24 13:55:54 +00:00
Sanjay Patel 2e92846d36 [x86] reduce code duplication; NFC
llvm-svn: 356836
2019-03-23 15:00:52 +00:00
Craig Topper ce1ed55a4a [X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.

This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.

Differential Revision: https://reviews.llvm.org/D59662

llvm-svn: 356808
2019-03-22 20:47:02 +00:00
Craig Topper 1ffd8e8114 [X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.

I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.

So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?

I've left some FIXMEs for improvements we can do without SSE2.

Differential Revision: https://reviews.llvm.org/D59679

llvm-svn: 356807
2019-03-22 20:46:56 +00:00
Simon Pilgrim 564392d752 [X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203)
llvm-svn: 356784
2019-03-22 17:23:55 +00:00
Craig Topper b3bad3dce3 [X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI
For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead.

llvm-svn: 356721
2019-03-21 21:37:18 +00:00
Simon Pilgrim c2e4405475 [X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers
We don't use X86ISD::ANDNP for mask registers.

Test case from @craig.topper (Craig Topper)

llvm-svn: 356696
2019-03-21 18:32:38 +00:00
Craig Topper 8d46403b8e [X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set.
CMPXCHG8B was introduced on i586/pentium generation.

If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.

Differential Revision: https://reviews.llvm.org/D59576

llvm-svn: 356631
2019-03-20 23:35:49 +00:00
Craig Topper 0367553304 [X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend.
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.

I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.

This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.

Differential Revision: https://reviews.llvm.org/D59479

llvm-svn: 356618
2019-03-20 21:30:20 +00:00
Simon Pilgrim 2acca37a2d [X86] Use getConstantOperandAPInt to detect out-of-range shifts.
llvm-svn: 356549
2019-03-20 11:41:52 +00:00
Andrea Di Biagio 624f5deff4 [X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI
This patch removes the following dag node opcodes from namespace X86ISD:

RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG

The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:

    RDTSC/RDTSCP don't implicitly read ECX.
    RDTSCP also implicitly writes ECX.

I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.

No functional change intended.

Differential Revision: https://reviews.llvm.org/D59547

llvm-svn: 356546
2019-03-20 11:21:15 +00:00
Simon Pilgrim e744f513c4 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts
If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed.

llvm-svn: 356483
2019-03-19 17:23:25 +00:00
Jordan Rupprecht f74d45a775 [NFC] Fix unused variable in release builds
This was introduced in rL356468.

llvm-svn: 356477
2019-03-19 16:52:40 +00:00
Simon Pilgrim a56f2822d0 [SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect
These changes are related to PR37743 and include:

    SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.

    Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.

    Add promoting the integer ABS node in the LegalizeIntegerType.

    Expand-based legalization of integer result for the ABS nodes.

    Expand-based legalization of ABS vector operations.

    Add some integer abs testcases for different typesizes for Thumb arch

    Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
        tmp = (SRA, Hi, 31)
        Lo = (UADDO tmp, Lo)
        Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
        Lo = (XOR tmp, Lo)

    The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
        (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).

    Change integer abs testcases for codegen with the ABS node support for AArch64.
        Indicate that the ABS is legal for the i64 type when the NEON is supported.
        Change the integer abs testcases to show changing of codegen.

    Add combine and legalization of ABS nodes for Thumb arch.

    Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.

For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743

Patch by: @ikulagin (Ivan Kulagin)

Differential Revision: https://reviews.llvm.org/D49837

llvm-svn: 356468
2019-03-19 16:24:55 +00:00
Adhemerval Zanella 664c1ef528 [TargetLowering] Add code size information on isFPImmLegal. NFC
This allows better code size for aarch64 floating point materialization
in a future patch.

Reviewers: evandro

Differential Revision: https://reviews.llvm.org/D58690

llvm-svn: 356389
2019-03-18 18:40:07 +00:00
Simon Pilgrim f2c53b5d6c [X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes.
Replaces existing i1-only fold.

llvm-svn: 356325
2019-03-16 15:02:00 +00:00
Simon Pilgrim 0f472e1d01 [X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW
Improved constant folding for PEXTRB/PEXTRW will be added in a future commit

llvm-svn: 356324
2019-03-16 14:29:50 +00:00
Roman Lebedev 9f37790608 [X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

llvm-svn: 356301
2019-03-15 21:18:05 +00:00
Roman Lebedev b6e376ddfa [X86] Promote i8 CMOV's (PR40965)
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

llvm-svn: 356300
2019-03-15 21:17:53 +00:00
Craig Topper af856db961 [X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table.
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.

Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.

Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.

llvm-svn: 356294
2019-03-15 19:59:35 +00:00
Simon Pilgrim d33e62c826 [X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x)))
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.

llvm-svn: 356292
2019-03-15 19:14:28 +00:00
Simon Pilgrim 65165d54bb [X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW
llvm-svn: 356270
2019-03-15 16:16:49 +00:00
Simon Pilgrim 0ad17402a9 [X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm.
Prep work for PR40203

llvm-svn: 356249
2019-03-15 11:05:42 +00:00
Sanjay Patel 5d1df114e8 [x86] prevent infinite looping from vselect commutation (PR41066)
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.

llvm-svn: 356158
2019-03-14 15:32:34 +00:00
Craig Topper 84abec2855 [X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b()
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.

We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.

I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.

Differential Revision: https://reviews.llvm.org/D59308

llvm-svn: 356078
2019-03-13 18:48:50 +00:00
Simon Pilgrim bef4fe056d [X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 356067
2019-03-13 17:00:18 +00:00
Simon Pilgrim d9aa879b67 [X86][AVX] Add combineConcatVectors support to improve subvector handling
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.

This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.

The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.

llvm-svn: 356064
2019-03-13 16:37:30 +00:00
Sanjay Patel 0a251e4076 [x86] limit extractelement of setcc to pre-legalization
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700

The bug was introduced recently here:
rL355741

This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8). 

llvm-svn: 356053
2019-03-13 14:49:52 +00:00
Simon Pilgrim 0c1e5aacd3 Fix signed/unsigned mismatch warning. NFCI.
llvm-svn: 356046
2019-03-13 13:14:14 +00:00
Simon Pilgrim 7abbd70300 [X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.

This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.

Differential Revision: https://reviews.llvm.org/D58888

llvm-svn: 356043
2019-03-13 12:20:39 +00:00
Sanjay Patel 737c27a9cd [x86] scalarize extractelement 0 of FP vselect
llvm-svn: 355955
2019-03-12 19:20:45 +00:00
Craig Topper f19d6a4073 [X86] Add SCALAR_SINT_TO_FP/SCALAR_UINT_TO_FP ISD opcodes without rounding mode.
After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those.

llvm-svn: 355807
2019-03-11 04:37:01 +00:00
Craig Topper ecbc141dbf [X86] Split SCALEF(S) ISD opcodes into a version without rounding mode.
llvm-svn: 355806
2019-03-11 04:36:59 +00:00
Craig Topper a0b5338834 [X86] Split RCP28/RSQRT/GETEXP/EXP2 ISD opcodes into SAE and current direction nodes. Remove rounding mode operand.
llvm-svn: 355805
2019-03-11 04:36:57 +00:00
Craig Topper ba7d654526 [X86] Rename _RND versions of RANGE/REDUCE/GETMANT/RDNSCALE ISD opcodes to _SAE. Remove SAE operand.
No need to explicitly store it and match it during isel.

llvm-svn: 355804
2019-03-11 04:36:55 +00:00
Craig Topper 244ffcdf0d [X86] Rename X86ISD::CVTPH2PS_RND to CVTPH2PS_SAE. Remove SAE operand.
llvm-svn: 355803
2019-03-11 04:36:53 +00:00
Craig Topper 6059b1737e [X86] Rename the CVTT*_RND ISD nodes to _SAE and remove the SAE operand. Split VFPROUNDS_RND/VFPEXT(S)_RND into versions without rounding operand.
For VFPEXT(S) we only need current rounding mode and an SAE version. Neither need extra operand.

llvm-svn: 355802
2019-03-11 04:36:51 +00:00
Craig Topper 4c544ca993 [X86] Rename X86ISD::CMPM_RND and X86ISD::FSETCCM_RND to _SAE instead of _RND. Remove rounding operand.
The operand could only be the SAE encoding so no need to include it.

llvm-svn: 355801
2019-03-11 04:36:49 +00:00
Craig Topper 704303a2a1 [X86] Split the VFIXUPIMM/VFIXUPIMMS nodes into a current rounding mode and SAE ISD opcode.
Remove matching of FROUND_CURRENT and FROUND_NO_EXC for these nodes from isel table.

llvm-svn: 355800
2019-03-11 04:36:47 +00:00
Craig Topper b7e6bfe579 [X86] Begin removing matching of FROUND_CURRENT and FROUND_NO_EXC from isel tables.
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.

This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.

llvm-svn: 355799
2019-03-11 04:36:44 +00:00
Sanjay Patel 26e06e859e [x86] add x86-specific opcodes to extractelement scalarization list
llvm-svn: 355792
2019-03-10 18:56:21 +00:00
Craig Topper 66c9690ad6 [X86] Remove unused variable. NFC
llvm-svn: 355790
2019-03-10 17:36:41 +00:00
Craig Topper 93e15dfacc [X86] Make lowering of intrinsics with rounding mode stricter so that only valid rounding modes are lowered. Update tests accordingly
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.

With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.

llvm-svn: 355789
2019-03-10 17:20:45 +00:00
Craig Topper 0dc8c52d4e [X86] Remove dead code from the handler for INTR_TYPE_SCALAR_MASK_RM.
The code in here handles nodes with 6 or 7 operands. But only the 6 operand case is ever used these days.

llvm-svn: 355788
2019-03-10 17:20:42 +00:00
Sanjay Patel f84083b4db [x86] scalarize extract element 0 of FP cmp
An extension of D58282 noted in PR39665:
https://bugs.llvm.org/show_bug.cgi?id=39665

This doesn't answer the request to use movmsk, but that's an
independent problem. We need this and probably still need
scalarization of FP selects because we can't do that as a
target-independent transform (although it seems likely that
targets besides x86 should have this transform).

llvm-svn: 355741
2019-03-08 21:54:41 +00:00
Sanjay Patel b22f438df3 [x86] prevent infinite looping from inverse shuffle transforms
llvm-svn: 355713
2019-03-08 19:20:28 +00:00
Craig Topper 3acc4236b8 [X86] Enable combineFMinNumFMaxNum for 512 bit vectors when AVX512 is enabled.
Simplified by just checking if the vector type is legal rather than listing all combinations of types and features.

Fixes PR40984.

llvm-svn: 355582
2019-03-07 06:30:19 +00:00
Simon Pilgrim 9d6347cfc1 [DAGCombine] Improve select (not Cond), N1, N2 -> select Cond, N2, N1 fold
Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well.

Requested by @spatel on D59006

llvm-svn: 355533
2019-03-06 18:52:52 +00:00
Simon Pilgrim 468bb2e601 [X86][SSE] VSELECT(XOR(Cond,-1), LHS, RHS) --> VSELECT(Cond, RHS, LHS)
As noticed on D58965

DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point.

Differential Revision: https://reviews.llvm.org/D58974

llvm-svn: 355494
2019-03-06 10:54:43 +00:00
Jeremy Morse 09d8ea5282 [X86] Avoid codegen changes when DBG_VALUE appears between lowered selects
X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo
instructions without accounting for debug intrinsics. This leads to different
codegen with and without option -g, if a DBG_VALUE instruction lands in the
middle of several lowered selects.

Work around this by skipping over debug instructions when looking for CMOV
sequences, and sinking those debug insts into the EmitLoweredSelect sunk block.
This might slightly shift where variables appear in the instruction sequence,
but won't re-order assignments.

Differential Revision: https://reviews.llvm.org/D58672

llvm-svn: 355307
2019-03-04 10:56:02 +00:00
Simon Pilgrim e48be5d698 Remove unused variable. NFCI.
llvm-svn: 355289
2019-03-03 14:23:07 +00:00
Simon Pilgrim d8e91a54c0 [X86] getShuffleScalarElt - peek through insert/extract subvector nodes.
llvm-svn: 355288
2019-03-03 14:11:05 +00:00
Simon Pilgrim 11149ea433 [X86] Pull out combineToConsecutiveLoads helper. NFCI.
llvm-svn: 355287
2019-03-03 13:53:27 +00:00
Amaury Sechet f24abf6511 [X86] Improve use of SHLD/SHRD
Summary:
This extends the variety of pattern that can generate a SHLD instead of using two shifts.

This fixes a regression that would be introduced by D57367 or D33587

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57389

llvm-svn: 355260
2019-03-02 02:44:16 +00:00
Sanjay Patel 7fc6ef7dd7 [x86] scalarize extract element 0 of FP math
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:

  1. Memory loads may get scalarized resulting in more efficient code.
  2. Memory stores may get scalarized resulting in more efficient code.
  3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
  4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.

The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.

Differential Revision: https://reviews.llvm.org/D58282

llvm-svn: 355130
2019-02-28 19:47:04 +00:00
Craig Topper 38427c47b9 [X86] Don't peek through bitcasts before checking ISD::isBuildVectorOfConstantSDNodes in combineTruncatedArithmetic
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.

Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.

Differential Revision: https://reviews.llvm.org/D58705

llvm-svn: 355116
2019-02-28 18:49:29 +00:00
Bjorn Pettersson d30f308a9f Add support for computing "zext of value" in KnownBits. NFCI
Summary:
The description of KnownBits::zext() and
KnownBits::zextOrTrunc() has confusingly been telling
that the operation is equivalent to zero extending the
value we're tracking. That has not been true, instead
the user has been forced to explicitly set the extended
bits as known zero afterwards.

This patch adds a second argument to KnownBits::zext()
and KnownBits::zextOrTrunc() to control if the extended
bits should be considered as known zero or as unknown.

Reviewers: craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58650

llvm-svn: 355099
2019-02-28 15:45:29 +00:00
Simon Pilgrim 134bc19079 [X86][AVX] Remove superfluous insert_subvector(zero, bitcast(x)) -> bitcast(insert_subvector(zero, x)) fold
This is caught by other existing bitcast folds.

llvm-svn: 355084
2019-02-28 11:39:52 +00:00
Simon Pilgrim 87aeff8bbb [X86][AVX] Fold vf64 concat_vectors(movddup(x),movddup(x)) -> broadcast(x)
llvm-svn: 355078
2019-02-28 10:53:58 +00:00
Simon Pilgrim 1001a6ab03 [X86][AVX] Pull out some INSERT_SUBVECTOR combines into a combineConcatVectorOps helper. NFCI
A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node.

I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered.

Differential Revision: https://reviews.llvm.org/D58583

llvm-svn: 355015
2019-02-27 18:46:32 +00:00
Simon Pilgrim 71bb6850cf [X86][AVX] Only combine loads to broadcasts for legal types
Thanks to @echristo for spotting this.

llvm-svn: 354961
2019-02-27 11:17:25 +00:00
Reid Kleckner 2f055f026a [X86] Fix bug in x86_intrcc with arg copy elision
Summary:
Use a custom calling convention handler for interrupts instead of fixing
up the locations in LowerMemArgument. This way, the offsets are correct
when constructed and we don't need to account for them in as many
places.

Depends on D56883

Replaces D56275

Reviewers: craig.topper, phil-opp

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D56944

llvm-svn: 354837
2019-02-26 02:11:25 +00:00
Simon Pilgrim c61f1e8e6c [X86] Merge ISD::ADD/SUB nodes into X86ISD::ADD/SUB equivalents (PR40483)
Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results.

Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB.

Differential Revision: https://reviews.llvm.org/D58597

llvm-svn: 354771
2019-02-25 11:19:37 +00:00
Simon Pilgrim cfaf663a35 [X86] Combine zext(packus(x),packus(y)) -> concat(x,y) (PR39637)
Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step.

llvm-svn: 354757
2019-02-24 19:57:52 +00:00
Craig Topper be3348573e [LegalizeTypes][AArch64][X86] Make type legalization of vector (S/U)ADD/SUB/MULO follow getSetCCResultType for the overflow bits. Make UnrollVectorOverflowOp properly convert from scalar boolean contents to vector boolean contents
Summary:
When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it.

We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used.

Reviewers: spatel, RKSimon, nikic

Reviewed By: nikic

Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58567

llvm-svn: 354753
2019-02-24 19:23:36 +00:00
Simon Pilgrim 4f4f9abdfa [X86][AVX] Rename lowerShuffleByMerging128BitLanes to lowerShuffleAsLanePermuteAndRepeatedMask. NFC.
Name better matches the other similar 'lane permute' and 'repeated mask' functions we have.

llvm-svn: 354749
2019-02-24 17:30:06 +00:00
Craig Topper be9eeb5526 Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
And its follow ups r354511, r354640.

A follow patch will fix the issue that caused it to be reverted.

llvm-svn: 354737
2019-02-23 21:41:42 +00:00
Craig Topper ccc860cb81 Recommit r354647 and r354648 "[LegalizeTypes] When promoting the result of EXTRACT_SUBVECTOR, also check if the input needs to be promoted. Use that to determine the element type to extract"
r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit."

These were reverted in r354713 as their context depended on other patches that were reverted for a bug.

llvm-svn: 354734
2019-02-23 19:51:32 +00:00
Simon Pilgrim f383a47b7d [X86][AVX] combineInsertSubvector - remove concat_vectors(load(x),load(x)) --> sub_vbroadcast(x)
D58053/rL354340 added this to EltsFromConsecutiveLoads directly

llvm-svn: 354732
2019-02-23 18:53:03 +00:00
Simon Pilgrim e08f177ea2 [X86][AVX] concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) --> broadcast(x)
For AVX1, limit this to i32/f32/i64/f64 loading cases only.

llvm-svn: 354730
2019-02-23 18:34:05 +00:00
Simon Pilgrim 31793733a0 [X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input in place
Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD.

llvm-svn: 354729
2019-02-23 17:10:47 +00:00
Reid Kleckner e3876637cf Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
r354363 caused https://crbug.com/934963#c1, which has a plain C reduced
test case.

I also had to revert some dependent changes:
- r354648
- r354647
- r354640
- r354511

llvm-svn: 354713
2019-02-23 01:19:42 +00:00
Craig Topper a9697f24cf [X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when input type will be promoted by the type legalize to 128-bits.
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.

Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.

llvm-svn: 354709
2019-02-23 00:35:02 +00:00
Sanjay Patel a9e289174a [x86] allow narrowing of vector UINT_TO_FP
As discussed in:
D56864
D58197

Always use the narrow (128-bit) instruction when possible.
We already had the signed int version of this transform.

llvm-svn: 354675
2019-02-22 15:47:45 +00:00
Sanjay Patel 1baf7896cc [x86] simplify code in combineExtractSubvector; NFC
Only the 1st fold is attempted pre-legalization, but it requires
legal (simple) types too, so we don't need an EVT in any of the code.

llvm-svn: 354674
2019-02-22 15:28:22 +00:00
Craig Topper 3a391fc0e8 [X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit.
Type legalization is causing two nodes to be created here, but we can use a single node to extend from v8i16 to v2i64.

llvm-svn: 354648
2019-02-22 01:49:53 +00:00
Sanjay Patel 234a5e8ea4 [x86] vectorize more cast ops in lowering to avoid register file transfers
This is a follow-up to D56864.

If we're extracting from a non-zero index before casting to FP,
then shuffle the vector and optionally narrow the vector before doing the cast:

cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0

This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974

Differential Revision: https://reviews.llvm.org/D58197

llvm-svn: 354619
2019-02-21 20:40:39 +00:00
Nirav Dave dce91c1edb [X86] Fix copy-paste error in @ccz flag.
@ccz operand should be equivalent to @cce.

llvm-svn: 354588
2019-02-21 15:28:31 +00:00
Simon Pilgrim e6b338cbef [X86][SSE] combineX86ShufflesRecursively - moved to generic op input index lookup. NFCI.
We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement.

llvm-svn: 354575
2019-02-21 12:24:49 +00:00
Nikita Popov c3b496de7a [SDAG] Support vector UMULO/SMULO
Second part of https://bugs.llvm.org/show_bug.cgi?id=40442.

This adds an extra UnrollVectorOverflowOp() method to SDAG, because
the general UnrollOverflowOp() method can't deal with multiple results.

Additionally we need to expand UMULO/SMULO during vector op
legalization, as it may result in unrolling, which may need additional
type legalization.

Differential Revision: https://reviews.llvm.org/D57997

llvm-svn: 354513
2019-02-20 20:41:44 +00:00
Simon Pilgrim dca47c659c [X86][SSE] combineX86ShufflesRecursively - begin generalizing the number of shuffle inputs. NFCI.
We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit.

llvm-svn: 354489
2019-02-20 17:58:29 +00:00
Simon Pilgrim 0b3b9424ca [X86][SSE] Generalize X86ISD::BLENDI support to more value types
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.

With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.

This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.

I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).

The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.

Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361

Differential Revision: https://reviews.llvm.org/D57888

llvm-svn: 354363
2019-02-19 18:05:42 +00:00
Simon Pilgrim d6add74915 Cast from SDValue directly instead of superfluous getNode(). NFCI.
llvm-svn: 354343
2019-02-19 16:20:09 +00:00
Simon Pilgrim 952abcefe4 [X86][AVX] EltsFromConsecutiveLoads - Add BROADCAST lowering support
This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads.

It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load.

Differential Revision: https://reviews.llvm.org/D58053

llvm-svn: 354340
2019-02-19 15:57:09 +00:00
Sanjay Patel d8b4efcb6b [CGP] form usub with overflow from sub+icmp
The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754
https://bugs.llvm.org/show_bug.cgi?id=40487
..and those are shown in the IR test file and x86 codegen file.

Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.

This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo
formation by default. Only x86 overrides that setting for now although other targets will likely benefit
by forming usbuo too.

Differential Revision: https://reviews.llvm.org/D57789

llvm-svn: 354298
2019-02-18 23:33:05 +00:00
Sanjay Patel fff628274d [x86] split more v8f32/v8i32 shuffles in lowering
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.

This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.

Differential Revision: https://reviews.llvm.org/D58181

llvm-svn: 354279
2019-02-18 16:46:12 +00:00
Craig Topper ce3c5ac6a6 [X86] In FP_TO_INTHelper, when moving data from SSE register to X87 register file via the stack, use the same stack slot we use for the integer conversion.
No need for a separate stack slot. The lifetimes don't overlap.

Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot.

llvm-svn: 354234
2019-02-17 19:23:49 +00:00
Craig Topper db5aa955cb [X86] When type legalizing the result of a i64 fp_to_uint on 32-bit targets. Generate all of the ops as i64 and let them be legalized.
No need to manually split everything. We can let the type legalizer work for us.

The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended.

llvm-svn: 354197
2019-02-16 08:25:42 +00:00
Craig Topper db2f084aa9 [X86] Don't set exception mask bits when modifying FPCW to change rounding mode for fp->int conversion
When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption.

This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block)

Differential Revision: https://reviews.llvm.org/D57788

llvm-svn: 354178
2019-02-15 21:59:33 +00:00
Nirav Dave 7875841121 [X86] Fix LowerAsmOutputForConstraint.
Summary:
Update Flag when generating cc output.

Fixes PR40737.

Reviewers: rnk, nickdesaulniers, craig.topper, spatel

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58283

llvm-svn: 354163
2019-02-15 20:01:55 +00:00
Craig Topper 9c6a9276da [X86] Move all the SSE legality checks out of FP_TO_INTHelper and up to LowerFP_TO_INT. NFCI
These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there.

Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner.

llvm-svn: 354161
2019-02-15 19:21:39 +00:00
Simon Pilgrim 6ce08672fb [X86][AVX] lowerShuffleAsLanePermuteAndPermute - fully populate the lane shuffle mask (PR40730)
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.

This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.

This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.

Differential Revision: https://reviews.llvm.org/D58237

llvm-svn: 354117
2019-02-15 11:39:21 +00:00
Nirav Dave 5ffdc43dc9 [X86] cleanup inline asm register generation. NFCI.
llvm-svn: 354042
2019-02-14 18:06:21 +00:00
Craig Topper 1d158dd930 [X86] Make (f80 (sint_to_fp (i16))) use fistps/fisttps instead of fistpl/fisttpl when SSE is enabled.
When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion.

Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type.

llvm-svn: 354003
2019-02-14 01:41:43 +00:00
Craig Topper 9b61f48e4b [X86] Use default expansion for (i64 fp_to_uint f80) when avx512 is enabled on 64-bit targets to match what happens without avx512.
In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this.

llvm-svn: 353921
2019-02-13 07:42:34 +00:00
Craig Topper 3099e442a6 [X86] Refactor the FP_TO_INTHelper interface. NFCI
-Pull the final stack load creation from the two callers into the helper.
-Return a single SDValue instead of a std::pair.
-Remove the Replace flag which isn't really needed.

llvm-svn: 353920
2019-02-13 07:42:31 +00:00
Simon Pilgrim 5338f41ced [X86][AVX] Enable shuffle combining support for zero_extend
A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars.

llvm-svn: 353860
2019-02-12 17:22:35 +00:00
Craig Topper 7670ede434 [X86] Collapse FP_TO_INT16_IN_MEM/FP_TO_INT32_IN_MEM/FP_TO_INT64_IN_MEM into a single opcode using memory VT to distinquish. NFC
llvm-svn: 353798
2019-02-12 06:14:18 +00:00
Craig Topper d7303ecd0b [X86] Remove the value type operand from the floating point load/store MemIntrinsicSDNodes. Use the MemoryVT instead. NFCI
We already have the memory VT, we can just match from that during isel.

llvm-svn: 353797
2019-02-12 06:14:16 +00:00
Craig Topper 75eb0af874 [X86] Correct the memory operand for the FLD emitted in FP_TO_INTHelper for 32-bit SSE targets.
We were using DstTy, but that represents the integer type we are converting to which is i64 in this
case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers.
If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte
access. The store we used to get from SSE to the stack is using the corect size.

While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function.

llvm-svn: 353745
2019-02-11 20:38:10 +00:00
Sam McCall e825ba9165 Revert "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
This reverts commit r353610.
It causes a miscompile visible in macro expansion in a bootstrapped clang.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190211/626590.html

llvm-svn: 353699
2019-02-11 14:05:36 +00:00
Simon Pilgrim f6e6c369c0 [X86] EltsFromConsecutiveLoads - replace SmallBitVector with APInt (NFC).
Minor refactor to simplify some incoming patches to improve broadcast loads.

llvm-svn: 353655
2019-02-10 22:45:48 +00:00
Sanjay Patel 833550fc74 [x86] narrow 256-bit horizontal ops via demanded elements
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).

The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.

All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.

In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758
https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.

I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.

Differential Revision: https://reviews.llvm.org/D57841

llvm-svn: 353641
2019-02-10 15:22:06 +00:00
Simon Pilgrim 6bf7b30b10 [X86] CombineOr - fold to generic funnel shifts
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.

There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.

llvm-svn: 353626
2019-02-09 20:34:59 +00:00
Simon Pilgrim 690a2889d8 [X86][SSE] Generalize X86ISD::BLENDI support to more value types
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.

With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.

This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.

I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).

The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.

Differential Revision: https://reviews.llvm.org/D57888

llvm-svn: 353610
2019-02-09 13:13:59 +00:00
Sanjay Patel e9cc26a56a [x86] fix formatting; NFC
(test commit #2 migrating to git)

llvm-svn: 353533
2019-02-08 16:48:40 +00:00
Craig Topper 738180cc7f Fix the lowering issue of intrinsics llvm.localaddress on X86
Patch by Yuanke Luo

Reviewers: craig.topper, annita.zhang, smaslov, rnk, wxiao3

Reviewed By: rnk

Subscribers: efriedma, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57501

llvm-svn: 353492
2019-02-08 01:14:12 +00:00
Sanjay Patel 81f859d169 [x86] fix formatting; NFC
llvm-svn: 353477
2019-02-07 22:36:55 +00:00
Simon Pilgrim fe3ac70b18 [DAGCombiner] (add (umax X, C), -C) --> (usubsat X, C) (PR40111)
Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner

First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111.

Differential Revision: https://reviews.llvm.org/D57754

llvm-svn: 353457
2019-02-07 20:14:43 +00:00
Sanjay Patel a5c4a5e958 [x86] split more 256/512-bit shuffles in lowering
This is intentionally a small step because it's hard to know exactly 
where we might introduce a conflicting transform with the code that 
tries to form wider shuffles. But I think this is safe - if we have 
a wide shuffle with 2 operands, then we should do better with an 
extract + narrow shuffle.

Differential Revision: https://reviews.llvm.org/D57867

llvm-svn: 353427
2019-02-07 17:10:49 +00:00
Nirav Dave 84e5bf0c95 [X86] Simplify casing. NFC.
llvm-svn: 353417
2019-02-07 15:43:40 +00:00
Nirav Dave c6bfa103a5 [X86][DAG] Avoid creating dangling bitcast.
combineExtractWithShuffle may leave a dangling bitcast which may
prevent further optimization in later passes. Avoid constructing it
unless it is used.

llvm-svn: 353333
2019-02-06 19:45:47 +00:00
Nirav Dave e5c37958f9 [InlineAsm][X86] Add backend support for X86 flag output parameters.
Allow custom handling of inline assembly output parameters and add X86
flag parameter support.

llvm-svn: 353307
2019-02-06 15:26:29 +00:00
Sanjay Patel e84fbb67a1 [x86] vectorize cast ops in lowering to avoid register file transfers
The proposal in D56796 may cross the line because we're trying to avoid vectorization 
transforms in generic DAG combining. So this is an alternate, later, x86-specific 
translation of that patch.

There are several potential follow-ups to enhance this:
1. Allow extraction from non-zero element index.
2. Peek through extends of smaller width integers.
3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P

Differential Revision: https://reviews.llvm.org/D56864

llvm-svn: 353302
2019-02-06 14:59:39 +00:00
Simon Pilgrim b0afc69435 [X86][SSE] Disable ZERO_EXTEND shuffle combining
rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly.

llvm-svn: 353198
2019-02-05 19:15:48 +00:00
Simon Pilgrim 822d2e35e7 [X86][AVX] Attempt to combine shuffles to subvector broadcast load
llvm-svn: 353189
2019-02-05 17:02:49 +00:00
Simon Pilgrim 62af24cc93 [X86][SSE] Add SimplifyDemandedVectorElts support for X86ISD::BLENDV
llvm-svn: 353165
2019-02-05 12:27:29 +00:00
Simon Pilgrim 9e595e3663 [X86][AVX] Attempt to share broadcasts of different widths (PR39454)
If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free).

Differential Revision: https://reviews.llvm.org/D57663

llvm-svn: 353154
2019-02-05 10:58:43 +00:00
Craig Topper f86eb00f12 [X86] Connect the default fpsr and dirflag clobbers in inline assembly to the registers we have defined for them.
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.

This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.

Reviewers: rnk

Reviewed By: rnk

Subscribers: eraman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57641

llvm-svn: 353141
2019-02-05 06:13:06 +00:00
Simon Pilgrim 6e5350a367 [X86][SSE] SimplifyDemandedBitsForTargetNode - PCMPGT(0,X) sign mask
For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly.

Differential Revision: https://reviews.llvm.org/D57667

llvm-svn: 353051
2019-02-04 15:43:36 +00:00
Simon Pilgrim 9899967464 Use auto for dyn_cast case to save a line. NFCI.
llvm-svn: 353041
2019-02-04 12:32:39 +00:00
Simon Pilgrim 1fce5a8b75 [X86][AVX] Support shuffle combining for VBROADCAST with smaller vector sources
getTargetShuffleMask can only do this safely if we're extracting the lowest subvector from a vector of the same result type.

llvm-svn: 352999
2019-02-03 16:51:33 +00:00
Simon Pilgrim 18b73a655b [X86][AVX] Support shuffle combining for VPMOVZX with smaller vector sources
llvm-svn: 352997
2019-02-03 16:10:18 +00:00
Simon Pilgrim a2a3e5b811 [X86][AVX] More aggressively simplify BROADCAST source operand
Aim to use scalar source or lowest 128-bit vector directly.

We're still missing some VZMOVL_LOAD combines.

llvm-svn: 352994
2019-02-03 14:39:41 +00:00
Craig Topper 950ca192f6 [X86] Lower ISD::UADDO to use the Z flag instead of C flag when the RHS is a constant 1 to encourage INC formation.
Summary:
Add an additional combine to combineCarryThroughADD to reverse it back to the C flag to avoid regressions.

I believe this catches the cases that D57547 got.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57637

llvm-svn: 352984
2019-02-03 07:25:06 +00:00
Simon Pilgrim dbf302c9f1 [X86][AVX] Enable INSERT_SUBVECTOR(SRC0, SHUFFLE(SRC1)) shuffle combining
Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles.

The is exposes a couple of minor issues that will be fixed shortly:
Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up
combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts).

llvm-svn: 352963
2019-02-02 18:08:04 +00:00
Simon Pilgrim bd42f97946 [SDAG] Add SDNode/SDValue getConstantOperandAPInt helper. NFCI.
We already have the getConstantOperandVal helper which returns a uint64_t, but along comes the fuzzer and inserts a i128 -1 constant or something and the whole thing asserts.......

I've updated a few obvious cases, and tried to make use of the const reference where possible, but there's more to do. A number of existing oss-fuzz tickets should be fixed if we start using APInt and perform value clamping where necessary.

llvm-svn: 352961
2019-02-02 17:35:06 +00:00
James Y Knight 14359ef1b6 [opaque pointer types] Pass value type to LoadInst creation.
This cleans up all LoadInst creation in LLVM to explicitly pass the
value type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57172

llvm-svn: 352911
2019-02-01 20:44:24 +00:00
James Y Knight 7976eb5838 [opaque pointer types] Pass function types to CallInst creation.
This cleans up all CallInst creation in LLVM to explicitly pass a
function type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57170

llvm-svn: 352909
2019-02-01 20:43:25 +00:00
Simon Pilgrim 85184017e9 [X86][SSE] Use PSLLDQ/PSRLDQ to mask out zeroable ends of a shuffle
As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section.

For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts).

Differential Revision: https://reviews.llvm.org/D56784

llvm-svn: 352883
2019-02-01 16:02:12 +00:00
Simon Pilgrim 1a529f58f9 [X86][AVX] Combine INSERT_SUBVECTOR(SRC0, BITCAST(SHUFFLE(EXTRACT_SUBVECTOR(SRC1)))
Enable peeking through one use bitcasts to the subvector shuffle.

This still depends on the subvector being the same scalar-size but D57514 has already helped with the more tricky patterns

llvm-svn: 352879
2019-02-01 15:31:01 +00:00
James Y Knight 13680223b9 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc
doesn't choke on it, hopefully.

Original Message:
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352827
2019-02-01 02:28:03 +00:00
James Y Knight fadf25068e Revert "[opaque pointer types] Add a FunctionCallee wrapper type, and use it."
This reverts commit f47d6b38c7 (r352791).

Seems to run into compilation failures with GCC (but not clang, where
I tested it). Reverting while I investigate.

llvm-svn: 352800
2019-01-31 21:51:58 +00:00
James Y Knight f47d6b38c7 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352791
2019-01-31 20:35:56 +00:00
Simon Pilgrim 00cefe1158 Trim trailing whitespace. NFCI.
llvm-svn: 352775
2019-01-31 17:49:25 +00:00
Simon Pilgrim eb6aef6db3 [X86][AVX] Fold concat(broadcast(x),broadcast(x)) -> broadcast(x)
Differential Revision: https://reviews.llvm.org/D57514

llvm-svn: 352774
2019-01-31 17:48:35 +00:00
Simon Pilgrim d04a2d2d5e [X86][AVX] insert_subvector(bitcast(v), bitcast(s), c1) -> bitcast(insert_subvector(v,s,c2))
Similar to what we already do in DAGCombiner, but this version also handles bitcasts from types with different scalar sizes, which x86 is better at handling.

Differential Revision: https://reviews.llvm.org/D57514

llvm-svn: 352773
2019-01-31 17:38:10 +00:00
Simon Pilgrim 63f3383ece [X86][AVX] Fold broadcast(bitcast(src)) -> bitcast(broadcast(src))
llvm-svn: 352751
2019-01-31 14:04:07 +00:00
Simon Pilgrim a001008a09 [X86] combineExtractWithShuffle - more aggressively peek through bitcasts
Fixes regression introduced by rL352743

llvm-svn: 352745
2019-01-31 11:55:30 +00:00
Simon Pilgrim b96a2c7fed [X86][AVX] Enable AVX1 broadcasts in shuffle combining
Enables 32/64-bit scalar load broadcasts on AVX1 targets

The extractelement-load.ll regression will be fixed shortly in a followup commit.

llvm-svn: 352743
2019-01-31 11:41:10 +00:00
Simon Pilgrim 51c2efc104 [X86][AVX] Fold vt1 concat_vectors(vt2 undef, vt2 broadcast(x)) --> vt1 broadcast(x)
If we're not inserting the broadcast into the lowest subvector then we can avoid the insertion by just performing a larger broadcast.

Avoids a regression when we enable AVX1 broadcasts in shuffle combining

llvm-svn: 352742
2019-01-31 11:15:05 +00:00
Craig Topper 8bdc203d4b [X86] Remove handling of ISD::INTRINSIC_WO_CHAIN in ReplaceNodeResults.
I believe this was there to handle avx512bw intrinsics that returned i64 type in 32-bit mode. But all those intrinsics have since been changed to v64i1 results or replaced with generic IR.

llvm-svn: 352698
2019-01-31 00:04:46 +00:00
Simon Pilgrim 317fad5921 [X86][AVX] Prefer to combine shuffle to broadcasts whenever possible
This is the first step towards improving broadcast support on AVX1 targets.

llvm-svn: 352634
2019-01-30 16:19:19 +00:00
Mikael Holmen b792627ce9 Fix compiler warning when using clang 3.6.0
Without the fix we get the following (with -Werror):

../lib/Target/X86/X86ISelLowering.cpp:14181:58: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
  SmallVector<std::array<int, 2>, 2> LaneSrcs(NumLanes, {-1, -1});
                                                         ^~~~~~
                                                         {     }
1 error generated.

llvm-svn: 352455
2019-01-29 06:51:28 +00:00
Craig Topper 390ac61b93 Recommit r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer"
This did not cause the buildbot failure it was previously reverted for.

Original commit message:

I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.

This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inre

On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.

llvm-svn: 352433
2019-01-28 21:38:47 +00:00
Simon Pilgrim 2c17512456 [X86][AVX] Remove lowerShuffleByMerging128BitLanes 2-lane restriction
First step towards adding support for 64-bit unary "sublane" handling (a bit like lowerShuffleAsRepeatedMaskAndLanePermute). 

This allows us to add lowerV64I8Shuffle handling.

llvm-svn: 352389
2019-01-28 17:02:35 +00:00
Sanjay Patel 94cca60b82 [x86] allow more shuffle splitting to avoid vpermps (PR40434)
This is tricky to make optimal: sometimes we're better off using 
a single wider op, but other times it makes more sense to combine
a narrow ops to achieve the same result.

This solves the case from:
https://bugs.llvm.org/show_bug.cgi?id=40434

There's potentially a similar change for vectors with 64-bit elements,
but it needs adjustments similar to rL352333 to avoid creating infinite
loops.

llvm-svn: 352380
2019-01-28 15:51:34 +00:00
Craig Topper 453150bc18 [X86] Add new variadic avx512 compress/expand intrinsics that use vXi1 types for the mask argument.
Remove and autoupgrade the old intrinsics

llvm-svn: 352343
2019-01-28 07:03:03 +00:00
Sanjay Patel ebe6b43aec [x86] add restriction for lowering to vpermps
This transform was added with rL351346, and we had
an escape for shufps, but we also want one for
unpckps vs. vpermps because vpermps doesn't take
an immediate shuffle index operand.

llvm-svn: 352333
2019-01-27 21:53:33 +00:00
Simon Pilgrim 670a6971f8 [X86][SSE] Add UNDEF handling to combineSelect ISD::USUBSAT matching (PR40083)
llvm-svn: 352330
2019-01-27 21:01:23 +00:00
Simon Pilgrim f10b6623cc [X86][SSE] Permit UNDEFs in combineAddToSUBUS matching (PR40083)
llvm-svn: 352328
2019-01-27 20:36:37 +00:00
Sanjay Patel 5f1fdaa192 [x86] refactor logic in lowerShuffleWithUndefHalf
Although this is longer code, this is no-functional-change-intended.
The goal is to untangle the conditions under which we bail out, so 
that's easier to adjust.

llvm-svn: 352320
2019-01-27 18:12:03 +00:00
Simon Pilgrim a914fa4dd8 [X86] combineAddOrSubToADCOrSBB/combineCarryThroughADD - use oneuse for entire SDNode
Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB.

I've added the getNode() to make it clearer what is intended than just the -> redirection.

llvm-svn: 352291
2019-01-26 21:29:16 +00:00
Simon Pilgrim 37a8e65a60 [X86] combineCarryThroughADD - add support for X86::COND_A commutations (PR24545)
As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction.

Differential Revision: https://reviews.llvm.org/D57281

llvm-svn: 352289
2019-01-26 20:23:04 +00:00
Simon Pilgrim b7a15acd38 [X86] Fold X86ISD::SBB(ISD::SUB(X,Y),0) -> X86ISD::SBB(X,Y) (PR25858)
We often generate X86ISD::SBB(X, 0) for carry flag arithmetic.

I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet.

Differential Revision: https://reviews.llvm.org/D57169

llvm-svn: 352288
2019-01-26 20:13:44 +00:00
Simon Pilgrim 6162fba57c [X86][SSE] Generalized unsigned compares to support nonsplat constant vectors (PR39859)
llvm-svn: 352283
2019-01-26 16:40:03 +00:00
Sanjay Patel a03c63b77f [x86] add helper for creating a half-width shuffle; NFC
This reduces a bit of duplication between the combining and
lowering places that use it, but the primary motivation is
to make it easier to rearrange the lowering logic and solve
PR40434:
https://bugs.llvm.org/show_bug.cgi?id=40434

llvm-svn: 352280
2019-01-26 16:20:22 +00:00
Craig Topper 58e6b37e62 Revert r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer"
This might be breaking an lldb windows buildbot.

llvm-svn: 352268
2019-01-26 02:44:58 +00:00
Craig Topper 7a8e74775c [X86] Add DAG combine to merge vzext_movl with the various fp<->int conversion operations that only write the lower 64-bits of an xmm register and zero the rest.
Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56971

llvm-svn: 352260
2019-01-26 01:17:09 +00:00
Craig Topper b1d3457c03 [SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer
Summary:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.

This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization.

On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57186

llvm-svn: 352255
2019-01-26 00:26:37 +00:00
Craig Topper 4cf28bad5b [X86] Combine masked store and truncate into masked truncating stores.
We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch.

This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate.

Differential Revision: https://reviews.llvm.org/D57218

llvm-svn: 352230
2019-01-25 18:37:36 +00:00
Sanjay Patel 0020f8bb23 [x86] simplify logic in lowerShuffleWithUndefHalf(); NFCI
This seems unnecessarily complicated because we gave names to
opposite polarity bools and have code comments that don't really
line up with the logic. 

Step 1: remove UndefUpper and assert that it is the opposite of 
UndefLower after the initial early exit.

llvm-svn: 352217
2019-01-25 17:00:41 +00:00
Simon Pilgrim f56298f4b9 [X86] Simplify X86ISD::ADD/SUB if we don't use the result flag
Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag.

This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further.

Noticed in some of the test cases in PR31754

Differential Revision: https://reviews.llvm.org/D57234

llvm-svn: 352210
2019-01-25 15:58:28 +00:00
Sanjay Patel 21aa6ddc14 [x86] narrow a shuffle that doesn't use or set any high elements
This isn't the final fix for our reduction/horizontal codegen, but it takes care 
of a lot of the problems. After we narrow the shuffle, existing combines for 
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.

The avg and mul reduction tests show an existing shuffle lowering hole for 
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.

Differential Revision: https://reviews.llvm.org/D57156

llvm-svn: 352209
2019-01-25 15:37:42 +00:00
Sanjay Patel 4c304b2923 [x86] move half-size shuffle mask creation to helper; NFC
As noted in D57156, we want to check at least part of
this pattern earlier (in combining), so this will allow
the code to be shared instead of duplicated.

llvm-svn: 352127
2019-01-24 23:12:36 +00:00
Sanjay Patel e524639d72 [x86] rename VectorShuffle -> Shuffle; NFC
This wasn't consistent within the file, so made it harder to search.
Standardize on the shorter name to save some typing.

llvm-svn: 352077
2019-01-24 18:52:12 +00:00
Sanjay Patel e5a0bcf7b8 [x86] add low/high undef half shuffle mask helpers; NFC
This is the most common usage for isUndefInRange, 
so make the code slightly less duplicated and more 
readable.

llvm-svn: 352063
2019-01-24 17:05:02 +00:00
Matt Arsenault a5840c3c39 Codegen support for atomicrmw fadd/fsub
llvm-svn: 351851
2019-01-22 18:36:06 +00:00
Simon Pilgrim 933673d878 [X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.

This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.

Differential Revision: https://reviews.llvm.org/D55935

llvm-svn: 351819
2019-01-22 13:44:49 +00:00
Craig Topper bcbdf61078 [X86] Use X86ISD::VFPROUND instead of ISD::FP_ROUND for 256 and 512 bit cvtpd2ps intrinsics.
Summary:
Use X86ISD::VFPROUND in the instruction isel patterns. Add new patterns for ISD::FP_ROUND to maintain support for fptrunc in IR.

In the process I found a couple duplicate isel patterns which I also deleted in this patch.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56991

llvm-svn: 351762
2019-01-21 20:14:09 +00:00
Craig Topper c2087d8f3f [X86] Change avx512 COMPRESS and EXPAND lowering to use a single masked node instead of expand/compress+select.
Summary:
For compress, a select node doesn't semantically reflect the behavior of the instruction. The mask would have holes in it, but the resulting write is to contiguous elements at the bottom of the vector.

Furthermore, as far as the compressing and expanding is concerned the behavior is depended on the mask. You can't just have an expand/compress node that only reads the input vector. That node would have no meaning by itself.

This all only works because we pattern match the compress/expand+select back to the instruction. But conceivably an optimization of the select could break the pattern and leave something meaningless.

This patch modifies the expand and compress node to take the mask and passthru as additional inputs and gets rid of the select all together.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57002

llvm-svn: 351761
2019-01-21 20:02:28 +00:00
Craig Topper 4aa74fff1f [X86] Add masked MCVTSI2P/MCVTUI2P ISD opcodes to model the cvtqq2ps cvtuqq2ps nodes that produce less than 128-bits of results.
These nodes zero the upper half of the result and can't be represented with vselect.

llvm-svn: 351666
2019-01-19 21:26:20 +00:00
Chandler Carruth 2946cd7010 Update the file headers across all of the LLVM projects in the monorepo
to reflect the new license.

We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.

Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.

llvm-svn: 351636
2019-01-19 08:50:56 +00:00
Reid Kleckner 38f9900aa5 [X86] Deduplicate static calling convention helpers for code size, NFC
Summary:
Right now we include ${TGT}GenCallingConv.inc once per each instruction
selection method implemented by ${TGT}:
- ${TGT}ISelLowering.cpp
- ${TGT}CallLowering.cpp
- ${TGT}FastISel.cpp

Instead, add a mechanism to tablegen for marking a particular convention
as "External", which causes tablegen to emit into the ::llvm namespace,
instead of as a static helper. This allows us to provide a header to
forward declare it, so we can simply call the function from all the
places it is referenced. Typically the calling convention analyzer is
called indirectly, so it doesn't benefit from inlining.

This saves a bit of final binary size, but mostly just saves object file
size:

before  after   diff   artifact
12852K  12492K  -360K  X86ISelLowering.cpp.obj
4640K   4280K   -360K  X86FastISel.cpp.obj
1704K   2092K   +388K  X86CallingConv.cpp.obj
52448K  52336K  -112K  llc.exe

I didn't collect before numbers for X86CallLowering.cpp.obj, which is
for GlobalISel, but we should save 360K there as well.

This patch applies the strategy to the X86 backend, but there is no
reason it couldn't be applied to the other backends that implement
multiple ISel strategies, like AArch64.

Reviewers: craig.topper, hfinkel, efriedma

Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D56883

llvm-svn: 351616
2019-01-19 00:33:02 +00:00
Craig Topper 08d3d32ead [X86] Lower avx512f scatter intrinsics to X86MaskedScatterSDNode instead of going directly to MachineSDNode.
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.

llvm-svn: 351583
2019-01-18 20:14:46 +00:00
Craig Topper b9d4461f9f [X86] Lower avx2/avx512f gather intrinsics to X86MaskedGatherSDNode instead of going directly to MachineSDNode.:
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.

Differential Revision: https://reviews.llvm.org/D56827

llvm-svn: 351570
2019-01-18 18:22:26 +00:00
Sanjay Patel b6c91a1a59 [x86] simplify code for SDValue.getOperand(); NFC
llvm-svn: 351557
2019-01-18 15:55:21 +00:00
Craig Topper 59abdf5f3f [X86] Add X86ISD::VSHLV and X86ISD::VSRLV nodes for psllv and psrlv
Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits.

This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts.

We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change.

Patch by zhutianyang and Craig Topper

Differential Revision: https://reviews.llvm.org/D56695

llvm-svn: 351381
2019-01-16 21:46:32 +00:00
Craig Topper 5ea3120718 [X86] Use X86ISD::BLENDV for blendv intrinsics. Replace vselect with blendv just before isel table lookup. Remove vselect isel patterns.
This cleans up the duplication we have with both intrinsic isel patterns and vselect isel patterns. This should also allow the intrinsics to get SimplifyDemandedBits support for the condition.

I've switched the canonical pattern in isel to use the X86ISD::BLENDV node instead of VSELECT. Since it always seemed weird to move from BLENDV with its relaxed rules on condition bits to VSELECT which has strict rules about all bits of the condition element being the same. Its more correct to go from VSELECT to BLENDV.

Differential Revision: https://reviews.llvm.org/D56771

llvm-svn: 351380
2019-01-16 21:46:28 +00:00
Craig Topper e5b7cc8aa0 [X86] Add a one use check to the setcc inversion code in combineVSelectWithAllOnesOrZeros
If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc.

Differential Revision: https://reviews.llvm.org/D56765

llvm-svn: 351378
2019-01-16 21:29:29 +00:00
Simon Pilgrim 5a2bbe267a [X86] getFauxShuffleMask - bail for non-byte aligned shuffle types
Remove the existing assertion and just return false for unexpected shuffle value types (<X x i1> mainly....).

Found while updating combineX86ShufflesRecursively to run within SimplifyDemandedVectorElts/SimplifyDemandedBits.

llvm-svn: 351365
2019-01-16 18:15:31 +00:00
Simon Pilgrim 524daea429 [X86] Add combineX86ShufflesRecursively helper. NFCI.
combineX86ShufflesRecursively is pretty cumbersome with a lot of arguments that only matter later in recursion.

This commit adds a wrapper version that only takes the initial root Op to simplify calls that don't need to worry about these.

An early, cleanup step towards merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts/SimplifyDemandedBits.

llvm-svn: 351352
2019-01-16 16:01:42 +00:00
Sanjay Patel 0dbecd05ed [x86] lower shuffle of extracts to AVX2 vperm instructions
I was trying to prevent shuffle regressions while matching more horizontal ops 
and ended up here:
  shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0

The affected tests were added for:
https://bugs.llvm.org/show_bug.cgi?id=34380

This patch won't change the examples in the bug report itself, but we should be 
able to extend this to catch more types.

Differential Revision: https://reviews.llvm.org/D56756

llvm-svn: 351346
2019-01-16 14:15:18 +00:00
Mandeep Singh Grang 436735c3fe [EH] Rename llvm.x86.seh.recoverfp intrinsic to llvm.eh.recoverfp
Summary:
Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc.
Refer D53541 for the context. Clang counterpart D56748.

Reviewers: rnk, efriedma

Reviewed By: rnk, efriedma

Subscribers: javed.absar, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D56747

llvm-svn: 351281
2019-01-16 00:37:13 +00:00
Craig Topper 0e420e6a62 [X86] Rename SHRUNKBLEND ISD node to BLENDV.
That's really what it is. If we didn't use intrinsics for BLENDVPS/BLENDVPD/PBLENDVB all the way to isel, this is the node we would use.

llvm-svn: 351278
2019-01-16 00:20:30 +00:00
Craig Topper 34ac509ac8 [X86] Add avx512 scatter intrinsics that use a vXi1 mask instead of a scalar integer.
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.

llvm-svn: 351275
2019-01-15 23:36:25 +00:00
Craig Topper 82015b633b [X86] Add versions of the avx512 gather intrinsics that take the mask as a vXi1 vector instead of a scalar
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.

I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.

I'll work on scatter and gatherpf/scatterpf next.

Differential Revision: https://reviews.llvm.org/D56527

llvm-svn: 351234
2019-01-15 20:12:33 +00:00
Nirav Dave dbc41bac0c [X86] Fix register class for assembly constraints to ST(7). NFCI.
Modify getRegForInlineAsmConstraint to return special singleton
register class when a constraint references ST(7) not RFP80 for which
ST(7) is not a member.

llvm-svn: 351206
2019-01-15 17:09:14 +00:00
Simon Pilgrim b8f08c8d7b [X86] Bailout of lowerVectorShuffleAsPermuteAndUnpack for shuffle-with-zero (PR40306)
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.

We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.

llvm-svn: 351203
2019-01-15 16:56:55 +00:00
Benjamin Kramer 2fc8ede082 [X86] Fix unused variable warning in Release builds. NFC.
llvm-svn: 351136
2019-01-14 23:29:54 +00:00
Craig Topper 9906f77f82 [X86] Silence a -Wparentheses warning on gcc. NFC
llvm-svn: 351111
2019-01-14 19:44:02 +00:00
Simon Pilgrim bfe2ee453a [X86][SSSE3] Bailout of lowerVectorShuffleAsPermuteAndUnpack for shuffle-with-zero (PR40306)
If we have PSHUFB and we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.

llvm-svn: 351103
2019-01-14 19:07:26 +00:00