Commit Graph

6328 Commits

Author SHA1 Message Date
Simon Pilgrim 2c2f8e74b9 [X86][SSE] Match all-of bool scalar reductions into a bitcast/movmsk + cmp.
Same as what we do for vector reductions in combineHorizontalPredicateResult, use movmsk+cmp for scalar (and(extract(x,0),extract(x,1)) reduction patterns. 

llvm-svn: 361052
2019-05-17 17:25:55 +00:00
Simon Pilgrim 279314e81b [X86][AVX] Remove LowerCTTZ's AVX1 custom vector handling.
We can now rely on generic expansion to handle this.

llvm-svn: 361038
2019-05-17 14:37:19 +00:00
Simon Pilgrim 62c7032c18 [X86][AVX] isNOT - add extract_subvector(xor X, -1) -> extract_subvector(X) fold.
Prep work for the removal of the remaining x86 CTTZ vector lowering.

llvm-svn: 361035
2019-05-17 14:04:56 +00:00
Simon Pilgrim a6d3bd486b [X86] Pull out IsNOT helper. NFCI.
Return the input value for the NOT pattern: (xor X, -1) -> X

llvm-svn: 361012
2019-05-17 10:37:08 +00:00
Reid Kleckner 08c15df29f [X86] Deduplicate symbol lowering logic, NFC
Summary:
This refactors four pieces of code that create SDNodes for references to
symbols:
- normal global address lowering (LEA, MOV, etc)
- callee global address lowering (CALL)
- external symbol address lowering (LEA, MOV, etc)
- external symbol address lowering (CALL)

Each of these pieces of code need to:
- classify the reference
- lower the symbol
- emit a RIP wrapper if needed
- emit a load if needed
- add offsets if needed

I think handling them all in one place will make the code easier to
maintain in the future.

Reviewers: craig.topper, RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61690

llvm-svn: 360952
2019-05-16 23:15:26 +00:00
Adhemerval Zanella 73643b5041 [CodeGen] Add lround/llround builtins
This patch add the ISD::LROUND and ISD::LLROUND along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lround/llround generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

llvm-svn: 360889
2019-05-16 13:15:27 +00:00
Craig Topper 384d46c0d5 [X86] Use OR32mi8Locked instead of LOCK_OR32mi8 in emitLockedStackOp.
They encode the same way, but OR32mi8Locked sets hasUnmodeledSideEffects set
which should be stronger than the mayLoad/mayStore on LOCK_OR32mi8. I think
this makes sense since we are using it as a fence.

This also seems to hide the operation from the speculative load hardening pass
so I've reverted r360511.

llvm-svn: 360747
2019-05-15 04:15:46 +00:00
Philip Reames 658cad1287 [NFC] Reuse a helper function to eliminate duplicate code
llvm-svn: 360740
2019-05-15 01:39:07 +00:00
Philip Reames 445f942fc4 Use an offset from TOS for idempotent rmw locked op lowering
This was the portion split off D58632 so that it could follow the redzone API cleanup. Note that I changed the offset preferred from -8 to -64. The difference should be very minor, but I thought it might help address one concern which had been previously raised.

Differential Revision: https://reviews.llvm.org/D61862

llvm-svn: 360719
2019-05-14 22:32:42 +00:00
Simon Pilgrim c2d9cfd925 [X86] Disable shouldFoldConstantShiftPairToMask for scalar shifts on AMD targets (PR40758)
D61068 handled vector shifts, this patch does the same for scalars where there are similar number of pipes for shifts as bit ops - this is true almost entirely for AMD targets where the scalar ALUs are well balanced.

This combine avoids AND immediate mask which usually means we reduce encoding size.

Some tests show use of (slow, scaled) LEA instead of SHL in some cases, but thats due to particular shift immediates - shift+mask generate these just as easily.

Differential Revision: https://reviews.llvm.org/D61830

llvm-svn: 360684
2019-05-14 15:21:28 +00:00
Simon Pilgrim 2747ee2c83 [X86] X86TargetLowering::LowerINTRINSIC_WO_CHAIN - ensure rounding control is initialized. NFCI.
Fixes scan-build warnings

llvm-svn: 360664
2019-05-14 11:30:39 +00:00
Philip Reames 3098e44daa [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit targets
This is a follow on to D58632, with the same logic. Given a memory operation which needs ordering, but doesn't need to modify any particular address, prefer to use a locked stack op over an mfence.

Differential Revision: https://reviews.llvm.org/D61863

llvm-svn: 360649
2019-05-14 04:43:37 +00:00
Sanjay Patel 3a13d970aa [SDAG, x86] allow targets to override test for binop opcodes
This follows the pattern of the existing isCommutativeBinOp().

x86 shows improvements from vector narrowing for the min/max opcodes.

llvm-svn: 360639
2019-05-14 00:39:40 +00:00
Craig Topper e2966473dd [X86] Use ISD::MERGE_VALUES to return from lowerAtomicArith instead of calling ReplaceAllUsesOfValueWith and returning SDValue().
Returning SDValue() makes the caller think that nothing happened and it will
end up executing the Expand path. This generates extra nodes that will need to
be pruned as dead code.

Returning an ISD::MERGE_VALUES will tell the caller that we'd like to make a
change and it will take care of replacing uses. This will prevent falling into
the Expand path.

llvm-svn: 360627
2019-05-13 22:17:13 +00:00
Craig Topper 5f999c2bea [X86] Various type corrections to the code that creates LOCK_OR32mi8/OR32mi8Locked to the stack for idempotent atomic rmw and atomic fence.
These are updates to match how isel table would emit a LOCK_OR32mi8 node.

-Use i32 for the immediate zero even though only 8 bits are encoded.
-Use i16 for segment register.
-Use LOCK_OR32mi8 for idempotent atomic operations in 32-bit mode to match
64-bit mode. I'm not sure why OR32mi8Locked and LOCK_OR32mi8 both exist. The
only difference seems to be that OR32mi8Locked is marked as UnmodeledSideEffects=1.
-Emit an extra i32 result for the flags output.

I don't know if the types here really matter just noticed it was inconsistent
with normal behavior.

llvm-svn: 360619
2019-05-13 21:01:24 +00:00
Nick Desaulniers c33f754e74 [TargetLowering] Handle multi depth GEPs w/ inline asm constraints
Summary:
X86TargetLowering::LowerAsmOperandForConstraint had better support than
TargetLowering::LowerAsmOperandForConstraint for arbitrary depth
getelementpointers for "i", "n", and "s" extended inline assembly
constraints. Hoist its support from the derived class into the base
class.

Link: https://github.com/ClangBuiltLinux/linux/issues/469

Reviewers: echristo, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, E5ten, kees, jyknight, nemanjai, javed.absar, eraman, hiraditya, jsji, llvm-commits, void, craig.topper, nathanchance, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61560

llvm-svn: 360604
2019-05-13 17:27:44 +00:00
Simon Pilgrim 73aee29095 [X86][SSE] LowerBuildVectorv4x32 - don't insert MOVQ for undef elts
Fixes the regression noted in D61782 where a VZEXT_MOVL was being inserted because we weren't discriminating between 'zeroable' and 'all undef' for the upper elts.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360596
2019-05-13 16:10:11 +00:00
Simon Pilgrim cf5a8eb7cd [X86][SSE] Relax use limits for lowerAddSubToHorizontalOp (PR32433)
Now that we can use HADD/SUB for scalar additions from any pair of extracted elements (D61263), we can relax the one use limit as we will be able to merge multiple uses into using the same HADD/SUB op.

This exposes a couple of missed opportunities in LowerBuildVectorv4x32 which will be committed separately.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360594
2019-05-13 16:02:45 +00:00
Simon Pilgrim d9aa928603 [X86] Add SimplifyDemandedBits support for PEXTRB/PEXTRW (PR39709)
Test case will be included in a followup - its being used but its tricky to show a case that isn't caught at a later stage anyway.

llvm-svn: 360588
2019-05-13 15:31:27 +00:00
Simon Pilgrim a7fc763082 [X86][AVX] Split VZEXT_MOVL ymm/zmm if the upper elements are not demanded.
Removes unnecessary vzeroupper noted in D61806

llvm-svn: 360543
2019-05-12 15:16:29 +00:00
Simon Pilgrim fda6bffd3b [X86][SSE] SimplifyDemandedBits - call PEXTRB/PEXTRW SimplifyDemandedVectorElts as well.
See if we can simplify the demanded vector elts from the extraction before trying to simplify the demanded bits.

This helps us with target shuffles and hops in particular.

llvm-svn: 360535
2019-05-11 21:35:50 +00:00
Simon Pilgrim e4c5b6d9bd [X86][SSE] Add SimplifyDemandedVectorElts HADD/HSUB handling.
Still missing PHADDW/PHSUBW tests because PEXTRW doesn't call SimplifyDemandedVectorElts

llvm-svn: 360526
2019-05-11 16:07:12 +00:00
Craig Topper c9d7484aa3 [X86] Add CMOV_FR32X/CMOV_FR64X pseudo instructions. Use them in fast isel to fix a machine verifier error after adding test cases.
Fast isel picks the FR32X/FR64X register classes when lowering pseudo select, but it didn't have the right opcode to go with it.

llvm-svn: 360524
2019-05-11 16:00:28 +00:00
Simon Pilgrim a0b1518a4a [X86][SSE] Add getHopForBuildVector vector splitting
If we only use the lower xmm of a ymm hop, then extract the xmm's (for free), perform the xmm hop and then insert back into a ymm (for free).

Fixes some of the regressions noted in D61782

llvm-svn: 360435
2019-05-10 15:46:04 +00:00
Philip Reames bd588dfd59 [X86] Improve lowering of idemptotent RMW operations
The current lowering uses an mfence. mfences are substaintially higher latency than the locked operations originally requested, but we do want to avoid contention on the original cache line. As such, use a locked instruction on a cache line assumed to be thread local.

Differential Revision: https://reviews.llvm.org/D58632

llvm-svn: 360393
2019-05-09 23:23:42 +00:00
Simon Pilgrim 93bfa5af48 [X86][SSE] Fold add(shuffle(),shuffle()) to hadd on 'slow' targets (PR39920)
As reported on PR39920, "slow horizontal ops" targets tend to internally expand to 2*shuffle+add/sub - so if we can reduce 2*shuffle+add/sub to a hadd/sub then we should do it - similar port usage but reduced instruction count.

This works out in most cases, although the "PR22377" regression in vector-shuffle-combining.ll is annoying - going from 2*shuffle+add+shuffle to hadd+2*shuffle - I've opened PR41813 to cover this.

Differential Revision: https://reviews.llvm.org/D61308

llvm-svn: 360360
2019-05-09 17:45:01 +00:00
Reid Kleckner 6bf108d77a [COFF] Use COFF stubs for extern_weak functions
Summary:
A COFF stub indirects the reference to a symbol through memory. A
.refptr.$sym global variable pointer is created to refer to $sym.
Typically mingw uses these for external global variable declarations,
but we can use them for weak function declarations as well.

Updates the dso_local classification to add a special case for
extern_weak symbols on COFF in both clang and LLVM.

Fixes PR37598

Reviewers: smeenai, mstorsjo

Subscribers: hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D61615

llvm-svn: 360207
2019-05-07 23:06:21 +00:00
Eric Christopher 4727221734 Make sure that the DAG combiner doesn't merge stores that we explicitly
asked not be greater than preferred vector width for the vectorizer.
Test for both 128 and 256 with a skylake architecture.

llvm-svn: 360183
2019-05-07 19:25:34 +00:00
Simon Pilgrim debb2b2a1e Fix local shadow variable warning. NFCI.
llvm-svn: 360157
2019-05-07 14:56:34 +00:00
Simon Pilgrim b0f51266b8 [X86][AVX] Fold concat(packus(),packus()) -> packus(concat(),concat()) (PR34773)
Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining.

llvm-svn: 360134
2019-05-07 11:17:39 +00:00
Craig Topper a75630302d [X86] Use extended vector register classes in getRegForInlineAsmConstraint to support x/y/zmm16-31 when the type is mismatched.
The FR32/FR64/VR128/VR256 register classes don't contain the upper 16 registers. For most cases we use the default implementation which will find any register class that contains the register in question if the VT is legal for the register class. But if the VT is i32 or i64, we won't find a matching register class and will instead up in the code modified in this patch.

If the requested register is x/y/zmm16-31 we weren't returning a register class that contains those registers and will hit an assertion in the caller.

To fix this, I've changed to use the extended register class instead. I don't believe we need a subtarget check to see if avx512 is enabled. The default implementation just pick whatever register class it finds first. I checked and we currently pick FR32X for XMM0 with an f32 type using the default implementation regardless of whether avx512 is enabled. So I assume its it is ok to do the same for i32.

Differential Revision: https://reviews.llvm.org/D61457

llvm-svn: 360102
2019-05-06 23:57:42 +00:00
Simon Pilgrim 07d91cd98a [X86] lowerVectorShuffle - use any_of to detect out of bounds shuffle indices. NFCI.
Fixes cppcheck local shadow warning as well.

llvm-svn: 360027
2019-05-06 10:11:24 +00:00
Luo, Yuanke beec41c656 Enable AVX512_BF16 instructions, which are supported for BFLOAT16 in Cooper Lake
Summary:
1. Enable infrastructure of AVX512_BF16, which is supported for BFLOAT16 in Cooper Lake;
2. Enable VCVTNE2PS2BF16, VCVTNEPS2BF16 and DPBF16PS  instructions, which are Vector Neural Network Instructions supporting BFLOAT16 inputs and conversion instructions from IEEE single precision.
VCVTNE2PS2BF16: Convert Two Packed Single Data to One Packed BF16 Data.
VCVTNEPS2BF16: Convert Packed Single Data to Packed BF16 Data.
VDPBF16PS: Dot Product of BF16 Pairs Accumulated into Packed Single Precision.
For more details about BF16 isa, please refer to the latest ISE document: https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference

Author: LiuTianle

Reviewers: craig.topper, smaslov, LuoYuanke, wxiao3, annita.zhang, RKSimon, spatel

Reviewed By: craig.topper

Subscribers: kristina, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60550

llvm-svn: 360017
2019-05-06 08:22:37 +00:00
Simon Pilgrim 5170c0e5fe Move getOpcode() call into if statement. NFCI.
Avoids a cppcheck "Local variable name shadows outer variable" warning. 

llvm-svn: 359991
2019-05-05 18:34:38 +00:00
Simon Pilgrim cbcd9b1b92 [X86] Fix some cppcheck "Local variable name shadows outer variable" warnings. NFCI.
llvm-svn: 359976
2019-05-05 12:00:14 +00:00
Simon Pilgrim b323d5ec7c [X86] LowerToHorizontalOp - Tidyup calls to getHopForBuildVector. NFCI.
Merge the if() tests for the various HADD/SUB + Subtarget tests

llvm-svn: 359901
2019-05-03 15:56:06 +00:00
Simon Pilgrim bfdd0f75a8 [X86] Remove repeated variables. NFCI.
llvm-svn: 359889
2019-05-03 14:37:00 +00:00
Simon Pilgrim aa49be4926 Avoid cppcheck operator precedence warnings. NFCI.
Prefer ((X & Y) ? A : B) to (X & Y ? A : B)

llvm-svn: 359884
2019-05-03 13:50:38 +00:00
Simon Pilgrim a359ef192b [X86] LowerMULH - remove unused Lo/Hi vector indices. NFCI.
Leftover from before we had the extract128BitVector helpers.

llvm-svn: 359871
2019-05-03 10:32:07 +00:00
Simon Pilgrim 88f9117168 Reduce variable scope to just the if() block its actually used in. NFCI.
llvm-svn: 359869
2019-05-03 10:13:41 +00:00
Craig Topper e1e38d4248 [X86] Correct the register class for specific mask register constraints in getRegForInlineAsmConstraint when the VT is a scalar type
The default impementation in the base class for TargetLowering::getRegForInlineAsmConstraint doesn't work for mask registers when the VT is a scalar type integer types since the only legal mask types are vXi1. So we end up just getting whatever the first register class that contains the register. Currently this appears to be VK1, but its really dependent on the order tablegen outputs the register classes.

Some code in the caller ends up looking up the type for this register class and find v1i1 then generates a copyfromreg from the physical k-register with the v1i1 type. Then it generates an any_extend from v1i1 to the scalar VT which isn't legal. This bad any_extend sticks around until isel where it selects a MOVZX32rr8 with a v1i1 input or maybe a i8 input. Not sure but eventually we pick up a copy from VK1 to GR8 in MachineIR which isn't supported. This leads to a failure in physical register copying.

This patch uses the scalar type to find a VK class of the right size. In the attached test case this will be VK16. This causes a bitcast from vk16 to i16 to be generated instead of an any_extend. This will be properly iseled to a VK16 to GR32 copy and a GR32->GR16 extract_subreg.

Fixes PR41678

Differential Revision: https://reviews.llvm.org/D61453

llvm-svn: 359837
2019-05-02 22:26:40 +00:00
Simon Pilgrim df8daf0ef4 [X86][SSE] lowerAddSubToHorizontalOp - enable ymm extraction+fold
Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway.

Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result:
https://godbolt.org/z/0R-U-K

Differential Revision: https://reviews.llvm.org/D61426

llvm-svn: 359786
2019-05-02 14:00:55 +00:00
Simon Pilgrim 9fa56f7829 [X86][SSE] Move shouldUseHorizontalOp inside isHorizontalBinOp. NFCI.
Matches what we do for lowerAddSubToHorizontalOp and will make it easier to peek through subvectors to help fix PR39921

llvm-svn: 359782
2019-05-02 12:18:24 +00:00
Simon Pilgrim 9f04d97cd7 [X86][SSE] Fold scalar horizontal add/sub for non-0/1 element extractions
We already perform horizontal add/sub if we extract from elements 0 and 1, this patch extends it to non-0/1 element extraction indices (as long as they are from the lowest 128-bit vector).

Differential Revision: https://reviews.llvm.org/D61263

llvm-svn: 359707
2019-05-01 17:13:35 +00:00
Simon Pilgrim f5bdff7747 Fix 80 column violation. NFCI.
llvm-svn: 359694
2019-05-01 16:01:49 +00:00
Simon Pilgrim 6711b9699a [X86][SSE] Add demanded elts support X86ISD::PMULDQ\PMULUDQ
Add to SimplifyDemandedVectorEltsForTargetNode and SimplifyDemandedBitsForTargetNode

llvm-svn: 359686
2019-05-01 14:50:50 +00:00
Simon Pilgrim 3d6899e369 [X86][SSE] Add SSE vector shift support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359680
2019-05-01 13:51:09 +00:00
Simon Pilgrim ba372c6e62 [X86][SSE] Split 512-bit -> 128-bit vector directly in SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 359678
2019-05-01 12:48:42 +00:00
Simon Pilgrim 951a6b4579 [X86][SSE] Add 512-bit vector support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359677
2019-05-01 12:37:41 +00:00
Simon Pilgrim 37c2419cc7 [X86][SSE] Add X86ISD::PACKSS\PACKUS to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359673
2019-05-01 11:29:36 +00:00
Simon Pilgrim 3353cee06c [X86][SSE] Add X86ISD::UNPCKL\UNPCK to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359670
2019-05-01 11:08:03 +00:00
Simon Pilgrim f7b978a71b [X86][SSE] Move extract_subvector(pshufb) fold to SimplifyDemandedVectorEltsForTargetNode
This lets us hit more cases than combineExtractSubvector and allows us reuse more code.

llvm-svn: 359669
2019-05-01 10:58:38 +00:00
Simon Pilgrim a7d107a3e0 [X86] SimplifyDemandedVectorEltsForTargetNode - pull out vector halving code. NFCI.
Pull out the HADD/HSUB code to halve vector widths if the upper half isn't used - prep work to adding support for other opcodes.

llvm-svn: 359667
2019-05-01 10:38:10 +00:00
Simon Pilgrim 99eefe94b5 [X86][SSE] Extract i1 elements from vXi1 bool vectors
This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.

Differential Revision: https://reviews.llvm.org/D61189

llvm-svn: 359666
2019-05-01 10:02:22 +00:00
Simon Pilgrim 07ab4e7db8 [X86][SSE] Fold extract_subvector(extend(x)) -> extend_vector_inreg(x)
This adds any extend support - folding to zero_extend_vector_inreg (PMOVZX) for legality

Minor improvement for PR39709

llvm-svn: 359608
2019-04-30 20:31:07 +00:00
Simon Pilgrim 22641cc194 Fix for bug 41512: lower INSERT_VECTOR_ELT(ZeroVec, 0, Elt) to SCALAR_TO_VECTOR(Elt) for all SSE flavors
Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512

Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.

Committed on behalf of @Serge_Preis (Serge Preis)

Differential Revision: https://reviews.llvm.org/D60852

llvm-svn: 359545
2019-04-30 10:18:25 +00:00
Sjoerd Meijer 180f1ae57c [TargetLowering] Change getOptimalMemOpType to take a function attribute list
The MachineFunction wasn't used in getOptimalMemOpType, but more importantly,
this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType.

This is the groundwork for the changes in D59766 and D59787, that allows
implementation of TTI::getMemcpyCost.

Differential Revision: https://reviews.llvm.org/D59785

llvm-svn: 359537
2019-04-30 08:38:12 +00:00
Simon Pilgrim 028485d7b9 [X86][SSE] isHorizontalBinOp - add support for target shuffles
Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support.

This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type

Fixes PR39936 and should help with PR39920/PR39921

Differential Revision: https://reviews.llvm.org/D61245

llvm-svn: 359491
2019-04-29 19:52:59 +00:00
Simon Pilgrim d5cc753b6d [X86][SSE] combineExtractVectorElt - add early-out to return zero/undef for out-of-range extraction indices.
llvm-svn: 359406
2019-04-28 19:12:58 +00:00
Simon Pilgrim 22d1476bfa [X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.

llvm-svn: 359400
2019-04-28 14:31:01 +00:00
Simon Pilgrim 93ad48210c [X86][SSE] Optimize llvm.experimental.vector.reduce.xor.vXi1 parity reduction (PR38840)
An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0.

Differential Revision: https://reviews.llvm.org/D61230

llvm-svn: 359396
2019-04-28 10:46:17 +00:00
Simon Pilgrim 03c4e2663c Revert rL359389: [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.
........
Causing irregular buildbot failures.

llvm-svn: 359391
2019-04-27 20:44:08 +00:00
Simon Pilgrim 4118be3af6 [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.

llvm-svn: 359389
2019-04-27 20:04:44 +00:00
Simon Pilgrim 2a2d422400 [X86][AVX512] Improve vector bool reductions
As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern.

llvm-svn: 359386
2019-04-27 17:32:46 +00:00
Simon Pilgrim acc1e6d1c6 [X86][AVX] Merge mask select with shuffles across extract_subvector (PR40332)
Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector.

We can extend this in the future to handle more opcodes and non-zero selections.

llvm-svn: 359378
2019-04-27 13:35:32 +00:00
Craig Topper 063b471ff7 [X86] Use MOVQ for i64 atomic_stores when SSE2 is enabled
Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store.

Reviewers: spatel, RKSimon, reames, jfb, efriedma

Reviewed By: RKSimon

Subscribers: hiraditya, dexonsmith, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60546

llvm-svn: 359368
2019-04-27 03:38:15 +00:00
Simon Pilgrim 27e01e675c [X86][AVX] Fold extract_subvector(broadcast(x)) -> broadcast(x) iff x has one use
llvm-svn: 359332
2019-04-26 18:02:14 +00:00
Simon Pilgrim c3a34c3e07 Fix Wparentheses warning. NFCI.
llvm-svn: 359299
2019-04-26 12:23:42 +00:00
Simon Pilgrim bb230c5e79 [X86][SSE] Pull out OR(EXTRACTELT(X,0),OR(EXTRACTELT(X,1),...)) matching code from LowerVectorAllZeroTest
Create a matchBitOpReduction helper that checks for the pattern with any opcode.

First step towards reusing this code to recognize other scalar reduction patterns.

llvm-svn: 359296
2019-04-26 11:45:54 +00:00
Simon Pilgrim 5d6ef94c36 [X86][SSE] Disable shouldFoldConstantShiftPairToMask for btver1/btver2 targets (PR40758)
As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs).

Differential Revision: https://reviews.llvm.org/D61068

llvm-svn: 359293
2019-04-26 10:49:13 +00:00
Simon Pilgrim 5e161df9f8 [X86][AVX] Combine shuffles extracted from a common vector
A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380.

Differential Revision: https://reviews.llvm.org/D60512

llvm-svn: 359292
2019-04-26 09:56:14 +00:00
Simon Pilgrim 0a7d1b3ce1 [X86][SSE] combineBitcastvxi1 - add support for bitcasting to non-scalar integers
Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type.

We still have the issue identified in PR41594 but D61114 should handle this.

llvm-svn: 359176
2019-04-25 09:34:36 +00:00
Sanjay Patel b1b3368907 [x86] make sure horizontal op and broadcast types match to simplify (PR41414)
If the types don't match, we can't just remove the shuffle.
There may be some other opportunity for optimization here,
but this should prevent the crashing seen in:
https://bugs.llvm.org/show_bug.cgi?id=41414

llvm-svn: 359095
2019-04-24 14:05:08 +00:00
Simon Pilgrim d30745b2a0 [X86] Add shouldFoldConstantShiftPairToMask override placeholder. NFCI.
Prep work toward fixing PR40758

llvm-svn: 359088
2019-04-24 12:34:08 +00:00
Sanjay Patel 12a561fa1b [x86] use psubus for more vsetcc lowering (PR39859)
Circling back to a leftover bit from PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859#c1

...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'.
This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca.
We already do this transform for the SETULT predicate, so this makes the code more
symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect
anything but pre-SSE4.1 subtargets.

  $ cat before.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768]
	pxor	%xmm0, %xmm2
	pcmpgtw	-32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255]
	pand	%xmm2, %xmm0
	pandn	%xmm1, %xmm2
	por	%xmm2, %xmm0

  $ cat after.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [256,256,256,256,256,256,256,256]
	psubusw	%xmm0, %xmm2
	pxor	%xmm3, %xmm3
	pcmpeqw	%xmm2, %xmm3
	pand	%xmm3, %xmm0
	pandn	%xmm1, %xmm3
	por	%xmm3, %xmm0

  $ llvm-mca before.s -mcpu=haswell
  Iterations:        100
  Instructions:      600
  Total Cycles:      909
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    0.77
  IPC:               0.66
  Block RThroughput: 1.8

  $ llvm-mca after.s -mcpu=haswell
  Iterations:        100
  Instructions:      700
  Total Cycles:      409
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    1.71
  IPC:               1.71
  Block RThroughput: 1.8

Differential Revision: https://reviews.llvm.org/D60838

llvm-svn: 358999
2019-04-23 15:20:17 +00:00
Simon Pilgrim 0e4992ce27 [X86] Pull out collectConcatOps helper. NFCI.
Create collectConcatOps helper that returns all the subvector ops for CONCAT_VECTORS or a INSERT_SUBVECTOR series.

llvm-svn: 358989
2019-04-23 14:07:49 +00:00
Sanjay Patel bf8aacb715 [SelectionDAG] move splat util functions up from x86 lowering
This was supposed to be NFC, but the change in SDLoc
definitions causes instruction scheduling changes.

There's nothing x86-specific in this code, and it can
likely be used from DAGCombiner's simplifyVBinOp().

llvm-svn: 358930
2019-04-22 22:43:36 +00:00
Craig Topper 5c43ab337f [X86] Reject 512-bit types in getRegForInlineAsmConstraint when AVX512 is not enabled. Same for 256 bit and AVX.
llvm-svn: 358872
2019-04-22 06:12:02 +00:00
Craig Topper 3980d1ca6b [X86] Disable argument copy elision for arguments passed via pointers
Summary:
If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory.

The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too.

This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix.

Reviewers: rnk

Reviewed By: rnk

Subscribers: dbabokin, llvm-commits, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60801

llvm-svn: 358817
2019-04-20 15:26:44 +00:00
Simon Pilgrim 4171a91e92 [X86] combineVectorTruncationWithPACKUS - remove split/concatenation of mask
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.

This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).

This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.

My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.

Differential Revision: https://reviews.llvm.org/D60375#inline-539623

llvm-svn: 358692
2019-04-18 17:23:09 +00:00
Simon Pilgrim 8f87e53462 [X86][SSE] Lower ICMP EQ(AND(X,C),C) -> SRA(SHL(X,LOG2(C)),BW-1) iff C is power-of-2.
This replaces the MOVMSK combine introduced at D52121/rL342326

(movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))

with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra.

Differential Revision: https://reviews.llvm.org/D60625

llvm-svn: 358651
2019-04-18 09:58:59 +00:00
Simon Pilgrim e5573f4f4e [TargetLowering] Rename preferShiftsToClearExtremeBits and shouldFoldShiftPairToMask (PR41359)
As discussed on PR41359, this patch renames the pair of shift-mask target feature functions to make their purposes more obvious.

shouldFoldShiftPairToMask -> shouldFoldConstantShiftPairToMask

preferShiftsToClearExtremeBits -> shouldFoldMaskToVariableShiftPair

llvm-svn: 358526
2019-04-16 20:57:28 +00:00
Simon Pilgrim d769bb1e58 [X86][AVX] X86ISD::PERMV/PERMV3 node types can never fold index ops
Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register.

This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data.

Differential Revision: https://reviews.llvm.org/D60562

llvm-svn: 358516
2019-04-16 19:18:53 +00:00
Craig Topper 0495f29e42 [X86] Limit the 'x' inline assembly constraint to zmm0-15 when used for a 512 type.
The 'v' constraint is used to select zmm0-31. This makes 512 bit consistent with 128/256-bit.a

llvm-svn: 358450
2019-04-15 21:06:32 +00:00
Craig Topper 3d9b47c770 [X86] Block i32/i64 for 'k' and 'Yk' in getRegForInlineAsmConstraint without avx512bw.
32 and 64 bit k-registers require avx512bw. If we don't block this properly, it leads to a crash.

llvm-svn: 358436
2019-04-15 18:39:45 +00:00
Simon Pilgrim 6c8f4ada36 [X86][SSE] Recognise vXi1 boolean anyof/allof reduction patterns
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.

This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.

This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.

Differential Revision: https://reviews.llvm.org/D60610

llvm-svn: 358286
2019-04-12 14:22:57 +00:00
Craig Topper 68a5d619a4 [X86] Restrict vselect handling in scalarizeExtEltFP to only case to pre type legalization where the setcc result type is vXi1.
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.

The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.

llvm-svn: 358218
2019-04-11 19:57:44 +00:00
Craig Topper 586fad50ac [X86] Add patterns for using movss/movsd for atomic load/store of f32/64. Remove atomic fadd pseudos use isel patterns instead.
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.

It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.

This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.

Differential Revision: https://reviews.llvm.org/D60394

llvm-svn: 358215
2019-04-11 19:19:52 +00:00
Craig Topper f7e548c076 Recommit r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
With correct test checks this time.

If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ

This matches what gcc and icc do for this case and removes an existing FIXME.

llvm-svn: 358214
2019-04-11 19:19:42 +00:00
Craig Topper 8200880c9a Revert r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
I seem to have messed up the test checks.

llvm-svn: 358212
2019-04-11 19:04:38 +00:00
Craig Topper 1c2dfc3100 [X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integer and store it to a stack temporary. From there we can do two 32-bit loads to get the value into integer registers without worrying about atomicness.

This matches what gcc and icc do for this case and removes an existing FIXME.

Differential Revision: https://reviews.llvm.org/D60156

llvm-svn: 358211
2019-04-11 18:40:21 +00:00
Simon Pilgrim 40b647ae8e [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV3 mask support
Completes SimplifyDemandedVectorElts's basic variable shuffle mask support which should help D60512 + D60562 

llvm-svn: 358186
2019-04-11 15:29:15 +00:00
Simon Pilgrim 8a25154fa7 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV mask support
llvm-svn: 358174
2019-04-11 14:35:45 +00:00
Simon Pilgrim 6f3866c6fb [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMILPV mask support
llvm-svn: 358170
2019-04-11 14:15:01 +00:00
Simon Pilgrim cb5218ad48 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMIL2 mask support
llvm-svn: 358167
2019-04-11 14:04:19 +00:00
Simon Pilgrim e468cc7f14 [X86] SimplifyDemandedVectorElts - add VPPERM support
We need to add support for all variable shuffle mask ops, but VPPERM is the only one that already has test coverage.

llvm-svn: 358165
2019-04-11 13:30:38 +00:00
David Green 0861c87b06 Revert rL357745: [SelectionDAG] Compute known bits of CopyFromReg
Certain optimisations from ConstantHoisting and CGP rely on Selection DAG not
seeing through to the constant in other blocks. Revert this patch while we come
up with a better way to handle that.

I will try to follow this up with some better tests.

llvm-svn: 358113
2019-04-10 18:00:41 +00:00
Simon Pilgrim 37d8d55823 [X86][AVX] getTargetConstantBitsFromNode - extract bits from X86ISD::SUBV_BROADCAST
llvm-svn: 358096
2019-04-10 16:24:47 +00:00
Craig Topper 3a4c2192a4 [X86] Fix a couple lowering functions that called ReplaceAllUsesOfValueWith for the newly created code and then return SDValue(). Use MERGE_VALUES instead.
Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it.

Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements.

For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later.

llvm-svn: 357935
2019-04-08 19:44:07 +00:00
Sanjay Patel 50c3b290ed [x86] make 8-bit shl undesirable
I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops.

We've hinted at making all 8-bit ops undesirable for the reason in the code comment:

// TODO: Almost no 8-bit ops are desirable because they have no actual
//       size/speed advantages vs. 32-bit ops, but they do have a major
//       potential disadvantage by causing partial register stalls.

...but that leads to massive diffs and exposes all kinds of optimization holes itself.

Differential Revision: https://reviews.llvm.org/D60286

llvm-svn: 357912
2019-04-08 13:58:50 +00:00
Craig Topper 6a6da233b9 [X86] Make LowerOperationWrapper more robust. Remove now unnecessary ReplaceAllUsesWith from LowerMSCATTER.
Previously LowerOperationWrapper took the number of results from the original
node and counted that many results from the new node. This was intended to drop
chain operands from FP_TO_SINT lowering that uses X87 with memory operations to
stack temporaries. The final load had an extra chain output that needs to be
ignored.

Unfortunately, it didn't work with scatter which has 2 result operands, the
mask output which is discarded and a chain output. The chain output is the one
that is needed but it comes second and it would be dropped by the previous
logic here. To workaround this we were doing a ReplaceAllUses in the lowering
code so that the generic legalization code wouldn't see any uses to replace
since it had been given the wrong result/type.

After this change we take the LowerOperation result directly if the original
node has one result. This allows us to directly return the chain from scatter
or the load data from the FP_TO_SINT case. When the original node has multiple
results we'll ensure the returned node has the same number and copy them over.
For cases where the original node has multiple results and the new code for some
reason has even more results, MERGE_VALUES can be used to pass only the needed
results.

llvm-svn: 357887
2019-04-08 07:39:17 +00:00
Simon Pilgrim 07adb6abda [X86][SSE] SimplifyDemandedBitsForTargetNode - Add initial PACKSS support
In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands.

This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved.

Differential Revision: https://reviews.llvm.org/D60333

llvm-svn: 357859
2019-04-07 10:40:01 +00:00
Simon Pilgrim d0a53d4914 [X86] combineBitcastvxi1 - provide dst VT and src SDValue directly. NFCI.
Prep work to make it easier to reuse the BITCAST->MOVSMK combine in other cases.

llvm-svn: 357847
2019-04-06 18:54:17 +00:00
Simon Pilgrim af1cbdd3ba Fix spelling mistake. NFCI.
llvm-svn: 357843
2019-04-06 15:38:34 +00:00
Francis Visoiu Mistrih 9d9d1b6b2b [X86] Enable tail calls for CallingConv::Swift
It's currently only enabled on AArch64 (enabled in r281376).

llvm-svn: 357809
2019-04-05 20:18:25 +00:00
Craig Topper 80aa2290fb [X86] Merge the different Jcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon

Reviewed By: RKSimon

Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60228

llvm-svn: 357802
2019-04-05 19:28:09 +00:00
Piotr Sobczak 0376ac1d94 [SelectionDAG] Compute known bits of CopyFromReg
Summary:
Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.

This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.

Also add a missing truncation on X86, found by testing of this patch.

Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa

Reviewers: bogner, craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59535

llvm-svn: 357745
2019-04-05 07:44:09 +00:00
Craig Topper 94f1772b1e [X86] Promote i16 SRA instructions to i32
We already promote SRL and SHL to i32.

This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions.

I think there might be some DAG combine improvement opportunities in the test changes here.

Differential Revision: https://reviews.llvm.org/D60278

llvm-svn: 357743
2019-04-05 06:32:50 +00:00
Evandro Menezes 85bd3978ae [IR] Refactor attribute methods in Function class (NFC)
Rename the functions that query the optimization kind attributes.

Differential revision: https://reviews.llvm.org/D60287

llvm-svn: 357731
2019-04-04 22:40:06 +00:00
James Y Knight a040174418 Revert [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
It unnecessarily breaks previously-working code which used varargs,
but didn't pass any float/double arguments (such as EDK2).

Also revert the fixup on top of that:
Revert [X86] Fix a test from r357317

This reverts r357317 (git commit d413f41de6)
This reverts r357380 (git commit 7af32444b9)

llvm-svn: 357718
2019-04-04 19:05:48 +00:00
Sanjay Patel 17648b848e [x86] eliminate unnecessary broadcast of horizontal op
This is another pattern that comes up if we more aggressively
scalarize FP ops.

llvm-svn: 357703
2019-04-04 14:46:13 +00:00
Craig Topper 051bd16faf [X86] Remove CustomInserters for RDPKRU/WRPKRU. Use some custom lowering and new ISD opcodes instead.
These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers.

This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes
that take the extra zeroes as inputs. The zeros will then go through isel on their own to select
the MOV32r0 pseudo. Then we just need to mention the physical registers directly
in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary
copies to/from physical registers.

llvm-svn: 357659
2019-04-04 00:28:49 +00:00
Craig Topper 52cac4b79f [X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom instruction selection instead.
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.

To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.

With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.

I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.

llvm-svn: 357652
2019-04-03 23:28:30 +00:00
Sanjay Patel c9a012e4ea [x86] fold shuffles of h-ops that have an undef operand
If an operand is undef, we can assume it's the same as the
other operand.

llvm-svn: 357644
2019-04-03 22:40:35 +00:00
Sanjay Patel 61b5e3c6a9 [x86] eliminate movddup of horizontal op
This pattern would show up as a regression if we more
aggressively convert vector FP ops to scalar ops.

There's still a missed optimization for the v4f64 legal
case (AVX) because we create that h-op with an undef operand.
We should probably just duplicate the operands for that
pattern to avoid trouble.

llvm-svn: 357642
2019-04-03 22:15:29 +00:00
Krzysztof Parzyszek 4841643a1d [X86] Extend boolean arguments to inline-asm according to getBooleanType
Differential Revision: https://reviews.llvm.org/D60208

llvm-svn: 357615
2019-04-03 17:43:14 +00:00
Simon Pilgrim 15919ad306 [X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 reduction on AVX1
Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result.

llvm-svn: 357611
2019-04-03 17:28:34 +00:00
Simon Pilgrim 9e28dddf55 [X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction on AVX1
Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets.

llvm-svn: 357608
2019-04-03 17:17:13 +00:00
Craig Topper 2e1bf89e3a [X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics.
This is the appropriate opcode for only having a chain output. Though I'm not
sure it matters much.

llvm-svn: 357375
2019-04-01 05:26:12 +00:00
Sanjay Patel e1bc360fc6 [x86] allow movmsk with 2-element reductions
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.

  $ llvm-mca -mcpu=haswell no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      504
  Total uOps:        400

  Dispatch Width:    4
  uOps Per Cycle:    0.79
  IPC:               0.79
  Block RThroughput: 1.0

  $ llvm-mca -mcpu=haswell movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      358
  Total uOps:        600

  Dispatch Width:    4
  uOps Per Cycle:    1.68
  IPC:               1.68
  Block RThroughput: 1.5

  $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      407
  Total uOps:        400

  Dispatch Width:    2
  uOps Per Cycle:    0.98
  IPC:               0.98
  Block RThroughput: 2.0

  $ llvm-mca -mcpu=btver2 movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      311
  Total uOps:        600

  Dispatch Width:    2
  uOps Per Cycle:    1.93
  IPC:               1.93
  Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.

Differential Revision: https://reviews.llvm.org/D59997

llvm-svn: 357367
2019-03-31 15:11:34 +00:00
Simon Pilgrim 10c9032c02 [X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316)
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.

llvm-svn: 357351
2019-03-30 17:12:29 +00:00
Simon Pilgrim 3293455595 [X86][SSE] detectAVGPattern - begin generalizing ADD matches
Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316

llvm-svn: 357349
2019-03-30 15:31:53 +00:00
Amara Emerson d413f41de6 [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.

llvm-svn: 357317
2019-03-29 21:30:51 +00:00
Simon Pilgrim aeaf7fcdde [X86] Add X86TargetLowering::isCommutativeBinOp override.
We currently just have test coverage for PMULUDQ - will add more in the future.

llvm-svn: 357244
2019-03-29 11:25:58 +00:00
Sanjay Patel 5bbf6f0bd8 [x86] avoid cmov in movmsk reduction
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.

We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.

There seems to be a missing fold for sext of a bool bit here:

negl %ecx
movslq %ecx, %rax

...but that's an independent transform.

Differential Revision: https://reviews.llvm.org/D59818

llvm-svn: 357172
2019-03-28 14:16:13 +00:00
Sanjay Patel 1df0bb6264 [x86] improve AVX lowering of vector zext
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.

I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.

From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.

Differential Revision: https://reviews.llvm.org/D59777

llvm-svn: 357129
2019-03-27 22:42:11 +00:00
Sanjay Patel 704817912a [x86] look through bitcast operand of MOVMSK
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
  // Also support integer VTs to avoid a int->fp bitcast in the DAG.

llvm-svn: 357128
2019-03-27 22:24:03 +00:00
Simon Pilgrim ccb71b2985 Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249

llvm-svn: 357057
2019-03-27 10:25:02 +00:00
Simon Pilgrim 87d4ab8b92 [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.

llvm-svn: 356864
2019-03-24 19:06:35 +00:00
Simon Pilgrim a71c0ed471 [X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).

Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.

llvm-svn: 356858
2019-03-24 16:30:35 +00:00
Sanjay Patel 7d676dfd86 [x86] improve the default expansion of uaddsat/usubsat
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613

uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0

We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.

Differential Revision: https://reviews.llvm.org/D59006

llvm-svn: 356855
2019-03-24 13:55:54 +00:00
Sanjay Patel 2e92846d36 [x86] reduce code duplication; NFC
llvm-svn: 356836
2019-03-23 15:00:52 +00:00
Craig Topper ce1ed55a4a [X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.

This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.

Differential Revision: https://reviews.llvm.org/D59662

llvm-svn: 356808
2019-03-22 20:47:02 +00:00
Craig Topper 1ffd8e8114 [X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.

I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.

So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?

I've left some FIXMEs for improvements we can do without SSE2.

Differential Revision: https://reviews.llvm.org/D59679

llvm-svn: 356807
2019-03-22 20:46:56 +00:00
Simon Pilgrim 564392d752 [X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203)
llvm-svn: 356784
2019-03-22 17:23:55 +00:00
Craig Topper b3bad3dce3 [X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI
For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead.

llvm-svn: 356721
2019-03-21 21:37:18 +00:00
Simon Pilgrim c2e4405475 [X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers
We don't use X86ISD::ANDNP for mask registers.

Test case from @craig.topper (Craig Topper)

llvm-svn: 356696
2019-03-21 18:32:38 +00:00
Craig Topper 8d46403b8e [X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set.
CMPXCHG8B was introduced on i586/pentium generation.

If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.

Differential Revision: https://reviews.llvm.org/D59576

llvm-svn: 356631
2019-03-20 23:35:49 +00:00
Craig Topper 0367553304 [X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend.
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.

I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.

This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.

Differential Revision: https://reviews.llvm.org/D59479

llvm-svn: 356618
2019-03-20 21:30:20 +00:00
Simon Pilgrim 2acca37a2d [X86] Use getConstantOperandAPInt to detect out-of-range shifts.
llvm-svn: 356549
2019-03-20 11:41:52 +00:00
Andrea Di Biagio 624f5deff4 [X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI
This patch removes the following dag node opcodes from namespace X86ISD:

RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG

The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:

    RDTSC/RDTSCP don't implicitly read ECX.
    RDTSCP also implicitly writes ECX.

I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.

No functional change intended.

Differential Revision: https://reviews.llvm.org/D59547

llvm-svn: 356546
2019-03-20 11:21:15 +00:00
Simon Pilgrim e744f513c4 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts
If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed.

llvm-svn: 356483
2019-03-19 17:23:25 +00:00
Jordan Rupprecht f74d45a775 [NFC] Fix unused variable in release builds
This was introduced in rL356468.

llvm-svn: 356477
2019-03-19 16:52:40 +00:00
Simon Pilgrim a56f2822d0 [SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect
These changes are related to PR37743 and include:

    SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.

    Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.

    Add promoting the integer ABS node in the LegalizeIntegerType.

    Expand-based legalization of integer result for the ABS nodes.

    Expand-based legalization of ABS vector operations.

    Add some integer abs testcases for different typesizes for Thumb arch

    Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
        tmp = (SRA, Hi, 31)
        Lo = (UADDO tmp, Lo)
        Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
        Lo = (XOR tmp, Lo)

    The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
        (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).

    Change integer abs testcases for codegen with the ABS node support for AArch64.
        Indicate that the ABS is legal for the i64 type when the NEON is supported.
        Change the integer abs testcases to show changing of codegen.

    Add combine and legalization of ABS nodes for Thumb arch.

    Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.

For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743

Patch by: @ikulagin (Ivan Kulagin)

Differential Revision: https://reviews.llvm.org/D49837

llvm-svn: 356468
2019-03-19 16:24:55 +00:00
Adhemerval Zanella 664c1ef528 [TargetLowering] Add code size information on isFPImmLegal. NFC
This allows better code size for aarch64 floating point materialization
in a future patch.

Reviewers: evandro

Differential Revision: https://reviews.llvm.org/D58690

llvm-svn: 356389
2019-03-18 18:40:07 +00:00
Simon Pilgrim f2c53b5d6c [X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes.
Replaces existing i1-only fold.

llvm-svn: 356325
2019-03-16 15:02:00 +00:00
Simon Pilgrim 0f472e1d01 [X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW
Improved constant folding for PEXTRB/PEXTRW will be added in a future commit

llvm-svn: 356324
2019-03-16 14:29:50 +00:00
Roman Lebedev 9f37790608 [X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

llvm-svn: 356301
2019-03-15 21:18:05 +00:00
Roman Lebedev b6e376ddfa [X86] Promote i8 CMOV's (PR40965)
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

llvm-svn: 356300
2019-03-15 21:17:53 +00:00
Craig Topper af856db961 [X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table.
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.

Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.

Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.

llvm-svn: 356294
2019-03-15 19:59:35 +00:00