Commit Graph

19030 Commits

Author SHA1 Message Date
Craig Topper fdcdf74b0e [X86] Remove some unused tablegen multiclasses. NFC
llvm-svn: 358345
2019-04-14 04:20:38 +00:00
Bill Wendling 191f1487b6 [X86] Use PC-relative mode for the kernel code model
Summary:
The Linux kernel uses PC-relative mode, so allow that when the code model is
"kernel".

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits, kees, nickdesaulniers

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60643

llvm-svn: 358343
2019-04-13 21:39:28 +00:00
Craig Topper 55b0d987fd [X86] Use int64_t and isInt<N> instead of APInt operations in foldLoadStoreIntoMemOperand. NFC
We know all our values are limited to 64 bits here so we don't need an APInt.

This should save some generated code checking between large and small size.

llvm-svn: 358338
2019-04-13 18:57:41 +00:00
Simon Pilgrim 6c8f4ada36 [X86][SSE] Recognise vXi1 boolean anyof/allof reduction patterns
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.

This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.

This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.

Differential Revision: https://reviews.llvm.org/D60610

llvm-svn: 358286
2019-04-12 14:22:57 +00:00
Eric Christopher 6b06c6a5ef Add explicit dependencies on MCSection.h and MCDwarf.h to the .cpp
files rather than rely on transitive includes from MCStreamer.h.

llvm-svn: 358263
2019-04-12 07:40:01 +00:00
Nick Desaulniers 8ec304c9fd [X86AsmPrinter] refactor static functions into private methods. NFC
Summary:
A lot of the code for printing special cases of operands in this
translation unit are static functions. While I too have suffered many
years of abuse at the hands of C, we should prefer private methods,
particularly when you start passing around *this as your first argument,
which is a code smell.

This will help make generic vs arch specific asm printing easier, as it
brings X86AsmPrinter more in line with other arch's derived AsmPrinters.
We will then be able to more easily move architecture generic code to
the base class, and architecture specific code to the derived classes.

Some other small refactorings while we're here:
- the parameter Op is now consistently OpNo
- add spaces around binary expressions. I know we're not millionaires
  but c'mon.

Reviewers: echristo

Reviewed By: echristo

Subscribers: smeenai, hiraditya, llvm-commits, srhines, craig.topper

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60577

llvm-svn: 358236
2019-04-11 22:47:13 +00:00
Craig Topper 68a5d619a4 [X86] Restrict vselect handling in scalarizeExtEltFP to only case to pre type legalization where the setcc result type is vXi1.
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.

The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.

llvm-svn: 358218
2019-04-11 19:57:44 +00:00
Craig Topper 586fad50ac [X86] Add patterns for using movss/movsd for atomic load/store of f32/64. Remove atomic fadd pseudos use isel patterns instead.
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.

It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.

This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.

Differential Revision: https://reviews.llvm.org/D60394

llvm-svn: 358215
2019-04-11 19:19:52 +00:00
Craig Topper f7e548c076 Recommit r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
With correct test checks this time.

If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ

This matches what gcc and icc do for this case and removes an existing FIXME.

llvm-svn: 358214
2019-04-11 19:19:42 +00:00
Craig Topper 8200880c9a Revert r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
I seem to have messed up the test checks.

llvm-svn: 358212
2019-04-11 19:04:38 +00:00
Craig Topper 1c2dfc3100 [X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2
If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integer and store it to a stack temporary. From there we can do two 32-bit loads to get the value into integer registers without worrying about atomicness.

This matches what gcc and icc do for this case and removes an existing FIXME.

Differential Revision: https://reviews.llvm.org/D60156

llvm-svn: 358211
2019-04-11 18:40:21 +00:00
Simon Pilgrim 40b647ae8e [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV3 mask support
Completes SimplifyDemandedVectorElts's basic variable shuffle mask support which should help D60512 + D60562 

llvm-svn: 358186
2019-04-11 15:29:15 +00:00
Luo, Yuanke a2b4d3fab6 [X86] Add MM register mapping from CodeView to MC register id
Differential Revision: https://reviews.llvm.org/D60437

Change-Id: I2183a6d825d0284b22705d423b88882992b236c5
llvm-svn: 358179
2019-04-11 15:01:03 +00:00
Simon Pilgrim 8a25154fa7 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMV mask support
llvm-svn: 358174
2019-04-11 14:35:45 +00:00
Simon Pilgrim 6f3866c6fb [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMILPV mask support
llvm-svn: 358170
2019-04-11 14:15:01 +00:00
Simon Pilgrim cb5218ad48 [X86] SimplifyDemandedVectorElts - add X86ISD::VPERMIL2 mask support
llvm-svn: 358167
2019-04-11 14:04:19 +00:00
Simon Pilgrim e468cc7f14 [X86] SimplifyDemandedVectorElts - add VPPERM support
We need to add support for all variable shuffle mask ops, but VPPERM is the only one that already has test coverage.

llvm-svn: 358165
2019-04-11 13:30:38 +00:00
Craig Topper 61f31cbcb2 [X86] Teach foldMaskedShiftToScaledMask to look through an any_extend from i32 to i64 between the and & shl
foldMaskedShiftToScaledMask tries to reorder and & shl to enable the shl to fold into an LEA. But if there is an any_extend between them it doesn't work.

This patch modifies the code to look through any_extend from i32 to i64 when the and mask only uses bits that weren't from the extended part.

This will prevent a regression from D60358 caused by 64-bit SHL being narrowed to 32-bits when their upper bits aren't demanded.

Differential Revision: https://reviews.llvm.org/D60532

llvm-svn: 358139
2019-04-10 21:42:08 +00:00
Craig Topper 4a32ce39b7 [X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.
Many of our instructions have both a _Int form used by intrinsics and a form
used by other IR constructs. In the EVEX space the _Int versions usually cover
all the capabilities include broadcasting and rounding. While the other version
only covers simple register/register or register/load forms. For this reason
in EVEX, the non intrinsic form is usually marked isCodeGenOnly=1.

In the VEX encoding space we were less consistent, but usually the _Int version
was the isCodeGenOnly version.

This commit makes the VEX instructions match the EVEX instructions. This was
done by manually studying the AsmMatcher table so its possible I missed some
cases, but we should be closer now.

I'm thinking about using the isCodeGenOnly bit to simplify the EVEX2VEX
tablegen code that disambiguates the _Int and non _Int versions. Currently it
checks register class sizes and Record the memory operands come from. I have
some other changes I was looking into for D59266 that may break the memory check.

I had to make a few scheduler hacks to keep the _Int versions from being treated
differently than the non _Int version.

Differential Revision: https://reviews.llvm.org/D60441

llvm-svn: 358138
2019-04-10 21:29:41 +00:00
Craig Topper 87a8f9761e [X86] Replace some if statements in isel address matching that should never be true with asserts. And move them earlier before we looked through operands that don't change size. NFC
These ifs were ensuring we don't have to handle types larger than 64 bits probably because we use getZExtValue in several places below them.

None of the callers of this code pass types larger than 64-bits so we can just assert instead of branching in release code.

I've also moved them earlier since we're just looking through operations that don't effect bit width.

This is prep work for some refactoring I plan to do to the (and (shl)) handling code.

llvm-svn: 358123
2019-04-10 19:08:59 +00:00
Nick Desaulniers ad8f3a1440 [X86AsmPrinter] refactor to limit use of Modifier. NFC
Summary:
The Modifier memory operands is used in 2 cases of memory references
(H & P ExtraCodes). Rather than pass around the likely nullptr Modifier,
refactor the handling of the Modifier out from printOperand().

The refactorings in this patch:
- Don't forward declare printOperand, move its definition up.
  - The diff makes it look like there's a change to printPCRelImm
    (narrator: there's not).
- Create printModifiedOperand()
  - Move logic for Modifier to there from printOperand
  - Use printModifiedOperand in 3 call sites that actually create
    Modifiers.
- Remove now unused Modifier parameter from printOperand
- Remove default parameter from printLeaMemReference as it only has 1
  call site that explicitly passes a parameter.
- Remove default parameter from printMemReference, make call lone call
  site explicitly pass nullptr.
- Drop Modifier parameter from printIntelMemReference, as Intel style
  memory references don't support the Modifiers in question.

This will allow future changes to printOperand() to make it a pure virtual
method on the base AsmPrinter class, allowing for more generic handling
of some architecture generic constraints. X86AsmPrinter was the only
derived class of AsmPrinter to have additional parameters on its
printOperand function.

Reviewers: craig.topper, echristo

Reviewed By: echristo

Subscribers: hiraditya, llvm-commits, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60526

llvm-svn: 358122
2019-04-10 19:01:44 +00:00
Roman Lebedev dc67659ba5 [X86] X86ScheduleBdVer2: use !listsplat operator to cleanup loadres calculation
The problem is that one can't concatenate an empty list
(implied all-ones) with non-empty list here. The result
will be the non-empty list, and it won't match the length
of the ExePorts list.

The problems begin when LoadRes != 1 here,
which is the case in PdWriteResYMMPair,
and more importantly i think it will be the case for PdWriteResExPair.

llvm-svn: 358118
2019-04-10 18:26:42 +00:00
David Green 0861c87b06 Revert rL357745: [SelectionDAG] Compute known bits of CopyFromReg
Certain optimisations from ConstantHoisting and CGP rely on Selection DAG not
seeing through to the constant in other blocks. Revert this patch while we come
up with a better way to handle that.

I will try to follow this up with some better tests.

llvm-svn: 358113
2019-04-10 18:00:41 +00:00
Nick Desaulniers 5277b3ff25 [AsmPrinter] refactor to remove remove AsmVariant. NFC
Summary:
The InlineAsm::AsmDialect is only required for X86; no architecture
makes use of it and as such it gets passed around between arch-specific
and general code while being unused for all architectures but X86.

Since the AsmDialect is queried from a MachineInstr, which we also pass
around, remove the additional AsmDialect parameter and query for it deep
in the X86AsmPrinter only when needed/as late as possible.

This refactor should help later planned refactors to AsmPrinter, as this
difference in the X86AsmPrinter makes it harder to make AsmPrinter more
generic.

Reviewers: craig.topper

Subscribers: jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, eraman, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, llvm-commits, peter.smith, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60488

llvm-svn: 358101
2019-04-10 16:38:43 +00:00
Simon Pilgrim 37d8d55823 [X86][AVX] getTargetConstantBitsFromNode - extract bits from X86ISD::SUBV_BROADCAST
llvm-svn: 358096
2019-04-10 16:24:47 +00:00
Craig Topper 391d5caa10 [X86] Move the 2 byte VEX optimization for MOV instructions back to the X86AsmParser::processInstruction where it used to be. Block when {vex3} prefix is present.
Years ago I moved this to an InstAlias using VR128H/VR128L. But now that we support {vex3} pseudo prefix, we need to block the optimization when it is set to match gas behavior.

llvm-svn: 358046
2019-04-10 05:43:20 +00:00
Craig Topper 9ca3a95f79 [X86] Support the EVEX versions vcvt(t)ss2si and vcvt(t)sd2si with the {evex} pseudo prefix in the assembler.
The EVEX versions are ambiguous with the VEX versions based on operands alone so we had explicitly dropped
them from the AsmMatcher table. Unfortunately, when we add them they incorrectly show in the table before
their VEX counterparts. This is different how the prioritization normally works.

To fix this we have to explicitly reject the instructions unless the {evex} prefix has been seen.

llvm-svn: 358041
2019-04-10 01:29:59 +00:00
Craig Topper 7143224272 [X86] Add VEX_LIG to scalar VEX/EVEX instructions that were missing it.
Scalar VEX/EVEX instructions don't use the L bit and don't look at it for decoding either.
So we should ignore it in our disassembler.

The missing instructions here were found by grepping the raw tablegen class definitions in
the tablegen debug output.

llvm-svn: 358040
2019-04-09 23:30:36 +00:00
Craig Topper 60f83544bb [X86] Fix a dangling StringRef issue introduced in r358029.
I was attempting to convert mnemonics to lower case after processing a pseudo prefix. But the ParseOperands just hold a StringRef for tokens so there is no where to allocate the memory.

Add FIXMEs for the lower case issue which also exists in the prefix parsing code.

llvm-svn: 358036
2019-04-09 21:37:21 +00:00
Amara Emerson 2b523f8162 [GlobalISel][AArch64] Allow CallLowering to handle types which are normally
required to be passed as different register types. E.g. <2 x i16> may need to
be passed as a larger <2 x i32> type, so formal arg lowering needs to be able
truncate it back. Likewise, when dealing with returns of these types, they need
to be widened in the appropriate way back.

Differential Revision: https://reviews.llvm.org/D60425

llvm-svn: 358032
2019-04-09 21:22:33 +00:00
Craig Topper 8e2871cd2c [X86] Add support for {vex2}, {vex3}, and {evex} to the assembler to match gas. Use {evex} to improve the one our 32-bit AVX512 tests.
These can be used to force the encoding used for instructions.

{vex2} will fail if the instruction is not VEX encoded, but otherwise won't do anything since we prefer vex2 when possible. Might need to skip use of the _REV MOV instructions for this too, but I haven't done that yet.

{vex3} will force the instruction to use the 3 byte VEX encoding or fail if there is no VEX form.

{evex} will force the instruction to use the EVEX version or fail if there is no EVEX version.

Differential Revision: https://reviews.llvm.org/D59266

llvm-svn: 358029
2019-04-09 18:45:15 +00:00
Craig Topper 53ee783c6e [X86] Have EVEX2VEX tablegenerator use HasVEX_L and HasEVEX_L2 fields instead of the composite EVEX_LL field. Remove the EVEX_LL field. NFCI
The composite existed to simplify some other tablegen code and not really in an
important way. Remove the combined field and just calculate the vector size
using two ifs.

llvm-svn: 357972
2019-04-09 07:40:14 +00:00
Craig Topper f19f991b7f [X86] Use VEX_WIG for VPINSRB/W and VPEXTRB/W to match what is done for EVEX.
The instruction's document this as W0 for the VEX encoding. But there's a
footnote mentioning that VEX.W is ignored in 64-bit mode. And the main VEX
encoding description says the VEX.W bit is ignored for instructions that are
equivalent to a legacy SSE instruction that uses REX.W to select a GPR which
would apply here.

By making this match EVEX we can remove a special case of allowing EVEX2VEX to
turn an EVEX.WIG instruction into VEX.W0.

llvm-svn: 357971
2019-04-09 07:40:10 +00:00
Craig Topper 2f9c1732b8 [X86] Split the VEX_WPrefix in X86Inst tablegen class into 3 separate fields with clear meanings.
llvm-svn: 357970
2019-04-09 07:40:06 +00:00
Craig Topper 6c11a31bce [X86] Derive ssmem and sdmem from X86MemOperand. NFCI
This changes the operand type from v4f32/v2f64 to iPTR which seems more correct. But that doesn't seem to do anything other than change the comments in X86GenDAGISel.inc. Probably because we use a ComplexPattern to do the matching so there's no autogenerated code to change.

llvm-svn: 357959
2019-04-09 00:24:17 +00:00
Craig Topper 3a4c2192a4 [X86] Fix a couple lowering functions that called ReplaceAllUsesOfValueWith for the newly created code and then return SDValue(). Use MERGE_VALUES instead.
Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it.

Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements.

For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later.

llvm-svn: 357935
2019-04-08 19:44:07 +00:00
Sanjay Patel 50c3b290ed [x86] make 8-bit shl undesirable
I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops.

We've hinted at making all 8-bit ops undesirable for the reason in the code comment:

// TODO: Almost no 8-bit ops are desirable because they have no actual
//       size/speed advantages vs. 32-bit ops, but they do have a major
//       potential disadvantage by causing partial register stalls.

...but that leads to massive diffs and exposes all kinds of optimization holes itself.

Differential Revision: https://reviews.llvm.org/D60286

llvm-svn: 357912
2019-04-08 13:58:50 +00:00
Craig Topper 6a6da233b9 [X86] Make LowerOperationWrapper more robust. Remove now unnecessary ReplaceAllUsesWith from LowerMSCATTER.
Previously LowerOperationWrapper took the number of results from the original
node and counted that many results from the new node. This was intended to drop
chain operands from FP_TO_SINT lowering that uses X87 with memory operations to
stack temporaries. The final load had an extra chain output that needs to be
ignored.

Unfortunately, it didn't work with scatter which has 2 result operands, the
mask output which is discarded and a chain output. The chain output is the one
that is needed but it comes second and it would be dropped by the previous
logic here. To workaround this we were doing a ReplaceAllUses in the lowering
code so that the generic legalization code wouldn't see any uses to replace
since it had been given the wrong result/type.

After this change we take the LowerOperation result directly if the original
node has one result. This allows us to directly return the chain from scatter
or the load data from the FP_TO_SINT case. When the original node has multiple
results we'll ensure the returned node has the same number and copy them over.
For cases where the original node has multiple results and the new code for some
reason has even more results, MERGE_VALUES can be used to pass only the needed
results.

llvm-svn: 357887
2019-04-08 07:39:17 +00:00
Craig Topper 424417da79 [X86] Use (SUBREG_TO_REG (MOV32rm)) for extloadi64i8/extloadi64i16 when the load is 4 byte aligned or better and not volatile.
Summary:
Previously we would use MOVZXrm8/MOVZXrm16, but those are longer encodings.

This is similar to what we do in the loadi32 predicate.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60341

llvm-svn: 357875
2019-04-07 19:19:44 +00:00
Simon Pilgrim 6d7fdd9ab7 [CostModel][X86] Masked load legalization requires an binary-shuffle not a select (PR39812)
Expansion/truncation is better described by SK_PermuteTwoSrc than SK_Select

llvm-svn: 357864
2019-04-07 13:26:09 +00:00
Simon Pilgrim 07adb6abda [X86][SSE] SimplifyDemandedBitsForTargetNode - Add initial PACKSS support
In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands.

This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved.

Differential Revision: https://reviews.llvm.org/D60333

llvm-svn: 357859
2019-04-07 10:40:01 +00:00
Craig Topper 399102b464 [X86] When converting (x << C1) AND C2 to (x AND (C2>>C1)) << C1 during isel, try using andl over andq by favoring 32-bit unsigned immediates.
llvm-svn: 357848
2019-04-06 19:00:11 +00:00
Simon Pilgrim d0a53d4914 [X86] combineBitcastvxi1 - provide dst VT and src SDValue directly. NFCI.
Prep work to make it easier to reuse the BITCAST->MOVSMK combine in other cases.

llvm-svn: 357847
2019-04-06 18:54:17 +00:00
Craig Topper f9b9f8d2e4 [X86] Use a signed mask in foldMaskedShiftToScaledMask to enable a shorter immediate encoding.
This function reorders AND and SHL to enable the SHL to fold into an LEA. The
upper bits of the AND will be shifted out by the SHL so it doesn't matter what
mask value we use for these bits. By using sign bits from the original mask in
these upper bits we might enable a shorter immediate encoding to be used.

llvm-svn: 357846
2019-04-06 18:00:50 +00:00
Simon Pilgrim af1cbdd3ba Fix spelling mistake. NFCI.
llvm-svn: 357843
2019-04-06 15:38:34 +00:00
Francis Visoiu Mistrih 9d9d1b6b2b [X86] Enable tail calls for CallingConv::Swift
It's currently only enabled on AArch64 (enabled in r281376).

llvm-svn: 357809
2019-04-05 20:18:25 +00:00
Francis Visoiu Mistrih ab051a378c [X86] Preserve operand flag when expanding TCRETURNri
The expansion of TCRETURNri(64) would not keep operand flags like
undef/renamable/etc. which can result in machine verifier issues.

Also add plumbing to be able to use `-run-pass=x86-pseudo`.

llvm-svn: 357808
2019-04-05 20:18:21 +00:00
Craig Topper 80aa2290fb [X86] Merge the different Jcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon

Reviewed By: RKSimon

Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60228

llvm-svn: 357802
2019-04-05 19:28:09 +00:00
Craig Topper 7323c2bf85 [X86] Merge the different SETcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between SETcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: andreadb, courbet, RKSimon, spatel, lebedev.ri

Reviewed By: andreadb

Subscribers: hiraditya, lebedev.ri, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60138

llvm-svn: 357801
2019-04-05 19:27:49 +00:00
Craig Topper e0bfeb5f24 [X86] Merge the different CMOV instructions for each condition code into single instructions that store the condition code as an immediate.
Summary:
Reorder the condition code enum to match their encodings. Move it to MC layer so it can be used by the scheduler models.

This avoids needing an isel pattern for each condition code. And it removes
translation switches for converting between CMOV instructions and condition
codes.

Now the printer, encoder and disassembler take care of converting the immediate.
We use InstAliases to handle the assembly matching. But we print using the
asm string in the instruction definition. The instruction itself is marked
IsCodeGenOnly=1 to hide it from the assembly parser.

This does complicate the scheduler models a little since we can't assign the
A and BE instructions to a separate class now.

I plan to make similar changes for SETcc and Jcc.

Reviewers: RKSimon, spatel, lebedev.ri, andreadb, courbet

Reviewed By: RKSimon

Subscribers: gchatelet, hiraditya, kristina, lebedev.ri, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60041

llvm-svn: 357800
2019-04-05 19:27:41 +00:00
Fangrui Song 2c5c12c041 Change some dyn_cast to more apropriate isa. NFC
llvm-svn: 357773
2019-04-05 16:16:23 +00:00
Sanjay Patel 50a8652785 [DAGCombiner][x86] scalarize splatted vector FP ops
There are a variety of vector patterns that may be profitably reduced to a
scalar op when scalar ops are performed using a subset (typically, the
first lane) of the vector register file.

For x86, this is true for float/double ops and element 0 because
insert/extract is just a sub-register rename.

Other targets should likely enable the hook in a similar way.

Differential Revision: https://reviews.llvm.org/D60150

llvm-svn: 357760
2019-04-05 13:32:17 +00:00
Piotr Sobczak 0376ac1d94 [SelectionDAG] Compute known bits of CopyFromReg
Summary:
Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if
the virtual reg used has one def only.

This can be particularly useful when calling isBaseWithConstantOffset()
with the ISD::CopyFromReg argument, as more optimizations may get enabled
in the result.

Also add a missing truncation on X86, found by testing of this patch.

Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa

Reviewers: bogner, craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59535

llvm-svn: 357745
2019-04-05 07:44:09 +00:00
Craig Topper 94f1772b1e [X86] Promote i16 SRA instructions to i32
We already promote SRL and SHL to i32.

This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions.

I think there might be some DAG combine improvement opportunities in the test changes here.

Differential Revision: https://reviews.llvm.org/D60278

llvm-svn: 357743
2019-04-05 06:32:50 +00:00
Evandro Menezes 85bd3978ae [IR] Refactor attribute methods in Function class (NFC)
Rename the functions that query the optimization kind attributes.

Differential revision: https://reviews.llvm.org/D60287

llvm-svn: 357731
2019-04-04 22:40:06 +00:00
James Y Knight a040174418 Revert [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
It unnecessarily breaks previously-working code which used varargs,
but didn't pass any float/double arguments (such as EDK2).

Also revert the fixup on top of that:
Revert [X86] Fix a test from r357317

This reverts r357317 (git commit d413f41de6)
This reverts r357380 (git commit 7af32444b9)

llvm-svn: 357718
2019-04-04 19:05:48 +00:00
Sanjay Patel 17648b848e [x86] eliminate unnecessary broadcast of horizontal op
This is another pattern that comes up if we more aggressively
scalarize FP ops.

llvm-svn: 357703
2019-04-04 14:46:13 +00:00
Craig Topper 3649c20884 [X86] Use INSERT_SUBREG rather than SUBREG_TO_REG when creating LEA64_32 during isel.
SUBREG_TO_REG is supposed to be used to assert that we know the upper bits are
zero. But that isn't the case here. We've done no analysis of the inputs.

llvm-svn: 357673
2019-04-04 05:00:18 +00:00
Craig Topper 051bd16faf [X86] Remove CustomInserters for RDPKRU/WRPKRU. Use some custom lowering and new ISD opcodes instead.
These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers.

This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes
that take the extra zeroes as inputs. The zeros will then go through isel on their own to select
the MOV32r0 pseudo. Then we just need to mention the physical registers directly
in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary
copies to/from physical registers.

llvm-svn: 357659
2019-04-04 00:28:49 +00:00
Craig Topper 52cac4b79f [X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom instruction selection instead.
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.

To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.

With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.

I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.

llvm-svn: 357652
2019-04-03 23:28:30 +00:00
Sanjay Patel c9a012e4ea [x86] fold shuffles of h-ops that have an undef operand
If an operand is undef, we can assume it's the same as the
other operand.

llvm-svn: 357644
2019-04-03 22:40:35 +00:00
Sanjay Patel 61b5e3c6a9 [x86] eliminate movddup of horizontal op
This pattern would show up as a regression if we more
aggressively convert vector FP ops to scalar ops.

There's still a missed optimization for the v4f64 legal
case (AVX) because we create that h-op with an undef operand.
We should probably just duplicate the operands for that
pattern to avoid trouble.

llvm-svn: 357642
2019-04-03 22:15:29 +00:00
Krzysztof Parzyszek 4841643a1d [X86] Extend boolean arguments to inline-asm according to getBooleanType
Differential Revision: https://reviews.llvm.org/D60208

llvm-svn: 357615
2019-04-03 17:43:14 +00:00
Simon Pilgrim 15919ad306 [X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 reduction on AVX1
Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result.

llvm-svn: 357611
2019-04-03 17:28:34 +00:00
Simon Pilgrim 9e28dddf55 [X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction on AVX1
Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets.

llvm-svn: 357608
2019-04-03 17:17:13 +00:00
Clement Courbet 26a8ed3ac9 [X86] Make the post machine scheduler macrofusion-aware.
Summary:
Given that X86 does not use this currently, this is an NFC. I'll
experiment with enabling and will report numbers.

Reviewers: andreadb, lebedev.ri

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60185

llvm-svn: 357568
2019-04-03 09:37:30 +00:00
Craig Topper 9224c114a9 [X86] Mark the default case of the X86InstrInfo::convertToThreeAddress switch as unreachable.
This function should only be called with instructions that are really convertible. And all
convertible instructions need to be handled by the switch. So nothing should use the default.

llvm-svn: 357529
2019-04-02 20:52:16 +00:00
Craig Topper ffd8662558 [X86] Check MI.isConvertibleTo3Addr() before calling convertToThreeAddress in X86FixupLEAs.
X86FixupLEAs just assumes convertToThreeAddress will return nullptr for any instruction that isn't convertible.

But the code in convertToThreeAddress for X86 assumes that any instruction coming in has at least 2 operands and that the second one is a register. But those properties aren't guaranteed of all instructions. We should check the instruction property first.

llvm-svn: 357528
2019-04-02 20:52:10 +00:00
Craig Topper 0d3a533270 [X86] Allow FixupLEAs to form INC/DEC under OptSize not just MinSize
This matches our usual INC/DEC heuristic used during isel.

llvm-svn: 357497
2019-04-02 17:13:03 +00:00
Craig Topper c5903c935c [X86] Use unsigned type for opcodes throughout X86FixupLEAs.
All of the interfaces related to opcode in MachineInstr and MCInstrInfo refer to opcodes as unsigned.

llvm-svn: 357444
2019-04-02 00:50:58 +00:00
Craig Topper 4307172b84 [X86] Classify the AVX512 rounding control operand as X86::OPERAND_ROUNDING_CONTROL instead of MCOI::OPERAND_IMMEDIATE. Add an assert on legal values of rounding control in the encoder and remove an explicit mask.
This should allow llvm-exegesis to intelligently constrain the rounding mode.

The mask in the encoder shouldn't be necessary any more. We used to allow codegen to use 8-11 for rounding mode and the assembler would use 0-3 to mean the same thing so we masked here and in the printer. Codegen now matches the assembler and the printer was updated, but I forgot to update the encoder.

llvm-svn: 357419
2019-04-01 19:08:15 +00:00
Matt Arsenault ebf90db084 X86: Fix override warning
llvm-svn: 357388
2019-04-01 14:08:26 +00:00
Clement Courbet 7e062c9b1f [X86] Make post-ra scheduling macrofusion-aware.
Subscribers: MatzeB, arsenm, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59688

llvm-svn: 357384
2019-04-01 13:48:50 +00:00
Craig Topper 2e1bf89e3a [X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics.
This is the appropriate opcode for only having a chain output. Though I'm not
sure it matters much.

llvm-svn: 357375
2019-04-01 05:26:12 +00:00
Sanjay Patel e1bc360fc6 [x86] allow movmsk with 2-element reductions
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.

  $ llvm-mca -mcpu=haswell no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      504
  Total uOps:        400

  Dispatch Width:    4
  uOps Per Cycle:    0.79
  IPC:               0.79
  Block RThroughput: 1.0

  $ llvm-mca -mcpu=haswell movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      358
  Total uOps:        600

  Dispatch Width:    4
  uOps Per Cycle:    1.68
  IPC:               1.68
  Block RThroughput: 1.5

  $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      407
  Total uOps:        400

  Dispatch Width:    2
  uOps Per Cycle:    0.98
  IPC:               0.98
  Block RThroughput: 2.0

  $ llvm-mca -mcpu=btver2 movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      311
  Total uOps:        600

  Dispatch Width:    2
  uOps Per Cycle:    1.93
  IPC:               1.93
  Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.

Differential Revision: https://reviews.llvm.org/D59997

llvm-svn: 357367
2019-03-31 15:11:34 +00:00
Liang Zou 9f4a4d3974 fix typo: "\t" => " "
Reviewers: llvm.org, Jim

Reviewed By: Jim

Subscribers: arsenm, jvesely, nhaehnle, rupprecht, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59983

llvm-svn: 357365
2019-03-31 14:49:00 +00:00
Craig Topper e4a0fc7d75 [X86] Teach isel for RMW binops to handle negate
Negate updates flags like a subtract. We should be able to use the flags from the RMW form of negate when we have (store (X86ISD::SUB 0, load A), A)

Differential Revision: https://reviews.llvm.org/D60007

llvm-svn: 357353
2019-03-30 18:59:17 +00:00
Simon Pilgrim 10c9032c02 [X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316)
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.

llvm-svn: 357351
2019-03-30 17:12:29 +00:00
Simon Pilgrim 3293455595 [X86][SSE] detectAVGPattern - begin generalizing ADD matches
Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316

llvm-svn: 357349
2019-03-30 15:31:53 +00:00
Amara Emerson d413f41de6 [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.

llvm-svn: 357317
2019-03-29 21:30:51 +00:00
Craig Topper 4ccb3b96b6 [X86] Use cached OptForSize in X86ISelDAGToDAG.cpp instead of pulling it from the function attribute. NFCI
llvm-svn: 357297
2019-03-29 18:36:40 +00:00
Simon Pilgrim aeaf7fcdde [X86] Add X86TargetLowering::isCommutativeBinOp override.
We currently just have test coverage for PMULUDQ - will add more in the future.

llvm-svn: 357244
2019-03-29 11:25:58 +00:00
Craig Topper c25c9b4d16 [X86] Teach the isel optimization for (x << C1) op C2 to (x op (C2>>C1)) << C1 to consider cases where C2>>C1 can fit an unsigned 32-bit immediate
For 64-bit operations we should consider if the immediate can be made to fit
in an unsigned 32-bits immedate. For OR/XOR this allows us to load the immediate
with MOV32ri instead of movabsq. For AND this allows us to fold the immediate.

Differential Revision: https://reviews.llvm.org/D59867

llvm-svn: 357196
2019-03-28 18:05:37 +00:00
Sanjay Patel 5bbf6f0bd8 [x86] avoid cmov in movmsk reduction
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.

We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.

There seems to be a missing fold for sext of a bool bit here:

negl %ecx
movslq %ecx, %rax

...but that's an independent transform.

Differential Revision: https://reviews.llvm.org/D59818

llvm-svn: 357172
2019-03-28 14:16:13 +00:00
Clement Courbet 699dc025a6 [X86MacroFusion] Handle branch fusion (AMD CPUs).
Summary:
This adds a BranchFusion feature to replace the usage of the MacroFusion
for AMD CPUs.

See D59688 for context.

Reviewers: andreadb, lebedev.ri

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59872

llvm-svn: 357171
2019-03-28 14:12:46 +00:00
Roman Lebedev c325be6cef [X86] AMD Piledriver (BdVer2): fine-tune some latencies
Based on llvm-exegesis measurements.

Now that llvm-exegesis is ~2 magnitudes faster, and is a bit smarter,
it is now possible to continue cleanup of the scheduler model.

With this, there are no more latency inconsistencies for the
opcodes that produce stable measurements, and only a few inconsistencies
for unstable measurements (MMX_* opcodes, opcodes that llvm-exegesis
measures by chaining - CMP, TEST, BT, SETcc, CVT, MOV, etc.)

llvm-svn: 357169
2019-03-28 13:40:34 +00:00
Clement Courbet 54c95e5172 [NFC] Format InlineFeatureIgnoreList.
To avoid more spurious clang-format changes when adding features (D59872).

llvm-svn: 357168
2019-03-28 13:38:58 +00:00
Simon Pilgrim 22be913ac0 [X85][AVX] Add missing vXi16 broadcast fold patterns
Now that D59484 has landed its easier to add these.

Added missing AVX512BW v32i16 equivalents while I was at it.

llvm-svn: 357155
2019-03-28 10:25:13 +00:00
Sanjay Patel 1df0bb6264 [x86] improve AVX lowering of vector zext
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.

I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.

From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.

Differential Revision: https://reviews.llvm.org/D59777

llvm-svn: 357129
2019-03-27 22:42:11 +00:00
Sanjay Patel 704817912a [x86] look through bitcast operand of MOVMSK
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
  // Also support integer VTs to avoid a int->fp bitcast in the DAG.

llvm-svn: 357128
2019-03-27 22:24:03 +00:00
Craig Topper 4bc38cfe29 [X86ISelDAGToDAG] Move initialization of OptForSize and OptForMinSize from PreprocessISelDAG to runOnMachineFunction. NFCI
This makes more sense as a place to initialize these. I don't think runOnMachineFunction was overriden when these cached values were originally created.

llvm-svn: 357123
2019-03-27 21:05:07 +00:00
Craig Topper 7c9afc35bc [X86] Add post-isel pseudos for rotate by immediate using SHLD/SHRD
Haswell CPUs have special support for SHLD/SHRD with the same register for both sources. Such an instruction will go to the rotate/shift unit on port 0 or 6. This gives it 1 cycle latency and 0.5 cycle reciprocal throughput. When the register is not the same, it becomes a 3 cycle operation on port 1. Sandybridge and Ivybridge always have 1 cyc latency and 0.5 cycle reciprocal throughput for any SHLD.

When FastSHLDRotate feature flag is set, we try to use SHLD for rotate by immediate unless BMI2 is enabled. But MachineCopyPropagation can look through a copy and change one of the sources to be different. This will break the hardware optimization.

This patch adds psuedo instruction to hide the second source input until after register allocation and MachineCopyPropagation. I'm not sure if this is the best way to do this or if there's some other way we can make this work.

Fixes PR41055

Differential Revision: https://reviews.llvm.org/D59391

llvm-svn: 357096
2019-03-27 17:29:34 +00:00
Simon Pilgrim ccb71b2985 Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249

llvm-svn: 357057
2019-03-27 10:25:02 +00:00
Craig Topper 7da7b97487 [X86] When iselling (x << C1) and/or/xor C2 as (x and/or/xor (C2>>C1)) << C1, go through the isel table instead of manually selecting.
Previously we manually selected the AND/OR/XOR with immediate and the SHL(or ADD if the shift is 1). But this was missing out on the opportunity to use a 64 bit AND with a 32-bit immediate and possibly other isel tricks we have built into the tables.

Instead, insert the new nodes into the DAG using insertDAGNode and allow them each to be selected through the normal table.

llvm-svn: 357049
2019-03-27 04:45:58 +00:00
Craig Topper 22387a56fe [X86] Simplify some code in matchBitExtract by using ANY_EXTEND.
We were manually outputting the code we would get from selecting ANY_EXTEND. We
can save some code by just letting an ANY_EXTEND go through isel on its own.

llvm-svn: 357045
2019-03-27 02:08:03 +00:00
Craig Topper 4dcabf8ddf [X86] In matchBitExtract, place all of the new nodes before Node's position in the DAG for the topological sort.
We were using OrigNBits, but that put all the nodes before the node we used to start the control computation. This caused some node earlier than the sequence we inserted to be selected before the sequence we created. We want our new sequence to be selected first since it depends on OrigNBits.

I don't have a test case. Found by reviewing the code.

llvm-svn: 356979
2019-03-26 05:31:32 +00:00
Craig Topper 10576fea82 [X86] In matchBitExtract, if we need to truncate the BEXTR make sure we put the BEXTR at Node's position in the DAG for the topological sort.
We were using OrigNBits, but that doesn't guarantee that it will be selected before the nodes that make up X.

llvm-svn: 356978
2019-03-26 05:12:23 +00:00
Craig Topper 795ebe3bff [X86] Remove unneeded FIXME. NFC
We do fold loads right below this.

llvm-svn: 356977
2019-03-26 05:12:21 +00:00
Craig Topper fd880d30b1 X86Parser: Fix potential reference to deleted object
Within the MatchFPUWaitAlias function, Operands[0] is potentially overwritten leading to &Op referencing a deleted object. To fix this, assign the reference after the function.

Differential Revision: https://reviews.llvm.org/D57376

llvm-svn: 356973
2019-03-26 03:12:43 +00:00
Craig Topper 3dce29b8e9 X86AsmParser: Do not process a non-existent token
This error can only happen if an unfinished operation is at Eof.

Patch by Brandon Jones

Differential Revision: https://reviews.llvm.org/D57379

llvm-svn: 356972
2019-03-26 03:12:41 +00:00
Craig Topper a17287f084 [X86] Update some of the getMachineNode calls from X86ISelDAGToDAG to also include a VT for a EFLAGS result.
This makes the nodes consistent with how they would be emitted from the isel
table.

llvm-svn: 356870
2019-03-25 07:22:18 +00:00
Craig Topper 1cc01c3228 [X86] When selecting (x << C1) op C2 as (x op (C2>>C1)) << C1, use the operation VT for the target constant.
Normally when the nodes we use here(AND32ri8 for example) are selected their
immediates are just converted from ConstantSDNode to TargetConstantSDNode
without changing VT from the original operation VT. So we should still be
emitting them with the operation VT.

Theoretically this could expose more accurate opportunities for CSE.

llvm-svn: 356869
2019-03-25 06:53:45 +00:00
Craig Topper 3810e35d3f [X86] Remove GetLo8XForm and use GetLo32XForm instead. NFCI
We were using this to create an AND32ri8 node from a 64-bit and, but that node
normally still uses a 32-bit immediate. So we should just truncate the existing
immediate to i32. We already verified it has the same value in bits 31:7.

llvm-svn: 356868
2019-03-25 06:53:44 +00:00
Craig Topper 5b43446831 [X86] Remove a couple unused SDNodeXForms. NFC
llvm-svn: 356867
2019-03-25 06:53:43 +00:00
Craig Topper 7c2554dd92 Revert r356688 "[X86] Don't avoid folding multiple use sign extended 8-bit immediate into instructions under optsize."
Looking back over how the one use optimization works, I don't think this is the right way to fix this.

llvm-svn: 356866
2019-03-25 01:25:32 +00:00
Simon Pilgrim 87d4ab8b92 [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.

llvm-svn: 356864
2019-03-24 19:06:35 +00:00
Simon Pilgrim a71c0ed471 [X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).

Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.

llvm-svn: 356858
2019-03-24 16:30:35 +00:00
Sanjay Patel 7d676dfd86 [x86] improve the default expansion of uaddsat/usubsat
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613

uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0

We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.

Differential Revision: https://reviews.llvm.org/D59006

llvm-svn: 356855
2019-03-24 13:55:54 +00:00
Sanjay Patel 2e92846d36 [x86] reduce code duplication; NFC
llvm-svn: 356836
2019-03-23 15:00:52 +00:00
Craig Topper ce1ed55a4a [X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.

This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.

Differential Revision: https://reviews.llvm.org/D59662

llvm-svn: 356808
2019-03-22 20:47:02 +00:00
Craig Topper 1ffd8e8114 [X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.

I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.

So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?

I've left some FIXMEs for improvements we can do without SSE2.

Differential Revision: https://reviews.llvm.org/D59679

llvm-svn: 356807
2019-03-22 20:46:56 +00:00
Simon Pilgrim 564392d752 [X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203)
llvm-svn: 356784
2019-03-22 17:23:55 +00:00
Craig Topper b3bad3dce3 [X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI
For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead.

llvm-svn: 356721
2019-03-21 21:37:18 +00:00
Simon Pilgrim c2e4405475 [X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers
We don't use X86ISD::ANDNP for mask registers.

Test case from @craig.topper (Craig Topper)

llvm-svn: 356696
2019-03-21 18:32:38 +00:00
Craig Topper c14f3e4222 [X86] Don't avoid folding multiple use sign extended 8-bit immediate into instructions under optsize.
Under optsize we try to avoid folding immediates into instructions under optsize. But if the immediate is 16-bits or 32 bits, but can be encoded as an 8-bit immediate we don't save enough from disabling the folding unless the immediate has enough uses to make up for the size of the move which is either 3 bytes or 5 bytes since there are no sign extended 8-bit moves. We would also save something if the immediate was a live out of the basic block and thus a move was unavoidable, but that would require a more advanced heuristic than just counting uses.

Note we only avoid folding multiple use immediates into the patterns that use X86ISD::ADD/SUB/XOR/OR/AND/CMP/ADC/SBB nodes and not the more common ISD::ADD/SUB/XOR/OR/AND nodes.

Differential Revision: https://reviews.llvm.org/D59522

llvm-svn: 356688
2019-03-21 17:38:58 +00:00
Craig Topper 9f0b17a248 [ScalarizeMaskedMemIntrin] Add support for scalarizing expandload and compressstore intrinsics.
This adds support for scalarizing these intrinsics as well the X86TargetTransformInfo support to avoid scalarizing them in the cases X86 can handle.

I've omitted handling special cases for constant masks for this first pass. Though CodeGenPrepare can constant fold the branch conditions and remove some of the control flow anyway.

Fixes PR40994 and is covers most of PR3666. Might want to implement constant masks to close that.

Differential Revision: https://reviews.llvm.org/D59180

llvm-svn: 356687
2019-03-21 17:38:52 +00:00
Craig Topper 8d46403b8e [X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set.
CMPXCHG8B was introduced on i586/pentium generation.

If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.

Differential Revision: https://reviews.llvm.org/D59576

llvm-svn: 356631
2019-03-20 23:35:49 +00:00
Craig Topper 0367553304 [X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend.
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.

I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.

This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.

Differential Revision: https://reviews.llvm.org/D59479

llvm-svn: 356618
2019-03-20 21:30:20 +00:00
Simon Pilgrim 2acca37a2d [X86] Use getConstantOperandAPInt to detect out-of-range shifts.
llvm-svn: 356549
2019-03-20 11:41:52 +00:00
Andrea Di Biagio 624f5deff4 [X86] Remove X86 specific dag nodes for RDTSC/RDTSCP/RDPMC. NFCI
This patch removes the following dag node opcodes from namespace X86ISD:

RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG

The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:

    RDTSC/RDTSCP don't implicitly read ECX.
    RDTSCP also implicitly writes ECX.

I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.

No functional change intended.

Differential Revision: https://reviews.llvm.org/D59547

llvm-svn: 356546
2019-03-20 11:21:15 +00:00
Craig Topper 97d104cbee [X86] Re-disable cmpxchg16b for 32-bit mode assembly parsing.
This was broken recently when I factored the 64 bit mode check into hasCmpxchg16 without thinking about the AssemblerPredicate.

llvm-svn: 356531
2019-03-19 23:57:16 +00:00
Simon Pilgrim e744f513c4 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - handle repeated shift amounts
If a value with multiple uses is only ever used for SSE shift amounts then we know that only the bottom 64-bits are needed.

llvm-svn: 356483
2019-03-19 17:23:25 +00:00
Jordan Rupprecht f74d45a775 [NFC] Fix unused variable in release builds
This was introduced in rL356468.

llvm-svn: 356477
2019-03-19 16:52:40 +00:00
Simon Pilgrim a56f2822d0 [SelectionDAG] Handle unary SelectPatternFlavor for ABS case in SelectionDAGBuilder::visitSelect
These changes are related to PR37743 and include:

    SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.

    Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.

    Add promoting the integer ABS node in the LegalizeIntegerType.

    Expand-based legalization of integer result for the ABS nodes.

    Expand-based legalization of ABS vector operations.

    Add some integer abs testcases for different typesizes for Thumb arch

    Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
        tmp = (SRA, Hi, 31)
        Lo = (UADDO tmp, Lo)
        Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
        Lo = (XOR tmp, Lo)

    The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
        (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).

    Change integer abs testcases for codegen with the ABS node support for AArch64.
        Indicate that the ABS is legal for the i64 type when the NEON is supported.
        Change the integer abs testcases to show changing of codegen.

    Add combine and legalization of ABS nodes for Thumb arch.

    Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.

For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743

Patch by: @ikulagin (Ivan Kulagin)

Differential Revision: https://reviews.llvm.org/D49837

llvm-svn: 356468
2019-03-19 16:24:55 +00:00
Craig Topper b24bdf626a [X86] Disable CQTO and CLTQ instructions in the assembly parser outside 64-bit mode.
llvm-svn: 356419
2019-03-18 22:06:14 +00:00
Craig Topper e732bc6bea [X86] Allow any 8-bit immediate to be used with BT/BTC/BTR/BTS not just sign extended 8-bit immediates.
We need to allow [128,255] in addition to [-128, 127] to match gas.

llvm-svn: 356413
2019-03-18 21:33:59 +00:00
Craig Topper f086e562f9 [X86] Use relocImm in the ROL8ri/ROL16ri/ROL32ri/ROL64ri patterns to be consistent with the ROR patterns.
llvm-svn: 356407
2019-03-18 20:43:15 +00:00
Craig Topper 0b9c640fe0 [X86] Replace uses of i64immSExt32_su with i64relocImmSExt32_su.
For the i8, i16, and i32 instructions we were using a relocImm. Presumably we should for i64 as well.

llvm-svn: 356406
2019-03-18 20:43:09 +00:00
Craig Topper f07062a798 [X86] Rename imm8_su/imm16_su/imm32_su to relocImm8_su/relocImm16_su/relocImm32_su/ to accurately reflect what they are.
llvm-svn: 356393
2019-03-18 18:54:06 +00:00
Adhemerval Zanella 664c1ef528 [TargetLowering] Add code size information on isFPImmLegal. NFC
This allows better code size for aarch64 floating point materialization
in a future patch.

Reviewers: evandro

Differential Revision: https://reviews.llvm.org/D58690

llvm-svn: 356389
2019-03-18 18:40:07 +00:00
Craig Topper c2b35ebc1d [X86] Remove the _alt forms of (V)CMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Similar to previous change done for VPCOM and VPCMP

Differential Revision: https://reviews.llvm.org/D59468

llvm-svn: 356384
2019-03-18 17:59:59 +00:00
Craig Topper ba898da132 [X86] Hopefully fix a tautological compare warning in printVecCompareInstr.
llvm-svn: 356359
2019-03-18 07:05:01 +00:00
Craig Topper b4c49255aa [X86] Make ADD*_DB post-RA pseudos and expand them in expandPostRAPseudo.
These are used to help convert OR->LEA when needed to avoid avoid a copy. They
aren't need after register allocation.

Happens to remove an ugly goto from X86MCCodeEmitter.cpp

llvm-svn: 356356
2019-03-18 05:48:18 +00:00
Craig Topper 860a27208e [X86] Add tab character to the custom printing of VPCMP and VPCOM instructions.
All the other instructions are printed with a preceeding tab.

llvm-svn: 356355
2019-03-18 02:53:11 +00:00
Craig Topper 04cc28fe13 [X86] Merge printf32mem/printi32mem into a single printdwordmem. Do the same for all other printing functions.
The only thing the print methods currently need to know is the string to print for the memory size in intel syntax.

This patch merges the functions based on this string. If we ever need something else in the future, its easy to split them back out.

This reduces the number of cases in the assembly printers. It shrinks the intel printer to only use 7 bytes per instruction instead of 8.

llvm-svn: 356352
2019-03-17 22:57:21 +00:00
Craig Topper affead9ad0 [X86] Remove the _alt forms of AVX512 VPCMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Similar to the previous patch for VPCOM.

Differential Revision: https://reviews.llvm.org/D59398

llvm-svn: 356344
2019-03-17 21:21:40 +00:00
Craig Topper 12509d87f3 [X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.

The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".

The _alt form on the other hand parsed and printed like any other instruction with no specialness.

With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.

I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.

I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.

Differential Revision: https://reviews.llvm.org/D59398

llvm-svn: 356343
2019-03-17 21:21:37 +00:00
Simon Pilgrim f2c53b5d6c [X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes.
Replaces existing i1-only fold.

llvm-svn: 356325
2019-03-16 15:02:00 +00:00
Simon Pilgrim 0f472e1d01 [X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW
Improved constant folding for PEXTRB/PEXTRW will be added in a future commit

llvm-svn: 356324
2019-03-16 14:29:50 +00:00
Roman Lebedev 9f37790608 [X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

llvm-svn: 356301
2019-03-15 21:18:05 +00:00
Roman Lebedev b6e376ddfa [X86] Promote i8 CMOV's (PR40965)
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

llvm-svn: 356300
2019-03-15 21:17:53 +00:00
Craig Topper af856db961 [X86] Strip the SAE bit from the rounding mode passed to the _RND opcodes. Use TargetConstant to save a conversion in the isel table.
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.

Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.

Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.

llvm-svn: 356294
2019-03-15 19:59:35 +00:00
Simon Pilgrim d33e62c826 [X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x)))
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.

llvm-svn: 356292
2019-03-15 19:14:28 +00:00
Philip Reames d238bf7855 [X86][GlobalISEL] Support lowering aligned unordered atomics
The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.

Differential Revision: https://reviews.llvm.org/D57803

llvm-svn: 356280
2019-03-15 17:50:30 +00:00
Simon Pilgrim 65165d54bb [X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW
llvm-svn: 356270
2019-03-15 16:16:49 +00:00
Simon Pilgrim 0ad17402a9 [X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm.
Prep work for PR40203

llvm-svn: 356249
2019-03-15 11:05:42 +00:00
Craig Topper c747ac3f93 [X86] Fix the pattern changes from r356121 so that the ROR*r1/ROR*m1 pattern use the rotr opcode.
These instructions used to use rotl with a bitwidth-1 immediate. I changed the immediate to 1,
but failed to change the opcode.

Thankfully this seems to have not caused a functional issue because we now had two rotl by 1 patterns,
but the correct ones were earlier and took priority. So we just missed some optimization.

llvm-svn: 356164
2019-03-14 16:53:24 +00:00
Sanjay Patel 5d1df114e8 [x86] prevent infinite looping from vselect commutation (PR41066)
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.

llvm-svn: 356158
2019-03-14 15:32:34 +00:00
Craig Topper 54a0b53308 [X86] Add patterns for rotr by immediate to fix PR41057.
Prior to the introduction of funnel shift intrinsics we could count on rotate
by immediates prefering to use rotl since that's what MatchRotate would check
first. The or+shift pattern doesn't have a direction so one must be chosen
arbitrarily.

With funnel shift, there is a direction and fshr will try to use rotr first.
While fshl will try to use rotl first.

This patch adds the isel patterns for rotr to complement the rotl patterns. I've
put the rotr by 1 patterns in the instruction patterns. And moved the rotl by
bitwidth-1 patterns to separate Pat patterns.

Fixes PR41057.

llvm-svn: 356121
2019-03-14 07:07:26 +00:00
Craig Topper 84abec2855 [X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b()
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.

We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.

I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.

Differential Revision: https://reviews.llvm.org/D59308

llvm-svn: 356078
2019-03-13 18:48:50 +00:00
Simon Pilgrim bef4fe056d [X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 356067
2019-03-13 17:00:18 +00:00
Simon Pilgrim d9aa879b67 [X86][AVX] Add combineConcatVectors support to improve subvector handling
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.

This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.

The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.

llvm-svn: 356064
2019-03-13 16:37:30 +00:00
Sanjay Patel 0a251e4076 [x86] limit extractelement of setcc to pre-legalization
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700

The bug was introduced recently here:
rL355741

This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8). 

llvm-svn: 356053
2019-03-13 14:49:52 +00:00
Simon Pilgrim 0c1e5aacd3 Fix signed/unsigned mismatch warning. NFCI.
llvm-svn: 356046
2019-03-13 13:14:14 +00:00
Simon Pilgrim 7abbd70300 [X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.

This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.

Differential Revision: https://reviews.llvm.org/D58888

llvm-svn: 356043
2019-03-13 12:20:39 +00:00
Jonas Hahnfeld c64d73cce2 [ELF] Fix GCC8 warnings about "fall through", NFCI
Add break statements in Object/ELF.cpp since the code should consider the
generic tags for Hexagon, MIPS, and PPC. Add a test (copied from llvm-readobj)
to show that this works correctly (earlier versions of this patch would have
asserted).

The warnings in X86ELFObjectWriter.cpp are actually false-positives since
the nested switch() handles all possible values and returns in all cases.
Make this explicit by adding llvm_unreachable's.

Differential Revision: https://reviews.llvm.org/D58837

llvm-svn: 356037
2019-03-13 10:38:17 +00:00
Craig Topper 750efba67c [X86] Enable printAliasInstr for the Intel assembly printer so that AAM and AAD will print without an immediate when the immediate is 10.
llvm-svn: 355997
2019-03-13 00:43:03 +00:00
Philip Reames 9134f84ba4 For faulting ops, include a comment w/the fault destination
A faulting_op is one that has specified behavior when a fault occurs, generally redirecting control flow to another location.  This change just adds a comment to the assembly output which makes it both human readable, and machine checkable w/o having to parse the FaultMap section.  This is used to split a test file into two parts, so that I can (in a near future commit) easily extend the test file to demonstrate another case.

llvm-svn: 355982
2019-03-12 21:05:31 +00:00
Sanjay Patel 737c27a9cd [x86] scalarize extractelement 0 of FP vselect
llvm-svn: 355955
2019-03-12 19:20:45 +00:00
Craig Topper 5c1177a68f [X86] Arrange more CPU features to inherit from earlier CPUs. NFCI
This makes SandyBridge inherit back to Westmere/Nehalem.

Make bdver1-4 inherit from each other and btver2 inherit from btver1.

llvm-svn: 355935
2019-03-12 16:35:30 +00:00
Craig Topper a958d40e78 [X86] Remove ProcModel and ProcFeatures tablegen classes. Move all feature lists into a ProcessorFeatures class.
ProcFeatures was a class that just concatenated two feature lists together and gave it a name. We used it to inherit features between CPUs.

ProcModel took a two CPU feature lists and concatenated them before deferring to ProcessorModel. This was to allow inherited features and specific features to be passed to each CPU.

Both of these allowed for only very rigid CPU inheritance rules.

With this patch we now store all of the lists we were using for inheritance in one object and do any list oncatenation we want there. Then we just pass whatever list we want from this class into the ProcessorModel class for each CPU.

Hopefully this gives us more flexibility to build up feature lists in whatever ways we think make sense. Perhaps untangling ISA flags and tuning flags.

I've only touched the CPUs that were directly affected by the removal of the ProcModel and ProcFeatures classes. We should move more of the feature lists into ProcessorFeatures.

llvm-svn: 355872
2019-03-11 22:29:00 +00:00
Evgeniy Stepanov aedec3f684 Remove ASan asm instrumentation.
Summary: It is incomplete and has no users AFAIK.

Reviewers: pcc, vitalybuka

Subscribers: srhines, kubamracek, mgorny, krytarowski, eraman, hiraditya, jdoerfert, #sanitizers, llvm-commits, thakis

Tags: #sanitizers, #llvm

Differential Revision: https://reviews.llvm.org/D59154

llvm-svn: 355870
2019-03-11 21:50:10 +00:00
Stanislav Mekhanoshin e98944ed47 Use bitset for assembler predicates
AMDGPU target run out of Subtarget feature flags hitting the limit of 64.
AssemblerPredicates uses at most uint64_t for their representation.
At the same time CodeGen has exhausted this a long time ago and switched
to a FeatureBitset with the current limit of 192 bits.

This patch completes transition to the bitset for feature bits extending
it to asm matcher and MC code emitter.

Differential Revision: https://reviews.llvm.org/D59002

llvm-svn: 355839
2019-03-11 17:04:35 +00:00
Craig Topper 00afa193f1 [X86] Enable sse2_cvtsd2ss intrinsic to use an EVEX encoded instruction.
llvm-svn: 355810
2019-03-11 06:01:04 +00:00
Craig Topper f1e7482e69 [X86] Remove apparently unneeded patterns for storing a bitcasted extractelement.
I suspect if this pattern was seen, DAG combine would just change the type of the store to eliminate the bitcast.

llvm-svn: 355809
2019-03-11 06:01:02 +00:00
Craig Topper dc488767b2 [X86] Use 'UseAVX' in place of 'HasAVX, NoAVX512'. NFC
They mean the same thing, but 'HasAVX, NoAVX512' only appears in this one place. Every other place uses UseAVX.

llvm-svn: 355808
2019-03-11 06:01:00 +00:00
Craig Topper f19d6a4073 [X86] Add SCALAR_SINT_TO_FP/SCALAR_UINT_TO_FP ISD opcodes without rounding mode.
After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those.

llvm-svn: 355807
2019-03-11 04:37:01 +00:00
Craig Topper ecbc141dbf [X86] Split SCALEF(S) ISD opcodes into a version without rounding mode.
llvm-svn: 355806
2019-03-11 04:36:59 +00:00
Craig Topper a0b5338834 [X86] Split RCP28/RSQRT/GETEXP/EXP2 ISD opcodes into SAE and current direction nodes. Remove rounding mode operand.
llvm-svn: 355805
2019-03-11 04:36:57 +00:00
Craig Topper ba7d654526 [X86] Rename _RND versions of RANGE/REDUCE/GETMANT/RDNSCALE ISD opcodes to _SAE. Remove SAE operand.
No need to explicitly store it and match it during isel.

llvm-svn: 355804
2019-03-11 04:36:55 +00:00
Craig Topper 244ffcdf0d [X86] Rename X86ISD::CVTPH2PS_RND to CVTPH2PS_SAE. Remove SAE operand.
llvm-svn: 355803
2019-03-11 04:36:53 +00:00
Craig Topper 6059b1737e [X86] Rename the CVTT*_RND ISD nodes to _SAE and remove the SAE operand. Split VFPROUNDS_RND/VFPEXT(S)_RND into versions without rounding operand.
For VFPEXT(S) we only need current rounding mode and an SAE version. Neither need extra operand.

llvm-svn: 355802
2019-03-11 04:36:51 +00:00
Craig Topper 4c544ca993 [X86] Rename X86ISD::CMPM_RND and X86ISD::FSETCCM_RND to _SAE instead of _RND. Remove rounding operand.
The operand could only be the SAE encoding so no need to include it.

llvm-svn: 355801
2019-03-11 04:36:49 +00:00
Craig Topper 704303a2a1 [X86] Split the VFIXUPIMM/VFIXUPIMMS nodes into a current rounding mode and SAE ISD opcode.
Remove matching of FROUND_CURRENT and FROUND_NO_EXC for these nodes from isel table.

llvm-svn: 355800
2019-03-11 04:36:47 +00:00
Craig Topper b7e6bfe579 [X86] Begin removing matching of FROUND_CURRENT and FROUND_NO_EXC from isel tables.
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.

This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.

llvm-svn: 355799
2019-03-11 04:36:44 +00:00
Craig Topper d8ebbe4a76 [X86] Remove unneeded isel patterns from VCVTSI2SDZ and VCVTUSI2SDZ. NFC
We had patterns using X86ISD::SCALAR_SINT_TO_FP_RND/SCALAR_UINT_TO_FP_RND for
these instructions. There's nothing to round. Instead, we use a regular
sint_to_fp/uint_to_fp and a movsd as the pattern for these.

llvm-svn: 355796
2019-03-11 01:20:38 +00:00
Craig Topper 4cf8cdc51d [X86] Remove VCVTSI2SDZrrb_Int as it shouldn't exist.
This would convert a signed 32-bit integer to double precision with rounding. But there's nothing to round.

llvm-svn: 355795
2019-03-11 01:20:37 +00:00
Sanjay Patel 26e06e859e [x86] add x86-specific opcodes to extractelement scalarization list
llvm-svn: 355792
2019-03-10 18:56:21 +00:00
Craig Topper 66c9690ad6 [X86] Remove unused variable. NFC
llvm-svn: 355790
2019-03-10 17:36:41 +00:00
Craig Topper 93e15dfacc [X86] Make lowering of intrinsics with rounding mode stricter so that only valid rounding modes are lowered. Update tests accordingly
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.

With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.

llvm-svn: 355789
2019-03-10 17:20:45 +00:00
Craig Topper 0dc8c52d4e [X86] Remove dead code from the handler for INTR_TYPE_SCALAR_MASK_RM.
The code in here handles nodes with 6 or 7 operands. But only the 6 operand case is ever used these days.

llvm-svn: 355788
2019-03-10 17:20:42 +00:00
Craig Topper 1a872f2b15 Recommit r355224 "[TableGen][SelectionDAG][X86] Add specific isel matchers for immAllZerosV/immAllOnesV. Remove bitcasts from X86 patterns that are no longer necessary."
Includes a fix to emit a CheckOpcode for build_vector when immAllZerosV/immAllOnesV is used as a pattern root. This means it can't be used to look through bitcasts when used as a root, but that's probably ok. This extra CheckOpcode will ensure that the first match in the isel table will be a SwitchOpcode which is needed by the caching optimization in the ISel Matcher.

Original commit message:

Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.

By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.

This removes something like 40,000 bytes from the X86 isel table.

Differential Revision: https://reviews.llvm.org/D58595

llvm-svn: 355784
2019-03-10 05:21:52 +00:00
Sanjay Patel f84083b4db [x86] scalarize extract element 0 of FP cmp
An extension of D58282 noted in PR39665:
https://bugs.llvm.org/show_bug.cgi?id=39665

This doesn't answer the request to use movmsk, but that's an
independent problem. We need this and probably still need
scalarization of FP selects because we can't do that as a
target-independent transform (although it seems likely that
targets besides x86 should have this transform).

llvm-svn: 355741
2019-03-08 21:54:41 +00:00
Sanjay Patel b22f438df3 [x86] prevent infinite looping from inverse shuffle transforms
llvm-svn: 355713
2019-03-08 19:20:28 +00:00
Craig Topper 4505c99e72 [X86] Improve the type checking in isLegalMaskedLoad and isLegalMaskedGather.
We were just checking pointer size and type primitive size. But this caused unintended things like vectors of half being accepted by masked load/store.

For FP we now explicitly check for only double and float.

For pointers we now let any pointer through. Trusting that only 32 and 64 would be used to generate assembly.

We only check bitwidth after checking that the type is an integer.

llvm-svn: 355667
2019-03-08 07:33:43 +00:00
Craig Topper d0c2dba644 [X86] Correct scheduler information for rotate by constant for Haswell, Broadwell, and Skylake.
Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not.

The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that.

RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd.

Differential Revision: https://reviews.llvm.org/D59077

llvm-svn: 355636
2019-03-07 21:22:56 +00:00
Craig Topper b3af5d3e57 [X86] Model ADC/SBB with immediate 0 more accurately in the Haswell scheduler model
Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms.

Differential Revision: https://reviews.llvm.org/D59058

llvm-svn: 355635
2019-03-07 21:22:51 +00:00
Vlad Tsyrklevich 2e1479e2f2 Delete x86_64 ShadowCallStack support
Summary:
ShadowCallStack on x86_64 suffered from the same racy security issues as
Return Flow Guard and had performance overhead as high as 13% depending
on the benchmark. x86_64 ShadowCallStack was always an experimental
feature and never shipped a runtime required to support it, as such
there are no expected downstream users.

Reviewers: pcc

Reviewed By: pcc

Subscribers: mgorny, javed.absar, hiraditya, jdoerfert, cfe-commits, #sanitizers, llvm-commits

Tags: #clang, #sanitizers, #llvm

Differential Revision: https://reviews.llvm.org/D59034

llvm-svn: 355624
2019-03-07 18:56:36 +00:00
Craig Topper 3acc4236b8 [X86] Enable combineFMinNumFMaxNum for 512 bit vectors when AVX512 is enabled.
Simplified by just checking if the vector type is legal rather than listing all combinations of types and features.

Fixes PR40984.

llvm-svn: 355582
2019-03-07 06:30:19 +00:00
Simon Pilgrim 9d6347cfc1 [DAGCombine] Improve select (not Cond), N1, N2 -> select Cond, N2, N1 fold
Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well.

Requested by @spatel on D59006

llvm-svn: 355533
2019-03-06 18:52:52 +00:00
Simon Pilgrim 468bb2e601 [X86][SSE] VSELECT(XOR(Cond,-1), LHS, RHS) --> VSELECT(Cond, RHS, LHS)
As noticed on D58965

DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point.

Differential Revision: https://reviews.llvm.org/D58974

llvm-svn: 355494
2019-03-06 10:54:43 +00:00
Craig Topper c0e01d29a4 [X86] Enable the add with 128 -> sub with -128 encoding trick with X86ISD::ADD when the carry flag isn't used.
This allows us to use an 8-bit sign extended immediate instead of a 16 or 32 bit immediate.

Also do similar for 0x80000000 with 64-bit adds to avoid having to use a movabsq.

llvm-svn: 355485
2019-03-06 07:36:38 +00:00
Craig Topper 97a1c4c340 [X86] Suppress load folding for add/sub with 128 immediate.
128 won't fit in a sign extended 8-bit immediate, but we can negate it to -128 and use the other operation. This results in a shorter encoding since the move would have used 16 or 32 bits for the immediate.

llvm-svn: 355484
2019-03-06 07:36:36 +00:00
Craig Topper 112ea336c3 [X86] Remove periods from the end of SubtargetFeature descriptions since the help printer adds a period.
Most features don't have periods already, but some did. When there is a period it causes llc -mattr=+help to print 2 periods.

llvm-svn: 355474
2019-03-06 02:36:48 +00:00
Craig Topper 57fd733140 Revert r355224 "[TableGen][SelectionDAG][X86] Add specific isel matchers for immAllZerosV/immAllOnesV. Remove bitcasts from X86 patterns that are no longer necessary."
This caused the first matcher in the isel table for many targets to Opc_Scope instead of Opc_SwitchOpcode. This leads to a significant increase in isel match failures.

llvm-svn: 355433
2019-03-05 19:18:16 +00:00
Guozhi Wei f124e75656 [X86] In X86DomainReassignment.cpp add enclosed registers to EnclosedEdges
The variable X86DomainReassignment::EnclosedEdges is used to store registers that have been enclosed in some closure, so those registers will be ignored when create new closures. But there is no registers has ever been put into this set, so a single register can be enclosed in multiple closures, it significantly increase compile time.

This patch adds a register into EnclosedEdges when it is enclosed into a closure.

Differential Revision: https://reviews.llvm.org/D58646

llvm-svn: 355430
2019-03-05 18:54:34 +00:00
Craig Topper 4a9dd7c39b [X86] Enable 8-bit SHL to convert to LEA
Differential Revision: https://reviews.llvm.org/D58870

llvm-svn: 355425
2019-03-05 18:37:41 +00:00
Craig Topper 216bf7f03b [X86] Allow 8-bit INC/DEC to be converted to LEA.
We already do this for 16/32/64 as well as 8-bit add with register/immediate. Might as well do it for 8-bit INC/DEC too.

Differential Revision: https://reviews.llvm.org/D58869

llvm-svn: 355424
2019-03-05 18:37:37 +00:00
Craig Topper 572e94ca02 [X86] Enable 8-bit OR with disjoint bits to convert to LEA
We already support 8-bits adds in convertToThreeAddress. But we can also support 8-bit OR if the bits are disjoint. We already do this for 16/32/64.

Differential Revision: https://reviews.llvm.org/D58863

llvm-svn: 355423
2019-03-05 18:37:33 +00:00
Craig Topper 6a6ce5be84 [X86] Reduce some patterns by using FP instructions for integer types even when AVX2 is available and execution domain fixing will do the right thing
We have quite a few cases of using FP instructions for integer operations when only AVX1 is available. Then we switch to integer instructions with AVX2. In a lot of these cases execution domain fixing will take care of turning FP instructions into integer if its profitable.

With this patch we just keep on using the FP instructions even with AVX2. I've only handled some cases that don't require messing with patterns that are defined in the instruction definition. Those will require more subtle multiclass work possibly involving null_frag, hasSideEffects = 0, etc.

Differential Revision: https://reviews.llvm.org/D58470

llvm-svn: 355361
2019-03-05 01:14:25 +00:00
Jeremy Morse 09d8ea5282 [X86] Avoid codegen changes when DBG_VALUE appears between lowered selects
X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo
instructions without accounting for debug intrinsics. This leads to different
codegen with and without option -g, if a DBG_VALUE instruction lands in the
middle of several lowered selects.

Work around this by skipping over debug instructions when looking for CMOV
sequences, and sinking those debug insts into the EmitLoweredSelect sunk block.
This might slightly shift where variables appear in the instruction sequence,
but won't re-order assignments.

Differential Revision: https://reviews.llvm.org/D58672

llvm-svn: 355307
2019-03-04 10:56:02 +00:00
Simon Pilgrim e48be5d698 Remove unused variable. NFCI.
llvm-svn: 355289
2019-03-03 14:23:07 +00:00
Simon Pilgrim d8e91a54c0 [X86] getShuffleScalarElt - peek through insert/extract subvector nodes.
llvm-svn: 355288
2019-03-03 14:11:05 +00:00
Simon Pilgrim 11149ea433 [X86] Pull out combineToConsecutiveLoads helper. NFCI.
llvm-svn: 355287
2019-03-03 13:53:27 +00:00
Craig Topper ce68659772 [X86] Prefer VPBLENDD for v2i64/v4i64 blends with AVX2.
We were using VPBLENDW for v2i64 and VBLENDPD for v4i64. VPBLENDD has better throughput than VPBLENDW on some CPUs so it makes sense to use it when possible. VBLENDPD will probably become VBLENDD during execution domain fixing, but we might as well use integer in isel while we can.

This should work around some issues with the domain fixing pass prefering PBLENDW when we start with PBLENDW. There may still be some v8i16 cases that could use PBLENDD.

llvm-svn: 355281
2019-03-03 00:18:07 +00:00
Amaury Sechet f24abf6511 [X86] Improve use of SHLD/SHRD
Summary:
This extends the variety of pattern that can generate a SHLD instead of using two shifts.

This fixes a regression that would be introduced by D57367 or D33587

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57389

llvm-svn: 355260
2019-03-02 02:44:16 +00:00
Craig Topper 4cfc39179e [TableGen][SelectionDAG][X86] Add specific isel matchers for immAllZerosV/immAllOnesV. Remove bitcasts from X86 patterns that are no longer necessary.
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.

By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.

This removes something like 40,000 bytes from the X86 isel table.

Differential Revision: https://reviews.llvm.org/D58595

llvm-svn: 355224
2019-03-01 20:18:38 +00:00
Sanjay Patel 7fc6ef7dd7 [x86] scalarize extract element 0 of FP math
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:

  1. Memory loads may get scalarized resulting in more efficient code.
  2. Memory stores may get scalarized resulting in more efficient code.
  3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
  4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.

The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.

Differential Revision: https://reviews.llvm.org/D58282

llvm-svn: 355130
2019-02-28 19:47:04 +00:00
Craig Topper 38427c47b9 [X86] Don't peek through bitcasts before checking ISD::isBuildVectorOfConstantSDNodes in combineTruncatedArithmetic
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.

Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.

Differential Revision: https://reviews.llvm.org/D58705

llvm-svn: 355116
2019-02-28 18:49:29 +00:00
Bjorn Pettersson d30f308a9f Add support for computing "zext of value" in KnownBits. NFCI
Summary:
The description of KnownBits::zext() and
KnownBits::zextOrTrunc() has confusingly been telling
that the operation is equivalent to zero extending the
value we're tracking. That has not been true, instead
the user has been forced to explicitly set the extended
bits as known zero afterwards.

This patch adds a second argument to KnownBits::zext()
and KnownBits::zextOrTrunc() to control if the extended
bits should be considered as known zero or as unknown.

Reviewers: craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58650

llvm-svn: 355099
2019-02-28 15:45:29 +00:00
Simon Pilgrim 134bc19079 [X86][AVX] Remove superfluous insert_subvector(zero, bitcast(x)) -> bitcast(insert_subvector(zero, x)) fold
This is caught by other existing bitcast folds.

llvm-svn: 355084
2019-02-28 11:39:52 +00:00
Simon Pilgrim 87aeff8bbb [X86][AVX] Fold vf64 concat_vectors(movddup(x),movddup(x)) -> broadcast(x)
llvm-svn: 355078
2019-02-28 10:53:58 +00:00
Craig Topper 6ca7398a1e [X86] Use PreprocessISelDAG to convert vector sra/srl/shl to the X86 specific variable shift ISD opcodes.
These allows use to use the same set of isel patterns for sra/srl/shl which are undefined for out of range shifts and intrinsic shifts which aren't undefined.

Doing this late allows DAG combine to have every opportunity to optimize the sra/srl/shl nodes.

This removes about 7000 bytes from the isel table and simplies the td files.

llvm-svn: 355071
2019-02-28 07:21:26 +00:00
Craig Topper 240315aa64 [X86] Use X86::LAST_VALID_COND instead of assuming X86::COND_S is the last encoding. NFC
llvm-svn: 355059
2019-02-28 01:00:31 +00:00
Simon Pilgrim 1001a6ab03 [X86][AVX] Pull out some INSERT_SUBVECTOR combines into a combineConcatVectorOps helper. NFCI
A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node.

I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered.

Differential Revision: https://reviews.llvm.org/D58583

llvm-svn: 355015
2019-02-27 18:46:32 +00:00
Simon Pilgrim 71bb6850cf [X86][AVX] Only combine loads to broadcasts for legal types
Thanks to @echristo for spotting this.

llvm-svn: 354961
2019-02-27 11:17:25 +00:00
Reid Kleckner 8fda7e15e6 [X86] Fix bug in vectorcall calling convention
Original implementation can't correctly handle __m256 and __m512 types
passed by reference through stack. This patch fixes it.

Patch by Wei Xiao!

Differential Revision: https://reviews.llvm.org/D57643

llvm-svn: 354921
2019-02-26 19:48:16 +00:00
Ganesh Gopalasubramanian e172d7008d [X86] AMD znver2 enablement
This patch enables the following

1) AMD family 17h "znver2" tune flag (-march, -mcpu).
2) ISAs that are enabled for "znver2" architecture.
3) For the time being, it uses the znver1 scheduler model.
4) Tests are updated.
5) Scheduler descriptions are yet to be put in place.

Reviewers: craig.topper

Differential Revision: https://reviews.llvm.org/D58343

llvm-svn: 354897
2019-02-26 16:55:10 +00:00
Reid Kleckner 2f055f026a [X86] Fix bug in x86_intrcc with arg copy elision
Summary:
Use a custom calling convention handler for interrupts instead of fixing
up the locations in LowerMemArgument. This way, the offsets are correct
when constructed and we don't need to account for them in as many
places.

Depends on D56883

Replaces D56275

Reviewers: craig.topper, phil-opp

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D56944

llvm-svn: 354837
2019-02-26 02:11:25 +00:00
Craig Topper 316c58e8f1 [X86] Improve detection of unneeded shift amount masking to also handle the case that the LHS has known zeroes in it
If the LHS has known zeros, the RHS immediate will have had bits removed. So call computeKnownBits to get the known zeroes so we can handle this case.

Differential Revision: https://reviews.llvm.org/D58475

llvm-svn: 354811
2019-02-25 19:42:47 +00:00
Simon Pilgrim c61f1e8e6c [X86] Merge ISD::ADD/SUB nodes into X86ISD::ADD/SUB equivalents (PR40483)
Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results.

Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB.

Differential Revision: https://reviews.llvm.org/D58597

llvm-svn: 354771
2019-02-25 11:19:37 +00:00
Simon Pilgrim cfaf663a35 [X86] Combine zext(packus(x),packus(y)) -> concat(x,y) (PR39637)
Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step.

llvm-svn: 354757
2019-02-24 19:57:52 +00:00
Craig Topper 3fe4bd464c [X86] Fix tls variable lowering issue with large code model
Summary:
The problem here is the lowering for tls variable. Below is the DAG for the code.
SelectionDAG has 11 nodes:

t0: ch = EntryToken
      t8: i64,ch = load<(load 8 from `i8 addrspace(257)* null`, addrspace 257)> t0, Constant:i64<0>, undef:i64
        t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]
      t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64
    t12: i64 = add t8, t11
  t4: i32,ch = load<(dereferenceable load 4 from @x)> t0, t12, undef:i64
t6: ch = CopyToReg t0, Register:i32 %0, t4
And when mcmodel is large, below instruction can NOT be folded.

  t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]
t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64
So "t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64" is lowered to " Morphed node: t11: i64,ch = MOV64rm<Mem:(load 8 from got)> t10, TargetConstant:i8<1>, Register:i64 $noreg, TargetConstant:i32<0>, Register:i32 $noreg, t0"

When llvm start to lower "t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]", it fails.

The patch is to fold the load and X86ISD::WrapperRIP.

Fixes PR26906

Patch by LuoYuanke

Reviewers: craig.topper, rnk, annita.zhang, wxiao3

Reviewed By: rnk

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58336

llvm-svn: 354756
2019-02-24 19:33:37 +00:00
Craig Topper 5532a98737 [X86][SSE] Use pblendw for v4i32/v2i64 during isel.
Summary:

Previously we used BLENDPS/BLENDPD but that puts the blend in the FP domain. Under optsize, the two address instruction pass can cause blendps/blendpd to commute to blendps/blendpd. But we probably shouldn't do that if the original type was a integer. So use pblendw instead.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58574

llvm-svn: 354755
2019-02-24 19:23:41 +00:00
Craig Topper ce2bd19c49 [X86] Correct some ADC/SBB with immediate scheduler data for Broadwell and Skylake.
Summary:
The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate.

The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate.

Reviewers: RKSimon, andreadb

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58581

llvm-svn: 354754
2019-02-24 19:23:39 +00:00
Craig Topper be3348573e [LegalizeTypes][AArch64][X86] Make type legalization of vector (S/U)ADD/SUB/MULO follow getSetCCResultType for the overflow bits. Make UnrollVectorOverflowOp properly convert from scalar boolean contents to vector boolean contents
Summary:
When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it.

We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used.

Reviewers: spatel, RKSimon, nikic

Reviewed By: nikic

Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58567

llvm-svn: 354753
2019-02-24 19:23:36 +00:00
Simon Pilgrim 4f4f9abdfa [X86][AVX] Rename lowerShuffleByMerging128BitLanes to lowerShuffleAsLanePermuteAndRepeatedMask. NFC.
Name better matches the other similar 'lane permute' and 'repeated mask' functions we have.

llvm-svn: 354749
2019-02-24 17:30:06 +00:00
Craig Topper be9eeb5526 Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
And its follow ups r354511, r354640.

A follow patch will fix the issue that caused it to be reverted.

llvm-svn: 354737
2019-02-23 21:41:42 +00:00
Craig Topper ccc860cb81 Recommit r354647 and r354648 "[LegalizeTypes] When promoting the result of EXTRACT_SUBVECTOR, also check if the input needs to be promoted. Use that to determine the element type to extract"
r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit."

These were reverted in r354713 as their context depended on other patches that were reverted for a bug.

llvm-svn: 354734
2019-02-23 19:51:32 +00:00
Simon Pilgrim f383a47b7d [X86][AVX] combineInsertSubvector - remove concat_vectors(load(x),load(x)) --> sub_vbroadcast(x)
D58053/rL354340 added this to EltsFromConsecutiveLoads directly

llvm-svn: 354732
2019-02-23 18:53:03 +00:00
Simon Pilgrim 398d0b9e96 Fix MSVC constant truncation warnings. NFCI.
llvm-svn: 354731
2019-02-23 18:49:02 +00:00
Simon Pilgrim e08f177ea2 [X86][AVX] concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) --> broadcast(x)
For AVX1, limit this to i32/f32/i64/f64 loading cases only.

llvm-svn: 354730
2019-02-23 18:34:05 +00:00
Simon Pilgrim 31793733a0 [X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input in place
Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD.

llvm-svn: 354729
2019-02-23 17:10:47 +00:00
Craig Topper 75afc0105c [X86] Sign extend the 8-bit immediate when commuting blend instructions to match isel.
Conversion from ConstantSDNode to MachineInstr sign extends immediates from their APInt representation to int64_t.

This commit makes sure we do the same for commuting. The tests changes show how this improves CSE. This issue was made worse by the MachineCSE using commuteInstruction to undo a commute. So we virtually guarantee the sign extend from isel would be lost.

The improved CSE also occurred with r354363, but that was reverted. I'm working to undo the revert, but wanted to get this fix in while it was easy to see the results.

llvm-svn: 354724
2019-02-23 08:34:10 +00:00
Reid Kleckner e3876637cf Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
r354363 caused https://crbug.com/934963#c1, which has a plain C reduced
test case.

I also had to revert some dependent changes:
- r354648
- r354647
- r354640
- r354511

llvm-svn: 354713
2019-02-23 01:19:42 +00:00
Craig Topper a9697f24cf [X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when input type will be promoted by the type legalize to 128-bits.
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.

Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.

llvm-svn: 354709
2019-02-23 00:35:02 +00:00
Sanjay Patel a9e289174a [x86] allow narrowing of vector UINT_TO_FP
As discussed in:
D56864
D58197

Always use the narrow (128-bit) instruction when possible.
We already had the signed int version of this transform.

llvm-svn: 354675
2019-02-22 15:47:45 +00:00
Sanjay Patel 1baf7896cc [x86] simplify code in combineExtractSubvector; NFC
Only the 1st fold is attempted pre-legalization, but it requires
legal (simple) types too, so we don't need an EVT in any of the code.

llvm-svn: 354674
2019-02-22 15:28:22 +00:00
Craig Topper 3a391fc0e8 [X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit.
Type legalization is causing two nodes to be created here, but we can use a single node to extend from v8i16 to v2i64.

llvm-svn: 354648
2019-02-22 01:49:53 +00:00
Craig Topper 427404c769 [X86] Fix some copy/paste mistakes that caused a VR128 to be used as the address of a load in an isel pattern
This was introduced in r354511.

Fixes PR40811.

llvm-svn: 354640
2019-02-22 00:04:35 +00:00
Craig Topper 2b34fdc67f [X86] Remove hasSideEffects=1 from the X87 pseudos with folded load.
This was done in r321424 to prevent scheduling from reordering things. But now that we model FPCW as a dependency, I don't think the same scheduling we were trying to prevent can occur.

llvm-svn: 354628
2019-02-21 22:00:15 +00:00
Sanjay Patel 234a5e8ea4 [x86] vectorize more cast ops in lowering to avoid register file transfers
This is a follow-up to D56864.

If we're extracting from a non-zero index before casting to FP,
then shuffle the vector and optionally narrow the vector before doing the cast:

cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0

This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974

Differential Revision: https://reviews.llvm.org/D58197

llvm-svn: 354619
2019-02-21 20:40:39 +00:00
Nirav Dave dce91c1edb [X86] Fix copy-paste error in @ccz flag.
@ccz operand should be equivalent to @cce.

llvm-svn: 354588
2019-02-21 15:28:31 +00:00
Simon Pilgrim e6b338cbef [X86][SSE] combineX86ShufflesRecursively - moved to generic op input index lookup. NFCI.
We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement.

llvm-svn: 354575
2019-02-21 12:24:49 +00:00
Nikita Popov c3b496de7a [SDAG] Support vector UMULO/SMULO
Second part of https://bugs.llvm.org/show_bug.cgi?id=40442.

This adds an extra UnrollVectorOverflowOp() method to SDAG, because
the general UnrollOverflowOp() method can't deal with multiple results.

Additionally we need to expand UMULO/SMULO during vector op
legalization, as it may result in unrolling, which may need additional
type legalization.

Differential Revision: https://reviews.llvm.org/D57997

llvm-svn: 354513
2019-02-20 20:41:44 +00:00
Craig Topper 31823fba2e [X86] Add more load folding patterns for blend instructions as a follow up to r354363.
This avoids depending on the peephole pass to do load folding.

Also adds some load folding for some insert_subvector patterns that use blend.

All of this was found by temporarily adding TB_NO_FORWARD to the blend immediate entries in the load folding tables.

I've added -disable-peephole to some of the affected tests from that experiment to ensure we're testing isel patterns.

llvm-svn: 354511
2019-02-20 20:18:20 +00:00
Simon Pilgrim dca47c659c [X86][SSE] combineX86ShufflesRecursively - begin generalizing the number of shuffle inputs. NFCI.
We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit.

llvm-svn: 354489
2019-02-20 17:58:29 +00:00
Craig Topper e4025c5eb1 [X86] Remove FeatureSlowIncDec from Sandy Bridge and later Intel Core CPUs
Summary:
Inc and Dec were at one point slow on Intel CPUs due to their tendency to cause partial flag stalls on P6 derived CPU cores. This is because these instructions are defined to preserve the carry flag. This partial flag stall issue persisted until Sandy Bridge when flag merging was changed to be handled as a data dependency instead of as a stall until retirement. Sandy Bridge and later CPUs rename the C flag separately from OSPAZ so there is no flag merge needed on INC/DEC to preserve the C flag.

Given these improvements I don't know why INC/DEC was ever considered slow on Sandy Bridge. If anything they should have been disabled on the earlier CPUs instead.

Note after this patch, INC/DEC are still considered slow on Silvermont, Goldmont, Knights Landing and our generic "x86-64" CPU.

Reviewers: spatel, RKSimon, chandlerc

Reviewed By: chandlerc

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D58412

llvm-svn: 354436
2019-02-20 05:39:11 +00:00
Eric Christopher 2534592b9f Temporarily Revert "[X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions (add, sub)"
As this has broken the lto bootstrap build for 3 days and is
showing a significant regression on the Dither_benchmark results (from
the LLVM benchmark suite) -- specifically, on the
BENCHMARK_FLOYD_DITHER_128, BENCHMARK_FLOYD_DITHER_256, and
BENCHMARK_FLOYD_DITHER_512; the others are unchanged.  These have
regressed by about 28% on Skylake, 34% on Haswell, and over 40% on
Sandybridge.

This reverts commit r353923.

llvm-svn: 354434
2019-02-20 04:42:07 +00:00
Craig Topper 8eade09249 [X86] Mark FP32_TO_INT16_IN_MEM/FP32_TO_INT32_IN_MEM/FP32_TO_INT64_IN_MEM as clobbering EFLAGS to prevent mis-scheduling during conversion from SelectionDAG to MIR.
After r354178, these instruction expand to a sequence that uses an OR instruction. That OR clobbers EFLAGS so we need to state that to avoid accidentally using the clobbered flags.

Our tests show the bug, but I didn't notice because the SETcc instructions didn't move after r354178 since it used to be safe to do the fp->int conversion first.

We should probably convert this whole sequence to SelectionDAG instead of a custom inserter to avoid mistakes like this.

Fixes PR40779

llvm-svn: 354395
2019-02-19 22:37:00 +00:00
Craig Topper 6d0190f21c [X86] Don't consider functions ABI compatible for ArgumentPromotion pass if they view 512-bit vectors differently.
The use of the -mprefer-vector-width=256 command line option mixed with functions
using vector intrinsics can create situations where one function thinks 512 vectors
are legal, but another fucntion does not.

If a 512 bit vector is passed between them via a pointer, its possible ArgumentPromotion
might try to pass by value instead. This will result in type legalization for the two
functions handling the 512 bit vector differently leading to runtime failures.

Had the 512 bit vector been passed by value from clang codegen, both functions would
have been tagged with a min-legal-vector-width=512 function attribute. That would
make them be legalized the same way.

I observed this issue in 32-bit mode where a union containing a 512 bit vector was
being passed by a function that used intrinsics to one that did not. The caller
ended up passing in zmm0 and the callee tried to read it from ymm0 and ymm1.

The fix implemented here is just to consider it a mismatch if two functions
would handle 512 bit differently without looking at the types that are being
considered. This is the easist and safest fix, but it can be improved in the future.

Differential Revision: https://reviews.llvm.org/D58390

llvm-svn: 354376
2019-02-19 20:12:20 +00:00
Simon Pilgrim 0b3b9424ca [X86][SSE] Generalize X86ISD::BLENDI support to more value types
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.

With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.

This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.

I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).

The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.

Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361

Differential Revision: https://reviews.llvm.org/D57888

llvm-svn: 354363
2019-02-19 18:05:42 +00:00
Simon Pilgrim dce9c2a811 [X86][AVX2] Hide VPBLENDD instructions behind AVX2 predicate
This was the cause of the regression in D57888 - the commuted load pattern wasn't hidden by the predicate so once we enabled v4i32 blends on SSE41+ targets then isel was incorrectly matched against AVX2+ instructions.

llvm-svn: 354358
2019-02-19 17:23:55 +00:00
Craig Topper 51a2e88990 [X86] Bugfix for nullptr check by klocwork
klocwork critical issues in CG files:

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D58363

llvm-svn: 354357
2019-02-19 17:16:23 +00:00
Craig Topper d8acfe69f0 X86AsmParser AVX-512: Return error instead of hitting assert
When parsing a sequence of tokens beginning with {, it will hit an assert and crash if the token afterwards is not an identifier. Instead of this, return a more verbose error as seen elsewhere in the function.

Patch by Brandon Jones (BrandonTJones)

Differential Revision: https://reviews.llvm.org/D57375

llvm-svn: 354356
2019-02-19 17:13:40 +00:00
Craig Topper 236e1ce1d9 [X86] Filter out tuning feature flags and a few ISA feature flags when checking for function inline compatibility.
Tuning flags don't have any effect on the available instructions so aren't a good reason to prevent inlining.

There are also some ISA flags that don't have any intrinsics our ABI requirements that we can exclude. I've put only the most basic ones like cmpxchg16b and lahfsahf. These are interesting because they aren't present in all 64-bit CPUs, but we have codegen workarounds when they aren't present.

Loosening these checks can help with scenarios where a caller has a more specific CPU than a callee. The default tuning flags on our generic 'x86-64' CPU can currently make it inline compatible with other CPUs. I've also added an example test for 'nocona' and 'prescott' where 'nocona' is just a 64-bit capable version of 'prescott' but in 32-bit mode they should be completely compatible.

I've based the implementation here of the similar code in AMDGPU.

Differential Revision: https://reviews.llvm.org/D58371

llvm-svn: 354355
2019-02-19 17:05:11 +00:00
Simon Pilgrim 9d575db85e [X86][AVX] Update VBROADCAST folds to always use v2i64 X86vzload
The VBROADCAST combines and SimplifyDemandedVectorElts improvements mean that we now more consistently use shorter (128-bit) X86vzload input operands.

Follow up to D58053

llvm-svn: 354346
2019-02-19 16:33:17 +00:00
Simon Pilgrim d6add74915 Cast from SDValue directly instead of superfluous getNode(). NFCI.
llvm-svn: 354343
2019-02-19 16:20:09 +00:00
Simon Pilgrim 952abcefe4 [X86][AVX] EltsFromConsecutiveLoads - Add BROADCAST lowering support
This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads.

It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load.

Differential Revision: https://reviews.llvm.org/D58053

llvm-svn: 354340
2019-02-19 15:57:09 +00:00
Craig Topper c81e0c67ba [X86] Remove command line strings from the ProcIntel* features.
These should always follow the CPU string. There's no reason to control them independently.

llvm-svn: 354304
2019-02-19 03:04:14 +00:00
Sanjay Patel d8b4efcb6b [CGP] form usub with overflow from sub+icmp
The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754
https://bugs.llvm.org/show_bug.cgi?id=40487
..and those are shown in the IR test file and x86 codegen file.

Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.

This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo
formation by default. Only x86 overrides that setting for now although other targets will likely benefit
by forming usbuo too.

Differential Revision: https://reviews.llvm.org/D57789

llvm-svn: 354298
2019-02-18 23:33:05 +00:00
Sanjay Patel fff628274d [x86] split more v8f32/v8i32 shuffles in lowering
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.

This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.

Differential Revision: https://reviews.llvm.org/D58181

llvm-svn: 354279
2019-02-18 16:46:12 +00:00
Craig Topper ce3c5ac6a6 [X86] In FP_TO_INTHelper, when moving data from SSE register to X87 register file via the stack, use the same stack slot we use for the integer conversion.
No need for a separate stack slot. The lifetimes don't overlap.

Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot.

llvm-svn: 354234
2019-02-17 19:23:49 +00:00
Craig Topper db5aa955cb [X86] When type legalizing the result of a i64 fp_to_uint on 32-bit targets. Generate all of the ops as i64 and let them be legalized.
No need to manually split everything. We can let the type legalizer work for us.

The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended.

llvm-svn: 354197
2019-02-16 08:25:42 +00:00
Craig Topper 61da80584d [X86] Don't prevent load folding for cvtsi2ss/cvtsi2sd based on hasPartialRegUpdate.
Preventing the load fold won't fix the partial register update since the
input we can fold is a GPR. So it will do nothing to prevent a false dependency
on an XMM register.

llvm-svn: 354193
2019-02-16 03:34:54 +00:00
Craig Topper db2f084aa9 [X86] Don't set exception mask bits when modifying FPCW to change rounding mode for fp->int conversion
When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption.

This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block)

Differential Revision: https://reviews.llvm.org/D57788

llvm-svn: 354178
2019-02-15 21:59:33 +00:00
Nirav Dave 7875841121 [X86] Fix LowerAsmOutputForConstraint.
Summary:
Update Flag when generating cc output.

Fixes PR40737.

Reviewers: rnk, nickdesaulniers, craig.topper, spatel

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58283

llvm-svn: 354163
2019-02-15 20:01:55 +00:00
Craig Topper 9c6a9276da [X86] Move all the SSE legality checks out of FP_TO_INTHelper and up to LowerFP_TO_INT. NFCI
These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there.

Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner.

llvm-svn: 354161
2019-02-15 19:21:39 +00:00
Simon Pilgrim 6ce08672fb [X86][AVX] lowerShuffleAsLanePermuteAndPermute - fully populate the lane shuffle mask (PR40730)
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.

This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.

This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.

Differential Revision: https://reviews.llvm.org/D58237

llvm-svn: 354117
2019-02-15 11:39:21 +00:00
Matt Arsenault 2a5488b877 X86: Replace isSafeToClobberEFLAGS implementation
Also use modifiesRegister instead of looping over operands.

llvm-svn: 354098
2019-02-15 04:01:39 +00:00
Nirav Dave 5ffdc43dc9 [X86] cleanup inline asm register generation. NFCI.
llvm-svn: 354042
2019-02-14 18:06:21 +00:00
Craig Topper 1d158dd930 [X86] Make (f80 (sint_to_fp (i16))) use fistps/fisttps instead of fistpl/fisttpl when SSE is enabled.
When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion.

Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type.

llvm-svn: 354003
2019-02-14 01:41:43 +00:00
Anton Afanasyev ca9aff9353 [X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions (add, sub)
Try to use 64-bit SLP vectorization. In addition to horizontal instrs
this change triggers optimizations for partial vector operations (for instance,
using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by
<2 x float>).

Fixes llvm.org/PR32433

llvm-svn: 353923
2019-02-13 08:26:43 +00:00
Craig Topper 9b61f48e4b [X86] Use default expansion for (i64 fp_to_uint f80) when avx512 is enabled on 64-bit targets to match what happens without avx512.
In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this.

llvm-svn: 353921
2019-02-13 07:42:34 +00:00
Craig Topper 3099e442a6 [X86] Refactor the FP_TO_INTHelper interface. NFCI
-Pull the final stack load creation from the two callers into the helper.
-Return a single SDValue instead of a std::pair.
-Remove the Replace flag which isn't really needed.

llvm-svn: 353920
2019-02-13 07:42:31 +00:00
Simon Pilgrim 5338f41ced [X86][AVX] Enable shuffle combining support for zero_extend
A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars.

llvm-svn: 353860
2019-02-12 17:22:35 +00:00
Craig Topper 7670ede434 [X86] Collapse FP_TO_INT16_IN_MEM/FP_TO_INT32_IN_MEM/FP_TO_INT64_IN_MEM into a single opcode using memory VT to distinquish. NFC
llvm-svn: 353798
2019-02-12 06:14:18 +00:00
Craig Topper d7303ecd0b [X86] Remove the value type operand from the floating point load/store MemIntrinsicSDNodes. Use the MemoryVT instead. NFCI
We already have the memory VT, we can just match from that during isel.

llvm-svn: 353797
2019-02-12 06:14:16 +00:00
Craig Topper 75eb0af874 [X86] Correct the memory operand for the FLD emitted in FP_TO_INTHelper for 32-bit SSE targets.
We were using DstTy, but that represents the integer type we are converting to which is i64 in this
case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers.
If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte
access. The store we used to get from SSE to the stack is using the corect size.

While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function.

llvm-svn: 353745
2019-02-11 20:38:10 +00:00
Sam McCall e825ba9165 Revert "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"
This reverts commit r353610.
It causes a miscompile visible in macro expansion in a bootstrapped clang.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190211/626590.html

llvm-svn: 353699
2019-02-11 14:05:36 +00:00
Craig Topper 5b1beda001 [X86] Removed unused SDTypeProfile. NFC
llvm-svn: 353659
2019-02-11 07:30:48 +00:00
Simon Pilgrim f6e6c369c0 [X86] EltsFromConsecutiveLoads - replace SmallBitVector with APInt (NFC).
Minor refactor to simplify some incoming patches to improve broadcast loads.

llvm-svn: 353655
2019-02-10 22:45:48 +00:00
Sanjay Patel 833550fc74 [x86] narrow 256-bit horizontal ops via demanded elements
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).

The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.

All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.

In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758
https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.

I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.

Differential Revision: https://reviews.llvm.org/D57841

llvm-svn: 353641
2019-02-10 15:22:06 +00:00
Craig Topper f37ea96922 [X86] Move some vector InstAliases out from under unnecessary 'let Predicates'. NFCI
We don't have any assembler predicates for vector ISAs so this isn't necessary. It just adds extra lines and identation.

llvm-svn: 353631
2019-02-10 02:34:31 +00:00
Simon Pilgrim 6bf7b30b10 [X86] CombineOr - fold to generic funnel shifts
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.

There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.

llvm-svn: 353626
2019-02-09 20:34:59 +00:00
Simon Pilgrim 690a2889d8 [X86][SSE] Generalize X86ISD::BLENDI support to more value types
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.

With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.

This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.

I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).

The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.

Differential Revision: https://reviews.llvm.org/D57888

llvm-svn: 353610
2019-02-09 13:13:59 +00:00
Craig Topper fcb63c4c6c [X86] Add FPCW as an implicit use on floating point load instructions.
These instructions can generate a stack overflow exception so technically they read the stack overflow exception mask bit.

llvm-svn: 353564
2019-02-08 20:50:09 +00:00
Craig Topper 784929d045 Implementation of asm-goto support in LLVM
This patch accompanies the RFC posted here:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127239.html

This patch adds a new CallBr IR instruction to support asm-goto
inline assembly like gcc as used by the linux kernel. This
instruction is both a call instruction and a terminator
instruction with multiple successors. Only inline assembly
usage is supported today.

This also adds a new INLINEASM_BR opcode to SelectionDAG and
MachineIR to represent an INLINEASM block that is also
considered a terminator instruction.

There will likely be more bug fixes and optimizations to follow
this, but we felt it had reached a point where we would like to
switch to an incremental development model.

Patch by Craig Topper, Alexander Ivchenko, Mikhail Dvoretckii

Differential Revision: https://reviews.llvm.org/D53765

llvm-svn: 353563
2019-02-08 20:48:56 +00:00
Craig Topper 41a1792b15 [X86] Remove isReMaterializable from X87 floating point constant loads and constant pool loads.
Summary: These instructions update FPSW so they aren't generically safe to rematerialize into any location if FPSW is live for a comparison result. They also use FPCW for exception masking control. Though the only exception they can generate is stack overflow and we manage the stack ourselves so that's not really going to occur.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57934

llvm-svn: 353536
2019-02-08 17:07:54 +00:00
Sanjay Patel e9cc26a56a [x86] fix formatting; NFC
(test commit #2 migrating to git)

llvm-svn: 353533
2019-02-08 16:48:40 +00:00
Craig Topper 738180cc7f Fix the lowering issue of intrinsics llvm.localaddress on X86
Patch by Yuanke Luo

Reviewers: craig.topper, annita.zhang, smaslov, rnk, wxiao3

Reviewed By: rnk

Subscribers: efriedma, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57501

llvm-svn: 353492
2019-02-08 01:14:12 +00:00
Craig Topper c782f18835 [X86] Add FPCW as a register and start using it as an implicit use on floating point instructions.
Summary:
FPCW contains the rounding mode control which we manipulate to implement fp to integer conversion by changing the roudning mode, storing the value to the stack, and then changing the rounding mode back. Because we didn't model FPCW and its dependency chain, other instructions could be scheduled into the middle of the sequence.

This patch introduces the register and adds it as an implciit def of FLDCW and implicit use of the FP binary arithmetic instructions and store instructions. There are more instructions that need to be updated, but this is a good start. I believe this fixes at least the reduced test case from PR40529.

Reviewers: RKSimon, spatel, rnk, efriedma, andrew.w.kaylor

Subscribers: dim, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57735

llvm-svn: 353489
2019-02-08 00:44:39 +00:00
Sanjay Patel 81f859d169 [x86] fix formatting; NFC
llvm-svn: 353477
2019-02-07 22:36:55 +00:00
Simon Pilgrim fe3ac70b18 [DAGCombiner] (add (umax X, C), -C) --> (usubsat X, C) (PR40111)
Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner

First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111.

Differential Revision: https://reviews.llvm.org/D57754

llvm-svn: 353457
2019-02-07 20:14:43 +00:00
Sanjay Patel a5c4a5e958 [x86] split more 256/512-bit shuffles in lowering
This is intentionally a small step because it's hard to know exactly 
where we might introduce a conflicting transform with the code that 
tries to form wider shuffles. But I think this is safe - if we have 
a wide shuffle with 2 operands, then we should do better with an 
extract + narrow shuffle.

Differential Revision: https://reviews.llvm.org/D57867

llvm-svn: 353427
2019-02-07 17:10:49 +00:00
Nirav Dave 84e5bf0c95 [X86] Simplify casing. NFC.
llvm-svn: 353417
2019-02-07 15:43:40 +00:00
Nirav Dave c6bfa103a5 [X86][DAG] Avoid creating dangling bitcast.
combineExtractWithShuffle may leave a dangling bitcast which may
prevent further optimization in later passes. Avoid constructing it
unless it is used.

llvm-svn: 353333
2019-02-06 19:45:47 +00:00
Nirav Dave e5c37958f9 [InlineAsm][X86] Add backend support for X86 flag output parameters.
Allow custom handling of inline assembly output parameters and add X86
flag parameter support.

llvm-svn: 353307
2019-02-06 15:26:29 +00:00
Sanjay Patel e84fbb67a1 [x86] vectorize cast ops in lowering to avoid register file transfers
The proposal in D56796 may cross the line because we're trying to avoid vectorization 
transforms in generic DAG combining. So this is an alternate, later, x86-specific 
translation of that patch.

There are several potential follow-ups to enhance this:
1. Allow extraction from non-zero element index.
2. Peek through extends of smaller width integers.
3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P

Differential Revision: https://reviews.llvm.org/D56864

llvm-svn: 353302
2019-02-06 14:59:39 +00:00
Simon Pilgrim b0afc69435 [X86][SSE] Disable ZERO_EXTEND shuffle combining
rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly.

llvm-svn: 353198
2019-02-05 19:15:48 +00:00
Simon Pilgrim 822d2e35e7 [X86][AVX] Attempt to combine shuffles to subvector broadcast load
llvm-svn: 353189
2019-02-05 17:02:49 +00:00
Simon Pilgrim 62af24cc93 [X86][SSE] Add SimplifyDemandedVectorElts support for X86ISD::BLENDV
llvm-svn: 353165
2019-02-05 12:27:29 +00:00
Simon Pilgrim 9e595e3663 [X86][AVX] Attempt to share broadcasts of different widths (PR39454)
If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free).

Differential Revision: https://reviews.llvm.org/D57663

llvm-svn: 353154
2019-02-05 10:58:43 +00:00
Craig Topper f86eb00f12 [X86] Connect the default fpsr and dirflag clobbers in inline assembly to the registers we have defined for them.
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.

This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.

Reviewers: rnk

Reviewed By: rnk

Subscribers: eraman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57641

llvm-svn: 353141
2019-02-05 06:13:06 +00:00
Roman Lebedev b7ecc9b624 [X86] X86DAGToDAGISel::matchBitExtract(): prepare 'control' in 32 bits
Summary:
Noticed while looking at D56052.
```
  // The 'control' of BEXTR has the pattern of:
  // [15...8 bit][ 7...0 bit] location
  // [ bit count][     shift] name
  // I.e. 0b000000011'00000001 means  (x >> 0b1) & 0b11
```
I.e. we do not care about any of the bits aside from the low 16 bits.
So there is no point in doing the `slh`,`or` in 64 bits,
let's just do everything in 32 bits, and anyext if needed.

We could do that in 16 even, but we intentionally don't
zext to i16 (longer encoding IIRC),
so i'm guessing the same applies here.

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D56715

llvm-svn: 353073
2019-02-04 19:04:26 +00:00
Craig Topper 8ea72a8201 [X86] Add ST0 as an implicit def/use of x87 load/store instructions during FP stackifying.
These instructions implicitly operate on ST0, but we don't currently add that information to the MachineInstr. We also don't add it the tablegen definitions either.

For the most part this doesn't cause any problems because the stackifying occurs after register allocation. All the instructions are marked as having side effects so the postRA scheduler won't reorder them amongst themselves.

But nothing stops inline assembly using X87 instructions from being reordered around other x87 instructions if that inline assembly wasn't marked volatile.

The two test cases I've identified so far in PR40539 involve loads and stores used to set up the inline assembly or capture the results of the inline assembly ending up in the wrong order.

This patch adds implicit ST0 uses/defs to the load/store instructions to prevent this from happening.

I plan to fix all of the FP instructions, but the binops are bit trickier to get right. So I've chosen fixing the known test cases as a good first step.

I think we also need to update the tablegen descriptions so MS inline assembly infers the right clobbers, but I haven't checked that yet.

Differential Revision: https://reviews.llvm.org/D57644

llvm-svn: 353070
2019-02-04 18:43:55 +00:00
Craig Topper bf7593ec4a [X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two arguments where on is %st.
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.

This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.

llvm-svn: 353061
2019-02-04 17:28:18 +00:00
Simon Pilgrim 6e5350a367 [X86][SSE] SimplifyDemandedBitsForTargetNode - PCMPGT(0,X) sign mask
For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly.

Differential Revision: https://reviews.llvm.org/D57667

llvm-svn: 353051
2019-02-04 15:43:36 +00:00
Andrea Di Biagio edbf06a767 [AsmPrinter] Remove hidden flag -print-schedule.
This patch removes hidden codegen flag -print-schedule effectively reverting the
logic originally committed as r300311
(https://llvm.org/viewvc/llvm-project?view=revision&revision=300311).

Flag -print-schedule was originally introduced by r300311 to address PR32216
(https://bugs.llvm.org/show_bug.cgi?id=32216). That bug was about adding "Better
testing of schedule model instruction latencies/throughputs".

These days, we can use llvm-mca to test scheduling models. So there is no longer
a need for flag -print-schedule in LLVM. The main use case for PR32216 is
now addressed by llvm-mca.
Flag -print-schedule is mainly used for debugging purposes, and it is only
actually used by x86 specific tests. We already have extensive (latency and
throughput) tests under "test/tools/llvm-mca" for X86 processor models. That
means, most (if not all) existing -print-schedule tests for X86 are redundant.

When flag -print-schedule was first added to LLVM, several files had to be
modified; a few APIs gained new arguments (see for example method
MCAsmStreamer::EmitInstruction), and MCSubtargetInfo/TargetSubtargetInfo gained
a couple of getSchedInfoStr() methods.

Method getSchedInfoStr() had to originally work for both MCInst and
MachineInstr. The original implmentation of getSchedInfoStr() introduced a
subtle layering violation (reported as PR37160 and then fixed/worked-around by
r330615).
In retrospect, that new API could have been designed more optimally. We can
always query MCSchedModel to get the latency and throughput. More importantly,
the "sched-info" string should not have been generated by the subtarget.
Note, r317782 fixed an issue where "print-schedule" didn't work very well in the
presence of inline assembly. That commit is also reverted by this change.

Differential Revision: https://reviews.llvm.org/D57244

llvm-svn: 353043
2019-02-04 12:51:26 +00:00
Simon Pilgrim 9899967464 Use auto for dyn_cast case to save a line. NFCI.
llvm-svn: 353041
2019-02-04 12:32:39 +00:00
Craig Topper b5e945c260 Recommit r352660 "[X86] Mark EMMS and FEMMS as clobbering MM0-7 and ST0-7."
We now print ST0 as 'st' when generating the clobber list for MS inline assembly in clang. This matches what the gcc reg name list expects.

Original commit message:

This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.

Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.

Differential Revision: https://reviews.llvm.org/D57298

llvm-svn: 353016
2019-02-04 04:44:20 +00:00
Craig Topper 7a2944efe1 [X86] Print %st(0) as %st when its implicit to the instruction. Continue printing it as %st(0) when its encoded in the instruction.
This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior.

llvm-svn: 353015
2019-02-04 04:15:10 +00:00
Craig Topper f77b858dc3 Revert r352985 "[X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly"
Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st.

I'll be making a more directed change in a future patch.

llvm-svn: 353013
2019-02-04 04:15:02 +00:00
Simon Pilgrim 1fce5a8b75 [X86][AVX] Support shuffle combining for VBROADCAST with smaller vector sources
getTargetShuffleMask can only do this safely if we're extracting the lowest subvector from a vector of the same result type.

llvm-svn: 352999
2019-02-03 16:51:33 +00:00
Simon Pilgrim 18b73a655b [X86][AVX] Support shuffle combining for VPMOVZX with smaller vector sources
llvm-svn: 352997
2019-02-03 16:10:18 +00:00
Simon Pilgrim a2a3e5b811 [X86][AVX] More aggressively simplify BROADCAST source operand
Aim to use scalar source or lowest 128-bit vector directly.

We're still missing some VZMOVL_LOAD combines.

llvm-svn: 352994
2019-02-03 14:39:41 +00:00
Craig Topper 5a570dd437 [X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly
Summary:
When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name.

This also matches what objdump disassembly prints. It's also what is printed by gcc -S.

Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri

Reviewed By: rnk

Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D57621

llvm-svn: 352985
2019-02-03 07:53:39 +00:00
Craig Topper 950ca192f6 [X86] Lower ISD::UADDO to use the Z flag instead of C flag when the RHS is a constant 1 to encourage INC formation.
Summary:
Add an additional combine to combineCarryThroughADD to reverse it back to the C flag to avoid regressions.

I believe this catches the cases that D57547 got.

Reviewers: RKSimon, spatel

Reviewed By: spatel

Subscribers: javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57637

llvm-svn: 352984
2019-02-03 07:25:06 +00:00
Simon Pilgrim dbf302c9f1 [X86][AVX] Enable INSERT_SUBVECTOR(SRC0, SHUFFLE(SRC1)) shuffle combining
Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles.

The is exposes a couple of minor issues that will be fixed shortly:
Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up
combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts).

llvm-svn: 352963
2019-02-02 18:08:04 +00:00
Simon Pilgrim bd42f97946 [SDAG] Add SDNode/SDValue getConstantOperandAPInt helper. NFCI.
We already have the getConstantOperandVal helper which returns a uint64_t, but along comes the fuzzer and inserts a i128 -1 constant or something and the whole thing asserts.......

I've updated a few obvious cases, and tried to make use of the const reference where possible, but there's more to do. A number of existing oss-fuzz tickets should be fixed if we start using APInt and perform value clamping where necessary.

llvm-svn: 352961
2019-02-02 17:35:06 +00:00
Simon Pilgrim e95550f508 [X86][AVX] Add VMOVDDUP-VPBROADCASTQ execution domain mapping
Noticed in D57514.

Differential Revision: https://reviews.llvm.org/D57519

llvm-svn: 352922
2019-02-01 21:41:30 +00:00
James Y Knight 7716075a17 [opaque pointer types] Pass value type to GetElementPtr creation.
This cleans up all GetElementPtr creation in LLVM to explicitly pass a
value type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57173

llvm-svn: 352913
2019-02-01 20:44:47 +00:00
James Y Knight 14359ef1b6 [opaque pointer types] Pass value type to LoadInst creation.
This cleans up all LoadInst creation in LLVM to explicitly pass the
value type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57172

llvm-svn: 352911
2019-02-01 20:44:24 +00:00
James Y Knight 7976eb5838 [opaque pointer types] Pass function types to CallInst creation.
This cleans up all CallInst creation in LLVM to explicitly pass a
function type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57170

llvm-svn: 352909
2019-02-01 20:43:25 +00:00
Simon Pilgrim 85184017e9 [X86][SSE] Use PSLLDQ/PSRLDQ to mask out zeroable ends of a shuffle
As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section.

For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts).

Differential Revision: https://reviews.llvm.org/D56784

llvm-svn: 352883
2019-02-01 16:02:12 +00:00
Simon Pilgrim 1a529f58f9 [X86][AVX] Combine INSERT_SUBVECTOR(SRC0, BITCAST(SHUFFLE(EXTRACT_SUBVECTOR(SRC1)))
Enable peeking through one use bitcasts to the subvector shuffle.

This still depends on the subvector being the same scalar-size but D57514 has already helped with the more tricky patterns

llvm-svn: 352879
2019-02-01 15:31:01 +00:00
Roman Lebedev 7857215f8e [X86][BdVer2] Transfer delays from the integer to the floating point unit.
Summary:
I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as `2`,
which matches the latencies specified in "AMD SOG for family 15h".

However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency.

Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`,
which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay.

Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does
state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`.

Reviewers: andreadb, RKSimon, craig.topper

Reviewed By: andreadb

Subscribers: gbedwell, courbet, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57300

llvm-svn: 352861
2019-02-01 11:15:13 +00:00
James Y Knight 13680223b9 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc
doesn't choke on it, hopefully.

Original Message:
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352827
2019-02-01 02:28:03 +00:00
James Y Knight fadf25068e Revert "[opaque pointer types] Add a FunctionCallee wrapper type, and use it."
This reverts commit f47d6b38c7 (r352791).

Seems to run into compilation failures with GCC (but not clang, where
I tested it). Reverting while I investigate.

llvm-svn: 352800
2019-01-31 21:51:58 +00:00
James Y Knight f47d6b38c7 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352791
2019-01-31 20:35:56 +00:00
Craig Topper a8f0745440 Revert "[X86] Mark EMMS and FEMMS as clobbering MM0-7 and ST0-7."
This is causing a failure in chromium

llvm-svn: 352782
2019-01-31 19:05:22 +00:00
Simon Pilgrim 00cefe1158 Trim trailing whitespace. NFCI.
llvm-svn: 352775
2019-01-31 17:49:25 +00:00
Simon Pilgrim eb6aef6db3 [X86][AVX] Fold concat(broadcast(x),broadcast(x)) -> broadcast(x)
Differential Revision: https://reviews.llvm.org/D57514

llvm-svn: 352774
2019-01-31 17:48:35 +00:00
Simon Pilgrim d04a2d2d5e [X86][AVX] insert_subvector(bitcast(v), bitcast(s), c1) -> bitcast(insert_subvector(v,s,c2))
Similar to what we already do in DAGCombiner, but this version also handles bitcasts from types with different scalar sizes, which x86 is better at handling.

Differential Revision: https://reviews.llvm.org/D57514

llvm-svn: 352773
2019-01-31 17:38:10 +00:00
Simon Pilgrim 63f3383ece [X86][AVX] Fold broadcast(bitcast(src)) -> bitcast(broadcast(src))
llvm-svn: 352751
2019-01-31 14:04:07 +00:00
Simon Pilgrim a001008a09 [X86] combineExtractWithShuffle - more aggressively peek through bitcasts
Fixes regression introduced by rL352743

llvm-svn: 352745
2019-01-31 11:55:30 +00:00
Simon Pilgrim b96a2c7fed [X86][AVX] Enable AVX1 broadcasts in shuffle combining
Enables 32/64-bit scalar load broadcasts on AVX1 targets

The extractelement-load.ll regression will be fixed shortly in a followup commit.

llvm-svn: 352743
2019-01-31 11:41:10 +00:00
Simon Pilgrim 51c2efc104 [X86][AVX] Fold vt1 concat_vectors(vt2 undef, vt2 broadcast(x)) --> vt1 broadcast(x)
If we're not inserting the broadcast into the lowest subvector then we can avoid the insertion by just performing a larger broadcast.

Avoids a regression when we enable AVX1 broadcasts in shuffle combining

llvm-svn: 352742
2019-01-31 11:15:05 +00:00
Sjoerd Meijer f7cc34cae8 [SelectionDAG] Codesize: don't expand SHIFT to SHIFT_PARTS
And instead just generate a libcall. My motivating example on ARM was a simple:
  
  shl i64 %A, %B

for which the code bloat is quite significant. For other targets that also
accept __int128/i128 such as AArch64 and X86, it is also beneficial for these
cases to generate a libcall when optimising for minsize. On these 64-bit targets,
the 64-bits shifts are of course unaffected because the SHIFT/SHIFT_PARTS
lowering operation action is not set to custom/expand.

Differential Revision: https://reviews.llvm.org/D57386

llvm-svn: 352736
2019-01-31 08:07:30 +00:00
Matt Arsenault 2a64598ef2 GlobalISel: Fix creating MMOs with align 0
llvm-svn: 352712
2019-01-31 01:38:47 +00:00
Craig Topper 8bdc203d4b [X86] Remove handling of ISD::INTRINSIC_WO_CHAIN in ReplaceNodeResults.
I believe this was there to handle avx512bw intrinsics that returned i64 type in 32-bit mode. But all those intrinsics have since been changed to v64i1 results or replaced with generic IR.

llvm-svn: 352698
2019-01-31 00:04:46 +00:00
Craig Topper 22b3de5b51 [X86] Mark EMMS and FEMMS as clobbering MM0-7 and ST0-7.
This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.

Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.

Differential Revision: https://reviews.llvm.org/D57298

llvm-svn: 352660
2019-01-30 19:57:01 +00:00
Simon Pilgrim 317fad5921 [X86][AVX] Prefer to combine shuffle to broadcasts whenever possible
This is the first step towards improving broadcast support on AVX1 targets.

llvm-svn: 352634
2019-01-30 16:19:19 +00:00
Craig Topper 11133d2531 [X86] Remove unnecessary code from the top of handleCompareFP in X86FloatingPoint.cpp.
There were checks to ensure some tables were sorted, but those tables aren't used by this function. The same tables are checked in the function that does use them. Maybe this was copy/pasted?

llvm-svn: 352609
2019-01-30 08:04:06 +00:00
Craig Topper 594f76aea2 [X86] Remove a couple places where we unnecessarily pass 0 to the EmitPriority of some FP instruction aliases. NFC
As far as I can tell we already won't emit these aliases due to an operand count check in the tablegen code. Removing these because I couldn't make sense of the inconsistency between fadd and fmul from reading the code.

I checked the AsmMatcher and AsmWriter files before and after this change and there were no differences.

llvm-svn: 352608
2019-01-30 07:33:24 +00:00
Craig Topper 9dfe9b086e [X86] Add FPSW as a Def on some FP instructions that were missing it.
llvm-svn: 352607
2019-01-30 07:08:44 +00:00
Andrea Di Biagio 815cdbff29 [X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions.
Account for bypass delays when computing the latency of scalar int-to-float
conversions.
On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG).
This patch also fixes the number of micropcodes for the register-memory variants
of scalar int-to-float conversions.

Differential Revision: https://reviews.llvm.org/D57148

llvm-svn: 352518
2019-01-29 16:47:27 +00:00
Mikael Holmen b792627ce9 Fix compiler warning when using clang 3.6.0
Without the fix we get the following (with -Werror):

../lib/Target/X86/X86ISelLowering.cpp:14181:58: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
  SmallVector<std::array<int, 2>, 2> LaneSrcs(NumLanes, {-1, -1});
                                                         ^~~~~~
                                                         {     }
1 error generated.

llvm-svn: 352455
2019-01-29 06:51:28 +00:00
Craig Topper 390ac61b93 Recommit r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer"
This did not cause the buildbot failure it was previously reverted for.

Original commit message:

I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.

This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inre

On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.

llvm-svn: 352433
2019-01-28 21:38:47 +00:00
Nikita Popov 8e1a464e6a [CodeGen][X86] Expand UADDSAT to NOT+UMIN+ADD
Followup to D56636, this time handling the UADDSAT case by expanding
uadd.sat(a, b) to umin(a, ~b) + b.

Differential Revision: https://reviews.llvm.org/D56869

llvm-svn: 352409
2019-01-28 19:19:09 +00:00
Simon Pilgrim 2c17512456 [X86][AVX] Remove lowerShuffleByMerging128BitLanes 2-lane restriction
First step towards adding support for 64-bit unary "sublane" handling (a bit like lowerShuffleAsRepeatedMaskAndLanePermute). 

This allows us to add lowerV64I8Shuffle handling.

llvm-svn: 352389
2019-01-28 17:02:35 +00:00
Sanjay Patel 94cca60b82 [x86] allow more shuffle splitting to avoid vpermps (PR40434)
This is tricky to make optimal: sometimes we're better off using 
a single wider op, but other times it makes more sense to combine
a narrow ops to achieve the same result.

This solves the case from:
https://bugs.llvm.org/show_bug.cgi?id=40434

There's potentially a similar change for vectors with 64-bit elements,
but it needs adjustments similar to rL352333 to avoid creating infinite
loops.

llvm-svn: 352380
2019-01-28 15:51:34 +00:00
Craig Topper 453150bc18 [X86] Add new variadic avx512 compress/expand intrinsics that use vXi1 types for the mask argument.
Remove and autoupgrade the old intrinsics

llvm-svn: 352343
2019-01-28 07:03:03 +00:00
Sanjay Patel ebe6b43aec [x86] add restriction for lowering to vpermps
This transform was added with rL351346, and we had
an escape for shufps, but we also want one for
unpckps vs. vpermps because vpermps doesn't take
an immediate shuffle index operand.

llvm-svn: 352333
2019-01-27 21:53:33 +00:00
Simon Pilgrim 670a6971f8 [X86][SSE] Add UNDEF handling to combineSelect ISD::USUBSAT matching (PR40083)
llvm-svn: 352330
2019-01-27 21:01:23 +00:00
Simon Pilgrim f10b6623cc [X86][SSE] Permit UNDEFs in combineAddToSUBUS matching (PR40083)
llvm-svn: 352328
2019-01-27 20:36:37 +00:00
Sanjay Patel 5f1fdaa192 [x86] refactor logic in lowerShuffleWithUndefHalf
Although this is longer code, this is no-functional-change-intended.
The goal is to untangle the conditions under which we bail out, so 
that's easier to adjust.

llvm-svn: 352320
2019-01-27 18:12:03 +00:00
Gabor Buella a0f743b77a [X86] Add some missing blsr patterns
The add+and sequence followed by a branch can
happen e.g. when looping over the set bits of an integer:

```
while (x != 0) {
   func(x & ~x);
   x &= x - 1;
}
```

Reviewed By: ctopper

Differential Revision: https://reviews.llvm.org/D57296

llvm-svn: 352306
2019-01-27 06:15:39 +00:00
Craig Topper e65d4c5525 [X86] Add a pattern for (i64 (and (anyext def32:), 0x00000000FFFFFFFF)) to produce SUBREG_TO_REG
def32 here means the producing instruction zeroed bits 63:32. We already do this for zext, but it looks like we can get an and+anyext sometimes.

Spotted in the diffs from D33587.

llvm-svn: 352303
2019-01-27 03:37:05 +00:00
Simon Pilgrim a914fa4dd8 [X86] combineAddOrSubToADCOrSBB/combineCarryThroughADD - use oneuse for entire SDNode
Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB.

I've added the getNode() to make it clearer what is intended than just the -> redirection.

llvm-svn: 352291
2019-01-26 21:29:16 +00:00
Simon Pilgrim 37a8e65a60 [X86] combineCarryThroughADD - add support for X86::COND_A commutations (PR24545)
As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction.

Differential Revision: https://reviews.llvm.org/D57281

llvm-svn: 352289
2019-01-26 20:23:04 +00:00
Simon Pilgrim b7a15acd38 [X86] Fold X86ISD::SBB(ISD::SUB(X,Y),0) -> X86ISD::SBB(X,Y) (PR25858)
We often generate X86ISD::SBB(X, 0) for carry flag arithmetic.

I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet.

Differential Revision: https://reviews.llvm.org/D57169

llvm-svn: 352288
2019-01-26 20:13:44 +00:00
Simon Pilgrim 6162fba57c [X86][SSE] Generalized unsigned compares to support nonsplat constant vectors (PR39859)
llvm-svn: 352283
2019-01-26 16:40:03 +00:00
Sanjay Patel a03c63b77f [x86] add helper for creating a half-width shuffle; NFC
This reduces a bit of duplication between the combining and
lowering places that use it, but the primary motivation is
to make it easier to rearrange the lowering logic and solve
PR40434:
https://bugs.llvm.org/show_bug.cgi?id=40434

llvm-svn: 352280
2019-01-26 16:20:22 +00:00
Craig Topper 3b5e01b386 [X86] Remove and autoupgrade vpconflict intrinsics that take a mask and passthru argument.
We have unmasked versions as of r352172

llvm-svn: 352270
2019-01-26 06:27:01 +00:00
Craig Topper 58e6b37e62 Revert r352255 "[SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer"
This might be breaking an lldb windows buildbot.

llvm-svn: 352268
2019-01-26 02:44:58 +00:00
Craig Topper 6c9c7d0796 [X86] Remove GCCBuiltins from 512-bit cvt(u)qqtops, cvt(u)qqtopd, and cvt(u)dqtops intrinsics. Add new variadic uitofp/sitofp with rounding mode intrinsics.
Summary: See clang patch D56998 for a full description.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56999

llvm-svn: 352266
2019-01-26 02:41:54 +00:00
Craig Topper 7a8e74775c [X86] Add DAG combine to merge vzext_movl with the various fp<->int conversion operations that only write the lower 64-bits of an xmm register and zero the rest.
Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56971

llvm-svn: 352260
2019-01-26 01:17:09 +00:00
Craig Topper b1d3457c03 [SelectionDAG][X86] Don't use SEXTLOAD for promoting masked loads in the type legalizer
Summary:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.

This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization.

On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57186

llvm-svn: 352255
2019-01-26 00:26:37 +00:00
Mircea Trofin 519f42d914 [llvm] Opt-in flag for X86DiscriminateMemOps
Summary:
Currently, if an instruction with a memory operand has no debug information,
X86DiscriminateMemOps will generate one based on the first line of the
enclosing function, or the last seen debug info.

This may cause confusion in certain debugging scenarios. The long term
approach would be to use the line number '0' in such cases, however, that
brings in challenges: the base discriminator value range is limited
(4096 values).

For the short term, adding an opt-in flag for this feature.

See bug 40319 (https://bugs.llvm.org/show_bug.cgi?id=40319)

Reviewers: dblaikie, jmorse, gbedwell

Reviewed By: dblaikie

Subscribers: aprantl, eraman, hiraditya

Differential Revision: https://reviews.llvm.org/D57257

llvm-svn: 352246
2019-01-25 21:49:54 +00:00
Craig Topper 4cf28bad5b [X86] Combine masked store and truncate into masked truncating stores.
We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch.

This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate.

Differential Revision: https://reviews.llvm.org/D57218

llvm-svn: 352230
2019-01-25 18:37:36 +00:00
Sanjay Patel 0020f8bb23 [x86] simplify logic in lowerShuffleWithUndefHalf(); NFCI
This seems unnecessarily complicated because we gave names to
opposite polarity bools and have code comments that don't really
line up with the logic. 

Step 1: remove UndefUpper and assert that it is the opposite of 
UndefLower after the initial early exit.

llvm-svn: 352217
2019-01-25 17:00:41 +00:00
Simon Pilgrim f56298f4b9 [X86] Simplify X86ISD::ADD/SUB if we don't use the result flag
Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag.

This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further.

Noticed in some of the test cases in PR31754

Differential Revision: https://reviews.llvm.org/D57234

llvm-svn: 352210
2019-01-25 15:58:28 +00:00
Sanjay Patel 21aa6ddc14 [x86] narrow a shuffle that doesn't use or set any high elements
This isn't the final fix for our reduction/horizontal codegen, but it takes care 
of a lot of the problems. After we narrow the shuffle, existing combines for 
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.

The avg and mul reduction tests show an existing shuffle lowering hole for 
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.

Differential Revision: https://reviews.llvm.org/D57156

llvm-svn: 352209
2019-01-25 15:37:42 +00:00
Craig Topper 6fd9af587a [X86] Add non-masked versions of vpconflict intrinsics so we can use a select in the header file in clang.
I'll remove and autoupgrade the old intrinsics in a future commit.

llvm-svn: 352172
2019-01-25 07:08:07 +00:00
Sanjay Patel 4c304b2923 [x86] move half-size shuffle mask creation to helper; NFC
As noted in D57156, we want to check at least part of
this pattern earlier (in combining), so this will allow
the code to be shared instead of duplicated.

llvm-svn: 352127
2019-01-24 23:12:36 +00:00
Sanjay Patel e524639d72 [x86] rename VectorShuffle -> Shuffle; NFC
This wasn't consistent within the file, so made it harder to search.
Standardize on the shorter name to save some typing.

llvm-svn: 352077
2019-01-24 18:52:12 +00:00
James Y Knight 2c36240a82 Fix emission of _fltused for MSVC.
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.

The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.

This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)

That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.

Differential Revision: https://reviews.llvm.org/D56548

llvm-svn: 352076
2019-01-24 18:34:00 +00:00
Sanjay Patel e5a0bcf7b8 [x86] add low/high undef half shuffle mask helpers; NFC
This is the most common usage for isUndefInRange, 
so make the code slightly less duplicated and more 
readable.

llvm-svn: 352063
2019-01-24 17:05:02 +00:00
Nirav Dave c5cb2bed58 [X86] Add missing isReg() guards in FixupSetCCs pass.
llvm-svn: 352051
2019-01-24 15:04:17 +00:00
Simon Pilgrim 47ca8606ba [TTI] Add generic SADDO/SSUBO costs
Added x86 scalar sadd_with_overflow/ssub_with_overflow costs.

llvm-svn: 352045
2019-01-24 13:36:45 +00:00
Simon Pilgrim 2d1964b90f [TTI] Add generic UADDO/USUBO costs
Added x86 scalar uadd_with_overflow/usub_with_overflow costs.

Differential Revision: https://reviews.llvm.org/D56907

llvm-svn: 352043
2019-01-24 12:10:20 +00:00
Craig Topper e79b779fbb [X86] Add test cases for opportunities to fold a truncate and a masked store into a truncating masked store.
llvm-svn: 352027
2019-01-24 06:15:03 +00:00
Andrea Di Biagio d768d35515 [MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.

Differential Revision: https://reviews.llvm.org/D57056

llvm-svn: 351965
2019-01-23 16:35:07 +00:00
Matt Arsenault 30989e492b GlobalISel: Allow shift amount to be a different type
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.

X86 uses i8, but seemed to be hacking around this before.

llvm-svn: 351882
2019-01-22 21:42:11 +00:00
Matt Arsenault a5840c3c39 Codegen support for atomicrmw fadd/fsub
llvm-svn: 351851
2019-01-22 18:36:06 +00:00
Simon Pilgrim 933673d878 [X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.

This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.

Differential Revision: https://reviews.llvm.org/D55935

llvm-svn: 351819
2019-01-22 13:44:49 +00:00
Simon Pilgrim aa6a4339ac [X86][BtVer2] SSE2 vector shifts has local forwarding disabled
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57026

llvm-svn: 351817
2019-01-22 13:27:18 +00:00
Simon Pilgrim 9e2c2cfcd9 Fix "comparison of unsigned expression >= 0 is always true" warning. NFCI.
llvm-svn: 351816
2019-01-22 13:18:26 +00:00
Simon Pilgrim 2c69f90171 [X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57022

llvm-svn: 351815
2019-01-22 13:13:57 +00:00
Simon Pilgrim ee900efb30 [CostModel][X86] Add ICMP Predicate specific costs
First step towards PR40376, this patch adds support for getCmpSelInstrCost to use the (optional) Instruction CmpInst predicate to indicate the type of integer comparison we're performing and alter the costs accordingly.

Differential Revision: https://reviews.llvm.org/D57013

llvm-svn: 351810
2019-01-22 12:29:38 +00:00
Simon Pilgrim 180fcff5a7 [X86][SSE] Add selective commutation support for insertps (PR40340)
When we are inserting 1 "inline" element, and zeroing 2 of the other elements then we can safely commute the insertps source inputs to improve memory folding.

Differential Revision: https://reviews.llvm.org/D56843

llvm-svn: 351807
2019-01-22 12:17:48 +00:00
Simon Pilgrim 3adf50b2c0 [X86] HADDPS/HADDPD scalar lowering was added at rL350421
llvm-svn: 351797
2019-01-22 10:49:41 +00:00
Craig Topper bcbdf61078 [X86] Use X86ISD::VFPROUND instead of ISD::FP_ROUND for 256 and 512 bit cvtpd2ps intrinsics.
Summary:
Use X86ISD::VFPROUND in the instruction isel patterns. Add new patterns for ISD::FP_ROUND to maintain support for fptrunc in IR.

In the process I found a couple duplicate isel patterns which I also deleted in this patch.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56991

llvm-svn: 351762
2019-01-21 20:14:09 +00:00
Craig Topper c2087d8f3f [X86] Change avx512 COMPRESS and EXPAND lowering to use a single masked node instead of expand/compress+select.
Summary:
For compress, a select node doesn't semantically reflect the behavior of the instruction. The mask would have holes in it, but the resulting write is to contiguous elements at the bottom of the vector.

Furthermore, as far as the compressing and expanding is concerned the behavior is depended on the mask. You can't just have an expand/compress node that only reads the input vector. That node would have no meaning by itself.

This all only works because we pattern match the compress/expand+select back to the instruction. But conceivably an optimization of the select could break the pattern and leave something meaningless.

This patch modifies the expand and compress node to take the mask and passthru as additional inputs and gets rid of the select all together.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D57002

llvm-svn: 351761
2019-01-21 20:02:28 +00:00
Simon Pilgrim 9b73ae96c5 [X86][BtVer2] Update latency of mmx horizontal operations
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.

Confirmed with @andreadb

llvm-svn: 351755
2019-01-21 18:04:25 +00:00
Andrea Di Biagio b68dd05c14 [X86][BtVer2] Update the WriteLoad latency.
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.

Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.

Differential Revision: https://reviews.llvm.org/D56922

llvm-svn: 351742
2019-01-21 12:04:10 +00:00
Craig Topper f608dc1f57 [X86] Remove and autoupgrade vpmovqd/vpmovwb intrinsics using trunc+select.
llvm-svn: 351729
2019-01-21 08:16:59 +00:00
Simon Pilgrim e1143c1322 [X86] Auto upgrade VPCOM/VPCOMU intrinsics to generic integer comparisons
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.

Noticed while cleaning up vector integer comparison costs for PR40376.

llvm-svn: 351697
2019-01-20 19:27:40 +00:00
Simon Pilgrim c934d3a01b [CostModel][X86] Add explicit vector select costs
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.

Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).

The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.

llvm-svn: 351685
2019-01-20 13:55:01 +00:00