Commit Graph

1018 Commits

Author SHA1 Message Date
Craig Topper f7ba8b808a [X86] Convert f32/f64 FANDN/FAND/FOR/FXOR to vector logic ops and scalar_to_vector/extract_vector_elts to reduce isel patterns.
Previously we did the equivalent operation in isel patterns with
COPY_TO_REGCLASS operations to transition. By inserting
scalar_to_vetors and extract_vector_elts before isel we can
allow each piece to be selected individually and accomplish the
same final result.

I ideally we'd use vector operations earlier in lowering/combine,
but that looks to be more difficult.

The scalar-fp-to-i64.ll changes are because we have a pattern for
using movlpd for store+extract_vector_elt. While an f64 store
uses movsd. The encoding sizes are the same.

llvm-svn: 362914
2019-06-10 00:41:07 +00:00
Craig Topper 7d8494c41c [X86] Mutate scalar fceil/ffloor/ftrunc/fnearbyint/frint into X86ISD::RNDSCALE during PreProcessIselDAG to cut down on number of isel patterns.
Similar was done for vectors in r362535. Removes about 1200 bytes from the isel table.

llvm-svn: 362894
2019-06-08 23:53:31 +00:00
Craig Topper 137de38009 [X86] Mutate fceil/ffloor/ftrunc/fnearbyint/frint into X86ISD::RNDSCALE during PreProcessIselDAG to cut down on pattern permutations
We already need to have patterns for X86ISD::RNDSCALE to support software intrinsics. But we currently have 5 sets of patterns for the 5 rounding operations. For of these 6 patterns we have to support 3 vectors widths, 2 element sizes, sse/vex/evex encodings, load folding, and broadcast load folding. This results in a fair amount of bytes in the isel table.

This patch adds code to PreProcessIselDAG to morph the fceil/ffloor/ftrunc/fnearbyint/frint to X86ISD::RNDSCALE. This way we can remove everything, but the intrinsic pattern while still allowing the operations to be considered Legal for DAGCombine and Legalization. This shrinks the DAGISel by somewhere between 9K and 10K.

There is one complication to this, the STRICT versions of these nodes are currently mutated to their none strict equivalents at isel time when the node is visited. This won't be true in the future since that loses the chain ordering information. For now I've also added support for the non-STRICT nodes to Select so we can change the STRICT versions there after they've been mutated to their non-STRICT versions. We'll probably need a STRICT version of RNDSCALE or something to handle this in the future. Which will take us back to needing 2 sets of patterns for strict and non-strict, but that's still better than the 11 or 12 sets of patterns we'd need.

We can probably do something similar for scalar, but I haven't looked at it yet.

Differential Revision: https://reviews.llvm.org/D62757

llvm-svn: 362535
2019-06-04 18:03:07 +00:00
Pengfei Wang 1f67d94279 [X86] Add ENQCMD instructions
For more details about these instructions, please refer to the latest
ISE document:
https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference.

Patch by Tianqing Wang (tianqing)

Differential Revision: https://reviews.llvm.org/D62281

llvm-svn: 362053
2019-05-30 03:59:16 +00:00
Craig Topper e43bdf144c [X86] Delay creating index register negations during address matching until after we know for sure the match will succeed
If we're trying to match an LEA, its possible the LEA match will be deemed unprofitable. In which case the negation we created in matchAddress would be left dangling in the SelectionDAG. This could artificially increase use counts for other nodes in the DAG. Though I don't have an example of that. But it just seems like bad form to have dangling nodes in isel.

Differential Revision: https://reviews.llvm.org/D61047

llvm-svn: 360823
2019-05-15 21:59:53 +00:00
Simon Pilgrim 2a0ef0530b [X86] Fix uninitialized members in constructor warnings. NFCI.
Initialize all member variables in X86ATTInstPrinter and X86DAGToDAGISel constructors to fix cppcheck warning.

llvm-svn: 360047
2019-05-06 14:48:02 +00:00
Simon Pilgrim d672d0e246 X86DAGToDAGISel::tryVPTESTM - fix uninitialized variable warning. NFCI.
findBroadcastedOp should always initialize the value if it returns true but static-analyzer isn't great at recognising this.

llvm-svn: 360037
2019-05-06 11:52:16 +00:00
Craig Topper 3958719dda [X86] If PreprocessISelDAG reorders a load before a call, make sure we remove dead nodes from the graph
The reordering can leave at least a dead TokenFactor in the graph. This cause the linearize scheduler to fail with something like the assert seen in PR22614. This is only one of many ways we can break the linearize scheduler today so I can't say for sure that any of the other failures in that bug were caused by this issue.

This takes the heavy hammer approach of just running RemoveDeadNodes unconditionally at the end of the PreprocessISelDAG. If this turns out to be a compile time hit, we can try to refine it.

Differential Revision: https://reviews.llvm.org/D61164

llvm-svn: 359582
2019-04-30 17:56:47 +00:00
Craig Topper 354247c08d [X86] Sink NoRegister creation for unused Base/Index registers into getAddressOperands. NFCI
llvm-svn: 359318
2019-04-26 16:39:38 +00:00
Craig Topper ad662cf4c1 [X86] Segment registers should have i16 type not i32.
Probably doesn't really matter, but was inconsistent with the rest of the code.

llvm-svn: 359317
2019-04-26 16:39:35 +00:00
Craig Topper 013503c78d [X86] Remove part of an if condition that should always be true.
The IndexReg will always be non-null at this point. Earlier in the function, if
IndexReg was null we set it to CurDAG->getRegister(0, VT) which made it
non-null.

llvm-svn: 359170
2019-04-25 06:08:02 +00:00
Craig Topper 6932abee2c [X86] Attempt to fix use-after-poison from r359121.
llvm-svn: 359143
2019-04-24 21:48:24 +00:00
Craig Topper af194e9380 [X86] Prevent folding a load into an AND if that AND is really a ZEXT_INREG that should use movzx.
This can save a 32-bit immediate move.

We would shrink the load and fold it if it was non-volatile, but that's trickier to check for.

llvm-svn: 359129
2019-04-24 19:28:38 +00:00
Craig Topper 882ca6d484 [X86] Remove dead nodes left after ReplaceAllUsesWith calls during address matching
ReplaceAllUsesWith doesn't remove the node that was replaced. So its left around in the graph messing up use counts on other nodes.

One thing to note, is that this isn't valid if the node being deleted is the root node of an LEA match that gets rejected. In that case the node needs to stay alive because the isel table walking code would still have a reference to it that its going to try to match next. I don't think that's the case here though because the nodes being deleted here should be "and", "srl", and "zero_extend" none of which can be the root node of an LEA match.

Differential Revision: https://reviews.llvm.org/D61048

llvm-svn: 359121
2019-04-24 18:02:07 +00:00
Craig Topper 4d4b5d952e [X86] Don't turn (and (shl X, C1), C2) into (shl (and X, (C1 >> C2), C2) if the original AND can represented by MOVZX.
The MOVZX doesn't require an immediate to be encoded at all. Though it does use
a 2 byte opcode so its the same size as a 1 byte immediate. But it has a
separate source and dest register so can help avoid copies.

llvm-svn: 358805
2019-04-20 04:38:53 +00:00
Craig Topper 8b8264828c [X86] Turn (and (anyextend (shl X, C1), C2)) into (shl (and (anyextend X), (C1 >> C2), C2) if the AND could match a movzx.
There's one slight regression in here because we don't check that the immediate
already allowed movzx before the shift. I'll fix that next.

llvm-svn: 358804
2019-04-20 04:38:49 +00:00
Craig Topper bb769a2946 [X86] Turn (and (shl X, C1), C2) into (shl (and X, (C1 >> C2), C2) if the AND could match a movzx.
Could get further improvements by recognizing (i64 and (anyext (i32 shl))).

llvm-svn: 358737
2019-04-19 05:48:13 +00:00
Craig Topper f73caae956 [X86] Make sure we copy the HandleSDNode back to N before executing the default code after the switch in matchAddressRecursively
Summary:
There are two places where we create a HandleSDNode in address matching in order to handle the case where N is changed by CSE. But if we end up not matching, we fall back to code at the bottom of the switch that really would like N to point to something that wasn't CSEd away. So we should make sure we copy the handle back to N on any paths that can reach that code.

This appears to be the true reason we needed to check DELETED_NODE in the negation matching. In pr32329.ll we had two subtracts back to back. We recursed through the first subtract, and onto the second subtract. The second subtract called matchAddressRecursively on its LHS which caused that subtract to CSE. We ultimately failed the match and ended up in the default code. But N was pointing at the old node that had been deleted, but the default code didn't know that and took it as the base register. Then we unwound back to the first subtract and tried to access this bogus base reg requiring the check for deleted node. With this patch we now use the CSE result as the base reg instead.

matchAdd has been broken since sometime in 2015 when it was pulled out of the switch into a helper function. The assignment to N at the end was still there, but N was passed by value and not by reference so the update didn't go anywhere.

Reviewers: niravd, spatel, RKSimon, bkramer

Reviewed By: niravd

Subscribers: llvm-commits, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60843

llvm-svn: 358735
2019-04-19 04:52:21 +00:00
Sanjay Patel fb363a778f [x86] try to widen 'shl' as part of LEA formation
The test file has pairs of tests that are logically equivalent:
https://rise4fun.com/Alive/2zQ

%t4 = and i8 %t1, 8
%t5 = zext i8 %t4 to i16
%sh = shl i16 %t5, 2
%t6 = add i16 %sh, %t0
=>
%t4 = and i8 %t1, 8
%sh2 = shl i8 %t4, 2
%z5 = zext i8 %sh2 to i16
%t6 = add i16 %z5, %t0

...so if we can fold the shift op into LEA in the 1st pattern, then we
should be able to do the same in the 2nd pattern (unnecessary 'movzbl'
is a separate bug I think).

We don't want to do this any sooner though because that would conflict
with generic transforms that try to narrow the width of the shift.

Differential Revision: https://reviews.llvm.org/D60789

llvm-svn: 358622
2019-04-17 22:38:51 +00:00
Craig Topper 3c57976447 [X86] Move VPTESTM matching from the isel table to custom code in X86ISelDAGToDAG.
We had many tablegen patterns for these instructions. And due to the
commutability of the patterns, tablegen expands them to even more patterns. All
together VPTESTMD patterns accounted for more the 50K of the 610K isel table.
This had gotten bad when we stopped canonicalizing AND to vXi64. This required
a pattern for every combination of bitcast input type.

This change moves the matching to custom code where it is easier to look through
the bitcasts without being concerned with the specific types.

The test changes are because we are now stricter with one use checks as its
required to make load folding legal. We now require the AND and any BITCAST to
only have a single use. This prevents forming VPTESTM and a VPAND with the same
inputs.

We now support broadcast loads for 128/256 patterns without VLX. We'll widen to
512-bit like and still fold the broadcast since the amount of memory read
doesn't change.

There are a few tests that got slightly longer because are now prefering
load + VPTESTM over XOR+VPCMPEQ for (seteq (load), allzeros). Previously we were
able to share the XOR with multiple VPTESTM instructions.

llvm-svn: 358359
2019-04-14 18:26:11 +00:00
Bill Wendling 191f1487b6 [X86] Use PC-relative mode for the kernel code model
Summary:
The Linux kernel uses PC-relative mode, so allow that when the code model is
"kernel".

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits, kees, nickdesaulniers

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60643

llvm-svn: 358343
2019-04-13 21:39:28 +00:00
Craig Topper 55b0d987fd [X86] Use int64_t and isInt<N> instead of APInt operations in foldLoadStoreIntoMemOperand. NFC
We know all our values are limited to 64 bits here so we don't need an APInt.

This should save some generated code checking between large and small size.

llvm-svn: 358338
2019-04-13 18:57:41 +00:00
Craig Topper 61f31cbcb2 [X86] Teach foldMaskedShiftToScaledMask to look through an any_extend from i32 to i64 between the and & shl
foldMaskedShiftToScaledMask tries to reorder and & shl to enable the shl to fold into an LEA. But if there is an any_extend between them it doesn't work.

This patch modifies the code to look through any_extend from i32 to i64 when the and mask only uses bits that weren't from the extended part.

This will prevent a regression from D60358 caused by 64-bit SHL being narrowed to 32-bits when their upper bits aren't demanded.

Differential Revision: https://reviews.llvm.org/D60532

llvm-svn: 358139
2019-04-10 21:42:08 +00:00
Craig Topper 87a8f9761e [X86] Replace some if statements in isel address matching that should never be true with asserts. And move them earlier before we looked through operands that don't change size. NFC
These ifs were ensuring we don't have to handle types larger than 64 bits probably because we use getZExtValue in several places below them.

None of the callers of this code pass types larger than 64-bits so we can just assert instead of branching in release code.

I've also moved them earlier since we're just looking through operations that don't effect bit width.

This is prep work for some refactoring I plan to do to the (and (shl)) handling code.

llvm-svn: 358123
2019-04-10 19:08:59 +00:00
Craig Topper 399102b464 [X86] When converting (x << C1) AND C2 to (x AND (C2>>C1)) << C1 during isel, try using andl over andq by favoring 32-bit unsigned immediates.
llvm-svn: 357848
2019-04-06 19:00:11 +00:00
Craig Topper f9b9f8d2e4 [X86] Use a signed mask in foldMaskedShiftToScaledMask to enable a shorter immediate encoding.
This function reorders AND and SHL to enable the SHL to fold into an LEA. The
upper bits of the AND will be shifted out by the SHL so it doesn't matter what
mask value we use for these bits. By using sign bits from the original mask in
these upper bits we might enable a shorter immediate encoding to be used.

llvm-svn: 357846
2019-04-06 18:00:50 +00:00
Craig Topper 80aa2290fb [X86] Merge the different Jcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon

Reviewed By: RKSimon

Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60228

llvm-svn: 357802
2019-04-05 19:28:09 +00:00
Craig Topper 7323c2bf85 [X86] Merge the different SETcc instructions for each condition code into single instructions that store the condition code as an operand.
Summary:
This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between SETcc instructions and condition codes.

Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser.

Reviewers: andreadb, courbet, RKSimon, spatel, lebedev.ri

Reviewed By: andreadb

Subscribers: hiraditya, lebedev.ri, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60138

llvm-svn: 357801
2019-04-05 19:27:49 +00:00
Craig Topper e0bfeb5f24 [X86] Merge the different CMOV instructions for each condition code into single instructions that store the condition code as an immediate.
Summary:
Reorder the condition code enum to match their encodings. Move it to MC layer so it can be used by the scheduler models.

This avoids needing an isel pattern for each condition code. And it removes
translation switches for converting between CMOV instructions and condition
codes.

Now the printer, encoder and disassembler take care of converting the immediate.
We use InstAliases to handle the assembly matching. But we print using the
asm string in the instruction definition. The instruction itself is marked
IsCodeGenOnly=1 to hide it from the assembly parser.

This does complicate the scheduler models a little since we can't assign the
A and BE instructions to a separate class now.

I plan to make similar changes for SETcc and Jcc.

Reviewers: RKSimon, spatel, lebedev.ri, andreadb, courbet

Reviewed By: RKSimon

Subscribers: gchatelet, hiraditya, kristina, lebedev.ri, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60041

llvm-svn: 357800
2019-04-05 19:27:41 +00:00
Fangrui Song 2c5c12c041 Change some dyn_cast to more apropriate isa. NFC
llvm-svn: 357773
2019-04-05 16:16:23 +00:00
Evandro Menezes 85bd3978ae [IR] Refactor attribute methods in Function class (NFC)
Rename the functions that query the optimization kind attributes.

Differential revision: https://reviews.llvm.org/D60287

llvm-svn: 357731
2019-04-04 22:40:06 +00:00
Craig Topper 3649c20884 [X86] Use INSERT_SUBREG rather than SUBREG_TO_REG when creating LEA64_32 during isel.
SUBREG_TO_REG is supposed to be used to assert that we know the upper bits are
zero. But that isn't the case here. We've done no analysis of the inputs.

llvm-svn: 357673
2019-04-04 05:00:18 +00:00
Craig Topper 52cac4b79f [X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom instruction selection instead.
This custom inserter existed so we could do a weird thing where we pretended that the instructions support
a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer
size agnostic in the isel pattern.

To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after
isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV
instruction.

With this change we now just do custom selection during isel instead and just assign the incoming address
of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take
care of selecting an LEA or other operation to do any address computation needed in this basic block.

I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit
use is properly sized based on the pointer. Without this we get comments in the assembly output about
killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX.

llvm-svn: 357652
2019-04-03 23:28:30 +00:00
Craig Topper e4a0fc7d75 [X86] Teach isel for RMW binops to handle negate
Negate updates flags like a subtract. We should be able to use the flags from the RMW form of negate when we have (store (X86ISD::SUB 0, load A), A)

Differential Revision: https://reviews.llvm.org/D60007

llvm-svn: 357353
2019-03-30 18:59:17 +00:00
Craig Topper 4ccb3b96b6 [X86] Use cached OptForSize in X86ISelDAGToDAG.cpp instead of pulling it from the function attribute. NFCI
llvm-svn: 357297
2019-03-29 18:36:40 +00:00
Craig Topper c25c9b4d16 [X86] Teach the isel optimization for (x << C1) op C2 to (x op (C2>>C1)) << C1 to consider cases where C2>>C1 can fit an unsigned 32-bit immediate
For 64-bit operations we should consider if the immediate can be made to fit
in an unsigned 32-bits immedate. For OR/XOR this allows us to load the immediate
with MOV32ri instead of movabsq. For AND this allows us to fold the immediate.

Differential Revision: https://reviews.llvm.org/D59867

llvm-svn: 357196
2019-03-28 18:05:37 +00:00
Craig Topper 4bc38cfe29 [X86ISelDAGToDAG] Move initialization of OptForSize and OptForMinSize from PreprocessISelDAG to runOnMachineFunction. NFCI
This makes more sense as a place to initialize these. I don't think runOnMachineFunction was overriden when these cached values were originally created.

llvm-svn: 357123
2019-03-27 21:05:07 +00:00
Craig Topper 7da7b97487 [X86] When iselling (x << C1) and/or/xor C2 as (x and/or/xor (C2>>C1)) << C1, go through the isel table instead of manually selecting.
Previously we manually selected the AND/OR/XOR with immediate and the SHL(or ADD if the shift is 1). But this was missing out on the opportunity to use a 64 bit AND with a 32-bit immediate and possibly other isel tricks we have built into the tables.

Instead, insert the new nodes into the DAG using insertDAGNode and allow them each to be selected through the normal table.

llvm-svn: 357049
2019-03-27 04:45:58 +00:00
Craig Topper 22387a56fe [X86] Simplify some code in matchBitExtract by using ANY_EXTEND.
We were manually outputting the code we would get from selecting ANY_EXTEND. We
can save some code by just letting an ANY_EXTEND go through isel on its own.

llvm-svn: 357045
2019-03-27 02:08:03 +00:00
Craig Topper 4dcabf8ddf [X86] In matchBitExtract, place all of the new nodes before Node's position in the DAG for the topological sort.
We were using OrigNBits, but that put all the nodes before the node we used to start the control computation. This caused some node earlier than the sequence we inserted to be selected before the sequence we created. We want our new sequence to be selected first since it depends on OrigNBits.

I don't have a test case. Found by reviewing the code.

llvm-svn: 356979
2019-03-26 05:31:32 +00:00
Craig Topper 10576fea82 [X86] In matchBitExtract, if we need to truncate the BEXTR make sure we put the BEXTR at Node's position in the DAG for the topological sort.
We were using OrigNBits, but that doesn't guarantee that it will be selected before the nodes that make up X.

llvm-svn: 356978
2019-03-26 05:12:23 +00:00
Craig Topper 795ebe3bff [X86] Remove unneeded FIXME. NFC
We do fold loads right below this.

llvm-svn: 356977
2019-03-26 05:12:21 +00:00
Craig Topper a17287f084 [X86] Update some of the getMachineNode calls from X86ISelDAGToDAG to also include a VT for a EFLAGS result.
This makes the nodes consistent with how they would be emitted from the isel
table.

llvm-svn: 356870
2019-03-25 07:22:18 +00:00
Craig Topper 1cc01c3228 [X86] When selecting (x << C1) op C2 as (x op (C2>>C1)) << C1, use the operation VT for the target constant.
Normally when the nodes we use here(AND32ri8 for example) are selected their
immediates are just converted from ConstantSDNode to TargetConstantSDNode
without changing VT from the original operation VT. So we should still be
emitting them with the operation VT.

Theoretically this could expose more accurate opportunities for CSE.

llvm-svn: 356869
2019-03-25 06:53:45 +00:00
Craig Topper 4c544ca993 [X86] Rename X86ISD::CMPM_RND and X86ISD::FSETCCM_RND to _SAE instead of _RND. Remove rounding operand.
The operand could only be the SAE encoding so no need to include it.

llvm-svn: 355801
2019-03-11 04:36:49 +00:00
Craig Topper 97a1c4c340 [X86] Suppress load folding for add/sub with 128 immediate.
128 won't fit in a sign extended 8-bit immediate, but we can negate it to -128 and use the other operation. This results in a shorter encoding since the move would have used 16 or 32 bits for the immediate.

llvm-svn: 355484
2019-03-06 07:36:36 +00:00
Craig Topper 6ca7398a1e [X86] Use PreprocessISelDAG to convert vector sra/srl/shl to the X86 specific variable shift ISD opcodes.
These allows use to use the same set of isel patterns for sra/srl/shl which are undefined for out of range shifts and intrinsic shifts which aren't undefined.

Doing this late allows DAG combine to have every opportunity to optimize the sra/srl/shl nodes.

This removes about 7000 bytes from the isel table and simplies the td files.

llvm-svn: 355071
2019-02-28 07:21:26 +00:00
Craig Topper 316c58e8f1 [X86] Improve detection of unneeded shift amount masking to also handle the case that the LHS has known zeroes in it
If the LHS has known zeros, the RHS immediate will have had bits removed. So call computeKnownBits to get the known zeroes so we can handle this case.

Differential Revision: https://reviews.llvm.org/D58475

llvm-svn: 354811
2019-02-25 19:42:47 +00:00
Craig Topper 3fe4bd464c [X86] Fix tls variable lowering issue with large code model
Summary:
The problem here is the lowering for tls variable. Below is the DAG for the code.
SelectionDAG has 11 nodes:

t0: ch = EntryToken
      t8: i64,ch = load<(load 8 from `i8 addrspace(257)* null`, addrspace 257)> t0, Constant:i64<0>, undef:i64
        t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]
      t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64
    t12: i64 = add t8, t11
  t4: i32,ch = load<(dereferenceable load 4 from @x)> t0, t12, undef:i64
t6: ch = CopyToReg t0, Register:i32 %0, t4
And when mcmodel is large, below instruction can NOT be folded.

  t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]
t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64
So "t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64" is lowered to " Morphed node: t11: i64,ch = MOV64rm<Mem:(load 8 from got)> t10, TargetConstant:i8<1>, Register:i64 $noreg, TargetConstant:i32<0>, Register:i32 $noreg, t0"

When llvm start to lower "t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]", it fails.

The patch is to fold the load and X86ISD::WrapperRIP.

Fixes PR26906

Patch by LuoYuanke

Reviewers: craig.topper, rnk, annita.zhang, wxiao3

Reviewed By: rnk

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58336

llvm-svn: 354756
2019-02-24 19:33:37 +00:00
Roman Lebedev b7ecc9b624 [X86] X86DAGToDAGISel::matchBitExtract(): prepare 'control' in 32 bits
Summary:
Noticed while looking at D56052.
```
  // The 'control' of BEXTR has the pattern of:
  // [15...8 bit][ 7...0 bit] location
  // [ bit count][     shift] name
  // I.e. 0b000000011'00000001 means  (x >> 0b1) & 0b11
```
I.e. we do not care about any of the bits aside from the low 16 bits.
So there is no point in doing the `slh`,`or` in 64 bits,
let's just do everything in 32 bits, and anyext if needed.

We could do that in 16 even, but we intentionally don't
zext to i16 (longer encoding IIRC),
so i'm guessing the same applies here.

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D56715

llvm-svn: 353073
2019-02-04 19:04:26 +00:00
Chandler Carruth 2946cd7010 Update the file headers across all of the LLVM projects in the monorepo
to reflect the new license.

We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.

Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.

llvm-svn: 351636
2019-01-19 08:50:56 +00:00
Craig Topper 5ea3120718 [X86] Use X86ISD::BLENDV for blendv intrinsics. Replace vselect with blendv just before isel table lookup. Remove vselect isel patterns.
This cleans up the duplication we have with both intrinsic isel patterns and vselect isel patterns. This should also allow the intrinsics to get SimplifyDemandedBits support for the condition.

I've switched the canonical pattern in isel to use the X86ISD::BLENDV node instead of VSELECT. Since it always seemed weird to move from BLENDV with its relaxed rules on condition bits to VSELECT which has strict rules about all bits of the condition element being the same. Its more correct to go from VSELECT to BLENDV.

Differential Revision: https://reviews.llvm.org/D56771

llvm-svn: 351380
2019-01-16 21:46:28 +00:00
Craig Topper 0e420e6a62 [X86] Rename SHRUNKBLEND ISD node to BLENDV.
That's really what it is. If we didn't use intrinsics for BLENDVPS/BLENDVPD/PBLENDVB all the way to isel, this is the node we would use.

llvm-svn: 351278
2019-01-16 00:20:30 +00:00
Roman Lebedev fb4eed381d X86DAGToDAGISel::matchBitExtract() with truncation (PR36419)
Summary:
Previously in D54095 i have added support for extraction of `lshr` from `X` if we are to produce `BEXTR`.
That was good, but the fix was partial, there was still [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]].

That pattern can also appear, roughly, when you have a large (64-bit) storage, and the consume bits from it.
It will not be unexpected if you will be doing further computations in 32-bit width.
And then the current code breaks, as the tests show.

The basic idea/pattern here is following:
1. We have `i64` input
2. We perform `i64` right-shift on it.
3. We `trunc`ate that shifted value
4. We do all further work (masking) in `i32`

Since we see `trunc`ation and not `lshr`, we give up, and stop trying to extract that right-shift.
BUT. The mask is `i32`, therefore we can extend both of the operands of the masking (`and`) to `i64`
and truncate the result after masking: https://rise4fun.com/Alive/K4B
```
Name: @bextr64_32_b1 -> @bextr64_32_b0
  %shiftedval = lshr i64 %val, %numskipbits
  %truncshiftedval = trunc i64 %shiftedval to i32
  %widenumlowbits1 = zext i8 %numlowbits to i32
  %notmask1 = shl nsw i32 -1, %widenumlowbits1
  %mask1 = xor i32 %notmask1, -1
  %res = and i32 %truncshiftedval, %mask1
=>
  %shiftedval = lshr i64 %val, %numskipbits
  %widenumlowbits = zext i8 %numlowbits to i64
  %notmask = shl nsw i64 -1, %widenumlowbits
  %mask = xor i64 %notmask, -1
  %wideres = and i64 %shiftedval, %mask
  %res = trunc i64 %wideres to i32
```

Thus, we are again able to extract that `lshr` into `BEXTR`'s control.

Now, the perf (via `llvm-exegesis`) of the snippet suggests that it is not a good idea:
```
$ cat /tmp/old.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RSI
# LLVM-EXEGESIS-LIVEIN EDX
# LLVM-EXEGESIS-LIVEIN RDI
movq %rsi, %rcx
shrq %cl, %rdi
shll $8, %edx
bextrl %edx, %edi, %eax
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-1e0082.o
---
mode:            latency
key:
  instructions:
    - 'MOV64rr RCX RSI'
    - 'SHR64rCL RDI RDI'
    - 'SHL32ri EDX EDX i_0x8'
    - 'BEXTR32rr EAX EDI EDX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.6638, per_snippet_value: 2.6552 }
error:           ''
info:            ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-43e346.o
---
mode:            uops
key:
  instructions:
    - 'MOV64rr RCX RSI'
    - 'SHR64rCL RDI RDI'
    - 'SHL32ri EDX EDX i_0x8'
    - 'BEXTR32rr EAX EDI EDX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 0, per_snippet_value: 0 }
  - { key: PdFPU1, value: 0, per_snippet_value: 0 }
  - { key: PdFPU2, value: 0, per_snippet_value: 0 }
  - { key: PdFPU3, value: 0, per_snippet_value: 0 }
  - { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error:           ''
info:            ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
```
vs
```
$ cat /tmp/new.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RDX
# LLVM-EXEGESIS-LIVEIN SIL
# LLVM-EXEGESIS-LIVEIN RDI
shlq $8, %rdx
movzbl %sil, %eax
orq %rdx, %rax
bextrq %rax, %rdi, %rax
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8944f1.o
---
mode:            latency
key:
  instructions:
    - 'SHL64ri RDX RDX i_0x8'
    - 'MOVZX32rr8 EAX SIL'
    - 'OR64rr RAX RAX RDX'
    - 'BEXTR64rr RAX RDI RAX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.7454, per_snippet_value: 2.9816 }
error:           ''
info:            ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-da403c.o
---
mode:            uops
key:
  instructions:
    - 'SHL64ri RDX RDX i_0x8'
    - 'MOVZX32rr8 EAX SIL'
    - 'OR64rr RAX RAX RDX'
    - 'BEXTR64rr RAX RDI RAX'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 0, per_snippet_value: 0 }
  - { key: PdFPU1, value: 0, per_snippet_value: 0 }
  - { key: PdFPU2, value: 0, per_snippet_value: 0 }
  - { key: PdFPU3, value: 0, per_snippet_value: 0 }
  - { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error:           ''
info:            ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
```
^ latency increased (worse).

Except //maybe// not really.
Like with all synthetic benchmarks, they //may// be misleading.

Let's take a look on some actual real-world hotpath.
In this case it's 'my' [[ https://github.com/darktable-org/rawspeed | RawSpeed ]]'s `BitStream<>::peekBitsNoFill()`, in [[ e3316dc851/src/librawspeed/decompressors/VC5Decompressor.cpp (L814) | GoPro VC5 decompressor ]]:
```
raw.pixls.us-unique/GoPro/HERO6 Black$ /usr/src/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-clangs1-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR
RUNNING: /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplwbKEM
2018-12-22 21:23:03
Running /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench
Run on (8 X 4012.81 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 3.41, 2.41, 2.03
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean           40 ms         40 ms        128   0.322244          7.96974        12M       37.4457M        298.534M      3.12047       24.8778   0.040465
GOPR9172.GPR/threads:8/real_time_median         39 ms         39 ms        128   0.312606          7.99155        12M        38.387M        306.788M      3.19891       25.5656   0.039115
GOPR9172.GPR/threads:8/real_time_stddev          4 ms          3 ms        128  0.0271557         0.130575          0        2.4941M        21.3909M     0.207842       1.78257   3.81081m
RUNNING: /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpWAkan9
2018-12-22 21:23:08
Running /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Run on (8 X 4013.1 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 3.78, 2.50, 2.06
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean           39 ms         39 ms        128   0.311533          7.97323        12M       38.6828M        308.471M      3.22356        25.706  0.0390928
GOPR9172.GPR/threads:8/real_time_median         38 ms         38 ms        128   0.304231          7.99005        12M       39.4437M        315.527M      3.28698        26.294  0.0380316
GOPR9172.GPR/threads:8/real_time_stddev          3 ms          3 ms        128  0.0229149         0.133814          0       2.26225M        19.1421M     0.188521       1.59517   3.13671m
Comparing /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 128 vs 128
GOPR9172.GPR/threads:8/real_time_mean                  -0.0339         -0.0316            40            39            40            39
GOPR9172.GPR/threads:8/real_time_median                -0.0277         -0.0274            39            38            39            38
GOPR9172.GPR/threads:8/real_time_stddev                -0.1769         -0.1267             4             3             3             3
```
I.e. this results in //roughly// -3% improvements in perf.

While this will help [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]], it won't address it fully.

Reviewers: RKSimon, craig.topper, andreadb, spatel

Reviewed By: craig.topper

Subscribers: courbet, llvm-commits

Differential Revision: https://reviews.llvm.org/D56052

llvm-svn: 351253
2019-01-15 21:31:18 +00:00
Craig Topper 90fe6edcba [X86] Remove X86ISD::SELECT as its no longer used by any of our intrinsic lowering.
llvm-svn: 350995
2019-01-12 08:15:54 +00:00
Craig Topper 6265a15f2e [X86] Add post-isel peephole to fold KAND+KORTEST into KTEST if only the zero flag is used.
Doing this late so we will prefer to fold the AND into a masked comparison first. That can be better for the live range of the mask register.

Differential Revision: https://reviews.llvm.org/D56246

llvm-svn: 350374
2019-01-04 00:10:58 +00:00
Craig Topper df5304d8de [X86] Add load folding support to the custom isel we do for X86ISD::UMUL/SMUL.
The peephole pass isn't always able to fold the load because it can't commute the implicit usage of AL/AX/EAX/RAX.

llvm-svn: 350272
2019-01-02 23:24:08 +00:00
Craig Topper 9d4860ec4e [X86] Remove X86ISD::INC/DEC. Just select them from X86ISD::ADD/SUB at isel time
INC/DEC are pretty much the same as ADD/SUB except that they don't update the C flag.

This patch removes the special nodes and just pattern matches from ADD/SUB during isel if the C flag isn't being used.

I had to avoid selecting DEC is the result isn't used. This will become a SUB immediate which will turned into a CMP later by optimizeCompareInstr. This lead to the one test change where we use a CMP instead of a DEC for an overflow intrinsic since we only checked the flag.

This also exposed a hole in our RMW flag matching use of hasNoCarryFlagUses. Our root node for the match is a store and there's no guarantee that all the flag users have been selected yet. So hasNoCarryFlagUses needs to check copyToReg and machine opcodes, but it also needs to check for the pre-match SETCC, SETCC_CARRY, BRCOND, and CMOV opcodes.

Differential Revision: https://reviews.llvm.org/D55975

llvm-svn: 350245
2019-01-02 19:01:05 +00:00
Craig Topper f7cc7e3201 [X86] Remove the separate SMUL8/UMUL8 X86ISD opcodes by merging with SMUL/UMUL. Remove the second result from X86ISD::UMUL.
All of these use custom isel so we can pretty easily detect the differences in the custom code in X86ISelDAGToDAG. The ISD opcodes just need to express the desired semantics not the details of how they would be selected by isel. So unifying them lets us remove the special casing from lowering.

llvm-svn: 350206
2019-01-02 06:40:11 +00:00
Craig Topper d8217b23ff [X86] Move the optimization that turns 'CMP (AND+IMM64), 0' into SRL/SHL+TEST to X86ISelDAGToDAG.
This cleans more code out of EmitTest.

llvm-svn: 350041
2018-12-24 05:27:13 +00:00
Craig Topper e8c50fc6af [X86] Remove the ANDN check from EmitTest.
Remove the TESTmr isel patterns and add another postprocessing combine for TESTrr+ANDrm->TESTmr. We already have a postprocessing combine for TESTrr+ANDrr->TESTrr. With this we can give ANDN a chance to match first. And clean it up during post processing if we ended up with just a regular AND.

This is another step towards my plan to gut EmitTest and do more flag handling during isel matching or by using optimizeCompare.

llvm-svn: 350038
2018-12-24 01:10:13 +00:00
Craig Topper e58cd9cbc6 [X86] Add isel patterns to match BMI/TBMI instructions when lowering has turned the root nodes into one of the flag producing binops.
This fixes the patterns that have or/and as a root. 'and' is handled differently since thy usually have a CMP wrapped around them.

I had to look for uses of the CF flag because all these nodes have non-standard CF flag behavior. A real or/xor would always clear CF. In practice we shouldn't be using the CF flag from these nodes as far as I know.

Differential Revision: https://reviews.llvm.org/D55813

llvm-svn: 349962
2018-12-21 21:42:43 +00:00
Simon Pilgrim 57733507fe [X86] Always use the version of computeKnownBits that returns a value. NFCI.
Continues the work started by @bogner in rL340594 to remove uses of the old KnownBits output paramater version.

llvm-svn: 349902
2018-12-21 14:25:14 +00:00
Craig Topper 54f1a7be13 [X86] Refactor hasNoCarryFlagUses and hasNoSignFlagUses in X86ISelDAGToDAG.cpp to tranlate opcode to condition code using the helpers in X86InstrInfo.cpp.
This shortens the switches in X86ISelDAGToDAG.cpp to only need to check condition code instead of a list of opcodes.

This also fixes a bug where the memory forms of SETcc were missing from hasNoCarryFlagUses.

llvm-svn: 349868
2018-12-21 01:14:25 +00:00
Craig Topper e0cff10289 [X86] Add memory forms of some SETCC instructions to hasNoCarryFlagUses.
Found while working on another patch

llvm-svn: 349867
2018-12-21 01:14:23 +00:00
Craig Topper 84a00bd98a [X86] Don't match TESTrr from (cmp (and X, Y), 0) during isel. Defer to post processing
The (cmp (and X, Y) 0) pattern is greedy and ends up forming a TESTrr and consuming the and when it might be better to use one of the BMI/TBM like BLSR or BLSI.

This patch moves removes the pattern from isel and adds a post processing check to combine TESTrr+ANDrr into just a TESTrr. With this patch we are able to select the BMI/TBM instructions, but we'll also emit a TESTrr when the result is compared to 0. In many cases the peephole pass will be able to use optimizeCompareInstr to remove the TEST, but its probably not perfect.

Differential Revision: https://reviews.llvm.org/D55870

llvm-svn: 349661
2018-12-19 18:49:13 +00:00
Craig Topper 1fc257d97f [X86] Rename hasNoSignedComparisonUses to hasNoSignFlagUses. Add the instruction that only modify the O flag to the waiver list.
The only caller of this turns CMP with 0 into TEST. CMP with 0 and TEST both set OF to 0 so we should have no issues with instructions that only use OF.

Though I don't think there's any reason we would read just OF after a compare with 0 anyway. So this probably isn't an observable change.

llvm-svn: 349223
2018-12-15 01:07:19 +00:00
Craig Topper 5c304eac41 [X86] Make hasNoCarryFlagUses/hasNoSignedComparisonUses take an SDValue that indicates which result is the flag result. NFCI
hasNoCarryFlagUses hardcoded that the flag result is 1 and used that to filter which uses were of interest. hasNoSignedComparisonUses just assumes the only result is flags and checks whether any user of the node is a CopyToReg instruction.

After this patch we now do a result number check in both and rely on the caller to provide the result number.

This shouldn't change behavior it was just an odd difference between the two functions that I noticed.

llvm-svn: 349222
2018-12-15 01:07:16 +00:00
Craig Topper d1c61861dd [X86] Don't emit MULX by default with BMI2
MULX has somewhat improved register allocation constraints compared to the legacy MUL instruction. Both output registers are encoded instead of fixed to EAX/EDX, but EDX is used as input. It also doesn't touch flags. Unfortunately, the encoding is longer.

Prefering it whenever BMI2 is enabled is probably not optimal. Choosing it should somehow be a function of register allocation constraints like converting adds to three address. gcc and icc definitely don't pick MULX by default. Not sure what if any rules they have for using it.

Differential Revision: https://reviews.llvm.org/D55565

llvm-svn: 348975
2018-12-12 21:21:31 +00:00
Roman Lebedev 90c5b3f78e [X86] X86DAGToDAGISel::matchBitExtract(): extract 'lshr' from `X`
Summary:
As discussed in previous review, and noted in the FIXME, if `X` is actually an `lshr Y, Z` (logical!),
we can fold the `Z` into 'control`, and let the `BEXTR` do this too.
We could just insert those 8 bits of shift amount into control,
but it is better to instead zero-extend them, and 'or' them in place.

We can only do this for `lshr`, not `ashr`, because we do not know that the mask cover only the bits of `Y`,
and not any of the sign-extended bits.

The obvious question is, is this actually legal to do?
I believe it is. Relevant quotes, from `Intel® 64 and IA-32 Architectures Software Developer’s Manual`, `BEXTR — Bit Field Extract`:
* `Bit 7:0 of the second source operand specifies the starting bit position of bit extraction.`
* `A START value exceeding the operand size will not extract any bits from the second source operand.`
* `Only bit positions up to (OperandSize -1) of the first source operand are extracted.`
* `All higher order bits in the destination operand (starting at bit position LENGTH) are zeroed.`
* `The destination register is cleared if no bits are extracted.`

FIXME: if we can do this, i wonder if we should prefer `BEXTR` over `BZHI` in such cases.

Reviewers: RKSimon, craig.topper, spatel, andreadb

Reviewed By: RKSimon, craig.topper, andreadb

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54095

llvm-svn: 347048
2018-11-16 13:04:54 +00:00
Craig Topper 6c3f1692c8 Revert r345165 "[X86] Bring back the MOV64r0 pseudo instruction"
Google is reporting regressions on some benchmarks.

llvm-svn: 345785
2018-10-31 21:53:24 +00:00
Roman Lebedev b3a14208ac [X86][BMI1] X86DAGToDAGISel: select BEXTR from x & (-1 >> (32 - y)) pattern
Summary:
The final pattern.
There is no test changes:
* We are looking for the pattern with one-use of it's mask,
* If the mask is one-use, D48768 will unfold it into pattern d.
* Thus, the tests have extra-use on the mask.
* Thus, only the BMI2 BZHI can be tested, and it already worked.
* So there is no BMI1 test coverage, we just assume it works since it uses the same codepath.

Reviewers: craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53575

llvm-svn: 345584
2018-10-30 11:12:34 +00:00
Craig Topper 2417273255 [X86] Bring back the MOV64r0 pseudo instruction
This patch brings back the MOV64r0 pseudo instruction for zeroing a 64-bit register. This replaces the SUBREG_TO_REG MOV32r0 sequence we use today. Post register allocation we will rewrite the MOV64r0 to a 32-bit xor with an implicit def of the 64-bit register similar to what we do for the various XMM/YMM/ZMM zeroing pseudos.

My main motivation is to enable the spill optimization in foldMemoryOperandImpl. As we were seeing some code that repeatedly did "xor eax, eax; store eax;" to spill several registers with a new xor for each store. With this optimization enabled we get a store of a 0 immediate instead of an xor. Though I admit the ideal solution would be one xor where there are multiple spills. I don't believe we have a test case that shows this optimization in here. I'll see if I can try to reduce one from the code were looking at.

There's definitely some other machine CSE(and maybe other passes) behavior changes exposed by this patch. So it seems like there might be some other deficiencies in SUBREG_TO_REG handling.

Differential Revision: https://reviews.llvm.org/D52757

llvm-svn: 345165
2018-10-24 17:32:09 +00:00
Roman Lebedev 2fae985793 X86DAGToDAGISel::matchBitExtract(): lambdas can't have default arguments.
As reported by ctopper.
That is a gcc-only warning at the moment.

llvm-svn: 345065
2018-10-23 18:27:10 +00:00
Roman Lebedev 06e4db07af Experimental re-land of [X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern
This initially landed in rL345014, but was reverted in rL345017
due to sanitizer-x86_64-linux-fast buildbot failure in
check-lld (ELF/relocatable-versioned.s) test.

While i'm not yet quite sure what is the problem, one obvious
thing here is that extra truncation roundtrip.
Maybe that's it? If not, will re-revert.

Differential Revision: https://reviews.llvm.org/D53521

llvm-svn: 345027
2018-10-23 13:19:31 +00:00
Roman Lebedev c29dbbdb10 Revert "[X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern"
*Seems* to be breaking sanitizer-x86_64-linux-fast buildbot,
the ELF/relocatable-versioned.s test:

==17758==MemorySanitizer CHECK failed: /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:191 "((kBlockMagic)) == ((((u64*)addr)[0]))" (0x6a6cb03abcebc041, 0x0)
    #0 0x59716b in MsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/msan/msan.cc:393
    #1 0x586635 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79
    #2 0x57d5ff in __sanitizer::InternalFree(void*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:191
    #3 0x7fc21b24193f  (/lib/x86_64-linux-gnu/libc.so.6+0x3593f)
    #4 0x7fc21b241999 in exit (/lib/x86_64-linux-gnu/libc.so.6+0x35999)
    #5 0x7fc21b22c2e7 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e7)
    #6 0x57c039 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_msan/bin/lld+0x57c039)

This reverts commit r345014.

llvm-svn: 345017
2018-10-23 10:34:57 +00:00
Roman Lebedev 1c95b2f779 [X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern
Summary:
Continuation of D52348.

We also get the `c) x &  (-1 >> (32 - y))` pattern here, because of the D48768.
I will add extra-uses into those tests and follow-up with a patch to handle those patterns too.

Reviewers: RKSimon, craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53521

llvm-svn: 345014
2018-10-23 09:08:44 +00:00
Craig Topper c8e183f9ee Recommit r344877 "[X86] Stop promoting integer loads to vXi64"
I've included a fix to DAGCombiner::ForwardStoreValueToDirectLoad that I believe will prevent the previous miscompile.

Original commit message:

Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to rem

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the lo

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344965
2018-10-22 22:14:05 +00:00
Craig Topper 8d8dcfe690 Revert r344877 "[X86] Stop promoting integer loads to vXi64"
Sam McCall reported miscompiles in some tensorflow code. Reverting while I try to figure out.

llvm-svn: 344921
2018-10-22 16:59:24 +00:00
Roman Lebedev 898808504d [X86] X86DAGToDAGISel: handle BZHI selection too, not just BEXTR.
Summary:
As discussed in D52304 / IRC, we now have pattern matching for
'bit extract' in two places - tablegen and `X86DAGToDAGISel`.
There are 4 patterns.
And we will have a problem with `x &  (-1 >> (32 - y))` pattern.
* If the mask is one-use, then it is always unfolded into `x << (32 - y) >> (32 - y)` first.
  Thus, the existing test coverage is already broken.
* If it is not one-use, then it is not unfolded, and is matched as BZHI.
* If it is not one-use, we will not match it as BEXTR. And if it is one-use, it will have been unfolded already.
So we will either not handle that pattern for BEXTR, or not have test coverage for it.
This is bad.

As discussed with @craig.topper, let's unify this matching, and do everything in `X86DAGToDAGISel`.
Then we will not have code duplication, and will have proper test coverage.

This indeed does not affect any tests, and this is great.
It means that for these two patterns, the `X86DAGToDAGISel` is identical to the tablegen version.

Please review carefully, i'm not fully sure about that intrinsic change, and introduction of the new `X86ISD` opcode.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: craig.topper

Subscribers: llvm-commits, craig.topper

Differential Revision: https://reviews.llvm.org/D53164

llvm-svn: 344904
2018-10-22 14:12:44 +00:00
Roman Lebedev 13c5ab2e27 [X86][BMI1]: X86DAGToDAGISel: select BEXTR from x & ((1 << nbits) + (-1)) pattern
Summary:
Trivial continuation of D52304.
While this pattern is not canonical, we do select it in the BZHI case,
so this should not be any different.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52348

llvm-svn: 344902
2018-10-22 13:54:17 +00:00
Craig Topper 321df5b0d4 [X86] Stop promoting integer loads to vXi64
Summary:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to remove the bitcast.

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the load size.

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344877
2018-10-21 21:30:26 +00:00
Craig Topper 5eea94edd4 [X86] Remove SDIVREM8_SEXT_HREG/UDIVREM8_ZEXT_HREG and their associated DAG combine and target bits support. Use a post isel peephole instead.
Summary:
These nodes exist to overcome an isel problem where we can generate a zero extend of an AH register followed by an extract subreg, and another zero extend. The first zero extend exists to avoid a partial register update copying the AH register into the low 8-bits. The second zero extend exists if the user wanted the remainder zero extended.

To make this work we had a DAG combine to morph the DIVREM opcode to a special opcode that included the extend. But then we had to add the new node to computeKnownBits and computeNumSignBits to process the extension portion.

This patch instead removes all of that and adds a late peephole to detect the two extends.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53449

llvm-svn: 344874
2018-10-21 21:07:27 +00:00
Craig Topper 5c81c68385 [X86] In PostprocessISelDAG, start from allnodes_end, not the root.
There is no guarantee the root is at the end if isel created any nodes without morphing them. This includes the nodes created by manual isel from C++ code in X86ISelDAGToDAG.

This is similar to r333415 from PowerPC which is where I originally stole the peephole loop from.

I don't have a test case, but without this a future patch doesn't work which is how I found it.

llvm-svn: 344808
2018-10-19 19:24:42 +00:00
Kristina Brooks 312fcc116b [X86] Support for the mno-tls-direct-seg-refs flag
Allows to disable direct TLS segment access (%fs or %gs). GCC supports
a similar flag, it can be useful in some circumstances, e.g. when a thread
context block needs to be updated directly from user space. More info
and specific use cases: https://bugs.llvm.org/show_bug.cgi?id=16145

There is another revision for clang as well.
Related: D53102

All X86 CodeGen tests appear to pass:
```
[46/47] Running lit suite /SourceCache/llvm-trunk-8.0/test/CodeGen
Testing Time: 23.17s
  Expected Passes    : 3801
  Expected Failures  : 15
  Unsupported Tests  : 8021
```

Reviewed by: Craig Topper.

Patch by nruslan (Ruslan Nikolaev).

Differential Revision: https://reviews.llvm.org/D53103

llvm-svn: 344723
2018-10-18 03:14:37 +00:00
Craig Topper e0a992918b [X86] Match (cmp (and (shr X, C), mask), 0) to BEXTR+TEST.
Without this we match the CMP+AND to a TEST and then match the SHR separately. I'm trusting analyzeCompare to remove the TEST during the peephole pass. Otherwise we need to check the flag users to see if they only use the Z flag.

This recovers a case lost by r344270.

Differential Revision: https://reviews.llvm.org/D53310

llvm-svn: 344649
2018-10-16 22:29:36 +00:00
Simon Pilgrim c844bc84dd [X86] Ignore float/double non-temporal loads (PR39256)
Scalar non-temporal loads were asserting instead of just being ignored.

Reduced from https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10895

llvm-svn: 344331
2018-10-12 10:20:16 +00:00
Craig Topper fb2ac8969e [X86] Restore X86ISelDAGToDAG::matchBEXTRFromAnd. Teach address matching to create a BEXTR pattern from a (shl (and X, mask >> C1) if C1 can be folded into addressing mode.
This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about.

We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR.

I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that.

I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet.

Differential Revision: https://reviews.llvm.org/D53126

llvm-svn: 344270
2018-10-11 18:06:07 +00:00
Roman Lebedev 4225f4adff [X86][BMI1]: X86DAGToDAGISel: select BEXTR from x & ~(-1 << nbits) pattern
Summary:
As discussed in D48491, we can't really do this in the TableGen,
since we need to produce *two* instructions. This only implements
one single pattern. The other 3 patterns will be in follow-ups.

I'm not sure yet if we want to also fuse shift into here
(i.e `(x >> start) & ...`)

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D52304

llvm-svn: 344224
2018-10-11 07:51:13 +00:00
Craig Topper b5421c498d [X86] Prevent non-temporal loads from folding into instructions by blocking them in X86DAGToDAGISel::IsProfitableToFold rather than with a predicate.
Remove tryFoldVecLoad since tryFoldLoad would call IsProfitableToFold and pick up the new check.

This saves about 5K out of ~600K on the generated isel table.

llvm-svn: 344189
2018-10-10 21:48:34 +00:00
Roman Lebedev 33d84c6dac [X86] Move X86DAGToDAGISel::matchBEXTRFromAnd() into X86ISelLowering
Summary:
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 | PR38938 ]],
we fail to emit `BEXTR` if the mask is shifted.
We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`,
and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node,
and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back.
So it would seem X86ISelLowering is the place to handle this.

This patch only moves the matchBEXTRFromAnd()
from X86DAGToDAGISel to X86ISelLowering.
It does not add support for the 'shifted mask' pattern.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52426

llvm-svn: 344179
2018-10-10 20:40:12 +00:00
Craig Topper 42cd8cd862 Recommit r343499 "[X86] Enable load folding in the test shrinking code"
Original message:
This patch adds load folding support to the test shrinking code. This was noticed missing in the review for D52669

llvm-svn: 343540
2018-10-01 21:35:28 +00:00
Craig Topper f06a57fc89 Recommit r343498 "[X86] Improve test instruction shrinking when the sign flag is used and the output of the and is truncated."
This includes a fix to prevent i16 compares with i32/i64 ands from being shrunk if bit 15 of the and is set and the sign bit is used.

Original commit message:
Currently we skip looking through truncates if the sign flag is used. But that's overly restrictive.

It's safe to look through the truncate as long as we ensure one of the 3 things when we shrink. Either the MSB of the mask at the shrunken size isn't set. If the mask bit is set then either the shrunk size needs to be equal to the compare size or the sign

There are still missed opportunities to shrink a load and fold it in here. This will be fixed in a future patch.

llvm-svn: 343539
2018-10-01 21:35:26 +00:00
Craig Topper e072934d28 Revert r343499 and r343498. X86 test improvements
There's a subtle bug in the handling of truncate from i32/i64 to i32 without minsize.

I'll be adding more test cases and trying to find a fix.

llvm-svn: 343516
2018-10-01 18:40:44 +00:00
Craig Topper aa84e1bba2 [X86] Enable load folding in the test shrinking code
This patch adds load folding support to the test shrinking code. This was noticed missing in the review for D52669

Differential Revision: https://reviews.llvm.org/D52699

llvm-svn: 343499
2018-10-01 17:10:50 +00:00
Craig Topper 2b587ad071 [X86] Improve test instruction shrinking when the sign flag is used and the output of the and is truncated
Currently we skip looking through truncates if the sign flag is used. But that's overly restrictive.

It's safe to look through the truncate as long as we ensure one of the 3 things when we shrink. Either the MSB of the mask at the shrunken size isn't set. If the mask bit is set then either the shrunk size needs to be equal to the compare size or the sign flag needs to be unused.

There are still missed opportunities to shrink a load and fold it in here. This will be fixed in a future patch.

Differential Revision: https://reviews.llvm.org/D52669

llvm-svn: 343498
2018-10-01 17:10:45 +00:00
Craig Topper 99ad2a5723 [X86] Copy memrefs when folding a load for division instruction selection.
llvm-svn: 343419
2018-09-30 17:47:18 +00:00
Craig Topper 1709829fed [X86] Disable BMI BEXTR in X86DAGToDAGISel::matchBEXTRFromAnd unless we're on compiling for a CPU with single uop BEXTR
Summary:
This function turns (X >> C1) & C2 into a BMI BEXTR or TBM BEXTRI instruction. For BMI BEXTR we have to materialize an immediate into a register to feed to the BEXTR instruction.

The BMI BEXTR instruction is 2 uops on Intel CPUs. It looks like on SKL its one port 0/6 uop and one port 1/5 uop. Despite what Agner's tables say. I know one of the uops is a regular shift uop so it would have to go through the port 0/6 shifter unit. So that's the same or worse execution wise than the shift+and which is one 0/6 uop and one 0/1/5/6 uop. The move immediate into register is an additional 0/1/5/6 uop.

For now I've limited this transform to AMD CPUs which have a single uop BEXTR. If may also might make sense if we can fold a load or if the and immediate is larger than 32-bits and can't be encoded as a sign extended 32-bit value or if LICM or CSE can hoist the move immediate and share it. But we'd need to look more carefully at that. In the regression I looked at it doesn't look load folding or large immediates were occurring so the regression isn't caused by the loss of those. So we could try to be smarter here if we find a compelling case.

Reviewers: RKSimon, spatel, lebedev.ri, andreadb

Reviewed By: RKSimon

Subscribers: llvm-commits, andreadb, RKSimon

Differential Revision: https://reviews.llvm.org/D52570

llvm-svn: 343399
2018-09-30 03:01:46 +00:00
Craig Topper 313d09af51 [X86] Teach X86DAGToDAGISel::foldLoadStoreIntoMemOperand to handle loads in operand 1 of commutable operations.
Previously we only handled loads in operand 0, but nothing guarantees the load will be operand 0 for commutable operations.

Differential Revision: https://reviews.llvm.org/D51768

llvm-svn: 341675
2018-09-07 16:27:55 +00:00
Craig Topper 13148564d4 [X86] Fix some incorrect comments. NFC
llvm-svn: 341624
2018-09-07 01:29:42 +00:00
Craig Topper 0fd5cdee3a [X86] Add isel patterns for commuting X86adc_flag with a load in the LHS.
The peephole pass likely gets this normally, but we should be doing it during isel.

Ideally we'd just make the X86adc_flag pattern SDNPCommutable, but the tablegen doesn't handle that when one of the operands is a register reference.

llvm-svn: 341596
2018-09-06 22:41:44 +00:00
Sanjay Patel 40aa86751a [x86] add debug option for and-immediate shrinking
The commit that added this functionality:
rL322957

may be causing/exposing a miscompile in PR38648:
https://bugs.llvm.org/show_bug.cgi?id=38648

so allow enabling/disabling to make debugging easier.

llvm-svn: 340540
2018-08-23 15:58:07 +00:00
Chandler Carruth ae0cafece8 [x86/retpoline] Split the LLVM concept of retpolines into separate
subtarget features for indirect calls and indirect branches.

This is in preparation for enabling *only* the call retpolines when
using speculative load hardening.

I've continued to use subtarget features for now as they continue to
seem the best fit given the lack of other retpoline like constructs so
far.

The LLVM side is pretty simple. I'd like to eventually get rid of the
old feature, but not sure what backwards compatibility issues that will
cause.

This does remove the "implies" from requesting an external thunk. This
always seemed somewhat questionable and is now clearly not desirable --
you specify a thunk the same way no matter which set of things are
getting retpolines.

I really want to keep this nicely isolated from end users and just an
LLVM implementation detail, so I've moved the `-mretpoline` flag in
Clang to no longer rely on a specific subtarget feature by that name and
instead to be directly handled. In some ways this is simpler, but in
order to preserve existing behavior I've had to add some fallback code
so that users who relied on merely passing -mretpoline-external-thunk
continue to get the same behavior. We should eventually remove this
I suspect (we have never tested that it works!) but I've not done that
in this patch.

Differential Revision: https://reviews.llvm.org/D51150

llvm-svn: 340515
2018-08-23 06:06:38 +00:00
Craig Topper 538f8ab438 [X86] Replace (32/64 - n) shift amounts with (neg n) since the shift amount is masked in hardware
Inspired by what AArch64 does for shifts, this patch attempts to replace shift amounts with neg if we can.

This is done directly as part of isel so its as late as possible to avoid breaking some BZHI patterns since those patterns need an unmasked (32-n) to be correct.

To avoid manual load folding and custom instruction selection for the negate. I've inserted new nodes in the DAG above the shift node in topological order.

Differential Revision: https://reviews.llvm.org/D48789

llvm-svn: 340441
2018-08-22 19:39:09 +00:00
Chandler Carruth 66654b72c9 [SDAG] Remove the reliance on MI's allocation strategy for
`MachineMemOperand` pointers attached to `MachineSDNodes` and instead
have the `SelectionDAG` fully manage the memory for this array.

Prior to this change, the memory management was deeply confusing here --
The way the MI was built relied on the `SelectionDAG` allocating memory
for these arrays of pointers using the `MachineFunction`'s allocator so
that the raw pointer to the array could be blindly copied into an
eventual `MachineInstr`. This creates a hard coupling between how
`MachineInstr`s allocate their array of `MachineMemOperand` pointers and
how the `MachineSDNode` does.

This change is motivated in large part by a change I am making to how
`MachineFunction` allocates these pointers, but it seems like a layering
improvement as well.

This would run the risk of increasing allocations overall, but I've
implemented an optimization that should avoid that by storing a single
`MachineMemOperand` pointer directly instead of allocating anything.
This is expected to be a net win because the vast majority of uses of
these only need a single pointer.

As a side-effect, this makes the API for updating a `MachineSDNode` and
a `MachineInstr` reasonably different which seems nice to avoid
unexpected coupling of these two layers. We can map between them, but we
shouldn't be *surprised* at where that occurs. =]

Differential Revision: https://reviews.llvm.org/D50680

llvm-svn: 339740
2018-08-14 23:30:32 +00:00
Craig Topper a7a12399a1 [X86] Remove all the vector NOP bitcast patterns. Use a few lines of code in the Select method in X86ISelDAGToDAG.cpp instead.
There are a lot of permutations of types here generating a lot of patterns in the isel table. It's more efficient to just ReplaceUses and RemoveDeadNode from the Select function.

The test changes are because we have a some shuffle patterns that have a bitcast as their root node. But the behavior is identical to another instruction whose pattern doesn't start with a bitcast. So this isn't a functional change.

llvm-svn: 338824
2018-08-03 07:01:10 +00:00
Craig Topper a80352c04e [X86] When post-processing the DAG to remove zero extending moves for YMM/ZMM, make sure the producing instruction is VEX/XOP/EVEX encoded.
If the producing instruction is legacy encoded it doesn't implicitly zero the upper bits. This is important for the SHA instructions which don't have a VEX encoded version. We might also be able to hit this with the incomplete f128 support that hasn't been ported to VEX.

llvm-svn: 338812
2018-08-03 04:49:42 +00:00
Reid Kleckner 980c4df037 Re-land r335297 "[X86] Implement more of x86-64 large and medium PIC code models"
Don't try to generate large PIC code for non-ELF targets. Neither COFF
nor MachO have relocations for large position independent code, and
users have been using "large PIC" code models to JIT 64-bit code for a
while now. With this change, if they are generating ELF code, their
JITed code will truly be PIC, but if they target MachO or COFF, it will
contain 64-bit immediates that directly reference external symbols. For
a JIT, that's perfectly fine.

llvm-svn: 337740
2018-07-23 21:14:35 +00:00
Craig Topper abc307e6fa [X86] Connect the flags user from PCMPISTR instructions to the correct node from the instruction.
We were accidentally connecting it to result 0 instead of result 1. This was caught by the machine verifier that noticed the flags were dead, but we were using them somehow. I'm still not clear what actually happened downstream.

llvm-svn: 336925
2018-07-12 18:04:05 +00:00
Craig Topper 38b290f7d7 [X86] Remove patterns for inserting a load into a zero vector.
We can instead block the load folding isProfitableToFold. Then isel will emit a register->register move for the zeroing part and a separate load. The PostProcessISelDAG should be able to remove the register->register move.

This saves us patterns and fixes the fact that we only had unaligned load patterns. The test changes show places where we should have been using an aligned load.

llvm-svn: 336828
2018-07-11 18:09:04 +00:00
Craig Topper 08b81a5508 [X86] Use IsProfitableToFold to block vinsertf128rm in favor of insert_subreg instead of artifically increasing pattern complexity to give priority.
This is a much more direct way to solve the issue than just giving extra priority.

llvm-svn: 336639
2018-07-10 06:19:54 +00:00
Craig Topper 90317d1d94 [X86] Suppress load folding into and/or/xor if it will prevent matching btr/bts/btc.
This is a follow up to r335753. At the time I forgot about isProfitableToFold which makes this pretty easy.

Differential Revision: https://reviews.llvm.org/D48706

llvm-svn: 335895
2018-06-28 17:58:01 +00:00
Jonas Devlieghere b757fc3878 Revert "Re-land r335297 "[X86] Implement more of x86-64 large and medium PIC code models""
Reverting because this is causing failures in the LLDB test suite on
GreenDragon.

  LLVM ERROR: unsupported relocation with subtraction expression, symbol
  '__GLOBAL_OFFSET_TABLE_' can not be undefined in a subtraction
  expression

llvm-svn: 335894
2018-06-28 17:56:43 +00:00
Craig Topper ab70f58891 [X86] Change how we prefer shift by immediate over folding a load into a shift.
BMI2 added new shift by register instructions that have the ability to fold a load.

Normally without doing anything special isel would prefer folding a load over folding an immediate because the load folding pattern has higher "complexity". This would require an instruction to move the immediate into a register. We would rather fold the immediate instead and have a separate instruction for the load.

We used to enforce this priority by artificially lowering the complexity of the load pattern.

This patch changes this to instead reject the load fold in isProfitableToFoldLoad if there is an immediate. This is more consistent with other binops and feels less hacky.

llvm-svn: 335804
2018-06-28 00:47:41 +00:00
Craig Topper 880e34ed45 [X86] In X86DAGToDAGISel::PreprocessISelDAG, make sure we don't access N after we delete it.
If we turn X86ISD::AND into ISD::AND, we delete N. But we were continuing onto the next block of code even though N no longer existed.

Just happened to notice it. I assume asan didn't notice it because we explicitly unpoison deleted nodes and give them a DELETE_NODE opcode.

llvm-svn: 335787
2018-06-27 20:58:46 +00:00
Reid Kleckner 88fee5fdbc Re-land r335297 "[X86] Implement more of x86-64 large and medium PIC code models"
The large code model allows code and data segments to exceed 2GB, which
means that some symbol references may require a displacement that cannot
be encoded as a displacement from RIP. The large PIC model even relaxes
the assumption that the GOT itself is within 2GB of all code. Therefore,
we need a special code sequence to materialize it:
  .LtmpN:
    leaq .LtmpN(%rip), %rbx
    movabsq $_GLOBAL_OFFSET_TABLE_-.LtmpN, %rax # Scratch
    addq %rax, %rbx # GOT base reg

From that, non-local references go through the GOT base register instead
of being PC-relative loads. Local references typically use GOTOFF
symbols, like this:
    movq extern_gv@GOT(%rbx), %rax
    movq local_gv@GOTOFF(%rbx), %rax

All calls end up being indirect:
    movabsq $local_fn@GOTOFF, %rax
    addq %rbx, %rax
    callq *%rax

The medium code model retains the assumption that the code segment is
less than 2GB, so calls are once again direct, and the RIP-relative
loads can be used to access the GOT. Materializing the GOT is easy:
    leaq _GLOBAL_OFFSET_TABLE_(%rip), %rbx # GOT base reg

DSO local data accesses will use it:
    movq local_gv@GOTOFF(%rbx), %rax

Non-local data accesses will use RIP-relative addressing, which means we
may not always need to materialize the GOT base:
    movq extern_gv@GOTPCREL(%rip), %rax

Direct calls are basically the same as they are in the small code model:
They use direct, PC-relative addressing, and the PLT is used for calls
to non-local functions.

This patch adds reasonably comprehensive testing of LEA, but there are
lots of interesting folding opportunities that are unimplemented.

I restricted the MCJIT/eh-lg-pic.ll test to Linux, since the large PIC
code model is not implemented for MachO yet.

Differential Revision: https://reviews.llvm.org/D47211

llvm-svn: 335508
2018-06-25 18:16:27 +00:00
Reid Kleckner 3a2fd1c2f3 Revert r335297 "[X86] Implement more of x86-64 large and medium PIC code models"
MCJIT can't handle R_X86_64_GOT64 yet.

llvm-svn: 335300
2018-06-21 22:19:05 +00:00
Reid Kleckner 247fe6aeab [X86] Implement more of x86-64 large and medium PIC code models
Summary:
The large code model allows code and data segments to exceed 2GB, which
means that some symbol references may require a displacement that cannot
be encoded as a displacement from RIP. The large PIC model even relaxes
the assumption that the GOT itself is within 2GB of all code. Therefore,
we need a special code sequence to materialize it:
  .LtmpN:
    leaq .LtmpN(%rip), %rbx
    movabsq $_GLOBAL_OFFSET_TABLE_-.LtmpN, %rax # Scratch
    addq %rax, %rbx # GOT base reg

From that, non-local references go through the GOT base register instead
of being PC-relative loads. Local references typically use GOTOFF
symbols, like this:
    movq extern_gv@GOT(%rbx), %rax
    movq local_gv@GOTOFF(%rbx), %rax

All calls end up being indirect:
    movabsq $local_fn@GOTOFF, %rax
    addq %rbx, %rax
    callq *%rax

The medium code model retains the assumption that the code segment is
less than 2GB, so calls are once again direct, and the RIP-relative
loads can be used to access the GOT. Materializing the GOT is easy:
    leaq _GLOBAL_OFFSET_TABLE_(%rip), %rbx # GOT base reg

DSO local data accesses will use it:
    movq local_gv@GOTOFF(%rbx), %rax

Non-local data accesses will use RIP-relative addressing, which means we
may not always need to materialize the GOT base:
    movq extern_gv@GOTPCREL(%rip), %rax

Direct calls are basically the same as they are in the small code model:
They use direct, PC-relative addressing, and the PLT is used for calls
to non-local functions.

This patch adds reasonably comprehensive testing of LEA, but there are
lots of interesting folding opportunities that are unimplemented.

Reviewers: chandlerc, echristo

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D47211

llvm-svn: 335297
2018-06-21 21:55:08 +00:00
Craig Topper c2696d577b [X86] Use setcc ISD opcode for AVX512 integer comparisons all the way to isel
I don't believe there is any real reason to have separate X86 specific opcodes for vector compares. Setcc has the same behavior just uses a different encoding for the condition code.

I had to change the CondCodeAction for SETLT and SETLE to prevent some transforms from changing SETGT lowering.

Differential Revision: https://reviews.llvm.org/D43608

llvm-svn: 335173
2018-06-20 21:05:02 +00:00
Craig Topper b0e986f88e [X86] Pass the parent SDNode to X86DAGToDAGISel::selectScalarSSELoad to simplify the hasSingleUseFromRoot handling.
Some of the calls to hasSingleUseFromRoot were passing the load itself. If the load's chain result has a user this would count against that. By getting the true parent of the match and ensuring any intermediate between the match and the load have a single use we can avoid this case. isLegalToFold will take care of checking users of the load's data output.

This fixed at least fma-scalar-memfold.ll to succed without the peephole pass.

llvm-svn: 334908
2018-06-17 16:29:46 +00:00
Craig Topper 3efdb7ce19 [X86] Push some variable declarations down into the individual switch cases that need them. NFC
All of the cases are already wrapped in curly braces so declaring a variable there isn't an issue. And the variables aren't assigned or used in the larger scope.

llvm-svn: 334436
2018-06-11 20:50:58 +00:00
Simon Pilgrim 3d14158891 [X86][BMI][TBM] Only demand bottom 16-bits of the BEXTR control op (PR34042)
Only the bottom 16-bits of BEXTR's control op are required (0:8 INDEX, 15:8 LENGTH).

Differential Revision: https://reviews.llvm.org/D47690

llvm-svn: 334083
2018-06-06 10:52:10 +00:00
Reid Kleckner 537917d13c [X86] Simplify some X86 address mode folding code, NFCI
This code should really do exactly the same thing for 32-bit x86 and
64-bit small code models, with the exception that RIP-relative
addressing can't use base and index registers.

llvm-svn: 332893
2018-05-21 21:03:19 +00:00
Nicola Zaghen d34e60ca85 Rename DEBUG macro to LLVM_DEBUG.
The DEBUG() macro is very generic so it might clash with other projects.
The renaming was done as follows:
- git grep -l 'DEBUG' | xargs sed -i 's/\bDEBUG\s\?(/LLVM_DEBUG(/g'
- git diff -U0 master | ../clang/tools/clang-format/clang-format-diff.py -i -p1 -style LLVM
- Manual change to APInt
- Manually chage DOCS as regex doesn't match it.

In the transition period the DEBUG() macro is still present and aliased
to the LLVM_DEBUG() one.

Differential Revision: https://reviews.llvm.org/D43624

llvm-svn: 332240
2018-05-14 12:53:11 +00:00
Adrian Prantl 5f8f34e459 Remove \brief commands from doxygen comments.
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.

Patch produced by

  for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done

Differential Revision: https://reviews.llvm.org/D46290

llvm-svn: 331272
2018-05-01 15:54:18 +00:00
Nico Weber 432a38838d IWYU for llvm-config.h in llvm, additions.
See r331124 for how I made a list of files missing the include.
I then ran this Python script:

    for f in open('filelist.txt'):
        f = f.strip()
        fl = open(f).readlines()

        found = False
        for i in xrange(len(fl)):
            p = '#include "llvm/'
            if not fl[i].startswith(p):
                continue
            if fl[i][len(p):] > 'Config':
                fl.insert(i, '#include "llvm/Config/llvm-config.h"\n')
                found = True
                break
        if not found:
            print 'not found', f
        else:
            open(f, 'w').write(''.join(fl))

and then looked through everything with `svn diff | diffstat -l | xargs -n 1000 gvim -p`
and tried to fix include ordering and whatnot.

No intended behavior change.

llvm-svn: 331184
2018-04-30 14:59:11 +00:00
Craig Topper d656410293 [X86] Make the STTNI flag intrinsics use the flags from pcmpestrm/pcmpistrm if the mask instrinsics are also used in the same basic block.
Summary:
Previously the flag intrinsics always used the index instructions even if a mask instruction also exists.

To fix fix this I've created a single ISD node type that returns index, mask, and flags. The SelectionDAG CSE process will merge all flavors of intrinsics with the same inputs to a s ingle node. Then during isel we just have to look at which results are used to know what instruction to generate. If both mask and index are used we'll need to emit two instructions. But for all other cases we can emit a single instruction.

Since I had to do manual isel anyway, I've removed the pseudo instructions and custom inserter code that was working around tablegen limitations with multiple implicit defs.

I've also renamed the recently added sse42.ll test case to sttni.ll since it focuses on that subset of the sse4.2 instructions.

Reviewers: chandlerc, RKSimon, spatel

Reviewed By: chandlerc

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D46202

llvm-svn: 331091
2018-04-27 22:15:33 +00:00
Craig Topper 7e42af87a6 [X86] Prevent folding loads with 64-bit ANDs with immediates that fit in 32-bits.
Prefer to use the 32-bit AND with immediate instead.

Primarily I'm doing this to ensure that immediates created by shrinkAndImmediate will always get absorbed into the AND. But I do believe this would be a reduction in the number of uops that need to execute. Ideally we should shrink the 'and' and the 'load' during DAG combine to re-enable the fold.

Fixes PR37063.

llvm-svn: 329667
2018-04-10 03:44:15 +00:00
Craig Topper 88e38e3e3e [X86] Remove more dead code left over from the handling of i8/i16 UMUL_LOHI/SMUL_LOHI that is no longer needed. NFC
llvm-svn: 329152
2018-04-04 07:00:16 +00:00
Craig Topper afa22edcf0 [X86] Remove dead code for handling i8/i16 UMUL_LOHI/SMUL_LOHI from X86ISelDAGToDAG.cpp. NFC
These are promoted to i16/i32 multiplies by a DAG combine.

llvm-svn: 329147
2018-04-04 04:38:55 +00:00
Nirav Dave 8c5f47ac40 [DAG, X86] Fix ISel-time node insertion ids
As in SystemZ backend, correctly propagate node ids when inserting new
unselected nodes into the DAG during instruction Seleciton for X86
target.

Fixes PR36865.

Reviewers: jyknight, craig.topper

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D44797

llvm-svn: 328233
2018-03-22 19:32:07 +00:00
Craig Topper ad7c685791 [X86] Rename MOVSX32_NOREXrr8 to MOVSX32rr8_NOREX so that the scheduler model regular expressions will pick it up with the regular version.
Do the same for MOVSX32_NOREXrm8, MOVZX32_NOREXrr8, and MOVZX32_NOREXrm8

llvm-svn: 327948
2018-03-20 05:00:20 +00:00
Nirav Dave 3264c1bdf6 [DAG, X86] Revert r327197 "Revert r327170, r327171, r327172"
Reland ISel cycle checking improvements after simplifying node id
invariant traversal and correcting typo.

llvm-svn: 327898
2018-03-19 20:19:46 +00:00
Nirav Dave 5f0ab71b62 Revert "[DAG, X86] Revert r327197 "Revert r327170, r327171, r327172""
as it times out building test-suite on PPC.

llvm-svn: 327778
2018-03-17 19:24:54 +00:00
Nirav Dave 982d3a56ea [DAG, X86] Revert r327197 "Revert r327170, r327171, r327172"
Reland ISel cycle checking improvements after simplifying and reducing
node id invariant traversal.

llvm-svn: 327777
2018-03-17 17:42:10 +00:00
Craig Topper 25007c4f32 [X86] Pass SelectionDAG into X86ISelAddressMode::dump and on to SDNode::dump.
This prevents a crash in SelectionDAGDumper with -debug when trying to print mem operands if one of the registers in the addressing mode comes from a load.

llvm-svn: 327744
2018-03-16 21:10:07 +00:00
Craig Topper e6913ec340 [X86] Post process the DAG after isel to remove vector moves that were added to zero upper bits.
We previously avoided inserting these moves during isel in a few cases which is implemented using a whitelist of opcodes. But it's too difficult to generate a perfect list of opcodes to whitelist. Especially with AVX512F without AVX512VL using 512 bit vectors to implement some 128/256 bit operations. Since isel is done bottoms up, we'd have to check the VT and opcode and subtarget in order to determine whether an EXTRACT_SUBREG would be generated for some operations.

So instead of doing that, this patch adds a post processing step that detects when the moves are unnecesssary after isel. At that point any EXTRACT_SUBREGs would have already been created and appear in the DAG. So then we just need to ensure the input to the move isn't one.

Differential Revision: https://reviews.llvm.org/D44289

llvm-svn: 327724
2018-03-16 17:13:42 +00:00
Nirav Dave 042678bd55 Revert: r327172 "Correct load-op-store cycle detection analysis"
r327171 "Improve Dependency analysis when doing multi-node Instruction Selection"
        r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection"

Reverting patch as NodeId invariant change is causing pathological
increases in compile time on PPC

llvm-svn: 327197
2018-03-10 02:16:15 +00:00
Nirav Dave 0fab41782d Correct load-op-store cycle detection analysis
Add missing cycle dependency checks in load-op-store fusion.

Fixes PR36274.

Reviewers: craig.topper, bogner

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D43154

llvm-svn: 327172
2018-03-09 20:58:07 +00:00
Nirav Dave d668f69ee7 Improve Dependency analysis when doing multi-node Instruction Selection
Relanding after fixing NodeId Invariant.

Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.

As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.

Reviewers: craig.topper, bogner

Subscribers: llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D41293

llvm-svn: 327171
2018-03-09 20:57:42 +00:00
Nirav Dave 071699bf82 [DAG] Enforce stricter NodeId invariant during Instruction selection
Instruction Selection makes use of the topological ordering of nodes
by node id (a node's operands have smaller node id than it) when doing
cycle detection.  During selection we may violate this property as a
selection of multiple nodes may induce a use dependence (and thus a
node id restriction) between two unrelated nodes. If a selected node
has an unselected successor this may allow us to miss a cycle in
detection an invalid selection.

This patch fixes this by marking all unselected successors of a
selected node have negated node id.  We avoid pruning on such negative
ids but still can reconstruct the original id for pruning.

In-tree targets have been updated to replace DAG-level replacements
with ISel-level ones which enforce this property.

This preemptively fixes PR36312 before triggering commit r324359 relands

Reviewers: craig.topper, bogner, jyknight

Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D43198

llvm-svn: 327170
2018-03-09 20:57:15 +00:00
Craig Topper 48d5ed265c [X86] Don't use EXTRACT_ELEMENT from v1i1 with i8/i32 result type when we need to guarantee zeroes in the upper bits of return.
An extract_element where the result type is larger than the scalar element type is semantically an any_extend of from the scalar element type to the result type. If we expect zeroes in the upper bits of the i8/i32 we need to mae sure those zeroes are explicit in the DAG.

For these cases the best way to accomplish this is use an insert_subvector to pad zeroes to the upper bits of the v1i1 first. We extend to either v16i1(for i32) or v8i1(for i8). Then bitcast that to a scalar and finish with a zero_extend up to i32 if necessary. We can't extend past v16i1 because that's the largest mask size on KNL. But isel is smarter enough to know that a zext of a bitcast from v16i1 to i16 can use a KMOVW instruction. The insert_subvectors will be dropped during isel because we can determine that the producing instruction already zeroed the upper bits of the k-register.

llvm-svn: 326308
2018-02-28 08:14:28 +00:00
Chandler Carruth a1d6107b14 [DAG, X86] Revert r324797, r324491, and r324359.
Sadly, r324359 caused at least PR36312. There is a patch out for review
but it seems to be taking a bit and we've already had these crashers in
tree for too long. We're hitting this PR in real code now and are
blocked on shipping new compilers as a consequence so I'm reverting us
back to green.

Sorry for the churn due to the stacked changes that I had to revert. =/

llvm-svn: 325420
2018-02-17 02:26:25 +00:00
Craig Topper 2b2d8c5eb2 [X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the upper 32-bits of a 64-bit operation.
We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.

For now only do this for optsize.

Differential Revision: https://reviews.llvm.org/D37418

llvm-svn: 325287
2018-02-15 19:57:35 +00:00
Craig Topper 88939fefe8 [X86] Simplify X86DAGToDAGISel::matchBEXTRFromAnd by creating an X86ISD::BEXTR node and calling Select. Add isel patterns to recognize this node.
This removes a bunch of special case code for selecting the immediate and folding loads.

llvm-svn: 324939
2018-02-12 21:18:11 +00:00
Craig Topper b424fafa9f [X86] Don't look for TEST instruction shrinking opportunities when the root node is a X86ISD::SUB.
I don't believe we ever create an X86ISD::SUB with a 0 constant which is what the TEST handling needs. The ternary operator at the end of this code shows up as only going one way in the llvm-cov report from the bots.

llvm-svn: 324865
2018-02-12 03:02:02 +00:00
Craig Topper 3ccbd3f32f [X86] Remove check for X86ISD::AND with no flag users from the TEST instruction immediate shrinking code.
We turn X86ISD::AND with no flag users back to ISD::AND in PreprocessISelDAG.

llvm-svn: 324864
2018-02-12 03:02:01 +00:00
Nirav Dave c8c9d4fe35 [DAG] Make early exit hasPredecessorHelper return true. NFCI.
All uses conservatively assume in early exit case that it will be a
predecessor. Changing default removes checking code in all uses.

llvm-svn: 324797
2018-02-10 02:41:22 +00:00
Nirav Dave 27721e8617 [DAG, X86] Improve Dependency analysis when doing multi-node
Instruction Selection

Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.

As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.

Reviewers: craig.topper, bogner

Subscribers: llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D41293

llvm-svn: 324359
2018-02-06 16:14:29 +00:00
Craig Topper 57e0643160 [X86] Teach X86DAGToDAGISel::shrinkAndImmediate to preserve upper 32 zeroes of a 64 bit mask.
If the upper 32 bits of a 64 bit mask are all zeros, we have special isel patterns to use a 32-bit and instead of a 64-bit and by relying on the impliciting zeroing of 32 bit ops.

This patch teachs shrinkAndImmediate not to break that optimization.

Differential Revision: https://reviews.llvm.org/D42899

llvm-svn: 324249
2018-02-05 16:54:07 +00:00
Craig Topper 7e910a9e85 [X86] Turn X86ISD::AND nodes that have no flag users back into ISD::AND just before isel to enable test instruction matching
Summary:
EmitTest sometimes creates X86ISD::AND specifically to hide the AND from DAG combine. But this prevents isel patterns that look for (cmp (and X, Y), 0) from being able to see it. So we end up with an AND and a TEST. The TEST gets removed by compare instruction optimization during the peephole pass.

This patch attempts to fix this by converting X86ISD::AND with no flag users back into ISD::AND during the DAG preprocessing just before isel.

In order to do this correctly I had to make the X86ISD::AND node created by EmitTest in this case really have a flag output. Which arguably it should have had anyway so that the number of operands would be consistent for the opcode in all cases. Then I had to modify the ReplaceAllUsesWith to understand that we might be looking at an instruction with 2 outputs. Though in this case there are no uses to replace since we just created the node, but that's what the code did before so I just made it keep working.

Reviewers: spatel, RKSimon, niravd, deadalnix

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42764

llvm-svn: 323982
2018-02-01 17:08:39 +00:00
Amaury Sechet f9a9e9a251 [X86] Generate testl instruction through truncates.
Summary:
This was introduced in D42646 but ended up being reverted because the original implementation was buggy.

Depends on D42646

Reviewers: craig.topper, niravd, spatel, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42741

llvm-svn: 323899
2018-01-31 19:20:06 +00:00
Amaury Sechet f89f188ddb [X86] Avoid using high register trick for test instruction
Summary:
It seems it's main effect is to create addition copies when values are inr register that do not support this trick, which increase register pressure and makes the code bigger.

Reviewers: craig.topper, niravd, spatel, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42646

llvm-svn: 323888
2018-01-31 16:48:54 +00:00
Eric Liu 0b69b5ed85 Revert "[X86] Avoid using high register trick for test instruction"
This reverts commit r323690. This causes crash in llc. See the original commit thread for details.

llvm-svn: 323761
2018-01-30 14:18:33 +00:00
Amaury Sechet 015184b79e [X86] Avoid using high register trick for test instruction
Summary:
It seems it's main effect is to create addition copies when values are inr register that do not support this trick, which increase register pressure and makes the code bigger.

The main noteworthy regression I was able to observe was pattern of the type (setcc (trunc (and X, C)), 0) where C is such as it would benefit from the hi register trick. To prevent this, a new pattern is added to materialize such pattern using a 32 bits test. This has the added benefit of working with any constant that is materializable as a 32bits immediate, not just the ones that can leverage the high register trick, as demonstrated by the test case in test-shrink.ll using the constant 2049 .

Reviewers: craig.topper, niravd, spatel, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42646

llvm-svn: 323690
2018-01-29 20:54:33 +00:00
Craig Topper 15d69739e2 [X86] Remove VPTESTM/VPTESTNM ISD opcodes. Use isel patterns matching cmpm eq/ne with immallzeros.
llvm-svn: 323612
2018-01-28 00:56:30 +00:00
Craig Topper 513d3fa674 [X86] Remove X86ISD::PCMPGTM/PCMPEQM and instead just use X86ISD::PCMPM and pattern match the immediate value during isel.
Legalization is still biased to turn LT compares in to GT by swapping operands to avoid needing extra isel patterns to commute.

I'm hoping to remove TESTM/TESTNM next and this should simplify that by making EQ/NE more similar.

llvm-svn: 323604
2018-01-27 20:19:02 +00:00
Craig Topper 8f324bb1a4 [SelectionDAGISel] Add a debug print before call to Select. Adjust where blank lines are printed during isel process to make things more sensibly grouped.
Previously some targets printed their own message at the start of Select to indicate what they were selecting. For the targets that didn't, it means there was no print of the root node before any custom handling in the target executed. So if the target did something custom and never called SelectNodeCommon, no print would be made. For the targets that did print a message in Select, if they didn't custom handle a node SelectNodeCommon would reprint the root node before walking the isel table.

It seems better to just print the message before the call to Select so all targets behave the same. And then remove the root node printing from SelectNodeCommon and just leave a message that says we're starting the table search.

There were also some oddities in blank line behavior. Usually due to a \n after a call to SelectionDAGNode::dump which already inserted a new line.

llvm-svn: 323551
2018-01-26 19:34:20 +00:00
Chandler Carruth c58f2166ab Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Summary:
First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

We also support a target feature that disables emission of the retpoline
thunk by the compiler to allow for custom thunks if users want them.
These are particularly useful in environments like kernels that
routinely do hot-patching on boot and want to hot-patch their thunk to
different code sequences. They can write this custom thunk and use
`-mretpoline-external-thunk` *in addition* to `-mretpoline`. In this
case, on x86-64 thu thunk names must be:
```
  __llvm_external_retpoline_r11
```
or on 32-bit:
```
  __llvm_external_retpoline_eax
  __llvm_external_retpoline_ecx
  __llvm_external_retpoline_edx
  __llvm_external_retpoline_push
```
And the target of the retpoline is passed in the named register, or in
the case of the `push` suffix on the top of the stack via a `pushl`
instruction.

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch `-mretpoline` is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with `-mretpoline` and additionally link the dynamic
executable and all shared libraries with LLD and pass `-z retpolineplt`
(or use similar functionality from some other linker). We strongly
recommend also using `-z now` as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to `-mretpoline` to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer

Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D41723

llvm-svn: 323155
2018-01-22 22:05:25 +00:00
Sanjay Patel 74a1eef7c4 [x86] shrink 'and' immediate values by setting the high bits (PR35907)
Try to reverse the constant-shrinking that happens in SimplifyDemandedBits()
for 'and' masks when it results in a smaller sign-extended immediate.

We are also able to detect dead 'and' ops here (the mask is all ones). In
that case, we replace and return without selecting the 'and'.

Other targets might want to share some of this logic by enabling this under a
target hook, but I didn't see diffs for simple cases with PowerPC or AArch64,
so they may already have some specialized logic for this kind of thing or have
different needs.

This should solve PR35907:
https://bugs.llvm.org/show_bug.cgi?id=35907

Differential Revision: https://reviews.llvm.org/D42088

llvm-svn: 322957
2018-01-19 16:37:25 +00:00
Nirav Dave 72d32f24f5 [X86] Extend load-op-store fusion merge to ADC/SBB.
Summary: Add handling of EFLAG input to X86 Load-op-store fusion checking.

Reviewers: craig.topper, RKSimon

Subscribers: llvm-commits, hiraditya

Differential Revision: https://reviews.llvm.org/D42128

llvm-svn: 322952
2018-01-19 15:37:57 +00:00
Craig Topper af4eb17223 [SelectionDAG][X86] Explicitly store the scale in the gather/scatter ISD nodes
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.

Most of this patch is just making sure we copy the scale around everywhere.

Differential Revision: https://reviews.llvm.org/D40055

llvm-svn: 322210
2018-01-10 19:16:05 +00:00
Craig Topper d58c165545 [X86] Make v2i1 and v4i1 legal types without VLX
Summary:
There are few oddities that occur due to v1i1, v8i1, v16i1 being legal without v2i1 and v4i1 being legal when we don't have VLX. Particularly during legalization of v2i32/v4i32/v2i64/v4i64 masked gather/scatter/load/store. We end up promoting the mask argument to these during type legalization and then have to widen the promoted type to v8iX/v16iX and truncate it to get the element size back down to v8i1/v16i1 to use a 512-bit operation. Since need to fill the upper bits of the mask we have to fill with 0s at the promoted type.

It would be better if we could just have the v2i1/v4i1 types as legal so they don't undergo any promotion. Then we can just widen with 0s directly in a k register. There are no real v4i1/v2i1 instructions anyway. Everything is done on a larger register anyway.

This also fixes an issue that we couldn't implement a masked vextractf32x4 from zmm to xmm properly.

We now have to support widening more compares to 512-bit to get a mask result out so new tablegen patterns got added.

I had to hack the legalizer for widening the operand of a setcc a bit so it didn't try create a setcc returning v4i32, extract from it, then try to promote it using a sign extend to v2i1. Now we create the setcc with v4i1 if the original setcc's result type is v2i1. Then extract that and don't sign extend it at all.

There's definitely room for improvement with some follow up patches.

Reviewers: RKSimon, zvi, guyblank

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41560

llvm-svn: 321967
2018-01-07 18:20:37 +00:00
Craig Topper eff84ed204 [X86] Improve the printing of address mode during isel matching.
Fix some inconsistent new line behavior and only print the FrameIndex when the address mode is a FrameIndexBase addressing mode.

llvm-svn: 321368
2017-12-22 17:18:10 +00:00
Matthias Braun f1caa2833f MachineFunction: Return reference from getFunction(); NFC
The Function can never be nullptr so we can return a reference.

llvm-svn: 320884
2017-12-15 22:22:58 +00:00
Michael Zolotukhin ad24af7f58 Remove redundant includes from lib/Target/X86.
llvm-svn: 320636
2017-12-13 21:31:19 +00:00
Matt Morehouse 9e658c974b Revert "[X86] Improvement in CodeGen instruction selection for LEAs."
This reverts r319543, due to ASan bot breakage.

llvm-svn: 319591
2017-12-01 22:20:26 +00:00
Jatin Bhateja 328199ec26 [X86] Improvement in CodeGen instruction selection for LEAs.
Summary:
1/  Operand folding during complex pattern matching for LEAs has been extended, such that it promotes Scale to
     accommodate similar operand appearing in the DAG  e.g.
                 T1 = A + B
                 T2 = T1 + 10
                 T3 = T2 + A
    For above DAG rooted at T3, X86AddressMode will now look like
                Base = B , Index = A , Scale = 2 , Disp = 10

2/  During OptimizeLEAPass down the pipeline factorization is now performed over LEAs so that if there is an opportunity
     then complex LEAs (having 3 operands) could be factored out  e.g.
                 leal 1(%rax,%rcx,1), %rdx
                 leal 1(%rax,%rcx,2), %rcx
     will be factored as following
                 leal 1(%rax,%rcx,1), %rdx
                 leal (%rdx,%rcx)   , %edx

3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops, thus avoiding creation of any complex LEAs within a loop.

4/ Simplify LEA converts (lea (BASE,1,INDEX,0)  --> add (BASE, INDEX) which offers better through put.

PR32755 will be taken care of by this pathc.

Previous patch revisions : r313343 , r314886

Reviewers: lsaba, RKSimon, craig.topper, qcolombet, jmolloy, jbhateja

Reviewed By: lsaba, RKSimon, jbhateja

Subscribers: jmolloy, spatel, igorb, llvm-commits

Differential Revision: https://reviews.llvm.org/D35014

llvm-svn: 319543
2017-12-01 14:07:38 +00:00
Craig Topper f31b0b850b [X86] Teach isel that X86ISD::CMPM_RND zeros the upper bits of the mask register.
llvm-svn: 318933
2017-11-23 18:41:21 +00:00
Craig Topper ee74044f93 [X86] Add an X86ISD::MSCATTER node for consistency with the X86ISD::MGATHER.
This makes the fact that X86 needs an explicit mask output not part of the type constraint for the ISD::MSCATTER.

This also gives the X86ISD::MGATHER/MSCATTER nodes a common base class simplifying the address selection code in X86ISelDAGToDAG.cpp

llvm-svn: 318823
2017-11-22 08:10:54 +00:00
Craig Topper c314f461dd [X86] Allow X86ISD::Wrapper to be folded into the base of gather/scatter address
If the base of our gather corresponds to something contained in X86ISD::Wrapper we should be able to fold it into the address.

This patch refactors some of the address matching to more fully use the X86ISelAddressMode struct and the getAddressOperands helper. A new helper function matchVectorAddress is added to call matchWrapper or fall back to matchAddressBase.

We should also be able to support constant offsets from a wrapper, but I'll look into that in a future patch. We may even be able to completely reuse matchAddress here, but I wanted to start simple and work up to it.

Differential Revision: https://reviews.llvm.org/D39927

llvm-svn: 318057
2017-11-13 17:53:59 +00:00
Craig Topper bb001c6ddc [X86] Merge the template method selectAddrOfGatherScatterNode into selectVectorAddr. NFCI
Just need to initialize a couple variables differently based on the node type. No need for a whole separate template method.

llvm-svn: 317915
2017-11-10 19:26:04 +00:00
Craig Topper 61f81f9637 [X86] Preserve memory refs when folding loads into divides.
This is similar to what we already do for multiplies. Without this we can't unfold and hoist an invariant load.

llvm-svn: 317732
2017-11-08 22:26:39 +00:00
Craig Topper 55029d811f [X86] Remove an if check on the result of a cast. NFC
cast takes a non-null input and produces a non-null output. So this if can never fail.

llvm-svn: 317731
2017-11-08 22:26:37 +00:00
Craig Topper 78a770402a [X86] Correct the implementation of BEXTR load folding to use the shift as the parent node and pass a separate root.
We were calling tryFoldLoad with the 'and' node was the root and parent node of the load. But the parent of the load should be the shift that proceeds the and. While the and node is correctly the root node.

To fix this I had to make tryFoldLoad take a separate use and root input. I've added a convenience version with the old signature to avoid updating the other call sites.

llvm-svn: 317720
2017-11-08 20:17:33 +00:00
Uriel Korach bb86686a8b [X86][AVX512] Improve lowering of AVX512 test intrinsics
Added TESTM and TESTNM to the list of instructions that already zeroing unused upper bits
and does not need the redundant shift left and shift right instructions afterwards.
Added a pattern for TESTM and TESTNM in iselLowering, so now icmp(neq,and(X,Y), 0) goes folds into TESTM
and icmp(eq,and(X,Y), 0) goes folds into TESTNM
This commit is a preparation for lowering the test and testn X86 intrinsics to IR.

Differential Revision: https://reviews.llvm.org/D38732

llvm-svn: 317465
2017-11-06 09:22:38 +00:00
Uriel Korach eb47d95d52 [X86] Replace duplicate function call with variable. NFC
Change from:
if (N->getOperand(0).getValueType() == MVT::v8i32 ||
    N->getOperand(0).getValueType() == MVT::v8f32)

to:
EVT OpVT = N->getOperand(0).getValueType();
if (OpVT == MVT::v8i32 || OpVT == MVT::v8f32)

Change-Id: I5a105f8710b73a828e6cfcd55fac2eae6153ce25
llvm-svn: 317464
2017-11-06 08:32:45 +00:00
Craig Topper 40f0584f08 [X86] Fix a mistake in the X86ISelDAGToDAG.cpp code for MUL8r/IMUL8r.
I think this code is unreachable due to some promotions that occur elsewhere. I'll look into that to be sure, but for now I thought I should at least fix the obvious typo.

llvm-svn: 316840
2017-10-28 19:56:57 +00:00
Craig Topper b8d7d4d683 [X86] Improve handling of UDIVREM8_ZEXT_HREG/SDIVREM8_SEXT_HREG to support 64-bit extensions.
If the extend type is 64-bits, emit a 32-bit -> 64-bit extend after the UDIVREM8_ZEXT_HREG/UDIVREM8_SEXT_HREG operation.

This gives a shorter encoding for the second extend in the sext case, and allows us to completely remove the second extend in the zext case.

This also adds known bit and num sign bits support for UDIVREM8_ZEXT_HREG/SDIVREM8_SEXT_HREG.

Differential Revision: https://reviews.llvm.org/D38275

llvm-svn: 316702
2017-10-26 21:12:03 +00:00
Aaron Ballman 615eb47035 Reverting r315590; it did not include changes for llvm-tblgen, which is causing link errors for several people.
Error LNK2019 unresolved external symbol "public: void __cdecl `anonymous namespace'::MatchableInfo::dump(void)const " (?dump@MatchableInfo@?A0xf4f1c304@@QEBAXXZ) referenced in function "public: void __cdecl `anonymous namespace'::AsmMatcherEmitter::run(class llvm::raw_ostream &)" (?run@AsmMatcherEmitter@?A0xf4f1c304@@QEAAXAEAVraw_ostream@llvm@@@Z) llvm-tblgen D:\llvm\2017\utils\TableGen\AsmMatcherEmitter.obj 1

llvm-svn: 315854
2017-10-15 14:32:27 +00:00
Don Hinton 3e0199f7eb [dump] Remove NDEBUG from test to enable dump methods [NFC]
Summary:
Add LLVM_FORCE_ENABLE_DUMP cmake option, and use it along with
LLVM_ENABLE_ASSERTIONS to set LLVM_ENABLE_DUMP.

Remove NDEBUG and only use LLVM_ENABLE_DUMP to enable dump methods.

Move definition of LLVM_ENABLE_DUMP from config.h to llvm-config.h so
it'll be picked up by public headers.

Differential Revision: https://reviews.llvm.org/D38406

llvm-svn: 315590
2017-10-12 16:16:06 +00:00
Craig Topper 9563cab961 [X86] Simplify some code in getInsertVINSERTImmediate and getExtractVEXTRACTImmediate. NFC
Replace one of the divides with a multiply.

llvm-svn: 315162
2017-10-08 01:33:42 +00:00
Hans Wennborg 2a6c9adb2f Revert r314886 "[X86] Improvement in CodeGen instruction selection for LEAs (re-applying post required revision changes.)"
It broke the Chromium / SQLite build; see PR34830.

> Summary:
>    1/  Operand folding during complex pattern matching for LEAs has been
>        extended, such that it promotes Scale to accommodate similar operand
>        appearing in the DAG.
>        e.g.
>          T1 = A + B
>          T2 = T1 + 10
>          T3 = T2 + A
>        For above DAG rooted at T3, X86AddressMode will no look like
>          Base = B , Index = A , Scale = 2 , Disp = 10
>
>    2/  During OptimizeLEAPass down the pipeline factorization is now performed over LEAs
>        so that if there is an opportunity then complex LEAs (having 3 operands)
>        could be factored out.
>        e.g.
>          leal 1(%rax,%rcx,1), %rdx
>          leal 1(%rax,%rcx,2), %rcx
>        will be factored as following
>          leal 1(%rax,%rcx,1), %rdx
>          leal (%rdx,%rcx)   , %edx
>
>    3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops,
>       thus avoiding creation of any complex LEAs within a loop.
>
> Reviewers: lsaba, RKSimon, craig.topper, qcolombet, jmolloy
>
> Reviewed By: lsaba
>
> Subscribers: jmolloy, spatel, igorb, llvm-commits
>
>     Differential Revision: https://reviews.llvm.org/D35014

llvm-svn: 314919
2017-10-04 17:54:06 +00:00
Jatin Bhateja 3c29bacd43 [X86] Improvement in CodeGen instruction selection for LEAs (re-applying post required revision changes.)
Summary:
   1/  Operand folding during complex pattern matching for LEAs has been
       extended, such that it promotes Scale to accommodate similar operand
       appearing in the DAG.
       e.g.
         T1 = A + B
         T2 = T1 + 10
         T3 = T2 + A
       For above DAG rooted at T3, X86AddressMode will no look like
         Base = B , Index = A , Scale = 2 , Disp = 10

   2/  During OptimizeLEAPass down the pipeline factorization is now performed over LEAs
       so that if there is an opportunity then complex LEAs (having 3 operands)
       could be factored out.
       e.g.
         leal 1(%rax,%rcx,1), %rdx
         leal 1(%rax,%rcx,2), %rcx
       will be factored as following
         leal 1(%rax,%rcx,1), %rdx
         leal (%rdx,%rcx)   , %edx

   3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops,
      thus avoiding creation of any complex LEAs within a loop.

Reviewers: lsaba, RKSimon, craig.topper, qcolombet, jmolloy

Reviewed By: lsaba

Subscribers: jmolloy, spatel, igorb, llvm-commits

    Differential Revision: https://reviews.llvm.org/D35014

llvm-svn: 314886
2017-10-04 09:02:10 +00:00
Craig Topper 6255c7b675 [X86] Don't select (cmp (and, imm), 0) to testw
Summary:
X86ISelDAGToDAG tries to analyze ANDs compared with 0 to optimize to narrower immediates using subregisters.

I don't think we should be optimizing to 16-bit test instructions. It goes against our normal behavior of promoting i16 operations to i32. It only saves one byte due to the need to add a 0x66 prefix. I think it would also be subject to a length changing prefix penalty in the decoders on Intel CPUs.

Reviewers: RKSimon, zvi, spatel

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38273

llvm-svn: 314474
2017-09-28 23:35:36 +00:00
Craig Topper fd6b8a67fb [X86] Remove dead code from X86ISelDAGToDAG.cpp multiply handling
Summary:
Lowering never creates X86ISD::UMUL for 8-bit types. X86ISD::UMUL8 is used instead. If X86ISD::UMUL 8-bit were ever used it would crash.

DAGCombiner replaces UMUL_LOHI/SMUL_LOHI with a wider MUL and a shift if the type twice as wide is legal. So we should never see i8 UMUL_LOHI/SMUL_LOHI. In fact I think there was a bug in part of the i8 code. Similar is true for i16 though without the bug.

Reviewers: RKSimon, spatel, zvi

Reviewed By: zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38276

llvm-svn: 314430
2017-09-28 16:56:36 +00:00
Craig Topper ba3cc2e0da [AVX-512] Replace large number of explicit patterns that check for insert_subvector with zero after masked compares with fewer patterns with predicate
This replaces the large number of patterns that handle every possible case of zeroing after a masked compare with a few simpler patterns that use a predicate to check for a masked compare producer.

This is similar to what we do for detecting free GR32->GR64 zero extends and free xmm->ymm/zmm zero extends.

This shrinks the isel table from ~590k to ~531k. This is a roughly 10% reduction in size.

Differential Revision: https://reviews.llvm.org/D38217

llvm-svn: 314133
2017-09-25 18:43:13 +00:00
Craig Topper 092c2f4357 [X86] Move the getInsertVINSERTImmediate and getExtractVEXTRACTImmediate helper functions over to X86ISelDAGToDAG.cpp
Redefine them to call getI8Imm and return that directly.

llvm-svn: 314059
2017-09-23 05:34:07 +00:00
Craig Topper 75370b9b49 [X86] Convert X86ISD::SELECT to ISD::VSELECT just before instruction selection to avoid duplicate patterns
Similar to what we do for X86ISD::SHRUNKBLEND just turn X86ISD::SELECT into ISD::VSELECT. This allows us to remove the duplicated TRUNC patterns.

Differential Revision: https://reviews.llvm.org/D38022

llvm-svn: 313644
2017-09-19 17:19:45 +00:00
Craig Topper e92327e236 [X86] Don't emit COPY_TO_REG to ABCD registers before EXTRACT_SUBREG of sub_8bit
This is similar to D37843, but for sub_8bit. This fixes all of the patterns except for the 2 that emit only an EXTRACT_SUBREG. That causes a verifier error with global isel because global isel doesn't know to issue the ABCD when doing this extract on 32-bits targets.

Differential Revision: https://reviews.llvm.org/D37890

llvm-svn: 313558
2017-09-18 19:21:21 +00:00
Craig Topper b2155159a8 [X86] Don't emit COPY_TO_REG to ABCD registers before EXTRACT_SUBREG of sub_8bit_hi
I'm pretty sure that InstrEmitter::EmitSubregNode will take care of this itself by calling ConstrainForSubReg which in turn calls TRI->getSubClassWithSubReg.

I think Jakob Stoklund Olesen alluded to this in his commit message for r141207 which added the code to EmitSubregNode.

Differential Revision: https://reviews.llvm.org/D37843

llvm-svn: 313557
2017-09-18 19:21:19 +00:00
Hans Wennborg 534bfbd3ba Revert r313343 "[X86] PR32755 : Improvement in CodeGen instruction selection for LEAs."
This caused PR34629: asserts firing when building Chromium. It also broke some
buildbots building test-suite as reported on the commit thread.

> Summary:
>    1/  Operand folding during complex pattern matching for LEAs has been
>        extended, such that it promotes Scale to accommodate similar operand
>        appearing in the DAG.
>        e.g.
>           T1 = A + B
>           T2 = T1 + 10
>           T3 = T2 + A
>        For above DAG rooted at T3, X86AddressMode will no look like
>           Base = B , Index = A , Scale = 2 , Disp = 10
>
>    2/  During OptimizeLEAPass down the pipeline factorization is now performed over LEAs
>        so that if there is an opportunity then complex LEAs (having 3 operands)
>        could be factored out.
>        e.g.
>           leal 1(%rax,%rcx,1), %rdx
>           leal 1(%rax,%rcx,2), %rcx
>        will be factored as following
>           leal 1(%rax,%rcx,1), %rdx
>           leal (%rdx,%rcx)   , %edx
>
>    3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops,
>       thus avoiding creation of any complex LEAs within a loop.
>
> Reviewers: lsaba, RKSimon, craig.topper, qcolombet
>
> Reviewed By: lsaba
>
> Subscribers: spatel, igorb, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D35014

llvm-svn: 313376
2017-09-15 18:40:26 +00:00
Jatin Bhateja 908c8b37c2 [X86] PR32755 : Improvement in CodeGen instruction selection for LEAs.
Summary:
   1/  Operand folding during complex pattern matching for LEAs has been
       extended, such that it promotes Scale to accommodate similar operand
       appearing in the DAG.
       e.g.
          T1 = A + B
          T2 = T1 + 10
          T3 = T2 + A
       For above DAG rooted at T3, X86AddressMode will no look like
          Base = B , Index = A , Scale = 2 , Disp = 10

   2/  During OptimizeLEAPass down the pipeline factorization is now performed over LEAs
       so that if there is an opportunity then complex LEAs (having 3 operands)
       could be factored out.
       e.g.
          leal 1(%rax,%rcx,1), %rdx
          leal 1(%rax,%rcx,2), %rcx
       will be factored as following
          leal 1(%rax,%rcx,1), %rdx
          leal (%rdx,%rcx)   , %edx

   3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops,
      thus avoiding creation of any complex LEAs within a loop.

Reviewers: lsaba, RKSimon, craig.topper, qcolombet

Reviewed By: lsaba

Subscribers: spatel, igorb, llvm-commits

Differential Revision: https://reviews.llvm.org/D35014

llvm-svn: 313343
2017-09-15 05:29:51 +00:00
Craig Topper 2b6bfda561 [X86] Make sure we emit a SUBREG_TO_REG after the MOV32ri when creating a BEXTR64rr instruction from a shift/and pair.
Fixes PR34589.

llvm-svn: 313126
2017-09-13 07:53:21 +00:00
Craig Topper 0a3bcebcc2 [X86] Use isUInt<32> to simplify some code. NFC
llvm-svn: 313112
2017-09-13 02:29:59 +00:00
Craig Topper 958106d0f1 [X86] Move matching of (and (srl/sra, C), (1<<C) - 1) to BEXTR/BEXTRI instruction to custom isel
Recognizing this pattern during DAG combine hides information about the 'and' and the shift from other combines. I think it should be recognized at isel so its as late as possible. But it can't be done with table based isel because you need to be able to look at both immediates. This patch moves it to custom isel in X86ISelDAGToDAG.cpp.

This does break a couple tests in tbm_patterns because we are now emitting an and_flag node or (cmp and, 0) that we dont' recognize yet. We already had this problem for several other TBM patterns so I think this fine and we can address of them together.

I've also fixed a bug where the combine to BEXTR was preventing us from using a trick of zero extending AH to handle extracts of bits 15:8. We might still want to use BEXTR if it enables load folding. But honestly I hope we narrowed the load instead before got to isel.

I think we should probably also support matching BEXTR from (srl/srl (and mask << C), C). But that should be a different patch.

Differential Revision: https://reviews.llvm.org/D37592

llvm-svn: 313054
2017-09-12 17:40:25 +00:00
Craig Topper 6bed9de3d5 [X86] Call removeDeadNode when we're done doing custom isel for mul, div and test
Summary:
Once we've done our custom isel for these nodes, I think we should be calling removeDeadNode to prune them out of the DAG. Table driven isel ultimately either calls morphNodeTo which modifies a node and doesn't leave dead nodes. Or it emits new nodes and then calls removeDeadNode as part of Opc_CompleteMatch.

If you run a simple multiply test case like this through llc with -debug you'll see a umul_lohi node get printed as part of the dump for Instruction Selection ends.

```
define i64 @foo(i64 %a, i64 %b) local_unnamed_addr #0 {
entry:
  %conv = zext i64 %a to i128
  %conv1 = zext i64 %b to i128
  %mul = mul nuw nsw i128 %conv1, %conv
  %shr = lshr i128 %mul, 64
  %conv2 = trunc i128 %shr to i64
  ret i64 %conv2
}
```

Reviewers: RKSimon, spatel, zvi, guyblank, niravd

Reviewed By: niravd

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37547

llvm-svn: 312857
2017-09-09 05:57:20 +00:00
Craig Topper 63c5047a4e [X86] Use ReplaceNode instead of ReplaceUses when converting X86ISD::SHRUNKBLEND to ISD::VSELECT during isel.
This ensures that the SHRUNKBLEND node gets erased immediately.

llvm-svn: 312856
2017-09-09 05:57:19 +00:00
Chandler Carruth 38e2b506db [x86] Fix GCC pedantic warnings about default arguments for lambdas.
llvm-svn: 312809
2017-09-08 18:23:42 +00:00
Chandler Carruth acbcf06f03 [x86] Flesh out the custom ISel for RMW aritmetic ops with used flags to
cover the bitwise operators.

Nothing really exciting here, this just stamps out the rest of the core
operations that can RMW memory and set flags.

Still not implemented here: ADC, SBB. Those will require more
interesting logic to channel the flags *in*, and I'm not currently
planning to try to tackle that. It might be interesting for someone who
wants to improve our code generation for bignum implementations.

Differential Revision: https://reviews.llvm.org/D37141

llvm-svn: 312768
2017-09-08 00:17:12 +00:00