Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.
This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.
The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.
llvm-svn: 356064
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700
The bug was introduced recently here:
rL355741
This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8).
llvm-svn: 356053
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.
This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.
Differential Revision: https://reviews.llvm.org/D58888
llvm-svn: 356043
Add break statements in Object/ELF.cpp since the code should consider the
generic tags for Hexagon, MIPS, and PPC. Add a test (copied from llvm-readobj)
to show that this works correctly (earlier versions of this patch would have
asserted).
The warnings in X86ELFObjectWriter.cpp are actually false-positives since
the nested switch() handles all possible values and returns in all cases.
Make this explicit by adding llvm_unreachable's.
Differential Revision: https://reviews.llvm.org/D58837
llvm-svn: 356037
A faulting_op is one that has specified behavior when a fault occurs, generally redirecting control flow to another location. This change just adds a comment to the assembly output which makes it both human readable, and machine checkable w/o having to parse the FaultMap section. This is used to split a test file into two parts, so that I can (in a near future commit) easily extend the test file to demonstrate another case.
llvm-svn: 355982
ProcFeatures was a class that just concatenated two feature lists together and gave it a name. We used it to inherit features between CPUs.
ProcModel took a two CPU feature lists and concatenated them before deferring to ProcessorModel. This was to allow inherited features and specific features to be passed to each CPU.
Both of these allowed for only very rigid CPU inheritance rules.
With this patch we now store all of the lists we were using for inheritance in one object and do any list oncatenation we want there. Then we just pass whatever list we want from this class into the ProcessorModel class for each CPU.
Hopefully this gives us more flexibility to build up feature lists in whatever ways we think make sense. Perhaps untangling ISA flags and tuning flags.
I've only touched the CPUs that were directly affected by the removal of the ProcModel and ProcFeatures classes. We should move more of the feature lists into ProcessorFeatures.
llvm-svn: 355872
AMDGPU target run out of Subtarget feature flags hitting the limit of 64.
AssemblerPredicates uses at most uint64_t for their representation.
At the same time CodeGen has exhausted this a long time ago and switched
to a FeatureBitset with the current limit of 192 bits.
This patch completes transition to the bitset for feature bits extending
it to asm matcher and MC code emitter.
Differential Revision: https://reviews.llvm.org/D59002
llvm-svn: 355839
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.
This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.
llvm-svn: 355799
We had patterns using X86ISD::SCALAR_SINT_TO_FP_RND/SCALAR_UINT_TO_FP_RND for
these instructions. There's nothing to round. Instead, we use a regular
sint_to_fp/uint_to_fp and a movsd as the pattern for these.
llvm-svn: 355796
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.
With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.
llvm-svn: 355789
Includes a fix to emit a CheckOpcode for build_vector when immAllZerosV/immAllOnesV is used as a pattern root. This means it can't be used to look through bitcasts when used as a root, but that's probably ok. This extra CheckOpcode will ensure that the first match in the isel table will be a SwitchOpcode which is needed by the caching optimization in the ISel Matcher.
Original commit message:
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355784
An extension of D58282 noted in PR39665:
https://bugs.llvm.org/show_bug.cgi?id=39665
This doesn't answer the request to use movmsk, but that's an
independent problem. We need this and probably still need
scalarization of FP selects because we can't do that as a
target-independent transform (although it seems likely that
targets besides x86 should have this transform).
llvm-svn: 355741
We were just checking pointer size and type primitive size. But this caused unintended things like vectors of half being accepted by masked load/store.
For FP we now explicitly check for only double and float.
For pointers we now let any pointer through. Trusting that only 32 and 64 would be used to generate assembly.
We only check bitwidth after checking that the type is an integer.
llvm-svn: 355667
Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not.
The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that.
RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd.
Differential Revision: https://reviews.llvm.org/D59077
llvm-svn: 355636
Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms.
Differential Revision: https://reviews.llvm.org/D59058
llvm-svn: 355635
Summary:
ShadowCallStack on x86_64 suffered from the same racy security issues as
Return Flow Guard and had performance overhead as high as 13% depending
on the benchmark. x86_64 ShadowCallStack was always an experimental
feature and never shipped a runtime required to support it, as such
there are no expected downstream users.
Reviewers: pcc
Reviewed By: pcc
Subscribers: mgorny, javed.absar, hiraditya, jdoerfert, cfe-commits, #sanitizers, llvm-commits
Tags: #clang, #sanitizers, #llvm
Differential Revision: https://reviews.llvm.org/D59034
llvm-svn: 355624
Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well.
Requested by @spatel on D59006
llvm-svn: 355533
As noticed on D58965
DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point.
Differential Revision: https://reviews.llvm.org/D58974
llvm-svn: 355494
This allows us to use an 8-bit sign extended immediate instead of a 16 or 32 bit immediate.
Also do similar for 0x80000000 with 64-bit adds to avoid having to use a movabsq.
llvm-svn: 355485
128 won't fit in a sign extended 8-bit immediate, but we can negate it to -128 and use the other operation. This results in a shorter encoding since the move would have used 16 or 32 bits for the immediate.
llvm-svn: 355484
This caused the first matcher in the isel table for many targets to Opc_Scope instead of Opc_SwitchOpcode. This leads to a significant increase in isel match failures.
llvm-svn: 355433
The variable X86DomainReassignment::EnclosedEdges is used to store registers that have been enclosed in some closure, so those registers will be ignored when create new closures. But there is no registers has ever been put into this set, so a single register can be enclosed in multiple closures, it significantly increase compile time.
This patch adds a register into EnclosedEdges when it is enclosed into a closure.
Differential Revision: https://reviews.llvm.org/D58646
llvm-svn: 355430
We already do this for 16/32/64 as well as 8-bit add with register/immediate. Might as well do it for 8-bit INC/DEC too.
Differential Revision: https://reviews.llvm.org/D58869
llvm-svn: 355424
We already support 8-bits adds in convertToThreeAddress. But we can also support 8-bit OR if the bits are disjoint. We already do this for 16/32/64.
Differential Revision: https://reviews.llvm.org/D58863
llvm-svn: 355423
We have quite a few cases of using FP instructions for integer operations when only AVX1 is available. Then we switch to integer instructions with AVX2. In a lot of these cases execution domain fixing will take care of turning FP instructions into integer if its profitable.
With this patch we just keep on using the FP instructions even with AVX2. I've only handled some cases that don't require messing with patterns that are defined in the instruction definition. Those will require more subtle multiclass work possibly involving null_frag, hasSideEffects = 0, etc.
Differential Revision: https://reviews.llvm.org/D58470
llvm-svn: 355361
X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo
instructions without accounting for debug intrinsics. This leads to different
codegen with and without option -g, if a DBG_VALUE instruction lands in the
middle of several lowered selects.
Work around this by skipping over debug instructions when looking for CMOV
sequences, and sinking those debug insts into the EmitLoweredSelect sunk block.
This might slightly shift where variables appear in the instruction sequence,
but won't re-order assignments.
Differential Revision: https://reviews.llvm.org/D58672
llvm-svn: 355307
We were using VPBLENDW for v2i64 and VBLENDPD for v4i64. VPBLENDD has better throughput than VPBLENDW on some CPUs so it makes sense to use it when possible. VBLENDPD will probably become VBLENDD during execution domain fixing, but we might as well use integer in isel while we can.
This should work around some issues with the domain fixing pass prefering PBLENDW when we start with PBLENDW. There may still be some v8i16 cases that could use PBLENDD.
llvm-svn: 355281
Summary:
This extends the variety of pattern that can generate a SHLD instead of using two shifts.
This fixes a regression that would be introduced by D57367 or D33587
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57389
llvm-svn: 355260
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355224
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:
1. Memory loads may get scalarized resulting in more efficient code.
2. Memory stores may get scalarized resulting in more efficient code.
3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.
The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.
Differential Revision: https://reviews.llvm.org/D58282
llvm-svn: 355130
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.
Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.
Differential Revision: https://reviews.llvm.org/D58705
llvm-svn: 355116
Summary:
The description of KnownBits::zext() and
KnownBits::zextOrTrunc() has confusingly been telling
that the operation is equivalent to zero extending the
value we're tracking. That has not been true, instead
the user has been forced to explicitly set the extended
bits as known zero afterwards.
This patch adds a second argument to KnownBits::zext()
and KnownBits::zextOrTrunc() to control if the extended
bits should be considered as known zero or as unknown.
Reviewers: craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58650
llvm-svn: 355099
These allows use to use the same set of isel patterns for sra/srl/shl which are undefined for out of range shifts and intrinsic shifts which aren't undefined.
Doing this late allows DAG combine to have every opportunity to optimize the sra/srl/shl nodes.
This removes about 7000 bytes from the isel table and simplies the td files.
llvm-svn: 355071
A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node.
I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered.
Differential Revision: https://reviews.llvm.org/D58583
llvm-svn: 355015
Original implementation can't correctly handle __m256 and __m512 types
passed by reference through stack. This patch fixes it.
Patch by Wei Xiao!
Differential Revision: https://reviews.llvm.org/D57643
llvm-svn: 354921
This patch enables the following
1) AMD family 17h "znver2" tune flag (-march, -mcpu).
2) ISAs that are enabled for "znver2" architecture.
3) For the time being, it uses the znver1 scheduler model.
4) Tests are updated.
5) Scheduler descriptions are yet to be put in place.
Reviewers: craig.topper
Differential Revision: https://reviews.llvm.org/D58343
llvm-svn: 354897
Summary:
Use a custom calling convention handler for interrupts instead of fixing
up the locations in LowerMemArgument. This way, the offsets are correct
when constructed and we don't need to account for them in as many
places.
Depends on D56883
Replaces D56275
Reviewers: craig.topper, phil-opp
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D56944
llvm-svn: 354837
If the LHS has known zeros, the RHS immediate will have had bits removed. So call computeKnownBits to get the known zeroes so we can handle this case.
Differential Revision: https://reviews.llvm.org/D58475
llvm-svn: 354811
Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results.
Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB.
Differential Revision: https://reviews.llvm.org/D58597
llvm-svn: 354771
Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step.
llvm-svn: 354757
Summary:
Previously we used BLENDPS/BLENDPD but that puts the blend in the FP domain. Under optsize, the two address instruction pass can cause blendps/blendpd to commute to blendps/blendpd. But we probably shouldn't do that if the original type was a integer. So use pblendw instead.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58574
llvm-svn: 354755
Summary:
The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate.
The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate.
Reviewers: RKSimon, andreadb
Reviewed By: RKSimon
Subscribers: gbedwell, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58581
llvm-svn: 354754
Summary:
When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it.
We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used.
Reviewers: spatel, RKSimon, nikic
Reviewed By: nikic
Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58567
llvm-svn: 354753
r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit."
These were reverted in r354713 as their context depended on other patches that were reverted for a bug.
llvm-svn: 354734
Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD.
llvm-svn: 354729
Conversion from ConstantSDNode to MachineInstr sign extends immediates from their APInt representation to int64_t.
This commit makes sure we do the same for commuting. The tests changes show how this improves CSE. This issue was made worse by the MachineCSE using commuteInstruction to undo a commute. So we virtually guarantee the sign extend from isel would be lost.
The improved CSE also occurred with r354363, but that was reverted. I'm working to undo the revert, but wanted to get this fix in while it was easy to see the results.
llvm-svn: 354724
r354363 caused https://crbug.com/934963#c1, which has a plain C reduced
test case.
I also had to revert some dependent changes:
- r354648
- r354647
- r354640
- r354511
llvm-svn: 354713
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.
Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.
llvm-svn: 354709
As discussed in:
D56864
D58197
Always use the narrow (128-bit) instruction when possible.
We already had the signed int version of this transform.
llvm-svn: 354675
Only the 1st fold is attempted pre-legalization, but it requires
legal (simple) types too, so we don't need an EVT in any of the code.
llvm-svn: 354674
This was done in r321424 to prevent scheduling from reordering things. But now that we model FPCW as a dependency, I don't think the same scheduling we were trying to prevent can occur.
llvm-svn: 354628
This is a follow-up to D56864.
If we're extracting from a non-zero index before casting to FP,
then shuffle the vector and optionally narrow the vector before doing the cast:
cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0
This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974
Differential Revision: https://reviews.llvm.org/D58197
llvm-svn: 354619
We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement.
llvm-svn: 354575
Second part of https://bugs.llvm.org/show_bug.cgi?id=40442.
This adds an extra UnrollVectorOverflowOp() method to SDAG, because
the general UnrollOverflowOp() method can't deal with multiple results.
Additionally we need to expand UMULO/SMULO during vector op
legalization, as it may result in unrolling, which may need additional
type legalization.
Differential Revision: https://reviews.llvm.org/D57997
llvm-svn: 354513
This avoids depending on the peephole pass to do load folding.
Also adds some load folding for some insert_subvector patterns that use blend.
All of this was found by temporarily adding TB_NO_FORWARD to the blend immediate entries in the load folding tables.
I've added -disable-peephole to some of the affected tests from that experiment to ensure we're testing isel patterns.
llvm-svn: 354511
We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit.
llvm-svn: 354489
Summary:
Inc and Dec were at one point slow on Intel CPUs due to their tendency to cause partial flag stalls on P6 derived CPU cores. This is because these instructions are defined to preserve the carry flag. This partial flag stall issue persisted until Sandy Bridge when flag merging was changed to be handled as a data dependency instead of as a stall until retirement. Sandy Bridge and later CPUs rename the C flag separately from OSPAZ so there is no flag merge needed on INC/DEC to preserve the C flag.
Given these improvements I don't know why INC/DEC was ever considered slow on Sandy Bridge. If anything they should have been disabled on the earlier CPUs instead.
Note after this patch, INC/DEC are still considered slow on Silvermont, Goldmont, Knights Landing and our generic "x86-64" CPU.
Reviewers: spatel, RKSimon, chandlerc
Reviewed By: chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D58412
llvm-svn: 354436
As this has broken the lto bootstrap build for 3 days and is
showing a significant regression on the Dither_benchmark results (from
the LLVM benchmark suite) -- specifically, on the
BENCHMARK_FLOYD_DITHER_128, BENCHMARK_FLOYD_DITHER_256, and
BENCHMARK_FLOYD_DITHER_512; the others are unchanged. These have
regressed by about 28% on Skylake, 34% on Haswell, and over 40% on
Sandybridge.
This reverts commit r353923.
llvm-svn: 354434
After r354178, these instruction expand to a sequence that uses an OR instruction. That OR clobbers EFLAGS so we need to state that to avoid accidentally using the clobbered flags.
Our tests show the bug, but I didn't notice because the SETcc instructions didn't move after r354178 since it used to be safe to do the fp->int conversion first.
We should probably convert this whole sequence to SelectionDAG instead of a custom inserter to avoid mistakes like this.
Fixes PR40779
llvm-svn: 354395
The use of the -mprefer-vector-width=256 command line option mixed with functions
using vector intrinsics can create situations where one function thinks 512 vectors
are legal, but another fucntion does not.
If a 512 bit vector is passed between them via a pointer, its possible ArgumentPromotion
might try to pass by value instead. This will result in type legalization for the two
functions handling the 512 bit vector differently leading to runtime failures.
Had the 512 bit vector been passed by value from clang codegen, both functions would
have been tagged with a min-legal-vector-width=512 function attribute. That would
make them be legalized the same way.
I observed this issue in 32-bit mode where a union containing a 512 bit vector was
being passed by a function that used intrinsics to one that did not. The caller
ended up passing in zmm0 and the callee tried to read it from ymm0 and ymm1.
The fix implemented here is just to consider it a mismatch if two functions
would handle 512 bit differently without looking at the types that are being
considered. This is the easist and safest fix, but it can be improved in the future.
Differential Revision: https://reviews.llvm.org/D58390
llvm-svn: 354376
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 354363
This was the cause of the regression in D57888 - the commuted load pattern wasn't hidden by the predicate so once we enabled v4i32 blends on SSE41+ targets then isel was incorrectly matched against AVX2+ instructions.
llvm-svn: 354358
When parsing a sequence of tokens beginning with {, it will hit an assert and crash if the token afterwards is not an identifier. Instead of this, return a more verbose error as seen elsewhere in the function.
Patch by Brandon Jones (BrandonTJones)
Differential Revision: https://reviews.llvm.org/D57375
llvm-svn: 354356
Tuning flags don't have any effect on the available instructions so aren't a good reason to prevent inlining.
There are also some ISA flags that don't have any intrinsics our ABI requirements that we can exclude. I've put only the most basic ones like cmpxchg16b and lahfsahf. These are interesting because they aren't present in all 64-bit CPUs, but we have codegen workarounds when they aren't present.
Loosening these checks can help with scenarios where a caller has a more specific CPU than a callee. The default tuning flags on our generic 'x86-64' CPU can currently make it inline compatible with other CPUs. I've also added an example test for 'nocona' and 'prescott' where 'nocona' is just a 64-bit capable version of 'prescott' but in 32-bit mode they should be completely compatible.
I've based the implementation here of the similar code in AMDGPU.
Differential Revision: https://reviews.llvm.org/D58371
llvm-svn: 354355
The VBROADCAST combines and SimplifyDemandedVectorElts improvements mean that we now more consistently use shorter (128-bit) X86vzload input operands.
Follow up to D58053
llvm-svn: 354346
This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads.
It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load.
Differential Revision: https://reviews.llvm.org/D58053
llvm-svn: 354340
The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754https://bugs.llvm.org/show_bug.cgi?id=40487
..and those are shown in the IR test file and x86 codegen file.
Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.
This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo
formation by default. Only x86 overrides that setting for now although other targets will likely benefit
by forming usbuo too.
Differential Revision: https://reviews.llvm.org/D57789
llvm-svn: 354298
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.
This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.
Differential Revision: https://reviews.llvm.org/D58181
llvm-svn: 354279
No need for a separate stack slot. The lifetimes don't overlap.
Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot.
llvm-svn: 354234
No need to manually split everything. We can let the type legalizer work for us.
The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended.
llvm-svn: 354197
Preventing the load fold won't fix the partial register update since the
input we can fold is a GPR. So it will do nothing to prevent a false dependency
on an XMM register.
llvm-svn: 354193
When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption.
This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block)
Differential Revision: https://reviews.llvm.org/D57788
llvm-svn: 354178
These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there.
Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner.
llvm-svn: 354161
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.
This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.
This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.
Differential Revision: https://reviews.llvm.org/D58237
llvm-svn: 354117
When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion.
Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type.
llvm-svn: 354003
Try to use 64-bit SLP vectorization. In addition to horizontal instrs
this change triggers optimizations for partial vector operations (for instance,
using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by
<2 x float>).
Fixes llvm.org/PR32433
llvm-svn: 353923
In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this.
llvm-svn: 353921
-Pull the final stack load creation from the two callers into the helper.
-Return a single SDValue instead of a std::pair.
-Remove the Replace flag which isn't really needed.
llvm-svn: 353920
A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars.
llvm-svn: 353860
We were using DstTy, but that represents the integer type we are converting to which is i64 in this
case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers.
If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte
access. The store we used to get from SSE to the stack is using the corect size.
While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function.
llvm-svn: 353745
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).
The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.
All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.
In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.
I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.
Differential Revision: https://reviews.llvm.org/D57841
llvm-svn: 353641
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.
There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.
llvm-svn: 353626
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 353610
This patch accompanies the RFC posted here:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127239.html
This patch adds a new CallBr IR instruction to support asm-goto
inline assembly like gcc as used by the linux kernel. This
instruction is both a call instruction and a terminator
instruction with multiple successors. Only inline assembly
usage is supported today.
This also adds a new INLINEASM_BR opcode to SelectionDAG and
MachineIR to represent an INLINEASM block that is also
considered a terminator instruction.
There will likely be more bug fixes and optimizations to follow
this, but we felt it had reached a point where we would like to
switch to an incremental development model.
Patch by Craig Topper, Alexander Ivchenko, Mikhail Dvoretckii
Differential Revision: https://reviews.llvm.org/D53765
llvm-svn: 353563
Summary: These instructions update FPSW so they aren't generically safe to rematerialize into any location if FPSW is live for a comparison result. They also use FPCW for exception masking control. Though the only exception they can generate is stack overflow and we manage the stack ourselves so that's not really going to occur.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57934
llvm-svn: 353536
Summary:
FPCW contains the rounding mode control which we manipulate to implement fp to integer conversion by changing the roudning mode, storing the value to the stack, and then changing the rounding mode back. Because we didn't model FPCW and its dependency chain, other instructions could be scheduled into the middle of the sequence.
This patch introduces the register and adds it as an implciit def of FLDCW and implicit use of the FP binary arithmetic instructions and store instructions. There are more instructions that need to be updated, but this is a good start. I believe this fixes at least the reduced test case from PR40529.
Reviewers: RKSimon, spatel, rnk, efriedma, andrew.w.kaylor
Subscribers: dim, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57735
llvm-svn: 353489
Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner
First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111.
Differential Revision: https://reviews.llvm.org/D57754
llvm-svn: 353457
This is intentionally a small step because it's hard to know exactly
where we might introduce a conflicting transform with the code that
tries to form wider shuffles. But I think this is safe - if we have
a wide shuffle with 2 operands, then we should do better with an
extract + narrow shuffle.
Differential Revision: https://reviews.llvm.org/D57867
llvm-svn: 353427
combineExtractWithShuffle may leave a dangling bitcast which may
prevent further optimization in later passes. Avoid constructing it
unless it is used.
llvm-svn: 353333
The proposal in D56796 may cross the line because we're trying to avoid vectorization
transforms in generic DAG combining. So this is an alternate, later, x86-specific
translation of that patch.
There are several potential follow-ups to enhance this:
1. Allow extraction from non-zero element index.
2. Peek through extends of smaller width integers.
3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P
Differential Revision: https://reviews.llvm.org/D56864
llvm-svn: 353302
rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly.
llvm-svn: 353198
If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free).
Differential Revision: https://reviews.llvm.org/D57663
llvm-svn: 353154
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.
This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.
Reviewers: rnk
Reviewed By: rnk
Subscribers: eraman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57641
llvm-svn: 353141
Summary:
Noticed while looking at D56052.
```
// The 'control' of BEXTR has the pattern of:
// [15...8 bit][ 7...0 bit] location
// [ bit count][ shift] name
// I.e. 0b000000011'00000001 means (x >> 0b1) & 0b11
```
I.e. we do not care about any of the bits aside from the low 16 bits.
So there is no point in doing the `slh`,`or` in 64 bits,
let's just do everything in 32 bits, and anyext if needed.
We could do that in 16 even, but we intentionally don't
zext to i16 (longer encoding IIRC),
so i'm guessing the same applies here.
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D56715
llvm-svn: 353073
These instructions implicitly operate on ST0, but we don't currently add that information to the MachineInstr. We also don't add it the tablegen definitions either.
For the most part this doesn't cause any problems because the stackifying occurs after register allocation. All the instructions are marked as having side effects so the postRA scheduler won't reorder them amongst themselves.
But nothing stops inline assembly using X87 instructions from being reordered around other x87 instructions if that inline assembly wasn't marked volatile.
The two test cases I've identified so far in PR40539 involve loads and stores used to set up the inline assembly or capture the results of the inline assembly ending up in the wrong order.
This patch adds implicit ST0 uses/defs to the load/store instructions to prevent this from happening.
I plan to fix all of the FP instructions, but the binops are bit trickier to get right. So I've chosen fixing the known test cases as a good first step.
I think we also need to update the tablegen descriptions so MS inline assembly infers the right clobbers, but I haven't checked that yet.
Differential Revision: https://reviews.llvm.org/D57644
llvm-svn: 353070
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.
This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.
llvm-svn: 353061
For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly.
Differential Revision: https://reviews.llvm.org/D57667
llvm-svn: 353051
This patch removes hidden codegen flag -print-schedule effectively reverting the
logic originally committed as r300311
(https://llvm.org/viewvc/llvm-project?view=revision&revision=300311).
Flag -print-schedule was originally introduced by r300311 to address PR32216
(https://bugs.llvm.org/show_bug.cgi?id=32216). That bug was about adding "Better
testing of schedule model instruction latencies/throughputs".
These days, we can use llvm-mca to test scheduling models. So there is no longer
a need for flag -print-schedule in LLVM. The main use case for PR32216 is
now addressed by llvm-mca.
Flag -print-schedule is mainly used for debugging purposes, and it is only
actually used by x86 specific tests. We already have extensive (latency and
throughput) tests under "test/tools/llvm-mca" for X86 processor models. That
means, most (if not all) existing -print-schedule tests for X86 are redundant.
When flag -print-schedule was first added to LLVM, several files had to be
modified; a few APIs gained new arguments (see for example method
MCAsmStreamer::EmitInstruction), and MCSubtargetInfo/TargetSubtargetInfo gained
a couple of getSchedInfoStr() methods.
Method getSchedInfoStr() had to originally work for both MCInst and
MachineInstr. The original implmentation of getSchedInfoStr() introduced a
subtle layering violation (reported as PR37160 and then fixed/worked-around by
r330615).
In retrospect, that new API could have been designed more optimally. We can
always query MCSchedModel to get the latency and throughput. More importantly,
the "sched-info" string should not have been generated by the subtarget.
Note, r317782 fixed an issue where "print-schedule" didn't work very well in the
presence of inline assembly. That commit is also reverted by this change.
Differential Revision: https://reviews.llvm.org/D57244
llvm-svn: 353043
We now print ST0 as 'st' when generating the clobber list for MS inline assembly in clang. This matches what the gcc reg name list expects.
Original commit message:
This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.
Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.
Differential Revision: https://reviews.llvm.org/D57298
llvm-svn: 353016
Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st.
I'll be making a more directed change in a future patch.
llvm-svn: 353013
Summary:
When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name.
This also matches what objdump disassembly prints. It's also what is printed by gcc -S.
Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri
Reviewed By: rnk
Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits
Differential Revision: https://reviews.llvm.org/D57621
llvm-svn: 352985
Summary:
Add an additional combine to combineCarryThroughADD to reverse it back to the C flag to avoid regressions.
I believe this catches the cases that D57547 got.
Reviewers: RKSimon, spatel
Reviewed By: spatel
Subscribers: javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57637
llvm-svn: 352984
Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles.
The is exposes a couple of minor issues that will be fixed shortly:
Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up
combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts).
llvm-svn: 352963
We already have the getConstantOperandVal helper which returns a uint64_t, but along comes the fuzzer and inserts a i128 -1 constant or something and the whole thing asserts.......
I've updated a few obvious cases, and tried to make use of the const reference where possible, but there's more to do. A number of existing oss-fuzz tickets should be fixed if we start using APInt and perform value clamping where necessary.
llvm-svn: 352961
This cleans up all GetElementPtr creation in LLVM to explicitly pass a
value type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57173
llvm-svn: 352913
This cleans up all LoadInst creation in LLVM to explicitly pass the
value type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57172
llvm-svn: 352911
This cleans up all CallInst creation in LLVM to explicitly pass a
function type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57170
llvm-svn: 352909
As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section.
For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts).
Differential Revision: https://reviews.llvm.org/D56784
llvm-svn: 352883
Enable peeking through one use bitcasts to the subvector shuffle.
This still depends on the subvector being the same scalar-size but D57514 has already helped with the more tricky patterns
llvm-svn: 352879
Summary:
I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as `2`,
which matches the latencies specified in "AMD SOG for family 15h".
However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency.
Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`,
which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay.
Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does
state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`.
Reviewers: andreadb, RKSimon, craig.topper
Reviewed By: andreadb
Subscribers: gbedwell, courbet, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57300
llvm-svn: 352861
Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc
doesn't choke on it, hopefully.
Original Message:
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.
Then:
- update the CallInst/InvokeInst instruction creation functions to
take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.
One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.
However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)
Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.
Differential Revision: https://reviews.llvm.org/D57315
llvm-svn: 352827
This reverts commit f47d6b38c7 (r352791).
Seems to run into compilation failures with GCC (but not clang, where
I tested it). Reverting while I investigate.
llvm-svn: 352800
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.
Then:
- update the CallInst/InvokeInst instruction creation functions to
take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.
One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.
However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)
Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.
Differential Revision: https://reviews.llvm.org/D57315
llvm-svn: 352791
Similar to what we already do in DAGCombiner, but this version also handles bitcasts from types with different scalar sizes, which x86 is better at handling.
Differential Revision: https://reviews.llvm.org/D57514
llvm-svn: 352773
Enables 32/64-bit scalar load broadcasts on AVX1 targets
The extractelement-load.ll regression will be fixed shortly in a followup commit.
llvm-svn: 352743
If we're not inserting the broadcast into the lowest subvector then we can avoid the insertion by just performing a larger broadcast.
Avoids a regression when we enable AVX1 broadcasts in shuffle combining
llvm-svn: 352742
And instead just generate a libcall. My motivating example on ARM was a simple:
shl i64 %A, %B
for which the code bloat is quite significant. For other targets that also
accept __int128/i128 such as AArch64 and X86, it is also beneficial for these
cases to generate a libcall when optimising for minsize. On these 64-bit targets,
the 64-bits shifts are of course unaffected because the SHIFT/SHIFT_PARTS
lowering operation action is not set to custom/expand.
Differential Revision: https://reviews.llvm.org/D57386
llvm-svn: 352736
I believe this was there to handle avx512bw intrinsics that returned i64 type in 32-bit mode. But all those intrinsics have since been changed to v64i1 results or replaced with generic IR.
llvm-svn: 352698
This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.
Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.
Differential Revision: https://reviews.llvm.org/D57298
llvm-svn: 352660
There were checks to ensure some tables were sorted, but those tables aren't used by this function. The same tables are checked in the function that does use them. Maybe this was copy/pasted?
llvm-svn: 352609
As far as I can tell we already won't emit these aliases due to an operand count check in the tablegen code. Removing these because I couldn't make sense of the inconsistency between fadd and fmul from reading the code.
I checked the AsmMatcher and AsmWriter files before and after this change and there were no differences.
llvm-svn: 352608
Account for bypass delays when computing the latency of scalar int-to-float
conversions.
On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG).
This patch also fixes the number of micropcodes for the register-memory variants
of scalar int-to-float conversions.
Differential Revision: https://reviews.llvm.org/D57148
llvm-svn: 352518
Without the fix we get the following (with -Werror):
../lib/Target/X86/X86ISelLowering.cpp:14181:58: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
SmallVector<std::array<int, 2>, 2> LaneSrcs(NumLanes, {-1, -1});
^~~~~~
{ }
1 error generated.
llvm-svn: 352455
This did not cause the buildbot failure it was previously reverted for.
Original commit message:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.
This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inre
On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.
llvm-svn: 352433
Followup to D56636, this time handling the UADDSAT case by expanding
uadd.sat(a, b) to umin(a, ~b) + b.
Differential Revision: https://reviews.llvm.org/D56869
llvm-svn: 352409
First step towards adding support for 64-bit unary "sublane" handling (a bit like lowerShuffleAsRepeatedMaskAndLanePermute).
This allows us to add lowerV64I8Shuffle handling.
llvm-svn: 352389
This is tricky to make optimal: sometimes we're better off using
a single wider op, but other times it makes more sense to combine
a narrow ops to achieve the same result.
This solves the case from:
https://bugs.llvm.org/show_bug.cgi?id=40434
There's potentially a similar change for vectors with 64-bit elements,
but it needs adjustments similar to rL352333 to avoid creating infinite
loops.
llvm-svn: 352380
This transform was added with rL351346, and we had
an escape for shufps, but we also want one for
unpckps vs. vpermps because vpermps doesn't take
an immediate shuffle index operand.
llvm-svn: 352333
Although this is longer code, this is no-functional-change-intended.
The goal is to untangle the conditions under which we bail out, so
that's easier to adjust.
llvm-svn: 352320
The add+and sequence followed by a branch can
happen e.g. when looping over the set bits of an integer:
```
while (x != 0) {
func(x & ~x);
x &= x - 1;
}
```
Reviewed By: ctopper
Differential Revision: https://reviews.llvm.org/D57296
llvm-svn: 352306
def32 here means the producing instruction zeroed bits 63:32. We already do this for zext, but it looks like we can get an and+anyext sometimes.
Spotted in the diffs from D33587.
llvm-svn: 352303
Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB.
I've added the getNode() to make it clearer what is intended than just the -> redirection.
llvm-svn: 352291
As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction.
Differential Revision: https://reviews.llvm.org/D57281
llvm-svn: 352289
We often generate X86ISD::SBB(X, 0) for carry flag arithmetic.
I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet.
Differential Revision: https://reviews.llvm.org/D57169
llvm-svn: 352288
This reduces a bit of duplication between the combining and
lowering places that use it, but the primary motivation is
to make it easier to rearrange the lowering logic and solve
PR40434:
https://bugs.llvm.org/show_bug.cgi?id=40434
llvm-svn: 352280
Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56971
llvm-svn: 352260
Summary:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.
This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization.
On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57186
llvm-svn: 352255
Summary:
Currently, if an instruction with a memory operand has no debug information,
X86DiscriminateMemOps will generate one based on the first line of the
enclosing function, or the last seen debug info.
This may cause confusion in certain debugging scenarios. The long term
approach would be to use the line number '0' in such cases, however, that
brings in challenges: the base discriminator value range is limited
(4096 values).
For the short term, adding an opt-in flag for this feature.
See bug 40319 (https://bugs.llvm.org/show_bug.cgi?id=40319)
Reviewers: dblaikie, jmorse, gbedwell
Reviewed By: dblaikie
Subscribers: aprantl, eraman, hiraditya
Differential Revision: https://reviews.llvm.org/D57257
llvm-svn: 352246
We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch.
This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate.
Differential Revision: https://reviews.llvm.org/D57218
llvm-svn: 352230
This seems unnecessarily complicated because we gave names to
opposite polarity bools and have code comments that don't really
line up with the logic.
Step 1: remove UndefUpper and assert that it is the opposite of
UndefLower after the initial early exit.
llvm-svn: 352217
Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag.
This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further.
Noticed in some of the test cases in PR31754
Differential Revision: https://reviews.llvm.org/D57234
llvm-svn: 352210
This isn't the final fix for our reduction/horizontal codegen, but it takes care
of a lot of the problems. After we narrow the shuffle, existing combines for
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.
The avg and mul reduction tests show an existing shuffle lowering hole for
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.
Differential Revision: https://reviews.llvm.org/D57156
llvm-svn: 352209
As noted in D57156, we want to check at least part of
this pattern earlier (in combining), so this will allow
the code to be shared instead of duplicated.
llvm-svn: 352127
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.
The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.
This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)
That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.
Differential Revision: https://reviews.llvm.org/D56548
llvm-svn: 352076
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.
X86 uses i8, but seemed to be hacking around this before.
llvm-svn: 351882
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.
This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.
Differential Revision: https://reviews.llvm.org/D55935
llvm-svn: 351819
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57026
llvm-svn: 351817
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57022
llvm-svn: 351815
First step towards PR40376, this patch adds support for getCmpSelInstrCost to use the (optional) Instruction CmpInst predicate to indicate the type of integer comparison we're performing and alter the costs accordingly.
Differential Revision: https://reviews.llvm.org/D57013
llvm-svn: 351810
When we are inserting 1 "inline" element, and zeroing 2 of the other elements then we can safely commute the insertps source inputs to improve memory folding.
Differential Revision: https://reviews.llvm.org/D56843
llvm-svn: 351807
Summary:
Use X86ISD::VFPROUND in the instruction isel patterns. Add new patterns for ISD::FP_ROUND to maintain support for fptrunc in IR.
In the process I found a couple duplicate isel patterns which I also deleted in this patch.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56991
llvm-svn: 351762
Summary:
For compress, a select node doesn't semantically reflect the behavior of the instruction. The mask would have holes in it, but the resulting write is to contiguous elements at the bottom of the vector.
Furthermore, as far as the compressing and expanding is concerned the behavior is depended on the mask. You can't just have an expand/compress node that only reads the input vector. That node would have no meaning by itself.
This all only works because we pattern match the compress/expand+select back to the instruction. But conceivably an optimization of the select could break the pattern and leave something meaningless.
This patch modifies the expand and compress node to take the mask and passthru as additional inputs and gets rid of the select all together.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57002
llvm-svn: 351761
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.
Confirmed with @andreadb
llvm-svn: 351755
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.
Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.
Differential Revision: https://reviews.llvm.org/D56922
llvm-svn: 351742
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.
Noticed while cleaning up vector integer comparison costs for PR40376.
llvm-svn: 351697
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.
Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).
The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.
llvm-svn: 351685
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Summary:
Right now we include ${TGT}GenCallingConv.inc once per each instruction
selection method implemented by ${TGT}:
- ${TGT}ISelLowering.cpp
- ${TGT}CallLowering.cpp
- ${TGT}FastISel.cpp
Instead, add a mechanism to tablegen for marking a particular convention
as "External", which causes tablegen to emit into the ::llvm namespace,
instead of as a static helper. This allows us to provide a header to
forward declare it, so we can simply call the function from all the
places it is referenced. Typically the calling convention analyzer is
called indirectly, so it doesn't benefit from inlining.
This saves a bit of final binary size, but mostly just saves object file
size:
before after diff artifact
12852K 12492K -360K X86ISelLowering.cpp.obj
4640K 4280K -360K X86FastISel.cpp.obj
1704K 2092K +388K X86CallingConv.cpp.obj
52448K 52336K -112K llc.exe
I didn't collect before numbers for X86CallLowering.cpp.obj, which is
for GlobalISel, but we should save 360K there as well.
This patch applies the strategy to the X86 backend, but there is no
reason it couldn't be applied to the other backends that implement
multiple ISel strategies, like AArch64.
Reviewers: craig.topper, hfinkel, efriedma
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D56883
llvm-svn: 351616
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.
llvm-svn: 351583
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.
Differential Revision: https://reviews.llvm.org/D56827
llvm-svn: 351570
Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits.
This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts.
We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change.
Patch by zhutianyang and Craig Topper
Differential Revision: https://reviews.llvm.org/D56695
llvm-svn: 351381
This cleans up the duplication we have with both intrinsic isel patterns and vselect isel patterns. This should also allow the intrinsics to get SimplifyDemandedBits support for the condition.
I've switched the canonical pattern in isel to use the X86ISD::BLENDV node instead of VSELECT. Since it always seemed weird to move from BLENDV with its relaxed rules on condition bits to VSELECT which has strict rules about all bits of the condition element being the same. Its more correct to go from VSELECT to BLENDV.
Differential Revision: https://reviews.llvm.org/D56771
llvm-svn: 351380
If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc.
Differential Revision: https://reviews.llvm.org/D56765
llvm-svn: 351378
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.
This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.
I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).
Differential Revision: https://reviews.llvm.org/D56777
llvm-svn: 351366
Remove the existing assertion and just return false for unexpected shuffle value types (<X x i1> mainly....).
Found while updating combineX86ShufflesRecursively to run within SimplifyDemandedVectorElts/SimplifyDemandedBits.
llvm-svn: 351365
combineX86ShufflesRecursively is pretty cumbersome with a lot of arguments that only matter later in recursion.
This commit adds a wrapper version that only takes the initial root Op to simplify calls that don't need to worry about these.
An early, cleanup step towards merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts/SimplifyDemandedBits.
llvm-svn: 351352
I was trying to prevent shuffle regressions while matching more horizontal ops
and ended up here:
shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0
The affected tests were added for:
https://bugs.llvm.org/show_bug.cgi?id=34380
This patch won't change the examples in the bug report itself, but we should be
able to extend this to catch more types.
Differential Revision: https://reviews.llvm.org/D56756
llvm-svn: 351346
Summary:
Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc.
Refer D53541 for the context. Clang counterpart D56748.
Reviewers: rnk, efriedma
Reviewed By: rnk, efriedma
Subscribers: javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D56747
llvm-svn: 351281
That's really what it is. If we didn't use intrinsics for BLENDVPS/BLENDVPD/PBLENDVB all the way to isel, this is the node we would use.
llvm-svn: 351278
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.
llvm-svn: 351275
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.
I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.
I'll work on scatter and gatherpf/scatterpf next.
Differential Revision: https://reviews.llvm.org/D56527
llvm-svn: 351234
Modify getRegForInlineAsmConstraint to return special singleton
register class when a constraint references ST(7) not RFP80 for which
ST(7) is not a member.
llvm-svn: 351206
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.
llvm-svn: 351203
Summary:
In r345197 ESP and RSP were added to GR32_TC/GR64_TC, allowing them to
be used for tail calls, but this also caused `findDeadCallerSavedReg` to
think they were acceptable targets for clobbering. Filter them out.
Fixes PR40289.
Patch by Geoffry Song!
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D56617
llvm-svn: 351146
If we have PSHUFB and we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
llvm-svn: 351103
add (extractelt (X, 0), extractelt (X, 1)) --> extractelt (hadd X, X), 0
This is the integer sibling to D56011.
There's an additional restriction to only to do this transform in the
case where we don't have extra extracts from the source vector. Without
that, we can fail to match larger horizontal patterns that are more
beneficial than this minimal case. An improvement to the more general
h-op lowering may allow us to remove the restriction here in a follow-up.
llvm-svn: 351093
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877
llvm-svn: 351018
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877.
llvm-svn: 351017
This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types.
Differential Revision: https://reviews.llvm.org/D56544
llvm-svn: 350998
The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do.
This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there.
Fixes another instruction for PR34877.
llvm-svn: 350994
We no longer need to extend mask scalars before bitcasting them to vXi1. This was only needed for the truncate intrinsics. And was really a bug in our lowering of them.
llvm-svn: 350991
We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type.
By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned.
llvm-svn: 350989
We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask.
This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that.
This fixes some of PR34877
llvm-svn: 350985
Teach x86 assembly operand parsing to distinguish between assembler
variable assigned to named registers and those assigned to immediate
values.
Reviewers: rnk, nickdesaulniers, void
Subscribers: hiraditya, jyknight, llvm-commits
Differential Revision: https://reviews.llvm.org/D56287
llvm-svn: 350966
Previously, we limited this transform to cases where the
extraction into the build vector happens from vectors of
the same type as the build vector, but that's not required.
There's a slight potential regression seen in the AVX512
result for phadd -- we're using the 256-bit flavor of the
instruction now even though the 128-bit subset is sufficient.
The same problem could already be seen in the AVX2 result.
Follow-up patches will attempt to narrow that back down.
llvm-svn: 350928
We were lowering the last step extract_vector_elt to a bitcast+truncate. Change it to use an extract_vector_elt of index 0 instead. Add isel patterns to do the equivalent of what the bitcast would have done. Plus an isel pattern for an any_extend+extract to prevent some regressions.
Finally add a DAG combine to turn v1i1 scalar_to_vector+extract_vector_elt of 0 into an extract_subvector.
This fixes some of the regressions from D350800.
llvm-svn: 350918
This extends to combineVSelectToShrunkBlend to be able to resimplify SHRUNKBLENDS that have already been created.
This should help some of the regressions from D56387
Differential Revision: https://reviews.llvm.org/D56421
llvm-svn: 350875
Despite what the comment says, FCMP_UNE would be an OR not an AND. In the lowering code the first branch created still goes to the original destination. The second branch was exchanged to go to where the subsequent unconditional branch went. This is different than what we do for FCMP_OEQ where both branches that we create go to the original unconditional branch.
As far as I can tell, I think this means we don't need to exchange the branch target with the unconditional branch for FCMP_UNE at all.
Differential Revision: https://reviews.llvm.org/D56309
llvm-svn: 350873
When we use the partial-matching function on a 128-bit chunk, we must
account for the possibility that we've matched undef halves of the
original source vectors, so the outputs may need to be reset.
This should allow closing PR40243:
https://bugs.llvm.org/show_bug.cgi?id=40243
llvm-svn: 350830
This is a partial fix for:
https://bugs.llvm.org/show_bug.cgi?id=40243
...as seen in the integer test, we still need to correct the result when using the
existing (old) horizontal op matching function because it does not model the way
x86 256-bit horizontal ops return results (each 128-bit half is its own horizontal-op).
A potential follow-up change for that is discussed in the bug report - see also D56490.
This generally duplicates a lot of the existing matching code, but we can't just remove
that without introducing regressions, so the existing code is renamed and used less often.
Follow-ups may try to reduce that overlap.
Differential Revision: https://reviews.llvm.org/D56450
llvm-svn: 350826
Summary:
This pass replaces GR8/GR16/GR32/GR64 with their equivalent sized mask register classes. But VK32/VK64 aren't legal without AVX512BW. Apparently this mostly appears to work if the register coalescer is able to remove the VK32/VK64 register class reference. Or if we don't ever spill it. But there's no guarantee of that.
Another Intel employee managed to trigger a crash due to this with ISPC. Unfortunately, I've lost the test case he sent me at the time. I'm trying to get him to reproduce it for me. I'd like to get this in before 8.0 branches since its a little scary.
The regressions here are unfortunate, but I think we can make some improvements to DAG combine, load folding, etc. to fix them. Just not sure if we can get that done for 8.0.
Fixes PR39741
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56460
llvm-svn: 350800
Found while trying to figure out why my second version of D56421 worked better than the first version. We weren't deleting the vselect in a timely fashion and that caused SimplfyDemandedBit to see an additional user.
The new version doesn't have this problem so this fix isn't needed there, but seemed like the right thing to do.
llvm-svn: 350781
When the result type is v2i64/v2f64 and the index element size is i32, the index vector has two unused elements making the type v4i32. The mask VT should match the number of memory accesses that will be made.
This is consistent with the isel patterns used for the target independent gather/scatter intrinsic.
llvm-svn: 350687
For stack frames on the size of a register in x86, a code size optimization
emits "push rax/eax" instead of "sub" for stack allocation. For example:
foo:
.cfi_startproc
BB#0:
pushq %rax
Ltmp0:
.cfi_def_cfa_offset 16
...
.cfi_endproc
However, we are falling back to DWARF in this case because we cannot
encode %rax as a saved register.
This requirement is wrong, since we don't care about the contents of
%rax, it is the equivalent of a sub.
In order to specify that we care about the contents of %rax, we would
need a .cfi_offset %rax, <offset>.
It's also overzealous in the case where there are pushes for callee saved
registers followed by a "push rax/eax" instead of "sub", in which case we should
also be able to encode the callee saved regs and everything else using compact
unwind.
Patch authored by Bruno Cardoso Lopes.
Differential Revision: https://reviews.llvm.org/D13793
llvm-svn: 350623
Summary: AVX512VBMI2 supports a funnel shift by immediate and a funnel shift by a variable vector.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56361
llvm-svn: 350498
There are no test changes here in the existing cost model
regression tests because integer add/sub have a default
legal cost of 1 already. This would break, however, if
we custom lower those ops because the default cost model
assumes that custom-lowered ops are more expensive.
This is similar to the change in rL350403. See discussion
in D56011 for more details. When we enhance that patch to
handle integer ops, we need this cost model change to avoid
unintended diffs here from the custom lowering.
llvm-svn: 350496
This was here because out and in instructions allow '(%dx)' even though its not a memory reference. To handle this we build a special operand for the DX register reference before we get to the call to CheckBaseRegAndIndexRegAndScale. So we no longer need this special case.
llvm-svn: 350483
This is especially helpful on targets without avx512bw since we don't have a good way to convert from v16i8/v32i8 to v16i1/v32i1 for the truncate anyway. If we're just going to convert it to a GPR we might as well use pmovmskb to accomplish both.
llvm-svn: 350480
We don't need to require the first operand to be an integer because we already said it was the same type as the result which we also constrained to an integer.
llvm-svn: 350455
The 1st try for this was at rL350369, but it caused IR-level diffs because
our cost models differentiate custom vs. legal/promote lowering. So that was
reverted at rL350373. The cost models were fixed independently at rL350403,
so this is effectively the same patch as last time.
Original commit message:
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350421
Noticed in D56011 - handle the case that scalar fp ops are quicker on P3 than P4
Add the other costs so that we're not relying on the default "is legal/custom" cost logic.
llvm-svn: 350403
Doing this late so we will prefer to fold the AND into a masked comparison first. That can be better for the live range of the mask register.
Differential Revision: https://reviews.llvm.org/D56246
llvm-svn: 350374
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350369
As noted in PR39973 and D55558:
https://bugs.llvm.org/show_bug.cgi?id=39973
...this is a partial implementation of a fold that we do as an IR canonicalization in instcombine:
// extelt (binop X, Y), Index --> binop (extelt X, Index), (extelt Y, Index)
We want to have this in the DAG too because as we can see in some of the test diffs (reductions),
the pattern may not be visible in IR.
Given that this is already an IR canonicalization, any backend that would prefer a vector op over
a scalar op is expected to already have the reverse transform in DAG lowering (not sure if that's
a realistic expectation though). The transform is limited with a TLI hook because there's an
existing transform in CodeGenPrepare that tries to do the opposite transform.
Differential Revision: https://reviews.llvm.org/D55722
llvm-svn: 350354
INC/DEC are pretty much the same as ADD/SUB except that they don't update the C flag.
This patch removes the special nodes and just pattern matches from ADD/SUB during isel if the C flag isn't being used.
I had to avoid selecting DEC is the result isn't used. This will become a SUB immediate which will turned into a CMP later by optimizeCompareInstr. This lead to the one test change where we use a CMP instead of a DEC for an overflow intrinsic since we only checked the flag.
This also exposed a hole in our RMW flag matching use of hasNoCarryFlagUses. Our root node for the match is a store and there's no guarantee that all the flag users have been selected yet. So hasNoCarryFlagUses needs to check copyToReg and machine opcodes, but it also needs to check for the pre-match SETCC, SETCC_CARRY, BRCOND, and CMOV opcodes.
Differential Revision: https://reviews.llvm.org/D55975
llvm-svn: 350245
Peek through shift modulo masks while matching double shift patterns.
I was hoping to delay this until I could remove the X86 code with generic funnel shift matching (PR40081) but this will do for now.
Differential Revision: https://reviews.llvm.org/D56199
llvm-svn: 350222
All of these use custom isel so we can pretty easily detect the differences in the custom code in X86ISelDAGToDAG. The ISD opcodes just need to express the desired semantics not the details of how they would be selected by isel. So unifying them lets us remove the special casing from lowering.
llvm-svn: 350206
These require a different X86ISD node to be created than i16/i32/i64. I guess no one wanted to add the special code for that except in LowerXALUO. But now LowerXALUO, LowerSELECT, and LowerBRCOND all use a common helper function so they all share the special code.
Unfortunately, there are no test changes because we seem to correct the miss in a DAG combine later. I did verify it manually using test cases from xmulo.ll
llvm-svn: 350205
This makes it easier to keep the LowerBRCOND and LowerSELECT code in sync with LowerXALUO so they always pick the same operation for overflowing instructions.
This is inspired by the helper functions used by ARM and AArch64 for the same purpose.
The test change is because LowerSELECT was not in sync with LowerXALUO with regard to INC/DEC for SADDO/SSUBO.
llvm-svn: 350198
This seems to be getting in the way more than its helping. This does mean we stop scalarizing some cases, but I'm not convinced the scalarization was really better.
Some of the changes to vsel-cmp-load.ll are a regression but D56156 should fix it.
llvm-svn: 350159
This allows us to sign extend to v4i32 first. And then share that extension to implement the final steps to v4i64 using a pcmpgt and punpckl and punpckh.
We already do something similar for SIGN_EXTEND with -x86-experimental-vector-widening-legalization.
llvm-svn: 350158
This was tricking us into making these operations and then letting them get scalarized later. But I can't prove that the scalarized version is actually better.
llvm-svn: 350141
Previously we emitted a multiply and some masking that was supposed to matched to PMULUDQ, but the masking could sometimes be removed before we got a chance to match it. So instead just emit the PMULUDQ directly.
Remove the DAG combine that was added when the ReplaceNodeResults code was originally added. Add a new DAG combine to avoid regressions in shrink_vmul.ll
Some of the shrink_vmul.ll test cases now pick PMULUDQ instead of PMADDWD/PMULLD, but I think this should be an improvement on most CPUs.
I think all of this can go away if/when we switch to -x86-experimental-vector-widening-legalization
llvm-svn: 350134
Create PMULDQ/PMULUDQ as long as the number of elements is a power of 2.
This seems to give some improvements in our ability to use SimplifyDemandedBits.
llvm-svn: 350084
Make each of the helper functions only return their comparison node and the condition code. Leave X86ISD::SETCC creation to the LowerSETCC function itself.
Looking into whether we can use this code directly in BRCOND and SELECT lowering instead of going through LowerSETCC which creates an X86ISD::SETCC node we need to look through.
llvm-svn: 350082
Only one of the 3 callers of LowerAndToBT need the SETCC node. Two of them have to look through it to find the operands they really need. Instead create it after the one call that needs it.
LowerAndToBT now returns both the BT node and the X86 specific condition code separately.
llvm-svn: 350081
This is an alternative to what I attempted in D56057.
GetDemandedBits is a special version of SimplifyDemandedBits that allows simplifications even when the operand has other uses. GetDemandedBits will only do simplifications that allow a node to be bypassed. It won't create new nodes or alter any of the other users.
I had to add support for bypassing SIGN_EXTEND_INREG to GetDemandedBits.
Based on a patch that Simon Pilgrim sent me in email.
Fixes PR40142.
llvm-svn: 350059
Remove the TESTmr isel patterns and add another postprocessing combine for TESTrr+ANDrm->TESTmr. We already have a postprocessing combine for TESTrr+ANDrr->TESTrr. With this we can give ANDN a chance to match first. And clean it up during post processing if we ended up with just a regular AND.
This is another step towards my plan to gut EmitTest and do more flag handling during isel matching or by using optimizeCompare.
llvm-svn: 350038
The missed load folding noticed in D55898 is visible independent of that change
either with an adjusted IR pattern to start or with AVX2/AVX512 (where the build
vector becomes a broadcast first; movddup is not produced until we get into isel
via tablegen patterns).
Differential Revision: https://reviews.llvm.org/D55936
llvm-svn: 350005
Summary:
Added a pair of APIs for encoding/decoding the 3 components of a DWARF discriminator described in http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html: the base discriminator, the duplication factor (useful in profile-guided optimization) and the copy index (used to identify copies of code in cases like loop unrolling)
The encoding packs 3 unsigned values in 32 bits. This CL addresses 2 issues:
- communicates overflow back to the user
- supports encoding all 3 components together. Current APIs assume a sequencing of events. For example, creating a new discriminator based on an existing one by changing the base discriminator was not supported.
Reviewers: davidxl, danielcdh, wmi, dblaikie
Reviewed By: dblaikie
Subscribers: zzheng, dmgreen, aprantl, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D55681
llvm-svn: 349973
This fixes the patterns that have or/and as a root. 'and' is handled differently since thy usually have a CMP wrapped around them.
I had to look for uses of the CF flag because all these nodes have non-standard CF flag behavior. A real or/xor would always clear CF. In practice we shouldn't be using the CF flag from these nodes as far as I know.
Differential Revision: https://reviews.llvm.org/D55813
llvm-svn: 349962
The BEXTR instruction documents the SF bit as undefined.
The TBM BEXTR instruction has the same issue, but I'm not sure how to test it. With the control being an immediate we can determine the sign bit is 0 or the BEXTR would have been removed.
Fixes PR40060
Differential Revision: https://reviews.llvm.org/D55807
llvm-svn: 349956
This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right.
In the motivating example, note that we have movddup before SSE4.1 and
again with AVX2. That's because insertps isn't available pre-SSE41 and
vbroadcast is (more generally) available with AVX2 (and the splat is
reduced to movddup via isel pattern).
Differential Revision: https://reviews.llvm.org/D55898
llvm-svn: 349937
This shortens the switches in X86ISelDAGToDAG.cpp to only need to check condition code instead of a list of opcodes.
This also fixes a bug where the memory forms of SETcc were missing from hasNoCarryFlagUses.
llvm-svn: 349868
Summary:
This allows expanding {7,11,13,14,15,21,22,23,25,26,27,28,29,30,31}-byte memcmp
in just two loads on X86. These were previously calling memcmp.
Reviewers: spatel, gchatelet
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55263
llvm-svn: 349731
The (cmp (and X, Y) 0) pattern is greedy and ends up forming a TESTrr and consuming the and when it might be better to use one of the BMI/TBM like BLSR or BLSI.
This patch moves removes the pattern from isel and adds a post processing check to combine TESTrr+ANDrr into just a TESTrr. With this patch we are able to select the BMI/TBM instructions, but we'll also emit a TESTrr when the result is compared to 0. In many cases the peephole pass will be able to use optimizeCompareInstr to remove the TEST, but its probably not perfect.
Differential Revision: https://reviews.llvm.org/D55870
llvm-svn: 349661
Fixes https://bugs.llvm.org/show_bug.cgi?id=38743
The function removeRedundantBlockingStores is supposed to remove any blocking stores contained in each other in lockingStoresDispSizeMap.
But it currently looks only at the previous one, which will miss some cases that result in assert.
This patch refine the function to check all previous layouts until find the uncontained one. So all redundant stores will be removed.
Patch by Pengfei Wang
Differential Revision: https://reviews.llvm.org/D55642
llvm-svn: 349660
We already had BSF here as part of __builtin_ffs improvements and I was just wondering yesterday whether we should have BSR there.
This addresses one issue from PR40090.
llvm-svn: 349531
Migrate the X86 backend from X86ISD opcodes ADDS and SUBS to generic
ISD opcodes SADDSAT and SSUBSAT. This also improves scodegen for
@llvm.sadd.sat() and @llvm.ssub.sat() intrinsics.
This is a followup to D55787 and part of PR40056.
Differential Revision: https://reviews.llvm.org/D55833
llvm-svn: 349520
InstCombine seems to canonicalize or PSUB patter into a max with the cosntant and an add with an inverse of the constant.
This patch recognizes this pattern and turns it into PSUBUS. Future work could improve undef element handling.
Fixes some of PR40053
Differential Revision: https://reviews.llvm.org/D55780
llvm-svn: 349519
Replace the X86ISD opcodes ADDUS and SUBUS with generic ISD opcodes
UADDSAT and USUBSAT. As a side-effect, this also makes codegen for
the @llvm.uadd.sat and @llvm.usub.sat intrinsics reasonable.
This only replaces use in the X86 backend, and does not move any of
the ADDUS/SUBUS X86 specific combines into generic codegen.
Differential Revision: https://reviews.llvm.org/D55787
llvm-svn: 349481
(VSRAI (VSHLI X, C1), C1) --> X iff NumSignBits(X) > C1
This works better as part of SimplifyDemandedBits than part of the general combine.
llvm-svn: 349462
This fold was incredibly specific - replace with a SimplifyDemandedBits fold to remove a VSRAI if only the original sign bit is demanded (its guaranteed to stay the same).
Test change is merely a rescheduling.
llvm-svn: 349459
Convert VSRAI to VSRLI is the sign bit is known zero and improve KnownBits output for all shift instruction.
Fixes the poor codegen comments in D55768.
llvm-svn: 349407
This allows a TEST to be used and can be combined with any AND that may already exist as an input to the shift.
This was already done in EmitTest, but was easily tricked by multiple uses because the setcc might be used by multiple instructions. Once the SETCC and users are legalized then we can look for the shift to be used by a single CMP, but the CMP itself can have multiple users.
This appears to fix the case in PR39968.
llvm-svn: 349385
This is an initial patch to add the necessary support for a DemandedElts argument to SimplifyDemandedBits, more closely matching computeKnownBits and to help improve vector codegen.
I've added only a small amount of the changes necessary to get at least one test to update - a lot more can be done but I'd like to add these methodically with proper test coverage, at the same time the hope is to slowly move some/all of SimplifyDemandedVectorElts into SimplifyDemandedBits as well.
Differential Revision: https://reviews.llvm.org/D55768
llvm-svn: 349374
We keep a few iterators into the basic block we're selecting while
performing FastISel. Usually this is fine, but occasionally code wants
to remove already-emitted instructions. When this happens we have to be
careful to update those iterators so they're not pointint at dangling
memory.
llvm-svn: 349365
I'd like to try to move a lot of the flag matching out of EmitTest and push it to isel or isel preprocessing. This is a step towards that.
The test-shrink-bug.ll changie is an improvement because we are no longer interfering with test shrink handling in isel.
The pr34137.ll change is a regression, but the IR came from -O0 and was not reduced by InstCombine. So it contains a lot of redundancies like duplicate loads that made it combine poorly.
llvm-svn: 349315
Use consistent rules for when to lower to SHLD/SHRD for slow machines - fixes a weird issue where funnel shift gets expanded but then X86ISelLowering's combineOr sees the optsize and combines to SHLD/SHRD, but now with the modulo amount guard......
llvm-svn: 349285
The only caller of this turns CMP with 0 into TEST. CMP with 0 and TEST both set OF to 0 so we should have no issues with instructions that only use OF.
Though I don't think there's any reason we would read just OF after a compare with 0 anyway. So this probably isn't an observable change.
llvm-svn: 349223
hasNoCarryFlagUses hardcoded that the flag result is 1 and used that to filter which uses were of interest. hasNoSignedComparisonUses just assumes the only result is flags and checks whether any user of the node is a CopyToReg instruction.
After this patch we now do a result number check in both and rely on the caller to provide the result number.
This shouldn't change behavior it was just an odd difference between the two functions that I noticed.
llvm-svn: 349222
Summary:
If the setcc already has the target desired type we can reach the getSetCC/getSExtOrTrunc after the MatchingVecType check with the exact same types as the nodes we started with. This causes those causes VsetCC to be CSEd to N0 and the getSExtOrTrunc will CSE to N. When we return N, the caller will think that meant we called CombineTo and did our own worklist management. But that's not what happened. This prevents target hooks from being called for the node.
To fix this, I've now returned SDValue if the setcc is already the desired type. But to avoid some regressions in X86 I've had to disable one of the target combines that wasn't being reached before in the case of a (sext (setcc)). If we get vector widening legalization enabled that entire function will be deleted anyway so hopefully this is only for the short term.
Reviewers: RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55459
llvm-svn: 349137
This requires the two callers to manifest a 0 to make EmitCmp call EmitTest.
I'm looking into changing how we combine TEST and flag setting instructions to not be part of lowering. And instead be part of DAG combine or isel. Which will mean EmitTest will probably become gutted and maybe disappear entirely.
llvm-svn: 349094
Summary:
Macros are expanded on a single line. In case of large expansions,
with sufficiently many instructions with memory operands (and when
-fdebug-info-for-profiling is requested), we may be unable to generate
new base discriminator values - new values overflow (base
discriminators may not be larger than 2^12).
This CL warns instead of asserting in such a case. A subsequent CL
will add APIs to check for overflow before creating new debug info.
See https://bugs.llvm.org/show_bug.cgi?id=39890
Reviewers: davidxl, wmi, gbedwell
Reviewed By: davidxl
Subscribers: aprantl, llvm-commits
Differential Revision: https://reviews.llvm.org/D55643
llvm-svn: 349075
There's still a couple of minor SimplifyDemandedElts regressions in some of the shift amount splats that will be fixed in future patches.
llvm-svn: 349052
Move existing rotation expansion code into TargetLowering and set it up for vectors as well.
Ideally this would share more of the funnel shift expansion, but we handle the shift amount modulo quite differently at the moment.
Begun removing x86 vector rotate custom lowering to use the expansion.
llvm-svn: 349025
MULX has somewhat improved register allocation constraints compared to the legacy MUL instruction. Both output registers are encoded instead of fixed to EAX/EDX, but EDX is used as input. It also doesn't touch flags. Unfortunately, the encoding is longer.
Prefering it whenever BMI2 is enabled is probably not optimal. Choosing it should somehow be a function of register allocation constraints like converting adds to three address. gcc and icc definitely don't pick MULX by default. Not sure what if any rules they have for using it.
Differential Revision: https://reviews.llvm.org/D55565
llvm-svn: 348975
I'm hoping we can just replace SETCC_CARRY with SBB. This is another step towards that.
I've explicitly used zero as the input to the setcc to avoid a false dependency that we've had with the SETCC_CARRY. I changed one of the patterns that used NEG to instead use an explicit compare with 0 on the LHS. We needed the zero anyway to avoid the false dependency. The negate would clobber its input register. By using a CMP we can avoid that which could be useful.
Differential Revision: https://reviews.llvm.org/D55414
llvm-svn: 348959
This patch introduces a generic function to determine whether a given vector type is known to be a splat value for the specified demanded elements, recursing up the DAG looking for BUILD_VECTOR or VECTOR_SHUFFLE splat patterns.
It also keeps track of the elements that are known to be UNDEF - it returns true if all the demanded elements are UNDEF (as this may be useful under some circumstances), so this needs to be handled by the caller.
A wrapper variant is also provided that doesn't take the DemandedElts or UndefElts arguments for cases where we just want to know if the SDValue is a splat or not (with/without UNDEFS).
I had hoped to completely remove the X86 local version of this function, but I'm seeing some regressions in shift/rotate codegen that will take a little longer to fix and I hope to get this in sooner so I can continue work on PR38243 which needs more capable splat detection.
Differential Revision: https://reviews.llvm.org/D55426
llvm-svn: 348953
This extends the code that handles 16-bit add promotion to form LEA to also allow 8-bit adds.
That allows us to combine add ops with register moves and save some instructions. This is
another step towards allowing add truncation in generic DAGCombiner (see D54640).
Differential Revision: https://reviews.llvm.org/D55494
llvm-svn: 348946
Summary:
When doing X86CondBrFolding::analyzeCompare, it will meet the SUB32ri instruction as below to use the global address for its operand,
%733:gr32 = SUB32ri %62:gr32(tied-def 0), @img2buf_normal, implicit-def $eflags
JNE_1 %bb.41, implicit $eflags
so the assertion "assert(MI.getOperand(ValueIndex).isImm() && "Expecting Imm operand")" is not correct and change the assert to if make X86CondBrFolding::analyzeCompare return false as not finding the compare for this
Patch by Jianping Chen
Reviewers: smaslov, LuoYuanke, liutianle, Jianping
Reviewed By: Jianping
Subscribers: lebedev.ri, llvm-commits
Differential Revision: https://reviews.llvm.org/D54250
llvm-svn: 348853
As discussed in D55494, we want to extend this to handle 8-bit
ops too, but that could be extended further to enable this on
32-bit systems too.
llvm-svn: 348851
As discussed in:
D55494
...this code has been disabled/dead for a long time (the code references
Athlon and Pentium 4), and there's almost no chance that it will be used
given the last decade of uarch evolution. Also, in SDAG we promote 16-bit
ops to 32-bit, so there's almost no way to test this code any more.
llvm-svn: 348845
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.
This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.
Differential Revisions: https://reviews.llvm.org/D53629
llvm-svn: 348788
This should really be generalized to allow increment and/or
we should replace it by using ISD::matchUnaryPredicate().
See D55515 for context.
llvm-svn: 348776
Fixes https://bugs.llvm.org/show_bug.cgi?id=39926.
The size of the first copy was computed as
std::abs(std::abs(LdDisp2) - std::abs(LdDisp1)), which results in
skipped bytes if the signs of LdDisp2 and LdDisp1 differ. As far as
I can see, this should just be LdDisp2 - LdDisp1. The case where
LdDisp1 > LdDisp2 is already handled in the code above, in which case
LdDisp2 is set to LdDisp1 and this subtraction will evaluate to
Size1 = 0, which is the correct value to skip an overlapping copy.
Differential Revision: https://reviews.llvm.org/D55485
llvm-svn: 348750
Both intrinsics do the exact same thing so we really only need one.
Earlier in the 8.0 cycle we changed the signature of this intrinsic without renaming it. But it looks difficult to get the autoupgrade code to allow me to merge the intrinsics and change the signature at the same time. So I've renamed the intrinsic slightly for the new merged intrinsic. I'm skipping autoupgrading from the previous new to 8.0 signature. I've also renamed the subborrow for consistency.
llvm-svn: 348737
Previously we had to take the carry in and add -1 to it to set the carry flag so we could use it with ADC/SBB. But if we know its 0 then we don't need to bother.
This should go a long way towards fixing PR24545.
llvm-svn: 348727
The dependency was added in r213995 in response to r213986 which did make
X86/Utils depend on IR, but r256680 later removed that dependency again.
llvm-svn: 348724
The existing code tries to handle an undef operand while transforming an add to an LEA,
but it's incomplete because we will crash on the i16 test with the debug output shown below.
It's better to just give up instead. Really, GlobalIsel should have folded these before we
could get into trouble.
# Machine code for function add_undef_i16: NoPHIs, TracksLiveness, Legalized, RegBankSelected, Selected
bb.0 (%ir-block.0):
liveins: $edi
%1:gr32 = COPY killed $edi
%0:gr16 = COPY %1.sub_16bit:gr32
%5:gr64_nosp = IMPLICIT_DEF
%5.sub_16bit:gr64_nosp = COPY %0:gr16
%6:gr64_nosp = IMPLICIT_DEF
%6.sub_16bit:gr64_nosp = COPY %2:gr16
%4:gr32 = LEA64_32r killed %5:gr64_nosp, 1, killed %6:gr64_nosp, 0, $noreg
%3:gr16 = COPY killed %4.sub_16bit:gr32
$ax = COPY killed %3:gr16
RET 0, implicit killed $ax
# End machine code for function add_undef_i16.
*** Bad machine code: Reading virtual register without a def ***
- function: add_undef_i16
- basic block: %bb.0 (0x7fe6cd83d940)
- instruction: %6.sub_16bit:gr64_nosp = COPY %2:gr16
- operand 1: %2:gr16
LLVM ERROR: Found 1 machine code errors.
Differential Revision: https://reviews.llvm.org/D54710
llvm-svn: 348722
Extension to rL348617, turns out llvm-exegesis doesn't need to match the perf counter name against a scheduler model resource name - so I've added a few more counters that I could find in the libpfm4 source code (and fix a typo in the knl/knm retired_uops counter - which uses 'all' instead of 'any').
llvm-svn: 348721
To make X86CondBrFoldingPass can be run with --run-pass option, this can test one wrong assertion on analyzeCompare function for SUB32ri when its operand is not imm
Patch by Jianping Chen
Differential Revision: https://reviews.llvm.org/D55412
llvm-svn: 348620
This patch attempts to improve pfm perf counter coverage for all the x86 CPUs that libpfm4 supports.
Intel/AMD CPU families tend to share names for cycle/uops counters so even if they don't have a scheduler model yet they can at least use the default values (checked against the libpfm4 source code).
The remaining CPUs (where their port/pipe resource counters are known) I've tried to add to the existing model mappings.
These are untested but don't represent a regression to current llvm-exegesis behaviour for these CPUs.
Differential Revision: https://reviews.llvm.org/D55432
llvm-svn: 348617
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.
Differential Revision: https://reviews.llvm.org/D50141
llvm-svn: 348585
This addresses a FIXME and avoids depending on an isel pattern match I think. I've remove the isel patterns too since he have no lit tests left that cover them. Hopefully that really means they are unused.
I'm trying to decide if we need SETCC_CARRY. This removes one of its usages.
Differential Revision: https://reviews.llvm.org/D55355
llvm-svn: 348536
Initial step towards making the function more generic (and probably move into SelectionDAG).
This is necessary to avoid massive codegen bloat for PR38243 (Add modulo rotate support to LowerRotate).
llvm-svn: 348498
Whenever we effectively take the address of a basic block we need to
manually update that basic block to reflect that fact or later passes
such as tail duplication and tail merging can break the invariants of
the code. =/ Sadly, there doesn't appear to be any good way of
automating this or even writing a reasonable assert to catch it early.
The change seems trivially and obviously correct, but sadly the only
really good test case I have is 1000s of basic blocks. I've tried
directly writing a test case that happens to make tail duplication do
something that crashes later on, but this appears to require an
*amazingly* complex set of conditions that I've not yet reproduced.
The change is technically covered by the tests because we mark the
blocks as having their address taken, but that doesn't really count as
properly testing the functionality.
llvm-svn: 348374
Prep work for PR38243 - mainly adding comments on where we need to add modulo support (doing so at the moment causes massive codegen regressions).
I've also consistently added support for modulo folding for uniform constants (although at the moment we have no way to trigger this) and removed the old assertions.
llvm-svn: 348366
This is an initial patch to add a minimum level of support for funnel shifts to the SelectionDAG and to begin wiring it up to the X86 SHLD/SHRD instructions.
Some partial legalization code has been added to handle the case for 'SlowSHLD' where we want to expand instead and I've added a few DAG combines so we don't get regressions from the existing DAG builder expansion code.
Differential Revision: https://reviews.llvm.org/D54698
llvm-svn: 348353
PR17686 demonstrates that for some targets FP exceptions can fire in cases where the FP_TO_UINT is expanded using a FP_TO_SINT instruction.
The existing code converts both the inrange and outofrange cases using FP_TO_SINT and then selects the result, this patch changes this for 'strict' cases to pre-select the FP_TO_SINT input and the offset adjustment.
The X87 cases don't need the strict flag but generates much nicer code with it....
Differential Revision: https://reviews.llvm.org/D53794
llvm-svn: 348251
We only needed this because it provided really aggressive constant folding even through constant pool entries created from build_vectors. The main case was for vXi8 MULH legalization which was happening as part of legalize DAG instead of as part of legalize vector ops. Now its part of vector op legalization and we've added special handling for build vectors of all constants there. This has removed the need for this code on the list tests we have.
llvm-svn: 348237
This is the smallest vector enhancement I could find to D54640.
Here, we're allowing narrowing to only legal vector ops because we'll see
regressions without that. All of the test diffs are wins from what I can tell.
With AVX/AVX512, we can shrink ymm/zmm ops to xmm.
x86 vector multiplies are the problem case that we're avoiding due to the
patchwork ISA, and it's not clear to me if we can dance around those
regressions using TLI hooks or if we need preliminary patches to plug those
holes.
Differential Revision: https://reviews.llvm.org/D55126
llvm-svn: 348195
Summary:
We need to unpackl and unpackh the operands to use two vXi16 multiplies. Previously it looks like the low unpack would get constant folded at least in the 128-bit case after shuffle lowering turned the unpackl into ZERO_EXTEND_VECTOR_INREG and X86 custom DAG combined it. The same doesn't happen for the high half. So we'd load a constant and then shuffle it. But the low half would just be loaded and used by the multiply directly.
After this patch we now end up with a constant pool entry for the low and high unpacks separately with no shuffle operations.
This is a step towards removing custom constant folding for ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG in the X86 backend.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55165
llvm-svn: 348159
Summary:
Under -x86-experimental-vector-widening-legalization, fp_to_uint/fp_to_sint with a smaller than 128 bit vector type results are custom type legalized by promoting the result to a 128 bit vector by promoting the elements, inserting an assertzext/assertsext, then truncating back to original type. The truncate will be further legalizdd to a pack shuffle. In the case of a v8i8 result type, we'll end up with a v8i16 fp_to_sint. This will need to be further legalized during vector op legalization by promoting to v8i32 and then truncating again. Under avx2 this produces good code with two pack instructions, but Under avx512 this will result in a truncate instruction and a packuswb instruction. But we should be able to get away with a single truncate instruction.
The other option is to promote all the way to vXi32 result type during the first type legalization. But in some experimentation that seemed to require more work to produce good code for other configurations.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54836
llvm-svn: 348158
Previously this code generated its own extracts and build_vector. But we can use a simpler concat_vectors or scalar_to_vector operation and let type legalization do additional legalization of those operations.
llvm-svn: 348087
The generic legalizer will fall back to a stack spill that uses a truncating store. That store will get expanded into a shuffle and non-truncating store on pre-avx512 targets. Once that happens the stack store/load pair will be combined away leaving behind the shuffle and bitcasts. On avx512 targets the truncating store is legal so doesn't get folded away.
By custom legalizing it we can avoid this churn and maybe produce better code.
llvm-svn: 348085
Summary: With sse4.1 we use two zero_extend_vector_inreg and a pshufd to expand the v16i8 input into two v8i16 vectors for the multiply. That's 3 shuffles to extend one operand. The other operand is usually constant as this is mostly used by division by constant optimization. Pre sse4.1 we use a punpckhbw and a punpcklbw with a zero vector. That's two shuffles and an xor and a copy due to tied register constraints. That seems maybe better than the 3 shuffles. With AVX we avoid the copy so that's obviously better.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55138
llvm-svn: 348079
This reduces the number of shuffle operations that need to be done. The splitting strategy requires the shuffle unit for the extraction and the extension. With the unpack strategy the unpacks accomplish a splitting and extending in one operation.
llvm-svn: 348019
This does require a constant pool load instead of loading an immediate into a gpr, moving to a k register and masking. But its less instructions and more consistent with previous ISAs. It probably opens up more combine opportunities as one of the test cases demonstrates.
llvm-svn: 348018
Previously we emitted a punpcklbw/punpckhbw to move the byte elements into the upper half of 16 bit elements then shifted right by 8 to zero the upper bits. After DAG combine we end up with punpcklbw/punpckhbw into the lower bits with zeros in the uppers bits and no shifts. So just emit that directly.
llvm-svn: 347966
We had a EVT variable capturing the result of getSimpleValueType which returns an MVT. Another place using EVT that could have been MVT. And an 'int' that should be 'unsigned'.
llvm-svn: 347959
Summary:
Suppressed warnings in release builds due to variable used
only in assert statement.
Subscribers: llvm-commits, eraman, mgorny
Differential Revision: https://reviews.llvm.org/D55100
llvm-svn: 347939
I believe we should be legalizing these with the rest of vector binary operations. If any custom lowering is required for these nodes, this will give the DAG combine between LegalizeVectorOps and LegalizeDAG to run on the custom code before constant build_vectors are lowered in LegalizeDAG.
I've moved MULHU/MULHS handling in AArch64 from Lowering to isel. Moving the lowering earlier caused build_vector+extract_subvector simplifications to kick in which made the generated code worse.
Differential Revision: https://reviews.llvm.org/D54276
llvm-svn: 347902
This is another patch for -x86-experimental-vector-widening. This pre widens narrow division by constants so that we can get pass the legal type check in the generic DAG combiner. Otherwise we end up scalarizing.
I've restricted this to splats for now because it was easy to just call DAG.getConstant. Not sure what we should do for non-splat? Increase the element size?Widen the constant vector by padding with 1?
Differential Revision: https://reviews.llvm.org/D54919
llvm-svn: 347898
It causes asserts building BoringSSL. See https://crbug.com/91009#c3 for
repro.
This also reverts the follow-ups:
Revert r347724 "Do not insert prefetches with unsupported memory operands."
Revert r347606 "[X86] Add dependency from X86 to ProfileData after rL347596"
Revert r347607 "Add new passes to X86 pipeline tests"
llvm-svn: 347864
This patch adds the ability to specify via tablegen which processor resources
are load/store queue resources.
A new tablegen class named MemoryQueue can be optionally used to mark resources
that model load/store queues. Information about the load/store queue is
collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to
initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and
`StoreQueueID`. Those two fields are identifiers for buffered resources used to
describe the load queue and the store queue.
Field `BufferSize` is interpreted as the number of entries in the queue, while
the number of units is a throughput indicator (i.e. number of available pickers
for loads/stores).
At construction time, LSUnit in llvm-mca checks for the presence of extra
processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If
that information is available, and fields LoadQueueID and StoreQueueID are set
to a value different than zero (i.e. the invalid processor resource index), then
LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value
declared by the two processor resources.
With this patch, we more accurately track dynamic dispatch stalls caused by the
lack of LS tokens (i.e. load/store queue full). This is also shown by the
differences in two BdVer2 tests. Stalls that were previously classified as
generic SCHEDULER FULL stalls, are not correctly classified either as "load
queue full" or "store queue full".
About the differences in the -scheduler-stats view: those differences are
expected, because entries in the load/store queue are not released at
instruction issue stage. Instead, those are released at instruction executed
stage. This is the main reason why for the modified tests, the load/store
queues gets full before PdEx is full.
Differential Revision: https://reviews.llvm.org/D54957
llvm-svn: 347857
This failed to select (which might be a separate bug) in
X86ISelDAGToDAG because we try to create a select node
that can be simplified away after rL347227.
This change avoids the problem by simplifying the SHRUNKBLEND
node sooner. In the test case, we manage to realize that the
true/false values of the select (SHRUNKBLEND) are the same thing,
so it simplifies away completely.
llvm-svn: 347818
Unlike most cost model functions this code makes a lot of table lookups without using the results from getTypeLegalizationCost. This means 512-bit vectors can be looked up even when the type isn't legal.
This patch adds a check around the two tables that contain 512-bit types to make sure that neither of the types would be split by type legalization. Meaning 512 bit types are illegal. I wanted to write this in a somewhat generic way that uses type legalization query hooks. But if prefered, I can switch to just using is512BitVector and the subtarget feature.
Differential Revision: https://reviews.llvm.org/D54984
llvm-svn: 347786
This fixes some of scalarization costs reported for sext/zext using avx512bw. This does not fix all scalarization costs being reported. Just the worst.
I've restricted this only to combinations of types that are legal with avx512bw like v32i1/v64i1/v32i16/v64i8 and conversions between vXi1 and vXi8/vXi16 with legal vXi8/vXi16 result types.
Differential Revision: https://reviews.llvm.org/D54979
llvm-svn: 347785
Expansion of SIGN_EXTEND_INREG can create a VSRAI instruction. If there is already a VSRAI after it, we should combine them into a larger VSRAI
Differential Revision: https://reviews.llvm.org/D54959
llvm-svn: 347784
Currently, instructions doing memory accesses through a base operand that is
not a register can not be analyzed using `TII::getMemOpBaseRegImmOfs`.
This means that functions such as `TII::shouldClusterMemOps` will bail
out on instructions using an FI as a base instead of a register.
The goal of this patch is to refactor all this to return a base
operand instead of a base register.
Then in a separate patch, I will add FI support to the mem op clustering
in the MachineScheduler.
Differential Revision: https://reviews.llvm.org/D54846
llvm-svn: 347746
We're already mixing this APInt with other 'unsigned' variables. This allows us to use regular comparison operators instead of needing to use APInt::ult or APInt::uge. And it removes a later conversion from APInt to unsigned.
I might be adding another combine to this function and this will probably simplify the logic required for that.
llvm-svn: 347684
This is skylake-avx512 with the addition of avx512vnni ISA.
Patch by Jianping Chen
Differential Revision: https://reviews.llvm.org/D54785
llvm-svn: 347681
If we fold the bitcast into the store we'll end up creating a truncating store to vXi1 that will get scalarized. Instead allow the bitcast to be turned into a movmsk.
We probably need to do something if the store itself is a vXi1 type, but I'll leave that til a testcase appears.
llvm-svn: 347632
Summary:
Support for profile-driven cache prefetching (X86)
This change is part of a larger system, consisting of a cache prefetches recommender, create_llvm_prof (https://github.com/google/autofdo), and LLVM.
A proof of concept recommender is DynamoRIO's cache miss analyzer. It processes memory access traces obtained from a running binary and identifies patterns in cache misses. Based on them, it produces a csv file with recommendations. The expectation is that, by leveraging such recommendations, we can reduce the amount of clock cycles spent waiting for data from memory. A microbenchmark based on the DynamoRIO analyzer is available as a proof of concept: https://goo.gl/6TM2Xp.
The recommender makes prefetch recommendations in terms of:
* the binary offset of an instruction with a memory operand;
* a delta;
* and a type (nta, t0, t1, t2)
meaning: a prefetch of that type should be inserted right before the instrution at that binary offset, and the prefetch should be for an address delta away from the memory address the instruction will access.
For example:
0x400ab2,64,nta
and assuming the instruction at 0x400ab2 is:
movzbl (%rbx,%rdx,1),%edx
means that the recommender determined it would be beneficial for a prefetchnta instruction to be inserted right before this instruction, as such:
prefetchnta 0x40(%rbx,%rdx,1)
movzbl (%rbx, %rdx, 1), %edx
The workflow for prefetch cache instrumentation is as follows (the proof of concept script details these steps as well):
1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed. The latter option will enable the X86DiscriminateMemOps pass, which ensures instructions with memory operands are uniquely identifiable (this causes ~2% size increase in total binary size due to the additional debug information).
2. collect memory traces, run analysis to obtain recommendations (see above-referenced DynamoRIO demo as a proof of concept).
3. use create_llvm_prof to convert recommendations to reference insertion locations in terms of debug info locations.
4. rebuild binary, using the exact same set of arguments used initially, to which -mllvm -prefetch-hints-file=<file> needs to be added, using the afdo file obtained at step 3.
Note that if sample profiling feedback-driven optimization is also desired, that happens before step 1 above. In this case, the sample profile afdo file that was used to produce the binary at step 1 must also be included in step 4.
The data needed by the compiler in order to identify prefetch insertion points is very similar to what is needed for sample profiles. For this reason, and given that the overall approach (memory tracing-based cache recommendation mechanisms) is under active development, we use the afdo format as a syntax for capturing this information. We avoid confusing semantics with sample profile afdo data by feeding the two types of information to the compiler through separate files and compiler flags. Should the approach prove successful, we can investigate improvements to this encoding mechanism.
Reviewers: davidxl, wmi, craig.topper
Reviewed By: davidxl, wmi, craig.topper
Subscribers: davide, danielcdh, mgorny, aprantl, eraman, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D54052
llvm-svn: 347596
SplitVecOp_TruncateHelper tries to promote the result type while splitting FP_TO_SINT/UINT. It then concatenates the result and introduces a truncate to the original result type. But it does this without inserting the AssertZExt/AssertSExt that the regular result type promotion would insert. Nor does it turn FP_TO_UINT into FP_TO_SINT the way normal result type promotion for these operations does. This is bad on X86 which doesn't support FP_TO_SINT until AVX512.
This patch disables the use of SplitVecOp_TruncateHelper for these operations and just lets normal promotion handle it. I've tweaked a couple things in X86ISelLowering to avoid a few obvious regressions there. I believe all the changes on X86 are improvements. The other targets look neutral.
Differential Revision: https://reviews.llvm.org/D54906
llvm-svn: 347593
Summary:
Add a hook to the GCMetadataPrinter for emitting stack maps in
custom format. The hook will be called at stack map generation
time. The default stack map format is used if there is no hook.
For this to be useful a few data structures and accessors are
exposed from the StackMaps class, so the custom printer can
access the stack map data.
This patch authored by Cherry Zhang <cherryyz@google.com>.
Reviewers: thanm, apilipenko, reames
Reviewed By: reames
Subscribers: reames, apilipenko, nemanjai, javed.absar, kbarton, jsji, llvm-commits
Differential Revision: https://reviews.llvm.org/D53892
llvm-svn: 347584
We have these 2 "isDesirable" promotion hooks (I'm not sure why we need both of them, but that's
independent of this patch), and we can adjust them to promote "mul i8 X, C" to i32. Then, all of
our existing LEA and other multiply expansion magic happens as it would for i32 ops.
Some of the test diffs show that we could end up with an actual 32-bit mul instruction here
because we choose not to expand to simpler ops. That instruction could be slower depending on the
subtarget. On the plus side, this means we don't need a separate instruction to load the constant
operand and possibly an extra instruction to move the result. If we need to tune mul i32 further,
we could add a later transform that tries to shrink it back to i8 based on subtarget timing.
I did not bother to duplicate all of the 32-bit test file RUNs and target settings that exist to
test whether LEA expansion is cheap or not. The diffs here assume a default target, so that means
LEA is generally cheap.
Differential Revision: https://reviews.llvm.org/D54803
llvm-svn: 347557
This should likely be adjusted to limit this transform
further, but these diffs should be clear wins.
If we have blendv/conditional move, then we should assume
those are cheap ops. The loads become independent of the
compare, so those can be speculated before we need to use
the values in the blend/mov.
llvm-svn: 347526
These are AVX2 instructions, but have been incorrectly marked in tablegen for a while. This wasn't a problem until r346784 switched the patterns to use target independent ISD opcodes. This made the patterns visible to fast isel.
Fixes PR39733
llvm-svn: 347375
We can't guarantee that demanded bits passing through the vector shuffle won't cause the AND in front of this to be removed. This would prevent the PACKUS from being matched during shuffle lowering.
Unfortunately, this adds a packuswb to one of the vector-reduce-mul.ll tests since we were removing the shuffle via SimplifyDemandedVectorElts. We appear to have similar issues with vpmovwb on the same test case on other targets.
llvm-svn: 347361
Previously we emitted to separate shuffles, one for unpcklbw and one for unpcklwd. Instead emit a single shuffle equivalent to both of the original shuffles. Shuffle lowering seems able to handle it. This avoids a bitcast between the two shuffles which seems helpful to DAG combine.
Remove the custom type legalization for v8i8->v8i32. I had put that in to avoid some almost duplicate punpcklbw instructions I was seeing, but this lowering change seems to fix that. It also fixes some duplicate shuffles seen in vector-sext.ll
llvm-svn: 347348
Pull out getPackDemandedElts demanded elts remapping helper from computeKnownBitsForTargetNode and use in computeKnownBits/ComputeNumSignBits.
llvm-svn: 347303
Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output.
As near as I can tell this makes v16i8 behavior consistent with every other VT now.
This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types.
llvm-svn: 347296
getZeroVector produces a specifically canonicalized zero vector, but we can just let DAG legalization take care of it.
The test changes are because MULH lowering happens later than it should and this change gave us the opportunity to constant fold away a multiply during a DAG combine before the build_vector got legalized with a bitcast.
llvm-svn: 347290
We're seeing some issues internally where we sent some intrinsics into the cost model that the getTypeLegalizationCost call fails on, but X86 specific tables don't care about. Our base class implementation takes care of them. We'd just like X86 backend to ignore them.
This patch makes sure the switch returned something X86 cares about and skips the table lookups and type legalization call if not. Probably more efficient too since we don't go scanning the tables for every intrinsic we could possibly see.
Differential Revision: https://reviews.llvm.org/D54711
llvm-svn: 347248
SSE PSHUFB vector ctlz lowering works at the i4 nibble level. As detailed in PR39703, we were masking the lower nibble off but we only actually use it in the case where the upper nibble is known to be zero, making it safe to remove the mask and save an instruction.
Differential Revision: https://reviews.llvm.org/D54707
llvm-svn: 347242
Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.
This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.
Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.
Differential Revision: https://reviews.llvm.org/D54668
llvm-svn: 347240
The shift requires a copy to avoid clobbering a register. Comparing with 0 uses an xor to produce 0 that will be overwritten with the compare results. So still requires 2 instructions, but should be one byte shorter since it doesn't need to encode an immediate.
llvm-svn: 347185
Previously we used an arithmetic shift right by 31, but that requires a copy to preserve the input. So we might as well materialize a zero and compare to it since the comparison will overwrite the register that contains the zeros. This should be one byte shorter.
llvm-svn: 347181
Leave just the v4i8->v4i64 and v8i8->v8i64, but only enable them on pre-sse4.1 targets when 64-bit mode is enabled. In those cases we end up creating sext loads that get scalarized to code that looks better than what we get from loading into a vector register and doing a multiple step sign extend using unpacks and shifts.
llvm-svn: 347180
Pre-SSE4.1 sext_invec for v2i64 is complicated because we don't have a v2i64 sra instruction. So instead we sign extend to i32 using unpack and sra, then copy the elements and do a v4i32 sra to fill with sign bits, then interleave the i32 sign extend and the sign bits. So really we're doing to two sign extends but only using half of the v4i32 intermediate result.
When the result is more than 128 bits, default type legalization would prefer to split the destination type all the way down to v2i64 with shuffles followed by v16i8/v8i16->v2i64 sext_inreg operations. This results in more instructions than necessary because we are only utilizing the lower 2 elements of the v4i32 intermediate result. Instead we can custom split a v4i8/v4i16->v4i64 sign_extend. Then we can sign extend v4i8/v4i16->v4i32 invec producing a full v4i32 result. Create the sign bit vector as a v4i32 then split and interleave with the sign bits using an punpackldq and punpackhdq.
llvm-svn: 347176
If we widen illegal types instead of promoting, we should be able to rely on the type legalizer to create the vector_inreg operations for us with some caveats.
This patch disables combineToExtendVectorInReg when we are using widening.
I've enabled custom legalization for v8i8->v8i64 extends under avx512f since the type legalizer would want to create a vector_inreg with a v64i8 input type which isn't legal without avx512bw. So we go to v16i8 with custom code using the relaxation of rules we get from D54346.
I've also enable custom legalization of v8i64 and v16i32 operations with with AVX. When the input type is 128 bits, the default splitting legalization would extend first 128->256, then do the a split to two 128 pieces. Extend each half to 256 and then concat the result. The custom legalization I've added instead uses a 128->256 bit vector_inreg extend that only reads the lower 64-bits for the low half of the split. Then shuffles the high 64-bits to the low 64-bits and does another vector_inreg extend.
llvm-svn: 347172
Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54671
llvm-svn: 347171
Refactor towards making this recursive (necessary for PR38243 rotation splat detection).
IsSplatVector returns the original vector source of the splat and the splat index.
GetSplatValue returns the scalar splatted value as an extraction from IsSplatVector.
llvm-svn: 347168
We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs.
llvm-svn: 347158
The zero extend will require two stages of unpacks to implement. So its better to shrink the multiply using pmullw and then extend that result back to v4i32 using a single unpack.
llvm-svn: 347149
This tries to force the result type to vXi32 followed by a truncate. This can help avoid scalarization that would otherwise occur.
There's some annoying examples of an avx512 truncate instruction followed by a packus where we should really be able to just use one truncate. But overall this is still a net improvement.
llvm-svn: 347105
Summary:
As discussed in previous review, and noted in the FIXME, if `X` is actually an `lshr Y, Z` (logical!),
we can fold the `Z` into 'control`, and let the `BEXTR` do this too.
We could just insert those 8 bits of shift amount into control,
but it is better to instead zero-extend them, and 'or' them in place.
We can only do this for `lshr`, not `ashr`, because we do not know that the mask cover only the bits of `Y`,
and not any of the sign-extended bits.
The obvious question is, is this actually legal to do?
I believe it is. Relevant quotes, from `Intel® 64 and IA-32 Architectures Software Developer’s Manual`, `BEXTR — Bit Field Extract`:
* `Bit 7:0 of the second source operand specifies the starting bit position of bit extraction.`
* `A START value exceeding the operand size will not extract any bits from the second source operand.`
* `Only bit positions up to (OperandSize -1) of the first source operand are extracted.`
* `All higher order bits in the destination operand (starting at bit position LENGTH) are zeroed.`
* `The destination register is cleared if no bits are extracted.`
FIXME: if we can do this, i wonder if we should prefer `BEXTR` over `BZHI` in such cases.
Reviewers: RKSimon, craig.topper, spatel, andreadb
Reviewed By: RKSimon, craig.topper, andreadb
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54095
llvm-svn: 347048
By early promoting the multiply to use an i16 element type we can avoid op legalization emit a second multiply for the 8 upper elements of the v16i8 type we would otherwise get.
llvm-svn: 347032
We aren't going to use the upper bits of the multiply result that the extend would effect. So we don't need a specific type of extend.
This makes some reduction test cases shorter because we were previously trying to sign_extend a truncate which we can't eliminate.
llvm-svn: 347011
Removing this code doesn't affect any lit tests so it doesn't appear to be tested anymore. I assume it was when it was added, but I guess something else changed? Code coverage report also says its unused.
I mostly didn't like that it seemed to count the sign bits as if it was a sign_extend, but then set isPositive as if it was a zero_extend. It feels like we should have picked one interpretation?
Differential Revision: https://reviews.llvm.org/D54596
llvm-svn: 346995
Use unsigned to calculate the subvector index to avoid a cast.
Remove an unnecessary condition and replace it with a stronger assert.
Use the InVT variable we updated when we extracted instead of grabbing it from the In SDValue.
llvm-svn: 346983
In reduceVMULWidth, we no longer need to worry about extending the vector to 128 bits first. Regular widening of extends, muls and shuffles will take care of that for us.
In combineMulToPMADDWD, we can handle v2i32 multiplies and allow the VPMADDWD to be widened to v4i32 during type legalization by adding custom widening like we do have for AVG/ADDUS/SUBUS. I had to modify that code a little to allow different and output VTs.
Differential Revision: https://reviews.llvm.org/D54512
llvm-svn: 346980
This fixes -filetype=null support when compiling for a Win32 target and the module has a CodeView flag.
The only places changed are the uses of getTargetStreamer function - this patch guards both of them with null checks.
Committed on behalf of @eush (Eugene Sharygin)
Differential Revision: https://reviews.llvm.org/D54008
llvm-svn: 346962
This avoids some nasty shuffles when we have avx512. It will also prevent using zmm truncate instructions when a ymm instruction that zeroes part of an xmm register will do. Also avoid using avx512 truncate instructions when the input is 128 bits or less. These instructions are 2 uops on skx so we can probably find a better single uop shuffle like pshufb.
llvm-svn: 346936
The narrow types end up requesting widening, but generic legalization will end up scalaring and using a build_vector to do the widening.
llvm-svn: 346916
On 64-bit targets the type legalizer will use i64 to legalize these. But when i64 isn't legal, the type legalizer won't try an FP type. So do it manually instead.
There are a few regressions in here due to some v2i32 operations like mul and div now being reassembled into a full vector just to store instead of storing the pieces. But this was already occuring in 64-bit mode so its not a new issue.
llvm-svn: 346908
The machine scheduler currently biases register copies to/from
physical registers to be closer to their point of use / def to
minimize their live ranges. This change extends this to also physical
register assignments from immediate values.
This causes a reduction in reduction in overall register pressure and
minor reduction in spills and indirectly fixes an out-of-registers
assertion (PR39391).
Most test changes are from minor instruction reorderings and register
name selection changes and direct consequences of that.
Reviewers: MatzeB, qcolombet, myatsina, pcc
Subscribers: nemanjai, jvesely, nhaehnle, eraman, hiraditya,
javed.absar, arphaman, jfb, jsji, llvm-commits
Differential Revision: https://reviews.llvm.org/D54218
llvm-svn: 346894
Narrower vectors will be widened to 128 bits without changing the element size. And generic type legalization can already handle widening mulhu/mulhs.
Differential Revision: https://reviews.llvm.org/D54513
llvm-svn: 346879
Add support for the expansion of funnelshift/rotates to getIntrinsicInstrCost.
This also required us to move the X86 fshl/fshr costs to the same place as the rotates to avoid expansion and get correct scalarization vs vectorization costs.
llvm-svn: 346854
This patch removes the last use of the constant pool shuffle decode helper and consistently uses the 'getTargetShuffleMaskIndices' versions instead. The constant pool versions are now purely used for assembly comments.
The avx512vbmi intrinsic upgrades had to be altered as they were being decoded as broadcasts, similar to what I fixed in rL346032. I don't think the change is critical - although its annoying that we lose the {k}{z} instruction test coverage as they are tricky to generate....
Differential Revision: https://reviews.llvm.org/D54083
llvm-svn: 346850
Previously, the extend_vector_inreg opcode required their input register to be the same total width as their output. But this doesn't match up with how the X86 instructions are defined. For X86 the input just needs to be a legal type with at least enough elements to cover the output.
This patch weakens the check on these nodes and allows them to be used as long as they have more input elements than output elements. I haven't changed type legalization behavior so it will still create them with matching input and output sizes.
X86 will custom legalize these nodes by shrinking the input to be a 128 bit vector and once we've done that we treat them as legal operations. We still have one case during type legalization where we must custom handle v64i8 on avx512f targets without avx512bw where v64i8 isn't a legal type. In this case we will custom type legalize to a *extend_vector_inreg with a v16i8 input. After that the input is a legal type so type legalization should ignore the node and doesn't need to know about the relaxed restriction. We are no longer allowed to use the default expansion for these nodes during vector op legalization since the default expansion uses a shuffle which required the widths to match. Custom legalization for all types will prevent us from reaching the default expansion code.
I believe DAG combine works correctly with the released restriction because it doesn't check the number of input elements.
The rest of the patch is changing X86 to use either the vector_inreg nodes or the regular zero_extend/sign_extend nodes. I had to add additional isel patterns to handle any_extend during isel since simplifydemandedbits can create them at any time so we can't legalize to zero_extend before isel. We don't yet create any_extend_vector_inreg in simplifydemandedbits.
Differential Revision: https://reviews.llvm.org/D54346
llvm-svn: 346784
This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location.
Differential Revision: https://reviews.llvm.org/D54267
llvm-svn: 346706
Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis.
This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not.
llvm-svn: 346697
When we repeat the 2 shifting operands then this is a bit rotation - annoyingly this has to be done in the other getIntrinsicInstrCost than most intrinsics as we need to check the operands are the same.
llvm-svn: 346688