Normally when the nodes we use here(AND32ri8 for example) are selected their
immediates are just converted from ConstantSDNode to TargetConstantSDNode
without changing VT from the original operation VT. So we should still be
emitting them with the operation VT.
Theoretically this could expose more accurate opportunities for CSE.
llvm-svn: 356869
We were using this to create an AND32ri8 node from a 64-bit and, but that node
normally still uses a 32-bit immediate. So we should just truncate the existing
immediate to i32. We already verified it has the same value in bits 31:7.
llvm-svn: 356868
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.
This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
llvm-svn: 356864
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).
Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.
llvm-svn: 356858
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613
uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0
We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.
Differential Revision: https://reviews.llvm.org/D59006
llvm-svn: 356855
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.
This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.
Differential Revision: https://reviews.llvm.org/D59662
llvm-svn: 356808
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.
I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.
So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?
I've left some FIXMEs for improvements we can do without SSE2.
Differential Revision: https://reviews.llvm.org/D59679
llvm-svn: 356807
Under optsize we try to avoid folding immediates into instructions under optsize. But if the immediate is 16-bits or 32 bits, but can be encoded as an 8-bit immediate we don't save enough from disabling the folding unless the immediate has enough uses to make up for the size of the move which is either 3 bytes or 5 bytes since there are no sign extended 8-bit moves. We would also save something if the immediate was a live out of the basic block and thus a move was unavoidable, but that would require a more advanced heuristic than just counting uses.
Note we only avoid folding multiple use immediates into the patterns that use X86ISD::ADD/SUB/XOR/OR/AND/CMP/ADC/SBB nodes and not the more common ISD::ADD/SUB/XOR/OR/AND nodes.
Differential Revision: https://reviews.llvm.org/D59522
llvm-svn: 356688
This adds support for scalarizing these intrinsics as well the X86TargetTransformInfo support to avoid scalarizing them in the cases X86 can handle.
I've omitted handling special cases for constant masks for this first pass. Though CodeGenPrepare can constant fold the branch conditions and remove some of the control flow anyway.
Fixes PR40994 and is covers most of PR3666. Might want to implement constant masks to close that.
Differential Revision: https://reviews.llvm.org/D59180
llvm-svn: 356687
CMPXCHG8B was introduced on i586/pentium generation.
If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.
Differential Revision: https://reviews.llvm.org/D59576
llvm-svn: 356631
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.
I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.
This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.
Differential Revision: https://reviews.llvm.org/D59479
llvm-svn: 356618
This patch removes the following dag node opcodes from namespace X86ISD:
RDTSC_DAG,
RDTSCP_DAG,
RDPMC_DAG
The logic that expands RDTSC/RDPMC/XGETBV intrinsics is basically the same. The
only differences are:
RDTSC/RDTSCP don't implicitly read ECX.
RDTSCP also implicitly writes ECX.
I moved the common expansion logic into a helper function with the goal to get
rid of code repetition. That helper is now used for the expansion of
RDTSC/RDTSCP/RDPMC/XGETBV intrinsics.
No functional change intended.
Differential Revision: https://reviews.llvm.org/D59547
llvm-svn: 356546
These changes are related to PR37743 and include:
SelectionDAGBuilder::visitSelect handles the unary SelectPatternFlavor::SPF_ABS case to build ABS node.
Delete the redundant recognizer of the integer ABS pattern from the DAGCombiner.
Add promoting the integer ABS node in the LegalizeIntegerType.
Expand-based legalization of integer result for the ABS nodes.
Expand-based legalization of ABS vector operations.
Add some integer abs testcases for different typesizes for Thumb arch
Add the custom ABS expanding and change the SAD pattern recognizer for X86 arch: The i64 result of the ABS is expanded to:
tmp = (SRA, Hi, 31)
Lo = (UADDO tmp, Lo)
Hi = (XOR tmp, (ADDCARRY tmp, hi, Lo:1))
Lo = (XOR tmp, Lo)
The "detectZextAbsDiff" function is changed for the recognition of pattern with the ABS node. Given a ABS node, detect the following pattern:
(ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).
Change integer abs testcases for codegen with the ABS node support for AArch64.
Indicate that the ABS is legal for the i64 type when the NEON is supported.
Change the integer abs testcases to show changing of codegen.
Add combine and legalization of ABS nodes for Thumb arch.
Extend 'matchSelectPattern' to recognize the ABS patterns with ICMP_SGE condition.
For discussion, see https://bugs.llvm.org/show_bug.cgi?id=37743
Patch by: @ikulagin (Ivan Kulagin)
Differential Revision: https://reviews.llvm.org/D49837
llvm-svn: 356468
This allows better code size for aarch64 floating point materialization
in a future patch.
Reviewers: evandro
Differential Revision: https://reviews.llvm.org/D58690
llvm-svn: 356389
These are used to help convert OR->LEA when needed to avoid avoid a copy. They
aren't need after register allocation.
Happens to remove an ugly goto from X86MCCodeEmitter.cpp
llvm-svn: 356356
The only thing the print methods currently need to know is the string to print for the memory size in intel syntax.
This patch merges the functions based on this string. If we ever need something else in the future, its easy to split them back out.
This reduces the number of cases in the assembly printers. It shrinks the intel printer to only use 7 bytes per instruction instead of 8.
llvm-svn: 356352
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.
The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".
The _alt form on the other hand parsed and printed like any other instruction with no specialness.
With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.
I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.
I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.
Differential Revision: https://reviews.llvm.org/D59398
llvm-svn: 356343
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780
If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits, andreadb
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59412
llvm-svn: 356301
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.
Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.
Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.
llvm-svn: 356294
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.
llvm-svn: 356292
The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.
Differential Revision: https://reviews.llvm.org/D57803
llvm-svn: 356280
These instructions used to use rotl with a bitwidth-1 immediate. I changed the immediate to 1,
but failed to change the opcode.
Thankfully this seems to have not caused a functional issue because we now had two rotl by 1 patterns,
but the correct ones were earlier and took priority. So we just missed some optimization.
llvm-svn: 356164
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.
llvm-svn: 356158
Prior to the introduction of funnel shift intrinsics we could count on rotate
by immediates prefering to use rotl since that's what MatchRotate would check
first. The or+shift pattern doesn't have a direction so one must be chosen
arbitrarily.
With funnel shift, there is a direction and fshr will try to use rotr first.
While fshl will try to use rotl first.
This patch adds the isel patterns for rotr to complement the rotl patterns. I've
put the rotr by 1 patterns in the instruction patterns. And moved the rotl by
bitwidth-1 patterns to separate Pat patterns.
Fixes PR41057.
llvm-svn: 356121
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.
We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.
I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.
Differential Revision: https://reviews.llvm.org/D59308
llvm-svn: 356078
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.
This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.
The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.
llvm-svn: 356064
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700
The bug was introduced recently here:
rL355741
This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8).
llvm-svn: 356053
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.
This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.
Differential Revision: https://reviews.llvm.org/D58888
llvm-svn: 356043
Add break statements in Object/ELF.cpp since the code should consider the
generic tags for Hexagon, MIPS, and PPC. Add a test (copied from llvm-readobj)
to show that this works correctly (earlier versions of this patch would have
asserted).
The warnings in X86ELFObjectWriter.cpp are actually false-positives since
the nested switch() handles all possible values and returns in all cases.
Make this explicit by adding llvm_unreachable's.
Differential Revision: https://reviews.llvm.org/D58837
llvm-svn: 356037
A faulting_op is one that has specified behavior when a fault occurs, generally redirecting control flow to another location. This change just adds a comment to the assembly output which makes it both human readable, and machine checkable w/o having to parse the FaultMap section. This is used to split a test file into two parts, so that I can (in a near future commit) easily extend the test file to demonstrate another case.
llvm-svn: 355982
ProcFeatures was a class that just concatenated two feature lists together and gave it a name. We used it to inherit features between CPUs.
ProcModel took a two CPU feature lists and concatenated them before deferring to ProcessorModel. This was to allow inherited features and specific features to be passed to each CPU.
Both of these allowed for only very rigid CPU inheritance rules.
With this patch we now store all of the lists we were using for inheritance in one object and do any list oncatenation we want there. Then we just pass whatever list we want from this class into the ProcessorModel class for each CPU.
Hopefully this gives us more flexibility to build up feature lists in whatever ways we think make sense. Perhaps untangling ISA flags and tuning flags.
I've only touched the CPUs that were directly affected by the removal of the ProcModel and ProcFeatures classes. We should move more of the feature lists into ProcessorFeatures.
llvm-svn: 355872
AMDGPU target run out of Subtarget feature flags hitting the limit of 64.
AssemblerPredicates uses at most uint64_t for their representation.
At the same time CodeGen has exhausted this a long time ago and switched
to a FeatureBitset with the current limit of 192 bits.
This patch completes transition to the bitset for feature bits extending
it to asm matcher and MC code emitter.
Differential Revision: https://reviews.llvm.org/D59002
llvm-svn: 355839
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.
This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.
llvm-svn: 355799
We had patterns using X86ISD::SCALAR_SINT_TO_FP_RND/SCALAR_UINT_TO_FP_RND for
these instructions. There's nothing to round. Instead, we use a regular
sint_to_fp/uint_to_fp and a movsd as the pattern for these.
llvm-svn: 355796
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.
With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.
llvm-svn: 355789
Includes a fix to emit a CheckOpcode for build_vector when immAllZerosV/immAllOnesV is used as a pattern root. This means it can't be used to look through bitcasts when used as a root, but that's probably ok. This extra CheckOpcode will ensure that the first match in the isel table will be a SwitchOpcode which is needed by the caching optimization in the ISel Matcher.
Original commit message:
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355784
An extension of D58282 noted in PR39665:
https://bugs.llvm.org/show_bug.cgi?id=39665
This doesn't answer the request to use movmsk, but that's an
independent problem. We need this and probably still need
scalarization of FP selects because we can't do that as a
target-independent transform (although it seems likely that
targets besides x86 should have this transform).
llvm-svn: 355741
We were just checking pointer size and type primitive size. But this caused unintended things like vectors of half being accepted by masked load/store.
For FP we now explicitly check for only double and float.
For pointers we now let any pointer through. Trusting that only 32 and 64 would be used to generate assembly.
We only check bitwidth after checking that the type is an integer.
llvm-svn: 355667
Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not.
The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that.
RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd.
Differential Revision: https://reviews.llvm.org/D59077
llvm-svn: 355636
Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms.
Differential Revision: https://reviews.llvm.org/D59058
llvm-svn: 355635
Summary:
ShadowCallStack on x86_64 suffered from the same racy security issues as
Return Flow Guard and had performance overhead as high as 13% depending
on the benchmark. x86_64 ShadowCallStack was always an experimental
feature and never shipped a runtime required to support it, as such
there are no expected downstream users.
Reviewers: pcc
Reviewed By: pcc
Subscribers: mgorny, javed.absar, hiraditya, jdoerfert, cfe-commits, #sanitizers, llvm-commits
Tags: #clang, #sanitizers, #llvm
Differential Revision: https://reviews.llvm.org/D59034
llvm-svn: 355624
Move the x86 combine from D58974 into the DAGCombine VSELECT code and update the SELECT version to use the isBooleanFlip helper as well.
Requested by @spatel on D59006
llvm-svn: 355533
As noticed on D58965
DAGCombiner::visitSELECT has something similar, so we should be able to move this to DAGCombiner and support VSELECT as well at some point.
Differential Revision: https://reviews.llvm.org/D58974
llvm-svn: 355494
This allows us to use an 8-bit sign extended immediate instead of a 16 or 32 bit immediate.
Also do similar for 0x80000000 with 64-bit adds to avoid having to use a movabsq.
llvm-svn: 355485
128 won't fit in a sign extended 8-bit immediate, but we can negate it to -128 and use the other operation. This results in a shorter encoding since the move would have used 16 or 32 bits for the immediate.
llvm-svn: 355484
This caused the first matcher in the isel table for many targets to Opc_Scope instead of Opc_SwitchOpcode. This leads to a significant increase in isel match failures.
llvm-svn: 355433
The variable X86DomainReassignment::EnclosedEdges is used to store registers that have been enclosed in some closure, so those registers will be ignored when create new closures. But there is no registers has ever been put into this set, so a single register can be enclosed in multiple closures, it significantly increase compile time.
This patch adds a register into EnclosedEdges when it is enclosed into a closure.
Differential Revision: https://reviews.llvm.org/D58646
llvm-svn: 355430
We already do this for 16/32/64 as well as 8-bit add with register/immediate. Might as well do it for 8-bit INC/DEC too.
Differential Revision: https://reviews.llvm.org/D58869
llvm-svn: 355424
We already support 8-bits adds in convertToThreeAddress. But we can also support 8-bit OR if the bits are disjoint. We already do this for 16/32/64.
Differential Revision: https://reviews.llvm.org/D58863
llvm-svn: 355423
We have quite a few cases of using FP instructions for integer operations when only AVX1 is available. Then we switch to integer instructions with AVX2. In a lot of these cases execution domain fixing will take care of turning FP instructions into integer if its profitable.
With this patch we just keep on using the FP instructions even with AVX2. I've only handled some cases that don't require messing with patterns that are defined in the instruction definition. Those will require more subtle multiclass work possibly involving null_frag, hasSideEffects = 0, etc.
Differential Revision: https://reviews.llvm.org/D58470
llvm-svn: 355361
X86TargetLowering::EmitLoweredSelect presently detects sequences of CMOV pseudo
instructions without accounting for debug intrinsics. This leads to different
codegen with and without option -g, if a DBG_VALUE instruction lands in the
middle of several lowered selects.
Work around this by skipping over debug instructions when looking for CMOV
sequences, and sinking those debug insts into the EmitLoweredSelect sunk block.
This might slightly shift where variables appear in the instruction sequence,
but won't re-order assignments.
Differential Revision: https://reviews.llvm.org/D58672
llvm-svn: 355307
We were using VPBLENDW for v2i64 and VBLENDPD for v4i64. VPBLENDD has better throughput than VPBLENDW on some CPUs so it makes sense to use it when possible. VBLENDPD will probably become VBLENDD during execution domain fixing, but we might as well use integer in isel while we can.
This should work around some issues with the domain fixing pass prefering PBLENDW when we start with PBLENDW. There may still be some v8i16 cases that could use PBLENDD.
llvm-svn: 355281
Summary:
This extends the variety of pattern that can generate a SHLD instead of using two shifts.
This fixes a regression that would be introduced by D57367 or D33587
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57389
llvm-svn: 355260
Previously we had build_vector PatFrags that called ISD::isBuildVectorAllZeros/Ones. Internally the ISD::isBuildVectorAllZeros/Ones look through bitcasts, but we aren't able to take advantage of that in isel. Instead of we have to canonicalize the types of the all zeros/ones build_vectors and insert bitcasts. Then we have to pattern match those exact bitcasts.
By emitting specific matchers for these 2 nodes, we can make isel look through any bitcasts without needing to explicitly match them. We should also be able to remove the canonicalization to vXi32 from lowering, but I've left that for a follow up.
This removes something like 40,000 bytes from the X86 isel table.
Differential Revision: https://reviews.llvm.org/D58595
llvm-svn: 355224
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:
1. Memory loads may get scalarized resulting in more efficient code.
2. Memory stores may get scalarized resulting in more efficient code.
3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.
The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.
Differential Revision: https://reviews.llvm.org/D58282
llvm-svn: 355130
We don't have any combines that can look through a bitcast to truncate a build vector of constants. So the truncate will stick around and give us something like this pattern (binop (trunc X), (trunc (bitcast (build_vector)))) which has two truncates in it. Which will be reversed by hoistLogicOpWithSameOpcodeHands in the generic DAG combiner. Thus causing an infinite loop.
Even if we had a combine for (truncate (bitcast (build_vector))), I think it would need to be implemented in getNode otherwise DAG combiner visit ordering would probably still visit the binop first and reverse it. Or combineTruncatedArithmetic would need to do its own constant folding.
Differential Revision: https://reviews.llvm.org/D58705
llvm-svn: 355116
Summary:
The description of KnownBits::zext() and
KnownBits::zextOrTrunc() has confusingly been telling
that the operation is equivalent to zero extending the
value we're tracking. That has not been true, instead
the user has been forced to explicitly set the extended
bits as known zero afterwards.
This patch adds a second argument to KnownBits::zext()
and KnownBits::zextOrTrunc() to control if the extended
bits should be considered as known zero or as unknown.
Reviewers: craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: javed.absar, hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58650
llvm-svn: 355099
These allows use to use the same set of isel patterns for sra/srl/shl which are undefined for out of range shifts and intrinsic shifts which aren't undefined.
Doing this late allows DAG combine to have every opportunity to optimize the sra/srl/shl nodes.
This removes about 7000 bytes from the isel table and simplies the td files.
llvm-svn: 355071
A lot of the INSERT_SUBVECTOR combines can be more generally handled as if they have come from a CONCAT_VECTORS node.
I've been investigating adding a CONCAT_VECTORS combine to X86, but this is a much easier first step that avoids the issue of handling a number of pre-legalization issues that I've encountered.
Differential Revision: https://reviews.llvm.org/D58583
llvm-svn: 355015
Original implementation can't correctly handle __m256 and __m512 types
passed by reference through stack. This patch fixes it.
Patch by Wei Xiao!
Differential Revision: https://reviews.llvm.org/D57643
llvm-svn: 354921
This patch enables the following
1) AMD family 17h "znver2" tune flag (-march, -mcpu).
2) ISAs that are enabled for "znver2" architecture.
3) For the time being, it uses the znver1 scheduler model.
4) Tests are updated.
5) Scheduler descriptions are yet to be put in place.
Reviewers: craig.topper
Differential Revision: https://reviews.llvm.org/D58343
llvm-svn: 354897
Summary:
Use a custom calling convention handler for interrupts instead of fixing
up the locations in LowerMemArgument. This way, the offsets are correct
when constructed and we don't need to account for them in as many
places.
Depends on D56883
Replaces D56275
Reviewers: craig.topper, phil-opp
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D56944
llvm-svn: 354837
If the LHS has known zeros, the RHS immediate will have had bits removed. So call computeKnownBits to get the known zeroes so we can handle this case.
Differential Revision: https://reviews.llvm.org/D58475
llvm-svn: 354811
Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results.
Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB.
Differential Revision: https://reviews.llvm.org/D58597
llvm-svn: 354771
Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step.
llvm-svn: 354757
Summary:
Previously we used BLENDPS/BLENDPD but that puts the blend in the FP domain. Under optsize, the two address instruction pass can cause blendps/blendpd to commute to blendps/blendpd. But we probably shouldn't do that if the original type was a integer. So use pblendw instead.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58574
llvm-svn: 354755
Summary:
The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate.
The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate.
Reviewers: RKSimon, andreadb
Reviewed By: RKSimon
Subscribers: gbedwell, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58581
llvm-svn: 354754
Summary:
When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it.
We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used.
Reviewers: spatel, RKSimon, nikic
Reviewed By: nikic
Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58567
llvm-svn: 354753
r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit."
These were reverted in r354713 as their context depended on other patches that were reverted for a bug.
llvm-svn: 354734
Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD.
llvm-svn: 354729
Conversion from ConstantSDNode to MachineInstr sign extends immediates from their APInt representation to int64_t.
This commit makes sure we do the same for commuting. The tests changes show how this improves CSE. This issue was made worse by the MachineCSE using commuteInstruction to undo a commute. So we virtually guarantee the sign extend from isel would be lost.
The improved CSE also occurred with r354363, but that was reverted. I'm working to undo the revert, but wanted to get this fix in while it was easy to see the results.
llvm-svn: 354724
r354363 caused https://crbug.com/934963#c1, which has a plain C reduced
test case.
I also had to revert some dependent changes:
- r354648
- r354647
- r354640
- r354511
llvm-svn: 354713
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.
Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.
llvm-svn: 354709
As discussed in:
D56864
D58197
Always use the narrow (128-bit) instruction when possible.
We already had the signed int version of this transform.
llvm-svn: 354675
Only the 1st fold is attempted pre-legalization, but it requires
legal (simple) types too, so we don't need an EVT in any of the code.
llvm-svn: 354674
This was done in r321424 to prevent scheduling from reordering things. But now that we model FPCW as a dependency, I don't think the same scheduling we were trying to prevent can occur.
llvm-svn: 354628
This is a follow-up to D56864.
If we're extracting from a non-zero index before casting to FP,
then shuffle the vector and optionally narrow the vector before doing the cast:
cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0
This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974
Differential Revision: https://reviews.llvm.org/D58197
llvm-svn: 354619
We currently bail if the target shuffle decodes to more than 2 input vectors, this change alters the input index to work for any number of inputs for when we drop that requirement.
llvm-svn: 354575
Second part of https://bugs.llvm.org/show_bug.cgi?id=40442.
This adds an extra UnrollVectorOverflowOp() method to SDAG, because
the general UnrollOverflowOp() method can't deal with multiple results.
Additionally we need to expand UMULO/SMULO during vector op
legalization, as it may result in unrolling, which may need additional
type legalization.
Differential Revision: https://reviews.llvm.org/D57997
llvm-svn: 354513
This avoids depending on the peephole pass to do load folding.
Also adds some load folding for some insert_subvector patterns that use blend.
All of this was found by temporarily adding TB_NO_FORWARD to the blend immediate entries in the load folding tables.
I've added -disable-peephole to some of the affected tests from that experiment to ensure we're testing isel patterns.
llvm-svn: 354511
We currently bail if the target shuffle decodes to more than 2 input vectors, this is some initial cleanup that still has the limit but generalizes the opindices to an array that will be necessary when we drop the limit.
llvm-svn: 354489
Summary:
Inc and Dec were at one point slow on Intel CPUs due to their tendency to cause partial flag stalls on P6 derived CPU cores. This is because these instructions are defined to preserve the carry flag. This partial flag stall issue persisted until Sandy Bridge when flag merging was changed to be handled as a data dependency instead of as a stall until retirement. Sandy Bridge and later CPUs rename the C flag separately from OSPAZ so there is no flag merge needed on INC/DEC to preserve the C flag.
Given these improvements I don't know why INC/DEC was ever considered slow on Sandy Bridge. If anything they should have been disabled on the earlier CPUs instead.
Note after this patch, INC/DEC are still considered slow on Silvermont, Goldmont, Knights Landing and our generic "x86-64" CPU.
Reviewers: spatel, RKSimon, chandlerc
Reviewed By: chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D58412
llvm-svn: 354436
As this has broken the lto bootstrap build for 3 days and is
showing a significant regression on the Dither_benchmark results (from
the LLVM benchmark suite) -- specifically, on the
BENCHMARK_FLOYD_DITHER_128, BENCHMARK_FLOYD_DITHER_256, and
BENCHMARK_FLOYD_DITHER_512; the others are unchanged. These have
regressed by about 28% on Skylake, 34% on Haswell, and over 40% on
Sandybridge.
This reverts commit r353923.
llvm-svn: 354434
After r354178, these instruction expand to a sequence that uses an OR instruction. That OR clobbers EFLAGS so we need to state that to avoid accidentally using the clobbered flags.
Our tests show the bug, but I didn't notice because the SETcc instructions didn't move after r354178 since it used to be safe to do the fp->int conversion first.
We should probably convert this whole sequence to SelectionDAG instead of a custom inserter to avoid mistakes like this.
Fixes PR40779
llvm-svn: 354395
The use of the -mprefer-vector-width=256 command line option mixed with functions
using vector intrinsics can create situations where one function thinks 512 vectors
are legal, but another fucntion does not.
If a 512 bit vector is passed between them via a pointer, its possible ArgumentPromotion
might try to pass by value instead. This will result in type legalization for the two
functions handling the 512 bit vector differently leading to runtime failures.
Had the 512 bit vector been passed by value from clang codegen, both functions would
have been tagged with a min-legal-vector-width=512 function attribute. That would
make them be legalized the same way.
I observed this issue in 32-bit mode where a union containing a 512 bit vector was
being passed by a function that used intrinsics to one that did not. The caller
ended up passing in zmm0 and the callee tried to read it from ymm0 and ymm1.
The fix implemented here is just to consider it a mismatch if two functions
would handle 512 bit differently without looking at the types that are being
considered. This is the easist and safest fix, but it can be improved in the future.
Differential Revision: https://reviews.llvm.org/D58390
llvm-svn: 354376
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 354363
This was the cause of the regression in D57888 - the commuted load pattern wasn't hidden by the predicate so once we enabled v4i32 blends on SSE41+ targets then isel was incorrectly matched against AVX2+ instructions.
llvm-svn: 354358
When parsing a sequence of tokens beginning with {, it will hit an assert and crash if the token afterwards is not an identifier. Instead of this, return a more verbose error as seen elsewhere in the function.
Patch by Brandon Jones (BrandonTJones)
Differential Revision: https://reviews.llvm.org/D57375
llvm-svn: 354356
Tuning flags don't have any effect on the available instructions so aren't a good reason to prevent inlining.
There are also some ISA flags that don't have any intrinsics our ABI requirements that we can exclude. I've put only the most basic ones like cmpxchg16b and lahfsahf. These are interesting because they aren't present in all 64-bit CPUs, but we have codegen workarounds when they aren't present.
Loosening these checks can help with scenarios where a caller has a more specific CPU than a callee. The default tuning flags on our generic 'x86-64' CPU can currently make it inline compatible with other CPUs. I've also added an example test for 'nocona' and 'prescott' where 'nocona' is just a 64-bit capable version of 'prescott' but in 32-bit mode they should be completely compatible.
I've based the implementation here of the similar code in AMDGPU.
Differential Revision: https://reviews.llvm.org/D58371
llvm-svn: 354355
The VBROADCAST combines and SimplifyDemandedVectorElts improvements mean that we now more consistently use shorter (128-bit) X86vzload input operands.
Follow up to D58053
llvm-svn: 354346
This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads.
It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load.
Differential Revision: https://reviews.llvm.org/D58053
llvm-svn: 354340
The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754https://bugs.llvm.org/show_bug.cgi?id=40487
..and those are shown in the IR test file and x86 codegen file.
Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.
This adds a TLI hook that should preserve the existing behavior for uaddo formation, but disables usubo
formation by default. Only x86 overrides that setting for now although other targets will likely benefit
by forming usbuo too.
Differential Revision: https://reviews.llvm.org/D57789
llvm-svn: 354298
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.
This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.
Differential Revision: https://reviews.llvm.org/D58181
llvm-svn: 354279
No need for a separate stack slot. The lifetimes don't overlap.
Also fix the MachinePointerInfo for the final load after the integer conversion to indicate it came from the stack slot.
llvm-svn: 354234
No need to manually split everything. We can let the type legalizer work for us.
The test change seems to be caused by some DAG ordering issue that was previously circumventing a one use check in LowerSELECT where FP selects are turned into blends if the setcc has one use. But it was running after an integer select and the same setcc had been legalized to cmov and X86SISD::CMP. This dropped the use count of the setcc, but wasn't what was intended.
llvm-svn: 354197
Preventing the load fold won't fix the partial register update since the
input we can fold is a GPR. So it will do nothing to prevent a false dependency
on an XMM register.
llvm-svn: 354193
When we need to do an fp->int conversion using x87 instructions, we need to temporarily change the rounding mode to 0b11 and perform a store. To do this we save the old value of the fpcw to the stack, then set the fpcw to 0xc7f, do the store, then restore fpcw. But the 0xc7f value forces the exception mask bits 1. While this is what they would be in the default FP environment, as we move to support changing the FP environments, we shouldn't make this assumption.
This patch changes the code to explicitly OR 0xc00 with the old value so that only the rounding mode is changed. Unfortunately, this requires two stack temporaries instead of one. One to hold the old value and one to hold the new value. Without two stack temporaries we would need an additional GPR. We already need one to do the OR operation in. This is similar to what gcc and icc do for this operation. Though they are both better at reusing the stack temporaries when there are multiple truncates in a function(or at least in a basic block)
Differential Revision: https://reviews.llvm.org/D57788
llvm-svn: 354178
These checks aren't needed on the call to FP_TO_INTHelper from the type legalizer for splitting i64. We always want to use X87 FIST/FISTT to memory there.
Moving up the SSE checks will allow this routine to focus on what it cares about and makes its return semantics cleaner.
llvm-svn: 354161
As detailed on PR40730, we are not correctly filling in the lane shuffle mask (D53148/rL344446) - we fill in for the correct src lane but don't add it to the correct mask element, so any reference to the correct element is likely to see an UNDEF mask index.
This allows constant folding to propagate UNDEFs prior to the lane mask being (correctly) lowered to vperm2f128.
This patch fixes the issue by fully populating the lane shuffle mask - this is more than is necessary (if we only filled in the required mask elements we might be able to match other shuffle instructions - broadcasts etc.), but its the most cautious approach as this needs to be cherrypicked into the 8.0.0 release branch.
Differential Revision: https://reviews.llvm.org/D58237
llvm-svn: 354117
When SSE is enabled sint_to_fp with i16 is blindly promoted to i32, but that changes the behavior of f80 conversion.
Move the promotion to i16 to LowerFP_TO_INT so we can limit it based on the floating point type.
llvm-svn: 354003
Try to use 64-bit SLP vectorization. In addition to horizontal instrs
this change triggers optimizations for partial vector operations (for instance,
using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by
<2 x float>).
Fixes llvm.org/PR32433
llvm-svn: 353923
In 64-bit mode prior to avx512 we use Expand, but with avx512 we need to make f32/f64 conversions Legal so we use Custom and then do our own expansion for f80. But this seems to produce codegen differences relative to avx2. This patch corrects this.
llvm-svn: 353921
-Pull the final stack load creation from the two callers into the helper.
-Return a single SDValue instead of a std::pair.
-Remove the Replace flag which isn't really needed.
llvm-svn: 353920
A more limited version of rL352997 that had to be disabled in rL353198 - allow extension of any 128/256/512 bit vector that at least uses byte sized scalars.
llvm-svn: 353860
We were using DstTy, but that represents the integer type we are converting to which is i64 in this
case. The FLD is part of an intermediate step to get from the SSE registers to the x87 registers.
If the floating point type is f32, the memory operand should reflect a 4 byte access not an 8 byte
access. The store we used to get from SSE to the stack is using the corect size.
While there, consistenly use TheVT in place of Op.getOperand(0).getValueType() throughout the function.
llvm-svn: 353745
256-bit horizontal math ops are an x86 monstrosity (and thankfully have
not been extended to 512-bit AFAIK).
The two 128-bit halves operate on separate halves of the inputs. So if we
don't demand anything in the upper half of the result, we can extract the
low halves of the inputs, do the math, and then insert that result into a
256-bit output.
All of the extract/insert is free (ymm<-->xmm), so we're left with a
narrower (cheaper) version of the original op.
In the affected tests based on:
https://bugs.llvm.org/show_bug.cgi?id=33758https://bugs.llvm.org/show_bug.cgi?id=38971
...we see that the h-op narrowing can result in further narrowing of other
math via existing generic transforms.
I originally drafted this patch as an exact pattern match starting from
extract_vector_elt, but I thought we might see diffs starting from
extract_subvector too, so I changed it to a more general demanded elements
solution. There are no extra existing regression test improvements from
that switch though, so we could go back.
Differential Revision: https://reviews.llvm.org/D57841
llvm-svn: 353641
As discussed on D57389, this is a first step towards moving the SHLD/SHRD matching code to DAGCombiner using FSHL/FSHR instead.
There's a bit of work to do before I can do that, so this just folds to FSHL/FSHR in the existing code (handling the different SHRD/FSHR argument ordering), which fixes the issue we had with i16 shift amounts not being correctly masked.
llvm-svn: 353626
D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.
With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.
This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.
I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).
The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV.
Differential Revision: https://reviews.llvm.org/D57888
llvm-svn: 353610
This patch accompanies the RFC posted here:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127239.html
This patch adds a new CallBr IR instruction to support asm-goto
inline assembly like gcc as used by the linux kernel. This
instruction is both a call instruction and a terminator
instruction with multiple successors. Only inline assembly
usage is supported today.
This also adds a new INLINEASM_BR opcode to SelectionDAG and
MachineIR to represent an INLINEASM block that is also
considered a terminator instruction.
There will likely be more bug fixes and optimizations to follow
this, but we felt it had reached a point where we would like to
switch to an incremental development model.
Patch by Craig Topper, Alexander Ivchenko, Mikhail Dvoretckii
Differential Revision: https://reviews.llvm.org/D53765
llvm-svn: 353563
Summary: These instructions update FPSW so they aren't generically safe to rematerialize into any location if FPSW is live for a comparison result. They also use FPCW for exception masking control. Though the only exception they can generate is stack overflow and we manage the stack ourselves so that's not really going to occur.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57934
llvm-svn: 353536
Summary:
FPCW contains the rounding mode control which we manipulate to implement fp to integer conversion by changing the roudning mode, storing the value to the stack, and then changing the rounding mode back. Because we didn't model FPCW and its dependency chain, other instructions could be scheduled into the middle of the sequence.
This patch introduces the register and adds it as an implciit def of FLDCW and implicit use of the FP binary arithmetic instructions and store instructions. There are more instructions that need to be updated, but this is a good start. I believe this fixes at least the reduced test case from PR40529.
Reviewers: RKSimon, spatel, rnk, efriedma, andrew.w.kaylor
Subscribers: dim, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57735
llvm-svn: 353489
Move the (add (umax X, C), -C) --> (usubsat X, C) X86 combine into generic DAGCombiner
First of a number of saturated arithmetic folds that can be moved out of X86-specific code for PR40111.
Differential Revision: https://reviews.llvm.org/D57754
llvm-svn: 353457
This is intentionally a small step because it's hard to know exactly
where we might introduce a conflicting transform with the code that
tries to form wider shuffles. But I think this is safe - if we have
a wide shuffle with 2 operands, then we should do better with an
extract + narrow shuffle.
Differential Revision: https://reviews.llvm.org/D57867
llvm-svn: 353427
combineExtractWithShuffle may leave a dangling bitcast which may
prevent further optimization in later passes. Avoid constructing it
unless it is used.
llvm-svn: 353333
The proposal in D56796 may cross the line because we're trying to avoid vectorization
transforms in generic DAG combining. So this is an alternate, later, x86-specific
translation of that patch.
There are several potential follow-ups to enhance this:
1. Allow extraction from non-zero element index.
2. Peek through extends of smaller width integers.
3. Support x86-specific conversion opcodes like X86ISD::CVTSI2P
Differential Revision: https://reviews.llvm.org/D56864
llvm-svn: 353302
rL352997 enabled ZERO_EXTEND from non-shuffle-able value types. I've disabled it for now to fix a regression identified by @asbirlea until I can fix this properly.
llvm-svn: 353198
If we have broadcasts of different vector widths, keep the longest vector width and extract subvectors for the shorter vectors (which should be free).
Differential Revision: https://reviews.llvm.org/D57663
llvm-svn: 353154
Summary:
We don't currently map these constraints to physical register numbers so they don't make it to the MachineIR representation of inline assembly.
This could have problems for proper dependency tracking in the machine schedulers though I don't have a test case that shows that.
Reviewers: rnk
Reviewed By: rnk
Subscribers: eraman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57641
llvm-svn: 353141
Summary:
Noticed while looking at D56052.
```
// The 'control' of BEXTR has the pattern of:
// [15...8 bit][ 7...0 bit] location
// [ bit count][ shift] name
// I.e. 0b000000011'00000001 means (x >> 0b1) & 0b11
```
I.e. we do not care about any of the bits aside from the low 16 bits.
So there is no point in doing the `slh`,`or` in 64 bits,
let's just do everything in 32 bits, and anyext if needed.
We could do that in 16 even, but we intentionally don't
zext to i16 (longer encoding IIRC),
so i'm guessing the same applies here.
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D56715
llvm-svn: 353073
These instructions implicitly operate on ST0, but we don't currently add that information to the MachineInstr. We also don't add it the tablegen definitions either.
For the most part this doesn't cause any problems because the stackifying occurs after register allocation. All the instructions are marked as having side effects so the postRA scheduler won't reorder them amongst themselves.
But nothing stops inline assembly using X87 instructions from being reordered around other x87 instructions if that inline assembly wasn't marked volatile.
The two test cases I've identified so far in PR40539 involve loads and stores used to set up the inline assembly or capture the results of the inline assembly ending up in the wrong order.
This patch adds implicit ST0 uses/defs to the load/store instructions to prevent this from happening.
I plan to fix all of the FP instructions, but the binops are bit trickier to get right. So I've chosen fixing the known test cases as a good first step.
I think we also need to update the tablegen descriptions so MS inline assembly infers the right clobbers, but I haven't checked that yet.
Differential Revision: https://reviews.llvm.org/D57644
llvm-svn: 353070
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.
This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.
llvm-svn: 353061
For PCMPGT(0, X) patterns where we only demand the sign bit (e.g. BLENDV or MOVMSK) then we can use X directly.
Differential Revision: https://reviews.llvm.org/D57667
llvm-svn: 353051
This patch removes hidden codegen flag -print-schedule effectively reverting the
logic originally committed as r300311
(https://llvm.org/viewvc/llvm-project?view=revision&revision=300311).
Flag -print-schedule was originally introduced by r300311 to address PR32216
(https://bugs.llvm.org/show_bug.cgi?id=32216). That bug was about adding "Better
testing of schedule model instruction latencies/throughputs".
These days, we can use llvm-mca to test scheduling models. So there is no longer
a need for flag -print-schedule in LLVM. The main use case for PR32216 is
now addressed by llvm-mca.
Flag -print-schedule is mainly used for debugging purposes, and it is only
actually used by x86 specific tests. We already have extensive (latency and
throughput) tests under "test/tools/llvm-mca" for X86 processor models. That
means, most (if not all) existing -print-schedule tests for X86 are redundant.
When flag -print-schedule was first added to LLVM, several files had to be
modified; a few APIs gained new arguments (see for example method
MCAsmStreamer::EmitInstruction), and MCSubtargetInfo/TargetSubtargetInfo gained
a couple of getSchedInfoStr() methods.
Method getSchedInfoStr() had to originally work for both MCInst and
MachineInstr. The original implmentation of getSchedInfoStr() introduced a
subtle layering violation (reported as PR37160 and then fixed/worked-around by
r330615).
In retrospect, that new API could have been designed more optimally. We can
always query MCSchedModel to get the latency and throughput. More importantly,
the "sched-info" string should not have been generated by the subtarget.
Note, r317782 fixed an issue where "print-schedule" didn't work very well in the
presence of inline assembly. That commit is also reverted by this change.
Differential Revision: https://reviews.llvm.org/D57244
llvm-svn: 353043
We now print ST0 as 'st' when generating the clobber list for MS inline assembly in clang. This matches what the gcc reg name list expects.
Original commit message:
This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.
Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.
Differential Revision: https://reviews.llvm.org/D57298
llvm-svn: 353016
Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st.
I'll be making a more directed change in a future patch.
llvm-svn: 353013
Summary:
When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name.
This also matches what objdump disassembly prints. It's also what is printed by gcc -S.
Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri
Reviewed By: rnk
Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits
Differential Revision: https://reviews.llvm.org/D57621
llvm-svn: 352985
Summary:
Add an additional combine to combineCarryThroughADD to reverse it back to the C flag to avoid regressions.
I believe this catches the cases that D57547 got.
Reviewers: RKSimon, spatel
Reviewed By: spatel
Subscribers: javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57637
llvm-svn: 352984
Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles.
The is exposes a couple of minor issues that will be fixed shortly:
Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up
combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts).
llvm-svn: 352963
We already have the getConstantOperandVal helper which returns a uint64_t, but along comes the fuzzer and inserts a i128 -1 constant or something and the whole thing asserts.......
I've updated a few obvious cases, and tried to make use of the const reference where possible, but there's more to do. A number of existing oss-fuzz tickets should be fixed if we start using APInt and perform value clamping where necessary.
llvm-svn: 352961
This cleans up all GetElementPtr creation in LLVM to explicitly pass a
value type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57173
llvm-svn: 352913
This cleans up all LoadInst creation in LLVM to explicitly pass the
value type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57172
llvm-svn: 352911
This cleans up all CallInst creation in LLVM to explicitly pass a
function type rather than deriving it from the pointer's element-type.
Differential Revision: https://reviews.llvm.org/D57170
llvm-svn: 352909
As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section.
For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts).
Differential Revision: https://reviews.llvm.org/D56784
llvm-svn: 352883
Enable peeking through one use bitcasts to the subvector shuffle.
This still depends on the subvector being the same scalar-size but D57514 has already helped with the more tricky patterns
llvm-svn: 352879
Summary:
I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as `2`,
which matches the latencies specified in "AMD SOG for family 15h".
However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency.
Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`,
which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay.
Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does
state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`.
Reviewers: andreadb, RKSimon, craig.topper
Reviewed By: andreadb
Subscribers: gbedwell, courbet, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57300
llvm-svn: 352861
Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc
doesn't choke on it, hopefully.
Original Message:
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.
Then:
- update the CallInst/InvokeInst instruction creation functions to
take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.
One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.
However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)
Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.
Differential Revision: https://reviews.llvm.org/D57315
llvm-svn: 352827
This reverts commit f47d6b38c7 (r352791).
Seems to run into compilation failures with GCC (but not clang, where
I tested it). Reverting while I investigate.
llvm-svn: 352800
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.
Then:
- update the CallInst/InvokeInst instruction creation functions to
take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.
One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.
However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)
Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.
Differential Revision: https://reviews.llvm.org/D57315
llvm-svn: 352791
Similar to what we already do in DAGCombiner, but this version also handles bitcasts from types with different scalar sizes, which x86 is better at handling.
Differential Revision: https://reviews.llvm.org/D57514
llvm-svn: 352773
Enables 32/64-bit scalar load broadcasts on AVX1 targets
The extractelement-load.ll regression will be fixed shortly in a followup commit.
llvm-svn: 352743
If we're not inserting the broadcast into the lowest subvector then we can avoid the insertion by just performing a larger broadcast.
Avoids a regression when we enable AVX1 broadcasts in shuffle combining
llvm-svn: 352742
And instead just generate a libcall. My motivating example on ARM was a simple:
shl i64 %A, %B
for which the code bloat is quite significant. For other targets that also
accept __int128/i128 such as AArch64 and X86, it is also beneficial for these
cases to generate a libcall when optimising for minsize. On these 64-bit targets,
the 64-bits shifts are of course unaffected because the SHIFT/SHIFT_PARTS
lowering operation action is not set to custom/expand.
Differential Revision: https://reviews.llvm.org/D57386
llvm-svn: 352736
I believe this was there to handle avx512bw intrinsics that returned i64 type in 32-bit mode. But all those intrinsics have since been changed to v64i1 results or replaced with generic IR.
llvm-svn: 352698
This fixes the test case in PR35982 by preventing MMX instructions that read MM0-7 from being moved below EMMS/FEMMS by the post RA scheduler.
Though as discussed in bugzilla, this is not a complete fix. There is still the possibility of reordering in IR or by the pre-RA scheduler.
Differential Revision: https://reviews.llvm.org/D57298
llvm-svn: 352660
There were checks to ensure some tables were sorted, but those tables aren't used by this function. The same tables are checked in the function that does use them. Maybe this was copy/pasted?
llvm-svn: 352609
As far as I can tell we already won't emit these aliases due to an operand count check in the tablegen code. Removing these because I couldn't make sense of the inconsistency between fadd and fmul from reading the code.
I checked the AsmMatcher and AsmWriter files before and after this change and there were no differences.
llvm-svn: 352608
Account for bypass delays when computing the latency of scalar int-to-float
conversions.
On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG).
This patch also fixes the number of micropcodes for the register-memory variants
of scalar int-to-float conversions.
Differential Revision: https://reviews.llvm.org/D57148
llvm-svn: 352518
Without the fix we get the following (with -Werror):
../lib/Target/X86/X86ISelLowering.cpp:14181:58: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
SmallVector<std::array<int, 2>, 2> LaneSrcs(NumLanes, {-1, -1});
^~~~~~
{ }
1 error generated.
llvm-svn: 352455
This did not cause the buildbot failure it was previously reverted for.
Original commit message:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.
This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inre
On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.
llvm-svn: 352433
Followup to D56636, this time handling the UADDSAT case by expanding
uadd.sat(a, b) to umin(a, ~b) + b.
Differential Revision: https://reviews.llvm.org/D56869
llvm-svn: 352409
First step towards adding support for 64-bit unary "sublane" handling (a bit like lowerShuffleAsRepeatedMaskAndLanePermute).
This allows us to add lowerV64I8Shuffle handling.
llvm-svn: 352389
This is tricky to make optimal: sometimes we're better off using
a single wider op, but other times it makes more sense to combine
a narrow ops to achieve the same result.
This solves the case from:
https://bugs.llvm.org/show_bug.cgi?id=40434
There's potentially a similar change for vectors with 64-bit elements,
but it needs adjustments similar to rL352333 to avoid creating infinite
loops.
llvm-svn: 352380
This transform was added with rL351346, and we had
an escape for shufps, but we also want one for
unpckps vs. vpermps because vpermps doesn't take
an immediate shuffle index operand.
llvm-svn: 352333
Although this is longer code, this is no-functional-change-intended.
The goal is to untangle the conditions under which we bail out, so
that's easier to adjust.
llvm-svn: 352320
The add+and sequence followed by a branch can
happen e.g. when looping over the set bits of an integer:
```
while (x != 0) {
func(x & ~x);
x &= x - 1;
}
```
Reviewed By: ctopper
Differential Revision: https://reviews.llvm.org/D57296
llvm-svn: 352306
def32 here means the producing instruction zeroed bits 63:32. We already do this for zext, but it looks like we can get an and+anyext sometimes.
Spotted in the diffs from D33587.
llvm-svn: 352303
Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB.
I've added the getNode() to make it clearer what is intended than just the -> redirection.
llvm-svn: 352291
As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction.
Differential Revision: https://reviews.llvm.org/D57281
llvm-svn: 352289
We often generate X86ISD::SBB(X, 0) for carry flag arithmetic.
I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet.
Differential Revision: https://reviews.llvm.org/D57169
llvm-svn: 352288
This reduces a bit of duplication between the combining and
lowering places that use it, but the primary motivation is
to make it easier to rearrange the lowering logic and solve
PR40434:
https://bugs.llvm.org/show_bug.cgi?id=40434
llvm-svn: 352280
Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56971
llvm-svn: 352260
Summary:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.
This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization.
On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57186
llvm-svn: 352255
Summary:
Currently, if an instruction with a memory operand has no debug information,
X86DiscriminateMemOps will generate one based on the first line of the
enclosing function, or the last seen debug info.
This may cause confusion in certain debugging scenarios. The long term
approach would be to use the line number '0' in such cases, however, that
brings in challenges: the base discriminator value range is limited
(4096 values).
For the short term, adding an opt-in flag for this feature.
See bug 40319 (https://bugs.llvm.org/show_bug.cgi?id=40319)
Reviewers: dblaikie, jmorse, gbedwell
Reviewed By: dblaikie
Subscribers: aprantl, eraman, hiraditya
Differential Revision: https://reviews.llvm.org/D57257
llvm-svn: 352246
We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch.
This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate.
Differential Revision: https://reviews.llvm.org/D57218
llvm-svn: 352230
This seems unnecessarily complicated because we gave names to
opposite polarity bools and have code comments that don't really
line up with the logic.
Step 1: remove UndefUpper and assert that it is the opposite of
UndefLower after the initial early exit.
llvm-svn: 352217
Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag.
This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further.
Noticed in some of the test cases in PR31754
Differential Revision: https://reviews.llvm.org/D57234
llvm-svn: 352210
This isn't the final fix for our reduction/horizontal codegen, but it takes care
of a lot of the problems. After we narrow the shuffle, existing combines for
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.
The avg and mul reduction tests show an existing shuffle lowering hole for
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.
Differential Revision: https://reviews.llvm.org/D57156
llvm-svn: 352209
As noted in D57156, we want to check at least part of
this pattern earlier (in combining), so this will allow
the code to be shared instead of duplicated.
llvm-svn: 352127
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.
The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.
This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)
That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.
Differential Revision: https://reviews.llvm.org/D56548
llvm-svn: 352076
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.
X86 uses i8, but seemed to be hacking around this before.
llvm-svn: 351882
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.
This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.
Differential Revision: https://reviews.llvm.org/D55935
llvm-svn: 351819
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57026
llvm-svn: 351817
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57022
llvm-svn: 351815
First step towards PR40376, this patch adds support for getCmpSelInstrCost to use the (optional) Instruction CmpInst predicate to indicate the type of integer comparison we're performing and alter the costs accordingly.
Differential Revision: https://reviews.llvm.org/D57013
llvm-svn: 351810
When we are inserting 1 "inline" element, and zeroing 2 of the other elements then we can safely commute the insertps source inputs to improve memory folding.
Differential Revision: https://reviews.llvm.org/D56843
llvm-svn: 351807
Summary:
Use X86ISD::VFPROUND in the instruction isel patterns. Add new patterns for ISD::FP_ROUND to maintain support for fptrunc in IR.
In the process I found a couple duplicate isel patterns which I also deleted in this patch.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56991
llvm-svn: 351762
Summary:
For compress, a select node doesn't semantically reflect the behavior of the instruction. The mask would have holes in it, but the resulting write is to contiguous elements at the bottom of the vector.
Furthermore, as far as the compressing and expanding is concerned the behavior is depended on the mask. You can't just have an expand/compress node that only reads the input vector. That node would have no meaning by itself.
This all only works because we pattern match the compress/expand+select back to the instruction. But conceivably an optimization of the select could break the pattern and leave something meaningless.
This patch modifies the expand and compress node to take the mask and passthru as additional inputs and gets rid of the select all together.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57002
llvm-svn: 351761
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.
Confirmed with @andreadb
llvm-svn: 351755
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.
Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.
Differential Revision: https://reviews.llvm.org/D56922
llvm-svn: 351742
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.
Noticed while cleaning up vector integer comparison costs for PR40376.
llvm-svn: 351697
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.
Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).
The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.
llvm-svn: 351685
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Summary:
Right now we include ${TGT}GenCallingConv.inc once per each instruction
selection method implemented by ${TGT}:
- ${TGT}ISelLowering.cpp
- ${TGT}CallLowering.cpp
- ${TGT}FastISel.cpp
Instead, add a mechanism to tablegen for marking a particular convention
as "External", which causes tablegen to emit into the ::llvm namespace,
instead of as a static helper. This allows us to provide a header to
forward declare it, so we can simply call the function from all the
places it is referenced. Typically the calling convention analyzer is
called indirectly, so it doesn't benefit from inlining.
This saves a bit of final binary size, but mostly just saves object file
size:
before after diff artifact
12852K 12492K -360K X86ISelLowering.cpp.obj
4640K 4280K -360K X86FastISel.cpp.obj
1704K 2092K +388K X86CallingConv.cpp.obj
52448K 52336K -112K llc.exe
I didn't collect before numbers for X86CallLowering.cpp.obj, which is
for GlobalISel, but we should save 360K there as well.
This patch applies the strategy to the X86 backend, but there is no
reason it couldn't be applied to the other backends that implement
multiple ISel strategies, like AArch64.
Reviewers: craig.topper, hfinkel, efriedma
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D56883
llvm-svn: 351616
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.
llvm-svn: 351583
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.
Differential Revision: https://reviews.llvm.org/D56827
llvm-svn: 351570
Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits.
This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts.
We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change.
Patch by zhutianyang and Craig Topper
Differential Revision: https://reviews.llvm.org/D56695
llvm-svn: 351381
This cleans up the duplication we have with both intrinsic isel patterns and vselect isel patterns. This should also allow the intrinsics to get SimplifyDemandedBits support for the condition.
I've switched the canonical pattern in isel to use the X86ISD::BLENDV node instead of VSELECT. Since it always seemed weird to move from BLENDV with its relaxed rules on condition bits to VSELECT which has strict rules about all bits of the condition element being the same. Its more correct to go from VSELECT to BLENDV.
Differential Revision: https://reviews.llvm.org/D56771
llvm-svn: 351380
If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc.
Differential Revision: https://reviews.llvm.org/D56765
llvm-svn: 351378
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.
This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.
I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).
Differential Revision: https://reviews.llvm.org/D56777
llvm-svn: 351366
Remove the existing assertion and just return false for unexpected shuffle value types (<X x i1> mainly....).
Found while updating combineX86ShufflesRecursively to run within SimplifyDemandedVectorElts/SimplifyDemandedBits.
llvm-svn: 351365
combineX86ShufflesRecursively is pretty cumbersome with a lot of arguments that only matter later in recursion.
This commit adds a wrapper version that only takes the initial root Op to simplify calls that don't need to worry about these.
An early, cleanup step towards merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts/SimplifyDemandedBits.
llvm-svn: 351352
I was trying to prevent shuffle regressions while matching more horizontal ops
and ended up here:
shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0
The affected tests were added for:
https://bugs.llvm.org/show_bug.cgi?id=34380
This patch won't change the examples in the bug report itself, but we should be
able to extend this to catch more types.
Differential Revision: https://reviews.llvm.org/D56756
llvm-svn: 351346
Summary:
Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc.
Refer D53541 for the context. Clang counterpart D56748.
Reviewers: rnk, efriedma
Reviewed By: rnk, efriedma
Subscribers: javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D56747
llvm-svn: 351281
That's really what it is. If we didn't use intrinsics for BLENDVPS/BLENDVPD/PBLENDVB all the way to isel, this is the node we would use.
llvm-svn: 351278
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.
llvm-svn: 351275
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.
I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.
I'll work on scatter and gatherpf/scatterpf next.
Differential Revision: https://reviews.llvm.org/D56527
llvm-svn: 351234
Modify getRegForInlineAsmConstraint to return special singleton
register class when a constraint references ST(7) not RFP80 for which
ST(7) is not a member.
llvm-svn: 351206
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.
llvm-svn: 351203
Summary:
In r345197 ESP and RSP were added to GR32_TC/GR64_TC, allowing them to
be used for tail calls, but this also caused `findDeadCallerSavedReg` to
think they were acceptable targets for clobbering. Filter them out.
Fixes PR40289.
Patch by Geoffry Song!
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D56617
llvm-svn: 351146
If we have PSHUFB and we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
llvm-svn: 351103
add (extractelt (X, 0), extractelt (X, 1)) --> extractelt (hadd X, X), 0
This is the integer sibling to D56011.
There's an additional restriction to only to do this transform in the
case where we don't have extra extracts from the source vector. Without
that, we can fail to match larger horizontal patterns that are more
beneficial than this minimal case. An improvement to the more general
h-op lowering may allow us to remove the restriction here in a follow-up.
llvm-svn: 351093
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877
llvm-svn: 351018
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877.
llvm-svn: 351017
This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types.
Differential Revision: https://reviews.llvm.org/D56544
llvm-svn: 350998
The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do.
This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there.
Fixes another instruction for PR34877.
llvm-svn: 350994
We no longer need to extend mask scalars before bitcasting them to vXi1. This was only needed for the truncate intrinsics. And was really a bug in our lowering of them.
llvm-svn: 350991
We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type.
By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned.
llvm-svn: 350989
We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask.
This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that.
This fixes some of PR34877
llvm-svn: 350985
Teach x86 assembly operand parsing to distinguish between assembler
variable assigned to named registers and those assigned to immediate
values.
Reviewers: rnk, nickdesaulniers, void
Subscribers: hiraditya, jyknight, llvm-commits
Differential Revision: https://reviews.llvm.org/D56287
llvm-svn: 350966
Previously, we limited this transform to cases where the
extraction into the build vector happens from vectors of
the same type as the build vector, but that's not required.
There's a slight potential regression seen in the AVX512
result for phadd -- we're using the 256-bit flavor of the
instruction now even though the 128-bit subset is sufficient.
The same problem could already be seen in the AVX2 result.
Follow-up patches will attempt to narrow that back down.
llvm-svn: 350928
We were lowering the last step extract_vector_elt to a bitcast+truncate. Change it to use an extract_vector_elt of index 0 instead. Add isel patterns to do the equivalent of what the bitcast would have done. Plus an isel pattern for an any_extend+extract to prevent some regressions.
Finally add a DAG combine to turn v1i1 scalar_to_vector+extract_vector_elt of 0 into an extract_subvector.
This fixes some of the regressions from D350800.
llvm-svn: 350918
This extends to combineVSelectToShrunkBlend to be able to resimplify SHRUNKBLENDS that have already been created.
This should help some of the regressions from D56387
Differential Revision: https://reviews.llvm.org/D56421
llvm-svn: 350875
Despite what the comment says, FCMP_UNE would be an OR not an AND. In the lowering code the first branch created still goes to the original destination. The second branch was exchanged to go to where the subsequent unconditional branch went. This is different than what we do for FCMP_OEQ where both branches that we create go to the original unconditional branch.
As far as I can tell, I think this means we don't need to exchange the branch target with the unconditional branch for FCMP_UNE at all.
Differential Revision: https://reviews.llvm.org/D56309
llvm-svn: 350873
When we use the partial-matching function on a 128-bit chunk, we must
account for the possibility that we've matched undef halves of the
original source vectors, so the outputs may need to be reset.
This should allow closing PR40243:
https://bugs.llvm.org/show_bug.cgi?id=40243
llvm-svn: 350830
This is a partial fix for:
https://bugs.llvm.org/show_bug.cgi?id=40243
...as seen in the integer test, we still need to correct the result when using the
existing (old) horizontal op matching function because it does not model the way
x86 256-bit horizontal ops return results (each 128-bit half is its own horizontal-op).
A potential follow-up change for that is discussed in the bug report - see also D56490.
This generally duplicates a lot of the existing matching code, but we can't just remove
that without introducing regressions, so the existing code is renamed and used less often.
Follow-ups may try to reduce that overlap.
Differential Revision: https://reviews.llvm.org/D56450
llvm-svn: 350826
Summary:
This pass replaces GR8/GR16/GR32/GR64 with their equivalent sized mask register classes. But VK32/VK64 aren't legal without AVX512BW. Apparently this mostly appears to work if the register coalescer is able to remove the VK32/VK64 register class reference. Or if we don't ever spill it. But there's no guarantee of that.
Another Intel employee managed to trigger a crash due to this with ISPC. Unfortunately, I've lost the test case he sent me at the time. I'm trying to get him to reproduce it for me. I'd like to get this in before 8.0 branches since its a little scary.
The regressions here are unfortunate, but I think we can make some improvements to DAG combine, load folding, etc. to fix them. Just not sure if we can get that done for 8.0.
Fixes PR39741
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56460
llvm-svn: 350800
Found while trying to figure out why my second version of D56421 worked better than the first version. We weren't deleting the vselect in a timely fashion and that caused SimplfyDemandedBit to see an additional user.
The new version doesn't have this problem so this fix isn't needed there, but seemed like the right thing to do.
llvm-svn: 350781
When the result type is v2i64/v2f64 and the index element size is i32, the index vector has two unused elements making the type v4i32. The mask VT should match the number of memory accesses that will be made.
This is consistent with the isel patterns used for the target independent gather/scatter intrinsic.
llvm-svn: 350687
For stack frames on the size of a register in x86, a code size optimization
emits "push rax/eax" instead of "sub" for stack allocation. For example:
foo:
.cfi_startproc
BB#0:
pushq %rax
Ltmp0:
.cfi_def_cfa_offset 16
...
.cfi_endproc
However, we are falling back to DWARF in this case because we cannot
encode %rax as a saved register.
This requirement is wrong, since we don't care about the contents of
%rax, it is the equivalent of a sub.
In order to specify that we care about the contents of %rax, we would
need a .cfi_offset %rax, <offset>.
It's also overzealous in the case where there are pushes for callee saved
registers followed by a "push rax/eax" instead of "sub", in which case we should
also be able to encode the callee saved regs and everything else using compact
unwind.
Patch authored by Bruno Cardoso Lopes.
Differential Revision: https://reviews.llvm.org/D13793
llvm-svn: 350623
Summary: AVX512VBMI2 supports a funnel shift by immediate and a funnel shift by a variable vector.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56361
llvm-svn: 350498
There are no test changes here in the existing cost model
regression tests because integer add/sub have a default
legal cost of 1 already. This would break, however, if
we custom lower those ops because the default cost model
assumes that custom-lowered ops are more expensive.
This is similar to the change in rL350403. See discussion
in D56011 for more details. When we enhance that patch to
handle integer ops, we need this cost model change to avoid
unintended diffs here from the custom lowering.
llvm-svn: 350496
This was here because out and in instructions allow '(%dx)' even though its not a memory reference. To handle this we build a special operand for the DX register reference before we get to the call to CheckBaseRegAndIndexRegAndScale. So we no longer need this special case.
llvm-svn: 350483
This is especially helpful on targets without avx512bw since we don't have a good way to convert from v16i8/v32i8 to v16i1/v32i1 for the truncate anyway. If we're just going to convert it to a GPR we might as well use pmovmskb to accomplish both.
llvm-svn: 350480
We don't need to require the first operand to be an integer because we already said it was the same type as the result which we also constrained to an integer.
llvm-svn: 350455
The 1st try for this was at rL350369, but it caused IR-level diffs because
our cost models differentiate custom vs. legal/promote lowering. So that was
reverted at rL350373. The cost models were fixed independently at rL350403,
so this is effectively the same patch as last time.
Original commit message:
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350421
Noticed in D56011 - handle the case that scalar fp ops are quicker on P3 than P4
Add the other costs so that we're not relying on the default "is legal/custom" cost logic.
llvm-svn: 350403
Doing this late so we will prefer to fold the AND into a masked comparison first. That can be better for the live range of the mask register.
Differential Revision: https://reviews.llvm.org/D56246
llvm-svn: 350374
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350369
As noted in PR39973 and D55558:
https://bugs.llvm.org/show_bug.cgi?id=39973
...this is a partial implementation of a fold that we do as an IR canonicalization in instcombine:
// extelt (binop X, Y), Index --> binop (extelt X, Index), (extelt Y, Index)
We want to have this in the DAG too because as we can see in some of the test diffs (reductions),
the pattern may not be visible in IR.
Given that this is already an IR canonicalization, any backend that would prefer a vector op over
a scalar op is expected to already have the reverse transform in DAG lowering (not sure if that's
a realistic expectation though). The transform is limited with a TLI hook because there's an
existing transform in CodeGenPrepare that tries to do the opposite transform.
Differential Revision: https://reviews.llvm.org/D55722
llvm-svn: 350354
INC/DEC are pretty much the same as ADD/SUB except that they don't update the C flag.
This patch removes the special nodes and just pattern matches from ADD/SUB during isel if the C flag isn't being used.
I had to avoid selecting DEC is the result isn't used. This will become a SUB immediate which will turned into a CMP later by optimizeCompareInstr. This lead to the one test change where we use a CMP instead of a DEC for an overflow intrinsic since we only checked the flag.
This also exposed a hole in our RMW flag matching use of hasNoCarryFlagUses. Our root node for the match is a store and there's no guarantee that all the flag users have been selected yet. So hasNoCarryFlagUses needs to check copyToReg and machine opcodes, but it also needs to check for the pre-match SETCC, SETCC_CARRY, BRCOND, and CMOV opcodes.
Differential Revision: https://reviews.llvm.org/D55975
llvm-svn: 350245
Peek through shift modulo masks while matching double shift patterns.
I was hoping to delay this until I could remove the X86 code with generic funnel shift matching (PR40081) but this will do for now.
Differential Revision: https://reviews.llvm.org/D56199
llvm-svn: 350222
All of these use custom isel so we can pretty easily detect the differences in the custom code in X86ISelDAGToDAG. The ISD opcodes just need to express the desired semantics not the details of how they would be selected by isel. So unifying them lets us remove the special casing from lowering.
llvm-svn: 350206
These require a different X86ISD node to be created than i16/i32/i64. I guess no one wanted to add the special code for that except in LowerXALUO. But now LowerXALUO, LowerSELECT, and LowerBRCOND all use a common helper function so they all share the special code.
Unfortunately, there are no test changes because we seem to correct the miss in a DAG combine later. I did verify it manually using test cases from xmulo.ll
llvm-svn: 350205
This makes it easier to keep the LowerBRCOND and LowerSELECT code in sync with LowerXALUO so they always pick the same operation for overflowing instructions.
This is inspired by the helper functions used by ARM and AArch64 for the same purpose.
The test change is because LowerSELECT was not in sync with LowerXALUO with regard to INC/DEC for SADDO/SSUBO.
llvm-svn: 350198
This seems to be getting in the way more than its helping. This does mean we stop scalarizing some cases, but I'm not convinced the scalarization was really better.
Some of the changes to vsel-cmp-load.ll are a regression but D56156 should fix it.
llvm-svn: 350159
This allows us to sign extend to v4i32 first. And then share that extension to implement the final steps to v4i64 using a pcmpgt and punpckl and punpckh.
We already do something similar for SIGN_EXTEND with -x86-experimental-vector-widening-legalization.
llvm-svn: 350158
This was tricking us into making these operations and then letting them get scalarized later. But I can't prove that the scalarized version is actually better.
llvm-svn: 350141
Previously we emitted a multiply and some masking that was supposed to matched to PMULUDQ, but the masking could sometimes be removed before we got a chance to match it. So instead just emit the PMULUDQ directly.
Remove the DAG combine that was added when the ReplaceNodeResults code was originally added. Add a new DAG combine to avoid regressions in shrink_vmul.ll
Some of the shrink_vmul.ll test cases now pick PMULUDQ instead of PMADDWD/PMULLD, but I think this should be an improvement on most CPUs.
I think all of this can go away if/when we switch to -x86-experimental-vector-widening-legalization
llvm-svn: 350134
Create PMULDQ/PMULUDQ as long as the number of elements is a power of 2.
This seems to give some improvements in our ability to use SimplifyDemandedBits.
llvm-svn: 350084
Make each of the helper functions only return their comparison node and the condition code. Leave X86ISD::SETCC creation to the LowerSETCC function itself.
Looking into whether we can use this code directly in BRCOND and SELECT lowering instead of going through LowerSETCC which creates an X86ISD::SETCC node we need to look through.
llvm-svn: 350082
Only one of the 3 callers of LowerAndToBT need the SETCC node. Two of them have to look through it to find the operands they really need. Instead create it after the one call that needs it.
LowerAndToBT now returns both the BT node and the X86 specific condition code separately.
llvm-svn: 350081
This is an alternative to what I attempted in D56057.
GetDemandedBits is a special version of SimplifyDemandedBits that allows simplifications even when the operand has other uses. GetDemandedBits will only do simplifications that allow a node to be bypassed. It won't create new nodes or alter any of the other users.
I had to add support for bypassing SIGN_EXTEND_INREG to GetDemandedBits.
Based on a patch that Simon Pilgrim sent me in email.
Fixes PR40142.
llvm-svn: 350059
Remove the TESTmr isel patterns and add another postprocessing combine for TESTrr+ANDrm->TESTmr. We already have a postprocessing combine for TESTrr+ANDrr->TESTrr. With this we can give ANDN a chance to match first. And clean it up during post processing if we ended up with just a regular AND.
This is another step towards my plan to gut EmitTest and do more flag handling during isel matching or by using optimizeCompare.
llvm-svn: 350038
The missed load folding noticed in D55898 is visible independent of that change
either with an adjusted IR pattern to start or with AVX2/AVX512 (where the build
vector becomes a broadcast first; movddup is not produced until we get into isel
via tablegen patterns).
Differential Revision: https://reviews.llvm.org/D55936
llvm-svn: 350005
Summary:
Added a pair of APIs for encoding/decoding the 3 components of a DWARF discriminator described in http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html: the base discriminator, the duplication factor (useful in profile-guided optimization) and the copy index (used to identify copies of code in cases like loop unrolling)
The encoding packs 3 unsigned values in 32 bits. This CL addresses 2 issues:
- communicates overflow back to the user
- supports encoding all 3 components together. Current APIs assume a sequencing of events. For example, creating a new discriminator based on an existing one by changing the base discriminator was not supported.
Reviewers: davidxl, danielcdh, wmi, dblaikie
Reviewed By: dblaikie
Subscribers: zzheng, dmgreen, aprantl, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D55681
llvm-svn: 349973
This fixes the patterns that have or/and as a root. 'and' is handled differently since thy usually have a CMP wrapped around them.
I had to look for uses of the CF flag because all these nodes have non-standard CF flag behavior. A real or/xor would always clear CF. In practice we shouldn't be using the CF flag from these nodes as far as I know.
Differential Revision: https://reviews.llvm.org/D55813
llvm-svn: 349962
The BEXTR instruction documents the SF bit as undefined.
The TBM BEXTR instruction has the same issue, but I'm not sure how to test it. With the control being an immediate we can determine the sign bit is 0 or the BEXTR would have been removed.
Fixes PR40060
Differential Revision: https://reviews.llvm.org/D55807
llvm-svn: 349956
This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right.
In the motivating example, note that we have movddup before SSE4.1 and
again with AVX2. That's because insertps isn't available pre-SSE41 and
vbroadcast is (more generally) available with AVX2 (and the splat is
reduced to movddup via isel pattern).
Differential Revision: https://reviews.llvm.org/D55898
llvm-svn: 349937
This shortens the switches in X86ISelDAGToDAG.cpp to only need to check condition code instead of a list of opcodes.
This also fixes a bug where the memory forms of SETcc were missing from hasNoCarryFlagUses.
llvm-svn: 349868
Summary:
This allows expanding {7,11,13,14,15,21,22,23,25,26,27,28,29,30,31}-byte memcmp
in just two loads on X86. These were previously calling memcmp.
Reviewers: spatel, gchatelet
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55263
llvm-svn: 349731
The (cmp (and X, Y) 0) pattern is greedy and ends up forming a TESTrr and consuming the and when it might be better to use one of the BMI/TBM like BLSR or BLSI.
This patch moves removes the pattern from isel and adds a post processing check to combine TESTrr+ANDrr into just a TESTrr. With this patch we are able to select the BMI/TBM instructions, but we'll also emit a TESTrr when the result is compared to 0. In many cases the peephole pass will be able to use optimizeCompareInstr to remove the TEST, but its probably not perfect.
Differential Revision: https://reviews.llvm.org/D55870
llvm-svn: 349661
Fixes https://bugs.llvm.org/show_bug.cgi?id=38743
The function removeRedundantBlockingStores is supposed to remove any blocking stores contained in each other in lockingStoresDispSizeMap.
But it currently looks only at the previous one, which will miss some cases that result in assert.
This patch refine the function to check all previous layouts until find the uncontained one. So all redundant stores will be removed.
Patch by Pengfei Wang
Differential Revision: https://reviews.llvm.org/D55642
llvm-svn: 349660
We already had BSF here as part of __builtin_ffs improvements and I was just wondering yesterday whether we should have BSR there.
This addresses one issue from PR40090.
llvm-svn: 349531
Migrate the X86 backend from X86ISD opcodes ADDS and SUBS to generic
ISD opcodes SADDSAT and SSUBSAT. This also improves scodegen for
@llvm.sadd.sat() and @llvm.ssub.sat() intrinsics.
This is a followup to D55787 and part of PR40056.
Differential Revision: https://reviews.llvm.org/D55833
llvm-svn: 349520
InstCombine seems to canonicalize or PSUB patter into a max with the cosntant and an add with an inverse of the constant.
This patch recognizes this pattern and turns it into PSUBUS. Future work could improve undef element handling.
Fixes some of PR40053
Differential Revision: https://reviews.llvm.org/D55780
llvm-svn: 349519
Replace the X86ISD opcodes ADDUS and SUBUS with generic ISD opcodes
UADDSAT and USUBSAT. As a side-effect, this also makes codegen for
the @llvm.uadd.sat and @llvm.usub.sat intrinsics reasonable.
This only replaces use in the X86 backend, and does not move any of
the ADDUS/SUBUS X86 specific combines into generic codegen.
Differential Revision: https://reviews.llvm.org/D55787
llvm-svn: 349481
(VSRAI (VSHLI X, C1), C1) --> X iff NumSignBits(X) > C1
This works better as part of SimplifyDemandedBits than part of the general combine.
llvm-svn: 349462
This fold was incredibly specific - replace with a SimplifyDemandedBits fold to remove a VSRAI if only the original sign bit is demanded (its guaranteed to stay the same).
Test change is merely a rescheduling.
llvm-svn: 349459
Convert VSRAI to VSRLI is the sign bit is known zero and improve KnownBits output for all shift instruction.
Fixes the poor codegen comments in D55768.
llvm-svn: 349407
This allows a TEST to be used and can be combined with any AND that may already exist as an input to the shift.
This was already done in EmitTest, but was easily tricked by multiple uses because the setcc might be used by multiple instructions. Once the SETCC and users are legalized then we can look for the shift to be used by a single CMP, but the CMP itself can have multiple users.
This appears to fix the case in PR39968.
llvm-svn: 349385
This is an initial patch to add the necessary support for a DemandedElts argument to SimplifyDemandedBits, more closely matching computeKnownBits and to help improve vector codegen.
I've added only a small amount of the changes necessary to get at least one test to update - a lot more can be done but I'd like to add these methodically with proper test coverage, at the same time the hope is to slowly move some/all of SimplifyDemandedVectorElts into SimplifyDemandedBits as well.
Differential Revision: https://reviews.llvm.org/D55768
llvm-svn: 349374
We keep a few iterators into the basic block we're selecting while
performing FastISel. Usually this is fine, but occasionally code wants
to remove already-emitted instructions. When this happens we have to be
careful to update those iterators so they're not pointint at dangling
memory.
llvm-svn: 349365
I'd like to try to move a lot of the flag matching out of EmitTest and push it to isel or isel preprocessing. This is a step towards that.
The test-shrink-bug.ll changie is an improvement because we are no longer interfering with test shrink handling in isel.
The pr34137.ll change is a regression, but the IR came from -O0 and was not reduced by InstCombine. So it contains a lot of redundancies like duplicate loads that made it combine poorly.
llvm-svn: 349315
Use consistent rules for when to lower to SHLD/SHRD for slow machines - fixes a weird issue where funnel shift gets expanded but then X86ISelLowering's combineOr sees the optsize and combines to SHLD/SHRD, but now with the modulo amount guard......
llvm-svn: 349285
The only caller of this turns CMP with 0 into TEST. CMP with 0 and TEST both set OF to 0 so we should have no issues with instructions that only use OF.
Though I don't think there's any reason we would read just OF after a compare with 0 anyway. So this probably isn't an observable change.
llvm-svn: 349223
hasNoCarryFlagUses hardcoded that the flag result is 1 and used that to filter which uses were of interest. hasNoSignedComparisonUses just assumes the only result is flags and checks whether any user of the node is a CopyToReg instruction.
After this patch we now do a result number check in both and rely on the caller to provide the result number.
This shouldn't change behavior it was just an odd difference between the two functions that I noticed.
llvm-svn: 349222
Summary:
If the setcc already has the target desired type we can reach the getSetCC/getSExtOrTrunc after the MatchingVecType check with the exact same types as the nodes we started with. This causes those causes VsetCC to be CSEd to N0 and the getSExtOrTrunc will CSE to N. When we return N, the caller will think that meant we called CombineTo and did our own worklist management. But that's not what happened. This prevents target hooks from being called for the node.
To fix this, I've now returned SDValue if the setcc is already the desired type. But to avoid some regressions in X86 I've had to disable one of the target combines that wasn't being reached before in the case of a (sext (setcc)). If we get vector widening legalization enabled that entire function will be deleted anyway so hopefully this is only for the short term.
Reviewers: RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55459
llvm-svn: 349137
This requires the two callers to manifest a 0 to make EmitCmp call EmitTest.
I'm looking into changing how we combine TEST and flag setting instructions to not be part of lowering. And instead be part of DAG combine or isel. Which will mean EmitTest will probably become gutted and maybe disappear entirely.
llvm-svn: 349094
Summary:
Macros are expanded on a single line. In case of large expansions,
with sufficiently many instructions with memory operands (and when
-fdebug-info-for-profiling is requested), we may be unable to generate
new base discriminator values - new values overflow (base
discriminators may not be larger than 2^12).
This CL warns instead of asserting in such a case. A subsequent CL
will add APIs to check for overflow before creating new debug info.
See https://bugs.llvm.org/show_bug.cgi?id=39890
Reviewers: davidxl, wmi, gbedwell
Reviewed By: davidxl
Subscribers: aprantl, llvm-commits
Differential Revision: https://reviews.llvm.org/D55643
llvm-svn: 349075
There's still a couple of minor SimplifyDemandedElts regressions in some of the shift amount splats that will be fixed in future patches.
llvm-svn: 349052
Move existing rotation expansion code into TargetLowering and set it up for vectors as well.
Ideally this would share more of the funnel shift expansion, but we handle the shift amount modulo quite differently at the moment.
Begun removing x86 vector rotate custom lowering to use the expansion.
llvm-svn: 349025
MULX has somewhat improved register allocation constraints compared to the legacy MUL instruction. Both output registers are encoded instead of fixed to EAX/EDX, but EDX is used as input. It also doesn't touch flags. Unfortunately, the encoding is longer.
Prefering it whenever BMI2 is enabled is probably not optimal. Choosing it should somehow be a function of register allocation constraints like converting adds to three address. gcc and icc definitely don't pick MULX by default. Not sure what if any rules they have for using it.
Differential Revision: https://reviews.llvm.org/D55565
llvm-svn: 348975
I'm hoping we can just replace SETCC_CARRY with SBB. This is another step towards that.
I've explicitly used zero as the input to the setcc to avoid a false dependency that we've had with the SETCC_CARRY. I changed one of the patterns that used NEG to instead use an explicit compare with 0 on the LHS. We needed the zero anyway to avoid the false dependency. The negate would clobber its input register. By using a CMP we can avoid that which could be useful.
Differential Revision: https://reviews.llvm.org/D55414
llvm-svn: 348959
This patch introduces a generic function to determine whether a given vector type is known to be a splat value for the specified demanded elements, recursing up the DAG looking for BUILD_VECTOR or VECTOR_SHUFFLE splat patterns.
It also keeps track of the elements that are known to be UNDEF - it returns true if all the demanded elements are UNDEF (as this may be useful under some circumstances), so this needs to be handled by the caller.
A wrapper variant is also provided that doesn't take the DemandedElts or UndefElts arguments for cases where we just want to know if the SDValue is a splat or not (with/without UNDEFS).
I had hoped to completely remove the X86 local version of this function, but I'm seeing some regressions in shift/rotate codegen that will take a little longer to fix and I hope to get this in sooner so I can continue work on PR38243 which needs more capable splat detection.
Differential Revision: https://reviews.llvm.org/D55426
llvm-svn: 348953
This extends the code that handles 16-bit add promotion to form LEA to also allow 8-bit adds.
That allows us to combine add ops with register moves and save some instructions. This is
another step towards allowing add truncation in generic DAGCombiner (see D54640).
Differential Revision: https://reviews.llvm.org/D55494
llvm-svn: 348946
Summary:
When doing X86CondBrFolding::analyzeCompare, it will meet the SUB32ri instruction as below to use the global address for its operand,
%733:gr32 = SUB32ri %62:gr32(tied-def 0), @img2buf_normal, implicit-def $eflags
JNE_1 %bb.41, implicit $eflags
so the assertion "assert(MI.getOperand(ValueIndex).isImm() && "Expecting Imm operand")" is not correct and change the assert to if make X86CondBrFolding::analyzeCompare return false as not finding the compare for this
Patch by Jianping Chen
Reviewers: smaslov, LuoYuanke, liutianle, Jianping
Reviewed By: Jianping
Subscribers: lebedev.ri, llvm-commits
Differential Revision: https://reviews.llvm.org/D54250
llvm-svn: 348853
As discussed in D55494, we want to extend this to handle 8-bit
ops too, but that could be extended further to enable this on
32-bit systems too.
llvm-svn: 348851
As discussed in:
D55494
...this code has been disabled/dead for a long time (the code references
Athlon and Pentium 4), and there's almost no chance that it will be used
given the last decade of uarch evolution. Also, in SDAG we promote 16-bit
ops to 32-bit, so there's almost no way to test this code any more.
llvm-svn: 348845
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.
This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.
Differential Revisions: https://reviews.llvm.org/D53629
llvm-svn: 348788
This should really be generalized to allow increment and/or
we should replace it by using ISD::matchUnaryPredicate().
See D55515 for context.
llvm-svn: 348776
Fixes https://bugs.llvm.org/show_bug.cgi?id=39926.
The size of the first copy was computed as
std::abs(std::abs(LdDisp2) - std::abs(LdDisp1)), which results in
skipped bytes if the signs of LdDisp2 and LdDisp1 differ. As far as
I can see, this should just be LdDisp2 - LdDisp1. The case where
LdDisp1 > LdDisp2 is already handled in the code above, in which case
LdDisp2 is set to LdDisp1 and this subtraction will evaluate to
Size1 = 0, which is the correct value to skip an overlapping copy.
Differential Revision: https://reviews.llvm.org/D55485
llvm-svn: 348750
Both intrinsics do the exact same thing so we really only need one.
Earlier in the 8.0 cycle we changed the signature of this intrinsic without renaming it. But it looks difficult to get the autoupgrade code to allow me to merge the intrinsics and change the signature at the same time. So I've renamed the intrinsic slightly for the new merged intrinsic. I'm skipping autoupgrading from the previous new to 8.0 signature. I've also renamed the subborrow for consistency.
llvm-svn: 348737
Previously we had to take the carry in and add -1 to it to set the carry flag so we could use it with ADC/SBB. But if we know its 0 then we don't need to bother.
This should go a long way towards fixing PR24545.
llvm-svn: 348727
The dependency was added in r213995 in response to r213986 which did make
X86/Utils depend on IR, but r256680 later removed that dependency again.
llvm-svn: 348724
The existing code tries to handle an undef operand while transforming an add to an LEA,
but it's incomplete because we will crash on the i16 test with the debug output shown below.
It's better to just give up instead. Really, GlobalIsel should have folded these before we
could get into trouble.
# Machine code for function add_undef_i16: NoPHIs, TracksLiveness, Legalized, RegBankSelected, Selected
bb.0 (%ir-block.0):
liveins: $edi
%1:gr32 = COPY killed $edi
%0:gr16 = COPY %1.sub_16bit:gr32
%5:gr64_nosp = IMPLICIT_DEF
%5.sub_16bit:gr64_nosp = COPY %0:gr16
%6:gr64_nosp = IMPLICIT_DEF
%6.sub_16bit:gr64_nosp = COPY %2:gr16
%4:gr32 = LEA64_32r killed %5:gr64_nosp, 1, killed %6:gr64_nosp, 0, $noreg
%3:gr16 = COPY killed %4.sub_16bit:gr32
$ax = COPY killed %3:gr16
RET 0, implicit killed $ax
# End machine code for function add_undef_i16.
*** Bad machine code: Reading virtual register without a def ***
- function: add_undef_i16
- basic block: %bb.0 (0x7fe6cd83d940)
- instruction: %6.sub_16bit:gr64_nosp = COPY %2:gr16
- operand 1: %2:gr16
LLVM ERROR: Found 1 machine code errors.
Differential Revision: https://reviews.llvm.org/D54710
llvm-svn: 348722
Extension to rL348617, turns out llvm-exegesis doesn't need to match the perf counter name against a scheduler model resource name - so I've added a few more counters that I could find in the libpfm4 source code (and fix a typo in the knl/knm retired_uops counter - which uses 'all' instead of 'any').
llvm-svn: 348721
To make X86CondBrFoldingPass can be run with --run-pass option, this can test one wrong assertion on analyzeCompare function for SUB32ri when its operand is not imm
Patch by Jianping Chen
Differential Revision: https://reviews.llvm.org/D55412
llvm-svn: 348620
This patch attempts to improve pfm perf counter coverage for all the x86 CPUs that libpfm4 supports.
Intel/AMD CPU families tend to share names for cycle/uops counters so even if they don't have a scheduler model yet they can at least use the default values (checked against the libpfm4 source code).
The remaining CPUs (where their port/pipe resource counters are known) I've tried to add to the existing model mappings.
These are untested but don't represent a regression to current llvm-exegesis behaviour for these CPUs.
Differential Revision: https://reviews.llvm.org/D55432
llvm-svn: 348617
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.
Differential Revision: https://reviews.llvm.org/D50141
llvm-svn: 348585
This addresses a FIXME and avoids depending on an isel pattern match I think. I've remove the isel patterns too since he have no lit tests left that cover them. Hopefully that really means they are unused.
I'm trying to decide if we need SETCC_CARRY. This removes one of its usages.
Differential Revision: https://reviews.llvm.org/D55355
llvm-svn: 348536
Initial step towards making the function more generic (and probably move into SelectionDAG).
This is necessary to avoid massive codegen bloat for PR38243 (Add modulo rotate support to LowerRotate).
llvm-svn: 348498
Whenever we effectively take the address of a basic block we need to
manually update that basic block to reflect that fact or later passes
such as tail duplication and tail merging can break the invariants of
the code. =/ Sadly, there doesn't appear to be any good way of
automating this or even writing a reasonable assert to catch it early.
The change seems trivially and obviously correct, but sadly the only
really good test case I have is 1000s of basic blocks. I've tried
directly writing a test case that happens to make tail duplication do
something that crashes later on, but this appears to require an
*amazingly* complex set of conditions that I've not yet reproduced.
The change is technically covered by the tests because we mark the
blocks as having their address taken, but that doesn't really count as
properly testing the functionality.
llvm-svn: 348374
Prep work for PR38243 - mainly adding comments on where we need to add modulo support (doing so at the moment causes massive codegen regressions).
I've also consistently added support for modulo folding for uniform constants (although at the moment we have no way to trigger this) and removed the old assertions.
llvm-svn: 348366
This is an initial patch to add a minimum level of support for funnel shifts to the SelectionDAG and to begin wiring it up to the X86 SHLD/SHRD instructions.
Some partial legalization code has been added to handle the case for 'SlowSHLD' where we want to expand instead and I've added a few DAG combines so we don't get regressions from the existing DAG builder expansion code.
Differential Revision: https://reviews.llvm.org/D54698
llvm-svn: 348353
PR17686 demonstrates that for some targets FP exceptions can fire in cases where the FP_TO_UINT is expanded using a FP_TO_SINT instruction.
The existing code converts both the inrange and outofrange cases using FP_TO_SINT and then selects the result, this patch changes this for 'strict' cases to pre-select the FP_TO_SINT input and the offset adjustment.
The X87 cases don't need the strict flag but generates much nicer code with it....
Differential Revision: https://reviews.llvm.org/D53794
llvm-svn: 348251
We only needed this because it provided really aggressive constant folding even through constant pool entries created from build_vectors. The main case was for vXi8 MULH legalization which was happening as part of legalize DAG instead of as part of legalize vector ops. Now its part of vector op legalization and we've added special handling for build vectors of all constants there. This has removed the need for this code on the list tests we have.
llvm-svn: 348237
This is the smallest vector enhancement I could find to D54640.
Here, we're allowing narrowing to only legal vector ops because we'll see
regressions without that. All of the test diffs are wins from what I can tell.
With AVX/AVX512, we can shrink ymm/zmm ops to xmm.
x86 vector multiplies are the problem case that we're avoiding due to the
patchwork ISA, and it's not clear to me if we can dance around those
regressions using TLI hooks or if we need preliminary patches to plug those
holes.
Differential Revision: https://reviews.llvm.org/D55126
llvm-svn: 348195
Summary:
We need to unpackl and unpackh the operands to use two vXi16 multiplies. Previously it looks like the low unpack would get constant folded at least in the 128-bit case after shuffle lowering turned the unpackl into ZERO_EXTEND_VECTOR_INREG and X86 custom DAG combined it. The same doesn't happen for the high half. So we'd load a constant and then shuffle it. But the low half would just be loaded and used by the multiply directly.
After this patch we now end up with a constant pool entry for the low and high unpacks separately with no shuffle operations.
This is a step towards removing custom constant folding for ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG in the X86 backend.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55165
llvm-svn: 348159
Summary:
Under -x86-experimental-vector-widening-legalization, fp_to_uint/fp_to_sint with a smaller than 128 bit vector type results are custom type legalized by promoting the result to a 128 bit vector by promoting the elements, inserting an assertzext/assertsext, then truncating back to original type. The truncate will be further legalizdd to a pack shuffle. In the case of a v8i8 result type, we'll end up with a v8i16 fp_to_sint. This will need to be further legalized during vector op legalization by promoting to v8i32 and then truncating again. Under avx2 this produces good code with two pack instructions, but Under avx512 this will result in a truncate instruction and a packuswb instruction. But we should be able to get away with a single truncate instruction.
The other option is to promote all the way to vXi32 result type during the first type legalization. But in some experimentation that seemed to require more work to produce good code for other configurations.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54836
llvm-svn: 348158
Previously this code generated its own extracts and build_vector. But we can use a simpler concat_vectors or scalar_to_vector operation and let type legalization do additional legalization of those operations.
llvm-svn: 348087