Commit Graph

21769 Commits

Author SHA1 Message Date
David Sherwood 0aff1798b5 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Simon Pilgrim 1cfecf4fc4 [X86][AVX] Add getBROADCAST_LOAD helper function. NFCI.
Begin replacing individual getMemIntrinsicNode calls and setup (for X86ISD::VBROADCAST_LOAD + X86ISD::SUBV_BROADCAST_LOAD opcodes) with this getBROADCAST_LOAD helper.
2021-07-25 20:37:58 +01:00
Simon Pilgrim b95f66ad78 [X86][SSE] LowerRotate - perform modulo on the amount splat source directly.
If the rotation amount is a known splat, perform the modulo on the splat source, and then perform the splat. That way the amount-extension performed later by LowerScalarVariableShift can fold the splats away without any multiple-use issues.

Fixes one of the concerns raised on D104156
2021-07-25 17:30:32 +01:00
Sanjay Patel 1ce05ad619 [x86] improve CMOV codegen by pushing add into operands, part 2
This is a minimum extension of D106607 to allow folding for
2 non-zero constantsi that can be materialized as immediates..

In the reduced test examples, we save 1 instruction by rolling
the constants into LEA/ADD. In the motivating test from the bullet
benchmark, we absorb both of the constant moves into add ops via
LEA magic, so we reduce by 2 instructions.

Differential Revision: https://reviews.llvm.org/D106684
2021-07-25 10:05:41 -04:00
Simon Pilgrim 15b883f457 [X86][AVX] Adjust AllowBWIVPERMV3 tolerance to account for VariableCrossLaneShuffleDepth
As noticed on D105390 - we were hardwiring the depth limit for combining to VPERMI2W/VPERMI2B instructions. Not only had we made the limit too low, we hadn't accounted for slow/fast shuffles via the VariableCrossLaneShuffleDepth control
2021-07-25 14:05:11 +01:00
Craig Topper cc6d302c91 [X86] Fix a bug in TEST with immediate creation
This code tries to form a TEST from CMP+AND with an optional
truncate in between. If we looked through the truncate, we may
have extra bits in the AND mask that shouldn't participate in
the checks. Normally SimplifyDemendedBits takes care of this, but
the AND may have another user. So manually mask out any extra bits.

Fixes PR51175.

Differential Revision: https://reviews.llvm.org/D106634
2021-07-23 09:03:53 -07:00
Sanjay Patel f060aa1cf3 [x86] improve CMOV codegen by pushing add into operands
This is not the transform direction we want in general,
but by the time we have a CMOV, we've already tried
everything else that could be better.
The transform increases the uses of the other add operand,
but that is safe according to Alive2:
https://alive2.llvm.org/ce/z/Yn6p-A

We could probably extend this to other binops (not just add).
This is the motivating pattern discussed in:
https://llvm.org/PR51069

The test with i8 shows a missed fold because there's a trunc
sitting in front of the add. That can be handled with a small
follow-up.

Differential Revision: https://reviews.llvm.org/D106607
2021-07-23 09:39:32 -04:00
Simon Pilgrim 71d0fd3564 [X86][AVX] lowerV2X128Shuffle - attempt to recognise broadcastf128 subvector load
As noticed on PR50053 we were failing to recognise when a shuffle of a load was really a subvector broadcast load
2021-07-23 13:10:38 +01:00
Simon Pilgrim 4185c5502c [CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.
2021-07-22 20:07:32 +01:00
Simon Pilgrim d073b19dbf [X86] Fix SLM FP<->INT throughputs.
Noticed while trying to clean up the shift costs model for SSE4 targets using the script in D10369 - SLM double-pumps all the 128-bit vector conversion ops and only use FP0 pipe - numbers taken from Intel AOM + Agner.
2021-07-22 19:39:04 +01:00
Simon Pilgrim e1bdb57958 [CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.
2021-07-22 18:12:49 +01:00
Eric Astor 69551486fd [ms] [llvm-ml] Restrict implicit RIP-relative addressing to named-variable references
ML64.EXE applies implicit RIP-relative addressing only to memory references that include a named-variable reference.

Reviewed By: mstorsjo

Differential Revision: https://reviews.llvm.org/D105372
2021-07-21 11:49:58 -04:00
Tianqing Wang bec4a8157d [X86] Update MachineLoopInfo in CMOV conversion.
If a CMOV is in a loop and is converted to branches, CMOV conversion wouldn't
add newly created basic blocks to loop info. Since the candidates is collected
based on loops, instructions in these basic blocks will be ignored.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D104623
2021-07-21 10:53:46 +08:00
Simon Pilgrim c188f0b876 [X86] X86InstCombineIntrinsic.cpp - silence clang-tidy warnings about incorrect uses of auto. NFCI.
We were using auto instead of auto* in a number of places which failed the llvm-qualified-auto check.

Additionally we were using auto in some places where the type wasn't immediately obvious - the style guide rule of thumb is only to use auto from casts etc. where the type is already explicitly stated.
2021-07-20 13:37:45 +01:00
Simon Pilgrim 142e60f40b [X86] Fix case of IsAfterLegalize argument. NFC.
Pulled out of D106280
2021-07-19 17:15:28 +01:00
Jeremy Morse f46321207f [InstrRef][X86] Drop debug instruction numbers from x87 instructions
Avoid a crash when using instruction referencing if x87 floating point
instructions are used. These instructions are significantly mutated when
they're rewritten from referring to registers, to referring to
floating-point-stack positions. As a result, their operands are re-ordered,
and (InstrRef) LiveDebugValues asserts when it sees a DBG_INSTR_REF
referring to a non-reg non-def register operand.

To fix this, drop the instruction numbers, and thus variable locations.
This patch adds a helper utility do do that.

Dropping the variable locations is sub-optimal, but applying DBG_VALUEs to
the $fp0 and similar registers is dropped on emission too. It seems we've
never done well at describing variables that live in x87 registers, at all.

Differential Revision: https://reviews.llvm.org/D105657
2021-07-19 15:08:27 +01:00
Eli Friedman 6601be4419 [X86] Remove incorrect use of known bits in shuffle simplification.
This reverts commit 2a419a0b99.

The result of a shufflevector must not propagate poison from any element
other than the one noted in the shuffle mask.

The regressions outside of fptoui-may-overflow.ll can probably be
recovered some other way; for example, using isGuaranteedNotToBePoison.

See discussion on https://reviews.llvm.org/D106053 for more background.

Differential Revision: https://reviews.llvm.org/D106222
2021-07-18 18:13:11 -07:00
Simon Pilgrim 51a12d2ff0 [X86][SSE] matchShuffleWithPACK - avoid poison pollution from bitcasting multiple elements together.
D106053 exposed that we've not been taking into account that by bitcasting smaller elements together and then performing a ComputeKnownBits on the result we'd be allowing a poison element to influence other neighbouring elements being used in the pack. Instead we now peek through any existing bitcast to ensure that the source type already matches the width source of the pack node we're trying to match.

This has also been a chance to stop matchShuffleWithPACK creating unused nodes on the fly which could affect oneuse tests during shuffle lowering/combining.

The only regression we're seeing is due to being unable to peek through a bitcast as its on the other side of a extract_subvector - which should go away once we finally allow shuffle combining across different vector widths (by making matchShuffleWithPACK using const SelectionDAG& we've gotten closer to this - see PR45974).
2021-07-18 14:25:28 +01:00
Simon Pilgrim d2458bcdc6 [X86][SSE] combineX86ShufflesRecursively - bail if constant folding fails due to oneuse limits.
Fixes issue reported on D105827 where a single shuffle of a constant (with multiple uses) was caught in an infinite loop where one shuffle (UNPCKL) used an undef arg but then that got recombined to SHUFPS as the constant value had its own undef that confused matching.....
2021-07-16 19:21:46 +01:00
Guozhi Wei 5609c8b607 [X86FixupLEAs] Try again to transform the sequence LEA/SUB to SUB/SUB
This patch transforms the sequence
    lea (reg1, reg2), reg3
    sub reg3, reg4
to two sub instructions
    sub reg1, reg4
    sub reg2, reg4

Similar optimization can also be applied to LEA/ADD sequence.

The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register
of ADD).

Differential Revision: https://reviews.llvm.org/D104684
2021-07-16 10:16:03 -07:00
Harald van Dijk a8ad917054
[X86] Fix handling of maskmovdqu in X32
The maskmovdqu instruction is an odd one: it has a 32-bit and a 64-bit
variant, the former using EDI, the latter RDI, but the use of the
register is implicit. In 64-bit mode, a 0x67 prefix can be used to get
the version using EDI, but there is no way to express this in
assembly in a single instruction, the only way is with an explicit
addr32.

This change adds support for the instruction. When generating assembly
text, that explicit addr32 will be added. When not generating assembly
text, it will be kept as a single instruction and will be emitted with
that 0x67 prefix. When parsing assembly text, it will be re-parsed as
ADDR32 followed by MASKMOVDQU64, which still results in the correct
bytes when converted to machine code.

The same applies to vmaskmovdqu as well.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103427
2021-07-15 22:56:08 +01:00
Simon Pilgrim ee71c1bbcc [X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Original Patch by @TomHender (Tom Hender)

Differential Revision: https://reviews.llvm.org/D89697
2021-07-14 12:03:49 +01:00
Simon Pilgrim 3cee36c5ac [X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result (REAPPLIED)
Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))

Reapplied rGe4aa6ad13216 with fix for intrinsic variants of the opcode which uses a vector return type
2021-07-13 12:31:09 +01:00
Vitaly Buka 606551ee98 Revert "[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result"
Fails here https://lab.llvm.org/buildbot/#/builders/37/builds/5267

This reverts commit e4aa6ad132.
2021-07-12 22:26:54 -07:00
Simon Pilgrim ae0d73ac3b [CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 20:38:25 +01:00
Craig Topper d5c97f4bf0 [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack.
There are some calls to functions like `__alloca` that are missing
a regmask operand. Lack of a regmask operand means that all
registers that aren't mentioned by def operands are preserved.
__alloca only updates EAX and ESP and has def operands for
them so this is ok. Because there is no regmask the register
allocator won't spill the FP registers across the call. Assuming
we want to keep the FP stack untoched across these calls, we
need to handle this is in the FP stackifier.

We might want to add a proper regmask operand to the code that
creates these calls to indicate all registers are preserved, but we'd
still need this change to the FP stackifier to know to preserve the
FP stack for such a regmask.

The test is kind of long, but bugpoint wasn't able to reduce it
any further.

Fixes PR50782

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D105762
2021-07-12 10:15:38 -07:00
Simon Pilgrim 96b4117d51 [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports.
Update truncation costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 13:50:43 +01:00
Simon Pilgrim e4aa6ad132 [X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result
Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))
2021-07-12 09:56:59 +01:00
Jeremy Morse 30cce54dad [X86] Return src/dest register from stack spill/restore recogniser
LLVM provides target hooks to recognise stack spill and restore
instructions, such as isLoadFromStackSlot, and it also provides post frame
elimination versions such as isLoadFromStackSlotPostFE. These are supposed
to return the store-source and load-destination registers; unfortunately on
X86, the PostFE recognisers just return "1", apparently to signify "yes
it's a spill/load". This patch alters the hooks to correctly return the
store-source and load-destination registers:

This is really useful for debug-info as we it helps follow variable values
as they move on/off the stack. There should be no codegen changes: the only
other users of these PostFE target hooks are MachineInstr::getRestoreSize
and MachineInstr::getSpillSize, which don't attempt to interpret the
returned register location.

While we're here, delete the (InstrRef) LiveDebugValues heuristic that
tries to find the spill source register by looking for a killed reg -- we
should be able to rely on the target hooks for that. This involves
temporarily turning off a n InstrRef LivedDebugValues test on aarch64
(patch to re-enable it is in D104521).

Differential Revision: https://reviews.llvm.org/D105428
2021-07-09 18:12:30 +01:00
Simon Pilgrim 9dbeac16ba [X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to let us add AssertSext/AssertZext
Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again.

This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles).

This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.
2021-07-09 12:07:33 +01:00
David Green 38c9a4068d [TTI] Remove IsPairwiseForm from getArithmeticReductionCost
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.

This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.

Differential Revision: https://reviews.llvm.org/D105484
2021-07-09 11:51:16 +01:00
Alexey Bataev 0d74fd3fdf [SLP][COST][X86]Improve cost model for masked gather.
Revived D101297 in its original form + added some changes in X86
legalization cehcking for masked gathers.

This solution is the most stable and the most correct one. We have to
check the legality before trying to build the masked gather in SLP.
Without this check we have incorrect cost (for SLP) in case if the masked gather
is not legal/slower than the gather. And we're missing some
vectorization opportunities.

This can be fixed in the cost model, but in this case we need to add
special checks for the cost of GEPs for ScatterVectorize node, add
special check for small trees, etc., i.e. there are a lot of corner
cases here and there, which insrease code base and make it harder to
maintain the code.

> Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed.

The question from D101297. Actually, no, it can't. Actually, simple
gather may give us better result, especially after we started
vectorization of insertelements. Plus, like I said before, the cost for
non-legal masked gathers leads to missed vectorization opportunities.

Differential Revision: https://reviews.llvm.org/D105042
2021-07-08 11:53:30 -07:00
Matt Arsenault 9b057f647d GlobalISel: Track original argument index in ArgInfo
SelectionDAG's equivalents in ISD::InputArg/OutputArg track the
original argument index. Mips relies on this, and its currently
reinventing its own parallel CallLowering infrastructure which tracks
these indexes on the side. Add this to help move towards deleting the
custom mips handling.
2021-07-08 13:39:02 -04:00
Simon Pilgrim 8ef67fa9d2 [CostModel][X86] Account for older SSE targets with slow fp->int conversions
Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets
2021-07-08 18:08:24 +01:00
Jeremy Morse 63cc251eb9 [DebugInfo][InstrRef][4/4] Support DBG_INSTR_REF through all backend passes
This is a cleanup patch -- we're now able to support all flavours of
variable location in instruction referencing mode. This patch updates
various tests for debug instructions to be broader: numerous code paths
try to ignore debug isntructions, and they now have to ignore the
additional DBG_PHI and DBG_INSTR_REFs that we can generate.

A small amount of rework happens for LiveDebugVariables: as we don't need
to track live intervals through regalloc any more, we can get away with
unlinking debug instructions before regalloc, then re-inserting them after.
Note that this isn't (yet) true of DBG_VALUE_LISTs, they still have to go
through live interval tracking.

In SelectionDAG, add a helper lambda that emits half-formed DBG_INSTR_REFs
for arguments in instr-ref mode, DBG_VALUE otherwise. This is one of the
final locations where DBG_VALUEs are emitted for vreg arguments.

X86InstrInfo now un-sets the debug instr number on SUB instructions that
get mutated into CMP instructions. As the instruction no longer computes a
subtraction, we can't use it for variable locations.

Differential Revision: https://reviews.llvm.org/D88898
2021-07-08 16:42:24 +01:00
Simon Pilgrim ded8866f4a [X86][Atom] Fix vector fp<->int resource/throughputs
Match whats documented in the Intel AOM - almost all the conversion instructions requires BOTH ports (apart from the MMX cvtpi2ps/cvtpi2ps instructions which we already override) - this was being incorrectly modelled as EITHER port.

Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
2021-07-07 16:52:34 +01:00
Simon Pilgrim 4c7e9a3852 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim a7da0296a6 [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 12:03:45 +01:00
Simon Pilgrim b298308ba2 [CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32
Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
2021-07-06 17:28:03 +01:00
Simon Pilgrim 6f3f9535fc [CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.

We get the extension for free for non-vector loads.
2021-07-06 13:58:52 +01:00
Wang, Pengfei 9ab99f773f [X86] Twist shuffle mask when fold HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
This patch fixes PR50823.

The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104903
2021-07-05 21:29:42 +08:00
Simon Pilgrim 5db826e4ce [CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack
Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.
2021-07-05 13:26:53 +01:00
Simon Pilgrim 65e4240fa1 [CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner).
Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64
2021-07-05 13:26:53 +01:00
Nikita Popov fabc17192e [IRBuilder] Add type argument to CreateMaskedLoad/Gather
Same as other CreateLoad-style APIs, these need an explicit type
argument to support opaque pointers.

Differential Revision: https://reviews.llvm.org/D105395
2021-07-04 12:17:59 +02:00
Amir Ayupov 884bc6a6ed [X86] Modify LOOP*, HLT control flow attributes
Add missing control flow attributes:
- LOOP*: isBranch, isTerminator
- HLT: isTerminator

This helps downstream disassemblers (such as BOLT) reconstruct the control
flow graph more accurately.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D102297
2021-07-02 10:34:29 -07:00
Simon Pilgrim e5fdff1cf8 [X86][SLM] Keep similar scheduler costs types together. NFCI.
The SLM model is inconsistent about where it kept its 'unsupported' schedule classes - better to keep them close to similar classes.

I'm not sure why some ymm classes are defined and others are unsupported though (but I haven't altered them) - the only SLM-like CPU supporting any ymm is KNL and that currently uses the HSW model.
2021-07-02 14:50:24 +01:00
Simon Pilgrim d867634fbd [CostModel][X86] Update comment describing source of costs - we now use llvm-mca more than IACA 2021-07-02 14:29:32 +01:00
Simon Pilgrim d181fd918d [CostModel][X86] Drop some hard coded fp<->int scalarization costs
Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695
2021-07-02 14:29:32 +01:00
Simon Pilgrim 2aecffcd40 [CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match
Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.
2021-07-02 13:49:31 +01:00
Simon Pilgrim cdca1785d3 [CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports.
Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695.

Fixes a few regressions before we start adding AVX costs for legalized types.
2021-07-02 13:09:00 +01:00
Matt Arsenault 99c7e918b5 GlobalISel: Use LLT in call lowering callbacks
This preserves the memory type so the lowerings can rely on them.
2021-07-01 12:15:54 -04:00
Simon Pilgrim 5e5ba14b4d [CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports.
Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695.

To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src
2021-07-01 15:34:20 +01:00
Simon Pilgrim 2a1ef8784a [CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types.
Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides.

The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.
2021-07-01 12:06:40 +01:00
Jeremy Morse 47c3fe2a22 [DebugInfo][InstrRef][1/4] Support transformations that widen values
Very late in compilation, backends like X86 will perform optimisations like
this:

    $cx = MOV16rm $rax, ...
    ->
    $rcx = MOV64rm $rax, ...

Widening the load from 16 bits to 64 bits. SEeing how the lower 16 bits
remain the same, this doesn't affect execution. However, any debug
instruction reference to the defined operand now refers to a 64 bit value,
nto a 16 bit one, which might be unexpected. Elsewhere in codegen, there's
often this pattern:

    CALL64pcrel32 @foo, implicit-def $rax
    %0:gr64 = COPY $rax
    %1:gr32 = COPY %0.sub_32bit

Where we want to refer to the definition of $eax by the call, but don't
want to refer the copies (they don't define values in the way
LiveDebugValues sees it). To solve this, add a subregister field to the
existing "substitutions" facility, so that we can describe a field within
a larger value definition. I would imagine that this would be used most
often when a value is widened, and we need to refer to the original,
narrower definition.

Differential Revision: https://reviews.llvm.org/D88891
2021-07-01 11:19:27 +01:00
Simon Pilgrim 59fa435ea6 [X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760)
This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read.

As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes.

I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations.

Differential Revision: https://reviews.llvm.org/D101074
2021-06-30 18:46:50 +01:00
Simon Pilgrim 47941d601d [CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports
Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput).

The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables.

Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).
2021-06-30 15:23:34 +01:00
Craig Topper 81f6d7c082 [X86] Tighten up some inline assembly constraint handling.
Don't allow vectors to split into GPRs for 'r' and other scalar
constraints. Prevents assertion in getCopyToPartsVector.

Makes PR50907 give a better error instead of crashing.
2021-06-26 22:57:22 -07:00
Eric Astor e074d580b2 [ms] [llvm-ml] Disable C-style comments 2021-06-25 23:09:13 -04:00
Luo, Yuanke 36003c20ad [X86] Selecting fld0 for undefined value in fast ISEL.
When set opt-bisect-limit to some value that is less than ISel pass
in command line and CurBisectNum expired, "DAG to DAG" pass lower
its opt level to O0. However "processimpdefs" and "X86 FP Stackifier"
is not stopped due to the CurBisectNum expiration. So undefined fp0
is generated. This cause crash in the "X86 FP Stackifier" pass,
because Stackifier doesn't expect any undefined fp value.

Here is the scenario that cause compiler crash.

  successors: %bb.26
  liveins: $r14
    ST_FPrr $st0, implicit-def $fpsw, implicit $fpcw
    renamable $rdi = MOV64ri @.str.3.16422
    renamable $rdx = LEA64r %stack.6, 1, $noreg, 0, $noreg
    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def dead
    $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp
    dead $esi = MOV32r0 implicit-def dead $eflags, implicit-def $rsi
    CALL64pcrel32 @foo, implicit $rsp, implicit $ssp, implicit $rdi,
    implicit $rsi, implicit $rdx, implicit-def dead $fp0
    renamable $xmm0 = MOVSDrm_alt %stack.10, 1, $noreg, 0, $noreg :: (load 8
    from %stack.10)
    ADJCALLSTACKUP64 0, 0, implicit-def $rsp, implicit-def dead $eflags,
    implicit-def $ssp, implicit $rsp, implicit $ssp
    renamable $fp2 = CHS_Fp80 killed undef renamable $fp0, implicit-def
    $fpsw
    JMP_1 %bb.26
The CALL64pcrel32 mark fp0 dead, so llvm free the stack slot for fp0
and the stack become empty. In the late instruction CHS_Fp80, it use
undefined register fp0, the original code assume there must be a stack
slot for the src register (fp0) without respecting it is undefined,
so llvm report error.

We have some discussion in https://reviews.llvm.org/D104440 and we
decide to fix it in fast ISel. The fix is to lower undefined fp value to
zero value, so that it release the burden of "X86 FP Stackifier" pass.
Thank Craig for the suggestion and the initial patch to fix it.

Differential Revision: https://reviews.llvm.org/D104678
2021-06-26 08:43:09 +08:00
Craig Topper 0f3bc00a7d [X86] Simplify part of the isel for X86ISD::FCMP/STRICT_FCMP/STRICT_FCMPS.
We don't need to have the compare output a value and then copy it
to FPSW for use by FNSTSW. Instead we can just have the compare
output Glue and glue the FNSTSW to it. InstrEmitter effectively
performed this optimization when emitting the Machine IR. Doing
it directly simplifies the codes and reduces the work in
InstrEmitter. There's no change in the machine IR at the end of
isel before and after this change.
2021-06-25 11:39:01 -07:00
Serge Pavlov b36d214bed [X86] Add description of FXAM instruction
Previously this instruction could be used only in assembler. This change
makes it available for compiler also. Scheduling information was copied
from FTST instruction, hopefully this can be a satisfactory approximation.

Differential Revision: https://reviews.llvm.org/D104853
2021-06-25 12:26:51 +07:00
Martin Storsjö 42f74e8249 [llvm] Rename StringRef _lower() method calls to _insensitive()
This is a mechanical change. This actually also renames the
similarly named methods in the SmallString class, however these
methods don't seem to be used outside of the llvm subproject, so
this doesn't break building of the rest of the monorepo.
2021-06-25 00:22:01 +03:00
Florian Hahn a54c6fc083
[X86] Exclude invalid element types for bitcast/broadcast folding.
It looks like the fold introduced in 63f3383ece can cause crashes
if the type of the bitcasted value is not a valid vector element type,
like x86_mmx.

To resolve the crash, reject invalid vector element types. The way it is
done in the patch is a bit clunky. Perhaps there's a better way to
check?

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104792
2021-06-24 12:39:01 +01:00
Simon Pilgrim c4d3eedc7f [X86] Fold nested select_cc to select (cmp*ge/le Cond0, Cond1), LHS, Y)
select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y)
etc,

We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes.

Pulled out of D101074 to help address the remaining regressions.

Differential Revision: https://reviews.llvm.org/D104707
2021-06-24 11:27:57 +01:00
Sander de Smalen d5e14ba88c [GlobalISel] NFC: Change LLT::vector to take ElementCount.
This also adds new interfaces for the fixed- and scalable case:
* LLT::fixed_vector
* LLT::scalable_vector

The strategy for migrating to the new interfaces was as follows:
* If the new LLT is a (modified) clone of another LLT, taking the
  same number of elements, then use LLT::vector(OtherTy.getElementCount())
  or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2)
  or operator*. That is because there is no reason to specifically restrict
  the types to 'fixed_vector'.
* If the algorithm works on the number of elements (as unsigned), then
  just use fixed_vector. This will need to be fixed up in the future when
  modifying the algorithm to also work for scalable vectors, and will need
  then need additional tests to confirm the behaviour works the same for
  scalable vectors.
* If the test used the '/*Scalable=*/true` flag of LLT::vector, then
  this is replaced by LLT::scalable_vector.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D104451
2021-06-24 11:26:12 +01:00
Eli Friedman 74909e4b6e Rename MachineMemOperand::getOrdering -> getSuccessOrdering.
Since this method can apply to cmpxchg operations, make sure it's clear
what value we're actually retrieving.  This will help ensure we don't
accidentally ignore the failure ordering of cmpxchg in the future.

We could potentially introduce a getOrdering() method on AtomicSDNode
that asserts the operation isn't cmpxchg, but not sure that's
worthwhile.

Differential Revision: https://reviews.llvm.org/D103338
2021-06-21 16:49:27 -07:00
Fangrui Song c618692218 [AArch64][X86] Allow 64-bit label differences lower to IMAGE_REL_*_REL32
`IMAGE_REL_ARM64_REL64/IMAGE_REL_AMD64_REL64` do not exist and `.quad a - .` is
currently not representable.

For instrumentation, `.quad a - .` is useful representing a cross-section
reference in a metadata section, to allow ELF medium/large code models. The COFF
limitation makes such generic instrumentations inconvenient. I plan to make a
PGO/coverage metadata section field relative in D104556.

Differential Revision: https://reviews.llvm.org/D104564
2021-06-21 14:32:25 -07:00
Roman Lebedev 834aafa55b
[NFC] AMD Zen 3: fix typo in a comment 2021-06-19 22:05:17 +03:00
Roman Lebedev 69caacc626
[X86] AMD Zen 3: don't confuse shift and shuffle, NFC
These proc res groups occupy the exact same pipes,
so this doesn't affect the modelling,
but it's confusing nontheless.
2021-06-17 21:07:35 +03:00
Simon Pilgrim cdb4fcf9a1 [X86] combineSelect - refactor MIN/MAX detection code to make it easier to add additional select(setcc,x,y) folds. NFCI.
I need to add some additional handling to address some of the regressions from D101074
2021-06-17 13:50:59 +01:00
Roman Lebedev 308f6a5245
[NFC][X86] lowerVECTOR_SHUFFLE(): drop FIXME about widening to i128 (YMM half) element type
As per the discussion in D103818, so far, this does not appear to be worthwhile.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103818
2021-06-16 10:24:33 +03:00
Saleem Abdulrasool 17bdc0ff6f X86: balance the frame prologue and epilogue on Win64
This was broken in ba1509da7b.  The Win64
frame would not perform the setup of the Swift async context parameter
but would tear down the setup in the epilogue resulting in crashes.
This ensures that we do the full setup when we do the tear down.
Although this is non-conforming to the Win64 calling convention, it
corrects the setup and exposes the actual issue that the change
introduced: incorrect frame setup.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D104246
2021-06-15 20:13:52 -07:00
Bob Haarman 3bc899b4de [X86] avoid assert with varargs, soft float, and no-implicit-float
Fixes:
 - PR36507 Floating point varargs are not handled correctly with
   -mno-implicit-float
 - PR48528 __builtin_va_start assumes it can pass SSE registers
   when using -Xclang -msoft-float -Xclang -no-implicit-float

On x86_64, floating-point parameters are normally passed in XMM
registers. For va_start, we spill those to memory so va_arg can
find them. There is an interaction here with -msoft-float and
-no-implicit-float:

When -msoft-float is in effect, instead of passing floating-point
parameters in XMM registers, they are passed in general-purpose
registers.

When -no-implicit-float is in effect, it "disables implicit
floating-point instructions" (per the LangRef). The intended
effect is to not have the compiler generate floating-point code
unless explicit floating-point operations are present in the
source code, but what exactly counts as an explicit floating-point
operation is not specified. The existing behavior of LLVM here has
led to some surprises and PRs.

This change modifies the behavior as follows:

  | soft | no-implicit | old behavior    | new behavior    |
  |  no  |   no        | spill XMM regs  | spill XMM regs  |
  | yes  |   no        | don't spill XMM | don't spill XMM |
  |  no  |  yes        | don't spill XMM | spill XMM regs  |
  | yes  |  yes        | assert          | don't spill XMM |

In particular, this avoids the assert that happens when
-msoft-float and -no-implicit-float are both in effect. This
seems like a perfectly reasonable combination: If we don't want
to rely on hardware floating-point support, we want to both
avoid using float registers to pass parameters and avoid having
the compiler generate floating-point code that wasn't in the
original program. Instead of crashing the compiler, the new
behavior is to not synthesize floating-point code in this
case. This fixes PR48528.

The other interesting case is when -no-implicit-float is in
effect, but -msoft-float is not. In that case, any floating-point
parameters that are present will be in XMM registers, and so we
have to spill them to correctly handle those. This fixes
PR36507. The spill is conditional on %al indicating that
parameters are present in XMM registers, so no floating-point
code will be executed unless the function is called with
floating-point parameters.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D104001
2021-06-15 11:27:35 -07:00
Roman Lebedev 88da6c1ead
[X86] Schedule-model second (mask) output of GATHER instruction
Much like `mulx`'s `WriteIMulH`, there are two outputs of
AVX2 GATHER instructions. This was changed back in rL160110,
but the sched model change wasn't present.

So right now, for sched models that are marked as complete
(`znver3` only now), codegen'ning `GATHER` results in a crash:
```
DefIdx 1 exceeds machine model writes for early-clobber renamable $ymm3, dead early-clobber renamable $ymm2 = VPGATHERDDYrm killed renamable $ymm3(tied-def 0), undef renamable $rax, 4, renamable $ymm0, 0, $noreg, killed renamable $ymm2(tied-def 1) :: (load 32, align 1)
```
https://godbolt.org/z/Ks7zW7WGh

I'm guessing we need to deal with this like we deal with `WriteIMulH`.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104205
2021-06-15 12:04:33 +03:00
Craig Topper 4017d0335a [X86] Use EVT::getVectorVT instead of changeVectorElementType in reduceVMULWidth.
Changing vector element type doesn't work for v6i32->v6i16 now
that v6i32 is an MVT and v6i16 is not.

I would like to fix this in changeVectorElementType, but you
need a LLVMContext to call getVectorVT which we can't get from
an MVT.

Fixes PR50709.
2021-06-14 22:07:04 -07:00
Saleem Abdulrasool 8c8dbc1082 X86: pass swift_async context in R14 on Win64
Pass swift_async context in a callee-saved register rather than as a
regular parameter.  This is similar to the Swift `self` and `error`
parameters.
2021-06-14 11:02:21 -07:00
Eric Astor f09e200b31 [ms] [llvm-ml] When parsing MASM, "jmp short" instructions are case insensitive
Handle "short" in a case-insensitive fashion in MASM.

Required to correctly parse z_Windows_NT-586_asm.asm from the OpenMP runtime.

Reviewed By: thakis

Differential Revision: https://reviews.llvm.org/D104195
2021-06-13 18:36:00 -04:00
Eric Astor 56edcbc2ad Fix misspelled instruction in X86 assembly parser
Did not correctly handle "jecxz short <address>".

Discovered while working on LLVM-ML; shows up in z_Windows_NT-586_asm.asm from the OpenMP runtime

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D104194
2021-06-13 18:34:15 -04:00
Luo, Yuanke 5be314f79b [X86] Check immediate before get it.
For CMP imm instruction, when the operand 1 is symbol address we should
check if it is immediate first. Here is the example code.
`CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def
$eflags`
Many thanks to Craig, Topper for the test case to reproduce this issue.

Differential Revision: https://reviews.llvm.org/D104037
2021-06-13 15:40:52 +08:00
Luo, Yuanke 1e72b9d52f Revert "[X86] Check immediate before get it."
This reverts commit 9eb2f723c2.
2021-06-13 13:55:38 +08:00
Luo, Yuanke 9eb2f723c2 [X86] Check immediate before get it.
For CMP imm instruction, when the operand 1 is symbol address we should
check if it is immediate first. Here is the example code.
`CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def
$eflags`
Many thanks to Craig, Topper for the test case to reproduce this issue.

Differential Revision: https://reviews.llvm.org/D104037
2021-06-13 09:08:40 +08:00
Craig Topper c997867dc0 [X86] Add ISD::FREEZE and ISD::AssertAlign to the list of opcodes that don't guarantee upper 32 bits are zero.
The freeze issue was reported here
https://llvm.discourse.group/t/bug-or-feature-freeze-instruction/3639

I don't have a test for AssertAlign. I just noticed it was missing
and assume it should be similar to the other two Asserts.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104178
2021-06-12 09:52:29 -07:00
Florian Hahn 5cd66420cc
Revert "[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB"
This reverts commit 1b748faf2b because it
breaks building the llvm-test-suite with -verify-machineinstrs on X86:
http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64-O3/9585/

Running llc -verify-machineinstr on X86 crashes on the IR below:

    target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

    %struct.widget = type { i32, i32, i32, i32, i32*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [16 x [16 x i16]], [6 x [32 x i32]], [16 x [16 x i32]], [4 x [12 x [4 x [4 x i32]]]], [16 x i32], i8**, i32*, i32***, i32**, i32, i32, i32, i32, %struct.baz*, %struct.wobble.1*, i32, i32, i32, i32, i32, i32, %struct.quux.2*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x i32], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32***, i32***, i32****, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x [2 x i32]], [3 x [2 x i32]], i32, i32, i64, i64, %struct.zot.3, %struct.zot.3, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.baz = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, %struct.snork*, %struct.wombat.0*, %struct.wobble*, i32, i32*, i32*, i32*, i32, i32*, i32*, i32*, i32 (%struct.widget*, %struct.eggs*)*, i32, i32, i32, i32 }
    %struct.snork = type { %struct.spam*, %struct.zot, i32 (%struct.wombat*, %struct.widget*, %struct.snork*)* }
    %struct.spam = type { i32, i32, i32, i32, i8*, i32 }
    %struct.zot = type { i32, i32, i32, i32, i32, i8*, i32* }
    %struct.wombat = type { i32, i32, i32, i32, i32, i32, i32, i32, void (i32, i32, i32*, i32*)*, void (%struct.wombat*, %struct.widget*, %struct.zot*)* }
    %struct.wombat.0 = type { [4 x [11 x %struct.quux]], [2 x [9 x %struct.quux]], [2 x [10 x %struct.quux]], [2 x [6 x %struct.quux]], [4 x %struct.quux], [4 x %struct.quux], [3 x %struct.quux] }
    %struct.quux = type { i16, i8 }
    %struct.wobble = type { [2 x %struct.quux], [4 x %struct.quux], [3 x [4 x %struct.quux]], [10 x [4 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]] }
    %struct.eggs = type { [1000 x i8], [1000 x i8], [1000 x i8], i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.wobble.1 = type { i32, [2 x i32], i32, i32, %struct.wobble.1*, %struct.wobble.1*, i32, [2 x [4 x [4 x [2 x i32]]]], i32, i64, i64, i32, i32, [4 x i8], [4 x i8], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.quux.2 = type { i32, i32, i32, i32, i32, %struct.quux.2* }
    %struct.zot.3 = type { i64, i16, i16, i16 }

    define void @blam(%struct.widget* %arg, i32 %arg1) local_unnamed_addr {
    bb:
      %tmp = load i32, i32* undef, align 4
      %tmp2 = sdiv i32 %tmp, 6
      %tmp3 = sdiv i32 undef, 6
      %tmp4 = load i32, i32* undef, align 4
      %tmp5 = icmp eq i32 %tmp4, 4
      %tmp6 = select i1 %tmp5, i32 %tmp3, i32 %tmp2
      %tmp7 = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]]* undef, i64 0, i64 0, i64 0
      %tmp8 = zext i16 undef to i32
      %tmp9 = zext i16 undef to i32
      %tmp10 = load i16, i16* undef, align 2
      %tmp11 = zext i16 %tmp10 to i32
      %tmp12 = zext i16 undef to i32
      %tmp13 = zext i16 undef to i32
      %tmp14 = zext i16 undef to i32
      %tmp15 = load i16, i16* undef, align 2
      %tmp16 = zext i16 %tmp15 to i32
      %tmp17 = zext i16 undef to i32
      %tmp18 = sub nsw i32 %tmp8, %tmp9
      %tmp19 = shl nsw i32 undef, 1
      %tmp20 = add nsw i32 %tmp19, %tmp18
      %tmp21 = sub nsw i32 %tmp11, %tmp12
      %tmp22 = shl nsw i32 undef, 1
      %tmp23 = add nsw i32 %tmp22, %tmp21
      %tmp24 = sub nsw i32 %tmp13, %tmp14
      %tmp25 = shl nsw i32 undef, 1
      %tmp26 = add nsw i32 %tmp25, %tmp24
      %tmp27 = sub nsw i32 %tmp16, %tmp17
      %tmp28 = shl nsw i32 undef, 1
      %tmp29 = add nsw i32 %tmp28, %tmp27
      %tmp30 = sub nsw i32 %tmp20, %tmp29
      %tmp31 = sub nsw i32 %tmp23, %tmp26
      %tmp32 = shl nsw i32 %tmp30, 1
      %tmp33 = add nsw i32 %tmp32, %tmp31
      store i32 %tmp33, i32* undef, align 4
      %tmp34 = mul nsw i32 %tmp31, -2
      %tmp35 = add nsw i32 %tmp34, %tmp30
      store i32 %tmp35, i32* undef, align 4
      %tmp36 = select i1 %tmp5, i32 undef, i32 undef
      br label %bb37

    bb37:                                             ; preds = %bb
      %tmp38 = load i32, i32* undef, align 4
      %tmp39 = ashr i32 %tmp38, %tmp6
      %tmp40 = load i32, i32* undef, align 4
      %tmp41 = sdiv i32 %tmp39, %tmp40
      store i32 %tmp41, i32* undef, align 4
      ret void
    }
2021-06-12 11:41:38 +01:00
Florian Hahn e087b4f149
Revert "[X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization"
This reverts commit f35bcea1d4 because it
depends on 1b748faf2b, which breaks
building the llvm-test-suite with -verify-machineinstrs on X86.

See 154adc0f135cff3f8a8861c335d2b88c8049d098 for more details.
2021-06-12 11:40:47 +01:00
Guozhi Wei f35bcea1d4 [X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization
In function searchALUInst, sub register usage of LEA dest should also block LEA/SUB optimization, otherwise the sub register usage gets an undefined value.

This patch fixes https://bugs.llvm.org/show_bug.cgi?id=50615.

Differential Revision: https://reviews.llvm.org/D103922
2021-06-11 09:45:56 -07:00
Simon Pilgrim 61cdaf66fe [ADT] Remove APInt/APSInt toString() std::string variants
<string> is currently the highest impact header in a clang+llvm build:

https://commondatastorage.googleapis.com/chromium-browser-clang/llvm-include-analysis.html

One of the most common places this is being included is the APInt.h header, which needs it for an old toString() implementation that returns std::string - an inefficient method compared to the SmallString versions that it actually wraps.

This patch replaces these APInt/APSInt methods with a pair of llvm::toString() helpers inside StringExtras.h, adjusts users accordingly and removes the <string> from APInt.h - I was hoping that more of these users could be converted to use the SmallString methods, but it appears that most end up creating a std::string anyhow. I avoided trying to use the raw_ostream << operators as well as I didn't want to lose having the integer radix explicit in the code.

Differential Revision: https://reviews.llvm.org/D103888
2021-06-11 13:19:15 +01:00
Bing1 Yu 56d5c46b49 [X86] Support __tile_stream_loadd intrinsic for new AMX interface
Adding support for __tile_stream_loadd intrinsic.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D103784
2021-06-11 17:28:43 +08:00
Luo, Yuanke 63233da723 [X86][NFC] Fix typo. 2021-06-10 22:49:11 +08:00
Craig Topper 765ef4bb2a [X86] Check destination element type before forming VTRUNCS/VTRUNCUS in combineTruncateWithSat.
Fixes crash reported here https://reviews.llvm.org/D73607

Using a store to keep the trunc intact. Returning v16i24 would
cause the trunc to be optimized away in SelectionDAGBuilder.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103940
2021-06-09 07:08:17 -07:00
Simon Pilgrim 630820bafc [X86][SLM] Adjust XMM non-PMULLD throughput costs to half rate.
Match what's reported in the costs table, Agner's tables and the Intel AOM
2021-06-09 13:51:40 +01:00
Nick Desaulniers 3787ee4571 reland [IR] make -stack-alignment= into a module attr
Relands commit 433c8d950c with fixes for
MIPS.

Similar to D102742, specifying the stack alignment via CodegenOpts means
that this flag gets dropped during LTO, unless the command line is
re-specified as a plugin opt. Instead, encode this information as a
module level attribute so that we don't have to expose this llvm
internal flag when linking the Linux kernel with LTO.

Looks like external dependencies might need a fix:
* https://github.com/llvm-hs/llvm-hs/issues/345
* https://github.com/halide/Halide/issues/6079

Link: https://github.com/ClangBuiltLinux/linux/issues/1377

Reviewed By: tejohnson

Differential Revision: https://reviews.llvm.org/D103048
2021-06-08 10:59:46 -07:00
Nick Desaulniers a596b54d47 Revert "[IR] make -stack-alignment= into a module attr"
This reverts commit 433c8d950c.

Breaks the MIPS build.
2021-06-08 08:55:50 -07:00
Nick Desaulniers 433c8d950c [IR] make -stack-alignment= into a module attr
Similar to D102742, specifying the stack alignment via CodegenOpts means
that this flag gets dropped during LTO, unless the command line is
re-specified as a plugin opt. Instead, encode this information as a
module level attribute so that we don't have to expose this llvm
internal flag when linking the Linux kernel with LTO.

Looks like external dependencies might need a fix:
* https://github.com/llvm-hs/llvm-hs/issues/345
* https://github.com/halide/Halide/issues/6079

Link: https://github.com/ClangBuiltLinux/linux/issues/1377

Reviewed By: tejohnson

Differential Revision: https://reviews.llvm.org/D103048
2021-06-08 08:31:04 -07:00
Simon Pilgrim 49d3a367c0 [CostModel][X86] Improve AVX1/AVX2 truncation costs
Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:

AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8

Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
2021-06-08 10:41:03 +01:00
Harald van Dijk 75521bd9d8
[X32] Add Triple::isX32(), use it.
So far, support for x86_64-linux-gnux32 has been handled by explicit
comparisons of Triple.getEnvironment() to GNUX32. This worked as long as
x86_64-linux-gnux32 was the only X32 environment to worry about, but we
now have x86_64-linux-muslx32 as well. To support this, this change adds
an isX32() function and uses it. It replaces all checks for GNUX32 or
MuslX32 by isX32(), except for the following:

- Triple::isGNUEnvironment() and Triple::isMusl() are supposed to treat
  GNUX32 and MuslX32 differently.
- computeTargetTriple() needs to be able to transform triples to add or
  remove X32 from the environment and needs to map GNU to GNUX32, and
  Musl to MuslX32.
- getMultiarchTriple() completely lacks any Musl support and retains the
  explicit check for GNUX32 as it can only return x86_64-linux-gnux32.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D103777
2021-06-07 20:48:39 +01:00
Simon Pilgrim 432eff22ab [CostModel][X86] Add 512-bit bswap costs 2021-06-06 22:36:34 +01:00
Simon Pilgrim ae973380c5 [CostModel][X86] Improve AVX512 FDIV costs
Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model
2021-06-06 21:41:05 +01:00
Simon Pilgrim 8ab8b3fad7 [X86][SSE] LowerFP_TO_INT - remove dead code. NFCI.
Non-Strict v2f32->v2i64 cases have already early-returned to be handled by legalization.
2021-06-06 20:04:15 +01:00
Simon Pilgrim 4879c8f3b0 [X86][SSE] combineVectorTruncation - simplify PSHUFB-is-better logic. NFCI.
OutSVT is guaranteed to be i8/i16 and we accept any InSVT that isn't i64
2021-06-06 20:04:14 +01:00
Simon Pilgrim b69e16b5cc X86MachObjectWriter.cpp - silence null deference warnings. NFCI.
The MCSymbol data should always be present for non-absolute sections so assert that it is to silence static analysis warnings.
2021-06-06 15:33:47 +01:00