Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
Instead of report fatal error, this patch emit error message and exit
when shapes are not pre-defined. This would cause the compiling fail but
not crash.
Differential Revision: https://reviews.llvm.org/D124342
The related instructions are:
VPERMD/Q/PS/PD
VRANGEPD/PS/SD/SS
VGETMANTSS/SD/SH
VGETMANDPS/PD - mem version only
VPMULLQ
VFMULCSH/PH
VFCMULCSH/PH
Differential Revision: https://reviews.llvm.org/D116072
This is x86 specific, and adds statefulness to
MachineModuleInfo. Instead of explicitly tracking this, infer if we
need to declare the symbol based on the reference previously inserted.
This produces a small change in the output due to the move from
AsmPrinter::doFinalization to X86's emitEndOfAsmFile. This will now be
moved relative to other end of file fields, which I'm assuming doesn't
matter (e.g. the __morestack_addr declaration is now after the
.note.GNU-split-stack part)
This also produces another small change in code if the module happened
to define/declare __morestack_addr, but I assume that's invalid and
doesn't really matter.
This is used to emit one field in doFinalization for the module. We
can accumulate this when emitting all individual functions directly in
the AsmPrinter, rather than accumulating additional state in
MachineModuleInfo.
Move the special case behavior predicate into MachineFrameInfo to
share it. This now promotes it to generic behavior. I'm assuming this
is fine because no other target implements adjustForSegmentedStacks,
or has tests using the split-stack attribute.
ValueMap should only be necessary if the IR values can be
replaced. This is only used during codegen, when it's illegal to
change the underlying IR. This allows using the default copy
constructor for X86MachineFunctionInfo.
I'm not happy about targets keeping state here that's only used in one
specific pass, but we don't have a better place to put it right now.
Calling hasOneUse can be expensive on nodes with multiple results.
Especially when some results are Chains. By checking the opcode first,
we can avoid walking the uses if it isn't an interesting node,
and thus avoid calling hasOneUse on a node that might have many uses.
Found by profiling the IR given in D123857.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D123881
Checking opcode is cheap. hasOneUse might not be if the node has
multiple results. By checking the opcode we can rule out nodes
with multiple results we aren't interested in.
znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
unsinged int 0 will be convert to float/double -0.0 when the rounding
mode is set to 'FE_DOWNWARD'. Using FILD instruction instead of SSE
instructions on 32-bit target if the strictfp is enabled.
Differential Revision: https://reviews.llvm.org/D123660
This patch adds support for inline assembly address operands using the "p"
constraint on X86 and SystemZ.
This was in fact broken on X86 (see example at
https://reviews.llvm.org/D110267, Nov 23).
These operands should probably be treated the same as memory operands by
CodeGenPrepare, which have been commented with "TODO" there.
Review: Xiang Zhang and Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D122220
This reverts the functional changes of D103427 but keeps its tests, and
and reimplements the functionality by reusing the existing 32-bit
MASKMOVDQU and VMASKMOVDQU instructions as suggested by skan in review.
These instructions were previously predicated on Not64BitMode. This
reimplementation restores the disassembly of a class of instructions,
which will see a test added in followup patch D122449.
These instructions are in 64-bit mode special cased in
X86MCInstLower::Lower, because we use flags with one meaning for subtly
different things: we have an AdSize32 class which indicates both that
the instruction needs a 0x67 prefix and that the text form of the
instruction implies a 0x67 prefix. These instructions are special in
needing a 0x67 prefix but having a text form that does *not* imply a
0x67 prefix, so we encode this in MCInst as an instruction that has an
explicit address size override.
Note that originally VMASKMOVDQU64 was special cased to be excluded from
disassembly, as we cannot distinguish between VMASKMOVDQU and
VMASKMOVDQU64 and rely on the fact that these are indistinguishable, or
close enough to it, at the MCInst level that it does not matter which we
use. Because VMASKMOVDQU now receives special casing, even though it
does not make a difference in the current implementation, as a
precaution VMASKMOVDQU is excluded from disassembly rather than
VMASKMOVDQU64.
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D122540
znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are.
Now matches AMD SoG, Agner and instlatx64 numbers.
Fixes#54811
extract_subvector(insert_subvector(V,X,C1),C1) -> insert_subvector(extract_subvector(V,C1),X,0)
More aggressively attempt to reduce the width of an extract_subvector source - we currently only do this if we're inserting into a zero vector (i.e. canonicalizing to the AVX implicit zero upper elts pattern).
But if we're extracting from the same point as the inner insert_subvector then the fold is still relatively trivial - we can probably do even better if we can ensure the subvector isn't badly split.
When walk the user chain to get the shape of a phi node. If it is phi
node in the chain, we should walk to the user of this phi node instead
of the original phi node.
When we fold vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)), ensure we convert all undef elements to zero elements - this should help us expose more known zero elements for deeper chains of these cases.
Noticed while triaging Issue #54819
znver2 is a mainly a search+replace of the znver1 model, but for no reason some lines have been moved around - try to keep these in sync (no actual changes in the models).
Don't try to directly use the with.overflow flag result in a cmov
if we need to materialize constants between the instruction
producing the overflow flag and the cmov. The current code is
careful to check that there are no other instructions in between,
but misses the constant materialization case (which may clobber
eflags via xor or constant expression evaluation).
Fixes https://github.com/llvm/llvm-project/issues/54369.
Differential Revision: https://reviews.llvm.org/D122825
Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such.
We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro.
Fixes#36407
rGa3b8695bf592 enabled this for znver3, but AMD SoG, Agner and uops.info all agree that even znver1 has a fast per-lane shuffle op (VPSHUFB), but cross-lane shuffles seem to be slow (PERMPS etc.)
Fixes#44140
Differential Revision: https://reviews.llvm.org/D123306
smin(x, 0):
(select (x < 0), x, 0) -> ((x >> (size_in_bits(x)-1))) & x
smax(x, 0):
(select (x > 0), x, 0) -> (~(x >> (size_in_bits(x)-1))) & x
The comparison is testing for a positive value, we have to invert the sign
bit mask, so only do that transform if the target has a bitwise 'and not'
instruction (the invert is free).
The transform is performed only when CMP has a single user to avoid
increasing total instruction number.
https://alive2.llvm.org/ce/z/euUnNmhttps://alive2.llvm.org/ce/z/37339J
Differential Revision: https://reviews.llvm.org/D123109
Use the same enum as the other atomic instructions for consistency, in
preparation for addition of another strategy.
Introduce a new "Expand" option, since the store expansion does not
use cmpxchg. Alternatively, the existing CmpXChg strategy could be
renamed to Expand.
```
X86::MMX_MOVD64from64rr -> X86::MMX_MOVQ64mr
X86::MMX_MOVD64grr -> X86::MMX_MOVD64mr
```
These two entries were added in llvm-svn: 372770.
I think these two should be reversable.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D122217
Add void casts to mark the variables used, next to the places where
they are used in assert or `LLVM_DEBUG()` expressions.
Differential Revision: https://reviews.llvm.org/D123117
Intuitively, the memory folding pair should have the same mnemonic.
This patch removes
```
{X86::SENDUIPI,X86::VMXON}
```
in the auto-generated table.
And `NotMemoryFoldable` for `TPAUSE` and `CLWB` can be saved.
```
{X86::MOVLHPSrr,X86::MOVHPSrm}
{X86::VMOVLHPSZrr,X86::VMOVHPSZ128rm}
{X86::VMOVLHPSrr,X86::VMOVHPSrm}
```
It seems the three pairs above are mistakenly killed.
But we can add them back manually later.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D122477
Current implementation only recognizes absolute operation implemented by
select instruction. This patch adds support for abs intrinsic.
Differential Revision: https://reviews.llvm.org/D122777
Without VBMI, we are better off permuting v16i32 sub-lanes, even though its a variable shuffle, if it allows us to then shuffle v64i8 inlane repeated masks (PSHUFB etc.)
Fixes#54658
Allows us to fold XOR(X, MIN_SIGNED_VALUE) == ADD(X, MIN_SIGNED_VALUE) into LEA patterns
As mentioned on PR52267.
Differential Revision: https://reviews.llvm.org/D122815
As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc.
Recommitted with a fix to ensure we zext/trunc the SETCC result to the original type.
Differential Revision: https://reviews.llvm.org/D122891
As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc.
Differential Revision: https://reviews.llvm.org/D122891
This approach is used by AArch64/RISCV to make frame-setup/frame-destroy
instructions contiguous instead of being interleaved by CFI instructions. Code
checking `MBBI->getFlag(MachineInstr::FrameSetup) || MBBI->isCFIInstruction()`
can be simplified to just check FrameSetup.
This helps locate all CFI instructions in the prologue, which can be handy to use
.cfi_remember_state/.cfi_restore_state to decrease unwind table size (D114545).
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122541
This inverts a fold recently added to IR with:
3491f2f4b0
We can put -bidirectional on the Alive2 examples to show that
the reverse transforms work:
https://alive2.llvm.org/ce/z/8iVQwB
The motivation for the IR change was to improve matching to
'fabs' in IR (see https://github.com/llvm/llvm-project/issues/38828 ),
but it regressed x86 codegen for 'not-quite-fabs' patterns like
(X > -X) ? X : -X.
Ie, when there is no fast-math (nsz), the cmp+select is not a proper
fabs operation, but it does map nicely to the unusual NAN semantics
of MINSS/MAXSS.
I drafted this as a target-independent fold, but it doesn't appear to
help any other targets and seems to cause regressions for SystemZ at
least.
Differential Revision: https://reviews.llvm.org/D122726
The AMX combiner would store undef or zero to stack and invoke tileload
to load the data to tile register. To avoid the store/load, we can
materialzie undef or zero value to tilezero.
Differential Revision: https://reviews.llvm.org/D122714
As mentioned on D122482, if we've generated a masked overflow test see if we can fold it to X86ISD::BT to feed a X86ISD::ADC/SBB
Differential Revision: https://reviews.llvm.org/D122572
If we're not relying on the flag result, we can fold the constants together into the RHS immediate operand and set the LHS operand to zero, simplifying for further folds.
We could do something similar if the flag result is in use and the constant fold doesn't affect it, but I don't have any real test cases for this yet.
As suggested by @davezarzycki on Issue #35256
Differential Revision: https://reviews.llvm.org/D122482
Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput
This is used for f16 emulation. We emulate f16 for SSE2 targets and
above. Refactoring makes the future code to be more clean.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D122475
1. Add comments to explain why we set `isAsmParserOnly` for XACQUIRE and XRELEASE
2. Check `X86Inst` in the constructor of `RecognizableInstrBase` so that
we can avoid the case where one of it's field is not initialized but
accessed by user. (e.g. in X86EVEX2VEXTablesEmitter.cpp)
3. Move `Rec` from `RecognizableInstrBase` to `RecognizableInstr` to reduce
size of `RecognizableInstrBase`
4. Remove out-of-date comments for shouldBeEmitted() (filter() was removed)
5. Add a basic field `IsAsmParserOnly` and remove the field
`ShouldBeEmitted` b/c we can deduce it w/ little overhead
Asan complained about uninitialized bool
`invalid-bool-load`
llvm/lib/Target/X86/AsmParser/X86Operand.h:389:12: runtime error: load
of value 171, which is not a valid value for type 'bool'
Differential Revision: https://reviews.llvm.org/D122405
MMX_MOVD64from64rr moves an MMX register to a 64-bit GPR.
MMX_MOVD64from64mr is the memory version of moving a MMX register to a
64-bit GPR. It requires the REX.W bit to be set. There are no isel
patterns that use this instruction.
MMX_MOVQ64mr is the MMX register store instruction. It doesn't
require a REX.W prefix. This makes it one byte shorter to encode
than MMX_MOVD64from64mr in many cases.
Both store instructions output the same mnemonic string. The assembler
would choose MMX_MOVQ64mr if it was to parse the output. Which is
another reason using it is the correct thing to do.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122241
As suggested on PR35908, if we are adding/subtracting an extracted bit, attempt to use BT instead to fold the op and use a ADC/SBB op.
Reapply with extra type legality checks - LowerAndToBT was originally only used during lowering, now that it can occur earlier we might encounter illegal types that we can either promote to i32 or just bail.
Differential Revision: https://reviews.llvm.org/D122084
AVX512 has excellent broadcast ops for everything but vXi1 bool vectors - so if we're broadcasting a comparison result, see if we can broadcast the comparison operands instead.
Ensure we don't attempt to fold to illegal types to ADC/SBB nodes.
After D122084 its possible for ADD(X,AND(SRL(Y,Z),1) patterns to be matched before type legalization.
As suggested on PR35908, if we are adding/subtracting an extracted bit, attempt to use BT instead to fold the op and use a ADC/SBB op.
Differential Revision: https://reviews.llvm.org/D122084
Instead of taking a SkipDefs parameter, rename to getCondSrcNoFromDesc
and have it return the source operand number. Make getCondFromMI
responsible for adding the number of Defs for MI instructions.
While there remove some unneeded casts to unsigned and check for
negative numbers instead of explicitly -1. Less than 0 is easier
for a compiler to codegen.
Differential Revision: https://reviews.llvm.org/D122113
This is not a NFC change b/c we add more instructions like
IMUL16/32/64r, MOV16ao16 and MOV16rr_REV etc to the list.
But I think it's reasonable.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D122063
Split combineAddOrSubToADCOrSBB into wrapper (which handles ADDs with commuted args) and the real combine, which no longer has to account for commutation.
I'm intending to extend combineAddOrSubToADCOrSBB to detect patterns other than just X86ISD::SETCC, so we need to detect all patterns without detecting them as part of a commutation swap.
Rename hasCMPXCHG16B() to canUseCMPXCHG16B() to make it less like other
feature functions. Add a similar canUseCMPXCHG8B() that aliases
hasCX8() to keep similar naming.
Differential Revision: https://reviews.llvm.org/D121978
If a X86ISD::BLENDV op appears before legalization (in this test case due to the icmp_slt x, 0) its constant mask was being treated as a vselect mask (mask != 0) instead of blendv (mask < 0)
This just prevents constant folding entirely for non-VSELECT ops.
We can use MOVMSK+TEST/BT to extract individual bool elements even if the index isn't constant
This relies on combineBitcastvxi1 so some AVX512 cases still aren't optimized as they avoid MOVMSK usage.
-Rename Mode*Bit to Is*Bit to match X86Subtarget.
-Rename FeatureLAHFSAHF to FeatureLAFHSAFH64 to match X86Subtarget.
-Use consistent capitalization
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D121975
Compiler only emits a comment for `Int_MemBarrier`, so it should
be marked as a meta-instruction, which can help improve accuracy
of debug location.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D121879
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.
Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
- But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer
Because of this some of the lit tests contain more IR instructions.
Differential Revision: https://reviews.llvm.org/D121354
We favor 'and' and 'test' in earlier phases of optimization,
and that's usually the better option, but we can save a few
instruction bytes by converting a mask constant to a shift here.
Differential Revision: https://reviews.llvm.org/D121147
Fix prefix emission order to emit REX immediately before the opcode (SDM vol2,
2.1, Figure 2-1). According to SDM vol2 2.2.1, "Other placements are ignored".
This fix has a side effect of outputting segment override prefix in a different
order than previously (benign).
Follow-up to https://reviews.llvm.org/D120592
Reviewed By: skan, craig.topper
Differential Revision: https://reviews.llvm.org/D120871
Print and emit redundant Address-Size override prefix if it's set on the
instruction.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D120592
Optimize a pattern where a sequence of 8/16 or 32 bits is tested for
zero: LLVM normalizes this towards and `AND` with mask which is usually
good, but does not work well on X86 when the mask does not fit into a
64bit register. This DagToDAG peephole transforms sequences like:
```
movabsq $562941363486720, %rax # imm = 0x1FFFE00000000
testq %rax, %rdi
```
to
```
shrq $33, %rdi
testw %di, %di
```
The result has a shorter encoding and saves a register if the tested
value isn't used otherwise.
Differential Revision: https://reviews.llvm.org/D121320
This replaces the attempt in 20af71f8ec to use combineToExtendBoolVectorInReg to create X86ISD::BLENDV masks directly, instead we use it to canonicalize the iX bitcast to a sign-extended mask and then truncate it back to vXi1 prior to legalization breaking it apart.
Fixes#53760
Discussed extensively on D98232. The functionality introduced in D35816
never worked correctly. In D98232, it was fixed, but, as it was
introducing a large compile-time regression, and the value of the
original patch was called into doubt, we disabled it by default
everywhere. A year later, it appears that caused no grief, so it seems
safe to remove the disabled code.
This should be accompanied by re-opening bug 26810.
Differential Revision: https://reviews.llvm.org/D121128
If we're comparing a value against zero, strip away any zero-extension and perform the comparison on the pre-extended value
Fixes#38308
Differential Revision: https://reviews.llvm.org/D121472
Move out the table and prepare the code to reuse it for the reverse mapping.
Follows the example of memory folding/unfolding tables in X86InstrFoldTables.cpp
Preparation step to unify `llvm::X86::getRelaxedOpcodeArith` and
`getShortArithOpcode` in BOLT X86MCPlusBuilder.cpp.
Addresses https://lists.llvm.org/pipermail/llvm-dev/2022-January/154526.html
Reviewed By: skan, MaskRay
Differential Revision: https://reviews.llvm.org/D121402
If the SETCC fp-condcode is supported on SSE as a single CMPPS/PD op then we can use convertIntLogicToFPLogic to reduce EFLAGS and XMM->GPR traffic like we do for AVX targets.
Differential Revision: https://reviews.llvm.org/D121210
Fix a number of issues with MCSymbolizer::tryAddingSymbolicOperand()
in X86Disassembler:
* Pass instruction size instead of immediate size.
* Correctly adjust the value of PC-relative operands.
* Set operand offset to zero when the operand is specified
implicitly.
Reviewed By: Amir, skan
Differential Revision: https://reviews.llvm.org/D121065
If the shift amount has been zero-extended, peek through as this might help us further canonicalize the shift amount.
Fixes regression mentioned in rG147cfcbef1255ba2b4875b76708dab1a685085f5
This completes the removal of uses of SelectionDAG::getSplatValue started in D119090 - by avoiding extracting the splatted element we make it a lot easier to zero-extend the bottom 64-bits of the shift amount and fixes issues we had on 32-bit targets where i64 isn't legal.
I've removed the old version of getTargetVShiftNode that took the scalar shift amount argument and LowerRotate can finally efficiently handle vXi16 rotates-by-scalar (using the same code as general funnel-shifts).
The only regression we see is in the X86-AVX2 PR52719 test case in vector-shift-ashr-256.ll - this is now hitting the same problem as the X86-AVX1 case (failure to simplify a multi-use X86ISD::VBROADCAST_LOAD) which I intend to address in a follow up patch.
Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists.
Noticed while investigating reduction add optimizations.
This wraps up from D119053. The 2 headers are moved as described,
fixed file headers and include guards, updated all files where the old
paths were detected (simple grep through the repo), and `clang-format`-ed it all.
Differential Revision: https://reviews.llvm.org/D119876
For i16/32/64 vectors, if the upper bits are known to be zero, then we can try to truncate to vXi8 (if its worth it) and perform this as a PSADBW to add+zext each v4i8 subvector to a i64 sum, which we can then reduce together.
This addresses some of the PR42674 test cases where the source data was vXi8 but had been extended to match a wider unsigned integer accumulator.
Differential Revision: https://reviews.llvm.org/D120193
Introduce an option to expand all CMOV groups into hammocks, matching GCC's
`-fno-if-conversion2` flag. The motivation is to leave CMOV conversion
opportunities to a binary optimizer that can make the decision based on branch
misprediction rate (available e.g. in Intel's LBR).
Reviewed By: MaskRay, skan
Differential Revision: https://reviews.llvm.org/D119777
Using getSplatValue causes poor codegen due to not always being able to remove the EXTRACT_VECTOR_ELT created inside getSplatValue.
The vXi16 shifts/rotates are still showing occasional regressions but vXi8 is a definite improvement.
combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319
combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319
Removing widening from combineX86ShufflesRecursively will be the next step, followed by removing combineX86ShuffleChainWithExtract entirely
With only a load-fold the diffs look neutral. If there's a load and store (rmw)
fold opportunity as shown in the test based on #53862, then we end up with an
extra instruction.
Fixes#53862
Differential Revision: https://reviews.llvm.org/D120281
Peek through if we're extracting a non-zero'th subvector in an attempt to fold the extract into a lane-crossing shuffle
This also exposes a failure to fold extract_subvector(movddup(x),c) -> movddup(extract_subvector(x,c))
Extension to PR45974, unless we actual combine the target shuffles we shouldn't be generating temporary nodes as they may interfere with the one use checks in the shuffle recursions
When building 32b x86 code as PIC, the existing handling of "i"
constraints is conservative since generally we have to go through the
GOT to find references to functions.
But generally, BlockAddresses from C code refer to the Function in the
current TU. Permit BlockAddresses to be used with the "i" constraint
for those cases.
I regressed this in
commit 4edb9983cb ("[SelectionDAG] treat X constrained labels as i for asm")
Fixes: https://github.com/llvm/llvm-project/issues/53868
Reviewed By: efriedma, MaskRay
Differential Revision: https://reviews.llvm.org/D119905
After D118128 relaxed the heuristic to require only one EFLAGS generating operand, it now makes sense to avoid X86ISD::SMUL/UMULO duplication as well.
Differential Revision: https://reviews.llvm.org/D119578
It's not particularly user-friendly to have to call `initLRU` everywhere. Also,
it wasn't particularly great that the LRU for registers used in a sequence was
also initialized by `initLRU`.
This patch hides this stuff behind some helper functions:
* `isAvailableAcrossAndOutOfSeq`
* `isAnyUnavailableAcrossOrOutOfSeq`
* `isAvailableInsideSeq`
This allows the user to avoid calling `initLRU` explicitly. Also, it allows
us to separate initializing the used-in-sequence LRU from the main LRU.
Since both ARM and AArch64 check LR liveness in `insertOutlinedCall`, this
refactor requires that we de-const the Candidate there.
Some other quality-of-code improvements:
* LRUs in outliner::Candidate now have more descriptive names
* Use `Register` instead of `unsigned` in some places
* Improve readability in some places by using ranges rather than `std::for_each`
This is a preparatory commit for a larger compile time related change for the
AArch64 outliner.
Extend the existing split where we already do this for v32i16/v64i8
We can end up trying to use PCMPEQ/GT if the result needs to be sign-extended (typically due to the DAGCombiner::foldSextSetcc fold).
Fixes#53842
The "avoid trailing call pass" makes sure that no function ends with a call instruction for the purpose of the unwinder.
It starts of by skipping over any non real instruction, which is approximated via the Pseudo and Meta property. This sadly leads to issues when the last machine instruction is a STATEPOINT, as it is skipped despite it lowering to a call.
This patch fixes the use of a statepoint in the trailing call position by making sure call instructions are not skipped.
Differential Revision: https://reviews.llvm.org/D119644
This is a retry of b4b97ec813 - that was reverted because it
could cause miscompiles by illegally reordering memory operations.
A new test based on #53695 is added here to verify we do not have
that same problem.
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310Fixes#53695
Differential Revision: https://reviews.llvm.org/D118376
D108887 fixed alignment mismatch by changing the caller's alignment in
ABI. However, we found some cases that still assume the alignment is
vector size. This patch fixes them to avoid the runtime crash.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D114536
Extend the existing fold to use SimplifyMultipleUseDemandedBits as well as SimplifyDemandedVectorElts/SimplifyDemandedBits when attempting to simplify based off known zero vector elements.
To find uniform shift/rotation amounts, we currently use SelectionDAG::getSplatValue which creates a node that extracts the scalar value from the source vector, this makes it more difficult for later combines to remove the extraction and stay on the SIMD unit, and can be a problem when the scalar type is illegal (i.e. i64 vs v2i64 on 32-bit targets).
This patch begins to use SelectionDAG::getSplatSourceVector (which SelectionDAG::getSplatValue uses internally) and adds a new variant of getTargetVShiftNode that takes the source vector and the splat index, and adjusts the vector in place to create the zero-extended value suitable for the SSE PSLL/PSRL/PSRA uniform instructions.
I'm still addressing a number of regressions when used for normal vector shifts, so I've just handled the funnelshift/rotation lowering for this first patch. I can then focus on the yak shaving (SimplifyDemandedBits/Elts in particular) necessary to always use SelectionDAG::getSplatSourceVector.
Differential Revision: https://reviews.llvm.org/D119090
Pulled out of D106237, this replaces the X86ISD::AVG DAG node with the
generic ISD::AVGCEILU. It doesn't remove the detectAVGPattern method,
but the extra generic ISel matching does alter the existing test.
Differential Revision: https://reviews.llvm.org/D119073
Replace the *_EXTEND node with the raw operands, this will make it easier to use combineToExtendBoolVectorInReg for any boolvec extension combine.
Cleanup prep for Issue #53760
The introduction and some examples are on this page:
https://devblogs.microsoft.com/cppblog/announcing-jmc-stepping-in-visual-studio/
The `/JMC` flag enables these instrumentations:
- Insert at the beginning of every function immediately after the prologue with
a call to `void __fastcall __CheckForDebuggerJustMyCode(unsigned char *JMC_flag)`.
The argument for `__CheckForDebuggerJustMyCode` is the address of a boolean
global variable (the global variable is initialized to 1) with the name
convention `__<hash>_<filename>`. All such global variables are placed in
the `.msvcjmc` section.
- The `<hash>` part of `__<hash>_<filename>` has a one-to-one mapping
with a directory path. MSVC uses some unknown hashing function. Here I
used DJB.
- Add a dummy/empty COMDAT function `__JustMyCode_Default`.
- Add `/alternatename:__CheckForDebuggerJustMyCode=__JustMyCode_Default` link
option via ".drectve" section. This is to prevent failure in
case `__CheckForDebuggerJustMyCode` is not provided during linking.
Implementation:
All the instrumentations are implemented in an IR codegen pass. The pass is placed immediately before CodeGenPrepare pass. This is to not interfere with mid-end optimizations and make the instrumentation target-independent (I'm still working on an ELF port in a separate patch).
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D118428
This new-ish LEA-fixup code path creates two substitutions for an
instruction number -- this is incorrect because each Value should be
replaced by a single replacement Value. Fix by deleting the duplicate
substitution. Add some test coverage for this path with debug-info
attached.
Differential Revision: https://reviews.llvm.org/D119232
This ensures that the Windows unwinder will work at every instruction
boundary, and allows other targets to read and write flags without
setting up a frame pointer.
Fixes GH-46875
Differential Revision: https://reviews.llvm.org/D119391
This fix is similar to 3cf3ffce240e("Fix the TCRETURNmi64 bug differently.")
after allocating register for index+base, we will only have one register left
This bug affects linux kernel compilation for x86 target. Error happens when compiling kmod_si476x_core.
clang complains:
error: ran out of registers during register allocation
The full command is:
clang -Wp,-MMD,drivers/mfd/.si476x-cmd.o.d -nostdinc -isystem /opt/toolchain/main/lib/clang/14.0.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu89 -no-integrated-as --prefix=/usr/bin/ -Werror=unknown-warning-option -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m32 -msoft-float -mregparm=3 -freg-struct-return -fno-pic -mstack-alignment=4 -march=atom -mtune=atom -mtune=generic -Wa,-mtune=generic32 -ffreestanding -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-address-of-packed-member -O2 -Wframe-larger-than=1024 -fno-stack-protector -Wno-format-invalid-specifier -Wno-gnu -mno-global-merge -Wno-unused-but-set-variable -Wno-unused-const-variable -fomit-frame-pointer -ftrivial-auto-var-init=pattern -fno-stack-clash-protection -falign-functions=32 -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-array-bounds -fno-strict-overflow -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-pointer-to-enum-cast -Wno-tautological-constant-out-of-range-compare -DKBUILD_MODFILE='"drivers/mfd/si476x-core"' -DKBUILD_BASENAME='"si476x_cmd"' -DKBUILD_MODNAME='"si476x_core"' -D__KBUILD_MODNAME=kmod_si476x_core -c -o drivers/mfd/si476x-cmd.o drivers/mfd/si476x-cmd.c
-------------
LLVM cannot compile the following code for x86 32bit target, the reason is tail call(TCRETURNmi) is using 2 registers for index+base and we want to use more than one registers for passing function args and that is impossible.
This fix is similar to 3cf3ffce240e("Fix the TCRETURNmi64 bug differently.").
We will only use tail call when it is using <=1 registers for passing args.
```
struct BIG_PARM {
int ver;
};
static struct {
int (*foo) (struct BIG_PARM* a, void *b);
int (*bar) (struct BIG_PARM* a);
int (*zoo0) (void);
int (*zoo1) (void);
int (*zoo2) (void);
int (*zoo3) (void);
int (*zoo4) (void);
} vtable[] = {
[0] = {
.foo = (int (*)(struct BIG_PARM* a, void *b))0xdeadbeef,
},
};
int something(struct BIG_PARM *a, void* b) {
return vtable[a->ver].foo(a,b);
}
```
```
$ clang -std=gnu89 -m32 -mregparm=3 -mtune=generic -fno-strict-overflow -O2 -c t0.c -o t0.c.o
error: ran out of registers during register allocation
1 error generated.
```
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D118312
Previously the `let Predicates = ...` line only applied to the rr version, and
so VMOVSH was being emitted whenever HasAVX512 (the default) applied. This is
not right.
There's a few relevant forward declarations in there that may require downstream
adding explicit includes:
llvm/MC/MCContext.h no longer includes llvm/BinaryFormat/ELF.h, llvm/MC/MCSubtargetInfo.h, llvm/MC/MCTargetOptions.h
llvm/MC/MCObjectStreamer.h no longer include llvm/MC/MCAssembler.h
llvm/MC/MCAssembler.h no longer includes llvm/MC/MCFixup.h, llvm/MC/MCFragment.h
Counting preprocessed lines required to rebuild llvm-project on my setup:
before: 1052436830
after: 1049293745
Which is significant and backs up the change in addition to the usual benefits of
decreasing coupling between headers and compilation units.
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D119244
The "-fzero-call-used-regs" option tells the compiler to zero out
certain registers before the function returns. It's also available as a
function attribute: zero_call_used_regs.
The two upper categories are:
- "used": Zero out used registers.
- "all": Zero out all registers, whether used or not.
The individual options are:
- "skip": Don't zero out any registers. This is the default.
- "used": Zero out all used registers.
- "used-arg": Zero out used registers that are used for arguments.
- "used-gpr": Zero out used registers that are GPRs.
- "used-gpr-arg": Zero out used GPRs that are used as arguments.
- "all": Zero out all registers.
- "all-arg": Zero out all registers used for arguments.
- "all-gpr": Zero out all GPRs.
- "all-gpr-arg": Zero out all GPRs used for arguments.
This is used to help mitigate Return-Oriented Programming exploits.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D110869
Most Intel CPU scheduler files lumped the immediate and 1 instructions
together, but uops.info shows they are quite different.
For the most part the by 1 instructions were pretty accurate to the uops.info
data except the latency was 3 instead of 2 as uops.info indicates.
The by immediate instructions need 7 or 8 uops and have higher latency.
It looks like the 8-bit by immediate instructions may need even more
uops, but I just lumped them with the 16/32/64.
Noticed while checking out PR53648. So mostly I cared about the by 1
instructions.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D119217
As suggested by @craig.topper, relaxing LEA matching to only require the ADD to be fed from a single op with EFLAGS helps avoid duplication when the EFLAGS are consumed in a later, dependent instruction.
There was some concern about whether the heuristic is too simple, not taking into account lost loads that can't fold by using a LEA, but some basic tests (included in select-lea.ll) don't suggest that's really a problem.
Differential Revision: https://reviews.llvm.org/D118128
This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.
This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.
I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.
Differential Revision: https://reviews.llvm.org/D119019
This is effectively inverting the transform added with D116804
because the downside of the false dependency of something like
"sbb %eax, %eax" is much greater than the upside of eliminating
a zeroing instruction on (all?) Intel CPUs.
Differential Revision: https://reviews.llvm.org/D118843
This reverts commit b4b97ec813.
As discussed in post-commit feedback at:
https://reviews.llvm.org/D118376
...there's a stage 2 failure on a Mac running a clang-refactor tool test.
As noted in D116804, we want to effectively invert that patch
for CPUs (intel) that don't break the false dependency on
sbb %eax, %eax
So we will likely want to create that here in the
X86DAGToDAGISel::Select() case for X86::SETCC_CARRY.
This is an intentionally limited/different form of D90113.
That patch bravely tries to generalize folds where we pull
a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...but it is not universally profitable.
This is the inverse of IR canonicalization as discussed in
D113442.
We know that this transform is not entirely profitable even
within x86, so we only handle x86 vector fadd/fsub as a 1st
step. The intent is to prevent AVX512 regressions as mentioned
in D113442.
The plan is to port this to DAGCombiner (so it will eventually
look more like D90113) and add more types/cases in pieces with
many more tests to verify that we are seeing improvements.
Differential Revision: https://reviews.llvm.org/D118644
None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
Allows pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.
Without AVX512 (which can efficiently extend/truncate to vXi16/vXi32), unpacking/packing to vXi16 is more efficient that relying on the (uops-heavy) PBLENDV shift expansion
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310
Differential Revision: https://reviews.llvm.org/D118376
Previous folds by combineSetCCMOVMSK might have converted these to CMP when changing the bitwidth, and the CMP->SUB fold might not have happened (or will happen)
rG9103b73fe052 was assuming that we could OR/AND with the source vector, but that will fail on float/double vectors without bitcasting - it also missed the case that any_of checks might be testing less than all the source elements
InstCombine performs this more generally with SimplifyUsingDistributiveLaws, but we don't need anything that complex here - this is mainly to fix up cases where logic ops get created late on during lowering, often in conjunction with sext/zext ops for type legalization.
https://alive2.llvm.org/ce/z/gGpY5v
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
There's no further reasons to limit this to cmpeq-with-zero, the outstanding regressions with lowering to PTEST have now been addressed
Improves codegen for Issue #53379
vXi16 patterns allof(cmp()) reduction patterns will have to be pack the comparison results to vXi8 to use PMOVMSKB.
If we're reducing cmpeq(), then we can compare the vXi8 halves directly - similar to what we already do for vXi64 -> vXi32 for cases without PCMPEQQ.
As suggested on PR53379, for all-of icmp-eq patterns, we can use ptest(sub(x,y)) on SSE41+ targets
This is a generalization of the existing allof(cmpeq(x,0)) -> ptest(x) pattern
We can probably extend this further, in particularly to handle 256-bit cases on pre-AVX2 targets, but this part of the generalization is pretty trivial
Fixes Issue #53379
MSVC currently doesn't support 80 bits long double. ICC supports it when
the option `/Qlong-double` is specified. Changing the alignment of f80
to 16 bytes so that we can be compatible with ICC's option.
Reviewed By: rnk, craig.topper
Differential Revision: https://reviews.llvm.org/D115942
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
Intel's CET/IBT requires every indirect branch target to be an ENDBR instruction. Because of that, the compiler needs to correctly emit these instruction on function's prologues. Because this is a security feature, it is desirable that only actual indirect-branch-targeted functions are emitted with ENDBRs. While it is possible to identify address-taken functions through LTO, minimizing these ENDBR instructions remains a hard task for user-space binaries because exported functions may end being reachable through PLT entries, that will use an indirect branch for such. Because this cannot be determined during compilation-time, the compiler currently emits ENDBRs to every non-local-linkage function.
Despite the challenge presented for user-space, the kernel landscape is different as no PLTs are used. With the intent of providing the most fit ENDBR emission for the kernel, kernel developers proposed an optimization named "ibt-seal" which replaces the ENDBRs for NOPs directly in the binary. The discussion of this feature can be seen in [1].
This diff brings the enablement of the flag -mibt-seal, which in combination with LTO enforces a different policy for ENDBR placement in when the code-model is set to "kernel". In this scenario, the compiler will only emit ENDBRs to address taken functions, ignoring non-address taken functions that are don't have local linkage.
A comparison between an LTO-compiled kernel binaries without and with the -mibt-seal feature enabled shows that when -mibt-seal was used, the number of ENDBRs in the vmlinux.o binary patched by objtool decreased from 44383 to 33192, and that the number of superfluous ENDBR instructions nopped-out decreased from 11730 to 540.
The 540 missed superfluous ENDBRs need to be investigated further, but hypotheses are: assembly code not being taken care of by the compiler, kernel exported symbols mechanisms creating bogus address taken situations or even these being removed due to other binary optimizations like kernel's static_calls. For now, I assume that the large drop in the number of ENDBR instructions already justifies the feature being merged.
[1] - https://lkml.org/lkml/2021/11/22/591
Reviewed By: xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D116070
AVX512 doesn't provide a ADDSUB instruction, but if we've built this from a build vector of scalar fsub/fadd elements we can still lower to blend(fsub,fadd)
Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc.
These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered.
Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first.
@lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed.
Based on an original patch by @andreadb (Andrea Di Biagio)
Differential Revision: https://reviews.llvm.org/D117497
This allows us to match shuffle<1,3,5,7,9,11,13,15> style shift+trunc/pack patterns as well as the existing shuffle<0,2,4,6,8,10,12,14> style shuffle trunc/pack patterns
In the future, interleaving patterns might benefit from an even more general implementation for higher strides
Allows shuffle combining through per-element shift nodes
This exposed a number of issues with shuffle combining with target intrinsics that are lowered to nodes later during legalization - in particular shuffle combining and SimplifyDemandedVectorElts were being called after canonicalizeShuffleWithBinOps, meaning that shuffles didn't have a chance to be combined away before the shuffle(binop(x,y)) -> binop(shuffle(x),shuffle(y)) fold.
Partial element rotate patterns (e.g. for element insertion on Issue #53124) were being split if every lane wasn't crossing, but really there's a good repeated mask hiding in there.
This is required to query the legality more precisely in the LoopVectorizer.
This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.
Differential Revision: https://reviews.llvm.org/D115329
MSVC currently doesn't support 80 bits long double. ICC supports it when
the option `/Qlong-double` is specified. Changing the alignment of f80
to 16 bytes so that we can be compatible with ICC's option.
Reviewed By: rnk, craig.topper
Differential Revision: https://reviews.llvm.org/D115942
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.
Differential Revision: https://reviews.llvm.org/D116779
This is a re-commit of e2c7ee0743 which
was reverted in a2a58d91e8 and
ea81cea816. This includes a fix to
consistently check for EFLAGS being live-out. See phabricator
review.
Original Summary:
This extends `optimizeCompareInstr` to re-use previous comparison
results if the previous comparison was with an immediate that was 1
bigger or smaller. Example:
CMP x, 13
...
CMP x, 12 ; can be removed if we change the SETg
SETg ... ; x > 12 changed to `SETge` (x >= 13) removing CMP
Motivation: This often happens because SelectionDAG canonicalization
tends to add/subtract 1 often when optimizing for fallthrough blocks.
Example for `x > C` the fallthrough optimization switches true/false
blocks with `!(x > C)` --> `x <= C` and canonicalization turns this into
`x < C + 1`.
Differential Revision: https://reviews.llvm.org/D110867
This diff renames emitCalleeSavedFrameMoves to avoid conflicts with
non-virtual methods of derived classes having the same name but different semantics.
E.g. the class AArch64FrameLowering used to have (non-virtual) "emitCalleeSavedFrameMoves"
but it started to override TargetFrameLowering::emitCalleeSavedFrameMoves after
https://github.com/llvm/llvm-project/commit/c3e6555616 though its usage and semantics didn't change.
P.S. for x86 there was no conflict because the signature of
non-virtual X86FrameLowering::emitCalleeSavedFrameMoves is different
Test plan: make check-all
Differential revision: https://reviews.llvm.org/D114140
This is a suggested follow-up to D116765.
This removes a clear of the register operand, so it is better
for code size, but it does potentially create a false register
dependency on surrounding code. If that is a problem, it should
be solvable using dependency-breaking code that is used for
other instructions.
Differential Revision: https://reviews.llvm.org/D116804
To enable this on all targets there's still a number of regressions due to getSplatValue/getTargetVShiftNode but these don't really affect pre-AVX targets.
This is very similar to the existing ROTL/ROTR support for scalar shifts in LowerRotate, I think as time goes on we should be able to share much of this code in helpers between Funnel Shift + Rotation lowering.
select (X != 0), -1, Y --> 0 - X; or (sbb), Y
select (X != 0), Y, -1 --> X - 1; or (sbb), Y
We already had these x86 carry-flag transforms, but one was over-specified to
handle a "0" select arm only. That's just a special-case of the more general
pattern (the 'or' will be deleted if Y is zero).
This is part of solving #53006, but it misses that example because some other
combine has already converted that exact pattern into math ops.
Differential Revision: https://reviews.llvm.org/D116765