1. When checking if a candidate contains a CFI instruction, actually
iterate over all of the instructions, instead of stopping halfway
through.
2. Make sure copied CFI directives refer to the correct instruction.
Fixes https://github.com/llvm/llvm-project/issues/55842
Differential Revision: https://reviews.llvm.org/D126930
In the same spirit as D73543 and in reply to https://reviews.llvm.org/D126768#3549920 this patch is adding support for `__builtin_memset_inline`.
The idea is to get support from the compiler to easily write efficient memory function implementations.
This patch could be split in two:
- one for the LLVM part adding the `llvm.memset.inline.*` intrinsics.
- and another one for the Clang part providing the instrinsic as a builtin.
Differential Revision: https://reviews.llvm.org/D126903
Fixes a regression on D127115 - splitting was creating extract_subvector(bitcast(build_vector())) patterns which prevented the build vectors being split before being bitcast to vXi16 types, resulting in various issues with further folding of the (now legal) build vectors
Its just as safe to move shuffles across X86ISD::ANDNP as any other logical bitop, they just tend to appear too late to matter.
Noticed while triaging D127115 regressions.
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.
Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
First step towards enabling shuffle combining starting from VSELECT/BLENDV nodes - this should eventually help improve the codegen reported at Issue #54819
We may need to peek through a bitcast when identifying an fneg idiom
via its pool constant, but we can't allow a different-sized constant
in that match.
This is noted in issue #55758 with an example that needs fast-math,
but as the test here shows, this has potential to miscompile more
generally (no fast-math required).
Differential Revision: https://reviews.llvm.org/D126775
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node
Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
Originally we tried to use default expansion for v4i64 types to make it easier to concatenate the results back together, but this can cause infinite loop issues with existing VSELECT splitting code in narrowExtractedVectorSelect if we have other uses of the VSELECT results (e.g. reduction patterns).
To fix the infinite loop, this patch always splits MIN/MAX v4i64 nodes during lowering and I've added a TODO for combineConcatVectorOps to investigate when we can cheaply concatenate VSELECT/BLENDV nodes together.
Fixes#55648 - regression test case will be added in a follow up.
We don't seem to need this for any test coverage and it was making tracking of the uses() of the source vector more difficult
Noticed while investigating Issue #55648
We have a generic DAG combine to attempt to fold extract_subvector(bitcast(vec)) -> bitcast(extract_subvector(vec)) but if we create these patterns late in lowering then we often miss them.
Noticed while investigating Issue #55648 which gets caught in an infinite loop trying to split extract_subvector(bitcast(vselect()) patterns - this doesn't fix the issue yet but reduces the regressions from the WIP fix.
znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping.
Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r
znver1 ymm variants of VPMOVSX**/VPMOVZX** instructions require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
There is intrinsic `@llvm.x86.ldtilecfg` which is lowered to LDTILECFG.
This intrinsic is open for user to configure tile registers by
themselves. There is a chance that `@llvm.x86.ldtilecfg` would be mixed
with the new AMX intrinsics which depend on compiler to configure tile
registers. Separate pusedo instruction PLDTILECFGV would avoid
unexpected behavious when `@llvm.x86.ldtilecfg` is mixed with new AMX
intrinsics. Though user should not mix the two programming model,
compiler should avoid crash or UB when they are mixed.
Differential Revision: https://reviews.llvm.org/D126519
MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter
to specify either the instruction size or the operand size depending on
the architecture. However, for proper symbolic disassembly on X86, we
need to know both sizes, as an instruction can have two operands, and
the instruction size cannot be reliably calculated based on the operand
offset and its size. Hence, split Size into OpSize and InstSize.
For X86, the new interface allows to fix a couple of issues:
* Correctly adjust the value of PC-relative operands.
* Set operand size to zero when the operand is specified implicitly.
Differential Revision: https://reviews.llvm.org/D126101
I think we need to be sure the load isn't volatile before we
duplicate and shrink it.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D126353
We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs.
This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once.
Differential Revision: https://reviews.llvm.org/D125527
The previous solution depends on variable name to record the shape
information. However it is not reliable, because in release build
compiler would not set the variable name. It can be accomplished with an
additional option `fno-discard-value-names`, but it is not acceptable
for users.
This patch is to preconfigure the tile register with machine
instruction. It follow the same way what sigle configure does. In the
future we can fall back to multiple configure when single configure
fails due to the shape dependency issue.
The algorithm to configure the tile register is simple in the patch. We
may improve it in the future. It configure tile register based on basic
block. Compiler would spill the tile register if it live out the basic
block. After the configure there should be no spill across tile
confgiure in the register alloction. Just like fast register allocation
the algorithm walk the instruction in reverse order. When the shape
dependency doesn't meet, it insert ldtilecfg after the last instruction
that define the shape.
In post configuration compiler also walk the basic block to collect the
physical tile register number and generate instruction to fill the stack
slot for the correponding shape information.
TODO: There is some following work in D125602. The risk is modifying the
fast RA may cause regression as fast RA is usded for different targets.
We may create an independent RA for tile register.
Differential Revision: https://reviews.llvm.org/D125075
Support the "-fzero-call-used-regs" option on AArch64. This involves much less
specialized code than the X86 version. Most of the checks can be done with
TableGen.
Reviewed By: nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D124836
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
We already use combineAddOrSubToADCOrSBB to fold extended EFLAGS results into ISD::ADD/SUB ops as X86ISD::ADC/SBB carry ops.
This patch extends this to also try to fold EFLAGS results with X86ISD::ADD/SUB ops
Differential Revision: https://reviews.llvm.org/D125642
znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
This adds a `TargetLoweringBase::getSwitchConditionType` callback to
give targets a chance to control the type used in
`CodeGenPrepare::optimizeSwitchInst`.
Implement callback for X86 to avoid i8 and i16 types where possible as
they often incur extra zero-extensions.
This is NFC for non-X86 targets.
Differential Revision: https://reviews.llvm.org/D124894
These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both
Fixes#53242
Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops
To match the cost of other scheduling models. This is expected to
schedule mov instructions around INLINEASM less frequently for the
default machineschedule (pre-RA scheduling).
Suggested by Craig Topper.
Link: https://github.com/llvm/llvm-project/issues/41914
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122350
To generate zero value, the PXOR instruction need 3 operands that is
tied to the same vreg. If is not good in SSA form and with undef value
two address instruction pass may convert
`%0:vr128 = PXORrr undef %0, undef %0`
to `%1:vr128 = PXORrr undef %1:vr128(tied-def 0), undef %0:vr128`.
It is not expected.
It can be simplified to SET0 instruction which only take 1 destination
operand. It should be more friendly to two address instruction pass and
register allocation pass.
`%0:vr128 = V_SET0`
Also add AVX1 code path so that it is consistant to other code.
Differential Revision: https://reviews.llvm.org/D124903
znver2 is a mainly a search+replace of the znver1 model, but for no reason the HADD and DPPS have been moved around - try to keep these in sync (no actual changes in the models).
We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op.
The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases.
The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.
If the legalized src/dst types are the same, assume the "truncation" is free.
This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths
Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput
Need to normalizize the mask to avoid possible crashes during attempts
to estimate cost of the very long shuffles with non-power-2 number of
elements in masks.
In MASM, if a QWORD symbol is passed to a jmp or call instruction in
64-bit mode or a DWORD or WORD symbol is passed in 32-bit mode, then
MSVC's assembler recognizes that as an indirect call. Additionally, if
the operand is qualified as a ptr, then that should also be an indirect
call.
Furthermore, in 64-bit mode, such operands are implicitly rip-relative
(in fact, MSVC's assembler ml64.exe does not allow explicitly specifying
rip as a base register.)
To keep this patch managable, this patch does not include:
* error messages for wrong operand types (e.g. passing a QWORD in 32-bit
mode)
* resolving indirect calls if the symbol is declared after it's first
use (llvm-ml currently only runs a single pass).
* imlementing the extern keyword (required to resolve
https://crbug.com/762167.)
This patch is likely missing a bunch of edge cases, so please do point
them out in the review.
Reviewed By: epastor, hans, MaskRay
Committed By: epastor (on behalf of ayzhao)
Differential Revision: https://reviews.llvm.org/D124413
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Ideally we'd fold this with generic DAGCombiner, but that only works for !isTruncateFree cases - we might be able to adapt IsDesirableToPromoteOp to find truncated src ops in the future, but for now just use this peephole.
Noticed in Issue #55138
The `llvm.x86.cast.tile.to.vector` intrinsic is lowered to
`llvm.x86.tilestored64.internal` and `load <256 x i32>`. The
`llvm.x86.cast.vector.to.tile` is lowered to `store <256 x i32>` and
`llvm.x86.tileloadd64.internal`. When `llvm.x86.cast.tile.to.vector` is
used by `store <256 x i32>` or `load <256 x i32>` is used by
`llvm.x86.cast.vector.to.tile`, they can be combined by
`llvm.x86.tilestored64.internal` and `llvm.x86.tileloadd64.internal`.
Differential Revision: https://reviews.llvm.org/D124378
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
Instead of report fatal error, this patch emit error message and exit
when shapes are not pre-defined. This would cause the compiling fail but
not crash.
Differential Revision: https://reviews.llvm.org/D124342
The related instructions are:
VPERMD/Q/PS/PD
VRANGEPD/PS/SD/SS
VGETMANTSS/SD/SH
VGETMANDPS/PD - mem version only
VPMULLQ
VFMULCSH/PH
VFCMULCSH/PH
Differential Revision: https://reviews.llvm.org/D116072
This is x86 specific, and adds statefulness to
MachineModuleInfo. Instead of explicitly tracking this, infer if we
need to declare the symbol based on the reference previously inserted.
This produces a small change in the output due to the move from
AsmPrinter::doFinalization to X86's emitEndOfAsmFile. This will now be
moved relative to other end of file fields, which I'm assuming doesn't
matter (e.g. the __morestack_addr declaration is now after the
.note.GNU-split-stack part)
This also produces another small change in code if the module happened
to define/declare __morestack_addr, but I assume that's invalid and
doesn't really matter.
This is used to emit one field in doFinalization for the module. We
can accumulate this when emitting all individual functions directly in
the AsmPrinter, rather than accumulating additional state in
MachineModuleInfo.
Move the special case behavior predicate into MachineFrameInfo to
share it. This now promotes it to generic behavior. I'm assuming this
is fine because no other target implements adjustForSegmentedStacks,
or has tests using the split-stack attribute.
ValueMap should only be necessary if the IR values can be
replaced. This is only used during codegen, when it's illegal to
change the underlying IR. This allows using the default copy
constructor for X86MachineFunctionInfo.
I'm not happy about targets keeping state here that's only used in one
specific pass, but we don't have a better place to put it right now.
Calling hasOneUse can be expensive on nodes with multiple results.
Especially when some results are Chains. By checking the opcode first,
we can avoid walking the uses if it isn't an interesting node,
and thus avoid calling hasOneUse on a node that might have many uses.
Found by profiling the IR given in D123857.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D123881
Checking opcode is cheap. hasOneUse might not be if the node has
multiple results. By checking the opcode we can rule out nodes
with multiple results we aren't interested in.
znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
unsinged int 0 will be convert to float/double -0.0 when the rounding
mode is set to 'FE_DOWNWARD'. Using FILD instruction instead of SSE
instructions on 32-bit target if the strictfp is enabled.
Differential Revision: https://reviews.llvm.org/D123660
This patch adds support for inline assembly address operands using the "p"
constraint on X86 and SystemZ.
This was in fact broken on X86 (see example at
https://reviews.llvm.org/D110267, Nov 23).
These operands should probably be treated the same as memory operands by
CodeGenPrepare, which have been commented with "TODO" there.
Review: Xiang Zhang and Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D122220
This reverts the functional changes of D103427 but keeps its tests, and
and reimplements the functionality by reusing the existing 32-bit
MASKMOVDQU and VMASKMOVDQU instructions as suggested by skan in review.
These instructions were previously predicated on Not64BitMode. This
reimplementation restores the disassembly of a class of instructions,
which will see a test added in followup patch D122449.
These instructions are in 64-bit mode special cased in
X86MCInstLower::Lower, because we use flags with one meaning for subtly
different things: we have an AdSize32 class which indicates both that
the instruction needs a 0x67 prefix and that the text form of the
instruction implies a 0x67 prefix. These instructions are special in
needing a 0x67 prefix but having a text form that does *not* imply a
0x67 prefix, so we encode this in MCInst as an instruction that has an
explicit address size override.
Note that originally VMASKMOVDQU64 was special cased to be excluded from
disassembly, as we cannot distinguish between VMASKMOVDQU and
VMASKMOVDQU64 and rely on the fact that these are indistinguishable, or
close enough to it, at the MCInst level that it does not matter which we
use. Because VMASKMOVDQU now receives special casing, even though it
does not make a difference in the current implementation, as a
precaution VMASKMOVDQU is excluded from disassembly rather than
VMASKMOVDQU64.
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D122540
znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are.
Now matches AMD SoG, Agner and instlatx64 numbers.
Fixes#54811
extract_subvector(insert_subvector(V,X,C1),C1) -> insert_subvector(extract_subvector(V,C1),X,0)
More aggressively attempt to reduce the width of an extract_subvector source - we currently only do this if we're inserting into a zero vector (i.e. canonicalizing to the AVX implicit zero upper elts pattern).
But if we're extracting from the same point as the inner insert_subvector then the fold is still relatively trivial - we can probably do even better if we can ensure the subvector isn't badly split.
When walk the user chain to get the shape of a phi node. If it is phi
node in the chain, we should walk to the user of this phi node instead
of the original phi node.
When we fold vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)), ensure we convert all undef elements to zero elements - this should help us expose more known zero elements for deeper chains of these cases.
Noticed while triaging Issue #54819
znver2 is a mainly a search+replace of the znver1 model, but for no reason some lines have been moved around - try to keep these in sync (no actual changes in the models).
Don't try to directly use the with.overflow flag result in a cmov
if we need to materialize constants between the instruction
producing the overflow flag and the cmov. The current code is
careful to check that there are no other instructions in between,
but misses the constant materialization case (which may clobber
eflags via xor or constant expression evaluation).
Fixes https://github.com/llvm/llvm-project/issues/54369.
Differential Revision: https://reviews.llvm.org/D122825
Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such.
We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro.
Fixes#36407
rGa3b8695bf592 enabled this for znver3, but AMD SoG, Agner and uops.info all agree that even znver1 has a fast per-lane shuffle op (VPSHUFB), but cross-lane shuffles seem to be slow (PERMPS etc.)
Fixes#44140
Differential Revision: https://reviews.llvm.org/D123306
smin(x, 0):
(select (x < 0), x, 0) -> ((x >> (size_in_bits(x)-1))) & x
smax(x, 0):
(select (x > 0), x, 0) -> (~(x >> (size_in_bits(x)-1))) & x
The comparison is testing for a positive value, we have to invert the sign
bit mask, so only do that transform if the target has a bitwise 'and not'
instruction (the invert is free).
The transform is performed only when CMP has a single user to avoid
increasing total instruction number.
https://alive2.llvm.org/ce/z/euUnNmhttps://alive2.llvm.org/ce/z/37339J
Differential Revision: https://reviews.llvm.org/D123109
Use the same enum as the other atomic instructions for consistency, in
preparation for addition of another strategy.
Introduce a new "Expand" option, since the store expansion does not
use cmpxchg. Alternatively, the existing CmpXChg strategy could be
renamed to Expand.
```
X86::MMX_MOVD64from64rr -> X86::MMX_MOVQ64mr
X86::MMX_MOVD64grr -> X86::MMX_MOVD64mr
```
These two entries were added in llvm-svn: 372770.
I think these two should be reversable.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D122217
Add void casts to mark the variables used, next to the places where
they are used in assert or `LLVM_DEBUG()` expressions.
Differential Revision: https://reviews.llvm.org/D123117
Intuitively, the memory folding pair should have the same mnemonic.
This patch removes
```
{X86::SENDUIPI,X86::VMXON}
```
in the auto-generated table.
And `NotMemoryFoldable` for `TPAUSE` and `CLWB` can be saved.
```
{X86::MOVLHPSrr,X86::MOVHPSrm}
{X86::VMOVLHPSZrr,X86::VMOVHPSZ128rm}
{X86::VMOVLHPSrr,X86::VMOVHPSrm}
```
It seems the three pairs above are mistakenly killed.
But we can add them back manually later.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D122477
Current implementation only recognizes absolute operation implemented by
select instruction. This patch adds support for abs intrinsic.
Differential Revision: https://reviews.llvm.org/D122777
Without VBMI, we are better off permuting v16i32 sub-lanes, even though its a variable shuffle, if it allows us to then shuffle v64i8 inlane repeated masks (PSHUFB etc.)
Fixes#54658
Allows us to fold XOR(X, MIN_SIGNED_VALUE) == ADD(X, MIN_SIGNED_VALUE) into LEA patterns
As mentioned on PR52267.
Differential Revision: https://reviews.llvm.org/D122815
As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc.
Recommitted with a fix to ensure we zext/trunc the SETCC result to the original type.
Differential Revision: https://reviews.llvm.org/D122891
As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc.
Differential Revision: https://reviews.llvm.org/D122891
This approach is used by AArch64/RISCV to make frame-setup/frame-destroy
instructions contiguous instead of being interleaved by CFI instructions. Code
checking `MBBI->getFlag(MachineInstr::FrameSetup) || MBBI->isCFIInstruction()`
can be simplified to just check FrameSetup.
This helps locate all CFI instructions in the prologue, which can be handy to use
.cfi_remember_state/.cfi_restore_state to decrease unwind table size (D114545).
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122541
This inverts a fold recently added to IR with:
3491f2f4b0
We can put -bidirectional on the Alive2 examples to show that
the reverse transforms work:
https://alive2.llvm.org/ce/z/8iVQwB
The motivation for the IR change was to improve matching to
'fabs' in IR (see https://github.com/llvm/llvm-project/issues/38828 ),
but it regressed x86 codegen for 'not-quite-fabs' patterns like
(X > -X) ? X : -X.
Ie, when there is no fast-math (nsz), the cmp+select is not a proper
fabs operation, but it does map nicely to the unusual NAN semantics
of MINSS/MAXSS.
I drafted this as a target-independent fold, but it doesn't appear to
help any other targets and seems to cause regressions for SystemZ at
least.
Differential Revision: https://reviews.llvm.org/D122726
The AMX combiner would store undef or zero to stack and invoke tileload
to load the data to tile register. To avoid the store/load, we can
materialzie undef or zero value to tilezero.
Differential Revision: https://reviews.llvm.org/D122714
As mentioned on D122482, if we've generated a masked overflow test see if we can fold it to X86ISD::BT to feed a X86ISD::ADC/SBB
Differential Revision: https://reviews.llvm.org/D122572
If we're not relying on the flag result, we can fold the constants together into the RHS immediate operand and set the LHS operand to zero, simplifying for further folds.
We could do something similar if the flag result is in use and the constant fold doesn't affect it, but I don't have any real test cases for this yet.
As suggested by @davezarzycki on Issue #35256
Differential Revision: https://reviews.llvm.org/D122482
Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput
This is used for f16 emulation. We emulate f16 for SSE2 targets and
above. Refactoring makes the future code to be more clean.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D122475
1. Add comments to explain why we set `isAsmParserOnly` for XACQUIRE and XRELEASE
2. Check `X86Inst` in the constructor of `RecognizableInstrBase` so that
we can avoid the case where one of it's field is not initialized but
accessed by user. (e.g. in X86EVEX2VEXTablesEmitter.cpp)
3. Move `Rec` from `RecognizableInstrBase` to `RecognizableInstr` to reduce
size of `RecognizableInstrBase`
4. Remove out-of-date comments for shouldBeEmitted() (filter() was removed)
5. Add a basic field `IsAsmParserOnly` and remove the field
`ShouldBeEmitted` b/c we can deduce it w/ little overhead
Asan complained about uninitialized bool
`invalid-bool-load`
llvm/lib/Target/X86/AsmParser/X86Operand.h:389:12: runtime error: load
of value 171, which is not a valid value for type 'bool'
Differential Revision: https://reviews.llvm.org/D122405
MMX_MOVD64from64rr moves an MMX register to a 64-bit GPR.
MMX_MOVD64from64mr is the memory version of moving a MMX register to a
64-bit GPR. It requires the REX.W bit to be set. There are no isel
patterns that use this instruction.
MMX_MOVQ64mr is the MMX register store instruction. It doesn't
require a REX.W prefix. This makes it one byte shorter to encode
than MMX_MOVD64from64mr in many cases.
Both store instructions output the same mnemonic string. The assembler
would choose MMX_MOVQ64mr if it was to parse the output. Which is
another reason using it is the correct thing to do.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122241
As suggested on PR35908, if we are adding/subtracting an extracted bit, attempt to use BT instead to fold the op and use a ADC/SBB op.
Reapply with extra type legality checks - LowerAndToBT was originally only used during lowering, now that it can occur earlier we might encounter illegal types that we can either promote to i32 or just bail.
Differential Revision: https://reviews.llvm.org/D122084
AVX512 has excellent broadcast ops for everything but vXi1 bool vectors - so if we're broadcasting a comparison result, see if we can broadcast the comparison operands instead.