Previous code build the model that tile config register is the user of
each AMX instruction. There is a problem for the tile config register
spill. When across function, the ldtilecfg instruction may be inserted
on each AMX instruction which use tile config register. This cause all
tile data register clobber.
To fix this issue, we remove the model of tile config register. We
analyze the regmask of call instruction and insert ldtilecfg if there is
any tile data register live across the call. Inserting the sttilecfg
before the call is unneccessary, because the tile config doesn't change
and we can just reload the config.
Besides we also need check tile config register interference. Since we
don't model the config register we should check interference from the
ldtilecfg to each tile data register def.
ldtilecfg
/ \
BB1 BB2
/ \
call BB3
/ \
%1=tileload %2=tilezero
We can start from the instruction of each tile def, and backward to
ldtilecfg. If there is any call instruction, and tile data register is
not preserved, we should insert ldtilecfg after the call instruction.
Differential Revision: https://reviews.llvm.org/D94155
The x86_amx is used for AMX intrisics. <256 x i32> is bitcast to x86_amx when
it is used by AMX intrinsics, and x86_amx is bitcast to <256 x i32> when it
is used by load/store instruction. So amx intrinsics only operate on type x86_amx.
It can help to separate amx intrinsics from llvm IR instructions (+-*/).
Thank Craig for the idea. This patch depend on https://reviews.llvm.org/D87981.
Differential Revision: https://reviews.llvm.org/D91927
The ABI explains that %fs:(%eax) zero-extends %eax to 64 bits, and adds
that the TLS base address, but that the TLS base address need not be
at the start of the TLS block, TLS references may use negative offsets.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D93158
This patch implements amx programming model that discussed in llvm-dev
(http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet.
This patch implemeted 7 components.
1. The c interface to end user.
2. The AMX intrinsics in LLVM IR.
3. Transform load/store <256 x i32> to AMX intrinsics or split the
type into two <128 x i32>.
4. The Lowering from AMX intrinsics to AMX pseudo instruction.
5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx
intruction.
6. The register allocation for tile register.
7. Morph AMX pseudo instruction to AMX real instruction.
Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0
Differential Revision: https://reviews.llvm.org/D87981
Pretty sure we meant to be checking signed 32 immediates here
rather than unsigned 32 bit. I suspect I messed this up because
in MathExtras.h we have isIntN and isUIntN so isIntN differs in
signedness depending on whether you're using APInt or plain integers.
This fixes a case where we didn't fold a constant created
by shrinkAndImmediate. Since shrinkAndImmediate doesn't topologically
sort constants it creates, we can fail to convert the Constant
to a TargetConstant. This leads to very strange behavior later.
Fixes PR48458.
Since x32 supports PC-relative address, it shouldn't use EBX for TLS
address. Instead of checking N.getValueType(), we should check
Subtarget->is32Bit(). This fixes PR 22676.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D16474
This uses PreprocessISelDAG to replace the constant before
instruction selection instead of matching opcodes after.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D89178
In small code model, program and its symbols are linked in the lower 2 GB of
the address space. Try encoding global address even when the range is unknown
in such case.
Differential Revision: https://reviews.llvm.org/D89341
We should avoid emitting MachineSDNodes from lowering.
We can use the the implicit def handling in InstrEmitter to avoid
manually copying from each xmm result register. We only need to
manually emit the copies for the implicit uses.
Instead of emitting MachineSDNodes during lowering, emit X86ISD
opcodes. These opcodes will either be selected by tablegen
patterns or custom selection code.
Emitting MachineSDNodes during lowering is uncommon so this makes
things more consistent. It also allows selectAddr to be called to
perform address matching during instruction selection.
I had trouble getting tablegen to accept XMM0-XMM7 as results in
an isel pattern for the WIDE instructions so I had to use custom
instruction selection.
This is the type declared in X86InstrFragmentsSIMD.td. ISel pattern
matching doesn't check so it doesn't matter in practice. Maybe for
SelectionDAG CSE it would matter.
By factoring out the end of tryVPTERNLOG, we can use the same code
to directly match X86ISD::VPTERNLOG. This allows us to remove
around 3-4K worth of X86GenDAGISel.inc.
When we use mask compare intrinsics under strict FP option, the masked
elements shouldn't raise any exception. So, we cann't replace the
intrinsic with a full compare + "and" operation.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D85385
Now we try to load and broadcast together for operand 1. Followed
by load and broadcast for operand 1. Previously we tried load
operand 1, load operand 1, broadcast operand 0, broadcast operand 1.
Now we have a single helper that tries load and broadcast for
one operand that we can just call twice.
Rather than hardcoding immediate values for 12 different combinations
in a nested pair of switches, we can perform the matched logic
operation on 3 magic constants to calculate the immediate.
Special thanks to this tweet https://twitter.com/rygorous/status/1187034321992871936
for making me realize I could do this.
Previously we just matched the logic ops and replaced with an
X86ISD::VPTERNLOG node that we would send through the normal
pattern match. But that approach couldn't handle a bitcast
between the logic ops. Extending that approach would require us
to peek through the bitcasts and emit new bitcasts to match
the types. Those new bitcasts would then have to be properly
topologically sorted.
This patch instead switches to directly emitting the
MachineSDNode and skips the normal tablegen pattern matching.
We do have to handle load folding and broadcast load folding
ourselves now. Which also means commuting the immediate control.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D83630
I think this mostly looks ok. The only weird thing I noticed was
a couple rotate vXi8 tests picked up an extra logic op where we have
(and (or (and), (andn)), X). Previously we matched the (or (and), (andn))
to vpternlog, but now we match the (and (or), X) and leave the and/andn
unmatched.
This function picks X86 opcode name based on type, masking,
and whether not a load or broadcast has been folded using multiple
switch statements. The contents of the switches mostly just vary in
a few characters in the instruction name. So use some macros to
build the instruction names to reduce the repetiveness.
> relocImm was a complexPattern that handled both ConstantSDNode
> and X86Wrapper. But it was only applied selectively because using
> it would cause patterns to be not importable into FastISel or
> GlobalISel. So it only got applied to flag setting instructions,
> stores, RMW arithmetic instructions, and rotates.
>
> Most of the test changes are a result of making patterns available
> to GlobalISel or FastISel. The absolute-cmp.ll change is due to
> this fixing a pattern ordering issue to make an absolute symbol
> match to an 8-bit immediate before trying a 32-bit immediate.
>
> I tried to use PatFrags to reduce the repetition, but I was getting
> errors from TableGen.
This caused "Invalid EmitNode" assertions, see the llvm-commits thread for
discussion.
relocImm was a complexPattern that handled both ConstantSDNode
and X86Wrapper. But it was only applied selectively because using
it would cause patterns to be not importable into FastISel or
GlobalISel. So it only got applied to flag setting instructions,
stores, RMW arithmetic instructions, and rotates.
Most of the test changes are a result of making patterns available
to GlobalISel or FastISel. The absolute-cmp.ll change is due to
this fixing a pattern ordering issue to make an absolute symbol
match to an 8-bit immediate before trying a 32-bit immediate.
I tried to use PatFrags to reduce the repetition, but I was getting
errors from TableGen.
As shown in PR46237:
https://bugs.llvm.org/show_bug.cgi?id=46237
The size-savings win for hoisting an 8-bit ALU immediate (intentionally
excluding store constants) requires extreme conditions; it may not even
be possible when including REX prefix bytes on x86-64.
I did draft a version of this patch that included use counts after the
loop, but I suspect that accounting is not working as expected. I think
that is because the number of constant uses are changing as we select
instructions (for example as we transform shl/add into LEA).
Differential Revision: https://reviews.llvm.org/D81468
The instruction is defined to only produce high result if both
destinations are the same. We can exploit this to avoid
unnecessarily clobbering a register.
In order to hide this from register allocation we use a pseudo
instruction and expand the result during MCInst creation.
Differential Revision: https://reviews.llvm.org/D80500
Looking back over gcc and icc behavior it looks like icc does
use mulx32 on 32-bit targets and mulx64 on 64-bit targets. It's
also used when dividing i32 by constant on 32-bit targets and
i64 by constant on 64-bit targets.
gcc uses it multiplies producing a 64 bit result on 32-bit targets
and 128-bit results on a 64-bit target. gcc does not appear to use
it for division by constant.
After this patch clang is closer to the icc behavior. This
basically reverts d1c61861dd, but
there were no strong feelings at the time.
Fixes PR45518.
Differential Revision: https://reviews.llvm.org/D80498
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
This ensures we create mem operands for these instructions fixing PR45949.
Unfortunately, it increases the size of X86GenDAGISel.inc, but some dag
combine canonicalization could reduce the types of load we need to match.
This patch stores the alignment for ConstantPoolSDNode as an
Align and updates the getConstantPool interface to take a MaybeAlign.
Removing getAlignment() will be done as a follow up.
Differential Revision: https://reviews.llvm.org/D79436
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.
Differential Revision: https://reviews.llvm.org/D76212
-Drop llvm:: on MVT::i32
-Use getValueType instead of getSimpleValueType for an equality
check just cause its shorter and doesn't matter.
-Don't create a const SDValue & since its cheap to copy.
-Remove explicit case from MVT enum to EVT.
-Add message to assert.
This will cause the operation to be repeated in both a mask and another masked
or unmasked form. This can a wasted of execution resources.
Differential Revision: https://reviews.llvm.org/D60940
Instead add it when we make the machine nodes during instruction
selections.
This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
The type profile we use for the isel patterns lied about how
many operands the gather/scatter node has to skip the index
and scale operands. This allowed us to expand the baseptr
operand into base, displacement, and segment and then merge
the index and scale with them in the final instruction during
isel. This is kind of a hack that relies on isel not checking the
number of operands at all.
This commit switches to custom isel where we can manage this
directly without relying on holes in the isel checking.
This allows it to work properly with masked inc/dec for avx512. Those
would have a vselect as the root node so didn't get a chance to call
combineIncDecVector.
This also simplifies the logic because we don't have to manage
the topological ordering.
The flag isn't used, but I believe this matches the MOV32r0 that
would be created by the table emitter. This should allow this node
to be CSEed with any others created by the table.
A vselect+strictfp node is not equivalent to a masked operation.
The exceptions of the strictfp node are not masked by a vselect
after it so we can't match it to a masked operation.
We already had a hack in IsLegalToFold to prevent these patterns from
matching. This patch removes that hack and removes the patterns.
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.
Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.
This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.
Differential Revision: https://reviews.llvm.org/D73863
Only 32 and 64 bit SBB are dependency breaking instructons on some
CPUs. The 8 and 16 bit forms have to preserve upper bits of the GPR.
This patch removes the smaller forms and selects the wider form
instead. I had to do this with custom code as the tblgen generated
code glued the eflags copytoreg to the extract_subreg instead of
to the SETB pseudo.
Longer term I think we can remove X86ISD::SETCC_CARRY and use
(X86ISD::SBB zero, zero). We'll want to keep the pseudo and select
(X86ISD::SBB zero, zero) to either a MOV32r0+SBB for targets where
there is no dependency break and SETB_C32/SETB_C64 for targets
that have a dependency break. May want some way to avoid the MOV32r0
if the instruction that produced the carry flag happened to def a
register that we can use for the dependency.
I think the flag copy lowering should be using NEG instead of SUB to
handle SETB. That would avoid the MOV32r0 there. Or maybe it should
use a ADC with -1 to recreate the carry flag and keep the SETB?
That would avoid a MOVZX on the input of the SUB.
Differential Revision: https://reviews.llvm.org/D74024
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This is my second attempt at committing this after failing
build bots previously. One thing I realized about the previous
attempt is that its possible that AM.Disp is already non-zero
and the new Offset changes it back to zero. In that case my
previous attempt failed to update AM.Disp to zero. So this patch
removes the early out for 0 and appropriately handle the 0 case
in each check so we still update AM.Disp at the end.
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This passes fold-add-pcrel.ll.
Differential Revision: https://reviews.llvm.org/D73608
For `ret i64 add (i64 ptrtoint (i32* @foo to i64), i64 1701208431)`,
```
X86DAGToDAGISel::matchAdd
...
// AM.setBaseReg(CurDAG->getRegister(X86::RIP, MVT::i64));
if (!matchAddressRecursively(N.getOperand(0), AM, Depth+1) &&
// Try folding offset but fail; there is a symbolic displacement, so offset cannot be too large
!matchAddressRecursively(Handle.getValue().getOperand(1), AM, Depth+1))
return false;
...
// Try again after commuting the operands.
// AM.Disp = Val; foldOffsetIntoAddress() does not know there will be a symbolic displacement
if (!matchAddressRecursively(Handle.getValue().getOperand(1), AM, Depth+1) &&
// AM.setBaseReg(CurDAG->getRegister(X86::RIP, MVT::i64));
!matchAddressRecursively(Handle.getValue().getOperand(0), AM, Depth+1))
// Succeeded! Produced leaq sym+disp(%rip),...
return false;
```
`foldOffsetIntoAddress()` currently does not know there is a symbolic
displacement and can fold a large offset.
The produced `leaq sym+disp(%rip), %rax` instruction is relocated by
an R_X86_64_PC32. If disp is large and sym+disp-rip>=2**31, there
will be a relocation overflow.
This approach is still not elegant. Unfortunately the isRIPRelative
interface is a bit clumsy. I tried several solutions and eventually
picked this one.
Differential Revision: https://reviews.llvm.org/D73606
We use the stack for X87 fp_round and for moving from SSE f32/f64 to
X87 f64/f80. Or from X87 f64/f80 to SSE f32/f64.
Note for the SSE<->X87 conversions the conversion always happens in the
X87 domain. The load/store ops in the X87 instructions are able
to signal exceptions.
This allows us to delete InlineAsm::Constraint_i workarounds in
SelectionDAGISel::SelectInlineAsmMemoryOperand overrides and
TargetLowering::getInlineAsmMemConstraint overrides.
They were introduced to X86 in r237517 to prevent crashes for
constraints like "=*imr". They were later copied to other targets.
Fix several several additional problems with the int <-> FP conversion
logic both in common code and in the X86 target. In particular:
- The STRICT_FP_TO_UINT expansion emits a floating-point compare. This
compare can raise exceptions and therefore needs to be a strict compare.
I've made it signaling (even though quiet would also be correct) as
signaling is the more usual default for an LT. This code exists both
in common code and in the X86 target.
- The STRICT_UINT_TO_FP expansion algorithm was incorrect for strict mode:
it emitted two STRICT_SINT_TO_FP nodes and then used a select to choose one
of the results. This can cause spurious exceptions by the STRICT_SINT_TO_FP
that ends up not chosen. I've fixed the algorithm to use only a single
STRICT_SINT_TO_FP instead.
- The !isStrictFPEnabled logic in DoInstructionSelection would sometimes do
the wrong thing because it calls getOperationAction using the result VT.
But for some opcodes, incuding [SU]INT_TO_FP, getOperationAction needs to
be called using the operand VT.
- Remove some (obsolete) code in X86DAGToDAGISel::Select that would mutate
STRICT_FP_TO_[SU]INT to non-strict versions unnecessarily.
Reviewed by: craig.topper
Differential Revision: https://reviews.llvm.org/D71840
We really need to update the isel patterns to prevent this, but
that requires some tablegen de-tangling. So this hack will work
for correctness in the short term.
This has two main effects:
- Optimizes debug info size by saving 221.86 MB of obj file size in a
Windows optimized+debug build of 'all'. This is 3.03% of 7,332.7MB of
object file size.
- Incremental step towards decoupling target intrinsics.
The enums are still compact, so adding and removing a single
target-specific intrinsic will trigger a rebuild of all of LLVM.
Assigning distinct target id spaces is potential future work.
Part of PR34259
Reviewers: efriedma, echristo, MaskRay
Reviewed By: echristo, MaskRay
Differential Revision: https://reviews.llvm.org/D71320
Summary: This is a follow up of D69281, it enables the X86 backend support for the FP comparision.
Reviewers: uweigand, kpn, craig.topper, RKSimon, cameron.mcinally, andrew.w.kaylor
Subscribers: hiraditya, llvm-commits, annita.zhang, LuoYuanke, LiuChen3
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70582
Summary:
This follows a previous patch that changes the X86 datalayout to represent
mixed size pointers (32-bit sext, 32-bit zext, and 64-bit) with address spaces
(https://reviews.llvm.org/D64931)
This patch implements the address space cast lowering to the corresponding
sign extension, zero extension, or truncate instructions.
Related to https://bugs.llvm.org/show_bug.cgi?id=42359
Reviewers: rnk, craig.topper, RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69639
This is the following patch of D68854.
This patch adds basic operations of X87 instructions, including +, -, *, / , fp extensions and fp truncations.
Patch by Chen Liu(LiuChen3)
Differential Revision: https://reviews.llvm.org/D68857