Summary:
ARMAsmParser was incorrectly dropping a leading dollar sign character
from symbol names in targets of branch instructions. This was caused by
an incorrect assumption that the contents following the dollar sign
token should be handled as a constant immediate, similarly to the #
token.
This patch avoids the operand parsing from consuming the dollar sign
token when it is followed by an identifier, making sure it is properly
parsed as part of the expression.
Reviewers: efriedma
Reviewed By: efriedma
Subscribers: danielkiss, chill, carwil, vhscampos, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73176
Handle LinkOnceODRLinkage;
Handle AppendingLinkage type for llvm.global_ctors/dtors static init global arrays;
Differential Revision: https://reviews.llvm.org/D75305
Previously for any copy from a register bigger than the destination:
Copied to a same-sized register in the destination register bank.
Subregister copy of that to the destination.
This fails for copies from 128-bit FPRs to GPRs because the GPR register bank
can't accomodate 128-bit values.
Instead of special-casing such copies to perform the truncation beforehand in
the source register bank, generalize this:
a) Perform a subregister copy straight from source register whenever possible.
This results in shorter MIR and fixes the above problem.
b) Perform a full copy to target bank and then do a subregister copy only if
source bank can't support target's size. E.g. GPR to 8-bit FPR copy.
Patch by Raul Tambre (tambre)!
Differential Revision: https://reviews.llvm.org/D75421
Summary:
This seems like an obvious error - cut and paste issue?
The change does make a change to one of the lit tests - it stops s_buffer_load
re-ordering past an MUBUF instruction (which is not surprising).
Change-Id: I80be99de5b62af4f42e91af2591b76a52ac9efa6
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75686
This is a follow up to the previous patch: [AIX] Implement caller
arguments passed in stack memory.
This corrects a defect in AIX 64-bit where an i32 is written to the
stack with stw (4 bytes) rather than the expected std (8 bytes.) Integer
arguments pass on the stack as images of their register representation.
I also took the opportunity to tidy up some of the calling convention
AIX tests I added in my last commit. This patch adds the missed assembly
expected output for the stack arg int case, which would have caught this
problem.
Differential Revision: https://reviews.llvm.org/D75126
Summary:
Hint instructions printed as "hint\t#hintnum" except
in case of ARM v8.3a instruction only "hint #hintnum" is printed.
This patch changes all format to the fist one.
Reviewers: pbarrio, LukeCheeseman, vsk
Reviewed By: vsk
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75625
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.
Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.
For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.
I think this all matches what we generated before, just in a more
readable way.
Summary: X86 can reduce the bytes of NOP by padding instructions with prefixes to get a better peformance in some cases. So a private member function `determinePaddingPrefix` is added to determine which prefix is the most suitable.
Reviewers: annita.zhang, reames, MaskRay, craig.topper, LuoYuanke, jyknight
Reviewed By: reames
Subscribers: llvm-commits, dexonsmith, hiraditya
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75357
If we have an explicit align directive, we currently default to emitting nops to fill the space. As discussed in the context of the prefix padding work for branch alignment (D72225), we're allowed to play other tricks such as extending the size of previous instructions instead.
This patch will convert near jumps to far jumps if doing so decreases the number of bytes of nops needed for a following align. It does so as a post-pass after relaxation is complete. It intentionally works without moving any labels or doing anything which might require another round of relaxation.
The point of this patch is mainly to mock out the approach. The optimization implemented is real, and possibly useful, but the main point is to demonstrate an approach for implementing such "pad previous instruction" approaches. The key notion in this patch is to treat padding previous instructions as an optional optimization, not as a core part of relaxation. The benefit to this is that we avoid the potential concern about increasing the distance between two labels and thus causing further potentially non-local code grown due to relaxation. The downside is that we may miss some opportunities to avoid nops.
For the moment, this patch only implements a small set of existing relaxations.. Assuming the approach is satisfactory, I plan to extend this to a broader set of instructions where there are obvious "relaxations" which are roughly performance equivalent.
Note that this patch *doesn't* change which instructions are relaxable. We may wish to explore that separately to increase optimization opportunity, but I figured that deserved it's own separate discussion.
There are possible downsides to this optimization (and all "pad previous instruction" variants). The major two are potentially increasing instruction fetch and perturbing uop caching. (i.e. the usual alignment risks) Specifically:
* If we pad an instruction such that it crosses a fetch window (16 bytes on modern X86-64), we may cause the decoder to have to trigger a fetch it wouldn't have otherwise. This can effect both decode speed, and icache pressure.
* Intel's uop caching have particular restrictions on instruction combinations which can fit in a particular way. By moving around instructions, we can both cause misses an change misses into hits. Many of the most painful cases are around branch density, so I don't expect this to be too bad on the whole.
On the whole, I expect to see small swings (i.e. the typical alignment change problem), but nothing major or systematic in either direction.
Differential Revision: https://reviews.llvm.org/D75203
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.
Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.
The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.
Differential Revision: https://reviews.llvm.org/D75154
I believe this is the correct fix for D75506 rather than disabling all commuting. We can still commute the remaining two sources.
Differential Revision:m https://reviews.llvm.org/D75526
Create a wider source vector, and unmerge with dead defs like the
legalizer. The legalization handling for G_EXTRACT is incomplete, and
it's preferrable to keep everything in 32-bit pieces.
We should probably start moving these functions into utils, since we
have a growing number of places that do almost the same thing.
SLH had two functions named isDataInvariant and isDataInvariantLoad that
checked whether the passed instruction was data invariant. For some instructions,
if the EFLAGS were dead then they were considered data invariant, otherwise
they were not considered data invariant.
In this patch, I extracted that EFLAGS liveness check and made it
explicit at every call to isDataInvariant and isDataInvariantLoad.
This makes the isDataInvariant function behave more generally
and preserves the liveness check behavior that SLH would like to have.
Tested via llvm-lit llvm/test/CodeGen/X86/speculative-load-hardening*
This is the first step in making these two data invariance checks
available for non-SLH passes. The second step is to move the passes from
SLH to X86InstrInfo.cpp. I'll follow up with a patch that does that.
Differential Revision: https://reviews.llvm.org/D70283
https://reviews.llvm.org/D42848 only handled CFA related cfi directives but
didn't handle CSR related cfi. The patch adds the CSR part. Basically it reuses
the framework created in D42848. For each basicblock, the patch tracks which
CSR set have been saved at its CFG predecessors's exits, and compare the CSR
set with the set at its previous basicblock's exit (The previous block is the
block laid before the current block). If the saved CSR set at its previous
basicblock's exit is larger, .cfi_restore will be inserted.
The patch also generates proper .cfi_restore in epilogue to make sure the
saved CSR set is consistent for the incoming edges of each block.
Differential Revision: https://reviews.llvm.org/D74303
Spin-off from D75407. As described there, ConstantFoldConstant()
currently returns null for non-ConstantExpr/ConstantVector inputs,
but otherwise always returns non-null, independently of whether
any folding has happened or not.
This is confusing and makes consumer code more complicated.
I would expect either that ConstantFoldConstant() returns only if
it actually folded something, or that it always returns non-null.
I'm going to the latter possibility here, which appears to be more
useful considering existing usage.
Differential Revision: https://reviews.llvm.org/D75543
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.
I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
If SimplifyDemandedBits succeeds in simplifying the byte src, add the CVT_F32_UBYTE node back to the worklist as we might be able to simplify further.
Yet another step towards removing SelectionDAG::GetDemandedBits.
Summary:
The VSHLC instruction performs a left shift of a whole vector register
by an immediate shift count up to 32, shifting in new bits at the low
end from a GPR and delivering the shifted-out bits from the high end
back into the same GPR.
Since the instruction produces two outputs (the shifted vector
register and the output GPR of shifted-out bits), it has to be
instruction-selected in C++ rather than Tablegen.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75445
Summary:
These are exactly parallel to the existing `vadciq` intrinsics, which
we implemented last year as part of the original MVE intrinsics
framework setup.
Just like VADC/VADCI, the MVE VSBC/VSBCI instructions deliver two
outputs, both of which the intrinsic exposes: a modified vector
register and a carry flag. So they have to be instruction-selected in
C++ rather than Tablegen. However, in this case, that's trivial: the
same C++ isel routine we already have for VADC works unchanged, and
all we have to do is to pass it a different instruction id.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75444
The computation here didn't really make sense to me, and reported
wildy different results depending on the flat work group size
attribute.
I think this should really report a range derived from the possible
work group size bounds, and only allow an occupancy that is a multiple
of the group size.
uops.info says these should be 15 cycle instructions. Uops.info also shows the 512-bit form uses port 0 and 5 for both register and memory. We had memory using 0 and 1.
Differential Revision: https://reviews.llvm.org/D75549
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.
Differential Revision: https://reviews.llvm.org/D75413
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.
Differential Revision: https://reviews.llvm.org/D75412
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
On SystemZ there are a set of "access registers" that can be copied in and
out of 32-bit GPRs with special instructions. These instructions can only
perform the copy using low 32-bit parts of the 64-bit GPRs. However, the
default register class for 32-bit integers is GRX32, which also contains the
high 32-bit part registers.
In order to never end up with a case of such a COPY into a high reg, this
patch adds a new simple pre-RA pass that selects such COPYs into target
instructions.
This pass also handles COPYs from CC (Condition Code register), and COPYs to
CC can now also be emitted from a high reg in copyPhysReg().
Fixes: https://bugs.llvm.org/show_bug.cgi?id=44254
Review: Ulrich Weigand.
Differential Revision: https://reviews.llvm.org/D75014
Summary:
The argument that sets the prefetch type of a prefetch intrinsic must
be an immediate value.
Reviewers: andwar, sdesmalen, efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75482
Use MIOperand in collectLocalKilledOperands to make the search
global, as we already have to search for global uses too. This
allows us to delete more dead code when tail predicating.
Differential Revision: https://reviews.llvm.org/D75167
In RDA, check against the already decided dead instructions when
looking at users. This allows an instruction to be removed if it
has multiple users, but they're all dead.
This means that IT instructions can be considered killed once all
the itstate using instructions are dead.
Differential Revision: https://reviews.llvm.org/D75245
The incoming back chain slot was implicitly allocated whenever a GPR was
saved in SystemZFrameLowering::getRegSpillOffset(), but in cases where no
GPRs were saved/restored this did not take effect.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D75367
Forming subtract with overflow is beneficial on SystemZ, just like additions.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D75290
Summary:
LDRdPtr expanded from LDWRdPtr shouldn't define its second operand(SrcReg).
The second operand is its source register.
Add -verify-machineinstrs into command line of testcases can trigger this error.
Reviewers: dylanmckay
Reviewed By: dylanmckay
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75437
Summary:
It is not safe for ARMConstantIslands to undoLRSpillRestore. PrologEpilogInserter is
the one to ensure stack alignment, taking into consideration LR is spilled or not.
For noreturn function with StackAlignment 8 (function contains call/alloc),
undoLRSpillRestore cause stack be mis-aligned. Fixing stack alignment in
ARMConstantIslands doesn't give us much benefit, as undo LR spill/restore only
occur in large function with near branches only, also doesn't have callee-saved LR spill.
Reviewers: t.p.northover, rengolin, efriedma, apazos, samparker, ostannard
Reviewed By: ostannard
Subscribers: dmgreen, ostannard, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75288
X86 has several instructions which are documented as enabling interrupts exactly one instruction *after* the one which changes the SS segment register. Inserting a nop between these two instructions allows an interrupt to arrive before the execution of the following instruction which changes semantic behaviour.
The list of instructions is documented in "Table 24-3. Format of Interruptibility State" in Volume 3c of the Intel manual. They basically all come down to different ways to write to the SS register.
Differential Revision: https://reviews.llvm.org/D75359
CFI instructions can only safely be outlined when the outlined call is a tail
call, or when the outlined frame is fixed up.
For the sake of correctness, disable outlining from CFI instructions.
Add machine-outliner-cfi.mir to test this.
- Remove unnecessary includes from the headers
- Fix cppcheck definition/declaration arg mismatch warnings
- Tidyup old comments (MVT usage was removed a long time ago)
- Use SmallVector::append for repeated mask entries
This patch upstreams support for the ARM Armv8.1m cpu Cortex-M55.
In detail adding support for:
- mcpu option in clang
- Arm Target Features in clang
- llvm Arm TargetParser definitions
details of the CPU can be found here:
https://developer.arm.com/ip-products/processors/cortex-m/cortex-m55
Reviewers: chill
Reviewed By: chill
Subscribers: dmgreen, kristof.beyls, hiraditya, cfe-commits,
llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74966
Summary:
This patch adds the following LLVM IR intrinsics for SVE:
1. non-temporal gather loads
* @llvm.aarch64.sve.ldnt1.gather
* @llvm.aarch64.sve.ldnt1.gather.uxtw
* @llvm.aarch64.sve.ldnt1.gather.scalar.offset
2. non-temporal scatter stores
* @llvm.aarch64.sve.stnt1.scatter
* @llvm.aarch64.sve.ldnt1.gather.uxtw
* @llvm.aarch64.sve.ldnt1.gather.scalar.offset
These intrinsic are mapped to the corresponding SVE instructions
(example for half-words, zero-extending):
* ldnt1h { z0.s }, p0/z, [z0.s, x0]
* stnt1h { z0.s }, p0/z, [z0.s, x0]
Note that for non-temporal gathers/scatters, the SVE spec defines only
one instruction type: "vector + scalar". For this reason, we swap the
arguments when processing intrinsics that implement the "scalar +
vector" addressing mode:
* @llvm.aarch64.sve.ldnt1.gather
* @llvm.aarch64.sve.ldnt1.gather.uxtw
* @llvm.aarch64.sve.stnt1.scatter
* @llvm.aarch64.sve.ldnt1.gather.uxtw
In other words, all intrinsics for gather-loads and scatter-stores
implemented in this patch are mapped to the same load and store
instruction, respectively.
The sve2_mem_gldnt_vs multiclass (and it's counterpart for scatter
stores) from SVEInstrFormats.td was split into:
* sve2_mem_gldnt_vec_vs_32_ptrs (32bit wide base addresses)
* sve2_mem_gldnt_vec_vs_62_ptrs (64bit wide base addresses)
This is consistent with what we did for
@llvm.aarch64.sve.ld1.scalar.offset and highlights the actual split in
the spec and the implementation.
Reviewed by: sdesmalen
Differential Revision: https://reviews.llvm.org/D74858
Summary:
These instructions convert a vector of floats to a vector of integers
of the same size, with assorted non-default rounding modes.
Implemented in IR as target-specific intrinsics, because as far as I
can see there are no matches for that functionality in the standard IR
intrinsics list.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75255
Summary:
These instructions make a vector of `<4 x float>` by widening every
other lane of a vector of `<8 x half>`.
I wondered about representing these using standard IR, along the lines
of a shufflevector to extract elements of the input into a `<4 x half>`
followed by an `fpext` to turn that into `<4 x float>`. But it looks as
if that would take a lot of work in isel lowering to make it match any
pattern I could sensibly write in Tablegen, and also I haven't been
able to think of any other case where that pattern might be generated
in IR, so there wouldn't be any extra code generation win from doing
it that way.
Therefore, I've just used another target-specific intrinsic. We can
always change it to the other way later if anyone thinks of a good
reason.
(In order to put the intrinsic definition near similar things in
`IntrinsicsARM.td`, I've also lifted the definition of the
`MVEMXPredicated` multiclass higher up the file, without changing it.)
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75254
Summary:
The two MVE instructions that convert between v4f32 and v8f16 were
implemented as instances of the same class, with the same MC operand
list.
But that's not really appropriate, because the narrowing conversion
only partially overwrites its output register (it only has 4 f16
values to write into a vector of 8), so even when unpredicated, it
needs a $Qd_src input, a constraint tying that to the $Qd output, and
a vpred_n.
The widening conversion is better represented like any other
instruction that completely replaces its output when unpredicated: it
should have no $Qd_src operand, and instead, a vpred_r containing a
$inactive parameter. That's a better match to other similar
instructions, such as its integer analogue, the VMOVL instruction that
makes a v4i32 by sign- or zero-extending every other lane of a v8i16.
This commit brings the widening VCVT.F32.F16 into line with the other
instructions that behave like it. That means you can write isel
patterns that use it unpredicated, without having to add a pointless
undefined $QdSrc operand.
No existing code generation uses that instruction yet, so there should
be no functional change from this fix.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75253
Summary:
These instructions work like VMOVN (narrowing a vector of wide values
to half size, and overwriting every other lane of an output register
with the result), except that the narrowing conversion is saturating.
They come in three signedness flavours: signed to signed, unsigned to
unsigned, and signed to unsigned. All are represented in IR by a
target-specific intrinsic that takes two separate 'unsigned' flags.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75252
The MVE gather instructions smaller than 32bits zext extend the values
in the offset register, as opposed to sign extending them. We need to
make sure that the code that we select from is suitably extended, which
this patch attempts to fix by tightening up the offset checks.
Differential Revision: https://reviews.llvm.org/D75361
Summary:
It should be normal constant instead of target constant.
Pattern CMPri can be matched if the constant can be fitted into immediate field.
Otherwise, pattern CMPrr will be matched.
This fixed bug https://bugs.llvm.org/show_bug.cgi?id=44091.
Reviewers: dcederman, jyknight
Reviewed By: jyknight
Subscribers: jonpa, hiraditya, fedor.sergeev, jrtc27, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75227
Summary:
Currently the boundaryalign fragment caches its size during the process
of layout and then it is relaxed and update the size in each iteration. This
behaviour is unnecessary and ugly.
Reviewers: annita.zhang, reames, MaskRay, craig.topper, LuoYuanke, jyknight
Reviewed By: MaskRay
Subscribers: hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75404
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.
Fixes PR45067.
This removes everything but int_x86_avx512_mask_vcvtph2ps_512 which provides the SAE variant, but even this can use the fpext generic if the rounding control is the default.
Differential Revision: https://reviews.llvm.org/D75162
MCObjectStreamer is more suitable to create fragments than
X86AsmBackend, for example, the function getOrCreateDataFragment is
defined in MCObjectStreamer.
Differential Revision: https://reviews.llvm.org/D75351
When bundle is enabled, data fragment itself has a space to emit NOP
to bundle-align instructions. The behaviour makes it impossible for
us to determine whether the macro fusion really happen when emitting
instructions. In addition, boundary-align fragment is also used to
emit NOPs to align instructions, currently using them together sometimes
makes code crazy.
Differential Revision: https://reviews.llvm.org/D75346
We already combine non extending loads with broadcasts in DAG
combine. All these patterns are picking up is the aligned extload
special case. But the only lit test we have that exercsises it is
using v8i1 load that datalayout is reporting align 8 for. That
seems generous. So without a realistic test case I don't think
there is much value in these patterns.
As a narrow stopgap for the assertion failure described in PR45025, add
a describeLoadedValue override to ARMBaseInstrInfo and use it to detect
copies in which the forwarding reg is a super/sub reg of the copy
destination. For the moment this is unsupported.
Several follow ups are possible:
1) Handle VORRq. At the moment, we do not, because isCopyInstrImpl
returns early when !MI.isMoveReg().
2) In the case where forwarding reg is a super-reg of the copy
destination, we should be able to describe the forwarding reg as a
subreg within the copy destination. I'm not 100% sure about this, but
it looks like that's what's done in AArch64InstrInfo.
3) In the case where the forwarding reg is a sub-reg of the copy
destination, maybe we could describe the forwarding reg using the
copy destinaion and a DW_OP_LLVM_fragment (I guess this should be
possible after D75036).
https://bugs.llvm.org/show_bug.cgi?id=45025
rdar://59772698
Differential Revision: https://reviews.llvm.org/D75273
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
Summary:
Final patch in series to fix inlining between functions with different
nobuiltin attributes/options, which was specifically an issue in LTO.
See discussion on D61634 for background.
The prior patch in this series (D67923) enabled per-Function TLI
construction that identified the nobuiltin attributes.
Here I have allowed inlining to proceed if the callee's nobuiltins are a
subset of the caller's nobuiltins, but not in the reverse case, which
should be conservatively correct. This is controlled by a new option,
-inline-caller-superset-nobuiltin, which is enabled by default.
Reviewers: hfinkel, gchatelet, chandlerc, davidxl
Subscribers: arsenm, jvesely, nhaehnle, mehdi_amini, eraman, hiraditya, haicheng, dexonsmith, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74162
In some cases when HexagonTargetLowering::allowsMemoryAccess returned
true, it did not set the "Fast" argument, leaving it uninitialized.
[Hexagon] Improve casting of boolean HVX vectors to scalars
- Mark memory access for bool vectors as disallowed in target lowering.
This will prevent combining bitcasts of bool vectors with stores.
- Replace the actual bitcasting code with a faster version.
- Handle casting of v16i1 to i16.
This addes extra patterns for the VMLAS MVE instruction, which performs
Qda = Qda * Qn + Rm, a similar pattern to the existing VMLA. The sinking
of splat(Rm) into the loop is already performed, meaning we just need
extra Pat's in tablegen.
Differential Revision: https://reviews.llvm.org/D75115
When running under LTO, it is common to not specify the architecture
spec, which is used for setting up the target machine, and instead rely
on features specified in each function to generate the correct
instructions.
This works for the code generator, but the RISC-V backend uses the
AsmPrinter to do instruction compression, which does not see these
features but instead uses a MCSubtargetInfo object to see whether
compression is enabled. Since this is configured based on the
TargetMachine at startup, it will result in compressed instructions not
being emitted when it has not been given the 'c' TargetFeature, but the
function has it.
This changes the RISCVAsmPrinter to re-initialize the STI feature set
based on the current MachineFunction, such that compressed instructions
are now correctly emitted regardless of the method used to enable them.
Differential revision: https://reviews.llvm.org/D73339
Add ELF relocations for the following fixups:
fixup_thumb_adr_pcrel_10 -> R_ARM_THM_PC8
fixup_thumb_cp -> R_ARM_THM_PC8
fixup_t2_adr_pcrel_12 -> R_ARM_THM_PREL_11_0
fixup_t2_ldst_pcrel_12 -> R_ARM_THM_PC12
While these relocations are short-ranged there is support in the open
source ELF linker's in binutils and soon to be in LLD. MC will no longer
resolve pc-relative fixups to global symbols due to interpositioning
concerns. We can handle these at link time by implementing the relocations.
The R_ARM_THM_PC8 has some extra encoding rules for addends that llvm-mc
sidesteps by not supporting addends for these instructions, using the wide
Thumb 2 instruction if it is available. I think that this is a reasonable
compromise given that these are rare.
This partiall reverts D72892, the Thumb fixups no longer need to be
evaluated at assembly time.
Differential Revision: https://reviews.llvm.org/D75039
Ensure that we're recording implicit defs, as well as visiting implicit
uses and implicit defs when we're walking through operands.
Differential Revision: https://reviews.llvm.org/D75185
Support the explicit wide assembler qualifier for the dmb/dsb/isb synchronization barrier instructions.
Differential revision: https://reviews.llvm.org/D75143
The code changes here are hopefully straightforward:
1. Use MachineInstruction flags to decide if FP ops can be reassociated
(use both "reassoc" and "nsz" to be consistent with IR transforms;
we probably don't need "nsz", but that's a safer interpretation of
the FMF).
2. Check that both nodes allow reassociation to change instructions.
This is a stronger requirement than we've usually implemented in
IR/DAG, but this is needed to solve the motivating bug (see below),
and it seems unlikely to impede optimization at this late stage.
3. Intersect/propagate MachineIR flags to enable further reassociation
in MachineCombiner.
We managed to make MachineCombiner flexible enough that no changes are
needed to that pass itself. So this patch should only affect x86
(assuming no other targets have implemented the hooks using MachineIR
flags yet).
The motivating example in PR43609 is another case of fast-math transforms
interacting badly with special FP ops created during lowering:
https://bugs.llvm.org/show_bug.cgi?id=43609
The special fadd ops used for converting int to FP assume that they will
not be altered, so those are created without FMF.
However, the MachineCombiner pass was being enabled for FP ops using the
global/function-level TargetOption for "UnsafeFPMath". We managed to run
instruction/node-level FMF all the way down to MachineIR sometime in the
last 1-2 years though, so we can do better now.
The test diffs require some explanation:
1. llvm/test/CodeGen/X86/fmf-flags.ll - no target option for unsafe math was
specified here, so MachineCombiner kicks in where it did not previously;
to make it behave consistently, we need to specify a CPU schedule model,
so use the default model, and there are no code diffs.
2. llvm/test/CodeGen/X86/machine-combiner.ll - replace the target option for
unsafe math with the equivalent IR-level flags, and there are no code diffs;
we can't remove the NaN/nsz options because those are still used to drive
x86 fmin/fmax codegen (special SDAG opcodes).
3. llvm/test/CodeGen/X86/pow.ll - similar to #1
4. llvm/test/CodeGen/X86/sqrt-fastmath.ll - similar to #1, but MachineCombiner
does some reassociation of the estimate sequence ops; presumably these are
perf wins based on latency/throughput (and we get some reduction of move
instructions too); I'm not sure how it affects numerical accuracy, but the
test reflects reality better now because we would expect MachineCombiner to
be enabled if the IR was generated via something like "-ffast-math" with clang.
5. llvm/test/CodeGen/X86/vec_int_to_fp.ll - this is the test added to model PR43609;
the fadds are not reassociated now, so we should get the expected results.
6. llvm/test/CodeGen/X86/vector-reduce-fadd-fast.ll - similar to #1
7. llvm/test/CodeGen/X86/vector-reduce-fmul-fast.ll - similar to #1
Differential Revision: https://reviews.llvm.org/D74851
This tries to improve the accuracy of extract/insert element costs by accounting for subvector extraction/insertion for >128-bit vectors and the shuffling of elements to/from the 0'th index.
It also adds INSERTPS for f32 types and PINSR/PEXTR costs for integer types (at the moment we assume the same cost as MOVD/MOVQ - which isn't always true).
Differential Revision: https://reviews.llvm.org/D74976
This will address the issue: P8198 and P8199 (from D73534).
The methods was not handle bundles properly.
Differential Revision: https://reviews.llvm.org/D74904
Summary:
The following intrinsics are added:
* @llvm.aarch64.sve.ldff1.gather
* @llvm.aarch64.sve.ldff1.gather.index
* @llvm.aarch64.sve.ldff1.gather_sxtw
* @llvm.aarch64.sve.ldff1.gather.uxtw
* @llvm.aarch64.sve.ldff1.gather_sxtw.index
* @llvm.aarch64.sve.ldff1.gather.uxtw.index
* @llvm.aarch64.sve.ldff1.gather.scalar.offset
Although this patch is quite substantial, the vast majority of the
implementation is just a 'copy & paste' of the implementation of regular
gather loads, including tests. There's only a handful of new
definitions:
* AArch64ISD nodes defined in AArch64ISelLowering.h (e.g. GLDFF1)
* Seleciton DAG Types in AArch64SVEInstrInfo.td (e.g.
AArch64ldff1_gather)
* intrinsics in IntrinsicsAArch64.td (e.g. aarch64_sve_ldff1_gather)
* Pseudo instructions in SVEInstrFormats.td to workaround the issue of
use-before-def for the FFR register.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D75128
Under fp16 we optimise the bitcast between a VMOVhr and a CopyToReg via
custom lowering. This rewrites that to be a DAG combine instead, which
helps produce better code in the cases where the bitcast is actaully
legal.
Differential Revision: https://reviews.llvm.org/D72753
MC currently does not emit these relocation types, and lld does not
handle them. Add FKF_Constant as a work-around of some ARM code after
D72197. Eventually we probably should implement these relocation types.
By Fangrui Song!
Differential revision: https://reviews.llvm.org/D72892
In some cases Clang does not perform merging of instructions AND and TST (aka
ANDS xzr).
Example:
tst x2, x1
and x3, x2, x1
to:
ands x3, x2, x1
This patch add such merging during instruction selection: when AND is replaced
with ANDS instruction in LowerSELECT_CC, all users of AND also should be
changed for using this ANDS instruction
Short discussion on mailing list:
http://llvm.1065342.n5.nabble.com/llvm-dev-ARM-Peephole-optimization-instructions-tst-add-tp133109.html
Patch by Pavel Kosov.
Differential Revision: https://reviews.llvm.org/D71701
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.
In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.
For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.
This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.
I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
Since all types <32b on gpr end up being assigned gpr32 regclasses, we can end
up with PHIs here which try to select between a gpr32 and an fpr16. Ideally RBS
shouldn't be selecting heterogenous regbanks for operands if possible, but we
still need to be able to deal with it here.
To fix this, if we have a gpr-bank operand < 32b in size and at least one other
operand is on the fpr bank, then we add cross-bank copies to homogenize the
operand banks. For simplicity the bank that we choose to settle on is whatever
bank the def operand has. For example:
%endbb:
%dst:gpr(s16) = G_PHI %in1:gpr(s16), %bb1, %in2:fpr(s16), %bb2
=>
%bb2:
...
%in2_copy:gpr(s16) = COPY %in2:fpr(s16)
...
%endbb:
%dst:gpr(s16) = G_PHI %in1:gpr(s16), %bb1, %in2_copy:gpr(s16), %bb2
Differential Revision: https://reviews.llvm.org/D75086
Summary: ParsingInlineAsm was a misleading name. These values are only set for MS-style inline assembly.
Reviewed By: rnk
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75198
This is a small pet peeve from me. This change makes sure the AVR backend uses
the correct private label prefix (.L) so that private labels are hidden in
avr-objdump.
Example code:
define i8 @foo(i1 %cond) {
br i1 %cond, label %then, label %else
then:
ret i8 3
else:
ret i8 5
}
When compiling this:
llc -march=avr -filetype=obj -o test.o test.ll
and then dumping it:
avr-objdump -d test.o
You would previously get an ugly temporary label:
00000000 <foo>:
0: 81 70 andi r24, 0x01 ; 1
2: 80 30 cpi r24, 0x00 ; 0
4: f9 f3 breq .-2 ; 0x4 <foo+0x4>
6: 83 e0 ldi r24, 0x03 ; 3
8: 08 95 ret
0000000a <LBB0_2>:
a: 85 e0 ldi r24, 0x05 ; 5
c: 08 95 ret
This patch fixes that, the output is now:
00000000 <foo>:
0: 81 70 andi r24, 0x01 ; 1
2: 80 30 cpi r24, 0x00 ; 0
4: 01 f0 breq .+0 ; 0x6 <foo+0x6>
6: 83 e0 ldi r24, 0x03 ; 3
8: 08 95 ret
a: 85 e0 ldi r24, 0x05 ; 5
c: 08 95 ret
Note that as you can see the breq operand is different. However it is
still the same after linking:
4: 11 f0 breq .+4
Differential Revision: https://reviews.llvm.org/D75124
Adjusting by 2 breaks DWARF output. With this fix, programs start to
compile and produce valid DWARF output.
Differential Revision: https://reviews.llvm.org/D74213
- Mark memory access for bool vectors as disallowed in target lowering.
This will prevent combining bitcasts of bool vectors with stores.
- Replace the actual bitcasting code with a faster version.
- Handle casting of v16i1 to i16.
Summary:
This commit adds the predicated MVE intrinsics for the same set of
unary operations that I added in their unpredicated forms in
* D74333 (vrint)
* D74334 (vrev)
* D74335 (vclz, vcls)
* D74336 (vmovl)
* D74337 (vmovn)
but since the predicated versions are a lot more similar to each
other, I've kept them all together in a single big patch. Everything
here is done in the standard way we've been doing other predicated
operations: an IR intrinsic called `@llvm.arm.mve.foo.predicated` and
some isel rules that match that alongside whatever they accept for the
unpredicated version of the same instruction.
In order to write the isel rules conveniently, I've refactored the
existing isel rules for the affected instructions into multiclasses
parametrised by a vector-type class, in the usual way. All those
refactorings are intended to leave the existing isel rules unchanged:
the only difference should be that new ones for the predicated
intrinsics are introduced.
The only tiny infrastructure change I needed in this commit was to
change the implementation of `IntrinsicMX` in `arm_mve_defs.td` so
that the records it defines are anonymous rather than named (and use
`NameOverride` to set the output intrinsic name), which allows me to
call it twice in two multiclasses with the same `NAME` without a
tablegen-time error.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75165
Allow all ExternalSymbolSDNode on AIX, and rely on the linker error to find
symbols which we don't have definitions from any library/compiler-rt.
Differential Revision: https://reviews.llvm.org/D75075
Summary:
The old code made some incorrect assumptions about the order in which
basic blocks are laid out in a function. This could lead to incorrect
early-exits, especially when kills occurred inside of loops.
The new approach is to check whether the point where the conditional
kill occurs dominates all reachable code. If that is the case, there
cannot be any other threads in the wave that are waiting to rejoin
at a later point in the CFG, i.e. if exec=0 at that point, then all
threads really are dead and we can exit the wave.
Make some other minor cleanups to the pass while we're at it.
v2: preserve the dominator tree
Reviewers: arsenm, cdevadas, foad, critson
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74908
Change-Id: Ia0d2b113ac944ad642d1c622b6da1b20aa1aabcc
Summary:
Implements the following intrinsics:
- @llvm.aarch64.sve.bdep.x
- @llvm.aarch64.sve.bext.x
- @llvm.aarch64.sve.bgrp.x
- @llvm.aarch64.sve.tbl2
- @llvm.aarch64.sve.tbx
The SelectTableSVE2 function in this patch is used to select the TBL2
intrinsic & ensures that the vector registers allocated are consecutive.
Reviewers: sdesmalen, andwar, dancgr, cameron.mcinally, efriedma, rengolin
Reviewed By: efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74912
Add getUniqueReachingMIDef to RDA which performs a global search for
a machine instruction that produces a unique definition of a given
register at a given point. Also add two helper functions
(getMIOperand) that wrap around this functionality to get the
incoming definition uses of a given instruction. These now replace
the uses of getReachingMIDef in ARMLowOverheadLoops. getReachingMIDef
has been renamed to getReachingLocalMIDef and has been made private
along with getInstFromId.
Differential Revision: https://reviews.llvm.org/D74605
Summary:
The patch D62993 : `[PowerPC] Emit scalar min/max instructions with unsafe fp math`
has modified the functionality when `Subtarget.hasP9Vector() && (!HasNoInfs || !HasNoNaNs)`,
this modification is not expected.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D74701
Fixed an issue exposed by D74006.
In clang cc1as, MCContext::UseNamesOnTempLabels is true.
When parsing a .fnstart directive, FnStart gets redefined to a temporary symbol of a different name (.Ltmp0, .Ltmp1, ...).
MCContext::getELFSection() called by SwitchToEHSection() will create a different .ARM.exidx each time.
llvm-mc uses `Ctx.setUseNamesOnTempLabels(false);` and FnStart is unnamed.
MCContext::getELFSection() called by SwitchToEHSection() will reuse the same .ARM.exidx .
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75095
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.
This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.
Differential Revision: https://reviews.llvm.org/D75132
This reverts commit eee22ec3c3.
This is not the correct fix, the root cause seems to be a bug in the
stage1 host clang compiler. See https://reviews.llvm.org/D75091 for more
discussion.
Summary:
Removes patterns that were not doing useful work, changes the
default extract instructions to be the unsigned versions now that
they are enabled by default, fixes PR44988, and adds tests for
sext_inreg lowering.
Reviewers: aheejin
Reviewed By: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75005
Summary:
Future patches will make use of TTI to perform cost-model-driven `SCEVExpander::isHighCostExpansionHelper()`
This is a fully NFC patch to make things reviewable.
Reviewers: reames, mkazantsev, wmi, sanjoy
Reviewed By: mkazantsev
Subscribers: hiraditya, zzheng, javed.absar, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73704
Summary:
Implement the DWARF register mapping described in
llvm/docs/AMDGPUUsage.rst
This is currently limited to wave64 VGPRs/AGPRs.
This also includes some minor changes in AMDGPUInstPrinter,
AMDGPUMCTargetDesc, and AMDGPUAsmParser to make generating CFI assembly
text and ELF sections possible to ease testing, although complete CFI
support is not yet implemented.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74915
Summary:
This patch renames functions and TableGen classes for SVE gathers and
scatters. The original names implied that the corresponding
methods/classes are only suited for regular gathers/scatters (i.e. LD1
and ST1), which is not the case. Indeed, we will be re-using them for
non-temporal and first-faulting gathers/scatters in the forthcoming
patches. The new names also highlight the split into Vector-Scalar (VS)
and Scalar-Vector (SV) cases.
List of changes:
* `performLD1GatherCombine` and `performST1ScatterCombine` are renamed
as `performGatherLoadCombine` and `performScatterStoreCombine`,
respectively.
* Selection DAG types for scatters and gathers from
AArch64SVEInstrInfo.td are renamed. For example, `SDT_AArch64_GLD1` is
renamed as `SDT_AArch64_GATHER_SV`. SV stands for Scalar-Vector, as
opposed to Vector-Scalar (VS).
* The intrinsic classes from IntrinsicsAArch64.td are renamed. For
example, `AdvSIMD_GatherLoad_64bitOffset_Intrinsic` is renamed as
`AdvSIMD_GatherLoad_SV_64b_Offsets_Intrinsic`.
* Updated comments in `performGatherLoadCombine` and
`performScatterStoreCombine`.
Reviewers: sdesmalen, rengolin, efriedma
Reviewed By: sdesmalen
Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75035
Summary:
Implements the following intrinsics:
* llvm.aarch64.sve.convert.to.svbool
* llvm.aarch64.sve.convert.from.svbool
For converting the ACLE svbool_t type (<n x 16 x i1>) to and from the
other predicate types: <n x 8 x i1>, <n x 4 x i1> and <n x 2 x i1>.
Reviewers: sdesmalen, kmclaughlin, efriedma, dancgr, rengolin
Reviewed By: sdesmalen, efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74471
Instead add it when we make the machine nodes during instruction
selections.
This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
The current set of custom combines are only really useful after
legalization, so move them there. There is a lot of overlap in the
boilerplate here, but I think we do want a pretty different set of
combines before and after legalize. I think we will want a lot of
overlap between the post-legalize and a post-regbankselect combiner.
This is explicitly guaranteed in ARMARM. And it makes reasoning about
vectors easier: we can assume that if a vector operation is legal, the
corresponding scalar operation is also legal.
Differential Revision: https://reviews.llvm.org/D74993
Previously we emitted an fmadd and a fmadd+fneg and combined them with a shufflevector. But this doesn't follow the correct exception behavior for unselected elements so the backend can't merge them into the fmaddsub/fmsubadd instructions.
This patch restores the the fmaddsub intrinsics so we don't have two arithmetic operations. We lose out on optimization opportunity in the non-strict FP case, but I don't think this is a big loss. If someone gives us a test case we can look into adding instcombine/dagcombine improvements. I'd rather not have the frontend do completely different things for strict and non-strict.
This still has problems because target specific intrinsics don't support strict semantics yet. We also still have all of the problems with masking. But we at least generate the right instruction in constrained mode now.
Differential Revision: https://reviews.llvm.org/D74268
GCC 9.2 seems to incorrectly issue warning about out of bounds
access. This situation should not happen in any way.
Differential Revision: https://reviews.llvm.org/D75071
Simply by implementing a few functions I was able to correctly
disassemble a much larger amount of instructions.
Differential Revision: https://reviews.llvm.org/D74045
Not all operands are correctly disassembled at the moment. This means
that some machine instructions won't have all the necessary operands
set.
To avoid asserting, print an error instead until the necessary support
has been implemented.
Differential Revision: https://reviews.llvm.org/D73958
A number of multiplication instructions (muls, mulsu, fmul, fmuls,
fmulsu) had the wrong register class for an operand. This resulted in
the wrong register being used for the instruction.
Example:
target datalayout = "e-P1-p:16:8-i8:8-i16:8-i32:8-i64:8-f32:8-f64:8-n8-a:8"
target triple = "avr-atmel-none"
define i16 @sliceAppend(i16, i16, i16, i16, i16, i16) addrspace(1) {
%d = mul i16 %0, %5
ret i16 %d
}
The first instruction would be muls r24, r31 before this patch. The r31
should have been r15 if you look at the intermediate forms during
instruction selection / register allocation, but the generated
instruction uses r31. After this patch, an extra movw is inserted to get
%5 in range for muls.
To make sure this bug is fixed everywhere, I checked all instructions
and found that most multiplication instructions suffered from this bug,
which I have fixed with this patch. No other instructions appear to be
affected.
Differential Revision: https://reviews.llvm.org/D74281
Summary:
Function descriptor csect on AIX should be 4 byte align instead of 1 byte align.
Reviewer: daltenty
Differential Revision: https://reviews.llvm.org/D74974
I'm hoping to begin improving shuffle combining across different vector sizes, but before that we must ensure that all existing getTargetShuffleInputs calls must bail if the inputs aren't the same size.
Extends the existing support for spilling and restoring the condition
register to the linkage area for 32-bit targets, and enables for AIX.
Differential Revision: https://reviews.llvm.org/D74349
D74976 will handle larger vector types, but since SLM doesn't support AVX+ then we will always be extracting from 128-bit vectors so don't need to scale the cost.
This adds infrastructure to print and parse MIR MachineOperand comments.
The motivation for the ARM backend is to print condition code names instead of
magic constants that are difficult to read (for human beings). For example,
instead of this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14, $noreg
t2Bcc %bb.4, 0, killed $cpsr
we now print this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14 /* CC::always */, $noreg
t2Bcc %bb.4, 0 /* CC:eq */, killed $cpsr
This shows that MachineOperand comments are enclosed between /* and */. In this
example, the EOR instruction is not conditionally executed (i.e. it is "always
executed"), which is encoded by the 14 immediate machine operand. Thus, now
this machine operand has /* CC::always */ as a comment. The 0 on the next
conditional branch instruction represents the equal condition code, thus now
this operand has /* CC:eq */ as a comment.
As it is a comment, the MI lexer/parser completely ignores it. The benefit is
that this keeps the change in the lexer extremely minimal and no target
specific parsing needs to be done. The changes on the MIPrinter side are also
minimal, as there is only one target hooks that is used to create the machine
operand comments.
Differential Revision: https://reviews.llvm.org/D74306
Summary:
Implements the @llvm.aarch64.sve.dupq.lane intrinsic.
As specified in the ACLE, the behaviour of:
svdupq_lane_u64(data, index)
...is identical to:
svtbl(data, svadd_x(svptrue_b64(),
svand_x(svptrue_b64(), svindex_u64(0, 1), 1),
index * 2))
If the index is in the range [0,3], the operation is equivalent
to a single DUP (.q) instruction.
Reviewers: sdesmalen, c-rhodes, cameron.mcinally, efriedma, dancgr, rengolin
Reviewed By: sdesmalen
Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74734
Change the way that we remove the redundant iteration count code in
the presence of IT blocks. collectLocalKilledOperands has been
introduced to scan an instructions operands, collecting the killed
instructions and then visiting them too. This is used to delete the
code in the preheader which calculates the iteration count. We also
track any IT blocks within the preheader and, if we remove all the
instructions from the IT block, we also remove the IT instruction.
isSafeToRemove is used to remove any redundant uses of the iteration
count within the loop body.
Differential Revision: https://reviews.llvm.org/D74975
Summary:
The type used to represent functional units in MC is
'unsigned', which is 32 bits wide. This is currently
not a problem in any upstream target as no one seems
to have hit the limit on this yet, but in our
downstream one, we need to define more than 32
functional units.
Increasing the size does not seem to cause a huge
size increase in the binary (an llc debug build went
from 1366497672 to 1366523984, a difference of 26k),
so perhaps it would be acceptable to have this patch
applied upstream as well.
Subscribers: hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71210
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
The type profile we use for the isel patterns lied about how
many operands the gather/scatter node has to skip the index
and scale operands. This allowed us to expand the baseptr
operand into base, displacement, and segment and then merge
the index and scale with them in the final instruction during
isel. This is kind of a hack that relies on isel not checking the
number of operands at all.
This commit switches to custom isel where we can manage this
directly without relying on holes in the isel checking.
Leave the gather/scatter subclasses, but make them inherit from
MemIntrinsicSDNode and delete their constructor and destructor.
This way we can still have the getIndex, getMask, etc. convenience
functions.
In order to build the Linux kernel, the back chain must be supported with
packed-stack. The back chain is then stored topmost in the register save
area.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D74506
If a simplication occurs the operand will be added to the worklist.
But since the demanded mask was based on N, we need to make sure
we revisit N in case there are more simplifications to be done.
Returning SDValue(N, 0) as we do, only tells DAG combine that
something changed, but that won't make it add anything to the
worklist.
Found while playing around with using VEXTRACT_STORE in more cases.
But I guess this doesn't affect any of our existing tests.
We can use MOVLPS which will load 64 bits, but we need a v4f32
result type. We already have isel patterns for this.
The code here is a little hacky. We can probably improve it with
more isel patterns.
This is similar to using movd which we do for sse2 targets.
I've added a DAG combine for VEXTRACT_STORE to use SimplifyDemandedVectorElts
to clean up some artifacts from type legalization.
Similar to what do for other operations that use a subset of bits.
Allows us to remove a pattern that shrinks a load. Which was
incorrect if the load was volatile.
Summary:
We already sorted the blocks when fixing up a set of mutual
loop entries, however, there can be multiple sets of such
mutual loop entries, and the order we encounter them
should not be random, so sort them too.
Fixes https://bugs.llvm.org/show_bug.cgi?id=44982
Patch by Alon Zakai (kripken)
Reviewers: aheejin, sbc100, dschuff
Subscribers: mgrang, sunfish, hiraditya, jgravelle-google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74999
This reverts commit 977cd661cf.
It breaks OpenCL testing. OpenCL Runtime is using PT_LOAD information
to calculate memory for global variables. This commit should be relanded once
the OpenCL runtime stops relying on PT_LOAD information for calculating global
variable memory size.
Differential Revision: https://reviews.llvm.org/D74995
Add support for DestructiveBinaryComm DestructiveInstType, as well as the lowering code to expand the new Pseudos into the final movprfx+instruction pairs.
Differential Revision: https://reviews.llvm.org/D73711
This moves all the logic of converting LLVM Triples to
MachO::CPU_(SUB_)TYPE from the specific target (Target)AsmBackend to
more convenient functions in lib/BinaryFormat.
This also gets rid of the separate two X86AsmBackend classes.
The previous attempt was to add it to libObject, but that adds an
unnecessary dependency to libObject from all the targets.
Differential Revision: https://reviews.llvm.org/D74808
isPrefix was added to support the patches to align branches.
it relies on a switch over instruction names.
This moves those opcodes to a new format so the information is
tablegen and we can just check for a specific value in some bits
in TSFlags instead.
I've left the other function in place for now so that the
existing patches in phabricator will still work. I'll work with
the owner to get them migrated.
Summary:
Added register + immediate and register + register addressing modes for the following intrinsics:
1. Masked load and stores:
* Sign and zero extended load and truncated stores.
* No extension or truncation.
2. Masked non-temporal load and store.
Reviewers: andwar, efriedma
Subscribers: cameron.mcinally, sdesmalen, tschuett, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74254
The legalizer helper functions are unusably awkward to perform the 3-5
part legalization. This needs to be widened, scalarized, lowered, and
we should avoid creating vector extends and truncates. Manually do all
of this and expand.
I tried to use some of the new tablegen features to avoid creating
different operand list permutations, but I still don't see a way to
programmatically build a source pattern dag.
Also add GlobalISel tests, which now all import successfully.
Some of the fneg fold tests are incorrect, which need to be fixed in a
future commit
G_SHUFFLE_VECTOR is legal since it theoretically may help match op_sel
for VOP3P instructions. Expand it in some other way in case it doesn't
fold into the use instructions.
A cost query for a vector instruction should return a cost even without
target vector support, and not trigger an assert.
VectorCombine does this with an input containing source code vectors.
Review: Ulrich Weigand
We don't use this, and matching from the def doesn't make much sense.
There are multiple tablegen bugs with default operand
handling. undef_tied_input should work to handle the vdst_in
correctly, but this breaks the operand register class constraint which
it should be able to infer.
We should try the generated matchers before the manual selection. This
means the patterns are now handling the common cases, but the manual
selection code is not yet dead. It's still handling the non-s32/s64
cases (like v2s16 and v2s32). Currently tablegen doesn't have a nice
way to have a single pattern that covers multiple types.
We have patterns for s_pack* selection, but they assume the inputs are
a build_vector with 16-bit inputs, not a truncating build
vector. Since there's still outstanding work for how to handle
mismatched result and source element vector operations, and since I'm
trying a different packed vector strategy than SelectionDAG, just
manually select this for now.
There are few differences from the DAG handling. First, the DAG
handling uses a primitive selection pattern instead of custom
legalizing it. Because of this, this makes use of source modifiers
while the DAG does not.
Also instead of promoting f16, try to use the f16 log/exp. There's no
f16 fmul_legacy, so widen just for the multiply, although I'm not sure
that's the best solution.
This looked through copies to find the source modifiers, which may
have been SGPR->VGPR copies added to avoid potential constant bus
violations. Re-insert a copy to a VGPR if this happens.
Remove some cumbersome Darwin specific logic for updating the frame
offsets of the condition-register spill slots. The containing function has an
early return if the subtarget is not ELF based which makes the Darwin logic
dead.
The (overloaded) intrinsic is llvm.hexagon.V6.pred.typecast[.128B]. The
types of the operand and the return value are HVX boolean vector types.
For each cast, there needs to be a corresponding intrinsic declared,
with different suffixes appended to the name, e.g.
; cast <128 x i1> to <32 x i1>
declare <32 x i1> @llvm.hexagon.V6.pred.typecast.128B.s1(<128 x i1>)
; cast <32 x i1> to <64 x i1>
declare <64 x i1> @llvm.hexagon.V6.pred.typecast.128B.s2(<32 x i1>)
etc.
At this point in the code we know that Op1 or Op2 is
all ones. Y points to the other operand. In the case that
Op2 is zero, Op1 must be all ones and Y is Op2. The OR
ORs Y into Res. But if Y is 0 the OR will be folded away by
getNode so we don't need to check for it.
The combineSelect code was casting to i64 without any check that
i64 was legal. This can break after type legalization.
It also required splitting the mmx register on 32-bit targets.
It's not clear that this makes sense. Instead switch to using
a cmov pseudo like we do for XMM/YMM/ZMM.
VK1 was being used as the output of the copy to regclass, but it
should be VK2/VK4. Shouldn't matter in practice though since
VK1/VK2/VK4/VK8/VK16 are all identicaly and just have different VTs.
The code at https://reviews.llvm.org/D74808 has broken builds that are
configured with -DBUILD_SHARED_LIBS=On.
This patch adds the correct library dependencies.
The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)
There are other potential improvements visible like allowing scalarization or vector
narrowing.
Differential Revision: https://reviews.llvm.org/D74909
This moves all the logic of converting LLVM Triples to
MachO::CPU_(SUB_)TYPE from the specific target (Target)AsmBackend to
more convenient functions in libObject.
This also gets rid of the separate two X86AsmBackend classes.
Differential Revision: https://reviews.llvm.org/D74808
There's a lot of old leftover code in LowerBRCOND. Especially
the detecting or AND or OR of X86ISD::SETCC nodes. Those were
needed before LegalizeDAG was changed to visit nodes before
their operands.
It also relied on reversing the output of LowerSETCC to find the
flags producing node to use for the X86ISD::BRCOND node.
Rather than using LowerSETCC this patch uses emitFlagsForSetcc to
handle the integer ISD::SETCC case. This gives the flag producer
and the comparison code to use directly. I've removed the addTest
flag and just produce a X86ISD::BRCOND and return immediately.
Floating point ISD::SETCC case is just an X86ISD::FCMP with special
care for OEQ and UNE derived from the previous code. I've left
f128 out so it will emit a test. And LowerSETCC will be called
later to produce a libcall and X86ISD::SETCC. We have combines
that can merge the test and X86ISD::SETCC.
We need to handle two cases for overflow ops. Either they are used
directly or they have a seteq 0 or setne 1 to invert the overflow.
The old code did not handle the setne 1 case, but I think some
other combines were making up for it.
If we fail to find a condition, we'll wrap an AND with 1 on the
original condition and tell emitFlagsForSetcc to emit a compare
with 0. This will pickup the LowerAndToBT and or the EmitTest case.
I kept the isTruncWithZeroHighBitsInput call, but we might be able
to fold that in to emitFlagsForSetcc.
Differential Revision: https://reviews.llvm.org/D74750
Only handle power of 2 element count for simplicity. Not sure what to do with vXf64->vXf16 fp_round to avoid double rounding
Differential Revision: https://reviews.llvm.org/D74886
Marking a section as ALLOC tells the ELF loader to load the section into memory.
As we do not want to load the notes into VRAM, the flag should not be there.
Differential Revision: https://reviews.llvm.org/D74600
GetDemandedBits mostly just calls SimplifyMultipleUseDemandedBits now, but it does a very blunt constant simplification that SimplifyMultipleUseDemandedBits avoids.
If we need to demand bits from constants we should handle this through ShrinkDemandedConstant/targetShrinkDemandedConstant.
@arsenm confirmed that the sign extended immediates are better for code size.
Differential Revision: https://reviews.llvm.org/D74857
Summary:
This patch adds two families of ACLE intrinsics: vqdmullbq and
vqdmulltq (including vector-vector and vector-scalar variants) and the
corresponding LLVM IR intrinsics llvm.arm.mve.vqdmull and
llvm.arm.mve.vqdmull.predicated.
Reviewers: simon_tatham, MarkMurrayARM, dmgreen, ostannard
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74845
Summary:
The instruction at `DefI` can sometimes be destroyed by
`rematerializeCheapDef`, so it should not be used after calling that
function. The fix is to use `Insert` instead when examining additional
multivalue stackifications. `Insert` is the address of the new
defining instruction after all moves and rematerializations have taken
place.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74875
We probably want this, and I've meant to turn this on for a long
time. SC actually emits a special case to early-out for a 1
denominator, which perhaps should also be considered.
It uses VGPR_32.RegTypes which includes 16 bit types. As a
result DS_WRITE_B32 may be generated for "store i16" which
is a bug. The only reason we do not hit it now is relative
patterns complexity and sorting. Should DS_WRITE_B16 pattern
complexity become higher and the bug appears.
Differential Revision: https://reviews.llvm.org/D74868
This commit removes the artificial types <512 x i1> and <1024 x i1>
from HVX intrinsics, and makes v512i1 and v1024i1 no longer legal on
Hexagon.
It may cause existing bitcode files to become invalid.
* Converting between vector predicates and vector registers must be
done explicitly via vandvrt/vandqrt instructions (their intrinsics),
i.e. (for 64-byte mode):
%Q = call <64 x i1> @llvm.hexagon.V6.vandvrt(<16 x i32> %V, i32 -1)
%V = call <16 x i32> @llvm.hexagon.V6.vandqrt(<64 x i1> %Q, i32 -1)
The conversion intrinsics are:
declare <64 x i1> @llvm.hexagon.V6.vandvrt(<16 x i32>, i32)
declare <128 x i1> @llvm.hexagon.V6.vandvrt.128B(<32 x i32>, i32)
declare <16 x i32> @llvm.hexagon.V6.vandqrt(<64 x i1>, i32)
declare <32 x i32> @llvm.hexagon.V6.vandqrt.128B(<128 x i1>, i32)
They are all pure.
* Vector predicate values cannot be loaded/stored directly. This directly
reflects the architecture restriction. Loading and storing or vector
predicates must be done indirectly via vector registers and explicit
conversions via vandvrt/vandqrt instructions.
We only need to split after type legalization. If we're before
we can just use a wide store and type legalization will split it.
Add a v128i1 test to exercise it post type legalization.
Summary:
Some predicated MVE intrinsics return a vector with element size
different from the input vector element size. In this case the
predicate must type correspond to the output vector type.
The following intrinsics use the incorrect predicate type:
* llvm.arm.mve.mull.int.predicated
* llvm.arm.mve.mull.poly.predicated
* llvm.arm.mve.vshll.imm.predicated
This patch fixes the issue.
Reviewers: simon_tatham, dmgreen, ostannard, MarkMurrayARM
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74838
On PowerPC we will soon need to use pcrel to indicate PC Relative addressing.
Renamed the Hexagon specific variant kind to a non target specific VK so that
it can be used on both Hexagon and PowerPC.
Differential Revision: https://reviews.llvm.org/D74788
Check that no Q-regs are live out of the loop, unless the instruction
within the loop is predicated on the vctp.
Differential Revision: https://reviews.llvm.org/D72713
Similar to VADDV and VADDLV that have been added recently, this adds
lowering and patterns for VMLAV, VMLAVA, VMLALV and VMLALVA. They
perform the same roles as the add's, just folding a mul into the same
instruction (and so taking two inputs). As such, they need to be lowered
in the same way as the types are often not legal.
Differential Revision: https://reviews.llvm.org/D74390
This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits.
Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively.
We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer.
Differential Revision: https://reviews.llvm.org/D74786
Following on from the extra VADDV lowering, this extends things to
handle VADDLV which allows summing values into a pair of i32 registers,
together treated as a i64. This needs to be done in DAGCombine too as
the types are otherwise illegal, which is a fairly simple addition on
top of the existing code.
There is also a VADDLVA instruction handled here, that adds the incoming
values from the two general purpose registers. As opposed to the
non-long version where we could just add patterns for add(x, VADDV), the
long version needs to handle this early before the i64 has being split
into too many pieces.
Differential Revision: https://reviews.llvm.org/D74224
Custom legalize non-power-of-2 and unaligned load and store for MIPS32r5
and older, custom legalize non-power-of-2 load and store for MIPS32r6.
Don't attempt to combine non power of 2 loads or unaligned loads when
subtarget doesn't support them (MIPS32r5 and older).
Differential Revision: https://reviews.llvm.org/D74625
Improve legality checks for load and store, 4 byte scalar
load and store are now legal for all subtargets.
During regbank selection 4 byte unaligned loads and stores
for MIPS32r5 and older get mapped to gprb.
Select 4 byte unaligned loads and stores for MIPS32r5.
Fix tests that unintentionally had unaligned load or store.
Differential Revision: https://reviews.llvm.org/D74624
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.
This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.
Reviewers: nikic, RKSimon, lebedev.ri, spatel
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D74722
We already make use of the VADDV vector reduction instruction for cases
where the input and the output start out at the same type. The MVE
instruction however will sum into an i32, so if we are summing a v16i8
into an i32, we can still use the same instructions. In terms of IR,
this looks like a sext of a legal type (v16i8) into a very illegal type
(v16i32) and a vecreduce.add of that into the result. This means we have
to catch the pattern early in a DAG combine, producing a target VADDVs/u
node, where the signedness is now important.
This is the first part, handling VADDV and VADDVA. There are also
VADDVL/VADDVLA instructions, which are interesting because they sum into
a 64bit value. And VMLAV and VMLALV, which are interesting because they
also do a multiply of two values. It may look a little odd in places as
a result.
On it's own this will probably not do very much, as the vectorizer will
not produce this IR yet.
Differential Revision: https://reviews.llvm.org/D74218
Consider large operands in G_MERGE_VALUES and G_UNMERGE_VALUES as
Ambiguous during regbank selection.
Introducing new InstType AmbiguousWithMergeOrUnmerge which will
allow us to recognize whether to narrow scalar or use s64:fprb.
This change exposed a bug when reusing data from TypeInfoForMF.
Thus when Instr is about to get destroyed (using narrow scalar)
clear its data in TypeInfoForMF. Internal data is saved based on
Instr's address, and it will no longer be valid.
Add detailed asserts for InstType and operand size.
Generate generic instructions instead of MIPS target instructions
during argument lowering and custom legalizer.
Select G_UNMERGE_VALUES and G_MERGE_VALUES when proper banks are
selected: {s32:gprb, s32:gprb, s64:fprb} for G_UNMERGE_VALUES and
{s64:fprb, s32:gprb, s32:gprb} for G_MERGE_VALUES.
Update tests. One improvement is when floating point argument in
gpr(or two gprs) gets passed to another function through gpr
unnecessary fpr-to-gpr moves are no longer generated.
Differential Revision: https://reviews.llvm.org/D74623
LoweSELECT will detect the constant inputs and convert to scalar
selects, but we can do it directly here.
I might remove some of the code from LowerSELECT and move it to
DAG combine so doing this explicitly will make us less dependent
on it happening in lowering.
Summary:
Extends the multivalue call infrastructure to tail calls, removes all
legacy calls specialized for particular result types, and removes the
CallIndirectFixup pass, since all indirect call arguments are now
fixed up directly in the post-insertion hook.
In order to keep supporting pretty-printed defs and uses in test
expectations, MCInstLower now inserts an immediate containing the
number of defs for each call and call_indirect. The InstPrinter is
updated to query this immediate if it is present and determine which
MCOperands are defs and uses accordingly.
Depends on D72902.
Reviewers: aheejin
Subscribers: dschuff, mgorny, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74192
Summary:
There is still room for improvement in the handling of multivalue
nodes in both passes, but the current algorithm is at least correct
and optimizes some simpler cases. In order to make future
optimizations of these passes easier and build confidence that the
current algorithms are correct, this CL also adds a script that
automatically and exhaustively generates interesting multivalue test
cases.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72902
Essentially, fold OrderedBasicBlock into BasicBlock, and make it
auto-invalidate the instruction ordering when new instructions are
added. Notably, we don't need to invalidate it when removing
instructions, which is helpful when a pass mostly delete dead
instructions rather than transforming them.
The downside is that Instruction grows from 56 bytes to 64 bytes. The
resulting LLVM code is substantially simpler and automatically handles
invalidation, which makes me think that this is the right speed and size
tradeoff.
The important change is in SymbolTableTraitsImpl.h, where the numbering
is invalidated. Everything else should be straightforward.
We probably want to implement a fancier re-numbering scheme so that
local updates don't invalidate the ordering, but I plan for that to be
future work, maybe for someone else.
Reviewed By: lattner, vsk, fhahn, dexonsmith
Differential Revision: https://reviews.llvm.org/D51664
Summary:
Unlike normal calls, call_indirects have immediate arguments that
caused a MachineVerifier failure without a small tweak to loosen the
verifier's requirements for variadicOpsAreDefs instructions.
One nice thing about the new call_indirects is that they do not need
to participate in the PCALL_INDIRECT mechanism because their post-isel
hook handles moving the function pointer argument and adding the flags
and typeindex arguments itself.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74191
This reverts commit 649aba93a2, now that
the approach started there has been shown to be workable in the patch
series culminating in https://reviews.llvm.org/D74192.
Summary:
This patch adds a new MVE intrinsics family, `vbrsrq`: vector bit
reverse and shift right. The intrinsics are compiled into the VBRSR
instruction. Two new LLVM IR intrinsics were also added: arm.mve.vbrsr
and arm.mve.vbrsr.predicated.
Reviewers: simon_tatham, dmgreen, ostannard, MarkMurrayARM
Reviewed By: simon_tatham
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74721
Create preprocessor defines for callee saved floating-point register spill
slots, vector register spill slots, and both 32-bit and 64-bit general
purpose register spill slots. This is an NFC refactor to prepare for
adding ABI compliant callee saves and restores for AIX.
Implement TargetLowering callback mayBeEmittedAsTailCall for riscv in CodeGenPrepare,
which will duplicate return instructions to enable tailcall optimization.
Differential Revision: https://reviews.llvm.org/D73699
Summary:
Making `Scale` a `TypeSize` in AArch64InstrInfo::getMemOpInfo,
has the effect that all places where this information is used
(notably, TargetInstrInfo::getMemOperandWithOffset) will need
to consider Scale - and derived, Offset - possibly being scalable.
This patch adds a new operand `bool &OffsetIsScalable` to
TargetInstrInfo::getMemOperandWithOffset and fixes up all
the places where this function is used, to consider the
offset possibly being scalable.
In most cases, this means bailing out because the algorithm does not
(or cannot) support scalable offsets in places where it does some
form of alias checking for example.
Reviewers: rovka, efriedma, kristof.beyls
Reviewed By: efriedma
Subscribers: wuzish, kerbowa, MatzeB, arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, javed.absar, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72758
Summary:
Codegen and tests for thread-local storage.
This implements only the general dynamic model due to limitations in nld 2.26.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D74718
This patch upstreams support for the AArch64 Armv8-A cpu Cortex-A34.
In detail adding support for:
- mcpu option in clang
- AArch64 Target Features in clang
- llvm AArch64 TargetParser definitions
details of the cpu can be found here:
https://developer.arm.com/ip-products/processors/cortex-a/cortex-a34
Reviewers: SjoerdMeijer
Reviewed By: SjoerdMeijer
Subscribers: SjoerdMeijer, kristof.beyls, hiraditya, cfe-commits,
llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74483
Change-Id: Ida101fc544ca183a0a0e61a1277c8957855fde0b
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
Summary:
These are in some sense the inverse of vmovl[bt]q: they take a vector
of n wide elements and truncate each to half its width. So they only
write half a vector's worth of output data, and therefore they also
take an 'inactive' parameter to provide the other half of the data in
the output vector. So vmovnb overwrites the even lanes of 'inactive'
with the narrowed values from the main input, and vmovnt overwrites
the odd lanes.
LLVM had existing codegen which generates these MVE instructions in
response to IR that takes two vectors of wide elements, or two vectors
of narrow ones. But in this case, we have one vector of each. So my
clang codegen strategy is to narrow the input vector of wide elements
by simply reinterpreting it as the output type, and then we have two
narrow vectors and can represent the operation as a vector shuffle
that interleaves lanes from both of them.
Even so, not all the cases I needed ended up being selected as a
single MVE instruction, so I've added a couple more patterns that spot
combinations of the 'MVEvmovn' and 'ARMvrev32' SDNodes which can be
generated as a VMOVN instruction with operands swapped.
This commit adds the unpredicated forms only.
Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74337
Summary:
These intrinsics take a vector of 2n elements, and return a vector of
n wider elements obtained by sign- or zero-extending every other
element of the input vector. They're represented in IR as a
shufflevector that extracts the odd or even elements of the input,
followed by a sext or zext.
Existing LLVM codegen already matches this pattern and generates the
VMOVLB instruction (which widens the even-index input lanes). But no
existing isel rule was generating VMOVLT, so I've added some. However,
the new rules currently only work in little-endian MVE, because the
pattern they expect from isel lowering includes a bitconvert which
doesn't have the right semantics in big-endian.
The output of one existing codegen test is improved by those new
rules.
This commit adds the unpredicated forms only.
Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74336
Summary:
When we start putting instances of `ARMVectorRegCast` in complex isel
patterns, it will be awkward that they're often turned into the more
standard `bitconvert` in little-endian mode. We'd rather not have to
write separate isel patterns for the two endiannesses, matching
different but equivalent cast operations.
This change aims to fix that awkwardness in advance, by turning the
Tablegen record `ARMVectorRegCast` from a simple `SDNode` instance
into a `PatFrags` that can match either kind of cast – with a
predicate that prevents it matching a bitconvert in the big-endian
case, where bitconvert isn't semantically identical.
No existing code generation should be affected by this change, but it
will enable the patterns introduced by D74336 to work in both
endiannesses.
Reviewers: dmgreen
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74716
Summary:
vclzq maps nicely to the existing target-independent @llvm.ctlz IR
intrinsic. But vclsq ('count leading sign bits') has no corresponding
target-independent intrinsic, so I've made up @llvm.arm.mve.vcls.
This commit adds the unpredicated forms only.
Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74335
Summary:
This adds the unpredicated forms of six different MVE intrinsics which
all round a vector of floating-point numbers to integer values,
leaving them still in FP format, differing only in rounding mode and
exception settings.
Five of them map to existing target-independent intrinsics in LLVM IR,
such as @llvm.trunc and @llvm.rint. The sixth, mapping to the `vrintn`
instruction, is done by inventing a target-specific intrinsic.
(`vrintn` behaves the same as `vrintx` in terms of the output value:
the side effects on the FPSCR flags are the only difference between
the two. But ACLE specifies separate user-callable intrinsics for the
two, so the side effects matter enough to make sure we generate the
right one of the two instructions in each case.)
Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74333
This helps this transform occur earlier so we can fold the not
with setcc. If we delay it until after type legalization we might
have introduced instructions to widen the mask if the vselect was
widened. This can prevent the not from making it to the setcc.
We could of course add more DAG combines to handle that, but
moving this earlier is easier.
AMDGPUCodeGenPrepare expands this most of the time, but not always. We
will always at least need a fallback option here. This is the 3rd
implementation of the same expansion in the backend. Eventually I
would like to eliminate the IR expansion (and the DAG version
obviously).
Currently the new legalizer path produces a better result, since the
IR expansion results in extra operations which need to be combined
out. Notably, the IR expansion results in multiplies by 0.
mutateStrictFPToFP can delete the node and replace it with another with the same
value which can later cause problems, and returning the result of
mutateStrictFPToFP doesn't work because SelectionDAGLegalize expects that the
returned value has the same number of results as the original. Instead handle
things by doing the mutation manually.
Differential Revision: https://reviews.llvm.org/D74726
Summary:
This patch adds vector-scalar variants to the following families of
MVE intrinsics:
* vaddq
* vsubq
* vmulq
* vqaddq
* vqsubq
* vhaddq
* vhsubq
* vqdmulhq
* vqrdmulhq
The vector-scalar variants perform a splat operation on the scalar
operand and then perform the same operations as their vector-vector
counterparts. Code generation is done accordingly (using LLVM IR 'insert'
and 'shuffle' operations which are later converted into an ARMvdup
SDNode).
Reviewers: simon_tatham, dmgreen, MarkMurrayARM, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74620
D73835 will make IRBuilder no longer trivially copyable. This patch
deletes the copy constructor in advance, to separate out the breakage.
Currently, the IRBuilder copy constructor is usually used by accident,
not by intention. In rG7c362b25d7a9 I've fixed a number of cases where
functions accepted IRBuilder rather than IRBuilder &, thus performing
an unnecessary copy. In rG5f7b92b1b4d6 I've fixed cases where an
IRBuilder was copied, while an InsertPointGuard should have been used
instead.
The only non-trivial use of the copy constructor is the
getIRBForDbgInsertion() helper, for which I separated construction and
setting of the insertion point in this patch.
Differential Revision: https://reviews.llvm.org/D74693
The way fallback to SelectionDAG works is somewhat surprising to
me. When the fallback path is enabled, the entire set of SelectionDAG
selector passes is added to the pass pipeline, and each one needs to
check if the function was selected. This results in the surprising
behavior of running SIFixSGPRCopies for example, but only if
-global-isel-abort=2 is used.
SIAddIMGInitPass is also added in addInstSelector, but I'm not sure
why we have this pass or if it should be added somewhere else for
GlobalISel.
Produce an unmerge to a narrower type and introduce a narrower shift
if needed. I wasn't sure if there was a better way to parameterize the
target's preferred shift type for the GICombineRule, so manually call
the combine helper.
Summary:
This patch adds assembly-level support for a new Arm M-profile
architecture extension, Custom Datapath Extension (CDE).
A brief description of the extension is available at
https://developer.arm.com/architectures/instruction-sets/custom-instructions
The latest specification for CDE is currently a beta release and is
available at
https://static.docs.arm.com/ddi0607/aa/DDI0607A_a_armv8m_arm_supplement_cde.pdf
CDE allows chip vendors to add custom CPU instructions. The CDE
instructions re-use the same encoding space as existing coprocessor
instructions (such as MRC, MCR, CDP etc.). Each coprocessor in range
cp0-cp7 can be configured as either general purpose (GCP) or custom
datapath (CDEv1). This configuration is defined by the CPU vendor and
is provided to LLVM using 8 subtarget features: cdecp0 ... cdecp7.
The semantics of CDE instructions are implementation-defined, but the
instructions are guaranteed to be pure (that is, they are stateless,
they do not access memory or any registers except their explicit
inputs/outputs).
CDE requires the CPU to support at least Armv8.0-M mainline
architecture. CDE includes 3 sets of instructions:
* Instructions that operate on general purpose registers and NZCV
flags
* Instructions that operate on the S or D register file (require
either FP or MVE extension)
* Instructions that operate on the Q register file, require MVE
The user-facing names that can be specified on the command line are
the same as the 8 subtarget feature names. For example:
$ clang -target arm-none-none-eabi -march=armv8m.main+cdecp0+cdecp3
tells the compiler that the coprocessors 0 and 3 are configured as
CDEv1 and the remaining coprocessors are configured as GCP (which is
the default).
Reviewers: simon_tatham, ostannard, dmgreen, eli.friedman
Reviewed By: simon_tatham
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D74044
While looking at the output on real sized programs, there is a lot of
extra SGPR spilling compared to the DAG path. This seems to largely be
from all constants being SGPRs in the entry block.
Summary:
This patch implements the part of the calling convention
where SVE Vectors are passed by reference. This means the
caller must allocate stack space for these objects and
pass the address to the callee.
Reviewers: efriedma, rovka, cameron.mcinally, c-rhodes, rengolin
Reviewed By: efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71216
Try to handle arbitrary scalar BFEs by packing the operands. The DAG
gives up on non-constant arguments. We're still missing any constant
folding, so we end up with pretty ugly code most of the time. Also
handle the 64-bit scalar case, which the DAG doesn't try to do.
Summary:
Implements the @llvm.aarch64.sve.index intrinsic, which
takes a scalar base and step value.
This patch also adds the printSImm function to AArch64InstPrinter
to ensure that immediates of type i8 & i16 are printed correctly.
Reviewers: sdesmalen, andwar, efriedma, dancgr, cameron.mcinally, rengolin
Reviewed By: cameron.mcinally
Subscribers: tatyana-krasnukha, tschuett, kristof.beyls, hiraditya, rkruppe, arphaman, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74550
We have the InstAlias rules for 32-bit rotate but missing the 64-bit one.
Rotate left immediate rotlwi ra,rs,n rlwinm ra,rs,n,0,31
Rotate left rotlw ra,rs,rb rlwnm ra,rs,rb,0,31
Differential Revision: https://reviews.llvm.org/D72676
On Powerpc, set instruction count as lsr first priority of lsr by default.
Add an option ppc-lsr-no-insns-cost to return back to default lsr cost model.
Reviewed By: steven.zhang, jsji
Differential Revision: https://reviews.llvm.org/D72683
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.
This removes the need to pass Chain and IsSignaling all the way
down.
Summary:
Many directives are unavailable, and support for others may be limited.
This first draft has preliminary support for:
- conditional directives (including errors),
- data allocation (unsigned types up to 8 bytes, and ALIGN),
- equates/variables (numeric and text),
- and procedure directives (without parameters),
as well as COMMENT, ECHO, INCLUDE, INCLUDELIB, PUBLIC, and EXTERN. Text variables (aka text macros) are expanded in-place wherever the identifier occurs.
We deliberately ignore all ml.exe processor directives.
Prominent features not yet supported:
- structs
- macros (both procedures and functions)
- procedures (with specified parameters)
- substitution & expansion operators
Conditional directives are complicated by the fact that "ifdef rax" is a valid way to check if a file is being assembled for a 64-bit x86 processor; we add support for "ifdef <register>" in general, which requires adding a tryParseRegister method to all MCTargetAsmParsers. (Some targets require backtracking in the non-register case.)
Reviewers: rnk, thakis
Reviewed By: thakis
Subscribers: kerbowa, merge_guards_bot, wuzish, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, mgorny, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72680
The unseen logic diff occurs because MayFoldLoad() is defined like this:
static bool MayFoldLoad(SDValue Op) {
return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}
The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.
The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.
Differential Revision: https://reviews.llvm.org/D74088
This allows it to work properly with masked inc/dec for avx512. Those
would have a vselect as the root node so didn't get a chance to call
combineIncDecVector.
This also simplifies the logic because we don't have to manage
the topological ordering.
CallPreservedMask is used to describe the register liveness after a
function call. The function call in an interrupt handler should use the same
CallPreservedMask as normal functions. So that only callee save registers
can live through the function call.
The division expansions in AMDGPUCodeGenPrepare can't be relied on for
correctness, since they punt to later optimization and possibly
legalization in some cases. We still need a way to be able to write
tests for the legalizer versions of the expansion. This is mostly for
GlobalISel, since the expected optimzations is expecting aren't
implemented.
The interaction with the flag to expand 64-bit division in the IR is
pretty confusing, but these flags have different purposes.
I didn't realize we were already expanding 24/32-bit division here
already. Use the available IntegerDivision utilities. This uses loops,
so produces significantly smaller code than the inline DAG expansion.
This now requires width reductions of 64-bit divisions before
introducing the expanded loops.
This helps work around missing legalization in GlobalISel for
division, which are the only remaining core instructions that didn't
work at all.
I think this is plausibly a better implementation than exists in the
DAG, although turning it on by default misses out on the constant
value optimizations and also needs benchmarking.
This is more or less directly ported from the AMDGPU custom lowering
for FP_TO_FP16. I made a few minor fixups (using G_UNMERGE_VALUES
instead of creating shift/trunc to extract the two halves, and zexting
an inverted compare instead of select_cc).
This also does not include the fast math expansion the DAG which
converts to f32 and then to f16. I think that belongs in a
pre-legalize combine instead.
Like COPY instructions explained in D70616, we don't check the constraints
when combining G_UNMERGE_VALUES. Use the same logic used in D70616 to check
if registers can be replaced, or a COPY instruction needs to be built.
https://reviews.llvm.org/D70564
Assembler now permits pairs like 'v0:1', which are encoded
differently from the odd-first pairs like 'v1:0'.
The compiler will require more work to leverage these new register
pairs.
This patch added generation of SIMD bitwise insert BIT/BIF instructions.
In the absence of GCC-like functionality for optimal constraints satisfaction
during register allocation the bitwise insert and select patterns are matched
by pseudo bitwise select BSP instruction with not tied def.
It is expanded later after register allocation with def tied
to BSL/BIT/BIF depending on operands registers.
This allows to get rid of redundant moves.
Reviewers: t.p.northover, samparker, dmgreen
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D74147
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
Summary: Support for PIC with tests for global variables and function calls.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D74536
Summary:
Also make return calls terminator instructions so epilogues are
inserted before them rather than after them. Together, these changes
make WebAssembly's tail call optimization more stack-safe.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73943
If we widen the compare we might trigger a spurious exception from
the garbage data.
We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.
I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.
Differential Revision: https://reviews.llvm.org/D74522
Summary:
Fixes a crash in the backend where optimizations produce calls to the
cbrt runtime functions. Fixes PR 44227.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74259
This allow it to recognize more loads as being consecutive when the load's address are complex at the start.
Differential Revision: https://reviews.llvm.org/D74444
Also greatly improve i64 lowering. LegalizeIntegerTypes does the
correct narrowing if i64 isn't legal. Just workaround this for
SelectionDAG by making i64 legal and splitting in the patterns.
If the target has FP64 but not FP16 then we have custom lowering for FP_EXTEND
and STRICT_FP_EXTEND with type f64. However if the extend is from f32 to f64 the
current implementation will cause in infinite loop for STRICT_FP_EXTEND due to
emitting a merge_values of the original node which after replacement becomes a
merge_values of itself.
Fix this by not doing anything for f32 to f64 extend when we have FP64, though
for STRICT_FP_EXTEND we have to do the strict-to-nonstrict mutation as that
doesn't happen automatically for opcodes with custom lowering.
Differential Revision: https://reviews.llvm.org/D74559
Skip the loop over the CalleSavedInfos in 'restoreCalleeSavedRegisters' when
the register is a CR field and we are not targeting 32-bit ELF. This is safe
because:
1) The helper function 'restoreCRs' returns if the target is not 32-bit ELF,
making all the code in the loop related to CR fields dead for every other
subtarget. This code is only called on ELF right now, but the patch
to extend it for AIX also needs to skip 'restoreCRs'.
2) The loop will not otherwise modify the iterator, so the iterator
manipulations at the bottom of the loop end up setting 'I' to its
current value.
This simplifciation allows us to remove one argument from 'restoreCRs'.
Also add a helper function to determine if a register is one of the
callee saved condition register fields.
Exploit native VSX rounding instruction, x(v|s)r(d|s)pic, which does
rounding using current rounding mode.
According to C standard library, rint may raise INEXACT exception while
nearbyint won't.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D72685
In some cases BTI landing pad is inserted even compatible instruction
was there already. Meta instruction does not count in this case
therefore skip them in the check for first instructions in the function.
Differential revision: https://reviews.llvm.org/D74492
Simon pointed out that this function is doing a bitcast, which can be
incorrect for big endian. That makes the lowering of VMOVN in MVE
wrong, but the function is shared between Neon and MVE so both can
be incorrect.
This attempts to fix things by using the newly added VECTOR_REG_CAST
instead of the BITCAST. As it may now be used on Neon, I've added the
relevant patterns for it there too. I've also added a quick dag combine
for it to remove them where possible.
Differential Revision: https://reviews.llvm.org/D74485
Currently, BPF does not support dynamic static allocation.
For a program like below:
extern void bar(int *);
void foo(int n) {
int a[n];
bar(a);
}
The current error message looks like:
unimplemented operand
UNREACHABLE executed at /.../llvm/lib/Target/BPF/BPFISelLowering.cpp:199!
Let us make error message explicit so it will be clear to the user
what is the problem. With this patch, the error message looks like:
fatal error: error in backend: Unsupported dynamic stack allocation
...
Differential Revision: https://reviews.llvm.org/D74521
When SI_IF is inserted, it constrains the source register with a
register class, which was quite likely a G_ICMP. This was incorrectly
treating it as a scalar, and then applyMappingImpl would end up
producing invalid MIR since this was unexpected.
Also fix not using all VGPR sources for vcc outputs.
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.
I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.
NFC.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74482
When we have to widen to a 64-bit register, we have to emit a SUBREG_TO_REG.
Add a general-purpose widening helpe which emits the correct SUBREG_TO_REG
instruction based off of a desired size and add a testcase.
Also remove some asserts which are technically incorrect in `emitTestBit`.
- p0 doesn't count as a scalar type, so we need to check `!Ty.isVector()`
instead
- Whenever we have a s1, the Size/Bit checks are too conservative, so just
remove them
Replace these asserts with less conservative ones where applicable.
Differential Revision: https://reviews.llvm.org/D74427
This has a really interesting side effect in that it improves some UMAX/UMIN reduction code which had redundant XOR(SHUFFLE(XOR(X,SIGNMASK)),SIGNMASK) patterns - the getNegatibleCost recognises it as FNEG(SHUFFLE(FNEG(X))).... We have a lot of FNEG patterns bitcasted to the integer domain for XOR signbit twiddling which is similar to what we do to allow UMAX/UMIN to be lowered using SMAX/SMIN.
Differential Revision: https://reviews.llvm.org/D74231
An option is added for PowerPC to disable use of non-volatile CR
register fields and avoid CR spilling in the prologue.
Differential Revision: https://reviews.llvm.org/D69835
Added support for the intrinsic llvm.ppc.dcbfl and llvm.ppc.dcbflp.
These will be used for emitting cache control instructions dcbfl and dcbflp
which are actually mnemonics for using dcbf instruction with different
immediate arguments.
dcbfl ra, rb -> dcbf ra, rb, 1
dcbflp, ra, rb -> dcbf ra, rb, 3
Differential Revision: https://reviews.llvm.org/D68411
Load extra bits if suitably aligned. This allows using widened
3-vector loads on SI, and fixes legalization for <9 x s32> (which LSV
apparently forms frequently on lowered kernel argument lists).
Fix incorrectly treating these as legal on SI. This should emit a
64-bit store and a 32-bit store.
I think all of the load and store rules are just about complete, but
due for a rewrite.
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.
This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.
It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.
Differential Revision: https://reviews.llvm.org/D74221
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
Summary:
Consider:
%r = call i32 @llvm.amdgcn.writelane(i32 0, i32 1, i32 2)
This produces a value that is 0 on lane 1, and 2 everywhere else; i.e.,
it is divergent.
Reported-by: Marek Olsak <Marek.Olsak@amd.com>
Reviewers: arsenm, foad, mareko
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74400
These should require AVX512VL not AVX512F. The legacy VEX patterns
will match first unless AVX512VL is enabled so this doesn't cause
a functional issue.
This adds a strict version of FP16_TO_FP and FP_TO_FP16 and uses
them to implement soft promotion for the half type. This is
enough to provide basic support for __fp16 with strictfp.
Add the necessary X86 support to use VCVTPS2PH/VCVTPH2PS when F16C
is enabled.
This was creating a select on true/false values, and then comparing
that later. This produced more work for later combines, which can be
avoided by just using the boolean values. This was copied from the
original DAG expansion, which also has the same problem. This doesn't
have a observable change using SelectionDAG, but since GlobalISel is
missing these optimizations, the final code was noticeably longer.
These have nicer expansions implemented in the DAG. Ideally we would
either directly implement all of these special expansions, or stop
expanding division in the IR.
This is apparently worse than 1-byte alignment. This does not attempt
to decompose 2-byte aligned wide stores, but will stop trying to
produce them.
Also fix bug in LoadStoreVectorizer which was decreasing the alignment
and vectorizing stack accesses. It was assuming a stack object was an
alloca that could have its base alignment changed, which is not true
if the pointer is derived from a function argument.
Since natural fdiv lowering is now more conservative even with
denormals disabled, we get a slower expansion from just a plain
1.0/fdiv. Directly emit the rcp intrinsic when using it to implement
integer division to avoid a pointlessly complex sequence.
We aren't doing a good job of optimizing AVX512 outside of this code. So remove the bail out for AVX512 and replace with a FIXME. This at least gets us the AVX2 codegen.
Differential Revision: https://reviews.llvm.org/D74431
This patch adds the support required for using the __riscv_save and
__riscv_restore libcalls to implement a size-optimization for prologue
and epilogue code, whereby the spill and restore code of callee-saved
registers is implemented by common functions to reduce code duplication.
Logic is also included to ensure that if both this optimization and
shrink wrapping are enabled then the prologue and epilogue code can be
safely inserted into the basic blocks chosen by shrink wrapping.
Differential Revision: https://reviews.llvm.org/D62686
Summary:
SIInstrInfo::expandPostRAPseudo converts ENTER_WWM in-place into an
S_OR_SAVEEXEC instruction that needs certain implicit operands. Without
this patch I get errors like this that make it harder to use -stop-after
to bisect the pass pipeline:
$ llc -march=amdgcn test/CodeGen/AMDGPU/wqm.ll -stop-after=postrapseudos -o - | sed -E 's/ (from|into) custom "TargetCustom[0-9]+"//' | llc -march=amdgcn -x=mir
error: <stdin>:1295:70: missing implicit register operand 'implicit-def $scc'
renamable $sgpr2_sgpr3 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec
^
Note that this error is currently only generated by MIParser but it
comes with a FIXME comment:
// FIXME: Move the implicit operand verification to the machine verifier.
Reviewers: critson, arsenm, rampitec, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74428
Based on uops.info these should have 5 cycle latency as they did on Haswell/Broadwell. I have no additional internal information from Intel.
This was also shown as a discrepancy in the spreadsheet that was sent with an early llvm-dev post about llvm-exegesis.
It also matches Agner Fog.
Differential Revision: https://reviews.llvm.org/D74357
Currently, isTruncateFree() and isZExtFree() callbacks return false
as they are not implemented in BPF backend. This may cause suboptimal
code generation. For example, if the load in the context of zero extension
has more than one use, the pattern zextload{i8,i16,i32} will
not be generated. Rather, the load will be matched first and
then the result is zero extended.
For example, in the test together with this commit, we have
I1: %0 = load i32, i32* %data_end1, align 4, !tbaa !2
I2: %conv = zext i32 %0 to i64
...
I3: %2 = load i32, i32* %data, align 4, !tbaa !7
I4: %conv2 = zext i32 %2 to i64
...
I5: %4 = trunc i64 %sub.ptr.lhs.cast to i32
I6: %conv13 = sub i32 %4, %2
...
The I1 and I2 will match to one zextloadi32 DAG node, where SUBREG_TO_REG is
used to convert a 32bit register to 64bit one. During code generation,
SUBREG_TO_REG is a noop.
The %2 in I3 is used in both I4 and I6. If isTruncateFree() is false,
the current implementation will generate a SLL_ri and SRL_ri
for the zext part during lowering.
This patch implement isTruncateFree() in the BPF backend, so for the
above example, I3 and I4 will generate a zextloadi32 DAG node with
SUBREG_TO_REG is generated during lowering to Machine IR.
isZExtFree() is also implemented as it should help code gen as well.
This patch also enables the change in https://reviews.llvm.org/D73985
since it won't kick in generates MOV_32_64 machine instruction.
Differential Revision: https://reviews.llvm.org/D74101
Fix/workaround for https://bugs.llvm.org/show_bug.cgi?id=44539.
As discussed there, this pass makes some overly optimistic
assumptions, as it does not have access to actual branch weights.
This patch makes the computation of the depth of the optimized cmov
more conservative, by assuming a distribution of 75/25 rather than
50/50 and placing the weights to get the more conservative result
(larger depth). The fully conservative choice would be
std::max(TrueOpDepth, FalseOpDepth), but that would break at least
one existing test (which may or may not be an issue in practice).
Differential Revision: https://reviews.llvm.org/D74155
Summary:
Add a new method (tryParseRegister) that attempts to parse a register specification.
MASM allows the use of IFDEF <register>, as well as IFDEF <symbol>. To accommodate this, we make it possible to check whether a register specification can be parsed at the current location, without failing the entire parse if it can't.
Reviewers: thakis
Reviewed By: thakis
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73486
When more than one SelectPseudo instruction is handled a new MBB is
returned. This must not be done if that would result in leaving an undhandled
isel pseudo behind in the original MBB.
Fixes https://bugs.llvm.org/show_bug.cgi?id=44849.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D74352
A small IR change in calculating the active lanes resulted in no longer
recognising tail-predication. Now recognise both an 'add' and 'or' in
the expression that calculates the active lanes.
Differential Revision: https://reviews.llvm.org/D74394
ADDI(C.ADDI) may achieve better code size than XORI, since XORI has no C extension.
This patch transforms two patterns and gets almost equivalent results.
Differential Revision: https://reviews.llvm.org/D71774
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.
New intrinisics are implemented for when we need to port SIMD code from other
arhitectures and only load or store portions of MSA registers.
Following intriniscs are added which only load/store element 0 of a vector:
v4i32 __builtin_msa_ldrq_w (const void *, imm_n2048_2044);
v2i64 __builtin_msa_ldr_d (const void *, imm_n4096_4088);
void __builtin_msa_strq_w (v4i32, void *, imm_n2048_2044);
void __builtin_msa_str_d (v2i64, void *, imm_n4096_4088);
Differential Revision: https://reviews.llvm.org/D73644
Summary:
As far as I know this did not affect code generation, but it did affect
the order of -debug-only=si-wqm output and the naming of autonamed
values in -print-after=si-wqm output.
Reviewers: arsenm, rampitec, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, mgrang, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74317
We need to use vector instructions for these operations. Previously
we handled this with isel patterns that used extra instructions
and copies to handle the the conversions.
Now we use custom lowering to emit the conversions. This allows
them to be pattern matched and optimized on their own. For
example we can now emit vpextrw to store the result if its going
directly to memory.
I've forced the upper elements to VCVTPHS2PS to zero to keep some
code similar. Zeroes will be needed for strictfp. I've added a
DAG combine for (fp16_to_fp (fp_to_fp16 X)) to avoid extra
instructions in between to be closer to the previous codegen.
This is a step towards strictfp support for f16 conversions.
This patch:
- enable frame pointer for AIX;
- update some of red zone comments;
- add/update testcases;
Differential Revision: https://reviews.llvm.org/D72454
SUMMARY:
The patch is enable to support Mergeable2ByteCString and Mergeable4ByteCString
Reviewers: daltenty
Subscribers: wuzish, nemanjai, hiraditya
Differential Revision: https://reviews.llvm.org/D74164
Each function is with this compiled with the SystemZSubtarget initialized
from the functions attributes.
Review: Ulrich Weigand.
Differential Revision: https://reviews.llvm.org/D74086
Non-AVX512BW targets failed to concatenate 256-bit shifts back to 512-bits (split during 512-bit shuffle lowering as they don't have v32i16/v64i8 types).
These are generated and do not need to have the same values.
We are defining separate subregs for R600 and GCN but then
using AMDGPU subregs on R600.
Differential Revision: https://reviews.llvm.org/D74248
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
REAPPLIED rGe82e17d4d4ca after reversion at rG39eade73a567 - fixed offset matching in matchShuffleAsBitRotate.
This patch makes the following System Registers Read Only:
- CurrentEL
- ICH_MISR_EL2
- PMBIDR_EL1
- PMSIDR_EL1
as found in:
https://developer.arm.com/docs/ddi0595/e/aarch64-system-registers
Relative line numbers were also added to the tests so we get more
informative error messages on failure.
Change-Id: I963b4f01ca5737b58f9e8e7abe9ca1d99e328758
This change implements the llvm intrinsic llvm.read_register for
the SystemZ platform which returns the value of the specified
register
(http://llvm.org/docs/LangRef.html#llvm-read-register-and-llvm-write-register-intrinsics).
This implementation returns the value of the stack register, and
can be extended to return the value of other registers. The
implementation for this intrinsic exists on various other platforms
including Power, x86, ARM, etc. but missing on SystemZ.
Reviewers: uweigand
Differential Revision: https://reviews.llvm.org/D73378
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
---
Internal shuffle tests indicate theres a bug somewhere that I haven't been able to track down yet.
When specifying -march=arch[8|9|10], those CPU types do NOT support
the vector extension. In this case the vector ABI must be disabled.
The generated data layout should NOT contain 64-v128.
Reviewers: uweigand
Differential Revision: https://reviews.llvm.org/D74146
Use the isCandidateForCallSiteEntry().
This should mostly be an NFC, but there are some parts ensuring
the moveCallSiteInfo() and copyCallSiteInfo() operate with call site
entry candidates (both Src and Dest should be the call site entry
candidates).
Differential Revision: https://reviews.llvm.org/D74122
Based on D72931
This adds a new feature called A16 which is enabled for gfx10.
gfx9 keeps the R128A16 feature so it can share all the instruction encodings
with gfx7/8.
Differential Revision: https://reviews.llvm.org/D73956
Using sign extend forces the adjacent element to either all zeros
or all ones. But all ones is a NAN. So that doesn't seem like a
great idea.
Trying to work on supporting this with strict FP where NAN would
definitely be bad.
When the FP exists, the FP base CFI directive offset should take the size of variable arguments into account.
Differential Revision: https://reviews.llvm.org/D73862
Vector indexing with a constant index should be folded out in the
legalizer, but this was accidentally falling through. This would
produce the indexing operation with $noreg. Handle this case as a
dynamic index just in case a bug like this happens again in the
future.
We were failing to find constants that were casted. I feel like the
artifact combiner should have folded the constant in the trunc before
the custom lowering, but that doesn't happen.
Reverts part of 6524a7a2b9. Since that
commit, the expansion was ignoring the actual save exec register
produced by the instruction, and looking at other instructions. I do
not understand why it was looking at other instructions, but relying
on this scan was wrong.
Fixes verifier errors after SI_IF is tail duplicated, which should be
correct to do. The results were fed into a phi, which was lowered to
the S_MOV_B64_term instructions.
Fix issue mentioned on rGe82e17d4d4ca - non-AVX512BW targets failed to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
The flag isn't used, but I believe this matches the MOV32r0 that
would be created by the table emitter. This should allow this node
to be CSEed with any others created by the table.
As noted on PR44379, we didn't attempt to lower vector shuffles using bit rotations on XOP/AVX512F targets.
This patch lowers to uniform ISD:ROTL nodes - ROTR isn't supported by XOP and they are interchangeable for constant values anyway.
There might be cases where targets without ISD:ROTL support would benefit from this (expanding to SRL+SHL+OR), which I'll investigate in a future patch.
Also, non-AVX512BW targets fail to concatenate 256-bit rotations back to 512-bits (split during shuffle lowering as they don't have v32i16/v64i8 types).
A vselect+strictfp node is not equivalent to a masked operation.
The exceptions of the strictfp node are not masked by a vselect
after it so we can't match it to a masked operation.
We already had a hack in IsLegalToFold to prevent these patterns from
matching. This patch removes that hack and removes the patterns.
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.
Differential Revision: https://reviews.llvm.org/D68720
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with proper LiveIn
declaration, better option handling and more portable testing.
Differential Revision: https://reviews.llvm.org/D68720
Making sure not to use them with patterns for masked instructions.
Also fix FMA patterns that were matching strict_fma+x86selects to
masked instructions.
Remove code from LegalizeTypes that allowed this to work.
We were already using BUILD_PAIR for this in some places so this
standardizes on a single way to do this.
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with better option
handling and more portable testing
Differential Revision: https://reviews.llvm.org/D68720
This hasn't been used for years, its original implementation, D35700, had bugs that caused the reversion of most of the code, and since then x86 shuffle lowering/combining has handled most cases and can deal with the rest as well.
Update lambda function
static auto InitializeRegisterBankOnce = [this](const auto &TRI) {
with
static auto InitializeRegisterBankOnce = [&]() {
Capture reference instead of passing argument, as there are buildbot
compiling errors related when passing argument.
Update lambda function
static auto InitializeRegisterBankOnce = [this](const auto &TRI) {
with
static auto InitializeRegisterBankOnce = [&]() {
Capture reference instead of passing argument, as there are buildbot
compiling errors related when passing argument.
On little endian targets prior to Power9, we spill vector registers using a
swapping store (i.e. stdxvd2x saves the vector with the two doublewords in
big endian order regardless of endianness). This is generally not a problem
since we restore them using the corresponding swapping load (lxvd2x). However
if the restore is done by the unwinder, the vector register contains data in
the incorrect order.
This patch fixes that by using Altivec loads/stores for vector saves and
restores in PEI (which keep the order correct) under those specific conditions:
- EH aware function
- Subtarget requires swaps for VSX memops (Little Endian prior to Power9)
Differential revision: https://reviews.llvm.org/D73692
Summary:
The accuracy limit to use rcp is adjusted to 1.0 ulp from 2.5 ulp.
Also, afn instead of arcp is used to allow inaccurate rcp to be used.
Reviewers:
arsenm
Differential Revision: https://reviews.llvm.org/D73588
The issue in the previous commits was that we swap the LHS and RHS while
looking for the constant. In SLT/SGT, the constant must be on the RHS, or the
optimization is invalid.
Move the swapping logic after the check for the SLT/SGT case and update tests.
Original commits:
d78cefb160a373841407
Summary:
Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.
But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.
Reviewers:
rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D74180
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
This a recommit of 39f50da2a3 with correct option
flags set.
Differential Revision: https://reviews.llvm.org/D68720
hasReservedSpillSlot returns a dummy frame index of '0' on PPC64 for the
non-volatile condition registers, which leads to the CalleSavedInfo
either referencing an unrelated stack object, or an invalid object if
there are no stack objects. The latter case causes the mir-printer to
crash due to assertions that checks if the frame index referenced by a
CalleeSavedInfo is valid.
To fix the problem create an immutable FixedStack object at the correct offset
in the linkage area of the previous stack frame (ie SP + positive offset).
Differential Revision: https://reviews.llvm.org/D73709
Previously we took the restored flag in a GPR, extended it 32 or 64 bits. Then used as an input to a sub from 0. This requires creating a zero extend and creating a 0.
This patch changes this to just use an ADD with 255 to restore the carry flag and keep the SETB_C32r/SETB_C64r. Exactly like we handle SBB which is what SETB becomes.
Differential Revision: https://reviews.llvm.org/D74152
Allows more flexible use of buildMerge in places where
use operands are available as SrcOp since it does not
require explicit conversion to Register.
Simplify code with new buildMerge.
Differential Revision: https://reviews.llvm.org/D74223
We were executing this in a waterfall loop as a placeholder, but this
should really be converted to a MUBUF load. Also execute in a
waterfall loop if the resource isn't an SGPR. This is a case where the
DAG handling was wrong because doing the right thing was too hard.
Currently, this will mishandle 96-bit loads. There's currently no way
to track the original memory size with an MMO, so these loads will be
widened andd the resulting memory size will be 128-bits.
The type passed to lower was invalid, so I'm not sure how this was
even working before. The source and destination type also do not have
to match, so make sure to use the right ones.
The registers TRCEXTINSELR and TRCEXTINSELR0 are distinct registers,
defined by separate extension specifications (ETM and ETE,
respectively), yet they use the same encoding in MSR/MRS.
When performing a system register lookup by encoding, we would
essentially return a random one, depending on the number, relative
position in the TableGen file, whether the TableGen records for system
registers are named or not, and, if they are named, depending on
record (not register!) name as well.
This patch works around the issue by explictly checking for the
TRCEXTINSELR/TRCEXTINSELR0 encoding and always returning TRCEXTINSELR.
Differential Revision: https://reviews.llvm.org/D74074
This reverts commit 39f50da2a3.
The -fstack-clash-protection is being passed to the linker too, which
is not intended.
Reverting and fixing that in a later commit.
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.
Reviewers: courbet
Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73964
Implement protection against the stack clash attack [0] through inline stack
probing.
Probe stack allocation every PAGE_SIZE during frame lowering or dynamic
allocation to make sure the page guard, if any, is touched when touching the
stack, in a similar manner to GCC[1].
This extends the existing `probe-stack' mechanism with a special value `inline-asm'.
Technically the former uses function call before stack allocation while this
patch provides inlined stack probes and chunk allocation.
Only implemented for x86.
[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html
Differential Revision: https://reviews.llvm.org/D68720
We are using countPopulation on a LaneBitmask to determine
a number of registers it covers. This is the assumption which
does not necessarily need to be true. It is not changed but
factored into a single call SIRegisterInfo::getNumCoveredRegs().
Some other places are cleaned up with respect to assumptions
about subreg indexes values and tablegen behavior.
Differential Revision: https://reviews.llvm.org/D74177
Summary:
Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.
But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.
Reviewers:
rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D74180
This reverts commit a373841407.
It looks like this broke set_shadow_test.c, so I'm reverting until I can fix it.
I also reverted the SGT change because it's probably also broken.
Update lambda function argument "[this](const auto &TRI)" with
[this](const TargetRegisterInfo &TRI).
Looks like a bug in g++-6, there is no issue compiling using g++-9.
X86 uses i8 for shift amounts. This code can fail on a 32-bit target
if it runs after type legalization.
This code was copied from AArch64 and modified for X86, but the
shift amount wasn't changed to the correct type for X86.
Fixes PR44812
When we have a G_BRCOND fed by a sgt compare against -1, we can just emit a TBZ.
This is similar to the code in `AArch64TargetLowering::LowerBR_CC`.
Also while we're here, properly scope the commutative constant check in
`selectCompareBranch`, since it sometimes would call
`getConstantVRegValWithLookThrough` twice.
Differential Revision: https://reviews.llvm.org/D74149
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.
Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.
This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.
Differential Revision: https://reviews.llvm.org/D73863
Only 32 and 64 bit SBB are dependency breaking instructons on some
CPUs. The 8 and 16 bit forms have to preserve upper bits of the GPR.
This patch removes the smaller forms and selects the wider form
instead. I had to do this with custom code as the tblgen generated
code glued the eflags copytoreg to the extract_subreg instead of
to the SETB pseudo.
Longer term I think we can remove X86ISD::SETCC_CARRY and use
(X86ISD::SBB zero, zero). We'll want to keep the pseudo and select
(X86ISD::SBB zero, zero) to either a MOV32r0+SBB for targets where
there is no dependency break and SETB_C32/SETB_C64 for targets
that have a dependency break. May want some way to avoid the MOV32r0
if the instruction that produced the carry flag happened to def a
register that we can use for the dependency.
I think the flag copy lowering should be using NEG instead of SUB to
handle SETB. That would avoid the MOV32r0 there. Or maybe it should
use a ADC with -1 to recreate the carry flag and keep the SETB?
That would avoid a MOVZX on the input of the SUB.
Differential Revision: https://reviews.llvm.org/D74024
When multiple instructions are moved into a waterfall loop, it's
possible some of them re-use the same operands. Avoid creating
multiple sequences of readfirstlanes for them. None of the current
uses will hit this, but will be used in a future patch.
This patch implements the caller side of placing function call arguments
in stack memory. This removes the current limitation where LLVM on AIX
will report fatal error when arguments can't be contained in registers.
There is a particular oddity that a float argument that passes in a
register and also in stack memory requires that the caller initialize
both. From what AIX "ABI" documentation I have it's not clear that this
needs to be done, however, it is necessary for compatibility with the
AIX XL compiler so I think it's best to implement it the same way.
Note a later patch will follow to address the callee side.
Differential Revision: https://reviews.llvm.org/D73209
When we have a G_ICMP which checks SLT, and the comparison is against 0, we
can emit a TBNZ instead of a CBZ.
This lets us fold in things into the branch, which can provide some code size
savings.
This is similar to the case in `AArch64TargetLowering::LowerBR_CC`.
https://reviews.llvm.org/D74090
Factor it out into `emitTestBit` and add some asserts to the new function.
This will be useful for implementing TB(N)Z emission for SLT/SGT compares.
Differential Revision: https://reviews.llvm.org/D74080
Add support for walking through G_LSHR in `getTestBitReg`. Equivalent to the
code in `getTestBitOperand` in AArch64ISelLowering.
```
(tbz (lshr x, c), b) -> (tbz x, b+c) when b + c is < # bits in x
```
Differential Revision: https://reviews.llvm.org/D74077
The "{=v0}" constraint did not result in the expected error message in the
abscence of the vector facility, because 'v0' matches as a string into the
AnyRegBitRegClass in common code.
This patch adds checks for vector support in case of "{v" and soft-float in
case of "{f" to remedy this.
Review: Ulrich Weigand.
The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles.
I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have.
Differential Revision: https://reviews.llvm.org/D74000
Really the intrinsic definition is wrong, but work around this
here. The DAG lowering introduces an MMO. We have to introduce a new
operation to avoid the verifier complaining about the missing mayLoad.
Use cmp ord instead of cmp_class compared to the DAG version for the
nan check, but mostly try to match the existsing pattern.
I think the sign doesn't matter for fract, so we could do a little
better with the source modifier matching.
I think this is also still broken as in D22898, but I'm leaving it
as-is for now while I don't have an SI system to test on.
contractCrossBankCopyIntoStore() finds the instruction defines the
source register and uses its output to replace the register. There are,
however, instructions that have multiple outputs, e.g. G_UNMERGE_VALUES.
Current implementation hardcodes to operand 0 and has no way of knowing
which output should be used.
This change adds another function to directly return the register that
is the source of the register and use that for folding.
This fixes https://bugs.llvm.org/show_bug.cgi?id=44783
Differential Revision: https://reviews.llvm.org/D74005
This implements walking over G_ASHR in the same way as `getTestBitOperand` in
AArch64ISelLowering.
```
(tbz (ashr x, c), b) -> (tbz x, b+c) or (tbz x, msb) if b+c is > # bits in x
```
Differential Revision: https://reviews.llvm.org/D73933
(1) The check needs to be on the 0th operand of whatever we're folding
(2) Checks for validity should happen before we change the bit
Fixes a bug which caused MultiSource/Applications/JM/lencod to fail at -O3.
Differential Revision: https://reviews.llvm.org/D74002
Rewrite the result register pair into the expected sinigle register
format in the legalizer.
I'm also operating under the assumption that TFE doesn't apply to
stores or atomics, but don't know if this is true or not.
The mask results of these should be uniform. The trickier part is the
dummy booleans used as IR glue need to be treated as divergent. This
should make the divergence analysis results correct for the IR the DAG
is constructed from.
This should allow us to eliminate requiresUniformRegister, which has
an expensive, recursive scan over all users looking for control flow
intrinsics. This should avoid recent compile time regressions.
The 96-bit results need to be widened.
I find the interaction between LegalizerHelper and MIRBuilder somewhat
awkward. The custom legalization is called by the LegalizerHelper, but
then does not have access to the helper. You have to construct a new
helper, which then does not own the MachineIRBuilder, but does modify
it. Maybe custom legalization should be passed the helper?
The adjusted iterator range included the last we just inserted, and
don't want to process. Figure out the new iterator range before
inserting phis. This was a harmless problem, but added an unnecessary
complication for a future patch.
If we have s_pack_* instructions, legalize this to
G_BUILD_VECTOR_TRUNC from s32 elements. This is closer to how how the
s_pack_* instructions really behave.
If we don't have s_pack_ instructions, expand this by creating a merge
to s32 and bitcasting. This expands to the expected bit operations. I
think this eventually should go in a new bitcast legalize action type
in LegalizerHelper.
We already directly emit the shift operations in RegBankSelect for the
vector case. This could possibly be cleaned up, but I also may want to
defer doing this expansion to selection anyway. I'll see about that
when I try to actually match VOP3P instructions.
This breaks the selection of the build_vector since tablegen doesn't
know how to match G_BUILD_VECTOR_TRUNC yet, so just xfail it for now.
Once we have created a tail-predicated hardware-loop, and thus know the number
of elements that are processed, we want to clean-up the iteration count
expression of that loop. In D73682, we bailed the analysis on conditionally
executed instructions. This adds support for IT-blocks, so that we can handle
these cases again. The restriction is that we only support IT blocks containing
1 statement, but that seems to cover most cases and forms of the iteration
count expression.
Differential Revision: https://reviews.llvm.org/D73947
Checking that the use-def chain that performs the loop count
isSafeToRemove is not sufficient because it means that we can
remove register copies that we need to restore lr to its correct
value. This change now prevents the transform from kicking in for the
'remove-elem-moves' test which needs to addressed later on.
Differential Revision: https://reviews.llvm.org/D74037
While validating each MVE instruction, check that all instructions
that touch memory are somehow predicated upon the VCTP.
Differential Revision: https://reviews.llvm.org/D73616
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers
Differential Revision: https://reviews.llvm.org/D73482
This should lower the amount of used registers for gfx9.
I updated some of the changed tests with the update script because
changing them by hand is tedious.
Differential Revision: https://reviews.llvm.org/D73884
Same for any_extend though we don't have coverage for that.
The test changes are because isel didn't check one use of the
setcc_carry. So in isel we would end up with two different
sized setcc_carry instructions. And since it clobbers
the flags we would need to recreate the flags for the second
instruction.
This code handles additional uses by truncating the new wide
setcc_carry back to the original size for those uses.
The old version might be faster on EG (RECIP_IEEE is Trans only),
but it'd need extra corner case checks.
This gives correct corner case behaviour and saves a register.
Fixes OCL CTS sqrt test (1-thread, scalar) on Turks.
Reviewer: arsenm
Differential Revision: https://reviews.llvm.org/D74017
Summary:
This reverts commit 3ef169e586. The
purpose of this commit was to allow stack machines to perform
instruction selection for instructions with variadic defs. However,
MachineInstrs fundamentally cannot support variadic defs right now, so
this change does not turn out to be useful.
Depends on D73927.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73928
This was incorrectly rounding up to the next power of 2. v4f32 was
rounding up to v8f32, which was just wrong. There are also v3i16/v3f16
available in MVT, so we don't even need to round the f16 cases
anymore. Additionally, this field is really an EVT so we don't even
need to consider this.
Also switch some asserts to return invalid. We should have an IR
verifier for these intrinsic return types, but for now it's better to
not assert on IR that passes the verifier.
This should also probably be fixed to consider that dmask is really
eliminating some of the loaded components.
Summary:
This reverts commit 28857d14a8. This
commit worked toward a solution that did not turn out to be feasible
because MachineInstrs cannot contain an arbitrary number of defs.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73927
The compiler may transform the following code
ctx = ctx + reloc_offset
... (*(u32 *)ctx) & 0x8000 ...
to
ctx = ctx + reloc_offset
... (*(u8 *)(ctx + 1)) & 0x80 ...
where reloc_offset will be replaced with a constant during
AsmPrinter phase.
The above transformed code will be rejected the kernel verifier
as it does not allow
*(type *)((ctx + non_zero_offset1) + non_zero_offset2)
style access pattern.
It is hard at SelectionDag phase to identify whether a load
is related to context or not. Sometime, interprocedure analysis
may be needed. So let us simply prevent such optimization
from happening.
Differential Revision: https://reviews.llvm.org/D73997
Summary:
Moves a batch of instructions from unimplemented-simd128 to simd128
because they have recently become available in V8.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73926
lrint/llrint are defined as rounding using the current rounding
mode. Numbers that can't be converted raise FE_INVALID and an
implementation defined value is returned. They may also write to
errno.
I believe this means we can use cvtss2si/cvtsd2si or fist to
convert as long as -fno-math-errno is passed on the command line.
Clang will leave them as libcalls if errno is enabled so they
won't become ISD::LRINT/LLRINT in SelectionDAG.
For 64-bit results on a 32-bit target we can't use cvtss2si/cvtsd2si
but we can use fist since it can write to a 64-bit memory location.
Though maybe we could consider using vcvtps2qq/vcvtpd2qq on avx512dq
targets?
gcc also does this optimization.
I think we might be able to do this with STRICT_LRINT/LLRINT as
well, but I've left that for future work.
Differential Revision: https://reviews.llvm.org/D73859
The CATCHPAD node mostly existed to be selected into the EH_RESTORE
instruction, which sets the frame back up when 32-bit Windows exceptions
return to the parent function. However, creating this MachineInstr early
increases the risk that other passes will come along and insert
instructions that use the stack before ESP and EBP are restored. That
happened in PR44697.
Instead of representing these in the instruction stream early, delay it
until PEI. Mark the blocks where this needs to happen as EHPads, but not
funclet entry blocks. Passes after PEI have to be careful not to hoist
instructions that can use stack across frame setup instructions, so this
should be relatively reliable.
Fixes PR44697
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D73752
https://reviews.llvm.org/D72312 introduced an infinite loop which involves
DAGCombiner::visitFMA and AMDGPUTargetLowering::performFNegCombine.
fma( a, fneg(b), fneg(c) ) => fneg( fma (a, b, c) ) => fma( a, fneg(b), fneg(c) ) ...
This only breaks with types where 'isFNegFree' returns flase, e.g. v4f32.
Reproducing the issue also needs the attribute 'no-signed-zeros-fp-math',
and no source mods allowed on one of the users of the Op.
This fix makes changes to indicate that it is not free to negate a fma if it
has users with source mods.
Differential Revision: https://reviews.llvm.org/D73939
This time with correct types for the data result from the SUB.
Original commit message:
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.
This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
The usage of the Imm out argument from SelectSMRDOffset is pretty
confusing. Stop trying to reject CI immediates in the case where the
offset field can be used. It's not an illegal way to encode the
immediate, so just prefer the better encoding pattern with
AddedComplexity.
We probably don't even really need the different opcodes for the
different offset types anymore, but that will be more work to cleanup.
The SMRD non-buffer load patterns could also use a cleanup to be done
separately.
AMDGPU and x86 at least both have separate controls for whether
denormal results are flushed on output, and for whether denormals are
implicitly treated as 0 as an input. The current DAGCombiner use only
really cares about the input treatment of denormals.
Similar to D73680 (AArch64 BTI).
A local linkage function whose address is not taken does not need ENDBR32/ENDBR64. Placing the patch label after ENDBR32/ENDBR64 has the advantage that code does not need to differentiate whether the function has an initial ENDBR.
Also, add 32-bit tests and test that .cfi_startproc is at the function
entry. The line information has a general implementation and is tested
by AArch64/patchable-function-entry-empty.mir
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D73760
Linux commit
1cf5b23988 (diff-289313b9fec99c6f0acfea19d9cfd949)
uses "#pragma clang attribute push (__attribute__((preserve_access_index)),
apply_to = record)"
to apply CO-RE relocations to all records including the following pattern:
#pragma clang attribute push (__attribute__((preserve_access_index)), apply_to = record)
typedef struct {
int a;
} __t;
#pragma clang attribute pop
int test(__t *arg) { return arg->a; }
The current approach to use struct/union type in the relocation record will
result in an anonymous struct, which make later type matching difficult
in bpf loader. In fact, current BPF backend will fail the above program
with assertion:
clang: ../lib/Target/BPF/BPFAbstractMemberAccess.cpp:796: ...
Assertion `TypeName.size()' failed.
clang will change to use the type of the base of the member access
which will preserve the typedef modifier for the
preserve_{struct,union}_access_index intrinsics in the above example.
Here we adjust BPF backend to accept that the debuginfo
type metadata may be 'typedef' and handle them properly.
Differential Revision: https://reviews.llvm.org/D73902
Summary:
After following Simon's suggestion about additional testing posted at
https://reviews.llvm.org/D73906, I found several more places that
need to be updated.
Reviewers: simon_tatham, dmgreen, ostannard, eli.friedman
Reviewed By: simon_tatham
Subscribers: merge_guards_bot, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73963
Summary:
This patch changes the underlying type of the ARM::ArchExtKind
enumeration to uint64_t and adjusts the related code.
The goal of the patch is to prepare the code base for a new
architecture extension.
Reviewers: simon_tatham, eli.friedman, ostannard, dmgreen
Reviewed By: dmgreen
Subscribers: merge_guards_bot, kristof.beyls, hiraditya, cfe-commits, llvm-commits, pbarrio
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73906
Under MVE, we do not have any lowering for fminimum, which a
vector_reduce_fmin without NoNan will be expanded into. As with the
other recent patches, force this to expand in the pre-isel pass. Note
that Neon lowering would be OK because the scalar fminimum uses the
vector VMIN instruction, but is probably better to just rely on the
scalar operations, which is what is done here.
Also fixes what appears to be the reversal of INF vs -INF in the
vector_reduce_fmin widening code.
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.
I believe this code is a leftover from when i1 was a legal type.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
We were creating two with different operand orders, and then only
using one of them.
Instead just swap the operands when needed and create a single node.
Broadwell was missing half the gather instructions. Both models
had some mixups in the resource costs and number of uops.
I've updated here based on what I think the original IACA source
says with some cross checking against the microcode.
I'm not sure about latency as the IACA source I have doesn't have
that information. So I'm using the latency from uops.info.
I plan to update Skylake models as well, but I'll do that in a
separate patch.
Differential Revision: https://reviews.llvm.org/D73844
This ports the existing case for G_XOR from `getTestBitOperand` in
AArch64ISelLowering into GlobalISel.
The idea is to flip between TBZ and TBNZ while walking through G_XORs.
Let's say we have
```
tbz (xor x, c), b
```
Let's say the `b`-th bit in `c` is 1. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 0.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 1.
So, then
```
tbz (xor x, c), b == tbnz x, b
```
Let's say the `b`-th bit in `c` is 0. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 1.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 0.
So, then
```
tbz (xor x, c), b == tbz x, b
```
Differential Revision: https://reviews.llvm.org/D73929
This implements the following optimization:
```
(tbz (shl x, c), b) -> (tbz x, b-c)
```
Which appears in `getTestBitOperand` in AArch64ISelLowering.cpp.
If we test bit `b` of `shl x, c`, we can fold away the `shl` by looking `c` bits
to the right of `b` in `x` when this fits in the type. So, we can just test the
`b-c`th bit.
Differential Revision: https://reviews.llvm.org/D73924
Summary:
The AIX assembler .space directive can't take a second non-zero argument to fill
with. But LLVM emitFill currently assumes it can. We add a flag to the AsmInfo
to check if non-zero fill is supported, and if we can't zerofill non-zero values
we just splat the .byte directives.
Reviewers: stevewan, sfertile, DiggerLin, jasonliu, Xiangling_L
Reviewed By: jasonliu
Subscribers: Xiangling_L, wuzish, nemanjai, hiraditya, kbarton, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73554
Given
```
tb(n)z (and x, m), b
```
Where the `b`-th bit of `m` is 1,
```
tb(n)z (and x, m), b == tb(n)z x, b
```
So, we can walk past a `G_AND` in this case.
Also add test/CodeGen/AArch64/GlobalISel/opt-fold-and-tbz-tbnz.mir to test this.
Differential Revision: https://reviews.llvm.org/D73790
convertPtrAddToAdd improved overall code size and quality by a significant amount,
but on -O0 we generate some cross-class copies due to the fact that we emitted
G_PTRTOINT and G_INTTOPTR around the G_ADD. Unfortunately at -O0 we don't run any
register coalescing, so these cross class copies end up escaping as moves, and
we ended up regressing 3 benchmarks on CTMark (though still a winner overall).
This patch changes the lowering to instead directly emit the G_ADD into the
destination register, and then force changes the dest LLT to s64 from p0. This
should be ok, as all uses of the register should now be selected and therefore
the LLT doesn't matter for the users. It does however matter for the importer
patterns, which will fail to select a G_ADD if there's a p0 LLT.
I'm not able to get rid of the G_PTRTOINT on the source yet however. We can't
use the same trick of breaking the type system since that could break the
selection of the defining instruction. Thus with -O0 we still end up with a
cross class copy on source.
Code size improvements on -O0:
Program baseline new diff
test-suite :: CTMark/Bullet/bullet.test 965520 949164 -1.7%
test-suite...TMark/7zip/7zip-benchmark.test 1069456 1052600 -1.6%
test-suite...ark/tramp3d-v4/tramp3d-v4.test 1213692 1199804 -1.1%
test-suite...:: CTMark/sqlite3/sqlite3.test 421680 419736 -0.5%
test-suite...-typeset/consumer-typeset.test 837076 833380 -0.4%
test-suite :: CTMark/lencod/lencod.test 799712 796976 -0.3%
test-suite...:: CTMark/ClamAV/clamscan.test 688264 686132 -0.3%
test-suite :: CTMark/kimwitu++/kc.test 1002344 999648 -0.3%
test-suite...Mark/mafft/pairlocalalign.test 422296 421768 -0.1%
test-suite :: CTMark/SPASS/SPASS.test 656792 656532 -0.0%
Geomean difference -0.6%
Differential Revision: https://reviews.llvm.org/D73910
Start using a new strategy with a combination of merge and unmerges.
This allows scalarizing before lowering, which in cases like
<2 x s128> avoids producing giant illegal shifts.
Followup to D73135. If the target doesn't have hard float (default
for ARM), then we assert when trying to soften the result of vector
reduction intrinsics. This patch marks these for expansion as well.
(A bit odd to use vectors on a target without hard float ... but
that's where you end up if you expose target-independent vector types.)
Differential Revision: https://reviews.llvm.org/D73854
As detailed on PR43463, this fixes a static analyzer null dereference warning by sinking Changed = true into the if() blocks where the MIB is actually created.
I did a quick check that suggested that one of those if() blocks is always guaranteed to be hit (so we could change it to if-else), but this seems like a safer approach
Differential Revision: https://reviews.llvm.org/D73883
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73885
These instructions can set the exception in FPSW. But I
don't think they can change FPCW. So this looks like a typo.
Differential Revision: https://reviews.llvm.org/D73864
These can be lowered to code sequences using CMPFP and CMPFPE which then get
selected to VCMP and VCMPE. The implementation isn't fully correct, as the chain
operand isn't handled correctly, but resolving that looks like it would involve
changes around FPSCR-handling instructions and how the FPSCR is modelled.
The fp-intrinsics test was already testing some of this but as the entire test
was being XFAILed it wasn't noticed. Un-XFAIL the test and instead leave the
cases where we aren't generating the right instruction sequences as FIXME.
Differential Revision: https://reviews.llvm.org/D73194
Summary:
In big-endian MVE, the simple vector load/store instructions (i.e.
both contiguous and non-widening) don't all store the bytes of a
register to memory in the same order: it matters whether you did a
VSTRB.8, VSTRH.16 or VSTRW.32. Put another way, the in-register
formats of different vector types relate to each other in a different
way from the in-memory formats.
So, if you want to 'bitcast' or 'reinterpret' one vector type as
another, you have to carefully specify which you mean: did you want to
reinterpret the //register// format of one type as that of the other,
or the //memory// format?
The ACLE `vreinterpretq` intrinsics are specified to reinterpret the
register format. But I had implemented them as LLVM IR bitcast, which
is specified for all types as a reinterpretation of the memory format.
So a `vreinterpretq` intrinsic, applied to values already in registers,
would code-generate incorrectly if compiled big-endian: instead of
emitting no code, it would emit a `vrev`.
To fix this, I've introduced a new IR intrinsic to perform a
register-format reinterpretation: `@llvm.arm.mve.vreinterpretq`. It's
implemented by a trivial isel pattern that expects the input in an
MQPR register, and just returns it unchanged.
In the clang codegen, I only emit this new intrinsic where it's
actually needed: I prefer a bitcast wherever it will have the right
effect, because LLVM understands bitcasts better. So we still generate
bitcasts in little-endian mode, and even in big-endian when you're
casting between two vector types with the same lane size.
For testing, I've moved all the codegen tests of vreinterpretq out
into their own file, so that they can have a different set of RUN
lines to check both big- and little-endian.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73786
Summary:
These instructions generate a vector of consecutive elements starting
from a given base value and incrementing by 1, 2, 4 or 8. The `wdup`
versions also wrap the values back to zero when they reach a given
limit value. The instruction updates the scalar base register so that
another use of the same instruction will continue the sequence from
where the previous one left off.
At the IR level, I've represented these instructions as a family of
target-specific intrinsics with two return values (the constructed
vector and the updated base). The user-facing ACLE API provides a set
of intrinsics that throw away the written-back base and another set
that receive it as a pointer so they can update it, plus the usual
predicated versions.
Because the intrinsics return two values (as do the underlying
instructions), the isel has to be done in C++.
This is the first family of MVE intrinsics that use the `imm_1248`
immediate type in the clang Tablegen framework, so naturally, I found
I'd given it the wrong C integer type. Also added some tests of the
check that the immediate has a legal value, because this is the first
time those particular checks have been exercised.
Finally, I also had to fix a bug in MveEmitter which failed an
assertion when I nested two `seq` nodes (the inner one used to extract
the two values from the pair returned by the IR intrinsic, and the
outer one put on by the predication multiclass).
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73357
Summary:
The unpredicated case of this is trivial: the clang codegen just makes
a vector splat of the input, and LLVM isel is already prepared to
handle that. For the predicated version, I've generated a `select`
between the same vector splat and the `inactive` input parameter, and
added new Tablegen isel rules to match that pattern into a predicated
`MVE_VDUP` instruction.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73356
Summary: There are no counters for individual ports, but this is already
enough to find a lot of issues in the current model (upcoming patch).
Reviewers: dblaikie, gchatelet
Subscribers: hiraditya, tschuett, RKSimon, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72032
Summary:
D68092 introduced a new SIRemoveShortExecBranches optimization pass and
broke some graphics shaders. The problem is that it was removing
branches over KILL pseudo instructions, and the fix is to explicitly
check for that in mustRetainExeczBranch.
Reviewers: critson, arsenm, nhaehnle, cdevadas, hakzsam
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73771
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
This reverts commit e34801c8e6 and the followup due to multiple
problems.
I've tried to keep the tests and RDA parts where possible, as those
still seem useful.
Summary:
Virtual registers that are undef have an empty LiveInterval at this
point, which means beginIndex() and endIndex() cannot be used. We
only need those indices to determine the range in which to scan for
affected other NSA instructions, and undef operands cannot contribute
to that range.
Reviewers: arsenm, rampitec, mareko
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73831
We were checking that the original Value * for the compare operands
were null. But that can never happen.
I believe we intended to check for 0 registers here instead.
Fixes PR44749.
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This is my second attempt at committing this after failing
build bots previously. One thing I realized about the previous
attempt is that its possible that AM.Disp is already non-zero
and the new Offset changes it back to zero. In that case my
previous attempt failed to update AM.Disp to zero. So this patch
removes the early out for 0 and appropriately handle the 0 case
in each check so we still update AM.Disp at the end.
This is based on this llvm-dev thread http://lists.llvm.org/pipermail/llvm-dev/2019-December/137521.html
The current strategy for f16 is to promote type to float every except where the specific width is required like loads, stores, and bitcasts. This results in rounding occurring in odd places instead of immediately after arithmetic operations. This interacts in weird ways with the __fp16 type in clang which is a storage only type where arithmetic is always promoted to float. InstCombine can remove some fpext/fptruncs around such arithmetic and turn it into arithmetic on half. This wouldn't be so bad if SelectionDAG was able to put those fpext/fpround back in when it promotes.
It is also not obvious how to handle to make the existing strategy work with STRICT fp. We need to use STRICT versions of the conversions which require chain operands. But if the conversions are created for a bitcast, there is no place to get an appropriate chain from.
This patch implements a different strategy where conversions are emitted directly around arithmetic operations. And otherwise its passed around as an i16 including in arguments and return values. This can result in more conversions between arithmetic operations, but is closer to matching the IR the frontend generates for __fp16. And it will allow us to use the chain from constrained arithmetic nodes to link the STRICT_FP_TO_FP16/STRICT_FP16_TO_FP that will need to be added. I've set it up so that each target can opt into the new behavior. Converting all the targets myself was more than I was able to handle.
Differential Revision: https://reviews.llvm.org/D73749
This fixes legalizations of global stores > 128-bits. It seems work is
needed on how this split actually occurs. For example, we get the
right code for s160, with an s128 and s32 load, but get 5 s32 loads
for <5 x s32>.
Summary:
Implements the jump pseudo-instruction, which is used in e.g. the Linux kernel.
Reviewers: asb, lenary
Reviewed By: lenary
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73178
When you encounter a G_TRUNC, you are moving from a larger type to a smaller
type.
Asking for the i-th bit on a larger value is the same as asking for the i-th
bit on a smaller value.
So, we should always be able to walk through G_TRUNC when computing the bit
for a TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73748
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
Try out using combine definition rules.
This really should be a post-legalizer combine, but the combiner pass
is currently pre-legalize. Most of the target combines are really
post-legalize, so we should probably move the pass.
For a MC_GlobalAddress reference to a dso_local external GlobalValue with a definition, emit .Lfoo$local to avoid a relocation.
-fno-pic and -fpie can infer dso_local but -fpic cannot. In the future,
we can explore the possibility of inferring dso_local with -fpic. As the
description of D73228 says, LLVM's existing IPO optimization behaviors
(like -fno-semantic-interposition) and a previous assembly behavior give
us enough license to be aggressive here.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D73230
I believe this also fixes bugs with CI 32-bit handling, which was
incorrectly skipping offsets that look like signed 32-bit values. Also
validate the offsets are dword aligned before folding.
This is similar to the code in getTestBitOperand in AArch64ISelLowering. Instead
of implementing all of the TB(N)Z optimizations at once, this patch implements
the simplest case first. The way that this is set up should make it fairly easy
to add the rest as we go along.
The idea here is that after determining that we can use a TB(N)Z, we can
continue looking through instructions and perform further folding.
In this case, when we have a G_ZEXT or G_ANYEXT where the extended bits are not
used, we can fold it into the TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73673
Found by inspection, but there's no test for this yet because G_PTR_ADD is
currently illegal for vectors. I'll add the test at a later time when the
legalizer support has landed.
There's not much value to this separate node from the intrinsic. Make
the operand structure the same as the intrinsic, so we can reuse the
same pattern for GlobalISel.
Summary:
Added file headers for files which implement iterative lightweight scheduling
strategies. Which is basically an exercise which I undertook in order to get
used to LLVM development process.
Reviewers: arsenm, vpykhtin, cdevadas
Reviewed By: vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73417
Summary:
For -fpatchable-function-entry=N,0 -mbranch-protection=bti, after
9a24488cb6, we place the NOP sled after
the initial BTI.
```
.Lfunc_begin0:
bti c
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lfunc_begin0
```
This patch adds a label after the initial BTI and changes the __patchable_function_entries entry to reference the label:
```
.Lfunc_begin0:
bti c
.Lpatch0:
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lpatch0
```
This placement is compatible with the resolution in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424 .
A local linkage function whose address is not taken does not need a BTI.
Placing the patch label after BTI has the advantage that code does not
need to differentiate whether the function has an initial BTI.
Reviewers: mrutland, nickdesaulniers, nsz, ostannard
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73680
fadd/fmul reductions without reassoc are lowered to
VECREDUCE_STRICT_FADD/FMUL nodes, which don't have legalization
support. Until that is in place, expand these intrinsics on
ARM and AArch64. Other targets always expand the vector reduction
intrinsics.
Additionally expand fmax/fmin reductions without nonan flag on
AArch64, as the backend asserts that the flag is present when
lowering VECREDUCE_FMIN/FMAX.
This fixes https://bugs.llvm.org/show_bug.cgi?id=44600.
Differential Revision: https://reviews.llvm.org/D73135
The recommended optimization level for BPF programs
is O2 since (1). BPF is running inside the kernel and
linux kernel won't work at -O0 level, and (2). Verifier
is not able to handle O0 code properly, e.g., potential
large stack size and a lot of spills.
But we should keep -O0 at least compiling.
This patch fixed a bug in BPFMISimplifyPatchable phase
where with -O0, a segmentation fault will happen for a
simple program like:
int test(int a, int b) { return a + b; }
A test case is added to capture such a case.
Differential Revision: https://reviews.llvm.org/D73681
Summary:
This patch intends to support three most common relocation type
on AIX: R_POS, R_TOC, R_RBR.
These three relocation type will be needed for object file generation
on AIX for small code model.
We will have follow up patches to bring relocation support for
large code model on AIX.
Reviewers: hubert.reinterpretcast, daltenty, DiggerLin
Differential Revision: https://reviews.llvm.org/D72027
By adding the prefixed instructions the branch distances are no longer
computed correctly. Since prefixed instructions cannot cross a 64 byte
boundary we have to assume that a prefixed instruction may have a nop
prepended to it. This patch tries to take that nop into consideration
when computing the size of basic blocks.
Differential Revision: https://reviews.llvm.org/D72572
The commit https://reviews.llvm.org/rG14fc20ca6 added some options to the X86
back end that cause the help text for opt/llc to become much harder to read.
The issue is that the cl::value_desc is part of the option name and is used to
compute the indentation of the description text (i.e. the maximum length option
name is what everything aligns to). Since the commit puts a large number of
characters into that text, everything is aligned to that width.
This patch just reformats the option so that the description is contained in the
description and the list of possible values is within the angle brackets.
Note: the readability issue of the helptext was fixed in commit
70cbf8c71c, but the re-formatting wasn't
added on that commit so I am still committing this.
Differential revision: https://reviews.llvm.org/D73267
Strict fp-to-int and int-to-fp conversions can be handled in the same way that
the non-strict versions are (by using the appropriate instruction or converting
to a function call when we have no instruction).
Differential Revision: https://reviews.llvm.org/D73625