MBFIWrapper keeps track of block frequencies of newly created blocks and
modified blocks, modified block frequencies should also impact block profile
count. This class doesn't provide interface getBlockProfileCount, users can only
use the underlying MBFI to query profile count, the underlying MBFI doesn't know
the modifications made in MBFIWrapper, so it either provides stale profile count
for modified block or simply crashes on new blocks.
So this patch add function getBlockProfileCount to class MBFIWrapper to handle
new blocks or modified blocks.
Differential Revision: https://reviews.llvm.org/D87802
The shift amount type does not necessarily match the result type. This
was inserting a trunc from s32 to s32, which asserted. Just preserve
the original shift amount type which can be legalized later.
An existing function Type::getScalarSizeInBits returns a uint64_t
instead of a TypeSize class because the caller is requesting a
scalar size, which cannot be scalable. This patch makes other
similar functions requesting a scalar size consistent with that,
thereby eliminating more than 1000 implicit TypeSize -> uint64_t
casts.
Differential revision: https://reviews.llvm.org/D87889
Just scalarize trunc stores - GenWidenVectorTruncStores does the same thing but is flawed (PR42046) and unused.
Differential Revision: https://reviews.llvm.org/D87708
If we are going to write handler data (that is written as variable
length data following after the unwind info in .xdata), we need to
emit the handler data immediately, but for cases where no such
info is going to be written, skip emitting it right away. (Unwind
info for all remaining functions that hasn't gotten it emitted
directly is emitted at the end.)
This does slightly change the ordering of sections (triggering a
bunch of updates to DebugInfo/COFF tests), but the change should be
benign.
This also matches GCC's assembly output, which doesn't output
.seh_handlerdata unless it actually is needed.
For ARM64, the unwind info can be packed into the runtime function
entry itself (leaving no data in the .xdata section at all), but
that can only be done if there's no follow-on data in the .xdata
section. If emission of the unwind info is triggered via
EmitWinEHHandlerData (or the .seh_handlerdata directive), which
implicitly switches to the .xdata section, there's a chance of the
caller wanting to pass further data there, so the packed format
can't be used in that case.
Differential Revision: https://reviews.llvm.org/D87448
When exporting statepoint results to virtual registers we try to avoid
generating exports for duplicated inputs. But we erroneously use
IR Value* to check if inputs are duplicated. Instead, we should use
SDValue, because even different IR values can get lowered to the same
SDValue.
I'm adding a (degenerate) test case which emphasizes importance of this
feature for invoke statepoints.
If we fail to export only unique values we will end up with something
like that:
%0 = STATEPOINT
%1 = COPY %0
landing_pad:
<use of %1>
And when exceptional path is taken, %1 is left uninitialized (COPY is never
execute).
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D87695
This reverts commit 0345d88de6.
Google internal backend uses EntrySU, we are looking into removing
dependency on it.
Differential Revision: https://reviews.llvm.org/D88018
SelectionDAGBuilder was inconsistently mangling values based on ABI
Calling Conventions when getting them through copyFromRegs in
SelectionDAGBuilder, causing duplicate value type convertions for
function arguments. The checking for the mangling requirement was based
on the value's originating instruction and was performed outside of, and
inspite of, the regular Calling Convention Lowering.
The issue could be observed in a scenario such as:
```
%arg1 = load half, half* %const, align 2
%arg2 = call fastcc half @someFunc()
call fastcc void @otherFunc(half %arg1, half %arg2)
; Here, %arg2 was incorrectly mangled twice, as the CallConv data from
; the call to @someFunc() was taken into consideration for the check
; when getting the value for processing the call to @otherFunc(...),
; after the proper convertion had taken place when lowering the return
; value of the first call.
```
This patch fixes the issue by disregarding the Calling Convention
information for such copyFromRegs, making sure the ABI mangling is
properly contanined in the Calling Convention Lowering.
This fixes Bugzilla #47454.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87844
This is a follow-up of D86605. For strict DAG FP node, if its FP
exception behavior metadata is ignore, it should have nofpexcept flag.
But during custom lowering, this flag isn't passed down.
This is also seen on X86 target.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D87390
This changes the order of output sections and the output assembly, but
is otherwise NFC.
It simplifies the TLOF interface by removing two COFF-only methods.
findPHICopyInsertPoint special cases placement in a block with a
callbr or invoke in it. In that case, we must ensure that the copy is
placed before the INLINEASM_BR or call instruction, if the register is
defined prior to that instruction, because it may jump out of the
block.
Previously, the code placed it immediately after the last def _or
use_. This is wrong, if the use is the instruction which may jump. We
could correctly place it immediately after the last def (ignoring
uses), but that is non-optimal for register pressure.
Instead, place the copy after the last def, or before the
call/inlineasm_br, whichever is later.
Differential Revision: https://reviews.llvm.org/D87865
This rewrites big parts of the fast register allocator. The basic
strategy of doing block-local allocation hasn't changed but I tweaked
several details:
Track register state on register units instead of physical
registers. This simplifies and speeds up handling of register aliases.
Process basic blocks in reverse order: Definitions are known to end
register livetimes when walking backwards (contrary when walking
forward then uses may or may not be a kill so we need heuristics).
Check register mask operands (calls) instead of conservatively
assuming everything is clobbered. Enhance heuristics to detect
killing uses: In case of a small number of defs/uses check if they are
all in the same basic block and if so the last one is a killing use.
Enhance heuristic for copy-coalescing through hinting: We check the
first k defs of a register for COPYs rather than relying on there just
being a single definition. When testing this on the full llvm
test-suite including SPEC externals I measured:
average 5.1% reduction in code size for X86, 4.9% reduction in code on
aarch64. (ranging between 0% and 20% depending on the test) 0.5%
faster compiletime (some analysis suggests the pass is slightly slower
than before, but we more than make up for it because later passes are
faster with the reduced instruction count)
Also adds a few testcases that were broken without this patch, in
particular bug 47278.
Patch mostly by Matthias Braun
This change enables the generic implicit null transformation for the AArch64 target. As background for those unfamiliar with our implicit null check support:
An implicit null check is the use of a signal handler to catch and redirect to a handler a null pointer. Specifically, it's replacing an explicit conditional branch with such a redirect. This is only done for very cold branches under frontend control w/appropriate metadata.
FAULTING_OP is used to wrap the faulting instruction. It is modelled as being a conditional branch to reflect the fact it can transfer control in the CFG.
FAULTING_OP does not need to be an analyzable branch to achieve it's purpose. (Or at least, that's the x86 model. I find this slightly questionable.)
When lowering to MC, we convert the FAULTING_OP back into the actual instruction, record the labels, and lower the original instruction.
As can be seen in the test changes, currently the AArch64 backend does not eliminate the unconditional branch to the fallthrough block. I've tried two approaches, neither of which worked. I plan to return to this in a separate change set once I've wrapped my head around the interactions a bit better. (X86 handles this via AllowModify on analyzeBranch, but adding the obvious code causing BranchFolding to crash. I haven't yet figured out if it's a latent bug in BranchFolding, or something I'm doing wrong.)
Differential Revision: https://reviews.llvm.org/D87851
Before this patch, the last chance recoloring and deferred spilling
techniques were solely controled by command line options.
This patch adds target hooks for these two techniques so that it
is easier for backend writers to override the default behavior.
The default behavior of the hooks preserves the default values of
the related command line options.
NFC
Initial support for dwarf fission sections (-gsplit-dwarf) on wasm.
The most interesting change is support for writing 2 files (.o and .dwo) in the
wasm object writer. My approach moves object-writing logic into its own function
and calls it twice, swapping out the endian::Writer (W) in between calls.
It also splits the import-preparation step into its own function (and skips it when writing a dwo).
Differential Revision: https://reviews.llvm.org/D85685
For <8 x s32> = fptrunc <8 x s64> the fewerElementsVector action tries to break
down the source vector into the final source vectors of <2 x s64> using unmerge.
This fixes a crash due to using the wrong number of elements for the breakdown
type.
Also add some legalizer tests for explicitly G_FPTRUNC which we didn't have.
Differential Revision: https://reviews.llvm.org/D87814
When splitting a live interval with subranges, only insert copies for
the lanes that are live at the point of the split. This avoids some
unnecessary copies and fixes a problem where copying dead lanes was
generating MIR that failed verification. The test case for this is
test/CodeGen/AMDGPU/splitkit-copy-live-lanes.mir.
Without this fix, some earlier live range splitting would create %430:
%430 [256r,848r:0)[848r,2584r:1) 0@256r 1@848r L0000000000000003 [848r,2584r:0) 0@848r L0000000000000030 [256r,2584r:0) 0@256r weight:1.480938e-03
...
256B undef %430.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
848B %430.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B %431:vreg_128 = COPY %430:vreg_128
Then RAGreedy::tryLocalSplit would split %430 into %432 and %433 just
before 848B giving:
%432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03
%433 [844r,848r:0)[848r,2584r:1) 0@844r 1@848r L0000000000000030 [844r,2584r:0) 0@844r L0000000000000003 [844r,844d:0)[848r,2584r:1) 0@844r 1@848r weight:2.831776e-03
...
256B undef %432.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128 {
internal %433.sub2:vreg_128 = COPY %432.sub2:vreg_128
848B }
%433.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B %431:vreg_128 = COPY %433:vreg_128
Note that the copy from %432 to %433 at 844B is a curious
bundle-without-a-BUNDLE-instruction that SplitKit creates deliberately,
and it includes a copy of .sub0 which is not live at this point, and
that causes it to fail verification:
*** Bad machine code: No live subrange at use ***
- function: zextload_global_v64i16_to_v64i64
- basic block: %bb.0 (0x7faed48) [0B;2848B)
- instruction: 844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128
- operand 1: %432.sub0:vreg_128
- interval: %432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03
- at: 844B
Using real bundles with a BUNDLE instruction might also fix this
problem, but the current fix is less invasive and also avoids some
unnecessary copies.
https://bugs.llvm.org/show_bug.cgi?id=47492
Differential Revision: https://reviews.llvm.org/D87757