Currently for atomic load, store, and rmw instructions, as long as the
operand is floating-point value, they are casted to integer. Nowadays many
targets can actually support part of atomic operations with floating-point
operands. For example, NVPTX supports atomic load and store of floating-point
values. This patch adds a series interface functions `shouldCastAtomicXXXInIR`,
and the default implementations are same as what we currently do. Later for
targets can have their specialization.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D125652
If the SafeWrap operation is a subtract, we negated the constant
to treat the subtract as an addition. The sext was based on the
operation being addition. So we really need to do (neg (sext (neg C)))
when promoting the constant. This is equivalent to (sext C) for
every value of C except the min signed value. For min signed value
we need to do (zext C) instead.
Fixes PR55490.
Differential Revision: https://reviews.llvm.org/D125653
This is the first commit for the cmov-vs-branch optimization pass.
The goal is to develop a new profile-guided and target-independent cost/benefit analysis
for selecting conditional moves over branches when optimizing for performance.
Initially, this new pass is expected to be enabled only for instrumentation-based PGO.
RFC: https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D120230
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
An upcoming patch will extend llvm-symbolizer to provide the source line
information for global variables. The goal is to move AddressSanitizer
off of internal debug info for symbolization onto the DWARF standard
(and doing a clean-up in the process). Currently, ASan reports the line
information for constant strings if a memory safety bug happens around
them. We want to keep this behaviour, so we need to emit debuginfo for
these variables as well.
Reviewed By: dblaikie, rnk, aprantl
Differential Revision: https://reviews.llvm.org/D123534
I noticed https://reviews.llvm.org/D87415 added SDAG combines to fold
FMIN/MAX instrs with NaNs.
The patch implements the same NaN combines for GISel GMIR FMIN/MAX opcodes:
G_FMINNUM(X, NaN) -> X
G_FMAXNUM(X, NaN) -> X
G_FMINIMUM(X, NaN) -> NaN
G_FMAXIMUM(X, NaN) -> NaN
The patch adds AArch64 tests for these combines as well.
Reviewed by: arsenm
Differential revision: https://reviews.llvm.org/D125819
This function tries to match (a >> 8) | (a << 8) as (bswap a) >> 16.
If the SRL isn't masked and the high bits aren't demanded, we still
need to ensure that bits 23:16 are zero. After the right shift they
will be in bits 15:8 which is where the important bits from the SHL
end up. It's only a bswap if the OR on bits 15:8 only takes the bits
from the SHL.
Fixes PR55484.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D125641
The patch does not pass math flags to float VPCmpIntrinsics because LLParser
could not identify float VPCmpIntrinsics as FPMathOperators.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125600
If we're using shift pairs to mask, then relax the one use limit if the shift amounts are equal - we'll only be generating a single AND node.
AArch64 has a couple of regressions due to this, so I've enforced the existing one use limit inside a AArch64TargetLowering::shouldFoldConstantShiftPairToMask callback.
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125607
Add a new TargetRegisterInfo hook to allow targets to tweak the
priority of live ranges, so that AllocationPriority of the register
class will be treated as more important than whether the range is local
to a basic block or global. This is determined per-MachineFunction.
Differential Revision: https://reviews.llvm.org/D125102
This patch uses VP_REDUCE_AND and VP_REDUCE_OR to replace VP_REDUCE_SMAX,VP_REDUCE_SMIN,VP_REDUCE_UMAX and VP_REDUCE_UMIN for mask vector type.
Differential Revision: https://reviews.llvm.org/D125002
The documentation for this specifically mentions that this should not
happen. We could think about adding target hooks to permit it (and how
to merge IDs) in the future if that is desirable.
This specific test case was merging a scalable-vector slot into a
non-scalable one and dropping the notion of scalability, meaning we
failed to allocate enough stack space for the object.
Reviewed By: arsenm, MaskRay, sdesmalen
Differential Revision: https://reviews.llvm.org/D125699
An upcoming patch will extend llvm-symbolizer to provide the source line
information for global variables. The goal is to move AddressSanitizer
off of internal debug info for symbolization onto the DWARF standard
(and doing a clean-up in the process). Currently, ASan reports the line
information for constant strings if a memory safety bug happens around
them. We want to keep this behaviour, so we need to emit debuginfo for
these variables as well.
Reviewed By: dblaikie, rnk, aprantl
Differential Revision: https://reviews.llvm.org/D123534
The existing redundant copy elimination required a virtual register source, but the same logic works for any physreg where we don't have to worry about clobbers. On RISCV, this helps eliminate redundant CSR reads from VLENB.
Differential Revision: https://reviews.llvm.org/D125564
During early gather/scatter enablement two different approaches
were taken to represent scaled indices:
* A Scale operand whereby byte_offsets = Index * Scale
* An IndexType whereby byte_offsets = Index * sizeof(MemVT.ElementType)
Having multiple representations is bad as shown by this patch which
fixes instances where the two are out of sync. The dedicated scale
operand is more flexible and pervasive so this patch removes the
UNSCALED values from IndexType. This means all indices are scaled
but the scale can be one, hence unscaled. SDNodes now use the scale
operand to answer the "isScaledIndex" question.
I toyed with the idea of keeping the UNSCALED enums and helper
functions but because they will have no uses and force SDNodes to
validate the set of supported values I figured it's best to remove
them. We can re-add them if there's a real need. For similar
reasons I've kept the IndexType enum when a bool could be used as I
think being explicitly looks better.
Depends On D123347
Differential Revision: https://reviews.llvm.org/D123381
If we use multiply it would be with 0x0101 which is 1 more than a power
of 2. On some targets we would expand this to shl+add. By avoiding the
multiply earlier, we can generate better code.
Note, PowerPC doesn't do the shl+add expansion of multiply so one of
the tests increased in instruction count.
Limiting to scalars because it almost always increased the number of
instructions in vector tests.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D125638
This change adds the constant splat versions of m_ICst() (by using
getBuildVectorConstantSplat()) and uses it in
matchOrShiftToFunnelShift(). The getBuildVectorConstantSplat() name is
shortened to getIConstantSplatVal() so that the *SExtVal() version would
have a more compact name.
Differential Revision: https://reviews.llvm.org/D125516
The patch make users not need to know getNode with SDNodeFlags argument may not
pass its flags.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125659
FunctionLoweringInfo::StatepointRelocationMaps map is used to pass GC pointer
lowering information from statepoint to gc.relocate which may appear ini
different block.
D124444 introduced different lowering for local and non-local relocates.
Local relocates use SDValue and non-local relocates use value exported to VReg.
But I overlooked the fact that StatepointRelocationMap is indexed not by
GCRelocate instruction, but by derived pointer. This works incorrectly when
we have two relocates (one local and another non-local) of the same value,
because they need different relocation records.
This patch fixes the problem by recording relocation information per relocate
instruction, not per derived pointer. This way, each gc.relocate can be lowered
differently.
Reviewed By: skatkov
Differential Revision: https://reviews.llvm.org/D125538
FastISel tries to fold loads into the single using instruction.
However, if the register has fixups, then there may be additional
uses through an alias of the register.
In particular, this fixes the problem reported at
https://reviews.llvm.org/D119432#3507087. The load register is
(at the time of load folding) only used in a single call instruction.
However, selection of the bitcast has added a fixup between the
load register and the cross-BB register of the bitcast result.
After fixups are applied, there would now be two uses of the load
register, so load folding is not legal.
Differential Revision: https://reviews.llvm.org/D125459
SelectionDAG::FoldConstantArithmetic determines if operands are foldable constants, so we don't need to bother with isConstantOrConstantVector / Opaque tests before calling it directly.
SelectionDAG::FoldConstantArithmetic determines if operands are foldable constants, so we don't need to bother with isConstantOrConstantVector / Opaque tests before calling it directly.
Pulled out of D77804 as its going to be easier to address the regressions individually.
This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.
The lost RISCV gorc2 fold shouldn't be a problem - instcombine would have already destroyed that pattern - see https://github.com/llvm/llvm-project/issues/50553
Differential Revision: https://reviews.llvm.org/D124839
When GlobalISel fails, we need to report the error, and we need to set
the FailedISel property. We skipped those steps if stack protector
insertion failed, which led to a very strange miscompile.
Differential Revision: https://reviews.llvm.org/D125584
We commonly want to create either an inbounds or non-inbounds GEP
based on a boolean value, e.g. when preserving inbounds from
existing GEPs. Directly accept such a boolean in the API, rather
than requiring a ternary between CreateGEP and CreateInBoundsGEP.
This change is not entirely NFC, because we now preserve an
inbounds flag in a constant expression edge-case in InstCombine.
Previously it built MIR for the results and returned a Register.
This avoids building constants for earlier elements of the vector if
later elements will fail to fold, and allows CSEMIRBuilder::buildInstr
to avoid unconditionally building a copy from the result.
Use a new helper function MachineIRBuilder::buildBuildVectorConstant
to build a G_BUILD_VECTOR of G_CONSTANTs.
Differential Revision: https://reviews.llvm.org/D117758
If we're promoting an undef I think that means that we expect the
upper bits are zero. undef doesn't guarantee that.
This patch replaces undef with 0 to ensure this. This matches how
a zext or sext of undef would be folded by InstCombine/InstSimplify.
I haven't found a failure from this was just thinking through the code.
Differential Revision: https://reviews.llvm.org/D123174
This is a re-apply of D123599, which was reverted in 4fe2ab5279, now
with a more appropriate assertion. Original commit message follow:
InstrRefBasedLDV can track and describe variable values that are spilt to
the stack -- however it does not current describe the size of the value on
the stack. This can cause uninitialized bytes to be read from the stack if
a small register is spilt for a larger variable, or theoretically on
big-endian machines if a large value on the stack is used for a small
variable.
Fix this by using DW_OP_deref_size to specify the amount of data to load
from the stack, if there's any possibility for ambiguity. There are a few
scenarios where this can be omitted (such as when using DW_OP_piece and a
non-DW_OP_stack_value location), see deref-spills-with-size.mir for an
explicit table of inputs flavours and output expressions.
Differential Revision: https://reviews.llvm.org/D123599
As pointed out in #55342, given non-canonical IR with multiple
constants, we check the second operand in isSafeWrap, but can promote
both with sext. Fix that as suggested by @craig.topper by ensuring we
only extend the second constant if multiple are present.
Fixes#55342
Differential Revision: https://reviews.llvm.org/D125294
This clang-formats the TypePromotion code, with the only meaningful
change being the removal of a verifyFunction call inside a LLVM_DEBUG,
and the printing of the entire function which can be better handled
via -print-after-all.
We often see code like the following after running SCCP:
switch (x) { case 42: phi(42, ...); }
This tends to produce bad code as we currently materialize the constant
phi-argument in the switch-block. This increases register pressure and
if the pattern repeats for `n` case statements, we end up generating `n`
constant values.
This changes CodeGenPrepare to catch this pattern and revert it back to:
switch (x) { case 42: phi(x, ...); }
Differential Revision: https://reviews.llvm.org/D124552
This adds a `TargetLoweringBase::getSwitchConditionType` callback to
give targets a chance to control the type used in
`CodeGenPrepare::optimizeSwitchInst`.
Implement callback for X86 to avoid i8 and i16 types where possible as
they often incur extra zero-extensions.
This is NFC for non-X86 targets.
Differential Revision: https://reviews.llvm.org/D124894
This allows the compiler to support more features than those supported by a
model. The only requirement (development mode only) is that the new
features must be appended at the end of the list of features requested
from the model. The support is transparent to compiler code: for
unsupported features, we provide a valid buffer to copy their values;
it's just that this buffer is disconnected from the model, so insofar
as the model is concerned (AOT or development mode), these features don't
exist. The buffers are allocated at setup - meaning, at steady state,
there is no extra allocation (maintaining the current invariant). These
buffers has 2 roles: one, keep the compiler code simple. Second, allow
logging their values in development mode. The latter allows retraining
a model supporting the larger feature set starting from traces produced
with the old model.
For release mode (AOT-ed models), this decouples compiler evolution from
model evolution, which we want in scenarios where the toolchain is
frequently rebuilt and redeployed: we can first deploy the new features,
and continue working with the older model, until a new model is made
available, which can then be picked up the next time the compiler is built.
Differential Revision: https://reviews.llvm.org/D124565
As suggested from 02f8519502, this uses the
isAnyConstantBuildVector method in lieu of separate
isBuildVectorOfConstantSDNodes calls. It should
otherwise be an NFC.
This prevents an infinite loop from D123801, where code trying to reduce
the total number of bitcasts, but also handling constants, could create
the opposite transform. Prevent the transform in these case to let the
bitcast of a constant transform naturally.
Fixes#55345
Like other shifts, the type isn't required to match. We shouldn't
assume we can call ZExtPromotedInteger.
I tested the PromoteIntOp_FunnelShift locally by removing the promotion
of the shift amount from PromoteIntRes_FunnelShift. But with the final
version of this patch it is never executed on any tests.
Differential Revision: https://reviews.llvm.org/D125106
This is part of an ongoing effort toward making DAGCombine process the nodes in topological order.
This is able to discover a couple of new optimizations, but also causes a couple of regression. I nevertheless chose to submit this patch for review as to start the discussion with people working on the backend so we can find a good way forward.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124743
Add helper functions to query the signed and scaled properties
of ISD::IndexType along with functions to change them.
Remove setIndexType from MaskedGatherSDNode because it only has
one usage and typically should only be changed alongside its
index operand.
Minimise the direct use of the enum values to lay the groundwork
for more refactoring.
Differential Revision: https://reviews.llvm.org/D123347
Something is going wrong with the BigEndian PowerPC bot. It is hard to
tell what is wrong from here, but attempt to fix it by disabling the
combineShuffleOfBitcast combine for bigendian.
Otherwise we have garbage in the upper bits that can affect the
results of the UREM.
Fixes PR55296.
Differential Revision: https://reviews.llvm.org/D125076
If the mask is made up of elements that form a mask in the higher type
we can convert shuffle(bitcast into the bitcast type, simplifying the
instruction sequence. A v4i32 2,3,0,1 for example can be treated as a
1,0 v2i64 shuffle. This helps clean up some of the AArch64 concat load
combines, along with helping simplify a number of other tests.
The PowerPC combine for v16i8 splat vector loads needed some fixes to
keep it working for v16i8 vectors. This improves the handling of v2i64
shuffles to match too, hopefully improving them in general.
Differential Revision: https://reviews.llvm.org/D123801
The result of sign_extend_inreg needs to have as many sign bits
as requested by the VT argument. The easiest way to guarantee this
is to fold it to 0.
SystemZ test was modified to avoid using undef.
Fixes https://github.com/llvm/llvm-project/issues/55178
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124696
There are many more instances of this pattern, but I chose to limit this change to .rst files (docs), anything in libcxx/include, and string literals. These have the highest chance of being seen by end users.
Reviewed By: #libc, Mordante, martong, ldionne
Differential Revision: https://reviews.llvm.org/D124708
Prior to ordering instructions to be scheduled, the machine pipeliner
update recurrence node sets in groupRemainingNodes() by adding in a
given node set any node on the dependency path from a node set with
higher priority to the given node set. The function computePath() that
determine what constitutes a path follows artificial dependencies.
However, when ordering the nodes in the resulting node sets,
computeNodeOrder() calls ignoreDependence when looking at dependencies
which ignores artificial dependencies. This can cause a node not to be
scheduled which then causes wrong code generation and in the case of a
debug build will lead to an assert failure in generatePhis() in
ModuloScheduler.cpp.
This commit adds calls to ignoreDependence() in computePath() to not add
any node in groupRemainingNodes() that would not be ordered by
computeNodeOrder().
Reviewed By: sgundapa
Differential Revision: https://reviews.llvm.org/D124267
Summary:
When -ffunction-sections is on, this patch makes the compiler to generate unique LSDA and EH info sections for functions on AIX by appending the function name to the section name as a suffix. This will allow the AIX linker to garbage-collect unused function.
Reviewed by: MaskRay, hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D124855
This extends the (X & ~Y) | Y to X | Y fold to also work if ~Y is
a truncated not (when taking into account the mask X). This is
done by exporting the infrastructure added in D124856 and reusing
it here.
I've retained the old value of AllowUndefs=false, though probably
this can be switched to true with extra test coverage.
Differential Revision: https://reviews.llvm.org/D124930
Demanded bits analysis may replace a full-width not with a
any_extend (not (truncate X)) pattern. This patch looks through
this kind of pattern in haveNoCommonBitsSet(). Of course, we can
only do this if we only need negated bits in the non-extended part,
as the other bits may now be arbitrary. For example, if we have
haveNoCommonBitsSet(~X & Y, X) then ~X only needs to actually
negate bits set in Y.
This is only a partial solution to the problem in that it allows
add -> or conversion, but the resulting or doesn't get folded yet.
(I guess that will involve exposing getBitwiseNotOperand() as a
more general helper and using that in the relevant transform.)
Differential Revision: https://reviews.llvm.org/D124856
If the tied use is undef value, fastregalloc should free the def
register. There is no reload needed for the undef value.
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D124834
Don't assume the rotation amounts have been correctly normalized - do it as part of the constant folding.
Also, the normalization should be performed with UREM not SREM.
This is the DAG variant of D124763. The code already handles the
general pattern, but not this degenerate case.
This allows folding A + (B&~A) to A | (B&~A) which further holds
to A | B.
Handling on the SDAG level is needed because in the motivating
case the add is actually a getelementptr, which only gets converted
into an add on the SDAG level. However, this patch is not quite
sufficient to handle the getelementptr case yet, because of an
interfering demanded bits simplification.
Differential Revision: https://reviews.llvm.org/D124772
In SelectionDAG, DBG_PHI instructions are created to "read" physreg values
and give them an instruction number, when they can't be traced back to a
defining instruction. The most common scenario if arguments to a function.
Unfortunately, if you have 100 inlined methods, each of which has the same
"this" pointer, then the 100 dbg.value instructions become 100
DBG_INSTR_REFs plus 100 DBG_PHIs, where only one DBG_PHI would suffice.
This patch adds a vreg cache for MachienFunction::salvageCopySSA, if we've
already traced a value back to the start of a block and created a DBG_PHI
then it allows us to re-use the DBG_PHI, as well as reducing work.
Differential Revision: https://reviews.llvm.org/D124517
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.
The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.
Differential Revision: https://reviews.llvm.org/D124358
When looking for memory uses,
reassociationCanBreakAddressingModePattern should check uses of
the outer ADD rather than the inner ADD. We want to know if the
two ops we're reassociating are used by a load/store.
In practice, the existing check usually works because CodeGenPrepare
will make one of the load/stores have an offset of 0 relative to
split GEP. That will make the inner add have a memory use.
To test this, I've manually split the GEPs so there is no 0 offset
store.
This issue was recently discussed in the original review D60294.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124644
SIGN_EXTEND_INREG expansion can trigger a TypeSize error because
"VT.getSizeInBits() == 1" is used to detect for a boolean without
first verifying VT is a scalar.
We try to match as a disguised rotate by constant of these forms
(shl (X | Y), C1) | (srl X, C2) --> (rotl X, C1) | (shl Y, C1)
(shl X, C1) | (srl (X | Y), C2) --> (rotl X, C1) | (srl Y, C2)
We may have also looked through an AND to find the shift. If we
did, we need to apply a mask to the result.
I'll add an AArch64 test and pre-commit it and the RISC-V test
tomorrow.
Fixes PR55201.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124711
Fixed "private field is not used" warning when compiled
with clang.
original commit: 28d09bbbc3
reverted in: fa49021c68
------
This patch permits Swing Modulo Scheduling for ARM targets
turns it on by default for the Cortex-M7. The t2Bcc
instruction is recognized as a loop-ending branch.
MachinePipeliner is extended by adding support for
"unpipelineable" instructions. These instructions are
those which contribute to the loop exit test; in the SMS
papers they are removed before creating the dependence graph
and then inserted into the final schedule of the kernel and
prologues. Support for these instructions was not previously
necessary because current targets supporting SMS have only
supported it for hardware loop branches, which have no
loop-exit-contributing instructions in the loop body.
The current structure of the MachinePipeliner makes it difficult
to remove/exclude these instructions from the dependence graph.
Therefore, this patch leaves them in the graph, but adds a
"normalization" method which moves them in the schedule to
stage 0, which causes them to appear properly in kernel and
prologues.
It was also necessary to be more careful about boundary nodes
when iterating across successors in the dependence graph because
the loop exit branch is now a non-artificial successor to
instructions in the graph. In additional, schedules with physical
use/def pairs in the same cycle should be treated as creating an
invalid schedule because the scheduling logic doesn't respect
physical register dependence once scheduled to the same cycle.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D122672
When looking through extends of gather/scatter indices it's safe
to convert a known positive signed index to unsigned, but unsigned
indices must remain unsigned.
Depends On D123318
Differential Revision: https://reviews.llvm.org/D123326
This is an alternative to D124530. In getUniformBase() only create
scales that match the gather/scatter element size. If targets also
support other scales, then they can produce those scales in target
DAG combines. This is what X86 already does (as long as the
resulting scale would be 1, 2, 4 or 8).
This essentially restores the pre-opaque-pointer state of things.
Fixes https://github.com/llvm/llvm-project/issues/55021.
Differential Revision: https://reviews.llvm.org/D124605
refineUniformBase and selectGatherScatterAddrMode both attempt the
transformation:
base(0) + index(A+splat(B)) => base(B) + index(A)
However, this is only safe when index is not implicitly scaled.
Differential Revision: https://reviews.llvm.org/D123222
PowerPC supports `ppc_fp128`, which is not an IEEE floating point
type. The generic lowering of llvm.is_fpclass could not handle it
properly. This change extends the generic lowering code to
support `ppc_fp128`.
The change was tested on emulator using runtime tests from
https://reviews.llvm.org/D112933 and the patch for clang
https://reviews.llvm.org/D112932.
Differential Revision: https://reviews.llvm.org/D113908
This reverts commit a15b66e76d.
This causes linker to crash at assertion: `Assertion failed: !Expr->isComplex(), file C:\b\s\w\ir\cache\builder\src\third_party\llvm\llvm\lib\CodeGen\LiveDebugValues\InstrRefBasedImpl.cpp, line 907`.
This patch permits Swing Modulo Scheduling for ARM targets
turns it on by default for the Cortex-M7. The t2Bcc
instruction is recognized as a loop-ending branch.
MachinePipeliner is extended by adding support for
"unpipelineable" instructions. These instructions are
those which contribute to the loop exit test; in the SMS
papers they are removed before creating the dependence graph
and then inserted into the final schedule of the kernel and
prologues. Support for these instructions was not previously
necessary because current targets supporting SMS have only
supported it for hardware loop branches, which have no
loop-exit-contributing instructions in the loop body.
The current structure of the MachinePipeliner makes it difficult
to remove/exclude these instructions from the dependence graph.
Therefore, this patch leaves them in the graph, but adds a
"normalization" method which moves them in the schedule to
stage 0, which causes them to appear properly in kernel and
prologues.
It was also necessary to be more careful about boundary nodes
when iterating across successors in the dependence graph because
the loop exit branch is now a non-artificial successor to
instructions in the graph. In additional, schedules with physical
use/def pairs in the same cycle should be treated as creating an
invalid schedule because the scheduling logic doesn't respect
physical register dependence once scheduled to the same cycle.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D122672
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
The description of SETCC says
/// SetCC operator - This evaluates to a true value iff the condition is
/// true. If the result value type is not i1 then the high bits conform
/// to getBooleanContents.
Without this patch, we sign extended the i1 to the used larger type
regardless of getBooleanContents. This resulted in miscompiles, as
shown in the attached testcase that ended up returning -1 instead of
1 when using -mattr=+v.
Fixes https://github.com/llvm/llvm-project/issues/55168
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124618
The current testcase I'm trying to reduce only reproduces with IPRA
enabled and requires handling multiple functions.
The only real difference vs. the IR is the extra indirect to look for
the underlying MachineFunction, so treat the ReduceWorkItem as the
module instead of the function.
The ugliest piece of this is really the ugliness of
MachineModuleInfo. It not only tracks actual module state, but has a
number of transient fields used for isel and/or the asm printer. These
shouldn't do any harm for the use here, though they should be
separated out.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Cuurently we always export STATEPOINT results (GC pointers lowered via VRegs)
to virtual registers. When processing gc.relocate instructions we have to
generate CopyFromRegs node and then export it to VReg again if gc.relocate
is used in other basic blocks. This results in generation of extra COPY MIR
instruction if statepoint and its gc.relocate are in the same BB, but gc.relocate
result is used in other blocks.
This patch changes this behavior to export statepoint results only if used
in other basic blocks. For local uses StatepointLoweringState.(get|set)Location()
API is used to communicate appropriate statepoint result from `LowerStatepoint()`
to `visitGCRelocate()`
This is NFC and is purely compile time optimization. On big methids it can improve
codegen compile time up to 10%.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124444
InstrRefBasedLDV can track and describe variable values that are spilt to
the stack -- however it does not current describe the size of the value on
the stack. This can cause uninitialized bytes to be read from the stack if
a small register is spilt for a larger variable, or theoretically on
big-endian machines if a large value on the stack is used for a small
variable.
Fix this by using DW_OP_deref_size to specify the amount of data to load
from the stack, if there's any possibility for ambiguity. There are a few
scenarios where this can be omitted (such as when using DW_OP_piece and a
non-DW_OP_stack_value location), see deref-spills-with-size.mir for an
explicit table of inputs flavours and output expressions.
Differential Revision: https://reviews.llvm.org/D123599
Default behavior for .file directory was changed in D105856, but
ptxas (CUDA 11.5 release) refuses to parse it:
$ llc -march=nvptx64 llvm/test/DebugInfo/NVPTX/debug-file-loc.ll
$ ptxas debug-file-loc.s
ptxas debug-file-loc.s, line 42; fatal : Parsing error near
'"foo.h"': syntax error
Added a new field to MCAsmInfo to control default value of
UseDwarfDirectory. This value is used if -dwarf-directory command line
option is not specified.
Differential Revision: https://reviews.llvm.org/D121299
This was reverted twice, in 987cd7c3ed and 13815e8cbf. The latter
stemed from not accounting for rare register classes in a pre-allocated
array, and the former from an array not being completely initialized,
leading to asan complaining.
This change introduces a new intrinsic, `llvm.is.fpclass`, which checks
if the provided floating-point number belongs to any of the the specified
value classes. The intrinsic implements the checks made by C standard
library functions `isnan`, `isinf`, `isfinite`, `isnormal`, `issubnormal`,
`issignaling` and corresponding IEEE-754 operations.
The primary motivation for this intrinsic is the support of strict FP
mode. In this mode using compare instructions or other FP operations is
not possible, because if the value is a signaling NaN, floating-point
exception `Invalid` is raised, but the aforementioned functions must
never raise exceptions.
Currently there are two solutions for this problem, both are
implemented partially. One of them is using integer operations to
implement the check. It was implemented in https://reviews.llvm.org/D95948
for `isnan`. It solves the problem of exceptions, but offers one
solution for all targets, although some can do the check in more
efficient way.
The other, implemented in https://reviews.llvm.org/D96568, introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects a target
specific code into IR to implement `isnan` and some other functions. It is
convenient for targets that have dedicated instruction to determine FP data
class. However using target-specific intrinsic complicates analysis and can
prevent some optimizations.
A special intrinsic for value class checks allows representing data class
tests with enough flexibility. During IR transformations it represents the
check in target-independent way and saves it from undesired transformations.
In the instruction selector it allows efficient lowering depending on the
used target and mode.
This implementation is an extended variant of `llvm.isnan` introduced
in https://reviews.llvm.org/D104854. It is limited to minimal intrinsic
support. Target-specific treatment will be implemented in separate
patches.
Differential Revision: https://reviews.llvm.org/D112025
Last chance recoloring didn't try recoloring a done register with the
same class since it believed there was no point. This doesn't
necessarily apply if the members in that class overlap. Allow the
recoloring to proceed if the assigned interfering physical register
overlaps with the candidate register.
This avoids an allocation failure with overlapping tuples. This
testcase could be handled better, and I don't believe should reach
last chance recoloring. The failure only manifests with the mutually
unsatisfiable register hints to overlapping tuples. The earlier
assignment decisions probably should have figured out that using these
hints was a bad idea.
This was applied in fda4305e53, reverted in 13815e8cbf, the problem
was that fp80 X86 registers that were spilt to the stack aren't expected by
LiveDebugValues. It pre-allocates a position number for all register sizes
that can be spilt, and 80 bits isn't exactly common.
The solution is to scan the register classes to find any unrecognised
register sizes, adn pre-allocate those position numbers, avoiding a later
assertion.
DBG_PHI instructions can refer to stack slots, to indicate that multiple
values merge together on control flow joins in that slot. This is fine --
however the slot might be merged at a later date with a slot of a different
size. In doing so, we lose information about the size the eliminated PHI.
Later analysis passes have to guess.
Improve this by attaching an optional "bit size" operand to DBG_PHI, which
only gets added for stack slots, to let us know how large a size the value
on the stack is.
Differential Revision: https://reviews.llvm.org/D124184
This is a very specific fold to fix an upstream poor codegen issue.
InstCombine has the much more flexible pushFreezeToPreventPoisonFromPropagating but I don't think we're quite there with DAG/TLI handling for canCreateUndefOrPoison/isGuaranteedNotToBeUndefOrPoison value tracking yet.
Fixes#54911
Differential Revision: https://reviews.llvm.org/D124185
The most common situation where G_ASSERT_ZEXT appears for AMDGPU is a
copy from a physical register, which happens to use set the actual
register class on the virtual register. After copy coalescing, the
assert's source operand had a vreg with a set class. The verifier was
strictly rejecting cases where the set class/bank weren't an exact
match. Additionally, RegBankSelect was also expecting a register bank
to be set on the register, not a class.
This is much stricter than regular copies so relax this behavior. This
now allows these 2 cases:
1. Source register has either class or bank, and the result does not
2. Source register has a register class, and the result is a register
with a matching bank.
This should avoid needing some kind of special handling to avoid
violating this constraint when folding copies.
This emits an `st_size` that represents the actual useable size of an object before the redzone is added.
Reviewed By: vitalybuka, MaskRay, hctim
Differential Revision: https://reviews.llvm.org/D123010
Current stack size diagnostics ignore the size of the unsafe stack.
This patch attaches the size of the static portion of the unsafe stack
to the function as metadata, which can be used by the backend to emit
diagnostics regarding stack usage.
Reviewed By: phosek, mcgrathr
Differential Revision: https://reviews.llvm.org/D119996
We can process the long shuffles (working across several actual
vector registers) in the best way if we take the actual register
represantion into account. We can build more correct representation of
register shuffles, improve number of recognised buildvector sequences.
Also, same function can be used to improve the cost model for the
shuffles. in future patches.
Part of D100486
Differential Revision: https://reviews.llvm.org/D115653
This is x86 specific, and adds statefulness to
MachineModuleInfo. Instead of explicitly tracking this, infer if we
need to declare the symbol based on the reference previously inserted.
This produces a small change in the output due to the move from
AsmPrinter::doFinalization to X86's emitEndOfAsmFile. This will now be
moved relative to other end of file fields, which I'm assuming doesn't
matter (e.g. the __morestack_addr declaration is now after the
.note.GNU-split-stack part)
This also produces another small change in code if the module happened
to define/declare __morestack_addr, but I assume that's invalid and
doesn't really matter.
This is used to emit one field in doFinalization for the module. We
can accumulate this when emitting all individual functions directly in
the AsmPrinter, rather than accumulating additional state in
MachineModuleInfo.
Move the special case behavior predicate into MachineFrameInfo to
share it. This now promotes it to generic behavior. I'm assuming this
is fine because no other target implements adjustForSegmentedStacks,
or has tests using the split-stack attribute.
We can process the long shuffles (working across several actual
vector registers) in the best way if we take the actual register
represantion into account. We can build more correct representation of
register shuffles, improve number of recognised buildvector sequences.
Also, same function can be used to improve the cost model for the
shuffles. in future patches.
Part of D100486
Differential Revision: https://reviews.llvm.org/D115653
This can be set up front, and used only as a cache. This avoids a
field that looks like it requires MIR serialization.
I believe this fixes 2 bugs for CodeView. First, this addresses a
FIXME that the flag -diable-debug-info-print only works with
DWARF. Second, it fixes emitting debug info with emissionKind NoDebug.
1. X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first.
2. Add AArch64 faster path for SREM only pow2 case.
Fix https://github.com/llvm/llvm-project/issues/54649
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122968
hasOneUse is not cheap on nodes with chain results that might have
many uses. By checking the opcode first, we can avoid a costly walk
of the use list on nodes we aren't interested in.
Found by investigating calls to hasNUsesOfValue from the example
provided in D123857.
specifying DW_AT_trampoline as a string. Also update the signature
of DIBuilder::createFunction to reflect this addition.
Differential Revision: https://reviews.llvm.org/D123697
Certain applications crashed for us with the AMDGPU backend. While this
is not a proper fix it allows us to compile the code for now. I left a
TODO for someone that understands DWARF.
Differential Revision: https://reviews.llvm.org/D123717
Before that change, constant-size `bcmp` would miss an opportunity to generate
a more efficient equality pattern and would generate a -1/0-1 pattern
instead.
Differential Revision: https://reviews.llvm.org/D123849
For strict FP16 to work correctly needs some changes in lowering and
legalization:
* SelectionDAGLegalize::PromoteNode was missing handling for some
strict fp opcodes.
* Some of the custom lowering of strict fp operations needed to be
adjusted to work with FP16.
* Custom lowering needed to be added for round-to-int operations.
With this, and the previous patches for the rest of the strict fp
isel, we can set IsStrictFPEnabled = true.
Differential Revision: https://reviews.llvm.org/D115620
Offloading sections can be embedded in the host during codegen via a
section. This section was originally marked as metadata to prevent it
from being loaded, but these sections are completely unused at runtime
so the linker should automatically drop them from the final executable
or shard library. This flag adds support for the SHF_EXCLUDE flag in
target lowering and uses it.
Reviewed By: JonChesterfield, MaskRay
Differential Revision: https://reviews.llvm.org/D122987
The lowering code did not use the scale operand of MGATHER/MSCATTER
nodes, but instead assumed scaled indices were always scaled based
on the element type of the memory type. This patch adds the missing
support by rewritting the nodes as unscaled variants.
Differential Revision: https://reviews.llvm.org/D123670
This testcase fails register allocation, but at the failure point
there were also new split virtual registers. Previously this was
assigning the failing register and not enqueueing the newly created
split virtual registers. These would then never be allocated and
assert in VirtRegRewriter.
This patch adds support for inline assembly address operands using the "p"
constraint on X86 and SystemZ.
This was in fact broken on X86 (see example at
https://reviews.llvm.org/D110267, Nov 23).
These operands should probably be treated the same as memory operands by
CodeGenPrepare, which have been commented with "TODO" there.
Review: Xiang Zhang and Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D122220
The condition in canEvictInterferenceBasedOnCost is slightly different
from the assertion in evictInteference.
canEvictInterferenceBasedOnCost uses a <= check for the cascade number
for legality, but the assert was checking for <. For equal cascade
numbers for an urgent eviction, canEvictInterferenceBasedOnCost could
return success. The actual eviction would then hit this assert. Avoid
ever returning true for equivalent cascade numbers.
The resulting failed allocation seems a bit off to me. e.g. in
illegal-eviction-assert.mir, I wuold assume %0 gets allocated starting
at $vgpr0. That was its initial allocation choice, but was later
evicted. In this example no evictions can help improve anything.
This is a replacement for the original fix attempted in
c46aab01c0.
This fixes "overlapping insert" assertion failures when trying to
unwind an unsuccessful recoloring attempt.
The problem would occur when there are multiple recoloring candidates
which recursively required recoloring. If one recoloring candidate was
successfully recolored at one level, and the next recoloring candidate
was unsuccessful, we would not roll back the first candidates
successful recoloring. The forgotten successful recoloring may have
been assigned to something that conflicts with a register that needs
to be restored in a parent recoloring attempt.
See the testcase added in issue48473 for a more concrete example with
explanation.
This was making several invalid assumptions about the incoming
select. First, it was assuming the incoming condition was either s1 or
already sign extended, not accounting for different boolean high bits
behavior between scalar and vector conditions. We only had a vector
boolean due to the intermediate step vector select, which is now
avoided.
Second, it was assuming it can use the result vector type as a boolean
mask. These types don't have anything to do with other, and only makes
sense in the context of the expansion to bit operations. Since these
logically are part of the same lowering, do the complete expansion in
a single step.
The added select_v4s1_s1 test does fail to legalize, since it seems
AArch64's vector legalization support is pretty incomplete.
This patch is similar to D122557, adding an `ArrayRef` version for `setOperationAction`, `setLoadExtAction`, `setCondCodeAction`, `setLibcallName`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123467
As far as I know getNode will never return a null SDValue.
I'm guessing this was modeled after the FoldConstantArithmetic
call earlier.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123550
This is really a replacement for memSizeInBytesNotPow2 that actually
does what most every target wants. In particular, since s1 rounds to 1
byte, it wasn't lowered by this predicate. This results in targets
needing to think harder and add more matchers to catch all the
degenerate cases.
Also small bug fix that prevented the correct insertion of
G_ASSERT_ZEXT in the AArch64 use case.
Materializing constants on RISCV is simpler if the constant is sign
extended from i32. By default i32 constant operands of phis are
zero extended.
This patch adds a hook to allow RISCV to override this for i32. We
have an existing isSExtCheaperThanZExt, but it operates on EVT which
we don't have at these places in the code.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122951
We're just trying to canonicalize here and won't be using the constant
value returned.
The attached test changes are because we were previously commuting
a seteq X, (splat_vector 0) because we also have (sub 0, X). The
0 is larger than the element type so we don't detect it as a splat
without the AllowTruncation flag. By preventing the commute we are
able to match it to the vmseq.vx instruction during isel. We only
look for constants on the RHS in isel.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D123256
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CGA
aware) nature of the unwind tables.
Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.
This pass takes advantage of the constraints taht LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):
* there is a single basic block, containing the function prologue
* possibly multiple epilogue blocks, where each epilogue block is
complete and self-contained, i.e. CSR restore instructions (and the
corresponding CFI instructions are not split across two or more
blocks.
* prologue and epilogue blocks are outside of any loops
Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:
- "has a call frame", if the function has executed the prologue, or
has not executed any epilogue
- "does not have a call frame", if the function has not executed the
prologue, or has executed an epilogue
These properties can be computed for each basic block by a single RPO
traversal.
From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.
Where these states differ, we insert compensating CFI instructions,
which come in two flavours:
- CFI instructions, which reset the unwind table state to the
initial one. This is done by a target specific hook and is
expected to be trivial to implement, for example it could be:
```
.cfi_def_cfa <sp>, 0
.cfi_same_value <rN>
.cfi_same_value <rN-1>
...
```
where `<rN>` are the callee-saved registers.
- CFI instructions, which reset the unwind table state to the one
created by the function prologue. These are the sequence:
```
.cfi_restore_state
.cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.
Reviewed By: MaskRay, danielkiss, chill
Differential Revision: https://reviews.llvm.org/D114545
fshl (or X, Y), X, C ==/!= 0 --> or (shl Y, C), X ==/!= 0
fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0
This is similar to an existing setcc-of-rotate fold, but the
matching requires more checks for the more general funnel op:
https://alive2.llvm.org/ce/z/Ab2jDd
We are effectively decomposing the funnel shift into logical
shifts, reassociating, and removing a shift.
This should get us the final improvements for x86-64 that were
originally shown in D111530
( https://github.com/llvm/llvm-project/issues/49541 );
x86-32 still shows some SHLD/SHRD, so the pattern is not
matching there yet.
Differential Revision: https://reviews.llvm.org/D122919
arm64_32 guarantees the high 32 bits of pointer parameters are passed as 0, and
this is modelled in the IR by inserting an AssertZExt after the CopyFromReg.
The function deciding whether registers that need to be preserved actually are
wasn't expecting this so it banned perfectly legitimate tail calls.
This patch aims to overcome an issue in these mappings where, when an ISD
node was registered with BEGIN_REGISTER_VP_SDNODE but outwidth the scope
of a pair of BEGIN_REGISTER_VP_INTRINSIC/END_REGISTER_VP_INTRINSIC
macros, the switch cases fell apart. This in particular happened with
VP_SETCC, where we'd end up with something along the lines of:
case Intrinsic::vp_fcmp:
break;
case Intrinsic::vp_icmp:
break;
ResOpc = ISD::VP_SETCC;
case Intrinsic::vp_store:
...
To remedy this, we introduce a special-purpose mapping macro which can
map any number of VP intrinsic opcodes to an ISD opcode.
As a result, we no longer need to special-case the mapping from vp.icmp
and vp.fcmp to VP_SETCC, as the new helper macro does it for us.
Thanks to @craig.topper for noticing this and to @rogfer01 for the idea.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D123324
Rather than rewriting the alloca pointer to zero, use
removePointerBase() to drop the base pointer. This will simply bail
if the base pointer is not the alloca. We could try doing something
more fancy here (like dropping the sources not based on the alloca
on the premise that they aren't SafeStack-relevant), but I don't
think that's worthwhile.
Fixes https://github.com/llvm/llvm-project/issues/54784.
Differential Revision: https://reviews.llvm.org/D123309
This patch adds the necessary infrastructure to lower vp.fcmp via
ISD::VP_SETCC to RVV instructions.
Most notably this patch adds cond-code legalization for VP_SETCC,
reusing the existing TargetLowering::LegalizeSetCCCondCode by passing in
additional SDValue parameters for the Mask and EVL. This method then
uses VP operations to legalize the condcode.
There is still a general lack of canonicalization on VP_SETCC as opposed
to SETCC which results in worse code than is theoretically possible.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123051
Currently LowerAtomics exists as a separate pass which blindly
replaces all atomics. Add a new lowering strategy option to eliminate
the atomics which the target can control on a per-instruction level.
Use the same enum as the other atomic instructions for consistency, in
preparation for addition of another strategy.
Introduce a new "Expand" option, since the store expansion does not
use cmpxchg. Alternatively, the existing CmpXChg strategy could be
renamed to Expand.
The VP path was using the split source VTs instead of the split
destination VTs. This may not be a problem today because the VP
nodes going through this have the same source and dest VTs.
It will be a problem when we start using this function for legalizing
VP cast operations.
This patch adds the minimum required to successfully lower vp.icmp via
the new ISD::VP_SETCC node to RVV instructions.
Regular ISD::SETCC goes through a lot of canonicalization which targets
may rely on which has not hereto been ported to VP_SETCC. It also
supports expansion of individual condition codes and a non-boolean
return type. Support for all of that will follow in later patches.
In the case of RVV this largely isn't a problem as the vector integer
comparison instructions are plentiful enough that it can lower all
VP_SETCC nodes on legal integer vectors except for boolean vectors,
which regular SETCC folds away immediately into logical operations.
Floating-point VP_SETCC operations aren't as well supported in RVV and
the backend relies on condition code expansion, so support for those
operations will come in later patches.
Portions of this code were taken from the VP reference patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D122743
Promotion does not affect the base element type and so the original
index type will remain unchanged. This reflects the behaviour of
DAGTypeLegalizer::PromoteIntOp_MGATHER with no tests affected.
Accord the discussion in D122281, we missing an ISD::AND combine for MLOAD
because it relies on BuildVectorSDNode is fails for scalable vectors.
This patch is intend to handle that, so we can circle back the type MVT::nxv2i32
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122703
E.g. in
```
%i0 = zext <2 x i8> to <2 x i16>
%i1 = bitcast <2 x i16> to <4 x i8>
```
the `%i0`'s zero bits are known to be `0xFF00` (upper half of every element is known zero),
but no elements are known to be zero, and for `%i1`, we don't know anything about zero bits,
but the elements under `0b1010` mask are known to be zero (i.e. the odd elements).
But, we didn't perform such a propagation.
Noticed while investigating more aggressive `vpmaddwd` formation.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D123163
Place PersistentId declaration under #if LLVM_ENABLE_ABI_BREAKING_CHECKS to
reduce memory usage when it is not needed.
Differential Revision: https://reviews.llvm.org/D120714
Variable locations now come in two modes, instruction referencing and
DBG_VALUE. At -O0 we pick DBG_VALUE to allow fast construction of variable
information. Unfortunately, SelectionDAG edits the optimisation level in
the presence of opt-bisect-limit, meaning different passes have different
views of what variable location mode we should use. That causes assertions
when they're mixed.
This patch plumbs through a boolean in SelectionDAG from start to
instruction emission, so that we don't rely on the current optimisation
level for correctness.
Differential Revision: https://reviews.llvm.org/D123033
As raised on PR52267, XOR(X,MIN_SIGNED_VALUE) can be treated as ADD(X,MIN_SIGNED_VALUE), so let these cases use the 'AddLike' folds, similar to how we perform no-common-bits OR(X,Y) cases.
define i8 @src(i8 %x) {
%r = xor i8 %x, 128
ret i8 %r
}
=>
define i8 @tgt(i8 %x) {
%r = add i8 %x, 128
ret i8 %r
}
Transformation seems to be correct!
https://alive2.llvm.org/ce/z/qV46E2
Differential Revision: https://reviews.llvm.org/D122754
https://alive2.llvm.org/ce/z/A_auBq
Remove limitation that wouldn't perform the fold if all the inverted bits are known zero
The thumb2 changes look to be benign, although it does show that the TEQ/TST isel patterns could probably be improved.
Fixes movmsk regression in D122754
Differential Revision: https://reviews.llvm.org/D123023
Add void casts to mark the variables used, next to the places where
they are used in assert or `LLVM_DEBUG()` expressions.
Differential Revision: https://reviews.llvm.org/D123117
Returning `std::array<uint8_t, N>` is better ergonomics for the hashing functions usage, instead of a `StringRef`:
* When returning `StringRef`, client code is "jumping through hoops" to do string manipulations instead of dealing with fixed array of bytes directly, which is more natural
* Returning `std::array<uint8_t, N>` avoids the need for the hasher classes to keep a field just for the purpose of wrapping it and returning it as a `StringRef`
As part of this patch also:
* Introduce `TruncatedBLAKE3` which is useful for using BLAKE3 as the hasher type for `HashBuilder` with non-default hash sizes.
* Make `MD5Result` inherit from `std::array<uint8_t, 16>` which improves & simplifies its API.
Differential Revision: https://reviews.llvm.org/D123100
NFC
When no actual change happens there's no need to notify the
observers about the fact the register class is being constrained.
So we should avoid notifying observers when no change has
happened, because this can dramatically affect compile
time for particular test cases.
Reviewed By: dsanders, arsenm
Differential Revision: https://reviews.llvm.org/D122615
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CFG
aware) nature of the unwind tables.
Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.
This pass takes advantage of the constraints that LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):
* there is a single basic block, containing the function prologue
* possibly multiple epilogue blocks, where each epilogue block is
complete and self-contained, i.e. CSR restore instructions (and the
corresponding CFI instructions are not split across two or more
blocks.
* prologue and epilogue blocks are outside of any loops
Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:
- "has a call frame", if the function has executed the prologue, or
has not executed any epilogue
- "does not have a call frame", if the function has not executed the
prologue, or has executed an epilogue
These properties can be computed for each basic block by a single RPO
traversal.
In order to accommodate backends which do not generate unwind info in
epilogues we compute an additional property "strong no call frame on
entry" which is set for the entry point of the function and for every
block reachable from the entry along a path that does not execute the
prologue. If this property holds, it takes precedence over the "has a
call frame" property.
From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.
Where these states differ, we insert compensating CFI instructions,
which come in two flavours:
- CFI instructions, which reset the unwind table state to the
initial one. This is done by a target specific hook and is
expected to be trivial to implement, for example it could be:
```
.cfi_def_cfa <sp>, 0
.cfi_same_value <rN>
.cfi_same_value <rN-1>
...
```
where `<rN>` are the callee-saved registers.
- CFI instructions, which reset the unwind table state to the one
created by the function prologue. These are the sequence:
```
.cfi_restore_state
.cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.
Reviewed By: MaskRay, danielkiss, chill
Differential Revision: https://reviews.llvm.org/D114545
The reason why I am making this change is that before this commit,
EmitFuncArgumentDbgValue relied on a boolean flag IsDbgDeclare both to signal
that a DBG_VALUE should be made to be indirect /and/ that the original intrinsic
was a dbg.declare. This is no longer always true if we add support for handling
dbg.addr since we will have an indirect DBG_VALUE that is a different intrinsic
from dbg.declare.
With that in mind, in this NFC patch, we prepare for future fixes by introducing
a 3 case-enum argument to EmitFuncArgumentDbgValue that allows the caller to
explicitly specify how the argument's DBG_VALUE should be emitted. This then
allows us to turn the indirect checks into a != FuncArgumentDbgValueKind::Value
and prepare us for a future where we add support here for llvm.dbg.addr
directly.
rdar://83957028
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D122945
If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.
Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.
This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122933
D122053 set the ExtendType for ConstantSDNodes in getCopyToRegs to
ZERO_EXTEND to match assumptions in ComputePHILiveOutRegInfo. PHIs
are probably not the only way ConstantSDNodeNodes can get to
getCopyToRegs.
This patch adds an ExtendType parameter to CopyValueToVirtualRegister and
has HandlePHINodesInSuccessorBlocks pass ISD::ZERO_EXTEND for ConstantInts.
This way we only affect ConstantSDNodes for PHIs.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122171
This is an extension of D70965 to avoid creating a mathlib
call where it did not exist in the original source. Also see
D70852 for discussion about an alternative proposal that was
abandoned.
In the motivating bug report:
https://github.com/llvm/llvm-project/issues/54554
...we also have a more general issue about handling "no-builtin" options.
Differential Revision: https://reviews.llvm.org/D122610
When shifting by a byte-multiple:
bswap (shl X, C) --> lshr (bswap X), C
bswap (lshr X, C) --> shl (bswap X), C
This is the backend version of D122010 and an alternative
suggested in D120648.
There's an extra check to make sure the shift amount is
valid that was not in the rough draft.
I'm not sure if there is a larger motivating case for RISCV (bug report?),
but the ARM diffs show a benefit from having a late version of the
transform (because we do not combine the loads in IR).
Differential Revision: https://reviews.llvm.org/D122655
This patch fixes a (seemingly very rare) crash during vector constant
folding introduced in D113300.
Normally, during legalization, if we create an illegally-typed node during
a failed attempt at constant folding it's cleaned up before being
visited, due to it having no uses.
If, however, an illegally-typed node is created during one round of
legalization and isn't cleaned up, it's possible for a second round of
legalization to create new illegally-typed nodes which add extra uses to
the old illegal nodes. This means that we can end up visiting the old
nodes before they're known to be dead, at which point we crash.
I'm not happy about this fix. Creating illegal types at all seems like a
bad idea, but we all-too-often rely on illegal constants being
successfully folded and being fixed up afterwards. However, we can't
rely on constant folding actually happening, and we don't have a
foolproof way of peering into the future.
Perhaps the correct fix is to revisit the node-iteration order during
legalization, ensuring we visit all uses of nodes before the nodes
themselves. Or alternatively we could try and clean up dead nodes
immediately after failing constant folding.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122382
Modified DAGCombiner to pass the shift the bittest input and the shift amount
to hasBitTest. This matches the other call to hasBitTest in TargetLowering.h
This is an alternative to D122454.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122458
We could only do this in limited ways (since we emit the TUs first, we
can't use ref_addr (& we can't use that in Split DWARF either) - so we
had to synthesize declarations into the TUs) and they were ambiguous in
some cases (if the CU type had internal linkage, parsing the TU would
require knowing which CU was referencing the TU to know which type the
declaration was for, which seems not-ideal). So to avoid all that, let's
just not reference types defined in the CU from TUs - instead moving the
TU type into the CU (recursively).
This does increase debug info size (by pulling more things out of type
units, into the compile unit) - about 2% of uncompressed dwp file size
for clang -O0 -g -gsplit-dwarf. (5% .debug_info.dwo section size
increase in the .dwp)
There is a case when a function has pseudo probe intrinsics but the module it resides does not have the probe desc. This could happen when the current module is not built with `-fpseudo-probe-for-profiling` while a function in it calls some other function from a probed module. In thinLTO mode, the callee function could be imported and inlined into the current function.
While this is undefined behavior, I'm fixing the asm printer to not ICE and warn user about this.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D121737
The alignment needs to be part of the folding set hash. This is
handled by getAssertAlign when nodes are created, but needs to repeated here.
No test case as I found it as part of a very early experimental patch.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D122279
Non-static class members declared under #ifndef NDEBUG should be declared
under #if LLVM_ENABLE_ABI_BREAKING_CHECKS to make headers library-friendly and
allow cross-linking, as discussed in D120714.
Differential Revision: https://reviews.llvm.org/D121549
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
Instead of using operator[], use DenseMap::find to prevent default
constructing an entry if it isn't already in the map.
Also simplify a condition to check for 0 instead of a virtual register.
I'm pretty sure we can only get 0 or a virtual register out of the value
map.
Sinking must check for interference between the block prologue
and the instruction being sunk.
Specifically check for clobbering of uses by the prologue, and
overwrites to prologue defined registers by the sunk instruction.
Reviewed By: rampitec, ruiling
Differential Revision: https://reviews.llvm.org/D121277
This is only called for instructions and the caller is already holding
an Instruction *. This makes the code more explicit and makes it
obvious the code doesn't make decisions about constants.
Change the implementation of isForwardableRegClassCopy so that it
does not rely on getMinimalPhysRegClass. Instead, iterate over all
classes looking for any that satisfy a required property.
NFCI on current upstream targets, but this copes better with
downstream AMDGPU changes where some new smaller classes have been
introduced, which was breaking regclass equality tests in the old
code like:
if (UseDstRC != CrossCopyRC && CopyDstRC == CrossCopyRC)
Differential Revision: https://reviews.llvm.org/D121903
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
Add shl/srl/sra to the list of ops that we canonicalize with a select to expose an identity merge
Differential Revision: https://reviews.llvm.org/D122070
ComputePHILiveOutRegInfo assumes that constant incoming values to
Phis will be zero extended if they aren't a legal type. To guarantee
that we should zero_extend rather than any_extend constants.
This fixes a bug for RISCV where any_extend of constants can be
treated as a sign_extend.
Differential Revision: https://reviews.llvm.org/D122053
Fixes llvm-clang-x86_64-expensive-checks-debian failure with 2f497ec3.
expandAtomicStore always modifies the function, so make sure we set
MadeChange unconditionally. Not sure how nobody else has stumbled over
this before.
When generating a all-one mask value whose bitwidth is larger than 64, signed extension should be used rather then zero extension.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D120865
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
This reverts commit c46aab01c0.
This evidently blocks compiling in some cases that used to work
before. I'm also not fully convinced this is the correct place to fix
this problem.
This patch adjusts what location is picked for a known variable value --
preferring to leave locations on the stack, even when a value is re-loaded
into a register. The benefit is reduced location list entropy, on a
clang-3.4 build I found that .debug_loclists reduces in size by 6%, from
29Mb down to 27Mb.
Testing: a few tests need the stack slot to be written to explicitly, to
force LiveDebugValues into restoring the variable location to a register.
I've added an explicit test for the desired behaviour in
livedebugvalues_recover_clobbers.mir .
Differential Revision: https://reviews.llvm.org/D120732
This fixes a reported bug that caused an infinite loop during the
SelectionDAG optimization phase in ISel, by creating an overridable hook
in `TargetLowering` that allows us to bail out from running
`SimplifyDemandedVectorElts`.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D121869
If we promote the ABS and then Expand in LegalizeDAG, then both the
sra and the xor will have their inputs sign extended. This generates
extra code on RISCV which lacks an i8 or i16 sign extend instructon.
If we expand during type legalization, then only the sra will get its
input sign extended. RISCV is able to combine this with the sra by
doing a shift left followed by an sra.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121664
RISCV strong prefers i32 values be sign extended to i64. This combine
was always zero extending the constant using APInt methods.
This adjusts the code so that it calls getNode using ISD::ANY_EXTEND instead.
getNode will call TLI.isSExtCheaperThanZExt to decide how to handle
the constant.
Tests were copied from D121598 where I noticed that we were creating
constants that were hard to materialize.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121650
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121327
The existing volatile checks only handle aliasing hazards between stores,
but that isn't enough since by that point volatile stores may have already
been added to the current candidate group.
Discussed extensively on D98232. The functionality introduced in D35816
never worked correctly. In D98232, it was fixed, but, as it was
introducing a large compile-time regression, and the value of the
original patch was called into doubt, we disabled it by default
everywhere. A year later, it appears that caused no grief, so it seems
safe to remove the disabled code.
This should be accompanied by re-opening bug 26810.
Differential Revision: https://reviews.llvm.org/D121128
We do not have general reassociation here (and probably
do not need it), but I noticed these were missing in
patches/tests motivated by D111530, so we can at
least handle the simplest patterns.
The VE test diff looks correct, but we miss that
pattern in IR currently:
https://alive2.llvm.org/ce/z/u66_PM
`DIE::getUnitDie` looks up parent DIE until compile unit or type unit is found. However for skeleton CU with debug fission, we would have DW_TAG_skeleton_unit instead of DW_TAG_compile_unit as top level DIE.
This change fixes the look up so we can get DW_TAG_skeleton_unit as UnitDie for skeleton CU.
Differential Revision: https://reviews.llvm.org/D120610
This patch introduces two new experimental IR intrinsics and SDAG nodes
to represent vector strided loads and stores.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D114884
SDNodes with different target flags may now be folded together
rightfully resulting in the assertion in the refineAlignment.
Folding nodes with different target flags may result in the
wrong load instructions produced at least on the AMDGPU.
Fixes: SWDEV-326805
Differential Revision: https://reviews.llvm.org/D121335
This is another fold generalized from D111530.
We can find a common source for a rotate operation hidden inside an 'or':
https://alive2.llvm.org/ce/z/9pV8hn
Deciding when this is profitable vs. a funnel-shift is tricky, but this
does not show any regressions: if a target has a rotate but it does not
have a funnel-shift, then try to form the rotate here. That is why we
don't have x86 test diffs for the scalar tests that are duplicated from
AArch64 ( 74a65e3834 ) - shld/shrd are available. That also makes it
difficult to show vector diffs - the only case where I found a diff was
on x86 AVX512 or XOP with i64 elements.
There's an additional check for a legal type to avoid a problem seen
with x86-32 where we form a 64-bit rotate but then it gets split
inefficiently. We might avoid that by adding more rotate folds, but
I didn't check to see what is missing on that path.
This gets most of the motivating patterns for AArch64 / ARM that are in
D111530.
We still need a couple of enhancements to setcc pattern matching with
rotate/funnel-shift to get the rest.
Differential Revision: https://reviews.llvm.org/D120933
This was disabled in 2acea2786b as a
work-around for Issue #31491. I've reduced the test case from that bug
and confirmed that it is now fixed.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D120866
This relands commit 7313474319.
It failed on Windows/Mac because `-fjmc` is only checked for ELF targets.
Check the flag unconditionally instead and issue a warning for non-ELF targets.
Previously we used sra+add+xor if ADDCARRY is supported. This changes
to sra+xor+sub is SUBCARRY is available.
This is consistent with the recent change to the default expansion
in LegalizeDAG.
Differential Revision: https://reviews.llvm.org/D121039
The motivation is to enable the MSVC-style JMC instrumentation usable by a ELF-based
debugger. Since there is no prior experience implementing JMC feature for ELF-based
debugger, it might be better to just reuse existing MSVC-style JMC instrumentation.
For debuggers that support both ELF&COFF (like lldb), the JMC implementation might
be shared between ELF&COFF. If this is found to inadequate, it is pretty low-cost
switching to alternatives.
Implementation:
- The '-fjmc' is already a driver and cc1 flag. Wire it up for ELF in the driver.
- Refactor the JMC instrumentation pass a little bit.
- The ELF handling is different from MSVC in two places:
* the flag section name is ".just.my.code" instead of ".msvcjmc"
* the way default function is provided: MSVC uses /alternatename; ELF uses weak function.
Based on D118428.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D119910
When inserting undef into buildvectors created from shuffles of
buildvectors, we convert elements to the largest needed type. This had
the effect of converting undef into 0, which isn't needed as the
buildvector implicitly truncates and trunc(zext(undef)) == undef.
Differential Revision: https://reviews.llvm.org/D121002
This extends acb96ffd14 to 'and' and 'xor' opcodes.
Copying from that message:
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
When lowering LLVM-IR to instruction referencing stuff, if a value is
defined by a COPY, we try and follow the register definitions back to where
the value was defined, and build an instruction reference to that
instruction. In a few scenarios (such as arguments), this isn't possible.
I added some assertions to catch cases that weren't explicitly whitelisted.
Over the course of a few months, several more scenarios have cropped up,
the lastest is the llvm.read_register intrinsic, which lets LLVM-IR read an
arbitary register at any point. In the face of this, there's little point
in validating whether debug-info reads a register in an expected scenario.
Thus: this patch just deletes those assertions, and adds a regression test
to check that something is done with the llvm.read_register intrinsic.
Fixes#54190
Differential Revision: https://reviews.llvm.org/D121001
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
growRegion() does not scale in code with BBs with a very large number of edges.
In such code growRegion() becomes a compile-time bottleneck, consuming 60% of
the total compilation time.
This patch adds a limit to the complexity of growRegion() by incrementing a counter
in each iteration. We bail out once the limit is reached.
Differential Revision: https://reviews.llvm.org/D120752
https://alive2.llvm.org/ce/z/mJP7XP
This can be viewed as expanding the compare into and/or-of-compares:
https://alive2.llvm.org/ce/z/bkZYWE
followed by reduction of each compare.
This could be extended in several ways:
1. There's a (X & Y) == -1 sibling.
2. We can recurse through more than 1 'or'.
3. The fold could be generalized beyond rotates - any operation that
only changes the order of bits (bswap, bitreverse).
This is a transform noted in D111530.
Instead of emitting 0 > Hi, emit Hi < 0. If Hi needs to be expanded again
this will allow the special case for sign bit tests in ExpandIntOp_SETCC
to trigger.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120761
Also, changes how the CSR loop is indexed, which should avoid bugs like the one fixed by rG4a57bb5a3b74bdad9b0518009a7d7ac7ca2ac650
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D120668
This is an alternative to D120330, which disables MachineSink for
functions with irreducible cycles entirely. This avoids both the
correctness problem, and ensures we don't perform non-profitable
sinks into cycles. At the same time, it may also disable
profitable sinks in the same function. This can be made more
precise by using MachineCycleInfo in the future.
Fixes https://github.com/llvm/llvm-project/issues/53990.
Differential Revision: https://reviews.llvm.org/D120800
Currently we only check for splat shuffles, this extends it to see if the source operand is a splat across the demanded elts based upon the shuffle mask
This patch adds support for recognising vector splats by peeking through bitcasts to vectors with smaller element types - if all the offset subelements are splats then the bitcasted vector is a splat as well.
We don't have great coverage for isSplatValue so I've made this pretty specific to the use case I'm trying to fix - regressions in some vXi64 vector shift by splat cases that 32-bit x86 doesn't recognise because the shift amount buildvector has been type legalised to v2Xi32.
We can add further support (floats, bitcast from larger element types, undef elements) when we have actual test coverage.
Differential Revision: https://reviews.llvm.org/D120553
This wraps up from D119053. The 2 headers are moved as described,
fixed file headers and include guards, updated all files where the old
paths were detected (simple grep through the repo), and `clang-format`-ed it all.
Differential Revision: https://reviews.llvm.org/D119876
If the types aren't legal, the expansions may get type legalized in a
different way preventing code sharing. If the type is legal, we will
share some instructions between the two expansions, but we will need an
extra register.
Since we don't appear to fold (neg (sub A, B)) if the sub has an
additional user, I think it makes sense not to expand NABS.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120513
InstrRefBasedLDV allocates some big tables of ValueIDNum, to store live-in
and live-out block values in, that then get passed around as pointers
everywhere. This patch wraps the allocation in a std::unique_ptr, names
some types based on unique_ptr, and passes references to those around
instead. There's no functional change, but it makes it clearer to the
reader that references to these tables are borrowed rather than owned, and
we get some extra validity assertions too.
Differential Revision: https://reviews.llvm.org/D118774
This is the SDAG equivalent of an instcombine transform added with:
fd807601a7
This is another step towards solving #49541 and part of an alternative
set of more general transforms than what is proposed in D111530.
https://alive2.llvm.org/ce/z/ToxaE8
LOGIC (LOGIC (SH X0, Y), Z), (SH X1, Y) --> LOGIC (SH (LOGIC X0, X1), Y), Z
https://alive2.llvm.org/ce/z/QmR9rR
This is a reassociation + factoring fold. The common shift operation is moved
after a bitwise logic op on 2 input operands.
We get simpler cases of these patterns in IR, but I suspect we would miss all
of these exact tests in IR too. We also handle the simpler form of this plus
several other folds in DAGCombiner::hoistLogicOpWithSameOpcodeHands().
This is a partial implementation of a transform suggested in D111530
(only handles 'or' bitwise logic as a first step - need to stamp out more
tests for other opcodes).
Several of the same tests added for D111530 are altered here (but not
fully optimized). I'm not sure yet if this would help/hinder that patch,
but this should be an improvement for all tests added with ecf606cb43
since it removes a shift operation in those examples.
Differential Revision: https://reviews.llvm.org/D120516
IR level addDiscriminator pass is guarded by DebugInfoForProfiling
(set by option -fdebug-info-for-profiling).
This patch syncs the logic for the MIR and IR level implementations.
Differential Revision: https://reviews.llvm.org/D120536
If the shl is at least half the bitwidth (i.e. the lower half of the bswap source is zero), then we can reduce the shift and perform the bswap at half the bitwidth and just zero extend.
Based off PR51391 + PR53867
Differential Revision: https://reviews.llvm.org/D120192
This is the SDAG translation of D120253 :
https://alive2.llvm.org/ce/z/qHpmNn
The SDAG nodes can have different operand types than the result value.
We can see an example of that with AArch64 - the funnel shift amount
is an i64 rather than i32.
We may need to make that match even more flexible to handle
post-legalization nodes, but I have not stepped into that yet.
Differential Revision: https://reviews.llvm.org/D120264
When parsing MachineMemOperands, MIRParser treated the "align" keyword
the same as "basealign". Really "basealign" should specify the
alignment of the MachinePointerInfo base value, and "align" should
specify the alignment of that base value plus the offset.
This worked OK when the specified alignment was no larger than the
alignment of the offset, but in cases like this it just caused
confusion:
STW killed %18, 4, %stack.1.ap2.i.i :: (store (s32) into %stack.1.ap2.i.i + 4, align 8)
MIRPrinter would never have printed this, with an offset of 4 but an
align of 8, so it must have been written by hand. MIRParser would
interpret "align 8" as "basealign 8", but I think it is better to give
an error and force the user to write "basealign 8" if that is what they
really meant.
Differential Revision: https://reviews.llvm.org/D120400
Change-Id: I7eeeefc55c2df3554ba8d89f8809a2f45ada32d8
The `SplitIndirectBrCriticalEdges` function was originally designed for
`CodeGenPrepare` and skipped splitting of edges when the destination
block didn't contain any `PHI` instructions. This only makes sense when
reducing COPYs like `CodeGenPrepare`. In the case of
`PGOInstrumentation` or `GCOVProfiling` it would result in missed
counters and wrong result in functions with computed goto.
Differential Revision: https://reviews.llvm.org/D120096
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
In combineCarryDiamond() use getAsCarry() to find more candidates for being a carry flag.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118362
This is a fix for a regression discussed in:
https://github.com/llvm/llvm-project/issues/53829
We cleared more high multiplier bits with 995d400,
but that can lead to worse codegen because we would fail
to recognize the now disguised multiplication by neg-power-of-2
as a shift-left. The problem exists independently of the IR
change in the case that the multiply already had cleared high
bits. We also convert shl+sub into mul+add in instcombine's
negator.
This patch fills in the high-bits to see the shift transform
opportunity. Alive2 attempt to show correctness:
https://alive2.llvm.org/ce/z/GgSKVX
The AArch64, RISCV, and MIPS diffs look like clear wins. The
x86 code requires an extra move register in the minimal examples,
but it's still an improvement to get rid of the multiply on all
CPUs that I am aware of (because multiply is never as fast as a
shift).
There's a potential follow-up noted by the TODO comment. We
should already convert that pattern into shl+add in IR, so
it's probably not common:
https://alive2.llvm.org/ce/z/7QY_GaFixes#53829
Differential Revision: https://reviews.llvm.org/D120216
As requested in D107955 <https://reviews.llvm.org/D107955>, this patch
splits off the `MC` and `CodeGen` parts and adds a testcase.
Tested on `sparcv9-sun-solaris2.11`, `amd64-pc-solaris2.11`, and
`x86_64-pc-linux-gnu`.
Differential Revision: https://reviews.llvm.org/D120318
Conceptually, the new encoding emits the offsets and sizes as label differences between each two consecutive basic block begin and end label. When decoding, the offsets must be aggregated along with basic block sizes to calculate the final relative-to-function offsets of basic blocks.
This encoding uses smaller values compared to the existing one (offsets relative to function symbol).
Smaller values tend to occupy fewer bytes in ULEB128 encoding. As a result, we get about 25% reduction
in the size of the bb-address-map section (reduction from about 9MB to 7MB).
Reviewed By: tmsriram, jhenderson
Differential Revision: https://reviews.llvm.org/D106421
This adds very basic support for hashing MachineBasicBlock
and MachineFunction, for use in MachineFunctionPass to
detect passes that modify the MachineFunction wrongly.
Differential Revision: https://reviews.llvm.org/D120122
We use offloading sections in the new Clang driver scheme to embed
device code into the host. We later use these sections to link the
device image, after which point they are completely unused and should
not be loaded into memory if they are still in the executable.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D120275
We found a case in the Swift benchmarks where the MachineOutliner introduces
about a 20% compile time overhead in comparison to building without the
MachineOutliner.
The origin of this slowdown is that the benchmark has long blocks which incur
lots of LRU checks for lots of candidates.
Imagine a case like this:
```
bb:
i1
i2
i3
...
i123456
```
Now imagine that all of the outlining candidates appear early in the block, and
that something like, say, NZCV is defined at the end of the block.
The outliner has to check liveness for certain registers across all candidates,
because outlining from areas where those registers are used is unsafe at call
boundaries.
This is fairly wasteful because in the previously-described case, the outlining
candidates will never appear in an area where those registers are live.
To avoid this, precalculate areas where we will consider outlining from.
Anything outside of these areas is mapped to illegal and not included in the
outlining search space. This allows us to reduce the size of the outliner's
suffix tree as well, giving us a potential memory win.
By precalculating areas, we can also optimize other checks too, like whether
or not LR is live across an outlining candidate.
Doing all of this is about a 16% compile time improvement on the case.
This is likely useful for other targets (e.g. ARM + RISCV) as well, but for now,
this only implements the AArch64 path. The original "is the MBB safe" method
still works as before.
Previous we used sra (X, size(X)-1); xor (add (X, Y), Y).
By placing sub at the end, we allow RISCV to combine sign_extend_inreg
with it to form subw.
Some X86 tests for Z - abs(X) seem to have improved as well.
Other targets look to be a wash.
I had to modify ARM's abs matching code to match from sub instead of
xor. Maybe instead ISD::ABS should be made legal. I'll try that in
parallel to this patch.
This is an alternative to D119099 which was focused on RISCV only.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119171
This code was detecting whether the value returned by getShiftAmountTy
can represent all shift amounts. If not, it would use MVT::i32 as a
placeholder. getShiftAmountTy was updated last year to return i32
if the type returned by the target couldn't represent all values.
This means the MVT::i32 case here is dead and can the logic can
be simplified.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120164
If the "reciprocal-estimates" attribute is present and it doesn't
contain "all", "none", or "default", we previously crashed on f16
operations.
This patch addes an 'h' suffix' to prevent the crash.
I've added simple tests that just enable the estimate for all
vec-sqrt and one test case that explicitly tests the new 'h' suffix
to override the default steps.
There may be some frontend change needed to, but I haven't checked
that yet.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D120158
The code was considering shifts by an about larger than the number of
bits in the original VT to be out of range. Shifts exactly equal to
the original bit width are also out of range.
I don't know how to test this. DAGCombiner should usually fold this
away. I just noticed while looking for something else in this code. The
llvm-cov report shows that we don't have coverage for out of range shifts here.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D120170
getShiftAmountTy will return MVT::i32 if the shift amount
coming from the target's getScalarShiftAmountTy can't reprsent
all possible values. That should eliminate the need to use the
pointer type which is what we do when LegalTypes is false.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D120165
If the "reciprocal-estimates" attribute is present and it doesn't
contain "all", "none", or "default", we previously crashed on f16
operations.
This patch addes an 'h' suffix' to prevent the crash.
I've added simple tests that just enable the estimate for all
vec-sqrt and one test case that explicitly tests the new 'h' suffix
to override the default steps.
There may be some frontend change needed to, but I haven't checked
that yet.
Differential Revision: https://reviews.llvm.org/D120158
This fold is done in IR:
https://alive2.llvm.org/ce/z/jWyFrP
There is an x86 test that shows an improvement
from the added flexibility of using add (commutative).
The other diffs are presumed neutral.
Note that this could also be folded to an 'xor',
but I'm not sure if that would be universally better
(eg, x86 can convert adds more easily into LEA).
This helps prevent regressions from a potential fold for
issue #53829.
Useful for debugging + evaluating improvements to the outliner.
Stats are the number of illegal, legal, and invisible instructions in the
unsigned vector, and it's total length.
This makes three thread local variables (`__THREW__`, `__threwValue`,
and `__wasm_lpad_context`) unconditionally thread local. If the target
doesn't support TLS, they will be downgraded to normal variables in
`stripThreadLocals`. This makes the object not linkable with other
objects using shared memory, which is what we intend here; these
variables should be thread local when used with shared memory. This is
what we initially tried in D88262.
But D88323 changed this: It only created these variables when threads
were supported, because `__THREW__` and `__threwValue` were always
generated even if Emscripten EH/SjLj was not used, making all objects
built without threads not linkable with shared memory, which was too
restrictive. But sometimes this is not safe. If we build an object using
variables such as `__THREW__` without threads, it can be linked to other
objects using shared memory, because the original object's `__THREW__`
was not created thread local to begin with.
So this CL basically reverts D88323 with some additional improvements:
- This checks each of the functions and global variables created within
`LowerEmscriptenEHSjLj` pass and removes it if it's not used at the
end of the pass. So only modules using those variables will be
affected.
- Moves `CoalesceFeaturesAndStripAtomics` and `AtomicExpand` passes
after all other IR pasess that can create thread local variables. It
is not sufficient to move them to the end of `addIRPasses`, because
`__wasm_lpad_context` is created in `WasmEHPrepare`, which runs inside
`addPassesToHandleExceptions`, which runs before `addISelPrepare`. So
we override `addISelPrepare` and move atomic/TLS stripping and
expanding passes there.
This also removes merges `TLS` and `NO-TLS` FileCheck lines into one
`CHECK` line, because in the bitcode level we always create them as
thread local. Also some function declarations are deleted `CHECK` lines
because they are unused.
Reviewed By: tlively, sbc100
Differential Revision: https://reviews.llvm.org/D120013
This example is not compilable without handling eviction of specific
subregisters. Last chance recoloring was deciding it could try
evicting an overlapping superregister, which doesn't help make any
progress. The LiveIntervalUnion would then assert due to an
overlapping / identical range when trying the new assignment.
Unfortunately this is also producing a verifier error after the
allocation fails. I've seen a number of these, and not sure if we
should just start deleting the function on error rather than trying to
figure out how to put together valid MIR.
I'm not super confident this is the right place to fix this. I also
have a number of failing testcases I need to fix by handling partial
evictions of superregisters.
The current ABD combine doesn't quite work for SVE because only a
single scalable vector per scalar integer type is legal (e.g. for
i32, <vscale x 4 x i32> is the only legal scalable vector type).
This patch extends the combine to also trigger for the cases when
operand extension must be retained.
Differential Revision: https://reviews.llvm.org/D115739
When doing SelectionDAG::ReplaceAllUsesOfValuesWith a worklist is
prepared containing all users that should be updated. Then we use
the RemoveNodeFromCSEMaps/AddModifiedNodeToCSEMaps helpers to handle
recursive CSE updates while doing the replacements.
This patch aims at solving a problem that could arise if the recursive
CSE updates would result in an SDNode present in the worklist is being
removed as a side-effect of morphing a prio user in the worklist.
To examplify such a scenario, imagine that we have these nodes in
the DAG
t12: i64 = add t8, t11
t13: i64 = add t12, t8
t14: i64 = add t11, t11
t15: i64 = add t14, t8
t16: i64 = sub t13, t15
and that the t8 uses should be replaced by t11. An initial worklist
(listing the users that should be morphed) could be [t12, t13, t15].
When updating t12 we get
t12: i64 = add t11, t11
which results in a CSE update that replaces t14 by t12, so we get
t15: i64 = add t12, t8
which results in a CSE update that replaces t13 by t12, so we get
t16: i64 = sub t12, t15
and then t13 is removed given that it was the last use of t13.
So when being done with the updates triggered by rewriting the use
of t8 in t12 the t13 node no longer exist. And we used to end up
hitting an assertion when continuing with the worklist aiming at
replacing the t8 uses in t13.
The solution is based on using a DAGUpdateListener, making sure that
we prune a user from the worklist if it is removed during the
recursive CSE updates.
The bug was found using an OOT target. I think the problem is quite
old, even if the particular intree target reproducer added in this
patch seem to pass when using LLVM 13.0.0.
Differential Revision: https://reviews.llvm.org/D119088
This makes `__wasm_lpad_context`, a struct that is used as a
communication channel between compiler-generated code and personality
function in libunwind, thread local. The library code will be changed to
thread local in the emscripten side.
Reviewed By: sbc100, tlively
Differential Revision: https://reviews.llvm.org/D119803
For AMDGPU the insertion point for a block may not be the first
non-PHI instruction. This happens when a block contains EXEC
mask manipulation related to control flow (converging lanes).
Use SkipPHIsAndLabels to determine the block insertion point
so that the target can skip any block prologue instructions.
Reviewed By: rampitec, ruiling
Differential Revision: https://reviews.llvm.org/D119399
Layering-wise, it seems RegisterBank stuff fits under CodeGen, like
other target abstraction.
In particular, TargetSubtargetInfo has a getRegBankInfo member, but
using that object requires making sure GlobalISel is linked, which is
not always the case (e.g. llvm-jitlink doesn't).
Differential Revision: https://reviews.llvm.org/D119053
This moves the matching of AVGFloor and AVGCeil into a place where
demand bit are available, so that it can detect more cases for more
folds. It changes the transform to start from a shift, not from a
truncate. We match the pattern shr(add(ext(A), ext(B)), 1), transforming
to ext(hadd(A, B)).
For signed values, because only the bottom bits are demanded llvm will
transform the above to use a lshr too, as opposed to ashr. In order to
correctly detect the hadd we need to know the demanded bits to turn it
back. Depending on whether the shift is signed (ashr) or logical (lshr),
and the extensions are signed or unsigned we can create different nodes.
If the shift is signed:
Needs >= 2 sign bits. https://alive2.llvm.org/ce/z/h4gQAW generating signed rhadd.
Needs >= 2 zero bits. https://alive2.llvm.org/ce/z/B64DUA generating unsigned rhadd.
If the shift is unsigned:
Needs >= 1 zero bits. https://alive2.llvm.org/ce/z/ByD8sj generating unsigned rhadd.
Needs 1 demanded bit zero and >= 2 sign bits https://alive2.llvm.org/ce/z/hvPGxX and
https://alive2.llvm.org/ce/z/32P5n1 generating signed rhadd.
Differential Revision: https://reviews.llvm.org/D119072
We have the `clang -cc1` command-line option `-funwind-tables=1|2` and
the codegen option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind
tables (1) or asynchronous unwind tables (2)`. However, this is
encoded in LLVM IR by the presence or the absence of the `uwtable`
attribute, i.e. we lose the information whether to generate want just
some unwind tables or asynchronous unwind tables.
Asynchronous unwind tables take more space in the runtime image, I'd
estimate something like 80-90% more, as the difference is adding
roughly the same number of CFI directives as for prologues, only a bit
simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even
more, if you consider tail duplication of epilogue blocks.
Asynchronous unwind tables could also restrict code generation to
having only a finite number of frame pointer adjustments (an example
of *not* having a finite number of `SP` adjustments is on AArch64 when
untagging the stack (MTE) in some cases the compiler can modify `SP`
in a loop).
Having the CFI precise up to an instruction generally also means one
cannot bundle together CFI instructions once the prologue is done,
they need to be interspersed with ordinary instructions, which means
extra `DW_CFA_advance_loc` commands, further increasing the unwind
tables size.
That is to say, async unwind tables impose a non-negligible overhead,
yet for the most common use cases (like C++ exceptions), they are not
even needed.
This patch extends the `uwtable` attribute with an optional
value:
- `uwtable` (default to `async`)
- `uwtable(sync)`, synchronous unwind tables
- `uwtable(async)`, asynchronous (instruction precise) unwind tables
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D114543
This adds very basic combines for AVG nodes, mostly for constant folding
and handling degenerate (zero) cases. The code performs mostly the same
transforms as visitMULHS, adjusted for AVG nodes.
Constant folding extends to a higher bitwidth and drops the lowest bit.
For undef nodes, `avg undef, x` is transformed to x. There is also a
transform for `avgfloor x, 0` transforming to `shr x, 1`.
Differential Revision: https://reviews.llvm.org/D119559
When deciding where to split a block to insert stack guard checks, we should
move past any debug instructions we see that might (e.g.) be separating a tail
call from its frame wrangling.
This time, also don't run off the front of a basic block.
The current FastISel code reuses the register for a bitcast that
doesn't change the IR type, but uses a reg-to-reg copy if it
changes the IR type without changing the MVT. However, we can
simply reuse the register in that case as well.
In particular, this avoids unnecessary reg-to-reg copies for pointer
bitcasts. This was found while inspecting O0 codegen differences
between typed and opaque pointers.
Differential Revision: https://reviews.llvm.org/D119432
This enables fshl to be matched earlier on X86
%6 = lshr i32 %3, 1
%7 = select i1 %4, i32 -2147483648, i32 0
%8 = or i32 %6, %7
X86 uses i8 for shift amounts. SelectionDAGBuilder creates the
ISD::SRL with an i8 shift type. DAGCombiner turns the select into
an ISD::SHL. Prior to this patch it would use i32 for the shift
amount. fshl matching failed because the shift amounts have different
types. LegalizeDAG fixes the ISD::SHL shift amount to i8. This
allowed fshl matching to succeed.
With this patch, the ISD::SHL will be created with an i8 shift
amount. This allows the fshl to match immediately.
No test case beause we still end up with a fshl either way.
I have not found a way to expose a difference for this patch in a test
because it only triggers for a one-use load, but this is the code that
was adapted into D118376 and caused miscompiles. The new code pattern
is the same as what we do in narrowExtractedVectorLoad() (reduces load
width for a subvector extract).
This removes seemingly unnecessary manual worklist management and fixes
the chain updating via "SelectionDAG::makeEquivalentMemoryOrdering()".
Differential Revision: https://reviews.llvm.org/D119549
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.
Differential Revision: https://reviews.llvm.org/D110579
When deciding where to split a block to insert stack guard checks, we should
move past any debug instructions we see that might (e.g.) be separating a tail
call from its frame wrangling.
As usual with that header cleanup series, some implicit dependencies now need to
be explicit:
llvm/MC/MCParser/MCAsmParser.h no longer includes llvm/MC/MCParser/MCAsmLexer.h
Preprocessed lines to build llvm on my setup:
after: 1068185081
before: 1068324320
So no compile time benefit to expect, but we still get the looser coupling
between files which is great.
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D119359
The introduction and some examples are on this page:
https://devblogs.microsoft.com/cppblog/announcing-jmc-stepping-in-visual-studio/
The `/JMC` flag enables these instrumentations:
- Insert at the beginning of every function immediately after the prologue with
a call to `void __fastcall __CheckForDebuggerJustMyCode(unsigned char *JMC_flag)`.
The argument for `__CheckForDebuggerJustMyCode` is the address of a boolean
global variable (the global variable is initialized to 1) with the name
convention `__<hash>_<filename>`. All such global variables are placed in
the `.msvcjmc` section.
- The `<hash>` part of `__<hash>_<filename>` has a one-to-one mapping
with a directory path. MSVC uses some unknown hashing function. Here I
used DJB.
- Add a dummy/empty COMDAT function `__JustMyCode_Default`.
- Add `/alternatename:__CheckForDebuggerJustMyCode=__JustMyCode_Default` link
option via ".drectve" section. This is to prevent failure in
case `__CheckForDebuggerJustMyCode` is not provided during linking.
Implementation:
All the instrumentations are implemented in an IR codegen pass. The pass is placed immediately before CodeGenPrepare pass. This is to not interfere with mid-end optimizations and make the instrumentation target-independent (I'm still working on an ELF port in a separate patch).
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D118428
It's inevitable that optimisation passes will fail to update debug-info:
when that happens, it's best if the compiler doesn't crash as a result.
Therefore, downgrade a few assertions / failure modes that would crash
when illegal debug-info was seen, to instead drop variable locations. In
practice this means that an instruction reference to a nonexistant or
illegal operand should be tolerated.
Differential Revision: https://reviews.llvm.org/D118998
At -O0 we claim to CSE constants only. I think this should apply to
G_FCONSTANT as well as G_CONSTANT.
Differential Revision: https://reviews.llvm.org/D119344