We have to be careful when we replace values to not use a non-dominating
instruction. It makes sense that simplification offers those as
"simplified values" but we can't manifest them in the IR without PHI
nodes. In the future we should consider potentially adding those PHI
nodes.
We should use AAValueSimplify for all value simplification, however
there was some leftover logic that predates AAValueSimplify in
AAReturnedValues. This remove the AAReturnedValues part and provides a
replacement by making AAValueSimplifyReturned strong enough to handle
all previously covered cases. Further, this improve
AAValueSimplifyCallSiteReturned to handle returned arguments.
AAReturnedValues is now much easier and the collected returned
values/instructions are now from the associated function only, making it
much more sane. We also do not have the brittle logic anymore that looks
for unresolved calls. Instead, we use AAValueSimplify to handle
recursion.
Useful code has been split into helper functions, e.g., an Attributor
interface to get a simplified value.
Differential Revision: https://reviews.llvm.org/D103860
As the `llvm::getUnderlyingObjects` helper, the optimistic version
collects objects that might be the base of a given pointer. In contrast
to the llvm variant, the optimistic one will use assumed information,
e.g., about select conditions or dead blocks, to provide a more precise
result.
Differential Revision: https://reviews.llvm.org/D103859
Not all attributes are able to handle the interprocedural step and
follow the uses into a call site. Let them be able to combine call site
uses instead. This might result in some unused values/arguments being
leftover but it removes problems where we misused "is dead" even though
it was actually "is simplified/replaced".
We explicitly check for dead values due to constant propagation in
`AAIsDeadValueImpl::areAllUsesAssumedDead` instead.
Differential Revision: https://reviews.llvm.org/D103858
Broke check-clang, see https://reviews.llvm.org/D102307#2869065
Ran `git revert -n ebbe149a6f08535ede848a531a601ae6591cfbc5..269416d41908bb670f67af689155d5ab8eea689a`
As with other Attributor interfaces we often want to know if assumed
information was used to answer a query. This is important if only
known information is allowed or if known information can lead to an
early fixpoint. The users have been adjusted but none of them utilizes
the new information yet.
In the spirit of TRegions [0], this patch analyzes a kernel and tracks
if it can be executed in SPMD-mode. If so, we flip the arguments of
the __kmpc_target_init and deinit call to enable the mode. We also
update the `<kernel>_exec_mode` flag to indicate to the runtime we
changed the mode to SPMD.
The code analysis is done interprocedurally by extending the
AAKernelInfo abstract attribute to track SPMD compatibility as well.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D102307
When we talk to outside analyse, e.g., LVI and ScalarEvolution, we need
to be careful with the query. The particular error occurred because we
folded a PHI node before the LVI query but the context location was now
not dominated by the value anymore. This is not supported by LVI so we
have to filter these situations before we query the outside analyses.
We have to be careful when we replace values to not use a non-dominating
instruction. It makes sense that simplification offers those as
"simplified values" but we can't manifest them in the IR without PHI
nodes. In the future we should consider potentially adding those PHI
nodes.
In the spirit of TRegions [0], this patch creates a custom state
machine for a generic target region based on the potentially called
parallel regions.
The code analysis is done interprocedurally via an abstract attribute
(AAKernelInfo). All outermost parallel regions are collected and we
check if there might be unknown outermost parallel regions for which
we need an indirect call. Other AAKernelInfo extensions are expected.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D101977
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
In order to simplify future extensions, e.g., the merge of
AAHeapToShared in to AAHeapToStack, we reorganize AAHeapToStack and the
state we keep for each malloc-like call. The result is also less
confusing as we only track malloc-like calls, not all calls. Further, we
only perform the updates necessary for a malloc-like to argue it can go
to the stack, e.g., we won't check all uses if we moved on to the
"must-be-freed" argument.
This patch also uses Attributor helps to simplify the allocated size,
alignment, and the potentially freed objects.
Overall, this is mostly a reorganization and only the use of the
optimistic helpers should change (=improve) the capabilities a bit.
Differential Revision: https://reviews.llvm.org/D104993
We should use AAValueSimplify for all value simplification, however
there was some leftover logic that predates AAValueSimplify in
AAReturnedValues. This remove the AAReturnedValues part and provides a
replacement by making AAValueSimplifyReturned strong enough to handle
all previously covered cases. Further, this improve
AAValueSimplifyCallSiteReturned to handle returned arguments.
AAReturnedValues is now much easier and the collected returned
values/instructions are now from the associated function only, making it
much more sane. We also do not have the brittle logic anymore that looks
for unresolved calls. Instead, we use AAValueSimplify to handle
recursion.
Useful code has been split into helper functions, e.g., an Attributor
interface to get a simplified value.
Differential Revision: https://reviews.llvm.org/D103860
As the `llvm::getUnderlyingObjects` helper, the optimistic version
collects objects that might be the base of a given pointer. In contrast
to the llvm variant, the optimistic one will use assumed information,
e.g., about select conditions or dead blocks, to provide a more precise
result.
Differential Revision: https://reviews.llvm.org/D103859
Not all attributes are able to handle the interprocedural step and
follow the uses into a call site. Let them be able to combine call site
uses instead. This might result in some unused values/arguments being
leftover but it removes problems where we misused "is dead" even though
it was actually "is simplified/replaced".
We explicitly check for dead values due to constant propagation in
`AAIsDeadValueImpl::areAllUsesAssumedDead` instead.
Differential Revision: https://reviews.llvm.org/D103858
Instead of performing the isMoreProfitable() operation on
InstructionCost::CostTy the operation is performed on InstructionCost
directly, so that it can handle the case where one of the costs is
Invalid.
This patch also changes the CostTy to be int64_t, so that the type is
wide enough to deal with multiplications with e.g. `unsigned MaxTripCount`.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D105113
The original motivation for this was to implement moreElementsVector of shuffles
on AArch64, which resulted in complex sequences of artifacts like unmerge(unmerge(concat...))
which the combiner couldn't handle. It seemed here that the better option,
instead of writing ever-more-complex combines, was to have a way to find
the original "non-artifact" source registers for a given definition, walking
through arbitrary expressions of unmerge/concat/insert. As long as the bits
aren't extended or truncated, this is a pretty simple algorithm that avoids
the need for lots of combines and instead jumps straight to the final result
we want.
I've only used this new technique in 2 places within tryCombineUnmerge, using it
in more general situations resulted in infinite loops in AMDGPU. So for now
it's used when we would otherwise fail to combine and that seems to work.
In order to support looking through G_INSERTs, I also had to add it as an
artifact in isArtifact(), which caused a whole lot of issues in tests. AMDGPU
started infinite looping since full legalization of G_INSERT doensn't seem to
be there. To work around this, I've temporarily added a CLI option to use the
old behaviour so that the MIR tests will still run and terminate.
Other minor changes include no longer making >128b G_MERGE/UNMERGE legal.
We never had isel support for that anyway and it was a remnant of the legacy
legalizer rules. However being legal prevented the combiner from checking if it
was dead and deleting them.
Differential Revision: https://reviews.llvm.org/D104355
Renames CommonOrcRuntimeTypes.h to ExecutorAddress.h and moves ExecutorAddress
into the 'orc' namespace (rather than orc::shared).
Also makes ExecutorAddress a class, adds an ExecutorAddrDiff type and some
arithmetic operations on the pair (subtracting two addresses yields an addrdiff,
adding an addrdiff and an address yields an address).
Replace the clang builtin function and LLVM intrinsic previously used to select
the f64x2.promote_low_f32x4 instruction with custom combines from standard
SelectionDAG nodes. Implement the new combines to share code with the similar
combines for f64x2.convert_low_i32x4_{s,u}. Resolves PR50232.
Differential Revision: https://reviews.llvm.org/D105675
A followup to the feature added in
69da27c749 that added the optional "start
file name" to match "start line" - but this didn't work with Split DWARF
because of the need for the decl file number resolution code to refer
back to the skeleton unit to find its .debug_line contribution. So this
patch adds the necessary infrastructure to track the skeleton unit
corresponding to a split full unit for the purpose of this lookup.
Rules:
1. SCEVUnknown is a pointer if and only if the LLVM IR value is a
pointer.
2. SCEVPtrToInt is never a pointer.
3. If any other SCEV expression has no pointer operands, the result is
an integer.
4. If a SCEVAddExpr has exactly one pointer operand, the result is a
pointer.
5. If a SCEVAddRecExpr's first operand is a pointer, and it has no other
pointer operands, the result is a pointer.
6. If every operand of a SCEVMinMaxExpr is a pointer, the result is a
pointer.
7. Otherwise, the SCEV expression is invalid.
I'm not sure how useful rule 6 is in practice. If we exclude it, we can
guarantee that ScalarEvolution::getPointerBase always returns a
SCEVUnknown, which might be a helpful property. Anyway, I'll leave that
for a followup.
This is basically mop-up at this point; all the changes with significant
functional effects have landed. Some of the remaining changes could be
split off, but I don't see much point.
Differential Revision: https://reviews.llvm.org/D105510
This patch teaches the sample profile loader to merge function
attributes after inlining functions.
Without this patch, the compiler could inline a function requiring the
512-bit vector width into its caller without merging function
attributes, triggering a failure during instruction selection.
Differential Revision: https://reviews.llvm.org/D105729
`LegalizerHelper::insertParts` uses `extractGCDType` on registers split into
a desired type and a smaller leftover type. This is used to populate a list
of registers. Each register in the list will have the same type as returned by
`extractGCDType`.
If we have
- `ResultTy` = s792
- `PartTy` = s64
- `LeftoverTy` = s24
When we call `extractGCDType`, we'll end up with two different types appended
to the list:
Part: gcd(792, 64, 24) => s8
Leftover: gcd(792, 24, 24) => s24
When this happens, we'll hit an assert while trying to build a G_MERGE_VALUES.
This patch changes the code for the leftover type so that we reuse the GCD from
the desired type.
e.g.
Leftover: gcd(792, 8, 24) => s8
https://llvm.godbolt.org/z/137Kqxj6j
Differential Revision: https://reviews.llvm.org/D105674
This to protect against non-sensical instruction sequences being assembled,
which would either cause asserts/crashes further down, or a Wasm module being output that doesn't validate.
Unlike a validator, this type checker is able to give type-errors as part of the parsing process, which makes the assembler much friendlier to be used by humans writing manual input.
Because the MC system is single pass (instructions aren't even stored in MC format, they are directly output) the type checker has to be single pass as well, which means that from now on .globaltype and .functype decls must come before their use. An extra pass is added to Codegen to collect information for this purpose, since AsmPrinter is normally single pass / streaming as well, and would otherwise generate this information on the fly.
A `-no-type-check` flag was added to llvm-mc (and any other tools that take asm input) that surpresses type errors, as a quick escape hatch for tests that were not intended to be type correct.
This is a first version of the type checker that ignores control flow, i.e. it checks that types are correct along the linear path, but not the branch path. This will still catch most errors. Branch checking could be added in the future.
Differential Revision: https://reviews.llvm.org/D104945
In order to mirror the GetElementPtrInst::indices() API.
Wanted to use this in the IRForTarget code, and was surprised to
find that it didn't exist yet.
This makes it clearer when we have encountered the extra arg.
Also, we may need to adjust the way the operand iteration
works when handling logical and/or.
This change is intended as initial setup. The plan is to add
more semantic checks later. I plan to update the documentation
as more semantic checks are added (instead of documenting the
details up front). Most of the code closely mirrors that for
the Swift calling convention. Three places are marked as
[FIXME: swiftasynccc]; those will be addressed once the
corresponding convention is introduced in LLVM.
Reviewed By: rjmccall
Differential Revision: https://reviews.llvm.org/D95561
LLVM provides target hooks to recognise stack spill and restore
instructions, such as isLoadFromStackSlot, and it also provides post frame
elimination versions such as isLoadFromStackSlotPostFE. These are supposed
to return the store-source and load-destination registers; unfortunately on
X86, the PostFE recognisers just return "1", apparently to signify "yes
it's a spill/load". This patch alters the hooks to correctly return the
store-source and load-destination registers:
This is really useful for debug-info as we it helps follow variable values
as they move on/off the stack. There should be no codegen changes: the only
other users of these PostFE target hooks are MachineInstr::getRestoreSize
and MachineInstr::getSpillSize, which don't attempt to interpret the
returned register location.
While we're here, delete the (InstrRef) LiveDebugValues heuristic that
tries to find the spill source register by looking for a killed reg -- we
should be able to rely on the target hooks for that. This involves
temporarily turning off a n InstrRef LivedDebugValues test on aarch64
(patch to re-enable it is in D104521).
Differential Revision: https://reviews.llvm.org/D105428
Fails with:
```
/build/llvm-toolchain-snapshot-13~++20210709092633+88326bbce38c/llvm/lib/Target/M68k/GlSel/M68kCallLowering.cpp: In member function 'virtual bool llvm::M68kCallLowering::lowerReturn(llvm::MachineIRBuilder&, const llvm::Value*, llvm::ArrayRef<llvm::Register>, llvm::FunctionLoweringInfo&, llvm::Register) const':
/build/llvm-toolchain-snapshot-13~++20210709092633+88326bbce38c/llvm/lib/Target/M68k/GlSel/M68kCallLowering.cpp:71:42: error: no matching function for call to 'llvm::CallLowering::ArgInfo::ArgInfo(<brace-enclosed initializer list>)'
ArgInfo OrigArg{VRegs, Val->getType()};
```
Differential Revision: https://reviews.llvm.org/D105689
This is NFC-intended currently (so no test diffs). The motivation
is to eventually allow matching for poison-safe logical-and and
logical-or (these are in the form of a select-of-bools).
( https://llvm.org/PR41312 )
Those patterns will not have all of the same constraints as min/max
in the form of cmp+sel. We may also end up removing the cmp+sel
min/max matching entirely (if we canonicalize to intrinsics), so
this will make that step easier.
While working on the elementtype attribute, I felt that the type
attribute handling in AttrBuilder is overly repetitive. This patch
converts the separate Type* members into an std::array<Type*>, so
that all type attribute kinds can be handled generically.
There's more room for improvement here (especially when it comes to
converting the AttrBuilder to an Attribute), but this seems like a
good starting point.
Differential Revision: https://reviews.llvm.org/D105658
This reverts commit 52aeacfbf5.
There isn't full agreement on a path forward yet, but there is agreement that
this shouldn't land as-is. See discussion on https://reviews.llvm.org/D105338
Also reverts unreviewed "[clang] Improve `-Wnull-dereference` diag to be more in-line with reality"
This reverts commit f4877c78c0.
And all the related changes to tests:
This reverts commit 9a0152799f.
This reverts commit 3f7c9cc274.
This reverts commit 329f8197ef.
This reverts commit aa9f58cc2c.
This reverts commit 2df37d5ddd.
This reverts commit a72a441812.
Currently InstructionSimplify.cpp knows how to simplify floating point
instructions that have a NaN operand. It does not know how to handle the
matching constrained FP intrinsic.
This patch teaches it how to simplify so long as the exception handling
is not "fpexcept.strict".
Differential Revision: https://reviews.llvm.org/D103169
Summary:
The bit order of the has_vec and longtbtable bits in the traceback table generated by the XL compiler flipped at some point after v12.1. This is different from the definition is the AIX header debug.h. The change in the XL compiler that caused the deviation from the OS header definition was unintentional. Since both orderings are extant and the XL compiler runtime also expects the ordering defined by the OS, we will correct the output from LLVM to match the defined ordering given by the OS (which is also consistent with the Assembler Language Reference). Mitigation for traceback tables encoded with the wrong ordering is required for either ordering.
Reviewers: XingXue, HubertTong
Differential Revision: https://reviews.llvm.org/D105487
We keep a record of substitutions between debug value numbers post-isel,
however we never actually look them up until the end of compilation. As a
result, there's nothing gained by the collection being a std::map. This
patch downgrades it to being a vector, that's then sorted at the end of
compilation in LiveDebugValues.
Differential Revision: https://reviews.llvm.org/D105029
This reverts commit 5b350183cd (and
also "[NFC][ScalarEvolution] Cleanup howManyLessThans.",
009436e9c1, to make it apply).
See https://reviews.llvm.org/D105216 for discussion on various
miscompilations caused by that commit.
This reverts commit 4e413e1621,
which landed almost 10 months ago under premise that the original behavior
didn't match reality and was breaking users, even though it was correct as per
the LangRef. But the LangRef change still hasn't appeared, which might suggest
that the affected parties aren't really worried about this problem.
Please refer to discussion in:
* https://reviews.llvm.org/D87399 (`Revert "[InstCombine] erase instructions leading up to unreachable"`)
* https://reviews.llvm.org/D53184 (`[LangRef] Clarify semantics of volatile operations.`)
* https://reviews.llvm.org/D87149 (`[InstCombine] erase instructions leading up to unreachable`)
clang has `-Wnull-dereference` which will diagnose the obvious cases
of null dereference, it was adjusted in f4877c78c0,
but it will only catch the cases where the pointer is a null literal,
it will not catch the cases where an arbitrary store is expected to trap.
Differential Revision: https://reviews.llvm.org/D105338
Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again.
This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles).
This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.
Added check for switch-terminated blocks in loops.
Now if a block is terminated with a switch, we try to find out which of the
cases is taken on 1st iteration and mark corresponding edge from the block
to the case successor as live.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D105688
Reviewed By: nikic, mkazantsev
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.
This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.
Differential Revision: https://reviews.llvm.org/D105484
FreeBSD's condvar.h (included by user.h in Threading.inc) uses a "struct
thread" that conflicts with llvm::thread if both are visible when it's
included.
So this moves our #include after the FreeBSD code.
It is confusing to have two ways of specifying the same pass
('simple-loop-unswitch' and 'unswitch'). This patch replaces
'unswitch' by 'simple-loop-unswitch' to get a unique identifier.
Using 'simple-loop-unswitch' instead of 'unswitch' also has the
advantage of matching how the pass is named in DEBUG_TYPE etc. So
this makes it a bit more consistent how we refer to the pass in
options such as -passes, -print-after and -debug-only.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D105628
There was an alias between 'simplifycfg' and 'simplify-cfg' in the
PassRegistry. That was the original reason for this patch, which
effectively removes the alias.
This patch also replaces all occurrances of 'simplify-cfg'
by 'simplifycfg'. Reason for choosing that form for the name is
that it matches the DEBUG_TYPE for the pass, and the legacy PM name
and also how it is spelled out in other passes such as
'loop-simplifycfg', and in other options such as
'simplifycfg-merge-cond-stores'.
I for some reason the name should be changed to 'simplify-cfg' in
the future, then I think such a renaming should be more widely done
and not only impacting the PassRegistry.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D105627
To support options like -print-before=<pass> and -print-after=<pass>
the PassBuilder will register PassInstrumentation callbacks as well
as a mapping between internal pass class names and the pass names
used in those options (and other cmd line interfaces). But for
some reason all the passes that takes options where missing in those
maps, so for example "-print-after=loop-vectorize" didn't work.
This patch will add the missing entries by also taking care of
function and loop passes with params when setting up the class to
pass name maps.
One might notice that even with this patch it might be tricky to
know what pass name to use in options such as -print-after. This
because there only is a single mapping from class name to pass name,
while the PassRegistry currently is a bit messy as it sometimes
reuses the same class for different pass names (without using the
"pass with params" scheme, or the pass-name<variant> syntax).
It gets extra messy in some situations. For example the
MemorySanitizerPass can run like this (with debug and print-after)
opt -passes='kmsan' -print-after=msan-module -debug-only=msan
The 'kmsan' alias for 'msan<kernel>' is just confusing as one might
think that 'kmsan' is a separate pass (but the DEBUG_TYPE is still
just 'msan'). And since the module pass version of the pass adds
a mapping from 'MemorySanitizerPass' to 'msan-module' one need to
use 'msan-module' in the print-before and print-after options.
Reviewed By: ychen
Differential Revision: https://reviews.llvm.org/D105006
When the instruction has imm form and fed by LI, we can remove the redundat LI instruction.
Below is an example:
```
renamable $x5 = LI8 2
renamable $x4 = exact SRD killed renamable $x4, killed renamable $r5, implicit $x5
```
will be converted to:
```
renamable $x5 = LI8 2
renamable $x4 = exact RLDICL killed renamable $x4, 62, 2, implicit killed $x5
```
But when we do this optimization, we forget to remove implicit killed $x5
This bug has caused a lnt case error. This patch is to fix above bug.
Reviewed By: #powerpc, shchenz
Differential Revision: https://reviews.llvm.org/D85288
AddDiscriminatorsPass is in Legacy PM's O0 pipeline. This patch did the same
for NPM O0 pipeline.
Reviewed By: aeubanks, MaskRay
Differential Revision: https://reviews.llvm.org/D105650
The rest of the SOP instructions implicitly set SCC and not
suitable for the rematerialization.
Differential Revision: https://reviews.llvm.org/D105670
This parameter controls how much space is reserved for incoming
values. There are always going to be 2 incoming values in this case.
While there remove the unused std::vector right below.
Found while looking at porting this code to RISCV.
Override the `shouldScalarizeBinop` target lowering hook using the same
implementation used in the x86 backend. This causes `extract_vector_elt`s of
vector binary ops to be scalarized if the scalarized version would be supported.
Differential Revision: https://reviews.llvm.org/D105646
C++23 will make these conversions ambiguous - so fix them to make the
codebase forward-compatible with C++23 (& a follow-up change I've made
will make this ambiguous/invalid even in <C++23 so we don't regress
this & it generally improves the code anyway)
Patch tries to improve the vectorization of stores. Originally, we just
check the type and the base pointer of the store.
Patch adds some extra checks to avoid non-profitable vectorization
cases. It includes analysis of the scalar values to be stored and
triggers the vectorization attempt only if the scalar values have
same/alt opcode and are from same basic block, i.e. we don't end up
immediately with the gather node, which is not profitable.
This also improves compile time by filtering out non-profitable cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D104122
Noticed while making a related change. This code was doing
something really peculiar: Creating an APInt by parsing a string.
And then creating a SmallVector with one element to create the
GEP.
Instead create the APInt from integers and directly pass the single
index to GetElementPtrInst::Create().
Revived D101297 in its original form + added some changes in X86
legalization cehcking for masked gathers.
This solution is the most stable and the most correct one. We have to
check the legality before trying to build the masked gather in SLP.
Without this check we have incorrect cost (for SLP) in case if the masked gather
is not legal/slower than the gather. And we're missing some
vectorization opportunities.
This can be fixed in the cost model, but in this case we need to add
special checks for the cost of GEPs for ScatterVectorize node, add
special check for small trees, etc., i.e. there are a lot of corner
cases here and there, which insrease code base and make it harder to
maintain the code.
> Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed.
The question from D101297. Actually, no, it can't. Actually, simple
gather may give us better result, especially after we started
vectorization of insertelements. Plus, like I said before, the cost for
non-legal masked gathers leads to missed vectorization opportunities.
Differential Revision: https://reviews.llvm.org/D105042
SelectionDAG's equivalents in ISD::InputArg/OutputArg track the
original argument index. Mips relies on this, and its currently
reinventing its own parallel CallLowering infrastructure which tracks
these indexes on the side. Add this to help move towards deleting the
custom mips handling.
There are two issues with the current implementation of computeBECount:
1. It doesn't account for the possibility that adding "Stride - 1" to
Delta might overflow. For almost all loops, it doesn't, but it's not
actually proven anywhere.
2. It doesn't account for the possibility that Stride is zero. If Delta
is zero, the backedge is never taken; the value of Stride isn't
relevant. To handle this, we have to make sure that the expression
returned by computeBECount evaluates to zero.
To deal with this, add two new checks:
1. Use a variety of tricks to try to prove that the addition doesn't
overflow. If the proof is impossible, use an alternate sequence which
never overflows.
2. Use umax(Stride, 1) to handle the possibility that Stride is zero.
Differential Revision: https://reviews.llvm.org/D105216
This is a cleanup patch -- we're now able to support all flavours of
variable location in instruction referencing mode. This patch updates
various tests for debug instructions to be broader: numerous code paths
try to ignore debug isntructions, and they now have to ignore the
additional DBG_PHI and DBG_INSTR_REFs that we can generate.
A small amount of rework happens for LiveDebugVariables: as we don't need
to track live intervals through regalloc any more, we can get away with
unlinking debug instructions before regalloc, then re-inserting them after.
Note that this isn't (yet) true of DBG_VALUE_LISTs, they still have to go
through live interval tracking.
In SelectionDAG, add a helper lambda that emits half-formed DBG_INSTR_REFs
for arguments in instr-ref mode, DBG_VALUE otherwise. This is one of the
final locations where DBG_VALUEs are emitted for vreg arguments.
X86InstrInfo now un-sets the debug instr number on SUB instructions that
get mutated into CMP instructions. As the instruction no longer computes a
subtraction, we can't use it for variable locations.
Differential Revision: https://reviews.llvm.org/D88898
This adds a new llvm::thread class with the same interface as std::thread
except there is an extra constructor that allows us to set the new thread's
stack size. On Darwin even the default size is boosted to 8MB to match the main
thread.
It also switches all users of the older C-style `llvm_execute_on_thread` API
family over to `llvm::thread` followed by either a `detach` or `join` call and
removes the old API.
Moved definition of DefaultStackSize into the .cpp file to hopefully
fix the build on some (GCC-6?) machines.
Some of the SPEC tests end up with reduction+(sext/zext(<n x i1>) to <n x im>) pattern, which can be transformed to [-]zext/trunc(ctpop(bitcast <n x i1> to in)) to im.
Also, reduction+(<n x i1>) can be transformed to ctpop(bitcast <n x i1> to in) & 1 != 0.
Differential Revision: https://reviews.llvm.org/D105587
- ``externally_initialized`` variables would be initialized or modified
elsewhere. Particularly, CUDA or HIP may have host code to initialize
or modify ``externally_initialized`` device variables, which may not
be explicitly referenced on the device side but may still be used
through the host side interfaces. Not preserving them triggers the
elimination of them in the GlobalDCE and breaks the user code.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D105135
- In [D98783](https://reviews.llvm.org/D98783), an extra GlobalDCE pass
is inserted before the internalization pass to ensure a global
variable without users could be internalized even if there are dead
users. Instead of inserting a dedicated optimization pass, the
dead user checking, i.e. 'use_empty()', should be preceeded with
constant dead user removal to ensure an accurate result.
Differential Revision: https://reviews.llvm.org/D105590
This adds a new llvm::thread class with the same interface as std::thread
except there is an extra constructor that allows us to set the new thread's
stack size. On Darwin even the default size is boosted to 8MB to match the main
thread.
It also switches all users of the older C-style `llvm_execute_on_thread` API
family over to `llvm::thread` followed by either a `detach` or `join` call and
removes the old API.
Additionally, lower the floating point compare SVE intrinsics to
SETCC_MERGE_ZERO ISD nodes to avoid duplicating ISel patterns.
Differential Revision: https://reviews.llvm.org/D105486
Several subclasses of User override operator new without also overriding
operator delete. This means that delete expressions fall back to using
operator delete of the base class, which would be User. However, this is
only allowed if the base class has a virtual destructor which is not the
case for User, so this is UB.
See also [expr.delete] (3) for the exact wording.
This is actually detected in some cases by GCC 11's
-Wmismatched-new-delete now which is how I found this error.
Differential Revision: https://reviews.llvm.org/D103143
The computeNamedSymbolDependencies and computeLocalDeps methods on
ObjectLinkingLayerJITLinkContext are responsible for computing, for each symbol
in the current MaterializationResponsibility, the set of non-locally-scoped
symbols that are depended on. To calculate this we have to consider the effect
of chains of dependence through locally scoped symbols in the LinkGraph. E.g.
.text
.globl foo
foo:
callq bar ## foo depneds on external 'bar'
movq Ltmp1(%rip), %rcx ## foo depends on locally scoped 'Ltmp1'
addl (%rcx), %eax
retq
.data
Ltmp1:
.quad x ## Ltmp1 depends on external 'x'
In this example symbol 'foo' depends directly on 'bar', and indirectly on 'x'
via 'Ltmp1', which is locally scoped.
Performance of the existing implementations appears to have been mediocre:
Based on flame graphs posted by @drmeister (in #jit on the LLVM discord server)
the computeLocalDeps function was taking up a substantial amount of time when
starting up Clasp (https://github.com/clasp-developers/clasp).
This commit attempts to address the performance problems in three ways:
1. Using jitlink::Blocks instead of jitlink::Symbols as the nodes of the
dependencies-introduced-by-locally-scoped-symbols graph.
Using either Blocks or Symbols as nodes provides the same information, but since
there may be more than one locally scoped symbol per block the block-based
version of the dependence graph should always be a subgraph of the Symbol-based
version, and so faster to operate on.
2. Improved worklist management.
The older version of computeLocalDeps used a fixed worklist containing all
nodes, and iterated over this list propagating dependencies until no further
changes were required. The worklist was not sorted into a useful order before
the loop started.
The new version uses a variable work-stack, visiting nodes in DFS order and
only adding nodes when there is meaningful work to do on them.
Compared to the old version the new version avoids revisiting nodes which
haven't changed, and I suspect it converges more quickly (due to the DFS
ordering).
3. Laziness and caching.
Mappings of...
jitlink::Symbol* -> Interned Name (as SymbolStringPtr)
jitlink::Block* -> Immediate dependencies (as SymbolNameSet)
jitlink::Block* -> Transitive dependencies (as SymbolNameSet)
are all built lazily and cached while running computeNamedSymbolDependencies.
According to @drmeister these changes reduced Clasp startup time in his test
setup (averaged over a handful of starts) from 4.8 to 2.8 seconds (with
ORC/JITLink linking ~11,000 object files in that time), which seems like
enough to justify switching to the new algorithm in the absence of any other
perf numbers.
WebAssembly's shift instructions implicitly masks the shift count, so optimize
out redundant explicit masks of the shift count. For vector shifts, this
currently only works if the mask is applied before splatting the shift count,
but this should be addressed in a future commit. Resolves PR49655.
Differential Revision: https://reviews.llvm.org/D105600
MachOJITDylibInitializers::SectionExtent represented the address range of a
section as an (address, size) pair. The new ExecutorAddressRange type
generalizes this to an address range (for any object, not necessarily a section)
represented as a (start-address, end-address) pair.
The aim is to express more of ORC (and the ORC runtime) in terms of simple types
that can be serialized/deserialized via SPS. This will simplify SPS-based RPC
involving arguments/return-values of these types.