When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.
Original patch by Anna Welker
Differential Revision: https://reviews.llvm.org/D86452
This migrates all LLVM (except Kaleidoscope and
CodeGen/StackProtector.cpp) DebugLoc::get to DILocation::get.
The CodeGen/StackProtector.cpp usage may have a nullptr Scope
and can trigger an assertion failure, so I don't migrate it.
Reviewed By: #debug-info, dblaikie
Differential Revision: https://reviews.llvm.org/D93087
This is the first in a series of patches that attempts to migrate
existing cost instructions to return a new InstructionCost class
in place of a simple integer. This new class is intended to be
as light-weight and simple as possible, with a full range of
arithmetic and comparison operators that largely mirror the same
sets of operations on basic types, such as integers. The main
advantage to using an InstructionCost is that it can encode a
particular cost state in addition to a value. The initial
implementation only has two states - Normal and Invalid - but these
could be expanded over time if necessary. An invalid state can
be used to represent an unknown cost or an instruction that is
prohibitively expensive.
This patch adds the new class and changes the getInstructionCost
interface to return the new class. Other cost functions, such as
getUserCost, etc., will be migrated in future patches as I believe
this to be less disruptive. One benefit of this new class is that
it provides a way to unify many of the magic costs in the codebase
where the cost is set to a deliberately high number to prevent
optimisations taking place, e.g. vectorization. It also provides
a route to represent the extremely high, and unknown, cost of
scalarization of scalable vectors, which is not currently supported.
Differential Revision: https://reviews.llvm.org/D91174
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s
Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections. The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead.
**ELF object emission**
The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.
Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication. A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.
The format of `.pseudo_probe_desc` section looks like:
```
.section .pseudo_probe_desc,"",@progbits
.quad 6309742469962978389 // Func GUID
.quad 4294967295 // Func Hash
.byte 9 // Length of func name
.ascii "_Z5funcAi" // Func name
.quad 7102633082150537521
.quad 138828622701
.byte 12
.ascii "_Z8funcLeafi"
.quad 446061515086924981
.quad 4294967295
.byte 9
.ascii "_Z5funcBi"
.quad -2016976694713209516
.quad 72617220756
.byte 7
.ascii "_Z3fibi"
```
For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :
```
FUNCTION BODY (one for each outlined function present in the text section)
GUID (uint64)
GUID of the function
NPROBES (ULEB128)
Number of probes originating from this function.
NUM_INLINED_FUNCTIONS (ULEB128)
Number of callees inlined into this function, aka number of
first-level inlinees
PROBE RECORDS
A list of NPROBES entries. Each entry contains:
INDEX (ULEB128)
TYPE (uint4)
0 - block probe, 1 - indirect call, 2 - direct call
ATTRIBUTE (uint3)
reserved
ADDRESS_TYPE (uint1)
0 - code address, 1 - address delta
CODE_ADDRESS (uint64 or ULEB128)
code address or address delta, depending on ADDRESS_TYPE
INLINED FUNCTION RECORDS
A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
callees. Each record contains:
INLINE SITE
GUID of the inlinee (uint64)
ID of the callsite probe (ULEB128)
FUNCTION BODY
A FUNCTION BODY entry describing the inlined function.
```
To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.
**Assembling**
Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.
A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.
A example assembly looks like:
```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```
With inlining turned on, the assembly may look different around %bb2 with an inlined probe:
```
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 3 0
.pseudoprobe 6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe 837061429793323041 4 0
popq %rax
retq
```
**Disassembling**
We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.
An example disassembly looks like:
```
00000000002011a0 <foo2>:
2011a0: 50 push rax
2011a1: 85 ff test edi,edi
[Probe]: FUNC: foo2 Index: 1 Type: Block
2011a3: 74 02 je 2011a7 <foo2+0x7>
[Probe]: FUNC: foo2 Index: 3 Type: Block
[Probe]: FUNC: foo2 Index: 4 Type: Block
[Probe]: FUNC: foo Index: 1 Type: Block Inlined: @ foo2:6
2011a5: 58 pop rax
2011a6: c3 ret
[Probe]: FUNC: foo2 Index: 2 Type: Block
2011a7: bf 01 00 00 00 mov edi,0x1
[Probe]: FUNC: foo2 Index: 5 Type: IndirectCall
2011ac: ff d6 call rsi
[Probe]: FUNC: foo2 Index: 4 Type: Block
2011ae: 58 pop rax
2011af: c3 ret
```
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D91878
The test is reduced from the example in D82005.
Similar to 94f6d365e, the test here would assert in
the DomTree when we tried to convert a select to a
phi with an unreachable block operand.
We may want to add some kind of guard code in DomTree
itself to avoid this sort of problem.
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s
Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections. The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead.
**ELF object emission**
The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.
Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication. A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.
The format of `.pseudo_probe_desc` section looks like:
```
.section .pseudo_probe_desc,"",@progbits
.quad 6309742469962978389 // Func GUID
.quad 4294967295 // Func Hash
.byte 9 // Length of func name
.ascii "_Z5funcAi" // Func name
.quad 7102633082150537521
.quad 138828622701
.byte 12
.ascii "_Z8funcLeafi"
.quad 446061515086924981
.quad 4294967295
.byte 9
.ascii "_Z5funcBi"
.quad -2016976694713209516
.quad 72617220756
.byte 7
.ascii "_Z3fibi"
```
For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :
```
FUNCTION BODY (one for each outlined function present in the text section)
GUID (uint64)
GUID of the function
NPROBES (ULEB128)
Number of probes originating from this function.
NUM_INLINED_FUNCTIONS (ULEB128)
Number of callees inlined into this function, aka number of
first-level inlinees
PROBE RECORDS
A list of NPROBES entries. Each entry contains:
INDEX (ULEB128)
TYPE (uint4)
0 - block probe, 1 - indirect call, 2 - direct call
ATTRIBUTE (uint3)
reserved
ADDRESS_TYPE (uint1)
0 - code address, 1 - address delta
CODE_ADDRESS (uint64 or ULEB128)
code address or address delta, depending on ADDRESS_TYPE
INLINED FUNCTION RECORDS
A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
callees. Each record contains:
INLINE SITE
GUID of the inlinee (uint64)
ID of the callsite probe (ULEB128)
FUNCTION BODY
A FUNCTION BODY entry describing the inlined function.
```
To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.
**Assembling**
Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.
A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.
A example assembly looks like:
```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```
With inlining turned on, the assembly may look different around %bb2 with an inlined probe:
```
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 3 0
.pseudoprobe 6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe 837061429793323041 4 0
popq %rax
retq
```
**Disassembling**
We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.
An example disassembly looks like:
```
00000000002011a0 <foo2>:
2011a0: 50 push rax
2011a1: 85 ff test edi,edi
[Probe]: FUNC: foo2 Index: 1 Type: Block
2011a3: 74 02 je 2011a7 <foo2+0x7>
[Probe]: FUNC: foo2 Index: 3 Type: Block
[Probe]: FUNC: foo2 Index: 4 Type: Block
[Probe]: FUNC: foo Index: 1 Type: Block Inlined: @ foo2:6
2011a5: 58 pop rax
2011a6: c3 ret
[Probe]: FUNC: foo2 Index: 2 Type: Block
2011a7: bf 01 00 00 00 mov edi,0x1
[Probe]: FUNC: foo2 Index: 5 Type: IndirectCall
2011ac: ff d6 call rsi
[Probe]: FUNC: foo2 Index: 4 Type: Block
2011ae: 58 pop rax
2011af: c3 ret
```
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D91878
*************
* The problem
*************
See motivation examples in compiler-rt/test/dfsan/pair.cpp. The current
DFSan always uses a 16bit shadow value for a variable with any type by
combining all shadow values of all bytes of the variable. So it cannot
distinguish two fields of a struct: each field's shadow value equals the
combined shadow value of all fields. This introduces an overtaint issue.
Consider a parsing function
std::pair<char*, int> get_token(char* p);
where p points to a buffer to parse, the returned pair includes the next
token and the pointer to the position in the buffer after the token.
If the token is tainted, then both the returned pointer and int ar
tainted. If the parser keeps on using get_token for the rest parsing,
all the following outputs are tainted because of the tainted pointer.
The CL is the first change to address the issue.
**************************
* The proposed improvement
**************************
Eventually all fields and indices have their own shadow values in
variables and memory.
For example, variables with type {i1, i3}, [2 x i1], {[2 x i4], i8},
[2 x {i1, i1}] have shadow values with type {i16, i16}, [2 x i16],
{[2 x i16], i16}, [2 x {i16, i16}] correspondingly; variables with
primary type still have shadow values i16.
***************************
* An potential implementation plan
***************************
The idea is to adopt the change incrementially.
1) This CL
Support field-level accuracy at variables/args/ret in TLS mode,
load/store/alloca still use combined shadow values.
After the alloca promotion and SSA construction phases (>=-O1), we
assume alloca and memory operations are reduced. So if struct
variables do not relate to memory, their tracking is accurate at
field level.
2) Support field-level accuracy at alloca
3) Support field-level accuracy at load/store
These two should make O0 and real memory access work.
4) Support vector if necessary.
5) Support Args mode if necessary.
6) Support passing more accurate shadow values via custom functions if
necessary.
***************
* About this CL.
***************
The CL did the following
1) extended TLS arg/ret to work with aggregate types. This is similar
to what MSan does.
2) implemented how to map between an original type/value/zero-const to
its shadow type/value/zero-const.
3) extended (insert|extract)value to use field/index-level progagation.
4) for other instructions, propagation rules are combining inputs by or.
The CL converts between aggragate and primary shadow values at the
cases.
5) Custom function interfaces also need such a conversion because
all existing custom functions use i16. It is unclear whether custome
functions need more accurate shadow propagation yet.
6) Added test cases for aggregate type related cases.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D92261
This is an enhancement to load vectorization that is motivated by
a pattern in https://llvm.org/PR16739.
Unfortunately, it's still not enough to make a difference there.
We will have to handle multi-use cases in some better way to avoid
creating multiple overlapping loads.
Differential Revision: https://reviews.llvm.org/D92858
For stores chain vectorization we choose the size of vector
elements to ensure we fit to minimum and maximum vector register
size for the number of elements given. This patch corrects vector
element size choosing the width of value truncated just before
storing instead of the width of value stored.
Fixes PR46983
Differential Revision: https://reviews.llvm.org/D92824
* Steps are scaled by `vscale`, a runtime value.
* Changes to circumvent the cost-model for now (temporary)
so that the cost-model can be implemented separately.
This can vectorize the following loop [1]:
void loop(int N, double *a, double *b) {
#pragma clang loop vectorize_width(4, scalable)
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1.0;
}
}
[1] This source-level example is based on the pragma proposed
separately in D89031. This patch only implements the LLVM part.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91077
This patch removes a number of asserts that VF is not scalable, even though
the code where this assert lives does nothing that prevents VF being scalable.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91060
This commit adds two new intrinsics.
- llvm.experimental.vector.insert: used to insert a vector into another
vector starting at a given index.
- llvm.experimental.vector.extract: used to extract a subvector from a
larger vector starting from a given index.
The codegen work for these intrinsics has already been completed; this
commit is simply exposing the existing ISD nodes to LLVM IR.
Reviewed By: cameron.mcinally
Differential Revision: https://reviews.llvm.org/D91362
This patch adds new PM support for the pass and the pass can be now used
during middle-end transforms. The old pass is remamed to
ScalarizeMaskedMemIntrinLegacyPass.
Reviewed-By: skatkov, aeubanks
Differential Revision: https://reviews.llvm.org/D92743
ScalarizeMaskedMemIntrinsic is currently a codeGen level pass. The pass
is actually operating on IR level and does not use any code gen specific
passes. It is useful to move it into transforms directory so that it
can be more widely used as a mid-level transform as well (apart from
usage in codegen pipeline).
In particular, we have a usecase downstream where we would like to use
this pass in our mid-level pipeline which operates on IR level.
The next change will be to add support for new PM.
Reviewers: craig.topper, apilipenko, skatkov
Reviewed-By: skatkov
Differential Revision: https://reviews.llvm.org/D92407
This is a rework of D85812, which didn't land.
When callee coroutine function is inlined into caller coroutine function before coro-split pass, llvm will emits "coroutine should have exactly one defining @llvm.coro.begin". It seems that coro-early pass can not handle this quiet well.
So we believe that unsplited coroutine function should not be inlined.
This patch fix such issue by not inlining function if it has attribute "coroutine.presplit" (it means the function has not been splited) to fix this issue
test plan: check-llvm, check-clang
In D85812, there was suggestions on moving the macros to Attributes.td to avoid circular header dependency issue.
I believe it's not worth doing just to be able to use one constant string in one place.
Today, there are already 3 possible attribute values for "coroutine.presplit": c6543cc6b8/llvm/lib/Transforms/Coroutines/CoroInternal.h (L40-L42)
If we move them into Attributes.td, we would be adding 3 new attributes to EnumAttr, just to support this, which I think is an overkill.
Instead, I think the best way to do this is to add an API in Function class that checks whether this function is a coroutine, by checking the attribute by name directly.
Differential Revision: https://reviews.llvm.org/D92706
This guards against cases where the symbol was dead code eliminated in
the binary by ThinLTO, and we have a sample profile collected for one
binary but used to optimize another.
Most of the benefit from ICP comes from inlining the target, which we
can't do with only a declaration anyway. If this is in the pre-ThinLTO
link step (e.g. for instrumentation based PGO), we will attempt the
promotion again in the ThinLTO backend after importing anyway, and we
don't need the early promotion to facilitate that.
Differential Revision: https://reviews.llvm.org/D92804
Currently in some places we use signed type to represent size of an access and put explicit casts from unsigned to signed.
For example: int64_t EarlierSize = int64_t(Loc.Size.getValue());
Even though it doesn't loos bits (immidiatly) it may overflow and we end up with negative size. Potentially that cause later code to work incorrectly. A simple expample is a check that size is not negative.
I think it would be safer and clearer if we use unsigned type for the size and handle it appropriately.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D92648
I'm not sure if it would be legal by the IR reference to introduce
an addrspacecast here, since the IR reference is a bit vague on
the exact semantics, but at least for our usage of it (and I
suspect for many other's usage) it is not. For us, addrspacecasts
between non-integral address spaces carry frontend information that the
optimizer cannot deduce afterwards in a generic way (though we
have frontend specific passes in our pipline that do propagate
these). In any case, I'm sure nobody is using it this way at
the moment, since it would have introduced inttoptrs, which
are definitely illegal.
Fixes PR38375
Co-authored-by: Keno Fischer <keno@alumni.harvard.edu>
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D50010
This patch adds the ConstraintElimination pass to the LTO pipeline and
also runs it after SCCP in the function simplification pipeline.
This increases the number of cases we can elimination. Pending further
tuning.
It is possible to merge reuse and reorder shuffles and reduce the total
cost of the ivectorization tree/number of final instructions.
Differential Revision: https://reviews.llvm.org/D92668
The CountPrev variable was only used to forward a value from
the if statement to the conditional operator under the same
condition.
While there move some variable declarations to their first
assignment.
This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon.
Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO.
**Interface**
There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile.
- Query base profile (`getBaseSamplesFor`)
Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function.
- Query context profile (`getContextSamplesFor`)
Context profile is a function's CFG profile for a given calling context. We can query context profile by context string.
- Track inlined context profile (`markContextSamplesInlined`)
When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function.
- Track not-inlined context profile (`promoteMergeContextSamplesTree`)
When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination.
**Implementation**
Implementation-wise, `SampleContext` is created as abstraction for context. Currently it's a string for call path, and we can later optimize it to something more efficient, e.g. context id. Each `SampleContext` also has a `ContextState` indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each `FunctionSamples` now has a `SampleContext` that tells whether it's base profile or context profile, and for context profile what is the context and state.
On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, `SampleContextTracker` is implemented that encapsulates a trie tree with `ContextTireNode` as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track `FunctionSamples` for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes.
**Integration**
`SampleContextTracker` is now also integrated with AutoFDO, `SampleProfileReader` and `SampleProfileLoader`. When we detected input profile contains context-sensitive profile, `SampleContextTracker` will be used to track profiles, and all profile query will go to `SampleContextTracker` instead of `SampleProfileReader` automatically. Tracking APIs are called automatically for each inline decision from `SampleProfileLoader`.
Differential Revision: https://reviews.llvm.org/D90125
The x86-64 backend currently has a bug which uses a wrong register when for the GOTPCREL reference.
The program will crash without the dso_local specifier.
IsSigned and its accessor, isSigned, were introduced on Oct 25, 2017
in commit 9ac7021a25. The last use was
removed on Nov 20, 2017 in commit
268467869b.
This is a child diff of D92261.
This diff adds APIs that return shadow type/value/zero from origin
objects. For the time being these APIs simply returns primitive
shadow type/value/zero. The following diff will be implementing the
conversion.
As D92261 explains, some cases still use primitive shadow during
the incremential changes. The cases include
1) alloca/load/store
2) custom function IO
3) vectors
At the cases this diff does not use the new APIs, but uses primitive
shadow objects explicitly.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D92629
Prepare to delete `AlignedCharArrayUnion` by migrating its users over to
`std::aligned_union_t`.
I will delete `AlignedCharArrayUnion` and its tests in a follow-up
commit so that it's easier to revert in isolation in case some
downstream wants to keep using it.
Differential Revision: https://reviews.llvm.org/D92516
Update all the users of `AlignedCharArrayUnion` to stop peeking inside
(to look at `buffer`) so that a follow-up patch can replace it with an
alias to `std::aligned_union_t`.
This was reviewed as part of https://reviews.llvm.org/D92512, but I'm
splitting this bit out to commit first to reduce churn in case the
change to `AlignedCharArrayUnion` needs to be reverted for some
unexpected reason.
Currently PassBuilder.cpp is by far the file that takes longest to
compile. This is due to tons of templates being instantiated per pass.
Follow PassManager by using wrappers around passes to avoid making
the adaptors templated on the pass type. This allows us to move various
adaptors' run methods into .cpp files.
This reduces the compile time of PassBuilder.cpp on my machine from 66
to 39 seconds. It also reduces the size of opt from 685M to 676M.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D92616
Currently we have to duplicate the same checks in isPotentiallyReassociatable and tryReassociate. With simple pattern like add/mul this may be not a big deal. But the situation gets much worse when I try to add support for min/max. Min/Max may be represented by several instructions and can take different forms. In order reduce complexity for upcoming min/max support we need to restructure the code a bit to avoid mentioned code duplication.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88286
Currently we delete optimized instructions as we go. That has several negative consequences. First it complicates traversal logic itself. Second if newly generated instruction has been deleted the traversal is repeated from scratch.
But real motivation for the change is upcoming change with support for min/max reassociation. Here we employ SCEV expander to generate code. As a result newly generated instructions may be inserted not right before original instruction (because SCEV may do hoisting) and there is no way to know 'next' instruction.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88285
This patch teaches the jump threading pass to call BPI->eraseBlock
when it folds a conditional branch.
Without this patch, BranchProbabilityInfo could end up with stale edge
probabilities for the basic block containing the conditional branch --
one edge probability with less than 1.0 and the other for a removed
edge.
Differential Revision: https://reviews.llvm.org/D92608
This reverts commit 4bd35cdc3a.
The patch was reverted during the investigation. The investigation
shown that the patch did not cause any trouble, but just exposed
the existing problem that is addressed by the previous patch
"[IndVars] Quick fix LHS/RHS bug". Returning without changes.
The code relies on fact that LHS is the NarrowDef but never
really checks it. Adding the conservative restrictive check,
will follow-up with handling of case where RHS is a NarrowDef.
This is a child diff of D92261.
It extended TLS arg/ret to work with aggregate types.
For a function
t foo(t1 a1, t2 a2, ... tn an)
Its arguments shadow are saved in TLS args like
a1_s, a2_s, ..., an_s
TLS ret simply includes r_s. By calculating the type size of each shadow
value, we can get their offset.
This is similar to what MSan does. See __msan_retval_tls and __msan_param_tls
from llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp.
Note that this change does not add test cases for overflowed TLS
arg/ret because this is hard to test w/o supporting aggregate shdow
types. We will be adding them after supporting that.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D92440
The initial step of the uniform-after-vectorization (lane-0 demanded only) analysis was very awkwardly written. It would revisit use list of each pointer operand of a widened load/store. As a result, it was in the worst case O(N^2) where N was the number of instructions in a loop, and had restricted operand Value types to reduce the size of use lists.
This patch replaces the original algorithm with one which is at most O(2N) in the number of instructions in the loop. (The key observation is that each use of a potentially interesting pointer is visited at most twice, once on first scan, once in the use list of *it's* operand. Only instructions within the loop have their uses scanned.)
In the process, we remove a restriction which required the operand of the uniform mem op to itself be an instruction. This allows detection of uniform mem ops involving global addresses.
Differential Revision: https://reviews.llvm.org/D92056
1. Removed #include "...AliasAnalysis.h" in other headers and modules.
2. Cleaned up includes in AliasAnalysis.h.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D92489
This reverts commit 0c9c6ddf17.
We are seeing some failures with this patch locally. Not clear
if it's causing them or just triggering a problem in another
place. Reverting while investigating.
This is a child diff of D92261.
After supporting field/index-level shadow, the existing shadow with type
i16 works for only primitive types.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D92459
An indirect call site needs to be probed for its potential call targets. With CSSPGO a direct call also needs a probe so that a calling context can be represented by a stack of callsite probes. Unlike pseudo probes for basic blocks that are in form of standalone intrinsic call instructions, pseudo probes for callsites have to be attached to the call instruction, thus a separate instruction would not work.
One possible way of attaching a probe to a call instruction is to use a special metadata that carries information about the probe. The special metadata will have to make its way through the optimization pipeline down to object emission. This requires additional efforts to maintain the metadata in various places. Given that the `!dbg` metadata is a first-class metadata and has all essential support in place , leveraging the `!dbg` metadata as a channel to encode pseudo probe information is probably the easiest solution.
With the requirement of not inflating `!dbg` metadata that is allocated for almost every instruction, we found that the 32-bit DWARF discriminator field which mainly serves AutoFDO can be reused for pseudo probes. DWARF discriminators distinguish identical source locations between instructions and with pseudo probes such support is not required. In this change we are using the discriminator field to encode the ID and type of a callsite probe and the encoded value will be unpacked and consumed right before object emission. When a callsite is inlined, the callsite discriminator field will go with the inlined instructions. The `!dbg` metadata of an inlined instruction is in form of a scope stack. The top of the stack is the instruction's original `!dbg` metadata and the bottom of the stack is for the original callsite of the top-level inliner. Except for the top of the stack, all other elements of the stack actually refer to the nested inlined callsites whose discriminator field (which actually represents a calliste probe) can be used together to represent the inline context of an inlined PseudoProbeInst or CallInst.
To avoid collision with the baseline AutoFDO in various places that handles dwarf discriminators where a check against the `-pseudo-probe-for-profiling` switch is not available, a special encoding scheme is used to tell apart a pseudo probe discriminator from a regular discriminator. For the regular discriminator, if all lowest 3 bits are non-zero, it means the discriminator is basically empty and all higher 29 bits can be reversed for pseudo probe use.
Callsite pseudo probes are inserted in `SampleProfileProbePass` and a target-independent MIR pass `PseudoProbeInserter` is added to unpack the probe ID/type from `!dbg`.
Note that with this work the switch -debug-info-for-profiling will not work with -pseudo-probe-for-profiling anymore. They cannot be used at the same time.
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D91756
At D92261, this type will be used to cache both combined shadow and
converted shadow values.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D92458
Summary:
AIX uses the existing EH infrastructure in clang and llvm.
The major differences would be
1. AIX do not have CFI instructions.
2. AIX uses a new personality routine, named __xlcxx_personality_v1.
It doesn't use the GCC personality rountine, because the
interoperability is not there yet on AIX.
3. AIX do not use eh_frame sections. Instead, it would use a eh_info
section (compat unwind section) to store the information about
personality routine and LSDA data address.
Reviewed By: daltenty, hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D91455
This is yet another attempt at providing support for epilogue
vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none
and reviews D30247 and D88819.
Similar to D88819, this patch achieve epilogue vectorization by
executing a single vplan twice: once on the main loop and a second
time on the epilogue loop (using a different VF). However it's able
to handle more loops, and generates more optimal control flow for
cases where the trip count is too small to execute any code in vector
form.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D89566
This might be a small improvement in readability, but the
real motivation is to make it easier to adapt the code to
deal with intrinsics like 'maxnum' and/or integer min/max.
There is potentially help in doing that with D92086, but
we might also just add specialized wrappers here to deal
with the expected patterns.
OpenMPIRBuilder::createParallel outlines the body region of the parallel
construct into a new function that accepts any value previously defined outside
the region as a function argument. This function is called back by OpenMP
runtime function __kmpc_fork_call, which expects trailing arguments to be
pointers. If the region uses a value that is not of a pointer type, e.g. a
struct, the produced code would be invalid. In such cases, make createParallel
emit IR that stores the value on stack and pass the pointer to the outlined
function instead. The outlined function then loads the value back and uses as
normal.
Reviewed By: jdoerfert, llitchev
Differential Revision: https://reviews.llvm.org/D92189
In this patch I have added support for a new loop hint called
vectorize.scalable.enable that says whether we should enable scalable
vectorization or not. If a user wants to instruct the compiler to
vectorize a loop with scalable vectors they can now do this as
follows:
br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !2
...
!2 = !{!2, !3, !4}
!3 = !{!"llvm.loop.vectorize.width", i32 8}
!4 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
Setting the hint to false simply reverts the behaviour back to the
default, using fixed width vectors.
Differential Revision: https://reviews.llvm.org/D88962
This is yet another attempt at providing support for epilogue
vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none
and reviews D30247 and D88819.
Similar to D88819, this patch achieve epilogue vectorization by
executing a single vplan twice: once on the main loop and a second
time on the epilogue loop (using a different VF). However it's able
to handle more loops, and generates more optimal control flow for
cases where the trip count is too small to execute any code in vector
form.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D89566
This is a straightforward port of MemCpyOpt to MemorySSA following
the approach of D26739. MemDep queries are replaced with MSSA queries
without changing the overall structure of the pass. Some care has
to be taken to account for differences between these APIs
(MemDep also returns reads, MSSA doesn't).
Differential Revision: https://reviews.llvm.org/D89207
We were not correctly splitting a blocks for chains of length 1.
Before that change, additional instructions for blocks in chains of
length 1 were not split off from the block before removing (this was
done correctly for chains of longer size).
If this first block contained an instruction referenced elsewhere,
deleting the block, would result in invalidation of the produced value.
This caused a miscompile which motivated D92297 (before D17993,
nonnull and dereferenceable attributed were not added so MergeICmps were
not triggered.) The new test gep-references-bb.ll demonstrate the issue.
The regression was introduced in
rG0efadbbcdeb82f5c14f38fbc2826107063ca48b2.
This supersedes D92364.
Test case by MaskRay (Fangrui Song).
Differential Revision: https://reviews.llvm.org/D92375
icmp is the preferred spelling in IR because icmp analysis is
expected to be better than any other analysis. This should
lead to more follow-on folding potential.
It's difficult to say exactly what we should do in codegen to
compensate. For example on AArch64, which of these is preferred:
sub w8, w0, w1
lsr w0, w8, #31
vs:
cmp w0, w1
cset w0, lt
If there are perf regressions, then we should deal with those in
codegen on a case-by-case basis.
A possible motivating example for better optimization is shown in:
https://llvm.org/PR43198 but that will require other transforms
before anything changes there.
Alive proof:
https://rise4fun.com/Alive/o4E
Name: sign-bit splat
Pre: C1 == (width(%x) - 1)
%s = sub nsw %x, %y
%r = ashr %s, C1
=>
%c = icmp slt %x, %y
%r = sext %c
Name: sign-bit LSB
Pre: C1 == (width(%x) - 1)
%s = sub nsw %x, %y
%r = lshr %s, C1
=>
%c = icmp slt %x, %y
%r = zext %c
If the shift amount was undef for some lane, the shift amount in opposite
shift is irrelevant for that lane, and the new shift amount for that lane
can be undef.
If the shift amount was undef for some lane, the shift amount in opposite
shift is irrelevant for that lane, and the new shift amount for that lane
can be undef.
There is no correctness need for that, and since we allow live-out
uses, this could theoretically happen, because currently nothing
will move the cond to right before the branch in those tests.
But regardless, lifting that restriction even makes the transform
easier to understand.
This makes the transform happen in 81 more cases (+0.55%)
)
In the following loop the dependence distance is 2 and can only be
vectorized if the vector length is no larger than this.
void foo(int *a, int *b, int N) {
#pragma clang loop vectorize(enable) vectorize_width(4)
for (int i=0; i<N; ++i) {
a[i + 2] = a[i] + b[i];
}
}
However, when specifying a VF of 4 via a loop hint this loop is
vectorized. According to [1][2], loop hints are ignored if the
optimization is not safe to apply.
This patch introduces a check to bail of vectorization if the user
specified VF is greater than the maximum feasible VF, unless explicitly
forced with '-force-vector-width=X'.
[1] https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
[2] https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations
Reviewed By: sdesmalen, fhahn, Meinersbur
Differential Revision: https://reviews.llvm.org/D90687
This patch replaces the attribute `unsigned VF` in the class
IntrinsicCostAttributes by `ElementCount VF`.
This is a non-functional change to help upcoming patches to compute the cost
model for scalable vector inside this class.
Differential Revision: https://reviews.llvm.org/D91532
Instruction ExtractValue wasn't handled in
LoopVectorizationCostModel::getInstructionCost(). As a result, it was modeled
as a mul which is not really accurate. Since it is free (most of the times),
this now gets a cost of 0 using getInstructionCost.
This is a follow-up of D92208, that required changing this regression test.
In a follow up I will look at InsertValue which also isn't handled yet.
Differential Revision: https://reviews.llvm.org/D92317
Enable performing mandatory inlinings upfront, by reusing the same logic
as the full inliner, instead of the AlwaysInliner. This has the
following benefits:
- reduce code duplication - one inliner codebase
- open the opportunity to help the full inliner by performing additional
function passes after the mandatory inlinings, but before th full
inliner. Performing the mandatory inlinings first simplifies the problem
the full inliner needs to solve: less call sites, more contextualization, and,
depending on the additional function optimization passes run between the
2 inliners, higher accuracy of cost models / decision policies.
Note that this patch does not yet enable much in terms of post-always
inline function optimization.
Differential Revision: https://reviews.llvm.org/D91567
VPPredInstPHIRecipe is one of the recipes that was missed during the
initial conversion. This patch adjusts the recipe to also manage its
operand using VPUser.
If we decided to widen IV with zext, then unsigned comparisons
should not prevent widening (same for sext/sign comparisons).
The result of comparison in wider type does not change in this case.
Differential Revision: https://reviews.llvm.org/D92207
Reviewed By: nikic
Interleave groups also depend on the values they store. Manage the
stored values as VPUser operands. This is currently a NFC, but is
required to allow VPlan transforms and to manage generated vector values
exclusively in VPTransformState.
Reverting commit due to address sanitizer errors.
> Extracting the similar regions is the first step in the IROutliner.
>
> Using the IRSimilarityIdentifier, we collect the SimilarityGroups and
> sort them by how many instructions will be removed. Each
> IRSimilarityCandidate is used to define an OutlinableRegion. Each
> region is ordered by their occurrence in the Module and the regions that
> are not compatible with previously outlined regions are discarded.
>
> Each region is then extracted with the CodeExtractor into its own
> function.
>
> We test that correctly extract in:
> test/Transforms/IROutliner/extraction.ll
> test/Transforms/IROutliner/address-taken.ll
> test/Transforms/IROutliner/outlining-same-globals.ll
> test/Transforms/IROutliner/outlining-same-constants.ll
> test/Transforms/IROutliner/outlining-different-structure.ll
>
> Reviewers: paquette, jroelofs, yroux
>
> Differential Revision: https://reviews.llvm.org/D86975
This reverts commit bf899e8913.
Extracting the similar regions is the first step in the IROutliner.
Using the IRSimilarityIdentifier, we collect the SimilarityGroups and
sort them by how many instructions will be removed. Each
IRSimilarityCandidate is used to define an OutlinableRegion. Each
region is ordered by their occurrence in the Module and the regions that
are not compatible with previously outlined regions are discarded.
Each region is then extracted with the CodeExtractor into its own
function.
We test that correctly extract in:
test/Transforms/IROutliner/extraction.ll
test/Transforms/IROutliner/address-taken.ll
test/Transforms/IROutliner/outlining-same-globals.ll
test/Transforms/IROutliner/outlining-same-constants.ll
test/Transforms/IROutliner/outlining-different-structure.ll
Reviewers: paquette, jroelofs, yroux
Differential Revision: https://reviews.llvm.org/D86975
This was orginally committed in 2245fb8aaa.
but was immediately reverted in f3abd54958
because of a PHI handling issue.
Original commit message:
1. It doesn't make sense to enforce that the bonus instruction
is only used once in it's basic block. What matters is
whether those user instructions fit within our budget, sure,
but that is another question.
2. It doesn't make sense to enforce that said bonus instructions
are only used within their basic block. Perhaps the branch
condition isn't using the value computed by said bonus instruction,
and said bonus instruction is simply being calculated
to be used in successors?
So iff we can clone bonus instructions, to lift these restrictions,
we just need to carefully update their external uses
to use the new cloned instructions.
Notably, this transform (even without this change) appears to be
poison-unsafe as per alive2, but is otherwise (including the patch) legal.
We don't introduce any new PHI nodes, but only "move" the instructions
around, i'm not really seeing much potential for extra cost modelling
for the transform, especially since now we allow at most one such
bonus instruction by default.
This causes the fold to fire +11.4% more (13216 -> 14725)
as of vanilla llvm test-suite + RawSpeed.
The motivational pattern is IEEE-754-2008 Binary16->Binary32
extension code:
ca57d77fb2/src/librawspeed/common/FloatingPoint.h (L115-L120)
^ that should be a switch, but it is not now: https://godbolt.org/z/bvja5v
That being said, even thought this seemed like this would fix it: https://godbolt.org/z/xGq3TM
apparently that fold is happening somewhere else afterall,
so something else also has a similar 'artificial' restriction.
When widening an IndVar that has LCSSA Phi users outside
the loop, we can safely widen it as usual and then truncate
the result outside the loop without hurting the performance.
Differential Revision: https://reviews.llvm.org/D91593
Reviewed By: skatkov
Many bots are unhappy, at the very least missed a few codegen tests,
and possibly this has a logic hole inducing a miscompile
(will be really awesome to have ready reproducer..)
Need to investigate.
This reverts commit 2245fb8aaa.
1. It doesn't make sense to enforce that the bonus instruction
is only used once in it's basic block. What matters is
whether those user instructions fit within our budget, sure,
but that is another question.
2. It doesn't make sense to enforce that said bonus instructions
are only used within their basic block. Perhaps the branch
condition isn't using the value computed by said bonus instruction,
and said bonus instruction is simply being calculated
to be used in successors?
So iff we can clone bonus instructions, to lift these restrictions,
we just need to carefully update their external uses
to use the new cloned instructions.
Notably, this transform (even without this change) appears to be
poison-unsafe as per alive2, but is otherwise (including the patch) legal.
We don't introduce any new PHI nodes, but only "move" the instructions
around, i'm not really seeing much potential for extra cost modelling
for the transform, especially since now we allow at most one such
bonus instruction by default.
This causes the fold to fire +11.4% more (13216 -> 14725)
as of vanilla llvm test-suite + RawSpeed.
The motivational pattern is IEEE-754-2008 Binary16->Binary32
extension code:
ca57d77fb2/src/librawspeed/common/FloatingPoint.h (L115-L120)
^ that should be a switch, but it is not now: https://godbolt.org/z/bvja5v
That being said, even thought this seemed like this would fix it: https://godbolt.org/z/xGq3TM
apparently that fold is happening somewhere else afterall,
so something else also has a similar 'artificial' restriction.
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).
This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.
The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.
Differential Revision: https://reviews.llvm.org/D91649
Update VPReplicateRecipe to inherit from VPValue. This still does not
update scalarizeInstruction to set the result for the VPValue of
VPReplicateRecipe, because this first requires tracking scalar values in
VPTransformState.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D91500
When bailing out in rewriteLoopExitValues() you could be left with PHI
nodes in the DeadInsts vector. Those would be not handled by the use of
RecursivelyDeleteTriviallyDeadInstructions() in IndVarSimplify. This
resulted in the IndVarSimplify pass returning an incorrect modified
status. This was caught by the expensive check introduced in D86589.
This patches changes IndVarSimplify so that it deletes those PHI nodes,
using RecursivelyDeleteDeadPHINode().
This fixes PR47486.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D91153
LoopLoadElim may end up expanding an AddRec from a loop
which is not the current loop. This loop may not be in simplify
form. We figure it out after the no-return point, so cannot bail
in this case.
AddRec requires simplify form to expand. The only way to ensure
this does not crash is to simplify all loops beforehand.
The issue only exists in new PM. Old PM requests LoopSimplify
required pass and it simplifies all loops before the opt begins.
Differential Revision: https://reviews.llvm.org/D91525
Reviewed By: asbirlea, aeubanks
Currently, `-indvars` runs first, and then immediately after `-loop-idiom` does.
I'm not really sure if `-loop-idiom` requires `-indvars` to run beforehand,
but i'm *very* sure that `-indvars` requires `-loop-idiom` to run afterwards,
as it can be seen in the phase-ordering test.
LoopIdiom runs on two types of loops: countable ones, and uncountable ones.
For uncountable ones, IndVars obviously didn't make any change to them,
since they are uncountable, so for them the order should be irrelevant.
For countable ones, well, they should have been countable before IndVars
for IndVars to make any change to them, and since SCEV is used on them,
it shouldn't matter if IndVars have already canonicalized them.
So i don't really see why we'd want the current ordering.
Should this cause issues, it will give us a reproducer test case
that shows flaws in this logic, and we then could adjust accordingly.
While this is quite likely beneficial in-the-wild already,
it's a required part for the full motivational pattern
behind `left-shift-until-bittest` loop idiom (D91038).
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91800
MaxSafeRegisterWidth is a misnomer since it actually returns the maximum
safe vector width. Register suggests it relates directly to a physical
register where it could be a vector spanning one or more physical
registers.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D91727
This is a follow-up to 00a6601136 to make
isa<VPReductionRecipe> work and unifies the VPValue ID names, by making
sure they all consistently start with VPV*.
Similar to other patches, this makes VPWidenRecipe a VPValue. Because of
the way it interacts with the reduction code it also slightly alters the
way that VPValues are registered, removing the up front NeedDef and
using getOrAddVPValue to create them on-demand if needed instead.
Differential Revision: https://reviews.llvm.org/D88447
This converts the VPReductionRecipe into a VPValue, like other
VPRecipe's in preparation for traversing def-use chains. It also makes
it a VPUser, now storing the used VPValues as operands.
It doesn't yet change how the VPReductionRecipes are created. It will
need to call replaceAllUsesWith from the original recipe they replace,
but that is not done yet as VPWidenRecipe need to be created first.
Differential Revision: https://reviews.llvm.org/D88382
When deciding to widen narrow use, we may need to prove some facts
about it. For proof, the context is used. Currently we take the instruction
being widened as the context.
However, we may be more precise here if we take as context the point that
dominates all users of instruction being widened.
Differential Revision: https://reviews.llvm.org/D90456
Reviewed By: skatkov
Some older code - and code copied from older code - still directly tested against the singelton result of SE::getCouldNotCompute. Using the isa<SCEVCouldNotCompute> form is both shorter, and more readable.
We need to preserve wrapping flags to allow better folds.
The cases with geps may be non-intuitive, but that appears to agree with Alive2:
https://alive2.llvm.org/ce/z/JQcqw7
We create 'nsw' ops independent from the original wrapping on the sub.
Previously this option could be used to skip devirtualizations of the
given functions in regular LTO and in the ThinLTO indexing step. This
change allows them to be skipped in the backend as well, which is useful
when debugging WPD in a distributed ThinLTO backend.
Differential Revision: https://reviews.llvm.org/D91812
Fix PR47390.
The primary induction should be considered alive when folding tail by masking,
because it will be used by said masking; even when it may otherwise appear
useless: feeding only its own 'bump', which is correctly considered dead, and
as the 'bump' of another induction variable, which may wrongfully want to
consider its bump = the primary induction, dead.
Differential Revision: https://reviews.llvm.org/D92017
The legacy pass didn't properly detect indirect calls.
We can still remove the convergent attribute when there are indirect
calls. The LangRef says:
> When it appears on a call/invoke, the convergent attribute indicates
that we should treat the call as though we’re calling a convergent
function. This is particularly useful on indirect calls; without this we
may treat such calls as though the target is non-convergent.
So don't skip handling of convergent when there are unknown calls.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D89826
A uniform load is one which loads from a uniform address across all lanes. As currently implemented, we cost model such loads as if we did a single scalar load + a broadcast, but the actual lowering replicates the load once per lane.
This change tweaks the lowering to use the REPLICATE strategy by marking such loads (and the computation leading to their memory operand) as uniform after vectorization. This is a useful change in itself, but it's real purpose is to pave the way for a following change which will generalize our uniformity logic.
In review discussion, there was an issue raised with coupling cost modeling with the lowering strategy for uniform inputs. The discussion on that item remains unsettled and is pending larger architectural discussion. We decided to move forward with this patch as is, and revise as warranted once the bigger picture design questions are settled.
Differential Revision: https://reviews.llvm.org/D91398
This is a retry of 324a53205. I cautiously reverted that at 6aa3fc4
because the rules about gep math were not clear. Since then, we
have added this line to LangRef for gep inbounds:
"The successive addition of offsets (without adding the base address)
does not wrap the pointer index type in a signed sense (nsw)."
See D90708 and post-commit comments on the revert patch for more details.
I disabled the widening in fa5cb4b because it run in an assert, which was
related to replacing values with different types. I forgot that an extend could
also be a zero-extend, which I have added now. This means that the approach now
is to create and insert a trunc value of the outerloop for each user, and use
that to replace IV values.
Differential Revision: https://reviews.llvm.org/D91690
The function was introduced on Jan 23, 2019 in commit
73078ecd38.
Its definition was removed on Oct 27, 2020 in commit
0930763b4b, leaving the declaration
unused.
This change introduces a new IR intrinsic named `llvm.pseudoprobe` for pseudo-probe block instrumentation. Please refer to https://reviews.llvm.org/D86193 for the whole story.
A pseudo probe is used to collect the execution count of the block where the probe is instrumented. This requires a pseudo probe to be persisting. The LLVM PGO instrumentation also instruments in similar places by placing a counter in the form of atomic read/write operations or runtime helper calls. While these operations are very persisting or optimization-resilient, in theory we can borrow the atomic read/write implementation from PGO counters and cut it off at the end of compilation with all the atomics converted into binary data. This was our initial design and we’ve seen promising sample correlation quality with it. However, the atomics approach has a couple issues:
1. IR Optimizations are blocked unexpectedly. Those atomic instructions are not going to be physically present in the binary code, but since they are on the IR till very end of compilation, they can still prevent certain IR optimizations and result in lower code quality.
2. The counter atomics may not be fully cleaned up from the code stream eventually.
3. Extra work is needed for re-targeting.
We choose to implement pseudo probes based on a special LLVM intrinsic, which is expected to have most of the semantics that comes with an atomic operation but does not block desired optimizations as much as possible. More specifically the semantics associated with the new intrinsic enforces a pseudo probe to be virtually executed exactly the same number of times before and after an IR optimization. The intrinsic also comes with certain flags that are carefully chosen so that the places they are probing are not going to be messed up by the optimizer while most of the IR optimizations still work. The core flags given to the special intrinsic is `IntrInaccessibleMemOnly`, which means the intrinsic accesses memory and does have a side effect so that it is not removable, but is does not access memory locations that are accessible by any original instructions. This way the intrinsic does not alias with any original instruction and thus it does not block optimizations as much as an atomic operation does. We also assign a function GUID and a block index to an intrinsic so that they are uniquely identified and not merged in order to achieve good correlation quality.
Let's now look at an example. Given the following LLVM IR:
```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
%cmp = icmp eq i32 %x, 0
br i1 %cmp, label %bb1, label %bb2
bb1:
br label %bb3
bb2:
br label %bb3
bb3:
ret void
}
```
The instrumented IR will look like below. Note that each `llvm.pseudoprobe` intrinsic call represents a pseudo probe at a block, of which the first parameter is the GUID of the probe’s owner function and the second parameter is the probe’s ID.
```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
%cmp = icmp eq i32 %x, 0
call void @llvm.pseudoprobe(i64 837061429793323041, i64 1)
br i1 %cmp, label %bb1, label %bb2
bb1:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 2)
br label %bb3
bb2:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 3)
br label %bb3
bb3:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 4)
ret void
}
```
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D86490
Summary:
Expand existing loopsink testing to also test loopsinking using new pass
manager. Enable memoryssa for loopsink with new pass manager. This
combination exposed a bug that was previously fixed for loopsink
without memoryssa. When sinking an instruction into a loop, the source
block may not be part of the loop but still needs to be checked for
pointer invalidation. This is the fix for bugzilla #39695 (PR 54659)
expanded to also work with memoryssa.
Respond to review comments. Enable Memory SSA in legacy Loop Sink pass
under EnableMSSALoopDependency option control. Update tests accordingly.
Respond to review comments. Add options controlling whether memoryssa is
used for loop sink, defaulting to off. Expand testing based on these
options.
Respond to review comments. Properly indicated preserved analyses.
This relanding addresses a compile-time performance problem by moving
test for profile data earlier to avoid unnecessary computations.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea)
Differential Revision: https://reviews.llvm.org/D90249
This reuses the existing lower-matrix-intrinsics pass rather than going
the legacy pass route of creating a new pass.
Use this new variant in the NPM -O0 pipeline.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D91811
Unused since https://reviews.llvm.org/D91762 and triggering
-Wunused-private-field
```
llvm/lib/Transforms/Instrumentation/DataFlowSanitizer.cpp:365:13: error: private field 'GetArgTLS' is not used [-Werror,-Wunused-private-field]
Constant *GetArgTLS;
^
llvm/lib/Transforms/Instrumentation/DataFlowSanitizer.cpp:366:13: error: private field 'GetRetvalTLS' is not used [-Werror,-Wunused-private-field]
Constant *GetRetvalTLS;
```
Reviewed By: stephan.yichao.zhao
Differential Revision: https://reviews.llvm.org/D91820
One less instruction and reducing use count of zext.
As alive2 confirms, we're fine with all the weird combinations of
undef elts in constants, but unless the shift amount was undef
for a lane, we must sanitize undef mask to zero, since sign bits
are no longer zeros.
https://rise4fun.com/Alive/d7r
```
----------------------------------------
Optimization: zz
Precondition: ((C1 == (width(%r) - width(%x))) && isSignBit(C2))
%o0 = zext %x
%o1 = shl %o0, C1
%r = and %o1, C2
=>
%n0 = sext %x
%r = and %n0, C2
Done: 2016
Optimization is correct!
```
When constructing a MemoryLocation by hand, require that a
LocationSize is explicitly specified. D91649 will split up
LocationSize::unknown() into two different states, and callers
should make an explicit choice regarding the kind of MemoryLocation
they want to have.
rGf571fe6df585127d8b045f8e8f5b4e59da9bbb73 led to a warning of an unused
variable for MaxSafeDepDist (written but not used). It seems this
variable and assignment can be safely removed.
Summary:
Add support for passing source locations to libomptarget runtime functions using the ident_t struct present in the rest of the libomp API. This will allow the runtime system to give much more insightful error messages and debugging values.
Reviewers: jdoerfert grokos
Differential Revision: https://reviews.llvm.org/D87946
The assertion that vector widths are <= 256 elements was hard wired in the LV code. Eg, VE allows for vectors up to 512 elements. Test again the TTI vector register bit width instead - this is an NFC for non-asserting builds.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D91518
Some nested loops may share the same ExitingBB, so after we finishing FoldExit,
we need to notify OuterLoop and SCEV to drop any stored trip count.
Patched by: guopeilin
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D91325
This reverts commit 562addba65.
Reverted change too quickly, the failing test cases passed on the next build.
So reverting revert (to include the changes).
Summary:
This patch adds support for passing in the original delcaration name in the source file to the libomptarget runtime. This will allow the runtime to provide more intelligent debugging messages. This patch takes the original expression parsed from the OpenMP map / update clause and provides a textual representation if it was explicitly mapped, otherwise it takes the name of the variable declaration as a fallback. The information in passed to the runtime in a global array of strings that matches the existing ident_t source location strings using ";name;filename;column;row;;"
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D89802
This is the same fix as 23aeadb89d,
just for CloneScopedAliasMetadata rather than PropagateCallSiteMetadata.
In this case the previous outcome was incorrectly dropped metadata,
as it was not part of the computed metadata map.
The real change in the test is that the first load now retains
metadata, the rest of the changes are due to changes in metadata
numbering.
The VMap also contains a mapping from Argument => Instruction,
where the instruction is part of the original function, not the
inlined one. The code was assuming that all the instructions in
the VMap were inlined.
This was a pre-existing problem for the loop access metadata, but
was extended to the more common noalias metadata by
27f647d117, thus causing miscompiles.
There is a similar assumption inside CloneAliasScopeMetadata(), so
that one likely needs to be fixed as well.
Summary:
Expand existing loopsink testing to also test loopsinking using new pass
manager. Enable memoryssa for loopsink with new pass manager. This
combination exposed a bug that was previously fixed for loopsink
without memoryssa. When sinking an instruction into a loop, the source
block may not be part of the loop but still needs to be checked for
pointer invalidation. This is the fix for bugzilla #39695 (PR 54659)
expanded to also work with memoryssa.
Respond to review comments. Enable Memory SSA in legacy Loop Sink pass
under EnableMSSALoopDependency option control. Update tests accordingly.
Respond to review comments. Add options controlling whether memoryssa is
used for loop sink, defaulting to off. Expand testing based on these
options.
Respond to review comments. Properly indicated preserved analyses.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea)
Differential Revision: https://reviews.llvm.org/D90249
As Wei Mi is reporting in post-commit review
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20201116/853479.html
teaching -reassociate about add-like-or's (70472f3) results in breaking apart
load widening patterns, and reassociating them.
For now, simply exclude any such `or` that appears to be a root of
load widening idiom from the or->add transformation.
Note that the heuristic is greedy, it doesn't ensure that loads
can *actually* be widened into a single load.
This patch generalizes the extraction of a constraint for a given
condition. It allows decompose to return a vector of c * X pairs, which
allows de-composing multiple instructions in the future.
It also adds more clarifying comments.
In some cases we can handle IV and iter count of different types. It's a typical situation
after IV have been widened. This patch adds support for such cases, when legal.
Differential Revision: https://reviews.llvm.org/D88528
Reviewed By: skatkov
Support more instructions in SpeculativeExecution pass:
- ExtractElement
- InsertElement
- ShuffleVector
Differential Revision: https://reviews.llvm.org/D91633
I don't see any reason not to unconditionally retrieve TLI, it's fairly
cheap.
Fixes calls-errno.ll under NPM.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D91476
m_SpecificInt has the same 'no undef element' behaviour as m_APInt so no change there, and anyway we have test coverage for undef elements in the fold.
Noticed while fixing a Wshadow warning about shadow Value *X, *Y variables.
This patch introduces a new VPDef class, which can be used to
manage VPValues defined by recipes/VPInstructions.
The idea here is to mirror VPUser for values defined by a recipe. A
VPDef can produce either zero (e.g. a store recipe), one (most recipes)
or multiple (VPInterleaveRecipe) result VPValues.
To traverse the def-use chain from a VPDef to its users, one has to
traverse the users of all values defined by a VPDef.
VPValues now contain a pointer to their corresponding VPDef, if one
exists. To traverse the def-use chain upwards from a VPValue, we first
need to check if the VPValue is defined by a VPDef. If it does not have
a VPDef, this means we have a VPValue that is not directly defined
iniside the plan and we are done.
If we have a VPDef, it is defined inside the region by a recipe, which
is a VPUser, and the upwards def-use chain traversal continues by
traversing all its operands.
Note that we need to add an additional field to to VPVAlue to link them
to their defs. The space increase is going to be offset by being able to
remove the SubclassID field in future patches.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D90558
This wasn't properly remapping the type like with the other
attributes, so this would end up hitting a verifier error after
linking different modules using byref.
For the scattered operands of load instructions it makes sense
to use gathering load intrinsic, which can lower to native instruction
for X86/AVX512 and ARM/SVE. This also enables building
vectorization tree with entries containing scattered operands.
The next step is to add scattered store.
Fixes PR47629 and PR47623
Differential Revision: https://reviews.llvm.org/D90445
When processing conditional branches, if the condition is an AND of 2 compares
and the true successor only has the current block as predecessor, queue both
conditions for the true successor.
When instructions are cloned from block BB to PredBB in the method
DuplicateCondBranchOnPHIIntoPred() number of successors of PredBB
changes from 1 to number of successors of BB. So we have to copy
branch probabilities from BB to PredBB.
Reviewed By: Kazu Hirata
Differential Revision: https://reviews.llvm.org/D90841
With a function pass manager, it would insert debuginfo metadata before
getting to function passes while processing the pass manager, causing
debugify to skip while running the function passes.
Skip special passes + verifier + printing passes. Compared to the legacy
implementation of -debugify-each, this additionally skips verifier
passes. Probably no need to update the legacy version since it will be
obsolete soon.
This fixes 2 instcombine tests using -debugify-each under NPM.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D91558
- In certain cases, a generic pointer could be assumed as a pointer to
the global memory space or other spaces. With a dedicated target hook
to query that address space from a given value, infer-address-space
pass could infer and propagate that to all its users.
Differential Revision: https://reviews.llvm.org/D91121
When processing conditional branches, if the condition is an OR of 2 compares
and the false successor only has the current block as predecessor, queue both
negated conditions for the false successor
In the existing logic, for a given alloca, as long as its pointer value is stored into another location, it's considered as escaped.
This is a bit too conservative. Specifically, in non-optimized build mode, it's often to have patterns of code that first store an alloca somewhere and then load it right away.
These used should be handled without conservatively marking them escaped.
This patch tracks how the memory location where an alloca pointer is stored into is being used. As long as we only try to load from that location and nothing else, we can still
consider the original alloca not escaping and keep it on the stack instead of putting it on the frame.
Differential Revision: https://reviews.llvm.org/D91305
This patch adds a new pass to add !annotation metadata for entries in
@llvm.global.anotations, which is generated using
__attribute__((annotate("_name"))) on functions in Clang.
This has been discussed on llvm-dev as part of
RFC: Combining Annotation Metadata and Remarks
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D91195
Widen the IV to the widest available and legal integer type, which makes this
transformations always safe so that we can skip overflow checks.
Motivation is to let this pass trigger on 64-bit targets too, and this is the
last patch in a serie to achieve this: D90402 moves pass LoopFlatten to just
before IndVarSimplify so that IVs are not already widened, D90421 factors out
widening from IndVarSimplify into Utils/SimplifyIndVar so that we can also use
it in LoopFlatten.
Differential Revision: https://reviews.llvm.org/D90640
This patch teaches the jump threading pass to call BPI->eraseBlock
when it folds a conditional branch.
Without this patch, BranchProbabilityInfo could end up with stale edge
probabilities for the basic block containing the conditional branch --
one edge probability with less than 1.0 and the other for a removed
edge.
This patch is one of the steps before we can safely re-apply D91017.
Differential Revision: https://reviews.llvm.org/D91511
In the last change to IRCE the BPI is ignored if BFI is present, however
BFI and BPI have a different thresholds. Specifically BPI approach checks only
latch exit probability so it is expected if the loop has only one exit block (latch)
the behavior with BFI and BPI should be the same,
BPI approach by default uses threshold 10, so it considers the loop with estimated
number of iterations less then 10 should not be considered for IRCE optimization.
BFI approach uses the default value 3 and this is inconsistent.
The CL modifies the code to use the same threshold for both approaches..
The test is updated due to it has two side-exits (except latch) and each of them has a
probability 1/16, so BFI estimates the number of runtime iteration is about to 7
(1/16 + 1/16 + some for latch) and test fails.
Reviewers: mkazantsev, ebrevnov
Reviewed By: mkazantsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D91230
There are 1-2 potential follow-up NFC commits to reduce
this further on the way to generalizing this for vectors.
The operand replacing path should be dead code because demanded
bits handles that more generally (D91415).
I noticed an add example like the one from D91343, so here's a similar patch.
The logic is based on existing code for the single-use demanded bits fold.
But I only matched a constant instead of using compute known bits on the
operands because that was the motivating patterni that I noticed.
I think this will allow removing a special-case (but incomplete) dedicated
fold within visitAnd(), but I need to untangle the existing code to be sure.
https://rise4fun.com/Alive/V6fP
Name: add with low mask
Pre: (C1 & (-1 u>> countLeadingZeros(C2))) == 0
%a = add i8 %x, C1
%r = and i8 %a, C2
=>
%r = and i8 %x, C2
Differential Revision: https://reviews.llvm.org/D91415
This patch turns VPWidenGEPRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84683
This is used to test RemoveRedundantDbgInstrs(), which is used by other
passes.
Reviewed By: ychen
Differential Revision: https://reviews.llvm.org/D91477
See discussion in https://bugs.llvm.org/show_bug.cgi?id=45073 / https://reviews.llvm.org/D66324#2334485
the implementation is known-broken for certain inputs,
the bugreport was up for a significant amount of timer,
and there has been no activity to address it.
Therefore, just completely rip out all of misexpect handling.
I suspect, fixing it requires redesigning the internals of MD_misexpect.
Should anyone commit to fixing the implementation problem,
starting from clean slate may be better anyways.
This reverts commit 7bdad08429,
and some of it's follow-ups, that don't stand on their own.
dependency. NFC
Use findSingleDependency in place of FindDependencies and stop passing a
set of Instructions around. Modify FindDependencies to return a boolean
flag which indicates whether the dependencies it has found are all
valid.
Like inlineCallIfPossible and InlinerPass, after inlining mergeAttributesForInlining
should be called to merge callee's attributes to caller. But it is not called in
AlwaysInliner, causes caller's attributes inconsistent with inlined code.
Attached test case demonstrates that attribute "min-legal-vector-width"="512" is
not merged into caller without this patch, and it causes failure in SelectionDAG
when lowering the inlined AVX512 intrinsic.
Differential Revision: https://reviews.llvm.org/D91446
Handle the emission of the add in a single place, instead of three
different ones.
Don't emit an unnecessary add with zero to start with. It will get
dropped by InstCombine, but we may as well not create it in the
first place. This also means that InstCombine does not need to
specially handle this extra add.
This is conceptually NFC, but can affect worklist order etc.
Use exact component name in add_ocaml_library.
Make expand_topologically compatible with new architecture.
Fix quoting in is_llvm_target_library.
Fix LLVMipo component name.
Write release note.
This patch adds a new !annotation metadata kind which can be used to
attach annotation strings to instructions.
It also adds a new pass that emits summary remarks per function with the
counts for each annotation kind.
The intended uses cases for this new metadata is annotating
'interesting' instructions and the remarks should provide additional
insight into transformations applied to a program.
To motivate this, consider these specific questions we would like to get answered:
* How many stores added for automatic variable initialization remain after optimizations? Where are they?
* How many runtime checks inserted by a frontend could be eliminated? Where are the ones that did not get eliminated?
Discussed on llvm-dev as part of 'RFC: Combining Annotation Metadata and Remarks'
(http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html)
Reviewed By: thegameg, jdoerfert
Differential Revision: https://reviews.llvm.org/D91188
No longer rely on an external tool to build the llvm component layout.
Instead, leverage the existing `add_llvm_componentlibrary` cmake function and
introduce `add_llvm_component_group` to accurately describe component behavior.
These function store extra properties in the created targets. These properties
are processed once all components are defined to resolve library dependencies
and produce the header expected by llvm-config.
Differential Revision: https://reviews.llvm.org/D90848
If we cannot prove that the check is trivially true, but can prove that it either
fails on the 1st iteration or never fails, we can replace it with first iteration check.
Differential Revision: https://reviews.llvm.org/D88527
Reviewed By: skatkov
There might be some demanded/known bits way to generalize this,
but I'm not seeing it right now.
This came up as a regression when I was looking at a different
demanded bits improvement.
https://rise4fun.com/Alive/5fl
Name: general
Pre: ((-1 << countTrailingZeros(C1)) & C2) == 0
%a1 = add i8 %x, C1
%a2 = and i8 %x, C2
%r = sub i8 %a1, %a2
=>
%r = and i8 %a1, ~C2
Name: test 1
%a1 = add i8 %x, 192
%a2 = and i8 %x, 10
%r = sub i8 %a1, %a2
=>
%r = and i8 %a1, -11
Name: test 2
%a1 = add i8 %x, -108
%a2 = and i8 %x, 3
%r = sub i8 %a1, %a2
=>
%r = and i8 %a1, -4
Summary:
Refactor SinkAdHoistLICMFlags from a struct to a class with accessors and constructors to allow other
classes to construct flags with meaningful defaults while not exposing LICM internal details.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea)
Differential Revision: https://reviews.llvm.org/D90482
Sometimes the an instruction we are trying to widen is used by the IV
(which means the instruction is the IV increment). Currently this may
prevent its widening. We should ignore such user because it will be
dead once the transform is done anyways.
Differential Revision: https://reviews.llvm.org/D90920
Reviewed By: fhahn
In the existing logic, for a given alloca, as long as its pointer value is stored into another location, it's considered as escaped.
This is a bit too conservative. Specifically, in non-optimized build mode, it's often to have patterns of code that first store an alloca somewhere and then load it right away.
These used should be handled without conservatively marking them escaped.
This patch tracks how the memory location where an alloca pointer is stored into is being used. As long as we only try to load from that location and nothing else, we can still
consider the original alloca not escaping and keep it on the stack instead of putting it on the frame.
Differential Revision: https://reviews.llvm.org/D91305
InstCombine canonicalizes 'sub nuw' instructions to 'add' without the
`nuw` flag. The typical case where we see it is decrementing induction
variables. For them, IndVars fails to prove that it's legal to widen them,
and inserts unprofitable `zext`'s.
This patch adds recognition of such pattern using SCEV.
Differential Revision: https://reviews.llvm.org/D89550
Reviewed By: fhahn, skatkov
We need to be able to call function pointers. Inline the dispatch
function.
Also inline the context projection function.
Transfer debug locations from the suspend point to the inlined functions.
Use the function argument index instead of the function argument in
coro.id.async. This solves any spurious use issues.
Coerce the arguments of the tail call function at a suspend point. The LLVM
optimizer seems to drop casts leading to a vararg intrinsic.
rdar://70097093
Differential Revision: https://reviews.llvm.org/D91098
Previously the inliner did a bit of a hack by adding ref edges for all
new edges introduced by performing an inline before calling
updateCGAndAnalysisManagerForPass(). This was because
updateCGAndAnalysisManagerForPass() didn't handle new non-trivial call
edges.
This adds handling of non-trivial call edges to
updateCGAndAnalysisManagerForPass(). The inliner called
updateCGAndAnalysisManagerForFunctionPass() since it was handling adding
newly introduced edges (so updateCGAndAnalysisManagerForPass() would
only have to handle promotion), but now it needs to call
updateCGAndAnalysisManagerForCGSCCPass() since
updateCGAndAnalysisManagerForPass() is now handling the new call edges
and function passes cannot add new edges.
We follow the previous path of adding trivial ref edges then letting promotion
handle changing the ref edges to call edges and the CGSCC updates. So
this still does not allow adding call edges that result in an addition
of a non-trivial ref edge.
This is in preparation for better detecting devirtualization. Previously
since the inliner itself would add ref edges,
updateCGAndAnalysisManagerForPass() would think that promotion and thus
devirtualization had happened after any sort of inlining.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D91046
Before the change, DFSan always does the propagation. W/o
origin tracking, it is harder to understand such flows. After
the change, the flag is off by default.
Reviewed-by: morehouse
Differential Revision: https://reviews.llvm.org/D91234
This reverts commits:
* [LoopVectorizer] NFCI: Calculate register usage based on TLI.getTypeLegalizationCost.
b873aba394.
* [LoopVectorizer] Silence warning in GetRegUsage.
9ff701100a.
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
This patch silences the warning:
error: lambda capture 'DL' is not used [-Werror,-Wunused-lambda-capture]
auto GetRegUsage = [&DL, &TTI=TTI](Type *Ty, ElementCount VF) {
~^~~
1 error generated.
Introduced in:
https://reviews.llvm.org/rGb873aba3943c067a5efd5303cbdf5aeb0732cf88
This is more accurate than dividing the bitwidth based on the element count by the
maximum register size, as it can just reuse whatever has been calculated for
legalization of these types.
This change is also necessary when calculating register usage for scalable vectors, where
the legalization of these types cannot be done based on the widest register size, because
that does not take the 'vscale' component into account.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D91059
delete abs/nabs handling in earlycse pass to avoid bugs related to
hashing values. After abs/nabs is canonicalized to intrinsics in D87188,
we should get CSE ability for abs/nabs back.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D90734
Tracking local variables across suspend points is still somewhat incomplete.
Consider this coroutine snippet:
```
resumable foo() {
int x[10] = {};
int a = 3;
co_await std::experimental::suspend_always();
a++;
x[0] = 1;
a += 2;
x[1] = 2;
a += 3;
x[2] = 3;
}
```
Can't manage to print `a` or `x` if they turn out to be allocas during
CoroSplit (which happens if you build this code with `-O0` prior to this
commit):
```
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x0000000100003729 main-noprint`foo() at main-noprint.cpp:43:5
40 co_await std::experimental::suspend_always();
41 a++;
42 x[0] = 1;
-> 43 a += 2;
44 x[1] = 2;
45 a += 3;
46 x[2] = 3;
(lldb) p x
error: <user expression 21>:1:1: use of undeclared identifier 'x'
x
^
```
The generated IR contains a `llvm.dbg.declare` for `x` in it's initialization
basic block. After CoroSplit, the `llvm.dbg.declare` might not dominate all of
`x` uses and we lose debugging quality.
Add `llvm.dbg.value`s to all relevant basic blocks such that if later
transformations break the dominance the reliable debug info is already in
place. For instance, this BB:
```
await.ready:
...
%arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 0, !dbg !760
...
%arrayidx19 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 1, !dbg !763
...
%arrayidx21 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 2, !dbg !766
```
becomes:
```
await.ready:
...
call void @llvm.dbg.value(metadata [10 x i32]* %x.reload.addr, metadata !751, metadata !DIExpression()), !dbg !753
...
%arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 0, !dbg !760
...
%arrayidx19 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 1, !dbg !763
...
%arrayidx21 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 2, !dbg !766
```
Differential Revision: https://reviews.llvm.org/D90772
This is a prep step for widening induction variables in LoopFlatten if this is
posssible (D90640), to avoid having to perform certain overflow checks. Since
IndVarSimplify may already widen induction variables, we want to run
LoopFlatten just before IndVarSimplify. This is a minor reshuffle as both
passes were already close after each other.
Differential Revision: https://reviews.llvm.org/D90402
This converts LoopFlatten from a LoopPass to a FunctionPass so that we don't
run into problems of a loop pass deleting a (inner)loop.
Differential Revision: https://reviews.llvm.org/D90940
This patch turns VPWidenSelectRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84682
This was missing as discovered by the SystemZ multistage bot:
http://lab.llvm.org:8011/#/builders/8, where wrong code resulted when this
extension was not performed.
Thanks for review by Ulrich Weigand and Roman Lebedev.
Differential Revision: https://reviews.llvm.org/D90760
We already do not unroll loops with vector instructions under MVE, but
that does not include the remainder loops that the vectorizer produces.
These remainder loops will be rarely executed and are not worth
unrolling, as the trip count is likely to be low if they get executed at
all. Luckily they get llvm.loop.isvectorized to make recognizing them
simpler.
We have wanted to do this for a while but hit issues with low overhead
loops being reverted due to difficult registry allocation. With recent
changes that seems to be less of an issue now.
Differential Revision: https://reviews.llvm.org/D90055
The LoopDistribute pass is missing from the LTO pipeline, so
-enable-loop-distribute has no effect during post-link. The pre-link
loop distribution doesn't seem to survive the LTO pipeline either.
With this patch (and -flto -mllvm -enable-loop-distribute) we see a 43%
uplift on SPEC 2006 hmmer for AArch64. The rest of SPECINT 2006 is
unaffected.
Differential Revision: https://reviews.llvm.org/D89896
Interfaces changed to take `ElementCount` as parameters:
* LoopVectorizationPlanner::buildVPlans
* LoopVectorizationPlanner::buildVPlansWithVPRecipes
* LoopVectorizationCostModel::selectVectorizationFactor
This patch is NFC for fixed-width vectors.
Reviewed By: dmgreen, ctetreau
Differential Revision: https://reviews.llvm.org/D90879
For consistency with the IRBuilder, OpenMPIRBuilder has method names starting with 'Create'. However, the LLVM coding style has methods names starting with lower case letters, as all other OpenMPIRBuilder already methods do. The clang-tidy configuration used by Phabricator also warns about the naming violation, adding noise to the reviews.
This patch renames all `OpenMPIRBuilder::CreateXYZ` methods to `OpenMPIRBuilder::createXYZ`, and updates all in-tree callers.
I tested check-llvm, check-clang, check-mlir and check-flang to ensure that I did not miss a caller.
Reviewed By: mehdi_amini, fghanim
Differential Revision: https://reviews.llvm.org/D91109
Prior to D89768, any alloca that's used after suspension points will be put on to the coroutine frame, and hence they will always be reloaded in the resume function.
However D89768 introduced a more precise way to determine whether an alloca should live on the frame. Allocas that are only used within one suspension region (hence does not need to live across suspension points) will not be put on the frame. They will remain local to the resume function.
When creating the new entry for the .resume function, the existing logic only moved all the allocas from the old entry to the new entry. This covers every alloca from the old entry. However allocas that's defined afer coro.begin are put into a separate basic block during CoroSplit (the PostSpill basic block). We need to make sure these allocas are moved to the new entry as well if they are used.
This patch walks through all allocas, and check if they are still used but are not reachable from the new entry, if so, we move them to the new entry.
Differential Revision: https://reviews.llvm.org/D90977
Introduce struct FlattenInfo to group some of the bookkeeping. Besides this
being a bit of a clean-up, it is a prep step for next additions (D90640). I
could take things a bit further, but thought this was a good first step also
not to make this change too large.
Differential Revision: https://reviews.llvm.org/D90408
This patch turns VPWidenCall into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84681
Feeding vector values to `InstCombiner::OptimizeOverflowCheck` produces a scalar boolean flag if it proves the overflow check can be eliminated.
This causes `InstCombiner::CreateOverflowTuple` to crash as it correctly expects a vector of i1 values instead.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D89628
As raised by @nlopes on D90382 - if this is not a rotate then the select was blocking poison from the 'shift-by-zero' non-TVal, but a funnel shift won't - so freeze it.
From C11 and C++11 onwards, a forward-progress requirement has been
introduced for both languages. In the case of C, loops with non-constant
conditionals that do not have any observable side-effects (as defined by
6.8.5p6) can be assumed by the implementation to terminate, and in the
case of C++, this assumption extends to all functions. The clang
frontend will emit the `mustprogress` function attribute for C++
functions (D86233, D85393, D86841) and emit the loop metadata
`llvm.loop.mustprogress` for every loop in C11 or later that has a
non-constant conditional.
This patch modifies LoopDeletion so that only loops with
the `llvm.loop.mustprogress` metadata or loops contained in functions
that are required to make progress (`mustprogress` or `willreturn`) are
checked for observable side-effects. If these loops do not have an
observable side-effect, then we delete them.
Loops without observable side-effects that do not satisfy the above
conditions will not be deleted.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86844
Currently, LoopDeletion refuses to remove dead loops with no exit blocks
because it cannot statically determine the control flow after it removes
the block. This leads to miscompiles if the loop is an infinite loop and
should've been removed.
Differential Revision: https://reviews.llvm.org/D90115
Results of convergent operations are implicitly affected by the
enclosing control flows and should not be hoisted out of arbitrary
loops.
Patch by Xiaoqing Wu <xiaoqing_wu@apple.com>
Differential Revision: https://reviews.llvm.org/D90361
The `llvm.coro.suspend.async` intrinsic takes a function pointer as its
argument that describes how-to restore the current continuation's
context from the context argument of the continuation function. Before
we assumed that the current context can be restored by loading from the
context arguments first pointer field (`first_arg->caller_context`).
This allows for defining suspension points that reuse the current
context for example.
Also:
llvm.coro.id.async lowering: Add llvm.coro.preprare.async intrinsic
Blocks inlining until after the async coroutine was split.
Also, change the async function pointer's context size position
struct async_function_pointer {
uint32_t relative_function_pointer_to_async_impl;
uint32_t context_size;
}
And make the position of the `async context` argument configurable. The
position is specified by the `llvm.coro.id.async` intrinsic.
rdar://70097093
Differential Revision: https://reviews.llvm.org/D90783
This patch changes the type of Start, End in VFRange to be an ElementCount
instead of `unsigned`. This is done as preparation to make VPlans for
scalable vectors, but is otherwise NFC.
Reviewed By: dmgreen, fhahn, vkmr
Differential Revision: https://reviews.llvm.org/D90715
This moves WidenIV from IndVarSimplify to Utils/SimplifyIndVar so that we have
createWideIV available as a generic helper utility. I.e., this is not only
useful in IndVarSimplify, but could be useful for loop transformations. For
example, motivation for this refactoring is the loop flatten transformation: if
induction variables in a loop nest can be widened, we can avoid having to
perform certain overflow checks, enabling this transformation.
Differential Revision: https://reviews.llvm.org/D90421
When replacing an assume(false) with a store, we have to be more careful
with the order we insert the new access. This patch updates the code to
look at the accesses in the block to find a suitable insertion point.
Alterantively we could check the defining access of the assume, but IIRC
there has been some discussion about making assume() readnone, so
looking at the access list might be more future proof.
Fixes PR48072.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D90784
This patch adds the `async` lowering of coroutines.
This will be used by the Swift frontend to lower async functions. In
contrast to the `retcon` lowering the frontend needs to be in control
over control-flow at suspend points as execution might be suspended at
these points.
This is very much work in progress and the implementation will change as
it evolves with the frontend. As such the documentation is lacking
detail as some of it might change.
rdar://70097093
Reapply with fix for memory sanitizer failure and sphinx failure.
Differential Revision: https://reviews.llvm.org/D90612
This patch adds the `async` lowering of coroutines.
This will be used by the Swift frontend to lower async functions. In
contrast to the `retcon` lowering the frontend needs to be in control
over control-flow at suspend points as execution might be suspended at
these points.
This is very much work in progress and the implementation will change as
it evolves with the frontend. As such the documentation is lacking
detail as some of it might change.
rdar://70097093
Differential Revision: https://reviews.llvm.org/D90612
This is slightly better compile-time wise,
since we avoid potentially-costly knownbits analysis that will
ultimately not allow us to actually do anything with said `add`.
This reverts commit 59b22e495c.
That commit broke building for ARM and AArch64, reproducible like this:
$ cat apedec-reduced.c
a;
b(e) {
int c;
unsigned d = f();
c = d >> 32 - e;
return c;
}
g() {
int h = i();
if (a)
h = h << a | b(a);
return h;
}
$ clang -target aarch64-linux-gnu -w -c -O3 apedec-reduced.c
clang: ../lib/Transforms/InstCombine/InstructionCombining.cpp:3656: bool llvm::InstCombinerImpl::run(): Assertion `DT.dominates(BB, UserParent) && "Dominance relation broken?"' failed.
Same thing for e.g. an armv7-linux-gnueabihf target.
There is already an API in BasicBlock that checks and returns the musttail call if it precedes the return instruction.
Use it instead of manually checking in each place.
Differential Revision: https://reviews.llvm.org/D90693
InstCombine is quite aggressive in doing the opposite transform,
folding `add` of operands with no common bits set into an `or`,
and that not many things support that new pattern..
In this case, teaching Reassociate about it is easy,
there's preexisting art for `sub`/`shl`:
just convert such an `or` into an `add`:
https://rise4fun.com/Alive/Xlyv
The LoopDistribute pass is missing from the LTO pipeline, so
-enable-loop-distribute has no effect during post-link. The pre-link
loop distribution doesn't seem to survive the LTO pipeline either.
With this patch (and -flto -mllvm -enable-loop-distribute) we see a 43%
uplift on SPEC 2006 hmmer for AArch64. The rest of SPECINT 2006 is
unaffected.
Differential Revision: https://reviews.llvm.org/D89896
This relaxes one-use restriction on that `sub` fold,
since apparently the addition of Negator broke
preexisting `C-(C2-X) --> X+(C-C2)` (with C=0) fold.
Vectors where all elements have the same known constant range are treated as a
single constant range in the lattice. When bitcasting such vectors, there is a
mis-match between the width of the lattice value (single constant range) and
the original operands (vector). Go to overdefined in that case.
Fixes PR47991.
The fold currently only handles rotation patterns, but with the maturation of backend funnel shift handling we can now realistically handle all funnel shift patterns.
This should allow us to begin resolving PR46896 et al.
Differential Revision: https://reviews.llvm.org/D90625
If we know that some check will not be executed on the last iteration, we can use this
fact to eliminate its check.
Differential Revision: https://reviews.llvm.org/D88210
Reviwed By: ebrevnov
This patch enhances computeOutliningColdRegionsInfo() to allow it to
consider regions containing a single basic block and a single
predecessor as candidate for partial inlining.
Reviewed By: fhann
Differential Revision: https://reviews.llvm.org/D89911
This reverts the revert commit 408c4408fa.
This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.
Original message:
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Similar to -fprofile-generate=, add -fmemory-profile= which takes a
directory path. This is passed down to LLVM via a new module flag
metadata. LLVM in turn provides this name to the runtime via the new
__memprof_profile_filename variable.
Additionally, always pass a default filename (in $cwd if a directory
name is not specified vi the = form of the option). This is also
consistent with the behavior of the PGO instrumentation. Since the
memory profiles will generally be fairly large, it doesn't make sense to
dump them to stderr. Also, importantly, the memory profiles will
eventually be dumped in a compact binary format, which is another reason
why it does not make sense to send these to stderr by default.
Change the existing memprof tests to specify log_path=stderr when that
was being relied on.
Depends on D89086.
Differential Revision: https://reviews.llvm.org/D89087
This patch updates DSE + MemorySSA to use the same check as the legacy
implementation to determine if a location is killed by a free call.
This changes the existing behavior so that a free does not kill
locations before the start of the freed pointer.
This should fix PR48036.
This reverts the revert commit a1b53db324.
This patch includes a fix for a reported issue, caused by
matchSelectPattern returning UMIN for selects of pointers in
some cases by looking to some connected casts.
For now, ensure integer instrinsics are only returned for selects of
ints or int vectors.
This is the last of the rotate->funnel shift InstCombine generalizations for PR46896
We still have foldGuardedRotateToFunnelShift to deal with in AggressiveInstCombine
Differential Revision: https://reviews.llvm.org/D90382
Previously, !noalias and !alias.scope metadata on the call site was
applied as part of CloneAliasScopeMetadata(), which short-circuits
if the callee does not use any noalias metadata itself. However,
these two things have no relation to each other.
Consistently apply !noalias and !alias.scope metadata by integrating
this into an existing function that handled !llvm.access.group and
!llvm.mem.parallel_loop_access metadata. The handling for all of
these metadata kinds essentially the same.
This reverts commit 1922570489.
This appears to cause a crash in the following example
a, b, c;
l() {
int e = a, f = l, g, h, i, j;
float *d = c, *k = b;
for (;;)
for (; g < f; g++) {
k[h] = d[i];
k[h - 1] = d[j];
h += e << 1;
i += e;
}
}
clang -cc1 -triple i386-unknown-linux-gnu -emit-obj -target-cpu pentium-m -O1 -vectorize-loops -vectorize-slp reduced.c
llvm::Type *llvm::Type::getWithNewBitWidth(unsigned int) const: Assertion `isIntOrIntVectorTy() && "Original type expected to be a vector of integers or a scalar integer."' failed.
Add support for match-all tags and GOT-free runtime calls, which
are both required for the kernel to be able to support outlined
checks. This requires extending the access info to let the backend
know when to enable these features. To make the code easier to maintain
introduce an enum with the bit field positions for the access info.
Allow outlined checks to be enabled with -mllvm
-hwasan-inline-all-checks=0. Kernels that contain runtime support for
outlined checks may pass this flag. Kernels lacking runtime support
will continue to link because they do not pass the flag. Old versions
of LLVM will ignore the flag and continue to use inline checks.
With a separate kernel patch [1] I measured the code size of defconfig
+ tag-based KASAN, as well as boot time (i.e. time to init launch)
on a DragonBoard 845c with an Android arm64 GKI kernel. The results
are below:
code size boot time
before 92824064 6.18s
after 38822400 6.65s
[1] https://linux-review.googlesource.com/id/I1a30036c70ab3c3ee78d75ed9b87ef7cdc3fdb76
Depends on D90425
Differential Revision: https://reviews.llvm.org/D90426
This is a workaround for poor heuristics in the backend where we can
end up materializing the constant multiple times. This is particularly
bad when using outlined checks because we materialize it for every call
(because the backend considers it trivial to materialize).
As a result the field containing the shadow base value will always
be set so simplify the code taking that into account.
Differential Revision: https://reviews.llvm.org/D90425
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D88609
This patch modifies two for loops to use the range based syntax.
Since they are equivalent, this patch is tagged NFC.
Differential Revision: https://reviews.llvm.org/D90069
- As convergent intrinsics/calls could only be moved to
control-equivalent blocks, or more precisely the same divergent
branch, PRE needs to skip them.
Differential Revision: https://reviews.llvm.org/D90391
Currently isOverwrite returns OW_MaybePartial even for accesss known not to overlap. This is not a big problem for legacy implementation (since isPartialOverwrite follows isOverwrite and clarifies the result). Contrary SSA based version does a lot of work to later find out that accesses don't overlap. Besides negative impact on compile time we quickly reach MemorySSAPartialStoreLimit and miss optimization opportunities.
Note: In fact, I think it would be cleaner implementation if isOverwrite returned fully clarified result in the first place whithout need to call isPartialOverwrite. This can be done as a follow up. What do you think?
Reviewed By: fhahn, asbirlea
Differential Revision: https://reviews.llvm.org/D90371
As per the comment in VPRecipeBase, clients should not rely on
getVPRecipeID, as it may change in the future. It should only be used in
classof implementations. Use isa instead in getFirstNonPhi.
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Reviewed By: dmgreen, RKSimon
Differential Revision: https://reviews.llvm.org/D90070
Currently we fail to eliminate some noop stores if there is a kill-able
store between the starting def and the load. This is because we
eliminate noop stores first.
In practice it seems like eliminating noop stores after the main
elimination for a def covers slightly more cases.
This patch improves the number of stores slightly in 2 cases for X86 -O3
-flto
Same hash: 235 (filtered out)
Remaining: 2
Metric: dse.NumRedundantStores
Program base patch diff
test-suite...ce/Benchmarks/PAQ8p/paq8p.test 2.00 3.00 50.0%
test-suite...006/453.povray/453.povray.test 18.00 21.00 16.7%
There might be other phase ordering issues, but it appears that they do
not show up in the test-suite/SPEC2000/SPEC2006. We can always tune the
ordering later.
Partly fixes PR47887.
Reviewed By: asbirlea, zoecarver
Differential Revision: https://reviews.llvm.org/D89650
And use it to model LLVM IR's `ptrtoint` cast.
This is essentially an alternative to D88806, but with no chance for
all the problems it caused due to having the cast as implicit there.
(see rG7ee6c402474a2f5fd21c403e7529f97f6362fdb3)
As we've established by now, there are at least two reasons why we want this:
* It will allow SCEV to actually model the `ptrtoint` casts
and their operands, instead of treating them as `SCEVUnknown`
* It should help with initial problem of PR46786 - this should eventually allow us
to not loose pointer-ness of an expression in more cases
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=46786 | PR46786 ]], in principle,
we could just extend `SCEVUnknown` with a `is ptrtoint` cast, because `ScalarEvolution::getPtrToIntExpr()`
should sink the cast as far down into the expression as possible,
so in the end we should always end up with `SCEVPtrToIntExpr` of `SCEVUnknown`.
But i think that it isn't the best solution, because it doesn't really matter
from memory consumption side - there probably won't be *that* many `SCEVPtrToIntExpr`s
for it to matter, and it allows for much better discoverability.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D89456
The existing logic in determining whether an alloca should live on the frame only looks explicit def-use relationships. However a value defined by an alloca may be implicitly needed across suspension points, either because an alias has across-suspension-point def-use relationship, or escaped by store/call/memory intrinsics. To properly handle all these cases, we have to properly visit the alloca pointer up-front. Thie patch extends the exisiting alloca use visitor to determine whether an alloca should live on the frame.
Differential Revision: https://reviews.llvm.org/D89768
Use -0.0 instead of 0.0 as the start value. The previous use of 0.0
was fine for all existing uses of this function though, as it is
always generated with fast flags right now, and thus nsz.
Some architectures do not have general vector select instructions (e.g.
AArch64). But some cmp/select patterns can be vectorized using other
instructions/intrinsics.
One example is using min/max instructions for certain patterns.
This patch updates the cost calculations for selects in the SLP
vectorizer to consider using min/max intrinsics.
This patch does not change SLP vectorizer's codegen itself to actually
generate those intrinsics, but relies on the backends to lower the
vector cmps & selects. This keeps things simple on the SLP side and
works well in practice for AArch64.
This exposes additional SLP vectorization opportunities in some
benchmarks on AArch64 (-O3 -flto).
Metric: SLP.NumVectorInstructions
Program base slp diff
test-suite...ications/JM/ldecod/ldecod.test 502.00 697.00 38.8%
test-suite...ications/JM/lencod/lencod.test 1023.00 1414.00 38.2%
test-suite...-typeset/consumer-typeset.test 56.00 65.00 16.1%
test-suite...6/464.h264ref/464.h264ref.test 804.00 822.00 2.2%
test-suite...006/453.povray/453.povray.test 3335.00 3357.00 0.7%
test-suite...CFP2000/177.mesa/177.mesa.test 2110.00 2121.00 0.5%
test-suite...:: External/Povray/povray.test 2378.00 2382.00 0.2%
Reviewed By: RKSimon, samparker
Differential Revision: https://reviews.llvm.org/D89969
When we promote pointer arguments we did compute a wrong offset and use
a wrong type for the array case.
Bug reported and reduced by Whitney Tsang <whitneyt@ca.ibm.com>.
The following constraints hold for swifterror values:
A swifterror value (either the parameter or the alloca) can only
be loaded and stored from, or used as a swifterror argument.
This patch updates instcombine to not try to convert a bitcast of a
function into a bitcast of a swifterror argument.
Reviewed By: rjmccall
Differential Revision: https://reviews.llvm.org/D90258
This reverts commit e038b60d91.
This reverts commit a0d84d8031.
This revert was a mistake. The reason of the failures was
"Use uint64_t for branch weights instead of uint32_t"
Differential Revision: https://reviews.llvm.org/D87832
Instead of getting the defining access we should be able to use
getClobberingMemoryAccess to skip non-aliasing MemoryDefs. No additional
checks should be needed, because we only remove the starting def if it
matches the defining access of the load. All we need to worry about is
that there are no (may)alias stores between the starting def and the
load and getClobberingMemoryAccess should guarantee that.
Partly fixes PR47887.
This improves the number of redundant stores removed in some cases
(numbers below for MultiSource, SPEC2000, SPEC2006 on X86 with -flto
-O3).
Same hash: 226 (filtered out)
Remaining: 11
Metric: dse.NumRedundantStores
Program base patch1 diff
test-suite...:: External/Povray/povray.test 1.00 5.00 400.0%
test-suite...chmarks/MallocBench/gs/gs.test 1.00 3.00 200.0%
test-suite...0/253.perlbmk/253.perlbmk.test 21.00 37.00 76.2%
test-suite...0.perlbench/400.perlbench.test 24.00 37.00 54.2%
test-suite.../Applications/SPASS/SPASS.test 3.00 4.00 33.3%
test-suite...006/453.povray/453.povray.test 15.00 18.00 20.0%
test-suite...T2006/445.gobmk/445.gobmk.test 27.00 29.00 7.4%
test-suite.../CINT2006/403.gcc/403.gcc.test 136.00 137.00 0.7%
test-suite.../CINT2000/176.gcc/176.gcc.test 6.00 6.00 0.0%
test-suite.../Benchmarks/Bullet/bullet.test NaN 3.00 nan%
test-suite.../Benchmarks/Ptrdist/bc/bc.test NaN 1.00 nan%
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D89647
The types of SEH aren't x86(-32) vs x64 but rather stack-based exception chaining
vs table-based exception handling. x86-32 is the only arch for which Windows
uses the former. 32-bit ARM would use what is called Win64SEH today, which
is a bit confusing so instead let's just rename it to be a bit more clear.
Reviewed By: compnerd, rnk
Differential Revision: https://reviews.llvm.org/D90117
This patch removes extraneous calls to setEdgeProbability introduced
in c91487769d.
The follow-up patch, a7b662d0f4, has
since fixed BranchProbabilityInfo::eraseBlock, so we don't need to
worry about getting stale values from getEdgeProbability.
Also, since getEdgeProbability(BB, BB->getSingleSuccessor()) returns
edge probability 1/1 by default for BB with exactly one successor
edge, we don't need to explicitly call setEdgeProbability.
This patch introduces almost no functional change, but we do end up
reducing debug messages from setEdgeProbability.
Differential Revision: https://reviews.llvm.org/D90284
Before we used to only mark unreachable static functions as dead if all
uses were known dead. Now we optimistically assume uses to be dead until
proven otherwise.
If we are looking at a call site argument it might be a load or call
which is in a different context than the call site argument. We cannot
simply use the call site argument range for the call or load.
Bug reported and reduced by Whitney Tsang <whitneyt@ca.ibm.com>.
In the AANoAlias logic we determine if a pointer may have been captured
before a call. We need to look at other uses in the call not uses of the
call.
The new code is not perfect as it does not allow trivial cases where the
call has multiple arguments but it is at least not unsound and a TODO
was added.
This patch teaches the jump threading pass to set edge probabilities
whenever the pass creates new basic blocks.
Without this patch, the compiler sometimes produces non-deterministic
results. The non-determinism comes from the jump threading pass using
stale edge probabilities in BranchProbabilityInfo. Specifically, when
the jump threading pass creates a new basic block, we don't initialize
its outgoing edge probability.
Edge probabilities are maintained in:
DenseMap<Edge, BranchProbability> Probs;
in class BranchProbabilityInfo, where Edge is an ordered pair of
BasicBlock * and a successor index declared as:
using Edge = std::pair<const BasicBlock *, unsigned>;
Probs maps edges to their corresponding probabilities.
Now, we rarely remove entries from this map, so if we happen to
allocate a new basic block at the same address as a previously deleted
basic block with an edge probability assigned, the newly created basic
block appears to have an edge probability, albeit a stale one.
This patch fixes the problem by explicitly setting edge probabilities
whenever the jump threading pass creates new basic blocks.
Differential Revision: https://reviews.llvm.org/D90106
Summary:
This patch adds support for passing in the original delcaration name in the
source file to the libomptarget runtime. This will allow the runtime to provide
more intelligent debugging messages. This patch takes the original expression
parsed from the OpenMP map / update clause and provides a textual
representation if it was explicitly mapped, otherwise it takes the name of the
variable declaration as a fallback. The information in passed to the runtime in
a global array of strings that matches the existing ident_t source location
strings using ";name;filename;column;row;;". See
clang/test/OpenMP/target_map_names.cpp for an example of the generated output
for a given map clause.
Reviewers: jdoervert
Differential Revision: https://reviews.llvm.org/D89802
These logically belong together since it's a base commit plus
followup fixes to less common build configurations.
The patches are:
Revert "CfgInterface: rename interface() to getInterface()"
This reverts commit a74fc48158.
Revert "Wrap CfgTraitsFor in namespace llvm to please GCC 5"
This reverts commit f2a06875b6.
Revert "Try to make GCC5 happy about the CfgTraits thing"
This reverts commit 03a5f7ce12.
Revert "Introduce CfgTraits abstraction"
This reverts commit c0cdd22c72.
This patch changes MergeBlockIntoPredecessor to skip the call to
RemoveRedundantDbgInstrs, in effect partially reverting D71480 due to
some compile-time issues spotted in LoopUnroll and SimplifyCFG.
The call to RemoveRedundantDbgInstrs appears to have changed the
worst-case behavior of the merging utility. Loosely speaking, it seems
to have gone from O(#phis) to O(#insts).
It might not be possible to mitigate this by scanning a block to
determine whether there are any debug intrinsics to remove, since such a
scan costs O(#insts).
So: skip the call to RemoveRedundantDbgInstrs. There's surprisingly
little fallout from this, and most of it can be addressed by doing
RemoveRedundantDbgInstrs later. The exception is (the block-local
version of) SimplifyCFG, where it might just be too expensive to call
RemoveRedundantDbgInstrs.
Differential Revision: https://reviews.llvm.org/D88928
This reverts commit c6ca26c0bf.
This breaks stage2 builds due to hitting this assert:
```
Assertion failed: (WeightSum <= UINT32_MAX && "Expected weights to scale down to 32 bits"), function calcMetadataWeights
```
when compiling AArch64RegisterBankInfo.cpp in LLVM.
-Oz normally does not allow loop header duplication so this loop wouldn't be
vectorized. However the vectorization pragma should override this and allow
for loop rotation.
rdar://problem/49281061
Original patch by Adam Nemet.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D59832
GVN Load PRE can split the backedge causing breaking the loop structure where the latch
contains the conditional branch with for example induction variable.
Different optimizations expect this form of the loop, so it is better to preserve it for some time.
This CL adds an option to control an ability to split backedge.
Default value is true so technically it is NFC and current behavior is not changed.
Reviewers: fedor.sergeev, mkazantsev, nikic, reames, fhahn
Reviewed By: mkazasntsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D89854
Even if the exact exit count is unknown, we can still prove that this
exit will not be taken. If we can prove that the predicate is monotonic,
fulfilled on first & last iteration, and no overflow happened in between,
then the check can be removed.
Differential Revision: https://reviews.llvm.org/D87832
Reviewed By: apilipenko
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D88609
`-separate-const-offset-from-gep` has not yet be ported, so some tests are not updated.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D90149
Prepend the module name hash with a fixed string ".__uniq." which helps tools
that consume sampled profiles and attribute it to functions to understand
that this symbol belongs to a unique internal linkage type symbol.
Symbols with suffixes can result from various optimizations in the compiler.
Function Multiversioning, function splitting, parameter constant propogation,
unique internal linkage names.
External tools like sampled profile aggregators combine profiles from multiple
runs of a binary. They use various heuristics with symbols that have suffixes
to try and attribute the profile to the right function instance. For instance
multi-versioned symbols like foo.avx, foo.sse4.2, etc even though different
should be attributed to the same source function if a single function is
versioned, using attribute target_clones (supported in GCC but yet to land in
LLVM). Similarly, functions that are split (split part having a .cold suffix)
could have profiles for both the original and split symbols but would be
aggregated and attributed to the original function that was split.
Unique internal linkage functions however have different source instances and
the aggregator must not put them together but attribute it to the appropriate
function instance. To be sure that we are dealing with a symbol of a unique
internal linkage function, we would like to prepend the hash with a known
string ".__uniq." which these tools can check to understand the suffix type.
Differential Revision: https://reviews.llvm.org/D89617
This fixes the bug 47945. It is legal to have a PHI with values
from from the same block, but values must stay the same. In this
case it is illegal to merge different values.
Differential Revision: https://reviews.llvm.org/D89978
The warning would fire when calling canReplaceGEPIdxWithZero on a GEP
whose source element type is a scalable vector. The size of scalable
vector types is not known, so this optimization cannot be performed.
This patch fixes the issue by:
- bailing out early in this routine if the GEP instruction's source
element type is a scalable vector.
- making use of getFixedSize -- this removes the dependency on the
deprecated interface.
Reviewed By: fpetrogalli
Differential Revision: https://reviews.llvm.org/D89968
The warning would fire when calling isDereferenceableAndAlignedInLoop
with a scalable load. Calling isDereferenceableAndAlignedInLoop with a
scalable load would result in the use of the now deprecated implicit
cast of TypeSize to uint64_t through the overloaded operator.
This patch fixes this issue by:
- no longer considering vector loads as candidates in
canVectorizeWithIfConvert. This doesn't make sense in the context of
identifying scalar loads to vectorize.
- making use of getFixedSize inside isDereferenceableAndAlignedInLoop --
this removes the dependency on the deprecated interface, and will
trigger an assertion error if the function is ever called with a
scalable type.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D89798
I'm not certain InstCombinerImpl::matchBSwapOrBitReverse needs to filter the or(op0(),op1()) ops - there are just too many cases that recognizeBSwapOrBitReverseIdiom/collectBitParts handle now (and quickly).
The exit blocks of the versioned and non-versioned loops are not dedicated and thus the two loops are not in simplify form.
Insert dummy exit blocks after loop versioning with `formDedicatedExits()` to preserve the simplify form for subsequence passes.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D89569
As discussed on PR35155, this extends narrowFunnelShift (recently renamed from narrowRotate) to support basic funnel shift patterns.
Unlike matchFunnelShift we don't include the computeKnownBits limitation as extracting the pattern from the zext/trunc layers should be a indicator of reasonable funnel shift codegen, in D89139 we demonstrated how to efficiently promote funnel shifts to wider types.
Differential Revision: https://reviews.llvm.org/D89542
Duplicated callsites share the same callee profile if the original callsite was inlined. The sharing also causes the profile of callee's callee to be shared. This breaks the assert introduced ealier by D84997 in a tricky way.
To illustrate, I'm using an abstract example. Say we have three functions `A`, `B` and `C`. A calls B twice and B calls C once. Some optimize performed prior to the sample profile loader duplicates first callsite to `B` and the program may look like
```
A()
{
B(); // with nested profile B1 and C1
B(); // duplicated, with nested profile B1 and C1
B(); // with nested profile B2 and C2
}
```
For some reason, the sample profile loader inliner then decides to only inline the first callsite in `A` and transforms `A` into
```
A()
{
C(); // with nested profile C1
B(); // duplicated, with nested profile B1 and C1
B(); // with nested profile B2 and C2.
}
```
Here is what happens next:
1. Failing to inline the callsite `C()` results in `C1`'s samples returned to `C`'s base (outlined) profile. In the meantime, `C1`'s head samples are updated to `C1`'s entry sample. This also affects the profile of the middle callsite which shares `C1` with the first callsite.
2. Failing to inline the middle callsite results in `B1` returned to `B`'s base profile, which in turn will cause `C1` merged into `B`'s base profile. Note that the nest `C` profile in `B`'s base has a non-zero head sample count now. The value actually equals to `C1`'s entry count.
3. Failing to inline last callsite results in `B2` returned to `B`'s base profile. Note that the nested `C` profile in `B`'s base now has an entry count equal to the sum of that of `C1` and `C2`, with the head count equal to that of `C1`. This will trigger the assert later on.
4. Compiling `B` using `B`'s base profile. Failing to inline `C` there triggers the returning of the nested `C` profile. Since the nested `C` profile has a non-zero head count, the returning doesn't go through. Instead, the assert goes off.
It's good that `C1` is only returned once, based on using a non-zero head count to ensure an inline profile is only returned once. However C2 is never returned. While it seems hard to solve this perfectly within the current framework, I'm just removing the broken assert. This should be reasonably fixed by the upcoming CSSPGO work where counts returning is based on context-sensitivity and a distribution factor for callsite probes.
The simple example is extracted from one of our internal services. In reality, why the original callsite `B()` and duplicate one having different inline behavior is a magic. It has to do with imperfect counts in profile and extra complicated inlining that makes the hotness for them different.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D90056
This doesn't support -structurizecfg-skip-uniform-regions since that
would require porting LegacyDivergenceAnalysis.
The NPM doesn't support adding a non-analysis pass as a dependency of
another, so I had to add -lowerswitch to some tests or pin them to the
legacy PM.
This is the only RegionPass in tree, so I simply copied the logic for
finding all Regions from the legacy PM's RGManager into
StructurizeCFG::run().
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D89026
This change introduces a GC parseable lowering for element atomic
memcpy/memmove intrinsics. This way runtime can provide an
implementation which can take a safepoint during copy operation.
See "GC-parseable element atomic memcpy/memmove" thread on llvm-dev
for the background and details:
https://groups.google.com/g/llvm-dev/c/NnENHzmX-b8/m/3PyN8Y2pCAAJ
Differential Revision: https://reviews.llvm.org/D88861
It's currently ambiguous in IR whether the source language explicitly
did not want a stack a stack protector (in C, via function attribute
no_stack_protector) or doesn't care for any given function.
It's common for code that manipulates the stack via inline assembly or
that has to set up its own stack canary (such as the Linux kernel) would
like to avoid stack protectors in certain functions. In this case, we've
been bitten by numerous bugs where a callee with a stack protector is
inlined into an __attribute__((__no_stack_protector__)) caller, which
generally breaks the caller's assumptions about not having a stack
protector. LTO exacerbates the issue.
While developers can avoid this by putting all no_stack_protector
functions in one translation unit together and compiling those with
-fno-stack-protector, it's generally not very ergonomic or as
ergonomic as a function attribute, and still doesn't work for LTO. See also:
https://lore.kernel.org/linux-pm/20200915172658.1432732-1-rkir@google.com/https://lore.kernel.org/lkml/20200918201436.2932360-30-samitolvanen@google.com/T/#u
Typically, when inlining a callee into a caller, the caller will be
upgraded in its level of stack protection (see adjustCallerSSPLevel()).
By adding an explicit attribute in the IR when the function attribute is
used in the source language, we can now identify such cases and prevent
inlining. Block inlining when the callee and caller differ in the case that one
contains `nossp` when the other has `ssp`, `sspstrong`, or `sspreq`.
Fixes pr/47479.
Reviewed By: void
Differential Revision: https://reviews.llvm.org/D87956
matchBSwapOrBitReverse was hardcoded to just match bswaps - we're going to need to expose the ability to match bitreverse as well, so make this part of the function call.
This patch copies @vsk's fix to instcombine from D85555 over to mem2reg. The
motivation and rationale are exactly the same: When mem2reg removes an alloca,
it erases the dbg.{addr,declare} instructions which refer to the alloca. It
would be better to instead remove all debug intrinsics which describe the
contents of the dead alloca, namely all dbg.value(<dead alloca>, ...,
DW_OP_deref)'s.
As far as I can tell, prior to D80264 these `dbg.value+deref`s would have been
silently dropped instead of being made `undef`, so we're just returning to
previous behaviour with these patches.
Testing:
`llvm-lit llvm/test` and `ninja check-clang` gave no unexpected failures. Added
3 tests, each of which covers a dbg.value deletion path in mem2reg:
mem2reg-promote-alloca-1.ll
mem2reg-promote-alloca-2.ll
mem2reg-promote-alloca-3.ll
The first is based on the dexter test inlining.c from D89543. This patch also
improves the debugging experience for loop.c from D89543, which suffers
similarly after arg promotion instead of inlining.
Use isKnownXY comparators when one of the operands can be with
scalable vectors or getFixedSize() for all the other cases.
This patch also does bug fixes for getPrimitiveSizeInBits by using
getFixedSize() near the places with the TypeSize comparison.
Differential Revision: https://reviews.llvm.org/D89703
We want to have a caching version of symbolic BE exit count
rather than recompute it every time we need it.
Differential Revision: https://reviews.llvm.org/D89954
Reviewed By: nikic, efriedma
An alwaysinline function may not get inlined in inliner-wrapper due to
the inlining order.
Previously for the following, the inliner would first inline @a() into @b(),
```
define void @a() {
entry:
call void @b()
ret void
}
define void @b() alwaysinline {
entry:
br label %for.cond
for.cond:
call void @a()
br label %for.cond
}
```
making @b() recursive and unable to be inlined into @a(), ending at
```
define void @a() {
entry:
call void @b()
ret void
}
define void @b() alwaysinline {
entry:
br label %for.cond
for.cond:
call void @b()
br label %for.cond
}
```
Running always-inliner first makes sure that we respect alwaysinline in more cases.
Fixes https://bugs.llvm.org/show_bug.cgi?id=46945.
Reviewed By: davidxl, rnk
Differential Revision: https://reviews.llvm.org/D86988
This reverts commit 26ee8aff2b.
It's necessary to insert bitcast the pointer operand of a lifetime
marker if it has an opaque pointer type.
rdar://70560161
When performing a call slot optimization to a GEP destination, it
will currently usually fail, because the GEP is directly before the
memcpy and as such does not dominate the call. We should move it
above the call if that satisfies the domination requirement.
I think that a constant-index GEP is the only useful thing to move
here, as otherwise isDereferenceablePointer couldn't look through
it anyway. As such I'm not trying to generalize this further.
Differential Revision: https://reviews.llvm.org/D89623
Make member function const where possible, use LLVM_DEBUG to print debug traces
rather than a custom option, pass by reference to avoid null checking, ...
Reviewed By: fhann
Differential Revision: https://reviews.llvm.org/D89895
When InstCombine removes an alloca, it erases the dbg.{addr,declare}
instructions which refer to the alloca. It would be better to instead
remove all debug intrinsics which describe the contents of the dead
alloca, namely all dbg.value(<dead alloca>, ..., DW_OP_deref)'s.
This effectively undoes work performed in an InstCombine run earlier in
the pipeline by LowerDbgDeclare, which inserts DW_OP_deref dbg.values
before CallInst users of an alloca. The motivating example looks like:
```
define void @foo(i32 %0) {
%a = alloca i32 ; This alloca is erased.
store i32 %0, i32* %a
dbg.value(i32 %0, "arg0") ; This dbg.value survives.
dbg.value(i32* %a, "arg0", DW_OP_deref)
call void @trivially_inlinable_no_op(i32* %a)
ret void
}
```
If the DW_OP_deref dbg.value is not erased, it becomes dbg.value(undef)
after inlining, making "arg0" unavailable. But we already have dbg.value
descriptions of the alloca's value (from LowerDbgDeclare), so the
DW_OP_deref dbg.value cannot serve its purpose of describing an
initialization of the alloca by some callee. It invalidates other useful
dbg.values, causing large gaps in location coverage, so we should delete
it (even though doing so may cause stale dbg.values to appear, if
there's a dead store to `%a` in @trivially_inlinable_no_op).
OTOH, it wouldn't be correct to delete all dbg.value descriptions of an
alloca. Note that it's possible to describe a variable that takes on
different pointer values, e.g.:
```
void use(int *);
void t(int a, int b) {
int *local = &a; // dbg.value(i32* %a.addr, "local")
local = &b; // dbg.value(i32* undef, "local")
use(&a); // (note: %b.addr is optimized out)
local = &a; // dbg.value(i32* %a.addr, "local")
}
```
In this example, the alloca for "b" is erased, but we need to describe
the value of "local" as <unavailable> before the call to "use". This
prevents "local" from appearing to be equal to "&a" at the callsite.
rdar://66592859
Differential Revision: https://reviews.llvm.org/D85555
Use BFI if it is available and BPI otherwise.
This is a promised follow-up after D89541.
Reviewers: ebrevnov, mkazantsev
Reviewed By: ebrevnov
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D89773
PreservedCFGCheckerInstrumentation was saying that LowerMatrixIntrinsics
didn't properly preserve CFG even though it claimed to. The legacy pass
says it doesn't. Match the legacy pass's preserved analyses.
Reviewed By: thakis
Differential Revision: https://reviews.llvm.org/D89175
For GC parseable element atomic memcpy/memmove we'll need to
shuffle statepoint arguments. Make it possible by storing the
arguments as Value *, not Use *.
Avoid having to instantiate and compile a subset of the dominator tree logic
separately for each node type. More importantly, this allows generic
algorithms to be built on top of dominator trees without writing them as
templates -- such algorithms can now use opaque CfgBlockRef and
CfgInterface instead.
A type-erased implementation of dominator trees could be written in
terms of CfgInterface as well, but doing so would change the current
trade-off: it would slightly reduce code size at the cost of a slight
runtime overhead.
This patch does not change the trade-off, as it only does type-erasure
where basic blocks can be treated in a fully opaque way, i.e. it only
moves methods that don't require iteration over CFG successors and
predecessors.
v5:
- rename generic_{begin,end,children} back without the generic_ prefix
and refer explictly to base class methods in NewGVN, which wants to
mutate the order of dominator tree node children directly
v6:
- style change: iDom -> idom; it's arguable whether this is really
invalid, since it is actually standard camelCase, but clang-tidy
complains about it so... *shrug*
- rename {to,from}Generic -> {wrap,unwrap}Ref
Change-Id: Ib860dc04cf8bb093d8ed00be7def40d662213672
Differential Revision: https://reviews.llvm.org/D83089
isMemTerminator checks if the current def is a memory terminator that
terminates the memory pointed to by DefLoc. We do not have to add any of
their users to the worklist, because the follow-on users cannot read the
memory in question.
This leads to more stores eliminated in the presence of lifetime calls.
Previously we added the users of those intrinsics to the worklist,
limiting elimination.
In terms of removed stores, this gives a nice boost on some benchmarks
(MultiSource/SPEC2000/SPEC2006 on X86 with -flto -O3):
Same hash: 205 (filtered out)
Remaining: 32
Metric: dse.NumFastStores
Program base patch diff
test-suite...000/197.parser/197.parser.test 4.00 8.00 100.0%
test-suite...rolangs-C++/family/family.test 4.00 7.00 75.0%
test-suite...marks/7zip/7zip-benchmark.test 1722.00 2189.00 27.1%
test-suite...CFP2000/177.mesa/177.mesa.test 30.00 38.00 26.7%
test-suite :: External/Nurbs/nurbs.test 44.00 49.00 11.4%
test-suite...lications/sqlite3/sqlite3.test 115.00 128.00 11.3%
test-suite...006/447.dealII/447.dealII.test 2715.00 3013.00 11.0%
test-suite...ProxyApps-C++/CLAMR/CLAMR.test 237.00 261.00 10.1%
test-suite...tions/lambda-0.1.3/lambda.test 40.00 44.00 10.0%
test-suite...3.xalancbmk/483.xalancbmk.test 1366.00 1475.00 8.0%
test-suite...abench/jpeg/jpeg-6a/cjpeg.test 13.00 14.00 7.7%
test-suite...oxyApps-C++/miniFE/miniFE.test 43.00 46.00 7.0%
test-suite...lications/ClamAV/clamscan.test 230.00 246.00 7.0%
test-suite...006/450.soplex/450.soplex.test 284.00 299.00 5.3%
test-suite...nsumer-jpeg/consumer-jpeg.test 21.00 22.00 4.8%
The CfgTraits abstraction simplfies writing algorithms that are
generic over the type of CFG, and enables writing such algorithms
as regular non-template code that operates on opaque references
to CFG blocks and values.
Implementations of CfgTraits provide operations on the concrete
CFG types, e.g. `IrCfgTraits::BlockRef` is `BasicBlock *`.
CfgInterface is an abstract base class which provides operations
on opaque types CfgBlockRef and CfgValueRef. Those opaque types
encapsulate a `void *`, but the meaning depends on the concrete
CFG type. For example, MachineCfgTraits -- for use with MachineIR
in SSA form -- encodes a Register inside CfgValueRef. Converting
between concrete references and opaque/generic ones is done by
CfgTraits::{fromGeneric,toGeneric}. Convenience methods
CfgTraits::{un}wrap{Iterator,Range} are available as well.
Writing algorithms in terms of CfgInterface adds some overhead
(virtual method calls, plus in same cases it removes the
opportunity to inline iterators), but can be much more convenient
since generic algorithms can be written as non-templates.
This patch adds implementations of CfgTraits for all CFGs on
which dominator trees are calculated, so that the dominator
tree can be ported to this machinery. Only IrCfgTraits (LLVM IR)
and MachineCfgTraits (Machine IR in SSA form) are complete, the
other implementations are limited to the absolute minimum
required to make the upcoming dominator tree changes work.
v5:
- fix MachineCfgTraits::blockdef_iterator and allow it to iterate over
the instructions in a bundle
- use MachineBasicBlock::printName
v6:
- implement predecessors/successors for all CfgTraits implementations
- fix error in unwrapRange
- rename toGeneric/fromGeneric into wrapRef/unwrapRef to have naming
that is consistent with {wrap,unwrap}{Iterator,Range}
- use getVRegDef instead of getUniqueVRegDef
v7:
- std::forward fix in wrapping_iterator
- fix typos
v8:
- cleanup operators on CfgOpaqueType
- address other review comments
Change-Id: Ia75f4f268fded33fca11218a7d578c9aec1f3f4d
Differential Revision: https://reviews.llvm.org/D83088
This adds the LLVM IR attribute `mustprogress` as defined in LangRef through D86233. This attribute will be applied to functions with in languages like C++ where forward progress is guaranteed. Functions without this attribute are not required to make progress.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D85393
IRCE has some overhead for runtime checks and in case number of iteration is small
the overhead can kill the benefit from optimizations.
This CL bases on BlockFrequencyInfo of pre-header and header to estimate the
number of loop iterations. If it is less than irce-min-estimated-iters we do not transform the loop.
Probably it is better to make more complex cost model but for simplicity it seems the be enough.
The usage of BFI is added only for new pass manager and tries to use it efficiently.
Reviewers: ebrevnov, dantrushin, asbirlea, mkazantsev
Reviewed By: mkazantsev
Subscribers: llvm-commits, fhahn
Differential Revision: https://reviews.llvm.org/D89541
The main tricky thing here is forward-declaring the enum:
we have to specify it's underlying data type.
In particular, this avoids the danger of switching over the SCEVTypes,
but actually switching over an integer, and not being notified
when some case is not handled.
I have updated most of such switches to be exaustive and not have
a default case, where it's pretty obvious to be the intent,
however not all of them.
If we switch over an enum, compiler can easily issue a diagnostic
if some case is not handled. However with an if cascade that isn't so.
Experimental evidence suggests new behavior to be superior.
Fixes a number of stage2 buildbots that were failing when I generalized the m_ConstantInt() logic - that didn't match for pointer types but m_Zero() does......
Scalar cases were already being handled by foldLogOpOfMaskedICmps (so this was dead code), but refactoring to support non-uniform vectors will take some time, so tweak this fold in the meantime.
This broke Chromium's PGO build, it seems because hot-cold-splitting got turned
on unintentionally. See comment on the code review for repro etc.
> This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
> the splitting pass to be toggled on/off. The current method of passing
> `-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
> correctly (say, with `-O0` or `-Oz`).
>
> To implement the -fsplit-cold-code option, an attribute is applied to
> functions to indicate that they may be considered for splitting. This
> removes some complexity from the old/new PM pipeline builders, and
> behaves as expected when LTO is enabled.
>
> Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
> Differential Revision: https://reviews.llvm.org/D57265
> Reviewed By: Aditya Kumar, Vedant Kumar
> Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar
This reverts commit 273c299d5d.
All existing SCEV cast types operate on integers.
D89456 will add SCEVPtrToIntExpr cast expression type.
I believe this is best for consistency.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D89455
isNoopIntrinsic returns true for some intrinsics that are modeled in
MemorySSA but do not actually read or write any memory and do not block
DSE. Such intrinsics should not be considered as read-clobbers.
This patch adds metadata !noundef and makes load instructions can optionally have it.
A load with !noundef always return a well-defined value (has no undef bit or isn't poison).
If the loaded value isn't well defined, the behavior is undefined.
This metadata can be used to encode the assumption from C/C++ that certain reads of variables should have well-defined values.
It is helpful for optimizing freeze instructions away, because freeze can be removed when its operand has well-defined value, and showing that a load from arbitrary location is well-defined is usually hard otherwise.
The same information can be encoded with llvm.assume with operand bundle; using metadata is chosen because I wasn't sure whether code motion can be freely done when llvm.assume is inserted from clang instead.
The existing codebase already is stripping unknown metadata when doing code motion, so using metadata is UB-safe as well.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D89050
We can not bitcast pointers across different address spaces, and VectorCombine
should be careful when it attempts to find the original source of the loaded
data.
Differential Revision: https://reviews.llvm.org/D89577
Logic of widenWithVariantUse is split into check and transform
part, unlike any other transform in IndVars. We want to pass some
extra flags from analysis to transform part and standartize
the code at once, so merging them together.
Variable ExtendOperExpr only exists to check whether it is a SCEV ext.
We create it as SCEV ext right here, so semantically this check is
trivially true. In theory, it may fail if SCEV is smart enough and can
simplify the expression. However, no matter whether it is an ext or not,
we never use this fact for further reasoning. So this code is currently
useless and in theory may become harmful with SCEV's development.
We do not expect any behavior changes with removing it. If it caused
negative changes, the patch should be reverted.
Some facts have already been checked in widenWithVariantUse and then
checked again in widenWithVariantUseCodegen. The latter is redundant,
we can replace it with asserts.
Prep work for PR35155 - renamed narrowRotate to narrowFunnelShift, rewrote some comments and adjusted code to collect separate shift values, although we bail if they don't match (still only rotations are only actually folded).
I'm trying to match matchFunnelShift as much as possible in case we finally get to merge these one day.
After investigation by @asbirlea, the issue that caused the
revert appears to be an issue in the original source, rather
than a problem with the compiler.
This patch enables MemorySSA DSE again.
This reverts commit 915310bf14.
This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
the splitting pass to be toggled on/off. The current method of passing
`-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
correctly (say, with `-O0` or `-Oz`).
To implement the -fsplit-cold-code option, an attribute is applied to
functions to indicate that they may be considered for splitting. This
removes some complexity from the old/new PM pipeline builders, and
behaves as expected when LTO is enabled.
Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
Differential Revision: https://reviews.llvm.org/D57265
Reviewed By: Aditya Kumar, Vedant Kumar
Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar
This is an initial cleanup of the way LoopVersioning interacts with LAA.
Currently LoopVersioning has 2 ways of initializing things:
1. Passing LAI and passing UseLAIChecks = true
2. Passing UseLAIChecks = false, followed by calling setSCEVChecks and
setAliasChecks.
Both ways of initializing lead to the same result and the duplication
seems more complicated than necessary.
This patch removes the UseLAIChecks flag from the constructor and the
setSCEVChecks & setAliasChecks helpers and move initialization
exclusively to the constructor.
This simplifies things, by providing a single way to initialize
LoopVersioning and reducing duplication.
Reviewed By: Meinersbur, lebedev.ri
Differential Revision: https://reviews.llvm.org/D84406
Following up D81682 and D83903, remove the code for the old value profiling
buckets, which have been replaced with the new, extended buckets and disabled by
default.
Also syncing InstrProfData.inc between compiler-rt and llvm.
Differential Revision: https://reviews.llvm.org/D88838
Replace m_ConstantInt with m_APInt to support uniform vectors (with no undef elements)
Adding non-undef support would involve some refactoring of the MaskOps struct but this might still be worth it.
This reverts commit 25a97c3a43.
We have other constant folds that fold undef funnel shift amounts to 0 - so we need to be consistent.
If we end up with regressions where we lose a splat shift amount pattern we'll have to investigate other canonicalizations, but matchFunnelShift currently protects us from that.
This was broken by 16295d521e, when
instructions started being handled and not just constant
expressions. This was re-inserting an equivalent bitcast to the
original memcpy operand, which made a non-functional IR change on
every iteration.
This also fixes a secondary problem where it was inserting
addrspacecasts which may not have been legal (i.e. it changed the
source address space). Start visiting all pointer users and fail out
if we can't process them. Also start handling the relevant memory
intrinsic users. These cases can be dealt with by running
InferAddressSpaces separately.
This reverts the revert commit 710aceb645
and includes a fix for a memsan failure.
Original message:
This patch turns VPMemoryInstructionRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
m_SpecificInt doesn't accept undef elements in a vector splat value - tweak specific_intval to optionally allow undefs and add the m_SpecificIntAllowUndef variants.
Allows us to remove the m_APIntAllowUndef + comparison hack inside matchFunnelShift
By always performing a modulo on the shift amount constants this was causing undef amounts being replaced with zero, meaning we were losing funnel shift by splat (with undef) patterns.
Tweaked the shift amount bounds check to support (passthrough) undefs, and use Constant::mergeUndefsWith to preserve the undefs after folding.
While we haven't encountered an earth-shattering problem with this yet,
by now it is pretty evident that trying to model the ptr->int cast
implicitly leads to having to update every single place that assumed
no such cast could be needed. That is of course the wrong approach.
Let's back this out, and re-attempt with some another approach,
possibly one originally suggested by Eli Friedman in
https://bugs.llvm.org/show_bug.cgi?id=46786#c20
which should hopefully spare us this pain and more.
This reverts commits 1fb6104293,
7324616660,
aaafe350bb,
e92a8e0c74.
I've kept&improved the tests though.
LV fails with assertion checking that UF > 0. We already set UF to 1 if it is 0 except the case when IC > MaxInterleaveCount. The fix is to set UF to 1 for that case as well.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D87679
Replace m_SpecificInt with m_APIntAllowUndef to matching splats containing undefs, then use ConstantExpr::mergeUndefsWith to merge the undefs together in the result.
The undef funnel shift amounts are getting replaced with zero later on - I'll address this in a later patch, otherwise we lose potential shift by splat value patterns.
D85703 will need to create shallow wrappers in order to track the spmd icv. We need to make it available.
Differential Revision: https://reviews.llvm.org/D89342
-loop-extract-single is just -loop-extract on one loop.
-loop-extract depended on -break-crit-edges and -loop-simplify in the
legacy PM, but the NPM doesn't allow specifying pass dependencies like
that, so manually add those passes to the RUN lines where necessary.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D89016
While promotion currently always has an AST available, it is only
relevant for invalidation purposes in LoopPromoter, so we do not
need to have it as a hard dependency.
This adds an -enable-memcpyopt-memoryssa option that currently does
nothing apart from requiring MSSA as a dependency. The tests are
split to run both with the option disabled and enabled. I went with
this rather than the separate directory DSE uses, as I found it
convenient to have a direct side-by-side comparison of differences.
Differential Revision: https://reviews.llvm.org/D89206
moveUp() moves instructions, so we should move the corresponding
memory accesses as well. We should also move the store instruction
itself: Even though we'll end up removing it later, this gives us
a correct MemoryDef to replace.
The implementation is somewhat more complicated than it should be,
because we also handle the case where P does not have a memory
access due to a degnerate AA pipeline. Hopefully, the need for this
will go away in the future, when the rest of the pass is based on
MSSA.
Differential Revision: https://reviews.llvm.org/D88778
If the memcpy operands are the same (which is allowed since D86815)
then the memcpy is effectively a no-op and the partially overlapping
memset is not dead.
Differential Revision: https://reviews.llvm.org/D89192
MemCpyOpt can shorten a memset if it is later partially overwritten
by a memcpy. It checks that the destination is not read in between,
but we also need to make sure that the destination cannot be observed
via unwinding.
Differential Revision: https://reviews.llvm.org/D89190
The previous code added the scope on each iteration, so that the
same scope was represented many times in the same !noalias metadata.
That's legal, and semantically equivalent to only storing the scope
once, but it's also wasteful and may pessimize further optimization
if AATags get intersected naively, as done by the AliasSetTracker.
Based on the recent patches D88475 and D88429 where we are losing undef values due to extension/comparisons.
I've added a Constant::mergeUndefsWith method that merges the undef scalar/elements from another Constant into a specific Constant.
Differential Revision: https://reviews.llvm.org/D88687
This relands commit 1c021c64ca which was
reverted in commit 17cec6a11a because
an assertion was being triggered, since `BuildConstantFromSCEV()`
wasn't updated to handle the case where the constant we want to truncate
is actually a pointer. I was unsuccessful in coming up with a test case
where we'd end there with constant zext/sext of a pointer,
so i didn't handle those cases there until there is a test case.
Original commit message:
While we indeed can't treat them as no-ops, i believe we can/should
do better than just modelling them as `unknown`. `inttoptr` story
is complicated, but for `ptrtoint`, it seems straight-forward
to model it just as a zext-or-trunc of unknown.
This may be important now that we track towards
making inttoptr/ptrtoint casts not no-op,
and towards preventing folding them into loads/etc
(see D88979/D88789/D88788)
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88806
This patch turns VPMemoryInstructionRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84680
> While we indeed can't treat them as no-ops, i believe we can/should
> do better than just modelling them as `unknown`. `inttoptr` story
> is complicated, but for `ptrtoint`, it seems straight-forward
> to model it just as a zext-or-trunc of unknown.
>
> This may be important now that we track towards
> making inttoptr/ptrtoint casts not no-op,
> and towards preventing folding them into loads/etc
> (see D88979/D88789/D88788)
>
> Reviewed By: mkazantsev
>
> Differential Revision: https://reviews.llvm.org/D88806
It caused the following assert during Chromium builds:
llvm/lib/IR/Constants.cpp:1868:
static llvm::Constant *llvm::ConstantExpr::getTrunc(llvm::Constant *, llvm::Type *, bool):
Assertion `C->getType()->isIntOrIntVectorTy() && "Trunc operand must be integer"' failed.
See code review for a link to a reproducer.
This reverts commit 1c021c64ca.
60b852092c introduced SCEV verification to
deleteDeadLoop, but it appears this check is currently a bit over-eager
and some users of deleteDeadLoop appear to only patch up SE after
calling it (e.g. PR47753).
Remove the extra check for now. We can consider adding it back after we
tracked down the source of the inconsistency for PR47753.
If value tracking can confirm that a shift value is less than the type bitwidth then we can more confidently fold general or(shl(a,x),lshr(b,sub(bw,x))) patterns to a funnel/rotate intrinsic pattern without causing bad codegen regressions in the backend (see D89139).
Reapplied after the shift canonicalization in rG02295e6d1a15 which removed the need to flip the shift values.
Differential Revision: https://reviews.llvm.org/D88783
After rG02295e6d1a15 we no longer need to invert the shift values for fshr - this is just hidden at the moment as funnel shifts only ever match for constant values so never use the fshr "Sub on SHL" path.
While we indeed can't treat them as no-ops, i believe we can/should
do better than just modelling them as `unknown`. `inttoptr` story
is complicated, but for `ptrtoint`, it seems straight-forward
to model it just as a zext-or-trunc of unknown.
This may be important now that we track towards
making inttoptr/ptrtoint casts not no-op,
and towards preventing folding them into loads/etc
(see D88979/D88789/D88788)
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88806
I have introduced a new template PolySize class, where the template
parameter determines the type of quantity, i.e. for an element
count this is just an unsigned value. The ElementCount class is
now just a simple derivation of PolySize<unsigned>, whereas TypeSize
is more complicated because it still needs to contain the uint64_t
cast operator, since there are still many places in the code that
rely upon this implicit cast. As such the class also still needs
some of it's own operators.
I've tried to minimise the amount of code in the base PolySize
class, which led to a couple of changes:
1. In some places we were relying on '==' operator comparisons
between ElementCounts and the scalar value 1. I didn't put this
operator in the new PolySize class, and thought it was actually
clearer to use the isScalar() function instead.
2. I removed the isByteSized function and replaced it with calls
to isKnownMultipleOf(8).
I've also renamed NextPowerOf2 to be coefficientNextPowerOf2 so
that it's more consistent with coefficientDivideBy.
Differential Revision: https://reviews.llvm.org/D88409
And another step towards transforms not introducing inttoptr and/or
ptrtoint casts that weren't there already.
As we've been establishing (see D88788/D88789), if there is a int<->ptr cast,
it basically must stay as-is, we can't do much with it.
I've looked, and the most source of new such casts being introduces,
as far as i can tell, is this transform, which, ironically,
tries to reduce count of casts..
On vanilla llvm test-suite + RawSpeed, @ `-O3`, this results in
-33.58% less `IntToPtr`s (19014 -> 12629)
and +76.20% more `PtrToInt`s (18589 -> 32753),
which is an increase of +20.69% in total.
However just on RawSpeed, where i know there are basically
none `IntToPtr` in the original source code,
this results in -99.27% less `IntToPtr`s (2724 -> 20)
and +82.92% more `PtrToInt`s (4513 -> 8255).
which is again an increase of 14.34% in total.
To me this does seem like the step in the right direction,
we end up with strictly less `IntToPtr`, but strictly more `PtrToInt`,
which seems like a reasonable trade-off.
See https://reviews.llvm.org/D88860 / https://reviews.llvm.org/D88995
for some more discussion on the subject.
(Eventually, `CastInst::isNoopCast()`/`CastInst::isEliminableCastPair`
should be taught about this, yes)
Reviewed By: nlopes, nikic
Differential Revision: https://reviews.llvm.org/D88979
This expands upon the inloop reductions added in e9761688e41cb9e976,
allowing them to be inserted into tail folded loops. Reductions are
generates with the form:
x = select(mask, vecop, zero)
v = vecreduce.add(x)
c = add chain, v
Where zero here is chosen as the identity value for add reductions. The
backend is then expected to fold the select and the vecreduce into a
single predicated instruction.
Most of the code is fairly straight forward, except for the creation of
blockmasks which need to ensure they are created in dominance order. The
order they are added is altered to be after any phis, keeping the
requirements for the underlying IR.
Differential Revision: https://reviews.llvm.org/D84451
As shown in the affected test, we could increase instruction
count without this limitation. There's another test with extra
use that shows we still convert directly to a real "sext" if
possible.
If value tracking can confirm that a shift value is less than the type bitwidth then we can more confidently fold general or(shl(a,x),lshr(b,sub(bw,x))) patterns to a funnel/rotate intrinsic pattern without causing bad codegen regressions in the backend (see D89139).
Differential Revision: https://reviews.llvm.org/D88783
This exposes the helper for other power-of-2 instcombine folds that I'm intending to add vector support to.
The helper only operated on power-of-2 constants so getExactLogBase2 is a more accurate name.
This patch is a refactoring of how we process spills and allocas during CoroSplit.
In the previous implementation, everything that needs to go to the heap is put into Spills, including all the values defined by allocas.
And the way to identify a Spill, is to check whether there exists a use-def relationship that crosses suspension points.
This approach is fundamentally confusing, and unfortunately, incorrect.
First of all, allocas are always process differently than spills, hence it's quite confusing to put them together. It's a much cleaner to separate them and process them separately.
Doing so simplify lots of code and makes the logic more clear and easier to reason about.
Secondly, use-def relationship is insufficient to decide whether a value defined by AllocaInst needs to go to the heap.
There are many cases where a value defined by AllocaInst can implicitly be used across suspension points without a direct use-def relationship.
For example, you can store the address of an alloca into the heap, and load that address after suspension. Or you can escape the address into an object through a function call.
Or you can have a PHINode that takes two allocas, and this PHINode is used across suspension point (when this happens, the existing implementation will spill the PHINode, a.k.a a stack adddress to the heap!).
All these issues suggest that we need to separate spill and alloca in order to properly implement this.
This patch does not yet fix these bugs, however it sets up the code in a better shape so that we can start fixing them in the next patch.
The core idea of this patch is to add a new struct called FrameDataInfo, which contains all Spills, all Allocas, and a map from each definition to its layout index in the frame (FieldIndexMap).
Spills and Allocas are identified, stored and processed independently. When they are initially added to the frame, we record their field index through FieldIndexMap. When the frame layout is finalized, we update each index into their final layout index.
In doing so, I also cleaned up a few things and also discovered a few other bugs.
Cleanups:
1. Found out that PromiseFieldId is not used, delete it.
2. Previously, SpillInfo is a vector, which is strange because every def can have multiple users. This patch cleans it up by turning it into a map from def to users.
3. Previously, a frame Field struct contains a list of Spills that field corresponds to. This isn't necessary since we only need the layout index for each given definition. This patch removes that list. Instead, we connect each field and definition using the FieldIndexMap.
4. All the loops that process Spills are simplified now because we use a map instead of a vector.
Bugs:
It seems that we are only keeping llvm.dbg.declare intrinsics in the .resume part of the function. The ramp function will no longer has it. This means we are dropping some debug information in the ramp function.
The next step is to start fixing the bugs where the implementation fails to identify some allocas that should live on the frame.
Differential Revision: https://reviews.llvm.org/D88872
MemCpyOpt can hoist stores while load+store pairs into memcpy.
This hoisting can currently result in stores being executed that
weren't guaranteed to execute in the original problem.
Differential Revision: https://reviews.llvm.org/D89154
Summary:
The current instruction sink pass uses findNearestCommonDominator of all users to find block to sink the instruction to.
However, a user may be in a dead block, which will result in unexpected behavior.
This patch handles such cases by skipping dead blocks. This patch fixes:
https://bugs.llvm.org/show_bug.cgi?id=47415
Reviewers:
MaskRay, arsenm
Differential Revision:
https://reviews.llvm.org/D89166
If a module has many values that need to be resolved by
ResolvedUndefsIn, compilation takes quadratic time overall. Solve should
do a small amount of work, since not much is added to the worklists each
time markOverdefined is called. But ResolvedUndefsIn is linear over the
length of the function/module, so resolving one undef at a time is
quadratic in general.
To solve this, make ResolvedUndefsIn resolve every undef value at once,
instead of resolving them one at a time. This loses a little
optimization power, but can be a lot faster.
We still need a loop around ResolvedUndefsIn because markOverdefined
could change the set of blocks that are live. That should be uncommon,
hopefully. We could optimize it by tracking which blocks transition from
dead to live, instead of iterating over the whole module to find them.
But I'll leave that for later. (The whole function will become a lot
simpler once we start pruning branches on undef.)
The regression test changes seem minor. The specific cases in question
could probably be optimized with a bit more work, but they seem like
edge cases that don't really matter.
Fixes an "infinite" compile issue my team found on an internal workoad.
Differential Revision: https://reviews.llvm.org/D89080
There are cases that generated OpenMP code consists of multiple,
consecutive OpenMP parallel regions, either due to high-level
programming models, such as RAJA, Kokkos, lowering to OpenMP code, or
simply because the programmer parallelized code this way. This
optimization merges consecutive parallel OpenMP regions to: (1) reduce
the runtime overhead of re-activating a team of threads; (2) enlarge the
scope for other OpenMP optimizations, e.g., runtime call deduplication
and synchronization elimination.
This implementation defensively merges parallel regions, only when they
are within the same BB and any in-between instructions are safe to
execute in parallel.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D83635
In the NPM, a pass cannot depend on another non-analysis pass. So pin
the test that tests that -lowerswitch is run automatically to legacy PM.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D89051
There might be a better way to specify the pre-conditions,
but this is hopefully clearer than the way it was written:
https://rise4fun.com/Alive/Jhk3
Pre: C2 < 0 && isShiftedMask(C2) && (C1 == C1 & C2)
%a = and %x, C2
%r = add %a, C1
=>
%a2 = add %x, C1
%r = and %a2, C2
IV widening is sometimes a strictly harmful transform (some examples
of this are shown in tests 11, 12 in widen-loop-comp.ll). One of the
reasons of this is that sometimes SCEV fails to prove some facts after
part of guards has been widened.
Though each single such case looks like a bug that can be addressed,
it seems that disabling of IV widening may be profitable in some cases.
We want to have an option to do so. By default, existing behavior is
preserved and IV widening is on.
Annoyingly vectors aren't supported by shouldChangeType(), but we have precedents for always performing this on vector types (e.g. narrowBinOp).
Differential Revision: https://reviews.llvm.org/D89067
Use cast<> as we immediately dereference the pointer afterwards - cast<> will assert if we fail.
Prevents clang static analyzer warning that we could deference a null pointer.
Pre-conditions seem to be optimal, but we don't need a use check
because we are only replacing an add with a sub.
https://rise4fun.com/Alive/hzN
Pre: (~C1 | C2 == -1) && isPowerOf2(C2+1)
%m = and i8 %x, C1
%f = xor i8 %m, C2
%r = add i8 %f, C3
=>
%r = sub i8 C2 + C3, %m
Use cast<> as we immediately dereference the pointer afterwards - cast<> will assert if we fail.
Prevents clang static analyzer warning that we could deference a null pointer.
Complete basic PR46895 fixes by refactoring D87452/D88402 to allow us to match non-uniform constant values.
We still don't handle non-uniform vectors that contain undef elements, but that can wait until we have a decent generic mechanism for this.
Differential Revision: https://reviews.llvm.org/D88420
Use SCEV to salvage additional @llvm.dbg.value that have turned into
referencing undef after transformation (and traditional
salvageDebugInfo). Before transformation compute SCEV for each
@llvm.dbg.value in the loop body and store it (along side its current
DIExpression). After transformation update those @llvm.dbg.value now
referencing undef by comparing its stored SCEV to the SCEV of the
current loop-header PHI-nodes. Allow match with offset by inserting
compensation code in the DIExpression.
Includes fix for the nullptr deref that caused the original commit
to be reverted in 9d63029770.
Fixes : PR38815
Differential Revision: https://reviews.llvm.org/D87494
First step towards extending the existing rotation support to full funnel shift handling now that the backend legalization support has improved.
This enables us to match the shift by constant cases, which are pretty trivial to expand again if necessary.
D88420 will add non-uniform support for funnel shifts as well once its been finalized.
Differential Revision: https://reviews.llvm.org/D88834
The existing code ignores undef values which matches m_SpecificInt_ICMP, although m_SpecificInt_ICMP returns false for an all-undef constant, I've added test coverage at rGfe0197e194a64f9 to show that undef folding should already have dealt with that case.
We currently collect the ICmp and Add from an induction variable,
marking them as dead so that vplan values are not created for them. This
extends that to include any single use trunk from the ICmp, which allows
the Add to more readily be removed too.
This can help with costing vplan nodes, as the ICmp and Add are more
reliably removed and are not double-counted.
Differential Revision: https://reviews.llvm.org/D88873
In some cases, we can negate instruction if only one of it's operands
negates. Previously, we assumed that constants would have been
canonicalized to RHS already, but that isn't guaranteed to happen,
because of InstCombine worklist visitation order,
as the added test (previously-hanging) shows.
So if we only need to negate a single operand,
we should ensure ourselves that we try constant operand first.
Do that by re-doing the complexity sorting ourselves,
when we actually care about it.
Fixes https://bugs.llvm.org/show_bug.cgi?id=47752
And another step towards transformss not introducing inttoptr and/or
ptrtoint casts that weren't there already.
In this case, when load/store uses have conflicting types,
instead of falling back to the iN, we can try to use allocated sub-type.
As disscussed, this isn't the best idea overall (we shouldn't rely on
allocated type), but it works fine as a temporary measure.
I've measured, and @ `-O3` as of vanilla llvm test-suite + RawSpeed,
this results in +0.05% more bitcasts, -5.51% less inttoptr
and -1.05% less ptrtoint (at the end of middle-end opt pipeline)
See https://bugs.llvm.org/show_bug.cgi?id=47592
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D88788
The old function attribute deduction pass ignores reads of constant
memory and we need to copy this behavior to replace the pass completely.
First step are constant globals. TBAA can also describe constant
accesses and there are other possibilities. We might want to consider
asking the alias analyses that are available but for now this is simpler
and cheaper.
If the function is not assumed `noreturn` we should not wait for an
update to mark the call site as "may-return".
This has two kinds of consequences:
- We have less iterations in many tests.
- We have less deductions based on "known information" (since we ask
earlier, point 1, and therefore assumed information is not "known"
yet).
The latter is an artifact that we might want to tackle properly at some
point but which is not easily fixable right now.
The call slot optimization has some home-grown code for checking
whether the destination is dereferenceable. Replace this with the
generic isDereferenceableAndAlignedPointer() helper.
I'm not checking alignment here, because that is currently handled
separately and may be an enforced alignment for allocas. The clean
way of integrating that part would probably be to accept a callback
in isDereferenceableAndAlignedPointer() for the actual isAligned check,
which would then have a chance to use an enforced alignment instead.
This allows the destination to be a GEP (among other things), though
the two open TODOs may prevent it from working in practice.
Differential Revision: https://reviews.llvm.org/D88805
When performing call slot optimization for a non-local destination,
we need to check whether there may be throwing calls between the
call and the copy. Otherwise, the early write to the destination
may be observable by the caller.
This was already done for call slot optimization of load/store,
but not for memcpys. For the sake of clarity, I'm moving this check
into the common optimization function, even if that does need an
additional instruction scan for the load/store case.
As efriedma pointed out, this check is not sufficient due to
potential accesses from another thread. This case is left as a TODO.
Differential Revision: https://reviews.llvm.org/D88799
When we assume a return value is dead we might still visit return
instructions via `Attributor::checkForAllReturnedValuesAndReturnInsts(..)`.
When we do so the "returned value" is potentially simplified to `undef`
as it is the assumed "returned value". This is a problem if there was a
preexisting `noundef` attribute that will only be removed as we manifest
the `undef` return value. We should not use this combination to derive
`unreachable` though. Two test cases fixed.
In AAMemoryBehaviorFloating we used to track benign uses in a SetVector.
With this change we look through benign uses eagerly to reduce the
number of elements (=Uses) we look at during an update.
The test does actually not fail prior to this commit but I already wrote
it so I kept it.
We know V is a IntToPtrInst or PtrToIntInst type so we know its a CastInst - so use cast<> directly.
Prevents clang static analyzer warning that we could deference a null pointer.
This still only gets used for scalar types but now always uses ConstantExpr in preparation for vector support - it was using APInt methods in some places.
Use context to prove that load can be safely executed at a point where load is being hoisted.
Postpone the decision about safety of speculative load execution till the moment we know
where we hoist load and check safety at that context.
Reviewers: nikic, fhahn, mkazantsev, lebedev.ri, efriedma, reames
Reviewed By: reames, mkazantsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D88725
This reverts commit 20797989ea.
This patch (https://reviews.llvm.org/D69257) cannot complete a stage2
build due to the change:
```
CI->getCalledFunction()->getName().contains("longjmp")
```
There are several concrete issues here:
- The callee may not be a function, so `getCalledFunction` can assert.
- The called value may not have a name, so `getName` can assert.
- There's no distinction made between "my_longjmp_test_helper" and the
actual longjmp libcall.
At a higher level, there's a serious layering problem here. The
splitting pass makes policy decisions in a general way (e.g. based on
attributes or profile data). Special-casing certain names breaks the
layering. It subverts the work of library maintainers (who may now need
to opt-out of unexpected optimization behavior for any affected
functions) and can lead to inconsistent optimization behavior (as not
all llvm passes special-case ".*longjmp.*" in the same way).
The patch may need significant revision to address these issues.
But the immediate issue is that this crashes while compiling llvm's unit
tests in a stage2 build (due to the `getName` problem).
(it was introduced in https://lists.llvm.org/pipermail/llvm-dev/2015-January/080956.html)
This canonicalization seems dubious.
Most importantly, while it does not create `inttoptr` casts by itself,
it may cause them to appear later, see e.g. D88788.
I think it's pretty obvious that it is an undesirable outcome,
by now we've established that seemingly no-op `inttoptr`/`ptrtoint` casts
are not no-op, and are no longer eager to look past them.
Which e.g. means that given
```
%a = load i32
%b = inttoptr %a
%c = inttoptr %a
```
we likely won't be able to tell that `%b` and `%c` is the same thing.
As we can see in D88789 / D88788 / D88806 / D75505,
we can't really teach SCEV about this (not without the https://bugs.llvm.org/show_bug.cgi?id=47592 at least)
And we can't recover the situation post-inlining in instcombine.
So it really does look like this fold is actively breaking
otherwise-good IR, in a way that is not recoverable.
And that means, this fold isn't helpful in exposing the passes
that are otherwise unaware of these patterns it produces.
Thusly, i propose to simply not perform such a canonicalization.
The original motivational RFC does not state what larger problem
that canonicalization was trying to solve, so i'm not sure
how this plays out in the larger picture.
On vanilla llvm test-suite + RawSpeed, this results in
increase of asm instructions and final object size by ~+0.05%
decreases final count of bitcasts by -4.79% (-28990),
ptrtoint casts by -15.41% (-3423),
and of inttoptr casts by -25.59% (-6919, *sic*).
Overall, there's -0.04% less IR blocks, -0.39% instructions.
See https://bugs.llvm.org/show_bug.cgi?id=47592
Differential Revision: https://reviews.llvm.org/D88789
... by using MapVector. The issue was caused by 63182c2ac0.
Also use stable_partition instead of partition to get stable results
across different STL implementations.
When retrying the "simplify with operand replaced" select
optimization without poison flags, also handle inbounds on GEPs.
Of course, this particular example would also be safe to transform
while keeping inbounds, but the underlying machinery does not
know this (yet).
Use m_Specific instead of m_Value followed by an equality check - we already do this for the similar folds above, it looks like an oversight in rG2b459fe7e1e where the original pattern match code looked a little different.
This reverts commit a3caf7f610.
The ReleaseLTO-g test-suite configuration has been failing
to build since this commit, because clang segfaults while
building 7zip.
Update the code responsible for deleting VPBBs and recipes to properly
update users and release operands.
This is another preparation for D84680 & following patches towards
enabling modeling def-use chains in VPlan.
Use SCEV to salvage additional @llvm.dbg.value that have turned into
referencing undef after transformation (and traditional
salvageDebugInfo). Before transformation compute SCEV for each
@llvm.dbg.value in the loop body and store it (along side its current
DIExpression). After transformation update those @llvm.dbg.value now
referencing undef by comparing its stored SCEV to the SCEV of the
current loop-header PHI-nodes. Allow match with offset by inserting
compensation code in the DIExpression.
Fixes : PR38815
Differential Revision: https://reviews.llvm.org/D87494
This adds a helper to convert a VPRecipeBase pointer to a VPUser, for
recipes that inherit from VPUser. Once VPRecipeBase directly inherits
from VPUser this helper can be removed.
When updating operands of a VPUser, we also have to adjust the list of
users for the new and old VPValues. This is required once we start
transitioning recipes to become VPValues.
Do not instrument user-defined ELF sections (whose names resemble valid
C identifiers). They may have special use semantics and modifying them
may break programs. This is e.g. the case with NetBSD __link_set API
that expects these sections to store consecutive array elements.
Differential Revision: https://reviews.llvm.org/D76665
Add basic vector handling to recognizeBSwapOrBitReverseIdiom/collectBitParts - this works at the element level, all vector element operations must match (splat constants etc.) and there is no cross-element support (insert/extract/shuffle etc.).
If we're bswap'ing some bytes and zero'ing the remainder we can perform this as a bswap+mask which helps us match 'partial' bswaps as a first step towards folding into a more complex bswap pattern.
Reapplied with early-out if recognizeBSwapOrBitReverseIdiom collects a source wider than the result type.
Differential Revision: https://reviews.llvm.org/D88578
Next to erasing the instruction, we also always want to remove
it from MSSA and MD. Use a common function to do so.
This is a refactoring split out from D26739.
The removal of the cpy instruction is left to the caller of
performCallSlotOptzn(), including the invalidation of MD. Both
call-sites already do this.
Also handle incrementation of NumMemCpyInstr consistently at the
call-site. One of the call-site was already doing this, which
ended up incrementing the statistic twice.
This fix was part of D26739.
While looping through all args or all return values, we may mark a use
of a later iteration as live. Previously when we got to that later value
it would ignore that and continue adding to Uses instead of marking it
live. For example, when looping through arg#0 and arg#1,
MarkValue(arg#0, Live) may cause some use of arg#1 to be live, but
MarkValue(arg#1, MaybeLive) will not notice that and continue adding
into Uses.
Now MarkValue(RA, MaybeLive) will MarkLive(RA) if any use is live.
Fixes PR47444.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D88529