With debug information enabled (-g) Clang will wrap the actual target
region into a new function which is called from the "kernel". The problem
is that the "kernel" is now basically a wrapper without all the things
we expect. More importantly, if we end up asking for an AAKernelInfo
for the "target region function" we might try to turn it into SPMD mode.
That used to cause an assertion as that function doesn't have an
appropriately named `_exec_mode` global. While the global is going away
soon we still need to make sure to properly handle this case, e.g.,
perform optimizations reliably.
Differential Revision: https://reviews.llvm.org/D122043
Hardcode the function type as ParallelTask, which is the guaranteed
pointee type of this runtime function argument (if pointee types
exist). The elimination of the callee bitcast is left for InstCombine.
Differential Revision: https://reviews.llvm.org/D120885
If the user disables de-globalization we did not seed the AAHeapToShared
and AAHeapToStack but we still could end up with them through in-flight
lookups. With this patch we disable AAHeapToShared completely if the
user disabled de-globalization. Heap-2-stack is still run though.
Differential Revision: https://reviews.llvm.org/D121059
Heap-2-stack and heap-2-shared can replace an allocation call with
something else. To avoid us deriving information from the allocator
implementation we register a simplification callback now that will
force us to stop at the call site. We probably should create the
replacement memory eagerly and return that instead though.
The custom state machine had a check for surplus threads that filtered
the main thread if the kernel was executed by a single warp only. We
now first check for the main thread, then for surplus threads, avoiding
to filter the former out.
Fixes#54214.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121011
Previously there was a debug flag to print the module after
optimizations. Sometimes we wanted to print the module before
optimizations so this is being split into two flags.
`-openmp-opt-print-module` is now `-openmp-opt-print-module-after`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120768
One of the optimizations performed in OpenMPOpt pushes globalized
variables to static shared memory. This is preferable to keeping the
runtime call in all cases, however if too many variables are pushed to
hared memory the kernel will crash. Since this is an optimization and
not something the user specified explicitly, there should be an option
to limit this optimization in those cases. This path introduces the
`-openmp-opt-shared-limit=` option to limit the amount of bytes that
will be placed in shared memory from HeapToShared.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120079
This patch adds the '_kmpc_get_hardware_num_threads_in_block'
OpenMP RTL function to the externalization RAII struct. This was getting
optimized out and then being replaced with an undefined value once added
back in, causing bugs for complex reductions.
Fixes#53909.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120076
If we assume `llvm.amdgcn.s.barrier` is aligned we may remove it and
cause OpenMP GPU applications on the AMD GPU to be stuck or wrongly
synchronized.
Reported by Carlo Bertolli.
Patch originally by Giorgis Georgakoudis (@ggeorgakoudis), typos and
bugs introduced later by me.
This patch allows us to remove redundant barriers if they are part
of a "consecutive" pair of barriers in a basic block with no impacted
memory effect (read or write) in-between them. Memory accesses to
local (=thread private) or constant memory are allowed to appear.
Technically we could also allow any other memory that is not used to
share information between threads, e.g., the result of a malloc that
is also not captured. However, it will be easier to do more reasoning
once the code is put into an AA. That will also allow us to look through
phis/selects reasonably. At that point we should also deal with calls,
barriers in different blocks, and other complexities.
Differential Revision: https://reviews.llvm.org/D118002
We used to remove noinline from known OpenMP runtime functions (which
are declared in OMPKinds.td). Now we remove noinline from all functions
with the proper prefixes: __kmpc, _ZN4_OMP (= namespace omp), omp_
Summary:
This patch modifies code generation in OpenMPIRBuilder to pass arguments
to the parallel region outlined function in an aggregate (struct),
besides the global_tid and bound_tid arguments. It depends on the
updated CodeExtractor (see D96854) for support. It mirrors functionality
of Clang codegen (see D102107).
Differential Revision: https://reviews.llvm.org/D110114
Currenly we push some variables to a global constant containing shared
memory as an optimization. This generated constant had internal linkage
and should not have collided with any known identifiers in the
translation unit. However, there have been observed cases of this
optimiztaion unintentionally colliding with undocumented PTX
identifiers. This patch adds a suffix to the created globals to
hopefully bypass this.
Depends on D118059
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D118068
Previously in OpenMPOpt we did not correctly inherit the calling
convention of the callee when creating new OpenMP runtime calls. This
created issues when the calling convention was changed during
`GlobalOpt` but a new call was creating without the correct calling
convention. This lead to the call being replaced with a poison value in
`InstCombine` due to undefined behaviour and causing large portions of
the program to be incorrectly eliminated. This patch correctly inherits
the existing calling convention from the callee.
Reviewed By: tianshilei1992, jdoerfert
Differential Revision: https://reviews.llvm.org/D118059
While we might know the value if an ICV at a getter position it is not
always clear that we can simply use it. Verify the value is valid first
to avoid invalid IR.
Fixes#53300.
Similar to loads, we want to be aggressive when it comes to store
simplification. Not everything in LLVM handles dead stores well when
address space casts are involved, we can simply ask the Attributor to do
it for us though.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D109998
One of the unused ident_t fields now holds the size of the string
(=const char *) field so we have an easier time dealing with those
in the future.
Differential Revision: https://reviews.llvm.org/D113126
This patch uses the return alignment attribute now present in the
`__kmpc_alloc_shared` runtime call to set the alignment of the shared
memory global created to replace it.
Depends on D115971
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D116319
Reduction functions were guarded before which was wrong, these are SPMD
compatible.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115159
Before SPMDzation it was sufficient to add an incompatible instruction
to the SPMDCompatibilityTracker. However, now adding instructions means
they need guarding. As calls cannot be guarded in general we need to
explicitly prevent SPMD mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115158
If two reaching kernels disagree on the execution mode we cannot guard a
function right now. Ensure we do not as we otherwise will cause a
deadlock.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D114866
This patch adds the `__kmpc_get_warp_size` OpenMP RTL function to the
externalization RAII struct. This was getting optimized out and then
being replaced with an undefined value once added back in, causing bugs
for complex reductions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D114802
This patch uses the abstract attributor introduced in D111054 to get the
assumption values instead of the `hasAssumption` function. This also
calls it so assumption information should propagate throug the device
where applicabile.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111445
If we assume SPMD-mode during the fixpoint iteration we have to execute
the kernel in SPMD-mode. If we change our mind during manifest there is
the chance of a mismatch between the simplification, e.g., of
`__kmpc_is_spmd_exec_mode` calls, and the execution mode. This problem
was introduced in D109438.
This patch is compromise to resolve the problem purely in OpenMP-opt
while trying to keep the benefits of D109438 around. This might not
always work, see `get_hardware_num_threads_in_block_fold` but it often
does. At the same time we do keep value specialization and execution
mode in sync.
Proper solutions to this problem should be considered. I believe a new
execution mode is the easiest way forward (Singleton-SPMD).
Alternatively, SPMD-mode execution can be used with a way to provide a
new thread_limit (here 1) to the runtime. This is more general and could
be useful if we see `num_threads` clauses or workshared loops with small
trip counts in the kernel. In either proposal we need to disable the
guarding for the kernel (which was the motivation for D109438).
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D112894
Before we had aligned barriers the `__kmpc_barrier_simple_spmd` was
OK to be used in the custom state machine. Now that SPMD barriers are
assumed to be aligned we need to use a "generic" barrier in places
that are not aligned.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112893
Global symbols cannot have any name so we need to sanitize the string
first. Also remove an assertion that is not actually necessary nor
true in general.
Reviewed By: ggeorgakoudis
Differential Revision: https://reviews.llvm.org/D112892
This patch adds a new command line option `openmp-opt-max-iterations`
that controls the maximum number of iterations the attributor will run
for when compiling OpenMP target device code. This patch also adds a
remark to indicate when the attributor failed because it did not run
for enough iterations.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110749
Summary:
The RTL functions added in https://reviews.llvm.org/D110429 were
mistakenly left out from the list of safe runtime calls in AAKernelInfo.
This patch adds them in.
This patch fixes a problem when the AAKernelInfo state was invalidated,
e.g., due to `optnone` for a kernel, but not all parts indicated the
invalidation properly. We further eliminate most full state
invalidations as they should never be necessary.
Differential Revision: https://reviews.llvm.org/D109468
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110279
The execution mode of a kernel is stored in a global variable, whose value means:
- 0 - SPMD mode
- 1 - indicates generic mode
- 2 - SPMD mode execution with generic mode semantics
We are going to add support for SIMD execution mode. It will be come with another
execution mode, such as SIMD-generic mode. As a result, this value-based indicator
is not flexible.
This patch changes to bitset based solution to encode execution mode. Each
position is:
[0] - generic mode
[1] - SPMD mode
[2] - SIMD mode (will be added later)
In this way, `0x1` is generic mode, `0x2` is SPMD mode, and `0x3` is SPMD mode
execution with generic mode semantics. In the future after we add the support for
SIMD mode, `0b1xx` will be in SIMD mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110029
Summary:
The thread ID function was reintroduced in D110195, but could
potentially be removed by the optimizer. Make the function noinline to
preserve the call sites and add it to the externalization RAII so its
definition is not removed by the attributor.