The device runtime contains several calls to __kmpc_get_hardware_num_threads_in_block
and __kmpc_get_hardware_num_blocks. If the thread_limit and the num_teams are constant,
these calls can be folded to the constant value.
In commit D106033 we have the optimization phase. This commit adds the attributes to
the outlined function for the grid size. the two attributes are `omp_target_num_teams` and
`omp_target_thread_limit`. These values are added as long as they are constant.
Two functions are created `getNumThreadsExprForTargetDirective` and
`getNumTeamsExprForTargetDirective`. The original functions `emitNumTeamsForTargetDirective`
and `emitNumThreadsForTargetDirective` identify the expresion and emit the code.
However, for the Device version of the outlined function, we cannot emit anything.
Therefore, this is a first attempt to separate emision of code from deduction of the
values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106298
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102107
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
Summary:
Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
`__kmpc_data_sharing_push_stack` and __kmpc_data_sharing_pop_stack` functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.
This patch replaces this functinality with a single allocation command, effectively mimicing an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an `alloca` and small allocations can be pushed to shared memory.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97680
This patch refactors a subset of Clang OpenMP tests, generating checklines using the update_cc_test_checks script. This refactoring facilitates updating the Clang OpenMP code generation codebase by automating test generation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101849
This patch renames the replace-function-regex to replace-value-regex to indicate that the existing regex replacement functionality can replace any IR value besides functions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101934
This patch refactors a subset of Clang OpenMP tests, generating checklines using the update_cc_test_checks script. This refactoring facilitates updating the Clang OpenMP code generation codebase by automating test generation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101849
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
Many OpenMP Clang tests do not RUN for version 4.5 and the default
version. This second patch in the series handles test cases which
require updation in CHECK lines along with adding RUN lines for
the default version. It involves updating line number of pragmas.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D85150
performance.
Internally generated functions must be marked as always_inlines in most
cases. Patch marks some extra reduction function + outlined parallel
functions as always_inline for better performance, but only if the
optimization is requested.
llvm-svn: 361269
mode.
After the previous patch with the more correct handling of the number of
threads in parallel regions, the parallel regions with num_threads
clauses can be executed in SPMD mode.
llvm-svn: 358445
In generic data-sharing mode we do not need to globalize
variables/parameters of reference/pointer types. They already are placed
in the global memory.
llvm-svn: 332380
Differential Revision: https://reviews.llvm.org/D43852
This patch extends the SPMD implementation to all target constructs and guards this implementation under a new flag.
llvm-svn: 326368
This patch adds support for the Spmd construct 'target parallel' on the
NVPTX device. This involves ignoring the num_threads clause on the device
since the number of threads in this combined construct is already set on
the host through the call to __tgt_target_teams().
Reviewers: ABataev
Differential Revision: https://reviews.llvm.org/D29083
llvm-svn: 292999