Previous support for device memory allocators used a single free
routine and did not provide the original kind of the allocation. This is
problematic as some of these memory types required different handling.
Previously this was worked around using a map in runtime to record the
original kind of each pointer. Instead, this patch introduces new free
routines similar to the existing allocation routines. This allows us to
avoid a map traversal every time we free a device pointer.
The only interfaces defined by the standard are `omp_target_alloc` and
`omp_target_free`, these do not take a kind as `omp_alloc` does. The
standard dictates the following:
"The omp_target_alloc routine returns a device pointer that references
the device address of a storage location of size bytes. The storage
location is dynamically allocated in the device data environment of the
device specified by device_num."
Which suggests that these routines only allocate the default device
memory for the kind. So this has been changed to reflect this. This
change is somewhat breaking if users were using `omp_target_free` as
previously shown in the tests.
Reviewed By: JonChesterfield, tianshilei1992
Differential Revision: https://reviews.llvm.org/D133053
In OpenMP 5.2, §5.8.6, page 160 line 32-33, when a device pointer
allocated by omp_target_alloc has implicitly been included on a target
construct as a zero-length array, the pointer initialisation should not
find a matching mapped list item, and so should retain its value as a
firstprivate variable. Previously, we would return a null pointer if the
list item was not found. This patch updates the map handling to the
OpenMP 5.2 semantics.
Reviewed By: jdoerfert, ye-luo
Differential Revision: https://reviews.llvm.org/D133447
Some code previous needed the `used` attribute to prevent the GCC
compiler versions 5 and 6 from removing it. This is no longer required
as the minimum supported GCC version for LLVM 16 is >=7.1.0.
Reviewed By: JonChesterfield, vzakhari
Differential Revision: https://reviews.llvm.org/D132976
Previously, the tripcount was set by a push call. We moved away from
this with the new interface that added the tripcount to the kernel
arguments struct, but kept around the old interface for legacy purposes
for the LLVM 15 release. This patch removes the support for the legacy
method.
This removes the support for the old method, but does not break
backwards compatibility. This will result in applications using the old
interface being slower when run on the device.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D132885
The only RTLs that get added to the `UsedRTLs` list have already been
checked is they were valid binaries. We shouldn't need to do this again
when we unregister all the used binaries as they wouldn't have been used
if they were invalid anyway. Let me know if I'm incorrect in this
assumption.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D131443
This patch replaces uses of `dlopen` and `dlsym` with LLVM's support
with `loadPermanentLibrary` and `getSymbolAddress`. This allows us to
remove the explicit dependency on the `dl` libraries in the CMake. This
removes another explicit dependency and solves an issue encountered
while building on Windows platforms. The one downside to this is that
the LLVM library does not currently support `dlclose` functionality, but
this could be added in the future.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D131507
The runtime makes some use of `std::vector` data structures. We should
be able to replace these trivially with `llvm::SmallVector` instead.
This should allow us to avoid heap allocations in the majority of cases
now.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D130927
Will allow plugins to migrate away from using global variables to
manage lifetime, which will fix a segfault discovered in relation to D127432
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D130712
The previous path changed the linker wrapper to embed the offloading
binary format inside the target image instead. This will allow us to
more generically bundle metadata with these images, such as requires
clauses or the target architecture it was compiled for.
I wasn't sure how to handle this best, so I introduced a new type that
replaces the old `__tgt_device_image` struct that we can expand inside
the runtime library. I made the new `__tgt_device_binary` struct pretty
much the same for now. In the future we could change this struct to
pretty much be the `OffloadBinary` class in the future.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D127432
Summary:
We use a static assert to make sure that someone doesn't change the size
of an argument struct without properly updating all the other logic.
This originally only checked the size on a 64-bit system with 8-byte
pointers, causing builds on 32-bit systems to fail. This patch allows
either pointer size to work.
Fixes#56486
This patch moves the old legacy interfaces into `libomptarget` to a
separate file. These do not need to be included anywhere and are simply
provided for backwards compatibility with the ABI. This cleans up the
interface greatly.
Depends on D128817
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D128818
The previous patch added an argument to the `__tgt_target_kernel`
runtime function which includes the tripcount used for the loop clause.
This was originally passed in via the `__kmpc_push_target_tripcount`
function. Now we move this logic to the kernel launch itself and remove
the need for the push function.
Depends on D128816
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D128817
This patch implements a unified kernel entry function that will be
targeted from both teams and non-teams clauses. We introduce a new
interface and make the old functions call in using the new one. A
following patch will include the necessary changes to Clang to call
these new functions instead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D128549
Libomptarget grew out of a project that was originally not in LLVM. As
we develop libomptarget this has led to an increasingly large clash
between the naming conventions used. This patch fixes most of the
variable names that did not confrom to the LLVM standard, that is
`VariableName` for variables and `functionName` for functions.
This patch was primarily done using my editor's linting messages, if
there are any issues I missed arising from the automation let me know.
Reviewed By: saiislam
Differential Revision: https://reviews.llvm.org/D128997
This patch implements omp_get_device_num() in the host and the device.
It uses the already existing getDeviceNum in the device config for the device.
And in the host it uses the omp_get_num_devices().
Two simple tests added
Differential Revision: https://reviews.llvm.org/D128347
Currently, globals on the device will have an infinite reference count
and an unknown name when using `LIBOMPTARGET_INFO` to print the mapping
table. We already store the name of the global in the offloading entry
so we should be able to use it, although there will be no source
location. To do this we need to create a valid `ident_t` string from a
name only.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D124381
This patch adds the `llvm_omp_target_dynamic_shared_alloc` function to
the `omp.h` header file so users can access it by default. Also changed
the name to keep it consistent with the other target allocators. Added
some documentation so users know how to use it. Didn't add the interface
for Fortran since there's no way to test it right now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123246
If we decided to delete a mapping entry we did not act on it right away
but first issued and waited for memory copies. In the meantime some
other thread might reuse the entry. While there was some logic to avoid
colliding on the actual "deletion" part, there were two races happening:
1) The data transfer back of the thread deleting the entry and
the data transfer back of the thread taking over the entry raced.
2) The update to the shadow map happened regardless if the entry was
actually reused by another thread which left the shadow map in a
inconsistent state.
To fix both issues we will now update the shadow map and delete the
entry only if we are sure the thread is responsible for deletion, hence
no other thread took over the entry and reused it. We also wait for a
potential former data transfer from the device to finish before we issue
another one that would race with it.
Fixes https://github.com/llvm/llvm-project/issues/54216
Differential Revision: https://reviews.llvm.org/D121058
This patch solves two problems with the `HostDataToTargetMap` (HDTT
map) which caused races and crashes before:
1) Any access to the HDTT map needs to be exclusive access. This was not
the case for the "dump table" traversals that could collide with
updates by other threads. The new `Accessor` and `ProtectedObject`
wrappers will ensure we have a hard time introducing similar races in
the future. Note that we could allow multiple concurrent
read-accesses but that feature can be added to the `Accessor` API
later.
2) The elements of the HDTT map were `HostDataToTargetTy` objects which
meant that they could be copied/moved/deleted as the map was changed.
However, we sometimes kept pointers to these elements around after we
gave up the map lock which caused potential races again. The new
indirection through `HostDataToTargetMapKeyTy` will allows us to
modify the map while keeping the (interesting part of the) entries
valid. To offset potential cost we duplicate the ordering key of the
entry which avoids an additional indirect lookup.
We should replace more objects with "protected objects" as we go.
Differential Revision: https://reviews.llvm.org/D121057
There are two problems this patch tries to address:
1) We currently free resources in a random order wrt. plugin and
libomptarget destruction. This patch should ensure the CUDA plugin
is less fragile if something during the deinitialization goes wrong.
2) We need to support (hard) pause runtime calls eventually. This patch
allows us to free all associated resources, though we cannot
reinitialize the device yet.
Follow up patch will associate one event pool per device/context.
Differential Revision: https://reviews.llvm.org/D120089
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Differential Revision: https://reviews.llvm.org/D113124
Atomic handling of map clauses was introduced to comply with the OpenMP
standard (see D104418). However, many apps won't need this feature which
can be costly in certain situations. To allow for applications to
opt-out we now introduce the `LIBOMPTARGET_MAP_FORCE_ATOMIC` environment
flag that voids the atomicity guarantee of the standard for map clauses
again, shifting the burden to the user.
This patch also de-duplicates the code that introduces the events used
to enforce atomicity as a cleanup.
Differential Revision: https://reviews.llvm.org/D117627
The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.
For each issued data movement, a new event is generated and returned to `libomptarget`
by calling `createEvent`. The event will be attached to the corresponding mapping table
entry.
For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.
This design is to avoid synchronization on host side.
Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:
- `Event` can be any kind of flag that has at least two status, 0 and 1.
- `waitEvent` can directly busy loop if `Event` is still 0.
My local test shows that `bug49334.cpp` can pass.
Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940
Reviewed By: grokos, JonChesterfield, ye-luo
Differential Revision: https://reviews.llvm.org/D104418
This patch adds an external interface to access the dynamic shared
memory buffer in the device runtime. The function introduced is
``llvm_omp_get_dynamic_shared``. This includes a host-side
definition that only returns a null pointer so that it can be used when
host-fallback is enabled without crashing. Support for dynamic shared
memory was also ported to the old device runtime.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110957
Follow on to D110006, related to D110957
Where implementations have diverged this resolves to match the new DeviceRTL
- replaces definitions of this struct in deviceRTL and plugins with include
- changes the dynamic_shared_size field from D110006 to 32 bits
- handles stdint being unavailable in DeviceRTL
- adds a zero initializer for the field to amdgpu
- moves the extern declaration for deviceRTL to target_interface
(omptarget.h is more natural, but doesn't work due to include order
with debug.h)
- Renames the fields everywhere to match the LLVM format used in DeviceRTL
- Makes debug_level uint32_t everywhere (previously sometimes int32_t)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111069
The defintion of OFFLOAD_SUCCESS and OFFLOAD_FAIL used in plugin APIs and libomptarget public APIs are not consistent.
Create __tgt_target_return_t for libomptarget public APIs.
Differential Revision: https://reviews.llvm.org/D109304
This patch implements OpenMP runtime support for an original OpenMP
extension we have developed to support OpenACC: the `ompx_hold` map
type modifier. The previous patch in this series, D106509, implements
Clang support and documents the new functionality in detail.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D106510
This patch adds the support form event related interfaces, which will be used
later to fix data race. See D104418 for more details.
Reviewed By: jdoerfert, ye-luo
Differential Revision: https://reviews.llvm.org/D108528
This patch introduces the `llvm-omp-device-info` tool, which uses the
omptarget library and interface to query the device info from all the
available devices as seen by OpenMP. This is inspired by PGI's `pgaccelinfo`
Since omptarget usually requires a description structure with executable
kernels, I split the initialization of the RTLs and Devices to be able to
initialize all possible devices and query each of them.
This revision relies on the patch that introduces the print device info.
A limitation is that the order in which the devices are initialized, and the
corresponding device ID is not necesarily the one seen by OpenMP.
The changes are as follows:
1. Separate the RTL initialization that was performed in `RegisterLib` to its own `initRTLonce` function
2. Create an `initAllRTLs` method that initializes all available RTLs at runtime
3. Created the `llvm-deviceinfo.cpp` tool that uses `omptarget` to query each device and prints its information.
Example Output:
```
Device (0):
print_device_info not implemented
Device (1):
print_device_info not implemented
Device (2):
print_device_info not implemented
Device (3):
print_device_info not implemented
Device (4):
CUDA Driver Version: 11000
CUDA Device Number: 0
Device Name: Quadro P1000
Global Memory Size: 4236312576 bytes
Number of Multiprocessors: 5
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536 bytes
Max Shared Memory per Block: 49152 bytes
Registers per Block: 65536
Warp Size: 32 Threads
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647 bytes
Texture Alignment: 512 bytes
Clock Rate: 1480500 kHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: DEFAULT
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2505000 kHz
Memory Bus Width: 128 bits
L2 Cache Size: 1048576 bytes
Max Threads Per SMP: 2048
Async Engines: Yes (2)
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device Boars: No
Compute Capabilities: 61
```
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D106752
This patch introduces a function in the device's plugin to print the
device information. This patch relates to another patch that introduces
a CLI tool to obtain the device information from the omplibrary directly.
It is inspired by PGI's pgaccelinfo.
The modifications are as follows:
1. Introduce the optional `void __tgt_rtl_print_device_info(RTLdevID)` function into the RTL.
2. Introduce the `bool __tgt_print_device_info(devID)` function into `omptarget` interface. Returns false if the RTL is not implemented
3. Added `bool printDeviceInfo(RTLDevID)` to the `DeviceTy`
4. Implement the `__tgt_rtl_print_device_info` for CUDA. Added additional CUDA Runtime calls.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106751
Revision of D102858. Raise dlwrap arity argument to template argument
so the correct value is given in the error message. E.g. '2 == 1' instead of
'2 == trait<>::nargs'.
Arity higher than it should be:
Before diff
```
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: error:
static_assert failed due to requirement '2 == trait<cudaError_enum (*)(unsigned int)>::nargs'
"Arity Error"
DLWRAP_INTERNAL(cuInit, 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~
...
$/include/dlwrap.h:166:3: note: expanded from macro
'DLWRAP_COMMON'
static_assert(ARITY == trait<decltype(&SYMBOL)>::nargs, "Arity Error"); \
```
After diff
In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
```
$/include/dlwrap.h:131:3: error: static_assert failed due to
requirement '2UL == 1UL' "Arity Error"
static_assert(Requested == Required, "Arity Error");
^ ~~~~~~~~~~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
instantiation of function template specialization 'dlwrap::verboseAssert<2UL, 1UL>' requested
here
DLWRAP_INTERNAL(cuInit, 2);
```
Arity lower than it should be:
Before diff
```
$/plugins/cuda/dynamic_cuda/cuda.cpp:131:10: error: no
matching function for call to 'dlwrap_cuInit'
return dlwrap_cuInit(X);
^~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: candidate
function not viable: requires 0 arguments, but 1 was provided
DLWRAP_INTERNAL(cuInit, 0);
```
After diff
In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
```
$/include/dlwrap.h:131:3: error: static_assert failed due to
requirement '0UL == 1UL' "Arity Error"
static_assert(Requested == Required, "Arity Error");
^ ~~~~~~~~~~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
instantiation of function template specialization 'dlwrap::verboseAssert<0UL, 1UL>' requested
here
DLWRAP_INTERNAL(cuInit, 0);
```
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106543
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
[libomptarget] Improve dlwrap compile time error diagnostic
The dlwrap interface takes an explict arity, e.g. DLWRAP(cuAlloc, 2);
This probably can't be eliminated as it controls the argument list of an
external symbol, not an inline header function. If the arity given is too
big, the error from clang referring to the line is in the middle of
implementation details.
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1277:7: error: static_assert failed
due to requirement '0UL < tuple_size<std::tuple<>>::value' "tuple index is in range"
static_assert(__i < tuple_size<tuple<>>::value,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:93:27 ...
/home/amd/llvm-project/openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp:34:1: note: in
instantiation of template class 'dlwrap::trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::arg<2>' requested here
DLWRAP(cuMemAlloc, 3);
^
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:51:31: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:166:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:133:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:186:37: ...
If the arity is too small, the diagnostic is better:
cuda/dynamic_cuda/cuda.cpp:34:1: error: too few
arguments to function call, expected 2, have 1
DLWRAP(cuMemAlloc, 1);
This patch changes the diagnostic to:
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '1 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 1);
or
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '3 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 3);
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102858
Summary:
This patch improves the implementation of D100774 by replacing the global
variable introduced with a function that returns a reference to an internal
one. This removes the need to define the variable in every plugin that uses it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D101102
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
This patch adds the infrastructure for allocator support for target memory.
Three allocators are introduced for device, host and shared memory.
The corresponding API functions have the llvm_ prefix temporarily, until they become part of the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D97883
Summary:
The changes introduced in D87946 changed the API for libomptarget
functions. `__kmpc_push_target_tripcount` was a function in Clang 11.x
but was not given a backward-compatible interface. This change will
require people using Clang 13.x or 12.x to recompile their offloading
programs.
Reviewed By: jdoerfert cchen
Differential Revision: https://reviews.llvm.org/D98358