This patch enables running the new driver tests for AMDGPU. Previously
this was disabled because some tests failed. This was only because the
new driver tests hadn't been listed as unsupported or expected to fail.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D119240
This patch completely removes the old OpenMP device runtime. Previously,
the old runtime had the prefix `libomptarget-new-` and the old runtime
was simply called `libomptarget-`. This patch makes the formerly new
runtime the only runtime available. The entire project has been deleted,
and all references to the `libomptarget-new` runtime has been replaced
with `libomptarget-`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D118934
This patch adds a new target to the tests to run using the new driver as
the method for generating offloading code.
Depends on D116541
Differential Revision: https://reviews.llvm.org/D118637
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674
The old runtime is not tested by CI. Disable the build prior to the llvm-14 branch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118268
This patch makes some minor adjustments to `ResourcePool`:
- Don't initialize the resources if `Size` is 0 which can avoid assertion.
- Add a new interface function `clear` to release all hold resources.
- If initial size is 0, resize to 1 when the first request is encountered.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D116340
Currently CUDA streams are managed by `StreamManagerTy`. It works very well. Now
we have the need that some resources, such as CUDA stream and event, will be
hold by `libomptarget`. It is always good to buffer those resources. What's more
important, given the way that `libomptarget` and plugins are connected, we cannot
make sure whether plugins are still alive when `libomptarget` is destroyed. That
leads to an issue that those resouces hold by `libomptarget` might not be
released correctly. As a result, we need an unified management of all the resources
that can be shared between `libomptarget` and plugins.
`ResourcePoolTy` is designed to manage the type of resource for one device.
It has to work with an allocator which is supposed to provide `create` and
`destroy`. In this way, when the plugin is destroyed, we can make sure that
all resources allocated from native runtime library will be released correctly,
no matter whether `libomptarget` starts its destroy.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D111954
I missed the async info parameter in the first version of this API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115887
This patch extends the AMDGPU plugin for OpenMP target offloading from using a single HSA queue to multiple queues (four in this patch) per device. This enables concurrent threads to concurrently submit kernel launches to the same GPU.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115771
and synchronous kernel launch implementations into a single
synchronous version. This patch prepares the plugin for asynchronous
implementation by:
Privatizing actual kernel launch code (valid in both cases) into
an anonymous namespace base function (submitted at D115267)
- Separating the control flow path of asynchronous and synchronous
kernel launch functions** (this diff)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115273
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
At present, amdgpu plugin merges both asynchronous and synchronous kernel launch implementations into a single synchronous version.
This patch prepares the plugin for asynchronous implementation by:
- Privatizing actual kernel launch code (valid in both cases) into an anonymous namespace base function
Actual separation of kernel launch code (async vs sync) is a following patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115267
amdgpu plugin depends on libhsa-runtime64 library. Add runpath in case it is not on the LD_LIBRARY_PATH.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115198
This was trying to figure out the build path for amdgpu-arch, and
making assumptions about where it is which were not working on my
system. Whether a standalone build or not, we should have a proper
imported target to get the location from.
OpenMP (compiler) does not currently request any implicit kernel
arguments. OpenMP (runtime) allocates and initialises a reasonable guess at
the implicit kernel arguments anyway.
This change makes the plugin check the number of explicit arguments, instead
of all arguments, and puts the pointer to hostcall buffer in both the current
location and at the offset expected when implicit arguments are added to the
metadata by D113538.
This is intended to keep things running while fixing the oversight in the
compiler (in D113538). Once that patch lands, and a following one marks
openmp kernels that use printf such that the backend emits an args element
with the right type (instead of hidden_node), the over-allocation can be
removed and the hardcoded 8*e+3 offset replaced with one read from the
.offset of the corresponding metadata element.
Reviewed By: estewart08
Differential Revision: https://reviews.llvm.org/D114274
Removes a +x/-x pair on the only store/load of a variable
and deletes some nearby dead code. Also reduces the size of the implicit
struct to reflect the code currently emitted by clang.
Differential Revision: https://reviews.llvm.org/D114270
Passes same tests as the current deviceRTL. Includes cmake change from D111987.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112227
Implemented by patching python config instead of modifying all
the tests so that -generic and XFAIL work as usual. Expectation is for
this to be reverted once the old runtime is deleted.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D112225
The plugin currently uses a macro to check if this is a debug built
before assigning the debug kind variable to the device environment
struct. This is being deprecated because the new device runtime does not
maintain separate debug builds and should always be availible.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112083
Follow on to D110006, related to D110957
Where implementations have diverged this resolves to match the new DeviceRTL
- replaces definitions of this struct in deviceRTL and plugins with include
- changes the dynamic_shared_size field from D110006 to 32 bits
- handles stdint being unavailable in DeviceRTL
- adds a zero initializer for the field to amdgpu
- moves the extern declaration for deviceRTL to target_interface
(omptarget.h is more natural, but doesn't work due to include order
with debug.h)
- Renames the fields everywhere to match the LLVM format used in DeviceRTL
- Makes debug_level uint32_t everywhere (previously sometimes int32_t)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111069
The hand-rolled linking logic in elf_common does not account for
the possibility of using LLVM dylib rather than a dozen static
libraries. Since it does not seem to be easily convertible
to add_llvm_library, just hand-roll support for LLVM_LINK_LLVM_DYLIB.
This is necessary to support stand-alone builds against installed LLVM.
Differential Revision: https://reviews.llvm.org/D111038
Use enum for execution mode.
This is partly a port from ROCm and partly a port from D110029. Attempted to
make the same choices as ROCm as far as comments etc go to reduce the merge
conflicts.
There is some cleanup warranted here - in particular I like the cuda patch
factoring out the comparisons into named variables - but I'd like to leave
that for a follow up patch, keeping this one minimal.
Reviewed By: carlo.bertolli
Differential Revision: https://reviews.llvm.org/D110845
Fixes: SWDEV-275232 (With contributions from Ammar Elwazir, Laurent Morichetti, and Tony Tye)
The current code is racy. After the packet is submitted, the GPU will increment the read index. If this wraps around before the memory is read from it'll refer to a signal from an unrelated packet. Change avoids reading from the packet post-submission.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110679
Store queues in unique_ptr so they are destroyed when the global DeviceInfo is. Currently they leak which raises an assert in debug builds of hsa.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109511
The execution mode of a kernel is stored in a global variable, whose value means:
- 0 - SPMD mode
- 1 - indicates generic mode
- 2 - SPMD mode execution with generic mode semantics
We are going to add support for SIMD execution mode. It will be come with another
execution mode, such as SIMD-generic mode. As a result, this value-based indicator
is not flexible.
This patch changes to bitset based solution to encode execution mode. Each
position is:
[0] - generic mode
[1] - SPMD mode
[2] - SIMD mode (will be added later)
In this way, `0x1` is generic mode, `0x2` is SPMD mode, and `0x3` is SPMD mode
execution with generic mode semantics. In the future after we add the support for
SIMD mode, `0b1xx` will be in SIMD mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110029
This patch adds support for using dynamic shared memory in the new
device runtime. The new function `__kmpc_get_dynamic_shared` will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.
In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110006
This patch adds fields for the device number and number of devices into
the device environment struct and debugging values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110004
The hsa library must be initialized before any calls into it and
destructed after the last call into it. There have been a number of bugs in
this area related to member variables which would like to use raii to manage
resources acquired from hsa.
This patch moves the init/shutdown of hsa into a class, such that when used as
the first member variable (could be a base), the lifetime of other member
variables are reliably scoped within it. This will allow other classes to use
raii reliably when used as member variables within the global.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109512
Use the same debug print as the rest of libomptarget plugins with
the same environment control. Also drop the max queue size debugging hook as
I don't believe it is still in use, can bring it back near the rest of the env
handling in rtl.cpp if someone objects.
That makes most of rt.h and all of utils.cpp unused. Clean that up and simplify
control flow in a couple of places.
Behaviour change is that debug prints that used to use the old environment
variable now use the new one and print in slightly different format, and the
removal of the max queue size variable.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D108784
`CU_EVENT_DEFAULT` is defined in CUDA header. It should be added to
`openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h` for CUDA free build.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108878
This patch adds the support form event related interfaces, which will be used
later to fix data race. See D104418 for more details.
Reviewed By: jdoerfert, ye-luo
Differential Revision: https://reviews.llvm.org/D108528
Lets wavefront size be 32 for amdgpu openmp, as well as 64.
Fixes up as little as possible to pass that through the libraries. This change
is end to end, as opposed to updating clang/devicertl/plugin separately. It can
be broken up for review/commit if preferred. Posting as-is so that others with
a gfx10 can try it out. It works roughly as well as gfx9 for me, but there are
probably bugs remaining as well as the todo: for letting grid values vary more.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108708
Move most debug printing in rtl.cpp behind DP() macro
Adjust the print output for gpu arch mismatch when the architectures match
Convert an assert into graceful failure
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108562
With uses of g_atl_machine gone, a significant portion of dead
code has been removed.
This patch depends on D104691 and D104695.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104696
[nfc] Replaces enum indices into an array with a struct. Named the
fields to match the enum, leaves memory layout and initialization unchanged.
Motivation is to later safely remove dead fields and replace redundant ones
with (compile time) computation. It should also be possible to factor some
common fields into a base and introduce a gfx10 amdgpu instance with less
duplication than the arrays of integers require.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108339
Currently, the runtime returns an error when the `exec_mode` global is
not present. The expected behvaiour is that the region will default to
Generic. This prevents global constructors from being called because
they do not contain execution mode globals.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108255
On FreeBSD, the `environ` symbol is undefined at link time for shared
libraries, but resolved by the dynamic linker at runtime. Therefore,
allow the symbol to be undefined when creating a shared library, by
using the `--allow-shlib-undefined` linker flag, instead of `-z defs`
(a.k.a `--no-undefined`).
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D107698
On FreeBSD, the system `<libelf.h>` already declares `struct Elf_Note`
indirectly (via `<sys/elf_common.h>`). This results in compile errors
when building the libomptarget amdgpu plugin. Avoid redeclaring `struct
Elf_Note` on FreeBSD to fix the errors.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D107661
This patch fixes the error reported in D106751. When there is no CUDA SDK
installed in the system, the build fails due to missing `CU_DEVICE_ATTRIBUTE`
variables.
Using @zsrkmyn sugested fix
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106933
This patch introduces the `llvm-omp-device-info` tool, which uses the
omptarget library and interface to query the device info from all the
available devices as seen by OpenMP. This is inspired by PGI's `pgaccelinfo`
Since omptarget usually requires a description structure with executable
kernels, I split the initialization of the RTLs and Devices to be able to
initialize all possible devices and query each of them.
This revision relies on the patch that introduces the print device info.
A limitation is that the order in which the devices are initialized, and the
corresponding device ID is not necesarily the one seen by OpenMP.
The changes are as follows:
1. Separate the RTL initialization that was performed in `RegisterLib` to its own `initRTLonce` function
2. Create an `initAllRTLs` method that initializes all available RTLs at runtime
3. Created the `llvm-deviceinfo.cpp` tool that uses `omptarget` to query each device and prints its information.
Example Output:
```
Device (0):
print_device_info not implemented
Device (1):
print_device_info not implemented
Device (2):
print_device_info not implemented
Device (3):
print_device_info not implemented
Device (4):
CUDA Driver Version: 11000
CUDA Device Number: 0
Device Name: Quadro P1000
Global Memory Size: 4236312576 bytes
Number of Multiprocessors: 5
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536 bytes
Max Shared Memory per Block: 49152 bytes
Registers per Block: 65536
Warp Size: 32 Threads
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647 bytes
Texture Alignment: 512 bytes
Clock Rate: 1480500 kHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: DEFAULT
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2505000 kHz
Memory Bus Width: 128 bits
L2 Cache Size: 1048576 bytes
Max Threads Per SMP: 2048
Async Engines: Yes (2)
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device Boars: No
Compute Capabilities: 61
```
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D106752
This patch introduces a function in the device's plugin to print the
device information. This patch relates to another patch that introduces
a CLI tool to obtain the device information from the omplibrary directly.
It is inspired by PGI's pgaccelinfo.
The modifications are as follows:
1. Introduce the optional `void __tgt_rtl_print_device_info(RTLdevID)` function into the RTL.
2. Introduce the `bool __tgt_print_device_info(devID)` function into `omptarget` interface. Returns false if the RTL is not implemented
3. Added `bool printDeviceInfo(RTLDevID)` to the `DeviceTy`
4. Implement the `__tgt_rtl_print_device_info` for CUDA. Added additional CUDA Runtime calls.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106751
Default to building the amdgpu plugin to use dlopen when hsa is
not found instead of disabling it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106600
If hsa_init fails, subsequent calls into hsa are not safe. Except for
hsa_init, but we don't retry on failure.
This patch:
- deletes a print that called into hsa to ask why it can't call into hsa
- drops a merge conflict block next to that print
- reliably initializes number of devices to zero
- skips the plugin destructor contents if the constructor failed to init hsa
Tested by making hsa_init return error, and by forcing the dynamic library
use which was then deleted from disk. Before this patch, both segv. After it,
friendly message about offloading being unavailable.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106774
Default to building the amdgpu plugin to use dlopen when hsa is
not found instead of disabling it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106600
This patch adds support for two environment variables to configure the device.
``LIBOMPTARGET_STACK_SIZE`` sets the amount of memory in bytes that each thread
has for its stack. ``LIBOMPTARGET_HEAP_SIZE`` sets the amount of heap memory
that can be allocated using malloc / free on the device.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106627
AMDGPU can assume Elf64 so doesn't need to abstract over Elf32
Drop a few other unused headers at the same time. Now only llvm elf
and libelf are used by the plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106579
AMDGPU plugin equivalent of D95155, build without HSA installed locally
Compiles a new file, plugins/amdgpu/dynamic_hsa/hsa.cpp, to an object file that
exposes the same symbols that the plugin presently uses from hsa. The object
file contains dlopen of hsa and cached dlsym calls. Also provides header files
corresponding to the subset that is used.
This is behind a feature flag, LIBOMPTARGET_FORCE_DLOPEN_LIBHSA, default off.
That allows developers to build against the dlopen/dlsym implementation, e.g.
while testing this mode.
Enabling by default will cause this plugin to build on a wider variety of
machines than it does at present so may break some CI builds. That risk can
be minimised by reviewing the header dependencies of the library and ensuring
it doesn't use any libraries that are not already used by libomptarget.
Separating the implementation from enabling by default in case the latter needs
to be rolled back after wider CI results.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106559
Summary:
Fixes some warning given for uninitialized block counts if the exection mode is
not recognized. This shouldn't happen in practice because the execution mode is
checked when it's read from the device.
This class is instantiated once in rtl.cpp before hsa_init is
called. The hsa_signal_create call therefore fails leaving the pool empty.
This signal pool is a legacy from ATMI where it was constructed after hsa_init.
Moving the state into the rtl.cpp global class disabled the initial populating
of the pool without noticeably changing performance. Just rechecked with a fix
that allocates the signals after hsa_init and that also doesn't noticeably
change performance.
This patch therefore drops the initialisation. Only change from main is to
drop a DEBUG_PRINT statement that would say the pool initial size is zero.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106515
Qualified kernels can be transformed from generic-mode to SPMD mode using an
optimization in OpenMPOpt. This patch introduces a new execution mode to
indicate kernels that have been transformed from generic-mode to SPMD-mode.
These kernels have SPMD-mode execution, but need generic-mode semantics for
scheduling the blocks and threads. Without this far too few blocks will be
scheduled for a generic region as SPMD mode expects the trip count to be
divided by the number of threads.
Reviewed By: ggeorgakoudis
Differential Revision: https://reviews.llvm.org/D106460
Create a hsa_api.h header that includes the ROCr headers in use
Drop some unused headers and _cplusplus macros
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106455
[libomptarget][nfc] Group environment variables, drop accesses to DeviceInfo global
Folds some duplicates logic into a helper function, passes the new environment
struct into getLaunchVals which no longer reads the DeviceInfo global.
Implemented on top of D105237
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D105239
D97883 introduced a compile-time error in the experimental remote offloading
libomptarget plugin, this patch fixes it and resolves a number of
inconsistencies in the plugin as well:
1. Non-functional Asynchronous API
2. Unnecessarily verbose debug printing
3. Misc. code clean ups
This is not intended to make any functional changes to the plugin.
Differential Revision: https://reviews.llvm.org/D105325
This reverts commit 2240b41ee4.
A value of 0 for KernDescVal WG_Size implies it is unknown, so it should be
set to the default. The above change was made without this assumption.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105250
A step towards making this function adequately self contained that it
can be tested easily. No functional change intended here, left variable
names unchanged.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D105229
Removes stdarg header, drops uses of iostream, fix some format string errors.
Also changes a C style struct to C++ style to avoid a warning from clang/
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D104923
This patch is related to https://reviews.llvm.org/D98832. Based on discussions there, I decided to separate out the teams default as this patch. This change is to increase the number of teams per computation unit so as to provide more wavefronts for hiding latency. This change improves performance for some programs, including 20-50% for some Stream benchmarks.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D99003
When max flat workgroup size is not specified, it is set to the default
workgroup size. This prevents kernel launch with a workgroup size larger
than the default. The fix is to ignore a size of 0 and treat it as
unspecified.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105073
The logic is almost similar to that of system.cpp with one change that
instead of adding all the memory pools to a device struct it only
keeps a single pool. The existing approach also always allocated memory on
the first HSA pool found for a GPU.
This depends on D104691. The goal of this series of patches is to remove
_atl_machine global. The next patch will drop g_atl_machine entirely.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104695