Use the same debug print as the rest of libomptarget plugins with
the same environment control. Also drop the max queue size debugging hook as
I don't believe it is still in use, can bring it back near the rest of the env
handling in rtl.cpp if someone objects.
That makes most of rt.h and all of utils.cpp unused. Clean that up and simplify
control flow in a couple of places.
Behaviour change is that debug prints that used to use the old environment
variable now use the new one and print in slightly different format, and the
removal of the max queue size variable.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D108784
Lets wavefront size be 32 for amdgpu openmp, as well as 64.
Fixes up as little as possible to pass that through the libraries. This change
is end to end, as opposed to updating clang/devicertl/plugin separately. It can
be broken up for review/commit if preferred. Posting as-is so that others with
a gfx10 can try it out. It works roughly as well as gfx9 for me, but there are
probably bugs remaining as well as the todo: for letting grid values vary more.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108708
Move most debug printing in rtl.cpp behind DP() macro
Adjust the print output for gpu arch mismatch when the architectures match
Convert an assert into graceful failure
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108562
With uses of g_atl_machine gone, a significant portion of dead
code has been removed.
This patch depends on D104691 and D104695.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104696
[nfc] Replaces enum indices into an array with a struct. Named the
fields to match the enum, leaves memory layout and initialization unchanged.
Motivation is to later safely remove dead fields and replace redundant ones
with (compile time) computation. It should also be possible to factor some
common fields into a base and introduce a gfx10 amdgpu instance with less
duplication than the arrays of integers require.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108339
On FreeBSD, the `environ` symbol is undefined at link time for shared
libraries, but resolved by the dynamic linker at runtime. Therefore,
allow the symbol to be undefined when creating a shared library, by
using the `--allow-shlib-undefined` linker flag, instead of `-z defs`
(a.k.a `--no-undefined`).
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D107698
On FreeBSD, the system `<libelf.h>` already declares `struct Elf_Note`
indirectly (via `<sys/elf_common.h>`). This results in compile errors
when building the libomptarget amdgpu plugin. Avoid redeclaring `struct
Elf_Note` on FreeBSD to fix the errors.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D107661
Default to building the amdgpu plugin to use dlopen when hsa is
not found instead of disabling it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106600
If hsa_init fails, subsequent calls into hsa are not safe. Except for
hsa_init, but we don't retry on failure.
This patch:
- deletes a print that called into hsa to ask why it can't call into hsa
- drops a merge conflict block next to that print
- reliably initializes number of devices to zero
- skips the plugin destructor contents if the constructor failed to init hsa
Tested by making hsa_init return error, and by forcing the dynamic library
use which was then deleted from disk. Before this patch, both segv. After it,
friendly message about offloading being unavailable.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106774
Default to building the amdgpu plugin to use dlopen when hsa is
not found instead of disabling it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106600
AMDGPU can assume Elf64 so doesn't need to abstract over Elf32
Drop a few other unused headers at the same time. Now only llvm elf
and libelf are used by the plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106579
AMDGPU plugin equivalent of D95155, build without HSA installed locally
Compiles a new file, plugins/amdgpu/dynamic_hsa/hsa.cpp, to an object file that
exposes the same symbols that the plugin presently uses from hsa. The object
file contains dlopen of hsa and cached dlsym calls. Also provides header files
corresponding to the subset that is used.
This is behind a feature flag, LIBOMPTARGET_FORCE_DLOPEN_LIBHSA, default off.
That allows developers to build against the dlopen/dlsym implementation, e.g.
while testing this mode.
Enabling by default will cause this plugin to build on a wider variety of
machines than it does at present so may break some CI builds. That risk can
be minimised by reviewing the header dependencies of the library and ensuring
it doesn't use any libraries that are not already used by libomptarget.
Separating the implementation from enabling by default in case the latter needs
to be rolled back after wider CI results.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106559
Summary:
Fixes some warning given for uninitialized block counts if the exection mode is
not recognized. This shouldn't happen in practice because the execution mode is
checked when it's read from the device.
This class is instantiated once in rtl.cpp before hsa_init is
called. The hsa_signal_create call therefore fails leaving the pool empty.
This signal pool is a legacy from ATMI where it was constructed after hsa_init.
Moving the state into the rtl.cpp global class disabled the initial populating
of the pool without noticeably changing performance. Just rechecked with a fix
that allocates the signals after hsa_init and that also doesn't noticeably
change performance.
This patch therefore drops the initialisation. Only change from main is to
drop a DEBUG_PRINT statement that would say the pool initial size is zero.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106515
Qualified kernels can be transformed from generic-mode to SPMD mode using an
optimization in OpenMPOpt. This patch introduces a new execution mode to
indicate kernels that have been transformed from generic-mode to SPMD-mode.
These kernels have SPMD-mode execution, but need generic-mode semantics for
scheduling the blocks and threads. Without this far too few blocks will be
scheduled for a generic region as SPMD mode expects the trip count to be
divided by the number of threads.
Reviewed By: ggeorgakoudis
Differential Revision: https://reviews.llvm.org/D106460
Create a hsa_api.h header that includes the ROCr headers in use
Drop some unused headers and _cplusplus macros
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106455
[libomptarget][nfc] Group environment variables, drop accesses to DeviceInfo global
Folds some duplicates logic into a helper function, passes the new environment
struct into getLaunchVals which no longer reads the DeviceInfo global.
Implemented on top of D105237
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D105239
This reverts commit 2240b41ee4.
A value of 0 for KernDescVal WG_Size implies it is unknown, so it should be
set to the default. The above change was made without this assumption.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105250
A step towards making this function adequately self contained that it
can be tested easily. No functional change intended here, left variable
names unchanged.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D105229
Removes stdarg header, drops uses of iostream, fix some format string errors.
Also changes a C style struct to C++ style to avoid a warning from clang/
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D104923
This patch is related to https://reviews.llvm.org/D98832. Based on discussions there, I decided to separate out the teams default as this patch. This change is to increase the number of teams per computation unit so as to provide more wavefronts for hiding latency. This change improves performance for some programs, including 20-50% for some Stream benchmarks.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D99003
When max flat workgroup size is not specified, it is set to the default
workgroup size. This prevents kernel launch with a workgroup size larger
than the default. The fix is to ignore a size of 0 and treat it as
unspecified.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105073
The logic is almost similar to that of system.cpp with one change that
instead of adding all the memory pools to a device struct it only
keeps a single pool. The existing approach also always allocated memory on
the first HSA pool found for a GPU.
This depends on D104691. The goal of this series of patches is to remove
_atl_machine global. The next patch will drop g_atl_machine entirely.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104695
The OpenMP 5.1 standard defines the environment variable
`OMP_TEAMS_THREAD_LIMIT` to limit the number of threads that will be run in a
single block. This patch adds support for this into the AMDGPU and CUDA
plugins.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D103923
There does not seem to be any use of these functions. They just
put the value to a local which is never used again.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104512
This change-set removes libelf usage from elf_common part of the plugins.
libelf is still used in x86_64 generic plugin code and in some plugins
(e.g. amdgpu) - these will have to be cleaned up in separate checkins.
Differential Revision: https://reviews.llvm.org/D103545
This patch includes some changes which deletes the code accessing
g_atl_machine global. Some accesses related to memory_pools are
still remaining.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103813
This global struct used to hold various flags for monitoring the
initialization of hsa.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103795
Previous logic was to always use the first kernarg pool found to allocate
kernel args. This patch changes this to use only the kernarg pool which
has non-zero size. This logic is also reworked to not use any globals.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103600
Turns out the only purpose of this class was verify if device ID
was in range or not which could be done easily by using g_atl_machine.
Still getting rid of g_atl_machine is pending which would be done in
a later patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103443
This struct was used to specify the device on which memory was
being allocated/free in atmi_malloc/free. It has now been replaced
with int DeviceId.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103239