We are planning on making LTO the default compilation mode for
offloading. In order to make sure it works we should run these tests on
the test suite. AMDGPU already uses the LTO compilation path for its
linking, but in LTO mode it also links the static library late.
Performing LTO requires the static library to be built, if we make the
change this will be a hard requirement and the old bitcode library will
go away. This means users will need to use either a two-step build or a
runtimes build for libomptarget.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D127512
Previously an opt-in flag `-fopenmp-new-driver` was used to enable the
new offloading driver. After passing tests for a few months it should be
sufficiently mature to flip the switch and make it the default. The new
offloading driver is now enabled if there is OpenMP and OpenMP
offloading present and the new `-fno-openmp-new-driver` is not present.
The new offloading driver has three main benefits over the old method:
- Static library support
- Device-side LTO
- Unified clang driver stages
Depends on D122683
Differential Revision: https://reviews.llvm.org/D122831
This patch adds a new target to the OpenMP CPU offloading tests. This
tests the usage of the new driver for CPU offloading. If this all works
then we can move to transition to the new driver as the default.
Depends on D119613
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119736
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.
This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.
In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.
Differential Revision: https://reviews.llvm.org/D101326
Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.
The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.
In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.
The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:
1. The program ends. Obviously. However, there is a little problem that plugin
library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.
2. Device is out of memory when we request a new memory. The manager will walk
through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.
Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
`LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`.
Reviewed By: jdoerfert, ye-luo, JonChesterfield
Differential Revision: https://reviews.llvm.org/D81054