Recently OpenMP has transitioned to using the "new" driver which
primarily merges the device and host linking phases into a single
wrapper that handles both at the same time. This replaced a few tools
that were only used for OpenMP offloading, such as the
`clang-offload-wrapper` and `clang-nvlink-wrapper`. The new driver
carries some marked benefits compared to the old driver that is now
being deprecated. Things like device-side LTO, static library
support, and more compatible tooling. As such, we should be able to
completely deprecate the old driver, at least for OpenMP. The old driver
support will still exist for CUDA and HIP, although both of these can
currently be compiled on Linux with `--offload-new-driver` to use the new
method.
Note that this does not deprecate the `clang-offload-bundler`, although
it is unused by OpenMP now, it is still used by the HIP toolchain both
as their device binary format and object format.
When I proposed deprecating this code I heard some vendors voice
concernes about needing to update their code in their fork. They should
be able to just revert this commit if it lands.
Reviewed By: jdoerfert, MaskRay, ye-luo
Differential Revision: https://reviews.llvm.org/D130020
We will add some simple implementation of libc functions starting from
this patch, and the first one is `memcmp`, which is reported in #56929. Note that
`malloc` and `free` are not included in this patch because of the use of
`declare variant`. In the near future we will implement the two functions w/o
using any vendor provided function.
This fixes#56929.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D131182
Sometimes libomptarget's CUDA plugin produces unhelpful diagnostics
about a lack of CUDA devices before an application runs:
```
$ clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa hello-world.c
$ ./a.out
CUDA error: Error returned from cuInit
CUDA error: no CUDA-capable device is detected
Hello World: 4
```
This can happen when the CUDA plugin was built but all CUDA devices
are currently disabled in some manner, perhaps because
`CUDA_VISIBLE_DEVICES` is set to the empty string. As shown in the
above example, it can even happen when we haven't compiled the
application for offloading to CUDA.
The following code from `openmp/libomptarget/plugins/cuda/src/rtl.cpp`
appears to be intended to handle this case, and it chooses not to
write a diagnostic to stderr unless debugging is enabled:
```
if (NumberOfDevices == 0) {
DP("There are no devices supporting CUDA.\n");
return;
}
```
The problem is that the above code is never reached because the
earlier `cuInit` returns `CUDA_ERROR_NO_DEVICE`. This patch handles
that `cuInit` case in the same manner as the above code handles the
`NumberOfDevices == 0` case.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D130371
Multiple calls to `omp_get_wtime` could be optimized out due to the function
is mistakenly marked as `readnone`. This patch fixes the issue, and also add the
support to run optimization on `libomptarget` tests.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D130179
Multiple calls to `omp_get_wtime` could be optimized out due to the function
is mistakenly marked as `readnone`. This patch fixes the issue, and also add the
support to run optimization on `libomptarget` tests.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D130179
Summary:
The static linking test ensures that we can statically link offloading
programs. To create the test we used `llvm-ar`. However, this may not
exist in the user's environment. This patch changes it to use the
binutils `ar` which should exist on every system running these tests
currently. In the future we should set up the dependencies properly.
We are planning on making LTO the default compilation mode for
offloading. In order to make sure it works we should run these tests on
the test suite. AMDGPU already uses the LTO compilation path for its
linking, but in LTO mode it also links the static library late.
Performing LTO requires the static library to be built, if we make the
change this will be a hard requirement and the old bitcode library will
go away. This means users will need to use either a two-step build or a
runtimes build for libomptarget.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D127512
Without this patch, arguments to the
`llvm::OpenMPIRBuilder::AtomicOpValue` initializer are reversed.
Reviewed By: ABataev, tianshilei1992
Differential Revision: https://reviews.llvm.org/D126619
OpenMP 5.2, sec. 10.2 "teams Construct", p. 232, L9-12 restricts what
regions can be strictly nested within a `teams` construct. This patch
relaxes Clang's enforcement of this restriction in the case of nested
`atomic` constructs unless `-fno-openmp-extensions` is specified.
Cases like the following then seem to work fine with no additional
implementation changes:
```
#pragma omp target teams map(tofrom:x)
#pragma omp atomic update
x++;
```
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D126323
Currently, globals on the device will have an infinite reference count
and an unknown name when using `LIBOMPTARGET_INFO` to print the mapping
table. We already store the name of the global in the offloading entry
so we should be able to use it, although there will be no source
location. To do this we need to create a valid `ident_t` string from a
name only.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D124381
Previously an opt-in flag `-fopenmp-new-driver` was used to enable the
new offloading driver. After passing tests for a few months it should be
sufficiently mature to flip the switch and make it the default. The new
offloading driver is now enabled if there is OpenMP and OpenMP
offloading present and the new `-fno-openmp-new-driver` is not present.
The new offloading driver has three main benefits over the old method:
- Static library support
- Device-side LTO
- Unified clang driver stages
Depends on D122683
Differential Revision: https://reviews.llvm.org/D122831
This patch adds the necessary AMDGPU calling convention to the ctor /
dtor kernels. These are fundamentally device kenels called by the host
on image load. Without this calling convention information the AMDGPU
plugin is unable to identify them.
Depends on D122504
Fixes#54091
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D122515
`bug49334.cpp` has one issue that causes flaky result reported in #53730.
The root cause is `BlockedC` is never initialized but in `BlockMatMul_TargetNowait`
it is directly read and written (via `+=`). Fixes#53730.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119988
This patch adds a new target to the OpenMP CPU offloading tests. This
tests the usage of the new driver for CPU offloading. If this all works
then we can move to transition to the new driver as the default.
Depends on D119613
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119736
This patch fixes the issue that the for loop in `applyToShadowMapEntries`
is infinite because `Itr` is not incremented in `CB`. Fixes#53727.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119471
`bug49334.cpp` directly uses `!=` to compare two floating point values,
which is almost wrong.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119485
The 'bug49779.cpp' test has been failing recently. This is because the
runtime is sufficiently complex when using nested parallelism without
optimizations that the CUDA tools cannot statically determine the stack
size. Because of this the kernel can exceed the thread stack size and
crash. Work around this using the 'LIBOMPTARGET_STACK_SIZE' environment
variable and add an FAQ entry for this situation.
Fixes#53670
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119357
This patch enables running the new driver tests for AMDGPU. Previously
this was disabled because some tests failed. This was only because the
new driver tests hadn't been listed as unsupported or expected to fail.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D119240
This patch completely removes the old OpenMP device runtime. Previously,
the old runtime had the prefix `libomptarget-new-` and the old runtime
was simply called `libomptarget-`. This patch makes the formerly new
runtime the only runtime available. The entire project has been deleted,
and all references to the `libomptarget-new` runtime has been replaced
with `libomptarget-`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D118934
This patch adds a new target to the tests to run using the new driver as
the method for generating offloading code.
Depends on D116541
Differential Revision: https://reviews.llvm.org/D118637
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674
In most cases, hidden helper task behave similar as detached tasks. That means,
for example, if we have to wait for detached tasks, we have to do the same thing
for hidden helper tasks as well. This patch adds the missing condition for hidden
helper task accordingly along with detached task.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D107316
D113602 broke the custom state machine when a reduction is present, as
revealed by the reproducer this patch adds to the test suite. In that
case, openmp-opts changes the return value to undef in
`__kmpc_get_warp_size` (which the custom state machine calls as of
D113602). Later optimizations then optimize away the custom state
machine code as if all threads are outside the thread block, so the
target region does not execute. D114802 fixed that but didn't add a
reproducer.
This patch also adds a `__OMP_RTL_ATTRS` entry for
`__kmpc_get_warp_size` to OMPKinds.def, which D113602 missed. This
change does not seem to have any impact on the reduction problem.
Reviewed By: JonChesterfield, jdoerfert
Differential Revision: https://reviews.llvm.org/D113824
Extension of D112504. Lower amdgpu printf to `__llvm_omp_vprintf`
which takes the same const char*, void* arguments as cuda vprintf and also
passes the size of the void* alloca which will be needed by a non-stub
implementation of `__llvm_omp_vprintf` for amdgpu.
This removes the amdgpu link error on any printf in a target region in favour
of silently compiling code that doesn't print anything to stdout.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112680
Extension of D112504. Lower amdgpu printf to `__llvm_omp_vprintf`
which takes the same const char*, void* arguments as cuda vprintf and also
passes the size of the void* alloca which will be needed by a non-stub
implementation of `__llvm_omp_vprintf` for amdgpu.
This removes the amdgpu link error on any printf in a target region in favour
of silently compiling code that doesn't print anything to stdout.
The exact set of changes to check-openmp probably needs revision before commit
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112680
Passes same tests as the current deviceRTL. Includes cmake change from D111987.
CI is showing a different set of pass/fails to local, committing this
without the tests enabled by default while debugging that difference.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112227
Passes same tests as the current deviceRTL. Includes cmake change from D111987.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112227
Fixes 51982. Adds a missing CreatePointerCast and allocates a global in
the correct address space.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
Fixes 51982. Minor refactor to remove `return x = y` construct.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
This patch implements OpenMP runtime support for an original OpenMP
extension we have developed to support OpenACC: the `ompx_hold` map
type modifier. The previous patch in this series, D106509, implements
Clang support and documents the new functionality in detail.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D106510
Currently, the runtime returns an error when the `exec_mode` global is
not present. The expected behvaiour is that the region will default to
Generic. This prevents global constructors from being called because
they do not contain execution mode globals.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108255
Bug 50022 [0] reports target nowait fails in certain case, which is added in this
patch. The root cause of the failure is, when the second task is created, its
parent's `td_incomplete_child_tasks` will not be incremented because there is no
parallel region here thus its team is serialized. Therefore, when the initial
thread is waiting for its unfinished children tasks, it thought there is only
one, the first task, because it is hidden helper task, so it is tracked. The
second task will only be pushed to the queue when the first task is finished.
However, when the first task finishes, it first decrements the counter of its
parent, and then release dependences. Once the counter is decremented, the thread
will move on because its counter is reset, but actually, the second task has not
been executed at all. As a result, since in this case, the main function finishes,
then `libomp` starts to destroy. When the second task is pushed somewhere, all
some of the structures might already have already been destroyed, then anything
could happen.
This patch simply moves `__kmp_release_deps` ahead of decrement of the counter.
In this way, we can make sure that the initial thread is aware of the existence
of another task(s) so it will not move on. In addition, in order to tackle
dependence chain starting with hidden helper thread, when hidden helper task is
encountered, we force the task to release dependences.
Reference:
[0] https://bugs.llvm.org/show_bug.cgi?id=50022
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D106519
For example, without this patch:
```
$ cat test.c
int main() {
int x;
#pragma omp target enter data map(alloc: x)
#pragma omp target exit data map(release: x)
;
return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, updated RefCount=1
```
There are two problems in this example:
* `RefCount` is not reported when a mapping is created, but it might
be 1 or infinite. In this case, because it's created by `omp target
enter data`, it's 1. Seeing that would make later `RefCount`
messages easier to understand.
* `RefCount` is still 1 at the `omp target exit data`, but it's
reported as `updated`. The reason it's still 1 is that, upon
deletions, the reference count is generally not updated in
`DeviceTy::getTgtPtrBegin`, where the report is produced. Instead,
it's zeroed later in `DeviceTy::deallocTgtPtr`, where it's actually
removed from the mapping table.
This patch makes the following changes:
* Report the reference count when creating a mapping.
* Where an existing mapping is reported, always report a reference
count action:
* `update suppressed` when `UpdateRefCount=false`
* `incremented`
* `decremented`
* `deferred final decrement`, which replaces the misleading
`updated` in the above example
* Add comments to `DeviceTy::getTgtPtrBegin` to explain why it does
not zero the reference count. (Please advise if these comments miss
the point.)
* For unified shared memory, don't report confusing messages like
`RefCount=` or `RefCount= updated` given that reference counts are
irrelevant in this case. Instead, just report `for unified shared
memory`.
* Use `INFO` not `DP` consistently for `Mapping exists` messages.
* Fix device table dumps to print `INF` instead of `-1` for an
infinite reference count.
Reviewed By: jhuber6, grokos
Differential Revision: https://reviews.llvm.org/D104559
`bug49334.cpp` cannot detect data race in `libomptarget` efficiently. It
is reported that with `N = 256` and `BS = 16`, the data race can be reproduced
more steadily. The next coming pathces will fix it so this patch is expected to
fail now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104552