A new rule is added in 5.0:
If a list item appears in a reduction, lastprivate or linear clause
on a combined target construct then it is treated as if it also appears
in a map clause with a map-type of tofrom.
Currently map clauses for all capture variables are added implicitly.
But missing for list item of expression for array elements or array
sections.
The change is to add implicit map clause for array of elements used in
reduction clause. Skip adding map clause if the expression is not
mappable.
Noted: For linear and lastprivate, since only variable name is
accepted, the map has been added though capture variables.
To do so:
During the mappable checking, if error, ignore diagnose and skip
adding implicit map clause.
The changes:
1> Add code to generate implicit map in ActOnOpenMPExecutableDirective,
for omp 5.0 and up.
2> Add extra default parameter NoDiagnose in ActOnOpenMPMapClause:
Use that to skip error as well as skip adding implicit map during the
mappable checking.
Note: there are only tow places need to be check for NoDiagnose. Rest
of them either the check is for < omp 5.0 or the error already generated for
reduction clause.
Differential Revision: https://reviews.llvm.org/D108132
Currently, the runtime returns an error when the `exec_mode` global is
not present. The expected behvaiour is that the region will default to
Generic. This prevents global constructors from being called because
they do not contain execution mode globals.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108255
Remove --cuda-path=CUDA_TOOLKIT_ROOT_DIR-NOTFOUND
from the invocation of non-nvptx test cases. Better signal
to noise ratio on other architectures.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D107074
Bug 50022 [0] reports target nowait fails in certain case, which is added in this
patch. The root cause of the failure is, when the second task is created, its
parent's `td_incomplete_child_tasks` will not be incremented because there is no
parallel region here thus its team is serialized. Therefore, when the initial
thread is waiting for its unfinished children tasks, it thought there is only
one, the first task, because it is hidden helper task, so it is tracked. The
second task will only be pushed to the queue when the first task is finished.
However, when the first task finishes, it first decrements the counter of its
parent, and then release dependences. Once the counter is decremented, the thread
will move on because its counter is reset, but actually, the second task has not
been executed at all. As a result, since in this case, the main function finishes,
then `libomp` starts to destroy. When the second task is pushed somewhere, all
some of the structures might already have already been destroyed, then anything
could happen.
This patch simply moves `__kmp_release_deps` ahead of decrement of the counter.
In this way, we can make sure that the initial thread is aware of the existence
of another task(s) so it will not move on. In addition, in order to tackle
dependence chain starting with hidden helper thread, when hidden helper task is
encountered, we force the task to release dependences.
Reference:
[0] https://bugs.llvm.org/show_bug.cgi?id=50022
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D106519
D106236 added a new CMake argument for `libomptarget` test, but when user's
input contains white spaces, CMake will add escape char to the final lit command,
which leads to an error. This patch converts the user's input `LIBOMPTARGET_LIT_ARGS`
into a local array, and then passes the array to the function.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D106247
By default, `lit` uses all threads to invoke tests, which can easily cause out
of memory on GPUs because most of OpenMP offloading test usually take about 1GB
GPU memory, but a typical GPU only has 4-8GB memory. This patch introduce a
CMake argument `LIBOMPTARGET_LIT_ARGS` to allow users to control the behavior of
`libomptarget` tests, similar to `LLVM_LIT_ARGS`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D106236
Currently, libomptarget will always perform a host-to-device memory transfer in
order to update the device pointer of a PTR_AND_OBJ entry. This is not always
necessary because the device pointer may have been set to the correct pointee
address already, so we can eliminate the redundant memory transfer.
If the base is used in a map clause and later we have a memberexpr with
this base, and the member is a pointer, and this pointer is dereferenced
anyhow (subscript, array section, dereference, etc.), such components
should be considered as overlapped, otherwise it may lead to incorrect
size computations, since we try to map a pointee as a part of the whole
struct, which is not true for the pointer members.
Differential Revision: https://reviews.llvm.org/D105562
For example, without this patch:
```
$ cat test.c
int main() {
int x;
#pragma omp target enter data map(alloc: x)
#pragma omp target exit data map(release: x)
;
return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, updated RefCount=1
```
There are two problems in this example:
* `RefCount` is not reported when a mapping is created, but it might
be 1 or infinite. In this case, because it's created by `omp target
enter data`, it's 1. Seeing that would make later `RefCount`
messages easier to understand.
* `RefCount` is still 1 at the `omp target exit data`, but it's
reported as `updated`. The reason it's still 1 is that, upon
deletions, the reference count is generally not updated in
`DeviceTy::getTgtPtrBegin`, where the report is produced. Instead,
it's zeroed later in `DeviceTy::deallocTgtPtr`, where it's actually
removed from the mapping table.
This patch makes the following changes:
* Report the reference count when creating a mapping.
* Where an existing mapping is reported, always report a reference
count action:
* `update suppressed` when `UpdateRefCount=false`
* `incremented`
* `decremented`
* `deferred final decrement`, which replaces the misleading
`updated` in the above example
* Add comments to `DeviceTy::getTgtPtrBegin` to explain why it does
not zero the reference count. (Please advise if these comments miss
the point.)
* For unified shared memory, don't report confusing messages like
`RefCount=` or `RefCount= updated` given that reference counts are
irrelevant in this case. Instead, just report `for unified shared
memory`.
* Use `INFO` not `DP` consistently for `Mapping exists` messages.
* Fix device table dumps to print `INF` instead of `-1` for an
infinite reference count.
Reviewed By: jhuber6, grokos
Differential Revision: https://reviews.llvm.org/D104559
`bug49334.cpp` cannot detect data race in `libomptarget` efficiently. It
is reported that with `N = 256` and `BS = 16`, the data race can be reproduced
more steadily. The next coming pathces will fix it so this patch is expected to
fail now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104552
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
The check for the TO flag when processing firstprivates is missing. As a result,
sometimes the device copy of a firstprivate never gets initialized. Currectly we
try to force lambda structs to be allocated immediately by marking them as a
non-firstprivate, so that PrivateArgumentManagerTy::addArg allocates memory for
them immediately. However, calling addArg with IsFirstPrivate=false makes the
function skip initializing the device copy. Whether an argument is firstprivate
and whether we need to allocate memory immediately are not synonyms, so this
patch introduces one more control variable for immediate allocation and sets it
apart from initialization.
Differential Revision: https://reviews.llvm.org/D102890
[libomptarget] Disable test bug49334 on amdgpu
Hangs on amdgpu, do not know why. Disable to unblock build.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D102017
Add a `REQUIRES: unified_shared_memory` option to tests that use `#pragma omp requires unified_shared_memory`.
For CUDA, the feature tag is derived from LIBOMPTARGET_DEP_CUDA_ARCH which itself is derived using [[ https://cmake.org/cmake/help/latest/module/FindCUDA.html#commands | cuda_select_nvcc_arch_flags ]]. The latter determines which compute capability the GPU in the system supports. To ensure that this is the CUDA arch being used, we could also set the `-Xopenmp-target -march=` flag.
In the absence of an NVIDIA GPU, LIBOMPTARGET_DEP_CUDA_ARCH will be 35. That is, in that case we are assuming unified_shared_memory is not available. CUDA plugin testing could be disabled entirely in this case, but this currently depends on `LIBOMPTARGET_CAN_LINK_LIBCUDA OR LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA`, not on whether the hardware is actually available.
For all other targets, nothing changes and we are assuming unified shared memory is available. This might need refinement if not the case.
This tries to fix the [[ http://meinersbur.de:8011/#/builders/143 | OpenMP Offloading Buildbot ]] that, although brand-new, only has a Pascal-generation (sm_61) GPU installed. Hence, tests that require unified shared memory are currently failing. I wish I had known in advance.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D101498
Summary:
The allocator interface added in D97883 allows the RTL to allocate shared and
host-pinned memory from the cuda plugin. This patch adds support for these to
the runtime.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D102000
This enables the runtime tests on amdgpu targets.
10 tests have been marked as XFAIL on amdgcn currently mostly due to
missing printf.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D99656
The OpenMP runtime can be compiled using a CUDA installed at non-default
location with the -DCUDA_TOOLKIT_ROOT_DIR setting. However, check-openmp
will fail afterwards because Clang needs to know where to find the CUDA
headers.
Fix by passing -cuda-path to Clang using the value of
CUDA_TOOLKIT_ROOT_DIR which has been determined by CMake. Also set
LD_LIBRARY_PATH such that it can find the cuda runtime when executing.
This will ensure that the regression test do not depend on the current
environment, but use the environment it was configured for.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101266
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.
This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.
In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.
Differential Revision: https://reviews.llvm.org/D101326
This patch creates a separate test directory for each offloading target to be
tested. This allows to test multiple architectures in one configuration, while
still see all failing tests separately. The lit test names include the target
triple, so that it will be easier to spot the failing target.
This patch also allows to mark expected failing tests based on the
target-triple, as the currently used triple is added to the lit "features":
```
// XFAIL: nvptx64-nvidia-cuda
```
Differential Revision: https://reviews.llvm.org/D101315
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
The implicitly generated mappings for allocation/deallocation in mappers
runtime should be mapped as implicit, also no need to clear member_of
flag to avoid ref counter increment. Also, the ref counter should not be
incremented for the very first element that comes from the mapper
function.
Differential Revision: https://reviews.llvm.org/D100673
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
omp_is_initial_device() is marked as a built-in function in the current
compiler, and user code guarded by this call may be optimized away,
resulting in undesired behavior in some cases. This patch provides a
possible fix for such cases by defining the routine as a variant
function and removing it from builtin list.
Differential Revision: https://reviews.llvm.org/D99447
Summary:
The changes introduced in D87946 changed the API for libomptarget
functions. `__kmpc_push_target_tripcount` was a function in Clang 11.x
but was not given a backward-compatible interface. This change will
require people using Clang 13.x or 12.x to recompile their offloading
programs.
Reviewed By: jdoerfert cchen
Differential Revision: https://reviews.llvm.org/D98358
Without this patch, an `omp target exit data` before the runtime is
initialized produces a runtime error. This patch fixes that by
changing `__tgt_target_data_end_mapper` to call `CheckDeviceAndCtors`
like many other runtime routines.
Discussed at
<https://lists.llvm.org/pipermail/openmp-dev/2021-March/003920.html>.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D97907
Without this patch, when the offload device is set to
`omp_get_initial_device()`, the runtime fails with an error diagnostic
when entering target regions or target data regions.
However, OpenMP 5.1, sec. 2.14.5 "target Construct", "Restrictions",
p. 203, L3-5 states:
> The device clause expression must evaluate to a non-negative integer
> value that is less than or equal to the value of
> omp_get_num_devices().
Sec. 3.7.7 "omp_get_initial_device", p. 412, L2-3 states:
> The value of the device number is the value returned by the
> omp_get_num_devices routine.
Similarly, OpenMP 5.0, sec. 2.12.5 "target Construct", "Restrictions",
p. 174 L30-32 states:
> The device clause expression must evaluate to a non-negative integer
> value less than the value of omp_get_num_devices() or to the value
> of omp_get_initial_device().
This patch fixes this behavior by changing the runtime to behave as if
offloading is disabled whenever it finds the offload device (either
from a `device` clause or the default device) is set to the host
device. In the case of mandatory offloading when
`omp_get_num_devices() == 0`, it incorporates the behavior proposed
for OpenMP 5.2 in OpenMP spec github issue 2669.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D97616
If the mapped structure has data members, which have 'default' mappers,
need to map these members individually using their 'default' mappers.
Differential Revision: https://reviews.llvm.org/D92195
PR#49334 reports a crash when offloading to x86_64 with `target nowait`,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in `__kmp_push_task`, it also maps the gtid to its
shadow gtid, which is wrong.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97329
Without this patch, there's a runtime error for those map types at
exit from an "omp target data" or at "omp target exit data", but the
spec says the list item should be ignored.
This patch tests that fix in data_absent_at_exit.c, and it also
improves other testing for data that is not fully present at exit.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D96999
OpenMP 5.0 removed a lot of restriction for overlapped mapped items
comparing to OpenMP 4.5. Patch restricts the checks for overlapped data
mappings only for OpenMP 4.5 and less and reorders mapping of the
arguments so, that present and alloc mappings are processed first and
then all others.
Differential Revision: https://reviews.llvm.org/D86119
Currently if there is not kernel argument, device synchronization will
be skipped. This can lead to two issues:
1. If there is any device error, it will not be captured;
2. The target region might end before the kernel is done, which is not spec
conformant.
The test added in this patch only runs on NVPTX platform, although it will not
be executed by Phab at all. It also requires `not` which is not available on most
systems.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D96067
Summary:
The custom mapper API did not previously support the mapping names added previously. This means they were not present if a user requested debugging information while using the mapper functions. This adds basic support for passing the mapped names to the runtime library.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D94806
Summary:
This patch adds more fine-grained support over which information is output from the libomptarget runtime when run with the environment variable LIBOMPTARGET_INFO set. An extensible set of flags can be used to pick and choose which information the user is interested in.
Reviewers: jdoerfert JonChesterfield grokos
Differential Revision: https://reviews.llvm.org/D93727
This patch is the runtime support for https://reviews.llvm.org/D84192.
In order not to modify the tgt_target_data_update information but still be
able to pass the extra information for non-contiguous map item (offset,
count, and stride for each dimension), this patch overload arg when
the maptype is set as OMP_TGT_MAPTYPE_DESCRIPTOR. The origin arg is for
passing the pointer information, however, the overloaded arg is an
array of descriptor_dim:
```
struct descriptor_dim {
int64_t offset;
int64_t count;
int64_t stride
};
```
and the array size is the dimension size. In addition, since we
have count and stride information in descriptor_dim, we can replace/overload the
arg_size parameter by using dimension size.
Reviewed By: grokos, tianshilei1992
Differential Revision: https://reviews.llvm.org/D82245
This patch is the runtime support for https://reviews.llvm.org/D84192.
In order not to modify the tgt_target_data_update information but still be
able to pass the extra information for non-contiguous map item (offset,
count, and stride for each dimension), this patch overload arg when
the maptype is set as OMP_TGT_MAPTYPE_DESCRIPTOR. The origin arg is for
passing the pointer information, however, the overloaded arg is an
array of descriptor_dim:
```
struct descriptor_dim {
int64_t offset;
int64_t count;
int64_t stride
};
```
and the array size is the dimension size. In addition, since we
have count and stride information in descriptor_dim, we can replace/overload the
arg_size parameter by using dimension size.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D82245
There is a non-conforming use of variable-sized array in the test case `parallel_offloading_map.c`. This patch fixed it.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D90642
The implementation of target nowait just wraps the target region into a task. The essential four parameters (base ptr, ptr, size, mapper) are taken as firstprivate such that they will be copied to the private location. When there is no user-defined mapper, the mapper variable will be nullptr. However, it will be still copied to the corresponding place. Therefore, a memcpy will be generated and the source pointer will be nullptr, causing a segmentation fault. The root cause is when calling `emitOffloadingArraysArgument`, the last argument `Options` has a field about whether it requires a task. It only takes depend clause into account. In this patch, the nowait clause is also included.
There're two things that will be done in another patches:
1. target data nowait has not been supported yet. D90099 added the support.
2. When there is no mapper, the mapper array can be nullptr no matter whether it requires outer task or not. It can avoid an unnecessary data copy. This is an optimization that is covered in D90101.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D89844
This patch fixes a problem whereby the pointee object of a PTR_AND_OBJ entry with a `map(to)` motion clause can be overwritten on the device even if its reference counter is >=1.
Currently, we check the reference counter of the parent struct in order to determine whether the motion clause should be respected, but since the pointee object is not part of the struct, it's got its own reference counter which should be used to enqueue the copy or discard it.
The same behavior has already been implemented in targetDataEnd (omptarget.cpp:539-540), but we somehow missed doing the same in targetDataBegin.
Differential Revision: https://reviews.llvm.org/D89597
Summary:
This patch starts adding support for adding information dumps to libomptarget
and rtl plugins. The information printing is controlled by the
LIBOMPTARGET_INFO environment variable introduced in D86483. The goal of this
patch is to provide the user with additional information about the device
during kernel execution and providing the user with information dumps in the
case of failure. This patch added the ability to dump the pointer mapping table
as well as printing the number of blocks and threads in the cuda RTL.
Reviewers: jdoerfort gkistanova ye-luo
Subscribers: guansong openmp-commits sstefan1 yaxunl ye-luo
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D87165
PrivateArgumentManager shall immediately allocate firstprivates if they
are bases for the next parameters and the next paramaters rely on the
fact that the base musst be allocated already.
Differential Revision: https://reviews.llvm.org/D86781