[libomptarget][nvptx][nfc] Move target_impl functions out of header
This removes most of the differences between the two target_impl.h.
Also change name mangling from C to C++ for __kmpc_impl_*_lock.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D94728
`omptarget-nvptx` is still a dependence for `check-libomptarget-nvtpx`
although it has been removed by D94573.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94725
The comment said CUDA 9 header files use the `nv_weak` attribute which
`clang` is not yet prepared to handle. It's three years ago and now things have
changed. Based on my test, removing the definition doesn't have any problem on
my machine with CUDA 11.1 installed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94700
For NVPTX target, OpenMP provides a static library `libomptarget-nvptx`
built by NVCC, and another bitcode `libomptarget-nvptx-sm_{$sm}.bc` generated by
Clang. When compiling an OpenMP program, the `.bc` file will be fed to `clang`
in the second run on the program that compiles the target part. Then the generated
PTX file will be fed to `ptxas` to generate the object file, and finally the driver
invokes `nvlink` to generate the binary, where the static library will be appened
to `nvlink`.
One question is, why do we need two libraries? The only difference is, the static
library contains `omp_data.cu` and the bitcode library doesn't. It's unclear why
they were implemented in this way, but per D94565, there is no issue if we also
include the file into the bitcode library. Therefore, we can safely drop the
static library.
This patch is about the change in OpenMP. The driver will be updated as well if
this patch is accepted.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D94573
Restore control of kernel launch tracing to be >= 1 as it was before
export LIBOMPTARGET_KERNEL_TRACE=1
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94695
Hierarchical barrier is an experimental barrier algorithm that uses aspects
of machine hierarchy to define the barrier tree structure. This patch fixes
offset calculation in hierarchical barrier. The offset is used to store info
on a flag about sleeping threads waiting on a location stored in the flag.
This commit also fixes a potential deadlock in hierarchical barrier when
using infinite blocktime by adjusting the offset value of leaf kids so that
it matches the value of leaf state. It also adds testing of default barriers
with infinite blocktime, and also tests hierarchical barrier algorithm with
both default and infinite blocktime.
Patch by Terry Wilmarth and Nawrin Sultana.
Differential Revision: https://reviews.llvm.org/D94241
Add extra information to the runtime page describing the error messages and add information to the release notes for clang 12.0
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94562
This change enables volatile use of persistent memory for omp_large_cap_mem*
on supported systems. It depends on libmemkind's support for persistent memory,
and requirements/details can be found at the following url.
https://pmem.io/2020/01/20/memkind-dax-kmem.html
Differential Revision: https://reviews.llvm.org/D94353
Constant static data member can be defined in the class without another
define after the class in C++17. Although it is C++17, Clang can still handle it
even w/o the flag for C++17. Unluckily, GCC cannot handle that.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D94541
[libomptarget][amdgpu][nfc] Fix build on centos
rtl.cpp replaced 224 with a #define from elf.h, but that
doesn't work on a centos 7 build machine with an old elf.h
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D94528
Some LLVM headers are generated by CMake. Before the installation,
LLVM's headers are distributed everywhere, some of which are in
`${LLVM_SRC_ROOT}/llvm/include/llvm`, and some are in
`${LLVM_BINARY_ROOT}/include/llvm`. After intallation, they're all in
`${LLVM_INSTALLATION_ROOT}/include/llvm`.
OpenMP now depends on LLVM headers. Some headers depend on headers generated
by CMake. When building OpenMP along with LLVM, a.k.a via `LLVM_ENABLE_RUNTIMES`,
we need to tell OpenMP where it can find those headers, especially those still
have not been copied/installed.
Reviewed By: jdoerfert, jhuber6
Differential Revision: https://reviews.llvm.org/D94534
The lifetime of `libomptarget` and its opened plugins are not aligned
and it's hard for `libomptarget` to determine when the plugins are destroyed.
As a result, some issues (see D94256 for details) occur on some platforms.
Actually, if we take target memory as target resources, same as other resources,
such as CUDA streams, in each plugin, then the memory manager should also be in
the plugin. Also considering some platforms may want to opt out the feature, it
makes sense to move the memory manager to plugin, make it a common interface, and
let plguin developers determine whether they need it. This is what this patch does.
CUDA plugin is taken as example to show how to integrate it. In this way, we can
also get a bonus that different thresholds can be set for different platforms.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D94379
For now `elf_common.c` is taken as a common part included into
different plugin implementations directly via
`#include "../../common/elf_common.c"`, which is not a best practice. Since it
is simple enough such that we don't need to create a real library for it, we just
take it as a interface library so that other targets can link it directly. Another
advantage of this method is, we don't need to add the folder into header search
path which can potentially pollute the search path.
VE and AMD platforms have not been tested because I don't have target machines.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94443
For now, `*_STANDALONE_BUILD` is set to ON even if they're built along
with LLVM because of issues mentioned in the comments. This can cause some issues.
For example, if we build OpenMP along with LLVM, we'd like to copy those OpenMP
headers to `<prefix>/lib/clang/<version>/include` such that `clang` can find
those headers without using `-I <prefix>/include` because those headers will be
copied to `<prefix>/include` if it is built standalone.
In this patch, we fixed the dependence issue in OpenMP such that it can be built
correctly even with `OPENMP_STANDALONE_BUILD=OFF`. The issue is in the call to
`add_lit_testsuite`, where `clang` and `clang-resource-headers` are passed as
`DEPENDS`. Since we're building OpenMP along with LLVM, `clang` is set by CMake
to be the C/C++ compiler, therefore these two dependences are no longer needed,
where caused the dependence issue.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93738
Multiple `RTLInfoTy` objects are stored in a list `AllRTLs`. Since
`RTLInfoTy` contains a `std::mutex`, it is by default not a copyable object.
In order to support `AllRTLs.push_back(...)` which is currently used, a customized
copy constructor is provided. Every time we need to add a new data member into
`RTLInfoTy`, we should keep in mind not forgetting to add corresponding assignment
in the copy constructor. In fact, the only use of the copy constructor is to push
the object into the list, we can of course write it in a way that first emplace
a new object back, and then use the reference to the last element. In this way we
don't need the copy constructor anymore. If the element is invalid, we just need
to pop it, and that's what this patch does.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94361
Fugaku supercomputer is built with the Fujitsu A64FX microprocessor, whose cache line is 256. In current libomp, we only have cache line size 128 for PPC64 and otherwise 64. This patch added the support of cache line 256 for A64FX. It's worth noting that although A64FX is a variant of AArch64, this property is not shared. As a result, in light of UCX source code (392443ab92/src/ucs/arch/aarch64/cpu.c (L17)), we can only determine by checking whether the CPU is FUJITSU A64FX.
Reviewed By: jdoerfert, Hahnfeld
Differential Revision: https://reviews.llvm.org/D93169
Summary:
Currently error messages from the CUDA plugins are only printed to the user if they have debugging enabled. Change this behaviour to always print the messages that result in offloading failure. This improves the error messages by indidcating what happened when the error occurs in the plugin library, such as a segmentation fault on the device.
Reviewed by: jdoerfert
Differential Revision: https://reviews.llvm.org/D94263
Add an example to the OpenMP Documentation on the LIBOMPTARGET_INFO environment variable
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94246
Wrong LLVM headers might be included if we don't set `include_directories`
to a right place. This will cause a compilation error if LLVM is installed in
system directories.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93737
Currently all built libraries in OpenMP are anywhere if building along
with LLVM. It is not an issue if we don't execute any test. However, almost all
tests for `libomptarget` fails because in the lit configuration, we only set
`<build_dir>/libomptarget` to `LD_LIBRARY_PATH` and `LIBRARY_PATH`. Since those
libraries are everywhere, `clang` can no longer find `libomptarget.so` or those
deviceRTLs anymore.
In this patch, we set a unified path for all built libraries, no matter whether
it is built along with LLVM or not. In this way, our lit configuration can work
propoerly.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93736
Summary:
This patch adds more fine-grained support over which information is output from the libomptarget runtime when run with the environment variable LIBOMPTARGET_INFO set. An extensible set of flags can be used to pick and choose which information the user is interested in.
Reviewers: jdoerfert JonChesterfield grokos
Differential Revision: https://reviews.llvm.org/D93727
[libomptarget][amdgpu] Call into deviceRTL instead of ockl
Amdgpu codegen presently emits a call into ockl. The same functionality
is already present in the deviceRTL. Adds an amdgpu specific entry point
to avoid the dependency. This lets simple openmp code (specifically, that
which doesn't use libm) run without rocm device libraries installed.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D93356
This patch partially prepares the runtime source code to be built with
-Wconversion, which should trigger warnings if any implicit conversions
can possibly change a value. For builds done with icc or gcc, all such
warnings are handled in this patch. clang gives a much longer list of
warnings, particularly for sign conversions, which the other compilers
don't report. The -Wconversion flag is commented into cmake files, but
I'm not going to turn it on. If someone thinks it is important, and wants
to fix all the clang warnings, they are welcome to.
Types of changes made here involve either improving the consistency of types
used so that no conversion is needed, or else performing careful explicit
conversions, when we're sure a problem won't arise.
Patch is a combination of changes by Terry Wilmarth and Johnny Peyton.
Differential Revision: https://reviews.llvm.org/D92942
Add support to the OpenMP web pages for environment variables supported
by Libomptarget and their usage.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93723
When setting `LLVM_ENABLE_RUNTIMES`, lower case word should be used;
otherwise, it can cause a CMake error that specific path is not found.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D93719
After some issues about building runtimes along with LLVM were fixed,
building an OpenMP offloading capable compiler is pretty simple. This patch updates
the FAQ part in the doc.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93671
If a GPU function is externally reachable we give up trying to find the
(unique) kernel it is called from. This can hinder optimizations. Emit a
remark and explain mitigation strategies.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D93439
This patchs adds CMake variables to add subdirectories and include
directories for libomptarget and explicitly gives the location of source
files.
Differential Revision: https://reviews.llvm.org/D93290
[libomptarget][nfc] Replace static const with enum
Semantically identical. Replaces 0xff... with ~0 to spare counting the f.
Has the advantage that the compiler doesn't need to prove the 4/8 byte
value dead before discarding it, and sidesteps the compilation question
associated with what static means for a single source language.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93328
Introduce new kmp_safe_raii_file_t class with RAII semantics for file
open/close. It is essentially a wrapper around the C-style FILE* object.
This also unifies the way we error report if a file can't be opened.
Differential Revision: https://reviews.llvm.org/D92604
This patch enables serial initialization in the forked child process
to fix unstable runtime behavior when used with Python-based AI tools.
Differential Revision: https://reviews.llvm.org/D93230
[libomptarget][nfc] Remove data_sharing type aliasing
Libomptarget previous used __kmpc_data_sharing_slot to access values of type
__kmpc_data_sharing_{worker,master}_slot_static. This aliasing violation was
benign in practice. The master type has since been removed, so a single type
can be used instead.
This is particularly helpful for the transition to an openmp deviceRTL, as the
c++/openmp compiler for amdgcn currently rejects the flexible array member for
being an incomplete type. Serves the same purpose as abandoned D86324.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93075
This patch introduces a new RTM lock type based on spin lock which is
used for OMP lock with speculative hint on supported architecture.
Differential Revision: https://reviews.llvm.org/D92615
This patch adds new API __kmpc_taskloop_5 to accomadate strict
modifier (introduced in OpenMP 5.1) in num_tasks and grainsize
clause.
Differential Revision: https://reviews.llvm.org/D92352
KMP_AFFINITY=norespect was triggering an error because the underlying
process affinity mask was not updated to include the entire machine.
The Windows documentation states that the thread affinities must be
subsets of the process affinity. This patch also moves the printing
(for KMP_AFFINITY=verbose) of whether the initial mask was respected
out of each topology detection function and to one location where the
initial affinity mask is read.
Differential Revision: https://reviews.llvm.org/D92587
Check pointer returned by strchr, as it can be NULL in case of broken
format of input string. Introduced new function __kmp_str_loc_numbers
for fast parsing of numbers only in the location string.
Also made some cleanup of __kmp_str_loc_init declaration and usage:
- changed type of init_fname parameter to bool;
- changed input from true to false in places where fname is not used.
Differential Revision: https://reviews.llvm.org/D90962
D91692 missed various locations in kmp_gsupport, where the scope for
OMPT_STORE_RETURN_ADDRESS is too narrow, i.e. the scope ends before the OMPT
callback is called in some nested function.
This patch fixes the scoping issue, so that all OMPT tests pass, when the
tests are built with gcc.
Differential Revision: https://reviews.llvm.org/D92121
[libomptarget][amdgpu] Address compiler warnings, drive by fixes
Initialize some variables, remove unused ones.
Changes the debug printing condition to align with the aomp test suite.
Differential Revision: https://reviews.llvm.org/D92559
These changes add support for Intel's umonitor/umwait usage in wait
code, for architectures that support those intrinsic functions. Usage of
umonitor/umwait is off by default, but can be turned on by setting the
KMP_USER_LEVEL_MWAIT environment variable.
Differential Revision: https://reviews.llvm.org/D91189
Added UNLIKELY hint to one-time or rarely executed branches.
This improves performance of the library on some tasking benchmarks.
Differential Revision: https://reviews.llvm.org/D92322
With the change to using shared memory, there were a few problems that need to be fixed.
- The previous filename that was used for SHM only used process id. Given that process is
usually based on 16bit number, this was causing some conflicts on machines. Thus we add
UID to the name to prevent this.
- It appears under some conditions (SIGTERM, etc) the shared memory files were not getting
cleaned up. Added a call to clean up the shm files under those conditions. For this user
needs to set envirable KMP_HANDLE_SIGNALS to true.
Patch by Erdner, Todd <todd.erdner@intel.com>
Differential Revision: https://reviews.llvm.org/D91869
Once __kmp_task_finish is not executed for proxy tasks,
move mutexinoutset dependency code to __kmp_release_deps
which is executed for all task kinds.
Differential Revision: https://reviews.llvm.org/D92326
[libomptarget][cuda] Detect missing symbols in plugin at build time
Passes -z,defs to the linker. Error on unresolved symbol references.
Otherwise, those unresolved symbols present as target code running on the host
as the plugin fails to load. This is significantly harder to debug than a link
time error. Flag matches that passed by amdgcn and ve plugins.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D92143
This is an alternative approach to address inconsistencies pointed out in: D90078
This patch makes sure that the return address is reset, when leaving the scope.
In some cases, I had to move the macro out of an if-statement to have it in the
right scope, in some cases I added an additional block to restrict the scope.
This patch does not handle inconsistencies, which might occur if the return
address is still set when we call into the application.
Test case (repeated_calls.c) provided by @hbae
Differential Revision: https://reviews.llvm.org/D91692
OpenMP 5.1 introduces the new env variable
OMP_TOOL_VERBOSE_INIT=(disabled|stdout|stderr|<filename>) to enable verbose
loading and initialization of OMPT tools.
This env variable helps to understand the cause when loading of a tool fails
(e.g., undefined symbols or dependency not in LD_LIBRARY_PATH)
Output of OMP_TOOL_VERBOSE_INIT is added for OMP_DISPLAY_ENV
Tests for this patch are integrated into the different existing tool loading
tests, making these tests more verbose. An Archer specific verbose test is
integrated into an existing Archer test.
Patch prepared by: Isabel Thärigen
Differential Revision: https://reviews.llvm.org/D91464
Adjusted external reference for Darwin/AARCH64 link compatibility.
Made size directive conditional only if __ELF__ defined.
Patch by Michael_Pique <mpique@icloud.com>
Differential Revision: https://reviews.llvm.org/D88252
This patch is the runtime support for https://reviews.llvm.org/D84192.
In order not to modify the tgt_target_data_update information but still be
able to pass the extra information for non-contiguous map item (offset,
count, and stride for each dimension), this patch overload arg when
the maptype is set as OMP_TGT_MAPTYPE_DESCRIPTOR. The origin arg is for
passing the pointer information, however, the overloaded arg is an
array of descriptor_dim:
```
struct descriptor_dim {
int64_t offset;
int64_t count;
int64_t stride
};
```
and the array size is the dimension size. In addition, since we
have count and stride information in descriptor_dim, we can replace/overload the
arg_size parameter by using dimension size.
Reviewed By: grokos, tianshilei1992
Differential Revision: https://reviews.llvm.org/D82245
Summary:
Add support for passing source locations to libomptarget runtime functions using the ident_t struct present in the rest of the libomp API. This will allow the runtime system to give much more insightful error messages and debugging values.
Reviewers: jdoerfert grokos
Differential Revision: https://reviews.llvm.org/D87946
Summary:
This patch adds basic support for priting the source location and names for the mapped variables. This patch does not support names for custom mappers. This is based on D89802.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D90172
Summary:
This patch adds support for passing in the original delcaration name in the source file to the libomptarget runtime. This will allow the runtime to provide more intelligent debugging messages. This patch takes the original expression parsed from the OpenMP map / update clause and provides a textual representation if it was explicitly mapped, otherwise it takes the name of the variable declaration as a fallback. The information in passed to the runtime in a global array of strings that matches the existing ident_t source location strings using ";name;filename;column;row;;"
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D89802
This introduces the new `ARCHER_OPTIONS` flag `ignore_serial=0|1` to disable
analysis and logging of memory accesses in the sequential part of the OpenMP
application.
In the sequential part of an OpenMP program no data race is possible, unless
there is non-OpenMP concurrency (such as pthreads, MPI, ...). For the latter
reason, this is not active by default.
Besides reducing the runtime overhead for the sequential part of the program,
this reduces the memory overhead for sequential initialization. In combination
with `flush_shadow=1` this can allow analysis of applications, which run close
to the limit of available memory, but only access smaller parts of shared
memory during each OpenMP parallel region.
A problem for this approach is that Archer only gets active, when the OpenMP
runtime gets initialized, which might be after serial initialization of the
application. In such case, it helps to call for example `omp_get_max_threads()`
at the beginning of main.
Differential Revision: https://reviews.llvm.org/D90473
OpenMP 5.1 adds an extra enum entry for ompt_scope_t, which makes the related
switch statement incomplete.
Also adding cases for newly added barrier variants.
Differential Revision: https://reviews.llvm.org/D90758
Currently the affinity format string has initial value. When users set
the format via OMP_AFFINITY_FORMAT, it will overwrite the format string. However,
when copying the format, the tailing null is missing. As a result, if the user
format string is shorter than default value, the remaining part in the default
value still makes effort. This bug is not exposed because the test case doesn't
check the end of a string. It only checks whether given output "contains" the
check string.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D91309
Summary:
This patch begins to add support for a set of scripts that can be used to get information from OpenMP programs to better describe problems and eventually show the data to the user in formatted output. Right now the only support is forformatting the register and memory usage reports from ptxas and nvlink. This is simply done as a wrapper around clang and clang++.
Reviewers: jdoerfert
DIfferential Revision: https://reviews.llvm.org/D91085
The deadlock/race happens when primary thread gets initz lock and tries to join
the worker thread which waits for the same lock in TLS key destructor.
The patch removes the lock and the code of setting TLS value which needed
the lock. Also removed setting TLS from __kmp_unregister_root_current_thread.
Differential Revision: https://reviews.llvm.org/D90647
This patch allows to pass the OpenMP runtime tests after configuring with
`cmake . -DOPENMP_TEST_FLAGS:STRING="-Werror"`.
The warnings for OMPT tests are addressed in D90752.
Differential Revision: https://reviews.llvm.org/D91280
This doesn't add functionality, but just adds the new types and renames the
master callback to masked callback.
Differential Revision: https://reviews.llvm.org/D90752
Modern Fortran compilers support Fortran 90, so we do not need to use
the source code for Fortran compilers that do not support Fortran 90.
Differential Revision: https://reviews.llvm.org/D90077
This patch is the runtime support for https://reviews.llvm.org/D84192.
In order not to modify the tgt_target_data_update information but still be
able to pass the extra information for non-contiguous map item (offset,
count, and stride for each dimension), this patch overload arg when
the maptype is set as OMP_TGT_MAPTYPE_DESCRIPTOR. The origin arg is for
passing the pointer information, however, the overloaded arg is an
array of descriptor_dim:
```
struct descriptor_dim {
int64_t offset;
int64_t count;
int64_t stride
};
```
and the array size is the dimension size. In addition, since we
have count and stride information in descriptor_dim, we can replace/overload the
arg_size parameter by using dimension size.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D82245
This patch fixes potential division by 0 in case hwloc does not
recognize cores (or architecture has no cores).
Patch by Andrey Churbanov
Differential Revision: https://reviews.llvm.org/D90954
This patch adds the mask and ident_t function to get the
openmp version. It also adds logic to force monotonic:dynamic
behavior when OpenMP version less than 5.0.
The OpenMP version is stored in the format:
major*10+minor e.g., OpenMP 5.0 = 50
Differential Revision: https://reviews.llvm.org/D90632
The macros are used in several places with an if(macro) pattern. This results
in several warnings about extraneous parenteses in equality comparison.
Having the constant at the lhs of the comparison, avoids this warning.
Differential Revision: https://reviews.llvm.org/D90756
There is a non-conforming use of variable-sized array in the test case `parallel_offloading_map.c`. This patch fixed it.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D90642
As reported by @ronlieb, the test shows intermittent fails.
The test failed, if the dependent task was already finished, when the depending
task was to be created. We have other tests to check for the dependences pair.
Since detached tasks are supported by clang and the OpenMP runtime, Archer
must expect to receive the corresponding callbacks.
This patch adds support to interpret the synchronization semantics of
omp_fulfill_event and cleans up the handling of task switches.
Presently, there a number of global variables in libomptarget (devices,
RTLs, tables, mutexes, etc.) that are not placed within a struct. This
patch places them into a struct ``PluginManager``. All of the functions
that act on this data remain free.
Differential Revision: https://reviews.llvm.org/D90519
This adds some initial content as well as structure to the new OpenMP
Sphinx documentation hosted at http://openmp.llvm.org/docs/ .
The content contains some useful links but most pages are still empty.
This uses a "custom" theme which is a copy of the default "agogo" one
with minor modifications to get a nicer table of content in the sidebar.
This way we can also adjust the theme as we go.
Reviewed By: jhuber6, JonChesterfield
Differential Revision: https://reviews.llvm.org/D90256
[AMDGPU] Add __builtin_amdgcn_grid_size
Similar to D76772, loads the data from the dispatch pointer. Marked invariant.
Patch also updates the openmp devicertl to use this builtin.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D90251
Summary:
This patch adds basic support for priting the source location and names for the
mapped variables. This patch does not support names for custom mappers. This is
based on D89802. The names information currently will be printed out only in
debug mode or using env LIBOMPTARGET_INFO during execution. But the information
is added when availible to the Device and Private data structures. To get the
information out the code must be built with debug symbols on using -g or
-Rpass=openmp-opt
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D90172
Summary:
This patch adds support for passing in the original delcaration name in the
source file to the libomptarget runtime. This will allow the runtime to provide
more intelligent debugging messages. This patch takes the original expression
parsed from the OpenMP map / update clause and provides a textual
representation if it was explicitly mapped, otherwise it takes the name of the
variable declaration as a fallback. The information in passed to the runtime in
a global array of strings that matches the existing ident_t source location
strings using ";name;filename;column;row;;". See
clang/test/OpenMP/target_map_names.cpp for an example of the generated output
for a given map clause.
Reviewers: jdoervert
Differential Revision: https://reviews.llvm.org/D89802
The implementation of target nowait just wraps the target region into a task. The essential four parameters (base ptr, ptr, size, mapper) are taken as firstprivate such that they will be copied to the private location. When there is no user-defined mapper, the mapper variable will be nullptr. However, it will be still copied to the corresponding place. Therefore, a memcpy will be generated and the source pointer will be nullptr, causing a segmentation fault. The root cause is when calling `emitOffloadingArraysArgument`, the last argument `Options` has a field about whether it requires a task. It only takes depend clause into account. In this patch, the nowait clause is also included.
There're two things that will be done in another patches:
1. target data nowait has not been supported yet. D90099 added the support.
2. When there is no mapper, the mapper array can be nullptr no matter whether it requires outer task or not. It can avoid an unnecessary data copy. This is an optimization that is covered in D90101.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D89844
`size_t` has different width on 32- and 64-bit architecture, but the
computation to floor to power of two assumed it is 64-bit, which can cause an
integer overflow. In this patch, architecture detection is added so that the
operation for 64-bit `size_t`. Thank Luke for reporting the issue.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D89878
[libomptarget] Require LLVM source tree to build libomptarget
This is to permit reliably #including files from the LLVM tree in libomptarget,
as an improvement on the copy and paste that is currently in use. See D87841
for the first example of removing duplication given this new requirement.
The weekly openmp dev call reached consensus on this approach. See also D87841
for some alternatives that were considered. In the future, we may want to
introduce a new top level repo for shared constants, or start using the ADT
library within openmp.
This will break sufficiently exotic build systems, trivial fixes as below.
Building libomptarget as part of the monorepo will continue to work.
If openmp is built separately, it now requires a cmake macro indicating
where to find the LLVM source tree.
If openmp is built separately, without the llvm source tree already on disk,
the build machine will need a copy of a subset of the llvm source tree and
the cmake macro indicating where it is.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D89426
[libomptarget][amdgcn] Refactor memcpy to eliminate maps
Builds on D89776 to remove now dead code.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D89888
The calls to atmi_memcpy presently determine the direction of copy (host to
device or device to host) by storing pointers in a map during malloc and
looking up the pointers during memcpy. As each call site already knows the
direction, this stash+lookup can be eliminated.
This NFC will be followed by a functional one that deletes those map lookups.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D89776
Change-Id: I1d9089bc1e56b3a9a30e334735fa07dee1f84990
[libomptarget][amdgcn] Implement missing symbols in deviceRTL
Malloc, wtime are stubs. Malloc needs a hostrpc implementation which is
a work in progress, wtime needs some experimentation to find out the
multiplier to get a time in seconds as documentation is scarce.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D89725
This patch fixes a problem whereby the pointee object of a PTR_AND_OBJ entry with a `map(to)` motion clause can be overwritten on the device even if its reference counter is >=1.
Currently, we check the reference counter of the parent struct in order to determine whether the motion clause should be respected, but since the pointee object is not part of the struct, it's got its own reference counter which should be used to enqueue the copy or discard it.
The same behavior has already been implemented in targetDataEnd (omptarget.cpp:539-540), but we somehow missed doing the same in targetDataBegin.
Differential Revision: https://reviews.llvm.org/D89597
[openmp][libomptarget] Include header from LLVM source tree
The change is to the amdgpu plugin so is unlikely to break anything.
The point of contention is whether libomptarget can depend on LLVM.
A community discussion was cautiously not opposed yesterday.
This introduces a compile time dependency on the LLVM source tree, in this case
expressed as skipping the building of the plugin if LLVM_MAIN_INCLUDE_DIR is not
set. One the source files will #include llvm/Frontend/OpenMP/OMPGridValues.h,
instead of copy&pasting the numbers across.
For users that download the monorepo, the llvm tree is already on disk. This will
inconvenience users who download only the openmp source as a tar, as they would
now also have to download (at least a file or two) from the llvm source, if they want
to build the parts of the openmp project that (post this patch) depend on llvm.
There was interest expressed in going further - using llvm tools as part of
building libomp, or linking against llvm libraries. That seems less clear cut
an improvement and worthy of further discussion. This patch seeks only to change
policy to support openmp depending on the llvm source tree. Including in the
other direction, or using libraries / tools etc, are purposefully out of scope.
Reviewers are a best guess at interested parties, please feel free to add others
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D87841
[libomptarget][amdgcn] Implement partial barrier
named_sync is used to coordinate non-spmd kernels. This uses bar.sync on nvptx.
There is no corresponding ISA support on amdgcn, so this is implemented using
shared memory, one word initialized to zero.
Each wave increments the variable by one. Whichever wave is last is responsible
for resetting the variable to zero, at which point it and the others continue.
The race condition on a wave reaching the barrier before another wave has
noticed that it has been released is handled with a generation counter, packed
into the same word.
Uses a shared variable that is not needed on nvptx. Introduces a new hook,
kmpc_impl_target_init, to allow different targets to do extra initialization.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D88602
Summary:
This patch changes the CMake files for Clang and Libomptarget to query the
system for its supported CUDA architecture. This makes it much easier for the
user to build optimal code without needing to set the flags manually. This
relies on the now deprecated FindCUDA method in CMake, but full support for
architecture detection is only availible in CMake >3.18
Reviewers: jdoerfert ye-luo
Subscribers: cfe-commits guansong mgorny openmp-commits sstefan1 yaxunl
Tags: #clang #OpenMP
Differential Revision: https://reviews.llvm.org/D87946
The test disables suppression and therefore sometimes triggers a know false
positive in the openmp runtime. The test should only verify that the env
var is handles as expected.
The worker thread can start execution of the task before creation of the second task
Fixes the spurious failure reported in https://reviews.llvm.org/D61657
Currently, the parser used to tokenize the TSAN_OPTIONS in libomp uses
only spaces as separators, even though TSAN in compiler-rt supports
other separators like ':' or ','.
CTest uses ':' to separate sanitizer options by default.
The documentation for other sanitizers mentions ':' as separator,
but TSAN only lists spaces, which is probably where this mismatch originated.
Patch provided by upsj
Differential Revision: https://reviews.llvm.org/D87144
This patch updates the expected results for the GOMP interface patches: D87267, D87269, and D87271.
The taskwait-depend test is changed to really use taskwait-depend and copied to an task_if0-depend test.
To pass the tests, the handling of the return address was fixed.
Differential Revision: https://reviews.llvm.org/D87680
OpenMP 5.1 defines omp_get_initial_device to return the same value as omp_get_num_devices.
Since this change is also 5.0 compliant, no versioning is needed.
Differential Revision: https://reviews.llvm.org/D88149
[nfc][libomptarget] Drop parameter to named_sync
named_sync has one call site (in sync.cu) where it always passed L1_BARRIER.
Folding this into the call site and dropping the macro is a simplification.
amdgpu doesn't have ptx' bar.sync instruction. A correct implementation of
__kmpc_impl_named_sync in terms of shared memory is much easier if it can
assume that the barrier argument is this constant. Said implementation is left
for a second patch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D88474
Summary:
Adding a missing directory needed for generating Sphinx documentation without
errors. Directory current contains a placeholder image just to populate the
directory.
It allows customizing MAX_SM for non-flagship GPU and reduces graphic memory usage.
In addition, so far the size is hard-coded up to __CUDA_ARCH__ 700 and is already a hassle for 800.
Introduce MAX_SM for 800 and protect future arch
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D88185
This change introduces the GOMP_taskwait_depend() function. It implements
the OpenMP 5.0 feature of #pragma omp taskwait with depend() clause by
wrapping around __kmpc_omp_wait_deps().
Differential Revision: https://reviews.llvm.org/D87269
Encapsulate GOMP task dependencies in separate class and introduce the
new mutexinoutset dependency type. This separate class allows
future GOMP task APIs easier access to the task dependency functionality
and better ability to propagate new dependency types to all existing GOMP
task APIs which use task dependencies.
Differential Revision: https://reviews.llvm.org/D87267
Implement GOMP_teams_reg() function which enables GOMP support of the
standalone teams construct. The GOMP_parallel* functions were modified
to call __kmp_fork_call() unconditionally so that the teams-specific
code could be reused within __kmp_fork_call() instead of reproduced
inside the GOMP_* functions.
Differential Revision: https://reviews.llvm.org/D87167
If an error code can not be recognized by cuGetErrorString, errStr remains null and causes crashing at DP() printing.
Protect this case.
Reviewed By: jhuber6, tianshilei1992
Differential Revision: https://reviews.llvm.org/D87980
Summary:
Adding support for generated html documentation for OpenMP. Changing
Cmake files to build the documentation and adding the base templates for
future documentation to be added.
Reviewers: jdoerfert
Subscribers: aaron.ballman arphaman guansong mgorny openmp-commits sstefan1 yaxunl
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D87797
Summary:
This patch adds additonal support for priting infromation from Libomptarget for
already existing maps and printing the final data mapped on the device at
device destruction.
Reviewers: jdoerfort gkistanova
Subscribers: guansong openmp-commits sstefan1 yaxunl
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D87722
Summary:
This patch starts adding support for adding information dumps to libomptarget
and rtl plugins. The information printing is controlled by the
LIBOMPTARGET_INFO environment variable introduced in D86483. The goal of this
patch is to provide the user with additional information about the device
during kernel execution and providing the user with information dumps in the
case of failure. This patch added the ability to dump the pointer mapping table
as well as printing the number of blocks and threads in the cuda RTL.
Reviewers: jdoerfort gkistanova ye-luo
Subscribers: guansong openmp-commits sstefan1 yaxunl ye-luo
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D87165
The size of worker_rootS should have been DS_Max_Warp_Number.
This reduces memory usage by deviceRTL on AMDGPU from around 2.3GB
to around 770MB.
Reviewed By: JonChesterfield, jdoerfert
Differential Revision: https://reviews.llvm.org/D87084
Summary:
This patch consolidates the error handling and messaging routines to a single
file omptargetmessage. The goal is to simplify the error handling interface
prior to adding more error handling support
Reviewers: jdoerfert grokos ABataev AndreyChurbanov ronlieb JonChesterfield ye-luo tianshilei1992
Subscribers: danielkiss guansong jvesely kerbowa nhaehnle openmp-commits sstefan1 yaxunl
PrivateArgumentManager shall immediately allocate firstprivates if they
are bases for the next parameters and the next paramaters rely on the
fact that the base musst be allocated already.
Differential Revision: https://reviews.llvm.org/D86781
The test command in `private_mapping.c` was set to expect failure by mistake. It is fixed in this patch.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D86758
Summary:
This patch changes the libomptarget runtime to always emit debug messages that
occur before offloading failure. The goal is to provide users with information
about why their application failed in the target region rather than a single
failure message. This is only done in regions that precede offloading failure
so this should not impact runtime performance. if the debug environment
variable is set then the message is forwarded to the debug output as usual.
A new environment variable was added for future use but does nothing in this
current patch. LIBOMPTARGET_INFO will be used to report runtime information to
the user if requrested, such as grid size, SPMD usage, or data mapping. It will
take an integer indicating the level of information verbosity and a value of 0
will disable it.
Reviewers: jdoerfort
Subscribers: guansong sstefan1 yaxunl ye-luo
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D86483
In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.
Let's take the test case as example.
```
int main() {
int data1[3] = {1}, data2[3] = {2}, data3[3] = {3};
int sum[16] = {0};
#pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3)
for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 3; ++j) {
sum[i] += data1[j];
sum[i] += data2[j];
sum[i] += data3[j];
}
}
}
```
Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:
1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later.
2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`.
The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D86307
Instead of copying and pasting the same `#ifdef` expressions in multiple
places, define a type and a pair of macros in `kmp_os.h`, to handle
whether `va_list` is pointer-like or not:
* `kmp_va_list` is the type to use for `__kmp_fork_call()`
* `kmp_va_deref()` dereferences a `va_list`, if necessary
* `kmp_va_addr_of()` takes the address of a `va_list`, if necessary
Also add FreeBSD to the list of OSes that has a non pointer-like
va_list. This can now be easily extended to other OSes too.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D86397
Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.
The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.
In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.
The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:
1. The program ends. Obviously. However, there is a little problem that plugin
library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.
2. Device is out of memory when we request a new memory. The manager will walk
through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.
Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
`LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`.
Reviewed By: jdoerfert, ye-luo, JonChesterfield
Differential Revision: https://reviews.llvm.org/D81054
This patch contains the following changes:
1. Renamed the function `DeviceTy::data_exchange` to `DeviceTy::dataExchange`;
2. Changed the second argument `DeviceTy DstDev` to `DeviceTy &DstDev`;
3. Renamed the last argument.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D86238
Instead of calling `cuFuncGetAttribute` with
`CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation,
we can do it for the first one and cache the result as part of the
`KernelInfo` struct. The only functional change is that we now expect
`cuFuncGetAttribute` to succeed and otherwise propagate the error.
Ignoring any error seems like a slippery slope...
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D86038
[libomptarget] Implement host plugin for amdgpu
Replacement for D71384. Primary difference is inlining the dependency on atmi
followed by extensive simplification and bugfixes. This is the latest version
from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with
minor patches and a rename from hsa to amdgpu, on the basis that this can't be
used by other implementations of hsa without additional work.
This will not build unless the ROCM_DIR variable is passed so won't break other
builds. That variable is used to locate two amdgpu specific libraries that ship
as part of rocm:
libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface
libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime
These libraries build from source. The build scripts in those repos are for
shared libraries, but can be adapted to statically link both into this plugin.
There are caveats.
- This works well enough to run various tests and benchmarks, and will be used
to support the current clang bring up
- It is adequately thread safe for the above but there will be races remaining
- It is not stylistically correct for llvm, though has had clang-format run
- It has suboptimal memory management and locking strategies
- The debug printing / error handling is inconsistent
I would like to contribute this pretty much as-is and then improve it in-tree.
This would be advantagous because the aomp12 branch that was in use for fixing
this codebase has just been joined with the amd internal rocm dev process.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85742
Starting with 787eb0c637 I got spurious segmentation faults for some testcases. I could nail it down to `brel` trying to release the "memory" of the node allocated on the stack of __kmpc_omp_wait_deps. With this patch, you will see the assertion triggering for some of the tests in the test suite.
My proposed solution for the issue is to just patch __kmpc_omp_wait_deps:
```
__kmp_init_node(&node);
- node.dn.on_stack = 1;
+ // the stack owns the node
+ __kmp_node_ref(&node);
```
What do you think?
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D84472
For example:
```
#pragma omp target data map(tofrom:arr[0:100])
{
#pragma omp target exit data map(delete:arr[0:100])
#pragma omp target enter data map(alloc:arr[98:2])
}
```
Without this patch, the transfer at the end of the target data region
is broken and fails depending on the target device. According to my
read of the spec, the transfer shouldn't even be attempted because
`arr[0:100]` isn't (fully) present there. To fix that, this patch
makes `DeviceTy::getTgtPtrBegin` return null for this case.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D85342
For example, without this patch, the following fails as expected with
or without the `present` modifier, but the `present` modifier doesn't
produce its usual diagnostic:
```
#pragma omp target data map(alloc: arr[0:2])
{
#pragma omp target map(present, tofrom: arr[0:100]) // not fully present
;
}
```
Reviewed By: grokos, vzakhari
Differential Revision: https://reviews.llvm.org/D85320
The standard way of printing `int64_t` data is via the PRId64 macro, `ld`
is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int`
on all platforms. E.g. on Windows we get mismatch warnings.
Differential Revision: https://reviews.llvm.org/D85353
targetDataMapper function fills arrays with the mapping data in the
direct order. When this function is called by targetDataBegin or
tgt_target_update functions, it works as expected. But targetDataEnd
function processes mapped data in reverse order. In this case, the base
pointer might be deleted before the associated data is deleted. Need to
reverse data, mapped by mapper, too, since it always adds data that must
be deleted at the end of the buffer.
Fixes the test declare_mapper_target_update.cpp.
Also, reduces the memry fragmentation by preallocation the memory
buffers.
Differential Revision: https://reviews.llvm.org/D85216
OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states:
> If the corresponding list item is not present in the device data
> environment and there is no present modifier in the clause, then no
> assignment occurs to or from the original list item.
L10-11 states:
> If a present modifier appears in the clause and the corresponding
> list item is not present in the device data environment then an
> error occurs and the program termintates.
(OpenMP 5.0 also has the first passage but without mention of the
present modifier of course.)
In both passages, I assume "is not present" includes the case of
partially but not entirely present. However, without this patch, the
target update directive misbehaves in this case both with and without
the present modifier. For example:
```
#pragma omp target enter data map(to:arr[0:3])
#pragma omp target update to(arr[0:5]) // might fail on data transfer
#pragma omp target update to(present:arr[0:5]) // might fail on data transfer
```
The problem is that `DeviceTy::getTgtPtrBegin` does not return a null
pointer in that case, so `target_data_update` sees the data as fully
present, and the data transfer then might fail depending on the target
device. However, without the present modifier, there should never be
a failure. Moreover, with the present modifier, there should always
be a failure, and the diagnostic should mention the present modifier.
This patch fixes `DeviceTy::getTgtPtrBegin` to return null when
`target_data_update` is the caller. I'm wondering if it should do the
same for more callers.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D85246
Without this patch, the following example fails but shouldn't
according to OpenMP TR8:
```
#pragma omp target enter data map(alloc:i)
#pragma omp target data map(present, alloc: i)
{
#pragma omp target exit data map(delete:i)
} // fails presence check here
```
OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states:
> If the map clause appears on a target, target data, target enter
> data or target exit data construct with a present map-type-modifier
> then on entry to the region if the corresponding list item does not
> appear in the device data environment an error occurs and the
> program terminates.
There is no corresponding statement about the exit from a region.
Thus, the `present` modifier should:
1. Check for presence upon entry into any region, including a `target
exit data` region. This behavior is already implemented correctly.
2. Should not check for presence upon exit from any region, including
a `target` or `target data` region. Without this patch, this
behavior is not implemented correctly, breaking the above example.
In the case of `target data`, this patch fixes the latter behavior by
removing the `present` modifier from the map types Clang generates for
the runtime call at the end of the region.
In the case of `target`, we have not found a valid OpenMP program for
which such a fix would matter. It appears that, if a program can
guarantee that data is present at the beginning of a `target` region
so that there's no error there, that data is also guaranteed to be
present at the end. This patch adds a comment to the runtime to
document this case.
Reviewed By: grokos, RaviNarayanaswamy, ABataev
Differential Revision: https://reviews.llvm.org/D84422
This patch fixed the issue that target memory might be deallocated when
they're still being used or before they're used.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84996
This is to address the issue reported at:
https://bugs.llvm.org/show_bug.cgi?id=46863
Since weak is meaningless for a shared library interface function, this patch
disables the attribute, when the OpenMP library is built as shared library.
ompt_start_tool is not an interface function, but a internally called function
possibly implemented by an OMPT tool.
This function needs to be weak if possible to allow overwriting ompt_start_tool
with a function implementation built into the application.
Differential Revision: https://reviews.llvm.org/D84871
Refactored the function `targetDataEnd` to make preparation of fixing
the issue of ahead-of-time target memory deallocation. This patch only
renamed `targetDataEnd` related variables and functions to conform
with LLVM code standard.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84991
Refactored the function `target` to make preparation for fixing the
issue of ahead-of-time device memory deallocation.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84816
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84767
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
It applies only for global pointers.
Differential Revision: https://reviews.llvm.org/D84767
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
On runtime failures, D83963 causes the runtime to abort instead of
merely exiting with a non-zero value, but many tests in the
libomptarget test suite still expect the former behavior. This patch
updates the test suite and was discussed in post-commit comments on
D83963 and D84557.
Summary:
1. Add DeviceTy::data_alloc, DeviceTy::data_delete, DeviceTy::data_alloc, DeviceTy::synchronize pass-through functions. Avoid directly accessing Device.RTL
2. Fix the type of the first argument of synchronize_ty in rth.h, device id is int32_t which is consistent with other functions.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84487
See PR46515 for the rational but generally, we want to *really* abort
not gracefully shut down.
Reviewed By: grokos, ABataev
Differential Revision: https://reviews.llvm.org/D83963
Additionally fix the copy if enabled on multi-config targets.
Summary:
This changes the copy command for libomp.so to use the output of the target as
the source of the copy, rather than trying to find it based on
${LIBOMP_LIBRARY_DIR}, which appears to be incorrect in multi-config generator
builds.
Reviewers: jdoerfert
Subscribers: mgorny, yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84148
Summary:
In the function `target`, memory deallocation and `target_data_end` is called
immediately returning from launching kernel. This might cause a race condition
that the corresponding memory is still being used by the kernel and a potential
issue that when the kernel starts to execute, its required data have already
been deallocated, especially when multiple kernels running concurrently. Since
nevertheless, we will block the thread issuing the target offloading at the end
of the target, we just move the synchronization ahead a little bit to make sure
the correctness.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84381
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
Following tests were disabled for clang-11 after upgrading to
version 5.0 in D82963:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
They are also failing for clang-12. Thus this temporary disabling
until they are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D84241
Add check of frm to prevent array out-of-bound access;
add check of new_nproc to prevent access of unallocated hot_teams array;
add check of location info pointer to prevent NULL dereference;
add check of d_tn pointer to prevent NULL dereference in release build.
These checks make static analyzers happier.
This is second part of the patch from https://reviews.llvm.org/D84062.
Add check of negative gtid before indexing __kmp_threads.
This makes static analyzers happier.
This is the first part of the patch split in two parts.
Differential Revision: https://reviews.llvm.org/D84062
hwloc documentation guarantees the only object that is always present
in the topology is PU. We can check the presence of other objects
in the topology, just in case.
Differential Revision: https://reviews.llvm.org/D84065
Add barrier/region notification for parallel inside teams construct
when number of teams is 1, as VTune only shows outer level regions for
simplicity.
Differential Revision: https://reviews.llvm.org/D84024
Libomptarget patch adding runtime support for "declare mapper".
Patch co-developed by Lingda Li and George Rokos.
Differential revision: https://reviews.llvm.org/D68100
There are various runtime calls in the device runtime with unused, or
always fixed, arguments. This is bad for all sorts of reasons. Clean up
two before as we match them in OpenMPOpt now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83268
Summary:
Retaining per device primary context is preferred to creating a context owned by the plugin.
From CUDA documentation
1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX
Two issues are addressed by this patch:
1. Not using the primary context caused interoperability issue with libraries like cublas, cusolver. CUBLAS_STATUS_EXECUTION_FAILED and cudaErrorInvalidResourceHandle
2. On OLCF summit, "Error returned from cuCtxCreate" and "CUDA error is: invalid device ordinal"
Regarding the flags of the primary context. If it is inactive, we set CU_CTX_SCHED_BLOCKING_SYNC. If it is already active, we respect the current flags.
Reviewers: grokos, ABataev, jdoerfert, protze.joachim, AndreyChurbanov, Hahnfeld
Reviewed By: jdoerfert
Subscribers: openmp-commits, yaxunl, guansong, sstefan1, tianshilei1992
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82718
This function uses __builtin_amdgcn_atomic_inc32():
uint32_t atomicInc(uint32_t *address, uint32_t max);
These functions use __builtin_amdgcn_fence():
__kmpc_impl_threadfence()
__kmpc_impl_threadfence_block()
__kmpc_impl_threadfence_system()
They will take place of current mechanism of directly calling IR functions.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83132
This patch adds missing GOMP_5.0 loop entry points which incorporate
new non-monotonic default into entry point name. Since monotonic
schedules are a subset of nonmonotonic, it is acceptable to use
monotonic as the implementation. This patch simply has the nonmonotonic
(and possibly non-monontonic) versions of the loop entry points as
wrappers around the monotonic ones.
Differential Revision: https://reviews.llvm.org/D73922
Following tests are failing after upgrading to version 5.0 but are passing
for version 4.5:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
To be enabled as soon as these tests are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D82963
If the compilation fails, the test is marked as unsupported.
-> This will never change for a specific version of gcc
If the linking fails, the test is marked as expected to fail.
-> This might change as LLVM/OpenMP implements the missing GOMP interface function
Reviewed by: Hahnfeld
Differential Revision: https://reviews.llvm.org/D83077
When configuring in-tree, the correct names are LLVM_VERSION_MAJOR
and LLVM_VERSION_MINOR. This has been wrong since the code was added
in commits fc473dee98 and 821649229e.
Summary: Warnings are printed by clang when building LIBOMPTARGET_ENABLE_DEBUG=ON due incorrect format string.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82789
Summary:
`config.test_extra_flags` is passed in from `lit.site.cfg.in` files, but they're not used in the LIT configs. This variable can be useful for distros which don't have the standard c/c++ headers in the default search paths. Since the tests run clang on c/c++ source code, we rely on `test_extra_flags` to pass in the necessary header files.
This is a similar setup that's also done in litomptarget https://github.com/llvm/llvm-project/blob/master/openmp/libomptarget/test/lit.cfg#L42 and openmp/runtime.
Reviewers: jdoerfert, jdenny, protze.joachim
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82516
Summary:
lookupMapping took significant time due to linear complexity searching.
This is bad for offloading from multiple host threads because lookupMapping is protected by mutex.
Use std::set for logarithmic complexity searching.
Before my change.
libomptarget inclusive time 16.7 sec, exclusive time 8.6 sec.
After the change
libomptarget inclusive time 7.3 sec, exclusive time 0.4 sec.
Most of the overhead of libomptarget (exclusive time) is gone.
Reviewers: jdoerfert, grokos
Reviewed By: grokos
Subscribers: tianshilei1992, yaxunl, guansong, sstefan1
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82264
DeviceID is added for some cases that we only have the __tgt_async_info but do
not know its corresponding device id. However, to communicate with target
plugins, we need that information.
Event is added for another way to synchronize.
Summary:
The OpenMP loops are normalized and transformed into the loops from 0 to
max number of iterations. In some cases, original scheme may lead to
overflow during calculation of number of iterations. If it is unknown,
if we can end up with overflow or not (the bounds are not constant and
we cannot define if there is an overflow), cast original type to the
unsigned.
Reviewers: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits, cfe-commits, caomhin
Tags: #clang, #openmp
Differential Revision: https://reviews.llvm.org/D81881
Adds the callbacks for ordered with source/sink dependencies.
The test for task dependencies changed, because callbach.h now actually prints
the passed dependencies and the test also checks for the address.
Reviewed by: hbae
Differential Revision: https://reviews.llvm.org/D81807
This patch allows to specify a prefix (default:empty) to be included into print-out
written by callback.h.
Also adding a cmake target to find the header file from other tests.
Reviewed by: jdoerfert
Differential Revision: https://reviews.llvm.org/D76008
Summary:
In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.
In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:
- Same devices: this is the D2D memcpy in the CUDA context.
- Different devices: this is the PeerToPeer memcpy in the CUDA context.
My implementation merges this two parts. It chooses the best API according to
the source device and destination device.
Reviewers: jdoerfert, AndreyChurbanov, grokos
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D80649
The OpenMP spec has the task-fulfill event for a call to omp_fulfill_event.
If the task did not yet finish execution, ompt_task_early_fulfill is used,
otherwise ompt_task_late_fulfill.
If a task does not complete, when the execution finishes (i.e., the task goes
in detached mode), ompt_task_detach instead of ompt_task_complete must be
used, when the next task is scheduled.
A test for both cases is included, which only work with clang-11+
Reviewed By: hbae
Differential revision: https://reviews.llvm.org/D80843
__kmp_realloc_task_deque implicitly assumes, that the task queue is full
(ntasks == size), therefore tail = size in line 319.
An assertion is added to document this assumption.
The first check for a full queue is before the locking and might not hold
when the lock is taken. So, we need to check again for this condition when
we have the lock.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D80480
Spurious assertion failures are symptoms of a race condition for the handling
of detached tasks:
Assertion failure at kmp_tasking.cpp(3744): taskdata->td_flags.complete == 1.
Assertion failure at kmp_tasking.cpp(710): taskdata->td_flags.executing == 0.
in the case of detach=true, all accesses to taskdata in __kmp_task_finish need
to happen before (~line 873):
taskdata->td_flags.proxy = TASK_PROXY;
This assignment signals to __kmp_fulfill_event, that the task will need to be
freed there. So, conceptionally the ownership of taskdata is moved.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D79702
This patch adds a libomptarget plugin for the NEC SX-Aurora TSUBASA Vector
Engine (VE target). The code is largely based on the existing generic-elf
plugin and uses the NEC VEO and VEOSINFO libraries for offloading.
Differential Revision: https://reviews.llvm.org/D76843
D78566 introduced a `\bnot\b` lit substitution in OpenMP test suites.
However, that would corrupt a command like
`FileCheck -implicit-check-not` or any file name like `%t.not`. We
could use lookbehind/lookahead assertions to avoid such cases, but
this patch switches to `%not` (suggested during the D78566 review) as
a safer option.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D79529
Summary: There is a typo in DeviceRTLTy::getNumOfDevices that the type of its return value is bool. It will lead to a problem of wrong device number returned from omp_get_num_devices.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D79255
The two locals IsNew and Pointer_IsNew were uninitialized at declaration, and then passed by
reference to Device.getOrAllocTgtPtr which in turn did not assign on all
paths within the function. This resulted in occasional runtime failures in one application.
Device::getOrAllocTgtPtr will now initialize IsNew to false on entry to function.
Differential Revision: https://reviews.llvm.org/D78744
Without this patch, target_data_begin continues after an illegal
mapping or an out-of-memory error on the device. With this patch, it
terminates the runtime with an error instead.
The new test exercises only illegal mappings. I didn't think of a
good way to exercise out-of-memory errors from the test suite.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D78170
Without this patch, the openmp project's test suites do not appear to
have support for negative tests. However, D78170 needs to add a test
that an expected runtime failure occurs.
This patch makes `not` visible in all of the openmp project's test
suites. In all but `libomptarget/test`, it should be possible for a
test author to insert `not` before a use of the lit substitution for
running a test program. In `libomptarget/test`, that substitution is
target-specific, and its value is `echo` when the target is not
available. In that case, inserting `not` before a lit substitution
would expect an `echo` fail, so this patch instead defines a separate
lit substitution for expected runtime fails.
Reviewed By: jdoerfert, Hahnfeld
Differential Revision: https://reviews.llvm.org/D78566
On systems with weak memory consistency, this patch fixes an intermittent crash
in the reduction function called by __kmp_hyper_barrier_gather, which suffers
from a race on a child thread's data.
Reviewed-By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D77603
Summary: Current implementation mixed everything up so that there is almost no encapsulation. In this patch, all CUDA related operations are put into a new class DeviceRTLTy and only necessary functions are exposed. In addition, all C++ code now conforms with LLVM code standard, keeping those API functions following C style.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: jfb, yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77951
...onization
Summary: In previous patch, in order to optimize performance, we only synchronize once
for each target region. The syncrhonization is via stream synchronization.
However, in the extreme situation, the performce might be bad. Consider the
following case: There is a task that requires transferring huge amount of data
(call many times of data transferring function). It is scheduled to the first
stream. And then we have 255 very light tasks scheduled to the remaining 255
streams (by default we have 256 streams). They can be finished before we do
synchronization at the end of the first task. Next, we get another very huge
task. It will be scheduled again to the first stream. Now the first task
finishes its kernel launch and call stream synchronization. Right now, the
stream already contains two kernels, and the synchronization will wait until the
two kernels finish instead of just the first one for the first task.
In this patch, we introduce stream pool. After each synchronization, the stream
will be returned back to the pool to make sure that for each synchronization,
only expected operations are waited.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: gregrodgers, yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77412
Summary: According to comments on bi-weekly meeting, this patch put back old APIs and added new `_async` series
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77822
Summary:
This patch introduces two things for offloading:
1. Asynchronous data transferring: those functions are suffix with `_async`. They have one more argument compared with their synchronous counterparts: `__tgt_async_info*`, which is a new struct that only has one field, `void *Identifier`. This struct is for information exchange between different asynchronous operations. It can be used for stream selection, like in this case, or operation synchronization, which is also used. We may expect more usages in the future.
2. Optimization of stream selection for data mapping. Previous implementation was using asynchronous device memory transfer but synchronizing after each memory transfer. Actually, if we say kernel A needs four memory copy to device and two memory copy back to host, then we can schedule these seven operations (four H2D, two D2H, and one kernel launch) into a same stream and just need synchronization after memory copy from device to host. In this way, we can save a huge overhead compared with synchronization after each operation.
Reviewers: jdoerfert, ye-luo
Reviewed By: jdoerfert
Subscribers: yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77005
Summary:
[libomptarget][nfc] Move non-freestanding headers out of common
Lowers the bar for building deviceRTL.
Drops math.h entirely as it wasn't used and libm is a big dependency.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77071
Data race occurs when acquiring lock for critical section
triggering assertion failure. Added barrier to ensure
all memory is commited before checking assertion.
Reviewed By: Hahnfeld
Differential Revision: https://reviews.llvm.org/D76780
Summary:
[libomptarget][nfc] Explicitly static function scope shared variables
`__shared__` in CUDA implies static in function scope. See e.g. D.2.1.1
in CUDA_C_Programming_Guide.pdf,
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/
This is surprising for non-cuda developers, see e.g. D73239 where I thought
local variables would be thread local.
Tested by IR diff of libomptarget.bc (no change), running in tree tests,
and binary diff of the nvcc static archives (no significant change).
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D76713
Summary: Explicitly initialize data members of RTLsTy class upon construction.
Reviewers: grokos
Subscribers: guansong, openmp-commits, caomhin, kkwli0
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75946
Summary:
[libomptarget] Implement locks for amdgcn
The nvptx implementation deadlocks on amdgcn. atomic_cas with multiple
active lanes can deadlock - if one lane succeeds, all the others are locked
out. The set_lock implementation therefore runs on a single lane.
Also uses a sleep intrinsic instead of the system clock for a probably
minor performance improvement. The unset/test implementations may be revised
later, based on code size / performance or similar concerns.
This implements the lock at a per-wavefront scope. That's not strictly as
specified, since openmp describes locks in terms of threads. I think the
nvptx implementation provides true per-thread locking on volta and the same
per-warp locking on other architectures.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75546
Summary:
[libomptarget][nfc] Move GetWarp/LaneId functions into per arch code
No code change for nvptx. Amdgcn currently has two implementations of GetLaneId,
this patch keeps the one a colleague considered to be superior for our ISA.
GetWarpId is currently the same function for amdgcn and nvptx, but I think it's
cleaner to keep it grouped with all the others than to keep it in support.cu.
Reviewers: jdoerfert, grokos, ABataev
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75587
Summary:
[libomptarget] Implement hip atomic functions in terms of intrinsics
All but atomicInc can be implemented using type generic clang intrinsics.
There is not yet a corresponding intrinsic for atomicInc in clang, only one in
LLVM. This patch leaves atomicInc as an unresolved symbol.
Reviewers: jdoerfert, ABataev, hfinkel, grokos, arsenm
Reviewed By: arsenm
Subscribers: sri, saiislam, wdng, jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D73076
Summary: fixed the warning from gcc since prios 0-100 are reserved for the internal use.
Reviewers: grokos
Subscribers: kkwli0, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75458
Summary:
Instead of using global variables with unpredicted time of
deinitialization, use dynamically allocated variables with functions
explicitly marked as global constructor/destructor and priority. This
allows to prevent the crash because of the incorrect order of dynamic
libraries deinitialization.
Reviewers: grokos, hfinkel
Subscribers: caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74837
From the context, it looks like the test should not be run with `check-all`,
but it does. It turns out option argument resolving to True/False which
could not be passed down as is. There is one such example in
AddLLVM.cmake.
Summary: The 5.0 spec states, "The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel construct without a num_threads clause were encountered after execution returns from this routine." The attached test shows Max Threads: 96, Num Threads: 128 without the proposed change. The number of threads should not exceed the (max) nthreads ICV, hence we should return the higher SPMD thread number even when omp_get_max_threads() is called in a generic kernel. This change does fail the api test, max_threads.c, because now it would return 64 instead of 32.
Reviewers: jdoerfert, ABataev, grokos, JonChesterfield
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74092
Summary:
[libomptarget][nfc] Change enum values to match those in cuda/rtl
support.h and cuda/rtl.cpp (and downsteam hsa/rtl.cpp) have enums for execution
mode. These are actually independent - the numbers that used within support, or
within the plugin, are never passed across the boundary.
Nevertheless, trying to work out why the values are different between the two
has generated a reasonable amount of confusion. This patch changes support to
match the values in plugin, on the basis that the plugin also has some comments
which I'd have to update if I changed that one instead. Credit to Ron for
working through this in our own fork. See rocm-developer-tools/aomp/issues/7
for that earlier diagnostic write up.
Also happy with generic = 0, spmd = 1 - provided it's the same in both places.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74503
EXCLUDE_FROM_ALL means something else for add_lit_testsuite as it does
for something like add_executable. Distinguish between the two by
renaming the variable and making it an argument to add_lit_testsuite.
Differential revision: https://reviews.llvm.org/D74168
Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=44733 | TEST 'libomp :: ompt/synchronization/reduction/tree_reduce.c' FAILED on 32-bit x86 ]]
For 32-bit we need at least 3 variables to avoid atomic reduction to be
choosen by runtime function `__kmp_determine_reduction_method`.
This patch adds reduction variables to the testcase.
Reviewers: mgorny, Hahnfeld
Differential Revision: https://reviews.llvm.org/D73850
Summary:
[nfc][libomptarget] Remove SHARED annotation from local variables
A few local variables in reduction.cu were marked SHARED. This patch leaves
all per-kernel global state localised in omp_data.cu.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D73239
Summary:
This patch is to fix issue in the following simple case:
#include <omp.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int num = omp_get_num_devices();
printf("%d\n", num);
return 0;
}
Currently it returns 0 even devices exist. Since this file doesn't contain any
target region, the host entry is empty so further actions like initialization
will not be proceeded, leading to wrong device number returned by runtime
function call.
Reviewers: jdoerfert, ABataev, protze.joachim
Reviewed By: ABataev
Subscribers: protze.joachim
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72576
Summary:
[libomptarget] Implement smid for amdgcn
Implementation is in a new file as it uses an intrinsic with
complicated encoding that warranted substantial comments.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72956
The reference counter for global objects marked with declare target is INF. This patch prevents the runtime from incrementing /decrementing INF refcounts. Without it, the map(delete: global_object) directive actually deallocates the global on the device. With this patch, such a directive becomes a no-op.
Differential Revision: https://reviews.llvm.org/D72525
TSan spuriously reports for any OpenMP application a race on the initialization
of a runtime internal mutex:
```
Atomic read of size 1 at 0x7b6800005940 by thread T4:
#0 pthread_mutex_lock <null> (a.out+0x43f39e)
#1 __kmp_resume_64 <null> (libomp.so.5+0x84db4)
Previous write of size 1 at 0x7b6800005940 by thread T7:
#0 pthread_mutex_init <null> (a.out+0x424793)
#1 __kmp_suspend_initialize_thread <null> (libomp.so.5+0x8422e)
```
According to @AndreyChurbanov this is a false positive report, as the control
flow of the runtime guarantees the ordering of the mutex initialization and
the lock:
https://software.intel.com/en-us/forums/intel-open-source-openmp-runtime-library/topic/530363
To suppress this report, I suggest the use of
TSAN_OPTIONS='ignore_uninstrumented_modules=1'.
With this patch, a runtime warning is provided in case an OpenMP application
is built with Tsan and executed without this Tsan-option.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D70412
Summary:
[nfc][libomptarget] Refactor nxptx/target_impl.cu
Use __kmpc_impl_atomic_add instead of atomicAdd to match the rest of the file.
Alternatively, target_impl.cu could use the cuda functions directly. Using a mixture in this
file was an oversight, happy to resolve in either direction.
Removed some comments that look outdated.
Call __kmpc_impl_unset_lock directly to avoid a redundant diagnostic and remove an implict
dependency on interface.h.
Reviewers: ABataev, grokos, jdoerfert
Reviewed By: jdoerfert
Subscribers: jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72719
Summary:
[nfc][libomptarget] Refactor amdgcn target_impl
Removes references to internal libraries from the header
Standardises on C++ mangling for all the target_impl functions
Update comment block
clang-format
Move some functions into a new target_impl.hip source file
This lays the groundwork for implementing the remaining unresolved
symbols in the target_impl.hip source.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72712
The OpenMP runtime is not instrumented, so entering the runtime leaves no hint
on the source line of the pragma on ThreadSanitizer's function stack.
This patch adds function entry/exit annotations for OpenMP parallel regions,
and synchronization regions (barrier, taskwait, taskgroup).
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D70408
If the openmp project is built standalone, the test compiler is feature tested for an available -fsanitize=thread flag.
If the openmp project is built as part of llvm, the target tsan is needed to test archer.
An additional line (requires tsan) was introduced to the tests, this patch updates the line numbers for the race.
Follow-up for 77ad98c
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D71914
Summary:
If the dynamically loaded module has been compiled with -fopenmp-targets
and has no target regions, it has empty target descriptor. It leads to a
crash at the runtime if another module has at least one target region
and at least one entry in its descriptor. The runtime library is unable
to load the empty binary descriptor and terminates the execution.
Caused by a clang-offload-wrapper.
Reviewers: grokos, jdoerfert
Subscribers: caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72472
Including two tests
These callbacks were added late to the 5.0 specification, an implementation is missing.
Reviewed By: jdoerfert
Differential Review: https://reviews.llvm.org/D70395
Summary:
[libomptarget][nfc] Introduce atomic wrapper function
Wraps atomic functions in a template prefixed __kmpc_atomic that
dispatches to cuda or hip atomic functions. Intended to be easily extended
to dispatch to OpenCL or C++ atomics for a third target.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: Anastasia, jvesely, mgrang, dexonsmith, llvm-commits, mgorny, jfb, openmp-commits
Tags: #openmp, #llvm
Differential Revision: https://reviews.llvm.org/D71404
Summary:
[libomptarget][nfc] Extract function from data_sharing, move to common
Finding the first active thread in the warp is different on nvptx and amdgcn,
mostly due to warp size and the desire for efficiency.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71643
Summary:
[libomptarget][nfc] Move three files under common, build them for amdgcn
Change to reduction.cu to remove two dead includes, otherwise no code change.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71601
Summary:
[libomptarget][nfc] Move omp locks under target_impl
These are likely to be target specific, even down to the lock_t which is
correspondingly moved out of interface.h. The alternative is to include
interface.h in target_impl which substantiatially increases the scope of
those symbols.
The current nvptx implementation deadlocks on amdgcn. The preferred
implementation for that arch is still under discussion - this change
leaves declarations in target_impl.
The functions could be inline for nvptx. I'd prefer to keep the internals
hidden in the target_impl translation unit, but will add the (possibly renamed)
macros to target_impl.h if preferred.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71574
Summary:
[libomptarget][nfc] Wrap cuda min() in target_impl
nvptx forwards to cuda min, amdgcn implements directly.
Sufficient to build parallel.cu for amdgcn, added to CMakeLists.
All call sites are homogenous except one that passes a uint32_t and an
int32_t. This could be smoothed over by taking two type parameters
and some care over the return type, but overall I think the inline
<uint32_t> calling attention to what was an implicit sign conversion
is cleaner.
Reviewers: ABataev, jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71580
Summary:
This reverts commit dd8a7fcdd7.
Alexey reports undefined symbols for the new inline functions defined in target_impl.h
This does not reproduce for me for nvptx, or amdgcn, under release or debug builds.
I believe the patch is fine, based on:
- the semantics of an inline function in C++ (the cuda INLINE functions end
up as linkonce_odr in IR), which are only legal to drop if they have no uses
- the code generated from a debug build of clang 9 does not show these undef symbols
- the tests pass
- the code is trivial
To progress from here I either need:
- A tie break - someone to play the role of CI in determining whether the patch works
- Alexey to provide sufficient information about his build for me to reproduce the failure
- Alexey to debug why the symbols are disappearing for him and report back
Reviewers: ABataev, jdoerfert, grokos
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71502
Summary:
[libomptarget] Build most of common/src for amdgcn
Excluding parallel.cu, which uses an integer min() from cuda,
Excluding support.cu, which calls malloc that is not yet available for amdgcn
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: gregrodgers, ronlieb, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71446
Summary:
[libomptarget][nfc] Add declarations of atomic functions for amdgcn
This enables building more source for amdgcn. The functions are usually available
in a hip runtime header, but are duplicated here to decouple the implementation
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71412
Summary:
[libomptarget][nfc] Move cuda threadfence functions behind kmpc_impl
Part of building code under common/ without requiring a cuda compiler
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: ABataev
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71102
Summary:
[libomptarget][nfc] Move omptarget-nvptx under common
Almost all files depend on require omptarget-nvptx, which no longer
contains any obviously architecture dependent code. Moving it under
common unblocks task/loop for amdgcn, and allows moving other code.
At some point there should probably be a widespread symbol renaming to
replace the nvptx string. I'd prefer to get things working first.
Building this (and task.cu, loop.cu) without a cuda library requires
some more refactoring, e.g. wrap threadfence(), use DEVICE macro more
consistently. Patches for that are orthogonal and will be posted shortly.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: mgorny, fedor.sergeev, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71073
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
Repeat of D70414, with an include path fixed. Diff for sanity checking.
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: ABataev, jdoerfert
Reviewed By: ABataev
Subscribers: jvesely, mgorny, jfb, openmp-commits, jdoerfert
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70971
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Reviewed By: jdoerfert
Subscribers: jfb, Hahnfeld, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70414
Summary:
"make check-all" or "make check-libomptarget" would attempt to run offloading
tests before the offload plugins are built. This patch corrects that by adding
dependencies to the libomptarget CMake rules.
Reviewers: jdoerfert
Subscribers: mgorny, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70803
Summary:
The termination function duplicated the functionality of the
__attribute((destructor))-annotated function __kmp_internal_end_fini,
and we have no indication that this doesn't work.
The function might cause issues with link-time optimization turned on:
until very recently, none of the usual linkers was reporting functions
named in -Wl,-fini as used to the LTO plugin, so it might be dropped.
If the function is dropped, -Wl,-fini=__kmp_internal_end_fini doesn't
do what we want: with ld.bfd and lld it drops the FINI attribute from
.dynamic and with gold we get FINI = 0x0, which leads to a crash on
cleanup. This can be reproduced by building with
-DLLVM_ENABLE_PROJECTS="clang;openmp" \
-DLLVM_ENABLE_LTO=Thin \
-DLLVM_USE_LINKER=gold
The issue in lld has been fixed in f95273f75a, but gold remains without
fix so far.
Fixes PR43927.
Reviewers: JonChesterfield, jdoerfert, AndreyChurbanov
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D69927
Summary:
[libomptarget][nfc] Move some source into common from nvptx
Moves some source that compiles cleanly under amdgcn into a common subdirectory
Includes some non-trivial files and some headers. Keeps the cuda file extension.
The build systems for different architectures seem unlikely to have much in
common. The idea is therefore to set include paths such that files under
common/src compile as if they were under arch/src as the mechanism for sharing.
In particular, files under common/src need to be able to include target_impl.h.
The corresponding -Icommon is left out in favour of explicit includes on the
basis that the it makes it clearer which files under common are used by a given
architecture.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: jfb, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70328
The tool provides TSAN annotations for OpenMP synchronization. The tool
is activated if no other OMPT tool is loaded.
The tool detects whether the application was built with TSan and rejects
activation according to the OMPT protocol if there is no TSan-rt.
Differential Revision: https://reviews.llvm.org/D45890
Summary:
[libomptarget][nfc] Use cuda variable wrappers from support.h
Reimplementation of D69693, after the revert of D69885
Use the wrappers in support.h for cuda builtin variables at all call sites.
Localises use of cuda and removes WARPSIZE==32 assumption in debug.h.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70186
Summary:
[libomptarget] Move supporti.h to support.cu
Reimplementation of D69652, without the unity build and refactors.
Will need a clean build of libomptarget as the cmakelists changed.
Reviewers: ABataev, jdoerfert
Reviewed By: jdoerfert
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70131
Summary:
[libomptarget] Revert all improvements to support
The change to unity build for nvcc has broken the build for some developers.
This patch reverts to a known-working state.
There has been some confusion over exactly how the build broke. I think we
have reached a common understanding that the disappearing symbols are from
the bitcode library built by clang. The static archive built by nvcc may show the
same problem. Some of the confusion arose from building the deviceRTL twice
and using one or the other library based on various environmental factors.
I'm pretty sure the problem is clang expanding `__forceinline__` into both `__inline__`
and `attribute(("always_inline"))`. The `__inline__` attribute resolves to linkonce_odr
which is not safe for exporting symbols from translation units.
"always_inline" is the desired semantic for small functions defined in one translation
unit that are intended to be inlined at link time. "inline" is not.
This therefore reintroduces the dependency hazard of supporti.h and some code
duplication, and blocks progress separating deviceRTL into reusable components.
See also D69857, D69859 for attempts at a fix instead of a revert.
Reviewers: ABataev, jdoerfert, grokos, ikitayama, tianshilei1992
Reviewed By: ABataev
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69885
Summary:
[libomptarget] Implement target_impl for amdgcn
Smallest atomic addition for a new target. Implements enough of the amdgcn
specific code that some of the source files under nvptx/src could be compiled,
without modification, to run on amdgcn.
This foreshadows a work in progress patch to move said source out of nvptx/src.
Patch based on fork at https://github.com/ROCm-Developer-Tools/llvm-project
Reviewers: ABataev, jdoerfert, grokos, ronlieb
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69718
Summary:
[nfc][omptarget] Use builtin var abstraction. Second pass at D69476
Use the wrappers in support.h for cuda builtin variables at all call sites.
Localises use of cuda and removes WARPSIZE==32 assumption in debug.h.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69693
Summary:
[nfc][libomptarget] Reorganise support header
All functions defined in support implementation are now declared in support.h
Reordered functions in support implementation to match the sequence in support.h
Added include guards to support.h
Added #include interface to support.h to provide kmp_Ident declaration
Move supporti.h to support.cu and s/INLINE/EXTERN/g
Add remaining includes to support.cu
A minor side effect is to change the name mangling of the support functions to
extern "C". If this matters another macro along the lines of INLINE/EXTERN
can be added - perhaps DEVICE as that's the obvious implementation.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69652
Summary:
[libomptarget] Change nvcc compilation to use a unity build
This allows nvcc to inline functions between what would otherwise be distinct
translation units, which in turn removes any runtime cost from implementing
functions in source files (as opposed to inline in headers).
This will then allow the circular dependencies in deviceRTL to be readily
broken and individual components more easily shared between architectures.
Reviewers: ABataev, jdoerfert, grokos, RaviNarayanaswamy, hfinkel, ronlieb, gregrodgers
Reviewed By: jdoerfert
Subscribers: mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69489
Summary:
[libomptarget] Always call malloc, free via SafeMalloc, SafeFree wrapper
NFC for release, adds some verbosity to debug printing. Motivation is to provide
one place where local modifications can be made to the behaviour of all heap
allocation or deallocation while debugging.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69492
Summary:
[nfc][libomptarget] Decrease coupling between files
debug.h used the symbol omptarget_device_environment so implicitly required
an include of omptarget-nvptx.h to compile. Similarly interface.h uses size_t.
Moving this declaration to a new header means cancel, critical can now build
without omptarget-nvptx.h. After this change, debug.h, cancel.cu, critical.cu
could move under a common source directory.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69473
Summary:
[nfc][libomptarget] Inline option into target_impl
Subset of D69423. The macros that were in option.h are all target dependent.
Inlining the header simplifies the dependency graph when looking to move code
into a common subdir.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69472
Summary:
[NFC][libomptarget] move remaining device specific code out of omptarget-nvptx.h
Strictly there is one remaining difference wrt amdgcn - parallelLevel is
volatile qualified on amdgcn and not on nvptx. Determining whether this is
correct - and how to represent the different semantics of 'volatile' under
various conditions - is beyond the scope of this code motion patch.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69424
Details:
- nconflicts field initialized;
- formatting fix (moved declaration out of the long line);
- count conflicts in new hash as opposed to old one.
Differential Revision: https://reviews.llvm.org/D68036
Summary:
[libomptarget][nfc] Make interface.h target independent
Move interface.h under a top level include directory.
Remove #includes to avoid the interface depending on the implementation.
Reviewers: ABataev, jdoerfert, grokos, ronlieb, RaviNarayanaswamy
Reviewed By: jdoerfert
Subscribers: mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D68615
llvm-svn: 374919
Summary:
[libomptarget][nfc] Update remaining uint32 to use lanemask_t
Update a few functions in the API to use lanemask_t instead of i32. NFC for
nvptx. Also update the ActiveThreads type in DataSharingStateTy.
This removes a lot of #ifdef from the downsteam amdgcn implementation.
Reviewers: ABataev, jdoerfert, grokos, ronlieb, RaviNarayanaswamy
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D68513
llvm-svn: 373806
/proc unless Linux layer compatibility is activated for CentOS is activated is not present
thus relying on a more native for checking the address.
Reviewers: Hahnfeld, kongyl, jdoerfert, jlpeyton, AndreyChurbanov, emaster, dim
Reviewed By: Hahnfeld
Differential Revision: https://reviews.llvm.org/D67326
llvm-svn: 373152
Linker automatically provides __start_<section name> and __stop_<section name> symbols to satisfy unresolved references if <section name> is representable as a C identifier (see https://sourceware.org/binutils/docs/ld/Input-Section-Example.html for details). These symbols indicate the start address and end address of the output section respectively. Therefore, renaming OpenMP offload entries section name from ".omp.offloading_entries" to "omp_offloading_entries" to use this feature.
This is the first part of the patch for eliminating OpenMP linker script (please see https://reviews.llvm.org/D64943).
Differential Revision: https://reviews.llvm.org/D68070
llvm-svn: 373118
There's no need to initialize variables with static storage duration
because they're implicitly initialized to zero. See
https://en.cppreference.com/w/c/language/initialization#Implicit_initialization
I think that's already relied upon because the supplied 0 only sets
'kmp_time_global_t g_time;' in 'struct kmp_base_global'. The other fields
are not set in the code, but implicitly initialized by the compiler.
Differential Revision: https://reviews.llvm.org/D66292
llvm-svn: 370943
Summary:
In non-SPMD mode we may end up with the divergent threads when trying to
increment/decrement parallel level counter. It may lead to incorrect
calculations of the parallel level and wrong results when threads are
divergent. We need to reconverge the threads before trying to modify the
parallel level counter.
Reviewers: grokos, jdoerfert
Subscribers: guansong, openmp-commits, caomhin, kkwli0
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66802
llvm-svn: 370803
Summary:
Use target_impl functions to replace more inline asm
Follow on from D65836. Removes remaining asm shuffles and lanemask accessors
Also changes the types of target_impl bitwise functions to unsigned.
Reviewers: jdoerfert, ABataev, grokos, Hahnfeld, gregrodgers, ronlieb, hfinkel
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66809
llvm-svn: 370216
Summary:
[libomptarget] Refactor syncthreads macro to inline function
See also abandoned D66846, split into this diff and others.
Rev 2 of D66855
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66861
llvm-svn: 370210
Summary:
[libomptarget] Refactor syncwarp macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66857
llvm-svn: 370149
Summary:
[libomptarget] Refactor shfl_down_sync macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66853
llvm-svn: 370146
Summary:
[libomptarget] Refactor shfl_sync macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66852
llvm-svn: 370144
Summary:
Added function void __kmpc_syncwarp(int32_t) to expose it to the
compiler. It is required to fix the problem with the critical regions in
Cuda9.0+. We cannot use barrier in the critical region, but still need
to reconverge the threads in the warp after. This function allows to do
this.
Reviewers: grokos, jdoerfert
Subscribers: guansong, openmp-commits, kkwli0, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66672
llvm-svn: 369933
Summary:
In Cuda 9.0 it is not guaranteed that threads in the warps are
convergent. We need to use __syncwarp() function to reconverge
the threads and to guarantee the memory ordering among threads in the
warps.
This is the first patch to fix the problem with the test
libomptarget/deviceRTLs/nvptx/src/sync.cu on Cuda9+.
This patch just replaces calls to __shfl_sync() function with the call
of __syncwarp() function where we need to reconverge the threads when we
try to modify the value of the parallel level counter.
Reviewers: grokos
Subscribers: guansong, jfb, jdoerfert, caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65013
llvm-svn: 369796
This patch fixes https://bugs.llvm.org/show_bug.cgi?id=42906, via adding
adjustment of number of threads on enter to the teams construct on host
according to user settings. This allows to pass checks and avoid assertions
at time of team of threads creation.
Patch by Andrey Churbanov
Differential Revision: https://reviews.llvm.org/D66351
llvm-svn: 369430
Fix last warned location in ittnotify_static.cpp using the defined
macro KMP_FALLTHROUGH().
Differential Revision: https://reviews.llvm.org/D65871
llvm-svn: 369003
The variables in kmp_lock.cpp are really arrays of function pointers
that return void or int, not pointers to functions that return void*
or int*. The other changes are only cosmetic.
Differential Revision: https://reviews.llvm.org/D65870
llvm-svn: 369002
The implementation status can only be one of
ompt_event_UNIMPLEMENTED = ompt_set_never = 1
ompt_event_MAY_ALWAYS = ompt_set_always = 5
In both cases, the condition was already true, so just remove
the check.
Differential Revision: https://reviews.llvm.org/D65869
llvm-svn: 369001
Instead, maintain a list of disabled options to still build libomp and
libomptarget without warnings. This includes -Wno-error and -Wno-pedantic
to silence warnings that LLVM enables when building in-tree.
I tested the following compilers:
* Clang 6.0, 7.0, 8.0
* GCC 4.8.5 (CentOS 7), GCC 6, 7, 8, 9
* Intel Compiler 16, 17, 18, 19
RFC thread on openmp-dev mailing list:
http://lists.llvm.org/pipermail/openmp-dev/2019-August/002668.html
Differential Revision: https://reviews.llvm.org/D65867
llvm-svn: 368999
Summary:
[libomptarget] Factor architecture dependent code out of loop.cu
Related to the patch series starting D64217. Added subscribers to said series as reviewers. This effort is smaller in scope.
This patch factors out just enough architecture dependent code from loop.cu to allow the same source to be used with amdgcn, given a different target_impl.h. Testing is that the same bitcode (modulo variable names) is generated for libomptarget before and after the refactor, for nvptx and the out of tree amdgcn.
Reviewers: jdoerfert, ABataev, bollu, jfb, tra, grokos, Hahnfeld, guansong, xtian, gregrodgers, ronlieb, hfinkel, gtbercea, guraypp, arpith-jacob
Reviewed By: jdoerfert, ABataev
Subscribers: dexonsmith, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65836
llvm-svn: 368751
This patch fixes problem raised in post-review comments of the
https://reviews.llvm.org/D65285. Developers of ittnotify confirmed
that dll_path_ptr field of the __itt_global structure is never used
by ittnotify library, so it is safe to remove the dll_path array.
Differential Revision: https://reviews.llvm.org/D65885
llvm-svn: 368559
Summary:
This patch adds support for the close map modifier.
The close map modifier will overwrite the unified shared memory requirement and create a device copy of the data.
Reviewers: ABataev, Hahnfeld, caomhin, grokos, jdoerfert, AlexEichenberger
Reviewed By: Hahnfeld, AlexEichenberger
Subscribers: guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65340
llvm-svn: 368488
We have one global RTLs.RequiresFlags, I don't see a need to make a
copy per device that the runtime manages. This was problematic anyway
because the copy happened during the first __tgt_register_lib(). This
made it impossible to call __tgt_register_requires() from normal user
funtions for testing.
Hence, this change also fixes unified_shared_memory/shared_update.c for
older versions of Clang that don't call __tgt_register_requires() before
__tgt_register_lib().
Differential Revision: https://reviews.llvm.org/D66019
llvm-svn: 368465
Summary:
This patch adds support for using unified memory in the case of regular maps that happen when a target region is offloaded to the device.
For cases where only a single version of the data is required then the host address can be used. When variables need to be privatized in any way or globalized, then the copy to the device is still required for correctness.
Reviewers: ABataev, jdoerfert, Hahnfeld, AlexEichenberger, caomhin, grokos
Reviewed By: Hahnfeld
Subscribers: mgorny, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65001
llvm-svn: 368192
Summary:
[libomptarget] Use forceinline. Necessary for nvcc to inline small functions within the bitcode library
Suggested in D65836
Reviewers: ABataev, jdoerfert, grokos, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65876
llvm-svn: 368177
New OMPT tests with teams construct should be disabled for GCC as it
emits code with a GOMP entry not supported in the LLVM runtime.
Differential Revision: https://reviews.llvm.org/D65757
llvm-svn: 367939
Ensures that CUDA fail reasons (such as "No CUDA-capable device detected")
are printed together with libomptarget's debug message
(e.g. "Error when setting CUDA context"). Previously, the former was
printed only in CMAKE_BUILD_TYPE=Debug builds while the latter was
enabled by LIBOMPTARGET_ENABLE_DEBUG.
With this change, also only call cuGetErrorString when the error will be
printed.
Suggested-by: Ye Luo <xw111luoye@gmail.com>
Differential Revision: https://reviews.llvm.org/D65687
llvm-svn: 367910
This patch implements the libomptarget runtime interface for OpenMP 5.0
declare mapper functions. The declare mapper functions generated by
Clang will call them to complete the mapping of members.
kmpc_mapper_num_components gets the current number of components for a
user-defined mapper; kmpc_push_mapper_component pushes back one
component for a user-defined mapper.
The design slides can be found at
https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
Patch by Lingda Li <lildmh@gmail.com>
Differential Revision: https://reviews.llvm.org/D60972
llvm-svn: 367772
All other files are already C++ and the build system has always
passed '-x c++' for C files, effectively compiling them as C++.
To stay warning free we need one fix in ittnotify_static.{c,cpp}:
The variable dll_path can be written to, so it must not be const.
GCC complained with -Wcast-qual and I think it's right.
Differential Revision: https://reviews.llvm.org/D65285
llvm-svn: 367343
Round the stack size to a multiple of the page size. Older versions of
Android (until KitKat) would fail pthread_attr_setstacksize with
EINVAL if the stack size was not a multiple of the page size.
Patch by Dan Albert <danalbert@google.com>.
Test: Build, copied into the NDK, passed openmp test on ICS.
Bug: https://github.com/android-ndk/ndk/issues/9
llvm-svn: 367070
Both Clang and GCC complained that they cannot initialize a return
object of type 'kmp_proc_bind_t' with an 'int'. While at it, also
fix a warning about missing parentheses thrown by Clang.
Differential Revision: https://reviews.llvm.org/D65284
llvm-svn: 367041
Summary:
According to the OpenMP standard, barrier operation must perform
implicit flush operation. Currently, if there is only one thread in the
team, barrier does not flush the memory. Patch fixes this problem.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D62398
llvm-svn: 367024
This is a port of libomp for the RISC-V 64-bit Linux target.
We have tested this port on a HiFive Unleashed development board
using a downstream LLVM that has support for the missing bits in
upstream. As of now, all tests are passing, including OMPT.
Patch by Ferran Pallarès!
Differential Revision: https://reviews.llvm.org/D59880
llvm-svn: 367021
If the first target region in a program calls the push_tripcount
function, libomptarget didn't handle the offload policy correctly.
This could lead to unexpected error messages as seen in
http://lists.llvm.org/pipermail/openmp-dev/2019-June/002561.html
To solve this, add a check calling IsOffloadDisabled() as all other
entry points already do. If this method returns false, libomptarget
is effectively disabled.
Differential Revision: https://reviews.llvm.org/D64626
llvm-svn: 366810
This is done at call-site and does not need to be handled in
__kmp_invoke_microtask. It was already absent from the x86
and x86_64 assembly, this patch removes it from the generic
implementation in z_Linux_util.cpp and adds documentation for
AArch64 and PPC64 that it's actually not needed. I can't test
on these architectures, so I don't want to change the code just
because it looks right :)
While at it, rename some variables for consistency and add a
check in test/ompt/parallel/normal.c that the pointer was reset
before entering the barrier.
Differential Revision: https://reviews.llvm.org/D64442
llvm-svn: 366721
Remove loopTripCnt from threaded device stack after consuming it.
Added a libomptarget DP message to aid in future debugging and to
validate the added testcase, which only runs in Debug build.
Differential Revision: https://reviews.llvm.org/D64808
llvm-svn: 366349
This is a follow up patch to D64534 (r365963) which removed all OMP
spec versioning within the OpenMP runtime codebase. This patch removes
REQUIRES: openmp-x.y lines from lit tests.
llvm-svn: 366341
This leads to problems when compiling C++ code with libc++ for Nvidia GPUs
because Clang now uses wrappers for math functions that might include
C++ templates not allowed in 'extern "C"'.
Differentiel Revision: https://reviews.llvm.org/D64625
llvm-svn: 366229
Summary:
We used CUDART_VERSION macro to check for the installed cuda version
but this macro is defined in cuda_runtime_api.h, which is not used by
project. Better to use CUDA_VERSION macro, which is defined in cuda.h.
Also, added the check if this macro is defined. If macro is undefined,
there is something wrong with the cuda configuration and we should not
continue the compilation.
This also fixes problems with runtime building in cuda 10+.
Reviewers: grokos
Subscribers: guansong, jdoerfert, caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64648
llvm-svn: 366224
Summary:
We used to call __kmpc_omp_taskwait function with global threadid set to
0. It may crash the application at the runtime if the thread executing
target region is not a master thread.
Reviewers: grokos, kkwli0
Subscribers: guansong, jdoerfert, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64571
llvm-svn: 366220
Remove all older OMP spec versioning from the runtime and build system.
Patch by Terry Wilmarth
Differential Revision: https://reviews.llvm.org/D64534
llvm-svn: 365963
These entry points are never called by Clang trunk nor clang-ykt. If
XL doesn't use them either, they can finally go away.
Differential Revision: https://reviews.llvm.org/D52700
llvm-svn: 365817
Summary:
__kmpc_push_tripcount function is not thread safe and may lead to data
race when the target regions are executed in parallel threads. The patch
makes loopTripCnt counter thread aware and stores the tripcount value
per thread in the map. Access to map is guarded by mutex to prevent
data race in the map itself.
Test is for NVPTX target because it does not work correctly on the
host. Seems to me, there is a problem in libomp with target regions in
the parallel threads.
Reviewers: grokos
Subscribers: guansong, jfb, jdoerfert, openmp-commits, kkwli0, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64080
llvm-svn: 365332
Summary:
According to the OpenMP standard, flush makes a thread’s temporary view of memory consistent with memory and enforces an order on the memory operations of the variables explicitly specified or implied.
According to the Cuda toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#memory-fence-functions), __threadfence() functions provides required functionality.
__threadfence_system() also provides required functionality, but it also
includes some extra functionality, like synchronization of page-locked
host memory, synchronization for the host, etc. It is not required per
the standard and we can use more relaxed version of memory fence
operation.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D62397
llvm-svn: 364572
Bug reported in https://bugs.llvm.org/show_bug.cgi?id=42269.
Freeing of the contention group (CG) stucture by master thread looks wrong,
because workers can leave the CG later on. Intead the freeing
is now done by the last thread leaving the CG.
Differential Revision: https://reviews.llvm.org/D63599
llvm-svn: 364456
Summary:
This patch adds support for handling variables under the:
```
#pragma omp declare target to()
```
clause when the
```
#pragma omp requires unified_shared_memory
```
is used.
The address of the host variable is copied into the device pointer just like for the declare target link case.
Reviewers: ABataev, caomhin, grokos, AlexEichenberger
Reviewed By: grokos
Subscribers: jcownie, guansong, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D63106
llvm-svn: 363825