Restructured dynamic loop dispatcher code.
Fixed use of dispatch buffers for nonmonotonic dynamic (static_steal) schedule:
- eliminated possibility of stealing iterations of the wrong loop when victim
thread changed its buffer to work on another loop;
- fixed race when victim thread changed its buffer to work in nested parallel;
- eliminated "static" property of the schedule, that is now a single thread can
execute whole loop.
Differential Revision: https://reviews.llvm.org/D103648
Two-level distributed barrier is a new experimental barrier designed
for Intel hardware that has better performance in some cases than the
default hyper barrier.
This barrier is designed to handle fine granularity parallelism where
barriers are used frequently with little compute and memory access
between barriers. There is no need to use it for codes with few
barriers and large granularity compute, or memory intensive
applications, as little difference will be seen between this barrier
and the default hyper barrier. This barrier is designed to work
optimally with a fixed number of threads, and has a significant setup
time, so should NOT be used in situations where the number of threads
in a team is varied frequently.
The two-level distributed barrier is off by default -- hyper barrier
is used by default. To use this barrier, you must set all barrier
patterns to use this type, because it will not work with other barrier
patterns. Thus, to turn it on, the following settings are required:
KMP_FORKJOIN_BARRIER_PATTERN=dist,dist
KMP_PLAIN_BARRIER_PATTERN=dist,dist
KMP_REDUCTION_BARRIER_PATTERN=dist,dist
Branching factors (set with KMP_FORKJOIN_BARRIER, KMP_PLAIN_BARRIER,
and KMP_REDUCTION_BARRIER) are ignored by the two-level distributed
barrier.
Differential Revision: https://reviews.llvm.org/D103121
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
Lazily set affinity for root threads. Previously, the root thread
executing middle initialization would attempt to assign affinity
to other existing root threads. This was not working properly as the
set_system_affinity() function wasn't setting the affinity for the
target thread. Instead, the middle init thread was resetting the
its own affinity using the target thread's affinity mask.
Differential Revision: https://reviews.llvm.org/D103625
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
Size of type of the dependence flag changed from 1 to 4 bytes in clang.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
Warnings on deprecated api cannot be suppressed if the library is not initialized.
With this change it is possible to set KMP_WARNINGS=false to suppress the warnings.
Differential Revision: https://reviews.llvm.org/D102676
Bug 49356 (https://bugs.llvm.org/show_bug.cgi?id=49356) reports crash in
the test case `tasking/bug_taskwait_detach.cpp`, which is caused by the wrong
function declaration. `gtid` in `__kmpc_omp_task` should be `kmp_int32`.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D102584
This patch does the following:
1) Introduce kmp_topology_t as the runtime-friendly structure (the
corresponding global variable is __kmp_topology) to determine the
exact machine topology which can vary widely among current and future
architectures. The current design is not easy to expand beyond the assumed
three layer topology: sockets, cores, and threads so a rework capable of
using the existing KMP_AFFINITY mechanisms is required.
This new topology structure has:
* The depth and types of the topology
* Ratio count for each consecutive level (e.g., number of cores per
socket, number of threads per core)
* Absolute count for each level (e.g., 2 sockets, 16 cores, 32 threads)
* Equivalent topology layer map (e.g., Numa domain is equivalent to
socket, L1/L2 cache equivalent to core)
* Whether it is uniform or not
The hardware threads are represented with the kmp_hw_thread_t
structure. This structure contains the ids (e.g., socket 0, core 1,
thread 0) and other information grabbed from the previous Address
structure. The kmp_topology_t structure contains an array of these.
2) Generalize the KMP_HW_SUBSET envirable for the new
kmp_topology_t structure. The algorithm doesn't assume any order with
tiles,numa domains,sockets,cores,threads. Instead it just parses the
envirable, makes sure it is consistent with the detected topology
(including taking into account equivalent layers) and then trims away
the unneeded subset of hardware threads. To enable this, a new
kmp_hw_subset_t structure is introduced which contains a vector of
items (hardware type, number user wants, offset). Any keyword within
__kmp_hw_get_keyword() can be used as a name and can be shortened as
well. e.g.,
KMP_HW_SUBSET=1s,2numa,4tile,2c,3t can be used on the KNL SNC-4 machine.
3) Simplify topology detection functions so they only do the singular
task of detecting the machine's topology. Printing, and all
canonicalizing functionality is now done afterwards. So many lines of
duplicated code are eliminated.
4) Add new ll_caches and numa_domains to OMP_PLACES, and
consequently, KMP_AFFINITY's granularity setting. All the names within
__kmp_hw_get_keyword() are available for use in OMP_PLACES or
KMP_AFFINITY's granularity setting.
5) Simplify and future-proof code where explicit lists of allowed
affinity settings keywords inside if() conditions.
6) Add x86 CPUID leaf 4 cache detection to existing x2apic id method
so equivalent caches could be detected (in particular for the ll_caches
place).
Differential Revision: https://reviews.llvm.org/D100997
Implement the remaining GOMP_* functions to support task reductions
in taskgroup, parallel, loop, and taskloop constructs. The unused mem
argument to many of the work-sharing constructs has to do with the
scan() directive/ inscan() modifier. If mem is set, each function
will call KMP_FATAL() and tell the user scan/inscan is unsupported. The
GOMP reduction implementation is kept separate from our implementation
because of how GOMP presents reduction data and computes the reductions.
GOMP expects the privatized copies to be present even after a #pragma
omp parallel reduction(task:...) region has ended so the data is stored
inside GOMP's uintptr_t* data pseudo-structure. This style is tightly
coupled with GCC compiler codegen. There also isn't any init(),
combiner(), fini() functions in GOMP's codegen so the two
implementations were to disparate to try to wrap GOMP's around our own.
Differential Revision: https://reviews.llvm.org/D98806
Current atfork() handler for child processes does not reset
the affinity masks array which prevents users from setting their own
affinity in child processes.
Differential Revision: https://reviews.llvm.org/D99218
It is reported that after enabling hidden helper thread, the program
can hit the assertion `new_gtid < __kmp_threads_capacity` sometimes. The root
cause is explained as follows. Let's say the default `__kmp_threads_capacity` is
`N`. If hidden helper thread is enabled, `__kmp_threads_capacity` will be offset
to `N+8` by default. If the number of threads we need exceeds `N+8`, e.g. via
`num_threads` clause, we need to expand `__kmp_threads`. In
`__kmp_expand_threads`, the expansion starts from `__kmp_threads_capacity`, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is `Y`. If `Y` happens to meet the constraint
`(N+8)*2^X=Y` where `X` is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.
Here is an example.
```
#include <vector>
int main(int argc, char *argv[]) {
constexpr const size_t N = 1344;
std::vector<int> data(N);
#pragma omp parallel for
for (unsigned i = 0; i < N; ++i) {
data[i] = i;
}
#pragma omp parallel for num_threads(N)
for (unsigned i = 0; i < N; ++i) {
data[i] += i;
}
return 0;
}
```
My CPU is 20C40T, then `__kmp_threads_capacity` is 160. After offset,
`__kmp_threads_capacity` becomes 168. `1344 = (160+8)*2^3`, then the assertions
hit.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D98838
Restrict the chunk_size * chunk_num to only occur for valid
chunk_nums and reimplement calculating the limit to avoid overflow.
Differential Revision: https://reviews.llvm.org/D96747
When including <ostream>, the register_callback macro of the OMPT callback.h
clashes with a function defined in ostream. This patch renames the macro
and includes ompt into the macro name.
This code alleviates some pathological loop parameters (lower,
upper, stride) within calculations involved in the static loop code. It
bounds the chunk size to the trip count if it is greater than the trip
count and also minimizes problematic code for when trip count < nth.
Differential Revision: https://reviews.llvm.org/D96426
This patch limits the number of dispatch buffers (used for
loop worksharing construct) to between 1 and 4096.
Differential Revision: https://reviews.llvm.org/D96749
This patch adds lower-bound and upper-bound to num_teams clause
according to OpenMP 5.1 specification. The initial number of teams
created is implementation defined, but it will be greater than or
equal to lower-bound and less than or equal to upper-bound. If
num_teams clause is not specified, the number of teams created is
implementation defined, but it will be greater or equal to 1.
Differential Revision: https://reviews.llvm.org/D95820
When OMP_PLACES contains an invalid value, the warning informs the user
that the fallback is OMP_PLACES=threads, but the actual internal setting
is OMP_PLACES=cores and is detected as such with KMP_SETTINGS=1.
This patch informs the user that OMP_PLACES=cores is being used instead
of OMP_PLACES=threads.
Differential Revision: https://reviews.llvm.org/D95170
This patch sets the def-allocator-var ICV based on the environment variables
provided in OMP_ALLOCATOR. Previously, only allowed value for OMP_ALLOCATOR
was a predefined memory allocator. OpenMP 5.1 specification allows predefined
memory allocator, predefined mem space, or predefined mem space with traits in
OMP_ALLOCATOR. If an allocator can not be created using the provided environment
variables, the def-allocator-var is set to omp_default_mem_alloc.
Differential Revision: https://reviews.llvm.org/D94985
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
Hierarchical barrier is an experimental barrier algorithm that uses aspects
of machine hierarchy to define the barrier tree structure. This patch fixes
offset calculation in hierarchical barrier. The offset is used to store info
on a flag about sleeping threads waiting on a location stored in the flag.
This commit also fixes a potential deadlock in hierarchical barrier when
using infinite blocktime by adjusting the offset value of leaf kids so that
it matches the value of leaf state. It also adds testing of default barriers
with infinite blocktime, and also tests hierarchical barrier algorithm with
both default and infinite blocktime.
Patch by Terry Wilmarth and Nawrin Sultana.
Differential Revision: https://reviews.llvm.org/D94241
This patch adds new API __kmpc_taskloop_5 to accomadate strict
modifier (introduced in OpenMP 5.1) in num_tasks and grainsize
clause.
Differential Revision: https://reviews.llvm.org/D92352
This is an alternative approach to address inconsistencies pointed out in: D90078
This patch makes sure that the return address is reset, when leaving the scope.
In some cases, I had to move the macro out of an if-statement to have it in the
right scope, in some cases I added an additional block to restrict the scope.
This patch does not handle inconsistencies, which might occur if the return
address is still set when we call into the application.
Test case (repeated_calls.c) provided by @hbae
Differential Revision: https://reviews.llvm.org/D91692
OpenMP 5.1 introduces the new env variable
OMP_TOOL_VERBOSE_INIT=(disabled|stdout|stderr|<filename>) to enable verbose
loading and initialization of OMPT tools.
This env variable helps to understand the cause when loading of a tool fails
(e.g., undefined symbols or dependency not in LD_LIBRARY_PATH)
Output of OMP_TOOL_VERBOSE_INIT is added for OMP_DISPLAY_ENV
Tests for this patch are integrated into the different existing tool loading
tests, making these tests more verbose. An Archer specific verbose test is
integrated into an existing Archer test.
Patch prepared by: Isabel Thärigen
Differential Revision: https://reviews.llvm.org/D91464
Currently the affinity format string has initial value. When users set
the format via OMP_AFFINITY_FORMAT, it will overwrite the format string. However,
when copying the format, the tailing null is missing. As a result, if the user
format string is shorter than default value, the remaining part in the default
value still makes effort. This bug is not exposed because the test case doesn't
check the end of a string. It only checks whether given output "contains" the
check string.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D91309
This patch allows to pass the OpenMP runtime tests after configuring with
`cmake . -DOPENMP_TEST_FLAGS:STRING="-Werror"`.
The warnings for OMPT tests are addressed in D90752.
Differential Revision: https://reviews.llvm.org/D91280
This doesn't add functionality, but just adds the new types and renames the
master callback to masked callback.
Differential Revision: https://reviews.llvm.org/D90752
As reported by @ronlieb, the test shows intermittent fails.
The test failed, if the dependent task was already finished, when the depending
task was to be created. We have other tests to check for the dependences pair.
The worker thread can start execution of the task before creation of the second task
Fixes the spurious failure reported in https://reviews.llvm.org/D61657
This patch updates the expected results for the GOMP interface patches: D87267, D87269, and D87271.
The taskwait-depend test is changed to really use taskwait-depend and copied to an task_if0-depend test.
To pass the tests, the handling of the return address was fixed.
Differential Revision: https://reviews.llvm.org/D87680
This change introduces the GOMP_taskwait_depend() function. It implements
the OpenMP 5.0 feature of #pragma omp taskwait with depend() clause by
wrapping around __kmpc_omp_wait_deps().
Differential Revision: https://reviews.llvm.org/D87269
Encapsulate GOMP task dependencies in separate class and introduce the
new mutexinoutset dependency type. This separate class allows
future GOMP task APIs easier access to the task dependency functionality
and better ability to propagate new dependency types to all existing GOMP
task APIs which use task dependencies.
Differential Revision: https://reviews.llvm.org/D87267
Implement GOMP_teams_reg() function which enables GOMP support of the
standalone teams construct. The GOMP_parallel* functions were modified
to call __kmp_fork_call() unconditionally so that the teams-specific
code could be reused within __kmp_fork_call() instead of reproduced
inside the GOMP_* functions.
Differential Revision: https://reviews.llvm.org/D87167
Following tests were disabled for clang-11 after upgrading to
version 5.0 in D82963:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
They are also failing for clang-12. Thus this temporary disabling
until they are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D84241
Following tests are failing after upgrading to version 5.0 but are passing
for version 4.5:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
To be enabled as soon as these tests are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D82963
If the compilation fails, the test is marked as unsupported.
-> This will never change for a specific version of gcc
If the linking fails, the test is marked as expected to fail.
-> This might change as LLVM/OpenMP implements the missing GOMP interface function
Reviewed by: Hahnfeld
Differential Revision: https://reviews.llvm.org/D83077
Summary:
The OpenMP loops are normalized and transformed into the loops from 0 to
max number of iterations. In some cases, original scheme may lead to
overflow during calculation of number of iterations. If it is unknown,
if we can end up with overflow or not (the bounds are not constant and
we cannot define if there is an overflow), cast original type to the
unsigned.
Reviewers: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits, cfe-commits, caomhin
Tags: #clang, #openmp
Differential Revision: https://reviews.llvm.org/D81881
Adds the callbacks for ordered with source/sink dependencies.
The test for task dependencies changed, because callbach.h now actually prints
the passed dependencies and the test also checks for the address.
Reviewed by: hbae
Differential Revision: https://reviews.llvm.org/D81807
This patch allows to specify a prefix (default:empty) to be included into print-out
written by callback.h.
Also adding a cmake target to find the header file from other tests.
Reviewed by: jdoerfert
Differential Revision: https://reviews.llvm.org/D76008
The OpenMP spec has the task-fulfill event for a call to omp_fulfill_event.
If the task did not yet finish execution, ompt_task_early_fulfill is used,
otherwise ompt_task_late_fulfill.
If a task does not complete, when the execution finishes (i.e., the task goes
in detached mode), ompt_task_detach instead of ompt_task_complete must be
used, when the next task is scheduled.
A test for both cases is included, which only work with clang-11+
Reviewed By: hbae
Differential revision: https://reviews.llvm.org/D80843
D78566 introduced a `\bnot\b` lit substitution in OpenMP test suites.
However, that would corrupt a command like
`FileCheck -implicit-check-not` or any file name like `%t.not`. We
could use lookbehind/lookahead assertions to avoid such cases, but
this patch switches to `%not` (suggested during the D78566 review) as
a safer option.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D79529
Without this patch, the openmp project's test suites do not appear to
have support for negative tests. However, D78170 needs to add a test
that an expected runtime failure occurs.
This patch makes `not` visible in all of the openmp project's test
suites. In all but `libomptarget/test`, it should be possible for a
test author to insert `not` before a use of the lit substitution for
running a test program. In `libomptarget/test`, that substitution is
target-specific, and its value is `echo` when the target is not
available. In that case, inserting `not` before a lit substitution
would expect an `echo` fail, so this patch instead defines a separate
lit substitution for expected runtime fails.
Reviewed By: jdoerfert, Hahnfeld
Differential Revision: https://reviews.llvm.org/D78566
EXCLUDE_FROM_ALL means something else for add_lit_testsuite as it does
for something like add_executable. Distinguish between the two by
renaming the variable and making it an argument to add_lit_testsuite.
Differential revision: https://reviews.llvm.org/D74168
Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=44733 | TEST 'libomp :: ompt/synchronization/reduction/tree_reduce.c' FAILED on 32-bit x86 ]]
For 32-bit we need at least 3 variables to avoid atomic reduction to be
choosen by runtime function `__kmp_determine_reduction_method`.
This patch adds reduction variables to the testcase.
Reviewers: mgorny, Hahnfeld
Differential Revision: https://reviews.llvm.org/D73850
Including two tests
These callbacks were added late to the 5.0 specification, an implementation is missing.
Reviewed By: jdoerfert
Differential Review: https://reviews.llvm.org/D70395
New OMPT tests with teams construct should be disabled for GCC as it
emits code with a GOMP entry not supported in the LLVM runtime.
Differential Revision: https://reviews.llvm.org/D65757
llvm-svn: 367939
This is a port of libomp for the RISC-V 64-bit Linux target.
We have tested this port on a HiFive Unleashed development board
using a downstream LLVM that has support for the missing bits in
upstream. As of now, all tests are passing, including OMPT.
Patch by Ferran Pallarès!
Differential Revision: https://reviews.llvm.org/D59880
llvm-svn: 367021
This is done at call-site and does not need to be handled in
__kmp_invoke_microtask. It was already absent from the x86
and x86_64 assembly, this patch removes it from the generic
implementation in z_Linux_util.cpp and adds documentation for
AArch64 and PPC64 that it's actually not needed. I can't test
on these architectures, so I don't want to change the code just
because it looks right :)
While at it, rename some variables for consistency and add a
check in test/ompt/parallel/normal.c that the pointer was reset
before entering the barrier.
Differential Revision: https://reviews.llvm.org/D64442
llvm-svn: 366721
This is a follow up patch to D64534 (r365963) which removed all OMP
spec versioning within the OpenMP runtime codebase. This patch removes
REQUIRES: openmp-x.y lines from lit tests.
llvm-svn: 366341
Remove all older OMP spec versioning from the runtime and build system.
Patch by Terry Wilmarth
Differential Revision: https://reviews.llvm.org/D64534
llvm-svn: 365963