llvm-project

Commit Graph

Author	SHA1	Message	Date
Joseph Huber	fdbb15355e	[Libomptarget][CUDA] Check CUDA compatibilty correctly We recently added support for multi-architecture binaries in libomptarget. This is done by extracting the architecture from the embedded image and comparing it with the major and minor version supported by the current CUDA installation. Previously we just compared these directly, which was not correct for binary compatibility. The CUDA documentation states that we can consider any image with an equivalent major or a greater or equal to minor compatible with the current image. Change the check to use this new logic in the CUDA plugin. Fixes #57049 Reviewed By: jdoerfert, ye-luo Differential Revision: https://reviews.llvm.org/D131567	2022-08-10 11:15:27 -04:00
Fangrui Song	0972a390b9	LLVM_FALLTHROUGH => [[fallthrough]]. NFC	2022-08-09 04:06:52 +00:00
Joel E. Denny	cfa6e79df3	[Libomptarget] Don't report lack of CUDA devices Sometimes libomptarget's CUDA plugin produces unhelpful diagnostics about a lack of CUDA devices before an application runs: ``` $ clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa hello-world.c $ ./a.out CUDA error: Error returned from cuInit CUDA error: no CUDA-capable device is detected Hello World: 4 ``` This can happen when the CUDA plugin was built but all CUDA devices are currently disabled in some manner, perhaps because `CUDA_VISIBLE_DEVICES` is set to the empty string. As shown in the above example, it can even happen when we haven't compiled the application for offloading to CUDA. The following code from `openmp/libomptarget/plugins/cuda/src/rtl.cpp` appears to be intended to handle this case, and it chooses not to write a diagnostic to stderr unless debugging is enabled: ``` if (NumberOfDevices == 0) { DP("There are no devices supporting CUDA.\n"); return; } ``` The problem is that the above code is never reached because the earlier `cuInit` returns `CUDA_ERROR_NO_DEVICE`. This patch handles that `cuInit` case in the same manner as the above code handles the `NumberOfDevices == 0` case. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D130371	2022-07-22 14:46:45 -04:00
Joseph Huber	a3804a3145	[Libomptarget] Make the plugins link as LLVM libraries Previously we made `libomptarget` link as an LLVM library so we have access to the LLVM core libraries. After the initial patch stuck we can now apply the same changes to the plugins. This will allow us to use LLVM in all of `libomptarget` when we have uses for them. In the future this should allow us to remove the dependencies on `libelf`, `libffi`, and `dl`. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D130262	2022-07-22 09:34:12 -04:00
Joseph Huber	e01ce4e88a	[Libomptarget] Add checks for CUDA subarchitecture using new info This patch extends the `is_valid_binary` routine to also check if the binary's architecture string matches the one parsed from the runtime. This should allow us to only use the binary whose compute capability matches, allowing us to support basic multi-architecture binaries for CUDA. Depends on D127432 Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D127505	2022-07-21 13:20:06 -04:00
Joseph Huber	1f940b69c3	[Libomptarget][NFC] Fix signed comparison warnings Summary: Non-functional change, just fixing some sign comparison warnings by making both match.	2022-07-15 13:22:55 -04:00
Joseph Huber	d27d0a673c	[Libomptarget][NFC] Make Libomptarget use the LLVM naming convention Libomptarget grew out of a project that was originally not in LLVM. As we develop libomptarget this has led to an increasingly large clash between the naming conventions used. This patch fixes most of the variable names that did not confrom to the LLVM standard, that is `VariableName` for variables and `functionName` for functions. This patch was primarily done using my editor's linting messages, if there are any issues I missed arising from the automation let me know. Reviewed By: saiislam Differential Revision: https://reviews.llvm.org/D128997	2022-07-05 14:53:38 -04:00
Shilei Tian	696bca9bb2	[NFC][OpenMP][CUDA] Remove unnecessary default label	2022-07-01 09:50:29 -04:00
Shilei Tian	2695e23ad9	[OpenMP][CUDA] Fix the issue that P2P memcpy doesn't work This patch fixes the issue that P2P memcpy doesn't work. The root cause is we didn't set current context when calling the API function. In addition, a matrix to track the states of each pair of devices is also added such that we only need to query and configure the device once. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D122764	2022-06-28 15:32:03 -04:00
Joseph Huber	d5d836635c	[Libomptarget] Add test config for compiling in LTO-mode We are planning on making LTO the default compilation mode for offloading. In order to make sure it works we should run these tests on the test suite. AMDGPU already uses the LTO compilation path for its linking, but in LTO mode it also links the static library late. Performing LTO requires the static library to be built, if we make the change this will be a hard requirement and the old bitcode library will go away. This means users will need to use either a two-step build or a runtimes build for libomptarget. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D127512	2022-06-14 10:16:03 -04:00
Joseph Huber	9ffa945c40	[Libomptarget] Remove global include directory from libomptarget We used to globally include the libomptarget include directory for all projects. This caused some conflicts with the other files named "Debug.h". This patch changes the cmake to include these files via the target include instead. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D125563	2022-05-13 14:38:47 -04:00
Joseph Huber	ae23be84cb	[OpenMP] Make the new offloading driver the default Previously an opt-in flag `-fopenmp-new-driver` was used to enable the new offloading driver. After passing tests for a few months it should be sufficiently mature to flip the switch and make it the default. The new offloading driver is now enabled if there is OpenMP and OpenMP offloading present and the new `-fno-openmp-new-driver` is not present. The new offloading driver has three main benefits over the old method: - Static library support - Device-side LTO - Unified clang driver stages Depends on D122683 Differential Revision: https://reviews.llvm.org/D122831	2022-04-18 15:05:09 -04:00
Johannes Doerfert	a3a42c3ca2	[OpenMP][FIX] Ensure to set the context for wait events if necessary Differential Revision: https://reviews.llvm.org/D123445	2022-04-12 16:42:50 -05:00
Michael Kruse	7fa7b0cbd8	[libomptarget] Add device RTL to regression test dependencies. In a clean build directory, `check-openmp` or `check-libomptarget` will fail because of missing device RTL .bc files. Ensure that the new targets new custom targets `omptarget.devicertl.nvptx` and `omptarget.devicertl.amdgpu` (corresponding to the plugin rtl targets `omptarget.rtl.cuda`, respectively `omptarget.rlt.amdgpu` ) are dependencies of the regression tests. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D123177	2022-04-06 20:01:47 -05:00
Johannes Doerfert	ba93e4e33e	[OpenMP][NFC] Add missing virtual destructor to silence warning	2022-03-28 22:33:18 -05:00
Shilei Tian	545fcc3d84	[OpenMP][CUDA] Fix potential program crash caused by double free resources As we mentioned in the code comments for function `ResourcePoolTy::release`, at some point there could be two identical resources on the two sides of `Next` mark. It is usually not an issue, unless the following case: 1. Some resources are not returned. 2. We need to iterate the pool and free the element. That will cause double free, which is the case for event pool. Since we don't release events hold by the data map, it can happen that the `Next` mark is not reset, and we have two identical items in the pool. When the pool is destroyed, we will call `cuEventDestroy` twice on the same event. In the best case, we can only observe CUDA errors. In the worst case, it can cause internal failures in CUDART and further crash. This patch fixes the issue by tracking all resources that have been given using an `unordered_set`. We don't remove it when a resource is returned. When the pool is destroyed, we merge the pool (a `vector`) and the set. In this way, we can make sure that the set contains all resources allocated from the device. We just need to iterate the set and free the resource accordingly. For now, only event pool is set to use it. Stream pool is not because we can make sure all streams are returned when the plugin is destroyed. Someone might be wondering, why don't we release all events hold in the data map. That is because, plugins are determined to be destroyed before `libomptarget`. If we can somehow make the plugin outlast `libomptarget`, life will be much easier. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D122014	2022-03-25 22:49:32 -04:00
Johannes Doerfert	6c2be885ff	Revert "[OpenMP][NFC] Add missing virtual destructor to silence warning" This reverts commit `b9fd8f34ae` as it accidentally contained a unit test change that is not finished (and unrelated).	2022-03-25 16:07:11 -05:00
Johannes Doerfert	b9fd8f34ae	[OpenMP][NFC] Add missing virtual destructor to silence warning	2022-03-25 16:00:53 -05:00
Shilei Tian	f6639a424b	[OpenMP][CUDA] Fix the check of `setContext`	2022-03-09 18:48:44 -05:00
Shilei Tian	39d3283a08	[OpenMP][CUDA] Avoid calling `cuCtxSetCurrent` redundantly Currently we set ccontext everywhere accordingly, but that causes many unnecessary function calls. For example, in the resource pool, if we need to resize the pool, we need to get from allocator. Each call to allocate sets the current context once, which is unnecessary. In this patch, we set the context only in the entry interface functions, if needed. Actually in the best way this should be implemented via RAII, but since `cuCtxSetCurrent` could return error, and we don't use exception, we can't stop the execution if RAII fails. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D121322	2022-03-09 16:32:47 -05:00
Shilei Tian	5105c7cd78	[OpenMP][CUDA] Fix an issue that multiple `CUmodule` are could be overwritten This patch fixes the issue introduced in `14de0820e8` and D120089, that if dynamic libraries are used, the `CUmodule` array could be overwritten. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D121308	2022-03-09 14:55:20 -05:00
Johannes Doerfert	14de0820e8	[OpenMP][FIX] Ensure the modules vector is filled as others are The modules vector was for some reason special which could lead to it not being of the same size (=num devices). Easiest solution is to treat it like we do all the other vectors.	2022-03-08 23:45:43 -06:00
Johannes Doerfert	1660288b28	[OpenMP][CUDA] Use one event pool per device An event pool, similar to the stream pool, needs to be kept per device. For one, events are associated with cuda contexts which means we cannot destroy the former after the latter. Also, CUDA documentation states streams and events need to be associated with the same context, which we did not ensure at all. Differential Revision: https://reviews.llvm.org/D120142	2022-03-07 23:43:05 -06:00
Johannes Doerfert	10aa83ff74	[OpenMP] Allow to explicitly deinitialize device resources There are two problems this patch tries to address: 1) We currently free resources in a random order wrt. plugin and libomptarget destruction. This patch should ensure the CUDA plugin is less fragile if something during the deinitialization goes wrong. 2) We need to support (hard) pause runtime calls eventually. This patch allows us to free all associated resources, though we cannot reinitialize the device yet. Follow up patch will associate one event pool per device/context. Differential Revision: https://reviews.llvm.org/D120089	2022-03-07 23:43:04 -06:00
Shilei Tian	7f7c2c34b6	[OpenMP][CMake] Clean up the CMake variable `LIBOMPTARGET_LLVM_INCLUDE_DIRS` `LIBOMPTARGET_LLVM_INCLUDE_DIRS` is currently checked and included for multiple times redundantly. This patch is simply a clean up. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D121055	2022-03-05 22:37:59 -05:00
Shilei Tian	aca33b0b37	[OpenMP][CUDA] Remove the hard team limit Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a `int32_t` can represent. There is no limitation mentioned in the spec. This patch simply removes it. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119313	2022-02-10 18:07:46 -05:00
Shilei Tian	f6685f7746	[OpenMP][CUDA] Refine the logic to determine grid size This patch refines the logic to determine grid size as previous method can escape the check of whether `CudaBlocksPerGrid` could be greater than the actual hardware limit. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119311	2022-02-10 14:13:32 -05:00
Joseph Huber	034adaf5be	[OpenMP] Completely remove old device runtime This patch completely removes the old OpenMP device runtime. Previously, the old runtime had the prefix `libomptarget-new-` and the old runtime was simply called `libomptarget-`. This patch makes the formerly new runtime the only runtime available. The entire project has been deleted, and all references to the `libomptarget-new` runtime has been replaced with `libomptarget-`. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D118934	2022-02-04 15:31:33 -05:00
Joseph Huber	0ac799b5c9	[Libomptarget] Run GPU offloading tests using the new drvier This patch adds a new target to the tests to run using the new driver as the method for generating offloading code. Depends on D116541 Differential Revision: https://reviews.llvm.org/D118637	2022-01-31 23:11:43 -05:00
Sri Hari Krishna Narayanan	f44e41af41	Runtime for Interop directive This implements the runtime portion of the interop directive. It expects the frontend and IRBuilder portions to be in place for proper execution. It currently works only for GPUs and has several TODOs that should be addressed going forward. Reviewed By: RaviNarayanaswamy Differential Revision: https://reviews.llvm.org/D106674	2022-01-27 15:16:24 -05:00
Jon Chesterfield	e08f3bfe58	[openmp] Disable build of old runtimes by default The old runtime is not tested by CI. Disable the build prior to the llvm-14 branch. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D118268	2022-01-26 19:17:31 +00:00
Shilei Tian	943d1d83dd	[OpenMP][CUDA] Add resource pool for CUevent Following D111954, this patch adds the resource pool for CUevent. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D116315	2021-12-28 17:42:38 -05:00
Shilei Tian	357c8031ff	[OpenMP][Plugin] Minor adjustments to ResourcePool This patch makes some minor adjustments to `ResourcePool`: - Don't initialize the resources if `Size` is 0 which can avoid assertion. - Add a new interface function `clear` to release all hold resources. - If initial size is 0, resize to 1 when the first request is encountered. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D116340	2021-12-28 16:11:03 -05:00
Shilei Tian	a697a0a4b6	[OpenMP][Plugin] Introduce generic resource pool Currently CUDA streams are managed by `StreamManagerTy`. It works very well. Now we have the need that some resources, such as CUDA stream and event, will be hold by `libomptarget`. It is always good to buffer those resources. What's more important, given the way that `libomptarget` and plugins are connected, we cannot make sure whether plugins are still alive when `libomptarget` is destroyed. That leads to an issue that those resouces hold by `libomptarget` might not be released correctly. As a result, we need an unified management of all the resources that can be shared between `libomptarget` and plugins. `ResourcePoolTy` is designed to manage the type of resource for one device. It has to work with an allocator which is supposed to provide `create` and `destroy`. In this way, when the plugin is destroyed, we can make sure that all resources allocated from native runtime library will be released correctly, no matter whether `libomptarget` starts its destroy. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D111954	2021-12-27 11:32:14 -05:00
Jon Chesterfield	38af5b4fd1	[libomptarget][nfc] Refactor dlwrap.h for easier reuse in D115966 and upcoming patches	2021-12-17 22:28:31 +00:00
Jon Chesterfield	a05a0c3c2f	[libomptarget] Add cmake variables to disable building the amdgpu or cuda plugins Analogous to the controls on building device runtimes Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D115148	2021-12-06 16:42:26 +00:00
Jon Chesterfield	bf6f955f39	[libomptarget] Run GPU offloading tests on both new and old runtime Implemented by patching python config instead of modifying all the tests so that -generic and XFAIL work as usual. Expectation is for this to be reverted once the old runtime is deleted. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D112225	2021-10-22 23:28:44 +01:00
Joseph Huber	b1ce454930	[OpenMP] Remove macro guards for device debugging The plugin currently uses a macro to check if this is a debug built before assigning the debug kind variable to the device environment struct. This is being deprecated because the new device runtime does not maintain separate debug builds and should always be availible. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D112083	2021-10-19 12:21:43 -04:00
Shilei Tian	c060c634ef	[OpenMP][NVPTX] Fix an error in configuring #teams and #threads It must be a copy mistake. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D111407	2021-10-08 11:07:43 -04:00
Jon Chesterfield	0c554a4769	[libomptarget] Move device environment to shared header, remove divergence Follow on to D110006, related to D110957 Where implementations have diverged this resolves to match the new DeviceRTL - replaces definitions of this struct in deviceRTL and plugins with include - changes the dynamic_shared_size field from D110006 to 32 bits - handles stdint being unavailable in DeviceRTL - adds a zero initializer for the field to amdgpu - moves the extern declaration for deviceRTL to target_interface (omptarget.h is more natural, but doesn't work due to include order with debug.h) - Renames the fields everywhere to match the LLVM format used in DeviceRTL - Makes debug_level uint32_t everywhere (previously sometimes int32_t) Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D111069	2021-10-07 12:03:48 +01:00
Shilei Tian	ca999f7191	[OpenMP][Offloading] Use bitset to indicate execution mode instead of value The execution mode of a kernel is stored in a global variable, whose value means: - 0 - SPMD mode - 1 - indicates generic mode - 2 - SPMD mode execution with generic mode semantics We are going to add support for SIMD execution mode. It will be come with another execution mode, such as SIMD-generic mode. As a result, this value-based indicator is not flexible. This patch changes to bitset based solution to encode execution mode. Each position is: [0] - generic mode [1] - SPMD mode [2] - SIMD mode (will be added later) In this way, `0x1` is generic mode, `0x2` is SPMD mode, and `0x3` is SPMD mode execution with generic mode semantics. In the future after we add the support for SIMD mode, `0b1xx` will be in SIMD mode. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D110029	2021-09-22 11:40:52 -04:00
Shilei Tian	49e976c934	[OpenMP][NVPTX] Fix a warning that data argument not used by format string Reviewed By: jhuber6, grokos Differential Revision: https://reviews.llvm.org/D110104	2021-09-20 17:22:14 -04:00
Joseph Huber	f1c821fa85	[OpenMP] Add support for dynamic shared memory in new RTL This patch adds support for using dynamic shared memory in the new device runtime. The new function `__kmpc_get_dynamic_shared` will return a pointer to the buffer of dynamic shared memory. Currently the amount of memory allocated is set by an environment variable. In the future this amount will be added to the amount used for the smart stack which will be configured in a similar way. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D110006	2021-09-17 21:25:36 -04:00
Joseph Huber	ec02c34b6d	[OpenMP] Add additional fields to device environment This patch adds fields for the device number and number of devices into the device environment struct and debugging values. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D110004	2021-09-17 21:25:32 -04:00
Shilei Tian	e8fdacfd81	[OpenMP][NVPTX] Fixed missing variables for CUDA free compilation in NVPTX plugin `CU_EVENT_DEFAULT` is defined in CUDA header. It should be added to `openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h` for CUDA free build. Reviewed By: ronlieb Differential Revision: https://reviews.llvm.org/D108878	2021-08-28 18:08:10 -04:00
Shilei Tian	29df4ab3f3	[OpenMP][Offloading] Add support for event related interfaces This patch adds the support form event related interfaces, which will be used later to fix data race. See D104418 for more details. Reviewed By: jdoerfert, ye-luo Differential Revision: https://reviews.llvm.org/D108528	2021-08-28 16:24:14 -04:00
Joseph Huber	edb8acdc6e	[Libomptarget] Correctly default to Generic if exec_mode is not present Currently, the runtime returns an error when the `exec_mode` global is not present. The expected behvaiour is that the region will default to Generic. This prevents global constructors from being called because they do not contain execution mode globals. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D108255	2021-08-18 11:24:28 -04:00
Shilei Tian	28939b6ae5	[NFC] Clean up and clang-format openmp/libomptarget/plugins/cuda/src/rtl.cpp	2021-08-05 22:32:28 -04:00
Jose M Monsalve Diaz	88e66fa60a	[OpenMP] Fixing missing variables when CUDA SDK not in system This patch fixes the error reported in D106751. When there is no CUDA SDK installed in the system, the build fails due to missing `CU_DEVICE_ATTRIBUTE` variables. Using @zsrkmyn sugested fix Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106933	2021-07-27 23:46:15 -05:00
Jose M Monsalve Diaz	313c523995	[OpenMP][Tool] Introducing the `llvm-omp-device-info` tool This patch introduces the `llvm-omp-device-info` tool, which uses the omptarget library and interface to query the device info from all the available devices as seen by OpenMP. This is inspired by PGI's `pgaccelinfo` Since omptarget usually requires a description structure with executable kernels, I split the initialization of the RTLs and Devices to be able to initialize all possible devices and query each of them. This revision relies on the patch that introduces the print device info. A limitation is that the order in which the devices are initialized, and the corresponding device ID is not necesarily the one seen by OpenMP. The changes are as follows: 1. Separate the RTL initialization that was performed in `RegisterLib` to its own `initRTLonce` function 2. Create an `initAllRTLs` method that initializes all available RTLs at runtime 3. Created the `llvm-deviceinfo.cpp` tool that uses `omptarget` to query each device and prints its information. Example Output: ``` Device (0): print_device_info not implemented Device (1): print_device_info not implemented Device (2): print_device_info not implemented Device (3): print_device_info not implemented Device (4): CUDA Driver Version: 11000 CUDA Device Number: 0 Device Name: Quadro P1000 Global Memory Size: 4236312576 bytes Number of Multiprocessors: 5 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 bytes Max Shared Memory per Block: 49152 bytes Registers per Block: 65536 Warp Size: 32 Threads Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647 bytes Texture Alignment: 512 bytes Clock Rate: 1480500 kHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: DEFAULT Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 2505000 kHz Memory Bus Width: 128 bits L2 Cache Size: 1048576 bytes Max Threads Per SMP: 2048 Async Engines: Yes (2) Unified Addressing: Yes Managed Memory: Yes Concurrent Managed Memory: Yes Preemption Supported: Yes Cooperative Launch: Yes Multi-Device Boars: No Compute Capabilities: 61 ``` Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D106752	2021-07-27 22:38:35 -04:00

1 2 3

117 Commits