Jack Kirk
3e0e5568a6
[CUDA] Fixed sm version constrain for __bmma_m8n8k128_mma_and_popc_b1.
...
As stated in
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma :
".and operation in single-bit wmma requires sm_80 or higher."
tra@: Fixed a bug in builtins-nvptx-mma.py test generator and regenerated the tests.
Differential Revision: https://reviews.llvm.org/D131265
2022-08-05 12:14:06 -07:00
David Green
9727c77d58
[NFC] Rename Instrinsic to Intrinsic
2022-04-25 18:13:23 +01:00
Artem Belevich
d774b4aa5e
[NVPTX, CUDA] Add .and.popc variant of the b1 MMA instruction.
...
That should allow clang to compile mma.h from CUDA-11.3.
Differential Revision: https://reviews.llvm.org/D105384
2021-07-15 12:02:09 -07:00
Steffen Larsen
3644726a78
[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
...
Adds NVPTX builtins and intrinsics for the CUDA PTX `wmma.load`, `wmma.store`, `wmma.mma`, and `mma` instructions added in PTX 6.5 and 7.0.
PTX ISA description of
- `wmma.load`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
- `wmma.store`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-st
- `wmma.mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
- `mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
Overview of `wmma.mma` and `mma` matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape
Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D104847
2021-06-29 15:44:07 -07:00
Artem Belevich
5fe85a003f
[CUDA] Implemented _[bi]mma* builtins.
...
These builtins provide access to the new integer and
sub-integer variants of MMA (matrix multiply-accumulate) instructions
provided by CUDA-10.x on sm_75 (AKA Turing) GPUs.
Also added a feature for PTX 6.4. While Clang/LLVM does not generate
any PTX instructions that need it, we still need to pass it through to
ptxas in order to be able to compile code that uses the new 'mma'
instruction as inline assembly (e.g used by NVIDIA's CUTLASS library
https://github.com/NVIDIA/cutlass/blob/master/cutlass/arch/mma.h#L101 )
Differential Revision: https://reviews.llvm.org/D60279
llvm-svn: 359248
2019-04-25 22:28:09 +00:00