Commit Graph

1107 Commits

Author SHA1 Message Date
uvos aa79524c51
HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (#14945) 2025-07-29 20:23:04 +02:00
uvos b77d11179d
HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (#14930)
This is useful for testing for regressions on GCN with CDNA hardware.

With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
2025-07-29 17:44:30 +02:00
uvos c7aa1364fd
HIP: Ignore unsupported unroll transformation in fattn-vec (#14931)
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
2025-07-29 17:43:43 +02:00
hipudding 204f2cf168
CANN: Add ggml_set_rows (#14943) 2025-07-29 22:36:43 +08:00
Sigbjørn Skjæret 138b288b59
cuda : add softcap fusion (#14907) 2025-07-29 14:22:03 +02:00
Aman Gupta 0a5036bee9
CUDA: add roll (#14919)
* CUDA: add roll

* Make everything const, use __restrict__
2025-07-29 14:45:18 +08:00
xctan db16e2831c
ggml-cpu : deduplicate scalar implementations (#14897)
* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0
2025-07-28 17:40:24 +02:00
Akarshan Biswas cd1fce6d4f
SYCL: Add set_rows support for quantized types (#14883)
* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp
2025-07-28 20:32:15 +05:30
Johannes Gäßler 946b1f6859
CUDA: fix pointer incrementation in FA (#14916) 2025-07-28 14:30:22 +02:00
Alberto Cabrera Pérez afc0e89698
sycl: refactor quantization to q8_1 (#14815)
* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat
2025-07-28 11:05:53 +01:00
Kai Pastor 613c5095c3 cmake : Indent ggml-config.cmake (ggml/1310) 2025-07-28 08:15:01 +03:00
Erik Scholz 89d1029559
vulkan : add fp16 support for the conv_2d kernel (#14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-27 12:04:33 +02:00
Jeff Bolz f1a4e72de5
vulkan: skip empty set_rows to avoid invalid API usage (#14860) 2025-07-27 11:05:34 +02:00
deepsek 66906cd82a
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624)
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-27 00:28:14 +02:00
hipudding 11dd5a44eb
CANN: Implement GLU ops (#14884)
Implement REGLU, GEGLU, SWIGLU ops according to #14158
2025-07-26 17:56:18 +08:00
R0CKSTAR 9b8f3c6c77
musa: fix build warnings (unused variable) (#14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-26 10:36:02 +08:00
Aaron Teo c7f3169cd5
ggml-cpu : disable GGML_NNPA by default due to instability (#14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-25 19:09:03 +02:00
Gabe Goodhart 793c0d7f46
metal: SSM_SCAN performance (#14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-25 10:47:39 -06:00
lhez ce111d39d6
opencl: add fused `rms_norm_mul` (#14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
Oliver Simons e2b7621e7c
ggml : remove invalid portPos specifiers from dot files (#14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
2025-07-25 14:29:57 +03:00
Chris Rohlf 64bf1c3744
rpc : check for null buffers in get/set/copy tensor endpoints (#14868) 2025-07-25 12:17:02 +02:00
Diego Devesa c12bbde372
sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855)
ggml-ci
2025-07-25 11:07:26 +03:00
R0CKSTAR 3f4fc97f1d
musa: upgrade musa sdk to rc4.2.0 (#14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-24 20:05:37 +01:00
Kai Pastor 60f816a79d cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-24 20:27:23 +03:00
Daniel Bevenius 5592f278b6 ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-24 20:27:23 +03:00
Alberto Cabrera Pérez cb4a63aad6
sycl: fixed semantics of block offset calculation (#14814) 2025-07-24 11:09:57 +01:00
Georgi Gerganov 065908cb09
metal : fix fusion across different encoders (#14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
2025-07-24 10:24:05 +03:00
Donghyeon Jeong 4ec6291a24
sycl: fix undefined variable in work group size check (#14843) 2025-07-24 12:50:41 +08:00
Johannes Gäßler a86f52b285
CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
Johannes Gäßler b284197df4
CUDA: fix compilation with GGML_CUDA_F16 (#14837) 2025-07-23 18:22:30 +02:00
Johannes Gäßler 07a19e27a2 CUDA: fix quantized KV cache + multiple sequences (#14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-23 14:08:09 +03:00
lixing-star 6c88b3bb25
ggml: fix loongarch quantize_row_q8_1 error (#14827) 2025-07-23 09:39:51 +03:00
chen fan 14c28dfc50
CANN: weight format to NZ for Ascend310P3 (#14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-23 11:58:00 +08:00
Aman Gupta 8c988fa41d
CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
Jeff Bolz 84712b6043
vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817) 2025-07-22 17:35:21 +02:00
Sigbjørn Skjæret e28c0b80c2
cuda : implement bf16 cpy ops and enable bf16 cont (#14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-22 12:33:10 +02:00
lhez 8e6f8bc875
opencl: remove unreachable `return` (#14806) 2025-07-22 08:53:30 +02:00
R0CKSTAR 48b86c4fdb
cuda: remove linking to cublasLt (#14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-22 07:45:26 +08:00
Sigbjørn Skjæret 38d3af1b73
opencl: fix `im2col` when `KW!=KH` (#14803) 2025-07-21 13:55:10 -07:00
rmatif 6c9ee3b17e
opencl: add conv2d kernel (#14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
Romain Biessy cd465d823c
sycl: Fix im2col (#14797) 2025-07-21 18:39:29 +02:00
Charles Xu 922042601b
kleidiai: add support for get_rows (#14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-21 16:49:52 +03:00
Jeff Bolz c2e058f1b4
vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi a979ca22db
ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
Peter0x44 d4b91ea7b2
vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m 83f5872404
Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
Georgi Gerganov bf9087f59a
metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
Oliver Simons 021cc28bef
cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Aman Gupta f9a31eea06
CUDA: set_rows + cpy.cu refactor (#14712) 2025-07-18 14:54:18 +08:00
Neo Zhang Jianyu 349ea79fce
use max work group size for device to replace the magic number (#14732) 2025-07-18 10:23:14 +08:00