llama.cpp

Commit Graph

Author	SHA1	Message	Date
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-02 14:52:01 +02:00
Chenguang Li	9bacd6b374	[CANN] get_rows and dup optimization (#12671 ) * [CANN]get_rows and dup optimization. Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]GET_ROWS and CPY/DUP optimization Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-04-02 15:22:13 +08:00
Junil Kim	f423981ac8	opencl : fix memory allocation size (#12649 ) issue: https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283 This patch fixes the memory allocation size not exceeding the maximum size of the OpenCL device.	2025-04-01 09:54:34 -07:00
Georgi Gerganov	3fd072a540	metal : use F32 prec in FA kernels (#12688 ) * metal : use F32 prec in FA kernels ggml-ci * cont : fix FA vec kernel ggml-ci	2025-04-01 14:57:19 +03:00
R0CKSTAR	a6f32f0b34	Fix clang warning in gguf_check_reserved_keys (#12686 ) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-01 13:12:53 +02:00
Wagner Bruna	2bb3597e42	vulkan: fix build when glslc doesn't support coopmat (#12683 )	2025-04-01 11:38:07 +02:00
Romain Biessy	8293970542	SYCL: Rename oneMKL to oneMath (#12192 ) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-01 16:24:29 +08:00
Akarshan Biswas	8bbf26083d	SYCL: switch to SYCL namespace (#12674 )	2025-04-01 10:11:39 +02:00
a3sh	250d7953e8	ggml : faster ssm scan (#10558 ) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash	2025-03-31 18:05:13 +02:00
0cc4m	a8a1f33567	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135 ) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-03-31 14:37:01 +02:00
Georgi Gerganov	1790e73157	cmake : fix whitespace (#0 )	2025-03-31 15:07:32 +03:00
Sandro Hanea	a7724480fd	cmake: improve Vulkan cooperative matrix support checks (whisper/2966) Co-authored-by: Sandro Hanea <me@sandro.rocks>	2025-03-31 15:07:32 +03:00
Akarshan Biswas	6c02a032fa	SYCL: Remove misleading ggml_sycl_op_flatten function (#12387 ) * SYCL: Remove misleading ggml_sycl_op_flatten function * remove trailing whitespace * Fix L2 norm from rebase * remove try catch block from element_wise.cpp * remove comment from common.hp * ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward * norm.cpp: remove try catch exception block	2025-03-31 11:25:24 +02:00
Georgi Gerganov	4663bd353c	metal : use constexpr in FA kernels + fix typedef (#12659 ) * metal : use constexpr in FA kernels ggml-ci * cont ggml-ci * cont : fix typedef ggml-ci	2025-03-30 22:04:04 +03:00
R0CKSTAR	492d7f1ff7	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 ) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-30 10:59:38 +02:00
Xuan-Son Nguyen	360dc22c00	cpu : rm unused variable (ggml/1166)	2025-03-30 08:33:31 +03:00
cmdr2	a62d7fa7a9	cpu: de-duplicate some of the operators and refactor (ggml/1144) * cpu: de-duplicate some of the operators and refactor * Fix PR comments * Fix PR comments	2025-03-30 08:33:31 +03:00
Daniel Bevenius	e408d4351a	ggml : add logging for native build options/vars (whisper/2935) This commit adds debug level logging for the native build options and variables to ggml/CMakeLists.txt. The motivation for this is that it can be useful to see the effective result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a cmake build. I've found myself adding similar logging a few times now, so I thought it might be a good idea to add this. Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when running cmake produces the following output: ```console -- GGML_NATIVE : OFF -- GGML_NATIVE_DEFAULT : OFF -- INS_ENB : OFF ```	2025-03-30 08:33:31 +03:00
Daniel Bevenius	3891e183c6	examples : command.wasm updates (whisper/2904) This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated. * emscripten : fix TOTAL_STACK for wasm This commit moves the TOTAL_STACK setting from the compile flags to the linker flags. This is because the TOTAL_STACK setting is a linker setting. The motivation for this change is that currently the following warnings are generated when building: ```console em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] ``` * examples : suppress C++17 deprecation warning for std::codecvt_utf8 This commit suppresses the C++17 deprecation warning for std::codecvt_utf8 similar to what is done in examples/talk-llama/unicode.cpp. The motivation for this change is to suppress these warnings: ```console /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ 4 warnings generated. ``` * ggml : suppress double-promotion warning in GGML_F16x4_REDUCE This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro to suppress a double-promotion warning. Currently the following warning is generated when compiling the command.wasm example: ```console /whisper-work/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1592 \| GGML_F16_VEC_REDUCE(sumf, sum); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /whisper-work/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1640 \| GGML_F16_VEC_REDUCE(sumf[k], sum[k]); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 warnings generated. ``` wasm_f32x4_extract_lane returns a 32-bit float and this is what the addition is performed on. But there is an implicit conversion from 32-bit float to 64-bit double when the result is assigned to `res`, which is of type `ggml_float`. My understanding here is that this is intentional and adding a cast to `ggml_float` should suppress the warning. * emscripten : add -Wno-deprecated to for emscripten This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten builds. The motivation for this is that currently there a number of warnings generated like the following: ```console warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] ``` The downside of this is that we might miss other deprecation warnings in the future so I'm not sure if this is acceptable. But it make the wasm examples cleaner without the warnings. * examples : fix tautological-compare warning in stb_vorbis.c [no ci] This commit applies a fix to address a tautological-compare warning in stb_vorbis.c. The motivation for this is that currently the following warning is generated when compiling the commmand-wasm example: ```console /Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare] 1404 \| if (f->stream_start + loc >= f->stream_end \|\| f->stream_start + loc < f->stream_start) { \| ^ 1 warning generated. ``` This fix was taken from an open pull request on the stb repository that addreses this issue: https://github.com/nothings/stb/pull/1746 * squash! examples : update command.wasm instructions [no ci] This commit adds a Python script to serve the the wasm examples build in the `build-em` directory. Initially I thought that it would be enough to start a simple python server but I did not notice that there was an error in the browser console when I did that: ```console command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated. at command.js:1:1206224 at new Promise (<anonymous>) at loadWasmModuleToWorker (command.js:1:1204981) at Array.map (<anonymous>) at Object.loadWasmModuleToAllWorkers (command.js:1:1206428) at command.js:1:1204318 at callRuntimeCallbacks (command.js:1:1202062) at preRun (command.js:1:6136) at run (command.js:1:1294094) at removeRunDependency (command.js:1:7046) ``` We need a few CORS headers to be set and in order hopefully make this easy for users a Python script is added to the examples directory. This should be able to server all the wasm examples provided they have been built. command.wasm's README.md is updated to reflect this change. * examples : remove unused functions This commit removed the unused functions convert_to_utf8 and convert_to_wstring from examples/common.cpp. * Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]" This reverts commit 8e3c47d96141c7675c985562ebdc705e839e338a. We should not make this change here and instead when the upstream PR is merged we can sync with it. Refs: https://github.com/ggerganov/whisper.cpp/issues/2784	2025-03-30 08:33:31 +03:00
Jay	a69f846351	cmake : fix ccache conflict (#12522 ) If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in cmake again will lead to conflict and compile fail. Signed-off-by: Jay <BusyJay@users.noreply.github.com>	2025-03-29 11:04:58 +01:00
hipudding	d07a0d7a79	CANN : remove clang-format in ggml-cann (#12607 )	2025-03-29 11:03:28 +01:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 20:21:59 +02:00
Icenowy Zheng	b86f600723	vulkan: fix coopmat shader generation when cross-compiling (#12272 ) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>	2025-03-28 14:51:06 -03:00
amritahs-ibm	13731766db	llamafile : ppc64le GEMV forwarding for FP32. (#12594 ) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-28 09:43:22 +02:00
Radoslav Gerganov	ab6ab8f809	rpc : send hash when tensor data is above some fixed threshold (#12496 ) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency	2025-03-28 08:18:04 +02:00
lhez	5dec47dcd4	opencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600 ) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope	2025-03-27 08:08:08 -07:00
Georgi Gerganov	771d84371c	scripts : update sync + fix cmake merge ggml-ci	2025-03-27 10:09:29 +02:00
Georgi Gerganov	0306aad1ca	cmake : sync/merge PowerPC build commands (#0 )	2025-03-27 09:04:38 +02:00
amritahs-ibm	c7b43ab608	llamafile : ppc64le MMA implementation for Q4_0. (#12489 ) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-27 08:51:47 +02:00
xctan	24feaec057	ggml : riscv: add 128-bit RVV support (#12530 ) * ggml : add 128-bit RVV support * ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl * remove trailing whitespaces * restructure vector length selection code	2025-03-27 08:38:34 +02:00
Akarshan Biswas	f17a3bb4e8	SYCL: implement memset ggml backend buffer interface (#12580 ) * SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation	2025-03-27 09:46:00 +08:00
Slobodan Josic	bd40678df7	HIP: Add support for RDNA4 targets (#12372 )	2025-03-26 23:46:30 +01:00
Georgi Gerganov	b3298fa47a	metal : refactor mat-vec code (#12569 ) * metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci	2025-03-26 21:38:38 +02:00
Georgi Gerganov	5ed38b6852	ggml : fix MUL_MAT_ID repack with Q8_K (#12544 ) * ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci	2025-03-26 13:02:00 +02:00
Dan Johansson	053b3f9aae	ggml-cpu : update KleidiAI to v1.5.0 (#12568 ) ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-03-25 13:10:18 +02:00
Akarshan Biswas	e2f560175a	SYCL: disable Q4_0 reorder optimization (#12560 ) ggml-ci	2025-03-25 18:40:18 +08:00
lhez	2b65ae3029	opencl: simplify kernel embedding logic in cmakefile (#12503 ) Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-03-24 09:20:47 -07:00
R0CKSTAR	7ea75035b6	CUDA: Fix clang warnings (#12540 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-24 11:28:34 +01:00
Jeff Bolz	9b169a4d4e	vulkan: fix mul_mat_vec failure in backend tests (#12529 ) The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.	2025-03-24 07:56:17 +01:00
Georgi Gerganov	ba932dfb50	ggml : fix quantized cpy op (#12310 ) * ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci	2025-03-22 16:23:26 +02:00
R0CKSTAR	fac63a3d78	musa: refine compute capability (#12493 ) * musa: refine compute capability Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-22 10:11:37 +01:00
Jeff Bolz	eddfb43850	vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505 ) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-22 09:40:11 +01:00
stduhpf	4375415b4a	Vulkan: RTE rounding for cpy to quant (#12480 ) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-21 20:34:50 +01:00
Eve	30c42ef5cb	vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472 )	2025-03-21 20:27:47 +01:00
蕭澧邦	1aa87ee53d	[SYCL] Fix build on Windows when ccache enabled (#9954 ) (#9976 ) * [SYCL] Fix build on Windows when ccache enabled (#9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-21 14:58:47 +08:00
Svetlozar Georgiev	9ffcc9e374	sycl: cleanup oneDNN related code (#12097 )	2025-03-21 10:15:56 +08:00
Srihari-mcw	3d82dbcbce	ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332 ) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments	2025-03-20 13:35:34 +02:00
Gaurav Garg	517b5ddbf0	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 ) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-19 20:52:06 +01:00
Jeff Bolz	a9b59288e2	vulkan: optimize iq1 coopmat2 dequant functions (#12427 )	2025-03-19 19:56:23 +01:00
Guus Waals	0fd8487b14	Fix visionOS build and add CI (#12415 ) * ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>	2025-03-19 11:15:23 +01:00
Jeff Bolz	c446b2edd2	vulkan: Submit once enough matmul work has been recorded (#12406 ) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-19 08:26:26 +01:00
lhez	d84635b1b0	opencl: improve profiling (#12442 ) * opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing	2025-03-18 12:54:55 -07:00
R0CKSTAR	bb115d2bf7	musa: override warp_size of musa device to 32 (#12445 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-18 19:28:26 +01:00
Łukasz Ślusarczyk	35cae5ba05	SYCL: using graphs is configurable by environment variable and compile option (#12371 ) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-18 11:16:31 +01:00
Prajwal B Mehendarkar	eba92d64c3	cmake : fix PowerPC build (#12241 ) Closes #12240	2025-03-18 11:37:33 +02:00
fj-y-saito	d9a14523bb	ggml : add SVE support for q6_K_q8_K (#12361 )	2025-03-18 10:14:39 +02:00
0cc4m	fd123cfead	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434 )	2025-03-18 07:21:40 +01:00
Łukasz Ślusarczyk	a53f7f7b88	fixed compilation warnings in ggml-sycl (#12424 )	2025-03-18 08:51:25 +08:00
Molly Sophia	7dfad387e3	llama: Add support for RWKV v7 architecture (#12412 ) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-18 07:27:50 +08:00
Gaurav Garg	b1b132efcb	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394 ) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA	2025-03-17 20:25:13 +02:00
Guus Waals	01e8f2138b	ggml-vulkan: remove unused find_program(glslc) (#12416 ) It's already found by FindVulkan.cmake in the parent CMakeLists	2025-03-17 13:35:43 -03:00
Jeff Bolz	484a8ab513	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312 )	2025-03-17 09:26:18 -05:00
Daniele	cf2270e4d3	vulkan: subgroup size tuning (#12087 ) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-17 12:42:33 +01:00
Jeff Bolz	f07690c930	vulkan: use fp32 in coopmat2 q4_k dequant function (#12309 )	2025-03-17 10:43:35 +01:00
Jeff Bolz	891c63956d	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273 ) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-17 10:41:59 +01:00
Jeff Bolz	2f21123c1d	vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258 )	2025-03-17 10:35:00 +01:00
Christian Kastner	374101fd74	cmake : enable building llama.cpp using system libggml (#12321 ) * cmake: Factor out compiler flag function from ggml llama.cpps's build requires it, too, and we may want to make use of it without add_subdirectory(ggml). * cmake: Enable building against system ggml This facilitates package maintenance for Linux distributions, where the libggml library most likely will be shipped as an individual package upon which a llama.cpp package depends.	2025-03-17 11:05:23 +02:00
Akarshan Biswas	b3c9a65673	SYCL: set extras only on GGML_TYPE_Q4_0 (#12366 ) * SYCL: set extras only on GGML_TYPE_Q4_0 * release tensor_extras in reset buffer interface	2025-03-17 09:45:12 +08:00
aubreyli	3d35d87b41	SYCL: Delete redundant plus sign and space (#12391 )	2025-03-15 15:49:03 +01:00
fairydreaming	b19bd064c0	SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399 ) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-15 22:19:30 +08:00
Chenguang Li	92a391327e	[CANN]MUL_MAT optimization (#12382 )	2025-03-15 09:31:08 +08:00
Alberto Cabrera Pérez	363f8c5d67	sycl : variable sg_size support for mmvq kernels (#12336 )	2025-03-12 09:57:32 +00:00
uvos	34c961b181	CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315 ) When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need to avoid launching them with parameters for warp64	2025-03-12 10:14:11 +01:00
Jeff Bolz	bf69cfe62f	vulkan: fix bug in coopmat1 mul_mat_id (#12316 ) * tests: run mul_mat_id with a larger N * vulkan: fix bug in coopmat1 mul_mat_id	2025-03-12 06:59:19 +01:00
uvos	10f2e81809	CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177 ) refactor mmqv to unify the calculation of nwarps and rows per block between host and device code. --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-11 20:16:03 +01:00
jklincn	ba7654380a	ggml-backend : fix backend search path (#12330 ) * Fix backend search path * replace .native() with '/' * reverted .native()	2025-03-11 14:25:17 +01:00
BB-fat	6ab2e4765a	metal : Cache the Metal library at the device context level (#12265 )	2025-03-11 13:45:02 +02:00
Eve	2c9f833d17	mat vec double buffer (#12188 )	2025-03-10 19:28:11 +00:00
R0CKSTAR	251364549f	musa: support new arch mp_31 and update doc (#12296 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-10 18:18:25 +01:00
Henry Linjamäki	8acdacb3ea	opencl: use OpenCL C standard supported by the device (#12221 ) This patch nudges the llama.cpp a bit to be supported on PoCL which doesn't support OpenCL C CL2.0. The issue is solved by querying the device for the supported OpenCL C versions and using the highest one available.	2025-03-10 09:57:00 -07:00
Jason C.H	6fefc05a7a	ggml-backend : make path_str compatible with C++20 (#12269 )	2025-03-08 17:02:39 +01:00
Daniel Bevenius	7c7f3b7f43	ggml : skip intermediate .air file when compiling .metallib (#12247 ) This commit updates the compilation of default.metallib to skip the intermediate .air (Apple Intermediate Representation) file. The motivation for this change is to simplify the custom command a little and avoid generating and then removing the .air file.	2025-03-07 14:15:27 +01:00
vmobilis	d6ae2fa061	ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118) * ggml_compute_forward_concat() for arbitrary tensor type * Check that tensors' type match * ggml-cpu.c: check type of source tensors * ggml-cpu.c: move tensor type check to ggml_compute_forward_concat() * ggml.c: check concatenated tensor type * Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c ..., as it was moved to ggml.c.	2025-03-07 14:49:44 +02:00
Rémy O	68d0027f3d	ggml-cpu: faster AVX2 variant for IQ1_M (#12216 )	2025-03-07 13:54:22 +02:00
BB-fat	5e2d57b2b2	metal : simplify kernel arguments using a struct (#3229 ) (#12194 ) * metal : refactor im2col parameters into a struct * metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets * metal : refactor sum_rows parameters into a struct * metal : refactor soft_max parameters into a struct * metal : refactor diag_mask_inf parameters into a struct * metal : refactor ssm_conv parameters into a struct * metal : refactor ssm_scan parameters into a struct * metal : refactor get_rows parameters into a struct * metal : refactor group_norm parameters into a struct * metal : refactor conv_transpose_1d parameters into a struct * metal : refactor upscale parameters into a struct * metal : refactor pad parameters into a struct * metal : refactor pad_reflect_1d parameters into a struct * metal : refactor arange parameters into a struct * metal : refactor timestep_embedding parameters into a struct * metal : refactor argsort parameters into a struct * metal : refactor leaky_relu parameters into a struct * metal : refactor pool_2d parameters into a struct * metal : fix trailing whitespace --------- Co-authored-by: alexju <alexju@tencent.com>	2025-03-07 08:35:57 +01:00
Daniel Bevenius	d6c95b0740	metal : fix default.metallib build (#12224 ) This commit updates the custom command to build the default.metallib file to use the correct path to ../ggml-common.h by using the variable METALLIB_COMMON. The motivation for this change is that currently when building and specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is generated: ```console [ 11%] Linking CXX shared library ../../bin/libggml.dylib [ 11%] Built target ggml make[2]: * No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'. Stop. make[1]: * [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2 ``` With the above change the build could progress but there was a follow on error about not being able to find the ggml-common.h file in ggml-metal.metal where is was included as a relative path: ```console [ 11%] Compiling Metal kernels /Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'? ^~~~~~~~~~~~~~~~~~ "ggml-common.h" 1 error generated. ``` Removing the relative path then allowed the build to complete successfully.	2025-03-07 06:23:16 +01:00
lhez	d76a86d967	opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (#12217 ) * opencl: support noncontiguous `norm` * opencl: support noncontiguous `rms_norm` * opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`	2025-03-07 00:20:35 +00:00
xiaofei	776f9e59cc	cmake : fix undefined reference errors for std::filesystem in ggml (#12092 ) (#12094 ) Signed-off-by: Ray Lee <hburaylee@gmail.com> Co-authored-by: Ray Lee <hburaylee@gmail.com>	2025-03-06 22:58:25 +00:00
Johannes Gäßler	5220a16d18	CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222 )	2025-03-06 18:45:09 +01:00
uvos	e721c05c93	HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209 ) This avoids conflict with internal cuda/hip runtimes memory managment behavior.	2025-03-06 08:20:52 +01:00
Henry Linjamäki	94bb63e4f0	opencl : fix buffer alignment (#12197 ) Fix the following error: ``` ggml-alloc.c:99: not enough space in the buffer ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024) ``` which occurs when `ggml_backend_opencl_context::alignment` is larger than `cl_ptr_base` (hard-coded to `0x1000`). Also, fix `ggml_backend_opencl_context::alignment` was set to `CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the value is reported in bits.	2025-03-06 02:33:40 +01:00
Henry Linjamäki	f79243992c	opencl : fix `ulong` kernel args were set from `int` variables (#12174 ) ... which left garbage bits in the upper half of the kernel args. This caused segmentation faults when running PoCL.	2025-03-06 02:31:14 +01:00
simon886212	ed4ce0dda2	opencl : fix profile-related errors (#12095 ) Co-authored-by: ubuntu <ubuntu@localhost.localdomain>	2025-03-06 02:30:05 +01:00
Rémy O	07d1572347	ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154 ) * ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions * cmake: Add GGML_BMI2 build option * ggml: enable BMI2 on relevant CPU variants * ggml-cpu: include BMI2 in backend score * ggml-cpu: register BMI2 in ggml_backend_cpu_get_features * ggml-cpu: add __BMI2__ define when using MSVC	2025-03-06 02:26:10 +01:00
Akarshan Biswas	5e43f104cc	SYCL: Disable f16 Unary OPs as not supported by the kernels (#12201 )	2025-03-05 16:58:23 +01:00
Plamen Minev	16e4b22c5e	ggml : fix GGMLMetalClass ODR (#12200 ) -- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.	2025-03-05 17:16:01 +02:00
mgroeber9110	5bbe6a9fe9	ggml : portability fixes for VS 2017 (#12150 ) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-04 18:53:26 +02:00
David Huang	becade5de7	HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032 ) Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check Adds rocWMMA support to fattn-wmma-f16 --- Signed-off-by: Carl Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Ben Jackson <ben@ben.com>	2025-03-03 22:10:54 +01:00
cmdr2	b64d7cc272	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
cmdr2	0cbee131ad	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) ggml-ci	2025-03-03 18:18:11 +02:00
cmdr2	87abb7e903	cuda/cpu: Increase support for fp16 unary operations (ggml/1125) * Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests	2025-03-03 18:18:11 +02:00
Diego Devesa	6d4c23b81b	whisper : support GGML_BACKEND_DL (whisper/2843) * whisper : support GGML_BACKEND_DL * fix DTW crash * whisper.objc : fix build - add ggml-cpp.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-03 18:18:11 +02:00
midnight	6512a90037	cmake : fix compile assumptions for power9/etc (whisper/2777) * Add small comment re: VSX to readme Co-authored-by: midnight <midnight@example.com>	2025-03-03 18:18:11 +02:00
petterreinholdtsen	4512055792	Told cmake to install ggml-cpp.h as a public header file. (ggml/1126) It is used by Whisper talk-llama example. Co-authored-by: Petter Reinholdtsen <pere@debian.org>	2025-03-03 18:18:11 +02:00
cmdr2	f54a4ba11e	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div	2025-03-03 18:18:11 +02:00
ag2s20150909	9660ffef58	ggml : fix kleidiai build (#12159 ) The libggml API has changed, but this has not been updated.	2025-03-03 13:54:08 +01:00
Akarshan Biswas	ece9745bb8	SYCL: Move CPY kernels to a separate file and add few missing kernels (#12133 ) * SYCL: refactor and move cpy kernels to a separate file * Add few missing cpy kernels * refactor and add debug logs	2025-03-03 11:07:22 +01:00
Diego Devesa	cc473cac7c	ggml-backend : keep paths in native string type when possible (#12144 )	2025-03-02 22:11:00 +01:00
Erik Scholz	80c41ddd8f	CUDA: compress mode option and default to size (#12029 ) cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".	2025-03-01 12:57:22 +01:00
William Tambellini	70680c48e5	ggml : upgrade init_tensor API to return a ggml_status (#11854 ) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-02-28 14:41:47 +01:00
Rémy O	438a83926a	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595 ) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-02-28 09:42:52 +01:00
Johannes Gäßler	9c42b1718c	CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098 )	2025-02-28 09:26:43 +01:00
Prashant Vithule	05e6f5aad0	ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064 ) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>	2025-02-28 09:36:12 +02:00
hipudding	673cfef9aa	CANN: Fix build error with GCC 13 (#11990 ) Remove unused header file that causes compilation failure on ARM platform with GCC 13.	2025-02-28 15:23:47 +08:00
Eve	fbeda9002d	vulkan: matmul dequantization improvements (#12015 ) * faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8	2025-02-28 08:20:08 +01:00
Daniele	581650b7ca	vulkan: improve im2col (#11826 ) * vulkan: improve im2col performance	2025-02-28 07:52:51 +01:00
Vladimir Vuksanovic	b95c8af37c	cmake: Fix ggml backend dependencies and installation (#11818 ) * Fix dependencies between ggml and backends ggml backends link only to ggml-base and ggml links to all backends. * Fix installation of ggml backends Set up GNUInstallDirs before setting the installation directory of ggml backends	2025-02-27 09:42:48 +02:00
Jeff Bolz	a82c9e7c23	vulkan: fix assertion when qy_needs_dequant (#12068 ) Looks like a copy/paste bug from qx_needs_dequant.	2025-02-25 16:30:21 +01:00
Judd	c132239bfb	add OP sigmoid (#12056 ) Co-authored-by: Judd <foldl@boxvest.com>	2025-02-25 12:32:20 +01:00
Molly Sophia	393fca629e	ggml-cpu: Fix build with sve (#12059 ) * ggml-cpu: Fix build with sve Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml-cpu: Remove unused variable in sve q3_k vec dot Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-02-25 19:28:22 +08:00
Rémy O	61d4f39dfe	vulkan: implement more backpropagation operators (#11914 ) * vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK	2025-02-25 12:04:45 +01:00
Gian-Carlo Pascutto	58d07a8043	metal : copy kernels for quant to F32/F16 conversions (#12017 ) metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-25 11:27:58 +02:00
lhez	34a846b584	opencl: fix for small models (#11950 ) * opencl: fix small shape gemv, remove unused extensions * opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size * opencl: fix for token length < 4 * opencl: use wave size of 64 for all Adreno GPUs --------- Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com> Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>	2025-02-24 14:47:07 -07:00
Neo Zhang Jianyu	08d5986290	[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035 ) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-24 22:33:23 +08:00
Akarshan Biswas	8303e8b0fb	SYCL: Fix GGML_SYCL_DEBUG macro (#11995 )	2025-02-24 10:18:25 +00:00
Aaron Teo	af7747c95a	ggml-cpu: Support s390x SIMD Instruction Set (#12019 ) * ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong charx16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick `225bbbf` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>	2025-02-22 21:39:24 +00:00
Johannes Gäßler	a28e0d5eb1	CUDA: app option to compile without FlashAttention (#12025 )	2025-02-22 20:44:34 +01:00
Johannes Gäßler	5fa07c2f93	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto	d70908421f	cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000 )	2025-02-22 09:43:24 +01:00
PureJourney	ecc8e3aeff	CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984 ) * CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-21 12:21:05 +01:00
Bodhi	0b3863ff95	MUSA: support ARM64 and enable dp4a .etc (#11843 ) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-21 09:46:23 +02:00
Charles Xu	c5d91a7400	ggml-cpu: Add CPU backend support for KleidiAI library (#11390 ) * ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list	2025-02-20 15:06:51 +02:00
Prashant Vithule	4806498bf1	ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917 ) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-20 12:08:32 +02:00
Johannes Gäßler	73e2ed3ce3	CUDA: use async data loading for FlashAttention (#11894 ) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-17 14:03:24 +01:00
Rémy O	2eea03d86a	vulkan: implement several ops relevant for ggml_opt (#11769 ) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-17 07:55:57 +01:00
Jeff Bolz	bf42a23d0a	vulkan: support multi/vision rope, and noncontiguous rope (#11902 )	2025-02-16 08:52:23 +01:00
Hale Chan	c2ea16f260	metal : fix the crash caused by the lack of residency set support on Intel Macs. (#11904 )	2025-02-16 08:50:26 +02:00
Adrian Kretz	22885105a6	metal : optimize dequant q6_K kernel (#11892 )	2025-02-15 20:39:20 +02:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
Rémy O	fc1b0d0936	vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528 ) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-15 09:01:40 +01:00
lhez	300907b211	opencl: Fix rope and softmax (#11833 ) * opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`	2025-02-14 12:12:23 -07:00
Diego Devesa	94b87f87b5	cuda : add ampere to the list of default architectures (#11870 )	2025-02-14 15:33:52 +01:00
Jinyang He	38e32eb6a0	ggml: optimize some vec dot functions for LoongArch ASX (#11842 ) * Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX * Optimize mul_sum_i8_pairs_float for LoongArch ASX * Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX	2025-02-14 10:54:27 +02:00
Eve	a4f011e8d0	vulkan: linux builds + small subgroup size fixes (#11767 ) * mm subgroup size * upload vulkan x86 builds	2025-02-14 02:59:40 +00:00
Jeffrey Morgan	8a8c4ceb60	llamafile: use member variable instead of constant for iq4nlt (#11780 )	2025-02-13 18:05:04 +01:00
R0CKSTAR	bd6e55bfd3	musa: bump MUSA SDK version to rc3.1.1 (#11822 ) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-02-13 13:28:18 +01:00
Diego Devesa	a394039db0	ggml-cpu : add chunking support to mul_mat_id (#11666 ) * ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes	2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen	be3bbd6215	ggml : x2 speed for WASM by optimizing SIMD (#11453 ) * ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <camel-cdr@protonmail.com> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <camel-cdr@protonmail.com> --------- Co-authored-by: camel-cdr <camel-cdr@protonmail.com>	2025-02-13 00:33:45 +01:00
uvos	5c4284d57b	HIP: Remove GCN from list of devices that avoid MMQ (#11831 )	2025-02-12 22:25:28 +01:00
uvos	e598697d63	HIP: Switch to std::vector in rocblas version check (#11820 )	2025-02-12 17:25:03 +01:00

1 2 3 4 5 ...

791 Commits