llama.cpp

Commit Graph

Author	SHA1	Message	Date
compilade	90083283ec	imatrix : use GGUF to store importance matrices (#9400 ) * imatrix : allow processing multiple chunks per batch * perplexity : simplify filling the batch * imatrix : fix segfault when using a single chunk per batch * imatrix : use GGUF to store imatrix data * imatrix : fix conversion problems * imatrix : use FMA and sort tensor names * py : add requirements for legacy imatrix convert script * perplexity : revert changes * py : include imatrix converter requirements in toplevel requirements * imatrix : avoid using designated initializers in C++ * imatrix : remove unused n_entries * imatrix : allow loading mis-ordered tensors Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * quantize : use unused imatrix chunk_size with LLAMA_TRACE * common : use GGUF for imatrix output by default * imatrix : two-way conversion between old format and GGUF * convert : remove imatrix to gguf python script * imatrix : use the function name in more error messages * imatrix : don't use FMA explicitly This should make comparisons between the formats easier because this matches the behavior of the previous version. * imatrix : avoid returning from void function save_imatrix * imatrix : support 3d tensors with MUL_MAT * quantize : fix dataset name loading from gguf imatrix * common : move string_remove_suffix from quantize and imatrix Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * imatrix : add warning when legacy format is written * imatrix : warn when writing partial data, to help guess dataset coverage Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used. * imatrix : avoid loading model to convert or combine imatrix * imatrix : avoid using imatrix.dat in README --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-19 12:51:22 -04:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-16 16:35:42 +03:00
Georgi Gerganov	6ffd4e9c44	server : pre-calculate EOG logit biases (#14721 ) ggml-ci	2025-07-16 14:04:12 +03:00
Georgi Gerganov	538cc77f7f	server : fix handling of the ignore_eos flag (#14710 ) ggml-ci	2025-07-16 12:13:57 +03:00
Johannes Gäßler	5cae766541	scripts: synthetic prompt mode for server-bench.py (#14695 )	2025-07-16 09:33:28 +02:00
Johannes Gäßler	494c5899cb	scripts: benchmark for HTTP server throughput (#14668 ) * scripts: benchmark for HTTP server throughput * fix server connection reset	2025-07-14 13:14:30 +02:00
Douglas Hanley	0c1df14b5f	server : fix pooled embedding output (#14645 )	2025-07-12 13:21:02 +03:00
Eric Zhang	a457551332	cmake : do not search for curl libraries by ourselves (#14613 ) * cmake : do not search for curl libraries by ourselves * run : do not search for curl libraries by ourselves	2025-07-10 15:29:05 +03:00
Alawode Oluwandabira	17a1f0d2d4	server: Add ability to mount server at prefix (#14544 ) * Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix	2025-07-08 11:47:33 +03:00
Sigbjørn Skjæret	ddef99522d	server : fix assistant prefilling when content is an array (#14360 )	2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret	28657a8229	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445 )	2025-07-03 23:07:22 +02:00
Vedran Miletić	e9b6350e61	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
matteo	caf5681fcb	server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196 ) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>	2025-06-29 20:02:53 +02:00
Renat	83790b0e7e	server : fix appearance of the chats list context menu for Safari (#14322 )	2025-06-29 19:29:57 +02:00
Nigel Bosch	1b809cee22	server : move no API key doc to /health (#14352 )	2025-06-24 10:59:11 +02:00
Sigbjørn Skjæret	abf241045d	main : honor --verbose-prompt on interactive prompts (#14350 )	2025-06-24 09:31:00 +02:00
Molly Sophia	72c6bc3f3d	llama : better rwkv chat template and add missing `inputs.use_jinja` setting (#14336 ) * llama-cli : add missing `inputs.use_jinja` setting Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama : better legacy chat template for rwkv Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-06-23 19:56:19 +08:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments	2025-06-23 12:27:35 +03:00
Ed Addario	fa4a9f2a1c	quantize : handle user-defined pruning of whole layers (blocks) (#13037 )	2025-06-22 23:16:26 +02:00
Ruikai Peng	66aba7aca9	run : avoid double tokenization (#14327 ) * run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag	2025-06-23 01:28:06 +08:00
Georgi Gerganov	f1f5e82df6	examples : fix is_first logic for tokenization (#14329 ) ggml-ci	2025-06-22 20:10:07 +03:00
yuiseki	5d5c066de8	mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326 ) Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size. This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.	2025-06-22 14:44:57 +02:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci	2025-06-20 10:14:14 +03:00
aa956	d67341dc18	server : add server parameters for draft model cache type (#13782 ) Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>	2025-06-19 16:01:03 +03:00
bashayer hijji	fffcce535e	llama-bench : add --no-warmup flag (#14224 ) (#14270 ) Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking. - Add no_warmup boolean field to cmd_params struct - Add --no-warmup command-line argument parsing - Add help text documentation for the new flag - Wrap existing warmup logic in conditional check - Maintain full backward compatibility (warmup enabled by default) Addresses #14224	2025-06-19 12:24:12 +02:00
Xuan-Son Nguyen	413977de32	mtmd : refactor llava-uhd preprocessing logic (#14247 ) * mtmd : refactor llava-uhd preprocessing logic * fix editorconfig	2025-06-18 10:43:57 +02:00
Georgi Gerganov	89fea80d29	server : fix incorrect usage of llama_get_embeddings() (#14225 ) * server : fix incorrect usage of llama_get_embeddings() ggml-ci * cont : fix the fix ggml-ci	2025-06-16 22:33:27 +03:00
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Eric Curtin	cd355eda7d	server : When listening on a unix domain socket don't print http:// and port (#14180 ) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-06-15 23:36:22 +02:00
Georgi Gerganov	ffad043973	server : fix SWA condition for full context reprocess (#14163 ) ggml-ci	2025-06-13 11:18:25 +03:00
Georgi Gerganov	7d516443dd	server : re-enable SWA speculative decoding (#14131 ) ggml-ci	2025-06-12 11:51:38 +03:00
Aman	7781e5fe99	webui: Wrap long numbers instead of infinite horizontal scroll (#14062 ) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz	2025-06-11 16:42:25 +02:00
Taylor	2baf07727f	server : pass default --keep argument (#14120 )	2025-06-11 13:43:43 +03:00
Juk Armstrong	3a12db23b6	Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104 )	2025-06-10 16:48:07 +01:00
R0CKSTAR	dc0623fddb	webui: fix sidebar being covered by main content (#14082 ) * webui: fix sidebar being covered by main content Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * webui: update index.html.gz Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-06-09 12:01:17 +02:00
Georgi Gerganov	87d34b381d	server : fix LRU check (#14079 ) ggml-ci	2025-06-09 12:57:58 +03:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Georgi Gerganov	3637576288	server : disable speculative decoding for SWA models (#13970 ) * server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models	2025-06-02 21:34:40 +03:00
Olivier Chafik	c9bbc77931	`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 ) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts	2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen	bfd322796c	mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961 ) * mtmd : fix memory in mtmd_helper_eval_chunk_single * mtmd-cli : fix mem leak * Update tools/mtmd/mtmd-cli.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-02 16:29:28 +02:00
Max Krasnyansky	053b1539c0	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995 ) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-31 15:39:19 -07:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts	2025-05-31 15:57:44 +03:00
igardev	c7e0a2054b	webui : Replace alert and confirm with custom modals. (#13711 ) * Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons. * use Modal Provider to simplify the use of confirm and alert modals. * Increase the z index of the modal dialogs. * Update index.html.gz * also add showPrompt * rebuild --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-31 11:56:08 +02:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen	51fa76f172	mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917 ) * mtmd : fix missing public header * no object * apply suggestion from Georgi * rm mtmd-helper, merge it to mtmd * missing vendor include dir	2025-05-31 10:14:29 +02:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci	2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen	10961339b2	mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866 ) * mtmd : move helpers to dedicated library * fix server build * rm leftover cmakelist code	2025-05-28 22:35:22 +02:00
Đinh Trọng Huy	e0e3aa231d	llama : add support for BertForSequenceClassification reranker (#13858 ) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-28 19:01:58 +02:00

1 2 3

127 Commits