Go to file

Amber 0b9a5ef90c Merge branch 'dev' into testing		2025-07-20 15:35:09 +08:00
.vscode	…
container	…
dev	…
doc	update: add release tags (#72 )	2025-07-20 15:34:23 +08:00
ek-base	Merge branch 'feat/cpu-binding' into canary	2025-07-18 13:46:16 +08:00
ek-benchmark	feat:gpu support in backend (#46 )	2025-07-17 16:33:42 +08:00
ek-cli	Merge branch 'feat/cpu-binding' into canary	2025-07-18 13:46:16 +08:00
ek-computation	Merge branch 'feat/cpu-binding' into canary	2025-07-18 13:46:16 +08:00
ek-db	feat: use log create kv feature to print log (#57 )	2025-05-29 19:45:26 +08:00
ek-integration	feat: choose default backend according to the environment (#65 )	2025-06-11 12:17:26 +08:00
ek-proto	…
ek-solution/ansible	…
.dockerignore	…
.gitattributes	…
.gitignore	…
.lfsconfig	…
.python-version	…
Cargo.lock	feat:gpu support in backend (#46 )	2025-07-17 16:33:42 +08:00
Cargo.toml	fix: tokio runtime default init with no affinity binding	2025-06-04 09:28:58 +08:00
LICENSE	…
README.md	Update video for online scaling	2025-06-20 11:16:44 +08:00
buf.yaml	…
pyproject.toml	…
ruff.toml	…
rust-toolchain.toml	…
uv.lock	…

README.md

Expert Kit: A Distributed, Expert-Centric Framework for MoE LLM Inference

[!CAUTION] Early Work-in-Progress. This project is currently a proof-of-concept demo and is under active development. It is not intended for production use and may contain significant bugs, security vulnerabilities, and unexpected behavior. We appreciate community feedback and contributions as we continue to build and refine this project.

About

Expert Kit (EK) is a high-performance framework for scalable MoE (Mixture of Experts) LLM inference. The vision of EK is to provide an efficient foundation of Expert Parallelism (EP) on heterogeneous hardware (e.g., CPU and GPU) over commodity networks (e.g. PCIe, TCP, RDMA), thereby enabling easy deployment and fine-grained expert-level scaling.

EK features Expert-Attention (E/A) separation architecture, enabling MoE LLMs to be deployed efficiently in a distributed environment composed of x CPUs and y GPUs. The motivation behind the E/A separation lies in our observation that, in modern MoE LLMs, expert parameters account for the vast majority of the model size (e.g., over 90% in DeepSeek-V3). By decoupling expert modules and deploying them across distributed GPUs and CPUs, EK leverages the high bandwidth and large capacity of distributed memory and storage systems.

https://github.com/user-attachments/assets/9f1f5b23-28fe-44cf-b592-2f6ad0ad4dad

Quick Start

Here are some tutorials to help you quickly start with Expert Kit.

DeepSeek-tiny: A tailored MoE model with DeepSeek-V3 architecture and small parameter count, designed for quick evaluation and testing of the Expert Kit framework.
DeepSeek-V3: A demo for running the DeepSeek-V3 model with Expert Kit, showcasing the framework's capabilities in handling large-scale MoE models.
Qwen3-30B-A3B: A demo for running the Qwen3-30B-A3B model with Expert Kit, showcasing the framework's capabilities in handling real-world MoE models.

Key Features

Low-Cost Deployment: supports distributed and mixed GPUs and CPUs.
Fine-Grained Expert-Level Scalability: provides independent scaling of attention and experts, with dynamic scaling of hot experts on demand

Performance

Model	Throughput (tokens/s)	Environment
DeepSeek-V3 671B W8A16	14.26	1xNvidia 4090(24G) + 5xAMD EPYC 7302
Qwen3-MoE-30B FP16	36.38	1xNvidia A10(24G) + 1xAMD EPYC 7302 + 1xKunpeng 920

Repository Map

ek-computation: performs schedule(frontend) and computation(backend) task.
ek-db: supports registering and loading experts' weight in fine-grained granularity.
ek-benchmark: contains several micro-benchmarks help you know the performance.
ek-solution: contains several recipes to quickly setup a running cluster.

Roadmap

Core Features

Frontend for request schedule
- Simple Executor
- Extensible Executor
- Schedule Interface
Backend compute engine for expert computation
- pytorch
- onnxruntime
- candle
Integration with existing framework for attention computation
- pytorch
- vLLM
Transport channel between frontend and backend
- gRPC
- RDMA
- DSM

Contact Us

If you have any questions, please join our discussion at https://expert-kit.zulipchat.com/ or post new issues.

License Agreement

Primary License: This project as a whole is licensed under the GNU GPL 3.0.
Third-Party Components:
- Licenses and copyright notices for third-party components are located alongside the component code directory.
- The following components are included:
  - DeepSeek-V3 (Code/Complementary Material): Located in ek-integration/expertkit-torch/expertkit-torch/models/deepseek_v3/. This code is licensed under the DeepSeek License Agreement v1.0 and the MIT License. Please be aware that use of the associated DeepSeek Model is subject to the use restrictions detailed in Attachment A of the DeepSeek License Agreement v1.0.
  - Qwen3-MoE: Located in ek-integration/expertkit-torch/expertkit-torch/models/. This code is licensed under Apache License Version 2.0.
Compliance: All third-party components are used in compliance with their original license terms.