Go to file
Amber 0b9a5ef90c Merge branch 'dev' into testing 2025-07-20 15:35:09 +08:00
.vscode
container
dev
doc update: add release tags (#72) 2025-07-20 15:34:23 +08:00
ek-base Merge branch 'feat/cpu-binding' into canary 2025-07-18 13:46:16 +08:00
ek-benchmark feat:gpu support in backend (#46) 2025-07-17 16:33:42 +08:00
ek-cli Merge branch 'feat/cpu-binding' into canary 2025-07-18 13:46:16 +08:00
ek-computation Merge branch 'feat/cpu-binding' into canary 2025-07-18 13:46:16 +08:00
ek-db feat: use log create kv feature to print log (#57) 2025-05-29 19:45:26 +08:00
ek-integration feat: choose default backend according to the environment (#65) 2025-06-11 12:17:26 +08:00
ek-proto
ek-solution/ansible
.dockerignore
.gitattributes
.gitignore
.lfsconfig
.python-version
Cargo.lock feat:gpu support in backend (#46) 2025-07-17 16:33:42 +08:00
Cargo.toml fix: tokio runtime default init with no affinity binding 2025-06-04 09:28:58 +08:00
LICENSE
README.md Update video for online scaling 2025-06-20 11:16:44 +08:00
buf.yaml
pyproject.toml
ruff.toml
rust-toolchain.toml
uv.lock

README.md

Expert Kit: A Distributed, Expert-Centric Framework for MoE LLM Inference

[!CAUTION] Early Work-in-Progress. This project is currently a proof-of-concept demo and is under active development. It is not intended for production use and may contain significant bugs, security vulnerabilities, and unexpected behavior. We appreciate community feedback and contributions as we continue to build and refine this project.

GitHub project chat

About

Expert Kit (EK) is a high-performance framework for scalable MoE (Mixture of Experts) LLM inference. The vision of EK is to provide an efficient foundation of Expert Parallelism (EP) on heterogeneous hardware (e.g., CPU and GPU) over commodity networks (e.g. PCIe, TCP, RDMA), thereby enabling easy deployment and fine-grained expert-level scaling.

EK features Expert-Attention (E/A) separation architecture, enabling MoE LLMs to be deployed efficiently in a distributed environment composed of x CPUs and y GPUs. The motivation behind the E/A separation lies in our observation that, in modern MoE LLMs, expert parameters account for the vast majority of the model size (e.g., over 90% in DeepSeek-V3). By decoupling expert modules and deploying them across distributed GPUs and CPUs, EK leverages the high bandwidth and large capacity of distributed memory and storage systems.

arch-illustration-light

https://github.com/user-attachments/assets/9f1f5b23-28fe-44cf-b592-2f6ad0ad4dad

Quick Start

Here are some tutorials to help you quickly start with Expert Kit.

  1. DeepSeek-tiny: A tailored MoE model with DeepSeek-V3 architecture and small parameter count, designed for quick evaluation and testing of the Expert Kit framework.
  2. DeepSeek-V3: A demo for running the DeepSeek-V3 model with Expert Kit, showcasing the framework's capabilities in handling large-scale MoE models.
  3. Qwen3-30B-A3B: A demo for running the Qwen3-30B-A3B model with Expert Kit, showcasing the framework's capabilities in handling real-world MoE models.

Key Features

  • Low-Cost Deployment: supports distributed and mixed GPUs and CPUs.
  • Fine-Grained Expert-Level Scalability: provides independent scaling of attention and experts, with dynamic scaling of hot experts on demand

Performance

Model Throughput (tokens/s) Environment
DeepSeek-V3 671B W8A16 14.26 1xNvidia 4090(24G) + 5xAMD EPYC 7302
Qwen3-MoE-30B FP16 36.38 1xNvidia A10(24G) + 1xAMD EPYC 7302 + 1xKunpeng 920

Repository Map

  • ek-computation: performs schedule(frontend) and computation(backend) task.
  • ek-db: supports registering and loading experts' weight in fine-grained granularity.
  • ek-benchmark: contains several micro-benchmarks help you know the performance.
  • ek-solution: contains several recipes to quickly setup a running cluster.

Roadmap

Core Features

  • Frontend for request schedule
    • Simple Executor
    • Extensible Executor
    • Schedule Interface
  • Backend compute engine for expert computation
    • pytorch
    • onnxruntime
    • candle
  • Integration with existing framework for attention computation
    • pytorch
    • vLLM
  • Transport channel between frontend and backend
    • gRPC
    • RDMA
    • DSM

Contact Us

If you have any questions, please join our discussion at https://expert-kit.zulipchat.com/ or post new issues.

License Agreement

  • Primary License: This project as a whole is licensed under the GNU GPL 3.0.

  • Third-Party Components:

    • Licenses and copyright notices for third-party components are located alongside the component code directory.
    • The following components are included:
      • DeepSeek-V3 (Code/Complementary Material): Located in ek-integration/expertkit-torch/expertkit-torch/models/deepseek_v3/. This code is licensed under the DeepSeek License Agreement v1.0 and the MIT License. Please be aware that use of the associated DeepSeek Model is subject to the use restrictions detailed in Attachment A of the DeepSeek License Agreement v1.0.
      • Qwen3-MoE: Located in ek-integration/expertkit-torch/expertkit-torch/models/. This code is licensed under Apache License Version 2.0.
  • Compliance: All third-party components are used in compliance with their original license terms.