Go to file
Jayon02 ef4215d6f1
[Feat][Refactor]Support DeepSpeed AutoTP; Refactor hf_engine.py and unit test. (#161)
* refactor hf engine

* format file

* revert file format

* Squashed commit of the following:

commit 8d4b8dc90f
Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
Date:   Thu Jul 10 13:14:10 2025 +0800

    [Doc] Add an instruction about how to run the SFT example. (#164)

commit 3bf9c85e40
Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
Date:   Thu Jul 10 12:56:24 2025 +0800

    [Fix] Merge previous contributions from fw/refactor to lite (#163)

    * initial proposal

    * add arealite

    * .

    * change api

    * .

    * remove LOG_ROOT

    * remove MODEL_SAVE_PATH

    * remove PARAM_REALLOC_PATH, DATASET_CACHE

    * prepare for testing

    * prepare for testing

    * ready for run

    * local run

    * tests mainly pass

    * format

    * .

    * amend cluster.py

    * .

    * .

    * client test pass

    * pass rollout test

    * remove unused imports

    * add arealite readme

    * change api

    * .

    * .

    * .

    * .

    * .

    * .

    * .

    * .

    * format

    * .

    * implement iteraptable generation (#112)

    Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>

    * .

    * fix

    * .

    * .

    * .

    * pass controller generate batch test

    * .

    * refactor rollout controller into worker and controller

    * .

    * .

    * .

    * change to async rollout

    * pass rollout controller test

    * pass test

    * .

    * update readme

    * .

    * sft debug

    * .

    * add lisence

    * remove unused files

    * remove unsed args in ppo

    * add hf engine wrapper  (#116)

    * add hf engine

    * fix issues

    * fix ppo bugs and add test

    * add hf client interface and modify cli args

    * fix bugs

    * fix issues

    * Merge fw/refactor

    * Finish hf wrapper test

    * add test

    ---------

    Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com>

    * format

    * format

    * .

    * refine hf engine

    * .

    * fix

    * add fsdp engine and sft tests

    * .

    * .

    * .

    * pass ppo unittest

    * pass ppo and rollout controller tests

    * clear unused imports

    * rename ppo to grpo

    * change reward function organization

    * reorganize code

    * add dataset api

    * .

    * .

    * .

    * format

    * chmod fix

    * .

    * rename workflow to collector

    * refactor llm_client location

    * .

    * .

    * fix llm server api

    * refactor config structure

    * .

    * fix tests

    * .

    * .

    * .

    * Fix unresolved issue in SFTTrainer PR (#139)

    * .

    * .

    * efficient loading

    * format

    * .

    * .

    * .

    * .

    * .

    * .

    * Add CI for testing AReaLite (#150)

    * ci: add test-arealite

    * ci: add checkout before running test-arealite

    * ci: add USERNAME

    * ci: add test script

    * ci: add GitHub mirror

    * ci: fix typo

    * ci: clone one commit

    * ci: fix condition

    * ci: set command timeout to 60m

    * ci: enable pip cache

    * ci: optimize container lifecycle

    * ci: split into many stages

    * ci(test-arealite): fix typo

    * ci: fix wrong env

    * ci: fix pytest

    * ci: uninstall transformer-engine

    * ci: uninstall transformer-engine

    * ci: fix model paths

    * ci: show stdout/stderr

    * ci: fix not clean up

    * ci: backup sglang

    * ci: remove tmp repo dir when run

    * ci: fix docker run exit 1 condition

    * ci(test-arealite): limit the concurrency and extend command timeout

    * .

    * merge fw/refactor

    * revert some changes

    * fix

    ---------

    Co-authored-by: meizhiyu.mzy <meizhiyu.mzy@antgroup.com>
    Co-authored-by: Chayenne <zhaochen20@outlook.com>
    Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
    Co-authored-by: Jayon02 <qiujiangc@outlook.com>
    Co-authored-by: root <meizhiyu.mzy>
    Co-authored-by: Zijian Zhang <futrime@outlook.com>

commit d48bf007cf
Merge: 42c717b b9dbd4a
Author: 博惟 <bowei.fw@antgroup.com>
Date:   Thu Jul 10 12:53:30 2025 +0800

    Merge branch 'main' of https://github.com/inclusionAI/AReaL into lite

commit 42c717b6e4
Merge: c38cffc a203c7c
Author: 博惟 <bowei.fw@antgroup.com>
Date:   Thu Jul 10 11:15:01 2025 +0800

    Merge branch 'lite' of https://github.com/inclusionAI/AReaL into lite

commit c38cffc023
Author: 博惟 <bowei.fw@antgroup.com>
Date:   Thu Jul 10 11:10:10 2025 +0800

    PullRequest: 340 [lite] Refactor trainer API into utilities and remove mb_spec in engine methods

    Merge branch fw/lite-dev of git@code.alipay.com:inclusionAI/AReaL.git into lite
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/340

    Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>

    * support fsdp engine and sglang remote engine
    * minor fix
    * .
    * refactor trainer
    * add close
    * rm mb_spec
    * fix

commit b9dbd4a2c1
Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
Date:   Wed Jul 9 10:50:19 2025 +0800

    Update to persistent wechat QR code. (#159)

commit 17ea7fe94d
Author: xssstory <33601810+xssstory@users.noreply.github.com>
Date:   Mon Jul 7 15:49:13 2025 +0800

    fix math reward verifier (#156)

    * PullRequest: 293 fix get_param_realloc_path

    Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/293

    Reviewed-by: 博惟 <bowei.fw@antgroup.com>

    * fix get_param_realloc_path

    * PullRequest: 297 bugfix: reward is always -5

    Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/297

    Reviewed-by: 博惟 <bowei.fw@antgroup.com>

    * bugfix: reward is always -5

    * PullRequest: 321 fix checkpoint save dir

    Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/321

    Reviewed-by: 博惟 <bowei.fw@antgroup.com>

    * fix checkpoint save dir

    * PullRequest: 328 [Doc] update installation

    Merge branch sxj/doc of git@code.alipay.com:inclusionAI/AReaL.git into gh
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/328

    Reviewed-by: 博惟 <bowei.fw@antgroup.com>

    * [Doc] update installation

    * PullRequest: 329 bugfix: math verifier blocks the async training

    Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh
    https://code.alipay.com/inclusionAI/AReaL/pull_requests/329

    Reviewed-by: 博惟 <bowei.fw@antgroup.com>

    * bugfix: math verifier block the async training

    * format

    ---------

    Co-authored-by: 冰临 <shenxujie.sxj@antgroup.com>
    Co-authored-by: garrett4wade <fuwth17@gmail.com>

* add autotp for hf

* refactor test

* fix bugs

* fix issues

* format files

* Squashed commit of the following:

commit 9ed043f6ab
Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
Date:   Tue Jul 15 10:24:48 2025 +0800

    format (#174)

commit 8cc9b1feb5
Author: Night <32424487+PrinsYin@users.noreply.github.com>
Date:   Mon Jul 14 19:22:00 2025 -0700

    added LocalSGlangEngine and test (#170)

    * added LocalSGLangEngine

    * upload test file

    * add build args

    * fix sgl_local generate

    * improved sgl local robustness

    * test

    * test updated

    * added fallback when sgl engine isn't initialized

    * finish test local engine

    * added LocalSGlangEngine and test

    * format and fix

    format and fix, raise when generate missing field

    format

    * change cli_args.py

    * add comment header

    format

    ---------

    Co-authored-by: ChangyiYang <changyiyang2023@gmail.com>

---------

Co-authored-by: Jayon02 <12012211@mail..sustech.edu.cn>
Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
2025-07-16 12:44:10 +08:00
.github [Fix] Fix CI running condition for lite. (#172) 2025-07-12 14:47:56 +08:00
arealite [Feat][Refactor]Support DeepSpeed AutoTP; Refactor hf_engine.py and unit test. (#161) 2025-07-16 12:44:10 +08:00
assets Update to persistent wechat QR code. (#159) 2025-07-09 10:50:19 +08:00
benchmark/verl_v0_3_0_post1_76084d3 [Doc] Add verl benchmark scripts (#71) 2025-06-03 18:10:32 +08:00
blog [Doc] update blog to discuss staleness (#76) 2025-06-04 11:48:30 +08:00
ci [Fix] Fix CI running condition for lite. (#172) 2025-07-12 14:47:56 +08:00
csrc Support asynchronous RL training, Qwen3, and the latest SGLang (#47) 2025-05-26 09:45:13 +08:00
docs [Fix] Merge error fixes. (#152) 2025-07-07 10:30:27 +08:00
evaluation Support using SwanLab for experiment tracking (#98) 2025-06-16 19:51:31 +08:00
examples format (#174) 2025-07-15 10:24:48 +08:00
functioncall fix math reward verifier (#156) 2025-07-07 15:49:13 +08:00
patch/sglang [Doc & Fix] Simplify the environment setup procedure (#62) 2025-06-01 14:57:21 +08:00
realhf [Fix] Merge previous contributions from fw/refactor to lite (#163) 2025-07-10 12:56:24 +08:00
tests PullRequest: 252 [Feature] Fix constants initialization. (#122) 2025-06-23 12:52:49 +08:00
training PullRequest: 252 [Feature] Fix constants initialization. (#122) 2025-06-23 12:52:49 +08:00
.clang-format Initial commit. 2025-02-24 18:58:19 +08:00
.dockerignore Initial commit. 2025-02-24 18:58:19 +08:00
.gitignore [Feature] Switch dataset path / model path to HF location to ease community usage (#82) 2025-06-06 21:38:06 +08:00
Dockerfile [Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92) 2025-06-08 21:11:35 +08:00
LEGAL.md Initial commit. 2025-02-24 18:58:19 +08:00
LICENSE Initial commit. 2025-02-24 18:58:19 +08:00
MANIFEST.in Initial commit. 2025-02-24 18:58:19 +08:00
Makefile Initial commit. 2025-02-24 18:58:19 +08:00
README.md add a preprocessing script for code training data and update readme (#126) 2025-06-24 09:44:15 +08:00
pyproject.toml [Feat][Refactor]Support DeepSpeed AutoTP; Refactor hf_engine.py and unit test. (#161) 2025-07-16 12:44:10 +08:00
pytest.ini Initial commit. 2025-02-24 18:58:19 +08:00
requirements.txt [Feat][Refactor]Support DeepSpeed AutoTP; Refactor hf_engine.py and unit test. (#161) 2025-07-16 12:44:10 +08:00
setup.py [Feature & Doc & Bug Fix] Add docs, simplified ray-based scripts, and fix issues to stablize asynchronous experiments (#52) 2025-05-28 19:18:05 +08:00

README.md

AReaL: Ant Reasoning Reinforcement Learning for LLMs

| Paper | Documentation | Ask DeepWiki | 🤗 Models & Data | WeChat Group |

ReaL

AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research. Built upon the open-source project RealHF, we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).

AReaL Highlights

  • 🔥 [NEW] Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training! Experimental support for multi-turn agentic RL is also provided.
  • 🛠️ Open & Reproducible: We continuously release all code, datasets, and training recipes for RL training of LLMs.
  • 🚀 Scalability: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
  • 🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.

News

[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper.

[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog.

[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog.

Release Highlights

In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:

  • A fully asynchronous RL training pipeline with system and RL algorithm co-design, achieving over 2.77x speedup without any performance drop. Check the benchmark scripts and instructions here.

  • SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5. To reproduce, check the configs and instructions.

  • Experimental support for multi-turn agentic RL training. Check our complete example.

For the complete system design and more training details, please check our v0.3 blog and our research paper.

Jump to the quickstart section if you want to quickly run an experiment and get your hands dirty! 😈

Overview of Asynchronous RL Training

During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works (DeepCoder, Intellect) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.

Synchronous vs One-step Overlap RL

Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.

AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.

Asynchronous RL Training

Fig 2. Execution timeline of our fully asynchronous RL system.

AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.

We compare the scalability of asynchronous RL training based on our AReaL-boba² system with classical synchronous RL training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.

Scaling Comparison

Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.

SOTA Code Generation Model by AReaL-boba²

We use Qwen3 as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.

Model (8B) LiveCodeBench v5
(2024.10-2025.2)
Codeforces CodeContests
Qwen3-8B 58.8 1879/96.7% 31.4
DeepSeek-R1-0528-Qwen3-8B 58.4 1945/97.3% 31.0
🤗 AReaL-boba²-8B-Open 62.0 1933/97.2% 41.4
🤗 AReaL-boba²-8B 63.0 1962/97.5% 40.8
Model (14B) LiveCodeBench v5
(2024.10-2025.2)
Codeforces CodeContests
Qwen3-14B 65.4 1978/97.7% 38.3
DeepCoder-14B-Preview 60.6 1936/95.3% 40.1
🤗 AReaL-boba²-14B-Open 67.3 1990/97.8% 46.2
🤗 AReal-boba²-14B 69.1 2044/98.2% 46.1
Larger Models LiveCodeBench v5
(2024.10-2025.2)
Codeforces CodeContests
Qwen3-235B 70.7 2056 -
DeepSeek-R1 64.3 2029 -
OpenAI-o3-mini (Medium) 66.3 2036 -

Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.

We highlight the tutorials and code walkthroughs about the following key features for asynchronous training:

RL Training for Multi-turn Agent

AReaL-boba² allows you to independently customize the dataset, rollout behavior, and the training algorithm, without needing to modify the heavy system-level code.

In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the step-by-step guide if you want to implement your own agentic RL project.

Getting Started

Obtain the training data:

For code training data, a simple preprocessing script was provided in examples/data_preprocess/preprocess_training_data.py:

python3 preprocess_training_data.py --data_path $original_data_path --output_path $training_data_path

Train Qwen3 1.7B locally (Remember to modify dataset.path in the script below):

bash examples/run_async_ppo.sh

Evaluation:

cd evaluation
# Evaluate the model
python eval_and_aggregate.py \
  --model_path ${MODEL_PATH} \
  --output_path ${OUTPUT_PATH} \
  --data_names aime24,aime25 \
  --max_gen_tokens 32768 \
  --data_names codeforces,lcb_v5 \
  --prompt_type qwen3-think-pure \
  --temperature 1.0

Resources

Quickstart

Benchmark and Reproduction

Customization Guide

System Code Walkthrough

Future Plan

AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also hiring interns and full-time employees with open positions in both the US and China.

For the research and development plan already in place, please see the following list:

System Development

  • Support for SGLang
  • RL training with coding problems
  • Asynchronous generation and RL training
  • Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
  • RL for vision-language models (VLM)
  • Multi-turn agentic RL
  • Function calling and tool use

Algorithm Development

  • RL training recipes for 1.5B and 7B models
  • A complete RL training recipe for 32B models
  • Sample-efficient multi-task RL algorithms
  • Agentic capabilities with end-to-end RL
  • Stable RL training for larger MOE models

Acknowledgement

We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.

Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.

We also appreciate all the pioneering works from the community, particularly the ReaLHF project from OpenPsi Inc. and other projects, including but not limited to DeepScaleR, Open-Reasoner-Zero, OpenRLHF, VeRL, SGLang, QwQ, Light-R1 and DAPO.

Citation

@inproceedings{mei2025real,
  author       = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
  title        = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
  booktitle    = {Proceedings of the Eighth Conference on Machine Learning and Systems,
                  MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025},
  publisher    = {mlsys.org},
  year         = {2025},
}
@misc{fu2025areal,
      title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning}, 
      author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
      year={2025},
      eprint={2505.24298},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.24298}, 
}