[Doc] Update README. (#74)

* update benchmark script

* .

* add benchmark docs

* add v0.3.0 configs

* .

* PullRequest: 178 multi turn math agent training

Merge branch gjx/multi-turn-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/178?tab=diff

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* multi turn math agent training
* training data logging and clean math multi-turn exp
* fix
* .

* fix

* change readme

* fix typo

* revert multiturn

---------

Co-authored-by: 步偶 <sam.gjx@antgroup.com>
This commit is contained in:
Wei Fu 2025-06-04 00:43:19 +08:00 committed by GitHub
parent 2d4d937d10
commit d56df5102e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
16 changed files with 627 additions and 48 deletions

133
README.md
View File

@ -8,68 +8,101 @@
<img align="right" alt="ReaL" src="/assets/logo.png" width="20%">
AReaL (Ant Reasoning RL) is a fully open-sourced, scalable, and efficient reinforcement learning training system for large language models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by releasing training details, data, and infra required to reproduce all the models with the desired performances. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy A-ReaL-milk-tea.🧋
AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk tea (cheers).
**AReaL Highlights**
+ 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs .
+ 🔥 **[NEW] Asynchronous RL:** With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
+ 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs.
+ 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from 1 single node to 1K GPUs.
+ 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities. We are actively working on other domains, such as coding and agent, as well.
+ 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
## News
**[2025/03/31]** **(v0.2, Boba)** Our milestone release Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning.
**[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
**[2025/02/24]** **(v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
**[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
## AReaL-boba Milestones and Highlights
In our boba release, we highlight the 3 most important milestones:
**[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
+ Full SGLang support and a collection of efficiency improvements
+ A SOTA 7B math reasoning model [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) and the corresponing training data [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).
+ A particularly competitive 32B model [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) that can be trained with extremely low cost. (Training Data: [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl))
## Release Highlights
For the complete training and model details, please check [our v0.2 technical blog](/blog/AReaL_v0_2.md).
In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
#### SGLang support with 1.5x speedup on 7B Training
+ A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving [over 2.77x speedup](https://github.com/inclusionAI/AReaL/tree/main/benchmark) without any performance drop.
+ SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. [Reproducible results](https://inclusionai.github.io/AReaL/references/reproduce.html) with fully open-sourced datasets are also provided.
+ Experimental support for **multi-turn** agentic RL training.
![throughput_comparision_with_v0.1.0.png](/assets/thpt_comparison.png)
For the complete system design and training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
Thanks to a series of system-level optimizations, AReaL v0.2 improves its end-to-end training performance by up to 73%.
### Overview of Asynchronous RL Training
In the following table, we show the convergence time under different resource settings:
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRM, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose to overlap a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** |**32B (SFT)** |
| --- |:--------:|:--------:|:--------:|:------:|:------:|:-------:|
| #GPU | 8 | 32 | 128 | 32 | 128 | 64 |
| Step | 250 | 250 | 250 | 400 | 400 | 300 |
| Time (h) | ~240 | ~69 | ~27 | ~252 | ~90 | ~3.5 |
![Synchronous vs One-step Overlap RL](/assets/sync_one_step_gen.png)
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
#### SOTA 7B model using RL in math reasoning
| **Model** | **AIME 2024** | **AIME 2025** | **GPQA-Diamond** |
AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
![Asynchronous RL Training](/assets/async_timeline.png)
*Fig 2. Execution timeline of our fully asynchronous RL system.*
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs up model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-sourced system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates the much improved scaling capabilities w.r.t. training throughput. This is also partially due to that AReaL decouples training and generation, leading to much fewer GPU memory fragments. (Check the [benchmark directory](/benchmark) for detailed benchmark guide.)
![Scaling Comparison](/assets/async_scaling_vs_verl.png)
*Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.*
### SOTA Code Generation Model by AReaL-boba²
We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforce, and CodeContests benchmarks.
| **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
| :---: | :---: | :---: | :---: |
| O1-Preview | 56.7 | - | |
| R1-Distill-Qwen-7B | 55.0 | 39.7 | 47.1 |
| Light-R1-7B-DS | 56.7 | 44.9 | 40.9 |
| [AReaL-boba-RL-7B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | **61.9** | **48.3** | **47.6** |
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
| AReaL-boba²-8B-Open | 62.0 | 1933/97.2% | **41.4** |
| AReaL-boba²-8B | **63.0** | **1962/97.5%** | 40.8 |
| **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
| :---: | :---: | :---: | :---: |
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
| AReaL-boba²-14B-Open | 67.3 | 1990/97.8% | **46.2** |
| AReal-boba²-14B | **69.1** | **2044/98.2%** | 46.1 |
We use **R1-Distill-Qwen-7B** as our base model. After RL training, the pass@1 scores on AIME 2024 and AIME 2025 improve by 6.9 and 8.6 points, respectively, achieving SOTA performance among 7B models in mathematical reasoning. We have released the training data at [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).
| **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **Codecontest** |
| :---: | :---: | :---: | :---: |
| Qwen3-235B | 70.7 | 2056 | - |
| DeepSeek-R1 | 64.3 | 2029 | - |
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. We plan to open-source more datasets in the future, including code, STEM, and other domains. All the reported numbers are re-evaluated using our evaluation code with more details in our [blog](blog/AReaL_v0_2.md#eval_detail).
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced dataAReaL-boba²-8B/14B models are trained with an additional small amount of internal data could achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests*
#### Approaching QwQ-32B performances using only 200 data samples
| **Model** | **AIME 2024** |
| :---: | :---: |
| R1-Distill-Qwen-32B | 72.6 |
| QwQ-32B | 78.9 |
| [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) | 78.8 |
We highlight the tutorials about the following key features for synchronous training in AReaL-boba²:
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
+ [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
+ [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
+ [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
We provide a step-by-step [code walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html) and [customization guide](https://inclusionai.github.io/AReaL/customization/dataset.html) for these features and recommend users to walk through the corresponding documentation.
### RL Training for Multi-turn Agent
AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without the need to modify the heavy system-level code.
In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
![Multi-turn Agent Learning Curve](/assets/multiturn_reward.png)
Building upon **R1-Distill-Qwen-32B**, we replicate **QwQ-32B's** inference performance on AIME 2024 using just **200 data points** via Supervised Fine-Tuning (SFT). We have released the training data at [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl).
## Getting Started
### Quick Start
Train Qwen3 1.7B locally:
@ -95,7 +128,7 @@ python3 training/main_async_ppo.py \
max_head_offpolicyness=4
```
Evaluation
Evaluation:
```bash
bash examples/env/scripts/setup-eval-pip-deps.sh
@ -117,23 +150,29 @@ python eval_and_aggregate.py \
+ [Quickstart](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
+ [Code Walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html)
+ **Customization Guide**
+ [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
+ [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
+ [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
- [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
- [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
- [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
+ [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
## Future Plan
AReaL is under active development. We will have major releases in a weekly manner. We also highly appreciate efforts from the community as well. Here we highlight our future research and development plan.
AReaL is under active development. We plan to have minor releases in a weekly manner and major releases in a monthly manner. Community engagements and contributions are extremely welcomed. We are also **hiring interns and full-timers** with open positions in both the US and China.
For the research and development plan already in place, please see the following list:
### System Development
- [x] Support for SGLang.
- [x] RL training with coding problems.
- [x] Asynchronous generation and RL training.
- [ ] Optimizations for distributed training: expert parallel and zero-bubble pipelining.
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining.
- [ ] RL for vision-language models (VLM).
- [ ] Function calling and agent capabilities.
- [x] Multi-turn agentic RL.
- [ ] Function calling and tool use.
### Algorithm Development
- [x] RL training receipes for 1.5B and 7B models.
- [x] A complete RL training receipe for 32B models.
- [ ] Sample-efficient multi-task RL algorithms.
@ -141,14 +180,14 @@ AReaL is under active development. We will have major releases in a weekly manne
- [ ] Stable RL training for larger MOE models.
## Acknowledgement
We would like to remark that major contributors are from **RL Lab at Ant Research** and **Institute for Interdisciplinary Information Sciences, Tsinghua University**.
We would like to remark that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
Our team has also received invaluable assistance from the Super Computing Technology (SCT) team at Ant Group, particularly in large-scale cluster operations and maintenance.
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
We also appreciate all the pioneer works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and many other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1), and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
## Citation
```plain
```bibtex
@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
@ -159,7 +198,7 @@ We also appreciate all the pioneer works from the community, particularly the [R
}
```
```plain
```bibtex
@misc{areal2025,
author = {RL Lab, Ant Research},
title = {AReaL: Ant Reasoning RL},

BIN
assets/algo_ablation.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

BIN
assets/arch.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 125 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

BIN
assets/async_timeline.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

BIN
assets/multiturn_reward.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

180
blog/AReaL_v0_3.md Normal file
View File

@ -0,0 +1,180 @@
# AReaL v0.3
## Introduction
We now release AReaL v0.3, featuring three major milestones:
- **A fully asynchronous RL training pipeline with system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop
- **SOTA coding models**, i.e., a 14B model with a **69.1 score on LCB-v5**. Reproducible results with fully open-sourced datasets are also provided
- Experimental support for **multi-turn agentic RL training**
### Performance Results
| **Model (8B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|:---:|:---:|:---:|:---:|
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
| AReaL-boba²-8B-Open | 62.0 | 1933/97.2% | **41.4** |
| AReaL-boba²-8B | **63.0** | **1962/97.5%** | 40.8 |
| **Model (14B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|:---:|:---:|:---:|:---:|
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
| AReaL-boba²-14B-Open | 67.3 | 1990/97.8% | **46.2** |
| AReal-boba²-14B | **69.1** | **2044/98.2%** | 46.1 |
| **Larger Models** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|:---:|:---:|:---:|:---:|
| Qwen3-235B | 70.7 | 2056 | - |
| DeepSeek-R1 | 64.3 | 2029 | - |
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests.*
## Motivation for Asynchronous RL System
### Inference devices are underutilized
During the synchronous RL training process, the generation step must wait until the finish of the longest output within a batch. Due to the varying output lengths for LRM, a synchronous RL system suffers from severe training inefficiency.
Some works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) use outputs generated from a previous model version to update the current model. For the best performances, the model version used for rollout generation is limited to only one or two steps older. However, all these systems still follow a batched generation setting, the issue of system inefficiency during the generation phase still remains unaddressed.
![](/assets/sync_one_step_gen.png)
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
### Scalability is poor in synchronous RL systems
Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size. This pushes the decoding process into a memory-IO-bound regime where additional devices fail to improve throughput.
![](/assets/gen_scaling_trend.png)
*Fig2. Left: Strong scaling of batched generation throughput for a 1.5B LRM. Right: Generation becomes memory-IO bound as GPU count increases.*
## Asynchronous AReaL Overview
We design a system that fully decouples generation and training across separate GPU clusters. This system should be hardware-efficient, scalable, and equipped with the flexibility for a customized RL workflow. We implement these principles in AReaL. Fig.3 presents the architecture and data flow of AREAL. The system comprises 4 core components:
![](/assets/arch.png)
*Fig.3 The architecture featuring asynchronous generation and training components.*
### Core Components
- **Interruptible Rollout Worker** handles two types of requests: (1) The *generate* request generates responses given prompts. (2) The *update_weights* request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights. Afterwards, the rollout workers continue to decode the unfinished sequences until the next interruption or termination. We emphasize that such interruptions and in-flight weight updates would result in trajectories composed of segments produced by different model versions. This introduces a novel algorithmic challenge, which will be addressed in the next section.
- **Reward Service** evaluates the accuracy of the responses generated by the model. For example, in the coding task, this service extracts the code and executes unit tests to verify its accuracy.
- **Trainer Workers** continuously sample from the replay buffer, accumulating data until reaching the configured training batch size. They then perform PPO updates and store the resulting parameters in distributed storage. To ensure data freshness, data from the replay buffer is used only once.
- **Rollout Controller** serves as a critical bridge between the rollout workers, reward service, and the model workers. During the training process, it reads data from the dataset and invokes the rollout worker's *generate* request. The received response is then sent to the reward service to obtain the reward. The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker. After the model worker updates the parameters, the controller calls the rollout worker's *update_weight*. We illustrate the generation and training management in Fig.4. This asynchronous pipeline ensures continuous full utilization of both generation and training resources.
![](/assets/async_timeline.png)
*Fig 4. Execution timeline of our fully asynchronous RL system.*
## Algorithmic Challenges and Solutions for Asynchronous RL
### Challenges
While the asynchronous system design offers significant acceleration through improved device utilization, it introduces several technical challenges that require algorithmic considerations.
- **Challenge 1: Data Staleness** - Due to the asynchronous nature of AREAL, each training batch contains data from multiple prior policy versions. Data staleness would lead to a distribution gap between the training data and the latest model. In asynchronous RL training for LRMs, this issue could be even more severe for long trajectories due to extended decoding time.
- **Challenge 2: Inconsistent Policy Versions** - As shown in Fig.4, a single generated trajectory may involve segments produced by different policy versions. This inconsistency fundamentally violates the formulation of standard PPO that assumes all actions being generated by a single policy.
### Solutions
To overcome these two challenges, we propose two solutions:
- **Solution 1: Staleness-Aware Training** - To avoid negative impact of the data staleness issue, we constrain the version discrepancy between the policy version of generated trajectories and the training policy. We introduce a hyperparameter η representing the maximum permitted staleness. When η = 0, the system degenerates to the synchronous RL setting where generation exactly matches training batches. When η = 1, the system recovers to the previous one-step overlap methods.
- **Solution 2: Decoupled PPO Objective** - We apply a decoupled PPO objective that disentangles the behavior policy and the proximal policy. The behavior policy represents the policy used for sampling trajectories and the proxy policy is a proximal policy serving as a recent target to regularize the update of online policy.
![](/assets/decoupled_ppo_obj.png)
## Validating Asynchronous AReaL
We first evaluate Asynchronous AReaL on a math task to validate our system and algorithmic choices. Building on our previous release, AReaL-boba, we adopt R1-Distill-Qwen as the base model and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training dataset.
### End-to-End Performance Comparison
We compared synchronous and asynchronous training on 1.5B and 7B models. Under equal resource constraints and training steps, the asynchronous system was over twice as fast as the synchronous one. Evaluations on AIME24 confirmed that this acceleration does not compromise performance.
| **Model (1.5B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|:---:|:---:|:---:|:---:|:---:|
| basemodel | 29.3 | - | - | - |
| Synchronous AReaL | 42.0 | 16 | 1000 | 41.0 |
| Asynchronous AReaL | 42.2 | 16 | 1000 | **14.8** |
| **Model (7B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|:---:|:---:|:---:|:---:|:---:|
| basemodel | 54.3 | - | - | - |
| Synchronous AReaL | 63.0 | 24 | 1000 | 57.7 |
| Asynchronous AReaL | 63.1 | 24 | 1000 | **25.4** |
*Table 2: End-to-end comparison between asynchronous AReaL and synchronous AReaL. Following last release AReaL-boba, we adopt R1-Distill-Qwen as the basemodel and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training data. Each performance number represents the average results of 32 sampling responses.*
### Ablation Study
#### System Ablations
We ablate interruptible generation and present the resulting generation throughput in Fig.5. Without interruptible generation, the controller must wait for the longest response. In particular, interruptible generation leads to a 12% and 17% throughput increase for 1.5B and 7B models respectively on 4 nodes.
![](/assets/interrupt_gen_ablation.png)
*Fig.5 Ablation study of interruptible generation.*
#### Algorithm Ablations
We vary the maximum allowed staleness η and compare configurations with and without the decoupled PPO objective. Fig.6 shows the learning curves after 1000 PPO updates, demonstrating that naive PPO fails to match the performance of the synchronous RL oracle (i.e., the performance when η = 0). Even slight staleness can significantly degrade final performance due to the improper clipping center and policy changes during interruptible generation. Furthermore, increasing data staleness consistently degrades learning performance.
![](/assets/algo_ablation.png)
*Fig.6 Left: Learning curves with naive PPO. Right: Learning curves with decoupled PPO objective.*
| η | **AIME 24** | | **AIME 25** | |
|:---:|:---:|:---:|:---:|:---:|
| | naive PPO | + Decoupled Objective | naive PPO | + Decoupled Objective |
| 0 | 42.0 | | 32.9 | |
| 1 | 41.8 | 42.1 | 30.7 | 31.9 |
| 2 | 40.0 | 41.8 | 32.1 | 32.5 |
| 4 | 23.3 | 42.2 | 23.1 | 32.0 |
| 8 | 35.7 | 41.0 | 27.8 | 31.1 |
| 16 | 35.8 | 38.7 | 26.2 | 32.5 |
| ∞ | 34.0 | 36.9 | 26.9 | 29.9 |
*Table 3: Evaluation scores when varying data staleness, comparing performance with and without the decoupled objective.*
However, the decoupled PPO objective substantially improves training stability when handling stale data. Notably, even with the decoupled objective, unbounded staleness (maximum staleness → ∞) still results in inferior performance compared to the zero-staleness oracle. When properly constrained, moderate staleness (e.g., η ≤ 4) has minimal impact on final performance while significantly accelerating training through the asynchronous pipeline, as demonstrated in Tab.3 and Fig.7.
![](/assets/staleness_throughput.png)
*Fig.7 The relationship between η and training throughput. Larger η leads to higher throughput.*
## Training Recipe with Asynchronous AReaL
### Dataset Curation
We train the 7B and 14B model using open-source data from [DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
### Rollout Strategy
During the rollout phase, the LLM generates 16 responses per question. To minimize output truncation, we set the maximum generation length to 27K tokens.
### Filtering Strategy
In the design of our asynchronous system, we can naturally incorporate strategies to filter some rollouts. Specifically, we exclude the questions where all 16 answers are either entirely correct or entirely incorrect.
### Key Hyperparameters
| **Parameter** | **Value** |
|:---:|:---:|
| PPO Minibatches | 1 |
| Learning Rate | 2e-5 |
| Adam ε | 1e-5 |
| Batch Size | 2,048 |
| max_head_offpolicyness η | 16 |
| success_rate_ub | 0.95 |
| success_rate_lb | 0.05 |

View File

@ -8,7 +8,7 @@ The experiment configurations used to train our released models can be found in
+ Dataset: [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl)
+ `examples/configs/v0.3-qwen3-code`: Configuration for reproducing the boba² coding model based on Qwen3
+ Dataset: [DeepCoder Dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset) (duplicated problems and problems with incorrect answers have been filtered out)
+ Dataset: [DeepCoder Dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
## Adjusting Computation Resources

View File

@ -92,6 +92,8 @@ python3 training/main_sync_ppo.py --help
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training. Defaults to True.
- **`ppo.gen.max_new_tokens`**: Maximum tokens to generate per prompt.
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO update.
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate will be filtered out.
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate will be filtered out.
## Monitoring the Training Process

View File

@ -0,0 +1,179 @@
experiment_name: qwen3-14b-code
trial_name: my-trial
seed: 1
mode: ray
metric_discovery_port: 0
wandb:
mode: disbled
entity: null
project: null
name: null
job_type: null
group: null
notes: null
tags: null
config: null
tensorboard:
path: null
recover_mode: auto
recover_retries: 10
recover_after: 10
exp_ctrl:
total_train_epochs: 10
save_freq_epochs: null
save_freq_steps: 20
save_freq_secs: null
ckpt_freq_epochs: null
ckpt_freq_steps: null
ckpt_freq_secs: 600
eval_freq_epochs: null
eval_freq_steps: null
eval_freq_secs: null
benchmark_steps: null
benchmark_n_seqs: null
torch_cache_mysophobia: true
cache_clear_freq: 1
max_head_offpolicyness: 16
n_rollout_workers: null
max_concurrent_rollouts: 512
flush_request_timeout: 600
cpus_per_generation_server: 4
mem_per_generation_server: 61440
cpus_per_gserver_manager: 4
mem_per_gserver_manager: 10240
cpus_per_rollout_worker: 4
mem_per_rollout_worker: 20480
allocation_mode: sglang.d80m4p1+d4p8m2
n_nodes: 48
n_gpus_per_node: 8
ray_temp_path: /tmp/ray
cluster:
fileroot: /home/admin/.cache/realhf
n_nodes: 48
n_gpus_per_node: 8
actor:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-14B/
init_from_scratch: false
gradient_checkpointing: true
bf16: false
optimizer:
type: adam
lr: 2.0e-05
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
eps: 1.0e-05
min_lr_ratio: 0.0
lr_scheduler_type: constant
warmup_steps_proportion: 0.001
initial_loss_scale: 4294967296.0
min_loss_scale: 1.0
loss_scale_window: 5.0
hysteresis: 2
gradient_clipping: 1.0
megatron:
ddp:
grad_reduce_in_fp32: true
overlap_grad_reduce: true
use_distributed_optimizer: true
sglang:
disable_cuda_graph: false
disable_radix_cache: false
disable_cuda_graph_padding: false
enable_nccl_nvls: false
disable_outlines_disk_cache: false
disable_custom_all_reduce: false
disable_overlap_schedule: false
enable_mixed_chunk: false
enable_torch_compile: false
torch_compile_max_bs: 32
cuda_graph_max_bs: null
cuda_graph_bs: null
torchao_config: ''
enable_nan_detection: false
enable_p2p_check: false
triton_attention_reduce_in_fp32: false
triton_attention_num_kv_splits: 16
num_continuous_decode_steps: 1
enable_memory_saver: false
allow_auto_truncate: false
attention_backend: flashinfer
sampling_backend: null
context_length: 30720
mem_fraction_static: 0.7
max_running_requests: null
chunked_prefill_size: -1
max_prefill_tokens: 32768
schedule_policy: lpm
schedule_conservativeness: 1.0
cpu_offload_gb: 0
dtype: float16
kv_cache_dtype: auto
log_level: warning
log_level_http: warning
log_requests: false
log_requests_level: 0
show_time_cost: false
enable_metrics: true
decode_log_interval: 1
ref:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-14B/
init_from_scratch: false
bf16: false
actor_train:
mb_spec:
max_tokens_per_mb: 30720
ref_inf:
mb_spec:
max_tokens_per_mb: 30720
actor_inf:
mb_spec:
max_tokens_per_mb: 30720
shuffle_dataset: true
dataset:
path: /path/to/dataset.jsonl
max_prompt_len: 2048
train_bs_n_seqs: 128
group_size: 16
mask_too_long: false
group_adv_norm: false
rw_type: sparse
success_rate_ub: 0.95
success_rate_lb: 0.05
ppo:
gen:
n: 1
max_new_tokens: 27648
min_new_tokens: 0
greedy: false
top_p: 1.0
top_k: 100000000
temperature: 1.0
ppo_n_minibatches: 1
eps_clip: 0.2
c_clip: null
value_eps_clip: 0.2
early_stop_imp_ratio: 5.0
actor_sample_reuse: 1
critic_sample_reuse: 1
max_reward_clip: 20.0
reward_output_scaling: 5.0
reward_output_bias: 0.0
fuse_rew_ref: true
discount: 1.0
gae_lambda: 1.0
adv_norm: true
kl_ctl: 0.0
use_adaptive_kl_ctl: false
disable_value: true
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: 5.0
cpus_per_master_worker: 4
mem_per_master_worker: 20000
cpus_per_model_worker: 4
mem_per_model_worker: 90000

View File

@ -0,0 +1,179 @@
experiment_name: qwen3-8b-code
trial_name: my-trial
seed: 1
mode: ray
metric_discovery_port: 0
wandb:
mode: disbled
entity: null
project: null
name: null
job_type: null
group: null
notes: null
tags: null
config: null
tensorboard:
path: null
recover_mode: auto
recover_retries: 10
recover_after: 10
exp_ctrl:
total_train_epochs: 10
save_freq_epochs: null
save_freq_steps: 20
save_freq_secs: null
ckpt_freq_epochs: null
ckpt_freq_steps: null
ckpt_freq_secs: 600
eval_freq_epochs: null
eval_freq_steps: null
eval_freq_secs: null
benchmark_steps: null
benchmark_n_seqs: null
torch_cache_mysophobia: true
cache_clear_freq: 1
max_head_offpolicyness: 16
n_rollout_workers: null
max_concurrent_rollouts: 512
flush_request_timeout: 300
cpus_per_generation_server: 4
mem_per_generation_server: 61440
cpus_per_gserver_manager: 4
mem_per_gserver_manager: 10240
cpus_per_rollout_worker: 4
mem_per_rollout_worker: 20480
allocation_mode: sglang.d80m2p1+d4m2p4
n_nodes: 24
n_gpus_per_node: 8
ray_temp_path: /tmp/ray
cluster:
fileroot: /home/admin/.cache/realhf
n_nodes: 32
n_gpus_per_node: 8
actor:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-8B/
init_from_scratch: false
gradient_checkpointing: true
bf16: false
optimizer:
type: adam
lr: 2.0e-05
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
eps: 1.0e-05
min_lr_ratio: 0.0
lr_scheduler_type: constant
warmup_steps_proportion: 0.001
initial_loss_scale: 4294967296.0
min_loss_scale: 1.0
loss_scale_window: 5.0
hysteresis: 2
gradient_clipping: 1.0
megatron:
ddp:
grad_reduce_in_fp32: true
overlap_grad_reduce: true
use_distributed_optimizer: true
sglang:
disable_cuda_graph: false
disable_radix_cache: false
disable_cuda_graph_padding: false
enable_nccl_nvls: false
disable_outlines_disk_cache: false
disable_custom_all_reduce: false
disable_overlap_schedule: false
enable_mixed_chunk: false
enable_torch_compile: false
torch_compile_max_bs: 32
cuda_graph_max_bs: null
cuda_graph_bs: null
torchao_config: ''
enable_nan_detection: false
enable_p2p_check: false
triton_attention_reduce_in_fp32: false
triton_attention_num_kv_splits: 16
num_continuous_decode_steps: 1
enable_memory_saver: false
allow_auto_truncate: false
attention_backend: flashinfer
sampling_backend: null
context_length: 30720
mem_fraction_static: 0.7
max_running_requests: null
chunked_prefill_size: -1
max_prefill_tokens: 32768
schedule_policy: lpm
schedule_conservativeness: 1.0
cpu_offload_gb: 0
dtype: float16
kv_cache_dtype: auto
log_level: warning
log_level_http: warning
log_requests: false
log_requests_level: 0
show_time_cost: false
enable_metrics: true
decode_log_interval: 1
ref:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-8B/
init_from_scratch: false
bf16: false
actor_train:
mb_spec:
max_tokens_per_mb: 30720
ref_inf:
mb_spec:
max_tokens_per_mb: 30720
actor_inf:
mb_spec:
max_tokens_per_mb: 30720
shuffle_dataset: true
dataset:
path: /path/to/dataset.jsonl
max_prompt_len: 2048
train_bs_n_seqs: 128
group_size: 16
mask_too_long: false
group_adv_norm: false
rw_type: sparse
success_rate_ub: 0.95
success_rate_lb: 0.05
ppo:
gen:
n: 1
max_new_tokens: 27648
min_new_tokens: 0
greedy: false
top_p: 1.0
top_k: 100000000
temperature: 1.0
ppo_n_minibatches: 1
eps_clip: 0.2
c_clip: null
value_eps_clip: 0.2
early_stop_imp_ratio: 5.0
actor_sample_reuse: 1
critic_sample_reuse: 1
max_reward_clip: 20.0
reward_output_scaling: 5.0
reward_output_bias: 0.0
fuse_rew_ref: true
discount: 1.0
gae_lambda: 1.0
adv_norm: true
kl_ctl: 0.0
use_adaptive_kl_ctl: false
disable_value: true
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: 5.0
cpus_per_master_worker: 4
mem_per_master_worker: 20000
cpus_per_model_worker: 4
mem_per_model_worker: 90000