This commit is contained in:
Wei Fu 2025-06-04 11:27:09 +08:00 committed by GitHub
parent 67dc056dd9
commit 15024d8d32
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 98 additions and 442 deletions

152
README.md
View File

@ -8,20 +8,20 @@
<img align="right" alt="ReaL" src="/assets/logo.png" width="20%">
AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk tea (cheers).
AReaL (Ant Reasoning RL) is an open-source **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**. Built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF), we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).
**AReaL Highlights**
+ 🔥 **[NEW] Asynchronous RL:** With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
+ 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs.
+ 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from 1 single node to 1K GPUs.
+ 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
+ 🔥 <span style="color: red; font-weight: bold;">**[NEW] Asynchronous RL:**</span> With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
+ 🛠️ **Open & Reproducible**: We continuously release _all code, datasets, and training recipes_ for RL training of LLMs.
+ 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
+ 🔪 **Cutting-Edge Performance:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
## News
**[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
**[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
**[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
**[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
@ -29,19 +29,21 @@ AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement l
In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
+ A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving [over 2.77x speedup](https://github.com/inclusionAI/AReaL/tree/main/benchmark) without any performance drop.
+ SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. [Reproducible results](https://inclusionai.github.io/AReaL/references/reproduce.html) with fully open-sourced datasets are also provided.
+ Experimental support for **multi-turn** agentic RL training.
+ A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop. Check the [benchmark scripts and instructions here](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3).
For the complete system design and training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
+ SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. To reproduce, check the [configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html).
+ Experimental support for **multi-turn** agentic RL training. Check our [complete example](https://inclusionai.github.io/AReaL/customization/agent.html).
For the complete system design and more training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
### Overview of Asynchronous RL Training
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRM, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose to overlap a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
![Synchronous vs One-step Overlap RL](/assets/sync_one_step_gen.png)
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
*Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
@ -49,9 +51,9 @@ AReaL adopts a fully asynchronous RL training framework that completely decouple
*Fig 2. Execution timeline of our fully asynchronous RL system.*
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs up model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-sourced system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates the much improved scaling capabilities w.r.t. training throughput. This is also partially due to that AReaL decouples training and generation, leading to much fewer GPU memory fragments. (Check the [benchmark directory](/benchmark) for detailed benchmark guide.)
We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.
![Scaling Comparison](/assets/async_scaling_vs_verl.png)
@ -59,48 +61,45 @@ We compare the scalability of **asynchronous RL** training based on our AReaL-bo
### SOTA Code Generation Model by AReaL-boba²
We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforce, and CodeContests benchmarks.
We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.
| **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
| **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
| :---: | :---: | :---: | :---: |
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
| [🤗 AReaL-boba²-8B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset) | 62.0 | 1933/97.2% | **41.4** |
| [🤗 AReaL-boba²-8B](https://huggingface.co/inclusionAI/AReaL-boba-2-8B) | **63.0** | **1962/97.5%** | 40.8 |
| **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
| **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
| :---: | :---: | :---: | :---: |
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
| [🤗 AReaL-boba²-14B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset) | 67.3 | 1990/97.8% | **46.2** |
| [🤗 AReal-boba²-14B](https://huggingface.co/inclusionAI/AReaL-boba-2-14B) | **69.1** | **2044/98.2%** | 46.1 |
| **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **Codecontest** |
| **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
| :---: | :---: | :---: | :---: |
| Qwen3-235B | 70.7 | 2056 | - |
| DeepSeek-R1 | 64.3 | 2029 | - |
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced dataAReaL-boba²-8B/14B models are trained with an additional small amount of internal data could achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests*
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.*
We highlight the tutorials about the following key features for synchronous training in AReaL-boba²:
We highlight the [tutorials](https://inclusionai.github.io/AReaL/customization/dataset.html) and [code walkthroughs](https://inclusionai.github.io/AReaL/developer/overview.html) about the following key features for asynchronous training:
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
+ [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
+ [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
+ [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
We provide a step-by-step [code walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html) and [customization guide](https://inclusionai.github.io/AReaL/customization/dataset.html) for these features and recommend users to walk through the corresponding documentation.
### RL Training for Multi-turn Agent
AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without the need to modify the heavy system-level code.
AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without needing to modify the heavy system-level code.
In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
![Multi-turn Agent Learning Curve](/assets/multiturn_reward.png)
## Getting Started
### Quick Start
@ -108,30 +107,12 @@ In particular, we show a simple example to develop a multi-turn math agent for R
Train Qwen3 1.7B locally:
```bash
bash examples/env/scripts/setup-pip-deps.sh
python3 training/main_async_ppo.py \
n_nodes=1 n_gpus_per_node=8 \
allocation_mode=sglang.d4p1m1+d2p2m1 \
cluster.fileroot=/storage/testing/experiments \
actor.type._class=qwen3 \
actor.path=Qwen/Qwen3-1.7B \
ref.type._class=qwen3 \
ref.path=Qwen/Qwen3-1.7B \
dataset.path=/path/to/dataset/boba_106k_0319.jsonl \
dataset.train_bs_n_seqs=32 \
group_size=8 \
ppo.gen.max_new_tokens=4096 \
ppo.ppo_n_minibatches=4 \
actor_train.mb_spec.max_tokens_per_mb=32768 \
actor_inf.mb_spec.max_tokens_per_mb=32768 \
max_concurrent_rollouts=16 \
max_head_offpolicyness=4
bash examples/run_async_ppo.sh
```
Evaluation:
```bash
bash examples/env/scripts/setup-eval-pip-deps.sh
cd evaluation
# Evaluate the model
python eval_and_aggregate.py \
@ -144,49 +125,72 @@ python eval_and_aggregate.py \
--temperature 1.0
```
### Resources
## Resources
+ [Documentation](https://inclusionai.github.io/AReaL/)
+ [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
### Quickstart
+ [Installation](https://inclusionai.github.io/AReaL/tutorial/installation.html)
+ [Quickstart](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
+ [Code Walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html)
+ **Customization Guide**
- [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
- [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
- [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
+ [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
+ [Example: Improving the math capability of Qwen3 with PPO](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
### Benchmark and Reproduction
+ **Reproduce boba² Code Models**
- 🤗 **Model weights**: [8B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-8B), [14B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-14B), [8B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset), [14B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset)
- [Evaluation Guide](https://inclusionai.github.io/AReaL/tutorial/eval.html)
- [Training configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html)
+ [Scripts for Benchmark Training Throughput](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3)
### Customization Guide
- [Use your own dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
- [Modifying the reward function and rollout behavior (multi-turn agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
- [Modifying PPO to GRPO](https://inclusionai.github.io/AReaL/customization/algorithm.html#grouped-advantage-normalization)
- [Developing the decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html#the-decoupled-ppo-loss)
### System Code Walkthrough
+ [Trainer](https://inclusionai.github.io/AReaL/developer/trainer/model_worker.html)
+ [Model Backend and Algorithm Interface](https://inclusionai.github.io/AReaL/developer/trainer/algo_interface.html)
+ [Rollout Controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
## Future Plan
AReaL is under active development. We plan to have minor releases in a weekly manner and major releases in a monthly manner. Community engagements and contributions are extremely welcomed. We are also **hiring interns and full-timers** with open positions in both the US and China.
AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also **hiring interns and full-time employees** with open positions in both the US and China.
For the research and development plan already in place, please see the following list:
### System Development
- [x] Support for SGLang.
- [x] RL training with coding problems.
- [x] Asynchronous generation and RL training.
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining.
- [ ] RL for vision-language models (VLM).
- [x] Multi-turn agentic RL.
- [ ] Function calling and tool use.
- [x] Support for SGLang
- [x] RL training with coding problems
- [x] Asynchronous generation and RL training
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
- [ ] RL for vision-language models (VLM)
- [x] Multi-turn agentic RL
- [ ] Function calling and tool use
### Algorithm Development
- [x] RL training receipes for 1.5B and 7B models.
- [x] A complete RL training receipe for 32B models.
- [ ] Sample-efficient multi-task RL algorithms.
- [ ] Agentic capabilities with end-to-end RL.
- [ ] Stable RL training for larger MOE models.
- [x] RL training recipes for 1.5B and 7B models
- [x] A complete RL training recipe for 32B models
- [ ] Sample-efficient multi-task RL algorithms
- [ ] Agentic capabilities with end-to-end RL
- [ ] Stable RL training for larger MOE models
## Acknowledgement
We would like to remark that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and other projects, including but not limited to [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
## Citation
```bibtex
@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
@ -199,13 +203,13 @@ We also appreciate all the pioneering works from the community, particularly the
```
```bibtex
@misc{areal2025,
author = {RL Lab, Ant Research},
title = {AReaL: Ant Reasoning RL},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/inclusionAI/AReaL}},
@misc{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
year={2025},
eprint={2505.24298},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.24298},
}
```

View File

@ -10,6 +10,10 @@ parts:
- file: tutorial/quickstart
- file: tutorial/eval
- file: tutorial/troubleshooting
- caption: References
chapters:
- file: references/benchmark
- file: references/reproduce
- caption: Customization
chapters:
- file: customization/dataset
@ -28,10 +32,6 @@ parts:
sections:
- file: developer/rollout/gserver
- file: developer/rollout/rollout_worker
- caption: References
chapters:
- file: references/benchmark
- file: references/reproduce
- caption: Contributing
chapters:
- file: contrib

View File

@ -1,16 +1,26 @@
# Reproduction Guide
The experiment configurations used to train our released models can be found in `examples/configs`. The user can overwrite `training/config/async_ppo.yaml` with our provided config (e.g., by copy-pasting) and then run `python3 training/main_async_ppo.py` as illustrated in the [quickstart section](../tutorial/quickstart.md).
The experiment configurations can be found [here](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code).
**Available Configurations:**
Users should overwrite [training/config/async_ppo.yaml](https://github.com/inclusionAI/AReaL/blob/main/training/configs/async-ppo.yaml) and then run:
+ `examples/configs/v0.2-qwen2-math`: Configuration for reproducing the boba math models based on R1-Distilled-Qwen models
```bash
python3 training/main_async_ppo.py
```
More information can be found in the [quickstart section](../tutorial/quickstart.md).
**Math:**
+ [`examples/configs/v0.2-qwen2-math`](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.2-qwen2-math): Configuration for reproducing the boba math models based on R1-Distilled-Qwen models
+ Dataset: [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl)
+ `examples/configs/v0.3-qwen3-code`: Configuration for reproducing the boba² coding model based on Qwen3
+ Dataset: [DeepCoder Dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
**Code:**
## Adjusting Computation Resources
+ [`examples/configs/v0.3-qwen3-code`](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code): Configuration for reproducing the boba² coding model based on Qwen3
+ Dataset: [AReaL-boba-2-code](https://huggingface.co/datasets/inclusionAI/AReaL-boba-2-RL-Code). We perform deduplication and filtering on the [DeepCoder dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
## Adjusting Computational Resources
We recommend using the provided number of GPUs and corresponding parallelization strategy for optimal reproducibility. When resources are limited, try decreasing the `n_nodes` and reducing the data parallelism degree in `allocation_mode`.

View File

@ -1,179 +0,0 @@
experiment_name: qwen3-14b-code
trial_name: my-trial
seed: 1
mode: ray
metric_discovery_port: 0
wandb:
mode: disbled
entity: null
project: null
name: null
job_type: null
group: null
notes: null
tags: null
config: null
tensorboard:
path: null
recover_mode: auto
recover_retries: 10
recover_after: 10
exp_ctrl:
total_train_epochs: 10
save_freq_epochs: null
save_freq_steps: 20
save_freq_secs: null
ckpt_freq_epochs: null
ckpt_freq_steps: null
ckpt_freq_secs: 600
eval_freq_epochs: null
eval_freq_steps: null
eval_freq_secs: null
benchmark_steps: null
benchmark_n_seqs: null
torch_cache_mysophobia: true
cache_clear_freq: 1
max_head_offpolicyness: 16
n_rollout_workers: null
max_concurrent_rollouts: 512
flush_request_timeout: 600
cpus_per_generation_server: 4
mem_per_generation_server: 61440
cpus_per_gserver_manager: 4
mem_per_gserver_manager: 10240
cpus_per_rollout_worker: 4
mem_per_rollout_worker: 20480
allocation_mode: sglang.d80m4p1+d4p8m2
n_nodes: 48
n_gpus_per_node: 8
ray_temp_path: /tmp/ray
cluster:
fileroot: /home/admin/.cache/realhf
n_nodes: 48
n_gpus_per_node: 8
actor:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-14B/
init_from_scratch: false
gradient_checkpointing: true
bf16: false
optimizer:
type: adam
lr: 2.0e-05
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
eps: 1.0e-05
min_lr_ratio: 0.0
lr_scheduler_type: constant
warmup_steps_proportion: 0.001
initial_loss_scale: 4294967296.0
min_loss_scale: 1.0
loss_scale_window: 5.0
hysteresis: 2
gradient_clipping: 1.0
megatron:
ddp:
grad_reduce_in_fp32: true
overlap_grad_reduce: true
use_distributed_optimizer: true
sglang:
disable_cuda_graph: false
disable_radix_cache: false
disable_cuda_graph_padding: false
enable_nccl_nvls: false
disable_outlines_disk_cache: false
disable_custom_all_reduce: false
disable_overlap_schedule: false
enable_mixed_chunk: false
enable_torch_compile: false
torch_compile_max_bs: 32
cuda_graph_max_bs: null
cuda_graph_bs: null
torchao_config: ''
enable_nan_detection: false
enable_p2p_check: false
triton_attention_reduce_in_fp32: false
triton_attention_num_kv_splits: 16
num_continuous_decode_steps: 1
enable_memory_saver: false
allow_auto_truncate: false
attention_backend: flashinfer
sampling_backend: null
context_length: 30720
mem_fraction_static: 0.7
max_running_requests: null
chunked_prefill_size: -1
max_prefill_tokens: 32768
schedule_policy: lpm
schedule_conservativeness: 1.0
cpu_offload_gb: 0
dtype: float16
kv_cache_dtype: auto
log_level: warning
log_level_http: warning
log_requests: false
log_requests_level: 0
show_time_cost: false
enable_metrics: true
decode_log_interval: 1
ref:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-14B/
init_from_scratch: false
bf16: false
actor_train:
mb_spec:
max_tokens_per_mb: 30720
ref_inf:
mb_spec:
max_tokens_per_mb: 30720
actor_inf:
mb_spec:
max_tokens_per_mb: 30720
shuffle_dataset: true
dataset:
path: /path/to/dataset.jsonl
max_prompt_len: 2048
train_bs_n_seqs: 128
group_size: 16
mask_too_long: false
group_adv_norm: false
rw_type: sparse
success_rate_ub: 0.95
success_rate_lb: 0.05
ppo:
gen:
n: 1
max_new_tokens: 27648
min_new_tokens: 0
greedy: false
top_p: 1.0
top_k: 100000000
temperature: 1.0
ppo_n_minibatches: 1
eps_clip: 0.2
c_clip: null
value_eps_clip: 0.2
early_stop_imp_ratio: 5.0
actor_sample_reuse: 1
critic_sample_reuse: 1
max_reward_clip: 20.0
reward_output_scaling: 5.0
reward_output_bias: 0.0
fuse_rew_ref: true
discount: 1.0
gae_lambda: 1.0
adv_norm: true
kl_ctl: 0.0
use_adaptive_kl_ctl: false
disable_value: true
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: 5.0
cpus_per_master_worker: 4
mem_per_master_worker: 20000
cpus_per_model_worker: 4
mem_per_model_worker: 90000

View File

@ -1,179 +0,0 @@
experiment_name: qwen3-8b-code
trial_name: my-trial
seed: 1
mode: ray
metric_discovery_port: 0
wandb:
mode: disbled
entity: null
project: null
name: null
job_type: null
group: null
notes: null
tags: null
config: null
tensorboard:
path: null
recover_mode: auto
recover_retries: 10
recover_after: 10
exp_ctrl:
total_train_epochs: 10
save_freq_epochs: null
save_freq_steps: 20
save_freq_secs: null
ckpt_freq_epochs: null
ckpt_freq_steps: null
ckpt_freq_secs: 600
eval_freq_epochs: null
eval_freq_steps: null
eval_freq_secs: null
benchmark_steps: null
benchmark_n_seqs: null
torch_cache_mysophobia: true
cache_clear_freq: 1
max_head_offpolicyness: 16
n_rollout_workers: null
max_concurrent_rollouts: 512
flush_request_timeout: 300
cpus_per_generation_server: 4
mem_per_generation_server: 61440
cpus_per_gserver_manager: 4
mem_per_gserver_manager: 10240
cpus_per_rollout_worker: 4
mem_per_rollout_worker: 20480
allocation_mode: sglang.d80m2p1+d4m2p4
n_nodes: 24
n_gpus_per_node: 8
ray_temp_path: /tmp/ray
cluster:
fileroot: /home/admin/.cache/realhf
n_nodes: 32
n_gpus_per_node: 8
actor:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-8B/
init_from_scratch: false
gradient_checkpointing: true
bf16: false
optimizer:
type: adam
lr: 2.0e-05
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
eps: 1.0e-05
min_lr_ratio: 0.0
lr_scheduler_type: constant
warmup_steps_proportion: 0.001
initial_loss_scale: 4294967296.0
min_loss_scale: 1.0
loss_scale_window: 5.0
hysteresis: 2
gradient_clipping: 1.0
megatron:
ddp:
grad_reduce_in_fp32: true
overlap_grad_reduce: true
use_distributed_optimizer: true
sglang:
disable_cuda_graph: false
disable_radix_cache: false
disable_cuda_graph_padding: false
enable_nccl_nvls: false
disable_outlines_disk_cache: false
disable_custom_all_reduce: false
disable_overlap_schedule: false
enable_mixed_chunk: false
enable_torch_compile: false
torch_compile_max_bs: 32
cuda_graph_max_bs: null
cuda_graph_bs: null
torchao_config: ''
enable_nan_detection: false
enable_p2p_check: false
triton_attention_reduce_in_fp32: false
triton_attention_num_kv_splits: 16
num_continuous_decode_steps: 1
enable_memory_saver: false
allow_auto_truncate: false
attention_backend: flashinfer
sampling_backend: null
context_length: 30720
mem_fraction_static: 0.7
max_running_requests: null
chunked_prefill_size: -1
max_prefill_tokens: 32768
schedule_policy: lpm
schedule_conservativeness: 1.0
cpu_offload_gb: 0
dtype: float16
kv_cache_dtype: auto
log_level: warning
log_level_http: warning
log_requests: false
log_requests_level: 0
show_time_cost: false
enable_metrics: true
decode_log_interval: 1
ref:
type:
_class: qwen3
path: /storage/testing/models/Qwen__Qwen3-8B/
init_from_scratch: false
bf16: false
actor_train:
mb_spec:
max_tokens_per_mb: 30720
ref_inf:
mb_spec:
max_tokens_per_mb: 30720
actor_inf:
mb_spec:
max_tokens_per_mb: 30720
shuffle_dataset: true
dataset:
path: /path/to/dataset.jsonl
max_prompt_len: 2048
train_bs_n_seqs: 128
group_size: 16
mask_too_long: false
group_adv_norm: false
rw_type: sparse
success_rate_ub: 0.95
success_rate_lb: 0.05
ppo:
gen:
n: 1
max_new_tokens: 27648
min_new_tokens: 0
greedy: false
top_p: 1.0
top_k: 100000000
temperature: 1.0
ppo_n_minibatches: 1
eps_clip: 0.2
c_clip: null
value_eps_clip: 0.2
early_stop_imp_ratio: 5.0
actor_sample_reuse: 1
critic_sample_reuse: 1
max_reward_clip: 20.0
reward_output_scaling: 5.0
reward_output_bias: 0.0
fuse_rew_ref: true
discount: 1.0
gae_lambda: 1.0
adv_norm: true
kl_ctl: 0.0
use_adaptive_kl_ctl: false
disable_value: true
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: 5.0
cpus_per_master_worker: 4
mem_per_master_worker: 20000
cpus_per_model_worker: 4
mem_per_model_worker: 90000