[Doc] Update README. (#74)
* update benchmark script * . * add benchmark docs * add v0.3.0 configs * . * PullRequest: 178 multi turn math agent training Merge branch gjx/multi-turn-math of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/178?tab=diff Reviewed-by: 博惟 <bowei.fw@antgroup.com> * multi turn math agent training * training data logging and clean math multi-turn exp * fix * . * fix * change readme * fix typo * revert multiturn --------- Co-authored-by: 步偶 <sam.gjx@antgroup.com>
133
README.md
|
@ -8,68 +8,101 @@
|
|||
|
||||
<img align="right" alt="ReaL" src="/assets/logo.png" width="20%">
|
||||
|
||||
AReaL (Ant Reasoning RL) is a fully open-sourced, scalable, and efficient reinforcement learning training system for large language models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by releasing training details, data, and infra required to reproduce all the models with the desired performances. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy A-ReaL-milk-tea.🧋
|
||||
AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk tea (cheers).
|
||||
|
||||
**AReaL Highlights**
|
||||
|
||||
+ 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs .
|
||||
+ 🔥 **[NEW] Asynchronous RL:** With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
|
||||
+ 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs.
|
||||
+ 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from 1 single node to 1K GPUs.
|
||||
+ 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities. We are actively working on other domains, such as coding and agent, as well.
|
||||
+ 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
|
||||
|
||||
## News
|
||||
|
||||
**[2025/03/31]** **(v0.2, Boba)** Our milestone release Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning.
|
||||
**[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
|
||||
|
||||
**[2025/02/24]** **(v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
|
||||
**[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
|
||||
|
||||
## AReaL-boba Milestones and Highlights
|
||||
In our boba release, we highlight the 3 most important milestones:
|
||||
**[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
|
||||
|
||||
+ Full SGLang support and a collection of efficiency improvements
|
||||
+ A SOTA 7B math reasoning model [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) and the corresponing training data [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).
|
||||
+ A particularly competitive 32B model [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) that can be trained with extremely low cost. (Training Data: [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl))
|
||||
## Release Highlights
|
||||
|
||||
For the complete training and model details, please check [our v0.2 technical blog](/blog/AReaL_v0_2.md).
|
||||
In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
|
||||
|
||||
#### SGLang support with 1.5x speedup on 7B Training
|
||||
+ A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving [over 2.77x speedup](https://github.com/inclusionAI/AReaL/tree/main/benchmark) without any performance drop.
|
||||
+ SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. [Reproducible results](https://inclusionai.github.io/AReaL/references/reproduce.html) with fully open-sourced datasets are also provided.
|
||||
+ Experimental support for **multi-turn** agentic RL training.
|
||||
|
||||

|
||||
For the complete system design and training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
|
||||
|
||||
Thanks to a series of system-level optimizations, AReaL v0.2 improves its end-to-end training performance by up to 73%.
|
||||
### Overview of Asynchronous RL Training
|
||||
|
||||
In the following table, we show the convergence time under different resource settings:
|
||||
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRM, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose to overlap a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
|
||||
|
||||
| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** |**32B (SFT)** |
|
||||
| --- |:--------:|:--------:|:--------:|:------:|:------:|:-------:|
|
||||
| #GPU | 8 | 32 | 128 | 32 | 128 | 64 |
|
||||
| Step | 250 | 250 | 250 | 400 | 400 | 300 |
|
||||
| Time (h) | ~240 | ~69 | ~27 | ~252 | ~90 | ~3.5 |
|
||||

|
||||
|
||||
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
|
||||
|
||||
#### SOTA 7B model using RL in math reasoning
|
||||
| **Model** | **AIME 2024** | **AIME 2025** | **GPQA-Diamond** |
|
||||
AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
|
||||
|
||||

|
||||
|
||||
*Fig 2. Execution timeline of our fully asynchronous RL system.*
|
||||
|
||||
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs up model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
|
||||
|
||||
We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-sourced system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates the much improved scaling capabilities w.r.t. training throughput. This is also partially due to that AReaL decouples training and generation, leading to much fewer GPU memory fragments. (Check the [benchmark directory](/benchmark) for detailed benchmark guide.)
|
||||
|
||||

|
||||
|
||||
*Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.*
|
||||
|
||||
### SOTA Code Generation Model by AReaL-boba²
|
||||
|
||||
We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforce, and CodeContests benchmarks.
|
||||
|
||||
| **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| O1-Preview | 56.7 | - | |
|
||||
| R1-Distill-Qwen-7B | 55.0 | 39.7 | 47.1 |
|
||||
| Light-R1-7B-DS | 56.7 | 44.9 | 40.9 |
|
||||
| [AReaL-boba-RL-7B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | **61.9** | **48.3** | **47.6** |
|
||||
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
|
||||
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
|
||||
| AReaL-boba²-8B-Open | 62.0 | 1933/97.2% | **41.4** |
|
||||
| AReaL-boba²-8B | **63.0** | **1962/97.5%** | 40.8 |
|
||||
|
||||
| **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
|
||||
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
|
||||
| AReaL-boba²-14B-Open | 67.3 | 1990/97.8% | **46.2** |
|
||||
| AReal-boba²-14B | **69.1** | **2044/98.2%** | 46.1 |
|
||||
|
||||
We use **R1-Distill-Qwen-7B** as our base model. After RL training, the pass@1 scores on AIME 2024 and AIME 2025 improve by 6.9 and 8.6 points, respectively, achieving SOTA performance among 7B models in mathematical reasoning. We have released the training data at [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).
|
||||
| **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **Codecontest** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| Qwen3-235B | 70.7 | 2056 | - |
|
||||
| DeepSeek-R1 | 64.3 | 2029 | - |
|
||||
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
|
||||
|
||||
Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. We plan to open-source more datasets in the future, including code, STEM, and other domains. All the reported numbers are re-evaluated using our evaluation code with more details in our [blog](blog/AReaL_v0_2.md#eval_detail).
|
||||
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced data,AReaL-boba²-8B/14B models are trained with an additional small amount of internal data could achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests*
|
||||
|
||||
#### Approaching QwQ-32B performances using only 200 data samples
|
||||
| **Model** | **AIME 2024** |
|
||||
| :---: | :---: |
|
||||
| R1-Distill-Qwen-32B | 72.6 |
|
||||
| QwQ-32B | 78.9 |
|
||||
| [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) | 78.8 |
|
||||
We highlight the tutorials about the following key features for synchronous training in AReaL-boba²:
|
||||
|
||||
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
|
||||
+ [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
|
||||
+ [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
|
||||
+ [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
|
||||
|
||||
We provide a step-by-step [code walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html) and [customization guide](https://inclusionai.github.io/AReaL/customization/dataset.html) for these features and recommend users to walk through the corresponding documentation.
|
||||
|
||||
### RL Training for Multi-turn Agent
|
||||
|
||||
AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without the need to modify the heavy system-level code.
|
||||
|
||||
In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
|
||||
|
||||

|
||||
|
||||
Building upon **R1-Distill-Qwen-32B**, we replicate **QwQ-32B's** inference performance on AIME 2024 using just **200 data points** via Supervised Fine-Tuning (SFT). We have released the training data at [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl).
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Quick Start
|
||||
|
||||
Train Qwen3 1.7B locally:
|
||||
|
@ -95,7 +128,7 @@ python3 training/main_async_ppo.py \
|
|||
max_head_offpolicyness=4
|
||||
```
|
||||
|
||||
Evaluation
|
||||
Evaluation:
|
||||
|
||||
```bash
|
||||
bash examples/env/scripts/setup-eval-pip-deps.sh
|
||||
|
@ -117,23 +150,29 @@ python eval_and_aggregate.py \
|
|||
+ [Quickstart](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
|
||||
+ [Code Walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html)
|
||||
+ **Customization Guide**
|
||||
+ [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
|
||||
+ [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
|
||||
+ [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
|
||||
- [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
|
||||
- [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
|
||||
- [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
|
||||
+ [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
|
||||
|
||||
## Future Plan
|
||||
AReaL is under active development. We will have major releases in a weekly manner. We also highly appreciate efforts from the community as well. Here we highlight our future research and development plan.
|
||||
|
||||
AReaL is under active development. We plan to have minor releases in a weekly manner and major releases in a monthly manner. Community engagements and contributions are extremely welcomed. We are also **hiring interns and full-timers** with open positions in both the US and China.
|
||||
|
||||
For the research and development plan already in place, please see the following list:
|
||||
|
||||
### System Development
|
||||
|
||||
- [x] Support for SGLang.
|
||||
- [x] RL training with coding problems.
|
||||
- [x] Asynchronous generation and RL training.
|
||||
- [ ] Optimizations for distributed training: expert parallel and zero-bubble pipelining.
|
||||
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining.
|
||||
- [ ] RL for vision-language models (VLM).
|
||||
- [ ] Function calling and agent capabilities.
|
||||
- [x] Multi-turn agentic RL.
|
||||
- [ ] Function calling and tool use.
|
||||
|
||||
### Algorithm Development
|
||||
|
||||
- [x] RL training receipes for 1.5B and 7B models.
|
||||
- [x] A complete RL training receipe for 32B models.
|
||||
- [ ] Sample-efficient multi-task RL algorithms.
|
||||
|
@ -141,14 +180,14 @@ AReaL is under active development. We will have major releases in a weekly manne
|
|||
- [ ] Stable RL training for larger MOE models.
|
||||
|
||||
## Acknowledgement
|
||||
We would like to remark that major contributors are from **RL Lab at Ant Research** and **Institute for Interdisciplinary Information Sciences, Tsinghua University**.
|
||||
We would like to remark that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
|
||||
|
||||
Our team has also received invaluable assistance from the Super Computing Technology (SCT) team at Ant Group, particularly in large-scale cluster operations and maintenance.
|
||||
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
|
||||
|
||||
We also appreciate all the pioneer works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and many other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1), and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
|
||||
We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
|
||||
|
||||
## Citation
|
||||
```plain
|
||||
```bibtex
|
||||
@inproceedings{mei2025real,
|
||||
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
|
||||
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
|
||||
|
@ -159,7 +198,7 @@ We also appreciate all the pioneer works from the community, particularly the [R
|
|||
}
|
||||
```
|
||||
|
||||
```plain
|
||||
```bibtex
|
||||
@misc{areal2025,
|
||||
author = {RL Lab, Ant Research},
|
||||
title = {AReaL: Ant Reasoning RL},
|
||||
|
|
After Width: | Height: | Size: 232 KiB |
After Width: | Height: | Size: 125 KiB |
After Width: | Height: | Size: 224 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 57 KiB |
After Width: | Height: | Size: 73 KiB |
After Width: | Height: | Size: 52 KiB |
After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 58 KiB |
After Width: | Height: | Size: 41 KiB |
|
@ -0,0 +1,180 @@
|
|||
# AReaL v0.3
|
||||
|
||||
## Introduction
|
||||
|
||||
We now release AReaL v0.3, featuring three major milestones:
|
||||
|
||||
- **A fully asynchronous RL training pipeline with system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop
|
||||
- **SOTA coding models**, i.e., a 14B model with a **69.1 score on LCB-v5**. Reproducible results with fully open-sourced datasets are also provided
|
||||
- Experimental support for **multi-turn agentic RL training**
|
||||
|
||||
### Performance Results
|
||||
|
||||
| **Model (8B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
|
||||
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
|
||||
| AReaL-boba²-8B-Open | 62.0 | 1933/97.2% | **41.4** |
|
||||
| AReaL-boba²-8B | **63.0** | **1962/97.5%** | 40.8 |
|
||||
|
||||
| **Model (14B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
|
||||
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
|
||||
| AReaL-boba²-14B-Open | 67.3 | 1990/97.8% | **46.2** |
|
||||
| AReal-boba²-14B | **69.1** | **2044/98.2%** | 46.1 |
|
||||
|
||||
| **Larger Models** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-235B | 70.7 | 2056 | - |
|
||||
| DeepSeek-R1 | 64.3 | 2029 | - |
|
||||
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
|
||||
|
||||
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests.*
|
||||
|
||||
## Motivation for Asynchronous RL System
|
||||
|
||||
### Inference devices are underutilized
|
||||
|
||||
During the synchronous RL training process, the generation step must wait until the finish of the longest output within a batch. Due to the varying output lengths for LRM, a synchronous RL system suffers from severe training inefficiency.
|
||||
|
||||
Some works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) use outputs generated from a previous model version to update the current model. For the best performances, the model version used for rollout generation is limited to only one or two steps older. However, all these systems still follow a batched generation setting, the issue of system inefficiency during the generation phase still remains unaddressed.
|
||||
|
||||

|
||||
|
||||
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
|
||||
|
||||
### Scalability is poor in synchronous RL systems
|
||||
|
||||
Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size. This pushes the decoding process into a memory-IO-bound regime where additional devices fail to improve throughput.
|
||||
|
||||

|
||||
|
||||
*Fig2. Left: Strong scaling of batched generation throughput for a 1.5B LRM. Right: Generation becomes memory-IO bound as GPU count increases.*
|
||||
|
||||
## Asynchronous AReaL Overview
|
||||
|
||||
We design a system that fully decouples generation and training across separate GPU clusters. This system should be hardware-efficient, scalable, and equipped with the flexibility for a customized RL workflow. We implement these principles in AReaL. Fig.3 presents the architecture and data flow of AREAL. The system comprises 4 core components:
|
||||
|
||||

|
||||
|
||||
*Fig.3 The architecture featuring asynchronous generation and training components.*
|
||||
|
||||
### Core Components
|
||||
|
||||
- **Interruptible Rollout Worker** handles two types of requests: (1) The *generate* request generates responses given prompts. (2) The *update_weights* request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights. Afterwards, the rollout workers continue to decode the unfinished sequences until the next interruption or termination. We emphasize that such interruptions and in-flight weight updates would result in trajectories composed of segments produced by different model versions. This introduces a novel algorithmic challenge, which will be addressed in the next section.
|
||||
|
||||
- **Reward Service** evaluates the accuracy of the responses generated by the model. For example, in the coding task, this service extracts the code and executes unit tests to verify its accuracy.
|
||||
|
||||
- **Trainer Workers** continuously sample from the replay buffer, accumulating data until reaching the configured training batch size. They then perform PPO updates and store the resulting parameters in distributed storage. To ensure data freshness, data from the replay buffer is used only once.
|
||||
|
||||
- **Rollout Controller** serves as a critical bridge between the rollout workers, reward service, and the model workers. During the training process, it reads data from the dataset and invokes the rollout worker's *generate* request. The received response is then sent to the reward service to obtain the reward. The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker. After the model worker updates the parameters, the controller calls the rollout worker's *update_weight*. We illustrate the generation and training management in Fig.4. This asynchronous pipeline ensures continuous full utilization of both generation and training resources.
|
||||
|
||||

|
||||
|
||||
*Fig 4. Execution timeline of our fully asynchronous RL system.*
|
||||
|
||||
## Algorithmic Challenges and Solutions for Asynchronous RL
|
||||
|
||||
### Challenges
|
||||
|
||||
While the asynchronous system design offers significant acceleration through improved device utilization, it introduces several technical challenges that require algorithmic considerations.
|
||||
|
||||
- **Challenge 1: Data Staleness** - Due to the asynchronous nature of AREAL, each training batch contains data from multiple prior policy versions. Data staleness would lead to a distribution gap between the training data and the latest model. In asynchronous RL training for LRMs, this issue could be even more severe for long trajectories due to extended decoding time.
|
||||
|
||||
- **Challenge 2: Inconsistent Policy Versions** - As shown in Fig.4, a single generated trajectory may involve segments produced by different policy versions. This inconsistency fundamentally violates the formulation of standard PPO that assumes all actions being generated by a single policy.
|
||||
|
||||
### Solutions
|
||||
|
||||
To overcome these two challenges, we propose two solutions:
|
||||
|
||||
- **Solution 1: Staleness-Aware Training** - To avoid negative impact of the data staleness issue, we constrain the version discrepancy between the policy version of generated trajectories and the training policy. We introduce a hyperparameter η representing the maximum permitted staleness. When η = 0, the system degenerates to the synchronous RL setting where generation exactly matches training batches. When η = 1, the system recovers to the previous one-step overlap methods.
|
||||
|
||||
- **Solution 2: Decoupled PPO Objective** - We apply a decoupled PPO objective that disentangles the behavior policy and the proximal policy. The behavior policy represents the policy used for sampling trajectories and the proxy policy is a proximal policy serving as a recent target to regularize the update of online policy.
|
||||
|
||||

|
||||
|
||||
## Validating Asynchronous AReaL
|
||||
|
||||
We first evaluate Asynchronous AReaL on a math task to validate our system and algorithmic choices. Building on our previous release, AReaL-boba, we adopt R1-Distill-Qwen as the base model and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training dataset.
|
||||
|
||||
### End-to-End Performance Comparison
|
||||
|
||||
We compared synchronous and asynchronous training on 1.5B and 7B models. Under equal resource constraints and training steps, the asynchronous system was over twice as fast as the synchronous one. Evaluations on AIME24 confirmed that this acceleration does not compromise performance.
|
||||
|
||||
| **Model (1.5B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| basemodel | 29.3 | - | - | - |
|
||||
| Synchronous AReaL | 42.0 | 16 | 1000 | 41.0 |
|
||||
| Asynchronous AReaL | 42.2 | 16 | 1000 | **14.8** |
|
||||
|
||||
| **Model (7B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| basemodel | 54.3 | - | - | - |
|
||||
| Synchronous AReaL | 63.0 | 24 | 1000 | 57.7 |
|
||||
| Asynchronous AReaL | 63.1 | 24 | 1000 | **25.4** |
|
||||
|
||||
*Table 2: End-to-end comparison between asynchronous AReaL and synchronous AReaL. Following last release AReaL-boba, we adopt R1-Distill-Qwen as the basemodel and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training data. Each performance number represents the average results of 32 sampling responses.*
|
||||
|
||||
### Ablation Study
|
||||
|
||||
#### System Ablations
|
||||
|
||||
We ablate interruptible generation and present the resulting generation throughput in Fig.5. Without interruptible generation, the controller must wait for the longest response. In particular, interruptible generation leads to a 12% and 17% throughput increase for 1.5B and 7B models respectively on 4 nodes.
|
||||
|
||||

|
||||
|
||||
*Fig.5 Ablation study of interruptible generation.*
|
||||
|
||||
#### Algorithm Ablations
|
||||
|
||||
We vary the maximum allowed staleness η and compare configurations with and without the decoupled PPO objective. Fig.6 shows the learning curves after 1000 PPO updates, demonstrating that naive PPO fails to match the performance of the synchronous RL oracle (i.e., the performance when η = 0). Even slight staleness can significantly degrade final performance due to the improper clipping center and policy changes during interruptible generation. Furthermore, increasing data staleness consistently degrades learning performance.
|
||||
|
||||

|
||||
|
||||
*Fig.6 Left: Learning curves with naive PPO. Right: Learning curves with decoupled PPO objective.*
|
||||
|
||||
| η | **AIME 24** | | **AIME 25** | |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| | naive PPO | + Decoupled Objective | naive PPO | + Decoupled Objective |
|
||||
| 0 | 42.0 | | 32.9 | |
|
||||
| 1 | 41.8 | 42.1 | 30.7 | 31.9 |
|
||||
| 2 | 40.0 | 41.8 | 32.1 | 32.5 |
|
||||
| 4 | 23.3 | 42.2 | 23.1 | 32.0 |
|
||||
| 8 | 35.7 | 41.0 | 27.8 | 31.1 |
|
||||
| 16 | 35.8 | 38.7 | 26.2 | 32.5 |
|
||||
| ∞ | 34.0 | 36.9 | 26.9 | 29.9 |
|
||||
|
||||
*Table 3: Evaluation scores when varying data staleness, comparing performance with and without the decoupled objective.*
|
||||
|
||||
However, the decoupled PPO objective substantially improves training stability when handling stale data. Notably, even with the decoupled objective, unbounded staleness (maximum staleness → ∞) still results in inferior performance compared to the zero-staleness oracle. When properly constrained, moderate staleness (e.g., η ≤ 4) has minimal impact on final performance while significantly accelerating training through the asynchronous pipeline, as demonstrated in Tab.3 and Fig.7.
|
||||
|
||||

|
||||
|
||||
*Fig.7 The relationship between η and training throughput. Larger η leads to higher throughput.*
|
||||
|
||||
## Training Recipe with Asynchronous AReaL
|
||||
|
||||
### Dataset Curation
|
||||
|
||||
We train the 7B and 14B model using open-source data from [DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
|
||||
|
||||
### Rollout Strategy
|
||||
|
||||
During the rollout phase, the LLM generates 16 responses per question. To minimize output truncation, we set the maximum generation length to 27K tokens.
|
||||
|
||||
### Filtering Strategy
|
||||
|
||||
In the design of our asynchronous system, we can naturally incorporate strategies to filter some rollouts. Specifically, we exclude the questions where all 16 answers are either entirely correct or entirely incorrect.
|
||||
|
||||
### Key Hyperparameters
|
||||
|
||||
| **Parameter** | **Value** |
|
||||
|:---:|:---:|
|
||||
| PPO Minibatches | 1 |
|
||||
| Learning Rate | 2e-5 |
|
||||
| Adam ε | 1e-5 |
|
||||
| Batch Size | 2,048 |
|
||||
| max_head_offpolicyness η | 16 |
|
||||
| success_rate_ub | 0.95 |
|
||||
| success_rate_lb | 0.05 |
|
|
@ -8,7 +8,7 @@ The experiment configurations used to train our released models can be found in
|
|||
+ Dataset: [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl)
|
||||
|
||||
+ `examples/configs/v0.3-qwen3-code`: Configuration for reproducing the boba² coding model based on Qwen3
|
||||
+ Dataset: [DeepCoder Dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset) (duplicated problems and problems with incorrect answers have been filtered out)
|
||||
+ Dataset: [DeepCoder Dataset](https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
|
||||
|
||||
## Adjusting Computation Resources
|
||||
|
||||
|
|
|
@ -92,6 +92,8 @@ python3 training/main_sync_ppo.py --help
|
|||
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training. Defaults to True.
|
||||
- **`ppo.gen.max_new_tokens`**: Maximum tokens to generate per prompt.
|
||||
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO update.
|
||||
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate will be filtered out.
|
||||
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate will be filtered out.
|
||||
|
||||
## Monitoring the Training Process
|
||||
|
||||
|
|
|
@ -0,0 +1,179 @@
|
|||
experiment_name: qwen3-14b-code
|
||||
trial_name: my-trial
|
||||
seed: 1
|
||||
mode: ray
|
||||
metric_discovery_port: 0
|
||||
wandb:
|
||||
mode: disbled
|
||||
entity: null
|
||||
project: null
|
||||
name: null
|
||||
job_type: null
|
||||
group: null
|
||||
notes: null
|
||||
tags: null
|
||||
config: null
|
||||
tensorboard:
|
||||
path: null
|
||||
recover_mode: auto
|
||||
recover_retries: 10
|
||||
recover_after: 10
|
||||
exp_ctrl:
|
||||
total_train_epochs: 10
|
||||
save_freq_epochs: null
|
||||
save_freq_steps: 20
|
||||
save_freq_secs: null
|
||||
ckpt_freq_epochs: null
|
||||
ckpt_freq_steps: null
|
||||
ckpt_freq_secs: 600
|
||||
eval_freq_epochs: null
|
||||
eval_freq_steps: null
|
||||
eval_freq_secs: null
|
||||
benchmark_steps: null
|
||||
benchmark_n_seqs: null
|
||||
torch_cache_mysophobia: true
|
||||
cache_clear_freq: 1
|
||||
max_head_offpolicyness: 16
|
||||
n_rollout_workers: null
|
||||
max_concurrent_rollouts: 512
|
||||
flush_request_timeout: 600
|
||||
cpus_per_generation_server: 4
|
||||
mem_per_generation_server: 61440
|
||||
cpus_per_gserver_manager: 4
|
||||
mem_per_gserver_manager: 10240
|
||||
cpus_per_rollout_worker: 4
|
||||
mem_per_rollout_worker: 20480
|
||||
allocation_mode: sglang.d80m4p1+d4p8m2
|
||||
n_nodes: 48
|
||||
n_gpus_per_node: 8
|
||||
ray_temp_path: /tmp/ray
|
||||
cluster:
|
||||
fileroot: /home/admin/.cache/realhf
|
||||
n_nodes: 48
|
||||
n_gpus_per_node: 8
|
||||
actor:
|
||||
type:
|
||||
_class: qwen3
|
||||
path: /storage/testing/models/Qwen__Qwen3-14B/
|
||||
init_from_scratch: false
|
||||
gradient_checkpointing: true
|
||||
bf16: false
|
||||
optimizer:
|
||||
type: adam
|
||||
lr: 2.0e-05
|
||||
weight_decay: 0.05
|
||||
beta1: 0.9
|
||||
beta2: 0.95
|
||||
eps: 1.0e-05
|
||||
min_lr_ratio: 0.0
|
||||
lr_scheduler_type: constant
|
||||
warmup_steps_proportion: 0.001
|
||||
initial_loss_scale: 4294967296.0
|
||||
min_loss_scale: 1.0
|
||||
loss_scale_window: 5.0
|
||||
hysteresis: 2
|
||||
gradient_clipping: 1.0
|
||||
megatron:
|
||||
ddp:
|
||||
grad_reduce_in_fp32: true
|
||||
overlap_grad_reduce: true
|
||||
use_distributed_optimizer: true
|
||||
sglang:
|
||||
disable_cuda_graph: false
|
||||
disable_radix_cache: false
|
||||
disable_cuda_graph_padding: false
|
||||
enable_nccl_nvls: false
|
||||
disable_outlines_disk_cache: false
|
||||
disable_custom_all_reduce: false
|
||||
disable_overlap_schedule: false
|
||||
enable_mixed_chunk: false
|
||||
enable_torch_compile: false
|
||||
torch_compile_max_bs: 32
|
||||
cuda_graph_max_bs: null
|
||||
cuda_graph_bs: null
|
||||
torchao_config: ''
|
||||
enable_nan_detection: false
|
||||
enable_p2p_check: false
|
||||
triton_attention_reduce_in_fp32: false
|
||||
triton_attention_num_kv_splits: 16
|
||||
num_continuous_decode_steps: 1
|
||||
enable_memory_saver: false
|
||||
allow_auto_truncate: false
|
||||
attention_backend: flashinfer
|
||||
sampling_backend: null
|
||||
context_length: 30720
|
||||
mem_fraction_static: 0.7
|
||||
max_running_requests: null
|
||||
chunked_prefill_size: -1
|
||||
max_prefill_tokens: 32768
|
||||
schedule_policy: lpm
|
||||
schedule_conservativeness: 1.0
|
||||
cpu_offload_gb: 0
|
||||
dtype: float16
|
||||
kv_cache_dtype: auto
|
||||
log_level: warning
|
||||
log_level_http: warning
|
||||
log_requests: false
|
||||
log_requests_level: 0
|
||||
show_time_cost: false
|
||||
enable_metrics: true
|
||||
decode_log_interval: 1
|
||||
ref:
|
||||
type:
|
||||
_class: qwen3
|
||||
path: /storage/testing/models/Qwen__Qwen3-14B/
|
||||
init_from_scratch: false
|
||||
bf16: false
|
||||
actor_train:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
ref_inf:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
actor_inf:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
shuffle_dataset: true
|
||||
dataset:
|
||||
path: /path/to/dataset.jsonl
|
||||
max_prompt_len: 2048
|
||||
train_bs_n_seqs: 128
|
||||
group_size: 16
|
||||
mask_too_long: false
|
||||
group_adv_norm: false
|
||||
rw_type: sparse
|
||||
success_rate_ub: 0.95
|
||||
success_rate_lb: 0.05
|
||||
ppo:
|
||||
gen:
|
||||
n: 1
|
||||
max_new_tokens: 27648
|
||||
min_new_tokens: 0
|
||||
greedy: false
|
||||
top_p: 1.0
|
||||
top_k: 100000000
|
||||
temperature: 1.0
|
||||
ppo_n_minibatches: 1
|
||||
eps_clip: 0.2
|
||||
c_clip: null
|
||||
value_eps_clip: 0.2
|
||||
early_stop_imp_ratio: 5.0
|
||||
actor_sample_reuse: 1
|
||||
critic_sample_reuse: 1
|
||||
max_reward_clip: 20.0
|
||||
reward_output_scaling: 5.0
|
||||
reward_output_bias: 0.0
|
||||
fuse_rew_ref: true
|
||||
discount: 1.0
|
||||
gae_lambda: 1.0
|
||||
adv_norm: true
|
||||
kl_ctl: 0.0
|
||||
use_adaptive_kl_ctl: false
|
||||
disable_value: true
|
||||
recompute_logprob: true
|
||||
use_decoupled_loss: true
|
||||
behav_imp_weight_cap: 5.0
|
||||
cpus_per_master_worker: 4
|
||||
mem_per_master_worker: 20000
|
||||
cpus_per_model_worker: 4
|
||||
mem_per_model_worker: 90000
|
|
@ -0,0 +1,179 @@
|
|||
experiment_name: qwen3-8b-code
|
||||
trial_name: my-trial
|
||||
seed: 1
|
||||
mode: ray
|
||||
metric_discovery_port: 0
|
||||
wandb:
|
||||
mode: disbled
|
||||
entity: null
|
||||
project: null
|
||||
name: null
|
||||
job_type: null
|
||||
group: null
|
||||
notes: null
|
||||
tags: null
|
||||
config: null
|
||||
tensorboard:
|
||||
path: null
|
||||
recover_mode: auto
|
||||
recover_retries: 10
|
||||
recover_after: 10
|
||||
exp_ctrl:
|
||||
total_train_epochs: 10
|
||||
save_freq_epochs: null
|
||||
save_freq_steps: 20
|
||||
save_freq_secs: null
|
||||
ckpt_freq_epochs: null
|
||||
ckpt_freq_steps: null
|
||||
ckpt_freq_secs: 600
|
||||
eval_freq_epochs: null
|
||||
eval_freq_steps: null
|
||||
eval_freq_secs: null
|
||||
benchmark_steps: null
|
||||
benchmark_n_seqs: null
|
||||
torch_cache_mysophobia: true
|
||||
cache_clear_freq: 1
|
||||
max_head_offpolicyness: 16
|
||||
n_rollout_workers: null
|
||||
max_concurrent_rollouts: 512
|
||||
flush_request_timeout: 300
|
||||
cpus_per_generation_server: 4
|
||||
mem_per_generation_server: 61440
|
||||
cpus_per_gserver_manager: 4
|
||||
mem_per_gserver_manager: 10240
|
||||
cpus_per_rollout_worker: 4
|
||||
mem_per_rollout_worker: 20480
|
||||
allocation_mode: sglang.d80m2p1+d4m2p4
|
||||
n_nodes: 24
|
||||
n_gpus_per_node: 8
|
||||
ray_temp_path: /tmp/ray
|
||||
cluster:
|
||||
fileroot: /home/admin/.cache/realhf
|
||||
n_nodes: 32
|
||||
n_gpus_per_node: 8
|
||||
actor:
|
||||
type:
|
||||
_class: qwen3
|
||||
path: /storage/testing/models/Qwen__Qwen3-8B/
|
||||
init_from_scratch: false
|
||||
gradient_checkpointing: true
|
||||
bf16: false
|
||||
optimizer:
|
||||
type: adam
|
||||
lr: 2.0e-05
|
||||
weight_decay: 0.05
|
||||
beta1: 0.9
|
||||
beta2: 0.95
|
||||
eps: 1.0e-05
|
||||
min_lr_ratio: 0.0
|
||||
lr_scheduler_type: constant
|
||||
warmup_steps_proportion: 0.001
|
||||
initial_loss_scale: 4294967296.0
|
||||
min_loss_scale: 1.0
|
||||
loss_scale_window: 5.0
|
||||
hysteresis: 2
|
||||
gradient_clipping: 1.0
|
||||
megatron:
|
||||
ddp:
|
||||
grad_reduce_in_fp32: true
|
||||
overlap_grad_reduce: true
|
||||
use_distributed_optimizer: true
|
||||
sglang:
|
||||
disable_cuda_graph: false
|
||||
disable_radix_cache: false
|
||||
disable_cuda_graph_padding: false
|
||||
enable_nccl_nvls: false
|
||||
disable_outlines_disk_cache: false
|
||||
disable_custom_all_reduce: false
|
||||
disable_overlap_schedule: false
|
||||
enable_mixed_chunk: false
|
||||
enable_torch_compile: false
|
||||
torch_compile_max_bs: 32
|
||||
cuda_graph_max_bs: null
|
||||
cuda_graph_bs: null
|
||||
torchao_config: ''
|
||||
enable_nan_detection: false
|
||||
enable_p2p_check: false
|
||||
triton_attention_reduce_in_fp32: false
|
||||
triton_attention_num_kv_splits: 16
|
||||
num_continuous_decode_steps: 1
|
||||
enable_memory_saver: false
|
||||
allow_auto_truncate: false
|
||||
attention_backend: flashinfer
|
||||
sampling_backend: null
|
||||
context_length: 30720
|
||||
mem_fraction_static: 0.7
|
||||
max_running_requests: null
|
||||
chunked_prefill_size: -1
|
||||
max_prefill_tokens: 32768
|
||||
schedule_policy: lpm
|
||||
schedule_conservativeness: 1.0
|
||||
cpu_offload_gb: 0
|
||||
dtype: float16
|
||||
kv_cache_dtype: auto
|
||||
log_level: warning
|
||||
log_level_http: warning
|
||||
log_requests: false
|
||||
log_requests_level: 0
|
||||
show_time_cost: false
|
||||
enable_metrics: true
|
||||
decode_log_interval: 1
|
||||
ref:
|
||||
type:
|
||||
_class: qwen3
|
||||
path: /storage/testing/models/Qwen__Qwen3-8B/
|
||||
init_from_scratch: false
|
||||
bf16: false
|
||||
actor_train:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
ref_inf:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
actor_inf:
|
||||
mb_spec:
|
||||
max_tokens_per_mb: 30720
|
||||
shuffle_dataset: true
|
||||
dataset:
|
||||
path: /path/to/dataset.jsonl
|
||||
max_prompt_len: 2048
|
||||
train_bs_n_seqs: 128
|
||||
group_size: 16
|
||||
mask_too_long: false
|
||||
group_adv_norm: false
|
||||
rw_type: sparse
|
||||
success_rate_ub: 0.95
|
||||
success_rate_lb: 0.05
|
||||
ppo:
|
||||
gen:
|
||||
n: 1
|
||||
max_new_tokens: 27648
|
||||
min_new_tokens: 0
|
||||
greedy: false
|
||||
top_p: 1.0
|
||||
top_k: 100000000
|
||||
temperature: 1.0
|
||||
ppo_n_minibatches: 1
|
||||
eps_clip: 0.2
|
||||
c_clip: null
|
||||
value_eps_clip: 0.2
|
||||
early_stop_imp_ratio: 5.0
|
||||
actor_sample_reuse: 1
|
||||
critic_sample_reuse: 1
|
||||
max_reward_clip: 20.0
|
||||
reward_output_scaling: 5.0
|
||||
reward_output_bias: 0.0
|
||||
fuse_rew_ref: true
|
||||
discount: 1.0
|
||||
gae_lambda: 1.0
|
||||
adv_norm: true
|
||||
kl_ctl: 0.0
|
||||
use_adaptive_kl_ctl: false
|
||||
disable_value: true
|
||||
recompute_logprob: true
|
||||
use_decoupled_loss: true
|
||||
behav_imp_weight_cap: 5.0
|
||||
cpus_per_master_worker: 4
|
||||
mem_per_master_worker: 20000
|
||||
cpus_per_model_worker: 4
|
||||
mem_per_model_worker: 90000
|