mirror of https://github.com/inclusionAI/AReaL
arealite doc final draft
This commit is contained in:
parent
e7c713125d
commit
40573a6580
228
README.md
228
README.md
|
@ -9,163 +9,132 @@
|
|||
|
||||
<img align="right" alt="ReaL" src="/assets/logo.png" width="20%">
|
||||
|
||||
AReaL (Ant Reasoning RL) is an open-source **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**. Built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF), we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).
|
||||
AReaL (Ant Reasoning RL) is an open-source **fully asynchronous reinforcement learning
|
||||
training system** for large reasoning models developed at **the RL Lab, Ant Research**.
|
||||
Built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF),
|
||||
we are fully committed to open-source by providing training details, data, and
|
||||
infrastructure required to reproduce results along with the model itself. AReaL aims to
|
||||
help everyone build their own AI agents easily and affordably. Our team loves milk tea
|
||||
because it's delicious, customizable, and affordable. We hope you enjoy our project just
|
||||
like how you enjoy real-world milk tea (cheers).
|
||||
|
||||
**AReaL Highlights**
|
||||
|
||||
+ 🔥 <span style="color: red; font-weight: bold;">**[NEW] Asynchronous RL:**</span> With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
|
||||
+ 🛠️ **Open & Reproducible**: We continuously release _all code, datasets, and training recipes_ for RL training of LLMs.
|
||||
+ 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
|
||||
+ 🔪 **Cutting-Edge Performance:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
|
||||
- 🔥 **Asynchronous RL**: With algorithm-system co-design, AReaL supports fully
|
||||
asynchronous RL for **the fastest training**! Experimental support for multi-turn
|
||||
agentic RL is also provided.
|
||||
- ⚡ **\[NEW\] Light-weight & AI-centric:** In our new release AReaLite, we deliver
|
||||
**90%** of AReaL functionalities with only **20%** # lines of code! AReaLite also
|
||||
follows an **AI-centric** design that make users build their own **agentic** and
|
||||
**RLVR** training workflows with much less effort.
|
||||
- 🛠️ **Open & Reproducible**: We continuously release _all code, datasets, and training
|
||||
recipes_ for RL training of LLMs.
|
||||
- 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource
|
||||
settings, ranging from a single node to 1K GPUs.
|
||||
- 🔪 **Cutting-Edge Performance:** AReaL can produce models with cutting-edge reasoning
|
||||
capabilities in math and coding. We are also actively working on agentic tasks.
|
||||
|
||||
## News
|
||||
|
||||
**[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
|
||||
**\[2025/07/31\] (v0.4, AReaLite)** We introduce **AReaLite**, a **light-weight**
|
||||
version of AReaL with an **AI-centric** API design that inherently supports fully
|
||||
asynchronous **agentic RL**. Check out [our AReaLite Design Doc](/arealite/README.md)
|
||||
and [the quickstart guide](/docs/tutorial/quickstart.md) to begin your journey with
|
||||
**AReaLite**!
|
||||
|
||||
**[2025/03/31] (v0.2, boba)** Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
|
||||
**\[2025/06/03\] (v0.3, boba²)** We release **boba²** (double-boba) for fully
|
||||
asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or
|
||||
even better training performance** compared to synchronous systems. Moreover,
|
||||
asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check
|
||||
out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the
|
||||
[research paper](https://arxiv.org/pdf/2505.24298).
|
||||
|
||||
**[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
|
||||
**\[2025/03/31\] (v0.2, boba)** Here comes our next milestone release - boba! Please
|
||||
call it A-ReaL-boba! This release includes much faster training with SGLang support and
|
||||
SOTA 7B and 32B models on math reasoning. Check our
|
||||
[v0.2 technical blog](/blog/AReaL_v0_2.md).
|
||||
|
||||
**\[2025/02/24\] (v0.1)** Our initial release includes reproducible results for 1.5B and
|
||||
7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
|
||||
|
||||
## Release Highlights
|
||||
|
||||
In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
|
||||
New highlights in AReaLite:
|
||||
|
||||
+ A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop. Check the [benchmark scripts and instructions here](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3).
|
||||
- Follows an *AI-centric* API design instead of the *system-centric* architecture in old
|
||||
AReaL, which make it easier for AI researchers to adopt, understand, and develop
|
||||
effectively and efficiently. To learn more about the design philosophy of AReaL,
|
||||
please read [AReaLite Design Doc](/arealite/README.md)!
|
||||
|
||||
+ SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. To reproduce, check the [configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html).
|
||||
- A much more *light-weight* codebase compared to old AReaL codebase with only **20%** #
|
||||
lines of code, with a detailed [code walkthrough](/docs/arealite/gsm8k_grpo.md) on an
|
||||
GRPO-on-GSM8K example. Save your time & efforts for code reading!
|
||||
|
||||
+ Experimental support for **multi-turn** agentic RL training. Check our [complete example](https://inclusionai.github.io/AReaL/customization/agent.html).
|
||||
- Smoother customization for your own **algorithms** and **agentic & RLVR rollout** RL
|
||||
with `RolloutWorkflow` and SPMD training scripts! Check
|
||||
[here](/docs/customization/agent.md) for agent & RLVR customization and
|
||||
[here](/docs/customization/algorithm.md) for algorithm customization.
|
||||
|
||||
For the complete system design and more training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](https://arxiv.org/pdf/2505.24298).
|
||||
Good old stuff from AReaL:
|
||||
|
||||
**Jump to the [quickstart section](https://github.com/inclusionAI/AReaL?tab=readme-ov-file#getting-started) if you want to quickly run an experiment and get your hands dirty!** 😈
|
||||
- High performance and scalability with fully asynchronous RL training. Check our
|
||||
[boba² (v0.3) blog](/blog/AReaL_v0_3.md) for details.
|
||||
|
||||
### Overview of Asynchronous RL Training
|
||||
- A single command line to launch an experiment, no matter on a single node or a
|
||||
large-scale distributed cluster.
|
||||
|
||||
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
|
||||
Now, let us run an example experiment with AReaLite following the quickstart guide
|
||||
below!
|
||||
|
||||

|
||||
## Getting Started with AReaLite
|
||||
|
||||
*Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
|
||||
Our training scripts will automatically download the dataset (openai/gsm8k) and model
|
||||
(Qwen/Qwen2-1.5B-Instruct). On a single node, runs:
|
||||
|
||||
AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
|
||||
|
||||

|
||||
|
||||
*Fig 2. Execution timeline of our fully asynchronous RL system.*
|
||||
|
||||
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
|
||||
|
||||
We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.
|
||||
|
||||

|
||||
|
||||
*Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.*
|
||||
|
||||
### SOTA Code Generation Model by AReaL-boba²
|
||||
|
||||
We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.
|
||||
|
||||
| **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
|
||||
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
|
||||
| [🤗 AReaL-boba²-8B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset) | 62.0 | 1933/97.2% | **41.4** |
|
||||
| [🤗 AReaL-boba²-8B](https://huggingface.co/inclusionAI/AReaL-boba-2-8B) | **63.0** | **1962/97.5%** | 40.8 |
|
||||
|
||||
| **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
|
||||
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
|
||||
| [🤗 AReaL-boba²-14B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset) | 67.3 | 1990/97.8% | **46.2** |
|
||||
| [🤗 AReal-boba²-14B](https://huggingface.co/inclusionAI/AReaL-boba-2-14B) | **69.1** | **2044/98.2%** | 46.1 |
|
||||
|
||||
| **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
|
||||
| :---: | :---: | :---: | :---: |
|
||||
| Qwen3-235B | 70.7 | 2056 | - |
|
||||
| DeepSeek-R1 | 64.3 | 2029 | - |
|
||||
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
|
||||
|
||||
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.*
|
||||
|
||||
We highlight the [tutorials](https://inclusionai.github.io/AReaL/customization/dataset.html) and [code walkthroughs](https://inclusionai.github.io/AReaL/developer/overview.html) about the following key features for asynchronous training:
|
||||
|
||||
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
|
||||
+ [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
|
||||
+ [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
|
||||
+ [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
|
||||
|
||||
### RL Training for Multi-turn Agent
|
||||
|
||||
AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without needing to modify the heavy system-level code.
|
||||
|
||||
In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
|
||||
|
||||
## Getting Started
|
||||
Obtain the training data:
|
||||
- [Math](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data)
|
||||
- [Code](https://huggingface.co/datasets/inclusionAI/AReaL-boba-2-RL-Code)
|
||||
|
||||
For code training data, a simple preprocessing script was provided in `examples/data_preprocess/preprocess_training_data.py`:
|
||||
```bash
|
||||
python3 preprocess_training_data.py --data_path $original_data_path --output_path $training_data_path
|
||||
```
|
||||
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml
|
||||
```
|
||||
|
||||
Train Qwen3 1.7B locally (Remember to modify `dataset.path` in the script below):
|
||||
```bash
|
||||
bash examples/run_async_ppo.sh
|
||||
On a ray cluster with 2 nodes & 8 GPUs each node, runs (Remember to change paths in YAML
|
||||
file to your own shared storage):
|
||||
|
||||
```
|
||||
python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml \
|
||||
cluster.n_nodes=2 \
|
||||
cluster.n_gpus_per_node=8
|
||||
```
|
||||
|
||||
Evaluation:
|
||||
Evaluation (on a single node):
|
||||
|
||||
```bash
|
||||
cd evaluation
|
||||
# Evaluate the model
|
||||
python eval_and_aggregate.py \
|
||||
--model_path ${MODEL_PATH} \
|
||||
--output_path ${OUTPUT_PATH} \
|
||||
--data_names aime24,aime25 \
|
||||
--max_gen_tokens 32768 \
|
||||
--data_names codeforces,lcb_v5 \
|
||||
--prompt_type qwen3-think-pure \
|
||||
--temperature 1.0
|
||||
```
|
||||
python3 -m arealite.launcher.local examples/arealite/eval.py --config examples/arealite/configs/eval.yaml
|
||||
```
|
||||
|
||||
## Switching from Old AReaL to AReaLite
|
||||
|
||||
## Resources
|
||||
|
||||
+ [Documentation](https://inclusionai.github.io/AReaL/)
|
||||
+ [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
|
||||
- [Documentation](https://inclusionai.github.io/AReaL/)
|
||||
- [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
|
||||
|
||||
### Quickstart
|
||||
|
||||
+ [Installation](https://inclusionai.github.io/AReaL/tutorial/installation.html)
|
||||
+ [Example: Improving the math capability of Qwen3 with PPO](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
|
||||
- [Installation](https://inclusionai.github.io/AReaL/tutorial/installation.html)
|
||||
- [Example: Improving the math capability of Qwen3 with PPO](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
|
||||
|
||||
### Benchmark and Reproduction
|
||||
### AReaL Legacy
|
||||
|
||||
+ **Reproduce boba² Code Models**
|
||||
- 🤗 **Model weights**: [8B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-8B), [14B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-14B), [8B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset), [14B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset)
|
||||
- [Evaluation Guide](https://inclusionai.github.io/AReaL/tutorial/eval.html)
|
||||
- [Training configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html)
|
||||
+ [Scripts for Benchmark Training Throughput](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3)
|
||||
|
||||
### Customization Guide
|
||||
|
||||
- [Use your own dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
|
||||
- [Modifying the reward function and rollout behavior (multi-turn agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
|
||||
- [Modifying PPO to GRPO](https://inclusionai.github.io/AReaL/customization/algorithm.html#grouped-advantage-normalization)
|
||||
- [Developing the decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html#the-decoupled-ppo-loss)
|
||||
|
||||
### System Code Walkthrough
|
||||
|
||||
+ [Trainer](https://inclusionai.github.io/AReaL/developer/trainer/model_worker.html)
|
||||
+ [Model Backend and Algorithm Interface](https://inclusionai.github.io/AReaL/developer/trainer/algo_interface.html)
|
||||
+ [Rollout Controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
|
||||
+ [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
|
||||
For old AReaL documentation, check legacy sections in our
|
||||
[Documentation](https://inclusionai.github.io/AReaL/). To reproduce AReaL boba & boba²
|
||||
results, check our
|
||||
[reproduction guide with legacy AReaL](https://inclusionai.github.io/AReaL/references/reproduce.html).
|
||||
|
||||
## Future Plan
|
||||
|
||||
AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also **hiring interns and full-time employees** with open positions in both the US and China.
|
||||
AReaL is under active development. We plan to have minor releases weekly and major
|
||||
releases monthly. Community engagement and contributions are extremely welcome. We are
|
||||
also **hiring interns and full-time employees** with open positions in both the US and
|
||||
China.
|
||||
|
||||
For the research and development plan already in place, please see the following list:
|
||||
|
||||
|
@ -174,7 +143,8 @@ For the research and development plan already in place, please see the following
|
|||
- [x] Support for SGLang
|
||||
- [x] RL training with coding problems
|
||||
- [x] Asynchronous generation and RL training
|
||||
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
|
||||
- [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble
|
||||
pipelining
|
||||
- [ ] RL for vision-language models (VLM)
|
||||
- [x] Multi-turn agentic RL
|
||||
- [ ] Function calling and tool use
|
||||
|
@ -189,11 +159,23 @@ For the research and development plan already in place, please see the following
|
|||
|
||||
## Acknowledgement
|
||||
|
||||
We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
|
||||
We would like to note that major contributors are from the RL Lab at Ant Research and
|
||||
the Institute for Interdisciplinary Information Sciences, Tsinghua University.
|
||||
|
||||
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
|
||||
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant
|
||||
Research for data support and from the Super Computing Technology (SCT) team at Ant
|
||||
Group, particularly in the realm of large-scale cluster operations and maintenance.
|
||||
|
||||
We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and other projects, including but not limited to [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
|
||||
We also appreciate all the pioneering works from the community, particularly the
|
||||
[ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and other
|
||||
projects, including but not limited to
|
||||
[DeepScaleR](https://github.com/agentica-project/deepscaler),
|
||||
[Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main),
|
||||
[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF),
|
||||
[VeRL](https://github.com/volcengine/verl),
|
||||
[SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ),
|
||||
[Light-R1](https://github.com/Qihoo360/Light-R1) and
|
||||
[DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
|
||||
|
||||
## Citation
|
||||
|
||||
|
|
|
@ -6,59 +6,81 @@
|
|||
|
||||
We now release AReaL v0.3, featuring three major milestones:
|
||||
|
||||
- **A fully asynchronous RL training pipeline with system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop
|
||||
- **SOTA coding models**, i.e., a 14B model with a **69.1 score on LCB-v5**. Reproducible results with fully open-sourced datasets are also provided
|
||||
- **A fully asynchronous RL training pipeline with system and RL algorithm co-design**,
|
||||
achieving over 2.77x speedup without any performance drop
|
||||
- **SOTA coding models**, i.e., a 14B model with a **69.1 score on LCB-v5**.
|
||||
Reproducible results with fully open-sourced datasets are also provided
|
||||
- Experimental support for **multi-turn agentic RL training**
|
||||
|
||||
### Performance Results
|
||||
|
||||
| **Model (8B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
|
||||
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
|
||||
| [🤗 AReaL-boba²-8B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset) | 62.0 | 1933/97.2% | **41.4** |
|
||||
| [🤗 AReaL-boba²-8B](https://huggingface.co/inclusionAI/AReaL-boba-2-8B) | **63.0** | **1962/97.5%** | 40.8 |
|
||||
| **Model (8B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
| :---------------------------------------------------------------------------------: | :-----------------------------------: | :------------: | :--------------: |
|
||||
| Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
|
||||
| DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
|
||||
| [🤗 AReaL-boba²-8B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset) | 62.0 | 1933/97.2% | **41.4** |
|
||||
| [🤗 AReaL-boba²-8B](https://huggingface.co/inclusionAI/AReaL-boba-2-8B) | **63.0** | **1962/97.5%** | 40.8 |
|
||||
|
||||
| **Model (14B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
|
||||
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
|
||||
| [🤗 AReaL-boba²-14B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset) | 67.3 | 1990/97.8% | **46.2** |
|
||||
| [🤗 AReal-boba²-14B](https://huggingface.co/inclusionAI/AReaL-boba-2-14B) | **69.1** | **2044/98.2%** | 46.1 |
|
||||
| **Model (14B)** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
| :-----------------------------------------------------------------------------------: | :-----------------------------------: | :------------: | :--------------: |
|
||||
| Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
|
||||
| DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
|
||||
| [🤗 AReaL-boba²-14B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset) | 67.3 | 1990/97.8% | **46.2** |
|
||||
| [🤗 AReal-boba²-14B](https://huggingface.co/inclusionAI/AReaL-boba-2-14B) | **69.1** | **2044/98.2%** | 46.1 |
|
||||
|
||||
| **Larger Models** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| Qwen3-235B | 70.7 | 2056 | - |
|
||||
| DeepSeek-R1 | 64.3 | 2029 | - |
|
||||
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
|
||||
| **Larger Models** | **LiveCodeBench v5** (2024.10-2025.2) | **Codeforce** | **CodeContests** |
|
||||
| :---------------------: | :-----------------------------------: | :-----------: | :--------------: |
|
||||
| Qwen3-235B | 70.7 | 2056 | - |
|
||||
| DeepSeek-R1 | 64.3 | 2029 | - |
|
||||
| OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
|
||||
|
||||
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests.*
|
||||
*Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training
|
||||
results on open-sourced data. AReaL-boba²-8B/14B models are trained with an additional
|
||||
small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforce &
|
||||
CodeContests.*
|
||||
|
||||
To access our latest models and training data, please visit this [Huggingface Link](https://huggingface.co/collections/inclusionAI/areal-boba-2-683f0e819ccb7bb2e1b2f2d5).
|
||||
To access our latest models and training data, please visit this
|
||||
[Huggingface Link](https://huggingface.co/collections/inclusionAI/areal-boba-2-683f0e819ccb7bb2e1b2f2d5).
|
||||
|
||||
## Motivation for Asynchronous RL System
|
||||
|
||||
### Inference devices are underutilized
|
||||
|
||||
During the synchronous RL training process, the generation step must wait until the finish of the longest output within a batch. Due to the varying output lengths for LRM, a synchronous RL system suffers from severe training inefficiency.
|
||||
During the synchronous RL training process, the generation step must wait until the
|
||||
finish of the longest output within a batch. Due to the varying output lengths for LRM,
|
||||
a synchronous RL system suffers from severe training inefficiency.
|
||||
|
||||
Some works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) use outputs generated from a previous model version to update the current model. For the best performances, the model version used for rollout generation is limited to only one or two steps older. However, all these systems still follow a batched generation setting, the issue of system inefficiency during the generation phase still remains unaddressed.
|
||||
Some works
|
||||
([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51),
|
||||
[Intellect](https://www.primeintellect.ai/blog/intellect-2)) use outputs generated from
|
||||
a previous model version to update the current model. For the best performances, the
|
||||
model version used for rollout generation is limited to only one or two steps older.
|
||||
However, all these systems still follow a batched generation setting, the issue of
|
||||
system inefficiency during the generation phase still remains unaddressed.
|
||||
|
||||

|
||||
|
||||
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
|
||||
*Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline
|
||||
of one-step overlap RL system.*
|
||||
|
||||
### Scalability is poor in synchronous RL systems
|
||||
|
||||
Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size. This pushes the decoding process into a memory-IO-bound regime where additional devices fail to improve throughput.
|
||||
Synchronous systems distribute generation across all devices, reducing the per-GPU
|
||||
decoding batch size. This pushes the decoding process into a memory-IO-bound regime
|
||||
where additional devices fail to improve throughput.
|
||||
|
||||

|
||||
|
||||
*Fig2. Left: Strong scaling of batched generation throughput for a 1.5B LRM. Right: Generation becomes memory-IO bound as GPU count increases.*
|
||||
*Fig2. Left: Strong scaling of batched generation throughput for a 1.5B LRM. Right:
|
||||
Generation becomes memory-IO bound as GPU count increases.*
|
||||
|
||||
## Asynchronous AReaL Overview
|
||||
|
||||
We design a system that fully decouples generation and training across separate GPU clusters. This system should be hardware-efficient, scalable, and equipped with the flexibility for a customized RL workflow. We implement these principles in AReaL. Fig.3 presents the architecture and data flow of AREAL. The system comprises 4 core components:
|
||||
We design a system that fully decouples generation and training across separate GPU
|
||||
clusters. This system should be hardware-efficient, scalable, and equipped with the
|
||||
flexibility for a customized RL workflow. We implement these principles in AReaL. Fig.3
|
||||
presents the architecture and data flow of AREAL. The system comprises 4 core
|
||||
components:
|
||||
|
||||

|
||||
|
||||
|
@ -66,13 +88,34 @@ We design a system that fully decouples generation and training across separate
|
|||
|
||||
### Core Components
|
||||
|
||||
- **Interruptible Rollout Worker** handles two types of requests: (1) The *generate* request generates responses given prompts. (2) The *update_weights* request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights. Afterwards, the rollout workers continue to decode the unfinished sequences until the next interruption or termination. We emphasize that such interruptions and in-flight weight updates would result in trajectories composed of segments produced by different model versions. This introduces a novel algorithmic challenge, which will be addressed in the next section.
|
||||
- **Interruptible Rollout Worker** handles two types of requests: (1) The *generate*
|
||||
request generates responses given prompts. (2) The *update_weights* request interrupts
|
||||
all ongoing generations and loads parameters of new versions. Upon the interruption,
|
||||
the rollout workers discard KV caches computed by old weights, and re-compute them
|
||||
using the new weights. Afterwards, the rollout workers continue to decode the
|
||||
unfinished sequences until the next interruption or termination. We emphasize that
|
||||
such interruptions and in-flight weight updates would result in trajectories composed
|
||||
of segments produced by different model versions. This introduces a novel algorithmic
|
||||
challenge, which will be addressed in the next section.
|
||||
|
||||
- **Reward Service** evaluates the accuracy of the responses generated by the model. For example, in the coding task, this service extracts the code and executes unit tests to verify its accuracy.
|
||||
- **Reward Service** evaluates the accuracy of the responses generated by the model. For
|
||||
example, in the coding task, this service extracts the code and executes unit tests to
|
||||
verify its accuracy.
|
||||
|
||||
- **Trainer Workers** continuously sample from the replay buffer, accumulating data until reaching the configured training batch size. They then perform PPO updates and store the resulting parameters in distributed storage. To ensure data freshness, data from the replay buffer is used only once.
|
||||
- **Trainer Workers** continuously sample from the replay buffer, accumulating data
|
||||
until reaching the configured training batch size. They then perform PPO updates and
|
||||
store the resulting parameters in distributed storage. To ensure data freshness, data
|
||||
from the replay buffer is used only once.
|
||||
|
||||
- **Rollout Controller** serves as a critical bridge between the rollout workers, reward service, and the model workers. During the training process, it reads data from the dataset and invokes the rollout worker's *generate* request. The received response is then sent to the reward service to obtain the reward. The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker. After the model worker updates the parameters, the controller calls the rollout worker's *update_weight*. We illustrate the generation and training management in Fig.4. This asynchronous pipeline ensures continuous full utilization of both generation and training resources.
|
||||
- **Rollout Controller** serves as a critical bridge between the rollout workers, reward
|
||||
service, and the model workers. During the training process, it reads data from the
|
||||
dataset and invokes the rollout worker's *generate* request. The received response is
|
||||
then sent to the reward service to obtain the reward. The trajectory, along with the
|
||||
reward, is stored in the replay buffer, waiting to be trained by the model worker.
|
||||
After the model worker updates the parameters, the controller calls the rollout
|
||||
worker's *update_weight*. We illustrate the generation and training management in
|
||||
Fig.4. This asynchronous pipeline ensures continuous full utilization of both
|
||||
generation and training resources.
|
||||
|
||||

|
||||
|
||||
|
@ -82,49 +125,80 @@ We design a system that fully decouples generation and training across separate
|
|||
|
||||
### Challenges
|
||||
|
||||
While the asynchronous system design offers significant acceleration through improved device utilization, it introduces several technical challenges that require algorithmic considerations.
|
||||
While the asynchronous system design offers significant acceleration through improved
|
||||
device utilization, it introduces several technical challenges that require algorithmic
|
||||
considerations.
|
||||
|
||||
- **Challenge 1: Data Staleness** - Due to the asynchronous nature of AREAL, each training batch contains data from multiple prior policy versions. Data staleness would lead to a distribution gap between the training data and the latest model. In asynchronous RL training for LRMs, this issue could be even more severe for long trajectories due to extended decoding time.
|
||||
- **Challenge 1: Data Staleness** - Due to the asynchronous nature of AREAL, each
|
||||
training batch contains data from multiple prior policy versions. Data staleness would
|
||||
lead to a distribution gap between the training data and the latest model. In
|
||||
asynchronous RL training for LRMs, this issue could be even more severe for long
|
||||
trajectories due to extended decoding time.
|
||||
|
||||
- **Challenge 2: Inconsistent Policy Versions** - As shown in Fig.4, a single generated trajectory may involve segments produced by different policy versions. This inconsistency fundamentally violates the formulation of standard PPO that assumes all actions being generated by a single policy.
|
||||
- **Challenge 2: Inconsistent Policy Versions** - As shown in Fig.4, a single generated
|
||||
trajectory may involve segments produced by different policy versions. This
|
||||
inconsistency fundamentally violates the formulation of standard PPO that assumes all
|
||||
actions being generated by a single policy.
|
||||
|
||||
### Solutions
|
||||
|
||||
To overcome these two challenges, we propose two solutions:
|
||||
|
||||
- **Solution 1: Staleness-Aware Training** - To avoid negative impact of the data staleness issue, we constrain the version discrepancy between the policy version of generated trajectories and the training policy. We introduce a hyperparameter η representing the maximum permitted staleness. When η = 0, the system degenerates to the synchronous RL setting where generation exactly matches training batches. When η = 1, the system recovers to the previous one-step overlap methods.
|
||||
- **Solution 1: Staleness-Aware Training** - To avoid negative impact of the data
|
||||
staleness issue, we constrain the version discrepancy between the policy version of
|
||||
generated trajectories and the training policy. We introduce a hyperparameter η
|
||||
representing the maximum permitted staleness. When η = 0, the system degenerates to
|
||||
the synchronous RL setting where generation exactly matches training batches. When η =
|
||||
1, the system recovers to the previous one-step overlap methods.
|
||||
|
||||
- **Solution 2: Decoupled PPO Objective** - We apply a decoupled PPO objective that disentangles the behavior policy and the proximal policy. The behavior policy represents the policy used for sampling trajectories and the proxy policy is a proximal policy serving as a recent target to regularize the update of online policy.
|
||||
- **Solution 2: Decoupled PPO Objective** - We apply a decoupled PPO objective that
|
||||
disentangles the behavior policy and the proximal policy. The behavior policy
|
||||
represents the policy used for sampling trajectories and the proxy policy is a
|
||||
proximal policy serving as a recent target to regularize the update of online policy.
|
||||
|
||||

|
||||
|
||||
## Validating Asynchronous AReaL
|
||||
|
||||
We first evaluate Asynchronous AReaL on a math task to validate our system and algorithmic choices. Building on our previous release, AReaL-boba, we adopt R1-Distill-Qwen as the base model and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training dataset.
|
||||
We first evaluate Asynchronous AReaL on a math task to validate our system and
|
||||
algorithmic choices. Building on our previous release, AReaL-boba, we adopt
|
||||
R1-Distill-Qwen as the base model and
|
||||
[AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl)
|
||||
as the training dataset.
|
||||
|
||||
### End-to-End Performance Comparison
|
||||
|
||||
We compared synchronous and asynchronous training on 1.5B and 7B models. Under equal resource constraints and training steps, the asynchronous system was over twice as fast as the synchronous one. Evaluations on AIME24 confirmed that this acceleration does not compromise performance.
|
||||
We compared synchronous and asynchronous training on 1.5B and 7B models. Under equal
|
||||
resource constraints and training steps, the asynchronous system was over twice as fast
|
||||
as the synchronous one. Evaluations on AIME24 confirmed that this acceleration does not
|
||||
compromise performance.
|
||||
|
||||
| **Model (1.5B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| basemodel | 29.3 | - | - | - |
|
||||
| Synchronous AReaL | 42.0 | 16 | 1000 | 41.0 |
|
||||
| Asynchronous AReaL | 42.2 | 16 | 1000 | **14.8** |
|
||||
| **Model (1.5B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
| :----------------: | :--------: | :---------: | :-----------: | :----------------: |
|
||||
| basemodel | 29.3 | - | - | - |
|
||||
| Synchronous AReaL | 42.0 | 16 | 1000 | 41.0 |
|
||||
| Asynchronous AReaL | 42.2 | 16 | 1000 | **14.8** |
|
||||
|
||||
| **Model (7B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| basemodel | 54.3 | - | - | - |
|
||||
| Synchronous AReaL | 63.0 | 24 | 1000 | 57.7 |
|
||||
| Asynchronous AReaL | 63.1 | 24 | 1000 | **25.4** |
|
||||
| **Model (7B)** | **AIME24** | **# Nodes** | **PPO Steps** | **Training Hours** |
|
||||
| :----------------: | :--------: | :---------: | :-----------: | :----------------: |
|
||||
| basemodel | 54.3 | - | - | - |
|
||||
| Synchronous AReaL | 63.0 | 24 | 1000 | 57.7 |
|
||||
| Asynchronous AReaL | 63.1 | 24 | 1000 | **25.4** |
|
||||
|
||||
*Table 2: End-to-end comparison between asynchronous AReaL and synchronous AReaL. Following last release AReaL-boba, we adopt R1-Distill-Qwen as the basemodel and [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl) as the training data. Each performance number represents the average results of 32 sampling responses.*
|
||||
*Table 2: End-to-end comparison between asynchronous AReaL and synchronous AReaL.
|
||||
Following last release AReaL-boba, we adopt R1-Distill-Qwen as the basemodel and
|
||||
[AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl)
|
||||
as the training data. Each performance number represents the average results of 32
|
||||
sampling responses.*
|
||||
|
||||
### Ablation Study
|
||||
|
||||
#### System Ablations
|
||||
|
||||
We ablate interruptible generation and present the resulting generation throughput in Fig.5. Without interruptible generation, the controller must wait for the longest response. In particular, interruptible generation leads to a 12% and 17% throughput increase for 1.5B and 7B models respectively on 4 nodes.
|
||||
We ablate interruptible generation and present the resulting generation throughput in
|
||||
Fig.5. Without interruptible generation, the controller must wait for the longest
|
||||
response. In particular, interruptible generation leads to a 12% and 17% throughput
|
||||
increase for 1.5B and 7B models respectively on 4 nodes.
|
||||
|
||||

|
||||
|
||||
|
@ -132,58 +206,92 @@ We ablate interruptible generation and present the resulting generation throughp
|
|||
|
||||
#### Algorithm Ablations
|
||||
|
||||
We vary the maximum allowed staleness η and compare configurations with and without the decoupled PPO objective. Fig.6 shows the learning curves after 1000 PPO updates, demonstrating that naive PPO fails to match the performance of the synchronous RL oracle (i.e., the performance when η = 0). Even slight staleness can significantly degrade final performance due to the improper clipping center and policy changes during interruptible generation. Furthermore, increasing data staleness consistently degrades learning performance.
|
||||
We vary the maximum allowed staleness η and compare configurations with and without the
|
||||
decoupled PPO objective. Fig.6 shows the learning curves after 1000 PPO updates,
|
||||
demonstrating that naive PPO fails to match the performance of the synchronous RL oracle
|
||||
(i.e., the performance when η = 0). Even slight staleness can significantly degrade
|
||||
final performance due to the improper clipping center and policy changes during
|
||||
interruptible generation. Furthermore, increasing data staleness consistently degrades
|
||||
learning performance.
|
||||
|
||||

|
||||
|
||||
*Fig.6 Ablation Study on Decoupled PPO Objective with DeepSeek-R1-Distill-Qwen-1.5B. Left: Learning curves with naive PPO. Right: Learning curves with decoupled PPO objective.*
|
||||
*Fig.6 Ablation Study on Decoupled PPO Objective with DeepSeek-R1-Distill-Qwen-1.5B.
|
||||
Left: Learning curves with naive PPO. Right: Learning curves with decoupled PPO
|
||||
objective.*
|
||||
|
||||
| η | **AIME 24** | | **AIME 25** | |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| | naive PPO | + Decoupled Objective | naive PPO | + Decoupled Objective |
|
||||
| 0 | 42.0 | | 32.9 | |
|
||||
| 1 | 41.8 | 42.1 | 30.7 | 31.9 |
|
||||
| 2 | 40.0 | 41.8 | 32.1 | 32.5 |
|
||||
| 4 | 23.3 | 42.2 | 23.1 | 32.0 |
|
||||
| 8 | 35.7 | 41.0 | 27.8 | 31.1 |
|
||||
| 16 | 35.8 | 38.7 | 26.2 | 32.5 |
|
||||
| ∞ | 34.0 | 36.9 | 26.9 | 29.9 |
|
||||
|
||||
*Table 3: Evaluation scores when varying data staleness, comparing performance with and without the decoupled objective.*
|
||||
| η | **AIME 24** | | **AIME 25** | |
|
||||
| :-: | :---------: | :-------------------: | :---------: | :-------------------: |
|
||||
| | naive PPO | + Decoupled Objective | naive PPO | + Decoupled Objective |
|
||||
| 0 | 42.0 | | 32.9 | |
|
||||
| 1 | 41.8 | 42.1 | 30.7 | 31.9 |
|
||||
| 2 | 40.0 | 41.8 | 32.1 | 32.5 |
|
||||
| 4 | 23.3 | 42.2 | 23.1 | 32.0 |
|
||||
| 8 | 35.7 | 41.0 | 27.8 | 31.1 |
|
||||
| 16 | 35.8 | 38.7 | 26.2 | 32.5 |
|
||||
| ∞ | 34.0 | 36.9 | 26.9 | 29.9 |
|
||||
|
||||
*Table 3: Evaluation scores when varying data staleness, comparing performance with and
|
||||
without the decoupled objective.*
|
||||
|
||||
```{note}
|
||||
In practice, the effect of staleness could depend on various factors, such as learning rate, batch size, the number of mini batches, generation length, and also the RL algorithm. Here we aim to illustrate the impact of staleness on the training performance of a synchronous RL system, and show that stalenss control and decoupled PPO could well mitigate the negative impact of staleness. Here we follow a large batch size setting with `batch_size=512`, `n_rollouts=16`, `n_mini_batch=4`, `max_new_tokens=8192`, and `lr=2e-5`. The staleness is defined as the gap of model versions, where the model version increases by 1 when a batch of `batch_size` X `n_rollouts` samples are trained on.
|
||||
```
|
||||
|
||||
However, the decoupled PPO objective substantially improves training stability when handling stale data. Notably, even with the decoupled objective, unbounded staleness (maximum staleness → ∞) still results in inferior performance compared to the zero-staleness oracle. When properly constrained, moderate staleness (e.g., η ≤ 4) has minimal impact on final performance while significantly accelerating training through the asynchronous pipeline, as demonstrated in Tab.3 and Fig.7.
|
||||
However, the decoupled PPO objective substantially improves training stability when
|
||||
handling stale data. Notably, even with the decoupled objective, unbounded staleness
|
||||
(maximum staleness → ∞) still results in inferior performance compared to the
|
||||
zero-staleness oracle. When properly constrained, moderate staleness (e.g., η ≤ 4) has
|
||||
minimal impact on final performance while significantly accelerating training through
|
||||
the asynchronous pipeline, as demonstrated in Tab.3 and Fig.7.
|
||||
|
||||

|
||||
|
||||
*Fig.7 The relationship between η and training throughput. Larger η leads to higher throughput.*
|
||||
*Fig.7 The relationship between η and training throughput. Larger η leads to higher
|
||||
throughput.*
|
||||
|
||||
## Training Recipe with Asynchronous AReaL
|
||||
|
||||
### Dataset Curation
|
||||
|
||||
We train the 7B and 14B model using open-source data from [DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51). After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.
|
||||
We train the 7B and 14B model using open-source data from
|
||||
[DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51).
|
||||
After deduplication and quality filtering, approximately 7,600 coding problems remained.
|
||||
Additionally, we have around 10,000 internal data entries that can further improve the
|
||||
model's performance. We are actively working on open-sourcing this portion of the data.
|
||||
|
||||
### Rollout Strategy
|
||||
|
||||
During the rollout phase, the LLM generates 16 responses per question. To minimize output truncation, we set the maximum generation length to 27K tokens.
|
||||
During the rollout phase, the LLM generates 16 responses per question. To minimize
|
||||
output truncation, we set the maximum generation length to 27K tokens.
|
||||
|
||||
### Filtering Strategy
|
||||
|
||||
In the design of our asynchronous system, we can naturally incorporate strategies to filter some rollouts. Specifically, we exclude the questions where all 16 answers are either entirely correct or entirely incorrect.
|
||||
In the design of our asynchronous system, we can naturally incorporate strategies to
|
||||
filter some rollouts. Specifically, we exclude the questions where all 16 answers are
|
||||
either entirely correct or entirely incorrect.
|
||||
|
||||
### Key Hyperparameters
|
||||
|
||||
| **Parameter** | **Value** |
|
||||
|:---:|:---:|
|
||||
| PPO Minibatches | 1 |
|
||||
| Learning Rate | 2e-5 |
|
||||
| Adam ε | 1e-5 |
|
||||
| Batch Size | 2,048 |
|
||||
| max_head_offpolicyness η | 16 |
|
||||
| success_rate_ub | 0.95 |
|
||||
| success_rate_lb | 0.05 |
|
||||
| **Parameter** | **Value** |
|
||||
| :----------------------: | :-------: |
|
||||
| PPO Minibatches | 1 |
|
||||
| Learning Rate | 2e-5 |
|
||||
| Adam ε | 1e-5 |
|
||||
| Batch Size | 2,048 |
|
||||
| max_head_offpolicyness η | 16 |
|
||||
| success_rate_ub | 0.95 |
|
||||
| success_rate_lb | 0.05 |
|
||||
|
||||
## Resources
|
||||
|
||||
- [boba² Code dataset](https://huggingface.co/datasets/inclusionAI/AReaL-boba-2-RL-Code)
|
||||
- **Reproduce boba² Code Models**
|
||||
- 🤗 **Model weights**: [8B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-8B),
|
||||
[14B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-14B),
|
||||
[8B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset),
|
||||
[14B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset)
|
||||
- [Evaluation Guide](https://inclusionai.github.io/AReaL/tutorial/eval.html)
|
||||
- [Training configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code)
|
||||
and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html)
|
||||
- [Scripts for Benchmark Training Throughput](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3)
|
||||
|
|
Loading…
Reference in New Issue