update

2025-03-30 20:32:15 +08:00 · 2025-03-30 20:32:15 +08:00 · 6794b2f07e
parent bbf61d72c6
commit 6794b2f07e
2 changed files with 174 additions and 32 deletions
--- a/README.md
+++ b/README.md
@ -9,7 +9,7 @@
  </picture>
 </p>

-AReaL (Ant Reasoning RL) is an open-sourced and efficient reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk-tea (cheers). 
+AReaL (Ant Reasoning RL) is an open-sourced and efficient reinforcement learning training system for large reasoning models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk-tea (cheers). 

 **AReaL Highlights**

@ -18,22 +18,20 @@ AReaL (Ant Reasoning RL) is an open-sourced and efficient reinforcement learning
 + 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities. We are actively working on other domains, such as coding and agent, as well. 

 ## News
+**[2025/03/31]** **(v0.2, Boba)** Our milestone release Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. 

-**[2025/03/31] (v0.2, nickname Boba)** Our milestone release Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. 
-
-**[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
+**[2025/02/24]** **(v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).

 ## AReaL-boba Milestones and Highlights
-
 In our boba release, we highlight the 3 most important milestones:

 + Full SGLang support and a collection of efficiency improvements
 + A SOTA 7B math reasoning model [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) and the corresponing training data [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).
-+ A particularly competitive 32B model [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B/tree/main) that can be trained with extremely low cost. (Training Data: [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl))
+ A particularly competitive 32B model [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) that can be trained with extremely low cost. (Training Data: [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl))

-For the complete training and model details, please check our [v0.2 technical blog](/blog/AReaL_v0_2.md). 
+For the complete training and model details, please check [our v0.2 technical blog](/blog/AReaL_v0_2.md). 

-### SGLang support with 1.5x speedup on 7B Training
+#### SGLang support with 1.5x speedup on 7B Training

 ![throughput_comparision_with_v0.1.0.png](/assets/thpt_comparison.png) 

@ -47,38 +45,44 @@ In the following table, we show the convergence time under different resource se
 | Step | 250 | 250 | 250 | 400 | 400 | 300 |
 | Time (h) | ~230 | ~70 | ~25 | ~290 | ~80 | ~3.5 |

-### SOTA 7B model using RL on math reasoning

-| Model  | AIME 2024 | AIME 2025 | GPQA |
+#### SOTA 7B model using RL in math reasoning
+| **Model ** | **AIME 2024** | **AIME 2025** | **GPQA-Diamond** |
 | :---: | :---: | :---: | :---: |
-| O1-Preview | 56.7 | - | - |
-| R1-Distill-Qwen-7B | 55.0 | 39.7 | 47.1  |
+| O1-Preview | 56.7 | - |  |
+| R1-Distill-Qwen-7B | 55.0 | 39.7 | 47.1 |
 | Light-R1-7B-DS | 56.7 | 44.9 | 40.9 |
 | [AReaL-boba-RL-7B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | **61.9** | **48.3** | **47.6** |

-We used **R1-Distill-Qwen-7B** as our base model. After RL training, the pass@1 scores on AIME 2024 and AIME 2025 improve by 6.9 and 8.6 points, respectively, achieving SOTA performance among 7B models in mathematical reasoning. We have released the training data at [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).

-### Approaching QwQ-32B performances using only 200 data samples
-|  | R1-Distill-Qwen-32B | QwQ-32B | [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B/tree/main) |
-| --- | :---: | :---: | :---: |
-| AIME 2024 | 72.6 | 78.9 | 78.8 |
+We use **R1-Distill-Qwen-7B** as our base model. After RL training, the pass@1 scores on AIME 2024 and AIME 2025 improve by 6.9 and 8.6 points, respectively, achieving SOTA performance among 7B models in mathematical reasoning. We have released the training data at [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).  

-Building upon **R1-Distill-Qwen-32B**, we replicate **QwQ-32B**'s inference performance on AIME 2024 using just **200 data points** via Supervised Fine-Tuning (SFT). We have released the training data at [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl).
+Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. We plan to open-source more datasets in the future, including code, STEM, and other domains.
+
+#### Approaching QwQ-32B performances using only 200 data samples
+| **Model** | **AIME 2024** |
+| :---: | :---: |
+| R1-Distill-Qwen-32B | 72.6 |
+| QwQ-32B | 78.9 |
+| [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) | 78.8 |
+
+
+Building upon **R1-Distill-Qwen-32B**, we replicate **QwQ-32B's** inference performance on AIME 2024 using just **200 data points **via Supervised Fine-Tuning (SFT). We have released the training data at [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl).

 ## Getting Started
-
 ### Quick Start
-
 ```bash
-git clone https://github.com/inclusionAI/AReaL
-cd AReaL
-
 # Train the distilled 7B model
-REAL_NO_EXT=1 pip3 install -e . --no-build-isolation
-python3 -m realhf.apps.quickstart ppo-math --config examples/configs/7B-distill/areal-7B-distill-gpus-128.yaml
+python3 -m realhf.apps.quickstart ppo-math \
+  --config examples/configs/7B-distill/ppo-7B-distill-gpus-128.yaml

 # Evaluate the 7B model
-python3 evaluation/eval_and_aggregate.py --model_path $MODEL_PATH --max_gen_tokens 32768
+python evaluation/eval_and_aggregate.py \
+  --model_path ${MODEL_PATH} \
+  --output_path ${OUTPUT_PATH} \
+  --data_names aime24,aime25 \
+  --prompt_type AReaL-boba \
+  --output_path outputs --temperature 1.0
 ```

 ### Resources
@ -90,7 +94,6 @@ AReaL is under active development. We will have major releases in a weekly manne

 ### System Development
 - [x] Support for SGLang.
- [ ] Support for the latest vLLM and megatron-core packages.
 - [ ] RL training with coding problems.
 - [ ] Asynchronous generation and RL training.
 - [ ] Optimizations for distributed training: expert parallel and zero-bubble pipelining.
@ -98,18 +101,18 @@ AReaL is under active development. We will have major releases in a weekly manne
 - [ ] Function calling and agent capabilities.

 ### Algorithm Development
- [ ] The training receipe for 32B models.
- [ ] Multi-task RL training.
- [ ] Improving agentic capabilities with end-to-end RL.
+- [x] RL training receipes for 1.5B and 7B models.
+- [ ] A complete RL training receipe for 32B models.
+- [ ] Sample-efficient multi-task RL algorithms.
+- [ ] Agentic capabilities with end-to-end RL.
 - [ ] Stable RL training for larger MOE models.

 ## Acknowledgement
-
 We would like to remark that major contributors are from **RL Lab at Ant Research** and **Institute for Interdisciplinary Information Sciences, Tsinghua University**.

 Our team has also received invaluable assistance from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance. 

-We also appreciate all the pioneer works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [veRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://qwenlm.github.io/blog/qwq-32b/), [Light-R1](https://github.com/Qihoo360/Light-R1), and [DAPO](https://arxiv.org/abs/2503.14476).
+We also appreciate all the pioneer works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ),  [Light-R1](https://github.com/Qihoo360/Light-R1), and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).

 ## Citation
 ```plain
--- a/blog/AReaL_v0_2.md
+++ b/blog/AReaL_v0_2.md
@ -0,0 +1,139 @@
+<h1 align="center">
+<em>AReaL</em> v0.2: Training a SOTA 7B LRM with 1.5x Throughput Improvment
+</h1>
+
+<p align="center" style="font-size: 0.8em; color: #666;">
+Release Date: 2025-03-31
+</p>
+
+## Introduction
+We are excited to release AReaL v0.2 (boba), featuring three major milestones: 
+
+ **SGLang Support**: With the addition of SGlang support and a series of engineering optimizations, AReaL v0.2 achieves a speed improvement of 1.5x over AReaL v0.1 on 7B models.
+ **SOTA 7B Model**: AReaL's RL training becomes more stable and sample-efficient. We obtain a SOTA 7B model in mathematical reasoning, which achieves pass@1 score of 61.9 on aime24 and 48.3 on aime25 respectively.
+ **Competitive 32B Model**: We train a highly competitive 32B model at an extremely low cost, achieving results comparable to QwQ-32B using only 200 data samples.
+
+| **Model (7B)** | **AIME 2024** | **AIME 2025** | **GPQA-Diamond** |
+| :---: | :---: | :---: | :---: |
+| R1-Distill-Qwen-7B | 55.0 | 39.7 | 47.1 |
+| Light-R1-7B-DS | 56.7 | 44.9 | 40.9 |
+| [AReaL-boba-RL-7B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | **61.9** | **48.3** | **47.6  ** |
+| **Model (32B)** | **AIME 2024** | **AIME 2025** | **GPQA-Diamond** |
+| R1-Distill-Qwen-32B | 72.6 | 54.9 | 63.2 |
+| QwQ-32B | 78.9 | 70.2 | 64.6 |
+| Light-R1-32B-DS | 76.2 | 67.8 | 63.5 |
+| [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) | 78.8 | 62.1 | 60.1 |
+
+
+*Table 1: The performance of [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) and [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B). We obtain SOTA 7B model using RL on math reasoning. Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. Additionally, We train a highly competitive 32B model using only 200 data samples, replicating **QwQ-32B's** inference performance on AIME 2024.*
+
+### Training Speed Comparison
+
+![throughput_comparision_with_v0.1.0.png](/assets/thpt_comparison.png) 
+
+AReaL v0.2.0 features the following system optimizations:
+
+ **Upgraded Generation Backend: vLLM 0.6.3 → SGLang v0.4.0**
+
+The generation backend has been upgraded from vLLM 0.6.3 to SGLang v0.4.0, leveraging its radix attention mechanism to significantly improve throughput in scenarios where multiple responses are sampled from the same prompt. Additionally, SGLang automatically flushes radix caches upon weight updates, ensuring correctness in on-policy reinforcement learning (RL). We will continue tracking advancements in the community to integrate further optimizations.
+
+ **Optimized Training for Variable-Length Sequences & Large Batches**
+
+To handle variable sequence lengths efficiently, we eliminate padding and instead pack sequences into 1D tensors. A dynamic allocation algorithm (approximately) optimally distributes sequences under a maximum token budget, balancing micro-batch sizes while minimizing the number of micro-batches. This approach maximizes GPU memory utilization, enabling efficient computation across a large batch of variable-length inputs.
+
+ **High-Performance Data Transfer for 1K-GPU Scaling**
+
+AReaL employs NCCL with GPU-Direct RDMA (GDRDMA) over InfiniBand/RoCE, enabling direct GPU-to-GPU communication that bypasses costly CPU-mediated transfers and PCIe bottlenecks. Compared to traditional Ethernet-based approaches, this reduces latency and increases throughput, keeping generation-to-training data transfer overhead below 3 seconds even in a 1,000-GPU cluster.
+
+## Training Recipe
+### SOTA 7B model using RL on math reasoning
+#### Base Model
+We use [R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) as our foundation model.
+
+#### Dataset Curation
+Our training dataset ([AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/blob/main/data/boba_106k_0319.jsonl)) combines resources from multiple open-source projects:
+
+ [DeepScaleR](https://github.com/agentica-project/deepscaler)
+ [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/)
+ [Light-R1](https://github.com/Qihoo360/Light-R1) 
+ [DAPO](https://github.com/BytedTsinghua-SIA/DAPO)
+
+We enhanced this with challenging problems from:
+
+ [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) (AoPS/Olympiad subsets)
+ [ZebraLogic](https://hf-mirror.com/datasets/WildEval/ZebraLogic)
+
+To maintain an appropriate difficulty level, we filtered out overly simple questions. Specifically, We generate 8 solutions per question using DeepSeek-R1-Distill-Qwen-7B and filter out questions where all solutions were correct.
+
+#### Reward Function
+We adopt a sparse sequence-level reward mechanism. The model is instructed to enclose the final answer within \boxed{}, and the boxed answer is then verified. Correct responses receive a reward of +5, while incorrect ones are penalized with -5.
+
+Notably, we observe that the KL reward can impair performance, particularly in long chain-of-thought training, so we set it to zero.
+
+#### RL Algorithm
+We employ Proximal Policy Optimization (PPO) as our training algorithm. We remove the critic model to save compute. We set both the discount factor γ and the GAE parameter λ to 1. Such practices are also adopted by the [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) project. 
+
+#### Token-Level Loss Normalization
+Averaging the loss in the sequence level can underweight the overall contribution of longer texts. To address this, we normalize the loss at the token level, this practice is also highlighted in [DAPO](https://github.com/BytedTsinghua-SIA/DAPO). 
+
+#### Rollout Strategy
+During the rollout phase, we sample 512 questions per batch, and the LLM generates 16 responses per question—resulting in a total batch size of 8,192. To minimize output truncation, we set the maximum generation length to 27K tokens. In our experiment, the truncation rate remained below 5%.
+
+#### Advantage Normalization
+During the training phase, we compute advantages using GAE and normalize them across all generated tokens. 
+
+#### Key Hyperparameters
+| **Parameter** | **Value** |
+| :---: | :---: |
+| PPO Minibatches | 4 |
+| Learning Rate | 2e-5 |
+| Adam ε | 1e-5 |
+| Batch Size | 8,192 |
+
+
+This configuration balances convergence speed with training stability, avoiding collapse risks from higher learning rates or smaller ε values.
+
+### Approaching QwQ-32B's performance using only 200 data samples
+
+
+At the 32B model size, we further refine the training data and released [AReaL-boba-SFT-200](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-SFT-200.jsonl), a high-quality dataset with only **200 data points**. Accompanied by relevant training scripts, we replicated **QwQ-32B's** inference performance on **AIME 2024** via Supervised Fine-Tuning (SFT). 
+
+### Evaluation Best Practices
+During evaluation, we use vLLM v0.6.3 as the generation framework. We identified several settings that influence evaluation performance, particularly for long-context generation. We recommend manually configuring the following options:
+
+```bash
+enforce_eager=True
+enable_chunked_prefill=False
+disable_custom_all_reduce=True
+disable_sliding_window=True
+```
+
+Following the practice of DeepSeek models, we incorporate a directive in the prompt: "Please reason step by step, and enclose your final answer in \boxed{}." To encourage long context reasoning, we also enforce that the model begins each response with "\n" at the start of its output.
+
+To ensure reliable pass@1 estimation, we:
+
+ Sample 32 answers per problem
+ Use temperature=0.6 and top_p=0.95 for SFT models
+ Maintain training temperature (1.0) for RL models
+
+## Conclusion & Future Work
+Our results demonstrate that **high-quality data is equally critical as algorithmic innovations**.  
+When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. Therefore, we integrate resources from multiple recent open-source projects and filter problems by difficulty. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts, as these no longer contribute to improving the model's performance.
+
+AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility.
+
+All aforementioned techniques have been implemented in AReaL, with **reproducible configurations** for various model sizes and different hardware setups.
+
+Looking ahead, the AReaL team will:
+
+ Further optimize system performance
+ Introduce new features
+ Continue open-sourcing training data
+ Expand to broader reasoning tasks
+
+We believe these contributions lower the barrier for high-quality RL training while pushing the boundaries of reasoning capabilities. The project will continue evolving - we welcome community feedback and collaboration to drive further progress in this exciting field.
+
+
+
+
+