V0.2.0 prerelease (#9)

* PullRequest: 67 Update v0.2.0 Dockerfile Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/67 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * fw/v0.2.0-dockerfile * PullRequest: 66 Update v0.2.0 cover letter Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/66 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * . * . * . * . * . * update thpt fig * update readme 20250329-20:16 * update * update tutorial * . * update readme
2025-03-30 21:26:48 +08:00 · 2025-03-30 21:26:48 +08:00 · df7d84cb0e
parent 5b21c9add0
commit df7d84cb0e
2 changed files with 7 additions and 12 deletions
--- a/blog/AReaL_v0_2.md
+++ b/blog/AReaL_v0_2.md
@ -25,7 +25,7 @@ We are excited to release AReaL v0.2 (boba), featuring three major milestones:
 | [AReaL-boba-SFT-32B 🤗](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B) | 78.8 | 62.1 | 60.1 |


-*Table 1: The performance of [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) and [AReaL-boba-SFT-32B](https://huggingface.co/inclusionAI/AReaL-boba-SFT-32B). We obtain SOTA 7B model using RL on math reasoning. Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. Additionally, We train a highly competitive 32B model using only 200 data samples, replicating **QwQ-32B's** inference performance on AIME 2024.*
+*Table 1: The performance of AReaL-boba-RL-7B and AReaL-boba-SFT-32B. We obtain SOTA 7B model using RL on math reasoning. Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. Additionally, We train a highly competitive 32B model using only 200 data samples, replicating **QwQ-32B's** inference performance on AIME 2024.*

 ### Training Speed Comparison

@ -88,8 +88,6 @@ During the training phase, we compute advantages using GAE and normalize them ac
 | PPO Minibatches | 4 |
 | Learning Rate | 2e-5 |
 | Adam ε | 1e-5 |
-| Batch Size | 8,192 |
-

 This configuration balances convergence speed with training stability, avoiding collapse risks from higher learning rates or smaller ε values.

@ -117,17 +115,14 @@ To ensure reliable pass@1 estimation, we:
 + Maintain training temperature (1.0) for RL models

 ## Conclusion & Future Work
-Our results demonstrate that **high-quality data is equally critical as algorithmic innovations**.  
-When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. Therefore, we integrate resources from multiple recent open-source projects and filter problems by difficulty. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts, as these no longer contribute to improving the model's performance.
+Our results demonstrate that **high-quality data is equally critical as algorithmic innovations**. When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. Therefore, we integrate resources from multiple recent open-source projects and filter problems by difficulty. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts, as they no longer contribute to improving the model's performance.

-AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility.
-
-All aforementioned techniques have been implemented in AReaL, with **reproducible configurations** for various model sizes and different hardware setups.
+AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility. All aforementioned techniques have been implemented in AReaL, with [**reproducible configurations**](/examples/configs/) for various model sizes and different hardware setups.

 Looking ahead, the AReaL team will:

-+ Further optimize system performance
-+ Introduce new features
+ Further optimize the RL training throughput
+ Introduce new algorithmic features
 + Continue open-sourcing training data
 + Expand to broader reasoning tasks