9.0 KiB
AReaL: Ant Reasoning Reinforcement Learning for LLMs
| Paper | Documentation | Ask DeepWiki | 🤗 Models & Data | WeChat Group |

AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research. Built upon the open-source project RealHF, we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).
AReaL Highlights
- 🔥 Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training! Experimental support for multi-turn agentic RL is also provided.
- ⚡ [NEW] Light-weight & AI-centric: In our new release AReaLite, we deliver 90% of AReaL functionalities with only 20% # lines of code! AReaLite also follows an AI-centric design that make users build their own agentic and RLVR training workflows with much less effort.
- 🛠️ Open & Reproducible: We continuously release all code, datasets, and training recipes for RL training of LLMs.
- 🚀 Scalability: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
- 🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
News
[2025/07/31] (v0.4, AReaLite) We introduce AReaLite, a light-weight version of AReaL with an AI-centric API design that inherently supports fully asynchronous agentic RL. Check out our AReaLite Design Doc and the quickstart guide to begin your journey with AReaLite!
[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper.
[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog.
[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog.
Release Highlights
New highlights in AReaLite:
-
Follows an AI-centric API design instead of the system-centric architecture in old AReaL, which make it easier for AI researchers to adopt, understand, and develop effectively and efficiently. To learn more about the design philosophy of AReaL, please read AReaLite Design Doc!
-
A much more light-weight codebase compared to old AReaL codebase with only 20% # lines of code, with a detailed code walkthrough on an GRPO-on-GSM8K example. Save your time & efforts for code reading!
-
Smoother customization for your own algorithms and agentic & RLVR rollout RL with
RolloutWorkflow
and SPMD training scripts! Check here for agent & RLVR customization and here for algorithm customization.
Good old stuff from AReaL:
-
High performance and scalability with fully asynchronous RL training. Check our boba² (v0.3) blog for details.
-
A single command line to launch an experiment, no matter on a single node or a large-scale distributed cluster.
Now, let us run an example experiment with AReaLite following the quickstart guide below!
Getting Started with AReaLite
Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen2-1.5B-Instruct). On a single node, runs:
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml
On a ray cluster with 2 nodes & 8 GPUs each node, runs (Remember to change paths in YAML file to your own shared storage):
python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml \
cluster.n_nodes=2 \
cluster.n_gpus_per_node=8
Evaluation (on a single node):
python3 -m arealite.launcher.local examples/arealite/eval.py --config examples/arealite/configs/eval.yaml
Switching from Old AReaL to AReaLite
Resources
Quickstart
AReaL Legacy
For old AReaL documentation, check legacy sections in our Documentation. To reproduce AReaL boba & boba² results, check our reproduction guide with legacy AReaL.
Future Plan
AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also hiring interns and full-time employees with open positions in both the US and China.
For the research and development plan already in place, please see the following list:
System Development
- Support for SGLang
- RL training with coding problems
- Asynchronous generation and RL training
- Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
- RL for vision-language models (VLM)
- Multi-turn agentic RL
- Function calling and tool use
Algorithm Development
- RL training recipes for 1.5B and 7B models
- A complete RL training recipe for 32B models
- Sample-efficient multi-task RL algorithms
- Agentic capabilities with end-to-end RL
- Stable RL training for larger MOE models
Acknowledgement
We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
We also appreciate all the pioneering works from the community, particularly the ReaLHF project from OpenPsi Inc. and other projects, including but not limited to DeepScaleR, Open-Reasoner-Zero, OpenRLHF, VeRL, SGLang, QwQ, Light-R1 and DAPO.
Citation
@inproceedings{mei2025real,
author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
title = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
booktitle = {Proceedings of the Eighth Conference on Machine Learning and Systems,
MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025},
publisher = {mlsys.org},
year = {2025},
}
@misc{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
year={2025},
eprint={2505.24298},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.24298},
}