14 KiB

Raw Blame History

AReaL v0.3: SOTA Coding Models with 2.77x Faster Asynchronous RL Training

Introduction

We now release AReaL v0.3, featuring three major milestones:

A fully asynchronous RL training pipeline with system and RL algorithm co-design, achieving over 2.77x speedup without any performance drop
SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5. Reproducible results with fully open-sourced datasets are also provided
Experimental support for multi-turn agentic RL training

Performance Results

Model (8B)	LiveCodeBench v5 (2024.10-2025.2)	Codeforce	CodeContests
Qwen3-8B	58.8	1879/96.7%	31.4
DeepSeek-R1-0528-Qwen3-8B	58.4	1945/97.3%	31.0
🤗 AReaL-boba²-8B-Open	62.0	1933/97.2%	41.4
🤗 AReaL-boba²-8B	63.0	1962/97.5%	40.8

Model (14B)	LiveCodeBench v5 (2024.10-2025.2)	Codeforce	CodeContests
Qwen3-14B	65.4	1978/97.7%	38.3
DeepCoder-14B-Preview	60.6	1936/95.3%	40.1
🤗 AReaL-boba²-14B-Open	67.3	1990/97.8%	46.2
🤗 AReal-boba²-14B	69.1	2044/98.2%	46.1

Larger Models	LiveCodeBench v5 (2024.10-2025.2)	Codeforce	CodeContests
Qwen3-235B	70.7	2056	-
DeepSeek-R1	64.3	2029	-
OpenAI-o3-mini (Medium)	66.3	2036	-

Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests.

To access our latest models and training data, please visit this Huggingface Link.

Motivation for Asynchronous RL System

Inference devices are underutilized

During the synchronous RL training process, the generation step must wait until the finish of the longest output within a batch. Due to the varying output lengths for LRM, a synchronous RL system suffers from severe training inefficiency.

Some works (DeepCoder, Intellect) use outputs generated from a previous model version to update the current model. For the best performances, the model version used for rollout generation is limited to only one or two steps older. However, all these systems still follow a batched generation setting, the issue of system inefficiency during the generation phase still remains unaddressed.

Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.

Scalability is poor in synchronous RL systems

Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size. This pushes the decoding process into a memory-IO-bound regime where additional devices fail to improve throughput.

Fig2. Left: Strong scaling of batched generation throughput for a 1.5B LRM. Right: Generation becomes memory-IO bound as GPU count increases.

Asynchronous AReaL Overview

We design a system that fully decouples generation and training across separate GPU clusters. This system should be hardware-efficient, scalable, and equipped with the flexibility for a customized RL workflow. We implement these principles in AReaL. Fig.3 presents the architecture and data flow of AREAL. The system comprises 4 core components:

Fig.3 The architecture featuring asynchronous generation and training components.

Core Components

Interruptible Rollout Worker handles two types of requests: (1) The generate request generates responses given prompts. (2) The update_weights request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights. Afterwards, the rollout workers continue to decode the unfinished sequences until the next interruption or termination. We emphasize that such interruptions and in-flight weight updates would result in trajectories composed of segments produced by different model versions. This introduces a novel algorithmic challenge, which will be addressed in the next section.
Reward Service evaluates the accuracy of the responses generated by the model. For example, in the coding task, this service extracts the code and executes unit tests to verify its accuracy.
Trainer Workers continuously sample from the replay buffer, accumulating data until reaching the configured training batch size. They then perform PPO updates and store the resulting parameters in distributed storage. To ensure data freshness, data from the replay buffer is used only once.
Rollout Controller serves as a critical bridge between the rollout workers, reward service, and the model workers. During the training process, it reads data from the dataset and invokes the rollout worker's generate request. The received response is then sent to the reward service to obtain the reward. The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker. After the model worker updates the parameters, the controller calls the rollout worker's update_weight. We illustrate the generation and training management in Fig.4. This asynchronous pipeline ensures continuous full utilization of both generation and training resources.

Fig 4. Execution timeline of our fully asynchronous RL system.

Algorithmic Challenges and Solutions for Asynchronous RL

Challenges

While the asynchronous system design offers significant acceleration through improved device utilization, it introduces several technical challenges that require algorithmic considerations.

Challenge 1: Data Staleness - Due to the asynchronous nature of AREAL, each training batch contains data from multiple prior policy versions. Data staleness would lead to a distribution gap between the training data and the latest model. In asynchronous RL training for LRMs, this issue could be even more severe for long trajectories due to extended decoding time.
Challenge 2: Inconsistent Policy Versions - As shown in Fig.4, a single generated trajectory may involve segments produced by different policy versions. This inconsistency fundamentally violates the formulation of standard PPO that assumes all actions being generated by a single policy.

Solutions

To overcome these two challenges, we propose two solutions:

Solution 1: Staleness-Aware Training - To avoid negative impact of the data staleness issue, we constrain the version discrepancy between the policy version of generated trajectories and the training policy. We introduce a hyperparameter η representing the maximum permitted staleness. When η = 0, the system degenerates to the synchronous RL setting where generation exactly matches training batches. When η = 1, the system recovers to the previous one-step overlap methods.
Solution 2: Decoupled PPO Objective - We apply a decoupled PPO objective that disentangles the behavior policy and the proximal policy. The behavior policy represents the policy used for sampling trajectories and the proxy policy is a proximal policy serving as a recent target to regularize the update of online policy.

Validating Asynchronous AReaL

We first evaluate Asynchronous AReaL on a math task to validate our system and algorithmic choices. Building on our previous release, AReaL-boba, we adopt R1-Distill-Qwen as the base model and AReaL-boba-106k as the training dataset.

End-to-End Performance Comparison

We compared synchronous and asynchronous training on 1.5B and 7B models. Under equal resource constraints and training steps, the asynchronous system was over twice as fast as the synchronous one. Evaluations on AIME24 confirmed that this acceleration does not compromise performance.

Model (1.5B)	AIME24	# Nodes	PPO Steps	Training Hours
basemodel	29.3	-	-	-
Synchronous AReaL	42.0	16	1000	41.0
Asynchronous AReaL	42.2	16	1000	14.8

Model (7B)	AIME24	# Nodes	PPO Steps	Training Hours
basemodel	54.3	-	-	-
Synchronous AReaL	63.0	24	1000	57.7
Asynchronous AReaL	63.1	24	1000	25.4

Table 2: End-to-end comparison between asynchronous AReaL and synchronous AReaL. Following last release AReaL-boba, we adopt R1-Distill-Qwen as the basemodel and AReaL-boba-106k as the training data. Each performance number represents the average results of 32 sampling responses.

Ablation Study

System Ablations

We ablate interruptible generation and present the resulting generation throughput in Fig.5. Without interruptible generation, the controller must wait for the longest response. In particular, interruptible generation leads to a 12% and 17% throughput increase for 1.5B and 7B models respectively on 4 nodes.

Fig.5 Ablation study of interruptible generation.

Algorithm Ablations

We vary the maximum allowed staleness η and compare configurations with and without the decoupled PPO objective. Fig.6 shows the learning curves after 1000 PPO updates, demonstrating that naive PPO fails to match the performance of the synchronous RL oracle (i.e., the performance when η = 0). Even slight staleness can significantly degrade final performance due to the improper clipping center and policy changes during interruptible generation. Furthermore, increasing data staleness consistently degrades learning performance.

Fig.6 Ablation Study on Decoupled PPO Objective with DeepSeek-R1-Distill-Qwen-1.5B. Left: Learning curves with naive PPO. Right: Learning curves with decoupled PPO objective.

η	AIME 24		AIME 25
	naive PPO	+ Decoupled Objective	naive PPO	+ Decoupled Objective
0	42.0		32.9
1	41.8	42.1	30.7	31.9
2	40.0	41.8	32.1	32.5
4	23.3	42.2	23.1	32.0
8	35.7	41.0	27.8	31.1
16	35.8	38.7	26.2	32.5
∞	34.0	36.9	26.9	29.9

Table 3: Evaluation scores when varying data staleness, comparing performance with and without the decoupled objective.

In practice, the effect of staleness could depend on various factors, such as learning rate, batch size, the number of mini batches, generation length, and also the RL algorithm. Here we aim to illustrate the impact of staleness on the training performance of a synchronous RL system, and show that stalenss control and decoupled PPO could well mitigate the negative impact of staleness. Here we follow a large batch size setting with `batch_size=512`, `n_rollouts=16`, `n_mini_batch=4`, `max_new_tokens=8192`, and `lr=2e-5`. The staleness is defined as the gap of model versions, where the model version increases by 1 when a batch of `batch_size` X `n_rollouts` samples are trained on.

However, the decoupled PPO objective substantially improves training stability when handling stale data. Notably, even with the decoupled objective, unbounded staleness (maximum staleness → ∞) still results in inferior performance compared to the zero-staleness oracle. When properly constrained, moderate staleness (e.g., η ≤ 4) has minimal impact on final performance while significantly accelerating training through the asynchronous pipeline, as demonstrated in Tab.3 and Fig.7.

Fig.7 The relationship between η and training throughput. Larger η leads to higher throughput.

Training Recipe with Asynchronous AReaL

Dataset Curation

We train the 7B and 14B model using open-source data from DeepCoder. After deduplication and quality filtering, approximately 7,600 coding problems remained. Additionally, we have around 10,000 internal data entries that can further improve the model's performance. We are actively working on open-sourcing this portion of the data.

Rollout Strategy

During the rollout phase, the LLM generates 16 responses per question. To minimize output truncation, we set the maximum generation length to 27K tokens.

Filtering Strategy

In the design of our asynchronous system, we can naturally incorporate strategies to filter some rollouts. Specifically, we exclude the questions where all 16 answers are either entirely correct or entirely incorrect.

Key Hyperparameters

Parameter	Value
PPO Minibatches	1
Learning Rate	2e-5
Adam ε	1e-5
Batch Size	2,048
max_head_offpolicyness η	16
success_rate_ub	0.95
success_rate_lb	0.05

14 KiB Raw Blame History