AReaL/benchmark/verl_v0_3_0_post1_76084d3
Wei Fu ad6e5bd3fa
[Doc] Add verl benchmark scripts (#71)
* update benchmark script

* .

* add benchmark docs

* .
2025-06-03 18:10:32 +08:00
..
README.md [Doc] Add verl benchmark scripts (#71) 2025-06-03 18:10:32 +08:00
areal_config.yaml [Doc] Add verl benchmark scripts (#71) 2025-06-03 18:10:32 +08:00
build_cmd.py [Doc] Add verl benchmark scripts (#71) 2025-06-03 18:10:32 +08:00
scaling_trend_vs_verl.png [Doc] Add verl benchmark scripts (#71) 2025-06-03 18:10:32 +08:00

README.md

Benchmark Results

We measure effective training throughput, defined as the number of tokens used in each RL training step divided by the time of that step. Note that since AReaL is asynchronous, some generated tokens may not be consumed by the trainer. We do not count these tokens in the throughput calculation.

We compare against the latest release of verl (v0.3.0.post1) as of May 7, 2025.

Throughput Comparison

How to Reproduce

verl

We provide code and instructions in this repo.

AReaL

Run build_cmd.py to generate the CLI command to run AReaL:

python3 benchmark/verl_v0_3_1_76084d3/build_cmd.py --model-size 1 --ctx 32768 --n-nodes 4

The above command generates the command to run AReaL with DeepSeek-R1-Distill-Qwen-1.5B using 32k context length (31k generation length) on 4 nodes (32 GPUs). You can choose model_size from [1, 7, 32] and n_nodes from [4, 8, 16, 32, 64].

The throughput value can be parsed from the log (you should manually divide the number of tokens by the time):

# obtain the time
grep -ai "execution time" /path/to/main.log
# obtain the number of processed tokens
grep -ai "n_tokens" /path/to/main.log

Alternatively, these metrics can be found in wandb and tensorboard logs, named "time_perf/e2e" and "ppo_actor/n_tokens" respectively.

Settings

The key settings include:

  • Batch size: 512 prompts with 16 answers each, following DAPO and VAPO.
  • Maximum Generation Length: 31k (32k context length minus 1k prompt length).
  • Minimum Generation Length: 0. This should not be set to the maximum length because the generation length varies for each prompt.
  • Model: Distilled models from DeepSeek-R1, which have long-CoT reasoning abilities. These cannot be simply replaced with models like Qwen2-7B, Qwen2.5-Math-7B, or randomly initialized weights.
  • Dataset: Our math dataset or other similar open-source datasets. The prompts should trigger the model to generate a long-CoT trajectory.