7.8 KiB

Raw Blame History

Quickstart

This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly installed dependencies and set up the runtime environment before proceeding.

Option 1: Using AReaLite

AReaLite is an RL training framework that provides the same functionality as AReaL, but is much easier to use, customize, and understand. It does not depend on AReaL except for some common core utilities such as logging.

We provide usage examples in the examples/arealite folder. To launch an experiment that trains your LLM to solve GSM8k math problems, run the following command:

python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml

You can modify any options in examples/arealite/configs/gsm8k_grpo.yaml, such as the base model to use and hyperparameters. Note that this example does not support changing the dataset through configuration modifications. Users can modify the dataset processing logic using the HuggingFace datasets package in the training script examples/arealite/gsm8k_grpo.py to use other datasets.

Note: This command assumes you can connect to the HuggingFace Hub to download models and datasets. Use hf-mirror if necessary.

Option 2: Using the old version of AReaL

Dataset

Use huggingface-cli to download our open-source dataset:

huggingface-cli download --repo-type=dataset inclusionAI/AReaL-RL-Data

Note: The command above will display the path of the downloaded dataset. You'll need to pass this path to the training command.

Model

We train using open-source models available on Hugging Face Hub. You can either download the model in advance or use the model identifier when running the experiment.

# If you want to download it in advance
huggingface-cli download Qwen/Qwen3-1.7B

Refer to the official documentation for more information on using huggingface-cli.

Training

From the repository directory, run:

# examples/run_async_ppo.sh
python3 training/main_async_ppo.py \
    n_nodes=1 n_gpus_per_node=8 \
    allocation_mode=sglang.d4p1m1+d2p2m1 \
    cluster.fileroot=/path/to/save/logs/checkpoints/ \
    actor.type._class=qwen3 \
    actor.path=Qwen/Qwen3-1.7B \
    ref.type._class=qwen3 \
    ref.path=Qwen/Qwen3-1.7B \
    dataset.path=/path/to/boba_106k_0319.jsonl \
    dataset.train_bs_n_seqs=32 \
    group_size=8 \
    ppo.gen.max_new_tokens=4096 \
    ppo.ppo_n_minibatches=4 \
    actor_train.mb_spec.max_tokens_per_mb=32768 \
    actor_inf.mb_spec.max_tokens_per_mb=32768 \
    max_concurrent_rollouts=16 \
    max_head_offpolicyness=4

::::{important} Running main_async_ppo.py with ppo.recompute_logprob=False, ppo.use_decoupled_loss=False, and max_head_offpolicyness=0 will essentially replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to run synchronous PPO directly (i.e., main_sync_ppo.py). The workflow of asynchronous RL is more stable and easier to customize. ::::

Command Line Options

To view all available options:

python3 training/main_sync_ppo.py --help

Configuration Parameters

experiment_name: The name of your project.
trial_name: The name of this trial in your project.
{actor|ref}.path: The path to the model files.
dataset.path: The path to the dataset JSONL file.
cluster.fileroot: The root path for saving training outputs (logs and checkpoints).
n_nodes: The number of nodes in the cluster.
n_gpus_per_node: The number of GPUs per node.
allocation_mode: The GPU allocation strategy and 3D parallelism configuration for the experiment. Format:
- sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}: Configures parallel strategies for SGLang generation and training respectively. Generation and training use separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 = #GPUs.

Training Control

exp_ctrl.total_train_epochs: Number of training epochs (complete dataset iterations).
exp_ctrl.save_freq_{epochs|steps|secs}: Frequency for saving model parameters to persistent storage. Set to null to disable saving.
exp_ctrl.ckpt_freq_{epochs|steps|secs}: Frequency for saving temporary parameters for restart capability.
dataset.train_bs_n_seqs: Training batch size (number of prompts sampled per training iteration).
group_size: Number of responses sampled per prompt.

Memory and Performance

{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb: Maximum tokens per mini-batch for forward/backward passes during reference model inference and actor model training. Reduce this value to avoid OOM errors.
max_concurrent_rollouts: The maximum number of concurrent rollouts. SGLang will run out of memory if this value is too large. Defaults to dataset.train_bs_n_seqs.

Algorithm Configuration

max_head_offpolicyness: The allowed maximum data staleness. 0 recovers synchronous training. A large value will increase generation throughput but degrade final performance. We recommend keeping this value at 8 or below.
ppo.recompute_logprob: Whether to compute proximal log probabilities for training. Defaults to True for asynchronous experiments and False for synchronous baselines.
ppo.use_decoupled_loss: Use decoupled loss to stabilize asynchronous training. Defaults to True.
ppo.gen.max_new_tokens: Maximum tokens to generate per prompt.
ppo.ppo_n_minibatches: Number of mini-batches for dividing data during each PPO update.
success_rate_ub: Upper bound of success rate. Prompts with a higher success rate will be filtered out.
success_rate_lb: Lower bound of success rate. Prompts with a lower success rate will be filtered out.

Monitoring the Training Process

We recommend using Weights & Biases (wandb) or SwanLab for monitoring—run wandb login or swanlab login, or set the corresponding environment variable API key (WANDB_API_KEY or SWANLAB_API_KEY). Set wandb.mode="online" or swanlab.mode="cloud" in your configuration to upload training statistics. If you cannot connect to the server, you can also use wandb.mode="offline" or swanlab.mode="local" to save data locally without uploading.

You can also use TensorBoard by setting the tensorboard.path parameter.

The main log will be saved to ${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log and contains the statistics uploaded to wandb.

If SwanLab is enabled, logs will be saved to the directory specified by swanlab.logdir.

Key Training Statistics

Epoch 1/5: Indicates the total epochs required and the current epoch being trained.
step 6/19: Shows that the current epoch has 19 steps, with the 6th step just completed.
global step 6: Step count across all epochs.
ppo_actor/task_reward/avg: Average reward value of all sampled responses in this step. This should steadily increase during training and eventually stabilize.
ppo_actor/importance_weight/avg: Average importance sampling ratio across all tokens in the PPO loss. This is typically close to 1.0.
ppo_actor/actor_clip_ratio/avg: Ratio of clipped tokens in PPO loss to total tokens. This is usually less than 0.1.
ppo_actor/actor_loss/avg: PPO loss value. This does not show clear trends during training and should not be used as a performance indicator.

Next Steps

Evaluate your model or check the troubleshooting section if you encounter any issues.

7.8 KiB Raw Blame History Unescape Escape