7.8 KiB
Quickstart
This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly installed dependencies and set up the runtime environment before proceeding.
Option 1: Using AReaLite
AReaLite is an RL training framework that provides the same functionality as AReaL, but is much easier to use, customize, and understand. It does not depend on AReaL except for some common core utilities such as logging.
We provide usage examples in the examples/arealite
folder. To launch an experiment
that trains your LLM to solve GSM8k math problems, run the following command:
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml
You can modify any options in examples/arealite/configs/gsm8k_grpo.yaml
, such as the
base model to use and hyperparameters. Note that this example does not support changing
the dataset through configuration modifications. Users can modify the dataset processing
logic using the HuggingFace datasets
package in the training script
examples/arealite/gsm8k_grpo.py
to use other datasets.
Note: This command assumes you can connect to the HuggingFace Hub to download models and datasets. Use hf-mirror if necessary.
Option 2: Using the old version of AReaL
Dataset
Use huggingface-cli
to download our open-source dataset:
huggingface-cli download --repo-type=dataset inclusionAI/AReaL-RL-Data
Note: The command above will display the path of the downloaded dataset. You'll need to pass this path to the training command.
Model
We train using open-source models available on Hugging Face Hub. You can either download the model in advance or use the model identifier when running the experiment.
# If you want to download it in advance
huggingface-cli download Qwen/Qwen3-1.7B
Refer to the
official documentation for
more information on using huggingface-cli
.
Training
From the repository directory, run:
# examples/run_async_ppo.sh
python3 training/main_async_ppo.py \
n_nodes=1 n_gpus_per_node=8 \
allocation_mode=sglang.d4p1m1+d2p2m1 \
cluster.fileroot=/path/to/save/logs/checkpoints/ \
actor.type._class=qwen3 \
actor.path=Qwen/Qwen3-1.7B \
ref.type._class=qwen3 \
ref.path=Qwen/Qwen3-1.7B \
dataset.path=/path/to/boba_106k_0319.jsonl \
dataset.train_bs_n_seqs=32 \
group_size=8 \
ppo.gen.max_new_tokens=4096 \
ppo.ppo_n_minibatches=4 \
actor_train.mb_spec.max_tokens_per_mb=32768 \
actor_inf.mb_spec.max_tokens_per_mb=32768 \
max_concurrent_rollouts=16 \
max_head_offpolicyness=4
::::{important} Running main_async_ppo.py
with ppo.recompute_logprob=False
,
ppo.use_decoupled_loss=False
, and max_head_offpolicyness=0
will essentially
replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to
run synchronous PPO directly (i.e., main_sync_ppo.py
). The workflow of asynchronous RL
is more stable and easier to customize. ::::
Command Line Options
To view all available options:
python3 training/main_sync_ppo.py --help
Configuration Parameters
experiment_name
: The name of your project.trial_name
: The name of this trial in your project.{actor|ref}.path
: The path to the model files.dataset.path
: The path to the dataset JSONL file.cluster.fileroot
: The root path for saving training outputs (logs and checkpoints).n_nodes
: The number of nodes in the cluster.n_gpus_per_node
: The number of GPUs per node.allocation_mode
: The GPU allocation strategy and 3D parallelism configuration for the experiment. Format:sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}
: Configures parallel strategies for SGLang generation and training respectively. Generation and training use separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 = #GPUs.
Training Control
exp_ctrl.total_train_epochs
: Number of training epochs (complete dataset iterations).exp_ctrl.save_freq_{epochs|steps|secs}
: Frequency for saving model parameters to persistent storage. Set to null to disable saving.exp_ctrl.ckpt_freq_{epochs|steps|secs}
: Frequency for saving temporary parameters for restart capability.dataset.train_bs_n_seqs
: Training batch size (number of prompts sampled per training iteration).group_size
: Number of responses sampled per prompt.
Memory and Performance
{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb
: Maximum tokens per mini-batch for forward/backward passes during reference model inference and actor model training. Reduce this value to avoid OOM errors.max_concurrent_rollouts
: The maximum number of concurrent rollouts. SGLang will run out of memory if this value is too large. Defaults todataset.train_bs_n_seqs
.
Algorithm Configuration
max_head_offpolicyness
: The allowed maximum data staleness. 0 recovers synchronous training. A large value will increase generation throughput but degrade final performance. We recommend keeping this value at 8 or below.ppo.recompute_logprob
: Whether to compute proximal log probabilities for training. Defaults to True for asynchronous experiments and False for synchronous baselines.ppo.use_decoupled_loss
: Use decoupled loss to stabilize asynchronous training. Defaults to True.ppo.gen.max_new_tokens
: Maximum tokens to generate per prompt.ppo.ppo_n_minibatches
: Number of mini-batches for dividing data during each PPO update.success_rate_ub
: Upper bound of success rate. Prompts with a higher success rate will be filtered out.success_rate_lb
: Lower bound of success rate. Prompts with a lower success rate will be filtered out.
Monitoring the Training Process
- We recommend using Weights & Biases (wandb) or
SwanLab for monitoring—run
wandb login
orswanlab login
, or set the corresponding environment variable API key (WANDB_API_KEY
orSWANLAB_API_KEY
). Setwandb.mode="online"
orswanlab.mode="cloud"
in your configuration to upload training statistics. If you cannot connect to the server, you can also usewandb.mode="offline"
orswanlab.mode="local"
to save data locally without uploading.
You can also use TensorBoard by setting the tensorboard.path
parameter.
The main log will be saved to
${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log
and contains the
statistics uploaded to wandb.
If SwanLab is enabled, logs will be saved to the directory specified by
swanlab.logdir
.
Key Training Statistics
Epoch 1/5
: Indicates the total epochs required and the current epoch being trained.step 6/19
: Shows that the current epoch has 19 steps, with the 6th step just completed.global step 6
: Step count across all epochs.ppo_actor/task_reward/avg
: Average reward value of all sampled responses in this step. This should steadily increase during training and eventually stabilize.ppo_actor/importance_weight/avg
: Average importance sampling ratio across all tokens in the PPO loss. This is typically close to 1.0.ppo_actor/actor_clip_ratio/avg
: Ratio of clipped tokens in PPO loss to total tokens. This is usually less than 0.1.ppo_actor/actor_loss/avg
: PPO loss value. This does not show clear trends during training and should not be used as a performance indicator.
Next Steps
Evaluate your model or check the troubleshooting section if you encounter any issues.