AReaL/docs/tutorial/quickstart_legacy.md

170 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Quickstart (Legacy)
> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code
> in `realhf/`. We strongly recommend users to try AReaLite for better experiences.
> [Click here](quickstart.md) for AReaLite quickstart guide!
This guide walks you through a simple example of training an LLM to solve math problems.
Please ensure you have properly
[installed dependencies and set up the runtime environment](installation.md) before
proceeding.
## Dataset
Use `huggingface-cli` to download our open-source dataset:
```bash
huggingface-cli download --repo-type=dataset inclusionAI/AReaL-RL-Data
```
> **Note**: The command above will display the path of the downloaded dataset. You'll
> need to pass this path to the training command.
## Model
We train using open-source models available on Hugging Face Hub. You can either download
the model in advance or use the model identifier when running the experiment.
```bash
# If you want to download it in advance
huggingface-cli download Qwen/Qwen3-1.7B
```
Refer to the
[official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for
more information on using `huggingface-cli`.
## Training
From the repository directory, run:
```bash
# examples/run_async_ppo.sh
python3 training/main_async_ppo.py \
n_nodes=1 n_gpus_per_node=8 \
allocation_mode=sglang.d4p1m1+d2p2m1 \
cluster.fileroot=/path/to/save/logs/checkpoints/ \
actor.type._class=qwen3 \
actor.path=Qwen/Qwen3-1.7B \
ref.type._class=qwen3 \
ref.path=Qwen/Qwen3-1.7B \
dataset.path=/path/to/boba_106k_0319.jsonl \
dataset.train_bs_n_seqs=32 \
group_size=8 \
ppo.gen.max_new_tokens=4096 \
ppo.ppo_n_minibatches=4 \
actor_train.mb_spec.max_tokens_per_mb=32768 \
actor_inf.mb_spec.max_tokens_per_mb=32768 \
max_concurrent_rollouts=16 \
max_head_offpolicyness=4
```
::::{important} Running `main_async_ppo.py` with `ppo.recompute_logprob=False`,
`ppo.use_decoupled_loss=False`, and `max_head_offpolicyness=0` will essentially
replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to
run synchronous PPO directly (i.e., `main_sync_ppo.py`). The workflow of asynchronous RL
is more stable and easier to customize. ::::
## Command Line Options
To view all available options:
```bash
python3 training/main_sync_ppo.py --help
```
### Configuration Parameters
- **`experiment_name`**: The name of your project.
- **`trial_name`**: The name of this trial in your project.
- **`{actor|ref}.path`**: The path to the model files.
- **`dataset.path`**: The path to the dataset JSONL file.
- **`cluster.fileroot`**: The root path for saving training outputs (logs and
checkpoints).
- **`n_nodes`**: The number of nodes in the cluster.
- **`n_gpus_per_node`**: The number of GPUs per node.
- **`allocation_mode`**: The GPU allocation strategy and 3D parallelism configuration
for the experiment. Format:
- `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configures parallel strategies
for SGLang generation and training respectively. Generation and training use
separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 =
#GPUs.
### Training Control
- **`exp_ctrl.total_train_epochs`**: Number of training epochs (complete dataset
iterations).
- **`exp_ctrl.save_freq_{epochs|steps|secs}`**: Frequency for saving model parameters to
persistent storage. Set to null to disable saving.
- **`exp_ctrl.ckpt_freq_{epochs|steps|secs}`**: Frequency for saving temporary
parameters for restart capability.
- **`dataset.train_bs_n_seqs`**: Training batch size (number of prompts sampled per
training iteration).
- **`group_size`**: Number of responses sampled per prompt.
### Memory and Performance
- **`{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb`**: Maximum tokens per
mini-batch for forward/backward passes during reference model inference and actor
model training. Reduce this value to avoid OOM errors.
- **`max_concurrent_rollouts`**: The maximum number of concurrent rollouts. SGLang will
run out of memory if this value is too large. Defaults to `dataset.train_bs_n_seqs`.
### Algorithm Configuration
- **`max_head_offpolicyness`**: The allowed maximum data staleness. 0 recovers
synchronous training. A large value will increase generation throughput but degrade
final performance. We recommend keeping this value at 8 or below.
- **`ppo.recompute_logprob`**: Whether to compute proximal log probabilities for
training. Defaults to True for asynchronous experiments and False for synchronous
baselines.
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training.
Defaults to True.
- **`ppo.gen.max_new_tokens`**: Maximum tokens to generate per prompt.
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO
update.
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate
will be filtered out.
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate
will be filtered out.
## Monitoring the Training Process
- We recommend using [Weights & Biases (wandb)](https://github.com/wandb/wandb) or
[SwanLab](https://github.com/SwanHubX/SwanLab) for monitoring—run `wandb login` or
`swanlab login`, or set the corresponding environment variable API key
(`WANDB_API_KEY` or `SWANLAB_API_KEY`). Set `wandb.mode="online"` or
`swanlab.mode="cloud"` in your configuration to upload training statistics. If you
cannot connect to the server, you can also use `wandb.mode="offline"` or
`swanlab.mode="local"` to save data locally without uploading.
You can also use TensorBoard by setting the `tensorboard.path` parameter.
The main log will be saved to
`${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log` and contains the
statistics uploaded to wandb.
If SwanLab is enabled, logs will be saved to the directory specified by
`swanlab.logdir`.
### Key Training Statistics
- **`Epoch 1/5`**: Indicates the total epochs required and the current epoch being
trained.
- **`step 6/19`**: Shows that the current epoch has 19 steps, with the 6th step just
completed.
- **`global step 6`**: Step count across all epochs.
- **`ppo_actor/task_reward/avg`**: Average reward value of all sampled responses in this
step. This should steadily increase during training and eventually stabilize.
- **`ppo_actor/importance_weight/avg`**: Average importance sampling ratio across all
tokens in the PPO loss. This is typically close to 1.0.
- **`ppo_actor/actor_clip_ratio/avg`**: Ratio of clipped tokens in PPO loss to total
tokens. This is usually less than 0.1.
- **`ppo_actor/actor_loss/avg`**: PPO loss value. **This does not show clear trends
during training** and should not be used as a performance indicator.
## Next Steps
[Evaluate your model](eval.md) or check the
[troubleshooting section](troubleshooting.md) if you encounter any issues.