add quickstart

This commit is contained in:
晓雷 2025-07-21 18:48:10 +08:00
parent 4804b05637
commit 92495df9e3
4 changed files with 261 additions and 111 deletions

View File

@ -7,9 +7,13 @@ parts:
- caption: Tutorial
chapters:
- file: tutorial/installation
- file: tutorial/quickstart_arealite
- file: tutorial/quickstart
- file: tutorial/eval
- file: tutorial/troubleshooting
- caption: Getting Started with AReaLite
chapters:
- file: arealite/gsm8k_grpo
- caption: References
chapters:
- file: references/benchmark

132
docs/arealite/gsm8k_grpo.md Normal file
View File

@ -0,0 +1,132 @@
# Code Walkthrough: Running GRPO on GSM8K Dataset
In this guide, we will walk you through the detailed code of an example that runs GRPO algorithm on GSM8K dataset, with training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
## Launching the Experiment
As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), an experiment of AReaLite is launched by standalone launchers with command:
```
# Local Launcher
python -m arealite.launcher.local <training script> --config <configuration file> <cli args>
# Ray Launcher
python -m arealite.launcher.ray <training script> --config <configuration file> <cli args>
# Slurm Launcher
python -m arealite.launcher.slurm <training script> --config <configuration file> <cli args>
```
In AReaLite, the **training script** is **an SPMD python script** that serves as an entry point to launch the experiment.
The launcher directly runs the training script with their distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
Except for the training script, the launcher also is responsible for running inference servers (currently only support `SGLangServer`).
For distributed launchers (`RayLauncher` and `SlurmLauncher`), they run inference servers with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) for managing addresses and ports in the distributed settings.
The **configuration file**, which is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
It could be changed by CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
The training scripts uses [load_expr_config(args, config_cls)](../../arealite/api/cli_args.py#L886) to parse the config with CLI arguments to the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
In the example:
```
config, _ = load_expr_config(args, GRPOConfig)
config: GRPOConfig
```
## Loading and Pre-processing Dataset
In our example, we directly use tools from package `datasets` and `torchdata` to load and pre-process the dataset into our dataloader.
we first download `openai/gsm8k` from huggingface and split them by data parallel ranks, and then map them to the format we want.
```python
def process_gsm8k_rl_dataset(dataset: Dataset):
def process(sample):
messages = [{"role": "user", "content": sample["question"]}]
return {"messages": messages}
dataset = dataset.map(process).remove_columns(["question"])
return dataset
def get_gsm8k_dataset(split, rank, world_size):
dataset = load_dataset(path="openai/gsm8k", name="main", split=split)
dataset = split_dataset_by_node(dataset, rank=rank, world_size=world_size)
return process_gsm8k_rl_dataset(dataset)
```
Then we prepare training and evaluation data loaders with `torchdata.StatefulDataLoader`, which will serve as an input for data rollout.
```python
train_dataloader = torchdata.StatefulDataLoader(
get_gsm8k_dataset("train", rank, world_size),
batch_size=config.train_dataset.batch_size // world_size,
shuffle=config.train_dataset.shuffle,
num_workers=config.train_dataset.num_workers,
collate_fn=lambda x: x,
drop_last=config.train_dataset.drop_last,
)
valid_dataloader = ...
```
If you wish to use your own huggingface datasets or datasets on your local storage, please refers to [Customization: Dataset](../customization/dataset.md) for further details.
## Rollout
Next, we prepare for data rollout. The life-cycle of a piece of data is controlled by a `RLVRWorkflow`, which defines how data is processed from a prompt to a complete rollout data with fields required for training. Note that the workflow can involve multiple turns of generation, tool calling and reward calculation. In our example here, we only show a single-turn RLVR workflow with a math reward function.
First, we define a math reward function for GSM8K.
```python
from ... import extract_answer, extract_solution
def gsm8k_reward_fn(prompt, completions, prompt_ids, completion_ids, answer, **kwargs):
sol = extract_answer(completions, data_name="math")
ans = extract_solution(solution_str=answer, method="strict")
if sol is None:
return 0
if ans is None:
return 0
return int(sol.strip() == ans.strip())
```
Then we initialize `RLVRWorkflow` for reward computation.
```python
tokenizer = load_hf_tokenizer(config.tokenizer_path)
workflow = RLVRWorkflow(
reward_fn=gsm8k_reward_fn,
gconfig=config.gconfig,
tokenizer=tokenizer,
)
```
As for generation, we assume that the launchers have already launched instances of `SGLangServer`, and passed in the environment variable `AREAL_LLM_SERVER_ADDRS` to tell us the addresses and ports of these inference servers to connect to.
In the next step, we initialize `RemoteSGLangEngine` in the training script. Its APIs could be catagorized into two types:
- Sending requests such as generation and update weights to remote inference servers and returns the replies. Related APIs include `agenerate` and `update_weights`.
- Execute the rollout workflow. Manage the streaming data going through the rollout workflow to control parameter version differences between generation and training (data offpolicyness), and collate completed rollout data into a batched training sample. Related APIs include `prepare_batch` and `rollout_batch`.
```python
rollout = RemoteSGLangEngine(config.rollout)
rollout.initialize()
eval_rollout = ...
data_generator = iter(train_dataloader)
for global_step in range(max_steps):
# rollout batched training data for current step
if config.async_training:
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
else:
try:
data = next(data_generator)
except StopIteration:
data_generator = iter(train_dataloader)
data = next(data_generator)
batch = rollout.rollout_batch(data, workflow=workflow)
```
If you want to customize your own rollout workflow with customized reward functions or agentic tool calling, please refer to [Customization: Rollout Workflows](agent.md).
## Training
## Utilities

View File

@ -1,35 +1,10 @@
# Quickstart
# Quickstart (Legacy)
This guide walks you through a simple example of training an LLM to solve math problems.
Please ensure you have properly
[installed dependencies and set up the runtime environment](installation.md) before
proceeding.
> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in [realhf/](../../realhf/). We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!
## Option 1: Using *AReaLite*
This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
AReaLite is an RL training framework that provides the same functionality as AReaL, but
is much easier to use, customize, and understand. It does not depend on AReaL except for
some common core utilities such as logging.
We provide usage examples in the `examples/arealite` folder. To launch an experiment
that trains your LLM to solve GSM8k math problems, run the following command:
```bash
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml
```
You can modify any options in `examples/arealite/configs/gsm8k_grpo.yaml`, such as the
base model to use and hyperparameters. Note that this example does not support changing
the dataset through configuration modifications. Users can modify the dataset processing
logic using the HuggingFace `datasets` package in the training script
`examples/arealite/gsm8k_grpo.py` to use other datasets.
> **Note**: This command assumes you can connect to the HuggingFace Hub to download
> models and datasets. Use [hf-mirror](https://hf-mirror.com/) if necessary.
## Option 2: Using the old version of AReaL
### Dataset
## Dataset
Use `huggingface-cli` to download our open-source dataset:
@ -37,24 +12,20 @@ Use `huggingface-cli` to download our open-source dataset:
huggingface-cli download --repo-type=dataset inclusionAI/AReaL-RL-Data
```
> **Note**: The command above will display the path of the downloaded dataset. You'll
> need to pass this path to the training command.
> **Note**: The command above will display the path of the downloaded dataset. You'll need to pass this path to the training command.
### Model
## Model
We train using open-source models available on Hugging Face Hub. You can either download
the model in advance or use the model identifier when running the experiment.
We train using open-source models available on Hugging Face Hub. You can either download the model in advance or use the model identifier when running the experiment.
```bash
# If you want to download it in advance
huggingface-cli download Qwen/Qwen3-1.7B
```
Refer to the
[official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for
more information on using `huggingface-cli`.
Refer to the [official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for more information on using `huggingface-cli`.
### Training
## Training
From the repository directory, run:
@ -79,13 +50,11 @@ python3 training/main_async_ppo.py \
max_head_offpolicyness=4
```
::::{important} Running `main_async_ppo.py` with `ppo.recompute_logprob=False`,
`ppo.use_decoupled_loss=False`, and `max_head_offpolicyness=0` will essentially
replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to
run synchronous PPO directly (i.e., `main_sync_ppo.py`). The workflow of asynchronous RL
is more stable and easier to customize. ::::
::::{important}
Running `main_async_ppo.py` with `ppo.recompute_logprob=False`, `ppo.use_decoupled_loss=False`, and `max_head_offpolicyness=0` will essentially replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to run synchronous PPO directly (i.e., `main_sync_ppo.py`). The workflow of asynchronous RL is more stable and easier to customize.
::::
### Command Line Options
## Command Line Options
To view all available options:
@ -93,97 +62,62 @@ To view all available options:
python3 training/main_sync_ppo.py --help
```
#### Configuration Parameters
### Configuration Parameters
- **`experiment_name`**: The name of your project.
- **`trial_name`**: The name of this trial in your project.
- **`{actor|ref}.path`**: The path to the model files.
- **`dataset.path`**: The path to the dataset JSONL file.
- **`cluster.fileroot`**: The root path for saving training outputs (logs and
checkpoints).
- **`cluster.fileroot`**: The root path for saving training outputs (logs and checkpoints).
- **`n_nodes`**: The number of nodes in the cluster.
- **`n_gpus_per_node`**: The number of GPUs per node.
- **`allocation_mode`**: The GPU allocation strategy and 3D parallelism configuration
for the experiment. Format:
- `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configures parallel strategies
for SGLang generation and training respectively. Generation and training use
separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 =
#GPUs.
- **`allocation_mode`**: The GPU allocation strategy and 3D parallelism configuration for the experiment. Format:
- `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configures parallel strategies for SGLang generation and training respectively. Generation and training use separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 = #GPUs.
#### Training Control
### Training Control
- **`exp_ctrl.total_train_epochs`**: Number of training epochs (complete dataset
iterations).
- **`exp_ctrl.save_freq_{epochs|steps|secs}`**: Frequency for saving model parameters to
persistent storage. Set to null to disable saving.
- **`exp_ctrl.ckpt_freq_{epochs|steps|secs}`**: Frequency for saving temporary
parameters for restart capability.
- **`dataset.train_bs_n_seqs`**: Training batch size (number of prompts sampled per
training iteration).
- **`exp_ctrl.total_train_epochs`**: Number of training epochs (complete dataset iterations).
- **`exp_ctrl.save_freq_{epochs|steps|secs}`**: Frequency for saving model parameters to persistent storage. Set to null to disable saving.
- **`exp_ctrl.ckpt_freq_{epochs|steps|secs}`**: Frequency for saving temporary parameters for restart capability.
- **`dataset.train_bs_n_seqs`**: Training batch size (number of prompts sampled per training iteration).
- **`group_size`**: Number of responses sampled per prompt.
#### Memory and Performance
### Memory and Performance
- **`{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb`**: Maximum tokens per
mini-batch for forward/backward passes during reference model inference and actor
model training. Reduce this value to avoid OOM errors.
- **`max_concurrent_rollouts`**: The maximum number of concurrent rollouts. SGLang will
run out of memory if this value is too large. Defaults to `dataset.train_bs_n_seqs`.
- **`{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb`**: Maximum tokens per mini-batch for forward/backward passes during reference model inference and actor model training. Reduce this value to avoid OOM errors.
- **`max_concurrent_rollouts`**: The maximum number of concurrent rollouts. SGLang will run out of memory if this value is too large. Defaults to `dataset.train_bs_n_seqs`.
#### Algorithm Configuration
### Algorithm Configuration
- **`max_head_offpolicyness`**: The allowed maximum data staleness. 0 recovers
synchronous training. A large value will increase generation throughput but degrade
final performance. We recommend keeping this value at 8 or below.
- **`ppo.recompute_logprob`**: Whether to compute proximal log probabilities for
training. Defaults to True for asynchronous experiments and False for synchronous
baselines.
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training.
Defaults to True.
- **`max_head_offpolicyness`**: The allowed maximum data staleness. 0 recovers synchronous training. A large value will increase generation throughput but degrade final performance. We recommend keeping this value at 8 or below.
- **`ppo.recompute_logprob`**: Whether to compute proximal log probabilities for training. Defaults to True for asynchronous experiments and False for synchronous baselines.
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training. Defaults to True.
- **`ppo.gen.max_new_tokens`**: Maximum tokens to generate per prompt.
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO
update.
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate
will be filtered out.
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate
will be filtered out.
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO update.
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate will be filtered out.
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate will be filtered out.
### Monitoring the Training Process
## Monitoring the Training Process
+ We recommend using [Weights & Biases (wandb)](https://github.com/wandb/wandb) or [SwanLab](https://github.com/SwanHubX/SwanLab) for monitoring—run `wandb login` or `swanlab login`, or set the corresponding environment variable API key (`WANDB_API_KEY` or `SWANLAB_API_KEY`). Set `wandb.mode="online"` or `swanlab.mode="cloud"` in your configuration to upload training statistics. If you cannot connect to the server, you can also use `wandb.mode="offline"` or `swanlab.mode="local"` to save data locally without uploading.
- We recommend using [Weights & Biases (wandb)](https://github.com/wandb/wandb) or
[SwanLab](https://github.com/SwanHubX/SwanLab) for monitoring—run `wandb login` or
`swanlab login`, or set the corresponding environment variable API key
(`WANDB_API_KEY` or `SWANLAB_API_KEY`). Set `wandb.mode="online"` or
`swanlab.mode="cloud"` in your configuration to upload training statistics. If you
cannot connect to the server, you can also use `wandb.mode="offline"` or
`swanlab.mode="local"` to save data locally without uploading.
You can also use TensorBoard by setting the `tensorboard.path` parameter.
The main log will be saved to
`${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log` and contains the
statistics uploaded to wandb.
The main log will be saved to `${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log` and contains the statistics uploaded to wandb.
If SwanLab is enabled, logs will be saved to the directory specified by
`swanlab.logdir`.
If SwanLab is enabled, logs will be saved to the directory specified by `swanlab.logdir`.
#### Key Training Statistics
### Key Training Statistics
- **`Epoch 1/5`**: Indicates the total epochs required and the current epoch being
trained.
- **`step 6/19`**: Shows that the current epoch has 19 steps, with the 6th step just
completed.
- **`Epoch 1/5`**: Indicates the total epochs required and the current epoch being trained.
- **`step 6/19`**: Shows that the current epoch has 19 steps, with the 6th step just completed.
- **`global step 6`**: Step count across all epochs.
- **`ppo_actor/task_reward/avg`**: Average reward value of all sampled responses in this
step. This should steadily increase during training and eventually stabilize.
- **`ppo_actor/importance_weight/avg`**: Average importance sampling ratio across all
tokens in the PPO loss. This is typically close to 1.0.
- **`ppo_actor/actor_clip_ratio/avg`**: Ratio of clipped tokens in PPO loss to total
tokens. This is usually less than 0.1.
- **`ppo_actor/actor_loss/avg`**: PPO loss value. **This does not show clear trends
during training** and should not be used as a performance indicator.
- **`ppo_actor/task_reward/avg`**: Average reward value of all sampled responses in this step. This should steadily increase during training and eventually stabilize.
- **`ppo_actor/importance_weight/avg`**: Average importance sampling ratio across all tokens in the PPO loss. This is typically close to 1.0.
- **`ppo_actor/actor_clip_ratio/avg`**: Ratio of clipped tokens in PPO loss to total tokens. This is usually less than 0.1.
- **`ppo_actor/actor_loss/avg`**: PPO loss value. **This does not show clear trends during training** and should not be used as a performance indicator.
## Next Steps
[Evaluate your model](eval.md) or check the
[troubleshooting section](troubleshooting.md) if you encounter any issues.
[Evaluate your model](eval.md) or check the [troubleshooting section](troubleshooting.md) if you encounter any issues.

View File

@ -0,0 +1,80 @@
# Quickstart
Welcome to AReaLite quickstart guide!
In this guide, we provide an example that runs an AReaLite experiment that trains an LLM on GSM8K dataset with GRPO algorithm and function-based rewards.
Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
# Running the Experiment (on a single node)
To run the experiment, you need a training script and a config YAML file.
- Training script: [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py)
- Config YAML: [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml)
Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B) for you.
You do not need to prepare any dataset or model files before running the experiment.
To run the example with the default configuration, execute following command from the repository directory:
```
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml experiment_name=<your experiment name> trial_name=<your trial name>
```
> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For launching distributed experiments, please check out [Distributed Experiments with Ray or Slurm](quickstart.md#distributed-experiments-with-ray-or-slurm).
### Modifying configuration
All available options for experiment configuration are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
To change the experiment configuration, including models, resource allocation, and algorithm options, you can:
1. Directly modifying the config YAML file at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
2. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`.
For example, here is the command to launch a customized configuration, based on our GSM8K GRPO example:
```
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py \
--config examples/arealite/configs/gsm8k_grpo.yaml \
experiment_name=<your experiment name> \
trial_name=<your trial name> \
allocation_mode=sglang.d2p1t1+d2p1t1 \
cluster.n_nodes=1 \
cluster.n_gpus_per_node=4 \
gconfig.max_new_tokens=2048 \
train_dataset.batch_size=1024 \
+sglang.attention_backend=triton
```
::::{important}
Since we are working on a refactor from legacy AReaL to AReaLite, there are some changes that makes available options for AReaLite slightly different from legacy AReaL. We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) for the usage of **config converter**.
::::
### Distributed Experiments with Ray or Slurm
AReaLite also provide standalone Ray or Slurm launchers for distributed experiments. Once you have properly setup your Ray or Slurm cluster, you could launch your experiment with `arealite.launcher.ray` and `arealite.launcher.slurm`, similar to the `LocalLauncher`:
```
# Launch with Ray launcher. 4 nodes with 4 GPUs each node, 3 nodes for generation, 1 node for training.
python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
--config examples/arealite/configs/gsm8k_grpo.yaml \
experiment_name=<your experiment name> \
trial_name=<your trial name> \
allocation_mode=sglang.d12p1t1+d4p1t1 \
cluster.n_nodes=4 \
cluster.n_gpus_per_node=4 \
...
# Launch with Slurm launcher. 16 nodes with 8 GPUs each node, 12 nodes for generation, 4 nodes for training
python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
--config examples/arealite/configs/gsm8k_grpo.yaml \
experiment_name=<your experiment name> \
trial_name=<your trial name> \
allocation_mode=sglang.d96p1t1+d32p1t1 \
cluster.n_nodes=16 \
cluster.n_gpus_per_node=8 \
...
```
[Click here](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster. For more options for launchers, check `LauncherConfig` in [arealite/cli_args.py](quickstart.md#distributed-experiments-with-ray-or-slurm).
> **Note**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`.
> **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
## Next Steps
Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough for the GRPO GSM8K Example.