finish code walkthrough doc

This commit is contained in:
晓雷 2025-07-22 13:10:49 +08:00
parent 92495df9e3
commit 9f822a9275
3 changed files with 177 additions and 48 deletions

View File

@ -1,10 +1,10 @@
# Code Walkthrough: Running GRPO on GSM8K Dataset
# Running GRPO on GSM8K Dataset
In this guide, we will walk you through the detailed code of an example that runs GRPO algorithm on GSM8K dataset, with training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
This guide walks through the code for running the GRPO algorithm on the GSM8K dataset, using the training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
## Launching the Experiment
As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), an experiment of AReaLite is launched by standalone launchers with command:
As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), experiments in AReaLite are launched using standalone launchers with the following commands:
```
# Local Launcher
@ -15,26 +15,23 @@ python -m arealite.launcher.ray <training script> --config <configuration file>
python -m arealite.launcher.slurm <training script> --config <configuration file> <cli args>
```
In AReaLite, the **training script** is **an SPMD python script** that serves as an entry point to launch the experiment.
The launcher directly runs the training script with their distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
Except for the training script, the launcher also is responsible for running inference servers (currently only support `SGLangServer`).
For distributed launchers (`RayLauncher` and `SlurmLauncher`), they run inference servers with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) for managing addresses and ports in the distributed settings.
In AReaLite:
- The **training script** is an SPMD python script that serves as the experiment entry point.
- The launcher runs the training script with its distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
- The launcher also manages inference servers (currently only supporting `SGLangServer`).
- For distributed launchers (`RayLauncher` and `SlurmLauncher`), inference servers run with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) to handle addresses and ports in distributed settings.
The **configuration file**, which is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
It could be changed by CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
The training scripts uses [load_expr_config(args, config_cls)](../../arealite/api/cli_args.py#L886) to parse the config with CLI arguments to the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
In the example:
The **configuration file** is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
It could be modified via CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
The training scripts parse the config with CLI arguments into the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
```
config, _ = load_expr_config(args, GRPOConfig)
config: GRPOConfig
```
## Loading and Pre-processing Dataset
In our example, we directly use tools from package `datasets` and `torchdata` to load and pre-process the dataset into our dataloader.
we first download `openai/gsm8k` from huggingface and split them by data parallel ranks, and then map them to the format we want.
## Loading and Preprocessing Dataset
We use the `datasets` and `torchdata` packages to load and preprocess the dataset into our dataloader. First, we download `openai/gsm8k` from Huggingface and split it by data parallel ranks, then map it to our desired format:
```python
def process_gsm8k_rl_dataset(dataset: Dataset):
def process(sample):
@ -49,7 +46,7 @@ def get_gsm8k_dataset(split, rank, world_size):
return process_gsm8k_rl_dataset(dataset)
```
Then we prepare training and evaluation data loaders with `torchdata.StatefulDataLoader`, which will serve as an input for data rollout.
We then prepare training and evaluation dataloaders with `torchdata.StatefulDataLoader`:
```python
train_dataloader = torchdata.StatefulDataLoader(
@ -67,7 +64,11 @@ If you wish to use your own huggingface datasets or datasets on your local stora
## Rollout
The data lifecycle is controlled by an `RLVRWorkflow`, which defines how data progresses from prompt to complete rollout data with fields required for training. Our example shows a single-turn RLVR workflow with a math reward function.
<!--
Next, we prepare for data rollout. The life-cycle of a piece of data is controlled by a `RLVRWorkflow`, which defines how data is processed from a prompt to a complete rollout data with fields required for training. Note that the workflow can involve multiple turns of generation, tool calling and reward calculation. In our example here, we only show a single-turn RLVR workflow with a math reward function.
-->
First, we define a math reward function for GSM8K.
@ -95,11 +96,17 @@ workflow = RLVRWorkflow(
)
```
As for generation, we assume that the launchers have already launched instances of `SGLangServer`, and passed in the environment variable `AREAL_LLM_SERVER_ADDRS` to tell us the addresses and ports of these inference servers to connect to.
For generation, we assume the launchers have started `SGLangServer` instances and set the `AREAL_LLM_SERVER_ADDRS` environment variable with their addresses and ports.
In the next step, we initialize `RemoteSGLangEngine` in the training script. Its APIs could be catagorized into two types:
- Sending requests such as generation and update weights to remote inference servers and returns the replies. Related APIs include `agenerate` and `update_weights`.
- Execute the rollout workflow. Manage the streaming data going through the rollout workflow to control parameter version differences between generation and training (data offpolicyness), and collate completed rollout data into a batched training sample. Related APIs include `prepare_batch` and `rollout_batch`.
We initialize `RemoteSGLangEngine`, whose APIs fall into two categories:
- Sending requests to remote inference servers. Related APIs include `agenerate` and `update_weights`.
- Executing the rollout workflow, managing streaming data, and collating completed rollout data into batched training samples. Related APIs include `prepare_batch` and `rollout_batch`.
The following code shows how `RemoteSGLangEngine` generates data batches for RL training:
<!--
Note that whether asynchronous RL training is enabled is solely controlled by the API `RemoteSGLangEngine` use to rollout data batches. In `prepare_batch`, data is processed in a streaming style and only batched in output, while in `rollout_batch`, the engine submits the data in a batch and waits for the results in a synchronous style.
-->
```python
rollout = RemoteSGLangEngine(config.rollout)
@ -119,14 +126,122 @@ for global_step in range(max_steps):
data = next(data_generator)
batch = rollout.rollout_batch(data, workflow=workflow)
```
If you want to customize your own rollout workflow with customized reward functions or agentic tool calling, please refer to [Customization: Rollout Workflows](agent.md).
If you want to use rollout workflows with custom reward functions or agentic tool calling, see [Customization: Rollout Workflows](../customization/agent.md).
## Training
After obtaining the training batch, we use `FSDPPPOActor` to calculate losses and update weights. Each train engine corresponds to one model, therefore we need an additional engine for reference model.
```python
actor = FSDPPPOActor(config=config.actor)
actor.initialize(None, ft_spec)
ref = None
if config.actor.kl_ctl > 0 and config.ref is not None:
ref = FSDPPPOActor(config=config.ref)
ref.initialize(None, ft_spec)
```
The following code shows a GRPO training step:
```python
logp = actor.compute_logp(batch)
batch["prox_logp"] = logp
if ref is not None:
batch["ref_logp"] = ref.compute_logp(batch)
log_gpu_stats("ref logp")
actor.compute_advantages(batch)
stats = actor.ppo_update(batch)
actor.step_lr_scheduler()
```
`FSDPPPOActor` is a high-level engine with algorithm-specific APIs, such as `compute_logp`,`compute_advantages` and `ppo_update`.
`FSDPPPOActor` is powered by the lower-level train engine `FSDPEngine`, who only provides basic APIs for the model, such as `train_batch` and `forward`.
## Transferring Weights to Inference Servers
After training, we transfer updated parameters to remote inference servers through cooperation between `FSDPPPOActor` and `RemoteSGLangEngine`.
In our example, we show a simple case in which parameters are transfered from disks:
```python
path = update_weight_path(global_step)
meta = WeightUpdateMeta(
type="disk",
path=path,
model_version=global_step + 1
)
# send requests to remote servers, tell them to update weights
if dist.get_rank() == 0:
future = rollout.update_weights(meta)
# actor save weights
actor.upload_weights(meta)
# remote servers returns after finishing updates
if dist.get_rank() == 0:
future.result()
shutil.rmtree(path, ignore_errors=True)
# synchronize rollout processes for model version update
dist.barrier()
torch.cuda.synchronize()
# update version for rollout engine
rollout.set_version(global_step + 1)
```
The core GRPO training logic in AReaLite can be summarized as:
```python
data_generator = iter(train_dataloader)
for global_step in range(max_steps):
if config.async_training:
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
else:
try:
data = next(data_generator)
except StopIteration:
data_generator = iter(train_dataloader)
data = next(data_generator)
batch = rollout.rollout_batch(data, workflow=workflow)
batch = batch.to(actor.device)
# Create barrier to synchronize all rollout processes.
dist.barrier()
torch.cuda.synchronize()
logp = actor.compute_logp(batch)
batch["prox_logp"] = logp
if ref is not None:
batch["ref_logp"] = ref.compute_logp(batch)
log_gpu_stats("ref logp")
actor.compute_advantages(batch)
stats = actor.ppo_update(batch)
actor.step_lr_scheduler()
path = update_weight_path(global_step)
meta = WeightUpdateMeta(
type="disk",
path=path,
model_version=global_step + 1
)
# send requests to remote servers, tell them to update weights
if dist.get_rank() == 0:
future = rollout.update_weights(meta)
# actor save weights
actor.upload_weights(meta)
# remote servers returns after finishing updates
if dist.get_rank() == 0:
future.result()
shutil.rmtree(path, ignore_errors=True)
# synchronize rollout processes for model version update
dist.barrier()
torch.cuda.synchronize()
# update version for rollout engine
rollout.set_version(global_step + 1)
```
## Utilities
In AReaLite, we provide a wide range of utilities for basic functionalities required for observing and tuning your experiments, including:
- `Saver` ([arealite/utils/saver.py](../../arealite/utils/saver.py)): Saves the checkpoints in a frequency set by config.
- `Evaluator` ([arealite/utils/evaluator.py](../../arealite/utils/evaluator.py)): Evaluates the model in a frequency set by config.
- `StatsLogger` ([arealite/utils/stats_logger.py](../../arealite/utils/stats_logger.py)): Logs training data to backends like `wandb` and `tensorboard`. Also manages outputs to terminal or log files.
- `stats_tracker` ([realhf/base/stats_tracker.py](../../realhf/base/stats_tracker.py)): Gathers and manages training statistics.

View File

@ -1,6 +1,6 @@
# Quickstart (Legacy)
> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in [realhf/](../../realhf/). We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!
> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in `realhf/`. We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!
This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.

View File

@ -1,30 +1,35 @@
# Quickstart
Welcome to AReaLite quickstart guide!
In this guide, we provide an example that runs an AReaLite experiment that trains an LLM on GSM8K dataset with GRPO algorithm and function-based rewards.
Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
Welcome to the **AReaLite** Quickstart Guide!
This guide demonstrates how to run an AReaLite experiment training an LLM on the GSM8K dataset using the GRPO algorithm with function-based rewards.
Ensure you've completed [the installation and environment setup](installation.md) before proceeding.
# Running the Experiment (on a single node)
## Running the Experiment (on a single node)
To run the experiment, you need a training script and a config YAML file.
To run the experiment, you will need:
- Training script: [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py)
- Config YAML: [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml)
Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B) for you.
You do not need to prepare any dataset or model files before running the experiment.
To run the example with the default configuration, execute following command from the repository directory:
Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B).
To run the example with default configuration, execute from the repository directory:
```
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml experiment_name=<your experiment name> trial_name=<your trial name>
```
> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For launching distributed experiments, please check out [Distributed Experiments with Ray or Slurm](quickstart.md#distributed-experiments-with-ray-or-slurm).
> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For distributed experiments, see [Distributed Experiments with Ray or Slurm](quickstart_arealite.md#distributed-experiments-with-ray-or-slurm).
### Modifying configuration
## Modifying configuration
All available options for experiment configuration are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
To change the experiment configuration, including models, resource allocation, and algorithm options, you can:
1. Directly modifying the config YAML file at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
2. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`.
All available configuration options are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
To customize the experiment (models, resources, algorithm options), you can:
1. Edit the YAML file directly at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
2. Add command-line options:
- For existing options in the YAML file, directly add the option: `actor.path=Qwen/Qwen3-1.7B`.
- For other options in `cli_args.py`, but not in the YAML file, add with a prefix "+": `+sglang.attention_backend=triton`.
<!--
1. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`.
-->
For example, here is the command to launch a customized configuration, based on our GSM8K GRPO example:
```
@ -41,15 +46,15 @@ python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py \
```
::::{important}
Since we are working on a refactor from legacy AReaL to AReaLite, there are some changes that makes available options for AReaLite slightly different from legacy AReaL. We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) for the usage of **config converter**.
We're currently refactoring from legacy AReaL to AReaLite, which introduces some configuration differences. We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) to learn how to use the **config converter**.
::::
### Distributed Experiments with Ray or Slurm
## Distributed Experiments with Ray or Slurm
AReaLite also provide standalone Ray or Slurm launchers for distributed experiments. Once you have properly setup your Ray or Slurm cluster, you could launch your experiment with `arealite.launcher.ray` and `arealite.launcher.slurm`, similar to the `LocalLauncher`:
AReaLite provides standalone launchers for distributed experiments. After setting up your Ray or Slurm cluster, launch experiments similarly to `LocalLauncher`:
```
# Launch with Ray launcher. 4 nodes with 4 GPUs each node, 3 nodes for generation, 1 node for training.
# Launch with Ray launcher. 4 nodes (4 GPUs each), 3 nodes for generation, 1 node for training.
python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
--config examples/arealite/configs/gsm8k_grpo.yaml \
experiment_name=<your experiment name> \
@ -59,7 +64,7 @@ python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
cluster.n_gpus_per_node=4 \
...
# Launch with Slurm launcher. 16 nodes with 8 GPUs each node, 12 nodes for generation, 4 nodes for training
# Launch with Slurm launcher. 16 nodes (8 GPUs each), 12 nodes for generation, 4 nodes for training
python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
--config examples/arealite/configs/gsm8k_grpo.yaml \
experiment_name=<your experiment name> \
@ -70,11 +75,20 @@ python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
...
```
[Click here](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster. For more options for launchers, check `LauncherConfig` in [arealite/cli_args.py](quickstart.md#distributed-experiments-with-ray-or-slurm).
Additional references:
- For more options for launchers, check `LauncherConfig` in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
- [Ray cluster setup guide](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster.
> **Note**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`.
> **Important Notes**:
> 1. Ensure `allocation_mode` matches your cluster configuration (`#GPUs == cluster.n_nodes * cluster.n_gpus_per_node`)
> 2. Ray/Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For single node scenario, please use `LocalLauncher`.
> 3. In Ray/Slurm launchers, GPUs are allocated at node granularity, which means #GPUs for generation or training must be integer multiples of `cluster.n_gpus_per_node`.
<!--
> **Notes**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`.
> **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
-->
## Next Steps
Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough for the GRPO GSM8K Example.
Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough on the GRPO GSM8K Example.