finish code walkthrough doc

2025-07-22 13:10:49 +08:00 · 2025-07-22 13:10:49 +08:00 · 9f822a9275
parent 92495df9e3
commit 9f822a9275
3 changed files with 177 additions and 48 deletions
--- a/docs/arealite/gsm8k_grpo.md
+++ b/docs/arealite/gsm8k_grpo.md
@ -1,10 +1,10 @@
-# Code Walkthrough: Running GRPO on GSM8K Dataset
+# Running GRPO on GSM8K Dataset

-In this guide, we will walk you through the detailed code of an example that runs GRPO algorithm on GSM8K dataset, with training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml). 
+This guide walks through the code for running the GRPO algorithm on the GSM8K dataset, using the training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml). 

 ## Launching the Experiment

-As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), an experiment of AReaLite is launched by standalone launchers with command:
+As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), experiments in AReaLite are launched using standalone launchers with the following commands:

 ```
 # Local Launcher
@ -15,26 +15,23 @@ python -m arealite.launcher.ray <training script> --config <configuration file>
 python -m arealite.launcher.slurm <training script> --config <configuration file> <cli args>
 ```

-In AReaLite, the **training script** is **an SPMD python script** that serves as an entry point to launch the experiment.
-The launcher directly runs the training script with their distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
-Except for the training script, the launcher also is responsible for running inference servers (currently only support `SGLangServer`). 
-For distributed launchers (`RayLauncher` and `SlurmLauncher`), they run inference servers with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) for managing addresses and ports in the distributed settings.
+In AReaLite:
+- The **training script** is an SPMD python script that serves as the experiment entry point.
+- The launcher runs the training script with its distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
+- The launcher also manages inference servers (currently only supporting `SGLangServer`).
+- For distributed launchers (`RayLauncher` and `SlurmLauncher`), inference servers run with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) to handle addresses and ports in distributed settings.

-The **configuration file**, which is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
-It could be changed by CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
-The training scripts uses [load_expr_config(args, config_cls)](../../arealite/api/cli_args.py#L886) to parse the config with CLI arguments to the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py). 
-
-In the example:
+The **configuration file** is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
+It could be modified via CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
+The training scripts parse the config with CLI arguments into the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py). 
 ```
 config, _ = load_expr_config(args, GRPOConfig)
 config: GRPOConfig
 ```

-## Loading and Pre-processing Dataset
-
-In our example, we directly use tools from package `datasets` and `torchdata` to load and pre-process the dataset into our dataloader.  
-we first download `openai/gsm8k` from huggingface and split them by data parallel ranks, and then map them to the format we want.
+## Loading and Preprocessing Dataset

+We use the `datasets` and `torchdata` packages to load and preprocess the dataset into our dataloader. First, we download `openai/gsm8k` from Huggingface and split it by data parallel ranks, then map it to our desired format:
 ```python
 def process_gsm8k_rl_dataset(dataset: Dataset):
    def process(sample):
@ -49,7 +46,7 @@ def get_gsm8k_dataset(split, rank, world_size):
    return process_gsm8k_rl_dataset(dataset)
 ```

-Then we prepare training and evaluation data loaders with `torchdata.StatefulDataLoader`, which will serve as an input for data rollout.
+We then prepare training and evaluation dataloaders with `torchdata.StatefulDataLoader`:

 ```python
 train_dataloader = torchdata.StatefulDataLoader(
@ -67,7 +64,11 @@ If you wish to use your own huggingface datasets or datasets on your local stora

 ## Rollout

+The data lifecycle is controlled by an `RLVRWorkflow`, which defines how data progresses from prompt to complete rollout data with fields required for training. Our example shows a single-turn RLVR workflow with a math reward function.
+
+<!-- 
 Next, we prepare for data rollout. The life-cycle of a piece of data is controlled by a `RLVRWorkflow`, which defines how data is processed from a prompt to a complete rollout data with fields required for training. Note that the workflow can involve multiple turns of generation, tool calling and reward calculation. In our example here, we only show a single-turn RLVR workflow with a math reward function.
+-->

 First, we define a math reward function for GSM8K.

@ -95,11 +96,17 @@ workflow = RLVRWorkflow(
 )
 ```

-As for generation, we assume that the launchers have already launched instances of `SGLangServer`, and passed in the environment variable `AREAL_LLM_SERVER_ADDRS` to tell us the addresses and ports of these inference servers to connect to. 
+For generation, we assume the launchers have started `SGLangServer` instances and set the `AREAL_LLM_SERVER_ADDRS` environment variable with their addresses and ports.

-In the next step, we initialize `RemoteSGLangEngine` in the training script. Its APIs could be catagorized into two types:
-  Sending requests such as generation and update weights to remote inference servers and returns the replies. Related APIs include `agenerate` and `update_weights`.
-  Execute the rollout workflow. Manage the streaming data going through the rollout workflow to control parameter version differences between generation and training (data offpolicyness), and collate completed rollout data into a batched training sample. Related APIs include `prepare_batch` and `rollout_batch`.
+We initialize `RemoteSGLangEngine`, whose APIs fall into two categories:
+-  Sending requests to remote inference servers. Related APIs include `agenerate` and `update_weights`.
+-  Executing the rollout workflow, managing streaming data, and collating completed rollout data into batched training samples. Related APIs include `prepare_batch` and `rollout_batch`.
+
+The following code shows how `RemoteSGLangEngine` generates data batches for RL training:
+
+<!--
+Note that whether asynchronous RL training is enabled is solely controlled by the API `RemoteSGLangEngine` use to rollout data batches. In `prepare_batch`, data is processed in a streaming style and only batched in output, while in `rollout_batch`, the engine submits the data in a batch and waits for the results in a synchronous style.
+-->

 ```python
 rollout = RemoteSGLangEngine(config.rollout)
@ -119,14 +126,122 @@ for global_step in range(max_steps):
            data = next(data_generator)
        batch = rollout.rollout_batch(data, workflow=workflow)
 ```
-If you want to customize your own rollout workflow with customized reward functions or agentic tool calling, please refer to [Customization: Rollout Workflows](agent.md).
+
+If you want to use rollout workflows with custom reward functions or agentic tool calling, see [Customization: Rollout Workflows](../customization/agent.md).

 ## Training

+After obtaining the training batch, we use `FSDPPPOActor` to calculate losses and update weights. Each train engine corresponds to one model, therefore we need an additional engine for reference model. 
+
+```python
+actor = FSDPPPOActor(config=config.actor)
+actor.initialize(None, ft_spec)
+ref = None
+if config.actor.kl_ctl > 0 and config.ref is not None:
+    ref = FSDPPPOActor(config=config.ref)
+    ref.initialize(None, ft_spec)
+```
+
+The following code shows a GRPO training step:
+
+```python
+logp = actor.compute_logp(batch)
+batch["prox_logp"] = logp
+if ref is not None:
+    batch["ref_logp"] = ref.compute_logp(batch)
+    log_gpu_stats("ref logp")
+actor.compute_advantages(batch)
+stats = actor.ppo_update(batch)
+actor.step_lr_scheduler()
+```
+
+`FSDPPPOActor` is a high-level engine with algorithm-specific APIs, such as `compute_logp`,`compute_advantages` and `ppo_update`.
+`FSDPPPOActor` is powered by the lower-level train engine `FSDPEngine`, who only provides basic APIs for the model, such as `train_batch` and `forward`. 
+
+## Transferring Weights to Inference Servers
+
+After training, we transfer updated parameters to remote inference servers through cooperation between `FSDPPPOActor` and `RemoteSGLangEngine`.
+In our example, we show a simple case in which parameters are transfered from disks:
+
+```python
+path = update_weight_path(global_step)
+meta = WeightUpdateMeta(
+    type="disk",
+    path=path,
+    model_version=global_step + 1
+)
+# send requests to remote servers, tell them to update weights
+if dist.get_rank() == 0:
+    future = rollout.update_weights(meta)
+# actor save weights
+actor.upload_weights(meta)
+# remote servers returns after finishing updates
+if dist.get_rank() == 0:
+    future.result()
+    shutil.rmtree(path, ignore_errors=True)
+# synchronize rollout processes for model version update
+dist.barrier()
+torch.cuda.synchronize()
+# update version for rollout engine
+rollout.set_version(global_step + 1)
+```
+
+The core GRPO training logic in AReaLite can be summarized as:
+
+```python
+data_generator = iter(train_dataloader)
+for global_step in range(max_steps):
+    if config.async_training:
+        batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
+    else:
+        try:
+            data = next(data_generator)
+        except StopIteration:
+            data_generator = iter(train_dataloader)
+            data = next(data_generator)
+        batch = rollout.rollout_batch(data, workflow=workflow)
+
+    batch = batch.to(actor.device)
+    # Create barrier to synchronize all rollout processes.
+    dist.barrier()
+    torch.cuda.synchronize()
+    
+    logp = actor.compute_logp(batch)
+    batch["prox_logp"] = logp
+    if ref is not None:
+        batch["ref_logp"] = ref.compute_logp(batch)
+        log_gpu_stats("ref logp")
+    actor.compute_advantages(batch)
+    stats = actor.ppo_update(batch)
+    actor.step_lr_scheduler()
+    
+    path = update_weight_path(global_step)
+    meta = WeightUpdateMeta(
+        type="disk",
+        path=path,
+        model_version=global_step + 1
+    )
+    # send requests to remote servers, tell them to update weights
+    if dist.get_rank() == 0:
+        future = rollout.update_weights(meta)
+    # actor save weights
+    actor.upload_weights(meta)
+    # remote servers returns after finishing updates
+    if dist.get_rank() == 0:
+        future.result()
+        shutil.rmtree(path, ignore_errors=True)
+    # synchronize rollout processes for model version update
+    dist.barrier()
+    torch.cuda.synchronize()
+    # update version for rollout engine
+    rollout.set_version(global_step + 1)
+```

 ## Utilities 

-
-
-
+In AReaLite, we provide a wide range of utilities for basic functionalities required for observing and tuning your experiments, including:
+- `Saver` ([arealite/utils/saver.py](../../arealite/utils/saver.py)): Saves the checkpoints in a frequency set by config.
+- `Evaluator` ([arealite/utils/evaluator.py](../../arealite/utils/evaluator.py)): Evaluates the model in a frequency set by config.
+- `StatsLogger` ([arealite/utils/stats_logger.py](../../arealite/utils/stats_logger.py)): Logs training data to backends like `wandb` and `tensorboard`. Also manages outputs to terminal or log files.
+- `stats_tracker` ([realhf/base/stats_tracker.py](../../realhf/base/stats_tracker.py)): Gathers and manages training statistics.

--- a/docs/tutorial/quickstart.md
+++ b/docs/tutorial/quickstart.md
@ -1,6 +1,6 @@
 # Quickstart (Legacy)

-> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in [realhf/](../../realhf/). We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!
+> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in `realhf/`. We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!

 This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.

--- a/docs/tutorial/quickstart_arealite.md
+++ b/docs/tutorial/quickstart_arealite.md
@ -1,30 +1,35 @@
 # Quickstart

-Welcome to AReaLite quickstart guide!
-In this guide, we provide an example that runs an AReaLite experiment that trains an LLM on GSM8K dataset with GRPO algorithm and function-based rewards. 
-Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
+Welcome to the **AReaLite** Quickstart Guide!
+This guide demonstrates how to run an AReaLite experiment training an LLM on the GSM8K dataset using the GRPO algorithm with function-based rewards.
+Ensure you've completed [the installation and environment setup](installation.md) before proceeding.

-# Running the Experiment (on a single node)
+## Running the Experiment (on a single node)

-To run the experiment, you need a training script and a config YAML file.
+To run the experiment, you will need:
 - Training script: [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py)
 - Config YAML: [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml)

-Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B) for you. 
-You do not need to prepare any dataset or model files before running the experiment.
-To run the example with the default configuration, execute following command from the repository directory: 
+Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B).
+To run the example with default configuration, execute from the repository directory:
 ```
 python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml experiment_name=<your experiment name> trial_name=<your trial name>
 ```

-> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For launching distributed experiments, please check out [Distributed Experiments with Ray or Slurm](quickstart.md#distributed-experiments-with-ray-or-slurm).
+> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For  distributed experiments, see [Distributed Experiments with Ray or Slurm](quickstart_arealite.md#distributed-experiments-with-ray-or-slurm).

-### Modifying configuration
+## Modifying configuration

-All available options for experiment configuration are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py). 
-To change the experiment configuration, including models, resource allocation, and algorithm options, you can:
-1. Directly modifying the config YAML file at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
-2. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`. 
+All available configuration options are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py). 
+To customize the experiment (models, resources, algorithm options), you can:
+1. Edit the YAML file directly at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
+2. Add command-line options:
+   - For existing options in the YAML file, directly add the option: `actor.path=Qwen/Qwen3-1.7B`.
+   - For other options in `cli_args.py`, but not in the YAML file, add with a prefix "+": `+sglang.attention_backend=triton`.
+
+<!--
+1. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`. 
+-->

 For example, here is the command to launch a customized configuration, based on our GSM8K GRPO example:
 ```
@ -41,15 +46,15 @@ python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py \
 ```

 ::::{important}
-Since we are working on a refactor from legacy AReaL to AReaLite, there are some changes that makes available options for AReaLite slightly different from legacy AReaL. We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) for the usage of **config converter**. 
+We're currently refactoring from legacy AReaL to AReaLite, which introduces some configuration differences.  We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) to learn how to use the **config converter**. 
 ::::

-### Distributed Experiments with Ray or Slurm
+## Distributed Experiments with Ray or Slurm

-AReaLite also provide standalone Ray or Slurm launchers for distributed experiments. Once you have properly setup your Ray or Slurm cluster, you could launch your experiment with `arealite.launcher.ray` and `arealite.launcher.slurm`, similar to the `LocalLauncher`:
+AReaLite provides standalone launchers for distributed experiments. After setting up your Ray or Slurm cluster, launch experiments similarly to `LocalLauncher`:

 ```
-# Launch with Ray launcher. 4 nodes with 4 GPUs each node, 3 nodes for generation, 1 node for training.
+# Launch with Ray launcher. 4 nodes (4 GPUs each), 3 nodes for generation, 1 node for training.
 python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
    --config examples/arealite/configs/gsm8k_grpo.yaml \
    experiment_name=<your experiment name> \
@ -59,7 +64,7 @@ python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
    cluster.n_gpus_per_node=4 \
    ...

-# Launch with Slurm launcher. 16 nodes with 8 GPUs each node, 12 nodes for generation, 4 nodes for training
+# Launch with Slurm launcher. 16 nodes (8 GPUs each), 12 nodes for generation, 4 nodes for training
 python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
    --config examples/arealite/configs/gsm8k_grpo.yaml \
    experiment_name=<your experiment name> \
@ -70,11 +75,20 @@ python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
    ...
 ```

-[Click here](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster. For more options for launchers, check `LauncherConfig` in [arealite/cli_args.py](quickstart.md#distributed-experiments-with-ray-or-slurm).
+Additional references:
+- For more options for launchers, check `LauncherConfig` in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
+- [Ray cluster setup guide](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster. 

-> **Note**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`. 
+> **Important Notes**:
+> 1. Ensure `allocation_mode` matches your cluster configuration (`#GPUs == cluster.n_nodes * cluster.n_gpus_per_node`)
+> 2. Ray/Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For single node scenario, please use `LocalLauncher`.
+> 3. In Ray/Slurm launchers, GPUs are allocated at node granularity, which means #GPUs for generation or training must be integer multiples of `cluster.n_gpus_per_node`.
+
+<!--
+> **Notes**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`. 
 > **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
+-->

 ## Next Steps

-Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough for the GRPO GSM8K Example.
+Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough on the GRPO GSM8K Example.