mirror of https://github.com/inclusionAI/AReaL
add quickstart
This commit is contained in:
parent
4804b05637
commit
92495df9e3
|
@ -7,9 +7,13 @@ parts:
|
|||
- caption: Tutorial
|
||||
chapters:
|
||||
- file: tutorial/installation
|
||||
- file: tutorial/quickstart_arealite
|
||||
- file: tutorial/quickstart
|
||||
- file: tutorial/eval
|
||||
- file: tutorial/troubleshooting
|
||||
- caption: Getting Started with AReaLite
|
||||
chapters:
|
||||
- file: arealite/gsm8k_grpo
|
||||
- caption: References
|
||||
chapters:
|
||||
- file: references/benchmark
|
||||
|
|
|
@ -0,0 +1,132 @@
|
|||
# Code Walkthrough: Running GRPO on GSM8K Dataset
|
||||
|
||||
In this guide, we will walk you through the detailed code of an example that runs GRPO algorithm on GSM8K dataset, with training script [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py) and configuration file [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
|
||||
|
||||
## Launching the Experiment
|
||||
|
||||
As shown in [Quickstart Guide](../tutorial/quickstart_arealite.md), an experiment of AReaLite is launched by standalone launchers with command:
|
||||
|
||||
```
|
||||
# Local Launcher
|
||||
python -m arealite.launcher.local <training script> --config <configuration file> <cli args>
|
||||
# Ray Launcher
|
||||
python -m arealite.launcher.ray <training script> --config <configuration file> <cli args>
|
||||
# Slurm Launcher
|
||||
python -m arealite.launcher.slurm <training script> --config <configuration file> <cli args>
|
||||
```
|
||||
|
||||
In AReaLite, the **training script** is **an SPMD python script** that serves as an entry point to launch the experiment.
|
||||
The launcher directly runs the training script with their distributed backend (`subprocess` for `LocalLauncher`, `ray.remote` for `RayLauncher`, `srun` for `SlurmLauncher`).
|
||||
Except for the training script, the launcher also is responsible for running inference servers (currently only support `SGLangServer`).
|
||||
For distributed launchers (`RayLauncher` and `SlurmLauncher`), they run inference servers with a wrapper [arealite/launcher/sglang_server.py](../../arealite/launcher/sglang_server.py) for managing addresses and ports in the distributed settings.
|
||||
|
||||
The **configuration file**, which is a YAML file that sets the options provided in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
|
||||
It could be changed by CLI arguments such as `actor.path=Qwen/Qwen3-1.7B` and `+sglang.attention_backend=triton`.
|
||||
The training scripts uses [load_expr_config(args, config_cls)](../../arealite/api/cli_args.py#L886) to parse the config with CLI arguments to the config class defined in [arealite/api/cli_args.py](../../arealite/api/cli_args.py).
|
||||
|
||||
In the example:
|
||||
```
|
||||
config, _ = load_expr_config(args, GRPOConfig)
|
||||
config: GRPOConfig
|
||||
```
|
||||
|
||||
## Loading and Pre-processing Dataset
|
||||
|
||||
In our example, we directly use tools from package `datasets` and `torchdata` to load and pre-process the dataset into our dataloader.
|
||||
we first download `openai/gsm8k` from huggingface and split them by data parallel ranks, and then map them to the format we want.
|
||||
|
||||
```python
|
||||
def process_gsm8k_rl_dataset(dataset: Dataset):
|
||||
def process(sample):
|
||||
messages = [{"role": "user", "content": sample["question"]}]
|
||||
return {"messages": messages}
|
||||
dataset = dataset.map(process).remove_columns(["question"])
|
||||
return dataset
|
||||
|
||||
def get_gsm8k_dataset(split, rank, world_size):
|
||||
dataset = load_dataset(path="openai/gsm8k", name="main", split=split)
|
||||
dataset = split_dataset_by_node(dataset, rank=rank, world_size=world_size)
|
||||
return process_gsm8k_rl_dataset(dataset)
|
||||
```
|
||||
|
||||
Then we prepare training and evaluation data loaders with `torchdata.StatefulDataLoader`, which will serve as an input for data rollout.
|
||||
|
||||
```python
|
||||
train_dataloader = torchdata.StatefulDataLoader(
|
||||
get_gsm8k_dataset("train", rank, world_size),
|
||||
batch_size=config.train_dataset.batch_size // world_size,
|
||||
shuffle=config.train_dataset.shuffle,
|
||||
num_workers=config.train_dataset.num_workers,
|
||||
collate_fn=lambda x: x,
|
||||
drop_last=config.train_dataset.drop_last,
|
||||
)
|
||||
valid_dataloader = ...
|
||||
```
|
||||
|
||||
If you wish to use your own huggingface datasets or datasets on your local storage, please refers to [Customization: Dataset](../customization/dataset.md) for further details.
|
||||
|
||||
## Rollout
|
||||
|
||||
Next, we prepare for data rollout. The life-cycle of a piece of data is controlled by a `RLVRWorkflow`, which defines how data is processed from a prompt to a complete rollout data with fields required for training. Note that the workflow can involve multiple turns of generation, tool calling and reward calculation. In our example here, we only show a single-turn RLVR workflow with a math reward function.
|
||||
|
||||
First, we define a math reward function for GSM8K.
|
||||
|
||||
```python
|
||||
from ... import extract_answer, extract_solution
|
||||
|
||||
def gsm8k_reward_fn(prompt, completions, prompt_ids, completion_ids, answer, **kwargs):
|
||||
sol = extract_answer(completions, data_name="math")
|
||||
ans = extract_solution(solution_str=answer, method="strict")
|
||||
if sol is None:
|
||||
return 0
|
||||
if ans is None:
|
||||
return 0
|
||||
return int(sol.strip() == ans.strip())
|
||||
```
|
||||
|
||||
Then we initialize `RLVRWorkflow` for reward computation.
|
||||
|
||||
```python
|
||||
tokenizer = load_hf_tokenizer(config.tokenizer_path)
|
||||
workflow = RLVRWorkflow(
|
||||
reward_fn=gsm8k_reward_fn,
|
||||
gconfig=config.gconfig,
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
```
|
||||
|
||||
As for generation, we assume that the launchers have already launched instances of `SGLangServer`, and passed in the environment variable `AREAL_LLM_SERVER_ADDRS` to tell us the addresses and ports of these inference servers to connect to.
|
||||
|
||||
In the next step, we initialize `RemoteSGLangEngine` in the training script. Its APIs could be catagorized into two types:
|
||||
- Sending requests such as generation and update weights to remote inference servers and returns the replies. Related APIs include `agenerate` and `update_weights`.
|
||||
- Execute the rollout workflow. Manage the streaming data going through the rollout workflow to control parameter version differences between generation and training (data offpolicyness), and collate completed rollout data into a batched training sample. Related APIs include `prepare_batch` and `rollout_batch`.
|
||||
|
||||
```python
|
||||
rollout = RemoteSGLangEngine(config.rollout)
|
||||
rollout.initialize()
|
||||
eval_rollout = ...
|
||||
|
||||
data_generator = iter(train_dataloader)
|
||||
for global_step in range(max_steps):
|
||||
# rollout batched training data for current step
|
||||
if config.async_training:
|
||||
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
|
||||
else:
|
||||
try:
|
||||
data = next(data_generator)
|
||||
except StopIteration:
|
||||
data_generator = iter(train_dataloader)
|
||||
data = next(data_generator)
|
||||
batch = rollout.rollout_batch(data, workflow=workflow)
|
||||
```
|
||||
If you want to customize your own rollout workflow with customized reward functions or agentic tool calling, please refer to [Customization: Rollout Workflows](agent.md).
|
||||
|
||||
## Training
|
||||
|
||||
|
||||
## Utilities
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -1,35 +1,10 @@
|
|||
# Quickstart
|
||||
# Quickstart (Legacy)
|
||||
|
||||
This guide walks you through a simple example of training an LLM to solve math problems.
|
||||
Please ensure you have properly
|
||||
[installed dependencies and set up the runtime environment](installation.md) before
|
||||
proceeding.
|
||||
> **Note**: This is a quickstart guide for launching AReaL experiment with legacy code in [realhf/](../../realhf/). We strongly recommend users to try AReaLite for better experiences. [Click here](quickstart_arealite.md) for AReaLite quickstart guide!
|
||||
|
||||
## Option 1: Using *AReaLite*
|
||||
This guide walks you through a simple example of training an LLM to solve math problems. Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
|
||||
|
||||
AReaLite is an RL training framework that provides the same functionality as AReaL, but
|
||||
is much easier to use, customize, and understand. It does not depend on AReaL except for
|
||||
some common core utilities such as logging.
|
||||
|
||||
We provide usage examples in the `examples/arealite` folder. To launch an experiment
|
||||
that trains your LLM to solve GSM8k math problems, run the following command:
|
||||
|
||||
```bash
|
||||
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml
|
||||
```
|
||||
|
||||
You can modify any options in `examples/arealite/configs/gsm8k_grpo.yaml`, such as the
|
||||
base model to use and hyperparameters. Note that this example does not support changing
|
||||
the dataset through configuration modifications. Users can modify the dataset processing
|
||||
logic using the HuggingFace `datasets` package in the training script
|
||||
`examples/arealite/gsm8k_grpo.py` to use other datasets.
|
||||
|
||||
> **Note**: This command assumes you can connect to the HuggingFace Hub to download
|
||||
> models and datasets. Use [hf-mirror](https://hf-mirror.com/) if necessary.
|
||||
|
||||
## Option 2: Using the old version of AReaL
|
||||
|
||||
### Dataset
|
||||
## Dataset
|
||||
|
||||
Use `huggingface-cli` to download our open-source dataset:
|
||||
|
||||
|
@ -37,24 +12,20 @@ Use `huggingface-cli` to download our open-source dataset:
|
|||
huggingface-cli download --repo-type=dataset inclusionAI/AReaL-RL-Data
|
||||
```
|
||||
|
||||
> **Note**: The command above will display the path of the downloaded dataset. You'll
|
||||
> need to pass this path to the training command.
|
||||
> **Note**: The command above will display the path of the downloaded dataset. You'll need to pass this path to the training command.
|
||||
|
||||
### Model
|
||||
## Model
|
||||
|
||||
We train using open-source models available on Hugging Face Hub. You can either download
|
||||
the model in advance or use the model identifier when running the experiment.
|
||||
We train using open-source models available on Hugging Face Hub. You can either download the model in advance or use the model identifier when running the experiment.
|
||||
|
||||
```bash
|
||||
# If you want to download it in advance
|
||||
huggingface-cli download Qwen/Qwen3-1.7B
|
||||
```
|
||||
|
||||
Refer to the
|
||||
[official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for
|
||||
more information on using `huggingface-cli`.
|
||||
Refer to the [official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for more information on using `huggingface-cli`.
|
||||
|
||||
### Training
|
||||
## Training
|
||||
|
||||
From the repository directory, run:
|
||||
|
||||
|
@ -79,13 +50,11 @@ python3 training/main_async_ppo.py \
|
|||
max_head_offpolicyness=4
|
||||
```
|
||||
|
||||
::::{important} Running `main_async_ppo.py` with `ppo.recompute_logprob=False`,
|
||||
`ppo.use_decoupled_loss=False`, and `max_head_offpolicyness=0` will essentially
|
||||
replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to
|
||||
run synchronous PPO directly (i.e., `main_sync_ppo.py`). The workflow of asynchronous RL
|
||||
is more stable and easier to customize. ::::
|
||||
::::{important}
|
||||
Running `main_async_ppo.py` with `ppo.recompute_logprob=False`, `ppo.use_decoupled_loss=False`, and `max_head_offpolicyness=0` will essentially replicate the behavior of synchronous PPO. Therefore, it's usually not recommended to run synchronous PPO directly (i.e., `main_sync_ppo.py`). The workflow of asynchronous RL is more stable and easier to customize.
|
||||
::::
|
||||
|
||||
### Command Line Options
|
||||
## Command Line Options
|
||||
|
||||
To view all available options:
|
||||
|
||||
|
@ -93,97 +62,62 @@ To view all available options:
|
|||
python3 training/main_sync_ppo.py --help
|
||||
```
|
||||
|
||||
#### Configuration Parameters
|
||||
### Configuration Parameters
|
||||
|
||||
- **`experiment_name`**: The name of your project.
|
||||
- **`trial_name`**: The name of this trial in your project.
|
||||
- **`{actor|ref}.path`**: The path to the model files.
|
||||
- **`dataset.path`**: The path to the dataset JSONL file.
|
||||
- **`cluster.fileroot`**: The root path for saving training outputs (logs and
|
||||
checkpoints).
|
||||
- **`cluster.fileroot`**: The root path for saving training outputs (logs and checkpoints).
|
||||
- **`n_nodes`**: The number of nodes in the cluster.
|
||||
- **`n_gpus_per_node`**: The number of GPUs per node.
|
||||
- **`allocation_mode`**: The GPU allocation strategy and 3D parallelism configuration
|
||||
for the experiment. Format:
|
||||
- `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configures parallel strategies
|
||||
for SGLang generation and training respectively. Generation and training use
|
||||
separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 =
|
||||
#GPUs.
|
||||
- **`allocation_mode`**: The GPU allocation strategy and 3D parallelism configuration for the experiment. Format:
|
||||
- `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configures parallel strategies for SGLang generation and training respectively. Generation and training use separate GPU sets, and the total GPU count must equal: DP1×TP1×PP1 + DP2×TP2×PP2 = #GPUs.
|
||||
|
||||
#### Training Control
|
||||
### Training Control
|
||||
|
||||
- **`exp_ctrl.total_train_epochs`**: Number of training epochs (complete dataset
|
||||
iterations).
|
||||
- **`exp_ctrl.save_freq_{epochs|steps|secs}`**: Frequency for saving model parameters to
|
||||
persistent storage. Set to null to disable saving.
|
||||
- **`exp_ctrl.ckpt_freq_{epochs|steps|secs}`**: Frequency for saving temporary
|
||||
parameters for restart capability.
|
||||
- **`dataset.train_bs_n_seqs`**: Training batch size (number of prompts sampled per
|
||||
training iteration).
|
||||
- **`exp_ctrl.total_train_epochs`**: Number of training epochs (complete dataset iterations).
|
||||
- **`exp_ctrl.save_freq_{epochs|steps|secs}`**: Frequency for saving model parameters to persistent storage. Set to null to disable saving.
|
||||
- **`exp_ctrl.ckpt_freq_{epochs|steps|secs}`**: Frequency for saving temporary parameters for restart capability.
|
||||
- **`dataset.train_bs_n_seqs`**: Training batch size (number of prompts sampled per training iteration).
|
||||
- **`group_size`**: Number of responses sampled per prompt.
|
||||
|
||||
#### Memory and Performance
|
||||
### Memory and Performance
|
||||
|
||||
- **`{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb`**: Maximum tokens per
|
||||
mini-batch for forward/backward passes during reference model inference and actor
|
||||
model training. Reduce this value to avoid OOM errors.
|
||||
- **`max_concurrent_rollouts`**: The maximum number of concurrent rollouts. SGLang will
|
||||
run out of memory if this value is too large. Defaults to `dataset.train_bs_n_seqs`.
|
||||
- **`{actor_train|ref_inf|actor_inf}.mb_spec.max_tokens_per_mb`**: Maximum tokens per mini-batch for forward/backward passes during reference model inference and actor model training. Reduce this value to avoid OOM errors.
|
||||
- **`max_concurrent_rollouts`**: The maximum number of concurrent rollouts. SGLang will run out of memory if this value is too large. Defaults to `dataset.train_bs_n_seqs`.
|
||||
|
||||
#### Algorithm Configuration
|
||||
### Algorithm Configuration
|
||||
|
||||
- **`max_head_offpolicyness`**: The allowed maximum data staleness. 0 recovers
|
||||
synchronous training. A large value will increase generation throughput but degrade
|
||||
final performance. We recommend keeping this value at 8 or below.
|
||||
- **`ppo.recompute_logprob`**: Whether to compute proximal log probabilities for
|
||||
training. Defaults to True for asynchronous experiments and False for synchronous
|
||||
baselines.
|
||||
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training.
|
||||
Defaults to True.
|
||||
- **`max_head_offpolicyness`**: The allowed maximum data staleness. 0 recovers synchronous training. A large value will increase generation throughput but degrade final performance. We recommend keeping this value at 8 or below.
|
||||
- **`ppo.recompute_logprob`**: Whether to compute proximal log probabilities for training. Defaults to True for asynchronous experiments and False for synchronous baselines.
|
||||
- **`ppo.use_decoupled_loss`**: Use decoupled loss to stabilize asynchronous training. Defaults to True.
|
||||
- **`ppo.gen.max_new_tokens`**: Maximum tokens to generate per prompt.
|
||||
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO
|
||||
update.
|
||||
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate
|
||||
will be filtered out.
|
||||
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate
|
||||
will be filtered out.
|
||||
- **`ppo.ppo_n_minibatches`**: Number of mini-batches for dividing data during each PPO update.
|
||||
- **`success_rate_ub`**: Upper bound of success rate. Prompts with a higher success rate will be filtered out.
|
||||
- **`success_rate_lb`**: Lower bound of success rate. Prompts with a lower success rate will be filtered out.
|
||||
|
||||
### Monitoring the Training Process
|
||||
## Monitoring the Training Process
|
||||
|
||||
+ We recommend using [Weights & Biases (wandb)](https://github.com/wandb/wandb) or [SwanLab](https://github.com/SwanHubX/SwanLab) for monitoring—run `wandb login` or `swanlab login`, or set the corresponding environment variable API key (`WANDB_API_KEY` or `SWANLAB_API_KEY`). Set `wandb.mode="online"` or `swanlab.mode="cloud"` in your configuration to upload training statistics. If you cannot connect to the server, you can also use `wandb.mode="offline"` or `swanlab.mode="local"` to save data locally without uploading.
|
||||
|
||||
- We recommend using [Weights & Biases (wandb)](https://github.com/wandb/wandb) or
|
||||
[SwanLab](https://github.com/SwanHubX/SwanLab) for monitoring—run `wandb login` or
|
||||
`swanlab login`, or set the corresponding environment variable API key
|
||||
(`WANDB_API_KEY` or `SWANLAB_API_KEY`). Set `wandb.mode="online"` or
|
||||
`swanlab.mode="cloud"` in your configuration to upload training statistics. If you
|
||||
cannot connect to the server, you can also use `wandb.mode="offline"` or
|
||||
`swanlab.mode="local"` to save data locally without uploading.
|
||||
|
||||
You can also use TensorBoard by setting the `tensorboard.path` parameter.
|
||||
|
||||
The main log will be saved to
|
||||
`${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log` and contains the
|
||||
statistics uploaded to wandb.
|
||||
The main log will be saved to `${fileroot}/logs/${USER}/${experiment_name}/${trial_name}/main.log` and contains the statistics uploaded to wandb.
|
||||
|
||||
If SwanLab is enabled, logs will be saved to the directory specified by
|
||||
`swanlab.logdir`.
|
||||
If SwanLab is enabled, logs will be saved to the directory specified by `swanlab.logdir`.
|
||||
|
||||
#### Key Training Statistics
|
||||
### Key Training Statistics
|
||||
|
||||
- **`Epoch 1/5`**: Indicates the total epochs required and the current epoch being
|
||||
trained.
|
||||
- **`step 6/19`**: Shows that the current epoch has 19 steps, with the 6th step just
|
||||
completed.
|
||||
- **`Epoch 1/5`**: Indicates the total epochs required and the current epoch being trained.
|
||||
- **`step 6/19`**: Shows that the current epoch has 19 steps, with the 6th step just completed.
|
||||
- **`global step 6`**: Step count across all epochs.
|
||||
- **`ppo_actor/task_reward/avg`**: Average reward value of all sampled responses in this
|
||||
step. This should steadily increase during training and eventually stabilize.
|
||||
- **`ppo_actor/importance_weight/avg`**: Average importance sampling ratio across all
|
||||
tokens in the PPO loss. This is typically close to 1.0.
|
||||
- **`ppo_actor/actor_clip_ratio/avg`**: Ratio of clipped tokens in PPO loss to total
|
||||
tokens. This is usually less than 0.1.
|
||||
- **`ppo_actor/actor_loss/avg`**: PPO loss value. **This does not show clear trends
|
||||
during training** and should not be used as a performance indicator.
|
||||
- **`ppo_actor/task_reward/avg`**: Average reward value of all sampled responses in this step. This should steadily increase during training and eventually stabilize.
|
||||
- **`ppo_actor/importance_weight/avg`**: Average importance sampling ratio across all tokens in the PPO loss. This is typically close to 1.0.
|
||||
- **`ppo_actor/actor_clip_ratio/avg`**: Ratio of clipped tokens in PPO loss to total tokens. This is usually less than 0.1.
|
||||
- **`ppo_actor/actor_loss/avg`**: PPO loss value. **This does not show clear trends during training** and should not be used as a performance indicator.
|
||||
|
||||
## Next Steps
|
||||
|
||||
[Evaluate your model](eval.md) or check the
|
||||
[troubleshooting section](troubleshooting.md) if you encounter any issues.
|
||||
[Evaluate your model](eval.md) or check the [troubleshooting section](troubleshooting.md) if you encounter any issues.
|
|
@ -0,0 +1,80 @@
|
|||
# Quickstart
|
||||
|
||||
Welcome to AReaLite quickstart guide!
|
||||
In this guide, we provide an example that runs an AReaLite experiment that trains an LLM on GSM8K dataset with GRPO algorithm and function-based rewards.
|
||||
Please ensure you have properly [installed dependencies and set up the runtime environment](installation.md) before proceeding.
|
||||
|
||||
# Running the Experiment (on a single node)
|
||||
|
||||
To run the experiment, you need a training script and a config YAML file.
|
||||
- Training script: [examples/arealite/gsm8k_grpo.py](../../examples/arealite/gsm8k_grpo.py)
|
||||
- Config YAML: [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml)
|
||||
|
||||
Our training scripts will automatically download the dataset (openai/gsm8k) and model (Qwen/Qwen3-1.7B) for you.
|
||||
You do not need to prepare any dataset or model files before running the experiment.
|
||||
To run the example with the default configuration, execute following command from the repository directory:
|
||||
```
|
||||
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py --config examples/arealite/configs/gsm8k_grpo.yaml experiment_name=<your experiment name> trial_name=<your trial name>
|
||||
```
|
||||
|
||||
> **Note**: The command above uses `LocalLauncher`, which only works for a single node (`cluster.n_nodes == 1`). For launching distributed experiments, please check out [Distributed Experiments with Ray or Slurm](quickstart.md#distributed-experiments-with-ray-or-slurm).
|
||||
|
||||
### Modifying configuration
|
||||
|
||||
All available options for experiment configuration are listed in [arealite/api/cli_args.py](https://github.com/inclusionAI/AReaL/blob/main/arealite/api/cli_args.py).
|
||||
To change the experiment configuration, including models, resource allocation, and algorithm options, you can:
|
||||
1. Directly modifying the config YAML file at [examples/arealite/configs/gsm8k_grpo.yaml](../../examples/arealite/configs/gsm8k_grpo.yaml).
|
||||
2. Adding command line options. For entries that exist in the config YAML, you could directly add the options after your command. For example: `actor.path=Qwen/Qwen3-1.7B`. For other options in `cli_args.py` but not in YAML, you could add these options with a prefix "+". For example: `+sglang.attention_backend=triton`.
|
||||
|
||||
For example, here is the command to launch a customized configuration, based on our GSM8K GRPO example:
|
||||
```
|
||||
python3 -m arealite.launcher.local examples/arealite/gsm8k_grpo.py \
|
||||
--config examples/arealite/configs/gsm8k_grpo.yaml \
|
||||
experiment_name=<your experiment name> \
|
||||
trial_name=<your trial name> \
|
||||
allocation_mode=sglang.d2p1t1+d2p1t1 \
|
||||
cluster.n_nodes=1 \
|
||||
cluster.n_gpus_per_node=4 \
|
||||
gconfig.max_new_tokens=2048 \
|
||||
train_dataset.batch_size=1024 \
|
||||
+sglang.attention_backend=triton
|
||||
```
|
||||
|
||||
::::{important}
|
||||
Since we are working on a refactor from legacy AReaL to AReaLite, there are some changes that makes available options for AReaLite slightly different from legacy AReaL. We provide a **config converter** to transfer old AReaL config into AReaLite YAML file for users' convenience. [Click here](xxx) for the usage of **config converter**.
|
||||
::::
|
||||
|
||||
### Distributed Experiments with Ray or Slurm
|
||||
|
||||
AReaLite also provide standalone Ray or Slurm launchers for distributed experiments. Once you have properly setup your Ray or Slurm cluster, you could launch your experiment with `arealite.launcher.ray` and `arealite.launcher.slurm`, similar to the `LocalLauncher`:
|
||||
|
||||
```
|
||||
# Launch with Ray launcher. 4 nodes with 4 GPUs each node, 3 nodes for generation, 1 node for training.
|
||||
python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
|
||||
--config examples/arealite/configs/gsm8k_grpo.yaml \
|
||||
experiment_name=<your experiment name> \
|
||||
trial_name=<your trial name> \
|
||||
allocation_mode=sglang.d12p1t1+d4p1t1 \
|
||||
cluster.n_nodes=4 \
|
||||
cluster.n_gpus_per_node=4 \
|
||||
...
|
||||
|
||||
# Launch with Slurm launcher. 16 nodes with 8 GPUs each node, 12 nodes for generation, 4 nodes for training
|
||||
python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
|
||||
--config examples/arealite/configs/gsm8k_grpo.yaml \
|
||||
experiment_name=<your experiment name> \
|
||||
trial_name=<your trial name> \
|
||||
allocation_mode=sglang.d96p1t1+d32p1t1 \
|
||||
cluster.n_nodes=16 \
|
||||
cluster.n_gpus_per_node=8 \
|
||||
...
|
||||
```
|
||||
|
||||
[Click here](installation.md#optional-launch-ray-cluster-for-distributed-training) for a guide on how to set up a ray cluster. For more options for launchers, check `LauncherConfig` in [arealite/cli_args.py](quickstart.md#distributed-experiments-with-ray-or-slurm).
|
||||
|
||||
> **Note**: Before launching distributed experiments, please check if your `allocation_mode` matches your cluster configuration. Make sure #GPUs allocated by `allocation_mode` equals to `cluster.n_nodes * cluster.n_gpus_per_node`.
|
||||
> **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
|
||||
|
||||
## Next Steps
|
||||
|
||||
Check [Getting Started with AReaLite](../arealite/gsm8k_grpo.md) for a complete code walkthrough for the GRPO GSM8K Example.
|
Loading…
Reference in New Issue