PullRequest: 68 Update tutorials

Merge branch update-tutorials of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/68 Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-28 20:15:45 +08:00 · 2025-03-28 20:15:45 +08:00 · 046a28908d
parent cc83c84e29 b7e90fbd48
commit 046a28908d
24 changed files with 535 additions and 1249 deletions
--- a/examples/README.md
+++ b/examples/README.md
@ -34,7 +34,7 @@ This tutorial provides a Docker image. Below are the tested software versions:
 | Git LFS | Refer to: https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage. Mainly used for downloading models, datasets, and AReaL project code. |
 | Docker | 27.5.1 |
 |NVIDIA Container Toolkit|[Installing the NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)|
-| AReaL Image | `ghcr.io/inclusionai/areal-runtime:v0.1.0`. This image includes AReaL's runtime dependencies and Ray components. |
+| AReaL Image | `ghcr.io/inclusionai/areal-runtime:v0.2.0`. This image includes AReaL's runtime dependencies and Ray components. |

 Since the installation of NVIDIA Drivers and CUDA, as well as the mounting of shared storage, depends on node configurations and system versions, please complete these installations independently. This tutorial does not cover their setup.

@ -76,7 +76,7 @@ If the script in this section fails to execute or encounters errors due to envir

 Since shared storage is used, downloading only needs to be done on one node.

-## Code and Cluster Configuration
+## Code
 Clone the AReaL project code to `/storage/codes`:


@ -86,27 +86,6 @@ cd /storage/codes/
 git clone https://github.com/inclusionAI/AReaL
 ```

-Create the cluster configuration file `/storage/ray/cluster_config_on_ray.json`:
-
-```bash
-mkdir -p /storage/ray/
-cd /storage/ray/
-```
-
-Write the following configuration to `/storage/ray/cluster_config_on_ray.json`:
-
-```
-{
-    "cluster_type": "ray",
-    "cluster_name": "ray_cluster",
-    "fileroot": "/storage/ray/experiments",
-    "default_mount": "/storage:/storage",
-    "n_gpus_per_node": 8
-}
-```
-
-This configuration file describes the cluster where AReaL training job runs. In particular, the fileroot path is where logs and checkpoints are stored during training.
-
 ## Dataset

 We provide a dataset for training. Download the dataset and place it in `/storage/datasets/`:
@ -114,8 +93,8 @@ We provide a dataset for training. Download the dataset and place it in `/storag
 ```bash
 mkdir -p /storage/datasets/
 cd /storage/datasets/
-wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/full_prompts_for_r1_distilled.jsonl?download=true
-wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/full_orz_zero.jsonl?download=true
+wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/boba_106k_0319.jsonl?download=true
+wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/orz-zero_56k_0319.jsonl?download=true
 ```

 ## Model
@ -127,6 +106,7 @@ mkdir -p /storage/models
 cd /storage/models
 GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
 ```

 You can also use the HuggingFace CLI to download after installing PyPI and huggingface_hub. Refer to the [official documentation](https://huggingface.co/docs/huggingface_hub/guides/cli) for details.
@ -138,7 +118,7 @@ Before proceeding, pull the AReaL environment image, which already includes Ray
 On the first node, start the Ray Head with the following command:

 ```bash
-docker run -d --name r1-ray-head --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "ray start --head --port=6379 && tail -f /dev/null"
+docker run -d --name r1-ray-head --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "ray start --head --port=6379 && tail -f /dev/null"
 ```

 On all other nodes, start the Ray Worker with the following command (skip this step if you only have one node):
@ -146,7 +126,7 @@ On all other nodes, start the Ray Worker with the following command (skip this s
 ```bash
 # RAY_HEAD_IP is the IP of the first node
 RAY_HEAD_IP=xxx.xxx.xxx.xxx
-docker run -d --name r1-ray-worker --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "ray start --address=$RAY_HEAD_IP:6379 && tail -f /dev/null"
+docker run -d --name r1-ray-worker --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "ray start --address=$RAY_HEAD_IP:6379 && tail -f /dev/null"
 ```

 Once all nodes are up, check the Ray cluster status by entering the container on the first node:
@ -198,87 +178,45 @@ Demands:

 # RL Trainig

-## Single-Node Training
-
-For a single node, execute the following command to start training:
-
-```bash
-docker exec -it r1-ray-head bash
-cd /storage/codes/AReaL
-mkdir /storage/ray/train_batch_logs/
-nohup bash ./examples/train_batch_1.5B_n1.sh &> /storage/ray/train_batch_logs/n1.log &
-```
-
-After starting, check the training launch information in the log file `/storage/ray/train_batch_logs/n1.lo`g:
-
-```
-Log Dir: /storage/ray/train_batch_logs/ppo-zero-distill-1.5B-n1/20250222-104411
-Task Count: 1
-2025-02-22 10:44.11 Task 0 started: ppo-zero-distill-1.5B-n1 deepseek-ai__DeepSeek-R1-Distill-Qwen-1.5B prompts.jsonl 1024 8 1 actor_gen:d4p1m2,*:d4p2m1 16384 128 1 0.001
-```
-
-Based on the Log Dir, you can check the specific logs for the currently running training task. The log path is `{Log Dir}/{task_id}.log`. For example, `/storage/ray/train_batch_logs/ppo-zero-distill-1.5B-n1/20250222-104411/0.log`:
-
-```
-20250222-10:44:15.581 quickstart INFO: Running ppo-math experiment.
-20250222-10:44:15.581 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-zero-distill-1.5B-n1/1024x8-n1
-20250222-10:44:15.581 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-zero-distill-1.5B-n1/1024x8-n1
-20250222-10:44:17.100 quickstart INFO: Launching experiments with RAY...
-```
-
-If errors occur during execution (e.g., keywords like "Error" appear), refer to the troubleshooting section.
-
-## Distributed Training
-
 Before starting distributed training, ensure the Ray cluster is up and running properly.
 Then, on the first node (where the Ray Head is located), enter the container:

 ```
 docker exec -it r1-ray-head bash
 cd /storage/codes/AReaL
-mkdir /storage/ray/train_batch_logs/
 ```

-Choose a task that matches your hardware environment and run it:
+Choose a config file that matches your hardware environment and run it:

 ```bash
-# For 1.5B model on 4 nodes, log file is n4.log
-nohup bash ./examples/train_batch_1.5B_n4.sh &> /storage/ray/train_batch_logs/n4.log &
-# For 1.5B model on 16 nodes, log file is n16.log
-nohup bash ./examples/train_batch_1.5B_n16.sh &> /storage/ray/train_batch_logs/n16.log &
-# For 7B model on 4 nodes, log file is 7n4.log
-nohup bash ./examples/train_batch_7B_n4.sh &> /storage/ray/train_batch_logs/7n4.log &
-# For 7B model on 16 nodes, log file is 7n16.log
-nohup bash ./examples/train_batch_7B_n16.sh &> /storage/ray/train_batch_logs/7n16.log &
+python3 -m realhf.apps.quickstart ppo-math --config ./examples/configs/7B-distill/ppo-7B-distill-gpus-128.yaml
 ```

-After starting, check the training launch information in the log file `/storage/ray/train_batch_logs/{corresponding log file name}.log` (e.g., `7n16.log`):
+After starting, check the training launch information:

 ```
-Log Dir: /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631
-Task Count: 1
-2025-02-22 10:26.31 Task 0 started: ppo-zero-distill-7B-n16 deepseek-ai__DeepSeek-R1-Distill-Qwen-7B prompts_7b_progress_20k.jsonl 1024 16 16 vllm.d16p1m4+d32p2m1 16384 128 4 0.01
-```
+              ╭─────────────────────────────────────────────────╮               
+              │ Setting PPOMATHConfig with the Following Values │               
+              ╰─────────────────────────────────────────────────╯               

-Based on the Log Dir, you can check the specific logs for the currently running training task. The log path is `{Log Dir}/{task_id}.log`. For example, `/storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log`:
-
-```
+───────────────────────── Current Configuration Begin ──────────────────────────
+actor (ModelTrainEvalConfig)
+    actor.type (ModelFamily)
+        actor.type._class (str) - qwen2
+        actor.type.size (int) - 7
+        actor.type.is_critic (bool) - False
+...
+────────────────────────── Current Configuration End ───────────────────────────
+ 
 20250222-10:26:34.877 quickstart INFO: Running ppo-math experiment.
-20250222-10:26:34.877 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-zero-distill-7B-n16/1024x16-n16
-20250222-10:26:34.877 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-zero-distill-7B-n16/1024x16-n16
+20250222-10:44:15.581 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-7B-distill-gpus-128/512x16
+20250222-10:44:15.581 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-7B-distill-gpus-128/512x16
 20250222-10:26:36.408 quickstart INFO: Launching experiments with RAY...
 ```

 If errors occur during execution (e.g., keywords like "Error" appear), refer to the troubleshooting section.

 ## Commandline Options
-The `./examples/train_batch_{1.5/7}B_n{1/4/16}.sh` scripts contain pre-configured training parameters, and all of these scripts ultimately launch the training using the following command:
-
-```bash
-python3 -m realhf.apps.quickstart ppo-math option1=arg1 option2=arg2 ...
-```
-
-The command-line arguments like `option1=arg1` are parsed by [hydra](https://hydra.cc/), and each configuration item is a `dataclasses.dataclass` in the Python code. You can use the following command to view all the command-line arguments that can be passed in the experiment:

 ```bash
 python3 -m realhf.apps.quickstart ppo-math --help
@ -287,17 +225,15 @@ python3 -m realhf.apps.quickstart ppo-math --help
 The descriptions of the important parameters are as follows:


-+ `MODE`: It is always `ray`, and do not change it to other values when referring to this tutorial for training.
-+ `BASE_MODEL_PATH`: The path of the model.
-+ `DATA_PATH`: The path of the dataset jsonl file
-+ `CLUSTER_SPEC_PATH`: Set it to the path of cluster_config.json
+ `mode`: It is always `ray`, and do not change it to other values when referring to this tutorial for training.
+ `{actor|critic|ref}.path`: The path of the model.
+ `dataset.path`: The path of the dataset jsonl file
+ `external_configs.cluster_config`: Set config for cluster_config. e.g. fileroot is the root path for saving traning outputs.

 + `n_nodes`: The number of nodes
 + `n_gpus_per_node`: The number of GPUs per node
-+ `allocation_mode`: The GPU allocation and 3D parallel strategy of the model in the experiment, mainly in the following two forms:
-	+ `d${DP}m${TP}p${PP}`: Where the three integers DP, TP, and PP respectively represent the degrees of data parallelism, tensor parallelism, and pipeline parallelism, and the product of the three integers should be equal to the total number of GPUs (i.e. DPxTPxPP=#GPUs). In this configuration, generation and training use the entire GPU cluster and the same parallel strategy. If you want to use vLLM for generation, you also need to set `actor.vllm.hybrid_train=True` and `actor.vllm.enforce_eager=True`. Note that PP must be 1 (vLLM does not support PP temporarily). 
-    + `actor_gen:d${DP1}p${TP1}m{PP1},*:d{DP2}p{PP2}m{MP2}`: Setting parallel stratgies for generation and training separately. Generation and training use the entire GPU cluster but they could use different parallel strategies. The configuration must satisfy DP1xTP1xPP1=DP2xPP2xMP2=#GPU. If you want to use vLLM for generation, you also need to set `actor.vllm.hybrid_train=True` and `actor.vllm.enforce_eager=True`. Note that PP1 must be 1 (vLLM does not support PP temporarily).
-	+ `vllm.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configure the parallel strategies for vLLM generation and training respectively. The generation and training use disjoint sets of GPUs, and the sum of the number of GPUs used by the two should be equal to the total number of GPUs, i.e DP1xTP1xPP1+DP2xTP2xPP2=#GPUs. If you want to use vLLM for generation, you have to set `actor.vllm.hybrid_train=False` and PP1=1. It is recommended to set `actor.vllm.enforce_eager=False` to accelerate vLLM generation. 
+ `allocation_mode`: The GPU allocation and 3D parallel strategy of the model in the experiment, mainly in the following form:
+	+ `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: Configure the parallel strategies for SGLang generation and training respectively. The generation and training use disjoint sets of GPUs, and the sum of the number of GPUs used by the two should be equal to the total number of GPUs, i.e DP1xTP1xPP1+DP2xTP2xPP2=#GPUs.

 + `exp_ctrl.total_train_epochs`: The number of training epochs (i.e., the number of times to iterate over the entire dataset)
 + `exp_ctrl.save_freq_{epochs|steps|secs}`: The frequency of saving the model parameters in persistent storage. If it is set to null, the model will not be saved.
@ -318,7 +254,6 @@ Here, we use the logs from a 16-node run (the same applies to 1-node and 4-node
 Search for the keyword `Epoch` in the logs to see the total number of Epochs and Steps:

 ```bash
-# grep "Epoch" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:11:56.997 master worker INFO: Epoch 1/1 step 1/19 (global step 1) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2124.429*s. Total time consumption: 2283.862s. 
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.719 master worker INFO: Epoch 1/1 step 2/19 (global step 2) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2405.716*s. Total time consumption: 4689.584s. 
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.084 master worker INFO: Epoch 1/1 step 3/19 (global step 3) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2122.318*s. Total time consumption: 6811.949s. Estimated remaining time: 33957.093s. 
@ -342,7 +277,6 @@ Search for the keyword `task_reward` in the logs.


 ```bash
-# grep "task_reward" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:11:56.991 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.2640759198111482e-05, 'actor_loss': 1.1128166761409375e-06, 'actor_clip_ratio': 2.1122002635820536e-07, 'importance_weight': 1.0000014305114746, 'task_reward': -0.2996826171875, 'kl_reward': -2.27004832709099e-07, 'final_reward': -0.30145370960235596, 'advantage': 0.003593671601265669, 'avg_seq_len': 7907.8955078125, 'avg_prompt_len': 105.845703125, 'n_tokens': 127828786.0, 'n_valid_tokens': 127828786.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.122802734375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.712 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.493159263394773e-05, 'actor_loss': -3.846728588996484e-07, 'actor_clip_ratio': 3.16789424914532e-07, 'importance_weight': 0.9999996423721313, 'task_reward': -0.6793212890625, 'kl_reward': -2.536311853873485e-07, 'final_reward': -0.6813737154006958, 'advantage': 0.004844569601118565, 'avg_seq_len': 8203.9453125, 'avg_prompt_len': 111.892578125, 'n_tokens': 132580185.0, 'n_valid_tokens': 132580185.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.13812255859375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.077 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.572356243035756e-05, 'actor_loss': -5.036404786551429e-07, 'actor_clip_ratio': 1.8960582792715286e-07, 'importance_weight': 0.9999992251396179, 'task_reward': -0.6280517578125, 'kl_reward': -2.988609537624143e-07, 'final_reward': -0.6303607225418091, 'advantage': 0.004505862481892109, 'avg_seq_len': 7834.6328125, 'avg_prompt_len': 108.900390625, 'n_tokens': 126578395.0, 'n_valid_tokens': 126578395.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.11761474609375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
@ -367,7 +301,7 @@ The evaluation code is located in the `evaluation` folder of the repository. As

 Start a new container to execute the evaluation script (note: evaluation requires updates to certain Python libraries; avoid using the training container for this task):
 ```
-docker run -d --name r1-eval --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "tail -f /dev/null"
+docker run -d --name r1-eval --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "tail -f /dev/null"
 docker exec -it r1-eval bash
 ```

@ -431,48 +365,26 @@ The runtime of the evaluation depends on factors such as the maximum generation

 If the following content does not address your issue, feel free to raise a GitHub Issue.

+
 ## Automatic Recover

-### How to
+When setting `recover_mode=auto` and the experiment config remains the same, AReaL will try to discover previous checkpoints and recover the experiment from it.

-The training is exclusively initiated through the ./examples/train_batch_{1.5/7}B_n{1/4/16}.sh scripts. These scripts include parameter entries in the format below and automatically exit upon completing the parameter set execution:
+If the automatic recover fails, please check the following possibilities:

-```bash
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-```
+* The `experiment_name` and `trial_name` in the training script differ from the previous run.

-Training may abort due to OOM errors or hardware failures. In such scenarios, manually rerunning the train_batch script will automatically resume from the latest recoverable checkpoint.
-
-For persistent failures requiring frequent manual intervention, modify the train_batch script by duplicating identical parameter sets to enable automated retries. For instance, to implement 3 retry attempts, configure three identical parameter groups as demonstrated:
-
-```bash
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-```
-
-### Why does the training task restart from the beginning instead of resuming from the last Step?
-
-Check the following possibilities:
-
-* The EXP_NAME and TRIAL_NAME in the training script differ from the previous run.
-
-* Changes in Batch Size (1024 in the parameters), Group Size (16 in the parameters), or the number of nodes (${NODES} in the parameters).
+* Changes in Batch Size (`dataset.train_bs_n_seqs` in the parameters), Group Size (`group_size` in the parameters), or the number of nodes (`n_nodes` in the parameters).

 * No recover checkpoint was created in the previous run. By default, recover checkpoints are generated under two conditions:

 	* After the completion of the second Step.

-	* When a Step completes and more than 600 seconds have passed since the last recover checkpoint. This parameter is in the `examples/train_{tiny|small|large}_on_ray.sh` script, named `exp_ctrl.ckpt_freq_secs=600`.
+	* When a Step completes and more than 600 seconds have passed since the last recover checkpoint. This parameter is in the `./examples/configs/*/*.yaml`, named `exp_ctrl.ckpt_freq_secs=600`.

 You can confirm if a recover checkpoint was generated by searching in the log:

 ```bash
-# grep "Dumped recover" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.105 master worker INFO: Dumped recover info to file.
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-13:05:58.264 master worker INFO: Dumped recover info to file.
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
@ -42,7 +42,7 @@
 |Git LFS|参考：[Git LFS 安装指南](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage) 主要用于下载模型，数据集，AReaL 工程代码|
 |Docker|版本：27.5.1|
 |NVIDIA Container Toolkit|[NVIDIA Container Toolkit 安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)|
-|镜像|ghcr.io/inclusionai/areal-runtime:v0.1.0 这个镜像中包含运行依赖和 Ray 的相关组件|
+|镜像|ghcr.io/inclusionai/areal-runtime:v0.2.0 这个镜像中包含运行依赖和 Ray 的相关组件|


 由于 NVIDIA Driver 和 CUDA 的安装以及共享存储的挂载与节点和系统版本有关，请自行完成安装，本教程不进行介绍。
@ -84,7 +84,7 @@ python ./examples/env/setup_env_and_start_train.py setup --private_key_file /pat

 由于使用了共享存储，下载操作只需要在一个节点上完成。

-## 代码和集群配置
+## 代码
 将 AReaL 项目代码克隆到 `/storage/codes` 中：


@ -94,34 +94,14 @@ cd /storage/codes/
 git clone https://github.com/inclusionAI/AReaL.git
 ```

-创建集群配置文件 `/storage/ray/cluster_config_on_ray.json`：
-```bash
-mkdir -p /storage/ray/
-cd /storage/ray/
-```
-
-将以下配置写入到 `/storage/ray/cluster_config_on_ray.json`：
-
-```
-{
-    "cluster_type": "ray",
-    "cluster_name": "ray_cluster",
-    "fileroot": "/storage/ray/experiments",
-    "default_mount": "/storage:/storage",
-    "n_gpus_per_node": 8
-}
-```
-
-集群配置文件是运行 AReaL 训练任务的描述文件。其中 fileroot 所指向的路径是训练过程中日志，checkpoint 的存储路径。
-
 ## 数据集

 我们提供了用于训练的数据集，请下载数据集并放置在 /storage/datasets/
 ```bash
 mkdir -p /storage/datasets/
 cd /storage/datasets/
-wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/full_prompts_for_r1_distilled.jsonl?download=true
-wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/full_orz_zero.jsonl?download=true
+wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/boba_106k_0319.jsonl?download=true
+wget https://huggingface.co/datasets/inclusionAI/AReaL-RL-Data/resolve/main/data/orz-zero_56k_0319.jsonl?download=true
 ```

 ## 模型
@ -132,6 +112,7 @@ mkdir -p /storage/models
 cd /storage/models
 GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
 ```

 你也可以在安装 PyPI 和 huggingface_hub 后利用 huggingface CLI 进行下载，具体请参考[官方文档](https://huggingface.co/docs/huggingface_hub/guides/cli)
@ -144,7 +125,7 @@ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-D
 在第一个节点上执行如下命令启动 Ray Head：

 ```bash
-docker run -d --name r1-ray-head --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "ray start --head --port=6379 && tail -f /dev/null"
+docker run -d --name r1-ray-head --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "ray start --head --port=6379 && tail -f /dev/null"
 ```

 在除了第一个节点以外的每个节点上执行如下命令启动 Ray Worker（如果只有一个节点，这一步就不用执行了）：
@ -152,7 +133,7 @@ docker run -d --name r1-ray-head --privileged --gpus all --network host --shm-si
 ```bash
 # RAY_HEAD_IP 是第一个节点的 IP
 RAY_HEAD_IP=xxx.xxx.xxx.xxx
-docker run -d --name r1-ray-worker --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "ray start --address=$RAY_HEAD_IP:6379 && tail -f /dev/null"
+docker run -d --name r1-ray-worker --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "ray start --address=$RAY_HEAD_IP:6379 && tail -f /dev/null"
 ```

 全部启动完成后，在第一个节点上通过 docker exec 进入容器，查看 Ray 集群的状态：
@ -204,88 +185,44 @@ Demands:

 # RL训练

-## 单节点训练
-
-
-只有一个节点的情况下，执行如下命令即可启动训练：
-
-```bash
-docker exec -it r1-ray-head bash
-cd /storage/codes/AReaL
-mkdir /storage/ray/train_batch_logs/
-nohup bash ./examples/train_batch_1.5B_n1.sh &> /storage/ray/train_batch_logs/n1.log &
-```
-
-启动后，通过 `/storage/ray/train_batch_logs/n1.log` 日志文件查看训练的启动信息：
-
-```
-Log Dir: /storage/ray/train_batch_logs/ppo-zero-distill-1.5B-n1/20250222-104411
-Task Count: 1
-2025-02-22 10:44.11 Task 0 started: ppo-zero-distill-1.5B-n1 deepseek-ai__DeepSeek-R1-Distill-Qwen-1.5B prompts.jsonl 1024 8 1 actor_gen:d4p1m2,*:d4p2m1 16384 128 1 0.001
-```
-
-根据 Log Dir，可以查看当前运行的训练任务的具体日志，日志路径为 `{Log Dir}/{任务编号}.log`。比如 `/storage/ray/train_batch_logs/ppo-zero-distill-1.5B-n1/20250222-104411/0.log`：
-
-```
-20250222-10:44:15.581 quickstart INFO: Running ppo-math experiment.
-20250222-10:44:15.581 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-zero-distill-1.5B-n1/1024x8-n1
-20250222-10:44:15.581 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-zero-distill-1.5B-n1/1024x8-n1
-20250222-10:44:17.100 quickstart INFO: Launching experiments with RAY...
-```
-
-如果运行过程中出现错误（比如出现 Error 关键字），请参考Troubleshooting解决。
-
-## 分布式训练
-
 在进行分布式训练之前，请确保已经启动了 Ray 集群，并且集群状态正常。
 然后在第一个节点（Ray Head 所在节点），进入容器：

 ```
 docker exec -it r1-ray-head bash
 cd /storage/codes/AReaL
-mkdir /storage/ray/train_batch_logs/
 ```

-选择匹配硬件环境的一个任务运行即可：
+选择匹配硬件环境的一个配置运行即可：

 ```bash
-# 对应 1.5B 模型 4 节点，日志文件名为 n4.log
-nohup bash ./examples/train_batch_1.5B_n4.sh &> /storage/ray/train_batch_logs/n4.log &
-# 对应 1.5B 模型 16 节点，日志文件名为 n16.log
-nohup bash ./examples/train_batch_1.5B_n16.sh &> /storage/ray/train_batch_logs/n16.log &
-# 对应 7B 模型 4 节点，日志文件名为 7n4.log
-nohup bash ./examples/train_batch_7B_n4.sh &> /storage/ray/train_batch_logs/7n4.log &
-# 对应 7B 模型 16 节点，日志文件名为 7n16.log
-nohup bash ./examples/train_batch_7B_n16.sh &> /storage/ray/train_batch_logs/7n16.log &
+python3 -m realhf.apps.quickstart ppo-math --config ./examples/configs/7B-distill/ppo-7B-distill-gpus-128.yaml
 ```

-启动后，通过 `/storage/ray/train_batch_logs/{对应的日志文件名}.log` 日志文件查看训练的启动信息（以 `7n16.log` 为例）：
-
-```
-Log Dir: /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631
-Task Count: 1
-2025-02-22 10:26.31 Task 0 started: ppo-zero-distill-7B-n16 deepseek-ai__DeepSeek-R1-Distill-Qwen-7B prompts_7b_progress_20k.jsonl 1024 16 16 vllm.d16p1m4+d32p2m1 16384 128 4 0.01
+启动后，在终端可以看到启动日志：
 ```
+              ╭─────────────────────────────────────────────────╮               
+              │ Setting PPOMATHConfig with the Following Values │               
+              ╰─────────────────────────────────────────────────╯               

-根据 Log Dir，可以查看当前运行的训练任务的具体日志，日志路径为 `{Log Dir}/{任务编号}.log`。比如 `/storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log`：
-
-```
+───────────────────────── Current Configuration Begin ──────────────────────────
+actor (ModelTrainEvalConfig)
+    actor.type (ModelFamily)
+        actor.type._class (str) - qwen2
+        actor.type.size (int) - 7
+        actor.type.is_critic (bool) - False
+...
+────────────────────────── Current Configuration End ───────────────────────────
+ 
 20250222-10:26:34.877 quickstart INFO: Running ppo-math experiment.
-20250222-10:26:34.877 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-zero-distill-7B-n16/1024x16-n16
-20250222-10:26:34.877 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-zero-distill-7B-n16/1024x16-n16
+20250222-10:44:15.581 quickstart INFO: Logs will be dumped to /storage/ray/experiments/logs/root/ppo-7B-distill-gpus-128/512x16
+20250222-10:44:15.581 quickstart INFO: Model checkpoints will be saved to /storage/ray/experiments/checkpoints/root/ppo-7B-distill-gpus-128/512x16
 20250222-10:26:36.408 quickstart INFO: Launching experiments with RAY...
 ```

 如果运行过程中出现错误（比如出现 Error 关键字），请参考Troubleshooting解决。

 ## Commandline Options
-`./examples/train_batch_{1.5/7}B_n{1/4/16}.sh` 脚本包含了预先配置好的训练参数，这些脚本最终都是通过以下命令启动训练的：
-
-```bash
-python3 -m realhf.apps.quickstart ppo-math option1=arg1 option2=arg2 ...
-```
-
-其中`option1=arg1`这些命令行参数是通过[hydra](https://hydra.cc/)进行解析的，其中每一条配置项都是python代码中的`dataclasses.dataclass`。用以下命令可以查看实验中所有可以传递的命令行参数：

 ```bash
 python3 -m realhf.apps.quickstart ppo-math --help
@ -293,16 +230,15 @@ python3 -m realhf.apps.quickstart ppo-math --help

 其中重要的参数的说明如下：

-+ MODE：总是为 ray，参考本教程进行训练时不要改成其他值。
-+ BASE_MODEL_PATH：模型的路径
-+ DATA_PATH：数据集 jsonl 文件的路径
-+ CLUSTER_SPEC_PATH：设置成 cluster_config.json 的路径
+ mode：总是为 ray，参考本教程进行训练时不要改成其他值。
+ {actor|critic|ref}.path：模型的路径
+ dataset.path：数据集 jsonl 文件的路径
+ external_configs.cluster_config：设置 cluster_config 的配置，比如 fileroot 是存放训练输出的根目录。

 + n_nodes：节点数量
-+ n_gpus_per_node：每个节点的GPU数量
-+ allocation_mode：实验中模型的GPU分配和3D并行策略，推荐的策略主要有以下两种形式:
-    + `actor_gen:d${DP1}p${TP1}m{PP1},*:d{DP2}p{PP2}m{MP2}`: 分别配置生成和推理的并行策略，训练和推理共用所有GPU，可以采用不同的并行策略。两种策略中三个整数相乘均需要等于GPU总量，即DP1xTP1xPP1=DP2xPP2xMP2=#GPU。这种情况下如果希望使用vLLM加速生成，需要设置`actor.vllm.hybrid_train=True`和`actor.vllm.enforce_eager=True`,且PP1必须是1（vLLM推理暂时不支持PP）。
-	+ `vllm.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: 分别配置vLLM生成和训练的并行策略，生成和训练分离，使用两部分不同的GPU。二者所用的GPU数量相加要等于总的 GPU 数量，即DP1xTP1xPP1+DP2xTP2xPP2=#GPUs。在这种配置下，必须设置`actor.vllm.hybrid_train=False`。可以设置`actor.vllm.enforce_eager=False`加速vLLM生成。使用vLLM时同样需要保证PP1=1。
+ n_gpus_per_node：每个节点的 GPU 数量
+ allocation_mode：实验中模型的 GPU 分配和 3D 并行策略，推荐的策略有以下形式:
+	+ `sglang.d${DP1}m${TP1}p${PP1}+d${DP2}m${TP2}p${PP2}`: 分别配置 SGLang 生成和训练的并行策略，生成和训练分离，使用两部分不同的 GPU。二者所用的GPU数量相加要等于总的 GPU 数量，即 DP1xTP1xPP1+DP2xTP2xPP2=#GPUs。

 + exp_ctrl.total_train_epochs：训练的 epoch 数量（即迭代整个数据集的次数）
 + exp_ctrl.save_freq_{epochs|steps|secs}：保存持久化存储模型参数的频率，如果设成 null 会不保存模型
@ -323,7 +259,6 @@ python3 -m realhf.apps.quickstart ppo-math --help
 搜索日志中的 Epoch 关键字，查看总的 Epoch 数量和 Step 数量：

 ```bash
-# grep "Epoch" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:11:56.997 master worker INFO: Epoch 1/1 step 1/19 (global step 1) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2124.429*s. Total time consumption: 2283.862s. 
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.719 master worker INFO: Epoch 1/1 step 2/19 (global step 2) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2405.716*s. Total time consumption: 4689.584s. 
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.084 master worker INFO: Epoch 1/1 step 3/19 (global step 3) finishes. Average #tokens per batch is 111847. #End to end# execution time: *2122.318*s. Total time consumption: 6811.949s. Estimated remaining time: 33957.093s. 
@ -346,7 +281,6 @@ python3 -m realhf.apps.quickstart ppo-math --help
 搜索日志中的 `task_reward` 关键字

 ```bash
-# grep "task_reward" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:11:56.991 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.2640759198111482e-05, 'actor_loss': 1.1128166761409375e-06, 'actor_clip_ratio': 2.1122002635820536e-07, 'importance_weight': 1.0000014305114746, 'task_reward': -0.2996826171875, 'kl_reward': -2.27004832709099e-07, 'final_reward': -0.30145370960235596, 'advantage': 0.003593671601265669, 'avg_seq_len': 7907.8955078125, 'avg_prompt_len': 105.845703125, 'n_tokens': 127828786.0, 'n_valid_tokens': 127828786.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.122802734375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.712 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.493159263394773e-05, 'actor_loss': -3.846728588996484e-07, 'actor_clip_ratio': 3.16789424914532e-07, 'importance_weight': 0.9999996423721313, 'task_reward': -0.6793212890625, 'kl_reward': -2.536311853873485e-07, 'final_reward': -0.6813737154006958, 'advantage': 0.004844569601118565, 'avg_seq_len': 8203.9453125, 'avg_prompt_len': 111.892578125, 'n_tokens': 132580185.0, 'n_valid_tokens': 132580185.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.13812255859375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.077 master worker INFO: RPC name actor_train returns {'ppo_approx_kl': -2.572356243035756e-05, 'actor_loss': -5.036404786551429e-07, 'actor_clip_ratio': 1.8960582792715286e-07, 'importance_weight': 0.9999992251396179, 'task_reward': -0.6280517578125, 'kl_reward': -2.988609537624143e-07, 'final_reward': -0.6303607225418091, 'advantage': 0.004505862481892109, 'avg_seq_len': 7834.6328125, 'avg_prompt_len': 108.900390625, 'n_tokens': 126578395.0, 'n_valid_tokens': 126578395.0, 'n_seqs': 16384.0, 'no_eos_ratio': 0.11761474609375, 'disable_value': 1.0, 'mask_no_eos_with_zero': 0.0}
@ -371,7 +305,7 @@ python3 -m realhf.apps.quickstart ppo-math --help

 启动一个新的容器用于运行评估脚本（评估需要更新部分 python 库，请不要在训练容器中进行）：
 ```
-docker run -d --name r1-eval --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.1.0 /bin/bash -c "tail -f /dev/null"
+docker run -d --name r1-eval --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.2.0 /bin/bash -c "tail -f /dev/null"
 docker exec -it r1-eval bash
 ```

@ -434,43 +368,22 @@ nohup python eval_and_aggregate.py \

 如果以下内容没有解答你的问题，欢迎在 GitHub Issue 中进行提问。

-## 自动重启
+## 自动恢复

-### How to
+当设置了 `recover_mode=auto` 并且训练配置和之前相同，AReaL 会尝试找到之前生成的 checkpoints 并且从这个 checkpoints 恢复训练。

-训练都是通过 `./examples/train_batch_{1.5/7}B_n{1/4/16}.sh` 脚本启动的，脚本中存在如下格式的 1 行启动参数，`train_batch` 脚本在执行完该组参数后自动停止：
+如果自动恢复失败，有这些可能性：

-```bash
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-```
-OOM 或硬件故障都会导致训练终止，这种情况下可以手动重新执行一次 `train_batch` 脚本，会自动从上次训练的 recover checkpoint 处继续训练。
-
-如果频繁遇到故障，需要手动重启的情况时，可以修改`train_batch`脚本，设置多组相同的参数，让脚本自动重跑。比如我希望这组训练参数可以重跑3次，那么参数设置为完全相同的3组即可，如下所示：
-```bash
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-```
-
-### 为什么训练任务重启后没有在上次的 Step 之后继续而是从头开始训练了
-
-有以下可能性，请检查：
-
-+ 训练脚本里的 EXP_NAME 和TRIAL_NAME与之前的不一样
-+ Batch Size（参数里的 1024），Group Size（参数里的 16），节点数（参数里的 ${NODES}）三个值发生了变化
+ 训练配置里的 `experiment_name` 和 `trial_name` 与之前的不一样
+ Batch Size（参数里的 `dataset.train_bs_n_seqs`），Group Size（参数里的 `group_size`），节点数（参数里的 `n_nodes`）三个值发生了变化
 + 之前的训练没有创建过 recover checkpoint 。默认的 recover checkpoint 规则有 2 个：
 	+ 从第 2 个 step 完成后才生成 recover checkpoint
-	+ 一个 step 训练完成，且距离上次 recover checkpoint 时间超过 600s，则生成一个新的 recover checkpoint。这个参数在 `examples/train_{tiny|small|large}_on_ray.sh` 脚本里，参数名为 ：`exp_ctrl.ckpt_freq_secs=600`。
+	+ 一个 step 训练完成，且距离上次 recover checkpoint 时间超过 600s，则生成一个新的 recover checkpoint。这个参数在 `./examples/configs/*/*.yaml` 文件里，参数名为 ：`exp_ctrl.ckpt_freq_secs=600`。


-可以通过搜索 Dumped recover 确认是否生成过 recover checkpoint
+可以通过搜索 `Dumped recover` 确认是否生成过 recover checkpoint

 ```bash
-# grep "Dumped recover" /storage/ray/train_batch_logs/ppo-zero-distill-7B-n16/20250222-102631/0.log
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-12:27:25.105 master worker INFO: Dumped recover info to file.
 (master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-13:05:58.264 master worker INFO: Dumped recover info to file.
--- a/examples/cluster_config.json
+++ b/examples/cluster_config.json
@ -1,15 +0,0 @@
-{
-    "cluster_type": "slurm",
-    "cluster_name": "my_cluster",
-    "fileroot": "/storage/openpsi/experiments",
-    "default_mount": "/storage:/storage",
-    "node_type_from_node_name": {
-        "slurmd-\\d+$": ""
-    },
-    "gpu_type_from_node_name": {
-        "slurmd-\\d+$": "tesla"
-    },
-    "cpu_image": "/storage/images/real-gpu.sif",
-    "gpu_image": "/storage/images/real-gpu.sif",
-    "node_name_prefix": "slurmd-"
-}
--- a/examples/cluster_config_on_ray.json
+++ b/examples/cluster_config_on_ray.json
@ -1,7 +0,0 @@
-{
-    "cluster_type": "ray",
-    "cluster_name": "ray_cluster",
-    "fileroot": "/storage/ray/experiments",
-    "default_mount": "/storage:/storage",
-    "n_gpus_per_node": 8
-}
--- a/examples/configs/1.5B-distill/ppo-1.5B-distill-gpus-128.yaml
+++ b/examples/configs/1.5B-distill/ppo-1.5B-distill-gpus-128.yaml
@ -0,0 +1,86 @@
+experiment_name: ppo-1.5B-distill-gpus-128
+trial_name: 512x16
+mode: ray
+wandb:
+  mode: disabled
+recover_mode: auto
+recover_retries: 10
+allocation_mode: 'sglang.d64p1m1+d32p2m1'
+n_nodes: 16
+n_gpus_per_node: 8
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_epochs: 1
+  ckpt_freq_secs: 600
+torch_cache_mysophobia: true
+actor:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+  optimizer:
+    lr: 2e-05
+    lr_scheduler_type: constant
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
+    hysteresis: 2
+  sglang:
+    mem_fraction_static: 0.8
+    triton_attention_num_kv_splits: 16
+critic:
+  type:
+    _class: qwen2
+    is_critic: true
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+  init_critic_from_actor: true
+ref:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+actor_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_gen:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+ref_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+dataset:
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
+ppo:
+  gen:
+    max_new_tokens: 27648
+    min_new_tokens: 0
+    top_p: 1.0
+    top_k: 1000000
+    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
+  ppo_n_minibatches: 4
+  kl_ctl: 0.0
+  discount: 1.0
+  value_eps_clip: 0.2
+  disable_value: true
+  reward_output_scaling: 5
+  reward_output_bias: 0.0
+  adv_norm: true
+  value_norm: true
+group_size: 16
+group_adv_norm: false
+external_configs:
+  cluster_config:
+    fileroot: "/storage/ray/experiments"
+  envs:
+    REAL_GPU_MEMORY_KILL_THRESHOLD: "1"
--- a/examples/configs/1.5B-distill/ppo-1.5B-distill-gpus-32.yaml
+++ b/examples/configs/1.5B-distill/ppo-1.5B-distill-gpus-32.yaml
@ -0,0 +1,86 @@
+experiment_name: ppo-1.5B-distill-gpus-32
+trial_name: 512x16
+mode: ray
+wandb:
+  mode: disabled
+recover_mode: auto
+recover_retries: 10
+allocation_mode: 'sglang.d16p1m1+d8p2m1'
+n_nodes: 4
+n_gpus_per_node: 8
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_epochs: 1
+  ckpt_freq_secs: 600
+torch_cache_mysophobia: true
+actor:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+  optimizer:
+    lr: 2e-05
+    lr_scheduler_type: constant
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
+    hysteresis: 2
+  sglang:
+    mem_fraction_static: 0.8
+    triton_attention_num_kv_splits: 16
+critic:
+  type:
+    _class: qwen2
+    is_critic: true
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+  init_critic_from_actor: true
+ref:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
+actor_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_gen:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+ref_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+dataset:
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
+ppo:
+  gen:
+    max_new_tokens: 27648
+    min_new_tokens: 0
+    top_p: 1.0
+    top_k: 1000000
+    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
+  ppo_n_minibatches: 4
+  kl_ctl: 0.0
+  discount: 1.0
+  value_eps_clip: 0.2
+  disable_value: true
+  reward_output_scaling: 5
+  reward_output_bias: 0.0
+  adv_norm: true
+  value_norm: true
+group_size: 16
+group_adv_norm: false
+external_configs:
+  cluster_config:
+    fileroot: "/storage/ray/experiments"
+  envs:
+    REAL_GPU_MEMORY_KILL_THRESHOLD: "1"
--- a/examples/configs/1.5B-distill/areal-1.5B-distill-gpus-8.yaml
+++ b/examples/configs/1.5B-distill/areal-1.5B-distill-gpus-8.yaml
@ -1,16 +1,16 @@
-experiment_name: areal-1.5B-distill-gpus-8
-trial_name: 1024x8
+experiment_name: ppo-1.5B-distill-gpus-8
+trial_name: 512x16
 mode: ray
 wandb:
  mode: disabled
 recover_mode: auto
 recover_retries: 10
-allocation_mode: 'sglang.d4p1m1+d4p1m1'
+allocation_mode: 'sglang.d4p1m1+d2p2m1'
 n_nodes: 1
 n_gpus_per_node: 8
 cache_clear_freq: 1
 exp_ctrl:
-  total_train_epochs: 10
+  total_train_epochs: 5
  save_freq_epochs: 1
  ckpt_freq_secs: 600
 torch_cache_mysophobia: true
@ -19,16 +19,14 @@ actor:
    _class: qwen2
  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
  optimizer:
-    lr: 1.0e-05
+    lr: 2e-05
    lr_scheduler_type: constant
-    initial_loss_scale: 262144.0
-    loss_scale_window: 5.0
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
    hysteresis: 2
  sglang:
-    disable_radix_cache: true
-    context_length: 18432
    mem_fraction_static: 0.8
-    max_running_requests: 128
+    triton_attention_num_kv_splits: 16
 critic:
  type:
    _class: qwen2
@ -41,43 +39,45 @@ ref:
  path: '/storage/models/DeepSeek-R1-Distill-Qwen-1.5B'
 actor_train:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 critic_train:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 actor_gen:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 critic_inf:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 actor_inf:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 ref_inf:
  mb_spec:
-    max_tokens_per_mb: 19456
+    max_tokens_per_mb: 30720
 dataset:
-  path: '/storage/datasets/prompts_for_r1_distilled_0319.jsonl'
-  max_prompt_len: 2048
-  train_bs_n_seqs: 1024
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
 ppo:
  gen:
-    max_new_tokens: 16384
+    max_new_tokens: 27648
    min_new_tokens: 0
    top_p: 1.0
    top_k: 1000000
    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
  ppo_n_minibatches: 4
-  kl_ctl: 0.001
+  kl_ctl: 0.0
  discount: 1.0
  value_eps_clip: 0.2
  disable_value: true
-  reward_output_scaling: 5.0
+  reward_output_scaling: 5
  reward_output_bias: 0.0
  adv_norm: true
  value_norm: true
-group_size: 8
+group_size: 16
 group_adv_norm: false
 external_configs:
  cluster_config:
--- a/examples/configs/32B-distill/ppo-32B-distill-gpus-128.yaml
+++ b/examples/configs/32B-distill/ppo-32B-distill-gpus-128.yaml
@ -0,0 +1,86 @@
+experiment_name: ppo-32B-distill-gpus-128
+trial_name: 512x32
+mode: ray
+wandb:
+  mode: disabled
+recover_mode: auto
+recover_retries: 10
+allocation_mode: 'sglang.d8m8p1+d4p4m4'
+n_nodes: 16
+n_gpus_per_node: 8
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_epochs: 1
+  ckpt_freq_secs: 600
+torch_cache_mysophobia: true
+actor:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-32B'
+  optimizer:
+    lr: 2e-05
+    lr_scheduler_type: constant
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
+    hysteresis: 2
+  sglang:
+    mem_fraction_static: 0.8
+    triton_attention_num_kv_splits: 16
+critic:
+  type:
+    _class: qwen2
+    is_critic: true
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-32B'
+  init_critic_from_actor: true
+ref:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-32B'
+actor_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_gen:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+ref_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+dataset:
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
+ppo:
+  gen:
+    max_new_tokens: 27648
+    min_new_tokens: 0
+    top_p: 1.0
+    top_k: 1000000
+    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
+  ppo_n_minibatches: 4
+  kl_ctl: 0.0
+  discount: 1.0
+  value_eps_clip: 0.2
+  disable_value: true
+  reward_output_scaling: 5
+  reward_output_bias: 0.0
+  adv_norm: true
+  value_norm: true
+group_size: 32
+group_adv_norm: false
+external_configs:
+  cluster_config:
+    fileroot: "/storage/ray/experiments"
+  envs:
+    REAL_GPU_MEMORY_KILL_THRESHOLD: "1"
--- a/examples/configs/7B-distill/ppo-7B-distill-gpus-128.yaml
+++ b/examples/configs/7B-distill/ppo-7B-distill-gpus-128.yaml
@ -0,0 +1,86 @@
+experiment_name: ppo-7B-distill-gpus-128
+trial_name: 512x16
+mode: ray
+wandb:
+  mode: disabled
+recover_mode: auto
+recover_retries: 10
+allocation_mode: 'sglang.d64p1m1+d32p2m1'
+n_nodes: 16
+n_gpus_per_node: 8
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_epochs: 1
+  ckpt_freq_secs: 600
+torch_cache_mysophobia: true
+actor:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+  optimizer:
+    lr: 2e-05
+    lr_scheduler_type: constant
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
+    hysteresis: 2
+  sglang:
+    mem_fraction_static: 0.8
+    triton_attention_num_kv_splits: 16
+critic:
+  type:
+    _class: qwen2
+    is_critic: true
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+  init_critic_from_actor: true
+ref:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+actor_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_gen:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+ref_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+dataset:
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
+ppo:
+  gen:
+    max_new_tokens: 27648
+    min_new_tokens: 0
+    top_p: 1.0
+    top_k: 1000000
+    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
+  ppo_n_minibatches: 4
+  kl_ctl: 0.0
+  discount: 1.0
+  value_eps_clip: 0.2
+  disable_value: true
+  reward_output_scaling: 5
+  reward_output_bias: 0.0
+  adv_norm: true
+  value_norm: true
+group_size: 16
+group_adv_norm: false
+external_configs:
+  cluster_config:
+    fileroot: "/storage/ray/experiments"
+  envs:
+    REAL_GPU_MEMORY_KILL_THRESHOLD: "1"
--- a/examples/configs/7B-distill/ppo-7B-distill-gpus-32.yaml
+++ b/examples/configs/7B-distill/ppo-7B-distill-gpus-32.yaml
@ -0,0 +1,86 @@
+experiment_name: ppo-7B-distill-gpus-32
+trial_name: 512x16
+mode: ray
+wandb:
+  mode: disabled
+recover_mode: auto
+recover_retries: 10
+allocation_mode: 'sglang.d16p1m1+d8p2m1'
+n_nodes: 4
+n_gpus_per_node: 8
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_epochs: 1
+  ckpt_freq_secs: 600
+torch_cache_mysophobia: true
+actor:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+  optimizer:
+    lr: 2e-05
+    lr_scheduler_type: constant
+    eps: 1e-5
+    warmup_steps_proportion: 0.001
+    hysteresis: 2
+  sglang:
+    mem_fraction_static: 0.8
+    triton_attention_num_kv_splits: 16
+critic:
+  type:
+    _class: qwen2
+    is_critic: true
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+  init_critic_from_actor: true
+ref:
+  type:
+    _class: qwen2
+  path: '/storage/models/DeepSeek-R1-Distill-Qwen-7B'
+actor_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_train:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_gen:
+  mb_spec:
+    max_tokens_per_mb: 30720
+critic_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+actor_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+ref_inf:
+  mb_spec:
+    max_tokens_per_mb: 30720
+dataset:
+  path: '/storage/datasets/boba_106k_0319.jsonl'
+  max_prompt_len: 1024
+  train_bs_n_seqs: 512
+ppo:
+  gen:
+    max_new_tokens: 27648
+    min_new_tokens: 0
+    top_p: 1.0
+    top_k: 1000000
+    temperature: 1.0
+    force_no_logits_mask: True
+    use_cuda_graph: True
+  ppo_n_minibatches: 4
+  kl_ctl: 0.0
+  discount: 1.0
+  value_eps_clip: 0.2
+  disable_value: true
+  reward_output_scaling: 5
+  reward_output_bias: 0.0
+  adv_norm: true
+  value_norm: true
+group_size: 16
+group_adv_norm: false
+external_configs:
+  cluster_config:
+    fileroot: "/storage/ray/experiments"
+  envs:
+    REAL_GPU_MEMORY_KILL_THRESHOLD: "1"
--- a/examples/scripts/distributed_slurm/sft.sh
+++ b/examples/scripts/distributed_slurm/sft.sh
@ -1,48 +0,0 @@
-# MODEL_FAMILY specifies how the pretrained checkpoint is loaded, e.g., as a LLaMA model or a GPT model.
-MODEL_FAMILY=qwen2
-
-# PRETRAINED_PATH is the HuggingFace checkpoint.
-PRETRAINED_PATH=/storage/openpsi/models/Qwen__Qwen2.5-7B-Instruct
-TRAIN_DATA_PATH=/storage/openpsi/data/ppu_test_data/examples_test_data/sft_pos-train.jsonl
-VALID_DATA_PATH=/storage/openpsi/data/ppu_test_data/examples_test_data/sft_pos-train.jsonl
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=slurm
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-EXP_NAME=quickstart-sft
-TRIAL_NAME=$MODEL_FAMILY-$MODE-run1
-
-# We use the "manual" allocation mode here to manually specify the parallelism strategy,
-# which is pipeline=2, tensor-model=2, and data=2, using in total of 8 GPUs.
-
-# The `sft` subcommand specifies that this is a supervised fine-tuning experiment.
-export CLUSTER_SPEC_PATH="/storage/realhf/examples/cluster_config.json"
-python3 -m realhf.apps.quickstart sft \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    exp_ctrl.total_train_epochs=8 \
-    exp_ctrl.save_freq_steps=50 \
-    exp_ctrl.eval_freq_epochs=1 \
-    model.optimizer.type=adam \
-    model.optimizer.lr_scheduler_type=cosine \
-    model.optimizer.lr=1e-5 \
-    model.optimizer.warmup_steps_proportion=0.02 \
-    model.type._class=$MODEL_FAMILY \
-    model.path=$PRETRAINED_PATH \
-    dataset.train_path=${TRAIN_DATA_PATH} \
-    dataset.valid_path=${VALID_DATA_PATH} \
-    dataset.max_seqlen=1024 \
-    dataset.train_bs_n_seqs=512 \
-    dataset.valid_bs_n_seqs=512 \
-    allocation_mode=d4m4p2 \
-    n_nodes=4 n_gpus_per_node=8 \
-    allocation.mb_spec.n_mbs=2
--- a/examples/train_1.5B_n16_on_ray.sh
+++ b/examples/train_1.5B_n16_on_ray.sh
@ -1,122 +0,0 @@
-#!/bin/sh
-MODEL_FAMILY=qwen2
-
-EXP_NAME="ds-r1-distill-qwen-1.5b-16nodes"
-TRAIN_BATCH_SIZE="1024"
-GROUP_SIZE="8"
-NODES="16"
-ALLOCATION_MODE="vllm.d64p1m1+d32p2m1"
-MAX_NEW_TOKENS=$3
-MAX_NUM_SEQS=128
-PPO_MBS=4
-KL_CTL=0.001
-
-MAX_TOKEN_PER_MB=$(expr 2048 + ${MAX_NEW_TOKENS} + 1024)
-MAX_SEQ_LEN_TO_CAPTURE=$(expr 2048 + ${MAX_NEW_TOKENS})
-
-BASE_MODEL_PATH="$1"
-
-# original data
-DATA_PATH="$2"
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=ray
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-#EXP_NAME=ppo-zero--${MODEL_NAME}--${DATASET_NAME}
-#EXP_NAME=ppo-zero-distill-1.5B-default
-TRIAL_NAME="${TRAIN_BATCH_SIZE}x${GROUP_SIZE}-n${NODES}"
-
-# We use the "heuristic" allocation mode here to automatically determine the parallelism strategy
-# for each model function call, i.e., actor generation, critic inference, actor train, etc.
-# The number of GPUs is `n_nodes` * `n_gpus_per_node` (not set explictly here, defaults to 8).
-# ReaL will make full use of these available GPUs to design allocations.
-# This does not ensure the optimal throughput, but it is a good starting point.
-
-# The `heuristic` allocation mode is not ensured to run with every model configurations.
-# For example, if the vocabulary size is an odd number, the model parallelism may not work.
-# In these cases, you can use the `ppo_manual.sh` to specify the parallelism strategy manually.
-
-# The `ppo` subcommand specifies that this is a PPO experiment.
-# The `save_freq_steps` is set to `null` to disable saving checkpoints.
-# Enable it if you want to save checkpoints.
-# The `ppo` option is used to control the generation and PPO algorithm hyperparameters.
-# Note that the performance of PPO is sensitive to the the pre-trained model and hyperparameters.
-# It's the user's responsibility to tune them appropriately.
-unset CLUSTER_SPEC_PATH
-CLUSTER_SPEC_PATH=/storage/ray/cluster_config_on_ray.json \
-REAL_GPU_MEMORY_KILL_THRESHOLD=1 \
-python3 -m realhf.apps.quickstart ppo-math \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    wandb.mode=disabled \
-    exp_ctrl.total_train_epochs=10 \
-    exp_ctrl.save_freq_epochs=1 \
-    exp_ctrl.ckpt_freq_secs=600 \
-    group_size=${GROUP_SIZE} \
-    group_adv_norm=False \
-    use_dense_reward=False \
-    reward_delta=True \
-    rw_type=sparse \
-    check_xml_format=False \
-    actor.type._class=$MODEL_FAMILY \
-    actor.path=$BASE_MODEL_PATH \
-    actor.vllm.hybrid_train=False \
-    actor.vllm.enforce_eager=False \
-    actor.vllm.max_seq_len_to_capture=${MAX_SEQ_LEN_TO_CAPTURE} \
-    actor.vllm.max_num_seqs=${MAX_NUM_SEQS} \
-    actor.vllm.gpu_memory_utilization=1 \
-    actor.vllm.swap_space=64 \
-    critic.type._class=$MODEL_FAMILY \
-    critic.type.is_critic=True \
-    critic.init_critic_from_actor=True \
-    critic.path=$BASE_MODEL_PATH\
-    ref.type._class=$MODEL_FAMILY \
-    ref.path=$BASE_MODEL_PATH \
-    rew.type._class=$MODEL_FAMILY \
-    rew.type.is_critic=True \
-    rew.init_critic_from_actor=True \
-    rew.path=$BASE_MODEL_PATH \
-    dataset.path=$DATA_PATH \
-    dataset.max_prompt_len=2048 \
-    dataset.train_bs_n_seqs=${TRAIN_BATCH_SIZE} \
-    ppo.gen.max_new_tokens=${MAX_NEW_TOKENS} \
-    ppo.gen.min_new_tokens=0 \
-    ppo.disable_value=True \
-    ppo.gen.top_p=1 ppo.gen.top_k=1000000 \
-    ppo.gen.use_cuda_graph=True \
-    ppo.gen.force_no_logits_mask=True \
-    ppo.ppo_n_minibatches=${PPO_MBS} \
-    ppo.gen.temperature=0.6 \
-    ppo.kl_ctl=${KL_CTL} \
-    ppo.value_eps_clip=0.2 \
-    ppo.reward_output_scaling=5 \
-    ppo.reward_output_bias=0.0 \
-    ppo.adv_norm=True ppo.value_norm=True \
-    mask_too_long=False \
-    ppo.discount=1.0 \
-    actor.optimizer.lr=1e-5 \
-    actor.optimizer.lr_scheduler_type=constant \
-    actor.optimizer.initial_loss_scale=262144.0 \
-    actor.optimizer.loss_scale_window=5 \
-    actor.optimizer.hysteresis=2 \
-    actor_gen.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    ref_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    rew_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    cache_clear_freq=1 \
-    n_nodes=${NODES} \
-    allocation_mode="'${ALLOCATION_MODE}'" n_gpus_per_node=8 \
-    recover_mode=auto \
-    recover_retries=10 \
-    torch_cache_mysophobia=True
--- a/examples/train_7B_zero_n16_on_ray.sh
+++ b/examples/train_7B_zero_n16_on_ray.sh
@ -1,119 +0,0 @@
-#!/bin/sh
-
-EXP_NAME="ds-r1-distill-qwen-7b-zero-16nodes"
-TRAIN_BATCH_SIZE="512"
-GROUP_SIZE="64"
-NODES="16"
-ALLOCATION_MODE="vllm.d16p1m4+d32p2m1"
-MAX_NEW_TOKENS=$3
-MAX_NUM_SEQS=128
-PPO_MBS=4
-KL_CTL=0.0
-
-MAX_TOKEN_PER_MB=$(expr 2048 + ${MAX_NEW_TOKENS} + 1024)
-MAX_SEQ_LEN_TO_CAPTURE=$(expr 2048 + ${MAX_NEW_TOKENS})
-
-BASE_MODEL_PATH="$1"
-
-# original data
-DATA_PATH="$2"
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=ray
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-#EXP_NAME=ppo-zero--${MODEL_NAME}--${DATASET_NAME}
-#EXP_NAME=ppo-zero-distill-1.5B-default
-TRIAL_NAME="${TRAIN_BATCH_SIZE}x${GROUP_SIZE}-n${NODES}"
-
-# We use the "heuristic" allocation mode here to automatically determine the parallelism strategy
-# for each model function call, i.e., actor generation, critic inference, actor train, etc.
-# The number of GPUs is `n_nodes` * `n_gpus_per_node` (not set explictly here, defaults to 8).
-# ReaL will make full use of these available GPUs to design allocations.
-# This does not ensure the optimal throughput, but it is a good starting point.
-
-# The `heuristic` allocation mode is not ensured to run with every model configurations.
-# For example, if the vocabulary size is an odd number, the model parallelism may not work.
-# In these cases, you can use the `ppo_manual.sh` to specify the parallelism strategy manually.
-
-# The `ppo` subcommand specifies that this is a PPO experiment.
-# The `save_freq_steps` is set to `null` to disable saving checkpoints.
-# Enable it if you want to save checkpoints.
-# The `ppo` option is used to control the generation and PPO algorithm hyperparameters.
-# Note that the performance of PPO is sensitive to the the pre-trained model and hyperparameters.
-# It's the user's responsibility to tune them appropriately.
-unset CLUSTER_SPEC_PATH
-CLUSTER_SPEC_PATH=/storage/ray/cluster_config_on_ray.json \
-REAL_GPU_MEMORY_KILL_THRESHOLD=1 \
-python3 -m realhf.apps.quickstart ppo-math \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    wandb.mode=disabled \
-    exp_ctrl.total_train_epochs=10 \
-    exp_ctrl.save_freq_epochs=1 \
-    exp_ctrl.ckpt_freq_secs=600 \
-    group_size=${GROUP_SIZE} \
-    group_adv_norm=False \
-    use_dense_reward=False \
-    reward_delta=True \
-    rw_type=sparse \
-    check_xml_format=False \
-    actor.type._class=$MODEL_FAMILY \
-    actor.path=$BASE_MODEL_PATH \
-    actor.vllm.hybrid_train=False \
-    actor.vllm.enforce_eager=False \
-    actor.vllm.max_seq_len_to_capture=${MAX_SEQ_LEN_TO_CAPTURE} \
-    actor.vllm.max_num_seqs=${MAX_NUM_SEQS} \
-    actor.vllm.gpu_memory_utilization=1 \
-    actor.vllm.swap_space=64 \
-    critic.type._class=$MODEL_FAMILY \
-    critic.type.is_critic=True \
-    critic.init_critic_from_actor=True \
-    critic.path=$BASE_MODEL_PATH\
-    ref.type._class=$MODEL_FAMILY \
-    ref.path=$BASE_MODEL_PATH \
-    rew.type._class=$MODEL_FAMILY \
-    rew.type.is_critic=True \
-    rew.init_critic_from_actor=True \
-    rew.path=$BASE_MODEL_PATH \
-    dataset.path=$DATA_PATH \
-    dataset.max_prompt_len=2048 \
-    dataset.train_bs_n_seqs=${TRAIN_BATCH_SIZE} \
-    ppo.gen.max_new_tokens=${MAX_NEW_TOKENS} \
-    ppo.gen.min_new_tokens=0 \
-    ppo.disable_value=True \
-    ppo.gen.top_p=1 ppo.gen.top_k=1000000 \
-    ppo.gen.use_cuda_graph=True \
-    ppo.gen.force_no_logits_mask=True \
-    ppo.ppo_n_minibatches=${PPO_MBS} \
-    ppo.gen.temperature=1 \
-    ppo.kl_ctl=${KL_CTL} \
-    ppo.value_eps_clip=0.2 \
-    ppo.reward_output_scaling=0.5 \
-    ppo.reward_output_bias=-1.0 \
-    ppo.adv_norm=True ppo.value_norm=True \
-    mask_too_long=False \
-    ppo.discount=1.0 \
-    actor.optimizer.lr=1e-6 \
-    critic.optimizer.lr=5e-6 \
-    actor.optimizer.lr_scheduler_type=constant \
-    actor_gen.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    ref_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    rew_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    cache_clear_freq=1 \
-    n_nodes=${NODES} \
-    allocation_mode="'${ALLOCATION_MODE}'" n_gpus_per_node=8 \
-    recover_mode=auto \
-    recover_retries=10 \
-    torch_cache_mysophobia=True
--- a/examples/train_batch_1.5B_n1.sh
+++ b/examples/train_batch_1.5B_n1.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-distill-1.5B-n1
-MODEL_NAME="DeepSeek-R1-Distill-Qwen-1.5B"
-DATASET_NAME="full_prompts_for_r1_distilled.jsonl"
-NODES=1
-ALLOCATION_MODE="actor_gen:d4p1m2,*:d4p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 1 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_tiny_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        #sleep 120
-        echo >&3
-    } &
-
-    #sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_batch_1.5B_n16.sh
+++ b/examples/train_batch_1.5B_n16.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-distill-1.5B-n16
-MODEL_NAME="DeepSeek-R1-Distill-Qwen-1.5B"
-DATASET_NAME="full_prompts_for_r1_distilled.jsonl"
-NODES=16
-ALLOCATION_MODE="vllm.d64p1m1+d32p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 16 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_small_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        sleep 120
-        echo >&3
-    } &
-
-    sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_batch_1.5B_n4.sh
+++ b/examples/train_batch_1.5B_n4.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-distill-1.5B-n4
-MODEL_NAME="DeepSeek-R1-Distill-Qwen-1.5B"
-DATASET_NAME="full_prompts_for_r1_distilled.jsonl"
-NODES=4
-ALLOCATION_MODE="vllm.d16p1m1+d8p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 4 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 8 ${NODES} ${ALLOCATION_MODE} 16384 128 1 0.001"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_small_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        sleep 120
-        echo >&3
-    } &
-
-    sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_batch_7B_n16.sh
+++ b/examples/train_batch_7B_n16.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-distill-7B-n16
-MODEL_NAME="DeepSeek-R1-Distill-Qwen-7B"
-DATASET_NAME="full_prompts_for_r1_distilled.jsonl"
-NODES=16
-ALLOCATION_MODE="vllm.d16p1m4+d32p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 16 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_small_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        sleep 120
-        echo >&3
-    } &
-
-    sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_batch_7B_n4.sh
+++ b/examples/train_batch_7B_n4.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-distill-7B-n4
-MODEL_NAME="DeepSeek-R1-Distill-Qwen-7B"
-DATASET_NAME="prompts_for_r1_distilled.jsonl"
-NODES=4
-ALLOCATION_MODE="vllm.d4p1m4+d8p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 4 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 1024 16 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.01"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_small_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        sleep 120
-        echo >&3
-    } &
-
-    sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_batch_7B_zero_n16.sh
+++ b/examples/train_batch_7B_zero_n16.sh
@ -1,52 +0,0 @@
-#!/bin/bash
-
-SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
-
-EXP_NAME=ppo-zero-7B-zero-n16
-MODEL_NAME="Qwen2.5-7B"
-DATASET_NAME="full_orz_zero.jsonl"
-NODES=16
-ALLOCATION_MODE="vllm.d64p1m1+d32p2m1"
-
-LOG_DIR="/storage/ray/train_batch_logs/${EXP_NAME}/$(date +'%Y%m%d-%H%M%S')"
-mkdir -p ${LOG_DIR}
-echo "Log Dir: ${LOG_DIR}"
-
-MAX_WORKERS=$(expr 16 / ${NODES})
-
-FIFO_NAME=$(mktemp -u)
-mkfifo "$FIFO_NAME"
-exec 3<>"$FIFO_NAME"
-rm -f "$FIFO_NAME"
-
-for ((i=0; i<MAX_WORKERS; i++)); do
-    echo >&3
-done
-
-
-ALL_PARAMS=(
-    "${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 512 64 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.0"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 512 64 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.0"
-    #"${EXP_NAME} ${MODEL_NAME} ${DATASET_NAME} 512 64 ${NODES} ${ALLOCATION_MODE} 16384 128 4 0.0"
-)
-
-echo "Task Count: ${#ALL_PARAMS[@]}"
-
-for ((i=0; i<${#ALL_PARAMS[@]}; i++)); do
-    read -u3
-
-    {
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i started: ${ALL_PARAMS[$i]}"
-        bash -c "bash ${SCRIPT_DIR}/train_zero_on_ray.sh ${ALL_PARAMS[$i]} &> ${LOG_DIR}/${i}.log"
-        echo "$(date +"%Y-%m-%d %H:%M.%S") Task $i completed with exit code: $?, ${ALL_PARAMS[$i]}"
-        sleep 120
-        echo >&3
-    } &
-
-    sleep 120
-done
-
-wait
-
-exec 3>&-
-echo "All tasks completed"
--- a/examples/train_small_on_ray.sh
+++ b/examples/train_small_on_ray.sh
@ -1,115 +0,0 @@
-#!/bin/sh
-MODEL_FAMILY=qwen2
-
-EXP_NAME="$1"
-MODEL_NAME="$2"
-DATASET_NAME="$3"
-TRAIN_BATCH_SIZE="$4"
-GROUP_SIZE="$5"
-NODES="$6"
-ALLOCATION_MODE="$7"
-MAX_NEW_TOKENS=$8
-MAX_NUM_SEQS=$9
-PPO_MBS=${10}
-KL_CTL=${11}
-
-MAX_TOKEN_PER_MB=$(expr 2048 + ${MAX_NEW_TOKENS} + 1024)
-MAX_SEQ_LEN_TO_CAPTURE=$(expr 2048 + ${MAX_NEW_TOKENS})
-
-BASE_MODEL_PATH="/storage/models/${MODEL_NAME}"
-
-# original data
-DATA_PATH="/storage/datasets/${DATASET_NAME}"
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=ray
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-#EXP_NAME=ppo-zero--${MODEL_NAME}--${DATASET_NAME}
-#EXP_NAME=ppo-zero-distill-1.5B-default
-TRIAL_NAME="${TRAIN_BATCH_SIZE}x${GROUP_SIZE}-n${NODES}"
-
-# We use the "heuristic" allocation mode here to automatically determine the parallelism strategy
-# for each model function call, i.e., actor generation, critic inference, actor train, etc.
-# The number of GPUs is `n_nodes` * `n_gpus_per_node` (not set explictly here, defaults to 8).
-# ReaL will make full use of these available GPUs to design allocations.
-# This does not ensure the optimal throughput, but it is a good starting point.
-
-# The `heuristic` allocation mode is not ensured to run with every model configurations.
-# For example, if the vocabulary size is an odd number, the model parallelism may not work.
-# In these cases, you can use the `ppo_manual.sh` to specify the parallelism strategy manually.
-
-# The `ppo` subcommand specifies that this is a PPO experiment.
-# The `save_freq_steps` is set to `null` to disable saving checkpoints.
-# Enable it if you want to save checkpoints.
-# The `ppo` option is used to control the generation and PPO algorithm hyperparameters.
-# Note that the performance of PPO is sensitive to the the pre-trained model and hyperparameters.
-# It's the user's responsibility to tune them appropriately.
-unset CLUSTER_SPEC_PATH
-CLUSTER_SPEC_PATH=/storage/ray/cluster_config_on_ray.json \
-REAL_GPU_MEMORY_KILL_THRESHOLD=1 \
-python3 -m realhf.apps.quickstart ppo-math \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    wandb.mode=disabled \
-    exp_ctrl.total_train_epochs=10 \
-    exp_ctrl.save_freq_epochs=1 \
-    exp_ctrl.ckpt_freq_secs=600 \
-    group_size=${GROUP_SIZE} \
-    group_adv_norm=False \
-    rw_type=sparse \
-    check_xml_format=False \
-    actor.type._class=$MODEL_FAMILY \
-    actor.path=$BASE_MODEL_PATH \
-    actor.vllm.hybrid_train=False \
-    actor.vllm.enforce_eager=False \
-    actor.vllm.max_seq_len_to_capture=${MAX_SEQ_LEN_TO_CAPTURE} \
-    actor.vllm.max_num_seqs=${MAX_NUM_SEQS} \
-    actor.vllm.gpu_memory_utilization=0.85 \
-    actor.vllm.swap_space=64 \
-    critic.type._class=$MODEL_FAMILY \
-    critic.type.is_critic=True \
-    critic.init_critic_from_actor=True \
-    critic.path=$BASE_MODEL_PATH\
-    ref.type._class=$MODEL_FAMILY \
-    ref.path=$BASE_MODEL_PATH \
-    dataset.path=$DATA_PATH \
-    dataset.max_prompt_len=2048 \
-    dataset.train_bs_n_seqs=${TRAIN_BATCH_SIZE} \
-    ppo.gen.max_new_tokens=${MAX_NEW_TOKENS} \
-    ppo.gen.min_new_tokens=0 \
-    ppo.disable_value=True \
-    ppo.gen.top_p=1 ppo.gen.top_k=1000000 \
-    ppo.ppo_n_minibatches=${PPO_MBS} \
-    ppo.gen.temperature=0.6 \
-    ppo.kl_ctl=${KL_CTL} \
-    ppo.value_eps_clip=0.2 \
-    ppo.reward_output_scaling=5 \
-    ppo.reward_output_bias=0.0 \
-    ppo.adv_norm=True ppo.value_norm=True \
-    ppo.fuse_rew_ref=False \
-    mask_too_long=False \
-    ppo.discount=1.0 \
-    actor.optimizer.lr=1e-6 \
-    critic.optimizer.lr=5e-6 \
-    actor.optimizer.lr_scheduler_type=constant \
-    actor_gen.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    ref_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    cache_clear_freq=1 \
-    n_nodes=${NODES} \
-    allocation_mode="'${ALLOCATION_MODE}'" n_gpus_per_node=8 \
-    recover_mode=auto \
-    recover_retries=10 \
-    torch_cache_mysophobia=True
--- a/examples/train_tiny_on_ray.sh
+++ b/examples/train_tiny_on_ray.sh
@ -1,115 +0,0 @@
-#!/bin/sh
-MODEL_FAMILY=qwen2
-
-EXP_NAME="$1"
-MODEL_NAME="$2"
-DATASET_NAME="$3"
-TRAIN_BATCH_SIZE="$4"
-GROUP_SIZE="$5"
-NODES="$6"
-ALLOCATION_MODE="$7"
-MAX_NEW_TOKENS=$8
-MAX_NUM_SEQS=$9
-PPO_MBS=${10}
-KL_CTL=${11}
-
-MAX_TOKEN_PER_MB=$(expr 2048 + ${MAX_NEW_TOKENS} + 1024)
-MAX_SEQ_LEN_TO_CAPTURE=$(expr 2048 + ${MAX_NEW_TOKENS})
-
-BASE_MODEL_PATH="/storage/models/${MODEL_NAME}"
-
-# original data
-DATA_PATH="/storage/datasets/${DATASET_NAME}"
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=ray
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-#EXP_NAME=ppo-zero--${MODEL_NAME}--${DATASET_NAME}
-#EXP_NAME=ppo-zero-distill-1.5B-default
-TRIAL_NAME="${TRAIN_BATCH_SIZE}x${GROUP_SIZE}-n${NODES}"
-
-# We use the "heuristic" allocation mode here to automatically determine the parallelism strategy
-# for each model function call, i.e., actor generation, critic inference, actor train, etc.
-# The number of GPUs is `n_nodes` * `n_gpus_per_node` (not set explictly here, defaults to 8).
-# ReaL will make full use of these available GPUs to design allocations.
-# This does not ensure the optimal throughput, but it is a good starting point.
-
-# The `heuristic` allocation mode is not ensured to run with every model configurations.
-# For example, if the vocabulary size is an odd number, the model parallelism may not work.
-# In these cases, you can use the `ppo_manual.sh` to specify the parallelism strategy manually.
-
-# The `ppo` subcommand specifies that this is a PPO experiment.
-# The `save_freq_steps` is set to `null` to disable saving checkpoints.
-# Enable it if you want to save checkpoints.
-# The `ppo` option is used to control the generation and PPO algorithm hyperparameters.
-# Note that the performance of PPO is sensitive to the the pre-trained model and hyperparameters.
-# It's the user's responsibility to tune them appropriately.
-unset CLUSTER_SPEC_PATH
-CLUSTER_SPEC_PATH=/storage/ray/cluster_config_on_ray.json \
-REAL_GPU_MEMORY_KILL_THRESHOLD=1 \
-python3 -m realhf.apps.quickstart ppo-math \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    wandb.mode=disabled \
-    exp_ctrl.total_train_epochs=10 \
-    exp_ctrl.save_freq_epochs=1 \
-    exp_ctrl.ckpt_freq_secs=600 \
-    group_size=${GROUP_SIZE} \
-    group_adv_norm=False \
-    rw_type=sparse \
-    check_xml_format=False \
-    actor.type._class=$MODEL_FAMILY \
-    actor.path=$BASE_MODEL_PATH \
-    actor.vllm.hybrid_train=True \
-    actor.vllm.enforce_eager=True \
-    actor.vllm.max_seq_len_to_capture=${MAX_SEQ_LEN_TO_CAPTURE} \
-    actor.vllm.max_num_seqs=${MAX_NUM_SEQS} \
-    actor.vllm.gpu_memory_utilization=0.8 \
-    actor.vllm.swap_space=64 \
-    critic.type._class=$MODEL_FAMILY \
-    critic.type.is_critic=True \
-    critic.init_critic_from_actor=True \
-    critic.path=$BASE_MODEL_PATH\
-    ref.type._class=$MODEL_FAMILY \
-    ref.path=$BASE_MODEL_PATH \
-    dataset.path=$DATA_PATH \
-    dataset.max_prompt_len=2048 \
-    dataset.train_bs_n_seqs=${TRAIN_BATCH_SIZE} \
-    ppo.gen.max_new_tokens=${MAX_NEW_TOKENS} \
-    ppo.gen.min_new_tokens=0 \
-    ppo.disable_value=True \
-    ppo.gen.top_p=1 ppo.gen.top_k=1000000 \
-    ppo.ppo_n_minibatches=${PPO_MBS} \
-    ppo.gen.temperature=0.6 \
-    ppo.kl_ctl=${KL_CTL} \
-    ppo.value_eps_clip=0.2 \
-    ppo.reward_output_scaling=5 \
-    ppo.reward_output_bias=0.0 \
-    ppo.adv_norm=True ppo.value_norm=True \
-    ppo.fuse_rew_ref=False \
-    mask_too_long=False \
-    ppo.discount=1.0 \
-    actor.optimizer.lr=1e-6 \
-    critic.optimizer.lr=5e-6 \
-    actor.optimizer.lr_scheduler_type=constant \
-    actor_gen.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    ref_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    cache_clear_freq=1 \
-    n_nodes=${NODES} \
-    allocation_mode="'${ALLOCATION_MODE}'" n_gpus_per_node=8 \
-    recover_mode=auto \
-    recover_retries=10 \
-    torch_cache_mysophobia=True
--- a/examples/train_zero_on_ray.sh
+++ b/examples/train_zero_on_ray.sh
@ -1,115 +0,0 @@
-#!/bin/sh
-MODEL_FAMILY=qwen2
-
-EXP_NAME="$1"
-MODEL_NAME="$2"
-DATASET_NAME="$3"
-TRAIN_BATCH_SIZE="$4"
-GROUP_SIZE="$5"
-NODES="$6"
-ALLOCATION_MODE="$7"
-MAX_NEW_TOKENS=$8
-MAX_NUM_SEQS=$9
-PPO_MBS=${10}
-KL_CTL=${11}
-
-MAX_TOKEN_PER_MB=$(expr 2048 + ${MAX_NEW_TOKENS} + 1024)
-MAX_SEQ_LEN_TO_CAPTURE=$(expr 2048 + ${MAX_NEW_TOKENS})
-
-BASE_MODEL_PATH="/storage/models/${MODEL_NAME}"
-
-# original data
-DATA_PATH="/storage/datasets/${DATASET_NAME}"
-
-# Option 1: The experiment runs locally with subprocesses.
-# MODE=local
-# Option 2: The experiment runs in a Ray cluster
-# MODE=ray
-# Option 3: The experiment runs in a SLURM + pyxis cluster
-# Using the slurm mode requires a cluster spec file
-# and setting CLUSTER_SPEC_PATH to the path of it.
-MODE=ray
-
-# `experiment_name` and `trial_name` can be arbitrary.
-# Logs and saved checkpoints will be indexed by them.
-#EXP_NAME=ppo-zero--${MODEL_NAME}--${DATASET_NAME}
-#EXP_NAME=ppo-zero-distill-1.5B-default
-TRIAL_NAME="${TRAIN_BATCH_SIZE}x${GROUP_SIZE}-n${NODES}"
-
-# We use the "heuristic" allocation mode here to automatically determine the parallelism strategy
-# for each model function call, i.e., actor generation, critic inference, actor train, etc.
-# The number of GPUs is `n_nodes` * `n_gpus_per_node` (not set explictly here, defaults to 8).
-# ReaL will make full use of these available GPUs to design allocations.
-# This does not ensure the optimal throughput, but it is a good starting point.
-
-# The `heuristic` allocation mode is not ensured to run with every model configurations.
-# For example, if the vocabulary size is an odd number, the model parallelism may not work.
-# In these cases, you can use the `ppo_manual.sh` to specify the parallelism strategy manually.
-
-# The `ppo` subcommand specifies that this is a PPO experiment.
-# The `save_freq_steps` is set to `null` to disable saving checkpoints.
-# Enable it if you want to save checkpoints.
-# The `ppo` option is used to control the generation and PPO algorithm hyperparameters.
-# Note that the performance of PPO is sensitive to the the pre-trained model and hyperparameters.
-# It's the user's responsibility to tune them appropriately.
-unset CLUSTER_SPEC_PATH
-CLUSTER_SPEC_PATH=/storage/ray/cluster_config_on_ray.json \
-REAL_GPU_MEMORY_KILL_THRESHOLD=1 \
-python3 -m realhf.apps.quickstart ppo-math \
-    mode=$MODE \
-    experiment_name=$EXP_NAME \
-    trial_name=$TRIAL_NAME \
-    wandb.mode=disabled \
-    exp_ctrl.total_train_epochs=10 \
-    exp_ctrl.save_freq_epochs=1 \
-    exp_ctrl.ckpt_freq_secs=600 \
-    group_size=${GROUP_SIZE} \
-    group_adv_norm=False \
-    rw_type=sparse \
-    check_xml_format=True \
-    actor.type._class=$MODEL_FAMILY \
-    actor.path=$BASE_MODEL_PATH \
-    actor.vllm.hybrid_train=False \
-    actor.vllm.enforce_eager=False \
-    actor.vllm.max_seq_len_to_capture=${MAX_SEQ_LEN_TO_CAPTURE} \
-    actor.vllm.max_num_seqs=${MAX_NUM_SEQS} \
-    actor.vllm.gpu_memory_utilization=0.9 \
-    actor.vllm.swap_space=64 \
-    critic.type._class=$MODEL_FAMILY \
-    critic.type.is_critic=True \
-    critic.init_critic_from_actor=True \
-    critic.path=$BASE_MODEL_PATH\
-    ref.type._class=$MODEL_FAMILY \
-    ref.path=$BASE_MODEL_PATH \
-    dataset.path=$DATA_PATH \
-    dataset.max_prompt_len=2048 \
-    dataset.train_bs_n_seqs=${TRAIN_BATCH_SIZE} \
-    ppo.gen.max_new_tokens=${MAX_NEW_TOKENS} \
-    ppo.gen.min_new_tokens=0 \
-    ppo.disable_value=True \
-    ppo.gen.top_p=1 ppo.gen.top_k=1000000 \
-    ppo.ppo_n_minibatches=${PPO_MBS} \
-    ppo.gen.temperature=1 \
-    ppo.kl_ctl=${KL_CTL} \
-    ppo.value_eps_clip=0.2 \
-    ppo.reward_output_scaling=0.5 \
-    ppo.reward_output_bias=-1.0 \
-    ppo.adv_norm=True ppo.value_norm=True \
-    ppo.fuse_rew_ref=False \
-    mask_too_long=False \
-    ppo.discount=1.0 \
-    actor.optimizer.lr=1e-6 \
-    critic.optimizer.lr=5e-6 \
-    actor.optimizer.lr_scheduler_type=constant \
-    actor_gen.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    ref_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_inf.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    actor_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    critic_train.mb_spec.max_tokens_per_mb=${MAX_TOKEN_PER_MB} \
-    cache_clear_freq=1 \
-    n_nodes=${NODES} \
-    allocation_mode="'${ALLOCATION_MODE}'" n_gpus_per_node=8 \
-    recover_mode=auto \
-    recover_retries=10 \
-    torch_cache_mysophobia=True
--- a/realhf/apps/quickstart.py
+++ b/realhf/apps/quickstart.py
@ -145,7 +145,7 @@ def launch_hydra_task(
 ):
    # Disable hydra logging.
    if not any("hydra/job_logging=disabled" in x for x in sys.argv):
-        sys.argv += ["hydra/job_logging=disabled"]
+        sys.argv.insert(2, "hydra/job_logging=disabled")

    if (
        "--multirun" in sys.argv
@ -155,10 +155,10 @@ def launch_hydra_task(
        raise NotImplementedError("Hydra multi-run is not supported.")

    # non-multirun mode, add hydra run dir
-    sys.argv += [
+    sys.argv.insert(2,
        f"hydra.run.dir={cluster_spec.fileroot}/logs/{getpass.getuser()}/"
        f"{experiment_name}/{trial_name}/hydra-outputs/"
-    ]
+    )

    sys.argv.pop(1)

--- a/realhf/base/constants.py
+++ b/realhf/base/constants.py
@ -59,10 +59,9 @@ class GlobalMemoryBuffer:
        return res


-# 30 minutes. Transferring super-large batches via NCCL bcast
-# for the first time may consumer over 600 secs, which is the
-# pytorch's default. Increase this value to 30 minutes.
-NCCL_DEFAULT_TIMEOUT = datetime.timedelta(seconds=1800)
+# For large models, generation may consume more than 3600s.
+# We set a large value to avoid NCCL timeout issues during generaiton.
+NCCL_DEFAULT_TIMEOUT = datetime.timedelta(seconds=7200)

 # We may want to use CPU for testing even when CUDA is available.
 TORCH_FORCE_CPU = False