[Feature] Create docs and examples for multi-turn agent RL (#60)

* PullRequest: 168 添加Codeforces测试,修复其它测试问题

Merge branch areal-eval-0.3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/168

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix eval and add codeforces elo calc
* fix codeforce test
* fix qwen3 prompt
* change annotations to eng
* add code verify files

* PullRequest: 173 [FIX} format code and fix a recover error in rollout worker

Merge branch fw/fix-rollout-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/173

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* format code and fix a recover error in rollout worker

* PullRequest: 171 更新评估文档

Merge branch eval-doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/171?tab=diff

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* update eval doc
* complete eval doc
* complete eval doc
* fix ood info
* add data obtaining guide
* fix supported datasets

* PullRequest: 174 decouple max_behav_imp_weight and c_clip & track entropy, positve_seq_len and negative_seq_len

Merge branch xss/max_behav_imp_weight of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/174

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* decouple max_behav_imp_weight and c_clip
* rename log: positve_* -> correct_*, negative_* -> incorrect_*
* rename hyper-parameter: max_behav_imp_weight -> behav_imp_weight_cap

* PullRequest: 175 [Fix] Fix the "event loop is already running" error in ray scripts

Merge branch fw/fix-ray-asyncio of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/175

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* format code and fix a recover error in rollout worker
* .

* add docs

* .

---------

Co-authored-by: 乘鹭 <hechuyi.hcy@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
This commit is contained in:
Wei Fu 2025-05-30 16:26:07 +08:00 committed by GitHub
parent c0200f10d0
commit 473eeb2db0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 256 additions and 5 deletions

View File

@ -3,7 +3,7 @@
title: AReaL Documentation
author: AReaL Team
copyright: 2025
copyright: "2025"
logo: figures/logo.png
# Force re-execution of notebooks on each build.

View File

@ -6,10 +6,11 @@ root: intro
parts:
- caption: Tutorial
chapters:
- file: installation
- file: training
- file: eval
- file: troubleshooting
- file: tutorial/installation
- file: tutorial/training
- file: tutorial/eval
- file: tutorial/agent
- file: tutorial/troubleshooting
- caption: Developer Manual
chapters:
- file: developer/exp_launch

142
docs/tutorial/agent.md Normal file
View File

@ -0,0 +1,142 @@
# Agentic RL
This guide provides an example of training a multi-turn math agent using end-to-end RL. The math agent will continuously attempt to think through and solve math problems until it reaches the correct answer.
## Define Your Agent
Create a new file under `realhf/impl/agent/`, for example, `math_multi_turn_agent.py`. Your `Agent` must implement the interface defined in `realhf/api/core/agent.py`, which requires implementing a single method: `collect_trajectory`.
```python
class MathMultiTurnAgent(Agent):
def __init__(
self,
... # Any required configurations here
):
...
async def collect_trajectory(
self,
prompt: SequenceSample,
env: EnvironmentService,
obs_queue: asyncio.Queue,
act_queue: asyncio.Queue,
) -> List[SequenceSample]:
...
```
## Implement the `collect_trajectory` Logic
The `collect_trajectory` function takes a task prompt, an environment, and two queues as input, then produces several trajectories for the RL trainer. Within this function, you can create arbitrary data processing logic to produce the input prompt for the inference engine and extract the action from the generated tokens.
In this example, the initial observation is the math problem itself, which is already included in the `prompt` parameter. We put the token IDs and generation config into `obs_queue` and wait for the action produced by the inference engine from `act_queue`. After the inference engine returns, we extract the generated answers and send them to the environment.
```python
for turn in range(self.num_turns):
await obs_queue.put((qid, token_ids, self.gconfig))
act: BundledGenerationOutputs = await act_queue.get()
...
_, success, *_ = await env.step((qid, answers))
```
The environment is similar to a [gym environment](https://github.com/Farama-Foundation/Gymnasium), which defines two methods: `reset` and `step`. However, to maintain efficiency, we use an asynchronous implementation to avoid mutual blocking across different environment instances.
Although the environment can be quite complex (e.g., for a SWE-agent), the implementation in this example is straightforward. The math environment is single-step and essentially serves as a wrapper around the reward function:
```python
class MathCodeSingleStepEnv(EnvironmentService):
async def reset(self, seed=None, options=None):
return None, {}
async def step(self, action: Tuple[str, List[str]]):
qid, answers = action
group_size = len(answers)
qid = qid.split("@")[0]
cur_task = self.id2info[qid]["task"]
format_rewards = await asyncio.to_thread(
math_verify_call,
self.id2info,
answers,
[qid for _ in range(group_size)],
)
return None, format_rewards, True, False, {}
```
After `env.step` returns the reward for the current step, we can check whether the answer is correct. If not, we can append a user prompt and send it to `obs_queue` again to enter the next round.
```python
for turn in range(self.num_turns):
...
feedback = None
if success[0]:
feedback = "Congratulations! You are correct!"
else:
feedback = "Unfortunately your answer is wrong. Let's try again."
feedback = "\n" + self.tokenizer.apply_chat_template(
[dict(content=feedback, role="user")],
add_generation_prompt=True,
tokenize=False,
)
feedback = self.tokenizer(feedback)["input_ids"]
token_ids.extend(feedback)
```
## Modify the Configuration
Finally, arrange all the data in the proper format and return it. You're now close to running the end-to-end RL loop. The final step is to register and import your implementation, then modify the experiment configuration.
```python
# in realhf/impl/agent/math_multi_turn_agent.py
register_agent("math-multi-turn", MathMultiTurnAgent)
```
```python
# in realhf/impl/agent/__init__.py
import realhf.impl.agent.math_multi_turn_agent
```
```diff
@dataclasses.dataclass
class AsyncPPOMATHConfig(AsyncRLExperimentConfig, PPOMATHConfig):
# in realhf/experiments/async_exp/async_ppo_math_exp.py
@property
def agent(self) -> AgentAbstraction:
return AgentAbstraction(
- "math-single-step",
+ "math-multi-turn", # Your registered name
args=dict(
- gconfig=self.generation_config,
- tokenizer_path=self.actor.path,
- success_rate_lb=self.success_rate_lb,
- success_rate_ub=self.success_rate_ub,
- reward_scaling=self.ppo.reward_output_scaling,
- reward_bias=self.ppo.reward_output_bias,
+ # Any configurations for your agent
+ ...
),
)
@property
def env(self) -> EnvServiceAbstraction:
- return EnvServiceAbstraction(
- "math-code-single-step", args=dict(dataset_path=self.dataset.path)
- )
+ # Change to your customized environment if necessary
+ # The same registration and importing mechanism as Agents
+ return EnvServiceAbstraction(
+ "my-env", args=dict(...)
+ )
```
## Run Training
Please follow the guide in [training.md](training.md). Generally, create a YAML configuration called `math-multi-turn.yaml` under `training/configs/async-ppo` and run:
```bash
python3 training/main_async_ppo.py --config-name=math-multi-turn
```
Happy coding!

View File

@ -0,0 +1,108 @@
actor:
type:
_class: qwen2
path: /storage/models/deepseek-ai__DeepSeek-R1-Distill-Qwen-1.5B
optimizer:
type: adam
lr: 1.0e-05
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
eps: 1.0e-05
min_lr_ratio: 0.0
lr_scheduler_type: constant
warmup_steps_proportion: 0.001
offload: false
initial_loss_scale: 4294967296.0
min_loss_scale: 1.0
loss_scale_window: 5.0
hysteresis: 2
gradient_clipping: 1.0
sglang:
attention_backend: flashinfer
context_length: 32768
mem_fraction_static: 0.8
max_running_requests: null
max_prefill_tokens: 32768
actor_train:
mb_spec:
n_mbs: 1
max_tokens_per_mb: 32768
actor_inf:
mb_spec:
n_mbs: 1
max_tokens_per_mb: 32768
dataset:
path: /storage/datasets/boba_106k_0319_qwen3_think.jsonl
max_prompt_len: 2048
train_bs_n_seqs: 512
fill_to_max_length: false
ppo:
gen:
n: 1
max_new_tokens: 4096
min_new_tokens: 0
greedy: false
top_p: 1.0
top_k: 100000000
temperature: 1.0
use_cuda_graph: true
force_cudagraph_recapture: true
force_no_logits_mask: true
ppo_n_minibatches: 4
eps_clip: 0.2
c_clip: null
value_eps_clip: 0.2
early_stop_imp_ratio: 5.0
actor_sample_reuse: 1
critic_sample_reuse: 1
max_reward_clip: 20.0
reward_output_scaling: 5.0
reward_output_bias: -1.0
fuse_rew_ref: true
discount: 1.0
gae_lambda: 1.0
adv_norm: false
kl_ctl: 0.0
use_adaptive_kl_ctl: false
disable_value: true
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: null
group_size: 4
new_tokens_per_chunk: 1024
max_head_offpolicyness: 4
max_concurrent_rollouts: 256
flush_request_timeout: 300
experiment_name: multi-turn-rl-math
trial_name: my-trial
mode: ray
partition: dev
schedule_strategy: empty_first
wandb:
mode: online
tensorboard:
path: null
image_name: null
recover_mode: auto
recover_retries: 0
recover_after: 10
ignore_worker_error: false
allocation_mode: sglang.d32m1p1+d16p2m1
n_nodes: 8
n_gpus_per_node: 8
seed: 1
cache_clear_freq: 1
exp_ctrl:
total_train_epochs: 5
save_freq_steps: 20
ckpt_freq_secs: 3600
torch_cache_mysophobia: true
shuffle_dataset: true
ray_temp_path: /tmp/ray
cluster:
config_path: ''
cluster_name: local
fileroot: /storage/ray/experiments
turn_level_discount: 1.0
num_turns: 4