[Feature] Create docs and examples for multi-turn agent RL (#60)

* PullRequest: 168 添加Codeforces测试，修复其它测试问题 Merge branch areal-eval-0.3 of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/168 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix eval and add codeforces elo calc * fix codeforce test * fix qwen3 prompt * change annotations to eng * add code verify files * PullRequest: 173 [FIX} format code and fix a recover error in rollout worker Merge branch fw/fix-rollout-recover of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/173 Reviewed-by: 温差 <xushusheng.xss@antgroup.com> * format code and fix a recover error in rollout worker * PullRequest: 171 更新评估文档 Merge branch eval-doc of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/171?tab=diff Reviewed-by: 博惟 <bowei.fw@antgroup.com> * update eval doc * complete eval doc * complete eval doc * fix ood info * add data obtaining guide * fix supported datasets * PullRequest: 174 decouple max_behav_imp_weight and c_clip & track entropy, positve_seq_len and negative_seq_len Merge branch xss/max_behav_imp_weight of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/174 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * decouple max_behav_imp_weight and c_clip * rename log: positve_* -> correct_*, negative_* -> incorrect_* * rename hyper-parameter: max_behav_imp_weight -> behav_imp_weight_cap * PullRequest: 175 [Fix] Fix the "event loop is already running" error in ray scripts Merge branch fw/fix-ray-asyncio of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/175 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * format code and fix a recover error in rollout worker * . * add docs * . --------- Co-authored-by: 乘鹭 <hechuyi.hcy@antgroup.com> Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
2025-05-30 16:26:07 +08:00 · 2025-05-30 16:26:07 +08:00 · 473eeb2db0
parent c0200f10d0
commit 473eeb2db0
8 changed files with 256 additions and 5 deletions
--- a/docs/_config.yml
+++ b/docs/_config.yml
@ -3,7 +3,7 @@

 title: AReaL Documentation
 author: AReaL Team
-copyright: 2025
+copyright: "2025"
 logo: figures/logo.png

 # Force re-execution of notebooks on each build.
--- a/docs/_toc.yml
+++ b/docs/_toc.yml
@ -6,10 +6,11 @@ root: intro
 parts:
  - caption: Tutorial
    chapters:
-    - file: installation
-    - file: training
-    - file: eval
-    - file: troubleshooting
+    - file: tutorial/installation
+    - file: tutorial/training
+    - file: tutorial/eval
+    - file: tutorial/agent
+    - file: tutorial/troubleshooting
  - caption: Developer Manual
    chapters:
    - file: developer/exp_launch
--- a/docs/tutorial/agent.md
+++ b/docs/tutorial/agent.md
@ -0,0 +1,142 @@
+# Agentic RL
+
+This guide provides an example of training a multi-turn math agent using end-to-end RL. The math agent will continuously attempt to think through and solve math problems until it reaches the correct answer.
+
+## Define Your Agent
+
+Create a new file under `realhf/impl/agent/`, for example, `math_multi_turn_agent.py`. Your `Agent` must implement the interface defined in `realhf/api/core/agent.py`, which requires implementing a single method: `collect_trajectory`.
+
+```python
+class MathMultiTurnAgent(Agent):
+    def __init__(
+        self,
+        ... # Any required configurations here
+    ):
+        ...
+    
+    async def collect_trajectory(
+        self,
+        prompt: SequenceSample,
+        env: EnvironmentService,
+        obs_queue: asyncio.Queue,
+        act_queue: asyncio.Queue,
+    ) -> List[SequenceSample]:
+        ...
+```
+
+## Implement the `collect_trajectory` Logic
+
+The `collect_trajectory` function takes a task prompt, an environment, and two queues as input, then produces several trajectories for the RL trainer. Within this function, you can create arbitrary data processing logic to produce the input prompt for the inference engine and extract the action from the generated tokens.
+
+In this example, the initial observation is the math problem itself, which is already included in the `prompt` parameter. We put the token IDs and generation config into `obs_queue` and wait for the action produced by the inference engine from `act_queue`. After the inference engine returns, we extract the generated answers and send them to the environment.
+
+```python
+for turn in range(self.num_turns):
+    await obs_queue.put((qid, token_ids, self.gconfig))
+    act: BundledGenerationOutputs = await act_queue.get()
+    ...
+    _, success, *_ = await env.step((qid, answers))
+```
+
+The environment is similar to a [gym environment](https://github.com/Farama-Foundation/Gymnasium), which defines two methods: `reset` and `step`. However, to maintain efficiency, we use an asynchronous implementation to avoid mutual blocking across different environment instances.
+
+Although the environment can be quite complex (e.g., for a SWE-agent), the implementation in this example is straightforward. The math environment is single-step and essentially serves as a wrapper around the reward function:
+
+```python
+class MathCodeSingleStepEnv(EnvironmentService):
+
+    async def reset(self, seed=None, options=None):
+        return None, {}
+
+    async def step(self, action: Tuple[str, List[str]]):
+        qid, answers = action
+        group_size = len(answers)
+        qid = qid.split("@")[0]
+        cur_task = self.id2info[qid]["task"]
+
+        format_rewards = await asyncio.to_thread(
+            math_verify_call,
+            self.id2info,
+            answers,
+            [qid for _ in range(group_size)],
+        )
+
+        return None, format_rewards, True, False, {}
+```
+
+After `env.step` returns the reward for the current step, we can check whether the answer is correct. If not, we can append a user prompt and send it to `obs_queue` again to enter the next round.
+
+```python
+for turn in range(self.num_turns):
+    ...
+    feedback = None
+    if success[0]:
+        feedback = "Congratulations! You are correct!"
+    else:
+        feedback = "Unfortunately your answer is wrong. Let's try again."
+    
+    feedback = "\n" + self.tokenizer.apply_chat_template(
+        [dict(content=feedback, role="user")],
+        add_generation_prompt=True,
+        tokenize=False,
+    )
+    feedback = self.tokenizer(feedback)["input_ids"]
+    token_ids.extend(feedback)
+```
+
+## Modify the Configuration
+
+Finally, arrange all the data in the proper format and return it. You're now close to running the end-to-end RL loop. The final step is to register and import your implementation, then modify the experiment configuration.
+
+```python
+# in realhf/impl/agent/math_multi_turn_agent.py
+register_agent("math-multi-turn", MathMultiTurnAgent)
+```
+
+```python
+# in realhf/impl/agent/__init__.py
+import realhf.impl.agent.math_multi_turn_agent
+```
+
+```diff
+@dataclasses.dataclass
+class AsyncPPOMATHConfig(AsyncRLExperimentConfig, PPOMATHConfig):
+    # in realhf/experiments/async_exp/async_ppo_math_exp.py
+    @property
+    def agent(self) -> AgentAbstraction:
+        return AgentAbstraction(
+-           "math-single-step",
+           "math-multi-turn",  # Your registered name
+            args=dict(
+-                gconfig=self.generation_config,
+-                tokenizer_path=self.actor.path,
+-                success_rate_lb=self.success_rate_lb,
+-                success_rate_ub=self.success_rate_ub,
+-                reward_scaling=self.ppo.reward_output_scaling,
+-                reward_bias=self.ppo.reward_output_bias,
+                # Any configurations for your agent
+                ...
+            ),
+        )
+
+    @property
+    def env(self) -> EnvServiceAbstraction:
+-        return EnvServiceAbstraction(
+-            "math-code-single-step", args=dict(dataset_path=self.dataset.path)
+-        )
+        # Change to your customized environment if necessary
+        # The same registration and importing mechanism as Agents
+        return EnvServiceAbstraction(
+            "my-env", args=dict(...)
+        )
+```
+
+## Run Training
+
+Please follow the guide in [training.md](training.md). Generally, create a YAML configuration called `math-multi-turn.yaml` under `training/configs/async-ppo` and run:
+
+```bash
+python3 training/main_async_ppo.py --config-name=math-multi-turn
+```
+
+Happy coding!
--- a/docs/tutorial/eval.md
+++ b/docs/tutorial/eval.md
--- a/docs/tutorial/installation.md
+++ b/docs/tutorial/installation.md
--- a/docs/tutorial/training.md
+++ b/docs/tutorial/training.md
--- a/docs/tutorial/troubleshooting.md
+++ b/docs/tutorial/troubleshooting.md
--- a/training/configs/async-ppo/math-multi-turn.yaml
+++ b/training/configs/async-ppo/math-multi-turn.yaml
@ -0,0 +1,108 @@
+actor:
+  type:
+    _class: qwen2
+  path: /storage/models/deepseek-ai__DeepSeek-R1-Distill-Qwen-1.5B
+  optimizer:
+    type: adam
+    lr: 1.0e-05
+    weight_decay: 0.05
+    beta1: 0.9
+    beta2: 0.95
+    eps: 1.0e-05
+    min_lr_ratio: 0.0
+    lr_scheduler_type: constant
+    warmup_steps_proportion: 0.001
+    offload: false
+    initial_loss_scale: 4294967296.0
+    min_loss_scale: 1.0
+    loss_scale_window: 5.0
+    hysteresis: 2
+    gradient_clipping: 1.0
+  sglang:
+    attention_backend: flashinfer
+    context_length: 32768
+    mem_fraction_static: 0.8
+    max_running_requests: null
+    max_prefill_tokens: 32768
+actor_train:
+  mb_spec:
+    n_mbs: 1
+    max_tokens_per_mb: 32768
+actor_inf:
+  mb_spec:
+    n_mbs: 1
+    max_tokens_per_mb: 32768
+dataset:
+  path: /storage/datasets/boba_106k_0319_qwen3_think.jsonl
+  max_prompt_len: 2048
+  train_bs_n_seqs: 512
+  fill_to_max_length: false
+ppo:
+  gen:
+    n: 1
+    max_new_tokens: 4096
+    min_new_tokens: 0
+    greedy: false
+    top_p: 1.0
+    top_k: 100000000
+    temperature: 1.0
+    use_cuda_graph: true
+    force_cudagraph_recapture: true
+    force_no_logits_mask: true
+  ppo_n_minibatches: 4
+  eps_clip: 0.2
+  c_clip: null
+  value_eps_clip: 0.2
+  early_stop_imp_ratio: 5.0
+  actor_sample_reuse: 1
+  critic_sample_reuse: 1
+  max_reward_clip: 20.0
+  reward_output_scaling: 5.0
+  reward_output_bias: -1.0
+  fuse_rew_ref: true
+  discount: 1.0
+  gae_lambda: 1.0
+  adv_norm: false
+  kl_ctl: 0.0
+  use_adaptive_kl_ctl: false
+  disable_value: true
+  recompute_logprob: true
+  use_decoupled_loss: true
+  behav_imp_weight_cap: null
+group_size: 4
+new_tokens_per_chunk: 1024
+max_head_offpolicyness: 4
+max_concurrent_rollouts: 256
+flush_request_timeout: 300
+experiment_name: multi-turn-rl-math
+trial_name: my-trial
+mode: ray
+partition: dev
+schedule_strategy: empty_first
+wandb:
+  mode: online
+tensorboard:
+  path: null
+image_name: null
+recover_mode: auto
+recover_retries: 0
+recover_after: 10
+ignore_worker_error: false
+allocation_mode: sglang.d32m1p1+d16p2m1
+n_nodes: 8
+n_gpus_per_node: 8
+seed: 1
+cache_clear_freq: 1
+exp_ctrl:
+  total_train_epochs: 5
+  save_freq_steps: 20
+  ckpt_freq_secs: 3600
+torch_cache_mysophobia: true
+shuffle_dataset: true
+ray_temp_path: /tmp/ray
+cluster:
+  config_path: ''
+  cluster_name: local
+  fileroot: /storage/ray/experiments
+turn_level_discount: 1.0
+num_turns: 4