18 KiB
Rollout and Agentic RL
This guide demonstrates how to customize rollout behavior for PPO training by implementing a multi-turn math agent that uses end-to-end reinforcement learning. Our example agent will continuously try to solve a math problem until it reaches the correct answer.
Approach: Using AReaLite (Recommended)
The complete implementation is placed at arealite/workflow/multi_turn.py
.
Step 1: Define Your Workflow
AReaLite takes a flexible approach to agent definition. Rather than using rigid Agent
classes that might limit your agentic capabilities, AReaLite captures all rollout
behavior in a RolloutWorkflow
class. This design gives you complete freedom to
customize your agent's behavior.
# arealite/api/workflow_api.py
class RolloutWorkflow:
async def arun_episode(
self, engine: InferenceEngine, data: Dict[str, Any]
) -> TensorDict:
"""Run a single episode of the workflow.
See concrete example implementations under the `arealite/workflow` directory.
"""
raise NotImplementedError()
The workflow exposes a single arun_episode
method that runs and collects data from a
single episode. This method takes two key arguments:
- InferenceEngine: Provides the
agenerate
method for generating responses to user inputs - data: The prompt data loaded from your RL dataset
Within this method, you have complete control over how your agent and environment interact.
Setting Up the Multi-Turn Math Workflow
Let's build a multi-turn rollout workflow for solving math problems. First, we'll define
the __init__
method to capture the utilities we need during rollout:
Note: You have complete flexibility in defining the
__init__
method. Pass any arguments needed to construct your workflow. If you want to use tools, pass the corresponding environment here so your agent can call it in thearun_episode
method.
class MultiTurnWorkflow(RolloutWorkflow):
def __init__(
self,
reward_fn,
gconfig: GenerationHyperparameters, # aka sampling_params
tokenizer: PreTrainedTokenizerFast,
max_turns: int,
turn_discount: float,
):
self.reward_fn = reward_fn
self.gconfig = gconfig
self.tokenizer = tokenizer
self.max_turns = max_turns
# Discount rewards if the agent takes longer to find the correct answer
self.turn_discount = turn_discount
Implementing the Episode Logic
Now let's implement the arun_episode
method. We'll start by tokenizing the prompt data
and converting it into an LLMRequest
object for the inference engine:
class MultiTurnWorkflow(RolloutWorkflow):
# ... __init__ method above ...
async def arun_episode(self, engine: InferenceEngine, data) -> TensorDict:
# Initialize result containers
seq, logprobs, loss_mask, versions = [], [], [], []
messages = data["messages"]
# Run multi-turn rollout until we get the correct answer
t = reward = 0
discount = 1.0
rid = uuid.uuid4().hex
while reward == 0 and t < self.max_turns:
# Convert the conversation into input tokens
input_ids = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
)
# Generate response from the model
req = LLMRequest(
rid=rid,
input_ids=input_ids,
gconfig=self.gconfig.new(n_samples=1),
)
resp = await engine.agenerate(req)
# ... continue processing ...
Note: This example accesses the "messages" key from the prompt data to get OpenAI-compatible messages. This isn't mandatory—the key and prompt format depend entirely on your implementation. For instance, if your dataset stores prompt strings in a "prompt" column, you could get input token IDs via
self.tokenizer.encode(data["prompt"])
.
Note: The
rid
field inLLMRequest
is the request ID. Requests with the same ID will reuse the LLM inference server's KV caches for efficiency.
Handling Multi-Turn Conversations
Next, we'll evaluate whether the current answer is correct using our reward_fn
. This
function should return 1 for correct answers and 0 otherwise. When the answer is wrong,
we'll apply a discount, add feedback to the conversation, and let the model try again:
class MultiTurnWorkflow(RolloutWorkflow):
# ... previous methods ...
async def arun_episode(self, engine: InferenceEngine, data) -> TensorDict:
# ... initialization code ...
while reward == 0 and t < self.max_turns:
# Add feedback if the previous answer was incorrect
if t > 0:
messages += [
{"role": "assistant", "content": completions_str},
{
"role": "user",
"content": "Your answer is not correct. Please try to answer it again."
},
]
# Generate response (code from above)
# ...
# Evaluate the response
prompt_str = self.tokenizer.decode(input_ids)
completions_str = self.tokenizer.decode(resp.output_tokens)
reward = self.reward_fn(
prompt=prompt_str,
completions=completions_str,
prompt_ids=resp.input_tokens,
completion_ids=resp.output_tokens,
**data,
)
# Update counters
t += 1
discount *= self.turn_discount
Reward Function Signature
For convenience when switching between different reward functions, we recommend following this pre-defined signature:
def reward_fn(
prompt: str,
completions: str,
prompt_ids: List[int],
completion_ids: List[int],
**kwargs,
):
"""Reward function for evaluating agent performance.
This signature is recommended for compatibility with predefined workflows,
but you can modify it freely in custom implementations.
Args:
prompt: The task description string
completions: The agent's response string
prompt_ids: Tokenized prompt
completion_ids: Tokenized response
**kwargs: Additional dataset attributes (solutions, input_outputs, etc.)
Returns:
float: Reward value (typically 1.0 for correct, 0.0 for incorrect)
"""
While this signature is convenient, there are no strict restrictions on reward functions in custom workflows—modify them as needed for your specific use case.
Collecting Training Data
Finally, let's complete the implementation by collecting trajectories in the
TensorDict
format:
class MultiTurnWorkflow(RolloutWorkflow):
# ... previous methods ...
async def arun_episode(self, engine: InferenceEngine, data) -> TensorDict:
# ... episode logic above ...
while reward == 0 and t < self.max_turns:
# ... generation and evaluation ...
# Collect trajectory data
input_len = len(resp.input_tokens) - len(seq)
seq += resp.input_tokens[-input_len:] + resp.output_tokens
logprobs += [0.0] * input_len + resp.output_logprobs
loss_mask += [0] * input_len + [1] * resp.output_len
versions += [-1] * input_len + resp.output_versions
# Package results
res = dict(
input_ids=torch.tensor(seq),
logprobs=torch.tensor(logprobs),
loss_mask=torch.tensor(loss_mask),
versions=torch.tensor(versions),
rewards=torch.tensor(float(reward * discount)),
attention_mask=torch.ones(len(seq), dtype=torch.bool),
)
res = {k: v.unsqueeze(0) for k, v in res.items()}
return concat_padded_tensors([res])
Important: The returned
TensorDict
must follow HuggingFace's padded data format, where each tensor has shape[batch_size, sequence_length, *]
. This allows AReaLite to automatically batch multiple trajectories for the training engine. Since this example returns a single trajectory, we useunsqueeze(0)
to create a size-1 batch.
Note: There are no restrictions on the keys in your
TensorDict
—different algorithms require different keys. This example targets the GRPO algorithm, so we includeinput_ids
,loss_mask
,attention_mask
, andlogprobs
(needed for computing importance ratios).
Step 2: Training with Your Custom Workflow
Using your custom workflow is straightforward—just construct it in your training script
and pass it to the rollout_batch
or prepare_batch
method:
def main(args):
# ... setup code ...
# Create your custom workflow
workflow = MultiTurnWorkflow(
reward_fn=gsm8k_reward_fn,
gconfig=config.gconfig,
tokenizer=tokenizer,
turn_discount=0.9,
max_turns=5,
)
# Run training—no other changes needed!
data_generator = iter(train_dataloader)
for global_step in range(max_steps):
with stats_tracker.record_timing("rollout"):
if config.async_training:
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
else:
try:
data = next(data_generator)
except StopIteration:
data_generator = iter(train_dataloader)
data = next(data_generator)
batch = rollout.rollout_batch(data, workflow=workflow)
# ... continue with training loop ...
That's it! Your custom multi-turn math agent is now ready to train with reinforcement learning. The workflow will automatically handle the multi-turn conversations, reward computation, and data collection needed for effective RL training.
Alternative Approach: Using the Legacy Version (Not Recommended)
While we strongly recommend using AReaLite for new projects, you might encounter legacy code that uses the older Agent-based approach. Here's how it works for reference, though we suggest migrating to the workflow-based system when possible.
Step 1: Define Your Agent Class
Create a new file under realhf/impl/agent/
, such as math_multi_turn_agent.py
. Your
Agent
must implement the interface defined in realhf/api/core/agent.py
, which
requires a single method: collect_trajectory
.
class MathMultiTurnAgent(Agent):
async def collect_trajectory(
self,
reward_fn,
gconfig: GenerationHyperparameters, # aka sampling_params
tokenizer: PreTrainedTokenizerFast,
max_turns: int,
turn_discount: float,
):
# Implementation goes here
...
Step 2: Implement the Trajectory Collection Logic
The collect_trajectory
method takes a task prompt, an environment, and two
communication queues. Within this method, you control the data flow between your agent
and the inference engine using these queues:
- obs_queue: Send observations (token IDs and generation config) to the inference engine
- act_queue: Receive actions (generated responses) from the inference engine
Here's how the multi-turn conversation works:
for turn in range(self.num_turns):
# Send the current state to the inference engine
await obs_queue.put((qid, token_ids, self.gconfig))
# Get the generated response
act: BundledGenerationOutputs = await act_queue.get()
# Evaluate the response through the environment
success, rewards = await env.step((qid, answers))
# ... process results ...
Environment Integration
The environment follows a
Gym-like interface with reset
and
step
methods, but uses asynchronous implementations to prevent blocking across
different environment instances.
For math problems, the environment is typically stateless and acts as a wrapper around your reward function:
class MathCodeSingleStepEnv(EnvironmentService):
async def step(self, action: Tuple[str, List[str]]):
qid, answers = action
# ... setup code ...
# Run reward computation asynchronously
format_rewards = await asyncio.to_thread(
math_verify_call,
answers,
# ... other parameters ...
)
return None, format_rewards, True, False, {}
Handling Multi-Turn Feedback
After receiving the reward from env.step
, check if the answer is correct. If not,
provide feedback and continue to the next turn:
for turn in range(self.num_turns):
# ... generation and evaluation code ...
# Provide feedback based on the result
if success[0]:
feedback = "Congratulations! You are correct!"
else:
feedback = "Unfortunately your answer is wrong. Let's try again."
# Format feedback as a user message
feedback = "\n" + self.tokenizer.apply_chat_template(
[{"content": feedback, "role": "user"}],
add_generation_prompt=True,
tokenize=False,
)
# Add feedback tokens to the conversation
feedback_tokens = self.tokenizer(feedback)["input_ids"]
token_ids.extend(feedback_tokens)
Step 3: Register and Configure Your Agent
First, register your agent implementation:
class MultiTurnWorkflow(RolloutWorkflow):
# ... previous methods ...
async def arun_episode(self, engine: InferenceEngine, data):
# ... episode logic above ...
while reward == 0 and t < self.max_turns:
# ... generation and evaluation ...
# Collect trajectory data
input_len = len(resp.input_tokens) - len(seq)
seq += resp.input_tokens[-input_len:] + resp.output_tokens
logprobs += [0.0] * input_len + resp.output_logprobs
loss_mask += [0] * input_len + [1] * resp.output_len
versions += [-1] * input_len + resp.output_versions
# Package results
res = dict(
input_ids=torch.tensor(seq),
logprobs=torch.tensor(logprobs),
loss_mask=torch.tensor(loss_mask),
versions=torch.tensor(versions),
rewards=torch.tensor(float(reward * discount)),
attention_mask=torch.ones(len(seq), dtype=torch.bool),
)
res = {k: v.unsqueeze(0) for k, v in res.items()}
return concat_padded_tensors([res])
Important: The returned
TensorDict
must follow HuggingFace's padded data format, where each tensor has shape[batch_size, sequence_length, *]
. This allows AReaLite to automatically batch multiple trajectories for training. Since this example returns a single trajectory, we useunsqueeze(0)
to create a batch of size 1.
Note: You're not restricted to specific keys in your
TensorDict
—different algorithms need different keys. This example targets the GRPO algorithm, so we includeinput_ids
,loss_mask
,attention_mask
, andlogprobs
(needed for computing importance ratios).
Step 2: Training with Your Custom Workflow
Using your custom workflow is straightforward—just create it in your training script and
pass it to the rollout_batch
or prepare_batch
method:
def main(args):
# ... setup code ...
# Create your custom workflow
workflow = MultiTurnWorkflow(
reward_fn=gsm8k_reward_fn,
gconfig=config.gconfig,
tokenizer=tokenizer,
turn_discount=0.9,
max_turns=5,
)
# Run training—no other changes needed!
data_generator = itertools.cycle(train_dataloader)
for global_step in range(max_steps):
with stats_tracker.record_timing("rollout"):
if config.async_training:
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
else:
batch = rollout.rollout_batch(next(data_generator), workflow=workflow)
# ... continue with training loop ...
Then update your experiment configuration in
realhf/experiments/async_exp/async_math_ppo.py
:
@dataclasses.dataclass
class AsyncPPOMATHConfig(AsyncRLExperimentConfig, PPOMATHConfig):
# Add any new CLI arguments your agent needs
my_param: float = 1.0
@property
def agent(self) -> AgentAbstraction:
return AgentAbstraction(
"math-multi-turn", # Your registered agent name
args=dict(
# Pass any arguments needed for your __init__ method
my_param=self.my_param,
# ... other configuration ...
),
)
@property
def env(self) -> EnvServiceAbstraction:
# Update to use your custom environment if needed
return EnvServiceAbstraction(
"math-code-single-step",
args=dict(dataset_path=self.dataset.path)
)
Step 4: Run Training
Follow the standard training procedure outlined in the quickstart guide. Launch your experiment with:
python3 training/main_async_ppo.py my_param=5.0 # plus any additional CLI arguments
Training Results
Here's an example of the training reward curve from our multi-turn math agent:
The agent successfully learns to solve math problems with improved accuracy over time, demonstrating the effectiveness of the multi-turn approach.
Note: While this legacy approach works, we strongly recommend using the AReaLite workflow system for new projects. It provides better flexibility, cleaner abstractions, and easier maintenance. Consider migrating existing legacy agents to the workflow-based approach when possible.
Happy coding!