mirror of https://github.com/inclusionAI/AReaL
This commit is contained in:
parent
7f876e7182
commit
e6bf47f7b0
20
README.md
20
README.md
|
@ -20,10 +20,11 @@ like how you enjoy real-world milk tea (cheers).
|
|||
|
||||
**AReaL Highlights**
|
||||
|
||||
- ⚡ **\[NEW\] Light-weight & AI-centric:** In our new release AReaLite, we deliver
|
||||
**90%** of AReaL functionalities with only **20%** # lines of code! AReaLite also
|
||||
follows an **AI-centric** design that make users build their own **agentic** and
|
||||
**RLVR** training workflows with much less effort.
|
||||
- ⚡ **\[NEW\] Light-weight & AI-centric:** Our new release **AReaLite** follows an
|
||||
**AI-centric** design that prioritizes better development experiences for AI
|
||||
researchers. As a result, **AReaLite** delivers most AReaL functionalities with a much
|
||||
more light-weight codebase, supporting users to build their own **agentic** and
|
||||
**RLVR** training workflows with less effort.
|
||||
- 🔥 **Asynchronous RL**: With algorithm-system co-design, AReaL supports fully
|
||||
asynchronous RL for **the fastest training**! Experimental support for multi-turn
|
||||
agentic RL is also provided.
|
||||
|
@ -37,9 +38,12 @@ like how you enjoy real-world milk tea (cheers).
|
|||
## News
|
||||
|
||||
**\[2025/07/31\] (v0.4, AReaLite)** We introduce **AReaLite**, a **light-weight**
|
||||
version of AReaL with an **AI-centric** API design that inherently supports fully
|
||||
asynchronous **agentic RL**. Check out [our AReaLite Design Doc](/arealite/README.md)
|
||||
and [the quickstart guide](/docs/tutorial/quickstart.md) to begin your journey with
|
||||
version of AReaL designed specifically for AI researchers and rapid prototyping.
|
||||
AReaLite features an **AI-centric** API design that prioritizes ease of use and
|
||||
algorithm development, while inherently supporting fully asynchronous **agentic RL**.
|
||||
With 80% fewer lines of code, AReaLite maintains 90% of AReaL's core functionality.
|
||||
Check out [our AReaLite design doc](/arealite/README.md) and
|
||||
[the quickstart guide](/docs/tutorial/quickstart.md) to begin your journey with
|
||||
**AReaLite**!
|
||||
|
||||
**\[2025/06/03\] (v0.3, boba²)** We release **boba²** (double-boba) for fully
|
||||
|
@ -64,7 +68,7 @@ New highlights in AReaLite:
|
|||
- Follows an *AI-centric* API design instead of the *system-centric* architecture in old
|
||||
AReaL, which make it easier for AI researchers to adopt, understand, and develop
|
||||
effectively and efficiently. To learn more about the design principles of AReaL,
|
||||
please read [AReaLite Design Doc](/arealite/README.md)!
|
||||
please read the [AReaLite design doc](/arealite/README.md)!
|
||||
|
||||
- A much more *light-weight* codebase compared to old AReaL codebase with only **20%** #
|
||||
lines of code, with a detailed [code walkthrough](/docs/arealite/gsm8k_grpo.md) on an
|
||||
|
|
|
@ -29,8 +29,8 @@ show you how these steps are done in details.
|
|||
|
||||
## Launching the Experiment
|
||||
|
||||
As shown in [Quickstart Guide](../tutorial/quickstart.md), experiments in AReaLite are
|
||||
launched using standalone launchers with the following commands:
|
||||
As shown in the [quickstart guide](../tutorial/quickstart.md), experiments in AReaLite
|
||||
are launched using standalone launchers with the following commands:
|
||||
|
||||
```
|
||||
# Local Launcher
|
||||
|
@ -72,7 +72,7 @@ config: GRPOConfig
|
|||
## Loading and Preprocessing Dataset
|
||||
|
||||
We use the `datasets` and `torchdata` packages to load and preprocess the dataset into
|
||||
our dataloader. First, we download `openai/gsm8k` from Huggingface and split it by data
|
||||
our dataloader. First, we download `openai/gsm8k` from Hugging Face and split it by data
|
||||
parallel ranks, then map it to our desired format:
|
||||
|
||||
```python
|
||||
|
@ -116,14 +116,14 @@ on separate GPUs from those used for training. The `RemoteSGLangEngine` acts as
|
|||
that interacts with the servers. `RemoteSGLangEngine` runs in a SPMD manner on every
|
||||
training process, without occupying any GPUs.
|
||||
|
||||
`RemoteSGLangEngine` provides two APIs, `agenerate` and `update_weights`. It is worth
|
||||
mentioning that, in asynchronous RL experiment in AReaLite, inference-side weight update
|
||||
could happen **in the middle of** generation of one prompt. With that being said, one
|
||||
output sequence could be generated by multiple versions of models. Let us glimpse into
|
||||
code of `agenerate` and `update_weights` for a better understanding.
|
||||
`RemoteSGLangEngine` provides two APIs, `agenerate` and `async_update_weights`. It is
|
||||
worth mentioning that, in asynchronous RL experiment in AReaLite, inference-side weight
|
||||
update could happen **in the middle of** generation of one prompt. With that being said,
|
||||
one output sequence could be generated by multiple versions of models. Let us glimpse
|
||||
into code of `agenerate` and `async_update_weights` for a better understanding.
|
||||
|
||||
In `update_weights`, the engine first send `pause_generation` requests to all inference
|
||||
servers, notifying them a weight update is about to happen. Upon receiveing
|
||||
In `async_update_weights`, the engine first send `pause_generation` requests to all
|
||||
inference servers, notifying them a weight update is about to happen. Upon receiveing
|
||||
`pause_generation`, inference servers will immediately stop generating and respond with
|
||||
already generated tokens. Then, the engine sends `update_weights_from_distributed` (for
|
||||
NCCL update) or `update_weights_from_disk` (for disk update). After the update is
|
||||
|
@ -133,8 +133,8 @@ start working again.
|
|||
```python
|
||||
class RemoteSGLangEngine:
|
||||
...
|
||||
def update_weights(self, meta: WeightUpdateMeta):
|
||||
# `update_weights` is completely async.
|
||||
def async_update_weights(self, meta: WeightUpdateMeta):
|
||||
# `async_update_weights` is completely async.
|
||||
# It submits task to a ProcessPoolExecutor and returns a future
|
||||
for addr in self.addresses:
|
||||
res = requests.post(f"http://{addr}/pause_generation")
|
||||
|
@ -148,7 +148,7 @@ class RemoteSGLangEngine:
|
|||
|
||||
def callback(future):
|
||||
for addr in self.addresses
|
||||
requests.post(f"http://{addr}/continue_generation")
|
||||
requests.post(f"http://{addr}/continue_generation")
|
||||
|
||||
future.add_done_callback(callback)
|
||||
return future
|
||||
|
@ -285,15 +285,15 @@ class WorkflowExecutor:
|
|||
rid += 1
|
||||
# Wait for rollout completion
|
||||
tasks = list(rollout_tasks.values())
|
||||
done = []
|
||||
completed_tasks = []
|
||||
if tasks:
|
||||
done, _ = await asyncio.wait(
|
||||
completed_tasks, _ = await asyncio.wait(
|
||||
tasks,
|
||||
timeout=ROLLOUT_POLL_WAIT_TIME,
|
||||
return_when=asyncio.FIRST_COMPLETED,
|
||||
)
|
||||
# Collect done results, put the results into output queue
|
||||
for task in done:
|
||||
for task in completed_tasks:
|
||||
traj = await task
|
||||
task_rid = task.get_name()
|
||||
rollout_tasks.pop(task_rid)
|
||||
|
@ -432,25 +432,24 @@ weight_update_meta = WeightUpdateMeta.from_disk(config.saver)
|
|||
```
|
||||
|
||||
After a training step is finished, we transfer new weights from actor engine to remote
|
||||
inference servers with steps shown in the following code:
|
||||
inference servers:
|
||||
|
||||
1. The rollout engine needs to stop sending generation requests to remote servers
|
||||
(`rollout.pause()`) to avoid server-side congestion.
|
||||
1. Since we need to invoke weight update on the trainer engine and remote inference
|
||||
servers at the same time, in the training script, we asynchronously send requests to
|
||||
remote inference servers, and then immediately upload weights on the trainer engine.
|
||||
|
||||
```python
|
||||
# 1. Pause rollout on remote inference servers
|
||||
rollout.pause()
|
||||
# 2. Send requests to remote servers, tell them to update weights
|
||||
if dist.get_rank() == 0:
|
||||
future = rollout.update_weights(weight_update_meta)
|
||||
# 3. Actor begins to transfer weights
|
||||
future = rollout.async_update_weights(weight_update_meta)
|
||||
actor.upload_weights(weight_update_meta)
|
||||
# 4. Wait for remote servers to return after finishing updates
|
||||
if dist.get_rank() == 0:
|
||||
future.result()
|
||||
# 5. Synchronize rollout processes for model version update
|
||||
dist.barrier(device_ids=[actor.device.index])
|
||||
torch.cuda.synchronize()
|
||||
# 6. Resume rollout on remote inference servers
|
||||
rollout.resume()
|
||||
# 7. Set version, ensures versions on actor and rollout engine are identical
|
||||
actor.set_version(global_step + 1)
|
||||
rollout.set_version(global_step + 1)
|
||||
```
|
||||
|
@ -482,7 +481,7 @@ for global_step in range(max_steps):
|
|||
|
||||
rollout.pause()
|
||||
if dist.get_rank() == 0:
|
||||
future = rollout.update_weights(weight_update_meta)
|
||||
future = rollout.async_update_weights(weight_update_meta)
|
||||
actor.upload_weights(weight_update_meta)
|
||||
if dist.get_rank() == 0:
|
||||
future.result()
|
||||
|
|
|
@ -8,10 +8,12 @@ You can find the complete implementation in `arealite/workflow/multi_turn.py`.
|
|||
|
||||
## Step 1: Define Your Workflow
|
||||
|
||||
AReaLite gives you flexibility in how you design your agents. Instead of rigid `Agent`
|
||||
classes that might constrain your agent's capabilities, AReaLite captures all rollout
|
||||
behavior in a `RolloutWorkflow` class. This approach lets you customize your agent's
|
||||
behavior however you need.
|
||||
AReaLite gives you flexibility in how you design your agents to run **an episode**. **An
|
||||
episode** defines how your agent rollouts a complete training sample from an input
|
||||
prompt, using tools, reward functions, and (multi-turn) generation. Instead of rigid
|
||||
`Agent` classes that might constrain your agent's capabilities, AReaLite captures all
|
||||
rollout behavior in a `RolloutWorkflow` class. This approach allows you to customize
|
||||
your agent's behavior however you need.
|
||||
|
||||
```python
|
||||
# arealite/api/workflow_api.py
|
||||
|
@ -40,7 +42,7 @@ interact.
|
|||
> generated from that prompt—it's not batched. However, you can generate multiple
|
||||
> trajectories from a single prompt (for example, with GRPO or tree search).
|
||||
|
||||
### Setting Up the Multi-Turn Math Workflow
|
||||
### Setting Up the Multi-turn Math Workflow
|
||||
|
||||
Let's build a multi-turn rollout workflow for solving math problems. First, we'll define
|
||||
the `__init__` method to set up what we need during rollout:
|
||||
|
@ -82,10 +84,11 @@ class MultiTurnWorkflow(RolloutWorkflow):
|
|||
seq, logprobs, loss_mask, versions = [], [], [], []
|
||||
messages = data["messages"]
|
||||
# Run multi-turn rollout until we get the correct answer
|
||||
t = reward = 0
|
||||
turn_index = 0
|
||||
reward = 0
|
||||
discount = 1.0
|
||||
rid = uuid.uuid4().hex
|
||||
while reward == 0 and t < self.max_turns:
|
||||
while reward == 0 and turn_index < self.max_turns:
|
||||
# Convert the conversation into input tokens
|
||||
input_ids = self.tokenizer.apply_chat_template(
|
||||
messages,
|
||||
|
@ -111,7 +114,7 @@ class MultiTurnWorkflow(RolloutWorkflow):
|
|||
> **Note**: The `rid` field in `LLMRequest` is the request ID. Requests with the same ID
|
||||
> will reuse the LLM inference server's KV caches for better efficiency.
|
||||
|
||||
### Handling Multi-Turn Conversations
|
||||
### Handling Multi-turn Conversations
|
||||
|
||||
Next, we'll check if the current answer is correct using our `reward_fn`. This function
|
||||
should return 1 for correct answers and 0 otherwise. When the answer is wrong, we'll
|
||||
|
|
|
@ -75,7 +75,6 @@ python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
|
|||
allocation_mode=sglang.d12p1t1+d4p1t1 \
|
||||
cluster.n_nodes=4 \
|
||||
cluster.n_gpus_per_node=4 \
|
||||
...
|
||||
|
||||
# Launch with Slurm launcher. 16 nodes (8 GPUs each), 12 nodes for generation, 4 nodes for training
|
||||
python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
|
||||
|
@ -85,7 +84,6 @@ python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
|
|||
allocation_mode=sglang.d96p1t1+d32p1t1 \
|
||||
cluster.n_nodes=16 \
|
||||
cluster.n_gpus_per_node=8 \
|
||||
...
|
||||
```
|
||||
|
||||
Additional references:
|
||||
|
@ -99,9 +97,9 @@ Additional references:
|
|||
>
|
||||
> 1. Ensure `allocation_mode` matches your cluster configuration
|
||||
> (`#GPUs == cluster.n_nodes * cluster.n_gpus_per_node`)
|
||||
> 1. Ray/Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For
|
||||
> 1. Ray / Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For
|
||||
> single node scenario, please use `LocalLauncher`.
|
||||
> 1. In Ray/Slurm launchers, GPUs are allocated at node granularity, which means #GPUs
|
||||
> 1. In Ray / Slurm launchers, GPUs are allocated at node granularity, which means #GPUs
|
||||
> for generation or training must be integer multiples of `cluster.n_gpus_per_node`.
|
||||
|
||||
<!--
|
||||
|
|
Loading…
Reference in New Issue