.

2025-07-30 15:46:50 +08:00 · 2025-07-30 15:46:50 +08:00 · e6bf47f7b0
parent 7f876e7182
commit e6bf47f7b0
4 changed files with 50 additions and 46 deletions
--- a/README.md
+++ b/README.md
@ -20,10 +20,11 @@ like how you enjoy real-world milk tea (cheers).

 **AReaL Highlights**

- ⚡ **\[NEW\] Light-weight & AI-centric:** In our new release AReaLite, we deliver
-  **90%** of AReaL functionalities with only **20%** # lines of code! AReaLite also
-  follows an **AI-centric** design that make users build their own **agentic** and
-  **RLVR** training workflows with much less effort.
+- ⚡ **\[NEW\] Light-weight & AI-centric:** Our new release **AReaLite** follows an
+  **AI-centric** design that prioritizes better development experiences for AI
+  researchers. As a result, **AReaLite** delivers most AReaL functionalities with a much
+  more light-weight codebase, supporting users to build their own **agentic** and
+  **RLVR** training workflows with less effort.
 - 🔥 **Asynchronous RL**: With algorithm-system co-design, AReaL supports fully
  asynchronous RL for **the fastest training**! Experimental support for multi-turn
  agentic RL is also provided.
@ -37,9 +38,12 @@ like how you enjoy real-world milk tea (cheers).
 ## News

 **\[2025/07/31\] (v0.4, AReaLite)** We introduce **AReaLite**, a **light-weight**
-version of AReaL with an **AI-centric** API design that inherently supports fully
-asynchronous **agentic RL**. Check out [our AReaLite Design Doc](/arealite/README.md)
-and [the quickstart guide](/docs/tutorial/quickstart.md) to begin your journey with
+version of AReaL designed specifically for AI researchers and rapid prototyping.
+AReaLite features an **AI-centric** API design that prioritizes ease of use and
+algorithm development, while inherently supporting fully asynchronous **agentic RL**.
+With 80% fewer lines of code, AReaLite maintains 90% of AReaL's core functionality.
+Check out [our AReaLite design doc](/arealite/README.md) and
+[the quickstart guide](/docs/tutorial/quickstart.md) to begin your journey with
 **AReaLite**!

 **\[2025/06/03\] (v0.3, boba²)** We release **boba²** (double-boba) for fully
@ -64,7 +68,7 @@ New highlights in AReaLite:
 - Follows an *AI-centric* API design instead of the *system-centric* architecture in old
  AReaL, which make it easier for AI researchers to adopt, understand, and develop
  effectively and efficiently. To learn more about the design principles of AReaL,
-  please read [AReaLite Design Doc](/arealite/README.md)!
+  please read the [AReaLite design doc](/arealite/README.md)!

 - A much more *light-weight* codebase compared to old AReaL codebase with only **20%** #
  lines of code, with a detailed [code walkthrough](/docs/arealite/gsm8k_grpo.md) on an
--- a/docs/arealite/gsm8k_grpo.md
+++ b/docs/arealite/gsm8k_grpo.md
@ -29,8 +29,8 @@ show you how these steps are done in details.

 ## Launching the Experiment

-As shown in [Quickstart Guide](../tutorial/quickstart.md), experiments in AReaLite are
-launched using standalone launchers with the following commands:
+As shown in the [quickstart guide](../tutorial/quickstart.md), experiments in AReaLite
+are launched using standalone launchers with the following commands:

 ```
 # Local Launcher
@ -72,7 +72,7 @@ config: GRPOConfig
 ## Loading and Preprocessing Dataset

 We use the `datasets` and `torchdata` packages to load and preprocess the dataset into
-our dataloader. First, we download `openai/gsm8k` from Huggingface and split it by data
+our dataloader. First, we download `openai/gsm8k` from Hugging Face and split it by data
 parallel ranks, then map it to our desired format:

 ```python
@ -116,14 +116,14 @@ on separate GPUs from those used for training. The `RemoteSGLangEngine` acts as
 that interacts with the servers. `RemoteSGLangEngine` runs in a SPMD manner on every
 training process, without occupying any GPUs.

-`RemoteSGLangEngine` provides two APIs, `agenerate` and `update_weights`. It is worth
-mentioning that, in asynchronous RL experiment in AReaLite, inference-side weight update
-could happen **in the middle of** generation of one prompt. With that being said, one
-output sequence could be generated by multiple versions of models. Let us glimpse into
-code of `agenerate` and `update_weights` for a better understanding.
+`RemoteSGLangEngine` provides two APIs, `agenerate` and `async_update_weights`. It is
+worth mentioning that, in asynchronous RL experiment in AReaLite, inference-side weight
+update could happen **in the middle of** generation of one prompt. With that being said,
+one output sequence could be generated by multiple versions of models. Let us glimpse
+into code of `agenerate` and `async_update_weights` for a better understanding.

-In `update_weights`, the engine first send `pause_generation` requests to all inference
-servers, notifying them a weight update is about to happen. Upon receiveing
+In `async_update_weights`, the engine first send `pause_generation` requests to all
+inference servers, notifying them a weight update is about to happen. Upon receiveing
 `pause_generation`, inference servers will immediately stop generating and respond with
 already generated tokens. Then, the engine sends `update_weights_from_distributed` (for
 NCCL update) or `update_weights_from_disk` (for disk update). After the update is
@ -133,8 +133,8 @@ start working again.
 ```python
 class RemoteSGLangEngine:
    ...
-    def update_weights(self, meta: WeightUpdateMeta):
-        # `update_weights` is completely async.
+    def async_update_weights(self, meta: WeightUpdateMeta):
+        # `async_update_weights` is completely async.
        # It submits task to a ProcessPoolExecutor and returns a future
        for addr in self.addresses:
            res = requests.post(f"http://{addr}/pause_generation")
@ -148,7 +148,7 @@ class RemoteSGLangEngine:

        def callback(future):
            for addr in self.addresses
-            requests.post(f"http://{addr}/continue_generation")
+                requests.post(f"http://{addr}/continue_generation")

        future.add_done_callback(callback)
        return future
@ -285,15 +285,15 @@ class WorkflowExecutor:
                    rid += 1
                # Wait for rollout completion
                tasks = list(rollout_tasks.values())
-                done = []
+                completed_tasks = []
                if tasks:
-                    done, _ = await asyncio.wait(
+                    completed_tasks, _ = await asyncio.wait(
                        tasks,
                        timeout=ROLLOUT_POLL_WAIT_TIME,
                        return_when=asyncio.FIRST_COMPLETED,
                    )
                # Collect done results, put the results into output queue
-                for task in done:
+                for task in completed_tasks:
                    traj = await task
                    task_rid = task.get_name()
                    rollout_tasks.pop(task_rid)
@ -432,25 +432,24 @@ weight_update_meta = WeightUpdateMeta.from_disk(config.saver)
 ```

 After a training step is finished, we transfer new weights from actor engine to remote
-inference servers with steps shown in the following code:
+inference servers:
+
+1. The rollout engine needs to stop sending generation requests to remote servers
+   (`rollout.pause()`) to avoid server-side congestion.
+1. Since we need to invoke weight update on the trainer engine and remote inference
+   servers at the same time, in the training script, we asynchronously send requests to
+   remote inference servers, and then immediately upload weights on the trainer engine.

 ```python
-# 1. Pause rollout on remote inference servers
 rollout.pause()
-# 2. Send requests to remote servers, tell them to update weights
 if dist.get_rank() == 0:
-    future = rollout.update_weights(weight_update_meta)
-# 3. Actor begins to transfer weights
+    future = rollout.async_update_weights(weight_update_meta)
 actor.upload_weights(weight_update_meta)
-# 4. Wait for remote servers to return after finishing updates
 if dist.get_rank() == 0:
    future.result()
-# 5. Synchronize rollout processes for model version update
 dist.barrier(device_ids=[actor.device.index])
 torch.cuda.synchronize()
-# 6. Resume rollout on remote inference servers
 rollout.resume()
-# 7. Set version, ensures versions on actor and rollout engine are identical
 actor.set_version(global_step + 1)
 rollout.set_version(global_step + 1)
 ```
@ -482,7 +481,7 @@ for global_step in range(max_steps):

    rollout.pause()
    if dist.get_rank() == 0:
-        future = rollout.update_weights(weight_update_meta)
+        future = rollout.async_update_weights(weight_update_meta)
    actor.upload_weights(weight_update_meta)
    if dist.get_rank() == 0:
        future.result()
--- a/docs/customization/agent.md
+++ b/docs/customization/agent.md
@ -8,10 +8,12 @@ You can find the complete implementation in `arealite/workflow/multi_turn.py`.

 ## Step 1: Define Your Workflow

-AReaLite gives you flexibility in how you design your agents. Instead of rigid `Agent`
-classes that might constrain your agent's capabilities, AReaLite captures all rollout
-behavior in a `RolloutWorkflow` class. This approach lets you customize your agent's
-behavior however you need.
+AReaLite gives you flexibility in how you design your agents to run **an episode**. **An
+episode** defines how your agent rollouts a complete training sample from an input
+prompt, using tools, reward functions, and (multi-turn) generation. Instead of rigid
+`Agent` classes that might constrain your agent's capabilities, AReaLite captures all
+rollout behavior in a `RolloutWorkflow` class. This approach allows you to customize
+your agent's behavior however you need.

 ```python
 # arealite/api/workflow_api.py
@ -40,7 +42,7 @@ interact.
 > generated from that prompt—it's not batched. However, you can generate multiple
 > trajectories from a single prompt (for example, with GRPO or tree search).

-### Setting Up the Multi-Turn Math Workflow
+### Setting Up the Multi-turn Math Workflow

 Let's build a multi-turn rollout workflow for solving math problems. First, we'll define
 the `__init__` method to set up what we need during rollout:
@ -82,10 +84,11 @@ class MultiTurnWorkflow(RolloutWorkflow):
        seq, logprobs, loss_mask, versions = [], [], [], []
        messages = data["messages"]
        # Run multi-turn rollout until we get the correct answer
-        t = reward = 0
+        turn_index = 0
+        reward = 0
        discount = 1.0
        rid = uuid.uuid4().hex
-        while reward == 0 and t < self.max_turns:
+        while reward == 0 and turn_index < self.max_turns:
            # Convert the conversation into input tokens
            input_ids = self.tokenizer.apply_chat_template(
                messages,
@ -111,7 +114,7 @@ class MultiTurnWorkflow(RolloutWorkflow):
 > **Note**: The `rid` field in `LLMRequest` is the request ID. Requests with the same ID
 > will reuse the LLM inference server's KV caches for better efficiency.

-### Handling Multi-Turn Conversations
+### Handling Multi-turn Conversations

 Next, we'll check if the current answer is correct using our `reward_fn`. This function
 should return 1 for correct answers and 0 otherwise. When the answer is wrong, we'll
--- a/docs/tutorial/quickstart.md
+++ b/docs/tutorial/quickstart.md
@ -75,7 +75,6 @@ python3 -m arealite.launcher.ray examples/arealite/gsm8k_grpo.py \
    allocation_mode=sglang.d12p1t1+d4p1t1 \
    cluster.n_nodes=4 \
    cluster.n_gpus_per_node=4 \
-    ...

 # Launch with Slurm launcher. 16 nodes (8 GPUs each), 12 nodes for generation, 4 nodes for training
 python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
@ -85,7 +84,6 @@ python3 -m arealite.launcher.slurm examples/arealite/gsm8k_grpo.py \
    allocation_mode=sglang.d96p1t1+d32p1t1 \
    cluster.n_nodes=16 \
    cluster.n_gpus_per_node=8 \
-    ...
 ```

 Additional references:
@ -99,9 +97,9 @@ Additional references:
 >
 > 1. Ensure `allocation_mode` matches your cluster configuration
 >    (`#GPUs == cluster.n_nodes * cluster.n_gpus_per_node`)
-> 1. Ray/Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For
+> 1. Ray / Slurm launchers only works for more than 1 node (`cluster.n_nodes > 1`). For
 >    single node scenario, please use `LocalLauncher`.
-> 1. In Ray/Slurm launchers, GPUs are allocated at node granularity, which means #GPUs
+> 1. In Ray / Slurm launchers, GPUs are allocated at node granularity, which means #GPUs
 >    for generation or training must be integer multiples of `cluster.n_gpus_per_node`.

 <!--