Update time estimation in README (#23)

* fix: `self.tasks_ids` should also be filtered * PullRequest: 67 Update v0.2.0 Dockerfile Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/67 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * fw/v0.2.0-dockerfile * PullRequest: 66 Update v0.2.0 cover letter Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/66 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * . * . * . * . * . * update thpt fig * update readme 20250329-20:16 * update * update tutorial * . * upload 7B zero and 32B sft config * clean ci * PullRequest: 72 change the condition of using etcd Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/72 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * change the condition of using etcd * PullRequest: 60 Change the default SGLang parameters to avoid precision issues. Merge branch fw/fix-sglang of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/60 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * change vllm config * . * . * PullRequest: 73 Fix a setup issue when using ETCD Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/73 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix etcd * . * . * PullRequest: 75 Fix epoch counter before model function call execution. Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/75 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 76 Update from opensource repository. Merge branch mzy/update-from-opensource of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/76 Signed-off-by: 博惟 <bowei.fw@antgroup.com> * . * . * . * . * . * update thpt fig * update readme 20250329-20:16 * update * update tutorial * fw/v0.2.0-dockerfile * . * V0.2.0 prerelease (#8) * V0.2.0 prerelease (#9) * Update README.md * Clean up CI (#11) * Update README.md citation (#15) * Xss/readme (#16) * Merge updates from ant repository. (#18) * update readme update readme update readme * update readme * . * . --------- Signed-off-by: 博惟 <bowei.fw@antgroup.com> Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com> Co-authored-by: meijun <meijun.mei@antgroup.com> Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com> Co-authored-by: chucai.dzq <chucai.dzq@alibaba-inc.com>
2025-04-02 08:09:11 +08:00 · 2025-04-02 08:09:11 +08:00 · aff05c2544
parent 1c33379c93
commit aff05c2544
3 changed files with 31 additions and 24 deletions
--- a/README.md
+++ b/README.md
@ -34,11 +34,11 @@ Thanks to a series of system-level optimizations, AReaL v0.2 improves its end-to

 In the following table, we show the convergence time under different resource settings:

-| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** | **32B (SFT)** |
-| --- | :---: | :---: | :---: | :---: | :---: | :---: |
-| #GPU | 8 | 32 | 128 | 32 | 128 | 64 |
-| Step | 250 | 250 | 250 | 400 | 400 | 300 |
-| Time (h) | ~230 | ~70 | ~25 | ~290 | ~80 | ~3.5 |
+| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** |**32B (SFT)** |
+| --- |:--------:|:--------:|:--------:|:------:|:------:|:-------:|
+| #GPU |    8     |    32    |   128    |   32   |  128   |  64 |
+| Step |   250    |   250    |   250    |  400   |  400   |  300 |
+| Time (h) |   ~240   |   ~69    |   ~27    |  ~252  |  ~90   |  ~3.5 |


 #### SOTA 7B model using RL in math reasoning
--- a/examples/README.md
+++ b/examples/README.md
@ -5,19 +5,23 @@
 Check if your hardware meets these minimum requirements:


-|**Model Size**| **1.5B** |**1.5B**|**1.5B**| **7B** | **7B** |
-|---|:---:|:---:|:---:|:---:|:---:|
-| **Nodes** | **1** | **4** | **16** | **4** | **16** |
-| GPU | 8x H800 |8x H800 per node| 8x H800 per node |8x H800 per node| 8x H800 per node |
-| CPU | 48 cores |48 cores per node|48 cores per node| 48 cores per node |48 cores per node|
-| Memory | 1 TB |1 TB per node|1 TB per node| 1 TB per node |1 TB per node|
-| Network | NVSwitch |NVSwitch + RoCE 3.2 Tbps|NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps |NVSwitch + RoCE 3.2 Tbps|
-| Storage | 1TB |Shared storage (NAS) 10TB|Shared storage (NAS) 10TB| Shared storage (NAS) 10TB |Shared storage (NAS) 10TB|
-| **Total Time (Hours)** | **230** | **70** | **25** | **290** | **80** |
+|**Model Size**| **1.5B** |**1.5B**|**1.5B**|          **7B**           | **7B** | **32B** |
+|---|:---:|:---:|:---:|:-------------------------:|:---:|:---:|
+| **Nodes** | **1** | **4** | **16** |           **4**           | **16** | **16** |
+| GPU | 8x H800 |8x H800 per node| 8x H800 per node |     8x H800 per node      | 8x H800 per node | 8x H800 per node |
+| CPU | 48 cores |48 cores per node|48 cores per node|     48 cores per node     | 48 cores per node| 48 cores per node|
+| Memory | 1 TB |1 TB per node|1 TB per node|       1 TB per node       | 1 TB per node| 1 TB per node|
+| Network | NVSwitch |NVSwitch + RoCE 3.2 Tbps|NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps  | NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps|
+| Storage | 1TB |Shared storage (NAS) 10TB|Shared storage (NAS) 10TB| Shared storage (NAS) 10TB |Shared storage (NAS) 10TB| Shared storage (NAS) 10TB|
+| BatchSize x GroupSize | 512x16 | 512x16 | 512x16 | 512x16  | 512x16 | 512x32|
+| **Single-step Time (seconds)** | **3461** | **997** | **391** |         **2275**     | **815** | **6707**|
+| **#Steps Until Convergence**           | **~250**  |**~250**  |**~250**  |**~400**  |**~400**  | -  |
+| **Total Time (Hours)**          |   **~240**   |   **~69**    |   **~27**    |  **~252**  |  **~90**   | - |

 Notes:
 - GPUs need to have 80GB memory. Other GPU models with similar specs are acceptable.
 - Single-node training can use local storage, but multi-node training requires shared storage.
+- We haven't successfully train a powerful 32B model, so we cannot estimate the required steps and time.

 ## Software Requirements
 This tutorial provides a Docker image. Below are the tested software versions:
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
@ -6,15 +6,18 @@

 为了能正常完成训练流程，请参照下表确认你的硬件是否满足要求：

-|**模型大小**| **1.5B** | **1.5B** |**1.5B** |**7B** |**7B** |
-|---|---|---|---|---|---|
-| 节点    | 1    | 4    | 16    | 4    | 16    |
-| GPU    | 8 张 H800    |    每节点 8 张 H800    |每节点 8 张 H800    |每节点 8 张 H800    |每节点 8 张 H800    |
-| CPU    | 48 核    |   每节点 48 核    |每节点 48 核    |每节点 48 核    |每节点 48 核    |
-| 内存    | 1 TB    |每节点 1 TB|每节点 1 TB    |每节点 1 TB    |每节点 1 TB    |
-| 通信    | NVSwitch    |NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|
-| 存储    | 1TB    |共享存储（NAS）10TB |共享存储（NAS）10TB |共享存储（NAS）10TB |共享存储（NAS）10TB |
-|总训练时间（小时）| **230** | **70** | **25** | **290** | **80** |
+| **模型大小**            | **1.5B** | **1.5B** |**1.5B** | **7B**                    |**7B** | **32B** |
+|---------------------|---|---|---|---------------------------|---|---|
+| 节点                  | 1    | 4    | 16    | 4                         | 16    | 16 |
+| GPU                 | 8 张 H800 |    每节点 8 张 H800    |每节点 8 张 H800    | 每节点 8 张 H800              |每节点 8 张 H800    |每节点 8 张 H800    |
+| CPU                 | 48 核    |   每节点 48 核    |每节点 48 核    | 每节点 48 核                  |每节点 48 核    | 每节点 48 核    |
+| 内存                  | 1 TB    |每节点 1 TB|每节点 1 TB    | 每节点 1 TB                  |每节点 1 TB    | 每节点 1 TB    |
+| 通信                  | NVSwitch |NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps| NVSwitch+RoCE 带宽 3.2 Tbps |NVSwitch+RoCE 带宽 3.2 Tbps| NVSwitch+RoCE 带宽 3.2 Tbps|
+| 存储                  | 1TB    |共享存储（NAS）10TB |共享存储（NAS）10TB | 共享存储（NAS）10TB             |共享存储（NAS）10TB | 共享存储（NAS）10TB |
+| BatchSize x GroupSize | 512x16 | 512x16 | 512x16 | 512x16  | 512x16 | 512x32|
+| 单步训练时间（秒）           | **3461** | **997** | **391** | **2275**  | **815** | **6707**|
+| 训练至收敛需要步数           | **~250**  |**~250**  |**~250**  |**~400**  |**~400**  | -  |
+| 总训练时间（小时）           |   **~240**   |   **~69**    |   **~27**    |  **~252**  |  **~90**   | - |

 关于硬件要求的说明：

@ -22,7 +25,7 @@

 -  单节点训练时可以使用本地存储，但多节点训练必须要提供共享存储，否则无法进行训练。

-  所有训练均采用 16K 的 Context Length
+-  目前32B模型没有训练出有意义的结果，所以无法估计训练到收敛需要的步数和时间。


 ## 软件要求