mirror of https://github.com/inclusionAI/AReaL
Update time estimation in README (#23)
* fix: `self.tasks_ids` should also be filtered * PullRequest: 67 Update v0.2.0 Dockerfile Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/67 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * fw/v0.2.0-dockerfile * PullRequest: 66 Update v0.2.0 cover letter Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/66 Signed-off-by: 温差 <xushusheng.xss@antgroup.com> * . * . * . * . * . * update thpt fig * update readme 20250329-20:16 * update * update tutorial * . * upload 7B zero and 32B sft config * clean ci * PullRequest: 72 change the condition of using etcd Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/72 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * change the condition of using etcd * PullRequest: 60 Change the default SGLang parameters to avoid precision issues. Merge branch fw/fix-sglang of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/60 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * change vllm config * . * . * PullRequest: 73 Fix a setup issue when using ETCD Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/73 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix etcd * . * . * PullRequest: 75 Fix epoch counter before model function call execution. Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/75 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 76 Update from opensource repository. Merge branch mzy/update-from-opensource of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/76 Signed-off-by: 博惟 <bowei.fw@antgroup.com> * . * . * . * . * . * update thpt fig * update readme 20250329-20:16 * update * update tutorial * fw/v0.2.0-dockerfile * . * V0.2.0 prerelease (#8) * V0.2.0 prerelease (#9) * Update README.md * Clean up CI (#11) * Update README.md citation (#15) * Xss/readme (#16) * Merge updates from ant repository. (#18) * update readme update readme update readme * update readme * . * . --------- Signed-off-by: 博惟 <bowei.fw@antgroup.com> Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com> Co-authored-by: meijun <meijun.mei@antgroup.com> Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com> Co-authored-by: chucai.dzq <chucai.dzq@alibaba-inc.com>
This commit is contained in:
parent
1c33379c93
commit
aff05c2544
10
README.md
10
README.md
|
@ -34,11 +34,11 @@ Thanks to a series of system-level optimizations, AReaL v0.2 improves its end-to
|
|||
|
||||
In the following table, we show the convergence time under different resource settings:
|
||||
|
||||
| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** | **32B (SFT)** |
|
||||
| --- | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| #GPU | 8 | 32 | 128 | 32 | 128 | 64 |
|
||||
| Step | 250 | 250 | 250 | 400 | 400 | 300 |
|
||||
| Time (h) | ~230 | ~70 | ~25 | ~290 | ~80 | ~3.5 |
|
||||
| **Model Size** | **1.5B** | **1.5B** | **1.5B** | **7B** | **7B** |**32B (SFT)** |
|
||||
| --- |:--------:|:--------:|:--------:|:------:|:------:|:-------:|
|
||||
| #GPU | 8 | 32 | 128 | 32 | 128 | 64 |
|
||||
| Step | 250 | 250 | 250 | 400 | 400 | 300 |
|
||||
| Time (h) | ~240 | ~69 | ~27 | ~252 | ~90 | ~3.5 |
|
||||
|
||||
|
||||
#### SOTA 7B model using RL in math reasoning
|
||||
|
|
|
@ -5,19 +5,23 @@
|
|||
Check if your hardware meets these minimum requirements:
|
||||
|
||||
|
||||
|**Model Size**| **1.5B** |**1.5B**|**1.5B**| **7B** | **7B** |
|
||||
|---|:---:|:---:|:---:|:---:|:---:|
|
||||
| **Nodes** | **1** | **4** | **16** | **4** | **16** |
|
||||
| GPU | 8x H800 |8x H800 per node| 8x H800 per node |8x H800 per node| 8x H800 per node |
|
||||
| CPU | 48 cores |48 cores per node|48 cores per node| 48 cores per node |48 cores per node|
|
||||
| Memory | 1 TB |1 TB per node|1 TB per node| 1 TB per node |1 TB per node|
|
||||
| Network | NVSwitch |NVSwitch + RoCE 3.2 Tbps|NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps |NVSwitch + RoCE 3.2 Tbps|
|
||||
| Storage | 1TB |Shared storage (NAS) 10TB|Shared storage (NAS) 10TB| Shared storage (NAS) 10TB |Shared storage (NAS) 10TB|
|
||||
| **Total Time (Hours)** | **230** | **70** | **25** | **290** | **80** |
|
||||
|**Model Size**| **1.5B** |**1.5B**|**1.5B**| **7B** | **7B** | **32B** |
|
||||
|---|:---:|:---:|:---:|:-------------------------:|:---:|:---:|
|
||||
| **Nodes** | **1** | **4** | **16** | **4** | **16** | **16** |
|
||||
| GPU | 8x H800 |8x H800 per node| 8x H800 per node | 8x H800 per node | 8x H800 per node | 8x H800 per node |
|
||||
| CPU | 48 cores |48 cores per node|48 cores per node| 48 cores per node | 48 cores per node| 48 cores per node|
|
||||
| Memory | 1 TB |1 TB per node|1 TB per node| 1 TB per node | 1 TB per node| 1 TB per node|
|
||||
| Network | NVSwitch |NVSwitch + RoCE 3.2 Tbps|NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps | NVSwitch + RoCE 3.2 Tbps| NVSwitch + RoCE 3.2 Tbps|
|
||||
| Storage | 1TB |Shared storage (NAS) 10TB|Shared storage (NAS) 10TB| Shared storage (NAS) 10TB |Shared storage (NAS) 10TB| Shared storage (NAS) 10TB|
|
||||
| BatchSize x GroupSize | 512x16 | 512x16 | 512x16 | 512x16 | 512x16 | 512x32|
|
||||
| **Single-step Time (seconds)** | **3461** | **997** | **391** | **2275** | **815** | **6707**|
|
||||
| **#Steps Until Convergence** | **~250** |**~250** |**~250** |**~400** |**~400** | - |
|
||||
| **Total Time (Hours)** | **~240** | **~69** | **~27** | **~252** | **~90** | - |
|
||||
|
||||
Notes:
|
||||
- GPUs need to have 80GB memory. Other GPU models with similar specs are acceptable.
|
||||
- Single-node training can use local storage, but multi-node training requires shared storage.
|
||||
- We haven't successfully train a powerful 32B model, so we cannot estimate the required steps and time.
|
||||
|
||||
## Software Requirements
|
||||
This tutorial provides a Docker image. Below are the tested software versions:
|
||||
|
|
|
@ -6,15 +6,18 @@
|
|||
|
||||
为了能正常完成训练流程,请参照下表确认你的硬件是否满足要求:
|
||||
|
||||
|**模型大小**| **1.5B** | **1.5B** |**1.5B** |**7B** |**7B** |
|
||||
|---|---|---|---|---|---|
|
||||
| 节点 | 1 | 4 | 16 | 4 | 16 |
|
||||
| GPU | 8 张 H800 | 每节点 8 张 H800 |每节点 8 张 H800 |每节点 8 张 H800 |每节点 8 张 H800 |
|
||||
| CPU | 48 核 | 每节点 48 核 |每节点 48 核 |每节点 48 核 |每节点 48 核 |
|
||||
| 内存 | 1 TB |每节点 1 TB|每节点 1 TB |每节点 1 TB |每节点 1 TB |
|
||||
| 通信 | NVSwitch |NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps|
|
||||
| 存储 | 1TB |共享存储(NAS)10TB |共享存储(NAS)10TB |共享存储(NAS)10TB |共享存储(NAS)10TB |
|
||||
|总训练时间(小时)| **230** | **70** | **25** | **290** | **80** |
|
||||
| **模型大小** | **1.5B** | **1.5B** |**1.5B** | **7B** |**7B** | **32B** |
|
||||
|---------------------|---|---|---|---------------------------|---|---|
|
||||
| 节点 | 1 | 4 | 16 | 4 | 16 | 16 |
|
||||
| GPU | 8 张 H800 | 每节点 8 张 H800 |每节点 8 张 H800 | 每节点 8 张 H800 |每节点 8 张 H800 |每节点 8 张 H800 |
|
||||
| CPU | 48 核 | 每节点 48 核 |每节点 48 核 | 每节点 48 核 |每节点 48 核 | 每节点 48 核 |
|
||||
| 内存 | 1 TB |每节点 1 TB|每节点 1 TB | 每节点 1 TB |每节点 1 TB | 每节点 1 TB |
|
||||
| 通信 | NVSwitch |NVSwitch+RoCE 带宽 3.2 Tbps|NVSwitch+RoCE 带宽 3.2 Tbps| NVSwitch+RoCE 带宽 3.2 Tbps |NVSwitch+RoCE 带宽 3.2 Tbps| NVSwitch+RoCE 带宽 3.2 Tbps|
|
||||
| 存储 | 1TB |共享存储(NAS)10TB |共享存储(NAS)10TB | 共享存储(NAS)10TB |共享存储(NAS)10TB | 共享存储(NAS)10TB |
|
||||
| BatchSize x GroupSize | 512x16 | 512x16 | 512x16 | 512x16 | 512x16 | 512x32|
|
||||
| 单步训练时间(秒) | **3461** | **997** | **391** | **2275** | **815** | **6707**|
|
||||
| 训练至收敛需要步数 | **~250** |**~250** |**~250** |**~400** |**~400** | - |
|
||||
| 总训练时间(小时) | **~240** | **~69** | **~27** | **~252** | **~90** | - |
|
||||
|
||||
关于硬件要求的说明:
|
||||
|
||||
|
@ -22,7 +25,7 @@
|
|||
|
||||
- 单节点训练时可以使用本地存储,但多节点训练必须要提供共享存储,否则无法进行训练。
|
||||
|
||||
- 所有训练均采用 16K 的 Context Length
|
||||
- 目前32B模型没有训练出有意义的结果,所以无法估计训练到收敛需要的步数和时间。
|
||||
|
||||
|
||||
## 软件要求
|
||||
|
|
Loading…
Reference in New Issue