Commit Graph

194 Commits

Author SHA1 Message Date
博惟 9b9b0af9c3 PullRequest: 61 [Patch v0.2.0] Fix the PPO bug in old environments.
Merge branch fw/patch20250326 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/61

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
2025-03-28 09:58:26 +08:00
君末 f8afa97484 PullRequest: 49 Enable Parallel Execution for Code, Math, and Reference Tasks
Merge branch async-ref-rew of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/49

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-28 09:40:54 +08:00
meijun 5773645acd fix review issue 2025-03-27 21:43:17 +08:00
meijun.mei abfe8bd30f update dataset for fused-refrw 2025-03-27 18:38:22 +08:00
meijun.mei fc79f21622 update examples script for fuse-ref-rw 2025-03-25 18:18:57 +08:00
博惟 9f77f96580 PullRequest: 58 Support ETCD3 name resolving repo
Merge branch fw/etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/58

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* .
* .
2025-03-25 16:05:04 +08:00
博惟 6ccbb01ca8 PullRequest: 56 Support the cuda 12.8 image with megatron v0.11.0 and SGLang 0.4.4
Merge branch fw/megatron-v0.11.0 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/56

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* update trial
* add moe test script
* .
* .
* .
* .
* .
* .
* .
* .
* .
* remove gae2d
* .
2025-03-25 16:02:10 +08:00
博惟 46b7d3d32b PullRequest: 59 Disable sliding window and chunked prefill for vLLM by default
Merge branch fw/fix-vllm of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/59

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* change vllm config
* .
2025-03-25 11:26:52 +08:00
bowei.fw 89fb1a8009 Merge branch 'main' of code.alipay.com:inclusionAI/AReaL into async-ref-rew 2025-03-24 11:02:54 +08:00
bowei.fw 6547112a9d . 2025-03-24 11:02:40 +08:00
博惟 df90bb512a PullRequest: 55 Fix a recover bug caused by dataset filtering
Merge branch fw/fix-recover20250322 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/55

Signed-off-by: 乘鹭 <hechuyi.hcy@antgroup.com>


* fix recover
2025-03-22 22:54:14 +08:00
bowei.fw 0300f4b94b . 2025-03-22 14:58:19 +08:00
bowei.fw bd8df23a13 . 2025-03-22 14:57:12 +08:00
bowei.fw 88e99f887a . 2025-03-22 12:33:17 +08:00
bowei.fw 25c45c7e83 . 2025-03-22 12:22:34 +08:00
bowei.fw 9dcdb7a684 . 2025-03-21 22:38:21 +08:00
bowei.fw d1554585a4 Merge branch 'main' of code.alipay.com:inclusionAI/AReaL into async-ref-rew 2025-03-21 21:25:35 +08:00
bowei.fw de8243cc78 . 2025-03-21 21:22:50 +08:00
博惟 f90fe19e00 PullRequest: 53 Fix a potential reward hacking issue related to "emptyset"
Merge branch fw/fix-rwd-hacking of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/53

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
2025-03-21 16:42:24 +08:00
bowei.fw eb1e8a7592 dataset loading fixed 2025-03-20 21:52:01 +08:00
bowei.fw f8586e47c8 . 2025-03-20 15:24:31 +08:00
博惟 16e82698e0 PullRequest: 52 Fix ckpt ctl
Merge branch fw/fix-save of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/52

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
2025-03-20 14:28:39 +08:00
博惟 86a4e69d0f PullRequest: 51 Fix a bug when ckpt_freq* are not set.
Merge branch fw/fix-save of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/51

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-20 11:19:13 +08:00
博惟 5759579aa8 PullRequest: 48 Misc changes for supporting the async worker in the future
Merge branch fw/async-worker of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/48

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
2025-03-20 10:18:26 +08:00
Jun Mo 8732811c73 create ref-rw inference aysnc mode 2025-03-19 18:00:31 +08:00
博惟 b8f1fc3ebf PullRequest: 47 Add a filelock for NFS name resolve to avoid the concurrency issue.
Merge branch fw/locked-nameresolve of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/47

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-19 16:31:23 +08:00
博惟 9c55827b32 PullRequest: 46 Call all runtime barriers upon CPU process groups and fix the SGLang performance with TP > 1
Merge branch fw/fix-gpu-barrier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/46?tab=diff

Signed-off-by: 闻通 <albert.zty@antgroup.com>


* .
2025-03-19 16:29:31 +08:00
怀颉 9a9d86112e PullRequest: 45 fix: multiple model families
Merge branch whj/fix-model-family of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/45

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-18 15:46:45 +08:00
wanghuaijie.whj 8c592f8ca1 fix: multiple model families 2025-03-18 15:42:03 +08:00
晓雷 4ac9595295 PullRequest: 43 Reduce GPU memory used by data transfer.
Merge branch mzy/fix-data-transfer-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/43

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* add oom observe logs
* tested
* format and clear code
* .
* format
* remove logging
* .
* add comments
2025-03-18 15:20:57 +08:00
君末 122bf6f214 PullRequest: 39 optimize code/math functioncall param
Merge branch fix/math-code of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/39?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-18 09:05:07 +08:00
Jun Mo 8310d7beb7 code format 2025-03-18 09:00:13 +08:00
博惟 b619f64cda PullRequest: 42 Fix the error raising logic and the bug when using sglang with tensor parallel
Merge branch fw/patch20250317-3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/42

Signed-off-by: 郭唯 <kira.gw@antgroup.com>


* .
2025-03-17 20:37:43 +08:00
博惟 312e84b62c PullRequest: 41 fix typo
Merge branch fw/patch20250317-2 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/41

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-17 17:00:04 +08:00
博惟 22548adf11 PullRequest: 40 Patch fix duplicated argument
Merge branch fw/patch20250317 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/40

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-17 16:53:54 +08:00
郭唯 583077c3d8 PullRequest: 32 Support profiling on ray
Merge branch train-profiling of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/32

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-17 16:13:07 +08:00
bowei.fw 843418ab42 format file 2025-03-17 16:12:49 +08:00
meijun.mei 92e0777178 optimize code/math functioncall param 2025-03-17 15:45:52 +08:00
博惟 767bb7bf47 PullRequest: 38 Support SGLang generation based on the 24.07 docker image
Merge branch fw/sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/38?tab=comment#note_168700036

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
* .
* .
* cleanup
2025-03-17 15:45:00 +08:00
博惟 7b3e33430a PullRequest: 37 Support Megatron v0.9.0
Merge branch fw/megatron-v0.9.0 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/37

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
2025-03-17 14:49:26 +08:00
博惟 69681d9fe4 PullRequest: 35 Support log probability recomputation in PPO.
Merge branch fw/recompute-logprob of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/35

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
2025-03-17 14:38:53 +08:00
晓雷 c074d6142b PullRequest: 36 Fix SLURM singularity environments on PPU cluster
Merge branch mzy/fix-ppu-no-home of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/36

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* .
2025-03-17 12:33:03 +08:00
博惟 17a4ada06e PullRequest: 33 Force the partition to be balanced when the capacity is a large number
Merge branch fw/balanced-pack of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/33

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* .
2025-03-17 10:47:29 +08:00
博惟 0b15ead835 PullRequest: 34 Fix the fp16 training issue.
Merge branch fw/fix-fp16 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/34

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* fix bf16 training issue
2025-03-17 10:46:57 +08:00
穰侯 548aaef642 PullRequest: 22 避免log_softmax时出现OOM的问题
Merge branch ranghou-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/22?tab=comment#note_168300024

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-14 11:21:42 +08:00
穰侯 d70586f85e memory saving for log_softmax and gather 2025-03-14 10:18:28 +08:00
kira.gw 07e724741b Support profiling on ray 2025-03-13 11:40:37 +08:00
博惟 234e3dd3a0 PullRequest: 31 Fix typo in topology name
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/31

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
* fix typo
2025-03-12 18:31:18 +08:00
博惟 56e4dd1f5d PullRequest: 30 Change the topology order of vLLM for better locality of pipeline parallelism
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/30?tab=comment

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
2025-03-12 17:26:50 +08:00
晓雷 3d8be914af PullRequest: 20 Fix bugs in auto evaluation
Merge branch mzy/refactor-eval of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/20?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* test
* move evaluator to main process
* .
* clear codes
* add docstring
* .
* separate wandb groups
* .
* handle eval error
* add check for failed eval
* refactor evaluator
* refactor evaluator, fix master worker wandb login
* .
2025-03-12 16:10:12 +08:00