Commit Graph

235 Commits

Author SHA1 Message Date
bowei.fw 6547112a9d . 2025-03-24 11:02:40 +08:00
博惟 df90bb512a PullRequest: 55 Fix a recover bug caused by dataset filtering
Merge branch fw/fix-recover20250322 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/55

Signed-off-by: 乘鹭 <hechuyi.hcy@antgroup.com>


* fix recover
2025-03-22 22:54:14 +08:00
bowei.fw 0300f4b94b . 2025-03-22 14:58:19 +08:00
bowei.fw bd8df23a13 . 2025-03-22 14:57:12 +08:00
bowei.fw 88e99f887a . 2025-03-22 12:33:17 +08:00
bowei.fw 25c45c7e83 . 2025-03-22 12:22:34 +08:00
bowei.fw 9dcdb7a684 . 2025-03-21 22:38:21 +08:00
bowei.fw d1554585a4 Merge branch 'main' of code.alipay.com:inclusionAI/AReaL into async-ref-rew 2025-03-21 21:25:35 +08:00
bowei.fw de8243cc78 . 2025-03-21 21:22:50 +08:00
博惟 f90fe19e00 PullRequest: 53 Fix a potential reward hacking issue related to "emptyset"
Merge branch fw/fix-rwd-hacking of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/53

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
2025-03-21 16:42:24 +08:00
bowei.fw eb1e8a7592 dataset loading fixed 2025-03-20 21:52:01 +08:00
bowei.fw f8586e47c8 . 2025-03-20 15:24:31 +08:00
博惟 16e82698e0 PullRequest: 52 Fix ckpt ctl
Merge branch fw/fix-save of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/52

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
2025-03-20 14:28:39 +08:00
博惟 86a4e69d0f PullRequest: 51 Fix a bug when ckpt_freq* are not set.
Merge branch fw/fix-save of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/51

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-20 11:19:13 +08:00
博惟 5759579aa8 PullRequest: 48 Misc changes for supporting the async worker in the future
Merge branch fw/async-worker of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/48

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
2025-03-20 10:18:26 +08:00
Jun Mo 8732811c73 create ref-rw inference aysnc mode 2025-03-19 18:00:31 +08:00
博惟 b8f1fc3ebf PullRequest: 47 Add a filelock for NFS name resolve to avoid the concurrency issue.
Merge branch fw/locked-nameresolve of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/47

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-19 16:31:23 +08:00
博惟 9c55827b32 PullRequest: 46 Call all runtime barriers upon CPU process groups and fix the SGLang performance with TP > 1
Merge branch fw/fix-gpu-barrier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/46?tab=diff

Signed-off-by: 闻通 <albert.zty@antgroup.com>


* .
2025-03-19 16:29:31 +08:00
怀颉 9a9d86112e PullRequest: 45 fix: multiple model families
Merge branch whj/fix-model-family of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/45

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-18 15:46:45 +08:00
wanghuaijie.whj 8c592f8ca1 fix: multiple model families 2025-03-18 15:42:03 +08:00
晓雷 4ac9595295 PullRequest: 43 Reduce GPU memory used by data transfer.
Merge branch mzy/fix-data-transfer-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/43

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* add oom observe logs
* tested
* format and clear code
* .
* format
* remove logging
* .
* add comments
2025-03-18 15:20:57 +08:00
君末 122bf6f214 PullRequest: 39 optimize code/math functioncall param
Merge branch fix/math-code of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/39?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-18 09:05:07 +08:00
Jun Mo 8310d7beb7 code format 2025-03-18 09:00:13 +08:00
博惟 b619f64cda PullRequest: 42 Fix the error raising logic and the bug when using sglang with tensor parallel
Merge branch fw/patch20250317-3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/42

Signed-off-by: 郭唯 <kira.gw@antgroup.com>


* .
2025-03-17 20:37:43 +08:00
博惟 312e84b62c PullRequest: 41 fix typo
Merge branch fw/patch20250317-2 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/41

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-17 17:00:04 +08:00
博惟 22548adf11 PullRequest: 40 Patch fix duplicated argument
Merge branch fw/patch20250317 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/40

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-03-17 16:53:54 +08:00
郭唯 583077c3d8 PullRequest: 32 Support profiling on ray
Merge branch train-profiling of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/32

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-17 16:13:07 +08:00
bowei.fw 843418ab42 format file 2025-03-17 16:12:49 +08:00
meijun.mei 92e0777178 optimize code/math functioncall param 2025-03-17 15:45:52 +08:00
博惟 767bb7bf47 PullRequest: 38 Support SGLang generation based on the 24.07 docker image
Merge branch fw/sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/38?tab=comment#note_168700036

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
* .
* .
* cleanup
2025-03-17 15:45:00 +08:00
博惟 7b3e33430a PullRequest: 37 Support Megatron v0.9.0
Merge branch fw/megatron-v0.9.0 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/37

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
2025-03-17 14:49:26 +08:00
博惟 69681d9fe4 PullRequest: 35 Support log probability recomputation in PPO.
Merge branch fw/recompute-logprob of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/35

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
2025-03-17 14:38:53 +08:00
晓雷 c074d6142b PullRequest: 36 Fix SLURM singularity environments on PPU cluster
Merge branch mzy/fix-ppu-no-home of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/36

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* .
2025-03-17 12:33:03 +08:00
博惟 17a4ada06e PullRequest: 33 Force the partition to be balanced when the capacity is a large number
Merge branch fw/balanced-pack of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/33

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* .
2025-03-17 10:47:29 +08:00
博惟 0b15ead835 PullRequest: 34 Fix the fp16 training issue.
Merge branch fw/fix-fp16 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/34

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* fix bf16 training issue
2025-03-17 10:46:57 +08:00
穰侯 548aaef642 PullRequest: 22 避免log_softmax时出现OOM的问题
Merge branch ranghou-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/22?tab=comment#note_168300024

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-14 11:21:42 +08:00
穰侯 d70586f85e memory saving for log_softmax and gather 2025-03-14 10:18:28 +08:00
kira.gw 07e724741b Support profiling on ray 2025-03-13 11:40:37 +08:00
博惟 234e3dd3a0 PullRequest: 31 Fix typo in topology name
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/31

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
* fix typo
2025-03-12 18:31:18 +08:00
博惟 56e4dd1f5d PullRequest: 30 Change the topology order of vLLM for better locality of pipeline parallelism
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/30?tab=comment

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
2025-03-12 17:26:50 +08:00
晓雷 3d8be914af PullRequest: 20 Fix bugs in auto evaluation
Merge branch mzy/refactor-eval of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/20?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* test
* move evaluator to main process
* .
* clear codes
* add docstring
* .
* separate wandb groups
* .
* handle eval error
* add check for failed eval
* refactor evaluator
* refactor evaluator, fix master worker wandb login
* .
2025-03-12 16:10:12 +08:00
博惟 26a48be73e PullRequest: 29 fix the dataloading bug during recover
Merge branch fw/fix-recover-dataloading of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/29

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix the dataloading bug during recover
2025-03-12 16:00:49 +08:00
博惟 fb23009e99 PullRequest: 27 support bf16 training
Merge branch fw/bf16 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/27

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* support bf16 training
2025-03-12 13:04:30 +08:00
博惟 9a07ab9460 PullRequest: 28 Fix a dataloading bug
Merge branch fw/patch20250312 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/28

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix dataloading bug
2025-03-12 13:04:15 +08:00
君末 cdf13ff852 PullRequest: 26 Set functioncall disabled by default for opensource
Merge branch fix/math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/26

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
2025-03-11 19:39:31 +08:00
meijun.mei 6fd0db01c7 add functioncall switch 2025-03-11 18:41:46 +08:00
温差 480f0ef502 PullRequest: 25 fix the save version in rw interface
Merge branch rw_save_version of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/25

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* fix the save version in rw interface
* format
2025-03-11 16:30:03 +08:00
晓雷 ffe1cd71ea PullRequest: 23 Fix a problem about singularity environments on SLURM
Merge branch mzy/fix-slurm of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/23

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* fix slurm + singularity environment
2025-03-10 20:31:57 +08:00
博惟 32b92a08f9 PullRequest: 21 fix timeutil consistency during recover
Merge branch fw/fix-timeutil-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/21

Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* fix timeutil consistency during recover
2025-03-10 16:27:55 +08:00
郭唯 e968ac47eb PullRequest: 19 Fix training time of 7B in README
Merge branch fix-tutorials of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/19

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
2025-03-08 18:35:47 +08:00