Commit Graph

257 Commits

Author SHA1 Message Date
meijun.mei 92e0777178 optimize code/math functioncall param 2025-03-17 15:45:52 +08:00
博惟 767bb7bf47 PullRequest: 38 Support SGLang generation based on the 24.07 docker image
Merge branch fw/sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/38?tab=comment#note_168700036

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
* .
* .
* cleanup
2025-03-17 15:45:00 +08:00
博惟 7b3e33430a PullRequest: 37 Support Megatron v0.9.0
Merge branch fw/megatron-v0.9.0 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/37

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* format
2025-03-17 14:49:26 +08:00
博惟 69681d9fe4 PullRequest: 35 Support log probability recomputation in PPO.
Merge branch fw/recompute-logprob of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/35

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
2025-03-17 14:38:53 +08:00
晓雷 c074d6142b PullRequest: 36 Fix SLURM singularity environments on PPU cluster
Merge branch mzy/fix-ppu-no-home of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/36

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* .
2025-03-17 12:33:03 +08:00
博惟 17a4ada06e PullRequest: 33 Force the partition to be balanced when the capacity is a large number
Merge branch fw/balanced-pack of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/33

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* .
2025-03-17 10:47:29 +08:00
博惟 0b15ead835 PullRequest: 34 Fix the fp16 training issue.
Merge branch fw/fix-fp16 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/34

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* fix bf16 training issue
2025-03-17 10:46:57 +08:00
穰侯 548aaef642 PullRequest: 22 避免log_softmax时出现OOM的问题
Merge branch ranghou-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/22?tab=comment#note_168300024

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-14 11:21:42 +08:00
穰侯 d70586f85e memory saving for log_softmax and gather 2025-03-14 10:18:28 +08:00
kira.gw 07e724741b Support profiling on ray 2025-03-13 11:40:37 +08:00
博惟 234e3dd3a0 PullRequest: 31 Fix typo in topology name
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/31

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
* fix typo
2025-03-12 18:31:18 +08:00
博惟 56e4dd1f5d PullRequest: 30 Change the topology order of vLLM for better locality of pipeline parallelism
Merge branch fw/topo of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/30?tab=comment

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change non-training topo order
* .
* fix the dataloading bug during recover
* fix typo
2025-03-12 17:26:50 +08:00
晓雷 3d8be914af PullRequest: 20 Fix bugs in auto evaluation
Merge branch mzy/refactor-eval of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/20?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* test
* move evaluator to main process
* .
* clear codes
* add docstring
* .
* separate wandb groups
* .
* handle eval error
* add check for failed eval
* refactor evaluator
* refactor evaluator, fix master worker wandb login
* .
2025-03-12 16:10:12 +08:00
博惟 26a48be73e PullRequest: 29 fix the dataloading bug during recover
Merge branch fw/fix-recover-dataloading of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/29

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix the dataloading bug during recover
2025-03-12 16:00:49 +08:00
博惟 fb23009e99 PullRequest: 27 support bf16 training
Merge branch fw/bf16 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/27

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* support bf16 training
2025-03-12 13:04:30 +08:00
博惟 9a07ab9460 PullRequest: 28 Fix a dataloading bug
Merge branch fw/patch20250312 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/28

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix dataloading bug
2025-03-12 13:04:15 +08:00
君末 cdf13ff852 PullRequest: 26 Set functioncall disabled by default for opensource
Merge branch fix/math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/26

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
2025-03-11 19:39:31 +08:00
meijun.mei 6fd0db01c7 add functioncall switch 2025-03-11 18:41:46 +08:00
温差 480f0ef502 PullRequest: 25 fix the save version in rw interface
Merge branch rw_save_version of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/25

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* fix the save version in rw interface
* format
2025-03-11 16:30:03 +08:00
晓雷 ffe1cd71ea PullRequest: 23 Fix a problem about singularity environments on SLURM
Merge branch mzy/fix-slurm of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/23

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* fix slurm + singularity environment
2025-03-10 20:31:57 +08:00
博惟 32b92a08f9 PullRequest: 21 fix timeutil consistency during recover
Merge branch fw/fix-timeutil-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/21

Signed-off-by: 步偶 <sam.gjx@antgroup.com>


* fix timeutil consistency during recover
2025-03-10 16:27:55 +08:00
郭唯 e968ac47eb PullRequest: 19 Fix training time of 7B in README
Merge branch fix-tutorials of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/19

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
2025-03-08 18:35:47 +08:00
博惟 ca42e43638 PullRequest: 9 Refactoring data transfer for v2 workers.
Merge branch fw/datatransfer-v2 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/9

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* fw/fix-dataloading-not-shuffle
* .
* .
* .
* .
* .
* add v2 master worker
* cpu test pass
* ppo run
* .
* pass sft test
* pass ppo dp test
* format
* fix
* run
* .
* cleanup
* .
* format
* run
* merge and format
* refactor
* sft pass
* .
* format
* format
* format
* .
* .
2025-03-08 18:26:24 +08:00
君末 b3bedd7b9d PullRequest: 17 Support functioncall for math and code verify
Merge branch functioncall-code of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/17

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* test code evaluation with faas
* support functioncall for code
* fix code crash bug
* format
* .
2025-03-07 11:26:57 +08:00
博惟 c1bb4770ff PullRequest: 6 Implement master worker v2 for refactoring and uvloop support
Merge branch fw/uvloop of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/6

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* fw/fix-dataloading-not-shuffle
* .
* .
* .
* .
* .
* add v2 master worker
* cpu test pass
* ppo run
* .
* format
* fix
* merge and format
* change default env vars to v1 worker
2025-03-07 11:07:54 +08:00
kira.gw d9816c03b0 Fix training time of 7B 2025-03-07 10:49:34 +08:00
穰侯 ad5baa74ee PullRequest: 18 Support TensorBoard logging
Merge branch ranghou-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/18

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* support specifying number of gpus and mems for actors
* PR fix
* support tensorboard
* bug fix
* add doc
2025-03-06 15:18:12 +08:00
博惟 bc6567dd39 PullRequest: 16 Fix the semantics of `avg_seq_len` in tutorials.
Merge branch fw/fix-avgseqlen of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/16

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
2025-03-05 18:57:08 +08:00
晓雷 e9bf229581 PullRequest: 11 支持训练时自动拉起evaluate任务
Merge branch mzy/auto-eval of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/11

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* test
* move evaluator to main process
* .
* clear codes
* add docstring
* .
* separate wandb groups
* .
2025-03-05 18:06:40 +08:00
穰侯 deccd47a22 PullRequest: 15 Support CLI configuration of master/model worker CPU and memory allocation
Merge branch ranghou-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/15?tab=diff

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* support specifying number of gpus and mems for actors
* PR fix
2025-03-05 11:04:18 +08:00
温差 054d323979 PullRequest: 13 update 7B-zero figures
Merge branch 0304-7b-zero-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/13

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* update 7B zero figures
* update readme
* update amc
* update readme
* Delete Data Comparison
* typo
* update figures
* fix typo
2025-03-04 19:11:52 +08:00
博惟 f04a8ee598 PullRequest: 14 format and update requirements.txt
Merge branch fw/update-requirements of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/14

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* format and update requirements.txt
* cleanup
2025-03-04 19:10:46 +08:00
郭唯 14d402f117 PullRequest: 12 Update tutorials
Merge branch update-tutorial of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/12?tab=comment

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* support setup env and start train
* update tutorial
* update training script
2025-03-04 15:41:43 +08:00
博惟 e1063a0fe6 PullRequest: 10 Fix the dataset indexing bug after filtering during recover.
Merge branch fw/fix-recover-filter of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/10

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* cleanup
* cleanup
2025-03-03 17:51:29 +08:00
温差 ecf5fec87f PullRequest: 8 debug: math verifier timeout
Merge branch math_verifier_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/8

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* debug: math verifier timeout
* debug: math verifier timeout
* timeout in math_verify_utils
* format
2025-03-03 10:20:06 +08:00
博惟 46c5a10eb9 PullRequest: 5 修改微批次分割逻辑
Merge branch fw/balanced-datapck of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/5

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* fw/fix-dataloading-not-shuffle
* .
* .
* .
* .
* .
2025-03-03 09:10:04 +08:00
博惟 ceee49454a PullRequest: 7 Support the key-value allocation when using decoupled vLLM generation.
Merge branch fw/vllm-key-value-alloc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/7?tab=commit

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* remove pring
2025-03-02 13:13:47 +08:00
博惟 e7c4a49adc PullRequest: 4 Fix the dataloader shuffle and random seed issue.
Merge branch fw/fix-dataloading-not-shuffle of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/4

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fw/fix-dataloading-not-shuffle
* .
* .
* .
2025-02-28 14:56:47 +08:00
怀颉 d04f17031a PullRequest: 3 fix: dataloader shuffle
Merge branch whj/fix-dataloader of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/3

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
2025-02-28 10:48:37 +08:00
wanghuaijie.whj 22f357b31c fix: dataloader shuffle
different seeds for every epoch & every rank
2025-02-28 10:16:44 +08:00
博惟 4b11997721 PullRequest: 2 Add CLI options such that users can control loss scale window and initial loss scale.
Merge branch fw/loss-scale of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/2

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
2025-02-27 16:48:12 +08:00
晓雷 db3e3ded9c PullRequest: 1 Normalize loss scale by tokens across micro batches
Merge branch fix-loss-scale of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/1

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-02-27 16:47:04 +08:00
meizhiyu.mzy b7815bb2c3 . 2025-02-27 16:38:42 +08:00
meizhiyu.mzy e2e1b231d8 . 2025-02-27 16:06:19 +08:00
meizhiyu.mzy 40b5d12490 . 2025-02-27 15:51:17 +08:00
bowei.fw e30790db11 fix 2025-02-27 15:47:27 +08:00
bowei.fw 3c5f603f59 updata doc 2025-02-27 15:44:39 +08:00
meizhiyu.mzy 35f50e3f53 . 2025-02-27 15:40:21 +08:00
meizhiyu.mzy 2436ce519e normalize loss scale by tokens 2025-02-27 11:36:26 +08:00
Wei Guo 241185227d
Merge pull request #2 from kdada/main
Fix evaluation script and update tutorials
2025-02-25 21:23:54 +08:00