Commit Graph

257 Commits

Author SHA1 Message Date
Wei Fu ce4d7354bf
[Doc] Fix documentation for using Docker containers and customized agents (#64)
* test env setup

* .

* fix a missing cherry-pick

* .

* .

* .

* update docker instrcution

* fix
2025-06-01 16:33:29 +08:00
Wei Fu afe5a2c880
[Fix] Fix tutorial async_ppo script and doc structure (#63)
* test env setup

* .

* fix a missing cherry-pick
2025-06-01 15:46:46 +08:00
Wei Fu 4fab3ac769
[Doc & Fix] Simplify the environment setup procedure (#62)
* PullRequest: 176 [FIX] clear sensitive info

Merge branch fw/fix-sensitive-info of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/176

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .

* .

* .

* test env setup

* fix

* allow cached model

* .

* revise docs

* change docs

* format docs

* update readme
2025-06-01 14:57:21 +08:00
Wei Fu d87c898d36
[Feature] Add link to documentation in README (#61) 2025-05-30 19:58:58 +08:00
Wei Fu 473eeb2db0
[Feature] Create docs and examples for multi-turn agent RL (#60)
* PullRequest: 168 添加Codeforces测试,修复其它测试问题

Merge branch areal-eval-0.3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/168

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix eval and add codeforces elo calc
* fix codeforce test
* fix qwen3 prompt
* change annotations to eng
* add code verify files

* PullRequest: 173 [FIX} format code and fix a recover error in rollout worker

Merge branch fw/fix-rollout-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/173

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* format code and fix a recover error in rollout worker

* PullRequest: 171 更新评估文档

Merge branch eval-doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/171?tab=diff

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* update eval doc
* complete eval doc
* complete eval doc
* fix ood info
* add data obtaining guide
* fix supported datasets

* PullRequest: 174 decouple max_behav_imp_weight and c_clip & track entropy, positve_seq_len and negative_seq_len

Merge branch xss/max_behav_imp_weight of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/174

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* decouple max_behav_imp_weight and c_clip
* rename log: positve_* -> correct_*, negative_* -> incorrect_*
* rename hyper-parameter: max_behav_imp_weight -> behav_imp_weight_cap

* PullRequest: 175 [Fix] Fix the "event loop is already running" error in ray scripts

Merge branch fw/fix-ray-asyncio of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/175

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* format code and fix a recover error in rollout worker
* .

* add docs

* .

---------

Co-authored-by: 乘鹭 <hechuyi.hcy@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
2025-05-30 16:26:07 +08:00
Wei Fu c0200f10d0
[Feature] Support behavior importance weight capping and update evaluation scripts (#59)
* PullRequest: 168 添加Codeforces测试,修复其它测试问题

Merge branch areal-eval-0.3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/168

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix eval and add codeforces elo calc
* fix codeforce test
* fix qwen3 prompt
* change annotations to eng
* add code verify files

* PullRequest: 173 [FIX} format code and fix a recover error in rollout worker

Merge branch fw/fix-rollout-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/173

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* format code and fix a recover error in rollout worker

* PullRequest: 171 更新评估文档

Merge branch eval-doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/171?tab=diff

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* update eval doc
* complete eval doc
* complete eval doc
* fix ood info
* add data obtaining guide
* fix supported datasets

* PullRequest: 174 decouple max_behav_imp_weight and c_clip & track entropy, positve_seq_len and negative_seq_len

Merge branch xss/max_behav_imp_weight of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/174

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* decouple max_behav_imp_weight and c_clip
* rename log: positve_* -> correct_*, negative_* -> incorrect_*
* rename hyper-parameter: max_behav_imp_weight -> behav_imp_weight_cap

* PullRequest: 175 [Fix] Fix the "event loop is already running" error in ray scripts

Merge branch fw/fix-ray-asyncio of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/175

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* format code and fix a recover error in rollout worker
* .

---------

Co-authored-by: 乘鹭 <hechuyi.hcy@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
2025-05-30 10:29:21 +08:00
Wei Fu 0815bb6494
[CI] Try to fix doc CI (#58)
* .

* .

* .

* .

* .

* .

* .

* .

* .

* .

* .
2025-05-29 12:45:10 +08:00
Wei Fu b3375607c6
[CI] Fix doc CI again (#57)
* .

* .

* .

* .

* .

* .
2025-05-29 11:22:27 +08:00
Wei Fu 409fa9842d
[CI] Fix doc CI (#56)
* .

* .

* .
2025-05-29 11:09:01 +08:00
Wei Fu b240a9cf80
. (#55) 2025-05-28 20:29:32 +08:00
Wei Fu d13b517cbf
add doc ci (#54) 2025-05-28 20:01:10 +08:00
Wei Fu 7826fdbb87
[Feature] Amend yaml configurations for Ray experiments (#53)
* feat: one buffer for each task

* feat: support "one buffer for each task" for async

* make kv_cache_dtype configurable

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>

* style: use plural form

fix: use _seed_from_key to set different seeds for data loaders
fix: call load_data for one buffer each time

* PullRequest: 125 Support running async experiments in the 2407 image.

Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/125

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* fix: handle multiple datasets in recover indices
fix: `isinstance(self.__datasets, PullerStreamDataset)`
feat: use the "spec" request to obtain the number of datasets
fix: revert rollout worker

* fix: revert async_rl_exp.py

* fix flag for list (cuda_graph_bs)

* format

* [FIX] fix async task reward [sglang bf16-> fp16]

* fix: define `self.__datasets` in advance

* PullRequest: 130 [Refactor] Remove deprecated search related code

Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/130

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* remove search related

* PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron.

Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* change mp to tp
* .
* .

* PullRequest: 142 Fix an error for megatron backend destroy

Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/142

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 143 Fix the port conflict issue of generation servers

Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* somehow fix the port issue
* add clearance period
* .
* .

* PullRequest: 145 Add code environment

Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add code env
* somehow fix the port issue
* fix

* PullRequest: 144 Add decoupled PPO loss

Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* fix typo

* PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format.

Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/146

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* merge all slurm logs into one
* write config to yaml

* PullRequest: 141 Merge changes during NeurIPS submission

Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/141

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* .
* .
* .
* .
* update script
* .
* .
* .
* .
* [ADD] add least req scheduling
* fix test genreq
* .
* .
* fix stats tracker nan
* .
* .
* .
* .
* .
* .
* .
* uppper clip decoupled objective
* add throughput exp script
* .
* remove behav upper clip param
* .
* .
* .
* plot curve
* update thpt script
* .
* master worker raise error when exiting
* update script
* add gen throughput logging
* .
* .
* add decoupled wandb data
* .
* fix port issue and add no training option
* .
* enlarge ttl
* remove gserver manager await staled
* update weights in groups
* .
* .
* .
* add port clearance period
* .
* .
* .
* add plot script
* add sft throughput eval
* .
* log tokens in null interface
* 消融实验和interruptible generation
* 画图脚本/运行脚本/数据结果
* .
* remove scripts
* add port test
* remove force_sync_reward
* revert some changes
* .
* revert
* revert fix
* fix
* revert
* fix typo

* support qwen3 training

* PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication

Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* initial commit
* add interupt request
* fix data transfer issue
* max concurrent rollouts defaults to train batch size
* merge main
* add patch
* fix patch typp
* revert sglang
* fix typo
* fix minor typo
* .
* pip show editable sglang path

* PullRequest: 149 fix: code faas max_retries

Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/149

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix: code faas max_retries

* PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer

Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/150

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix scatter groups key error

* fix test

* .

* PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version

Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/151

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* merge all slurm logs into one
* write config to yaml
* .

* PullRequest: 152 Support sglang0.4.6 and fix master_worker import error

Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/152

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* Support sglang0.4.6 and fix master_worker import error
* remove disable_mla option

* PullRequest: 155 [FIX] reduce port conflicts

Merge branch sxj/reduce_port_conflict of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/155

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* [FIX] reduce port conflicts

* PullRequest: 153 Fix stuck and recover issues for async experiments

Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/153

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout

* PullRequest: 158 [Fix] Fix the error where "accepted" is not defined

Merge branch fw/fix-rollout-accepted of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/158

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* .

* PullRequest: 154 Fix unit tests and simplify package installation

Merge branch fw/v0.3.0-tests of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/154?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix some tests
* fix tests except for experiments
* fix tests
* fix tests
* .
* .

* PullRequest: 159 [fix] Enlarge the default aiohttp connection timeout and fix a recover error in model worker

Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/159

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* .

* PullRequest: 160 set sock_connect as rollout_request_timeout in partial_rollout.py

Merge branch xss/rollout_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/160

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* set sock_connect as rollout_request_timeout in partial_rollout.py

* PullRequest: 161 Prioritize rollouts that are submitted earlier rather than arrived earlier

Merge branch fw/birth-time of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/161

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* blocking push

* PullRequest: 163 [bugfix] Fix synchronized training when birth time is absent

Merge branch fw/fix-sync-birthtime of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/163

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 164 [Refactor] Move cluster spec into CLI args

Merge branch fw/refactor-cluster-spec of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/164?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* set cluster spec path in args
* .
* fix
* add default cluster spec

* PullRequest: 165 Normally exit all workers after experiment completion

Merge branch fw/exit-all-workers of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/165

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .

* PullRequest: 167 [Feature] Use chunked logits computation to alleviate SGLang OOM

Merge branch fw/patch-sglang-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/167

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 166 [Feature] Support single-script experiment launch with Ray

Merge branch fw/turbolaunch of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/166?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add training script without ray name resolve
* add ray name resolve
* ray worker
* run
* run async
* local run
* set cluster spec path in args
* .
* .
* fix
* .
* .
* .
* .
* .
* update config
* .
* minor renaming

* PullRequest: 169 [Doc] Add v0.3.0 docs based on jupyter-book

Merge branch fw/doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/169

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add docs
* refine doc
* refine doc

* PullRequest: 170 [Feature] Amend configs for ray scripts

Merge branch fw/ray-configs of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/170

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

---------

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: kira.gw <kira.gw@antgroup.com>
Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: sam.gjx <sam.gjx@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>
2025-05-28 19:32:42 +08:00
Wei Fu cf46993a30
[Feature & Doc & Bug Fix] Add docs, simplified ray-based scripts, and fix issues to stablize asynchronous experiments (#52)
* feat: one buffer for each task

* feat: support "one buffer for each task" for async

* make kv_cache_dtype configurable

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>

* style: use plural form

fix: use _seed_from_key to set different seeds for data loaders
fix: call load_data for one buffer each time

* PullRequest: 125 Support running async experiments in the 2407 image.

Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/125

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* fix: handle multiple datasets in recover indices
fix: `isinstance(self.__datasets, PullerStreamDataset)`
feat: use the "spec" request to obtain the number of datasets
fix: revert rollout worker

* fix: revert async_rl_exp.py

* fix flag for list (cuda_graph_bs)

* format

* [FIX] fix async task reward [sglang bf16-> fp16]

* fix: define `self.__datasets` in advance

* PullRequest: 130 [Refactor] Remove deprecated search related code

Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/130

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* remove search related

* PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron.

Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* change mp to tp
* .
* .

* PullRequest: 142 Fix an error for megatron backend destroy

Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/142

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 143 Fix the port conflict issue of generation servers

Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* somehow fix the port issue
* add clearance period
* .
* .

* PullRequest: 145 Add code environment

Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add code env
* somehow fix the port issue
* fix

* PullRequest: 144 Add decoupled PPO loss

Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* fix typo

* PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format.

Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/146

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* merge all slurm logs into one
* write config to yaml

* PullRequest: 141 Merge changes during NeurIPS submission

Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/141

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* .
* .
* .
* .
* update script
* .
* .
* .
* .
* [ADD] add least req scheduling
* fix test genreq
* .
* .
* fix stats tracker nan
* .
* .
* .
* .
* .
* .
* .
* uppper clip decoupled objective
* add throughput exp script
* .
* remove behav upper clip param
* .
* .
* .
* plot curve
* update thpt script
* .
* master worker raise error when exiting
* update script
* add gen throughput logging
* .
* .
* add decoupled wandb data
* .
* fix port issue and add no training option
* .
* enlarge ttl
* remove gserver manager await staled
* update weights in groups
* .
* .
* .
* add port clearance period
* .
* .
* .
* add plot script
* add sft throughput eval
* .
* log tokens in null interface
* 消融实验和interruptible generation
* 画图脚本/运行脚本/数据结果
* .
* remove scripts
* add port test
* remove force_sync_reward
* revert some changes
* .
* revert
* revert fix
* fix
* revert
* fix typo

* support qwen3 training

* PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication

Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* initial commit
* add interupt request
* fix data transfer issue
* max concurrent rollouts defaults to train batch size
* merge main
* add patch
* fix patch typp
* revert sglang
* fix typo
* fix minor typo
* .
* pip show editable sglang path

* PullRequest: 149 fix: code faas max_retries

Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/149

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix: code faas max_retries

* PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer

Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/150

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix scatter groups key error

* fix test

* .

* PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version

Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/151

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* merge all slurm logs into one
* write config to yaml
* .

* PullRequest: 152 Support sglang0.4.6 and fix master_worker import error

Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/152

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* Support sglang0.4.6 and fix master_worker import error
* remove disable_mla option

* PullRequest: 155 [FIX] reduce port conflicts

Merge branch sxj/reduce_port_conflict of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/155

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* [FIX] reduce port conflicts

* PullRequest: 153 Fix stuck and recover issues for async experiments

Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/153

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout

* PullRequest: 158 [Fix] Fix the error where "accepted" is not defined

Merge branch fw/fix-rollout-accepted of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/158

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* .

* PullRequest: 154 Fix unit tests and simplify package installation

Merge branch fw/v0.3.0-tests of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/154?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix some tests
* fix tests except for experiments
* fix tests
* fix tests
* .
* .

* PullRequest: 159 [fix] Enlarge the default aiohttp connection timeout and fix a recover error in model worker

Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/159

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* .

* PullRequest: 160 set sock_connect as rollout_request_timeout in partial_rollout.py

Merge branch xss/rollout_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/160

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* set sock_connect as rollout_request_timeout in partial_rollout.py

* PullRequest: 161 Prioritize rollouts that are submitted earlier rather than arrived earlier

Merge branch fw/birth-time of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/161

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* blocking push

* PullRequest: 163 [bugfix] Fix synchronized training when birth time is absent

Merge branch fw/fix-sync-birthtime of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/163

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 164 [Refactor] Move cluster spec into CLI args

Merge branch fw/refactor-cluster-spec of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/164?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* set cluster spec path in args
* .
* fix
* add default cluster spec

* PullRequest: 165 Normally exit all workers after experiment completion

Merge branch fw/exit-all-workers of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/165

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .

* PullRequest: 167 [Feature] Use chunked logits computation to alleviate SGLang OOM

Merge branch fw/patch-sglang-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/167

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 166 [Feature] Support single-script experiment launch with Ray

Merge branch fw/turbolaunch of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/166?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add training script without ray name resolve
* add ray name resolve
* ray worker
* run
* run async
* local run
* set cluster spec path in args
* .
* .
* fix
* .
* .
* .
* .
* .
* update config
* .
* minor renaming

* PullRequest: 169 [Doc] Add v0.3.0 docs based on jupyter-book

Merge branch fw/doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/169

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add docs
* refine doc
* refine doc

---------

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: kira.gw <kira.gw@antgroup.com>
Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: sam.gjx <sam.gjx@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>
2025-05-28 19:18:05 +08:00
Wei Fu 89cc3a7400
Update issue templates (#48) 2025-05-26 14:38:17 +08:00
Wei Fu c60d128b14
Support asynchronous RL training, Qwen3, and the latest SGLang (#47)
* feat: one buffer for each task

* feat: support "one buffer for each task" for async

* make kv_cache_dtype configurable

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>

* style: use plural form

fix: use _seed_from_key to set different seeds for data loaders
fix: call load_data for one buffer each time

* PullRequest: 125 Support running async experiments in the 2407 image.

Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/125

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* fix: handle multiple datasets in recover indices
fix: `isinstance(self.__datasets, PullerStreamDataset)`
feat: use the "spec" request to obtain the number of datasets
fix: revert rollout worker

* fix: revert async_rl_exp.py

* fix flag for list (cuda_graph_bs)

* format

* [FIX] fix async task reward [sglang bf16-> fp16]

* fix: define `self.__datasets` in advance

* PullRequest: 130 [Refactor] Remove deprecated search related code

Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/130

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* remove search related

* PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron.

Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* change mp to tp
* .
* .

* PullRequest: 142 Fix an error for megatron backend destroy

Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/142

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 143 Fix the port conflict issue of generation servers

Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* somehow fix the port issue
* add clearance period
* .
* .

* PullRequest: 145 Add code environment

Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* add code env
* somehow fix the port issue
* fix

* PullRequest: 144 Add decoupled PPO loss

Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* fix typo

* PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format.

Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/146

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* merge all slurm logs into one
* write config to yaml

* PullRequest: 141 Merge changes during NeurIPS submission

Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/141

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .
* .
* .
* .
* .
* .
* .
* update script
* .
* .
* .
* .
* [ADD] add least req scheduling
* fix test genreq
* .
* .
* fix stats tracker nan
* .
* .
* .
* .
* .
* .
* .
* uppper clip decoupled objective
* add throughput exp script
* .
* remove behav upper clip param
* .
* .
* .
* plot curve
* update thpt script
* .
* master worker raise error when exiting
* update script
* add gen throughput logging
* .
* .
* add decoupled wandb data
* .
* fix port issue and add no training option
* .
* enlarge ttl
* remove gserver manager await staled
* update weights in groups
* .
* .
* .
* add port clearance period
* .
* .
* .
* add plot script
* add sft throughput eval
* .
* log tokens in null interface
* 消融实验和interruptible generation
* 画图脚本/运行脚本/数据结果
* .
* remove scripts
* add port test
* remove force_sync_reward
* revert some changes
* .
* revert
* revert fix
* fix
* revert
* fix typo

* support qwen3 training

* PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication

Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* initial commit
* add interupt request
* fix data transfer issue
* max concurrent rollouts defaults to train batch size
* merge main
* add patch
* fix patch typp
* revert sglang
* fix typo
* fix minor typo
* .
* pip show editable sglang path

* PullRequest: 149 fix: code faas max_retries

Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/149

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix: code faas max_retries

* PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer

Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/150

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* fix scatter groups key error

* fix test

* .

* PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version

Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/151

Reviewed-by: 温差 <xushusheng.xss@antgroup.com>


* merge all slurm logs into one
* write config to yaml
* .

* PullRequest: 152 Support sglang0.4.6 and fix master_worker import error

Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/152

Reviewed-by: 博惟 <bowei.fw@antgroup.com>


* Support sglang0.4.6 and fix master_worker import error
* remove disable_mla option

---------

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: kira.gw <kira.gw@antgroup.com>
Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: sam.gjx <sam.gjx@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>
2025-05-26 09:45:13 +08:00
Wei Fu f7ab31d050
Delete .github/PULL_REQUEST_TEMPLATE directory (#46) 2025-05-25 17:06:37 +08:00
Wei Fu 5236f382b1
Create bugfix.md (#45) 2025-05-25 16:59:45 +08:00
Wei Fu 919b2872a2
Add 2025-05-25 16:37:06 +08:00
Wei Fu 070a1216ce
Update and rename feature-request-issue-template.md to feature.md (#44) 2025-05-25 16:23:49 +08:00
Wei Fu f1a9251fb6
Update and rename bug-issue-template.md to bug.md (#43) 2025-05-25 16:23:07 +08:00
Wei Fu 21e14f63f5
Add feature request template (#42) 2025-05-25 16:19:40 +08:00
Wei Fu 03c160d21a
Update issue templates (#41) 2025-05-25 16:18:56 +08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 22eb63ef6e
[fix] Support local mode without setting CUDA_VISIBLE_DEVICES (#38)
If `CUDA_VISIBLE_DEVICES` is not set, it means that all the
devices will be available for use. If it's an empty string,
it means no device is available for the process to use.

As a result, if `CUDA_VISIBLE_DEVICES` does not exist, the
default value should be equivalent to all the devices instead
of an empty one.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-05-16 18:23:28 +08:00
Wei Fu e9eff73d90
Update README.md (#35) 2025-04-27 22:28:54 +08:00
nuzant ffc52a1520
Merge updates from ant repository. (#34)
* Cherry-pick commit 90dfd575 "PullRequest: 84 [ADD..." 到当前分支

* Cherry-pick commit 15e787b7 "PullRequest: 44 eval..." 到当前分支

* Cherry-pick commit f255ef60 "PullRequest: 85 add ..." 到当前分支

* Cherry-pick commit c2b4006a "PullRequest: 86 Supp..." 到当前分支

* Cherry-pick commit fa6c0f3d "PullRequest: 87 upda..." 到当前分支

* Cherry-pick commit a9ff4af0 "PullRequest: 88 Bump..." 到当前分支

* Cherry-pick commit 763839aa "PullRequest: 89 Add ..." 到当前分支

* Cherry-pick commit 21e8064a "PullRequest: 90 Merg..." 到当前分支

* Cherry-pick commit 94e97670 "PullRequest: 92 Supp..." 到当前分支

* Cherry-pick commit 92710522 "PullRequest: 91 Supp..." 到当前分支

* Cherry-pick commit 95aa3f28 "PullRequest: 93 Supp..." 到当前分支

* Cherry-pick commit 62191f8f "PullRequest: 94 Add ..." 到当前分支

* Cherry-pick commit baa0249a "PullRequest: 95 Form..." 到当前分支

* Cherry-pick commit e32945f2 "PullRequest: 96 Chan..." 到当前分支

* Cherry-pick commit b59286e3 "PullRequest: 98 fix ..." 到当前分支

* Cherry-pick commit ca2ba43e "PullRequest: 97 Move..." 到当前分支

* Cherry-pick commit f941700b "PullRequest: 99 Refa..." 到当前分支

* Cherry-pick commit 95439e70 "PullRequest: 100 Add..." 到当前分支

* Cherry-pick commit f3ebd941 "PullRequest: 101 Add..." 到当前分支

* Cherry-pick commit ee4779ea "PullRequest: 103 [Fe..." 到当前分支

* Cherry-pick commit ce5e24ec "PullRequest: 104 [Fi..." 到当前分支

* Cherry-pick commit b385761f "PullRequest: 105 [Bu..." 到当前分支

* Cherry-pick commit 4c21fbb5 "PullRequest: 106 [Bu..." 到当前分支

* Cherry-pick commit 7f3f14e0 "PullRequest: 108 [Fi..." 到当前分支

* Cherry-pick commit 8de62701 "PullRequest: 107 [Fe..." 到当前分支

* Cherry-pick commit ea864b21 "PullRequest: 24 [Fea..." 到当前分支

* Cherry-pick commit 4a658db3 "PullRequest: 109 [Bu..." 到当前分支

* Cherry-pick commit aaa12bf1 "PullRequest: 110 [Bu..." 到当前分支

* Cherry-pick commit 6adb6d9f "PullRequest: 112 [Fi..." 到当前分支

* Cherry-pick commit 55556bc5 "PullRequest: 111 [Fe..." 到当前分支

* Cherry-pick commit bfe5ec94 "PullRequest: 114 pri..." 到当前分支

* Cherry-pick commit 44529c9b "PullRequest: 113 spl..." 到当前分支

* Cherry-pick commit b1cc73df "PullRequest: 116 [FI..." 到当前分支

* Cherry-pick commit eff598ce "PullRequest: 115 [Fi..." 到当前分支

* Cherry-pick commit f7149475 "PullRequest: 119 [Fi..." 到当前分支

* Cherry-pick commit f1017bfe "PullRequest: 121 add..." 到当前分支

* Cherry-pick commit 56f6de8d "PullRequest: 120 set..." 到当前分支

---------

Co-authored-by: 冰临 <shenxujie.sxj@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 郭唯 <kira.gw@antgroup.com>
Co-authored-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: 君末 <meijun.mei@antgroup.com>
2025-04-27 11:09:25 +08:00
Ximingwang-09 42afcdb53a
fix typos (#32)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-04-22 18:10:17 +08:00
Wei Fu 5da0c62145
fix: remove the file lock of NFS-based name resolve and add testcases (#27)
* .

* add nfs name resolve tests
2025-04-07 21:50:54 +08:00
nuzant 62e51c3109
Merge updates from ant repository. (#29)
* fix: `self.tasks_ids` should also be filtered

* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .

* upload 7B zero and 32B sft config

* PullRequest: 72 change the condition of using etcd

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/72

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change the condition of using etcd

* PullRequest: 60 Change the default SGLang parameters to avoid precision issues.

Merge branch fw/fix-sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/60

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change vllm config
* .
* .

* PullRequest: 73 Fix a setup issue when using ETCD

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/73

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix etcd
* .
* .

* PullRequest: 75 Fix epoch counter before model function call execution.

Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/75

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 76 Update from opensource repository.

Merge branch mzy/update-from-opensource of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/76

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* fw/v0.2.0-dockerfile
* .
* V0.2.0 prerelease (#8)
* V0.2.0 prerelease (#9)
* Update README.md
* Clean up CI (#11)
* Update README.md citation (#15)
* Xss/readme (#16)
* Merge updates from ant repository. (#18)

* update readme
update readme

update readme

* update readme

* .

* .

* PullRequest: 77 Fix epoch counter for saving models

Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/77

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .

* PullRequest: 78 Increase SGLang init timeout from 120 secs to 300 secs

Merge branch fw/incr-sglang-init-timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/78

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 79 Fix default timeout of ETCD name resolve entries.

Merge branch mzy/fix-etcd-default-timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/79?tab=diff

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* fix etcd default timeout
* .

* PullRequest: 81 Fix save-load backend states for Megatron v0.11

Merge branch fw/fix-megatron-v0.11-recover of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/81

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fix megatron v0.11 recover

* PullRequest: 82 Add a push-pull stream for IPC communication between the inference client and the model worker.

Merge branch fw/ipc-stream of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/82?tab=diff

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix megatron v0.11 recover
* .

---------

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: meijun <meijun.mei@antgroup.com>
Co-authored-by: chucai.dzq <chucai.dzq@alibaba-inc.com>
2025-04-07 21:49:34 +08:00
Wei Fu aff05c2544
Update time estimation in README (#23)
* fix: `self.tasks_ids` should also be filtered

* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .

* upload 7B zero and 32B sft config

* clean ci

* PullRequest: 72 change the condition of using etcd

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/72

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change the condition of using etcd

* PullRequest: 60 Change the default SGLang parameters to avoid precision issues.

Merge branch fw/fix-sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/60

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change vllm config
* .
* .

* PullRequest: 73 Fix a setup issue when using ETCD

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/73

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix etcd
* .
* .

* PullRequest: 75 Fix epoch counter before model function call execution.

Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/75

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

* PullRequest: 76 Update from opensource repository.

Merge branch mzy/update-from-opensource of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/76

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* fw/v0.2.0-dockerfile
* .
* V0.2.0 prerelease (#8)
* V0.2.0 prerelease (#9)
* Update README.md
* Clean up CI (#11)
* Update README.md citation (#15)
* Xss/readme (#16)
* Merge updates from ant repository. (#18)

* update readme
update readme

update readme

* update readme

* .

* .

---------

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: meijun <meijun.mei@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: chucai.dzq <chucai.dzq@alibaba-inc.com>
2025-04-02 08:09:11 +08:00
nuzant 1c33379c93
Merge updates from ant repository. (#18)
* fix: `self.tasks_ids` should also be filtered

* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .

* upload 7B zero and 32B sft config

* PullRequest: 72 change the condition of using etcd

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/72

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change the condition of using etcd

* PullRequest: 60 Change the default SGLang parameters to avoid precision issues.

Merge branch fw/fix-sglang of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/60

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* change vllm config
* .
* .

* PullRequest: 73 Fix a setup issue when using ETCD

Merge branch fw/fix-etcd of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/73

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* fix etcd
* .
* .

* PullRequest: 75 Fix epoch counter before model function call execution.

Merge branch fw/fix-epoch-counter of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/75

Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .

---------

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: meijun <meijun.mei@antgroup.com>
2025-03-31 21:05:57 +08:00
xssstory f4bd798ed9
Xss/readme (#16)
* update readme

* update readme

* update readme

* update readme

* update readme
2025-03-31 18:57:39 +08:00
Wei Fu 22838f3288
Update README.md citation (#15) 2025-03-31 18:17:35 +08:00
Wei Fu de3f66a90c
Clean up CI (#11)
* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .

* upload 7B zero and 32B sft config

* clean ci

---------

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: meijun <meijun.mei@antgroup.com>
2025-03-31 08:39:16 +08:00
Ant OSS 9c54166bf9
Merge pull request #10 from inclusionAI/jxwuyi-patch-1
Update README.md
2025-03-31 07:54:57 +08:00
Yi Wu 2f148c3ed7
Update README.md 2025-03-31 00:07:50 +08:00
Wei Fu df7d84cb0e
V0.2.0 prerelease (#9)
* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .

* update readme
2025-03-30 21:26:48 +08:00
Wei Fu 5b21c9add0
V0.2.0 prerelease (#8)
* PullRequest: 67 Update v0.2.0 Dockerfile

Merge branch fw/v0.2.0-dockerfile of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/67

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* fw/v0.2.0-dockerfile

* PullRequest: 66 Update v0.2.0 cover letter

Merge branch fw/v0.2.0-readme of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/66

Signed-off-by: 温差 <xushusheng.xss@antgroup.com>


* .
* .
* .
* .
* .
* update thpt fig
* update readme 20250329-20:16
* update
* update tutorial
* .
2025-03-30 21:25:07 +08:00
garrett4wade 9ac5b9bdc3 . 2025-03-30 20:43:51 +08:00
博惟 840c23f102 Merge branch 'fw/v0.2.0-dockerfile' of https://code.alipay.com/inclusionAI/AReaL into fw/v0.2.0-readme 2025-03-30 20:39:01 +08:00
博惟 0094d2f113 update tutorial 2025-03-30 20:36:55 +08:00
博惟 6794b2f07e update 2025-03-30 20:32:15 +08:00
博惟 bbf61d72c6 Merge branch 'main' of https://code.alipay.com/inclusionAI/AReaL into fw/v0.2.0-readme 2025-03-30 20:23:52 +08:00
温差 1b0306631b PullRequest: 70 update evalution: aime25, gpqa
Merge branch xss/eval of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/70

Signed-off-by: 博惟 <bowei.fw@antgroup.com>


* update evalution: aime25, gpqa
* aime25 dataset
* format code
2025-03-30 20:16:08 +08:00
博惟 b54ffe1193 update readme 20250329-20:16 2025-03-29 20:16:18 +08:00
郭唯 046a28908d PullRequest: 68 Update tutorials
Merge branch update-tutorials of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/68

Signed-off-by: 博惟 <bowei.fw@antgroup.com>
2025-03-28 20:15:45 +08:00
kira.gw b7e90fbd48 fix timeout for NCCL 2025-03-28 20:12:01 +08:00
kira.gw f86b78c1c4 fix flags for hydra 2025-03-28 20:11:51 +08:00
kira.gw 3f7907217b update tutorials 2025-03-28 20:11:49 +08:00
博惟 5997ba8b47 update thpt fig 2025-03-28 17:43:13 +08:00
博惟 5bf51a82bf fw/v0.2.0-dockerfile 2025-03-28 17:23:55 +08:00