* add a preprocessing script for code training data and update readme
* add a preprocessing script for code training data and update readme
* add a preprocessing script for code training data and update readme
* fix eval doc
---------
Co-authored-by: hcy <hechuyi.hcy@antgroup.com>
* Support using SwanLab for experiment tracking
* docs: improve WandB and SwanLab integration documentation
- Added official links for better user reference
- Used backticks to quote commands and parameters
- Unified mode settings to use "online" / "cloud" convention
- Merged WandB and SwanLab descriptions into a single concise statement
- Added note on using `swanlab.mode="local"` when server connection is unavailable
* refactor: update default value of api_key
* fix: correct help description from WandB to SwanLab in SwanLabConfig
* refactor: merge log_swanlab_tensorboard and log_wandb_tensorboard into log_swanlab_wandb_tensorboard
- Unified logging logic for SwanLab, WandB, and TensorBoard to reduce code duplication
* chore: update swanlab version in dependency config files
- Updated SwanLab version in pyproject.toml
- Updated SwanLab version in requirements.txt
* refactor: enhance SwanLab config handling for logging purposes
- Config now uses provided arguments first
- Falls back to reading from config.yaml if no input is given
* docs: add note on using when server connection is unavailable
* refactor: merge _LATEST_WANDB_STEP and _LATEST_SWANLAB_STEP into _LATEST_LOG_STEP
* Format code with black and isort
* chore: update swanlab version in dependency config files
- Updated SwanLab version in requirements.txt
* refactor: rename swanlab_wandb_data to log_data
---------
Co-authored-by: dubingnan <dubingnan@360.cn>
* update benchmark script
* .
* add benchmark docs
* PullRequest: 178 multi turn math agent training
Merge branch gjx/multi-turn-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/178?tab=diff
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* multi turn math agent training
* training data logging and clean math multi-turn exp
* fix
* .
* format
* .
---------
Co-authored-by: 步偶 <sam.gjx@antgroup.com>
* Update .gitignore and modify dataset paths in scripts for improved file management and compatibility with Hugging Face datasets. Additionally, refactor dataset loading functions to utilize load_hf_or_local_file for better flexibility.
* Remove sglang subproject and update dataset path format in load_hf_or_local_file function for compatibility with Hugging Face datasets.
* Refactor imports in grader.py and parser.py to include sympy for improved functionality.
* update benchmark script
* .
* add benchmark docs
* add v0.3.0 configs
* .
* PullRequest: 178 multi turn math agent training
Merge branch gjx/multi-turn-math of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/178?tab=diff
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* multi turn math agent training
* training data logging and clean math multi-turn exp
* fix
* .
* fix
* add docs and config
* format
* revert multi-turn agent
* add config
---------
Co-authored-by: 步偶 <sam.gjx@antgroup.com>
* feat: one buffer for each task
* feat: support "one buffer for each task" for async
* make kv_cache_dtype configurable
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
* style: use plural form
fix: use _seed_from_key to set different seeds for data loaders
fix: call load_data for one buffer each time
* PullRequest: 125 Support running async experiments in the 2407 image.
Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/125
Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* fix: handle multiple datasets in recover indices
fix: `isinstance(self.__datasets, PullerStreamDataset)`
feat: use the "spec" request to obtain the number of datasets
fix: revert rollout worker
* fix: revert async_rl_exp.py
* fix flag for list (cuda_graph_bs)
* format
* [FIX] fix async task reward [sglang bf16-> fp16]
* fix: define `self.__datasets` in advance
* PullRequest: 130 [Refactor] Remove deprecated search related code
Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/130
Signed-off-by: 博惟 <bowei.fw@antgroup.com>
* remove search related
* PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron.
Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment
Signed-off-by: 博惟 <bowei.fw@antgroup.com>
* change mp to tp
* .
* .
* PullRequest: 142 Fix an error for megatron backend destroy
Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/142
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 143 Fix the port conflict issue of generation servers
Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* somehow fix the port issue
* add clearance period
* .
* .
* PullRequest: 145 Add code environment
Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add code env
* somehow fix the port issue
* fix
* PullRequest: 144 Add decoupled PPO loss
Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* fix typo
* PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format.
Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/146
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* merge all slurm logs into one
* write config to yaml
* PullRequest: 141 Merge changes during NeurIPS submission
Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/141
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* .
* .
* .
* .
* .
* .
* .
* .
* update script
* .
* .
* .
* .
* [ADD] add least req scheduling
* fix test genreq
* .
* .
* fix stats tracker nan
* .
* .
* .
* .
* .
* .
* .
* uppper clip decoupled objective
* add throughput exp script
* .
* remove behav upper clip param
* .
* .
* .
* plot curve
* update thpt script
* .
* master worker raise error when exiting
* update script
* add gen throughput logging
* .
* .
* add decoupled wandb data
* .
* fix port issue and add no training option
* .
* enlarge ttl
* remove gserver manager await staled
* update weights in groups
* .
* .
* .
* add port clearance period
* .
* .
* .
* add plot script
* add sft throughput eval
* .
* log tokens in null interface
* 消融实验和interruptible generation
* 画图脚本/运行脚本/数据结果
* .
* remove scripts
* add port test
* remove force_sync_reward
* revert some changes
* .
* revert
* revert fix
* fix
* revert
* fix typo
* support qwen3 training
* PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication
Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* initial commit
* add interupt request
* fix data transfer issue
* max concurrent rollouts defaults to train batch size
* merge main
* add patch
* fix patch typp
* revert sglang
* fix typo
* fix minor typo
* .
* pip show editable sglang path
* PullRequest: 149 fix: code faas max_retries
Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/149
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* fix: code faas max_retries
* PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer
Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/150
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* fix scatter groups key error
* fix test
* .
* PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version
Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/151
Reviewed-by: 温差 <xushusheng.xss@antgroup.com>
* merge all slurm logs into one
* write config to yaml
* .
* PullRequest: 152 Support sglang0.4.6 and fix master_worker import error
Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/152
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* Support sglang0.4.6 and fix master_worker import error
* remove disable_mla option
* PullRequest: 155 [FIX] reduce port conflicts
Merge branch sxj/reduce_port_conflict of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/155
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* [FIX] reduce port conflicts
* PullRequest: 153 Fix stuck and recover issues for async experiments
Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/153
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* PullRequest: 158 [Fix] Fix the error where "accepted" is not defined
Merge branch fw/fix-rollout-accepted of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/158
Reviewed-by: 温差 <xushusheng.xss@antgroup.com>
* .
* PullRequest: 154 Fix unit tests and simplify package installation
Merge branch fw/v0.3.0-tests of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/154?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix some tests
* fix tests except for experiments
* fix tests
* fix tests
* .
* .
* PullRequest: 159 [fix] Enlarge the default aiohttp connection timeout and fix a recover error in model worker
Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/159
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* .
* PullRequest: 160 set sock_connect as rollout_request_timeout in partial_rollout.py
Merge branch xss/rollout_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/160
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* set sock_connect as rollout_request_timeout in partial_rollout.py
* PullRequest: 161 Prioritize rollouts that are submitted earlier rather than arrived earlier
Merge branch fw/birth-time of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/161
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* blocking push
* PullRequest: 163 [bugfix] Fix synchronized training when birth time is absent
Merge branch fw/fix-sync-birthtime of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/163
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 164 [Refactor] Move cluster spec into CLI args
Merge branch fw/refactor-cluster-spec of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/164?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* set cluster spec path in args
* .
* fix
* add default cluster spec
* PullRequest: 165 Normally exit all workers after experiment completion
Merge branch fw/exit-all-workers of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/165
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* .
* PullRequest: 167 [Feature] Use chunked logits computation to alleviate SGLang OOM
Merge branch fw/patch-sglang-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/167
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 166 [Feature] Support single-script experiment launch with Ray
Merge branch fw/turbolaunch of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/166?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add training script without ray name resolve
* add ray name resolve
* ray worker
* run
* run async
* local run
* set cluster spec path in args
* .
* .
* fix
* .
* .
* .
* .
* .
* update config
* .
* minor renaming
* PullRequest: 169 [Doc] Add v0.3.0 docs based on jupyter-book
Merge branch fw/doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/169
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add docs
* refine doc
* refine doc
* PullRequest: 170 [Feature] Amend configs for ray scripts
Merge branch fw/ray-configs of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/170
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
---------
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: kira.gw <kira.gw@antgroup.com>
Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: sam.gjx <sam.gjx@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>
* feat: one buffer for each task
* feat: support "one buffer for each task" for async
* make kv_cache_dtype configurable
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
* style: use plural form
fix: use _seed_from_key to set different seeds for data loaders
fix: call load_data for one buffer each time
* PullRequest: 125 Support running async experiments in the 2407 image.
Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/125
Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* fix: handle multiple datasets in recover indices
fix: `isinstance(self.__datasets, PullerStreamDataset)`
feat: use the "spec" request to obtain the number of datasets
fix: revert rollout worker
* fix: revert async_rl_exp.py
* fix flag for list (cuda_graph_bs)
* format
* [FIX] fix async task reward [sglang bf16-> fp16]
* fix: define `self.__datasets` in advance
* PullRequest: 130 [Refactor] Remove deprecated search related code
Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/130
Signed-off-by: 博惟 <bowei.fw@antgroup.com>
* remove search related
* PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron.
Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment
Signed-off-by: 博惟 <bowei.fw@antgroup.com>
* change mp to tp
* .
* .
* PullRequest: 142 Fix an error for megatron backend destroy
Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/142
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 143 Fix the port conflict issue of generation servers
Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* somehow fix the port issue
* add clearance period
* .
* .
* PullRequest: 145 Add code environment
Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add code env
* somehow fix the port issue
* fix
* PullRequest: 144 Add decoupled PPO loss
Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* fix typo
* PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format.
Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/146
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* merge all slurm logs into one
* write config to yaml
* PullRequest: 141 Merge changes during NeurIPS submission
Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/141
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* .
* .
* .
* .
* .
* .
* .
* .
* update script
* .
* .
* .
* .
* [ADD] add least req scheduling
* fix test genreq
* .
* .
* fix stats tracker nan
* .
* .
* .
* .
* .
* .
* .
* uppper clip decoupled objective
* add throughput exp script
* .
* remove behav upper clip param
* .
* .
* .
* plot curve
* update thpt script
* .
* master worker raise error when exiting
* update script
* add gen throughput logging
* .
* .
* add decoupled wandb data
* .
* fix port issue and add no training option
* .
* enlarge ttl
* remove gserver manager await staled
* update weights in groups
* .
* .
* .
* add port clearance period
* .
* .
* .
* add plot script
* add sft throughput eval
* .
* log tokens in null interface
* 消融实验和interruptible generation
* 画图脚本/运行脚本/数据结果
* .
* remove scripts
* add port test
* remove force_sync_reward
* revert some changes
* .
* revert
* revert fix
* fix
* revert
* fix typo
* support qwen3 training
* PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication
Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix ppo step logging, nan in stats tracker, and add decoupled loss
* .
* somehow fix the port issue
* initial commit
* add interupt request
* fix data transfer issue
* max concurrent rollouts defaults to train batch size
* merge main
* add patch
* fix patch typp
* revert sglang
* fix typo
* fix minor typo
* .
* pip show editable sglang path
* PullRequest: 149 fix: code faas max_retries
Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/149
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* fix: code faas max_retries
* PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer
Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/150
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* fix scatter groups key error
* fix test
* .
* PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version
Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/151
Reviewed-by: 温差 <xushusheng.xss@antgroup.com>
* merge all slurm logs into one
* write config to yaml
* .
* PullRequest: 152 Support sglang0.4.6 and fix master_worker import error
Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/152
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* Support sglang0.4.6 and fix master_worker import error
* remove disable_mla option
* PullRequest: 155 [FIX] reduce port conflicts
Merge branch sxj/reduce_port_conflict of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/155
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* [FIX] reduce port conflicts
* PullRequest: 153 Fix stuck and recover issues for async experiments
Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/153
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* PullRequest: 158 [Fix] Fix the error where "accepted" is not defined
Merge branch fw/fix-rollout-accepted of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/158
Reviewed-by: 温差 <xushusheng.xss@antgroup.com>
* .
* PullRequest: 154 Fix unit tests and simplify package installation
Merge branch fw/v0.3.0-tests of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/154?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix some tests
* fix tests except for experiments
* fix tests
* fix tests
* .
* .
* PullRequest: 159 [fix] Enlarge the default aiohttp connection timeout and fix a recover error in model worker
Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/159
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* fix sample cnt stuck
* fix recover
* code cleanup
* merge all slurm logs into one
* write config to yaml
* .
* .
* .
* revert birth time change
* .
* enlarge sock connect timeout
* .
* PullRequest: 160 set sock_connect as rollout_request_timeout in partial_rollout.py
Merge branch xss/rollout_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/160
Reviewed-by: 博惟 <bowei.fw@antgroup.com>
* set sock_connect as rollout_request_timeout in partial_rollout.py
* PullRequest: 161 Prioritize rollouts that are submitted earlier rather than arrived earlier
Merge branch fw/birth-time of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/161
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* blocking push
* PullRequest: 163 [bugfix] Fix synchronized training when birth time is absent
Merge branch fw/fix-sync-birthtime of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/163
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 164 [Refactor] Move cluster spec into CLI args
Merge branch fw/refactor-cluster-spec of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/164?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* set cluster spec path in args
* .
* fix
* add default cluster spec
* PullRequest: 165 Normally exit all workers after experiment completion
Merge branch fw/exit-all-workers of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/165
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* .
* PullRequest: 167 [Feature] Use chunked logits computation to alleviate SGLang OOM
Merge branch fw/patch-sglang-oom of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/167
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* .
* PullRequest: 166 [Feature] Support single-script experiment launch with Ray
Merge branch fw/turbolaunch of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/166?tab=comment
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add training script without ray name resolve
* add ray name resolve
* ray worker
* run
* run async
* local run
* set cluster spec path in args
* .
* .
* fix
* .
* .
* .
* .
* .
* update config
* .
* minor renaming
* PullRequest: 169 [Doc] Add v0.3.0 docs based on jupyter-book
Merge branch fw/doc of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/169
Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>
* add docs
* refine doc
* refine doc
---------
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: kira.gw <kira.gw@antgroup.com>
Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com>
Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>
Co-authored-by: sam.gjx <sam.gjx@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>
If `CUDA_VISIBLE_DEVICES` is not set, it means that all the
devices will be available for use. If it's an empty string,
it means no device is available for the process to use.
As a result, if `CUDA_VISIBLE_DEVICES` does not exist, the
default value should be equivalent to all the devices instead
of an empty one.
Signed-off-by: Hollow Man <hollowman@opensuse.org>