AReaL

Commit Graph

Author	SHA1	Message	Date
Wei Fu	311bcd7697	[lite] [feature] Bump to SGLang v0.4.9.post2 and use NCCL to update weights (#196 ) * PullRequest: 353 [Lite] Add gradient checkpointing to FSDPEngine Merge branch mzy/add-gradient-ckpt of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/353 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * add gradient checkpointing * PullRequest: 354 [lite] GRPO pre-commit: minor changes in FSDP engine Merge branch fw/lite-fix1 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/354 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * PullRequest: 355 [Lite] GRPO pre-commit 2: Refactor RemoteSGLangEngine thread and SGLang configuration Merge branch fw/lite-fix1 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/355?tab=commit Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * . * . * fix * . * PullRequest: 357 [lite] GRPO pre-commit 3: Fix typos and experiment utilities Merge branch fw/lite-fix2 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/357?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * . * fix destroy process group * PullRequest: 358 [lite] Support GRPO training locally with the GSM8k dataset Merge branch fw/lite-fix3 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/358 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * fix loss mask * fix * . * PullRequest: 368 [lite] Refactor train engine after merging contributions from GitHub Merge branch fw/lite-train-engine of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/368 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * PullRequest: 371 [lite] [fix] fix misc bugs in GRPO implementation Merge branch fw/lite-fix0716 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/371 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 370 [lite] Add Slurm Launcher and Ray Launcher Merge branch mzy/lite/launcher of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/370 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * . * . * . * fix * PullRequest: 392 [lite] Fix several bugs regarding RL learning and add an example to reproduce boba-math results. Merge branch fw/lite-boba of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/392 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * support fsdp engine and sglang remote engine * minor fix * . * refactor trainer * add close * rm mb_spec * . * fix * . * qwen2 grpo works * fix * fix * async works * fix * slurm launcher not tested * fix arg parse * . * sglang server wrapper * . * . * slurm run * ready for boba * debug * 32k run * . * . * fix * . * . * . * . * . * fix * . * fix * . * . * . * . * fix * . * . * . * . * . * . * . * refactor train engine * refactor train engine * . * fix update weight error * . * . * match train * format * . * fix * seems to work * . * . * . * . * . * PullRequest: 408 [Feature] Bump SGLang version to v0.4.9.post2 Merge branch fw/sgl049 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/408 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * bump arealite to sglang 0.4.9.post2 * . * PullRequest: 412 [lite] Minor refactor on `UpdateWeightMeta` * PullRequest: 422 [lite] Fix tests and scripts after updating sgl to 0.4.9 Merge branch fw/sgl049 of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/422 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * bump arealite to sglang 0.4.9.post2 * . * PullRequest: 412 [lite] Minor refactor on `UpdateWeightMeta` * . * PullRequest: 423 [lite] Remove the boba example for github release. Merge branch fw/remove-boba of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/423 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . --------- Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com>	2025-07-24 15:34:52 +08:00
Jayon02	ef4215d6f1	[Feat][Refactor]Support DeepSpeed AutoTP; Refactor hf_engine.py and unit test. (#161 ) * refactor hf engine * format file * revert file format * Squashed commit of the following: commit `8d4b8dc90f` Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com> Date: Thu Jul 10 13:14:10 2025 +0800 [Doc] Add an instruction about how to run the SFT example. (#164) commit `3bf9c85e40` Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com> Date: Thu Jul 10 12:56:24 2025 +0800 [Fix] Merge previous contributions from fw/refactor to lite (#163) * initial proposal * add arealite * . * change api * . * remove LOG_ROOT * remove MODEL_SAVE_PATH * remove PARAM_REALLOC_PATH, DATASET_CACHE * prepare for testing * prepare for testing * ready for run * local run * tests mainly pass * format * . * amend cluster.py * . * . * client test pass * pass rollout test * remove unused imports * add arealite readme * change api * . * . * . * . * . * . * . * . * format * . * implement iteraptable generation (#112) Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com> * . * fix * . * . * . * pass controller generate batch test * . * refactor rollout controller into worker and controller * . * . * . * change to async rollout * pass rollout controller test * pass test * . * update readme * . * sft debug * . * add lisence * remove unused files * remove unsed args in ppo * add hf engine wrapper (#116) * add hf engine * fix issues * fix ppo bugs and add test * add hf client interface and modify cli args * fix bugs * fix issues * Merge fw/refactor * Finish hf wrapper test * add test --------- Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com> * format * format * . * refine hf engine * . * fix * add fsdp engine and sft tests * . * . * . * pass ppo unittest * pass ppo and rollout controller tests * clear unused imports * rename ppo to grpo * change reward function organization * reorganize code * add dataset api * . * . * . * format * chmod fix * . * rename workflow to collector * refactor llm_client location * . * . * fix llm server api * refactor config structure * . * fix tests * . * . * . * Fix unresolved issue in SFTTrainer PR (#139) * . * . * efficient loading * format * . * . * . * . * . * . * Add CI for testing AReaLite (#150) * ci: add test-arealite * ci: add checkout before running test-arealite * ci: add USERNAME * ci: add test script * ci: add GitHub mirror * ci: fix typo * ci: clone one commit * ci: fix condition * ci: set command timeout to 60m * ci: enable pip cache * ci: optimize container lifecycle * ci: split into many stages * ci(test-arealite): fix typo * ci: fix wrong env * ci: fix pytest * ci: uninstall transformer-engine * ci: uninstall transformer-engine * ci: fix model paths * ci: show stdout/stderr * ci: fix not clean up * ci: backup sglang * ci: remove tmp repo dir when run * ci: fix docker run exit 1 condition * ci(test-arealite): limit the concurrency and extend command timeout * . * merge fw/refactor * revert some changes * fix --------- Co-authored-by: meizhiyu.mzy <meizhiyu.mzy@antgroup.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com> Co-authored-by: Jayon02 <qiujiangc@outlook.com> Co-authored-by: root <meizhiyu.mzy> Co-authored-by: Zijian Zhang <futrime@outlook.com> commit `d48bf007cf` Merge: `42c717b` `b9dbd4a` Author: 博惟 <bowei.fw@antgroup.com> Date: Thu Jul 10 12:53:30 2025 +0800 Merge branch 'main' of https://github.com/inclusionAI/AReaL into lite commit `42c717b6e4` Merge: `c38cffc` `a203c7c` Author: 博惟 <bowei.fw@antgroup.com> Date: Thu Jul 10 11:15:01 2025 +0800 Merge branch 'lite' of https://github.com/inclusionAI/AReaL into lite commit `c38cffc023` Author: 博惟 <bowei.fw@antgroup.com> Date: Thu Jul 10 11:10:10 2025 +0800 PullRequest: 340 [lite] Refactor trainer API into utilities and remove mb_spec in engine methods Merge branch fw/lite-dev of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/340 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * support fsdp engine and sglang remote engine * minor fix * . * refactor trainer * add close * rm mb_spec * fix commit `b9dbd4a2c1` Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com> Date: Wed Jul 9 10:50:19 2025 +0800 Update to persistent wechat QR code. (#159) commit `17ea7fe94d` Author: xssstory <33601810+xssstory@users.noreply.github.com> Date: Mon Jul 7 15:49:13 2025 +0800 fix math reward verifier (#156) * PullRequest: 293 fix get_param_realloc_path Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh https://code.alipay.com/inclusionAI/AReaL/pull_requests/293 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix get_param_realloc_path * PullRequest: 297 bugfix: reward is always -5 Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh https://code.alipay.com/inclusionAI/AReaL/pull_requests/297 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * bugfix: reward is always -5 * PullRequest: 321 fix checkpoint save dir Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh https://code.alipay.com/inclusionAI/AReaL/pull_requests/321 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix checkpoint save dir * PullRequest: 328 [Doc] update installation Merge branch sxj/doc of git@code.alipay.com:inclusionAI/AReaL.git into gh https://code.alipay.com/inclusionAI/AReaL/pull_requests/328 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * [Doc] update installation * PullRequest: 329 bugfix: math verifier blocks the async training Merge branch xss/debug of git@code.alipay.com:inclusionAI/AReaL.git into gh https://code.alipay.com/inclusionAI/AReaL/pull_requests/329 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * bugfix: math verifier block the async training * format --------- Co-authored-by: 冰临 <shenxujie.sxj@antgroup.com> Co-authored-by: garrett4wade <fuwth17@gmail.com> * add autotp for hf * refactor test * fix bugs * fix issues * format files * Squashed commit of the following: commit `9ed043f6ab` Author: Wei Fu <36355462+garrett4wade@users.noreply.github.com> Date: Tue Jul 15 10:24:48 2025 +0800 format (#174) commit `8cc9b1feb5` Author: Night <32424487+PrinsYin@users.noreply.github.com> Date: Mon Jul 14 19:22:00 2025 -0700 added LocalSGlangEngine and test (#170) * added LocalSGLangEngine * upload test file * add build args * fix sgl_local generate * improved sgl local robustness * test * test updated * added fallback when sgl engine isn't initialized * finish test local engine * added LocalSGlangEngine and test * format and fix format and fix, raise when generate missing field format * change cli_args.py * add comment header format --------- Co-authored-by: ChangyiYang <changyiyang2023@gmail.com> --------- Co-authored-by: Jayon02 <12012211@mail..sustech.edu.cn> Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com>	2025-07-16 12:44:10 +08:00
Wei Fu	3bf9c85e40	[Fix] Merge previous contributions from fw/refactor to lite (#163 ) * initial proposal * add arealite * . * change api * . * remove LOG_ROOT * remove MODEL_SAVE_PATH * remove PARAM_REALLOC_PATH, DATASET_CACHE * prepare for testing * prepare for testing * ready for run * local run * tests mainly pass * format * . * amend cluster.py * . * . * client test pass * pass rollout test * remove unused imports * add arealite readme * change api * . * . * . * . * . * . * . * . * format * . * implement iteraptable generation (#112) Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com> * . * fix * . * . * . * pass controller generate batch test * . * refactor rollout controller into worker and controller * . * . * . * change to async rollout * pass rollout controller test * pass test * . * update readme * . * sft debug * . * add lisence * remove unused files * remove unsed args in ppo * add hf engine wrapper (#116) * add hf engine * fix issues * fix ppo bugs and add test * add hf client interface and modify cli args * fix bugs * fix issues * Merge fw/refactor * Finish hf wrapper test * add test --------- Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com> * format * format * . * refine hf engine * . * fix * add fsdp engine and sft tests * . * . * . * pass ppo unittest * pass ppo and rollout controller tests * clear unused imports * rename ppo to grpo * change reward function organization * reorganize code * add dataset api * . * . * . * format * chmod fix * . * rename workflow to collector * refactor llm_client location * . * . * fix llm server api * refactor config structure * . * fix tests * . * . * . * Fix unresolved issue in SFTTrainer PR (#139) * . * . * efficient loading * format * . * . * . * . * . * . * Add CI for testing AReaLite (#150) * ci: add test-arealite * ci: add checkout before running test-arealite * ci: add USERNAME * ci: add test script * ci: add GitHub mirror * ci: fix typo * ci: clone one commit * ci: fix condition * ci: set command timeout to 60m * ci: enable pip cache * ci: optimize container lifecycle * ci: split into many stages * ci(test-arealite): fix typo * ci: fix wrong env * ci: fix pytest * ci: uninstall transformer-engine * ci: uninstall transformer-engine * ci: fix model paths * ci: show stdout/stderr * ci: fix not clean up * ci: backup sglang * ci: remove tmp repo dir when run * ci: fix docker run exit 1 condition * ci(test-arealite): limit the concurrency and extend command timeout * . * merge fw/refactor * revert some changes * fix --------- Co-authored-by: meizhiyu.mzy <meizhiyu.mzy@antgroup.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com> Co-authored-by: Jayon02 <qiujiangc@outlook.com> Co-authored-by: root <meizhiyu.mzy> Co-authored-by: Zijian Zhang <futrime@outlook.com>	2025-07-10 12:56:24 +08:00
博惟	15dfbe837c	PullRequest: 332 [lite] Support FSDP engines Merge branch mzy/lite/engines of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/332 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fsdp2 engine * fix utils * add fsdp engine test * . * fsdp engine test passed * unsqueeze immediately before model inputs and after model outputts * add optimizer save/load, add position id calculation for input * . * format * not to squeeze * add train and eval api * . * . * improve fsdp engine data preprocessing * format * PullRequest: 337 [lite] Add SFT trainer example. * trainer log * minor changes * add update weights from disk * fix type annotation	2025-07-09 16:24:25 +08:00
博惟	8771778995	PullRequest: 331 [lite] Support remote sglang engine with corresponding testcases. Merge branch fw/lite of git@code.alipay.com:inclusionAI/AReaL.git into lite https://code.alipay.com/inclusionAI/AReaL/pull_requests/331 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * add test for sglang remote engine * fix	2025-07-09 14:18:55 +08:00
xichengpro	bb14f022dc	Support using SwanLab for experiment tracking (#98 ) * Support using SwanLab for experiment tracking * docs: improve WandB and SwanLab integration documentation - Added official links for better user reference - Used backticks to quote commands and parameters - Unified mode settings to use "online" / "cloud" convention - Merged WandB and SwanLab descriptions into a single concise statement - Added note on using `swanlab.mode="local"` when server connection is unavailable * refactor: update default value of api_key * fix: correct help description from WandB to SwanLab in SwanLabConfig * refactor: merge log_swanlab_tensorboard and log_wandb_tensorboard into log_swanlab_wandb_tensorboard - Unified logging logic for SwanLab, WandB, and TensorBoard to reduce code duplication * chore: update swanlab version in dependency config files - Updated SwanLab version in pyproject.toml - Updated SwanLab version in requirements.txt * refactor: enhance SwanLab config handling for logging purposes - Config now uses provided arguments first - Falls back to reading from config.yaml if no input is given * docs: add note on using when server connection is unavailable * refactor: merge _LATEST_WANDB_STEP and _LATEST_SWANLAB_STEP into _LATEST_LOG_STEP * Format code with black and isort * chore: update swanlab version in dependency config files - Updated SwanLab version in requirements.txt * refactor: rename swanlab_wandb_data to log_data --------- Co-authored-by: dubingnan <dubingnan@360.cn>	2025-06-16 19:51:31 +08:00
Wei Fu	b3f5392f44	[Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92 ) * change to math local eval * . * update docker image tag	2025-06-08 21:11:35 +08:00
Wei Fu	4fab3ac769	[Doc & Fix] Simplify the environment setup procedure (#62 ) * PullRequest: 176 [FIX] clear sensitive info Merge branch fw/fix-sensitive-info of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/176 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * . * test env setup * fix * allow cached model * . * revise docs * change docs * format docs * update readme	2025-06-01 14:57:21 +08:00
Wei Fu	cf46993a30	[Feature & Doc & Bug Fix] Add docs, simplified ray-based scripts, and fix issues to stablize asynchronous experiments (#52 ) * feat: one buffer for each task * feat: support "one buffer for each task" for async * make kv_cache_dtype configurable Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> * style: use plural form fix: use _seed_from_key to set different seeds for data loaders fix: call load_data for one buffer each time * PullRequest: 125 Support running async experiments in the 2407 image. Merge branch fw/async2407 of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/125 Signed-off-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * fix: handle multiple datasets in recover indices fix: `isinstance(self.__datasets, PullerStreamDataset)` feat: use the "spec" request to obtain the number of datasets fix: revert rollout worker * fix: revert async_rl_exp.py * fix flag for list (cuda_graph_bs) * format * [FIX] fix async task reward [sglang bf16-> fp16] * fix: define `self.__datasets` in advance * PullRequest: 130 [Refactor] Remove deprecated search related code Merge branch mzy/remove-search of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/130 Signed-off-by: 博惟 <bowei.fw@antgroup.com> * remove search related * PullRequest: 131 [Refactor] Change terminology "model parallel" into "tensor parallel" to align with megatron. Merge branch mzy/mp-to-tp of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/131?tab=comment Signed-off-by: 博惟 <bowei.fw@antgroup.com> * change mp to tp * . * . * PullRequest: 142 Fix an error for megatron backend destroy Merge branch fw/fix-meagatron-destroy of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/142 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 143 Fix the port conflict issue of generation servers Merge branch fw/fix-gen-port of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/143?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * somehow fix the port issue * add clearance period * . * . * PullRequest: 145 Add code environment Merge branch fw/code-env of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/145?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * add code env * somehow fix the port issue * fix * PullRequest: 144 Add decoupled PPO loss Merge branch fw/decoupled-ppo-loss of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/144?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix ppo step logging, nan in stats tracker, and add decoupled loss * . * somehow fix the port issue * fix typo * PullRequest: 146 Merge SLURM logs and save experiment configs in yaml format. Merge branch fw/better-logging of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/146 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * merge all slurm logs into one * write config to yaml * PullRequest: 141 Merge changes during NeurIPS submission Merge branch fw/async-dev of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/141 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * . * . * . * . * . * . * . * update script * . * . * . * . * [ADD] add least req scheduling * fix test genreq * . * . * fix stats tracker nan * . * . * . * . * . * . * . * uppper clip decoupled objective * add throughput exp script * . * remove behav upper clip param * . * . * . * plot curve * update thpt script * . * master worker raise error when exiting * update script * add gen throughput logging * . * . * add decoupled wandb data * . * fix port issue and add no training option * . * enlarge ttl * remove gserver manager await staled * update weights in groups * . * . * . * add port clearance period * . * . * . * add plot script * add sft throughput eval * . * log tokens in null interface * 消融实验和interruptible generation * 画图脚本/运行脚本/数据结果 * . * remove scripts * add port test * remove force_sync_reward * revert some changes * . * revert * revert fix * fix * revert * fix typo * support qwen3 training * PullRequest: 147 Support interruption in SGLang and fix a KeyError in gather-scatter communication Merge branch fw/sglang046-with-abort-request of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/147?tab=diff Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix ppo step logging, nan in stats tracker, and add decoupled loss * . * somehow fix the port issue * initial commit * add interupt request * fix data transfer issue * max concurrent rollouts defaults to train batch size * merge main * add patch * fix patch typp * revert sglang * fix typo * fix minor typo * . * pip show editable sglang path * PullRequest: 149 fix: code faas max_retries Merge branch xss/fix_code_verifier of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/149 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix: code faas max_retries * PullRequest: 150 [Bug Fix] Fix key errors in `_run_scatter` in data transfer Merge branch mzy/fix-scatter-groups of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/150 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix scatter groups key error * fix test * . * PullRequest: 151 Fix Qwen3 import error when using transformers with a lower version Merge branch fw/fix-qwen3 of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/151 Reviewed-by: 温差 <xushusheng.xss@antgroup.com> * merge all slurm logs into one * write config to yaml * . * PullRequest: 152 Support sglang0.4.6 and fix master_worker import error Merge branch adopt_sglang046 of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/152 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * Support sglang0.4.6 and fix master_worker import error * remove disable_mla option * PullRequest: 155 [FIX] reduce port conflicts Merge branch sxj/reduce_port_conflict of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/155 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * [FIX] reduce port conflicts * PullRequest: 153 Fix stuck and recover issues for async experiments Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/153 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix sample cnt stuck * fix recover * code cleanup * merge all slurm logs into one * write config to yaml * . * . * . * revert birth time change * . * enlarge sock connect timeout * PullRequest: 158 [Fix] Fix the error where "accepted" is not defined Merge branch fw/fix-rollout-accepted of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/158 Reviewed-by: 温差 <xushusheng.xss@antgroup.com> * . * PullRequest: 154 Fix unit tests and simplify package installation Merge branch fw/v0.3.0-tests of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/154?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix some tests * fix tests except for experiments * fix tests * fix tests * . * . * PullRequest: 159 [fix] Enlarge the default aiohttp connection timeout and fix a recover error in model worker Merge branch fw/stable-async of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/159 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * fix sample cnt stuck * fix recover * code cleanup * merge all slurm logs into one * write config to yaml * . * . * . * revert birth time change * . * enlarge sock connect timeout * . * PullRequest: 160 set sock_connect as rollout_request_timeout in partial_rollout.py Merge branch xss/rollout_timeout of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/160 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * set sock_connect as rollout_request_timeout in partial_rollout.py * PullRequest: 161 Prioritize rollouts that are submitted earlier rather than arrived earlier Merge branch fw/birth-time of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/161 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * blocking push * PullRequest: 163 [bugfix] Fix synchronized training when birth time is absent Merge branch fw/fix-sync-birthtime of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/163 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 164 [Refactor] Move cluster spec into CLI args Merge branch fw/refactor-cluster-spec of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/164?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * set cluster spec path in args * . * fix * add default cluster spec * PullRequest: 165 Normally exit all workers after experiment completion Merge branch fw/exit-all-workers of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/165 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * . * PullRequest: 167 [Feature] Use chunked logits computation to alleviate SGLang OOM Merge branch fw/patch-sglang-oom of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/167 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * . * PullRequest: 166 [Feature] Support single-script experiment launch with Ray Merge branch fw/turbolaunch of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/166?tab=comment Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * add training script without ray name resolve * add ray name resolve * ray worker * run * run async * local run * set cluster spec path in args * . * . * fix * . * . * . * . * . * update config * . * minor renaming * PullRequest: 169 [Doc] Add v0.3.0 docs based on jupyter-book Merge branch fw/doc of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/169 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * add docs * refine doc * refine doc --------- Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Co-authored-by: wanghuaijie.whj <wanghuaijie.whj@antgroup.com> Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com> Co-authored-by: kira.gw <kira.gw@antgroup.com> Co-authored-by: shenxujie.sxj <shenxujie.sxj@antgroup.com> Co-authored-by: 晓雷 <meizhiyu.mzy@antgroup.com> Co-authored-by: sam.gjx <sam.gjx@antgroup.com> Co-authored-by: 温差 <xushusheng.xss@antgroup.com> Co-authored-by: 履渊 <yuhong.gyh@antgroup.com>	2025-05-28 19:18:05 +08:00
晓雷	2963a67311	Initial commit.	2025-02-24 18:58:19 +08:00

10 Commits