AReaL/evaluation
nuzant ffc52a1520
Merge updates from ant repository. (#34)
* Cherry-pick commit 90dfd575 "PullRequest: 84 [ADD..." 到当前分支

* Cherry-pick commit 15e787b7 "PullRequest: 44 eval..." 到当前分支

* Cherry-pick commit f255ef60 "PullRequest: 85 add ..." 到当前分支

* Cherry-pick commit c2b4006a "PullRequest: 86 Supp..." 到当前分支

* Cherry-pick commit fa6c0f3d "PullRequest: 87 upda..." 到当前分支

* Cherry-pick commit a9ff4af0 "PullRequest: 88 Bump..." 到当前分支

* Cherry-pick commit 763839aa "PullRequest: 89 Add ..." 到当前分支

* Cherry-pick commit 21e8064a "PullRequest: 90 Merg..." 到当前分支

* Cherry-pick commit 94e97670 "PullRequest: 92 Supp..." 到当前分支

* Cherry-pick commit 92710522 "PullRequest: 91 Supp..." 到当前分支

* Cherry-pick commit 95aa3f28 "PullRequest: 93 Supp..." 到当前分支

* Cherry-pick commit 62191f8f "PullRequest: 94 Add ..." 到当前分支

* Cherry-pick commit baa0249a "PullRequest: 95 Form..." 到当前分支

* Cherry-pick commit e32945f2 "PullRequest: 96 Chan..." 到当前分支

* Cherry-pick commit b59286e3 "PullRequest: 98 fix ..." 到当前分支

* Cherry-pick commit ca2ba43e "PullRequest: 97 Move..." 到当前分支

* Cherry-pick commit f941700b "PullRequest: 99 Refa..." 到当前分支

* Cherry-pick commit 95439e70 "PullRequest: 100 Add..." 到当前分支

* Cherry-pick commit f3ebd941 "PullRequest: 101 Add..." 到当前分支

* Cherry-pick commit ee4779ea "PullRequest: 103 [Fe..." 到当前分支

* Cherry-pick commit ce5e24ec "PullRequest: 104 [Fi..." 到当前分支

* Cherry-pick commit b385761f "PullRequest: 105 [Bu..." 到当前分支

* Cherry-pick commit 4c21fbb5 "PullRequest: 106 [Bu..." 到当前分支

* Cherry-pick commit 7f3f14e0 "PullRequest: 108 [Fi..." 到当前分支

* Cherry-pick commit 8de62701 "PullRequest: 107 [Fe..." 到当前分支

* Cherry-pick commit ea864b21 "PullRequest: 24 [Fea..." 到当前分支

* Cherry-pick commit 4a658db3 "PullRequest: 109 [Bu..." 到当前分支

* Cherry-pick commit aaa12bf1 "PullRequest: 110 [Bu..." 到当前分支

* Cherry-pick commit 6adb6d9f "PullRequest: 112 [Fi..." 到当前分支

* Cherry-pick commit 55556bc5 "PullRequest: 111 [Fe..." 到当前分支

* Cherry-pick commit bfe5ec94 "PullRequest: 114 pri..." 到当前分支

* Cherry-pick commit 44529c9b "PullRequest: 113 spl..." 到当前分支

* Cherry-pick commit b1cc73df "PullRequest: 116 [FI..." 到当前分支

* Cherry-pick commit eff598ce "PullRequest: 115 [Fi..." 到当前分支

* Cherry-pick commit f7149475 "PullRequest: 119 [Fi..." 到当前分支

* Cherry-pick commit f1017bfe "PullRequest: 121 add..." 到当前分支

* Cherry-pick commit 56f6de8d "PullRequest: 120 set..." 到当前分支

---------

Co-authored-by: 冰临 <shenxujie.sxj@antgroup.com>
Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
Co-authored-by: 郭唯 <kira.gw@antgroup.com>
Co-authored-by: 博惟 <bowei.fw@antgroup.com>
Co-authored-by: 君末 <meijun.mei@antgroup.com>
2025-04-27 11:09:25 +08:00
..
data PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
latex2sympy Initial commit. 2025-02-24 18:58:19 +08:00
sh PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
LEGAL.md Initial commit. 2025-02-24 18:58:19 +08:00
LICENSE Initial commit. 2025-02-24 18:58:19 +08:00
README.md Xss/readme (#16) 2025-03-31 18:57:39 +08:00
aggregate_acc_from_generated.py Merge updates from ant repository. (#34) 2025-04-27 11:09:25 +08:00
data_loader.py Initial commit. 2025-02-24 18:58:19 +08:00
eval_and_aggregate.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
evaluate.py Initial commit. 2025-02-24 18:58:19 +08:00
examples.py Initial commit. 2025-02-24 18:58:19 +08:00
grader.py Initial commit. 2025-02-24 18:58:19 +08:00
math_eval.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
math_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
model_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
parser.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
python_executor.py Initial commit. 2025-02-24 18:58:19 +08:00
requirements.txt PullRequest: 11 支持训练时自动拉起evaluate任务 2025-03-05 18:06:40 +08:00
rm_maj_eval.py Initial commit. 2025-02-24 18:58:19 +08:00
trajectory.py Initial commit. 2025-02-24 18:58:19 +08:00
utils.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00

README.md

Evaluate

This evaluation package was modified from Qwen2.5-Math.

Install the following packages:

cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm --no-build-isolation
pip install transformers==4.47.0
pip install prettytable timeout_decorator

Run evaluation:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24 \
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768

The results are saved in {OUTPUT_PATH}/math_eval_32768.

Evaluate AReaL-boba-RL-7B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba \
--output_path outputs --temperature 1.0

Evaluate AReaL-boba-SFT-32B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba-SFT \
--samples_per_node 2 --num_sample_nodes 16 \
--output_path outputs --temperature 0.6