AReaL/evaluation
Wei Fu 4fab3ac769
[Doc & Fix] Simplify the environment setup procedure (#62)
* PullRequest: 176 [FIX] clear sensitive info

Merge branch fw/fix-sensitive-info of git@code.alipay.com:inclusionAI/AReaL.git into main
https://code.alipay.com/inclusionAI/AReaL/pull_requests/176

Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com>


* .
* .
* .

* .

* .

* test env setup

* fix

* allow cached model

* .

* revise docs

* change docs

* format docs

* update readme
2025-06-01 14:57:21 +08:00
..
code_verifier [Doc & Fix] Simplify the environment setup procedure (#62) 2025-06-01 14:57:21 +08:00
data PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
latex2sympy Initial commit. 2025-02-24 18:58:19 +08:00
sh [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
LEGAL.md Initial commit. 2025-02-24 18:58:19 +08:00
LICENSE Initial commit. 2025-02-24 18:58:19 +08:00
README.md Xss/readme (#16) 2025-03-31 18:57:39 +08:00
aggregate_acc_from_generated.py Merge updates from ant repository. (#34) 2025-04-27 11:09:25 +08:00
cf_elo_caculator.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
code_eval.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
data_loader.py Initial commit. 2025-02-24 18:58:19 +08:00
eval_and_aggregate.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
evaluate.py Initial commit. 2025-02-24 18:58:19 +08:00
examples.py Initial commit. 2025-02-24 18:58:19 +08:00
grader.py Initial commit. 2025-02-24 18:58:19 +08:00
math_eval.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
math_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
model_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
parser.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
python_executor.py Initial commit. 2025-02-24 18:58:19 +08:00
requirements.txt [Doc & Fix] Simplify the environment setup procedure (#62) 2025-06-01 14:57:21 +08:00
rm_maj_eval.py Initial commit. 2025-02-24 18:58:19 +08:00
trajectory.py Initial commit. 2025-02-24 18:58:19 +08:00
utils.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00

README.md

Evaluate

This evaluation package was modified from Qwen2.5-Math.

Install the following packages:

cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm --no-build-isolation
pip install transformers==4.47.0
pip install prettytable timeout_decorator

Run evaluation:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24 \
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768

The results are saved in {OUTPUT_PATH}/math_eval_32768.

Evaluate AReaL-boba-RL-7B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba \
--output_path outputs --temperature 1.0

Evaluate AReaL-boba-SFT-32B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba-SFT \
--samples_per_node 2 --num_sample_nodes 16 \
--output_path outputs --temperature 0.6