AReaL/evaluation
Wei Fu b3f5392f44
[Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92)
* change to math local eval

* .

* update docker image tag
2025-06-08 21:11:35 +08:00
..
code_verifier [Doc & Fix] Simplify the environment setup procedure (#62) 2025-06-01 14:57:21 +08:00
data PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
latex2sympy Initial commit. 2025-02-24 18:58:19 +08:00
sh [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
LEGAL.md Initial commit. 2025-02-24 18:58:19 +08:00
LICENSE Initial commit. 2025-02-24 18:58:19 +08:00
README.md Xss/readme (#16) 2025-03-31 18:57:39 +08:00
aggregate_acc_from_generated.py Merge updates from ant repository. (#34) 2025-04-27 11:09:25 +08:00
cf_elo_caculator.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
code_eval.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
data_loader.py Initial commit. 2025-02-24 18:58:19 +08:00
eval_and_aggregate.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
evaluate.py [Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92) 2025-06-08 21:11:35 +08:00
examples.py Initial commit. 2025-02-24 18:58:19 +08:00
grader.py Initial commit. 2025-02-24 18:58:19 +08:00
math_eval.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00
math_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
model_utils.py Initial commit. 2025-02-24 18:58:19 +08:00
parser.py PullRequest: 70 update evalution: aime25, gpqa 2025-03-30 20:16:08 +08:00
python_executor.py Initial commit. 2025-02-24 18:58:19 +08:00
requirements.txt [Doc & Fix] Simplify the environment setup procedure (#62) 2025-06-01 14:57:21 +08:00
rm_maj_eval.py [Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92) 2025-06-08 21:11:35 +08:00
trajectory.py Initial commit. 2025-02-24 18:58:19 +08:00
utils.py [Feature] Support behavior importance weight capping and update evaluation scripts (#59) 2025-05-30 10:29:21 +08:00

README.md

Evaluate

This evaluation package was modified from Qwen2.5-Math.

Install the following packages:

cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm --no-build-isolation
pip install transformers==4.47.0
pip install prettytable timeout_decorator

Run evaluation:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24 \
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768

The results are saved in {OUTPUT_PATH}/math_eval_32768.

Evaluate AReaL-boba-RL-7B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba \
--output_path outputs --temperature 1.0

Evaluate AReaL-boba-SFT-32B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba-SFT \
--samples_per_node 2 --num_sample_nodes 16 \
--output_path outputs --temperature 0.6