AReaL

History

Wei Fu b3f5392f44 [Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92 ) * change to math local eval * . * update docker image tag		2025-06-08 21:11:35 +08:00
..
code_verifier	[Doc & Fix] Simplify the environment setup procedure (#62 )	2025-06-01 14:57:21 +08:00
data	PullRequest: 70 update evalution: aime25, gpqa	2025-03-30 20:16:08 +08:00
latex2sympy	Initial commit.	2025-02-24 18:58:19 +08:00
sh	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00
LEGAL.md	Initial commit.	2025-02-24 18:58:19 +08:00
LICENSE	Initial commit.	2025-02-24 18:58:19 +08:00
README.md	Xss/readme (#16 )	2025-03-31 18:57:39 +08:00
aggregate_acc_from_generated.py	Merge updates from ant repository. (#34 )	2025-04-27 11:09:25 +08:00
cf_elo_caculator.py	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00
code_eval.py	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00
data_loader.py	Initial commit.	2025-02-24 18:58:19 +08:00
eval_and_aggregate.py	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00
evaluate.py	[Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92 )	2025-06-08 21:11:35 +08:00
examples.py	Initial commit.	2025-02-24 18:58:19 +08:00
grader.py	Initial commit.	2025-02-24 18:58:19 +08:00
math_eval.py	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00
math_utils.py	Initial commit.	2025-02-24 18:58:19 +08:00
model_utils.py	Initial commit.	2025-02-24 18:58:19 +08:00
parser.py	PullRequest: 70 update evalution: aime25, gpqa	2025-03-30 20:16:08 +08:00
python_executor.py	Initial commit.	2025-02-24 18:58:19 +08:00
requirements.txt	[Doc & Fix] Simplify the environment setup procedure (#62 )	2025-06-01 14:57:21 +08:00
rm_maj_eval.py	[Bug] Fix the dependency of a virtual environment for sympy==1.12 (#92 )	2025-06-08 21:11:35 +08:00
trajectory.py	Initial commit.	2025-02-24 18:58:19 +08:00
utils.py	[Feature] Support behavior importance weight capping and update evaluation scripts (#59 )	2025-05-30 10:29:21 +08:00

README.md

Evaluate

This evaluation package was modified from Qwen2.5-Math.

Install the following packages:

cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm --no-build-isolation
pip install transformers==4.47.0
pip install prettytable timeout_decorator

Run evaluation:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24 \
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768

The results are saved in {OUTPUT_PATH}/math_eval_32768.

Evaluate AReaL-boba-RL-7B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba \
--output_path outputs --temperature 1.0

Evaluate AReaL-boba-SFT-32B:

python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba-SFT \
--samples_per_node 2 --num_sample_nodes 16 \
--output_path outputs --temperature 0.6