5.7 KiB
Evaluate
This evaluation package was modified from Qwen2.5-Math.
Evaluation
The evaluation code is located in the evaluation
folder of the repository. Following the previous tutorial, trained checkpoints will be saved under ${fileroot}/checkpoints/${USER}/${experiment_name}/${trial_name}/
.
Setup Evaluation Environment
Note: Evaluation requires updates to certain Python libraries, so avoid using the training container or virtual environment for this task.
From the repository directory, create a new conda environment:
conda create -n areal-eval python=3.12
conda activate areal-eval
Install dependencies:
bash examples/env/scripts/setup-eval-pip-deps.sh
Run Evaluation
Specify an output_path
to save the test results. If not specified, the results will be saved in model_path
.
Math Evaluation
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names math_500,aime24,amc23 \
--prompt_type qwen3-think \
--task math &> eval_and_aggregate_parallel.log &
Code Evaluation
Obtaining Data:
- Due to the size of code datasets (some test cases are relatively large), we have uploaded all our code datasets to Hugging Face.
- Once you have downloaded the code dataset, place it under
./evaluation/data/
.
Running Evaluation:
cd evaluation
nohup python eval_and_aggregate.py \
--model_path /path/to/checkpoint \
--output_path /path/to/outputs \
--max_gen_tokens 32768 \
--data_names codeforces,lcb_v5 \
--prompt_type qwen3-think-pure \
--num_sample_nodes 8 \
--samples_per_node 1 \
--n_sampling $((num_sample_nodes * samples_per_node)) \
--task code &> eval_and_aggregate_parallel.log &
Command Line Parameters
--model_path
: Path to the saved model parameters--output_path
: Path to store generated answers and log files during evaluation--data_names
: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options:- Math:
math_500
,aime24
,aime25
,amc23
- Code:
lcb_v5
,lcb_v5_2410_2502
,codeforces
,code_contest_all
- Math:
--max_gen_tokens
: Maximum length of generated answers (default: 32768)--prompt_type
: Specify the prompt template. For our latest model, we useqwen3-think
for math datasets andqwen3-think-pure
for code datasets. Feel free to customize prompts for your own dataset by referencing the existing prompt templates inutils.py
.--num_sample_nodes
: Number of multiple sampling seeds to ensure sampling diversity.--samples_per_node
: Number of samples to generate per seed for each problem.
Logs and Evaluation Results
Check ${output_path}/math_eval_${max_gen_tokens}/logs
to review the log of each worker.
The evaluation script will output a results table in the terminal:
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| dataset | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| math_500 | 500 | 6757.4 | 4139.5 | 84.4 | 92.7 | 97.3 | 97.7 |
| aime24 | 30 | 19328.0 | 13663.5 | 50.0 | 50.4 | 77.3 | 80.0 |
| amc23 | 40 | 8850.0 | 6526.2 | 80.0 | 90.5 | 96.8 | 98.8 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
Metrics Explanation
{greedy|sample}_length
: Average answer length under greedy or random sampling strategygreedy_acc
: Average accuracy under greedy samplingsample_pass@{k}
: Probability of generating a correct answer withink
attempts under random sampling
For the Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to CodeElo and rllm:
+------------+----------------+-----------+
| Dataset | Percentile (%) | CF Rating |
+------------+----------------+-----------+
| codeforces | 17.9 | 590 |
+------------+----------------+-----------+
CF Rating
: The overall Elo rank score of the model across 57 Codeforces contests.Percentile
: The Elo ranking percentile of the model among all Codeforces users.
Note: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and taking the average score as the final result.
Configuration Details
Sampling Parameters
- The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
- We observed that the
enforce_eager
parameter in vLLM significantly impacts evaluation performance. Whenenforce_eager=True
, we can reproduce the model performance reported in previous work. Without this setting, evaluation results may fall below reported performance. Therefore, we enforceenforce_eager=True
during evaluation.
Runtime Expectations
Due to the sampling requirements and enforce_eager
setting, the evaluation process typically takes considerable time.
Runtime depends on several factors:
- Maximum generation length
- Number of questions in the dataset
- Model size
Performance benchmarks (on 8x H100 GPUs):
- AIME dataset: ~80 minutes
- MATH_500 dataset: ~160 minutes