mirror of https://github.com/inclusionAI/AReaL
3.8 KiB
Executable File
3.8 KiB
Executable File
Evaluation
The evaluation code is located in the evaluation
folder of the repository. Following the previous tutorial, trained checkpoints will be saved under /storage/ray/experiments/checkpoints/root/
.
Setup Evaluation Environment
Start a new container to execute the evaluation script. Note: Evaluation requires updates to certain Python libraries, so avoid using the training container for this task.
docker run -d --name areal-eval --privileged --gpus all --network host --shm-size 700g -v /storage:/storage ghcr.io/inclusionai/areal-runtime:v0.3.0 /bin/bash -c "tail -f /dev/null"
docker exec -it areal-eval bash
Install Dependencies and Run Evaluation
Execute the following commands inside the Docker container:
cd /storage/codes/AReaL/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
pip install vllm==0.8.5 --no-build-isolation
pip install transformers==4.51.1
pip install prettytable timeout_decorator
mkdir -p /storage/ray/eval_output/
nohup python eval_and_aggregate.py \
--model_path /storage/ray/experiments/checkpoints/root/my-exp/my-trial/epoch1epochstep20globalstep20/ \
--output_path /storage/ray/eval_output/ \
--data_names "math_500,aime24,amc23" \
--max_gen_tokens 32768 &> /storage/ray/eval_output/eval_and_aggregate_parallel.log &
Command Line Parameters
--model_path
: Path to the saved model parameters--output_path
: Path to store generated answers and log files during evaluation--data_names
: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options:math_500
,math
,gsm8k
,train_amc_aime
,aime24
,amc23
--max_gen_tokens
: Maximum length of generated answers (default: 32768)
Evaluation Results
The evaluation script will output a results table in the terminal:
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| dataset | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
| math_500 | 500 | 6757.4 | 4139.5 | 84.4 | 92.7 | 97.3 | 97.7 |
| aime24 | 30 | 19328.0 | 13663.5 | 50.0 | 50.4 | 77.3 | 80.0 |
| amc23 | 40 | 8850.0 | 6526.2 | 80.0 | 90.5 | 96.8 | 98.8 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
Metrics Explanation
{greedy|sample}_length
: Average answer length under greedy or random sampling strategygreedy_acc
: Average accuracy under greedy samplingsample_pass@{k}
: Probability of generating a correct answer withink
attempts under random sampling
Configuration Details
Sampling Parameters
- The evaluation script defaults to averaging 32 samples with temperature 0.6
- We observed that the
enforce_eager
parameter in vLLM significantly impacts evaluation performance - When
enforce_eager=True
, we can reproduce the model performance reported in previous work - Without this setting, evaluation results may fall below reported performance
- Therefore, we enforce
enforce_eager=True
during evaluation
Runtime Expectations
Due to the sampling requirements and enforce_eager
setting, the evaluation process typically takes considerable time.
Runtime depends on several factors:
- Maximum generation length
- Number of questions in the dataset
- Model size
Performance benchmarks (on 8x H100 GPUs):
- AIME dataset: ~80 minutes
- MATH_500 dataset: ~160 minutes