fix readme in evaluation (#103)

Co-authored-by: hcy <hechuyi.hcy@antgroup.com>
2025-06-10 17:56:08 +08:00 · 2025-06-10 17:56:08 +08:00 · f2f4b67bcd
parent e7eda16311
commit f2f4b67bcd
1 changed files with 115 additions and 30 deletions
--- a/evaluation/README.md
+++ b/evaluation/README.md
@ -2,45 +2,130 @@

 This evaluation package was modified from [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math/tree/main).

-Install the following packages:
+# Evaluation
+
+The evaluation code is located in the `evaluation` folder of the repository. Following the previous tutorial, trained checkpoints will be saved under `${fileroot}/checkpoints/${USER}/${experiment_name}/${trial_name}/`.
+
+## Setup Evaluation Environment
+
+> **Note**: Evaluation requires updates to certain Python libraries, so avoid using the training container or virtual environment for this task.
+
+From the repository directory, create a new conda environment:
+
 ```bash
-cd latex2sympy
-pip install -e .
-cd ..
-pip install -r requirements.txt 
-pip install vllm --no-build-isolation
-pip install transformers==4.47.0
-pip install prettytable timeout_decorator
+conda create -n areal-eval python=3.12
+conda activate areal-eval
 ```

-Run evaluation:
+Install dependencies:
+
 ```bash
-python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24 \
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768
+bash examples/env/scripts/setup-eval-pip-deps.sh
 ```

-The results are saved in `{OUTPUT_PATH}/math_eval_32768`.
+## Run Evaluation
+
+Specify an `output_path` to save the test results. If not specified, the results will be saved in `model_path`.
+
+### Math Evaluation

-Evaluate AReaL-boba-RL-7B:
 ```bash
-python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba \
--output_path outputs --temperature 1.0
+cd evaluation
+nohup python eval_and_aggregate.py \
+    --model_path /path/to/checkpoint \
+    --output_path /path/to/outputs \
+    --max_gen_tokens 32768 \
+    --data_names math_500,aime24,amc23 \
+    --prompt_type qwen3-think \
+    --task math &> eval_and_aggregate_parallel.log &
 ```

-Evaluate AReaL-boba-SFT-32B:
+### Code Evaluation
+
+**Obtaining Data:**
+- Due to the size of code datasets (some test cases are relatively large), we have uploaded all our code datasets to [Hugging Face](https://huggingface.co/inclusionAI).
+- Once you have downloaded the code dataset, place it under `./evaluation/data/`.
+
+**Running Evaluation:**
 ```bash
-python eval_and_aggregate.py \
--model_path ${MODEL_PATH} \
--output_path ${OUTPUT_PATH} \
--data_names aime24,aime25,gpqa_diamond \
--prompt_type AReaL-boba-SFT \
--samples_per_node 2 --num_sample_nodes 16 \
--output_path outputs --temperature 0.6
+cd evaluation
+nohup python eval_and_aggregate.py \
+    --model_path /path/to/checkpoint \
+    --output_path /path/to/outputs \
+    --max_gen_tokens 32768 \
+    --data_names codeforces,lcb_v5 \
+    --prompt_type qwen3-think-pure \
+    --num_sample_nodes 8 \
+    --samples_per_node 1 \
+    --n_sampling $((num_sample_nodes * samples_per_node)) \
+    --task code &> eval_and_aggregate_parallel.log &
 ```
+
+### Command Line Parameters
+
+- **`--model_path`**: Path to the saved model parameters
+- **`--output_path`**: Path to store generated answers and log files during evaluation
+- **`--data_names`**: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options: 
+    - Math: `math_500`, `aime24`, `aime25`, `amc23`
+    - Code: `lcb_v5`, `lcb_v5_2410_2502`, `codeforces`, `code_contest_all`
+- **`--max_gen_tokens`**: Maximum length of generated answers (default: 32768)
+- **`--prompt_type`**: Specify the prompt template. For our latest model, we use `qwen3-think` for math datasets and `qwen3-think-pure` for code datasets. Feel free to customize prompts for your own dataset by referencing the existing prompt templates in `utils.py`.
+- **`--num_sample_nodes`**: Number of multiple sampling seeds to ensure sampling diversity.
+- **`--samples_per_node`**: Number of samples to generate per seed for each problem.
+
+## Logs and Evaluation Results
+
+Check `${output_path}/math_eval_${max_gen_tokens}/logs` to review the log of each worker.
+
+The evaluation script will output a results table in the terminal:
+
+```
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
+| dataset  | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
+| math_500 |      500      |     6757.4    |     4139.5    |    84.4    |      92.7     |  97.3  |   97.7  |
+|  aime24  |       30      |    19328.0    |    13663.5    |    50.0    |      50.4     |  77.3  |   80.0  |
+|  amc23   |       40      |     8850.0    |     6526.2    |    80.0    |      90.5     |  96.8  |   98.8  |
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
+```
+
+### Metrics Explanation
+
+- **`{greedy|sample}_length`**: Average answer length under greedy or random sampling strategy
+- **`greedy_acc`**: Average accuracy under greedy sampling
+- **`sample_pass@{k}`**: Probability of generating a correct answer within `k` attempts under random sampling
+
+For the Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to [CodeElo](https://github.com/QwenLM/CodeElo) and [rllm](https://github.com/agentica-project/rllm):
+
+```
+------------+----------------+-----------+
+|  Dataset   | Percentile (%) | CF Rating |
+------------+----------------+-----------+
+| codeforces |      17.9      |    590    |
+------------+----------------+-----------+
+```
+
+- **`CF Rating`**: The overall Elo rank score of the model across 57 Codeforces contests.
+- **`Percentile`**: The Elo ranking percentile of the model among all Codeforces users.
+
+> **Note**: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and taking the average score as the final result.
+
+## Configuration Details
+
+### Sampling Parameters
+
+- The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
+- We observed that the `enforce_eager` parameter in vLLM significantly impacts evaluation performance. When `enforce_eager=True`, we can reproduce the model performance reported in previous work. Without this setting, evaluation results may fall below reported performance. Therefore, we enforce `enforce_eager=True` during evaluation.
+
+### Runtime Expectations
+
+Due to the sampling requirements and `enforce_eager` setting, the evaluation process typically takes considerable time.
+
+Runtime depends on several factors:
+- Maximum generation length
+- Number of questions in the dataset  
+- Model size
+
+**Performance benchmarks** (on 8x H100 GPUs):
+- **AIME dataset**: ~80 minutes
+- **MATH_500 dataset**: ~160 minutes