mirror of https://github.com/inclusionAI/AReaL
fix readme in evaluation (#103)
Co-authored-by: hcy <hechuyi.hcy@antgroup.com>
This commit is contained in:
parent
e7eda16311
commit
f2f4b67bcd
|
@ -2,45 +2,130 @@
|
|||
|
||||
This evaluation package was modified from [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math/tree/main).
|
||||
|
||||
Install the following packages:
|
||||
# Evaluation
|
||||
|
||||
The evaluation code is located in the `evaluation` folder of the repository. Following the previous tutorial, trained checkpoints will be saved under `${fileroot}/checkpoints/${USER}/${experiment_name}/${trial_name}/`.
|
||||
|
||||
## Setup Evaluation Environment
|
||||
|
||||
> **Note**: Evaluation requires updates to certain Python libraries, so avoid using the training container or virtual environment for this task.
|
||||
|
||||
From the repository directory, create a new conda environment:
|
||||
|
||||
```bash
|
||||
cd latex2sympy
|
||||
pip install -e .
|
||||
cd ..
|
||||
pip install -r requirements.txt
|
||||
pip install vllm --no-build-isolation
|
||||
pip install transformers==4.47.0
|
||||
pip install prettytable timeout_decorator
|
||||
conda create -n areal-eval python=3.12
|
||||
conda activate areal-eval
|
||||
```
|
||||
|
||||
Run evaluation:
|
||||
Install dependencies:
|
||||
|
||||
```bash
|
||||
python eval_and_aggregate.py \
|
||||
--model_path ${MODEL_PATH} \
|
||||
--output_path ${OUTPUT_PATH} \
|
||||
--data_names aime24 \
|
||||
--max_gen_tokens 32768 \ # max number of tokens to generate, defaults to 32768
|
||||
bash examples/env/scripts/setup-eval-pip-deps.sh
|
||||
```
|
||||
|
||||
The results are saved in `{OUTPUT_PATH}/math_eval_32768`.
|
||||
## Run Evaluation
|
||||
|
||||
Specify an `output_path` to save the test results. If not specified, the results will be saved in `model_path`.
|
||||
|
||||
### Math Evaluation
|
||||
|
||||
Evaluate AReaL-boba-RL-7B:
|
||||
```bash
|
||||
python eval_and_aggregate.py \
|
||||
--model_path ${MODEL_PATH} \
|
||||
--output_path ${OUTPUT_PATH} \
|
||||
--data_names aime24,aime25,gpqa_diamond \
|
||||
--prompt_type AReaL-boba \
|
||||
--output_path outputs --temperature 1.0
|
||||
cd evaluation
|
||||
nohup python eval_and_aggregate.py \
|
||||
--model_path /path/to/checkpoint \
|
||||
--output_path /path/to/outputs \
|
||||
--max_gen_tokens 32768 \
|
||||
--data_names math_500,aime24,amc23 \
|
||||
--prompt_type qwen3-think \
|
||||
--task math &> eval_and_aggregate_parallel.log &
|
||||
```
|
||||
|
||||
Evaluate AReaL-boba-SFT-32B:
|
||||
### Code Evaluation
|
||||
|
||||
**Obtaining Data:**
|
||||
- Due to the size of code datasets (some test cases are relatively large), we have uploaded all our code datasets to [Hugging Face](https://huggingface.co/inclusionAI).
|
||||
- Once you have downloaded the code dataset, place it under `./evaluation/data/`.
|
||||
|
||||
**Running Evaluation:**
|
||||
```bash
|
||||
python eval_and_aggregate.py \
|
||||
--model_path ${MODEL_PATH} \
|
||||
--output_path ${OUTPUT_PATH} \
|
||||
--data_names aime24,aime25,gpqa_diamond \
|
||||
--prompt_type AReaL-boba-SFT \
|
||||
--samples_per_node 2 --num_sample_nodes 16 \
|
||||
--output_path outputs --temperature 0.6
|
||||
cd evaluation
|
||||
nohup python eval_and_aggregate.py \
|
||||
--model_path /path/to/checkpoint \
|
||||
--output_path /path/to/outputs \
|
||||
--max_gen_tokens 32768 \
|
||||
--data_names codeforces,lcb_v5 \
|
||||
--prompt_type qwen3-think-pure \
|
||||
--num_sample_nodes 8 \
|
||||
--samples_per_node 1 \
|
||||
--n_sampling $((num_sample_nodes * samples_per_node)) \
|
||||
--task code &> eval_and_aggregate_parallel.log &
|
||||
```
|
||||
|
||||
### Command Line Parameters
|
||||
|
||||
- **`--model_path`**: Path to the saved model parameters
|
||||
- **`--output_path`**: Path to store generated answers and log files during evaluation
|
||||
- **`--data_names`**: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options:
|
||||
- Math: `math_500`, `aime24`, `aime25`, `amc23`
|
||||
- Code: `lcb_v5`, `lcb_v5_2410_2502`, `codeforces`, `code_contest_all`
|
||||
- **`--max_gen_tokens`**: Maximum length of generated answers (default: 32768)
|
||||
- **`--prompt_type`**: Specify the prompt template. For our latest model, we use `qwen3-think` for math datasets and `qwen3-think-pure` for code datasets. Feel free to customize prompts for your own dataset by referencing the existing prompt templates in `utils.py`.
|
||||
- **`--num_sample_nodes`**: Number of multiple sampling seeds to ensure sampling diversity.
|
||||
- **`--samples_per_node`**: Number of samples to generate per seed for each problem.
|
||||
|
||||
## Logs and Evaluation Results
|
||||
|
||||
Check `${output_path}/math_eval_${max_gen_tokens}/logs` to review the log of each worker.
|
||||
|
||||
The evaluation script will output a results table in the terminal:
|
||||
|
||||
```
|
||||
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
|
||||
| dataset | num_questions | greedy_length | sample_length | greedy_acc | sample_pass@1 | pass@8 | pass@16 |
|
||||
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
|
||||
| math_500 | 500 | 6757.4 | 4139.5 | 84.4 | 92.7 | 97.3 | 97.7 |
|
||||
| aime24 | 30 | 19328.0 | 13663.5 | 50.0 | 50.4 | 77.3 | 80.0 |
|
||||
| amc23 | 40 | 8850.0 | 6526.2 | 80.0 | 90.5 | 96.8 | 98.8 |
|
||||
+----------+---------------+---------------+---------------+------------+---------------+--------+---------+
|
||||
```
|
||||
|
||||
### Metrics Explanation
|
||||
|
||||
- **`{greedy|sample}_length`**: Average answer length under greedy or random sampling strategy
|
||||
- **`greedy_acc`**: Average accuracy under greedy sampling
|
||||
- **`sample_pass@{k}`**: Probability of generating a correct answer within `k` attempts under random sampling
|
||||
|
||||
For the Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to [CodeElo](https://github.com/QwenLM/CodeElo) and [rllm](https://github.com/agentica-project/rllm):
|
||||
|
||||
```
|
||||
+------------+----------------+-----------+
|
||||
| Dataset | Percentile (%) | CF Rating |
|
||||
+------------+----------------+-----------+
|
||||
| codeforces | 17.9 | 590 |
|
||||
+------------+----------------+-----------+
|
||||
```
|
||||
|
||||
- **`CF Rating`**: The overall Elo rank score of the model across 57 Codeforces contests.
|
||||
- **`Percentile`**: The Elo ranking percentile of the model among all Codeforces users.
|
||||
|
||||
> **Note**: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and taking the average score as the final result.
|
||||
|
||||
## Configuration Details
|
||||
|
||||
### Sampling Parameters
|
||||
|
||||
- The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
|
||||
- We observed that the `enforce_eager` parameter in vLLM significantly impacts evaluation performance. When `enforce_eager=True`, we can reproduce the model performance reported in previous work. Without this setting, evaluation results may fall below reported performance. Therefore, we enforce `enforce_eager=True` during evaluation.
|
||||
|
||||
### Runtime Expectations
|
||||
|
||||
Due to the sampling requirements and `enforce_eager` setting, the evaluation process typically takes considerable time.
|
||||
|
||||
Runtime depends on several factors:
|
||||
- Maximum generation length
|
||||
- Number of questions in the dataset
|
||||
- Model size
|
||||
|
||||
**Performance benchmarks** (on 8x H100 GPUs):
|
||||
- **AIME dataset**: ~80 minutes
|
||||
- **MATH_500 dataset**: ~160 minutes
|
Loading…
Reference in New Issue