Xss/readme (#16)

* update readme * update readme * update readme * update readme * update readme
2025-03-31 18:57:39 +08:00 · 2025-03-31 18:57:39 +08:00 · f4bd798ed9
parent 22838f3288
commit f4bd798ed9
3 changed files with 11 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -52,7 +52,7 @@ In the following table, we show the convergence time under different resource se

 We use **R1-Distill-Qwen-7B** as our base model. After RL training, the pass@1 scores on AIME 2024 and AIME 2025 improve by 6.9 and 8.6 points, respectively, achieving SOTA performance among 7B models in mathematical reasoning. We have released the training data at [AReaL-boba-106k](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data/blob/main/AReaL-boba-106k.jsonl).  

-Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. We plan to open-source more datasets in the future, including code, STEM, and other domains.
+Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. We plan to open-source more datasets in the future, including code, STEM, and other domains. All the reported numbers are re-evaluated using our evaluation code with more details in our [blog](blog/AReaL_v0_2.md#eval_detail). 

 #### Approaching QwQ-32B performances using only 200 data samples
 | **Model** | **AIME 2024** |
--- a/blog/AReaL_v0_2.md
+++ b/blog/AReaL_v0_2.md
@ -27,6 +27,14 @@ We are excited to release AReaL v0.2 (boba), featuring three major milestones:

 *Table 1: The performance of AReaL-boba-RL-7B and AReaL-boba-SFT-32B. We obtain SOTA 7B model using RL on math reasoning. Although our dataset primarily consists of math and logic problems, we observed that RL training led to measurable improvements on the challenging STEM benchmark GPQA. Additionally, We train a highly competitive 32B model using only 200 data samples, replicating **QwQ-32B's** inference performance on AIME 2024.*

+<span id="eval_detail"></span>
+We declare that the reported numbers are based on our re-testing, with each number representing the average results of 32 sampling responses. The evaluation code is available in our [evaluation](/evaluation/) folder.
+
+For the baselines and the SFT model AReaL-boba-SFT-32B, we follow the recommended configuration (temperature = 0.6, top_p = 0.95, suggested by DeepSeek) and use the default R1-Distill-Qwen template for prompts. For the RL-trained models, we retaine the same temperature as during RL rollout (1.0).
+
+Notably, during GPQA testing, we found that all answers are "A." To address this bias, we randomized the answer options.
+
+
 ### Training Speed Comparison

 ![throughput_comparision_with_v0.1.0.png](/assets/thpt_comparison.png) 
--- a/evaluation/README.md
+++ b/evaluation/README.md
@ -29,7 +29,7 @@ Evaluate AReaL-boba-RL-7B:
 python eval_and_aggregate.py \
 --model_path ${MODEL_PATH} \
 --output_path ${OUTPUT_PATH} \
--data_names aime24,aime25 \
+--data_names aime24,aime25,gpqa_diamond \
 --prompt_type AReaL-boba \
 --output_path outputs --temperature 1.0
 ```
@ -39,7 +39,7 @@ Evaluate AReaL-boba-SFT-32B:
 python eval_and_aggregate.py \
 --model_path ${MODEL_PATH} \
 --output_path ${OUTPUT_PATH} \
--data_names aime24,aime25 \
+--data_names aime24,aime25,gpqa_diamond \
 --prompt_type AReaL-boba-SFT \
 --samples_per_node 2 --num_sample_nodes 16 \
 --output_path outputs --temperature 0.6