[Feature] Support behavior importance weight capping and update evaluation scripts (#59)

* PullRequest: 168 添加Codeforces测试，修复其它测试问题 Merge branch areal-eval-0.3 of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/168 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * fix eval and add codeforces elo calc * fix codeforce test * fix qwen3 prompt * change annotations to eng * add code verify files * PullRequest: 173 [FIX} format code and fix a recover error in rollout worker Merge branch fw/fix-rollout-recover of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/173 Reviewed-by: 温差 <xushusheng.xss@antgroup.com> * format code and fix a recover error in rollout worker * PullRequest: 171 更新评估文档 Merge branch eval-doc of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/171?tab=diff Reviewed-by: 博惟 <bowei.fw@antgroup.com> * update eval doc * complete eval doc * complete eval doc * fix ood info * add data obtaining guide * fix supported datasets * PullRequest: 174 decouple max_behav_imp_weight and c_clip & track entropy, positve_seq_len and negative_seq_len Merge branch xss/max_behav_imp_weight of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/174 Reviewed-by: 博惟 <bowei.fw@antgroup.com> * decouple max_behav_imp_weight and c_clip * rename log: positve_* -> correct_*, negative_* -> incorrect_* * rename hyper-parameter: max_behav_imp_weight -> behav_imp_weight_cap * PullRequest: 175 [Fix] Fix the "event loop is already running" error in ray scripts Merge branch fw/fix-ray-asyncio of git@code.alipay.com:inclusionAI/AReaL.git into main https://code.alipay.com/inclusionAI/AReaL/pull_requests/175 Reviewed-by: 晓雷 <meizhiyu.mzy@antgroup.com> * format code and fix a recover error in rollout worker * . --------- Co-authored-by: 乘鹭 <hechuyi.hcy@antgroup.com> Co-authored-by: 温差 <xushusheng.xss@antgroup.com>
2025-05-30 10:29:21 +08:00 · 2025-05-30 10:29:21 +08:00 · c0200f10d0
parent 0815bb6494
commit c0200f10d0
18 changed files with 2203 additions and 45 deletions
--- a/docs/eval.md
+++ b/docs/eval.md
@ -11,7 +11,7 @@ docker run -d --name areal-eval --privileged --gpus all --network host --shm-siz
 docker exec -it areal-eval bash
 ```

-## Install Dependencies and Run Evaluation
+## Install Dependencies

 Execute the following commands inside the Docker container:

@ -24,20 +24,58 @@ pip install -r requirements.txt
 pip install vllm==0.8.5 --no-build-isolation
 pip install transformers==4.51.1
 pip install prettytable timeout_decorator
+```
+
+## Run Evaluation
+
+Specify an output_path to save the test results (optional — if not specified, the results will be saved in `model_path`):
+
+```bash
 mkdir -p /storage/ray/eval_output/
+```
+
+### Math Eval
+
+```bash
 nohup python eval_and_aggregate.py \
    --model_path /storage/ray/experiments/checkpoints/root/my-exp/my-trial/epoch1epochstep20globalstep20/ \
    --output_path /storage/ray/eval_output/ \
-    --data_names "math_500,aime24,amc23" \
-    --max_gen_tokens 32768 &> /storage/ray/eval_output/eval_and_aggregate_parallel.log &
+    --max_gen_tokens 32768
+    --data_names math_500,aime24,amc23 \
+    --prompt_type qwen3-think \
+    --task math &> /storage/ray/eval_output/eval_and_aggregate_parallel.log &
+```
+
+### Code Eval
+**Obtaining Data:**
+- Consider the size of code datasets (Because some of test cases are relatively large), we upload all our code datasets to Huggingface: [todo:upload the code dataset to Huggingface]().
+- Once you have downloaded the code dataset, place it under **`./evaluation/data/`**.
+
+**Running Eval:**
+```bash
+nohup python eval_and_aggregate.py \
+ --model_path /storage/ray/experiments/checkpoints/root/my-exp/my-trial/epoch1epochstep20globalstep20/ \
+ --output_path /storage/ray/eval_output/ \
+ --max_gen_tokens 32768 \
+ --data_names codeforces,lcb_v5 \
+ --prompt_type qwen3-think-pure \
+ --num_sample_nodes 8 \
+ --samples_per_node 1 \
+ --n_sampling $((num_sample_nodes * samples_per_node)) \
+ --task code &> /storage/ray/eval_output/eval_and_aggregate_parallel.log &
 ```

 ### Command Line Parameters

 - **`--model_path`**: Path to the saved model parameters
 - **`--output_path`**: Path to store generated answers and log files during evaluation
- **`--data_names`**: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options: `math_500`, `math`, `gsm8k`, `train_amc_aime`, `aime24`, `amc23`
+- **`--data_names`**: Dataset(s) to evaluate. Multiple datasets can be separated by commas. Available options: 
+    - math: `math_500`, `aime24`, `aime25`, `amc23`
+    - code: `lcb_v5`, `lcb_v5_2410_2502`, `codeforces`, `code_contest_all`
 - **`--max_gen_tokens`**: Maximum length of generated answers (default: 32768)
+- **`--prompt_type`**: Specify the prompt template, for our latest model, we use `qwen3-think` for math dataset, and `qwen3-think-pure` for code dataset.
+- **`--num_sample_nodes`**: Number of multiple sampling seeds to ensure saampling diversity.
+- **`--samples_per_node`**: Number of samples to generate per seed for each problem. 

 ## Evaluation Results

@ -59,11 +97,25 @@ The evaluation script will output a results table in the terminal:
 - **`greedy_acc`**: Average accuracy under greedy sampling
 - **`sample_pass@{k}`**: Probability of generating a correct answer within `k` attempts under random sampling

+For Codeforces dataset, we use the Elo ranking algorithm to evaluate model performance, referring to [CodeElo](https://github.com/QwenLM/CodeElo) and [rllm](https://github.com/agentica-project/rllm):
+
+```
+------------+----------------+-----------+
+|  Dataset   | Percentile (%) | CF Rating |
+------------+----------------+-----------+
+| codeforces |      17.9      |    590    |
+------------+----------------+-----------+
+```
+
+- **`CF Rating`**: The overall elo rank score of model across 57 Codeforces contests.
+- **`Percentile`**: The Elo ranking percentile of model among all Codeforces users.
+**Note**: As the penalty mechanism may cause fluctuations in Elo rankings, we suggest performing multiple evaluations and regard the average score as the final result.
+
 ## Configuration Details

 ### Sampling Parameters

- The evaluation script defaults to averaging 32 samples with temperature 0.6
+- The evaluation script defaults to averaging 32 samples with temperature 1.0. For the code dataset, we set it to 8 samples.
 - We observed that the `enforce_eager` parameter in vLLM significantly impacts evaluation performance
 - When `enforce_eager=True`, we can reproduce the model performance reported in previous work
 - Without this setting, evaluation results may fall below reported performance
--- a/evaluation/cf_elo_caculator.py
+++ b/evaluation/cf_elo_caculator.py
@ -0,0 +1,344 @@
+import bisect
+import json
+import os
+import re
+from collections import defaultdict
+from typing import Dict, List, Optional, Set, Tuple
+
+from tqdm import tqdm
+
+
+def get_percentile(rating: float, sorted_ratings: List[float]) -> float:
+    idx = bisect.bisect_left(sorted_ratings, float(rating))
+    return round(idx / len(sorted_ratings) * 100, 1)
+
+
+def read_ratings(file_path: str) -> List[float]:
+    with open(file_path, "r") as f:
+        ratings_dict = json.load(f)
+
+    sorted_ratings = []
+    for rating, count in ratings_dict.items():
+        sorted_ratings.extend([float(rating)] * count)
+
+    return sorted(sorted_ratings)
+
+
+def load_cached_contest_data(cache_file_path: str) -> Dict:
+    if not os.path.exists(cache_file_path):
+        raise FileNotFoundError(f"Cache file does not exist: {cache_file_path}")
+
+    with open(cache_file_path, "r") as f:
+        data = json.load(f)
+
+    print(f"Loaded {len(data)} contest data from cache file")
+    return data
+
+
+def get_contest_data_from_cache(
+    contest_id: int, cached_data: Dict
+) -> Tuple[Optional[Dict], Optional[Dict]]:
+    contest_id_str = str(contest_id)
+
+    if contest_id_str not in cached_data:
+        print(f"Warning: Contest {contest_id} data not found in cache")
+        return None, None
+
+    contest_data = cached_data[contest_id_str]
+
+    try:
+        standings = contest_data["standings"]
+        rating_changes = contest_data["rating_changes"]
+
+        if standings.get("status") != "OK" or rating_changes.get("status") != "OK":
+            print(f"Warning: Contest {contest_id} cached data status abnormal")
+            return None, None
+
+        return standings, rating_changes
+
+    except KeyError as e:
+        print(f"Warning: Contest {contest_id} cached data structure abnormal: {e}")
+        return None, None
+
+
+def calc_elo_rating_offline(
+    contest_id: int,
+    problem_status: Dict[str, List[bool]],
+    sorted_ratings: List[float],
+    cached_data: Dict,
+    pass_n=None,
+) -> Optional[Tuple[int, float]]:
+    try:
+        standings, rating_changes = get_contest_data_from_cache(contest_id, cached_data)
+
+        if standings is None or rating_changes is None:
+            return None
+
+        handle_set: Set[str] = set()
+        try:
+            handle_set_standings = set(
+                standings["result"]["rows"][i]["party"]["members"][0]["handle"]
+                for i in range(len(standings["result"]["rows"]))
+            )
+
+            handle_set_ratings = set(
+                rating_changes["result"][i]["handle"]
+                for i in range(len(rating_changes["result"]))
+            )
+
+            handle_set = handle_set_standings.intersection(handle_set_ratings)
+
+            standings["result"]["rows"] = [
+                row
+                for row in standings["result"]["rows"]
+                if row["party"]["members"][0]["handle"] in handle_set
+            ]
+
+            rating_changes["result"] = [
+                change
+                for change in rating_changes["result"]
+                if change["handle"] in handle_set
+            ]
+
+            assert (
+                len(standings["result"]["rows"]) == len(rating_changes["result"])
+                and len(standings["result"]["rows"]) > 200
+            )
+        except Exception:
+            return None
+
+        if (
+            "result" not in standings
+            or "result" not in rating_changes
+            or len(standings["result"]["rows"]) != len(rating_changes["result"])
+            or len(standings["result"]["rows"]) <= 200
+        ):
+            return None
+
+        max_rating = max(change["oldRating"] for change in rating_changes["result"])
+
+        score = 0
+        penalty = 0
+
+        for problem in standings["result"]["problems"]:
+            prob = f"{problem['contestId']}{problem['index']}"
+            if prob in problem_status:
+                if pass_n is None:
+                    pass_n = len(problem_status[prob])
+                for ith, status in enumerate(problem_status[prob][:pass_n]):
+                    if status == 1.0:
+                        if "points" in problem:
+                            score += max(0, problem["points"] - 50 * ith)
+                        else:
+                            score += 1
+                            penalty += ith * 10
+                        break
+
+        n = len(standings["result"]["rows"])
+
+        rank = n
+        for i in range(n):
+            if standings["result"]["rows"][i]["points"] < score or (
+                standings["result"]["rows"][i]["points"] == score
+                and standings["result"]["rows"][i]["penalty"] > penalty
+            ):
+                rank = i
+                break
+
+        l, r = 0, max_rating + 100
+        while r - l > 1:
+            mid = (l + r) // 2
+            new_seed = 1
+            for i in range(n):
+                new_seed += 1 / (
+                    1 + 10 ** ((mid - rating_changes["result"][i]["oldRating"]) / 400)
+                )
+            if new_seed < rank:
+                r = mid
+            else:
+                l = mid
+
+        percentile = get_percentile(l, sorted_ratings)
+        return l, percentile
+
+    except Exception as e:
+        print(f"Error calculating contest {contest_id} ELO rating: {e}")
+        return None
+
+
+def format_grouped_contest_data(
+    submissions: List[List[bool]], problem_ids: List[str]
+) -> List[Tuple[int, Dict[str, List[bool]]]]:
+    if len(submissions) != len(problem_ids):
+        raise ValueError("Length of submissions and problem_ids must be the same.")
+
+    grouped_data = defaultdict(dict)
+
+    for problem_id, submission in zip(problem_ids, submissions):
+        # Extract contest ID using regex to capture leading digits
+        match = re.match(r"(\d+)([A-Z].*)", problem_id)
+        if not match:
+            raise ValueError(f"Invalid problem ID format: {problem_id}")
+
+        contest_id = int(match.group(1))
+        problem_letter = match.group(0)
+
+        grouped_data[contest_id][problem_letter] = submission
+
+    combined_data = [
+        (contest_id, problems) for contest_id, problems in grouped_data.items()
+    ]
+
+    return combined_data
+
+
+def convert_score_to_cf_format(
+    all_samples: List[Dict], metadata: List[str]
+) -> List[List[bool]]:
+    cf_results = []
+
+    sorted_samples = sorted(all_samples, key=lambda x: x["idx"])
+
+    for sample in sorted_samples:
+        if "score" in sample:
+            cf_results.append([bool(s) for s in sample["score"]])
+        else:
+            cf_results.append([False])
+
+    return cf_results
+
+
+class CFEloCalculator:
+    def __init__(
+        self,
+        metadata_path: str = None,
+        ratings_path: str = None,
+        cache_file_path: str = None,
+    ):
+        # Set default paths
+        # current_dir = os.path.dirname(os.path.abspath(__file__))
+        current_dir = "/storage/openpsi/data/code/test_set/codeforces"
+
+        self.metadata_path = metadata_path or os.path.join(
+            current_dir, "metadata_cf.json"
+        )
+        self.ratings_path = ratings_path or os.path.join(
+            current_dir, "ratings_2024.json"
+        )
+        self.cache_file_path = cache_file_path or os.path.join(
+            current_dir, "all_contest_data.json"
+        )
+
+        # Preload data
+        self._load_data()
+
+    def _load_data(self):
+        try:
+            self.sorted_ratings = read_ratings(self.ratings_path)
+            print(f"✓ Loaded {len(self.sorted_ratings)} historical rating data")
+
+            with open(self.metadata_path, "r") as file:
+                self.metadata = json.load(file)
+            print(f"✓ Loaded {len(self.metadata)} problem metadata")
+
+            self.cached_data = load_cached_contest_data(self.cache_file_path)
+            print(f"✓ Loaded cached data")
+
+        except Exception as e:
+            raise RuntimeError(f"Failed to load data files: {e}")
+
+    def calculate_elo(
+        self, all_samples: List[Dict], pass_n: int = 1, verbose: bool = True
+    ) -> Optional[Dict]:
+        try:
+            if verbose:
+                print("\n" + "=" * 50)
+                print("Starting Codeforces ELO rating calculation...")
+                print("=" * 50)
+            # Convert data format
+            cf_results = convert_score_to_cf_format(all_samples, self.metadata)
+            if verbose:
+                print(f"✓ Converted {len(cf_results)} test results")
+
+            # Format data
+            model_results = format_grouped_contest_data(cf_results, self.metadata)
+            if verbose:
+                print(f"✓ Data grouped by {len(model_results)} contests")
+
+            # Calculate ELO rating for each contest
+            contest_elos = []
+            skipped_contests = []
+
+            iterator = (
+                tqdm(model_results, desc="Calculating ELO ratings")
+                if verbose
+                else model_results
+            )
+            for contest_id, problems in iterator:
+                elo_result = calc_elo_rating_offline(
+                    contest_id, problems, self.sorted_ratings, self.cached_data, pass_n
+                )
+                if elo_result is not None:
+                    contest_elos.append((contest_id, elo_result))
+                else:
+                    skipped_contests.append(contest_id)
+
+            # Calculate average percentile
+            percentiles = [elo[1][1] for elo in contest_elos if elo[1] is not None]
+            ratings = [elo[1][0] for elo in contest_elos if elo[1] is not None]
+
+            if not percentiles:
+                print("Error: No valid percentiles calculated")
+                return None
+
+            estimated_rating = sum(ratings) / len(ratings)
+            est_percentile = get_percentile(estimated_rating, self.sorted_ratings)
+
+            # Display results
+            if verbose:
+                print("\n" + "=" * 50)
+                print("CODEFORCES EVALUATION RESULTS")
+                print("=" * 50)
+                print(f"Estimated percentile: {est_percentile:.1f}%")
+                print(f"Estimated Codeforces rating: {estimated_rating:.0f}")
+
+                if skipped_contests:
+                    print(
+                        f"Skipped contest IDs: {skipped_contests[:10]}{'...' if len(skipped_contests) > 10 else ''}"
+                    )
+
+                print("=" * 50)
+
+            # Return detailed results
+            return {
+                "estimated_percentile": est_percentile,
+                "estimated_rating": estimated_rating,
+                "contests_processed": len(contest_elos),
+                "contests_skipped": len(skipped_contests),
+                "skipped_contest_ids": skipped_contests,
+                "individual_contest_results": [
+                    {
+                        "contest_id": contest_id,
+                        "rating": rating,
+                        "percentile": percentile,
+                    }
+                    for contest_id, (rating, percentile) in contest_elos
+                ],
+            }
+
+        except Exception as e:
+            print(f"Error calculating CF ELO rating: {e}")
+            return None
+
+
+def calculate_cf_elo_from_samples(
+    all_samples: List[Dict],
+    pass_n: int = 1,
+    metadata_path: str = None,
+    ratings_path: str = None,
+    cache_file_path: str = None,
+    verbose: bool = True,
+) -> Optional[Dict]:
+
+    calculator = CFEloCalculator(metadata_path, ratings_path, cache_file_path)
+    return calculator.calculate_elo(all_samples, pass_n, verbose)
--- a/evaluation/code_eval.py
+++ b/evaluation/code_eval.py
@ -0,0 +1,548 @@
+import argparse
+import itertools
+import json
+import os
+import random
+import time
+from datetime import datetime
+from parser import *
+
+import numpy as np
+import ray
+import torch
+from code_verifier.local_verify import evaluate
+from data_loader import load_data
+from model_utils import generate_completions, load_hf_lm_and_tokenizer
+from tqdm import tqdm
+from trajectory import *
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from utils import construct_prompt, load_jsonl, save_jsonl, set_seed
+from vllm import LLM, SamplingParams
+
+
+def extract_python_code(text, min_length=20, strict_syntax=False):
+    code_pattern = r"(?i)```(?:python|py|cpp|CPP)?\s*\n?(.*?)\n?```"
+    code_blocks = re.findall(code_pattern, text, re.DOTALL)
+    valid_blocks = []
+    for block in code_blocks:
+        clean_block = block.strip()
+        if len(clean_block) < min_length:
+            continue
+
+        # verify code syntax
+        if strict_syntax:
+            try:
+                ast.parse(clean_block, mode="exec")
+            except (SyntaxError, IndentationError):
+                continue
+
+        valid_blocks.append(clean_block)
+
+    if not valid_blocks:
+        # logger.warning(f"failed to extract python code from {text}")
+        return None
+    # return the last code block
+    return valid_blocks[-1]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data_names", default="lcb", type=str)
+    parser.add_argument(
+        "--data_dir", default="/storage/openpsi/data/code/test_set", type=str
+    )
+    parser.add_argument("--model_name_or_path", default="gpt-4", type=str)
+    parser.add_argument("--output_dir", default="./output", type=str)
+    parser.add_argument("--prompt_type", default="tool-integrated", type=str)
+    parser.add_argument("--split", default="test", type=str)
+    parser.add_argument("--num_test_sample", default=-1, type=int)  # -1 for full data
+    parser.add_argument("--seed", default=0, type=int)
+    parser.add_argument("--start", default=0, type=int)
+    parser.add_argument("--end", default=-1, type=int)
+    parser.add_argument("--temperature", default=0, type=float)
+    parser.add_argument("--n_sampling", default=1, type=int)
+    parser.add_argument("--top_p", default=1, type=float)
+    parser.add_argument("--top_k", default=-1, type=int)
+    parser.add_argument("--max_tokens_per_call", default=4096, type=int)
+    parser.add_argument("--shuffle", action="store_true")
+    parser.add_argument("--use_vllm", action="store_true")
+    parser.add_argument("--save_outputs", action="store_true")
+    parser.add_argument("--overwrite", action="store_true")
+    parser.add_argument("--use_safetensors", action="store_true")
+    parser.add_argument("--num_shots", type=int, default=0)
+    parser.add_argument(
+        "--apply_chat_template",
+        action="store_true",
+        help="Apply chat template to prompt.",
+    )
+    parser.add_argument("--tensor_parallel_size", type=int, default=1)
+    parser.add_argument(
+        "--adapt_few_shot",
+        action="store_true",
+        help="Few shot for multiple-choice questions, zero shot for others.",
+    )
+    args = parser.parse_args()
+    args.top_p = (
+        1 if args.temperature == 0 else args.top_p
+    )  # top_p must be 1 when using greedy sampling (vllm)
+    args.top_k = -1 if args.temperature == 0 else args.top_k
+
+    available_gpus = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
+    args.data_parallel_size = len(available_gpus) // args.tensor_parallel_size
+    return args
+
+
+def pass_at_k_v2(data_list, k=8):
+
+    def cur_pass_k(n, c, k):
+        if n - c < k:
+            return 1.0
+        return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
+
+    # count, right_count = 0, 0
+    pass_at_ks = []
+    for sample in data_list:
+        assert len(sample["score"]) >= k, sample
+        correct = sum(sample["score"])
+        pass_at_ks.append(cur_pass_k(len(sample["score"]), correct, k))
+
+    return np.mean(pass_at_ks) * 100
+
+
+def generate_in_parallel(requests, model_args, sampling_params, data_parallel_size):
+    @ray.remote
+    def run_inference_one_model(
+        model_args: dict, sampling_params, requests, cuda_visisble_devices
+    ):
+        os.environ["VLLM_LOGGING_LEVEL"] = "ERROR"
+        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(
+            [str(x) for x in cuda_visisble_devices]
+        )
+        # print("OS.ENVIRON", json.dumps({x: os.environ[x]  for x in sorted(dict(os.environ))}))
+        llm = LLM(**model_args)
+        return llm.generate(requests, sampling_params=sampling_params)
+
+    # print("OUT_OS_ENVIRON", json.dumps({x: os.environ[x]  for x in sorted(dict(os.environ))}))
+    all_cuda_visisble_devices = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
+    requests = [list(x) for x in distribute(data_parallel_size, requests)]
+    inputs = (
+        (model_args, sampling_params, req, cuda_visisble_devices)
+        for req, cuda_visisble_devices in zip(
+            requests, np.array_split(all_cuda_visisble_devices, data_parallel_size)
+        )
+    )
+    object_refs = [run_inference_one_model.remote(*x) for x in inputs]
+    results = ray.get(object_refs)
+    ray.shutdown()
+    return undistribute(results)
+
+
+from itertools import islice, tee
+
+
+def distribute(n, iterable):
+    """Distribute the items from *iterable* among *n* smaller iterables.
+
+        >>> group_1, group_2 = distribute(2, [1, 2, 3, 4, 5, 6])
+        >>> list(group_1)
+        [1, 3, 5]
+        >>> list(group_2)
+        [2, 4, 6]
+
+    If the length of *iterable* is not evenly divisible by *n*, then the
+    length of the returned iterables will not be identical:
+
+        >>> children = distribute(3, [1, 2, 3, 4, 5, 6, 7])
+        >>> [list(c) for c in children]
+        [[1, 4, 7], [2, 5], [3, 6]]
+
+    If the length of *iterable* is smaller than *n*, then the last returned
+    iterables will be empty:
+
+        >>> children = distribute(5, [1, 2, 3])
+        >>> [list(c) for c in children]
+        [[1], [2], [3], [], []]
+
+    This function uses :func:`itertools.tee` and may require significant
+    storage.
+
+    If you need the order items in the smaller iterables to match the
+    original iterable, see :func:`divide`.
+
+    """
+    if n < 1:
+        raise ValueError("n must be at least 1")
+
+    children = tee(iterable, n)
+    return [islice(it, index, None, n) for index, it in enumerate(children)]
+
+
+def undistribute(iterable):
+    """
+    Undoes https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.distribute .
+
+    Re-interleaves results that have been split using more_itertools.distribute:
+        >>> group_1, group_2 = distribute(2, [1, 2, 3, 4, 5, 6])
+        >>> list(group_1)
+        [1, 3, 5]
+        >>> list(group_2)
+        [2, 4, 6]
+        >>> undistribute([group_1, group_2])
+        [1, 2, 3, 4, 5, 6]
+
+    Handles non-uniform component lengths:
+
+        >>> children = distribute(3, [1, 2, 3, 4, 5, 6, 7])
+        >>> [list(c) for c in children]
+        [[1, 4, 7], [2, 5], [3, 6]]
+        >>> undistribute(children)
+        [1, 2, 3, 4, 5, 6, 7]
+
+    Also handles when some iterables are empty:
+
+        >>> children = distribute(5, [1, 2, 3])
+        >>> [list(c) for c in children]
+        [[1], [2], [3], [], []]
+        >>> undistribute(children)
+        [1, 2, 3]
+
+    """
+
+    return [
+        x
+        for x in itertools.chain.from_iterable(
+            itertools.zip_longest(*[list(x) for x in iterable])
+        )
+        if x is not None
+    ]
+
+
+def prepare_data(data_name, args):
+    examples = load_data(data_name, args.split, args.data_dir)
+
+    # sample `num_test_sample` from dataset
+    if args.num_test_sample > 0:
+        # examples = random.sample(examples, min(args.num_test_sample, len(examples)))
+        examples = examples[: args.num_test_sample]
+
+    # shuffle
+    if args.shuffle:
+        raise RuntimeError
+        random.seed(datetime.now().timestamp())
+        random.shuffle(examples)
+
+    # select start and end
+    examples = examples[args.start : len(examples) if args.end == -1 else args.end]
+
+    # get out_file name
+    dt_string = datetime.now().strftime("%m-%d_%H-%M")
+    model_name = "/".join(args.model_name_or_path.split("/")[-2:])
+    out_file_prefix = f"{args.split}_{args.prompt_type}_{args.num_test_sample}_seed{args.seed}_t{args.temperature:.1f}_topp{args.top_p:.2f}_topk{args.top_k}"
+    output_dir = args.output_dir
+    if not os.path.exists(output_dir):
+        output_dir = f"outputs/{output_dir}"
+
+    eval_dir = f"math_eval_{args.max_tokens_per_call}"
+
+    out_file = f"{output_dir}/{eval_dir}/{data_name}/{out_file_prefix}_s{args.start}_e{args.end}_n{args.n_sampling}.jsonl"
+    os.makedirs(f"{output_dir}/{eval_dir}/{data_name}", exist_ok=True)
+
+    # load all processed samples
+    processed_samples = []
+    if not args.overwrite:
+        processed_files = [
+            f
+            for f in os.listdir(f"{output_dir}/{eval_dir}/{data_name}/")
+            # if f.endswith(".jsonl") and f.startswith(out_file_prefix)
+            if f == os.path.basename(out_file)
+        ]
+        for f in processed_files:
+            processed_samples.extend(
+                list(load_jsonl(f"{output_dir}/{eval_dir}/{data_name}/{f}"))
+            )
+
+    # dedepulicate
+    idx2inoutput = {x["idx"]: x["input_output"] for x in examples}
+    processed_samples = {sample["idx"]: sample for sample in processed_samples}
+    for sample_idx in processed_samples:
+        processed_samples[sample_idx]["input_output"] = idx2inoutput[sample_idx]
+    processed_idxs = list(processed_samples.keys())
+    processed_samples = list(processed_samples.values())
+    examples = [example for example in examples if example["idx"] not in processed_idxs]
+    return examples, processed_samples, out_file
+
+
+def setup(args):
+    # load model
+    available_gpus = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
+    if args.use_vllm:
+        # breakpoint()
+        if args.data_parallel_size <= 1:
+            llm = LLM(
+                model=args.model_name_or_path,
+                tensor_parallel_size=args.tensor_parallel_size,
+                # distributed_executor_backend="ray",
+                enforce_eager=True,
+                # dtype="float16",
+                trust_remote_code=True,
+                disable_sliding_window=True,
+                max_model_len=32768,
+                enable_chunked_prefill=False,
+                swap_space=32,
+            )
+        else:
+            print(
+                f"TP = {args.tensor_parallel_size}\n",
+                f"DP = {args.data_parallel_size}",
+            )
+            llm = dict(
+                model=args.model_name_or_path,
+                tensor_parallel_size=args.tensor_parallel_size,
+                # distributed_executor_backend="ray",
+                trust_remote_code=True,
+                enforce_eager=True,
+                # dtype="float16",
+                disable_custom_all_reduce=True,
+                disable_sliding_window=True,
+                max_model_len=32768,
+                enable_chunked_prefill=False,
+                swap_space=32,
+            )
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_name_or_path, trust_remote_code=True
+        )
+    else:
+        llm, tokenizer = load_hf_lm_and_tokenizer(
+            model_name_or_path=args.model_name_or_path,
+            load_in_half=True,
+            use_fast_tokenizer=True,
+            use_safetensors=args.use_safetensors,
+        )
+
+    # infer & eval
+    data_list = args.data_names.split(",")
+    results = []
+    for data_name in data_list:
+        results.append(main(llm, tokenizer, data_name, args))
+
+    # add "avg" result to data_list and results
+    data_list.append("avg")
+    results.append(
+        {
+            "acc": sum([result["acc"] for result in results]) / len(results),
+        }
+    )
+
+    # print all results
+    pad = max([len(data_name) for data_name in data_list])
+    print("\t".join(data_name.ljust(pad, " ") for data_name in data_list))
+    print("\t".join([f"{result['acc']:.1f}".ljust(pad, " ") for result in results]))
+
+
+def main(llm, tokenizer, data_name, args):
+    available_gpus = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
+    examples, processed_samples, out_file = prepare_data(data_name, args)
+    print("=" * 50)
+    print("data:", data_name, " ,remain samples:", len(examples))
+    # if len(examples) > 0:
+    #     print(examples[0])
+
+    samples = []
+    for example in tqdm(examples, total=len(examples)):
+        idx = example["idx"]
+
+        full_prompt = construct_prompt(example, data_name, args)
+
+        if idx == args.start:
+            print(full_prompt)
+
+        sample = {
+            "idx": idx,
+            "question": example["question"],
+            "input_output": example["input_output"],
+            "prompt": full_prompt,
+        }
+
+        # add remain fields
+        # for key in [
+        #     "level",
+        #     "type",
+        #     "unit",
+        #     "solution_type",
+        #     "choices",
+        #     "solution",
+        #     "ques_type",
+        #     "ans_type",
+        #     "answer_type",
+        #     "dataset",
+        #     "subfield",
+        #     "filed",
+        #     "theorem",
+        #     "answer",
+        # ]:
+        #     if key in example:
+        #         sample[key] = example[key]
+        samples.append(sample)
+
+    # repeat n times
+    input_prompts = [
+        sample["prompt"] for sample in samples for _ in range(args.n_sampling)
+    ]
+    if args.apply_chat_template:
+        input_prompts = [
+            tokenizer.apply_chat_template(
+                [{"role": "user", "content": prompt.strip()}],
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+            for prompt in input_prompts
+        ]
+    remain_prompts = input_prompts
+    remain_prompts = [(i, prompt) for i, prompt in enumerate(remain_prompts)]
+    end_prompts = []
+
+    # max_func_call = 1 if args.prompt_type in ["cot", "pal"] else 4
+    max_func_call = 1 if args.prompt_type in ["cot", "pal"] or args.use_vllm else 4
+
+    stop_words = ["</s>", "<|im_end|>", "<|endoftext|>"]
+
+    # start inference
+    # measure time use
+    start_time = time.time()
+    for epoch in range(max_func_call):
+        print("-" * 20, "Epoch", epoch)
+        current_prompts = remain_prompts
+        if len(current_prompts) == 0:
+            break
+
+        # get all outputs
+        prompts = [item[1] for item in current_prompts]
+        if args.use_vllm:
+            sampling_params = SamplingParams(
+                temperature=args.temperature,
+                seed=args.seed,
+                top_p=args.top_p,
+                top_k=args.top_k,
+                max_tokens=args.max_tokens_per_call,
+                n=args.n_sampling,
+                stop=stop_words,
+                stop_token_ids=(
+                    [151645, 151643]
+                    if "qwen2" in args.model_name_or_path.lower()
+                    else None
+                ),
+            )
+            if args.data_parallel_size <= 1:
+
+                outputs = llm.generate(prompts[:: args.n_sampling], sampling_params)
+            else:
+                outputs = generate_in_parallel(
+                    prompts[:: args.n_sampling],
+                    llm,
+                    sampling_params,
+                    (
+                        args.data_parallel_size - 1
+                        if len(available_gpus) == 16
+                        else args.data_parallel_size
+                    ),
+                )
+
+            outputs = sorted(
+                outputs, key=lambda x: int(x.request_id)
+            )  # sort outputs by request_id
+            outputs = [x.text for output in outputs for x in output.outputs]
+        else:
+            outputs = generate_completions(
+                model=llm,
+                tokenizer=tokenizer,
+                prompts=prompts,
+                max_new_tokens=args.max_tokens_per_call,
+                batch_size=16,
+                stop_id_sequences=stop_words,
+            )
+
+        assert len(outputs) == len(current_prompts)
+
+        # process all outputs
+        remain_prompts = []
+        remain_codes = []
+        for (i, query), output in zip(current_prompts, outputs):
+            output = output.rstrip()
+            query += output
+            end_prompts.append((i, query))
+
+    # unsolved samples
+    print("Unsolved samples:", len(remain_prompts))
+    end_prompts.extend(remain_prompts)
+    # sort by idx
+    end_prompts = sorted(end_prompts, key=lambda x: x[0])
+
+    # remove input_prompt from end_prompt
+    codes = []
+    assert len(input_prompts) == len(end_prompts)
+    for i in range(len(input_prompts)):
+        _, end_prompt = end_prompts[i]
+        code = end_prompt.split(input_prompts[i])[-1].strip()
+        for stop_word in stop_words:
+            if stop_word in code:
+                code = code.split(stop_word)[0].strip()
+        codes.append(code)
+
+    if len(codes) != 0:
+        code_lens = tokenizer(codes, return_length=True)["length"]
+    else:
+        code_lens = []
+
+    # extract code
+    results = [extract_python_code(code) for code in codes]
+    time_use = time.time() - start_time
+    print("time_use", time_use)
+    # put results back to examples
+    all_samples = []
+    for i, sample in enumerate(samples):
+        code = codes[i * args.n_sampling : (i + 1) * args.n_sampling]
+        code_len = code_lens[i * args.n_sampling : (i + 1) * args.n_sampling]
+        preds = results[i * args.n_sampling : (i + 1) * args.n_sampling]
+        # preds = results
+
+        sample.pop("prompt")
+        sample.update({"code": code, "output_len": code_len, "pred": preds})
+        all_samples.append(sample)
+
+    # add processed samples
+    all_samples.extend(processed_samples)
+    all_samples, result_json = evaluate(samples=all_samples)
+
+    if args.n_sampling > 1:
+        result_json[f"pass@{args.n_sampling}"] = pass_at_k_v2(
+            all_samples, k=args.n_sampling
+        )
+        if args.n_sampling > 16:
+            result_json[f"pass@16"] = pass_at_k_v2(all_samples, k=16)
+        if args.n_sampling > 8:
+            result_json[f"pass@8"] = pass_at_k_v2(all_samples, k=8)
+        result_json[f"pass@1"] = pass_at_k_v2(all_samples, k=1)
+
+    # save outputs
+    # if len(processed_samples) < len(all_samples) and args.save_outputs:
+    if args.save_outputs:
+        for sample in all_samples:
+            sample.pop("input_output")
+        save_jsonl(all_samples, out_file)
+
+    result_json["time_use_in_second"] = time_use
+    result_json["time_use_in_minite"] = (
+        f"{int(time_use // 60)}:{int(time_use % 60):02d}"
+    )
+
+    with open(
+        out_file.replace(".jsonl", f"_{args.prompt_type}_metrics.json"), "w"
+    ) as f:
+        json.dump(result_json, f, indent=4)
+    return result_json
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    set_seed(args.seed)
+    setup(args)
--- a/evaluation/code_verifier/local_verify.py
+++ b/evaluation/code_verifier/local_verify.py
@ -0,0 +1,203 @@
+import concurrent.futures
+import json
+import os
+import signal
+import subprocess
+import sys
+import time
+import traceback
+import uuid
+from collections import defaultdict
+from io import StringIO
+
+import numpy as np
+
+SINGLE_CASE_EXEC_TIMEOUT = 6
+
+import logging
+
+from utils import load_jsonl
+
+logger = logging.getLogger("function call")
+
+
+def capture_stdout(code):
+    original_stdout = sys.stdout
+    fake_stdout = StringIO()
+
+    try:
+        sys.stdout = fake_stdout
+        exec(code, {"__builtins__": __builtins__})
+    except Exception as e:
+        return f"error: {str(e)}, traceback: {traceback.format_exc()}"
+    finally:
+        sys.stdout = original_stdout
+    return fake_stdout.getvalue()
+
+
+def call_verify(problem, generation, debug, timeout=SINGLE_CASE_EXEC_TIMEOUT):
+
+    tmp_id = str(uuid.uuid4())
+    input_data = {
+        "sample": problem,
+        "test": generation,
+        "debug": debug,
+        "timeout": timeout,
+    }
+    with open(f"/tmp/{tmp_id}-input.json", "w") as temp_file:
+        json.dump(input_data, temp_file)
+    start_time = time.time()
+
+    venv_python = "python3"
+    pro = subprocess.Popen(
+        " ".join(
+            [
+                venv_python,
+                "code_verifier/testing_util.py",
+                "--tmp_id",
+                tmp_id,
+            ]
+        ),
+        shell=True,
+        preexec_fn=os.setsid,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.DEVNULL,
+    )
+    try:
+        pro.wait(600)
+    except Exception as e:
+        pass
+    try:
+        os.killpg(os.getpgid(pro.pid), signal.SIGTERM)
+    except ProcessLookupError:
+        pass
+
+    result = {"result": [False], "info": {}}
+    try:
+        with open(f"/tmp/{tmp_id}-output.json", "r") as f:
+            result = json.load(f)
+    except FileNotFoundError as e:
+        logger.warning(
+            f"{problem['query_id']}: Failed to parse generated answers. FileNotFoundError. Set 0 reward."
+        )
+    except Exception as e:
+        logger.warning(
+            f"{problem['query_id']}: Failed to parse generated answers. {e}. Set 0 reward."
+        )
+    finally:
+        if os.path.exists(f"/tmp/{tmp_id}-input.json"):
+            os.remove(f"/tmp/{tmp_id}-input.json")
+        if os.path.exists(f"/tmp/{tmp_id}-output.json"):
+            os.remove(f"/tmp/{tmp_id}-output.json")
+
+    execution_time = time.time() - start_time
+    logger.info(
+        f'[call_verify] query_id: {problem["query_id"]}, start_time: {str(start_time)}, Time elapsed: {execution_time * 1000:.0f} ms'
+    )
+    return result["result"], result["info"]
+
+
+def code_verify(id2info, generateds, query_ids, debug=False):
+    assert len(generateds) == len(query_ids)
+    problems = [id2info[qid] for qid in query_ids]
+
+    final_results = []
+
+    infer_args = []
+    for query_id, generated, problem in zip(query_ids, generateds, problems):
+        infer_args.append((problem, generated, debug, SINGLE_CASE_EXEC_TIMEOUT))
+
+    run_results = []
+    num_process = os.cpu_count()
+    with concurrent.futures.ProcessPoolExecutor(num_process) as executor:
+        run_results = executor.map(call_verify, *zip(*infer_args))
+
+    for run_result in run_results:
+        curr_res, metadata = run_result
+        if any(x != True for x in curr_res):
+            final_results.append(0)
+        else:
+            final_results.append(1)
+
+    return final_results
+
+
+def evaluate(samples):
+
+    infer_args = []
+    scores = []
+    for sample in samples:
+        for pred in sample["pred"]:
+            problem = {
+                "input_output": sample["input_output"],
+                "query_id": sample["idx"],
+            }
+            infer_args.append((problem, pred, False, SINGLE_CASE_EXEC_TIMEOUT))
+
+    run_results = []
+    num_process = os.cpu_count()
+    with concurrent.futures.ProcessPoolExecutor(num_process) as executor:
+        run_results = executor.map(call_verify, *zip(*infer_args))
+
+    for run_result in run_results:
+        curr_res, metadata = run_result
+        if any(x != True for x in curr_res):
+            scores.append(0)
+        else:
+            scores.append(1)
+
+    idx = 0
+    score_mat = []
+    for sample in samples:
+        sample["score"] = scores[idx : idx + len(sample["pred"])]
+        assert len(sample["score"]) == len(sample["pred"])
+        score_mat.append(sample["score"])
+        idx += len(sample["pred"])
+
+    col_means = np.array(score_mat).mean(axis=0)
+    mean_score = list(np.round(col_means * 100, decimals=1))
+
+    max_len = max([len(s) for s in score_mat])
+
+    for i, s in enumerate(score_mat):
+        assert len(s) == max_len
+
+    result_json = {
+        "num_samples": len(samples),
+        "num_scores": len(scores),
+        "empty_samples": len([s for s in samples if not s["pred"][-1]]),
+        "acc": np.mean(mean_score),
+    }
+
+    return samples, result_json
+
+
+if __name__ == "__main__":
+    # data_list = load_jsonl("functioncall/test/test_success_dataset.jsonl")
+    data_path = "/storage/openpsi/data/code/deepcoder/deepcoder_0415_v3_verify_new_correct.jsonl"
+
+    id2info = defaultdict(dict)
+    for item in load_jsonl(data_path):
+        id2info[item["query_id"]] = item
+
+    def create_test_params(count=10):
+        query_ids = []
+        generateds = []
+        cnt = 0
+
+        for d in load_jsonl(data_path):
+            if cnt >= count:
+                break
+            if not d["solutions"] or d["query_id"] not in id2info:
+                continue
+            query_ids.append(d["query_id"])
+            generateds.append(d["solutions"][0])
+            cnt += 1
+
+        return generateds, query_ids
+
+    generateds, query_ids = create_test_params(100)
+    scale = 1
+    print(f"generateds:, query_ids:{query_ids}")
+    result = code_verify(id2info, generateds * scale, query_ids * scale)
+    print(result)
--- a/evaluation/code_verifier/testing_util.py
+++ b/evaluation/code_verifier/testing_util.py
@ -0,0 +1,789 @@
+import ast
+import faulthandler
+import json
+import platform
+
+# to run the solution files we're using a timing based approach
+import signal
+import sys
+import types
+
+# used for debugging to time steps
+from datetime import datetime
+from enum import Enum
+
+# for capturing the stdout
+from io import StringIO
+
+# used for testing the code that reads from input
+from unittest.mock import mock_open, patch
+
+import numpy as np
+
+
+# Note: adapted from https://github.com/LiveCodeBench/LiveCodeBench
+class RuntimeModule:
+    @staticmethod
+    def from_string(name, _, code):
+        module = types.ModuleType(name)
+        exec(code, module.__dict__)
+        return module
+
+
+def truncatefn(s, length=300):
+    assert isinstance(s, str)
+    if len(s) <= length:
+        return s
+
+    return s[: length // 2] + "...(truncated) ..." + s[-length // 2 :]
+
+
+class CODE_TYPE(Enum):
+    call_based = 0
+    standard_input = 1
+
+
+# stuff for setting up signal timer
+class TimeoutException(Exception):
+    pass
+
+
+def timeout_handler(signum, frame):
+    print("alarm went off")
+    # return
+    raise TimeoutException
+
+
+signal.signal(signal.SIGALRM, timeout_handler)
+# timeout = 6  # seconds
+
+
+# used to capture stdout as a list
+# from https://stackoverflow.com/a/16571630/6416660
+# alternative use redirect_stdout() from contextlib
+class Capturing(list):
+    def __enter__(self):
+        self._stdout = sys.stdout
+        sys.stdout = self._stringio = StringIO()
+        # Make closing the StringIO a no-op
+        self._stringio.close = lambda x: 1
+        return self
+
+    def __exit__(self, *args):
+        self.append(self._stringio.getvalue())
+        del self._stringio  # free up some memory
+        sys.stdout = self._stdout
+
+
+def only_int_check(val):
+    return isinstance(val, int)
+
+
+def string_int_check(val):
+    return isinstance(val, str) and val.isdigit()
+
+
+def combined_int_check(val):
+    return only_int_check(val) or string_int_check(val)
+
+
+def run_test(sample, test=None, debug=False, timeout=6):
+    """
+    if test(generated_code) is not None it'll try to run the code.
+    otherwise it'll just return an input and output pair.
+    """
+    # Disable functionalities that can make destructive changes to the test.
+    reliability_guard()
+
+    if debug:
+        print(f"start = {datetime.now().time()}")
+
+    try:
+        in_outs = json.loads(sample["input_output"])
+    except ValueError:
+        in_outs = None
+
+    which_type = CODE_TYPE.call_based
+    if in_outs:
+        if in_outs.get("fn_name", "") == "":
+            which_type = CODE_TYPE.standard_input  # Standard input
+            method_name = None
+        else:
+            which_type = CODE_TYPE.call_based  # Call-based
+            method_name = in_outs["fn_name"]
+
+    if debug:
+        print(f"loaded input_output = {datetime.now().time()}")
+
+    if test is None:
+        return [False] * len(in_outs["inputs"]), {"error": "no test code provided"}
+        # assert False, "should not happen: test code is none"
+        return in_outs, {"error": "no test code provided"}
+    elif test is not None:
+        results = []
+        sol = "from string import *\nfrom re import *\nfrom datetime import *\nfrom collections import *\nfrom heapq import *\nfrom bisect import *\nfrom copy import *\nfrom math import *\nfrom random import *\nfrom statistics import *\nfrom itertools import *\nfrom functools import *\nfrom operator import *\nfrom io import *\nfrom sys import *\nfrom json import *\nfrom builtins import *\nfrom typing import *\nimport string\nimport re\nimport datetime\nimport collections\nimport heapq\nimport bisect\nimport copy\nimport math\nimport random\nimport statistics\nimport itertools\nimport functools\nimport operator\nimport io\nimport sys\nimport json\nsys.setrecursionlimit(6*10**5)\n"
+        if debug:
+            print(f"loading test code = {datetime.now().time()}")
+
+        if which_type == CODE_TYPE.call_based:
+
+            sol += test
+            if debug:
+                print(f"sol = {sol}")
+            signal.alarm(timeout)
+            try:
+                tmp_sol = RuntimeModule.from_string("tmp_sol", "", sol)
+                if "class Solution" not in test:
+                    tmp = tmp_sol
+                else:
+                    tmp = tmp_sol.Solution()
+                signal.alarm(0)
+            except Exception as e:
+                signal.alarm(0)
+                if debug:
+                    print(f"type 0 compilation error = {e}")
+                results.append(-2)
+                return results, {
+                    "error": repr(e),
+                    "error_code": -1,
+                    "error_message": "Compilation Error",
+                }
+            signal.alarm(0)
+
+        elif which_type == CODE_TYPE.standard_input:
+            # sol
+            # if code has if __name__ == "__main__": then remove it
+            try:
+                astree = ast.parse(test)
+                last_block = astree.body[-1]
+                if isinstance(last_block, ast.If):
+                    condition = last_block.test
+                    if ast.unparse(condition).strip() == "__name__ == '__main__'":
+                        test = (
+                            ast.unparse(astree.body[:-1])
+                            + "\n"
+                            + ast.unparse(last_block.body)
+                        )
+            except:
+                pass
+
+            tmp_test = test.split("\n")
+
+            new_test = []
+            for x in tmp_test:
+                if (not x.startswith("from ")) and (not x.startswith("import ")):
+                    new_test.append("\t" + x + "\n")
+                else:
+                    new_test.append(x + "\n")
+            tmp_test = new_test
+
+            new_test = ""
+            started = False
+            for i in tmp_test:
+                if i.startswith("\t") and not started:
+                    new_test += "stdin = sys.stdin\nstdout = sys.stdout\n"
+                    new_test += "def code():\n"
+                    new_test += i
+                    started = True
+                elif started and ((i.startswith("from ")) or (i.startswith("import "))):
+                    new_test += "\t" + i
+                else:
+                    new_test += i
+            tmp_test = new_test
+
+            sol += tmp_test
+            if debug:
+                print(f"sol = {sol}")
+            method_name = "code"
+            signal.alarm(timeout)
+            try:
+                tmp_sol = RuntimeModule.from_string("tmp_sol", "", sol)
+                tmp = tmp_sol
+                signal.alarm(0)
+            except Exception as e:
+                signal.alarm(0)
+                if debug:
+                    print(f"type 1 compilation error = {e}")
+                results.append(-2)
+                return results, {
+                    "error": repr(e),
+                    "error_code": -1,
+                    "error_message": "Compilation Error",
+                }
+            signal.alarm(0)
+        if debug:
+            print(f"get method = {datetime.now().time()}")
+
+        try:
+            method = getattr(tmp, method_name)  # get_attr second arg must be str
+        except:
+            signal.alarm(0)
+            e = sys.exc_info()
+            print(f"unable to get function error = {e}")
+            results.append(-2)
+            return results, {
+                "error": repr(e),
+                "error_code": -1,
+                "error_message": "Unable to extract code",
+            }
+
+        for index, inputs in enumerate(in_outs["inputs"]):
+            raw_inputs = inputs
+            raw_outputs = in_outs["outputs"][index]
+            if which_type == CODE_TYPE.call_based:
+                inputs = [json.loads(line) for line in inputs.split("\n")]
+                in_outs["outputs"][index] = json.loads(in_outs["outputs"][index])
+
+                truncate_line_size = 300 // (raw_inputs.count("\n") + 1)
+                raw_inputs = "\n".join(
+                    [
+                        truncatefn(line, truncate_line_size)
+                        for line in raw_inputs.strip().split("\n")
+                    ]
+                )
+                raw_outputs = truncatefn(raw_outputs, 200)
+            else:
+                raw_inputs = truncatefn(raw_inputs)
+                raw_outputs = truncatefn(raw_outputs, 200)
+            # JSON forces dictionaries to have string keys; this undoes this (assuming a singleton list)
+            try:
+                if isinstance(inputs[0], dict):
+                    inputs = [{int(k): v for k, v in inputs[0].items()}]
+            except:
+                True
+            try:
+                if isinstance(in_outs["outputs"][index], dict):
+                    in_outs["outputs"][index] = [
+                        {int(k): v for k, v in in_outs["outputs"][index].items()}
+                    ]
+            except:
+                True
+            try:
+                if isinstance(in_outs["outputs"][index][0], dict):
+                    in_outs["outputs"][index] = [
+                        {int(k): v for k, v in in_outs["outputs"][index][0].items()}
+                    ]
+            except:
+                True
+
+            if debug:
+                print(
+                    f"time: {datetime.now().time()} testing index = {index}  inputs = {inputs}, {type(inputs)}. type = {which_type}"
+                )
+            if which_type == CODE_TYPE.call_based:  # Call-based
+                signal.alarm(timeout)
+                faulthandler.enable()
+                try:
+                    output = method(*inputs)
+                    raw_true_output = output
+
+                    raw_true_output_copy = json.dumps(output)
+                    raw_true_output_copy = truncatefn(raw_true_output_copy, 200)
+
+                    # ground truth sequences are not tuples
+                    if isinstance(output, tuple):
+                        output = list(output)
+
+                    tmp_result = output == in_outs["outputs"][index]
+                    if (
+                        isinstance(in_outs["outputs"][index], list)
+                        and in_outs["outputs"][index]
+                    ):
+                        tmp_result = tmp_result or (
+                            output == in_outs["outputs"][index][0]
+                        )
+
+                    # ground truth sequences are not tuples
+                    try:
+                        if isinstance(output[0], tuple):
+                            tmp_result = tmp_result or (
+                                [list(x) for x in output]
+                                == in_outs["outputs"][index][0]
+                            )
+                    except:
+                        True
+                    results.append(tmp_result)
+                    if tmp_result != True:
+                        return results, {
+                            "output": raw_true_output_copy,
+                            "expected": raw_outputs,
+                            "inputs": raw_inputs,
+                            "error_code": -2,
+                            "error_message": "Wrong Answer",
+                        }
+                    # reset the alarm
+                    signal.alarm(0)
+                except Exception as e:
+                    signal.alarm(0)
+                    faulthandler.disable()
+                    if debug:
+                        print(
+                            f"Standard input runtime error or time limit exceeded error = {e}"
+                        )
+                    results.append(-1)
+                    if "timeoutexception" in repr(e).lower():
+                        return results, {
+                            "error": repr(e),
+                            "error_code": -3,
+                            "error_message": "Time Limit Exceeded",
+                            "inputs": raw_inputs,
+                            "expected": raw_outputs,
+                        }
+                    else:
+                        return results, {
+                            "error": repr(e),
+                            "error_code": -4,
+                            "error_message": "Runtime Error",
+                            "inputs": raw_inputs,
+                            "expected": raw_outputs,
+                        }
+                faulthandler.disable()
+                signal.alarm(0)
+                if debug:
+                    print(
+                        f"outputs = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs}, {type(inputs)}, {output == [in_outs['outputs'][index]]}"
+                    )
+            elif which_type == CODE_TYPE.standard_input:  # Standard input
+                faulthandler.enable()
+                passed = False
+
+                if isinstance(inputs, list):
+                    inputs = "\n".join(inputs)
+                if isinstance(in_outs["outputs"][index], list):
+                    in_outs["outputs"][index] = "\n".join(in_outs["outputs"][index])
+
+                signal.alarm(timeout)
+                with Capturing() as output:
+                    try:
+                        call_method(method, inputs)
+                        # reset the alarm
+                        signal.alarm(0)
+                        passed = True
+                    except Exception as e:
+                        # runtime error or took too long
+                        signal.alarm(0)
+                        print(
+                            f"Call-based runtime error or time limit exceeded error = {repr(e)}{e}"
+                        )
+                        results.append(-1)
+                        if "timeoutexception" in repr(e).lower():
+                            return results, {
+                                "error": repr(e),
+                                "error_code": -3,
+                                "error_message": "Time Limit Exceeded",
+                                "inputs": raw_inputs,
+                                "expected": raw_outputs,
+                            }
+                        else:
+                            return results, {
+                                "error": repr(e),
+                                "error_code": -4,
+                                "error_message": "Runtime Error",
+                                "inputs": raw_inputs,
+                                "expected": raw_outputs,
+                            }
+                    signal.alarm(0)
+                raw_true_output = output[0]
+                raw_true_output_copy = truncatefn(raw_true_output, 200)
+                output = raw_true_output.splitlines()
+                if not passed:
+                    if debug:
+                        nl = "\n"
+                        if not isinstance(inputs, list):
+                            print(
+                                f"not passed output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs.replace(nl,' new-line ')}, {type(inputs)}, {output == [in_outs['outputs'][index]]}"
+                            )
+                        else:
+                            print(
+                                f"not passed output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs}, {type(inputs)}, {output == [in_outs['outputs'][index]]}"
+                            )
+                    continue
+
+                if passed and debug:
+                    print(
+                        f"==> output = {output}, test outputs = {in_outs['outputs'][index]}"
+                    )
+
+                if custom_compare_(output, in_outs["outputs"][index]):
+                    tmp_result = True
+                    results.append(tmp_result)
+                    continue
+
+                # ground truth sequences are expressed as lists not tuples
+                if isinstance(output, tuple):
+                    output = list(output)
+
+                tmp_result = False
+                try:
+                    tmp_result = output == [in_outs["outputs"][index]]
+                    if isinstance(in_outs["outputs"][index], list):
+                        tmp_result = tmp_result or (output == in_outs["outputs"][index])
+                        if isinstance(output[0], str):
+                            tmp_result = tmp_result or (
+                                [e.strip() for e in output] == in_outs["outputs"][index]
+                            )
+                except Exception as e:
+                    if debug:
+                        print(f"Failed check1 exception = {e}")
+                    pass
+
+                if tmp_result == True:
+                    results.append(tmp_result)
+                    continue
+
+                # try one more time without \n
+                if isinstance(in_outs["outputs"][index], list):
+                    for tmp_index, i in enumerate(in_outs["outputs"][index]):
+                        in_outs["outputs"][index][tmp_index] = i.split("\n")
+                        in_outs["outputs"][index][tmp_index] = [
+                            x.strip() for x in in_outs["outputs"][index][tmp_index] if x
+                        ]
+                else:
+                    in_outs["outputs"][index] = in_outs["outputs"][index].split("\n")
+                    in_outs["outputs"][index] = list(
+                        filter(len, in_outs["outputs"][index])
+                    )
+                    in_outs["outputs"][index] = list(
+                        map(lambda x: x.strip(), in_outs["outputs"][index])
+                    )
+
+                try:
+                    tmp_result = output == [in_outs["outputs"][index]]
+                    if isinstance(in_outs["outputs"][index], list):
+                        tmp_result = tmp_result or (output == in_outs["outputs"][index])
+                except Exception as e:
+                    if debug:
+                        print(f"Failed check2 exception = {e}")
+                    pass
+
+                if tmp_result == True:
+                    results.append(tmp_result)
+                    continue
+
+                # try by converting the output into a split up list too
+                if isinstance(output, list):
+                    output = list(filter(len, output))
+
+                if debug:
+                    nl = "\n"
+                    if not isinstance(inputs, list):
+                        print(
+                            f"@1 output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs.replace(nl,' new-line ')}, {type(inputs)}, {output == [in_outs['outputs'][index]]} {tmp_result=}"
+                        )
+                    else:
+                        print(
+                            f"@1 output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs}, {type(inputs)}, {output == [in_outs['outputs'][index]]} {tmp_result=}"
+                        )
+
+                if tmp_result == True:
+                    results.append(tmp_result)
+                    continue
+
+                if debug:
+                    print(f"{tmp_result=} @a")
+
+                try:
+                    tmp_result = output == [in_outs["outputs"][index]]
+                    if isinstance(in_outs["outputs"][index], list):
+                        tmp_result = tmp_result or (output == in_outs["outputs"][index])
+                except Exception as e:
+                    if debug:
+                        print(f"Failed check3 exception = {e}")
+                    pass
+
+                if debug:
+                    print(f"{tmp_result=} @b")
+
+                try:
+                    all_ints = all(
+                        combined_int_check(e1) and combined_int_check(e2)
+                        for e1, e2 in zip(output, in_outs["outputs"][index])
+                    )
+                    if not all_ints:
+                        if debug:
+                            print(
+                                [
+                                    combined_int_check(e1) and combined_int_check(e2)
+                                    for e1, e2 in zip(output, in_outs["outputs"][index])
+                                ]
+                            )
+                        output_float = [float(e) for e in output]
+                        gt_float = [float(e) for e in in_outs["outputs"][index]]
+                        tmp_result = tmp_result or (
+                            (len(output_float) == len(gt_float))
+                            and np.allclose(output_float, gt_float)
+                        )
+                except Exception as e:
+                    pass
+
+                if debug:
+                    print(f"{tmp_result=} @c")
+
+                try:
+                    if isinstance(output[0], list):
+                        all_ints = all(
+                            combined_int_check(e1) and combined_int_check(e2)
+                            for e1, e2 in zip(output[0], in_outs["outputs"][index])
+                        )
+                        if not all_ints:
+                            output_float = [float(e) for e in output[0]]
+                            gt_float = [float(e) for e in in_outs["outputs"][index][0]]
+                            tmp_result = tmp_result or (
+                                (len(output_float) == len(gt_float))
+                                and np.allclose(output_float, gt_float)
+                            )
+                except Exception as e:
+                    pass
+
+                if tmp_result == True:
+                    results.append(tmp_result)
+                    continue
+
+                if debug:
+                    print(f"{tmp_result=} @d")
+                # try by converting the stuff into split up list
+                if isinstance(in_outs["outputs"][index], list):
+                    for tmp_index, i in enumerate(in_outs["outputs"][index]):
+                        in_outs["outputs"][index][tmp_index] = set(i.split())
+                else:
+                    in_outs["outputs"][index] = set(in_outs["outputs"][index].split())
+
+                if debug:
+                    print(f"{tmp_result=} @e")
+
+                try:
+                    tmp_result = output == in_outs["outputs"][index]
+                except Exception as e:
+                    if debug:
+                        print(f"Failed check4 exception = {e}")
+                    continue
+
+                if tmp_result == True:
+                    results.append(tmp_result)
+                    continue
+
+                if debug:
+                    print(f"{tmp_result=} @f")
+
+                # try by converting the output into a split up list too
+                if isinstance(output, list):
+                    for tmp_index, i in enumerate(output):
+                        output[tmp_index] = i.split()
+                    output = list(filter(len, output))
+                    for tmp_index, i in enumerate(output):
+                        output[tmp_index] = set(i)
+                else:
+                    output = output.split()
+                    output = list(filter(len, output))
+                    output = set(output)
+
+                if debug:
+                    print(f"{tmp_result=} @g")
+                # try:
+                #     tmp_result = set(frozenset(s) for s in output) == set(
+                #         frozenset(s) for s in in_outs["outputs"][index]
+                #     )
+                # except Exception as e:
+                #     if debug:
+                #         print(f"Failed check5 exception = {e}")
+
+                # if they are all numbers, round so that similar numbers are treated as identical
+                # try:
+                #     all_ints = all(
+                #         combined_int_check(e1) and combined_int_check(e2)
+                #         for e1, e2 in zip(output, in_outs["outputs"][index])
+                #     )
+                #     tmp_result = tmp_result or (
+                #         set(frozenset(round(float(t), 3) for t in s) for s in output)
+                #         == set(
+                #             frozenset(round(float(t), 3) for t in s)
+                #             for s in in_outs["outputs"][index]
+                #         )
+                #     )
+                # except Exception as e:
+                #     if debug:
+                #         print(f"Failed check6 exception = {e}")
+
+                if debug:
+                    print(f"{tmp_result=} @h")
+
+                if tmp_result == True and debug:
+                    print("PASSED")
+
+                results.append(tmp_result)
+                if tmp_result != True:
+                    return results, {
+                        "output": raw_true_output_copy,
+                        "expected": raw_outputs,
+                        "inputs": raw_inputs,
+                        "error_code": -2,
+                        "error_message": "Wrong Answer",
+                    }
+
+                if debug:
+                    nl = "\n"
+                    if not isinstance(inputs, list):
+                        print(
+                            f"@2 output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs.replace(nl,' new-line ')}, {type(inputs)}, {output == [in_outs['outputs'][index]]}"
+                        )
+                    else:
+                        print(
+                            f"@2 output = {output}, test outputs = {in_outs['outputs'][index]}, inputs = {inputs}, {type(inputs)}, {output == [in_outs['outputs'][index]]}"
+                        )
+
+                    print(f"results = {results}")
+
+    return results, {}
+
+
+def custom_compare_(output, ground_truth):
+
+    if isinstance(output, list):
+        output_1 = "\n".join(output)
+        if stripped_string_compare(output_1, ground_truth):
+            return True
+
+    if isinstance(output, list):
+        output_2 = [o.lstrip().rstrip() for o in output]
+        output_2 = "\n".join(output_2)
+        if stripped_string_compare(output_2, ground_truth):
+            return True
+
+    return False
+
+
+def stripped_string_compare(s1, s2):
+    s1 = s1.lstrip().rstrip()
+    s2 = s2.lstrip().rstrip()
+    return s1 == s2
+
+
+def call_method(method, inputs):
+
+    if isinstance(inputs, list):
+        inputs = "\n".join(inputs)
+
+    inputs_line_iterator = iter(inputs.split("\n"))
+
+    # sys.setrecursionlimit(10000)
+
+    # @patch('builtins.input', side_effect=inputs.split("\n"))
+    @patch("builtins.open", mock_open(read_data=inputs))
+    @patch("sys.stdin", StringIO(inputs))
+    @patch("sys.stdin.readline", lambda *args: next(inputs_line_iterator))
+    @patch("sys.stdin.readlines", lambda *args: inputs.split("\n"))
+    @patch("sys.stdin.read", lambda *args: inputs)
+    # @patch('sys.stdout.write', print)
+    def _inner_call_method(_method):
+        try:
+            return _method()
+        except SystemExit as e:
+            pass
+        finally:
+            pass
+
+    return _inner_call_method(method)
+
+
+def reliability_guard(maximum_memory_bytes=None):
+    """
+    This disables various destructive functions and prevents the generated code
+    from interfering with the test (e.g. fork bomb, killing other processes,
+    removing filesystem files, etc.)
+    WARNING
+    This function is NOT a security sandbox. Untrusted code, including, model-
+    generated code, should not be blindly executed outside of one. See the
+    Codex paper for more information about OpenAI's code sandbox, and proceed
+    with caution.
+    """
+
+    if maximum_memory_bytes is not None:
+        import resource
+
+        resource.setrlimit(
+            resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes)
+        )
+        resource.setrlimit(
+            resource.RLIMIT_DATA, (maximum_memory_bytes, maximum_memory_bytes)
+        )
+        if not platform.uname().system == "Darwin":
+            resource.setrlimit(
+                resource.RLIMIT_STACK, (maximum_memory_bytes, maximum_memory_bytes)
+            )
+
+    faulthandler.disable()
+
+    import builtins
+
+    # builtins.exit = None
+    builtins.quit = None
+
+    import os
+
+    if os.putenv:
+        os.environ["OMP_NUM_THREADS"] = "20"
+
+    os.kill = None
+    os.system = None
+    os.putenv = None
+    os.remove = None
+    os.removedirs = None
+    os.rmdir = None
+    os.fchdir = None
+    os.setuid = None
+    os.fork = None
+    os.forkpty = None
+    os.killpg = None
+    os.rename = None
+    os.renames = None
+    os.truncate = None
+    os.replace = None
+    os.unlink = None
+    os.fchmod = None
+    os.fchown = None
+    os.chmod = None
+    os.chown = None
+    os.chroot = None
+    os.fchdir = None
+    os.lchflags = None
+    os.lchmod = None
+    os.lchown = None
+    os.getcwd = None
+    os.chdir = None
+
+    import shutil
+
+    shutil.rmtree = None
+    shutil.move = None
+    shutil.chown = None
+
+    import subprocess
+
+    subprocess.Popen = None  # type: ignore
+
+    # __builtins__["help"] = None
+
+    import sys
+
+    sys.modules["ipdb"] = None
+    sys.modules["joblib"] = None
+    sys.modules["resource"] = None
+    sys.modules["psutil"] = None
+    sys.modules["tkinter"] = None
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--tmp_id", type=str, required=True)
+    args = parser.parse_args()
+
+    all_input_data = []
+    with open(f"/tmp/{args.tmp_id}-input.json", "r") as temp_file:
+        input_data = json.load(temp_file)
+
+    result, info = run_test(**input_data)
+    saved_result = {"result": result, "info": info}
+    with open(f"/tmp/{args.tmp_id}-output.json", "w", encoding="utf-8") as temp_file:
+        json.dump(saved_result, temp_file)
--- a/evaluation/eval_and_aggregate.py
+++ b/evaluation/eval_and_aggregate.py
@ -5,6 +5,7 @@ import subprocess
 from glob import glob

 import numpy as np
+from cf_elo_caculator import calculate_cf_elo_from_samples
 from rm_maj_eval import group_pred
 from tqdm import tqdm
 from transformers import AutoTokenizer
@ -13,12 +14,10 @@ from utils import load_jsonl

 def parse_args():
    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--data_names", default="aime24,aime25", type=lambda x: x.split(",")
-    )
+    parser.add_argument("--data_names", required=True, type=lambda x: x.split(","))
    parser.add_argument(
        "--model_path",
-        default="/storage/openpsi/models/Qwen__Qwen2-1.5B-Instruct/",
+        required=True,
        type=str,
    )
    parser.add_argument("--output_path", type=str)
@ -32,9 +31,23 @@ def parse_args():
    parser.add_argument("--temperature", default=0.6, type=float)
    parser.add_argument("--top_p", default=0.95, type=float)
    parser.add_argument("--top_k", default=-1, type=int)
+    parser.add_argument("--task", default="math", type=str)
+
+    parser.add_argument(
+        "--cf_cache_file", type=str, default="./data/codeforces/all_contest_data.json"
+    )
+    parser.add_argument(
+        "--cf_metadata_path", type=str, default="./data/codeforces/metadata_cf.json"
+    )
+    parser.add_argument(
+        "--cf_ratings_path", type=str, default="./data/codeforces/sorted_rating.json"
+    )
+
    args = parser.parse_args()
    if args.output_path is None:
        args.output_path = args.model_path
+
+    args.cf_pass_n = args.num_sample_nodes
    return args


@ -73,6 +86,45 @@ def pass_at_k(data_list, k=8):
    return np.mean(pass_at_ks) * 100


+def calculate_cf_elo_for_dataset(data_name, results, args):
+    if "codeforces" not in data_name.lower():
+        return None
+    try:
+        print(f"\nCalculating CF ELO rating for dataset {data_name}...")
+        print(f"Dataset contains {len(results)} problems")
+
+        all_samples = []
+        for idx, sample_data in results.items():
+            sample = {"idx": idx, "score": sample_data["score"]}
+            all_samples.append(sample)
+
+        cf_result = calculate_cf_elo_from_samples(
+            all_samples=all_samples,
+            pass_n=args.cf_pass_n,
+            metadata_path=args.cf_metadata_path,
+            ratings_path=args.cf_ratings_path,
+            cache_file_path=args.cf_cache_file,
+            verbose=True,
+        )
+
+        if cf_result:
+            print(f"✓ {data_name} CF ELO calculation success:")
+            print(f"  Estimated percentile: {cf_result['estimated_percentile']:.1f}%")
+            print(f"  Estimated CF rating: {cf_result['estimated_rating']:.0f}")
+
+            return {
+                "cf_percentile": cf_result["estimated_percentile"],
+                "cf_rating": cf_result["estimated_rating"],
+            }
+        else:
+            print(f"✗ {data_name} CF ELO calculation failed")
+            return None
+
+    except Exception as e:
+        print(f"✗ {data_name} CF ELO calculation error: {e}")
+        return None
+
+
 def get_metrics(fname_pattern, tokenizer, is_greedy):

    generated = []
@ -109,14 +161,32 @@ def get_metrics(fname_pattern, tokenizer, is_greedy):
            "greedy_length": np.mean(lengths),
            "greedy_acc": pass_at_k(results.values(), 1),
            "num_questions": len(lengths),
-        }
+        }, results
    else:
        return {
            "sample_length": np.mean(lengths),
            "sample_pass@1": pass_at_k(results.values(), 1),
-            "pass@8": pass_at_k(results.values(), 8),
-            "pass@16": pass_at_k(results.values(), 16),
-        }
+            "pass@8": (
+                pass_at_k(results.values(), 8)
+                if args.num_sample_nodes * args.samples_per_node >= 8
+                else "-"
+            ),
+            "pass@16": (
+                pass_at_k(results.values(), 16)
+                if args.num_sample_nodes * args.samples_per_node >= 16
+                else "-"
+            ),
+            "maj@32": (
+                eval_maj_k_metrics(results.values(), 32)
+                if args.num_sample_nodes * args.samples_per_node >= 32
+                else "-"
+            ),
+            "pass@32": (
+                pass_at_k(results.values(), 32)
+                if args.num_sample_nodes * args.samples_per_node >= 32
+                else "-"
+            ),
+        }, results


 def process_single_data_name(args, data_name, base_dir, tokenizer):
@ -124,10 +194,10 @@ def process_single_data_name(args, data_name, base_dir, tokenizer):
    greedy_prefix = f"test_{args.prompt_type}_-1_seed0_t0.0_topp1.00_topk-1_s0_e-1_n1"
    sampling_prefix = f"test_{args.prompt_type}_-1_seed*_t{args.temperature:.1f}_topp{args.top_p:.2f}_topk{args.top_k}_s0_e-1_n{args.samples_per_node}"

-    greedy_length_metrics = get_metrics(
+    greedy_length_metrics, _ = get_metrics(
        os.path.join(cur_dir, greedy_prefix + ".jsonl"), tokenizer, True
    )
-    sampling_metrics = get_metrics(
+    sampling_metrics, sampling_results = get_metrics(
        os.path.join(cur_dir, sampling_prefix + ".jsonl"), tokenizer, False
    )

@ -140,6 +210,10 @@ def process_single_data_name(args, data_name, base_dir, tokenizer):
        **sampling_metrics,
    )

+    cf_elo_result = calculate_cf_elo_for_dataset(data_name, sampling_results, args)
+    if cf_elo_result:
+        output.update(cf_elo_result)
+
    return output


@ -171,6 +245,7 @@ if __name__ == "__main__":
                    ",".join(args.data_names),
                    args.prompt_type,
                    args.output_path,
+                    args.task,
                ],
                text=True,
                stdout=f,
@ -199,6 +274,7 @@ if __name__ == "__main__":
                        str(args.temperature),
                        str(args.top_p),
                        str(args.top_k),
+                        args.task,
                    ],
                    text=True,
                    stdout=f,
@ -220,6 +296,14 @@ if __name__ == "__main__":
        with open(result_path) as f:
            all_results = json.load(f)

+    cf_elo_summary = {}
+    for data_name, result in all_results.items():
+        if any(key.startswith("cf_") for key in result.keys()):
+            cf_elo_summary[data_name] = {
+                "percentile": result.get("cf_percentile"),
+                "rating": result.get("cf_rating"),
+            }
+
    try:
        from prettytable import PrettyTable

@ -227,7 +311,44 @@ if __name__ == "__main__":
        field_names = ["dataset"] + list(all_results[args.data_names[0]].keys())
        table.field_names = field_names
        for k, v in all_results.items():
-            table.add_row([k, *[round(v[x], 1) for x in field_names[1:]]])
+            formatted_values = []
+            for field in field_names[1:]:
+                value = v.get(field, "-")
+                if isinstance(value, (int, float)) and value != "-":
+                    if field.startswith("cf_") and "rating" in field:
+                        formatted_values.append(f"{value:.0f}")
+                    elif field.startswith("cf_"):
+                        formatted_values.append(f"{value:.1f}")
+                    else:
+                        formatted_values.append(f"{value:.1f}")
+                else:
+                    formatted_values.append(str(value))
+            table.add_row([k] + formatted_values)
+
+        if cf_elo_summary:
+            print("\n" + "=" * 50)
+            print("CODEFORCES ELO rating summary")
+            print("=" * 50)
+            cf_table = PrettyTable()
+            cf_table.field_names = ["Dataset", "Percentile (%)", "CF Rating"]
+            for data_name, cf_data in cf_elo_summary.items():
+                cf_table.add_row(
+                    [
+                        data_name,
+                        (
+                            f"{cf_data['percentile']:.1f}"
+                            if cf_data["percentile"] is not None
+                            else "-"
+                        ),
+                        (
+                            f"{cf_data['rating']:.0f}"
+                            if cf_data["rating"] is not None
+                            else "-"
+                        ),
+                    ]
+                )
+            print(cf_table)
+            print("=" * 50)

        print(table)
    except ModuleNotFoundError as e:
--- a/evaluation/math_eval.py
+++ b/evaluation/math_eval.py
@ -305,7 +305,7 @@ def setup(args):
            llm = LLM(
                model=args.model_name_or_path,
                tensor_parallel_size=args.tensor_parallel_size,
-                distributed_executor_backend="ray",
+                # distributed_executor_backend="ray",
                trust_remote_code=True,
                enforce_eager=True,
                # dtype="float16",
@ -323,7 +323,7 @@ def setup(args):
            llm = dict(
                model=args.model_name_or_path,
                tensor_parallel_size=args.tensor_parallel_size,
-                distributed_executor_backend="ray",
+                # distributed_executor_backend="ray",
                trust_remote_code=True,
                enforce_eager=True,
                # dtype="float16",
--- a/evaluation/sh/eval_greedy.sh
+++ b/evaluation/sh/eval_greedy.sh
@ -13,12 +13,13 @@ PROMPT_TYPE=${4:-"qwen-boxed"}
 SPLIT="test"
 NUM_TEST_SAMPLE=-1
 OUTPUT_DIR=${5:-$MODEL_NAME_OR_PATH}
+TASK=${6:-"math"}

 # English open datasets
 # DATA_NAME="math_500,math,gsm8k,train_amc_aime,aime24,amc23"

 TOKENIZERS_PARALLELISM=false \
-python3 -u math_eval.py \
+python3 -u ${TASK}_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
--- a/evaluation/sh/eval_sample_with_seed.sh
+++ b/evaluation/sh/eval_sample_with_seed.sh
@ -20,13 +20,14 @@ OUTPUT_DIR=${7:-$MODEL_NAME_OR_PATH}
 temperature=${8:-"1.0"}
 top_p=${9:-"1.0"}
 top_k=${10:-"-1"}
+TASK=${11:-"math"}


 # English open datasets
 # DATA_NAME="math_500,math,gsm8k,train_amc_aime,aime24,amc23"

 TOKENIZERS_PARALLELISM=false \
-python3 -u math_eval.py \
+python3 -u ${TASK}_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
--- a/evaluation/utils.py
+++ b/evaluation/utils.py
@ -13,6 +13,7 @@ def set_seed(seed: int = 42) -> None:
    np.random.seed(seed)
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
+    print(f"Random seed set as {seed}")


 def load_jsonl(file: Union[str, Path]) -> Iterable[Any]:
@ -143,18 +144,43 @@ PROMPT_TEMPLATES = {
        "{output}",
        "\n\n",
    ),
-    "AReaL-boba-SFT": (
-        "<｜begin▁of▁sentence｜><｜User｜>{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<｜Assistant｜><think>\n",
+    "qwen3-think": (
+        "<|im_start|>user\n{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}./think<|im_end|>\n"
+        "<|im_start|>assistant\n<think>",
        "{output}",
        "\n\n",
    ),
-    "AReaL-boba": (
+    "qwen3": (
+        "<|im_start|>user\n{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}./no_think<|im_end|>\n"
+        "<|im_start|>assistant\n<think></think>",
+        "{output}",
+        "\n\n",
+    ),
+    "qwen3-think-pure": (
+        "<|im_start|>user\n{input}\n/think<|im_end|>\n"
+        "<|im_start|>assistant\n<think>",
+        "{output}",
+        "\n\n",
+    ),
+    "deepscaler-fix": (
        "<｜User｜>{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<｜Assistant｜><think>\n",
        "{output}",
        "\n\n",
    ),
    "r1-distilled-qwen": (
-        "<｜begin▁of▁sentence｜><｜User｜>{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<｜Assistant｜><think>\n",
+        "<｜User｜>{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<｜Assistant｜><think>\n",
+        "{output}",
+        "\n\n",
+    ),
+    "nvidia-opencode": (
+        "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n",
+        "<|im_start|>user\nYou are a helpful and harmless assistant. You should think step-by-step before responding to the instruction below. Please use python programming language only.\n\nYou must use ```python for just the final solution code block with the following format:\n```python\n# Your code here\n```\n\n{input}\n<|im_end|>\n",
+        "<|im_start|>assistant\n",
+        "{output}",
+        "\n\n",
+    ),
+    "r1-distilled-pure": (
+        "<｜User｜>{input}<｜Assistant｜><think>\n",
        "{output}",
        "\n\n",
    ),
@ -171,8 +197,8 @@ PROMPT_TEMPLATES = {
        "\n\n",
    ),
    "r1-zero-box": (
-        "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
-        "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
+        "A conversation between User and Assistant. The user asks a question, and the Assistant solves it."
+        "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer."
        "The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. \nUser: \n{input}\n"
        "Please put your final answer within \\boxed{{}}.\n"
        "Assistant: \n",
--- a/realhf/api/cli_args.py
+++ b/realhf/api/cli_args.py
@ -689,6 +689,12 @@ class PPOHyperparameters:
        default=False,
        metadata={"help": "Use the decoupled loss. recompute_logprob must be True."},
    )
+    behav_imp_weight_cap: Optional[float] = field(
+        default=None,
+        metadata={
+            "help": "We filter out the tokens where behav_imp_weight exceeds behav_imp_weight_cap when computing the loss, must be > 1.0, use_decoupled_loss must be true"
+        },
+    )


 ## Experiment utilities. ##
--- a/realhf/experiments/common/ppo_math_exp.py
+++ b/realhf/experiments/common/ppo_math_exp.py
@ -107,6 +107,7 @@ class PPOMATHConfig(CommonExperimentConfig, PPOMATHExperimentOptions):
                "mask_too_long": self.mask_too_long,
                "sample_reuse": self.ppo.actor_sample_reuse,
                "c_clip": self.ppo.c_clip,
+                "behav_imp_weight_cap": self.ppo.behav_imp_weight_cap,
            },
        )

--- a/realhf/impl/model/backend/sglang.py
+++ b/realhf/impl/model/backend/sglang.py
@ -385,14 +385,27 @@ class SGLangGenerationEngine(PipelinableEngine):
            dist.barrier(group=constants.tensor_parallel_cpu_group())
            return None, None, None

-        results = asyncio.run(
-            self.async_generate(
-                input_=input_,
-                mb_spec=mb_spec,
-                tokenizer=tokenizer,
-                gconfig=gconfig,
-            )
-        )
+        def run_in_thread():
+            # Create a new event loop for this thread
+            new_loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(new_loop)
+            try:
+                return new_loop.run_until_complete(
+                    self.async_generate(
+                        input_=input_,
+                        mb_spec=mb_spec,
+                        tokenizer=tokenizer,
+                        gconfig=gconfig,
+                    )
+                )
+            finally:
+                new_loop.close()
+
+        from concurrent.futures import ThreadPoolExecutor
+
+        with ThreadPoolExecutor() as executor:
+            future = executor.submit(run_in_thread)
+            results = future.result()
        dist.barrier(group=constants.tensor_parallel_cpu_group())
        return results

@ -401,14 +414,13 @@ class SGLangGenerationEngine(PipelinableEngine):
            dist.barrier(group=constants.tensor_parallel_cpu_group())
            return

-        async def _fn():
-            async with SGLangAPIClient(
-                generate_url=self.api_urls["generate"],
-                update_weights_url=self.api_urls["update_weights_from_disk"],
-            ) as client:
-                await client.async_update_weights_from_disk(path)
-
-        asyncio.run(_fn())
+        resp = requests.post(
+            url=self.api_urls["update_weights_from_disk"],
+            json=dict(model_path=path),
+        )
+        resp.raise_for_status()
+        res = resp.json()
+        assert res["success"]
        dist.barrier(group=constants.tensor_parallel_cpu_group())


--- a/realhf/impl/model/interface/math_rw_interface.py
+++ b/realhf/impl/model/interface/math_rw_interface.py
@ -458,7 +458,20 @@ class MultiTaskRewardInterface(model_api.ModelInterface):

            return task_results

-        task_results = asyncio.run(_run_tasks())
+        def run_in_thread():
+            # Create a new event loop for this thread
+            new_loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(new_loop)
+            try:
+                return new_loop.run_until_complete(_run_tasks())
+            finally:
+                new_loop.close()
+
+        from concurrent.futures import ThreadPoolExecutor
+
+        with ThreadPoolExecutor() as executor:
+            future = executor.submit(run_in_thread)
+            task_results = future.result()
        final_result = self._gather_tasks(task_results, dispatch_indices, data.bs)
        final_result = self._gather_tp_and_pp(input_, final_result, backward_indices)

--- a/realhf/impl/model/interface/ppo_interface.py
+++ b/realhf/impl/model/interface/ppo_interface.py
@ -22,6 +22,7 @@ from realhf.impl.dataset.math_parser import parse_lines_in_parallel
 from realhf.impl.model.nn.real_llm_api import ReaLModel
 from realhf.impl.model.nn.real_llm_generate import concat_prompt_to_generation_output
 from realhf.impl.model.utils.functional import (
+    build_leave_one_indices,
    gather_packed_shifted_log_probs,
    masked_normalization,
 )
@ -48,12 +49,20 @@ def topk(scores, gen_lengths, k) -> list:
    return [idx for idx, _ in sorted_indices]


+def calc_entropy(logits, cu_seqlens):
+    leave_one_indices = build_leave_one_indices(logits, cu_seqlens)
+    probs = torch.nn.functional.softmax(logits, dim=-1)
+    entropy = -torch.sum(probs * torch.log(probs + 1e-7), dim=-1)[leave_one_indices]
+    return entropy.detach()
+
+
 def _ppo_actor_loss_from_model_outputs(
    logits: torch.FloatTensor,  # [tot_seqlen, vocab_size]
    input_: SequenceSample,
    kl_adapter: ppo_functional.KLController,  # const
    eps_clip: float,  # const
    c_clip: float | None,
+    behav_imp_weight_cap: float | None,
    early_stop_imp_ratio: Optional[float],  # const
    early_stop_kl: Optional[float],  # const
    temperature: Optional[float] = 1,
@ -87,8 +96,11 @@ def _ppo_actor_loss_from_model_outputs(
        loss_mask=ppo_loss_mask,
        c_clip=c_clip,
        proximal_logprobs=input_.data.get("prox_logp", None),
+        behav_imp_weight_cap=behav_imp_weight_cap,
    )

+    entropy = calc_entropy(logits=logits, cu_seqlens=cu_seqlens)
+
    # Log training statistics
    stats_tracker.denominator(
        n_tokens=torch.ones(logits.shape[0], dtype=torch.bool, device=logits.device),
@ -102,6 +114,7 @@ def _ppo_actor_loss_from_model_outputs(
        approx_kl=ppo_stat["approx_kl"],
        new_logp=logprobs.detach(),
        old_logp=old_logp,
+        entropy=entropy.float(),
        actor_loss=ppo_stat["loss"],
        clip_ratio=ppo_stat["clip_mask"].float(),
        dual_clip_ratio=ppo_stat["dual_clip_mask"].float(),
@ -206,6 +219,7 @@ class PPOActorInterface(model_api.ModelInterface):

    eps_clip: float = 0.2
    c_clip: Optional[float] = None
+    behav_imp_weight_cap: Optional[float] = None
    value_eps_clip: float = 0.2
    max_reward_clip: float = 5.0

@ -696,11 +710,17 @@ class PPOActorInterface(model_api.ModelInterface):
                for idx, task in enumerate(RL_TASKS)
            }

+            result_denominators = {
+                "correct_n_seqs": (reward_score > 0).bool(),
+                "incorrect_n_seqs": (reward_score <= 0).bool(),
+            }
+
            global_denominators = dict(
                n_seqs=torch.ones_like(reward_score, dtype=torch.bool),
                n_tokens=torch.ones_like(prompt_mask, dtype=torch.bool),
                n_valid_tokens=loss_mask.bool(),
                **task_denominators,
+                **result_denominators,
            )
            stats_tracker.denominator(**global_denominators)

@ -709,6 +729,13 @@ class PPOActorInterface(model_api.ModelInterface):
                    **{f"{task}_reward": reward_score}, denominator=f"{task}_n_seqs"
                )

+            stats_tracker.stat(
+                correct_seq_len=input_lens.float(), denominator="correct_n_seqs"
+            )
+            stats_tracker.stat(
+                incorrect_seq_len=input_lens.float(), denominator="incorrect_n_seqs"
+            )
+
            stats = dict(
                advantages=advantages,
                kl_rewards=kl_rewards,
@ -747,6 +774,8 @@ class PPOActorInterface(model_api.ModelInterface):
                scalars["use_dual_clip"] = 1
            else:
                scalars["use_dual_clip"] = 0
+            if self.behav_imp_weight_cap is not None:
+                scalars["behav_imp_weight_cap"] = self.behav_imp_weight_cap
            stats_tracker.scalar(**scalars)

            global_stats = stats_tracker.export()
@ -763,6 +792,7 @@ class PPOActorInterface(model_api.ModelInterface):
                    early_stop_imp_ratio=self.early_stop_imp_ratio,
                    early_stop_kl=self.early_stop_kl,
                    c_clip=self.c_clip,
+                    behav_imp_weight_cap=self.behav_imp_weight_cap,
                    temperature=self.gconfig.temperature,
                )

--- a/realhf/impl/model/utils/ppo_functional.py
+++ b/realhf/impl/model/utils/ppo_functional.py
@ -56,6 +56,7 @@ def actor_loss_fn(
    loss_mask: Optional[torch.BoolTensor] = None,
    c_clip: Optional[float] = None,
    proximal_logprobs: Optional[torch.FloatTensor] = None,
+    behav_imp_weight_cap: Optional[torch.FloatTensor] = None,
 ) -> Tuple[torch.Tensor, Dict]:
    """Compute PPO actor loss function.

@ -117,8 +118,10 @@ def actor_loss_fn(
    if proximal_logprobs is not None:
        behav_kl = proximal_logprobs - old_logprobs
        behav_imp_weight = behav_kl.exp()
-        if c_clip is not None:
-            behav_mask = (behav_imp_weight <= c_clip).logical_and(loss_mask)
+        if behav_imp_weight_cap is not None:
+            behav_mask = (behav_imp_weight <= behav_imp_weight_cap).logical_and(
+                loss_mask
+            )
        else:
            behav_mask = loss_mask
        behav_kl = torch.where(behav_mask, behav_kl, 0.0)
--- a/realhf/system/model_worker.py
+++ b/realhf/system/model_worker.py
@ -661,6 +661,8 @@ class ModelWorker(worker_base.Worker):
                if (
                    self.__recover_run
                    and x.ids[0] in self.__recover_info.hash_vals_to_ignore
+                    # Rollout worker has filered once
+                    and not isinstance(self.__datasets[dataset_id], PullerStreamDataset)
                ):
                    self.__recover_info.hash_vals_to_ignore.remove(x.ids[0])
                    continue
--- a/realhf/system/rollout_worker.py
+++ b/realhf/system/rollout_worker.py
@ -86,6 +86,9 @@ class RolloutWorker(AsyncWorker):
        # we don't need to recover rollout_stat here.
        self.rollout_stat = RolloutStat()

+        # recover info
+        self.__recover_run, self.__recover_info = recover.load_recover_info()
+
        return config.worker_info

    def make_datasets(self):
@ -187,6 +190,9 @@ class RolloutWorker(AsyncWorker):

        # NOTE: no need to ignore ids during recover, because model workers will do so
        data_id = cur_sample.ids[0]
+        if self.__recover_run and data_id in self.__recover_info.hash_vals_to_ignore:
+            self.__recover_info.hash_vals_to_ignore.remove(data_id)
+            return None
        assert data_id not in self.rollout_tasks
        return cur_sample