Spaces:

JetBrains-Research
/

long-code-arena

Running

File size: 12,622 Bytes

6c92442
 
 
49c07f7
2fb6d29
 
 
 
 
 
6c92442
 
 
 
49c07f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6553f1e
84fadb9
49c07f7
84fadb9
6553f1e
84fadb9
6553f1e
84fadb9
49c07f7
 
 
 
946246a
 
84fadb9
9cb627e
6553f1e
fe82027
49c07f7
4b54f71
fe82027
8314e15
 
84dc4f2
 
8314e15
20ef3c6
3ae0643
 
fe82027
49c07f7
946246a
 
fe82027
9cb627e
6553f1e
9cb627e
49c07f7
9cb627e
 
 
 
e670655
8314e15
 
 
6553f1e
 
9cb627e
6553f1e
9cb627e
 
 
6553f1e
9cb627e
49c07f7
946246a
 
9cb627e
84fadb9
6553f1e
6c92442
49c07f7
6c92442
 
 
 
 
 
 
49c07f7
04f40cd
e4c0a84
 
 
 
6c92442
84fadb9
6553f1e
aa6b5d3
49c07f7
8314e15
49c07f7
 
 
 
 
 
 
 
 
c7823aa
49c07f7
946246a
 
aa6b5d3
84fadb9
6553f1e
49c07f7
8314e15
0fc7c7a
6553f1e
49c07f7
0fc7c7a
49c07f7
946246a
 
0fc7c7a
6c92442
 
 
 
 
 
 
 
 
 
49c07f7
84fadb9
6c92442

from typing import Optional

TASKS_PRETTY = {
    "aggregated": "Aggregated Results",
    "library_based_code_generation": "Library-based code generation",
    "ci_builds_repair": "CI builds repair",
    "project_code_completion": "Project-level code completion",
    "commit_message_generation": "Commit message generation",
    "bug_localization": "Bug localization",
    "module_summarization": "Module Summarization",
}
TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}

TASKS_DESCRIPTIONS = {
    "aggregated": """# Aggregated Results\n
        
        Here, we present the aggregated results across all the tasks in BenchName (except for Project-level code completion, where its specifics required a different selection of models). To get more details about each task, visit the corresponding tab.

        To obtain aggregated results, we first select only one metric from metric suite for each task:
        * Library-based code generation: `API Recall`
        * CI builds repair: `Pass@1`
        * Commit message generation: `chrF`
        * Bug localization: `F1-score`
        * Module summarization: `CompScore`

        Then, to ensure a fair comparison across tasks with different score ranges, we normalize all scores to a 0-1 scale, where zero corresponds to the worst-performing model, and 1 to the best one. Note that for mean rank, rather than using strict rankings, we implemented a ranking system with a 10% margin to account for models with similar performance.

        We report mean rank (with std) and mean score across the tasks from BenchName, and the scores for each task in the table below.
        """,
    "library_based_code_generation": """# Library-based code generation\n
        
        Our Library-based code generation benchmark 🤗 [icmlbenchname/library-based-code-generation](https://huggingface.co/datasets/icmlbenchname/library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
        
        For evaluation, we use two metrics:
        * `ChrF`: textual similarity between the generated code and the reference program.  
        * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,  

        As a context, we pass a prefix of the list of APIs available in the target library. 
        We select the prefix based on their BM-25 similarity with the provided instruction.

        For further details on the dataset and the baselines from the BenchName team, refer to the `library_based_code_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "ci_builds_repair": """# CI builds repair\n
        
        Our CI builds repair benchmark 🤗 [icmlbenchname/ci-builds-repair](https://huggingface.co/datasets/icmlbenchname/ci-builds-repair) 
        includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.

        The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
        and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
        We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix. 
        
        Models can be evaluated in three settings:
        * `full` – **no** ground truth diffs are used for model evaluation;
        * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
        * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;

        For further details on the dataset and the baselines from the BenchName team, refer to the `ci-builds-repair` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "project_code_completion": """# Project-level code completion\n
        
        Our Project-level code completion benchmark 🤗 [icmlbenchname/project-level-code-completion](https://huggingface.co/datasets/icmlbenchname/project-level-code-completion) includes four sets of samples:
        * `small-context`: 144 data points,
        * `medium-context`: 224 data points,
        * `large-context`: 270 data points,
        * `huge-context`: 296 data points.
        
        Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below), 
        and a repository snapshot that can be used to build the context.
        
        We use standard `Exact Match (EM)` metric for one-line code completion.
        We evaluate `Exact Match` for different line categories:
        * *infile* – functions and classes are from the completion file;
        * *inproject* – functions and files are from the repository snapshot at the moment of completion;
        * *committed* – functions and classes are from the files that were added on the completion file commit;
        * *common* – functions and classes with common names, e.g., `main`, `get`, etc.;
        * *non-informative* – short/long lines, import/print lines, or comment lines;
        * *random* – lines that don't fit any of the previous categories.

        For further details on the dataset and the baselines from the BenchName team, refer to the `project_level_code_completion` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "commit_message_generation": """# Commit message generation\n
        
        Our Commit message generation benchmark 🤗 [icmlbenchname/commit-message-generation](https://huggingface.co/datasets/icmlbenchname/commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.  
        
        We use the following metrics for evaluation:
        * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
        * [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)
        * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
        * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
        
        For further details on the dataset and the baselines from the BenchName team, refer to the `commit_message_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
        
        **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default. 

        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 

        """,

    "bug_localization": """# Bug localization\n
        
        Our Bug localization benchmark 🤗 [icmlbenchname/bug-localization](https://huggingface.co/datasets/icmlbenchname/bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
        The model needs to identify the files within the repository that need to be modified to address the reported bug.
        
        To evaluate baseline performance, we use the following classification metrics: 
        * **P** - precision to estimate how many of the predicted buggy files were correctly identified
        * **R** - recall to indicate how many of the actual buggy files were correctly found
        * **FPR** - false positive rate to indicate how many non-buggy files were incorrectly predicted as buggy
        * **F1-score** - score to provide a balance between precision and recall
        * **All correct** - percentage of cases where all buggy files were correctly identified
        * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
        * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.

        For further details on the dataset and the baselines from the BenchName team, refer to the `bug_localization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).

        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
    """,

    "module_summarization": """# Module summarization\n
        Our Module summarization benchmark 🤗 [icmlbenchname/module-summarization](https://huggingface.co/datasets/icmlbenchname/module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects. 
        The model is required to generate such description, given the relevant context code and the intent behind the documentation.

        We use a novel metric for evaluation:
        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/module_summarization/README.md).

        For further details on the dataset and the baselines from the BenchName team, refer to the `module_summarization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,
}


def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
    if not task_pretty:
        return "Please, select a specific task to see more detailed instructions regarding submitting files."

    task_id = TASKS_PRETTY_REVERSE[task_pretty]

    if task_id == "commit_message_generation":
        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by BenchName Team in  🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""

    return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."