Spaces:

JetBrains-Research
/

long-code-arena

Running

App Files Files Community

galtimur commited on May 15

Commit

3514265

1 Parent(s): 49c07f7

Replaced benchname to LCA in readme and other files.

Browse files

Files changed (3) hide show

src/content.py +13 -13
src/get_results_for_task.py +2 -2
src/tasks_content.py +15 -15

src/content.py CHANGED Viewed

@@ -3,19 +3,19 @@ from .formatting import styled_warning
 # ================================
 # =            ABOUT             =
 # ================================
-INTRODUCTION_TITLE = """<h1 align="center">🏟️ BenchName </h1>"""
-INTRODUCTION_TEXT = """🏟️ **BenchName** is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository.
 It currently spans six different tasks and contains six datasets:
-* 🤗 [Library-based code generation](https://huggingface.co/datasets/icmlbenchname/library-based-code-generation)
-* 🤗 [CI builds repair](https://huggingface.co/datasets/icmlbenchname/ci-builds-repair)
-* 🤗 [Project-level code completion](https://huggingface.co/datasets/icmlbenchname/project-level-code-completion)
-* 🤗 [Commit message generation](https://huggingface.co/datasets/icmlbenchname/commit-message-generation)
-* 🤗 [Bug localization](https://huggingface.co/datasets/icmlbenchname/bug-localization)
-* 🤗 [Module summarization](https://huggingface.co/datasets/icmlbenchname/module-summarization)
-We are excited to invite you to participate in solving our benchmarks! To submit your results, please send the following materials to our 📩 email (icmlbenchname@gmail.com):
 * **Results**: Include the summary of your benchmark outcomes.
 * **Reproduction Package**: To ensure the integrity and reproducibility of your results, please include the code for context collection (if any), generation of predictions, and evaluating. You can follow [baselines](https://anonymous.4open.science/r/icml-benchname-2025/README.md) as a reference.
@@ -30,23 +30,23 @@ We look forward to reviewing your innovative solutions!
 # ================================
 LEADERBOARD_TITLE = '<h2 align="center">🏅Leaderboard</h2>'
-LEADERBOARD_TEXT = """The raw results from the leaderboard are available in 🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results)."""
 # ================================
 # =          SUBMISSION          =
 # ================================
 SUBMISSION_TITLE = '<h2 align="center">📩 Make A Submission</h2>'
-SUBMISSION_TEXT_INTRO = """Use the form below to submit new results to 🏟️ BenchName. If any problems arise, don't hesitate to contact us by email `TODO` or open a discussion 💛"""
 SUBMISSION_TEXT_TASK = """1. Select a task you want to submit results for."""
 SUBMISSION_TEXT_METADATA = """2. Fill in some metadata about your submission."""
 SUBMISSION_TEXT_FILES = """3. Attach one or more files with your model's predictions.
-    * If several files are attached, they will be treated as separate runs of the submitted model (e.g., with different seeds), and the metrics will be averaged across runs. For baselines provided by 🏟️ BenchName Team, the results are averaged across 3 runs.
 """
-SUBMISSION_TEXT_SUBMIT = """All set! A new PR to 🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results) should be opened when you press "Submit" button. 🏟️ BenchName Team will review it shortly, and the results will appear in the leaderboard.
 ⏳ **Note:** It might take some time (up to 40 minutes) for PR to get created, since it involves computing metrics for your submission."""

 # ================================
 # =            ABOUT             =
 # ================================
+INTRODUCTION_TITLE = """<h1 align="center">🏟️ Long Code Arena </h1>"""
+INTRODUCTION_TEXT = """🏟️ **Long Code Arena** is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository.
 It currently spans six different tasks and contains six datasets:
+* 🤗 [Library-based code generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation)
+* 🤗 [CI builds repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
+* 🤗 [Project-level code completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion)
+* 🤗 [Commit message generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation)
+* 🤗 [Bug localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization)
+* 🤗 [Module summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization)
+We are excited to invite you to participate in solving our benchmarks! To submit your results, please send the following materials to our 📩 email (lca@jetbrains.com):
 * **Results**: Include the summary of your benchmark outcomes.
 * **Reproduction Package**: To ensure the integrity and reproducibility of your results, please include the code for context collection (if any), generation of predictions, and evaluating. You can follow [baselines](https://anonymous.4open.science/r/icml-benchname-2025/README.md) as a reference.
 # ================================
 LEADERBOARD_TITLE = '<h2 align="center">🏅Leaderboard</h2>'
+LEADERBOARD_TEXT = """The raw results from the leaderboard are available in 🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results)."""
 # ================================
 # =          SUBMISSION          =
 # ================================
 SUBMISSION_TITLE = '<h2 align="center">📩 Make A Submission</h2>'
+SUBMISSION_TEXT_INTRO = """Use the form below to submit new results to 🏟️ Long Code Arena. If any problems arise, don't hesitate to contact us by email `TODO` or open a discussion 💛"""
 SUBMISSION_TEXT_TASK = """1. Select a task you want to submit results for."""
 SUBMISSION_TEXT_METADATA = """2. Fill in some metadata about your submission."""
 SUBMISSION_TEXT_FILES = """3. Attach one or more files with your model's predictions.
+    * If several files are attached, they will be treated as separate runs of the submitted model (e.g., with different seeds), and the metrics will be averaged across runs. For baselines provided by 🏟️ Long Code Arena Team, the results are averaged across 3 runs.
 """
+SUBMISSION_TEXT_SUBMIT = """All set! A new PR to 🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results) should be opened when you press "Submit" button. 🏟️ Long Code Arena Team will review it shortly, and the results will appear in the leaderboard.
 ⏳ **Note:** It might take some time (up to 40 minutes) for PR to get created, since it involves computing metrics for your submission."""

src/get_results_for_task.py CHANGED Viewed

@@ -37,7 +37,7 @@ def _get_results_stub() -> pd.DataFrame:
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
-                "Submitted By": "BenchName Team",
                 "Resources": "",
             },
             {
@@ -49,7 +49,7 @@ def _get_results_stub() -> pd.DataFrame:
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
-                "Submitted By": "BenchName Team",
                 "Resources": "",
             },
         ]

                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
+                "Submitted By": "LCA Team",
                 "Resources": "",
             },
             {
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
+                "Submitted By": "LCA Team",
                 "Resources": "",
             },
         ]

src/tasks_content.py CHANGED Viewed

@@ -14,7 +14,7 @@ TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
 TASKS_DESCRIPTIONS = {
     "aggregated": """# Aggregated Results\n
-        Here, we present the aggregated results across all the tasks in BenchName (except for Project-level code completion, where its specifics required a different selection of models). To get more details about each task, visit the corresponding tab.
         To obtain aggregated results, we first select only one metric from metric suite for each task:
         * Library-based code generation: `API Recall`
@@ -25,11 +25,11 @@ TASKS_DESCRIPTIONS = {
         Then, to ensure a fair comparison across tasks with different score ranges, we normalize all scores to a 0-1 scale, where zero corresponds to the worst-performing model, and 1 to the best one. Note that for mean rank, rather than using strict rankings, we implemented a ranking system with a 10% margin to account for models with similar performance.
-        We report mean rank (with std) and mean score across the tasks from BenchName, and the scores for each task in the table below.
         """,
     "library_based_code_generation": """# Library-based code generation\n
-        Our Library-based code generation benchmark 🤗 [icmlbenchname/library-based-code-generation](https://huggingface.co/datasets/icmlbenchname/library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
         For evaluation, we use two metrics:
         * `ChrF`: textual similarity between the generated code and the reference program.
@@ -38,14 +38,14 @@ TASKS_DESCRIPTIONS = {
         As a context, we pass a prefix of the list of APIs available in the target library.
         We select the prefix based on their BM-25 similarity with the provided instruction.
-        For further details on the dataset and the baselines from the BenchName team, refer to the `library_based_code_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "ci_builds_repair": """# CI builds repair\n
-        Our CI builds repair benchmark 🤗 [icmlbenchname/ci-builds-repair](https://huggingface.co/datasets/icmlbenchname/ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
@@ -57,14 +57,14 @@ TASKS_DESCRIPTIONS = {
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
-        For further details on the dataset and the baselines from the BenchName team, refer to the `ci-builds-repair` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "project_code_completion": """# Project-level code completion\n
-        Our Project-level code completion benchmark 🤗 [icmlbenchname/project-level-code-completion](https://huggingface.co/datasets/icmlbenchname/project-level-code-completion) includes four sets of samples:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
@@ -82,14 +82,14 @@ TASKS_DESCRIPTIONS = {
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
-        For further details on the dataset and the baselines from the BenchName team, refer to the `project_level_code_completion` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "commit_message_generation": """# Commit message generation\n
-        Our Commit message generation benchmark 🤗 [icmlbenchname/commit-message-generation](https://huggingface.co/datasets/icmlbenchname/commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
@@ -97,7 +97,7 @@ TASKS_DESCRIPTIONS = {
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
-        For further details on the dataset and the baselines from the BenchName team, refer to the `commit_message_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
@@ -107,7 +107,7 @@ TASKS_DESCRIPTIONS = {
     "bug_localization": """# Bug localization\n
-        Our Bug localization benchmark 🤗 [icmlbenchname/bug-localization](https://huggingface.co/datasets/icmlbenchname/bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         The model needs to identify the files within the repository that need to be modified to address the reported bug.
         To evaluate baseline performance, we use the following classification metrics:
@@ -119,19 +119,19 @@ TASKS_DESCRIPTIONS = {
         * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
         * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.
-        For further details on the dataset and the baselines from the BenchName team, refer to the `bug_localization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
     "module_summarization": """# Module summarization\n
-        Our Module summarization benchmark 🤗 [icmlbenchname/module-summarization](https://huggingface.co/datasets/icmlbenchname/module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
         * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/module_summarization/README.md).
-        For further details on the dataset and the baselines from the BenchName team, refer to the `module_summarization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
@@ -145,6 +145,6 @@ def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
     task_id = TASKS_PRETTY_REVERSE[task_pretty]
     if task_id == "commit_message_generation":
-        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by BenchName Team in  🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
     return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."

 TASKS_DESCRIPTIONS = {
     "aggregated": """# Aggregated Results\n
+        Here, we present the aggregated results across all the tasks in Long Code Arena (except for Project-level code completion, where its specifics required a different selection of models). To get more details about each task, visit the corresponding tab.
         To obtain aggregated results, we first select only one metric from metric suite for each task:
         * Library-based code generation: `API Recall`
         Then, to ensure a fair comparison across tasks with different score ranges, we normalize all scores to a 0-1 scale, where zero corresponds to the worst-performing model, and 1 to the best one. Note that for mean rank, rather than using strict rankings, we implemented a ranking system with a 10% margin to account for models with similar performance.
+        We report mean rank (with std) and mean score across the tasks from Long Code Arena, and the scores for each task in the table below.
         """,
     "library_based_code_generation": """# Library-based code generation\n
+        Our Library-based code generation benchmark 🤗 [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
         For evaluation, we use two metrics:
         * `ChrF`: textual similarity between the generated code and the reference program.
         As a context, we pass a prefix of the list of APIs available in the target library.
         We select the prefix based on their BM-25 similarity with the provided instruction.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "ci_builds_repair": """# CI builds repair\n
+        Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "project_code_completion": """# Project-level code completion\n
+        Our Project-level code completion benchmark 🤗 [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "commit_message_generation": """# Commit message generation\n
+        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
     "bug_localization": """# Bug localization\n
+        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         The model needs to identify the files within the repository that need to be modified to address the reported bug.
         To evaluate baseline performance, we use the following classification metrics:
         * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
         * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
     "module_summarization": """# Module summarization\n
+        Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
         * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/module_summarization/README.md).
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     task_id = TASKS_PRETTY_REVERSE[task_pretty]
     if task_id == "commit_message_generation":
+        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by Long Code Arena Team in  🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
     return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."