Spaces:

JetBrains-Research
/

long-code-arena

Running

App Files Files Community

Areyde commited on Jun 5, 2024

Commit

6553f1e

verified ·

1 Parent(s): 50b09d2

Update src/tasks_content.py

Browse files

Files changed (1) hide show

src/tasks_content.py +29 -30

src/tasks_content.py CHANGED Viewed

@@ -11,50 +11,49 @@ TASKS_PRETTY = {
 TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
 TASKS_DESCRIPTIONS = {
-    "library_based_code_generation": """# Library-Based Code Generation\n
-        Our Library-Based Code Generation benchmark 🤗 [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
-        For evaluation we use two metrics:
-        * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
         * `ChrF`: textual similarity between the generated code and the reference program.
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `library_based_code_generation` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines) or to our preprint (TODO).
         """,
-    "ci_builds_repair": """# CI Builds Repair\n
-        Our CI Builds Repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) includes 77 data points.
-        We use Pass@1 metric for CI repair.
-        We evaluate Exact Match for different line categories:
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `ci-builds-repair` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines) or to our preprint.
         """,
-    "project_code_completion": """# Project-Level Code Completion\n
-        Our Project-Level Code Completion benchmark 🤗 [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four datasets:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * `huge-context`: 296 data points.
-        We use standard Exact Match (EM) metric for one-line code completion.
-        We evaluate Exact Match for different line categories:
         * *infile* – functions and classes are from the completion file;
-        * *inproject* – functions and files are from the repository snapshot;
         * *committed* – functions and classes are from the files that were added on the completion file commit;
         * *common* – functions and classes with common names, e.g., `main`, `get`, etc.;
         * *non-informative* – short/long lines, import/print lines, or comment lines;
-        * *random* – lines that doesn't fit to any of previous categories.
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `code_completion` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines) or to our preprint (TODO).
         """,
-    "commit_message_generation": """# Commit Message Generation\n
-        Our Commit Message Generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
@@ -62,27 +61,27 @@ TASKS_DESCRIPTIONS = {
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `commit_message_generation` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines) or to our preprint.
-        **Note.** The leaderboard is sorted by ROUGE-1 metric by default.
         """,
-    "bug_localization": """# Bug Localization\n
-        Our Bug Localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java and Kotlin projects.
-        We used information retrieval metrics such as R@k, P@k, F1-score and MAP for evaluation, taking k equal to 1 and 2.
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `bug_localization` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/bug_localization/).
     """,
-    "module_summarization": """# Module Summarization\n
-        Our Module Summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of opensource permissive Python projects.
-        We use new metric for evaluation:
-        * `CompScore`: New metric proposed for this task. More details in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md)
-        For further details on the dataset and the baselines from 🏟️ Long Code Arena Team, refer to `module_summarization` folder in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
         """,
 }

 TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
 TASKS_DESCRIPTIONS = {
+    "library_based_code_generation": """# Library-based code generation\n
+        Our Library-based code generation benchmark 🤗 [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
+        For evaluation, we use two metrics:
         * `ChrF`: textual similarity between the generated code and the reference program.
+        * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         """,
+    "ci_builds_repair": """# CI builds repair\n
+        Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) includes 77 data points.
+        We use the `Pass@1` metric for CI builds repair.
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         """,
+    "project_code_completion": """# Project-level code completion\n
+        Our Project-level code completion benchmark 🤗 [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * `huge-context`: 296 data points.
+        We use standard `Exact Match (EM)` metric for one-line code completion.
+        We evaluate `Exact Match` for different line categories:
         * *infile* – functions and classes are from the completion file;
+        * *inproject* – functions and files are from the repository snapshot at the moment of completion;
         * *committed* – functions and classes are from the files that were added on the completion file commit;
         * *common* – functions and classes with common names, e.g., `main`, `get`, etc.;
         * *non-informative* – short/long lines, import/print lines, or comment lines;
+        * *random* – lines that don't fit any of the previous categories.
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         """,
+    "commit_message_generation": """# Commit message generation\n
+        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
+        **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
         """,
+    "bug_localization": """# Bug localization\n
+        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
+        We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
     """,
+    "module_summarization": """# Module summarization\n
+        Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
+        We use a novel metric for evaluation:
+        * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
+        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
         """,
 }