Spaces:

JetBrains-Research
/

long-code-arena

Running

App Files Files Community

Areyde commited on May 16

Commit

5557fe1

verified ·

1 Parent(s): bf01543

Update src/tasks_content.py

Browse files

Files changed (1) hide show

src/tasks_content.py +7 -7

src/tasks_content.py CHANGED Viewed

@@ -38,7 +38,7 @@ TASKS_DESCRIPTIONS = {
         As a context, we pass a prefix of the list of APIs available in the target library.
         We select the prefix based on their BM-25 similarity with the provided instruction.
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
@@ -57,7 +57,7 @@ TASKS_DESCRIPTIONS = {
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
@@ -82,7 +82,7 @@ TASKS_DESCRIPTIONS = {
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
@@ -97,7 +97,7 @@ TASKS_DESCRIPTIONS = {
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
@@ -119,7 +119,7 @@ TASKS_DESCRIPTIONS = {
         * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
         * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
@@ -129,9 +129,9 @@ TASKS_DESCRIPTIONS = {
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
-        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/module_summarization/README.md).
-        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,

         As a context, we pass a prefix of the list of APIs available in the target library.
         We select the prefix based on their BM-25 similarity with the provided instruction.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
         * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
         * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
+        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/tree/main/module_summarization).
+        For further details on the dataset and the baselines from the Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,