Update src/tasks_content.py
Browse files- src/tasks_content.py +11 -7
src/tasks_content.py
CHANGED
@@ -27,10 +27,10 @@ TASKS_DESCRIPTIONS = {
|
|
27 |
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
|
28 |
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
|
29 |
|
30 |
-
The benchmark clones the repo to the local
|
31 |
-
and then the benchmark pushes the repo to
|
32 |
-
We use the `Pass@1` rate metric
|
33 |
-
Models can be evaluated in three
|
34 |
* `full` β **no** ground truth diffs are used for model evaluation;
|
35 |
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
|
36 |
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
|
@@ -45,7 +45,9 @@ TASKS_DESCRIPTIONS = {
|
|
45 |
* `medium-context`: 224 data points,
|
46 |
* `large-context`: 270 data points,
|
47 |
* `huge-context`: 296 data points.
|
48 |
-
|
|
|
|
|
49 |
We use standard `Exact Match (EM)` metric for one-line code completion.
|
50 |
We evaluate `Exact Match` for different line categories:
|
51 |
* *infile* β functions and classes are from the completion file;
|
@@ -60,7 +62,7 @@ TASKS_DESCRIPTIONS = {
|
|
60 |
|
61 |
"commit_message_generation": """# Commit message generation\n
|
62 |
|
63 |
-
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
|
64 |
|
65 |
We use the following metrics for evaluation:
|
66 |
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
|
@@ -75,7 +77,8 @@ TASKS_DESCRIPTIONS = {
|
|
75 |
|
76 |
"bug_localization": """# Bug localization\n
|
77 |
|
78 |
-
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
|
|
|
79 |
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
|
80 |
|
81 |
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
|
@@ -84,6 +87,7 @@ TASKS_DESCRIPTIONS = {
|
|
84 |
|
85 |
"module_summarization": """# Module summarization\n
|
86 |
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
|
|
|
87 |
|
88 |
We use a novel metric for evaluation:
|
89 |
* `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
|
|
|
27 |
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
|
28 |
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
|
29 |
|
30 |
+
The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
|
31 |
+
and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
|
32 |
+
We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
|
33 |
+
Models can be evaluated in three settings:
|
34 |
* `full` β **no** ground truth diffs are used for model evaluation;
|
35 |
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
|
36 |
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
|
|
|
45 |
* `medium-context`: 224 data points,
|
46 |
* `large-context`: 270 data points,
|
47 |
* `huge-context`: 296 data points.
|
48 |
+
Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
|
49 |
+
and a repository snapshot that can be used to build the context.
|
50 |
+
|
51 |
We use standard `Exact Match (EM)` metric for one-line code completion.
|
52 |
We evaluate `Exact Match` for different line categories:
|
53 |
* *infile* β functions and classes are from the completion file;
|
|
|
62 |
|
63 |
"commit_message_generation": """# Commit message generation\n
|
64 |
|
65 |
+
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects.
|
66 |
|
67 |
We use the following metrics for evaluation:
|
68 |
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
|
|
|
77 |
|
78 |
"bug_localization": """# Bug localization\n
|
79 |
|
80 |
+
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
|
81 |
+
The model needs to identify the files within the repository that need to be modified to address the reported bug.
|
82 |
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
|
83 |
|
84 |
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
|
|
|
87 |
|
88 |
"module_summarization": """# Module summarization\n
|
89 |
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
|
90 |
+
The model is required to generate such description, given the relevant context code and the intent behind the documentation.
|
91 |
|
92 |
We use a novel metric for evaluation:
|
93 |
* `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
|