Areyde commited on
Commit
8314e15
Β·
verified Β·
1 Parent(s): 96a7ff5

Update src/tasks_content.py

Browse files
Files changed (1) hide show
  1. src/tasks_content.py +11 -7
src/tasks_content.py CHANGED
@@ -27,10 +27,10 @@ TASKS_DESCRIPTIONS = {
27
  Our CI builds repair benchmark πŸ€— [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
28
  includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
29
 
30
- The benchmark clones the repo to the local folder. The baseline model fixes the issue according to logs and the local repo state,
31
- and then the benchmark pushes the repo to GitGub and requests the result of the GitHub CI.
32
- We use the `Pass@1` rate metric for CI repair.
33
- Models can be evaluated in three types of tasks:
34
  * `full` – **no** ground truth diffs are used for model evaluation;
35
  * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
36
  * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
@@ -45,7 +45,9 @@ TASKS_DESCRIPTIONS = {
45
  * `medium-context`: 224 data points,
46
  * `large-context`: 270 data points,
47
  * `huge-context`: 296 data points.
48
-
 
 
49
  We use standard `Exact Match (EM)` metric for one-line code completion.
50
  We evaluate `Exact Match` for different line categories:
51
  * *infile* – functions and classes are from the completion file;
@@ -60,7 +62,7 @@ TASKS_DESCRIPTIONS = {
60
 
61
  "commit_message_generation": """# Commit message generation\n
62
 
63
- Our Commit message generation benchmark πŸ€— [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
64
 
65
  We use the following metrics for evaluation:
66
  * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
@@ -75,7 +77,8 @@ TASKS_DESCRIPTIONS = {
75
 
76
  "bug_localization": """# Bug localization\n
77
 
78
- Our Bug localization benchmark πŸ€— [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
 
79
  We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
80
 
81
  For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
@@ -84,6 +87,7 @@ TASKS_DESCRIPTIONS = {
84
 
85
  "module_summarization": """# Module summarization\n
86
  Our Module summarization benchmark πŸ€— [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
 
87
 
88
  We use a novel metric for evaluation:
89
  * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
 
27
  Our CI builds repair benchmark πŸ€— [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
28
  includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
29
 
30
+ The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
31
+ and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
32
+ We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
33
+ Models can be evaluated in three settings:
34
  * `full` – **no** ground truth diffs are used for model evaluation;
35
  * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
36
  * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
 
45
  * `medium-context`: 224 data points,
46
  * `large-context`: 270 data points,
47
  * `huge-context`: 296 data points.
48
+ Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
49
+ and a repository snapshot that can be used to build the context.
50
+
51
  We use standard `Exact Match (EM)` metric for one-line code completion.
52
  We evaluate `Exact Match` for different line categories:
53
  * *infile* – functions and classes are from the completion file;
 
62
 
63
  "commit_message_generation": """# Commit message generation\n
64
 
65
+ Our Commit message generation benchmark πŸ€— [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects.
66
 
67
  We use the following metrics for evaluation:
68
  * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
 
77
 
78
  "bug_localization": """# Bug localization\n
79
 
80
+ Our Bug localization benchmark πŸ€— [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
81
+ The model needs to identify the files within the repository that need to be modified to address the reported bug.
82
  We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
83
 
84
  For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
 
87
 
88
  "module_summarization": """# Module summarization\n
89
  Our Module summarization benchmark πŸ€— [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
90
+ The model is required to generate such description, given the relevant context code and the intent behind the documentation.
91
 
92
  We use a novel metric for evaluation:
93
  * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).