Update src/tasks_content.py
Browse files- src/tasks_content.py +11 -7
    	
        src/tasks_content.py
    CHANGED
    
    | @@ -27,10 +27,10 @@ TASKS_DESCRIPTIONS = { | |
| 27 | 
             
                    Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) 
         | 
| 28 | 
             
                    includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         | 
| 29 |  | 
| 30 | 
            -
                    The benchmark clones the repo to the local  | 
| 31 | 
            -
                    and then the benchmark pushes the repo to  | 
| 32 | 
            -
                    We use the `Pass@1` rate metric  | 
| 33 | 
            -
                    Models can be evaluated in three  | 
| 34 | 
             
                    * `full` β **no** ground truth diffs are used for model evaluation;
         | 
| 35 | 
             
                    * `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
         | 
| 36 | 
             
                    * `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
         | 
| @@ -45,7 +45,9 @@ TASKS_DESCRIPTIONS = { | |
| 45 | 
             
                    * `medium-context`: 224 data points,
         | 
| 46 | 
             
                    * `large-context`: 270 data points,
         | 
| 47 | 
             
                    * `huge-context`: 296 data points.
         | 
| 48 | 
            -
             | 
|  | |
|  | |
| 49 | 
             
                    We use standard `Exact Match (EM)` metric for one-line code completion.
         | 
| 50 | 
             
                    We evaluate `Exact Match` for different line categories:
         | 
| 51 | 
             
                    * *infile* β functions and classes are from the completion file;
         | 
| @@ -60,7 +62,7 @@ TASKS_DESCRIPTIONS = { | |
| 60 |  | 
| 61 | 
             
                "commit_message_generation": """# Commit message generation\n
         | 
| 62 |  | 
| 63 | 
            -
                    Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.  
         | 
| 64 |  | 
| 65 | 
             
                    We use the following metrics for evaluation:
         | 
| 66 | 
             
                    * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
         | 
| @@ -75,7 +77,8 @@ TASKS_DESCRIPTIONS = { | |
| 75 |  | 
| 76 | 
             
                "bug_localization": """# Bug localization\n
         | 
| 77 |  | 
| 78 | 
            -
                    Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects. | 
|  | |
| 79 | 
             
                    We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
         | 
| 80 |  | 
| 81 | 
             
                    For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         | 
| @@ -84,6 +87,7 @@ TASKS_DESCRIPTIONS = { | |
| 84 |  | 
| 85 | 
             
                "module_summarization": """# Module summarization\n
         | 
| 86 | 
             
                    Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects. 
         | 
|  | |
| 87 |  | 
| 88 | 
             
                    We use a novel metric for evaluation:
         | 
| 89 | 
             
                    * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
         | 
|  | |
| 27 | 
             
                    Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) 
         | 
| 28 | 
             
                    includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         | 
| 29 |  | 
| 30 | 
            +
                    The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
         | 
| 31 | 
            +
                    and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
         | 
| 32 | 
            +
                    We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
         | 
| 33 | 
            +
                    Models can be evaluated in three settings:
         | 
| 34 | 
             
                    * `full` β **no** ground truth diffs are used for model evaluation;
         | 
| 35 | 
             
                    * `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
         | 
| 36 | 
             
                    * `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
         | 
|  | |
| 45 | 
             
                    * `medium-context`: 224 data points,
         | 
| 46 | 
             
                    * `large-context`: 270 data points,
         | 
| 47 | 
             
                    * `huge-context`: 296 data points.
         | 
| 48 | 
            +
                    Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below), 
         | 
| 49 | 
            +
                    and a repository snapshot that can be used to build the context.
         | 
| 50 | 
            +
                    
         | 
| 51 | 
             
                    We use standard `Exact Match (EM)` metric for one-line code completion.
         | 
| 52 | 
             
                    We evaluate `Exact Match` for different line categories:
         | 
| 53 | 
             
                    * *infile* β functions and classes are from the completion file;
         | 
|  | |
| 62 |  | 
| 63 | 
             
                "commit_message_generation": """# Commit message generation\n
         | 
| 64 |  | 
| 65 | 
            +
                    Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects.  
         | 
| 66 |  | 
| 67 | 
             
                    We use the following metrics for evaluation:
         | 
| 68 | 
             
                    * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
         | 
|  | |
| 77 |  | 
| 78 | 
             
                "bug_localization": """# Bug localization\n
         | 
| 79 |  | 
| 80 | 
            +
                    Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         | 
| 81 | 
            +
                    The model needs to identify the files within the repository that need to be modified to address the reported bug.
         | 
| 82 | 
             
                    We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
         | 
| 83 |  | 
| 84 | 
             
                    For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         | 
|  | |
| 87 |  | 
| 88 | 
             
                "module_summarization": """# Module summarization\n
         | 
| 89 | 
             
                    Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects. 
         | 
| 90 | 
            +
                    The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         | 
| 91 |  | 
| 92 | 
             
                    We use a novel metric for evaluation:
         | 
| 93 | 
             
                    * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
         | 
