File size: 1,539 Bytes
c87c295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Test-suite Reduction

## Preperation Work

As test-suite reduction relies on the results of evaluation, make sure that you've run the evaluation script and an `eval_results.json` has been generated for each model under test.

Use the following command to install necessary dependencies:

```bash
# in $EVALPLUS_ROOT
pip install -r requirements-tsr.txt
```

## Usage

```bash
python3 run.py \
  --dataset DATASET \
  --sample_eval_dir SAMPLE_DIR \
  --model MODEL \
  [--report_dir REPORT_DIR]

# Example
python3 run.py --dataset humaneval --sample_eval_dir $HOME/HumanEval --model ALL
```

Parameter descriptions:
* `--dataset`: currently, `humaneval` and `mbpp` are supported.
* `--sample_eval_dir` is the directory containing all the LLM evaluation results. We require the directory be structured as
    ```bash
    SAMPLE_EVAL_DIR
    β”œβ”€β”€ LLM_1
    β”‚   β”œβ”€β”€ ...
    β”‚Β Β  └── eval_results.json
    β”œβ”€β”€ LLM_2
    β”‚   β”œβ”€β”€ ...
    β”œβ”€β”€ ...
    ```
* `--report_dir` is the directory where we store intermediate files, pass@k results, and reduced dataset. If not specified, `REPORT_DIR=./tsr_info` by default.
* If `MODEL` is a specific LLM name, the cross-validation results will be generated in `REPORT_DIR`; if `MODEL == ALL`, a reduced dataset will be generated in `REPORT_DIR`.

## Known Issues

If you find the program stuck at the mutant generation step, try removing the line
```python
assert len(completion_id) == len(problems), "Missing problems in samples"
```
in `evalplus/evaluate.py`.