Update README.md
Browse files
README.md
CHANGED
@@ -47,8 +47,11 @@ This is the repository for the base 7B version finetuned based on [CodeLlama-7b-
|
|
47 |
## Model Eval
|
48 |
|
49 |
HumanEval is the commonest code generation benchmark to evaluate the performance of models, especially on the the compeltion of code exercise cases.
|
50 |
-
Somehow, model evaluation is a kind of metaphysics. Different models are sensitive to different decoding methods and
|
51 |
-
It is impratical for us to manually set specific configuration for each fine-tuned model, because a real LLM should master the universal capability despite the parameters manipulated by users
|
|
|
|
|
|
|
52 |
|
53 |
| Model | HumanEval python pass@1 |
|
54 |
| --- |----------------------------------------------------------------------------- |
|
@@ -59,6 +62,10 @@ It is impratical for us to manually set specific configuration for each fine-tun
|
|
59 |
| CodeLlama-34b-hf | 48.2%|
|
60 |
| opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
|
61 |
|
|
|
|
|
|
|
|
|
62 |
```python
|
63 |
from transformers import AutoTokenizer
|
64 |
import transformers
|
|
|
47 |
## Model Eval
|
48 |
|
49 |
HumanEval is the commonest code generation benchmark to evaluate the performance of models, especially on the the compeltion of code exercise cases.
|
50 |
+
Somehow, model evaluation is a kind of metaphysics. Different models are sensitive to different decoding methods, paramters and instructions.
|
51 |
+
It is impratical for us to manually set specific configuration for each fine-tuned model, because a real LLM should master the universal capability despite the parameters manipulated by users.
|
52 |
+
|
53 |
+
Thus, OpenCSG strained our brains to provide a relatively fair method to compare the fine-tuned models on HumanEval benchmark.
|
54 |
+
To simplify the comparision, we chosed the Pass@1 metric on python language, but our finetuning dataset includes multi language samples.
|
55 |
|
56 |
| Model | HumanEval python pass@1 |
|
57 |
| --- |----------------------------------------------------------------------------- |
|
|
|
62 |
| CodeLlama-34b-hf | 48.2%|
|
63 |
| opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
|
64 |
|
65 |
+
**TODO**
|
66 |
+
- we will provide much more benchmark scores on fine-tuned models in future.
|
67 |
+
- we will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.
|
68 |
+
|
69 |
```python
|
70 |
from transformers import AutoTokenizer
|
71 |
import transformers
|