initial commit
Browse files
README.md
CHANGED
@@ -53,11 +53,11 @@ All open-weight models were tested using [vLLM](https://github.com/vllm-project/
|
|
53 |
|
54 |
We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
|
55 |
- Success: 42 cases
|
56 |
-
- Failure (code runs but incorrect): 37 cases
|
57 |
- Fail to compile: 8 cases
|
58 |
- No code: 220 cases
|
59 |
|
60 |
-
Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code.
|
61 |
|
62 |
## Usage
|
63 |
|
|
|
53 |
|
54 |
We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
|
55 |
- Success: 42 cases
|
56 |
+
- Failure (code runs but incorrect or timed out): 37 cases
|
57 |
- Fail to compile: 8 cases
|
58 |
- No code: 220 cases
|
59 |
|
60 |
+
Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
|
61 |
|
62 |
## Usage
|
63 |
|