chenwuml commited on
Commit
9e2b6d5
·
1 Parent(s): b4ec84a

initial commit

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -53,11 +53,11 @@ All open-weight models were tested using [vLLM](https://github.com/vllm-project/
53
 
54
  We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
55
  - Success: 42 cases
56
- - Failure (code runs but incorrect): 37 cases
57
  - Fail to compile: 8 cases
58
  - No code: 220 cases
59
 
60
- Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
61
 
62
  ## Usage
63
 
 
53
 
54
  We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
55
  - Success: 42 cases
56
+ - Failure (code runs but incorrect or timed out): 37 cases
57
  - Fail to compile: 8 cases
58
  - No code: 220 cases
59
 
60
+ Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
61
 
62
  ## Usage
63