aws-prototyping
/

codefu-7b-v0.1

Model card Files Files and versions

chenwuml commited on 4 days ago

Commit

9e2b6d5

·

1 Parent(s): b4ec84a

initial commit

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -53,11 +53,11 @@ All open-weight models were tested using [vLLM](https://github.com/vllm-project/
 We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
 - Success: 42 cases
-- Failure (code runs but incorrect): 37 cases
 - Fail to compile: 8 cases
 - No code: 220 cases
-Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. More future work is needed to address this long output issue. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
 ## Usage

 We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:
 - Success: 42 cases
+- Failure (code runs but incorrect or timed out): 37 cases
 - Fail to compile: 8 cases
 - No code: 220 cases
+Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.
 ## Usage