Add download, generation mode
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ An example of a complex debugging scenario is where you build library A on top o
|
|
16 |
|
17 |

|
18 |
|
19 |
-
IQ
|
20 |
|
21 |
<br>
|
22 |
|
@@ -34,10 +34,10 @@ Evaluated on two programming tasks: debugging and generation. It may be a bit su
|
|
34 |
| 6 | GPT-4o-mini | N/A | Decent, but struggles with complex debugging tasks | Reliable for shorter or simpler code generation tasks | Suitable for less complex coding tasks. |
|
35 |
| 7 | AutoCoder.IQ4_K.gguf | 26.43 | Average, offers different approaches but can be incorrect | Generates useful short code segments | Use Precise Mode for better results. |
|
36 |
| 8 | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf | 2.55 | Poor, too slow to be practical in day-to-day workflows | Occasionally helps generate ideas | Speed is a significant limitation. |
|
37 |
-
| 9 | Trinity-2-Codestral-22B-Q6_K_L | N/A | Poor, similar issues to DeepSeekV2 in
|
38 |
| 10 | DeepSeekV2 Coder Lite Instruct Q_8L | N/A | Poor, repeats code similar to other models in its family | Not as effective in my context | Not recommended overall based on my criteria. |
|
39 |
|
40 |
-
|
41 |
```
|
42 |
<code>
|
43 |
<current output>
|
@@ -49,12 +49,63 @@ Think step by step. Solve this problem without removing any existing functionali
|
|
49 |
|
50 |
<br>
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
## New Discoveries
|
53 |
|
54 |
-
The following are tested, but may not generalize well to other workflows.
|
55 |
|
56 |
- In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
|
57 |
- Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
|
58 |
- If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
|
59 |
|
60 |
-
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |

|
18 |
|
19 |
+
IQ here refers to Imatrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).
|
20 |
|
21 |
<br>
|
22 |
|
|
|
34 |
| 6 | GPT-4o-mini | N/A | Decent, but struggles with complex debugging tasks | Reliable for shorter or simpler code generation tasks | Suitable for less complex coding tasks. |
|
35 |
| 7 | AutoCoder.IQ4_K.gguf | 26.43 | Average, offers different approaches but can be incorrect | Generates useful short code segments | Use Precise Mode for better results. |
|
36 |
| 8 | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf | 2.55 | Poor, too slow to be practical in day-to-day workflows | Occasionally helps generate ideas | Speed is a significant limitation. |
|
37 |
+
| 9 | Trinity-2-Codestral-22B-Q6_K_L | N/A | Poor, similar issues to DeepSeekV2 in outputing the same code | Decent, but often repeats code | Similar problem to DeepSeekV2, not recommended for my complex tasks. |
|
38 |
| 10 | DeepSeekV2 Coder Lite Instruct Q_8L | N/A | Poor, repeats code similar to other models in its family | Not as effective in my context | Not recommended overall based on my criteria. |
|
39 |
|
40 |
+
Code debugging prompt template used:
|
41 |
```
|
42 |
<code>
|
43 |
<current output>
|
|
|
49 |
|
50 |
<br>
|
51 |
|
52 |
+
## Generation Kwargs
|
53 |
+
|
54 |
+
Balanced Mode:
|
55 |
+
```python
|
56 |
+
generation_kwargs = {
|
57 |
+
"max_tokens":8192,
|
58 |
+
"stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
|
59 |
+
"temperature":0.7,
|
60 |
+
"stream":True,
|
61 |
+
"top_k":50,
|
62 |
+
"top_p":0.95,
|
63 |
+
}
|
64 |
+
```
|
65 |
+
|
66 |
+
Precise Mode:
|
67 |
+
```python
|
68 |
+
generation_kwargs = {
|
69 |
+
"max_tokens":8192,
|
70 |
+
"stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
|
71 |
+
"temperature":0.0,
|
72 |
+
"stream":True,
|
73 |
+
"top_p":1.0,
|
74 |
+
}
|
75 |
+
```
|
76 |
+
|
77 |
+
Qwen2 7B:
|
78 |
+
```python
|
79 |
+
generation_kwargs = {
|
80 |
+
"max_tokens":8192,
|
81 |
+
"stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
|
82 |
+
"temperature":0.4,
|
83 |
+
"stream":True,
|
84 |
+
"top_k":20,
|
85 |
+
"top_p":0.8,
|
86 |
+
}
|
87 |
+
```
|
88 |
+
|
89 |
+
Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.
|
90 |
+
|
91 |
+
<br>
|
92 |
+
|
93 |
## New Discoveries
|
94 |
|
95 |
+
The following are tested in my workflow, but may not generalize well to other workflows.
|
96 |
|
97 |
- In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
|
98 |
- Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
|
99 |
- If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
|
100 |
|
101 |
+
<br>
|
102 |
+
|
103 |
+
## Download
|
104 |
+
|
105 |
+
```
|
106 |
+
pip install -U "huggingface_hub[cli]"
|
107 |
+
```
|
108 |
+
|
109 |
+
```
|
110 |
+
huggingface-cli download FredZhang7/claudegpt-code-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
|
111 |
+
```
|