File size: 8,254 Bytes
ffe845b
d3f3295
4c621eb
3c4ac31
85e3e97
 
 
 
73ffa45
8836ca8
ffe845b
 
b7afbcf
ffe845b
026c9ee
ffe845b
026c9ee
ffe845b
e96d022
 
 
 
 
 
 
 
ec2367d
ffe845b
 
 
 
 
9be11d9
ffe845b
3448a66
73ffa45
3448a66
 
 
 
 
 
 
 
 
ffe845b
 
e0a0bf0
 
 
1e26bdc
fa9794f
 
 
 
 
 
ffe845b
 
 
 
613406c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e96d022
ffe845b
613406c
ffe845b
61f411e
 
ee8ea4c
ffe845b
613406c
 
b03b089
 
30d0038
b03b089
 
 
 
 
613406c
 
 
 
 
 
fa9794f
613406c
fa9794f
30d0038
 
fa9794f
30d0038
fa9794f
613406c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: other
license_name: apache-2.0-or-mnpl-0.1
license_link: https://mistral.ai/licences/MNPL-0.1.md
tags:
- code
- generation
- debugging
- editing
pipeline_tag: text-generation
---

# Code Logic Debugger v0.1

Hardware requirements for ChatGPT GPT-4o level inference speed for the models in this repo: >=24 GB VRAM.

Note: The following results are based on my day-to-day workflows only on an RTX 3090. My goal was to run private models that could beat GPT-4o and Claude-3.5 in code debugging and generation to ‘load balance’ between OpenAI/Anthropic’s free plan and local models to avoid hitting rate limits, and to upload as few lines of my code and ideas to their servers as possible.

An example of a complex debugging scenario is where you build library A on top of library B that requires library C as a dependency but the root cause was a variable in library C. In this case, the following workflow guided me to correctly identify the problem.

<br>

## Throughput

![](./model_v0.1_throughput_comparison.png)

IQ here refers to Importance Matrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/). For more info on the techique, please see [this GitHub discussion](https://github.com/ggerganov/llama.cpp/discussions/5006/).

<br>

## Personal Preference Ranking

Evaluated on two programming tasks: debugging and generation. It may be a bit subjective. `DeepSeekV2 Coder Instruct` is ranked lower because DeepSeek's Privacy Policy says that they may collect "text input, prompt" and there's no way around it.


Code debugging/editing prompt template used:
```
<code>
<current output>
<the problem description of the current output>
<expected output (in English is fine)>
<any hints>
Think step by step. Solve this problem without removing any existing functionalities, logic, or checks, except any incorrect code that interferes with your edits.
```

| **Rank** | **Model Name**                               | **Token Speed (tokens/s)** | **Debugging Performance**                                             | **Code Generation Performance**                                      | **Notes**                                                                                 |
|----------|----------------------------------------------|----------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| 1*       | codestral-22b-v0.1-IQ6_K.gguf (this repo)    | 34.21                       | Excellent at complex debugging, often surpasses GPT-4o and Claude-3.5  | Good, but may not be par with GPT-4o                                           | One of the best overall for debugging in my workflow, use Balanced Mode.                               |
| 1*       | Claude-3.5-Sonnet                            | N/A                         | Poor in complex debugging compared to Codestral                         | Excellent, better in design and more creative than GPT-4o in code generation  | Great for code generation, but weaker in debugging.                                       |
| 1*       | GPT-4o                                       | N/A                         | Good at complex debugging but can be outperformed by Codestral          | Excellent, generally reliable for code generation, more knowledgable          | Balanced performance between code debugging and generation.                               |
| 4        | DeepSeekV2 Coder Instruct                    | N/A                         | Good, but outputs the same code in complex scenarios                    | Excellent at general code generation, rivals GPT-4o                           | Excellent at code generation, but has data privacy concerns as per Privacy Policy.        |
| 5*       | Qwen2-7b-Instruct bf16                       | 78.22                       | Average, can think of correct approaches                                | Sometimes helps generate new ideas                                            | High speed, useful for generating ideas.                                                  |
| 5*       | AutoCoder.IQ4_K.gguf (this repo)             | 26.43                       | Excellent at solutions that require one to few lines of edits           | Generates useful short code segments                                          | Try Precise Mode or Balanced Mode.                                                      |
| 7        | GPT-4o-mini                                  | N/A                         | Decent, but struggles with complex debugging tasks                      | Reliable for shorter or simpler code generation tasks                         | Suitable for less complex coding tasks.                                                   |
| 8        | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf      | 2.55                        | Poor, occasionally helps generate ideas                                 | ---                                                                           | Speed is a significant limitation.                                                        |
| 9        | Trinity-2-Codestral-22B-Q6_K_L               | N/A                         | Poor, similar issues to DeepSeekV2 in outputing the same code           | ---                                                                           | Similar problem to DeepSeekV2, not recommended for my complex tasks.                      |
| 10       | DeepSeekV2 Coder Lite Instruct Q_8L          | N/A                         | Poor, repeats code similar to other models in its family                | Not as effective in my context                                                | Not recommended overall based on my criteria.                                             |


<br>

## Generation Kwargs

Balanced Mode:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.7,
    "stream":True,
    "top_k":50,
    "top_p":0.95,
}
```

Precise Mode:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.0,
    "stream":True,
    "top_p":1.0,
}
```

Qwen2 7B:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.4,
    "stream":True,
    "top_k":20,
    "top_p":0.8,
}
```

Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.

<br>

## New Discoveries

The following are tested in my workflow, but may not generalize well to other workflows.

- In general, if there's an error in the code, copy pasting the last few rows of stacktrace (without the library stacktrace) to the LLM seems to work.
- Adding "Reflect." after a failed attempt at code generation sometimes allows Claude-3.5-Sonnet to generate the correct version.
- If GPT-4o reasons correctly in its first response and the conversation is then continued with GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.

<br>

## License

A reminder that `codestral-22b-v0.1-IQ6_K.gguf` should only be used for non-commercial projects.

Please use `Qwen2-7b-Instruct bf16` and `AutoCoder.IQ4_K.gguf` as alternatives for commericial activities.

<br>

## Download

```
pip install -U "huggingface_hub[cli]"
```

Commercial use:
```
huggingface-cli download FredZhang7/claudegpt-code-logic-debugger-v0.1 --include "AutoCoder.IQ4_K.gguf" --local-dir ./
```

Non-commercial (e.g. testing, research, personal, or evaluation purposes) use:
```
huggingface-cli download FredZhang7/claudegpt-code-logic-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
```