FredZhang7 commited on
Commit
613406c
·
verified ·
1 Parent(s): e96d022

Add download, generation mode

Browse files
Files changed (1) hide show
  1. README.md +56 -5
README.md CHANGED
@@ -16,7 +16,7 @@ An example of a complex debugging scenario is where you build library A on top o
16
 
17
  ![](./model_v0.1_throughput_comparison.png)
18
 
19
- IQ in model names mean Imatrix Quantizations. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).
20
 
21
  <br>
22
 
@@ -34,10 +34,10 @@ Evaluated on two programming tasks: debugging and generation. It may be a bit su
34
  | 6 | GPT-4o-mini | N/A | Decent, but struggles with complex debugging tasks | Reliable for shorter or simpler code generation tasks | Suitable for less complex coding tasks. |
35
  | 7 | AutoCoder.IQ4_K.gguf | 26.43 | Average, offers different approaches but can be incorrect | Generates useful short code segments | Use Precise Mode for better results. |
36
  | 8 | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf | 2.55 | Poor, too slow to be practical in day-to-day workflows | Occasionally helps generate ideas | Speed is a significant limitation. |
37
- | 9 | Trinity-2-Codestral-22B-Q6_K_L | N/A | Poor, similar issues to DeepSeekV2 in debugging | Decent, but often repeats code | Similar problem to DeepSeekV2, not recommended for my complex tasks. |
38
  | 10 | DeepSeekV2 Coder Lite Instruct Q_8L | N/A | Poor, repeats code similar to other models in its family | Not as effective in my context | Not recommended overall based on my criteria. |
39
 
40
- Prompt format:
41
  ```
42
  <code>
43
  <current output>
@@ -49,12 +49,63 @@ Think step by step. Solve this problem without removing any existing functionali
49
 
50
  <br>
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## New Discoveries
53
 
54
- The following are tested, but may not generalize well to other workflows.
55
 
56
  - In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
57
  - Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
58
  - If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
59
 
60
- <br>
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ![](./model_v0.1_throughput_comparison.png)
18
 
19
+ IQ here refers to Imatrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).
20
 
21
  <br>
22
 
 
34
  | 6 | GPT-4o-mini | N/A | Decent, but struggles with complex debugging tasks | Reliable for shorter or simpler code generation tasks | Suitable for less complex coding tasks. |
35
  | 7 | AutoCoder.IQ4_K.gguf | 26.43 | Average, offers different approaches but can be incorrect | Generates useful short code segments | Use Precise Mode for better results. |
36
  | 8 | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf | 2.55 | Poor, too slow to be practical in day-to-day workflows | Occasionally helps generate ideas | Speed is a significant limitation. |
37
+ | 9 | Trinity-2-Codestral-22B-Q6_K_L | N/A | Poor, similar issues to DeepSeekV2 in outputing the same code | Decent, but often repeats code | Similar problem to DeepSeekV2, not recommended for my complex tasks. |
38
  | 10 | DeepSeekV2 Coder Lite Instruct Q_8L | N/A | Poor, repeats code similar to other models in its family | Not as effective in my context | Not recommended overall based on my criteria. |
39
 
40
+ Code debugging prompt template used:
41
  ```
42
  <code>
43
  <current output>
 
49
 
50
  <br>
51
 
52
+ ## Generation Kwargs
53
+
54
+ Balanced Mode:
55
+ ```python
56
+ generation_kwargs = {
57
+ "max_tokens":8192,
58
+ "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
59
+ "temperature":0.7,
60
+ "stream":True,
61
+ "top_k":50,
62
+ "top_p":0.95,
63
+ }
64
+ ```
65
+
66
+ Precise Mode:
67
+ ```python
68
+ generation_kwargs = {
69
+ "max_tokens":8192,
70
+ "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
71
+ "temperature":0.0,
72
+ "stream":True,
73
+ "top_p":1.0,
74
+ }
75
+ ```
76
+
77
+ Qwen2 7B:
78
+ ```python
79
+ generation_kwargs = {
80
+ "max_tokens":8192,
81
+ "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
82
+ "temperature":0.4,
83
+ "stream":True,
84
+ "top_k":20,
85
+ "top_p":0.8,
86
+ }
87
+ ```
88
+
89
+ Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.
90
+
91
+ <br>
92
+
93
  ## New Discoveries
94
 
95
+ The following are tested in my workflow, but may not generalize well to other workflows.
96
 
97
  - In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
98
  - Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
99
  - If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
100
 
101
+ <br>
102
+
103
+ ## Download
104
+
105
+ ```
106
+ pip install -U "huggingface_hub[cli]"
107
+ ```
108
+
109
+ ```
110
+ huggingface-cli download FredZhang7/claudegpt-code-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
111
+ ```