FredZhang7
/

claudegpt-code-logic-debugger-v0.1

@@ -16,7 +16,7 @@ An example of a complex debugging scenario is where you build library A on top o
 ![](./model_v0.1_throughput_comparison.png)
-IQ in model names mean Imatrix Quantizations. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).
 <br>
@@ -34,10 +34,10 @@ Evaluated on two programming tasks: debugging and generation. It may be a bit su
 | 6        | GPT-4o-mini                                  | N/A                         | Decent, but struggles with complex debugging tasks                      | Reliable for shorter or simpler code generation tasks                  | Suitable for less complex coding tasks.                                                   |
 | 7        | AutoCoder.IQ4_K.gguf                         | 26.43                       | Average, offers different approaches but can be incorrect               | Generates useful short code segments                                   | Use Precise Mode for better results.                                                      |
 | 8        | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf      | 2.55                        | Poor, too slow to be practical in day-to-day workflows                  | Occasionally helps generate ideas                                      | Speed is a significant limitation.                                                        |
-| 9        | Trinity-2-Codestral-22B-Q6_K_L               | N/A                         | Poor, similar issues to DeepSeekV2 in debugging                         | Decent, but often repeats code                                         | Similar problem to DeepSeekV2, not recommended for my complex tasks.                      |
 | 10       | DeepSeekV2 Coder Lite Instruct Q_8L          | N/A                         | Poor, repeats code similar to other models in its family                | Not as effective in my context                                         | Not recommended overall based on my criteria.                                             |
-Prompt format:
 ```
 <code>
 <current output>
@@ -49,12 +49,63 @@ Think step by step. Solve this problem without removing any existing functionali
 <br>
 ## New Discoveries
-The following are tested, but may not generalize well to other workflows.
 - In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
 - Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
 - If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
-<br>

 ![](./model_v0.1_throughput_comparison.png)
+IQ here refers to Imatrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).
 <br>
 | 6        | GPT-4o-mini                                  | N/A                         | Decent, but struggles with complex debugging tasks                      | Reliable for shorter or simpler code generation tasks                  | Suitable for less complex coding tasks.                                                   |
 | 7        | AutoCoder.IQ4_K.gguf                         | 26.43                       | Average, offers different approaches but can be incorrect               | Generates useful short code segments                                   | Use Precise Mode for better results.                                                      |
 | 8        | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf      | 2.55                        | Poor, too slow to be practical in day-to-day workflows                  | Occasionally helps generate ideas                                      | Speed is a significant limitation.                                                        |
+| 9        | Trinity-2-Codestral-22B-Q6_K_L               | N/A                         | Poor, similar issues to DeepSeekV2 in outputing the same code           | Decent, but often repeats code                                         | Similar problem to DeepSeekV2, not recommended for my complex tasks.                      |
 | 10       | DeepSeekV2 Coder Lite Instruct Q_8L          | N/A                         | Poor, repeats code similar to other models in its family                | Not as effective in my context                                         | Not recommended overall based on my criteria.                                             |
+Code debugging prompt template used:
 ```
 <code>
 <current output>
 <br>
+## Generation Kwargs
+Balanced Mode:
+```python
+generation_kwargs = {
+    "max_tokens":8192,
+    "stop":["<|EOT|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
+    "temperature":0.7,
+    "stream":True,
+    "top_k":50,
+    "top_p":0.95,
+}
+```
+Precise Mode:
+```python
+generation_kwargs = {
+    "max_tokens":8192,
+    "stop":["<|EOT|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
+    "temperature":0.0,
+    "stream":True,
+    "top_p":1.0,
+}
+```
+Qwen2 7B:
+```python
+generation_kwargs = {
+    "max_tokens":8192,
+    "stop":["<|EOT|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
+    "temperature":0.4,
+    "stream":True,
+    "top_k":20,
+    "top_p":0.8,
+}
+```
+Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.
+<br>
 ## New Discoveries
+The following are tested in my workflow, but may not generalize well to other workflows.
 - In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
 - Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
 - If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.
+<br>
+## Download
+```
+pip install -U "huggingface_hub[cli]"
+```
+```
+huggingface-cli download FredZhang7/claudegpt-code-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
+```