FredZhang7
/

claudegpt-code-logic-debugger-v0.1

+---
+license: apache-2.0
+---
+# Code Debugger v0.1
+Hardware requirements for ChatGPT GPT-4o level inference speed for the following models on an RTX 3090: >=24 GB VRAM.
+Note: The following results are based on my day-to-day workflows only. My goal was to run private models that could beat GPT-4o and Claude-3.5 in code debugging and generation to ‘load balance’ between OpenAI/Anthropic’s free plan and local models to avoid hitting rate limits, and to upload as few lines of my code and ideas to their servers as possible.
+By a complex debugging task, I mean scenarios where you build library A on top of library B that requires library C as a dependency but the root cause was a variable in library C. In this case, the following workflow guided me to correctly identify the problem.
+<br>
+## Personal Preference Ranking
+Evaluated on two programming tasks: debugging and generation. It may be a bit subjective. `DeepSeekV2 Coder Instruct` is ranked lower because their privacy policy says that they may collect "text input, prompt" and there's no way around it.
+| **Rank** | **Model Name**                               | **Token Speed (tokens/s)** | **Debugging Performance**                                             | **Code Generation Performance**                                      | **Notes**                                                                                 |
+|----------|----------------------------------------------|----------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
+| 1        | codestral-22b-v0.1-IQ6_K.gguf (this model)   | 34.21                       | Excellent at complex debugging, often surpasses GPT-4o and Claude-3.5  | Good, but may not be par with GPT-4o                                  | Best overall for debugging in my workflow, use Balanced Mode.                             |
+| 2        | Claude-3.5-Sonnet                            | N/A                         | Poor in complex debugging compared to Codestral                         | Excellent, better than GPT-4o in long code generation                  | Great for code generation, but weaker in debugging.                                       |
+| 3        | GPT-4o                                       | N/A                         | Good at complex debugging but can be outperformed by Codestral          | Excellent, generally reliable for code generation                      | Balanced performance between code debugging and generation.                               |
+| 4        | DeepSeekV2 Coder Instruct                    | N/A                         | Poor, outputs the same code in complex scenarios                        | Great at general code generation, rivals GPT-4o                        | Excellent at code generation, but has data privacy concerns as per Privacy Policy.        |
+| 5        | qwen2 7b instruct bf16                       | 78.22                       | Average, can think of correct approaches                                | Sometimes helps generate new ideas                                     | High speed, useful for generating ideas.                                                  |
+| 6        | GPT-4o-mini                                  | N/A                         | Decent, but struggles with complex debugging tasks                      | Reliable for shorter or simpler code generation tasks                  | Suitable for less complex coding tasks.                                                   |
+| 7        | AutoCoder.IQ4_K.gguf                         | 26.43                       | Average, offers different approaches but can be incorrect               | Generates useful short code segments                                   | Use Precise Mode for better results.                                                      |
+| 8        | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf      | 2.55                        | Poor, too slow to be practical in day-to-day workflows                  | Occasionally helps generate ideas                                      | Speed is a significant limitation.                                                        |
+| 9        | Trinity-2-Codestral-22B-Q6_K_L               | N/A                         | Poor, similar issues to DeepSeekV2 in debugging                         | Decent, but often repeats code                                         | Similar problem to DeepSeekV2, not recommended for my complex tasks.                      |
+| 10       | DeepSeekV2 Coder Lite Instruct Q_8L          | N/A                         | Poor, repeats code similar to other models in its family                | Not as effective in my context                                         | Not recommended overall based on my criteria.                                             |
+Prompt format:
+```
+<code>
+<current output>
+<the problem description of the current output>
+<expected output (in English is fine)>
+<any hints>
+Think step by step. Solve this problem without removing any existing functionalities, logic, or checks, except any incorrect code that interferes with your edits.
+```
+<br>
+## Debugging with Reflection
+The following are personal opinions.
+In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
+Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.