OpenCodeEdit Series Models Quick Start Guide (OpenCodeEdit-DSC-6.7B)

For more details, please refer to our Arxiv

We advise you to use the latest version of transformers.

Requirements:

transformers
torchvision
torchaudio
tensorboard

Model Overview

DeepSeek-Coder-6.7B-Base has the following features:

Type: Causal Language Models

Number of Parameters: 6.7B
Number of Parameters (Non-Embedding): 5.9B
Number of Layers: 32
Number of Attention Heads (GQA): 32 for Q and 4 for KV
Context Length: 16,384 tokens

The following contains a prompt template. Please construct the prompt according to the template.

Prompt Template：

System Prompt:
You are a code editor. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose.

User Prompt:
## Code Before:
{pre_edit_code}

## Instruction:
{instruction}

## Code After:

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

import re
from transformers import AutoModelForCausalLM, AutoTokenizer

def extract_first_python_block(text: str) -> str:
    pattern = r"```python\s*(.*?)```"
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    return ""

model_name ="zkzhang88/OpenCodeEdit-DSC-6.7B"  #"zkzhang88/OpenCodeEdit-Qwen3-8B" "zkzhang88/OpenCodeEdit-Qwen2.5-7B" "zkzhang88/OpenCodeEdit-DSC-6.7B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

pre_edit_code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""
SYSTEM_PROMPT = "You are a code editor. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose."
instruction = "Optimize the calculation method for the Fibonacci sequence by reducing recursive calls and employing dynamic programming to enhance efficiency."

formatted_input = f"""
## Code Before:
{pre_edit_code}
## Instruction:
{instruction}
## Code After:
"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": formatted_input}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)


output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print(extract_first_python_block(content))

Citation

If you find our work helpful, feel free to give us a cite.

@misc{zhang2025generatinghighqualitydatasetscode,
      title={Generating High-Quality Datasets for Code Editing via Open-Source Language Models}, 
      author={Zekai Zhang and Mingwei Liu and Zhenxi Chen and Linxi Liang and Yuxuan Chen and Guangsheng Ou and Yanlin Wang and Dan Li and Xin Peng and Zibin Zheng},
      year={2025},
      eprint={2509.25203},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2509.25203}, 
}