CodeRankEmbed / README.md
tarsur909's picture
Create README.md
1b6c197 verified
|
raw
history blame
1.55 kB
metadata
base_model:
  - Snowflake/snowflake-arctic-embed-m-long

CodeRankEmbed

CodeRankEmbed is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.

Performance Benchmarks

Name Parameters CSN CoIR
CodeRankEmbed 137M 77.9 60.1
CodeSage-Large 1.3B 71.2 59.4
Jina-Code-v2 161M 67.2 58.4
CodeT5+ 110M 74.2 45.9
Voyage-Code-002 Unknown 68.5 56.3

Usage

Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code"

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
codes = ["""def func(n):
    if n <= 0:
        return "Input should be a positive integer"
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n):
            a, b = b, a + b
        return b
"""]
query_embeddings = model.encode(queries)
print(query_embeddings)
code_embeddings = model.encode(codes)
print(code_embeddings)