metadata
base_model:
- Snowflake/snowflake-arctic-embed-m-long
CodeRankEmbed
CodeRankEmbed
is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.
Performance Benchmarks
Name | Parameters | CSN | CoIR |
---|---|---|---|
CodeRankEmbed | 137M | 77.9 | 60.1 |
CodeSage-Large | 1.3B | 71.2 | 59.4 |
Jina-Code-v2 | 161M | 67.2 | 58.4 |
CodeT5+ | 110M | 74.2 | 45.9 |
Voyage-Code-002 | Unknown | 68.5 | 56.3 |
Usage
Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code"
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
codes = ["""def func(n):
if n <= 0:
return "Input should be a positive integer"
elif n == 1:
return 0
elif n == 2:
return 1
else:
a, b = 0, 1
for _ in range(2, n):
a, b = b, a + b
return b
"""]
query_embeddings = model.encode(queries)
print(query_embeddings)
code_embeddings = model.encode(codes)
print(code_embeddings)