|
--- |
|
base_model: |
|
- Snowflake/snowflake-arctic-embed-m-long |
|
--- |
|
|
|
|
|
# CodeRankEmbed |
|
|
|
`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks. |
|
|
|
|
|
# Performance Benchmarks |
|
|
|
| Name | Parameters | CSN | CoIR | |
|
| :-------------------------------:| :----- | :-------- | :------: | |
|
| **CodeRankEmbed** | 137M | **77.9** |**60.1** | |
|
| CodeSage-Large | 1.3B | 71.2 | 59.4 | |
|
| Jina-Code-v2 | 161M | 67.2 | 58.4 | |
|
| CodeT5+ | 110M | 74.2 | 45.9 | |
|
| Voyage-Code-002 | Unknown | 68.5 | 56.3 | |
|
|
|
|
|
# Usage |
|
|
|
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code" |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True) |
|
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number'] |
|
codes = ["""def func(n): |
|
if n <= 0: |
|
return "Input should be a positive integer" |
|
elif n == 1: |
|
return 0 |
|
elif n == 2: |
|
return 1 |
|
else: |
|
a, b = 0, 1 |
|
for _ in range(2, n): |
|
a, b = b, a + b |
|
return b |
|
"""] |
|
query_embeddings = model.encode(queries) |
|
print(query_embeddings) |
|
code_embeddings = model.encode(codes) |
|
print(code_embeddings) |
|
``` |
|
|