File size: 1,548 Bytes
1b6c197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
base_model:
- Snowflake/snowflake-arctic-embed-m-long
---


# CodeRankEmbed

`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.  


# Performance Benchmarks

| Name                             | Parameters | CSN      | CoIR     |
| :-------------------------------:| :----- | :-------- | :------: | 
| **CodeRankEmbed**              | 137M   | **77.9** |**60.1** | 
| CodeSage-Large       | 1.3B   | 71.2    | 59.4    | 
| Jina-Code-v2           | 161M   | 67.2     | 58.4  |
| CodeT5+          | 110M   | 74.2     | 45.9     | 
| Voyage-Code-002        | Unknown   | 68.5     | 56.3     | 


# Usage

**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code" 

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
codes = ["""def func(n):
    if n <= 0:
        return "Input should be a positive integer"
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n):
            a, b = b, a + b
        return b
"""]
query_embeddings = model.encode(queries)
print(query_embeddings)
code_embeddings = model.encode(codes)
print(code_embeddings)
```