Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,62 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- intfloat/e5-base-unsupervised
|
7 |
+
pipeline_tag: sentence-similarity
|
8 |
+
---
|
9 |
+
|
10 |
+
|
11 |
+
# cadet-embed-base-v1
|
12 |
+
|
13 |
+
**cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with
|
14 |
+
|
15 |
+
* **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`)
|
16 |
+
* **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.
|
17 |
+
|
18 |
+
The result: highly effective BERT-base retrieval.
|
19 |
+
|
20 |
+
---
|
21 |
+
|
22 |
+
## Quick start
|
23 |
+
```python
|
24 |
+
from sentence_transformers import SentenceTransformer
|
25 |
+
import numpy as np
|
26 |
+
|
27 |
+
model = SentenceTransformer("manveertamber/cadet-embed-base-v1")
|
28 |
+
|
29 |
+
query = "query: capital of France"
|
30 |
+
|
31 |
+
passages = [
|
32 |
+
"passage: Paris is the capital and largest city of France.",
|
33 |
+
"passage: Berlin is known for its vibrant art scene.",
|
34 |
+
"passage: The Eiffel Tower is located in Paris, France."
|
35 |
+
]
|
36 |
+
|
37 |
+
# Encode (embeddings are already L2-normalised by default)
|
38 |
+
q_emb = model.encode(query, normalize_embeddings=True)
|
39 |
+
p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim)
|
40 |
+
|
41 |
+
# Cosine similarity = dot product of normalised vectors
|
42 |
+
scores = np.dot(p_embs, q_emb) # shape (n_passages,)
|
43 |
+
|
44 |
+
# Rank passages by score
|
45 |
+
for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
|
46 |
+
print(f"{score:.3f}\t{passage}")
|
47 |
+
|
48 |
+
|
49 |
+
```
|
50 |
+
|
51 |
+
|
52 |
+
|
53 |
+
If you use this model, please cite:
|
54 |
+
|
55 |
+
```
|
56 |
+
@article{tamber2025teaching,
|
57 |
+
title={Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation},
|
58 |
+
author={Tamber, Manveer Singh and Kazi, Suleman and Sourabh, Vivek and Lin, Jimmy},
|
59 |
+
journal={arXiv preprint arXiv:2502.19712},
|
60 |
+
year={2025}
|
61 |
+
}
|
62 |
+
```
|