manveertamber commited on
Commit
82c932f
·
verified ·
1 Parent(s): 8056d11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -3
README.md CHANGED
@@ -1,3 +1,62 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - intfloat/e5-base-unsupervised
7
+ pipeline_tag: sentence-similarity
8
+ ---
9
+
10
+
11
+ # cadet-embed-base-v1
12
+
13
+ **cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with
14
+
15
+ * **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`)
16
+ * **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.
17
+
18
+ The result: highly effective BERT-base retrieval.
19
+
20
+ ---
21
+
22
+ ## Quick start
23
+ ```python
24
+ from sentence_transformers import SentenceTransformer
25
+ import numpy as np
26
+
27
+ model = SentenceTransformer("manveertamber/cadet-embed-base-v1")
28
+
29
+ query = "query: capital of France"
30
+
31
+ passages = [
32
+ "passage: Paris is the capital and largest city of France.",
33
+ "passage: Berlin is known for its vibrant art scene.",
34
+ "passage: The Eiffel Tower is located in Paris, France."
35
+ ]
36
+
37
+ # Encode (embeddings are already L2-normalised by default)
38
+ q_emb = model.encode(query, normalize_embeddings=True)
39
+ p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim)
40
+
41
+ # Cosine similarity = dot product of normalised vectors
42
+ scores = np.dot(p_embs, q_emb) # shape (n_passages,)
43
+
44
+ # Rank passages by score
45
+ for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
46
+ print(f"{score:.3f}\t{passage}")
47
+
48
+
49
+ ```
50
+
51
+
52
+
53
+ If you use this model, please cite:
54
+
55
+ ```
56
+ @article{tamber2025teaching,
57
+ title={Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation},
58
+ author={Tamber, Manveer Singh and Kazi, Suleman and Sourabh, Vivek and Lin, Jimmy},
59
+ journal={arXiv preprint arXiv:2502.19712},
60
+ year={2025}
61
+ }
62
+ ```