jacklin64 commited on
Commit
85c4e6d
·
1 Parent(s): a927e3e
README.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Aggretriever is an encoder to aggregate both lexical and semantic text information into a single-vector dense vector for dense retrieval, which is finetued on MS MARCO corpus with BM25 negative sampling, following the approach described in [Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval](https://arxiv.org/abs/2208.00511).
2
+
3
+ <p align="center">
4
+ <img src="https://raw.githubusercontent.com/castorini/dhr/main/fig/aggretriever_teaser.png" width="600">
5
+ </p>
6
+
7
+ The associated GitHub repository for fine-tuning is available [here](https://github.com/castorini/dhr) and the reproduce from pyserini is [here]. The following variants are also available:
8
+
9
+ Model | Initialization | MARCO Dev | Encoder Path
10
+ |---|---|---|---
11
+ aggretriever-distilbert | distilbert-base-uncased | 34.1 | [castorini/aggretriever-distilbert](https://huggingface.co/castorini/aggretriever-distilbert)
12
+ aggretriever-cocondenser | Luyu/co-condenser-marco | 36.2 | [castorini/aggretriever-cocondenser](https://huggingface.co/castorini/aggretriever-cocondenser)
13
+
14
+ ## Usage (HuggingFace Transformers)
15
+ Using the model directly available in HuggingFace transformers. We use the implemented Aggretriever from pyserini [here](https://github.com/castorini/pyserini/blob/master/pyserini/encode/_aggretriever.py).
16
+
17
+ ```python
18
+ from pyserini.encode._aggretriever import AggretrieverQueryEncoder
19
+ from pyserini.encode._aggretriever import AggretrieverDocumentEncoder
20
+
21
+ model_name = '/store/scratch/s269lin/experiments/aggretriever/hf_model/aggretriever-distilbert'
22
+ query_encoder = AggretrieverQueryEncoder(model_name, device='cpu')
23
+ context_encoder = AggretrieverDocumentEncoder(model_name, device='cpu')
24
+
25
+ query = "Where was Marie Curie born?"
26
+ contexts = [
27
+ "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
28
+ "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
29
+ ]
30
+ # Compute embeddings
31
+ query_emb = query_encoder.encode(query)
32
+ ctx_emb = context_encoder.encode(contexts)
33
+ # Compute similarity scores using dot product
34
+ score1 = query_emb @ ctx_emb[0] # 47.667152
35
+ score2 = query_emb @ ctx_emb[1] # 39.054127
36
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/store2/scratch/s269lin/Aggretriever/results/experiments/msmarco/DistilBERT-Aggretriever",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "AggretrieverEncoder"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "label2id": {
16
+ "LABEL_0": 0
17
+ },
18
+ "max_position_embeddings": 512,
19
+ "model_type": "distilbert",
20
+ "n_heads": 12,
21
+ "n_layers": 6,
22
+ "output_hidden_states": true,
23
+ "pad_token_id": 0,
24
+ "qa_dropout": 0.1,
25
+ "seq_classif_dropout": 0.2,
26
+ "sinusoidal_pos_embds": false,
27
+ "tie_weights_": true,
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.15.0",
30
+ "vocab_size": 30522
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:038ff235cdeb2a975b0b351ce0dfe9e435ff870c511ad31a925db888dbb71199
3
+ size 268380775
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "Luyu/co-condenser-marco", "special_tokens_map_file": "/bos/tmp0/luyug/outputs/condenser/models/l2-s6-km-L128-e8-lr1e-4-b256/special_tokens_map.json", "tokenizer_file": null, "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff