Sheshera Mysore commited on
Commit
7e9ce39
·
1 Parent(s): 2ade824

Update usage instructions.

Browse files
Files changed (1) hide show
  1. README.md +24 -7
README.md CHANGED
@@ -18,11 +18,11 @@ Model included in a paper for modeling fine grained similarity between documents
18
 
19
  ## Model Card
20
 
21
- **Model description:** This model is a BERT bi-encoder model trained for similarity of title-abstract pairs in biomedical scientific papers. The model is **initialized with the SciBert model**. This model inputs the title and abstract of a paper and represents it with a single vector obtained by a scalar mix of the CLS token at every layer of the SciBert encoder. These scalar mix parameters can be important for performance in some datasets. Importantly, these scalar mix weights are not included as part of this HF model, if you wish to use these parameters please download the model at: [`aspire-biencoder-biomed-scib-full.zip`](https://drive.google.com/file/d/1X6S5qwaKUlI3N3RDQSG-tJCzMBWAnqxP/view?usp=sharing).
22
 
23
  **Training data:** The model is trained on pairs of co-cited papers in a contrastive learning setup. The model is trained on 1.2 million biomedical paper pairs. In training the model negative examples for the contrastive loss are obtained as random in-batch negatives. Co-citations are obtained from the full text of papers, for example - the papers in brackets below are all co-cited and each pairs title and abstracts would be used as a training pair:
24
 
25
- > "The idea of distant supervision has been proposed and used widely in Relation Extraction (Mintz et al., 2009; Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012) , where the source of labels is an external knowledge base."
26
 
27
 
28
  **Training procedure:** The model was trained with the Adam Optimizer and a learning rate of 2e-5 with 1000 warm-up steps followed by linear decay of the learning rate. The model training convergence is checked with the loss on a held out dev set consisting of co-cited paper pairs.
@@ -30,9 +30,21 @@ Model included in a paper for modeling fine grained similarity between documents
30
  **Intended uses & limitations:** This model is trained for document similarity tasks in biomedical scientific text using a single vector per document. Here, the documents are the title and abstract of a paper. With appropriate fine-tuning the model can also be used for other tasks such as classification. Since the training data comes primarily from biomedicine, performance on other domains may be poorer.
31
 
32
 
33
- **How to use:** This model can be used as a BERT model via the `transformers` library:
34
 
35
- TODO.
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  **Variable and metrics:**
38
  This model is evaluated on information retrieval datasets with document level queries. Here we report performance on RELISH, and TRECCOVID. These are detailed on [github](https://github.com/allenai/aspire) and in our [paper](https://arxiv.org/abs/2111.08366). These datasets represent a abstract level retrieval task, where given a query scientific abstract the task requires the retrieval of relevant candidate abstracts.
@@ -41,7 +53,7 @@ We rank documents by the L2 distance between the query and candidate documents.
41
 
42
  **Evaluation results:**
43
 
44
- The released model `aspire-sentence-embedder` is compared against 1) `all-mpnet-base-v2` a sentence-bert model trained on ~1 billion training examples, 2) `paraphrase-TinyBERT-L6-v2` a sentence-bert model trained on paraphrase pairs, and 3) the `cosentbert` models used in our paper.
45
 
46
  | | TRECCOVID | TRECCOVID | RELISH | RELISH |
47
  |-------------------------------------------:|:---------:|:-------:|:------:|:-------:|
@@ -51,6 +63,11 @@ The released model `aspire-sentence-embedder` is compared against 1) `all-mpnet-
51
  | `aspire-biencoder-biomed-scib` | 30.74 | 60.16 | 61.52| 78.07 |
52
  | `aspire-biencoder-biomed-scib-full` | 31.45 | 63.15 | 61.34| 77.89 |
53
 
54
- <sup>*</sup> Refers to the performance reported in our paper by averaging over 3 re-runs of the model.
55
 
56
- The released models `aspire-biencoder-biomed-scib` and `aspire-biencoder-biomed-scib-full` are the single best run among the 3 re-runs.
 
 
 
 
 
 
 
18
 
19
  ## Model Card
20
 
21
+ **Model description:** This model is a BERT bi-encoder model trained for similarity of title-abstract pairs in biomedical scientific papers. The model is **initialized with the SciBert model**. This model inputs the title and abstract of a paper and represents it with a single vector obtained by a scalar mix of the CLS token at every layer of the SciBert encoder. These scalar mix parameters can be important for performance in some datasets. Importantly, these scalar mix weights are not included as part of this HF model, if you wish to use these parameters please download the full model at: [`aspire-biencoder-biomed-scib-full.zip`](https://drive.google.com/file/d/1X6S5qwaKUlI3N3RDQSG-tJCzMBWAnqxP/view?usp=sharing).
22
 
23
  **Training data:** The model is trained on pairs of co-cited papers in a contrastive learning setup. The model is trained on 1.2 million biomedical paper pairs. In training the model negative examples for the contrastive loss are obtained as random in-batch negatives. Co-citations are obtained from the full text of papers, for example - the papers in brackets below are all co-cited and each pairs title and abstracts would be used as a training pair:
24
 
25
+ > The idea of distant supervision has been proposed and used widely in Relation Extraction (Mintz et al., 2009; Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012) , where the source of labels is an external knowledge base.
26
 
27
 
28
  **Training procedure:** The model was trained with the Adam Optimizer and a learning rate of 2e-5 with 1000 warm-up steps followed by linear decay of the learning rate. The model training convergence is checked with the loss on a held out dev set consisting of co-cited paper pairs.
 
30
  **Intended uses & limitations:** This model is trained for document similarity tasks in biomedical scientific text using a single vector per document. Here, the documents are the title and abstract of a paper. With appropriate fine-tuning the model can also be used for other tasks such as classification. Since the training data comes primarily from biomedicine, performance on other domains may be poorer.
31
 
32
 
33
+ **How to use:** This model can be used via the `transformers` library:
34
 
35
+ ```
36
+ from transformers import AutoModel, AutoTokenizer
37
+ aspire_bienc = AutoModel.from_pretrained('allenai/aspire-biencoder-biomed-scib')
38
+ aspire_tok = AutoTokenizer.from_pretrained('allenai/aspire-biencoder-biomed-scib')
39
+ title = "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity"
40
+ abstract = "We present a new scientific document similarity model based on matching fine-grained aspects of texts."
41
+ d=[title+aspire_tok.sep_token+abstract]
42
+ inputs = aspire_tok(d, padding=True, truncation=True, return_tensors="pt", max_length=512)
43
+ result = aspire_bienc(**inputs)
44
+ clsrep = result.last_hidden_state[:,0,:]
45
+ ```
46
+
47
+ If you choose to use `aspire-biencoder-biomed-scib-full`, download the [`aspire-biencoder-biomed-scib-full.zip`](https://drive.google.com/file/d/1X6S5qwaKUlI3N3RDQSG-tJCzMBWAnqxP/view?usp=sharing), and use it per this example usage script: [`aspire/examples/ex_aspire_bienc.py`](https://github.com/allenai/aspire/blob/main/examples/ex_aspire_bienc.py)
48
 
49
  **Variable and metrics:**
50
  This model is evaluated on information retrieval datasets with document level queries. Here we report performance on RELISH, and TRECCOVID. These are detailed on [github](https://github.com/allenai/aspire) and in our [paper](https://arxiv.org/abs/2111.08366). These datasets represent a abstract level retrieval task, where given a query scientific abstract the task requires the retrieval of relevant candidate abstracts.
 
53
 
54
  **Evaluation results:**
55
 
56
+ The released model `aspire-biencoder-biomed-scib` (and `aspire-biencoder-biomed-scib-full`) is compared against 1) `allenai/specter`. The released models `aspire-biencoder-biomed-scib` and `aspire-biencoder-biomed-scib-full` are the single best run among the 3 re-runs. <sup>*</sup> Refers to the performance reported in our paper by averaging over 3 re-runs of the model.
57
 
58
  | | TRECCOVID | TRECCOVID | RELISH | RELISH |
59
  |-------------------------------------------:|:---------:|:-------:|:------:|:-------:|
 
63
  | `aspire-biencoder-biomed-scib` | 30.74 | 60.16 | 61.52| 78.07 |
64
  | `aspire-biencoder-biomed-scib-full` | 31.45 | 63.15 | 61.34| 77.89 |
65
 
 
66
 
67
+ **Alternative models:**
68
+
69
+ Besides the above models consider these alternative models also released in the Aspire paper:
70
+
71
+ [`aspire-biencoder-compsci-spec`](https://huggingface.co/allenai/aspire-biencoder-compsci-spec): If you wanted to run on computer science papers.
72
+
73
+ [`aspire-biencoder-biomed-spec`](https://huggingface.co/allenai/aspire-biencoder-biomed-spec): This is an alternative bi-encoder model identical to the above model, except that it is initialized with `allenai/specter` instead of SciBert. This usually under-performs the model released here.