antoinelouis commited on
Commit
c24d9d7
·
verified ·
1 Parent(s): dd25fb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -27
README.md CHANGED
@@ -1,33 +1,30 @@
1
  ---
2
- pipeline_tag: sentence-similarity
3
  language: fr
4
- license: apache-2.0
5
  datasets:
6
  - unicamp-dl/mmarco
7
  metrics:
8
  - recall
9
  tags:
10
- - sentence-similarity
11
  library_name: sentence-transformers
12
  ---
13
- # crossencoder-camembert-base-mmarcoFR
14
 
15
- This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
16
 
17
- It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
 
 
 
18
 
19
  ## Usage
20
- ***
21
 
22
- #### Sentence-Transformers
23
 
24
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
25
 
26
- ```bash
27
- pip install -U sentence-transformers
28
- ```
29
-
30
- Then you can use the model like this:
31
 
32
  ```python
33
  from sentence_transformers import CrossEncoder
@@ -38,9 +35,9 @@ scores = model.predict(pairs)
38
  print(scores)
39
  ```
40
 
41
- #### 🤗 Transformers
42
 
43
- Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
44
 
45
  ```python
46
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -58,12 +55,13 @@ with torch.no_grad():
58
  print(scores)
59
  ```
60
 
61
- ## Evaluation
62
  ***
63
 
64
- We evaluated the model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
65
 
66
- Below, we compare the model performance with other cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
 
 
67
 
68
  | | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
69
  |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
@@ -80,23 +78,27 @@ Below, we compare the model performance with other cross-encoder models fine-tun
80
  | 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR) | en | 15M | 62MB | 30.82 | 44.30 | 72.03 | 82.65 | 93.35 | 98.10 |
81
  -->
82
 
83
- ## Training
84
  ***
85
 
86
- #### Background
87
 
88
- We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
89
 
90
- #### Hyperparameters
 
 
 
91
 
92
- We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
93
 
94
- #### Data
 
 
 
95
 
96
- We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune the model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
97
 
98
  ## Citation
99
- ***
100
 
101
  ```bibtex
102
  @online{louis2023,
 
1
  ---
2
+ pipeline_tag: text-classification
3
  language: fr
4
+ license: mit
5
  datasets:
6
  - unicamp-dl/mmarco
7
  metrics:
8
  - recall
9
  tags:
10
+ - passage-reranking
11
  library_name: sentence-transformers
12
  ---
 
13
 
14
+ # crossencoder-camembert-base-mmarcoFR
15
 
16
+ This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1.
17
+ The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
18
+ retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
19
+ relevance according to the model's predicted scores.
20
 
21
  ## Usage
 
22
 
23
+ Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers) or [Huggingface Transformers](#using-huggingface-transformers).
24
 
25
+ #### Using Sentence-Transformers
26
 
27
+ Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
 
 
 
 
28
 
29
  ```python
30
  from sentence_transformers import CrossEncoder
 
35
  print(scores)
36
  ```
37
 
38
+ #### Using HuggingFace Transformers
39
 
40
+ Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
41
 
42
  ```python
43
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
55
  print(scores)
56
  ```
57
 
 
58
  ***
59
 
60
+ ## Evaluation
61
 
62
+ We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
63
+ subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
64
+ cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
65
 
66
  | | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
67
  |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
 
78
  | 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR) | en | 15M | 62MB | 30.82 | 44.30 | 72.03 | 82.65 | 93.35 | 98.10 |
79
  -->
80
 
 
81
  ***
82
 
83
+ ## Training
84
 
85
+ #### Data
86
 
87
+ We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
88
+ that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
89
+ [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
90
+ relevant and 75% are irrelevant).
91
 
92
+ #### Implementation
93
 
94
+ The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via the binary cross-entropy loss
95
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
96
+ with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
97
+ concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
98
 
99
+ ***
100
 
101
  ## Citation
 
102
 
103
  ```bibtex
104
  @online{louis2023,