antoinelouis commited on
Commit
b38a926
·
verified ·
1 Parent(s): 871f69a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -35
README.md CHANGED
@@ -9,7 +9,28 @@ metrics:
9
  tags:
10
  - passage-reranking
11
  library_name: sentence-transformers
12
- base_model: camembert-base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
  # crossencoder-camembert-base-mmarcoFR
@@ -72,27 +93,12 @@ print(scores)
72
  ```
73
 
74
  ***
75
-
76
  ## Evaluation
77
 
78
- We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
79
- subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
80
- cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
81
-
82
- | | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
83
- |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
84
- | 1 | **crossencoder-camembert-base-mmarcoFR** | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 |
85
- | 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 |
86
- | 3 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 |
87
- | 4 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR) | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
88
- | 5 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 |
89
- <!--
90
- | x | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | en | 109M | 438MB | 29.68 | 46.13 | 80.45 | 87.90 | 93.15 | 96.60 |
91
- | x | [crossencoder-MiniLM-L12-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L12-msmarco-mmarcoFR) | en | 33M | 134MB | 29.07 | 44.41 | 77.83 | 88.10 | 95.55 | 99.00 |
92
- | x | [crossencoder-MiniLM-L6-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-msmarco-mmarcoFR) | en | 23M | 91MB | 32.92 | 47.56 | 77.27 | 88.15 | 94.85 | 98.15 |
93
- | x | [crossencoder-MiniLM-L4-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L4-msmarco-mmarcoFR) | en | 19M | 77MB | 30.98 | 46.22 | 76.35 | 85.80 | 94.35 | 97.55 |
94
- | x | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR) | en | 15M | 62MB | 30.82 | 44.30 | 72.03 | 82.65 | 93.35 | 98.10 |
95
- -->
96
 
97
  ***
98
 
@@ -101,28 +107,29 @@ cross-encoder models fine-tuned on the same dataset. We report the R-precision (
101
  #### Data
102
 
103
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
104
- that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
105
- [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
106
- relevant and 75% are irrelevant).
 
107
 
108
  #### Implementation
109
 
110
- The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via the binary cross-entropy loss
111
- (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
112
- with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
113
- concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
114
 
115
  ***
116
 
117
  ## Citation
118
 
119
  ```bibtex
120
- @online{louis2023,
121
- author = 'Antoine Louis',
122
- title = 'crossencoder-camembert-base-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
123
- publisher = 'Hugging Face',
124
- month = 'september',
125
- year = '2023',
126
- url = 'https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR',
127
  }
128
- ```
 
9
  tags:
10
  - passage-reranking
11
  library_name: sentence-transformers
12
+ base_model: almanach/camembert-base
13
+ model-index:
14
+ - name: crossencoder-camembert-base-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Rerankingg
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_100
26
+ name: Recall@100
27
+ value: 85.34
28
+ - type: recall_at_10
29
+ name: Recall@10
30
+ value: 59.83
31
+ - type: mrr_at_10
32
+ name: MRR@10
33
+ value: 33.40
34
  ---
35
 
36
  # crossencoder-camembert-base-mmarcoFR
 
93
  ```
94
 
95
  ***
 
96
  ## Evaluation
97
 
98
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
99
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
100
+ to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
101
+ the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ***
104
 
 
107
  #### Data
108
 
109
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
110
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
111
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
112
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
113
+ relevant and 50% are irrelevant).
114
 
115
  #### Implementation
116
 
117
+ The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via the binary cross-entropy loss
118
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
119
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
120
+ We use the sigmoid function to get scores between 0 and 1.
121
 
122
  ***
123
 
124
  ## Citation
125
 
126
  ```bibtex
127
+ @online{louis2024decouvrir,
128
+ author = 'Antoine Louis',
129
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
130
+ publisher = 'Hugging Face',
131
+ month = 'mar',
132
+ year = '2024',
133
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
134
  }
135
+ ```