Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,28 @@ metrics:
|
|
9 |
tags:
|
10 |
- passage-reranking
|
11 |
library_name: sentence-transformers
|
12 |
-
base_model: camembert-base
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
---
|
14 |
|
15 |
# crossencoder-camembert-base-mmarcoFR
|
@@ -72,27 +93,12 @@ print(scores)
|
|
72 |
```
|
73 |
|
74 |
***
|
75 |
-
|
76 |
## Evaluation
|
77 |
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
| | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
|
83 |
-
|---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
|
84 |
-
| 1 | **crossencoder-camembert-base-mmarcoFR** | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 |
|
85 |
-
| 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 |
|
86 |
-
| 3 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 |
|
87 |
-
| 4 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR) | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
|
88 |
-
| 5 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 |
|
89 |
-
<!--
|
90 |
-
| x | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | en | 109M | 438MB | 29.68 | 46.13 | 80.45 | 87.90 | 93.15 | 96.60 |
|
91 |
-
| x | [crossencoder-MiniLM-L12-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L12-msmarco-mmarcoFR) | en | 33M | 134MB | 29.07 | 44.41 | 77.83 | 88.10 | 95.55 | 99.00 |
|
92 |
-
| x | [crossencoder-MiniLM-L6-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-msmarco-mmarcoFR) | en | 23M | 91MB | 32.92 | 47.56 | 77.27 | 88.15 | 94.85 | 98.15 |
|
93 |
-
| x | [crossencoder-MiniLM-L4-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L4-msmarco-mmarcoFR) | en | 19M | 77MB | 30.98 | 46.22 | 76.35 | 85.80 | 94.35 | 97.55 |
|
94 |
-
| x | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR) | en | 15M | 62MB | 30.82 | 44.30 | 72.03 | 82.65 | 93.35 | 98.10 |
|
95 |
-
-->
|
96 |
|
97 |
***
|
98 |
|
@@ -101,28 +107,29 @@ cross-encoder models fine-tuned on the same dataset. We report the R-precision (
|
|
101 |
#### Data
|
102 |
|
103 |
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
|
104 |
-
that contains 8.8M passages and 539K training queries. We
|
105 |
-
[
|
106 |
-
|
|
|
107 |
|
108 |
#### Implementation
|
109 |
|
110 |
-
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via the binary cross-entropy loss
|
111 |
-
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one
|
112 |
-
with a batch size of
|
113 |
-
|
114 |
|
115 |
***
|
116 |
|
117 |
## Citation
|
118 |
|
119 |
```bibtex
|
120 |
-
@online{
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
}
|
128 |
-
```
|
|
|
9 |
tags:
|
10 |
- passage-reranking
|
11 |
library_name: sentence-transformers
|
12 |
+
base_model: almanach/camembert-base
|
13 |
+
model-index:
|
14 |
+
- name: crossencoder-camembert-base-mmarcoFR
|
15 |
+
results:
|
16 |
+
- task:
|
17 |
+
type: text-classification
|
18 |
+
name: Passage Rerankingg
|
19 |
+
dataset:
|
20 |
+
type: unicamp-dl/mmarco
|
21 |
+
name: mMARCO-fr
|
22 |
+
config: french
|
23 |
+
split: validation
|
24 |
+
metrics:
|
25 |
+
- type: recall_at_100
|
26 |
+
name: Recall@100
|
27 |
+
value: 85.34
|
28 |
+
- type: recall_at_10
|
29 |
+
name: Recall@10
|
30 |
+
value: 59.83
|
31 |
+
- type: mrr_at_10
|
32 |
+
name: MRR@10
|
33 |
+
value: 33.40
|
34 |
---
|
35 |
|
36 |
# crossencoder-camembert-base-mmarcoFR
|
|
|
93 |
```
|
94 |
|
95 |
***
|
|
|
96 |
## Evaluation
|
97 |
|
98 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
|
99 |
+
an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
|
100 |
+
to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
|
101 |
+
the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
***
|
104 |
|
|
|
107 |
#### Data
|
108 |
|
109 |
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
|
110 |
+
that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
|
111 |
+
12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
|
112 |
+
distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
|
113 |
+
relevant and 50% are irrelevant).
|
114 |
|
115 |
#### Implementation
|
116 |
|
117 |
+
The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via the binary cross-entropy loss
|
118 |
+
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
|
119 |
+
with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
|
120 |
+
We use the sigmoid function to get scores between 0 and 1.
|
121 |
|
122 |
***
|
123 |
|
124 |
## Citation
|
125 |
|
126 |
```bibtex
|
127 |
+
@online{louis2024decouvrir,
|
128 |
+
author = 'Antoine Louis',
|
129 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
130 |
+
publisher = 'Hugging Face',
|
131 |
+
month = 'mar',
|
132 |
+
year = '2024',
|
133 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
134 |
}
|
135 |
+
```
|