antoinelouis
/

crossencoder-camembert-base-mmarcoFR

@@ -1,33 +1,30 @@
 ---
-pipeline_tag: sentence-similarity
 language: fr
-license: apache-2.0
 datasets:
 - unicamp-dl/mmarco
 metrics:
 - recall
 tags:
-- sentence-similarity
 library_name: sentence-transformers
 ---
-# crossencoder-camembert-base-mmarcoFR
-This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
-It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
 ## Usage
-***
-#### Sentence-Transformers
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```bash
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
 ```python
 from sentence_transformers import CrossEncoder
@@ -38,9 +35,9 @@ scores = model.predict(pairs)
 print(scores)
 ```
-#### 🤗 Transformers
-Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -58,12 +55,13 @@ with torch.no_grad():
 print(scores)
 ```
-## Evaluation
 ***
-We evaluated the model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
-Below, we compare the model performance with other cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
 |    | model                                                                                                                        | Vocab. | #Param. |  Size |     RP |   MRR@10 |  R@10(↑) |   R@20 |   R@50 |   R@100 |
 |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
@@ -80,23 +78,27 @@ Below, we compare the model performance with other cross-encoder models fine-tun
 | 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR)       |     en |     15M |  62MB |  30.82 |    44.30 |    72.03 |  82.65 |  93.35 |   98.10 |
 -->
-## Training
 ***
-#### Background
-We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
-#### Hyperparameters
-We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
-#### Data
-We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune the model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
 ## Citation
-***
 ```bibtex
 @online{louis2023,

 ---
+pipeline_tag: text-classification
 language: fr
+license: mit
 datasets:
 - unicamp-dl/mmarco
 metrics:
 - recall
 tags:
+- passage-reranking
 library_name: sentence-transformers
 ---
+# crossencoder-camembert-base-mmarcoFR
+This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1.
+The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
+retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
+relevance according to the model's predicted scores.
 ## Usage
+Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers) or [Huggingface Transformers](#using-huggingface-transformers).
+#### Using Sentence-Transformers
+Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
 ```python
 from sentence_transformers import CrossEncoder
 print(scores)
 ```
+#### Using HuggingFace Transformers
+Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 print(scores)
 ```
 ***
+## Evaluation
+We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
+subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
+cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
 |    | model                                                                                                                        | Vocab. | #Param. |  Size |     RP |   MRR@10 |  R@10(↑) |   R@20 |   R@50 |   R@100 |
 |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
 | 10 | [crossencoder-MiniLM-L2-msmarco-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L2-msmarco-mmarcoFR)       |     en |     15M |  62MB |  30.82 |    44.30 |    72.03 |  82.65 |  93.35 |   98.10 |
 -->
 ***
+## Training
+#### Data
+We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
+that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
+[training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
+relevant and 75% are irrelevant).
+#### Implementation
+The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via the binary cross-entropy loss
+(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
+with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
+concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
+***
 ## Citation
 ```bibtex
 @online{louis2023,