File size: 14,214 Bytes

---
license: mit
base_model: camembert-base
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: Camembert-base-frenchNER_4entities
  results: []
datasets:
- CATIE-AQ/frenchNER_4entities
language:
- fr
widget:
- text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan."
library_name: transformers
pipeline_tag: token-classification
co2_eq_emissions: 20
---


# Camembert-base-frenchNER_4entities

## Model Description

We present **Camembert-base-frenchNER_4entities**, which is a [CamemBERT base](https://huggingface.co/camembert-base) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC).  
All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities).  
There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing.  
Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/).



## Dataset

The dataset used is [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories :
* PER: personality ;
* LOC: location ;
* ORG: organization ;
* MISC: miscellaneous  ;
* O: background (Outside entity).

The distribution of the entities is as follows:

<table>
<thead>
  <tr>
    <th><br>Splits</th>
    <th><br>O</th>
    <th><br>PER</th>
    <th><br>LOC</th>  
    <th><br>ORG</th>
    <th><br>MISC</th>
  </tr>
</thead>
<tbody>
    <td><br>train</td>
    <td><br><b>7,539,692</b></td>
    <td><br><b>307,144</b></td>
    <td><br><b>286,746</b></td>
    <td><br><b>127,089</b></td>
    <td><br><b>799,494</b></td>  
  </tr>
  <tr>
    <td><br>validation</td>
    <td><br><b>544,580</b></td>
    <td><br><b>24,034</b></td>
    <td><br><b>21,585</b></td>
    <td><br><b>5,927</b></td>
    <td><br><b>18,221</b></td>  
  </tr>
  <tr>
    <td><br>test</td>
    <td><br><b>720,623</b></td>
    <td><br><b>32,870</b></td>
    <td><br><b>29,683</b></td>
    <td><br><b>7,911</b></td>
    <td><br><b>21,760</b></td>  
  </tr>
</tbody>
</table>


## Evaluation results

The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package.

### frenchNER_4entities

<table>
<thead>
    <tr>
      <th><br>Model</th>
      <th><br>Metrics</th>
      <th><br>PER</th>
      <th><br>LOC</th>
      <th><br>ORG</th>
      <th><br>MISC</th>
      <th><br>O</th>
      <th><br>Overall</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
        <td><br>Precision</td>
        <td><br>0.973</td>
        <td><br>0.951</td>
        <td><br>0.8877</td>
        <td><br>0.850</td>
        <td><br>0.993</td>
        <td><br>0.984</td>
    </tr>
    <tr>
        <td><br>Recall</td>
        <td><br>0.983</td>
        <td><br>0.964</td>
        <td><br>0.918</td>
        <td><br>0.781</td>
        <td><br>0.993</td>
        <td><br>0.984</td>
    </tr>
    <tr>
        <td>F1</td>
        <td><br>0.978</td>
        <td><br>0.958</td>
        <td><br>0.903</td>
        <td><br>0.814</td>
        <td><br>0.993</td>
        <td><br>0.984</td>
    </tr>
</tbody>
</table>


In detail:

### multiconer

<table>
<thead>
    <tr>
      <th><br>Model</th>
      <th><br>Metrics</th>
      <th><br>PER</th>
      <th><br>LOC</th>
      <th><br>ORG</th>
      <th><br>MISC</th>
      <th><br>O</th>
      <th><br>Overall</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
        <td><br>Precision</td>
        <td><br>0.954</td>
        <td><br>0.893</td>
        <td><br>0.851/td>
        <td><br>0.849</td>
        <td><br>0.979</td>
        <td><br>0.954</td>
    </tr>
    <tr>
        <td><br>Recall</td>
        <td><br>0.967</td>
        <td><br>0.887/td>
        <td><br>0.883</td>
        <td><br>0.855</td>
        <td><br>0.974</td>
        <td><br>0.954</td>
    </tr>
    <tr>
        <td>F1</td>
        <td><br>0.960</td>
        <td><br>0.890</td>
        <td><br>0.867</td>
        <td><br>0.852</td>
        <td><br>0.977</td>
        <td><br>0.954</td>
    </tr>
</tbody>
</table>

### multinerd

<table>
<thead>
    <tr>
      <th><br>Model</th>
      <th><br>Metrics</th>
      <th><br>PER</th>
      <th><br>LOC</th>
      <th><br>ORG</th>
      <th><br>MISC</th>
      <th><br>O</th>
      <th><br>Overall</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
        <td><br>Precision</td>
        <td><br>0.976</td>
        <td><br>0.961</td>
        <td><br>0.91</td>
        <td><br>0.829</td>
        <td><br>0.991</td>
        <td><br>0.983</td>
    </tr>
    <tr>
        <td><br>Recall</td>
        <td><br>0.993</td>
        <td><br>0.985</td>
        <td><br>0.967</td>
        <td><br>0.993</td>
        <td><br>0.719</td>
        <td><br>0.983</td>
    </tr>
    <tr>
        <td>F1</td>
        <td><br>0.985</td>
        <td><br>0.973</td>
        <td><br>0.938</td>
        <td><br>0.770</td>
        <td><br>0.992</td>
        <td><br>0.983</td>
    </tr>
</tbody>
</table>


### wikiner

<table>
<thead>
    <tr>
      <th><br>Model</th>
      <th><br>Metrics</th>
      <th><br>PER</th>
      <th><br>LOC</th>
      <th><br>ORG</th>
      <th><br>MISC</th>
      <th><br>O</th>
      <th><br>Overall</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
        <td><br>Precision</td>
        <td><br>0.970</td>
        <td><br>0.944</td>
        <td><br>0.872</td>
        <td><br>0.878</td>
        <td><br>0.996</td>
        <td><br>0.986</td>
    </tr>
    <tr>
        <td><br>Recall</td>
        <td><br>0.969</td>
        <td><br>0.947</td>
        <td><br>0.880</td>
        <td><br>0.866</td>
        <td><br>0.996</td>
        <td><br>0.986</td>
    </tr>
    <tr>
        <td>F1</td>
        <td><br>0.970</td>
        <td><br>0.945</td>
        <td><br>0.876</td>
        <td><br>0.872</td>
        <td><br>0.996</td>
        <td><br>0.986</td>
    </tr>
</tbody>
</table>


## Usage
### Code

```python
from transformers import pipeline

ner = pipeline('question-answering', model='CATIE-AQ/Camembert-base-frenchNER_4entities', tokenizer='CATIE-AQ/Camembert-base-frenchNER_4entities', aggregation_strategy="simple")

results = ner(
"Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan."
)


# Note : the aggregation_strategy parameter does not return the results as expected, so we need to do some post-processing
dict_to_del = []
for idx in range(len(results)-1):
    if results[idx]["end"] == results[idx+1]["start"]:
        results[idx+1]["word"] = results[idx]["word"]+results[idx+1]["word"]
        results[idx+1]["score"] = (results[idx]["score"]+results[idx+1]["score"])/2
        results[idx+1]["start"] = results[idx]["start"]
        dict_to_del.append(idx)
results = [j for i, j in enumerate(results) if i not in dict_to_del]

dict_to_del = []
for i in range(len(to_print)-1):
    if (to_print[i]["end"] == to_print[i+1]["start"]-1):
        to_print[i+1]["word"] = to_print[i]["word"]+" "+to_print[i+1]["word"]
        to_print[i+1]["score"] = (to_print[i]["score"]+to_print[i+1]["score"])/2
        to_print[i+1]["start"] = to_print[i]["start"]
        dict_to_del.append(i)
to_print = [j for i, j in enumerate(to_print) if i not in dict_to_del]

print(result)
```python
[{'entity_group': 'MISC',
  'score': 0.9404951632022858,
  'word': 'Euro 2024',
  'start': 22,
  'end': 31},
 {'entity_group': 'LOC',
  'score': 0.96980727,
  'word': 'Allemagne',
  'start': 35,
  'end': 44},
 {'entity_group': 'LOC',
  'score': 0.8612850904464722,
  'word': 'Pays-Bas',
  'start': 112,
  'end': 120},
 {'entity_group': 'ORG',
  'score': 0.8148028254508972,
  'word': 'les Bleus',
  'start': 122,
  'end': 131},
 {'entity_group': 'PER',
  'score': 0.9994482398033142,
  'word': 'Didier Deschamps',
  'start': 250,
  'end': 266},
 {'entity_group': 'MISC',
  'score': 0.84807388484478,
  'word': 'dernière Coupe du monde',
  'start': 296,
  'end': 319},
 {'entity_group': 'PER',
  'score': 0.9996860176324844,
  'word': 'Kylian Mbappé',
  'start': 352,
  'end': 365},
 {'entity_group': 'PER',
  'score': 0.9996881932020187,
  'word': 'Aurélien Tchouameni',
  'start': 367,
  'end': 386},
 {'entity_group': 'PER',
  'score': 0.9996924996376038,
  'word': 'Antoine Griezmann',
  'start': 388,
  'end': 405},
 {'entity_group': 'PER',
  'score': 0.9996860027313232,
  'word': 'Ibrahima Konaté',
  'start': 407,
  'end': 422},
 {'entity_group': 'PER',
  'score': 0.9996623992919922,
  'word': 'Mike Maignan',
  'start': 433,
  'end': 445}]
```

### Try it through Space
A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Camembert-NER).


## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step   | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:------:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.0407        | 1.0   | 41095  | 0.0547          | 0.9816    | 0.9816 | 0.9816 | 0.9816   |
| 0.0242        | 2.0   | 82190  | 0.0488          | 0.9843    | 0.9843 | 0.9843 | 0.9843   |
| 0.018         | 3.0   | 123285 | 0.0542          | 0.9844    | 0.9844 | 0.9844 | 0.9844   |


### Framework versions

- Transformers 4.36.2
- Pytorch 2.1.2
- Datasets 2.16.1
- Tokenizers 0.15.0


## Environmental Impact

*Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*

- **Hardware Type:** A100 PCIe 40/80GB
- **Hours used:** 1h45min
- **Cloud Provider:** Private Infrastructure
- **Carbon Efficiency (kg/kWh):** 0.046 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of January 4, 2024.)
- **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.02 kg eq. CO2



## Citations

### Camembert-frenchNER_4entities
```
TODO
```

### multiconer

> @inproceedings{multiconer2-report,  
    title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},  
    author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},  
    booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},  
    year={2023},  
    publisher={Association for Computational Linguistics}}

> @article{multiconer2-data,  
    title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},  
    author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},  
    year={2023}}


### multinerd

> @inproceedings{tedeschi-navigli-2022-multinerd,  
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",  
    author = "Tedeschi, Simone and  Navigli, Roberto",  
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",  
    month = jul,  
    year = "2022",  
    address = "Seattle, United States",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2022.findings-naacl.60",  
    doi = "10.18653/v1/2022.findings-naacl.60",  
    pages = "801--812"}

### pii-masking-200k

> @misc {ai4privacy_2023,  
author = { {ai4Privacy} },  
title = { pii-masking-200k (Revision 1d4c0a1) },  
year = 2023,  
url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k },  
doi = { 10.57967/hf/1532 },  
publisher = { Hugging Face }}

### wikiner

> @article{NOTHMAN2013151,  
title = {Learning multilingual named entity recognition from Wikipedia},  
journal = {Artificial Intelligence},  
volume = {194},  
pages = {151-175},  
year = {2013},  
note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},  
issn = {0004-3702},  
doi = {https://doi.org/10.1016/j.artint.2012.03.006},  
url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},  
author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}


### frenchNER_4entities
```
TODO
```

### CamemBERT
> @inproceedings{martin2020camembert,  
  title={CamemBERT: a Tasty French Language Model},  
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},  
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},  
  year={2020}}


## License
 [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)