camembert-mwer / README.md
bvantuan's picture
Update README.md
346b4fe
|
raw
history blame
4.8 kB
---
language: fr
license: mit
datasets:
- Sequoia
widget:
- text: Aucun financement politique occulte n'a pu être mis en évidence.
- text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue.
pipeline_tag: token-classification
tags:
- mwe
---
# Multiword expressions recognition.
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.
## Model description
`camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert/camembert-large) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task.
## How to use
You can use this model directly with a pipeline for token classification:
```python
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
>>> mwes = mwe_classifier(sentence)
[{'entity': 'B-MWE',
'score': 0.99492574,
'index': 4,
'word': '▁rendez',
'start': 15,
'end': 22},
{'entity': 'I-MWE',
'score': 0.9344883,
'index': 5,
'word': '-',
'start': 22,
'end': 23},
{'entity': 'I-MWE',
'score': 0.99398583,
'index': 6,
'word': 'vous',
'start': 23,
'end': 27},
{'entity': 'B-VID',
'score': 0.9827843,
'index': 22,
'word': '▁mettre',
'start': 106,
'end': 113},
{'entity': 'I-VID',
'score': 0.9835186,
'index': 23,
'word': '▁en',
'start': 113,
'end': 116},
{'entity': 'I-VID',
'score': 0.98324823,
'index': 24,
'word': '▁bouche',
'start': 116,
'end': 123}]
>>> mwe_classifier.group_entities(mwes)
[{'entity_group': 'MWE',
'score': 0.9744666,
'word': 'rendez-vous',
'start': 15,
'end': 27},
{'entity_group': 'VID',
'score': 0.9831837,
'word': 'mettre en bouche',
'start': 106,
'end': 123}]
```
## Training data
The Sequoia dataset is divided into train/dev/test sets:
| | Sequoia | train | dev | test |
| :----: | :---: | :----: | :---: | :----: |
| #sentences | 3099 | 1955 | 273 | 871 |
| #MWEs | 3450 | 2170 | 306 | 974 |
| #Unseen MWEs | _ | _ | 100 | 300 |
This dataset has 6 distinct categories:
* MWE: Non-verbal MWEs (e.g. **à peu près**)
* IRV: Inherently reflexive verb (e.g. **s'occuper**)
* LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**)
* LVC.full: Light-verb construction (e.g. **avoir pour but** de )
* MVC: Multi-verb construction (e.g. **faire remarquer**)
* VID: Verbal idiom (e.g. **voir le jour**)
## Training procedure
### Preprocessing
The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.
### Pretraining
The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.
### Evaluation results
On the test set, this model achieves the following results:
<table>
<tr>
<td colspan="3">Global MWE-based</td>
<td colspan="3">Unseen MWE-based</td>
</tr>
<tr>
<td>Precision</td><td>Recall</td><td>F1</td>
<td>Precision</td><td>Recall</td><td>F1</td>
</tr>
<tr>
<td>83.78</td><td>83.78</td><td>83.78</td>
<td>57.05</td><td>60.67</td><td>58.80</td>
</tr>
</table>
### BibTeX entry and citation info
```bibtex
@article{martin2019camembert,
title={CamemBERT: a tasty French language model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
journal={arXiv preprint arXiv:1911.03894},
year={2019}
}
@article{candito2020french,
title={A French corpus annotated for multiword expressions and named entities},
author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
journal={Journal of Language Modelling},
volume={8},
number={2},
year={2020},
publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
}
```