|
--- |
|
language: fr |
|
license: mit |
|
datasets: |
|
- Sequoia |
|
widget: |
|
- text: Aucun financement politique occulte n'a pu être mis en évidence. |
|
- text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue. |
|
pipeline_tag: token-classification |
|
tags: |
|
- mwe |
|
--- |
|
|
|
# Multiword expressions recognition. |
|
|
|
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs. |
|
|
|
## Model description |
|
|
|
`camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert/camembert-large) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task. |
|
|
|
## How to use |
|
|
|
You can use this model directly with a pipeline for token classification: |
|
|
|
```python |
|
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer") |
|
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer") |
|
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer) |
|
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants." |
|
>>> mwes = mwe_classifier(sentence) |
|
|
|
[{'entity': 'B-MWE', |
|
'score': 0.99492574, |
|
'index': 4, |
|
'word': '▁rendez', |
|
'start': 15, |
|
'end': 22}, |
|
{'entity': 'I-MWE', |
|
'score': 0.9344883, |
|
'index': 5, |
|
'word': '-', |
|
'start': 22, |
|
'end': 23}, |
|
{'entity': 'I-MWE', |
|
'score': 0.99398583, |
|
'index': 6, |
|
'word': 'vous', |
|
'start': 23, |
|
'end': 27}, |
|
{'entity': 'B-VID', |
|
'score': 0.9827843, |
|
'index': 22, |
|
'word': '▁mettre', |
|
'start': 106, |
|
'end': 113}, |
|
{'entity': 'I-VID', |
|
'score': 0.9835186, |
|
'index': 23, |
|
'word': '▁en', |
|
'start': 113, |
|
'end': 116}, |
|
{'entity': 'I-VID', |
|
'score': 0.98324823, |
|
'index': 24, |
|
'word': '▁bouche', |
|
'start': 116, |
|
'end': 123}] |
|
|
|
>>> mwe_classifier.group_entities(mwes) |
|
|
|
[{'entity_group': 'MWE', |
|
'score': 0.9744666, |
|
'word': 'rendez-vous', |
|
'start': 15, |
|
'end': 27}, |
|
{'entity_group': 'VID', |
|
'score': 0.9831837, |
|
'word': 'mettre en bouche', |
|
'start': 106, |
|
'end': 123}] |
|
``` |
|
|
|
## Training data |
|
|
|
The Sequoia dataset is divided into train/dev/test sets: |
|
|
|
| | Sequoia | train | dev | test | |
|
| :----: | :---: | :----: | :---: | :----: | |
|
| #sentences | 3099 | 1955 | 273 | 871 | |
|
| #MWEs | 3450 | 2170 | 306 | 974 | |
|
| #Unseen MWEs | _ | _ | 100 | 300 | |
|
|
|
This dataset has 6 distinct categories: |
|
* MWE: Non-verbal MWEs (e.g. **à peu près**) |
|
* IRV: Inherently reflexive verb (e.g. **s'occuper**) |
|
* LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**) |
|
* LVC.full: Light-verb construction (e.g. **avoir pour but** de ) |
|
* MVC: Multi-verb construction (e.g. **faire remarquer**) |
|
* VID: Verbal idiom (e.g. **voir le jour**) |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology. |
|
|
|
### Pretraining |
|
|
|
The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs. |
|
|
|
### Evaluation results |
|
|
|
On the test set, this model achieves the following results: |
|
|
|
<table> |
|
<tr> |
|
<td colspan="3">Global MWE-based</td> |
|
<td colspan="3">Unseen MWE-based</td> |
|
</tr> |
|
<tr> |
|
<td>Precision</td><td>Recall</td><td>F1</td> |
|
<td>Precision</td><td>Recall</td><td>F1</td> |
|
</tr> |
|
<tr> |
|
<td>83.78</td><td>83.78</td><td>83.78</td> |
|
<td>57.05</td><td>60.67</td><td>58.80</td> |
|
</tr> |
|
</table> |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{martin2019camembert, |
|
title={CamemBERT: a tasty French language model}, |
|
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, |
|
journal={arXiv preprint arXiv:1911.03894}, |
|
year={2019} |
|
} |
|
|
|
@article{candito2020french, |
|
title={A French corpus annotated for multiword expressions and named entities}, |
|
author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo}, |
|
journal={Journal of Language Modelling}, |
|
volume={8}, |
|
number={2}, |
|
year={2020}, |
|
publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN} |
|
} |
|
|
|
``` |