camembert-mwer / README.md

Update README.md

346b4fe over 1 year ago

4.8 kB

	---
	language: fr
	license: mit
	datasets:
	- Sequoia
	widget:
	- text: Aucun financement politique occulte n'a pu être mis en évidence.
	- text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue.
	pipeline_tag: token-classification
	tags:
	- mwe
	---

	# Multiword expressions recognition.

	A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.

	## Model description

	`camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert/camembert-large) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task.

	## How to use

	You can use this model directly with a pipeline for token classification:

	```python
	>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
	>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
	>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
	>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
	>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
	>>> mwes = mwe_classifier(sentence)

	[{'entity': 'B-MWE',
	'score': 0.99492574,
	'index': 4,
	'word': '▁rendez',
	'start': 15,
	'end': 22},
	{'entity': 'I-MWE',
	'score': 0.9344883,
	'index': 5,
	'word': '-',
	'start': 22,
	'end': 23},
	{'entity': 'I-MWE',
	'score': 0.99398583,
	'index': 6,
	'word': 'vous',
	'start': 23,
	'end': 27},
	{'entity': 'B-VID',
	'score': 0.9827843,
	'index': 22,
	'word': '▁mettre',
	'start': 106,
	'end': 113},
	{'entity': 'I-VID',
	'score': 0.9835186,
	'index': 23,
	'word': '▁en',
	'start': 113,
	'end': 116},
	{'entity': 'I-VID',
	'score': 0.98324823,
	'index': 24,
	'word': '▁bouche',
	'start': 116,
	'end': 123}]

	>>> mwe_classifier.group_entities(mwes)

	[{'entity_group': 'MWE',
	'score': 0.9744666,
	'word': 'rendez-vous',
	'start': 15,
	'end': 27},
	{'entity_group': 'VID',
	'score': 0.9831837,
	'word': 'mettre en bouche',
	'start': 106,
	'end': 123}]
	```

	## Training data

	The Sequoia dataset is divided into train/dev/test sets:

	\| \| Sequoia \| train \| dev \| test \|
	\| :----: \| :---: \| :----: \| :---: \| :----: \|
	\| #sentences \| 3099 \| 1955 \| 273 \| 871 \|
	\| #MWEs \| 3450 \| 2170 \| 306 \| 974 \|
	\| #Unseen MWEs \| _ \| _ \| 100 \| 300 \|

	This dataset has 6 distinct categories:
	* MWE: Non-verbal MWEs (e.g. à peu près)
	* IRV: Inherently reflexive verb (e.g. s'occuper)
	* LVC.cause: Causative light-verb construction (e.g. causer le bouleversement)
	* LVC.full: Light-verb construction (e.g. avoir pour but de )
	* MVC: Multi-verb construction (e.g. faire remarquer)
	* VID: Verbal idiom (e.g. voir le jour)

	## Training procedure

	### Preprocessing

	The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.

	### Pretraining

	The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.

	### Evaluation results

	On the test set, this model achieves the following results:

	<table>
	<tr>
	<td colspan="3">Global MWE-based</td>
	<td colspan="3">Unseen MWE-based</td>
	</tr>
	<tr>
	<td>Precision</td><td>Recall</td><td>F1</td>
	<td>Precision</td><td>Recall</td><td>F1</td>
	</tr>
	<tr>
	<td>83.78</td><td>83.78</td><td>83.78</td>
	<td>57.05</td><td>60.67</td><td>58.80</td>
	</tr>
	</table>

	### BibTeX entry and citation info

	```bibtex
	@article{martin2019camembert,
	title={CamemBERT: a tasty French language model},
	author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
	journal={arXiv preprint arXiv:1911.03894},
	year={2019}
	}

	@article{candito2020french,
	title={A French corpus annotated for multiword expressions and named entities},
	author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
	journal={Journal of Language Modelling},
	volume={8},
	number={2},
	year={2020},
	publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
	}

	```