File size: 3,189 Bytes
970528f 0f465c7 970528f 0f465c7 228eb07 0f465c7 228eb07 0f465c7 fbc429c 0f465c7 228eb07 6d7256e 228eb07 0f465c7 228eb07 0f465c7 7578b01 0f465c7 228eb07 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language:
- cy
tags:
- punctuation prediction
- punctuation
license: mit
widget:
- text: "A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn ne ddwyrain Cymru"
example_title: "Example 1"
- text: "Mae Pwllheli yn dref yng Ngwynedd Gogledd Cymru ac mae Llandrindod ym Mhowys"
example_title: "Example 2"
metrics:
- f1
---
This model predicts the punctuation of Welsh language texts. It has been created to restore punctuation of transcribed from speech recognition models such as https://huggingface.co/techiaith/wav2vec2-xlsr-ft-cy. The model restores the following punctuation markers: "." "," "?" "-" ":"
The model was trained on Welsh texts extracted from the Welsh Parliament / Senedd Record of Proceedings between 1999-2010 and 2016 to the present day. Please note that the training data consists of originally spoken and translated political speeches. Therefore the model might perform differently on texts from other domains.
Based on the work of https://github.com/oliverguhr/fullstop-deep-punctuation-prediction and [softcatala/fullstop-catalan-punctuation-prediction](https://huggingface.co/softcatala/fullstop-catalan-punctuation-prediction)
## Install
To get started install the deepmultilingualpunctuation package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):
```bash
pip install deepmultilingualpunctuation
```
### Restore Punctuation
```python
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel("techiaith/fullstop-welsh-punctuation-prediction")
text = "A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn ne ddwyrain Cymru"
result = model.restore_punctuation(text)
print(result)
```
**output**
```
[
{
"entity_group": "LABEL_0",
"score": 0.9999812841415405,
"word": "A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn",
"start": 0,
"end": 58
},
{
"entity_group": "LABEL_4",
"score": 0.9787278771400452,
"word": "ne",
"start": 59,
"end": 61
},
{
"entity_group": "LABEL_0",
"score": 0.9999902248382568,
"word": "ddwyrain",
"start": 62,
"end": 70
},
{
"entity_group": "LABEL_3",
"score": 0.9484745860099792,
"word": "Cymru",
"start": 71,
"end": 76
}
]
```
> A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn ne-ddwyrain Cymru?
## Results
The model achieves the following F1 scores for the different punctuation markers:
| Label | Precision | Recall | f1-score | Support |
| ------------- | ----- | ----- | ----- | ----- |
| 0 | 0.99 | 0.99 | 0.99 | 12124280 |
| . | 0.88 | 0.89 | 0.88 | 455896 |
| , | 0.84 | 0.82 | 0.83 | 771813 |
| ? | 0.92 | 0.88 | 0.90 | 54878 |
| - | 0.95 | 0.94 | 0.95 | 31545 |
| : | 0.91 | 0.87 | 0.89 | 39618 |
| | | | | |
| accuracy | | | 0.98 | 13478030 |
| macro avg | 0.91 | 0.90 | 0.91 | 13478030 |
|weighted avg | 0.97 | 0.98 | 0.97 | 13478030 |
##
|