WtP usage in wtpsplit (Legacy)
This doc details how to use the old WtP models. You should probably use SaT instead.
Usage
from wtpsplit import WtP
wtp = WtP("wtp-bert-mini")
# optionally run on GPU for better performance
# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split
wtp.half().to("cuda")
# returns ["Hello ", "This is a test."]
wtp.split("Hello This is a test.")
# returns an iterator yielding a lists of sentences for every text
# do this instead of calling wtp.split on every text individually for much better performance
wtp.split(["Hello This is a test.", "And some more texts..."])
# if you're using a model with language adapters, also pass a `lang_code`
wtp.split("Hello This is a test.", lang_code="en")
# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results
# this always requires a language code
wtp.split("Hello This is a test.", lang_code="en", style="ud")
ONNX support
You can enable ONNX inference for the wtp-bert-* models:
wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"])
This requires onnxruntime and onnxruntime-gpu. It should give a good speedup on GPU!
>>> from wtpsplit import WtP
>>> texts = ["This is a sentence. This is another sentence."] * 1000
# PyTorch GPU
>>> model = WtP("wtp-bert-mini")
>>> model.half().to("cuda")
>>> %timeit list(model.split(texts))
272 ms Β± 16.1 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
# onnxruntime GPU
>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
>>> %timeit list(model.split(texts))
198 ms Β± 1.36 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Notes:
- The
wtp-canine-*models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome! - This does not work with Python 3.7 because
onnxruntimedoes not support the opset we need for py37.
Available Models
Pro tips: I recommend wtp-bert-mini for speed-sensitive applications, otherwise wtp-canine-s-12l. The *-no-adapters models provide a good tradeoff between speed and performance. You should probably not use wtp-bert-tiny.
| Model | English Score | English Score (adapted) |
Multilingual Score | Multilingual Score (adapted) |
|---|---|---|---|---|
| wtp-bert-tiny | 83.8 | 91.9 | 79.5 | 88.6 |
| wtp-bert-mini | 91.8 | 95.9 | 84.3 | 91.3 |
| wtp-canine-s-1l | 94.5 | 96.5 | 86.7 | 92.8 |
| wtp-canine-s-1l-no-adapters | 93.1 | 96.4 | 85.1 | 91.8 |
| wtp-canine-s-3l | 94.4 | 96.8 | 86.7 | 93.4 |
| wtp-canine-s-3l-no-adapters | 93.8 | 96.4 | 86 | 92.3 |
| wtp-canine-s-6l | 94.5 | 97.1 | 87 | 93.6 |
| wtp-canine-s-6l-no-adapters | 94.4 | 96.8 | 86.4 | 92.8 |
| wtp-canine-s-9l | 94.8 | 97 | 87.7 | 93.8 |
| wtp-canine-s-9l-no-adapters | 94.3 | 96.9 | 86.6 | 93 |
| wtp-canine-s-12l | 94.7 | 97.1 | 87.9 | 94 |
| wtp-canine-s-12l-no-adapters | 94.5 | 97 | 87.1 | 93.2 |
The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details.
For comparison, here's the English scores of some other tools:
| Model | English Score |
|---|---|
| SpaCy (sentencizer) | 86.8 |
| PySBD | 69.8 |
| SpaCy (dependency parser) | 93.1 |
| Ersatz | 91.6 |
Punkt (nltk.sent_tokenize) |
92.5 |
Paragraph Segmentation
Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.
# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
wtp.split(text, do_paragraph_segmentation=True)
Adaptation
WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (preferred) or threshold adaptation.
Punctuation Adaptation
# this requires a `lang_code`
# check the paper or `wtp.mixtures` for supported styles
wtp.split(text, lang_code="en", style="ud")
This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded:
wtp.split(text, lang_code="en", style="ud", threshold=0.7)
To get the default threshold for a style:
wtp.get_threshold("en", "ud", return_punctuation_threshold=True)
Threshold Adaptation
threshold = wtp.get_threshold("en", "ud")
wtp.split(text, threshold=threshold)
Advanced Usage
Get the newline or sentence boundary probabilities for a text:
# returns newline probabilities (supports batching!)
wtp.predict_proba(text)
# returns sentence boundary probabilities for the given style
wtp.predict_proba(text, lang_code="en", style="ud")
Load a WtP model in HuggingFace transformers:
# import wtpsplit.models to register the custom models
# (character-level BERT w/ hash embeddings and canine with language adapters)
import wtpsplit.models
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name
** NEW ** Adapt to your own corpus using WtP_Punct:
Clone the repository:
git clone https://github.com/bminixhofer/wtpsplit
cd wtpsplit
Create your data:
import torch
torch.save(
{
"en": {
"sentence": {
"dummy-dataset": {
"meta": {
"train_data": ["train sentence 1", "train sentence 2"],
},
"data": [
"test sentence 1",
"test sentence 2",
]
}
}
}
},
"dummy-dataset.pth"
)
Run adaptation:
python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en
This should print something like
en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 30.52it/s]
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json
i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:
from wtpsplit import WtP
import skops.io as sio
wtp = WtP(
"wtp-bert-mini",
mixtures=sio.load(
"wtpsplit/.cache/wtp-bert-mini.skops",
["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
),
)
wtp.split("your text here", lang_code="en", style="dummy-dataset")
... and adjust the dataset name, language and model in the above to your needs.
Reproducing the paper
configs/ contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:
python wtpsplit/train/train.py configs/<config_name>.json
In addition:
wtpsplit/data_acquisitioncontains the code for obtaining evaluation data and raw text from the mC4 corpus.wtpsplit/evaluationcontains the code for:- intrinsic evaluation (i.e. sentence segmentation results) via
adapt.py. The raw intrinsic results in JSON format are also atevaluation_results/ - extrinsic evaluation on Machine Translation in
extrinsic.py - baseline (PySBD, nltk, etc.) intrinsic evaluation in
intrinsic_baselines.py - punctuation annotation experiments in
punct_annotation.pyandpunct_annotation_wtp.py
- intrinsic evaluation (i.e. sentence segmentation results) via
Supported Languages
| iso | Name |
|---|---|
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bn | Bengali |
| ca | Catalan |
| ceb | Cebuano |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| he | Hebrew |
| hi | Hindi |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| ja | Japanese |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Central Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kirghiz |
| la | Latin |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| pa | Panjabi |
| pl | Polish |
| ps | Pushto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sq | Albanian |
| sr | Serbian |
| sv | Swedish |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zu | Zulu |