File size: 11,398 Bytes
0231c29 fda3442 8335936 fda3442 8335936 3468b6d fda3442 0231c29 9df5160 eb01b22 9df5160 eb01b22 9df5160 4ac704b eb01b22 9df5160 4ac704b 9df5160 942de75 9df5160 d329075 942de75 d329075 9df5160 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
---
language:
- uk
license: mit
library_name: transformers
datasets:
- skypro1111/ubertext-2-news-verbalized
widget:
- text: Очікувалось, що цей застосунок буде запущено о 11 ранку 22.08.2025, але розробники
затягнули святкування і запуск був відкладений на 2 тижні.
---
# Model Card for mbart-large-50-verbalization
## Model Description
`mbart-large-50-verbalization` is a fine-tuned version of the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.
## Architecture
This model is based on the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.
## Training Data
The model was fine-tuned on a subset of 457,610 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks.
Dataset [skypro1111/ubertext-2-news-verbalized](https://huggingface.co/datasets/skypro1111/ubertext-2-news-verbalized)
## Training Procedure
The model underwent 410,000 training steps (1 epoch).
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch
model_name = "facebook/mbart-large-50"
dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
'train': dataset['train'],
'test': dataset['test']
})
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
def preprocess_data(examples):
model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
datasets = datasets.map(preprocess_data, batched=True)
model = MBartForConditionalGeneration.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir=f"./results/{model_name}-verbalization",
evaluation_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=1000,
save_total_limit=40,
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=2,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=datasets["train"],
eval_dataset=datasets["test"],
)
trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")
```
## Usage
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "skypro1111/mbart-large-50-verbalization"
model = MBartForConditionalGeneration.from_pretrained(
model_name,
low_cpu_mem_usage=True,
device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
input_text = "<verbalization>:Цей додаток вийде 15.06.2025."
encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
```
## ONNX usage
```bash
poetry new verbalizer
rm -rf verbalizer/tests/ verbalizer/verbalizer/ verbalizer/README.md
cd verbalizer/
poetry shell
wget https://huggingface.co/skypro1111/mbart-large-50-verbalization/resolve/main/onnx/infer_onnx_hf.py
poetry add transformers huggingface_hub onnxruntime-gpu torch
python infer_onnx_hf.py
```
```python
import onnxruntime
import numpy as np
from transformers import AutoTokenizer
import time
import os
from huggingface_hub import hf_hub_download
model_name = "skypro1111/mbart-large-50-verbalization"
def download_model_from_hf(repo_id=model_name, model_dir="onnx"):
"""Download ONNX models from HuggingFace Hub."""
os.makedirs(model_dir, exist_ok=True)
files = ["onnx/encoder_model.onnx", "onnx/decoder_model.onnx", "onnx/decoder_model.onnx_data"]
for file in files:
hf_hub_download(
repo_id=repo_id,
filename=file,
local_dir=model_dir
)
return files
def create_onnx_session(model_path, use_gpu=True):
"""Create an ONNX inference session."""
# Session options
session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.enable_mem_pattern = True
session_options.enable_mem_reuse = True
session_options.intra_op_num_threads = 8
session_options.log_severity_level = 1
cuda_provider_options = {
'device_id': 0,
'arena_extend_strategy': 'kSameAsRequested',
'gpu_mem_limit': 0, # 0 means no limit
'cudnn_conv_algo_search': 'DEFAULT',
'do_copy_in_default_stream': True,
}
print(f"Available providers: {onnxruntime.get_available_providers()}")
if use_gpu and 'CUDAExecutionProvider' in onnxruntime.get_available_providers():
providers = [('CUDAExecutionProvider', cuda_provider_options)]
print("Using CUDA for inference")
else:
providers = ['CPUExecutionProvider']
print("Using CPU for inference")
session = onnxruntime.InferenceSession(
model_path,
providers=providers,
sess_options=session_options
)
return session
def generate_text(text, tokenizer, encoder_session, decoder_session, max_length=128):
"""Generate text for a single input."""
# Prepare input
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
input_ids = inputs["input_ids"].astype(np.int64)
attention_mask = inputs["attention_mask"].astype(np.int64)
# Run encoder
encoder_outputs = encoder_session.run(
output_names=["last_hidden_state"],
input_feed={
"input_ids": input_ids,
"attention_mask": attention_mask,
}
)[0]
# Initialize decoder input
decoder_input_ids = np.array([[tokenizer.pad_token_id]], dtype=np.int64)
# Generate sequence
for _ in range(max_length):
# Run decoder
decoder_outputs = decoder_session.run(
output_names=["logits"],
input_feed={
"input_ids": decoder_input_ids,
"encoder_hidden_states": encoder_outputs,
"encoder_attention_mask": attention_mask,
}
)[0]
# Get next token
next_token = decoder_outputs[:, -1:].argmax(axis=-1)
decoder_input_ids = np.concatenate([decoder_input_ids, next_token], axis=-1)
# Check if sequence is complete
if tokenizer.eos_token_id in decoder_input_ids[0]:
break
# Decode sequence
output_text = tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
return output_text
def main():
# Print available providers
print("Available providers:", onnxruntime.get_available_providers())
# Download models from HuggingFace
print("\nDownloading models from HuggingFace...")
encoder_path, decoder_path, _ = download_model_from_hf()
# Load tokenizer and models
print("\nLoading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_UA"
tokenizer.tgt_lang = "uk_UA"
# Create ONNX sessions
print("\nLoading encoder...")
encoder_session = create_onnx_session(encoder_path)
print("\nLoading decoder...")
decoder_session = create_onnx_session(decoder_path)
# Test examples
test_inputs = [
"мій телефон 0979456822",
"квартира площею 11 тис кв м.",
"Пропонували хабар у 1 млрд грн.",
"1 2 3 4 5 6 7 8 9 10.",
"Крім того, парламентарій володіє шістьма ділянками землі (дві площею 25000 кв м, дві по 15000 кв м та дві по 10000 кв м) розташованими в Сосновій Балці Луганської області.",
"Підписуючи цей документ у 2003 році, голови Росії та України мали намір зміцнити співпрацю та сприяти розширенню двосторонніх відносин.",
"Очікується, що цей застосунок буде запущено 22.08.2025.",
"За інформацією від Державної служби з надзвичайних ситуацій станом на 7 ранку 15 липня.",
]
print("\nWarming up...")
_ = generate_text(test_inputs[0], tokenizer, encoder_session, decoder_session)
print("\nRunning inference...")
for text in test_inputs:
print(f"\nInput: {text}")
t = time.time()
output = generate_text(text, tokenizer, encoder_session, decoder_session)
print(f"Output: {output}")
print(f"Time: {time.time() - t:.2f} seconds")
if __name__ == "__main__":
main()
```
## Performance
Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.
## Limitations and Ethical Considerations
Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.
## Citation
Ubertext 2.0
```
@inproceedings{chaplynskyi-2023-introducing,
title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
author = "Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.1",
pages = "1--10",
}
```
mBart-large-50
```
@article{tang2020multilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## License
This model is released under the MIT License, in line with the base mbart-large-50 model. |