|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ru |
|
- en |
|
base_model: |
|
- d0rj/rut5-base-summ |
|
pipeline_tag: summarization |
|
tags: |
|
- summarization |
|
- natural-language-processing |
|
- text-summarization |
|
- machine-learning |
|
- deep-learning |
|
- transformer |
|
- artificial-intelligence |
|
- text-analysis |
|
- sequence-to-sequence |
|
- pytorch |
|
- tensorflow |
|
- safetensors |
|
- t5 |
|
library_name: transformers |
|
--- |
|
|
|
 |
|
|
|
# Russian Text Summarization Model - LaciaSUM V1 (small) |
|
This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries. |
|
|
|
# Key Features |
|
* Objective: Automatic abstractive summarization of texts. |
|
* Base Model: d0rj/rut5-base-summ. |
|
* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary). |
|
* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task. |
|
# Training Settings: |
|
* Number of epochs: 9. |
|
* Batch size: 4 per device. |
|
* Warmup steps: 1000. |
|
* FP16 training enabled (if CUDA is available). |
|
* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training). |
|
|
|
# Description |
|
The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes: |
|
|
|
Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary. |
|
Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100. |
|
|
|
This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats. |
|
**The model also supports English language, but its support was not tested** |
|
|
|
# Example Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Lacia_sum_small_v1") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("LaciaStudio/Lacia_sum_small_v1") |
|
|
|
text = "Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях." |
|
|
|
# "summarize: " prefix |
|
input_text = "summarize: " + text |
|
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) |
|
|
|
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
|
|
print("Summary:", summary) |
|
``` |
|
|
|
# Example of summarization |
|
**RU** |
|
Main text: |
|
|
|
``` |
|
Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. |
|
Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим |
|
новые перспективы в различных областях. |
|
``` |
|
|
|
Summarized text: |
|
|
|
``` |
|
Современные технологии оказывают значительное влияние на повседневную жизнь и рабочие процессы, включая |
|
искусственный интеллект, который помогает оптимизировать задачи и открывать новые перспективы. |
|
``` |
|
**EN** |
|
Main text: |
|
|
|
``` |
|
Modern technologies have a significant impact on our daily lives and work processes. Artificial intelligence |
|
is becoming an important tool that helps optimize tasks and opens up new opportunities in various fields. |
|
``` |
|
|
|
Summarized text: |
|
|
|
``` |
|
Matern technologies have a controration on our daily lives and work processes. Artificial intelligence |
|
is becoming an important tool and helps and opens up new opportunities. |
|
``` |
|
|
|
**Finetuned by LaciaStudio | LaciaAI** |