Lacia_sum_small_v1 / README.md
LaciaStudio's picture
Update README.md
244bbb3 verified
---
license: cc-by-nc-4.0
language:
- ru
- en
base_model:
- d0rj/rut5-base-summ
pipeline_tag: summarization
tags:
- summarization
- natural-language-processing
- text-summarization
- machine-learning
- deep-learning
- transformer
- artificial-intelligence
- text-analysis
- sequence-to-sequence
- pytorch
- tensorflow
- safetensors
- t5
library_name: transformers
---
![Official LaciaSUM Logo](https://huggingface.co/LaciaStudio/Lacia_sum_small_v1/resolve/main/LaciaSUM.png)
# Russian Text Summarization Model - LaciaSUM V1 (small)
This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries.
# Key Features
* Objective: Automatic abstractive summarization of texts.
* Base Model: d0rj/rut5-base-summ.
* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary).
* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task.
# Training Settings:
* Number of epochs: 9.
* Batch size: 4 per device.
* Warmup steps: 1000.
* FP16 training enabled (if CUDA is available).
* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training).
# Description
The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes:
Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100.
This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
**The model also supports English language, but its support was not tested**
# Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Lacia_sum_small_v1")
model = AutoModelForSeq2SeqLM.from_pretrained("LaciaStudio/Lacia_sum_small_v1")
text = "Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях."
# "summarize: " prefix
input_text = "summarize: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)
```
# Example of summarization
**RU**
Main text:
```
Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы.
Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим
новые перспективы в различных областях.
```
Summarized text:
```
Современные технологии оказывают значительное влияние на повседневную жизнь и рабочие процессы, включая
искусственный интеллект, который помогает оптимизировать задачи и открывать новые перспективы.
```
**EN**
Main text:
```
Modern technologies have a significant impact on our daily lives and work processes. Artificial intelligence
is becoming an important tool that helps optimize tasks and opens up new opportunities in various fields.
```
Summarized text:
```
Matern technologies have a controration on our daily lives and work processes. Artificial intelligence
is becoming an important tool and helps and opens up new opportunities.
```
**Finetuned by LaciaStudio | LaciaAI**