Lacia_sum_small_v1 / README.md

Update README.md

244bbb3 verified about 1 month ago

4.54 kB

	---
	license: cc-by-nc-4.0
	language:
	- ru
	- en
	base_model:
	- d0rj/rut5-base-summ
	pipeline_tag: summarization
	tags:
	- summarization
	- natural-language-processing
	- text-summarization
	- machine-learning
	- deep-learning
	- transformer
	- artificial-intelligence
	- text-analysis
	- sequence-to-sequence
	- pytorch
	- tensorflow
	- safetensors
	- t5
	library_name: transformers
	---

	![Official LaciaSUM Logo](https://huggingface.co/LaciaStudio/Lacia_sum_small_v1/resolve/main/LaciaSUM.png)

	# Russian Text Summarization Model - LaciaSUM V1 (small)
	This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries.

	# Key Features
	* Objective: Automatic abstractive summarization of texts.
	* Base Model: d0rj/rut5-base-summ.
	* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary).
	* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task.
	# Training Settings:
	* Number of epochs: 9.
	* Batch size: 4 per device.
	* Warmup steps: 1000.
	* FP16 training enabled (if CUDA is available).
	* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training).

	# Description
	The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes:

	Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
	Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100.

	This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
	The model also supports English language, but its support was not tested

	# Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Lacia_sum_small_v1")
	model = AutoModelForSeq2SeqLM.from_pretrained("LaciaStudio/Lacia_sum_small_v1")

	text = "Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях."

	# "summarize: " prefix
	input_text = "summarize: " + text
	inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

	summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
	summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

	print("Summary:", summary)
	```

	# Example of summarization
	RU
	Main text:

	```
	Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы.
	Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим
	новые перспективы в различных областях.
	```

	Summarized text:

	```
	Современные технологии оказывают значительное влияние на повседневную жизнь и рабочие процессы, включая
	искусственный интеллект, который помогает оптимизировать задачи и открывать новые перспективы.
	```
	EN
	Main text:

	```
	Modern technologies have a significant impact on our daily lives and work processes. Artificial intelligence
	is becoming an important tool that helps optimize tasks and opens up new opportunities in various fields.
	```

	Summarized text:

	```
	Matern technologies have a controration on our daily lives and work processes. Artificial intelligence
	is becoming an important tool and helps and opens up new opportunities.
	```

	Finetuned by LaciaStudio \| LaciaAI