LaciaStudio
/

Lacia_sum_small_v1

text2text-generation

natural-language-processing

text-summarization

machine-learning

artificial-intelligence

sequence-to-sequence

text-generation-inference

Model card Files Files and versions Community

LaciaStudio commited on Feb 9

Commit

8855296

·

verified ·

1 Parent(s): 63f41f5

Update README.md

Files changed (1) hide show

README.md +72 -3

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
----
-license: cc-by-nc-4.0
----

+---
+license: cc-by-nc-4.0
+language:
+- ru
+- en
+base_model:
+- d0rj/rut5-base-summ
+pipeline_tag: summarization
+tags:
+- summarization
+- natural-language-processing
+- text-summarization
+- machine-learning
+- deep-learning
+- transformer
+- artificial-intelligence
+- text-analysis
+- sequence-to-sequence
+- pytorch
+- tensorflow
+- safetensors
+- t5
+library_name: transformers
+---
+# Russian Text Summarization Model - LaciaSUM V1 (small)
+This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries.
+# Key Features
+* Objective: Automatic abstractive summarization of texts.
+* Base Model: d0rj/rut5-base-summ.
+* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary).
+* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task.
+# Training Settings:
+* Number of epochs: 9.
+* Batch size: 4 per device.
+* Warmup steps: 1000.
+* FP16 training enabled (if CUDA is available).
+* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training).
+# Description
+The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes:
+Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
+Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100.
+This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
+**The model also supports English language, but its support was not tested**
+# Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained("your_username/model_name")
+model = AutoModelForSeq2SeqLM.from_pretrained("your_username/model_name")
+# Example text to summarize
+text = "Your long text that needs summarizing."
+# Add the prefix as during training
+input_text = "summarize: " + text
+inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
+# Generate the summary
+summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
+summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
+print("Summary:", summary)
+```
+**Created by LaciaStudio | LaciaAI**