LaciaStudio commited on
Commit
8855296
·
verified ·
1 Parent(s): 63f41f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ru
5
+ - en
6
+ base_model:
7
+ - d0rj/rut5-base-summ
8
+ pipeline_tag: summarization
9
+ tags:
10
+ - summarization
11
+ - natural-language-processing
12
+ - text-summarization
13
+ - machine-learning
14
+ - deep-learning
15
+ - transformer
16
+ - artificial-intelligence
17
+ - text-analysis
18
+ - sequence-to-sequence
19
+ - pytorch
20
+ - tensorflow
21
+ - safetensors
22
+ - t5
23
+ library_name: transformers
24
+ ---
25
+ # Russian Text Summarization Model - LaciaSUM V1 (small)
26
+ This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries.
27
+
28
+ # Key Features
29
+ * Objective: Automatic abstractive summarization of texts.
30
+ * Base Model: d0rj/rut5-base-summ.
31
+ * Dataset: A custom CSV file with columns Text (original text) and Summarize (summary).
32
+ * Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task.
33
+ # Training Settings:
34
+ * Number of epochs: 9.
35
+ * Batch size: 4 per device.
36
+ * Warmup steps: 1000.
37
+ * FP16 training enabled (if CUDA is available).
38
+ * Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training).
39
+
40
+ # Description
41
+ The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes:
42
+
43
+ Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
44
+ Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100.
45
+
46
+ This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
47
+ **The model also supports English language, but its support was not tested**
48
+
49
+ # Example Usage
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
53
+
54
+ # Load the tokenizer and the model
55
+ tokenizer = AutoTokenizer.from_pretrained("your_username/model_name")
56
+ model = AutoModelForSeq2SeqLM.from_pretrained("your_username/model_name")
57
+
58
+ # Example text to summarize
59
+ text = "Your long text that needs summarizing."
60
+
61
+ # Add the prefix as during training
62
+ input_text = "summarize: " + text
63
+ inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
64
+
65
+ # Generate the summary
66
+ summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
67
+ summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
68
+
69
+ print("Summary:", summary)
70
+ ```
71
+
72
+ **Created by LaciaStudio | LaciaAI**