olafdil commited on
Commit
a70a778
·
verified ·
1 Parent(s): 7527e25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -7
README.md CHANGED
@@ -1,7 +1,129 @@
1
- ---
2
- license: mit
3
- tags:
4
- - unsloth
5
- - trl
6
- - sft
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - unsloth
5
+ - trl
6
+ - sft
7
+ datasets:
8
+ - olafdil/French_MultiSpeaker_Diarization
9
+ language:
10
+ - fr
11
+ base_model:
12
+ - unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
13
+ pipeline_tag: text-generation
14
+ ---
15
+ # Fine-Tuned Model: Meta-Llama-3.1-8B-Instruct-bnb-4bit
16
+
17
+ This is a fine-tuned version of the [Meta-Llama-3.1-8B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit) model, adapted for French multi-speaker diarization tasks. Below, you'll find details about the fine-tuning process, dataset, and how to use this model.
18
+
19
+ ---
20
+
21
+ ## Model Details
22
+
23
+ - **Base Model**: Meta-Llama-3.1-8B-Instruct-bnb-4bit
24
+ - **Quantization**: 4-bit quantization for reduced memory usage
25
+ - **Purpose**: Fine-tuned for multi-speaker diarization in French.
26
+ - **Techniques**:
27
+ - LoRA (Low-Rank Adaptation) for efficient fine-tuning.
28
+ - Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
29
+ - Rank: `16`
30
+ - LoRA alpha: `16`
31
+ - Gradient checkpointing: Enabled.
32
+
33
+ ---
34
+
35
+ ## Dataset
36
+
37
+ The model was fine-tuned on the `French_MultiSpeaker_Diarization` dataset, hosted on the Hugging Face Hub:
38
+
39
+ - **Dataset Name**: [French_MultiSpeaker_Diarization](https://huggingface.co/datasets/olafdil/French_MultiSpeaker_Diarization)
40
+ - **Split Used**: Train
41
+ - **Dataset Content**:
42
+ - Multispeaker conversational data in French.
43
+ - Includes labeled diarization information to improve diarization capabilities.
44
+
45
+ ---
46
+
47
+ ## Training Configuration
48
+
49
+ ### Hyperparameters
50
+
51
+ - **Max Sequence Length**: `120,000`
52
+ - **LoRA Dropout**: `0`
53
+ - **Bias**: `none`
54
+ - **Use Gradient Checkpointing**: Enabled for efficiency.
55
+ - **Custom Prompting**: Chat templates applied for formatting prompts (e.g., `llama-3.1` template).
56
+
57
+ ### Training Workflow
58
+
59
+ 1. **Model Loading**:
60
+ - Loaded the base model using `FastLanguageModel.from_pretrained()`.
61
+ - Applied 4-bit quantization for memory efficiency.
62
+
63
+ 2. **Dataset Preparation**:
64
+ - The dataset was tokenized using a custom chat template from the `unsloth.chat_templates` library.
65
+ - Prompts formatted with `apply_chat_template()` to suit the diarization task.
66
+
67
+ 3. **Fine-Tuning**:
68
+ - LoRA applied to specific layers for adaptation.
69
+ - Gradient checkpointing enabled to reduce memory overhead during training.
70
+
71
+ ---
72
+
73
+ ## Usage
74
+
75
+ ### Load the Model
76
+ You can load this model directly from Hugging Face:
77
+
78
+ ```python
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer
80
+
81
+ model_name = "olafdil/FrDiarization-Llama-3.1-8B-4bit"
82
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
83
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
84
+ ```
85
+ ### Inference Example
86
+ ```python
87
+ template = """
88
+ I have an audio transcription where multiple speakers are involved in a conversation.
89
+ Your task is to distinguish the different speakers and diarize the text accordingly.
90
+ Each speaker's dialogue should be clearly labeled, such as 'Speaker 1:', 'Speaker 2:', etc.
91
+ Ensure that the labels remain consistent throughout the transcription and that the text is formatted neatly.
92
+ Here's the transcription:
93
+ """
94
+ transciption = "Your input transcription here"
95
+ prompt = template + transcription
96
+
97
+ inputs = tokenizer(prompt, return_tensors="pt")
98
+ outputs = model.generate(**inputs)
99
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
100
+ ```
101
+
102
+ ## Dependencies
103
+
104
+ The following libraries were used:
105
+
106
+ - `transformers`
107
+ - `datasets`
108
+ - `unsloth`
109
+ - `torch`
110
+
111
+ To install the dependencies, you can use:
112
+
113
+ ```bash
114
+ pip install transformers datasets torch unsloth
115
+ ```
116
+
117
+ ## Limitations
118
+
119
+ - The model has been fine-tuned specifically for French multi-speaker diarization tasks and may not generalize well to other tasks or languages.
120
+ - 4-bit quantization reduces memory usage but may slightly affect precision.
121
+
122
+ ---
123
+
124
+ ## Citation
125
+
126
+ If you use this model, please consider citing the base model and the dataset:
127
+
128
+ - **Base Model**: Meta-Llama-3.1-8B-Instruct-bnb-4bit
129
+ - **Dataset**: French MultiSpeaker Diarization