olafdil's picture
Update README.md
a70a778 verified
---
license: mit
tags:
- unsloth
- trl
- sft
datasets:
- olafdil/French_MultiSpeaker_Diarization
language:
- fr
base_model:
- unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
pipeline_tag: text-generation
---
# Fine-Tuned Model: Meta-Llama-3.1-8B-Instruct-bnb-4bit
This is a fine-tuned version of the [Meta-Llama-3.1-8B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit) model, adapted for French multi-speaker diarization tasks. Below, you'll find details about the fine-tuning process, dataset, and how to use this model.
---
## Model Details
- **Base Model**: Meta-Llama-3.1-8B-Instruct-bnb-4bit
- **Quantization**: 4-bit quantization for reduced memory usage
- **Purpose**: Fine-tuned for multi-speaker diarization in French.
- **Techniques**:
- LoRA (Low-Rank Adaptation) for efficient fine-tuning.
- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
- Rank: `16`
- LoRA alpha: `16`
- Gradient checkpointing: Enabled.
---
## Dataset
The model was fine-tuned on the `French_MultiSpeaker_Diarization` dataset, hosted on the Hugging Face Hub:
- **Dataset Name**: [French_MultiSpeaker_Diarization](https://huggingface.co/datasets/olafdil/French_MultiSpeaker_Diarization)
- **Split Used**: Train
- **Dataset Content**:
- Multispeaker conversational data in French.
- Includes labeled diarization information to improve diarization capabilities.
---
## Training Configuration
### Hyperparameters
- **Max Sequence Length**: `120,000`
- **LoRA Dropout**: `0`
- **Bias**: `none`
- **Use Gradient Checkpointing**: Enabled for efficiency.
- **Custom Prompting**: Chat templates applied for formatting prompts (e.g., `llama-3.1` template).
### Training Workflow
1. **Model Loading**:
- Loaded the base model using `FastLanguageModel.from_pretrained()`.
- Applied 4-bit quantization for memory efficiency.
2. **Dataset Preparation**:
- The dataset was tokenized using a custom chat template from the `unsloth.chat_templates` library.
- Prompts formatted with `apply_chat_template()` to suit the diarization task.
3. **Fine-Tuning**:
- LoRA applied to specific layers for adaptation.
- Gradient checkpointing enabled to reduce memory overhead during training.
---
## Usage
### Load the Model
You can load this model directly from Hugging Face:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "olafdil/FrDiarization-Llama-3.1-8B-4bit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
### Inference Example
```python
template = """
I have an audio transcription where multiple speakers are involved in a conversation.
Your task is to distinguish the different speakers and diarize the text accordingly.
Each speaker's dialogue should be clearly labeled, such as 'Speaker 1:', 'Speaker 2:', etc.
Ensure that the labels remain consistent throughout the transcription and that the text is formatted neatly.
Here's the transcription:
"""
transciption = "Your input transcription here"
prompt = template + transcription
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Dependencies
The following libraries were used:
- `transformers`
- `datasets`
- `unsloth`
- `torch`
To install the dependencies, you can use:
```bash
pip install transformers datasets torch unsloth
```
## Limitations
- The model has been fine-tuned specifically for French multi-speaker diarization tasks and may not generalize well to other tasks or languages.
- 4-bit quantization reduces memory usage but may slightly affect precision.
---
## Citation
If you use this model, please consider citing the base model and the dataset:
- **Base Model**: Meta-Llama-3.1-8B-Instruct-bnb-4bit
- **Dataset**: French MultiSpeaker Diarization