FrDiarization-Llama-3.1-8B-4bit / README.md

Update README.md

a70a778 verified 7 months ago

3.95 kB

	---
	license: mit
	tags:
	- unsloth
	- trl
	- sft
	datasets:
	- olafdil/French_MultiSpeaker_Diarization
	language:
	- fr
	base_model:
	- unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
	pipeline_tag: text-generation
	---
	# Fine-Tuned Model: Meta-Llama-3.1-8B-Instruct-bnb-4bit

	This is a fine-tuned version of the [Meta-Llama-3.1-8B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit) model, adapted for French multi-speaker diarization tasks. Below, you'll find details about the fine-tuning process, dataset, and how to use this model.

	---

	## Model Details

	- Base Model: Meta-Llama-3.1-8B-Instruct-bnb-4bit
	- Quantization: 4-bit quantization for reduced memory usage
	- Purpose: Fine-tuned for multi-speaker diarization in French.
	- Techniques:
	- LoRA (Low-Rank Adaptation) for efficient fine-tuning.
	- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
	- Rank: `16`
	- LoRA alpha: `16`
	- Gradient checkpointing: Enabled.

	---

	## Dataset

	The model was fine-tuned on the `French_MultiSpeaker_Diarization` dataset, hosted on the Hugging Face Hub:

	- Dataset Name: [French_MultiSpeaker_Diarization](https://huggingface.co/datasets/olafdil/French_MultiSpeaker_Diarization)
	- Split Used: Train
	- Dataset Content:
	- Multispeaker conversational data in French.
	- Includes labeled diarization information to improve diarization capabilities.

	---

	## Training Configuration

	### Hyperparameters

	- Max Sequence Length: `120,000`
	- LoRA Dropout: `0`
	- Bias: `none`
	- Use Gradient Checkpointing: Enabled for efficiency.
	- Custom Prompting: Chat templates applied for formatting prompts (e.g., `llama-3.1` template).

	### Training Workflow

	1. Model Loading:
	- Loaded the base model using `FastLanguageModel.from_pretrained()`.
	- Applied 4-bit quantization for memory efficiency.

	2. Dataset Preparation:
	- The dataset was tokenized using a custom chat template from the `unsloth.chat_templates` library.
	- Prompts formatted with `apply_chat_template()` to suit the diarization task.

	3. Fine-Tuning:
	- LoRA applied to specific layers for adaptation.
	- Gradient checkpointing enabled to reduce memory overhead during training.

	---

	## Usage

	### Load the Model
	You can load this model directly from Hugging Face:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "olafdil/FrDiarization-Llama-3.1-8B-4bit"
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	```
	### Inference Example
	```python
	template = """
	I have an audio transcription where multiple speakers are involved in a conversation.
	Your task is to distinguish the different speakers and diarize the text accordingly.
	Each speaker's dialogue should be clearly labeled, such as 'Speaker 1:', 'Speaker 2:', etc.
	Ensure that the labels remain consistent throughout the transcription and that the text is formatted neatly.
	Here's the transcription:
	"""
	transciption = "Your input transcription here"
	prompt = template + transcription

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Dependencies

	The following libraries were used:

	- `transformers`
	- `datasets`
	- `unsloth`
	- `torch`

	To install the dependencies, you can use:

	```bash
	pip install transformers datasets torch unsloth
	```

	## Limitations

	- The model has been fine-tuned specifically for French multi-speaker diarization tasks and may not generalize well to other tasks or languages.
	- 4-bit quantization reduces memory usage but may slightly affect precision.

	---

	## Citation

	If you use this model, please consider citing the base model and the dataset:

	- Base Model: Meta-Llama-3.1-8B-Instruct-bnb-4bit
	- Dataset: French MultiSpeaker Diarization