---
license: mit
datasets:
- Intel/orca_dpo_pairs
language:
- en
base_model:
- unsloth/Llama-3.2-3B-Instruct
pipeline_tag: question-answering
---
# Fine-tuned Language Model for Preference Optimization (DPO)

## Model Overview

This model is a fine-tuned version of Llama 3.2-3B-Instruct with Direct Preference Optimization (DPO), specialized for reward modeling tasks. It has been optimized using memory-efficient techniques including 4-bit quantization, gradient checkpointing, and parameter-efficient fine-tuning (PEFT). The model is tailored for tasks requiring language comprehension, instruction-based response generation, and preference-based ranking of responses.

## Model Details

- **Base Model:** `unsloth/Llama-3.2-3B-Instruct`
- **Fine-Tuning Objective:** Preference Optimization (DPO) using pairs of accepted and rejected responses.
- **Training Framework:** Built on Unsloth with integration to Hugging Face Datasets and Transformers.
- **Quantization:** Utilizes 4-bit quantization for reduced memory usage, suitable for low VRAM devices.
- **Optimizations:** Includes gradient checkpointing for enhanced memory efficiency and faster inference. The model has undergone fine-tuning using PEFT methods such as LoRA (Low-Rank Adaptation).
- **Training Data:** Trained on the Intel/orca_dpo_pairs dataset containing instruction-input-response pairs for preference-based learning.

## Model Capabilities

- **Text Generation:** Capable of generating detailed and coherent text responses based on instructions or prompts.
- **Preference-Based Optimization:** Fine-tuned to rank responses based on user feedback (chosen vs. rejected).
- **Long Contexts:** Supports processing up to 2048 tokens of input efficiently, facilitated by internal RoPE scaling.
- **Faster Inference:** Optimized for real-time text generation with streaming capabilities and low-latency responses.

## Intended Use

This model can be applied to various natural language processing (NLP) tasks, including:

- **Question Answering:** Responding to user queries with detailed and contextually accurate information.
- **Instruction Following:** Generating responses based on user-defined tasks.
- **Preference Modeling:** Ranking different responses based on preferences provided in training data.
- **Text Completion:** Completing partially given texts based on provided instructions.

## Limitations

- **Context Length:** While capable of processing up to 2048 tokens, extremely long texts may require additional optimization or truncation.
- **Precision:** The model's 4-bit quantization may result in minor loss of precision in certain edge cases requiring high accuracy.
- **Dataset Bias:** Reflects biases present in the training dataset used for preference pairs labeling.

## Technical Details

- **Model Architecture:** Based on Llama 3.2 with 3 billion parameters.
- **Training Method:** Fine-tuned using Direct Preference Optimization (DPO).
- **Optimizer:** Utilizes AdamW optimizer with 8-bit precision for efficiency.
- **Batch Size:** Effective batch size of 8 (2 per device with 4-step gradient accumulation).
- **Training Configuration:**
  - Learning rate: 5e-6
  - Warm-up ratio: 0.1
  - Epochs: 1
  - Max sequence length: 2048 tokens
- **Mixed Precision Training:** Supports FP16 and BFloat16 depending on hardware.

## Usage Instructions

### Install Dependencies

Ensure `torch`, `transformers`, `unsloth`, and other required libraries are installed for inference and fine-tuning.

### Load Pretrained Model

You can load the model using `FastLanguageModel.from_pretrained()` by specifying the model name and optimization settings.

### Fine-Tuning

Apply PEFT and quantization strategies (e.g., LoRA, gradient checkpointing) using the dataset of preference pairs for fine-tuning.

### Inference

Use the `FastLanguageModel.for_inference()` method to enable optimized text generation, which supports streaming inference for real-time output.


## Performance Metrics

- **Training Loss:** 1.19
- **Training Runtime:** 1974.06 seconds (approximately 32 minutes)
- **Steps Per Second:** 0.063
- **Samples Per Second:** 0.507

## Model Version

- **Version:** Unsloth 2025.1.7 (Patched version)
- **Training Date:** January 2025

## Acknowledgements

This model was trained using the Unsloth framework with contributions from Intel and Hugging Face for data and tools.

## Notebook

Access the implementation notebook for this model [here](https://github.com/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/fine_tuning_llama_3_2_3b_dpo_peft.ipynb). This notebook provides detailed steps for fine-tuning and deploying the model.