--- license: mit datasets: - Intel/orca_dpo_pairs language: - en base_model: - unsloth/Llama-3.2-3B-Instruct pipeline_tag: question-answering --- # Fine-tuned Language Model for Preference Optimization (DPO) ## Model Overview This model is a fine-tuned version of Llama 3.2-3B-Instruct with Direct Preference Optimization (DPO), specialized for reward modeling tasks. It has been optimized using memory-efficient techniques including 4-bit quantization, gradient checkpointing, and parameter-efficient fine-tuning (PEFT). The model is tailored for tasks requiring language comprehension, instruction-based response generation, and preference-based ranking of responses. ## Model Details - **Base Model:** `unsloth/Llama-3.2-3B-Instruct` - **Fine-Tuning Objective:** Preference Optimization (DPO) using pairs of accepted and rejected responses. - **Training Framework:** Built on Unsloth with integration to Hugging Face Datasets and Transformers. - **Quantization:** Utilizes 4-bit quantization for reduced memory usage, suitable for low VRAM devices. - **Optimizations:** Includes gradient checkpointing for enhanced memory efficiency and faster inference. The model has undergone fine-tuning using PEFT methods such as LoRA (Low-Rank Adaptation). - **Training Data:** Trained on the Intel/orca_dpo_pairs dataset containing instruction-input-response pairs for preference-based learning. ## Model Capabilities - **Text Generation:** Capable of generating detailed and coherent text responses based on instructions or prompts. - **Preference-Based Optimization:** Fine-tuned to rank responses based on user feedback (chosen vs. rejected). - **Long Contexts:** Supports processing up to 2048 tokens of input efficiently, facilitated by internal RoPE scaling. - **Faster Inference:** Optimized for real-time text generation with streaming capabilities and low-latency responses. ## Intended Use This model can be applied to various natural language processing (NLP) tasks, including: - **Question Answering:** Responding to user queries with detailed and contextually accurate information. - **Instruction Following:** Generating responses based on user-defined tasks. - **Preference Modeling:** Ranking different responses based on preferences provided in training data. - **Text Completion:** Completing partially given texts based on provided instructions. ## Limitations - **Context Length:** While capable of processing up to 2048 tokens, extremely long texts may require additional optimization or truncation. - **Precision:** The model's 4-bit quantization may result in minor loss of precision in certain edge cases requiring high accuracy. - **Dataset Bias:** Reflects biases present in the training dataset used for preference pairs labeling. ## Technical Details - **Model Architecture:** Based on Llama 3.2 with 3 billion parameters. - **Training Method:** Fine-tuned using Direct Preference Optimization (DPO). - **Optimizer:** Utilizes AdamW optimizer with 8-bit precision for efficiency. - **Batch Size:** Effective batch size of 8 (2 per device with 4-step gradient accumulation). - **Training Configuration:** - Learning rate: 5e-6 - Warm-up ratio: 0.1 - Epochs: 1 - Max sequence length: 2048 tokens - **Mixed Precision Training:** Supports FP16 and BFloat16 depending on hardware. ## Usage Instructions ### Install Dependencies Ensure `torch`, `transformers`, `unsloth`, and other required libraries are installed for inference and fine-tuning. ### Load Pretrained Model You can load the model using `FastLanguageModel.from_pretrained()` by specifying the model name and optimization settings. ### Fine-Tuning Apply PEFT and quantization strategies (e.g., LoRA, gradient checkpointing) using the dataset of preference pairs for fine-tuning. ### Inference Use the `FastLanguageModel.for_inference()` method to enable optimized text generation, which supports streaming inference for real-time output. ## Performance Metrics - **Training Loss:** 1.19 - **Training Runtime:** 1974.06 seconds (approximately 32 minutes) - **Steps Per Second:** 0.063 - **Samples Per Second:** 0.507 ## Model Version - **Version:** Unsloth 2025.1.7 (Patched version) - **Training Date:** January 2025 ## Acknowledgements This model was trained using the Unsloth framework with contributions from Intel and Hugging Face for data and tools. ## Notebook Access the implementation notebook for this model [here](https://github.com/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/fine_tuning_llama_3_2_3b_dpo_peft.ipynb). This notebook provides detailed steps for fine-tuning and deploying the model.