--- license: cc-by-nd-4.0 language: - es base_model: - Qwen/Qwen2.5-0.5B --- # Zyphen-CO-Legal-Autocomplete Micro ## Overview **Zyphen-CO-Legal-Autocomplete Micro** is an experimental language model fine-tuned to assist legal professionals by providing intelligent autocomplete suggestions. Leveraging data from 100 Colombian jurisdictions, this model aims to enhance efficiency and accuracy in legal documentation tasks. While currently trained on a limited dataset, ongoing testing will expand its capabilities with additional data sources to ensure comprehensive coverage and reliability. ## Features - **Experimental Model**: Currently in the testing phase with foundational training on 100 Colombian jurisdictions. - **Domain-Specific Expertise**: Tailored for the Colombian legal framework, ensuring relevance and precision in legal contexts. - **Efficient Inference**: Optimized with LoRA adapters and 4-bit quantization to minimize memory usage and accelerate response times. - **Scalable Architecture**: Designed to handle extensive legal documents with support for up to 30,000 tokens in context. - **Seamless Integration**: Compatible with various applications and services, enabling effortless embedding into existing legal workflows. - **Multilingual Support**: Capable of understanding and generating content in Spanish. ## Installation Ensure you have the necessary libraries installed: ```bash pip install transformers huggingface_hub torch ``` ## Loading the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load the tokenizer and model from Hugging Face Hub tokenizer = AutoTokenizer.from_pretrained("AcropolisLabs/Zyphen-CO-Legal-Autocomplete-micro") model = AutoModelForCausalLM.from_pretrained("AcropolisLabs/Zyphen-CO-Legal-Autocomplete-micro") # Move the model to GPU for faster inference (if available) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Define the autocomplete function def generate_autocomplete(prompt, max_new_tokens=100): inputs = tokenizer(prompt, return_tensors="pt").to(device) outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, top_k=50, top_p=0.95, temperature=0.7 ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example usage prompt = "En el presente fallo de la Corte Suprema de Justicia se dispuso que" suggestion = generate_autocomplete(prompt) print("Autocomplete Suggestion:", suggestion) ``` ## Data Preparation The model was trained on a curated dataset comprising legal judgments from 100 Colombian jurisdictions. The data underwent the following preprocessing steps: 1. **Filtering**: Excluding unavailable or irrelevant content to ensure data quality. 2. **Chunking**: Splitting extensive texts into manageable segments, each appended with an end-of-sequence token to facilitate coherent text generation. 3. **Tokenization**: Converting textual data into tokens using the `unsloth/Qwen2.5-0.5B` tokenizer, optimized for efficient processing. **Future Plans**: As testing progresses, the dataset will be expanded to include additional jurisdictions and more comprehensive legal documents to enhance the model's robustness and applicability. ## Fine-Tuning Details - **Base Model**: `unsloth/Qwen2.5-0.5B` - **Adapter Method**: LoRA (Low-Rank Adaptation) with the following configurations: - **Rank (`r`)**: 16 - **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - **Alpha (`lora_alpha`)**: 16 - **Dropout (`lora_dropout`)**: 0 - **Training Parameters**: - **Batch Size**: 1 - **Gradient Accumulation Steps**: 4 - **Learning Rate**: 2e-4 - **Optimizer**: AdamW with 8-bit precision - **Weight Decay**: 0.01 - **Scheduler**: Linear - **Total Training Steps**: 500 ## Model Evaluation **Zyphen-CO-Legal-Autocomplete** has undergone preliminary evaluations focusing on its ability to generate contextually relevant and legally accurate autocomplete suggestions within the scope of the 100 jurisdictions it was trained on. Feedback from legal professionals indicates promising utility, with ongoing assessments aimed at identifying areas for improvement as the dataset expands. ## Usage Guidelines To integrate **Zyphen-CO-Legal-Autocomplete** into your applications: 1. **Import the Model and Tokenizer**: As demonstrated in the [Loading the Model](#loading-the-model) section. 2. **Generate Autocomplete Suggestions**: Utilize the `generate_autocomplete` function by providing appropriate legal prompts. 3. **Integrate with User Interfaces**: Embed the autocomplete functionality within your legal software tools to enhance productivity. ## Contributions Contributions are highly valued! If you have suggestions, improvements, or encounter any issues, please submit an issue or a pull request. Ensure that your contributions align with the project's focus on legal domain expertise and efficiency. ## Contact For inquiries or support, please contact [support@acropolisis.com](mailto:support@acropolisia.com). ## Acknowledgements - **Unsloth**: For their efficient model optimization techniques. - **Hugging Face**: For providing robust tools and an excellent community. - **TRL**: For their insightful fine-tuning methodologies. - **Colombian Legal Institutions**: For providing comprehensive legal data essential for training. ## License --- cc-by-nd-4.0 ---