---
library_name: transformers
tags:
- language
- detection
- classification
license: mit
datasets:
- hac541309/open-lid-dataset
pipeline_tag: text-classification
---

# Language Detection Model

This project trains a **BERT-based language detection model** on the **Hugging Face `hac541309/open-lid-dataset`**, which contains **121 million sentences across 200 languages**. The trained model is designed for **fast and accurate language identification** in text classification tasks.

## 📌 Model Details
- **Architecture**: `BertForSequenceClassification`
- **Hidden Size**: `384`
- **Layers**: `4`
- **Attention Heads**: `6`
- **Max Sequence Length**: `512`
- **Dropout**: `0.1`
- **Vocabulary Size**: `50,257`

## 🚀 Training Process
- **Dataset**: Preprocessed and split into **train (90%)** and **test (10%)** sets.
- **Tokenizer**: Custom `PreTrainedTokenizerFast` for text tokenization.
- **Evaluation Metrics**: Tracked using `compute_metrics` function.
- **Hyperparameters**:
  - Learning Rate: `2e-5`
  - Batch Size: `256` (train) / `512` (test)
  - Epochs: `1`
  - Scheduler: `cosine`
- **Trainer**: Utilizes `Hugging Face Trainer` API with `wandb` logging.

## 📊 Evaluation Results
The model was evaluated on a **separate test set**, and the results are shared in this repository.


https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ