--- library_name: transformers tags: - language - detection - classification license: mit datasets: - hac541309/open-lid-dataset pipeline_tag: text-classification --- # Language Detection Model This project trains a **BERT-based language detection model** on the **Hugging Face `hac541309/open-lid-dataset`**, which contains **121 million sentences across 200 languages**. The trained model is designed for **fast and accurate language identification** in text classification tasks. ## 📌 Model Details - **Architecture**: `BertForSequenceClassification` - **Hidden Size**: `384` - **Layers**: `4` - **Attention Heads**: `6` - **Max Sequence Length**: `512` - **Dropout**: `0.1` - **Vocabulary Size**: `50,257` ## 🚀 Training Process - **Dataset**: Preprocessed and split into **train (90%)** and **test (10%)** sets. - **Tokenizer**: Custom `PreTrainedTokenizerFast` for text tokenization. - **Evaluation Metrics**: Tracked using `compute_metrics` function. - **Hyperparameters**: - Learning Rate: `2e-5` - Batch Size: `256` (train) / `512` (test) - Epochs: `1` - Scheduler: `cosine` - **Trainer**: Utilizes `Hugging Face Trainer` API with `wandb` logging. ## 📊 Evaluation Results The model was evaluated on a **separate test set**, and the results are shared in this repository. https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ