|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- arbml/SANAD |
|
language: |
|
- ar |
|
base_model: |
|
- answerdotai/ModernBERT-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- modernbert |
|
- arabic |
|
--- |
|
|
|
|
|
# ModernBERT Arabic Model Card |
|
|
|
## Overview |
|
This is an Arabic version of ModernBERT, a modernized bidirectional encoder-only Transformer model (BERT-style). ModernBERT was pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. You can find more about the base ModernBERT model here: [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). |
|
|
|
For this proof of concept, a tokenizer trained on Arabic Wikipedia was utilized: |
|
- **Dataset:** Arabic Wikipedia |
|
- **Size:** 1.8 GB |
|
- **Tokens:** 228,788,529 tokens |
|
|
|
This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification. |
|
|
|
## Model Details |
|
- **Epochs:** 3 |
|
- **Evaluation Metrics:** |
|
- **F1 Score:** 0.9587811491105839 |
|
- **Loss:** 0.19986020028591156 |
|
- **Runtime:** 46.4942 seconds |
|
- **Samples per second:** 305.006 |
|
- **Steps per second:** 38.134 |
|
- **Training Step:** 47,862 |
|
|
|
## How to Use |
|
The model can be used for text classification using the `transformers` library. Below is an example: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Load model from huggingface.co/models using our repository ID |
|
classifier = pipeline( |
|
task="text-classification", |
|
model="ModernBERT-domain-classifier/checkpoint-47862", |
|
) |
|
|
|
sample = ''' |
|
اسلام عددا من الوافدين الى الممكلة العربية السعوديه |
|
''' |
|
|
|
classifier(sample) |
|
# [{'label': 'health', 'score': 0.6779336333274841}] |
|
|
|
|
|
|