metadata

license: apache-2.0
datasets:
  - arbml/SANAD
language:
  - ar
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: text-classification
library_name: transformers
tags:
  - modernbert
  - arabic

ModernBERT Arabic Model Card

Overview

This is an Arabic version of ModernBERT, a modernized bidirectional encoder-only Transformer model (BERT-style). ModernBERT was pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. You can find more about the base ModernBERT model here: ModernBERT-base.

For this proof of concept, a tokenizer trained on Arabic Wikipedia was utilized:

Dataset: Arabic Wikipedia
Size: 1.8 GB
Tokens: 228,788,529 tokens

This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

Model Details

Epochs: 3
Evaluation Metrics:
- F1 Score: 0.9587811491105839
- Loss: 0.19986020028591156
- Runtime: 46.4942 seconds
- Samples per second: 305.006
- Steps per second: 38.134
Training Step: 47,862

How to Use

The model can be used for text classification using the transformers library. Below is an example:

from transformers import pipeline

# Load model from huggingface.co/models using our repository ID
classifier = pipeline(
    task="text-classification",
    model="ModernBERT-domain-classifier/checkpoint-47862",
)

sample = '''
اسلام عددا من الوافدين الى الممكلة العربية السعوديه
'''

classifier(sample)
# [{'label': 'health', 'score': 0.6779336333274841}]