File size: 1,669 Bytes
470556d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf78b51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
datasets:
- arbml/SANAD
language:
- ar
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
library_name: transformers
tags:
- modernbert
- arabic
---


# ModernBERT Arabic Model Card

## Overview
This is an Arabic version of ModernBERT, a modernized bidirectional encoder-only Transformer model (BERT-style). ModernBERT was pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. You can find more about the base ModernBERT model here: [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

For this proof of concept, a tokenizer trained on Arabic Wikipedia was utilized:
- **Dataset:** Arabic Wikipedia
- **Size:** 1.8 GB
- **Tokens:** 228,788,529 tokens

This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

## Model Details
- **Epochs:** 3
- **Evaluation Metrics:**
  - **F1 Score:** 0.9587811491105839
  - **Loss:** 0.19986020028591156
  - **Runtime:** 46.4942 seconds
  - **Samples per second:** 305.006
  - **Steps per second:** 38.134
- **Training Step:** 47,862

## How to Use
The model can be used for text classification using the `transformers` library. Below is an example:

```python
from transformers import pipeline

# Load model from huggingface.co/models using our repository ID
classifier = pipeline(
    task="text-classification",
    model="ModernBERT-domain-classifier/checkpoint-47862",
)

sample = '''
اسلام عددا من الوافدين الى الممكلة العربية السعوديه
'''

classifier(sample)
# [{'label': 'health', 'score': 0.6779336333274841}]