NAMAA-Space
/

AraModernBert-Topic-Classifier

Text Classification

Model card Files Files and versions

AraModernBert-Topic-Classifier / README.md

Omartificial-Intelligence-Space's picture

Omartificial-Intelligence-Space

Update readme.md

cf78b51 verified 8 months ago

|

1.67 kB

	---
	license: apache-2.0
	datasets:
	- arbml/SANAD
	language:
	- ar
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- modernbert
	- arabic
	---


	# ModernBERT Arabic Model Card

	## Overview
	This is an Arabic version of ModernBERT, a modernized bidirectional encoder-only Transformer model (BERT-style). ModernBERT was pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. You can find more about the base ModernBERT model here: [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

	For this proof of concept, a tokenizer trained on Arabic Wikipedia was utilized:
	- Dataset: Arabic Wikipedia
	- Size: 1.8 GB
	- Tokens: 228,788,529 tokens

	This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

	## Model Details
	- Epochs: 3
	- Evaluation Metrics:
	- F1 Score: 0.9587811491105839
	- Loss: 0.19986020028591156
	- Runtime: 46.4942 seconds
	- Samples per second: 305.006
	- Steps per second: 38.134
	- Training Step: 47,862

	## How to Use
	The model can be used for text classification using the `transformers` library. Below is an example:

	```python
	from transformers import pipeline

	# Load model from huggingface.co/models using our repository ID
	classifier = pipeline(
	task="text-classification",
	model="ModernBERT-domain-classifier/checkpoint-47862",
	)

	sample = '''
	اسلام عددا من الوافدين الى الممكلة العربية السعوديه
	'''

	classifier(sample)
	# [{'label': 'health', 'score': 0.6779336333274841}]