convert this to raw readme.md file, it's a model card on huggingface

Pashto BERT (BERT-Base)

Model Overview

This is a monolingual Pashto BERT (BERT-Base) model trained on a large Pashto corpus. The model is designed to understand and generate text in Pashto, making it suitable for various downstream Natural Language Processing (NLP) tasks.

Model Details

Architecture: BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters)
Language: Pashto (ps)
Training Corpus: A diverse set of Pashto text data, including news articles, books, and web content.
Special Tokens: [CLS], [SEP], [PAD], [MASK], [UNK]

Intended Use

This model can be fine-tuned for various Pashto-specific NLP tasks, such as:

Sequence Classification: Sentiment analysis, topic classification, and document categorization.
Sequence Tagging: Named entity recognition (NER) and part-of-speech (POS) tagging.
Text Generation & Understanding: Question answering, text summarization, and machine translation.

How to Use

This model can be loaded using the transformers library from Hugging Face:

from transformers import AutoModel, AutoTokenizer

model_name = "your-huggingface-username/pashto-bert-base"
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/")
model = AutoModel.from_pretrained(model_name)

text = "ستاسو نننۍ ورځ څنګه وه؟"
tokens = tokenizer(text, return_tensors="pt")
out = model(**tokens)

Training Details

Optimization: AdamW
Sequence Length: 128
Warmup Steps: 10,000
Warmup Ratio: 0.06
Learning Rate: 1e-4
Weight Decay: 0.01
Adam Optimizer Parameters:
- Epsilon: 1e-8
- Betas: (0.9, 0.999)
Gradient Accumulation Steps: 1
Max Gradient Norm: 1.0
Scheduler: linear_schedule_with_warmup

Limitations & Biases

The model may reflect biases present in the training data.
Performance on low-resource or domain-specific tasks may require additional fine-tuning.
It is not trained for code-switching scenarios (e.g., mixing Pashto with English or other languages).

zirak-ai
/

pashto-bert-v1