convert this to raw readme.md file, it's a model card on huggingface
Pashto BERT (BERT-Base)
Model Overview
This is a monolingual Pashto BERT (BERT-Base) model trained on a large Pashto corpus. The model is designed to understand and generate text in Pashto, making it suitable for various downstream Natural Language Processing (NLP) tasks.
Model Details
- Architecture: BERT-Base (12 layers, 768 hidden size, 12 attention heads, 110M parameters)
- Language: Pashto (ps)
- Training Corpus: A diverse set of Pashto text data, including news articles, books, and web content.
- Special Tokens:
[CLS]
, [SEP]
, [PAD]
, [MASK]
, [UNK]
Intended Use
This model can be fine-tuned for various Pashto-specific NLP tasks, such as:
- Sequence Classification: Sentiment analysis, topic classification, and document categorization.
- Sequence Tagging: Named entity recognition (NER) and part-of-speech (POS) tagging.
- Text Generation & Understanding: Question answering, text summarization, and machine translation.
How to Use
This model can be loaded using the transformers
library from Hugging Face:
from transformers import AutoModel, AutoTokenizer
model_name = "your-huggingface-username/pashto-bert-base"
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/model/")
model = AutoModel.from_pretrained(model_name)
text = "ستاسو نننۍ ورځ څنګه وه؟"
tokens = tokenizer(text, return_tensors="pt")
out = model(**tokens)
Training Details
- Optimization: AdamW
- Sequence Length: 128
- Warmup Steps: 10,000
- Warmup Ratio: 0.06
- Learning Rate: 1e-4
- Weight Decay: 0.01
- Adam Optimizer Parameters:
- Epsilon: 1e-8
- Betas: (0.9, 0.999)
- Gradient Accumulation Steps: 1
- Max Gradient Norm: 1.0
- Scheduler:
linear_schedule_with_warmup
Limitations & Biases
- The model may reflect biases present in the training data.
- Performance on low-resource or domain-specific tasks may require additional fine-tuning.
- It is not trained for code-switching scenarios (e.g., mixing Pashto with English or other languages).