FinAraT5 – A T5-based Arabic Financial Text Generation Model

FinAraT5 is the first Arabic financial domain T5-based text-to-text model. This model is a fine-tuned version of FinAraT5_MSA on Alarabya-news-summarisation dataset. The model is based on AraT5 and trained using domain-specific financial Arabic corpora.

📘 Official Paper (LDK 2023) 📘 Authors: Nadhem Zmandar, Mo El-Haj, and Paul Rayson

---

🔧 Model Use Case

This model is designed for:

Generating short, informative headlines for Arabic financial news articles Summarising long financial texts into concise titles or summary statements

It can assist news agencies, financial analysts, and media platforms in streamlining content production.

⚠️ Note: The model was fine-tuned on data collected from a single source (Al Arabiya), which may limit generalisability to other domains or styles.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 22.0

💡 Example Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

model = T5ForConditionalGeneration.from_pretrained("drelhaj/FinAraT5")
tokenizer = T5Tokenizer(vocab_file="spiece.model")  # If required

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

input_text = "أعلنت الشركة عن ارتفاع أرباحها بنسبة ١٥٪ في الربع الثاني من العام نتيجة لزيادة المبيعات في السوق الخليجية"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

with torch.no_grad():
    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=30)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

💡 If using this model locally, ensure that spiece.model is included in the model directory for proper tokenisation.

📝 Example Output

input_text: 'صعدت أسعار الذهب على نحو طفيف اليوم الاثنين، حيث أدى ارتفاع التضخم في الولايات المتحدة إلى تعزيز جاذبيته كملاذ آمن، في حين يترقب المستثمرون اجتماع مجلس الاحتياطي الاتحادي لمعرفة مدى السرعة التي يعتزم بها إلغاء برنامج شراء السندات., وارتفع الذهب في المعاملات الفورية 0.2% إلى 1785.20 دولار للأونصة، وزادت العقود الأميركية الآجلة للذهب 0.1% إلى 1785.70 دولار., ومن المرجح أن يعلن مجلس الاحتياطي الاتحادي (البنك المركزي الأميركي) عن خفض أسرع في مشتريات السندات لكن المخاوف الأكثر وضوحا بشأن التضخم يمكن أن تزعج الأسواق., ورغم أن الذهب يعتبر أداة للتحوط من التضخم، فإن خفض التحفيز ورفع أسعار الفائدة عادة ما يؤديان إلى دفع عوائد السندات الحكومية للصعود، مما يرفع تكلفة الفرصة البديلة لحيازة المعدن الأصفر الذي لا يدر عائدا., وتتجه الأنظار الآن إلى اجتماع مجلس الاحتياطي المقرر في 14-15 ديسمبر/ كانون الأول., وارتفعت الفضة في المعاملات الفورية 0.3% إلى 22.22 دولار للأونصة., وزاد البلاتين 0.5% إلى 946.74 دولار، وارتفع البلاديوم 0.5% إلى 1769.61 دولار.'

Output: الذهب يصعد مع ارتفاع التضخم في أميركا

🏗️ Fine-tuning Example

Below is an example for fine-tuning any text-to-text model for News Title Generation on any dataset

!python run_trainer_seq2seq_huggingface.py \
    --learning_rate 5e-5 \
    --max_target_length 256 --max_source_length 128 \
    --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
    --model_name_or_path "<model_id>" \
    --output_dir finarat5_base_title_generation --overwrite_output_dir \
    --num_train_epochs 22 \
    --train_file "train_file" \
    --validation_file "valid_file" \
    --task "title_generation" --text_column "text" --summary_column "summary" \
    --evaluation_strategy epoch  --save_strategy epoch  \
    --load_best_model_at_end --metric_for_best_model "bertscore" \
    --greater_is_better True  \
    --logging_strategy epoch --predict_with_generate \
    --do_train --do_eval \
    --push_to_hub "True" \
    --push_to_hub_token "<HF_ID>" \
    --report_to "wandb" \
    --run_name "fine tuning a model on financial arabic news summarization dataset" \

Framework versions

Transformers 4.23.0.dev0
Pytorch 1.12.1+cu102
Datasets 2.5.1
Tokenizers 0.13.0

🙏 Acknowledgements

Many thanks to Dr Nadhem Zmandar (AI Research Engineer) for his great effort into building this model. Please get in touch with Nadhem on: LinkedIn:. Nadhem did this work as part of his PhD thesis titled Multilingual Financial Text Summarisation. For any other questions please contact Dr Mo El-Haj https://elhaj.uk.

We gratefully acknowledge the Google TensorFlow Research Cloud (TFRC) program for the free TPU V3.8 access and we thank the google cloud team for the free GCP credits.

drelhaj
/

FinAraT5