language:
- fa
- en
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- persian
- scientific-qa
- e5
base_model: intfloat/multilingual-e5-large
datasets:
- PersianSciQA
Sentence Transformer for Persian Scientific Text (persian-science-qa-e5-large)
Abstract
This repository contains the persian-science-qa-e5-large
model, a sentence-transformer fine-tuned on the novel PersianSciQA dataset. The model is based on intfloat/multilingual-e5-large
and is specifically designed to bridge the language gap in scientific question answering and information retrieval for the Persian language. [cite_start]The development of specialized datasets is hindered by a shortage in low-resource languages like Persian. [cite_start]This work introduces both a new, large-scale dataset of 39,809 question-abstract pairs and a powerful embedding model trained on it to facilitate a new wave of knowledge-centered applications for the Persian scientific community.
The PersianSciQA Dataset
[cite_start]The foundation of this model is the PersianSciQA dataset, a large-scale resource created to address the critical shortage of NLP datasets for scientific texts in Persian.
Dataset Creation
- [cite_start]Data Source: The dataset was built from 10,846 scientific abstracts sourced from IranDoc's "Ganj" repository, a comprehensive digital collection of Persian scientific and technical documents. [cite_start]The initial subset consisted mainly of engineering theses and dissertations.
- [cite_start]Generation Methodology: A two-stage, LLM-powered pipeline using
gpt-40-mini
was employed.- [cite_start]Stage 1: Query Generation: For each abstract, the model generated four distinct Persian queries with varying target relevance levels (Direct, Related, Tangential, Distant). [cite_start]A high temperature setting (0.8) was used to maximize query diversity.
- [cite_start]Stage 2: Relevance Assessment: To mitigate bias, each generated question-abstract pair was independently assessed by the LLM (using a low temperature of 0.3 for deterministic output) to assign a final, objective relevance score from 0 to 3.
- [cite_start]Data Refinement: The raw output of 39,883 pairs underwent a rigorous cleaning process, including text normalization and a two-tier deduplication (exact and semantic matching) to remove identical or near-duplicate queries. [cite_start]This resulted in the final dataset of 39,809 unique question-abstract pairs.
Dataset Statistics
Metric | Value | Source |
---|---|---|
Total Query-Abstract Pairs | [cite_start]39,809 | |
Unique Queries | [cite_start]39,809 | |
Unique Abstracts | [cite_start]10,235 | |
Avg. Queries per Abstract | [cite_start]3.89 | |
Query Vocabulary Size | [cite_start]17,497 words | |
Abstract Vocabulary Size | [cite_start]86,109 words |
Human Validation
[cite_start]To ensure the quality and reliability of the LLM-generated data, a rigorous human validation study was conducted.
- [cite_start]Process: A random, stratified sample of 1,000 question-abstract pairs was independently assessed by two domain experts with proficiency in Persian.
- [cite_start]Inter-Annotator Agreement (IAA): The annotation process was found to be highly reliable, with near-perfect agreement between the two experts on all metrics (Cohen's Kappa $\kappa > 0.99$).
- Key Findings:
- [cite_start]There was substantial agreement between the LLM's relevance scores and the final human-adjudicated labels ($\kappa = 0.6642$).
- [cite_start]Experts found that 88.60% of the generated queries were of high linguistic quality (grammatically correct and clear).
- [cite_start]The LLM showed a tendency to be conservative, often underestimating the relevance of a query compared to human judges, which suggests it is less likely to mistakenly assign high relevance.
Model Details
- Model Type: Sentence Transformer
- Base Model:
intfloat/multilingual-e5-large
- Output Dimensions: 1024
Training
- [cite_start]Training Data: The model was fine-tuned on the
PersianSciQA
training split, which contains 31,837 question-abstract pairs from 8,188 unique abstracts. - Loss Function: The model was trained using
CosineSimilarityLoss
, which is ideal for producing meaningful sentence embeddings for semantic similarity tasks.
How to Use
You can use this model directly with the sentence-transformers
library.
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = [
"query: شتاب سنجها در کدام زمینههای علمی و صنعتی کاربرد دارند؟",
"passage: در این پایان نامه یک شتاب سنج خازنی سه محوره با استفاده از تکنولوژی میکروماشین و تنها با یک جرم متحرک طراحی و شبیه سازی شده است."
]
model = SentenceTransformer("safora/persian-science-qa-e5-large")
embeddings = model.encode(sentences)
print(embeddings.shape)
# [2, 1024]
# Calculate similarity
similarities = model.similarity(embeddings[0:1], embeddings[1:])
print(similarities)
Potential Use Cases
This dataset and model directly support a range of applications in the Persian scientific domain:
Scientific Question Answering: Training and evaluating systems that can answer questions based on scientific abstracts.
Relevance Ranking Models: Building models that can rank documents or passages by their relevance to a given query.
Paraphrase Identification: The semantically rich queries can be used for research into identifying paraphrases in a technical context.
Evaluating LLMs: The dataset serves as a robust benchmark for assessing the capabilities of LLMs on Persian scientific text.
Limitations
The dataset is primarily generated by an LLM, and while rigorously validated, inherent LLM biases may persist.
The current scope is predominantly focused on engineering abstracts, which may limit generalizability to other scientific fields.
Abstract snippets can occasionally have formatting issues due to the combination of right-to-left and left-to-right text, or abrupt truncation.
Citation
If you use this model or the PersianSciQA dataset in your research, please cite the following paper:
@inproceedings{anonymous-2025-persiansciqa,
title = "{P}ersian{S}ci{QA}: A new Dataset for Bridging the Language Gap in Scientific Question Answering",
author = "Anonymous",
booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2025)",
month = "September",
year = "2025",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd.",
url = "[https://aclanthology.org/2025.ranlp-1.0](https://aclanthology.org/2025.ranlp-1.0)"
}