|
--- |
|
language: |
|
- ar |
|
- en |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:34436 |
|
- loss:MatryoshkaLoss |
|
- loss:CoSENTLoss |
|
base_model: AhmedZaky1/DIMI-embedding-v2 |
|
widget: |
|
- source_sentence: الرجل يركب حصاناً |
|
sentences: |
|
- رجل يُبث الجبن الممزق على البيتزا |
|
- source_sentence: المرأة تقلي لحم خنزير مشوي |
|
sentences: |
|
- امرأة تطبخ لحم خنزير مخبوز |
|
- طائرة طيران تقلع |
|
- source_sentence: امرأة تحمل في ذراعها طفل كنغر |
|
sentences: |
|
- امرأة تعزف على الغيتار |
|
- امرأة تحمل و تحمل طفل كنغر |
|
- source_sentence: رجل يعزف على الناي |
|
sentences: |
|
- طائرة ستقلع |
|
- رجل يعزف على فرقة الخيزران |
|
- source_sentence: ثلاثة رجال يلعبون الشطرنج. |
|
sentences: |
|
- رجلين يلعبان الشطرنج |
|
- بعض الرجال يقاتلون |
|
datasets: |
|
- silma-ai/silma-arabic-english-sts-dataset-v1.0 |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
--- |
|
|
|
|
|
# DIMI Embedding model |
|
|
|
<div align="center"> |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/65fb3ac20cfe262da2bb0fcc/uOuEn0LNhSVEBbOLwfFUu.jpeg" width="300"/> |
|
|
|
*State-of-the-art Multilingual Sentence Embeddings for Arabic-English Semantic Similarity* |
|
|
|
</div> |
|
|
|
## 🚀 Model Description |
|
|
|
**DIMI-embedding-v3-silma-sts-matryoshka** is a cutting-edge multilingual sentence embedding model specifically fine-tuned for Arabic-English semantic textual similarity tasks. Built upon the robust DIMI-embedding-v2 architecture, this model leverages **Matryoshka Representation Learning** combined with **CoSENT Loss** to deliver exceptional performance across multiple embedding dimensions. |
|
|
|
### ✨ Key Features |
|
|
|
- **Multi-dimensional embeddings**: Supports output dimensions of 768, 512, 256, 128, and 64 |
|
- **Bilingual expertise**: Optimized for Arabic and English text processing |
|
- **Matryoshka architecture**: Efficient embedding computation at multiple granularities |
|
- **State-of-the-art performance**: Fine-tuned on the comprehensive Silma Arabic-English STS dataset |
|
- **Cosine similarity optimized**: Perfect for semantic similarity and retrieval tasks |
|
|
|
## 📊 Model Performance |
|
|
|
The model demonstrates exceptional performance across different embedding dimensions: |
|
|
|
### Training Techniques |
|
|
|
This model was trained using advanced techniques for optimal performance: |
|
|
|
- **Matryoshka Representation Learning**: Enables efficient embeddings at multiple dimensions [768, 512, 256, 128, 64] without retraining |
|
- **CoSENT Loss Function**: Cosine-based sentence embedding loss for superior semantic similarity learning |
|
- **Multi-dimensional Evaluation**: Simultaneous optimization across all target dimensions during training |
|
- **Mixed Precision Training (FP16)**: Accelerated training with maintained numerical stability |
|
- **Warmup Learning Rate Schedule**: Gradual learning rate increase for stable convergence |
|
- **Best Model Selection**: Automatic selection based on highest Spearman correlation on 768d embeddings |
|
|
|
### Final Model Performance |
|
|
|
#### Development Set Results (Silma STS Dataset) |
|
Final evaluation on the held-out development set: |
|
|
|
| Dimension | Pearson Correlation | Spearman Correlation | |
|
|-----------|-------------------|---------------------| |
|
| 768d | 0.8894 | 0.8358 | |
|
| 512d | 0.8959 | 0.8395 | |
|
| 256d | 0.8979 | 0.8470 | |
|
| 128d | 0.9182 | 0.8562 | |
|
| 64d | 0.9066 | 0.8434 | |
|
|
|
#### MTEB STS17 Arabic Test Results |
|
Performance on the standard MTEB STS17 (ar-ar) benchmark: |
|
|
|
| Dimension | Pearson Correlation | Spearman Correlation | |
|
|-----------|-------------------|---------------------| |
|
| **768d** | **0.8205** | **0.8258** | |
|
| **512d** | **0.8193** | **0.8227** | |
|
| **256d** | **0.8191** | **0.8246** | |
|
| **128d** | **0.8115** | **0.8183** | |
|
| **64d** | **0.7962** | **0.8077** | |
|
|
|
**Sequential Score**: 0.8077 (based on 64d performance) |
|
|
|
## 🔧 Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load the model |
|
model = SentenceTransformer('AhmedZaky1/DIMI-embedding-v3-silma-sts-matryoshka', trust_remote_code=True) |
|
|
|
# Example sentences in Arabic and English |
|
sentences = [ |
|
"هذا مثال جميل للذكاء الاصطناعي", # Arabic |
|
"This is a beautiful example of artificial intelligence", # English |
|
"التعلم الآلي يغير العالم", # Arabic |
|
"Machine learning is changing the world" # English |
|
] |
|
|
|
# Generate embeddings |
|
embeddings = model.encode(sentences) |
|
print(f"Embedding shape: {embeddings.shape}") |
|
|
|
# Calculate cosine similarity |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
similarity_matrix = cosine_similarity(embeddings) |
|
print("Similarity matrix:") |
|
print(similarity_matrix) |
|
``` |
|
|
|
### Matryoshka Embeddings Usage |
|
|
|
```python |
|
# Use different embedding dimensions |
|
dimensions = [768, 512, 256, 128, 64] |
|
|
|
for dim in dimensions: |
|
# Truncate embeddings to specific dimension |
|
truncated_embeddings = embeddings[:, :dim] |
|
print(f"Dimension {dim}: {truncated_embeddings.shape}") |
|
|
|
# Calculate similarity with truncated embeddings |
|
similarity = cosine_similarity(truncated_embeddings) |
|
print(f"Average similarity at {dim}d: {similarity.mean():.4f}") |
|
``` |
|
|
|
### Semantic Search Example |
|
|
|
```python |
|
import numpy as np |
|
|
|
# Query and corpus |
|
query = "ما هو الذكاء الاصطناعي؟" # "What is artificial intelligence?" |
|
corpus = [ |
|
"الذكاء الاصطناعي هو محاكاة الذكاء البشري", |
|
"Machine learning is a subset of AI", |
|
"Deep learning uses neural networks", |
|
"التعلم العميق يستخدم الشبكات العصبية" |
|
] |
|
|
|
# Encode query and corpus |
|
query_embedding = model.encode([query]) |
|
corpus_embeddings = model.encode(corpus) |
|
|
|
# Find most similar documents |
|
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0] |
|
top_indices = np.argsort(similarities)[::-1] |
|
|
|
print(f"Query: {query}") |
|
print("\nMost similar documents:") |
|
for i, idx in enumerate(top_indices[:3]): |
|
print(f"{i+1}. {corpus[idx]} (similarity: {similarities[idx]:.4f})") |
|
``` |
|
|
|
## 🏗️ Model Architecture |
|
|
|
- **Base Model**: DIMI-embedding-v2 |
|
- **Training Objective**: CoSENT Loss with Matryoshka Learning |
|
- **Supported Dimensions**: [768, 512, 256, 128, 64] |
|
- **Max Sequence Length**: 512 tokens |
|
- **Pooling Method**: Mean pooling |
|
- **Similarity Function**: Cosine similarity |
|
|
|
## 📊 Training Details |
|
|
|
### Dataset |
|
- **Primary Dataset**: silma-ai/silma-arabic-english-sts-dataset-v1.0 |
|
- **Evaluation Dataset**: MTEB STS17 (ar-ar) |
|
- **Training Samples**: ~24,000+ multilingual sentence pairs |
|
- **Evaluation Samples**: 100 held-out pairs |
|
|
|
### Training Configuration |
|
- **Batch Size**: 16 |
|
- **Epochs**: 4 |
|
- **Learning Rate**: Warmup ratio 0.1 |
|
- **Precision**: FP16 |
|
- **Evaluation Strategy**: Every 100 steps |
|
- **Best Model Selection**: Highest Spearman correlation on 768d embeddings |
|
|
|
### Hardware Requirements |
|
- **GPU**: CUDA-compatible GPU recommended |
|
- **Memory**: 16GB+ RAM for training |
|
- **Storage**: 2GB+ for model weights |
|
|
|
## 🎯 Applications |
|
|
|
This model excels in various NLP tasks: |
|
|
|
- **Semantic Textual Similarity**: Measure similarity between Arabic-English text pairs |
|
- **Information Retrieval**: Find relevant documents in multilingual corpora |
|
- **Paraphrase Detection**: Identify semantically equivalent sentences |
|
- **Cross-lingual Search**: Search Arabic content with English queries and vice versa |
|
- **Clustering**: Group similar multilingual documents |
|
- **Recommendation Systems**: Content-based recommendations across languages |
|
|
|
## 📝 Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{dimi-embedding-v3-2024, |
|
title={DIMI-embedding-v3-silma-sts-matryoshka: Multilingual Sentence Embeddings for Arabic-English Semantic Similarity}, |
|
author={Ahmed Zaky}, |
|
year={2024}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/AhmedZaky1/DIMI-embedding-v3-silma-sts-matryoshka} |
|
} |
|
``` |
|
|
|
## 📧 Contact |
|
|
|
**Author**: Ahmed Zaky |
|
**Email**: [email protected] |
|
**GitHub**: [@AhmedZaky1](https://github.com/AhmedZaky1) |
|
|
|
## 📄 License |
|
|
|
This model is released under the **MIT License**. |
|
|
|
``` |
|
MIT License |
|
|
|
Copyright (c) 2024 Ahmed Zaky |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
``` |
|
|
|
## 🙏 Acknowledgments |
|
|
|
- **Silma AI** for providing the high-quality Arabic-English STS dataset |
|
- **Sentence Transformers** library for the excellent framework |
|
- **Hugging Face** for model hosting and distribution |
|
- The **MTEB** benchmark for evaluation standards |
|
|
|
--- |
|
|
|
<div align="center"> |
|
|
|
**Built with ❤️ by Ahmed Zaky** |
|
|
|
*Advancing Arabic NLP through state-of-the-art embedding models* |
|
|
|
</div> |