mhaseeb1604/bge-m3-law

This model is a fine-tuned version of the BAAI/bge-m3 model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.

Model Overview

  • Architecture: Based on sentence-transformers.
  • Training Data: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English.
  • Embedding Size: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text.
  • Applications: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context.

Installation

To use this model, you need to have the sentence-transformers library installed. You can install it via pip:

pip install -U sentence-transformers

Usage

You can easily load and use this model in Python with the following code:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('mhaseeb1604/bge-m3-law')

# Sample sentences
sentences = ["This is an example sentence", "Each sentence is converted"]

# Generate embeddings
embeddings = model.encode(sentences)

# Output embeddings
print(embeddings)

Model Training

The model was fine-tuned on Arabic and English legal texts using the following configurations:

  • DataLoader:
    • Batch size: 4
    • Sampler: SequentialSampler
  • Loss Function: MultipleNegativesRankingLoss with cosine similarity.
  • Optimizer: AdamW with learning rate 2e-05.
  • Training Parameters:
    • Epochs: 2
    • Warmup Steps: 20
    • Weight Decay: 0.01

Full Model Architecture

This model consists of three main components:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False})
  (2): Normalize()
)
  • Transformer Layer: Uses XLM-Roberta model with a max sequence length of 8192.
  • Pooling Layer: Utilizes CLS token pooling to generate sentence embeddings.
  • Normalization Layer: Ensures normalized output vectors for better performance in similarity tasks.

Citing & Authors

If you find this repository useful, please consider giving a star : and citation

@misc {muhammad_haseeb_2024,
    author       = { {Muhammad Haseeb} },
    title        = { bge-m3-law (Revision 2fc0289) },
    year         = 2024,
    url          = { https://huggingface.co/mhaseeb1604/bge-m3-law },
    doi          = { 10.57967/hf/3217 },
    publisher    = { Hugging Face }
}
Downloads last month
20
Safetensors
Model size
568M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for mhaseeb1604/bge-m3-law

Base model

BAAI/bge-m3
Finetuned
(201)
this model