yahyaabd's picture
Add new SentenceTransformer model
c4b55cc verified
metadata
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
datasets:
  - yahyaabd/allstats-semantic-search-synthetic-dataset-v2-mini
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:70280
  - loss:CosineSimilarityLoss
widget:
  - source_sentence: Data SBH tahun 2012 di Mamuju
    sentences:
      - >-
        Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Harmonized
        System November 2013
      - SBH 2012 - Mamuju
      - IHK di 66 Kota di Indonesia 2013
  - source_sentence: Statistik konstruksi tahun 2020
    sentences:
      - Indeks Ketimpangan Gender 2022
      - >-
        Angka Kematian Bayi/AKB (Infant Mortality Rate/IMR) Menurut Provinsi,
        1971-2020
      - >-
        Perkembangan Beberapa Indikator Utama sosial-Ekonomi Indonesia Edisi
        Februari 2016
  - source_sentence: Berapa besar inflasi pada bulan Oktober 2008?
    sentences:
      - >-
        Tinjauan Ekonomi Regional Indonesia Berdasarkan Data PDRB 2004-2008 Buku
        2
      - Statistik Sumber Daya Laut dan Pesisir 2020
      - Inflasi September 2008 sebesar 0,97 persen.
  - source_sentence: 'Sektor konstruksi Indonesia: data statistik 1990-2013'
    sentences:
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Provinsi dan Lapangan Pekerjaan Utama, 2023
      - Direktori Perusahaan Kehutanan 2019
      - Sensus Ekonomi 2006 Hasil Pendaftaran Perusahaan Sumatera Selatan
  - source_sentence: Perdagangan luar negeri, impor, Oktober 2020
    sentences:
      - Indikator Ekonomi September 2005
      - Statistik Potensi Desa Provinsi DI Yogyakarta 2005
      - Indikator Ekonomi November 1999
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic search mini v2 eval
          type: allstats-semantic-search-mini-v2-eval
        metrics:
          - type: pearson_cosine
            value: 0.9617082550278393
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8518022238549516
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic search mini v2 test
          type: allstat-semantic-search-mini-v2-test
        metrics:
          - type: pearson_cosine
            value: 0.9604638064122318
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8480797444308495
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the allstats-semantic-search-synthetic-dataset-v2-mini dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-mini-model-v2-2")
# Run inference
sentences = [
    'Perdagangan luar negeri, impor, Oktober 2020',
    'Indikator Ekonomi November 1999',
    'Indikator Ekonomi September 2005',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-mini-v2-eval allstat-semantic-search-mini-v2-test
pearson_cosine 0.9617 0.9605
spearman_cosine 0.8518 0.8481

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v2-mini

  • Dataset: allstats-semantic-search-synthetic-dataset-v2-mini at 8222b01
  • Size: 70,280 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 10.92 tokens
    • max: 50 tokens
    • min: 4 tokens
    • mean: 14.68 tokens
    • max: 59 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    Statistik perusahaan pembudidaya tanaman kehutanan 2018 Statistik Perusahaan Pembudidaya Tanaman Kehutanan 2018 0.97
    Berapa persen pertumbuhan PDB Indonesia pada Triwulan III Tahun 2002? Inflasi Bulan November 2002 Sebesar 1,85 % 0.0
    Perdagangan luar negeri Indonesia, impor 2019, jilid 2 Pendataan Sapi Potong Sapi Perah (PSPK 2011) Sulawesi Barat 0.06
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v2-mini

  • Dataset: allstats-semantic-search-synthetic-dataset-v2-mini at 8222b01
  • Size: 15,060 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 4 tokens
    • mean: 10.96 tokens
    • max: 48 tokens
    • min: 4 tokens
    • mean: 14.74 tokens
    • max: 70 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Review PDRB daerah di Pulau Sumatera 2010-2013 Statistik Pendidikan 2006 0.12
    Analisis data angkatan kerja Agustus 2021 Booklet Survei Angkatan Kerja Nasional Agustus 2021 0.9
    Berapa persen inflasi yang terjadi pada Juli 2015? Inflasi pada bulan lain tidak disebutkan 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 24
  • warmup_ratio: 0.1
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 24
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-search-mini-v2-eval_spearman_cosine allstat-semantic-search-mini-v2-test_spearman_cosine
0.4550 500 0.0643 0.0413 0.6996 -
0.9099 1000 0.0348 0.0280 0.7533 -
1.3649 1500 0.0254 0.0238 0.7737 -
1.8198 2000 0.0223 0.0205 0.7831 -
2.2748 2500 0.0181 0.0197 0.7894 -
2.7298 3000 0.0173 0.0184 0.7876 -
3.1847 3500 0.0152 0.0170 0.7954 -
3.6397 4000 0.0123 0.0175 0.7970 -
4.0946 4500 0.0125 0.0163 0.8118 -
4.5496 5000 0.01 0.0161 0.8047 -
5.0045 5500 0.0103 0.0157 0.8126 -
5.4595 6000 0.0079 0.0150 0.8224 -
5.9145 6500 0.0087 0.0156 0.8219 -
6.3694 7000 0.0071 0.0152 0.8145 -
6.8244 7500 0.0068 0.0153 0.8172 -
7.2793 8000 0.0061 0.0147 0.8216 -
7.7343 8500 0.0062 0.0146 0.8267 -
8.1893 9000 0.0055 0.0145 0.8325 -
8.6442 9500 0.005 0.0146 0.8335 -
9.0992 10000 0.0052 0.0143 0.8356 -
9.5541 10500 0.0043 0.0144 0.8313 -
10.0091 11000 0.0051 0.0144 0.8362 -
10.4641 11500 0.004 0.0145 0.8376 -
10.9190 12000 0.0039 0.0142 0.8331 -
11.3740 12500 0.0034 0.0141 0.8397 -
11.8289 13000 0.0033 0.0140 0.8398 -
12.2839 13500 0.0032 0.0143 0.8411 -
12.7389 14000 0.003 0.0141 0.8407 -
13.1938 14500 0.0031 0.0141 0.8379 -
13.6488 15000 0.0026 0.0141 0.8419 -
14.1037 15500 0.0028 0.0141 0.8442 -
14.5587 16000 0.0023 0.0138 0.8455 -
15.0136 16500 0.0025 0.0147 0.8359 -
15.4686 17000 0.0021 0.0141 0.8459 -
15.9236 17500 0.0023 0.0140 0.8433 -
16.3785 18000 0.002 0.0139 0.8465 -
16.8335 18500 0.002 0.0139 0.8461 -
17.2884 19000 0.0018 0.0139 0.8482 -
17.7434 19500 0.0018 0.0138 0.8477 -
18.1984 20000 0.0017 0.0138 0.8503 -
18.6533 20500 0.0016 0.0136 0.8493 -
19.1083 21000 0.0016 0.0139 0.8501 -
19.5632 21500 0.0015 0.0138 0.8478 -
20.0182 22000 0.0015 0.0139 0.8501 -
20.4732 22500 0.0013 0.0139 0.8508 -
20.9281 23000 0.0015 0.0139 0.8511 -
21.3831 23500 0.0013 0.0139 0.8517 -
21.8380 24000 0.0013 0.0139 0.8512 -
22.2930 24500 0.0012 0.0139 0.8512 -
22.7480 25000 0.0012 0.0138 0.8520 -
23.2029 25500 0.0012 0.0139 0.8520 -
23.6579 26000 0.0011 0.0139 0.8518 -
24.0 26376 - - - 0.8481

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}