|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- generated_from_trainer |
|
- dataset_size:24593 |
|
- loss:CoSENTLoss |
|
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
- pearson_max |
|
- spearman_max |
|
model-index: |
|
- name: >- |
|
SentenceTransformer based on |
|
sentence-transformers/finetuned_paraphrase-multilingual-MiniLM-L12-v2 |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: Unknown |
|
type: unknown |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.03594393239556079 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: -0.00047007527052389596 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.02486157492330912 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: -0.002126248151952068 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.024692776461385596 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: -0.0020342683424227027 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: -0.005055107350691934 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.0015424580293819054 |
|
name: Spearman Dot |
|
- type: pearson_max |
|
value: 0.03594393239556079 |
|
name: Pearson Max |
|
- type: spearman_max |
|
value: 0.0015424580293819054 |
|
name: Spearman Max |
|
license: mit |
|
language: |
|
- nl |
|
--- |
|
|
|
# SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for stylistic and semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. I personally used this to give LLM generated sentences a rating between 0 and 1 on how good they match the style of the city of Antwerp. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) <!-- at revision 8d6b950845285729817bf8e1af1861502c2fed0c --> |
|
- **Maximum Sequence Length:** 128 tokens |
|
- **Output Dimensionality:** 384 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
<!-- - **Training Dataset:** Unknown --> |
|
- **Language:** Dutch, Flemish |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("sentence_transformers_model_id") |
|
# Run inference |
|
sentences = [ |
|
'"Daarnaast willen ze hun bestaande platform DETECT, waarmee onderzoekers unieke inzichten kunnen verwerven in de respons tegen een vaccin, commercialiseren."', |
|
'"Ze zijn van plan om het platform DETECT, dat onderzoekers helpt bij het verkrijgen van unieke inzichten over hoe een vaccin reageert, verder te ontwikkelen en commercieel beschikbaar te maken."', |
|
'"In februari 2020 hield buurtcomit Stadspark een eerste gesprek over het Stadspark."', |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 384] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Semantic Similarity |
|
|
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:------------| |
|
| pearson_cosine | 0.0359 | |
|
| **spearman_cosine** | **-0.0005** | |
|
| pearson_manhattan | 0.0249 | |
|
| spearman_manhattan | -0.0021 | |
|
| pearson_euclidean | 0.0247 | |
|
| spearman_euclidean | -0.002 | |
|
| pearson_dot | -0.0051 | |
|
| spearman_dot | 0.0015 | |
|
| pearson_max | 0.0359 | |
|
| spearman_max | 0.0015 | |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
|
|
* Size: 24,593 training samples |
|
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code> |
|
* Approximate statistics based on the first 1000 samples: |
|
| | sentence1 | sentence2 | label | |
|
|:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:---------------------------------------------------------------| |
|
| type | string | string | float | |
|
| details | <ul><li>min: 18 tokens</li><li>mean: 34.72 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 34.48 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.63</li><li>max: 1.0</li></ul> | |
|
* Samples: |
|
| sentence1 | sentence2 | label | |
|
|:-------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------| |
|
| <code>"Bij een noodsituatie zoals een grote brand, een overstroming of een stroomonderbreking stuurt BE-Alert automatisch berichten uit."</code> | <code>"In een noodgeval zoals een grote brand, een overstroming of een stroomuitval, waarschuwt BE-Alert ons direct via sms."</code> | <code>1.0</code> | |
|
| <code>"Nationale test BE-Alert 18 steden en gemeenten in de provincie Antwerpen namen deel aan de nationale test op donderdag 7 oktober 2021."</code> | <code>"In de provincie Antwerpen deden 18 stadsdelen en districten mee aan de nationale test van BE-Alert op donderdag 7 oktober 2021."</code> | <code>0.9</code> | |
|
| <code>"Vrouwen van 50 tot 69 jaar die de voorbije 2 jaar geen mammografie lieten maken, ontvangen een uitnodiging voor een gratis mammografie."</code> | <code>"Vrouwen tussen de 50 en 69 jaar die de afgelopen twee jaar geen mammografie hebben laten doen, ontvangen een uitnodiging voor een gratis mammografie."</code> | <code>1.0</code> | |
|
* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters: |
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "pairwise_cos_sim" |
|
} |
|
``` |
|
|
|
### Evaluation Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
|
|
* Size: 10,540 evaluation samples |
|
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code> |
|
* Approximate statistics based on the first 1000 samples: |
|
| | sentence1 | sentence2 | label | |
|
|:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:---------------------------------------------------------------| |
|
| type | string | string | float | |
|
| details | <ul><li>min: 18 tokens</li><li>mean: 37.23 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 13 tokens</li><li>mean: 36.14 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.64</li><li>max: 1.0</li></ul> | |
|
* Samples: |
|
| sentence1 | sentence2 | label | |
|
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------| |
|
| <code>"Op dinsdag 23 mei verschijnt de Stadskroniek ‘Tingeling. 150 jaar tram in Antwerpen’ Deze Stadskroniek neemt de lezer mee in het dagelijkse leven van de reizigers en de bemanning van de trams in Antwerpen."</code> | <code>"Op dinsdag 23 mei verschijnt de Stadskroniek 'Tingeling. 150 jaar tram in Antwerpen'. Deze Stadskroniek neemt je mee in het dagelijkse leven van de reizigers en de bemanning van de trams in Antwerpen."</code> | <code>1.0</code> | |
|
| <code>"De pers wordt vriendelijk uitgenodigd op de lancering van de Stadskroniek ‘Tingeling. 150 jaar tram in Antwerpen’ op dinsdag 23 mei om 20 uur in het Vlaams Tram- en Autobusmuseum, Diksmuidelaan 42, 2600 Antwerpen Verwelkoming door Bob Morren, auteur Toespraak door Nabilla Ait Daoud, schepen voor cultuur Toespraak door Koen Kennis, schepen voor mobiliteit Korte gegidste rondleiding in het trammuseum door Bob Morren Stadskronieken zijn erfgoedverhalen over Antwerpen en de Antwerpse districten."</code> | <code>"De pers is van harte uitgenodigd voor de lancering van 'Tingeling. 150 jaar tram in Antwerpen' op dinsdag 23 mei om 20 uur bij het Vlaams Tram- en Autobusmuseum, Diksmuidelaan 42, in Antwerpen. Bob Morren, bekend van zijn boek 'Toespraak door Nabilla Ait Daoud, schepen voor cultuur, zal de avond openen met een welkomstwoord. Ook Koen Kennis, schepen voor mobiliteit, spreekt over de impact van trams op onze stad. Na deze toespraken volgt een korte rondleiding door Bob Morren in het museum. Stadskronieken zijn verhalen die ons erfgoed vieren en leren over Antwerpen en haar districten."</code> | <code>1.0</code> | |
|
| <code>0.9</code> | |
|
* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters: |
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "pairwise_cos_sim" |
|
} |
|
``` |
|
|
|
### Training Hyperparameters |
|
#### Non-Default Hyperparameters |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 32 |
|
- `learning_rate`: 4e-06 |
|
- `num_train_epochs`: 2 |
|
- `fp16`: True |
|
- `load_best_model_at_end`: True |
|
|
|
#### All Hyperparameters |
|
<details><summary>Click to expand</summary> |
|
|
|
- `overwrite_output_dir`: False |
|
- `do_predict`: False |
|
- `eval_strategy`: steps |
|
- `prediction_loss_only`: True |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 32 |
|
- `per_gpu_train_batch_size`: None |
|
- `per_gpu_eval_batch_size`: None |
|
- `gradient_accumulation_steps`: 1 |
|
- `eval_accumulation_steps`: None |
|
- `torch_empty_cache_steps`: None |
|
- `learning_rate`: 4e-06 |
|
- `weight_decay`: 0.0 |
|
- `adam_beta1`: 0.9 |
|
- `adam_beta2`: 0.999 |
|
- `adam_epsilon`: 1e-08 |
|
- `max_grad_norm`: 1.0 |
|
- `num_train_epochs`: 2 |
|
- `max_steps`: -1 |
|
- `lr_scheduler_type`: linear |
|
- `lr_scheduler_kwargs`: {} |
|
- `warmup_ratio`: 0.0 |
|
- `warmup_steps`: 0 |
|
- `log_level`: passive |
|
- `log_level_replica`: warning |
|
- `log_on_each_node`: True |
|
- `logging_nan_inf_filter`: True |
|
- `save_safetensors`: True |
|
- `save_on_each_node`: False |
|
- `save_only_model`: False |
|
- `restore_callback_states_from_checkpoint`: False |
|
- `no_cuda`: False |
|
- `use_cpu`: False |
|
- `use_mps_device`: False |
|
- `seed`: 42 |
|
- `data_seed`: None |
|
- `jit_mode_eval`: False |
|
- `use_ipex`: False |
|
- `bf16`: False |
|
- `fp16`: True |
|
- `fp16_opt_level`: O1 |
|
- `half_precision_backend`: auto |
|
- `bf16_full_eval`: False |
|
- `fp16_full_eval`: False |
|
- `tf32`: None |
|
- `local_rank`: 0 |
|
- `ddp_backend`: None |
|
- `tpu_num_cores`: None |
|
- `tpu_metrics_debug`: False |
|
- `debug`: [] |
|
- `dataloader_drop_last`: False |
|
- `dataloader_num_workers`: 0 |
|
- `dataloader_prefetch_factor`: None |
|
- `past_index`: -1 |
|
- `disable_tqdm`: False |
|
- `remove_unused_columns`: True |
|
- `label_names`: None |
|
- `load_best_model_at_end`: True |
|
- `ignore_data_skip`: False |
|
- `fsdp`: [] |
|
- `fsdp_min_num_params`: 0 |
|
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} |
|
- `fsdp_transformer_layer_cls_to_wrap`: None |
|
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} |
|
- `deepspeed`: None |
|
- `label_smoothing_factor`: 0.0 |
|
- `optim`: adamw_torch |
|
- `optim_args`: None |
|
- `adafactor`: False |
|
- `group_by_length`: False |
|
- `length_column_name`: length |
|
- `ddp_find_unused_parameters`: None |
|
- `ddp_bucket_cap_mb`: None |
|
- `ddp_broadcast_buffers`: False |
|
- `dataloader_pin_memory`: True |
|
- `dataloader_persistent_workers`: False |
|
- `skip_memory_metrics`: True |
|
- `use_legacy_prediction_loop`: False |
|
- `push_to_hub`: False |
|
- `resume_from_checkpoint`: None |
|
- `hub_model_id`: None |
|
- `hub_strategy`: every_save |
|
- `hub_private_repo`: False |
|
- `hub_always_push`: False |
|
- `gradient_checkpointing`: False |
|
- `gradient_checkpointing_kwargs`: None |
|
- `include_inputs_for_metrics`: False |
|
- `eval_do_concat_batches`: True |
|
- `fp16_backend`: auto |
|
- `push_to_hub_model_id`: None |
|
- `push_to_hub_organization`: None |
|
- `mp_parameters`: |
|
- `auto_find_batch_size`: False |
|
- `full_determinism`: False |
|
- `torchdynamo`: None |
|
- `ray_scope`: last |
|
- `ddp_timeout`: 1800 |
|
- `torch_compile`: False |
|
- `torch_compile_backend`: None |
|
- `torch_compile_mode`: None |
|
- `dispatch_batches`: None |
|
- `split_batches`: None |
|
- `include_tokens_per_second`: False |
|
- `include_num_input_tokens_seen`: False |
|
- `neftune_noise_alpha`: None |
|
- `optim_target_modules`: None |
|
- `batch_eval_metrics`: False |
|
- `eval_on_start`: False |
|
- `use_liger_kernel`: False |
|
- `eval_use_gather_object`: False |
|
- `batch_sampler`: batch_sampler |
|
- `multi_dataset_batch_sampler`: proportional |
|
|
|
</details> |
|
|
|
### Training Logs |
|
| Epoch | Step | Training Loss | Validation Loss | spearman_cosine | |
|
|:----------:|:-------:|:-------------:|:---------------:|:---------------:| |
|
| 0.1664 | 128 | - | 5.8279 | -0.0016 | |
|
| 0.3329 | 256 | - | 5.8067 | -0.0052 | |
|
| 0.4993 | 384 | - | 5.8030 | -0.0042 | |
|
| 0.6502 | 500 | 5.997 | - | - | |
|
| **0.6658** | **512** | **-** | **5.8018** | **-0.0036** | |
|
| 0.8322 | 640 | - | 5.8020 | -0.0023 | |
|
| 0.9987 | 768 | - | 5.8033 | -0.0021 | |
|
| 1.1651 | 896 | - | 5.8056 | -0.0012 | |
|
| 1.3004 | 1000 | 5.7987 | - | - | |
|
| 1.3316 | 1024 | - | 5.8079 | -0.0017 | |
|
| 1.4980 | 1152 | - | 5.8090 | -0.0015 | |
|
| 1.6645 | 1280 | - | 5.8033 | -0.0005 | |
|
| 1.8309 | 1408 | - | 5.8039 | -0.0003 | |
|
| 1.9506 | 1500 | 5.8021 | - | - | |
|
| 1.9974 | 1536 | - | 5.8043 | -0.0005 | |
|
|
|
* The bold row denotes the saved checkpoint. |
|
|
|
### Framework Versions |
|
- Python: 3.11.10 |
|
- Sentence Transformers: 3.2.0 |
|
- Transformers: 4.45.0 |
|
- PyTorch: 2.5.1+cu124 |
|
- Accelerate: 1.1.1 |
|
- Datasets: 3.1.0 |
|
- Tokenizers: 0.20.3 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### CoSENTLoss |
|
```bibtex |
|
@online{kexuefm-8847, |
|
title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT}, |
|
author={Su Jianlin}, |
|
year={2022}, |
|
month={Jan}, |
|
url={https://kexue.fm/archives/8847}, |
|
} |
|
``` |