ODeNy commited on
Commit
797983a
·
verified ·
1 Parent(s): 8ac3d8c

Upload 11 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ unigram.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,412 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:24593
8
+ - loss:CoSENTLoss
9
+ base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
10
+ pipeline_tag: sentence-similarity
11
+ library_name: sentence-transformers
12
+ metrics:
13
+ - pearson_cosine
14
+ - spearman_cosine
15
+ - pearson_manhattan
16
+ - spearman_manhattan
17
+ - pearson_euclidean
18
+ - spearman_euclidean
19
+ - pearson_dot
20
+ - spearman_dot
21
+ - pearson_max
22
+ - spearman_max
23
+ model-index:
24
+ - name: SentenceTransformer based on sentence-transformers/finetuned_paraphrase-multilingual-MiniLM-L12-v2
25
+ results:
26
+ - task:
27
+ type: semantic-similarity
28
+ name: Semantic Similarity
29
+ dataset:
30
+ name: Unknown
31
+ type: unknown
32
+ metrics:
33
+ - type: pearson_cosine
34
+ value: 0.03594393239556079
35
+ name: Pearson Cosine
36
+ - type: spearman_cosine
37
+ value: -0.00047007527052389596
38
+ name: Spearman Cosine
39
+ - type: pearson_manhattan
40
+ value: 0.02486157492330912
41
+ name: Pearson Manhattan
42
+ - type: spearman_manhattan
43
+ value: -0.002126248151952068
44
+ name: Spearman Manhattan
45
+ - type: pearson_euclidean
46
+ value: 0.024692776461385596
47
+ name: Pearson Euclidean
48
+ - type: spearman_euclidean
49
+ value: -0.0020342683424227027
50
+ name: Spearman Euclidean
51
+ - type: pearson_dot
52
+ value: -0.005055107350691934
53
+ name: Pearson Dot
54
+ - type: spearman_dot
55
+ value: 0.0015424580293819054
56
+ name: Spearman Dot
57
+ - type: pearson_max
58
+ value: 0.03594393239556079
59
+ name: Pearson Max
60
+ - type: spearman_max
61
+ value: 0.0015424580293819054
62
+ name: Spearman Max
63
+ ---
64
+
65
+ # SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
66
+
67
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
68
+
69
+ ## Model Details
70
+
71
+ ### Model Description
72
+ - **Model Type:** Sentence Transformer
73
+ - **Base model:** [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) <!-- at revision 8d6b950845285729817bf8e1af1861502c2fed0c -->
74
+ - **Maximum Sequence Length:** 128 tokens
75
+ - **Output Dimensionality:** 384 tokens
76
+ - **Similarity Function:** Cosine Similarity
77
+ <!-- - **Training Dataset:** Unknown -->
78
+ <!-- - **Language:** Unknown -->
79
+ <!-- - **License:** Unknown -->
80
+
81
+ ### Model Sources
82
+
83
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
84
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
85
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
86
+
87
+ ### Full Model Architecture
88
+
89
+ ```
90
+ SentenceTransformer(
91
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
92
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
93
+ )
94
+ ```
95
+
96
+ ## Usage
97
+
98
+ ### Direct Usage (Sentence Transformers)
99
+
100
+ First install the Sentence Transformers library:
101
+
102
+ ```bash
103
+ pip install -U sentence-transformers
104
+ ```
105
+
106
+ Then you can load this model and run inference.
107
+ ```python
108
+ from sentence_transformers import SentenceTransformer
109
+
110
+ # Download from the 🤗 Hub
111
+ model = SentenceTransformer("sentence_transformers_model_id")
112
+ # Run inference
113
+ sentences = [
114
+ '"Daarnaast willen ze hun bestaande platform DETECT, waarmee onderzoekers unieke inzichten kunnen verwerven in de respons tegen een vaccin, commercialiseren."',
115
+ '"Ze zijn van plan om het platform DETECT, dat onderzoekers helpt bij het verkrijgen van unieke inzichten over hoe een vaccin reageert, verder te ontwikkelen en commercieel beschikbaar te maken."',
116
+ '"In februari 2020 hield buurtcomit Stadspark een eerste gesprek over het Stadspark."',
117
+ ]
118
+ embeddings = model.encode(sentences)
119
+ print(embeddings.shape)
120
+ # [3, 384]
121
+
122
+ # Get the similarity scores for the embeddings
123
+ similarities = model.similarity(embeddings, embeddings)
124
+ print(similarities.shape)
125
+ # [3, 3]
126
+ ```
127
+
128
+ <!--
129
+ ### Direct Usage (Transformers)
130
+
131
+ <details><summary>Click to see the direct usage in Transformers</summary>
132
+
133
+ </details>
134
+ -->
135
+
136
+ <!--
137
+ ### Downstream Usage (Sentence Transformers)
138
+
139
+ You can finetune this model on your own dataset.
140
+
141
+ <details><summary>Click to expand</summary>
142
+
143
+ </details>
144
+ -->
145
+
146
+ <!--
147
+ ### Out-of-Scope Use
148
+
149
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
150
+ -->
151
+
152
+ ## Evaluation
153
+
154
+ ### Metrics
155
+
156
+ #### Semantic Similarity
157
+
158
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
159
+
160
+ | Metric | Value |
161
+ |:--------------------|:------------|
162
+ | pearson_cosine | 0.0359 |
163
+ | **spearman_cosine** | **-0.0005** |
164
+ | pearson_manhattan | 0.0249 |
165
+ | spearman_manhattan | -0.0021 |
166
+ | pearson_euclidean | 0.0247 |
167
+ | spearman_euclidean | -0.002 |
168
+ | pearson_dot | -0.0051 |
169
+ | spearman_dot | 0.0015 |
170
+ | pearson_max | 0.0359 |
171
+ | spearman_max | 0.0015 |
172
+
173
+ ## Training Details
174
+
175
+ ### Training Dataset
176
+
177
+ #### Unnamed Dataset
178
+
179
+
180
+ * Size: 24,593 training samples
181
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
182
+ * Approximate statistics based on the first 1000 samples:
183
+ | | sentence1 | sentence2 | label |
184
+ |:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:---------------------------------------------------------------|
185
+ | type | string | string | float |
186
+ | details | <ul><li>min: 18 tokens</li><li>mean: 34.72 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 34.48 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.63</li><li>max: 1.0</li></ul> |
187
+ * Samples:
188
+ | sentence1 | sentence2 | label |
189
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
190
+ | <code>"Bij een noodsituatie zoals een grote brand, een overstroming of een stroomonderbreking stuurt BE-Alert automatisch berichten uit."</code> | <code>"In een noodgeval zoals een grote brand, een overstroming of een stroomuitval, waarschuwt BE-Alert ons direct via sms."</code> | <code>1.0</code> |
191
+ | <code>"Nationale test BE-Alert 18 steden en gemeenten in de provincie Antwerpen namen deel aan de nationale test op donderdag 7 oktober 2021."</code> | <code>"In de provincie Antwerpen deden 18 stadsdelen en districten mee aan de nationale test van BE-Alert op donderdag 7 oktober 2021."</code> | <code>0.9</code> |
192
+ | <code>"Vrouwen van 50 tot 69 jaar die de voorbije 2 jaar geen mammografie lieten maken, ontvangen een uitnodiging voor een gratis mammografie."</code> | <code>"Vrouwen tussen de 50 en 69 jaar die de afgelopen twee jaar geen mammografie hebben laten doen, ontvangen een uitnodiging voor een gratis mammografie."</code> | <code>1.0</code> |
193
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
194
+ ```json
195
+ {
196
+ "scale": 20.0,
197
+ "similarity_fct": "pairwise_cos_sim"
198
+ }
199
+ ```
200
+
201
+ ### Evaluation Dataset
202
+
203
+ #### Unnamed Dataset
204
+
205
+
206
+ * Size: 10,540 evaluation samples
207
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
208
+ * Approximate statistics based on the first 1000 samples:
209
+ | | sentence1 | sentence2 | label |
210
+ |:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:---------------------------------------------------------------|
211
+ | type | string | string | float |
212
+ | details | <ul><li>min: 18 tokens</li><li>mean: 37.23 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 13 tokens</li><li>mean: 36.14 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.64</li><li>max: 1.0</li></ul> |
213
+ * Samples:
214
+ | sentence1 | sentence2 | label |
215
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
216
+ | <code>"Op dinsdag 23 mei verschijnt de Stadskroniek ‘Tingeling. 150 jaar tram in Antwerpen’ Deze Stadskroniek neemt de lezer mee in het dagelijkse leven van de reizigers en de bemanning van de trams in Antwerpen."</code> | <code>"Op dinsdag 23 mei verschijnt de Stadskroniek 'Tingeling. 150 jaar tram in Antwerpen'. Deze Stadskroniek neemt je mee in het dagelijkse leven van de reizigers en de bemanning van de trams in Antwerpen."</code> | <code>1.0</code> |
217
+ | <code>"De pers wordt vriendelijk uitgenodigd op de lancering van de Stadskroniek ‘Tingeling. 150 jaar tram in Antwerpen’ op dinsdag 23 mei om 20 uur in het Vlaams Tram- en Autobusmuseum, Diksmuidelaan 42, 2600 Antwerpen Verwelkoming door Bob Morren, auteur Toespraak door Nabilla Ait Daoud, schepen voor cultuur Toespraak door Koen Kennis, schepen voor mobiliteit Korte gegidste rondleiding in het trammuseum door Bob Morren Stadskronieken zijn erfgoedverhalen over Antwerpen en de Antwerpse districten."</code> | <code>"De pers is van harte uitgenodigd voor de lancering van 'Tingeling. 150 jaar tram in Antwerpen' op dinsdag 23 mei om 20 uur bij het Vlaams Tram- en Autobusmuseum, Diksmuidelaan 42, in Antwerpen. Bob Morren, bekend van zijn boek 'Toespraak door Nabilla Ait Daoud, schepen voor cultuur, zal de avond openen met een welkomstwoord. Ook Koen Kennis, schepen voor mobiliteit, spreekt over de impact van trams op onze stad. Na deze toespraken volgt een korte rondleiding door Bob Morren in het museum. Stadskronieken zijn verhalen die ons erfgoed vieren en leren over Antwerpen en haar districten."</code> | <code>1.0</code> |
218
+ | <code>"Stad Antwerpen ​ Mediagalerijen Categorieën Categorieën Nederlands Herwaarderingsplan Stadspark goedgekeurd Definitief ontwerp voor toekomst historisch park 4 april 2023 District en stad Antwerpen slaan de handen in elkaar om het Stadspark de komende jaren op te waarderen naar zijn oorspronkelijke landschapsstijl, rekening houdend met de hedendaagse noden."</code> | <code>"Stad Antwerpen ​ Mediagalerijen Categorieën Nederlands Herwaarderingsplan Stadspark goedgekeurd Definitief ontwerp voor toekomst historisch park 4 april 2023 District en stad Antwerpen slaan de handen in elkaar om het Stadspark de komende jaren op te waarderen naar zijn oorspronkelijke landschapsstijl, rekening houdend met de hedendaagse noden. 🌳✍️🗓️💼🎉"</code> | <code>0.9</code> |
219
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
220
+ ```json
221
+ {
222
+ "scale": 20.0,
223
+ "similarity_fct": "pairwise_cos_sim"
224
+ }
225
+ ```
226
+
227
+ ### Training Hyperparameters
228
+ #### Non-Default Hyperparameters
229
+
230
+ - `eval_strategy`: steps
231
+ - `per_device_train_batch_size`: 32
232
+ - `per_device_eval_batch_size`: 32
233
+ - `learning_rate`: 4e-06
234
+ - `num_train_epochs`: 2
235
+ - `fp16`: True
236
+ - `load_best_model_at_end`: True
237
+
238
+ #### All Hyperparameters
239
+ <details><summary>Click to expand</summary>
240
+
241
+ - `overwrite_output_dir`: False
242
+ - `do_predict`: False
243
+ - `eval_strategy`: steps
244
+ - `prediction_loss_only`: True
245
+ - `per_device_train_batch_size`: 32
246
+ - `per_device_eval_batch_size`: 32
247
+ - `per_gpu_train_batch_size`: None
248
+ - `per_gpu_eval_batch_size`: None
249
+ - `gradient_accumulation_steps`: 1
250
+ - `eval_accumulation_steps`: None
251
+ - `torch_empty_cache_steps`: None
252
+ - `learning_rate`: 4e-06
253
+ - `weight_decay`: 0.0
254
+ - `adam_beta1`: 0.9
255
+ - `adam_beta2`: 0.999
256
+ - `adam_epsilon`: 1e-08
257
+ - `max_grad_norm`: 1.0
258
+ - `num_train_epochs`: 2
259
+ - `max_steps`: -1
260
+ - `lr_scheduler_type`: linear
261
+ - `lr_scheduler_kwargs`: {}
262
+ - `warmup_ratio`: 0.0
263
+ - `warmup_steps`: 0
264
+ - `log_level`: passive
265
+ - `log_level_replica`: warning
266
+ - `log_on_each_node`: True
267
+ - `logging_nan_inf_filter`: True
268
+ - `save_safetensors`: True
269
+ - `save_on_each_node`: False
270
+ - `save_only_model`: False
271
+ - `restore_callback_states_from_checkpoint`: False
272
+ - `no_cuda`: False
273
+ - `use_cpu`: False
274
+ - `use_mps_device`: False
275
+ - `seed`: 42
276
+ - `data_seed`: None
277
+ - `jit_mode_eval`: False
278
+ - `use_ipex`: False
279
+ - `bf16`: False
280
+ - `fp16`: True
281
+ - `fp16_opt_level`: O1
282
+ - `half_precision_backend`: auto
283
+ - `bf16_full_eval`: False
284
+ - `fp16_full_eval`: False
285
+ - `tf32`: None
286
+ - `local_rank`: 0
287
+ - `ddp_backend`: None
288
+ - `tpu_num_cores`: None
289
+ - `tpu_metrics_debug`: False
290
+ - `debug`: []
291
+ - `dataloader_drop_last`: False
292
+ - `dataloader_num_workers`: 0
293
+ - `dataloader_prefetch_factor`: None
294
+ - `past_index`: -1
295
+ - `disable_tqdm`: False
296
+ - `remove_unused_columns`: True
297
+ - `label_names`: None
298
+ - `load_best_model_at_end`: True
299
+ - `ignore_data_skip`: False
300
+ - `fsdp`: []
301
+ - `fsdp_min_num_params`: 0
302
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
303
+ - `fsdp_transformer_layer_cls_to_wrap`: None
304
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
305
+ - `deepspeed`: None
306
+ - `label_smoothing_factor`: 0.0
307
+ - `optim`: adamw_torch
308
+ - `optim_args`: None
309
+ - `adafactor`: False
310
+ - `group_by_length`: False
311
+ - `length_column_name`: length
312
+ - `ddp_find_unused_parameters`: None
313
+ - `ddp_bucket_cap_mb`: None
314
+ - `ddp_broadcast_buffers`: False
315
+ - `dataloader_pin_memory`: True
316
+ - `dataloader_persistent_workers`: False
317
+ - `skip_memory_metrics`: True
318
+ - `use_legacy_prediction_loop`: False
319
+ - `push_to_hub`: False
320
+ - `resume_from_checkpoint`: None
321
+ - `hub_model_id`: None
322
+ - `hub_strategy`: every_save
323
+ - `hub_private_repo`: False
324
+ - `hub_always_push`: False
325
+ - `gradient_checkpointing`: False
326
+ - `gradient_checkpointing_kwargs`: None
327
+ - `include_inputs_for_metrics`: False
328
+ - `eval_do_concat_batches`: True
329
+ - `fp16_backend`: auto
330
+ - `push_to_hub_model_id`: None
331
+ - `push_to_hub_organization`: None
332
+ - `mp_parameters`:
333
+ - `auto_find_batch_size`: False
334
+ - `full_determinism`: False
335
+ - `torchdynamo`: None
336
+ - `ray_scope`: last
337
+ - `ddp_timeout`: 1800
338
+ - `torch_compile`: False
339
+ - `torch_compile_backend`: None
340
+ - `torch_compile_mode`: None
341
+ - `dispatch_batches`: None
342
+ - `split_batches`: None
343
+ - `include_tokens_per_second`: False
344
+ - `include_num_input_tokens_seen`: False
345
+ - `neftune_noise_alpha`: None
346
+ - `optim_target_modules`: None
347
+ - `batch_eval_metrics`: False
348
+ - `eval_on_start`: False
349
+ - `use_liger_kernel`: False
350
+ - `eval_use_gather_object`: False
351
+ - `batch_sampler`: batch_sampler
352
+ - `multi_dataset_batch_sampler`: proportional
353
+
354
+ </details>
355
+
356
+ ### Training Logs
357
+ | Epoch | Step | Training Loss | Validation Loss | spearman_cosine |
358
+ |:----------:|:-------:|:-------------:|:---------------:|:---------------:|
359
+ | 0.1664 | 128 | - | 5.8279 | -0.0016 |
360
+ | 0.3329 | 256 | - | 5.8067 | -0.0052 |
361
+ | 0.4993 | 384 | - | 5.8030 | -0.0042 |
362
+ | 0.6502 | 500 | 5.997 | - | - |
363
+ | **0.6658** | **512** | **-** | **5.8018** | **-0.0036** |
364
+ | 0.8322 | 640 | - | 5.8020 | -0.0023 |
365
+ | 0.9987 | 768 | - | 5.8033 | -0.0021 |
366
+ | 1.1651 | 896 | - | 5.8056 | -0.0012 |
367
+ | 1.3004 | 1000 | 5.7987 | - | - |
368
+ | 1.3316 | 1024 | - | 5.8079 | -0.0017 |
369
+ | 1.4980 | 1152 | - | 5.8090 | -0.0015 |
370
+ | 1.6645 | 1280 | - | 5.8033 | -0.0005 |
371
+ | 1.8309 | 1408 | - | 5.8039 | -0.0003 |
372
+ | 1.9506 | 1500 | 5.8021 | - | - |
373
+ | 1.9974 | 1536 | - | 5.8043 | -0.0005 |
374
+
375
+ * The bold row denotes the saved checkpoint.
376
+
377
+ ### Framework Versions
378
+ - Python: 3.11.10
379
+ - Sentence Transformers: 3.2.0
380
+ - Transformers: 4.45.0
381
+ - PyTorch: 2.5.1+cu124
382
+ - Accelerate: 1.1.1
383
+ - Datasets: 3.1.0
384
+ - Tokenizers: 0.20.3
385
+
386
+ ## Citation
387
+
388
+ ### BibTeX
389
+
390
+ #### Sentence Transformers
391
+ ```bibtex
392
+ @inproceedings{reimers-2019-sentence-bert,
393
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
394
+ author = "Reimers, Nils and Gurevych, Iryna",
395
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
396
+ month = "11",
397
+ year = "2019",
398
+ publisher = "Association for Computational Linguistics",
399
+ url = "https://arxiv.org/abs/1908.10084",
400
+ }
401
+ ```
402
+
403
+ #### CoSENTLoss
404
+ ```bibtex
405
+ @online{kexuefm-8847,
406
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
407
+ author={Su Jianlin},
408
+ year={2022},
409
+ month={Jan},
410
+ url={https://kexue.fm/archives/8847},
411
+ }
412
+ ```
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.45.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 250037
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.2.0",
4
+ "transformers": "4.45.0",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00ebc17d65eadaff28c4d8a8726bb4a22c2f94f01536de3b4d94a6b07ff2b1d1
3
+ size 470637416
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "do_lower_case": true,
48
+ "eos_token": "</s>",
49
+ "mask_token": "<mask>",
50
+ "max_length": 128,
51
+ "model_max_length": 128,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "<pad>",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "</s>",
57
+ "stride": 0,
58
+ "strip_accents": null,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "<unk>"
64
+ }
unigram.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da145b5e7700ae40f16691ec32a0b1fdc1ee3298db22a31ea55f57a966c4a65d
3
+ size 14763260