SentenceTransformer based on BAAI/bge-base-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Language: en
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v14")
# Run inference
sentences = [
"office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Eligibility criteria include being at least 35 years old, appropriate qualifications in the field of data protection law gained through relevant professional experience. The Commissioner's term is for five years, which can be extended once. The Commissioner has the responsibility to act as the primary office responsible for enforcing the Federal Data Protection Act within Germany. Some of the office's key responsibilities include: Advising the Bundestag, the Bundesrat, and the Federal Government on administrative and legislative measures related to data protection within the country; To oversee and implement both the GDPR and Federal Data Protection Act within Germany; To promote awareness within the public related to the risks, rules, safeguards, and rights concerning the processing of personal data; To handle all, within Germany. It supplements and aligns with the requirements of the EU GDPR. Yes, Germany is covered by GDPR (General Data Protection Regulation). GDPR is a regulation that applies uniformly across all EU member states, including Germany. The Federal Data Protection Act established the office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Germany's interpretation is the Bundesdatenschutzgesetz (BDSG), the German Federal Data Protection Act. It mirrors the GDPR in all key areas while giving local German regulatory authorities the power to enforce it more efficiently nationally. ## Join Our Newsletter Get all the latest information, law updates and more delivered to your inbox ### Share Copy 14 ### More Stories that May Interest You View More",
'What are the main responsibilities of the Federal Commissioner for Data Protection and Freedom of Information in enforcing data protection laws in Germany, including the GDPR and the Federal Data Protection Act?',
'What is the collection and use of personal information by businesses?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Dataset:
dim_768
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.6804 |
cosine_accuracy@3 | 0.9072 |
cosine_accuracy@5 | 0.9485 |
cosine_accuracy@10 | 0.9691 |
cosine_precision@1 | 0.6804 |
cosine_precision@3 | 0.3024 |
cosine_precision@5 | 0.1897 |
cosine_precision@10 | 0.0969 |
cosine_recall@1 | 0.6804 |
cosine_recall@3 | 0.9072 |
cosine_recall@5 | 0.9485 |
cosine_recall@10 | 0.9691 |
cosine_ndcg@10 | 0.8366 |
cosine_mrr@10 | 0.7925 |
cosine_map@100 | 0.7937 |
Information Retrieval
- Dataset:
dim_512
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.6907 |
cosine_accuracy@3 | 0.8763 |
cosine_accuracy@5 | 0.9278 |
cosine_accuracy@10 | 0.9691 |
cosine_precision@1 | 0.6907 |
cosine_precision@3 | 0.2921 |
cosine_precision@5 | 0.1856 |
cosine_precision@10 | 0.0969 |
cosine_recall@1 | 0.6907 |
cosine_recall@3 | 0.8763 |
cosine_recall@5 | 0.9278 |
cosine_recall@10 | 0.9691 |
cosine_ndcg@10 | 0.833 |
cosine_mrr@10 | 0.7889 |
cosine_map@100 | 0.7896 |
Information Retrieval
- Dataset:
dim_256
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.6907 |
cosine_accuracy@3 | 0.8557 |
cosine_accuracy@5 | 0.8969 |
cosine_accuracy@10 | 0.9278 |
cosine_precision@1 | 0.6907 |
cosine_precision@3 | 0.2852 |
cosine_precision@5 | 0.1794 |
cosine_precision@10 | 0.0928 |
cosine_recall@1 | 0.6907 |
cosine_recall@3 | 0.8557 |
cosine_recall@5 | 0.8969 |
cosine_recall@10 | 0.9278 |
cosine_ndcg@10 | 0.8132 |
cosine_mrr@10 | 0.7759 |
cosine_map@100 | 0.7795 |
Information Retrieval
- Dataset:
dim_128
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.5979 |
cosine_accuracy@3 | 0.7732 |
cosine_accuracy@5 | 0.8247 |
cosine_accuracy@10 | 0.8866 |
cosine_precision@1 | 0.5979 |
cosine_precision@3 | 0.2577 |
cosine_precision@5 | 0.1649 |
cosine_precision@10 | 0.0887 |
cosine_recall@1 | 0.5979 |
cosine_recall@3 | 0.7732 |
cosine_recall@5 | 0.8247 |
cosine_recall@10 | 0.8866 |
cosine_ndcg@10 | 0.7462 |
cosine_mrr@10 | 0.701 |
cosine_map@100 | 0.7047 |
Information Retrieval
- Dataset:
dim_64
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.5155 |
cosine_accuracy@3 | 0.6907 |
cosine_accuracy@5 | 0.7113 |
cosine_accuracy@10 | 0.7732 |
cosine_precision@1 | 0.5155 |
cosine_precision@3 | 0.2302 |
cosine_precision@5 | 0.1423 |
cosine_precision@10 | 0.0773 |
cosine_recall@1 | 0.5155 |
cosine_recall@3 | 0.6907 |
cosine_recall@5 | 0.7113 |
cosine_recall@10 | 0.7732 |
cosine_ndcg@10 | 0.6471 |
cosine_mrr@10 | 0.6064 |
cosine_map@100 | 0.6137 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 7,872 training samples
- Columns:
positive
andanchor
- Approximate statistics based on the first 1000 samples:
positive anchor type string string details - min: 18 tokens
- mean: 206.12 tokens
- max: 414 tokens
- min: 9 tokens
- mean: 21.62 tokens
- max: 102 tokens
- Samples:
positive anchor Automation PrivacyCenter.Cloud
Data Mapping on both in terms of material and territorial scope. ### 1.1 Material Scope The Spanish data protection law affords blanket protection for all data that may have been collected on a data subject. There are only a handful of exceptions that include: Information subject to a pending legal case Information collected concerning the investigation of terrorism or organised crime Information classified as "Confidential" for matters related to Spain's national security ### 1.2 Territorial Scope The Spanish data protection law applies to all data handlers that are: Carrying out data collection activities in Spain Not established in Spain but carrying out data collection activities on Spanish territory Not established within the European Union but carrying out data collection activities on Spanish residents unless for data transit purposes only ## 2. Obligations for Organizations Under Spanish Data Protection Law The Spanish data protection law and GDPR lay out specific obligations for all data handlers. These obligations ensure, . ### 2.3 Privacy Policy Requirements Spain's data protection law requires all data handlers to inform the data subject of the following in their privacy policy: The purpose of collecting the data and the recipients of the information The obligatory or voluntary nature of the reply to the questions put to them The consequences of obtaining the data or of refusing to provide them The possibility of exercising rights of access, rectification, erasure, portability, and objection The identity and address of the controller or their local Spanish representative ### 2.4 Security Requirements Article 9 of Spain's Data Protection Law is direct and explicit in stating the responsibility of the data handler is to take adequate measures to ensure the protection of any data collected. It mandates all data handlers to adopt technical and organisational measures necessary to ensure the security of the personal data and prevent their alteration, loss, and unauthorised processing or access. Additionally, collection of any
What are the requirements for organizations under the Spanish data protection law regarding privacy policies and security measures?
before the point of collection of their personal information. ## Right to Erasure The right to erasure gives consumers the right to request deleting all their data stored by the organization. Organizations are supposed to comply within 45 days and must deliver a report to the consumer confirming the deletion of their information. ## Right to Opt-in for Minors Personal information containing minors' personal information cannot be sold by a business unless the minor (age of 13 to 16 years) or the Parent/Guardian (if the minor is aged below 13 years) opt-ins to allow this sale. Businesses can be held liable for the sale of minors' personal information if they either knew or wilfully disregarded the consumer's status as a minor and the minor or Parent/Guardian had not willingly opted in. ## Right to Continued Protection Even when consumers choose to allow a business to collect and sell their personal information, businesses' must sign written
What are the conditions under which businesses can sell minors' personal information?
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_train_batch_size
: 32per_device_eval_batch_size
: 16learning_rate
: 2e-05num_train_epochs
: 2lr_scheduler_type
: cosinewarmup_ratio
: 0.1bf16
: Truetf32
: Trueload_best_model_at_end
: Trueoptim
: adamw_torch_fusedbatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 2max_steps
: -1lr_scheduler_type
: cosinelr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Truelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torch_fusedoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falsebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | dim_128_cosine_map@100 | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_64_cosine_map@100 | dim_768_cosine_map@100 |
---|---|---|---|---|---|---|---|
0.0407 | 10 | 7.3941 | - | - | - | - | - |
0.0813 | 20 | 6.0968 | - | - | - | - | - |
0.1220 | 30 | 4.9439 | - | - | - | - | - |
0.1626 | 40 | 3.8622 | - | - | - | - | - |
0.2033 | 50 | 3.0938 | - | - | - | - | - |
0.2439 | 60 | 1.8775 | - | - | - | - | - |
0.2846 | 70 | 2.3808 | - | - | - | - | - |
0.3252 | 80 | 4.0718 | - | - | - | - | - |
0.3659 | 90 | 2.2182 | - | - | - | - | - |
0.4065 | 100 | 1.914 | - | - | - | - | - |
0.4472 | 110 | 1.5123 | - | - | - | - | - |
0.4878 | 120 | 1.7047 | - | - | - | - | - |
0.5285 | 130 | 2.9509 | - | - | - | - | - |
0.5691 | 140 | 1.0605 | - | - | - | - | - |
0.6098 | 150 | 1.762 | - | - | - | - | - |
0.6504 | 160 | 1.6545 | - | - | - | - | - |
0.6911 | 170 | 3.0971 | - | - | - | - | - |
0.7317 | 180 | 1.3791 | - | - | - | - | - |
0.7724 | 190 | 1.9717 | - | - | - | - | - |
0.8130 | 200 | 5.1309 | - | - | - | - | - |
0.8537 | 210 | 1.4047 | - | - | - | - | - |
0.8943 | 220 | 1.4391 | - | - | - | - | - |
0.9350 | 230 | 3.6443 | - | - | - | - | - |
0.9756 | 240 | 3.721 | - | - | - | - | - |
1.0122 | 249 | - | 0.6625 | 0.7330 | 0.7497 | 0.5784 | 0.7568 |
1.0041 | 250 | 1.3171 | - | - | - | - | - |
1.0447 | 260 | 5.2603 | - | - | - | - | - |
1.0854 | 270 | 4.0513 | - | - | - | - | - |
1.1260 | 280 | 2.5508 | - | - | - | - | - |
1.1667 | 290 | 1.7385 | - | - | - | - | - |
1.2073 | 300 | 1.1692 | - | - | - | - | - |
1.2480 | 310 | 0.788 | - | - | - | - | - |
1.2886 | 320 | 1.2322 | - | - | - | - | - |
1.3293 | 330 | 3.3735 | - | - | - | - | - |
1.3699 | 340 | 1.2204 | - | - | - | - | - |
1.4106 | 350 | 0.8458 | - | - | - | - | - |
1.4512 | 360 | 0.7586 | - | - | - | - | - |
1.4919 | 370 | 0.8964 | - | - | - | - | - |
1.5325 | 380 | 1.9721 | - | - | - | - | - |
1.5732 | 390 | 0.5605 | - | - | - | - | - |
1.6138 | 400 | 0.9648 | - | - | - | - | - |
1.6545 | 410 | 1.0002 | - | - | - | - | - |
1.6951 | 420 | 2.138 | - | - | - | - | - |
1.7358 | 430 | 0.8221 | - | - | - | - | - |
1.7764 | 440 | 2.124 | - | - | - | - | - |
1.8171 | 450 | 2.7892 | - | - | - | - | - |
1.8577 | 460 | 0.9088 | - | - | - | - | - |
1.8984 | 470 | 0.9254 | - | - | - | - | - |
1.9390 | 480 | 3.1205 | - | - | - | - | - |
1.9797 | 490 | 3.014 | - | - | - | - | - |
1.9878 | 492 | - | 0.7047 | 0.7795 | 0.7896 | 0.6137 | 0.7937 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.1.2+cu121
- Accelerate: 0.31.0
- Datasets: 2.19.1
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 7
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for MugheesAwan11/bge-base-securiti-dataset-1-v14
Base model
BAAI/bge-base-en-v1.5Evaluation results
- Cosine Accuracy@1 on dim 768self-reported0.680
- Cosine Accuracy@3 on dim 768self-reported0.907
- Cosine Accuracy@5 on dim 768self-reported0.948
- Cosine Accuracy@10 on dim 768self-reported0.969
- Cosine Precision@1 on dim 768self-reported0.680
- Cosine Precision@3 on dim 768self-reported0.302
- Cosine Precision@5 on dim 768self-reported0.190
- Cosine Precision@10 on dim 768self-reported0.097
- Cosine Recall@1 on dim 768self-reported0.680
- Cosine Recall@3 on dim 768self-reported0.907