legal-ft-1 / README.md

Add new SentenceTransformer model

0e050a7 verified about 13 hours ago

23.5 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- dataset_size:400
	- loss:MatryoshkaLoss
	- loss:MultipleNegativesRankingLoss
	widget:
	- source_sentence: Which specific areas of law are mentioned as being unaffected by
	this Regulation?
	sentences:
	- (4)
	- '(45)



	Practices that are prohibited by Union law, including data protection law, non-discrimination
	law, consumer protection law, and competition law, should not be affected by this
	Regulation.













	(46)'
	- Union harmonisation legislation in an optimal manner. AI systems identified as
	high-risk should be limited to those that have a significant harmful impact on
	the health, safety and fundamental rights of persons in the Union and such limitation
	should minimise any potential restriction to international trade.
	- source_sentence: How does AI contribute to environmentally beneficial outcomes?
	sentences:
	- AI is a fast evolving family of technologies that contributes to a wide array
	of economic, environmental and societal benefits across the entire spectrum of
	industries and social activities. By improving prediction, optimising operations
	and resource allocation, and personalising digital solutions available for individuals
	and organisations, the use of AI can provide key competitive advantages to undertakings
	and support socially and environmentally beneficial outcomes, for example in healthcare,
	agriculture, food safety, education and training, media, sports, culture, infrastructure
	management, energy, transport and logistics, public services, security, justice,
	resource and energy efficiency, environmental monitoring, the conservation
	- To mitigate the risks from high-risk AI systems placed on the market or put into
	service and to ensure a high level of trustworthiness, certain mandatory requirements
	should apply to high-risk AI systems, taking into account the intended purpose
	and the context of use of the AI system and according to the risk-management system
	to be established by the provider. The measures adopted by the providers to comply
	with the mandatory requirements of this Regulation should take into account the
	generally acknowledged state of the art on AI, be proportionate and effective
	to meet the objectives of this Regulation. Based on the New Legislative Framework,
	as clarified in Commission notice ‘The “Blue Guide” on the implementation of EU
	product rules
	- 'Having regard to the proposal from the European Commission,



	After transmission of the draft legislative act to the national parliaments,



	Having regard to the opinion of the European Economic and Social Committee (1),



	Having regard to the opinion of the European Central Bank (2),



	Having regard to the opinion of the Committee of the Regions (3),



	Acting in accordance with the ordinary legislative procedure (4),


	Whereas:








	(1)'
	- source_sentence: What is the role of the Commission in providing guidance for the
	implementation of conditions for non-high-risk AI systems?
	sentences:
	- of suspects should not be ignored, in particular the difficulty in obtaining meaningful
	information on the functioning of those systems and the resulting difficulty in
	challenging their results in court, in particular by natural persons under investigation.
	- of the conditions referred to above should draw up documentation of the assessment
	before that system is placed on the market or put into service and should provide
	that documentation to national competent authorities upon request. Such a provider
	should be obliged to register the AI system in the EU database established under
	this Regulation. With a view to providing further guidance for the practical implementation
	of the conditions under which the AI systems listed in an annex to this Regulation
	are, on an exceptional basis, non-high-risk, the Commission should, after consulting
	the Board, provide guidelines specifying that practical implementation, completed
	by a comprehensive list of practical examples of use cases of AI systems that
	- completed human activity that may be relevant for the purposes of the high-risk
	uses listed in an annex to this Regulation. Considering those characteristics,
	the AI system provides only an additional layer to a human activity with consequently
	lowered risk. That condition would, for example, apply to AI systems that are
	intended to improve the language used in previously drafted documents, for example
	in relation to professional tone, academic style of language or by aligning text
	to a certain brand messaging. The third condition should be that the AI system
	is intended to detect decision-making patterns or deviations from prior decision-making
	patterns. The risk would be lowered because the use of the AI system follows a previously
	- source_sentence: How does the context surrounding the number 39 influence its interpretation?
	sentences:
	- (39)
	- requested by the European Parliament (6).
	- under the UN Convention relating to the Status of Refugees done at Geneva on 28 July
	1951 as amended by the Protocol of 31 January 1967. Nor should they be used to
	in any way infringe on the principle of non-refoulement, or to deny safe and effective
	legal avenues into the territory of the Union, including the right to international
	protection.
	- source_sentence: How does the number 63 relate to the overall theme or subject being
	discussed?
	sentences:
	- (60)
	- (63)
	- The deployment of AI systems in education is important to promote high-quality
	digital education and training and to allow all learners and teachers to acquire
	and share the necessary digital skills and competences, including media literacy,
	and critical thinking, to take an active part in the economy, society, and in
	democratic processes. However, AI systems used in education or vocational training,
	in particular for determining access or admission, for assigning persons to educational
	and vocational training institutions or programmes at all levels, for evaluating
	learning outcomes of persons, for assessing the appropriate level of education
	for an individual and materially influencing the level of education and training
	that individuals
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- cosine_accuracy@1
	- cosine_accuracy@3
	- cosine_accuracy@5
	- cosine_accuracy@10
	- cosine_precision@1
	- cosine_precision@3
	- cosine_precision@5
	- cosine_precision@10
	- cosine_recall@1
	- cosine_recall@3
	- cosine_recall@5
	- cosine_recall@10
	- cosine_ndcg@10
	- cosine_mrr@10
	- cosine_map@100
	model-index:
	- name: SentenceTransformer
	results:
	- task:
	type: information-retrieval
	name: Information Retrieval
	dataset:
	name: Unknown
	type: unknown
	metrics:
	- type: cosine_accuracy@1
	value: 0.9583333333333334
	name: Cosine Accuracy@1
	- type: cosine_accuracy@3
	value: 1.0
	name: Cosine Accuracy@3
	- type: cosine_accuracy@5
	value: 1.0
	name: Cosine Accuracy@5
	- type: cosine_accuracy@10
	value: 1.0
	name: Cosine Accuracy@10
	- type: cosine_precision@1
	value: 0.9583333333333334
	name: Cosine Precision@1
	- type: cosine_precision@3
	value: 0.3333333333333333
	name: Cosine Precision@3
	- type: cosine_precision@5
	value: 0.19999999999999998
	name: Cosine Precision@5
	- type: cosine_precision@10
	value: 0.09999999999999999
	name: Cosine Precision@10
	- type: cosine_recall@1
	value: 0.9583333333333334
	name: Cosine Recall@1
	- type: cosine_recall@3
	value: 1.0
	name: Cosine Recall@3
	- type: cosine_recall@5
	value: 1.0
	name: Cosine Recall@5
	- type: cosine_recall@10
	value: 1.0
	name: Cosine Recall@10
	- type: cosine_ndcg@10
	value: 0.9791666666666666
	name: Cosine Ndcg@10
	- type: cosine_mrr@10
	value: 0.9722222222222223
	name: Cosine Mrr@10
	- type: cosine_map@100
	value: 0.9722222222222222
	name: Cosine Map@100
	---

	# SentenceTransformer

	This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	<!-- - Base model: [Unknown](https://huggingface.co/unknown) -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 1024 dimensions
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("MikeCraBash/legal-ft-1")
	# Run inference
	sentences = [
	'How does the number 63 relate to the overall theme or subject being discussed?',
	'(63)',
	'(60)',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Information Retrieval

	* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)

	\| Metric \| Value \|
	\|:--------------------\|:-----------\|
	\| cosine_accuracy@1 \| 0.9583 \|
	\| cosine_accuracy@3 \| 1.0 \|
	\| cosine_accuracy@5 \| 1.0 \|
	\| cosine_accuracy@10 \| 1.0 \|
	\| cosine_precision@1 \| 0.9583 \|
	\| cosine_precision@3 \| 0.3333 \|
	\| cosine_precision@5 \| 0.2 \|
	\| cosine_precision@10 \| 0.1 \|
	\| cosine_recall@1 \| 0.9583 \|
	\| cosine_recall@3 \| 1.0 \|
	\| cosine_recall@5 \| 1.0 \|
	\| cosine_recall@10 \| 1.0 \|
	\| cosine_ndcg@10 \| 0.9792 \|
	\| cosine_mrr@10 \| 0.9722 \|
	\| cosine_map@100 \| 0.9722 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### Unnamed Dataset

	* Size: 400 training samples
	* Columns: <code>sentence_0</code> and <code>sentence_1</code>
	* Approximate statistics based on the first 400 samples:
	\| \| sentence_0 \| sentence_1 \|
	\|:--------\|:-----------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------\|
	\| type \| string \| string \|
	\| details \| <ul><li>min: 10 tokens</li><li>mean: 20.45 tokens</li><li>max: 35 tokens</li></ul> \| <ul><li>min: 5 tokens</li><li>mean: 93.01 tokens</li><li>max: 186 tokens</li></ul> \|
	* Samples:
	\| sentence_0 \| sentence_1 \|
	\|:-------------------------------------------------------------------------------------------\|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| <code>What types of risk analytics are permitted according to the context provided?</code> \| <code>solely on profiling them or on assessing their personality traits and characteristics should be prohibited. In any case, that prohibition does not refer to or touch upon risk analytics that are not based on the profiling of individuals or on the personality traits and characteristics of individuals, such as AI systems using risk analytics to assess the likelihood of financial fraud by undertakings on the basis of suspicious transactions or risk analytic tools to predict the likelihood of the localisation of narcotics or illicit goods by customs authorities, for example on the basis of known trafficking routes.</code> \|
	\| <code>Why is profiling individuals based on their personality traits prohibited?</code> \| <code>solely on profiling them or on assessing their personality traits and characteristics should be prohibited. In any case, that prohibition does not refer to or touch upon risk analytics that are not based on the profiling of individuals or on the personality traits and characteristics of individuals, such as AI systems using risk analytics to assess the likelihood of financial fraud by undertakings on the basis of suspicious transactions or risk analytic tools to predict the likelihood of the localisation of narcotics or illicit goods by customs authorities, for example on the basis of known trafficking routes.</code> \|
	\| <code>What criteria determine whether an AI system is classified as high-risk?</code> \| <code>of AI systems that are high-risk and use cases that are not.</code> \|
	* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
	```json
	{
	"loss": "MultipleNegativesRankingLoss",
	"matryoshka_dims": [
	768,
	512,
	256,
	128,
	64
	],
	"matryoshka_weights": [
	1,
	1,
	1,
	1,
	1
	],
	"n_dims_per_step": -1
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `eval_strategy`: steps
	- `per_device_train_batch_size`: 10
	- `per_device_eval_batch_size`: 10
	- `num_train_epochs`: 10
	- `multi_dataset_batch_sampler`: round_robin

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: steps
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 10
	- `per_device_eval_batch_size`: 10
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 5e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1
	- `num_train_epochs`: 10
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.0
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: False
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `dispatch_batches`: None
	- `split_batches`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: batch_sampler
	- `multi_dataset_batch_sampler`: round_robin

	</details>

	### Training Logs
	\| Epoch \| Step \| cosine_ndcg@10 \|
	\|:-----:\|:----:\|:--------------:\|
	\| 1.0 \| 40 \| 0.9715 \|
	\| 1.25 \| 50 \| 0.9792 \|
	\| 2.0 \| 80 \| 0.9715 \|
	\| 2.5 \| 100 \| 0.9715 \|
	\| 3.0 \| 120 \| 0.9715 \|
	\| 3.75 \| 150 \| 0.9715 \|
	\| 4.0 \| 160 \| 0.9792 \|
	\| 5.0 \| 200 \| 0.9792 \|
	\| 6.0 \| 240 \| 0.9688 \|
	\| 6.25 \| 250 \| 0.9792 \|
	\| 7.0 \| 280 \| 0.9715 \|
	\| 7.5 \| 300 \| 0.9792 \|
	\| 8.0 \| 320 \| 0.9792 \|
	\| 8.75 \| 350 \| 0.9792 \|
	\| 9.0 \| 360 \| 0.9792 \|
	\| 10.0 \| 400 \| 0.9792 \|


	### Framework Versions
	- Python: 3.11.11
	- Sentence Transformers: 3.4.1
	- Transformers: 4.48.2
	- PyTorch: 2.5.1+cu124
	- Accelerate: 1.3.0
	- Datasets: 3.2.0
	- Tokenizers: 0.21.0

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->