langcache-embed-v3 / README.md

Add new SentenceTransformer model

54c0816 verified about 2 months ago

15.6 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- biencoder
	- sentence-transformers
	- text-classification
	- sentence-pair-classification
	- semantic-similarity
	- semantic-search
	- retrieval
	- reranking
	- generated_from_trainer
	- dataset_size:483820
	- loss:MultipleNegativesSymmetricRankingLoss
	base_model: Alibaba-NLP/gte-modernbert-base
	widget:
	- source_sentence: 'See Precambrian time scale # Proposed Geologic timeline for another
	set of periods 4600 -- 541 MYA .'
	sentences:
	- In 2014 election , Biju Janata Dal candidate Tathagat Satapathy Bharatiya Janata
	party candidate Rudra Narayan Pany defeated with a margin of 1.37,340 votes .
	- In Scotland , the Strathclyde Partnership for Transport , formerly known as Strathclyde
	Passenger Transport Executive , comprises the former Strathclyde region , which
	includes the urban area around Glasgow .
	- 'See Precambrian Time Scale # Proposed Geological Timeline for another set of
	periods of 4600 -- 541 MYA .'
	- source_sentence: It is also 5 kilometers northeast of Tamaqua , 27 miles south of
	Allentown and 9 miles northwest of Hazleton .
	sentences:
	- In 1948 he moved to Massachusetts , and eventually settled in Vermont .
	- Suddenly I remembered that I was a New Zealander , I caught the first plane home
	and came back .
	- It is also 5 miles northeast of Tamaqua , 27 miles south of Allentown , and 9
	miles northwest of Hazleton .
	- source_sentence: The party has a Member of Parliament , a member of the House of
	Lords , three members of the London Assembly and two Members of the European Parliament
	.
	sentences:
	- The party has one Member of Parliament , one member of the House of Lords , three
	Members of the London Assembly and two Members of the European Parliament .
	- Grapsid crabs dominate in Australia , Malaysia and Panama , while gastropods Cerithidea
	scalariformis and Melampus coeffeus are important seed predators in Florida mangroves
	.
	- Music Story is a music service website and international music data provider that
	curates , aggregates and analyses metadata for digital music services .
	- source_sentence: 'The play received two 1969 Tony Award nominations : Best Actress
	in a Play ( Michael Annals ) and Best Costume Design ( Charlotte Rae ) .'
	sentences:
	- Ravishanker is a fellow of the International Statistical Institute and an elected
	member of the American Statistical Association .
	- 'In 1969 , the play received two Tony - Award nominations : Best Actress in a
	Theatre Play ( Michael Annals ) and Best Costume Design ( Charlotte Rae ) .'
	- AMD and Nvidia both have proprietary methods of scaling , CrossFireX for AMD ,
	and SLI for Nvidia .
	- source_sentence: He was a close friend of Ángel Cabrera and is a cousin of golfer
	Tony Croatto .
	sentences:
	- He was a close friend of Ángel Cabrera , and is a cousin of golfer Tony Croatto
	.
	- Eugenijus Bartulis ( born December 7 , 1949 in Kaunas ) is a Lithuanian Roman
	Catholic priest , and Bishop of Šiauliai .
	- UWIRE also distributes its members content to professional media outlets , including
	Yahoo , CNN and CBS News .
	datasets:
	- redis/langcache-sentencepairs-v1
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- cosine_accuracy
	- cosine_accuracy_threshold
	- cosine_f1
	- cosine_f1_threshold
	- cosine_precision
	- cosine_recall
	- cosine_ap
	- cosine_mcc
	model-index:
	- name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
	results:
	- task:
	type: binary-classification
	name: Binary Classification
	dataset:
	name: test
	type: test
	metrics:
	- type: cosine_accuracy
	value: 0.7035681462730365
	name: Cosine Accuracy
	- type: cosine_accuracy_threshold
	value: 0.8473721742630005
	name: Cosine Accuracy Threshold
	- type: cosine_f1
	value: 0.712274188436637
	name: Cosine F1
	- type: cosine_f1_threshold
	value: 0.8116312026977539
	name: Cosine F1 Threshold
	- type: cosine_precision
	value: 0.5987668417446905
	name: Cosine Precision
	- type: cosine_recall
	value: 0.8788826815642458
	name: Cosine Recall
	- type: cosine_ap
	value: 0.6473811496690576
	name: Cosine Ap
	- type: cosine_mcc
	value: 0.4419218320172892
	name: Cosine Mcc
	---

	# Redis fine-tuned BiEncoder model for semantic caching on LangCache

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) on the [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for sentence pair similarity.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) <!-- at revision e7f32e3c00f91d699e8c43b53106206bcc72bb22 -->
	- Maximum Sequence Length: 8192 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	- Training Dataset:
	- [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	- Language: en
	- License: apache-2.0

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("redis/langcache-embed-v3")
	# Run inference
	sentences = [
	'He was a close friend of Ángel Cabrera and is a cousin of golfer Tony Croatto .',
	'He was a close friend of Ángel Cabrera , and is a cousin of golfer Tony Croatto .',
	'UWIRE also distributes its members content to professional media outlets , including Yahoo , CNN and CBS News .',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[0.9922, 0.9922, 0.5352],
	# [0.9922, 0.9961, 0.5391],
	# [0.5352, 0.5391, 1.0000]], dtype=torch.bfloat16)
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Binary Classification

	* Dataset: `test`
	* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)

	\| Metric \| Value \|
	\|:--------------------------\|:-----------\|
	\| cosine_accuracy \| 0.7036 \|
	\| cosine_accuracy_threshold \| 0.8474 \|
	\| cosine_f1 \| 0.7123 \|
	\| cosine_f1_threshold \| 0.8116 \|
	\| cosine_precision \| 0.5988 \|
	\| cosine_recall \| 0.8789 \|
	\| cosine_ap \| 0.6474 \|
	\| cosine_mcc \| 0.4419 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### LangCache Sentence Pairs (all)

	* Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	* Size: 26,850 training samples
	* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence1 \| sentence2 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|:-----------------------------\|
	\| type \| string \| string \| int \|
	\| details \| <ul><li>min: 8 tokens</li><li>mean: 27.35 tokens</li><li>max: 53 tokens</li></ul> \| <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 52 tokens</li></ul> \| <ul><li>1: 100.00%</li></ul> \|
	* Samples:
	\| sentence1 \| sentence2 \| label \|
	\|:----------------------------------------------------------------------------------------------------------------------\|:-------------------------------------------------------------------------------------------------------------------------------\|:---------------\|
	\| <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> \| <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> \| <code>1</code> \|
	\| <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> \| <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> \| <code>1</code> \|
	\| <code>The 12F was officially homologated on August 21 , 1929 and exhibited at the Paris Salon in 1930 .</code> \| <code>The 12F was officially homologated on 21 August 1929 and displayed at the 1930 Paris Salon .</code> \| <code>1</code> \|
	* Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "cos_sim",
	"gather_across_devices": false
	}
	```

	### Evaluation Dataset

	#### LangCache Sentence Pairs (all)

	* Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	* Size: 26,850 evaluation samples
	* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence1 \| sentence2 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|:-----------------------------\|
	\| type \| string \| string \| int \|
	\| details \| <ul><li>min: 8 tokens</li><li>mean: 27.35 tokens</li><li>max: 53 tokens</li></ul> \| <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 52 tokens</li></ul> \| <ul><li>1: 100.00%</li></ul> \|
	* Samples:
	\| sentence1 \| sentence2 \| label \|
	\|:----------------------------------------------------------------------------------------------------------------------\|:-------------------------------------------------------------------------------------------------------------------------------\|:---------------\|
	\| <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> \| <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> \| <code>1</code> \|
	\| <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> \| <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> \| <code>1</code> \|
	\| <code>The 12F was officially homologated on August 21 , 1929 and exhibited at the Paris Salon in 1930 .</code> \| <code>The 12F was officially homologated on 21 August 1929 and displayed at the 1930 Paris Salon .</code> \| <code>1</code> \|
	* Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "cos_sim",
	"gather_across_devices": false
	}
	```

	### Training Logs
	\| Epoch \| Step \| test_cosine_ap \|
	\|:-----:\|:----:\|:--------------:\|
	\| -1 \| -1 \| 0.6474 \|


	### Framework Versions
	- Python: 3.12.3
	- Sentence Transformers: 5.1.0
	- Transformers: 4.56.0
	- PyTorch: 2.8.0+cu128
	- Accelerate: 1.10.1
	- Datasets: 4.0.0
	- Tokenizers: 0.22.0

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->