gte-micro-v3 / README.md

Update README.md

2e9aa88 verified 10 months ago

7.5 kB

	---
	license: mit
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- gte
	- mteb
	model-index:
	- name: gte-micro-test
	results:
	- task:
	type: Classification
	dataset:
	type: mteb/amazon_counterfactual
	name: MTEB AmazonCounterfactualClassification (en)
	config: en
	split: test
	revision: e8379541af4e31359cca9fbcf4b00f2671dba205
	metrics:
	- type: accuracy
	value: 71.43283582089552
	- type: ap
	value: 33.56235301308992
	- type: f1
	value: 65.18510976313922
	- task:
	type: Classification
	dataset:
	type: mteb/amazon_polarity
	name: MTEB AmazonPolarityClassification
	config: default
	split: test
	revision: e2d317d38cd51312af73b3d32a06d1a08b442046
	metrics:
	- type: accuracy
	value: 77.72055
	- type: ap
	value: 72.30281215701287
	- type: f1
	value: 77.62429097469116
	- task:
	type: Classification
	dataset:
	type: mteb/amazon_reviews_multi
	name: MTEB AmazonReviewsClassification (en)
	config: en
	split: test
	revision: 1399c76144fd37290681b995c656ef9b2e06e26d
	metrics:
	- type: accuracy
	value: 38.956
	- type: f1
	value: 38.59075995638611
	- task:
	type: Clustering
	dataset:
	type: mteb/arxiv-clustering-p2p
	name: MTEB ArxivClusteringP2P
	config: default
	split: test
	revision: a122ad7f3f0291bf49cc6f4d32aa80929df69d5d
	metrics:
	- type: v_measure
	value: 41.14317775707504
	- task:
	type: Clustering
	dataset:
	type: mteb/arxiv-clustering-s2s
	name: MTEB ArxivClusteringS2S
	config: default
	split: test
	revision: f910caf1a6075f7329cdf8c1a6135696f37dbd53
	metrics:
	- type: v_measure
	value: 31.79440862639374
	- task:
	type: Classification
	dataset:
	type: mteb/banking77
	name: MTEB Banking77Classification
	config: default
	split: test
	revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
	metrics:
	- type: accuracy
	value: 80.40259740259741
	- type: f1
	value: 80.33885811790022
	- task:
	type: Classification
	dataset:
	type: mteb/emotion
	name: MTEB EmotionClassification
	config: default
	split: test
	revision: 4f58c6b202a23cf9a4da393831edf4f9183cad37
	metrics:
	- type: accuracy
	value: 44.54
	- type: f1
	value: 39.40201192446353
	- task:
	type: Classification
	dataset:
	type: mteb/imdb
	name: MTEB ImdbClassification
	config: default
	split: test
	revision: 3d86128a09e091d6018b6d26cad27f2739fc2db7
	metrics:
	- type: accuracy
	value: 70.5904
	- type: ap
	value: 64.61751544665012
	- type: f1
	value: 70.47776028292148
	- task:
	type: Classification
	dataset:
	type: mteb/mtop_domain
	name: MTEB MTOPDomainClassification (en)
	config: en
	split: test
	revision: d80d48c1eb48d3562165c59d59d0034df9fff0bf
	metrics:
	- type: accuracy
	value: 90.49703602371181
	- type: f1
	value: 90.05253119123799
	- task:
	type: Classification
	dataset:
	type: mteb/mtop_intent
	name: MTEB MTOPIntentClassification (en)
	config: en
	split: test
	revision: ae001d0e6b1228650b7bd1c2c65fb50ad11a8aba
	metrics:
	- type: accuracy
	value: 67.52393980848153
	- type: f1
	value: 49.95609666042009
	- task:
	type: Classification
	dataset:
	type: mteb/amazon_massive_intent
	name: MTEB MassiveIntentClassification (en)
	config: en
	split: test
	revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
	metrics:
	- type: accuracy
	value: 68.4969737726967
	- type: f1
	value: 66.32116772424203
	- task:
	type: Classification
	dataset:
	type: mteb/amazon_massive_scenario
	name: MTEB MassiveScenarioClassification (en)
	config: en
	split: test
	revision: 7d571f92784cd94a019292a1f45445077d0ef634
	metrics:
	- type: accuracy
	value: 73.54741089441829
	- type: f1
	value: 73.47537036064044
	- task:
	type: Classification
	dataset:
	type: mteb/toxic_conversations_50k
	name: MTEB ToxicConversationsClassification
	config: default
	split: test
	revision: edfaf9da55d3dd50d43143d90c1ac476895ae6de
	metrics:
	- type: accuracy
	value: 66.6912
	- type: ap
	value: 12.157396278930436
	- type: f1
	value: 51.00574525406295
	- task:
	type: Classification
	dataset:
	type: mteb/tweet_sentiment_extraction
	name: MTEB TweetSentimentExtractionClassification
	config: default
	split: test
	revision: d604517c81ca91fe16a244d1248fc021f9ecee7a
	metrics:
	- type: accuracy
	value: 59.29258630447085
	- type: f1
	value: 59.6485358241374
	---
	---
	# gte-micro-v3

	This is a distill of [gte-tiny](https://huggingface.co/TaylorAI/gte-tiny).

	## Intended purpose

	<span style="color:blue">This model is designed for use in semantic-autocomplete ([click here for demo](https://mihaiii.github.io/semantic-autocomplete/)).</span>

	## Usage (Sentence-Transformers) (same as [gte-tiny](https://huggingface.co/TaylorAI/gte-tiny))

	Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

	```
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import SentenceTransformer
	sentences = ["This is an example sentence", "Each sentence is converted"]

	model = SentenceTransformer('Mihaiii/gte-micro-v3')
	embeddings = model.encode(sentences)
	print(embeddings)
	```



	## Usage (HuggingFace Transformers) (same as [gte-tiny](https://huggingface.co/TaylorAI/gte-tiny))
	Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch


	#Mean Pooling - Take attention mask into account for correct averaging
	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] #First element of model_output contains all token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


	# Sentences we want sentence embeddings for
	sentences = ['This is an example sentence', 'Each sentence is converted']

	# Load model from HuggingFace Hub
	tokenizer = AutoTokenizer.from_pretrained('Mihaiii/gte-micro-v3')
	model = AutoModel.from_pretrained('Mihaiii/gte-micro-v3')

	# Tokenize sentences
	encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

	# Compute token embeddings
	with torch.no_grad():
	model_output = model(**encoded_input)

	# Perform pooling. In this case, mean pooling.
	sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

	print("Sentence embeddings:")
	print(sentence_embeddings)
	```

	### Limitation (same as [gte-small](https://huggingface.co/thenlper/gte-small))
	This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.