|
# Quickstart |
|
Once you have SentenceTransformers [installed](installation.md), the usage is simple: |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
model = SentenceTransformer('all-MiniLM-L6-v2') |
|
|
|
#Our sentences we like to encode |
|
sentences = ['This framework generates embeddings for each input sentence', |
|
'Sentences are passed as a list of string.', |
|
'The quick brown fox jumps over the lazy dog.'] |
|
|
|
#Sentences are encoded by calling model.encode() |
|
sentence_embeddings = model.encode(sentences) |
|
|
|
#Print the embeddings |
|
for sentence, embedding in zip(sentences, sentence_embeddings): |
|
print("Sentence:", sentence) |
|
print("Embedding:", embedding) |
|
print("") |
|
``` |
|
|
|
|
|
With `SentenceTransformer('all-MiniLM-L6-v2')` we define which sentence transformer model we like to load. In this example, we load *all-MiniLM-L6-v2*, which is a MiniLM model fine tuned on a large dataset of over 1 billion training pairs. |
|
|
|
BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector. |
|
|
|
## Comparing Sentence Similarities |
|
|
|
The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). For two sentences, this can be done like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer, util |
|
model = SentenceTransformer('all-MiniLM-L6-v2') |
|
|
|
#Sentences are encoded by calling model.encode() |
|
emb1 = model.encode("This is a red cat with a hat.") |
|
emb2 = model.encode("Have you seen my red cat?") |
|
|
|
cos_sim = util.cos_sim(emb1, emb2) |
|
print("Cosine-Similarity:", cos_sim) |
|
``` |
|
|
|
If you have a list with more sentences, you can use the following code example: |
|
```python |
|
from sentence_transformers import SentenceTransformer, util |
|
model = SentenceTransformer('all-MiniLM-L6-v2') |
|
|
|
sentences = ['A man is eating food.', |
|
'A man is eating a piece of bread.', |
|
'The girl is carrying a baby.', |
|
'A man is riding a horse.', |
|
'A woman is playing violin.', |
|
'Two men pushed carts through the woods.', |
|
'A man is riding a white horse on an enclosed ground.', |
|
'A monkey is playing drums.', |
|
'Someone in a gorilla costume is playing a set of drums.' |
|
] |
|
|
|
#Encode all sentences |
|
embeddings = model.encode(sentences) |
|
|
|
#Compute cosine similarity between all pairs |
|
cos_sim = util.cos_sim(embeddings, embeddings) |
|
|
|
#Add all pairs to a list with their cosine similarity score |
|
all_sentence_combinations = [] |
|
for i in range(len(cos_sim)-1): |
|
for j in range(i+1, len(cos_sim)): |
|
all_sentence_combinations.append([cos_sim[i][j], i, j]) |
|
|
|
#Sort list by the highest cosine similarity score |
|
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True) |
|
|
|
print("Top-5 most similar pairs:") |
|
for score, i, j in all_sentence_combinations[0:5]: |
|
print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j])) |
|
``` |
|
|
|
See on the left the *Usage* sections for more examples how to use SentenceTransformers. |
|
|
|
## Pre-Trained Models |
|
Various pre-trained models exists optimized for many tasks exists. For a full list, see **[Pretrained Models](pretrained_models.md)**. |
|
|
|
|
|
|
|
## Training your own Embeddings |
|
|
|
Training your own sentence embeddings models for all type of use-cases is easy and requires often only minimal coding effort. For a comprehensive tutorial, see [Training/Overview](training/overview.md). |
|
|
|
You can also extend easily existent sentence embeddings models to **further languages**. For details, see [Multi-Lingual Training](../examples/training/multilingual/README). |
|
|