| # Semantic Textual Similarity | |
| Once you have [sentence embeddings computed](../../examples/applications/computing-embeddings/README.md), you usually want to compare them to each other. Here, I show you how you can compute the cosine similarity between embeddings, for example, to measure the semantic similarity of two texts. | |
| ```python | |
| from sentence_transformers import SentenceTransformer, util | |
| model = SentenceTransformer('all-MiniLM-L6-v2') | |
| # Two lists of sentences | |
| sentences1 = ['The cat sits outside', | |
| 'A man is playing guitar', | |
| 'The new movie is awesome'] | |
| sentences2 = ['The dog plays in the garden', | |
| 'A woman watches TV', | |
| 'The new movie is so great'] | |
| #Compute embedding for both lists | |
| embeddings1 = model.encode(sentences1, convert_to_tensor=True) | |
| embeddings2 = model.encode(sentences2, convert_to_tensor=True) | |
| #Compute cosine-similarities | |
| cosine_scores = util.cos_sim(embeddings1, embeddings2) | |
| #Output the pairs with their score | |
| for i in range(len(sentences1)): | |
| print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i])) | |
| ``` | |
| We pass the `convert_to_tensor=True` parameter to the encode function. This will return a pytorch tensor containing our embeddings. We can then call `util.cos_sim(A, B)` which computes the cosine similarity between all vectors in *A* and all vectors in *B*. | |
| It returns in the above example a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*. | |
| You can use this function also to find out the pairs with the highest cosine similarity scores: | |
| ```python | |
| from sentence_transformers import SentenceTransformer, util | |
| model = SentenceTransformer('all-MiniLM-L6-v2') | |
| # Single list of sentences | |
| sentences = ['The cat sits outside', | |
| 'A man is playing guitar', | |
| 'I love pasta', | |
| 'The new movie is awesome', | |
| 'The cat plays in the garden', | |
| 'A woman watches TV', | |
| 'The new movie is so great', | |
| 'Do you like pizza?'] | |
| #Compute embeddings | |
| embeddings = model.encode(sentences, convert_to_tensor=True) | |
| #Compute cosine-similarities for each sentence with each other sentence | |
| cosine_scores = util.cos_sim(embeddings, embeddings) | |
| #Find the pairs with the highest cosine similarity scores | |
| pairs = [] | |
| for i in range(len(cosine_scores)-1): | |
| for j in range(i+1, len(cosine_scores)): | |
| pairs.append({'index': [i, j], 'score': cosine_scores[i][j]}) | |
| #Sort scores in decreasing order | |
| pairs = sorted(pairs, key=lambda x: x['score'], reverse=True) | |
| for pair in pairs[0:10]: | |
| i, j = pair['index'] | |
| print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score'])) | |
| ``` | |
| Note, in the above approach we use a brute-force approach to find the highest scoring pairs, which has a quadratic complexity. For long lists of sentences, this might be infeasible. If you want find the highest scoring pairs in a long list of sentences, have a look at [Paraphrase Mining](../../examples/applications/paraphrase-mining/README.md). |