|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
|
|
--- |
|
|
|
# Bert-MLM_arXiv-MP-class_zbMath |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
|
|
The model is specifically designed to compute similarities of short mathematical texts. |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.", |
|
"We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."] |
|
|
|
model = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
|
|
## Usage (HuggingFace Transformers) |
|
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
|
|
#Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Sentences we want sentence embeddings for |
|
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.", |
|
"We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath') |
|
model = AutoModel.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath') |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, mean pooling. |
|
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
``` |
|
|
|
--------- |
|
|
|
## Background |
|
|
|
## Intended uses |
|
|
|
Our model is intended to be used as a sentence and short paragraph encoder for mathematical texts. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. |
|
|
|
By default, input text longer than 256 word pieces is truncated. |
|
|
|
## Training procedure |
|
|
|
### Domain-adaption |
|
|
|
We use the domain-adapted [math-similarity/Bert-MLM_arXiv](https://huggingface.co/math-similarity/Bert-MLM_arXiv) model. Please refer to the model card for more detailed information about the domain-adaption procedure. |
|
|
|
### Pooling |
|
|
|
We add a mean-pooling layer on top of the domain-adapted model. |
|
|
|
### Fine-tuning |
|
|
|
We fine-tune the model using a cosine-similarity objective. Formally, it computes the vectors `u = model(sentence_A)` and `v = model(sentence_B)` and measures the cosine-similarity between the two. By default, it minimizes the following loss: `||input_label - cos_score_transformation(cosine_sim(u,v))||_2`, with MSE as loss function. |
|
|
|
We use title-pairs from [zbMath](https://zbmath.org) as fine-tuning dataset and model semantic similarity with their MSC codes. Two titles are defined as similar, if they share their primary MSC<sub>5</sub> and another secondary MSC<sub>5</sub>. Otherwise, they are defined as semantically dissimilar. |
|
The training set contains 351.472 title pairs and the evaluation set contains 43.935 pairs. See the [training notebook](https://github.com/math-collab/text-similarity/blob/main/Bert-MLM%20%2B%20mean%20pooling%20%2B%20fine-tune%20zbMath-class.ipynb) for more information. |
|
|
|
Unfortunately, we cannot include a dataset with titles due to licensing issues. However, we have created a dataset than only contains the respective zbMath identifiers (also known as an) with primary and secondary MSC classification but without titles. It is available as [datasets/math-similarity/class-zbmath-identifier](https://huggingface.co/datasets/math-similarity/class-zbmath-identifier). |
|
|
|
## Citing & Authors |
|
|
|
This model is an additional resource for the [CICM'24](https://cicm-conference.org/2024/cicm.php) submission *On modelling similarity of short mathematical texts*. |