Model Card

mStyleDistance is a multilingual style embedding model that aims to embed texts with similar writing styles closely and different styles far apart, regardless of content and regardless of language. You may find this model useful for stylistic analysis of multilingual text, clustering, authorship identfication and verification tasks, and automatic style transfer evaluation.

This model is an multilingual version of the English-only StyleDistance model.

Training Data and Variants of StyleDistance

mStyleDistance was contrastively trained on mSynthSTEL, a synthetically generated dataset of positive and negative examples of ~40 style features being used in text in 9 non-English languages. By utilizing this synthetic dataset, mStyleDistance is able to achieve stronger content-independence than other style embedding models currently available and is able to operate on multilingual text.

Example Usage

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('StyleDistance/mstyledistance') # Load model

input = model.encode("ÉL TIENE PROBLEMAS PARA LOGRAR LA TEMPERATURA ADECUADA PARA COCINAR LA GALLINA CORNISH.")
others = model.encode(["TOCARÁS LA GUITARRA CON TU AMIGO; SERÁ UNA EXCELENTE OPORTUNIDAD PARA MEJORAR TUS HABILIDADES MUSICALES.", "Él tiene problemas para lograr la temperatura adecuada para cocinar la gallina Cornish."])
print(cos_sim(input, others))

Trained with DataDreamer

This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.


Funding Acknowledgements

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Downloads last month
3
Safetensors
Model size
278M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for StyleDistance/mstyledistance

Finetuned
(2740)
this model