|
--- |
|
license: mit |
|
language: |
|
- zh |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
|
|
## Model List |
|
The evaluation dataset is in Chinese, and we used the same language model **RoBERTa base** on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the **weighted average (w-avg)** method. |
|
| Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. | |
|
|:-----------------------:|:------------:|:-----------:|:----------|:-------------|:------------:|:----------:| |
|
| BERT-Whitening | 65.27| -| -| -| -| -| |
|
| SimBERT | 70.01| -| -| -| -| -| |
|
| SBERT-Whitening | 71.75| -| -| -| -| -| |
|
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | 78.61| -| -| -| -| -| |
|
| [hellonlp/simcse-base-zh](https://huggingface.co/hellonlp/simcse-roberta-base-zh) | 80.96| -| -| -| -| -| |
|
| [hellonlp/promcse-base-zh-v1.0](https://huggingface.co/hellonlp/promcse-bert-base-zh) | **81.57**| -| -| -| -| -| |
|
| [hellonlp/promcse-base-zh-v1.1](https://huggingface.co/hellonlp/promcse-bert-base-zh) | **82.02**| -| -| -| -| -| |
|
|
|
|
|
## Uses |
|
To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/) |
|
```bash |
|
pip install promcse |
|
``` |
|
|
|
After installing the package, you can load our model by two lines of code |
|
```python |
|
from promcse import PromCSE |
|
model = PromCSE("hellonlp/promcse-bert-base-zh-v1.1", "cls", 10) |
|
``` |
|
|
|
Then you can use our model for encoding sentences into embeddings |
|
```python |
|
embeddings = model.encode("武汉是一个美丽的城市。") |
|
print(embeddings.shape) |
|
#torch.Size([768]) |
|
``` |
|
|
|
Compute the cosine similarities between two groups of sentences |
|
```python |
|
sentences_a = ['你好吗'] |
|
sentences_b = ['你怎么样','我吃了一个苹果','你过的好吗','你还好吗','你', |
|
'你好不好','你好不好呢','我不开心','我好开心啊', '你吃饭了吗', |
|
'你好吗','你现在好吗','你好个鬼'] |
|
similarities = model.similarity(sentences_a, sentences_b) |
|
print(similarities) |
|
# [(1.0, '你好吗'), |
|
# (0.9029, '你好不好'), |
|
# (0.8945, '你好不好呢'), |
|
# (0.8478, '你还好吗'), |
|
# (0.7746, '你现在好吗'), |
|
# (0.7607, '你过的好吗'), |
|
# (0.7399, '你怎么样'), |
|
# (0.5967, '你'), |
|
# (0.5395, '你好个鬼'), |
|
# (0.5262, '你吃饭了吗'), |
|
# (0.3608, '我好开心啊'), |
|
# (0.2308, '我不开心'), |
|
# (0.0626, '我吃了一个苹果')] |
|
``` |
|
|
|
|