update readme zh
Browse files- README.md +3 -3
- README_zh.md +79 -0
README.md
CHANGED
|
@@ -1073,7 +1073,7 @@ effectively harnessing textual data and labels from diverse downstream
|
|
| 1073 |
tasks. In addition, Piccolo2 scales up the embedding dimension and uses
|
| 1074 |
MRL training to support more flexible vector dimensions.
|
| 1075 |
|
| 1076 |
-
## Model Hightlights
|
| 1077 |
The main feature of piccolo2 is that it uses a multi-task hybrid loss during training.
|
| 1078 |
For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative:
|
| 1079 |
<p align='left'>
|
|
@@ -1092,7 +1092,7 @@ it can easily lead to conflict training targets:
|
|
| 1092 |
<img src='assets/3.png' width='400' height='80'>
|
| 1093 |
</p>
|
| 1094 |
|
| 1095 |
-
## Experiments and Results
|
| 1096 |
Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932).
|
| 1097 |
|
| 1098 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
|
|
@@ -1102,7 +1102,7 @@ Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open
|
|
| 1102 |
| [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
|
| 1103 |
|
| 1104 |
|
| 1105 |
-
## Usage
|
| 1106 |
The piccolo model can be easily accessed in the sentence-transformer package:
|
| 1107 |
```python
|
| 1108 |
# for s2s/s2p dataset, you can use piccolo as below
|
|
|
|
| 1073 |
tasks. In addition, Piccolo2 scales up the embedding dimension and uses
|
| 1074 |
MRL training to support more flexible vector dimensions.
|
| 1075 |
|
| 1076 |
+
## 💡 Model Hightlights
|
| 1077 |
The main feature of piccolo2 is that it uses a multi-task hybrid loss during training.
|
| 1078 |
For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative:
|
| 1079 |
<p align='left'>
|
|
|
|
| 1092 |
<img src='assets/3.png' width='400' height='80'>
|
| 1093 |
</p>
|
| 1094 |
|
| 1095 |
+
## 📃 Experiments and Results
|
| 1096 |
Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932).
|
| 1097 |
|
| 1098 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
|
|
|
|
| 1102 |
| [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
|
| 1103 |
|
| 1104 |
|
| 1105 |
+
## 🔨 Usage
|
| 1106 |
The piccolo model can be easily accessed in the sentence-transformer package:
|
| 1107 |
```python
|
| 1108 |
# for s2s/s2p dataset, you can use piccolo as below
|
README_zh.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[EN](README.md) | [简体中文](README_zh.md)
|
| 2 |
+
|
| 3 |
+
**新闻**
|
| 4 |
+
|
| 5 |
+
**[2024-05-14]**
|
| 6 |
+
我们目前已经发布了模型权重、训练代码和技术报告,欢迎大家关注。
|
| 7 |
+
我们的训练代码放在了github上: https://github.com/hjq133/piccolo-embedding
|
| 8 |
+
对于训练细节,可以参考我们的技术报告: https://arxiv.org/abs/2405.06932
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
**[2024-04-22]**
|
| 12 |
+
|
| 13 |
+
piccolo-large-zh-v2 目前在C-MTEB榜单取得第一名,领先上一名BERT模型约1.9个点。
|
| 14 |
+
|
| 15 |
+
## Piccolo-large-zh-v2
|
| 16 |
+
|
| 17 |
+
piccolo-large-zh-v2是商汤研究院通用模型组开发的中文嵌入模型。 Piccolo此次的升级版本旨在关注通用的下游微调方法。 Piccolo2主要通过利用高效的多任务混合损失训练方法,有效地利用来自不同下游的文本数据和标签。 此外,Piccolo2扩大了嵌入维度,同时使用MRL训练来支持更灵活的向量维度。
|
| 18 |
+
|
| 19 |
+
## 💡 模型亮点
|
| 20 |
+
piccolo2的主要特点是在训练过程中使用了多任务混合损失。
|
| 21 |
+
对于检retrieval/reranking任务,我们使用带有批内负样本的InfoNCE:
|
| 22 |
+
<p align='left'>
|
| 23 |
+
<img src='assets/1.png' width='400' height='80'>
|
| 24 |
+
</p>
|
| 25 |
+
|
| 26 |
+
对于 sts/pair classification任务,我们使用排序损失:cosent loss。在具有细粒度标签的数据集上(比如有相似度的score),排序损失通常被证明表现更好:
|
| 27 |
+
<p align='left'>
|
| 28 |
+
<img src='assets/2.png' width='450' height='90'>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
对于分类/聚类任务,我们通过将文本和其语义标签视为正负对,将数据集转换为三元组的格式来进行训练。我们同样采用InfoNCE对其进行优化。但这类任务不能再使用批内负样本,因为很容易导致训练目标的冲突:
|
| 32 |
+
<p align='left'>
|
| 33 |
+
<img src='assets/3.png' width='400' height='80'>
|
| 34 |
+
</p>
|
| 35 |
+
|
| 36 |
+
## 📃 实验和结果
|
| 37 |
+
Piccolo2主要关注在一种通用的下游微调范式。我们的开源模型使用了[stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d)作为初始化,在32张A100上训练了2500 step,对于更多的实现细节,可以参考我们的 [技术报告](https://arxiv.org/abs/2405.06932), 以及[训练代码](https://github.com/hjq133/piccolo-embedding)
|
| 38 |
+
|
| 39 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
|
| 40 |
+
|:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 41 |
+
| [**piccolo-large-zh-v2**](https://huggingface.co/sensenova/piccolo-large-zh-v2) | 1.21 | 1792 | 512 | 74.59 | 62.17 | 90.24 | 70 | 74.36 | 63.5 | 70.95 |
|
| 42 |
+
| [gte-Qwen1.5-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| 26.45 | 32768 |4096 | 73.35 | 67.08 | 88.52 | 66.38 | 70.62 | 62.32 | 69.56|
|
| 43 |
+
| [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
## 🔨 使用方法
|
| 47 |
+
在sentence-transformer中使用piccolo:
|
| 48 |
+
```python
|
| 49 |
+
# for s2s/s2p dataset, you can use piccolo as below
|
| 50 |
+
from sklearn.preprocessing import normalize
|
| 51 |
+
from sentence_transformers import SentenceTransformer
|
| 52 |
+
sentences = ["数据1", "数据2"]
|
| 53 |
+
matryoshka_dim=1792 # support 256, 512, 768, 1024, 1280, 1536, 1792
|
| 54 |
+
model = SentenceTransformer('sensenova/piccolo-large-zh-v2')
|
| 55 |
+
embeddings_1 = model.encode(sentences, normalize_embeddings=False)
|
| 56 |
+
embeddings_2 = model.encode(sentences, normalize_embeddings=False)
|
| 57 |
+
embeddings_1 = normalize(embeddings_1[..., :matryoshka_dim], norm="l2", axis=1)
|
| 58 |
+
embeddings_2 = normalize(embeddings_2[..., :matryoshka_dim], norm="l2", axis=1)
|
| 59 |
+
similarity = embeddings_1 @ embeddings_2.T
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## 🤗 **Model List**
|
| 63 |
+
| Model|Language|Description|prompt|
|
| 64 |
+
|:-|:-:|:-:|:--:|
|
| 65 |
+
| [sensenova/piccolo-large-zh-v2](https://huggingface.co/sensenova/piccolo-large-zh-v2) | Chinese | version2: finetuning with multi-task hybrid loss training | None |
|
| 66 |
+
| [sensenova/piccolo-large-zh](https://huggingface.co/sensenova/piccolo-large-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' |
|
| 67 |
+
| [sensenova/piccolo-base-zh](https://huggingface.co/sensenova/piccolo-base-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' |
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
## Citation
|
| 71 |
+
如果我们的技术报告、模型或训练代码对您有帮助,请像下面这样引用我们的论文,或者在 github、 Huggingface 上给一个 Star!
|
| 72 |
+
```bibtex
|
| 73 |
+
@misc{2405.06932,
|
| 74 |
+
Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu},
|
| 75 |
+
Title = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training},
|
| 76 |
+
Year = {2024},
|
| 77 |
+
Eprint = {arXiv:2405.06932},
|
| 78 |
+
}
|
| 79 |
+
```
|