Update README.md
Browse files
README.md
CHANGED
@@ -11,174 +11,3 @@ license: mit
|
|
11 |
# Experimental Sparse Vector Repository
|
12 |
|
13 |
This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
|
14 |
-
|
15 |
-
For more details, please refer to the original [github repo](https://github.com/FlagOpen/FlagEmbedding).
|
16 |
-
|
17 |
-
## BGE-M3 Overview ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
|
18 |
-
|
19 |
-
BGE-M3 is a highly versatile embedding model that supports:
|
20 |
-
- **Multi-Functionality**: Capable of dense retrieval, multi-vector retrieval, and sparse retrieval.
|
21 |
-
- **Multi-Linguality**: Supports over 100 languages.
|
22 |
-
- **Multi-Granularity**: Handles inputs from short sentences to long documents up to 8192 tokens.
|
23 |
-
|
24 |
-
## Retrieval Pipeline Recommendations
|
25 |
-
|
26 |
-
We recommend using a hybrid retrieval + re-ranking pipeline:
|
27 |
-
- **Hybrid Retrieval**: Combines embedding retrieval and BM25 algorithm for higher accuracy and generalization. BGE-M3 supports both embedding and sparse retrieval, allowing token weights similar to BM25 without additional cost.
|
28 |
-
- Refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py) for hybrid retrieval examples.
|
29 |
-
- **Re-Ranking**: Use cross-encoder models like [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) or [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) for higher accuracy after retrieval.
|
30 |
-
|
31 |
-
## News
|
32 |
-
|
33 |
-
- **2024/7/1**: Updated MIRACL evaluation results for BGE-M3. Refer to [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr) for details.
|
34 |
-
- **2024/3/20**: Milvus now supports hybrid retrieval with BGE-M3. See [hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
35 |
-
- **2024/3/8**: BGE-M3 achieves top performance in multilingual benchmarks. See [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05).
|
36 |
-
- **2024/3/2**: Released unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data).
|
37 |
-
- **2024/2/6**: Released [MLDR](https://huggingface.co/datasets/Shitao/MLDR) dataset and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
|
38 |
-
- **2024/2/1**: Vespa now supports multiple modes of BGE-M3. See [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb).
|
39 |
-
|
40 |
-
## Model Specifications
|
41 |
-
|
42 |
-
| Model Name | Dimension | Sequence Length | Introduction |
|
43 |
-
|:----:|:---:|:---:|:---:|
|
44 |
-
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | Multilingual; unified fine-tuning (dense, sparse, and colbert) |
|
45 |
-
| [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | Multilingual; contrastive learning |
|
46 |
-
| [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | Multilingual; extended max_length of xlm-roberta to 8192 |
|
47 |
-
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
|
48 |
-
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
|
49 |
-
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
|
50 |
-
|
51 |
-
## Data
|
52 |
-
|
53 |
-
| Dataset | Introduction |
|
54 |
-
|:-------:|:------------:|
|
55 |
-
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Document Retrieval Dataset covering 13 languages |
|
56 |
-
| [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
|
57 |
-
|
58 |
-
## FAQ
|
59 |
-
|
60 |
-
### 1. Introduction to Different Retrieval Methods
|
61 |
-
|
62 |
-
- **Dense Retrieval**: Maps text into a single embedding.
|
63 |
-
- **Sparse Retrieval**: A vector with weights for tokens present in the text.
|
64 |
-
- **Multi-Vector Retrieval**: Uses multiple vectors to represent a text.
|
65 |
-
|
66 |
-
### 2. How to Use BGE-M3 in Other Projects?
|
67 |
-
|
68 |
-
For embedding retrieval, use the BGE-M3 model similarly to BGE. For hybrid retrieval, refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
|
69 |
-
|
70 |
-
### 3. How to Fine-Tune BGE-M3 Model?
|
71 |
-
|
72 |
-
Follow the [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) for dense embedding fine-tuning. For unified fine-tuning (dense, sparse, and colbert), refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune).
|
73 |
-
|
74 |
-
## Usage
|
75 |
-
|
76 |
-
### Installation
|
77 |
-
|
78 |
-
```bash
|
79 |
-
git clone https://github.com/FlagOpen/FlagEmbedding.git
|
80 |
-
cd FlagEmbedding
|
81 |
-
pip install -e .
|
82 |
-
```
|
83 |
-
|
84 |
-
or
|
85 |
-
|
86 |
-
```bash
|
87 |
-
pip install -U FlagEmbedding
|
88 |
-
```
|
89 |
-
|
90 |
-
### Generate Embedding for Text
|
91 |
-
|
92 |
-
#### Dense Embedding
|
93 |
-
|
94 |
-
```python
|
95 |
-
from FlagEmbedding import BGEM3FlagModel
|
96 |
-
|
97 |
-
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
98 |
-
|
99 |
-
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
100 |
-
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
101 |
-
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
102 |
-
|
103 |
-
embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192)['dense_vecs']
|
104 |
-
embeddings_2 = model.encode(sentences_2)['dense_vecs']
|
105 |
-
similarity = embeddings_1 @ embeddings_2.T
|
106 |
-
print(similarity)
|
107 |
-
```
|
108 |
-
|
109 |
-
#### Sparse Embedding (Lexical Weight)
|
110 |
-
|
111 |
-
```python
|
112 |
-
from FlagEmbedding import BGEM3FlagModel
|
113 |
-
|
114 |
-
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
115 |
-
|
116 |
-
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
117 |
-
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
118 |
-
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
119 |
-
|
120 |
-
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
121 |
-
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
|
122 |
-
|
123 |
-
print(model.convert_id_to_token(output_1['lexical_weights']))
|
124 |
-
|
125 |
-
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
|
126 |
-
print(lexical_scores)
|
127 |
-
```
|
128 |
-
|
129 |
-
#### Multi-Vector (ColBERT)
|
130 |
-
|
131 |
-
```python
|
132 |
-
from FlagEmbedding import BGEM3FlagModel
|
133 |
-
|
134 |
-
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
135 |
-
|
136 |
-
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
137 |
-
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
138 |
-
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
139 |
-
|
140 |
-
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
|
141 |
-
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
|
142 |
-
|
143 |
-
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
|
144 |
-
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
|
145 |
-
```
|
146 |
-
|
147 |
-
### Compute Score for Text Pairs
|
148 |
-
|
149 |
-
```python
|
150 |
-
from FlagEmbedding import BGEM3FlagModel
|
151 |
-
|
152 |
-
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
|
153 |
-
|
154 |
-
sentences_1 = ["What is BGE M3?", "Definition of BM25"]
|
155 |
-
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
|
156 |
-
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
|
157 |
-
|
158 |
-
sentence_pairs = [[i, j] for i in sentences_1 for j in sentences_2]
|
159 |
-
|
160 |
-
print(model.compute_score(sentence_pairs, max_passage_length=128, weights_for_different_modes=[0.4, 0.2, 0.4]))
|
161 |
-
```
|
162 |
-
|
163 |
-
## Evaluation
|
164 |
-
|
165 |
-
Evaluation scripts are provided for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
|
166 |
-
|
167 |
-
## Acknowledgement
|
168 |
-
|
169 |
-
Thanks to the authors of open-sourced datasets like Miracl, MKQA, NarritiveQA, and libraries like [Tevatron](https://github.com/texttron/tevatron) and [Pyserini](https://github.com/castorini/pyserini).
|
170 |
-
|
171 |
-
## Citation
|
172 |
-
|
173 |
-
If you find this repository useful, please consider giving a star :star: and citation:
|
174 |
-
|
175 |
-
```bibtex
|
176 |
-
@misc{bge-m3,
|
177 |
-
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
|
178 |
-
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
|
179 |
-
year={2024},
|
180 |
-
eprint={2402.03216},
|
181 |
-
archivePrefix={arXiv},
|
182 |
-
primaryClass={cs.CL}
|
183 |
-
}
|
184 |
-
```
|
|
|
11 |
# Experimental Sparse Vector Repository
|
12 |
|
13 |
This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|