p0x0q commited on
Commit
02c2c9e
·
verified ·
1 Parent(s): 09f2f11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -171
README.md CHANGED
@@ -11,174 +11,3 @@ license: mit
11
  # Experimental Sparse Vector Repository
12
 
13
  This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
14
-
15
- For more details, please refer to the original [github repo](https://github.com/FlagOpen/FlagEmbedding).
16
-
17
- ## BGE-M3 Overview ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
18
-
19
- BGE-M3 is a highly versatile embedding model that supports:
20
- - **Multi-Functionality**: Capable of dense retrieval, multi-vector retrieval, and sparse retrieval.
21
- - **Multi-Linguality**: Supports over 100 languages.
22
- - **Multi-Granularity**: Handles inputs from short sentences to long documents up to 8192 tokens.
23
-
24
- ## Retrieval Pipeline Recommendations
25
-
26
- We recommend using a hybrid retrieval + re-ranking pipeline:
27
- - **Hybrid Retrieval**: Combines embedding retrieval and BM25 algorithm for higher accuracy and generalization. BGE-M3 supports both embedding and sparse retrieval, allowing token weights similar to BM25 without additional cost.
28
- - Refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py) for hybrid retrieval examples.
29
- - **Re-Ranking**: Use cross-encoder models like [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) or [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) for higher accuracy after retrieval.
30
-
31
- ## News
32
-
33
- - **2024/7/1**: Updated MIRACL evaluation results for BGE-M3. Refer to [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr) for details.
34
- - **2024/3/20**: Milvus now supports hybrid retrieval with BGE-M3. See [hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
35
- - **2024/3/8**: BGE-M3 achieves top performance in multilingual benchmarks. See [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05).
36
- - **2024/3/2**: Released unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data).
37
- - **2024/2/6**: Released [MLDR](https://huggingface.co/datasets/Shitao/MLDR) dataset and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
38
- - **2024/2/1**: Vespa now supports multiple modes of BGE-M3. See [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb).
39
-
40
- ## Model Specifications
41
-
42
- | Model Name | Dimension | Sequence Length | Introduction |
43
- |:----:|:---:|:---:|:---:|
44
- | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | Multilingual; unified fine-tuning (dense, sparse, and colbert) |
45
- | [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | Multilingual; contrastive learning |
46
- | [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | Multilingual; extended max_length of xlm-roberta to 8192 |
47
- | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
48
- | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
49
- | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
50
-
51
- ## Data
52
-
53
- | Dataset | Introduction |
54
- |:-------:|:------------:|
55
- | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Document Retrieval Dataset covering 13 languages |
56
- | [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
57
-
58
- ## FAQ
59
-
60
- ### 1. Introduction to Different Retrieval Methods
61
-
62
- - **Dense Retrieval**: Maps text into a single embedding.
63
- - **Sparse Retrieval**: A vector with weights for tokens present in the text.
64
- - **Multi-Vector Retrieval**: Uses multiple vectors to represent a text.
65
-
66
- ### 2. How to Use BGE-M3 in Other Projects?
67
-
68
- For embedding retrieval, use the BGE-M3 model similarly to BGE. For hybrid retrieval, refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
69
-
70
- ### 3. How to Fine-Tune BGE-M3 Model?
71
-
72
- Follow the [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) for dense embedding fine-tuning. For unified fine-tuning (dense, sparse, and colbert), refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune).
73
-
74
- ## Usage
75
-
76
- ### Installation
77
-
78
- ```bash
79
- git clone https://github.com/FlagOpen/FlagEmbedding.git
80
- cd FlagEmbedding
81
- pip install -e .
82
- ```
83
-
84
- or
85
-
86
- ```bash
87
- pip install -U FlagEmbedding
88
- ```
89
-
90
- ### Generate Embedding for Text
91
-
92
- #### Dense Embedding
93
-
94
- ```python
95
- from FlagEmbedding import BGEM3FlagModel
96
-
97
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
98
-
99
- sentences_1 = ["What is BGE M3?", "Definition of BM25"]
100
- sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
101
- "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
102
-
103
- embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192)['dense_vecs']
104
- embeddings_2 = model.encode(sentences_2)['dense_vecs']
105
- similarity = embeddings_1 @ embeddings_2.T
106
- print(similarity)
107
- ```
108
-
109
- #### Sparse Embedding (Lexical Weight)
110
-
111
- ```python
112
- from FlagEmbedding import BGEM3FlagModel
113
-
114
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
115
-
116
- sentences_1 = ["What is BGE M3?", "Definition of BM25"]
117
- sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
118
- "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
119
-
120
- output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
121
- output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
122
-
123
- print(model.convert_id_to_token(output_1['lexical_weights']))
124
-
125
- lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
126
- print(lexical_scores)
127
- ```
128
-
129
- #### Multi-Vector (ColBERT)
130
-
131
- ```python
132
- from FlagEmbedding import BGEM3FlagModel
133
-
134
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
135
-
136
- sentences_1 = ["What is BGE M3?", "Definition of BM25"]
137
- sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
138
- "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
139
-
140
- output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
141
- output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
142
-
143
- print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
144
- print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
145
- ```
146
-
147
- ### Compute Score for Text Pairs
148
-
149
- ```python
150
- from FlagEmbedding import BGEM3FlagModel
151
-
152
- model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
153
-
154
- sentences_1 = ["What is BGE M3?", "Definition of BM25"]
155
- sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
156
- "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
157
-
158
- sentence_pairs = [[i, j] for i in sentences_1 for j in sentences_2]
159
-
160
- print(model.compute_score(sentence_pairs, max_passage_length=128, weights_for_different_modes=[0.4, 0.2, 0.4]))
161
- ```
162
-
163
- ## Evaluation
164
-
165
- Evaluation scripts are provided for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
166
-
167
- ## Acknowledgement
168
-
169
- Thanks to the authors of open-sourced datasets like Miracl, MKQA, NarritiveQA, and libraries like [Tevatron](https://github.com/texttron/tevatron) and [Pyserini](https://github.com/castorini/pyserini).
170
-
171
- ## Citation
172
-
173
- If you find this repository useful, please consider giving a star :star: and citation:
174
-
175
- ```bibtex
176
- @misc{bge-m3,
177
- title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
178
- author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
179
- year={2024},
180
- eprint={2402.03216},
181
- archivePrefix={arXiv},
182
- primaryClass={cs.CL}
183
- }
184
- ```
 
11
  # Experimental Sparse Vector Repository
12
 
13
  This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.