Spaces:
Sleeping
Sleeping
--- | |
pipeline_tag: sentence-similarity | |
tags: | |
- sentence-transformers | |
- feature-extraction | |
- sentence-similarity | |
license: mit | |
--- | |
For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding | |
# BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3)) | |
In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity. | |
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. | |
- Multi-Linguality: It can support more than 100 working languages. | |
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. | |
**Some suggestions for retrieval pipeline in RAG** | |
We recommend to use the following pipeline: hybrid retrieval + re-ranking. | |
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities. | |
A classic example: using both embedding retrieval and the BM25 algorithm. | |
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval. | |
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings. | |
To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb | |
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py). | |
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model. | |
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text. | |
## News: | |
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples | |
/hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py). | |
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.** | |
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data) | |
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR). | |
- 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) | |
## Specs | |
- Model | |
| Model Name | Dimension | Sequence Length | Introduction | | |
|:----:|:---:|:---:|:---:| | |
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised| | |
| [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | multilingual; contrastive learning from bge-m3-retromae | | |
| [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | multilingual; extend the max_length of [xlm-roberta](https://huggingface.co/FacebookAI/xlm-roberta-large) to 8192 and further pretrained via [retromae](https://github.com/staoxiao/RetroMAE)| | |
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model | | |
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model | | |
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model | | |
- Data | |
| Dataset | Introduction | | |
|:----------------------------------------------------------:|:-------------------------------------------------:| | |
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages | | |
| [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 | | |
## FAQ | |
**1. Introduction for different retrieval methods** | |
- Dense retrieval: map the text into a single embedding, e.g., [DPR](https://arxiv.org/abs/2004.04906), [BGE-v1.5](https://github.com/FlagOpen/FlagEmbedding) | |
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720) | |
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832). | |
**2. How to use BGE-M3 in other projects?** | |
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE. | |
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries. | |
For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb | |
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py). | |
**3. How to fine-tune bge-M3 model?** | |
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | |
to fine-tune the dense embedding. | |
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) | |
## Usage | |
Install: | |
``` | |
git clone https://github.com/FlagOpen/FlagEmbedding.git | |
cd FlagEmbedding | |
pip install -e . | |
``` | |
or: | |
``` | |
pip install -U FlagEmbedding | |
``` | |
### Generate Embedding for text | |
- Dense Embedding | |
```python | |
from FlagEmbedding import BGEM3FlagModel | |
model = BGEM3FlagModel('BAAI/bge-m3', | |
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation | |
sentences_1 = ["What is BGE M3?", "Defination of BM25"] | |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", | |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] | |
embeddings_1 = model.encode(sentences_1, | |
batch_size=12, | |
max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process. | |
)['dense_vecs'] | |
embeddings_2 = model.encode(sentences_2)['dense_vecs'] | |
similarity = embeddings_1 @ embeddings_2.T | |
print(similarity) | |
# [[0.6265, 0.3477], [0.3499, 0.678 ]] | |
``` | |
You also can use sentence-transformers and huggingface transformers to generate dense embeddings. | |
Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details. | |
- Sparse Embedding (Lexical Weight) | |
```python | |
from FlagEmbedding import BGEM3FlagModel | |
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation | |
sentences_1 = ["What is BGE M3?", "Defination of BM25"] | |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", | |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] | |
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False) | |
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False) | |
# you can see the weight for each token: | |
print(model.convert_id_to_token(output_1['lexical_weights'])) | |
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092}, | |
# {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}] | |
# compute the scores via lexical mathcing | |
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0]) | |
print(lexical_scores) | |
# 0.19554901123046875 | |
print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1])) | |
# 0.0 | |
``` | |
- Multi-Vector (ColBERT) | |
```python | |
from FlagEmbedding import BGEM3FlagModel | |
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) | |
sentences_1 = ["What is BGE M3?", "Defination of BM25"] | |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", | |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] | |
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True) | |
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True) | |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0])) | |
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1])) | |
# 0.7797 | |
# 0.4620 | |
``` | |
### Compute score for text pairs | |
Input a list of text pairs, you can get the scores computed by different methods. | |
```python | |
from FlagEmbedding import BGEM3FlagModel | |
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) | |
sentences_1 = ["What is BGE M3?", "Defination of BM25"] | |
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", | |
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] | |
sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2] | |
print(model.compute_score(sentence_pairs, | |
max_passage_length=128, # a smaller max length leads to a lower latency | |
weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score | |
# { | |
# 'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142], | |
# 'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625], | |
# 'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625], | |
# 'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816], | |
# 'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478] | |
# } | |
``` | |
## Evaluation | |
We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR) | |
### Benchmarks from the open-source community | |
 | |
The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI). | |
For more details, please refer to the [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) and [Github Repo](https://github.com/Yannael/multilingual-embeddings) | |
### Our results | |
- Multilingual (Miracl dataset) | |
 | |
- Cross-lingual (MKQA dataset) | |
 | |
- Long Document Retrieval | |
- MLDR: | |
 | |
Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM, | |
covering 13 languages, including test set, validation set, and training set. | |
We utilized the training set from MLDR to enhance the model's long document retrieval capabilities. | |
Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable. | |
Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets. | |
We believe that this data will be helpful for the open-source community in training document retrieval models. | |
- NarritiveQA: | |
 | |
- Comparison with BM25 | |
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline). | |
We tested BM25 using two different tokenizers: | |
one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta). | |
The results indicate that BM25 remains a competitive baseline, | |
especially in long document retrieval. | |
 | |
## Training | |
- Self-knowledge Distillation: combining multiple outputs from different | |
retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival) | |
- Efficient Batching: Improve the efficiency when fine-tuning on long text. | |
The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model. | |
- MCLS: A simple method to improve the performance on long text without fine-tuning. | |
If you have no enough resource to fine-tuning model with long text, the method is useful. | |
Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details. | |
## Acknowledgement | |
Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. | |
Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini). | |
## Citation | |
If you find this repository useful, please consider giving a star :star: and citation | |
``` | |
@misc{bge-m3, | |
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, | |
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, | |
year={2024}, | |
eprint={2402.03216}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL} | |
} | |
``` | |