Update README.md

Browse files

Files changed (1) hide show

README.md +0 -171

README.md CHANGED Viewed

@@ -11,174 +11,3 @@ license: mit
 # Experimental Sparse Vector Repository
 This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
-For more details, please refer to the original [github repo](https://github.com/FlagOpen/FlagEmbedding).
-## BGE-M3 Overview ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
-BGE-M3 is a highly versatile embedding model that supports:
-- **Multi-Functionality**: Capable of dense retrieval, multi-vector retrieval, and sparse retrieval.
-- **Multi-Linguality**: Supports over 100 languages.
-- **Multi-Granularity**: Handles inputs from short sentences to long documents up to 8192 tokens.
-## Retrieval Pipeline Recommendations
-We recommend using a hybrid retrieval + re-ranking pipeline:
-- **Hybrid Retrieval**: Combines embedding retrieval and BM25 algorithm for higher accuracy and generalization. BGE-M3 supports both embedding and sparse retrieval, allowing token weights similar to BM25 without additional cost.
-  - Refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py) for hybrid retrieval examples.
-- **Re-Ranking**: Use cross-encoder models like [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker) or [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) for higher accuracy after retrieval.
-## News
-- **2024/7/1**: Updated MIRACL evaluation results for BGE-M3. Refer to [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr) for details.
-- **2024/3/20**: Milvus now supports hybrid retrieval with BGE-M3. See [hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
-- **2024/3/8**: BGE-M3 achieves top performance in multilingual benchmarks. See [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05).
-- **2024/3/2**: Released unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data).
-- **2024/2/6**: Released [MLDR](https://huggingface.co/datasets/Shitao/MLDR) dataset and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
-- **2024/2/1**: Vespa now supports multiple modes of BGE-M3. See [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb).
-## Model Specifications
-| Model Name | Dimension | Sequence Length | Introduction |
-|:----:|:---:|:---:|:---:|
-| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | Multilingual; unified fine-tuning (dense, sparse, and colbert) |
-| [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | Multilingual; contrastive learning |
-| [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | Multilingual; extended max_length of xlm-roberta to 8192 |
-| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
-| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
-| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
-## Data
-| Dataset | Introduction |
-|:-------:|:------------:|
-| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Document Retrieval Dataset covering 13 languages |
-| [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
-## FAQ
-### 1. Introduction to Different Retrieval Methods
-- **Dense Retrieval**: Maps text into a single embedding.
-- **Sparse Retrieval**: A vector with weights for tokens present in the text.
-- **Multi-Vector Retrieval**: Uses multiple vectors to represent a text.
-### 2. How to Use BGE-M3 in Other Projects?
-For embedding retrieval, use the BGE-M3 model similarly to BGE. For hybrid retrieval, refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
-### 3. How to Fine-Tune BGE-M3 Model?
-Follow the [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) for dense embedding fine-tuning. For unified fine-tuning (dense, sparse, and colbert), refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune).
-## Usage
-### Installation
-```bash
-git clone https://github.com/FlagOpen/FlagEmbedding.git
-cd FlagEmbedding
-pip install -e .
-```
-or
-```bash
-pip install -U FlagEmbedding
-```
-### Generate Embedding for Text
-#### Dense Embedding
-```python
-from FlagEmbedding import BGEM3FlagModel
-model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
-sentences_1 = ["What is BGE M3?", "Definition of BM25"]
-sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
-               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
-embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192)['dense_vecs']
-embeddings_2 = model.encode(sentences_2)['dense_vecs']
-similarity = embeddings_1 @ embeddings_2.T
-print(similarity)
-```
-#### Sparse Embedding (Lexical Weight)
-```python
-from FlagEmbedding import BGEM3FlagModel
-model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
-sentences_1 = ["What is BGE M3?", "Definition of BM25"]
-sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
-               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
-output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
-output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
-print(model.convert_id_to_token(output_1['lexical_weights']))
-lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
-print(lexical_scores)
-```
-#### Multi-Vector (ColBERT)
-```python
-from FlagEmbedding import BGEM3FlagModel
-model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
-sentences_1 = ["What is BGE M3?", "Definition of BM25"]
-sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
-               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
-output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
-output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
-print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
-print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
-```
-### Compute Score for Text Pairs
-```python
-from FlagEmbedding import BGEM3FlagModel
-model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
-sentences_1 = ["What is BGE M3?", "Definition of BM25"]
-sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
-               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
-sentence_pairs = [[i, j] for i in sentences_1 for j in sentences_2]
-print(model.compute_score(sentence_pairs, max_passage_length=128, weights_for_different_modes=[0.4, 0.2, 0.4]))
-```
-## Evaluation
-Evaluation scripts are provided for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
-## Acknowledgement
-Thanks to the authors of open-sourced datasets like Miracl, MKQA, NarritiveQA, and libraries like [Tevatron](https://github.com/texttron/tevatron) and [Pyserini](https://github.com/castorini/pyserini).
-## Citation
-If you find this repository useful, please consider giving a star :star: and citation:
-```bibtex
-@misc{bge-m3,
-      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
-      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
-      year={2024},
-      eprint={2402.03216},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-```


11	# Experimental Sparse Vector Repository
12
13	This repository is a fork of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) repository, aimed at creating sparse vectors. It is an experimental project based on the BGE-M3 model, which is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.