Shitao commited on
Commit
a45d34f
·
verified ·
1 Parent(s): 6a63eab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -136
README.md CHANGED
@@ -2487,33 +2487,11 @@ model-index:
2487
  <h1 align="center">FlagEmbedding</h1>
2488
 
2489
 
2490
- <h4 align="center">
2491
- <p>
2492
- <a href=#model-list>Model List</a> |
2493
- <a href=#frequently-asked-questions>FAQ</a> |
2494
- <a href=#usage>Usage</a> |
2495
- <a href="#evaluation">Evaluation</a> |
2496
- <a href="#train">Train</a> |
2497
- <a href="#contact">Contact</a> |
2498
- <a href="#citation">Citation</a> |
2499
- <a href="#license">License</a>
2500
- <p>
2501
- </h4>
2502
 
2503
- For more details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2504
-
2505
- If you are looking for a model with rich semantic expression capabilities, consider choosing **BGE-EN-Mistral**. It combines the ability of in-context learning with the strengths of large models and dense retrieval, achieving outstanding results.
2506
-
2507
- **BGE-EN-Mistral** primarily demonstrates the following capabilities:
2508
- - In-context learning ability: By providing few-shot examples in the query, it can significantly enhance the model's ability to handle new tasks.
2509
- - Outstanding performance: The model has achieved state-of-the-art (SOTA) performance on both BEIR and AIR-Bench.
2510
 
2511
- We will release a technical report about **BGE-EN-Mistral** soon with more details.
2512
-
2513
- [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2514
 
2515
  FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
2516
-
2517
  - **LLM-based Dense Retrieval**: BGE-EN-Mistral, BGE-Multilingual-Gemma2
2518
  - **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
2519
  - **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
@@ -2521,75 +2499,25 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
2521
  - **Reranker Model**: [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
2522
  - **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
2523
 
2524
- ## News
2525
- - 7/26/2024: Release **BGE-En-Mistral**, a Mistral-7B based dense retriever, by integrating in-context learning abilities into the embedding model, new state-of-the-art results have been achieved on both the MTEB and AIR-Benchmark.
2526
- - 1/30/2024: Release **BGE-M3**, a new member to BGE model series! M3 stands for **M**ulti-linguality (100+ languages), **M**ulti-granularities (input length up to 8192), **M**ulti-Functionality (unification of dense, lexical, multi-vec/colbert retrieval).
2527
- It is the first embedding model that supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks.
2528
- [Technical Report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf) and [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3). :fire:
2529
- - 1/9/2024: Release [Activation-Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon), an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. [Technical Report](https://arxiv.org/abs/2401.03462) :fire:
2530
- - 12/24/2023: Release **LLaRA**, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. [Technical Report](https://arxiv.org/abs/2312.15503) :fire:
2531
- - 11/23/2023: Release [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail), a method to maintain general capabilities during fine-tuning by merging multiple language models. [Technical Report](https://arxiv.org/abs/2311.13534) :fire:
2532
- - 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Technical Report](https://arxiv.org/pdf/2310.07554.pdf)
2533
- - 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) and [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
2534
- - 09/12/2023: New models:
2535
- - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2536
- - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
2537
-
2538
-
2539
- <details>
2540
- <summary>More</summary>
2541
- <!-- ### More -->
2542
 
2543
- - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
2544
- - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
2545
- - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2546
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2547
- - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2548
-
2549
-
2550
- </details>
2551
-
2552
-
2553
- ## Model List
2554
-
2555
- `bge` is short for `BAAI general embedding`.
2556
 
2557
- | Model | Language | | Description | query instruction for retrieval [1] |
2558
- |:--------------------------------------------------------------------------|:-------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|
2559
- | [BAAI/bge-en-mistral](https://huggingface.co/BAAI/bge-en-mistral) | English | - | A LLM-based dense retriever with in-context learning capabilities can fully leverage the model's potential based on a few shot examples(4096 tokens) | Provide instructions and few-shot examples freely based on the given task. |
2560
- | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#usage) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
2561
- | [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [Inference](./FlagEmbedding/llm_embedder/README.md) [Fine-tune](./FlagEmbedding/llm_embedder/README.md) | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](./FlagEmbedding/llm_embedder/README.md) |
2562
- | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2563
- | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2564
- | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2565
- | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2566
- | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2567
- | [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2568
- | [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2569
- | [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2570
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2571
- | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` |
2572
- | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2573
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2574
- | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2575
- | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2576
 
2577
- [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2578
 
2579
- [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2580
- For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
 
 
2581
 
2582
- All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
2583
- If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models .
2584
 
2585
 
2586
  ## Usage
2587
 
2588
- ### Usage for Embedding Model
2589
-
2590
- Here are some examples for using `bge-en-mistral` model with [FlagEmbedding](#using-flagembedding) or [Huggingface Transformers](#using-huggingface-transformers).
2591
-
2592
- #### Using FlagEmbedding
2593
  ```
2594
  pip install -U FlagEmbedding
2595
  ```
@@ -2610,22 +2538,21 @@ examples = [
2610
  'query': 'causes of back pain in female for a week',
2611
  'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."}
2612
  ]
2613
- model = FlagICLModel('BAAI/bge-en-mistral',
2614
  query_instruction_for_retrieval="Given a web search query, retrieve relevant passages that answer the query.",
2615
- examples_for_task=examples,
2616
  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2617
  embeddings_1 = model.encode_queries(queries)
2618
  embeddings_2 = model.encode_corpus(documents)
2619
  similarity = embeddings_1 @ embeddings_2.T
2620
  print(similarity)
2621
  ```
2622
- For the value of the argument `query_instruction_for_retrieval`, You can see [e5-mistral-7b](https://huggingface.co/intfloat/e5-mistral-7b-instruct), but we append `.` at the end of each instruct.
2623
 
2624
  By default, FlagICLModel will use all available GPUs when encoding. Please set `os.environ["CUDA_VISIBLE_DEVICES"]` to select specific GPUs.
2625
  You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make all GPUs unavailable.
2626
 
2627
 
2628
- #### Using HuggingFace Transformers
2629
 
2630
  With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding.
2631
 
@@ -2669,7 +2596,7 @@ queries = [
2669
  examples_prefix + get_detailed_instruct(task, 'how much protein should a female eat'),
2670
  examples_prefix + get_detailed_instruct(task, 'summit define')
2671
  ]
2672
- # No need to add instruction for retrieval documents
2673
  documents = [
2674
  "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
2675
  "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
@@ -2695,52 +2622,9 @@ print(scores.tolist())
2695
  ```
2696
 
2697
 
2698
- ### Usage for Reranker
2699
-
2700
- Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
2701
- You can get a relevance score by inputting query and passage to the reranker.
2702
- The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
2703
-
2704
-
2705
- #### Using FlagEmbedding
2706
- ```
2707
- pip install -U FlagEmbedding
2708
- ```
2709
-
2710
- Get relevance scores (higher scores indicate more relevance):
2711
- ```python
2712
- from FlagEmbedding import FlagReranker
2713
- reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2714
-
2715
- score = reranker.compute_score(['query', 'passage'])
2716
- print(score)
2717
-
2718
- scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
2719
- print(scores)
2720
- ```
2721
-
2722
-
2723
- #### Using Huggingface transformers
2724
-
2725
- ```python
2726
- import torch
2727
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
2728
-
2729
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
2730
- model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
2731
- model.eval()
2732
-
2733
- pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
2734
- with torch.no_grad():
2735
- inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
2736
- scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
2737
- print(scores)
2738
- ```
2739
-
2740
  ## Evaluation
2741
 
2742
  `bge-en-mistral` achieve **state-of-the-art performance on both MTEB and Air-Bench leaderboard!**
2743
- For more details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md).
2744
 
2745
  - **MTEB**:
2746
 
@@ -2808,11 +2692,32 @@ For more details and evaluation tools see our [scripts](https://github.com/FlagO
2808
  | **bge-en-mistral few-shot** | **79.63** | **79.36** | **74.80** | 67.79 | **74.83** |
2809
 
2810
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2811
 
2812
- ## Contact
2813
 
2814
- If you have any question or suggestion related to this project, feel free to open an issue or pull request.
2815
- You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]).
2816
 
2817
 
2818
  ## Citation
@@ -2831,5 +2736,4 @@ If you find this repository useful, please consider giving a star :star: and cit
2831
  ```
2832
 
2833
  ## License
2834
- FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2835
-
 
2487
  <h1 align="center">FlagEmbedding</h1>
2488
 
2489
 
 
 
 
 
 
 
 
 
 
 
 
 
2490
 
 
 
 
 
 
 
 
2491
 
2492
+ For more details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
 
 
2493
 
2494
  FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
 
2495
  - **LLM-based Dense Retrieval**: BGE-EN-Mistral, BGE-Multilingual-Gemma2
2496
  - **Long-Context LLM**: [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
2497
  - **Fine-tuning of LM** : [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
 
2499
  - **Reranker Model**: [BGE Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
2500
  - **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
2501
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2502
 
2503
+ **BGE-EN-Mistral** primarily demonstrates the following capabilities:
2504
+ - In-context learning ability: By providing few-shot examples in the query, it can significantly enhance the model's ability to handle new tasks.
2505
+ - Outstanding performance: The model has achieved state-of-the-art (SOTA) performance on both BEIR and AIR-Bench.
 
 
 
 
 
 
 
 
 
 
2506
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2507
 
2508
+ ## 📑 Open-source Plan
2509
 
2510
+ - [x] Checkpoint
2511
+ - [ ] Training Data
2512
+ - [ ] Evaluation Pipeline
2513
+ - [ ] Technical Report
2514
 
2515
+ We will release the technical report and training data for **BGE-EN-Mistral** in the future.
 
2516
 
2517
 
2518
  ## Usage
2519
 
2520
+ ### Using FlagEmbedding
 
 
 
 
2521
  ```
2522
  pip install -U FlagEmbedding
2523
  ```
 
2538
  'query': 'causes of back pain in female for a week',
2539
  'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."}
2540
  ]
2541
+ model = FlagICLModel('BAAI/bge-en-icl',
2542
  query_instruction_for_retrieval="Given a web search query, retrieve relevant passages that answer the query.",
2543
+ examples_for_task=examples, # set `examples_for_task=None` to use model without examples
2544
  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2545
  embeddings_1 = model.encode_queries(queries)
2546
  embeddings_2 = model.encode_corpus(documents)
2547
  similarity = embeddings_1 @ embeddings_2.T
2548
  print(similarity)
2549
  ```
 
2550
 
2551
  By default, FlagICLModel will use all available GPUs when encoding. Please set `os.environ["CUDA_VISIBLE_DEVICES"]` to select specific GPUs.
2552
  You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make all GPUs unavailable.
2553
 
2554
 
2555
+ ### Using HuggingFace Transformers
2556
 
2557
  With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding.
2558
 
 
2596
  examples_prefix + get_detailed_instruct(task, 'how much protein should a female eat'),
2597
  examples_prefix + get_detailed_instruct(task, 'summit define')
2598
  ]
2599
+ # No need to add instructions for documents
2600
  documents = [
2601
  "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
2602
  "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
 
2622
  ```
2623
 
2624
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2625
  ## Evaluation
2626
 
2627
  `bge-en-mistral` achieve **state-of-the-art performance on both MTEB and Air-Bench leaderboard!**
 
2628
 
2629
  - **MTEB**:
2630
 
 
2692
  | **bge-en-mistral few-shot** | **79.63** | **79.36** | **74.80** | 67.79 | **74.83** |
2693
 
2694
 
2695
+ ## Model List
2696
+
2697
+ `bge` is short for `BAAI general embedding`.
2698
+
2699
+ | Model | Language | | Description | query instruction for retrieval [1] |
2700
+ |:--------------------------------------------------------------------------|:-------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|
2701
+ | [BAAI/bge-en-mistral](https://huggingface.co/BAAI/bge-en-mistral) | English | - | A LLM-based dense retriever with in-context learning capabilities can fully leverage the model's potential based on a few shot examples(4096 tokens) | Provide instructions and few-shot examples freely based on the given task. |
2702
+ | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [Inference](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#usage) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
2703
+ | [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [Inference](./FlagEmbedding/llm_embedder/README.md) [Fine-tune](./FlagEmbedding/llm_embedder/README.md) | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](./FlagEmbedding/llm_embedder/README.md) |
2704
+ | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2705
+ | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2706
+ | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2707
+ | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2708
+ | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2709
+ | [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2710
+ | [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2711
+ | [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
2712
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2713
+ | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` |
2714
+ | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2715
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2716
+ | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2717
+ | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2718
+
2719
 
 
2720
 
 
 
2721
 
2722
 
2723
  ## Citation
 
2736
  ```
2737
 
2738
  ## License
2739
+ FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE).