writinwaters jinhai-2012 commited on
Commit
4c39067
·
1 Parent(s): a4c4dfd

DRAFT: Updated chunk APIs (#2901)

Browse files

### What problem does this PR solve?



### Type of change


- [x] Documentation Update

---------

Signed-off-by: Jin Hai <[email protected]>
Co-authored-by: Jin Hai <[email protected]>

Files changed (2) hide show
  1. api/http_api.md +3 -3
  2. api/python_api_reference.md +228 -139
api/http_api.md CHANGED
@@ -37,7 +37,7 @@ Creates a dataset.
37
  # "name": name is required and can't be duplicated.
38
  # "tenant_id": tenant_id must not be provided.
39
  # "embedding_model": embedding_model must not be provided.
40
- # "navie" means general.
41
  curl --request POST \
42
  --url http://{address}/api/v1/dataset \
43
  --header 'Content-Type: application/json' \
@@ -236,7 +236,7 @@ Updates a dataset by its id.
236
  # "chunk_count": If you update chunk_count, it can't be changed.
237
  # "document_count": If you update document_count, it can't be changed.
238
  # "parse_method": If you update parse_method, chunk_count must be 0.
239
- # "navie" means general.
240
  curl --request PUT \
241
  --url http://{address}/api/v1/dataset/{dataset_id} \
242
  --header 'Content-Type: application/json' \
@@ -247,7 +247,7 @@ curl --request PUT \
247
  "embedding_model": "BAAI/bge-zh-v1.5",
248
  "chunk_count": 0,
249
  "document_count": 0,
250
- "parse_method": "navie"
251
  }'
252
  ```
253
 
 
37
  # "name": name is required and can't be duplicated.
38
  # "tenant_id": tenant_id must not be provided.
39
  # "embedding_model": embedding_model must not be provided.
40
+ # "naive" means general.
41
  curl --request POST \
42
  --url http://{address}/api/v1/dataset \
43
  --header 'Content-Type: application/json' \
 
236
  # "chunk_count": If you update chunk_count, it can't be changed.
237
  # "document_count": If you update document_count, it can't be changed.
238
  # "parse_method": If you update parse_method, chunk_count must be 0.
239
+ # "naive" means general.
240
  curl --request PUT \
241
  --url http://{address}/api/v1/dataset/{dataset_id} \
242
  --header 'Content-Type: application/json' \
 
247
  "embedding_model": "BAAI/bge-zh-v1.5",
248
  "chunk_count": 0,
249
  "document_count": 0,
250
+ "parse_method": "naive"
251
  }'
252
  ```
253
 
api/python_api_reference.md CHANGED
@@ -3,10 +3,10 @@
3
  **THE API REFERENCES BELOW ARE STILL UNDER DEVELOPMENT.**
4
 
5
  :::tip NOTE
6
- Knowledge Base Management
7
  :::
8
 
9
- ## Create knowledge base
10
 
11
  ```python
12
  RAGFlow.create_dataset(
@@ -17,12 +17,12 @@ RAGFlow.create_dataset(
17
  permission: str = "me",
18
  document_count: int = 0,
19
  chunk_count: int = 0,
20
- parse_method: str = "naive",
21
  parser_config: DataSet.ParserConfig = None
22
  ) -> DataSet
23
  ```
24
 
25
- Creates a knowledge base (dataset).
26
 
27
  ### Parameters
28
 
@@ -42,38 +42,24 @@ The unique name of the dataset to create. It must adhere to the following requir
42
 
43
  Base64 encoding of the avatar. Defaults to `""`
44
 
45
- #### description
46
-
47
- #### tenant_id: `str`
48
-
49
- The id of the tenant associated with the created dataset is used to identify different users. Defaults to `None`.
50
-
51
- - If creating a dataset, tenant_id must not be provided.
52
- - If updating a dataset, tenant_id can't be changed.
53
-
54
  #### description: `str`
55
 
56
- The description of the created dataset. Defaults to `""`.
57
 
58
  #### language: `str`
59
 
60
- The language setting of the created dataset. Defaults to `"English"`. ????????????
61
-
62
- #### permission
63
-
64
- Specify who can operate on the dataset. Defaults to `"me"`.
65
 
66
- #### document_count: `int`
 
67
 
68
- The number of documents associated with the dataset. Defaults to `0`.
69
-
70
- #### chunk_count: `int`
71
 
72
- The number of data chunks generated or processed by the created dataset. Defaults to `0`.
73
 
74
- #### parse_method, `str`
75
 
76
- The method used by the dataset to parse and process data. Defaults to `"naive"`.
77
 
78
  #### parser_config
79
 
@@ -100,19 +86,19 @@ ds = rag_object.create_dataset(name="kb_1")
100
 
101
  ---
102
 
103
- ## Delete knowledge bases
104
 
105
  ```python
106
  RAGFlow.delete_datasets(ids: list[str] = None)
107
  ```
108
 
109
- Deletes knowledge bases by name or ID.
110
 
111
  ### Parameters
112
 
113
  #### ids
114
 
115
- The IDs of the knowledge bases to delete.
116
 
117
  ### Returns
118
 
@@ -127,7 +113,7 @@ rag.delete_datasets(ids=["id_1","id_2"])
127
 
128
  ---
129
 
130
- ## List knowledge bases
131
 
132
  ```python
133
  RAGFlow.list_datasets(
@@ -140,7 +126,7 @@ RAGFlow.list_datasets(
140
  ) -> list[DataSet]
141
  ```
142
 
143
- Retrieves a list of knowledge bases.
144
 
145
  ### Parameters
146
 
@@ -158,7 +144,7 @@ The field by which the records should be sorted. This specifies the attribute or
158
 
159
  #### desc: `bool`
160
 
161
- Whether the sorting should be in descending order. Defaults to `True`.
162
 
163
  #### id: `str`
164
 
@@ -170,19 +156,19 @@ The name of the dataset to be got. Defaults to `None`.
170
 
171
  ### Returns
172
 
173
- - Success: A list of `DataSet` objects representing the retrieved knowledge bases.
174
  - Failure: `Exception`.
175
 
176
  ### Examples
177
 
178
- #### List all knowledge bases
179
 
180
  ```python
181
  for ds in rag_object.list_datasets():
182
  print(ds)
183
  ```
184
 
185
- #### Retrieve a knowledge base by ID
186
 
187
  ```python
188
  dataset = rag_object.list_datasets(id = "id_1")
@@ -191,23 +177,22 @@ print(dataset[0])
191
 
192
  ---
193
 
194
- ## Update knowledge base
195
 
196
  ```python
197
  DataSet.update(update_message: dict)
198
  ```
199
 
200
- Updates the current knowledge base.
201
 
202
  ### Parameters
203
 
204
  #### update_message: `dict[str, str|int]`, *Required*
205
 
206
- - `"name"`: `str` The name of the knowledge base to update.
207
- - `"tenant_id"`: `str` The `"tenant_id` you get after calling `create_dataset()`. ?????????????????????
208
  - `"embedding_model"`: `str` The embedding model for generating vector embeddings.
209
  - Ensure that `"chunk_count"` is `0` before updating `"embedding_model"`.
210
- - `"parser_method"`: `str` The default parsing method for the knowledge base.
211
  - `"naive"`: General
212
  - `"manual`: Manual
213
  - `"qa"`: Q&A
@@ -233,22 +218,24 @@ from ragflow import RAGFlow
233
 
234
  rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
235
  dataset = rag.list_datasets(name="kb_name")
236
- dataset.update({"embedding_model":"BAAI/bge-zh-v1.5", "parse_method":"manual"})
237
  ```
238
 
239
  ---
240
 
241
  :::tip API GROUPING
242
- File Management within Knowledge Base
243
  :::
244
 
 
 
245
  ## Upload documents
246
 
247
  ```python
248
  DataSet.upload_documents(document_list: list[dict])
249
  ```
250
 
251
- Updloads documents to the current knowledge base.
252
 
253
  ### Parameters
254
 
@@ -256,9 +243,8 @@ Updloads documents to the current knowledge base.
256
 
257
  A list of dictionaries representing the documents to upload, each containing the following keys:
258
 
259
- - `"name"`: (Optional) File path to the document to upload.
260
- Ensure that each file path has a suffix.
261
- - `"blob"`: (Optional) The document to upload in binary format.
262
 
263
  ### Returns
264
 
@@ -268,8 +254,8 @@ A list of dictionaries representing the documents to upload, each containing the
268
  ### Examples
269
 
270
  ```python
271
- dataset = rag.create_dataset(name="kb_name")
272
- dataset.upload_documents([{"name": "1.txt", "blob": "123"}])
273
  ```
274
 
275
  ---
@@ -284,9 +270,27 @@ Updates configurations for the current document.
284
 
285
  ### Parameters
286
 
287
- #### update_message: `dict[str, str|int]`, *Required*
288
 
289
- only `name`, `parser_config`, and `parser_method` can be changed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
290
 
291
  ### Returns
292
 
@@ -303,7 +307,7 @@ dataset=rag.list_datasets(id='id')
303
  dataset=dataset[0]
304
  doc = dataset.list_documents(id="wdfxb5t547d")
305
  doc = doc[0]
306
- doc.update([{"parser_method": "manual"}])
307
  ```
308
 
309
  ---
@@ -314,19 +318,21 @@ doc.update([{"parser_method": "manual"}])
314
  Document.download() -> bytes
315
  ```
316
 
 
 
317
  ### Returns
318
 
319
- Bytes of the document.
320
 
321
  ### Examples
322
 
323
  ```python
324
  from ragflow import RAGFlow
325
 
326
- rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
327
- ds=rag.list_datasets(id="id")
328
- ds=ds[0]
329
- doc = ds.list_documents(id="wdfxb5t547d")
330
  doc = doc[0]
331
  open("~/ragflow.txt", "wb+").write(doc.download())
332
  print(doc)
@@ -340,15 +346,17 @@ print(doc)
340
  Dataset.list_documents(id:str =None, keywords: str=None, offset: int=0, limit:int = 1024,order_by:str = "create_time", desc: bool = True) -> list[Document]
341
  ```
342
 
 
 
343
  ### Parameters
344
 
345
  #### id
346
 
347
- The id of the document to retrieve.
348
 
349
  #### keywords
350
 
351
- List documents whose name has the given keywords. Defaults to `None`.
352
 
353
  #### offset
354
 
@@ -360,11 +368,14 @@ Records number to return, `-1` means all of them. Records number to return, `-1`
360
 
361
  #### orderby
362
 
363
- The field by which the records should be sorted. This specifies the attribute or column used to order the results.
 
 
 
364
 
365
  #### desc
366
 
367
- A boolean flag indicating whether the sorting should be in descending order.
368
 
369
  ### Returns
370
 
@@ -375,8 +386,8 @@ A `Document` object contains the following attributes:
375
 
376
  - `id` Id of the retrieved document. Defaults to `""`.
377
  - `thumbnail` Thumbnail image of the retrieved document. Defaults to `""`.
378
- - `knowledgebase_id` Knowledge base ID related to the document. Defaults to `""`.
379
- - `parser_method` Method used to parse the document. Defaults to `""`.
380
  - `parser_config`: `ParserConfig` Configuration object for the parser. Defaults to `None`.
381
  - `source_type`: Source type of the document. Defaults to `""`.
382
  - `type`: Type or category of the document. Defaults to `""`.
@@ -414,7 +425,7 @@ for d in dataset.list_documents(keywords="rag", offset=0, limit=12):
414
  DataSet.delete_documents(ids: list[str] = None)
415
  ```
416
 
417
- Deletes specified documents or all documents from the current knowledge base.
418
 
419
  ### Returns
420
 
@@ -434,11 +445,10 @@ ds.delete_documents(ids=["id_1","id_2"])
434
 
435
  ---
436
 
437
- ## Parse and stop parsing document
438
 
439
  ```python
440
  DataSet.async_parse_documents(document_ids:list[str]) -> None
441
- DataSet.async_cancel_parse_documents(document_ids:list[str])-> None
442
  ```
443
 
444
  ### Parameters
@@ -476,6 +486,47 @@ print("Async bulk parsing cancelled")
476
 
477
  ---
478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
479
  ## List chunks
480
 
481
  ```python
@@ -590,13 +641,18 @@ doc.delete_chunks(["id_1","id_2"])
590
  ```python
591
  Chunk.update(update_message: dict)
592
  ```
 
 
 
593
  ### Parameters
594
 
595
- #### update_message: *Required*
596
 
597
- - `content`: `str` Contains the main text or information of the chunk
598
- - `important_keywords`: `list[str]` List the key terms or phrases that are significant or central to the chunk's content
599
- - `available`: `int` Indicating the availability status, `0` means unavailable and `1` means available
 
 
600
 
601
  ### Returns
602
 
@@ -608,8 +664,8 @@ Chunk.update(update_message: dict)
608
  ```python
609
  from ragflow import RAGFlow
610
 
611
- rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
612
- dataset = rag.list_datasets(id="123")
613
  dataset = dataset[0]
614
  doc = dataset.list_documents(id="wdfxb5t547d")
615
  doc = doc[0]
@@ -619,7 +675,7 @@ chunk.update({"content":"sdfx..."})
619
 
620
  ---
621
 
622
- ## Retrieval
623
 
624
  ```python
625
  RAGFlow.retrieve(question:str="", datasets:list[str]=None, document=list[str]=None, offset:int=1, limit:int=30, similarity_threshold:float=0.2, vector_similarity_weight:float=0.3, top_k:int=1024,rerank_id:str=None,keyword:bool=False,higlight:bool=False) -> list[Chunk]
@@ -627,25 +683,25 @@ RAGFlow.retrieve(question:str="", datasets:list[str]=None, document=list[str]=No
627
 
628
  ### Parameters
629
 
630
- #### question: `str`, *Required*
631
 
632
  The user query or query keywords. Defaults to `""`.
633
 
634
- #### datasets: `list[Dataset]`, *Required*
635
 
636
- The scope of datasets.
637
 
638
- #### document: `list[Document]`
639
 
640
- The scope of document. `None` means no limitation. Defaults to `None`.
641
 
642
  #### offset: `int`
643
 
644
- The beginning point of retrieved records. Defaults to `0`.
645
 
646
  #### limit: `int`
647
 
648
- The maximum number of records needed to return. Defaults to `6`.
649
 
650
  #### Similarity_threshold: `float`
651
 
@@ -653,40 +709,47 @@ The minimum similarity score. Defaults to `0.2`.
653
 
654
  #### similarity_threshold_weight: `float`
655
 
656
- The weight of vector cosine similarity, 1 - x is the term similarity weight. Defaults to `0.3`.
657
 
658
  #### top_k: `int`
659
 
660
- Number of records engaged in vector cosine computaton. Defaults to `1024`.
 
 
661
 
662
- #### rerank_id:`str`
663
- ID of the rerank model. Defaults to `None`.
664
 
665
- #### keyword:`bool`
666
- Indicating whether keyword-based matching is enabled (True) or disabled (False).
 
 
 
 
667
 
668
  #### highlight:`bool`
669
 
670
  Specifying whether to enable highlighting of matched terms in the results (True) or not (False).
 
671
  ### Returns
672
 
673
- list[Chunk]
 
674
 
675
  ### Examples
676
 
677
  ```python
678
  from ragflow import RAGFlow
679
 
680
- rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
681
- ds = rag.list_datasets(name="ragflow")
682
  ds = ds[0]
683
  name = 'ragflow_test.txt'
684
  path = './test_data/ragflow_test.txt'
685
- rag.create_document(ds, name=name, blob=open(path, "rb").read())
686
  doc = ds.list_documents(name=name)
687
  doc = doc[0]
688
  ds.async_parse_documents([doc.id])
689
- for c in rag.retrieve(question="What's ragflow?",
690
  datasets=[ds.id], documents=[doc.id],
691
  offset=1, limit=30, similarity_threshold=0.2,
692
  vector_similarity_weight=0.3,
@@ -705,9 +768,9 @@ Chat Assistant Management
705
 
706
  ```python
707
  RAGFlow.create_chat(
708
- name: str = "assistant",
709
- avatar: str = "path",
710
- knowledgebases: list[DataSet] = [],
711
  llm: Chat.LLM = None,
712
  prompt: Chat.Prompt = None
713
  ) -> Chat
@@ -715,51 +778,74 @@ RAGFlow.create_chat(
715
 
716
  Creates a chat assistant.
717
 
718
- ### Returns
719
-
720
- - Success: A `Chat` object representing the chat assistant.
721
- - Failure: `Exception`
722
 
723
  The following shows the attributes of a `Chat` object:
724
 
725
- - `name`: `str` The name of the chat assistant. Defaults to `"assistant"`.
726
- - `avatar`: `str` Base64 encoding of the avatar. Defaults to `""`.
727
- - `knowledgebases`: `list[str]` The associated knowledge bases. Defaults to `["kb1"]`.
728
- - `llm`: `LLM` The llm of the created chat. Defaults to `None`. When the value is `None`, a dictionary with the following values will be generated as the default.
729
- - `model_name`, `str`
730
- The chat model name. If it is `None`, the user's default chat model will be returned.
731
- - `temperature`, `float`
732
- Controls the randomness of the model's predictions. A lower temperature increases the model's conficence in its responses; a higher temperature increases creativity and diversity. Defaults to `0.1`.
733
- - `top_p`, `float`
734
- Also known as “nucleus sampling”, this parameter sets a threshold to select a smaller set of words to sample from. It focuses on the most likely words, cutting off the less probable ones. Defaults to `0.3`
735
- - `presence_penalty`, `float`
736
- This discourages the model from repeating the same information by penalizing words that have already appeared in the conversation. Defaults to `0.2`.
737
- - `frequency penalty`, `float`
738
- Similar to the presence penalty, this reduces the model’s tendency to repeat the same words frequently. Defaults to `0.7`.
739
- - `max_token`, `int`
740
- This sets the maximum length of the model’s output, measured in the number of tokens (words or pieces of words). Defaults to `512`.
741
- - `Prompt`: `Prompt` Instructions for the LLM to follow.
742
- - `"similarity_threshold"`: `float` A similarity score to evaluate distance between two lines of text. It's weighted keywords similarity and vector cosine similarity. If the similarity between query and chunk is less than this threshold, the chunk will be filtered out. Defaults to `0.2`.
743
- - `"keywords_similarity_weight"`: `float` It's weighted keywords similarity and vector cosine similarity or rerank score (0~1). Defaults to `0.7`.
744
- - `"top_n"`: `int` Not all the chunks whose similarity score is above the 'similarity threshold' will be feed to LLMs. LLM can only see these 'Top N' chunks. Defaults to `8`.
745
- - `"variables"`: `list[dict[]]` If you use dialog APIs, the variables might help you chat with your clients with different strategies. The variables are used to fill in the 'System' part in prompt in order to give LLM a hint. The 'knowledge' is a very special variable which will be filled-in with the retrieved chunks. All the variables in 'System' should be curly bracketed. Defaults to `[{"key": "knowledge", "optional": True}]`
746
- - `"rerank_model"`: `str` If it is not specified, vector cosine similarity will be used; otherwise, reranking score will be used. Defaults to `""`.
747
- - `"empty_response"`: `str` If nothing is retrieved in the knowledge base for the user's question, this will be used as the response. To allow the LLM to improvise when nothing is retrieved, leave this blank. Defaults to `None`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
748
  - `"opener"`: `str` The opening greeting for the user. Defaults to `"Hi! I am your assistant, can I help you?"`.
749
  - `"show_quote`: `bool` Indicates whether the source of text should be displayed Defaults to `True`.
750
- - `"prompt"`: `str` The prompt content. Defaults to `You are an intelligent assistant. Please summarize the content of the knowledge base to answer the question. Please list the data in the knowledge base and answer in detail. When all knowledge base content is irrelevant to the question, your answer must include the sentence "The answer you are looking for is not found in the knowledge base!" Answers need to consider chat history.
751
  Here is the knowledge base:
752
  {knowledge}
753
  The above is the knowledge base.`.
754
 
 
 
 
 
 
755
  ### Examples
756
 
757
  ```python
758
  from ragflow import RAGFlow
759
 
760
  rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
761
- knowledge_base = rag.list_datasets(name="kb_1")
762
- assistant = rag.create_chat("Miss R", knowledgebases=knowledge_base)
 
 
 
763
  ```
764
 
765
  ---
@@ -778,7 +864,7 @@ Updates the current chat assistant.
778
 
779
  - `"name"`: `str` The name of the chat assistant to update.
780
  - `"avatar"`: `str` Base64 encoding of the avatar. Defaults to `""`
781
- - `"knowledgebases"`: `list[str]` Knowledge bases to update.
782
  - `"llm"`: `dict` The LLM settings:
783
  - `"model_name"`, `str` The chat model name.
784
  - `"temperature"`, `float` Controls the randomness of the model's predictions.
@@ -792,7 +878,7 @@ Updates the current chat assistant.
792
  - `"top_n"`: `int` Not all the chunks whose similarity score is above the 'similarity threshold' will be feed to LLMs. LLM can only see these 'Top N' chunks. Defaults to `8`.
793
  - `"variables"`: `list[dict[]]` If you use dialog APIs, the variables might help you chat with your clients with different strategies. The variables are used to fill in the 'System' part in prompt in order to give LLM a hint. The 'knowledge' is a very special variable which will be filled-in with the retrieved chunks. All the variables in 'System' should be curly bracketed. Defaults to `[{"key": "knowledge", "optional": True}]`
794
  - `"rerank_model"`: `str` If it is not specified, vector cosine similarity will be used; otherwise, reranking score will be used. Defaults to `""`.
795
- - `"empty_response"`: `str` If nothing is retrieved in the knowledge base for the user's question, this will be used as the response. To allow the LLM to improvise when nothing is retrieved, leave this blank. Defaults to `None`.
796
  - `"opener"`: `str` The opening greeting for the user. Defaults to `"Hi! I am your assistant, can I help you?"`.
797
  - `"show_quote`: `bool` Indicates whether the source of text should be displayed Defaults to `True`.
798
  - `"prompt"`: `str` The prompt content. Defaults to `You are an intelligent assistant. Please summarize the content of the knowledge base to answer the question. Please list the data in the knowledge base and answer in detail. When all knowledge base content is irrelevant to the question, your answer must include the sentence "The answer you are looking for is not found in the knowledge base!" Answers need to consider chat history.
@@ -879,7 +965,7 @@ The attribute by which the results are sorted. Defaults to `"create_time"`.
879
 
880
  #### desc
881
 
882
- Indicates whether to sort the results in descending order. Defaults to `True`.
883
 
884
  #### id: `string`
885
 
@@ -1017,25 +1103,25 @@ The content of the message. Defaults to `"Hi! I am your assistant, can I help yo
1017
 
1018
  A list of `Chunk` objects representing references to the message, each containing the following attributes:
1019
 
1020
- - **id**: `str`
1021
  The chunk ID.
1022
- - **content**: `str`
1023
  The content of the chunk.
1024
- - **image_id**: `str`
1025
  The ID of the snapshot of the chunk.
1026
- - **document_id**: `str`
1027
  The ID of the referenced document.
1028
- - **document_name**: `str`
1029
  The name of the referenced document.
1030
- - **position**: `list[str]`
1031
  The location information of the chunk within the referenced document.
1032
- - **knowledgebase_id**: `str`
1033
- The ID of the knowledge base to which the referenced document belongs.
1034
- - **similarity**: `float`
1035
  A composite similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity.
1036
- - **vector_similarity**: `float`
1037
  A vector similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity between vector embeddings.
1038
- - **term_similarity**: `float`
1039
  A keyword similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity between keywords.
1040
 
1041
 
@@ -1091,11 +1177,14 @@ The number of records on each page. Defaults to `1024`.
1091
 
1092
  #### orderby
1093
 
1094
- The field by which the records should be sorted. This specifies the attribute or column used to sort the results. Defaults to `"create_time"`.
 
 
 
1095
 
1096
  #### desc
1097
 
1098
- Whether the sorting should be in descending order. Defaults to `True`.
1099
 
1100
  #### id
1101
 
 
3
  **THE API REFERENCES BELOW ARE STILL UNDER DEVELOPMENT.**
4
 
5
  :::tip NOTE
6
+ Dataset Management
7
  :::
8
 
9
+ ## Create dataset
10
 
11
  ```python
12
  RAGFlow.create_dataset(
 
17
  permission: str = "me",
18
  document_count: int = 0,
19
  chunk_count: int = 0,
20
+ chunk_method: str = "naive",
21
  parser_config: DataSet.ParserConfig = None
22
  ) -> DataSet
23
  ```
24
 
25
+ Creates a dataset.
26
 
27
  ### Parameters
28
 
 
42
 
43
  Base64 encoding of the avatar. Defaults to `""`
44
 
 
 
 
 
 
 
 
 
 
45
  #### description: `str`
46
 
47
+ A brief description of the dataset to create. Defaults to `""`.
48
 
49
  #### language: `str`
50
 
51
+ The language setting of the dataset to create. Available options:
 
 
 
 
52
 
53
+ - `"English"` (Default)
54
+ - `"Chinese"`
55
 
56
+ #### permission
 
 
57
 
58
+ Specifies who can operate on the dataset. You can set it only to `"me"` for now.
59
 
60
+ #### chunk_method, `str`
61
 
62
+ The default parsing method of the knwoledge . Defaults to `"naive"`.
63
 
64
  #### parser_config
65
 
 
86
 
87
  ---
88
 
89
+ ## Delete datasets
90
 
91
  ```python
92
  RAGFlow.delete_datasets(ids: list[str] = None)
93
  ```
94
 
95
+ Deletes datasets by name or ID.
96
 
97
  ### Parameters
98
 
99
  #### ids
100
 
101
+ The IDs of the datasets to delete.
102
 
103
  ### Returns
104
 
 
113
 
114
  ---
115
 
116
+ ## List datasets
117
 
118
  ```python
119
  RAGFlow.list_datasets(
 
126
  ) -> list[DataSet]
127
  ```
128
 
129
+ Retrieves a list of datasets.
130
 
131
  ### Parameters
132
 
 
144
 
145
  #### desc: `bool`
146
 
147
+ Indicates whether the retrieved datasets should be sorted in descending order. Defaults to `True`.
148
 
149
  #### id: `str`
150
 
 
156
 
157
  ### Returns
158
 
159
+ - Success: A list of `DataSet` objects representing the retrieved datasets.
160
  - Failure: `Exception`.
161
 
162
  ### Examples
163
 
164
+ #### List all datasets
165
 
166
  ```python
167
  for ds in rag_object.list_datasets():
168
  print(ds)
169
  ```
170
 
171
+ #### Retrieve a dataset by ID
172
 
173
  ```python
174
  dataset = rag_object.list_datasets(id = "id_1")
 
177
 
178
  ---
179
 
180
+ ## Update dataset
181
 
182
  ```python
183
  DataSet.update(update_message: dict)
184
  ```
185
 
186
+ Updates the current dataset.
187
 
188
  ### Parameters
189
 
190
  #### update_message: `dict[str, str|int]`, *Required*
191
 
192
+ - `"name"`: `str` The name of the dataset to update.
 
193
  - `"embedding_model"`: `str` The embedding model for generating vector embeddings.
194
  - Ensure that `"chunk_count"` is `0` before updating `"embedding_model"`.
195
+ - `"chunk_method"`: `str` The default parsing method for the dataset.
196
  - `"naive"`: General
197
  - `"manual`: Manual
198
  - `"qa"`: Q&A
 
218
 
219
  rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
220
  dataset = rag.list_datasets(name="kb_name")
221
+ dataset.update({"embedding_model":"BAAI/bge-zh-v1.5", "chunk_method":"manual"})
222
  ```
223
 
224
  ---
225
 
226
  :::tip API GROUPING
227
+ File Management within Dataset
228
  :::
229
 
230
+ ---
231
+
232
  ## Upload documents
233
 
234
  ```python
235
  DataSet.upload_documents(document_list: list[dict])
236
  ```
237
 
238
+ Uploads documents to the current dataset.
239
 
240
  ### Parameters
241
 
 
243
 
244
  A list of dictionaries representing the documents to upload, each containing the following keys:
245
 
246
+ - `"display_name"`: (Optional) The file name to display in the dataset.
247
+ - `"blob"`: (Optional) The binary content of the file to upload.
 
248
 
249
  ### Returns
250
 
 
254
  ### Examples
255
 
256
  ```python
257
+ dataset = rag_object.create_dataset(name="kb_name")
258
+ dataset.upload_documents([{"display_name": "1.txt", "blob": "<BINARY_CONTENT_OF_THE_DOC>"}, {"display_name": "2.pdf", "blob": "<BINARY_CONTENT_OF_THE_DOC>"}])
259
  ```
260
 
261
  ---
 
270
 
271
  ### Parameters
272
 
273
+ #### update_message: `dict[str, str|dict[]]`, *Required*
274
 
275
+ - `"name"`: `str` The name of the document to update.
276
+ - `"parser_config"`: `dict[str, Any]` The parsing configuration for the document:
277
+ - `"chunk_token_count"`: Defaults to `128`.
278
+ - `"layout_recognize"`: Defaults to `True`.
279
+ - `"delimiter"`: Defaults to `'\n!?。;!?'`.
280
+ - `"task_page_size"`: Defaults to `12`.
281
+ - `"chunk_method"`: `str` The parsing method to apply to the document.
282
+ - `"naive"`: General
283
+ - `"manual`: Manual
284
+ - `"qa"`: Q&A
285
+ - `"table"`: Table
286
+ - `"paper"`: Paper
287
+ - `"book"`: Book
288
+ - `"laws"`: Laws
289
+ - `"presentation"`: Presentation
290
+ - `"picture"`: Picture
291
+ - `"one"`: One
292
+ - `"knowledge_graph"`: Knowledge Graph
293
+ - `"email"`: Email
294
 
295
  ### Returns
296
 
 
307
  dataset=dataset[0]
308
  doc = dataset.list_documents(id="wdfxb5t547d")
309
  doc = doc[0]
310
+ doc.update([{"parser_config": {"chunk_token_count": 256}}, {"chunk_method": "manual"}])
311
  ```
312
 
313
  ---
 
318
  Document.download() -> bytes
319
  ```
320
 
321
+ Downloads the current document from RAGFlow.
322
+
323
  ### Returns
324
 
325
+ The downloaded document in bytes.
326
 
327
  ### Examples
328
 
329
  ```python
330
  from ragflow import RAGFlow
331
 
332
+ rag_object = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
333
+ dataset = rag_object.list_datasets(id="id")
334
+ dataset = dataset[0]
335
+ doc = dataset.list_documents(id="wdfxb5t547d")
336
  doc = doc[0]
337
  open("~/ragflow.txt", "wb+").write(doc.download())
338
  print(doc)
 
346
  Dataset.list_documents(id:str =None, keywords: str=None, offset: int=0, limit:int = 1024,order_by:str = "create_time", desc: bool = True) -> list[Document]
347
  ```
348
 
349
+ Retrieves a list of documents from the current dataset.
350
+
351
  ### Parameters
352
 
353
  #### id
354
 
355
+ The ID of the document to retrieve. Defaults to `None`.
356
 
357
  #### keywords
358
 
359
+ The keywords to match document titles. Defaults to `None`.
360
 
361
  #### offset
362
 
 
368
 
369
  #### orderby
370
 
371
+ The field by which the documents should be sorted. Available options:
372
+
373
+ - `"create_time"` (Default)
374
+ - `"update_time"`
375
 
376
  #### desc
377
 
378
+ Indicates whether the retrieved documents should be sorted in descending order. Defaults to `True`.
379
 
380
  ### Returns
381
 
 
386
 
387
  - `id` Id of the retrieved document. Defaults to `""`.
388
  - `thumbnail` Thumbnail image of the retrieved document. Defaults to `""`.
389
+ - `knowledgebase_id` Dataset ID related to the document. Defaults to `""`.
390
+ - `chunk_method` Method used to parse the document. Defaults to `""`.
391
  - `parser_config`: `ParserConfig` Configuration object for the parser. Defaults to `None`.
392
  - `source_type`: Source type of the document. Defaults to `""`.
393
  - `type`: Type or category of the document. Defaults to `""`.
 
425
  DataSet.delete_documents(ids: list[str] = None)
426
  ```
427
 
428
+ Deletes specified documents or all documents from the current dataset.
429
 
430
  ### Returns
431
 
 
445
 
446
  ---
447
 
448
+ ## Parse documents
449
 
450
  ```python
451
  DataSet.async_parse_documents(document_ids:list[str]) -> None
 
452
  ```
453
 
454
  ### Parameters
 
486
 
487
  ---
488
 
489
+ ## Stop parsing documents
490
+
491
+ ```python
492
+ DataSet.async_cancel_parse_documents(document_ids:list[str])-> None
493
+ ```
494
+
495
+ ### Parameters
496
+
497
+ #### document_ids: `list[str]`
498
+
499
+ The IDs of the documents to stop parsing.
500
+
501
+ ### Returns
502
+
503
+ - Success: No value is returned.
504
+ - Failure: `Exception`
505
+
506
+ ### Examples
507
+
508
+ ```python
509
+ #documents parse and cancel
510
+ rag = RAGFlow(API_KEY, HOST_ADDRESS)
511
+ ds = rag.create_dataset(name="dataset_name")
512
+ documents = [
513
+ {'name': 'test1.txt', 'blob': open('./test_data/test1.txt',"rb").read()},
514
+ {'name': 'test2.txt', 'blob': open('./test_data/test2.txt',"rb").read()},
515
+ {'name': 'test3.txt', 'blob': open('./test_data/test3.txt',"rb").read()}
516
+ ]
517
+ ds.upload_documents(documents)
518
+ documents=ds.list_documents(keywords="test")
519
+ ids=[]
520
+ for document in documents:
521
+ ids.append(document.id)
522
+ ds.async_parse_documents(ids)
523
+ print("Async bulk parsing initiated")
524
+ ds.async_cancel_parse_documents(ids)
525
+ print("Async bulk parsing cancelled")
526
+ ```
527
+
528
+ ---
529
+
530
  ## List chunks
531
 
532
  ```python
 
641
  ```python
642
  Chunk.update(update_message: dict)
643
  ```
644
+
645
+ Updates the current chunk.
646
+
647
  ### Parameters
648
 
649
+ #### update_message: `dict[str, str|list[str]|int]` *Required*
650
 
651
+ - `"content"`: `str` Content of the chunk.
652
+ - `"important_keywords"`: `list[str]` A list of key terms to attach to the chunk.
653
+ - `"available"`: `int` The chunk's availability status in the dataset.
654
+ - `0`: Unavailable
655
+ - `1`: Available
656
 
657
  ### Returns
658
 
 
664
  ```python
665
  from ragflow import RAGFlow
666
 
667
+ rag_object = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
668
+ dataset = rag_object.list_datasets(id="123")
669
  dataset = dataset[0]
670
  doc = dataset.list_documents(id="wdfxb5t547d")
671
  doc = doc[0]
 
675
 
676
  ---
677
 
678
+ ## Retrieve chunks
679
 
680
  ```python
681
  RAGFlow.retrieve(question:str="", datasets:list[str]=None, document=list[str]=None, offset:int=1, limit:int=30, similarity_threshold:float=0.2, vector_similarity_weight:float=0.3, top_k:int=1024,rerank_id:str=None,keyword:bool=False,higlight:bool=False) -> list[Chunk]
 
683
 
684
  ### Parameters
685
 
686
+ #### question: `str` *Required*
687
 
688
  The user query or query keywords. Defaults to `""`.
689
 
690
+ #### datasets: `list[str]`, *Required*
691
 
692
+ The datasets to search from.
693
 
694
+ #### document: `list[str]`
695
 
696
+ The documents to search from. `None` means no limitation. Defaults to `None`.
697
 
698
  #### offset: `int`
699
 
700
+ The beginning point of retrieved chunks. Defaults to `0`.
701
 
702
  #### limit: `int`
703
 
704
+ The maximum number of chunks to return. Defaults to `6`.
705
 
706
  #### Similarity_threshold: `float`
707
 
 
709
 
710
  #### similarity_threshold_weight: `float`
711
 
712
+ The weight of vector cosine similarity. Defaults to `0.3`. If x represents the vector cosine similarity, then (1 - x) is the term similarity weight.
713
 
714
  #### top_k: `int`
715
 
716
+ The number of chunks engaged in vector cosine computaton. Defaults to `1024`.
717
+
718
+ #### rerank_id
719
 
720
+ The ID of the rerank model. Defaults to `None`.
 
721
 
722
+ #### keyword
723
+
724
+ Indicates whether keyword-based matching is enabled:
725
+
726
+ - `True`: Enabled.
727
+ - `False`: Disabled.
728
 
729
  #### highlight:`bool`
730
 
731
  Specifying whether to enable highlighting of matched terms in the results (True) or not (False).
732
+
733
  ### Returns
734
 
735
+ - Success: A list of `Chunk` objects representing the document chunks.
736
+ - Failure: `Exception`
737
 
738
  ### Examples
739
 
740
  ```python
741
  from ragflow import RAGFlow
742
 
743
+ rag_object = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
744
+ ds = rag_object.list_datasets(name="ragflow")
745
  ds = ds[0]
746
  name = 'ragflow_test.txt'
747
  path = './test_data/ragflow_test.txt'
748
+ rag_object.create_document(ds, name=name, blob=open(path, "rb").read())
749
  doc = ds.list_documents(name=name)
750
  doc = doc[0]
751
  ds.async_parse_documents([doc.id])
752
+ for c in rag_object.retrieve(question="What's ragflow?",
753
  datasets=[ds.id], documents=[doc.id],
754
  offset=1, limit=30, similarity_threshold=0.2,
755
  vector_similarity_weight=0.3,
 
768
 
769
  ```python
770
  RAGFlow.create_chat(
771
+ name: str,
772
+ avatar: str = "",
773
+ knowledgebases: list[str] = [],
774
  llm: Chat.LLM = None,
775
  prompt: Chat.Prompt = None
776
  ) -> Chat
 
778
 
779
  Creates a chat assistant.
780
 
781
+ ### Parameters
 
 
 
782
 
783
  The following shows the attributes of a `Chat` object:
784
 
785
+ #### name: *Required*
786
+
787
+ The name of the chat assistant. Defaults to `"assistant"`.
788
+
789
+ #### avatar
790
+
791
+ Base64 encoding of the avatar. Defaults to `""`.
792
+
793
+ #### knowledgebases: `list[str]`
794
+
795
+ The IDs of the associated datasets. Defaults to `[""]`.
796
+
797
+ #### llm
798
+
799
+ The llm of the created chat. Defaults to `None`. When the value is `None`, a dictionary with the following values will be generated as the default.
800
+
801
+ An `LLM` object contains the following attributes:
802
+
803
+ - `model_name`, `str`
804
+ The chat model name. If it is `None`, the user's default chat model will be returned.
805
+ - `temperature`, `float`
806
+ Controls the randomness of the model's predictions. A lower temperature increases the model's conficence in its responses; a higher temperature increases creativity and diversity. Defaults to `0.1`.
807
+ - `top_p`, `float`
808
+ Also known as “nucleus sampling”, this parameter sets a threshold to select a smaller set of words to sample from. It focuses on the most likely words, cutting off the less probable ones. Defaults to `0.3`
809
+ - `presence_penalty`, `float`
810
+ This discourages the model from repeating the same information by penalizing words that have already appeared in the conversation. Defaults to `0.2`.
811
+ - `frequency penalty`, `float`
812
+ Similar to the presence penalty, this reduces the model’s tendency to repeat the same words frequently. Defaults to `0.7`.
813
+ - `max_token`, `int`
814
+ This sets the maximum length of the model’s output, measured in the number of tokens (words or pieces of words). Defaults to `512`.
815
+
816
+ #### Prompt
817
+
818
+ Instructions for the LLM to follow. A `Prompt` object contains the following attributes:
819
+
820
+ - `"similarity_threshold"`: `float` A similarity score to evaluate distance between two lines of text. It's weighted keywords similarity and vector cosine similarity. If the similarity between query and chunk is less than this threshold, the chunk will be filtered out. Defaults to `0.2`.
821
+ - `"keywords_similarity_weight"`: `float` It's weighted keywords similarity and vector cosine similarity or rerank score (0~1). Defaults to `0.7`.
822
+ - `"top_n"`: `int` Not all the chunks whose similarity score is above the 'similarity threshold' will be feed to LLMs. LLM can only see these 'Top N' chunks. Defaults to `8`.
823
+ - `"variables"`: `list[dict[]]` If you use dialog APIs, the variables might help you chat with your clients with different strategies. The variables are used to fill in the 'System' part in prompt in order to give LLM a hint. The 'knowledge' is a very special variable which will be filled-in with the retrieved chunks. All the variables in 'System' should be curly bracketed. Defaults to `[{"key": "knowledge", "optional": True}]`
824
+ - `"rerank_model"`: `str` If it is not specified, vector cosine similarity will be used; otherwise, reranking score will be used. Defaults to `""`.
825
+ - `"empty_response"`: `str` If nothing is retrieved in the dataset for the user's question, this will be used as the response. To allow the LLM to improvise when nothing is retrieved, leave this blank. Defaults to `None`.
826
  - `"opener"`: `str` The opening greeting for the user. Defaults to `"Hi! I am your assistant, can I help you?"`.
827
  - `"show_quote`: `bool` Indicates whether the source of text should be displayed Defaults to `True`.
828
+ - `"prompt"`: `str` The prompt content. Defaults to `You are an intelligent assistant. Please summarize the content of the dataset to answer the question. Please list the data in the knowledge base and answer in detail. When all knowledge base content is irrelevant to the question, your answer must include the sentence "The answer you are looking for is not found in the knowledge base!" Answers need to consider chat history.
829
  Here is the knowledge base:
830
  {knowledge}
831
  The above is the knowledge base.`.
832
 
833
+ ### Returns
834
+
835
+ - Success: A `Chat` object representing the chat assistant.
836
+ - Failure: `Exception`
837
+
838
  ### Examples
839
 
840
  ```python
841
  from ragflow import RAGFlow
842
 
843
  rag = RAGFlow(api_key="<YOUR_API_KEY>", base_url="http://<YOUR_BASE_URL>:9380")
844
+ kbs = rag.list_datasets(name="kb_1")
845
+ list_kb=[]
846
+ for kb in kbs:
847
+ list_kb.append(kb.id)
848
+ assi = rag.create_chat("Miss R", knowledgebases=list_kb)
849
  ```
850
 
851
  ---
 
864
 
865
  - `"name"`: `str` The name of the chat assistant to update.
866
  - `"avatar"`: `str` Base64 encoding of the avatar. Defaults to `""`
867
+ - `"knowledgebases"`: `list[str]` datasets to update.
868
  - `"llm"`: `dict` The LLM settings:
869
  - `"model_name"`, `str` The chat model name.
870
  - `"temperature"`, `float` Controls the randomness of the model's predictions.
 
878
  - `"top_n"`: `int` Not all the chunks whose similarity score is above the 'similarity threshold' will be feed to LLMs. LLM can only see these 'Top N' chunks. Defaults to `8`.
879
  - `"variables"`: `list[dict[]]` If you use dialog APIs, the variables might help you chat with your clients with different strategies. The variables are used to fill in the 'System' part in prompt in order to give LLM a hint. The 'knowledge' is a very special variable which will be filled-in with the retrieved chunks. All the variables in 'System' should be curly bracketed. Defaults to `[{"key": "knowledge", "optional": True}]`
880
  - `"rerank_model"`: `str` If it is not specified, vector cosine similarity will be used; otherwise, reranking score will be used. Defaults to `""`.
881
+ - `"empty_response"`: `str` If nothing is retrieved in the dataset for the user's question, this will be used as the response. To allow the LLM to improvise when nothing is retrieved, leave this blank. Defaults to `None`.
882
  - `"opener"`: `str` The opening greeting for the user. Defaults to `"Hi! I am your assistant, can I help you?"`.
883
  - `"show_quote`: `bool` Indicates whether the source of text should be displayed Defaults to `True`.
884
  - `"prompt"`: `str` The prompt content. Defaults to `You are an intelligent assistant. Please summarize the content of the knowledge base to answer the question. Please list the data in the knowledge base and answer in detail. When all knowledge base content is irrelevant to the question, your answer must include the sentence "The answer you are looking for is not found in the knowledge base!" Answers need to consider chat history.
 
965
 
966
  #### desc
967
 
968
+ Indicates whether the retrieved chat assistants should be sorted in descending order. Defaults to `True`.
969
 
970
  #### id: `string`
971
 
 
1103
 
1104
  A list of `Chunk` objects representing references to the message, each containing the following attributes:
1105
 
1106
+ - `id` `str`
1107
  The chunk ID.
1108
+ - `content` `str`
1109
  The content of the chunk.
1110
+ - `image_id` `str`
1111
  The ID of the snapshot of the chunk.
1112
+ - `document_id` `str`
1113
  The ID of the referenced document.
1114
+ - `document_name` `str`
1115
  The name of the referenced document.
1116
+ - `position` `list[str]`
1117
  The location information of the chunk within the referenced document.
1118
+ - `knowledgebase_id` `str`
1119
+ The ID of the dataset to which the referenced document belongs.
1120
+ - `similarity` `float`
1121
  A composite similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity.
1122
+ - `vector_similarity` `float`
1123
  A vector similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity between vector embeddings.
1124
+ - `term_similarity` `float`
1125
  A keyword similarity score of the chunk ranging from `0` to `1`, with a higher value indicating greater similarity between keywords.
1126
 
1127
 
 
1177
 
1178
  #### orderby
1179
 
1180
+ The field by which the sessions should be sorted. Available options:
1181
+
1182
+ - `"create_time"` (Default)
1183
+ - `"update_time"`
1184
 
1185
  #### desc
1186
 
1187
+ Indicates whether the retrieved sessions should be sorted in descending order. Defaults to `True`.
1188
 
1189
  #### id
1190