Spaces:
Sleeping
Sleeping
# Ingesting & Managing Documents | |
The ingestion of documents can be done in different ways: | |
* Using the `/ingest` API | |
* Using the Gradio UI | |
* Using the Bulk Local Ingestion functionality (check next section) | |
## Bulk Local Ingestion | |
When you are running PrivateGPT in a fully local setup, you can ingest a complete folder for convenience (containing | |
pdf, text files, etc.) | |
and optionally watch changes on it with the command: | |
```bash | |
make ingest /path/to/folder -- --watch | |
``` | |
To log the processed and failed files to an additional file, use: | |
```bash | |
make ingest /path/to/folder -- --watch --log-file /path/to/log/file.log | |
``` | |
**Note for Windows Users:** Depending on your Windows version and whether you are using PowerShell to execute | |
PrivateGPT API calls, you may need to include the parameter name before passing the folder path for consumption: | |
```bash | |
make ingest arg=/path/to/folder -- --watch --log-file /path/to/log/file.log | |
``` | |
After ingestion is complete, you should be able to chat with your documents | |
by navigating to http://localhost:8001 and using the option `Query documents`, | |
or using the completions / chat API. | |
## Ingestion troubleshooting | |
### Running out of memory | |
To do not run out of memory, you should ingest your documents without the LLM loaded in your (video) memory. | |
To do so, you should change your configuration to set `llm.mode: mock`. | |
You can also use the existing `PGPT_PROFILES=mock` that will set the following configuration for you: | |
```yaml | |
llm: | |
mode: mock | |
embedding: | |
mode: local | |
``` | |
This configuration allows you to use hardware acceleration for creating embeddings while avoiding loading the full LLM into (video) memory. | |
Once your documents are ingested, you can set the `llm.mode` value back to `local` (or your previous custom value). | |
### Ingestion speed | |
The ingestion speed depends on the number of documents you are ingesting, and the size of each document. | |
To speed up the ingestion, you can change the ingestion mode in configuration. | |
The following ingestion mode exist: | |
* `simple`: historic behavior, ingest one document at a time, sequentially | |
* `batch`: read, parse, and embed multiple documents using batches (batch read, and then batch parse, and then batch embed) | |
* `parallel`: read, parse, and embed multiple documents in parallel. This is the fastest ingestion mode for local setup. | |
To change the ingestion mode, you can use the `embedding.ingest_mode` configuration value. The default value is `simple`. | |
To configure the number of workers used for parallel or batched ingestion, you can use | |
the `embedding.count_workers` configuration value. If you set this value too high, you might run out of | |
memory, so be mindful when setting this value. The default value is `2`. | |
For `batch` mode, you can easily set this value to your number of threads available on your CPU without | |
running out of memory. For `parallel` mode, you should be more careful, and set this value to a lower value. | |
The configuration below should be enough for users who want to stress more their hardware: | |
```yaml | |
embedding: | |
ingest_mode: parallel | |
count_workers: 4 | |
``` | |
If your hardware is powerful enough, and that you are loading heavy documents, you can increase the number of workers. | |
It is recommended to do your own tests to find the optimal value for your hardware. | |
If you have a `bash` shell, you can use this set of command to do your own benchmark: | |
```bash | |
# Wipe your local data, to put yourself in a clean state | |
# This will delete all your ingested documents | |
make wipe | |
time PGPT_PROFILES=mock python ./scripts/ingest_folder.py ~/my-dir/to-ingest/ | |
``` | |
## Supported file formats | |
privateGPT by default supports all the file formats that contains clear text (for example, `.txt` files, `.html`, etc.). | |
However, these text based file formats as only considered as text files, and are not pre-processed in any other way. | |
It also supports the following file formats: | |
* `.hwp` | |
* `.pdf` | |
* `.docx` | |
* `.pptx` | |
* `.ppt` | |
* `.pptm` | |
* `.jpg` | |
* `.png` | |
* `.jpeg` | |
* `.mp3` | |
* `.mp4` | |
* `.csv` | |
* `.epub` | |
* `.md` | |
* `.mbox` | |
* `.ipynb` | |
* `.json` | |
**Please note the following nuance**: while `privateGPT` supports these file formats, it **might** require additional | |
dependencies to be installed in your python's virtual environment. | |
For example, if you try to ingest `.epub` files, `privateGPT` might fail to do it, and will instead display an | |
explanatory error asking you to download the necessary dependencies to install this file format. | |
**Other file formats might work**, but they will be considered as plain text | |
files (in other words, they will be ingested as `.txt` files). |