Spaces:
Paused
Paused
# Run or Build h2oGPT Docker | |
## Setup Docker for CPU Inference | |
No special docker instructions are required, just follow [these instructions](https://docs.docker.com/engine/install/ubuntu/) to get docker setup at all. Add your user as part of `docker` group, exit shell, login back in, and run: | |
```bash | |
newgrp docker | |
``` | |
which avoids having to reboot. Or just reboot to have docker access. | |
## Setup Docker for GPU Inference | |
Ensure docker installed and ready (requires sudo), can skip if system is already capable of running nvidia containers. Example here is for Ubuntu, see [NVIDIA Containers](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) for more examples. | |
```bash | |
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ | |
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ | |
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ | |
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ | |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | |
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base | |
sudo apt install nvidia-container-runtime | |
sudo nvidia-ctk runtime configure --runtime=docker | |
sudo systemctl restart docker | |
``` | |
If running on A100's, might require [Installing Fabric Manager](INSTALL.md#install-and-run-fabric-manager-if-have-multiple-a100100s) and [Installing GPU Manager](INSTALL.md#install-nvidia-gpu-manager-if-have-multiple-a100h100s). | |
## Run h2oGPT using Docker | |
All available public h2oGPT docker images can be found in [Google Container Registry](https://console.cloud.google.com/gcr/images/vorvan/global/h2oai/h2ogpt-runtime). | |
Ensure image is up-to-date by running: | |
```bash | |
docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 | |
``` | |
An example running h2oGPT via docker using LLaMa2 7B model is: | |
```bash | |
mkdir -p ~/.cache | |
mkdir -p ~/save | |
export CUDA_VISIBLE_DEVICES=0 | |
docker run \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p 7860:7860 \ | |
--rm --init \ | |
--network host \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--base_model=h2oai/h2ogpt-4096-llama2-7b-chat \ | |
--use_safetensors=True \ | |
--prompt_type=llama2 \ | |
--save_dir='/workspace/save/' \ | |
--use_gpu_id=False \ | |
--score_model=None \ | |
--max_max_new_tokens=2048 \ | |
--max_new_tokens=1024 | |
``` | |
Use `docker run -d` to run in detached background. Then go to http://localhost:7860/ or http://127.0.0.1:7860/. | |
An example of running h2oGPT via docker using AutoGPTQ (4-bit, so using less GPU memory) with LLaMa2 7B model is: | |
```bash | |
mkdir -p $HOME/.cache | |
mkdir -p $HOME/save | |
export CUDA_VISIBLE_DEVICES=0 | |
docker run \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p 7860:7860 \ | |
--rm --init \ | |
--network host \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--base_model=TheBloke/Llama-2-7b-Chat-GPTQ \ | |
--load_gptq="gptq_model-4bit-128g" \ | |
--use_safetensors=True \ | |
--prompt_type=llama2 \ | |
--save_dir='/workspace/save/' \ | |
--use_gpu_id=False \ | |
--score_model=None \ | |
--max_max_new_tokens=2048 \ | |
--max_new_tokens=1024 | |
``` | |
Use `docker run -d` to run in detached background. Then go to http://localhost:7860/ or http://127.0.0.1:7860/. | |
If one needs to use a Hugging Face token to access certain Hugging Face models like Meta version of LLaMa2, can run like: | |
```bash | |
mkdir -p ~/.cache | |
mkdir -p ~/save | |
export CUDA_VISIBLE_DEVICES=0 | |
docker run \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p 7860:7860 \ | |
--rm --init \ | |
--network host \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--base_model=h2oai/h2ogpt-4096-llama2-7b-chat \ | |
--prompt_type=llama2 \ | |
--save_dir='/workspace/save/' \ | |
--use_gpu_id=False \ | |
--score_model=None \ | |
--max_max_new_tokens=2048 \ | |
--max_new_tokens=1024 | |
``` | |
Use `docker run -d` to run in detached background. | |
For [GGML/GPT4All models](FAQ.md#adding-models), one should either download the file and map that path outsider docker to a pain told to h2oGPT for inside docker, or pass a URL that would download the model internally to docker. | |
See [README_GPU](README_GPU.md) for more details about what to run. | |
## Run h2oGPT + vLLM or vLLM using Docker | |
One can run an inference server in one docker and h2oGPT in another docker. | |
For the vLLM server running on 2 GPUs using h2oai/h2ogpt-4096-llama2-7b-chat model, run: | |
```bash | |
docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 | |
unset CUDA_VISIBLE_DEVICES | |
mkdir -p $HOME/.cache/huggingface/hub | |
mkdir -p $HOME/save | |
docker run \ | |
--runtime=nvidia \ | |
--gpus '"device=0,1"' \ | |
--shm-size=10.24gb \ | |
-p 5000:5000 \ | |
--rm --init \ | |
--entrypoint /h2ogpt_conda/vllm_env/bin/python3.10 \ | |
-e NCCL_IGNORE_DISABLED_P2P=1 \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
--network host \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 -m vllm.entrypoints.openai.api_server \ | |
--port=5000 \ | |
--host=0.0.0.0 \ | |
--model=h2oai/h2ogpt-4096-llama2-7b-chat \ | |
--tokenizer=hf-internal-testing/llama-tokenizer \ | |
--tensor-parallel-size=2 \ | |
--seed 1234 \ | |
--trust-remote-code \ | |
--download-dir=/workspace/.cache/huggingface/hub &>> logs.vllm_server.txt | |
``` | |
Use `docker run -d` to run in detached background. | |
Checks the logs `logs.vllm_server.txt` to make sure server is running. | |
If ones sees similar output to below, then endpoint it up & running. | |
```bash | |
INFO: Started server process [7] | |
INFO: Waiting for application startup. | |
INFO: Application startup complete. | |
INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit | |
``` | |
### Curl Test | |
One can also verify the endpoint by running following curl command. | |
```bash | |
curl http://localhost:5000/v1/completions \ | |
-H "Content-Type: application/json" \ | |
-d '{ | |
"model": "h2oai/h2ogpt-4096-llama2-7b-chat", | |
"prompt": "San Francisco is a", | |
"max_tokens": 7, | |
"temperature": 0 | |
}' | |
``` | |
If one sees similar output to below, then endpoint it up & running. | |
```json | |
{ | |
"id": "cmpl-4b9584f743ff4dc590f0c168f82b063b", | |
"object": "text_completion", | |
"created": 1692796549, | |
"model": "h2oai/h2ogpt-4096-llama2-7b-chat", | |
"choices": [ | |
{ | |
"index": 0, | |
"text": "city in Northern California that is known", | |
"logprobs": null, | |
"finish_reason": "length" | |
} | |
], | |
"usage": { | |
"prompt_tokens": 5, | |
"total_tokens": 12, | |
"completion_tokens": 7 | |
} | |
} | |
``` | |
If one needs to only setup vLLM one can stop here. | |
### Run h2oGPT | |
```bash | |
mkdir -p ~/.cache | |
mkdir -p ~/save | |
docker run \ | |
--gpus '"device=2,3"' \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p 7860:7860 \ | |
--rm --init \ | |
--network host \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--inference_server="vllm:0.0.0.0:5000" \ | |
--base_model=h2oai/h2ogpt-4096-llama2-7b-chat \ | |
--langchain_mode=UserData | |
``` | |
Make sure to set `--inference_server` argument to the correct vllm endpoint. | |
When one is done with the docker instance, run `docker ps` and find the container ID's hash, then run `docker stop <hash>`. | |
Follow [README_InferenceServers.md](README_InferenceServers.md) for more information on how to setup vLLM. | |
## Run h2oGPT and TGI using Docker | |
One can run an inference server in one docker and h2oGPT in another docker. | |
For the TGI server run (e.g. to run on GPU 0) | |
```bash | |
export MODEL=h2oai/h2ogpt-4096-llama2-7b-chat | |
export CUDA_VISIBLE_DEVICES=0 | |
docker run -d --gpus all \ | |
--shm-size 1g \ | |
--network host \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
-p 6112:80 \ | |
-v $HOME/.cache/huggingface/hub/:/data ghcr.io/huggingface/text-generation-inference:0.9.3 \ | |
--model-id $MODEL \ | |
--max-input-length 4096 \ | |
--max-total-tokens 8192 \ | |
--max-stop-sequences 6 &>> logs.infserver.txt | |
``` | |
Each docker can run on any system where network can reach or on same system on different GPUs. E.g. replace `--gpus all` with `--gpus '"device=0,3"'` to run on GPUs 0 and 3, and note the extra quotes, and then `unset CUDA_VISIBLE_DEVICES` and avoid passing that into the docker image. This multi-device format is required to avoid TGI server getting confused about which GPUs are available. | |
One a low-memory GPU system can add other options to limit batching, e.g.: | |
```bash | |
mkdir -p $HOME/.cache/huggingface/hub/ | |
export MODEL=h2oai/h2ogpt-4096-llama2-7b-chat | |
unset CUDA_VISIBLE_DEVICES | |
docker run -d --gpus '"device=0"' \ | |
--shm-size 1g \ | |
-p 6112:80 \ | |
-v $HOME/.cache/huggingface/hub/:/data ghcr.io/huggingface/text-generation-inference:0.9.3 \ | |
--model-id $MODEL \ | |
--max-input-length 1024 \ | |
--max-total-tokens 2048 \ | |
--max-batch-prefill-tokens 2048 \ | |
--max-batch-total-tokens 2048 \ | |
--max-stop-sequences 6 &>> logs.infserver.txt | |
``` | |
Then wait till it comes up (e.g. check docker logs for detached container hash in logs.infserver.txt), about 30 seconds for 7B LLaMa2 on 1 GPU. Then for h2oGPT, just run one of the commands like the above, but add e.g. `--inference_server=192.168.0.1:6112` to the docker command line. E.g. using same export's as above, run: | |
```bash | |
export GRADIO_SERVER_PORT=7860 | |
export CUDA_VISIBLE_DEVICES=0 | |
mkdir -p ~/.cache | |
mkdir -p ~/save | |
docker run -d \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \ | |
--rm --init \ | |
--network host \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--base_model=$MODEL \ | |
--inference_server=http://localhost:6112 \ | |
--prompt_type=llama2 \ | |
--save_dir='/workspace/save/' \ | |
--use_gpu_id=False \ | |
--score_model=None \ | |
--max_max_new_tokens=4096 \ | |
--max_new_tokens=1024 | |
``` | |
or change `max_max_new_tokens` to `2048` for low-memory case. Note the h2oGPT container has `--network host` with same port inside and outside so the other container on same host can see it. Otherwise use actual IP addersses if on separate hosts. | |
For maximal summarization performance when connecting to TGI server, auto-detection of file changes in `--user_path` every query, and maximum document filling of context, add these options: | |
``` | |
--num_async=10 \ | |
--top_k_docs=-1 | |
--detect_user_path_changes_every_query=True | |
``` | |
When one is done with the docker instance, run `docker ps` and find the container ID's hash, then run `docker stop <hash>`. | |
Follow [README_InferenceServers.md](README_InferenceServers.md) for similar (and more) examples of how to launch TGI server using docker. | |
## Make UserData db for generate.py using Docker | |
To make UserData db for generate.py, put pdfs, etc. into path user_path and run: | |
```bash | |
mkdir -p ~/.cache | |
mkdir -p ~/save | |
mkdir -p user_path | |
mkdir -p db_dir_UserData | |
docker run \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
--rm --init \ | |
--network host \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
-v user_path:/workspace/user_path \ | |
-v db_dir_UserData:/workspace/db_dir_UserData \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/src/make_db.py | |
``` | |
Once db is made, can use in generate.py like: | |
```bash | |
export CUDA_VISIBLE_DEVICES=0 | |
docker run \ | |
--gpus all \ | |
--runtime=nvidia \ | |
--shm-size=2g \ | |
-p 7860:7860 \ | |
--rm --init \ | |
--network host \ | |
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ | |
-v /etc/passwd:/etc/passwd:ro \ | |
-v /etc/group:/etc/group:ro \ | |
-u `id -u`:`id -g` \ | |
-v "${HOME}"/.cache:/workspace/.cache \ | |
-v "${HOME}"/save:/workspace/save \ | |
-v user_path:/workspace/user_path \ | |
-v db_dir_UserData:/workspace/db_dir_UserData \ | |
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ | |
--base_model=h2oai/h2ogpt-4096-llama2-7b-chat \ | |
--use_safetensors=True \ | |
--prompt_type=llama2 \ | |
--save_dir='/workspace/save/' \ | |
--use_gpu_id=False \ | |
--score_model=None \ | |
--max_max_new_tokens=2048 \ | |
--max_new_tokens=1024 \ | |
--langchain_mode=UserData | |
``` | |
For a more detailed description of other parameters of the make_db script, checkout the definition in this file: https://github.com/h2oai/h2ogpt/blob/main/src/make_db.py | |
## Build Docker | |
```bash | |
# build image | |
touch build_info.txt | |
docker build -t h2ogpt . | |
``` | |
then to run this version of the docker image, just replace `gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0` with `h2ogpt:latest` in above run command. | |
when any of the prebuilt dependencies are changed, e.g. duckdb or auto-gptq, you need to run `make docker_build_deps` or similar code what's in that Makefile target. | |
## Docker Compose Setup & Inference | |
1. (optional) Change desired model and weights under `environment` in the `docker-compose.yml` | |
2. Build and run the container | |
```bash | |
docker-compose up -d --build | |
``` | |
3. Open `https://localhost:7860` in the browser | |
4. See logs: | |
```bash | |
docker-compose logs -f | |
``` | |
5. Clean everything up: | |
```bash | |
docker-compose down --volumes --rmi all | |
``` | |