Update README.md
Browse files
README.md
CHANGED
|
@@ -33,7 +33,7 @@ Training is light-weight and can be completed in only a few days depending on ba
|
|
| 33 |
|
| 34 |
_Note: For all samples, your environment must have access to cuda_
|
| 35 |
|
| 36 |
-
### Production
|
| 37 |
|
| 38 |
*To try this out running in a production-like environment, please use the pre-built docker image:*
|
| 39 |
|
|
@@ -101,6 +101,27 @@ python sample_client.py
|
|
| 101 |
|
| 102 |
_Note: first prompt may be slower as there is a slight warmup time_
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
### Minimal Sample
|
| 105 |
|
| 106 |
*To try this out with the fms-native compiled model, please execute the following:*
|
|
|
|
| 33 |
|
| 34 |
_Note: For all samples, your environment must have access to cuda_
|
| 35 |
|
| 36 |
+
### Use in IBM Production TGIS
|
| 37 |
|
| 38 |
*To try this out running in a production-like environment, please use the pre-built docker image:*
|
| 39 |
|
|
|
|
| 101 |
|
| 102 |
_Note: first prompt may be slower as there is a slight warmup time_
|
| 103 |
|
| 104 |
+
### Use in Huggingface TGI
|
| 105 |
+
|
| 106 |
+
#### start the server
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
model=ibm-fms/llama-13b-accelerator
|
| 110 |
+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
| 111 |
+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
_note: for tensor parallel, add --num-shard_
|
| 115 |
+
|
| 116 |
+
#### make a request
|
| 117 |
+
|
| 118 |
+
```bash
|
| 119 |
+
curl 127.0.0.1:8080/generate_stream \
|
| 120 |
+
-X POST \
|
| 121 |
+
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
| 122 |
+
-H 'Content-Type: application/json'
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
### Minimal Sample
|
| 126 |
|
| 127 |
*To try this out with the fms-native compiled model, please execute the following:*
|