Spaces:
Runtime error
Runtime error
llama.cpp modification to run Falcon (work in progress) | |
**The Bloke features fine tuned weights in ggml v3 with various quantization options:** | |
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML | |
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML | |
https://huggingface.co/TheBloke/falcon-7b-instruct-GGML | |
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML | |
**The official HF models are here:** | |
https://huggingface.co/tiiuae/falcon-40b/ | |
https://huggingface.co/tiiuae/falcon-7b/ | |
https://huggingface.co/tiiuae/falcon-40b-instruct | |
https://huggingface.co/tiiuae/falcon-7b-instruct | |
**Conversion:** | |
1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly | |
2) use examples/falcon_quantize to convert these into memory aligned GGMLv3 binaries of your choice including mmap support from there on | |
_Important: The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_ | |
**Status/Bugs:** | |
* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows | |
**How to compile:** | |
``` | |
How to build: | |
1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support) | |
git clone | |
cd ggllm.cpp | |
rm -rf build; mkdir build; cd build | |
cmake -DLLAMA_CUBLAS=1 .. | |
cmake --build . --config Release | |
# find binaries in ./bin | |
2) Installing on WSL (Windows Subsystem for Linux) | |
# I am getting slightly better timings on WSL than native windows | |
# Use --no-mmap in WSL OR copy the model into native directory (not /mnt/) or it will get stuck loading (thanks @nauful) | |
#Choose a current distro: | |
wsl.exe --list --online | |
wsl --install -d distro | |
# cmake 3.16 is required and the cuda toolset | |
# If you run an old distro you can upgrade (like apt update; apt upgrade; apt full-upgrade; pico /etc/apt/sources.list/; apt update; apt upgrade; apt full-upgrade; apt autoremove; lsb_release -a); then wsl --shutdown and restart it | |
# install cuda WSL toolkit | |
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb | |
dpkg -i cuda-keyring_1.0-1_all.deb | |
apt-get update; apt-get -y install cuda | |
# you might need to add it to your path: | |
export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH" | |
export PATH="/usr/local/cuda-12.1/bin:$PATH" | |
# now start with a fresh cmake and all should work | |
``` | |
**CUDA:** | |
Only some tensors supported currently, only mul_mat operation supported currently | |
q3_k timing on 3090 of Falcon 40B: | |
falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token) | |
falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token) | |
q4_k timing on 3090 of falcon 40B (partial offload): | |
falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token) | |
falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token) | |
q4_1 timing on 3090 of falcon 7B: | |
falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token) | |
falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token) | |
CUDA sidenote: | |
1) use 1 less threads than you have physical processor cores | |
2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference | |
It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second | |
CPU inference examples: | |
``` | |
Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0 | |
main: build = 677 (dd3d346) | |
main: seed = 1687010794 | |
ggml_init_cublas: found 1 CUDA devices: | |
Device 0: NVIDIA GeForce RTX 3090 | |
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1 | |
falcon_model_load_internal: format = ggjt v3 (latest) | |
falcon_model_load_internal: n_vocab = 65024 | |
falcon_model_load_internal: n_ctx = 512 | |
falcon_model_load_internal: n_embd = 8192 | |
falcon_model_load_internal: n_head = 128 | |
falcon_model_load_internal: n_head_kv = 8 | |
falcon_model_load_internal: n_layer = 60 | |
falcon_model_load_internal: version = 40 | |
falcon_model_load_internal: ftype = 9 (mostly Q5_1) | |
falcon_model_load_internal: n_ff = 32768 | |
falcon_model_load_internal: n_parts = 1 | |
falcon_model_load_internal: model size = 40B | |
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 29929.00 MB) | |
falcon_model_load_internal: using CUDA for GPU acceleration | |
falcon_model_load_internal: mem required = 33513.70 MB (+ 120.00 MB per state) | |
falcon_model_load_internal: offloading 0 layers to GPU | |
falcon_model_load_internal: total VRAM used: 512 MB | |
................................................................................................... | |
falcon_init_from_file: kv self size = 120.00 MB | |
system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | | |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 | |
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0 | |
Love relates to hate like light relates to darkness. | |
Love is the strongest thing in the world, but hate is the second strongest force. | |
Love is a force multiplier. | |
For every moment of love, there is a parallel moment of hate. | |
You can’t | |
falcon_print_timings: load time = 4420.23 ms | |
falcon_print_timings: sample time = 11.34 ms / 50 runs ( 0.23 ms per token) | |
falcon_print_timings: prompt eval time = 785.42 ms / 5 tokens ( 157.08 ms per token) | |
falcon_print_timings: eval time = 27512.23 ms / 49 runs ( 561.47 ms per token) | |
falcon_print_timings: total time = 28315.91 ms | |
``` | |
Below are Falcon 7B tests: | |
**Q5_1 is working, comes with ggml v3 as a bonus (mmap support)** | |
``` | |
falcon_model_load_internal: ftype = 9 (mostly Q5_1) | |
falcon_print_timings: load time = 952.24 ms | |
falcon_print_timings: sample time = 67.91 ms / 300 runs ( 0.23 ms per token) | |
falcon_print_timings: prompt eval time = 370.94 ms / 14 tokens ( 26.50 ms per token) | |
falcon_print_timings: eval time = 50367.68 ms / 299 runs ( 168.45 ms per token) | |
``` | |
**Q4_1 is working as well** | |
``` | |
falcon_print_timings: load time = 864.40 ms | |
falcon_print_timings: sample time = 22.68 ms / 100 runs ( 0.23 ms per token) | |
falcon_print_timings: prompt eval time = 287.00 ms / 14 tokens ( 20.50 ms per token) | |
falcon_print_timings: eval time = 12233.39 ms / 99 runs ( 123.57 ms per token) | |
``` | |
Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage. | |