Spaces:

ML610
/

replit-3b-ggml_models

Runtime error

App Files Files Community

replit-3b-ggml_models / ctransformers /models /submodules /ggllm.cpp /README.md

ML610

Upload 515 files

1cf2abd almost 2 years ago

preview code

raw

history blame

7.17 kB

	llama.cpp modification to run Falcon (work in progress)

	The Bloke features fine tuned weights in ggml v3 with various quantization options:
	https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
	https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
	https://huggingface.co/TheBloke/falcon-7b-instruct-GGML
	https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML

	The official HF models are here:
	https://huggingface.co/tiiuae/falcon-40b/
	https://huggingface.co/tiiuae/falcon-7b/
	https://huggingface.co/tiiuae/falcon-40b-instruct
	https://huggingface.co/tiiuae/falcon-7b-instruct

	Conversion:
	1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly
	2) use examples/falcon_quantize to convert these into memory aligned GGMLv3 binaries of your choice including mmap support from there on
	_Important: The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_

	Status/Bugs:
	* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows

	How to compile:
	```
	How to build:
	1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
	git clone
	cd ggllm.cpp
	rm -rf build; mkdir build; cd build
	cmake -DLLAMA_CUBLAS=1 ..
	cmake --build . --config Release
	# find binaries in ./bin


	2) Installing on WSL (Windows Subsystem for Linux)
	# I am getting slightly better timings on WSL than native windows
	# Use --no-mmap in WSL OR copy the model into native directory (not /mnt/) or it will get stuck loading (thanks @nauful)
	#Choose a current distro:
	wsl.exe --list --online
	wsl --install -d distro
	# cmake 3.16 is required and the cuda toolset
	# If you run an old distro you can upgrade (like apt update; apt upgrade; apt full-upgrade; pico /etc/apt/sources.list/; apt update; apt upgrade; apt full-upgrade; apt autoremove; lsb_release -a); then wsl --shutdown and restart it
	# install cuda WSL toolkit
	wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
	dpkg -i cuda-keyring_1.0-1_all.deb
	apt-get update; apt-get -y install cuda
	# you might need to add it to your path:
	export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"
	export PATH="/usr/local/cuda-12.1/bin:$PATH"
	# now start with a fresh cmake and all should work
	```

	CUDA:
	Only some tensors supported currently, only mul_mat operation supported currently
	q3_k timing on 3090 of Falcon 40B:
	falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token)
	falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token)

	q4_k timing on 3090 of falcon 40B (partial offload):
	falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token)
	falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token)

	q4_1 timing on 3090 of falcon 7B:
	falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token)
	falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token)


	CUDA sidenote:
	1) use 1 less threads than you have physical processor cores
	2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference


	It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
	CPU inference examples:
	```
	Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
	main: build = 677 (dd3d346)
	main: seed = 1687010794
	ggml_init_cublas: found 1 CUDA devices:
	Device 0: NVIDIA GeForce RTX 3090
	falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
	falcon_model_load_internal: format = ggjt v3 (latest)
	falcon_model_load_internal: n_vocab = 65024
	falcon_model_load_internal: n_ctx = 512
	falcon_model_load_internal: n_embd = 8192
	falcon_model_load_internal: n_head = 128
	falcon_model_load_internal: n_head_kv = 8
	falcon_model_load_internal: n_layer = 60
	falcon_model_load_internal: version = 40
	falcon_model_load_internal: ftype = 9 (mostly Q5_1)
	falcon_model_load_internal: n_ff = 32768
	falcon_model_load_internal: n_parts = 1
	falcon_model_load_internal: model size = 40B
	falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 29929.00 MB)
	falcon_model_load_internal: using CUDA for GPU acceleration
	falcon_model_load_internal: mem required = 33513.70 MB (+ 120.00 MB per state)
	falcon_model_load_internal: offloading 0 layers to GPU
	falcon_model_load_internal: total VRAM used: 512 MB
	...................................................................................................
	falcon_init_from_file: kv self size = 120.00 MB

	system_info: n_threads = 31 / 32 \| AVX = 1 \| AVX2 = 1 \| AVX512 = 0 \| AVX512_VBMI = 0 \| AVX512_VNNI = 0 \| FMA = 1 \| NEON = 0 \| ARM_FMA = 0 \| F16C = 1 \| FP16_VA = 0 \| WASM_SIMD = 0 \| BLAS = 1 \| SSE3 = 1 \| VSX = 0 \|
	sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
	generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


	Love relates to hate like light relates to darkness.
	Love is the strongest thing in the world, but hate is the second strongest force.
	Love is a force multiplier.
	For every moment of love, there is a parallel moment of hate.
	You can’t
	falcon_print_timings: load time = 4420.23 ms
	falcon_print_timings: sample time = 11.34 ms / 50 runs ( 0.23 ms per token)
	falcon_print_timings: prompt eval time = 785.42 ms / 5 tokens ( 157.08 ms per token)
	falcon_print_timings: eval time = 27512.23 ms / 49 runs ( 561.47 ms per token)
	falcon_print_timings: total time = 28315.91 ms
	```


	Below are Falcon 7B tests:
	Q5_1 is working, comes with ggml v3 as a bonus (mmap support)
	```
	falcon_model_load_internal: ftype = 9 (mostly Q5_1)
	falcon_print_timings: load time = 952.24 ms
	falcon_print_timings: sample time = 67.91 ms / 300 runs ( 0.23 ms per token)
	falcon_print_timings: prompt eval time = 370.94 ms / 14 tokens ( 26.50 ms per token)
	falcon_print_timings: eval time = 50367.68 ms / 299 runs ( 168.45 ms per token)
	```
	Q4_1 is working as well
	```
	falcon_print_timings: load time = 864.40 ms
	falcon_print_timings: sample time = 22.68 ms / 100 runs ( 0.23 ms per token)
	falcon_print_timings: prompt eval time = 287.00 ms / 14 tokens ( 20.50 ms per token)
	falcon_print_timings: eval time = 12233.39 ms / 99 runs ( 123.57 ms per token)
	```

	Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.