ZeroGPU-LLM-Inference

Sleeping

ZeroGPU-LLM-Inference / README.md

fix yaml error

3190ad6 7 months ago

1.33 kB

	---
	title: Multi-GGUF LLM Inference
	emoji: 🧠
	colorFrom: pink
	colorTo: purple
	sdk: streamlit
	sdk_version: 1.44.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp
	---

	This Streamlit app lets you run chat-based inference on different GGUF models with `llama.cpp` and `llama-cpp-python`.

	### 🔄 Supported Models:
	- `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
	- `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q5_K_M.gguf`
	- `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q5_K_M.gguf`

	### ⚙️ Features:
	- Model selection in sidebar
	- Custom system prompt and generation parameters
	- Chat-style UI with streaming responses

	### 🧠 Memory-Safe Design (for HuggingFace Spaces):
	- Only one model is loaded at a time (no persistent memory bloat)
	- Uses manual unloading and `gc.collect()` to free memory when switching
	- Reduces `n_ctx` context length to stay under 16 GB RAM limit
	- Automatically downloads models only when needed
	- Trims history to the last 8 user-assistant turns to avoid context overflow

	Perfect for deploying multi-GGUF chat models on free-tier HuggingFace Spaces!

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference