Spaces:
Sleeping
Sleeping
| title: Multi-GGUF LLM Inference | |
| emoji: 🧠 | |
| colorFrom: pink | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: 1.44.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp | |
| This Streamlit app lets you run **chat-based inference** on different GGUF models with `llama.cpp` and `llama-cpp-python`. | |
| ### 🔄 Supported Models: | |
| - `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf` | |
| - `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q5_K_M.gguf` | |
| - `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q5_K_M.gguf` | |
| ### ⚙️ Features: | |
| - Model selection in sidebar | |
| - Custom system prompt and generation parameters | |
| - Chat-style UI with streaming responses | |
| ### 🧠 Memory-Safe Design (for HuggingFace Spaces): | |
| - Only **one model is loaded at a time** (no persistent memory bloat) | |
| - Uses **manual unloading and `gc.collect()`** to free memory when switching | |
| - Reduces `n_ctx` context length to stay under 16 GB RAM limit | |
| - Automatically downloads models only when needed | |
| - Trims history to the **last 8 user-assistant turns** to avoid context overflow | |
| Perfect for deploying multi-GGUF chat models on **free-tier HuggingFace Spaces**! | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |