Spaces:
Running
on
Zero
Running
on
Zero
File size: 1,325 Bytes
ef1afaf cd26609 ef1afaf 3190ad6 ef1afaf cd26609 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
---
title: Multi-GGUF LLM Inference
emoji: 🧠
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp
---
This Streamlit app lets you run **chat-based inference** on different GGUF models with `llama.cpp` and `llama-cpp-python`.
### 🔄 Supported Models:
- `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf`
- `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q5_K_M.gguf`
- `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q5_K_M.gguf`
### ⚙️ Features:
- Model selection in sidebar
- Custom system prompt and generation parameters
- Chat-style UI with streaming responses
### 🧠 Memory-Safe Design (for HuggingFace Spaces):
- Only **one model is loaded at a time** (no persistent memory bloat)
- Uses **manual unloading and `gc.collect()`** to free memory when switching
- Reduces `n_ctx` context length to stay under 16 GB RAM limit
- Automatically downloads models only when needed
- Trims history to the **last 8 user-assistant turns** to avoid context overflow
Perfect for deploying multi-GGUF chat models on **free-tier HuggingFace Spaces**!
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|