Luigi's picture
fix yaml error
3190ad6
|
raw
history blame
1.33 kB
metadata
title: Multi-GGUF LLM Inference
emoji: 🧠
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp

This Streamlit app lets you run chat-based inference on different GGUF models with llama.cpp and llama-cpp-python.

πŸ”„ Supported Models:

  • Qwen/Qwen2.5-7B-Instruct-GGUF β†’ qwen2.5-7b-instruct-q2_k.gguf
  • unsloth/gemma-3-4b-it-GGUF β†’ gemma-3-4b-it-Q5_K_M.gguf
  • unsloth/Phi-4-mini-instruct-GGUF β†’ Phi-4-mini-instruct-Q5_K_M.gguf

βš™οΈ Features:

  • Model selection in sidebar
  • Custom system prompt and generation parameters
  • Chat-style UI with streaming responses

🧠 Memory-Safe Design (for HuggingFace Spaces):

  • Only one model is loaded at a time (no persistent memory bloat)
  • Uses manual unloading and gc.collect() to free memory when switching
  • Reduces n_ctx context length to stay under 16 GB RAM limit
  • Automatically downloads models only when needed
  • Trims history to the last 8 user-assistant turns to avoid context overflow

Perfect for deploying multi-GGUF chat models on free-tier HuggingFace Spaces!

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference