Spaces:
Sleeping
HuggingFace Spaces GPU Setup Guide π
This guide will help you enable GPU acceleration for GRDN AI on HuggingFace Spaces with your Nvidia T4 grant.
Prerequisites
- HuggingFace Space with GPU enabled (Nvidia T4 small: 4 vCPU, 15GB RAM, 16GB GPU)
- Model files uploaded to your Space
Setup Steps
1. Enable GPU in Space Settings
- Go to your Space settings on HuggingFace
- Navigate to "Hardware" section
- Select "T4 small" (or your granted GPU tier)
- Save changes
2. Upload Model Files
Your Space needs the GGUF model files in the src/models/ directory:
llama-2-7b-chat.Q4_K_M.gguf(for Llama2)decilm-7b-uniform-gqa-q8_0.gguf(for DeciLM)
You can upload these via:
- HuggingFace web interface (Files tab)
- Git LFS (recommended for large files)
- HuggingFace Hub CLI
3. Install Dependencies
Make sure your Space has the updated requirements.txt which includes:
torch>=2.0.0
4. Verify GPU Detection
Once your Space restarts, check the sidebar in the app for:
- π GPU Acceleration: ENABLED - GPU is working!
- β οΈ GPU Acceleration: DISABLED - Something's wrong
You should also see in the logs:
π€ Running on HuggingFace Spaces
π GPU detected: Tesla T4 with 15.xx GB memory
π Will offload all layers to GPU (n_gpu_layers=-1)
β
GPU acceleration ENABLED with -1 layers
How It Works
The app now automatically:
- Detects HuggingFace Spaces environment via
SPACE_IDorSPACE_AUTHOR_NAMEenv variables - Checks for GPU availability using PyTorch's
torch.cuda.is_available() - Configures LlamaCPP to use GPU with
n_gpu_layers=-1(all layers on GPU) - Shows status in the sidebar UI
GPU Configuration
- CPU Mode:
n_gpu_layers=0- All computation on CPU (slow) - GPU Mode:
n_gpu_layers=-1- All model layers offloaded to GPU (fast)
Performance Expectations
With GPU acceleration on Nvidia T4:
- Response time: ~2-5 seconds (vs 30-60+ seconds on CPU)
- Token generation: ~20-50 tokens/sec (vs 1-3 tokens/sec on CPU)
- Memory: Model fits comfortably in 16GB VRAM
Troubleshooting
GPU Not Detected
- Check Space hardware: Ensure T4 is selected in settings
- Check logs: Look for GPU detection messages
- Verify torch installation:
torch.cuda.is_available()should returnTrue - Try restarting: Sometimes requires Space restart after hardware change
Model File Not Found
If you see: β οΈ Model not found at src/models/...
- Upload the model files to the correct path
- Check file names match exactly
- Ensure files aren't corrupted during upload
Out of Memory Errors
If GPU runs out of memory:
- The quantized models (Q4_K_M, q8_0) are designed to fit in 16GB
- Try restarting the Space
- Check if other processes are using GPU memory
Still Slow After GPU Setup
- Verify GPU is actually being used (check logs)
- Ensure
n_gpu_layers=-1is set (check initialization logs) - Check HuggingFace Space isn't in "Sleeping" mode
- Verify model is fully loaded before making requests
Code Changes Summary
The following changes enable automatic GPU detection:
src/backend/chatbot.py:- Added
detect_gpu_and_environment()function - Modified
init_llm()to use dynamic GPU configuration - Automatic path detection for HF Spaces vs local
- Added
app.py:- Added GPU status indicator in sidebar
- Shows real-time GPU availability
src/requirements.txt:- Added
torch>=2.0.0for GPU detection
- Added
Testing Locally
To test GPU detection locally (if you have an Nvidia GPU):
# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118
# Run the app
streamlit run app.py
Without GPU locally, you'll see:
β οΈ No GPU detected via torch.cuda
β οΈ Running on CPU (no GPU detected)
Additional Resources
Note: This GPU setup is backward compatible - the app will still work on CPU if no GPU is available!