Spaces:

danidanidani
/

GRDN.AI.3

Sleeping

App Files Files Community

GRDN.AI.3 / HUGGINGFACE_GPU_SETUP.md

danidanidani

Fresh deployment: Llama 3.2-1B with GPU acceleration

5e35da7 about 1 month ago

preview code

raw

history blame

4.25 kB

HuggingFace Spaces GPU Setup Guide 🚀

This guide will help you enable GPU acceleration for GRDN AI on HuggingFace Spaces with your Nvidia T4 grant.

Prerequisites

HuggingFace Space with GPU enabled (Nvidia T4 small: 4 vCPU, 15GB RAM, 16GB GPU)
Model files uploaded to your Space

Setup Steps

1. Enable GPU in Space Settings

Go to your Space settings on HuggingFace
Navigate to "Hardware" section
Select "T4 small" (or your granted GPU tier)
Save changes

2. Upload Model Files

Your Space needs the GGUF model files in the src/models/ directory:

llama-2-7b-chat.Q4_K_M.gguf (for Llama2)
decilm-7b-uniform-gqa-q8_0.gguf (for DeciLM)

You can upload these via:

HuggingFace web interface (Files tab)
Git LFS (recommended for large files)
HuggingFace Hub CLI

3. Install Dependencies

Make sure your Space has the updated requirements.txt which includes:

torch>=2.0.0

4. Verify GPU Detection

Once your Space restarts, check the sidebar in the app for:

🚀 GPU Acceleration: ENABLED - GPU is working!
⚠️ GPU Acceleration: DISABLED - Something's wrong

You should also see in the logs:

🤗 Running on HuggingFace Spaces
🚀 GPU detected: Tesla T4 with 15.xx GB memory
🚀 Will offload all layers to GPU (n_gpu_layers=-1)
✅ GPU acceleration ENABLED with -1 layers

How It Works

The app now automatically:

Detects HuggingFace Spaces environment via SPACE_ID or SPACE_AUTHOR_NAME env variables
Checks for GPU availability using PyTorch's torch.cuda.is_available()
Configures LlamaCPP to use GPU with n_gpu_layers=-1 (all layers on GPU)
Shows status in the sidebar UI

GPU Configuration

CPU Mode: n_gpu_layers=0 - All computation on CPU (slow)
GPU Mode: n_gpu_layers=-1 - All model layers offloaded to GPU (fast)

Performance Expectations

With GPU acceleration on Nvidia T4:

Response time: ~2-5 seconds (vs 30-60+ seconds on CPU)
Token generation: ~20-50 tokens/sec (vs 1-3 tokens/sec on CPU)
Memory: Model fits comfortably in 16GB VRAM

Troubleshooting

GPU Not Detected

Check Space hardware: Ensure T4 is selected in settings
Check logs: Look for GPU detection messages
Verify torch installation: torch.cuda.is_available() should return True
Try restarting: Sometimes requires Space restart after hardware change

Model File Not Found

If you see: ⚠️ Model not found at src/models/...

Upload the model files to the correct path
Check file names match exactly
Ensure files aren't corrupted during upload

Out of Memory Errors

If GPU runs out of memory:

The quantized models (Q4_K_M, q8_0) are designed to fit in 16GB
Try restarting the Space
Check if other processes are using GPU memory

Still Slow After GPU Setup

Verify GPU is actually being used (check logs)
Ensure n_gpu_layers=-1 is set (check initialization logs)
Check HuggingFace Space isn't in "Sleeping" mode
Verify model is fully loaded before making requests

Code Changes Summary

The following changes enable automatic GPU detection:

src/backend/chatbot.py:
- Added detect_gpu_and_environment() function
- Modified init_llm() to use dynamic GPU configuration
- Automatic path detection for HF Spaces vs local
app.py:
- Added GPU status indicator in sidebar
- Shows real-time GPU availability
src/requirements.txt:
- Added torch>=2.0.0 for GPU detection

Testing Locally

To test GPU detection locally (if you have an Nvidia GPU):

# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Run the app
streamlit run app.py

Without GPU locally, you'll see:

⚠️ No GPU detected via torch.cuda
⚠️ Running on CPU (no GPU detected)

Additional Resources

Note: This GPU setup is backward compatible - the app will still work on CPU if no GPU is available!