GRDN.AI.3 / HUGGINGFACE_GPU_SETUP.md
danidanidani's picture
Fresh deployment: Llama 3.2-1B with GPU acceleration
5e35da7
|
raw
history blame
4.25 kB

HuggingFace Spaces GPU Setup Guide πŸš€

This guide will help you enable GPU acceleration for GRDN AI on HuggingFace Spaces with your Nvidia T4 grant.

Prerequisites

  • HuggingFace Space with GPU enabled (Nvidia T4 small: 4 vCPU, 15GB RAM, 16GB GPU)
  • Model files uploaded to your Space

Setup Steps

1. Enable GPU in Space Settings

  1. Go to your Space settings on HuggingFace
  2. Navigate to "Hardware" section
  3. Select "T4 small" (or your granted GPU tier)
  4. Save changes

2. Upload Model Files

Your Space needs the GGUF model files in the src/models/ directory:

  • llama-2-7b-chat.Q4_K_M.gguf (for Llama2)
  • decilm-7b-uniform-gqa-q8_0.gguf (for DeciLM)

You can upload these via:

  • HuggingFace web interface (Files tab)
  • Git LFS (recommended for large files)
  • HuggingFace Hub CLI

3. Install Dependencies

Make sure your Space has the updated requirements.txt which includes:

torch>=2.0.0

4. Verify GPU Detection

Once your Space restarts, check the sidebar in the app for:

  • πŸš€ GPU Acceleration: ENABLED - GPU is working!
  • ⚠️ GPU Acceleration: DISABLED - Something's wrong

You should also see in the logs:

πŸ€— Running on HuggingFace Spaces
πŸš€ GPU detected: Tesla T4 with 15.xx GB memory
πŸš€ Will offload all layers to GPU (n_gpu_layers=-1)
βœ… GPU acceleration ENABLED with -1 layers

How It Works

The app now automatically:

  1. Detects HuggingFace Spaces environment via SPACE_ID or SPACE_AUTHOR_NAME env variables
  2. Checks for GPU availability using PyTorch's torch.cuda.is_available()
  3. Configures LlamaCPP to use GPU with n_gpu_layers=-1 (all layers on GPU)
  4. Shows status in the sidebar UI

GPU Configuration

  • CPU Mode: n_gpu_layers=0 - All computation on CPU (slow)
  • GPU Mode: n_gpu_layers=-1 - All model layers offloaded to GPU (fast)

Performance Expectations

With GPU acceleration on Nvidia T4:

  • Response time: ~2-5 seconds (vs 30-60+ seconds on CPU)
  • Token generation: ~20-50 tokens/sec (vs 1-3 tokens/sec on CPU)
  • Memory: Model fits comfortably in 16GB VRAM

Troubleshooting

GPU Not Detected

  1. Check Space hardware: Ensure T4 is selected in settings
  2. Check logs: Look for GPU detection messages
  3. Verify torch installation: torch.cuda.is_available() should return True
  4. Try restarting: Sometimes requires Space restart after hardware change

Model File Not Found

If you see: ⚠️ Model not found at src/models/...

  • Upload the model files to the correct path
  • Check file names match exactly
  • Ensure files aren't corrupted during upload

Out of Memory Errors

If GPU runs out of memory:

  • The quantized models (Q4_K_M, q8_0) are designed to fit in 16GB
  • Try restarting the Space
  • Check if other processes are using GPU memory

Still Slow After GPU Setup

  1. Verify GPU is actually being used (check logs)
  2. Ensure n_gpu_layers=-1 is set (check initialization logs)
  3. Check HuggingFace Space isn't in "Sleeping" mode
  4. Verify model is fully loaded before making requests

Code Changes Summary

The following changes enable automatic GPU detection:

  1. src/backend/chatbot.py:

    • Added detect_gpu_and_environment() function
    • Modified init_llm() to use dynamic GPU configuration
    • Automatic path detection for HF Spaces vs local
  2. app.py:

    • Added GPU status indicator in sidebar
    • Shows real-time GPU availability
  3. src/requirements.txt:

    • Added torch>=2.0.0 for GPU detection

Testing Locally

To test GPU detection locally (if you have an Nvidia GPU):

# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Run the app
streamlit run app.py

Without GPU locally, you'll see:

⚠️ No GPU detected via torch.cuda
⚠️ Running on CPU (no GPU detected)

Additional Resources


Note: This GPU setup is backward compatible - the app will still work on CPU if no GPU is available!