Vasudevakrishna's picture
Config added
40e0553

A newer version of the Gradio SDK is available: 5.23.2

Upgrade
metadata
title: MultiModel LLM ERAV2
emoji: πŸš€
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Multi-Modal LLM Gradio App

Project Overview

This project is a multi-modal language model Gradio app that accepts text, image, and audio inputs, and outputs text responses. The app mimics a ChatGPT-style interface, allowing users to interact using multiple input modes.

The app leverages:

  • CLIP for image processing
  • Whisper for audio transcription (ASR)
  • A text-based model (like GPT or Phi) for generating text responses

Features

  • Text Input: Users can input text directly for response generation.
  • Image Input: Users can upload images, which are processed by the CLIP model.
  • Audio Input: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response.
  • ChatGPT-Like Interface: Simple and intuitive interface to handle multi-modal inputs and provide text-based output.

Installation

  1. Clone the repository:

    git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2
    cd MultiModel_LLM_ERAV2
    
  2. Install dependencies:

    pip -r requirements.txt
    
  3. Run the app:

    python app.py
    

How It Works

  1. Text Processing: Input text is passed to a language model (like GPT or Phi) to generate a response.
  2. Image Processing: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model.
  3. Audio Processing: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation.

Usage

  • Text Input: Enter text in the provided textbox and click "Submit" to generate a response.
  • Image Input: Upload an image and click "Submit" to generate a response based on the image.
  • Audio Input: Upload or record an audio file, click "Submit" to transcribe and generate a response.

Future Improvements

  • Add advanced features like drag-and-drop file upload or live audio recording for a better user experience.
  • Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources.
  • Implement end-to-end training of all components for better response quality.

License

This project is licensed under the MIT License.