File size: 2,647 Bytes
40e0553
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94f80f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---

title: MultiModel LLM ERAV2
emoji: πŸš€
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
---


Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


# Multi-Modal LLM Gradio App

## Project Overview

This project is a **multi-modal language model** Gradio app that accepts **text**, **image**, and **audio inputs**, and outputs **text responses**. The app mimics a **ChatGPT-style interface**, allowing users to interact using multiple input modes. 

The app leverages:
- **CLIP** for image processing
- **Whisper** for audio transcription (ASR)
- A **text-based model** (like GPT or Phi) for generating text responses

## Features

- **Text Input**: Users can input text directly for response generation.
- **Image Input**: Users can upload images, which are processed by the CLIP model.
- **Audio Input**: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response.
- **ChatGPT-Like Interface**: Simple and intuitive interface to handle multi-modal inputs and provide text-based output.

## Installation

1. Clone the repository:
   ```bash

   git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2

   cd MultiModel_LLM_ERAV2

   ```

2. Install dependencies:
   ```bash

   pip -r requirements.txt

   ```

3. Run the app:
   ```bash

   python app.py

   ```

## How It Works

1. **Text Processing**: Input text is passed to a language model (like GPT or Phi) to generate a response.
2. **Image Processing**: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model.
3. **Audio Processing**: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation.

## Usage

- **Text Input**: Enter text in the provided textbox and click "Submit" to generate a response.
- **Image Input**: Upload an image and click "Submit" to generate a response based on the image.
- **Audio Input**: Upload or record an audio file, click "Submit" to transcribe and generate a response.

## Future Improvements

- Add advanced features like drag-and-drop file upload or live audio recording for a better user experience.
- Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources.
- Implement end-to-end training of all components for better response quality.

## License

This project is licensed under the MIT License.