shukdevdatta123 commited on
Commit
f757676
·
verified ·
1 Parent(s): a7e7cfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md CHANGED
@@ -12,3 +12,163 @@ short_description: A multimodal chatbot that supports both text and image chat.
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ Certainly! Here's an updated version of the article with the demo link included:
17
+
18
+ ---
19
+
20
+ # Building a Multimodal Chatbot with Gradio and OpenAI
21
+
22
+ In recent years, the field of artificial intelligence (AI) has seen an exciting leap in multimodal capabilities. Multimodal systems can understand and generate multiple types of input — like text and images — to provide richer, more dynamic responses. One such example is a multimodal chatbot that can process both text and image inputs using the OpenAI API.
23
+
24
+ In this article, we’ll walk through how to create a multimodal chatbot using **Gradio** and the **OpenAI API** that allows users to input both text and images, interact with the model, and receive insightful responses.
25
+
26
+ ## Key Components
27
+
28
+ Before we dive into the code, let's break down the core components of this chatbot:
29
+
30
+ - **Gradio**: A simple, open-source Python library for building UIs for machine learning models. It allows you to quickly create and deploy interfaces for any ML model, including those that take images, text, or audio as input.
31
+
32
+ - **OpenAI API**: This is the engine behind our chatbot. OpenAI provides models like `gpt-3.5`, `gpt-4`, and specialized image models such as `o1` for handling multimodal tasks (image and text inputs).
33
+
34
+ - **Python and PIL**: To handle image preprocessing, we use Python's `PIL` (Python Imaging Library) to convert uploaded images into a format that can be passed into the OpenAI model.
35
+
36
+ ## The Chatbot Overview
37
+
38
+ The chatbot can take two main types of input:
39
+ 1. **Text Input**: Ask a question or give a prompt to the model.
40
+ 2. **Image Input**: Upload an image, and the model will interpret the image and provide a response based on its content.
41
+
42
+ The interface offers the user the ability to adjust two main settings:
43
+ - **Reasoning Effort**: This controls how complex or detailed the assistant’s answers should be. The options are `low`, `medium`, and `high`.
44
+ - **Model Choice**: Users can select between two models: `o1` (optimized for image input) and `o3-mini` (focused on text input).
45
+
46
+ The interface is simple, intuitive, and interactive, with the chat history displayed on the side.
47
+
48
+ ## Step-by-Step Code Explanation
49
+
50
+ ### 1. Set Up Gradio UI
51
+
52
+ Gradio makes it easy to create beautiful interfaces for your AI models. We start by defining a custom interface with the following components:
53
+
54
+ - **Textbox for OpenAI API Key**: Users provide their OpenAI API key to authenticate their request.
55
+ - **Image Upload and Text Input Fields**: Users can choose to upload an image or input text.
56
+ - **Dropdowns for Reasoning Effort and Model Selection**: Choose the complexity of the responses and the model to use.
57
+ - **Submit and Clear Buttons**: These trigger the logic to process user inputs and clear chat history, respectively.
58
+
59
+ ```python
60
+ with gr.Blocks(css=custom_css) as demo:
61
+ gr.Markdown("""
62
+ <div class="gradio-header">
63
+ <h1>Multimodal Chatbot (Text + Image)</h1>
64
+ <h3>Interact with a chatbot using text or image inputs</h3>
65
+ </div>
66
+ """)
67
+
68
+ # User inputs and chat history
69
+ openai_api_key = gr.Textbox(label="Enter OpenAI API Key", type="password", placeholder="sk-...", interactive=True)
70
+ image_input = gr.Image(label="Upload an Image", type="pil")
71
+ input_text = gr.Textbox(label="Enter Text Question", placeholder="Ask a question or provide text", lines=2)
72
+
73
+ # Reasoning effort and model selection
74
+ reasoning_effort = gr.Dropdown(label="Reasoning Effort", choices=["low", "medium", "high"], value="medium")
75
+ model_choice = gr.Dropdown(label="Select Model", choices=["o1", "o3-mini"], value="o1")
76
+
77
+ submit_btn = gr.Button("Ask!", elem_id="submit-btn")
78
+ clear_btn = gr.Button("Clear History", elem_id="clear-history")
79
+
80
+ # Chat history display
81
+ chat_history = gr.Chatbot()
82
+ ```
83
+
84
+ ### 2. Handle Image and Text Inputs
85
+
86
+ The function `generate_response` processes both image and text inputs by sending them to OpenAI’s API. If an image is uploaded, it gets converted into a **base64 string** so it can be sent as part of the prompt.
87
+
88
+ For text inputs, the prompt is directly passed to the model.
89
+
90
+ ```python
91
+ def generate_response(input_text, image, openai_api_key, reasoning_effort="medium", model_choice="o1"):
92
+ openai.api_key = openai_api_key
93
+
94
+ if image:
95
+ image_info = get_base64_string_from_image(image)
96
+ input_text = f"data:image/png;base64,{image_info}"
97
+
98
+ if model_choice == "o1":
99
+ messages = [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": input_text}}]}]
100
+ elif model_choice == "o3-mini":
101
+ messages = [{"role": "user", "content": [{"type": "text", "text": input_text}]}]
102
+
103
+ # API request
104
+ response = openai.ChatCompletion.create(
105
+ model=model_choice,
106
+ messages=messages,
107
+ reasoning_effort=reasoning_effort,
108
+ max_completion_tokens=2000
109
+ )
110
+ return response["choices"][0]["message"]["content"]
111
+ ```
112
+
113
+ ### 3. Image-to-Base64 Conversion
114
+
115
+ To ensure the image is properly formatted, we convert it into a **base64** string. This string can then be embedded directly into the OpenAI request. This conversion is handled by the `get_base64_string_from_image` function.
116
+
117
+ ```python
118
+ def get_base64_string_from_image(pil_image):
119
+ buffered = io.BytesIO()
120
+ pil_image.save(buffered, format="PNG")
121
+ img_bytes = buffered.getvalue()
122
+ base64_str = base64.b64encode(img_bytes).decode("utf-8")
123
+ return base64_str
124
+ ```
125
+
126
+ ### 4. Chat History and Interaction
127
+
128
+ The chat history is stored and displayed using Gradio’s `gr.Chatbot`. Each time the user submits a question or image, the conversation history is updated, showing both user and assistant responses in an easy-to-read format.
129
+
130
+ ```python
131
+ def chatbot(input_text, image, openai_api_key, reasoning_effort, model_choice, history=[]):
132
+ response = generate_response(input_text, image, openai_api_key, reasoning_effort, model_choice)
133
+ history.append((f"User: {input_text}", f"Assistant: {response}"))
134
+ return "", history
135
+ ```
136
+
137
+ ### 5. Clear History Function
138
+
139
+ To reset the conversation, we include a simple function that clears the chat history when the "Clear History" button is clicked.
140
+
141
+ ```python
142
+ def clear_history():
143
+ return "", []
144
+ ```
145
+
146
+ ### 6. Custom CSS for Styling
147
+
148
+ To ensure a visually appealing interface, custom CSS is applied. The design includes animations for chat messages and custom button styles to make the interaction smoother.
149
+
150
+ ```css
151
+ /* Custom CSS for the chat interface */
152
+ .gradio-container { ... }
153
+ .gradio-header { ... }
154
+ .gradio-chatbot { ... }
155
+ ```
156
+
157
+ ### 7. Launch the Interface
158
+
159
+ Finally, we call the `create_interface()` function to launch the Gradio interface. This allows users to start interacting with the chatbot by uploading images, entering text, and receiving responses based on the selected model and reasoning effort.
160
+
161
+ ```python
162
+ if __name__ == "__main__":
163
+ demo = create_interface()
164
+ demo.launch()
165
+ ```
166
+
167
+ ## Conclusion
168
+
169
+ This multimodal chatbot can handle both text and image inputs, offering a rich conversational experience. By combining the power of **Gradio** for building intuitive UIs and **OpenAI’s powerful models** for natural language processing and image recognition, this application demonstrates how to seamlessly integrate multiple forms of input into a single, easy-to-use interface.
170
+
171
+ Feel free to try it out yourself and experiment with different settings, including reasoning effort and model selection. Whether you're building a customer support bot or an image-based query system, this framework provides a flexible foundation for creating powerful, multimodal applications.
172
+
173
+ ---
174
+