Spaces:
Runtime error
Runtime error
Upload 6 files
Browse files- LICENSE +21 -0
- README.md +78 -13
- functions.py +67 -0
- main.py +65 -0
- requirements.txt +6 -0
- tools.py +61 -0
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2023 Computer vision engineer
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,13 +1,78 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ask-question-image-web-app-streamlit-langchain
|
2 |
+
|
3 |
+
|
4 |
+
<p align="center">
|
5 |
+
<a href="https://www.youtube.com/watch?v=71EOM5__vkI">
|
6 |
+
<img width="600" src="https://utils-computervisiondeveloper.s3.amazonaws.com/thumbnails/with_play_button/ask_question_image.jpg" alt="Watch the video">
|
7 |
+
</br>Watch on YouTube: Ask questions to an image using Python, Streamlit and Langchain !
|
8 |
+
</a>
|
9 |
+
</p>
|
10 |
+
|
11 |
+
This is a Streamlit application that allows users to ask questions about an uploaded image and receive responses from a conversational AI agent. The agent uses the OpenAI GPT-3.5 Turbo model to generate answers based on the provided image and user input.
|
12 |
+
|
13 |
+
## installation
|
14 |
+
|
15 |
+
1. Clone the repository:
|
16 |
+
|
17 |
+
git clone https://github.com/your-username/image-question-answering.git
|
18 |
+
|
19 |
+
2. Change to the project directory:
|
20 |
+
|
21 |
+
cd ask-question-image-web-app-streamlit-langchain
|
22 |
+
|
23 |
+
3. Install the required dependencies:
|
24 |
+
|
25 |
+
pip install -r requirements.txt
|
26 |
+
|
27 |
+
4. Obtain an **OpenAI API key**. You can sign up for an API key at [OpenAI](https://platform.openai.com).
|
28 |
+
|
29 |
+
5. Replace the placeholder API key in the main.py file with your actual OpenAI API key:
|
30 |
+
|
31 |
+
llm = ChatOpenAI(
|
32 |
+
openai_api_key='YOUR_API_KEY',
|
33 |
+
temperature=0,
|
34 |
+
model_name="gpt-3.5-turbo"
|
35 |
+
)
|
36 |
+
|
37 |
+
6. Run the Streamlit application:
|
38 |
+
|
39 |
+
streamlit run main.py
|
40 |
+
|
41 |
+
7. Open your web browser and go to http://localhost:8501 to access the application.
|
42 |
+
|
43 |
+
## usage
|
44 |
+
|
45 |
+
1. Upload an image by clicking the file upload button.
|
46 |
+
|
47 |
+
2. The uploaded image will be displayed.
|
48 |
+
|
49 |
+
3. Enter a question about the image in the text input field.
|
50 |
+
|
51 |
+
4. The conversational AI agent will generate a response based on the provided question and image.
|
52 |
+
|
53 |
+
5. The response will be displayed below the question input.
|
54 |
+
|
55 |
+
## tools
|
56 |
+
|
57 |
+
The application utilizes the following custom tools:
|
58 |
+
|
59 |
+
- **ImageCaptionTool**: Generates a textual caption for the uploaded image.
|
60 |
+
- **ObjectDetectionTool**: Performs object detection on the uploaded image and identifies the objects present.
|
61 |
+
|
62 |
+
## contributing
|
63 |
+
|
64 |
+
Contributions are welcome! If you have any ideas, improvements, or bug fixes, please submit a pull request.
|
65 |
+
|
66 |
+
## license
|
67 |
+
|
68 |
+
This project is licensed under the MIT License.
|
69 |
+
|
70 |
+
## acknowledgements
|
71 |
+
|
72 |
+
This project uses the OpenAI GPT-3.5 Turbo model. Visit [OpenAI](https://openai.com/) for more information.
|
73 |
+
|
74 |
+
The Streamlit library is used for building the interactive user interface. Visit the [Streamlit documentation](https://docs.streamlit.io/) for more information.
|
75 |
+
|
76 |
+
The LangChain library is used for managing the conversational AI agent and tools. Visit the [LangChain GitHub repository](https://github.com/hwchase17/langchain) for more information.
|
77 |
+
|
78 |
+
The Transformers library is used to inference the AI features. Visit [this](https://huggingface.co/Salesforce/blip-image-captioning-large) and [this](https://huggingface.co/facebook/detr-resnet-50) pages for a more comprehensive description of the models used.
|
functions.py
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
|
2 |
+
from PIL import Image
|
3 |
+
import torch
|
4 |
+
|
5 |
+
|
6 |
+
def get_image_caption(image_path):
|
7 |
+
"""
|
8 |
+
Generates a short caption for the provided image.
|
9 |
+
|
10 |
+
Args:
|
11 |
+
image_path (str): The path to the image file.
|
12 |
+
|
13 |
+
Returns:
|
14 |
+
str: A string representing the caption for the image.
|
15 |
+
"""
|
16 |
+
image = Image.open(image_path).convert('RGB')
|
17 |
+
|
18 |
+
model_name = "Salesforce/blip-image-captioning-large"
|
19 |
+
device = "cpu" # cuda
|
20 |
+
|
21 |
+
processor = BlipProcessor.from_pretrained(model_name)
|
22 |
+
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
|
23 |
+
|
24 |
+
inputs = processor(image, return_tensors='pt').to(device)
|
25 |
+
output = model.generate(**inputs, max_new_tokens=20)
|
26 |
+
|
27 |
+
caption = processor.decode(output[0], skip_special_tokens=True)
|
28 |
+
|
29 |
+
return caption
|
30 |
+
|
31 |
+
|
32 |
+
def detect_objects(image_path):
|
33 |
+
"""
|
34 |
+
Detects objects in the provided image.
|
35 |
+
|
36 |
+
Args:
|
37 |
+
image_path (str): The path to the image file.
|
38 |
+
|
39 |
+
Returns:
|
40 |
+
str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
|
41 |
+
"""
|
42 |
+
image = Image.open(image_path).convert('RGB')
|
43 |
+
|
44 |
+
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
|
45 |
+
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
|
46 |
+
|
47 |
+
inputs = processor(images=image, return_tensors="pt")
|
48 |
+
outputs = model(**inputs)
|
49 |
+
|
50 |
+
# convert outputs (bounding boxes and class logits) to COCO API
|
51 |
+
# let's only keep detections with score > 0.9
|
52 |
+
target_sizes = torch.tensor([image.size[::-1]])
|
53 |
+
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
|
54 |
+
|
55 |
+
detections = ""
|
56 |
+
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
57 |
+
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
|
58 |
+
detections += ' {}'.format(model.config.id2label[int(label)])
|
59 |
+
detections += ' {}\n'.format(float(score))
|
60 |
+
|
61 |
+
return detections
|
62 |
+
|
63 |
+
|
64 |
+
if __name__ == '__main__':
|
65 |
+
image_path = '/home/phillip/Desktop/todays_tutorial/52_langchain_ask_questions_video/code/test.jpg'
|
66 |
+
detections = detect_objects(image_path)
|
67 |
+
print(detections)
|
main.py
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from tempfile import NamedTemporaryFile
|
2 |
+
|
3 |
+
import streamlit as st
|
4 |
+
from langchain.agents import initialize_agent
|
5 |
+
from langchain.chat_models import ChatOpenAI
|
6 |
+
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
|
7 |
+
|
8 |
+
from tools import ImageCaptionTool, ObjectDetectionTool
|
9 |
+
|
10 |
+
|
11 |
+
##############################
|
12 |
+
### initialize agent #########
|
13 |
+
##############################
|
14 |
+
tools = [ImageCaptionTool(), ObjectDetectionTool()]
|
15 |
+
|
16 |
+
conversational_memory = ConversationBufferWindowMemory(
|
17 |
+
memory_key='chat_history',
|
18 |
+
k=5,
|
19 |
+
return_messages=True
|
20 |
+
)
|
21 |
+
|
22 |
+
llm = ChatOpenAI(
|
23 |
+
openai_api_key='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',
|
24 |
+
temperature=0,
|
25 |
+
model_name="gpt-3.5-turbo"
|
26 |
+
)
|
27 |
+
|
28 |
+
agent = initialize_agent(
|
29 |
+
agent="chat-conversational-react-description",
|
30 |
+
tools=tools,
|
31 |
+
llm=llm,
|
32 |
+
max_iterations=5,
|
33 |
+
verbose=True,
|
34 |
+
memory=conversational_memory,
|
35 |
+
early_stopping_method='generate'
|
36 |
+
)
|
37 |
+
|
38 |
+
# set title
|
39 |
+
st.title('Ask a question to an image')
|
40 |
+
|
41 |
+
# set header
|
42 |
+
st.header("Please upload an image")
|
43 |
+
|
44 |
+
# upload file
|
45 |
+
file = st.file_uploader("", type=["jpeg", "jpg", "png"])
|
46 |
+
|
47 |
+
if file:
|
48 |
+
# display image
|
49 |
+
st.image(file, use_column_width=True)
|
50 |
+
|
51 |
+
# text input
|
52 |
+
user_question = st.text_input('Ask a question about your image:')
|
53 |
+
|
54 |
+
##############################
|
55 |
+
### compute agent response ###
|
56 |
+
##############################
|
57 |
+
with NamedTemporaryFile(dir='.') as f:
|
58 |
+
f.write(file.getbuffer())
|
59 |
+
image_path = f.name
|
60 |
+
|
61 |
+
# write agent response
|
62 |
+
if user_question and user_question != "":
|
63 |
+
with st.spinner(text="In progress..."):
|
64 |
+
response = agent.run('{}, this is the image path: {}'.format(user_question, image_path))
|
65 |
+
st.write(response)
|
requirements.txt
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
langchain==0.0.171
|
2 |
+
streamlit==1.22.0
|
3 |
+
openai==0.27.6
|
4 |
+
tabulate==0.9.0
|
5 |
+
timm==0.9.2
|
6 |
+
transformers==4.29.2
|
tools.py
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from langchain.tools import BaseTool
|
2 |
+
from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
|
3 |
+
from PIL import Image
|
4 |
+
import torch
|
5 |
+
|
6 |
+
|
7 |
+
class ImageCaptionTool(BaseTool):
|
8 |
+
name = "Image captioner"
|
9 |
+
description = "Use this tool when given the path to an image that you would like to be described. " \
|
10 |
+
"It will return a simple caption describing the image."
|
11 |
+
|
12 |
+
def _run(self, img_path):
|
13 |
+
image = Image.open(img_path).convert('RGB')
|
14 |
+
|
15 |
+
model_name = "Salesforce/blip-image-captioning-large"
|
16 |
+
device = "cpu" # cuda
|
17 |
+
|
18 |
+
processor = BlipProcessor.from_pretrained(model_name)
|
19 |
+
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
|
20 |
+
|
21 |
+
inputs = processor(image, return_tensors='pt').to(device)
|
22 |
+
output = model.generate(**inputs, max_new_tokens=20)
|
23 |
+
|
24 |
+
caption = processor.decode(output[0], skip_special_tokens=True)
|
25 |
+
|
26 |
+
return caption
|
27 |
+
|
28 |
+
def _arun(self, query: str):
|
29 |
+
raise NotImplementedError("This tool does not support async")
|
30 |
+
|
31 |
+
|
32 |
+
class ObjectDetectionTool(BaseTool):
|
33 |
+
name = "Object detector"
|
34 |
+
description = "Use this tool when given the path to an image that you would like to detect objects. " \
|
35 |
+
"It will return a list of all detected objects. Each element in the list in the format: " \
|
36 |
+
"[x1, y1, x2, y2] class_name confidence_score."
|
37 |
+
|
38 |
+
def _run(self, img_path):
|
39 |
+
image = Image.open(img_path).convert('RGB')
|
40 |
+
|
41 |
+
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
|
42 |
+
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
|
43 |
+
|
44 |
+
inputs = processor(images=image, return_tensors="pt")
|
45 |
+
outputs = model(**inputs)
|
46 |
+
|
47 |
+
# convert outputs (bounding boxes and class logits) to COCO API
|
48 |
+
# let's only keep detections with score > 0.9
|
49 |
+
target_sizes = torch.tensor([image.size[::-1]])
|
50 |
+
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
|
51 |
+
|
52 |
+
detections = ""
|
53 |
+
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
54 |
+
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
|
55 |
+
detections += ' {}'.format(model.config.id2label[int(label)])
|
56 |
+
detections += ' {}\n'.format(float(score))
|
57 |
+
|
58 |
+
return detections
|
59 |
+
|
60 |
+
def _arun(self, query: str):
|
61 |
+
raise NotImplementedError("This tool does not support async")
|