7jimmy commited on
Commit
1c0296c
·
1 Parent(s): 96b1212

Upload 6 files

Browse files
Files changed (6) hide show
  1. LICENSE +21 -0
  2. README.md +78 -13
  3. functions.py +67 -0
  4. main.py +65 -0
  5. requirements.txt +6 -0
  6. tools.py +61 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Computer vision engineer
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,13 +1,78 @@
1
- ---
2
- title: Ask To Image
3
- emoji: 🌖
4
- colorFrom: yellow
5
- colorTo: yellow
6
- sdk: streamlit
7
- sdk_version: 1.29.0
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ask-question-image-web-app-streamlit-langchain
2
+
3
+
4
+ <p align="center">
5
+ <a href="https://www.youtube.com/watch?v=71EOM5__vkI">
6
+ <img width="600" src="https://utils-computervisiondeveloper.s3.amazonaws.com/thumbnails/with_play_button/ask_question_image.jpg" alt="Watch the video">
7
+ </br>Watch on YouTube: Ask questions to an image using Python, Streamlit and Langchain !
8
+ </a>
9
+ </p>
10
+
11
+ This is a Streamlit application that allows users to ask questions about an uploaded image and receive responses from a conversational AI agent. The agent uses the OpenAI GPT-3.5 Turbo model to generate answers based on the provided image and user input.
12
+
13
+ ## installation
14
+
15
+ 1. Clone the repository:
16
+
17
+ git clone https://github.com/your-username/image-question-answering.git
18
+
19
+ 2. Change to the project directory:
20
+
21
+ cd ask-question-image-web-app-streamlit-langchain
22
+
23
+ 3. Install the required dependencies:
24
+
25
+ pip install -r requirements.txt
26
+
27
+ 4. Obtain an **OpenAI API key**. You can sign up for an API key at [OpenAI](https://platform.openai.com).
28
+
29
+ 5. Replace the placeholder API key in the main.py file with your actual OpenAI API key:
30
+
31
+ llm = ChatOpenAI(
32
+ openai_api_key='YOUR_API_KEY',
33
+ temperature=0,
34
+ model_name="gpt-3.5-turbo"
35
+ )
36
+
37
+ 6. Run the Streamlit application:
38
+
39
+ streamlit run main.py
40
+
41
+ 7. Open your web browser and go to http://localhost:8501 to access the application.
42
+
43
+ ## usage
44
+
45
+ 1. Upload an image by clicking the file upload button.
46
+
47
+ 2. The uploaded image will be displayed.
48
+
49
+ 3. Enter a question about the image in the text input field.
50
+
51
+ 4. The conversational AI agent will generate a response based on the provided question and image.
52
+
53
+ 5. The response will be displayed below the question input.
54
+
55
+ ## tools
56
+
57
+ The application utilizes the following custom tools:
58
+
59
+ - **ImageCaptionTool**: Generates a textual caption for the uploaded image.
60
+ - **ObjectDetectionTool**: Performs object detection on the uploaded image and identifies the objects present.
61
+
62
+ ## contributing
63
+
64
+ Contributions are welcome! If you have any ideas, improvements, or bug fixes, please submit a pull request.
65
+
66
+ ## license
67
+
68
+ This project is licensed under the MIT License.
69
+
70
+ ## acknowledgements
71
+
72
+ This project uses the OpenAI GPT-3.5 Turbo model. Visit [OpenAI](https://openai.com/) for more information.
73
+
74
+ The Streamlit library is used for building the interactive user interface. Visit the [Streamlit documentation](https://docs.streamlit.io/) for more information.
75
+
76
+ The LangChain library is used for managing the conversational AI agent and tools. Visit the [LangChain GitHub repository](https://github.com/hwchase17/langchain) for more information.
77
+
78
+ The Transformers library is used to inference the AI features. Visit [this](https://huggingface.co/Salesforce/blip-image-captioning-large) and [this](https://huggingface.co/facebook/detr-resnet-50) pages for a more comprehensive description of the models used.
functions.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
2
+ from PIL import Image
3
+ import torch
4
+
5
+
6
+ def get_image_caption(image_path):
7
+ """
8
+ Generates a short caption for the provided image.
9
+
10
+ Args:
11
+ image_path (str): The path to the image file.
12
+
13
+ Returns:
14
+ str: A string representing the caption for the image.
15
+ """
16
+ image = Image.open(image_path).convert('RGB')
17
+
18
+ model_name = "Salesforce/blip-image-captioning-large"
19
+ device = "cpu" # cuda
20
+
21
+ processor = BlipProcessor.from_pretrained(model_name)
22
+ model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
23
+
24
+ inputs = processor(image, return_tensors='pt').to(device)
25
+ output = model.generate(**inputs, max_new_tokens=20)
26
+
27
+ caption = processor.decode(output[0], skip_special_tokens=True)
28
+
29
+ return caption
30
+
31
+
32
+ def detect_objects(image_path):
33
+ """
34
+ Detects objects in the provided image.
35
+
36
+ Args:
37
+ image_path (str): The path to the image file.
38
+
39
+ Returns:
40
+ str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
41
+ """
42
+ image = Image.open(image_path).convert('RGB')
43
+
44
+ processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
45
+ model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
46
+
47
+ inputs = processor(images=image, return_tensors="pt")
48
+ outputs = model(**inputs)
49
+
50
+ # convert outputs (bounding boxes and class logits) to COCO API
51
+ # let's only keep detections with score > 0.9
52
+ target_sizes = torch.tensor([image.size[::-1]])
53
+ results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
54
+
55
+ detections = ""
56
+ for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
57
+ detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
58
+ detections += ' {}'.format(model.config.id2label[int(label)])
59
+ detections += ' {}\n'.format(float(score))
60
+
61
+ return detections
62
+
63
+
64
+ if __name__ == '__main__':
65
+ image_path = '/home/phillip/Desktop/todays_tutorial/52_langchain_ask_questions_video/code/test.jpg'
66
+ detections = detect_objects(image_path)
67
+ print(detections)
main.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from tempfile import NamedTemporaryFile
2
+
3
+ import streamlit as st
4
+ from langchain.agents import initialize_agent
5
+ from langchain.chat_models import ChatOpenAI
6
+ from langchain.chains.conversation.memory import ConversationBufferWindowMemory
7
+
8
+ from tools import ImageCaptionTool, ObjectDetectionTool
9
+
10
+
11
+ ##############################
12
+ ### initialize agent #########
13
+ ##############################
14
+ tools = [ImageCaptionTool(), ObjectDetectionTool()]
15
+
16
+ conversational_memory = ConversationBufferWindowMemory(
17
+ memory_key='chat_history',
18
+ k=5,
19
+ return_messages=True
20
+ )
21
+
22
+ llm = ChatOpenAI(
23
+ openai_api_key='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',
24
+ temperature=0,
25
+ model_name="gpt-3.5-turbo"
26
+ )
27
+
28
+ agent = initialize_agent(
29
+ agent="chat-conversational-react-description",
30
+ tools=tools,
31
+ llm=llm,
32
+ max_iterations=5,
33
+ verbose=True,
34
+ memory=conversational_memory,
35
+ early_stopping_method='generate'
36
+ )
37
+
38
+ # set title
39
+ st.title('Ask a question to an image')
40
+
41
+ # set header
42
+ st.header("Please upload an image")
43
+
44
+ # upload file
45
+ file = st.file_uploader("", type=["jpeg", "jpg", "png"])
46
+
47
+ if file:
48
+ # display image
49
+ st.image(file, use_column_width=True)
50
+
51
+ # text input
52
+ user_question = st.text_input('Ask a question about your image:')
53
+
54
+ ##############################
55
+ ### compute agent response ###
56
+ ##############################
57
+ with NamedTemporaryFile(dir='.') as f:
58
+ f.write(file.getbuffer())
59
+ image_path = f.name
60
+
61
+ # write agent response
62
+ if user_question and user_question != "":
63
+ with st.spinner(text="In progress..."):
64
+ response = agent.run('{}, this is the image path: {}'.format(user_question, image_path))
65
+ st.write(response)
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ langchain==0.0.171
2
+ streamlit==1.22.0
3
+ openai==0.27.6
4
+ tabulate==0.9.0
5
+ timm==0.9.2
6
+ transformers==4.29.2
tools.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.tools import BaseTool
2
+ from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
3
+ from PIL import Image
4
+ import torch
5
+
6
+
7
+ class ImageCaptionTool(BaseTool):
8
+ name = "Image captioner"
9
+ description = "Use this tool when given the path to an image that you would like to be described. " \
10
+ "It will return a simple caption describing the image."
11
+
12
+ def _run(self, img_path):
13
+ image = Image.open(img_path).convert('RGB')
14
+
15
+ model_name = "Salesforce/blip-image-captioning-large"
16
+ device = "cpu" # cuda
17
+
18
+ processor = BlipProcessor.from_pretrained(model_name)
19
+ model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
20
+
21
+ inputs = processor(image, return_tensors='pt').to(device)
22
+ output = model.generate(**inputs, max_new_tokens=20)
23
+
24
+ caption = processor.decode(output[0], skip_special_tokens=True)
25
+
26
+ return caption
27
+
28
+ def _arun(self, query: str):
29
+ raise NotImplementedError("This tool does not support async")
30
+
31
+
32
+ class ObjectDetectionTool(BaseTool):
33
+ name = "Object detector"
34
+ description = "Use this tool when given the path to an image that you would like to detect objects. " \
35
+ "It will return a list of all detected objects. Each element in the list in the format: " \
36
+ "[x1, y1, x2, y2] class_name confidence_score."
37
+
38
+ def _run(self, img_path):
39
+ image = Image.open(img_path).convert('RGB')
40
+
41
+ processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
42
+ model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
43
+
44
+ inputs = processor(images=image, return_tensors="pt")
45
+ outputs = model(**inputs)
46
+
47
+ # convert outputs (bounding boxes and class logits) to COCO API
48
+ # let's only keep detections with score > 0.9
49
+ target_sizes = torch.tensor([image.size[::-1]])
50
+ results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
51
+
52
+ detections = ""
53
+ for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
54
+ detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
55
+ detections += ' {}'.format(model.config.id2label[int(label)])
56
+ detections += ' {}\n'.format(float(score))
57
+
58
+ return detections
59
+
60
+ def _arun(self, query: str):
61
+ raise NotImplementedError("This tool does not support async")