Spaces:

7jimmy
/

ask_to_Image

Runtime error

App Files Files Community

7jimmy commited on Dec 14, 2023

Commit

1c0296c

1 Parent(s): 96b1212

Upload 6 files

Browse files

Files changed (6) hide show

LICENSE +21 -0
README.md +78 -13
functions.py +67 -0
main.py +65 -0
requirements.txt +6 -0
tools.py +61 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Computer vision engineer
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,13 +1,78 @@
----
-title: Ask To Image
-emoji: 🌖
-colorFrom: yellow
-colorTo: yellow
-sdk: streamlit
-sdk_version: 1.29.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# ask-question-image-web-app-streamlit-langchain
+<p align="center">
+<a href="https://www.youtube.com/watch?v=71EOM5__vkI">
+    <img width="600" src="https://utils-computervisiondeveloper.s3.amazonaws.com/thumbnails/with_play_button/ask_question_image.jpg" alt="Watch the video">
+    </br>Watch on YouTube: Ask questions to an image using Python, Streamlit and Langchain !
+</a>
+</p>
+This is a Streamlit application that allows users to ask questions about an uploaded image and receive responses from a conversational AI agent. The agent uses the OpenAI GPT-3.5 Turbo model to generate answers based on the provided image and user input.
+## installation
+1. Clone the repository:
+        git clone https://github.com/your-username/image-question-answering.git
+2. Change to the project directory:
+        cd  ask-question-image-web-app-streamlit-langchain
+3. Install the required dependencies:
+        pip install -r requirements.txt
+4. Obtain an **OpenAI API key**. You can sign up for an API key at [OpenAI](https://platform.openai.com).
+5. Replace the placeholder API key in the main.py file with your actual OpenAI API key:
+        llm = ChatOpenAI(
+            openai_api_key='YOUR_API_KEY',
+            temperature=0,
+            model_name="gpt-3.5-turbo"
+        )
+6. Run the Streamlit application:
+        streamlit run main.py
+7. Open your web browser and go to http://localhost:8501 to access the application.
+## usage
+1. Upload an image by clicking the file upload button.
+2. The uploaded image will be displayed.
+3. Enter a question about the image in the text input field.
+4. The conversational AI agent will generate a response based on the provided question and image.
+5. The response will be displayed below the question input.
+## tools
+The application utilizes the following custom tools:
+- **ImageCaptionTool**: Generates a textual caption for the uploaded image.
+- **ObjectDetectionTool**: Performs object detection on the uploaded image and identifies the objects present.
+## contributing
+Contributions are welcome! If you have any ideas, improvements, or bug fixes, please submit a pull request.
+## license
+This project is licensed under the MIT License.
+## acknowledgements
+This project uses the OpenAI GPT-3.5 Turbo model. Visit [OpenAI](https://openai.com/) for more information.
+The Streamlit library is used for building the interactive user interface. Visit the [Streamlit documentation](https://docs.streamlit.io/) for more information.
+The LangChain library is used for managing the conversational AI agent and tools. Visit the [LangChain GitHub repository](https://github.com/hwchase17/langchain) for more information.
+The Transformers library is used to inference the AI features. Visit [this](https://huggingface.co/Salesforce/blip-image-captioning-large) and [this](https://huggingface.co/facebook/detr-resnet-50) pages for a more comprehensive description of the models used.

functions.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
+from PIL import Image
+import torch
+def get_image_caption(image_path):
+    """
+    Generates a short caption for the provided image.
+    Args:
+        image_path (str): The path to the image file.
+    Returns:
+        str: A string representing the caption for the image.
+    """
+    image = Image.open(image_path).convert('RGB')
+    model_name = "Salesforce/blip-image-captioning-large"
+    device = "cpu"  # cuda
+    processor = BlipProcessor.from_pretrained(model_name)
+    model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
+    inputs = processor(image, return_tensors='pt').to(device)
+    output = model.generate(**inputs, max_new_tokens=20)
+    caption = processor.decode(output[0], skip_special_tokens=True)
+    return caption
+def detect_objects(image_path):
+    """
+    Detects objects in the provided image.
+    Args:
+        image_path (str): The path to the image file.
+    Returns:
+        str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
+    """
+    image = Image.open(image_path).convert('RGB')
+    processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
+    model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
+    inputs = processor(images=image, return_tensors="pt")
+    outputs = model(**inputs)
+    # convert outputs (bounding boxes and class logits) to COCO API
+    # let's only keep detections with score > 0.9
+    target_sizes = torch.tensor([image.size[::-1]])
+    results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
+    detections = ""
+    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+        detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
+        detections += ' {}'.format(model.config.id2label[int(label)])
+        detections += ' {}\n'.format(float(score))
+    return detections
+if __name__ == '__main__':
+    image_path = '/home/phillip/Desktop/todays_tutorial/52_langchain_ask_questions_video/code/test.jpg'
+    detections = detect_objects(image_path)
+    print(detections)

main.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from tempfile import NamedTemporaryFile
+import streamlit as st
+from langchain.agents import initialize_agent
+from langchain.chat_models import ChatOpenAI
+from langchain.chains.conversation.memory import ConversationBufferWindowMemory
+from tools import ImageCaptionTool, ObjectDetectionTool
+##############################
+### initialize agent #########
+##############################
+tools = [ImageCaptionTool(), ObjectDetectionTool()]
+conversational_memory = ConversationBufferWindowMemory(
+    memory_key='chat_history',
+    k=5,
+    return_messages=True
+)
+llm = ChatOpenAI(
+    openai_api_key='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',
+    temperature=0,
+    model_name="gpt-3.5-turbo"
+)
+agent = initialize_agent(
+    agent="chat-conversational-react-description",
+    tools=tools,
+    llm=llm,
+    max_iterations=5,
+    verbose=True,
+    memory=conversational_memory,
+    early_stopping_method='generate'
+)
+# set title
+st.title('Ask a question to an image')
+# set header
+st.header("Please upload an image")
+# upload file
+file = st.file_uploader("", type=["jpeg", "jpg", "png"])
+if file:
+    # display image
+    st.image(file, use_column_width=True)
+    # text input
+    user_question = st.text_input('Ask a question about your image:')
+    ##############################
+    ### compute agent response ###
+    ##############################
+    with NamedTemporaryFile(dir='.') as f:
+        f.write(file.getbuffer())
+        image_path = f.name
+        # write agent response
+        if user_question and user_question != "":
+            with st.spinner(text="In progress..."):
+                response = agent.run('{}, this is the image path: {}'.format(user_question, image_path))
+                st.write(response)

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+langchain==0.0.171
+streamlit==1.22.0
+openai==0.27.6
+tabulate==0.9.0
+timm==0.9.2
+transformers==4.29.2

tools.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from langchain.tools import BaseTool
+from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
+from PIL import Image
+import torch
+class ImageCaptionTool(BaseTool):
+    name = "Image captioner"
+    description = "Use this tool when given the path to an image that you would like to be described. " \
+                  "It will return a simple caption describing the image."
+    def _run(self, img_path):
+        image = Image.open(img_path).convert('RGB')
+        model_name = "Salesforce/blip-image-captioning-large"
+        device = "cpu"  # cuda
+        processor = BlipProcessor.from_pretrained(model_name)
+        model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
+        inputs = processor(image, return_tensors='pt').to(device)
+        output = model.generate(**inputs, max_new_tokens=20)
+        caption = processor.decode(output[0], skip_special_tokens=True)
+        return caption
+    def _arun(self, query: str):
+        raise NotImplementedError("This tool does not support async")
+class ObjectDetectionTool(BaseTool):
+    name = "Object detector"
+    description = "Use this tool when given the path to an image that you would like to detect objects. " \
+                  "It will return a list of all detected objects. Each element in the list in the format: " \
+                  "[x1, y1, x2, y2] class_name confidence_score."
+    def _run(self, img_path):
+        image = Image.open(img_path).convert('RGB')
+        processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
+        model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
+        inputs = processor(images=image, return_tensors="pt")
+        outputs = model(**inputs)
+        # convert outputs (bounding boxes and class logits) to COCO API
+        # let's only keep detections with score > 0.9
+        target_sizes = torch.tensor([image.size[::-1]])
+        results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
+        detections = ""
+        for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+            detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
+            detections += ' {}'.format(model.config.id2label[int(label)])
+            detections += ' {}\n'.format(float(score))
+        return detections
+    def _arun(self, query: str):
+        raise NotImplementedError("This tool does not support async")