# Vision Agents with smolagents


This notebook is part of the [Hugging Face Agents Course](https://www.hf.co/learn/agents-course), a free Course from beginner to expert, where you learn to build Agents.

![Agents course share](https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/communication/share.png)

## Let's install the dependencies and login to our HF account to access the Inference API

If you haven't installed `smolagents` yet, you can do so by running the following command:

In [None]:
!pip install smolagents

Let's also login to the Hugging Face Hub to have access to the Inference API.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Providing Images at the Start of the Agent's Execution

In this approach, images are passed to the agent at the start and stored as `task_images` alongside the task prompt. The agent then processes these images throughout its execution.  

Consider the case where Alfred wants to verify the identities of the superheroes attending the party. He already has a dataset of images from previous parties with the names of the guests. Given a new visitor's image, the agent can compare it with the existing dataset and make a decision about letting them in.  

In this case, a guest is trying to enter, and Alfred suspects that this visitor might be The Joker impersonating Wonder Woman. Alfred needs to verify their identity to prevent anyone unwanted from entering.  

Let’s build the example. First, the images are loaded. In this case, we use images from Wikipedia to keep the example minimal, but image the possible use-case!

In [None]:
from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg",
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg"
]

images = []
for url in image_urls:
    response = requests.get(url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

Now that we have the images, the agent will tell us wether the guests is actually a superhero (Wonder Woman) or a villian (The Joker).

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(model_id="gpt-4o")

# Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

In [None]:
response

{'description': '\n1. Costume:\n   - A purple suit with a yellow shirt and a large purple bow tie.\n   - Features a white flower lapel and a playing card in the second image.\n   - The style is flamboyant, consistent with a comic villain.\n\n2. Makeup:\n   - White face makeup covering the entire face.\n   - Red lips forming a wide, exaggerated smile.\n   - Blue eyeshadow with dark eye accents.\n   - Slicked-back green hair.\n',
 'character': 'The Joker'}

In this case, the output reveals that the person is impersonating someone else, so we can prevent The Joker from entering the party!

## Providing Images with Dynamic Retrieval

This examples is provided as a `.py` file since it needs to be run locally since it'll browse the web. Go to the [Hugging Face Agents Course](https://www.hf.co/learn/agents-course) for more details.