Does Llama-3.2 Vision model support MultiImages?

#43
by JOJOHuang - opened

Does this model support Multi Images? if True,like this?

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Meta Llama org

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply!

Sanyam changed discussion status to closed

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

I am cuda out of memory message when i use multiple images

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.

Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?

And what about images across history?

messages = [
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "please describe the image"}
   ]
  },
  {
    "role": "assistant", "content": "It shows a cat fighting with a dog"
  },
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "Can you explain more? Here's another perspective"}
    ]
  },
]

did we get an answer to this ?
I have set of images and set of context which I got from my retriever engine - I need to now pass these in my generation Model [ any vision model ] to get the final response

"Our training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data." - from LLAMA 3.2 blog by Meta

I don't think LLAMA 3.2 models can handle multiple images as they are not trained that way. I am planning to use MIVC (https://assets.amazon.science/5b/f4/131b6a25445fae6d1fec2befbb84/mivc-multiple-instance-visual-component-for-visual-language-models.pdf) on LLAMA 3.2 models to aggregate embeddings of multiple images to one embedding. If anyone is interested, you can join me.

There are a few other VLMs which allow multiple images in inference time: NVLM, LLAVA, GPT-4o

vLLM appears to have added support for multiple images with Llama 3.2 here: https://github.com/vllm-project/vllm/pull/9095 (v0.6.3.post1 and later)

The official answer from Meta, however, is that this model doesn't work well with more than one image: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/discussions/43#66f98f742094ed9e5f5107d4

Based on those sources and some experimentation, it appears that the answer is: yes, it can support multiple images, but the response quality will suffer, and you should really only use one image with this model.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment