Transformers documentation
Idefics2
Idefics2
Overview
The Idefics2 model was proposed in What matters when building vision-language models? by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.
The abstract from the paper is the following:
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
 Idefics2 architecture. Taken from the original paper.
 Idefics2 architecture. Taken from the original paper. This model was contributed by amyeroberts. The original code can be found here.
Usage tips
- Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images in a batch for input to the model.
- The processor has a do_image_splittingoption. IfTrue, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sureprocessor.image_processor.do_image_splittingis set toFalseif the model was not trained with this option.
- textpassed to the processor should have the- <image>tokens where the images should be inserted. And- <end_of_utterance>at the end of each utterance if the text is a chat message.
- The processor has its own apply_chat_templatemethod to convert chat messages to text that can then be passed astextto the processor.
Example of how to use the processor on chat messages:
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]
processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)
# at inference time, one needs to pass `add_generation_prompt=True` in order to make sure the model completes the prompt
text = processor.apply_chat_template(messages, add_generation_prompt=True)
print(text)
# 'User: What’s the difference between these two images?<image><image><end_of_utterance>\nAssistant:'
inputs = processor(images=images, text=text, return_tensors="pt").to(device)
generated_text = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
print("Generated text:", generated_text)Model optimizations: Flash Attention
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging Flash Attention, which is a faster implementation of the attention mechanism used inside the model.
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
pip install -U flash-attn --no-build-isolation
Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the flash attention repository. Make also sure to load your model in half-precision (e.g. torch.float16)
To load and run a model using Flash Attention-2, simply change the code snippet above with the following change:
model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    attn_implementation="flash_attention_2",
).to(device)Shrinking down Idefics2 using quantization
As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using quantization. If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM.
Quantizing a model is as simple as passing a quantization_config to the model. One can change the code snippet above with the changes below. We’ll leverage the BitsAndyBytes quantization (but refer to this page for other quantization methods):
+ from transformers import BitsAndBytesConfig
+ quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.float16
+ )
model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    quantization_config=quantization_config,
).to(device)Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Idefics2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
- A notebook on how to fine-tune Idefics2 on a custom dataset using the Trainer can be found here. It supports both full fine-tuning as well as (quantized) LoRa.
- A script regarding how to fine-tune Idefics2 using the TRL library can be found here.
- Demo notebook regarding fine-tuning Idefics2 for JSON extraction use cases can be found here. 🌎
Idefics2Config
class transformers.Idefics2Config
< source >( use_cache = True image_token_id = 32001 tie_word_embeddings = False vision_config = None perceiver_config = None text_config = None **kwargs )
Parameters
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should cache the key/value pairs of the attention mechanism.
-  image_token_id (int, optional, defaults to 32001) — The id of the “image” token.
-  tie_word_embeddings (bool, optional, defaults toFalse) — Whether or not to tie the word embeddings with the token embeddings.
-  vision_config (IdeficsVisionConfigordict, optional) — Custom vision config or dict
-  perceiver_config (IdeficsPerceiverConfigordict, optional) — Custom perceiver config or dict
-  text_config (MistralConfigordict, optional) — Custom text config or dict for the text model
This is the configuration class to store the configuration of a Idefics2Model. It is used to instantiate a Idefics2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Idefics2 HuggingFaceM4/idefics2-8b architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import Idefics2Model, Idefics2Config
>>> # Initializing configuration
>>> configuration = Idefics2Config()
>>> # Initializing a model from the configuration
>>> model = Idefics2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configIdefics2Model
class transformers.Idefics2Model
< source >( config: Idefics2Config )
Parameters
-  config (Idefics2Config or Idefics2VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Idefics2 model consisting of a SIGLIP vision encoder and Mistral language decoder This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None pixel_values: Optional = None pixel_attention_mask: Optional = None image_hidden_states: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. What are position IDs?
-  past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of alldecoder_input_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.41.2/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.41.2/en/model_doc/videomae#transformers.VideoMAEFeatureExtractor.__call__) for details ([]LlavaProcessor`] uses CLIPImageProcessor for processing images).
-  pixel_attention_mask (torch.Tensorof shape(batch_size, image_size, image_size), optional) — Mask to avoid performing attention on padding pixel indices.
-  image_hidden_states (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The hidden states of the image encoder after modality projection and perceiver resampling.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The Idefics2Model forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where max_num_images is the maximum number of images among the batch_size samples in the batch.
Padding images are not needed beyond padding the pixel_values at the entrance of the model. For efficiency, we only pass through the vision_model’s forward the real images by discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3.
Idefics2ForConditionalGeneration
class transformers.Idefics2ForConditionalGeneration
< source >( config )
Parameters
-  config (Idefics2Config or Idefics2VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Idefics2 Model with a language modeling head. It is made up a SigLIP vision encoder, with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None pixel_values: Optional = None pixel_attention_mask: Optional = None image_hidden_states: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None  ) → transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. What are position IDs?
-  past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of alldecoder_input_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.41.2/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.41.2/en/model_doc/videomae#transformers.VideoMAEFeatureExtractor.__call__) for details ([]LlavaProcessor`] uses CLIPImageProcessor for processing images).
-  pixel_attention_mask (torch.Tensorof shape(batch_size, image_size, image_size), optional) — Mask to avoid performing attention on padding pixel indices.
-  image_hidden_states (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The hidden states of the image encoder after modality projection and perceiver resampling.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.Args — labels ( torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]ormodel.image_token_id(wheremodelis your instance ofIdefics2ForConditionalGeneration). Tokens with indices set tomodel.image_token_idare ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
Returns
transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Idefics2Config) and inputs.
- loss (torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- logits (torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- hidden_states (tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- attentions (tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
- image_hidden_states (tuple(torch.FloatTensor), optional) — Tuple oftorch.FloatTensor(one for the output of the image embeddings,(batch_size, num_images, sequence_length, hidden_size). image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
The Idefics2ForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO
>>> from transformers import AutoProcessor, AutoModelForVision2Seq
>>> from transformers.image_utils import load_image
>>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible
>>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
>>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
>>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
>>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
>>> model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b-base", device_map="auto")
>>> BAD_WORDS_IDS = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> EOS_WORDS_IDS = [processor.tokenizer.eos_token_id]
>>> # Create inputs
>>> prompts = [
...   "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
...   "In which city is that bridge located?<image>",
... ]
>>> images = [[image1, image2], [image3]]
>>> inputs = processor(text=prompts, padding=True, return_tensors="pt").to("cuda")
>>> # Generate
>>> generated_ids = model.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=20)
>>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_texts)
['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of New York, and more specifically the Statue of Liberty.\n\n', 'In which city is that bridge located?\n\nThe bridge is located in the city of Pittsburgh, Pennsylvania.\n\n\nThe bridge is']Idefics2ImageProcessor
class transformers.Idefics2ImageProcessor
< source >( do_convert_rgb: bool = True do_resize: bool = True size: Dict = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_pad: bool = True do_image_splitting: bool = False **kwargs )
Parameters
-  do_convert_rgb (bool, optional, defaults toTrue) — Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA. Only has an effect if the input image is in the PIL format.
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image. The longest edge of the image is resized to be <=size["longest_edge"], with the shortest edge resized to keep the input aspect ratio, with a minimum size ofsize["shortest_edge"].
-  size (Dict, optional) — Controls the size of the output image. This is a dictionary containing the keys “shortest_edge” and “longest_edge”.
-  resample (Resampling, optional, defaults toResampling.BILINEAR) — Resampling filter to use when resizing the image.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image. If set toTrue, the image is rescaled to have pixel values between 0 and 1.
-  rescale_factor (float, optional, defaults to1/255) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toTrue) — Whether to normalize the image. If set toTrue, the image is normalized to have a mean ofimage_meanand a standard deviation ofimage_std.
-  image_mean (floatorList[float], optional, defaults toIDEFICS_STANDARD_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod. Can be overridden by theimage_meanparameter in thepreprocessmethod.
-  image_std (floatorList[float], optional, defaults toIDEFICS_STANDARD_STD) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod. Can be overridden by theimage_stdparameter in thepreprocessmethod.
-  do_pad (bool, optional, defaults toTrue) — Whether or not to pad the images to the largest height and width in the batch and number of images per sample in the batch, such that the returned tensor is of shape (batch_size, max_num_images, num_channels, max_height, max_width).
-  do_image_splitting (bool, optional, defaults toFalse) — Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. That strategy was first introduced in https://arxiv.org/abs/2311.06607.
Constructs a Idefics image processor.
preprocess
< source >( images: Union do_convert_rgb: Optional = None do_resize: Optional = None size: Optional = None resample: Resampling = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None do_pad: Optional = None do_image_splitting: Optional = None return_tensors: Union = None input_data_format: Optional = None data_format: Optional = <ChannelDimension.FIRST: 'channels_first'> )
Parameters
-  images (ImageInput) — A list of images to preprocess.
-  do_convert_rgb (bool, optional, defaults toself.do_convert_rgb) — Whether to convert the image to RGB.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (Dict[str, int], optional, defaults toself.size) — Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], with the longest edge resized to keep the input aspect ratio.
-  resample (int, optional, defaults toself.resample) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image.
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorList[float], optional, defaults toself.image_mean) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (floatorList[float], optional, defaults toself.image_std) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_pad (bool, optional, defaults toself.do_pad) — Whether or not to pad the images to the largest height and width in the batch.
-  do_image_splitting (bool, optional, defaults toself.do_image_splitting) — Whether to split the image into a sequence 4 equal sub-images concatenated with the original image. That strategy was first introduced in https://arxiv.org/abs/2311.06607.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Preprocess a batch of images.
Idefics2Processor
class transformers.Idefics2Processor
< source >( image_processor tokenizer = None image_seq_len: int = 64 **kwargs )
Parameters
-  image_processor (Idefics2ImageProcessor) — An instance of Idefics2ImageProcessor. The image processor is a required input.
-  tokenizer (PreTrainedTokenizerBase, optional) — An instance of PreTrainedTokenizerBase. This should correspond with the model’s text model. The tokenizer is a required input.
-  image_seq_len (int, optional, defaults to 64) — The length of the image sequence i.e. the number oftokens per image in the input. This parameter is used to build the string from the input prompt and image tokens and should match the config.perceiver_config.resampler_n_latents value for the model used. 
Constructs a IDEFICS2 processor which wraps a LLama tokenizer and IDEFICS2 image processor into a single processor.
IdeficsProcessor offers all the functionalities of Idefics2ImageProcessor and LlamaTokenizerFast. See
the docstring of call() and decode() for more information.
__call__
< source >( text: Union = None images: Union = None image_seq_len: Optional = None padding: Union = False truncation: Union = None max_length: Optional = None is_split_into_words: bool = False add_special_tokens: bool = True return_tensors: Union = None )
Parameters
-  text (Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).Wherever an image token, <image>is encountered it is expanded to<fake_token_around_image>+<image>image_seq_len`. 
-  images (PIL.Image.Image,np.ndarray,torch.Tensor,List[PIL.Image.Image],List[np.ndarray],List[torch.Tensor], optional) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. If is of typeList[ImageInput], it’s assumed that this is for a single prompt i.e. of batch size 1.
-  image_seq_len (int, optional) — The length of the image sequence. If not provided, the default value is used.
-  padding (Union[bool, str, PaddingStrategy], optional, defaults toFalse) — Padding strategy applied to the input ids. See PreTrainedTokenizerFast.pad() for more information.
-  truncation (Union[bool, str, TruncationStrategy], optional) — Truncation strategy applied to the input ids. SeePreTrainedTokenizerFast.truncatefor more information.
-  max_length (int, optional) — Maximum length of the returned list and optionally padding/truncation length. See PreTrainedTokenizerFast.call() for more information.
-  is_split_into_words (bool, optional, defaults toFalse) — Whether the input text is split into words or not. If set toTrue, the tokenizer will skip the tokenization process and assume the input is already tokenized.
-  add_special_tokens (bool, optional, defaults toTrue) — Whether to add special tokens or not. See PreTrainedTokenizerFast.call() for more information.
-  return_tensors (Union[str, TensorType], optional) — If set, will return tensors of a particular framework. See PreTrainedTokenizerFast.call() for more information.
Processes the input prompts and returns a BatchEncoding.
Example:
>>> import requests
>>> from transformers import Idefics2Processor
>>> from transformers.image_utils import load_image
>>> processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", image_seq_len=2)
>>> processor.image_processor.do_image_splitting = False  # Force as False to simplify the example
>>> url1 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
>>> url2 = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
>>> image1, image2 = load_image(url1), load_image(url2)
>>> images = [[image1], [image2]]
>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(text=text, images=images, return_tensors="pt", padding=True)
>>> input_ids = outputs.input_ids
>>> input_tokens = processor.tokenizer.batch_decode(input_ids)
>>> print(input_tokens)
['<s><fake_token_around_image><image><image><fake_token_around_image> In this image, we see', '<s> bla bla bla<fake_token_around_image><image><image><fake_token_around_image>']