bhavitvyamalik's picture
first draft
f82fbe0
|
raw
history blame
935 Bytes

This demo uses CLIP-mBART50 model checkpoint to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder and text decoder with approximately 5 million image-text pairs taken from the Conceptual 12M dataset translated using MBart.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage or Article 🤗 below.