sections/intro.md · flax-community/multilingual-image-captioning at ff133551296d50e8fb88faeaa646b7e2f43d34c3

This demo uses CLIP-mBART50 model checkpoint to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder and text decoder with approximately 5 million image-text pairs taken from the Conceptual 12M dataset translated using MBart.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage or Article 🤗 below.