Spaces:
Runtime error
Runtime error
File size: 2,254 Bytes
db5855f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Kosmos-2: Multimodal Large Language Model and OpenVINO
[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and
referring. KOSMOS-2 can understand multimodal input, follow instructions,
perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.
Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide
range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including
texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings.
[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large
language models. Grounding capability
can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point
to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model
can understand that image region with its spatial locations. Grounding capability also enables the model to respond
with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression
comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only
responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form
text response to the image regions, providing more accurate, informational, and comprehensive answers.

## Notebook contents
- Prerequisites
- Infer the original model
- Convert the model to OpenVINO IR
- Inference
- Interactive inference
## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md). |