In-browser unified multimodal understanding and generation.
Generate text based on images and prompts