PrismCaptioner Model Card

Model details

PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.

PrismCaptioner-2B details

Vision Backbone: google/siglip-so400m-patch14-384
Language Backbone: internlm/internlm2-1_8b
Dataset: 1x ALLaVA-Caption-[LAION/VFLAN], 2x Evol-Instruct-GPT4-Turbo-143K

Paper and codebase for more information: [Paper] [Code]

Intended uses

Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
Effective Captioner: The model can produce high-quality captions for given images.

Model usage

Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.

# In the Prism repo folder
from decouple import supported_VLM

model = supported_VLM['prismcaptioner-2b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])

Yuxuan-Qiao
/

PrismCaptioner-2B

PrismCaptioner Model Card

Dataset used to train Yuxuan-Qiao/PrismCaptioner-2B