PrismCaptioner Model Card

Model details

PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.

PrismCaptioner-2B details

  • Vision Backbone: google/siglip-so400m-patch14-384
  • Language Backbone: internlm/internlm2-1_8b
  • Dataset: 1x ALLaVA-Caption-[LAION/VFLAN], 2x Evol-Instruct-GPT4-Turbo-143K

Paper and codebase for more information: [Paper] [Code]

Intended uses

  • Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
  • Effective Captioner: The model can produce high-quality captions for given images.

Model usage

Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.

# In the Prism repo folder
from decouple import supported_VLM

model = supported_VLM['prismcaptioner-2b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support image-text-to-text models for prismcaptioner library.

Dataset used to train Yuxuan-Qiao/PrismCaptioner-2B