LLaVA-RLHF Model Card

Model details

Model type: LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4. Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.

Usage: NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type. Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.

dtype = torch.bfloat16

model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model"
lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model"

model = LlavaLlamaForCausalLM.from_pretrained(
    model_path,
    device_map={"": "cuda:0"},
    torch_dtype=dtype,
)

model = PeftModel.from_pretrained(
    model,
    lora_path,
)

Model date: LLaVA-RLHF was trained in Sept 2023.

Paper or resources for more information: https://llava-rlhf.github.io/

License: Apache License 2.0

Where to send questions or comments about the model: https://github.com/llava-rlhf/LLaVA-RLHF/issues

Intended use

Primary intended uses: The primary use of LLaVA-RLHF is research on large multimodal chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

595K filtered image-text pairs from CC3M.

150K GPT-generated multimodal instruction-following chat data.

83K VQA v2 instruction-following VQA data.

16K A-OKVQA instruction-following CoT-VQA data.

23K FLICKR instruction-following spotting captioning data.

10K LLaVA-based human preference data

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.