YuchengShi
/

LLaVA-v1.5-7B-CUB-200

Image-Text-to-Text

Model card Files Files and versions Community

LLaVA-v1.5-7B-CUB-200 / README.md

YuchengShi's picture

Update README.md

cc1f21f verified 6 months ago

|

history blame contribute delete

2.88 kB

	---
	base_model: llava-hf/llava-1.5-7b-hf
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags: []
	---

	# Fine-Grained Visual Classification on CUB-200

	Project Page: [SelfSynthX](https://github.com/sycny/SelfSynthX).

	Paper on arXiv: [Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data](https://arxiv.org/abs/2502.14044)

	This model is a fine-tuned multimodal foundation model developed on the [LLaVA-1.5-7B-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) base, optimized for fine-grained visual classification and explainability using the CUB-200 dataset.

	## Key Details

	- Base Model: LLaVA-1.5-7B
	- Dataset: CUB-200 (Caltech-UCSD Birds-200-2011)
	- Innovation:
	- Self-Synthesized Data: Generates interpretable explanations by extracting image-specific visual concepts using the Information Bottleneck principle.
	- Iterative Fine-Tuning: Uses reward model-free rejection sampling to progressively improve classification accuracy and explanation quality.
	- Intended Use: Fine-grained bird species identification with human-verifiable explanations.

	## How to Use

	```python
	import requests
	from PIL import Image
	import torch
	from transformers import AutoProcessor, LlavaForConditionalGeneration

	model_id = "YuchengShi/LLaVA-v1.5-7B-CUB-200"
	model = LlavaForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	).to("cuda")
	processor = AutoProcessor.from_pretrained(model_id)

	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "What is the bird name? Give your reasoning"},
	{"type": "image"},
	],
	},
	]
	prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
	image_file = "https://www.allaboutbirds.org/guide/assets/photo/297602831-1280px.jpg"
	raw_image = Image.open(requests.get(image_file, stream=True).raw)
	inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to("cuda", torch.float16)

	output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
	print(processor.decode(output[0][2:], skip_special_tokens=True))
	```

	## Training & Evaluation

	- Training: Fine-tuned using LoRA on CUB-200 with iterative rejection sampling.
	- Evaluation: Demonstrates higher accuracy and robust, interpretable explanations compared to baseline models.

	## Citation

	If you use this model, please cite:

	```bibtex
	@inproceedings{
	shi2025enhancing,
	title={Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data},
	author={Yucheng Shi and Quanzheng Li and Jin Sun and Xiang Li and Ninghao Liu},
	booktitle={The Thirteenth International Conference on Learning Representations},
	year={2025},
	url={https://openreview.net/forum?id=lHbLpwbEyt}
	}
	```