Post
1474
โก Can Stable Diffusion's visual expertise enhance Llama-3.2?
๐ Lavender: efficiently fine-tunes advanced vision-language models by aligning their text-vision attention with Stable Diffusion.
Paper: Diffusion Instruction Tuning (2502.06814)
๐ Key Highlights:
โ Significant Gains: +30% on 20 tasks, +68% on OOD WorldMedQA
โ Data-Efficient: Needs only 0.13M samples (~2.5% of typical VLM datasets)
โ Low Compute: Finetunes in ~1 day on 8 NVIDIA A10G GPUs
โ Model-Agnostic: Works with Llama-3.2-11B, MiniCPM-Llama3-v2.5 & more
โ Precise Alignment: Transfers strong text-vision alignment from Stable Diffusion
โ Open-Source: Code, data & finetuned models will be available
๐ฅ Discuss live at: https://www.alphaxiv.org/abs/2502.06814
๐ Project Page: https://astrazeneca.github.io/vlm/
๐ Lavender: efficiently fine-tunes advanced vision-language models by aligning their text-vision attention with Stable Diffusion.
Paper: Diffusion Instruction Tuning (2502.06814)
๐ Key Highlights:
โ Significant Gains: +30% on 20 tasks, +68% on OOD WorldMedQA
โ Data-Efficient: Needs only 0.13M samples (~2.5% of typical VLM datasets)
โ Low Compute: Finetunes in ~1 day on 8 NVIDIA A10G GPUs
โ Model-Agnostic: Works with Llama-3.2-11B, MiniCPM-Llama3-v2.5 & more
โ Precise Alignment: Transfers strong text-vision alignment from Stable Diffusion
โ Open-Source: Code, data & finetuned models will be available
๐ฅ Discuss live at: https://www.alphaxiv.org/abs/2502.06814
๐ Project Page: https://astrazeneca.github.io/vlm/