From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an Adaptive Data Refinement Framework (ADR), which consists of two stages: Data Rebalancing (DR) and Data Synthesis (DS). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.
Community
While the long-tail (LT) problem is severe in LVLMs, we analyze it and introduce a comprehensive Adaptive Data Refinement (ADR) framework to effectively mitigate it without increasing data volume or requiring additional training. ADR is both model-agnostic and data-agnostic, making it easily transferable to any open-source dataset.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation (2025)
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection (2025)
- Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs (2025)
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models (2025)
- Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations (2025)
- FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression (2025)
- Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper