Update README.md

[[GitHub]](https://github.com/deepglint/unicom)
## Model
We used our model as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b) which the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)

## Data
Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets.

## Performance and Limitations

In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|:----------------|:-------------|:-------------|
| LLM | Qwen2.5-7B | Qwen2.5-7B |
| AI2D | **76.98** | 73.15 |
| ScienceQA_img | **78.09** | 76.35 |
| GQA | **64.17** | 63.31 |
| InfoVQA_val | **43.48** | 38.88 |
| MMBench_cn_dev | **74.83** | 72.51 |
| MMBench_en_dev | **76.37** | 74.57 |
| MME(cognition) | **432** | 384 |
| MME(perception) | **1598** | 1512 |
| SeedBench | **68.20** | 66.80 |
| SeedBench_img | **73.75** | 72.72 |
| MMStar | **50.98** | 48.98 |
| MMMU | **44.30** | 44.20 |
| OCRBench | **531.00** | 525.00 |
| ChartQA | **67.84** | 66.52 |
| DocVQA_val | **76.46** | 75.21 |
| POPE | 88.69 | **88.83** |
| TextVQA_val | 61.69 | **62.47** |

### C. Limitations
Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.

## Acknowledgments

We would like to express our gratitude to [Xiang An](https://huggingface.co/xiangan) and [Kaicheng Yang](https://huggingface.co/Kaichengalex) for their significant work on the [MLCD](https://arxiv.org/abs/2407.17331).

Files changed (1) hide show

README.md +8 -3

README.md CHANGED Viewed

@@ -1,3 +1,8 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- liuhaotian/LLaVA-Pretrain
+- lmms-lab/LLaVA-NeXT-Data
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---