|
--- |
|
base_model: |
|
- Qwen/Qwen2-VL-2B-Instruct |
|
datasets: |
|
- rp-yu/VPT_Datasets |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# Introducing Visual Perception Token into Multimodal Large Language Model |
|
|
|
This repository contains models based on the paper [Introducing Visual Perception Token into Multimodal Large Language Model](https://arxiv.org/abs/2502.17425). These models utilize Visual Perception Tokens to enhance the visual perception capabilities of multimodal large language models (MLLMs). |
|
|
|
Code: https://github.com/yu-rp/VisualPerceptionToken |