hfl
/

VLE (Visual-Language Encoder) is an image-text multimodal understanding model built on the pre-trained text and image encoders. It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval. Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves significant improvements.

For more details see https://github.com/iflytek/VLE.

Online VLE demo on Visual Question Answering: https://huggingface.co/spaces/hfl/VQA_VLE_LLM

Downloads last month
54
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.