How to only use the text and visual embedding?
#2
by
Labmem009
- opened
Interesting work! I want to use the alignment between images and text in the encoder of this model for downstream tasks. How should I use it?
+1, is it possible to use only visual encoder to do downstream task? like classification