File size: 3,416 Bytes
57cd770
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import streamlit as st
from streamlit_extras.switch_page_button import switch_page

st.title("ColPali")

st.success("""[Original tweet](https://x.com/mervenoyann/status/1811003265858912670) (Jul 10, 2024)""", icon="ℹ️")
st.markdown(""" """)

st.markdown("""
Forget any document retrievers, use ColPali 💥💥

Document retrieval is done through OCR + layout detection, but it's overkill and doesn't work well! 🤓

ColPali uses a vision language model, which is better in doc understanding 📑
""")
st.markdown(""" """)

st.image("pages/ColPali/image_1.png", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Check out [ColPali model](https://huggingface.co/vidore/colpali) (mit license!)
Check out the [blog](https://huggingface.co/blog/manu/colpali)

The authors also released a new benchmark for document retrieval, [ViDoRe Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard), submit your model! """)
st.markdown(""" """)

st.image("pages/ColPali/image_2.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Regular document retrieval systems use OCR + layout detection + another model to retrieve information from documents, and then use output representations in applications like RAG 🥲

Meanwhile modern image encoders demonstrate out-of-the-box document understanding capabilities!""")
st.markdown(""" """)

st.markdown("""
ColPali marries the idea of modern vision language models with retrieval 🤝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali 🖇️
""")
st.markdown(""" """)

st.image("pages/ColPali/image_3.png", use_column_width=True)
st.markdown(""" """)

st.markdown("""
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🤩
""")
st.markdown(""" """)

st.markdown("""
The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.

See below how every model and ColPali performs on ViDoRe 👇🏻
""")
st.markdown(""" """)

st.image("pages/ColPali/image_4.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""Aside from performance improvements, ColPali is very fast for offline indexing as well!
""")
st.markdown(""" """)

st.image("pages/ColPali/image_5.png", use_column_width=True)
st.markdown(""" """)


st.info("""
Resources:  
- [ColPali: Efficient Document Retrieval with Vision Language Models](https://huggingface.co/papers/2407.01449) 
by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (2024)  
        
- [GitHub](https://github.com/illuin-tech/colpali)  
        
- [Link to Models](https://huggingface.co/models?search=vidore)
        
- [Link to Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard)""", icon="📚")

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
    if st.button('Previous paper', use_container_width=True):
        switch_page("MiniGemini")
with col2:
    if st.button('Home', use_container_width=True):
        switch_page("Home")
with col3:
    if st.button('Next paper', use_container_width=True):
        switch_page("Home")