|
--- |
|
library_name: transformers |
|
tags: |
|
- retrieval |
|
- constbert |
|
- colbert |
|
- multi-vector |
|
- embedding |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
--- |
|
|
|
# ConstBERT |
|
|
|
ConstBERT (Constant-Space BERT) is a multi-vector retrieval model designed for efficient and effective passage retrieval. It modifies the ColBERT architecture by encoding documents into a fixed number of learned embeddings, rather than one embedding per token. This approach significantly reduces storage costs and can improve OS paging management due to fixed-size document representations, while retaining most of the original effectiveness of multi-vector models. |
|
|
|
|
|
|
|
## Details |
|
ConstBERT addresses the high storage cost associated with traditional multi-vector retrieval methods like ColBERT, where each token in a document collection is stored as a vector. Instead, ConstBERT proposes a learned pooling mechanism that projects the token-level embeddings of a document into a smaller, fixed number (`C`) of document-level embeddings. Each of these `C` embeddings captures distinct semantic facets of the document. This projection is achieved through an additional linear transformation layer learned end-to-end during training. The relevance score between a query and a document is then computed using a late interaction mechanism (MaxSim) over these `C` document embeddings and the query's token embeddings. |
|
|
|
This approach offers a trade-off between storage/computational efficiency and retrieval effectiveness, configurable by the choice of `C`. The paper demonstrates that ConstBERT can achieve performance comparable to ColBERT on benchmarks like MSMARCO and BEIR, with substantially smaller index sizes. |
|
|
|
This model has been trained to produce 32 vectors of size 128. |
|
|
|
### Model Sources |
|
|
|
For more details, please refer to our [official repository](https://github.com/pisa-engine/ConstBERT), [paper](https://www.pinecone.io/research/efficient-constant-space-multi-vector-retrieval/) and [blog](https://www.pinecone.io/blog/cascading-retrieval-with-multi-vector-representations/)! |
|
|
|
### Direct Use |
|
|
|
ConstBERT is intended for semantic search and passage retrieval tasks. It can be used for: |
|
|
|
- First-stage retrieval in large document collections. |
|
- Reranking candidates produced by another retrieval system. |
|
|
|
The model produces fixed-size multi-vector representations for documents, which can be indexed efficiently. Queries are represented as sets of token embeddings. |
|
|
|
Example code: |
|
```python |
|
from transformers import AutoModel |
|
import numpy as np |
|
|
|
def max_sim(q: np.ndarray, d: np.ndarray) -> float: |
|
# Ensure the dimensions are correct |
|
assert q.ndim == 2, "Q must be a 2-dimensional array" |
|
assert d.ndim == 2, "d must be a 2-dimensional array" |
|
scores = np.dot(d, q.T) |
|
max_scores = np.max(scores, axis=0) |
|
return float(np.sum(max_scores)) |
|
|
|
model = AutoModel.from_pretrained("pinecone/ConstBERT", trust_remote_code=True) |
|
|
|
# Example queries and documents |
|
queries = ["What is the capital of France?", "latest advancements in AI"] |
|
documents = [ |
|
"Paris is the capital and most populous city of France.", |
|
"Artificial intelligence is rapidly evolving with new breakthroughs.", |
|
"The Eiffel Tower is a famous landmark in Paris." |
|
] |
|
|
|
# Encode queries and documents |
|
query_embeddings = model.encode_queries(queries).numpy() |
|
document_embeddings = model.encode_documents(documents).numpy() |
|
|
|
max_sim(query_embeddings[0], document_embeddings[0]) > max_sim(query_embeddings[0], document_embeddings[1]) |
|
# Returns: True |
|
``` |
|
|
|
## Citation (BibTeX) |
|
``` |
|
@inproceedings{macavaney2025constbert, |
|
author = {MacAvaney, Sean and Mallia, Antonio and Tonellotto, Nicola}, |
|
title = {Efficient Constant-Space Multi-vector Retrieval}, |
|
year = {2025}, |
|
isbn = {978-3-031-88713-0}, |
|
publisher = {Springer-Verlag}, |
|
address = {Berlin, Heidelberg}, |
|
url = {https://doi.org/10.1007/978-3-031-88714-7_22}, |
|
doi = {10.1007/978-3-031-88714-7_22}, |
|
booktitle = {Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III}, |
|
pages = {237–245}, |
|
numpages = {9}, |
|
keywords = {Multi-Vector Retrieval, Efficiency, Dense Retrieval}, |
|
location = {Lucca, Italy} |
|
} |
|
``` |