ConstBERT / README.md

Update README.md

6605b00 verified 5 days ago

4.23 kB

	---
	library_name: transformers
	tags:
	- retrieval
	- constbert
	- colbert
	- multi-vector
	- embedding
	license: apache-2.0
	language:
	- en
	base_model:
	- google-bert/bert-base-uncased
	---

	# ConstBERT

	ConstBERT (Constant-Space BERT) is a multi-vector retrieval model designed for efficient and effective passage retrieval. It modifies the ColBERT architecture by encoding documents into a fixed number of learned embeddings, rather than one embedding per token. This approach significantly reduces storage costs and can improve OS paging management due to fixed-size document representations, while retaining most of the original effectiveness of multi-vector models.



	## Details
	ConstBERT addresses the high storage cost associated with traditional multi-vector retrieval methods like ColBERT, where each token in a document collection is stored as a vector. Instead, ConstBERT proposes a learned pooling mechanism that projects the token-level embeddings of a document into a smaller, fixed number (`C`) of document-level embeddings. Each of these `C` embeddings captures distinct semantic facets of the document. This projection is achieved through an additional linear transformation layer learned end-to-end during training. The relevance score between a query and a document is then computed using a late interaction mechanism (MaxSim) over these `C` document embeddings and the query's token embeddings.

	This approach offers a trade-off between storage/computational efficiency and retrieval effectiveness, configurable by the choice of `C`. The paper demonstrates that ConstBERT can achieve performance comparable to ColBERT on benchmarks like MSMARCO and BEIR, with substantially smaller index sizes.

	This model has been trained to produce 32 vectors of size 128.

	### Model Sources

	For more details, please refer to our [official repository](https://github.com/pisa-engine/ConstBERT), [paper](https://www.pinecone.io/research/efficient-constant-space-multi-vector-retrieval/) and [blog](https://www.pinecone.io/blog/cascading-retrieval-with-multi-vector-representations/)!

	### Direct Use

	ConstBERT is intended for semantic search and passage retrieval tasks. It can be used for:

	- First-stage retrieval in large document collections.
	- Reranking candidates produced by another retrieval system.

	The model produces fixed-size multi-vector representations for documents, which can be indexed efficiently. Queries are represented as sets of token embeddings.

	Example code:
	```python
	from transformers import AutoModel
	import numpy as np

	def max_sim(q: np.ndarray, d: np.ndarray) -> float:
	# Ensure the dimensions are correct
	assert q.ndim == 2, "Q must be a 2-dimensional array"
	assert d.ndim == 2, "d must be a 2-dimensional array"
	scores = np.dot(d, q.T)
	max_scores = np.max(scores, axis=0)
	return float(np.sum(max_scores))

	model = AutoModel.from_pretrained("pinecone/ConstBERT", trust_remote_code=True)

	# Example queries and documents
	queries = ["What is the capital of France?", "latest advancements in AI"]
	documents = [
	"Paris is the capital and most populous city of France.",
	"Artificial intelligence is rapidly evolving with new breakthroughs.",
	"The Eiffel Tower is a famous landmark in Paris."
	]

	# Encode queries and documents
	query_embeddings = model.encode_queries(queries).numpy()
	document_embeddings = model.encode_documents(documents).numpy()

	max_sim(query_embeddings[0], document_embeddings[0]) > max_sim(query_embeddings[0], document_embeddings[1])
	# Returns: True
	```

	## Citation (BibTeX)
	```
	@inproceedings{macavaney2025constbert,
	author = {MacAvaney, Sean and Mallia, Antonio and Tonellotto, Nicola},
	title = {Efficient Constant-Space Multi-vector Retrieval},
	year = {2025},
	isbn = {978-3-031-88713-0},
	publisher = {Springer-Verlag},
	address = {Berlin, Heidelberg},
	url = {https://doi.org/10.1007/978-3-031-88714-7_22},
	doi = {10.1007/978-3-031-88714-7_22},
	booktitle = {Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III},
	pages = {237–245},
	numpages = {9},
	keywords = {Multi-Vector Retrieval, Efficiency, Dense Retrieval},
	location = {Lucca, Italy}
	}
	```