|
--- |
|
language: |
|
- en |
|
pipeline_tag: sentence-similarity |
|
--- |
|
# Model Card for gowitheflow/LASER-cubed-bert-base-unsup |
|
|
|
Official model checkpoints of **LA(SER)<sup>3</sup>** (LASER-cubed) from EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics" |
|
|
|
### Model Summary |
|
|
|
LASER-cubed-bert-base-unsup is an **unsupervised** model trained on wiki1M dataset. Without needing the training sets to have long texts, it provides surprising generalizability on long document retrieval. |
|
|
|
- **Developed by:** Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed |
|
- **Shared by:** Chenghao Xiao |
|
- **Model type:** BERT-base |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** BERT-base-uncased |
|
|
|
### Model Sources |
|
|
|
- **Github Repo:** https://github.com/gowitheflow-1998/LA-SER-cubed |
|
- **Paper:** https://aclanthology.org/2023.emnlp-main.86/ |
|
|
|
|
|
### Usage |
|
Use the model with Sentence Transformers: |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup") |
|
|
|
text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts." |
|
representation = model.encode(text) |
|
``` |
|
### Evaluation |
|
Evaluate it with the BEIR framework: |
|
```python |
|
from beir.retrieval import models |
|
from beir.datasets.data_loader import GenericDataLoader |
|
from beir.retrieval.evaluation import EvaluateRetrieval |
|
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES |
|
|
|
# download the datasets with BEIR original repo youself first |
|
data_path = './datasets/arguana' |
|
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test") |
|
model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512) |
|
retriever = EvaluateRetrieval(model, score_function="cos_sim") |
|
results = retriever.retrieve(corpus, queries) |
|
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values) |
|
|
|
``` |
|
### Downstream Use |
|
|
|
Information Retrieval |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching. |
|
|
|
|
|
|
|
## Training Details |
|
|
|
max seq 256, batch size 128, lr 3e-05, 1 epoch, 10% warmup, 1 A100. |
|
|
|
### Training Data |
|
|
|
wiki 1M |
|
|
|
### Training Procedure |
|
|
|
Please refer to the paper. |
|
|
|
## Evaluation |
|
|
|
|
|
### Results |
|
|
|
|
|
|
|
**BibTeX:** |
|
```bibtex |
|
@inproceedings{xiao2023length, |
|
title={Length is a Curse and a Blessing for Document-level Semantics}, |
|
author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura}, |
|
booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, |
|
pages={1385--1396}, |
|
year={2023} |
|
} |
|
``` |