biu-nlp
/

cdlm

+---
+language: en
+tags:
+- longformer
+- cdlm
+license: apache-2.0
+inference: false
+---
+# Cross-Document Language Modeling
+CDLM: Cross-Document Language Modeling.
+Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf)
+Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are `<doc-s>`, `</doc-s>` (the last two tokens in the vocabulary), and `<s>`, `</s>`, respectively.
+```python
+from transformers import AutoTokenizer, AutoModel
+# load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
+model = AutoModel.from_pretrained('biu-nlp/cdlm')
+```
+The original repo is [here](https://github.com/aviclu/CDLM).
+If you find our work useful, please cite the paper as:
+```python
+@article{caciularu2021cross,
+  title={Cross-Document Language Modeling},
+  author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
+  journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
+  year={2021}
+}
+```