README.md · tanikina/longformer-large-science at f708205a5ad3f769013a60f3f2d34f4b022cc0a6

metadata

language:
  - en
base_model:
  - allenai/longformer-large-4096

This is the fine-tuned version of the longformer-large-4096 model additionally pre-trained on the S2ORC corpus (Lo et al., 2020), which is a large corpus of 81.1M English-language academic papers from different disciplines. This model uses the weights of the longformer large science checkpoint that was used as the starting point for training the MultiVerS model (Wadden et al., 2022) on the task of scientific claim verification.

Note that the vocabulary size of this model (50275) differs from the original longformer-large-4096 (50265) since 10 new tokens were included:

<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>.