manelalab
/

chrono-bert-v1-19991231

@@ -14,7 +14,7 @@ inference: false
 ## Model Description
-ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintain good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
 All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
@@ -27,23 +27,35 @@ All models in the series achieve **GLUE benchmark scores that surpass standard B
 - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
-## How to Get Started with the Model
-The model is compatible with the `transformers` library starting from v4.48.0:
 ```sh
 pip install -U transformers>=4.48.0
 pip install flash-attn
 ```
-Here is an example code of using the model:
 ```python
 from transformers import AutoTokenizer, AutoModel
 device = 'cuda:0'
-tokenizer = AutoTokenizer.from_pretrained("manelalab/chrono-bert-v1-19991231")
-model = AutoModel.from_pretrained("manelalab/chrono-bert-v1-19991231").to(device)
 text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
@@ -51,6 +63,30 @@ inputs = tokenizer(text, return_tensors="pt").to(device)
 outputs = model(**inputs)
 ```
 ## Training Details
 ### Training Data
@@ -73,7 +109,7 @@ outputs = model(**inputs)
 ### Results
-- **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE score of 84.71 and 85.54 respectively, outperforming BERT (84.52).
 - **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.

 ## Model Description
+ChronoBERT is a series of **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
 All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
 - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
+## 🚀 Quickstart
+You can try ChronoBERT directly in your browser via Google Colab:
+<p align="left">
+  <a href="https://colab.research.google.com/gist/jimmywucm/64e70e3047bb126989660c92221abf3c/chronobert_tutorial.ipynb" target="_blank">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
+  </a>
+</p>
+Or run it locally with:
 ```sh
 pip install -U transformers>=4.48.0
 pip install flash-attn
 ```
+### Extract Embeddings
+The following contains a code snippet illustrating how to use the model to generate embeddings based on given inputs.
 ```python
 from transformers import AutoTokenizer, AutoModel
 device = 'cuda:0'
+model_name = "manelalab/chrono-bert-v1-19991231"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name).to(device)
 text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
 outputs = model(**inputs)
 ```
+### Masked Language Modeling (MLM) Prediction
+The following contains a code snippet illustrating how to use the model to predict a missing token given an incomplete sentence.
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+device = 'cuda:0'
+model_name = "manelalab/chrono-bert-v1-20201231"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
+year_election = 2016
+year_begin = year_election+1
+text = f"After the {year_election} U.S. presidential election, President [MASK] was inaugurated as U.S. President in the year {year_begin}."
+inputs = tokenizer(text, return_tensors="pt").to(device)
+outputs = model(**inputs)
+masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
+predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
+predicted_token = tokenizer.decode(predicted_token_id)
+```
 ## Training Details
 ### Training Data
 ### Results
+- **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE scores of 84.71 and 85.54, respectively, outperforming BERT (84.52).
 - **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.