jimmywustl commited on
Commit
972f0d4
·
verified ·
1 Parent(s): 6034af7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -7
README.md CHANGED
@@ -14,7 +14,7 @@ inference: false
14
 
15
  ## Model Description
16
 
17
- ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintain good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
18
 
19
  All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
20
 
@@ -27,23 +27,35 @@ All models in the series achieve **GLUE benchmark scores that surpass standard B
27
 
28
  - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
29
 
30
- ## How to Get Started with the Model
31
 
32
- The model is compatible with the `transformers` library starting from v4.48.0:
 
 
 
 
 
 
 
 
33
 
34
  ```sh
35
  pip install -U transformers>=4.48.0
36
  pip install flash-attn
37
  ```
38
 
39
- Here is an example code of using the model:
 
 
40
 
41
  ```python
42
  from transformers import AutoTokenizer, AutoModel
43
  device = 'cuda:0'
44
 
45
- tokenizer = AutoTokenizer.from_pretrained("manelalab/chrono-bert-v1-19991231")
46
- model = AutoModel.from_pretrained("manelalab/chrono-bert-v1-19991231").to(device)
 
 
47
 
48
  text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
49
 
@@ -51,6 +63,30 @@ inputs = tokenizer(text, return_tensors="pt").to(device)
51
  outputs = model(**inputs)
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## Training Details
55
 
56
  ### Training Data
@@ -73,7 +109,7 @@ outputs = model(**inputs)
73
 
74
  ### Results
75
 
76
- - **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE score of 84.71 and 85.54 respectively, outperforming BERT (84.52).
77
  - **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.
78
 
79
 
 
14
 
15
  ## Model Description
16
 
17
+ ChronoBERT is a series of **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
18
 
19
  All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
20
 
 
27
 
28
  - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
29
 
30
+ ## 🚀 Quickstart
31
 
32
+ You can try ChronoBERT directly in your browser via Google Colab:
33
+
34
+ <p align="left">
35
+ <a href="https://colab.research.google.com/gist/jimmywucm/64e70e3047bb126989660c92221abf3c/chronobert_tutorial.ipynb" target="_blank">
36
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
37
+ </a>
38
+ </p>
39
+
40
+ Or run it locally with:
41
 
42
  ```sh
43
  pip install -U transformers>=4.48.0
44
  pip install flash-attn
45
  ```
46
 
47
+ ### Extract Embeddings
48
+
49
+ The following contains a code snippet illustrating how to use the model to generate embeddings based on given inputs.
50
 
51
  ```python
52
  from transformers import AutoTokenizer, AutoModel
53
  device = 'cuda:0'
54
 
55
+ model_name = "manelalab/chrono-bert-v1-19991231"
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
58
+ model = AutoModel.from_pretrained(model_name).to(device)
59
 
60
  text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
61
 
 
63
  outputs = model(**inputs)
64
  ```
65
 
66
+ ### Masked Language Modeling (MLM) Prediction
67
+
68
+ The following contains a code snippet illustrating how to use the model to predict a missing token given an incomplete sentence.
69
+
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
72
+ device = 'cuda:0'
73
+
74
+ model_name = "manelalab/chrono-bert-v1-20201231"
75
+
76
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
77
+ model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
78
+
79
+ year_election = 2016
80
+ year_begin = year_election+1
81
+ text = f"After the {year_election} U.S. presidential election, President [MASK] was inaugurated as U.S. President in the year {year_begin}."
82
+
83
+ inputs = tokenizer(text, return_tensors="pt").to(device)
84
+ outputs = model(**inputs)
85
+ masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
86
+ predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
87
+ predicted_token = tokenizer.decode(predicted_token_id)
88
+ ```
89
+
90
  ## Training Details
91
 
92
  ### Training Data
 
109
 
110
  ### Results
111
 
112
+ - **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE scores of 84.71 and 85.54, respectively, outperforming BERT (84.52).
113
  - **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.
114
 
115