Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ inference: false
|
|
14 |
|
15 |
## Model Description
|
16 |
|
17 |
-
ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while
|
18 |
|
19 |
All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
|
20 |
|
@@ -27,23 +27,35 @@ All models in the series achieve **GLUE benchmark scores that surpass standard B
|
|
27 |
|
28 |
- **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
|
29 |
|
30 |
-
##
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
```sh
|
35 |
pip install -U transformers>=4.48.0
|
36 |
pip install flash-attn
|
37 |
```
|
38 |
|
39 |
-
|
|
|
|
|
40 |
|
41 |
```python
|
42 |
from transformers import AutoTokenizer, AutoModel
|
43 |
device = 'cuda:0'
|
44 |
|
45 |
-
|
46 |
-
|
|
|
|
|
47 |
|
48 |
text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
|
49 |
|
@@ -51,6 +63,30 @@ inputs = tokenizer(text, return_tensors="pt").to(device)
|
|
51 |
outputs = model(**inputs)
|
52 |
```
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
## Training Details
|
55 |
|
56 |
### Training Data
|
@@ -73,7 +109,7 @@ outputs = model(**inputs)
|
|
73 |
|
74 |
### Results
|
75 |
|
76 |
-
- **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE
|
77 |
- **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.
|
78 |
|
79 |
|
|
|
14 |
|
15 |
## Model Description
|
16 |
|
17 |
+
ChronoBERT is a series of **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
|
18 |
|
19 |
All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
|
20 |
|
|
|
27 |
|
28 |
- **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
|
29 |
|
30 |
+
## 🚀 Quickstart
|
31 |
|
32 |
+
You can try ChronoBERT directly in your browser via Google Colab:
|
33 |
+
|
34 |
+
<p align="left">
|
35 |
+
<a href="https://colab.research.google.com/gist/jimmywucm/64e70e3047bb126989660c92221abf3c/chronobert_tutorial.ipynb" target="_blank">
|
36 |
+
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
|
37 |
+
</a>
|
38 |
+
</p>
|
39 |
+
|
40 |
+
Or run it locally with:
|
41 |
|
42 |
```sh
|
43 |
pip install -U transformers>=4.48.0
|
44 |
pip install flash-attn
|
45 |
```
|
46 |
|
47 |
+
### Extract Embeddings
|
48 |
+
|
49 |
+
The following contains a code snippet illustrating how to use the model to generate embeddings based on given inputs.
|
50 |
|
51 |
```python
|
52 |
from transformers import AutoTokenizer, AutoModel
|
53 |
device = 'cuda:0'
|
54 |
|
55 |
+
model_name = "manelalab/chrono-bert-v1-19991231"
|
56 |
+
|
57 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
58 |
+
model = AutoModel.from_pretrained(model_name).to(device)
|
59 |
|
60 |
text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"
|
61 |
|
|
|
63 |
outputs = model(**inputs)
|
64 |
```
|
65 |
|
66 |
+
### Masked Language Modeling (MLM) Prediction
|
67 |
+
|
68 |
+
The following contains a code snippet illustrating how to use the model to predict a missing token given an incomplete sentence.
|
69 |
+
|
70 |
+
```python
|
71 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
72 |
+
device = 'cuda:0'
|
73 |
+
|
74 |
+
model_name = "manelalab/chrono-bert-v1-20201231"
|
75 |
+
|
76 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
77 |
+
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
|
78 |
+
|
79 |
+
year_election = 2016
|
80 |
+
year_begin = year_election+1
|
81 |
+
text = f"After the {year_election} U.S. presidential election, President [MASK] was inaugurated as U.S. President in the year {year_begin}."
|
82 |
+
|
83 |
+
inputs = tokenizer(text, return_tensors="pt").to(device)
|
84 |
+
outputs = model(**inputs)
|
85 |
+
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
|
86 |
+
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
|
87 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
88 |
+
```
|
89 |
+
|
90 |
## Training Details
|
91 |
|
92 |
### Training Data
|
|
|
109 |
|
110 |
### Results
|
111 |
|
112 |
+
- **GLUE Score:** chrono-bert-v1-19991231 and chrono-bert-v1-20241231 achieved GLUE scores of 84.71 and 85.54, respectively, outperforming BERT (84.52).
|
113 |
- **Stock return predictions:** During the sample from 2008-01 to 2023-07, chrono-bert-v1-realtime achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.
|
114 |
|
115 |
|